|
|
Flag Day: FMA for Athlon 64 and Opteron Processors
Date: Sat, 11 Feb 2006 16:01:35 -0800 (PST)
From: Cynthia McGuire <cindi at ozz dot sfbay dot sun dot com>
To: eversholt-interest at sun dot com, fma-interest at sun dot com, on-all at eng dot sun dot com,
Subject: Flag Day: FMA for Athlon 64 and Opteron Processors
Today's putback for:
PSARC 2006/020 FMA for Athlon 64 and Opteron Processors
PSARC 2006/028 eversholt language enhancements
6359264 Provide FMA support for AMD64 processors
represents a flag day in that new user and kernel components are
introduced, so you should not mix and match userland and kernel across
this flag day. As usual, BFU can be used to get you a consistent system.
A large number of files have also changed; you should do a clobber build
once you bringover the changes from the FMA putback. Specific flag day
details are described below.
This project brings the same level of FMA support to our Athlon 64 and Opteron
family of platforms that we have for SPARC-based platforms. Detailed
information describing the following features is found at
http://ctg.central.sun.com/wiki/index.php/FMA_x64_cpu/mem. FMA for x64
provides:
Error handling and ereport generation for Machine Check Architecture (MCA)
errors as well as background polling for correctable errors.
Diagnosis for faulty CPU and DIMMs related to those errors
Automatic page retire and CPU offline responses to faulty CPU and DIMMs
Diagnosis and response activities are fully integrated with the fault
manager daemon, fmd(1M) and syslog messaging agent to produce a standard
FMA diagnosis message, such as:
SUNW-MSG-ID: AMD-8000-5M, TYPE: Fault, VER: 1, SEVERITY: Major
EVENT-TIME: Tue Feb 7 12:03:02 PST 2006
PLATFORM: Sun Fire X4200 Server, CSN: 0000000000, HOSTNAME: vcr
SOURCE: eft, REV: 1.16
EVENT-ID: cc22e400-1e60-ee9f-81f3-af3d035f4dd8
DESC: The number of errors associated with this CPU has exceeded acceptable
levels. Refer to http://sun.com/msg/AMD-8000-5M for more information.
AUTO-RESPONSE: An attempt will be made to remove this CPU from service.
IMPACT: Performance of this system may be affected.
REC-ACTION: Schedule a repair procedure to replace the affected CPU. Use
fmdump -v -u to identify the module.
The EVENT-ID can be used to learn more about the diagnosis and impact
on system resources:
# fmdump -v -u cc22e400-1e60-ee9f-81f3-af3d035f4dd8
TIME UUID SUNW-MSG-ID
Feb 07 12:03:02.5062 cc22e400-1e60-ee9f-81f3-af3d035f4dd8 AMD-8000-5M
100% fault.cpu.amd.l2cachedata
Problem in: hc:///motherboard=0/chip=0/cpu=0
Affects: cpu:///cpuid=0
FRU: hc:///motherboard=0/chip=0
# psrinfo
0 faulted since 02/07/2006 12:03:02
1 on-line since 02/07/2006 11:58:29
2 on-line since 02/07/2006 11:58:31
3 on-line since 02/07/2006 11:58:33
This means that CPU 3 has a bad L2 data cache and to repair the problem,
the CPU module should be replaced. Specific details regarding repair
policies of the CPU are found at http://sun.com/msg/AMD-8000-5M.
Flag Day information:
BFU Changes
We have made minor modifications to BFU itself. If you have an outdated
BFU and use it to get to the new archives, you may experience a problem
with FMA features if you use your old BFU on a test machine in the lab
on which the FMA group was at one point testing older FMA bits. Please
use the new BFU or update yours from the new source in usr/src/tools.
BFU Conflict Resolution
You will see conflicts in the following files:
etc/driver_aliases
etc/name_to_major
You will need to resolve all of these conflicts in order to enable FMA.
As usual, the BFU acr utility will do the right thing.
Impact to Platform Developers
This project removes all of the project private .topo files in-lieu
of standard topologies for SPARC and x64 systems. This change will
help platform teams deliver consistent topologies for use in their
eft diagnosis rules without having to deliver additional platform
specific .topo files. Topologies may be viewed with the internal
fmtopo command:
# /usr/lib/fm/fmd/fmtopo -v
Topology Snapshot 22704dd4-d473-e3ac-a03b-af5e98bcabe9
hc:///motherboard=0
ASRU: -
FRU: hc:///motherboard=0
Label: MB
hc:///motherboard=0/chip=0
ASRU: -
FRU: hc:///motherboard=0/chip=0
Label: -
hc:///motherboard=0/chip=0/cpu=0
ASRU: cpu:///cpuid=0
FRU: hc:///motherboard=0/chip=0
Label: -
This output is suitable for inclusion in section 3 of a platform
FMA portfolio.
More Information:
More information about the FMA Program in general is available at
http://fma.eng. Specific information on the x64 FMA project is available
at http://ctg.central.sun.com/wiki/index.php/FMA_x64_cpu/mem. You will want
to take a look if you are an Opteron software or hardware developer who
wants to learn more about the error
handling and diagnosis capabilities offered via Solaris on our Galaxy,
Andromeda, Marrakesh and Thumper families.
As always, if you are interested in talking with people interested in fault
management or want to participate in discussions about features and RFES,
please sign up for fma-interest at sun dot com using netadmin or the fault management
discussion (fm: discuss) forum at http://www.opensolaris.org/os/discussions.
Feel free to send any question or comments directly to the FMA core team at
fma-core at sun dot com. The same bug categories for FMA related bugs and RFEs
cover this project:
kernel/fm - Solaris kernel FMA infrastructure
library/fm - Solaris FMA libraries
utility/fm - Solaris utilities fmdump(1M), fmstat(1M), fmadm(1M), fmd(1M)
fma/io - i/o error handling, telemetry, diagnosis engines, agents
fma/cpu - cpu error handling, telemetry, diagnosis engines, agents
fma/mem - memory error handling, telemetry, diagnosis engines, agents
fma/other - incoming triage category, requests for new fma features
If you're not sure which category to use, file a bug in fma/other and we
will be happy to recategorize it for you.
Cindi, humble servant of the FMA x64 I-Team
|