BIOS stability improvements in early detection and error handling related to memory DIMMs

BIOS STABILITY IMPROVEMENTS IN EARLY DETECTION AND ERROR HANDLING RELATED TO MEMORY DIMMS.

A soft error occurs when the data and/or ECC bits on the DIMM are incorrect, but the error will not continue to occur once the data and/or ECC bits on the DIMM have been corrected

What is correctable memory error?

Correctable errors can be detected and corrected if the BIOS and DIMM support this functionality. Correctable errors are generally single-bit errors and can cause the memory controller to an immediate reboot of the system.This will also trigger unexpected HA in Virtualization.

During reboot, the BIOS checks the Machine Check registers and determines that the previous reboot was due to a UCE.

The uncorrectable ECC memory error is displayed in the service processor’s system event log (SEL) as shown here:

Memory | Uncorrectable ECC | Asserted | DIMM A0

As shown in description.Soft errors do not indicate any issue with the DIMM.

.

Products Affected:
NX-G6/G7 Platforms with BIOS versions prior to PB42.300, PU42.300 or PW42.300 or later


Description:

If we go in IPMI we can see critical error like

IPMI Error.


We can see the correctable ECC memory error and sometimes uncorrectable memory error.

So What is the difference between correctable ECC memory error and uncorrectable memory error?

While correctable errors do not affect the normal operation of the system, uncorrectable memory errors will immediately result in a system crash or shutdown of the system when not configured for Mirroring or RAID AMP modes.

Uncorrectable errors are always multi-bit DIMM errors. The internal Health LED will indicate a critical condition, and on most systems, shows the error in front of the server or through logs. Uncorrectable memory errors can typically be isolated down to a failed slots of DIMMs, rather than the DIMM itself.

Nutanix has further improved the ability to detect problematic DIMMs and prevent unnecessary service
disruption as a result of uncorrectable memory errors (UECC).


BIOS improvements introduced in PB41.002,PU41.002 and PW42.000 or later include:
• Improved proactive detection of memory errors during Patrol Scrub alerts will generate in NCC
• Reduction in correctable memory error threshold to generate an alert
• Enabling of Adaptive Data Correction (ADC)


BIOS improvements introduced in PB42.300,PU42.300 and PW42.300 include:


• Enabling of Post Package Repair (PPR)
• Patrol Scrub correctable errors integrated into the CECC error handling and operate as a part of the RAS workflow
• Fixed an issue where a watchdog “three-strike” error could cause the host to stop or restart
unexpectedly


Resolution:

Latest Stable BIOS Version

Nutanix has released latest stable BIOS version with stability and lots of improvements / fixes in early detection and error handling related to memory DIMMs.

Nutanix NX – BIOS latest and stable version are that can be repair DIMM after host reboot.:

  1. PB42.300 or later
  2. PU42.300 or later
  3. PW42.300 or later

Reference

Nutanix Dimm Error Handling

Leave a Reply