Rabu, 30 Maret 2011

Troubleshooting

Dell™ PowerEdge™ Expandable RAID Controller 5/i and 5/E User's Guide

To get help with your Dell™ PowerEdge™ Expandable RAID Controller (PERC) 5 controller, you can contact your Dell Technical Service representative or access the Dell Support website at support.dell.com.

Virtual Disks Degraded


A redundant virtual disk is in a degraded state when one physical disk has failed or is inaccessible. For example, a RAID 1 virtual disk consisting of two physical disks can sustain one physical disk in a failed or inaccessible state and become a degraded virtual disk.
To recover from a degraded virtual disk, rebuild the physical disk in the inaccessible state. Upon successful completion of the rebuild process, the virtual disk state changes from degraded to optimal. For the rebuild procedure, see Performing a Manual Rebuild of an Individual Physical Disk in RAID Configuration and Management.

Memory Errors

Memory errors can corrupt cached data, so the controllers are designed to detect and attempt to recover from these memory errors. Single-bit memory errors can be handled by the firmware and do not disrupt normal operation. A notification will be sent if the number of single-bit errors exceeds a threshold value.
Multi-bit errors are more serious, as they result in corrupted data and data loss. The following are the actions that occur in the case of multi-bit errors:
  • If an access to data in cache memory causes a multi-bit error when the controller is started with dirty cache, the firmware will discard the cache contents. The firmware will generate a warning message to the system console to indicate that the cache was discarded and will generate an event.
  • If a multi-bit error occurs at run-time either in code/data or in the cache, the firmware will stop.
  • The firmware will log an event to the firmware internal event log and will log a message during POST indicating that a multi-bit error has occurred.
NOTE: In case of a multi-bit error, contact Dell Technical Support.

General Problems

Table 6-1 describes general problems you might encounter, along with suggested solutions.

Table 6-1. General Problems 

Problem

Suggested Solution
The device displays in Device Manager but has a yellow bang (exclamation point).Reinstall the driver. See the driver installation procedures in the section Driver Installation.
The device does not appear in Device Manager.Turn off the system and reseat the controller.
No Hard Drives Found message appears during a CD installation of Microsoft® Windows® 2000 Server, Windows Server® 2003, or Windows XP because of the following causes:
  1. The driver is not native in the operating system.
  2. The virtual disks are not configured properly.
  3. The controller BIOS is disabled.
The corresponding solutions to the three causes of the message are:
  1. Press to install the RAID Device Driver during installation.
  2. Enter the BIOS Configuration Utility to configure the virtual disks. See the section RAID Configuration and Management for procedures to configure the virtual disks.
  3. Enter the BIOS Configuration Utility to enable the BIOS. See the section Hardware Installation and Configuration for procedures to configure the virtual disks.

Physical Disk Related Issues

Table 6-2 describes physical disk-related problems you might encounter, along with suggested solutions.

Table 6-2. Physical Disk Issues 

Problem

Suggested Solution
One of the physical disks in the disk array is in the inaccessible state.Perform the following actions to resolve this problem:
  • Check the enclosure or backplane for damage.
  • Check the SAS cables.
  • Reseat the physical disk.
  • Contact Dell Technical Support if the problem persists.
Cannot rebuild a fault tolerant virtual disk.This could result from any of the following:
  • The replacement disk is too small. Replace the failed disk with a good physical disk with sufficient capacity.
Fatal errors or data corruption are reported when accessing virtual disks.Contact Dell Technical Support.

Physical Disk Failures and Rebuilds

Table 6-3 describes issues related to physical disk failures and rebuilds.
Table 6-3. Physical Disk Failure and Rebuild Issues 

Issue

Suggested Solution
Rebuilding a physical disk after one of them is in an inaccessible state.If you have configured hot spares, the PERC 5 controller automatically tries to use one to rebuild a physical disk that is in an inaccessible state. Manual rebuild is necessary if no hot spares with enough capacity to rebuild the inaccessible physical disks are available. You must insert a physical disk with enough storage into the subsystem before rebuilding the physical disk. You can use the BIOS Configuration Utility or Dell OpenManage™ Storage Management application to perform a manual rebuild of an individual physical disk.
See the section Performing a Manual Rebuild of an Individual Physical Disk in RAID Configuration and Management for procedures to rebuild a single physical disk.
Rebuilding the physical disks after multiple disks become simultaneously inaccessible.Multiple physical disk errors in a single array typically indicate a failure in cabling or connection and could involve the loss of data. It is possible to recover the virtual disk after multiple physical disks become simultaneously inaccessible. Perform the following steps to recover the virtual disk.
  1. Turn off the system, check cable connections, and reseat physical disks.
Follow the safety precautions to prevent electrostatic discharge.
Ensure that all the drives are present in the enclosure.
  1. Power up the system and enter into the CTRL-R utility and import the foreign configuration.
If the VD is redundant and transitioned into DEGRADED state before going OFFLINE a rebuild operation starts automatically after the configuration is imported. If the VD has gone directly into the OFFLINE state due to a cable pull or power loss situation the VD will be imported in its OPTIMAL state without a rebuild occurring.
You can use the BIOS Configuration Utility or Dell OpenManage Storage Management application to perform a manual rebuild of multiple physical disks.
See the section Performing a Manual Rebuild of an Individual Physical Disk in RAID Configuration and Management for procedures to rebuild a single physical disk.
A virtual disk fails during rebuild while using a global hot spare.The global hot spare goes back into HOTSPARE state and the virtual disk goes into FAIL state.
A virtual disk fails during rebuild while using a dedicated hot spare.The dedicated hot spare goes into READY state and the virtual disk goes into FAIL state.
A physical disk becomes inaccessible during a reconstruction process on a redundant virtual disk that has a hot spare.The rebuild operation for the inaccessible physical disk starts automatically after the reconstruction is completed.
A physical disk is taking longer than expected to rebuild.A physical disk takes longer to rebuild when under high stress. For example, there is one rebuild input/output (I/O) operation for every five host I/O operations.

SMART Error

Table 6-4 describes issues related to the Self-Monitoring Analysis and Reporting Technology (SMART). SMART monitors the internal performance of all motors, heads, and physical disk electronics and detects predictable physical disk failures.
NOTE: For information about where to find reports of SMART errors that could indicate hardware failure, see the Dell OpenManage Storage Management documentation.

Table 6-4. SMART Error

Problem

Suggested Solution
A SMART error is detected on a physical disk in a redundant virtual disk.Perform the following steps:
  1. Force the physical disk offline.
  2. Replace it with a new physical disk of equal or higher capacity.
  3. Perform a rebuild.
A SMART error is detected on a physical disk in a non-redundant virtual disk.Perform the following steps:
  1. Back up your data.
  2. Delete the virtual disk.
See Deleting Virtual Disks for information on deleting a virtual disk.
  1. Replace the affected physical disk with a new physical disk of equal or higher capacity.
  2. Recreate the virtual disk.
See Setting Up Virtual Disks for information on creating virtual disks.
  1. Restore the backup.

PERC 5 Post Error Messages

In PERC 5 controllers, the BIOS (read-only memory, ROM ) provides INT 13h functionality (disk I/O) for the virtual disks connected to the controller, so that you can boot from or access the physical disks without the need of a driver. Table 6-5 describes the error messages and warnings that display for the BIOS.

Table 6-5. BIOS Errors and Warnings 

Message

Meaning
BIOS Disabled. No Logical Drives Handled by BIOS
This warning displays after you disable the ROM option in the configuration utility. When the ROM option is disabled, the BIOS cannot hook Int13h and cannot provide the ability to boot from the virtual disk. (Int13h is an interrupt signal that supports numerous commands that are sent to the BIOS, then passed to the physical disk. The commands include actions you can perform with a physical disk, such as reading, writing, and formatting.)
Press to Enable BIOS
When the BIOS is disabled, you are given the option to enable it by entering the configuration utility. You can change the setting to Enabled in the configuration utility.
Adapter at Baseport xxxx is not responding
where xxxx is the baseport of the controller
If the controller does not respond for any reason but is detected by the BIOS, it displays this warning and continues.
Shut down the system and try to reseat the controller. If this message still occurs, contact Dell Technical Support.
x Virtual Disk(s) Failed
where x is the number of virtual disks failed
When the BIOS detects virtual disks in the failed state, it displays this warning. You should check to determine why the virtual disks failed and correct the problem. No action is taken by the BIOS.
x Virtual Disk(s) Degraded
where x is the number of virtual disks degraded
When the BIOS detects virtual disks in a degraded state, it displays this warning. You should try to make the virtual disks optimal. No action is taken by the BIOS.
Memory/Battery problems were detected. The adapter has recovered, but cached data was lost. Press any key to continue.
This message occurs under the following conditions:
  • The adapter detects that the cache in the controller cache has not yet been written to the disk subsystem
  • The controller detects an error-correcting code (ECC) error while performing its cache checking routine during initialization
  • The controller then discards the cache rather than sending it to the disk subsystem because the data integrity cannot be guaranteed
To resolve this problem, allow the battery to charge fully. If the problem persists, the battery or adapter DIMM might be faulty. In that case, contact Dell Technical Support.
Firmware is in Fault State
Contact Dell Technical Support.
Firmware version inconsistency was detected. The adapter has
recovered, but cached data was
lost. Press any key to continue.
New firmware has been flashed that is incompatible with the previous version. The cache contains data that has not been written to the physical disks and that cannot be recovered. Check data integrity. You may need to restore the data from a backup.
Foreign configuration(s) found on adapter. Press any key to continue, or 'C' to load the configuration utility.
When a controller firmware detects a physical disk with existing foreign metadata, it flags the physical disk as foreign and generates an alert indicating that a foreign disk was detected.
You can use the BIOS Configuration Utility to import or clear the foreign configuration.
The foreign configuration message is always present during POST but no foreign configurations are present in the foreign view page in CTRL+R and all virtual disks are in an optimal state.
Clear the foreign configuration using CTRL+R or Dell OpenManage™ Server Administrator Storage Management.
If a physical disk is inserted into the system that was once a member of a virtual disk, and that disk's previous location has been taken by a replacement disk through a rebuild, the newly inserted disk that was once a member of the virtual disk must have its foreign configuration flag manually removed.
Previous configuration(s) cleared or missing. Importing configuration created on XX/XX XX.XX. Press any key to continue, or 'C' to load the configuration utility.
The message means the controller and physical disks have different configurations. You can use the BIOS Configuration Utility to import or clear the foreign configuration.
There are X enclosures connected to port X but only X may be connected to a single SAS port. Please remove the extra enclosures then restart your system.
Too many enclosures are attached to one port. The extra enclosures must be removed and the system restarted.
Invalid SAS topology detected. Please check your cable configurations, repair the problem, and restart your system.
The SAS cables for your system are improperly connected. Check the cable connections and fix any problems, then restart the system. You may need to restore your data from a backup.
Multi-bit errors are detected on the controller. DIMM on the controller needs replacement. If you continue, data corruption can occur. Press 'X' to continue or else power off the system and replace the DIMM module and reboot. If you have replaced the DIMM please press 'X' to continue.
There are multi-bit ECC errors (MBE). ECC errors are errors that occur in the memory, which can corrupt cached data so that it has to be discarded.
NOTICE: MBE errors are serious, as they result in corrupted data and data loss. In case of MBE errors, contact Dell Technical Support.
NOTE: A similar message appears when multiple single-bit ECC errors are detected on the controller during bootup.
Some configured disks have been removed from your system, or are no longer accessible. Check your cables and ensure all disks are present. Press any key or 'C' to continue.
An array has failed. Some configured disks were removed from the system or, if not removed, are no longer accessible for other reasons.
The SAS cables for your system might be improperly connected. Check the cable connections and fix any problems, then restart the system. You may need to restore your data from a backup.
If there are no cable problems, press any key or to continue.
Physical disk removed: Physical Disk {x.x.x} Controller {x}, Connector {x}

Device failed: Physical Disk {x.x.x} Controller {x}, Connector {x}".
These two messages appear in the event log when you remove a drive. One indicates that the disk was removed and the other indicates that the device has failed. This is expected behavior.
A storage component such as a physical disk or an enclosure has failed. The failed component might have been identified by the controller while performing a task such as a rescan or a check consistency.
Replace the failed component. You can identify which disk has failed by locating the disk that has a red "X" for its status. Perform a rescan after replacing the disk.
Battery is missing or the battery could be fully discharged. If battery is connected and has been allowed to charge for 30 minutes and this message continues to appear, then contact Technical Support for assistance.
  • The controller battery is missing or damaged.
  • The controller battery is completely discharged and needs to be charged for it to become active. The battery must first be charged and the system must be restarted for the battery to be active again.

Red Hat Enterprise Linux Operating System Errors

Table 6-6 describes an issue related to the Red Hat® Enterprise Linux operating system.
Table 6-6. Linux Operating System Error 

Error Message

Suggested Solution
kernel: sdb: asking for cache data failed
kernel: sdb: assuming drive cache: write through
This error message displays when the Linux Small Computer System Interface (SCSI) mid layer asks for physical disk cache settings. Because the PERC 5 controller firmware manages the virtual disk cache settings on a per controller and a per virtual disk basis, the firmware does not respond to this command. Thus, the Linux SCSI mid layer assumes that the virtual disk's cache policy is write-through. SDB is the device node for a virtual disk. This value changes for each virtual disk.
See the section Setting Up Virtual Disks for more information about write-through cache.
Except for this message, there is no side effect to this behavior. The cache policy of the virtual disk and the I/O throughput are not affected by this message. The cache policy settings for the PERC5 SAS RAID system remain the settings you have already chosen.
Driver does not auto-build into new kernel after customer updates.
This error is a generic problem for DKMS and applies to all DKMS-enabled driver packages. This issue occurs when you perform the following steps:
  1. Install a DKMS-enabled driver package.
  2. Run up2date or a similar tool to upgrade the kernel into the latest version.
  3. Reboot into the new kernel.
The driver running in the new kernel is the native driver in the new kernel. The driver package you once installed in the new kernel does not take effect in the new kernel.
Perform the following procedure to make the driver auto-build into the new kernel:
  1. Type:
dkms build -m -v -k
  1. Type:
dkms install -m -v -k
  1. Type the following to check whether the driver is successfully installed in the new kernel:
DKMS
The following details appear:
, , : installed
smartd[smartd[2338] Device: /dev/sda, Bad IEC (SMART) mode page, err=-5, skip device

smartd[2338] Unable to register SCSI device /dev/sda at line 1 of file /etc/smartd.conf
These error messages are caused by an unsupported command coming directly from the user application. This is a known issue in which user applications try to direct Command Descriptor Blocks to RAID volumes. This error message has no effect on the user and there is no loss of functionality due to this error.
The Mode Sense/Select command is supported by firmware on the PERC 5. However, the Linux kernel daemon is issuing the command to the virtual disk instead of to the driver IOCTL node. This action is not supported.

LED Behavior Patterns

The external SAS ports on the PERC 5/E Adapter have a port status LED per x4 SAS port. This bi-color LED displays the status of any external SAS port. The LED indicates whether all links are functional or only partial links are functional. Table 6-7 describes the patterns for the port status.
Table 6-7. LED Behavior Patterns

Port State

LED State
Power-on stateOff
Reset stateOff
All links in port connectedGreen light on
One or more links are not connected (applicable only in wide port configurations)Amber light on
All links in the port are disconnected or the cable is disconnectedOff

Audible Alarm Warnings

An audible alarm is available on the PERC 5/E Adapter to alert you of key critical and warning events involving the virtual disk or physical disk problems. You can use the Basic Input/Output System (BIOS) Configuration Utility to enable, disable, or silence the on-board alarm tone.
NOTE: Silencing the alarm stops only the current alarm, but future alarms will be sounded. To permanently disable the alarm, select the disable alarm option.
Table 6-8 lists the critical and warning events, severity levels of the events, and audible codes.

Table 6-8. Audible Alarm Descriptions

Description

Severity

Audible Code
Controller alarm enabledNormalN/A
Virtual disk failed Critical3 seconds on, 1 second off
Virtual disk degradedWarning1 second on, 1 second off
Global hot spare failedWarning1 second on, 1 second off
Dedicated hot spare failedWarning1 second on, 1 second off
Physical disk failedCritical1 second on, 1 second off
Rebuild completed on physical diskNormal1 second on, 3 seconds off
Rebuild failed on physical diskWarning1 second on, 1 second off
Physical disk offlineCritical1 second on, 1 second off
NOTE: If the PERC 5/E alarm was already beeping due to a previous failure and a new virtual disk is created on the same controller, then the previous alarm will be silenced. This is expected behavior.


Tidak ada komentar:

Posting Komentar

koment