Troubleshooting Exadata Disk Failures

Troubleshooting Exadata Storage Server Disk Failures

After a great weekend, we came to office and performed our daily health checks like every Monday. One of our storage server(cell) of Exadata X2-2 X4270 M2 had lost 11 ASM disks of total 34 ASM disks. We struck it lucky, all databases were still up despite all the lost.

Lets examine what happened to our cell server. When i checked the mailbox, i saw an alert mail from the problematic cell stating that “Disk controller was hung. Cell was power cycled”. It looks like cell disk controller was not performing well (may be a bug or a peak moment) and forced server to reboot. But normally reboots does not end up with disk losses.

I started with checking with cells physicaldisk status.

CellCLI> list physicaldisk
20:0 XXXXXX normal
20:1 XXXXXX normal
20:2 XXXXXX normal
20:3 XXXXXX failed —> Failed Disk
20:4 XXXXXX normal
20:5 XXXXXX normal
20:6 XXXXXX normal
20:7 XXXXXX import failure —> Import Failure
20:8 XXXXXX normal
20:9 XXXXXX normal
20:10 XXXXXX normal
20:11 XXXXXX normal
FLASH_1_0 1111M00AAA normal
FLASH_1_1 1111M00AAA normal
FLASH_1_2 1111M00AAA normal
FLASH_1_3 1111M00AAA normal
FLASH_2_0 1111M00AAA normal
FLASH_2_1 1111M00AAA normal
FLASH_2_2 1111M00AAA normal
FLASH_2_3 1111M00AAA normal
FLASH_4_0 1111M00AAA normal
FLASH_4_1 1111M00AAA normal
FLASH_4_2 1111M00AAA normal
FLASH_4_3 1111M00AAA normal
FLASH_5_0 1111M00AAA not present –> fmods of failed flashdisk
FLASH_5_1 1111M00AAA not present –> fmods of failed flashdisk
FLASH_5_2 1111M00AAA not present –> fmods of failed flashdisk
FLASH_5_3 1111M00AAA not present –> fmods of failed flashdisk

What I got from output was; we had one flashdisk and one harddisk failure (disk number 3) and also one harddisk was in import failure status(disk number 7). But that did not explain 11 ASM disks failure. It should had been 6 ASM disks according to output. There should be something more.

I continued with checking grid disks.

CellCLI> list griddisk attributes name,size,offset,asmdeactivationoutcome,status
DATA_CD_00_exacel13 423G 32M Yes cacheContentLost
DATA_CD_01_exacel13 423G 32M Yes active
DATA_CD_02_exacel13 423G 32M Yes active
DATA_CD_03_exacel13 423G 32M Yes not present
DATA_CD_04_exacel13 423G 32M Yes active
DATA_CD_05_exacel13 423G 32M Yes active
DATA_CD_06_exacel13 423G 32M Yes active
DATA_CD_07_exacel13 423G 32M Yes not present
DATA_CD_08_exacel13 423G 32M Yes active
DATA_CD_09_exacel13 423G 32M Yes active
DATA_CD_10_exacel13 423G 32M Yes active
DATA_CD_11_exacel13 423G 32M Yes cacheContentLost
MORE_CD_02_exacel13 29.125G 528.734375G Yes active
MORE_CD_03_exacel13 29.125G 528.734375G Yes not present
MORE_CD_04_exacel13 29.125G 528.734375G Yes active
MORE_CD_05_exacel13 29.125G 528.734375G Yes active
MORE_CD_06_exacel13 29.125G 528.734375G Yes active
MORE_CD_07_exacel13 29.125G 528.734375G Yes not present
MORE_CD_08_exacel13 29.125G 528.734375G Yes active
MORE_CD_09_exacel13 29.125G 528.734375G Yes active
MORE_CD_10_exacel13 29.125G 528.734375G Yes active
MORE_CD_11_exacel13 29.125G 528.734375G Yes cacheContentLost
RECO_CD_00_exacel13 105.6875G 423.046875G Yes cacheContentLost
RECO_CD_01_exacel13 105.6875G 423.046875G Yes active
RECO_CD_02_exacel13 105.6875G 423.046875G Yes active
RECO_CD_03_exacel13 105.6875G 423.046875G Yes not present
RECO_CD_04_exacel13 105.6875G 423.046875G Yes active
RECO_CD_05_exacel13 105.6875G 423.046875G Yes active
RECO_CD_06_exacel13 105.6875G 423.046875G Yes active
RECO_CD_07_exacel13 105.6875G 423.046875G Yes not present
RECO_CD_08_exacel13 105.6875G 423.046875G Yes active
RECO_CD_09_exacel13 105.6875G 423.046875G Yes active
RECO_CD_10_exacel13 105.6875G 423.046875G Yes active
RECO_CD_11_exacel13 105.6875G 423.046875G Yes cacheContentLost

5 grid disks related with two physicaldisks (disk number 0 and disk number 11) were in “cacheContentLost” status.  I checked Oracle Support for the griddisks with “cacheContentLost” status

Doc Id 2346075.1 was related with our problem. Document was clear and explaining “the steps to recover the grid disks with cacheContentLost state.

When write-back flashcache is active on storage cells, in a flashdisk failure, the grid disks cached by the failed ones can be storing stale data. If the flash disk failure occurs while the Exadata storage software is running, a resilvering operation is started to resynchronize the stale blocks from the other storage servers.

But if the flashdisk failure happens while the storage software is not running or while rebooting phase of the cell, the resilvering operation is not started and the grid disks will be labeled to ‘cacheContentLost’ state. The grid disks stay offline to prevent the databases from accessing the stale data.

In our state, disk controller was hung and it ended up with rebooting the server. On reboot phase, flash disk failure occurred and grid disks stuck in “cacheContentLost”. Our team checked gv$asm_operation view for on going rebalance operations, there were no rows. Asmdisks related with that griddisk were already dropped, disk repair time had already passed. 

We decided to recreate the grid disks which are in “cacheContentLost” state to make them visible in ASM. 

CellCLI> drop griddisk DATA_CD_00_exacel13
CellCLI> drop griddisk RECO_CD_00_exacel13
CellCLI> create griddisk DATA_CD_00_exacel13 CELLDISK=CD_00_exaybsmcel13,size=423G,offset=32M
CellCLI> create griddisk RECO_CD_00_exacel13 CELLDISK=CD_00_exaybsmcel13,size=105.6875G,offset=423.046875G
SYS@+ASM1> alter diskgroup DATA add disk 'o/192.168.31.21/DATA_CD_00_exacel13' rebalance power 10;
SYS@+ASM1> alter diskgroup RECO add disk 'o/192.168.31.21/RECO_CD_00_exacel13' rebalance power 10;
CellCLI> drop griddisk DATA_CD_11_exacel13
CellCLI> drop griddisk RECO_CD_11_exacel13
CellCLI> drop griddisk MORE_CD_11_exacel13
CellCLI> create griddisk DATA_CD_11_exacel13 CELLDISK=CD_11_exacel13,size=423G,offset=32M
CellCLI> create griddisk RECO_CD_11_exacel13 CELLDISK=CD_11_exacel13,size=105.6875G,offset=423.046875G
CellCLI> create griddisk MORE_CD_11_exacel13 CELLDISK=CD_11_exacel13,size=29.125G,offset=528.734375G
SYS@+ASM1> alter diskgroup DATA add disk 'o/192.168.31.21/DATA_CD_11_exacel13' rebalance power 10;
SYS@+ASM1> alter diskgroup RECO add disk 'o/192.168.31.21/RECO_CD_11_exacel13' rebalance power 10;
SYS@+ASM1> alter diskgroup MORE add disk 'o/192.168.31.21/MORE_CD_11_exacel13' rebalance power 10;

After that commands, now we have 1 harddisk with import failure, 1 harddisk with failure and 1 flashdisk with 4 fmods in a failed state. We opened SRs for the flashdisk and the harddisk with failure.  We changed them with the spare new ones we had. Normally, No additional steps are required to re-create the cell disks or grid disks for flashdisk and the harddisk replacement.

Now let’s continue with our case, we had only one harddisk left with problems. We are only missing three griddisks and asmdisks. That disk was in import failure status. We executed the commands below to check harddisk information.

[root@exacel13 trace]# /opt/MegaRAID/MegaCli/MegaCli64 -pdlist -a0 |egrep 'Slot Number|Firmware state'
Slot Number: 7
Firmware state: Unconfigured(good), Spun Up

Foreign state is not looking good,  i will try to change foreing state of that harddisk. The commands below are executed for clearing foreign state and reconfiguring raid on that harddisk.

[root@exacel13 trace]# /opt/MegaRAID/MegaCli/MegaCli64 -CfgForeign -Clear -a0
Foreign configuration 0 is cleared on controller 0.
[root@exacel13 trace]# /opt/MegaRAID/MegaCli/MegaCli64 -CfgLdAdd -R0 [20:7] WB NORA Direct NoCachedBadBBU -strpsz1024 -a0
Adapter 0: Created VD 14
Adapter 0: Configured the Adapter!!

Griddisks for that disk were still in “not present” state. We had decided to go for the reenable command for that physicaldisk. The commands for reenable are as follows.

CellCLI> alter physicaldisk 20:7 drop for replacement
Physical disk 20:7 was dropped for replacement.
CellCLI> alter physicaldisk 20:7 reenable
Physical disk 20:7 was reenabled.

Now everything is perfect again. It was really a manic Monday. After that case, not to experience a same situation again, we also decided to update our Exadata servers image to the latest one. The issue has not happened again yet.

Hope it helps.


Discover More from Osman DİNÇ


Comments

Leave your comment