Raid 5 array problem

edited January 2008 in Hardware
Hi, fellow hunters,

Here is a worthy problem, which I hope somebody will take pity on and help me out before I go completly gaga through trying to sort it.

I have recently purchased and been using a Highpoint 1820A card for setting up a RAID 5 array and have had no problems for about 2 months. Unfortunately, I seem to have run into a problem with the RAID setup and I hope you could please advise me on any way I could resolve it or any way in which I could troubleshoot it
further.

Summed up:
After getting an error on one drive all the other drives on the array fail. After this whenever the files which were being worked on at that time are read the same errors occur and the drives fail regardless of whether the array is rebuilt.

More detail:
I have a hardware RAID 5 array setup on 8 - 160gb SATA drives all to the same controller. I setup the RAID 5 in the raid card bios and then setup the a ext3 filesystem on it.
It has been working fine up until now and was about half full when I started getting the audio alert. At the time only Azureus(a bittorrent client) was running. The computer would not respond to keyboard or mouse for about 10 seconds but then recovered but I was unable to access the array.

The raid gui tools error log showed that the hard drive on channel 5 had failed and the other drives were getting errors because of it. Syslog showed the scsi read errors as shown below. I was unable to access any of the array until I restarted. After backing everything up that I could I found that as soon as I tried to read the latest files on there, files that were being worked on and which must have been getting written to at the time, it would get the same problem - audio alert again and all drives would fail the same way.

In the raid card bios it showed the drive on channel 5 as seperate from the rest of the array. I tried rebuilding in bios by readding the drive on channel 5 then choosing rebuild. From the rate it was going at it would have taken approximately 3 days, so instead I tried booting up and rebuilding in the gui raid tool and it took about 3 hours. Unfortunately, it still failed the exact same way once any attempt was made to read the problem files.

If the array is rebuilt it always seems to fail on channel 5 (channel 4 to the syslog) whenever data is read from the problem files, once that drive fails the other drives all start getting errors and fail as well. If the array is not rebuilt it just fails on either 6,7 or 8 and fails the rest the same way.

I have tried updating the raid card bios and drivers to the latest as well as the linux kernel with no luck (have tried both 2.4 and 2.6). I have run a non-destructive read-write test with badblocks with no result (e2fsck -c -c /dev/sda1 - though when scanning badblocks listed out of 273508625 whereas the scsi errors list at 1879053343 - Is it possible I am not scanning all the drives?).

I have a second 1820A card which I have tried with the same results.


Hopefully you could please answer the following questions for me:


Why would this array fail only when accessing certain files on it even after being rebuilt? Why would a read/write test not seem to trigger the errors?

Is there any further tools or procedures you can recommend for me to use in order to pinpoint the problem?

Is there a recommended way for testing the reliability of the array once it is built?

Is there a reason for the raid bios rebuild time taking much longer than the gui tools rebuild?

Any suggestions or advice at all would be happily received as I am becoming quite frustrated with this problem. Especially any recommendations on how to prevent this from happening again or ways of working out what caused this.


Thank you for your time.



Further info & logs:


Spec:
Highpoint 1820A
8 x SATA Seagate 160gb drives
Tagan 440W power supply
NCCH-DL Dual processor board with 2x 2.8 ghz Xeons
1gb RAM
20gb Maxtor PATA drive for operating system - Debian Linux


My message log:

Apr 10 22:26:31 chikyuu kernel: IAL: COMPLETION ERROR, adapter 0, channel 4, flags=104
Apr 10 22:26:31 chikyuu kernel: ATA regs: error 10, sector count 1, LBA low ff,
LBA mid ff, LBA high ff, device 4f, status 51
Apr 10 22:26:31 chikyuu kernel: Retry on channel(4)
Apr 10 22:26:31 chikyuu kernel: SCSI error : return code = 0x25050000
Apr 10 22:26:31 chikyuu kernel: end_request: I/O error, dev sda, sector 1879053343
Apr 10 22:26:32 chikyuu kernel: IAL: COMPLETION ERROR, adapter 0, channel 4, flags=104
Apr 10 22:26:32 chikyuu kernel: ATA regs: error 10, sector count 1, LBA low ff,
LBA mid ff, LBA high ff, device 4f, status 51
Apr 10 22:26:32 chikyuu kernel: Retry on channel(4)
Apr 10 22:26:32 chikyuu kernel: psmouse.c: Explorer Mouse at isa0060/serio1/input0 lost synchronization,
throwing 2 bytes away.
Apr 10 22:26:32 chikyuu kernel: SCSI error : return code = 0x25050000
Apr 10 22:26:32 chikyuu kernel: end_request: I/O error, dev sda, sector 1879053351
Apr 10 22:26:33 chikyuu kernel: IAL: COMPLETION ERROR, adapter 0, channel 4, flags=104
Apr 10 22:26:33 chikyuu kernel: ATA regs: error 10, sector count 1, LBA low ff,
LBA mid ff, LBA high ff, device 4f, status 51
Apr 10 22:26:33 chikyuu kernel: Retry on channel(4)


Once it has retried all channels repeatedly until it gives up it cycles through scsi error and end_request messages for several hundred entries. The psmouse entry occurs when the computer freezes temporarily when the problem files are accessed.


Excerpt from RAIDtools log:

RAID I 04/10/2005 14:17:24 Array 'RAID_5_0' rebuilding started.
RAID I 04/10/2005 16:26:52 Array 'RAID_5_0' rebuilding completed.
RAID I 04/10/2005 21:59:32 User RAID(from 127.0.0.1) exited from system.
RAID I 04/10/2005 22:25:46 User RAID(from 127.0.0.1) logged on system.
RAID E 04/10/2005 22:26:32 An error occured on the disk at Controller1-Channel5-Device1.
RAID E 04/10/2005 22:26:32 An error occured on the disk at Controller1-Channel5-Device1.
RAID E 04/10/2005 22:26:35 An error occured on the disk at Controller1-Channel5-Device1.
RAID E 04/10/2005 22:26:35 An error occured on the disk at Controller1-Channel5-Device1.
RAID E 04/10/2005 22:26:35 An error occured on the disk at Controller1-Channel5-Device1.
RAID E 04/10/2005 22:26:37 An error occured on the disk at Controller1-Channel5-Device1.
RAID E 04/10/2005 22:26:37 Disk at Controller1-Channel5-Device1 failed.
RAID E 04/10/2005 22:26:37 An error occured on the disk at Controller1-Channel4-Device1.
RAID E 04/10/2005 22:26:37 An error occured on the disk at Controller1-Channel1-Device1.
RAID E 04/10/2005 22:26:37 An error occured on the disk at Controller1-Channel2-Device1.
RAID E 04/10/2005 22:26:37 An error occured on the disk at Controller1-Channel3-Device1.
RAID E 04/10/2005 22:26:37 An error occured on the disk at Controller1-Channel6-Device1.
RAID E 04/10/2005 22:26:37 An error occured on the disk at Controller1-Channel7-Device1.
RAID E 04/10/2005 22:26:37 An error occured on the disk at Controller1-Channel8-Device1.
RAID E 04/10/2005 22:26:39 Disk at Controller1-Channel8-Device1 failed.
RAID I 04/10/2005 22:28:00 User RAID(from 127.0.0.1) exited from system.
RAID E 04/10/2005 22:43:37 An error occured on the disk at Controller1-Channel1-Device1.
RAID E 04/10/2005 22:43:37 An error occured on the disk at Controller1-Channel2-Device1.
RAID E 04/10/2005 22:43:37 An error occured on the disk at Controller1-Channel3-Device1.
RAID E 04/10/2005 22:43:37 An error occured on the disk at Controller1-Channel4-Device1.
RAID E 04/10/2005 22:43:37 An error occured on the disk at Controller1-Channel6-Device1.
RAID E 04/10/2005 22:43:37 An error occured on the disk at Controller1-Channel7-Device1.
RAID E 04/10/2005 22:43:37 An error occured on the disk at Controller1-Channel8-Device1.
RAID E 04/10/2005 22:43:37 An error occured on the disk at Controller1-Channel1-Device1.
RAID E 04/10/2005 22:43:37 An error occured on the disk at Controller1-Channel2-Device1.
RAID I 04/10/2005 22:43:42 User RAID(from 127.0.0.1) logged on system.
RAID I 04/10/2005 22:44:09 User RAID(from 127.0.0.1) exited from system.
RAID I 04/10/2005 22:48:35 User RAID(from 127.0.0.1) logged on system.
RAID E 04/10/2005 22:49:39 An error occured on the disk at Controller1-Channel4-Device1.
RAID E 04/10/2005 22:49:39 An error occured on the disk at Controller1-Channel1-Device1.
RAID E 04/10/2005 22:49:39 An error occured on the disk at Controller1-Channel2-Device1.
RAID E 04/10/2005 22:49:39 An error occured on the disk at Controller1-Channel3-Device1.
RAID E 04/10/2005 22:49:39 An error occured on the disk at Controller1-Channel6-Device1.
RAID E 04/10/2005 22:49:39 An error occured on the disk at Controller1-Channel7-Device1.
RAID E 04/10/2005 22:49:39 An error occured on the disk at Controller1-Channel8-Device1.
RAID E 04/10/2005 22:49:39 An error occured on the disk at Controller1-Channel1-Device1.
RAID E 04/10/2005 22:49:39 An error occured on the disk at Controller1-Channel2-Device1.
RAID E 04/10/2005 22:49:41 Disk at Controller1-Channel8-Device1 failed.
RAID I 04/10/2005 22:56:18 User RAID(from 127.0.0.1) logged on system.
RAID I 04/10/2005 22:56:24 Array 'RAID_5_0' rebuilding started.
RAID I 04/11/2005 00:50:38 Array 'RAID_5_0' rebuilding completed.
RAID E 04/11/2005 22:04:01 An error occured on the disk at Controller1-Channel5-Device1.
RAID E 04/11/2005 22:04:03 An error occured on the disk at Controller1-Channel5-Device1.
RAID E 04/11/2005 22:04:03 An error occured on the disk at Controller1-Channel5-Device1.
RAID E 04/11/2005 22:04:03 An error occured on the disk at Controller1-Channel5-Device1.
RAID E 04/11/2005 22:04:05 An error occured on the disk at Controller1-Channel5-Device1.
RAID E 04/11/2005 22:04:05 An error occured on the disk at Controller1-Channel5-Device1.
RAID E 04/11/2005 22:04:05 Disk at Controller1-Channel5-Device1 failed.
RAID E 04/11/2005 22:04:16 An error occured on the disk at Controller1-Channel7-Device1.
RAID E 04/11/2005 22:04:16 An error occured on the disk at Controller1-Channel1-Device1.
RAID E 04/11/2005 22:04:16 An error occured on the disk at Controller1-Channel2-Device1.
RAID E 04/11/2005 22:04:16 An error occured on the disk at Controller1-Channel3-Device1.
RAID E 04/11/2005 22:04:16 An error occured on the disk at Controller1-Channel4-Device1.
RAID E 04/11/2005 22:04:16 An error occured on the disk at Controller1-Channel6-Device1.
RAID E 04/11/2005 22:04:16 An error occured on the disk at Controller1-Channel8-Device1.
RAID E 04/11/2005 22:04:16 An error occured on the disk at Controller1-Channel1-Device1.
RAID E 04/11/2005 22:04:16 An error occured on the disk at Controller1-Channel2-Device1.
RAID E 04/11/2005 22:04:18 Disk at Controller1-Channel8-Device1 failed.
RAID I 04/11/2005 22:15:45 User RAID(from 127.0.0.1) logged on system.
RAID I 04/11/2005 22:16:21 Deleting RAID 5 Array 'RAID_5_0' succeeded.
RAID I 04/11/2005 22:19:41 User RAID(from 127.0.0.1) logged on system.
RAID I 04/11/2005 22:19:48 Array 'RAID_5_0' rebuilding started.
RAID E 04/11/2005 22:51:52 An error occured on the disk at Controller1-Channel1-Device1.
RAID E 04/11/2005 22:51:56 An error occured on the disk at Controller1-Channel1-Device1.
RAID E 04/11/2005 22:52:01 An error occured on the disk at Controller1-Channel1-Device1.
RAID E 04/11/2005 22:52:05 An error occured on the disk at Controller1-Channel1-Device1.
RAID E 04/11/2005 22:52:09 An error occured on the disk at Controller1-Channel1-Device1.
RAID E 04/11/2005 22:52:14 An error occured on the disk at Controller1-Channel1-Device1.
RAID E 04/11/2005 22:52:18 Disk at Controller1-Channel1-Device1 failed.
RAID W 04/11/2005 22:52:18 Array 'RAID_5_0' rebuilding failed.
RAID I 04/11/2005 22:52:43 User RAID(from 127.0.0.1) exited from system.
RAID I 04/16/2005 14:39:35 User RAID(from 127.0.0.1) logged on system.
RAID E 04/16/2005 14:41:02 An error occured on the disk at Controller1-Channel5-Device1.
RAID E 04/16/2005 14:41:05 An error occured on the disk at Controller1-Channel5-Device1.
RAID E 04/16/2005 14:41:05 An error occured on the disk at Controller1-Channel5-Device1.
RAID E 04/16/2005 14:41:05 An error occured on the disk at Controller1-Channel5-Device1.
RAID E 04/16/2005 14:41:07 An error occured on the disk at Controller1-Channel5-Device1.
RAID E 04/16/2005 14:41:07 An error occured on the disk at Controller1-Channel5-Device1.
RAID E 04/16/2005 14:41:07 Disk at Controller1-Channel5-Device1 failed.
RAID E 04/16/2005 14:41:17 An error occured on the disk at Controller1-Channel8-Device1.
RAID E 04/16/2005 14:41:17 An error occured on the disk at Controller1-Channel1-Device1.
RAID E 04/16/2005 14:41:17 An error occured on the disk at Controller1-Channel2-Device1.
RAID E 04/16/2005 14:41:17 An error occured on the disk at Controller1-Channel3-Device1.
RAID E 04/16/2005 14:41:17 An error occured on the disk at Controller1-Channel4-Device1.
RAID E 04/16/2005 14:41:17 An error occured on the disk at Controller1-Channel6-Device1.
RAID E 04/16/2005 14:41:17 An error occured on the disk at Controller1-Channel7-Device1.
RAID E 04/16/2005 14:41:17 An error occured on the disk at Controller1-Channel1-Device1.
RAID E 04/16/2005 14:41:17 An error occured on the disk at Controller1-Channel2-Device1.
RAID E 04/16/2005 14:41:19 Disk at Controller1-Channel7-Device1 failed.
RAID I 04/16/2005 14:47:27 Array 'RAID_5_0' rebuilding started.
RAID I 04/16/2005 14:47:38 User RAID(from 127.0.0.1) logged on system.


The delay of 5 days at the end of the log is due to much longer time it takes the bios to rebuild instead of the gui tools.


Thanks for any help, this is driving me nuts trying to sort it out. :confused::confused:

Comments

  • TexTex Dallas/Ft. Worth
    edited April 2005
    The GUI tools are accessing the controller/drives with drivers allowing optimal dma access methods among other things. The bios usually uses the slowest most basic disk access methods.

    I have tons of probs with sata cables. Especially if they are zip tied up to make it look pretty.

    But eight drives internaly also with cdroms and fans is stressing most PSU's as well. Especialy the new AMD's or Intels where the cpu drives off the 12 volt line also.

    Your running dual cpu's and nine total hard disks and one or two cdroms off a 440 watt psu?

    Thats bad mojo bud. The 440 might be OK running only the dual cpu MB ( and I bet its marginal...check the man. website... ) but with nine hard disks and a cdrom or two your begging for trouble.

    I run 600watt psu's on my two dual opterons with some ide and cdroms....but I use a seperate psu for the 8 scsi drives also.

    Your way under powered on the PSU.

    Tex
  • GrayFoxGrayFox /dev/urandom Member
    edited April 2005
    Im goina have to recomend the pc power and cooling Turbo-Cool 850 for that rig yah its Very pricey (Retail Price: $489.95usd) but 9hdd's :eek:
  • edited May 2005
    Thank you for the replies.

    On your advice I am looking into a better power supply. Currently looking at a Thermaltake 680W that is well recommended. The current supply is actually a Tagan 480W (not 440W like I said) but as you say it is close to the limit (the drives only seem to be taking about 10W at load according to my multimeter so usage works out about 440W for everything).

    Though the power supply is likely a culprit for the original failure I am still trying to investigate why it failed the same way each time and how to make sure it does not happen again. Even with a power problem I would not expect it to fail as soon as data is read from particular files. I have disconnected each drive one by one, checked them via seatools (seagates hard drive tool) and reading their smart data. All drives check out as fine, not even any bad sectors being marked. All cables have been replaced. I have rebuilt the array from scratch and remade the file system.

    I have worked out my confusion with blocks and sectors - sectors = 512 bytes and the ext3 filesystem uses 4096 bytes per block - simply divide the sectors by 8 to find the block to scan :) . However, this means that the normal linux scan tools were checking the whole array but not showing any problem.

    Now I am a bit stuck as although I have a nice clean array setup I am wary of using it! The only tool which showed up the problem on the array was seatools, but that takes about a week (yep... that long) to run through the whole array. I am even open to installing XP on there if there is a good enough tool for checking it! (See how often you find a linux user that will say that! It would of course be temporary though ;) ) Currently trying a windows rescue cdrom that has several disk recovery tools on it but it does not seem very thorough.


    1. Would you recommend the single supply or getting a second seperate one for the drives? Though I will need a bigger case for that!

    2. Is there any commands or tools you might recommend for an exhaustive check of the drive?

    3. Is there any further advice you can give for making sure I do not bump into this problem again?
  • edited January 2008
    I am no expert but I know this:

    1) Change your sata cables around taking note of the hdd that has failed previously. From what you said it is the same hdd that fails time over? you should easily detect the dodgy cable doing this.

    2) Are the hdd's compatible to run raid? When you run anything above raid 0, say if one of the hdd's performs it's error recovery mode for more than 10 seconds, the controller will drop the disk out of the array thinking it is shafted see:RAID-specific, time-limited error recovery (TLER) on link http://westerndigital.com/en/products/Products.asp?DriveID=335
    I had this same problem when running 4 x 250gb diamondmax 10 drives in a raid 5.

    4) The PSU is well underpowered. I have just bought the Zalman 1000 hp for my new set up as I will be running 12 hdd's in total. I only needed 750watt to do this but as it was marginally more for the 1000 I figured what the hell. Plus if I want to upgrade further, I have got a PSU that can handle it. By the way, the PSU will only use the watts that the components draw, you are better off going bigger rather than too small
Sign In or Register to comment.