|
|
|
|
||||||
| linux.debian.user debian-user@lists.debian.org. |
![]() |
|
|
LinkBack | Outils de la discussion |
|
|
#1 |
|
Messages: n/a
Hébergeur: |
I have a RAID5 on 3 drives with a spare. One drive failed and it
rebuilt itself using the spare, then, before I could replace the spare, a 2nd drive failed. I shut it down, got some new drives (bigger to be sure they weren't too small, allowing for differences in drive sizes reported by drive makers), replaced the bad drives, and rebuilt the spare with no problem at all. Last night there were thunderstorms all night and the computer lost power a few times (yes, normally it's plugged in to a UPS, that's a long story that doesn't effect anything here). The RAID is on /dev/hde, f, g, and h, with h as the spare. On reboot, the drive was not being reassembled. I tried this: mdadm --assemble --verbose /dev/md0 /dev/hde /dev/hdf /dev/hdg and I got this: mdadm: looking for devices for /dev/md0 mdadm: /dev/hde is identified as a member of /dev/md0, slot 0. mdadm: /dev/hdf is identified as a member of /dev/md0, slot 1. mdadm: /dev/hdg is identified as a member of /dev/md0, slot 2. mdadm: added /dev/hde to /dev/md0 as 0 mdadm: added /dev/hdf to /dev/md0 as 1 mdadm: added /dev/hdg to /dev/md0 as 2 mdadm: /dev/md0 assembled from 1 drive - not enough to start the array. I stopped the array and reran the above command with "--run" added to it. Then I ran: mdadm --detail /dev/md0 and got this: /dev/md0: Version : 00.90.01 Creation Time : Sat Feb 25 07:10:01 2006 Raid Level : raid5 Device Size : 244198464 (232.89 GiB 250.06 GB) Raid Devices : 3 Total Devices : 1 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Mon Aug 13 06:00:34 2007 State : active, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 64K UUID : 0dcc0b91:f92304ba:e66cf827:43274a37 Events : 0.3700484 Number Major Minor RaidDevice State 0 0 0 - removed 1 0 0 - removed 2 34 0 2 active sync /dev/hdg I've tried this at different times and gotten different drives. I've also dried: mdadm --detail /dev/hde and I get: /dev/hde: Magic : a92b4efc Version : 00.90.00 UUID : 0dcc0b91:f92304ba:e66cf827:43274a37 Creation Time : Sat Feb 25 07:10:01 2006 Raid Level : raid5 Raid Devices : 3 Total Devices : 3 Preferred Minor : 0 Update Time : Wed Aug 8 05:04:55 2007 State : active Active Devices : 3 Working Devices : 3 Failed Devices : 0 Spare Devices : 0 Checksum : 732dc269 - correct Events : 0.3547873 Layout : left-symmetric Chunk Size : 64K Number Major Minor RaidDevice State this 0 33 0 0 active sync /dev/hde 0 0 33 0 0 active sync /dev/hde 1 1 33 64 1 active sync /dev/hdf 2 2 34 0 2 active sync /dev/hdg The major differences come up when I run the same command to examine hdf, g, and h. With f and g, on the last part, I get this with hdf: Number Major Minor RaidDevice State this 1 33 64 1 active sync /dev/hdf 0 0 0 0 0 removed 1 1 33 64 1 active sync /dev/hdf 2 2 34 0 2 active sync /dev/hdg Then with hdg, I get: Number Major Minor RaidDevice State this 2 34 0 2 active sync /dev/hdg 0 0 0 0 0 removed 1 1 0 0 1 faulty removed 2 2 34 0 2 active sync /dev/hdg and on hdh, I get: Number Major Minor RaidDevice State this 3 34 64 3 spare /dev/hdh 0 0 33 0 0 active sync /dev/hde 1 1 33 64 1 active sync /dev/hdf 2 2 34 0 2 active sync /dev/hdg 3 3 34 64 3 spare /dev/hdh I notice the information changes from drive to drive and is inconsistent. Mdadm is not telling me which drives are bad when it assembles the array and I want to verify what is going on. I'd like to get more info to see if mdadm "officially" sees drives e and f as bad or just drive e, or none at all (since reports vary according to the drive). I would have thought, after the first issue with any drive, that the system would not have had the others in use, since, when booting, it would have waited for me to hit "Control-D to continue," so I doubt that there are actually two bad drives. Any ideas how I can get more information, find out why mdadm is not rebuilding the RAID or to get it to rebuild it? It seems to think the drives are all okay when it's adding them and doesn't report any issues with any drives until it's done, then it says there aren't enough drives to start the RAID. Hal -- To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org |
|
|
|
#2 |
|
Messages: n/a
Hébergeur: |
On Saturday 18 August 2007 01:51, Hal Vaughan wrote:
> I have a RAID5 on 3 drives with a spare. One drive failed and it > rebuilt itself using the spare, then, before I could replace the spare, > a 2nd drive failed. I shut it down, got some new drives (bigger to be > sure they weren't too small, allowing for differences in drive sizes > reported by drive makers), replaced the bad drives, and rebuilt the > spare with no problem at all. Last night there were thunderstorms all > night and the computer lost power a few times (yes, normally it's > plugged in to a UPS, that's a long story that doesn't effect anything > here). (big snip) > I notice the information changes from drive to drive and is > inconsistent. > > Mdadm is not telling me which drives are bad when it assembles the array > and I want to verify what is going on. I'd like to get more info to > see if mdadm "officially" sees drives e and f as bad or just drive e, > or none at all (since reports vary according to the drive). > > I would have thought, after the first issue with any drive, that the > system would not have had the others in use, since, when booting, it > would have waited for me to hit "Control-D to continue," so I doubt > that there are actually two bad drives. > > Any ideas how I can get more information, find out why mdadm is not > rebuilding the RAID or to get it to rebuild it? It seems to think the > drives are all okay when it's adding them and doesn't report any issues > with any drives until it's done, then it says there aren't enough > drives to start the RAID. 1) I don't see that any of your drives are "bad". However they are so inconsistent that data recovery is unlikely. 2) If you just need to assemble the array and are not worried about recovering the data, I would suggest zeroing each drive's superblock using mdadm and then creating a new array. 3) RAID 5 is not resilient against multiple failures. We now use RAID 1. RAID 1 is also faster, although it sometimes requires more drives. In extreme cases we use RAID 1 with three or more drives. 4) With four drives, rather than RAID 5 with a hot spare, I would create two RAID 1 arrays. One could then combine them in RAID 0 or linear but I would choose to make them be PVMs in a LVM VG. --Mike Bird -- To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org |
|
|
|
#3 |
|
Messages: n/a
Hébergeur: |
On Saturday 18 August 2007 01:51, Hal Vaughan wrote:
> I have a RAID5 on 3 drives with a spare. One drive failed and it > rebuilt itself using the spare, then, before I could replace the spare, > a 2nd drive failed. I shut it down, got some new drives (bigger to be > sure they weren't too small, allowing for differences in drive sizes > reported by drive makers), replaced the bad drives, and rebuilt the > spare with no problem at all. Last night there were thunderstorms all > night and the computer lost power a few times (yes, normally it's > plugged in to a UPS, that's a long story that doesn't effect anything > here). (big snip) > I notice the information changes from drive to drive and is > inconsistent. > > Mdadm is not telling me which drives are bad when it assembles the array > and I want to verify what is going on. I'd like to get more info to > see if mdadm "officially" sees drives e and f as bad or just drive e, > or none at all (since reports vary according to the drive). > > I would have thought, after the first issue with any drive, that the > system would not have had the others in use, since, when booting, it > would have waited for me to hit "Control-D to continue," so I doubt > that there are actually two bad drives. > > Any ideas how I can get more information, find out why mdadm is not > rebuilding the RAID or to get it to rebuild it? It seems to think the > drives are all okay when it's adding them and doesn't report any issues > with any drives until it's done, then it says there aren't enough > drives to start the RAID. 1) I don't see that any of your drives are "bad". However they are so inconsistent that data recovery is unlikely. 2) If you just need to assemble the array and are not worried about recovering the data, I would suggest zeroing each drive's superblock using mdadm and then creating a new array. 3) RAID 5 is not resilient against multiple failures. We now use RAID 1. RAID 1 is also faster, although it sometimes requires more drives. In extreme cases we use RAID 1 with three or more drives. 4) With four drives, rather than RAID 5 with a hot spare, I would create two RAID 1 arrays. One could then combine them in RAID 0 or linear but I would choose to make them be PVMs in a LVM VG. --Mike Bird -- To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org |
|
|
|
#4 |
|
Messages: n/a
Hébergeur: |
also sprach Hal Vaughan <hal@thresholddigital.com> [2007.08.18.1051 +0200]:
> a 2nd drive failed. I shut it down, got some new drives (bigger to be > sure they weren't too small, allowing for differences in drive sizes > reported by drive makers), replaced the bad drives, and rebuilt the > spare with no problem at all. Are you certain the rebuild was completed? Did you --add the drives to the array after --remove'ing the broken ones? > State : active, degraded [...] > 0 0 0 - removed > 1 0 0 - removed > 2 34 0 2 active sync /dev/hdg This does not look like you did. > I notice the information changes from drive to drive and is > inconsistent. The reason for this is that some of the drives' superblocks have not been updated because you did not --add them. I hope you have backups. Otherwise I doubt you'll get your data back easily. also sprach Mike Bird <mgb-debian@yosemite.net> [2007.08.18.1537 +0200]: > 3) RAID 5 is not resilient against multiple failures. We now use RAID 1. > RAID 1 is also faster, although it sometimes requires more drives. > In extreme cases we use RAID 1 with three or more drives. RAID 1 is also not resilient to multiple failures. > 4) With four drives, rather than RAID 5 with a hot spare, I would create > two RAID 1 arrays. One could then combine them in RAID 0 or linear > but I would choose to make them be PVMs in a LVM VG. ... or use RAID 10, if you don't need LVM otherwise. You'll get better performance with RAID 10 than with RAID1+LVM (or RAID1+linear or RAID1+RAID0 for that matter). -- .''`. martin f. krafft <madduck@debian.org> : :' : proud Debian developer, author, administrator, and user `. `'` http://people.debian.org/~madduck - http://debiansystem.info `- Debian - when you have better things to do than fixing systems tempt not a desperate man. -- william shakespeare -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFGyUSKIgvIgzMMSnURAk3DAKC4/aEC4vRf9R691bv8GjPnvhs8MwCdHGL0 099EoOgx2MJVanjsU1g4rBw= =FOXX -----END PGP SIGNATURE----- |
|
|
|
#5 |
|
Messages: n/a
Hébergeur: |
Also sprach Mike Bird <mgb-debian@yosemite.net> [2007.08.18.1537 +0200]:
> 3) RAID 5 is not resilient against multiple failures. We now use RAID 1. > RAID 1 is also faster, although it sometimes requires more drives. > In extreme cases we use RAID 1 with three or more drives. On Monday 20 August 2007 00:36, martin f krafft wrote: > RAID 1 is also not resilient to multiple failures. Is the loss of N-1 members of an N-way RAID-1 not survivable? --Mike Bird |
|
|
|
#6 |
|
Messages: n/a
Hébergeur: |
also sprach Mike Bird <mgb-debian@yosemite.net> [2007.08.20.1627 +0200]:
> > 3) RAID 5 is not resilient against multiple failures. We now use RAID 1. > > RAID 1 is also faster, although it sometimes requires more drives. > > In extreme cases we use RAID 1 with three or more drives. > > On Monday 20 August 2007 00:36, martin f krafft wrote: > > RAID 1 is also not resilient to multiple failures. > > Is the loss of N-1 members of an N-way RAID-1 not survivable? Well, yes. I see what you mean now. You can survive N-1 harddrives failing at once while with RAID5, that better not happen before a spare could take over. -- .''`. martin f. krafft <madduck@debian.org> : :' : proud Debian developer, author, administrator, and user `. `'` http://people.debian.org/~madduck - http://debiansystem.info `- Debian - when you have better things to do than fixing systems "when faced with a new problem, the wise algorithmist will first attempt to classify it as np-complete. this will avoid many tears and tantrums as algorithm after algorithm fails." -- g. niruta -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFGybqKIgvIgzMMSnURArzhAJ9OJMXxck2HS6WGzvqCFS vSoZEu5QCdG/Gh GTDCrP4kLNMEW7b6Cht/Wxk= =666G -----END PGP SIGNATURE----- |
|
|
|
#7 |
|
Messages: n/a
Hébergeur: |
also sprach Mike Bird <mgb-debian@yosemite.net> [2007.08.20.1627 +0200]:
> > 3) RAID 5 is not resilient against multiple failures. We now use RAID 1. > > RAID 1 is also faster, although it sometimes requires more drives. > > In extreme cases we use RAID 1 with three or more drives. > > On Monday 20 August 2007 00:36, martin f krafft wrote: > > RAID 1 is also not resilient to multiple failures. > > Is the loss of N-1 members of an N-way RAID-1 not survivable? Well, yes. I see what you mean now. You can survive N-1 harddrives failing at once while with RAID5, that better not happen before a spare could take over. -- .''`. martin f. krafft <madduck@debian.org> : :' : proud Debian developer, author, administrator, and user `. `'` http://people.debian.org/~madduck - http://debiansystem.info `- Debian - when you have better things to do than fixing systems "when faced with a new problem, the wise algorithmist will first attempt to classify it as np-complete. this will avoid many tears and tantrums as algorithm after algorithm fails." -- g. niruta -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFGybqKIgvIgzMMSnURArzhAJ9OJMXxck2HS6WGzvqCFS vSoZEu5QCdG/Gh GTDCrP4kLNMEW7b6Cht/Wxk= =666G -----END PGP SIGNATURE----- |
|
|
|
#8 |
|
Messages: n/a
Hébergeur: |
On Monday 20 August 2007, martin f krafft wrote:
> also sprach Mike Bird <mgb-debian@yosemite.net> [2007.08.20.1627 +0200]: > > > 3) RAID 5 is not resilient against multiple failures. We now use > > > RAID 1. RAID 1 is also faster, although it sometimes requires > > > more drives. In extreme cases we use RAID 1 with three or more > > > drives. > > > > On Monday 20 August 2007 00:36, martin f krafft wrote: > > > RAID 1 is also not resilient to multiple failures. > > > > Is the loss of N-1 members of an N-way RAID-1 not survivable? > > Well, yes. I see what you mean now. You can survive N-1 harddrives > failing at once while with RAID5, that better not happen before > a spare could take over. In this case, I had 4 drives, so if one failed, then the spare should have been added but that hadn't happened. I've also tested the two "failed" drives and they are quite functional. A friend made a point to me that could make a difference. I had not partitioned the drives since mdadm seems okay without partitions. He said even if I only use one drive-wide partition, I should still partition the drives in a RAID first. Fortunately, this was in a backup system so I can get new drives and rebuild it from scratch with the larger drives. I've already got ideas for using the "failed" drives that are proving to be just fine. Hal |
|
|
|
#9 |
|
Messages: n/a
Hébergeur: |
also sprach Hal Vaughan <hal@thresholddigital.com> [2007.08.20.2022 +0200]:
> In this case, I had 4 drives, so if one failed, then the spare should > have been added but that hadn't happened. I thought your original email said it did resync the spare? > I've also tested the two "failed" drives and they are quite > functional. A friend made a point to me that could make > a difference. I had not partitioned the drives since mdadm seems > okay without partitions. He said even if I only use one > drive-wide partition, I should still partition the drives in > a RAID first. Your friend is confused. ![]() I don't see why you'd have to do this. The partition table would get overwritten anyway. > Fortunately, this was in a backup system so I can get new drives > and rebuild it from scratch with the larger drives. I've already > got ideas for using the "failed" drives that are proving to be > just fine. Have you inspected the smartctl output and checked for SMART errors? -- .''`. martin f. krafft <madduck@debian.org> : :' : proud Debian developer, author, administrator, and user `. `'` http://people.debian.org/~madduck - http://debiansystem.info `- Debian - when you have better things to do than fixing systems no keyboard present. press f1 to continue. zen engineering. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFGyeTMIgvIgzMMSnURAmFFAKDRTz4WGL5D4z3uUskhvC a7ccHO8wCfXNA5 XF9fJS/Fv1BM86W5xnEY2xo= =5gVi -----END PGP SIGNATURE----- |
|
|
|
#10 |
|
Messages: n/a
Hébergeur: |
On Monday 20 August 2007, martin f krafft wrote:
> also sprach Hal Vaughan <hal@thresholddigital.com> [2007.08.20.2022 +0200]: > > In this case, I had 4 drives, so if one failed, then the spare > > should have been added but that hadn't happened. > > I thought your original email said it did resync the spare? It did on the first failure. Then another failed and I turned the machine off. When I got 2 more drives, I put them in and it rebuilt the array using 3 of the drives with one as a spare. Then when it failed this time, it had never started rebuilding the spare. > > I've also tested the two "failed" drives and they are quite > > functional. A friend made a point to me that could make > > a difference. I had not partitioned the drives since mdadm seems > > okay without partitions. He said even if I only use one > > drive-wide partition, I should still partition the drives in > > a RAID first. > > Your friend is confused. ![]() > I don't see why you'd have to do this. The partition table would get > overwritten anyway. I've noticed, though, that on one system I had originally defined the raid using /dev/hde1, hdf1, and so on. When I tried to rebuild it with /dev/hde, hdf, and so on, it would not rebiuld. > > Fortunately, this was in a backup system so I can get new drives > > and rebuild it from scratch with the larger drives. I've already > > got ideas for using the "failed" drives that are proving to be > > just fine. > > Have you inspected the smartctl output and checked for SMART errors? I looked at the logs. Is this a different output and where would I find it? Hal -- To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org |
|
|
|
#11 |
|
Messages: n/a
Hébergeur: |
also sprach Hal Vaughan <hal@thresholddigital.com> [2007.08.20.2114 +0200]:
> It did on the first failure. Then another failed and I turned the > machine off. When I got 2 more drives, I put them in and it rebuilt > the array using 3 of the drives with one as a spare. Then when it > failed this time, it had never started rebuilding the spare. This situation *should* be recoverable. Contact me off-list if you'd be willing to let me log in as root and have a look. > I've noticed, though, that on one system I had originally defined the > raid using /dev/hde1, hdf1, and so on. When I tried to rebuild it > with /dev/hde, hdf, and so on, it would not rebiuld. Sure, partitions have different offsets, so the superblock could not be found. > > Have you inspected the smartctl output and checked for SMART > > errors? > > I looked at the logs. Is this a different output and where would > I find it? Are these ATA disks? if so, run smartctl -l error should be pretty empty. -- .''`. martin f. krafft <madduck@debian.org> : :' : proud Debian developer, author, administrator, and user `. `'` http://people.debian.org/~madduck - http://debiansystem.info `- Debian - when you have better things to do than fixing systems "when women love us, they forgive us everything, even our crimes; when they do not love us, they give us credit for nothing, not even our virtues." -- honoré de balzac -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFGyg3yIgvIgzMMSnURAsiiAKDfWz8P1pY1aoQq53jj6e zXhEGhpQCfVlaS J2y3KfP/XsvIv+FaheEhRgc= =orm5 -----END PGP SIGNATURE----- |
|
|
|
#12 |
|
Messages: n/a
Hébergeur: |
On Monday 20 August 2007, martin f krafft wrote:
> also sprach Hal Vaughan <hal@thresholddigital.com> [2007.08.20.2114 +0200]: > > It did on the first failure. Then another failed and I turned the > > machine off. When I got 2 more drives, I put them in and it > > rebuilt the array using 3 of the drives with one as a spare. Then > > when it failed this time, it had never started rebuilding the > > spare. > > This situation *should* be recoverable. Contact me off-list if you'd > be willing to let me log in as root and have a look. I may end up decding on that. Right now I'm considering replacing the 250GB drives that seemingly have failed with new 320s, wiping the install and installing Etch, then just rebuilding the entire array. Any suggestions or warnings from others so I can make sure this doens't happen again are appreciated. Remember, the two drives I've already removed that mdadm had said were bad have tested out as fine. I suspect it's more an issue with this system losing power and the RAID not being unmounted properly, but I'd think it should be able to handle that. > > I've noticed, though, that on one system I had originally defined > > the raid using /dev/hde1, hdf1, and so on. When I tried to rebuild > > it with /dev/hde, hdf, and so on, it would not rebiuld. > > Sure, partitions have different offsets, so the superblock could not > be found. > > > > Have you inspected the smartctl output and checked for SMART > > > errors? > > > > I looked at the logs. Is this a different output and where would > > I find it? > > Are these ATA disks? if so, run smartctl -l error should be pretty > empty. [root@archive:root]$ smartctl -l error bash: smartctl: command not found Is there a problem with that, or could that be part of the issue? Thanks! Hal |
|
|
|
#13 |
|
Messages: n/a
Hébergeur: |
also sprach Hal Vaughan <hal@thresholddigital.com> [2007.08.21.0003 +0200]:
> Any suggestions or warnings from others so I can make sure this doens't > happen again are appreciated. Remember, the two drives I've already > removed that mdadm had said were bad have tested out as fine. I suspect > it's more an issue with this system losing power and the RAID not being > unmounted properly, but I'd think it should be able to handle that. If the RAID was degraded (2/3 drives) *and* the power lost, mdadm won't assemble it without you forcing it to (see README.recipes). > [root@archive:root]$ smartctl -l error > > bash: smartctl: command not found > > Is there a problem with that, or could that be part of the issue? apt-get install smartmontools then try again. And enable smartd to check your drives continuously. -- .''`. martin f. krafft <madduck@debian.org> : :' : proud Debian developer, author, administrator, and user `. `'` http://people.debian.org/~madduck - http://debiansystem.info `- Debian - when you have better things to do than fixing systems "common sense is the collection of prejudices acquired by age eighteen." -- albert einstein -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFGyhKiIgvIgzMMSnURAhSeAJ0WeV+JHFNNeG6M3AyfAo 6D/OCXhQCdHFk7 VQE0hfiAR7gU+yyipjmKDPM= =VyAa -----END PGP SIGNATURE----- |
|
![]() |
| Outils de la discussion | |
|
|