The problem
The last couple weeks, a backup server I am managing is failing to make backups!
The backup procedure (a script via cron daemon) is to rsync data from a primary server to it’s /backup
directory. I was getting cron errors via email, informing me that the previous rsync script hasnt already finished when the new one was starting (by checking a lock file). This was strange as the time duration is 12hours. 12 hours werent enough to perform a ~200M data transfer over a 100Mb/s network port. That was really strange.
This is the second time in less than a year that this server is making problems. A couple months ago I had to remove a faulty disk from the software raid setup and check the system again. My notes on the matter, can be found here:
https://balaskas.gr/blog/2016/10/17/linux-raid-mdadm-md0/
Identify the problem
So let us start to identify the problem. A slow rsync can mean a lot of things, especially over ssh. Replacing network cables, viewing dmesg messages, rebooting servers or even changing the filesystem werent changing any things for the better. Time to move on the disks.
Manage and Monitor software RAID devices
On this server, I use raid5 with four hard disks:
# mdadm --verbose --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Wed Feb 26 21:00:17 2014
Raid Level : raid5
Array Size : 2929893888 (2794.16 GiB 3000.21 GB)
Used Dev Size : 976631296 (931.39 GiB 1000.07 GB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent
Update Time : Sun May 7 11:00:32 2017
State : clean
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Name : ServerTwo:0 (local to host ServerTwo)
UUID : ef5da4df:3e53572e:c3fe1191:925b24cf
Events : 10496
Number Major Minor RaidDevice State
4 8 16 0 active sync /dev/sdb
1 8 32 1 active sync /dev/sdc
6 8 48 2 active sync /dev/sdd
5 8 0 3 active sync /dev/sda
View hardware parameters of hard disk drive
aka test the hard disks:
# hdparm -Tt /dev/sda
/dev/sda:
Timing cached reads: 2490 MB in 2.00 seconds = 1245.06 MB/sec
Timing buffered disk reads: 580 MB in 3.01 seconds = 192.93 MB/sec
# hdparm -Tt /dev/sdb
/dev/sdb:
Timing cached reads: 2520 MB in 2.00 seconds = 1259.76 MB/sec
Timing buffered disk reads: 610 MB in 3.00 seconds = 203.07 MB/sec
# hdparm -Tt /dev/sdc
/dev/sdc:
Timing cached reads: 2512 MB in 2.00 seconds = 1255.43 MB/sec
Timing buffered disk reads: 570 MB in 3.01 seconds = 189.60 MB/sec
# hdparm -Tt /dev/sdd
/dev/sdd:
Timing cached reads: 2 MB in 7.19 seconds = 285.00 kB/sec
Timing buffered disk reads: 2 MB in 5.73 seconds = 357.18 kB/sec
Root Cause
Seems that one of the disks (/dev/sdd) in raid5 setup, is not performing as well as the others. The same hard disk had a problem a few months ago.
What I did the previous time, was to remove the disk, reformatting it in Low Level Format and add it again in the same setup. The system rebuild the raid5 and after 24hours everything was performing fine.
However the same hard disk seems that still has some issues . Now it is time for me to remove it and find a replacement disk.
Remove Faulty disk
I need to manually fail and then remove the faulty disk from the raid setup.
Failing the disk
Failing the disk manually, means that mdadm is not recognizing the disk as failed (as it did previously). I need to tell mdadm that this specific disk is a faulty one:
# mdadm --manage /dev/md0 --fail /dev/sdd
mdadm: set /dev/sdd faulty in /dev/md0
Removing the disk
now it is time to remove the faulty disk from our raid setup:
# mdadm --manage /dev/md0 --remove /dev/sdd
mdadm: hot removed /dev/sdd from /dev/md0
Show details
# mdadm --verbose --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Wed Feb 26 21:00:17 2014
Raid Level : raid5
Array Size : 2929893888 (2794.16 GiB 3000.21 GB)
Used Dev Size : 976631296 (931.39 GiB 1000.07 GB)
Raid Devices : 4
Total Devices : 3
Persistence : Superblock is persistent
Update Time : Sun May 7 11:08:44 2017
State : clean, degraded
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Name : ServerTwo:0 (local to host ServerTwo)
UUID : ef5da4df:3e53572e:c3fe1191:925b24cf
Events : 10499
Number Major Minor RaidDevice State
4 8 16 0 active sync /dev/sdb
1 8 32 1 active sync /dev/sdc
4 0 0 4 removed
5 8 0 3 active sync /dev/sda
Mounting the Backup
Now it’s time to re-mount the backup directory and re-run the rsync script
mount /backup/
and run the rsync with verbose and progress parameters to review the status of syncing
/usr/bin/rsync -zravxP --safe-links --delete-before --partial --protect-args -e ssh 192.168.2.1:/backup/ /backup/
Everything seems ok.
A replacement order has already been placed.
Rsync times manage to hit ~ 10.27MB/s
again!
rsync time for a daily (12h) diff is now again in normal rates:
real 15m18.112s
user 0m34.414s
sys 0m36.850s