What is the time?
Time offset is the amount of time that is off (or drift) from a specific value. In Linux systems, date is been calculating from the beginning of time. That is 00:00:00 1 January 1970 or as it called Unix Time and systems define date (time) as the number of seconds that have elapsed from 01.01.1970.
It is so important that even a few seconds off can cause tremendous disaster in data centers and applications.
Network Time
To avoid problems with time, systems must and should synchronize their time over the Internet every now and then. This is being done by asking a central NTP server via Network Time Protocol. The most common scenario for infrastructures is to have one or two NTP servers and then all the systems inside this infrastructure can synchronize their time from those machines.
Nagios - NRPE
In my case, I have a centralized NTP Daemon that runs on the Nagios Linux machine. That gives me the opportunity to check the EPOCH time of any system in my infrastructure against the time that the Nagios Server has.
Nagios Check
This is the script I am using:
# ebal, Thu, 11 May 2017 12:08:50 +0000
# EPOCH
TIME=$1
WARN=5
CRIT=10
# seconds
OFFSET=$( echo $(( $(date -d 'now ' +%s) - ${TIME} )) | sed -e 's#-##g' )
if [ "${OFFSET}" -lt "${WARN}" ]; then
echo "OK"
exit 0
elif [ "${OFFSET}" -ge "${CRIT}" ]; then
echo "CRITICAL- ${OFFSET}"
exit 2
elif [ "${OFFSET}" -lt "${CRIT}" ]; then
echo "WARNING- ${OFFSET}"
exit 1
else
echo "UNKNOWN- ${OFFSET}"
exit 3
fi
In a nutshell the script gets as the first argument an epoch time and calculate the diff between it’s own epoch time and that.
Example
./check_time_offset $(date -d 'now + 1 min' +%s)
The output is this:
CRITICAL- 60
Nrpe Configuration
This is the configuration for nrpe to run the check_time_offset
# tail -1 /etc/nrpe.d/time_offset.cfg
command[check_time_offset]=/usr/lib64/nagios/plugins/check_time_offset $ARG1$
Nagios Configuration
and this is my nagios configuration setup to use a remote nrpe :
define service{
use service-critical
hostgroup_name lnxserver01
service_description time_offset
check_command check_nrpe!check_time_offset!$TIMET$
}
Take a minute to observer a little better the nrpe command.
check_nrpe!check_time_offset!$TIMET$
TIMET
I was having problems passing the nagios epoch time as an argument on the definition of the above service.
Testing the nrpe command as below, I was getting the results I was looking for:
./check_nrpe -H lnxserver01 -c check_time_offset -a $(date -d 'now + 6 sec' +%s)
But is there a way to pass as a nagios argument the output of a command ?
- No
A dear colleague of mine mentioned nagios macros:
$TIMET$ Current time stamp in time_t format (seconds since the UNIX epoch)
Perfect !!!
The problem
The last couple weeks, a backup server I am managing is failing to make backups!
The backup procedure (a script via cron daemon) is to rsync data from a primary server to it’s /backup
directory. I was getting cron errors via email, informing me that the previous rsync script hasnt already finished when the new one was starting (by checking a lock file). This was strange as the time duration is 12hours. 12 hours werent enough to perform a ~200M data transfer over a 100Mb/s network port. That was really strange.
This is the second time in less than a year that this server is making problems. A couple months ago I had to remove a faulty disk from the software raid setup and check the system again. My notes on the matter, can be found here:
https://balaskas.gr/blog/2016/10/17/linux-raid-mdadm-md0/
Identify the problem
So let us start to identify the problem. A slow rsync can mean a lot of things, especially over ssh. Replacing network cables, viewing dmesg messages, rebooting servers or even changing the filesystem werent changing any things for the better. Time to move on the disks.
Manage and Monitor software RAID devices
On this server, I use raid5 with four hard disks:
# mdadm --verbose --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Wed Feb 26 21:00:17 2014
Raid Level : raid5
Array Size : 2929893888 (2794.16 GiB 3000.21 GB)
Used Dev Size : 976631296 (931.39 GiB 1000.07 GB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent
Update Time : Sun May 7 11:00:32 2017
State : clean
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Name : ServerTwo:0 (local to host ServerTwo)
UUID : ef5da4df:3e53572e:c3fe1191:925b24cf
Events : 10496
Number Major Minor RaidDevice State
4 8 16 0 active sync /dev/sdb
1 8 32 1 active sync /dev/sdc
6 8 48 2 active sync /dev/sdd
5 8 0 3 active sync /dev/sda
View hardware parameters of hard disk drive
aka test the hard disks:
# hdparm -Tt /dev/sda
/dev/sda:
Timing cached reads: 2490 MB in 2.00 seconds = 1245.06 MB/sec
Timing buffered disk reads: 580 MB in 3.01 seconds = 192.93 MB/sec
# hdparm -Tt /dev/sdb
/dev/sdb:
Timing cached reads: 2520 MB in 2.00 seconds = 1259.76 MB/sec
Timing buffered disk reads: 610 MB in 3.00 seconds = 203.07 MB/sec
# hdparm -Tt /dev/sdc
/dev/sdc:
Timing cached reads: 2512 MB in 2.00 seconds = 1255.43 MB/sec
Timing buffered disk reads: 570 MB in 3.01 seconds = 189.60 MB/sec
# hdparm -Tt /dev/sdd
/dev/sdd:
Timing cached reads: 2 MB in 7.19 seconds = 285.00 kB/sec
Timing buffered disk reads: 2 MB in 5.73 seconds = 357.18 kB/sec
Root Cause
Seems that one of the disks (/dev/sdd) in raid5 setup, is not performing as well as the others. The same hard disk had a problem a few months ago.
What I did the previous time, was to remove the disk, reformatting it in Low Level Format and add it again in the same setup. The system rebuild the raid5 and after 24hours everything was performing fine.
However the same hard disk seems that still has some issues . Now it is time for me to remove it and find a replacement disk.
Remove Faulty disk
I need to manually fail and then remove the faulty disk from the raid setup.
Failing the disk
Failing the disk manually, means that mdadm is not recognizing the disk as failed (as it did previously). I need to tell mdadm that this specific disk is a faulty one:
# mdadm --manage /dev/md0 --fail /dev/sdd
mdadm: set /dev/sdd faulty in /dev/md0
Removing the disk
now it is time to remove the faulty disk from our raid setup:
# mdadm --manage /dev/md0 --remove /dev/sdd
mdadm: hot removed /dev/sdd from /dev/md0
Show details
# mdadm --verbose --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Wed Feb 26 21:00:17 2014
Raid Level : raid5
Array Size : 2929893888 (2794.16 GiB 3000.21 GB)
Used Dev Size : 976631296 (931.39 GiB 1000.07 GB)
Raid Devices : 4
Total Devices : 3
Persistence : Superblock is persistent
Update Time : Sun May 7 11:08:44 2017
State : clean, degraded
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Name : ServerTwo:0 (local to host ServerTwo)
UUID : ef5da4df:3e53572e:c3fe1191:925b24cf
Events : 10499
Number Major Minor RaidDevice State
4 8 16 0 active sync /dev/sdb
1 8 32 1 active sync /dev/sdc
4 0 0 4 removed
5 8 0 3 active sync /dev/sda
Mounting the Backup
Now it’s time to re-mount the backup directory and re-run the rsync script
mount /backup/
and run the rsync with verbose and progress parameters to review the status of syncing
/usr/bin/rsync -zravxP --safe-links --delete-before --partial --protect-args -e ssh 192.168.2.1:/backup/ /backup/
Everything seems ok.
A replacement order has already been placed.
Rsync times manage to hit ~ 10.27MB/s
again!
rsync time for a daily (12h) diff is now again in normal rates:
real 15m18.112s
user 0m34.414s
sys 0m36.850s