rss.png profile for ebal on Stack Exchange, a network of free, community-driven Q&A sites
May
11
2017
time offset and nagios nrpe check

What is the time?

Time offset is the amount of time that is off (or drift) from a specific value. In Linux systems, date is been calculating from the beginning of time. That is 00:00:00 1 January 1970 or as it called Unix Time and systems define date (time) as the number of seconds that have elapsed from 01.01.1970.

It is so important that even a few seconds off can cause tremendous disaster in data centers and applications.

Network Time

To avoid problems with time, systems must and should synchronize their time over the Internet every now and then. This is being done by asking a central NTP server via Network Time Protocol. The most common scenario for infrastructures is to have one or two NTP servers and then all the systems inside this infrastructure can synchronize their time from those machines.

Nagios - NRPE

In my case, I have a centralized NTP Daemon that runs on the Nagios Linux machine. That gives me the opportunity to check the EPOCH time of any system in my infrastructure against the time that the Nagios Server has.

Nagios Check

This is the script I am using:

# ebal, Thu, 11 May 2017 12:08:50 +0000

# EPOCH
TIME=$1
WARN=5
CRIT=10

# seconds
OFFSET=$( echo $(( $(date -d 'now ' +%s) - ${TIME} )) | sed -e 's#-##g' )

if [ "${OFFSET}" -lt "${WARN}" ]; then
        echo "OK"
        exit 0
elif [ "${OFFSET}" -ge "${CRIT}" ]; then
        echo "CRITICAL- ${OFFSET}"
        exit 2
elif [ "${OFFSET}" -lt "${CRIT}" ]; then
        echo "WARNING- ${OFFSET}"
        exit 1
else
        echo "UNKNOWN- ${OFFSET}"
        exit 3
fi

In a nutshell the script gets as the first argument an epoch time and calculate the diff between it’s own epoch time and that.

Example

./check_time_offset $(date -d 'now + 1 min' +%s)

The output is this:

CRITICAL- 60

Nrpe Configuration

This is the configuration for nrpe to run the check_time_offset

# tail -1 /etc/nrpe.d/time_offset.cfg

command[check_time_offset]=/usr/lib64/nagios/plugins/check_time_offset $ARG1$

Nagios Configuration

and this is my nagios configuration setup to use a remote nrpe :

define service{
        use                             service-critical
        hostgroup_name                  lnxserver01
        service_description             time_offset
        check_command                   check_nrpe!check_time_offset!$TIMET$
        }

Take a minute to observer a little better the nrpe command.

check_nrpe!check_time_offset!$TIMET$

TIMET

I was having problems passing the nagios epoch time as an argument on the definition of the above service.

Testing the nrpe command as below, I was getting the results I was looking for:

./check_nrpe -H lnxserver01 -c check_time_offset -a $(date -d 'now + 6 sec' +%s)

But is there a way to pass as a nagios argument the output of a command ?

  • No

A dear colleague of mine mentioned nagios macros:

Standard Macros in Nagios

$TIMET$     Current time stamp in time_t format (seconds since the UNIX epoch)

Perfect !!!

Tag(s): nagios, nrpe, ntp, time
May
07
2017
How a slow disk affects your system

The problem

The last couple weeks, a backup server I am managing is failing to make backups!

The backup procedure (a script via cron daemon) is to rsync data from a primary server to it’s /backup directory. I was getting cron errors via email, informing me that the previous rsync script hasnt already finished when the new one was starting (by checking a lock file). This was strange as the time duration is 12hours. 12 hours werent enough to perform a ~200M data transfer over a 100Mb/s network port. That was really strange.

This is the second time in less than a year that this server is making problems. A couple months ago I had to remove a faulty disk from the software raid setup and check the system again. My notes on the matter, can be found here:

https://balaskas.gr/blog/2016/10/17/linux-raid-mdadm-md0/

Identify the problem

So let us start to identify the problem. A slow rsync can mean a lot of things, especially over ssh. Replacing network cables, viewing dmesg messages, rebooting servers or even changing the filesystem werent changing any things for the better. Time to move on the disks.

Manage and Monitor software RAID devices

On this server, I use raid5 with four hard disks:

# mdadm --verbose --detail /dev/md0


/dev/md0:
        Version : 1.2
  Creation Time : Wed Feb 26 21:00:17 2014
     Raid Level : raid5
     Array Size : 2929893888 (2794.16 GiB 3000.21 GB)
  Used Dev Size : 976631296 (931.39 GiB 1000.07 GB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

    Update Time : Sun May  7 11:00:32 2017
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : ServerTwo:0  (local to host ServerTwo)
           UUID : ef5da4df:3e53572e:c3fe1191:925b24cf
         Events : 10496

    Number   Major   Minor   RaidDevice State
       4       8       16        0      active sync   /dev/sdb
       1       8       32        1      active sync   /dev/sdc
       6       8       48        2      active sync   /dev/sdd
       5       8        0        3      active sync   /dev/sda

View hardware parameters of hard disk drive

aka test the hard disks:

# hdparm -Tt /dev/sda


/dev/sda:
 Timing cached reads:   2490 MB in  2.00 seconds = 1245.06 MB/sec
 Timing buffered disk reads: 580 MB in  3.01 seconds = 192.93 MB/sec

# hdparm -Tt /dev/sdb


/dev/sdb:
 Timing cached reads:   2520 MB in  2.00 seconds = 1259.76 MB/sec
 Timing buffered disk reads: 610 MB in  3.00 seconds = 203.07 MB/sec

# hdparm -Tt /dev/sdc

/dev/sdc:
 Timing cached reads:   2512 MB in  2.00 seconds = 1255.43 MB/sec
 Timing buffered disk reads: 570 MB in  3.01 seconds = 189.60 MB/sec

# hdparm -Tt /dev/sdd

/dev/sdd:
 Timing cached reads:     2 MB in  7.19 seconds = 285.00 kB/sec
 Timing buffered disk reads:   2 MB in  5.73 seconds = 357.18 kB/sec

Root Cause

Seems that one of the disks (/dev/sdd) in raid5 setup, is not performing as well as the others. The same hard disk had a problem a few months ago.

What I did the previous time, was to remove the disk, reformatting it in Low Level Format and add it again in the same setup. The system rebuild the raid5 and after 24hours everything was performing fine.

However the same hard disk seems that still has some issues . Now it is time for me to remove it and find a replacement disk.

Remove Faulty disk

I need to manually fail and then remove the faulty disk from the raid setup.

Failing the disk

Failing the disk manually, means that mdadm is not recognizing the disk as failed (as it did previously). I need to tell mdadm that this specific disk is a faulty one:

# mdadm --manage /dev/md0 --fail /dev/sdd
mdadm: set /dev/sdd faulty in /dev/md0

Removing the disk

now it is time to remove the faulty disk from our raid setup:

# mdadm --manage /dev/md0 --remove  /dev/sdd
mdadm: hot removed /dev/sdd from /dev/md0

Show details

# mdadm --verbose --detail /dev/md0

/dev/md0:
        Version : 1.2
  Creation Time : Wed Feb 26 21:00:17 2014
     Raid Level : raid5
     Array Size : 2929893888 (2794.16 GiB 3000.21 GB)
  Used Dev Size : 976631296 (931.39 GiB 1000.07 GB)
   Raid Devices : 4
  Total Devices : 3
    Persistence : Superblock is persistent

    Update Time : Sun May  7 11:08:44 2017
          State : clean, degraded
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : ServerTwo:0  (local to host ServerTwo)
           UUID : ef5da4df:3e53572e:c3fe1191:925b24cf
         Events : 10499

    Number   Major   Minor   RaidDevice State
       4       8       16        0      active sync   /dev/sdb
       1       8       32        1      active sync   /dev/sdc
       4       0        0        4      removed
       5       8        0        3      active sync   /dev/sda

Mounting the Backup

Now it’s time to re-mount the backup directory and re-run the rsync script

mount /backup/

and run the rsync with verbose and progress parameters to review the status of syncing

/usr/bin/rsync -zravxP --safe-links --delete-before --partial --protect-args -e ssh 192.168.2.1:/backup/ /backup/

Everything seems ok.

A replacement order has already been placed.

Rsync times manage to hit ~ 10.27MB/s again!

rsync time for a daily (12h) diff is now again in normal rates:

real    15m18.112s
user    0m34.414s
sys     0m36.850s
Tag(s): linux, mdadm, md0