this post is a copy of an old (Published on March 27, 2017) LinkedIn article, that you can also find it here
The Cloud Illustration - Some rights reserved - flickr 2013
In IT operations we are dealing with failures on a daily bases. Having the IT motto: “All systems will fail” (nowadays a fact) in mind, that’s not always a major issue for an operation team, especially when working with high available services.
Leaving a server down or in a problematic state is not an option even on the most high operational performance infrastructure. You have to fix the problem and give the server back to rotation/production. After all, it was there in the first place, for a reason!
Within the devops methodology, your basic scenario is to remove the faulty server and replace it with a new one. That’s said … in real life, that is not always the case. Most of the times, you have to review the failure and identify the reason behind it. You need to create new or review old procedures and follow them to recover your service into stability. The feedback of this failure is one of the most important things in your value stream.
In our case, the faulty server is up and running and one of it’s services is not working properly. This is what we call a partial error. There was no need to remove the entire server from production, we just needed to disable the specific broken service.
Talking with the vendor and review the incident with our colleagues, we concluded that we needed to reinitialize the service using data from our recent backup. Vendor’s suggestion, upon best practices, was to fully stop all the services and remove the server from production till the restoration ends.
That means a maintenance window (MW) needed to be scheduled on non-working hours with available engineers to work on the case. In order to perform the approved procedure on the MW, the engineers should have enough experience on all steps and a roughly estimation of restoration time.
Performing tasks like that, engineers should understand the entire restoration procedure and work through any possible errors. They should know exactly what to do and how to respond. What to check, monitor and validate in the end. They need to make a bulletproof plan and document every step in the way. After all, our devops team must provide us with an full Incident Report. Also, it is always a good thing to do a dry-run/practice-run and work all the possible exceptions before-hand.
It was time, for our devops team to add virtualization into the mix.
So we thought to give virtualization a try and started with docker containers. We already know how to clone a live running linux server into a docker image, without downtime. Image is a little big, almost ~100G of system data on it. Large size is not a problem, but for a docker image is a little too big.
Next step is to import the latest export of our backup data and start the restoration procedure. After a while, it’s was obvious that the docker image wasn’t performing very well. A couple hours later the running docker image failed and gave us an exit 1 error.
Even on failure, this first effort on a non production virtual environment, gave our engineers the opportunity to review the restoration procedure. The team was sure enough that could identify and even verify the failure on the “real” server. Indications that there was a system corruption on some the server’s database files.
With strong belief we were on the right path, the team tried a second iteration on attacking the problem using virtualization. Followed a P2V procedure (physical to virtual) a couple hours later an identical virtual machine of the “real” server was produced. Recent data were imported and a re-initialization procedure performed once again, on this virtual machine.
Another couple hours passed and the restoration procedure was finished. Our devops team did all of validations and checks, everything seemed to be perfectly ok! The virtual machine passed all of our acceptance tests.
Working the problem and with our previous suspicion of partial system corruption, we ‘ve noticed something very interesting. The system corruption was actually only to a few datafiles, almost 15G of data! That gave us an idea. We have already done the entire restoration procedure on a virtual machine, why not just sync those fixed datafiles to the production server?
And that we did! A few moments later, the entire server was “almost” in full production mode. No errors what so ever. Only one task left, to sync the data to our latest production data. With every other service running perfectly ok, we’ve decided to do the sync in real-time. Half hour later and synchronization was completed without any further error.
Just to be sure, we redo every check and validation we thought we could do. Worked through logs, review our monitoring, and in the end enable the “faulty-now-fixed” service on the production server.
In the end, everything played out just fine. Our devops team gained a lot of knowledge (feedback) and there was no need of scheduling any MW in the middle of the night. We didnt even need to take out the entire server from rotation which gave us a great advantage on our Work In Process fixing the problem on a virtual machine.