rss.png profile for ebal on Stack Exchange, a network of free, community-driven Q&A sites
Jun
13
2017
Failures will occur, even with ansible and version control systems!

Failures

Every SysAdmin, DevOp, SRE, Computer Engineer or even Developer knows that failures WILL occur. So you need to plan with that constant in mind. Failure causes can be present in hardware, power, operating system, networking, memory or even bugs in software. We often call them system failures but it is possible that a Human can be also the cause of such failure!

Listening to the stories on the latest episode of stack overflow podcast felt compelled to share my own oh-shit moment in recent history.

I am writing this article so others can learn from this story, as I did in the process.

Rolling upgrades

I am a really big fun of rolling upgrades.

I am used to work with server farms. In a nutshell that means a lot of servers connected to their own switch behind routers/load balancers. This architecture gives me a great opportunity when it is time to perform operations, like scheduling service updates in working hours.

eg. Update software version 1.2.3 to 1.2.4 on serverfarm001

The procedure is really easy:

  • From the load balancers, stop any new incoming traffic to one of the servers.
  • Monitor all processes on the server and wait for them to terminate.
  • When all connections hit zero, stop the service you want to upgrade.
  • Perform the service update
  • Testing
  • Monitor logs and possible alarms
  • Be ready to rollback if necessary
  • Send some low traffic and try to test service with internal users
  • When everything is OK, tell the load balancers to send more traffic
  • Wait, monitor everything, test, be sure
  • Revert changes on the load balancers so that the specific server can take equal traffic/connection as the others.

serverfarm.jpg

This procedure is well established in such environments, and gives us the benefit of working with the whole team in working hours without the need of scheduling a maintenance window in the middle of the night, when low customer traffic is reaching us. During the day, if something is not going as planned, we can also reach other departments and work with them, figuring out what is happening.

Configuration Management

We are using ansible as the main configuration management tool. Every file, playbook, role, task of ansible is under a version control system, so that we can review changes before applying them to production. Viewing diffs from a modern web tool can be a lifesaver in these days.

Virtualization

We also use docker images or virtual images as development machines, so that we can perform any new configuration, update/upgrade on those machines and test it there.

Ansible Inventory

To perform service updates with ansible on servers, we are using the ansible inventory to host some metadata (aka variables) for every machine in a serverfarm. Let me give you an example:

[serverfarm001]
server01 version=1.2.3
server02 version=1.2.3
server03 version=1.2.3
server04 version=1.2.4

And performing the update action via ansible limits

eg.

~> ansible-playbook serverfarm001.yml -t update -C -D -l server04

Rollback

When something is not going as planned, we revert the changes on ansible (version control) and re-push the previous changes on a system. Remember the system is not getting any traffic from the front-end routers.

The Update

I was ready to do the update. Nagios was opened, logs were tailed -f

and then:

~> ansible-playbook serverfarm001.yml -t update

The Mistake

I run the ansible-playbook without limiting the server I wanted to run the update !!!

So all new changes passed through all servers, at once!

On top of that, new configuration broke running software with previous version. When the restart notify of service occurred every server simple stopped!!!

Funny thing, the updated machine server04 worked perfectly, but no traffic was reaching through the load balancers to this server.

Activate Rollback

It was time to run the rollback procedure.

Reverting changes from version control is easy. Took me like a moment or something.
Running again:

~> ansible-playbook serverfarm001.yml

and …

Waiting for Nagios

In 2,5 minutes I had fixed the error and I was waiting for nagios to be green again.

Then … Nothing! Red alerts everywhere!

Oh-Shit Moment

It was time for me to inform everyone what I have done.
Explaining to my colleagues and manager the mistake and trying to figuring out what went wrong with the rollback procedure.

Collaboration

On this crucial moment everything else worked like clockwise.

My colleagues took every action to:

  • informing helpdesk
  • looking for errors
  • tailing logs
  • monitor graphs
  • viewing nagios
  • talking to other people
  • do the leg-work in general

and leaving me in piece with calm to figure out what went wrong.

I felt so proud to be part of the team at that specific moment.

If any of you reading this article: Truly thank all guys and gals .

Work-Around

I bypass ansible and copied the correct configuration to all servers via ssh.
My colleagues were telling me the good news and I was going through one by one of ~xx servers.
In 20minutes everything was back in normal.
And finally nagios was green again.

Blameless Post-Mortem

It was time for post-mortem and of course drafting the company’s incident report.

We already knew what happened and how, but nevertheless we need to write everything down and try to keep a good timeline of all steps.
This is not only for reporting but also for us. We need to figure out what happened exactly, do we need more monitoring tools?
Can we place any failsafes in our procedures? Also why the rollback procedure didnt work.

Fixing Rollback

I am writing this paragraph first, but to be honest with you, it took me some time getting to the bottom of this!

Rollback procedure actually is working as planned. I did a mistake with the version control system.

What we have done is to wrap ansible under another script so that we can select the version control revision number at runtime.
This is actually pretty neat, cause it gives us the ability to run ansible with previous versions of our configuration, without reverting in master branch.

The ansible wrapper asks for revision and by default we run it with [tip].

So the correct way to do rollbacks is:

eg.

~> ansible-playbook serverfarm001.yml -rev 238

At the time of problem, I didnt do that. I thought it was better to revert all changes and re-run ansible.
But ansible was running into default mode with tip revision !!

Although I manage pretty well on panic mode, that day my brain was frozen!

Re-Design Ansible

I wrap my head around and tried to find a better solution on performing service updates. I needed to change something that can run without the need of limit in ansible.

The answer has obvious in less than five minutes later:

files/serverfarm001/1.2.3
files/serverfarm001/1.2.4

I need to keep a separated configuration folder and run my ansible playbooks with variable instead of absolute paths.

eg.


- copy: src=files/serverfarm001/{{version}} dest=/etc/service/configuration

That will suffice next time (and actually did!). When the service upgrade is finished, We can simple remove the previous configuration folder without changing anything else in ansible.

Ansible Groups

Another (more simplistic) approach is to create a new group in ansible inventory.
Like you do with your staging Vs production environment.

eg.

[serverfarm001]
server01 version=1.2.3
server02 version=1.2.3
server03 version=1.2.3

[serverfarm001_new]
server04 version=1.2.4

and create a new yml file

---
- hosts: serverfarm001_new

run the ansible-playbook against the new serverfarm001_new group .

Validation

A lot of services nowadays have syntax check commands for their configuration.

You can use this validation process in ansible!

here is an example from ansible docs:

# Update sshd configuration safely, avoid locking yourself out
- template:
    src: etc/ssh/sshd_config.j2
    dest: /etc/ssh/sshd_config
    owner: root
    group: root
    mode: '0600'
    validate: /usr/sbin/sshd -t -f %s
    backup: yes

or you can use registers like this:

  - name: Check named
    shell: /usr/sbin/named-checkconf -t /var/named/chroot
    register: named_checkconf
    changed_when: "named_checkconf.rc == 0"
    notify: anycast rndc reconfig

Conclusion

Everyone makes mistakes. I know, I have some oh-shit moments in my career for sure. Try to learn from these failures and educate others. Keep notes and write everything down in a wiki or in whatever documentation tool you are using internally. Always keep your calm. Do not hide any errors from your team or your manager. Be the first person that informs everyone. If the working environment doesnt make you feel safe, making mistakes, perhaps you should think changing scenery. You will make a mistake, failures will occur. It is a well known fact and you have to be ready when the time is up. Do a blameless post-mortem. The only way a team can be better is via responsibility, not blame. You need to perform disaster-recovery scenarios from time to time and test your backup. And always -ALWAYS- use a proper configuration management tool for all changes on your infrastructure.

post scriptum

After writing this draft, I had a talk with some friends regarding the cloud industry and how this experience can be applied into such environment. The quick answer is you SHOULD NOT.

Working with cloud, means you are mostly using virtualization. Docker images or even Virtual Machines should be ephemeral. When it’s time to perform upgrades (system patching or software upgrades) you should be creating new virtual machines that will replace the old ones. There is no need to do it in any other way. You can rolling replacing the virtual machines (or docker images) without the need of stopping the service in a machine, do the upgrade, testing, put it back. Those ephemeral machines should not have any data or logs in the first place. Cloud means that you can (auto) scale as needed it without thinking where the data are.

thanks for reading.

Tag(s): failures, ansible
Jun
13
2017
Failures will occur, even with ansible and version control systems!

Failures

Every SysAdmin, DevOp, SRE, Computer Engineer or even Developer knows that failures WILL occur. So you need to plan with that constant in mind. Failure causes can be present in hardware, power, operating system, networking, memory or even bugs in software. We often call them system failures but it is possible that a Human can be also the cause of such failure!

Listening to the stories on the latest episode of stack overflow podcast felt compelled to share my own oh-shit moment in recent history.

I am writing this article so others can learn from this story, as I did in the process.

Rolling upgrades

I am a really big fun of rolling upgrades.

I am used to work with server farms. In a nutshell that means a lot of servers connected to their own switch behind routers/load balancers. This architecture gives me a great opportunity when it is time to perform operations, like scheduling service updates in working hours.

eg. Update software version 1.2.3 to 1.2.4 on serverfarm001

The procedure is really easy:

  • From the load balancers, stop any new incoming traffic to one of the servers.
  • Monitor all processes on the server and wait for them to terminate.
  • When all connections hit zero, stop the service you want to upgrade.
  • Perform the service update
  • Testing
  • Monitor logs and possible alarms
  • Be ready to rollback if necessary
  • Send some low traffic and try to test service with internal users
  • When everything is OK, tell the load balancers to send more traffic
  • Wait, monitor everything, test, be sure
  • Revert changes on the load balancers so that the specific server can take equal traffic/connection as the others.

serverfarm.jpg

This procedure is well established in such environments, and gives us the benefit of working with the whole team in working hours without the need of scheduling a maintenance window in the middle of the night, when low customer traffic is reaching us. During the day, if something is not going as planned, we can also reach other departments and work with them, figuring out what is happening.

Configuration Management

We are using ansible as the main configuration management tool. Every file, playbook, role, task of ansible is under a version control system, so that we can review changes before applying them to production. Viewing diffs from a modern web tool can be a lifesaver in these days.

Virtualization

We also use docker images or virtual images as development machines, so that we can perform any new configuration, update/upgrade on those machines and test it there.

Ansible Inventory

To perform service updates with ansible on servers, we are using the ansible inventory to host some metadata (aka variables) for every machine in a serverfarm. Let me give you an example:

[serverfarm001]
server01 version=1.2.3
server02 version=1.2.3
server03 version=1.2.3
server04 version=1.2.4

And performing the update action via ansible limits

eg.

~> ansible-playbook serverfarm001.yml -t update -C -D -l server04

Rollback

When something is not going as planned, we revert the changes on ansible (version control) and re-push the previous changes on a system. Remember the system is not getting any traffic from the front-end routers.

The Update

I was ready to do the update. Nagios was opened, logs were tailed -f

and then:

~> ansible-playbook serverfarm001.yml -t update

The Mistake

I run the ansible-playbook without limiting the server I wanted to run the update !!!

So all new changes passed through all servers, at once!

On top of that, new configuration broke running software with previous version. When the restart notify of service occurred every server simple stopped!!!

Funny thing, the updated machine server04 worked perfectly, but no traffic was reaching through the load balancers to this server.

Activate Rollback

It was time to run the rollback procedure.

Reverting changes from version control is easy. Took me like a moment or something.
Running again:

~> ansible-playbook serverfarm001.yml

and …

Waiting for Nagios

In 2,5 minutes I had fixed the error and I was waiting for nagios to be green again.

Then … Nothing! Red alerts everywhere!

Oh-Shit Moment

It was time for me to inform everyone what I have done.
Explaining to my colleagues and manager the mistake and trying to figuring out what went wrong with the rollback procedure.

Collaboration

On this crucial moment everything else worked like clockwise.

My colleagues took every action to:

  • informing helpdesk
  • looking for errors
  • tailing logs
  • monitor graphs
  • viewing nagios
  • talking to other people
  • do the leg-work in general

and leaving me in piece with calm to figure out what went wrong.

I felt so proud to be part of the team at that specific moment.

If any of you reading this article: Truly thank all guys and gals .

Work-Around

I bypass ansible and copied the correct configuration to all servers via ssh.
My colleagues were telling me the good news and I was going through one by one of ~xx servers.
In 20minutes everything was back in normal.
And finally nagios was green again.

Blameless Post-Mortem

It was time for post-mortem and of course drafting the company’s incident report.

We already knew what happened and how, but nevertheless we need to write everything down and try to keep a good timeline of all steps.
This is not only for reporting but also for us. We need to figure out what happened exactly, do we need more monitoring tools?
Can we place any failsafes in our procedures? Also why the rollback procedure didnt work.

Fixing Rollback

I am writing this paragraph first, but to be honest with you, it took me some time getting to the bottom of this!

Rollback procedure actually is working as planned. I did a mistake with the version control system.

What we have done is to wrap ansible under another script so that we can select the version control revision number at runtime.
This is actually pretty neat, cause it gives us the ability to run ansible with previous versions of our configuration, without reverting in master branch.

The ansible wrapper asks for revision and by default we run it with [tip].

So the correct way to do rollbacks is:

eg.

~> ansible-playbook serverfarm001.yml -rev 238

At the time of problem, I didnt do that. I thought it was better to revert all changes and re-run ansible.
But ansible was running into default mode with tip revision !!

Although I manage pretty well on panic mode, that day my brain was frozen!

Re-Design Ansible

I wrap my head around and tried to find a better solution on performing service updates. I needed to change something that can run without the need of limit in ansible.

The answer has obvious in less than five minutes later:

files/serverfarm001/1.2.3
files/serverfarm001/1.2.4

I need to keep a separated configuration folder and run my ansible playbooks with variable instead of absolute paths.

eg.


- copy: src=files/serverfarm001/{{version}} dest=/etc/service/configuration

That will suffice next time (and actually did!). When the service upgrade is finished, We can simple remove the previous configuration folder without changing anything else in ansible.

Ansible Groups

Another (more simplistic) approach is to create a new group in ansible inventory.
Like you do with your staging Vs production environment.

eg.

[serverfarm001]
server01 version=1.2.3
server02 version=1.2.3
server03 version=1.2.3

[serverfarm001_new]
server04 version=1.2.4

and create a new yml file

---
- hosts: serverfarm001_new

run the ansible-playbook against the new serverfarm001_new group .

Validation

A lot of services nowadays have syntax check commands for their configuration.

You can use this validation process in ansible!

here is an example from ansible docs:

# Update sshd configuration safely, avoid locking yourself out
- template:
    src: etc/ssh/sshd_config.j2
    dest: /etc/ssh/sshd_config
    owner: root
    group: root
    mode: '0600'
    validate: /usr/sbin/sshd -t -f %s
    backup: yes

or you can use registers like this:

  - name: Check named
    shell: /usr/sbin/named-checkconf -t /var/named/chroot
    register: named_checkconf
    changed_when: "named_checkconf.rc == 0"
    notify: anycast rndc reconfig

Conclusion

Everyone makes mistakes. I know, I have some oh-shit moments in my career for sure. Try to learn from these failures and educate others. Keep notes and write everything down in a wiki or in whatever documentation tool you are using internally. Always keep your calm. Do not hide any errors from your team or your manager. Be the first person that informs everyone. If the working environment doesnt make you feel safe, making mistakes, perhaps you should think changing scenery. You will make a mistake, failures will occur. It is a well known fact and you have to be ready when the time is up. Do a blameless post-mortem. The only way a team can be better is via responsibility, not blame. You need to perform disaster-recovery scenarios from time to time and test your backup. And always -ALWAYS- use a proper configuration management tool for all changes on your infrastructure.

post scriptum

After writing this draft, I had a talk with some friends regarding the cloud industry and how this experience can be applied into such environment. The quick answer is you SHOULD NOT.

Working with cloud, means you are mostly using virtualization. Docker images or even Virtual Machines should be ephemeral. When it’s time to perform upgrades (system patching or software upgrades) you should be creating new virtual machines that will replace the old ones. There is no need to do it in any other way. You can rolling replacing the virtual machines (or docker images) without the need of stopping the service in a machine, do the upgrade, testing, put it back. Those ephemeral machines should not have any data or logs in the first place. Cloud means that you can (auto) scale as needed it without thinking where the data are.

thanks for reading.

Tag(s): failures, ansible
Jun
19
2016
vagrant docker ansible

Recently, I had the opportunity to see a presentation on the subject by Alexandros Kosiaris.

I was never fan of vagrant (or even virtualbox) but I gave it a try and below are my personal notes on the matter.
All my notes are based on Archlinux as it is my primary distribution but I think you can try them with every Gnu Linux OS.

Vagrant

So what is Vagrant ?

Vagrant is a wrapper, an abstraction layer to deal with some virtual solutions, like virtualbox, Vmware, hyper-v, docker, aws etc etc etc
With a few lines you can describe what you want to do and then use vagrant to create your enviroment of virtual boxes to work with.

Just for the fun of it, I used docker

Docker

We first need to create and build a proper Docker Image!

The Dockerfile below, is suggesting that we already have an archlinux:latest docker image.
You can use your own dockerfile or docker image.

You need to have an ssh connection to this docker image and you will need -of course- to have a ssh password or a ssh authorized key built in this image for root. If you are using sudo (then even better) dont forget to add the user to sudoers!



# vim Dockerfile 

# sshd on archlinux
#
# VERSION               0.0.2

FROM     archlinux:latest
MAINTAINER  Evaggelos Balaskas < evaggelos _AT_ balaskas _DOT_ gr >

# Update the repositories
RUN  pacman -Syy && pacman -S --noconfirm openssh python2

# Generate host keys
RUN  /usr/bin/ssh-keygen -A

# Add password to root user
RUN  echo 'root:roottoor' | chpasswd

# Fix sshd
RUN  sed -i -e 's/^UsePAM yes/UsePAM no/g' /etc/ssh/sshd_config && echo 'PermitRootLogin yes' >> /etc/ssh/sshd_config

# Expose tcp port
EXPOSE   22

# Run openssh daemon
CMD  ["/usr/sbin/sshd", "-D"]

Again, you dont need to follow this step by the book!
It is an example to understand that you need a proper docker image that you can ssh into it.

Build the docker image:



# docker build -t archlinux:sshd . 

On my PC:



# docker images 

REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
archlinux           sshd                1b074ffe98be        7 days ago          636.2 MB
archlinux           latest              c0c56d24b865        7 days ago          534 MB
archlinux           devel               e66b5b8af509        2 weeks ago         607 MB
centos6             powerdns            daf76074f848        3 months ago        893 MB
centos6             newdnsps            642462a8dfb4        3 months ago        546.6 MB
centos7             cloudstack          b5e696e65c50        6 months ago        1.463 GB
centos7             latest              d96affc2f996        6 months ago        500.2 MB
centos6             latest              4ba27f5a1189        6 months ago        489.8 MB

Environment

We can define docker as our default provider with:


# export VAGRANT_DEFAULT_PROVIDER=docker

It is not necessary to define the default provider, as you will see below,
but it is also a good idea - if your forget to declare your vagrant provider later

Before we start with vagrant, let us create a new folder:



# mkdir -pv vagrant
# cd vagrant 

Initialization

We are ready to initialized our enviroment for vagrant:


# vagrant init

A `Vagrantfile` has been placed in this directory. You are now
ready to `vagrant up` your first virtual environment! Please read
the comments in the Vagrantfile as well as documentation on
`vagrantup.com` for more information on using Vagrant.

Initial Vagrantfile

A typical vagrant configuration file looks something like this:



# cat Vagrantfile
 cat Vagrantfile
# -*- mode: ruby -*-
# vi: set ft=ruby :

# All Vagrant configuration is done below. The "2" in Vagrant.configure
# configures the configuration version (we support older styles for
# backwards compatibility). Please don't change it unless you know what
# you're doing.
Vagrant.configure("2") do |config|
  # The most common configuration options are documented and commented below.
  # For a complete reference, please see the online documentation at
  # https://docs.vagrantup.com.

  # Every Vagrant development environment requires a box. You can search for
  # boxes at https://atlas.hashicorp.com/search.
  config.vm.box = "base"

  # Disable automatic box update checking. If you disable this, then
  # boxes will only be checked for updates when the user runs
  # `vagrant box outdated`. This is not recommended.
  # config.vm.box_check_update = false

  # Create a forwarded port mapping which allows access to a specific port
  # within the machine from a port on the host machine. In the example below,
  # accessing "localhost:8080" will access port 80 on the guest machine.
  # config.vm.network "forwarded_port", guest: 80, host: 8080

  # Create a private network, which allows host-only access to the machine
  # using a specific IP.
  # config.vm.network "private_network", ip: "192.168.33.10"

  # Create a public network, which generally matched to bridged network.
  # Bridged networks make the machine appear as another physical device on
  # your network.
  # config.vm.network "public_network"

  # Share an additional folder to the guest VM. The first argument is
  # the path on the host to the actual folder. The second argument is
  # the path on the guest to mount the folder. And the optional third
  # argument is a set of non-required options.
  # config.vm.synced_folder "../data", "/vagrant_data"

  # Provider-specific configuration so you can fine-tune various
  # backing providers for Vagrant. These expose provider-specific options.
  # Example for VirtualBox:
  #
  # config.vm.provider "virtualbox" do |vb|
  #   # Display the VirtualBox GUI when booting the machine
  #   vb.gui = true
  #
  #   # Customize the amount of memory on the VM:
  #   vb.memory = "1024"
  # end
  #
  # View the documentation for the provider you are using for more
  # information on available options.

  # Define a Vagrant Push strategy for pushing to Atlas. Other push strategies
  # such as FTP and Heroku are also available. See the documentation at
  # https://docs.vagrantup.com/v2/push/atlas.html for more information.
  # config.push.define "atlas" do |push|
  #   push.app = "YOUR_ATLAS_USERNAME/YOUR_APPLICATION_NAME"
  # end

  # Enable provisioning with a shell script. Additional provisioners such as
  # Puppet, Chef, Ansible, Salt, and Docker are also available. Please see the
  # documentation for more information about their specific syntax and use.
  # config.vm.provision "shell", inline: <<-SHELL
  #   apt-get update
  #   apt-get install -y apache2
  # SHELL
end

If you try to run this Vagrant configuration file with docker provider,
it will try to boot up base image (Vagrant Default box):



# vagrant up --provider=docker

Bringing machine 'default' up with 'docker' provider...
==> default: Box 'base' could not be found. Attempting to find and install...
    default: Box Provider: docker
    default: Box Version: >= 0
==> default: Box file was not detected as metadata. Adding it directly...
==> default: Adding box 'base' (v0) for provider: docker
    default: Downloading: base
An error occurred while downloading the remote file. The error
message, if any, is reproduced below. Please fix this error and try
again.

Couldn't open file /ebal/Desktop/vagrant/base

Vagrantfile

Put the initial vagrantfile aside and create the below Vagrant configuration file:


Vagrant.configure("2") do |config|
  config.vm.provider "docker" do |d|
    d.image = "archlinux:sshd"
  end
end

That translate to :

Vagrant Provider: docker
Docker Image: archlinux:sshd

Basic commands

Run vagrant to create our virtual box:


#  vagrant up

Bringing machine 'default' up with 'docker' provider...
==> default: Creating the container...
    default:   Name: vagrant_default_1466368592
    default:  Image: archlinux:sshd
    default: Volume: /home/ebal/Desktop/vagrant:/vagrant
    default:
    default: Container created: 4cf4649b47615469
==> default: Starting container...
==> default: Provisioners will not be run since container doesn't support SSH.

ok, we havent yet configured vagrant to use ssh

but we have a running docker instance:



# vagrant status

Current machine states:

default                   running (docker)

The container is created and running. You can stop it using
`vagrant halt`, see logs with `vagrant docker-logs`, and
kill/destroy it with `vagrant destroy`.

that we can verify with docker ps:


#  docker ps -a

CONTAINER ID        IMAGE               COMMAND               CREATED              STATUS              PORTS               NAMES
4cf4649b4761        archlinux:sshd      "/usr/sbin/sshd -D"   About a minute ago   Up About a minute   22/tcp              vagrant_default_1466368592

Destroy

We need to destroy this instance:



#  vagrant destroy

    default: Are you sure you want to destroy the 'default' VM? [y/N] y
==> default: Stopping container...
==> default: Deleting the container...

Vagrant ssh

We need to edit Vagrantfile to add ssh support to our docker :



# vim Vagrantfile

Vagrant.configure("2") do |config|

    config.vm.provider "docker" do |d|
        d.image = "archlinux:sshd"
        d.has_ssh = true
    end

end

and re-up our vagrant box:


#  vagrant up

Bringing machine 'default' up with 'docker' provider...
==> default: Creating the container...
    default:   Name: vagrant_default_1466368917
    default:  Image: archlinux:sshd
    default: Volume: /home/ebal/Desktop/vagrant:/vagrant
    default:   Port: 127.0.0.1:2222:22
    default:
    default: Container created: b4fce563a9f9042c
==> default: Starting container...
==> default: Waiting for machine to boot. This may take a few minutes...
    default: SSH address: 172.17.0.2:22
    default: SSH username: vagrant
    default: SSH auth method: private key
    default: Warning: Authentication failure. Retrying...
    default: Warning: Authentication failure. Retrying...

Vagrant will try to connect to our docker instance with the user: vagrant and a key.
But our docker image only have a root user and a root password !!


# vagrant status

Current machine states:

default                   running (docker)

The container is created and running. You can stop it using
`vagrant halt`, see logs with `vagrant docker-logs`, and
kill/destroy it with `vagrant destroy`.

#  vagrant destroy

    default: Are you sure you want to destroy the 'default' VM? [y/N] y
==> default: Stopping container...
==> default: Deleting the container...

Vagrant ssh - the Correct way !

We need to edit the Vagrantfile, properly:



# vim Vagrantfile

Vagrant.configure("2") do |config|

    config.ssh.username = 'root'
    config.ssh.password = 'roottoor'

    config.vm.provider "docker" do |d|
        d.image = "archlinux:sshd"
        d.has_ssh = true
    end

end


# vagrant up

Bringing machine 'default' up with 'docker' provider...
==> default: Creating the container...
    default:   Name: vagrant_default_1466369126
    default:  Image: archlinux:sshd
    default: Volume: /home/ebal/Desktop/vagrant:/vagrant
    default:   Port: 127.0.0.1:2222:22
    default:
    default: Container created: 7fef0efc8905bb3a
==> default: Starting container...
==> default: Waiting for machine to boot. This may take a few minutes...
    default: SSH address: 172.17.0.2:22
    default: SSH username: root
    default: SSH auth method: password
    default: Warning: Connection refused. Retrying...
    default:
    default: Inserting generated public key within guest...
    default: Removing insecure key from the guest if it's present...
    default: Key inserted! Disconnecting and reconnecting using new SSH key...
==> default: Machine booted and ready!

# vagrant status

Current machine states:

default                   running (docker)

The container is created and running. You can stop it using
`vagrant halt`, see logs with `vagrant docker-logs`, and
kill/destroy it with `vagrant destroy`.

# vagrant ssh-config

Host default
  HostName 172.17.0.2
  User root
  Port 22
  UserKnownHostsFile /dev/null
  StrictHostKeyChecking no
  PasswordAuthentication no
  IdentityFile /tmp/vagrant/.vagrant/machines/default/docker/private_key
  IdentitiesOnly yes
  LogLevel FATAL

# vagrant ssh

[root@7fef0efc8905 ~]# uptime
 20:45:48 up 11:33,  0 users,  load average: 0.53, 0.42, 0.28
[root@7fef0efc8905 ~]#
[root@7fef0efc8905 ~]#
[root@7fef0efc8905 ~]#
[root@7fef0efc8905 ~]# exit
logout
Connection to 172.17.0.2 closed.

Ansible

It is time to add ansible to the mix!

Ansible Playbook

We need to create a basic ansible playbook:



# cat playbook.yml 

---
- hosts: all

  vars:
      ansible_python_interpreter: "/usr/bin/env python2"

  gather_facts: no

  tasks:

    # Install package vim
    - pacman: name=vim state=present

The above playbook, is going to install vim, via pacman (archlinux PACkage MANager)!
Archlinux comes by default with python3 and with ansible_python_interpreter you are declaring to use python2!

Vagrantfile with Ansible



# cat Vagrantfile

Vagrant.configure("2") do |config|

    config.ssh.username = 'root'
    config.ssh.password = 'roottoor'

    config.vm.provider "docker" do |d|
        d.image = "archlinux:sshd"
        d.has_ssh = true
    end

    config.vm.provision "ansible" do |ansible|
        ansible.verbose = "v"
        ansible.playbook = "playbook.yml"
    end

end

Vagrant Docker Ansible



# vagrant up 

Bringing machine 'default' up with 'docker' provider...
==> default: Creating the container...
    default:   Name: vagrant_default_1466370194
    default:  Image: archlinux:sshd
    default: Volume: /home/ebal/Desktop/vagrant:/vagrant
    default:   Port: 127.0.0.1:2222:22
    default:
    default: Container created: 8909eee7007b8d4f
==> default: Starting container...
==> default: Waiting for machine to boot. This may take a few minutes...
    default: SSH address: 172.17.0.2:22
    default: SSH username: root
    default: SSH auth method: password
    default: Warning: Connection refused. Retrying...
    default:
    default: Inserting generated public key within guest...
    default: Removing insecure key from the guest if it's present...
    default: Key inserted! Disconnecting and reconnecting using new SSH key...
==> default: Machine booted and ready!

==> default: Running provisioner: ansible...
    default: Running ansible-playbook...
PYTHONUNBUFFERED=1 ANSIBLE_FORCE_COLOR=true ANSIBLE_HOST_KEY_CHECKING=false ANSIBLE_SSH_ARGS='-o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ControlMaster=auto -o ControlPersist=60s' ansible-playbook --connection=ssh --timeout=30 --limit="default" --inventory-file=/mnt/VB0250EAVER/home/ebal/Desktop/vagrant/.vagrant/provisioners/ansible/inventory -v playbook.yml
Using /etc/ansible/ansible.cfg as config file

PLAY [all] *********************************************************************

TASK [pacman] ******************************************************************
changed: [default] => {"changed": true, "msg": "installed 1 package(s). "}

PLAY RECAP *********************************************************************
default                    : ok=1    changed=1    unreachable=0    failed=0   


# vagrant status

Current machine states:

default                   running (docker)

The container is created and running. You can stop it using
`vagrant halt`, see logs with `vagrant docker-logs`, and
kill/destroy it with `vagrant destroy`.



#  vagrant ssh 

[root@8909eee7007b ~]# vim --version
VIM - Vi IMproved 7.4 (2013 Aug 10, compiled Jun  9 2016 09:35:16)
Included patches: 1-1910
Compiled by Arch Linux

Vagrant Provisioning

The ansible-step is called: provisioning as you may already noticed.

If you make a few changes on this playbook, just type:


#  vagrant provision

and it will re-run the ansible part on this vagrant box !

Apr
08
2015
ansible Jinja2 template example with for loop

Disclaimer: This blog post has one purpose only: be a proof of concept - not the “perfect” ansible playbook.

When managing a server farm, you will -soon enough- start using Jinja templates. Cause -let’s face it- static files are very easy to copy through servers but with templates, you are making magic!

This ansible example will create a bind-format slave zones configuration file.

You need to have in mind, the typical structure of that configuration file:



zone "balaskas.gr" {
    type slave;
    file "sec/balaskas.gr";
    masters {
        158.255.214.14;
    };
};

Let’s start with the actual data. I like to keep my configuration separated from my playbooks. With this approach is easy to re-use your variables in other playbooks.

So my variable list is looking like this:

zones.yml


---
zones:
  - { zone: 'balaskas.gr', master: '158.255.214.14', extras: '' }
  - { zone: 'example.com', master: '1.2.3.4', extras: '' }

My slavezone yml ansible playbook is very similar to this:

SecondaryDNS.yml


SecondaryDNS.yml

---

- hosts: myslavens
  gather_facts: no
  user: root

  vars_files:
    - [ "files/SecondaryDNS/zones.yml" ]

  tasks:
  - name: Create named.slavezone
    template:
      src="files/SecondaryDNS/slavezones.j2"
      dest="/etc/named.slavezones"
      owner=named
      group=named
      mode=0440

...

(This is not the entire playbook, but I am guessing you get the point)

To recap, we want to create a [new (if not exist)] file, with a very specific output for every line in our configuration.
So here is my Jinja2 template file:

slavezones.j2



{% for item in zones %}
zone "{{item.zone}}" { type slave; file "sec/{{item.zone}}"; masters { {{item.master}}; }; {{item.extra}} };
{% endfor %}

This template will loop for every line (item) of our zones.yml and create the desirable output.

And that’s how you can create ansible magic !

Tag(s): ansible
Apr
07
2015
ansible register

So here is a nice ansible trick to trigger a notify if only the exit status of a command is zero (without any errors)



  - name: Check named
    shell: /sbin/named-checkconf
    register: named_checkconf
    changed_when: "named_checkconf.rc == 0"
    notify: rndc reconfig

the named_checkconf contains the below values:


{
"changed": true,
"cmd": ["/sbin/rndc", "reconfig"],
"delta": "0:00:02.438532",
"end": "2015-04-07 15:02:21.349859",
"item": "",
"rc": 0,
"start": "2015-04-07 15:02:18.911327",
"stderr": "",
"stdout": ""
}
Tag(s): ansible, bind