The National Broadband Network (NBN Co) is an Australian government owned enterprise formed in 2009 to build a broadband network which will provide a fiber-optic connection to 93% of Australian homes and businesses, with fixed wireless and satellite services to the remainder. After its creation, the NBN quickly needed to establish a public website in order to disseminate information, and to create a set of business services to allow partners to begin the process of requesting access to the new network. The internal development teams used a combination of bespoke application development, service development, and configuration of commercial off-the-shelf software to provide these services.
Because the NBN was in such a rapid startup mode, the infrastructure necessary to support the development of these websites and services was being created as the development proceeded. In development and test environments, projects could request new nodes and then manage them as they saw fit. In production-like environments, there were a limited number of available nodes that needed to be shared between multiple projects. However, each project focused only on their own infrastructure requirements and issues started to appear. For example:
- Two teams sharing infrastructure would make incompatible changes and only one team’s application would work.
- Nodes were being set up by hand, and changes that were made in development environments to resolve defects were not being applied to later environments, leading to regressions.
- No one had a good idea of what was actually on each machine or who owned it, leading to anxiety about deployments, configuration changes and the ability to audit IT systems effectively.
- Because each node was being created by hand by different individuals, the loss of a node required a considerable amount of work to recover.
A group of developers got together to work on resolving these problems. They set out to accomplish a few goals:
- Stand up production and production-like test systems quickly from scratch.
- Capture all the configuration information in version control so that machines could be provisioned quickly without manual intervention.
- Have a single source of truth to determine what a node should look like so changes and their impacts could be analyzed prior to making them.
- Integrate infrastructure as early as possible. If four teams were going to share a node in production, make them share nodes in test environments as well.
- Make infrastructure changes using the same process as application changes—in particular, applying them in test environments so that their impact can be understood before they are applied to production.
The first implementation that was built started to address some of these goals, but when we came to use it we discovered several problems that needed to be fixed. Some of these issues are discussed in Figure 1, which presents this initial implementation, below.
Figure 1: Initial implementation of infrastructure configuration management system
The initial setup was as follows:
- Puppet was installed on all new servers being provisioned and setup to run in the out of the box configuration—it would poll for configuration changes at 30 minute intervals.
- Puppet usage across all development teams was inconsistent. Some teams had their server setup fully automated, some were partially automated, and some were hand-created masterpieces.
- The Puppetmaster daemons were set up manually.
- Each environment had its own repository, and migration of infrastructure code between environments was done manually.
This implementation lead to the following issues:
- The process to make infrastructure changes ended up being “Check in changes, log on to target server, manually trigger Puppet”
- Since changes were migrated manually between environments, not all changes would get migrated. This led to inconsistencies between environments and configuration drift.
- Environment-specific code bases lead to massive duplication in manifests and modules.
- Lack of automated processes for migrating code and inconsistent implementation of automation across all servers lead to frequent breakages and reduced trust in the infrastructure deployment.
Fixing the issues: Implementing a Deployment Pipeline for Infrastructure
As a result of these problems, a dedicated DevOps team was formed. The purpose of this team was to help development teams automate their server setup and configuration, and to create a better and more reliable process for migrating infrastructure code up to production environments.
By this stage, most teams had implemented a deployment pipeline to move their code from development to production, using a number of different technologies (Capistrano, Maven etc). The DevOps team decided to implement the same pattern for the infrastructure code, and chose to standardise on MCollective to roll out changes to nodes, Puppet for performing the migrations, and Go to manage the deployment pipeline. The aim was to be able to check a change into version control once, and then be able to promote it through test environments and finally into production at the click of a button.
The implementation of this pipeline consisted of the following steps as outlined in diagram 2 below:
Figure 2: Infrastructure deployment pipeline setup
- “Unit testing” of infrastructure code
- RSpec tests against custom Puppet/MCollective code.
- Local compilation of all node manifests using the Puppet API.
- Execution of a system wide Puppet ‘dry-run’ to catch errors and create a report detailing expected changes when applying that revision of the infrastructure code.
- Promoting successful changes up to the next environments
Update environment using the specified configuration
- Instead of packaging up and promoting the entire infrastructure codebase, we simply promoted version control revision numbers through the deployment pipeline.
- When a new revision was migrated to the next environment, we used MCollective to tell the Puppetmaster to update its repository to the specified revision, which was taken from the previous successful run of the upstream environment.
Post-update smoke tests
- Where before we had Puppet running in daemon mode on a 30 minute loop, we now wanted to have Puppet trigger only when the upstream pipelines had succeeded, or when manually requested by a user.
- We implemented an MCollective agent to allow a command to be sent to all the machines in an environment to cause them to execute a Puppet update.
Publish packaged configuration information and update CMDB
- To ensure that the update of the infrastructure code was working as expected, we wrote a limited number of smoke tests to quickly ascertain whether all the expected applications on the nodes were running.
- Provided that the update and smoke tests succeeded on all nodes in the environment, the revision in version control would be made available to update the next environment.
- In addition, each node would also write out a report to a CMDB detailing its current Puppet manifest.
This implementation has several important benefits:
- RSpec tests and local compilation could both be run on a developer’s machine prior to committing code. This enabled much faster feedback for developers so they could quickly catch issues such as syntax errors, cyclical dependencies and missing files.
- Code only had to be checked in once, and then could be migrated all the way to production with a click of a button. If we required a major production fix, we could push it through all stages of the pipeline from development to production in about 20 minutes.
- The output from the dry-run in the first phase of the pipeline provided a detailed report as to what would be changed on each node. This report was used by both developers and the Software Change Control Board to understand the impact of the changes that were going to go into production.
- The use of MCollective to trigger Puppet updates meant that we had complete control over when a particular revision was pushed into a new environment.
- Repeatability and consistency of infrastructure code deployments were much improved. Developers had confidence that what was getting pushed to production was the same baseline that had been tested extensively in pre-production environments. If necessary, we could re-deploy the same revision to an environment and have confidence that the results would be same: infrastructure deployment was idempotent.
There were also some limitations to this model:
- Because Puppet is additive only, there was no ability to revert changes as you would in a typical application deployment. If a change needs to be backed out, you must explicitly add configuration to reverse it, check this configuration in, and promote it to production using the pipeline. This meant that if a breaking change did get deployed into production, typically a manual fix was applied, with the proper fix checked into version control subsequently.
- Certain types of configuration are node-specific and could only be really tested when applied to the environment in which the node resided. We mitigated this risk as much as possible by using Vagrant to stand up virtual clusters on development machines, but these were never exact simulacra of the production environment, so bugs would very occasionally get through.
- Because the dry run report was the first step in promoting a change, it could be hard to dig out this report to determine the impact of a given revision to an environment when the promotion happened potentially days later. One possible solution would have been to implement the dry run report as the last stage in the previous environment’s build steps.
Storing configuration information in version control and using a tool like Puppet to apply infrastructure changes to your environments is a good first step to getting your infrastructure under control. Unfortunately, it is only the first step. In the same way that a business would never check in a change to a codebase and push it directly to production, operation teams need to be aware that simply adding Puppet as an entry point for changes is not enough to ensure success. A poorly-designed infrastructure change applied directly to production via Puppet is just as likely to cause an outage as a manual change made directly on the node.
The deployment pipeline model for software provides a path to production for every code change checked in by a developer. It implements a series of tests and verifications which the change must pass through to ensure that it is ready to go live. The benefit of this setup is that, provided the code change passes all the verifications, the team and the business both have confidence that changes can safely be made to production at any time.
Applying this same pattern to infrastructure changes allowed us to realize the same benefits. Since we had a single path to production, the only way a change could be applied to production was to have it applied and tested in each and every prior environment. In order to verify changes, we ensured that the infrastructure required for a project to run in production was also available in testing environments. Project-specific infrastructure changes could therefore be tested and promoted up to production alongside the application changes that required them.
The pipeline management tool we used—Go—provided both automated and manual configuration for promoting changes between environments. For testing environments, changes were automatically applied after check-in once the unit tests had passed. For controlled environments, changes that had made it through test environments could be promoted on demand. Finally, because our pipeline also produced a report of what it had done, we were able to detect uncontrolled changes and find out which well-meaning developer had been monkeying with an environment manually. This helped us to catch most uncontrolled changes and get them into version control to ensure they would be applied and promoted using our standard, controlled process.
Ultimately our infrastructure deployment pipeline gave us the ability to push changes to production on demand with a very high confidence that the change would work as expected. Reports were produced detailing what changes would take place on which nodes, and then the changes were verified using smoke tests. The teams became so confident that it was not unexpected to see changes being applied to production in the middle of the day. The combination of MCollective, Puppet, Go and the deployment pipeline pattern has enabled NBN Co to treat their infrastructure in the same way they treat any project. A project that can deliver specific requirements on time in a reliable, automated, and repeatable fashion.
About the authors:
Andrew Cunningham has just joined Telstra as a Senior Middleware Engineer after spending 7 years with ThoughtWorks in Canada, USA and Australia. He has worked in a mixture of QA, development and operation roles for a wide variety of industries and projects. He has been focusing on DevOps work for the last three years since discovering that he could play as a systems administrator while still getting to write code.
Andrew Myers joined NBN Co in 2011 to play a part in building a high speed broadband network for all Australians. There he helped grow what was initially a small DevOps experiment, into it’s current state where the automation capabilities he helped design and implement are depended on by multiple teams every day. Preivously he’s worked as a software developer, experiencing the “old ways” first hand, as well as working on the other side of the fence providing technical support for development and collaboration tools. Andrew’s passionate about DevOps because it allows him to combine software development and system administration skills and be involved across the whole software development lifecycle.