The Problem with Separating Data from Puppet Code
You’ve bought Pro Puppet, downloaded a couple of modules from the Puppet Forge (and have written some of your own too), and you’re on your way to implementing your Puppet environment when it hits you: something feels bulky with the way you’ve designed your Puppet code. Your modules may not be portable between environments (development, testing, production) without significant tweaks, each of your node declarations may require a number of variables in order for the code to work, or you’re constantly needing to open up your modules to account for changes in your environment.
There’s GOT to be an easier way to do this, right?
We hear stories from many customers about problems in their Puppet environments, and many of them can be traced back to the way their configuration data is integrated with their Puppet code. Configuration data is the term we use for the environment-specific data that needs to be plugged in to your Puppet code (i.e. variables, class parameters). Take the following bit of Puppet code for example:
1 2 3 4 5 6 7 | $dnsserver = '8.8.8.8' $searchdomain = 'puppetlabs.vm' file { '/etc/resolv.conf': ensure => present content => "search ${searchdomain}\n nameserver ${dnsserver}\n", } |
The configuration data in this example would be the hard-coded variables $dnsserver and $searchdomain and the Puppet code would be the file resource block declaring /etc/resolv.conf. This example is intentionally kept simple in order to highlight the methods by which you will separate your configuration data from your Puppet code, but imagine code that needs to set different variables in different environments (MySQL servers, databases, usernames, and passwords, for example) and you can see how the above example can quickly become unwieldy. How else can this be done?
Legacy Method – Node Inheritance
The first method that people usually tried was node inheritance. By defining variables in separate node definition blocks, and inheriting from a nested list of definitions, you could SIMULATE data separation with this method. This was the go-to method before Puppet 2.6 was released, and as such we consider it to be a legacy solution that we don’t recommend using with versions of Puppet newer than 0.25 (note that if you’re still using node inheritance, please read this advisory on dynamic scoping).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | node common { $dnsserver = '8.8.8.8' $searchdomain = 'puppetlabs.vm' } node production inherits common { $dnsserver = '10.13.1.3' } node 'agent.puppetlabs.vm' inherits production { file { '/etc/resolv.conf': content => "search ${searchdomain}\n nameserver ${dnsserver}\n", } } |
PROS
- It was the easiest method to employ.
- Your data was in one location and, technically, separate from your modules.
CONS
- There was no easy way to find the value of a variable for a specific node.
- FINDING the value of a variable required “human parsing,” or reading through each and every node declaration to trace variable values.
- The data still resided in your Puppet code repository.
- There are better ways to implement this strategy, and this should be considered a legacy solution provided solely for information purposes.
Parameterized Classes
Puppet version 2.6 gave us the ability to pass parameters with class declarations. This allows you to completely remove configuration data from your classes and provide ‘sane’ default values should a class declaration not pass a parameter. While this is an entry-level step in beginning to separate your configuration data from your Puppet code (the data is now in its own class—in this case dns::params), the configuration data is STILL in your Puppet code repository (and thus isn’t a full separation). See below for an example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | class dns::params { $dnsserver = '8.8.8.8' $searchdomain = 'puppetlabs.vm' } class dns( $dnsserver = $dns::params::dnsserver, $searchdomain = $dns::params::searchdomain ) inherits dns::params { file { '/etc/resolv.conf': content => "search ${searchdomain}\n nameserver ${dnsserver}\n", } } |
PROS
- Class parameters can be defaulted back to a ‘sane’ value as outlined in our Smart Parameter Defaults document.
- Modules that utilize this methodology are more portable—parameters need only be changed in a single ‘params’ class.
CONS
- All logic must be embedded in each module’s ‘params’ class.
- If you use this methodology to keep your configuration data separate, every module must have a ‘params’ class and any logic you introduce (picking different values based on operating system, for example) must be repeated in every module.
- The data isn’t truly separate from your Puppet code as it still resides INSIDE the module (and, technically, your Puppet code repository).
External Node Classifier
Many large sites decide to use an External Node Classifier script to solve the problem of looking up configuration data. External Node Classifiers (also known as ENCs) allow you to provide class declarations, parameters, and variables to Puppet in the form of YAML. The previous example would look like this in YAML:
1 2 3 4 5 | classes: - dns parameters: searchdomain : ‘puppetlabs.vm’ dnsserver : ‘8.8.8.8’ |
PROS
- Flexible – you design how the information lookup is done (query a database, parse a hostname or other Facter fact, etc).
- Can be written in any language: shell, perl, ruby, python, etc…
- Plugs into your existing CMDB (Configuration Management Database) to retrieve information that already exists in another source of truth
CONS
- You are responsible for writing and maintaining the External Node Classifier Script
- If the script breaks, your Puppet runs are endangered
Extlookup
Extlookup was introduced in Puppet version 2.6.0 as a hierarchical way to lookup values of parameters or variables based on a Facter fact value. To use Extlookup, you would first define a data directory that Extlookup would search based on a specific fact value (location, environment, operatingsystem, etc), and then you would specify a lookup precedence (look for a parameter/variable in a file named after the node’s certname FIRST, and then search in a file named after the node’s environment SECOND, and so on). Finally, you would assign a parameter/variable’s value by invoking Extlookup with the built-in (as of Puppet version 2.6.0) ‘extlookup()’ function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | $extlookup_datadir = "/etc/puppetlabs/puppet/data" $extlookup_precedence = [$environment, 'common'] node 'agent.puppetlabs.vm' { include dns } class dns { $dnsserver = extlookup('dnsserver') $searchdomain = extlookup('searchdomain') file { '/etc/resolv.conf': content => "search ${searchdomain}\n nameserver ${dnsserver}\n", } } |
Sample common.csv file used with Extlookup
1 2 | dnsserver, '8.8.8.8' searchdomain, 'puppetlabs.vm' |
PROS
- Extlookup supports a dynamic and hierarchical lookup based on a node’s Facter fact values.
- There could be a single node declaration that would use Extlookup to look up the value of every variable/parameter used in Puppet.
- The extlookup() function is built into Puppet as of version 2.6.0.
CONS
- You must use comma-separated value files (CSV) ONLY for your lookups (i.e. variable, value), so structured data (like arrays and hashes) is not supported.
- Data lookups only return the first-matched value.
- It doesn’t have the ability to concatenate a list of matches returned throughout the full hierarchy.
Introducing: Hiera
Hiera, short for “hierarchy” and written by R.I. Pienaar, is a pluggable, hierarchical database that can query YAML and JSON files (and any other data serialization for which you write a custom backend), as well as Puppet manifests, for configuration data. Hiera builds upon the model that Extlookup created and also adds support for structured data. With Hiera, you can dynamically lookup parameters based on a node’s Facter facts. Let’s look configuring Hiera for use with the previous example:
The hiera.yaml configuration file:
1 2 3 4 5 6 | --- :backends: - yaml :hierarchy: - %{environment} - common :yaml: :datadir: /etc/puppetlabs/puppet/hieradata |
The common.yaml file that Hiera uses for parameter lookup:
1 2 3 | --- dnsserver : '8.8.8.8' searchdomain : 'puppetlabs.vm' |
Puppet code using Hiera:
1 2 3 4 5 6 7 8 | class dns { $dnsserver = hiera('dnsserver') $searchdomain = hiera('searchdomain') file { '/etc/resolv.conf': content => "search ${searchdomain}\n nameserver ${dnsserver}\n", } } |
PROS
- Data is truly separated from your Puppet code—it exists in an entirely separate directory structure.
- Parameter lookup is hierarchical and dynamic based on Facter facts that describe your node.
- Hiera supports structured data—like arrays and hashes—that can be fed back to Puppet.
- Using Hiera, your Puppet modules contain zero proprietary data (which makes the module much more portable).
- Hiera will be integrated with the next version of Puppet (codenamed Telly).
CONS
- As of this writing Hiera is not YET built into Puppet , so utilizing it requires an initial installation step.
Conclusion
While there are a myriad of options to solve the problem of configuration data and Puppet code separation, we recommend using Hiera for its ability to adapt to every situation. This post only gives a brief glimpse of its awesome functionality. Stay tuned for a post dedicated to Hiera, where we will be looking in-depth at its usage, flexibility, and advanced features that can simplify the management of your environment whether you’re a sysadmin of 10, 100, or 10,000 nodes!
Additional Resources
- Come to Puppet Camp Atlanta on February 3, and check out Gary’s talk on this topic.
- Read the docs: Smart Parameter Defaults
- Read the docs: External Nodes
- Read the docs: Extlookup
- Check out the Hiera project, and look for a more in-depth introduction next week.
9 Comments
Heira CON: Requires you learn YAML and everyone in your group writes it properly.
Hey Jeff,
YAML is just ONE of the backends that Hiera can use – it can parse JSON or any other structured data you can throw at it (as long as you provide it with a data backend). Of course, like anything else, you would definitely want a precommit hook on your repository to parse your YAML, JSON, or WHATEVER you pass to Hiera to make sure it’s formatted correctly. I personally feel YAML is pretty easy to read (for what you need in Puppet – granted you can do MUCH more with YAML that requires a bit of planning), and I think the trade-off of expressing data separately beats needing to refactor Puppet code when changes need to occur. Thanks for the comment!
Oh, I agree on it being easy and easy to read. I’m saying relative to the others you listed, I think you’re biasing a little by leaving off that Heira con. None of the others have that requirement (or requirement like it).
I enjoyed the post. It was nice to see this spelled out clearly.
Great article, Gary!
It clarifies well the whole data management problem in Puppet’s evolution.
One point that I’d underline, and that I personally follow in my modules, is that you don’t need to use hiera() functions in your modules to actually use Hiera…
I mean, in terms of portability you must consider also situations where Hiera is not used and a module with an hiera() function could not be considered.
For this reason I prefer an approach based on Parameterized Classes with params..pp defaults and the extra of a lookup of an optional top scope variable.
(An example here: https://github.com/example42/puppet-openssh
The approach is verbose, could be optimized but gives, imho, freedom to choose how you want to manage data:
Variables used by the class can be provided either as arguments or at top scope, where they can be evaluated via an ENC, calling an extlookup / hiera function or even with custom logic based on Puppet selectors (not nice but possible).
The same arguments you pass to the parametrized class when you call it, can have values obtained from an hiera function, so I don’t really find a plus in forcing the usage of hiera() directly in the module (well, there’s less verbosity, but that’s a tradeoff I can cope with): you can use it in any case, if you want.
Trying to make modules that should be usable (and possible reusable by different people) NOW, for different Puppet versions and in most Puppet setups, this has been the best approach I’ve thought about (Dan has actually suggested me a very interesting improvement, but that’s another story).
my2c
al
Thanks for the comment Al! I agree that trying to make a portable module RIGHT NOW requires additional steps (especially since a data lookup mechanism like Hiera isn’t yet built-in to Puppet), and the params class is a great pattern to follow.
Great article on the topic and clearest explanation of the trade offs I’ve seen. There’s the question of what level of hiera support will exist in future versions of the Puppet Dashboard?
Also, with yet another future best practice we can expect even more disparity between modules available in the forge. What would be a nice addition to the forge would be the ability to filter results based on Puppet version support. Like Al and Gary said perhaps the example should show the hiera lookup function being used at the node level within a parametrized class. That would definitely help with module compatibility.
Appears foreman is planning hiera support at some point too.
At first glance, it appears that the semantics of the hiera() function are equivalent to those of the extlookup() function, supporting a superset of the of the types of data that can be returned. Why, then, can’t hiera() simply be a new implementation of the extlookup() function, so long as there’s a csv backend?
More succinctly: why does there need to be a new function name?
First Look: Installing and Using Hiera (part 1 of 2) | Puppet Labs
[...] a previous blog post, we introduced use cases for separating configuration data from Puppet code. This post (part one of a two part series) will go in-depth with installing, configuring, and using [...]
Hey Jonathon,
Check out the blog post from today on Hiera. It functions very similarly to extlookup, and it DOES improve on the methodology that was begun in extlookup, but we’re making Hiera a major part of the next version of Puppet, codenamed Telly, (i.e. it won’t be something as simple as adding a function, it will be a major change in how we handle data).