Introducing PuppetDB: Put Your Data to Work
PuppetDB is the next-generation open source storage service for Puppet-produced data. Today, this includes catalogs and facts, and will be extended in the near future. The initial release provides a drop-in replacement for both storeconfigs and inventory service.
We’ve designed PuppetDB to empower Puppet deployments, and built it from the ground up with performance in mind. It’s built on technologies known for their performance, and is highly parallel, making full use of available resources. It also stores all of its data asynchronously, freeing up the master to go compile more catalogs. Beyond that, we’ve devoted copious time to benchmarking and optimizing the performance.
Why PuppetDB?
The most immediate benefit of PuppetDB is improved performance for storeconfigs users, but even for others, it has a lot to offer. As a centralized store, PuppetDB knows about every node, resource, relationship, and fact across your entire infrastructure. All this information is easily queryable, so you can integrate it into your tools and workflow, or just satisfy your curiosity. It also provides a platform on which powerful new tooling will be built.
And if you’re not using storeconfigs, you should be. At its heart, storeconfigs can be thought of as “higher-order Puppet.” It’s a way for multiple nodes to interact with each other through Puppet, which is an immensely powerful feature. In any case where one node knows what another node is doing, storeconfigs may help.
For instance, storeconfigs can be used to configure a monitoring service, without knowing upfront any of the nodes or services being monitored. Each node to be monitored can simply define what ought to be checked, and those checks can be collected on the node doing the monitoring. Or it can be used to share SSH authorized keys, by having each node export its key, and collect everyone else’s.
Built for performance
Let’s talk about performance. I told you it was a key design goal, but just how much faster is PuppetDB than the existing solution? To find out, I ran an experiment against the old, ActiveRecord storeconfigs implementation.
I compiled and saved a catalog of 650 resources, using an initially empty PostgreSQL database. Compilation took 5.6 seconds. With nothing in the database, it took 53 seconds to store the catalog. That’s brushing right up on the agent’s timeout, risking an outright failure. With the database now primed, I submitted the same catalog a second time, unmodified, which took 4 seconds.
To find out how PuppetDB performs, we have much more information available to us. The service is highly instrumented to keep metrics on every aspect of its performance, all of which is made available over HTTP and JMX.
This is the PuppetDB dashboard, which uses the HTTP metrics API to give an overview of the current state of the system. The dashboard comes built-in, and updates live, even on your mobile device! Taking a look at this screenshot (taken from our internal PuppetDB instance), we can see the backlog of work, how long command processing is taking, how much work has been done, how large the database is, and much more. And yet this is still only a small subset of the metrics we track and make available.
In particular, we see that the queue is empty, meaning PuppetDB is keeping up with demand. Looking at the number of nodes and resources in the population, we can easily calculate that the average size of a catalog is ~670 resources. The average time to process a command is 394ms. This is around 130x faster than the worst case time of old-school storeconfigs, and 10x better than the case where catalogs are already present. We also see that PuppetDB is responding to storeconfigs queries in only 65ms.
Admittedly, these numbers are somewhat incomparable; for instance, the very first catalog stored in PuppetDB may take some extra time, but catalogs which are unchanged will be negligible. But this gives some indication of the improvement we’re talking about. It’s also important to note that all of this storage is asynchronous, freeing up the master to continue serving catalogs. Previously, the master would have been occupied waiting for storeconfigs.
Reliable data store
So we can see that PuppetDB stores your data more quickly, but what about the data itself? After all, that’s what you really care about. PuppetDB makes a few promises about its data: it will be complete, it will be accurate, and it will be current.
Every aspect of the catalog is stored, including edges and unexported resources, which are omitted in old storeconfigs and the popular thin_storeconfigs mode respectively. Nuances of the catalog like resource aliases are also respected, ensuring that every resource and edge is present and accurately represented.
It’s downright difficult to lose your data with PuppetDB. It takes great care not to let that happen, by accepting it into a persistent queue, and trying up to sixteen times (even across service restarts) to handle the command, ensuring that if the data is good, it will make it into the database. And if it somehow still doesn’t make it in, the command will be saved away with plenty of forensic data for later investigation and reprocessing.
In that vein, when configured to use PuppetDB, Puppet will refuse to serve catalogs if PuppetDB is down and the catalog can’t be persisted. This means the data PuppetDB has will always be current; an agent will never use a catalog that PuppetDB doesn’t know about.
And it’s secure. All communication between the puppet master and PuppetDB happens over SSL, authenticated with the same certificates as used for communication between puppet master and agents. Similarly, if PuppetDB and its database are separate, it’s a simple matter to secure their connection.
Plays well with others
PuppetDB is a key component of the Puppet Data Library, and brings that to bear in its query API. Resources, facts, nodes, and metrics can all be queried over HTTP. For resources and nodes, there is a simple query language which can be used to form arbitrarily complex requests. The public API is the same one that Puppet uses to make storeconfigs queries (using the <<||>> operator) of PuppetDB, but provides a superset of the functionality provided by storeconfigs. The API is fully documented and versioned, for use in scripts, Faces, or custom Puppet functions.
PuppetDB is faster, smarter, and has more complete data than ever before. If you’re a current storeconfigs user, there’s no reason not to try it out immediately. If you don’t use storeconfigs (and especially if performance was the reason), now is the time to start. We know that storeconfigs, while being a powerful and important feature, has historically been a pain point for users. One of the goals of PuppetDB is to alleviate that and personally, I want a world in which everyone uses storeconfigs and loves it. PuppetDB offers great power over and insight into your infrastructure, and it’s only going to get bigger and better.
Learn More
- Download the open beta of PuppetDB from our APT repository, YUM repository, or GitHub.
- Read the PuppetDB documentation.
- Open tickets on the PuppetDB project.
- Come see Deepak Giridharagopal talk in-depth about the development and design decisions, and ask all your usage questions this Saturday at Puppet Camp Los Angeles.

21 Comments
It’s surprising that MySQL isn’t a supported backed, especially when the existing stored config implementation is backed by MySQL and PuppetDB is JVM based so has access to the standard JDBC bindings etc. Is there plans to add?
There aren’t any plans to add support for MySQL. It lacks several features we do or will depend on, including array columns and recursive queries. We’re committing to PostgreSQL as our preferred database for the future, as it strikes the right balance between the features of Oracle and the pricetag of.. well, free. We certainly wouldn’t turn away a patch which works around the limitations of MySQL to bring support; we just have no plans to write it. We may eventually add support for other databases which *do* have the features we need.
Incidentally, if you have a small number of nodes, the embedded database should suffice.
is 250 a “small number of nodes”? do you have any metrics (compile times, cpu load) on how it works comparing with mysql on exported resources?
To be honest, the embedded database is primarily intended for proof-of-concept deployments, so I don’t have much hard data about it.
I did test it out running on my laptop. It seemed to easily keep up with inserting catalogs for 2000 nodes with a 30-minute run interval, with catalogs of around 650 resources. I didn’t test query performance, but I expect that to be fine.
I’m concerned about memory usage, though, as the documentation appears to indicate it will keep the entire database in memory at once. With around 1500 nodes in the database, the memory usage was around 1.5GB. So I’m feeling a little less solid about its viability for production use.
However, it does seem like it should work for 250 nodes, which is about what I was thinking of as a “small number of nodes”. Be generous with your memory allocation. It used around 1MB per node for me (again, 650 resource catalogs), so 512MB seems like a safe starting point. Watch the logs for memory errors, to make sure it’s working alright, particularly until every agent has run once and submitted a catalog.
All that back-and-forth said, if it were my own infrastructure, I would set up Postgres and not have to worry about it. I suggested the embedded database for people who, for whatever institutional reason, can’t use Postgres. If you can, you should.
sure, i would do that too, but for evaluation time it would be easier to leave mysql running as a quick failsafe.
i assume hsql keeps database in ram for performance but has some layer of on-disk persistence to survive reboots? it wouldn’t be nice to loose all those acls that we export with database restarts.
Yes, the embedded database is persistent. And if it’s for evaluation purposes, it’s definitely a good choice.
So, IIUC we’ll need yet another stack (the infamous JVM) to run puppet with that?
Why not working out on the current storeconfigs code instead of redesigning things?
Yes, you’ll need a JVM to run PuppetDB. But PuppetDB is a service separate from the puppet master, and can live on its own machine. It’s also completely optional today. You need a JVM if and only if you choose to use PuppetDB. I hope the benefits are sufficient to be worth the trouble; if not, we plan to continue making it better until the JVM is a small price to pay. :)
We developed PuppetDB as a separate service in order to support our long-term architectural goals. It’s bigger than just “new storeconfigs”. We’ve traditionally had a problem of many pieces owning different data. For instance, storeconfigs has the catalogs, the inventory service has the facts, and the Puppet Dashboard has the reports. And they all have wildly different APIs. That makes it difficult to easily integrate with Puppet.
The intent is for PuppetDB to become the single, consistent, authoritative place to find all the data that Puppet produces, and to interact with it in a single way. We’ve started here by consolidating (and seriously improving) catalogs and facts. Hopefully in a fairly short time, PuppetDB’s database will be the only one you need.
are you planning to use puppetdb mq for mcollective in the future or we’ll need to run 2 mq services in typical puppetmaster stack?
PuppetDB’s MQ is embedded, not a user-facing component, so it shouldn’t really be an issue. Message queuing is going to become more prevalent in Puppet, but as we introduce it, we’re going to ensure you never have more than one MQ to manage.
Fucking brilliant. Thanks :)
Links 21/5/2012: Linux 3.4 Released, Dream Studio 12.04 | Techrights
[...] Introducing PuppetDB: Put Your Data to Work PuppetDB is the next-generation open source storage service for Puppet-produced data. Today, this includes catalogs and facts, and will be extended in the near future. The initial release provides a drop-in replacement for both storeconfigs and inventory service. [...]
Open Source Twitter Chat on May 28th: #Puppetize | Puppet Labs
[...] on May 28th: #Puppetize Module of the Week: pdxcat/amanda – Advanced Network Backup Introducing PuppetDB: Put Your Data to Work Removing lint from the Puppet’s belly button of fluffy automation chaos, or: Using puppet-lint [...]
Nick, running a JVM is never a small price to pay. I just don’t understand how you can say “built it from the ground up with performance in mind. It’s built on technologies known for their performance” and then write it in Java which is none of those things. I’ll never understand why Java is so popular in enterprises considering its long and terrible track record. Every sys admin has had Java eat machines for breakfast and you’ve now asked them to trust the most important component of their infrastructure to it. In fact I just spent the last thirty minute trying to figure out which java process killed one of my machines. That’s a hard sell.
Also “agent nodes will be unable to request catalogs if it becomes unavailable” means its a single point of failure. Do you have instructions for clustering the PuppetDB or can it be run as a “farm” with each instance of PuppetDB not needing to be aware of the others. We can’t deploy it if it isn’t highly available.
I’ve been looking forward to PuppetDB and I really expected better from Puppet Labs. This is just disheartening…
Russell
Religious arguments aside, our choice of the JVM was a thoroughly researched, tested, and practical decision. We chose it because we decided that it was the best tool for the job, and will continue to guide our technology choices by that principle.
The performance of PuppetDB is something we’ve cared about from the project’s inception. We’ve been constantly benchmarking our performance and deploying builds internally to run our own infrastructure. Our commitment to performance is the reason why PuppetDB has the most extensive instrumentation and monitoring capabilities of any tool in the puppet ecosystem, and why it’s the absolute fastest and most robust way to store catalogs and facts.
I understand that you have some operational concerns. It’s worth noting that PuppetDB’s memory and CPU usage can be tightly controlled via simple configuration switches. Furthermore, we make it very easy to integrate PuppetDB’s performance metrics into your monitoring infrastructure, thereby providing an easy way to keep tabs on nearly all aspects of PuppetDB’s operations.
As mentioned above, we’ve been using PuppetDB internally for quite some time now, and we’ve seen staggering improvements in compilation time without any operational issues. Users that have already put PuppetDB into production have reported similar results. But by all means, don’t take our word for it. We make it easy for responsible systems administrators such as yourself to try PuppetDB out, and back things out if necessary. We have pre-built packages for most popular platforms that make installation quite simple. We encourage you to try it out, and put our performance claims to the test!
Regarding high availability, we use standard relational database tech under-the-hood, so the database itself can be backed up and replicated. Additionally, it’s easy to run several instances of PuppetDB pointing to the same database cluster. Also note that even if PuppetDB goes down, agents will still configure hosts using a cached copy of the catalog. That extreme level of redundancy has been included in Puppet for years now. And if the database is corrupted or wiped for some reason, agents will automatically repopulate PuppetDB with their latest data.
I must agree with what Russell is saying. A plain-old SQL is just too fat and slow. Using something like a graph database would render better performance. The Ldap ENC is perfect for central definition of nodes, can easily be scaled with multi-masters and consumers, and would be highly available. Latency issues on writes can be handled via a redirect and a proxy. The beauty would be the wealth of ldap-gui tools, api’s, and existing command line tools, such as ldapsearch.
My .02.
MCollective Pluggable Discovery | R.I.Pienaar
[...] also working on a PuppetDB one, it is not quite ready to publish as I am waiting for PuppetDB to get wildcard support. And [...]
Module of the Week: puppetlabs/puppetdb - PuppetDB Management | Puppet Labs
[...] (Version 1.0 of PuppetDB was just released! If you haven’t checked it out yet, have a look at Nick Lewis’ blog post or the PuppetDB documentation.) Currently, it offers a huge performance improvement for exported [...]
Announcing PuppetDB 1.1: Do More With Your Data | Puppet Labs
[...] Nick Lewis’ Introducing Introducing PuppetDB blog post [...]
PuppetDB 1.3 | Puppet Labs
[...] Nick Lewis’ Introducing Introducing PuppetDB blog post [...]
PuppetDB 1.3 Is Here With New Reporting Tools
[...] Nick Lewis’ Introducing Introducing PuppetDB blog post [...]