Developers need to be able to write software and deploy it, and often require cutting edge software tools and system libraries. Sysadmins are charged with maintaining stability in the production environment, and so are often resistant to rapid upgrade cycles. This has traditionally pitted us against each other, but it doesn't have to be that way. Using tools like puppet for maintaining and testing server configuration, nagios for monitoring, and jenkins for continuous code integration, Stanford University Library has brokered a peace that has given us the ability to maintain a stable production environment with a rapid upgrade cycle. I'll discuss the individual tools, our server configuration, and the social engineering that got us here.
Bess Sadler manages a software engineering team for the digital library group at Stanford University Libraries. She is a co-founder of several successful open source software projects including Blacklight (http://projectblacklight.org) and Hydra (http://projecthydra.org), which are used by libraries, museums, archives, and cultural institutions around the world.
- You can't just drop "DevOps experts" into an organization and expect them to fix everything. Read why PuppetConf keynoter Jez Humble said you need to stop hiring DevOps experts and start growing them.
- DevOps practices increase agility and reliability. Learn more by reading the results of our 2013 State of DevOps Survey.
- PuppetConf 2014 will be back in San Francisco. Developer Day will be held September 22, PuppetConf will be held September 23-24. Save the date!
Bess Sadler: My name is Bess Sadler and I work on the Stanford Digital Repository at Stanford University Library. I'm a librarian but I'm a librarian who has worked both as a systems administrator and as a software developer, and these days I manage a team of software developers and a newly hired DevOps Engineer, Aaron, in here somewhere, hi.
You might not think of a library as being a software development shop, but increasingly libraries need to manage massive amounts of digital information. And we face many of the same problems that start up companies and other software development shops face, but with the added responsibility of handling long-term data preservation.
When people hear I work at Stanford University Library, they often ask didn’t Google scan all your books. And yes, that is true and parts of the software that we build manage the long-term preservation of the digital copies of all of those books. But we also do a lot more than that. We also digitize and provide access to rare materials, we invent new ways of archiving the world’s increasingly digital cultural heritage. And we build infrastructure to support the emerging research needs of scholars at Stanford and around the world.
I build digital libraries and I love my job. Just within Stanford Library, here are a few of the projects I have had a chance to work on – digitized medieval manuscripts. By digitizing these manuscripts and putting them online, we are providing access to cultural treasures that otherwise very few people would have access to. Because many of these manuscripts have been broken up over the years, digitizing them provides a way to reassemble the original volumes virtually. By creating new ways to annotate them, we are helping to enable new kinds of scholarship.
We are also helping to build new kinds of archives, like the Stephen Jay Gould Papers. Stephen Jay Gould was a well-known paleontologist who made major contributions to our understanding of evolution. It used to be that when an important cultural figure left their papers to an archive, they were actually leaving paper and archives are pretty good by now at handling paper. But Stephen Jay Gould left floppy disks and today’s luminaries will leave us hard drives and mail accounts that have been hosted in the cloud.
Coming up with new ways to preserve these archives for researchers is a significant challenge. And finally, I am very proud of the infrastructure that Stanford University Library is building to enable data-driven science. In case you're not following this issue, let me explain that there is a revolution happening in scientific research right now. The way science has typically happened, is that a researcher has a question they want to ask, they go get a funder, they get some money, they gather some data and then they distill that data into a research publication.
Libraries have always played a role in preserving and providing access to the research publication, but what about all of that data that was gathered. Until recently, because there was no plan to preserve it, it’s been thrown away. This is increasingly recognized as unworkable from a scientific point of view, because it prevents reproducibility and creates an atmosphere where research findings can be too easily falsified. But the data is also incredibly valuable in its own right, increasingly access to data is being seen as vital for fueling entrepreneurship and economic growth.
By making data more accessible to more researchers, we enable the asking of new kinds of longitudinal and cross-disciplinary research questions. Here, for example is a data set created by Stanford Scientists about habitat for Pacific Salmon. As you might imagine, when you're studying climate change or habitat loss, having past data to compare with current measurements is vital to understanding trends. We have a very successful digital library program and part of the reason for that is that we have benefited greatly from participation in a repository development community known as the Hydra Project.
Hydra is community-built, open-source software for building digital repositories, it's a pretty successful project and as a project we've been growing fast. From an original founding five partners in 2009 to 19 development partners and countless installations four years later. But in addition to being a software project, Hydra is a community of people who know how to work together. It's a culture, a common approach to building digital libraries that is more valuable than any individual piece of software. And part of that culture increasingly is DevOps. These days I spend a lot of time talking to people about how to implement Hydra.
The number one thing I hear from people who want to adopt our software, but don't think they can is, I would love to start using Hydra, but Ruby on Rails is not on our technology stack and there's no way I'm going to be able to talk our systems administrators into it. And that’s when I started to suspect that they have a Vampires versus Werewolves problem. I see it a lot, chances are good that they have two groups of people, each with their own arcane powers, who have fundamentally different motivations and who are getting in each other's way instead of working together.
Developers who I tend to talk to more because of the kinds of conferences I attend are typically motivated by innovation. They care about building new features and pushing the envelope of what's possible. I know we have got a Vampires versus Werewolves situation, when I hear anger and frustration because they can't get those new features delivered to production. Systems teams on the other hand, are charged with maintaining stability and security. They care a lot about uptime, especially since they are often the ones who have to respond to emergency outages on weekends and holidays.
I know we have got a Vampires versus Werewolves problem when the SysAdmins are complaining that they can't adopt new technologies, they refuse to make any changes to the production system, because they don't know what will break. I tend to think of developers as the Werewolves because they're always changing and making a mess, recompiling libraries with no thought for what might be in the supported Linux distro on the server. I tend to think of the SysAdmins as Vampires interested in longevity and stability.
But after I gave an earlier version of this talk a few years ago, the website Coding Horror made a similar analogy in which programmers were the Vampires, because they're frequently up all night, paler than death itself, generally afraid of being exposed to daylight and they think of themselves or at least their code as immortal. And Coding Horror thought that the SysAdmins were like Werewolves, because they may look outwardly ordinary, but are incredibly strong, mostly invulnerable to stuff that would kill regular people and prone to strange transformations during a moon outage.
It doesn’t really matter which way you think of things, as long as you absorb the underlying message. Developers and SysAdmins have different skills and different motivations, which often put us in conflict with each other. Unless we make an effort to work together, we run the risk of losing a lot of energy and fighting each other. When things aren't going well, there can be a lot of anger on both sides and the danger is that Vampires and Werewolves can get so caught up in fighting each other that they aren't paying attention to the villagers with pitchforks, all those stakeholders who just want you to figure it out and make it work.
If you have never shipped new features or if your systems are unstable, you get villagers with pitchforks. The way out of this situation is DevOps. DevOps is interesting to me as a movement because it isn't just about technology innovation, it's about improving communication and collaboration. DevOps techniques are taking off in popularity because they have a reputation for delivering user-facing value faster and with less risk. The idea is to go from a situation where production releases are rare, high-effort and high-risk and high-stress to one where production releases happen often, easily and with minimal risk or stress. And then it’s a lot easier for everyone to get along. It sounds great, right, but how do we get from here to there, especially if your Vampires and Werewolves are already fighting.
Here is an observation, innovation is about risk and you don't take risks with people you don't trust. So the first thing you have to do is work on the interpersonal relationships and letting go of the anger. And there can be a lot of anger, especially when you spend years working in environment where these two groups have been pitted against each other, and where they mostly interact in high-stress, high-stakes situation, like traditional waterfall production deployment.
Here are some actual quotes I gather from developers facing the situation. In these conversations, I can hear the anger and frustration and the way that this can devolve into actual enmity.
"I developed this app, I am best able to deploy it and it’s my job on the line, if it doesn’t get deployed."
"They deployed it with the wrong version of Python, of course it didn’t work."
And here is my favorite from someone who is in the habit of throwing applications over the wall to SysAdmins who couldn't get his code running on the production server.
"Oh it's way more fun to just make them look stupid."
This is the "us versus them" attitude that has to change. In order to get out of this situation, what you have to do -- and this is hard -- is let go of the anger. You can have a lot of emotion tied up in the fact that you have worked really hard on a project and your work is not being seen because you can’t get it deployed. That has made me very angry in the past. But you are not going to be able to solve this problem until you let go off the resentment. So I tried to get people to take a deep breath, and let go of anger. And then really try to listen to the people on the other side of the fight.
So once you're emotionally prepared, here are the ways you reach out and actually solve this problem. Remember it's not about winning. It's not about getting root. It's not about beating the other side up until they do what you say. It's about really reaching out and building trust and listening to their side of the story, recognizing their motivations, even if they are different from yours. It's about recognizing your common goals and negotiating a way forward together.
Here are a few ways to move forward that have worked for us.
First of all, get to know the people on the other side as people. Ask them out for coffee. Ask questions that will help you understand the situation from their point of view. Before you make any requests, be prepared to demonstrate a show of good faith about the changes that you plan to make to improve your side of the situation. One show of good faith might be ensuring that you have test coverage for your code. I am a total convert to test-driven development. Having good test coverage makes a huge difference, not only in increasing the quality of the code that is produced, but in decreasing the anxiety levels of everyone involved.
Until I had good test coverage, I never realized how much time I had been wasting spinning my wheels fixing bugs I had introduced to myself. I recently had to work on a project without test coverage and it felt like walking through a booby-trapped house. I was afraid to touch anything. In contrast when I see 98% test coverage on a piece of code. As a developer, I feel safe incorporating into my project and as a SysAdmin, I feel safe deploying it to my server. As an added bonus, good testing also makes it easier to grow your development team and to collaborate across institutions. Good tests build trust.
Once you have a test suite, you can start running continuous integration, automatically running your test suite in a variety of environments on a regular basis. This used to mean you had to run your own continuous integration server, like Jenkins, but hosted continuous integration systems like Travis -- and this is a screenshot of one of our projects building on Travis -- make it easier than ever to continually test your code. I used to develop code on my Mac, and then find it behaved unexpectedly when it was deployed to a Linux production environment.
Worse, I often wouldn't realize at first that the software had bugs, because I had no test suite I could run. So I’d only find out there was a problem, sometime later when we received a complaint. Because the SysAdmins were the ones fielding complaints, this did not make them feel good about me. Since we’ve adopted unit tests and continuous integration, this particular problem rarely happens.
In addition to testing your code, you want to have regular monitoring for your production applications. I really like to thank Nagios for this. Instead of just using to monitor systems, I like to use it to monitor the production applications that are running to. I don't know if you can see it from the screenshot, but not only are we checking our Puppet agent and SSHD asking questions like is my disk full and is my CPU overloaded. We are also checking whether our Solr index is running and whether various parts of our application stack are behaving as expected.
We used to add even more Nagios checks for various parts of an application. But we've recently started using the ‘is it working gem’ instead. Our repository applications like many web-based apps are actually several applications working together in concert. And if any layer of the stock is down, then the application is broken. By defining parameters for what needs to be running and what normal functioning looks like for each application, it’s much easier to keep an eye on all of the parts of our applications, helping us hit that target between rapid upgrade cycles, quality assurance and stability that’s the hallmark of good DevOps practice.
The goal is to have a way to quickly answer the question, are all of our projects functioning correctly right now. Not just as HTTP responses but the totality of the application. Are our projects responding the way an end user is expecting to see them?
And next up in our process, if something is wrong, can we do something about it? Another best practice that we are trying to put in place across the board, is linking from our Nagios monitoring system to our documentation. You’ve heard the phrase RTFM, Read The Friendly Manual, you can’t read the friendly manual if it doesn't exist or you can’t find it.
A part of DevOps for us -- and we are still working on this -- is coming up with shared procedures for responding to outages. And genuinely making the documentation friendly and something someone might want to – might be able to consult without getting so angry and frustrated that they let their monster side out.
And finally Puppet. I have been part of Puppet implementations at two institutions now, and I have watched it being rolled out at several Hydra partner institutions. Puppet can feel like the ultimate challenge in the vampire-werewolf teambuilding because for us it is meant that developers have had to learn more about systems administration and SysAdmins have had to learn more about software development.
This process has been challenging, but also very rewarding. A few years ago, building a new application server was expensive, time-consuming and never quite consistent, no matter how hard we tried. Today having migrated most of our applications to Puppet, we are finding it much easier to spin up new boxes. This makes it much easier to scale our services, because we can duplicate an existing application server easily for load balancing or horizontal scaling.
A few of the practices that have helped make our Puppet installation successful have been good code management practices, training and virtual burn down boxes. We manage our Puppet code in git just like we do our application code. Our systems administrators maintain the master branch, merging pull requests from developers and deploying production ready code to our servers. Each developer is able to deploy their own fork of a Puppet manifest to their burn down box, a virtual machine intended specifically for testing Puppet configuration.
Training is an ongoing process and learning new skills always involves some frustration in learning curve, but we have found that the investment of time and effort has been very worthwhile. Puppet means including your production deploy strategy as part of the development process. We often used to fall under the trap of focusing all of our attention on software development, and then getting caught off guard again and again when turning that software into a production service took longer and involved more frustration than we expected. These days, we are integrating application development -- we are integrating application deployment into our development process.
By deploying early and often, we keep the process low risk and low stress and that means that even Vampires and Werewolves can get along. Thanks so much for time and I would be happy to answer any questions. Any questions?
Audience: [Inaudible] [0:18:51].
Bess Sadler: How do you get developers to move into the test driven development culture? That’s a really good question. It’s definitely a process, it’s a lot easier now than it was when we were first starting, because we already have a culture of test-driven development. When we bring new people in, you know it’s kind of the expected norm.
When we were getting started, there was a lot of anxiety about this is going to make it, take longer for me to write code. People couldn't immediately see the value in it. I see a lot of anxiety around not really knowing what to test, and when I have someone who is really recalcitrant and does not want to do test-driven development, it's usually because they're feeling insecure about it and they're not sure what to test and the test that they are writing are not actually helpful and thin shots why it’s easy for them to say writing tests isn't helpful.
There are some really good workshops, like weekend workshops, especially around here in the San Francisco Bay Area, there's so much. But I think you could find such a thing in every major city. The turning point I think is, when you can get to the point where having the tests saves someone. That’s where I see people really turn around in their attitude is the first time that they really catch something because there were tests there.
The other place I see people changing their attitude is when they start having to collaborate with someone new, where it goes from: You are maintaining a piece of code that’s simple enough that you can hold the whole thing in your head and you are the expert on it. To: you genuinely need to be developing this with someone at another institution or someone you don't talk to a lot and you need a way of being able to trust the code submissions that you're getting from them. So for example, because a lot of our code is open source, we are really counting on our ability to be able to distribute that workload across a bunch of institutions. We don't always necessarily know each other, we haven't always necessarily developed code together. By just having rules in place, that say new code commits have to have a test around them, we are able to have a lot more confidence in our ability to integrate code from other places.
And even people who are a little skeptical about the process at first, when they try to do that without tests, they come around pretty quickly.
Any other questions, yeah.
Audience: [Inaudible] [0:21:54].
Bess Sadler: So the question is, who makes the decision about – who is actually writing the Puppet modules and who makes the decision about when something is ready for the deploy? Is that right? So we have got developers actually writing – well it’s a combination of both. Our SysAdmins are writing the Puppet modules for core services, so we don’t want every developer reinventing how we are going to deploy Apache, right. So we have got a way that we deploy Apache. Already we are so far ahead of the game by having a way that that’s done.
And then the developers when they're specing out an application server, they can pull in that Apache config know that it’s going to be right and then customize it in the way that they need for their particular application.
They deploy that to a burn-down box, iterate on that code until it's working correctly on the burn down box and then submit a pole request saying, you know this is the code for you know such and such an application server. And there's a code review process that the SysAdmins go through to look at that and make sure it's following whatever conventions we decided on.
And then they do that merge and deploy that to a server. So it feels to us like a good balance between giving the developers the access that they need and the control that they need to really be able to tailor the systems to what they need for when they're trying to build something new. And keeping a tight rein on consistency and security and all the reasons why you might want a separation of powers between developers and the system team. Does that make sense?
Bess Sadler: We run VMware, so we – that’s just the system that we run for all of our virtual machines and so the burn down boxes, each developer is assigned a VMware box that they can blow away and re-create via the – we actually have a Puppet recipe for what a burn down box looks like, so that you can very easily get back to starting from scratch state on that.
Audience: [Indiscernible] [0:25:01].
Bess Sadler: It’s not, no it’s a virtual machine.
Yeah, back there.
Bess Sadler: How does it not just turn into a big pool of metrics that you can’t make sense of? We are still working on that part, in at least some places. For the most part though, just having a process, even if it’s a process that needs some refinement and further development around saying, "It's important for any application that we are going to be supporting in production to have a really good definition of what its proper working state is." That goes a long way. It also was greatly simplified for us recently when we could encapsulate a lot of that logic into the ‘is it working gem?’ instead of trying to have that logic split out into separate Nagios tests. So that’s been a change in the right direction.
Any other questions, yeah.
Audience: [Indiscernible] [0:26:47]?
Bess Sadler: So the question is, is there a process to work out disagreements especially when developers think that we should use one thing and SysAdmins think that we should use something else. There’s not really a formal process. I would say that a lot of the way that those conflicts have been eased for us has been getting to a point where there's much less of a division between teams, certainly compared like past institutions that I've worked on where we didn’t really have respect for each other necessarily, right.
So it was really easy to just say, "Oh, whatever they want to do is stupid." If you can get to a point where you are genuinely listening to the arguments that the other side is making, it's actually a lot easier to come to a shared agreement, because hopefully you are working together toward the thing that’s actually the right solution for the situation. Which isn't to say there is never going to be trade-offs. And you know, any negotiation is going to mean give-and-take on both sides. But I really feel like the most important part of being able to navigate something like that is having genuine respect and genuine ability to be able to listen to the other side.
Any other questions? Yeah over there?
Audience: You know let's just assume you have overcome those mutual respect issue, [Indiscernible] [0:28:29] a big happy family with your SysAdmins and developers, some of whom have worked as others in past [Indiscernible] [0:28:36] and that’s never been an issue, but there is another part in the equation and some institutions have people who have a vested interest and as you said, are pitted against each other, not that they, you know naturally hate each other, but that they are pitted there. And somebody is playing, let's you and him and fight.
Bess Sadler: Someone is playing what?
Audience: Let you and him fight. Okay, how do you recognize that and what do you do?
Bess Sadler: So how do you recognize situations where you are genuinely been pitted against someone else even though you are not necessarily in conflict with this person for any – that’s a really good question. I mean I think that that comes down to you now some level of organizational awareness. Like I sometimes just try to come up with ways of asking a question like, "It kind of seems like you want to be angry at him, but I am not." So why like and that can defuse a situation. What I encountered a lot, unfortunately, particularly in places where you know say like academia, where people tend to work in the same place for 20 years, right, is that a lot of the time people are angry about something that happened a decade ago.
And you how to come up with ways -- and you sometimes you have to able to be a little bit creative -- to get people to get past that and recognize that we've actually got a common goal here. A lot of the time when I see people who are being pitted against each other, it’s because of some older grudge, something that's not necessarily even relevant to this situation. And trying to figure out what that’s about and trying to figure out how did to defuse that, is not necessarily easy or straightforward. I really have found that just inviting people out for coffee or just finding out like, "What is their big problem? And is there a way I can help to ease that big problem?"
Right, is there some contribution that I can make to make their lives easier even if it's not immediately in my own best interest will kind of get people let their guard down a little bit. If they have tended to be in like a defensive posture, that’s something that's worked for me, but it’s a tough situation.
Audience: Right, it’s not so much, the anger, the one that they can see over with is much easier to deal, than your typically academic snake, It’s you know and you just, you know that have an agenda that has to do with making sure that these two guys don’t get along. It serves their power plan.
Bess Sadler: Right that’s going to have to be a future talk, I don't have a good answer to that. Yeah, in the red.
Audience: [Inaudible] [0:31:59].
Bess Sadler: So the question is, if you have a really entrenched kind of heavy process, waterfall approach what are the ways that you can transition to something more agile, something more in the spirit of DevOps? I’ve had good luck with grassroots efforts, I've had good luck with you know, you can only take it so far, you don't want to – skunk works can work well for a while but at a certain point you are going to have to get people in the higher levels of the administration on board. But, it can work. And being able to demonstrate success, it does a lot to convince people of your point of view.
So if there are ways that you can spin off a smaller team or spin off a smaller project or something that I’ve seen work in some situations, where there has been maybe some stagnation showing that there is a way that we could be making progress more rapidly. Or where there has been conflict and an inability to work together, finding at least two people who can come up with, you know a different way of doing things, can kind of just spread a new way of doing things.
Of course, it’s easier if you can get everyone on board at the same time. But that, I have rarely seen that happen. It’s more a process of cultural change, and if people are really entrenched and there is a whole department of folks who are really dedicated to doing stuff one way though, that’s hard to get away from unless you can get some buy-in from everyone that there might be a better way to approach this. So having more conversations about it, bringing in some guest speakers, I don’t know. I hope that helps.
Any other questions we are almost out of time. Okay well thank you so much.