Over the last few years, the Nova project has spent a lot of time on improving its story around live upgrades. We’ve made a lot of progress and done so with minimal disruption as we were figuring out how to make it work. As some of this work starts to spread in the wind and pollinate other projects, the details of how all the pieces fit together is hard to communicate. The details of how Nova pulls off some of its tricks in this area live primarily in the heads of a few people.
In OpenStack, projects are expected to maintain a minimum level of what I would call “offline upgrade-ability”. Historically that has meant things like config file compatibility between releases, such that an operator with a valid Juno config should be able to upgrade to Kilo without changes. Database schema migrations (performed offline) have generally been something we know how to do to avoid preventing someone with a large deployment from having to rebuild their data after upgrading. Careful handling of things like deprecations will soon be table stakes.
The goal of our work on live upgrades is to avoid having to take down any component of Nova for an extended period of time. Upgrades of a large cloud take time, hit roadblocks, and uncover bugs. Any phase that requires some service to be down to make a change means that if you get stuck in that phase, something isn’t working and customers are perusing your competitor’s website while they wait.
In a series of posts to follow, I hope to detail and document some of the mechanics with examples from the Nova code and provide insight to why things work the way they do. The target audience of these posts are developers in other OpenStack projects looking to follow in Nova’s footsteps. As always, not all projects will want to do things the way Nova did, and that’s fine. The details are offered here for people that are interested, but are not intended to define the way all projects should do things.
The approach taken to make live upgrades work on something as complicated as Nova is actually composed of many more specific strategies across many subsystems. It is often claimed that adopting one library or protocol will magically make upgrades work. However, in my experience, there is no silver bullet for this problem. Pulling it off requires substantial changes, most of which are in the culture. Some mechanical things need to be done at the database and RPC layers, but in the end, it requires the people writing and reviewing the code to understand the problem space and defend the project against changes that will break upgrades.
What follows this post is hopefully enough of a roadmap for other projects to wrap their heads around the complexity of the problem and start making progress.