On Tuesday, August 4th we had a significant period of instability and outage to our OS X build environment for both open source and private repositories. We want to take the time to explain what happened. We recognize that this was a significant disruption to the workflow and productivity of all of our users who rely on us for OS X building and testing. This is not at all acceptable to us. We are very sorry that it happened and our entire Engineering team has implemented a number of changes to help prevent similar problems in the future.
As outlined in our May 2015 post on the Xcode 6.3 beta availability, we partner with MacStadium for the infrastructure that provides our OS X build platform. This utilizes a number of infrastructure components, including an EMC SAN, multiple Xserve hosts, and vSphere 6.0 as the virtualization layer. Building on that infrastructure, we run a set of services which are responsible for the build VM lifecycle, including creation, build running, and cleanup. Due to the nature of what information we can get from the vSphere APIs and the long build timeouts we provide for travis-ci.com private repositories, we have a specialized service (vsphere-janitor) which is responsible for the final cleanup phase of build VMs. Understanding this separation of the creation/build run and the cleanup parts of the life-cycle becomes important in understanding what contributed to this incident.
We became aware of higher than normal re-queue rates for builds on our OS X platform, which were leading to a large backlog of OS X builds. At 17:45 UTC, we opened a public status incident and began looking into the issue.
As we began investigating, we discovered that the process that manages the creation of build VMs was logging a number of errors. These errors were about being unable to power on new virtual machines. The power-on events were failing because the vSphere cluster was unable to satisfy the resource reservations needed to allow the VMs to power-on.
When we reviewed the resource utilization on our vSphere infrastructure, we discovered we had over 6000 virtual machines on the Xserve cluster. During normal peak build times, this number shouldn't be more than 200. At 18:46 UTC, we paused all OS X builds in order to stop all work on the infrastructure, we then dug into why vsphere-janitor was not properly deleting VMs.
A this point we recalled we had rotated the passwords for the vSphere account that our services use to authentication with the vSphere API, on the previous Friday, July 31st, in response to vSphere's default 90 day password expiration policy. When the rotation happened, the configuration for the vsphere-janitor service did not get updated. This resulted in the service being unable to authenticate and so it was unable to cleanup any build VMs. We then began leaking VMs, from 2015-07-31 to 2015-08-04 when we discovered the configuration issue.
This also revealed a defect in our vsphere-janitor service. Which was that it will continue to report the last detected number of total VMs to our metrics storage, when it is unable authenticate, instead of reporting 0. We had existing alerting which was configured to alert us if the number of VMs either dropped below expected low traffic minimums or spiked above what we should see during peak usage. But due to this defect in the software, the alerts did not trigger because the metric kept being reporting at a number which was within our good range.
Once we realized the configuration issue, we fixed it, restarted the service, and it began cleaning up some of the backlog of build VMs. We kept all new builds paused until 21:01 UTC to give the cleanup process some time to run, without competing for resources with new builds, and then enabled builds for the .com queue, will still running cleanup. We then intended to re-enable .org as long as things seemed stable. While monitoring the progress of the cleanup, we realized that the number of leaked VMs was simply too high for our cleanup process to handle with other builds running and that a portion of the very abnormally high number of leaked VMs were in an unexpected state. They were in a state that caused our cleanup to skip them, because they had been leaked due to a failure to power-on and not because they were ready to be deleted as part of the normal build VM lifecycle.
In order to clean them up we needed to be able to run our cleanup in a mode that would also terminate new build VMs, so at 22:10 UTC we paused all builds again and initiated the more aggressive cleanup. The cleanup proceeded successfully and quickly deleted all 6326 VMs.
At 22:56 UTC, when we began resuming builds we discovered, and very quickly started to see power-on failures for new build VMs, due to errors about a lack of the ability to satisfy the resource reservations of the new VMs. We once again paused all builds and at 23:28 UTC we reached out to our infrastructure provider to help investigate these boot failures. At this point we want to note that we're very happy we've partnered with MacStadium for this infrastructure, as their deep expertise in running vSphere and using it for OS X virtualization has been invaluable.
While beginning the process of working with MacStadium, we also began testing to determine if there was a reduced rate of build throughput that would let us continue to run OS X builds and start working on the backlog we'd accrued by this point. We determined we were able to operate at a reduced capacity and enabled both .com and .org builds, with the reduced capacity, at 00:13 UTC.
By 02:53 UTC we were continuing to process the OS X build backlog at a reduced rate, while the OS X infrastructure was remaining stable, and we continued to work with MacStadium on investigating the boot failures. We operated at a reduced build rate and were able to successfully process the entire build backlog, with things remaining stable.
We continued to operate at a reduced capacity configuration into the morning (Pacific time) of Wed, Aug. 5th, with things remaining stable. We are able to increase our reduced capacity configuration without any boot errors and maintained sufficient capacity to keep up with the peak demand for that day.
What we were ultimately able to determine is that the way we had configured CPU Reservations for our build VMs resulted in a situation where when we tried to run the maximum number of builds we'd configured and assumed that we had capacity for, we were actually unable to actually boot that many VMs, because there were not enough CPU cycles available on the cluster.
MacStadium was able to help us understand that for our workload, we did not actually want to use the resource reservation system that vSphere implements but could just rely on the virtual machine resource sizing to sufficiently guarantee resources for each build VM.
In reviewing our build throughput metrics we discovered that our normal peak loads hadn't actually reached a point, yet, where the CPU reservation configuration would become an issue. It was only because of the large backlog that this incident had accrued, that we actually hit the maximum for the first time and uncovered this configuration limitation.
So on Aug 06, 2015 - 03:43 UTC, we briefly paused all OS X builds and implemented the recommended configuration changes to the CPU reservations. We then resumed builds, caught up with the small backlog and began monitoring the infrastructure closely.
We try to operate on the principle of trust, but verify, so we decided not to give an "all clear" until we'd done some load testing. We performed the load testing on Thurs, Aug 6th during the evening hours in Pacific time, and we were able to boot and run more build VMs than the previously configured maximum throughput. We actually created a build backlog far above anything we've seen during normal usage and were very happy with both how stable the infrastructure remained and how quickly the backlog was processed.
We are preparing plans for more well defined load tests of our OS X infrastructure. As the OS X/iOS build workload represents a unique type of workload in terms of CPU and IO utilization, and the time these builds usually take, compared to the majority of our Linux builds. This testing will include multiple build jobs that represent the common types of repositories we see customers building and we will have better and more detailed metrics that we'll gather during these load tests. So you can look for a scheduled maintenance announcement in the future for this.
We are going to make improvements to what metrics we collect when we create, run, and delete build VMs on the OS X infrastructure. We will also improve how these metrics are reported under failure conditions. These metrics improvements will help us build better alerts to help us avoid these kinds of issues in the future.
Some of the impact of this incident, in particular how we do a global stop/start of builds, and how they are re-queued, will be improved in ways that will result in less impact to running builds. These will happen as we build and ship some fundamental improvements to our build agent software.
We couldn't be more sorry about this incident and the impact that the build outages and delays had on you, our users and customers. We always use problems like these as an opportunity for us to improve, and this will be no exception.
We thank you for your continued support of Travis CI, we are working hard to make sure we live up to the trust you've placed in us and provide you with an excellent build experience for your open source and private repository builds, as we know that continuous integration and deployment tools we provide you are critical to the productivity of you all.
If you have any questions or concerns that were not addressed in this postmortem, please reach out to us via support@travis-ci.com and we'll do our best to provide you with the answers to your questions or concerns.
Happy Building!