High queue times on OSX builds (.com and .org)

Incident Report for Travis CI

Postmortem

On Tuesday, August 4th we had a significant period of instability and outage to our OS X build environment for both open source and private repositories. We want to take the time to explain what happened. We recognize that this was a significant disruption to the workflow and productivity of all of our users who rely on us for OS X building and testing. This is not at all acceptable to us. We are very sorry that it happened and our entire Engineering team has implemented a number of changes to help prevent similar problems in the future.

Some background on our OS X build infrastructure

As outlined in our May 2015 post on the Xcode 6.3 beta availability, we partner with MacStadium for the infrastructure that provides our OS X build platform. This utilizes a number of infrastructure components, including an EMC SAN, multiple Xserve hosts, and vSphere 6.0 as the virtualization layer. Building on that infrastructure, we run a set of services which are responsible for the build VM lifecycle, including creation, build running, and cleanup. Due to the nature of what information we can get from the vSphere APIs and the long build timeouts we provide for travis-ci.com private repositories, we have a specialized service (vsphere-janitor) which is responsible for the final cleanup phase of build VMs. Understanding this separation of the creation/build run and the cleanup parts of the life-cycle becomes important in understanding what contributed to this incident.

What went wrong and how we worked through it

We became aware of higher than normal re-queue rates for builds on our OS X platform, which were leading to a large backlog of OS X builds. At 17:45 UTC, we opened a public status incident and began looking into the issue.

As we began investigating, we discovered that the process that manages the creation of build VMs was logging a number of errors. These errors were about being unable to power on new virtual machines. The power-on events were failing because the vSphere cluster was unable to satisfy the resource reservations needed to allow the VMs to power-on.

When we reviewed the resource utilization on our vSphere infrastructure, we discovered we had over 6000 virtual machines on the Xserve cluster. During normal peak build times, this number shouldn't be more than 200. At 18:46 UTC, we paused all OS X builds in order to stop all work on the infrastructure, we then dug into why vsphere-janitor was not properly deleting VMs.

A this point we recalled we had rotated the passwords for the vSphere account that our services use to authentication with the vSphere API, on the previous Friday, July 31st, in response to vSphere's default 90 day password expiration policy. When the rotation happened, the configuration for the vsphere-janitor service did not get updated. This resulted in the service being unable to authenticate and so it was unable to cleanup any build VMs. We then began leaking VMs, from 2015-07-31 to 2015-08-04 when we discovered the configuration issue.

This also revealed a defect in our vsphere-janitor service. Which was that it will continue to report the last detected number of total VMs to our metrics storage, when it is unable authenticate, instead of reporting 0. We had existing alerting which was configured to alert us if the number of VMs either dropped below expected low traffic minimums or spiked above what we should see during peak usage. But due to this defect in the software, the alerts did not trigger because the metric kept being reporting at a number which was within our good range.

Once we realized the configuration issue, we fixed it, restarted the service, and it began cleaning up some of the backlog of build VMs. We kept all new builds paused until 21:01 UTC to give the cleanup process some time to run, without competing for resources with new builds, and then enabled builds for the .com queue, will still running cleanup. We then intended to re-enable .org as long as things seemed stable. While monitoring the progress of the cleanup, we realized that the number of leaked VMs was simply too high for our cleanup process to handle with other builds running and that a portion of the very abnormally high number of leaked VMs were in an unexpected state. They were in a state that caused our cleanup to skip them, because they had been leaked due to a failure to power-on and not because they were ready to be deleted as part of the normal build VM lifecycle.

In order to clean them up we needed to be able to run our cleanup in a mode that would also terminate new build VMs, so at 22:10 UTC we paused all builds again and initiated the more aggressive cleanup. The cleanup proceeded successfully and quickly deleted all 6326 VMs.

At 22:56 UTC, when we began resuming builds we discovered, and very quickly started to see power-on failures for new build VMs, due to errors about a lack of the ability to satisfy the resource reservations of the new VMs. We once again paused all builds and at 23:28 UTC we reached out to our infrastructure provider to help investigate these boot failures. At this point we want to note that we're very happy we've partnered with MacStadium for this infrastructure, as their deep expertise in running vSphere and using it for OS X virtualization has been invaluable.

While beginning the process of working with MacStadium, we also began testing to determine if there was a reduced rate of build throughput that would let us continue to run OS X builds and start working on the backlog we'd accrued by this point. We determined we were able to operate at a reduced capacity and enabled both .com and .org builds, with the reduced capacity, at 00:13 UTC.

By 02:53 UTC we were continuing to process the OS X build backlog at a reduced rate, while the OS X infrastructure was remaining stable, and we continued to work with MacStadium on investigating the boot failures. We operated at a reduced build rate and were able to successfully process the entire build backlog, with things remaining stable.

We continued to operate at a reduced capacity configuration into the morning (Pacific time) of Wed, Aug. 5th, with things remaining stable. We are able to increase our reduced capacity configuration without any boot errors and maintained sufficient capacity to keep up with the peak demand for that day.

What we were ultimately able to determine is that the way we had configured CPU Reservations for our build VMs resulted in a situation where when we tried to run the maximum number of builds we'd configured and assumed that we had capacity for, we were actually unable to actually boot that many VMs, because there were not enough CPU cycles available on the cluster.

MacStadium was able to help us understand that for our workload, we did not actually want to use the resource reservation system that vSphere implements but could just rely on the virtual machine resource sizing to sufficiently guarantee resources for each build VM.

In reviewing our build throughput metrics we discovered that our normal peak loads hadn't actually reached a point, yet, where the CPU reservation configuration would become an issue. It was only because of the large backlog that this incident had accrued, that we actually hit the maximum for the first time and uncovered this configuration limitation.

So on Aug 06, 2015 - 03:43 UTC, we briefly paused all OS X builds and implemented the recommended configuration changes to the CPU reservations. We then resumed builds, caught up with the small backlog and began monitoring the infrastructure closely.

We try to operate on the principle of trust, but verify, so we decided not to give an "all clear" until we'd done some load testing. We performed the load testing on Thurs, Aug 6th during the evening hours in Pacific time, and we were able to boot and run more build VMs than the previously configured maximum throughput. We actually created a build backlog far above anything we've seen during normal usage and were very happy with both how stable the infrastructure remained and how quickly the backlog was processed.

Going Forward?

We are preparing plans for more well defined load tests of our OS X infrastructure. As the OS X/iOS build workload represents a unique type of workload in terms of CPU and IO utilization, and the time these builds usually take, compared to the majority of our Linux builds. This testing will include multiple build jobs that represent the common types of repositories we see customers building and we will have better and more detailed metrics that we'll gather during these load tests. So you can look for a scheduled maintenance announcement in the future for this.
We are going to make improvements to what metrics we collect when we create, run, and delete build VMs on the OS X infrastructure. We will also improve how these metrics are reported under failure conditions. These metrics improvements will help us build better alerts to help us avoid these kinds of issues in the future.
Some of the impact of this incident, in particular how we do a global stop/start of builds, and how they are re-queued, will be improved in ways that will result in less impact to running builds. These will happen as we build and ship some fundamental improvements to our build agent software.

Summary

We couldn't be more sorry about this incident and the impact that the build outages and delays had on you, our users and customers. We always use problems like these as an opportunity for us to improve, and this will be no exception.

We thank you for your continued support of Travis CI, we are working hard to make sure we live up to the trust you've placed in us and provide you with an excellent build experience for your open source and private repository builds, as we know that continuous integration and deployment tools we provide you are critical to the productivity of you all.

If you have any questions or concerns that were not addressed in this postmortem, please reach out to us via support@travis-ci.com and we'll do our best to provide you with the answers to your questions or concerns.

Happy Building!

Posted Aug 10, 2015 - 22:59 UTC

Resolved

While things have been stable since Tuesday, we went though some detailed configurations reviews and performed some load testing of the infrastructure to verify that some configuration changes we've implemented would resolve the boot errors we saw during Tuesday's outage. We wanted to be confident in those changes before resolving this incident. We will be preparing and publishing a detailed public post-mortem on the incident in the very near future.

Posted Aug 07, 2015 - 04:36 UTC

Update

We've caught up with the small OSX back log and are processing current builds as they are queued.

Posted Aug 06, 2015 - 04:17 UTC

Update

We've resumed OSX builds and are monitoring things closely.

Posted Aug 06, 2015 - 04:11 UTC

Update

We are performing a brief pause to the OSX builds as part of some stabilization work to the OSX infrastructure, we expect it to last less than 15 minutes

Posted Aug 06, 2015 - 03:43 UTC

Update

We are monitoring both .com and. org OSX build queues this morning, and as the build queue grows we're slowly adding capacity to try and keep the backlog to a minimum while monitoring the hypervisor infrastructure closely.

Posted Aug 05, 2015 - 16:42 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Aug 05, 2015 - 04:41 UTC

Update

At this time both our .org and .com OS X backlogs have been caught up. We will continue to operate with a reduced build throughput into tomorrow, as we work with our OSX infrastructure provider to determine why we see boot failures under higher loads.

If you see any OS X build delays or failures, please reach out to support@travis-ci.com

Posted Aug 05, 2015 - 04:41 UTC

Update

We apologize for length of this incident, this is not the experience we want any of our build platforms to provide. We are still processing the .com and .org OSX build backlogs at a reduced capacity but the trend on both is downward.

We expect to continue operating at a reduced capacity into tomorrow, as we work with our OSX infrastructure provider to determine why we see boot failures under high loads. We appreciate your patience and understanding as we work towards a solid resolution for this issue.

Posted Aug 05, 2015 - 02:53 UTC

Update

We've enabled both .com and .org OSX builds at a reduced capacity, while we continue to work with our OSX infrastructure provider and are monitoring things closely.

Posted Aug 05, 2015 - 00:13 UTC

Update

We are escalating this OSX hypervisor infrastructure with our infrastructure provider and will provide more updates as we have them.

Posted Aug 04, 2015 - 23:28 UTC

Update

When we started up builds again we began seeing a high boot rate failure from our OSX hypervisor infrastructure. We've continuing to pause all OSX builds will troubleshooiting this.

Posted Aug 04, 2015 - 22:56 UTC

Update

We're seeing contention between running new build VMs and cleaning up stale VMs on the OSX infrastructure. We're going to do an additional emergency pause of both .org and .com builds to let the clean up get ahead and will provide another update by 23:00 UTC

Posted Aug 04, 2015 - 22:10 UTC

Update

We apologize for the delay in this update, we've deployed some changes to our cleanup which is helping, we're going to monitor until 22:15 UTC and then decide on enabling a portion of .org OSX builds

Posted Aug 04, 2015 - 21:51 UTC

Update

We've re-enable builds for .com and are processing the backlog and monitoring closely. If things seem stable we'll startup .org builds at a reduced capacity and monitor things. We'll provide an update at 21:30 UTC

Posted Aug 04, 2015 - 21:01 UTC

Update

We're continuing our clean up process and will provide further updates on an hourly basis until we resume OSX builds

Posted Aug 04, 2015 - 20:00 UTC

Update

We're currently hoping to resume OSX builds in 60-75 minutes and will update again at 1:00PM Pacific

Posted Aug 04, 2015 - 19:21 UTC

Update

We're working through cleaning up stale build VMs, which is the cause of the delays. Once these VMs are cleaned up, we'll resume builds and start working on the backlog.

Thank you for your patience in this.

Posted Aug 04, 2015 - 19:13 UTC

Update

We're going to be temporarily pausing all OSX builds as part of working to resolve this issue

Posted Aug 04, 2015 - 18:46 UTC

Investigating

We are currently investigating high queue times on OSX builds (.com and .org)

Posted Aug 04, 2015 - 17:45 UTC