MacOS queue backup & emergency maintenance

Incident Report for Travis CI

Postmortem

We strive to provide the most stable and user-friendly CI platform possible so that you and your teams can focus on shipping amazing open source and commercial software. When any portion of our service is unavailable, we know it can bring your productivity to a screeching halt. As developers building a tool for other developers, we understand firsthand how frustrating and debilitating this can be.

We want to take the time to explain what happened. We recognize that this was a significant disruption to the workflow and productivity of all of our users who rely on us for macOS building and testing. This is not at all acceptable to us. We are very sorry that it happened, we are very conscious of the fact that our macOS infrastructure has had ongoing stability and backlog issues, and we are very close to putting into production new infrastructure improvements to ensure a higher level of reliability going forward.

Timeline

The following is a timeline of the events during this outage.

Note: All times are in UTC timezone.

Feb 01, 2017 - 01:58 UTC: macOS queues for both public and private repos were backed up. We begin working with our macOS infrastructure provider to identify contributing factors.
Feb 01, 2017 - 02:27 UTC: We began stopping all job throughput to prevent runaway VM leakage while waiting for further insights from our upstream infrastructure provider.
Feb 01, 2017 - 02:52 UTC: Some misbehaving hosts were restarted, thanks to help from our upstream provider. We started bringing job processing capacity back online.
Feb 01, 2017 - 04:51 UTC: The underlying VM infrastructure remained unstable, so we continued coordinating with our infrastructure provider to perform a full restart of the entire underlying infrastructure.
Feb 01, 2017 - 06:45 UTC: The virtualization platform was fully restarted and we began bringing job processing capacity back online.
Feb 01, 2017 - 07:11 UTC: Restarting the platform did not resolve all issues, and we resumed digging into the sources of instability.
Feb 01, 2017 - 09:30 UTC: We identified connectivity issues in our macOS workers and stopped all macOS builds to further investigate and fix them.
Feb 01, 2017 - 14:07 UTC: We made the difficult decision to proceed with cancelling all pending macOS builds on travis-ci.org. In part to redouce the impact on Linux builds throughput and to begin running new builds for users.
Feb 01, 2017 - 14:33 UTC: We were continuing to work on fixing the connectivity issue preventing us restarting macOS builds processing on both travis-ci.com and travis-ci.org.
Feb 01, 2017 - 15:00 - 17:30 UTC: We provided regular updates as we continued to work on fixing the connectivity issues.
Feb 01, 2017 - 17:47 UTC: We were in the process of testing further patches to skip jobs older than 6 hours in order to help with the massive backlog.
Feb 01, 2017 - 18:00 - 17:00 UTC: Additional testing was required before we could resume running any builds.
Feb 01, 2017 - 19:07 UTC: We are now running at reduced job processing capacity in production for both public and private repos.
Feb 01, 2017 - 19:33 UTC: We increased capacity in production for both public and private repos. Due to ongoing issues with our DHCP setup, we were still limited less than full capacity.
Feb 01, 2017 - 20:38 UTC: Our macOS infrastructure was processing builds normally for both travis-ci.org and travis-ci.com, albeit at a reduced capacity. We continued working on fixing our DHCP issues to be able to restore the full capacity.
Feb 01, 2017 - 21:36 UTC: The private repo backlog dropped steadily over the past hour, and we expected it to be caught up in less than 90 minutes.
Feb 01, 2017 - 22:49 UTC: We saw the backlog level off during peak usage hours.
Feb 02, 2017 - 00:13 UTC: The backlog for private repos was still dropping; now below 150.
Feb 02, 2017 - 01:18 UTC: The backlog for private repos was still dropping; now below 50.
Feb 02, 2017 - 01:35 UTC: Issue Resolved

Contributing Factors

The major contributing factors in this outage were

Multiple vSphere hosts becoming unavailable resulted in strain on the whole system and caused a portion of new VM creation to start failing. This caused a churn of requeues of build jobs, which kept adding more strain to the entire virtualization platform.
Unexpected corruption of one of the pair of hosts that provide NAT and DHCP for our build VM network resulted in complete configuration loss of the other host. This led to use needing to move those services to a different component in our stack, while we rebuilt the corrupt hosts from scratch.
Existing limitations in how the core of our scheduling backend works meant that a backlog of macOS jobs blocked new Linux builds from starting and running.

Going forward?

We'll be sharing more details in a future blog post, but we've invested in building out a sharded virtualization infrastructure and we'll be migrating our macOS builds to this new infrastructure in the near future. This will give us more fault tolerance and let us spread out load across more isolated components.
We are investing in a newer hardware platform for the vSphere hosts, which will be able to handle load better and should result in improvements in overall build performance.
We identified and made a small set of hot fixes during the outage which has already improved our ability to better handle the kind of failure scenario we saw and reduce the amount of job requeues that happen during this type of outage. We are discussing some further areas of improvement we can make to the key backend services that interact with our macOS virtualization platform.
We will be improving our monitoring of the build VM NAT/DHCP compoents, to more quickly detect when this component is in a failure state.
We are looking at how we can improve our scheduling to better isolate things so a macOS backlog does not impact Linux builds so dramatically.

Summary

We couldn't be more sorry about this incident and the impact that the build outages and delays had on you, our users and customers. We always use problems like these as an opportunity for us to improve, and this will be no exception.

We thank you for your continued support of Travis CI, we are working hard to make sure we live up to the trust you've placed in us and provide you with an excellent build experience for your open source and private repository builds, as we know that continuous integration and deployment tools we provide you are critical to the productivity of you all.

If you have any questions or concerns that were not addressed in this postmortem, please reach out to us via support@travis-ci.com and we'll do our best to provide you with the answers to your questions or concerns.

Posted Feb 09, 2017 - 01:15 UTC

Resolved

The private repo backlog is clear. The public repo backlog continues to drop, which is typical for this day/hour. Thanks again for waiting! 👋❤️

Posted Feb 02, 2017 - 01:35 UTC

Update

The backlog for private repos is still dropping; now below 50. Thank you again for your patience!

Posted Feb 02, 2017 - 01:18 UTC

Update

The backlog for private repos is still dropping; now below 150. We will update again in an hour. Thank you for your patience! 💖

Posted Feb 02, 2017 - 00:13 UTC

Update

We're seeing the backlog level off during peak usage hours. We will continue to issue updates as we monitor backlog progress.

Posted Feb 01, 2017 - 22:49 UTC

Update

The private repo backlog has dropped steadily over the past hour, and we expect it will be caught up in less than 90 minutes. Thank you again for your patience!

Posted Feb 01, 2017 - 21:36 UTC

Update

Our Mac infrastructure is processing builds normally for both travis-ci.org and travis-ci.com albeit at a reduced capacity. We are working on fixing our DHCP issues to be able to restore the full capacity. We cannot thank you enough for your enduring patience.

Posted Feb 01, 2017 - 20:38 UTC

Update

We have increased capacity in production for both public and private repos. Due to ongoing issues with our DHCP setup, we have limited the cap to less than full capacity.

Posted Feb 01, 2017 - 19:33 UTC

Monitoring

We are now running at reduced job processing capacity in production for both public and private repos.

Posted Feb 01, 2017 - 19:07 UTC

Update

The patches we're testing need additional work. We expect production job capacity to come online in the next hour. Thank you for your patience through these multiple delays.

Posted Feb 01, 2017 - 18:22 UTC

Update

We are in the process of testing further patches to skip jobs older than 6 hours in order to help with the massive backlog. We expect to see jobs flowing again in production within the next 30 minutes.

Posted Feb 01, 2017 - 17:47 UTC

Update

We are on the verge of resuming to process Mac builds on travis-ci.org. Thank you for hanging in there with us.

Posted Feb 01, 2017 - 16:45 UTC

Update

We’ve begun performing the necessary networking changes and will begin testing them as soon as they’re completed. We appreciate your continued patience.

Posted Feb 01, 2017 - 16:09 UTC

Update

We have proceeded with limiting the maximum number of concurrent jobs on open source repositories with jobs on our Mac infrastructure. You can find more details about this setting here: https://docs.travis-ci.com/user/customizing-the-build#sts=Limiting-Concurrent-Builds.

This change will help with the throughput of your Linux builds on other repositories while we are getting our Mac infrastructure back up. We will revert this change once things settle. Thank you for your understanding.

Posted Feb 01, 2017 - 15:27 UTC

Update

We are continuing to work on fixing the connectivity issue preventing us restarting Mac builds processing on both travis-ci.com and travis-ci.org. Meanwhile, we are also working on putting stopgap measures via our software platform to prevent disruption of our Linux builds throughput. Thank you for your enduring patience.

Posted Feb 01, 2017 - 14:33 UTC

Update

We made the difficult decision to proceed with cancelling all pending Mac builds on travis-ci.org. Doing so should improve Linux builds throughput and it will hopefully help us get the Mac infrastructure back on its feet. We are sorry for this drastic measure.

Posted Feb 01, 2017 - 14:07 UTC

Update

We’re still attempting to resolve the connectivity issues. We appreciate your ongoing patience.

Posted Feb 01, 2017 - 11:03 UTC

Update

We’ve identified connectivity issues in our MacOS workers and we’re stopping all Mac builds to further investigate and fix them.

Posted Feb 01, 2017 - 09:30 UTC

Update

Restarting the platform did not resolve all issues, and we are continuing to dig into the sources of instability.

Posted Feb 01, 2017 - 07:11 UTC

Update

The virtualization platform has been fully restarted and we're now bringing job processing capacity back online.

Posted Feb 01, 2017 - 06:45 UTC

Update

The underlying VM infrastructure is still unstable, so we are coordinating with our infrastructure provider to perform a full restart. We will update again once we resume job processing.

Posted Feb 01, 2017 - 04:51 UTC

Identified

Some misbehaving hosts have been restarted thanks to help from our upstream provider. We are bringing job processing capacity back online.

Posted Feb 01, 2017 - 02:52 UTC

Update

We are stopping all job throughput to prevent runaway VM leakage while waiting for further insights from our upstream infrastructure provider.

Posted Feb 01, 2017 - 02:27 UTC

Investigating

MacOS queues for both public and private repos are backed up. We are working with our Mac infrastructure provider to identify contributing factors.

Posted Feb 01, 2017 - 01:58 UTC