Container-based Linux Precise infrastructure emergency maintenance

Incident Report for Travis CI

Postmortem

We strive to provide the most stable and user-friendly CI platform possible so that you and your teams can focus on shipping amazing open source and commercial software. When any portion of our service is unavailable, we know it can bring your productivity to a screeching halt. As developers building a tool for other developers, we understand firsthand how frustrating and debilitating this can be.

We want to take the time to explain what happened. We recognize that this was a significant disruption to the workflow and productivity of all of our users who rely on us for our Linux container-based builds. This is not at all acceptable to us. We are very sorry that it happened.

Some background information

A key component of how Travis CI runs users' builds is a service we call worker, which is written in Go. This service is responsible for the execution lifecycle of a build. Those responsibilities include:

requesting new compute environment resources, in the form of virtual machines or containers
executing a build script inside the VM/container
transmitting the output of the build script into our logging system, for display and storage
reporting the status (passed, errored, failed, cancelled, etc) of a job when the job has finished.

This service is built with multiple backends, which lets us be able to use the same service to run builds in Docker, Google Compute Engine, vSphere for macOS, etc.

For our container-based builds, we auto-scale numerous EC2 instances, with each one running the Docker daemon, an instance of our worker service, and a finite number of concurrent jobs, which we refer to as the POOL_SIZE.

In order to provide sufficient and reliable IOPS performance for container-based builds, we use EC2 instances with local SSD storage and a customized direct-lvm storage driver configuration for the Docker daemon. We also deploy the worker service inside a Docker container, on each EC2 instance, that is pulled down during the instance creation and startup.

We have traditionally published and pulled all of our Docker images from quay.io, but had recently switched to publishing/pulling our worker images from the Docker Hub, in support of plans to deploy the Docker registry as a pull through cache, inside our VPC, as the registry only support the Docker Hub as an upstream source for the pull through cache mode.

Additionally, due to the nature of how we provision the auto-scaled EC2 instances which run the builds, because builds are run in a push based fashion, and the manner in which we try to gracefully finish long running builds (travis-ci.com users have a timeout limit of 2 hours.), it is currently a lengthy process to fully recycle our fleet of container-based builds hosts.

So what happened?

Thursday Feb 2nd, we started rolling out a new worker version (v2.6.2) on EC2, for sudo: false, container-based builds.

Friday Feb 3rd, we identified an issue with jobs being incorrectly marked as failed and started the rollback to v2.5.0. We also notified customers affected by this via our support ticketing system. At the time we believed the rollback to be happening successfully, but expected it to take a few hours to be fully completed.

Saturday, Feb 4th, we discovered that the new instances with worker v2.5.0 were not actually making it into service and so builds were still being executed by worker v2.6.2.

The on-call engineer dug into why the v2.5.0 instances were failing to make it into service, identified a missing tag for the worker's image on the Docker Hub, created the missing tag, and then got the rollback working properly.

As mentioned previously, we recently switched where we pull the worker Docker images and this change was made with our v2.6.0 release of the worker.

Due to some stability issues with v2.6.0 and v2.6.1, these versions were not put into production for the container-based infrastructure. It wasn't until v2.6.2 appeared, based on it being stable in our other environments, that we began the upgrade for the container-based builds.

The code change for where we pull the docker images from, was done in one code base, while the publishing and tagging process is done in another code base. We did not foresee needing to rollback the container-based infrastructure to v2.5.0, which was originally only published/tagged on quay.io and not the Docker hub. Publishing it to the Docker hub was done as part of the incident remediation.

We also learned we did not have alerting which included the errors logged when the image pulls were failing to find v2.5.0.

Sunday, Feb 5th, in order to more quickly eliminate the impact of this issue on our users, we chose to declare emergency maintenance and take more disruptive and aggressive actions that would ensure all the EC2 instances were running v2.5.0. At 00:31 UTC the rollback was fully completed and the incident was marked as Resolved.

Contributing Factors

The major contributing factors in this outage were

A change in how our worker's docker backend executes build scripts, so that bash is run explicitly with a login shell. This change appears to have effects on how bash handles exit codes, in a manner that we have fully investigated yet. This change was not detected by our staging environment tests and revealed insufficient diversity in how our tests reflect the variety of builds ou users are running.
It currently can take us multiple hours to fully cycle out instances because of the safe guards we have in place to do this recycling in a fashion that does not interrupt longer running (up to 2 hours for travis-ci.com users) jobs.
The recent move to pulling our worker Docker images from the Docker Hub, instead of quay, in support of using local registry caching.
Missing coverage in terms of alerting for the errors which were being logged when the image pulls failed.

Going forward?

As part of the incident resolution, we created additional alerting, so that we know we'll be notified about similar errors in the future.
We are discussing both incremental and radical ways to change our instance replacement process and tooling, so that we can more quickly deploy and rollback future worker versions.
We are looking at how we can better improve the diversity of our tests in staging, to help better catch these kinds of regressions before they impact user builds.
Bigger picture, we discussing the potential for moving to a more agent/pull based process for running each job, which would let us more easily replace the worker version without requiring us to restart user builds.

Summary

We couldn't be more sorry about this incident and the impact that the build outages and delays had on you, our users and customers. We always use problems like these as an opportunity for us to improve, and this will be no exception.

We thank you for your continued support of Travis CI, we are working hard to make sure we live up to the trust you've placed in us and provide you with an excellent build experience for your open source and private repository builds, as we know that continuous integration and deployment tools we provide you are critical to the productivity of you all.

If you have any questions or concerns that were not addressed in this postmortem, please reach out to us via support@travis-ci.com and we'll do our best to provide you with the answers to your questions or concerns.

Posted Feb 13, 2017 - 22:56 UTC

Resolved

This incident has been resolved.

Posted Feb 06, 2017 - 00:31 UTC

Monitoring

The rollout is nearing completion with capacity surpassing demand. We don't expect any build start delays for the remainder of the maintenance.

Posted Feb 06, 2017 - 00:16 UTC

Identified

The container-based Linux infrastructure requires emergency maintenance to ensure all instances are running a known working version of the "worker" component. We had started rolling out a newer version this past Thursday, then began rolling back due to reports of mismatched exit code and job status. Please expect partial capacity behavior similar to times of high load during the maintenance, which we expect will take between 1-2 hours.

Posted Feb 05, 2017 - 21:56 UTC