We strive to provide the most stable and user-friendly CI platform possible so that you and your teams can focus on shipping amazing open source and commercial software. When any portion of our service is unavailable, we know it can bring your productivity to a screeching halt. As developers building a tool for other developers, we understand firsthand how frustrating and debilitating this can be.
We want to take the time to explain what happened. We recognize that this was a significant disruption to the workflow and productivity of all of our users who rely on us for our Linux container-based builds. This is not at all acceptable to us. We are very sorry that it happened.
A key component of how Travis CI runs users' builds is a service we call worker, which is written in Go. This service is responsible for the execution lifecycle of a build. Those responsibilities include:
This service is built with multiple backends, which lets us be able to use the same service to run builds in Docker, Google Compute Engine, vSphere for macOS, etc.
For our container-based builds, we auto-scale numerous EC2 instances, with each one running the Docker daemon, an instance of our worker
service, and a finite number of concurrent jobs, which we refer to as the POOL_SIZE
.
In order to provide sufficient and reliable IOPS performance for container-based builds, we use EC2 instances with local SSD storage and a customized direct-lvm storage driver configuration for the Docker daemon. We also deploy the worker
service inside a Docker container, on each EC2 instance, that is pulled down during the instance creation and startup.
We have traditionally published and pulled all of our Docker images from quay.io, but had recently switched to publishing/pulling our worker
images from the Docker Hub, in support of plans to deploy the Docker registry as a pull through cache, inside our VPC, as the registry only support the Docker Hub as an upstream source for the pull through cache mode.
Additionally, due to the nature of how we provision the auto-scaled EC2 instances which run the builds, because builds are run in a push based fashion, and the manner in which we try to gracefully finish long running builds (travis-ci.com users have a timeout limit of 2 hours.), it is currently a lengthy process to fully recycle our fleet of container-based builds hosts.
Thursday Feb 2nd, we started rolling out a new worker version (v2.6.2) on EC2, for sudo: false
, container-based builds.
Friday Feb 3rd, we identified an issue with jobs being incorrectly marked as failed
and started the rollback to v2.5.0
. We also notified customers affected by this via our support ticketing system. At the time we believed the rollback to be happening successfully, but expected it to take a few hours to be fully completed.
Saturday, Feb 4th, we discovered that the new instances with worker v2.5.0
were not actually making it into service and so builds were still being executed by worker v2.6.2
.
The on-call engineer dug into why the v2.5.0
instances were failing to make it into service, identified a missing tag
for the worker's image on the Docker Hub, created the missing tag, and then got the rollback working properly.
As mentioned previously, we recently switched where we pull the worker Docker images and this change was made with our v2.6.0
release of the worker.
Due to some stability issues with v2.6.0
and v2.6.1
, these versions were not put into production for the container-based infrastructure. It wasn't until v2.6.2
appeared, based on it being stable in our other environments, that we began the upgrade for the container-based builds.
The code change for where we pull the docker images from, was done in one code base, while the publishing and tagging process is done in another code base. We did not foresee needing to rollback the container-based infrastructure to v2.5.0
, which was originally only published/tagged on quay.io and not the Docker hub. Publishing it to the Docker hub was done as part of the incident remediation.
We also learned we did not have alerting which included the errors logged when the image pulls were failing to find v2.5.0
.
Sunday, Feb 5th, in order to more quickly eliminate the impact of this issue on our users, we chose to declare emergency maintenance and take more disruptive and aggressive actions that would ensure all the EC2 instances were running v2.5.0
. At 00:31 UTC the rollback was fully completed and the incident was marked as Resolved.
The major contributing factors in this outage were
bash
is run explicitly with a login shell. This change appears to have effects on how bash handles exit codes, in a manner that we have fully investigated yet. This change was not detected by our staging environment tests and revealed insufficient diversity in how our tests reflect the variety of builds ou users are running.worker
Docker images from the Docker Hub, instead of quay, in support of using local registry caching.We couldn't be more sorry about this incident and the impact that the build outages and delays had on you, our users and customers. We always use problems like these as an opportunity for us to improve, and this will be no exception.
We thank you for your continued support of Travis CI, we are working hard to make sure we live up to the trust you've placed in us and provide you with an excellent build experience for your open source and private repository builds, as we know that continuous integration and deployment tools we provide you are critical to the productivity of you all.
If you have any questions or concerns that were not addressed in this postmortem, please reach out to us via support@travis-ci.com and we'll do our best to provide you with the answers to your questions or concerns.