Slow .com build processing

Incident Report for Travis CI

Postmortem

Yesterday we had another outage on travis-ci.com due to issues with our RabbitMQ cluster causing us to perform emergency maintenance and having a degraded service for a few hours afterwards. I’m very sorry for this outage, and I would like to take a moment to explain what happened, what we did to fix it and what we are doing to improve in the future.

What happened?

On March 11th at around 18:00 UTC, we received messages from customers that their builds weren’t running on travis-ci.com. We looked at our metrics and quickly realised that our RabbitMQ instance had gone offline at 17:30 UTC. We tried to bring it back up, but it wouldn’t start up cleanly. One of the remediation actions after Tuesday’s RabbitMQ outage was to upgrade our cluster to run on more powerful servers, so we decided that instead of debugging why our current cluster wasn’t starting we’d perform emergency maintenance and spin up a new cluster.

At 18:22 UTC we brought our site into maintenance mode, spun up a new RabbitMQ cluster and started distributing the new connection details to our various services. The bulk of them were finished quickly, so we brought our site out of maintenance mode by 18:46 UTC.

After bringing the site back up, we realised that we had to restart all jobs that were running when the RabbitMQ cluster went down and requeue them on the new cluster. We started this at 19:30 UTC, and at about 19:45 UTC all stuck jobs had been restarted and requeued. Until this occurred, any jobs that were running as of 17:30 UTC would appear to be frozen.

Rolling out the new connection details to our workers took a little longer. Our legacy Linux infrastructure and Mac infrastructure were all up and running at about 19:40 UTC, but the container infrastructure wasn’t fully online until about 20:20 UTC. This was due to our rollout scripts purposefully deploying new versions of our software slowly and not being optimised for quick, instantaneous rollouts. After all the workers were running, we monitored all systems carefully and waited for the backlog of jobs to drain.

What are we doing to prevent this from happening again?

The first thing we realised during the incident was that there were no alerts on our end. 30 minutes after the RabbitMQ instance went offline we had heard nothing from our monitoring system, and started investigating because customers were reaching out due to stalled builds. 30 minutes for such a major issue is way too long, and we’re investigating our monitoring solutions to ensure that if something similar happens again in the future we’ll know about it much faster.

The reason for the original RabbitMQ cluster shutting down is still unknown, but we’ll be working with our RabbitMQ provider to find the cause of this and preventing that from happening again.

We’ll also be taking what we learned in both outages this week to improve our services’ resilience against RabbitMQ issues.

The container-based infrastructure was able to work through the backlog much faster than the other infrastructures, mainly due to its ability to scale up to a higher-than-normal capacity faster than the other infrastructures. We’ll continue to work on making this infrastructure the primary infrastructure and resolving any issues preventing projects from using this infrastructure, allowing us to work through backlogs and resume normal operations much faster than previously.

Posted Mar 12, 2015 - 18:47 UTC

Resolved

The Mac queue has caught. We are operating normally. If you see problems, please email support@travis-ci.com.

Posted Mar 11, 2015 - 22:59 UTC

Update

We are working though the Mac back log.

Posted Mar 11, 2015 - 21:47 UTC

Update

Linux builds have caught up and are operating normally.

Posted Mar 11, 2015 - 21:17 UTC

Update

Container builds have caught up and are operating normally.

Posted Mar 11, 2015 - 20:23 UTC

Update

Containers are back online. We are processing back logs and monitoring closely.

Posted Mar 11, 2015 - 20:13 UTC

Update

Mac workers are back online and processing builds.

Posted Mar 11, 2015 - 20:07 UTC

Update

We have identified issues with .com Mac workers. We are taking them offline to get them back online. Only Mac jobs are affected.

Posted Mar 11, 2015 - 19:47 UTC

Update

We are working through the back logs at this time. Container-based builds are seeing higher back logs, and are experiencing delays.

Posted Mar 11, 2015 - 19:41 UTC

Monitoring

Workers are coming online, and we are starting to process builds.

Posted Mar 11, 2015 - 19:17 UTC

Update

We are out of maintenance mode. We are now working to get the workers back online.

Posted Mar 11, 2015 - 18:46 UTC

Update

We will perform emergency maintenance to bring the service back online. API and Web will be shut down.

Posted Mar 11, 2015 - 18:22 UTC

Update

This is a .com build processing issue. Previous communications indicated that it was for .org builds. We apologize for the confusion.

Posted Mar 11, 2015 - 18:21 UTC

Identified

We've identified the failed component, and working to bring it back up online.

Posted Mar 11, 2015 - 18:19 UTC

Monitoring

Posted Mar 11, 2015 - 18:19 UTC