Yesterday we had another outage on travis-ci.com due to issues with our RabbitMQ cluster causing us to perform emergency maintenance and having a degraded service for a few hours afterwards. I’m very sorry for this outage, and I would like to take a moment to explain what happened, what we did to fix it and what we are doing to improve in the future.
On March 11th at around 18:00 UTC, we received messages from customers that their builds weren’t running on travis-ci.com. We looked at our metrics and quickly realised that our RabbitMQ instance had gone offline at 17:30 UTC. We tried to bring it back up, but it wouldn’t start up cleanly. One of the remediation actions after Tuesday’s RabbitMQ outage was to upgrade our cluster to run on more powerful servers, so we decided that instead of debugging why our current cluster wasn’t starting we’d perform emergency maintenance and spin up a new cluster.
At 18:22 UTC we brought our site into maintenance mode, spun up a new RabbitMQ cluster and started distributing the new connection details to our various services. The bulk of them were finished quickly, so we brought our site out of maintenance mode by 18:46 UTC.
After bringing the site back up, we realised that we had to restart all jobs that were running when the RabbitMQ cluster went down and requeue them on the new cluster. We started this at 19:30 UTC, and at about 19:45 UTC all stuck jobs had been restarted and requeued. Until this occurred, any jobs that were running as of 17:30 UTC would appear to be frozen.
Rolling out the new connection details to our workers took a little longer. Our legacy Linux infrastructure and Mac infrastructure were all up and running at about 19:40 UTC, but the container infrastructure wasn’t fully online until about 20:20 UTC. This was due to our rollout scripts purposefully deploying new versions of our software slowly and not being optimised for quick, instantaneous rollouts. After all the workers were running, we monitored all systems carefully and waited for the backlog of jobs to drain.
The first thing we realised during the incident was that there were no alerts on our end. 30 minutes after the RabbitMQ instance went offline we had heard nothing from our monitoring system, and started investigating because customers were reaching out due to stalled builds. 30 minutes for such a major issue is way too long, and we’re investigating our monitoring solutions to ensure that if something similar happens again in the future we’ll know about it much faster.
The reason for the original RabbitMQ cluster shutting down is still unknown, but we’ll be working with our RabbitMQ provider to find the cause of this and preventing that from happening again.
We’ll also be taking what we learned in both outages this week to improve our services’ resilience against RabbitMQ issues.
The container-based infrastructure was able to work through the backlog much faster than the other infrastructures, mainly due to its ability to scale up to a higher-than-normal capacity faster than the other infrastructures. We’ll continue to work on making this infrastructure the primary infrastructure and resolving any issues preventing projects from using this infrastructure, allowing us to work through backlogs and resume normal operations much faster than previously.