Slow .com build processing

Incident Report for Travis CI

Postmortem

Yesterday we had an extensive outage on travis-ci.com affecting log processing and job runs for several hours. We're very sorry for this extensive outage, and would like to explain what happened and share some insight into what we're doing to prevent it from happening again.

What happened?

On March 10th at 13:39 UTC, the Travis CI team was alerted that log output from jobs was piling up and not being processed on travis-ci.com. Based on this, we opened an incident on our status page at 13:43 UTC, noting that logs were being processed slower than normal. We restarted the logs processors (which is normally enough to get the processor to catch up with the backlog again) and monitored the metrics. By 14:00 UTC, we noticed that this didn't help and that the metrics were showing inconsistent data. We checked the host metrics for the RabbitMQ node and noticed very high CPU usage. At this point we decided to loop in our RabbitMQ provider.

At first, we thought the high CPU usage on the RabbitMQ node was due to a larger-than-normal number of channels being open on the node. We worked on trying to reduce this with a combination of restarting services with excessive channels open (in case of channel leak being an issue) and reading through the code trying to find the source of the channels. Debugging this issue was the main bulk of the outage.

After working on this for a while, our RabbitMQ provider identified a different issue: Two runaway TLS connections inside our primary RabbitMQ node that were causing high CPU usage on that node. Once this was found, we deemed the high channel count a red herring and instead started work on the stuck connections. Our provider attempted to remove the connections from the running node, but we decided that we had to restart the node. We started by manually restarting some of our services that don't handle RabbitMQ restarts well to ensure that they were connected to a node we weren't restarting. This involved putting our site into maintenance mode, which we had enabled from 17:50 UTC to 18:30 UTC. Once we brought the site out of maintenance at 17:30 UTC builds started running again and we monitored everything closely as we processed through the backlog.

How might we prevent similar issues from occurring again?

When we restarted the stuck RabbitMQ node, we also upgraded it to a larger instance type to give it more headroom for CPU usage and RAM usage. We're planning to update the other node in the near future.

We will also work on making our services deal with restarting the RabbitMQ server in a nicer way, meaning we can move much more quickly in the future for incidents requiring a RabbitMQ restart.

During the incident our status page was updated very infrequently. We are going to be reviewing our internal policies on incident response and find out how to best ensure that the status page is kept up to date throughout the incident.

Conclusion

Again, I’d like to apologize for the impact that this outage had to your operations. We strive to provide a stable and reliable service for you all, but we have fallen short and we are working hard to improve our internal processes and external systems to prevent these outages from happening again. Thank you all for being awesome!

Posted Mar 11, 2015 - 23:19 UTC

Resolved

We have fully recovered and have worked through the queued backlog of build requests from GitHub.
We are very sorry for these delays today and will have a full postmortem on this incident tomorrow.

Posted Mar 10, 2015 - 23:13 UTC

Monitoring

We are out of maintenance mode and are monitoring everything closely. There's a backlog of builds at the moment, so your builds may not start immediately.

Posted Mar 10, 2015 - 18:34 UTC

Update

We are performing an emergency maintenance to bring the service back online as soon as we can.

Posted Mar 10, 2015 - 17:26 UTC

Identified

We have identified the component slowing down our build processing. We are working with our service provider to resolve the issue.

Posted Mar 10, 2015 - 16:54 UTC

Investigating

Logs on .com repositories are being processed slowly.

Posted Mar 10, 2015 - 13:43 UTC