Build delays in Linux, Mac and Windows environments
Incident Report for Travis CI
Postmortem

We've conducted a review of this incident and have some insight to share. First, we'd like to thank affected customers for their patience while we were investigating the issue and want to acknowledge the delayed status update as a learning experience for us so we can improve going forward.

Travis Hub is the component that deals with updates to the job and build status coming in from the workers.

To process and update the jobs’ statuses, we have a process in place subscribed to a RabbitMQ queue. Once we restarted this process, our service recovered. We believe the RabbitMQ connection hung up and was unable to recover by itself.

We have witnessed similar behaviour before when networking issues were observed, and there were several reports of worldwide internet networking problems yesterday; however, we cannot confirm the correlation.

Coupled with a misconfigured alert that did not trigger and didn’t alert us to the situation and considering the inability of this process to recover by itself, this caused many messages to pile up, despite Hub recovering, other parts of the system got overwhelmed, which caused a job backlog on both GCE (Linux and Windows) and Mac.

There are two areas that we’re looking into to avoid this situation from happening in the future, the first, is that we’re improving our alert and monitoring system for Hub so that we’re able to address this situation much sooner. Over the next few weeks, the engineering team is analyzing the changes that need to happen to this processing system, including analyzing the possibility of removing RabbitMQ from our architecture altogether, so that it’s more reliable. We're also reviewing and working towards improving our incident response process, considering we were late in communicating this incident.

Once again, we apologize for the inconvenience and frustration caused by this incident, but we are working towards improved communication ongoing and technical changes that should help us mitigate similar problems going forward.

In the meantime, if you have any questions or concerns, please do not hesitate to get in touch support@travis-ci.com. We’re there to help.

Thank you,

The Travis CI Team

Posted 3 months ago. Jun 26, 2019 - 13:33 UTC

Resolved
This incident has been resolved.
Posted 3 months ago. Jun 24, 2019 - 18:29 UTC
Update
Our Mac builds have stabilised as well and we are back to normal build times across all infrastructure. We are going to monitor for a few more minutes after which this incident will be marked resolved.
Posted 3 months ago. Jun 24, 2019 - 17:44 UTC
Monitoring
Our queues have completely drained of backlogged jobs. We are seeing steady build completion times for our GCE Linux infrastructure and are monitoring our systems.
Posted 3 months ago. Jun 24, 2019 - 17:28 UTC
Identified
The backlog of jobs is draining as we have resolved all alerts and notifications related to this incident. We expect this to normalise within the hour. More updates to follow.
Posted 3 months ago. Jun 24, 2019 - 16:45 UTC
Investigating
We're looking into delayed startup times for builds in all infrastructures and a high API latency. We're currently investigating and we'll post updates as soon as possible.
Posted 3 months ago. Jun 24, 2019 - 15:43 UTC
This incident affected: Builds Processing (Mac Builds, Linux and Windows Builds).