We've conducted a review of this incident and have some insight to share. First, we'd like to thank affected customers for their patience while we were investigating the issue and want to acknowledge the delayed status update as a learning experience for us so we can improve going forward.
Travis Hub is the component that deals with updates to the job and build status coming in from the workers.
To process and update the jobs’ statuses, we have a process in place subscribed to a RabbitMQ queue. Once we restarted this process, our service recovered. We believe the RabbitMQ connection hung up and was unable to recover by itself.
We have witnessed similar behaviour before when networking issues were observed, and there were several reports of worldwide internet networking problems yesterday; however, we cannot confirm the correlation.
Coupled with a misconfigured alert that did not trigger and didn’t alert us to the situation and considering the inability of this process to recover by itself, this caused many messages to pile up, despite Hub recovering, other parts of the system got overwhelmed, which caused a job backlog on both GCE (Linux and Windows) and Mac.
There are two areas that we’re looking into to avoid this situation from happening in the future, the first, is that we’re improving our alert and monitoring system for Hub so that we’re able to address this situation much sooner. Over the next few weeks, the engineering team is analyzing the changes that need to happen to this processing system, including analyzing the possibility of removing RabbitMQ from our architecture altogether, so that it’s more reliable. We're also reviewing and working towards improving our incident response process, considering we were late in communicating this incident.
Once again, we apologize for the inconvenience and frustration caused by this incident, but we are working towards improved communication ongoing and technical changes that should help us mitigate similar problems going forward.
In the meantime, if you have any questions or concerns, please do not hesitate to get in touch firstname.lastname@example.org. We’re there to help.
The Travis CI Team