Build delays on travis-ci.org OSX builds

Incident Report for Travis CI

Postmortem

tl;dr:

The above graph was taken from the "Active OS X Builds for Open Source Projects" graph on our status page, showing a time window of the reduced capacity. The plateaus show capacity of 60, 20, and 60 from left to right.

what happened

We began receiving some isolated reports via email and chat at 22:33 UTC of OS X builds being stalled, or what appeared to be a stuck build queue. Our initial response was delayed by 40 minutes while a team member familiar with the OS X setup traveled back home to get online.

At 23:14 UTC we began investigating the reports, and found that the queue looked suspiciously like at least one of the worker processes had died:

Upon investigating the logs for the worker processes, it was not clear that one of the worker processes had died. In fact, all worker process were alive and we initially thought that jobs were being processed by all of them.

A few red herrings came up while investigating logs, and the investigation was hampered by the fact that we have two separate Librato and Papertrail accounts for our public and private repositories. Collapsing these accounts together is part of a long term plan to consolidate infrastructures.

After following a few such red herrings, we looked more closely at the worker logs and then noticed that there was one worker in particular that wasn't producing any log output. The most recent log message reported that the process was shutting down, although it had never exited and triggered a respawn.

The restart of the production worker that had gone into a zombie state (travis-worker-org-prod-1) happened at 00:41 UTC. The total time of the reduced capacity was roughly 4 hours.

what we're doing about it

As currently written, the worker code deals with most process-wide failures by exiting so that the process supervisor (upstart in this case) respawns. The exact reason for why the process became a zombie is not yet known, so we have put alerts in place for when worker capacity drops below the expected threshold.

We already have a plan in place to consolidate our Librato and Papertrail accounts to reduce confusion about which data we're viewing. It is a long process that has broad customer-facing impact, so we don't expect this to be solved for some months. In the meantime, we will be investigating ways to access our Librato graphs for both accounts through chat to help with initial incident investigation. We are also busy bringing more team members up to speed on the OS X setup in order to lower the bus factor.

As always, please reach out to us via support or twitter with any questions or concerns. Thank you for your patience.

Posted Jul 02, 2015 - 16:17 UTC

Resolved

At this time the backlog has been processed and new builds are running as expected.

Thank you to those who notified us via support email and chat about the delays.

As part of our incident review process we'll be reviewing our capacity monitoring to ensure this kind of delay is detected earlier in the future.

Posted Jul 02, 2015 - 02:37 UTC

Monitoring

At this time travis-ci.org OSX builds are running at full capacity and we're working through the backlog of OSX builds.

We're continuing to monitor things closely until the backlog has been processed.

Posted Jul 02, 2015 - 01:08 UTC

Identified

We've identified an issue that is causing lower capacity and build delays for travis-ci.org OSX builds and are actively working on resolving the issue and restoring full capacity for these builds.

Posted Jul 02, 2015 - 00:53 UTC