The above graph was taken from the "Active OS X Builds for Open Source Projects" graph on our status page, showing a time window of the reduced capacity. The plateaus show capacity of 60, 20, and 60 from left to right.
We began receiving some isolated reports via email and chat at 22:33 UTC of OS X builds being stalled, or what appeared to be a stuck build queue. Our initial response was delayed by 40 minutes while a team member familiar with the OS X setup traveled back home to get online.
At 23:14 UTC we began investigating the reports, and found that the queue looked suspiciously like at least one of the worker processes had died:
Upon investigating the logs for the worker processes, it was not clear that one of the worker processes had died. In fact, all worker process were alive and we initially thought that jobs were being processed by all of them.
A few red herrings came up while investigating logs, and the investigation was hampered by the fact that we have two separate Librato and Papertrail accounts for our public and private repositories. Collapsing these accounts together is part of a long term plan to consolidate infrastructures.
After following a few such red herrings, we looked more closely at the worker logs and then noticed that there was one worker in particular that wasn't producing any log output. The most recent log message reported that the process was shutting down, although it had never exited and triggered a respawn.
The restart of the production worker that had gone into a zombie state (
travis-worker-org-prod-1) happened at 00:41 UTC. The total time of the reduced capacity was roughly 4 hours.
As currently written, the worker code deals with most process-wide failures by exiting so that the process supervisor (upstart in this case) respawns. The exact reason for why the process became a zombie is not yet known, so we have put alerts in place for when worker capacity drops below the expected threshold.
We already have a plan in place to consolidate our Librato and Papertrail accounts to reduce confusion about which data we're viewing. It is a long process that has broad customer-facing impact, so we don't expect this to be solved for some months. In the meantime, we will be investigating ways to access our Librato graphs for both accounts through chat to help with initial incident investigation. We are also busy bringing more team members up to speed on the OS X setup in order to lower the bus factor.