Elevated wait times and timeouts for OSX builds

Incident Report for Travis CI

Postmortem

tl;dr:

Due to overwhelming growth, our SAN was being overloaded, and the migration to a new SAN solution was (and still may be) a rocky road.

What happened?

2015-07-20

We noticed an error pointing towards build VM boot timeouts at 23:21 UTC on the 20th. After discussion with our infrastructure provider, it was shown to us that our SAN (a NetApp appliance) was being overloaded due to a spike in disk operations per second.

NetApp CPU utilization at 100% for ~8 hours

Elevated IOps on the NetApp over the same time

Our infrastructure provider then offered us a different SAN solution (an EMC VMAX), although getting it in place would require some configuration time. As a stopgap, we reduced the number of build VMs to be booted at a time, and commented publicly about degraded performance. Over the course of approximately 5 hours, we caught up with our build backlog, and logged off for the night, pending access to the VMAX.

2015-07-21

The next day at peak hours, we received reports of Mac jobs being slow to boot. We were able to confirm that this was because of the SAN being saturated once again, even while at reduced capacity.

Mac job requeues skyrocketing due to boot timeouts.

Mac job requeues skyrocketing due to boot timeouts

Additionally, we encountered errors where build images were unable to be found, due to a misconfiguration. We pushed out a corrected configuration, and also added an IOps limit for some base VMs, in the hope this would reduce strain on the NetApp SAN. We also added per-image metrics so that we could track which images in particular were major offenders. After working through our backlog over the course of ~4 hours, we logged off for the night as the anticipated upgrade to the VMAX SAN wasn't quite ready.

2015-07-22

Everything looked stable, although we still were operating at reduced capacity so as to minimize load on the NetApp SAN. At around 17:00 UTC we received word from our infrastructure provider that the VMAX SAN was ready for testing. Our initial tests showed that the boot times for images on the VMAX were about the same as on the NetApp appliance, even with the IOps limit removed. This was considered acceptable, as the expectation was that this newer SAN would be able to handle more capacity than its predecessor. We began to move more of our high-demand images to the VMAX.

Our infrastructure provider contacted us shortly after we began the image migration to let us know that there were heightened errors related to cloning a particular image. We were also experiencing a heightened rate of boot timeouts, due to an error case we hadn't previously encountered in which VMs were being created but never powered on. While attempting to diagnose this, the error case quickly overwhelmed our worker component and began leaking powered-off VMs, creating thousands of dud VMs without cleaning them up and generally overloaded our vSphere cluster.

At that point, we decided to halt all job executions until we could manually clean up all of the non-booting VMs. After about 30 minutes we had cleaned up all of the VMs, at which point we deployed a patched version of the worker.

So, what happens now?

We are planning to keep an IOps limit on individual VMs for the foreseeable future, so as to avoid saturating our SAN. Performance over the past day has been stable and we have been able to raise our capacity limits above previous levels. We are continuing to migrate remaining images to the EMC VMAX, and we are working with our infrastructure provider to evaluate further optimizations.

Posted Jul 24, 2015 - 03:14 UTC

Resolved

Mac builds are now fully operational. A postmortem report describing the events of the past few days will be provided soon. Thanks for your patience!

Posted Jul 23, 2015 - 22:53 UTC

Monitoring

We have caught up to the backlog on the OS X job execution infrastructure and we are monitoring the state closely. Thanks, once again, for your patience.

Posted Jul 23, 2015 - 00:08 UTC

Update

We have begun job execution on the OS X infrastructure again, and everything looks to be working at the moment, but we have a backlog to work through before we can continue migration to the new SAN appliance from our infrastructure provider.

Posted Jul 22, 2015 - 23:29 UTC

Investigating

Due to an unfortunate series of bugs and errors, we have stopped all OS X infrastructure job processing to stop a runaway VM leak. Updates will be coming very soon, but for the time being we need to stop and reset our environment.

Thank you for your patience.

Posted Jul 22, 2015 - 22:31 UTC

Update

We are currently testing the new SAN solution provided to us by our OS X infrastructure provider. Due to a mishap when trying to switch to a new SAN device, we have had a few problems with user builds failing to start. The issue should be resolved now, and at the moment we are watching for problems closer than ever as we migrate some of our capacity on to the new SAN solution. Again, we thank you for your patience while we sort out these growing pains.

Posted Jul 22, 2015 - 22:09 UTC

Update

We are expecting the new datastore to be available to us within the next few hours, at which point we will begin the process of migrating VMs and verifying IO performance.

Posted Jul 22, 2015 - 19:49 UTC

Monitoring

The public repo queue has now caught up. We expect that because we are running at low capacity that periodic queue backups may occur. Thanks for your patience. 💖

Posted Jul 22, 2015 - 03:20 UTC

Identified

The private repo queue has recovered and is now below capacity, and the public repo queue has been maintaining a backlog of 20-30 jobs. Our infrastructure provider is still working on preparing the storage upgrade, so we remain on reduced capacity.

Posted Jul 22, 2015 - 02:11 UTC

Update

We continue to work through the backlog at reduced capacity, with ~70 jobs in the public repo queue and ~50 jobs in the private repo queue. Our infrastructure provider is working on adding faster storage, to which we'll migrate our base VM images once available.

Posted Jul 22, 2015 - 00:22 UTC

Update

We continue to run things at a reduced built throughput and are slowly working through the OSX build backlog.

We are also working with our OSX infrastructure provider on how to optimize our SAN configuration and what balance of resource limits and throughput is best to ensure stability of the infrastructure and timely execution of the builds.

Posted Jul 21, 2015 - 22:09 UTC

Monitoring

We are now running on reduced capacity while we continue to monitor the health of the SAN. Both public and private repository builds are now being handled, but there is a backlog of ~200 jobs in the public repository queue.

Posted Jul 21, 2015 - 20:27 UTC

Update

We've scaled down the build throughput and are giving the SAN some time to process it's backlog and let resource utilization drop. We'll provide another update once we begin ramping build throughput back up.

Posted Jul 21, 2015 - 18:44 UTC

Identified

We are seeing high resource contention on our OSX infrastructure SAN and are temporarily scaling down the .org build throughput to 0 builds in order to provide some relief and let the SAN catch a small break. Once things are looking better, we'll begin scaling the build throughput back up.

During these actions you will see existing builds finish but all new jobs will queue. We do not have an ETA on when we'll scale back up just yet.

Posted Jul 21, 2015 - 17:50 UTC

Investigating

We are actively investigating elevated wait times and timeouts "due to lack of output" for OSX builds

Posted Jul 21, 2015 - 17:39 UTC