Due to overwhelming growth, our SAN was being overloaded, and the migration to a new SAN solution was (and still may be) a rocky road.
We noticed an error pointing towards build VM boot timeouts at 23:21 UTC on the 20th. After discussion with our infrastructure provider, it was shown to us that our SAN (a NetApp appliance) was being overloaded due to a spike in disk operations per second.
NetApp CPU utilization at 100% for ~8 hours
Elevated IOps on the NetApp over the same time
Our infrastructure provider then offered us a different SAN solution (an EMC VMAX), although getting it in place would require some configuration time. As a stopgap, we reduced the number of build VMs to be booted at a time, and commented publicly about degraded performance. Over the course of approximately 5 hours, we caught up with our build backlog, and logged off for the night, pending access to the VMAX.
The next day at peak hours, we received reports of Mac jobs being slow to boot. We were able to confirm that this was because of the SAN being saturated once again, even while at reduced capacity.
Mac job requeues skyrocketing due to boot timeouts
Additionally, we encountered errors where build images were unable to be found, due to a misconfiguration. We pushed out a corrected configuration, and also added an IOps limit for some base VMs, in the hope this would reduce strain on the NetApp SAN. We also added per-image metrics so that we could track which images in particular were major offenders. After working through our backlog over the course of ~4 hours, we logged off for the night as the anticipated upgrade to the VMAX SAN wasn't quite ready.
Everything looked stable, although we still were operating at reduced capacity so as to minimize load on the NetApp SAN. At around 17:00 UTC we received word from our infrastructure provider that the VMAX SAN was ready for testing. Our initial tests showed that the boot times for images on the VMAX were about the same as on the NetApp appliance, even with the IOps limit removed. This was considered acceptable, as the expectation was that this newer SAN would be able to handle more capacity than its predecessor. We began to move more of our high-demand images to the VMAX.
Our infrastructure provider contacted us shortly after we began the image migration to let us know that there were heightened errors related to cloning a particular image. We were also experiencing a heightened rate of boot timeouts, due to an error case we hadn't previously encountered in which VMs were being created but never powered on. While attempting to diagnose this, the error case quickly overwhelmed our worker component and began leaking powered-off VMs, creating thousands of dud VMs without cleaning them up and generally overloaded our vSphere cluster.
At that point, we decided to halt all job executions until we could manually clean up all of the non-booting VMs. After about 30 minutes we had cleaned up all of the VMs, at which point we deployed a patched version of the worker.
We are planning to keep an IOps limit on individual VMs for the foreseeable future, so as to avoid saturating our SAN. Performance over the past day has been stable and we have been able to raise our capacity limits above previous levels. We are continuing to migrate remaining images to the EMC VMAX, and we are working with our infrastructure provider to evaluate further optimizations.