Partial API/logs service outage for travis-ci.org

Incident Report for Travis CI

Postmortem

We've just published an article about this incident, how we dealt with it and what we plan to do to improve our service: https://blog.travis-ci.com/2017-03-06-api-logs-outage

Posted Mar 06, 2017 - 14:48 UTC

Resolved

We've been monitoring our system for a number of hours and things are now stable. Thanks again for your patience over the last few days.

Posted Mar 02, 2017 - 09:49 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Mar 02, 2017 - 04:28 UTC

Update

We've resumed all build processing at this point. Builds are starting and running as expected. Logs display via the API and web UI is functional as well. We will be monitoring things closely for the next few hours and into tomorrow. Thank you to everyone for your patience, understanding, and the many kind words via Twitter.

Posted Mar 02, 2017 - 02:36 UTC

Update

The database work is done. We are in the process of resuming services and beginning to process jobs again. We're still verifying things and will post another update once we're confident jobs should be being processed as expected.

Posted Mar 02, 2017 - 02:20 UTC

Update

Our database provider has asked to make some changes to the existing primary logs DB that require we stop processing new jobs temporarily.

So all builds will be paused and logs display will result in an error from the API or web UI. We'll post an update once we've resumed builds.

Posted Mar 02, 2017 - 01:48 UTC

Update

We are currently waiting on a new replica logs database to finish provisioning and we plan to fail over to it once it is ready, which we expect to happen roughly 5 hours.

Until then delays in log displays and some errors from the API/web UI should be expected. We are sorry for the extended length of this issue and appreciate your patience while we work through this issue with our database infrastructure provider.

Posted Mar 02, 2017 - 01:07 UTC

Update

We are still working on a fix with our infrastructure provider.

Posted Mar 01, 2017 - 21:41 UTC

Update

We're currently mostly stable, and we're actively working with our infrastructure provider on a more complete fix. Thanks for hanging in there with us!

Posted Mar 01, 2017 - 20:14 UTC

Update

We have found a way to mitigate our degraded API performance in the short term. We continue to monitor performance and wait for the emergency failover database to provision. We are still experiencing a delay of logs in our web front end and will report back as soon as we can.

Posted Mar 01, 2017 - 15:52 UTC

Update

Our ongoing database connection issues are due to emergency maintenance following the recent AWS outage. We are working with our upstream provider to rectify a kernel bug and are currently waiting for a new database failover to be provisioned. We expect this to take some time, and will continue to post updates as we have them.

Posted Mar 01, 2017 - 14:48 UTC

Identified

We have traced the partial outage to an intermittent database connection issue, and we're working to resolve it.

Posted Mar 01, 2017 - 11:53 UTC

Investigating

We are experiencing a partial API outage on travis-ci.org, which is affecting performance of our web front end.

Posted Mar 01, 2017 - 09:16 UTC