Outage Report

Published by on October 17, 2013.

Today at 10:03AM Pacific we began experiencing an outage that resulted in a total of about 15 minutes of intermittent downtime. Our TraceView graphs show a clear MySQL anomaly.

image

We use Amazon’s RDS product for managing our MySQL needs and Sentry for tracking exceptions from Django. When we looked into the graphs and error reports we noticed MySQL was having disk space issues. Our first thought was we’d ran out of disk space on our DB server, but Nagios didn’t report any warnings about us getting close to running out of disk. Upon inspecting the RDS graphs we found something strange.

image

Our best guess right now is that some operation caused a large temporary table to be created at 17:00 UTC. Our RDS snapshots are scheduled for 13:30-14:00 UTC, our maintenance window is Tuesdays from 10:30-11:00 UTC, and our weekly full DB backups run on Friday at 16:00 UTC. This leaves us reasonably confident it wasn’t a scheduled maintenance issue.

Regardless, we’ll be taking the site offline tonight in order to dramatically increase the storage available on our RDS servers. Customers can expect about 30 minutes of downtime around 18:00 Pacific (1:00 UTC). Thank you for your patience and our apologies for any inconvenience this caused you!

UPDATE: We have double checked our cron jobs and confirmed none that run at this time. Additionally, no large API requests were sent in this timeframe, which also has special checks to limit data sent across the wire to mitigate abuse issues.