Previous incidents
Slower dequeuing and API responses
Resolved Sep 15 at 06:40pm BST
The database load is back to normal.
We're still investigating why this happened, the top theory at the moment is a an auto-vacuuming issue possibly to do with transaction wraparound.
1 previous update
V4 runs are slow to dequeue in the us-nyc-3 region
Resolved Sep 15 at 01:45pm BST
This issue was caused by our us-nyc-3 cloud provider taking an abnormally long time to spin up new servers and capacity issues, along with some runs getting stuck after restoring from a snapshot and not completing in under 2 minutes, which also mostly happened in us-nyc-3.
2 previous updates
eu-central-1 runs are slow to dequeue
Resolved Sep 10 at 06:18pm BST
We were seeing crashes on multiple servers in the EU only. It was related to an out of memory issue in our “supervisor” which meant some servers weren’t dequeuing consistently. We’ve changed some settings to allow for more memory. We're still investigating why these supervisor processes were using up too much memory and crashing, and we're monitoring the situation. us-east-1 runs have not been impacted.
1 previous update
eu-central-1 dequeue issues
Resolved Aug 24 at 11:21am BST
This is now resolved.
We're still fully investigating the root cause but there was too much activity with etcd and clean up not happening fast enough. This caused downstream impacts on pods piling up and Kubernetes API rate limits being hit. There were Out Of Memory issues because of this.
Our older Digital Ocean cluster is doing far higher volume that this so we need to determine what configuration is different and making appropriate changes to prevent this happening again.
1 previous update
Slower than normal v3 dequeue times
Resolved Aug 22 at 05:03pm BST
v3 dequeue times have been greatly improved and we're now seeing sub-500ms p50 dequeue times (it was previously 10-20s p50 dequeue times). We ended up backporting all of the dequeue performance improvements we made in v4 and even made some improvements that we'll be bringing to v4 soon as well.
1 previous update
We're experiencing longer than normal queue times on v4
Resolved Aug 20 at 08:44pm BST
Queue times are back to normal. We had an unprecedented number of new v4 runs. We've adjusted our autoscaling rules across multiple services to account for this. We are looking into how to avoid this happening as v4 scales up.
1 previous update
Run log failures and cascading API failures
Resolved Aug 01 at 01:29am BST
This was resolved at 00:29am. Logs and the API started recovering once a valid partition was in place.
1 previous update