v3 runs are starting slower...

Resolved
Aug 15 at 10:45pm BST

Runs are processing very fast now and everything will be fully caught up in 30 mins. The vast majority of organizations caught up an hour ago,

If you were executing a lot of runs during this period unfortunately it is quite likely that some of them failed. You can filter by Failed, Crashed and System Failure on the Runs page. Then you can multi-select them and use the bulk actions bar at the bottom of the screen to mass replay them.

We're really sorry about this incident and the impact it's had on you all. This wasn't caused by a code change and wasn't a gradual decline of performance so was hard to foresee. Some critical system processes in our primary database started failing causing locking transactions. This wasn't obvious at the time unfortunately. We will be doing a full write up of this incident tomorrow and we have an early plan of some tools we're going to use to ensure this doesn't happen again.

Updated
Aug 15 at 08:40pm BST

The dashboard and API are back to full functionality. We are processing lots of runs again but there is a big backlog because of the slow processing over the past couple of hours. We're working hard to catch the queue up.

Our primary database had entered into an unrecoverable state for an unknown reason with permanently locked transactions and many critical underlying processes that weren't functioning properly. Unfortunately this wasn't obvious. We switched our primary database ("failover") to one of our replicas and performance immediately returned to normal.

Sorry folks, this wasn't caused by a code change or a gradual decline of performance so was hard to foresee. We have found a specialist database monitoring tool that we are going to use to prevent this from happening again, or hopefully make it obvious if it ever does.

I'll update here when queue sizes are completely back to normal.

Updated
Aug 15 at 06:14pm BST

V3 runs are starting slower than normal and some runs are failing because of database transaction timeouts. We're deploying changes to try and fix this but have not determined the root cause yet.

Created
Aug 15 at 01:45pm BST

We're experiencing very high database load which is causing v3 runs to be queued for longer than normal before starting.

We're investigating the root cause of this and how to alleviate it.

v3 runs are starting slower than normal