Significant disruption to r...

Resolved
Mar 07 at 09:15pm GMT

We are confident that most queues have caught up again but are still monitoring the situation.

If you are experiencing unexpected queue times this is most likely due to plan or custom queue limits. Should this persist, please get in touch.

Updated
Mar 07 at 08:56pm GMT

The service is stable again and metrics are looking good. We're still catching up with a backlog of runs. You may see increased queue times until this is fully resolved. We'll keep you updated.

Updated
Mar 07 at 08:16pm GMT

We managed to clear the huge backlog of VMs that were completed but hadn't been cleaned up like they normally are. This was causing a lot of issues, including the initial drop in runs starting and another 8–8.13pm UTC.

There's a backlog of runs and it will unfortunately take a bit of time for everything to catch up. We're actively monitoring the situation and doing what we can to improve throughput.

Updated
Mar 07 at 07:30pm GMT

Runs are executing at the normal speed (from 7.30pm UTC). There is a backlog of runs to work through.

There's still a problem we're looking at where the completed run VMs aren't being cleared properly. We think that's the cause of this issue in the first place. Hasn't happened before and there have been no code changes.

Updated
Mar 07 at 07:17pm GMT

A significant proportion of runs are not starting. There is an issue in our worker cluster and we are trying to diagnose the issue.

There have been no deployments today. We're unsure at this point if this is a cloud provider issue or not.

Created
Mar 07 at 06:48pm GMT

There is an issue in the worker cluster we are investigating.

This is causing runs not to start quickly.

Significant disruption to run starts (runs stuck in queueing)