Previous incidents

June 2024
Jun 13, 2024
1 incident

v3 runs are paused due to network issues

Degraded

Resolved Jun 13 at 01:20pm BST

Runs are operating at full speed.

We think this issue was caused by the clean-up operations that clear completed pods. There are far more runs than a week ago, so that list can get very large causing a strain on the system including internal networking. We've increased the frequency and are monitoring the load including networking. After 15 mins everything seems normal.

2 previous updates

Jun 12, 2024
1 incident

v3 runs have stopped

Degraded

Resolved Jun 12 at 11:10pm BST

v3 runs are now executing again.

Networking was down because of an issue with BPF. While networking was down tasks couldn't heartbeat back to the platform. If the platform doesn't receive a heartbeat every 2 mins then a run will fail. Less than 500 total runs were failed because of this.

You can filter by status "System Failure" in the runs list to find these and then bulk replay them by selecting all, move to the next page and select all again. You can replay them using the bottom bar.

1 previous update

Jun 10, 2024
1 incident

v2 runs are slower than normal to start

Degraded

Resolved Jun 10 at 01:30pm BST

v2 p95 start times have been under 2s for 10 mins, so resolving this issue.

We think this is because there are a lot of schedules that send an event at midday UTC on a Monday. We're looking into what we can do about that.

2 previous updates

Jun 05, 2024
1 incident

v2 jobs are starting slowly

Degraded

Resolved Jun 05 at 07:35pm BST

Performance metrics for v2 are back to normal. v3 was unimpacted by this issue.

We have identified the underlying performance bottleneck and will publish a full retrospective on this. We can now handle far more v2 load than we could before.

1 previous update

Jun 03, 2024
1 incident

v2 jobs are queued

Degraded

Resolved Jun 03 at 09:00pm BST

Runs have been executing at normal speeds and queues down to normal size for a couple of hours so this is marked as resolved.

During this incident queue times were longer than normal for v2 runs. We've made some minor changes as well as increasing capacity. We are working on a larger change that we think should mean very large v2 run spikes don't cause these problems that should ship in the next few days.

3 previous updates

Jun 01, 2024
1 incident

v2 job backlog

Degraded

Resolved Jun 01 at 09:45pm BST

v2 jobs are now all caught up and processing normally.

3 previous updates

May 2024
May 14, 2024
1 incident

The v3 cluster is slow to accept new v3 tasks

Degraded

Resolved May 14 at 08:00am BST

Runs are operating at normal speed again

There were pods in our cluster in the RunContainerError state, this happens when a run isn’t heartbeating back to the platform. We’re closely monitoring and have cleaned these. We’re determining which tasks caused this and what we can do to prevent this from happening in the future.

1 previous update

May 09, 2024
1 incident

Queues and runs have been processing at good speeds now for several hours on ...

Resolved May 09 at 10:00pm BST

Before we get into what happened I want to emphasise how important reliability is to us. This has fallen very short of providing you all with a great experience and a reliable service. We're really sorry for the problems this has caused for you all.

All paying customers will get a full refund for the entirety of May.

What caused this?

This issue started with a huge number of queued events in v2. Our internal system that handles concurrency limits on v2 was slowing down the pro...

April 2024
No incidents reported