Previous incidents
Dashboard and API degraded
Resolved Jun 21 at 12:00am BST
We're continuing to monitor the situation but API and Dashboard services have been fully restored for the last 15+ minutes, and we think we have found the issue. We'll continue to monitor.
1 previous update
Our API and dashboard response times are elevated
Resolved Jun 20 at 03:35pm BST
API and cloud response times are back to normal.
1 previous update
Some v3 runs are failing
Resolved Jun 19 at 11:56am BST
v3 runs are back to normal.
There was an abnormal number of runs in the System Failure state because some of the data being passed from the workers back to the platform were in an unexpected format.
We are bulk replaying runs that were impacted.
2 previous updates
v3 runs are paused due to network issues
Resolved Jun 13 at 01:20pm BST
Runs are operating at full speed.
We think this issue was caused by the clean-up operations that clear completed pods. There are far more runs than a week ago, so that list can get very large causing a strain on the system including internal networking. We've increased the frequency and are monitoring the load including networking. After 15 mins everything seems normal.
2 previous updates
v3 runs have stopped
Resolved Jun 12 at 11:10pm BST
v3 runs are now executing again.
Networking was down because of an issue with BPF. While networking was down tasks couldn't heartbeat back to the platform. If the platform doesn't receive a heartbeat every 2 mins then a run will fail. Less than 500 total runs were failed because of this.
You can filter by status "System Failure" in the runs list to find these and then bulk replay them by selecting all, move to the next page and select all again. You can replay them using the bottom bar.
1 previous update
v2 runs are slower than normal to start
Resolved Jun 10 at 01:30pm BST
v2 p95 start times have been under 2s for 10 mins, so resolving this issue.
We think this is because there are a lot of schedules that send an event at midday UTC on a Monday. We're looking into what we can do about that.
2 previous updates
v2 jobs are starting slowly
Resolved Jun 05 at 07:35pm BST
Performance metrics for v2 are back to normal. v3 was unimpacted by this issue.
We have identified the underlying performance bottleneck and will publish a full retrospective on this. We can now handle far more v2 load than we could before.
1 previous update
v2 jobs are queued
Resolved Jun 03 at 09:00pm BST
Runs have been executing at normal speeds and queues down to normal size for a couple of hours so this is marked as resolved.
During this incident queue times were longer than normal for v2 runs. We've made some minor changes as well as increasing capacity. We are working on a larger change that we think should mean very large v2 run spikes don't cause these problems that should ship in the next few days.
3 previous updates
v2 job backlog
Resolved Jun 01 at 09:45pm BST
v2 jobs are now all caught up and processing normally.
3 previous updates
The v3 cluster is slow to accept new v3 tasks
Resolved May 14 at 08:00am BST
Runs are operating at normal speed again
There were pods in our cluster in the RunContainerError state, this happens when a run isn’t heartbeating back to the platform. We’re closely monitoring and have cleaned these. We’re determining which tasks caused this and what we can do to prevent this from happening in the future.
1 previous update
Queues and runs have been processing at good speeds now for several hours on ...
Resolved May 09 at 10:00pm BST
Before we get into what happened I want to emphasise how important reliability is to us. This has fallen very short of providing you all with a great experience and a reliable service. We're really sorry for the problems this has caused for you all.
All paying customers will get a full refund for the entirety of May.
What caused this?
This issue started with a huge number of queued events in v2. Our internal system that handles concurrency limits on v2 was slowing down the pro...