Previous incidents
Our emails aren't sending (downstream provider issue)
Resolved Aug 24 at 03:45pm BST
Resend is back online so magic link and alert emails are working again
1 previous update
Some v3 runs are crashing with triggerAndWait
Resolved Aug 20 at 07:40pm BST
A fix has been deployed and tested. Just confirming that is has fixed all instances of this issue.
1 previous update
v3 runs are starting slower than normal
Resolved Aug 15 at 10:45pm BST
Runs are processing very fast now and everything will be fully caught up in 30 mins. The vast majority of organizations caught up an hour ago,
If you were executing a lot of runs during this period unfortunately it is quite likely that some of them failed. You can filter by Failed, Crashed and System Failure on the Runs page. Then you can multi-select them and use the bulk actions bar at the bottom of the screen to mass replay them.
We're really sorry about this incident and the impact it's...
3 previous updates
v3 runs are slower than normal to start
Resolved Jul 18 at 11:14pm BST
v3 runs are now operating at full speed.
We were unable to spin up more servers and so the total throughput of v3 runs was limited. This caused a lack of concurrency and so runs started slower than normal although they were being fairly distributed between orgs.
We managed to fix the underlying issue that was causing servers not to spin up.
2 previous updates
Dashboard/API is down
Resolved Jul 11 at 11:55am BST
The platform is working correctly again. Runs will pick back up. Some runs may have failed that were in progress.
This was caused by a bad migration that didn't cause the deployment to fail automatically and so it rolled out to the instances.
1 previous update
Dashboard and API degraded
Resolved Jun 21 at 12:00am BST
We're continuing to monitor the situation but API and Dashboard services have been fully restored for the last 15+ minutes, and we think we have found the issue. We'll continue to monitor.
1 previous update
Our API and dashboard response times are elevated
Resolved Jun 20 at 03:35pm BST
API and cloud response times are back to normal.
1 previous update
Some v3 runs are failing
Resolved Jun 19 at 11:56am BST
v3 runs are back to normal.
There was an abnormal number of runs in the System Failure state because some of the data being passed from the workers back to the platform were in an unexpected format.
We are bulk replaying runs that were impacted.
2 previous updates
v3 runs are paused due to network issues
Resolved Jun 13 at 01:20pm BST
Runs are operating at full speed.
We think this issue was caused by the clean-up operations that clear completed pods. There are far more runs than a week ago, so that list can get very large causing a strain on the system including internal networking. We've increased the frequency and are monitoring the load including networking. After 15 mins everything seems normal.
2 previous updates
v3 runs have stopped
Resolved Jun 12 at 11:10pm BST
v3 runs are now executing again.
Networking was down because of an issue with BPF. While networking was down tasks couldn't heartbeat back to the platform. If the platform doesn't receive a heartbeat every 2 mins then a run will fail. Less than 500 total runs were failed because of this.
You can filter by status "System Failure" in the runs list to find these and then bulk replay them by selecting all, move to the next page and select all again. You can replay them using the bottom bar.
1 previous update
v2 runs are slower than normal to start
Resolved Jun 10 at 01:30pm BST
v2 p95 start times have been under 2s for 10 mins, so resolving this issue.
We think this is because there are a lot of schedules that send an event at midday UTC on a Monday. We're looking into what we can do about that.
2 previous updates
v2 jobs are starting slowly
Resolved Jun 05 at 07:35pm BST
Performance metrics for v2 are back to normal. v3 was unimpacted by this issue.
We have identified the underlying performance bottleneck and will publish a full retrospective on this. We can now handle far more v2 load than we could before.
1 previous update
v2 jobs are queued
Resolved Jun 03 at 09:00pm BST
Runs have been executing at normal speeds and queues down to normal size for a couple of hours so this is marked as resolved.
During this incident queue times were longer than normal for v2 runs. We've made some minor changes as well as increasing capacity. We are working on a larger change that we think should mean very large v2 run spikes don't cause these problems that should ship in the next few days.
3 previous updates
v2 job backlog
Resolved Jun 01 at 09:45pm BST
v2 jobs are now all caught up and processing normally.
3 previous updates