Previous incidents

August 2024
Aug 24, 2024
1 incident

Our emails aren't sending (downstream provider issue)

Degraded

Resolved Aug 24 at 03:45pm BST

Resend is back online so magic link and alert emails are working again

1 previous update

Aug 20, 2024
1 incident

Some v3 runs are crashing with triggerAndWait

Degraded

Resolved Aug 20 at 07:40pm BST

A fix has been deployed and tested. Just confirming that is has fixed all instances of this issue.

1 previous update

Aug 15, 2024
1 incident

v3 runs are starting slower than normal

Degraded

Resolved Aug 15 at 10:45pm BST

Runs are processing very fast now and everything will be fully caught up in 30 mins. The vast majority of organizations caught up an hour ago,

If you were executing a lot of runs during this period unfortunately it is quite likely that some of them failed. You can filter by Failed, Crashed and System Failure on the Runs page. Then you can multi-select them and use the bulk actions bar at the bottom of the screen to mass replay them.

We're really sorry about this incident and the impact it's...

3 previous updates

July 2024
Jul 18, 2024
1 incident

v3 runs are slower than normal to start

Degraded

Resolved Jul 18 at 11:14pm BST

v3 runs are now operating at full speed.

We were unable to spin up more servers and so the total throughput of v3 runs was limited. This caused a lack of concurrency and so runs started slower than normal although they were being fairly distributed between orgs.

We managed to fix the underlying issue that was causing servers not to spin up.

2 previous updates

Jul 11, 2024
1 incident

Dashboard/API is down

Downtime

Resolved Jul 11 at 11:55am BST

The platform is working correctly again. Runs will pick back up. Some runs may have failed that were in progress.

This was caused by a bad migration that didn't cause the deployment to fail automatically and so it rolled out to the instances.

1 previous update

June 2024
Jun 20, 2024
2 incidents

Dashboard and API degraded

Degraded

Resolved Jun 21 at 12:00am BST

We're continuing to monitor the situation but API and Dashboard services have been fully restored for the last 15+ minutes, and we think we have found the issue. We'll continue to monitor.

1 previous update

Our API and dashboard response times are elevated

Degraded

Resolved Jun 20 at 03:35pm BST

API and cloud response times are back to normal.

1 previous update

Jun 19, 2024
1 incident

Some v3 runs are failing

Degraded

Resolved Jun 19 at 11:56am BST

v3 runs are back to normal.

There was an abnormal number of runs in the System Failure state because some of the data being passed from the workers back to the platform were in an unexpected format.

We are bulk replaying runs that were impacted.

2 previous updates

Jun 13, 2024
1 incident

v3 runs are paused due to network issues

Degraded

Resolved Jun 13 at 01:20pm BST

Runs are operating at full speed.

We think this issue was caused by the clean-up operations that clear completed pods. There are far more runs than a week ago, so that list can get very large causing a strain on the system including internal networking. We've increased the frequency and are monitoring the load including networking. After 15 mins everything seems normal.

2 previous updates

Jun 12, 2024
1 incident

v3 runs have stopped

Degraded

Resolved Jun 12 at 11:10pm BST

v3 runs are now executing again.

Networking was down because of an issue with BPF. While networking was down tasks couldn't heartbeat back to the platform. If the platform doesn't receive a heartbeat every 2 mins then a run will fail. Less than 500 total runs were failed because of this.

You can filter by status "System Failure" in the runs list to find these and then bulk replay them by selecting all, move to the next page and select all again. You can replay them using the bottom bar.

1 previous update

Jun 10, 2024
1 incident

v2 runs are slower than normal to start

Degraded

Resolved Jun 10 at 01:30pm BST

v2 p95 start times have been under 2s for 10 mins, so resolving this issue.

We think this is because there are a lot of schedules that send an event at midday UTC on a Monday. We're looking into what we can do about that.

2 previous updates

Jun 05, 2024
1 incident

v2 jobs are starting slowly

Degraded

Resolved Jun 05 at 07:35pm BST

Performance metrics for v2 are back to normal. v3 was unimpacted by this issue.

We have identified the underlying performance bottleneck and will publish a full retrospective on this. We can now handle far more v2 load than we could before.

1 previous update

Jun 03, 2024
1 incident

v2 jobs are queued

Degraded

Resolved Jun 03 at 09:00pm BST

Runs have been executing at normal speeds and queues down to normal size for a couple of hours so this is marked as resolved.

During this incident queue times were longer than normal for v2 runs. We've made some minor changes as well as increasing capacity. We are working on a larger change that we think should mean very large v2 run spikes don't cause these problems that should ship in the next few days.

3 previous updates

Jun 01, 2024
1 incident

v2 job backlog

Degraded

Resolved Jun 01 at 09:45pm BST

v2 jobs are now all caught up and processing normally.

3 previous updates