Previous incidents

September 2025
Sep 26, 2025
1 incident

us-east-1 runs are slow to start executing

Degraded

Resolved Sep 26, 2025 at 05:35am UTC

Service in us-east-1 has fully recovered

2 previous updates

Sep 24, 2025
1 incident

Large machines are slow to dequeue runs

Degraded

Resolved Sep 24, 2025 at 04:30pm UTC

Larger machines are now dequeuing faster.

We are going to change the packing algorithm we use which means this will be much less likely to happen in the future. That work will begin this week.

1 previous update

Sep 15, 2025
2 incidents

Slower dequeuing and API responses

Degraded

Resolved Sep 15, 2025 at 05:40pm UTC

The database load is back to normal.

We're still investigating why this happened, the top theory at the moment is a an auto-vacuuming issue possibly to do with transaction wraparound.

1 previous update

V4 runs are slow to dequeue in the us-nyc-3 region

Degraded

Resolved Sep 15, 2025 at 12:45pm UTC

This issue was caused by our us-nyc-3 cloud provider taking an abnormally long time to spin up new servers and capacity issues, along with some runs getting stuck after restoring from a snapshot and not completing in under 2 minutes, which also mostly happened in us-nyc-3.

2 previous updates

Sep 10, 2025
1 incident

eu-central-1 runs are slow to dequeue

Degraded

Resolved Sep 10, 2025 at 05:18pm UTC

We were seeing crashes on multiple servers in the EU only. It was related to an out of memory issue in our “supervisor” which meant some servers weren’t dequeuing consistently. We’ve changed some settings to allow for more memory. We're still investigating why these supervisor processes were using up too much memory and crashing, and we're monitoring the situation. us-east-1 runs have not been impacted.

1 previous update

August 2025
Aug 24, 2025
1 incident

eu-central-1 dequeue issues

Degraded

Resolved Aug 24, 2025 at 10:21am UTC

This is now resolved.

We're still fully investigating the root cause but there was too much activity with etcd and clean up not happening fast enough. This caused downstream impacts on pods piling up and Kubernetes API rate limits being hit. There were Out Of Memory issues because of this.

Our older Digital Ocean cluster is doing far higher volume that this so we need to determine what configuration is different and making appropriate changes to prevent this happening again.

1 previous update

Aug 21, 2025
1 incident

Slower than normal v3 dequeue times

Degraded

Resolved Aug 22, 2025 at 04:03pm UTC

v3 dequeue times have been greatly improved and we're now seeing sub-500ms p50 dequeue times (it was previously 10-20s p50 dequeue times). We ended up backporting all of the dequeue performance improvements we made in v4 and even made some improvements that we'll be bringing to v4 soon as well.

1 previous update

Aug 20, 2025
1 incident

We're experiencing longer than normal queue times on v4

Degraded

Resolved Aug 20, 2025 at 07:44pm UTC

Queue times are back to normal. We had an unprecedented number of new v4 runs. We've adjusted our autoscaling rules across multiple services to account for this. We are looking into how to avoid this happening as v4 scales up.

1 previous update

Aug 01, 2025
1 incident

Run log failures and cascading API failures

Degraded

Resolved Aug 01, 2025 at 12:29am UTC

This was resolved at 00:29am. Logs and the API started recovering once a valid partition was in place.

1 previous update

July 2025
Jul 23, 2025
1 incident

Runs are missing from the dashboard and runs.list is degraded

Degraded

Resolved Jul 23, 2025 at 02:45pm UTC

The dashboard/runs.list is back to normal. We're working on and deploying multiple changes which will reduce and prevent these kind of issues from happening.

1 previous update

Jul 22, 2025
1 incident

Runs are missing from the dashboard and runs.list is degraded

Degraded

Resolved Jul 22, 2025 at 11:14pm UTC

The runs list is now fully operational. There is still missing data that we will be backfilling ASAP.

2 previous updates

Jul 18, 2025
1 incident

Some runs list calls impacted by ClickHouse server crashes

Degraded

Resolved Jul 18, 2025 at 03:26pm UTC

We've opened a case with ClickHouse Cloud to try and understand why this happened.

2 previous updates

Jul 15, 2025
1 incident

Batches with more than 20 runs are slow to process

Degraded

Resolved Jul 15, 2025 at 02:15pm UTC

Batches are processing as normal now. We have increased future capacity.

This was caused by a runaway loop of batches by a customer and this part of the system didn't have enough capacity to process them all fast enough.

We are updating how we process and rate-limit batches to prevent this from happening again, as well as improved internal alerts if similar issues happen in the future.

2 previous updates

Jul 13, 2025
1 incident

Realtime not processing updates

Downtime

Resolved Jul 13, 2025 at 04:37pm UTC

Realtime is sending updates again. The attached storage stopped working and restarting the AWS task didn't work. A hard reset caused it to become healthy again.

We're looking into how to prevent this from happening again

1 previous update

Jul 04, 2025
1 incident

Realtime not sending updates

Downtime

Resolved Jul 04, 2025 at 09:34am UTC

This is resolved – Realtime is sending updates again.

Restarting Electric released and reacquired the Postgres replication slot. We're discussing why this happened to try and prevent it in the future.

1 previous update