Previous incidents
Dashboard unreliable as we work through clickhouse issues
Resolved Nov 28, 2025 at 7:54pm UTC
We have recovered the clickhouse instance and the dashboard is responsive again and serving queries and data ingestion is back online. There has been some loss of otel data during this downtime but we don't know the extent of it at this moment as we continue to recover and investigate.
1 previous update
Realtime is down
Resolved Nov 23, 2025 at 12:38pm UTC
Realtime is working again. ElectricSQL was only working for the first 10 seconds after the server came up. We updated some settings around the startup sequencing that means it is now working properly and are working with the Electric team to determine why this was happening.
1 previous update
Deployments failing due to upstream provider
Resolved Oct 28, 2025 at 9:20pm UTC
Depot resolved the issue on their side. Deployments are now working normally again.
1 previous update
us-east-1 slow dequeues for large machines
Resolved Oct 27, 2025 at 9:21pm UTC
The issue is now resolved. We scaled up massively to absorb a load spike from one of our customers. Mainly runs on large machines where impacted.
1 previous update
Dashboard runs list is delayed
Resolved Oct 22, 2025 at 11:29am UTC
We are still working with clickhouse cloud on identifying the root cause of the issue but have pushed out a fix that has allowed the runs list to come back online and show live results in the meantime. We suspect the issue is related to our clickhouse server being "rotated" into another AZ which caused network degradation and cause our replication pipeline from postgresql to clickhouse to fall behind and unable to keep up with changes. The fix was to stop sending payload data to clickhouse wh...
2 previous updates
Deploys are still impacted by the us-east-1 outage
Resolved Oct 20, 2025 at 9:00pm UTC
Our remote build provider Depot.dev was able to fully recover after ongoing EC2 instance issues in us-east-1 - AWS had several regressions and limited service availability throughout the day.
1 previous update
Runs list and dashboard logs are impacted by the AWS us-east-1 outage
Resolved Oct 20, 2025 at 5:12pm UTC
Our clickhouse instance is back online and serving queries. The runs list is now working but some requests to clickhouse syncing run state are queued and waiting to finish. We should be all caught up in about 5-10 minutes.
1 previous update
Realtime is slow to update
Resolved Oct 20, 2025 at 1:11pm UTC
Realtime is now back to normal operation, serving live updates. We'll continue to monitor the situation.
2 previous updates
AWS outage in us-east-1 is causing service disruption
Resolved Oct 20, 2025 at 9:33am UTC
AWS us-east-1's latest update has stated that they are seeing "significant signs of recovery" and we're seeing the same in our worker cluster, as now image pulls are working and both cold and warm starts are now processing as normal. We'll continue to closely monitor the situation.
2 previous updates
us-east-1 slow dequeues for large machines
Resolved Oct 5, 2025 at 5:14pm UTC
Runs on large-2x machines were processing more slowly due to incredibly high volume. One of our customers caused an infinite loop in their large machine tasks which kept triggering more runs. Runs on small and medium machines were unlikely to have been impacted by this.
1 previous update
us-east-1 slow dequeues for cold starts
Resolved Oct 2, 2025 at 6:56pm UTC
This resolved shortly after a problematic control plane rollout.
1 previous update
us-east-1 runs are slow to start executing
Resolved Sep 26, 2025 at 5:35am UTC
Service in us-east-1 has fully recovered
2 previous updates
Large machines are slow to dequeue runs
Resolved Sep 24, 2025 at 4:30pm UTC
Larger machines are now dequeuing faster.
We are going to change the packing algorithm we use which means this will be much less likely to happen in the future. That work will begin this week.
1 previous update
Slower dequeuing and API responses
Resolved Sep 15, 2025 at 5:40pm UTC
The database load is back to normal.
We're still investigating why this happened, the top theory at the moment is a an auto-vacuuming issue possibly to do with transaction wraparound.
1 previous update
V4 runs are slow to dequeue in the us-nyc-3 region
Resolved Sep 15, 2025 at 12:45pm UTC
This issue was caused by our us-nyc-3 cloud provider taking an abnormally long time to spin up new servers and capacity issues, along with some runs getting stuck after restoring from a snapshot and not completing in under 2 minutes, which also mostly happened in us-nyc-3.
2 previous updates
eu-central-1 runs are slow to dequeue
Resolved Sep 10, 2025 at 5:18pm UTC
We were seeing crashes on multiple servers in the EU only. It was related to an out of memory issue in our “supervisor” which meant some servers weren’t dequeuing consistently. We’ve changed some settings to allow for more memory. We're still investigating why these supervisor processes were using up too much memory and crashing, and we're monitoring the situation. us-east-1 runs have not been impacted.
1 previous update