Previous incidents

November 2024
Nov 23, 2024
1 incident

V3 runs are processing slowly

Degraded

Resolved Nov 23 at 12:55am GMT

Runs are processing normally again, queues should come down fast.

The Kubernetes database etcd didn’t allow new values. Increasing max sizes, restarting, and changing some other settings worked.

1 previous update

Nov 22, 2024
1 incident

Realtime (beta) is offline

Downtime

Resolved Nov 22 at 07:36pm GMT

Realtime is back online.

We've made some configuration changes and have some more reliability fixes in progress to make this rock solid.

1 previous update

Nov 08, 2024
1 incident

V2 runs are processing slowly

Degraded

Resolved Nov 08 at 04:30pm GMT

V2 queues are caught up. Now any queued runs are due to concurrency limits.

V3 was not impacted during the entire period.

We restarted all V2 worker servers and V2 runs started processing again. We are still investigating the underlying cause to prevent this happening again. There were no code changes or deploys during this period and the overall V2 load wasn't unusual.

2 previous updates

October 2024
Oct 31, 2024
1 incident

Realtime service degraded

Degraded

Resolved Nov 01 at 12:38am GMT

Realtime is recovering after a restart and a clearing of the consumer cache, but the underlying issue has not been solved. We're still working on a fix and will update as we make progress.

1 previous update

Oct 23, 2024
1 incident

Dashboard instability and slower run processing

Degraded

Resolved Oct 25 at 06:47pm BST

The networking issues from our worker cluster cloud provider is no longer happening. Networking has been back to full speed for the past 10 minutes and run are processing fast.

3 previous updates

September 2024
Sep 24, 2024
1 incident

Some processing is slower than normal

Degraded

Resolved Sep 24 at 08:28pm BST

This issue is resolved, everything is back to normal.

This issue was caused by an exceptionally large number of v3 run alerts, caused by a run that was failing (from user code, not a Trigger.dev system problem). This caused us to hit Slack rate limits which slowed the processing down more.

We have scaled up the system that deals with this now so it should better deal with this. We've also changed the retrying settings for sending Slack alerts so it doesn't so aggressively retry.

1 previous update