Stay information about the cloud.trigger.dev API and Dashboard
Status Check • May 12, 2024 08:03:07 UTC
Incidents of the last 7 days or that have not been resolved yet.
May 09, 2024 21:00
•resolved
May 10, 2024 21:17
Before we get into what happened I want to emphasise how important reliability is to us. This has fallen very short of providing you all with a great experience and a reliable service. We're really sorry for the problems this has caused for you all.
All paying customers will get a full refund for the entirety of May.
This issue started with a huge number of queued events in v2. Our internal system that handles concurrency limits on v2 was slowing down the processing of them but was also causing more database load on the underlying v2 queuing engine: Graphile worker. Graphile is powered by Postgres so it caused a vicious cycle.
We've scaled many orders of magnitude in the past year and all our normal upgrades and tuning didn't work here. We upgraded servers and most importantly the database (twice). We tuned some parameters. But as the backlog was growing it became harder to recover from because of the concurrency limits built into the v2 system. Ordinarily this limiter distributes v2 runs fairly and prevents very high load by smoothing out very spiky demand.
Today we reached a new level of scale that highlighted some things that up until now had been working well. Reliability is never finished, so work continues tomorrow.
monitoring
May 10, 2024 16:40
Runs have been back operating with normal queue times for a couple of hours on both version 2 and version 3. We're continuing to monitor this and pushing some more changes which should provide better throughput.
v3 was only impacted because the database had spiked because of v2 runs.
Summary of the issue: We use Graphile Worker which is a Postgres based queuing system. This is the core of v2 and has scaled really well for us. We've been upgrading the database and workers that pull jobs over the past 12 months as demand has increased and it's worked great.
Yesterday the performance of the database started getting worse. We're 95% confident this is because of the Graphile queue table. The queue has to be in the primary db (that does writes). The queue performance problems had the effect of slowing down other queries too. That's why v2 and v3 runs have been slow to process. And why dashboard performance has been degraded for everyone as well as runs processing slowly – some pages in the dashboard use our read replica but some have to use the primary.
We performed the usual upgrades (and more) but that didn't fully fix the problem. It helped a bit but it didn't put us back to a steady state where we're processing runs fast again. We've been upgrading and changing settings on various parts of the infrastructure since to improve throughput but not overload the database.
identified
May 09, 2024 21:00
A large increase in usage has caused a big spike in database and backlog that we are working to rectify. Service will be reduced, especially of the dashboard, while we fix things.
Dashboard & API
99.95%
45 days ago
Today