Incidents | Trigger.dev

Incidents | Trigger.dev Incidents reported on status page for Trigger.dev https://status.trigger.dev/ https://d1lppblt9t2x15.cloudfront.net/logos/bb72686f10eddb819a2958a98b50538b.png Incidents | Trigger.dev https://status.trigger.dev/ en Elevated dequeue times in us-east-1 https://status.trigger.dev/incident/836575 Sun, 01 Mar 2026 02:36:00 -0000 https://status.trigger.dev/incident/836575#b1f71e450fe2925b5f753f946651f4e38251cf930768e52d4d7c832163b2c105 The issue is now resolved and dequeue times are back to normal. Mainly free-tier runs were affected. This was caused by a spike in the free-tier run volume. Elevated dequeue times in us-east-1 https://status.trigger.dev/incident/836575 Sun, 01 Mar 2026 02:28:00 -0000 https://status.trigger.dev/incident/836575#0b2e864a91be1250b31133cee326c5436320fab4b1124bfbd91b175f21fcb0b7 We have identified the root cause and we're applying a fix. Mainly free-tier runs were affected. Elevated dequeue times in us-east-1 https://status.trigger.dev/incident/836575 Sun, 01 Mar 2026 01:39:00 -0000 https://status.trigger.dev/incident/836575#83b0ed212487b602464c88e7bd10a94dda135ab61ad3f3a381613fabe4a031e3 Dequeues are slower than normal in us-east-1. Runs are still executing, but they are slower to start. We’re investigating the issue. Realtime recovered https://status.trigger.dev/ Sun, 01 Feb 2026 00:08:16 +0000 https://status.trigger.dev/#a6e66328622e001109baf84c0b160ca511d063abe268e82bb3e143cc925ea800 Realtime recovered Realtime went down https://status.trigger.dev/ Sun, 01 Feb 2026 00:05:40 +0000 https://status.trigger.dev/#a6e66328622e001109baf84c0b160ca511d063abe268e82bb3e143cc925ea800 Realtime went down Intermittent DNS failures affecting some run executions https://status.trigger.dev/incident/810341 Fri, 23 Jan 2026 04:19:00 -0000 https://status.trigger.dev/incident/810341#dc4a827282a884337cf2a7cd65838e647463103b54352d465b7263a2f8c270bc Full service has been restored. Task execution is back to normal. If you experienced failures between 01:37 and 04:19 UTC, those runs can be retried successfully now. What happened: During a period of high activity, a backlog of completed runs built up faster than our cleanup processes could handle, which put pressure on internal services and caused intermittent failures. What we did: We spun up additional cleanup capacity to clear the backlog and restore normal operation. What we're doing next: We're increasing resource limits on critical internal services and adding better alerting so we can catch this earlier if it happens again. Intermittent DNS failures affecting some run executions https://status.trigger.dev/incident/810341 Fri, 23 Jan 2026 01:37:00 -0000 https://status.trigger.dev/incident/810341#d46991d6336051f9114f8b8e835c813ac9ab56d84432b6ad9ef9b13a5998c753 We are experiencing intermittent issues that may cause some task runs to fail. Automatic retries are in place and should recover most affected runs. Our team is actively working on resolution. Realtime recovered https://status.trigger.dev/ Fri, 23 Jan 2026 01:25:52 +0000 https://status.trigger.dev/#703e4a6274528c237124c411c44c96769ee3926ab36ffedd84e9731f3db81ce9 Realtime recovered Realtime went down https://status.trigger.dev/ Fri, 23 Jan 2026 01:24:15 +0000 https://status.trigger.dev/#703e4a6274528c237124c411c44c96769ee3926ab36ffedd84e9731f3db81ce9 Realtime went down Realtime recovered https://status.trigger.dev/ Fri, 23 Jan 2026 01:13:55 +0000 https://status.trigger.dev/#1f6ee681b9b7b16c660e0e7dd77fdbfd795fff035f729c90468ebcb772261295 Realtime recovered Realtime went down https://status.trigger.dev/ Fri, 23 Jan 2026 01:12:35 +0000 https://status.trigger.dev/#1f6ee681b9b7b16c660e0e7dd77fdbfd795fff035f729c90468ebcb772261295 Realtime went down Realtime recovered https://status.trigger.dev/ Fri, 23 Jan 2026 00:13:05 +0000 https://status.trigger.dev/#dbfb3c7f268305ad0d4540d34d7074389810a2be9b64fb99f83db1bfd8ab86fc Realtime recovered Realtime went down https://status.trigger.dev/ Fri, 23 Jan 2026 00:09:04 +0000 https://status.trigger.dev/#dbfb3c7f268305ad0d4540d34d7074389810a2be9b64fb99f83db1bfd8ab86fc Realtime went down Realtime recovered https://status.trigger.dev/ Thu, 22 Jan 2026 22:10:20 +0000 https://status.trigger.dev/#d9ed8726064940d4162d4e4f22aaa5017939d695660c363a2c1b3a3b995909a2 Realtime recovered Realtime went down https://status.trigger.dev/ Thu, 22 Jan 2026 22:08:47 +0000 https://status.trigger.dev/#d9ed8726064940d4162d4e4f22aaa5017939d695660c363a2c1b3a3b995909a2 Realtime went down Realtime recovered https://status.trigger.dev/ Wed, 21 Jan 2026 19:05:18 +0000 https://status.trigger.dev/#7f0ad06acba5da1f0d2ebd49bc2a502b1b3efd8d68242b5c46323b92f2e1cd3f Realtime recovered Realtime went down https://status.trigger.dev/ Wed, 21 Jan 2026 19:03:39 +0000 https://status.trigger.dev/#7f0ad06acba5da1f0d2ebd49bc2a502b1b3efd8d68242b5c46323b92f2e1cd3f Realtime went down Dashboard runs list is delayed https://status.trigger.dev/incident/809042 Wed, 21 Jan 2026 17:32:00 -0000 https://status.trigger.dev/incident/809042#3d5656dafcc1532605352ca2e09c4de54bc1f057a49ab9da8ac14a2dd37d9408 The root cause of the issue was a failed upgrade of our clickhouse service from 25.6 to 25.8. While the service failed to rollout, the rollback also failed and so our clickhouse service was split between some instances on 25.6 and 25.8, which caused various issues. Our clickhouse provider is continuing to investigate the failed release but we can confirm that the rollback has completed and queries are back to normal. Dashboard runs list is delayed https://status.trigger.dev/incident/809042 Wed, 21 Jan 2026 16:49:00 -0000 https://status.trigger.dev/incident/809042#bdf0e3a9d0f2fb1ab3ca2a55d26daf0d984a218a60af640eb539f88ec4e67f8a The dashboard runs list is now live again but some v3 logs are being dropped because of ingestion problems. We're investigating and will provide an update as soon as we have anything Dashboard runs list is delayed https://status.trigger.dev/incident/809042 Wed, 21 Jan 2026 15:48:00 -0000 https://status.trigger.dev/incident/809042#315529cb405c961253198295ce3307d9f327487f6b7118eb8d7a83676b9cb121 Our run sync to clickhouse process is currently delayed. The runs list in the dashboard will be behind but runs are executing as normal. Some schedules have stopped working https://status.trigger.dev/incident/808875 Wed, 21 Jan 2026 11:15:00 -0000 https://status.trigger.dev/incident/808875#9292b93e73516f8b0270447a6efc2c08fad3724cf6b5d2a6ff792d0fc328fa6a All schedules have been fully restored. Realtime recovered https://status.trigger.dev/ Wed, 21 Jan 2026 11:01:59 +0000 https://status.trigger.dev/#bd25b8407bb6c775dc67298a77d3a2307871b0e689f9ab5f5ee5ccf6a1d794e7 Realtime recovered Some schedules have stopped working https://status.trigger.dev/incident/808875 Wed, 21 Jan 2026 10:56:00 -0000 https://status.trigger.dev/incident/808875#5d25e4f590f42351d0676c320865f6676e2f961dc81142d93fadd686b347cfb4 We had a brief outage earlier which affected a subset of schedules. We are working on a fix to get them going again. Realtime went down https://status.trigger.dev/ Wed, 21 Jan 2026 10:55:54 +0000 https://status.trigger.dev/#bd25b8407bb6c775dc67298a77d3a2307871b0e689f9ab5f5ee5ccf6a1d794e7 Realtime went down Realtime recovered https://status.trigger.dev/ Wed, 21 Jan 2026 10:49:07 +0000 https://status.trigger.dev/#935d2f7e501766bdc5f20e9c90b5a726239e2247a00b6415182bbda126ba4587 Realtime recovered Trigger.dev API recovered https://status.trigger.dev/ Wed, 21 Jan 2026 10:38:55 +0000 https://status.trigger.dev/#cfcc40fd0f6dd74d5f6f70d644d62a9df11baff66bc16f9948998a24ea82151a Trigger.dev API recovered Trigger.dev cloud recovered https://status.trigger.dev/ Wed, 21 Jan 2026 10:25:28 +0000 https://status.trigger.dev/#f9cb8969324fc757bdb21f1d18638924bd59f1d9a465a8469ba7f3ee06c4addd Trigger.dev cloud recovered Realtime went down https://status.trigger.dev/ Wed, 21 Jan 2026 10:19:53 +0000 https://status.trigger.dev/#935d2f7e501766bdc5f20e9c90b5a726239e2247a00b6415182bbda126ba4587 Realtime went down Trigger.dev cloud went down https://status.trigger.dev/ Wed, 21 Jan 2026 10:18:59 +0000 https://status.trigger.dev/#f9cb8969324fc757bdb21f1d18638924bd59f1d9a465a8469ba7f3ee06c4addd Trigger.dev cloud went down Trigger.dev API went down https://status.trigger.dev/ Wed, 21 Jan 2026 10:17:19 +0000 https://status.trigger.dev/#cfcc40fd0f6dd74d5f6f70d644d62a9df11baff66bc16f9948998a24ea82151a Trigger.dev API went down Realtime recovered https://status.trigger.dev/ Tue, 20 Jan 2026 17:05:10 +0000 https://status.trigger.dev/#ecef070d6c1a6149ac2132177cbb25c9acb626b7b3b530fa4ea7fdfc988e1662 Realtime recovered Realtime went down https://status.trigger.dev/ Tue, 20 Jan 2026 17:03:54 +0000 https://status.trigger.dev/#ecef070d6c1a6149ac2132177cbb25c9acb626b7b3b530fa4ea7fdfc988e1662 Realtime went down Realtime recovered https://status.trigger.dev/ Mon, 19 Jan 2026 11:05:05 +0000 https://status.trigger.dev/#e9489d30f03b7cae9427572fa54180fd86a0dec57023fc1d89a3edd283f3125d Realtime recovered Realtime went down https://status.trigger.dev/ Mon, 19 Jan 2026 11:03:40 +0000 https://status.trigger.dev/#e9489d30f03b7cae9427572fa54180fd86a0dec57023fc1d89a3edd283f3125d Realtime went down Issue with task logs https://status.trigger.dev/incident/805855 Fri, 16 Jan 2026 17:44:00 -0000 https://status.trigger.dev/incident/805855#b370768896d5ac3e4831cd5fa9c60ab01b4887fceb640cb583c0dc388d9d7b14 We have finally been able to provision additional capacity and logs are working again. A full post-mortem will follow. Issue with task logs https://status.trigger.dev/incident/805855 Fri, 16 Jan 2026 17:03:00 -0000 https://status.trigger.dev/incident/805855#73d6a10f66cabefff041b7ef72e4773ad431c4f72c111b87b0760015570b760c Our task log storage system is currently overloaded and we are working on bringing up additional capacity, but in the meantime some logs may be lost. Realtime recovered https://status.trigger.dev/ Fri, 16 Jan 2026 00:05:03 +0000 https://status.trigger.dev/#4b3538094a3c92103fa1d3db141d63cce6db577187f3c82753d865016072c786 Realtime recovered Realtime went down https://status.trigger.dev/ Fri, 16 Jan 2026 00:03:50 +0000 https://status.trigger.dev/#4b3538094a3c92103fa1d3db141d63cce6db577187f3c82753d865016072c786 Realtime went down Realtime recovered https://status.trigger.dev/ Thu, 15 Jan 2026 21:05:18 +0000 https://status.trigger.dev/#82b009882b5eefa796ae41c099105af864409e52266088f8f1283a9e0381937d Realtime recovered Realtime went down https://status.trigger.dev/ Thu, 15 Jan 2026 21:03:46 +0000 https://status.trigger.dev/#82b009882b5eefa796ae41c099105af864409e52266088f8f1283a9e0381937d Realtime went down Realtime recovered https://status.trigger.dev/ Sun, 11 Jan 2026 00:05:36 +0000 https://status.trigger.dev/#a90197bb9d71de1bb89478e1a5e408f18b0235e4dd86db9b7fc33f84178a1bd8 Realtime recovered Realtime went down https://status.trigger.dev/ Sun, 11 Jan 2026 00:03:37 +0000 https://status.trigger.dev/#a90197bb9d71de1bb89478e1a5e408f18b0235e4dd86db9b7fc33f84178a1bd8 Realtime went down Realtime recovered https://status.trigger.dev/ Wed, 07 Jan 2026 00:05:09 +0000 https://status.trigger.dev/#a940b022bc16b7f7fd5d2fbfbba38973e964262d0537bd5da72e8be43bb47243 Realtime recovered Realtime went down https://status.trigger.dev/ Wed, 07 Jan 2026 00:03:48 +0000 https://status.trigger.dev/#a940b022bc16b7f7fd5d2fbfbba38973e964262d0537bd5da72e8be43bb47243 Realtime went down Realtime recovered https://status.trigger.dev/ Fri, 02 Jan 2026 23:22:35 +0000 https://status.trigger.dev/#3960241d07422f8583d6d6d0a611516256cb108c2ea064805c5925e5cd907a69 Realtime recovered Realtime went down https://status.trigger.dev/ Fri, 02 Jan 2026 23:17:15 +0000 https://status.trigger.dev/#3960241d07422f8583d6d6d0a611516256cb108c2ea064805c5925e5cd907a69 Realtime went down Dashboard runs list is delayed https://status.trigger.dev/incident/797559 Fri, 02 Jan 2026 21:16:00 -0000 https://status.trigger.dev/incident/797559#28871321d2c4ced4678f5dca66c1b33335b48093ccd716ed93b1c19d1b827d88 The runs list is now up to do and syncing live updates again. Dashboard runs list is delayed https://status.trigger.dev/incident/797559 Fri, 02 Jan 2026 20:26:00 -0000 https://status.trigger.dev/incident/797559#6ea5959027eb18822f2081ff5897337ff99b241d8e7f3734f314b7692e51f00d We've confirmed the fix has improved the situation but the runs list is still behind. It's currently just over 45 minutes behind but that's down from the max of 57 minutes. We estimate the runs list will be live again in about 1-2 hours. We're sorry for the delay and will be pushing fixing this sync service to the top of our priority in the new year. Dashboard runs list is delayed https://status.trigger.dev/incident/797559 Fri, 02 Jan 2026 19:53:00 -0000 https://status.trigger.dev/incident/797559#9d32ed9d1f4f0c34d3313ebc485e20b36d27fe39568746cf4f8e8e2b6dc7e365 A fix has been deployed but the runs list continues to lag behind, but should be catching up soon. We're continuing to investigate and monitor the deployed fix. Dashboard runs list is delayed https://status.trigger.dev/incident/797559 Fri, 02 Jan 2026 18:51:00 -0000 https://status.trigger.dev/incident/797559#a11317c89e59581e85c569d3ccf4c00920750d6d9152256c236a77de7775efbe We've identified the issue and are working on a fix now. Runs are still delayed (at the time of this update, the delay is about 32 minutes behind. Dashboard runs list is delayed https://status.trigger.dev/incident/797559 Fri, 02 Jan 2026 18:17:00 -0000 https://status.trigger.dev/incident/797559#251a11ed776746ad194490fa141e38c0f9f57f7ab5de7e2a4d35df320d91eef6 Our run sync to clickhouse process is currently delayed. The runs list in the dashboard will be behind but runs are executing as normal. Batches are slow to process https://status.trigger.dev/incident/797058 Thu, 01 Jan 2026 15:10:00 -0000 https://status.trigger.dev/incident/797058#d19402994e51a18a82de25665f1a3db94cc544dbcbb5c152192f9654d4ee6474 The new batch concurrency processing defaults have brought the processing queue down to zero Batches are slow to process https://status.trigger.dev/incident/797058 Thu, 01 Jan 2026 14:33:00 -0000 https://status.trigger.dev/incident/797058#508345570b920260e23f38d8a0b66c70864e87c41a4a2dcda48fa5bec09ae7ed The batches are now processing much faster. The issue was the default batch processing concurrency for an organization was too low. Organizations that were already on Hobby or Pro had higher concurrency but newer orgs didn't. This created a backlog. We've increased the default processing concurrency for each org and will look at the upgrade path for paid customers as well. We're monitoring the processing – it should get back to near realtime soon. Batches are slow to process https://status.trigger.dev/incident/797058 Thu, 01 Jan 2026 10:00:00 -0000 https://status.trigger.dev/incident/797058#18c184c0e77ba336028bc2281a774c496238d48c77b038e2073f817c4e6907f2 There is a backlog in processing `batchTrigger` and `batchTriggerAndWait` calls. This means runs are being created slower than normal for these. We're investigating why this is happening Realtime recovered https://status.trigger.dev/ Sat, 27 Dec 2025 21:41:07 +0000 https://status.trigger.dev/#ea544b6061b25b71d8fed19ffdd9d03246a61bc3ebbd5592ea13b9c9165a542b Realtime recovered Realtime went down https://status.trigger.dev/ Sat, 27 Dec 2025 21:39:54 +0000 https://status.trigger.dev/#ea544b6061b25b71d8fed19ffdd9d03246a61bc3ebbd5592ea13b9c9165a542b Realtime went down Runs list is delayed https://status.trigger.dev/incident/788344 Wed, 17 Dec 2025 15:30:00 -0000 https://status.trigger.dev/incident/788344#13c4e16d0ebb4e2cf2e3ff9cf4978b75ccfbae43d5d0f5f0322bc43c322e548d Runs are now syncing live and the dashboard is back to normal. Runs list is delayed https://status.trigger.dev/incident/788344 Wed, 17 Dec 2025 15:12:00 -0000 https://status.trigger.dev/incident/788344#7712a74b0d146c6b75c7f846a01383da5df3335e6816730f842a17b96d5fa79b Runs are not syncing to our clickhouse instances fast enough and so there is a delay in data in the runs list dashboard. Runs are operating normally. Realtime streams v2 is degraded https://status.trigger.dev/incident/788071 Wed, 17 Dec 2025 08:16:00 -0000 https://status.trigger.dev/incident/788071#50b7a873b3ea7d4d82408fcf453c9bcb42d845d03d4e83d93a3f52ccef83c505 Fix has been applied and realtime streams v2 is fully operational. Realtime streams v2 is degraded https://status.trigger.dev/incident/788071 Wed, 17 Dec 2025 08:13:00 -0000 https://status.trigger.dev/incident/788071#85540e5458b35015381a14cc9a1625c28cb93220a5f8be440563df2413510cee We have identified the cause of the issue and are working on a fix which should be rolled out shortly. Realtime streams v2 is degraded https://status.trigger.dev/incident/788071 Wed, 17 Dec 2025 07:28:00 -0000 https://status.trigger.dev/incident/788071#da8cd31dd988499c7bccd543b82393f0e1741dcdf90a91237c8c9d22fc5720bb Writes and reads to Realtime streams v2 are currently suffering an outage and we're investigating. Dashboard issues due to ClickHouse Cloud https://status.trigger.dev/incident/787788 Tue, 16 Dec 2025 22:42:00 -0000 https://status.trigger.dev/incident/787788#b30b3ad2ceb50586779a1ac98bfbaac54000b3704e6461f3a9e046d0b3f74400 Operations have returned to normal, we're continuing to investigate the root cause and will provide more detail as we know more. Dashboard issues due to ClickHouse Cloud https://status.trigger.dev/incident/787788 Tue, 16 Dec 2025 21:02:00 -0000 https://status.trigger.dev/incident/787788#48afc55a8d8d4fa57916b7f9133a2475fb36ed9011c693281a9801cbbbf91592 We’re seeing a percentage of queries failing from ClickHouse Cloud which powers some pages in the dashboard, like Tasks graphs, Runs page and the logs. We’re talking to their team to try resolve this. Dashboard is degraded https://status.trigger.dev/incident/787481 Tue, 16 Dec 2025 11:05:00 -0000 https://status.trigger.dev/incident/787481#32da1cde4666a3187009a963332938e10f8e141e1e8c9d2a3f74aa88ff9b3e78 The issue in ClickHouse is now resolved. The dashboard is back to being fully operational. The root cause was a faulty node in the ClickHouse cluster which we couldn't kill. We're speaking to the ClickHouse Cloud team to find out why it happened. Dashboard is degraded https://status.trigger.dev/incident/787481 Tue, 16 Dec 2025 11:05:00 -0000 https://status.trigger.dev/incident/787481#32da1cde4666a3187009a963332938e10f8e141e1e8c9d2a3f74aa88ff9b3e78 The issue in ClickHouse is now resolved. The dashboard is back to being fully operational. The root cause was a faulty node in the ClickHouse cluster which we couldn't kill. We're speaking to the ClickHouse Cloud team to find out why it happened. Dashboard is degraded https://status.trigger.dev/incident/787481 Tue, 16 Dec 2025 11:05:00 -0000 https://status.trigger.dev/incident/787481#32da1cde4666a3187009a963332938e10f8e141e1e8c9d2a3f74aa88ff9b3e78 The issue in ClickHouse is now resolved. The dashboard is back to being fully operational. The root cause was a faulty node in the ClickHouse cluster which we couldn't kill. We're speaking to the ClickHouse Cloud team to find out why it happened. Dashboard is degraded https://status.trigger.dev/incident/787481 Tue, 16 Dec 2025 10:43:00 -0000 https://status.trigger.dev/incident/787481#49f493b9827d3635505d4ed96902ec2d12c1ede81087900bc32862a2f4348fa7 The dashboard is currently degraded due to an ongoing issue with our ClickHouse DB. We're currently investigating further. Run executions are not impacted. Dashboard is degraded https://status.trigger.dev/incident/787481 Tue, 16 Dec 2025 10:43:00 -0000 https://status.trigger.dev/incident/787481#49f493b9827d3635505d4ed96902ec2d12c1ede81087900bc32862a2f4348fa7 The dashboard is currently degraded due to an ongoing issue with our ClickHouse DB. We're currently investigating further. Run executions are not impacted. Dashboard is degraded https://status.trigger.dev/incident/787481 Tue, 16 Dec 2025 10:43:00 -0000 https://status.trigger.dev/incident/787481#49f493b9827d3635505d4ed96902ec2d12c1ede81087900bc32862a2f4348fa7 The dashboard is currently degraded due to an ongoing issue with our ClickHouse DB. We're currently investigating further. Run executions are not impacted. Realtime recovered https://status.trigger.dev/ Mon, 08 Dec 2025 01:41:26 +0000 https://status.trigger.dev/#ce14838f9ee98d380116be7b7b0fd6e1f3f2e983d1415a798f29facf587060aa Realtime recovered Realtime went down https://status.trigger.dev/ Mon, 08 Dec 2025 01:27:59 +0000 https://status.trigger.dev/#ce14838f9ee98d380116be7b7b0fd6e1f3f2e983d1415a798f29facf587060aa Realtime went down Deployments for new project are failing https://status.trigger.dev/incident/780932 Fri, 05 Dec 2025 18:52:00 -0000 https://status.trigger.dev/incident/780932#66fad50dc79ce3a177cc00582469e9409dc2b6ea566ef8c08666b8ada84c0aee Deployment for new projects have been fully restored. Deployments for new project are failing https://status.trigger.dev/incident/780932 Fri, 05 Dec 2025 18:05:00 -0000 https://status.trigger.dev/incident/780932#9e6c5e6b22cfaac135dd64ec97eb938942090be0cab5db2428d8af674047fdbc We introduced a bug in an earlier deploy today that affects new projects without deployments only. We are working on getting a fix deployed. Runs list is lagging behind https://status.trigger.dev/incident/779688 Thu, 04 Dec 2025 15:35:00 -0000 https://status.trigger.dev/incident/779688#865f8c8c70c8a4f261b17277877002e6b0ad57297dd18b06aafd2f13e56531b7 The runs list has all caught up and the dashboard is no longer displaying stale data. We're continuing to investigate the root cause of this issue Runs list is lagging behind https://status.trigger.dev/incident/779688 Thu, 04 Dec 2025 15:31:00 -0000 https://status.trigger.dev/incident/779688#5172c0ae92d16fc9add93540ed3c471c433a2016cd3622e03b5910548079880a We are slowly catching up with live run data and should be back to normal operation very soon. Runs list is lagging behind https://status.trigger.dev/incident/779688 Thu, 04 Dec 2025 15:16:00 -0000 https://status.trigger.dev/incident/779688#0b05f75e81ac4442c661933e2d6981013250b3530defec7bbc0b0529c5db0bd9 The runs list is currently showing stale data. Runs are executing like normal. Our replication process from postgresql to our Clickhouse instance is falling behind and so the dashboard will be showing stale run data. We're investigating. open telemetry logs and spans ingestion issues https://status.trigger.dev/incident/777468 Tue, 02 Dec 2025 11:18:00 -0000 https://status.trigger.dev/incident/777468#681363ec5a73a6fc95d49d8d5c854e83c1863aced1338e3caa724bbe97588a75 We've published a full post-mortem on this incident here: https://trigger.dev/blog/clickhouse-too-many-parts-postmortem open telemetry logs and spans ingestion issues https://status.trigger.dev/incident/777468 Mon, 01 Dec 2025 16:38:00 -0000 https://status.trigger.dev/incident/777468#4d2a204db2262bae1b77794a85992ff81f28d035bd1c165b312924483acce8f7 we have pushed a fix which will prevent this ingestion issue from happening again. The issue came from a "Too many parts" error when inserting new data into clickhouse. This is a very common issue with clickhouse and was caused by a poorly designed partition key that caused inserts to create very small parts that would exhaust the clickhouse server's merge capacity, leading to merge failures. To mitigate this issue we initially created a new table that new runs would write their otel into that didn't suffer from this partition key issue. That worked fine, but because old runs were continuing to write to the old, poorly designed, table, the new table could not perform merges because of the previously stated issue. We immediately increased the number of clickhouse replicas which did buy us some merge headroom, allowing inserts to resume. But it was a race against the clock as we needed to ship a robust fix to the existing table before merge capacity was again exhausted. That fix has now been shipped and we're monitoring our clickhouse server's merge capacity which looks to be holding up. open telemetry logs and spans ingestion issues https://status.trigger.dev/incident/777468 Mon, 01 Dec 2025 16:08:00 -0000 https://status.trigger.dev/incident/777468#6cb8699a391bdce17b2fc8beeacefdd1565747c1e9a5d0551ecce1792d328b85 otel ingestion has been operational since 1PM UTC but we are still finalizing a more permanent fix before we move the service from degraded to resolved. open telemetry logs and spans ingestion issues https://status.trigger.dev/incident/777468 Mon, 01 Dec 2025 14:08:00 -0000 https://status.trigger.dev/incident/777468#0777614b2043ca48f6b224a1a628d14ffba5e406f2893a1f4f6e597830552df8 We are currently having issues with our ingestion of open telemetry logs and spans after rolling out a fix for the issue that was happening over the weekend with clickhouse. We're investigating Dashboard unreliable as we work through clickhouse issues https://status.trigger.dev/incident/776048 Fri, 28 Nov 2025 19:54:00 -0000 https://status.trigger.dev/incident/776048#c1f5342e6ea39684d1cb980f742a20fa8cbacfcd3444d9c4bcaf91055bd3a006 We have recovered the clickhouse instance and the dashboard is responsive again and serving queries and data ingestion is back online. There has been some loss of otel data during this downtime but we don't know the extent of it at this moment as we continue to recover and investigate. Dashboard unreliable as we work through clickhouse issues https://status.trigger.dev/incident/776048 Fri, 28 Nov 2025 19:40:00 -0000 https://status.trigger.dev/incident/776048#538651cdf70606336160c512098c61f0b2712625e879c49baf3f37af178467e1 Parts of the dashboard are either missing data or unresponsive at the moment as we work through some issues with our clickhouse database. Currently trying to get it back online and then will update with more details. Deployments failing due to upstream provider https://status.trigger.dev/incident/753125 Tue, 28 Oct 2025 21:20:00 -0000 https://status.trigger.dev/incident/753125#3ce186b52a6945dee680d6e6317fe4fd79031dc000ee62d4a5ebb35e754dbba1 Depot resolved the issue on their side. Deployments are now working normally again. Deployments failing due to upstream provider https://status.trigger.dev/incident/753125 Tue, 28 Oct 2025 19:00:00 -0000 https://status.trigger.dev/incident/753125#d0bc04bfe2297449fa1a480022cf7b936a7c633cb71bdfd7c9dc82e7f7526bb0 Our upstream remote build provider is currently facing issues, causing some deployment failures on our side. More details on their status page: https://status.depot.dev/ us-east-1 slow dequeues for large machines https://status.trigger.dev/incident/752450 Mon, 27 Oct 2025 21:21:00 -0000 https://status.trigger.dev/incident/752450#566a04d39a4d230dcb46635bdbef2e299e14ce2eb27088b71ee85298ebad9cc4 The issue is now resolved. We scaled up massively to absorb a load spike from one of our customers. Mainly runs on large machines where impacted. us-east-1 slow dequeues for large machines https://status.trigger.dev/incident/752450 Mon, 27 Oct 2025 20:55:00 -0000 https://status.trigger.dev/incident/752450#4b122ced081531b81cea272aaa80887159c94e01d098126c84ec34e0e309042e Dequeues are slower than normal in us-east-1 due a load spike. We’re investigating. Dashboard runs list is delayed https://status.trigger.dev/incident/749128 Wed, 22 Oct 2025 11:29:00 -0000 https://status.trigger.dev/incident/749128#eb9ba2ac6f6f7aa8c05955b739b1407ceebdcda15a6129a01698011c32982431 We are still working with clickhouse cloud on identifying the root cause of the issue but have pushed out a fix that has allowed the runs list to come back online and show live results in the meantime. We suspect the issue is related to our clickhouse server being "rotated" into another AZ which caused network degradation and cause our replication pipeline from postgresql to clickhouse to fall behind and unable to keep up with changes. The fix was to stop sending payload data to clickhouse which is not currently being used but we planned to in the future. Turning it off for now while we investigate the underlying cause seems like a good tradeoff and we suspect will fix this issue going forward. Dashboard runs list is delayed https://status.trigger.dev/incident/749128 Wed, 22 Oct 2025 10:26:00 -0000 https://status.trigger.dev/incident/749128#bda7907a2abe306455e1158950c7b0e59033d3ebae55ff6b5d404b6c1baa604e We're working with Clickhouse Cloud to find the root cause of the issue, in the meantime we are working on a temporary mitigation which should hopefully help. There may be some missing runs from the dashboard temporarily, but runs will be backfilled into clickhouse after the issue has been addressed. Dashboard runs list is delayed https://status.trigger.dev/incident/749128 Wed, 22 Oct 2025 09:23:00 -0000 https://status.trigger.dev/incident/749128#3dc819341695ab52ba87580845820623f83f60e8748ae11b448693ef34dfff09 We're currently experiencing an issue with our Clickhouse cluster that is causing runs list in the dashboard and via the `runs.list()` endpoint to return stale data as inserts into clickhouse have degraded p95 latencies. We're investigating the issue. Runs executions are uneffected. Deploys are still impacted by the us-east-1 outage https://status.trigger.dev/incident/747198 Mon, 20 Oct 2025 21:00:00 -0000 https://status.trigger.dev/incident/747198#fffd0d164329eabe45cfc5beb80cbe8d216fc9312a17e18de762c9343c9425b8 Our remote build provider Depot.dev was able to fully recover after ongoing EC2 instance issues in `us-east-1` - AWS had several regressions and limited service availability throughout the day. Runs list and dashboard logs are impacted by the AWS us-east-1 outage https://status.trigger.dev/incident/747511 Mon, 20 Oct 2025 17:12:00 -0000 https://status.trigger.dev/incident/747511#19b1b056d69a3cf318c791af908b425d648d405ffad2c420e08dfc6fd5072e12 Our clickhouse instance is back online and serving queries. The runs list is now working but some requests to clickhouse syncing run state are queued and waiting to finish. We should be all caught up in about 5-10 minutes. Runs list and dashboard logs are impacted by the AWS us-east-1 outage https://status.trigger.dev/incident/747511 Mon, 20 Oct 2025 16:41:00 -0000 https://status.trigger.dev/incident/747511#c88437623c47843dcb784c7203e4f0f96337c1d33d3f97d1fddaf6e9fd184237 We use ClickHouse Cloud to power the dashboard Runs page and individual Run logs. It also powers the `runs.list()` SDK function. https://status.clickhouse.com/ AWS outage in us-east-1 is causing service disruption https://status.trigger.dev/incident/746988 Mon, 20 Oct 2025 09:33:00 -0000 https://status.trigger.dev/incident/746988#4a12a3f8e24b5882d8452b78f1ee8b7cafb3ed992057072f82dfad09c8476014 AWS us-east-1's latest update has stated that they are seeing "significant signs of recovery" and we're seeing the same in our worker cluster, as now image pulls are working and both cold and warm starts are now processing as normal. We'll continue to closely monitor the situation. Deploys are still impacted by the us-east-1 outage https://status.trigger.dev/incident/747198 Mon, 20 Oct 2025 09:15:00 -0000 https://status.trigger.dev/incident/747198#b7b511170fc296ea7ae4747c97ee17326a7e33708a2e4e2798043e9c4e76afea We use depot.dev for deploys and they're still recovering from the earlier us-east-1 outage. Full details on their status page: https://status.depot.dev/ AWS outage in us-east-1 is causing service disruption https://status.trigger.dev/incident/746988 Mon, 20 Oct 2025 09:14:00 -0000 https://status.trigger.dev/incident/746988#35208eb947fe33da2e7c8cf836142831a284f6f1ae62cedfc7984014db359fb2 Currently the main issue is cold starts because image pulls from AWS ECR are failing because of the AWS us-east-1 outage. Warm starts are unaffected. A workaround is currently available by navigating to the Regions page in the Trigger.dev dashboard and switching the default region to eu-central-1. AWS outage in us-east-1 is causing service disruption https://status.trigger.dev/incident/746988 Mon, 20 Oct 2025 08:35:00 -0000 https://status.trigger.dev/incident/746988#544efaa6ee1b783d3ef61e089d64c6c44211db024314e7597becf95b1fc00637 There is an ongoing AWS outage that is affecting our services in us-east-1. We are actively monitoring the situation and putting mitigations in place. Please check the AWS Health dashboard for updates: https://health.aws.amazon.com/health/status us-east-1 slow dequeues for large machines https://status.trigger.dev/incident/737975 Sun, 05 Oct 2025 17:14:00 -0000 https://status.trigger.dev/incident/737975#78d9992594bdf6e5eef5ad524c0025ce9cbd090deb9a2a878bb06124812b9055 Runs on `large-2x` machines were processing more slowly due to incredibly high volume. One of our customers caused an infinite loop in their large machine tasks which kept triggering more runs. Runs on small and medium machines were unlikely to have been impacted by this. us-east-1 slow dequeues for large machines https://status.trigger.dev/incident/737975 Sun, 05 Oct 2025 16:56:00 -0000 https://status.trigger.dev/incident/737975#971f63e0e80da63ff702c8a0314504dc8e72df34508eaef8578d6e77c57bd0db Dequeues are slower than normal in us-east-1. We’re investigating. us-east-1 slow dequeues for cold starts https://status.trigger.dev/incident/736530 Thu, 02 Oct 2025 18:56:00 -0000 https://status.trigger.dev/incident/736530#9707d32a129b422f9367519aa30e771679360e72328d75c6356c19f50c1455a7 This resolved shortly after a problematic control plane rollout. us-east-1 slow dequeues for cold starts https://status.trigger.dev/incident/736530 Thu, 02 Oct 2025 18:42:00 -0000 https://status.trigger.dev/incident/736530#0fdfd39e1b24f7c05fbee8e70d1447c904f1232e76c26e94666014499003d200 We're investigating an issue with cold starts in us-east-1. Warm starts are unimpacted – we keep runs alive for 2 mins after they finish, to pick up new runs us-east-1 runs are slow to start executing https://status.trigger.dev/incident/732612 Fri, 26 Sep 2025 05:35:00 -0000 https://status.trigger.dev/incident/732612#28587f99e4862165763f1ba2c01ff4188812566a6ec6b17782a6b5fadc10818d Service in `us-east-1` has fully recovered us-east-1 runs are slow to start executing https://status.trigger.dev/incident/732612 Fri, 26 Sep 2025 05:01:00 -0000 https://status.trigger.dev/incident/732612#d382cd5e60b4581ccf144723f92380d00664e24b0c0f253976e8703feb9c0edf We have applied a fix and are monitoring recovery us-east-1 runs are slow to start executing https://status.trigger.dev/incident/732612 Fri, 26 Sep 2025 04:05:00 -0000 https://status.trigger.dev/incident/732612#a012ac2fed3a08c917aa7663b8db7aeb4e270c074a3670395736aed90fdc52b8 We are investigating issues with runs being slow to start in the us-east-1 region. Will provide updates as we investigate and fix the issue. Large machines are slow to dequeue runs https://status.trigger.dev/incident/731338 Wed, 24 Sep 2025 16:30:00 -0000 https://status.trigger.dev/incident/731338#c9ce543531cb1d1d3601e7bbf420fd1e00da84a6202ce17939e645f8c54ac47e Larger machines are now dequeuing faster. We are going to change the packing algorithm we use which means this will be much less likely to happen in the future. That work will begin this week. Large machines are slow to dequeue runs https://status.trigger.dev/incident/731338 Wed, 24 Sep 2025 15:45:00 -0000 https://status.trigger.dev/incident/731338#2850930ed292272ffff08e6f8781daa253daa61bda05e08db095f0cd5266b24c Large and medium machines are currently slower to dequeue runs for. The larger the machine the more impacted they are. This is an issue with getting enough available space on servers. We're working on getting more capacity online. We have freed up a lot of availability already, this shouldn't be an issue for much longer. Slower dequeuing and API responses https://status.trigger.dev/incident/724711 Mon, 15 Sep 2025 17:40:00 -0000 https://status.trigger.dev/incident/724711#5b46dfbbbd5c05592619ef5c159da0481639b58c83bd2b82fd8e35bf01401b06 The database load is back to normal. We're still investigating why this happened, the top theory at the moment is a an auto-vacuuming issue possibly to do with transaction wraparound. Slower dequeuing and API responses https://status.trigger.dev/incident/724711 Mon, 15 Sep 2025 17:40:00 -0000 https://status.trigger.dev/incident/724711#5b46dfbbbd5c05592619ef5c159da0481639b58c83bd2b82fd8e35bf01401b06 The database load is back to normal. We're still investigating why this happened, the top theory at the moment is a an auto-vacuuming issue possibly to do with transaction wraparound. Slower dequeuing and API responses https://status.trigger.dev/incident/724711 Mon, 15 Sep 2025 17:31:00 -0000 https://status.trigger.dev/incident/724711#2d0d6d6917152948bf05087c0e541e71a9d93dcaa6fc99610ef082ab5d20a9d1 There is significantly higher than normal primary database load which is causing slower than normal API responses. This impacts dequeuing and API calls. Slower dequeuing and API responses https://status.trigger.dev/incident/724711 Mon, 15 Sep 2025 17:31:00 -0000 https://status.trigger.dev/incident/724711#2d0d6d6917152948bf05087c0e541e71a9d93dcaa6fc99610ef082ab5d20a9d1 There is significantly higher than normal primary database load which is causing slower than normal API responses. This impacts dequeuing and API calls. V4 runs are slow to dequeue in the us-nyc-3 region https://status.trigger.dev/incident/724506 Mon, 15 Sep 2025 12:45:00 -0000 https://status.trigger.dev/incident/724506#77b6b3bef8a3ee91106b3767dad33041f32f7403358e9248791432ad3a09fcee This issue was caused by our us-nyc-3 cloud provider taking an abnormally long time to spin up new servers and capacity issues, along with some runs getting stuck after restoring from a snapshot and not completing in under 2 minutes, which also mostly happened in us-nyc-3. V4 runs are slow to dequeue in the us-nyc-3 region https://status.trigger.dev/incident/724506 Mon, 15 Sep 2025 10:46:00 -0000 https://status.trigger.dev/incident/724506#4954f55ff73059959dafc355034b37874d788f2a94e6ec01b46fbacbd25a74f7 This issue is because it's taking 10 minutes for new servers to spin up on Digital Ocean at the moment. If you're on v4, go to the Regions page in the dashboard and set one of the AWS regions as your default. New runs will use that region instead. V4 runs are slow to dequeue in the us-nyc-3 region https://status.trigger.dev/incident/724506 Mon, 15 Sep 2025 10:39:00 -0000 https://status.trigger.dev/incident/724506#6215d3d6e0a658748a2a55a4b24d2d642fd0aeb7acdae1ef18d5547ee43fe519 We're currently investigating slow run dequeues for v4 runs in the us-nyc-3 region. You can switch runs to the us-east-1 region to mitigate this issue by going to your Regions page in the trigger.dev dashboard and click "Set as default" next to the us-east-1 region. eu-central-1 runs are slow to dequeue https://status.trigger.dev/incident/721980 Wed, 10 Sep 2025 17:18:00 -0000 https://status.trigger.dev/incident/721980#9e79867b7517bee15cae119af453f5e443baae81bb8aa218f7f0709c5f750075 We were seeing crashes on multiple servers in the EU only. It was related to an out of memory issue in our “supervisor” which meant some servers weren’t dequeuing consistently. We’ve changed some settings to allow for more memory. We're still investigating why these supervisor processes were using up too much memory and crashing, and we're monitoring the situation. us-east-1 runs have not been impacted. eu-central-1 runs are slow to dequeue https://status.trigger.dev/incident/721980 Wed, 10 Sep 2025 16:45:00 -0000 https://status.trigger.dev/incident/721980#9e6f1e98dbdd7013271c2b8338fc5caf7f6a97d9fd18701d38b3f0c80cd41e83 We're investigating slow to dequeue issue with runs in the eu-central-1 region and will provide updates as we recover & discover the issue. eu-central-1 dequeue issues https://status.trigger.dev/incident/711443 Sun, 24 Aug 2025 10:21:00 -0000 https://status.trigger.dev/incident/711443#f5acd4a89c2959fc430f006801c271d63d528118ed472a538ba7c45f76a15a10 This is now resolved. We're still fully investigating the root cause but there was too much activity with etcd and clean up not happening fast enough. This caused downstream impacts on pods piling up and Kubernetes API rate limits being hit. There were Out Of Memory issues because of this. Our older Digital Ocean cluster is doing far higher volume that this so we need to determine what configuration is different and making appropriate changes to prevent this happening again. eu-central-1 dequeue issues https://status.trigger.dev/incident/711443 Sun, 24 Aug 2025 07:18:00 -0000 https://status.trigger.dev/incident/711443#326d8396a9137cc7998ba14c4cd69e4cfb87fc2b5068a8dce9fc2e86ba444061 Our new EU region isn't dequeuing runs, we're working on a fix now. Slower than normal v3 dequeue times https://status.trigger.dev/incident/709835 Fri, 22 Aug 2025 16:03:00 -0000 https://status.trigger.dev/incident/709835#146e3769bfd5722ecac7b8a9ecbbff1e822e423f139fedbe26b3d353a99e6e28 v3 dequeue times have been greatly improved and we're now seeing sub-500ms p50 dequeue times (it was previously 10-20s p50 dequeue times). We ended up backporting all of the dequeue performance improvements we made in v4 and even made some improvements that we'll be bringing to v4 soon as well. Slower than normal v3 dequeue times https://status.trigger.dev/incident/709835 Thu, 21 Aug 2025 10:00:00 -0000 https://status.trigger.dev/incident/709835#9da0b1827ade1aefd47b334bdf19772652b2a8bf628de4c8944fd594922fd78d We are experiencing slower than normal dequeues on v3. We're actively investigating why this is happening. Runs are still executing but they're being pulled from the queue much slower than usual. We're experiencing longer than normal queue times on v4 https://status.trigger.dev/incident/709366 Wed, 20 Aug 2025 19:44:00 -0000 https://status.trigger.dev/incident/709366#a1e20deff1b8e1a492750928aab68e353d8a10ce2464b574843332247be5c37d Queue times are back to normal. We had an unprecedented number of new v4 runs. We've adjusted our autoscaling rules across multiple services to account for this. We are looking into how to avoid this happening as v4 scales up. We're experiencing longer than normal queue times on v4 https://status.trigger.dev/incident/709366 Wed, 20 Aug 2025 18:29:00 -0000 https://status.trigger.dev/incident/709366#46a3bd64d56649a55dcce3b3a530ef334f91ca80c6cd354d9280834d899a0d4d We're experiencing longer than normal queue times on v4. We're currently investigating the issue. Run log failures and cascading API failures https://status.trigger.dev/incident/681028 Fri, 01 Aug 2025 00:29:00 -0000 https://status.trigger.dev/incident/681028#b2429349f0bacbf23f2e29ba7b6d268947a471aa82075b298c665354df9dafcd This was resolved at 00:29am. Logs and the API started recovering once a valid partition was in place. Run log failures and cascading API failures https://status.trigger.dev/incident/681028 Fri, 01 Aug 2025 00:29:00 -0000 https://status.trigger.dev/incident/681028#b2429349f0bacbf23f2e29ba7b6d268947a471aa82075b298c665354df9dafcd This was resolved at 00:29am. Logs and the API started recovering once a valid partition was in place. Run log failures and cascading API failures https://status.trigger.dev/incident/681028 Fri, 01 Aug 2025 00:29:00 -0000 https://status.trigger.dev/incident/681028#b2429349f0bacbf23f2e29ba7b6d268947a471aa82075b298c665354df9dafcd This was resolved at 00:29am. Logs and the API started recovering once a valid partition was in place. Run log failures and cascading API failures https://status.trigger.dev/incident/681028 Fri, 01 Aug 2025 00:00:00 -0000 https://status.trigger.dev/incident/681028#516c8b64b99206030702b75fe8c269ea221bae8f1b82508c010bd0ccb4764860 Runs created after midnight August 1st UTC failed to get run logs created. For new runs this lasted from 00:00 – 00:29am and impacted logs for new runs created in that time. It caused a spike in open database connections which led to some other API failures during this time, especially from 00:21 – 00:29am. This issue was caused by an invalid table partition for the run logs table, which meant inserts failed. We are looking at how we can prevent this from happening again in future. Run log failures and cascading API failures https://status.trigger.dev/incident/681028 Fri, 01 Aug 2025 00:00:00 -0000 https://status.trigger.dev/incident/681028#516c8b64b99206030702b75fe8c269ea221bae8f1b82508c010bd0ccb4764860 Runs created after midnight August 1st UTC failed to get run logs created. For new runs this lasted from 00:00 – 00:29am and impacted logs for new runs created in that time. It caused a spike in open database connections which led to some other API failures during this time, especially from 00:21 – 00:29am. This issue was caused by an invalid table partition for the run logs table, which meant inserts failed. We are looking at how we can prevent this from happening again in future. Run log failures and cascading API failures https://status.trigger.dev/incident/681028 Fri, 01 Aug 2025 00:00:00 -0000 https://status.trigger.dev/incident/681028#516c8b64b99206030702b75fe8c269ea221bae8f1b82508c010bd0ccb4764860 Runs created after midnight August 1st UTC failed to get run logs created. For new runs this lasted from 00:00 – 00:29am and impacted logs for new runs created in that time. It caused a spike in open database connections which led to some other API failures during this time, especially from 00:21 – 00:29am. This issue was caused by an invalid table partition for the run logs table, which meant inserts failed. We are looking at how we can prevent this from happening again in future. Runs are missing from the dashboard and runs.list is degraded https://status.trigger.dev/incident/624552 Wed, 23 Jul 2025 14:45:00 -0000 https://status.trigger.dev/incident/624552#d81f12cb0a4c1c8de45b7162f9d93bf8a21f601b82a02a004243072a2fb3d75b The dashboard/runs.list is back to normal. We're working on and deploying multiple changes which will reduce and prevent these kind of issues from happening. Runs are missing from the dashboard and runs.list is degraded https://status.trigger.dev/incident/624552 Wed, 23 Jul 2025 14:00:00 -0000 https://status.trigger.dev/incident/624552#573ef4d78514a1d418b880ae872f5987ca4df27e2ee715f55a2a8b55501fb473 We're starting to see spikes in runs list performance again. This impacts the dashboard Tasks page, Runs page, and `runs.list()` SDK function. Our best theory right now is that this is when ClickHouse Cloud cycles replica servers. We are working with their support to get to the bottom of this because CPU and memory are not high before this happens. We have also made some changes since yesterday, as well as working on some mitigations (like falling back to Postgres). There are some dashboard changes going live soon where instead of it completely failing to load we show better error states. Sorry about this everyone, we're working hard to prevent this from happening. Runs are missing from the dashboard and runs.list is degraded https://status.trigger.dev/incident/624077 Tue, 22 Jul 2025 23:14:00 -0000 https://status.trigger.dev/incident/624077#21b6ced188f0662c68af8a94901fde834b550ccaccd99dbdbe919adbbbe79168 The runs list is now fully operational. There is still missing data that we will be backfilling ASAP. Runs are missing from the dashboard and runs.list is degraded https://status.trigger.dev/incident/624077 Tue, 22 Jul 2025 22:38:00 -0000 https://status.trigger.dev/incident/624077#9107002c39fe308a6c4e0b91e68ecb4663b8d71d9f6a49e2e6ba1fc62499bd8c The run list has resumed operation but there may still be missing runs or partially missing runs. We've identified a possible cause but we're still investigating and working on mitigations to prevent further issues, as well as filling in missing data Runs are missing from the dashboard and runs.list is degraded https://status.trigger.dev/incident/624077 Tue, 22 Jul 2025 20:50:00 -0000 https://status.trigger.dev/incident/624077#a98dc997e6d38c4fadd03c13332a47ab69078e460f84a0e24e300e623fe0e4a1 The runs list is currently missing recent runs. Runs are still executing. Will update as we know more Some runs list calls impacted by ClickHouse server crashes https://status.trigger.dev/incident/621981 Fri, 18 Jul 2025 15:26:00 -0000 https://status.trigger.dev/incident/621981#35f669b66a50d2aa4bee42866ee605e9a1a5cc85ea2a909b44cf94206e668323 We've opened a case with ClickHouse Cloud to try and understand why this happened. Some runs list calls impacted by ClickHouse server crashes https://status.trigger.dev/incident/621981 Fri, 18 Jul 2025 15:26:00 -0000 https://status.trigger.dev/incident/621981#35f669b66a50d2aa4bee42866ee605e9a1a5cc85ea2a909b44cf94206e668323 We've opened a case with ClickHouse Cloud to try and understand why this happened. Some runs list calls impacted by ClickHouse server crashes https://status.trigger.dev/incident/621981 Fri, 18 Jul 2025 14:05:00 -0000 https://status.trigger.dev/incident/621981#0ca0f50a63ef7806256b45176386c2b3bee34a9552b4b020bb1315f0068fe236 The ClickHouse instances recovered. We're trying to determine why this happened so we can prevent it from happening again. Some runs list calls impacted by ClickHouse server crashes https://status.trigger.dev/incident/621981 Fri, 18 Jul 2025 14:05:00 -0000 https://status.trigger.dev/incident/621981#0ca0f50a63ef7806256b45176386c2b3bee34a9552b4b020bb1315f0068fe236 The ClickHouse instances recovered. We're trying to determine why this happened so we can prevent it from happening again. Some runs list calls impacted by ClickHouse server crashes https://status.trigger.dev/incident/621981 Fri, 18 Jul 2025 13:35:00 -0000 https://status.trigger.dev/incident/621981#084b5446a9bc0cb4e0dad7baea449b2cad57755837811ad782fabf24a988a3b0 Some ClickHouse servers crashed which means the dashboard Runs page and SDK `runs.list()` calls are impacted. Some runs list calls impacted by ClickHouse server crashes https://status.trigger.dev/incident/621981 Fri, 18 Jul 2025 13:35:00 -0000 https://status.trigger.dev/incident/621981#084b5446a9bc0cb4e0dad7baea449b2cad57755837811ad782fabf24a988a3b0 Some ClickHouse servers crashed which means the dashboard Runs page and SDK `runs.list()` calls are impacted. Batches with more than 20 runs are slow to process https://status.trigger.dev/incident/620092 Tue, 15 Jul 2025 14:15:00 -0000 https://status.trigger.dev/incident/620092#3a750af8c978b351e86ba63a619d5cee42ed3d7f1710a0885cb0f6ee27ffb057 Batches are processing as normal now. We have increased future capacity. This was caused by a runaway loop of batches by a customer and this part of the system didn't have enough capacity to process them all fast enough. We are updating how we process and rate-limit batches to prevent this from happening again, as well as improved internal alerts if similar issues happen in the future. Batches with more than 20 runs are slow to process https://status.trigger.dev/incident/620092 Tue, 15 Jul 2025 13:33:00 -0000 https://status.trigger.dev/incident/620092#da657438acb2da24b4f60c2158f9f6c0c369a4f5d02ea4ca0ef05a42105d66b7 We're rapidly processing the backlog, batches should be back to normal soon. We'll update when it's running at normal throughput. Batches with more than 20 runs are slow to process https://status.trigger.dev/incident/620092 Tue, 15 Jul 2025 13:05:00 -0000 https://status.trigger.dev/incident/620092#bc3d8366256fbd64a1791241935e174ee8e040e7acd52e181951bf37e8c7cdcd We have a large backlog of batches to process (when you use our batchTrigger functions) caused by a huge influx of them. This only impacts batches with more than 20 runs in them. We're working to increase the processing of these to get back up to date and to prevent this from happening in the future v4 dequeue performance degradation https://status.trigger.dev/incident/580382 Mon, 26 May 2025 13:44:00 -0000 https://status.trigger.dev/incident/580382#576250e9ed29474c014322a87a0922c0a1a6514aa371dca4f2d159d15632302c v4 dequeue performance has now improved again, and we're working on two things: A short term fix to prevent this from happening again in the short term, to be deployed today. A long term fix for dequeue performance that will be worked on and hopefully shipped this week, which will vastly improve dequeue performance and scaling. v4 dequeue performance degradation https://status.trigger.dev/incident/580382 Mon, 26 May 2025 13:44:00 -0000 https://status.trigger.dev/incident/580382#576250e9ed29474c014322a87a0922c0a1a6514aa371dca4f2d159d15632302c v4 dequeue performance has now improved again, and we're working on two things: A short term fix to prevent this from happening again in the short term, to be deployed today. A long term fix for dequeue performance that will be worked on and hopefully shipped this week, which will vastly improve dequeue performance and scaling. v4 dequeue performance degradation https://status.trigger.dev/incident/580382 Mon, 26 May 2025 12:50:00 -0000 https://status.trigger.dev/incident/580382#0fed09d97fa8743f4dc3c189deaaf2e8e338c04341772401ece26c57b00aaa1a Dequeues in v4 are slow and we're investigating the cause and trying to deploy a mitigation v4 dequeue performance degradation https://status.trigger.dev/incident/580382 Mon, 26 May 2025 12:50:00 -0000 https://status.trigger.dev/incident/580382#0fed09d97fa8743f4dc3c189deaaf2e8e338c04341772401ece26c57b00aaa1a Dequeues in v4 are slow and we're investigating the cause and trying to deploy a mitigation v3 runs dequeuing slower than normal https://status.trigger.dev/incident/557448 Tue, 06 May 2025 14:28:00 -0000 https://status.trigger.dev/incident/557448#90cd0508deff5411ef332711b33ada54551777ce35254e15278336267eed5202 Queues are back to nominal length and have been for some time. This issue was caused by a huge influx of queues, which meant we weren't considering them all when selecting queues for dequeuing. We have increased some settings to make this better and we're looking at what we can do in the future to make this scale better for the next 10–100x multiple of queues. v3 runs dequeuing slower than normal https://status.trigger.dev/incident/557448 Tue, 06 May 2025 14:28:00 -0000 https://status.trigger.dev/incident/557448#90cd0508deff5411ef332711b33ada54551777ce35254e15278336267eed5202 Queues are back to nominal length and have been for some time. This issue was caused by a huge influx of queues, which meant we weren't considering them all when selecting queues for dequeuing. We have increased some settings to make this better and we're looking at what we can do in the future to make this scale better for the next 10–100x multiple of queues. v3 runs dequeuing slower than normal https://status.trigger.dev/incident/557448 Tue, 06 May 2025 13:50:00 -0000 https://status.trigger.dev/incident/557448#aa082ebaae28f8c074ca23078c6977bf46f18dd219a60153a6b8d68e8b421e22 We're dequeuing runs very fast again since the config update. Queues are coming down, mostly capped by the concurrency limits on queues. v3 runs dequeuing slower than normal https://status.trigger.dev/incident/557448 Tue, 06 May 2025 13:50:00 -0000 https://status.trigger.dev/incident/557448#aa082ebaae28f8c074ca23078c6977bf46f18dd219a60153a6b8d68e8b421e22 We're dequeuing runs very fast again since the config update. Queues are coming down, mostly capped by the concurrency limits on queues. v3 runs dequeuing slower than normal https://status.trigger.dev/incident/557448 Tue, 06 May 2025 13:29:00 -0000 https://status.trigger.dev/incident/557448#522fe92084e4c932b9355cfeec3df266b77d1066817e83faaf294a3a2030e6ef We have identified the issue. A huge flood of new queues has caused an imbalance in the fair queue algorithm. We're deploying a config env var change to the dequeuer now that should fix this. We're also figuring out how we can prevent this issue happening in the future if we hit another order of magnitude in the total queue count. v3 runs dequeuing slower than normal https://status.trigger.dev/incident/557448 Tue, 06 May 2025 13:29:00 -0000 https://status.trigger.dev/incident/557448#522fe92084e4c932b9355cfeec3df266b77d1066817e83faaf294a3a2030e6ef We have identified the issue. A huge flood of new queues has caused an imbalance in the fair queue algorithm. We're deploying a config env var change to the dequeuer now that should fix this. We're also figuring out how we can prevent this issue happening in the future if we hit another order of magnitude in the total queue count. v3 runs dequeuing slower than normal https://status.trigger.dev/incident/557448 Tue, 06 May 2025 13:16:00 -0000 https://status.trigger.dev/incident/557448#537c5f3081accc93acd19ad042a135b46709781b200574a5f22e02377c4ab6fd Version 3 queues aren't processing as fast as normal. Runs are still processing but not at the normal speed so some queues are getting longer. We're investigating why this is happening and will provide updates. v3 runs dequeuing slower than normal https://status.trigger.dev/incident/557448 Tue, 06 May 2025 13:16:00 -0000 https://status.trigger.dev/incident/557448#537c5f3081accc93acd19ad042a135b46709781b200574a5f22e02377c4ab6fd Version 3 queues aren't processing as fast as normal. Runs are still processing but not at the normal speed so some queues are getting longer. We're investigating why this is happening and will provide updates. v3 runs dequeuing slower than normal https://status.trigger.dev/incident/550447 Thu, 24 Apr 2025 16:05:00 -0000 https://status.trigger.dev/incident/550447#eddcb8a9db378d4925503dc10b4b0090425596b06ad640f34e1edbe57c0a8120 Queues have been operating at full speed since 16:05 UTC. We have found an edge case in the dequeue algorithm that can cause slower dequeue times. We're looking into a fix. v3 runs dequeuing slower than normal https://status.trigger.dev/incident/550447 Thu, 24 Apr 2025 15:50:00 -0000 https://status.trigger.dev/incident/550447#2c8594844ded712c8333733594a45141077c918f70cd0bebd75c33e19212e0b6 Runs are processing fast again now, we are quickly catching up. We are looking into why this happened, it's something to do with dequeuing v3 runs. v3 runs dequeuing slower than normal https://status.trigger.dev/incident/550447 Thu, 24 Apr 2025 15:30:00 -0000 https://status.trigger.dev/incident/550447#825ab97595809431541ea26bff1a49ae937c616202e7adc6bc9e243b0c350374 Queues are processing but slower than normal, we're investigating and will update when we know more. Run are dequeuing slower than normal https://status.trigger.dev/incident/541357 Mon, 07 Apr 2025 19:09:00 -0000 https://status.trigger.dev/incident/541357#14c5ce1c71beeae8d4b60aa2d87e88edd39fbc55fcc6a89d0c2eb525aab7d623 Runs have been dequeuing quickly for some time now, so marking this as resolved. We're continuing to monitor it closely. Runs dequeued for the entire period but queue times were longer than normal, across all customers. The vast majority of queues have reduced back to normal length already or will soon. We suspect this was caused by an underlying Digital Ocean networking issue, that meant our Kubernetes control plane nodes were slow to create and delete pods. We are trying to figure out if there's anything in the short term we can do to reduce the likelihood of this happening again. We are planning to move our primary US worker cluster to AWS, which is where we already host our database, dashboard, API, and all other services apart from the Worker cluster. This decision was made to increase reliability and improve response times. Run are dequeuing slower than normal https://status.trigger.dev/incident/541357 Mon, 07 Apr 2025 16:56:00 -0000 https://status.trigger.dev/incident/541357#0202d2876befab333063fbea101f0b7a0694b784209730b9cfd9e610fd511b8e Run queues are processing slower than normal due to an issue with our Kubernetes control planes. We are investigating and will update this status as soon as we know more. Tasks with large payloads or outputs are sometimes failing https://status.trigger.dev/incident/532422 Fri, 21 Mar 2025 22:48:00 -0000 https://status.trigger.dev/incident/532422#b5e3541e6835939fb1378c51eaef78154fa07baab15abd02d22614e82999f4c5 Cloudflare R2 is back online and uploads of large payloads and outputs have resumed. We'll continue to monitor the situation Tasks with large payloads or outputs are sometimes failing https://status.trigger.dev/incident/532422 Fri, 21 Mar 2025 22:43:00 -0000 https://status.trigger.dev/incident/532422#744a88a12c86ed784717193521a9a6656ac6542d4cf0104de1827171318381b5 We've investigated trying to temporarily switch from R2 to AWS S3 but unfortunately we cannot easily do that without breaking backwards compatibility. We're continuing to keep an eye on the Cloudflare Discord for any changes or expected timelines, of which there are none yet. Tasks with large payloads or outputs are sometimes failing https://status.trigger.dev/incident/532422 Fri, 21 Mar 2025 22:11:00 -0000 https://status.trigger.dev/incident/532422#6aff514423860e46764ccdc7de284dc44763a4fb96c1a3a2cd70dfac817db4dc There is now an incident report on Cloudflare Status: https://www.cloudflarestatus.com/incidents/v6c7l22vglw5 Tasks with large payloads or outputs are sometimes failing https://status.trigger.dev/incident/532422 Fri, 21 Mar 2025 22:02:00 -0000 https://status.trigger.dev/incident/532422#81978f799314c98474b5515884a1c5c0d6dfa5315ae5690392e4e9cea1568adf We've confirmed an ongoing issue with Cloudflare R2 that started approx 20 minutes ago, with this message from their support: > There is an ongoing R2 outage at the moment, expect GET/PUT requests and uploads to intermittently fail. Tasks with large payloads or outputs are sometimes failing https://status.trigger.dev/incident/532422 Fri, 21 Mar 2025 21:55:00 -0000 https://status.trigger.dev/incident/532422#9ae796254688cf699542c895cf412e37f0d1d08c2ae024e7716061565caece95 We're currently experiencing an issue with a downstream provider (Cloudflare R2) which we use to store large task payloads and outputs. We're investigating this at the moment and hope to provide more information shortly. Significant disruption to run starts (runs stuck in queueing) https://status.trigger.dev/incident/524477 Fri, 07 Mar 2025 21:15:00 -0000 https://status.trigger.dev/incident/524477#b1d5c4643ae92b31afab7994c7b1a0f4d0b2c9e910531c6523dfba5ced23c85d We are confident that most queues have caught up again but are still monitoring the situation. If you are experiencing unexpected queue times this is most likely due to plan or custom queue limits. Should this persist, please get in touch. Significant disruption to run starts (runs stuck in queueing) https://status.trigger.dev/incident/524477 Fri, 07 Mar 2025 20:56:00 -0000 https://status.trigger.dev/incident/524477#dfdac39024dad2d3d701af8acf516dfb395ab50e09dfd2036488bff4a3a69192 The service is stable again and metrics are looking good. We're still catching up with a backlog of runs. You may see increased queue times until this is fully resolved. We'll keep you updated. Significant disruption to run starts (runs stuck in queueing) https://status.trigger.dev/incident/524477 Fri, 07 Mar 2025 20:16:00 -0000 https://status.trigger.dev/incident/524477#43ffc749ad14e37bf75ebb023fb2f31bc8822f96b72591b965de3eafe1400f78 We managed to clear the huge backlog of VMs that were completed but hadn't been cleaned up like they normally are. This was causing a lot of issues, including the initial drop in runs starting and another 8–8.13pm UTC. There's a backlog of runs and it will unfortunately take a bit of time for everything to catch up. We're actively monitoring the situation and doing what we can to improve throughput. Significant disruption to run starts (runs stuck in queueing) https://status.trigger.dev/incident/524477 Fri, 07 Mar 2025 19:30:00 -0000 https://status.trigger.dev/incident/524477#356485aecb07798b8938dc7a6fab2fbf24e739c70ac33947ff7a75aacc660ad2 Runs are executing at the normal speed (from 7.30pm UTC). There is a backlog of runs to work through. There's still a problem we're looking at where the completed run VMs aren't being cleared properly. We think that's the cause of this issue in the first place. Hasn't happened before and there have been no code changes. Significant disruption to run starts (runs stuck in queueing) https://status.trigger.dev/incident/524477 Fri, 07 Mar 2025 19:17:00 -0000 https://status.trigger.dev/incident/524477#a1183aa6c52507a6c8c8be0dab60f11ab8909725a44c8957b085d90ff7d1bbbe A significant proportion of runs are not starting. There is an issue in our worker cluster and we are trying to diagnose the issue. There have been no deployments today. We're unsure at this point if this is a cloud provider issue or not. Significant disruption to run starts (runs stuck in queueing) https://status.trigger.dev/incident/524477 Fri, 07 Mar 2025 18:48:00 -0000 https://status.trigger.dev/incident/524477#95a49db3c93b064dae392196377e792acf52a9f6c84f522e6be3a161a17014cc There is an issue in the worker cluster we are investigating. This is causing runs not to start quickly. Uncached deploys are causing runs to be queued https://status.trigger.dev/incident/522499 Tue, 04 Mar 2025 12:40:00 -0000 https://status.trigger.dev/incident/522499#6a7b56529abf610d1a00a7f5e6534b92af23ba0185b57bd8a6f72d4579cd8e78 We tracked this down to a broken deploy pipeline which reverted one of our internal components to a previous version. This caused a required environment variable to be ignored. We have applied a hotfix and will be making more permanent changes to prevent this from happening again. Uncached deploys are causing runs to be queued https://status.trigger.dev/incident/522499 Tue, 04 Mar 2025 12:25:00 -0000 https://status.trigger.dev/incident/522499#4c91a51eedb4e236d0caf65406a5c72f2ec5dde0aedf20eb4de6c5d99c8db790 All runs should be dequeuing again. Runs that were impacted by this will start dequeuing again. These runs were in the queue for around 30 mins. This issue was caused by new servers not pulling the correct authentication details for pulling Docker images. We are trying to determine why this happened as we verified the env vars were set correctly but weren't being picked up. We're going to continue actively monitoring this until we're happy that it's fully resolved. Uncached deploys are causing runs to be queued https://status.trigger.dev/incident/522499 Tue, 04 Mar 2025 11:30:00 -0000 https://status.trigger.dev/incident/522499#3ec8f888b56a6cfd39b2f348833f92dd203c700395b4767586901b095ca7f208 Some runs associated with new deploys are not processing from queued to executing. This is an issue with pulling new deploys through the Docker image cache. Deploys from before 11:30 UTC should not be impacted (unless they haven't done a run in the past few days). We're working on a fix or this. Slow queue times and some runs system failing https://status.trigger.dev/incident/510033 Sun, 09 Feb 2025 18:10:00 -0000 https://status.trigger.dev/incident/510033#23aaffab65b269d1ca4f9791893dd98dd8ff6eb2d7ee8825cd5df3956da2d253 Queue times and timeout errors are back to their previous levels. Note that start times of containers are still slower than they should be, especially if you aren't doing a lot of runs. There's a GitHub issue here about slow start times and what we're doing to make them consistently fast: https://github.com/triggerdotdev/trigger.dev/issues/1685 It will start with a new Docker image caching layer that will ship tomorrow. ## What's causing these problems We've had a more than 10x increase in load in the past 7 days. Some things that have worked well for the past few months now work less well at this new scale. Some of those issues can compound under high load and cause more significant issues. Slow queue times and some runs system failing https://status.trigger.dev/incident/510033 Sun, 09 Feb 2025 17:35:00 -0000 https://status.trigger.dev/incident/510033#296c0d34d68b5481175a2ed5b3d3cfd92f25da990ecb84f20e7f741b8884bbd6 We have made some adjustments to stop this issue, we're looking into it Slow queue times https://status.trigger.dev/incident/504368 Thu, 30 Jan 2025 10:10:00 -0000 https://status.trigger.dev/incident/504368#86d3eec3d2bd41318c1e54abeee40f44eb9367d29fcdf6eed5a17bd5f3769b44 Queue processing performance is back to normal because there's been a reduction in demand. We have identified the underlying bottleneck and are working on a permanent fix. This shouldn't be a major change and should be live. There is a high degree of contention on an update when a single queue's concurrencyLimit is different on every call to trigger a task. This is an edge case we haven't seen anyone do before. Deploys are failing with a 520 status code https://status.trigger.dev/incident/500864 Fri, 24 Jan 2025 19:28:00 -0000 https://status.trigger.dev/incident/500864#54a0223b12fbed954c1bf19a19e547575ca6e03d9aff670b2a254061a10efdeb **Important: Upgrade to 3.3.12+ in order to deploy again** If you use npx you can upgrade the CLI and all of the packages by running: npx trigger.dev@latest update This should download 3.3.12 (or newer) of the CLI and then prompt you to update the other packages too. If you have pinned a specific version (e.g. in GitHub Actions) you may need to manually update your package.json file or a workflow file. Read our full package upgrading guide here: https://trigger.dev/docs/upgrading-packages Deploys are failing with a 520 status code https://status.trigger.dev/incident/500864 Fri, 24 Jan 2025 13:17:00 -0000 https://status.trigger.dev/incident/500864#bf35dac25abe349e5b9f7d7675dcec9f1c87bc25118dd7e2ee9604249bdfb156 The 520 errors are still happening with our Digital Ocean container registry, which is preventing the vast majority of deploys from working. We are speaking to their engineers but so far they haven't diagnosed a fix. There have been no code changes from our side which caused this issue – it comes from a change somewhere in one of the third parties we use. This is our deploy pipeline: 1. The Trigger.dev CLI runs the deploy command 2. Docker builds happen on Depot.dev 3. Depot pushes the image to our registry proxy (where we add registry credentials) 4. Our registry proxy pushes to Digital Ocean Container Registry (with the auth credentials) From our logs we think the issue is that Digital Ocean have changed something that means they're now rejecting pushes they were accepting before. We're working on two solutions in parallel: 1. Switching our container registry to something else, probably Docker Hub. This isn't super simple because we need to make it work with our proxy so we don't leak security credentials to users. 2. Stop using the proxy. This means generating temporary credentials that can only push to your project's repository. Those temporary tokens will be sent to the CLI, which means Depot.dev can push directly. AWS ECR supports doing this. Unfortunately this will require an updated CLI package to deploy… We'll provide another update as soon as we have more information. Deploys are failing with a 520 status code https://status.trigger.dev/incident/500864 Thu, 23 Jan 2025 23:20:00 -0000 https://status.trigger.dev/incident/500864#c176d306b698ec88e8ecebae6cc75ca2cd728c91234c3af92b638d1cbf55e1a0 We have confirmed that this is an issue pushing new images to our Digital Ocean Docker container registry. It's returning 520 errors. We have submitted priority support tickets and are speaking to the Digital Ocean engineering teams about the problem. Currently we're waiting on them. Deploys are failing with a 520 status code https://status.trigger.dev/incident/500864 Thu, 23 Jan 2025 21:10:00 -0000 https://status.trigger.dev/incident/500864#48310e57a1f571c005f09bd4be88eff73636d037439f1cf0c2dbf30e571398e9 We're investigating why this is happening Deploys are failing due to a downstream provider https://status.trigger.dev/incident/471215 Mon, 02 Dec 2024 22:09:00 -0000 https://status.trigger.dev/incident/471215#cfb1cc8da3aa94fcbb900565d21b816e4bb5354a7ae07a8f71e8612ff3bb57dd We have moved all deploys to Europe on Depot as a temporary fix while they fix the underlying issue in the US. Deploys will be a bit slower than normal and the first one won't use the cache, but they should work. Deploys are failing due to a downstream provider https://status.trigger.dev/incident/471215 Mon, 02 Dec 2024 21:00:00 -0000 https://status.trigger.dev/incident/471215#ca375c9fabefdd300329aa31573cc1e9849f9cd1647f01afbb14a0a044438583 Many deploys of tasks are failing due to an issue with a downstream provider. They're working on a fix. You can follow their status updates here: https://status.depot.dev/ We'll update when we know more. This only impacts deploys, everything else is functioning normally. V3 runs are processing slowly https://status.trigger.dev/incident/466240 Sat, 23 Nov 2024 00:55:00 -0000 https://status.trigger.dev/incident/466240#7802873d664a43ff807f05ea2f118c9e3f79e7387d312893592b69d34ac77fbe Runs are processing normally again, queues should come down fast. The Kubernetes database etcd didn’t allow new values. Increasing max sizes, restarting, and changing some other settings worked. V3 runs are processing slowly https://status.trigger.dev/incident/466240 Sat, 23 Nov 2024 00:14:00 -0000 https://status.trigger.dev/incident/466240#96df1076cb767d3cafd5d8a79853db0216775e20667d13ef69621a229deb146f Our primary worker cluster is experiencing Kubernetes issues that are preventing some pods from being created. V2 runs are processing slowly https://status.trigger.dev/incident/458173 Fri, 08 Nov 2024 16:30:00 -0000 https://status.trigger.dev/incident/458173#fe0650f4a60ba75f685cc82ede99a80513aca6b2190194c708c3ff8d1da4832e V2 queues are caught up. Now any queued runs are due to concurrency limits. V3 was not impacted during the entire period. We restarted all V2 worker servers and V2 runs started processing again. We are still investigating the underlying cause to prevent this happening again. There were no code changes or deploys during this period and the overall V2 load wasn't unusual. V2 runs are processing slowly https://status.trigger.dev/incident/458173 Fri, 08 Nov 2024 16:08:00 -0000 https://status.trigger.dev/incident/458173#e1ff88f9875037a0cc1a6bd8eed98aed885955985a86076a948b63f6b00540fd V2 runs are now dequeuing quickly, it's catching up to normal. We'll update when there are nominal queue times. V2 runs are processing slowly https://status.trigger.dev/incident/458173 Fri, 08 Nov 2024 15:25:00 -0000 https://status.trigger.dev/incident/458173#aef3136f68c53e4bccd89665d4b453e2f3785709f5d5da830a3c754996149666 V2 runs are in the queue for longer than normal, we're investigating what's causing this and working on a fix. Realtime service degraded https://status.trigger.dev/incident/454137 Fri, 01 Nov 2024 00:38:00 -0000 https://status.trigger.dev/incident/454137#8b5ebd7b2108e07e692bd26355056e399e29795dbe7c3596c55059b8189b876a Realtime is recovering after a restart and a clearing of the consumer cache, but the underlying issue has not been solved. We're still working on a fix and will update as we make progress. Realtime service degraded https://status.trigger.dev/incident/454137 Thu, 31 Oct 2024 23:52:00 -0000 https://status.trigger.dev/incident/454137#9658f921e8122bab3c8d8b4904566fe5060ad0ce5d26559ff8929c9622dabdba We've just discovered an issue with our realtime service, where our realtime server is crashing and is not able to consume new changes from the database, thus not being able to send new updates out through our realtime system. We're working on a fix but don't have an ETA at this time. Dashboard instability and slower run processing https://status.trigger.dev/incident/449476 Fri, 25 Oct 2024 17:47:00 -0000 https://status.trigger.dev/incident/449476#aa82725fe6aa7d1c6d51c08c3edc2f5924a11c9b6ee7cc0629b420cde204249a The networking issues from our worker cluster cloud provider is no longer happening. Networking has been back to full speed for the past 10 minutes and run are processing fast. Dashboard instability and slower run processing https://status.trigger.dev/incident/449476 Fri, 25 Oct 2024 17:47:00 -0000 https://status.trigger.dev/incident/449476#aa82725fe6aa7d1c6d51c08c3edc2f5924a11c9b6ee7cc0629b420cde204249a The networking issues from our worker cluster cloud provider is no longer happening. Networking has been back to full speed for the past 10 minutes and run are processing fast. Dashboard instability and slower run processing https://status.trigger.dev/incident/449476 Fri, 25 Oct 2024 17:34:00 -0000 https://status.trigger.dev/incident/449476#bee76cc4fff4db3be6086b82e808e3e34a4343037ed46487e5674a38262199b6 v3 runs are processing slowly. We think this is due to an intermittent networking issue with our worker cluster cloud provider. We are investigating and escalating this issue with them. This isn't due to a code change. Dashboard instability and slower run processing https://status.trigger.dev/incident/449476 Fri, 25 Oct 2024 17:34:00 -0000 https://status.trigger.dev/incident/449476#bee76cc4fff4db3be6086b82e808e3e34a4343037ed46487e5674a38262199b6 v3 runs are processing slowly. We think this is due to an intermittent networking issue with our worker cluster cloud provider. We are investigating and escalating this issue with them. This isn't due to a code change. Dashboard instability and slower run processing https://status.trigger.dev/incident/449476 Wed, 23 Oct 2024 17:45:00 -0000 https://status.trigger.dev/incident/449476#57a1c6ef21e2cff8744e041f45ab48966414f5c5e98152b59bcc4a9f1bb7e5b8 The API and dashboard have been back to normal for some time. Runs are processing fast. We are working on stopping this from happening again. We have identified the JSON data that caused this Node.js crash but it's not at all clear why it crashed V8. as it's valid JSON. Dashboard instability and slower run processing https://status.trigger.dev/incident/449476 Wed, 23 Oct 2024 17:45:00 -0000 https://status.trigger.dev/incident/449476#57a1c6ef21e2cff8744e041f45ab48966414f5c5e98152b59bcc4a9f1bb7e5b8 The API and dashboard have been back to normal for some time. Runs are processing fast. We are working on stopping this from happening again. We have identified the JSON data that caused this Node.js crash but it's not at all clear why it crashed V8. as it's valid JSON. Dashboard instability and slower run processing https://status.trigger.dev/incident/449476 Wed, 23 Oct 2024 17:25:00 -0000 https://status.trigger.dev/incident/449476#e2fba50651e2337d3bf2094ac7f90587113a82d03da82cd5d755f3402d9409f3 A crash caused some dashboard instability and has slowed run processing down. We're working to get all run queues back to their normal nominal size. We know what caused the crash (some unexpected user data that somehow crashed Node.js). We have stopped the user from doing this temporarily until we have a permanent fix in place. Dashboard instability and slower run processing https://status.trigger.dev/incident/449476 Wed, 23 Oct 2024 17:25:00 -0000 https://status.trigger.dev/incident/449476#e2fba50651e2337d3bf2094ac7f90587113a82d03da82cd5d755f3402d9409f3 A crash caused some dashboard instability and has slowed run processing down. We're working to get all run queues back to their normal nominal size. We know what caused the crash (some unexpected user data that somehow crashed Node.js). We have stopped the user from doing this temporarily until we have a permanent fix in place. Some processing is slower than normal https://status.trigger.dev/incident/434184 Tue, 24 Sep 2024 19:28:00 -0000 https://status.trigger.dev/incident/434184#c291b2ad5013c365a4b9127478bd3cd8de7f7ebe2a59fe6087c0848b149b5a95 This issue is resolved, everything is back to normal. This issue was caused by an exceptionally large number of v3 run alerts, caused by a run that was failing (from user code, not a Trigger.dev system problem). This caused us to hit Slack rate limits which slowed the processing down more. We have scaled up the system that deals with this now so it should better deal with this. We've also changed the retrying settings for sending Slack alerts so it doesn't so aggressively retry. Some processing is slower than normal https://status.trigger.dev/incident/434184 Tue, 24 Sep 2024 19:00:00 -0000 https://status.trigger.dev/incident/434184#e9b363c66b20ac68733076dfb9535845138c1683d590d1e799349edcd77eb8c8 Impacted - v2 run queues are processing slower than normal - v3 alerts are taking longer than normal to send - v3 scheduled runs may start slightly later than scheduled - v3 triggerAndWait/batchTriggerAndWaits are taking longer than normal to continue their parent runs We are working to get service back to normal. Our emails aren't sending (downstream provider issue) https://status.trigger.dev/incident/418604 Sat, 24 Aug 2024 14:45:00 -0000 https://status.trigger.dev/incident/418604#e4af28f1361b09db5268f552f3483bd2fba6f07c07b61c00ac171ae6e8827109 Resend is back online so magic link and alert emails are working again Our emails aren't sending (downstream provider issue) https://status.trigger.dev/incident/418604 Sat, 24 Aug 2024 12:00:00 -0000 https://status.trigger.dev/incident/418604#ad909c2c21439b75db4c92783bc9902e5cc35700033c4a133252b9c58029345b Resend, our email provider, is currently down in the US. You can follow their status here: https://resend-status.com/incidents We'll update when we know more. This impacts - Magic link login - Email alerts Some v3 runs are crashing with triggerAndWait https://status.trigger.dev/incident/416656 Tue, 20 Aug 2024 18:40:00 -0000 https://status.trigger.dev/incident/416656#d3a8f212b91b5d0dba30342e1674d43223fe7b72ae5e699cfeaad3fb03498706 A fix has been deployed and tested. Just confirming that is has fixed all instances of this issue. Some v3 runs are crashing with triggerAndWait https://status.trigger.dev/incident/416656 Tue, 20 Aug 2024 18:40:00 -0000 https://status.trigger.dev/incident/416656#d3a8f212b91b5d0dba30342e1674d43223fe7b72ae5e699cfeaad3fb03498706 A fix has been deployed and tested. Just confirming that is has fixed all instances of this issue. Some v3 runs are crashing with triggerAndWait https://status.trigger.dev/incident/416656 Tue, 20 Aug 2024 17:00:00 -0000 https://status.trigger.dev/incident/416656#4373e3e298e342d148be0fd2d93655b282cd0e4b1f0b2279ff14e7e22c11e99f You'll see the status as CRASHED and the error will say: "Invalid run status for execution: WAITING_TO_RESUME" We've diagnosed the issue and are looking to ship a fix quickly Some v3 runs are crashing with triggerAndWait https://status.trigger.dev/incident/416656 Tue, 20 Aug 2024 17:00:00 -0000 https://status.trigger.dev/incident/416656#4373e3e298e342d148be0fd2d93655b282cd0e4b1f0b2279ff14e7e22c11e99f You'll see the status as CRASHED and the error will say: "Invalid run status for execution: WAITING_TO_RESUME" We've diagnosed the issue and are looking to ship a fix quickly v3 runs are starting slower than normal https://status.trigger.dev/incident/414402 Thu, 15 Aug 2024 21:45:00 -0000 https://status.trigger.dev/incident/414402#34f97ddc14335244a67960322243b042bf565b8610edde579b7ac3d7b7859f9a Runs are processing very fast now and everything will be fully caught up in 30 mins. The vast majority of organizations caught up an hour ago, If you were executing a lot of runs during this period unfortunately it is quite likely that some of them failed. You can filter by Failed, Crashed and System Failure on the Runs page. Then you can multi-select them and use the bulk actions bar at the bottom of the screen to mass replay them. We're really sorry about this incident and the impact it's had on you all. This wasn't caused by a code change and wasn't a gradual decline of performance so was hard to foresee. Some critical system processes in our primary database started failing causing locking transactions. This wasn't obvious at the time unfortunately. We will be doing a full write up of this incident tomorrow and we have an early plan of some tools we're going to use to ensure this doesn't happen again. v3 runs are starting slower than normal https://status.trigger.dev/incident/414402 Thu, 15 Aug 2024 21:45:00 -0000 https://status.trigger.dev/incident/414402#34f97ddc14335244a67960322243b042bf565b8610edde579b7ac3d7b7859f9a Runs are processing very fast now and everything will be fully caught up in 30 mins. The vast majority of organizations caught up an hour ago, If you were executing a lot of runs during this period unfortunately it is quite likely that some of them failed. You can filter by Failed, Crashed and System Failure on the Runs page. Then you can multi-select them and use the bulk actions bar at the bottom of the screen to mass replay them. We're really sorry about this incident and the impact it's had on you all. This wasn't caused by a code change and wasn't a gradual decline of performance so was hard to foresee. Some critical system processes in our primary database started failing causing locking transactions. This wasn't obvious at the time unfortunately. We will be doing a full write up of this incident tomorrow and we have an early plan of some tools we're going to use to ensure this doesn't happen again. v3 runs are starting slower than normal https://status.trigger.dev/incident/414402 Thu, 15 Aug 2024 19:40:00 -0000 https://status.trigger.dev/incident/414402#c3e970afee5b70da611e0982abd7f8b69e137c9a5ccbe3a80c6337ffa46625c3 The dashboard and API are back to full functionality. We are processing lots of runs again but there is a big backlog because of the slow processing over the past couple of hours. We're working hard to catch the queue up. Our primary database had entered into an unrecoverable state for an unknown reason with permanently locked transactions and many critical underlying processes that weren't functioning properly. Unfortunately this wasn't obvious. We switched our primary database ("failover") to one of our replicas and performance immediately returned to normal. Sorry folks, this wasn't caused by a code change or a gradual decline of performance so was hard to foresee. We have found a specialist database monitoring tool that we are going to use to prevent this from happening again, or hopefully make it obvious if it ever does. I'll update here when queue sizes are completely back to normal. v3 runs are starting slower than normal https://status.trigger.dev/incident/414402 Thu, 15 Aug 2024 19:40:00 -0000 https://status.trigger.dev/incident/414402#c3e970afee5b70da611e0982abd7f8b69e137c9a5ccbe3a80c6337ffa46625c3 The dashboard and API are back to full functionality. We are processing lots of runs again but there is a big backlog because of the slow processing over the past couple of hours. We're working hard to catch the queue up. Our primary database had entered into an unrecoverable state for an unknown reason with permanently locked transactions and many critical underlying processes that weren't functioning properly. Unfortunately this wasn't obvious. We switched our primary database ("failover") to one of our replicas and performance immediately returned to normal. Sorry folks, this wasn't caused by a code change or a gradual decline of performance so was hard to foresee. We have found a specialist database monitoring tool that we are going to use to prevent this from happening again, or hopefully make it obvious if it ever does. I'll update here when queue sizes are completely back to normal. v3 runs are starting slower than normal https://status.trigger.dev/incident/414402 Thu, 15 Aug 2024 17:14:00 -0000 https://status.trigger.dev/incident/414402#b8313a880f632070ef8755337d2106a09657a2b568e037f0bc23a5ebfff16634 V3 runs are starting slower than normal and some runs are failing because of database transaction timeouts. We're deploying changes to try and fix this but have not determined the root cause yet. v3 runs are starting slower than normal https://status.trigger.dev/incident/414402 Thu, 15 Aug 2024 17:14:00 -0000 https://status.trigger.dev/incident/414402#b8313a880f632070ef8755337d2106a09657a2b568e037f0bc23a5ebfff16634 V3 runs are starting slower than normal and some runs are failing because of database transaction timeouts. We're deploying changes to try and fix this but have not determined the root cause yet. v3 runs are starting slower than normal https://status.trigger.dev/incident/414402 Thu, 15 Aug 2024 12:45:00 -0000 https://status.trigger.dev/incident/414402#7bd13f517300eca286d35ad9eb6411b98ffba3c73ce094e9ccaa2b1446f4d199 We're experiencing very high database load which is causing v3 runs to be queued for longer than normal before starting. We're investigating the root cause of this and how to alleviate it. v3 runs are starting slower than normal https://status.trigger.dev/incident/414402 Thu, 15 Aug 2024 12:45:00 -0000 https://status.trigger.dev/incident/414402#7bd13f517300eca286d35ad9eb6411b98ffba3c73ce094e9ccaa2b1446f4d199 We're experiencing very high database load which is causing v3 runs to be queued for longer than normal before starting. We're investigating the root cause of this and how to alleviate it. v3 runs are slower than normal to start https://status.trigger.dev/incident/400241 Thu, 18 Jul 2024 22:14:00 -0000 https://status.trigger.dev/incident/400241#5e6bd0054354fb4e0ce951204125aa7e4295c82659a4ecf3a7d3a3fb10e810c2 v3 runs are now operating at full speed. We were unable to spin up more servers and so the total throughput of v3 runs was limited. This caused a lack of concurrency and so runs started slower than normal although they were being fairly distributed between orgs. We managed to fix the underlying issue that was causing servers not to spin up. v3 runs are slower than normal to start https://status.trigger.dev/incident/400241 Thu, 18 Jul 2024 21:22:00 -0000 https://status.trigger.dev/incident/400241#fc4225a1c42b99305ae80925cd6cc93ea5a6664d62df49dc576abae225798a43 Our v3 cluster isn't able to spin up new servers so we can process more runs at once – this is why our concurrency is lower and start times are slower than normal. This seems to be an underlying issue with our cloud provider. v3 runs are slower than normal to start https://status.trigger.dev/incident/400241 Thu, 18 Jul 2024 20:50:00 -0000 https://status.trigger.dev/incident/400241#f84823ad744b461c913187be6a94a48787bc79e1e6cdb35a82657d88717d4bce v3 runs are processing but they're starting slower than normal. We're working on a fix for this. Dashboard/API is down https://status.trigger.dev/incident/396747 Thu, 11 Jul 2024 10:55:00 -0000 https://status.trigger.dev/incident/396747#0651a085c0f5a3e11883b6d0f438130797945d43509b583bbc66961824f63c50 The platform is working correctly again. Runs will pick back up. Some runs may have failed that were in progress. This was caused by a bad migration that didn't cause the deployment to fail automatically and so it rolled out to the instances. Dashboard/API is down https://status.trigger.dev/incident/396747 Thu, 11 Jul 2024 10:55:00 -0000 https://status.trigger.dev/incident/396747#0651a085c0f5a3e11883b6d0f438130797945d43509b583bbc66961824f63c50 The platform is working correctly again. Runs will pick back up. Some runs may have failed that were in progress. This was caused by a bad migration that didn't cause the deployment to fail automatically and so it rolled out to the instances. Dashboard/API is down https://status.trigger.dev/incident/396747 Thu, 11 Jul 2024 10:47:00 -0000 https://status.trigger.dev/incident/396747#68ea9f859f91fb5265c3fbcf720b1986051de37feaf2cc852cb46fc3d912e155 The dashboard and API is down due to a deployment. We're working to fix this. Dashboard/API is down https://status.trigger.dev/incident/396747 Thu, 11 Jul 2024 10:47:00 -0000 https://status.trigger.dev/incident/396747#68ea9f859f91fb5265c3fbcf720b1986051de37feaf2cc852cb46fc3d912e155 The dashboard and API is down due to a deployment. We're working to fix this. Dashboard and API degraded https://status.trigger.dev/incident/387209 Thu, 20 Jun 2024 23:00:00 -0000 https://status.trigger.dev/incident/387209#084703501c3495720c44eb25af68107261cd748c7cd4dcc8771c6e0a228c72f4 We're continuing to monitor the situation but API and Dashboard services have been fully restored for the last 15+ minutes, and we think we have found the issue. We'll continue to monitor. Dashboard and API degraded https://status.trigger.dev/incident/387209 Thu, 20 Jun 2024 23:00:00 -0000 https://status.trigger.dev/incident/387209#084703501c3495720c44eb25af68107261cd748c7cd4dcc8771c6e0a228c72f4 We're continuing to monitor the situation but API and Dashboard services have been fully restored for the last 15+ minutes, and we think we have found the issue. We'll continue to monitor. Dashboard and API degraded https://status.trigger.dev/incident/387209 Thu, 20 Jun 2024 21:28:00 -0000 https://status.trigger.dev/incident/387209#9afa9fa43c559230a5530e9807262c76d675f02dbec0685ef4ef6f480e856b0f We're once again dealing with an issue that is causing a CPU spike in our production API and Dashboard server instances. We're investigating and will update when we have any more news to share. Dashboard and API degraded https://status.trigger.dev/incident/387209 Thu, 20 Jun 2024 21:28:00 -0000 https://status.trigger.dev/incident/387209#9afa9fa43c559230a5530e9807262c76d675f02dbec0685ef4ef6f480e856b0f We're once again dealing with an issue that is causing a CPU spike in our production API and Dashboard server instances. We're investigating and will update when we have any more news to share. Our API and dashboard response times are elevated https://status.trigger.dev/incident/387013 Thu, 20 Jun 2024 14:35:00 -0000 https://status.trigger.dev/incident/387013#5a90d63723377a9539abd559eafe7c36426a2b9c21948cbd454c630dcdbc93e3 API and cloud response times are back to normal. Our API and dashboard response times are elevated https://status.trigger.dev/incident/387013 Thu, 20 Jun 2024 14:35:00 -0000 https://status.trigger.dev/incident/387013#5a90d63723377a9539abd559eafe7c36426a2b9c21948cbd454c630dcdbc93e3 API and cloud response times are back to normal. Our API and dashboard response times are elevated https://status.trigger.dev/incident/387013 Thu, 20 Jun 2024 14:05:00 -0000 https://status.trigger.dev/incident/387013#845cef828c2070857f8c4febfd05a2a11d9502360bbdb4a580373085e6fc47bf We're experiencing higher than normal response times in the API and the dashboard. v2 and v3 run queue times are normal and runs are executing still. General system load is normal. We are investigating what's causing this and will update, it looks like it's one of our providers. Our API and dashboard response times are elevated https://status.trigger.dev/incident/387013 Thu, 20 Jun 2024 14:05:00 -0000 https://status.trigger.dev/incident/387013#845cef828c2070857f8c4febfd05a2a11d9502360bbdb4a580373085e6fc47bf We're experiencing higher than normal response times in the API and the dashboard. v2 and v3 run queue times are normal and runs are executing still. General system load is normal. We are investigating what's causing this and will update, it looks like it's one of our providers. Some v3 runs are failing https://status.trigger.dev/incident/386440 Wed, 19 Jun 2024 10:56:00 -0000 https://status.trigger.dev/incident/386440#649950ac4bdd9b171d151612b2702b07895c6a11ca4a337c730cf6b7bdd63795 v3 runs are back to normal. There was an abnormal number of runs in the System Failure state because some of the data being passed from the workers back to the platform were in an unexpected format. We are bulk replaying runs that were impacted. Some v3 runs are failing https://status.trigger.dev/incident/386440 Wed, 19 Jun 2024 10:50:00 -0000 https://status.trigger.dev/incident/386440#a26d263bf20006e6650931d20d74bcc3217e5db69ff203e7c7decc08fae13085 v3 package version `3.0.0-beta.38` (that was released an hour ago) is throwing a `sendWithAck() timeout` error at the end of an attempt when trying to send the logs. We've released a new package `3.0.0-beta.39` that should fix this. Some v3 runs are failing https://status.trigger.dev/incident/386440 Wed, 19 Jun 2024 10:35:00 -0000 https://status.trigger.dev/incident/386440#15c5b6960309c3af74affcf34331187127fc98de556740b6c65fd4aca28bce52 We're investigating why more v3 runs are failing than normal. v3 runs are paused due to network issues https://status.trigger.dev/incident/383771 Thu, 13 Jun 2024 12:20:00 -0000 https://status.trigger.dev/incident/383771#a581fe9e8d84197481ff9939fb61b6f468c2dec4058599f3eac745a0a85a0b2a Runs are operating at full speed. We think this issue was caused by the clean-up operations that clear completed pods. There are far more runs than a week ago, so that list can get very large causing a strain on the system including internal networking. We've increased the frequency and are monitoring the load including networking. After 15 mins everything seems normal. v3 runs are paused due to network issues https://status.trigger.dev/incident/383771 Thu, 13 Jun 2024 12:15:00 -0000 https://status.trigger.dev/incident/383771#a973384888024747be31822f6c40df51fae9723c1b3903f85470cbdc708037f6 v3 runs are processing with slightly reduced capacity in our cluster. Some nodes that we've isolated have network issues. We're still trying to diagnose the root cause to prevent this from happening again. v3 runs are paused due to network issues https://status.trigger.dev/incident/383771 Thu, 13 Jun 2024 11:19:00 -0000 https://status.trigger.dev/incident/383771#aaf42730491319254d2ac0c7de30fb74ec3a25ddd484cd935521a43c090385b5 There's a networking issue in our cluster. The BPF networking change we made yesterday hasn't fully fixed the problems. We're working to get runs executing as quickly as possible and then figure out the root cause of this issue so it doesn't happen again. v3 runs have stopped https://status.trigger.dev/incident/383449 Wed, 12 Jun 2024 22:10:00 -0000 https://status.trigger.dev/incident/383449#f9cd9a052d0d268d28b922058aad644efd101d1d9dba87f3ff2e7ccc52f98fff v3 runs are now executing again. Networking was down because of an issue with BPF. While networking was down tasks couldn't heartbeat back to the platform. If the platform doesn't receive a heartbeat every 2 mins then a run will fail. Less than 500 total runs were failed because of this. You can filter by status "System Failure" in the runs list to find these and then bulk replay them by selecting all, move to the next page and select all again. You can replay them using the bottom bar. v3 runs have stopped https://status.trigger.dev/incident/383449 Wed, 12 Jun 2024 21:36:00 -0000 https://status.trigger.dev/incident/383449#9a6bb7463cbee3ea4c840bb355947338fe90f322e5c2a55ceb7c462480fe85d7 v3 runs have stopped because of a networking issue in our cluster. We're working to diagnose if this is an issue with our cloud provider and are trying to reset things. v2 runs are slower than normal to start https://status.trigger.dev/incident/382219 Mon, 10 Jun 2024 12:30:00 -0000 https://status.trigger.dev/incident/382219#b79ef4d852780e17ff4cb3c20e7aa779e19ba7fa7068468cd56e09237644886a v2 p95 start times have been under 2s for 10 mins, so resolving this issue. We think this is because there are a lot of schedules that send an event at midday UTC on a Monday. We're looking into what we can do about that. v2 runs are slower than normal to start https://status.trigger.dev/incident/382219 Mon, 10 Jun 2024 12:24:00 -0000 https://status.trigger.dev/incident/382219#d2ebe1f2ef82e2f55f3129aa39c7514caf1e90f63087c4286f0ff6dcb517573d v2 p95 start times are now under 1 second v2 runs are slower than normal to start https://status.trigger.dev/incident/382219 Mon, 10 Jun 2024 12:16:00 -0000 https://status.trigger.dev/incident/382219#010be3326161d0a53fa97bf7dd8d4637657bb2ea8f86574c362e480c346f44e2 Between 12:15–12:18 UTC P95 queue times were up to 1 min. They've come down to 1.9s p95 now. Monitoring and will update. v2 jobs are starting slowly https://status.trigger.dev/incident/379960 Wed, 05 Jun 2024 18:35:00 -0000 https://status.trigger.dev/incident/379960#d5294d4dee8985fe6d794319f7c07e258cf35812ca5c2cf004bdc3e7fb48825c Performance metrics for v2 are back to normal. v3 was unimpacted by this issue. We have identified the underlying performance bottleneck and will publish a full retrospective on this. We can now handle far more v2 load than we could before. v2 jobs are starting slowly https://status.trigger.dev/incident/379960 Wed, 05 Jun 2024 15:00:00 -0000 https://status.trigger.dev/incident/379960#83e0133a2d11e2a916cfc9f2a5dee1bb16e55c1b9d15f649e567e6d91eb953a3 v2 runs are starting with delays of a couple of minutes. We're working on a fix for this. v2 jobs are queued https://status.trigger.dev/incident/378831 Mon, 03 Jun 2024 20:00:00 -0000 https://status.trigger.dev/incident/378831#92d8496175b68cc9097d1d08fc08d0ae4e0cc5117dafa3a5f9ece694d8c39909 Runs have been executing at normal speeds and queues down to normal size for a couple of hours so this is marked as resolved. During this incident queue times were longer than normal for v2 runs. We've made some minor changes as well as increasing capacity. We are working on a larger change that we think should mean very large v2 run spikes don't cause these problems that should ship in the next few days. v2 jobs are queued https://status.trigger.dev/incident/378831 Mon, 03 Jun 2024 17:18:00 -0000 https://status.trigger.dev/incident/378831#daca17f391480da3dc64dfec85df784db0d5a58db65dca53668e613312f0a5cf v2 queues are getting smaller quickly now. v3 is still operating normally. v2 jobs are queued https://status.trigger.dev/incident/378831 Mon, 03 Jun 2024 16:42:00 -0000 https://status.trigger.dev/incident/378831#b353011fd980544118576e2f3f7535c7a171c9ef2f5179d065e4d11512b2a541 More capacity is now online and job queues are catching up, but not all the way. We're currently assessing whether it's possible to increase capacity even more. v2 jobs are queued https://status.trigger.dev/incident/378831 Mon, 03 Jun 2024 16:29:00 -0000 https://status.trigger.dev/incident/378831#400beb1080356463147317e1a8994ed44dbe3eccbea9521c08de8143951f0f1d We are currently struggling to keep up with v2 job capacity and there are some jobs are have been backed up. We're working on increasing the capacity now. v2 job backlog https://status.trigger.dev/incident/377985 Sat, 01 Jun 2024 20:45:00 -0000 https://status.trigger.dev/incident/377985#eb22eea3e59c463625e8047e1c255187026f9cfdda48fa773173d784b7626e99 v2 jobs are now all caught up and processing normally. v2 job backlog https://status.trigger.dev/incident/377985 Sat, 01 Jun 2024 17:54:00 -0000 https://status.trigger.dev/incident/377985#860ab9d18a047ba51590110c34e28d52f3ff6778223b5494990deb56653abe8e v2 jobs have started to catch up and should be back to normal operation soon. We'll keep an eye on it. v2 job backlog https://status.trigger.dev/incident/377985 Sat, 01 Jun 2024 17:45:00 -0000 https://status.trigger.dev/incident/377985#d70515877427e10423cf99fb17aa7b5992917af7c112d24a2f9885bcc95fc038 v2 jobs are still processing, but slowly. We're working on rolling out additional capacity to handle the backlog. v3 tasks aren't effected. v2 job backlog https://status.trigger.dev/incident/377985 Sat, 01 Jun 2024 17:36:00 -0000 https://status.trigger.dev/incident/377985#af60bc1368293c48e37c0970bdf342863dfb99af8694951b6da8a0bef509b2b9 We're currently experiencing an issue that is causing v2 job runs to be excessively delayed. We're currently working on fixing the issue and will update again shortly as we make progress The v3 cluster is slow to accept new v3 tasks https://status.trigger.dev/incident/369236 Tue, 14 May 2024 07:00:00 -0000 https://status.trigger.dev/incident/369236#3749f587a8c00096c336d3c0c8be0b59acf881c426a3972a16a7a76f65ec5e13 Runs are operating at normal speed again There were pods in our cluster in the RunContainerError state, this happens when a run isn’t heartbeating back to the platform. We’re closely monitoring and have cleaned these. We’re determining which tasks caused this and what we can do to prevent this from happening in the future. The v3 cluster is slow to accept new v3 tasks https://status.trigger.dev/incident/369236 Tue, 14 May 2024 07:00:00 -0000 https://status.trigger.dev/incident/369236#3749f587a8c00096c336d3c0c8be0b59acf881c426a3972a16a7a76f65ec5e13 Runs are operating at normal speed again There were pods in our cluster in the RunContainerError state, this happens when a run isn’t heartbeating back to the platform. We’re closely monitoring and have cleaned these. We’re determining which tasks caused this and what we can do to prevent this from happening in the future. The v3 cluster is slow to accept new v3 tasks https://status.trigger.dev/incident/369236 Tue, 14 May 2024 05:00:00 -0000 https://status.trigger.dev/incident/369236#e482ab5670e533a0989716a06265a5c2cd2688092f0d5d7cecd9f421c6319b6f The v3 cluster is slow to accept new v3 tasks The v3 cluster is slow to accept new v3 tasks https://status.trigger.dev/incident/369236 Tue, 14 May 2024 05:00:00 -0000 https://status.trigger.dev/incident/369236#e482ab5670e533a0989716a06265a5c2cd2688092f0d5d7cecd9f421c6319b6f The v3 cluster is slow to accept new v3 tasks Queues and runs have been processing at good speeds now for several hours on v2 and v3 https://status.trigger.dev/incident/369241 Thu, 09 May 2024 21:00:00 -0000 https://status.trigger.dev/incident/369241#322cccdadc68287cdc1eaa108173ae0ba3df5e4a757e12dd84f8e35aa8da71b1 Before we get into what happened I want to emphasise how important reliability is to us. This has fallen very short of providing you all with a great experience and a reliable service. We're really sorry for the problems this has caused for you all. **All paying customers will get a full refund for the entirety of May.** ## What caused this? This issue started with a huge number of queued events in v2. Our internal system that handles concurrency limits on v2 was slowing down the processing of them but was also causing more database load on the underlying v2 queuing engine: Graphile worker. Graphile is powered by Postgres so it caused a vicious cycle. We've scaled many orders of magnitude in the past year and all our normal upgrades and tuning didn't work here. We upgraded servers and most importantly the database (twice). We tuned some parameters. But as the backlog was growing it became harder to recover from because of the concurrency limits built into the v2 system. Ordinarily this limiter distributes v2 runs fairly and prevents very high load by smoothing out very spiky demand. ## What was impacted? - • Queues got very long for v2 and processed slowly. - • Queues got long for v3. The queuing system for v3 is built on Redis so that was fine but the actual run data lives in Postgres which couldn't be read because of the v2 issues. Also, we use Graphile to trigger v3 scheduled tasks. - • The dashboard was very slow to load or was showing timeout errors (ironic I know). - • When we took the brakes off the v2 concurrency filter it caused a massive number of runs to happen very quickly. Mostly this was fine but in some cases this caused downstream issues in runs. - • When we took the brakes off the v2 concurrency filter it also meant in some cases v2 concurrency limits weren't respected. ## What we've done (so far) - • We upgraded some hardware. This means we can process more runs but it didn't help us escape the spiral. - • We modified the v2 concurrency filter so it reschedules runs that are over the limit with a slight delay. Before it was thrashing the database with the same runs and could cause a huge load in edge cases like this. - • We've upgraded Graphile Worker, the core of the v2 queuing system, to v0.16.6. This has a lot of performance improvements so we can cope with more load than before. - • We have far better diagnostic tools than before. Today we reached a new level of scale that highlighted some things that up until now had been working well. Reliability is never finished, so work continues tomorrow. Queues and runs have been processing at good speeds now for several hours on v2 and v3 https://status.trigger.dev/incident/369241 Thu, 09 May 2024 21:00:00 -0000 https://status.trigger.dev/incident/369241#322cccdadc68287cdc1eaa108173ae0ba3df5e4a757e12dd84f8e35aa8da71b1 Before we get into what happened I want to emphasise how important reliability is to us. This has fallen very short of providing you all with a great experience and a reliable service. We're really sorry for the problems this has caused for you all. **All paying customers will get a full refund for the entirety of May.** ## What caused this? This issue started with a huge number of queued events in v2. Our internal system that handles concurrency limits on v2 was slowing down the processing of them but was also causing more database load on the underlying v2 queuing engine: Graphile worker. Graphile is powered by Postgres so it caused a vicious cycle. We've scaled many orders of magnitude in the past year and all our normal upgrades and tuning didn't work here. We upgraded servers and most importantly the database (twice). We tuned some parameters. But as the backlog was growing it became harder to recover from because of the concurrency limits built into the v2 system. Ordinarily this limiter distributes v2 runs fairly and prevents very high load by smoothing out very spiky demand. ## What was impacted? - • Queues got very long for v2 and processed slowly. - • Queues got long for v3. The queuing system for v3 is built on Redis so that was fine but the actual run data lives in Postgres which couldn't be read because of the v2 issues. Also, we use Graphile to trigger v3 scheduled tasks. - • The dashboard was very slow to load or was showing timeout errors (ironic I know). - • When we took the brakes off the v2 concurrency filter it caused a massive number of runs to happen very quickly. Mostly this was fine but in some cases this caused downstream issues in runs. - • When we took the brakes off the v2 concurrency filter it also meant in some cases v2 concurrency limits weren't respected. ## What we've done (so far) - • We upgraded some hardware. This means we can process more runs but it didn't help us escape the spiral. - • We modified the v2 concurrency filter so it reschedules runs that are over the limit with a slight delay. Before it was thrashing the database with the same runs and could cause a huge load in edge cases like this. - • We've upgraded Graphile Worker, the core of the v2 queuing system, to v0.16.6. This has a lot of performance improvements so we can cope with more load than before. - • We have far better diagnostic tools than before. Today we reached a new level of scale that highlighted some things that up until now had been working well. Reliability is never finished, so work continues tomorrow.