Incidents | Trigger.dev Incidents reported on status page for Trigger.dev https://status.trigger.dev/ https://d1lppblt9t2x15.cloudfront.net/logos/bb72686f10eddb819a2958a98b50538b.png Incidents | Trigger.dev https://status.trigger.dev/ en Deployments recovered https://status.trigger.dev/ Wed, 28 May 2025 16:56:53 +0000 https://status.trigger.dev/#8a1c2385e59f625e534c6dbdaa58bf6e50c3cf728d7b130b905e803b1fc01b64 Deployments recovered Deployments went down https://status.trigger.dev/ Wed, 28 May 2025 14:57:07 +0000 https://status.trigger.dev/#8a1c2385e59f625e534c6dbdaa58bf6e50c3cf728d7b130b905e803b1fc01b64 Deployments went down v4 dequeue performance degradation https://status.trigger.dev/incident/580382 Mon, 26 May 2025 13:44:00 -0000 https://status.trigger.dev/incident/580382#576250e9ed29474c014322a87a0922c0a1a6514aa371dca4f2d159d15632302c v4 dequeue performance has now improved again, and we're working on two things: A short term fix to prevent this from happening again in the short term, to be deployed today. A long term fix for dequeue performance that will be worked on and hopefully shipped this week, which will vastly improve dequeue performance and scaling. v4 dequeue performance degradation https://status.trigger.dev/incident/580382 Mon, 26 May 2025 13:44:00 -0000 https://status.trigger.dev/incident/580382#576250e9ed29474c014322a87a0922c0a1a6514aa371dca4f2d159d15632302c v4 dequeue performance has now improved again, and we're working on two things: A short term fix to prevent this from happening again in the short term, to be deployed today. A long term fix for dequeue performance that will be worked on and hopefully shipped this week, which will vastly improve dequeue performance and scaling. v4 dequeue performance degradation https://status.trigger.dev/incident/580382 Mon, 26 May 2025 12:50:00 -0000 https://status.trigger.dev/incident/580382#0fed09d97fa8743f4dc3c189deaaf2e8e338c04341772401ece26c57b00aaa1a Dequeues in v4 are slow and we're investigating the cause and trying to deploy a mitigation v4 dequeue performance degradation https://status.trigger.dev/incident/580382 Mon, 26 May 2025 12:50:00 -0000 https://status.trigger.dev/incident/580382#0fed09d97fa8743f4dc3c189deaaf2e8e338c04341772401ece26c57b00aaa1a Dequeues in v4 are slow and we're investigating the cause and trying to deploy a mitigation Deployments recovered https://status.trigger.dev/ Thu, 22 May 2025 09:20:50 +0000 https://status.trigger.dev/#094161573745754835ad8c20533b9e6338a0e62ea92dcc643f85c87dc3f3651f Deployments recovered Deployments went down https://status.trigger.dev/ Thu, 22 May 2025 08:19:21 +0000 https://status.trigger.dev/#094161573745754835ad8c20533b9e6338a0e62ea92dcc643f85c87dc3f3651f Deployments went down Deployments recovered https://status.trigger.dev/ Wed, 21 May 2025 16:03:37 +0000 https://status.trigger.dev/#ce6d0fd019512610a8248c928dde345e751bbd4819c2c38e16483c42b5f01ab3 Deployments recovered Deployments went down https://status.trigger.dev/ Wed, 21 May 2025 15:55:47 +0000 https://status.trigger.dev/#ce6d0fd019512610a8248c928dde345e751bbd4819c2c38e16483c42b5f01ab3 Deployments went down Trigger.dev cloud recovered https://status.trigger.dev/ Sat, 17 May 2025 08:37:55 +0000 https://status.trigger.dev/#fff5fcef25516337d664832e6dab9c9728482a5e87c964e8a156959f9523c649 Trigger.dev cloud recovered Realtime recovered https://status.trigger.dev/ Sat, 17 May 2025 08:37:34 +0000 https://status.trigger.dev/#3a2b8c5d631a57e52257491a0b86db1cf4862620649dd29cacf1711a1bbaf53d Realtime recovered Trigger.dev API recovered https://status.trigger.dev/ Sat, 17 May 2025 08:37:23 +0000 https://status.trigger.dev/#0ba93068f87c34bef2f483586054c9c427597332e70661a2bc3c809aae504f16 Trigger.dev API recovered Trigger.dev OpenTelemetry recovered https://status.trigger.dev/ Sat, 17 May 2025 08:37:13 +0000 https://status.trigger.dev/#d97dcf9a73389365c258fa4eaf87bf105d60dc016dbd5ac7dc1492304e810da1 Trigger.dev OpenTelemetry recovered Trigger.dev cloud went down https://status.trigger.dev/ Sat, 17 May 2025 08:35:13 +0000 https://status.trigger.dev/#fff5fcef25516337d664832e6dab9c9728482a5e87c964e8a156959f9523c649 Trigger.dev cloud went down Trigger.dev OpenTelemetry went down https://status.trigger.dev/ Sat, 17 May 2025 08:34:14 +0000 https://status.trigger.dev/#d97dcf9a73389365c258fa4eaf87bf105d60dc016dbd5ac7dc1492304e810da1 Trigger.dev OpenTelemetry went down Realtime went down https://status.trigger.dev/ Sat, 17 May 2025 08:34:04 +0000 https://status.trigger.dev/#3a2b8c5d631a57e52257491a0b86db1cf4862620649dd29cacf1711a1bbaf53d Realtime went down Trigger.dev API went down https://status.trigger.dev/ Sat, 17 May 2025 08:33:54 +0000 https://status.trigger.dev/#0ba93068f87c34bef2f483586054c9c427597332e70661a2bc3c809aae504f16 Trigger.dev API went down Deployments recovered https://status.trigger.dev/ Wed, 14 May 2025 19:53:28 +0000 https://status.trigger.dev/#d31827403b8b27c64f6c8bf1ebfa82c53df36549f4ea918b4d7049b7a22d4d75 Deployments recovered Deployments went down https://status.trigger.dev/ Wed, 14 May 2025 19:32:00 +0000 https://status.trigger.dev/#d31827403b8b27c64f6c8bf1ebfa82c53df36549f4ea918b4d7049b7a22d4d75 Deployments went down Deployments recovered https://status.trigger.dev/ Tue, 13 May 2025 09:24:09 +0000 https://status.trigger.dev/#1910f2d6c52dd41f785b262fbeb0c7bf355a84ded74718327bc627ef90c949b6 Deployments recovered Deployments went down https://status.trigger.dev/ Tue, 13 May 2025 09:19:37 +0000 https://status.trigger.dev/#1910f2d6c52dd41f785b262fbeb0c7bf355a84ded74718327bc627ef90c949b6 Deployments went down Deployments recovered https://status.trigger.dev/ Thu, 08 May 2025 21:29:08 +0000 https://status.trigger.dev/#f767027231ec485e1fa6c8cbd277c216234aefdf56be0f07b4dea4e4ec9a45a2 Deployments recovered Deployments went down https://status.trigger.dev/ Thu, 08 May 2025 17:40:07 +0000 https://status.trigger.dev/#f767027231ec485e1fa6c8cbd277c216234aefdf56be0f07b4dea4e4ec9a45a2 Deployments went down Deployments recovered https://status.trigger.dev/ Tue, 06 May 2025 19:02:08 +0000 https://status.trigger.dev/#1c8384f8ce026d5b86e6d55bd66e5a8d12db52a42170b0d011328220efe82545 Deployments recovered Deployments went down https://status.trigger.dev/ Tue, 06 May 2025 16:14:33 +0000 https://status.trigger.dev/#1c8384f8ce026d5b86e6d55bd66e5a8d12db52a42170b0d011328220efe82545 Deployments went down v3 runs dequeuing slower than normal https://status.trigger.dev/incident/557448 Tue, 06 May 2025 14:28:00 -0000 https://status.trigger.dev/incident/557448#90cd0508deff5411ef332711b33ada54551777ce35254e15278336267eed5202 Queues are back to nominal length and have been for some time. This issue was caused by a huge influx of queues, which meant we weren't considering them all when selecting queues for dequeuing. We have increased some settings to make this better and we're looking at what we can do in the future to make this scale better for the next 10–100x multiple of queues. v3 runs dequeuing slower than normal https://status.trigger.dev/incident/557448 Tue, 06 May 2025 14:28:00 -0000 https://status.trigger.dev/incident/557448#90cd0508deff5411ef332711b33ada54551777ce35254e15278336267eed5202 Queues are back to nominal length and have been for some time. This issue was caused by a huge influx of queues, which meant we weren't considering them all when selecting queues for dequeuing. We have increased some settings to make this better and we're looking at what we can do in the future to make this scale better for the next 10–100x multiple of queues. v3 runs dequeuing slower than normal https://status.trigger.dev/incident/557448 Tue, 06 May 2025 13:50:00 -0000 https://status.trigger.dev/incident/557448#aa082ebaae28f8c074ca23078c6977bf46f18dd219a60153a6b8d68e8b421e22 We're dequeuing runs very fast again since the config update. Queues are coming down, mostly capped by the concurrency limits on queues. v3 runs dequeuing slower than normal https://status.trigger.dev/incident/557448 Tue, 06 May 2025 13:50:00 -0000 https://status.trigger.dev/incident/557448#aa082ebaae28f8c074ca23078c6977bf46f18dd219a60153a6b8d68e8b421e22 We're dequeuing runs very fast again since the config update. Queues are coming down, mostly capped by the concurrency limits on queues. v3 runs dequeuing slower than normal https://status.trigger.dev/incident/557448 Tue, 06 May 2025 13:29:00 -0000 https://status.trigger.dev/incident/557448#522fe92084e4c932b9355cfeec3df266b77d1066817e83faaf294a3a2030e6ef We have identified the issue. A huge flood of new queues has caused an imbalance in the fair queue algorithm. We're deploying a config env var change to the dequeuer now that should fix this. We're also figuring out how we can prevent this issue happening in the future if we hit another order of magnitude in the total queue count. v3 runs dequeuing slower than normal https://status.trigger.dev/incident/557448 Tue, 06 May 2025 13:29:00 -0000 https://status.trigger.dev/incident/557448#522fe92084e4c932b9355cfeec3df266b77d1066817e83faaf294a3a2030e6ef We have identified the issue. A huge flood of new queues has caused an imbalance in the fair queue algorithm. We're deploying a config env var change to the dequeuer now that should fix this. We're also figuring out how we can prevent this issue happening in the future if we hit another order of magnitude in the total queue count. v3 runs dequeuing slower than normal https://status.trigger.dev/incident/557448 Tue, 06 May 2025 13:16:00 -0000 https://status.trigger.dev/incident/557448#537c5f3081accc93acd19ad042a135b46709781b200574a5f22e02377c4ab6fd Version 3 queues aren't processing as fast as normal. Runs are still processing but not at the normal speed so some queues are getting longer. We're investigating why this is happening and will provide updates. v3 runs dequeuing slower than normal https://status.trigger.dev/incident/557448 Tue, 06 May 2025 13:16:00 -0000 https://status.trigger.dev/incident/557448#537c5f3081accc93acd19ad042a135b46709781b200574a5f22e02377c4ab6fd Version 3 queues aren't processing as fast as normal. Runs are still processing but not at the normal speed so some queues are getting longer. We're investigating why this is happening and will provide updates. v3 runs dequeuing slower than normal https://status.trigger.dev/incident/550447 Thu, 24 Apr 2025 16:05:00 -0000 https://status.trigger.dev/incident/550447#eddcb8a9db378d4925503dc10b4b0090425596b06ad640f34e1edbe57c0a8120 Queues have been operating at full speed since 16:05 UTC. We have found an edge case in the dequeue algorithm that can cause slower dequeue times. We're looking into a fix. v3 runs dequeuing slower than normal https://status.trigger.dev/incident/550447 Thu, 24 Apr 2025 15:50:00 -0000 https://status.trigger.dev/incident/550447#2c8594844ded712c8333733594a45141077c918f70cd0bebd75c33e19212e0b6 Runs are processing fast again now, we are quickly catching up. We are looking into why this happened, it's something to do with dequeuing v3 runs. v3 runs dequeuing slower than normal https://status.trigger.dev/incident/550447 Thu, 24 Apr 2025 15:30:00 -0000 https://status.trigger.dev/incident/550447#825ab97595809431541ea26bff1a49ae937c616202e7adc6bc9e243b0c350374 Queues are processing but slower than normal, we're investigating and will update when we know more. Deployments recovered https://status.trigger.dev/ Fri, 11 Apr 2025 16:25:57 +0000 https://status.trigger.dev/#6cae707326a29dfdca3b37570d1fec3a33ef2a7979ffdbc1efe6d03ab26c06db Deployments recovered Deployments went down https://status.trigger.dev/ Fri, 11 Apr 2025 16:00:56 +0000 https://status.trigger.dev/#6cae707326a29dfdca3b37570d1fec3a33ef2a7979ffdbc1efe6d03ab26c06db Deployments went down Deployments recovered https://status.trigger.dev/ Mon, 07 Apr 2025 19:33:16 +0000 https://status.trigger.dev/#21674a88f5d2d1144c34745f252f2ca345bb6c015a4945da692f85ccae6767c6 Deployments recovered Run are dequeuing slower than normal https://status.trigger.dev/incident/541357 Mon, 07 Apr 2025 19:09:00 -0000 https://status.trigger.dev/incident/541357#14c5ce1c71beeae8d4b60aa2d87e88edd39fbc55fcc6a89d0c2eb525aab7d623 Runs have been dequeuing quickly for some time now, so marking this as resolved. We're continuing to monitor it closely. Runs dequeued for the entire period but queue times were longer than normal, across all customers. The vast majority of queues have reduced back to normal length already or will soon. We suspect this was caused by an underlying Digital Ocean networking issue, that meant our Kubernetes control plane nodes were slow to create and delete pods. We are trying to figure out if there's anything in the short term we can do to reduce the likelihood of this happening again. We are planning to move our primary US worker cluster to AWS, which is where we already host our database, dashboard, API, and all other services apart from the Worker cluster. This decision was made to increase reliability and improve response times. Deployments went down https://status.trigger.dev/ Mon, 07 Apr 2025 18:20:47 +0000 https://status.trigger.dev/#21674a88f5d2d1144c34745f252f2ca345bb6c015a4945da692f85ccae6767c6 Deployments went down Run are dequeuing slower than normal https://status.trigger.dev/incident/541357 Mon, 07 Apr 2025 16:56:00 -0000 https://status.trigger.dev/incident/541357#0202d2876befab333063fbea101f0b7a0694b784209730b9cfd9e610fd511b8e Run queues are processing slower than normal due to an issue with our Kubernetes control planes. We are investigating and will update this status as soon as we know more. Deployments recovered https://status.trigger.dev/ Fri, 04 Apr 2025 17:43:19 +0000 https://status.trigger.dev/#9ee3b8880891e850d48d262238dc53323fe164899c4088217c70dff0f457a899 Deployments recovered Deployments went down https://status.trigger.dev/ Fri, 04 Apr 2025 16:29:18 +0000 https://status.trigger.dev/#9ee3b8880891e850d48d262238dc53323fe164899c4088217c70dff0f457a899 Deployments went down Deployments recovered https://status.trigger.dev/ Fri, 28 Mar 2025 15:19:57 +0000 https://status.trigger.dev/#3d71098344fe8abb0730f65820edc64555b2d9609452a007ff0287dccd1ef938 Deployments recovered Deployments went down https://status.trigger.dev/ Fri, 28 Mar 2025 15:05:27 +0000 https://status.trigger.dev/#3d71098344fe8abb0730f65820edc64555b2d9609452a007ff0287dccd1ef938 Deployments went down Tasks with large payloads or outputs are sometimes failing https://status.trigger.dev/incident/532422 Fri, 21 Mar 2025 22:48:00 -0000 https://status.trigger.dev/incident/532422#b5e3541e6835939fb1378c51eaef78154fa07baab15abd02d22614e82999f4c5 Cloudflare R2 is back online and uploads of large payloads and outputs have resumed. We'll continue to monitor the situation Tasks with large payloads or outputs are sometimes failing https://status.trigger.dev/incident/532422 Fri, 21 Mar 2025 22:43:00 -0000 https://status.trigger.dev/incident/532422#744a88a12c86ed784717193521a9a6656ac6542d4cf0104de1827171318381b5 We've investigated trying to temporarily switch from R2 to AWS S3 but unfortunately we cannot easily do that without breaking backwards compatibility. We're continuing to keep an eye on the Cloudflare Discord for any changes or expected timelines, of which there are none yet. Tasks with large payloads or outputs are sometimes failing https://status.trigger.dev/incident/532422 Fri, 21 Mar 2025 22:11:00 -0000 https://status.trigger.dev/incident/532422#6aff514423860e46764ccdc7de284dc44763a4fb96c1a3a2cd70dfac817db4dc There is now an incident report on Cloudflare Status: https://www.cloudflarestatus.com/incidents/v6c7l22vglw5 Tasks with large payloads or outputs are sometimes failing https://status.trigger.dev/incident/532422 Fri, 21 Mar 2025 22:02:00 -0000 https://status.trigger.dev/incident/532422#81978f799314c98474b5515884a1c5c0d6dfa5315ae5690392e4e9cea1568adf We've confirmed an ongoing issue with Cloudflare R2 that started approx 20 minutes ago, with this message from their support: > There is an ongoing R2 outage at the moment, expect GET/PUT requests and uploads to intermittently fail. Tasks with large payloads or outputs are sometimes failing https://status.trigger.dev/incident/532422 Fri, 21 Mar 2025 21:55:00 -0000 https://status.trigger.dev/incident/532422#9ae796254688cf699542c895cf412e37f0d1d08c2ae024e7716061565caece95 We're currently experiencing an issue with a downstream provider (Cloudflare R2) which we use to store large task payloads and outputs. We're investigating this at the moment and hope to provide more information shortly. Deployments recovered https://status.trigger.dev/ Wed, 12 Mar 2025 14:32:49 +0000 https://status.trigger.dev/#3b95d5cea1db82676370690791dd77e5ba51061069a6ff0bd0f19aac6385cb67 Deployments recovered Deployments went down https://status.trigger.dev/ Wed, 12 Mar 2025 11:59:48 +0000 https://status.trigger.dev/#3b95d5cea1db82676370690791dd77e5ba51061069a6ff0bd0f19aac6385cb67 Deployments went down Significant disruption to run starts (runs stuck in queueing) https://status.trigger.dev/incident/524477 Fri, 07 Mar 2025 21:15:00 -0000 https://status.trigger.dev/incident/524477#b1d5c4643ae92b31afab7994c7b1a0f4d0b2c9e910531c6523dfba5ced23c85d We are confident that most queues have caught up again but are still monitoring the situation. If you are experiencing unexpected queue times this is most likely due to plan or custom queue limits. Should this persist, please get in touch. Significant disruption to run starts (runs stuck in queueing) https://status.trigger.dev/incident/524477 Fri, 07 Mar 2025 20:56:00 -0000 https://status.trigger.dev/incident/524477#dfdac39024dad2d3d701af8acf516dfb395ab50e09dfd2036488bff4a3a69192 The service is stable again and metrics are looking good. We're still catching up with a backlog of runs. You may see increased queue times until this is fully resolved. We'll keep you updated. Significant disruption to run starts (runs stuck in queueing) https://status.trigger.dev/incident/524477 Fri, 07 Mar 2025 20:16:00 -0000 https://status.trigger.dev/incident/524477#43ffc749ad14e37bf75ebb023fb2f31bc8822f96b72591b965de3eafe1400f78 We managed to clear the huge backlog of VMs that were completed but hadn't been cleaned up like they normally are. This was causing a lot of issues, including the initial drop in runs starting and another 8–8.13pm UTC. There's a backlog of runs and it will unfortunately take a bit of time for everything to catch up. We're actively monitoring the situation and doing what we can to improve throughput. Significant disruption to run starts (runs stuck in queueing) https://status.trigger.dev/incident/524477 Fri, 07 Mar 2025 19:30:00 -0000 https://status.trigger.dev/incident/524477#356485aecb07798b8938dc7a6fab2fbf24e739c70ac33947ff7a75aacc660ad2 Runs are executing at the normal speed (from 7.30pm UTC). There is a backlog of runs to work through. There's still a problem we're looking at where the completed run VMs aren't being cleared properly. We think that's the cause of this issue in the first place. Hasn't happened before and there have been no code changes. Significant disruption to run starts (runs stuck in queueing) https://status.trigger.dev/incident/524477 Fri, 07 Mar 2025 19:17:00 -0000 https://status.trigger.dev/incident/524477#a1183aa6c52507a6c8c8be0dab60f11ab8909725a44c8957b085d90ff7d1bbbe A significant proportion of runs are not starting. There is an issue in our worker cluster and we are trying to diagnose the issue. There have been no deployments today. We're unsure at this point if this is a cloud provider issue or not. Significant disruption to run starts (runs stuck in queueing) https://status.trigger.dev/incident/524477 Fri, 07 Mar 2025 18:48:00 -0000 https://status.trigger.dev/incident/524477#95a49db3c93b064dae392196377e792acf52a9f6c84f522e6be3a161a17014cc There is an issue in the worker cluster we are investigating. This is causing runs not to start quickly. Deployments recovered https://status.trigger.dev/ Fri, 07 Mar 2025 11:41:40 +0000 https://status.trigger.dev/#3bc8d584e6fb8d6a4307bc80c8e30be20c4a9001db66507bce25ed218825171e Deployments recovered Deployments went down https://status.trigger.dev/ Fri, 07 Mar 2025 11:03:55 +0000 https://status.trigger.dev/#3bc8d584e6fb8d6a4307bc80c8e30be20c4a9001db66507bce25ed218825171e Deployments went down Uncached deploys are causing runs to be queued https://status.trigger.dev/incident/522499 Tue, 04 Mar 2025 12:40:00 -0000 https://status.trigger.dev/incident/522499#6a7b56529abf610d1a00a7f5e6534b92af23ba0185b57bd8a6f72d4579cd8e78 We tracked this down to a broken deploy pipeline which reverted one of our internal components to a previous version. This caused a required environment variable to be ignored. We have applied a hotfix and will be making more permanent changes to prevent this from happening again. Uncached deploys are causing runs to be queued https://status.trigger.dev/incident/522499 Tue, 04 Mar 2025 12:25:00 -0000 https://status.trigger.dev/incident/522499#4c91a51eedb4e236d0caf65406a5c72f2ec5dde0aedf20eb4de6c5d99c8db790 All runs should be dequeuing again. Runs that were impacted by this will start dequeuing again. These runs were in the queue for around 30 mins. This issue was caused by new servers not pulling the correct authentication details for pulling Docker images. We are trying to determine why this happened as we verified the env vars were set correctly but weren't being picked up. We're going to continue actively monitoring this until we're happy that it's fully resolved. Uncached deploys are causing runs to be queued https://status.trigger.dev/incident/522499 Tue, 04 Mar 2025 11:30:00 -0000 https://status.trigger.dev/incident/522499#3ec8f888b56a6cfd39b2f348833f92dd203c700395b4767586901b095ca7f208 Some runs associated with new deploys are not processing from queued to executing. This is an issue with pulling new deploys through the Docker image cache. Deploys from before 11:30 UTC should not be impacted (unless they haven't done a run in the past few days). We're working on a fix or this. Slow queue times and some runs system failing https://status.trigger.dev/incident/510033 Sun, 09 Feb 2025 18:10:00 -0000 https://status.trigger.dev/incident/510033#23aaffab65b269d1ca4f9791893dd98dd8ff6eb2d7ee8825cd5df3956da2d253 Queue times and timeout errors are back to their previous levels. Note that start times of containers are still slower than they should be, especially if you aren't doing a lot of runs. There's a GitHub issue here about slow start times and what we're doing to make them consistently fast: https://github.com/triggerdotdev/trigger.dev/issues/1685 It will start with a new Docker image caching layer that will ship tomorrow. ## What's causing these problems We've had a more than 10x increase in load in the past 7 days. Some things that have worked well for the past few months now work less well at this new scale. Some of those issues can compound under high load and cause more significant issues. Slow queue times and some runs system failing https://status.trigger.dev/incident/510033 Sun, 09 Feb 2025 17:35:00 -0000 https://status.trigger.dev/incident/510033#296c0d34d68b5481175a2ed5b3d3cfd92f25da990ecb84f20e7f741b8884bbd6 We have made some adjustments to stop this issue, we're looking into it Slow queue times https://status.trigger.dev/incident/504368 Thu, 30 Jan 2025 10:10:00 -0000 https://status.trigger.dev/incident/504368#86d3eec3d2bd41318c1e54abeee40f44eb9367d29fcdf6eed5a17bd5f3769b44 Queue processing performance is back to normal because there's been a reduction in demand. We have identified the underlying bottleneck and are working on a permanent fix. This shouldn't be a major change and should be live. There is a high degree of contention on an update when a single queue's concurrencyLimit is different on every call to trigger a task. This is an edge case we haven't seen anyone do before. Deploys are failing with a 520 status code https://status.trigger.dev/incident/500864 Fri, 24 Jan 2025 19:28:00 -0000 https://status.trigger.dev/incident/500864#54a0223b12fbed954c1bf19a19e547575ca6e03d9aff670b2a254061a10efdeb **Important: Upgrade to 3.3.12+ in order to deploy again** If you use npx you can upgrade the CLI and all of the packages by running: npx trigger.dev@latest update This should download 3.3.12 (or newer) of the CLI and then prompt you to update the other packages too. If you have pinned a specific version (e.g. in GitHub Actions) you may need to manually update your package.json file or a workflow file. Read our full package upgrading guide here: https://trigger.dev/docs/upgrading-packages Deploys are failing with a 520 status code https://status.trigger.dev/incident/500864 Fri, 24 Jan 2025 13:17:00 -0000 https://status.trigger.dev/incident/500864#bf35dac25abe349e5b9f7d7675dcec9f1c87bc25118dd7e2ee9604249bdfb156 The 520 errors are still happening with our Digital Ocean container registry, which is preventing the vast majority of deploys from working. We are speaking to their engineers but so far they haven't diagnosed a fix. There have been no code changes from our side which caused this issue – it comes from a change somewhere in one of the third parties we use. This is our deploy pipeline: 1. The Trigger.dev CLI runs the deploy command 2. Docker builds happen on Depot.dev 3. Depot pushes the image to our registry proxy (where we add registry credentials) 4. Our registry proxy pushes to Digital Ocean Container Registry (with the auth credentials) From our logs we think the issue is that Digital Ocean have changed something that means they're now rejecting pushes they were accepting before. We're working on two solutions in parallel: 1. Switching our container registry to something else, probably Docker Hub. This isn't super simple because we need to make it work with our proxy so we don't leak security credentials to users. 2. Stop using the proxy. This means generating temporary credentials that can only push to your project's repository. Those temporary tokens will be sent to the CLI, which means Depot.dev can push directly. AWS ECR supports doing this. Unfortunately this will require an updated CLI package to deploy… We'll provide another update as soon as we have more information. Deploys are failing with a 520 status code https://status.trigger.dev/incident/500864 Thu, 23 Jan 2025 23:20:00 -0000 https://status.trigger.dev/incident/500864#c176d306b698ec88e8ecebae6cc75ca2cd728c91234c3af92b638d1cbf55e1a0 We have confirmed that this is an issue pushing new images to our Digital Ocean Docker container registry. It's returning 520 errors. We have submitted priority support tickets and are speaking to the Digital Ocean engineering teams about the problem. Currently we're waiting on them. Deploys are failing with a 520 status code https://status.trigger.dev/incident/500864 Thu, 23 Jan 2025 21:10:00 -0000 https://status.trigger.dev/incident/500864#48310e57a1f571c005f09bd4be88eff73636d037439f1cf0c2dbf30e571398e9 We're investigating why this is happening Realtime is degraded https://status.trigger.dev/incident/489157 Thu, 02 Jan 2025 12:07:00 -0000 https://status.trigger.dev/incident/489157#5d597f41958b63edb6110871081af0ac71593bce0b0c71becf8c0d0cca2e9d88 The ElectricSQL team created a fix for the Postgres transaction wraparound issue and we've confirmed that is now deployed and fixed on Prod. Realtime is degraded https://status.trigger.dev/incident/489157 Wed, 01 Jan 2025 21:58:00 -0000 https://status.trigger.dev/incident/489157#8a7d4151c80165ca3e67c0f5b0f04cec4d6f022c11ecc5c4c436abef9e0d1363 We've narrowed down the exact cause of the issue that is happening in Electric and awaiting their engineer to get some sleep and submit a patch in the morning. Electric "subscribes" to updates in our database and compares the "transaction ID" or "xid" of the update in the postgresql WAL with the "current xmin". Our database has a "current xmin" which is very large (4.3b) and which is over the maximum transaction ID of 2^32. This is normal and is a feature of the transaction ID wraparound logic in postgresql. So new transactions have "wrapped around" back to 0 and now have lower transaction IDs that the current snapshot, and electric isn't modulo'ing the current snapshot xmin before they compare the transaction ID, so they are basically detecting all transactions as "old" even though they are newer. Realtime is degraded https://status.trigger.dev/incident/489157 Wed, 01 Jan 2025 15:58:00 -0000 https://status.trigger.dev/incident/489157#777020e56b63f758587f0a5361d0bfa0ed927ee3f782caca9197a2baf0e69d1a The very helpful and responsive engineers of Electric SQL (which powers Realtime) have possibly found a reason and are working on a reproduction. We're ready to deploy updates to our electric server as soon as those are made available to us. The short story is it's related to a transaction ID wraparound issue, but we'll update when we know more. Realtime is degraded https://status.trigger.dev/incident/489157 Wed, 01 Jan 2025 04:11:00 -0000 https://status.trigger.dev/incident/489157#c71c0588d4e73dd61895a9ff9ff7cf0334c6c06def5eb7fd9056f85f3a73a85e Nothing we have tried has fixed the issue. We're working with the electric team but unfortunately the timing is not great... most people are either sleeping or celebrating new years 🎉. We're hoping to be able to get this issue resolved ASAP given those constraints, and will update again soon. Realtime is degraded https://status.trigger.dev/incident/489157 Wed, 01 Jan 2025 00:23:00 -0000 https://status.trigger.dev/incident/489157#64c63030b895171adf3fa1d32b9ef335e1127269f7a6078b6d9e5cb6e018a030 We've exhausted our current runbook related to fixing Realtime issues and the service remains degraded. We're continuing to investigate and will update as soon as we are able to provide more information. Realtime is degraded https://status.trigger.dev/incident/489157 Tue, 31 Dec 2024 17:30:00 -0000 https://status.trigger.dev/incident/489157#4f694f6e956b477b3d3f685e9044fff8b884c27f2e19699dbc1398791be512c4 Some runs are not being updated by Realtime. All other systems are operating normally. Deploys are failing due to a downstream provider https://status.trigger.dev/incident/471215 Mon, 02 Dec 2024 22:09:00 -0000 https://status.trigger.dev/incident/471215#cfb1cc8da3aa94fcbb900565d21b816e4bb5354a7ae07a8f71e8612ff3bb57dd We have moved all deploys to Europe on Depot as a temporary fix while they fix the underlying issue in the US. Deploys will be a bit slower than normal and the first one won't use the cache, but they should work. Deploys are failing due to a downstream provider https://status.trigger.dev/incident/471215 Mon, 02 Dec 2024 21:00:00 -0000 https://status.trigger.dev/incident/471215#ca375c9fabefdd300329aa31573cc1e9849f9cd1647f01afbb14a0a044438583 Many deploys of tasks are failing due to an issue with a downstream provider. They're working on a fix. You can follow their status updates here: https://status.depot.dev/ We'll update when we know more. This only impacts deploys, everything else is functioning normally. V3 runs are processing slowly https://status.trigger.dev/incident/466240 Sat, 23 Nov 2024 00:55:00 -0000 https://status.trigger.dev/incident/466240#7802873d664a43ff807f05ea2f118c9e3f79e7387d312893592b69d34ac77fbe Runs are processing normally again, queues should come down fast. The Kubernetes database etcd didn’t allow new values. Increasing max sizes, restarting, and changing some other settings worked. V3 runs are processing slowly https://status.trigger.dev/incident/466240 Sat, 23 Nov 2024 00:14:00 -0000 https://status.trigger.dev/incident/466240#96df1076cb767d3cafd5d8a79853db0216775e20667d13ef69621a229deb146f Our primary worker cluster is experiencing Kubernetes issues that are preventing some pods from being created. Realtime (beta) is offline https://status.trigger.dev/incident/466075 Fri, 22 Nov 2024 19:36:00 -0000 https://status.trigger.dev/incident/466075#066c3aa3162f168154bd451aeee6b2b4afb57e7c905365a36fca09dcbf165f3e Realtime is back online. We've made some configuration changes and have some more reliability fixes in progress to make this rock solid. Realtime (beta) is offline https://status.trigger.dev/incident/466075 Fri, 22 Nov 2024 17:45:00 -0000 https://status.trigger.dev/incident/466075#ab2d5a46bd6f1cfbf9a31ca9ce9347fcb0ab2d51ef5ff7491e9b6a766a330f51 Our Realtime (beta) v3 feature is offline. We suspect it was causing degradation of other core services and so made the decision to disable it. We are actively working on this so we can get it back online. V2 runs are processing slowly https://status.trigger.dev/incident/458173 Fri, 08 Nov 2024 16:30:00 -0000 https://status.trigger.dev/incident/458173#fe0650f4a60ba75f685cc82ede99a80513aca6b2190194c708c3ff8d1da4832e V2 queues are caught up. Now any queued runs are due to concurrency limits. V3 was not impacted during the entire period. We restarted all V2 worker servers and V2 runs started processing again. We are still investigating the underlying cause to prevent this happening again. There were no code changes or deploys during this period and the overall V2 load wasn't unusual. V2 runs are processing slowly https://status.trigger.dev/incident/458173 Fri, 08 Nov 2024 16:08:00 -0000 https://status.trigger.dev/incident/458173#e1ff88f9875037a0cc1a6bd8eed98aed885955985a86076a948b63f6b00540fd V2 runs are now dequeuing quickly, it's catching up to normal. We'll update when there are nominal queue times. V2 runs are processing slowly https://status.trigger.dev/incident/458173 Fri, 08 Nov 2024 15:25:00 -0000 https://status.trigger.dev/incident/458173#aef3136f68c53e4bccd89665d4b453e2f3785709f5d5da830a3c754996149666 V2 runs are in the queue for longer than normal, we're investigating what's causing this and working on a fix. Realtime service degraded https://status.trigger.dev/incident/454137 Fri, 01 Nov 2024 00:38:00 -0000 https://status.trigger.dev/incident/454137#8b5ebd7b2108e07e692bd26355056e399e29795dbe7c3596c55059b8189b876a Realtime is recovering after a restart and a clearing of the consumer cache, but the underlying issue has not been solved. We're still working on a fix and will update as we make progress. Realtime service degraded https://status.trigger.dev/incident/454137 Thu, 31 Oct 2024 23:52:00 -0000 https://status.trigger.dev/incident/454137#9658f921e8122bab3c8d8b4904566fe5060ad0ce5d26559ff8929c9622dabdba We've just discovered an issue with our realtime service, where our realtime server is crashing and is not able to consume new changes from the database, thus not being able to send new updates out through our realtime system. We're working on a fix but don't have an ETA at this time. Dashboard instability and slower run processing https://status.trigger.dev/incident/449476 Fri, 25 Oct 2024 17:47:00 -0000 https://status.trigger.dev/incident/449476#aa82725fe6aa7d1c6d51c08c3edc2f5924a11c9b6ee7cc0629b420cde204249a The networking issues from our worker cluster cloud provider is no longer happening. Networking has been back to full speed for the past 10 minutes and run are processing fast. Dashboard instability and slower run processing https://status.trigger.dev/incident/449476 Fri, 25 Oct 2024 17:47:00 -0000 https://status.trigger.dev/incident/449476#aa82725fe6aa7d1c6d51c08c3edc2f5924a11c9b6ee7cc0629b420cde204249a The networking issues from our worker cluster cloud provider is no longer happening. Networking has been back to full speed for the past 10 minutes and run are processing fast. Dashboard instability and slower run processing https://status.trigger.dev/incident/449476 Fri, 25 Oct 2024 17:34:00 -0000 https://status.trigger.dev/incident/449476#bee76cc4fff4db3be6086b82e808e3e34a4343037ed46487e5674a38262199b6 v3 runs are processing slowly. We think this is due to an intermittent networking issue with our worker cluster cloud provider. We are investigating and escalating this issue with them. This isn't due to a code change. Dashboard instability and slower run processing https://status.trigger.dev/incident/449476 Fri, 25 Oct 2024 17:34:00 -0000 https://status.trigger.dev/incident/449476#bee76cc4fff4db3be6086b82e808e3e34a4343037ed46487e5674a38262199b6 v3 runs are processing slowly. We think this is due to an intermittent networking issue with our worker cluster cloud provider. We are investigating and escalating this issue with them. This isn't due to a code change. Dashboard instability and slower run processing https://status.trigger.dev/incident/449476 Wed, 23 Oct 2024 17:45:00 -0000 https://status.trigger.dev/incident/449476#57a1c6ef21e2cff8744e041f45ab48966414f5c5e98152b59bcc4a9f1bb7e5b8 The API and dashboard have been back to normal for some time. Runs are processing fast. We are working on stopping this from happening again. We have identified the JSON data that caused this Node.js crash but it's not at all clear why it crashed V8. as it's valid JSON. Dashboard instability and slower run processing https://status.trigger.dev/incident/449476 Wed, 23 Oct 2024 17:45:00 -0000 https://status.trigger.dev/incident/449476#57a1c6ef21e2cff8744e041f45ab48966414f5c5e98152b59bcc4a9f1bb7e5b8 The API and dashboard have been back to normal for some time. Runs are processing fast. We are working on stopping this from happening again. We have identified the JSON data that caused this Node.js crash but it's not at all clear why it crashed V8. as it's valid JSON. Dashboard instability and slower run processing https://status.trigger.dev/incident/449476 Wed, 23 Oct 2024 17:25:00 -0000 https://status.trigger.dev/incident/449476#e2fba50651e2337d3bf2094ac7f90587113a82d03da82cd5d755f3402d9409f3 A crash caused some dashboard instability and has slowed run processing down. We're working to get all run queues back to their normal nominal size. We know what caused the crash (some unexpected user data that somehow crashed Node.js). We have stopped the user from doing this temporarily until we have a permanent fix in place. Dashboard instability and slower run processing https://status.trigger.dev/incident/449476 Wed, 23 Oct 2024 17:25:00 -0000 https://status.trigger.dev/incident/449476#e2fba50651e2337d3bf2094ac7f90587113a82d03da82cd5d755f3402d9409f3 A crash caused some dashboard instability and has slowed run processing down. We're working to get all run queues back to their normal nominal size. We know what caused the crash (some unexpected user data that somehow crashed Node.js). We have stopped the user from doing this temporarily until we have a permanent fix in place. Some processing is slower than normal https://status.trigger.dev/incident/434184 Tue, 24 Sep 2024 19:28:00 -0000 https://status.trigger.dev/incident/434184#c291b2ad5013c365a4b9127478bd3cd8de7f7ebe2a59fe6087c0848b149b5a95 This issue is resolved, everything is back to normal. This issue was caused by an exceptionally large number of v3 run alerts, caused by a run that was failing (from user code, not a Trigger.dev system problem). This caused us to hit Slack rate limits which slowed the processing down more. We have scaled up the system that deals with this now so it should better deal with this. We've also changed the retrying settings for sending Slack alerts so it doesn't so aggressively retry. Some processing is slower than normal https://status.trigger.dev/incident/434184 Tue, 24 Sep 2024 19:00:00 -0000 https://status.trigger.dev/incident/434184#e9b363c66b20ac68733076dfb9535845138c1683d590d1e799349edcd77eb8c8 Impacted - v2 run queues are processing slower than normal - v3 alerts are taking longer than normal to send - v3 scheduled runs may start slightly later than scheduled - v3 triggerAndWait/batchTriggerAndWaits are taking longer than normal to continue their parent runs We are working to get service back to normal. Our emails aren't sending (downstream provider issue) https://status.trigger.dev/incident/418604 Sat, 24 Aug 2024 14:45:00 -0000 https://status.trigger.dev/incident/418604#e4af28f1361b09db5268f552f3483bd2fba6f07c07b61c00ac171ae6e8827109 Resend is back online so magic link and alert emails are working again Our emails aren't sending (downstream provider issue) https://status.trigger.dev/incident/418604 Sat, 24 Aug 2024 12:00:00 -0000 https://status.trigger.dev/incident/418604#ad909c2c21439b75db4c92783bc9902e5cc35700033c4a133252b9c58029345b Resend, our email provider, is currently down in the US. You can follow their status here: https://resend-status.com/incidents We'll update when we know more. This impacts - Magic link login - Email alerts Some v3 runs are crashing with triggerAndWait https://status.trigger.dev/incident/416656 Tue, 20 Aug 2024 18:40:00 -0000 https://status.trigger.dev/incident/416656#d3a8f212b91b5d0dba30342e1674d43223fe7b72ae5e699cfeaad3fb03498706 A fix has been deployed and tested. Just confirming that is has fixed all instances of this issue. Some v3 runs are crashing with triggerAndWait https://status.trigger.dev/incident/416656 Tue, 20 Aug 2024 18:40:00 -0000 https://status.trigger.dev/incident/416656#d3a8f212b91b5d0dba30342e1674d43223fe7b72ae5e699cfeaad3fb03498706 A fix has been deployed and tested. Just confirming that is has fixed all instances of this issue. Some v3 runs are crashing with triggerAndWait https://status.trigger.dev/incident/416656 Tue, 20 Aug 2024 17:00:00 -0000 https://status.trigger.dev/incident/416656#4373e3e298e342d148be0fd2d93655b282cd0e4b1f0b2279ff14e7e22c11e99f You'll see the status as CRASHED and the error will say: "Invalid run status for execution: WAITING_TO_RESUME" We've diagnosed the issue and are looking to ship a fix quickly Some v3 runs are crashing with triggerAndWait https://status.trigger.dev/incident/416656 Tue, 20 Aug 2024 17:00:00 -0000 https://status.trigger.dev/incident/416656#4373e3e298e342d148be0fd2d93655b282cd0e4b1f0b2279ff14e7e22c11e99f You'll see the status as CRASHED and the error will say: "Invalid run status for execution: WAITING_TO_RESUME" We've diagnosed the issue and are looking to ship a fix quickly v3 runs are starting slower than normal https://status.trigger.dev/incident/414402 Thu, 15 Aug 2024 21:45:00 -0000 https://status.trigger.dev/incident/414402#34f97ddc14335244a67960322243b042bf565b8610edde579b7ac3d7b7859f9a Runs are processing very fast now and everything will be fully caught up in 30 mins. The vast majority of organizations caught up an hour ago, If you were executing a lot of runs during this period unfortunately it is quite likely that some of them failed. You can filter by Failed, Crashed and System Failure on the Runs page. Then you can multi-select them and use the bulk actions bar at the bottom of the screen to mass replay them. We're really sorry about this incident and the impact it's had on you all. This wasn't caused by a code change and wasn't a gradual decline of performance so was hard to foresee. Some critical system processes in our primary database started failing causing locking transactions. This wasn't obvious at the time unfortunately. We will be doing a full write up of this incident tomorrow and we have an early plan of some tools we're going to use to ensure this doesn't happen again. v3 runs are starting slower than normal https://status.trigger.dev/incident/414402 Thu, 15 Aug 2024 21:45:00 -0000 https://status.trigger.dev/incident/414402#34f97ddc14335244a67960322243b042bf565b8610edde579b7ac3d7b7859f9a Runs are processing very fast now and everything will be fully caught up in 30 mins. The vast majority of organizations caught up an hour ago, If you were executing a lot of runs during this period unfortunately it is quite likely that some of them failed. You can filter by Failed, Crashed and System Failure on the Runs page. Then you can multi-select them and use the bulk actions bar at the bottom of the screen to mass replay them. We're really sorry about this incident and the impact it's had on you all. This wasn't caused by a code change and wasn't a gradual decline of performance so was hard to foresee. Some critical system processes in our primary database started failing causing locking transactions. This wasn't obvious at the time unfortunately. We will be doing a full write up of this incident tomorrow and we have an early plan of some tools we're going to use to ensure this doesn't happen again. v3 runs are starting slower than normal https://status.trigger.dev/incident/414402 Thu, 15 Aug 2024 19:40:00 -0000 https://status.trigger.dev/incident/414402#c3e970afee5b70da611e0982abd7f8b69e137c9a5ccbe3a80c6337ffa46625c3 The dashboard and API are back to full functionality. We are processing lots of runs again but there is a big backlog because of the slow processing over the past couple of hours. We're working hard to catch the queue up. Our primary database had entered into an unrecoverable state for an unknown reason with permanently locked transactions and many critical underlying processes that weren't functioning properly. Unfortunately this wasn't obvious. We switched our primary database ("failover") to one of our replicas and performance immediately returned to normal. Sorry folks, this wasn't caused by a code change or a gradual decline of performance so was hard to foresee. We have found a specialist database monitoring tool that we are going to use to prevent this from happening again, or hopefully make it obvious if it ever does. I'll update here when queue sizes are completely back to normal. v3 runs are starting slower than normal https://status.trigger.dev/incident/414402 Thu, 15 Aug 2024 19:40:00 -0000 https://status.trigger.dev/incident/414402#c3e970afee5b70da611e0982abd7f8b69e137c9a5ccbe3a80c6337ffa46625c3 The dashboard and API are back to full functionality. We are processing lots of runs again but there is a big backlog because of the slow processing over the past couple of hours. We're working hard to catch the queue up. Our primary database had entered into an unrecoverable state for an unknown reason with permanently locked transactions and many critical underlying processes that weren't functioning properly. Unfortunately this wasn't obvious. We switched our primary database ("failover") to one of our replicas and performance immediately returned to normal. Sorry folks, this wasn't caused by a code change or a gradual decline of performance so was hard to foresee. We have found a specialist database monitoring tool that we are going to use to prevent this from happening again, or hopefully make it obvious if it ever does. I'll update here when queue sizes are completely back to normal. v3 runs are starting slower than normal https://status.trigger.dev/incident/414402 Thu, 15 Aug 2024 17:14:00 -0000 https://status.trigger.dev/incident/414402#b8313a880f632070ef8755337d2106a09657a2b568e037f0bc23a5ebfff16634 V3 runs are starting slower than normal and some runs are failing because of database transaction timeouts. We're deploying changes to try and fix this but have not determined the root cause yet. v3 runs are starting slower than normal https://status.trigger.dev/incident/414402 Thu, 15 Aug 2024 17:14:00 -0000 https://status.trigger.dev/incident/414402#b8313a880f632070ef8755337d2106a09657a2b568e037f0bc23a5ebfff16634 V3 runs are starting slower than normal and some runs are failing because of database transaction timeouts. We're deploying changes to try and fix this but have not determined the root cause yet. v3 runs are starting slower than normal https://status.trigger.dev/incident/414402 Thu, 15 Aug 2024 12:45:00 -0000 https://status.trigger.dev/incident/414402#7bd13f517300eca286d35ad9eb6411b98ffba3c73ce094e9ccaa2b1446f4d199 We're experiencing very high database load which is causing v3 runs to be queued for longer than normal before starting. We're investigating the root cause of this and how to alleviate it. v3 runs are starting slower than normal https://status.trigger.dev/incident/414402 Thu, 15 Aug 2024 12:45:00 -0000 https://status.trigger.dev/incident/414402#7bd13f517300eca286d35ad9eb6411b98ffba3c73ce094e9ccaa2b1446f4d199 We're experiencing very high database load which is causing v3 runs to be queued for longer than normal before starting. We're investigating the root cause of this and how to alleviate it. v3 runs are slower than normal to start https://status.trigger.dev/incident/400241 Thu, 18 Jul 2024 22:14:00 -0000 https://status.trigger.dev/incident/400241#5e6bd0054354fb4e0ce951204125aa7e4295c82659a4ecf3a7d3a3fb10e810c2 v3 runs are now operating at full speed. We were unable to spin up more servers and so the total throughput of v3 runs was limited. This caused a lack of concurrency and so runs started slower than normal although they were being fairly distributed between orgs. We managed to fix the underlying issue that was causing servers not to spin up. v3 runs are slower than normal to start https://status.trigger.dev/incident/400241 Thu, 18 Jul 2024 21:22:00 -0000 https://status.trigger.dev/incident/400241#fc4225a1c42b99305ae80925cd6cc93ea5a6664d62df49dc576abae225798a43 Our v3 cluster isn't able to spin up new servers so we can process more runs at once – this is why our concurrency is lower and start times are slower than normal. This seems to be an underlying issue with our cloud provider. v3 runs are slower than normal to start https://status.trigger.dev/incident/400241 Thu, 18 Jul 2024 20:50:00 -0000 https://status.trigger.dev/incident/400241#f84823ad744b461c913187be6a94a48787bc79e1e6cdb35a82657d88717d4bce v3 runs are processing but they're starting slower than normal. We're working on a fix for this. Dashboard/API is down https://status.trigger.dev/incident/396747 Thu, 11 Jul 2024 10:55:00 -0000 https://status.trigger.dev/incident/396747#0651a085c0f5a3e11883b6d0f438130797945d43509b583bbc66961824f63c50 The platform is working correctly again. Runs will pick back up. Some runs may have failed that were in progress. This was caused by a bad migration that didn't cause the deployment to fail automatically and so it rolled out to the instances. Dashboard/API is down https://status.trigger.dev/incident/396747 Thu, 11 Jul 2024 10:55:00 -0000 https://status.trigger.dev/incident/396747#0651a085c0f5a3e11883b6d0f438130797945d43509b583bbc66961824f63c50 The platform is working correctly again. Runs will pick back up. Some runs may have failed that were in progress. This was caused by a bad migration that didn't cause the deployment to fail automatically and so it rolled out to the instances. Dashboard/API is down https://status.trigger.dev/incident/396747 Thu, 11 Jul 2024 10:47:00 -0000 https://status.trigger.dev/incident/396747#68ea9f859f91fb5265c3fbcf720b1986051de37feaf2cc852cb46fc3d912e155 The dashboard and API is down due to a deployment. We're working to fix this. Dashboard/API is down https://status.trigger.dev/incident/396747 Thu, 11 Jul 2024 10:47:00 -0000 https://status.trigger.dev/incident/396747#68ea9f859f91fb5265c3fbcf720b1986051de37feaf2cc852cb46fc3d912e155 The dashboard and API is down due to a deployment. We're working to fix this. Dashboard and API degraded https://status.trigger.dev/incident/387209 Thu, 20 Jun 2024 23:00:00 -0000 https://status.trigger.dev/incident/387209#084703501c3495720c44eb25af68107261cd748c7cd4dcc8771c6e0a228c72f4 We're continuing to monitor the situation but API and Dashboard services have been fully restored for the last 15+ minutes, and we think we have found the issue. We'll continue to monitor. Dashboard and API degraded https://status.trigger.dev/incident/387209 Thu, 20 Jun 2024 23:00:00 -0000 https://status.trigger.dev/incident/387209#084703501c3495720c44eb25af68107261cd748c7cd4dcc8771c6e0a228c72f4 We're continuing to monitor the situation but API and Dashboard services have been fully restored for the last 15+ minutes, and we think we have found the issue. We'll continue to monitor. Dashboard and API degraded https://status.trigger.dev/incident/387209 Thu, 20 Jun 2024 21:28:00 -0000 https://status.trigger.dev/incident/387209#9afa9fa43c559230a5530e9807262c76d675f02dbec0685ef4ef6f480e856b0f We're once again dealing with an issue that is causing a CPU spike in our production API and Dashboard server instances. We're investigating and will update when we have any more news to share. Dashboard and API degraded https://status.trigger.dev/incident/387209 Thu, 20 Jun 2024 21:28:00 -0000 https://status.trigger.dev/incident/387209#9afa9fa43c559230a5530e9807262c76d675f02dbec0685ef4ef6f480e856b0f We're once again dealing with an issue that is causing a CPU spike in our production API and Dashboard server instances. We're investigating and will update when we have any more news to share. Our API and dashboard response times are elevated https://status.trigger.dev/incident/387013 Thu, 20 Jun 2024 14:35:00 -0000 https://status.trigger.dev/incident/387013#5a90d63723377a9539abd559eafe7c36426a2b9c21948cbd454c630dcdbc93e3 API and cloud response times are back to normal. Our API and dashboard response times are elevated https://status.trigger.dev/incident/387013 Thu, 20 Jun 2024 14:35:00 -0000 https://status.trigger.dev/incident/387013#5a90d63723377a9539abd559eafe7c36426a2b9c21948cbd454c630dcdbc93e3 API and cloud response times are back to normal. Our API and dashboard response times are elevated https://status.trigger.dev/incident/387013 Thu, 20 Jun 2024 14:05:00 -0000 https://status.trigger.dev/incident/387013#845cef828c2070857f8c4febfd05a2a11d9502360bbdb4a580373085e6fc47bf We're experiencing higher than normal response times in the API and the dashboard. v2 and v3 run queue times are normal and runs are executing still. General system load is normal. We are investigating what's causing this and will update, it looks like it's one of our providers. Our API and dashboard response times are elevated https://status.trigger.dev/incident/387013 Thu, 20 Jun 2024 14:05:00 -0000 https://status.trigger.dev/incident/387013#845cef828c2070857f8c4febfd05a2a11d9502360bbdb4a580373085e6fc47bf We're experiencing higher than normal response times in the API and the dashboard. v2 and v3 run queue times are normal and runs are executing still. General system load is normal. We are investigating what's causing this and will update, it looks like it's one of our providers. Some v3 runs are failing https://status.trigger.dev/incident/386440 Wed, 19 Jun 2024 10:56:00 -0000 https://status.trigger.dev/incident/386440#649950ac4bdd9b171d151612b2702b07895c6a11ca4a337c730cf6b7bdd63795 v3 runs are back to normal. There was an abnormal number of runs in the System Failure state because some of the data being passed from the workers back to the platform were in an unexpected format. We are bulk replaying runs that were impacted. Some v3 runs are failing https://status.trigger.dev/incident/386440 Wed, 19 Jun 2024 10:50:00 -0000 https://status.trigger.dev/incident/386440#a26d263bf20006e6650931d20d74bcc3217e5db69ff203e7c7decc08fae13085 v3 package version `3.0.0-beta.38` (that was released an hour ago) is throwing a `sendWithAck() timeout` error at the end of an attempt when trying to send the logs. We've released a new package `3.0.0-beta.39` that should fix this. Some v3 runs are failing https://status.trigger.dev/incident/386440 Wed, 19 Jun 2024 10:35:00 -0000 https://status.trigger.dev/incident/386440#15c5b6960309c3af74affcf34331187127fc98de556740b6c65fd4aca28bce52 We're investigating why more v3 runs are failing than normal. v3 runs are paused due to network issues https://status.trigger.dev/incident/383771 Thu, 13 Jun 2024 12:20:00 -0000 https://status.trigger.dev/incident/383771#a581fe9e8d84197481ff9939fb61b6f468c2dec4058599f3eac745a0a85a0b2a Runs are operating at full speed. We think this issue was caused by the clean-up operations that clear completed pods. There are far more runs than a week ago, so that list can get very large causing a strain on the system including internal networking. We've increased the frequency and are monitoring the load including networking. After 15 mins everything seems normal. v3 runs are paused due to network issues https://status.trigger.dev/incident/383771 Thu, 13 Jun 2024 12:15:00 -0000 https://status.trigger.dev/incident/383771#a973384888024747be31822f6c40df51fae9723c1b3903f85470cbdc708037f6 v3 runs are processing with slightly reduced capacity in our cluster. Some nodes that we've isolated have network issues. We're still trying to diagnose the root cause to prevent this from happening again. v3 runs are paused due to network issues https://status.trigger.dev/incident/383771 Thu, 13 Jun 2024 11:19:00 -0000 https://status.trigger.dev/incident/383771#aaf42730491319254d2ac0c7de30fb74ec3a25ddd484cd935521a43c090385b5 There's a networking issue in our cluster. The BPF networking change we made yesterday hasn't fully fixed the problems. We're working to get runs executing as quickly as possible and then figure out the root cause of this issue so it doesn't happen again. v3 runs have stopped https://status.trigger.dev/incident/383449 Wed, 12 Jun 2024 22:10:00 -0000 https://status.trigger.dev/incident/383449#f9cd9a052d0d268d28b922058aad644efd101d1d9dba87f3ff2e7ccc52f98fff v3 runs are now executing again. Networking was down because of an issue with BPF. While networking was down tasks couldn't heartbeat back to the platform. If the platform doesn't receive a heartbeat every 2 mins then a run will fail. Less than 500 total runs were failed because of this. You can filter by status "System Failure" in the runs list to find these and then bulk replay them by selecting all, move to the next page and select all again. You can replay them using the bottom bar. v3 runs have stopped https://status.trigger.dev/incident/383449 Wed, 12 Jun 2024 21:36:00 -0000 https://status.trigger.dev/incident/383449#9a6bb7463cbee3ea4c840bb355947338fe90f322e5c2a55ceb7c462480fe85d7 v3 runs have stopped because of a networking issue in our cluster. We're working to diagnose if this is an issue with our cloud provider and are trying to reset things. v2 runs are slower than normal to start https://status.trigger.dev/incident/382219 Mon, 10 Jun 2024 12:30:00 -0000 https://status.trigger.dev/incident/382219#b79ef4d852780e17ff4cb3c20e7aa779e19ba7fa7068468cd56e09237644886a v2 p95 start times have been under 2s for 10 mins, so resolving this issue. We think this is because there are a lot of schedules that send an event at midday UTC on a Monday. We're looking into what we can do about that. v2 runs are slower than normal to start https://status.trigger.dev/incident/382219 Mon, 10 Jun 2024 12:24:00 -0000 https://status.trigger.dev/incident/382219#d2ebe1f2ef82e2f55f3129aa39c7514caf1e90f63087c4286f0ff6dcb517573d v2 p95 start times are now under 1 second v2 runs are slower than normal to start https://status.trigger.dev/incident/382219 Mon, 10 Jun 2024 12:16:00 -0000 https://status.trigger.dev/incident/382219#010be3326161d0a53fa97bf7dd8d4637657bb2ea8f86574c362e480c346f44e2 Between 12:15–12:18 UTC P95 queue times were up to 1 min. They've come down to 1.9s p95 now. Monitoring and will update. v2 jobs are starting slowly https://status.trigger.dev/incident/379960 Wed, 05 Jun 2024 18:35:00 -0000 https://status.trigger.dev/incident/379960#d5294d4dee8985fe6d794319f7c07e258cf35812ca5c2cf004bdc3e7fb48825c Performance metrics for v2 are back to normal. v3 was unimpacted by this issue. We have identified the underlying performance bottleneck and will publish a full retrospective on this. We can now handle far more v2 load than we could before. v2 jobs are starting slowly https://status.trigger.dev/incident/379960 Wed, 05 Jun 2024 15:00:00 -0000 https://status.trigger.dev/incident/379960#83e0133a2d11e2a916cfc9f2a5dee1bb16e55c1b9d15f649e567e6d91eb953a3 v2 runs are starting with delays of a couple of minutes. We're working on a fix for this. v2 jobs are queued https://status.trigger.dev/incident/378831 Mon, 03 Jun 2024 20:00:00 -0000 https://status.trigger.dev/incident/378831#92d8496175b68cc9097d1d08fc08d0ae4e0cc5117dafa3a5f9ece694d8c39909 Runs have been executing at normal speeds and queues down to normal size for a couple of hours so this is marked as resolved. During this incident queue times were longer than normal for v2 runs. We've made some minor changes as well as increasing capacity. We are working on a larger change that we think should mean very large v2 run spikes don't cause these problems that should ship in the next few days. v2 jobs are queued https://status.trigger.dev/incident/378831 Mon, 03 Jun 2024 17:18:00 -0000 https://status.trigger.dev/incident/378831#daca17f391480da3dc64dfec85df784db0d5a58db65dca53668e613312f0a5cf v2 queues are getting smaller quickly now. v3 is still operating normally. v2 jobs are queued https://status.trigger.dev/incident/378831 Mon, 03 Jun 2024 16:42:00 -0000 https://status.trigger.dev/incident/378831#b353011fd980544118576e2f3f7535c7a171c9ef2f5179d065e4d11512b2a541 More capacity is now online and job queues are catching up, but not all the way. We're currently assessing whether it's possible to increase capacity even more. v2 jobs are queued https://status.trigger.dev/incident/378831 Mon, 03 Jun 2024 16:29:00 -0000 https://status.trigger.dev/incident/378831#400beb1080356463147317e1a8994ed44dbe3eccbea9521c08de8143951f0f1d We are currently struggling to keep up with v2 job capacity and there are some jobs are have been backed up. We're working on increasing the capacity now. v2 job backlog https://status.trigger.dev/incident/377985 Sat, 01 Jun 2024 20:45:00 -0000 https://status.trigger.dev/incident/377985#eb22eea3e59c463625e8047e1c255187026f9cfdda48fa773173d784b7626e99 v2 jobs are now all caught up and processing normally. v2 job backlog https://status.trigger.dev/incident/377985 Sat, 01 Jun 2024 17:54:00 -0000 https://status.trigger.dev/incident/377985#860ab9d18a047ba51590110c34e28d52f3ff6778223b5494990deb56653abe8e v2 jobs have started to catch up and should be back to normal operation soon. We'll keep an eye on it. v2 job backlog https://status.trigger.dev/incident/377985 Sat, 01 Jun 2024 17:45:00 -0000 https://status.trigger.dev/incident/377985#d70515877427e10423cf99fb17aa7b5992917af7c112d24a2f9885bcc95fc038 v2 jobs are still processing, but slowly. We're working on rolling out additional capacity to handle the backlog. v3 tasks aren't effected. v2 job backlog https://status.trigger.dev/incident/377985 Sat, 01 Jun 2024 17:36:00 -0000 https://status.trigger.dev/incident/377985#af60bc1368293c48e37c0970bdf342863dfb99af8694951b6da8a0bef509b2b9 We're currently experiencing an issue that is causing v2 job runs to be excessively delayed. We're currently working on fixing the issue and will update again shortly as we make progress The v3 cluster is slow to accept new v3 tasks https://status.trigger.dev/incident/369236 Tue, 14 May 2024 07:00:00 -0000 https://status.trigger.dev/incident/369236#3749f587a8c00096c336d3c0c8be0b59acf881c426a3972a16a7a76f65ec5e13 Runs are operating at normal speed again There were pods in our cluster in the RunContainerError state, this happens when a run isn’t heartbeating back to the platform. We’re closely monitoring and have cleaned these. We’re determining which tasks caused this and what we can do to prevent this from happening in the future. The v3 cluster is slow to accept new v3 tasks https://status.trigger.dev/incident/369236 Tue, 14 May 2024 07:00:00 -0000 https://status.trigger.dev/incident/369236#3749f587a8c00096c336d3c0c8be0b59acf881c426a3972a16a7a76f65ec5e13 Runs are operating at normal speed again There were pods in our cluster in the RunContainerError state, this happens when a run isn’t heartbeating back to the platform. We’re closely monitoring and have cleaned these. We’re determining which tasks caused this and what we can do to prevent this from happening in the future. The v3 cluster is slow to accept new v3 tasks https://status.trigger.dev/incident/369236 Tue, 14 May 2024 05:00:00 -0000 https://status.trigger.dev/incident/369236#e482ab5670e533a0989716a06265a5c2cd2688092f0d5d7cecd9f421c6319b6f The v3 cluster is slow to accept new v3 tasks The v3 cluster is slow to accept new v3 tasks https://status.trigger.dev/incident/369236 Tue, 14 May 2024 05:00:00 -0000 https://status.trigger.dev/incident/369236#e482ab5670e533a0989716a06265a5c2cd2688092f0d5d7cecd9f421c6319b6f The v3 cluster is slow to accept new v3 tasks Queues and runs have been processing at good speeds now for several hours on v2 and v3 https://status.trigger.dev/incident/369241 Thu, 09 May 2024 21:00:00 -0000 https://status.trigger.dev/incident/369241#322cccdadc68287cdc1eaa108173ae0ba3df5e4a757e12dd84f8e35aa8da71b1 Before we get into what happened I want to emphasise how important reliability is to us. This has fallen very short of providing you all with a great experience and a reliable service. We're really sorry for the problems this has caused for you all. **All paying customers will get a full refund for the entirety of May.** ## What caused this? This issue started with a huge number of queued events in v2. Our internal system that handles concurrency limits on v2 was slowing down the processing of them but was also causing more database load on the underlying v2 queuing engine: Graphile worker. Graphile is powered by Postgres so it caused a vicious cycle. We've scaled many orders of magnitude in the past year and all our normal upgrades and tuning didn't work here. We upgraded servers and most importantly the database (twice). We tuned some parameters. But as the backlog was growing it became harder to recover from because of the concurrency limits built into the v2 system. Ordinarily this limiter distributes v2 runs fairly and prevents very high load by smoothing out very spiky demand. ## What was impacted? - • Queues got very long for v2 and processed slowly. - • Queues got long for v3. The queuing system for v3 is built on Redis so that was fine but the actual run data lives in Postgres which couldn't be read because of the v2 issues. Also, we use Graphile to trigger v3 scheduled tasks. - • The dashboard was very slow to load or was showing timeout errors (ironic I know). - • When we took the brakes off the v2 concurrency filter it caused a massive number of runs to happen very quickly. Mostly this was fine but in some cases this caused downstream issues in runs. - • When we took the brakes off the v2 concurrency filter it also meant in some cases v2 concurrency limits weren't respected. ## What we've done (so far) - • We upgraded some hardware. This means we can process more runs but it didn't help us escape the spiral. - • We modified the v2 concurrency filter so it reschedules runs that are over the limit with a slight delay. Before it was thrashing the database with the same runs and could cause a huge load in edge cases like this. - • We've upgraded Graphile Worker, the core of the v2 queuing system, to v0.16.6. This has a lot of performance improvements so we can cope with more load than before. - • We have far better diagnostic tools than before. Today we reached a new level of scale that highlighted some things that up until now had been working well. Reliability is never finished, so work continues tomorrow. Queues and runs have been processing at good speeds now for several hours on v2 and v3 https://status.trigger.dev/incident/369241 Thu, 09 May 2024 21:00:00 -0000 https://status.trigger.dev/incident/369241#322cccdadc68287cdc1eaa108173ae0ba3df5e4a757e12dd84f8e35aa8da71b1 Before we get into what happened I want to emphasise how important reliability is to us. This has fallen very short of providing you all with a great experience and a reliable service. We're really sorry for the problems this has caused for you all. **All paying customers will get a full refund for the entirety of May.** ## What caused this? This issue started with a huge number of queued events in v2. Our internal system that handles concurrency limits on v2 was slowing down the processing of them but was also causing more database load on the underlying v2 queuing engine: Graphile worker. Graphile is powered by Postgres so it caused a vicious cycle. We've scaled many orders of magnitude in the past year and all our normal upgrades and tuning didn't work here. We upgraded servers and most importantly the database (twice). We tuned some parameters. But as the backlog was growing it became harder to recover from because of the concurrency limits built into the v2 system. Ordinarily this limiter distributes v2 runs fairly and prevents very high load by smoothing out very spiky demand. ## What was impacted? - • Queues got very long for v2 and processed slowly. - • Queues got long for v3. The queuing system for v3 is built on Redis so that was fine but the actual run data lives in Postgres which couldn't be read because of the v2 issues. Also, we use Graphile to trigger v3 scheduled tasks. - • The dashboard was very slow to load or was showing timeout errors (ironic I know). - • When we took the brakes off the v2 concurrency filter it caused a massive number of runs to happen very quickly. Mostly this was fine but in some cases this caused downstream issues in runs. - • When we took the brakes off the v2 concurrency filter it also meant in some cases v2 concurrency limits weren't respected. ## What we've done (so far) - • We upgraded some hardware. This means we can process more runs but it didn't help us escape the spiral. - • We modified the v2 concurrency filter so it reschedules runs that are over the limit with a slight delay. Before it was thrashing the database with the same runs and could cause a huge load in edge cases like this. - • We've upgraded Graphile Worker, the core of the v2 queuing system, to v0.16.6. This has a lot of performance improvements so we can cope with more load than before. - • We have far better diagnostic tools than before. Today we reached a new level of scale that highlighted some things that up until now had been working well. Reliability is never finished, so work continues tomorrow.