Run Lists and Logs Are Degr...

Resolved
Jul 1, 2026 at 11:22am UTC

Update: the full postmortem is now published: https://trigger.dev/blog/incident-report-jun-30-2026

Updated
Jun 30, 2026 at 3:54pm UTC

This incident has been resolved. We are continuing to monitor and a post-mortem will be posted soon.

Updated
Jun 30, 2026 at 3:43pm UTC

First, the important part: your runs are safe. Runs execute and are stored in Postgres as normal. This issue is isolated to ClickHouse, which we use as a read-optimised view store for querying runs. Triggering, execution, and storage have been unaffected throughout. This is a visibility issue, not a data loss one.

The fix is downgrading ClickHouse to 25.12. That rollback runs at the ClickHouse Cloud infrastructure level, so we're coordinating with them to apply it. It's agreed and queued.

What you're seeing

While the view was degraded, the runs list, filtering, bulk actions, the query API, and all error monitoring and alerting were slow or unavailable, and dashboard charts, traces, logs, and span detail were blank or incomplete for affected runs. Run execution, billing, and your Postgres-stored run data were unaffected throughout.

Dashboard and API queries against run data are slow or failing. There was a window this morning where these reads were down entirely.
A small percentage of runs created since June 17 aren't yet in the correct state in the view. The runs ran and are stored. The view of them is catching up.
Run logs and spans have recovered. Run views are still catching up.

What's causing it

ClickHouse moved from 25.12 to 26.2 on June 17. The new version added a cap on JSON column complexity (input_format_binary_max_type_complexity, defaulting to 1000), measured across the combined outputs in each batched insert. Once a batch went over, those inserts stopped landing in the view. Then this morning, June 30, memory and CPU on the instance climbed sharply as the backlog of unmerged data caught up with it, driving reads down until it ran out of memory and caused today's outage. The 25.12 rollback resolves the root cause.

What we've done so far

Cleared the data that couldn't merge so healthy merges can continue. Run logs and spans recovered quickly; run views are slower and still catching up.
Scaled the cluster horizontally to add headroom.
Lined up the 25.12 rollback with ClickHouse Cloud.

Updated
Jun 30, 2026 at 2:08pm UTC

We're continuing to monitor API response times, which are steadily improving. Our mitigation efforts remain ongoing.

Updated
Jun 30, 2026 at 1:17pm UTC

Runs are triggering and executing normally. The runs list (runs.list API + dashboard) performance is now improving. Run logs and spans are also improving. We're actively monitoring the rollout of the remediation. Next update in 30 mins.

Updated
Jun 30, 2026 at 12:36pm UTC

Runs are triggering and executing normally. The runs list (runs.list API + dashboard) is slow and may be delayed. Run logs and spans may be delayed. Cause identified, remediation in progress. Next update in 30 mins.

Updated
Jun 30, 2026 at 11:41am UTC

The root cause has been identified and remediation work is taking place

Updated
Jun 30, 2026 at 11:06am UTC

We believe slow ClickHouse replication is due to long running merges with deeply nested output objects. We have vertically scaled and are working with ClickHouse support on mitigations.

Updated
Jun 30, 2026 at 10:19am UTC

Spans and logs are also degraded. This is due to slow replication into our ClickHouse cluster.

We are investigating.

Updated
Jun 30, 2026 at 10:11am UTC

The runs.list() SDK function and Runs dashboard page are degraded. This is due to slow replication into our ClickHouse cluster.

We are investigating.

Created
Jun 30, 2026 at 9:40am UTC

Runs are executing normally.

The replication of runs to ClickHouse which powers the Runs list are impacted. We're investigating.