open telemetry logs and spans ingestion issues
Resolved
Dec 02 at 11:18am GMT
We've published a full post-mortem on this incident here: https://trigger.dev/blog/clickhouse-too-many-parts-postmortem
Affected services
Updated
Dec 01 at 04:38pm GMT
we have pushed a fix which will prevent this ingestion issue from happening again. The issue came from a "Too many parts" error when inserting new data into clickhouse. This is a very common issue with clickhouse and was caused by a poorly designed partition key that caused inserts to create very small parts that would exhaust the clickhouse server's merge capacity, leading to merge failures. To mitigate this issue we initially created a new table that new runs would write their otel into that didn't suffer from this partition key issue. That worked fine, but because old runs were continuing to write to the old, poorly designed, table, the new table could not perform merges because of the previously stated issue. We immediately increased the number of clickhouse replicas which did buy us some merge headroom, allowing inserts to resume. But it was a race against the clock as we needed to ship a robust fix to the existing table before merge capacity was again exhausted. That fix has now been shipped and we're monitoring our clickhouse server's merge capacity which looks to be holding up.
Affected services
Updated
Dec 01 at 04:08pm GMT
otel ingestion has been operational since 1PM UTC but we are still finalizing a more permanent fix before we move the service from degraded to resolved.
Affected services
Created
Dec 01 at 02:08pm GMT
We are currently having issues with our ingestion of open telemetry logs and spans after rolling out a fix for the issue that was happening over the weekend with clickhouse. We're investigating
Affected services