v3 runs are paused due to n...

Resolved

v3 runs are paused due to network issues

Jun 13, 2024 at 11:19am UTC

Affected services

Dashboard

Resolved
Jun 13, 2024 at 12:20pm UTC

Runs are operating at full speed.

We think this issue was caused by the clean-up operations that clear completed pods. There are far more runs than a week ago, so that list can get very large causing a strain on the system including internal networking. We've increased the frequency and are monitoring the load including networking. After 15 mins everything seems normal.

Updated
Jun 13, 2024 at 12:15pm UTC

v3 runs are processing with slightly reduced capacity in our cluster. Some nodes that we've isolated have network issues. We're still trying to diagnose the root cause to prevent this from happening again.

Created
Jun 13, 2024 at 11:19am UTC

There's a networking issue in our cluster. The BPF networking change we made yesterday hasn't fully fixed the problems.

We're working to get runs executing as quickly as possible and then figure out the root cause of this issue so it doesn't happen again.