-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows builds constantly failing with IOException: connection was forcibly closed by the remote host #4467
Comments
@sluongng any chance you're able to reproduce this on Windows? This kind of feels like an error that Bazel would handle gracefully / retry on Linux (this suggests that this is the windows error message for a RST stream). I believe our servers send an RST stream when they're restarting (and the 3 errors linked are around this time of our release yesterday when all of the apps are started, which happened around noon PST). The question for me is why isn't Bazel able to successfully retry the BES stream on Windows like it seems to be able to on Linux / Mac (otherwise we could never restart our servers / perform releases). |
I have not seen this before while I was trying to build BuildBuddy on Windows 🤔 Gona reach out to Tweag folks on Slack to see if I could reproduce with rules_haskell or not... though I doubt that this is specific to a repo, there might be some flags/options specific to rules_haskell CI (or their GitHub Action Runner) that could affect this. |
Seems like it happened during a release, you could try running a Windows build against a local server and restarting it mid build. |
Slightly hard since our server does not build on Windows 😅 Tried to stop and restart server on my MacBook 2 times mid-build but could not reproduce the issue 🤔 |
I tried a bunch of approaches and could not replicate the exact error. @avdv It would be helpful if you could help capture the java log in It's likely that the 5 hours build duration overlapping with our weekly release time frame was the root cause of this: our old server received the stream -> was interrupted for upgrade -> Bazel retried to another server that would also get interrupted later. If this is indeed the case, bumping |
Thank you!
I'll try to reproduce and attach the files. |
discussed with @avdv on Slack and tested it myself in https://github.com/sluongng/rules_haskell/actions/runs/5785750975/job/15678998276 https://app.buildbuddy.io/invocation/c912e3eb-c949-4872-9246-47a743b7425d It seems that the issue is unrelated to our weekly release. I could not reproduce it on AWS EC2 Windows VM, but it seems like the Github Action's managed runner (on Azure) could be easily reproduced. |
One thing I've noticed is that the Github action runners on Linux seem to have pretty flaky networking, and that setting this kernel param seems to help:
I wonder if there is an equivalent for Windows. Based on this it seems the default is 2 hours and it can be tweaked: https://serverfault.com/a/735768 |
I have made a PR here where I uploaded log files as artifacts. See here. Looking at the "windows-latest rules_haskell bzlmod" job's logs:
So the build started at 05:23:49 and the first disconnection happened at 05:36:46. Then it happens after 7, 5, 4 and 7 minutes again. I'll just try to bump the retries to a high number. Thank you @siggisim, I'll look into the TCP keepalive setting too. |
I did set the
Setting |
This remedies an issue (buildbuddy-io/buildbuddy#4467) where the build would fail eventually because the connections to the remote get closed intermittently.
Just to note (seems I only mentioned this in Slack), we are mostly seeing this error with PRs from forks (where the BuildBuddy API secret is not set, so we only have a read-only cache). Also, I made a PR adding the max retries setting and it failed with an upload timeout error again.
I'll try to increase the timeout and see if it makes a difference (no high hopes, I guess 60s should be plenty already). |
This remedies an issue (buildbuddy-io/buildbuddy#4467) where the build would fail eventually because the connections to the remote get closed intermittently.
https://www.buildbuddy.io/docs/troubleshooting-slow-upload#the-build-event-protocol-upload-timed-out We do recommend setting the BES timeout value to 600s here. I think it should help with cases like yours, but it would be better to know where in the network stack did the failure occur 🤔 |
Thank you @sluongng, but seems like it did not help: https://github.com/tweag/rules_haskell/actions/runs/5834070440/job/15838104954?pr=1935 I'll retry and try to get a log file. \edit: the job ran successfully on the third try. |
This remedies an issue (buildbuddy-io/buildbuddy#4467) where the build would fail eventually because the connections to the remote get closed intermittently.
We have recently helped a few customers troubleshoot their AWS network connectivity. Folks who connect to BuildBuddy through a NAT Gateway would often be subjected to the gateway's TCP timeout, especially when it comes to a long-running grpc stream. It's unknown how Github is setting up the managed Windows runners, but I guess that it's also subjected to similar limitations. On Bazel side, there are --grpc_keepalive_time and --grpc_keepalive_timeout flags that might help with these behavior. There is also --experimental_remote_execution_keepalive but I don't think RBE is relevant here. Also here is how you can tune the setting on Linux's sysctl https://www.buildbuddy.io/docs/troubleshooting-rbe#warning-remote-cache-unavailable-io-exception. It seems like you can achieve a similar effect on Windows https://serverfault.com/questions/735515/tcp-timeout-for-established-connections-in-windows. However, since we are discussing Github-managed runners, I think these are less relevant. |
We are using Bazel 6.2.0 currently, but the issue existed before (I tried to follow suggestions in #992 but no dice).
The issue only happens for the Windows builds, Linux and Darwin are fine.
CI runs: 1 (bb), 2 (bb), 3 (bb)
Note, we have 4 jobs in the workflow running on Windows, and one of them succeeded: https://github.com/tweag/rules_haskell/actions/runs/5738410548/job/15564769980?pr=1925
The successful job took a bit over one hour, the failing jobs took >5 hours. Could this be just some sort of timeout?
We are using:
Should we try something else? Or tune different parameters? Thank you for any suggestions!
The text was updated successfully, but these errors were encountered: