-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flake in dhcp-proxy test: parser ran out of data-- not enough byte #618
Comments
I run into this issue pretty often with some pods that use a macvlan network:
Even with latest netavark version:
DHCP Server is dnsmasq in all cases as far as I know |
Are you able to setup package captures when that happens? It would be good to know if we actually get the dhcp packages in time. I would think so given it flakes in our CI and we tests only local so the chances for package loss are very low. |
I got a minimal POC on this. Steps to reproduce:
Service restart should fail and logs show the bellow error:
Issue can be reproduced easily from now on with quick subsequent service restarts using |
Some further observations on this after spending half a day on it:
Here, the green section is a successful dhcp dora process after a The error message can be traced down on the Due to the empty payload in the error message, this could originate from this recv buffer https://github.com/nispor/mozim/blob/v0.2.3/src/socket.rs#L128 being passed empty to the relevant dhcp decoder above |
After enabling debug logs for netavark-dhcp-proxy service the error can be traced in mozim:
That I suppose comes from here: |
Thanks for the details, I don't have much time to run this down currently. |
I'm still tracing it, I'll try to make a fix if I find the problem |
After analyzing the payloads in the full mozim debug logs:
I converted the packet that the raw socket received starting with The crash happens after netavark/mozim sends packet highlighted in red, because it expects packet highlighted in green, but I suppose the fix should be in mozim here: https://github.com/nispor/mozim/blob/v0.2.3/src/client.rs#L512 to not process the first packet it receives, but to inspect and check if it is a valid dhcp packet and skip it if it isn't |
That sounds logical |
Created a PR in mozim that should fix the above: nispor/mozim#33 |
mozim 0.2.4 published. |
@cathay4t Thank you! |
I'm getting this behavior in a prod situation - the way it appears is that containers just won't start with that same error message. I have macvlan dhcp networks to a couple of real VLANs (also dnsmasq) I also noticed that the dhcp-proxy process starts using an inappropriate amount of CPU after some period of time, and when it's in that state it will fail all requests consistently - if I kill the proxy then most (not all) containers will start again I can consistently repro at least a couple of failures when I nuke and repave all of my containers on that box. I'll build and test in the next few days to see if this fixes it thanks @agorgl! |
I'm running rust-nightly release flavor now. I'm definitely suffering from #811 as there were 13708 threads created to manage ~15 container leases over 3 days. |
I was able to restart all of my containers without any crashes after this patch (for the first time in netavark), so thanks again @agorgl! I also have 1/10th the threads (~1300) after twice the time |
new flake:
https://cirrus-ci.com/task/5066529801240576
The text was updated successfully, but these errors were encountered: