-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unconfiguring of 'router bgp <ASN> vrf <VRF_NAME>' fails because VRF disable ignores BGP_VRF_AUTO instances #17903
Comments
Could you test the latest versions and also please test with this patch #17652. |
It does appear on master as well:
Trying with the patch in a moment EDIT - with a patch it also does appear:
|
I just pulled https://github.com/piotrsuchy/tinynetlab, changed iteration count from 100 to 1000, and ran
Am I missing something to reproduce this? |
I provided a virtual machine for Donatas to run the reproduction on. Because we rely on the race condition it might be the case that it happens with a smaller probability on 'faster' machines. |
Description
Configs and exact scripts to reproduce this using two docker containers bridged are below.
Genesis of the issue:
Normally, when we do:
It happens in that order - logs looking like this:
However once every X times when this sequence is done, the reload happens before BGP finishes learning of the addition of VRF, VXLAN and kernel. The logs look like this in that case:
So you can see that there are two times that a VRF is created - and this order is flipped:
Correct - no race condition:
Race condition case:
So the problem occurs, when if we have a race condition case and the order is flipped, when we go to the next step - removal of interfaces and then reloading of FRR config to the one from before (frr-no-vrf-host1.conf) by:
It fails on reloading the config, because the step 1) doesn't properly remove the whole state of the VRF - the l3vni is lingering, because bgp_vrf_disable() uses
bgp = bgp_lookup_by_name(vrf->name;
which includes this check:So even though it's a legit bgp instance we would like to get - it's skipped. Here's the commit that introduced it:
#16159
I would like to know if it's possible to somehow revert this PR, or fix it in an other way, that wouldn't introduce this kind of behavior in case of a race condition that we see.
Example logs of an issue appearing using a reproduction script:
Version
How to reproduce
Here everything is described:
https://github.com/piotrsuchy/tinynetlab/tree/main/reproduction_setups/unconfigure_l3vni_upstream
Running just one bash script sets up the whole environment, using docker images from dockerhub and extra setup steps.
Expected behavior
I expect, similarly to FRR-8.4.2, that even if there is a race condition, if we disable the VRF, it properly removes the l3vni.
Actual behavior
When a race condition happens, the bgp instance of the VRF is an 'AUTO_VRF_BGP' and when we ask for a vrf disable, it is skipped:
f153b9a
That means there is a subsequent reload fails because it says:
"Please unconfigure l3vni"
And the "router bgp vrf <VRF_NAME>" is not deleted from the config.
Additional context
No response
Checklist
The text was updated successfully, but these errors were encountered: