Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[orchagent]: VXLAN: Fix oper_status and tunnel encapsulation TTL #3383

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

bradh352
Copy link
Contributor

@bradh352 bradh352 commented Nov 20, 2024

What I did

This fixes 2 issues across a range of open tickets building upon patches created by others with modifications as requested by @VladimirKuk.

The first issue this resolves is the status shown for remote vteps which in the fact that it is wrong makes debugging nearly impossible:

# show vxlan remotevtep
+------------+------------+-------------------+--------------+
| SIP        | DIP        | Creation Source   | OperStatus   |
+============+============+===================+==============+
| 172.16.0.1 | 172.16.0.2 | EVPN              | oper_down    |
+------------+------------+-------------------+--------------+
Total count : 1

The remote VTEP is really up.

Original PR for that is #2080.

Also fixes sonic-net/sonic-buildimage#10004 or at least the error message which hurts debugging.

The next issue is in reachabiity across VXLANs. This fixes IP/MAC learning via ARP. The original PR for that is #3216, however it appears it has its origins in
sonic-net/sonic-buildimage#10050 which goes into greater detail about the issue itself. Also there is talk about it here kamelnetworks/sonic#9 as well as another similar patch here: kamelnetworks@02ee3e3

Why I did it

Fixes #3216
Fixes #2080
Fixes sonic-net/sonic-buildimage#10050
Fixes sonic-net/sonic-buildimage#10004

How I verified it

Pulled into my private sonic-swss fork:
https://github.com/bradh352/sonic-swss/commits/bradh352/master

Which is pulled in by my private sonic-buildimage fork:
https://github.com/bradh352/sonic-buildimage/tree/bradh352/master

Which is then automatically built when changes are made. Then the uploaded asset of sonic-broadcom.bin is installed onto Dell S5248F and N3248TE switches and tested.

Details if related

Signed-off-by: Brad House (@bradh352)

This should also be backported to 202405

@bradh352 bradh352 requested a review from prsunny as a code owner November 20, 2024 14:41
Copy link

linux-foundation-easycla bot commented Nov 20, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

Copy link

@VladimirKuk VladimirKuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@bradh352
Copy link
Contributor Author

@VladimirKuk any idea if those test failures are actually a symptom of the patch itself? Or is it just common for things to fail in the tests from time to time?

@VladimirKuk
Copy link

Tests do fail from time to time.
At least to me, these tests are unrelated to the change.

@bradh352
Copy link
Contributor Author

@prsunny please review

@lukasstockner
Copy link

Thank you for pushing this forward!
For the record, we've been running these changes in production for ~2 years without issues, so I'd be quite confident that they work as expected.

@bradh352
Copy link
Contributor Author

@prsunny ping

@bradh352 bradh352 force-pushed the vxlan-fixes branch 3 times, most recently from 1ad6c1e to 89b365a Compare December 4, 2024 17:16
@bradh352
Copy link
Contributor Author

bradh352 commented Dec 4, 2024

@VladimirKuk I ended up having to sprinkle your suggestion in 2 places to get it fully working.

@@ -275,7 +275,7 @@ create_tunnel(
sai_ip_address_t *dst_ip,
sai_object_id_t underlay_rif,
bool p2p,
sai_uint8_t encap_ttl=0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this removed? IIRC, it is done for a purpose to skip adding this attribute for those scenarios where the underlying implementation doesn't support this setting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is a static function that only exists in that file and is never called without the encap_ttl argument. The public functions both set the default value and are optional. I can of course set to a proper default value there if you'd prefer, its just never used so I didn't see a point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively we can just go back to default of 0 everywhere to explicitly mean "choose something for me", then check for 0 in this and set it to 64. Some old revisions did that but @VladimirKuk suggested to change it. I personally don't care either way, both ways definitively fix the issue at hand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prsunny how to move forward on this?

@bradh352
Copy link
Contributor Author

bradh352 commented Dec 7, 2024

@prsunny how would you like to proceed on this? I think this is a critical issue since community SONiC doesn't support ARP/ND suppression so things just don't work at all without this.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@bradh352
Copy link
Contributor Author

@prsunny looks like the internal msconflict is resolved now since pvst was merged to main. Please review and hopefully merge. Thanks!

@mssonicbld
Copy link
Collaborator

/azp run

@bradh352
Copy link
Contributor Author

rebased on top of current master to get rid of merged stuff that is now upstream

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@bradh352
Copy link
Contributor Author

I can't tell from the output if somehow the test failures are related to this PR. I'm not familiar enough with the test case output to differentiate expected failures from real ones.

This fixes 2 issues across a range of open tickets building upon patches
created by others with modifications as requested by @VladimirKuk.

The first issue this resolves is the status shown for remote vteps
which in the fact that it is wrong makes debugging nearly impossible:
```
+------------+------------+-------------------+--------------+
| SIP        | DIP        | Creation Source   | OperStatus   |
+============+============+===================+==============+
| 172.16.0.1 | 172.16.0.2 | EVPN              | oper_down    |
+------------+------------+-------------------+--------------+
Total count : 1
```

The VTEP is really up.

Original PR for that is sonic-net#2080.

Also fixes sonic-net/sonic-buildimage#10004
or at least the error message which hurts debugging.

The next issue is in reachabiity across VXLANs.  This fixes IP/MAC
learning via ARP.  The original PR for that is sonic-net#3216, however it
appears it has its origins in
sonic-net/sonic-buildimage#10050
which goes into greater detail about the issue itself.  Also there
is talk about it here kamelnetworks/sonic#9 as well as another
similar patch here: kamelnetworks@02ee3e3

Fixes sonic-net#3216
Fixes sonic-net#2080
Fixes sonic-net/sonic-buildimage#10050
Fixes sonic-net/sonic-buildimage#10004
Signed-off-by: Brad House (@bradh352)
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

bradh352 added a commit to bradh352/sonic-swss that referenced this pull request Dec 24, 2024
…onic-net#3383)

This fixes 2 issues across a range of open tickets building upon patches
created by others with modifications as requested by @VladimirKuk.

The first issue this resolves is the status shown for remote vteps
which in the fact that it is wrong makes debugging nearly impossible:
```
+------------+------------+-------------------+--------------+
| SIP        | DIP        | Creation Source   | OperStatus   |
+============+============+===================+==============+
| 172.16.0.1 | 172.16.0.2 | EVPN              | oper_down    |
+------------+------------+-------------------+--------------+
Total count : 1
```

The VTEP is really up.

Original PR for that is sonic-net#2080.

Also fixes sonic-net/sonic-buildimage#10004
or at least the error message which hurts debugging.

The next issue is in reachabiity across VXLANs.  This fixes IP/MAC
learning via ARP.  The original PR for that is sonic-net#3216, however it
appears it has its origins in
sonic-net/sonic-buildimage#10050
which goes into greater detail about the issue itself.  Also there
is talk about it here kamelnetworks/sonic#9 as well as another
similar patch here: kamelnetworks@02ee3e3

Fixes sonic-net#3216
Fixes sonic-net#2080
Fixes sonic-net/sonic-buildimage#10050
Fixes sonic-net/sonic-buildimage#10004
Signed-off-by: Brad House (@bradh352)
bradh352 added a commit to bradh352/sonic-swss that referenced this pull request Dec 24, 2024
…onic-net#3383)

This fixes 2 issues across a range of open tickets building upon patches
created by others with modifications as requested by @VladimirKuk.

The first issue this resolves is the status shown for remote vteps
which in the fact that it is wrong makes debugging nearly impossible:
```
+------------+------------+-------------------+--------------+
| SIP        | DIP        | Creation Source   | OperStatus   |
+============+============+===================+==============+
| 172.16.0.1 | 172.16.0.2 | EVPN              | oper_down    |
+------------+------------+-------------------+--------------+
Total count : 1
```

The VTEP is really up.

Original PR for that is sonic-net#2080.

Also fixes sonic-net/sonic-buildimage#10004
or at least the error message which hurts debugging.

The next issue is in reachabiity across VXLANs.  This fixes IP/MAC
learning via ARP.  The original PR for that is sonic-net#3216, however it
appears it has its origins in
sonic-net/sonic-buildimage#10050
which goes into greater detail about the issue itself.  Also there
is talk about it here kamelnetworks/sonic#9 as well as another
similar patch here: kamelnetworks@02ee3e3

Fixes sonic-net#3216
Fixes sonic-net#2080
Fixes sonic-net/sonic-buildimage#10050
Fixes sonic-net/sonic-buildimage#10004
Signed-off-by: Brad House (@bradh352)
bradh352 added a commit to bradh352/sonic-swss that referenced this pull request Dec 24, 2024
…onic-net#3383)

This fixes 2 issues across a range of open tickets building upon patches
created by others with modifications as requested by @VladimirKuk.

The first issue this resolves is the status shown for remote vteps
which in the fact that it is wrong makes debugging nearly impossible:
```
+------------+------------+-------------------+--------------+
| SIP        | DIP        | Creation Source   | OperStatus   |
+============+============+===================+==============+
| 172.16.0.1 | 172.16.0.2 | EVPN              | oper_down    |
+------------+------------+-------------------+--------------+
Total count : 1
```

The VTEP is really up.

Original PR for that is sonic-net#2080.

Also fixes sonic-net/sonic-buildimage#10004
or at least the error message which hurts debugging.

The next issue is in reachabiity across VXLANs.  This fixes IP/MAC
learning via ARP.  The original PR for that is sonic-net#3216, however it
appears it has its origins in
sonic-net/sonic-buildimage#10050
which goes into greater detail about the issue itself.  Also there
is talk about it here kamelnetworks/sonic#9 as well as another
similar patch here: kamelnetworks@02ee3e3

Fixes sonic-net#3216
Fixes sonic-net#2080
Fixes sonic-net/sonic-buildimage#10050
Fixes sonic-net/sonic-buildimage#10004
Signed-off-by: Brad House (@bradh352)
bradh352 added a commit to bradh352/sonic-swss that referenced this pull request Dec 24, 2024
…onic-net#3383)

This fixes 2 issues across a range of open tickets building upon patches
created by others with modifications as requested by @VladimirKuk.

The first issue this resolves is the status shown for remote vteps
which in the fact that it is wrong makes debugging nearly impossible:
```
+------------+------------+-------------------+--------------+
| SIP        | DIP        | Creation Source   | OperStatus   |
+============+============+===================+==============+
| 172.16.0.1 | 172.16.0.2 | EVPN              | oper_down    |
+------------+------------+-------------------+--------------+
Total count : 1
```

The VTEP is really up.

Original PR for that is sonic-net#2080.

Also fixes sonic-net/sonic-buildimage#10004
or at least the error message which hurts debugging.

The next issue is in reachabiity across VXLANs.  This fixes IP/MAC
learning via ARP.  The original PR for that is sonic-net#3216, however it
appears it has its origins in
sonic-net/sonic-buildimage#10050
which goes into greater detail about the issue itself.  Also there
is talk about it here kamelnetworks/sonic#9 as well as another
similar patch here: kamelnetworks@02ee3e3

Fixes sonic-net#3216
Fixes sonic-net#2080
Fixes sonic-net/sonic-buildimage#10050
Fixes sonic-net/sonic-buildimage#10004
Signed-off-by: Brad House (@bradh352)
bradh352 added a commit to bradh352/sonic-swss that referenced this pull request Jan 2, 2025
…onic-net#3383)

This fixes 2 issues across a range of open tickets building upon patches
created by others with modifications as requested by @VladimirKuk.

The first issue this resolves is the status shown for remote vteps
which in the fact that it is wrong makes debugging nearly impossible:
```
+------------+------------+-------------------+--------------+
| SIP        | DIP        | Creation Source   | OperStatus   |
+============+============+===================+==============+
| 172.16.0.1 | 172.16.0.2 | EVPN              | oper_down    |
+------------+------------+-------------------+--------------+
Total count : 1
```

The VTEP is really up.

Original PR for that is sonic-net#2080.

Also fixes sonic-net/sonic-buildimage#10004
or at least the error message which hurts debugging.

The next issue is in reachabiity across VXLANs.  This fixes IP/MAC
learning via ARP.  The original PR for that is sonic-net#3216, however it
appears it has its origins in
sonic-net/sonic-buildimage#10050
which goes into greater detail about the issue itself.  Also there
is talk about it here kamelnetworks/sonic#9 as well as another
similar patch here: kamelnetworks@02ee3e3

Fixes sonic-net#3216
Fixes sonic-net#2080
Fixes sonic-net/sonic-buildimage#10050
Fixes sonic-net/sonic-buildimage#10004
Signed-off-by: Brad House (@bradh352)
bradh352 added a commit to bradh352/sonic-swss that referenced this pull request Jan 2, 2025
…onic-net#3383)

This fixes 2 issues across a range of open tickets building upon patches
created by others with modifications as requested by @VladimirKuk.

The first issue this resolves is the status shown for remote vteps
which in the fact that it is wrong makes debugging nearly impossible:
```
+------------+------------+-------------------+--------------+
| SIP        | DIP        | Creation Source   | OperStatus   |
+============+============+===================+==============+
| 172.16.0.1 | 172.16.0.2 | EVPN              | oper_down    |
+------------+------------+-------------------+--------------+
Total count : 1
```

The VTEP is really up.

Original PR for that is sonic-net#2080.

Also fixes sonic-net/sonic-buildimage#10004
or at least the error message which hurts debugging.

The next issue is in reachabiity across VXLANs.  This fixes IP/MAC
learning via ARP.  The original PR for that is sonic-net#3216, however it
appears it has its origins in
sonic-net/sonic-buildimage#10050
which goes into greater detail about the issue itself.  Also there
is talk about it here kamelnetworks/sonic#9 as well as another
similar patch here: kamelnetworks@02ee3e3

Fixes sonic-net#3216
Fixes sonic-net#2080
Fixes sonic-net/sonic-buildimage#10050
Fixes sonic-net/sonic-buildimage#10004
Signed-off-by: Brad House (@bradh352)
@prsunny prsunny requested a review from srj102 January 6, 2025 18:46
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

github-actions bot pushed a commit to bradh352/sonic-swss that referenced this pull request Jan 7, 2025
…onic-net#3383)

This fixes 2 issues across a range of open tickets building upon patches
created by others with modifications as requested by @VladimirKuk.

The first issue this resolves is the status shown for remote vteps
which in the fact that it is wrong makes debugging nearly impossible:
```
+------------+------------+-------------------+--------------+
| SIP        | DIP        | Creation Source   | OperStatus   |
+============+============+===================+==============+
| 172.16.0.1 | 172.16.0.2 | EVPN              | oper_down    |
+------------+------------+-------------------+--------------+
Total count : 1
```

The VTEP is really up.

Original PR for that is sonic-net#2080.

Also fixes sonic-net/sonic-buildimage#10004
or at least the error message which hurts debugging.

The next issue is in reachabiity across VXLANs.  This fixes IP/MAC
learning via ARP.  The original PR for that is sonic-net#3216, however it
appears it has its origins in
sonic-net/sonic-buildimage#10050
which goes into greater detail about the issue itself.  Also there
is talk about it here kamelnetworks/sonic#9 as well as another
similar patch here: kamelnetworks@02ee3e3

Fixes sonic-net#3216
Fixes sonic-net#2080
Fixes sonic-net/sonic-buildimage#10050
Fixes sonic-net/sonic-buildimage#10004
Signed-off-by: Brad House (@bradh352)
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@bradh352
Copy link
Contributor Author

bradh352 commented Jan 7, 2025

@prsunny figured out the test case failure and corrected it, it was due to specifically checking tunnel attributes and we added 2 new ones.

@prsunny prsunny requested a review from dgsudharsan January 7, 2025 17:45
github-actions bot pushed a commit to bradh352/sonic-swss that referenced this pull request Jan 9, 2025
…onic-net#3383)

This fixes 2 issues across a range of open tickets building upon patches
created by others with modifications as requested by @VladimirKuk.

The first issue this resolves is the status shown for remote vteps
which in the fact that it is wrong makes debugging nearly impossible:
```
+------------+------------+-------------------+--------------+
| SIP        | DIP        | Creation Source   | OperStatus   |
+============+============+===================+==============+
| 172.16.0.1 | 172.16.0.2 | EVPN              | oper_down    |
+------------+------------+-------------------+--------------+
Total count : 1
```

The VTEP is really up.

Original PR for that is sonic-net#2080.

Also fixes sonic-net/sonic-buildimage#10004
or at least the error message which hurts debugging.

The next issue is in reachabiity across VXLANs.  This fixes IP/MAC
learning via ARP.  The original PR for that is sonic-net#3216, however it
appears it has its origins in
sonic-net/sonic-buildimage#10050
which goes into greater detail about the issue itself.  Also there
is talk about it here kamelnetworks/sonic#9 as well as another
similar patch here: kamelnetworks@02ee3e3

Fixes sonic-net#3216
Fixes sonic-net#2080
Fixes sonic-net/sonic-buildimage#10050
Fixes sonic-net/sonic-buildimage#10004
Signed-off-by: Brad House (@bradh352)
@bradh352
Copy link
Contributor Author

bradh352 commented Jan 9, 2025

@abdosi, @judyjoseph, @srj102 any review comments? This is a pretty large blocker for VXLAN EVPN

attr.id = SAI_TUNNEL_ATTR_ENCAP_TTL_MODE;
attr.value.s32 = SAI_TUNNEL_TTL_MODE_PIPE_MODEL;
tunnel_attrs.push_back(attr);
attr.id = SAI_TUNNEL_ATTR_ENCAP_TTL_MODE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current behaviour of this helper function allowed for uniform as well as pipe model with the default being uniform (encap_ttl=0)
With this change sonic forces tunnels to be in pipe mode which is a step back.

Hence I would say stick with the current implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the solution here to the issue at hand then? What we're seeing is ARP across the tunnel no longer works as it gets a TTL of 0 and gets dropped, therefore machines are unable to find eachother across the network. At least on Broadcom Trident 3.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or are you saying to try to just set a TTL > 0 in Uniform mode? So just don't change the mode attribute and see what happens?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the ARP TTL issue on Trident3 still seen with the latest SAI versions ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as of 3 weeks ago on master anyhow, I haven't tried without this patch since then

github-actions bot pushed a commit to bradh352/sonic-swss that referenced this pull request Jan 11, 2025
…onic-net#3383)

This fixes 2 issues across a range of open tickets building upon patches
created by others with modifications as requested by @VladimirKuk.

The first issue this resolves is the status shown for remote vteps
which in the fact that it is wrong makes debugging nearly impossible:
```
+------------+------------+-------------------+--------------+
| SIP        | DIP        | Creation Source   | OperStatus   |
+============+============+===================+==============+
| 172.16.0.1 | 172.16.0.2 | EVPN              | oper_down    |
+------------+------------+-------------------+--------------+
Total count : 1
```

The VTEP is really up.

Original PR for that is sonic-net#2080.

Also fixes sonic-net/sonic-buildimage#10004
or at least the error message which hurts debugging.

The next issue is in reachabiity across VXLANs.  This fixes IP/MAC
learning via ARP.  The original PR for that is sonic-net#3216, however it
appears it has its origins in
sonic-net/sonic-buildimage#10050
which goes into greater detail about the issue itself.  Also there
is talk about it here kamelnetworks/sonic#9 as well as another
similar patch here: kamelnetworks@02ee3e3

Fixes sonic-net#3216
Fixes sonic-net#2080
Fixes sonic-net/sonic-buildimage#10050
Fixes sonic-net/sonic-buildimage#10004
Signed-off-by: Brad House (@bradh352)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants