Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix orch stuch when removing vlan member (#3294) #3295

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Yang-Yongzhi
Copy link

What I did
Fixes #3294

The root cause of this issue is the two data struct of vlan member info in orchagent is not in sync.

Why I did it
Fix the bug

How I did it
The return value of setPortPvid() is does not matter, so we ignore it.

So we get the m_members" and "m_portVlanMember" in sync all the time.

How I verified it

  1. enable the swss to debug log level
  2. paste the configuration that triggered this bug.
sudo config vlan member del 1000 EthernetXX
sudo config interface ip add EthernetXX 1.1.1.1/24

Details if related

When one deletes one interface from a vlan, then makes the interface to router interface(add a ip addr to interface)

            +----------+
            | bash cli |
            +----+-----+
                 |
               +---+
              / CFG \
              \ DB  /
               +---+
                 |
   +-------------+-----------+
   |                         |
+----------+            +----------+
| vlanmgrd |            | intfmgrd |
+--+-------+            +----+-----+
   |                         |
   +-------------+-----------+
                 |
               +---+
              / APP \
              \ DB  /
               +---+
                 |
            +----+-----+
            |   orch   |
            +----------+
  • The path of removing vlan member is left path as above.
  • The path of creating a router interface is right path as above.

We have no way to ensure that above 2 config arrives at Orch in order.
It's possible that the "creating a router interface" arrives at first. If so, then the "removing vlan member" will be failed.

There are 2 data struct storing the vlan member info in orchagent. (They should be in sync all the time.)

  • "class Port" in port.h

    std::set<std::string> m_members;
    

    The instance of "class Port" is vlan interface. it's "m_members" is a set of vlan member, like EhternetXX, EthernetYY...

  • "class PortsOrch" in portsorch.h

    map<string, vlan_members_t> m_portVlanMember;  
    

    The instance of "class PortsOrch" is EthernetXX. It's m_portVlanMember is a map of vlan info, like Vlan100, Vlan200...

Please take a look at PortsOrch::removeVlanMember().

bool PortsOrch::removeVlanMember(Port &vlan, Port &port, string > e d_point_ip)
{
    ...........
    sai_object_id_t vlan_member_id;
    sai_vlan_tagging_mode_t sai_tagging_mode;
    auto vlan_member = m_portVlanMember[port.m_alias].find(vlan.m_lan_info.vlan_id);

    /* Assert the port belongs to this VLAN */
    assert (vlan_member != m_portVlanMember[port.m_alias].end());
    sai_tagging_mode = vlan_member->second.vlan_mode;
    vlan_member_id = vlan_member->second.vlan_member_id;

    sai_status_t status = sai_vlan_api->remove_vlan_member(vlan_member_id);
    if (status != SAI_STATUS_SUCCESS)
    {
       ...........
    }
    m_portVlanMember[port.m_alias].erase(vlan_member);
    if (m_portVlanMember[port.m_alias].empty())
    {
        m_portVlanMember.erase(port.m_alias);
    }
    ..........
    /* Restore to default pvid if this port joined this VLAN in untagged mode previously */
    if (sai_tagging_mode == SAI_VLAN_TAGGING_MODE_UNTAGGED)
    {
        if (!setPortPvid(port, DEFAULT_PORT_VLAN_ID))
        {
            return false;
        }
    }

    m_portList[port.m_alias] = port;
    vlan.m_members.erase(port.m_alias);
    m_portList[vlan.m_alias] = vlan;
    ............
    return true;
}

if setPortPvid() retrun fail(because the port is already a router interface). So the "m_members" and "m_portVlanMember" will be not in sync.

In the next enter of removeVlanMember() with same params.
The iterator "vlan_member" is point to the end, but the assert() doesn't work because NOS is relase version.(if NOS is in debug version, the assert() will trigger abort())
When m_portVlanMember[port.m_alias].erase(vlan_member) erase this end iterator, the c++ std lib will be stuck, occupy CPU 100%.

What I did

ignore the returned value of setPortPvid()
Copy link

linux-foundation-easycla bot commented Sep 20, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Yang-Yongzhi
Copy link
Author

Yang-Yongzhi commented Dec 24, 2024

@prsunny
Hi Prince Sunny,
Could you please have a reviewer on this fix?

@prsunny prsunny requested a review from dgsudharsan January 7, 2025 01:20
@prsunny
Copy link
Collaborator

prsunny commented Jan 7, 2025

Is it because the two commands are executed at the same time? what if there is a delay b/w the two commands?

@Yang-Yongzhi
Copy link
Author

Is it because the two commands are executed at the same time? what if there is a delay b/w the two commands?

@prsunny thanks for your reply.
Is it because the two commands are executed at the same time?
===> actually, the two commands are not executed at the same time. BUT, the delay b/w the two commands is very short.
what if there is a delay b/w the two commands?
===> there is no issue if there is a delay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[orchagent] orch get stuck when removing vlan member
3 participants