Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

portMap: use nft sets instead of linear rules #1139

Open
aroradaman opened this issue Jan 15, 2025 · 3 comments
Open

portMap: use nft sets instead of linear rules #1139

aroradaman opened this issue Jan 15, 2025 · 3 comments

Comments

@aroradaman
Copy link

aroradaman commented Jan 15, 2025

The portmap plugins (nftables mode) adds a rule for each portmap/DNAT, with nftables we can benefit from verdict maps which will be an efficient lookup compared to linear match on each portmap rule.

if useHostIP {
tx.Add(&knftables.Rule{
Chain: hostIPHostPortsChain,
Rule: knftables.Concat(
ipX, "daddr", e.HostIP,
e.Protocol, "dport", e.HostPort,
"dnat to", net.JoinHostPort(containerNet.IP.String(), strconv.Itoa(e.ContainerPort)),
),
Comment: &config.ContainerID,
})
} else {
tx.Add(&knftables.Rule{
Chain: hostPortsChain,
Rule: knftables.Concat(
e.Protocol, "dport", e.HostPort,
"dnat to", net.JoinHostPort(containerNet.IP.String(), strconv.Itoa(e.ContainerPort)),
),
Comment: &config.ContainerID,
})
}

@aroradaman
Copy link
Author

cc @danwinship

@danwinship
Copy link
Contributor

The comment at the top of portmap_nftables.go explains why it doesn't do this:

// The nftables portmap implementation is fairly similar to the iptables implementation:
// we add a rule for each mapping, with a comment containing a hash of the container ID,
// so that we can later reliably delete the rules we want. (This is important because in
// edge cases, it's possible the plugin might see "ADD container A with IP 192.168.1.3",
// followed by "ADD container B with IP 192.168.1.3" followed by "DEL container A with IP
// 192.168.1.3", and we need to make sure that the DEL causes us to delete the rule for
// container A, and not the rule for container B.) The iptables implementation actually
// uses a separate chain per container but there's not really any need for that...
//
// As with pkg/ip/ipmasq_nftables_linux.go, it would be more nftables-y to have a chain
// with a single rule doing a lookup against a map with an element per mapping, rather
// than having a chain with a rule per mapping. But there's no easy, non-racy way to say
// "delete the element 192.168.1.3 from the map, but only if it was added for container A,
// not if it was added for container B".

(The apparently-out-of-order ADD/DEL can happen when you have a chain of plugins, because if a plugin in the chain returns an error, then the DEL is aborted and tried again later. So you can have a situation where the IPAM plugin successfully processes the "DEL A", but then the operation is aborted before the portmap plugin receives the DEL request. Then before the runtime retries the "DEL A", it processes the "ADD B", and at that point since the IPAM plugin already processed "DEL A", it considers A's old IP to be free, and so might reuse it for B. So then the first time the portmap plugin sees the "DEL A" request would be after it had already processed the "ADD B" request.)

It's possible there's some way around this with sets/maps but I couldn't think of one...

@danwinship
Copy link
Contributor

It's possible there's some way around this with sets/maps but I couldn't think of one...

(that wasn't more work than it was worth)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants