Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(libp2p): shared TCP listeners and AutoTLS.AutoWSS #10565

Merged
merged 30 commits into from
Dec 20, 2024

Conversation

aschmahmann
Copy link
Contributor

@aschmahmann aschmahmann commented Oct 30, 2024

Part of #10560, relies on libp2p/go-libp2p#2984

lidel's TODOs

@aschmahmann aschmahmann requested a review from a team as a code owner October 30, 2024 20:59
@aschmahmann aschmahmann force-pushed the feat/libp2p-sharedtcp branch from 8cd0e01 to 215ae53 Compare October 30, 2024 21:23
@aschmahmann aschmahmann force-pushed the feat/libp2p-sharedtcp branch 2 times, most recently from 702101e to 60d8d5c Compare November 5, 2024 18:19
@aschmahmann aschmahmann force-pushed the feat/libp2p-sharedtcp branch from 60d8d5c to 7833e64 Compare November 12, 2024 02:56
@aschmahmann aschmahmann force-pushed the feat/libp2p-sharedtcp branch from f9c50bd to 590bed0 Compare November 12, 2024 04:55
@lidel lidel mentioned this pull request Nov 29, 2024
54 tasks
@lidel lidel force-pushed the feat/libp2p-sharedtcp branch from 0eb8455 to 10eed69 Compare December 3, 2024 22:55
@lidel lidel force-pushed the feat/libp2p-sharedtcp branch from 10eed69 to 1831a3d Compare December 3, 2024 23:14
go.mod Outdated
@@ -274,3 +274,5 @@ require (
gopkg.in/yaml.v3 v3.0.1 // indirect
lukechampine.com/blake3 v1.3.0 // indirect
)

replace github.com/libp2p/go-libp2p => github.com/libp2p/go-libp2p v0.37.1-0.20241202220543-9024f8e8c86e
Copy link
Member

@lidel lidel Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got this to the state that is as close to go-libp2p release as possible, filling this here so I won't forget:

  • remove from all go.mod files and switch to go-libp2p release

This adds AutoTLS.AutoWSS flag that is set to true by default.

It will check if Addresses.Swarm contain explicit /ws listener,
and if not found, it will append one per every /tcp listener

This way existing TCP ports are reused without any extra configuration,
but we don't break user's who have custom / explicit /ws listener
already.

I also moved logger around, to include Addresses.Swarm inspection
results in `autotls` logger.
@lidel lidel changed the title feat(libp2p): enable shared TCP listeners feat(libp2p): shared TCP listeners and AutoTLS.AutoWSS Dec 4, 2024
@lidel
Copy link
Member

lidel commented Dec 4, 2024

Added AutoTLS.AutoWSS flag in 9b887ea.

It simplifies things by removing the need for manually touching Addresses.Swarm.
If you don't have any /ws listener, Kubo will add one to every /tcp port you have, without any extra steps (AutoTLS.AutoWSS=true is the implicit default, and works only if no /ws is found).

I plan to ship this in RC1 as soon we have a new go-libp2p release,
but if you can give it a try in spare time, early feedback is appreciated.

core/node/libp2p/transport.go Outdated Show resolved Hide resolved
@lidel lidel marked this pull request as draft December 4, 2024 13:40
go.mod Outdated
@@ -274,3 +274,5 @@ require (
gopkg.in/yaml.v3 v3.0.1 // indirect
lukechampine.com/blake3 v1.3.0 // indirect
)

replace github.com/libp2p/go-libp2p => github.com/libp2p/go-libp2p v0.37.1-0.20241202220543-9024f8e8c86e
Copy link
Member

@lidel lidel Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seem to be a regression between 9b887ea (this PR) and 433444b (latest master) caused by TCP reuse logic in go-libp2p bumped here.

I did curl localhost:5001/debug/pprof/profile > ipfs.cpuprof and CPU profile seems to confirm the source is TCP reuse from libp2p/go-libp2p#2984? (cc @aschmahmann @MarcoPolo):

cpu

How to reproduce

  • Start ipfs daemon
  • Open http://127.0.0.1:5001/webui#/peers and wait a minute:
    • this will trigger connections to peers to fetch any missing block get for geoip data (if not cached locally) – acting like accelerator for any underlying bugs in TCP port reuse
    • 💢 staging-2024-12-04-9b887e (go-libp2p master v0.37.1-0.20241202220543-9024f8e8c86e) will grow CPU use once ipfs-webui Peers screen is opened. CPU peaks within 1minute.
    • 💚 master-2024-12-03-433444b (go-libp2p v0.37.2) does not trigger high CPU load

Here are docker images for repro convenience:

  • docker run --rm -it --net=host ipfs/kubo:staging-2024-12-04-9b887ea
  • docker run --rm -it --net=host ipfs/kubo:master-2024-12-03-433444b

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libp2p/go-libp2p#3080 seems to fix the CPU spin.

Tested in 2adb2f1 – need to investigate why CI fails.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am debugging sharness with make -O -j 1 because CI runs with -j 10 and logs are mangled.

Copy link
Member

@lidel lidel Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting bizzare. Tests with the same go-libp2p and boxo version pass in #10631.

The only go.mod diff between a436f4e (this PR, ci failing due to timeout after 20m) and 070e6ae (#10631, ci green, sharness run takes 5minutes), is:

-       github.com/ipshipyard/p2p-forge v0.1.0
+       github.com/ipshipyard/p2p-forge v0.2.0

other than that, we don't change anything in default config other than enabling shared tcp.
Will disable (b5cfd6d) and re-run, to see if that causes issue somewhere.

Copy link
Member

@lidel lidel Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 👉 Sharness passes green if we disable libp2p.ShareTCPListener() (b5cfd6d passed in job here).

Something related to libp2p.ShareTCPListener() makes sharness hang and timeout if run in 10 parallel threads (cd test/sharness and time make -O -j 10).

Copy link
Member

@lidel lidel Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looking for the cause, but I was able to reproduce locally by using the same settings as CI:

$ export TEST_DOCKER=1
$ export TEST_PLUGIN=0
$ export TEST_FUSE=0
$ export TEST_VERBOSE=1
$ export TEST_JUNIT=1
$ export TEST_EXPENSIVE=1
$ export IPFS_CHECK_RCMGR_DEFAULTS=1
$ export CONTINUE_ON_S_FAILURE=1
$ time make -O -j 10 test_sharness coverage/sharness_tests.coverprofile test/sharness/test-results/sharness.xml

# [...]
ok 1717 - add a few entries to big_dir/ to retrigger sharding

expecting success:
    kill -0 $IPFS_PID

ok 1718 - 'ipfs daemon' is still running

expecting success:
    test_kill_repeat_10_sec $IPFS_PID

ok 1719 - 'ipfs daemon' can be killed

# passed all 1719 test(s)
1..1719

# hangs..

It hanged for a few minutes and finished eventually, but took 3x longer than regular run (15m vs 5m).
Re-run and it hanged for 20m and i had to kill it, so there is some factor of randomness to the hang.

Looking at process tree spawned by make, the t0250-files-api.sh and t0260-sharding.sh t0181-private-network.sh t0182-circuit-relay.sh and t0320-pubsub.sh seem to hang the longest in semi-random order.

Will resume tomorrow unless someone else has any idea.

In the meantime, i've re-run https://github.com/ipfs/kubo/actions/runs/12420930073 to double-confirm disabling libp2p.ShareTCPListener() fixes sharness.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run two tests to narrow down the search:

  • 🟢 7a4ec80 disabled libp2p.ShareTCPListener() and sharness finished in 8m successfully (log)
  • 🔴 858e10a enabled libp2p.ShareTCPListener() and sharness timeouted after 20m (log)

Maybe increasing timeout would help, but taking 2x as long feels like a bug.

Copy link
Member

@lidel lidel Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, running sequentially (#10565 (comment)) found the problem:

  • Enabling libp2p.ShareTCPListener() break PNET and ./t0181-private-network.sh hangs – kinda makes sense, PNET is TCP-only feature of libp2p and we mess up with TCP.

I'll now confirm this is the only one failing,
and if other tests pass, will disable TCP sharing when PNET is enabled for now.

I'll also see if we can set timeout per test suite / unit, to avoid this debugging horror in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good news, PNET was the only reason sharness failed. Disabled port sharing when PNET is enabled (bb87df3) and CI is green again:

I'm going forward with Kubo 0.33.0-rc1.

gammazero and others added 4 commits December 19, 2024 14:16
once again, after master changes, for a good measure

(this commit should make sharness pass)
this re-enabled TCP muxing

(if sharness fails due to timeout, we have bug related to
libp2p.ShareTCPListener() somehow)
hoping to reproduce circuit-relay.sh repo.lock error found locally
@lidel lidel marked this pull request as ready for review December 20, 2024 17:35
Copy link
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sharness issues are resolved (#10565 (comment)) and docs are updated to include LIBP2P_TCP_MUX=false as an escape hatch for people playing with this feature in Kubo 0.33.0-rc1, including our staging/cluster tests.

I'm going to merge as soon CI finishes and start Kubo 0.33.0-rc1 dance.
Will post progress in #10580 (comment)

@lidel lidel merged commit 397c346 into master Dec 20, 2024
15 checks passed
@lidel lidel deleted the feat/libp2p-sharedtcp branch December 20, 2024 17:41
@lidel lidel mentioned this pull request Jan 9, 2025
17 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants