relay_v2 performance anomaly on small machines #2619
Replies: 14 comments 3 replies
-
Very much appreciate the detailed bug report @winksaville. Sorry for the late reply. Can you post the logs when you run the relay server with
:D The help is very welcome! |
Beta Was this translation helpful? Give feedback.
-
Below is the log captured after:
Here is doing the
relay_v2.master-22fbce.stable-1.60.0-7737e0.debug.1.rust_log-debug.log.zip |
Beta Was this translation helpful? Give feedback.
-
For good measure here is a trace log, again taken after booting:
The lookup:
relay_v2.master-22fbce.stable-1.60.0-7737e0.debug.1.rust_log-trace.log.zip |
Beta Was this translation helpful? Give feedback.
-
One other tidbit, which I'm sure you know, but in the offhand case you don't. The timeout occurs 20secs after starting. I added the UTC time to my prompt so if you look at the prompt after the timeout and subtract 20secs that will be the approximate time when the initial request from my desktop,
If you take
There is more information in the RUST_LOG=trace run, so here is that information. And we see a 6 second gap (weird I was so consistent) between lines 472 and 473:
|
Beta Was this translation helpful? Give feedback.
-
One other observation, looking at your libp2p-perf benchmark, I think there is something wrong when go-libp2p is twice as fast as rustlibp2p. Has the reason for this been investigated? |
Beta Was this translation helpful? Give feedback.
-
@mxinden , Ping to bump this up in your inbox :) |
Beta Was this translation helpful? Give feedback.
-
Sorry for the late reply here. I haven't found time to debug this yet.
Note that the numbers on the libp2p-perf page are based on a run on localhost only. I would question whether they have much value beyond comparing concrete changes. I encourage folks to run libp2p-perf between two machines with reasonable network delays. This is not to say that rust-libp2p is slower or faster than go-libp2p. Cross referencing #2601 here which might be related. My first suspicion is that we are missing a wake up somewhere. In case you do want to investigate further, would you mind testing with https://github.com/mxinden/rust-libp2p-server/ as a relay server? |
Beta Was this translation helpful? Give feedback.
-
I understand this doesn't have high priority. I'm willing to do experiments/investigations so if you can spend a small amount of time reviewing results and suggesting additional E/I's I think we can make progress. It does seem we're missing wake up's. In one experiment I saw a wake up happen shortly after the 20sec timeout, but since the initiator had timed out it was to late and the connection couldn't be established. Anyway, I didn't get a chance to run your rust-libp2p-server today as I spent the day trying to learn more about how the server polls, didn't make much progress but have a little better feel. I'll try your server tomorrow. |
Beta Was this translation helpful? Give feedback.
-
Your server has a Timeout error too:
I forked your repo and using master:
Built it with stable 1.60.0:
And did a debug build:
I then copied it to the small server on digital ocean:
Here is how I ran it with RUST_LOG=trace piping the output to a file:
Here is the log, I've not even looked at it yet, but it behaves exactly like the example/relay_v2 server. mxinden-libp2p-server.master-f88de4.stable-1.60.0-7737e0.debug.1.rust_log-trace.log.zip |
Beta Was this translation helpful? Give feedback.
-
My initial hypothesis was that the epoll code was on the main thread and should be moved to a separate high priority thread. That was wrong. It is on it's own thread, which I've determined by outputing the current thread id, although probably not high priority, but I haven't checked:
And here is the log where you can see the
I still "feel" there is too much polling, but I think its because the information epoll it can provide isn't being used. It seems to me the server is using epoll simply as a "signal" that something is ready and then server then starts polling everything to find out what that was. This is just a feeling/guess from my interpretation of what I see in the extra verbose logs I've created and could be totally wrong. |
Beta Was this translation helpful? Give feedback.
-
A couple of notes, again thanks for the follow-ups.
|
Beta Was this translation helpful? Give feedback.
-
Based on the suggestion at the Community call today, I tested a version of relay_v2 with Tokio. I ran relay_v2-tokio with Here are the quick and dirty changes I made as suggested in the call: This further suggests the bug is in It was great seeing and talking with everyone today, TXS! |
Beta Was this translation helpful? Give feedback.
-
Quick update, I've created issue Missing waker.wake events #78 and PR #79 on async-io. If anyone wants to experiment with that async-io version with libp2p I've created this branch on my fork which simply adds a
From my testing this fixes the timeout issue I was seeing. |
Beta Was this translation helpful? Give feedback.
-
Using Tokio is good enough for me. But if I'm correct, IMHO, async-io should be removed as a dependency from libp2p. @mxinden, thoughts? |
Beta Was this translation helpful? Give feedback.
-
I’ve had trouble getting the hole_punching working. Specifically, I run
libp2p-lookup direct --address /ip4/$RELAY_SERVER_IP/tcp/4001
in step 3 of the “Setting up the relay server” section it fails sometimes.The first time I run
relay_v2
and then usedlibp2p-lookup
it works. But, most of the time, if I stoprelay_v2
and restart it and then trylibp2p-lookup
again I get a Lookup failed: Timeout error.Here are the steps I use to reproduce the problem.
I built
libp2p
master on a ubuntu 20.04 virtualbox:Copied it to the smallest cheapest Digital Ocean single CPU VM which is named
do-sfo3-01
rebooted do-sfo3-01
log back in:
Run relay_v2:
And then from my desktop,
3900x
, I usedlibp2p-lookup
to accessrelay_v2
and all is well:I then used wireshark on the desktop to grab the contents on the wire. Everything looks fine!
I then stop
relay_v2
using Ctrl-C and run it again:I then run
libp2p-lookup
a second time from my desktop, but this time we see Lookup failed: Timeout:Here is the wireshark screenshot:
And here is the
.zip
pcapng file:wireshark-timeout-libp2p-relay_v2.all.pcapng.zip
My interpretation of packet 32 is that
do-sfo3-01
received the packet and that there is a performance issue withrelay_v2
that it could not "accept" the packet. Further more, the release version with excessivetrace!
logging can fail the same way, although typically it does work.Anyway, with the debugging I've done it seems to me there is too much polling being done on the main thread. My suggested fix is that the epoll code should be on high priority thread all by itself. Of course, this is the opinion of a retired programmer with no-expertise with
libp2p
:)Beta Was this translation helpful? Give feedback.
All reactions