Laggy Message Receiving #1012

xanderdunn · 2022-06-08T20:15:35Z

xanderdunn
Jun 8, 2022

100% reproducible self contained code here. After installing Docker and Docker Compose, just ./run.sh.

It creates a mesh network of 75 nodes, each its own Docker container. They connect with each other over bidirectional gRPC streams. Each node generates three signatures and then sends those signature to the 74 other nodes for a total of 75*3*74 = 16,650 messages. Even in the absence of any computationally expensive work, I'm finding that most (~15,000) messages are received immediately, and then the remaining ~1600 take ~60 seconds with 0% CPU usage. This is slower than I would expect.

Within a second or two of starting, all 225 unique messages are created and all 16,650 messages are sent:

runner                | 225 / 225 signatures have been created,
runner                |     16650 / 16650 signatures have been sent,
runner                |     15450 / 16650 signatures have been received,
runner                |     63 / 225 signature rounds have completed

15,450 messages were received as soon as they were sent, but the remaining messages will languish at this point for ~60 seconds. Sometimes it makes no progress for a handful of seconds, and sometimes they trickle in. They all eventually arrive 100% of the time, but I'm confused by this unexpected performance. It's about 63 seconds to receive 16,650 messages that were all sent within the first second. It's never CPU bound. I also extensively debugged my lock usage. There are 0 mutexes in the code and 0 RwLock::write locks being acquired during this execution.

All 16,650 sends are either this line or the one right below it.
The first one receives here and the second one receives here.

The bottleneck is definitely somewhere between sending the message and receiving it, so I thought I'd ask if it could be tonic related since it's responsible for the network communication over the bidirectional streams.

Have tested on both an 8 core machine and a 32 core machine. Same behavior.
Appreciate any feedback.

xanderdunn · 2022-06-08T23:59:55Z

xanderdunn
Jun 8, 2022
Author

It appears to be related specifically to the flushing of the last couple thousand messages in the streams.

When I do nothing more than increase the number of messages from 16,650 to 67,500, I don't see the lag appear at ~15,000 messages sent, I see the lag appear at ~63,500 messages. The first 63,500 messages take just a couple seconds, and then the remaining ~2500 take around a minute. Is there a reason to expect the rate at which already sent messages are received to slow down when no new messages are being sent? Does a lack of forward pressure alter the rate at which messages are dequeued?

4 replies

LucioFranco Jun 9, 2022
Maintainer

Could be due to the windowing limits imposed by h2. You could try to tweak some of those settings to see if that makes a difference.

olix0r Jun 9, 2022

Does a lack of forward pressure alter the rate at which messages are dequeued?

I've seen issues like this when a task isn't being polled appropriately. I tried to look through the code you shared, but it was a bit too complicated for a quick inspection to be useful. If I were in your shoes, I'd try to find the most minimal case where you can exhibit this problem. Can this be reproduced with a single client and server, for instance? What variables can you eliminate to make debugging easier?

You may also have some luck using tracing to try to track down what's happening in h2. Or tokio-console may help inspect your program's task usage.

xanderdunn Jun 9, 2022
Author

Indeed, it's a greatly reduced example from my production code base. A portable one line command to reproduce is quite nice.

No, it cannot be reproduced with two nodes. The lowest I've seen it occur is with 33 nodes. 30 nodes takes 3 seconds to execute. 33 nodes takes about a minute to execute. 75 nodes also takes about a minute to execute.

The basic pattern is a mesh network of peer-to-peer connections. This means every node is both a client and server of every bidirectional stream connection to every other node (except itself). If there is a better example of that somewhere, let me know.

I used tokio-console here and didn't find anything useful.

@LucioFranco Where can I find how to set h2 window sizes on tonic?

LucioFranco Jun 10, 2022
Maintainer

Here are all the settings you can use with hyper https://docs.rs/hyper/latest/hyper/server/conn/struct.Http.html (tonic uses it under the hood but its possible to connect it directly).

xanderdunn · 2022-06-11T21:32:11Z

xanderdunn
Jun 11, 2022
Author

This issue ended up being specific to my use of Docker containers. When I perform exactly the same test on my local machine without containers, there is no lag in messaging at all. I haven't figured out what about the Docker containers is causing the problem, but I am at least unblocked for my performance testing.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Laggy Message Receiving #1012

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Laggy Message Receiving #1012

xanderdunn Jun 8, 2022

Replies: 2 comments · 4 replies

xanderdunn Jun 8, 2022 Author

LucioFranco Jun 9, 2022 Maintainer

olix0r Jun 9, 2022

xanderdunn Jun 9, 2022 Author

LucioFranco Jun 10, 2022 Maintainer

xanderdunn Jun 11, 2022 Author

xanderdunn
Jun 8, 2022

Replies: 2 comments 4 replies

xanderdunn
Jun 8, 2022
Author

LucioFranco Jun 9, 2022
Maintainer

xanderdunn Jun 9, 2022
Author

LucioFranco Jun 10, 2022
Maintainer

xanderdunn
Jun 11, 2022
Author