Laggy Message Receiving #1012
Replies: 2 comments 4 replies
-
It appears to be related specifically to the flushing of the last couple thousand messages in the streams. When I do nothing more than increase the number of messages from 16,650 to 67,500, I don't see the lag appear at ~15,000 messages sent, I see the lag appear at ~63,500 messages. The first 63,500 messages take just a couple seconds, and then the remaining ~2500 take around a minute. Is there a reason to expect the rate at which already sent messages are received to slow down when no new messages are being sent? Does a lack of forward pressure alter the rate at which messages are dequeued? |
Beta Was this translation helpful? Give feedback.
-
This issue ended up being specific to my use of Docker containers. When I perform exactly the same test on my local machine without containers, there is no lag in messaging at all. I haven't figured out what about the Docker containers is causing the problem, but I am at least unblocked for my performance testing. |
Beta Was this translation helpful? Give feedback.
-
100% reproducible self contained code here. After installing Docker and Docker Compose, just
./run.sh
.It creates a mesh network of 75 nodes, each its own Docker container. They connect with each other over bidirectional gRPC streams. Each node generates three signatures and then sends those signature to the 74 other nodes for a total of
75*3*74
= 16,650 messages. Even in the absence of any computationally expensive work, I'm finding that most (~15,000) messages are received immediately, and then the remaining ~1600 take ~60 seconds with 0% CPU usage. This is slower than I would expect.Within a second or two of starting, all 225 unique messages are created and all 16,650 messages are sent:
15,450 messages were received as soon as they were sent, but the remaining messages will languish at this point for ~60 seconds. Sometimes it makes no progress for a handful of seconds, and sometimes they trickle in. They all eventually arrive 100% of the time, but I'm confused by this unexpected performance. It's about 63 seconds to receive 16,650 messages that were all sent within the first second. It's never CPU bound. I also extensively debugged my lock usage. There are 0 mutexes in the code and 0 RwLock::write locks being acquired during this execution.
All 16,650 sends are either this line or the one right below it.
The first one receives here and the second one receives here.
The bottleneck is definitely somewhere between sending the message and receiving it, so I thought I'd ask if it could be tonic related since it's responsible for the network communication over the bidirectional streams.
Have tested on both an 8 core machine and a 32 core machine. Same behavior.
Appreciate any feedback.
Beta Was this translation helpful? Give feedback.
All reactions