Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROUTER/DEALER proxy adds 130x overhead on Windows #2328

Closed
jakecobb opened this issue Jan 30, 2017 · 12 comments
Closed

ROUTER/DEALER proxy adds 130x overhead on Windows #2328

jakecobb opened this issue Jan 30, 2017 · 12 comments

Comments

@jakecobb
Copy link
Contributor

I have the following setup, encountered under both ZeroMQ 4.1.5 and 4.2.0. Remote clients use a REQ socket over TCP to a server process binding a ROUTER socket to the TCP address, using zmq_proxy to forward to a DEALER socket bound on an inproc address and several worker threads connect REP sockets to the inproc address.

On Windows, this adds a huge overhead to every request-reply round-trip. On OS X the overhead is much smaller. I created a modified version of local_lat to demonstrate the problem local_lat_rdr.cpp. It is compatible with remote_lat and by default uses just one worker thread. Using more than 1 worker thread does not impact performance, it seems to be purely from the ROUTER/DEALER proxy.

Here is the remote_lat output against the two cases on Windows:

>rem Server side: .\local_lat.exe tcp://127.0.0.1:9999 1 1000
>.\remote_lat.exe tcp://127.0.0.1:9999 1 1000
message size: 1 [B]
roundtrip count: 1000
average latency: 229.400 [us]

>rem Server side: .\local_lat_rdr.exe tcp://127.0.0.1:9999 1 1000
>.\remote_lat.exe tcp://127.0.0.1:9999 1 1000
message size: 1 [B]
roundtrip count: 1000
average latency: 30068.248 [us]

There is some variation in the numbers but always about 130X overhead. Compare the same code running on OS X:

$ # Server side: ./local_lat tcp://127.0.0.1:9999 1 1000
$ ./remote_lat tcp://127.0.0.1:9999 1 1000
message size: 1 [B]
roundtrip count: 1000
average latency: 29.328 [us]

$ # Server side: ./local_lat_rdr tcp://127.0.0.1:9999 1 1000
$ ./remote_lat tcp://127.0.0.1:9999 1 1000
message size: 1 [B]
roundtrip count: 1000
average latency: 153.570 [us]

So for OS X I'm seeing about 5X overhead. This remains in sub-millisecond territory so isn't really noticeable to the client, but Windows is adding 30 milliseconds to every request, which is very noticeable to the client.

@bjovke
Copy link
Contributor

bjovke commented Apr 10, 2017

@jakecobb Can you try the test with the latest master branch? There were some changes regarding this.

@jakecobb
Copy link
Contributor Author

Excellent, it appears to be fixed, here are the numbers on master (at commit 6480721):

>rem Server side: local_lat.exe tcp://127.0.0.1:6666 1 1000
>remote_lat tcp://127.0.0.1:6666 1 1000
message size: 1 [B]
roundtrip count: 1000
average latency: 177.448 [us]

>rem Server side: local_lat_rdr.exe tcp://127.0.0.1:6666 1 1000
>remote_lat tcp://127.0.0.1:6666 1 1000
message size: 1 [B]
roundtrip count: 1000
average latency: 209.502 [us]

Fixed by #2518.

@bjovke
Copy link
Contributor

bjovke commented Apr 10, 2017

@jakecobb Hello. There's a critical issue introduced with #2518. It is corrected with #2523. Please fetch master branch again when #2523 is merged.
Thank you.

@jakecobb
Copy link
Contributor Author

Ok, will do. By the way the OS X performance is also improved to less than 2X overhead with the proxy instead of 5X.

@bjovke
Copy link
Contributor

bjovke commented Apr 10, 2017

If you were using zmq_proxy() or zmq_proxy_steerable() with DRAFT api, the first part of this PR improved that part, proxy was rewritten so it doesn't use zmq_poll() code. zmq_poll() changes affect only Windows.

@jakecobb
Copy link
Contributor Author

I'm not using DRAFT. Doesn't the allocation change from 319eb27 affect all platforms?

@bjovke
Copy link
Contributor

bjovke commented Apr 10, 2017

I know only for Windows that it is automatically switched on. I haven't checked other platforms.

@bjovke
Copy link
Contributor

bjovke commented Apr 10, 2017

Yes, you're right, #2494 does affect all platforms. It doesn't allocate/free fd_set with new/delete so for sure there's improvement, I just didn't measure how much.

@bjovke
Copy link
Contributor

bjovke commented Apr 10, 2017

Also, with #2494 an issue could be encountered on Windows with applications which use a lot of stack.
Now additional 192 kB is allocated from stack during this function run and for default stack on Windows of 1 MB this could require increasing stack size and recompiling if app is crashing. But this is allocated just for a short time during zmq_poll() run.

@bjovke
Copy link
Contributor

bjovke commented Apr 10, 2017

But here some stack is sacrificed for speed and I think on other platforms than Windows default stack size is larger.

@jakecobb
Copy link
Contributor Author

I've explicitly unset ENABLE_DRAFT in CMake, so the numbers I show here are all without DRAFT. ENABLE_EVENTFD is set for these tests.

@bjovke
Copy link
Contributor

bjovke commented Apr 10, 2017

So improvement you see is definitely from #2494 (319eb27) Thank you for the information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants