how to coalesce wait_cqe #1104

pyhd · 2024-03-21T10:25:41Z

pyhd
Mar 21, 2024

io_uring_submit_and_wait_timeout can be used to collect more cqes, but it is costly to repeat if the interval is very short, especially during idle.

So I just wonder if any better method to collect more cqes in one wait, like IRQ coalesce. i.e. when the first cqe had arrived, a timer(e.g. 1ms) can be kicked to wake up wait_cqes later. In other words, the delay is based on the timestamp of the first cqe. Now the problem is the additional timer request, which means one more syscall.

Then my question is : could it be possible to kick the timer in kernel? Even a new blocking function like io_uring_wait_cqe(s)_batch/delay/coalesce?

Except that is there a better solution?

YoSTEALTH · 2024-04-19T04:48:28Z

YoSTEALTH
Apr 19, 2024

What would be the point of waiting to collecting bunch of cqes? For that time wasted you could have finished those cqes and create new once.

If you can keep an counter like counter += io_uring_submit(ring), now you don't have to use timeout till counter == 0 because you know there is an entry in the ring. You can still wait for it io_uring_wait_cqe(ring, cqe) without the timeout.

Trick would be to use multiple small function depending on the usage vs function like io_uring_submit_and_wait_timeout and hope it does everything!

0 replies

pyhd · 2024-04-19T09:02:45Z

pyhd
Apr 19, 2024
Author

What would be the point of waiting to collecting bunch of cqes? For that time wasted you could have finished those cqes and create new once.

Just to reduce wakeups, so lower CPU usage.

If you can keep an counter like counter += io_uring_submit(ring), now you don't have to use timeout till counter == 0 because you know there is an entry in the ring. You can still wait for it io_uring_wait_cqe(ring, cqe) without the timeout.

Right for send()/write(), though not very effective for recv()/read(), which could be unpredictable. E.g. a read of a tun/pipe and a recv of a network socket, both could arrive within a short timespan(assumed 1ms), so here io_uring_wait_cqe() might wake up twice.

Trick would be to use multiple small function depending on the usage vs function like io_uring_submit_and_wait_timeout and hope it does everything!

As you mentioned, the counter of send()/write() can be used, then io_uring_submit_and_wait_timeout(counter+N, 1ms) is generally cheap to collect N more recv()/read(). But if the counter is 0, io_uring_wait_cqe() is nearly the only option.

0 replies

YoSTEALTH · 2024-04-19T09:54:52Z

YoSTEALTH
Apr 19, 2024

Currently bellow code is how I am writing the event manager (limited to language I am using). The counter is just there to keep the whole thing going in a live state. In my code if counter == 0, error or done it will exit the event loop. It doesn't matter if you are using it for read/send/write/... counter is just counting how many entries are active.

io_uring_wait_cqe will get triggered when there is no more finished entry to process and wait till socket accept event is triggered. Or wait will get triggered if there is timeout event. You do want to wait in these cases, so wait is a good thing.

I am not sure what you mean by "might wake up twice."! I don't see it happening, unless its something that io_uring/kernel does, that's out of my control.

while counter := (io_uring_submit(ring) + counter - cq_ready):
    # get count of how many event(s) are ready and fill `cqe`
    while not (cq_ready := io_uring_peek_batch_cqe(ring, cqe, counter)):
        # wait for at least `1` event to be ready.
        io_uring_wait_cqe(ring, cqe)

    for i in range(cq_ready):
        # do stuff
        # ...
    io_uring_cq_advance(ring, cq_ready)  # free seen entries

Depending on the language you are using to code, you can improve on above code to get better results.

0 replies

pyhd · 2024-04-19T14:33:46Z

pyhd
Apr 19, 2024
Author

Currently bellow code is how I am writing the event manager (limited to language I am using). The counter is just there to keep the whole thing going in a live state. In my code if counter == 0, error or done it will exit the event loop. It doesn't matter if you are using it for read/send/write/... counter is just counting how many entries are active.

io_uring_wait_cqe will get triggered when there is no more finished entry to process and wait till socket accept event is triggered. Or wait will get triggered if there is timeout event. You do want to wait in these cases, so wait is a good thing.

I am not sure what you mean by "might wake up twice."! I don't see it happening, unless its something that io_uring/kernel does, that's out of my control.
while counter := (io_uring_submit(ring) + counter - cq_ready):
    # get count of how many event(s) are ready and fill `cqe`
    while not (cq_ready := io_uring_peek_batch_cqe(ring, cqe, counter)):
        # wait for at least `1` event to be ready.
        io_uring_wait_cqe(ring, cqe)

    for i in range(cq_ready):
        # do stuff
        # ...
    io_uring_cq_advance(ring, cq_ready)  # free seen entries
Depending on the language you are using to code, you can improve on above code to get better results.

io_uring_wait_cqe() is not always cheap: it would still enter the kernel to get events if no available cqe immediately. Therefore it is generally better to use io_uring_submit_and_wait() if io_uring_sq_ready() > 0 .

while() {
	if (io_uring_sq_ready() > 0) {
		if (counter > 0)
			io_uring_submit_and_wait_timeout(counter+N, 1ms);
		else
			io_uring_submit_and_wait(1);
	} else {
		io_uring_wait_cqe();
	}

	io_uring_prep_*();

	counter=send()+write();

	reap cqes;
}

During idle io_uring_wait_cqe() just works like epoll_wait(). e.g. X read cqe at the timestamp T, and Y recv cqe at T+1ms, your model might end up 4 syscalls ( 1st&2nd for read& response&reap, 3rd&4th for recv&response&reap). However In my model at most 3 times, and could be lower if chained more.

Now if io_uring_wait_cqe() can coalesce nearby cqes, wakeups could be reduced just like IRQ mitigation.

0 replies

pyhd · 2024-04-19T16:07:30Z

pyhd
Apr 19, 2024
Author

And I get another idea:

Could it be possible to mask certain successful cqes (e.g. send/write)? So that they will not wakeup io_uring_wait_cqe() alone, i.e. only non-masked cqes can wakeup. Because for many cases static buffers are allocated for send/write, whose successful cqes are expected to recycle buffers for recv/read. Why not just reap successful send/write cqes when recv/read cqes arrive?

0 replies

YoSTEALTH · 2024-04-19T16:14:53Z

YoSTEALTH
Apr 19, 2024

Anytime you use anything with *submit* that's a syscall as well. In your example inside while loop you are hitting either a *submit* or io_uring_wait_cqe so that's a syscall just to check or wait.

I get that you want to wait 1ms to collect cqes by using io_uring_submit_and_wait_timeout(). On a busy server you don't want to waste that 1ms is what I am saying, you could use that 1ms to do other stuff. While in the background io_uring will collect those cqes for you.

In my example for i in range(cq_ready): you could be processing those ready cqes and adding new task. By the time you are done and check again, more cqes will be ready.

correction: I use io_uring_wait_cqe_nr not io_uring_wait_cqe since io_uring_peek_batch_cqe already does the peeking.

I can't really use io_uring_submit_and_wait not sure why it doesn't work as thought! Need to do more testing even though it has been wrapped correctly.

Now if io_uring_wait_cqe() can coalesce nearby cqes, wakeups could be reduced just like IRQ mitigation.

You would have to talk to @isilence or @axboe on that, and your other ideas.

0 replies

isilence · 2024-04-19T16:32:02Z

isilence
Apr 19, 2024
Collaborator

It's a perfectly valid question to ask, CQ batching is important for performance. The pure number of syscall is not representative, it's more interesting what they do, i.e. if there was real waiting and how many times the task was actually scheduled out/in.

When there are long operations, like recv, than it's indeed pretty hard to batch, usually degrades to wait(1). One approach is to add NOWAIT to all sends, and then

nr_wait = nr_sends;
if (nr_recv != 0)
    nr_wait++;
io_uring_wait(nr_cqe_to_wait);

If you have submit separated from wait:

submit();
while (peek) { handle(cqe); }; // or for_each_cqe();
io_uring_wait(1); // wait receives and other long ops

Another approach we're trying is to specify the minimum time it's supposed to wait. To give an idea:

submit();
...
sleep(min_time_to_wait);
io_uring_wait(1);

I think Jens had patches to do it a little bit more controllable and efficient, that sleep(1) is moved inside the io_uring syscall.

Could it be possible to mask certain successful cqes (e.g. send/write)? So that they will not wakeup io_uring_wait_cqe() alone, i.e. only non-masked cqes can wakeup. Because for many cases static buffers are allocated for send/write, whose successful cqes are expected to recycle buffers for recv/read. Why not just reap successful send/write cqes when recv/read cqes arrive?

It was discussed before, it works, but folks who were thinking how to have it in production systems are not super excited though. I hope we'd have a more generic solution.

0 replies

YoSTEALTH · 2024-04-19T17:31:30Z

YoSTEALTH
Apr 19, 2024

Moving wait out of kernel and into io_uring locally to manage seems like a good idea.

After you start to batch cqes at io_uring side, you will soon want priority/heap queue, since you would want to process certain task sooner than others. Then if you have 100k+ entries to go through, now you have connection/timeout/limits issues to deal with, ...

1 reply

isilence Apr 20, 2024
Collaborator

Right, any solution should have a good degree of flexibility

pyhd · 2024-04-19T17:57:34Z

pyhd
Apr 19, 2024
Author

Anytime you use anything with *submit* that's a syscall as well. In your example inside while loop you are hitting either a *submit* or io_uring_wait_cqe so that's a syscall just to check or wait.

io_uring_submit_and_wait() is always only 1 syscall. Some operations might be punted to io-wq or delayed due to some other reason, in otherwords, they might not return right away after your submit(), thereby in such a situation io_uring_wait_cqe() must enter the kernel to get events. That is why I call it not always cheap.

I get that you want to wait 1ms to collect cqes by using io_uring_submit_and_wait_timeout(). On a busy server you don't want to waste that 1ms is what I am saying, you could use that 1ms to do other stuff. While in the background io_uring will collect those cqes for you.

io_uring_submit_and_wait_timeout() just wraps up a timeout request before the final submit(): if you get N cqes before 1ms expired, it will return immediately; even if the timer expired(-ETIME), it is just a schedule_hrtimeout() operation, which should not lock your application, and later still feel free to call io_uring_wait_cqe() again.

correction: I use io_uring_wait_cqe_nr not io_uring_wait_cqe since io_uring_peek_batch_cqe already does the peeking.

Be a little careful. io_uring_wait_cqe_nr() might delay your application somehow. e.g. you await 16 cqes of send/write, due to some reason, only 10 of them return immediately, then 2 recv/read arrives, IMHO you may not want it to stall at this moment.

I can't really use io_uring_submit_and_wait not sure why it doesn't work as thought! Need to do more testing even though it has been wrapped correctly.

Wish you good luck! Reference

0 replies

pyhd · 2024-04-19T20:14:12Z

pyhd
Apr 19, 2024
Author

When there are long operations, like recv, than it's indeed pretty hard to batch, usually degrades to wait(1). One approach is to add NOWAIT to all sends, and then
nr_wait = nr_sends;
if (nr_recv != 0)
    nr_wait++;
io_wait(nr_wait);

It would be less efficient if some sends were punted to io-wq or delayed, right?

Another approach we're trying is to specify the minimum time it's supposed to wait. To give an idea:
submit();
...
sleep(min_wait);
wait(1);
I think Jens had patches to do it a little bit more controllable and efficient, that sleep(1) is moved inside the io_uring syscall.

I suppose the timestamp is based on the submitter, rather than the first cqe. Though could NAPI/IRQ-defer style probably be more efficient？It wouldn't be so hard for io_uring to init a timeout request inside of kernel once the first cqe arrived.

Could it be possible to mask certain successful cqes (e.g. send/write)? So that they will not wakeup io_uring_wait_cqe() alone, i.e. only non-masked cqes can wakeup. Because for many cases static buffers are allocated for send/write, whose successful cqes are expected to recycle buffers for recv/read. Why not just reap successful send/write cqes when recv/read cqes arrive?

It was discussed before, it works, but folks who were thinking how to have it in production systems are not super excited though. I hope we'd have a more generic solution.

Most output operations can return immediately so probably they might think it not conspicuous to mask. But for a large chunk of sends flooding little socket buffers, wouldn't it be a length pain to waste time?

Anyway I believe both solutions mentioned are mutually complementary.

2 replies

isilence Apr 20, 2024
Collaborator

When there are long operations, like recv, than it's indeed pretty hard to batch, usually degrades to wait(1). One approach is to add NOWAIT to all sends, and then
nr_wait = nr_sends;
if (nr_recv != 0)
    nr_wait++;
io_wait(nr_wait);
It would be less efficient if some sends were punted to io-wq or delayed, right?

When such a NOWAIT req fails, an actual implementation would need to resubmit without the flag (and adjust @nr_wait accordingly), so it is somewhat less efficient. Do this happen often enough for you to care? Likely not, but should be measured / tracked.

Another approach we're trying is to specify the minimum time it's supposed to wait. To give an idea:
submit();
...
sleep(min_wait);
wait(1);
I think Jens had patches to do it a little bit more controllable and efficient, that sleep(1) is moved inside the io_uring syscall.
I suppose the timestamp is based on the submitter, rather than the first cqe. Though could NAPI/IRQ-defer style probably be more efficient

Games with irq handling is a different topic.

It wouldn't be so hard for io_uring to init a timeout request inside of kernel once the first cqe arrived.

So, instead of "keep waiting at least N ns" you're suggesting "wait at least for N after getting the first cqe". Doable, the question is whether it's better and usable by users in terms of latency expectations.

Could it be possible to mask certain successful cqes (e.g. send/write)? So that they will not wakeup io_uring_wait_cqe() alone, i.e. only non-masked cqes can wakeup. Because for many cases static buffers are allocated for send/write, whose successful cqes are expected to recycle buffers for recv/read. Why not just reap successful send/write cqes when recv/read cqes arrive?

It was discussed before, it works, but folks who were thinking how to have it in production systems are not super excited though. I hope we'd have a more generic solution.

Most output operations can return immediately so probably they might think it not conspicuous to mask. But for a large chunk of sends flooding little socket buffers, wouldn't it be a length pain to waste time?

I didn't get it, waste time on what?

pyhd Apr 20, 2024
Author

When such a NOWAIT req fails, an actual implementation would need to resubmit without the flag (and adjust @nr_wait accordingly), so it is somewhat less efficient. Do this happen often enough for you to care? Likely not, but should be measured / tracked.

If the socket was intensively stressed, like N times data than its max buffer size. some of them would be delayed. Indeed as you said only limited proportion.

So, instead of "keep waiting at least N ns" you're suggesting "wait at least for N after getting the first cqe". Doable, the question is whether it's better and usable by users in terms of latency expectations.

"wait at most N ns or at least M cqes after getting the first cqe", i.e. merely a io_uring_prep_timeout(N,M) after the first cqe. Just like how NAPI works (net.core.netdev_budget & net.core.netdev_budget_usecs), definitely there may be some latency.

Could it be possible to mask certain successful cqes (e.g. send/write)? So that they will not wakeup io_uring_wait_cqe() alone, i.e. only non-masked cqes can wakeup. Because for many cases static buffers are allocated for send/write, whose successful cqes are expected to recycle buffers for recv/read. Why not just reap successful send/write cqes when recv/read cqes arrive?

It was discussed before, it works, but folks who were thinking how to have it in production systems are not super excited though. I hope we'd have a more generic solution.

Most output operations can return immediately so probably they might think it not conspicuous to mask. But for a large chunk of sends flooding little socket buffers, wouldn't it be a length pain to waste time?

I didn't get it, waste time on what?

Sorry about that, I should have clarified it. Assumed some of them are delayed/async: if wait(1), it will cost more than 1batch(wakeup) to collect; if wait_nr(N), more time will be spent on kernel.

More importantly it can be taxing to reap cqes of output(send/write) operations right away. Even it becomes a lottery game: lucky to reap outputs if any input(recv/read) wakes up at the same syscall, unlucky to reap them individually. The busier the luckier, vice versa. Then an enhanced varient of SKIP_SUCCESS(DEFER_SUCCESS ?) might be helpful.

YoSTEALTH · 2024-04-20T01:54:31Z

YoSTEALTH
Apr 20, 2024

@pyhd Just tested io_uring_submit_and_wait, so its working... just realized you have to combine it with either io_uring_peek_cqe or io_uring_wait_cqe both internally calls io_uring_wait_cqe_nr which is +syscall, so total of 2 (1 for submit). Maybe you can us it with io_uring_for_each_cqe(for me its really glitchy) I dono, this seem more expensive to use on every iteration! Are you reading https://github.com/axboe/liburing/blob/4b45fd891947bd75b5fbbb5ebbcf63b55f2d3f6a/src/queue.c

1 reply

pyhd Apr 20, 2024
Author

Yeah, in the end they call __sys_io_uring_enter2().

YoSTEALTH · 2024-04-20T02:10:53Z

YoSTEALTH
Apr 20, 2024

@pyhd I actually liked your idea of checking io_uring_sq_ready() before submitting, that's viable.

The manual doesn't make io_uring_sq_ready() justice, it says "Usage of this function only applies if the ring has been setup with IORING_SETUP_SQPOLL", which doesn't seem to be true! edit: As in it works even without having to use IORING_SETUP_SQPOLL flag

0 replies

YoSTEALTH · 2024-04-21T01:09:56Z

YoSTEALTH
Apr 21, 2024

@isilence @pyhd You guys are talking about RWF_NOWAIT right? According to @axboe io_uring already handles that #280 (comment) it should apply for io_uring_prep_send as well right?

1 reply

pyhd Apr 21, 2024
Author

Supposedly MSG_DONTWAIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to coalesce wait_cqe #1104

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 13 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

how to coalesce wait_cqe #1104

Replies: 13 comments · 5 replies

pyhd Apr 19, 2024 Author

pyhd Apr 19, 2024 Author

pyhd Apr 19, 2024 Author

isilence Apr 19, 2024 Collaborator

isilence Apr 20, 2024 Collaborator

pyhd Apr 19, 2024 Author

pyhd Apr 19, 2024 Author

isilence Apr 20, 2024 Collaborator

pyhd Apr 20, 2024 Author

pyhd Apr 20, 2024 Author

pyhd Apr 21, 2024 Author

Replies: 13 comments 5 replies

pyhd
Apr 19, 2024
Author

pyhd
Apr 19, 2024
Author

pyhd
Apr 19, 2024
Author

isilence
Apr 19, 2024
Collaborator

isilence Apr 20, 2024
Collaborator

pyhd
Apr 19, 2024
Author

pyhd
Apr 19, 2024
Author

isilence Apr 20, 2024
Collaborator

pyhd Apr 20, 2024
Author

pyhd Apr 20, 2024
Author

pyhd Apr 21, 2024
Author