-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark results for storage to SSD with/without liburing #1123
Comments
What is concurrency for non io_uring? Are you comparing 1 thread for io_uring vs multiple threads for the other one? |
0 or 1 represents false or true. For example
represents 48 threads each working on small buffers and then queuing the buffers for storage with |
So, let me confirm, "concurrency = 2, use uring = 1" means there are 2 threads, each thread has an io_uring instance, and each thread keep one request inflight, i.e. QD=1? Similar to this: void thread_fn(ring) {
while (1) {
sqe = get_sqe();
prep_write(sqe);
io_uring_submit(ring);
cqe = io_uring_wait();
handle(cqe);
}
} |
All serializing theads share the same See code below: the first thread will create a queue, and all the others will share that queue.
|
Do you have a reproducer? It's not clear what the tool is doing, and without understanding that any performance reasoning would be futile. When I said separate rings, it means there are separate struct io_uring, each separately initialised with |
This is a moderately complex C++ project. If you're interested, I can share it with you. |
Please do share it - even if it's complicated, just being able to run what you are running, tracing will often tell us a lot about how it's done without needing to fully read and comprehend the sample program. |
Great, here is the project. |
And please include also how you are running it. The goal is to make this as trivial as possible for someone to reproduce :-) |
I've added an INSTALL file here that details how to run with default settings. |
It would be nice if simple benchmarks was added to |
@YoSTEALTH, not in liburing per se, but for storage there is fio/t/io_uring, someone may even adapt it to liburing and submit a patch |
@boxerab, hopefully we'll find time to look at it, but I suspect the comparison is not apple to apples without even really looking at numbers. When you compare synchronous with asynchronous, it's usually either: a) There are N threads in both cases, and both run QD1 per thread (i.e. one requests is executing in parallel per thread). In this case the asynchronous inteface basically runs synchronously, which is not good. b) The asynchronous interface runs just 1 thread but QD=N, i.e. executes all requests in parallel. In this case the async interface may likely lose in throughput and/or latency, but the key is that it takes much less CPU. And I don't understand which case is yours. There can be more options, a combination of previous two, or for instance N threads generate IO requests, send them to a single IO thread, which executes it via io_uring. But a lot would depend on how it actually implemented. |
This is how the benchmark works, but QD=4 for all concurrency levels. Perhaps it should match the concurrency. I will measure CPU usage for each configuration and see how that looks. |
Here are results comparing CPU usage for uring vs synchronous with O_DIRECT. Timing and CPU usage are both identical.
|
I've found the same issue in #912 . And they havn't resolved it until now. I've just tested the same case with the newest liburing on Ubuntu with kernel6.8.5, and got the same bad results as before. |
Glad to hear other people are testing this work flow. I hope this can be fixed eventually. With current situation it doesn't make sense for me to use uring in my application. |
They closed the issue, and I thought they maybe have solved it. But when I tested it again, found the issue was sitll existing. I wrote a very simple demo to test it in https://github.com/acl-dev/demo/tree/master/c/file , anyone can use it to test the liburing's write performance for file IO. |
I've added reading file comparing for sys read and liburing in https://github.com/acl-dev/demo/tree/master/c/file/main.c, and found that they'are similarly reading efficency. The comparing of read and write is below: ./file -n 100000
uring_write: open file.txt ok, fd=3
uring_write: write char=0
uring_write: write char=1
uring_write: write char=2
uring_write: write char=3
uring_write: write char=4
uring_write: write char=5
uring_write: write char=6
uring_write: write char=7
uring_write: write char=8
uring_write: write char=9
close file.txt ok, fd=3
uring write, total write=100000, cost=1541.28 ms, speed=64881.18
-------------------------------------------------------
sys_write: open file.txt ok, fd=3
sys_write: write char=0
sys_write: write char=1
sys_write: write char=2
sys_write: write char=3
sys_write: write char=4
sys_write: write char=5
sys_write: write char=6
sys_write: write char=7
sys_write: write char=8
sys_write: write char=9
close file.txt ok, fd=3
sys write, total write=100000, cost=80.58 ms, speed=1240925.73
========================================================
uring_read: read open file.txt ok, fd=3
uring_read: char[0]=0
uring_read: char[1]=1
uring_read: char[2]=2
uring_read: char[3]=3
uring_read: char[4]=4
uring_read: char[5]=5
uring_read: char[6]=6
uring_read: char[7]=7
uring_read: char[8]=8
uring_read: char[9]=9
close fd=3
uring read, total read=100000, cost=84.52 ms, speed=1183179.91
-------------------------------------------------------
sys_read: open file.txt ok, fd=3
sys_read: char[0]=0
sys_read: char[1]=1
sys_read: char[2]=2
sys_read: char[3]=3
sys_read: char[4]=4
sys_read: char[5]=5
sys_read: char[6]=6
sys_read: char[7]=7
sys_read: char[8]=8
sys_read: char[9]=9
sys read, total read=100000, cost=67.22 ms, speed=1487586.09 |
Great. It would be interesting to compare with running this test on xfs or btrfs. |
Hello, here are some benchmark results I have compiled for disk storage with/without liburing testing both buffered and direct IO.
Kernel: 6.7
file system: xfs
disk : nvme ssd
CPU: 48 thread AMD
The benchmark essentially takes a series of small buffers, does some work on each buffer, then stores the results to disk.
fsync
is called at the end. Unfortunately, liburing is slightly slower than blocking I/OThe benchmark project itself can be found here.
Note : 0 or 1 below represents false or true
The text was updated successfully, but these errors were encountered: