Skip to content

Commit

Permalink
man/io_uring_internal: Add man page about relevant internals for users
Browse files Browse the repository at this point in the history
Adds a man page with details about the inner workings of io_uring that
are likely to be useful for users as they relate to frequently misused
flags of io_uring such as IOSQE_ASYNC and the taskrun flags. This
mostly describes what needs to be done on the kernel side for each
request, who does the work and most notably what the async punt is.

Signed-off-by: Constantin Pestka <[email protected]>
  • Loading branch information
CPestka committed Oct 5, 2024
1 parent 206650f commit 3ea13c6
Showing 1 changed file with 225 additions and 0 deletions.
225 changes: 225 additions & 0 deletions man/io_uring_internals.7
Original file line number Diff line number Diff line change
@@ -0,0 +1,225 @@
.TH io_uring_internals 7 2024-10-5 "Linux" "Linux Programmer's Manual"
.SH NAME
io_uring_internals
.SH SYNOPSIS
.nf
.B "#include <linux/io_uring.h>"
.fi
.PP
.SH DESCRIPTION
.PP
.B io_uring
is a linux specific, asynchronous API that allows the submission of requests to
the kernel that are typically otherwise performed via a syscall. Requests are
passed to the kernel via a shared ring buffer the
.I Submission Queue
(SQ) and completion notifications are passed back to the application via the
.I Completion Queue
(CQ). An important detail here is that after a request has been submitted to
the kernel some CPU time has to be spent in kernel space to perform the
required submission and completion related tasks.
The mechanism used to provide this CPU time, as well as what process does so
and when is different in
.I io_uring
than for the traditional API provided by regular syscalls.

.PP
.SH Traditional Syscall Driven I/O
.PP
For regular syscalls the CPU time for these tasks is directly provided by the
process issuing the syscall, with the submission side tasks in kernel space
being directly executed after the context switch. The time for completion
related tasks is either also subsequently directly provided in the case of
polled I/O. In the case of interrupt driven I/O the CPU time is provided,
depending on the driver in question, by either the traditional top and bottom
half IRQ approach or via threaded IRQ handling. The CPU time for completion
tasks is thus in this case provided by the CPU on which the hardware
interrupt arrives, as well as the CPU to which the dedicated kernel worker
thread for the threaded IRQ handling gets scheduled, if that is used.

.PP
.SH The Submission Side Work
.PP

The tasks required in kernel space on the submission side are mostly checking
the SQ for newly arrived SQEs, parsing and check them for validity and
permissions and then passing them on to the responsible system, such as a
block device driver. An important note here is that
.I io_uring
guarantees that the process of submitting the request to responsible subsystem
and thus in this case the
.IR io_uring_enter (2)
syscall made to submit the new requests, will never block. However,
.I io_uring
relies on the capabilities of the responsible system to perform the submission
without blocking.
.I io_uring
will first attempt to submit the request without blocking.
If this fails, e.g. due to the respective system not supporting non-blocking
submissions,
.I io_uring
will
.I async punt
the request, i.e. off-load these requests to the
.I IO work queue
(IO WQ) (see description below).

.PP
.SH The Completion Side Work
.PP

The tasks required in kernel space on the completion side mostly come in the
form of various request type dependent tasks, such as copying buffers, parsing
packet headers etc., as well as posting a CQE to the CQ to inform the
application of the completion of the request.

.PP
.SH Who does the work
.PP

One of
the primary motivations behind
.I io_uring
was to reduce or entirely avoid the overheads of syscalls to provide the
required CPU time in kernel space. The mechanism that
.I io_uring
utilizes to achieve this differs depending on the configuration with different
trade-offs between configurations in respect to e.g. CPU efficiency and latency.

With the default configuration the primary mechanism to provide the kernel space
CPU time in
.I io_uring
is also a syscall:
.IR io_uring_enter (2)
This still differs from requests made via their respective syscall directly,
such as
.IR read (2),
in the sense that it allows for batching in a more flexible way than e.g.
possible via
.IR readv (2),
as different syscalls types can be freely mixed and matched and chains of
dependent requests, such as a
.IR send (2)
followed by a
.IR recv (2)
can be submitted with one syscall. Furthermore it is possible to both process
requests for submissions and process arrived completions within the same
.IR io_uring_enter (2)
call. Applications can set the flag
.I IORING_ENTER_GETEVENTS
to in addition to processing any pending submissions, process any arrived
completions and
optionally wait until a specified amount of completions have arrived before
returning.

If polled I/O is used all completion related work is performed during the
.IR io_uring_enter (2)
call. For interrupt driven I/O, the CPU receiving the hardware interrupt
schedules the remaining work to be performed including posting the CQE to be
performed via task work. Any outstanding task work is performed during any
user-kernel space transition. Per default, the CPU that received the hw
interrupt will after scheduling the task work interrupt a user space process
via an inter processor interrupt (IPI), which will cause it to enter the kernel,
and thus perform the scheduled work. While this ensures a timely delivery of
the CQE, it is a relatively disruptive and high overhead operation. To avoid
this applications can configure
.I io_uring
via
.I IORING_SETUP_COOP_TASKRUN
to elide the IPI. Applications must now ensure that they perform any syscall
ever so often to be able to observe new completions, but benefit from eliding
the overheads of the IPIs. Additionally
.I io_uring
can be configured to inform an application about the fact that it should now
perform any syscall to reap new completions by setting
.IR IORING_SETUP_TASKRUN_FLAG .
This will result in
.I io_uring
setting
.I IORING_SQ_TASKRUN
in the SQ flags once the application should do so. This mechanism can be
restricted further via
.IR IORING_SETUP_DEFER_TASKRUN ,
which results in the task work only being executed when
.IR io_uring_enter (2)
is called with
.I IORING_ENTER_GETEVENTS
set, rather than at any context switch, which gives the application more agency
about when the work is executed, thus enabling e.g. more opportunities for
batching.

.PP
.SH Submission Queue Polling
.PP

Sq polling introduces a dedicated kernel thread that performs essentially all
submission and completion related tasks from fetching SQEs from the SQ,
submitting requests, polling requests, if configured for I/O poll and posting
CQEs. Notably, async punt requests are still processed by the IO WQ, to not
hinder the progress of other requests. If the SQ thread does not have any work
to do for a user supplied timeout it goes to sleep. Sq polling removes the need
for any syscall during operation, besides waking up the sq thread after long
periods of inactivity and thus reduces per request overheads at the cost of a
high constant upkeep cost.

.PP
.SH IO Work Queue
.PP

The IO WQ is a kernel thread pool used to execute any requests that can not be
submitted in a non-blocking way to the underlying subsystem, due to missing
support in said subsystem. After either the sq poll thread or a user space
thread calling
.IR io_uring_enter (2)
fails the initial attempt to submit the request without blocking it passes the
request on to a IO WQ thread that then performs the blocking submission. While
this mechanism ensures that
.IR io_uring ,
unlike e.g. AIO, never blocks on any of the submission paths, it is, as the
name of this mechanism, the async punt, suggests not ideal. The blocking
nature of the submission, the passing of the request to another thread, as
well as the scheduling of the IO WQ threads are all ideally avoided
overheads. Significant IO WQ activity can thus be seen as an indicator that
something is very likely going wrong. Similarly the flag
.I IOSQE_ASYNC
should only be used if the user knows that a request will always or is very
likely to async punt and not to ensure that the submission will not block, as
.I io_uring
guarantees to never block in any case.

.PP
.SH Kernel Thread Management
.PP

Each user space process utilizing
.I io_uring
posses an
.I io_uring
context, which manages all
.I io_uring
instances created within said process via
.IR io_uring_setup (2).
Per default, both the sq poll thread, as well as the IO WQ thread pool are
dedicated for each
.I io_uring
instance and are thus not shared within a process and are never shared between
different processes. However sharing these between two or more instances can
be achieved during setup via
.IR IORING_SETUP_ATTACH_WQ .
The threads of the IO WQ are created lazily in response to request being async
punted and fall into two accounts, the
bounded account responsible for requests with a generally bounded execution
time, such as block I/O and the unbounded account for requests with unbounded
execution time such as e.g. recv operations.
The maximum thread count of the accounts is per default 2 * NPROC and can be
adjusted via
.IR IORING_REGISTER_IOWQ_MAX_WORKERS .
Their CPU affinity can be adjusted via
.IR IORING_REGISTER_IOWQ_AFF .

.EE
.SH SEE ALSO
.BR io_uring (7)
.BR io_uring_enter (2)
.BR io_uring_register (2)
.BR io_uring_setup (2)

0 comments on commit 3ea13c6

Please sign in to comment.