diff --git a/man/io_uring_internals.7 b/man/io_uring_internals.7 new file mode 100644 index 000000000..0f2658e17 --- /dev/null +++ b/man/io_uring_internals.7 @@ -0,0 +1,225 @@ +.TH io_uring_internals 7 2024-10-5 "Linux" "Linux Programmer's Manual" +.SH NAME +io_uring_internals +.SH SYNOPSIS +.nf +.B "#include " +.fi +.PP +.SH DESCRIPTION +.PP +.B io_uring +is a linux specific, asynchronous API that allows the submission of requests to +the kernel that are typically otherwise performed via a syscall. Requests are +passed to the kernel via a shared ring buffer the +.I Submission Queue +(SQ) and completion notifications are passed back to the application via the +.I Completion Queue +(CQ). An important detail here is that after a request has been submitted to +the kernel some CPU time has to be spent in kernel space to perform the +required submission and completion related tasks. +The mechanism used to provide this CPU time, as well as what process does so +and when is different in +.I io_uring +than for the traditional API provided by regular syscalls. + +.PP +.SH Traditional Syscall Driven I/O +.PP +For regular syscalls the CPU time for these tasks is directly provided by the +process issuing the syscall, with the submission side tasks in kernel space +being directly executed after the context switch. The time for completion +related tasks is either also subsequently directly provided in the case of +polled I/O. In the case of interrupt driven I/O the CPU time is provided, +depending on the driver in question, by either the traditional top and bottom +half IRQ approach or via threaded IRQ handling. The CPU time for completion +tasks is thus in this case provided by the CPU on which the hardware +interrupt arrives, as well as the CPU to which the dedicated kernel worker +thread for the threaded IRQ handling gets scheduled, if that is used. + +.PP +.SH The Submission Side Work +.PP + +The tasks required in kernel space on the submission side are mostly checking +the SQ for newly arrived SQEs, parsing and check them for validity and +permissions and then passing them on to the responsible system, such as a +block device driver. An important note here is that +.I io_uring +guarantees that the process of submitting the request to responsible subsystem +and thus in this case the +.IR io_uring_enter (2) +syscall made to submit the new requests, will never block. However, +.I io_uring +relies on the capabilities of the responsible system to perform the submission +without blocking. +.I io_uring +will first attempt to submit the request without blocking. +If this fails, e.g. due to the respective system not supporting non-blocking +submissions, +.I io_uring +will +.I async punt +the request, i.e. off-load these requests to the +.I IO work queue +(IO WQ) (see description below). + +.PP +.SH The Completion Side Work +.PP + +The tasks required in kernel space on the completion side mostly come in the +form of various request type dependent tasks, such as copying buffers, parsing +packet headers etc., as well as posting a CQE to the CQ to inform the +application of the completion of the request. + +.PP +.SH Who does the work +.PP + +One of +the primary motivations behind +.I io_uring +was to reduce or entirely avoid the overheads of syscalls to provide the +required CPU time in kernel space. The mechanism that +.I io_uring +utilizes to achieve this differs depending on the configuration with different +trade-offs between configurations in respect to e.g. CPU efficiency and latency. + +With the default configuration the primary mechanism to provide the kernel space +CPU time in +.I io_uring +is also a syscall: +.IR io_uring_enter (2) +This still differs from requests made via their respective syscall directly, +such as +.IR read (2), +in the sense that it allows for batching in a more flexible way than e.g. +possible via +.IR readv (2), +as different syscalls types can be freely mixed and matched and chains of +dependent requests, such as a +.IR send (2) +followed by a +.IR recv (2) +can be submitted with one syscall. Furthermore it is possible to both process +requests for submissions and process arrived completions within the same +.IR io_uring_enter (2) +call. Applications can set the flag +.I IORING_ENTER_GETEVENTS +to in addition to processing any pending submissions, process any arrived +completions and +optionally wait until a specified amount of completions have arrived before +returning. + +If polled I/O is used all completion related work is performed during the +.IR io_uring_enter (2) +call. For interrupt driven I/O, the CPU receiving the hardware interrupt +schedules the remaining work to be performed including posting the CQE to be +performed via task work. Any outstanding task work is performed during any +user-kernel space transition. Per default, the CPU that received the hw +interrupt will after scheduling the task work interrupt a user space process +via an inter processor interrupt (IPI), which will cause it to enter the kernel, +and thus perform the scheduled work. While this ensures a timely delivery of +the CQE, it is a relatively disruptive and high overhead operation. To avoid +this applications can configure +.I io_uring +via +.I IORING_SETUP_COOP_TASKRUN +to elide the IPI. Applications must now ensure that they perform any syscall +ever so often to be able to observe new completions, but benefit from eliding +the overheads of the IPIs. Additionally +.I io_uring +can be configured to inform an application about the fact that it should now +perform any syscall to reap new completions by setting +.IR IORING_SETUP_TASKRUN_FLAG . +This will result in +.I io_uring +setting +.I IORING_SQ_TASKRUN +in the SQ flags once the application should do so. This mechanism can be +restricted further via +.IR IORING_SETUP_DEFER_TASKRUN , +which results in the task work only being executed when +.IR io_uring_enter (2) +is called with +.I IORING_ENTER_GETEVENTS +set, rather than at any context switch, which gives the application more agency +about when the work is executed, thus enabling e.g. more opportunities for +batching. + +.PP +.SH Submission Queue Polling +.PP + +Sq polling introduces a dedicated kernel thread that performs essentially all +submission and completion related tasks from fetching SQEs from the SQ, +submitting requests, polling requests, if configured for I/O poll and posting +CQEs. Notably, async punt requests are still processed by the IO WQ, to not +hinder the progress of other requests. If the SQ thread does not have any work +to do for a user supplied timeout it goes to sleep. Sq polling removes the need +for any syscall during operation, besides waking up the sq thread after long +periods of inactivity and thus reduces per request overheads at the cost of a +high constant upkeep cost. + +.PP +.SH IO Work Queue +.PP + +The IO WQ is a kernel thread pool used to execute any requests that can not be +submitted in a non-blocking way to the underlying subsystem, due to missing +support in said subsystem. After either the sq poll thread or a user space +thread calling +.IR io_uring_enter (2) +fails the initial attempt to submit the request without blocking it passes the +request on to a IO WQ thread that then performs the blocking submission. While +this mechanism ensures that +.IR io_uring , +unlike e.g. AIO, never blocks on any of the submission paths, it is, as the +name of this mechanism, the async punt, suggests not ideal. The blocking +nature of the submission, the passing of the request to another thread, as +well as the scheduling of the IO WQ threads are all ideally avoided +overheads. Significant IO WQ activity can thus be seen as an indicator that +something is very likely going wrong. Similarly the flag +.I IOSQE_ASYNC +should only be used if the user knows that a request will always or is very +likely to async punt and not to ensure that the submission will not block, as +.I io_uring +guarantees to never block in any case. + +.PP +.SH Kernel Thread Management +.PP + +Each user space process utilizing +.I io_uring +posses an +.I io_uring +context, which manages all +.I io_uring +instances created within said process via +.IR io_uring_setup (2). +Per default, both the sq poll thread, as well as the IO WQ thread pool are +dedicated for each +.I io_uring +instance and are thus not shared within a process and are never shared between +different processes. However sharing these between two or more instances can +be achieved during setup via +.IR IORING_SETUP_ATTACH_WQ . +The threads of the IO WQ are created lazily in response to request being async +punted and fall into two accounts, the +bounded account responsible for requests with a generally bounded execution +time, such as block I/O and the unbounded account for requests with unbounded +execution time such as e.g. recv operations. +The maximum thread count of the accounts is per default 2 * NPROC and can be +adjusted via +.IR IORING_REGISTER_IOWQ_MAX_WORKERS . +Their CPU affinity can be adjusted via +.IR IORING_REGISTER_IOWQ_AFF . + +.EE +.SH SEE ALSO +.BR io_uring (7) +.BR io_uring_enter (2) +.BR io_uring_register (2) +.BR io_uring_setup (2)