In preparation for Christmas, the ELFs were thinking about the work that has to be done after the wishlists of all the children have arrived. And having many new and young elves, only a few hundred years old, came to the wishlist-processing team, the idea came up to build a new tool for sorting those wishes. And there are a lot of wishes! So efficient sorting is crucial. But sorting is not only about executing the algorithm itself, but they also have to bring in those wishes into the sorting office and push the sorted stream of wishes out to the workbench area.
For you, today will be a little bit easier or your time budget in comparison to yesterday, but we will learn about scatter and getter I/O under Linux and reason a little bit about system call batching. Today, we will not think about the sorting problem, but about getting data in and out to an I/O device (e.g., a file). With our well-known system call write(2) and its cousin pwrite(2), which takes an explicit file-offset, we can write the contents of a buffer into a file. However, at the end of an algorithm, the output data is often scattered across the memory. Only using write(2) we have two options: (1) combine all buffers into an large output buffer and flush that buffer with a single system call to disk, or (2) we could issue many small writes to write each result buffer separately. Clearly, both methods have drawbacks and induce either memcpy(3)-costs (1) or system call overheads (2).
To tackle this issue, Linux (and even POSIX) provide system calls that allow us to specify the source (or destination) buffer in a scattered manner. So instead of passing a single pointer and a buffer length, we pass an vector of (pointer, length)-tuples. Technically, this vector is an array of struct iovec
s:
struct iovec iov[2];
iov[0].iov_base = buf0;
iov[0].iov_len = 20;
iov[1].iov_base = buf1;
iov[1].iov_len = 30;
int nwritten = writev(fd, iov, 2);
In this small example, we create an iovec
with two fragments that describes a scattered buffer of 50 bytes that is written with a single writev(2) system call. As the kernel does not know about our C language level and the size of iov
but only gets a pointer, we have to tell it the number of array elements (2
).
And with this interface and its cousins (readv(2), preadv
, pwritev
), we can now batch many small writes, and avoid the system call overhead, into a single large write. An honorable mention in that man page deserve the preadv2()
and the pwritev2
variants, which demonstrate a common pattern that occurred during the development of Unix: Come up with a new system call, establish its API, notice that you want to influence its behavior a little bit, and add a new version 2
that takes a flags
argument. So to sum it up, in the read()/write()
universe, we have multiple pre- and suffixes to choose from:
p
: offset into the file is given explicitlyv
: take a scattered buffer instead of an contiguous buffer2
: mum, I fucked it up again and we now have a new system call that takes an flags argument. (Looking at you openat(2) and openat2(2).64
: Shit, computers are no longer 32-bit machines.In the kernel, handling of those vectors is quite easy as a library already exists to iterate over those vectors. And interesting example for that is do_loop_readv_writev(), which is a fallback mechanism if a file system does not provide file_operations
for readv/writev
. In that fallback case, the kernel just falls back to iterating itself over the vector issuing many small reads or writes. While this might sound like our solution (2) from above, this still saves the system call overhead for entering and leaving the kernel.
Write a program that reads lines from stdin
, sorts them with qsort(3) and write the sorted lines with a single system-call to stdout
. Special requirement: Once a line is read into memory, you are not allowed to move it around as this would be way to inefficient for handling that many wishes.
pselect6_time64
is an actual system call and I am not kidding.Last modified: 2023-12-01 15:52:28.167685, Last author: , Permalink: /p/advent-08-writev
Vacancies of TU Braunschweig
Career Service' Job Exchange
Merchandising
Term Dates
Courses
Degree Programmes
Information for Freshman
TUCard
Technische Universität Braunschweig
Universitätsplatz 2
38106 Braunschweig
P. O. Box: 38092 Braunschweig
GERMANY
Phone: +49 (0) 531 391-0