Christmas is also the season, where the kitchen in the ELF village is a very busy place. All the food and all the cookies for Christmas eve are prepared ahead of time and stored for the big feast when Santa comes home from the tour. And there is one thing that Santa is especially fond of. There is only one thing that both quenches the thirst and fills up the stomach: Gravy! Immense amount of gravy to drench meat and vegetables alike. But making this special ELF gravy is a very peculiar process as it involves multiple process and filter steps to ensure the delicious clarity of the brown molten gold.
However, last year, while the individiual steps of the process were already optimized, there were a few difficulties with the overall gravy production: the transfers stalled and blocked as there was only a single ELF with a single cart that moved the proto-gravy around. For this year, the ELF senate decided to come up with a better gravy-transportation process.
Today, we want to look at two different system call interfaces, combine them, and help our elves with the gravy problem. We will start with epoll(7), which is a more modern variant of select(2), and splice(2), which allows the user to directly redirect data from one pipe to another one. In essence, our ELF gets a mechanism to determine which process of the gravy production is producing proto-gravy at its output (epoll
), and a mechanism to directly establish pipelines between the output and the input buckets of the gravy production without loading or unloading of the cart.
Let's first have a look at epoll()
and the problem that it solves: With select
, we passed a bit-set of queried file descriptors to the kernel and got back the same set where only the bits of the "ready" descriptors are set. So, we have to rejuvenate the poll set for every select
invocation and we have to iterate over the set to determine which descriptors are actually ready. Both steps require time and are inefficient as usually the set of polled descriptors is very large and does not change that often.
To solve this, epoll(7) was created in Linux 2.5.44 (2002): In a nutshell, we create a epoll-descriptor with epoll_create(2), which is a file-descriptor itself, and register the file-descriptors that are interesting for us with epoll_ctl(2). With epoll_wait(2), we query the set of registered descriptors and the kernel returns one or multiple "events" (becoming ready is an event) that we can work on. This is fundamentally different than select
in two respects: (1) Unless we register a file descriptor explicitly as EPOLLONESHOT
, it remains registered even if it already produced an event. (2) The return value is a dense stream of events with much fewer elements than the number of registered fds. In its functionality, epoll
is similar to the poll(2) system call, which already solves (1) and allows us to reuse the set of interesting fds. Nevertheless, poll
still forces us to iterate over the fd set to find the interesting ones.
In essence, an epoll
event loop looks like this:
// Create epoll object
int epoll_fd = epoll_create(...);
// Register interesting FDs
epoll_ctl(epoll_fd, EPOLL_ADD, fd, ...)
// The event loop
while (true) {
struct epoll_event events[10];
int n = epoll_wait(epoll_fd, events,....);
for (int i = 0; i < n; i++) {
process(events[i])
}
}
However, one thing is special about epoll
: It does not tell us the file descriptor that produced an event, but it only hands back user-defined data (struct epoll_data_t
) that we gave it when registering the fds. This could be the file descriptor number, but it also could be a pointer to some fancy context object.
The other part of today are Unix pipes (pipe(7) that we want to splice(2) together. For this, we first have to understand that a Unix pipe, which was championed by Doug Mellroy, is an connection between to file descriptors. These fds are not connected to a real file, but if we write some data into one fd, the data can be read from the other FD. This comes with the constraint that pipe are unidirectional and we have a "write end" and a "read end". And this is reflected in the API to create an pipe object with pipe(2) (or with flags: pipe2(2)):
int pipe(int pipefd[2]);
....
int P[2];
pipe(P);
// P[0]: read end
// P[1]: write end
If you use your shell to spawn a command with pipes, this system call is used by the shell to establish the connection between the processes. For example, for cat file | grep bar
, the shell invokes pipe2()
once, gives the write end (P[1]
) as stdout to the cat
-process and the read end (P[0]
) to the grep
process. Thereby, both processes communicate directly with each other and the shell is no longer involved.
However, sometimes, we want to have a more fine grained control over how and when data is moved between two processes. For example, if we want to measure the amount of gravy that flows between two production steps. For this, the shell must become an intermediator, for example by creating two pipe pairs: one from cat
to the shell and one from the shell to grep
. Nevertheless, it is not desirable to read()
the data into the shells address space first, just to write()
it out to the grep
-pipe. And this is where splice(2) comes in handy:
ssize_t splice(int fd_in, off64_t *off_in,
int fd_out, off64_t *off_out,
size_t len, unsigned int flags);
splice()
connects two file descriptors (one or both of them must be a pipe) for len
bytes. Furthermore, we can add some flags and set input and output file offsets. As a result, we get the number of bytes that were actually written.
Write a single-threaded program that takes multiple filter commands as separate arguments:
seq 1 100000000000000 | ./epoll "grep [0-5]" "grep 1$" > /dev/null
The program spawns both filters, splices its stdin data with splice()
to the first filter, splices the first filter's output to the second filter's input, and so on, until the output of the last filter is written to stdout of the ./epoll
program. Meanwhile, the epoll program should record and regularly print the throughput of the filter pipeline to determine how much gravy is flowing. This is similar to the behavior of the pv(1) program. As an output, you should expect something like:
[grep [0-5]] Started filter as pid 458367
[grep 1$] Started filter as pid 458368
73.63MiB/s 72.70MiB/s 7.29MiB/s
85.22MiB/s 85.06MiB/s 8.52MiB/s
92.25MiB/s 92.25MiB/s 9.22MiB/s
92.44MiB/s 92.46MiB/s 9.25MiB/s
epoll
to find those pipe write ends that are ready and have data.copy_splice
that uses read()/write()
splice
eagerly when data is ready and do not try to identify read ends that are ready to accept new data. Last modified: 2023-12-01 15:52:28.025765, Last author: , Permalink: /p/advent-10-epoll
Vacancies of TU Braunschweig
Career Service' Job Exchange
Merchandising
Term Dates
Courses
Degree Programmes
Information for Freshman
TUCard
Technische Universität Braunschweig
Universitätsplatz 2
38106 Braunschweig
P. O. Box: 38092 Braunschweig
GERMANY
Phone: +49 (0) 531 391-0