Well, I do not know how to bring up this topic in an acceptable way, but there are some ELFs that are addicted to gifts. They are surrounded by so many gifts, so many temptations, some of them just snap, go nuts, ask for more and more gifts which they then horde in their private cave. As this problem can stay below the radar for a long time, the ELF elders, which want to help those ELFs, decided to create a detection scheme to identify ELFs in need. For this, it is sufficient to detect whether an ELF asks for some specific things from his supervisor which are only used by gift-addicted ELFs for storing them in their cave.
Software has security issues. Always. Every complex piece of software contains bugs and vulnerabilities that might be exploitable by an evil attacker. While it is good and noble to remove bugs from software and to use safer programming languages (like Rust), software will never be perfect. Therefore, it is necessary to tackle the problem also on the architectural level and limit the amount of damage that an attacker can do if they find an exploit. Exploits, like shellcodes, often try to spread in the system by issuing destructive system calls, we could restrict the system call interface for a specific program to a subset of all available operations. Or: why does my in-memory key-value store require more system calls than accept/read/write/close/mmap/munmap?
And this is where seccomp(2) comes into the picture. With this system call, a process restricts its own system call interface to the subset of required operations during its initialization. For example, a key-value store could ban everything besides [accept, read, write, close, mmap, munmap]
, so exploits cannot issue execve(2) to start a different executable. As the list of allowed system calls cannot be widened again, the kernel enforces a strict system call sandbox.
In the described mode, seccomp can not only filter out forbidden system calls but it can also inspect the arguments. For example, one could allow accept only on a specific file descriptor. However, for this, the user has to write bpf(2) programs as filters. Berkley Packet Filter is an in-kernel virtual machine and was originally designed to filter network packets efficiently. In essence, bpf allows us to execute small programs within the kernel without worrying about security.
Since understanding seccomp and bpf would be a little much for a single day, we will only experiment with the SECCOMP_SET_MODE_STRICT
mode of seccomp. In this mode, a process can restrict itself to a very small subset of system calls (read, write, exit, sigreturn(2)). The core idea of today's exercise is to execute a given function within a separate seccomp-protected process, which returns its result via a pipe to the calling process.
Since processes in seccomp's strict mode cannot create new file descriptors but only use existing ones with read() and write(), we have to close all file descriptors but the write-end pipe in our protected child-process. Since Linux gives us no possibility to inspect the file-descriptor table, we would have to iterate over all possible file descriptors and invoke close()
on them:
for (int i = 1; i < INT_MAX; i++)
close(i);
However, as INT_MAX
is usually a large number, Linux learned the close_range(2) system call with 5.9. With this system call, a process can close a whole range of file descriptors, which is very useful for our sandboxing and containerizing use cases.
secure_func_t spawn_secure(void (*func)(void*, int), void* arg)
int complete_secure(secure_func_t f, char *buf, size_t buflen) {
Complete these two functions: 1. spawn_secure
forks the current process, installs a seccomp filter, closes all file descriptors but the write-end of the pipe with close_range
, and calls the given function. 2. complete_secure
reads from the read-end into buf
and waits for the child to complete.
The output of the program should look like:
$ ./seccomp
ok: Hallo
fail failed: -1
Use dup2(2) to move the write-end of the pipe fd pair to a good position in the file-descriptor table.
Use syscall(__NR_exit, 0)
as glibc's _exit()
actually calls exit_group(2).
OpenBSD's pledge(2) system call is a more usable alternative to seccomp. However, there are efforts to build pledge
as an library on top of seccomp.
Last modified: 2023-12-01 15:52:27.777846, Last author: , Permalink: /p/advent-17-seccomp
Vacancies of TU Braunschweig
Career Service' Job Exchange
Merchandising
Term Dates
Courses
Degree Programmes
Information for Freshman
TUCard
Technische Universität Braunschweig
Universitätsplatz 2
38106 Braunschweig
P. O. Box: 38092 Braunschweig
GERMANY
Phone: +49 (0) 531 391-0