ariadne.space/content/blog/actually-bsd-kqueue-is-a-mo...

73 lines
8.0 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters!

This file contains invisible Unicode characters that may be processed differently from what appears below. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to reveal hidden characters.

---
title: "actually, BSD kqueue is a mountain of technical debt"
date: "2021-06-06"
---
A side effect of [the whole freenode kerfluffle](https://ariadne.space/2021/05/20/the-whole-freenode-kerfluffle/) is that I've been looking at IRCD again.  IRC, is of course a very weird and interesting place, and the smaller community of people who run IRCDs are largely weirder and even more interesting.
However, in that community of IRCD administrators there happens to be a few incorrect systems programming opinions that have been cargo culted around for years.  This particular blog is about one of these bikesheds, namely the _kqueue vs epoll debate_.
You've probably heard it before.  It goes something like this, _"BSD is better for networking, because it has kqueue.  Linux has nothing like kqueue, epoll doesn't come close."_  While I agree that epoll doesn't come close, I think that's actually a feature that has lead to a much more flexible and composable design.
## In the beginning...
Originally, IRCD like most daemons used `select` for polling sockets for readiness, as this was the first polling API available on systems with BSD sockets.  The `select` syscall works by taking a set of three bitmaps, with each bit describing a file descriptor number: bit 1 refers to file descriptor 1 and so on.  The bitmaps are the `read_set`, `write_set` and `err_set`, which map to sockets that can be read, written to or have errors accordingly.  Due to design defects with the `select` syscalls, it can only support up to `FD_SETSIZE` file descriptors on most systems.  This can be mitigated by making `fd_set` an arbitrarily large bitmap and depending on `fdmax` to be the upper bound, which is what WinSock has traditionally done on Windows.
The `select` syscall clearly had some design deficits that negatively affected scalability, so AT&T introduced the `poll` syscall in System V UNIX.  The `poll` syscall takes an array of `struct pollfd` of user-specified length, and updates a bitmap of flags in each `struct pollfd` entry with the current status of each socket.  Then you iterate over the `struct pollfd` list.  This is naturally a lot more efficient than `select`, where you have to iterate over all file descriptors up to `fdmax` and test for membership in each of the three bitmaps to ascertain each socket's status.
It can be argued that `select` was bounded by `FD_SETSIZE` (which is usually 1024 sockets), while `poll` begins to have serious scalability issues at around `10240` sockets.  These arbitrary benchmarks have been referred to as the C1K and C10K problems accordingly.  Dan Kegel has a [very lengthy post on his website](http://www.kegel.com/c10k.html) about his experiences mitigating the C10K problem in the context of running an FTP site.
## Then there was kqueue...
In July 2000, Jonathan Lemon introduced kqueue into FreeBSD, which quickly propagated into the other BSD forks as well.  kqueue is a kernel-assisted event notification system using two syscalls: `kqueue` and `kevent`.  The `kqueue` syscall creates a handle in the kernel represented as a file descriptor, which a developer uses with `kevent` to add and remove _event filters_.  Event filters can match against file descriptors, processes, filesystem paths, timers, and so on.
This design allows for a single-threaded server to process hundreds of thousands of connections at once, because it can register all of the sockets it wishes to monitor with the kernel and then lazily iterate over the sockets as they have events.
Most IRCDs have supported `kqueue` for the past 15 to 20 years.
## And then epoll...
In October 2002, Davide Libenzi got [his `epoll` patch](http://www.xmailserver.org/linux-patches/nio-improve.html) merged into Linux 2.5.44.  Like with kqueue, you use the `epoll_create` syscall to create a kernel handle which represents the set of descriptors to monitor.  You use the `epoll_ctl` syscall to add or remove descriptors from that set.  And finally, you use `epoll_wait` to wait for kernel events.
In general, the scalability aspects are the same to the application programmer: you have your sockets, you use `epoll_ctl` to add them to the kernel's `epoll` handle, and then you wait for events, just like you would with `kevent`.
Like `kqueue`, most IRCDs have supported `epoll` for the past 15 years.
## What is a file descriptor, anyway?
To understand the argument I am about to make, we need to talk about _file descriptors_.  UNIX uses the term _file descriptor_ a lot, even when referring to things which are clearly _not_ files, like network sockets.  Outside the UNIX world, a file descriptor is usually referred to as a _kernel handle_.  Indeed, in Windows, kernel-managed resources are given the `HANDLE` type, which makes this relationship more clear.  Essentially, a kernel handle is basically an opaque reference to an object in kernel space, and the astute reader may notice some similarities to the [object-capability model](https://en.wikipedia.org/wiki/Object-capability_model) as a result.
Now that we understand that file descriptors are actually just kernel handles, we can now talk about `kqueue` and `epoll`, and why `epoll` is actually the correct design.
## The problem with event filters
The key difference between `epoll` and `kqueue` is that `kqueue` operates on the notion of _event filters_ instead of _kernel handles_.  This means that any time you want `kqueue` to do something new, you have to add a new type of _event filter_.
[FreeBSD presently has 10 different event filter types](https://www.freebsd.org/cgi/man.cgi?query=kqueue&sektion=2): `EVFILT_READ`, `EVFILT_WRITE`, `EVFILT_EMPTY`, `EVFILT_AIO`, `EVFILT_VNODE`, `EVFILT_PROC`, `EVFILT_PROCDESC`, `EVFILT_SIGNAL`, `EVFILT_TIMER` and `EVFILT_USER`.  Darwin has additional event filters concerning monitoring Mach ports.
Other than `EVFILT_READ`, `EVFILT_WRITE` and `EVFILT_EMPTY`, all of these different event filter types are related to entirely different concerns in the kernel: they don't monitor kernel handles, but instead other specific subsystems than sockets.
This makes for a powerful API, but one which lacks [composability](https://en.wikipedia.org/wiki/Composability).
## `epoll` is better because it is composable
It is possible to do almost everything that `kqueue` can do on FreeBSD in Linux, but instead of having a single monolithic syscall to handle _everything_, Linux takes the approach of providing syscalls which allow almost anything to be represented as a _kernel handle_.
Since `epoll` strictly monitors _kernel handles_, you can register _any_ kernel handle you have with it and get events back when its state changes.  As a comparison to Windows, this basically means that `epoll` is a kernel-accelerated form of `WaitForMultipleObjects` in the Win32 API.
You are probably wondering how this works, so here's a table of commonly used `kqueue` event filters and the Linux syscall used to get a kernel handle for use with `epoll`.
| BSD event filter | Linux equivalent |
| --- | --- |
| `EVFILT_READ`, `EVFILT_WRITE`, `EVFILT_EMPTY` | Pass the socket with `EPOLLIN` etc. |
| `EVFILT_VNODE` | `inotify` |
| `EVFILT_SIGNAL` | `signalfd` |
| `EVFILT_TIMER` | `timerfd` |
| `EVFILT_USER` | `eventfd` |
| `EVFILT_PROC`, `EVFILT_PROCDESC` | `pidfd`, alternatively bind processes to a `cgroup` and monitor `cgroup.events` |
| `EVFILT_AIO` | `aiocb.aio_fildes` (treat as socket) |
Hopefully, as you can see, `epoll` can automatically monitor _any_ kind of kernel resource without having to be modified, due to its composable design, which makes it superior to `kqueue` from the perspective of having less technical debt.
Interestingly, [FreeBSD has added support for Linux's `eventfd` recently](https://www.freebsd.org/cgi/man.cgi?query=eventfd&apropos=0&sektion=2&manpath=FreeBSD+13.0-RELEASE+and+Ports&arch=default&format=html), so it appears that they may take `kqueue` in this direction as well.  Between that and FreeBSD's [process descriptors](https://www.freebsd.org/cgi/man.cgi?query=procdesc&sektion=4&apropos=0&manpath=FreeBSD+13.0-RELEASE+and+Ports), it seems likely.