ariadne.space/content/blog/monitoring-for-process-comp...

---
title: "Monitoring for process completion in 2021"
date: "2021-09-20"
---

A historical defect in the `ifupdown` suite has been the lack of proper supervision of processes run by the system in order to bring up and down interfaces. Specifically, it is possible in historical `ifupdown` for a process to hang forever, at which point the system will fail to finish configuring interfaces. As interface configuration is part of the boot process, this means that the boot process can potentially hang forever and fail to complete. Accordingly, we have [introduced correct supervision of processes run by `ifupdown-ng` in the upcoming version 0.12](https://github.com/ifupdown-ng/ifupdown-ng/pull/161), with a 5 minute timeout.

Because `ifupdown-ng` is intended to be portable, we had to implement two versions of the process completion monitoring routine. The portable version is a busy loop, which sleeps for 50 milliseconds between iteration, and the non-portable version uses Linux process descriptors, a feature introduced in Linux 5.3. For earlier versions, `ifupdown-ng` will downgrade to using the portable implementation. There are also a couple of other ways that one can monitor for process completion using notifications, but they were not appropriate for the `ifupdown-ng` design.

## Busy-waiting with `waitpid(2)`

The portable version, as previously noted, uses a busy loop which sleeps for short durations of time. A naive version of a routine which does this would look something like:

/\* return true if process exited successfully, false in any other case \*/
bool
monitor\_with\_timeout(pid\_t child\_pid, int timeout\_sec)
{
    int status;
    int ticks;

    while (ticks < timeout\_sec \* 10)
    {
        /\* waitpid returns the child PID on success \*/
        if (waitpid(child, &status, WNOHANG) == child)
            return WIFEXITED(status) && WEXITSTATUS(status) == 0;

        /\* sleep 100ms \*/
        usleep(100000);
        ticks++;
    }

    /\* timeout exceeded, kill the child process and error \*/
    kill(child, SIGKILL);
    waitpid(child, &status, WNOHANG);
    return false;
}

This approach, however, has some performance drawbacks. If the process has not already completed by the time that monitoring of it has begun, then you will be delayed at least 100ms. In the case of `ifupdown-ng`, almost all processes are very short-lived, so this is not a major issue, however, we can do better by tightening the event loop. Another optimization is to split the sleep part into two steps, allowing for the initial call to `waitpid` to have better chances of reaping the completed process:

/\* return true if process exited successfully, false in any other case \*/
bool
monitor\_with\_timeout(pid\_t child\_pid, int timeout\_sec)
{
    int status;
    int ticks;

    while (ticks < timeout\_sec \* 20)
    {
        /\* sleep 50usec to allow the child PID to complete \*/
        usleep(50);

        /\* waitpid returns the child PID on success \*/
        if (waitpid(child, &status, WNOHANG) == child)
            return WIFEXITED(status) && WEXITSTATUS(status) == 0;

        /\* sleep 49.95ms \*/
        usleep(49950);
        ticks++;
    }

    /\* timeout exceeded, kill the child process and error \*/
    kill(child, SIGKILL);
    waitpid(child, &status, WNOHANG);
    return false;
}

This works fairly well in practice: there is no performance regression on the `ifupdown-ng` test suite with this implementation.

## The self-pipe trick

Daniel J. Bernstein described a trick in the early 90s that allows for process completion notifications to be delivered via a pollable file descriptor called [the self-pipe trick](https://cr.yp.to/docs/selfpipe.html). It is portable to any POSIX-compliant system, and can be used with `poll` or whatever you wish to use. It works by installing a signal handler against `SIGCHLD` that writes to a descriptor obtained with `pipe(2)`. The downside of this approach is that you have to write quite a bit of code, and you have to track which pipe FD is associated with which PID. It also wastes a file descriptor per process, since you have a file descriptor for both sides of the pipe.

## Linux's `signalfd`

What if we could turn delivery of signals into a pollable file descriptor? This is precisely what Linux's `signalfd` does. The basic idea here is to open a `signalfd`, associate `SIGCHLD` with it, and then do the `waitpid(2)` call when `SIGCHLD` is received at the `signalfd`. The downside with this approach is similar to the self-pipe trick, you have to keep global state in order to accomplish it, as there can only be a single `SIGCHLD` handler.

## Process descriptors

[FreeBSD introduced support for process descriptors in 2010](http://lackingrhoticity.blogspot.com/2010/10/process-descriptors-in-freebsd-capsicum.html) as part of the Capsicum framework. A process descriptor is an opaque handle to a specific process in the kernel. This is helpful as it avoids race conditions involving the recycling of PIDs. And since they are kernel handles, they can be waited on with `kqueue` like other kernel objects, by using `EVFILT_PROCDESC`.

There have been a few attempts to introduce process descriptors to Linux over the years. The attempt which [finally succeeded was Christian Brauner's `pidfd` API](https://lwn.net/Articles/801319/), completely landing in Linux 5.4, although parts of it were functional in prior releases. Like FreeBSD's process descriptors, a `pidfd` is an opaque reference to a specific `struct task_struct` in the kernel, and is also pollable, making it quite suitable for notification monitoring.

A problem with using the `pidfd` API, however, is that it is not presently implemented in either glibc or musl, which means that applications will need to provide stub implementations of the API themselves for now. This issue with having to write our own stub aside, the solution is quite elegant:

#include <sys/syscall.h>

#if defined(\_\_linux\_\_) && defined(\_\_NR\_pidfd\_open)

static inline int
local\_pidfd\_open(pid\_t pid, unsigned int flags)
{
	return syscall(\_\_NR\_pidfd\_open, pid, flags);
}

/\* return true if process exited successfully, false in any other case \*/
bool
monitor\_with\_timeout(pid\_t child\_pid, int timeout\_sec)
{
    int status;
    int pidfd = local\_pidfd\_open(child\_pid, 0);
    if (pidfd < 0)
        return false;

    struct pollfd pfd = {
        .fd = pidfd,
        .pollin = POLLIN,
    };

    /\* poll(2) returns the number of ready FDs, if it is less than
     \* one, it means our process has timed out.
     \*/
    if (poll(&pfd, 1, timeout\_sec \* 1000) < 1)
    {
        close(pidfd);
        kill(child, SIGKILL);
        waitpid(child, &status, WNOHANG);
        return false;
    }

    /\* if poll did return a ready FD, process completed. \*/ 
    waitpid(child, &status, WNOHANG);
    close(pidfd);

    return WIFEXITED(status) && WEXITSTATUS(status) == 0;
}

#endif

It will be interesting to see process supervisors (and other programs which perform short-lived supervision) adopt these new APIs. As for me, I will probably prepare patches to include `pidfd_open` and the other syscalls in musl as soon as possible.
import posts from wordpress 2022-08-02 22:16:40 +00:00			`---`
			`title: "Monitoring for process completion in 2021"`
			`date: "2021-09-20"`
			`---`

			A historical defect in the `ifupdown` suite has been the lack of proper supervision of processes run by the system in order to bring up and down interfaces. Specifically, it is possible in historical `ifupdown` for a process to hang forever, at which point the system will fail to finish configuring interfaces. As interface configuration is part of the boot process, this means that the boot process can potentially hang forever and fail to complete. Accordingly, we have [introduced correct supervision of processes run by `ifupdown-ng` in the upcoming version 0.12](https://github.com/ifupdown-ng/ifupdown-ng/pull/161), with a 5 minute timeout.

			Because `ifupdown-ng` is intended to be portable, we had to implement two versions of the process completion monitoring routine. The portable version is a busy loop, which sleeps for 50 milliseconds between iteration, and the non-portable version uses Linux process descriptors, a feature introduced in Linux 5.3. For earlier versions, `ifupdown-ng` will downgrade to using the portable implementation. There are also a couple of other ways that one can monitor for process completion using notifications, but they were not appropriate for the `ifupdown-ng` design.

			## Busy-waiting with `waitpid(2)`

			`The portable version, as previously noted, uses a busy loop which sleeps for short durations of time. A naive version of a routine which does this would look something like:`

			`/\* return true if process exited successfully, false in any other case \*/`
			`bool`
			`monitor\_with\_timeout(pid\_t child\_pid, int timeout\_sec)`
			`{`
			`int status;`
			`int ticks;`

			`while (ticks < timeout\_sec \* 10)`
			`{`
			`/\* waitpid returns the child PID on success \*/`
			`if (waitpid(child, &status, WNOHANG) == child)`
			`return WIFEXITED(status) && WEXITSTATUS(status) == 0;`

			`/\* sleep 100ms \*/`
			`usleep(100000);`
			`ticks++;`
			`}`

			`/\* timeout exceeded, kill the child process and error \*/`
			`kill(child, SIGKILL);`
			`waitpid(child, &status, WNOHANG);`
			`return false;`
			`}`

			This approach, however, has some performance drawbacks. If the process has not already completed by the time that monitoring of it has begun, then you will be delayed at least 100ms. In the case of `ifupdown-ng`, almost all processes are very short-lived, so this is not a major issue, however, we can do better by tightening the event loop. Another optimization is to split the sleep part into two steps, allowing for the initial call to `waitpid` to have better chances of reaping the completed process:

			`/\* return true if process exited successfully, false in any other case \*/`
			`bool`
			`monitor\_with\_timeout(pid\_t child\_pid, int timeout\_sec)`
			`{`
			`int status;`
			`int ticks;`

			`while (ticks < timeout\_sec \* 20)`
			`{`
			`/\* sleep 50usec to allow the child PID to complete \*/`
			`usleep(50);`

			`/\* waitpid returns the child PID on success \*/`
			`if (waitpid(child, &status, WNOHANG) == child)`
			`return WIFEXITED(status) && WEXITSTATUS(status) == 0;`

			`/\* sleep 49.95ms \*/`
			`usleep(49950);`
			`ticks++;`
			`}`

			`/\* timeout exceeded, kill the child process and error \*/`
			`kill(child, SIGKILL);`
			`waitpid(child, &status, WNOHANG);`
			`return false;`
			`}`

			This works fairly well in practice: there is no performance regression on the `ifupdown-ng` test suite with this implementation.

			`## The self-pipe trick`

			Daniel J. Bernstein described a trick in the early 90s that allows for process completion notifications to be delivered via a pollable file descriptor called [the self-pipe trick](https://cr.yp.to/docs/selfpipe.html). It is portable to any POSIX-compliant system, and can be used with `poll` or whatever you wish to use. It works by installing a signal handler against `SIGCHLD` that writes to a descriptor obtained with `pipe(2)`. The downside of this approach is that you have to write quite a bit of code, and you have to track which pipe FD is associated with which PID. It also wastes a file descriptor per process, since you have a file descriptor for both sides of the pipe.

			## Linux's `signalfd`

			What if we could turn delivery of signals into a pollable file descriptor? This is precisely what Linux's `signalfd` does. The basic idea here is to open a `signalfd`, associate `SIGCHLD` with it, and then do the `waitpid(2)` call when `SIGCHLD` is received at the `signalfd`. The downside with this approach is similar to the self-pipe trick, you have to keep global state in order to accomplish it, as there can only be a single `SIGCHLD` handler.

			`## Process descriptors`

			[FreeBSD introduced support for process descriptors in 2010](http://lackingrhoticity.blogspot.com/2010/10/process-descriptors-in-freebsd-capsicum.html) as part of the Capsicum framework. A process descriptor is an opaque handle to a specific process in the kernel. This is helpful as it avoids race conditions involving the recycling of PIDs. And since they are kernel handles, they can be waited on with `kqueue` like other kernel objects, by using `EVFILT_PROCDESC`.

			There have been a few attempts to introduce process descriptors to Linux over the years. The attempt which [finally succeeded was Christian Brauner's `pidfd` API](https://lwn.net/Articles/801319/), completely landing in Linux 5.4, although parts of it were functional in prior releases. Like FreeBSD's process descriptors, a `pidfd` is an opaque reference to a specific `struct task_struct` in the kernel, and is also pollable, making it quite suitable for notification monitoring.

			A problem with using the `pidfd` API, however, is that it is not presently implemented in either glibc or musl, which means that applications will need to provide stub implementations of the API themselves for now. This issue with having to write our own stub aside, the solution is quite elegant:

			`#include <sys/syscall.h>`

			`#if defined(\_\_linux\_\_) && defined(\_\_NR\_pidfd\_open)`

			`static inline int`
			`local\_pidfd\_open(pid\_t pid, unsigned int flags)`
			`{`
			`return syscall(\_\_NR\_pidfd\_open, pid, flags);`
			`}`

			`/\* return true if process exited successfully, false in any other case \*/`
			`bool`
			`monitor\_with\_timeout(pid\_t child\_pid, int timeout\_sec)`
			`{`
			`int status;`
			`int pidfd = local\_pidfd\_open(child\_pid, 0);`
			`if (pidfd < 0)`
			`return false;`

			`struct pollfd pfd = {`
			`.fd = pidfd,`
			`.pollin = POLLIN,`
			`};`

			`/\* poll(2) returns the number of ready FDs, if it is less than`
			`\* one, it means our process has timed out.`
			`\*/`
			`if (poll(&pfd, 1, timeout\_sec \* 1000) < 1)`
			`{`
			`close(pidfd);`
			`kill(child, SIGKILL);`
			`waitpid(child, &status, WNOHANG);`
			`return false;`
			`}`

			`/\* if poll did return a ready FD, process completed. \*/`
			`waitpid(child, &status, WNOHANG);`
			`close(pidfd);`

			`return WIFEXITED(status) && WEXITSTATUS(status) == 0;`
			`}`

			`#endif`

			It will be interesting to see process supervisors (and other programs which perform short-lived supervision) adopt these new APIs. As for me, I will probably prepare patches to include `pidfd_open` and the other syscalls in musl as soon as possible.