--- title: 'How efficient can cat(1) be?' date: '2022-07-17' --- There have been a few initiatives in recent years to implement a new userspace base system for Linux distributions as an alternative to the GNU coreutils and BusyBox. Recently, one of the authors of one of these proposed implementations made the pitch in a few IRC channels [that her cat implementation][lc], which was derived from OpenBSD’s implementation, was the most efficient. But is it actually? [lc]: https://vimuser.org/cat.c.txt ## Understanding what `cat` actually does At the most basic level, `cat` takes one or more files and dumps them to `stdout`. But do we need to actually use `stdio` for this? Actually, we don’t, and most competent `cat` implementations at least use `read(2)` and `write(2)` if not more advanced approaches. If we consider `cat` as a form of buffer copy between an arbitrary file descriptor and `STDOUT_FILENO`, we can understand what the most efficient strategy to use for `cat` would be: splicing. Anything which isn’t doing splicing, after all, involves unnecessary buffer copies, and thus cannot be the most efficient. To get the best performance out of spliced I/O, we have to have some prerequisites: * The source and destination file descriptors should be unbuffered. * Any intermediate buffer should be a multiple of the filesystem block size. In general, to avoid doing a `stat` syscall, we can assume that a multiple of `PAGE_SIZE` is likely acceptable. ## A simple `cat` implementation The simplest way to implement `cat` is the way that it is done in BSD: using `read` and `write` on an intermediate buffer. This results in two buffer copies, but has the best portability. Most implementations of `cat` work this way, as it generally offers good enough performance. ```c /* This program is released into the public domain. */ #include #include #include #include #include #include #include void dumpfile(const char *path) { int srcfd = STDIN_FILENO; char buf[PAGE_SIZE * 16]; ssize_t nread, nwritten; size_t offset; /* POSIX allows - to represent stdin. */ if (*path != '-') { srcfd = open(path, O_RDONLY); if (srcfd < 0) err(EXIT_FAILURE, "open %s", path); } while ((nread = read(srcfd, buf, sizeof buf)) >= 1) { for (offset = 0; nread > 0; nread -= nwritten, offset += nwritten) { if ((nwritten = write(STDOUT_FILENO, buf + offset, nread)) <= 0) err(EXIT_FAILURE, "write stdout"); } } if (srcfd != STDIN_FILENO) (void) close(srcfd); } int main(int argc, const char *argv[]) { int i; for (i = 1; i < argc; i++) dumpfile(argv[i]); return EXIT_SUCCESS; } ``` ## Implementing spliced I/O Linux has no shortage of ways to perform spliced I/O. For our `cat` implementation, we have two possible ways to do it. The first possible option is the venerable `sendfile` syscall, which was [originally added to improve the file serving performance of web servers][sf-origin]. Originally, `sendfile` required the destination file descriptor to be a socket, but this restriction was removed in Linux 2.6.33. Unfortunately, sendfile is not perfect: because it only supports file descriptors which can be memory mapped, we must use a different strategy when using copying from `stdin`. [sf-origin]: https://yarchive.net/comp/linux/sendfile.html ```c /* This program is released into the public domain. */ #include #include #include #include #include #include #include #include #include bool spliced_copy(int srcfd) { ssize_t nwritten; off_t offset = 0; do { nwritten = sendfile(STDOUT_FILENO, srcfd, &offset, PAGE_SIZE * 16); if (nwritten < 0) return false; } while (nwritten > 0); return true; } void copy(int srcfd) { char buf[PAGE_SIZE * 16]; size_t nread, nwritten, offset; while ((nread = read(srcfd, buf, sizeof buf)) >= 1) { for (offset = 0; nread > 0; nread -= nwritten, offset += nwritten) { if ((nwritten = write(STDOUT_FILENO, buf + offset, nread)) <= 0) err(EXIT_FAILURE, "write stdout"); } } } void dumpfile(const char *path) { int srcfd = STDIN_FILENO; char buf[PAGE_SIZE * 16]; /* POSIX allows - to represent stdin. */ if (*path != '-') { srcfd = open(path, O_RDONLY); if (srcfd < 0) err(EXIT_FAILURE, "open %s", path); } /* Fall back to traditional copy if the spliced version fails. */ if (!spliced_copy(srcfd)) copy(srcfd); if (srcfd != STDIN_FILENO) (void) close(srcfd); } int main(int argc, const char *argv[]) { int i; int stdout_flags; stdout_flags = fcntl(STDOUT_FILENO, F_GETFL); if (stdout_flags < 0) err(EXIT_FAILURE, "fcntl(STDOUT_FILENO, F_GETFL)"); stdout_flags &= ~O_APPEND; if (fcntl(STDOUT_FILENO, F_SETFL, stdout_flags) < 0) err(EXIT_FAILURE, "fcntl(STDOUT_FILENO, F_SETFL)"); for (i = 1; i < argc; i++) dumpfile(argv[i]); return EXIT_SUCCESS; } ``` Another approach is to use `splice` and a pipe. This allows for true zero-copy I/O in userspace, as a pipe is simply implemented as a 64KB ring buffer in the kernel. In this case, we just use two splice operations per block of data we want to copy: one to move the data to the pipe and another to move the data from the pipe to the output file. ```c /* This program is released into the public domain. */ #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #define BLOCK_SIZE ((PAGE_SIZE * 16) - 1) bool spliced_copy(int srcfd) { int pipefd[2]; ssize_t nread, nwritten; off_t in_offset = 0; bool ret = true; if (pipe(pipefd) < 0) err(EXIT_FAILURE, "pipe"); do { nread = splice(srcfd, &in_offset, pipefd[1], NULL, BLOCK_SIZE, SPLICE_F_MOVE | SPLICE_F_MORE); if (nread <= 0) { ret = nread < 0 ? false : true; goto out; } nwritten = splice(pipefd[0], NULL, STDOUT_FILENO, NULL, BLOCK_SIZE, SPLICE_F_MOVE | SPLICE_F_MORE); if (nwritten < 0) { ret = false; goto out; } } while (nwritten > 0); out: close(pipefd[0]); close(pipefd[1]); return ret; } void copy(int srcfd) { char buf[PAGE_SIZE * 16]; size_t nread, nwritten, offset; while ((nread = read(srcfd, buf, sizeof buf)) >= 1) { for (offset = 0; nread > 0; nread -= nwritten, offset += nwritten) { if ((nwritten = write(STDOUT_FILENO, buf + offset, nread)) <= 0) err(EXIT_FAILURE, "write stdout"); } } } void dumpfile(const char *path) { int srcfd = STDIN_FILENO; char buf[PAGE_SIZE * 16]; /* POSIX allows - to represent stdin. */ if (*path != '-') { srcfd = open(path, O_RDONLY); if (srcfd < 0) err(EXIT_FAILURE, "open %s", path); (void) posix_fadvise(srcfd, 0, 0, POSIX_FADV_SEQUENTIAL); } /* Fall back to traditional copy if the spliced version fails. */ if (!spliced_copy(srcfd)) copy(srcfd); if (srcfd != STDIN_FILENO) (void) close(srcfd); } int main(int argc, const char *argv[]) { int i; int stdout_flags; stdout_flags = fcntl(STDOUT_FILENO, F_GETFL); if (stdout_flags < 0) err(EXIT_FAILURE, "fcntl(STDOUT_FILENO, F_GETFL)"); stdout_flags &= ~O_APPEND; if (fcntl(STDOUT_FILENO, F_SETFL, stdout_flags) < 0) err(EXIT_FAILURE, "fcntl(STDOUT_FILENO, F_SETFL)"); for (i = 1; i < argc; i++) dumpfile(argv[i]); return EXIT_SUCCESS; } ``` ## Honorable mention: `copy_file_range` While `copy_file_range` is not really that relevant to a `cat` implementation, if both the source and output files are normal files, you can use it to get even faster performance than using splice, as the kernel handles all of the details on its own. An optimized `cat` might try this strategy and then downgrade to `splice`, `sendfile`, and the normal `read` and `write` loop. ## Performance comparison To measure the performance of each strategy, we can simply use `dd` as a sink, running each cat program piped into `dd of=/dev/null bs=64K iflag=fullblock`. The runs in the table below are averaged across 1000 runs on a 8GB RAM Linode, using a 4GB file in tmpfs. | Strategy | Throughput | |----------------------------------------|------------| | `cat-simple` (`read` and `write` loop) | 3.6 GB/s | | `cat-sendfile` | 6.4 GB/s | | `cat-splice` | 11.6 GB/s | If you are interested in using these implementations in your own `cat` implementation, you may do so under any license terms you wish.