diff --git a/content/blog/cat.md b/content/blog/cat.md
new file mode 100644
index 0000000..6ec584b
--- /dev/null
+++ b/content/blog/cat.md
@@ -0,0 +1,336 @@
+---
+title: 'How efficient can cat(1) be?'
+date: '2022-07-17'
+---
+
+There have been a few initiatives in recent years to implement
+a new userspace base system for Linux distributions as an 
+alternative to the GNU coreutils and BusyBox.  Recently, one of
+the authors of one of these proposed implementations made the 
+pitch in a few IRC channels [that her cat implementation][lc],
+which was derived from OpenBSD’s implementation, was the most 
+efficient.  But is it actually?
+
+   [lc]: https://vimuser.org/cat.c.txt
+
+## Understanding what `cat` actually does
+
+At the most basic level, `cat` takes one or more files and
+dumps them to `stdout`.  But do we need to actually use `stdio`
+for this?  Actually, we don’t, and most competent `cat`
+implementations at least use `read(2)` and `write(2)` if not
+more advanced approaches.
+
+If we consider `cat` as a form of buffer copy between an
+arbitrary file descriptor and `STDOUT_FILENO`, we can understand
+what the most efficient strategy to use for `cat` would be: splicing.
+Anything which isn’t doing splicing, after all, involves unnecessary 
+buffer copies, and thus cannot be the most efficient.
+
+To get the best performance out of spliced I/O, we have to have
+some prerequisites:
+
+ * The source and destination file descriptors should be unbuffered.
+
+ * Any intermediate buffer should be a multiple of the filesystem
+   block size.  In general, to avoid doing a `stat` syscall, we can 
+   assume that a multiple of `PAGE_SIZE` is likely acceptable.
+
+## A simple `cat` implementation
+
+The simplest way to implement `cat` is the way that it is done in
+BSD: using `read` and `write` on an intermediate buffer.  This
+results in two buffer copies, but has the best portability.  Most
+implementations of `cat` work this way, as it generally offers
+good enough performance.
+
+```c
+/* This program is released into the public domain. */
+#include <stdio.h>
+#include <stdlib.h>
+#include <err.h>
+#include <errno.h>
+#include <limits.h>
+#include <fcntl.h>
+#include <unistd.h>
+
+void dumpfile(const char *path)
+{
+	int srcfd = STDIN_FILENO;
+	char buf[PAGE_SIZE * 16];
+	ssize_t nread, nwritten;
+	size_t offset;
+
+	/* POSIX allows - to represent stdin. */
+	if (*path != '-')
+	{
+		srcfd = open(path, O_RDONLY);
+		if (srcfd < 0)
+			err(EXIT_FAILURE, "open %s", path);
+	}
+
+	while ((nread = read(srcfd, buf, sizeof buf)) >= 1)
+	{
+		for (offset = 0; nread > 0; nread -= nwritten, offset += nwritten)
+		{
+			if ((nwritten = write(STDOUT_FILENO, buf + offset, nread)) <= 0)
+				err(EXIT_FAILURE, "write stdout");
+		}
+	}
+
+	if (srcfd != STDIN_FILENO)
+		(void) close(srcfd);
+}
+
+int main(int argc, const char *argv[])
+{
+	int i;
+
+	for (i = 1; i < argc; i++)
+		dumpfile(argv[i]);
+
+	return EXIT_SUCCESS;
+}
+```
+
+## Implementing spliced I/O
+
+Linux has no shortage of ways to perform spliced I/O.  For our `cat`
+implementation, we have two possible ways to do it.
+
+The first possible option is the venerable `sendfile` syscall, which
+was [originally added to improve the file serving performance of web
+servers][sf-origin]. Originally, `sendfile` required the destination
+file descriptor to be a socket, but this restriction was removed in 
+Linux 2.6.33.  Unfortunately, sendfile is not perfect: because it
+only supports file descriptors which can be memory mapped, we 
+must use a different strategy when using copying from `stdin`.
+
+```c
+/* This program is released into the public domain. */
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <err.h>
+#include <errno.h>
+#include <limits.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/sendfile.h>
+
+bool spliced_copy(int srcfd)
+{
+	ssize_t nwritten;
+	off_t offset = 0;
+
+	do
+	{
+		nwritten = sendfile(STDOUT_FILENO, srcfd, &offset,
+				    PAGE_SIZE * 16);
+		if (nwritten < 0)
+			return false;
+	} while (nwritten > 0);
+
+	return true;
+}
+
+void copy(int srcfd)
+{
+	char buf[PAGE_SIZE * 16];
+	size_t nread, nwritten, offset;
+
+	while ((nread = read(srcfd, buf, sizeof buf)) >= 1)
+	{
+		for (offset = 0; nread > 0;
+		     nread -= nwritten, offset += nwritten)
+		{
+			if ((nwritten = write(STDOUT_FILENO,
+					      buf + offset, nread)) <= 0)
+				err(EXIT_FAILURE, "write stdout");
+		}
+	}
+}
+
+void dumpfile(const char *path)
+{
+	int srcfd = STDIN_FILENO;
+	char buf[PAGE_SIZE * 16];
+
+	/* POSIX allows - to represent stdin. */
+	if (*path != '-')
+	{
+		srcfd = open(path, O_RDONLY);
+		if (srcfd < 0)
+			err(EXIT_FAILURE, "open %s", path);
+	}
+
+	/* Fall back to traditional copy if the spliced version fails. */
+	if (!spliced_copy(srcfd))
+		copy(srcfd);
+
+	if (srcfd != STDIN_FILENO)
+		(void) close(srcfd);
+}
+
+int main(int argc, const char *argv[])
+{
+	int i;
+	int stdout_flags;
+
+	stdout_flags = fcntl(STDOUT_FILENO, F_GETFL);
+	if (stdout_flags < 0)
+		err(EXIT_FAILURE, "fcntl(STDOUT_FILENO, F_GETFL)");
+	stdout_flags &= ~O_APPEND;
+	if (fcntl(STDOUT_FILENO, F_SETFL, stdout_flags) < 0)
+		err(EXIT_FAILURE, "fcntl(STDOUT_FILENO, F_SETFL)");
+
+	for (i = 1; i < argc; i++)
+		dumpfile(argv[i]);
+
+	return EXIT_SUCCESS;
+}
+```
+
+Another approach is to use `splice` and a pipe.  This allows for true
+zero-copy I/O in userspace, as a pipe is simply implemented as a 64KB
+ring buffer in the kernel.  In this case, we just use two splice
+operations per block of data we want to copy: one to move the data to
+the pipe and another to move the data from the pipe to the output file.
+
+```c
+/* This program is released into the public domain. */
+#define _GNU_SOURCE
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <err.h>
+#include <errno.h>
+#include <limits.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/sendfile.h>
+
+#define BLOCK_SIZE ((PAGE_SIZE * 16) - 1)
+
+bool spliced_copy(int srcfd)
+{
+	int pipefd[2];
+	ssize_t nread, nwritten;
+	off_t in_offset = 0;
+	bool ret = true;
+
+	if (pipe(pipefd) < 0)
+		err(EXIT_FAILURE, "pipe");
+
+	do
+	{
+		nread = splice(srcfd, &in_offset, pipefd[1], NULL,
+			       BLOCK_SIZE, SPLICE_F_MOVE | SPLICE_F_MORE);
+		if (nread <= 0)
+		{
+			ret = nread < 0 ? false : true;
+			goto out;
+		}
+
+		nwritten = splice(pipefd[0], NULL, STDOUT_FILENO, NULL,
+				  BLOCK_SIZE, SPLICE_F_MOVE | SPLICE_F_MORE);
+		if (nwritten < 0)
+		{
+			ret = false;
+			goto out;
+		}
+	} while (nwritten > 0);
+
+out:
+	close(pipefd[0]);
+	close(pipefd[1]);
+
+	return ret;
+}
+
+void copy(int srcfd)
+{
+	char buf[PAGE_SIZE * 16];
+	size_t nread, nwritten, offset;
+
+	while ((nread = read(srcfd, buf, sizeof buf)) >= 1)
+	{
+		for (offset = 0; nread > 0;
+		     nread -= nwritten, offset += nwritten)
+		{
+			if ((nwritten = write(STDOUT_FILENO,
+					      buf + offset, nread)) <= 0)
+				err(EXIT_FAILURE, "write stdout");
+		}
+	}
+}
+
+void dumpfile(const char *path)
+{
+	int srcfd = STDIN_FILENO;
+	char buf[PAGE_SIZE * 16];
+
+	/* POSIX allows - to represent stdin. */
+	if (*path != '-')
+	{
+		srcfd = open(path, O_RDONLY);
+		if (srcfd < 0)
+			err(EXIT_FAILURE, "open %s", path);
+
+		(void) posix_fadvise(srcfd, 0, 0, POSIX_FADV_SEQUENTIAL);
+	}
+
+	/* Fall back to traditional copy if the spliced version fails. */
+	if (!spliced_copy(srcfd))
+		copy(srcfd);
+
+	if (srcfd != STDIN_FILENO)
+		(void) close(srcfd);
+}
+
+int main(int argc, const char *argv[])
+{
+	int i;
+	int stdout_flags;
+
+	stdout_flags = fcntl(STDOUT_FILENO, F_GETFL);
+	if (stdout_flags < 0)
+		err(EXIT_FAILURE, "fcntl(STDOUT_FILENO, F_GETFL)");
+	stdout_flags &= ~O_APPEND;
+	if (fcntl(STDOUT_FILENO, F_SETFL, stdout_flags) < 0)
+		err(EXIT_FAILURE, "fcntl(STDOUT_FILENO, F_SETFL)");
+
+	for (i = 1; i < argc; i++)
+		dumpfile(argv[i]);
+
+	return EXIT_SUCCESS;
+}
+```
+
+## Honorable mention: `copy_file_range`
+
+While `copy_file_range` is not really that relevant to a `cat`
+implementation, if both the source and output files are normal 
+files, you can use it to get even faster performance than using
+splice, as the kernel handles all of the details on its own.  An 
+optimized `cat` might try this strategy and then downgrade to
+`splice`, `sendfile`, and the normal `read` and `write` loop.
+
+## Performance comparison
+
+To measure the performance of each strategy, we can simply use
+`dd` as a sink, running each cat program piped into
+`dd of=/dev/null bs=64K iflag=fullblock`.
+
+The runs in the table below are averaged across 1000 runs on a
+8GB RAM Linode, using a 4GB file in tmpfs.
+
+| Strategy                               | Throughput |
+|----------------------------------------|------------|
+| `cat-simple` (`read` and `write` loop) | 3.6 GB/s   |
+| `cat-sendfile`                         | 6.4 GB/s   |
+| `cat-splice`                           | 11.6 GB/s  |
+
+If you are interested in using these implementations in your own
+`cat` implementation, you may do so under any license terms you 
+wish.