--- title: "Rootless Containers on Alpine Part 1: Prep Work" date: 2022-11-08T19:30:15+11:00 lastmod: 2022-11-09T10:30:15+11:00 draft: false showSummary: true summary: "We recently murdered a server's terminal via `do_distro_upgrade`, and thought it'd be a good time to learn more about containers and Alpine." series: - "Rootless Containers on Alpine Linux" series_order: 1 --- # Part One: Prep Work ## Background > **(Ashe)** > So. We recently murdered a server's terminal via `do_distro_upgrade`. > > **(Tammy)** Was it really that bad? > > **(Ashe)** Yes. ``` % man 7z WARNING: terminal is not fully functional - (press RETURN)% ``` It was in fact *that bad*. So we figured, well, we can spend a few hours, days, whatever fixing this... > **(Tammy)** Or we could just build a new server! > > **(Ashe)** Right. So, after asking some friends about their opinions, we settled on Alpine Linux. And why not also migrate all of our pm2 workloads to containers while we're at it? We've been meaning to learn more about containers for a while now. So off we go! ## Prep Work We need a few things before we actually set up rootless containers. We'll be following along with the [Official Rootless Containers Tutorial](https://rootlesscontaine.rs/getting-started/common/), making adjustments as necessary. ### Login Information Most Rootless Container implementations use `$XDG_RUNTIME_DIR` to find the user's ID and where their runtime lives (usually some subdir of `/run/user/`). Systemd-based Linux distros will handle this automatically, but Alpine uses [OpenRC](https://wiki.alpinelinux.org/wiki/OpenRC), which does not do this automatically. While Alpine doesn't provide a tutorial for Rootless Containers, we can adapt some of the prep work done for [Wayland](https://wiki.alpinelinux.org/wiki/Wayland) to get OpenRC to set `$XDG_RUNTIME_DIR` for us. We just create `/etc/profile.d/xdg_runtime_dir.sh` like so: ```sh if test -z "${XDG_RUNTIME_DIR}"; then export XDG_RUNTIME_DIR=/tmp/$(id -u)-runtime-dir if ! test -d "${XDG_RUNTIME_DIR}"; then mkdir "${XDG_RUNTIME_DIR}" chmod 0700 "${XDG_RUNTIME_DIR}" fi fi ``` And, log out and then back in... ```sh ~ ❯ env [...] XDG_RUNTIME_DIR=/tmp/1000-runtime-dir [...] ``` With that done, we can move onto our next steps. ### Sysctl There's some sysctl config required for older distros, but this isn't required for Alpine. ### User Namespace Configuration Rootless Containers use User Namespaces, subUIDs, and subGIDs, so we'll need to have those working. The apk package `shadow-subids` provides that functionality for us. ``` ~ ❯ apk info shadow-subids shadow-subids-4.10-r3 description: Utilities for using subordinate UIDs and GIDs shadow-subids-4.10-r3 webpage: https://github.com/shadow-maint/shadow shadow-subids-4.10-r3 installed size: 140 KiB ``` ### Sub-ID Counts Rootless Containers generally expect `/etc/subuid` and `/etc/subgid` to contain at least 65,536 sub-IDs for each user. `shadow-subids` doed create these files for us, but leaves them empty by default, so let's go ahead and do that. The [page on subIDs](https://rootlesscontaine.rs/getting-started/common/subuid/) provides a handy Python script to do that for us, which we'll edit slightly so it's not writing directly to system files: ```python f = open("subuid", "w") for uid in range(1000, 65536): f.write("%d:%d:65536\n" %(uid,uid*65536)) f.close() f = open("subgid", "w") for uid in range(1000, 65536): f.write("%d:%d:65536\n" %(uid,uid*65536)) f.close() ``` This is probably overkill for our use-case, but that's also fine. > **(Doll)** So this one just runs script and copies to /etc/? > > **(Ashe)** Yes Doll, that's right. With that done, we can move onto the last prep step. ### CGroups V2 To limit resources that a container can use, we need to enable CGroups V2. In OpenRC, this can be done by changing some options in `/etc/rc.conf`. To enable CGroups in general, we need to set `rc_controller_cgroups` to `YES` ```sh # This switch controls whether or not cgroups version 1 controllers are # individually mounted under # /sys/fs/cgroup in hybrid or legacy mode. rc_controller_cgroups="YES" ``` From here, we can enable CGroups V2 by setting `rc_cgroup_mode` to `unified` ```sh # This sets the mode used to mount cgroups. # "hybrid" mounts cgroups version 2 on /sys/fs/cgroup/unified and # cgroups version 1 on /sys/fs/cgroup. # "legacy" mounts cgroups version 1 on /sys/fs/cgroup # "unified" mounts cgroups version 2 on /sys/fs/cgroup rc_cgroup_mode="unified" ``` > **(Doll)** Doll confused. > > **(Ashe)** So was I, for a bit. Despite what `rc.conf` says, cgroups V2 does *not* seem to be enabled on Alpine > unless `rc_cgroup_mode` is set to `unified`. The [Alpine Wiki](https://wiki.alpinelinux.org/wiki/OpenRC#cgroups\_v2) > seems to agree here, but isn't super clear. We'll find out if this is sufficient. Next step is configuring the controllers we want to use: ```sh # This is a list of controllers which should be enabled for cgroups version 2 # when hybrid mode is being used. # Controllers listed here will not be available for cgroups version 1. rc_cgroup_controllers="cpuset cpu io memory hugetlb pids" ``` Finally, we can add cgroups to a runlevel so that it's started automatically at boot: ```sh rc-update add cgroups ``` From here, we can reboot, and continue on. If you don't want to reboot, you can start the cgroup service manually: ```sh rc-service cgroups start ``` ## Creating a group for our container users We'll quickly create a group for all users who'll be using rootless containers here. In Alpine, this is as simple as `doas addgroup ctr`. We'll make use of this later. ## Installing containerd and friends First up we'll need to install `containerd` (to host our containers) and `slirp4netns` (to allow network spaced commands inside the container with lower overhead than VPNKit), so we just: ```sh doas apk add containerd doas apk add slirp4netns ``` Next, we need to install `nerdctl` and `rootlesskit`. Both of these are currently only found inside the `testing` repo for Alpine. We can pull them in without subscribing to the entire testing repo like so: ```sh doas apk add -X https://dl-cdn.alpinelinux.org/alpine/edge/testing/ nerdctl doas apk add -X https://dl-cdn.alpinelinux.org/alpine/edge/testing/ rootlesskit ``` ## Configuring the Rootless containerd service We'll be using nerdctl as our containerd controller of choice. It comes with a rootless containerd.service, but since Alpine doesn't use systemd, we'll have to adapt this into an rc service. We spent some time trying to adapt the [install script](https://github.com/containerd/nerdctl/blob/48f189a53a24c12838433f5bb5dd57f536816a8a/extras/rootless/containerd-rootless-setuptool.sh) nerdctl provides to our purposes, however this is a bit excessive for what we need, so we'll just do it the "[hard way](https://github.com/containerd/containerd/blob/main/docs/rootless.md)". > **(Tammy)** Wait, this isn't the "hard way", is it? > > **(Ashe)** Nope. Adapting a 500 line script would be hard and annoying. We're better served by just doing it manually, > and providing instructions for anyone following along. So in that vein: ### Getting containerd running in rootlesskit First, let's get containerd running at the CLI, and then we can make it into an OpenRC Script. We'll need a `config.toml`, but it can pretty minimal: ```toml version = 2 root = "/home/tammy/.local/share/containerd" state = "/tmp/1000-runtime-dir/containerd" [grpc] address = "/tmp/1000-runtime-dir/containerd/containerd.sock" ``` First try: ```sh ~ ❯ rootlesskit --net=slirp4netns --copy-up=/etc --copy-up=/run \ --state-dir=/tmp/1000-runtime-dir/rootlesskit-containerd --disable-host-loopback \ sh -c "rm -f /run/containerd; exec containerd -c config.toml" BusyBox v1.35.0 (2022-08-01 15:14:44 UTC) multi-call binary. Usage: ip [OPTIONS] address|route|link|tunnel|neigh|rule [ARGS] OPTIONS := -f[amily] inet|inet6|link | -o[neline] ip addr add|del IFADDR dev IFACE | show|flush [dev IFACE] [to PREFIX] ip route list|flush|add|del|change|append|replace|test ROUTE ip link set IFACE [up|down] [arp on|off] [multicast on|off] [promisc on|off] [mtu NUM] [name NAME] [qlen NUM] [address MAC] [master IFACE | nomaster] [netns PID] ip tunnel add|change|del|show [NAME] [mode ipip|gre|sit] [remote ADDR] [local ADDR] [ttl TTL] ip neigh show|flush [to PREFIX] [dev DEV] [nud STATE] ip rule [list] | add|del SELECTOR ACTION [rootlesskit:parent] error: failed to setup network &{logWriter:0xc00014aa00 binary:slirp4netns mtu:65520 ipnet: disableHostLoopback:true apiSocketPath: enableSandbox:false enableSeccomp:false enableIPv6:false ifname:tap0 infoMu:{w:{state:0 sema:0} writerSem:0 readerSem:0 readerCount:0 readerWait:0} info:}: setting up tap tap0: executing [[nsenter -t 28611 -n -m -U --preserve-credentials ip tuntap add name tap0 mode tap] [nsenter -t 28611 -n -m -U --preserve-credentials ip link set tap0 up]]: exit status 1 [rootlesskit:child ] error: parsing message from fd 3: EOF ``` > **(Doll)** That looks like it broke, Miss. > > **(Ashe)** *sigh*, yeah, that's broken alright. That output looks like ip didn't like the command supplied to it, > so let's find out what that was. Some troubleshooting later, it looks like this is to do with BusyBox's implementation of the ip commands. We've raised [an issue](https://github.com/rootless-containers/slirp4netns/issues/304), and we'll see how that goes. In the mean time, we'll just have to use native networking. This means we can't apply firewall rules per-container, which is moderately annoying, but won't actually hinder deployment. Just makes securing the deployment more annoying. So let's try without the `--net=slirp4netns` (omitting anything that's INFO): ```sh ~ ❯ rootlesskit --copy-up=/etc --copy-up=/run \ --state-dir=/tmp/1000-runtime-dir/rootlesskit-containerd --disable-host-loopback \ sh -c "rm -f /run/containerd; exec containerd -c config.toml" WARN[2022-11-03T11:32:53.207241941+11:00] failed to load plugin io.containerd.snapshotter.v1.devmapper error="devmapper not configured" WARN[2022-11-03T11:32:53.227691744+11:00] could not use snapshotter devmapper in metadata plugin error="devmapper not configured" WARN[2022-11-03T11:32:53.233006449+11:00] failed to load plugin io.containerd.internal.v1.opt error="mkdir /opt/containerd: permission denied" ERRO[2022-11-03T11:32:53.235151641+11:00] failed to load cni during init, please check CRI plugin status before setting up network for pods error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config" ``` A few things of note here: ```sh WARN[2022-11-03T11:32:53.233006449+11:00] failed to load plugin io.containerd.internal.v1.opt error="mkdir /opt/containerd: permission denied" [...] ERRO[2022-11-03T11:32:53.235151641+11:00] failed to load cni during init, please check CRI plugin status before setting up network for pods error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config" ``` The warning tells us that it tried to create /opt/containerd, but was unable to. This is easy enough to fix: ```sh ~ ❯ doas mkdir /opt/containerd ~ ❯ doas chmod 2770 /opt/containerd ~ ❯ doas chown root:ctr /opt/containerd #Replace the username and group here as necessary ``` The error is more interesting. CRI here stands for [Container Runtime Interface](https://github.com/containerd/cri), and it seems to be used for Kubernetes. Since we won't be using kubernetes here, we can just disable it by adding `disabled_plugins = ["io.containerd.grpc.v1.cri"]` to our `config.toml`. > **(Tammy)** If you *are* interested in Kubernetes, make sure to check out our > [Home Server Build-Out]({{< ref "home-server-build-out" >}}) series. > We're planning on setting up an entire cloud environment there. Let's try that again (cutting out any info stuff): ```sh [...] WARN[2022-11-03T16:18:35.425339343+11:00] failed to load plugin io.containerd.snapshotter.v1.devmapper error="devmapper not configured" WARN[2022-11-03T16:18:35.427868986+11:00] could not use snapshotter devmapper in metadata plugin error="devmapper not configured" ERRO[2022-11-03T16:18:35.430061527+11:00] failed to initialize a tracing processor "otlp" error="no OpenTelemetry endpoint: skip plugin" containerd successfully booted in 0.024502s [...] ``` That's cleaned up those issues, but we still have two warnings about `devmapper`, and `containerd` couldn't find an OpenTelemetry endpoint. We'll be skipping OpenTelemetry for now, but that sounds like a fun topic for a second blog post along side setting up Grafana. > **(Doll)** Doll will remember! Will remind Miss' to make a post about this! ### Setting up devmapper `devmapper` is one of a few [snapshotters](https://github.com/containerd/containerd/tree/main/docs/snapshotters) that `containerd` can use. It's not the most performant (that honour goes to `overlayfs`), but it is one of the most robust, and least likely to break. This is more imporant to us than pure performance. If you're following along at home, you'll have to decide which storage driver is best for your use-case. Following the [setup guide](https://github.com/containerd/containerd/blob/main/docs/snapshotters/devmapper.md), we'll need `dmsetup` installed. Under Alpine, this is provided by the `device-mapper` package, which we already have installed. We've also got a 100GB block device attached to this VPS, so let's get that provisioned too. #### Mounting and Formatting our block device We can use `fdisk` to format our block device. `fdisk -l` lists all devices and partitions. ``` ~ ❯ doas fdisk -l Disk /dev/vda: 25 GB, 26843545600 bytes, 52428800 sectors 52012 cylinders, 16 heads, 63 sectors/track Units: sectors of 1 * 512 = 512 bytes Device Boot StartCHS EndCHS StartLBA EndLBA Sectors Size Id Type /dev/vda1 * 2,0,33 205,3,19 2048 206847 204800 100M 83 Linux /dev/vda2 205,3,20 1023,15,63 206848 52428799 52221952 24.9G 8e Linux LVM Disk /dev/vdb: 100 GB, 107374182400 bytes, 209715200 sectors 208050 cylinders, 16 heads, 63 sectors/track Units: sectors of 1 * 512 = 512 bytes Disk /dev/vdb doesn't contain a valid partition table Disk /dev/dm-0: 1968 MB, 2063597568 bytes, 4030464 sectors 250 cylinders, 255 heads, 63 sectors/track Units: sectors of 1 * 512 = 512 bytes Disk /dev/dm-0 doesn't contain a valid partition table Disk /dev/dm-1: 23 GB, 24670896128 bytes, 48185344 sectors 2999 cylinders, 255 heads, 63 sectors/track Units: sectors of 1 * 512 = 512 bytes Disk /dev/dm-1 doesn't contain a valid partition table ``` We know that our VPS has a 25GB disk, so `/dev/vdb` is our 100GB block device. We can format it with `doas fdisk /dev/vdb`. Let's see how we do that: ```sh ~ ❯ doas fdisk /dev/vdb Device contains neither a valid DOS partition table, nor Sun, SGI, OSF or GPT disklabel Building a new DOS disklabel. Changes will remain in memory only, until you decide to write them. After that the previous content won't be recoverable. The number of cylinders for this disk is set to 208050. There is nothing wrong with that, but this is larger than 1024, and could in certain setups cause problems with: 1) software that runs at boot time (e.g., old versions of LILO) 2) booting and partitioning software from other OSs (e.g., DOS FDISK, OS/2 FDISK) Command (m for help): n Partition type p primary partition (1-4) e extended p Partition number (1-4): 1 First sector (63-209715199, default 63): Using default value 63 Last sector or +size{,K,M,G,T} (63-209715199, default 209715199): Using default value 209715199 Command (m for help): w The partition table has been altered. Calling ioctl() to re-read partition table ``` Running `fdisk -l` again: ```sh [...] Device Boot StartCHS EndCHS StartLBA EndLBA Sectors Size Id Type /dev/vdb1 0,1,1 1023,15,63 63 209715199 209715137 99.9G 83 Linux Disk /dev/dm-0: 1968 MB, 2063597568 bytes, 4030464 sectors 250 cylinders, 255 heads, 63 sectors/track Units: sectors of 1 * 512 = 512 bytes [...] ``` Looks like that worked. #### Adding the formatted block device into LVM Let's get this added into LVM. First, we need to create a physical volume with the `pvcreate` command: ```sh ~ ❯ doas pvcreate /dev/vdb1 Physical volume "/dev/vdb1" successfully created. ``` Let's create a new Volume Group for our workload data. There are two reasons for this: 1. This will make it easier to extend in the future; and 2. Our block device is spinning rust, and we don't necessarily want to mix SSDs with spinning rust. With that in mind, we'll leave the existing VG, `vg0` as the volume group for programs and container images: ```sh ~ ❯ doas vgcreate data /dev/vdb Volume group "data" successfully created ~ ❯ doas vgdisplay data --- Volume group --- VG Name data System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 1 VG Access read/write VG Status resizable MAX LV 0 Cur LV 0 Open LV 0 Max PV 0 Cur PV 1 Act PV 1 VG Size <100.00 GiB PE Size 4.00 MiB Total PE 25599 Alloc PE / Size 0 / 0 Free PE / Size 25599 / <100.00 GiB VG UUID 679FIe-aF9e-yBRy-bRH6-wRlY-KPgz-yUpXL9 ``` > **(Doll)** Is it working Miss? Doll wants to see websites in treasure chests go zoom! > > **(Ashe)** **Containers**, dear Doll. And yes, yes it is. > Only a few more steps and we'll be ready to start bringing things online, don't worry. Speaking of, next we need to create our logical volumes. We'll create two. One for our container scratch storage, and one for persistent storage. We'll size scratch at 30GiB, and persistent at 70GiB. Let's get that done: ```sh ~ ❯ doas lvcreate -n persist --size 70G data Logical volume "persist" created. ~ ❯ doas lvcreate -n scratch --size 30G data Volume group "data" has insufficient free space (7679 extents): 7680 required. ``` > **(Selene)** Oh interesting. What happened there? > > **(Ashe)** Our theoretically 100GiB device has one extent less than 100GiB, so we couldn't divide it into exactly 30/70. > > **(Tammy)** Wait is that why `fdisk` said the device was 99.9G? > > **(Ashe)** Good catch. Yeah. 100GiB doesn't divide evenly into 960KiB cylinders, so we end up with one cylinder > too few, and therefore— > > **(Tammy)** One extent too few! Sneaky! > > **(Ashe)** Yup. Actually, now that I look at it again, I forgot to make space for the metadata, so this works out > nicely. #### Creating our nerdctl thin pool Docker and nerdctl can control a block device directly to use as a storage driver via device-mapper, so we'll be letting nerdctl do that for it's mainline storage, and using our "persistent" pool for nerdctl volumes (which are persistent). For this we'll need `device-mapper`, `lvm2-dmeventd`, and `thin-provisioning-tools`, so we'll `apk add` those in. > **(Ashe)** I'm going to skip showing the terminal output for installing packages from here on in to save space. > I'm sure you've gotten the idea by now. First up is creating a thin pool, which we'll do as follows: ```sh ~ ❯ doas lvcreate --wipesignatures y -n scratch data -l 95%FREE Logical volume "scratch" created. ~ ❯ doas lvcreate --wipesignatures y -n scratchmeta data -l 10%FREE Logical volume "scratchmeta" created. ~ ❯ doas lvconvert -y --zero n -c 512K --thinpool data/scratch --poolmetadata data/scratchmeta Thin pool volume with chunk size 512.00 KiB can address at most 126.50 TiB of data. WARNING: Converting data/scratch and data/scratchmeta to thin pool's data and metadata volumes with metadata wiping. THIS WILL DESTROY CONTENT OF LOGICAL VOLUME (filesystem etc.) Converted data/scratch and data/scratchmeta to thin pool. ~ ❯ ``` So what did we do here? > **(Doll)** Ooh! Ooh! Doll knows! Miss created one LV, umm, Logical Volume, taking up 95% of the free space, and one > taking up 10% of the free space... remaining free space? So ummm, ummm, 152 MiB? > > **(Ashe)** That's right! What next? > > **(Doll)** We umm. Combine the two into one? This one is confuse. > > **(Ashe)** Okay, I'll try to keep it simple. A normal (thick) pool allocates all of its data when we create it. So > all the space is reserved ahead of time. You can write to whatever bit of it you want, whenever you want. > Imagine something like a notebook you bought. A thin pool isn't like that. It initialises a small area > with zeroes, but otherwise leaves the rest of the device alone. Like you have a page, and you ask the store for > another blank page every time you get close to filling up your page. > So, what would happen if I wrote a 100M file that was all zeroes? > > **(Selene)** Let's see if I understand. Well, you'd write the file metadata, and allocate some space... Wait who's > keeping track of the size of the volume? > > **(Ashe)** Precisely, Selene. You need a metadata volume that contains information about the assigned blocks in > the thin pool, since it wasn't allocated all at once. So we create a pool for that, and then combine the two into our > final thin pool. That done, we can configure autoextension by creating `/etc/lvm/profile/data-scratch.profile`: ```sh activation { thin_pool_autoextend_threshold=80 thin_pool_autoextend_percent=10 } ``` Apply said profile with `doas lvchange --metadataprofile data-scratch data/scratch`, and check if the thin pool is being monitored: ```sh ~ ❯ doas lvs -o+seg_monitor LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Monitor persist data -wi-a----- 70.00g scratch data twi---t--- <28.50g lv_root vg0 -wi-ao---- <22.98g lv_swap vg0 -wi-ao---- 1.92g ``` Looks good. Were the LV not monitored, we would see `not monitored` at the end of the `scratch data` line. Were that the case, we could fix that with `doas lvchange --monitor y data/scratch`. #### Formatting the new Logical Volume Our final step is to format the LV we'll be using for persistent volumes. We'll be using plain-old ext4 for this as I don't need to nor want to get fancy here. ```sh ~ ❯ doas mkfs.ext4 /dev/data/persist mke2fs 1.46.5 (30-Dec-2021) Discarding device blocks: done Creating filesystem with 18349056 4k blocks and 4587520 inodes Filesystem UUID: c0a59a7b-1969-4476-9d2c-11af32628337 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424 Allocating group tables: done Writing inode tables: done Creating journal (131072 blocks): done Writing superblocks and filesystem accounting information: done ``` #### Mounting our new logical drives and setting up automount Final step. Mounting the drive is relative simple: ```sh ~ ❯ doas mkdir /data ~ ❯ doas chmod 2770 /data ~ ❯ doas mount /dev/data/persist /data ~ ❯ doas chown root:ctr /data -R ``` From here, we can configure `/etc/fstab` so they're automatically mounted at boot. To achieve that, we'll add the following line to `/etc/fstab`: ```fstab /dev/data/persist /data ext4 rw,relatime 0 0 ``` We don't need to mount the scratch LV (Logical Volume) as containerd will be controlling that directly. And we should be good to go. Last thing to do is add a minimal devmapper config to our `config.toml`: ```toml [...] [plugins] [plugins."io.containerd.snapshotter.v1.devmapper"] root_path = "/opt/containerd/devmapper" pool_name = "data-scratch" base_image_size = "1024MB" [...] ``` Let's see what happens when we launch `containerd` again: ```sh WARN[2022-11-07T00:33:26.218437232+11:00] failed to load plugin io.containerd.snapshotter.v1.devmapper error="dmsetup version error: Library version: 1.02.170 (2020-03-24) /dev/mapper/control: open failed: Permission denied Failure to communicate with kernel device-mapper driver. Check that device-mapper is available in the kernel. Incompatible libdevmapper 1.02.170 (2020-03-24) and kernel driver (unknown version). Command failed. : exit status 1" ``` > **(Tammy)** That doesn't look great. > > **(Ashe)** No. It does not. Hmm. Let's investigate. > > **(Ashe)** Ah. Found it. Looks like devmapper isn't supported in > [rootless configs](https://github.com/containerd/nerdctl/blob/main/docs/rootless.md#snapshotters). Now we know. {{< alert >}} **(Ashe)** Rootless containerd does **not** support the devmapper snapshotter! {{< /alert >}} > **(Octavia)** And on that bomb-shell, I think it's about time we wrapped this up. Looks like we'll have to make this > into a series. > > **(Tammy)** Hopefully we'll have better luck next time.