snek-tech-blog/content/posts/rootless-containers-alpine.md

25 KiB
Raw Blame History

title date lastmod draft showSummary summary series series_order
Rootless Containers on Alpine 2022-11-08T19:30:15+11:00 2022-11-09T15:30:15+11:00 false true We recently murdered a server's terminal via `do_distro_upgrade`, and thought it'd be a good time to learn more about containers and Alpine.
Rootless Containers on Alpine Linux
1

Part One: Prep Work

Background

(Ashe) So. We recently murdered a server's terminal via do_distro_upgrade.

(Tammy) Was it really that bad?

(Ashe) Yes.

%  man 7z
WARNING: terminal is not fully functional
-  (press RETURN)%

It was in fact that bad. So we figured, well, we can spend a few hours, days, whatever fixing this...

(Tammy) Or we could just build a new server!

(Ashe) Right.

So, after asking some friends about their opinions, we settled on Alpine Linux. And why not also migrate all of our pm2 workloads to containers while we're at it? We've been meaning to learn more about containers for a while now.

So off we go!

Prep Work

We need a few things before we actually set up rootless containers. We'll be following along with the Official Rootless Containers Tutorial, making adjustments as necessary.

Login Information

Most Rootless Container implementations use $XDG_RUNTIME_DIR to find the user's ID and where their runtime lives (usually some subdir of /run/user/). Systemd-based Linux distros will handle this automatically, but Alpine uses OpenRC, which does not do this automatically.

While Alpine doesn't provide a tutorial for Rootless Containers, we can adapt some of the prep work done for Wayland to get OpenRC to set $XDG_RUNTIME_DIR for us.

We just create /etc/profile.d/xdg_runtime_dir.sh like so:

if test -z "${XDG_RUNTIME_DIR}"; then
  export XDG_RUNTIME_DIR=/tmp/$(id -u)-runtime-dir
  if ! test -d "${XDG_RUNTIME_DIR}"; then
    mkdir "${XDG_RUNTIME_DIR}"
    chmod 0700 "${XDG_RUNTIME_DIR}"
  fi
fi

And, log out and then back in...

~  env
[...]
XDG_RUNTIME_DIR=/tmp/1000-runtime-dir
[...]

With that done, we can move onto our next steps.

Sysctl

There's some sysctl config required for older distros, but this isn't required for Alpine.

User Namespace Configuration

Rootless Containers use User Namespaces, subUIDs, and subGIDs, so we'll need to have those working. The apk package shadow-subids provides that functionality for us.

~  apk info shadow-subids
shadow-subids-4.10-r3 description:
Utilities for using subordinate UIDs and GIDs

shadow-subids-4.10-r3 webpage:
https://github.com/shadow-maint/shadow

shadow-subids-4.10-r3 installed size:
140 KiB

Sub-ID Counts

Rootless Containers generally expect /etc/subuid and /etc/subgid to contain at least 65,536 sub-IDs for each user. shadow-subids doed create these files for us, but leaves them empty by default, so let's go ahead and do that. The page on subIDs provides a handy Python script to do that for us, which we'll edit slightly so it's not writing directly to system files:

f = open("subuid", "w")
for uid in range(1000, 65536):
  f.write("%d:%d:65536\n" %(uid,uid*65536))
f.close()

f = open("subgid", "w")
for uid in range(1000, 65536):
  f.write("%d:%d:65536\n" %(uid,uid*65536))
f.close()

This is probably overkill for our use-case, but that's also fine.

(Doll) So this one just runs script and copies to /etc/?

(Ashe) Yes Doll, that's right.

With that done, we can move onto the last prep step.

CGroups V2

To limit resources that a container can use, we need to enable CGroups V2. In OpenRC, this can be done by changing some options in /etc/rc.conf.

To enable CGroups in general, we need to set rc_controller_cgroups to YES

# This switch controls whether or not cgroups version 1 controllers are
# individually mounted under
# /sys/fs/cgroup in hybrid or legacy mode.
rc_controller_cgroups="YES"

From here, we can enable CGroups V2 by setting rc_cgroup_mode to unified

# This sets the mode used to mount cgroups.
# "hybrid" mounts cgroups version 2 on /sys/fs/cgroup/unified and
# cgroups version 1 on /sys/fs/cgroup.
# "legacy" mounts cgroups version 1 on /sys/fs/cgroup
# "unified" mounts cgroups version 2 on /sys/fs/cgroup
rc_cgroup_mode="unified"

(Doll) Doll confused.

(Ashe) So was I, for a bit. Despite what rc.conf says, cgroups V2 does not seem to be enabled on Alpine unless rc_cgroup_mode is set to unified. The Alpine Wiki seems to agree here, but isn't super clear. We'll find out if this is sufficient.

Next step is configuring the controllers we want to use:

# This is a list of controllers which should be enabled for cgroups version 2
# when hybrid mode is being used.
# Controllers listed here will not be available for cgroups version 1.
rc_cgroup_controllers="cpuset cpu io memory hugetlb pids"

Finally, we can add cgroups to a runlevel so that it's started automatically at boot:

rc-update add cgroups

From here, we can reboot, and continue on. If you don't want to reboot, you can start the cgroup service manually:

rc-service cgroups start

Creating a group for our container users

We'll quickly create a group for all users who'll be using rootless containers here. In Alpine, this is as simple as doas addgroup ctr. We'll make use of this later.

Installing containerd and friends

First up we'll need to install containerd (to host our containers) and slirp4netns (to allow network spaced commands inside the container with lower overhead than VPNKit), so we just:

doas apk add containerd
doas apk add slirp4netns

Next, we need to install nerdctl and rootlesskit. Both of these are currently only found inside the testing repo for Alpine. We can pull them in without subscribing to the entire testing repo like so:

doas apk add -X https://dl-cdn.alpinelinux.org/alpine/edge/testing/ nerdctl
doas apk add -X https://dl-cdn.alpinelinux.org/alpine/edge/testing/ rootlesskit

Configuring the Rootless containerd service

We'll be using nerdctl as our containerd controller of choice. It comes with a rootless containerd.service, but since Alpine doesn't use systemd, we'll have to adapt this into an rc service.

We spent some time trying to adapt the install script nerdctl provides to our purposes, however this is a bit excessive for what we need, so we'll just do it the "hard way".

(Tammy) Wait, this isn't the "hard way", is it?

(Ashe) Nope. Adapting a 500 line script would be hard and annoying. We're better served by just doing it manually, and providing instructions for anyone following along. So in that vein:

Getting containerd running in rootlesskit

First, let's get containerd running at the CLI, and then we can make it into an OpenRC Script. We'll need a config.toml, but it can pretty minimal:

version = 2
root = "/home/tammy/.local/share/containerd"
state = "/tmp/1000-runtime-dir/containerd"

[grpc]
  address = "/tmp/1000-runtime-dir/containerd/containerd.sock"

First try:

~  rootlesskit --net=slirp4netns --copy-up=/etc --copy-up=/run \
  --state-dir=/tmp/1000-runtime-dir/rootlesskit-containerd --disable-host-loopback \
  sh -c "rm -f /run/containerd; exec containerd -c config.toml"

BusyBox v1.35.0 (2022-08-01 15:14:44 UTC) multi-call binary.

Usage: ip [OPTIONS] address|route|link|tunnel|neigh|rule [ARGS]

OPTIONS := -f[amily] inet|inet6|link | -o[neline]

ip addr add|del IFADDR dev IFACE | show|flush [dev IFACE] [to PREFIX]
ip route list|flush|add|del|change|append|replace|test ROUTE
ip link set IFACE [up|down] [arp on|off] [multicast on|off]
	[promisc on|off] [mtu NUM] [name NAME] [qlen NUM] [address MAC]
	[master IFACE | nomaster] [netns PID]
ip tunnel add|change|del|show [NAME]
	[mode ipip|gre|sit] [remote ADDR] [local ADDR] [ttl TTL]
ip neigh show|flush [to PREFIX] [dev DEV] [nud STATE]
ip rule [list] | add|del SELECTOR ACTION
[rootlesskit:parent] error: failed to setup network &{logWriter:0xc00014aa00 binary:slirp4netns mtu:65520 ipnet:<nil> disableHostLoopback:true apiSocketPath: enableSandbox:false enableSeccomp:false enableIPv6:false ifname:tap0 infoMu:{w:{state:0 sema:0} writerSem:0 readerSem:0 readerCount:0 readerWait:0} info:<nil>}: setting up tap tap0: executing [[nsenter -t 28611 -n -m -U --preserve-credentials ip tuntap add name tap0 mode tap] [nsenter -t 28611 -n -m -U --preserve-credentials ip link set tap0 up]]: exit status 1
[rootlesskit:child ] error: parsing message from fd 3: EOF

(Doll) That looks like it broke, Miss.

(Ashe) sigh, yeah, that's broken alright. That output looks like ip didn't like the command supplied to it, so let's find out what that was.

Some troubleshooting later, it looks like this is to do with BusyBox's implementation of the ip commands. We've raised an issue, and we'll see how that goes. In the mean time, we'll just have to use native networking. This means we can't apply firewall rules per-container, which is moderately annoying, but won't actually hinder deployment. Just makes securing the deployment more annoying.

So let's try without the --net=slirp4netns (omitting anything that's INFO):

~  rootlesskit --copy-up=/etc --copy-up=/run \
--state-dir=/tmp/1000-runtime-dir/rootlesskit-containerd --disable-host-loopback \
sh -c "rm -f /run/containerd; exec containerd -c config.toml"
WARN[2022-11-03T11:32:53.207241941+11:00] failed to load plugin io.containerd.snapshotter.v1.devmapper  error="devmapper not configured"
WARN[2022-11-03T11:32:53.227691744+11:00] could not use snapshotter devmapper in metadata plugin  error="devmapper not configured"
WARN[2022-11-03T11:32:53.233006449+11:00] failed to load plugin io.containerd.internal.v1.opt  error="mkdir /opt/containerd: permission denied"
ERRO[2022-11-03T11:32:53.235151641+11:00] failed to load cni during init, please check CRI plugin status before setting up network for pods  error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"

A few things of note here:

WARN[2022-11-03T11:32:53.233006449+11:00] failed to load plugin io.containerd.internal.v1.opt  error="mkdir /opt/containerd: permission denied"
[...]
ERRO[2022-11-03T11:32:53.235151641+11:00] failed to load cni during init, please check CRI plugin status before setting up network for pods  error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"

The warning tells us that it tried to create /opt/containerd, but was unable to. This is easy enough to fix:

~  doas mkdir /opt/containerd
~  doas chmod 2770 /opt/containerd
~  doas chown root:ctr /opt/containerd #Replace the username and group here as necessary

The error is more interesting. CRI here stands for Container Runtime Interface, and it seems to be used for Kubernetes. Since we won't be using kubernetes here, we can just disable it by adding disabled_plugins = ["io.containerd.grpc.v1.cri"] to our config.toml.

(Tammy) If you are interested in Kubernetes, make sure to check out our [Home Server Build-Out]({{< ref "home-server-build-out" >}}) series. We're planning on setting up an entire cloud environment there.

Let's try that again (cutting out any info stuff):

[...]
WARN[2022-11-03T16:18:35.425339343+11:00] failed to load plugin io.containerd.snapshotter.v1.devmapper  error="devmapper not configured"
WARN[2022-11-03T16:18:35.427868986+11:00] could not use snapshotter devmapper in metadata plugin  error="devmapper not configured"
ERRO[2022-11-03T16:18:35.430061527+11:00] failed to initialize a tracing processor "otlp"  error="no OpenTelemetry endpoint: skip plugin"
containerd successfully booted in 0.024502s
[...]

That's cleaned up those issues, but we still have two warnings about devmapper, and containerd couldn't find an OpenTelemetry endpoint.

We'll be skipping OpenTelemetry for now, but that sounds like a fun topic for a second blog post along side setting up Grafana.

(Doll) Doll will remember! Will remind Miss' to make a post about this!

Setting up devmapper

devmapper is one of a few snapshotters that containerd can use. It's not the most performant (that honour goes to overlayfs), but it is one of the most robust, and least likely to break. This is more imporant to us than pure performance. If you're following along at home, you'll have to decide which storage driver is best for your use-case.

Following the setup guide, we'll need dmsetup installed. Under Alpine, this is provided by the device-mapper package, which we already have installed.

We've also got a 100GB block device attached to this VPS, so let's get that provisioned too.

Mounting and Formatting our block device

We can use fdisk to format our block device. fdisk -l lists all devices and partitions.

~  doas fdisk -l
Disk /dev/vda: 25 GB, 26843545600 bytes, 52428800 sectors
52012 cylinders, 16 heads, 63 sectors/track
Units: sectors of 1 * 512 = 512 bytes

Device  Boot StartCHS    EndCHS        StartLBA     EndLBA    Sectors  Size Id Type
/dev/vda1 *  2,0,33      205,3,19          2048     206847     204800  100M 83 Linux
/dev/vda2    205,3,20    1023,15,63      206848   52428799   52221952 24.9G 8e Linux LVM
Disk /dev/vdb: 100 GB, 107374182400 bytes, 209715200 sectors
208050 cylinders, 16 heads, 63 sectors/track
Units: sectors of 1 * 512 = 512 bytes

Disk /dev/vdb doesn't contain a valid partition table
Disk /dev/dm-0: 1968 MB, 2063597568 bytes, 4030464 sectors
250 cylinders, 255 heads, 63 sectors/track
Units: sectors of 1 * 512 = 512 bytes

Disk /dev/dm-0 doesn't contain a valid partition table
Disk /dev/dm-1: 23 GB, 24670896128 bytes, 48185344 sectors
2999 cylinders, 255 heads, 63 sectors/track
Units: sectors of 1 * 512 = 512 bytes

Disk /dev/dm-1 doesn't contain a valid partition table

We know that our VPS has a 25GB disk, so /dev/vdb is our 100GB block device. We can format it with doas fdisk /dev/vdb. Let's see how we do that:

~  doas fdisk /dev/vdb
Device contains neither a valid DOS partition table, nor Sun, SGI, OSF or GPT disklabel
Building a new DOS disklabel. Changes will remain in memory only,
until you decide to write them. After that the previous content
won't be recoverable.


The number of cylinders for this disk is set to 208050.
There is nothing wrong with that, but this is larger than 1024,
    and could in certain setups cause problems with:
    1) software that runs at boot time (e.g., old versions of LILO)
    2) booting and partitioning software from other OSs
       (e.g., DOS FDISK, OS/2 FDISK)

    Command (m for help): n
    Partition type
       p   primary partition (1-4)
       e   extended
    p
    Partition number (1-4): 1
    First sector (63-209715199, default 63):
    Using default value 63
    Last sector or +size{,K,M,G,T} (63-209715199, default 209715199):
    Using default value 209715199

    Command (m for help): w
    The partition table has been altered.
    Calling ioctl() to re-read partition table

    ```

    Running `fdisk -l` again:
    ```sh
    [...]
    Device  Boot StartCHS    EndCHS        StartLBA     EndLBA    Sectors  Size Id Type
    /dev/vdb1    0,1,1       1023,15,63          63  209715199  209715137 99.9G 83 Linux
    Disk /dev/dm-0: 1968 MB, 2063597568 bytes, 4030464 sectors
    250 cylinders, 255 heads, 63 sectors/track
    Units: sectors of 1 * 512 = 512 bytes
    [...]

Looks like that worked.

Adding the formatted block device into LVM

Let's get this added into LVM. First, we need to create a physical volume with the pvcreate command:

~  doas pvcreate /dev/vdb1
  Physical volume "/dev/vdb1" successfully created.

Let's create a new Volume Group for our workload data. There are two reasons for this:

  1. This will make it easier to extend in the future; and
  2. Our block device is spinning rust, and we don't necessarily want to mix SSDs with spinning rust.

With that in mind, we'll leave the existing VG, vg0 as the volume group for programs and container images:

~  doas vgcreate data /dev/vdb
  Volume group "data" successfully created
~  doas vgdisplay data
  --- Volume group ---
  VG Name               data
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  1
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                0
  Open LV               0
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               <100.00 GiB
  PE Size               4.00 MiB
  Total PE              25599
  Alloc PE / Size       0 / 0   
  Free  PE / Size       25599 / <100.00 GiB
  VG UUID               679FIe-aF9e-yBRy-bRH6-wRlY-KPgz-yUpXL9
   

(Doll) Is it working Miss? Doll wants to see websites in treasure chests go zoom!

(Ashe) Containers, dear Doll. And yes, yes it is. Only a few more steps and we'll be ready to start bringing things online, don't worry.

Speaking of, next we need to create our logical volumes. We'll create two. One for our container scratch storage, and one for persistent storage. We'll size scratch at 30GiB, and persistent at 70GiB. Let's get that done:

~  doas lvcreate -n persist --size 70G data
  Logical volume "persist" created.
~  doas lvcreate -n scratch --size 30G data
  Volume group "data" has insufficient free space (7679 extents): 7680 required.

(Selene) Oh interesting. What happened there?

(Ashe) Our theoretically 100GiB device has one extent less than 100GiB, so we couldn't divide it into exactly 30/70.

(Tammy) Wait is that why fdisk said the device was 99.9G?

(Ashe) Good catch. Yeah. 100GiB doesn't divide evenly into 960KiB cylinders, so we end up with one cylinder too few, and therefore—

(Tammy) One extent too few! Sneaky!

(Ashe) Yup. Actually, now that I look at it again, I forgot to make space for the metadata, so this works out nicely.

Creating our nerdctl thin pool

Docker and nerdctl can control a block device directly to use as a storage driver via device-mapper, so we'll be letting nerdctl do that for it's mainline storage, and using our "persistent" pool for nerdctl volumes (which are persistent).

For this we'll need device-mapper, lvm2-dmeventd, and thin-provisioning-tools, so we'll apk add those in.

(Ashe) I'm going to skip showing the terminal output for installing packages from here on in to save space. I'm sure you've gotten the idea by now.

First up is creating a thin pool, which we'll do as follows:

~  doas lvcreate --wipesignatures y -n scratch data -l 95%FREE
  Logical volume "scratch" created.
~  doas lvcreate --wipesignatures y -n scratchmeta data -l 10%FREE
  Logical volume "scratchmeta" created.
~  doas lvconvert -y --zero n -c 512K --thinpool data/scratch --poolmetadata data/scratchmeta
  Thin pool volume with chunk size 512.00 KiB can address at most 126.50 TiB of data.
  WARNING: Converting data/scratch and data/scratchmeta to thin pool's data and metadata volumes with metadata wiping.
  THIS WILL DESTROY CONTENT OF LOGICAL VOLUME (filesystem etc.)
  Converted data/scratch and data/scratchmeta to thin pool.
~ 

So what did we do here?

(Doll) Ooh! Ooh! Doll knows! Miss created one LV, umm, Logical Volume, taking up 95% of the free space, and one taking up 10% of the free space... remaining free space? So ummm, ummm, 152 MiB?

(Ashe) That's right! What next?

(Doll) We umm. Combine the two into one? This one is confuse.

(Ashe) Okay, I'll try to keep it simple. A normal (thick) pool allocates all of its data when we create it. So all the space is reserved ahead of time. You can write to whatever bit of it you want, whenever you want. Imagine something like a notebook you bought. A thin pool isn't like that. It initialises a small area with zeroes, but otherwise leaves the rest of the device alone. Like you have a page, and you ask the store for another blank page every time you get close to filling up your page. So, what would happen if I wrote a 100M file that was all zeroes?

(Selene) Let's see if I understand. Well, you'd write the file metadata, and allocate some space... Wait who's keeping track of the size of the volume?

(Ashe) Precisely, Selene. You need a metadata volume that contains information about the assigned blocks in the thin pool, since it wasn't allocated all at once. So we create a pool for that, and then combine the two into our final thin pool.

That done, we can configure autoextension by creating /etc/lvm/profile/data-scratch.profile:

activation {
        thin_pool_autoextend_threshold=80
        thin_pool_autoextend_percent=10
}

Apply said profile with doas lvchange --metadataprofile data-scratch data/scratch, and check if the thin pool is being monitored:

~  doas lvs -o+seg_monitor
  LV      VG   Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Monitor
  persist data -wi-a-----  70.00g
  scratch data twi---t--- <28.50g
  lv_root vg0  -wi-ao---- <22.98g
  lv_swap vg0  -wi-ao----   1.92g

Looks good. Were the LV not monitored, we would see not monitored at the end of the scratch data line. Were that the case, we could fix that with doas lvchange --monitor y data/scratch.

Formatting the new Logical Volume

Our final step is to format the LV we'll be using for persistent volumes. We'll be using plain-old ext4 for this as I don't need to nor want to get fancy here.

~  doas mkfs.ext4 /dev/data/persist
mke2fs 1.46.5 (30-Dec-2021)
Discarding device blocks: done
Creating filesystem with 18349056 4k blocks and 4587520 inodes
Filesystem UUID: c0a59a7b-1969-4476-9d2c-11af32628337
Superblock backups stored on blocks:
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
	4096000, 7962624, 11239424

Allocating group tables: done
Writing inode tables: done
Creating journal (131072 blocks): done
Writing superblocks and filesystem accounting information: done

Mounting our new logical drives and setting up automount

Final step. Mounting the drive is relative simple:

~  doas mkdir /data
~  doas chmod 2770 /data
~  doas mount /dev/data/persist /data
~  doas chown root:ctr /data -R

From here, we can configure /etc/fstab so they're automatically mounted at boot.

To achieve that, we'll add the following line to /etc/fstab:

/dev/data/persist	/data		    ext4	rw,relatime 0 0

We don't need to mount the scratch LV (Logical Volume) as containerd will be controlling that directly.

And we should be good to go.

Last thing to do is add a minimal devmapper config to our config.toml:

[...]
[plugins]
  [plugins."io.containerd.snapshotter.v1.devmapper"]
    root_path = "/opt/containerd/devmapper"
    pool_name = "data-scratch"
    base_image_size = "1024MB"
[...]

Let's see what happens when we launch containerd again:

WARN[2022-11-07T00:33:26.218437232+11:00] failed to load plugin io.containerd.snapshotter.v1.devmapper  error="dmsetup version
error: Library version:   1.02.170 (2020-03-24)
/dev/mapper/control: open failed: Permission denied
Failure to communicate with kernel device-mapper driver.
Check that device-mapper is available in the kernel.
Incompatible libdevmapper 1.02.170 (2020-03-24) and kernel driver (unknown version).
Command failed.

: exit status 1"

(Tammy) That doesn't look great.

(Ashe) No. It does not. Hmm. Let's investigate.

(Ashe) Ah. Found it. Looks like devmapper isn't supported in rootless configs. Now we know.

{{< alert >}} (Ashe) Rootless containerd does not support the devmapper snapshotter! {{< /alert >}}

An Interlude, Some Tea, And a Break

(Octavia) And on that bomb-shell, I think it's about time we wrapped this up. Ashe is looking is pretty grumpy. Looks like we'll have to make this into a series.

(Tammy) Hopefully we'll have better luck next time.

(Doll) This one hopes you'll all join us next time!