We'll be using nerdctl as our containerd controller of choice. It comes with a rootless containerd.service,
but since Alpine doesn't use systemd, we'll have to adapt this into an rc service.
We spent some time trying to adapt the [install script](https://github.com/containerd/nerdctl/blob/48f189a53a24c12838433f5bb5dd57f536816a8a/extras/rootless/containerd-rootless-setuptool.sh)
nerdctl provides to our purposes, however this is a bit excessive for what we need,
so we'll just do it the "[hard way](https://github.com/containerd/containerd/blob/main/docs/rootless.md)".
**(Tammy)** Wait, this isn't the "hard way", is it?
**(Ashe)** Nope. Adapting a 500 line script would be hard and annoying. We're better served by just doing it manually,
and providing instructions for anyone following along. So in that vein:
### Getting containerd running in rootlesskit
First, let's get containerd running at the CLI, and then we can make it into an OpenRC Script.
We'll need a `config.toml`, but it can pretty minimal:
sh -c "rm -f /run/containerd; exec containerd -c config.toml"
WARN[2022-11-03T11:32:53.207241941+11:00] failed to load plugin io.containerd.snapshotter.v1.devmapper error="devmapper not configured"
WARN[2022-11-03T11:32:53.227691744+11:00] could not use snapshotter devmapper in metadata plugin error="devmapper not configured"
WARN[2022-11-03T11:32:53.233006449+11:00] failed to load plugin io.containerd.internal.v1.opt error="mkdir /opt/containerd: permission denied"
ERRO[2022-11-03T11:32:53.235151641+11:00] failed to load cni during init, please check CRI plugin status before setting up network for pods error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"
```
A few things of note here:
```sh
WARN[2022-11-03T11:32:53.233006449+11:00] failed to load plugin io.containerd.internal.v1.opt error="mkdir /opt/containerd: permission denied"
[...]
ERRO[2022-11-03T11:32:53.235151641+11:00] failed to load cni during init, please check CRI plugin status before setting up network for pods error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"
```
The warning tells us that it tried to create /opt/containerd, but was unable to. This is easy enough to fix:
```sh
~ ❯ doas mkdir /opt/containerd
~ ❯ doas chmod 2770 /opt/containerd
~ ❯ doas chown root:ctr /opt/containerd #Replace the username and group here as necessary
```
The error is more interesting. CRI here stands for [Container Runtime Interface](https://github.com/containerd/cri), and
it seems to be used for Kubernetes. Since we won't be using kubernetes here, we can just disable it by adding
`disabled_plugins = ["io.containerd.grpc.v1.cri"]` to our `config.toml`.
**(Tammy)** If you *are* interested in Kubernetes, make sure to check out our [Home Server Build-Out]({{< ref "home-server-build-out" >}}) series. We're planning on setting up an entire cloud environment there.
Let's try that again (cutting out any info stuff):
```sh
[...]
WARN[2022-11-03T16:18:35.425339343+11:00] failed to load plugin io.containerd.snapshotter.v1.devmapper error="devmapper not configured"
WARN[2022-11-03T16:18:35.427868986+11:00] could not use snapshotter devmapper in metadata plugin error="devmapper not configured"
ERRO[2022-11-03T16:18:35.430061527+11:00] failed to initialize a tracing processor "otlp" error="no OpenTelemetry endpoint: skip plugin"
containerd successfully booted in 0.024502s
[...]
```
That's cleaned up those issues, but we still have two warnings about `devmapper`,
and `containerd` couldn't find an OpenTelemetry endpoint.
We'll be skipping OpenTelemetry for now, but that sounds like a fun topic for a second blog post along side setting up
Grafana.
**(Doll)** Doll will remember! Will remind Miss' to make a post about this!
### Setting up devmapper
`devmapper` is one of a few [snapshotters](https://github.com/containerd/containerd/tree/main/docs/snapshotters)
that `containerd` can use. It's not the most performant (that honour goes to `overlayfs`), but it is one of
the most robust, and least likely to break. This is more imporant to us than pure performance.
If you're following along at home, you'll have to decide which storage driver is best for your use-case.
Following the [setup guide](https://github.com/containerd/containerd/blob/main/docs/snapshotters/devmapper.md),
we'll need `dmsetup` installed. Under Alpine, this is provided by the `device-mapper` package,
which we already have installed.
We've also got a 100GB block device attached to this VPS, so let's get that provisioned too.
#### Mounting and Formatting our block device
We can use `fdisk` to format our block device. `fdisk -l` lists all devices and partitions.
```
~ ❯ doas fdisk -l
Disk /dev/vda: 25 GB, 26843545600 bytes, 52428800 sectors
52012 cylinders, 16 heads, 63 sectors/track
Units: sectors of 1 * 512 = 512 bytes
Device Boot StartCHS EndCHS StartLBA EndLBA Sectors Size Id Type
/dev/vda1 * 2,0,33 205,3,19 2048 206847 204800 100M 83 Linux
/dev/vda2 205,3,20 1023,15,63 206848 52428799 52221952 24.9G 8e Linux LVM
Disk /dev/vdb: 100 GB, 107374182400 bytes, 209715200 sectors
208050 cylinders, 16 heads, 63 sectors/track
Units: sectors of 1 * 512 = 512 bytes
Disk /dev/vdb doesn't contain a valid partition table
Disk /dev/dm-0: 1968 MB, 2063597568 bytes, 4030464 sectors
250 cylinders, 255 heads, 63 sectors/track
Units: sectors of 1 * 512 = 512 bytes
Disk /dev/dm-0 doesn't contain a valid partition table
Disk /dev/dm-1: 23 GB, 24670896128 bytes, 48185344 sectors
2999 cylinders, 255 heads, 63 sectors/track
Units: sectors of 1 * 512 = 512 bytes
Disk /dev/dm-1 doesn't contain a valid partition table
```
We know that our VPS has a 25GB disk, so `/dev/vdb` is our 100GB block device. We can format it with
`doas fdisk /dev/vdb`. Let's see how we do that:
```sh
~ ❯ doas fdisk /dev/vdb
Device contains neither a valid DOS partition table, nor Sun, SGI, OSF or GPT disklabel
Building a new DOS disklabel. Changes will remain in memory only,
until you decide to write them. After that the previous content
won't be recoverable.
The number of cylinders for this disk is set to 208050.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
(e.g., DOS FDISK, OS/2 FDISK)
Command (m for help): n
Partition type
p primary partition (1-4)
e extended
p
Partition number (1-4): 1
First sector (63-209715199, default 63):
Using default value 63
Last sector or +size{,K,M,G,T} (63-209715199, default 209715199):
Using default value 209715199
Command (m for help): w
The partition table has been altered.
Calling ioctl() to re-read partition table
```
Running `fdisk -l` again:
```sh
[...]
Device Boot StartCHS EndCHS StartLBA EndLBA Sectors Size Id Type
/dev/vdb1 0,1,1 1023,15,63 63 209715199 209715137 99.9G 83 Linux
Disk /dev/dm-0: 1968 MB, 2063597568 bytes, 4030464 sectors
250 cylinders, 255 heads, 63 sectors/track
Units: sectors of 1 * 512 = 512 bytes
[...]
```
Looks like that worked.
#### Adding the formatted block device into LVM
Let's get this added into LVM. First, we need to create a physical volume with the `pvcreate`
command:
```sh
~ ❯ doas pvcreate /dev/vdb1
Physical volume "/dev/vdb1" successfully created.
```
Let's create a new Volume Group for our workload data. There are two reasons for this:
1. This will make it easier to extend in the future; and
2. Our block device is spinning rust, and we don't necessarily want to mix SSDs with spinning rust.
With that in mind, we'll leave the existing VG, `vg0` as the volume group for programs and container images:
```sh
~ ❯ doas vgcreate data /dev/vdb
Volume group "data" successfully created
~ ❯ doas vgdisplay data
--- Volume group ---
VG Name data
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 1
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 0
Open LV 0
Max PV 0
Cur PV 1
Act PV 1
VG Size <100.00GiB
PE Size 4.00 MiB
Total PE 25599
Alloc PE / Size 0 / 0
Free PE / Size 25599 / <100.00GiB
VG UUID 679FIe-aF9e-yBRy-bRH6-wRlY-KPgz-yUpXL9
```
> **(Doll)** Is it working Miss? Doll wants to see websites in treasure chests go zoom!
>
> **(Ashe)** **Containers**, dear Doll. And yes, yes it is.
> Only a few more steps and we'll be ready to start bringing things online, don't worry.
Speaking of, next we need to create our logical volumes. We'll create two. One for our container scratch storage, and
one for persistent storage. We'll size scratch at 30GiB, and persistent at 70GiB. Let's get that done:
```sh
~ ❯ doas lvcreate -n persist --size 70G data
Logical volume "persist" created.
~ ❯ doas lvcreate -n scratch --size 30G data
Volume group "data" has insufficient free space (7679 extents): 7680 required.
```
> **(Selene)** Oh interesting. What happened there?
>
> **(Ashe)** Our theoretically 100GiB device has one extent less than 100GiB, so we couldn't divide it into exactly 30/70.
>
> **(Tammy)** Wait is that why `fdisk` said the device was 99.9G?
>
> **(Ashe)** Good catch. Yeah. 100GiB doesn't divide evenly into 960KiB cylinders, so we end up with one cylinder
> too few, and therefore—
>
> **(Tammy)** One extent too few! Sneaky!
>
> **(Ashe)** Yup. Actually, now that I look at it again, I forgot to make space for the metadata, so this works out
> nicely.
#### Creating our nerdctl thin pool
Docker and nerdctl can control a block device directly to use as a storage driver via device-mapper, so we'll be letting
nerdctl do that for it's mainline storage, and using our "persistent" pool for nerdctl volumes (which are persistent).
For this we'll need `device-mapper`, `lvm2-dmeventd`, and `thin-provisioning-tools`, so we'll `apk add` those in.
**(Ashe)** I'm going to skip showing the terminal output for installing packages from here on in to save space. I'm sure
you've gotten the idea by now.
First up is creating a thin pool, which we'll do as follows:
```sh
~ ❯ doas lvcreate --wipesignatures y -n scratch data -l 95%FREE
Logical volume "scratch" created.
~ ❯ doas lvcreate --wipesignatures y -n scratchmeta data -l 10%FREE
Logical volume "scratchmeta" created.
~ ❯ doas lvconvert -y --zero n -c 512K --thinpool data/scratch --poolmetadata data/scratchmeta
Thin pool volume with chunk size 512.00 KiB can address at most 126.50 TiB of data.
WARNING: Converting data/scratch and data/scratchmeta to thin pool's data and metadata volumes with metadata wiping.
THIS WILL DESTROY CONTENT OF LOGICAL VOLUME (filesystem etc.)
Converted data/scratch and data/scratchmeta to thin pool.
~ ❯
```
So what did we do here?
> **(Doll)** Ooh! Ooh! Doll knows! Miss created one LV, umm, Logical Volume, taking up 95% of the free space, and one
> taking up 10% of the free space... remaining free space? So ummm, ummm, 152 MiB?
>
> **(Ashe)** That's right! What next?
>
> **(Doll)** We umm. Combine the two into one? This one is confuse.
>
> **(Ashe)** Okay, I'll try to keep it simple. A normal (thick) pool allocates all of its data when we create it. So
> all the space is reserved ahead of time. You can write to whatever bit of it you want, whenever you want.
> Imagine something like a notebook you bought. A thin pool isn't like that. It initialises a small area
> with zeroes, but otherwise leaves the rest of the device alone. Like you have a page, and you ask the store for
> another blank page every time you get close to filling up your page.
> So, what would happen if I wrote a 100M file that was all zeroes?
>
> **(Selene)** Let's see if I understand. Well, you'd write the file metadata, and allocate some space... Wait who's
> keeping track of the size of the volume?
>
> **(Ashe)** Precisely, Selene. You need a metadata volume that contains information about the assigned blocks in
> the thin pool, since it wasn't allocated all at once. So we create a pool for that, and then combine the two into our
> final thin pool.
That done, we can configure autoextension by creating `/etc/lvm/profile/data-scratch.profile`:
```sh
activation {
thin_pool_autoextend_threshold=80
thin_pool_autoextend_percent=10
}
```
Apply said profile with `doas lvchange --metadataprofile data-scratch data/scratch`, and check if the thin pool is being