601 lines
25 KiB
Markdown
601 lines
25 KiB
Markdown
---
|
||
title: "Rootless Containers on Alpine"
|
||
date: 2022-11-08T19:30:15+11:00
|
||
lastmod: 2022-11-09T15:30:15+11:00
|
||
draft: false
|
||
showSummary: true
|
||
summary: "We recently murdered a server's terminal via `do_distro_upgrade`, and thought it'd be a good time to learn
|
||
more about containers and Alpine."
|
||
series:
|
||
- "Rootless Containers on Alpine Linux"
|
||
series_order: 1
|
||
---
|
||
|
||
# Part One: Prep Work
|
||
|
||
## Background
|
||
> **(Ashe)**
|
||
> So. We recently murdered a server's terminal via `do_distro_upgrade`.
|
||
>
|
||
> **(Tammy)** Was it really that bad?
|
||
>
|
||
> **(Ashe)** Yes.
|
||
|
||
```
|
||
% man 7z
|
||
WARNING: terminal is not fully functional
|
||
- (press RETURN)%
|
||
```
|
||
It was in fact *that bad*. So we figured, well, we can spend a few hours, days, whatever fixing this...
|
||
|
||
> **(Tammy)** Or we could just build a new server!
|
||
>
|
||
> **(Ashe)** Right.
|
||
|
||
So, after asking some friends about their opinions, we settled on Alpine Linux. And why not also migrate all of our
|
||
pm2 workloads to containers while we're at it? We've been meaning to learn more about containers for a while now.
|
||
|
||
So off we go!
|
||
|
||
## Prep Work
|
||
|
||
We need a few things before we actually set up rootless containers. We'll be following along with the
|
||
[Official Rootless Containers Tutorial](https://rootlesscontaine.rs/getting-started/common/),
|
||
making adjustments as necessary.
|
||
|
||
### Login Information
|
||
|
||
Most Rootless Container implementations use `$XDG_RUNTIME_DIR` to find the user's ID and where their runtime lives
|
||
(usually some subdir of `/run/user/`).
|
||
Systemd-based Linux distros will handle this automatically, but Alpine uses
|
||
[OpenRC](https://wiki.alpinelinux.org/wiki/OpenRC), which does not do this automatically.
|
||
|
||
While Alpine doesn't provide a tutorial for Rootless Containers, we can adapt some of the prep work done for
|
||
[Wayland](https://wiki.alpinelinux.org/wiki/Wayland) to get OpenRC to set `$XDG_RUNTIME_DIR` for us.
|
||
|
||
We just create `/etc/profile.d/xdg_runtime_dir.sh` like so:
|
||
```sh
|
||
if test -z "${XDG_RUNTIME_DIR}"; then
|
||
export XDG_RUNTIME_DIR=/tmp/$(id -u)-runtime-dir
|
||
if ! test -d "${XDG_RUNTIME_DIR}"; then
|
||
mkdir "${XDG_RUNTIME_DIR}"
|
||
chmod 0700 "${XDG_RUNTIME_DIR}"
|
||
fi
|
||
fi
|
||
```
|
||
And, log out and then back in...
|
||
```sh
|
||
~ ❯ env
|
||
[...]
|
||
XDG_RUNTIME_DIR=/tmp/1000-runtime-dir
|
||
[...]
|
||
```
|
||
|
||
With that done, we can move onto our next steps.
|
||
|
||
### Sysctl
|
||
There's some sysctl config required for older distros, but this isn't required for Alpine.
|
||
|
||
### User Namespace Configuration
|
||
Rootless Containers use User Namespaces, subUIDs, and subGIDs, so we'll need to have those working.
|
||
The apk package `shadow-subids` provides that functionality for us.
|
||
```
|
||
~ ❯ apk info shadow-subids
|
||
shadow-subids-4.10-r3 description:
|
||
Utilities for using subordinate UIDs and GIDs
|
||
|
||
shadow-subids-4.10-r3 webpage:
|
||
https://github.com/shadow-maint/shadow
|
||
|
||
shadow-subids-4.10-r3 installed size:
|
||
140 KiB
|
||
```
|
||
|
||
### Sub-ID Counts
|
||
Rootless Containers generally expect `/etc/subuid` and `/etc/subgid` to contain at least 65,536 sub-IDs for each user.
|
||
`shadow-subids` doed create these files for us, but leaves them empty by default, so let's go ahead and do that.
|
||
The [page on subIDs](https://rootlesscontaine.rs/getting-started/common/subuid/) provides a handy Python script
|
||
to do that for us, which we'll edit slightly so it's not writing directly to system files:
|
||
```python
|
||
f = open("subuid", "w")
|
||
for uid in range(1000, 65536):
|
||
f.write("%d:%d:65536\n" %(uid,uid*65536))
|
||
f.close()
|
||
|
||
f = open("subgid", "w")
|
||
for uid in range(1000, 65536):
|
||
f.write("%d:%d:65536\n" %(uid,uid*65536))
|
||
f.close()
|
||
```
|
||
This is probably overkill for our use-case, but that's also fine.
|
||
|
||
> **(Doll)** So this one just runs script and copies to /etc/?
|
||
>
|
||
> **(Ashe)** Yes Doll, that's right.
|
||
|
||
With that done, we can move onto the last prep step.
|
||
|
||
### CGroups V2
|
||
To limit resources that a container can use, we need to enable CGroups V2.
|
||
In OpenRC, this can be done by changing some options in `/etc/rc.conf`.
|
||
|
||
To enable CGroups in general, we need to set `rc_controller_cgroups` to `YES`
|
||
```sh
|
||
# This switch controls whether or not cgroups version 1 controllers are
|
||
# individually mounted under
|
||
# /sys/fs/cgroup in hybrid or legacy mode.
|
||
rc_controller_cgroups="YES"
|
||
```
|
||
From here, we can enable CGroups V2 by setting `rc_cgroup_mode` to `unified`
|
||
```sh
|
||
# This sets the mode used to mount cgroups.
|
||
# "hybrid" mounts cgroups version 2 on /sys/fs/cgroup/unified and
|
||
# cgroups version 1 on /sys/fs/cgroup.
|
||
# "legacy" mounts cgroups version 1 on /sys/fs/cgroup
|
||
# "unified" mounts cgroups version 2 on /sys/fs/cgroup
|
||
rc_cgroup_mode="unified"
|
||
```
|
||
|
||
> **(Doll)** Doll confused.
|
||
>
|
||
> **(Ashe)** So was I, for a bit. Despite what `rc.conf` says, cgroups V2 does *not* seem to be enabled on Alpine
|
||
> unless `rc_cgroup_mode` is set to `unified`. The [Alpine Wiki](https://wiki.alpinelinux.org/wiki/OpenRC#cgroups\_v2)
|
||
> seems to agree here, but isn't super clear. We'll find out if this is sufficient.
|
||
|
||
|
||
Next step is configuring the controllers we want to use:
|
||
```sh
|
||
# This is a list of controllers which should be enabled for cgroups version 2
|
||
# when hybrid mode is being used.
|
||
# Controllers listed here will not be available for cgroups version 1.
|
||
rc_cgroup_controllers="cpuset cpu io memory hugetlb pids"
|
||
```
|
||
Finally, we can add cgroups to a runlevel so that it's started automatically at boot:
|
||
```sh
|
||
rc-update add cgroups
|
||
```
|
||
From here, we can reboot, and continue on. If you don't want to reboot, you can start the cgroup service manually:
|
||
```sh
|
||
rc-service cgroups start
|
||
```
|
||
|
||
## Creating a group for our container users
|
||
|
||
We'll quickly create a group for all users who'll be using rootless containers here. In Alpine, this is as simple as
|
||
`doas addgroup ctr`. We'll make use of this later.
|
||
|
||
## Installing containerd and friends
|
||
First up we'll need to install `containerd` (to host our containers) and
|
||
`slirp4netns` (to allow network spaced commands inside the container with lower overhead than VPNKit), so we just:
|
||
```sh
|
||
doas apk add containerd
|
||
doas apk add slirp4netns
|
||
```
|
||
|
||
Next, we need to install `nerdctl` and `rootlesskit`. Both of these are currently only found inside
|
||
the `testing` repo for Alpine. We can pull them in without subscribing to the entire testing repo like so:
|
||
```sh
|
||
doas apk add -X https://dl-cdn.alpinelinux.org/alpine/edge/testing/ nerdctl
|
||
doas apk add -X https://dl-cdn.alpinelinux.org/alpine/edge/testing/ rootlesskit
|
||
```
|
||
|
||
## Configuring the Rootless containerd service
|
||
We'll be using nerdctl as our containerd controller of choice. It comes with a rootless containerd.service,
|
||
but since Alpine doesn't use systemd, we'll have to adapt this into an rc service.
|
||
|
||
We spent some time trying to adapt the [install script](https://github.com/containerd/nerdctl/blob/48f189a53a24c12838433f5bb5dd57f536816a8a/extras/rootless/containerd-rootless-setuptool.sh)
|
||
nerdctl provides to our purposes, however this is a bit excessive for what we need,
|
||
so we'll just do it the "[hard way](https://github.com/containerd/containerd/blob/main/docs/rootless.md)".
|
||
|
||
> **(Tammy)** Wait, this isn't the "hard way", is it?
|
||
>
|
||
> **(Ashe)** Nope. Adapting a 500 line script would be hard and annoying. We're better served by just doing it manually,
|
||
> and providing instructions for anyone following along. So in that vein:
|
||
|
||
### Getting containerd running in rootlesskit
|
||
First, let's get containerd running at the CLI, and then we can make it into an OpenRC Script.
|
||
We'll need a `config.toml`, but it can pretty minimal:
|
||
```toml
|
||
version = 2
|
||
root = "/home/tammy/.local/share/containerd"
|
||
state = "/tmp/1000-runtime-dir/containerd"
|
||
|
||
[grpc]
|
||
address = "/tmp/1000-runtime-dir/containerd/containerd.sock"
|
||
```
|
||
First try:
|
||
```sh
|
||
~ ❯ rootlesskit --net=slirp4netns --copy-up=/etc --copy-up=/run \
|
||
--state-dir=/tmp/1000-runtime-dir/rootlesskit-containerd --disable-host-loopback \
|
||
sh -c "rm -f /run/containerd; exec containerd -c config.toml"
|
||
|
||
BusyBox v1.35.0 (2022-08-01 15:14:44 UTC) multi-call binary.
|
||
|
||
Usage: ip [OPTIONS] address|route|link|tunnel|neigh|rule [ARGS]
|
||
|
||
OPTIONS := -f[amily] inet|inet6|link | -o[neline]
|
||
|
||
ip addr add|del IFADDR dev IFACE | show|flush [dev IFACE] [to PREFIX]
|
||
ip route list|flush|add|del|change|append|replace|test ROUTE
|
||
ip link set IFACE [up|down] [arp on|off] [multicast on|off]
|
||
[promisc on|off] [mtu NUM] [name NAME] [qlen NUM] [address MAC]
|
||
[master IFACE | nomaster] [netns PID]
|
||
ip tunnel add|change|del|show [NAME]
|
||
[mode ipip|gre|sit] [remote ADDR] [local ADDR] [ttl TTL]
|
||
ip neigh show|flush [to PREFIX] [dev DEV] [nud STATE]
|
||
ip rule [list] | add|del SELECTOR ACTION
|
||
[rootlesskit:parent] error: failed to setup network &{logWriter:0xc00014aa00 binary:slirp4netns mtu:65520 ipnet:<nil> disableHostLoopback:true apiSocketPath: enableSandbox:false enableSeccomp:false enableIPv6:false ifname:tap0 infoMu:{w:{state:0 sema:0} writerSem:0 readerSem:0 readerCount:0 readerWait:0} info:<nil>}: setting up tap tap0: executing [[nsenter -t 28611 -n -m -U --preserve-credentials ip tuntap add name tap0 mode tap] [nsenter -t 28611 -n -m -U --preserve-credentials ip link set tap0 up]]: exit status 1
|
||
[rootlesskit:child ] error: parsing message from fd 3: EOF
|
||
```
|
||
|
||
> **(Doll)** That looks like it broke, Miss.
|
||
>
|
||
> **(Ashe)** *sigh*, yeah, that's broken alright. That output looks like ip didn't like the command supplied to it,
|
||
> so let's find out what that was.
|
||
|
||
Some troubleshooting later, it looks like this is to do with BusyBox's implementation of the ip commands. We've raised
|
||
[an issue](https://github.com/rootless-containers/slirp4netns/issues/304), and we'll see how that goes.
|
||
In the mean time, we'll just have to use native networking. This means we can't apply firewall rules per-container, which
|
||
is moderately annoying, but won't actually hinder deployment. Just makes securing the deployment more annoying.
|
||
|
||
So let's try without the `--net=slirp4netns` (omitting anything that's INFO):
|
||
```sh
|
||
~ ❯ rootlesskit --copy-up=/etc --copy-up=/run \
|
||
--state-dir=/tmp/1000-runtime-dir/rootlesskit-containerd --disable-host-loopback \
|
||
sh -c "rm -f /run/containerd; exec containerd -c config.toml"
|
||
WARN[2022-11-03T11:32:53.207241941+11:00] failed to load plugin io.containerd.snapshotter.v1.devmapper error="devmapper not configured"
|
||
WARN[2022-11-03T11:32:53.227691744+11:00] could not use snapshotter devmapper in metadata plugin error="devmapper not configured"
|
||
WARN[2022-11-03T11:32:53.233006449+11:00] failed to load plugin io.containerd.internal.v1.opt error="mkdir /opt/containerd: permission denied"
|
||
ERRO[2022-11-03T11:32:53.235151641+11:00] failed to load cni during init, please check CRI plugin status before setting up network for pods error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"
|
||
```
|
||
|
||
A few things of note here:
|
||
```sh
|
||
WARN[2022-11-03T11:32:53.233006449+11:00] failed to load plugin io.containerd.internal.v1.opt error="mkdir /opt/containerd: permission denied"
|
||
[...]
|
||
ERRO[2022-11-03T11:32:53.235151641+11:00] failed to load cni during init, please check CRI plugin status before setting up network for pods error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"
|
||
```
|
||
|
||
The warning tells us that it tried to create /opt/containerd, but was unable to. This is easy enough to fix:
|
||
```sh
|
||
~ ❯ doas mkdir /opt/containerd
|
||
~ ❯ doas chmod 2770 /opt/containerd
|
||
~ ❯ doas chown root:ctr /opt/containerd #Replace the username and group here as necessary
|
||
```
|
||
|
||
The error is more interesting. CRI here stands for [Container Runtime Interface](https://github.com/containerd/cri), and
|
||
it seems to be used for Kubernetes. Since we won't be using kubernetes here, we can just disable it by adding
|
||
`disabled_plugins = ["io.containerd.grpc.v1.cri"]` to our `config.toml`.
|
||
|
||
> **(Tammy)** If you *are* interested in Kubernetes, make sure to check out our
|
||
> [Home Server Build-Out]({{< ref "home-server-build-out" >}}) series.
|
||
> We're planning on setting up an entire cloud environment there.
|
||
|
||
Let's try that again (cutting out any info stuff):
|
||
```sh
|
||
[...]
|
||
WARN[2022-11-03T16:18:35.425339343+11:00] failed to load plugin io.containerd.snapshotter.v1.devmapper error="devmapper not configured"
|
||
WARN[2022-11-03T16:18:35.427868986+11:00] could not use snapshotter devmapper in metadata plugin error="devmapper not configured"
|
||
ERRO[2022-11-03T16:18:35.430061527+11:00] failed to initialize a tracing processor "otlp" error="no OpenTelemetry endpoint: skip plugin"
|
||
containerd successfully booted in 0.024502s
|
||
[...]
|
||
```
|
||
That's cleaned up those issues, but we still have two warnings about `devmapper`,
|
||
and `containerd` couldn't find an OpenTelemetry endpoint.
|
||
|
||
We'll be skipping OpenTelemetry for now, but that sounds like a fun topic for a second blog post along side setting up
|
||
Grafana.
|
||
|
||
> **(Doll)** Doll will remember! Will remind Miss' to make a post about this!
|
||
|
||
### Setting up devmapper
|
||
|
||
`devmapper` is one of a few [snapshotters](https://github.com/containerd/containerd/tree/main/docs/snapshotters)
|
||
that `containerd` can use. It's not the most performant (that honour goes to `overlayfs`), but it is one of
|
||
the most robust, and least likely to break. This is more imporant to us than pure performance.
|
||
If you're following along at home, you'll have to decide which storage driver is best for your use-case.
|
||
|
||
Following the [setup guide](https://github.com/containerd/containerd/blob/main/docs/snapshotters/devmapper.md),
|
||
we'll need `dmsetup` installed. Under Alpine, this is provided by the `device-mapper` package,
|
||
which we already have installed.
|
||
|
||
We've also got a 100GB block device attached to this VPS, so let's get that provisioned too.
|
||
|
||
#### Mounting and Formatting our block device
|
||
|
||
We can use `fdisk` to format our block device. `fdisk -l` lists all devices and partitions.
|
||
|
||
```
|
||
~ ❯ doas fdisk -l
|
||
Disk /dev/vda: 25 GB, 26843545600 bytes, 52428800 sectors
|
||
52012 cylinders, 16 heads, 63 sectors/track
|
||
Units: sectors of 1 * 512 = 512 bytes
|
||
|
||
Device Boot StartCHS EndCHS StartLBA EndLBA Sectors Size Id Type
|
||
/dev/vda1 * 2,0,33 205,3,19 2048 206847 204800 100M 83 Linux
|
||
/dev/vda2 205,3,20 1023,15,63 206848 52428799 52221952 24.9G 8e Linux LVM
|
||
Disk /dev/vdb: 100 GB, 107374182400 bytes, 209715200 sectors
|
||
208050 cylinders, 16 heads, 63 sectors/track
|
||
Units: sectors of 1 * 512 = 512 bytes
|
||
|
||
Disk /dev/vdb doesn't contain a valid partition table
|
||
Disk /dev/dm-0: 1968 MB, 2063597568 bytes, 4030464 sectors
|
||
250 cylinders, 255 heads, 63 sectors/track
|
||
Units: sectors of 1 * 512 = 512 bytes
|
||
|
||
Disk /dev/dm-0 doesn't contain a valid partition table
|
||
Disk /dev/dm-1: 23 GB, 24670896128 bytes, 48185344 sectors
|
||
2999 cylinders, 255 heads, 63 sectors/track
|
||
Units: sectors of 1 * 512 = 512 bytes
|
||
|
||
Disk /dev/dm-1 doesn't contain a valid partition table
|
||
```
|
||
We know that our VPS has a 25GB disk, so `/dev/vdb` is our 100GB block device. We can format it with
|
||
`doas fdisk /dev/vdb`. Let's see how we do that:
|
||
|
||
```sh
|
||
~ ❯ doas fdisk /dev/vdb
|
||
Device contains neither a valid DOS partition table, nor Sun, SGI, OSF or GPT disklabel
|
||
Building a new DOS disklabel. Changes will remain in memory only,
|
||
until you decide to write them. After that the previous content
|
||
won't be recoverable.
|
||
|
||
|
||
The number of cylinders for this disk is set to 208050.
|
||
There is nothing wrong with that, but this is larger than 1024,
|
||
and could in certain setups cause problems with:
|
||
1) software that runs at boot time (e.g., old versions of LILO)
|
||
2) booting and partitioning software from other OSs
|
||
(e.g., DOS FDISK, OS/2 FDISK)
|
||
|
||
Command (m for help): n
|
||
Partition type
|
||
p primary partition (1-4)
|
||
e extended
|
||
p
|
||
Partition number (1-4): 1
|
||
First sector (63-209715199, default 63):
|
||
Using default value 63
|
||
Last sector or +size{,K,M,G,T} (63-209715199, default 209715199):
|
||
Using default value 209715199
|
||
|
||
Command (m for help): w
|
||
The partition table has been altered.
|
||
Calling ioctl() to re-read partition table
|
||
|
||
```
|
||
|
||
Running `fdisk -l` again:
|
||
```sh
|
||
[...]
|
||
Device Boot StartCHS EndCHS StartLBA EndLBA Sectors Size Id Type
|
||
/dev/vdb1 0,1,1 1023,15,63 63 209715199 209715137 99.9G 83 Linux
|
||
Disk /dev/dm-0: 1968 MB, 2063597568 bytes, 4030464 sectors
|
||
250 cylinders, 255 heads, 63 sectors/track
|
||
Units: sectors of 1 * 512 = 512 bytes
|
||
[...]
|
||
```
|
||
|
||
Looks like that worked.
|
||
|
||
#### Adding the formatted block device into LVM
|
||
|
||
Let's get this added into LVM. First, we need to create a physical volume with the `pvcreate`
|
||
command:
|
||
```sh
|
||
~ ❯ doas pvcreate /dev/vdb1
|
||
Physical volume "/dev/vdb1" successfully created.
|
||
```
|
||
|
||
Let's create a new Volume Group for our workload data. There are two reasons for this:
|
||
1. This will make it easier to extend in the future; and
|
||
2. Our block device is spinning rust, and we don't necessarily want to mix SSDs with spinning rust.
|
||
|
||
With that in mind, we'll leave the existing VG, `vg0` as the volume group for programs and container images:
|
||
```sh
|
||
~ ❯ doas vgcreate data /dev/vdb
|
||
Volume group "data" successfully created
|
||
~ ❯ doas vgdisplay data
|
||
--- Volume group ---
|
||
VG Name data
|
||
System ID
|
||
Format lvm2
|
||
Metadata Areas 1
|
||
Metadata Sequence No 1
|
||
VG Access read/write
|
||
VG Status resizable
|
||
MAX LV 0
|
||
Cur LV 0
|
||
Open LV 0
|
||
Max PV 0
|
||
Cur PV 1
|
||
Act PV 1
|
||
VG Size <100.00 GiB
|
||
PE Size 4.00 MiB
|
||
Total PE 25599
|
||
Alloc PE / Size 0 / 0
|
||
Free PE / Size 25599 / <100.00 GiB
|
||
VG UUID 679FIe-aF9e-yBRy-bRH6-wRlY-KPgz-yUpXL9
|
||
|
||
```
|
||
> **(Doll)** Is it working Miss? Doll wants to see websites in treasure chests go zoom!
|
||
>
|
||
> **(Ashe)** **Containers**, dear Doll. And yes, yes it is.
|
||
> Only a few more steps and we'll be ready to start bringing things online, don't worry.
|
||
|
||
Speaking of, next we need to create our logical volumes. We'll create two. One for our container scratch storage, and
|
||
one for persistent storage. We'll size scratch at 30GiB, and persistent at 70GiB. Let's get that done:
|
||
|
||
```sh
|
||
~ ❯ doas lvcreate -n persist --size 70G data
|
||
Logical volume "persist" created.
|
||
~ ❯ doas lvcreate -n scratch --size 30G data
|
||
Volume group "data" has insufficient free space (7679 extents): 7680 required.
|
||
```
|
||
|
||
> **(Selene)** Oh interesting. What happened there?
|
||
>
|
||
> **(Ashe)** Our theoretically 100GiB device has one extent less than 100GiB, so we couldn't divide it into exactly 30/70.
|
||
>
|
||
> **(Tammy)** Wait is that why `fdisk` said the device was 99.9G?
|
||
>
|
||
> **(Ashe)** Good catch. Yeah. 100GiB doesn't divide evenly into 960KiB cylinders, so we end up with one cylinder
|
||
> too few, and therefore—
|
||
>
|
||
> **(Tammy)** One extent too few! Sneaky!
|
||
>
|
||
> **(Ashe)** Yup. Actually, now that I look at it again, I forgot to make space for the metadata, so this works out
|
||
> nicely.
|
||
|
||
#### Creating our nerdctl thin pool
|
||
|
||
Docker and nerdctl can control a block device directly to use as a storage driver via device-mapper, so we'll be letting
|
||
nerdctl do that for it's mainline storage, and using our "persistent" pool for nerdctl volumes (which are persistent).
|
||
|
||
For this we'll need `device-mapper`, `lvm2-dmeventd`, and `thin-provisioning-tools`, so we'll `apk add` those in.
|
||
|
||
> **(Ashe)** I'm going to skip showing the terminal output for installing packages from here on in to save space.
|
||
> I'm sure you've gotten the idea by now.
|
||
|
||
First up is creating a thin pool, which we'll do as follows:
|
||
```sh
|
||
~ ❯ doas lvcreate --wipesignatures y -n scratch data -l 95%FREE
|
||
Logical volume "scratch" created.
|
||
~ ❯ doas lvcreate --wipesignatures y -n scratchmeta data -l 10%FREE
|
||
Logical volume "scratchmeta" created.
|
||
~ ❯ doas lvconvert -y --zero n -c 512K --thinpool data/scratch --poolmetadata data/scratchmeta
|
||
Thin pool volume with chunk size 512.00 KiB can address at most 126.50 TiB of data.
|
||
WARNING: Converting data/scratch and data/scratchmeta to thin pool's data and metadata volumes with metadata wiping.
|
||
THIS WILL DESTROY CONTENT OF LOGICAL VOLUME (filesystem etc.)
|
||
Converted data/scratch and data/scratchmeta to thin pool.
|
||
~ ❯
|
||
```
|
||
So what did we do here?
|
||
|
||
> **(Doll)** Ooh! Ooh! Doll knows! Miss created one LV, umm, Logical Volume, taking up 95% of the free space, and one
|
||
> taking up 10% of the free space... remaining free space? So ummm, ummm, 152 MiB?
|
||
>
|
||
> **(Ashe)** That's right! What next?
|
||
>
|
||
> **(Doll)** We umm. Combine the two into one? This one is confuse.
|
||
>
|
||
> **(Ashe)** Okay, I'll try to keep it simple. A normal (thick) pool allocates all of its data when we create it. So
|
||
> all the space is reserved ahead of time. You can write to whatever bit of it you want, whenever you want.
|
||
> Imagine something like a notebook you bought. A thin pool isn't like that. It initialises a small area
|
||
> with zeroes, but otherwise leaves the rest of the device alone. Like you have a page, and you ask the store for
|
||
> another blank page every time you get close to filling up your page.
|
||
> So, what would happen if I wrote a 100M file that was all zeroes?
|
||
>
|
||
> **(Selene)** Let's see if I understand. Well, you'd write the file metadata, and allocate some space... Wait who's
|
||
> keeping track of the size of the volume?
|
||
>
|
||
> **(Ashe)** Precisely, Selene. You need a metadata volume that contains information about the assigned blocks in
|
||
> the thin pool, since it wasn't allocated all at once. So we create a pool for that, and then combine the two into our
|
||
> final thin pool.
|
||
|
||
That done, we can configure autoextension by creating `/etc/lvm/profile/data-scratch.profile`:
|
||
```sh
|
||
activation {
|
||
thin_pool_autoextend_threshold=80
|
||
thin_pool_autoextend_percent=10
|
||
}
|
||
```
|
||
Apply said profile with `doas lvchange --metadataprofile data-scratch data/scratch`, and check if the thin pool is being
|
||
monitored:
|
||
```sh
|
||
~ ❯ doas lvs -o+seg_monitor
|
||
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Monitor
|
||
persist data -wi-a----- 70.00g
|
||
scratch data twi---t--- <28.50g
|
||
lv_root vg0 -wi-ao---- <22.98g
|
||
lv_swap vg0 -wi-ao---- 1.92g
|
||
```
|
||
Looks good. Were the LV not monitored, we would see `not monitored` at the end of the `scratch data` line. Were that the
|
||
case, we could fix that with `doas lvchange --monitor y data/scratch`.
|
||
|
||
#### Formatting the new Logical Volume
|
||
|
||
Our final step is to format the LV we'll be using for persistent volumes.
|
||
We'll be using plain-old ext4 for this as I don't need to nor want to get fancy here.
|
||
|
||
```sh
|
||
~ ❯ doas mkfs.ext4 /dev/data/persist
|
||
mke2fs 1.46.5 (30-Dec-2021)
|
||
Discarding device blocks: done
|
||
Creating filesystem with 18349056 4k blocks and 4587520 inodes
|
||
Filesystem UUID: c0a59a7b-1969-4476-9d2c-11af32628337
|
||
Superblock backups stored on blocks:
|
||
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
|
||
4096000, 7962624, 11239424
|
||
|
||
Allocating group tables: done
|
||
Writing inode tables: done
|
||
Creating journal (131072 blocks): done
|
||
Writing superblocks and filesystem accounting information: done
|
||
```
|
||
|
||
#### Mounting our new logical drives and setting up automount
|
||
|
||
Final step. Mounting the drive is relative simple:
|
||
```sh
|
||
~ ❯ doas mkdir /data
|
||
~ ❯ doas chmod 2770 /data
|
||
~ ❯ doas mount /dev/data/persist /data
|
||
~ ❯ doas chown root:ctr /data -R
|
||
```
|
||
|
||
From here, we can configure `/etc/fstab` so they're automatically mounted at boot.
|
||
|
||
To achieve that, we'll add the following line to `/etc/fstab`:
|
||
```fstab
|
||
/dev/data/persist /data ext4 rw,relatime 0 0
|
||
```
|
||
|
||
We don't need to mount the scratch LV (Logical Volume) as containerd will be controlling that directly.
|
||
|
||
And we should be good to go.
|
||
|
||
Last thing to do is add a minimal devmapper config to our `config.toml`:
|
||
```toml
|
||
[...]
|
||
[plugins]
|
||
[plugins."io.containerd.snapshotter.v1.devmapper"]
|
||
root_path = "/opt/containerd/devmapper"
|
||
pool_name = "data-scratch"
|
||
base_image_size = "1024MB"
|
||
[...]
|
||
```
|
||
|
||
Let's see what happens when we launch `containerd` again:
|
||
```sh
|
||
WARN[2022-11-07T00:33:26.218437232+11:00] failed to load plugin io.containerd.snapshotter.v1.devmapper error="dmsetup version
|
||
error: Library version: 1.02.170 (2020-03-24)
|
||
/dev/mapper/control: open failed: Permission denied
|
||
Failure to communicate with kernel device-mapper driver.
|
||
Check that device-mapper is available in the kernel.
|
||
Incompatible libdevmapper 1.02.170 (2020-03-24) and kernel driver (unknown version).
|
||
Command failed.
|
||
|
||
: exit status 1"
|
||
```
|
||
|
||
> **(Tammy)** That doesn't look great.
|
||
>
|
||
> **(Ashe)** No. It does not. Hmm. Let's investigate.
|
||
>
|
||
> **(Ashe)** Ah. Found it. Looks like devmapper isn't supported in
|
||
> [rootless configs](https://github.com/containerd/nerdctl/blob/main/docs/rootless.md#snapshotters). Now we know.
|
||
|
||
{{< alert >}}
|
||
**(Ashe)** Rootless containerd does **not** support the devmapper snapshotter!
|
||
{{< /alert >}}
|
||
|
||
## An Interlude, Some Tea, And a Break
|
||
|
||
> **(Octavia)** And on that bomb-shell, I think it's about time we wrapped this up. Ashe is looking is pretty grumpy.
|
||
> Looks like we'll have to make this into a series.
|
||
>
|
||
> **(Tammy)** Hopefully we'll have better luck next time.
|
||
>
|
||
> **(Doll)** This one hopes you'll all join us next time!
|