Add 'postmortems/2024-12-27.md'

co-authored-by: mxshift co-authored-by: kouhai reviewed-by: adrienne
2024-12-27 01:50:57 -05:00 · 2024-12-27 01:50:57 -05:00 · c0ae8a5c14
parent c195ee86c5
commit c0ae8a5c14
1 changed files with 93 additions and 0 deletions
--- a/postmortems/2024-12-22.md
+++ b/postmortems/2024-12-22.md
@ -0,0 +1,93 @@
+## Treehouse Postmortem: total Treehouse Social outage and brief user data loss caused by operator error during migration on 2024-12-22
+
+- authored-by: mxshift, k
+- reviewed-by: adrienne
+
+### What's involved? (background)
+
+- Treehouse Social: the Treehouse community's self-hosted and central mastodon server.
+- Treehouse infrastructure migration: a project to migrate Treehouse off of legacy donated infrastructure to logically separated infrastructure, accounts and all.
+- Treehouse storage: the storage infrastructure (object storage/s3-api) that powers the Treehouse Social mastodon instance's cache (cache.Treehouse.systems)
+- Treehouse Minio: a Minio server located on legacy donated infrastructure
+- Treehouse Tigris: the cloud object storage "buckets" used by Treehouse infrastructure, in a dedicated account paid for by Treehouse staff
+- `hil1`: the OVH dedicated server that powers Treehouse Social.
+- `mastodon-deployment`: the Docker Compose project on `hil1` which serves Treehouse Social. This directory is owned by the default SSH user. It contains a Docker Compose file, the production `.env` file, and bind mounts for Postgres and Redis/Valkey persistence
+
+### What happened? (summary)
+
+As part of our infrastructure migration, a staff operator was tasked with on copying data from Treehouse-hosted Minio to the Treehouse Tigris bucket. Because of the large number of objects in the bucket, listing objects for comparing one deployment with another was prohibitively slow/expensive. As a result, they elected to use a `s5cmd` container environment to write a batch processing script to speed up this migration.
+
+This script processed groups of 20k files by downloading the files to a temporary directory inside a bind-mounted `mastodon-deployment` directory, uploading them to Tigris, then deleting the temporary directory.
+
+The final deletion command had a bug which caused the deletion of the entire `mastodon-deployment` directory.
+
+#### How did we recover?
+
+This issue was detected immediately, and other Treehouse staff were engaged as quickly as possible.
+
+Six directory entries of information were immediately lost: the Postgres, Redis, Barman, and elasticsearch directories; the Docker Compose file; and the production `.env` file.
+
+- Postgres was recoverable from our recently-emplaced Postgres backups. This took a significant amount of time due to database's size.
+- Redis was recovered via a manual `redis-cli` snapshot operation. This took an insignificant amount of time after a brief documentation search.
+- elasticsearch was rebuilt with no attempt at recovery. It is considered a non-essential rebuildable service, and we eventually reconstructed the index using `tootctl`.
+- Barman was rebuilt with no attempt at recovery. We have lost some user data (several minutes of posts) due to this decision.
+- `docker-compose.yml` was fully reconstructed from `docker inspect` and an old copy saved on a staff operator's laptop. Some formatting and comments were lost
+- The production environment file was reconstructed from `docker inspect`
+
+After data restoration, we restarted the Docker Compose stack to reach recovery.
+
+### Why was this an issue? (root causes)
+
+- The final deletion command ran as root, overriding the posix filesystem permissions of the other directories in the
+  - The scripting environment inside the container used the container's default `root` user, which mapped to `root` on the host. this meant that the deletion had actual effect.
+    - This operational hack was used because of the poor performance of listing cached objects on Minio, thus requiring creative solutions to reduce the time of Treehouse Social cache impairment by speeding up the migration
+  - We provided all of `mastodon-deployment` as a bind mount to our tooling container, instead of a more appropriate scope
+
+- We took approximately two hours and ten minutes to recover, as the final deletion was not immediately recoverable
+
+- We lost a small amount of user data
+  - Our backups rely on Postgres asynchronous replication, and periodically upload data to Treehouse Tigris
+    - Data that isn't fully replicated may be lost in case of outage
+    - Loss of Barman means loss of any replicated but non-uploaded data
+
+### What went well?
+
+- We had working database backups
+  - Database backups (base + WALs) had everything except some of the latest transactions
+  - Database restore process was fairly straightforward
+- Rest of the stack came up easily once Postgres and Redis were restored
+
+### What went wrong?
+
+- No Redis backups
+- No backups of `docker-compose.yml` or `.env.production`
+
+### Where did we get lucky?
+
+- All of the Docker containers were still running so [docker.io/red5d/docker-autocompose](https://hub.docker.com/r/red5d/docker-autocompose) was able to rapidly generate a `docker-compose.yml` based on the running containers. This included the contents of `.env.production` mixed in with each container's default environment settings, but did not include `depends_on` and `healthcheck` settings
+  - A staff operator had a snapshot of `docker-compose.yml` from February that provided the `depends_on` and `healthcheck` settings
+- Redis was still running, and as an in-memory database, we could dump all of its contents
+- We had working Postgres backups; we explicitly made the call to do this first before migrating anything to reduce migration risks
+
+### What can we realistically fix soon?
+
+- Redis backups
+- `docker-compose.yml` backups
+- Stop making non-critical production changes
+  - Just let the Minio -> Tigris copy plod along for a week or however long it takes
+
+### What fixes should we work towards, longer term?
+
+- Move to podman so guest root != host root
+- Offline backup of `.env.production`
+  - Tricky, due to the sensitive nature of this file. We do not have a secure secrets store yet!
+
+### What did we learn?
+
+- Docker running containers such that `guest root == host root` is a loaded gun
+- Have backups? Test backups
+- Back up your configuration
+- Minimize bind mount exposure when performing operational tasks
+  - Use volumes if feasible
+  - Minimize the mounted risk surface
+  - Mark the bind-mount as read-only if possible