feat: add gpa postmortem
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful

This commit is contained in:
2025-10-19 16:17:58 +02:00
parent c18e2a7ecf
commit 57243489c8
3 changed files with 274 additions and 0 deletions

View File

@@ -0,0 +1,252 @@
---
title: "Postmortem - how to completely screw up an update"
date: 2025-10-19T12:05:10+02:00
lastmod: 2025-10-19T16:00:04+02:00
draft: false
image: "uploads/postmortem.png"
categories: [ 'English' ]
tags: [ 'backup', 'postmortem', 'fediverse', 'gotosocial' ]
---
The fediverse instance [gay-pirate-assassins.de](https://gay-pirate-assassins.de) was down for a couple of days. This
postmortem will outline what went wrong and what I did to prevent things from going that wrong in the future.
# Timeline
* 2025-10-05 17:26: [Update announcement](https://gay-pirate-assassins.de/@moanos/statuses/01K6TFQ1HVPAR6AYN08XYQ7XFV)
* 2025-10-05 ~17:45: Update started
* 2025-10-05 ~18:00: Services restart
* 2025-10-05 ~18:00: GoToSocial doesn't come up
* 2025-10-12 ~10:00: Issue is found
* 2025-10-12 10:30: Issue is fixed
* 2025-10-12 10:31: GoToSocial is started, migrations start
* 2025-10-12 15:38: Migrations finished successfully
* 2025-10-12 15:38: Service available again
* 2025-10-12 18:36:[Announcement sent](https://gay-pirate-assassins.de/@moanos/statuses/01K7CMGF7S2TE39792CMADGEPJ)
All times are given in CEST.
## The beginning: An update goes wrong
I run a small fediverse server with a few users called. [gay-pirate-assassins](https://gay-pirate-assassins.de/) which is powered by [GoToSocial](https://gotosocial.org/).
The (amazing) GoToSocial devs released `v0.20.0-rc1` and `v0.20.0-rc2`. As the new features seemed pretty cool, I'm
inpatient and the second release candidate seemed stable,
I decided to update to `v0.20.0-rc2`. So I stared a backup (via borgmatic), waited for it to finish and confirmed it ran
successfully.
Then I changed the version number in the [mash](https://github.com/mother-of-all-self-hosting/mash-playbook)-ansible
playbook I use. Then I pulled the newest version of the playbook and it's roles because I wanted to update all services
that run on the server. I checked
the [Changelog](https://github.com/mother-of-all-self-hosting/mash-playbook/blob/main/CHANGELOG.md),
didn't see anything and then started the update. It went through and GoToSocial started up just fine.
But the instance start page showed me 0 users, 0 posts and 0 federated instances. **Something has gone horribly wrong!**
## Migrations
It was pretty clear to me, that the migrations went wrong.
The [GoToSocial Migration notes](https://codeberg.org/superseriousbusiness/gotosocial/releases/tag/v0.20.0-rc1)
specifically mentioned long-running migrations that could take several hours. I assumed that somehow, during the running
database migration, the service must have restarted and left the DB in a broken state. This issue happened to me before.
Well, that's what backups are for, so let's pull it.
## Backups
Backups for this server are done two ways:
* via postgres-backup: Backups of the database are written to disk
* via [borgmatic](https://torsion.org/borgmatic/): Backups via borg are written to backup nodes, one of them at my home
They run every night automatically, monitored by [Healthchecks](https://healthchecks.io/). I triggered a manual run
before the update so that is the one I mounted using [Vorta](https://vorta.borgbase.com/).
And then the realization.
```
mash-postgres:5432 $ ls -lh
total 2.1M
-r-------- 1 moanos root 418K Oct 05 04:03 gitea
-r-------- 1 moanos root 123K Oct 05 04:03 healthchecks
-r-------- 1 moanos root 217K Oct 05 04:03 ilmo
-r-------- 1 moanos root 370K Oct 05 04:03 notfellchen
-r-------- 1 moanos root 67K Oct 05 04:03 oxitraffic
-r-------- 1 moanos root 931 Oct 05 04:03 prometheus_postgres_exporter
-r-------- 1 moanos root 142K Oct 05 04:03 semaphore
-r-------- 1 moanos root 110K Oct 05 04:03 vaultwarden
-r-------- 1 moanos root 669K Oct 05 04:03 woodpecker_ci_server
```
Fuck. The database gay-pirate-assassins is not there. Why?
To explain that I have to tell you how it *should* work: Services deployed by the mash-playbook are automatically wired
to the database and reverse proxy by a complex set of Ansible variables. This is great, because adding a service can
therefore be as easy as adding
```
healthchecks_enabled: true
healthchecks_hostname: health.hyteck.de
```
to the `vars.yml` file.
This will then configure the postgres database automatically, based on the `group_vars`. They look like this
```
mash_playbook_postgres_managed_databases_auto_itemized:
- |-
{{
({
'name': healthchecks_database_name,
'username': healthchecks_database_username,
'password': healthchecks_database_password,
} if healthchecks_enabled and healthchecks_database_hostname == postgres_connection_hostname and healthchecks_database_type == 'postgres' else omit)
}}
```
Note that a healthchecks database is only added to the managed databases if `healthchecks_enabled` is `True`.
This is really useful for backups because the borgmatic configuration also pulls the list
`mash_playbook_postgres_managed_databases_auto_itemized`. Therefore, you do not need to specify which databases to back
up, it just backs up all managed databases.
However, the database for gay-pirate assassins was not managed. In the playbook it's only possible to configure a
service once. You can not manage multiple GoToSocial instances in the same `vars.yml`. In the past, I had two instances
of GoToSocial running on the server. I therefore
followed [the how-to of "Running multiple instances of the same service on the same host"](https://github.com/mother-of-all-self-hosting/mash-playbook/blob/main/docs/running-multiple-instances.md).
Basically this means that an additional `vars.yml` must be created that is treated as a completely different server.
Databases must be created manually as they are not managed.
With that knowledge you can understand that when I say that the database for gay-pirate-assassins was not managed,
this means it was not included in the list of databases to be backed up. The backup service thought it ran successfully,
because it backed up everything it knew of.
So this left me with a three-month-old backup. Unacceptable.
## Investigating
So the existing database needed to be rescued. I SSHed into the server and checked the database. It looked completely
normal.
I asked the devs if they could me provide me with the migrations as they already did in the past. However, they pointed
out that the migrations are too difficult for that approach. They suggested to delete the oldest migration to force a
re-run of the migrations.
Here is where I was confused, because this was the `bun_migrations` table:
```
gay-pirate-assassins=# SELECT * FROM bun_migrations ORDER BY id DESC LIMIT 5;
id | name | group_id | migrated_at
-----+----------------+----------+-------------------------------
193 | 20250324173534 | 20 | 2025-04-23 20:00:33.955776+00
192 | 20250321131230 | 20 | 2025-04-23 19:58:06.873134+00
191 | 20250318093828 | 20 | 2025-04-23 19:57:50.540568+00
190 | 20250314120945 | 20 | 2025-04-23 19:57:30.677481+00
```
The last migration ran in April, when I updated to `v0.19.1`. Strange.
At this point I went on vacation and paused investigations, not only because the vacation was great, but also because I
bamboozeld by this state.
---
After my vacation I came back, and did some backups of the database.
```
$ docker run -e PGPASSWORD="XXXX" -it --rm --network mash-postgres postgres pg_dump -U gay-pirate-assassins -h mash-postgres gay-pirate-assassins > manual-backup/gay-pirate-assassins-2025-10-13.sql
```
Then I deleted the last migration, as I was advised
```
DELETE FROM bun_migration WHERE id=193;
```
and restarted the server. While watching the server come up it hit me in the face:
```
Oct 12 08:31:29 s3 mash-gpa-gotosocial[2251925]: timestamp="12/10/2025 08:31:29.905" func=bundb.sqliteConn level=INFO msg="connected to SQLITE database with address file:/opt/gotosocial/sqlite.db?_pragma=busy_timeout%281800000%29&_pragma=journal_mode%>
Oct 12 13:38:46 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 13:38:46.588" func=router.(*Router).Start.func1 level=INFO msg="listening on 0.0.0.0:8080"
```
The server is **starting from a completely different database**! That explains why
* the last migration was never done
* the server showed me 0 users, 0 posts and 0 federated instances even though the postgres database had plenty of those
All of a sudden a SQlite database was configured. This happened because
of [this commit](https://github.com/mother-of-all-self-hosting/ansible-role-gotosocial/commit/df34af385f9765bda8f160f6985a47cb7204fe96)
which introduced SQlite support and set it as default. This was not mentioned in
the [Changelog](https://github.com/mother-of-all-self-hosting/mash-playbook/blob/main/CHANGELOG.md).
So what happened is, that the config changed and then the server was restarted and an empty DB was initialized. The
postgres DB never started to migrate.
## Fixing
To fix it, I did the following
1. Configure the playbook to use postgres for GoToSocial:
```
# vars.yml
gotosocial_database_type: postgres
```
2. Run the playbook to configure GoToSocial (but not starting the service)
```
just run-tags install-gotosocial
```
3. Check the configuration is correct
4. Start the service
The migrations took several hours but after that, everything looked stable again. I don't think there are any lasting
consequences. However, the server was unavailable for several days.
## Learnings
I believe the main issue here was not the change in the config that went unnoticed by me. While I'd ideally notice stuff
like this, the server is a hobby, and I'll continue to not check every config option that changed.
The larger issue was the backup. Having a backup would have made this easy to solve. And there are other, less lucky
problems where I'd be completely lost without a backup. So to make sure this doesn't happen again, I did/will do the
following:
### 1. Mainstream the config
As explained, I used a specific non-mainstream setup in the ansible playbook because, in the past, I ran two instances
of GoToSocial on the server. After shutting down one of them, I never moved gay-pirate-assassins to be part of the main
config. This means important parts of the configuration had to be done manually, which I botched.
So in the past week I cleaned up and gay-pirate-assassins is now part of the main `vars.yml` and will benefit from all
relevant automations.
### 2. Checking backups
I was confident in my backups because
* they run every night very consistently. If they fail e.g. because of a network outage I reliably get a warning.
* I verified successfully run of the backup job prior to upgrading
The main problem was me assuming that a successful run of the backup command, meant a successful backup. Everyone will
tell you that a backup that is not tested is not to be trusted. And they are right. However, doing frequent
test-restores
exceeds my time and server capacity. So what I'll do instead is the following:
* mount the backup before an upgrade
* `tail` the backup file as created by postgres-backup and ensure the data is from the same day
* check media folders for the last changed image
This is not a 100% guarantee, but I'd argue it's a pretty good compromise for now. As the frequency of mounting backups
increases and therefore becomes faster, I'll re-evaluate to do a test-restore at least semi-regulary.
## Conclusion
I fucked up, but I was lucky that my error was recoverable and no data was lost. Next time this will hopefully be not due
to luck, but better planning!
Any questions? Let me know!

View File

@@ -0,0 +1,22 @@
```
"
Oct 12 09:33:25 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 09:33:25.266" func=cache.(*Caches).Start level=INFO msg="start: 0xc002476008"
Oct 12 09:33:25 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 09:33:25.303" func=bundb.pgConn level=INFO msg="connected to POSTGRES database"
Oct 12 09:33:25 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 09:33:25.328" func=migrations.init.110.func1 level=INFO msg="creating statuses column thread_id_new"
Oct 12 09:33:31 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 09:33:31.872" func=bundb.queryHook.AfterQuery level=WARN duration=6.528757799s query="SELECT count(*) FROM \"statuses\"" msg="SLOW DATABASE QUERY"
Oct 12 09:33:31 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 09:33:31.873" func=migrations.init.110.func1 level=WARN msg="rethreading 4611812 statuses, this will take a *long* time"
Oct 12 09:33:38 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 09:33:38.111" func=migrations.init.110.func1 level=INFO msg="[~0.02% done; ~137 rows/s] migrating threads"
Oct 12 09:33:44 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 09:33:44.618" func=migrations.init.110.func1 level=INFO msg="[~0.04% done; ~171 rows/s] migrating threads"
```
```
Oct 12 13:38:08 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 13:38:08.726" func=migrations.init.110.func1 level=INFO msg="[~99.98% done; ~148 rows/s] migrating stragglers"
Oct 12 13:38:10 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 13:38:10.309" func=migrations.init.110.func1 level=INFO msg="[~99.99% done; ~162 rows/s] migrating stragglers"
Oct 12 13:38:12 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 13:38:12.192" func=migrations.init.110.func1 level=INFO msg="[~100.00% done; ~141 rows/s] migrating stragglers"
Oct 12 13:38:13 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 13:38:13.711" func=migrations.init.110.func1 level=INFO msg="[~100.00% done; ~136 rows/s] migrating stragglers"
Oct 12 13:38:13 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 13:38:13.714" func=migrations.init.110.func1 level=INFO msg="dropping temporary thread_id_new index"
```
#

Binary file not shown.

After

Width:  |  Height:  |  Size: 86 KiB