feat: add gpa postmortem

2025-10-19 16:17:58 +02:00
parent c18e2a7ecf
commit 57243489c8
3 changed files with 274 additions and 0 deletions
--- a/content/post/postmortem-gpa/index.md
+++ b/content/post/postmortem-gpa/index.md
@@ -0,0 +1,252 @@
+---
+title: "Postmortem - how to completely screw up an update"
+date: 2025-10-19T12:05:10+02:00
+lastmod: 2025-10-19T16:00:04+02:00
+draft: false
+image: "uploads/postmortem.png"
+categories: [ 'English' ]
+tags: [ 'backup', 'postmortem', 'fediverse', 'gotosocial' ]
+---
+
+The fediverse instance [gay-pirate-assassins.de](https://gay-pirate-assassins.de) was down for a couple of days. This
+postmortem will outline what went wrong and what I did to prevent things from going that wrong in the future.
+
+# Timeline
+
+* 2025-10-05 17:26: [Update announcement](https://gay-pirate-assassins.de/@moanos/statuses/01K6TFQ1HVPAR6AYN08XYQ7XFV)
+* 2025-10-05 ~17:45: Update started
+* 2025-10-05 ~18:00: Services restart
+* 2025-10-05 ~18:00: GoToSocial doesn't come up
+* 2025-10-12 ~10:00: Issue is found
+* 2025-10-12 10:30: Issue is fixed
+* 2025-10-12 10:31: GoToSocial is started, migrations start
+* 2025-10-12 15:38: Migrations finished successfully
+* 2025-10-12 15:38: Service available again
+* 2025-10-12 18:36:[Announcement sent](https://gay-pirate-assassins.de/@moanos/statuses/01K7CMGF7S2TE39792CMADGEPJ)
+
+All times are given in CEST.
+
+## The beginning: An update goes wrong
+
+I run a small fediverse server with a few users called. [gay-pirate-assassins](https://gay-pirate-assassins.de/) which is powered by [GoToSocial](https://gotosocial.org/).
+The (amazing) GoToSocial devs released `v0.20.0-rc1` and `v0.20.0-rc2`. As the new features seemed pretty cool, I'm
+inpatient and the second release candidate seemed stable,
+I decided to update to `v0.20.0-rc2`. So I stared a backup (via borgmatic), waited for it to finish and confirmed it ran
+successfully.
+Then I changed the version number in the [mash](https://github.com/mother-of-all-self-hosting/mash-playbook)-ansible
+playbook I use. Then I pulled the newest version of the playbook and it's roles because I wanted to update all services
+that run on the server. I checked
+the [Changelog](https://github.com/mother-of-all-self-hosting/mash-playbook/blob/main/CHANGELOG.md),
+didn't see anything and then started the update. It went through and GoToSocial started up just fine.
+
+But the instance start page showed me 0 users, 0 posts and 0 federated instances. **Something has gone horribly wrong!**
+
+## Migrations
+
+It was pretty clear to me, that the migrations went wrong.
+The [GoToSocial Migration notes](https://codeberg.org/superseriousbusiness/gotosocial/releases/tag/v0.20.0-rc1)
+specifically mentioned long-running migrations that could take several hours. I assumed that somehow, during the running
+database migration, the service must have restarted and left the DB in a broken state. This issue happened to me before.
+
+Well, that's what backups are for, so let's pull it.
+
+## Backups
+
+Backups for this server are done two ways:
+
+* via postgres-backup: Backups of the database are written to disk
+* via [borgmatic](https://torsion.org/borgmatic/): Backups via borg are written to backup nodes, one of them at my home
+
+They run every night automatically, monitored by [Healthchecks](https://healthchecks.io/). I triggered a manual run
+before the update so that is the one I mounted using [Vorta](https://vorta.borgbase.com/).
+
+And then the realization.
+
+```
+
+mash-postgres:5432 $ ls -lh
+total 2.1M
+-r-------- 1 moanos root 418K Oct 05 04:03 gitea
+-r-------- 1 moanos root 123K Oct 05 04:03 healthchecks
+-r-------- 1 moanos root 217K Oct 05 04:03 ilmo
+-r-------- 1 moanos root 370K Oct 05 04:03 notfellchen
+-r-------- 1 moanos root  67K Oct 05 04:03 oxitraffic
+-r-------- 1 moanos root  931 Oct 05 04:03 prometheus_postgres_exporter
+-r-------- 1 moanos root 142K Oct 05 04:03 semaphore
+-r-------- 1 moanos root 110K Oct 05 04:03 vaultwarden
+-r-------- 1 moanos root 669K Oct 05 04:03 woodpecker_ci_server
+```
+
+Fuck. The database gay-pirate-assassins is not there. Why?
+
+To explain that I have to tell you how it *should* work: Services deployed by the mash-playbook are automatically wired
+to the database and reverse proxy by a complex set of Ansible variables. This is great, because adding a service can
+therefore be as easy as adding
+
+```
+healthchecks_enabled: true
+healthchecks_hostname: health.hyteck.de
+```
+
+to the `vars.yml` file.
+
+This will then configure the postgres database automatically, based on the `group_vars`. They look like this
+
+```
+mash_playbook_postgres_managed_databases_auto_itemized:
+    - |-
+      {{
+          ({
+          'name': healthchecks_database_name,
+          'username': healthchecks_database_username,
+          'password': healthchecks_database_password,
+          } if healthchecks_enabled and healthchecks_database_hostname == postgres_connection_hostname and healthchecks_database_type == 'postgres' else omit)
+      }}
+
+```
+
+Note that a healthchecks database is only added to the managed databases if `healthchecks_enabled` is `True`.
+
+This is really useful for backups because the borgmatic configuration also pulls the list
+`mash_playbook_postgres_managed_databases_auto_itemized`. Therefore, you do not need to specify which databases to back
+up, it just backs up all managed databases.
+
+However, the database for gay-pirate assassins was not managed. In the playbook it's only possible to configure a
+service once. You can not manage multiple GoToSocial instances in the same `vars.yml`. In the past, I had two instances
+of GoToSocial running on the server. I therefore
+followed [the how-to of "Running multiple instances of the same service on the same host"](https://github.com/mother-of-all-self-hosting/mash-playbook/blob/main/docs/running-multiple-instances.md).
+
+Basically this means that an additional `vars.yml` must be created that is treated as a completely different server.
+Databases must be created manually as they are not managed.
+
+With that knowledge you can understand that when I say that the database for gay-pirate-assassins was not managed,
+this means it was not included in the list of databases to be backed up. The backup service thought it ran successfully,
+because it backed up everything it knew of.
+
+So this left me with a three-month-old backup. Unacceptable.
+
+## Investigating
+
+So the existing database needed to be rescued. I SSHed into the server and checked the database. It looked completely
+normal.
+I asked the devs if they could me provide me with the migrations as they already did in the past. However, they pointed
+out that the migrations are too difficult for that approach. They suggested to delete the oldest migration to force a
+re-run of the migrations.
+
+Here is where I was confused, because this was the `bun_migrations` table:
+
+```
+gay-pirate-assassins=# SELECT * FROM bun_migrations ORDER BY id DESC LIMIT 5;
+ id  |      name      | group_id |          migrated_at          
+-----+----------------+----------+-------------------------------
+ 193 | 20250324173534 |       20 | 2025-04-23 20:00:33.955776+00
+ 192 | 20250321131230 |       20 | 2025-04-23 19:58:06.873134+00
+ 191 | 20250318093828 |       20 | 2025-04-23 19:57:50.540568+00
+ 190 | 20250314120945 |       20 | 2025-04-23 19:57:30.677481+00
+```
+
+The last migration ran in April, when I updated to `v0.19.1`. Strange.
+
+At this point I went on vacation and paused investigations, not only because the vacation was great, but also because I
+bamboozeld by this state.
+
+---
+
+After my vacation I came back, and did some backups of the database.
+
+```
+$ docker run -e PGPASSWORD="XXXX" -it --rm --network mash-postgres postgres pg_dump -U gay-pirate-assassins -h mash-postgres gay-pirate-assassins > manual-backup/gay-pirate-assassins-2025-10-13.sql
+```
+
+Then I deleted the last migration, as I was advised
+
+```
+DELETE FROM bun_migration WHERE id=193; 
+```
+
+and restarted the server. While watching the server come up it hit me in the face:
+
+```
+Oct 12 08:31:29 s3 mash-gpa-gotosocial[2251925]: timestamp="12/10/2025 08:31:29.905" func=bundb.sqliteConn level=INFO msg="connected to SQLITE database with address file:/opt/gotosocial/sqlite.db?_pragma=busy_timeout%281800000%29&_pragma=journal_mode%>
+Oct 12 13:38:46 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 13:38:46.588" func=router.(*Router).Start.func1 level=INFO msg="listening on 0.0.0.0:8080"
+```
+
+The server is **starting from a completely different database**! That explains why
+
+* the last migration was never done
+* the server showed me 0 users, 0 posts and 0 federated instances even though the postgres database had plenty of those
+
+All of a sudden a SQlite database was configured. This happened because
+of [this commit](https://github.com/mother-of-all-self-hosting/ansible-role-gotosocial/commit/df34af385f9765bda8f160f6985a47cb7204fe96)
+which introduced SQlite support and set it as default. This was not mentioned in
+the [Changelog](https://github.com/mother-of-all-self-hosting/mash-playbook/blob/main/CHANGELOG.md).
+
+So what happened is, that the config changed and then the server was restarted and an empty DB was initialized. The
+postgres DB never started to migrate.
+
+## Fixing
+
+To fix it, I did the following
+
+1. Configure the playbook to use postgres for GoToSocial:
+
+```
+# vars.yml
+gotosocial_database_type: postgres
+```
+
+2. Run the playbook to configure GoToSocial (but not starting the service)
+
+```
+just run-tags install-gotosocial
+```
+
+3. Check the configuration is correct
+4. Start the service
+
+The migrations took several hours but after that, everything looked stable again. I don't think there are any lasting
+consequences. However, the server was unavailable for several days.
+
+## Learnings
+
+I believe the main issue here was not the change in the config that went unnoticed by me. While I'd ideally notice stuff
+like this, the server is a hobby, and I'll continue to not check every config option that changed.
+
+The larger issue was the backup. Having a backup would have made this easy to solve. And there are other, less lucky
+problems where I'd be completely lost without a backup. So to make sure this doesn't happen again, I did/will do the
+following:
+
+### 1. Mainstream the config
+
+As explained, I used a specific non-mainstream setup in the ansible playbook because, in the past, I ran two instances
+of GoToSocial on the server. After shutting down one of them, I never moved gay-pirate-assassins to be part of the main
+config. This means important parts of the configuration had to be done manually, which I botched.
+
+So in the past week I cleaned up and gay-pirate-assassins is now part of the main `vars.yml` and will benefit from all
+relevant automations.
+
+### 2. Checking backups
+
+I was confident in my backups because
+
+* they run every night very consistently. If they fail e.g. because of a network outage I reliably get a warning.
+* I verified successfully run of the backup job prior to upgrading
+
+The main problem was me assuming that a successful run of the backup command, meant a successful backup. Everyone will
+tell you that a backup that is not tested is not to be trusted. And they are right. However, doing frequent
+test-restores
+exceeds my time and server capacity. So what I'll do instead is the following:
+
+* mount the backup before an upgrade
+* `tail` the backup file as created by postgres-backup and ensure the data is from the same day
+* check media folders for the last changed image
+
+This is not a 100% guarantee, but I'd argue it's a pretty good compromise for now. As the frequency of mounting backups
+increases and therefore becomes faster, I'll re-evaluate to do a test-restore at least semi-regulary.
+
+## Conclusion
+
+I fucked up, but I was lucky that my error was recoverable and no data was lost. Next time this will hopefully be not due
+to luck, but better planning!
+
+Any questions? Let me know!
--- a/content/post/postmortem-gpa/logs.md
+++ b/content/post/postmortem-gpa/logs.md
@@ -0,0 +1,22 @@
+```
+"
+Oct 12 09:33:25 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 09:33:25.266" func=cache.(*Caches).Start level=INFO msg="start: 0xc002476008"
+Oct 12 09:33:25 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 09:33:25.303" func=bundb.pgConn level=INFO msg="connected to POSTGRES database"
+Oct 12 09:33:25 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 09:33:25.328" func=migrations.init.110.func1 level=INFO msg="creating statuses column thread_id_new"
+Oct 12 09:33:31 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 09:33:31.872" func=bundb.queryHook.AfterQuery level=WARN duration=6.528757799s query="SELECT count(*) FROM \"statuses\"" msg="SLOW DATABASE QUERY"
+Oct 12 09:33:31 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 09:33:31.873" func=migrations.init.110.func1 level=WARN msg="rethreading 4611812 statuses, this will take a *long* time"
+Oct 12 09:33:38 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 09:33:38.111" func=migrations.init.110.func1 level=INFO msg="[~0.02% done; ~137 rows/s] migrating threads"
+Oct 12 09:33:44 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 09:33:44.618" func=migrations.init.110.func1 level=INFO msg="[~0.04% done; ~171 rows/s] migrating threads"
+```
+
+```
+
+Oct 12 13:38:08 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 13:38:08.726" func=migrations.init.110.func1 level=INFO msg="[~99.98% done; ~148 rows/s] migrating stragglers"
+Oct 12 13:38:10 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 13:38:10.309" func=migrations.init.110.func1 level=INFO msg="[~99.99% done; ~162 rows/s] migrating stragglers"
+Oct 12 13:38:12 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 13:38:12.192" func=migrations.init.110.func1 level=INFO msg="[~100.00% done; ~141 rows/s] migrating stragglers"
+Oct 12 13:38:13 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 13:38:13.711" func=migrations.init.110.func1 level=INFO msg="[~100.00% done; ~136 rows/s] migrating stragglers"
+Oct 12 13:38:13 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 13:38:13.714" func=migrations.init.110.func1 level=INFO msg="dropping temporary thread_id_new index"
+
+```
+
+#