feat: add gpa postmortem
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
This commit is contained in:
252
content/post/postmortem-gpa/index.md
Normal file
252
content/post/postmortem-gpa/index.md
Normal file
@@ -0,0 +1,252 @@
|
||||
---
|
||||
title: "Postmortem - how to completely screw up an update"
|
||||
date: 2025-10-19T12:05:10+02:00
|
||||
lastmod: 2025-10-19T16:00:04+02:00
|
||||
draft: false
|
||||
image: "uploads/postmortem.png"
|
||||
categories: [ 'English' ]
|
||||
tags: [ 'backup', 'postmortem', 'fediverse', 'gotosocial' ]
|
||||
---
|
||||
|
||||
The fediverse instance [gay-pirate-assassins.de](https://gay-pirate-assassins.de) was down for a couple of days. This
|
||||
postmortem will outline what went wrong and what I did to prevent things from going that wrong in the future.
|
||||
|
||||
# Timeline
|
||||
|
||||
* 2025-10-05 17:26: [Update announcement](https://gay-pirate-assassins.de/@moanos/statuses/01K6TFQ1HVPAR6AYN08XYQ7XFV)
|
||||
* 2025-10-05 ~17:45: Update started
|
||||
* 2025-10-05 ~18:00: Services restart
|
||||
* 2025-10-05 ~18:00: GoToSocial doesn't come up
|
||||
* 2025-10-12 ~10:00: Issue is found
|
||||
* 2025-10-12 10:30: Issue is fixed
|
||||
* 2025-10-12 10:31: GoToSocial is started, migrations start
|
||||
* 2025-10-12 15:38: Migrations finished successfully
|
||||
* 2025-10-12 15:38: Service available again
|
||||
* 2025-10-12 18:36:[Announcement sent](https://gay-pirate-assassins.de/@moanos/statuses/01K7CMGF7S2TE39792CMADGEPJ)
|
||||
|
||||
All times are given in CEST.
|
||||
|
||||
## The beginning: An update goes wrong
|
||||
|
||||
I run a small fediverse server with a few users called. [gay-pirate-assassins](https://gay-pirate-assassins.de/) which is powered by [GoToSocial](https://gotosocial.org/).
|
||||
The (amazing) GoToSocial devs released `v0.20.0-rc1` and `v0.20.0-rc2`. As the new features seemed pretty cool, I'm
|
||||
inpatient and the second release candidate seemed stable,
|
||||
I decided to update to `v0.20.0-rc2`. So I stared a backup (via borgmatic), waited for it to finish and confirmed it ran
|
||||
successfully.
|
||||
Then I changed the version number in the [mash](https://github.com/mother-of-all-self-hosting/mash-playbook)-ansible
|
||||
playbook I use. Then I pulled the newest version of the playbook and it's roles because I wanted to update all services
|
||||
that run on the server. I checked
|
||||
the [Changelog](https://github.com/mother-of-all-self-hosting/mash-playbook/blob/main/CHANGELOG.md),
|
||||
didn't see anything and then started the update. It went through and GoToSocial started up just fine.
|
||||
|
||||
But the instance start page showed me 0 users, 0 posts and 0 federated instances. **Something has gone horribly wrong!**
|
||||
|
||||
## Migrations
|
||||
|
||||
It was pretty clear to me, that the migrations went wrong.
|
||||
The [GoToSocial Migration notes](https://codeberg.org/superseriousbusiness/gotosocial/releases/tag/v0.20.0-rc1)
|
||||
specifically mentioned long-running migrations that could take several hours. I assumed that somehow, during the running
|
||||
database migration, the service must have restarted and left the DB in a broken state. This issue happened to me before.
|
||||
|
||||
Well, that's what backups are for, so let's pull it.
|
||||
|
||||
## Backups
|
||||
|
||||
Backups for this server are done two ways:
|
||||
|
||||
* via postgres-backup: Backups of the database are written to disk
|
||||
* via [borgmatic](https://torsion.org/borgmatic/): Backups via borg are written to backup nodes, one of them at my home
|
||||
|
||||
They run every night automatically, monitored by [Healthchecks](https://healthchecks.io/). I triggered a manual run
|
||||
before the update so that is the one I mounted using [Vorta](https://vorta.borgbase.com/).
|
||||
|
||||
And then the realization.
|
||||
|
||||
```
|
||||
|
||||
mash-postgres:5432 $ ls -lh
|
||||
total 2.1M
|
||||
-r-------- 1 moanos root 418K Oct 05 04:03 gitea
|
||||
-r-------- 1 moanos root 123K Oct 05 04:03 healthchecks
|
||||
-r-------- 1 moanos root 217K Oct 05 04:03 ilmo
|
||||
-r-------- 1 moanos root 370K Oct 05 04:03 notfellchen
|
||||
-r-------- 1 moanos root 67K Oct 05 04:03 oxitraffic
|
||||
-r-------- 1 moanos root 931 Oct 05 04:03 prometheus_postgres_exporter
|
||||
-r-------- 1 moanos root 142K Oct 05 04:03 semaphore
|
||||
-r-------- 1 moanos root 110K Oct 05 04:03 vaultwarden
|
||||
-r-------- 1 moanos root 669K Oct 05 04:03 woodpecker_ci_server
|
||||
```
|
||||
|
||||
Fuck. The database gay-pirate-assassins is not there. Why?
|
||||
|
||||
To explain that I have to tell you how it *should* work: Services deployed by the mash-playbook are automatically wired
|
||||
to the database and reverse proxy by a complex set of Ansible variables. This is great, because adding a service can
|
||||
therefore be as easy as adding
|
||||
|
||||
```
|
||||
healthchecks_enabled: true
|
||||
healthchecks_hostname: health.hyteck.de
|
||||
```
|
||||
|
||||
to the `vars.yml` file.
|
||||
|
||||
This will then configure the postgres database automatically, based on the `group_vars`. They look like this
|
||||
|
||||
```
|
||||
mash_playbook_postgres_managed_databases_auto_itemized:
|
||||
- |-
|
||||
{{
|
||||
({
|
||||
'name': healthchecks_database_name,
|
||||
'username': healthchecks_database_username,
|
||||
'password': healthchecks_database_password,
|
||||
} if healthchecks_enabled and healthchecks_database_hostname == postgres_connection_hostname and healthchecks_database_type == 'postgres' else omit)
|
||||
}}
|
||||
|
||||
```
|
||||
|
||||
Note that a healthchecks database is only added to the managed databases if `healthchecks_enabled` is `True`.
|
||||
|
||||
This is really useful for backups because the borgmatic configuration also pulls the list
|
||||
`mash_playbook_postgres_managed_databases_auto_itemized`. Therefore, you do not need to specify which databases to back
|
||||
up, it just backs up all managed databases.
|
||||
|
||||
However, the database for gay-pirate assassins was not managed. In the playbook it's only possible to configure a
|
||||
service once. You can not manage multiple GoToSocial instances in the same `vars.yml`. In the past, I had two instances
|
||||
of GoToSocial running on the server. I therefore
|
||||
followed [the how-to of "Running multiple instances of the same service on the same host"](https://github.com/mother-of-all-self-hosting/mash-playbook/blob/main/docs/running-multiple-instances.md).
|
||||
|
||||
Basically this means that an additional `vars.yml` must be created that is treated as a completely different server.
|
||||
Databases must be created manually as they are not managed.
|
||||
|
||||
With that knowledge you can understand that when I say that the database for gay-pirate-assassins was not managed,
|
||||
this means it was not included in the list of databases to be backed up. The backup service thought it ran successfully,
|
||||
because it backed up everything it knew of.
|
||||
|
||||
So this left me with a three-month-old backup. Unacceptable.
|
||||
|
||||
## Investigating
|
||||
|
||||
So the existing database needed to be rescued. I SSHed into the server and checked the database. It looked completely
|
||||
normal.
|
||||
I asked the devs if they could me provide me with the migrations as they already did in the past. However, they pointed
|
||||
out that the migrations are too difficult for that approach. They suggested to delete the oldest migration to force a
|
||||
re-run of the migrations.
|
||||
|
||||
Here is where I was confused, because this was the `bun_migrations` table:
|
||||
|
||||
```
|
||||
gay-pirate-assassins=# SELECT * FROM bun_migrations ORDER BY id DESC LIMIT 5;
|
||||
id | name | group_id | migrated_at
|
||||
-----+----------------+----------+-------------------------------
|
||||
193 | 20250324173534 | 20 | 2025-04-23 20:00:33.955776+00
|
||||
192 | 20250321131230 | 20 | 2025-04-23 19:58:06.873134+00
|
||||
191 | 20250318093828 | 20 | 2025-04-23 19:57:50.540568+00
|
||||
190 | 20250314120945 | 20 | 2025-04-23 19:57:30.677481+00
|
||||
```
|
||||
|
||||
The last migration ran in April, when I updated to `v0.19.1`. Strange.
|
||||
|
||||
At this point I went on vacation and paused investigations, not only because the vacation was great, but also because I
|
||||
bamboozeld by this state.
|
||||
|
||||
---
|
||||
|
||||
After my vacation I came back, and did some backups of the database.
|
||||
|
||||
```
|
||||
$ docker run -e PGPASSWORD="XXXX" -it --rm --network mash-postgres postgres pg_dump -U gay-pirate-assassins -h mash-postgres gay-pirate-assassins > manual-backup/gay-pirate-assassins-2025-10-13.sql
|
||||
```
|
||||
|
||||
Then I deleted the last migration, as I was advised
|
||||
|
||||
```
|
||||
DELETE FROM bun_migration WHERE id=193;
|
||||
```
|
||||
|
||||
and restarted the server. While watching the server come up it hit me in the face:
|
||||
|
||||
```
|
||||
Oct 12 08:31:29 s3 mash-gpa-gotosocial[2251925]: timestamp="12/10/2025 08:31:29.905" func=bundb.sqliteConn level=INFO msg="connected to SQLITE database with address file:/opt/gotosocial/sqlite.db?_pragma=busy_timeout%281800000%29&_pragma=journal_mode%>
|
||||
Oct 12 13:38:46 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 13:38:46.588" func=router.(*Router).Start.func1 level=INFO msg="listening on 0.0.0.0:8080"
|
||||
```
|
||||
|
||||
The server is **starting from a completely different database**! That explains why
|
||||
|
||||
* the last migration was never done
|
||||
* the server showed me 0 users, 0 posts and 0 federated instances even though the postgres database had plenty of those
|
||||
|
||||
All of a sudden a SQlite database was configured. This happened because
|
||||
of [this commit](https://github.com/mother-of-all-self-hosting/ansible-role-gotosocial/commit/df34af385f9765bda8f160f6985a47cb7204fe96)
|
||||
which introduced SQlite support and set it as default. This was not mentioned in
|
||||
the [Changelog](https://github.com/mother-of-all-self-hosting/mash-playbook/blob/main/CHANGELOG.md).
|
||||
|
||||
So what happened is, that the config changed and then the server was restarted and an empty DB was initialized. The
|
||||
postgres DB never started to migrate.
|
||||
|
||||
## Fixing
|
||||
|
||||
To fix it, I did the following
|
||||
|
||||
1. Configure the playbook to use postgres for GoToSocial:
|
||||
|
||||
```
|
||||
# vars.yml
|
||||
gotosocial_database_type: postgres
|
||||
```
|
||||
|
||||
2. Run the playbook to configure GoToSocial (but not starting the service)
|
||||
|
||||
```
|
||||
just run-tags install-gotosocial
|
||||
```
|
||||
|
||||
3. Check the configuration is correct
|
||||
4. Start the service
|
||||
|
||||
The migrations took several hours but after that, everything looked stable again. I don't think there are any lasting
|
||||
consequences. However, the server was unavailable for several days.
|
||||
|
||||
## Learnings
|
||||
|
||||
I believe the main issue here was not the change in the config that went unnoticed by me. While I'd ideally notice stuff
|
||||
like this, the server is a hobby, and I'll continue to not check every config option that changed.
|
||||
|
||||
The larger issue was the backup. Having a backup would have made this easy to solve. And there are other, less lucky
|
||||
problems where I'd be completely lost without a backup. So to make sure this doesn't happen again, I did/will do the
|
||||
following:
|
||||
|
||||
### 1. Mainstream the config
|
||||
|
||||
As explained, I used a specific non-mainstream setup in the ansible playbook because, in the past, I ran two instances
|
||||
of GoToSocial on the server. After shutting down one of them, I never moved gay-pirate-assassins to be part of the main
|
||||
config. This means important parts of the configuration had to be done manually, which I botched.
|
||||
|
||||
So in the past week I cleaned up and gay-pirate-assassins is now part of the main `vars.yml` and will benefit from all
|
||||
relevant automations.
|
||||
|
||||
### 2. Checking backups
|
||||
|
||||
I was confident in my backups because
|
||||
|
||||
* they run every night very consistently. If they fail e.g. because of a network outage I reliably get a warning.
|
||||
* I verified successfully run of the backup job prior to upgrading
|
||||
|
||||
The main problem was me assuming that a successful run of the backup command, meant a successful backup. Everyone will
|
||||
tell you that a backup that is not tested is not to be trusted. And they are right. However, doing frequent
|
||||
test-restores
|
||||
exceeds my time and server capacity. So what I'll do instead is the following:
|
||||
|
||||
* mount the backup before an upgrade
|
||||
* `tail` the backup file as created by postgres-backup and ensure the data is from the same day
|
||||
* check media folders for the last changed image
|
||||
|
||||
This is not a 100% guarantee, but I'd argue it's a pretty good compromise for now. As the frequency of mounting backups
|
||||
increases and therefore becomes faster, I'll re-evaluate to do a test-restore at least semi-regulary.
|
||||
|
||||
## Conclusion
|
||||
|
||||
I fucked up, but I was lucky that my error was recoverable and no data was lost. Next time this will hopefully be not due
|
||||
to luck, but better planning!
|
||||
|
||||
Any questions? Let me know!
|
||||
22
content/post/postmortem-gpa/logs.md
Normal file
22
content/post/postmortem-gpa/logs.md
Normal file
@@ -0,0 +1,22 @@
|
||||
```
|
||||
"
|
||||
Oct 12 09:33:25 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 09:33:25.266" func=cache.(*Caches).Start level=INFO msg="start: 0xc002476008"
|
||||
Oct 12 09:33:25 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 09:33:25.303" func=bundb.pgConn level=INFO msg="connected to POSTGRES database"
|
||||
Oct 12 09:33:25 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 09:33:25.328" func=migrations.init.110.func1 level=INFO msg="creating statuses column thread_id_new"
|
||||
Oct 12 09:33:31 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 09:33:31.872" func=bundb.queryHook.AfterQuery level=WARN duration=6.528757799s query="SELECT count(*) FROM \"statuses\"" msg="SLOW DATABASE QUERY"
|
||||
Oct 12 09:33:31 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 09:33:31.873" func=migrations.init.110.func1 level=WARN msg="rethreading 4611812 statuses, this will take a *long* time"
|
||||
Oct 12 09:33:38 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 09:33:38.111" func=migrations.init.110.func1 level=INFO msg="[~0.02% done; ~137 rows/s] migrating threads"
|
||||
Oct 12 09:33:44 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 09:33:44.618" func=migrations.init.110.func1 level=INFO msg="[~0.04% done; ~171 rows/s] migrating threads"
|
||||
```
|
||||
|
||||
```
|
||||
|
||||
Oct 12 13:38:08 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 13:38:08.726" func=migrations.init.110.func1 level=INFO msg="[~99.98% done; ~148 rows/s] migrating stragglers"
|
||||
Oct 12 13:38:10 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 13:38:10.309" func=migrations.init.110.func1 level=INFO msg="[~99.99% done; ~162 rows/s] migrating stragglers"
|
||||
Oct 12 13:38:12 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 13:38:12.192" func=migrations.init.110.func1 level=INFO msg="[~100.00% done; ~141 rows/s] migrating stragglers"
|
||||
Oct 12 13:38:13 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 13:38:13.711" func=migrations.init.110.func1 level=INFO msg="[~100.00% done; ~136 rows/s] migrating stragglers"
|
||||
Oct 12 13:38:13 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 13:38:13.714" func=migrations.init.110.func1 level=INFO msg="dropping temporary thread_id_new index"
|
||||
|
||||
```
|
||||
|
||||
#
|
||||
BIN
static/uploads/postmortem.png
Normal file
BIN
static/uploads/postmortem.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 86 KiB |
Reference in New Issue
Block a user