feat: add gpa postmortem
	
		
			
	
		
	
	
		
	
		
			All checks were successful
		
		
	
	
		
			
				
	
				ci/woodpecker/push/woodpecker Pipeline was successful
				
			
		
		
	
	
				
					
				
			
		
			All checks were successful
		
		
	
	ci/woodpecker/push/woodpecker Pipeline was successful
				
			This commit is contained in:
		
							
								
								
									
										252
									
								
								content/post/postmortem-gpa/index.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										252
									
								
								content/post/postmortem-gpa/index.md
									
									
									
									
									
										Normal file
									
								
							@@ -0,0 +1,252 @@
 | 
			
		||||
---
 | 
			
		||||
title: "Postmortem - how to completely screw up an update"
 | 
			
		||||
date: 2025-10-19T12:05:10+02:00
 | 
			
		||||
lastmod: 2025-10-19T16:00:04+02:00
 | 
			
		||||
draft: false
 | 
			
		||||
image: "uploads/postmortem.png"
 | 
			
		||||
categories: [ 'English' ]
 | 
			
		||||
tags: [ 'backup', 'postmortem', 'fediverse', 'gotosocial' ]
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
The fediverse instance [gay-pirate-assassins.de](https://gay-pirate-assassins.de) was down for a couple of days. This
 | 
			
		||||
postmortem will outline what went wrong and what I did to prevent things from going that wrong in the future.
 | 
			
		||||
 | 
			
		||||
# Timeline
 | 
			
		||||
 | 
			
		||||
* 2025-10-05 17:26: [Update announcement](https://gay-pirate-assassins.de/@moanos/statuses/01K6TFQ1HVPAR6AYN08XYQ7XFV)
 | 
			
		||||
* 2025-10-05 ~17:45: Update started
 | 
			
		||||
* 2025-10-05 ~18:00: Services restart
 | 
			
		||||
* 2025-10-05 ~18:00: GoToSocial doesn't come up
 | 
			
		||||
* 2025-10-12 ~10:00: Issue is found
 | 
			
		||||
* 2025-10-12 10:30: Issue is fixed
 | 
			
		||||
* 2025-10-12 10:31: GoToSocial is started, migrations start
 | 
			
		||||
* 2025-10-12 15:38: Migrations finished successfully
 | 
			
		||||
* 2025-10-12 15:38: Service available again
 | 
			
		||||
* 2025-10-12 18:36:[Announcement sent](https://gay-pirate-assassins.de/@moanos/statuses/01K7CMGF7S2TE39792CMADGEPJ)
 | 
			
		||||
 | 
			
		||||
All times are given in CEST.
 | 
			
		||||
 | 
			
		||||
## The beginning: An update goes wrong
 | 
			
		||||
 | 
			
		||||
I run a small fediverse server with a few users called. [gay-pirate-assassins](https://gay-pirate-assassins.de/) which is powered by [GoToSocial](https://gotosocial.org/).
 | 
			
		||||
The (amazing) GoToSocial devs released `v0.20.0-rc1` and `v0.20.0-rc2`. As the new features seemed pretty cool, I'm
 | 
			
		||||
inpatient and the second release candidate seemed stable,
 | 
			
		||||
I decided to update to `v0.20.0-rc2`. So I stared a backup (via borgmatic), waited for it to finish and confirmed it ran
 | 
			
		||||
successfully.
 | 
			
		||||
Then I changed the version number in the [mash](https://github.com/mother-of-all-self-hosting/mash-playbook)-ansible
 | 
			
		||||
playbook I use. Then I pulled the newest version of the playbook and it's roles because I wanted to update all services
 | 
			
		||||
that run on the server. I checked
 | 
			
		||||
the [Changelog](https://github.com/mother-of-all-self-hosting/mash-playbook/blob/main/CHANGELOG.md),
 | 
			
		||||
didn't see anything and then started the update. It went through and GoToSocial started up just fine.
 | 
			
		||||
 | 
			
		||||
But the instance start page showed me 0 users, 0 posts and 0 federated instances. **Something has gone horribly wrong!**
 | 
			
		||||
 | 
			
		||||
## Migrations
 | 
			
		||||
 | 
			
		||||
It was pretty clear to me, that the migrations went wrong.
 | 
			
		||||
The [GoToSocial Migration notes](https://codeberg.org/superseriousbusiness/gotosocial/releases/tag/v0.20.0-rc1)
 | 
			
		||||
specifically mentioned long-running migrations that could take several hours. I assumed that somehow, during the running
 | 
			
		||||
database migration, the service must have restarted and left the DB in a broken state. This issue happened to me before.
 | 
			
		||||
 | 
			
		||||
Well, that's what backups are for, so let's pull it.
 | 
			
		||||
 | 
			
		||||
## Backups
 | 
			
		||||
 | 
			
		||||
Backups for this server are done two ways:
 | 
			
		||||
 | 
			
		||||
* via postgres-backup: Backups of the database are written to disk
 | 
			
		||||
* via [borgmatic](https://torsion.org/borgmatic/): Backups via borg are written to backup nodes, one of them at my home
 | 
			
		||||
 | 
			
		||||
They run every night automatically, monitored by [Healthchecks](https://healthchecks.io/). I triggered a manual run
 | 
			
		||||
before the update so that is the one I mounted using [Vorta](https://vorta.borgbase.com/).
 | 
			
		||||
 | 
			
		||||
And then the realization.
 | 
			
		||||
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
mash-postgres:5432 $ ls -lh
 | 
			
		||||
total 2.1M
 | 
			
		||||
-r-------- 1 moanos root 418K Oct 05 04:03 gitea
 | 
			
		||||
-r-------- 1 moanos root 123K Oct 05 04:03 healthchecks
 | 
			
		||||
-r-------- 1 moanos root 217K Oct 05 04:03 ilmo
 | 
			
		||||
-r-------- 1 moanos root 370K Oct 05 04:03 notfellchen
 | 
			
		||||
-r-------- 1 moanos root  67K Oct 05 04:03 oxitraffic
 | 
			
		||||
-r-------- 1 moanos root  931 Oct 05 04:03 prometheus_postgres_exporter
 | 
			
		||||
-r-------- 1 moanos root 142K Oct 05 04:03 semaphore
 | 
			
		||||
-r-------- 1 moanos root 110K Oct 05 04:03 vaultwarden
 | 
			
		||||
-r-------- 1 moanos root 669K Oct 05 04:03 woodpecker_ci_server
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Fuck. The database gay-pirate-assassins is not there. Why?
 | 
			
		||||
 | 
			
		||||
To explain that I have to tell you how it *should* work: Services deployed by the mash-playbook are automatically wired
 | 
			
		||||
to the database and reverse proxy by a complex set of Ansible variables. This is great, because adding a service can
 | 
			
		||||
therefore be as easy as adding
 | 
			
		||||
 | 
			
		||||
```
 | 
			
		||||
healthchecks_enabled: true
 | 
			
		||||
healthchecks_hostname: health.hyteck.de
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
to the `vars.yml` file.
 | 
			
		||||
 | 
			
		||||
This will then configure the postgres database automatically, based on the `group_vars`. They look like this
 | 
			
		||||
 | 
			
		||||
```
 | 
			
		||||
mash_playbook_postgres_managed_databases_auto_itemized:
 | 
			
		||||
    - |-
 | 
			
		||||
      {{
 | 
			
		||||
          ({
 | 
			
		||||
          'name': healthchecks_database_name,
 | 
			
		||||
          'username': healthchecks_database_username,
 | 
			
		||||
          'password': healthchecks_database_password,
 | 
			
		||||
          } if healthchecks_enabled and healthchecks_database_hostname == postgres_connection_hostname and healthchecks_database_type == 'postgres' else omit)
 | 
			
		||||
      }}
 | 
			
		||||
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Note that a healthchecks database is only added to the managed databases if `healthchecks_enabled` is `True`.
 | 
			
		||||
 | 
			
		||||
This is really useful for backups because the borgmatic configuration also pulls the list
 | 
			
		||||
`mash_playbook_postgres_managed_databases_auto_itemized`. Therefore, you do not need to specify which databases to back
 | 
			
		||||
up, it just backs up all managed databases.
 | 
			
		||||
 | 
			
		||||
However, the database for gay-pirate assassins was not managed. In the playbook it's only possible to configure a
 | 
			
		||||
service once. You can not manage multiple GoToSocial instances in the same `vars.yml`. In the past, I had two instances
 | 
			
		||||
of GoToSocial running on the server. I therefore
 | 
			
		||||
followed [the how-to of "Running multiple instances of the same service on the same host"](https://github.com/mother-of-all-self-hosting/mash-playbook/blob/main/docs/running-multiple-instances.md).
 | 
			
		||||
 | 
			
		||||
Basically this means that an additional `vars.yml` must be created that is treated as a completely different server.
 | 
			
		||||
Databases must be created manually as they are not managed.
 | 
			
		||||
 | 
			
		||||
With that knowledge you can understand that when I say that the database for gay-pirate-assassins was not managed,
 | 
			
		||||
this means it was not included in the list of databases to be backed up. The backup service thought it ran successfully,
 | 
			
		||||
because it backed up everything it knew of.
 | 
			
		||||
 | 
			
		||||
So this left me with a three-month-old backup. Unacceptable.
 | 
			
		||||
 | 
			
		||||
## Investigating
 | 
			
		||||
 | 
			
		||||
So the existing database needed to be rescued. I SSHed into the server and checked the database. It looked completely
 | 
			
		||||
normal.
 | 
			
		||||
I asked the devs if they could me provide me with the migrations as they already did in the past. However, they pointed
 | 
			
		||||
out that the migrations are too difficult for that approach. They suggested to delete the oldest migration to force a
 | 
			
		||||
re-run of the migrations.
 | 
			
		||||
 | 
			
		||||
Here is where I was confused, because this was the `bun_migrations` table:
 | 
			
		||||
 | 
			
		||||
```
 | 
			
		||||
gay-pirate-assassins=# SELECT * FROM bun_migrations ORDER BY id DESC LIMIT 5;
 | 
			
		||||
 id  |      name      | group_id |          migrated_at          
 | 
			
		||||
-----+----------------+----------+-------------------------------
 | 
			
		||||
 193 | 20250324173534 |       20 | 2025-04-23 20:00:33.955776+00
 | 
			
		||||
 192 | 20250321131230 |       20 | 2025-04-23 19:58:06.873134+00
 | 
			
		||||
 191 | 20250318093828 |       20 | 2025-04-23 19:57:50.540568+00
 | 
			
		||||
 190 | 20250314120945 |       20 | 2025-04-23 19:57:30.677481+00
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
The last migration ran in April, when I updated to `v0.19.1`. Strange.
 | 
			
		||||
 | 
			
		||||
At this point I went on vacation and paused investigations, not only because the vacation was great, but also because I
 | 
			
		||||
bamboozeld by this state.
 | 
			
		||||
 | 
			
		||||
---
 | 
			
		||||
 | 
			
		||||
After my vacation I came back, and did some backups of the database.
 | 
			
		||||
 | 
			
		||||
```
 | 
			
		||||
$ docker run -e PGPASSWORD="XXXX" -it --rm --network mash-postgres postgres pg_dump -U gay-pirate-assassins -h mash-postgres gay-pirate-assassins > manual-backup/gay-pirate-assassins-2025-10-13.sql
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Then I deleted the last migration, as I was advised
 | 
			
		||||
 | 
			
		||||
```
 | 
			
		||||
DELETE FROM bun_migration WHERE id=193; 
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
and restarted the server. While watching the server come up it hit me in the face:
 | 
			
		||||
 | 
			
		||||
```
 | 
			
		||||
Oct 12 08:31:29 s3 mash-gpa-gotosocial[2251925]: timestamp="12/10/2025 08:31:29.905" func=bundb.sqliteConn level=INFO msg="connected to SQLITE database with address file:/opt/gotosocial/sqlite.db?_pragma=busy_timeout%281800000%29&_pragma=journal_mode%>
 | 
			
		||||
Oct 12 13:38:46 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 13:38:46.588" func=router.(*Router).Start.func1 level=INFO msg="listening on 0.0.0.0:8080"
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
The server is **starting from a completely different database**! That explains why
 | 
			
		||||
 | 
			
		||||
* the last migration was never done
 | 
			
		||||
* the server showed me 0 users, 0 posts and 0 federated instances even though the postgres database had plenty of those
 | 
			
		||||
 | 
			
		||||
All of a sudden a SQlite database was configured. This happened because
 | 
			
		||||
of [this commit](https://github.com/mother-of-all-self-hosting/ansible-role-gotosocial/commit/df34af385f9765bda8f160f6985a47cb7204fe96)
 | 
			
		||||
which introduced SQlite support and set it as default. This was not mentioned in
 | 
			
		||||
the [Changelog](https://github.com/mother-of-all-self-hosting/mash-playbook/blob/main/CHANGELOG.md).
 | 
			
		||||
 | 
			
		||||
So what happened is, that the config changed and then the server was restarted and an empty DB was initialized. The
 | 
			
		||||
postgres DB never started to migrate.
 | 
			
		||||
 | 
			
		||||
## Fixing
 | 
			
		||||
 | 
			
		||||
To fix it, I did the following
 | 
			
		||||
 | 
			
		||||
1. Configure the playbook to use postgres for GoToSocial:
 | 
			
		||||
 | 
			
		||||
```
 | 
			
		||||
# vars.yml
 | 
			
		||||
gotosocial_database_type: postgres
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
2. Run the playbook to configure GoToSocial (but not starting the service)
 | 
			
		||||
 | 
			
		||||
```
 | 
			
		||||
just run-tags install-gotosocial
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
3. Check the configuration is correct
 | 
			
		||||
4. Start the service
 | 
			
		||||
 | 
			
		||||
The migrations took several hours but after that, everything looked stable again. I don't think there are any lasting
 | 
			
		||||
consequences. However, the server was unavailable for several days.
 | 
			
		||||
 | 
			
		||||
## Learnings
 | 
			
		||||
 | 
			
		||||
I believe the main issue here was not the change in the config that went unnoticed by me. While I'd ideally notice stuff
 | 
			
		||||
like this, the server is a hobby, and I'll continue to not check every config option that changed.
 | 
			
		||||
 | 
			
		||||
The larger issue was the backup. Having a backup would have made this easy to solve. And there are other, less lucky
 | 
			
		||||
problems where I'd be completely lost without a backup. So to make sure this doesn't happen again, I did/will do the
 | 
			
		||||
following:
 | 
			
		||||
 | 
			
		||||
### 1. Mainstream the config
 | 
			
		||||
 | 
			
		||||
As explained, I used a specific non-mainstream setup in the ansible playbook because, in the past, I ran two instances
 | 
			
		||||
of GoToSocial on the server. After shutting down one of them, I never moved gay-pirate-assassins to be part of the main
 | 
			
		||||
config. This means important parts of the configuration had to be done manually, which I botched.
 | 
			
		||||
 | 
			
		||||
So in the past week I cleaned up and gay-pirate-assassins is now part of the main `vars.yml` and will benefit from all
 | 
			
		||||
relevant automations.
 | 
			
		||||
 | 
			
		||||
### 2. Checking backups
 | 
			
		||||
 | 
			
		||||
I was confident in my backups because
 | 
			
		||||
 | 
			
		||||
* they run every night very consistently. If they fail e.g. because of a network outage I reliably get a warning.
 | 
			
		||||
* I verified successfully run of the backup job prior to upgrading
 | 
			
		||||
 | 
			
		||||
The main problem was me assuming that a successful run of the backup command, meant a successful backup. Everyone will
 | 
			
		||||
tell you that a backup that is not tested is not to be trusted. And they are right. However, doing frequent
 | 
			
		||||
test-restores
 | 
			
		||||
exceeds my time and server capacity. So what I'll do instead is the following:
 | 
			
		||||
 | 
			
		||||
* mount the backup before an upgrade
 | 
			
		||||
* `tail` the backup file as created by postgres-backup and ensure the data is from the same day
 | 
			
		||||
* check media folders for the last changed image
 | 
			
		||||
 | 
			
		||||
This is not a 100% guarantee, but I'd argue it's a pretty good compromise for now. As the frequency of mounting backups
 | 
			
		||||
increases and therefore becomes faster, I'll re-evaluate to do a test-restore at least semi-regulary.
 | 
			
		||||
 | 
			
		||||
## Conclusion
 | 
			
		||||
 | 
			
		||||
I fucked up, but I was lucky that my error was recoverable and no data was lost. Next time this will hopefully be not due
 | 
			
		||||
to luck, but better planning!
 | 
			
		||||
 | 
			
		||||
Any questions? Let me know!
 | 
			
		||||
							
								
								
									
										22
									
								
								content/post/postmortem-gpa/logs.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										22
									
								
								content/post/postmortem-gpa/logs.md
									
									
									
									
									
										Normal file
									
								
							@@ -0,0 +1,22 @@
 | 
			
		||||
```
 | 
			
		||||
"
 | 
			
		||||
Oct 12 09:33:25 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 09:33:25.266" func=cache.(*Caches).Start level=INFO msg="start: 0xc002476008"
 | 
			
		||||
Oct 12 09:33:25 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 09:33:25.303" func=bundb.pgConn level=INFO msg="connected to POSTGRES database"
 | 
			
		||||
Oct 12 09:33:25 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 09:33:25.328" func=migrations.init.110.func1 level=INFO msg="creating statuses column thread_id_new"
 | 
			
		||||
Oct 12 09:33:31 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 09:33:31.872" func=bundb.queryHook.AfterQuery level=WARN duration=6.528757799s query="SELECT count(*) FROM \"statuses\"" msg="SLOW DATABASE QUERY"
 | 
			
		||||
Oct 12 09:33:31 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 09:33:31.873" func=migrations.init.110.func1 level=WARN msg="rethreading 4611812 statuses, this will take a *long* time"
 | 
			
		||||
Oct 12 09:33:38 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 09:33:38.111" func=migrations.init.110.func1 level=INFO msg="[~0.02% done; ~137 rows/s] migrating threads"
 | 
			
		||||
Oct 12 09:33:44 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 09:33:44.618" func=migrations.init.110.func1 level=INFO msg="[~0.04% done; ~171 rows/s] migrating threads"
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Oct 12 13:38:08 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 13:38:08.726" func=migrations.init.110.func1 level=INFO msg="[~99.98% done; ~148 rows/s] migrating stragglers"
 | 
			
		||||
Oct 12 13:38:10 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 13:38:10.309" func=migrations.init.110.func1 level=INFO msg="[~99.99% done; ~162 rows/s] migrating stragglers"
 | 
			
		||||
Oct 12 13:38:12 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 13:38:12.192" func=migrations.init.110.func1 level=INFO msg="[~100.00% done; ~141 rows/s] migrating stragglers"
 | 
			
		||||
Oct 12 13:38:13 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 13:38:13.711" func=migrations.init.110.func1 level=INFO msg="[~100.00% done; ~136 rows/s] migrating stragglers"
 | 
			
		||||
Oct 12 13:38:13 s3 mash-gpa-gotosocial[2304549]: timestamp="12/10/2025 13:38:13.714" func=migrations.init.110.func1 level=INFO msg="dropping temporary thread_id_new index"
 | 
			
		||||
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
#                  
 | 
			
		||||
		Reference in New Issue
	
	Block a user