Need a Postgres expert to help fix the Peertube DB. Willing to throw some cash to someone who helps.
Yes,
@omnipotens I messed up my last migration, but if anyone wants a second pair of eyes I'd be delighted to join.
@omnipotens cc @esm @esm 👀
@kainoa @omnipotens @esm i'd probably be down to help with postgres stuff, whether or not i'm an "expert" at it is debatable though lol

@esm @kainoa @esm

The issue is the database got corrupted

"videoRedundancy" failed: PQgetResult() failed.
pg_dump: error: Error message from server: ERROR: could not access status of transaction 3004004288
DETAIL: Could not open file "pg_xact/0B30": No such file or directory.
pg_dump: error: The command was: COPY public."videoRedundancy" (id, "expiresOn", "fileUrl", url, strategy, "videoFileId", "actorId", "createdAt", "updatedAt", "videoStreamingPlaylistId") TO stdout;

@omnipotens @kainoa @esm first of all i'd suggest applying a backup if you have one, if you don't for whatever reason, you should set up an auto backup script that uses pg_dumpall after this

give a restart a try, if that doesn't fix it you could try messing with pg_dump flags, --data-only and --schema-only are good ones to try (separately)

@esm @kainoa @esm

The data-only failed.

How to determine which foreign keys are circular? As highlighted by warning from pg_dump

When carrying out pg_dump on a (timescale) database running within a local container I have the following: pg_dump: warning: there are circular foreign-key constraints on this table: pg_dump:

Database Administrators Stack Exchange

@omnipotens @kainoa @esm hmmmm, it seems like it keeps failing on the videoRedundancy table specifically; I'm not too familiar with peertube but after looking at the docs I think it might just be possible to recreate that table

do the actual postgres db daemon logs say anything whenever you attempt to pg_dump? we may be able to isolate and recreate what's affected

@esm @kainoa @esm

This is what comes up in the logs when I try a dump

@omnipotens @kainoa @esm about the same then

are you able to get the indexes in the videoRedundancy table, in either the current db or the backup? if not then i'd try dropping that specific table and restoring it from the november backup

@esm @kainoa @esm

Sorry my database skills are very limited and I am not to sure how to do that. I have not really messed with DB in 10 years.

@omnipotens @kainoa @esm ah, do you have access to some sort of UI like phpMyAdmin, Adminer, or DBeaver? if not, then try running this in the postgres shell (sudo -u postgres psql) to try and get the indexes on the table:

SELECT indexname, indexdef FROM pg_indexes WHERE tablename = "peertube_prod.public.videoRedundancy";

@esm @kainoa @esm

peertube_prod=> SELECT indexname, indexdef FROM pg_indexes WHERE tablename = "peertube_prod.public.videoRedundancy";
ERROR: column "peertube_prod.public.videoRedundancy" does not exist
LINE 1: ...xname, indexdef FROM pg_indexes WHERE tablename = "peertube_...
^

@omnipotens @kainoa @esm apologies, try replacing "peertube_prod.public.videoRedundancy" with 'videoRedundancy'

@esm @kainoa @esm

same
peertube_prod=> SELECT indexname, indexdef FROM pg_indexes WHERE tablename = "videoRedundancy";
ERROR: column "videoRedundancy" does not exist
LINE 1: ...xname, indexdef FROM pg_indexes WHERE tablename = "videoRedu...

@omnipotens @kainoa @esm single quotes, not double, postgres is a bit picky with that
peertube_prod=> SELECT indexname, indexdef FROM pg_indexes WHERE tablename = 'vi - Pastebin.com

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

Pastebin

@omnipotens @kainoa @esm hmmm, so the table isn't completely toast, it just seems like the data is messed up

after backing up the current data directory as wolf480pl mentioned earlier (not a dump, just copying the /var/lib/pgsql or /var/lib/postgresql directory outright), give pg_restore --data-only -d peertube_prod -t videoRedundancy <november.dump> a shot (replace with the path to the database backup from november)

@esm @kainoa @esm

Here is a bit more logs

@omnipotens @esm @esm could you try putting them on a pastebin? It's hard to read from screenshots
==> /var/log/postgresql/postgresql-12-main.log <==2023-05-30 18:29:01.074 - Pastebin.com

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

Pastebin
@esm @omnipotens @esm in my opinion I'd say you're far better than most 👑

@esm @kainoa @esm

At this point If we loose a little bit of data I be ok but the last successful backup was nov2022.

I cannot even take a backup now.

@omnipotens @esm @kainoa @esm
That pg_xact (formerly known as pg_clog) directory contains some kind of "transaction metadata log" (which is different from WAL but I don't quite get in what way) and it seems like it's very important for postgres...

The warning about circular dependencies seems like a red herring, but the missing file in pg_xact seems like a severe issue.

I suggest you first make a copy of your postgres data directory (the one that contains subdirs like base, pg_xlog, etc.
1/

@omnipotens @esm @kainoa @esm

And the backup is so that if you lose data while trying to fix it, you can try again.

The problem looks similar to the error described here:
https://dba.stackexchange.com/questions/325104/error-could-not-access-status-of-transaction-could-not-open-file-pg-xact-0029

Except there, restarting the database fixed it.

If this is something that postgres's builtin recovery process can fix, a restart should help.

It's also possible that you trying to dump that table takes to long and creates the linked issue, but then I'd expect the number in the filename to increase

"ERROR: could not access status of transaction Could not open file "pg_xact/0029": No such file or directory" when attempting to LISTEN

When attempting to LISTEN to any channel, PostgreSQL says that Could not open file "pg_xact/0029": No such file or directory. This is a freshly restored database, so it can't be database

Database Administrators Stack Exchange

@wolf480pl @esm @kainoa @esm

yeah restarting it did not help tried that already.

@omnipotens @esm @kainoa @esm

Do you have a backup of the postgres data directory?

@wolf480pl @esm @kainoa @esm

Looks like the last one available was Nov2022

@omnipotens @esm @kainoa @esm

Make another one now.
In addition to the one you already have.

Yes, it's corrupted (apparently), but if we try fixing it and make it worse, this will let you try again.

@omnipotens @esm @kainoa @esm
Ideas for *after* you've made a backup of the postgres data directory:

- make sure nothing other than your psql is connected to postgres (check pg_stat_activity to see what is connected)
- disable triggers [1]
- try getting a single row from the table:
select * from videosomething limit 1;

If this doesn't cause the error, try selecting all the rows (no limit), if that works try pg_dump

@wolf480pl @esm @kainoa @esm

I am starting to think it is hosed

@omnipotens @esm @kainoa @esm
There are levels of hosed.

I'm no expert but I suspect most of the data from that table is still there, the question is how much effort you're willing to spend to recover it.

If you're getting this error on a 1-row select from that table even if that's the first query after restart, then I don't know how to proceed, but I suspect someone on postgres mailing list does.

@omnipotens @esm @kainoa @esm

Ok so we followed up in DMs and it turned out that:

1. The transaction ID from the error message was higher than txid_current() - it was in the future. It couldn't've been real, must've been postgres reading garbage where it'd expect transaction_id, similar to the "garden-variety data corruption" described in [1].

/cont

[1]: https://www.postgresql.org/message-id/11462.1238032941%40sss.pgh.pa.us

Re: could not access status of transaction

Tom Duffey <[email protected]> writes: > One of our databases suffered a problem yesterday during a normal > update, something we …

PostgreSQL Mailing List Archives

@omnipotens @esm @kainoa @esm

2. Looking at the table "heap" file (base/<db-oid>/<tbl-oid>) with hexdump, and searching for the little-endian hex encoding of the corrupted txid, it appeared only in one place.
Compared to rows seen earlier in the file, it seemed like 16 bytes earlier than a row should start. Possibly the row pointers at the start of a page got borked during a hardware failure while the rows were being moved around.

/cont

@omnipotens @esm @kainoa @esm

3. Trying to add fake pg_xact files like described in the "pg_clog problems" slide of [2] didn't help - it found another corrupted txid, one that appeared in many places in the file. The number or rows that I could select with `select * from <table> limit <n>` before getting an error did not go up. I suspected all rows on that page had wrong offsets.

/cont

[2]: https://thebuild.com/presentations/worst-day-fosdem-2014.pdf

@omnipotens @esm @kainoa @esm

4. Overwriting the whole page with zeros (stopping postgres, copying the file to the side, dd-ing into it with conv=notrunc and the right bs/seek/count, verifying, copying the file back) as advised in the "really bad heap damage" slide of [2] did make it possible to read more rows. There was another error with a different corrupted txid, which was also unique, and much much later in the file. Zeroing out that page too allowed reading the rest of the table.

@omnipotens @esm @kainoa @esm

5. I did save the copy of those pages before zeroing them out. They contain string fields with URLs, which are easy to spot and could be used as an anchor to guess correct row offsets and extract data from those rows. Maybe even extract correct txids, but I wouldn't know how to check if these rows are live or dead, so could end up with extra data that shouldn't be there.

@omnipotens @esm @kainoa @esm

Also, the bitfields in the row pointers confused me, maybe if I understood them it'd be possible to fix those two pages in-place. Or maybe not.

Main takeaway: it's usually recoverable, if you're not working under stress.

Yesterday I helped linuxrocks.online folks fix their peertube's postgres db after a corruption and an ominous "ERROR: could not access status of transaction <number>" errors.

Turns out postgres's physical data layout isn't all that complicated.

See upthread for details, but reply here as I don't know if folks above want this in their mentions.

@omnipotens were you able to solve the issue?