Need a Postgres expert to help fix the Peertube DB. Willing to throw some cash to someone who helps.
@omnipotens cc @esm @esm 👀
@kainoa @omnipotens @esm i'd probably be down to help with postgres stuff, whether or not i'm an "expert" at it is debatable though lol

@esm @kainoa @esm

At this point If we loose a little bit of data I be ok but the last successful backup was nov2022.

I cannot even take a backup now.

@omnipotens @esm @kainoa @esm
That pg_xact (formerly known as pg_clog) directory contains some kind of "transaction metadata log" (which is different from WAL but I don't quite get in what way) and it seems like it's very important for postgres...

The warning about circular dependencies seems like a red herring, but the missing file in pg_xact seems like a severe issue.

I suggest you first make a copy of your postgres data directory (the one that contains subdirs like base, pg_xlog, etc.
1/

@omnipotens @esm @kainoa @esm

And the backup is so that if you lose data while trying to fix it, you can try again.

The problem looks similar to the error described here:
https://dba.stackexchange.com/questions/325104/error-could-not-access-status-of-transaction-could-not-open-file-pg-xact-0029

Except there, restarting the database fixed it.

If this is something that postgres's builtin recovery process can fix, a restart should help.

It's also possible that you trying to dump that table takes to long and creates the linked issue, but then I'd expect the number in the filename to increase

"ERROR: could not access status of transaction Could not open file "pg_xact/0029": No such file or directory" when attempting to LISTEN

When attempting to LISTEN to any channel, PostgreSQL says that Could not open file "pg_xact/0029": No such file or directory. This is a freshly restored database, so it can't be database

Database Administrators Stack Exchange

@wolf480pl @esm @kainoa @esm

yeah restarting it did not help tried that already.

@omnipotens @esm @kainoa @esm

Do you have a backup of the postgres data directory?

@wolf480pl @esm @kainoa @esm

Looks like the last one available was Nov2022

@omnipotens @esm @kainoa @esm

Make another one now.
In addition to the one you already have.

Yes, it's corrupted (apparently), but if we try fixing it and make it worse, this will let you try again.

@omnipotens @esm @kainoa @esm
Ideas for *after* you've made a backup of the postgres data directory:

- make sure nothing other than your psql is connected to postgres (check pg_stat_activity to see what is connected)
- disable triggers [1]
- try getting a single row from the table:
select * from videosomething limit 1;

If this doesn't cause the error, try selecting all the rows (no limit), if that works try pg_dump

@wolf480pl @esm @kainoa @esm

I am starting to think it is hosed

@omnipotens @esm @kainoa @esm
There are levels of hosed.

I'm no expert but I suspect most of the data from that table is still there, the question is how much effort you're willing to spend to recover it.

If you're getting this error on a 1-row select from that table even if that's the first query after restart, then I don't know how to proceed, but I suspect someone on postgres mailing list does.

@omnipotens @esm @kainoa @esm

Ok so we followed up in DMs and it turned out that:

1. The transaction ID from the error message was higher than txid_current() - it was in the future. It couldn't've been real, must've been postgres reading garbage where it'd expect transaction_id, similar to the "garden-variety data corruption" described in [1].

/cont

[1]: https://www.postgresql.org/message-id/11462.1238032941%40sss.pgh.pa.us

Re: could not access status of transaction

Tom Duffey <[email protected]> writes: > One of our databases suffered a problem yesterday during a normal > update, something we …

PostgreSQL Mailing List Archives

@omnipotens @esm @kainoa @esm

2. Looking at the table "heap" file (base/<db-oid>/<tbl-oid>) with hexdump, and searching for the little-endian hex encoding of the corrupted txid, it appeared only in one place.
Compared to rows seen earlier in the file, it seemed like 16 bytes earlier than a row should start. Possibly the row pointers at the start of a page got borked during a hardware failure while the rows were being moved around.

/cont

@omnipotens @esm @kainoa @esm

3. Trying to add fake pg_xact files like described in the "pg_clog problems" slide of [2] didn't help - it found another corrupted txid, one that appeared in many places in the file. The number or rows that I could select with `select * from <table> limit <n>` before getting an error did not go up. I suspected all rows on that page had wrong offsets.

/cont

[2]: https://thebuild.com/presentations/worst-day-fosdem-2014.pdf

@omnipotens @esm @kainoa @esm

4. Overwriting the whole page with zeros (stopping postgres, copying the file to the side, dd-ing into it with conv=notrunc and the right bs/seek/count, verifying, copying the file back) as advised in the "really bad heap damage" slide of [2] did make it possible to read more rows. There was another error with a different corrupted txid, which was also unique, and much much later in the file. Zeroing out that page too allowed reading the rest of the table.

@omnipotens @esm @kainoa @esm

5. I did save the copy of those pages before zeroing them out. They contain string fields with URLs, which are easy to spot and could be used as an anchor to guess correct row offsets and extract data from those rows. Maybe even extract correct txids, but I wouldn't know how to check if these rows are live or dead, so could end up with extra data that shouldn't be there.

@omnipotens @esm @kainoa @esm

Also, the bitfields in the row pointers confused me, maybe if I understood them it'd be possible to fix those two pages in-place. Or maybe not.

Main takeaway: it's usually recoverable, if you're not working under stress.

Yesterday I helped linuxrocks.online folks fix their peertube's postgres db after a corruption and an ominous "ERROR: could not access status of transaction <number>" errors.

Turns out postgres's physical data layout isn't all that complicated.

See upthread for details, but reply here as I don't know if folks above want this in their mentions.