The issue is the database got corrupted
"videoRedundancy" failed: PQgetResult() failed.
pg_dump: error: Error message from server: ERROR: could not access status of transaction 3004004288
DETAIL: Could not open file "pg_xact/0B30": No such file or directory.
pg_dump: error: The command was: COPY public."videoRedundancy" (id, "expiresOn", "fileUrl", url, strategy, "videoFileId", "actorId", "createdAt", "updatedAt", "videoStreamingPlaylistId") TO stdout;
@omnipotens @kainoa @esm first of all i'd suggest applying a backup if you have one, if you don't for whatever reason, you should set up an auto backup script that uses pg_dumpall after this
give a restart a try, if that doesn't fix it you could try messing with pg_dump flags, --data-only and --schema-only are good ones to try (separately)
When carrying out pg_dump on a (timescale) database running within a local container I have the following: pg_dump: warning: there are circular foreign-key constraints on this table: pg_dump:
@omnipotens @kainoa @esm hmmmm, it seems like it keeps failing on the videoRedundancy table specifically; I'm not too familiar with peertube but after looking at the docs I think it might just be possible to recreate that table
do the actual postgres db daemon logs say anything whenever you attempt to pg_dump? we may be able to isolate and recreate what's affected
@omnipotens @kainoa @esm about the same then
are you able to get the indexes in the videoRedundancy table, in either the current db or the backup? if not then i'd try dropping that specific table and restoring it from the november backup
@omnipotens @kainoa @esm ah, do you have access to some sort of UI like phpMyAdmin, Adminer, or DBeaver? if not, then try running this in the postgres shell (sudo -u postgres psql) to try and get the indexes on the table:
SELECT indexname, indexdef FROM pg_indexes WHERE tablename = "peertube_prod.public.videoRedundancy";
"peertube_prod.public.videoRedundancy" with 'videoRedundancy'@omnipotens @kainoa @esm hmmm, so the table isn't completely toast, it just seems like the data is messed up
after backing up the current data directory as wolf480pl mentioned earlier (not a dump, just copying the /var/lib/pgsql or /var/lib/postgresql directory outright), give pg_restore --data-only -d peertube_prod -t videoRedundancy <november.dump> a shot (replace with the path to the database backup from november)
@omnipotens @esm @kainoa @esm
That pg_xact (formerly known as pg_clog) directory contains some kind of "transaction metadata log" (which is different from WAL but I don't quite get in what way) and it seems like it's very important for postgres...
The warning about circular dependencies seems like a red herring, but the missing file in pg_xact seems like a severe issue.
I suggest you first make a copy of your postgres data directory (the one that contains subdirs like base, pg_xlog, etc.
1/
And the backup is so that if you lose data while trying to fix it, you can try again.
The problem looks similar to the error described here:
https://dba.stackexchange.com/questions/325104/error-could-not-access-status-of-transaction-could-not-open-file-pg-xact-0029
Except there, restarting the database fixed it.
If this is something that postgres's builtin recovery process can fix, a restart should help.
It's also possible that you trying to dump that table takes to long and creates the linked issue, but then I'd expect the number in the filename to increase
When attempting to LISTEN to any channel, PostgreSQL says that Could not open file "pg_xact/0029": No such file or directory. This is a freshly restored database, so it can't be database
yeah restarting it did not help tried that already.
Do you have a backup of the postgres data directory?
Looks like the last one available was Nov2022
Make another one now.
In addition to the one you already have.
Yes, it's corrupted (apparently), but if we try fixing it and make it worse, this will let you try again.
@omnipotens @esm @kainoa @esm
Ideas for *after* you've made a backup of the postgres data directory:
- make sure nothing other than your psql is connected to postgres (check pg_stat_activity to see what is connected)
- disable triggers [1]
- try getting a single row from the table:
select * from videosomething limit 1;
If this doesn't cause the error, try selecting all the rows (no limit), if that works try pg_dump
I am starting to think it is hosed
@omnipotens @esm @kainoa @esm
There are levels of hosed.
I'm no expert but I suspect most of the data from that table is still there, the question is how much effort you're willing to spend to recover it.
If you're getting this error on a 1-row select from that table even if that's the first query after restart, then I don't know how to proceed, but I suspect someone on postgres mailing list does.
Ok so we followed up in DMs and it turned out that:
1. The transaction ID from the error message was higher than txid_current() - it was in the future. It couldn't've been real, must've been postgres reading garbage where it'd expect transaction_id, similar to the "garden-variety data corruption" described in [1].
/cont
[1]: https://www.postgresql.org/message-id/11462.1238032941%40sss.pgh.pa.us
Tom Duffey <[email protected]> writes: > One of our databases suffered a problem yesterday during a normal > update, something we …
2. Looking at the table "heap" file (base/<db-oid>/<tbl-oid>) with hexdump, and searching for the little-endian hex encoding of the corrupted txid, it appeared only in one place.
Compared to rows seen earlier in the file, it seemed like 16 bytes earlier than a row should start. Possibly the row pointers at the start of a page got borked during a hardware failure while the rows were being moved around.
/cont
3. Trying to add fake pg_xact files like described in the "pg_clog problems" slide of [2] didn't help - it found another corrupted txid, one that appeared in many places in the file. The number or rows that I could select with `select * from <table> limit <n>` before getting an error did not go up. I suspected all rows on that page had wrong offsets.
/cont
[2]: https://thebuild.com/presentations/worst-day-fosdem-2014.pdf
4. Overwriting the whole page with zeros (stopping postgres, copying the file to the side, dd-ing into it with conv=notrunc and the right bs/seek/count, verifying, copying the file back) as advised in the "really bad heap damage" slide of [2] did make it possible to read more rows. There was another error with a different corrupted txid, which was also unique, and much much later in the file. Zeroing out that page too allowed reading the rest of the table.
5. I did save the copy of those pages before zeroing them out. They contain string fields with URLs, which are easy to spot and could be used as an anchor to guess correct row offsets and extract data from those rows. Maybe even extract correct txids, but I wouldn't know how to check if these rows are live or dead, so could end up with extra data that shouldn't be there.
Also, the bitfields in the row pointers confused me, maybe if I understood them it'd be possible to fix those two pages in-place. Or maybe not.
Main takeaway: it's usually recoverable, if you're not working under stress.
Yesterday I helped linuxrocks.online folks fix their peertube's postgres db after a corruption and an ominous "ERROR: could not access status of transaction <number>" errors.
Turns out postgres's physical data layout isn't all that complicated.
See upthread for details, but reply here as I don't know if folks above want this in their mentions.