Mastodawn

fossilesque Jun 20, 2024

Elsevier

https://mander.xyz/post/14370720

Elsevier - Mander

Show thread

Passerby6497

That’s where you print the downloaded PDF to a new PDF. New hash and same content, good luck tracing it back to me fucko.

Show thread

Syn_Attck Jun 20, 2024

Now that this is known, It’s not enough to remove metadata from the PDF itself. Each image inside a PDF, for example, can contain metadata.

There are multiple ways of removing ALL metadata from a PDF, here are most I know of.

It will be slow-ish and probably make the file larger, but if you’re sharing a PDF that only you are supposed to have access to, it’s worth it. MAT or exiftool should work.

Removing metadata from a PDF

What commands must I issue irreversibly to remove all metadata from foo.pdf? Assume embedded images are already clean. I got the impression from https://gist.github.com/hubgit/6078384 that exiftoo...

Unix & Linux Stack Exchange

Show thread

Passerby6497 Jun 20, 2024

Wouldn’t printing the PDF to a new PDF inherently strip the metadata put there by the publisher?

Show thread

Syn_Attck Jun 20, 2024

Good question. I believe “Print to PDF” isn’t actually “printing” it page by page as if it was a physical printer, but rather just saving the loaded PDF to a PDF file locally.

I’m not an expert in this field, but you can ask on StackExchange, or ask the author of MAT and exiftools, or do it yourself by making a PDF with a jpg file with your metadata, and then extract the image and let us know here - it would be useful information that I can’t find via search engines. I’m using a smartphone so I can’t do it, but if you do, note from the linked SE page is you won’t be able to extract the original file extension, so if you use your own .jpg with your own exif data, rename to .jpg when finished (I believe exif is handled differently based on file type).

There are multiple tools to add exif data to an image but the exiftool website has some good easy examples for our purpose.

exiftool -artist=“Phil Harvey” -copyright=“2011 Phil Harvey” YourFile.jpg

(do this as the first step before adding to the PDF)

How to extract images from a PDF in their original format

I'm using pdfimages -j bar.pdf /tmp/image to extract images from a PDF. My objective is to get them in their raw state as they were added. So If it was a .tif I'd like to get a .tif, if it's a jp...

Stack Overflow

Show thread

Zacryon Jun 21, 2024

Okay, got it. Print the PDF, then scan it and save as PDF.

Or get some monks to get a handwritten copy, like the good old times.

Show thread

Olgratin_Magmatoe Jun 20, 2024

You’d be safer IRL printing it on a printer without yellow ink, then scanning it, then deleting the metadata from the scan.

Show thread

ChaoticNeutralCzech Jun 20, 2024

I know PDF providers who visibly print the customer’s name or number in the header of every page, along with short copyright text. I use qpdf --stream-decompress to make the PDF into human-readable PostScript, and then Python+regex to remove each header text, which stand out a bit from other PDF elements. The script throws an error if more or fewer elements than pages have been removed but that hasn’t happened yet. Processed documents sometimes have screwed-up non-ASCII characters in the Table of Contents for some reason but I don’t have the originas anymore so IDK if it’s my fault. Still, I wouldn’t share the PDFs unless in text-only or printed form because of any other steganographic shenanigans in the file. I would absolutely torrent them if I could repurchase them under a new identity and verify that the files are identical.

BTW, has anyone figured out how to embed Python code in PDF? The whitespace always gets reencoded as x-coordinates so copy&pasting it never preserves indentation. No, you can’t use the Ogham Space Mark (Unicode’s only non-blank character classified as a space) for indentation in Python, I tried.

Show thread

IlIllIIIllIlIlIIlI Jun 20, 2024

I saw some that add background watermarks too into random pages and locations.