That said and celebrated ;), there are things that #Censor is not yet well redacting.

The upstream library #MuPDF (with its #Python bindings in #PyMuPDF) supports by default only redaction of text, vector graphics and images. Testing on a variety of PDF files (thanks to #pypdf, #qpdf, #ghostscript, and their issue reporters, as well as @pdfarranger for their hint) let me discover that some vector graphics are not properly redacted and an upstream issue has been reported for that.

Also, form fields (widgets), signatures and links may be incompletely redacted.

You can find an updated list of “What is redacted? What not?” here: https://codeberg.org/censor/Censor/issues/120

#pdf #redaction #security

meta: What is redacted? What not?

> **Warning** > The following description is **not** valid for Censor until version 0.4.0. I recommend to update to [version 0.5.0](https://codeberg.org/censor/Censor/releases/tag/v0.5.0) for secure redaction. ## Elements under redaction rectangles - [x] Text: - characters are removed when ...

Codeberg.org
CVE Alert: CVE-2026-27628 - py-pdf - pypdf - RedPacket Security

pypdf is a free and open-source pure-python PDF library. Prior to 6.7.2, an attacker who uses this vulnerability can craft a PDF which leads to an infinite

RedPacket Security
@RomanOnARiver It may have changed in the meantime but back when we used #pypdf it was a lot of trouble; I'd recommend looking into #pikepdf if you run into any issues, it was a gamechanger for us.
But there are really good non-Adobe #PDF libraries, like #pypdf for #python, like #libpoppler. And not to mention stuff like PDF.js - having native PDF support in the browser and not needing the Adobe plugin is huge, we just need to continue that momentum.
An an aside, I really like the #pypdf module for #python. #PDF as a format is super interesting, I would compare them to exfat, as something that was proprietary and gate-kept for so long and then became an open spec, but it's suffering and has tons of untapped potential because its legacy "owners" still have the most advanced software with it. I'm talking Acrobat, for example.

#python #programming #coding

Q for other programmers - do you ever, out of caution, do things to prevent issues that probably won't happen? I'm processing PDFs with #pypdf, don't think I have to worry about a Bobby Tables situation but I'm still hitting it with (pseudo code)

if not base:
base = "file"

if not ext:
ext = ".pdf"

base = re.sub(r"[^A-Za-z0-9_-]+", base).strip()

base = re.sub(r"[ _]+", " ",)

if len(base) > 60:
base = base[:60].rstrip()

Is this a waste of time?

Khi tạo bộ phân tích hồ sơ tự động, một vấn đề thú vị xảy ra khi phân tích 15.000 hồ sơ với PyPDF. Hồ sơ được thiết kế 2 cột, nhưng khi trích xuất văn bản, nó không giữ được bố cục. Để tránh vấn đề này, hãy sử dụng font chữ đơn giản, tránh thiết kế nhiều cột và giữ thông tin liên lạc ở trên cùng. #HồSơ #TựĐộngHóa #PyPDF #ATS #Resume #Automation #PDF #Parsing #SaaS #ngDụng #TưVấn #LờiKhuyên #HồSơXinh #TìmViệc #IT #CôngNghệ #Vietnam #JobSearch #ResumeTips #SaaSTips

https://www.reddit.com/r/SaaS/c

I want to write a program to extract a list of clickable links from a PDF page.

#pypdf can list the link positions/sizes and target URLs. But in a PDF document, links are annotations, which are separate data from the document text.

To get the display text of a clickable link in a PDF, is the easiest way to convert the full page to PNG, crop it to the link's bounding box, and run that through OCR? Or am I missing something more reasonable?

#programming #python #IfItWorksItWorks

PDF parsen

Manchmal muss man PDF-Dateien auslesen. Dieser Artikel zeigt, wie man das mit einem Python-Skript macht.

#PDF #Parser #parsen #Auslesen #pypdf #Linux

https://gnulinux.ch/pdf-parsen

PDF parsen

Manchmal muss man PDF-Dateien auslesen. Dieser Artikel zeigt, wie man das mit einem Python-Skript macht.

GNU/Linux.ch
Working on a piece of (internal) software tentatively titled "t" it does some manipulation relating to PDFs. If I'm successful I can save our team about 50%. And what I've come across is that the #pypdf #Python module is really robust, powerful, lot of features. It's so strange that software like "Acrobat" has no free software equivalent - I'm pretty sure I can do everything that app does with this module, I could be wrong.