От вет-ИИ для коров до имперского глянца: хардкорный MLOps на бесплатных GPU

В начале 2026 года ленты новостей принесли тревожные сообщения из Сибири: массовые вспышки опасных заболеваний у КРС (крупного рогатого скота) привели к необходимости вынужденного забоя тысяч голов. Для многих фермеров это означало потерю бизнеса и средств к существованию. Мы задались вопросом: может ли доступный Computer Vision стать первой линией обороны? Инструментом, который позволит фермеру в отдаленном районе провести первичный скрининг (триаж) животного с помощью обычного смартфона и вовремя вызвать ветеринара, не дожидаясь начала эпидемии. Так родился проект AI-Vet-Scanner ( наше пространство на Hugging Face ), определяющий признаки заболеваний по фотографии.

https://habr.com/ru/articles/1013214/

#MLOps #Kaggle #Computer_Vision #OpenCV #PyMuPDF #Hugging_Face #датасет #парсинг #оптимизация_памяти #SDXL_LoRA

От вет-ИИ для коров до имперского глянца: хардкорный MLOps на бесплатных GPU

Введение. Контекст как катализатор В начале 2026 года ленты новостей принесли тревожные сообщения из Сибири: массовые вспышки опасных заболеваний у КРС (крупного рогатого скота) привели к...

Хабр

That said and celebrated ;), there are things that #Censor is not yet well redacting.

The upstream library #MuPDF (with its #Python bindings in #PyMuPDF) supports by default only redaction of text, vector graphics and images. Testing on a variety of PDF files (thanks to #pypdf, #qpdf, #ghostscript, and their issue reporters, as well as @pdfarranger for their hint) let me discover that some vector graphics are not properly redacted and an upstream issue has been reported for that.

Also, form fields (widgets), signatures and links may be incompletely redacted.

You can find an updated list of “What is redacted? What not?” here: https://codeberg.org/censor/Censor/issues/120

#pdf #redaction #security

meta: What is redacted? What not?

> **Warning** > The following description is **not** valid for Censor until version 0.4.0. I recommend to update to [version 0.5.0](https://codeberg.org/censor/Censor/releases/tag/v0.5.0) for secure redaction. ## Elements under redaction rectangles - [x] Text: - characters are removed when ...

Codeberg.org

“Better safe than sorry”

For release 0.5.0 of #Censor, a lot of work went into improving the security of PDF redaction.

PDF documents are tricky, and irrevocably removing elements from them is even more. With this release, before saving a redacted document, garbage is now properly collected and documents are sanitized, which means that metadata, page thumbnails, etc. are removed.

Also, vector graphics are now removed with a more strict option when they overlap with redaction rectangles. On top of that, I added redaction of PDF annotations.

The user interface was refreshed: with undo and redo buttons in the toolbar and improved document saving experience. Drawing rectangles is now indicated by a crosshair cursor.

Thanks to the translators, you may now talk also Czech with Censor!

Get it from @flathub: https://flathub.org/apps/page.codeberg.censor.Censor, or contribute on @Codeberg: https://codeberg.org/censor/Censor

#censorship #redaction #PDF #Codeberg #Flatpak #Flathub #GNOME #python #MuPDF #PyMuPDF #linux

Install Censor on Linux | Flathub

Redact PDF documents

“A historic moment for Censor”

#Censor – the PDF redaction tool for the @gnome desktop – comes now with a new edit history. It allows you to undo and redo redactions using the right-click context menu or keyboard shortcuts. Also, a bug that prevented repeated saving to the same file path was fixed.

Get the new version from @flathub: https://flathub.org/apps/page.codeberg.censor.Censor, and find it on @Codeberg: https://codeberg.org/censor/Censor/releases/tag/v0.4.0

You may now talk Chinese, Dutch, English, Estonian, Finnish, French, German, Italian, and Vietnamese with Censor (thanks a lot to the translators!). If your language is missing from this list I invite you to contribute at Codeberg Translate: https://translate.codeberg.org/engage/censor

#censorship #redaction #PDF #Codeberg #Flatpak #Flathub #GNOME #python #MuPDF #PyMuPDF #Linux

Install Censor on Linux | Flathub

Redact PDF documents

Censor, a new document redaction tool, is there!

It allows to draw black rectangles on PDF documents and to permanently remove the text and images below. Find it on @Codeberg: https://codeberg.org/censor/Censor, get it from @flathub: https://flathub.org/apps/page.codeberg.censor.Censor, or translate it on Codeberg Translate: https://translate.codeberg.org/engage/censor!

It is a free and open-source graphical user interface (GUI) for #Linux and the #GNOME desktop, and uses the #MuPDF library with its #python bindings from the #PyMuPDF module.

#censorship #redaction #PDF #Codeberg #Flatpak #Flathub

Censor

PDF Document Redaction for the GNOME Desktop

Codeberg.org

After struggling to get #python #PyMuPDF to work and being close the deadline, I shifted to using a combination of other commands.

First using the #linux #pdftohtml command, which is so much faster than PyMuPDF and packages the result similar to saving a website.

Next with #NeoVim and #RegEx format the #HTML file to be able to be quickly processed with #NodeJs #cheerio and eventually through #json to be saved in #sqlite.

Is it elegant and automatic? No, though it works!

#JavaScript

Further while trying to extract and format data from PDFs using #python #PyMuPDF.

I was trying to create a perfect chain of functions that would format all the edge cases into the final desired #HTML format. This is where I quickly realized running every tweaked version of the functions on the 100 page PDF is quite time consuming.

Instead I can run it once and save the results in a #sqlite database. Then create #sql queries to do post processing on the edge cases while having a good enough way to observe the contents of each page over the pervious method of posting the output into the #terminal and scrolling to the desired page. And in the end, I am one step closer of having the data in a #csv file, which is easily exported with #Dbeaver.

Currently trying to extract and format data from PDFs using #python #PyMuPDF.

Initially used the `get_text(value)` method with the `"text"` value, only to learn that I could have potentially saved time directly using the `"html"` value, since I have been creating pattern matchers to format the text into #HTML.

After investigation, although the html option exists, the post processing is more strenuous than the initial approach.

My fascination with the `get_text(value)` method is that each value packages the data differently. Where as `"html"` puts the text in `<p><span>text</span></p>`, `"xhtml"` puts it instead in `<h1>text</h1>`.

I just updated my 2023 post on extracting text from #EPUB files in #Python, and added an evaluation of #PyMuPDF (which also supports EPUB!). Includes link to demo script.

https://www.bitsgalore.org/2023/03/09/extracting-text-from-epub-files-in-python

Extracting text from EPUB files in Python

This post gives an introduction to extracting unformatted text from EPUB files in Python.

bitsgalore.org

Ever felt the need to convert a #PDF into a fixed-layout #EPUB that preserves the table of contents, internal cross-references and hyperlinks? Finding no out-of-the-box solution, I've developed one myself using #Python and the #PyMuPDF library. Here it is, open source, and ready for use:

https://github.com/aourednik/pdf2epub3fixed

My script is particularly suitable for the conversion of complex layout PDFs generated with variants of #TeXLaTeX.
Enjoy!

GitHub - aourednik/pdf2epub3fixed: Convert PDF to fixed-layout EPUB, conserving the table of contents, inner cross-references and hyperlinks.

Convert PDF to fixed-layout EPUB, conserving the table of contents, inner cross-references and hyperlinks. - aourednik/pdf2epub3fixed

GitHub