The lossless data compression fairies are having fun with me today...

  • Scan 8.5" x 11" document at 1200dpi @ greyscale
  • -> 60 MiB PNG, thank you
  • Open PNG in GIMP, select a good threshold point, convert to 1bpp
  • -> 514 KiB PNG
  • Wait... 116:1 compression from 8-bit PNG to 1-bit PNG? HOW??
  • convert to pdf
  • "Warning, this file is really huge and may actually be a decompression bomb" lol, ok.
  • -> 515 KiB PDF, nice
  • ocrmypdf foo.pdf document.pdf
  • -> 194 KiB PDF
  • WHAT? HOW?!?
  • pdfimages -png document.pdf foo
  • -> 514 KiB PNG
  • WHAT IS HAPPENING?!?

#PDF #PNG #Compression #greyscale

P.S., I found out that by default, ocrmypdf uses (lossless) #JBIG2 compression. That's why it was so well compressed. Also, the resultant PNG file at the end (which was basically the same PNG file that went into the PDF) was converted from JBIG — pdfimages converts images, it doesn't extract them in their natively stored format (but a -list will show you what the native format is). Also, I think pdfimages -all will just export the native format, whatever it is, but I haven't tried that yet.

Neat.. encountered the #xerox scanner bug live..

In case you don't remember, their #JBIG2 algorithm setttings causes small scanned symbols to be confused with symbols already in the compression dictionary and then misprinted.

In this case, the superscript "2" and "3" were turned into full size "2" and "3".

The nonsensical result, a #DnD whip does a 1D33 damage 🤣 if only 🤣

Interesting, both the German Federal Office for Information Security and the Swiss Coordination Office for the Permanent Archiving of Electronic Documents advise against the use of #JBIG2 compression in scanned #PDF documents.

This was prompted by the discovery in 2013 of the infamous "swapped characters" bug in Xerox photocopiers:

https://en.wikipedia.org/wiki/JBIG2#:~:text=not%20the%20same.%5B13%5D%5B14%5D%5B15%5D-,Character%20substitution%20errors%20in%20scanned%20documents,-%5Bedit%5D

JBIG2 - Wikipedia

Does anyone know if JPEG XL encoders can be prone to the infamous JBIG2 "Pattern Matching & Substitution" region-reuse issue, that so badly affected Xerox copiers?

https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_are_switching_written_numbers_when_scanning

#JPEG #JPEGXL #JBIG2 #Xerox

Xerox scanners/photocopiers randomly alter numbers in scanned documents

Xerox scanners/photocopiers randomly alter numbers in scanned documents Please see the „condensed time line“ section (the next one) for a time line of how the Xerox saga unfolded. It for example depicts that I did not push the thing to the public right away, but gave Xerox a lot of time before I did so. <iframe width="700" height="394" src="https://www.youtube.com/embed/c0O6UXrOZJo" frameborder="0" allowfullscreen></iframe>

D. Kriesel

@barubary @randomgeek Yep. See https://www.theverge.com/2013/8/6/4594482/xerox-copiers-randomly-replacing-numbers-in-documents

Full account of #Xerox copiers mangling numbers here: https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_are_switching_written_numbers_when_scanning

They used the #JBIG2 image format’s lossy “pattern matching & substitution” method that substitutes previously-encoded characters if they look enough like the one currently being encoded.

This is a great analogy to how #LLM-based “#AI” works.

Your Xerox copier could be replacing numbers in your documents

The Verge

Are there any easy-to-use #JBIG2 tools for #Linux?

#GIMP doesn't seem to support it (yet?), and I just can't wrap my head around how to use jbig2enc -- it spits out some data to STDOUT that 'file' can't identify.
I just find the concept of lossy bilevel images compression fascinating, and I'd love to play with it to see how badly it would butcher something like a Floyd-Steinberg dithered image (and how much compression it'd actually accomplish)

A #PDF masquerading as a #GIF and containing a #JBIG2 image with a #VM encoding the payload.

A genius who can pull this off, working for #NSOGroup that helps autocrats everywhere (including #India) subvert #democracy with #Pegasus.

What a terrible waste for humanity. ☹️

🤯:
“JBIG2 doesn't have scripting capabilities, but when combined with a vulnerability, it does have the ability to emulate circuits of arbitrary logic gates operating on arbitrary memory. So why not just use that to build your own computer architecture and script that!?”

#JBIG2

A deep dive into an NSO zero-click iMessage exploit: Remote Code Execution

Posted by Ian Beer & Samuel Groß of Google Project Zero We want to thank Citizen Lab for sharing a sample of the FORCEDENTRY exploit w...