we have chosen to put most of our research into documents in PDF format.

PDFs are a huge pain to make accessible.

most scientists write their papers in Latex, overleaf, etc., which cannot produce accessible PDFs.

to make such PDFs accessible, one uses Adobe Acrobat, which is expensive and proprietary.

increasingly, we post our PDFs to arXiv, which ~forbids accessible PDFs b/c they can't be compiled from source.

~none of our science is accessible.

artifacts (and file formats) have politics.

@jbigham It seems LaTeX is in the process of implementing this: https://www.pdfa.org/presentation/tagged-and-accessible-pdf-with-latex/

But put me in the "PDFs are not worth the hassle" camp anyway. I know some people in the LaTeX project have been working on a new document container format that still offers precise printing, but that can also reflow text for different screen sizes. The name escapes my memory right now. Maybe it'll be really cool. But for current research publication, I'd settle for HTML.

Tagged and Accessible PDF with LaTeX

In Summer 2020 the LaTeX Project Team announced the start of a multi-year project [1, 2] to produce tagged and accessible PDF from existing  LaTeX sources with

@julian this has been in progress for over a decade, so i'm skeptical. it really is hard, especially buried in that presentation is compatibility with other packages, which turns out to be a super hard problem.
@julian @jbigham Using LaTeXML to convert to HTML with MathJax is the suggestion I currently suggest, not least because the LaTeX tagging options tell you not to use them anymore 😭
@quelu @jbigham I think that must have been it, yeah! Thank you! Really tough to find if you can't remember the name. 😀 https://www.tug.org/TUGboat/tb40-2/tb125ruckert-hint.pdf
@julian @jbigham a replacement for pdf would be good
@jbigham I always wondered why science on the web is all pdf when the web is said to have been invented to ease sharing of information for and by scientists.
@simulo lots of reasons, but primarily editing and reading tools, also people like their papers to have a consistent look

@jbigham @simulo That's kind of chicken-and-egg, though, isn't it? If someone _wanted_ to output compliant, accessible HTML there are already standards for this: https://w3c.github.io/scholarly-html/

You can be prescriptive without requiring a proprietary file format...

Scholarly HTML

@scottmmjackson @jbigham @simulo I remember being super excited about PDF as a file format because it broke the exclusive choice of Apple or Microsoft as an environment for your doc
@scottmmjackson @jbigham I'm already happy about research in standard-compliant "normal" HTML (though the semantic infos of scholarly html are nice)
@simulo @jbigham Sure, but to the extent that journals have an interest in things generally looking the same and being printable in a particular fashion, HTML offers a really good basis for that
@scottmmjackson @jbigham @simulo PDF is not a proprietary standard - it is an open standard, maintained by the ISO.
@conrad @jbigham @simulo Adobe still owns a bunch of PDF technologies, and it was built as a proprietary format by Adobe. Ultimately it's proprietary by design, even if the format is "open" now.
@jbigham @simulo I think the “look” is more important than we might think at first. I have often wondered whether/to what extent the perceived scientific value of an academic paper would change as a function of its formatting. The technology for saving a document as an ebook is here. Some journals also offer epub. The question is not so much whether technological solutions exist. Rather it is why academic users haven’t adopted them, whether they are sharing a preprint or downloading an article…

@simulo @profgaelle @jbigham I’m no scientist, but I am a (copy) writer, so I have worked alongside designers for twenty odd years. So believe me when I say this: Design makes a huge difference. The impact on things like perception and readability cannot be over stated. And more often than not it affects content, too. (“Ok, so this info is in the diagram, we can skip the text” or “let’s high light that and skip the diagram” etc.)

Design and typography matters.

@simulo @profgaelle @jbigham Personally I’d go farther and say that design IS content, and that good writing is design.
@profgaelle @simulo @jbigham All of this is to say: Accessibility isn’t one thing, it hinges on the definition of “accessible”.
@thelovebing @profgaelle @jbigham I agree – and HTML/CSS offer a lot of ways to design documents well; what they do not do well is replicating how a printed document looks like (since they are focussed on content for different screen sizes)

@thelovebing @simulo @profgaelle @jbigham

Tech editor here who has also, separately, helped a few nonprofit orgs through the epub process.

I very much agree with your take on this.

@jbigham
@simulo
you forgot the massive and entrenched industries that fought web publishing tooth and nail!

@simulo @jbigham
I think it is in great part a desire to make it exactly correspond to what is printed in the paper journal, for better or worse.

My read of the invention of http in particular was that it was meant to help information management needed to run large physics experiements-- many thousand pages of internal documentation, and would probably not have been invented that early to keep track of the very distilled final product
(https://www.w3.org/History/1989/proposal.html speaks of cern user groups etc)

The original proposal of the WWW, HTMLized

@jbigham @blakereid this is so important. For my last article I used htlatex and tex4ebook to make web and epubs, but it was a struggle to find the tools to do it (https://shostack.org/blog/fast-cheap-good-redux/)
Shostack + Friends Blog > Fast, Cheap and Good, Redux

A new paper on how fast, cheap and good can combine into something we usually discount.

@jbigham For what it's worth the LaTeX team is supposedly working on a long-term solution ... this is the most recent https://www.pdfa.org/presentation/tagged-and-accessible-pdf-with-latex/ (still seems very slow-moving, but what do I know??) (see also https://www.latex-project.org//publications/indexbytopic/pdf/ )
Tagged and Accessible PDF with LaTeX

In Summer 2020 the LaTeX Project Team announced the start of a multi-year project [1, 2] to produce tagged and accessible PDF from existing  LaTeX sources with

@bbolker @jbigham I'm still always surprised that it has taken so long for LaTeX to do this. Seems like it would be a natural thing to do in documents that have already been created with section tags, figure environments with captions, etc. I've seen references to why this is a hard project and the only one I understand is that there are a million packages, homebrew document styles, and so on. But we are long overdue for at least having accessibility in simple document styles.

@cbischoff @bbolker @jbigham It is a small team, and I believe they are all volunteers.

The majority of the commits are from one person: https://github.com/latex3/latex3

GitHub - latex3/latex3: The expl3 (LaTeX3) Development Repository

The expl3 (LaTeX3) Development Repository. Contribute to latex3/latex3 development by creating an account on GitHub.

GitHub
@TomSwirly @cbischoff @jbigham I would contribute $$$ to help this happen! From a late-2020 article https://www.latex-project.org/publications/2020-FMi-TUB-tb129mitt-tagpdf.pdf : "A realistic scenario would be that each phase [out of 6] takes between one and two release cycles of LATEX, of which there are two per year. This implies that the project will stretch across four years as a minimum, but it most probably will be somewhat longer. Additional funding will help to ensure timely delivery of each phase ..."
@TomSwirly @cbischoff @jbigham If anyone is still on this thread *and* is on TeX Stack Exchange *and* feels like picking up some bounty points ... https://tex.stackexchange.com/q/663825/11435
Status on the Tagged PDF project

Does aynone know what the status of the Tagged PDF project is? How far has the LaTeX3 Team progressed on the time line from section 3 of "LaTeX Tagged PDF -- Feasibility Evaluation"? Also...

TeX - LaTeX Stack Exchange
@jbigham I wish SIGCHI/ACM would move away PDFs as the default file format for this and so many other reasons. Let's just do HTML and give people the option to convert to PDF >.<

@jbigham agreed on the thesis, but calling bullshit on "most scientists write their papers in Latex"

I would be hard pressed to believe that most CS or physicists do, much less scientists more broadly.

@BenjaminHimes @jbigham yep, LaTeX is certainly common in some fields, but I have been an author on well over a hundred papers, now, and LaTeX was used in just two of them.
@BenjaminHimes @jbigham I've never seen someone in CV/ML write a paper in something other than Latex... My first nature submission I had to reformat it from Latex to word, it was appalling what it did to my equations. I don't think it occurred to me that a pdf would not be okay.
How to write academic papers in Markdown

Tired of that silly LaTeX syntax?

Brain Baking
@jbigham Oh, it's not just research, almost all engineering documents hide their most useful data in PDFs. It's so horrible, I was forced to write my thesis on it!1!! PDF -> HTML -> Table Processing -> Knowledge Graphs. Math Formulas is still really hard tho. https://salkinium.com/master.pdf

@salkinium Oh interesting, though got a 404 when I tried the link.

A former coworker wrote an inhouse tool to pull out table data from TRMs, and I remember him complaining about all the edge cases he encountered, even though the tool only had to support documents from our main SoC supplier.

@Mayabotics Had to pull my thesis temporarily while our paper about it gets peer reviewed anonymously. It's still available in the git history ;-P https://github.com/salkinium/intertubes/tree/4b7af2a4bf2735162cb244f50738609cc04dcd37
GitHub - salkinium/intertubes at 4b7af2a4bf2735162cb244f50738609cc04dcd37

Marvel at my mad web skillz. Contribute to salkinium/intertubes development by creating an account on GitHub.

GitHub

@jbigham hmm, in which way is LaTeX PDF output non-accessible? At least it doesn't produce rasters like, well, Photoshop & other things, so that must be some other problem…

Do you mean text blocks ordering or some other stuff?

LMGTFY - Let Me Google That For You

For all those people who find it more convenient to bother you with their question rather than to Google it for themselves.

@jbigham I know how to use Google, thanks.
@jbigham (and it's not meant to search for URLs by URL, that's what the URL bar is.)

@jbigham https://lmgtfy.app/?q=latex+pdf+tags

Don't know how extensive they are but there are ways to do it it seems. Sure they need more exposure, and push for arXiv to consider making it mandatory.

LMGTFY - Let Me Google That For You

For all those people who find it more convenient to bother you with their question rather than to Google it for themselves.

@jbigham just like surely not everyone using Acrobat know or will use the accessibility tags properly anyway.

People still want to fake radio buttons in HTML, so there's a long road ahead…

@jbigham makes me wonder how well LibreOffice handles that in PDF it generates, as I've been using it for years thinking it was just fine.

Same for pandoc I use to generate user manual through LaTeX from markdown… 🤔

@jbigham are there alternatives to it?
@SWwind lots of them… e.g., Word, epub, HTML, etc, all with tradeoffs.
@jbigham @grimalkina if only we had invented some kind of Markup Language that allowed Text to be annotated, include Hyperlinks, and evolve over time to capture advances and updates in scientific thinking. Maybe we could use that as a shareable format, with each update being marked with different resource identifiers.

@jbigham I now use markdown (specifically quarto) to render work into multiple formats from the same source file.

As you can keep the .tex file while rendering to pdf, for submissions where submitting tex files are that still de rigueur is not a barrier.
But I can also, with the same source document, produce a fully accessible document (even with equations), with structure and metadata.
Making it easier to host files in a more accessible format on arXiv and similar is key

@DToher @jbigham Indeed, for some reason, Physical Review takes the tex files but does offer only PDF afterwards. Maybe we need to yell at our professional societies more.
@jbigham What makes accessing a pdf document challenging?
@jbigham The solution is to use the PubMed approach, encode documents in XML and then compile to PDF or an accessible format at the user's request. (Personally I think it's crying shame that EPUB can't handle all of this natively.)
@jbigham This is something I've been trying to legislate about, but it's hard... my state government uses PDFs a lot, too; how should we balance between the interests of supporting accessibility and opposing proprietary software?

@jbigham I understand this frustration. Sometimes I use Linux for some workaround (including using something that will let me create, markup, and edit PDFs)

I hate the hold that Adobe has on this kind of stuff.

@jbigham I'm curious about how hard it would be to teach screenreaders to read TeX, or perhaps make a tool that takes in TeX and spits out some semantic tagged stream. (Or maybe this is just the TeX -> HTML that already exists.)

And I feel like it should be possible to include TeX in PDFs as an embed of some sort... I know it's a hack, but I just wonder if it's a faster road to accessibility.

@jbigham EPUB should be the solution for this; it's based on HTML and offers reflowable and fixed-layout options. Unfortunately, there's no default reader for Windows, while there's the Books app on all Apple products.
@jbigham Do the suggestions at https://libguides.lib.msu.edu/c.php?g=995742&p=8207771 mostly not work, then? (Haven't tried this myself but was planning on trying soon.)
LibGuides: LaTeX: Creating Accessible LaTeX Documents

A basic introduction to writing and managing citations in LaTex.

@jbigham pdf is the devil. like it emulates a piece of paper, but somehow manages to be worse than an actual hard paper copy. i like to do hacking on extracting data from pdf to make the bad go away

@jbigham @freakboy3742 I work in this domain and PDF display and management is a non-trivial exercise.

PDFs held hostage by predatory publishers is a whole other extra barrel of fun.

@jbigham meanwhile, the original design intent of HTML is sitting in the corner, dagger-eyes