Code4Lib 2024 last week was *amazing*. I was really happy with how my talk on Batch Editing went (https://www.youtube.com/watch?v=B0fG0j949GM&list=PLw-ls5JXzeNYT9Y3JZGjl7EFy7vOeXpk-&index=2&t=2222s). I also did a lightning talk on some parallels I see between chess computers and AI in libraries that turned out pretty well (https://www.youtube.com/watch?v=npq8Qm1oeds&list=PLw-ls5JXzeNYT9Y3JZGjl7EFy7vOeXpk-&index=5&t=1558s). #code4lib #c4l24
At Code4Lib 2024, Zoe Tucker and Kristian Allen from UCLA Library did a presentation on their #OpenSource #metadata extraction pipeline for automated indexing of digital materials with complex layouts.
https://yewtu.be/watch?v=tujc_9nVg3o&t=10445
In their second iteration they chose the following components in order to improve the results: PaddleOCR (instead of #Tesseract) for #OCR, Amazon Science ReFinED (instead of #spaCy) for #NER, and Ollama (instead of #ChatGPT and #Gemini) for metadata extraction in Dublin Core or MODS.
Their experimental toolkit is available on GitHub as docker container running a JupyterLab environment and was implemented in Python.
https://github.com/UCLALibrary/metadata-extraction-lab
#AIinLibraries #Libraries #GenerativeAI #LLMs #AI #Cataloging #Cataloguing #c4l24
Zoe Tucker und Kristian Allen von der UCLA Library haben auf der Code4Lib 2024 eine #OpenSource #Metadaten-Extraktions-Pipeline zur automatischen #Erschließung von Digitalisaten mit komplexen Layouts vorgestellt.
https://yewtu.be/watch?v=tujc_9nVg3o&t=10445
In einer zweiten Iteration haben sie sich für die Kombination folgender Komponenten entschieden, um bessere Ergebnisse zu erzielen: PaddleOCR (statt #Tesseract) für #OCR, Amazon Science ReFinED (statt #spaCy) für #NER und Ollama (statt #ChatGPT und #Gemini) für die Metadaten-Generierung in Dublin Core oder MODS.
Das experimentelle Toolkit steht auf GitHub als Docker-Container mit Jupyter Lab Umgebung bereit und wurde in Python umgesetzt: https://github.com/UCLALibrary/metadata-extraction-lab
#KIinBibliotheken #Bibliotheken #GenerativeKI #LLMs #KI #Erschliessung #Katalogisierung #c4l24
Dear #library technology community, we need to talk about @OCLC. #libraries #code4lib #c4l24

This talk by Christina Cutler at #c4l24 was pretty great about the accessibility of PDFs, and the importance of allowing supplementary files to be submitted to institutional repositories and preprint servers:

https://www.youtube.com/live/B0fG0j949GM?feature=shared&t=7442

I learned that arXiv (sorta) recently started generating HTML views of TeX & LaTeX in order to improve accessibility over PDF: https://info.arxiv.org/about/accessible_HTML.html

Code4Lib 2024 — Day 2 Morning

YouTube

I will say, and I'm sure this will come up at #LPF24 tomorrow and Thursday:

for all we talk about "open infrastructure," many libraries are adopting tools/platforms that on the technical side are over-engineered, require substantial computing power, and often, considerable attention from people with technical expertise.

To say nothing of the fact that to load a single page, you are asking your users (who aren't using the latest MacBook Pros on 1+ gbit connections like many developers or first-world librarians, especially in the US/Canada) to load tons of webfonts and JavaScript, asking them to do all sorts of stuff in order to access an ostensibly open textbook.

The result of the first part is that our 'open' infrastructure becomes highly centralized, we surrender privacy protections that we (hopefully!!) have for things we host/manage ourselves, and we risk locking ourselves into a vendor/client relationship. Just like with Elsevier/Springer/et al. And lol, what happens in 5-6 years when you can't afford whatever they're charging? Or when the company that you pay for hosting realizes that maintaining this complicated code monster isn't worth the effort?

The result of the second part is that we make it actively harder for people who aren't at ARL Libraries, who don't have lots of resources, who have a hard time accessing sites because the only internet connection outside of campus is their phone?

Compare this to something like @PublicKnowledgeProject's Open Journal System which is a drastically simpler application based on 20+ year old technology that is rock solid.* That is easy to maintain. That is easy to understand. That is secure. Serving pages that are lightweight and easily accessible.

Not based on the popular programming language of the week. Not based on whatever the latest trend is.

Built on technology that very simply: JUST WORKS.

...and yes, we can make it have a slide carousel on the front page.

*technical: it's a (BSD/Linux/Mac/Windows)/Apache/MariaDB/PHP stack

#PKPSprint #LPF24 #C4L24

Poster presentation from #c4l24 on a user experience redesign of a “where I should study” web page
I do hope that #c4l24 and perhaps #code4lib will be trending these days. I have good reasons to not attend or watch online while it's happening, but the short bit I did catch was already very inspiring.

I'm at Code4Lib 2024 in Ann Arbor on an incredibly beautiful day! I'm so excited for a week of nerding out with fellow library technologists.

#code4lib #C4L24

We are looking forward to attending #c4l24 in Ann Arbor, MI next week!
If you are attending, keep an eye our for our table in the exhibitor area - and for our Repository Product Director, Aaron 👨‍💻 @code4lib