A lot of discussions at #tpdl2023 about OCR, tesseract and the post-processing steps
@shawnmjones overheard at #tpdl2023 "be more like perma.cc"
In his #tpdl2023 talk "Synthesizing Web Archive Collections Into Big Data: Lessons From Mining Data From Web Archives" @shawnmjones summarizes status quo of lack of interoperability among web archives, specifically for large-scale use, aka "think about the robots!".

At #tpdl2023 Shawn Jones (@shawnmjones) is giving a talk on „Synthesizing Web Archive Collections Into Big Data: Lessons From Mining Data From Web Archives“.

He talks about challenges when mining data from web archives. The authors reviewed 22 web archives and discuss methods needed to re-synthesize a memento to something close to its original capture without augmentations.

He especially cares about robots.

Paper is available at: https://doi.org/10.1007/978-3-031-43849-3_19

At #TPDL2023 right now, @martinklein is presenting “It's Not Just GitHub: Identifying Data and Software Sources Included in Publications”

The authors trained a classifier to classify open-access data and software (OADS) URLs from research papers as dataset or code. Archivists can then take these URLs and preserve the referenced datasets and code for reproducibility.

Paper: https://doi.org/10.1007/978-3-031-43849-3_17
Preprint: https://arxiv.org/abs/2307.14469

Martin Klein (@martinklein) is giving a presentation at #tpdl2023. The title is: „It’s Not Just GitHub: Identifying Data and Software Sources Included in Publications“. He talks about in which repositories researcher published their data and code.
Paper is available at https://doi.org/10.1007/978-3-031-43849-3_17

Beatrice Alex is giving the second #TPDL2023 keynote “AI language technologies and digital collections: the need for interdisciplinary communication and co-design and training”

* How can we invite #AI into the #archive?
* AI can provide a lot of positive opportunities.
* To improve its application, we need #interdisciplinary collaborations going forward.
* AI #literacy needs to be taught early in education.

Ref:
* https://www.ed.ac.uk/profile/dr-beatrice-alex
* https://www.ltg.ed.ac.uk
* https://www.ed.ac.uk/usher/clinical-natural-language-processing/people

Dr Beatrice Alex

The University of Edinburgh

Yesterday at #TPDL2023 David Pride presented “CORE-GPT: Combining Open Access research and large language models for credible, trustworthy question answering”

Rather than #ZeroShot question/answering, Pride’s team combines the #CORE #OpenAccess dataset with #ElasticSearch to create #FewShot prompts that leverage the strength of combining #search results with the #LLM’s (#GPT) #summarization abilities to produce an answer to a user’s question including citations.

Ref: https://doi.org/10.1007/978-3-031-43849-3_13

Yesterday at #TPDL2023 Gianmaria Silvello presented“How to Cite a Web Ranking and Make it #FAIR

Researchers often need to cite #SearchEngine results. Unfortunately, search engines change their algorithms and their index all the time. Alessandro Lotta and Gianmaria Silvello presented a #prototype that captures this ranking in a human- and machine-readable #format and posts it to #Zenodo for citing with a #DOI.

My suggestion: include #webarchiving and #webarchives

Ref: https://doi.org/10.1007/978-3-031-43849-3_6

#TPDL2023 @hkroll and Mirjam Cuper from the Institute for Information Systems presented “Aspect-Driven Structuring of Historical Dutch Newspaper Archives”

The authors discussed the challenges of automatically organizing and structuring content in a corpus when the #OCR is unreliable, the #metadata might be inconsistent, and the #licensing restrictions dictate who can see the content.

Ref: https://doi.org/10.1007/978-3-031-43849-3_4