At #tpdl2023 Shawn Jones (@shawnmjones) is giving a talk on „Synthesizing Web Archive Collections Into Big Data: Lessons From Mining Data From Web Archives“.
He talks about challenges when mining data from web archives. The authors reviewed 22 web archives and discuss methods needed to re-synthesize a memento to something close to its original capture without augmentations.
He especially cares about robots.
Paper is available at: https://doi.org/10.1007/978-3-031-43849-3_19
At #TPDL2023 right now, @martinklein is presenting “It's Not Just GitHub: Identifying Data and Software Sources Included in Publications”
The authors trained a classifier to classify open-access data and software (OADS) URLs from research papers as dataset or code. Archivists can then take these URLs and preserve the referenced datasets and code for reproducibility.
Paper: https://doi.org/10.1007/978-3-031-43849-3_17
Preprint: https://arxiv.org/abs/2307.14469
Beatrice Alex is giving the second #TPDL2023 keynote “AI language technologies and digital collections: the need for interdisciplinary communication and co-design and training”
* How can we invite #AI into the #archive?
* AI can provide a lot of positive opportunities.
* To improve its application, we need #interdisciplinary collaborations going forward.
* AI #literacy needs to be taught early in education.
Ref:
* https://www.ed.ac.uk/profile/dr-beatrice-alex
* https://www.ltg.ed.ac.uk
* https://www.ed.ac.uk/usher/clinical-natural-language-processing/people
Yesterday at #TPDL2023 David Pride presented “CORE-GPT: Combining Open Access research and large language models for credible, trustworthy question answering”
Rather than #ZeroShot question/answering, Pride’s team combines the #CORE #OpenAccess dataset with #ElasticSearch to create #FewShot prompts that leverage the strength of combining #search results with the #LLM’s (#GPT) #summarization abilities to produce an answer to a user’s question including citations.
Yesterday at #TPDL2023 Gianmaria Silvello presented“How to Cite a Web Ranking and Make it #FAIR”
Researchers often need to cite #SearchEngine results. Unfortunately, search engines change their algorithms and their index all the time. Alessandro Lotta and Gianmaria Silvello presented a #prototype that captures this ranking in a human- and machine-readable #format and posts it to #Zenodo for citing with a #DOI.
My suggestion: include #webarchiving and #webarchives
#TPDL2023 @hkroll and Mirjam Cuper from the Institute for Information Systems presented “Aspect-Driven Structuring of Historical Dutch Newspaper Archives”
The authors discussed the challenges of automatically organizing and structuring content in a corpus when the #OCR is unreliable, the #metadata might be inconsistent, and the #licensing restrictions dictate who can see the content.