Published at #IRRJ: "CRAWLDoc: A System for Contextual Ranking and Bibliographic Metadata Extraction from Web Resources" by Fabian Karl and Ansgar Scherp. #DocumentRanking, #BibliographicMetadataExtraction, #ScholarlyDataset

https://doi.org/10.54195/irrj.23861

CRAWLDoc: A System for Contextual Ranking and Bibliographic Metadata Extraction from Web Resources | Information Retrieval Research

Hard problems that reduce to document ranking

There are two claims I’d like to make: LLMs can be used effectively1 for listwise document ranking. Some complex problems can (surprisingly) be solved by transforming them into document ranking problems. I’ve primarily explored both of these claims in the context of using patch diffing to locate N-day vulnerabilities—a sufficiently domain-specific problem that can be solved using general purpose language models as comparators in document ranking algorithms. I demonstrated at RVAsec ‘24 that listwise document ranking can be used to locate the specific function in a patch diff that actually fixes a vulnerability described by a security advisory, and later wrote on the Bishop Fox blog in greater defense of listwise ranking by publishing a command-line tool implementation (raink) to prove the idea.