Mastodawn

@cxli Interestingly, this study (conducted in 2019) reports that the #ACMDL allows bulk download. I don't know if this feature is just hard to find or if it's been removed since then.

(Maybe @JonathanAldrich would know?)

@etosch @cxli I don't know the history but right now I think they are doing it as a defense against unauthorized LLM training and other things that act like DDOS. It can cause problems for certain kinds of academic use; given this, I'm honestly not sure it's worth the cost.

@JonathanAldrich @cxli I've had several research threads over the past 3-4 years that have more or less stalled out because while the DL seems like the best resource for them, it's just too labor intensive to manually search, click, download, refine the search, exclude papers already read, etc.

@JonathanAldrich @cxli I'm curious what their threat model is for LLMs (aside DDOS) and how that relates to their costs and revenue. Like, what I really want is a database connection and _maybe_ some UI and querying features. I'd prefer to work locally, but I could also see value in working on an ACM-hosted private notebook (which could become public up on publication). I _do not_ want an "AI research assistant." I would accept certain constrained AI/ML tools, if I understood their affordances.

@JonathanAldrich @cxli What I'm wondering is whether people like me are even the target audience for ACM DL subscriptions. If yes, then surely others would be interested in these features! If no, I'd like to know what our alternatives are.

I'd love to hear any insights you have on this, @JonathanAldrich! I really appreciate having some insight into the mechanics of these orgs.

@etosch @cxli I can't speak definitively, but beyond DDOS (which seems implausible anyway) I think the threat model is simply that a lot of ACM members don't want to allow LLMs to train on their papers. And while ACM wouldn't necessarily mind allowing training on the remainder, ACM would like to get paid for it, enabling us to provide more services to members at lower cost.

@etosch @cxli And the people who don't want LLMs to train on their papers generally feel that way for ethical reasons. It's not an issue for me, but I respect those who feel that way and I think ACM wants to also respect those wishes. Of course, when your whole library is Open Access (and honestly, even if it is not), this is very hard to enforce. It may be a losing battle.

@JonathanAldrich @cxli whoops, you mentioned open access here; this is what happens when I reply to messages one at a time. 🙃

@JonathanAldrich @cxli Hm. I suspect a lot of the ACM members who don't want their work to be training data are also proponents of open access. I don't know if these options are as mutually exclusive as they appear.

I'm also not convinced that firms selling LLMs services would have a competitive advantage over what a usable ACMDL UI could provide, but maybe I'm alone here?

@etosch @cxli Yeah LLMs.txt is supposed to allow training limits to coexist with open source (and indeed robots.txt could also be used). But compliance is voluntary. ACM's rate-limiting tools are a backstop--and lawsuits could be another one--but it's hard to be sure how effective they are.

@JonathanAldrich @cxli It's not just rate limiting though! Someone who definitely isn't me has only successfully programmatically downloaded (even open access papers!) directly, but only by jumping through hoops (copying real browser headers, creating sessions, some other stuff I'd have to...ask them about). I assume it wasn't always like this?

@etosch @cxli What is the alternative to manual search? You want to script the search? Would you use LLMs or other tools to read the downloaded papers and automatically decide if they are relevant to your systematic review?

@JonathanAldrich @cxli re: manual search. So right now I need to query through the browser so I have all the right headers. This entails a lot of pointing and clicking. I wish it were easier to refine my search query a little more programmatically. This is really a misalignment between what I want from a stateful system (let's call it a "research session") that has pagination and back buttons and how these systems are typically designed.

@JonathanAldrich @cxli re:LLMs. I guess I might consider using an LLM to verify aspects of the review, but not for the primary research.

Here's an example task I recently tried to do: I wanted to catalogue the benchmarks used in ASPLOS 2026 papers. My query was very simple: just the papers from the proceedings that use the word "benchmark" somewhere. I wanted a table of the names of the suites, domain, units (or "entity types"), size, dates of introduction, and a few other things.

@JonathanAldrich @cxli The problem was that just programmatically extracting a list of the names of the papers and their DOIs doesn't seem possible without (I'm assuming) breaking the ACM DL's terms of service.

@JonathanAldrich @cxli Instead I get my paginated list of papers and I have to click through each of them, copy their names and DOIs, then open them, and then fill out my table.

Jonathan Aldrich Apr 15

@etosch @cxli Yeah I can see why you'd want support for automation. I think it's totally reasonable, but I don't think the ACM has been thinking about these uses.

@JonathanAldrich @cxli Follow up question: do you happen to know if the ACM employs librarians or archivists?

Jonathan Aldrich Apr 15

@etosch @cxli Not sure about staff. I know we have some on the Publications Board. If you are interested I can try to make a connection.

@JonathanAldrich @cxli Yeah, actually that would be very cool, thank you!

Jonathan Aldrich Apr 15

@etosch @cxli Ok, reflecting on this, providing more service from the DL is in the purview of the ACM DL Board, not the Publications Board (which really does publications policy). Some members of the DL Board with specific library expertise include Stephen Downie (UIUC, Ph.D. in Lib/info sci), Michael Ley (creator of DBLP), and Phoebe Ayers (Librarian at MIT).

@JonathanAldrich @cxli For precedence/As an example: This 2006 paper (Empirical evaluation in Computer Science research published by ACM, Wainer et al., https://www.sciencedirect.com/science/article/pii/S0950584909000093) replicates Tichy et al.'s 1995 review of empiricism in CS. Wainer et al. randomly select 200 papers and annotate them (with some exclusions). 1/🧵

@JonathanAldrich @cxli The annotations/classifications they do could benefit form automation, but they don't require LLMs. I'm sure you could get very high accuracy from a small set of features using traditional ML. Hell, if the ACM really wanted to support researchers, they could provide classification/annotation as a service for these kinds of reviews.

@JonathanAldrich @cxli Ah so here is an ACM-published paper that includes a lit review: https://dl.acm.org/doi/pdf/10.1145/3406544

I would love it if the authors' annotations were available through the #ACMDL and linked to papers, supporting queries like, "get all of the empirical papers that don't involve human subjects."

@JonathanAldrich @cxli btw while I'm not sure such a tool is necessary or worth the cost, here is an example of a paper I'd found that uses an AI research assistant that has features that someone might want: https://link.springer.com/article/10.1007/s10115-024-02284-3

Factors influencing open science participation through research data sharing and reuse among researchers: a systematic literature review - Knowledge and Information Systems

This systematic literature review investigates the influential factors guiding researchers’ active engagement in open science through research data sharing and subsequent reuse, spanning various scientific disciplines. The review addresses key objectives and questions, including identifying distinct sample types, data collection methods, critical factors, and existing gaps within the body of literature concerning data sharing and reuse in open science. The methodology employed in the review was detailed, outlining a series of systematic steps. These steps encompass the systematic search and selection of relevant studies, rigorous data extraction and analysis, comprehensive evaluation of selected studies, and transparent reporting of the resulting findings. The review’s evaluation process was governed by well-defined inclusion and exclusion criteria, encompassing publication dates, language, study design, and research outcomes. Furthermore, it adheres to the PRISMA 2020 flow diagram, effectively illustrating the progression of records through the review stages, highlighting the number of records identified, screened, included, and excluded. The findings include a concise tabular representation summarizing data extracted from the 51 carefully selected studies incorporated within the review. The table provides essential details, including study citations, sample sizes, data collection methodologies, and key factors influencing open science data sharing and reuse. Additionally, common themes and categories among these influential factors are identified, shedding light on overarching trends in the field. In conclusion, this systematic literature review offers valuable insights into the multifaceted landscape of open science participation, emphasizing the critical role of research data sharing and reuse. It is a comprehensive resource for researchers and practitioners interested in further understanding the dynamics and factors shaping the open science ecosystem.

SpringerLink

@JonathanAldrich @cxli I do think that if the ACM were to provide such a resource, there should be an audit available as part of open access. (Right now Elicit doesn't offer a free trial, nor institutional access, from what I can tell.)

Rob Ricci Apr 15

@JonathanAldrich @etosch @cxli possibly unpopular take: if LLMs should be trained on anything, it should be scientific papers, so if this is ACM's reasoning for not supporting automated workflows, it's doubly harmful

(yes, I know: they want to get paid for it)

@ricci @JonathanAldrich @cxli Counterpoint: what is the purpose of LLMs?

I think I get what's implied --- scientific papers meet a quality metric for training data. However, if your goal is to use LLMs for customer support, they are absolutely the wrong training data!

@ricci @JonathanAldrich @cxli That said, I assume you mean that if LLMs are to be an expert system, they should be trained on appropriate domain expertise. I still think conference papers are the wrong training data. There is an enormous amount of implicit knowledge and norms in academic conference papers that isn't explicitly encoded anywhere. If we want a tool that's functionally closer to LLMs, IMO we need a formal target and more traditional (and constrained!) ML to achieve that.

Rob Ricci Apr 20

@etosch @JonathanAldrich @cxli oops I was going to follow up on this and forgot. Yeah part of my thinking was that they should generally contain on average information that's more likely to be correct than random Internet text. But also I was thinking about availability of text: there's copyright and economic questions around things like published books, but most academics are *happy* to get their papers out there as widely as possible. They're written with the explicit purpose of getting information out there and we're not expecting to get paid for them so some of the thorny issues around other sources of text are not present.

But yeah let's not use them for training customer service LLMs

@cxli FWIW the tl;dr version of this article is in the Discussion section:

> Overall, we found that only 14 of the 28 academic search systems examined are well‐suited to evidence synthesis in the form of systematic reviews...[and...can be used as principal search systems: ACM Digital Library, BASE, ClinicalTrials.gov, Cochrane Library, EbscoHost ..., OVID ..., ProQuest ..., PubMed, ScienceDirect, Scopus, TRID, Virtual Health Library, Web of Science ..., and Wiley Online Library.

cynthia Apr 14

@etosch interesting! i mostly got the sense that most CS work was being published in ACMDL which is why i gravitate there though admittedly i haven't done a lot of Real or Good lit review in CS

@cxli re: citation chains.

Ever notice that older papers have fewer citations? Is it because there's been more growth in the field? Or is because citation practices have become bloated at best and polluting at worst?

Put another way: perhaps citation practices perform one function during review but two (possibly conflicting) functions once published?

cynthia Apr 14

@etosch i feel mixed on this but i also tend to have a different sense of what needs to be cited bc norms of english citation are different than in CS

crediting authors is very important to me but i tend to value citations that actually Add Substance to a piece of writing but have much less tolerance for citations that serve the purpose of "yes we read this paper don't bother me about it"

@cxli I agree the fields' norms feel qualitatively different, but I also never formally published in English, so my impression is based solely on my professors' feedback at the time!

One of the main pieces of work I really wanted to put together for the Helical project was an encoding of a specific model/hypothesis some area computer science that evolved over time, due to experimentation. Citation practices made this a challenging task.

@cxli The fundamental issue was that I couldn't differentiate _why_ a paper was being cited without diffing through the citing text. I'd characterize most of the citations to major papers as "junk."

@cxli As an example, there's a highly cited paper from CCS that does an empirical analysis of different fuzzers. I started painstakingly tracking down which papers cited it as a proxy for "generalized knowledge" being "passed down." I found that the first 20 or so papers cited it when justifying the number of independent runs they used for their fuzzers. They didn't explicity engage with the paper otherwise.

@cxli Now, it's possible that "do at least 30 runs" or whatever _is_ the main generalized knowledge. I was just surprised to see so little variability in the context for those citations.

@cxli I ran into a similar issue when looking at benchmarks in ML. There's been a proliferation of new datasets, corpora, and "benchmark suites" over the past 5-10 years --- I was even part of a group that made one! (toybox.rs) Our work had a small amount of reuse and most of the citations were like "this thing exists." Looking into similar efforts, I realized this is the norm.

cynthia Apr 14

@etosch yeah i guess my main frustration w CS citation practice is how superficial it feels so much of the time.....? the way papers r written always make me feel like it's super individualistic work that only faintly ekes out connections to other stuff

@cxli STRONG AGREE.

Within smaller subfields, I think this is less true --- for better or for worse, PL has remained "niche" and IMO the community actually cares about getting the "process knowledge" right through apprenticeship.

I would note that I was trained to be the thing I hate and have absolutely contributed to the problem (ish: I'm a research nobody, so I actually just experience the psychological wound of the pratice, without the benefit. :P)

https://journals.sagepub.com/doi/10.1177/2329496514540131

Shae Erisson Apr 15

@etosch @cxli I approached citations differently. I wanted to know if cross field citations were beneficial, the only paper I've found for that is:

@shapr @cxli In a lot of ways, computer science isn't exactly a "field." It's unclear what phenomena computer scientists study; classically the answer was "computation" but IMO that's quite a stretch for the vast majority of us to claim. I'd say most of the computer scientists who study computation could just as easily be called mathematicians, philosophers, or logicians, depending on their phenomenological focus.

@shapr @cxli Of course, disciplinary differentiation is a new concept and one that is more about politics than some kind of taxonomy of disciplines. In that context, citation practices are also political.

@shapr @cxli All of that is to say that there are forces at play that incentivize cross-disciplinary citation when it confers legitimacy on an interdisciplinary project and forces that discincentivize cross-disciplinary citation when it might attract accusations of epistemic trespassing.

Shae Erisson Apr 14

@etosch Google scholar, arxiv and the crypto preprint website

@shapr 🤨 What's "the crypto preprint website"?