Mastodawn

#googlescholar is more complete, but the accuracy of the metadata drops off. I've found that historic searches (e.g., <1950) are mostly incorrectly dated.

I was curious whether this is corroborated by research and came across: https://pmc.ncbi.nlm.nih.gov/articles/PMC7079055/
...

Checking your browser - reCAPTCHA

Show thread

Emma Tosch, Thought Follower Apr 14

@cxli Interestingly, this study (conducted in 2019) reports that the #ACMDL allows bulk download. I don't know if this feature is just hard to find or if it's been removed since then.

(Maybe @JonathanAldrich would know?)

Show thread

Jonathan Aldrich Apr 14

@etosch @cxli I don't know the history but right now I think they are doing it as a defense against unauthorized LLM training and other things that act like DDOS. It can cause problems for certain kinds of academic use; given this, I'm honestly not sure it's worth the cost.

Show thread

Rob Ricci Apr 15

@JonathanAldrich @etosch @cxli possibly unpopular take: if LLMs should be trained on anything, it should be scientific papers, so if this is ACM's reasoning for not supporting automated workflows, it's doubly harmful

(yes, I know: they want to get paid for it)

Show thread

Emma Tosch, Thought Follower Apr 15

@ricci @JonathanAldrich @cxli Counterpoint: what is the purpose of LLMs?

I think I get what's implied --- scientific papers meet a quality metric for training data. However, if your goal is to use LLMs for customer support, they are absolutely the wrong training data!

Show thread

Emma Tosch, Thought Follower

@ricci @JonathanAldrich @cxli That said, I assume you mean that if LLMs are to be an expert system, they should be trained on appropriate domain expertise. I still think conference papers are the wrong training data. There is an enormous amount of implicit knowledge and norms in academic conference papers that isn't explicitly encoded anywhere. If we want a tool that's functionally closer to LLMs, IMO we need a formal target and more traditional (and constrained!) ML to achieve that.

Show thread

Rob Ricci Apr 20

@etosch @JonathanAldrich @cxli oops I was going to follow up on this and forgot. Yeah part of my thinking was that they should generally contain on average information that's more likely to be correct than random Internet text. But also I was thinking about availability of text: there's copyright and economic questions around things like published books, but most academics are *happy* to get their papers out there as widely as possible. They're written with the explicit purpose of getting information out there and we're not expecting to get paid for them so some of the thorny issues around other sources of text are not present.

But yeah let's not use them for training customer service LLMs