1/ Here's a thought to advance #OpenAccess to research. If it has problems, I think they're worth solving.
#AI #Copyright #Paraphrases #Summaries
🧵
1/ Here's a thought to advance #OpenAccess to research. If it has problems, I think they're worth solving.
#AI #Copyright #Paraphrases #Summaries
🧵
4/ I've always supported knowledge extraction in part because the extracts are paraphrases in this sense. They grab the ideas and leave the expression behind.
https://en.wikipedia.org/wiki/Knowledge_extraction
7/ Of course AI summaries are imperfect. But an AI summary of a specific text is much more accurate than an AI summary of the state of knowledge on a specific question.
Also see my idea for a competition to make AI summaries even more accurate.
https://fediscience.org/@petersuber/109932591113483593
An idea for an annual contest in which #AI writing tools compete to do one task useful to scholars. Researchers from different fields pick one of their own recent full-text research articles. Each tool in that year's contest creates a one-page #summary of the article (without human help). The authors score the accuracy of the summaries of their own articles (without knowing which tool created which summary). The public sees the total scores for each tool, broken down by discipline and language.
8/ An AI #summary can easily surpass an #abstract at giving readers the main ideas in an article, and we've long relied on abstracts for that purpose.
Hence, while AI summaries are imperfect, they're already good enough to be useful and even more useful than abstracts. Improvements will only make them more useful, without putting them any closer to copyright infringement.
10/ #Authors could generate AI summaries of their own paywalled works — in addition to depositing the accepted author manuscripts in OA #repositories.
Scholars could generate summaries of paywalled works that they want to read or share.
Organizations could do it systematically for all paywalled works in a certain category (defined by topic, field, language, date, and so on).
Even publishers could do it for their paywalled works in the same spirit in which they now make abstracts OA. (But I'm not predicting that many will do so; more in post 23 below.)
11/ Current AI summaries might quote sections of copyrighted text verbatim, not as quotations with quotation marks and attribution, but as if they were paraphrases. Or they might track the original language too closely.
It seems that AI tools are getting better at avoiding this problem. At least the companies behind the tools have a strong motivation to do so.
12/ If one tool generates the summary of a specific text, it shouldn't be hard for a second tool to check the summary against the original for quotation disguised as paraphrase. The second tool could even be built into the first so that the summary process iterates until the summary is 'sufficiently far' from the expression of the original.
If this isn't already how AI summarizing takes place, at least creating the second kind of tool should be much easier than the creating the first kind.
15/ I should have been clearer in distinguishing the #copyright problem from the #adequacy problem.
My primary interest was in the copyright problem (whether there are any copyright obstacles to the creation or free sharing of AI-generated summaries).
But I may have been hasty on the adequacy problem (whether AI summaries of research articles are good enough to be useful).
16/ Let's distinguish studying research articles from skimming them.
We skim to find out whether research is relevant to our interests and worth a deeper dive. It's a legitimate and important kind of scholarly task. When we find that a work is relevant and decide that it's worthy, then we study it.
19/ AI tools are getting better all the time. So is the adequacy of AI summaries. But I'm not confident that AI summaries will ever be adequate for study, especially for public critique, rather than for different levels of skimming.
As their adequacy and usefulness improve, we should remember that we can make them OA without copyright problems.
21/ OA for AI summaries is desirable in at least two cases: (1) when the primary sources are paywalled and (2) when the reader wants to skim.
In the first case, note that deposit of the full-text primary source in an OA #repository (#GreenOA) is better than OA to an AI summary. At least that would support studying not just skimming. But of course full-text green OA is entirely compatible with OA for AI summaries.
23/ #Publishers may oppose OA for AI summaries for many reasons. For example, they may predict that many readers will not click through from summaries to full texts (OA or paywalled).
That prediction may be true, even for readers who agree that studying requires the full-text primary source.
When I said that publishers could make & post these summaries in the spirit of making abstracts OA, I wasn't predicting that many would do so. But I do believe that some will and will benefit from it.
@kdnyhan
Good question. I'm one of those people who thinks text #mining is already authorized by #FairUse (in the US).
See Mike Carroll, Copyright and the Progress of Science: Why Text and Data Mining Is Lawful (2019).
https://lawreview.law.ucdavis.edu/issues/53/2/articles/files/53-2_Carroll.pdf
But I acknowledge that some people don't accept this view. We might need a more definitive resolution of this question.
You're expecting a language model which has no intrinsic understanding of meaning that summarises an academic article can be metricated, parsed and verified by the same language model?
Even if this process was statistically reasonable it could not be trusted, not by authors, co-authors, peer reviewers, readers without they themselves verifying the content against the article because it is known that language models are unreliable.
Entirely pointless.
You pretty much did. This verification would not be a regex or simple search, it would have to 'understand' in the sense of "is this reasonable, is there meaning which is not discussed and also unrepresentative, does it present findings which are the findings?".
But we know there is no understanding involved.
If this verification could be trustworthy it would indeed be part of the process to create the Abstract.
More useful would be a required Explain function.
@petersuber is there good research on this? I’d be very interested in links. I feel like if AI is doing better than abstracts, then doesn’t this mean that abstracts are failing? What is the point of an abstract if it isn’t summarizing the main points of a paper?
But even so, I feel that there is a danger in people reading summaries of papers. This happens with or without AI, but many of the nuances and caveats end up being lost when only presented with a summary.
@crawfordsm
I haven't seen any research on this idea. If there is some & I missed it, I'll gladly acknowledge it. I posted the idea to start a discussion.
Continued...
@crawfordsm
You're right about the risks of reading summaries. They're like the risks of reading abstracts. That's one reason I want summaries to be labelled as summaries & cite/link to the full works. But if we value abstracts even when we know their limitations, we could/should do the same for summaries.
I'd distinguish skimming from studying. For skimming, abstracts & summaries are good enough, maybe better than full texts. For studying, we should always want the nuances of the full text.
@petersuber Thanks! I had seen plenty of examples of LLM’s summarizing for different audiences but not any research on if it writes a better abstract so I was hopeful you had seen something.
It is something worthwhile to research for more quantified results but before using any ML tool, I always ask ‘what value does this add?’
@sennoma
Good question. I don't know the state of the art and haven't systematically tested different tools. But see my experience with #Bard. I gave it a 15 page article of mine (so that I could judge its accuracy) and it created a creditable 200 word summary.
https://fediscience.org/@petersuber/11004086893107987
It could have been better, but I imagine that the state of the art is always improving.
If my article were paywalled and most people only had access to the summary, I'd find that an acceptable second-best.
2/2 So I asked it what it meant by “its own unique perspective.” I got another long answer, including;
“the unique perspective that an AI system adds to text can be very subtle. In other cases, it can be very pronounced. However, the unique perspective is always there, and it is what makes AI-generated text different from human-generated text.”
Huh? Unique perspective? Which is not possible in “human-generated text?”