1/ Here's a thought to advance #OpenAccess to research. If it has problems, I think they're worth solving.

#AI #Copyright #Paraphrases #Summaries

🧵

2/ The non-controversial part: Ideas are not copyrightable. We can restate the ideas in a copyrighted text without infringement. If a paraphrase doesn't use the original expression or track it too closely, then it doesn't infringe. (If it does track the original too closely, it might count as a derivative work.)
3/ The controversial, exciting part: AI-generated summaries are paraphrases in this respect. They capture the ideas without the copyrighted expression. We can make them #OpenAccess without anyone's permission, without infringement, and without fear of liability.

4/ I've always supported knowledge extraction in part because the extracts are paraphrases in this sense. They grab the ideas and leave the expression behind.
https://en.wikipedia.org/wiki/Knowledge_extraction

#Extraction, #KnowledgeExtraction

Knowledge extraction - Wikipedia

5/ I used to think extraction was the best way to take advantage of the 'paraphrase' method of bypassing copyright barriers. But for purposes of human reading, AI summaries are (will be, can be) much better. Extraction may still be better for pulling #data from articles or creating machine-readable #triples.
6/ AI-generated summaries are not themselves copyrightable, at least today, at least in the US. Hence, they cannot carry #OpenLicenses like CC-BY. But that's not a problem if they carry a #PublicDomain mark or dedication instead.

7/ Of course AI summaries are imperfect. But an AI summary of a specific text is much more accurate than an AI summary of the state of knowledge on a specific question.

Also see my idea for a competition to make AI summaries even more accurate.
https://fediscience.org/@petersuber/109932591113483593

petersuber (@[email protected])

An idea for an annual contest in which #AI writing tools compete to do one task useful to scholars. Researchers from different fields pick one of their own recent full-text research articles. Each tool in that year's contest creates a one-page #summary of the article (without human help). The authors score the accuracy of the summaries of their own articles (without knowing which tool created which summary). The public sees the total scores for each tool, broken down by discipline and language.

FediScience.org

8/ An AI #summary can easily surpass an #abstract at giving readers the main ideas in an article, and we've long relied on abstracts for that purpose.

Hence, while AI summaries are imperfect, they're already good enough to be useful and even more useful than abstracts. Improvements will only make them more useful, without putting them any closer to copyright infringement.

9/ As an aid to scholarship, a courtesy to publishers, and an obligation to authors, every AI summary should include (1) a statement that it is an AI-generated summary and (2) an accurate citation and working link to the original work.

10/ #Authors could generate AI summaries of their own paywalled works — in addition to depositing the accepted author manuscripts in OA #repositories.

Scholars could generate summaries of paywalled works that they want to read or share.

Organizations could do it systematically for all paywalled works in a certain category (defined by topic, field, language, date, and so on).

Even publishers could do it for their paywalled works in the same spirit in which they now make abstracts OA. (But I'm not predicting that many will do so; more in post 23 below.)

11/ Current AI summaries might quote sections of copyrighted text verbatim, not as quotations with quotation marks and attribution, but as if they were paraphrases. Or they might track the original language too closely.

It seems that AI tools are getting better at avoiding this problem. At least the companies behind the tools have a strong motivation to do so.

12/ If one tool generates the summary of a specific text, it shouldn't be hard for a second tool to check the summary against the original for quotation disguised as paraphrase. The second tool could even be built into the first so that the summary process iterates until the summary is 'sufficiently far' from the expression of the original.

If this isn't already how AI summarizing takes place, at least creating the second kind of tool should be much easier than the creating the first kind.

13/ Thanks for Kyle Courtney for helpful feedback on this idea.
14/ Update. Thanks for the comments and discussion. I'm extending the thread to add new clarity and a few second thoughts.

15/ I should have been clearer in distinguishing the #copyright problem from the #adequacy problem.

My primary interest was in the copyright problem (whether there are any copyright obstacles to the creation or free sharing of AI-generated summaries).

But I may have been hasty on the adequacy problem (whether AI summaries of research articles are good enough to be useful).

16/ Let's distinguish studying research articles from skimming them.

We skim to find out whether research is relevant to our interests and worth a deeper dive. It's a legitimate and important kind of scholarly task. When we find that a work is relevant and decide that it's worthy, then we study it.

17/ AI summaries are like author abstracts in the sense that they are (can be) good enough for skimming even when they are not good enough for studying. In my experience, some are already better than abstracts for this purpose — partly because some AI summaries are very good and partly because some abstracts are very bad.
18/ To use AI summaries (or abstracts) for studying is a lazy shortcut. To criticize an article based on what you read in an AI summary, without seeing the author's original language or how it might have qualified a key claim or handled your objection, is closer to social-media jousting than scholarship.

19/ AI tools are getting better all the time. So is the adequacy of AI summaries. But I'm not confident that AI summaries will ever be adequate for study, especially for public critique, rather than for different levels of skimming.

As their adequacy and usefulness improve, we should remember that we can make them OA without copyright problems.

20/ We should not confuse the task of summarizing a specific text (available in full for comparison) with the task of summarizing the state of knowledge on a given question or the task of searching the internet for the best works on a given question. Summarizing a specific text is easier than those other tasks.

21/ OA for AI summaries is desirable in at least two cases: (1) when the primary sources are paywalled and (2) when the reader wants to skim.

In the first case, note that deposit of the full-text primary source in an OA #repository (#GreenOA) is better than OA to an AI summary. At least that would support studying not just skimming. But of course full-text green OA is entirely compatible with OA for AI summaries.

22/ Some people objected that #LLM tools don't "understand" anything. I agree. I didn't say anything to the contrary and hope I didn't assume anything to the contrary. But I do want to say that software can write usefully accurate summaries without semantic understanding. One reason to think so is that some software already does.

23/ #Publishers may oppose OA for AI summaries for many reasons. For example, they may predict that many readers will not click through from summaries to full texts (OA or paywalled).

That prediction may be true, even for readers who agree that studying requires the full-text primary source.

When I said that publishers could make & post these summaries in the spirit of making abstracts OA, I wasn't predicting that many would do so. But I do believe that some will and will benefit from it.

24/ Users could commission short or long summaries — say, 200 words or 2,000 words. They could commission summaries in different languages — say, German or Chinese. They could commission summaries at different levels of intelligibility — say, high-school English or college-level English. In these ways, AI summaries could be much more useful than abstracts.
@petersuber Do you need a license that explicitly allows for data mining to feed the full text of an article into your summarizer?

@kdnyhan
Good question. I'm one of those people who thinks text #mining is already authorized by #FairUse (in the US).

See Mike Carroll, Copyright and the Progress of Science: Why Text and Data Mining Is Lawful (2019).
https://lawreview.law.ucdavis.edu/issues/53/2/articles/files/53-2_Carroll.pdf

But I acknowledge that some people don't accept this view. We might need a more definitive resolution of this question.

@petersuber I am the chief editor for a tech publication. We are getting at least three contributed articles per month that are AI-generated. We use an AI checker but usually only when we suspect them to be generated. Our tip-off is false information, repetitive but rephrased paragraphs, contradictory statements of "fact", one or both being false and empty phrasing. Summarizing won't help that, but it would be much easier to reject if they told us outright that it was AI-generated.
@Loucovey
I'm sympathetic to the problems you describe and don't see good solutions. However, the idea I sketched is orthogonal to them. A properly labelled AI summary is not part of the problem (a submission designed to fool editors) and not part of the solution (not a way to detect other AI-generated texts). It's like a greatly improved abstract, designed to help readers find relevant research when they don't have access to the full texts or only need summaries.

@petersuber

You're expecting a language model which has no intrinsic understanding of meaning that summarises an academic article can be metricated, parsed and verified by the same language model?

Even if this process was statistically reasonable it could not be trusted, not by authors, co-authors, peer reviewers, readers without they themselves verifying the content against the article because it is known that language models are unreliable.

Entirely pointless.

@simon_lucy
I didn't say that. I said compare the summary to the original.

@petersuber

You pretty much did. This verification would not be a regex or simple search, it would have to 'understand' in the sense of "is this reasonable, is there meaning which is not discussed and also unrepresentative, does it present findings which are the findings?".

But we know there is no understanding involved.

If this verification could be trustworthy it would indeed be part of the process to create the Abstract.

More useful would be a required Explain function.

@petersuber is there good research on this? I’d be very interested in links. I feel like if AI is doing better than abstracts, then doesn’t this mean that abstracts are failing? What is the point of an abstract if it isn’t summarizing the main points of a paper?

But even so, I feel that there is a danger in people reading summaries of papers. This happens with or without AI, but many of the nuances and caveats end up being lost when only presented with a summary.

@crawfordsm
I haven't seen any research on this idea. If there is some & I missed it, I'll gladly acknowledge it. I posted the idea to start a discussion.

Continued...

@crawfordsm
You're right about the risks of reading summaries. They're like the risks of reading abstracts. That's one reason I want summaries to be labelled as summaries & cite/link to the full works. But if we value abstracts even when we know their limitations, we could/should do the same for summaries.

I'd distinguish skimming from studying. For skimming, abstracts & summaries are good enough, maybe better than full texts. For studying, we should always want the nuances of the full text.

@petersuber Thanks! I had seen plenty of examples of LLM’s summarizing for different audiences but not any research on if it writes a better abstract so I was hopeful you had seen something.

It is something worthwhile to research for more quantified results but before using any ML tool, I always ask ‘what value does this add?’

@petersuber "An AI #summary can easily surpass an #abstract at giving readers the main ideas in an article" -- is there an online tool that would let me test this claim for myself? It goes against my intuition, but I'm the furthest thing from an expert on #AI or #LLMs.

@sennoma
Good question. I don't know the state of the art and haven't systematically tested different tools. But see my experience with #Bard. I gave it a 15 page article of mine (so that I could judge its accuracy) and it created a creditable 200 word summary.
https://fediscience.org/@petersuber/11004086893107987

It could have been better, but I imagine that the state of the art is always improving.

If my article were paywalled and most people only had access to the summary, I'd find that an acceptable second-best.

2/2 So I asked it what it meant by “its own unique perspective.” I got another long answer, including;
“the unique perspective that an AI system adds to text can be very subtle. In other cases, it can be very pronounced. However, the unique perspective is always there, and it is what makes AI-generated text different from human-generated text.”

Huh? Unique perspective? Which is not possible in “human-generated text?”