We are seeing increased commodification of abstracts - perhaps not surprising given their importance in GenAI.

In 2022, SpringerNature had abstracts from
'their' non-open access articles removed from @OpenAlex

Now Elsevier appears to have done the same.

https://github.com/ourresearch/openalex-guts/commit/b85b3bc77cf9c0f3bd162426a2ba0dacdc951065

Needless to say, neither provide abstracts for these articles to @crossref either: https://i4oa.org/#:~:text=The%20following%20figure.

Is this how open we want research abstracts to be?

#openmetadata #openabstracts
#barcelonadeclaration

do not store closed elsevier abstract · ourresearch/openalex-guts@b85b3bc

The guts for computing data for OpenAlex. For more, see https://openalex.org/. - do not store closed elsevier abstract · ourresearch/openalex-guts@b85b3bc

GitHub
@MsPhelps @OpenAlex @crossref Soon coming to a publisher portal near you: empty landing pages!
@hvdsomp @MsPhelps @OpenAlex @crossref No, not empty but full of trackers and data analytics tools.
@RenkeSiems @MsPhelps @OpenAlex @crossref 100% accurate. Landing pages, empty except for trackers and similar monetizing devices. Don’t refrain from citing the landing page by means of its persistent, cite-able DOI though! That’s only #FAIR. The question is to whom.

@MsPhelps @OpenAlex @crossref

It's a bit of a miserable kick in the teeth how the front page of ScienceDirect has been emasculated after all the work was done to convert it from a lumpen mass into a structured scalable and robust vehicle for the crown jewels.

I think the public API still delivers abstracts, and the article page has been restructured with more excerpts in closed articles.

I don't think CrossRef is getting the financial support that it was.

@simon_lucy @OpenAlex @crossref

For Crossref, it's not an issue of financial capacity? (at least not directly). It's a decision on the publisher side to deposit abstracts or not. For some, there are technical barriers, but for others, it's a strategic choice.

@MsPhelps @OpenAlex @crossref

I get that. But I notice that CrossRef is still going through issues in delivering services.

The point about Abstracts being harvested for model training would be one thing but ScienceDirect still publish Abstracts which are indexed by search engines. It certainly was the case that recognised search engines, like Google, could index full articles. I'd be very surprised if they couldn't now, because of the impact on page impressions.

@MsPhelps @OpenAlex @crossref

Now I've thought about this a bit more and stirred the memory up, though it's still vague; CrossRef in Elsevier (generally) was thought of as linking and metadata (latterly events) and not content.

Though if Product Management had thought publishing Abstracts to CrossRef would increase Counter metrics they'd doubtless have made it happen.

@simon_lucy @MsPhelps @OpenAlex @crossref Not so sure about that. Both Elsevier and the SN/Holtzbrinck family have profit generating products built on abstracts (Scopus and Dimensions respectively) and probably don't like all abstracts becoming openly available at scale, machine readable, @OpenAlex, let alone with an open CC0 license. They can't (?) control other publishers sharing abstracts, but they can retain the ones they claim rights over.

@jeroenbosman @MsPhelps @OpenAlex @crossref

Abstracts aren't the driver for Scopus, that's the citation graph, the Abstracts database is one way into it and it's useful for decorating results but it's the graph that matters.

As far as I'm aware Abstracts are still returned from the APIs including the Analytics API, I could be corrected on that :-).

@simon_lucy @jeroenbosman @OpenAlex @crossref

Imo Scopus is about more than the citation graph, and abstracts play an important role in discovery and profiling (also indirectly via Pure) - and E is expanding on that value with eg Scopus AI.

Abstracts (and other metadata) returned via the API are under direct control of E via the terms and conditions, so to me that's not a contradiction.

@simon_lucy @jeroenbosman @OpenAlex @crossref

PS Not denying the importance of the citation graph, by the way, and I'm pretty sure Elsevier would have preferred to keep citations out of the public domain as well, but that other forces (Crossref policy and DORA) forced their hand in the end :)

@MsPhelps @simon_lucy @OpenAlex @crossref Currently looking into Scopus and Wos use cases at our institutions. It would surprise me if discovery using topical search terms (by students and researchers, incl. for syst reviews) would not be the most *frequent* use type of these systems. Of course citations based discovery, metrics and citation analysis is arguably also important and perhaps the prime reason why institutions hold on to their licenses for these systems, despite their limitations.

@MsPhelps @jeroenbosman @OpenAlex @crossref

I imagine they still spend a lot of effort picking up other publisher's back catalogue as the aim is as complete a corpus as possible.

I can see that 'AI' is now more important for Scopus than it was when I was around.

I don't know if they've added preprints and data set metadata which was a couple of things I was involved with along with the attempts to improve disambiguation of authors.

@MsPhelps @OpenAlex @crossref Does this affect the “inverted abstract” that OpenAlex offers? That’s at least a way for papers to be found when doing a keyword search.

@eschares @MsPhelps @OpenAlex @crossref That is a very good question.

And the whole thing is a tale to teach us not to provide all that data to publishers anymore other than in a Diamond Open Access context (or if need be in APC-based Open Access, but with an irrevocable open licence in any case)!

@christof
Exactly! And for that we need science institutions and well established researchers to help push this trend forward, since, as for so many other things, we cannot put that burden on individuals who may not be able to afford doing so for career or financial reasons.
@eschares @MsPhelps @OpenAlex @crossref

@christof @eschares @OpenAlex @crossref

To answer Eric's question: yes, it's about the inverted abstract index in OpenAlex.

That was a clever way Microsoft Academic also used to present abstracts as only separate words and word order.

@MsPhelps @OpenAlex @crossref

So the question (or, one of the questions) is: can an abstract be copyrighted, like the article, or can it not, like metadata? Maybe the third-party creation and dissemination of AI-generated abstracts cannot be prevented by legal constraints?

@anwagnerdreas @MsPhelps @OpenAlex @crossref

perhaps relatedly, they are also starting to paywall parts of the reference list too(!) https://mastodon.social/@rmounce/113202581844074647

@rmounce @anwagnerdreas @OpenAlex @crossref

1) Yes, abstracts straddle the boundary between text and metadata. Crossref considers them copyrightable, and thus exempts them from the non-licensable status of their other metadata

("Crossref generally provides metadata without restriction; however, some abstracts contained in the metadata may be subject to copyright by publishers or authors" -
https://www.crossref.org/documentation/retrieve-metadata/rest-api/#:~:text=restriction%3B%20however%2C%20some-,abstracts,-contained%20in%20the)

REST API - Crossref

Our publicly available REST API exposes the metadata that members deposit with Crossref when they register their content with us. And it’s not just the bibliographic metadata either: funding data, license information, full-text links, ORCID iDs, abstracts, and Crossmark updates are in members’ metadata too. You can search, facet, filter, or sample metadata from thousands of members, and the results are returned in JSON. Learn more in our REST API documentation.

www.crossref.org

@rmounce @anwagnerdreas @OpenAlex @crossref

2/ OpenAlex, however, distributes the inverted abstract index under CC0 as part of their data. (which can be challenged, and now apparently has been, twice)

@rmounce @anwagnerdreas @OpenAlex @crossref

3/ Regarding AI generated abstracts, I always wonder whether there are distinct legal aspects to a) using (non-openly licensed) full text to train a model, b) using (non-openly licensed) full text as prompts and c) whether the generated output consitutes derivative use?

@rmounce @anwagnerdreas @OpenAlex @crossref

4/ And that thing with references is Ridiculous (esp. given that E now do provide them to Crossref!)

/end

@MsPhelps @rmounce @OpenAlex @crossref wrt training, that's an obvious problem.

I wonder, though, how the creation of an abstract differs from the extraction of factual information in terms of #IntellectualProperty.
Why would the product of one process (eventually) be a derivative work and the other would not? And is it the creative/copyrightable character of the product that determines whether the process itself is legitimate use, or is it something else? Could the license, for instance, discriminate between these processes and allow one but disallow the other? These are sincere questions...

#FediLaw

@anwagnerdreas @rmounce @OpenAlex @crossref

In my understanding, in copyright terms 'derivative' is about whether the product is itself copyrightable - which is where creating new text would differ from extracting factual information.

1/3

@anwagnerdreas @rmounce @OpenAlex @crossref

Also, I keep coming back to thinking about the difference between a) the process, b) the outcome and c) what is done with the outcome (e.g. whether it's made public, or sold) - e.g. in some cases (not genAI per se or only), the process itself might be legal, but sharing the outcomes might violate the license of the original sources.

(as an aside, I've long wondered about that regarding TDM permissions in general)

2/

@anwagnerdreas @rmounce @OpenAlex @crossref

And regarding the last point: Creative Commons licenses, at least, make no difference between types of usage as long as they are allowed under the license, but there is a current discussion about signalling that through preference signals: https://creativecommons.org/2023/08/31/exploring-preference-signals-for-ai-training/ (h/t @jeroenbosman)

Exploring Preference Signals for AI Training - Creative Commons

One of the motivations for founding Creative Commons (CC) was offering more choices for people who wish to share their works openly. Through engagement with a wide variety of stakeholders, we heard frustrations with the “all or nothing” choices they seemed to face with copyright. Instead they wanted to let the public share and reuse…

Creative Commons
@MsPhelps @anwagnerdreas @rmounce @OpenAlex @crossref in my understanding of copyright abstracts are outcomes of creative work and thus fall under copyright. Under copyright limitations or fair use you can cite a few lines or paraphrase, but copying, and distributing full abstracts would like constitute a violation, unless of course the copyright holder has given permission (e.g. via an open license). 1/2
@MsPhelps @anwagnerdreas @rmounce @OpenAlex @crossref Of course you can make a summary of any work, which then in itself become copyrightable. "The article xx by authors xx researched topic xx answering question xx using xx data and xx method and xx analysis to come to xx results with xx conclusions and xx considerations." That would be your interpretation of what the article is about, but it has to be mostly in your own words/phrasing. I think you can do that at scale and can share it openly.
@MsPhelps @anwagnerdreas @rmounce @OpenAlex @crossref Of course to do that you would need to have legal access to full texts. Finally, it probably would not matter whether you made these summaries manually or using AI, although in the latter case a judge might question whether the result is copyrightable. But maybe the code/prompt is. An automated process would constitute TDM that rightholders can opt out of (using machine readable statements) if it is for non-scientific, commercial purposes.
@MsPhelps @OpenAlex @crossref Unbelievable. (unfortunately not) What a huge disservice to scientometrics