New Preprint Alert!

We're excited to share our latest work on #ChemRxiv! MARCUS (Molecular Annotation and Recognition for Curating Unravelled Structures) is a web-based platform for extracting chemical information from scientific papers.

📄 Preprint: https://doi.org/10.26434/chemrxiv-2025-9p1q1

🔗 Try it out: https://marcus.decimer.ai

#Cheminformatics #OpenScience #ChemicalDatabases #AIinScience #ScientificSoftware #ResearchTools

MARCUS: Molecular Annotation and Recognition for Curating Unravelled Structures

The exponential growth of chemical literature necessitates the development of automated tools for extracting and curating molecular information from unstructured scientific publications into open-access chemical databases. Current optical chemical structure recognition (OCSR) and named entity recognition solutions operate in isolation, which limits their scalability for comprehensive literature curation. Here we present MARCUS (Molecular Annotation and Recognition for Curating Unravelled Structures), a tool to aid curators in performing literature curation in the field of natural products. This integrated web-based platform combines automated text annotation, multi-engine OCSR, and direct submission capabilities to the COCONUT database. MARCUS employs a fine-tuned GPT-4 model to extract chemical entities and utilises an ensemble approach integrating DECIMER, MolNexTR, and MolScribe for structure recognition. The platform aims to streamline the data extraction workflow from PDF upload to database submission, significantly reducing curation time. MARCUS bridges the gap between unstructured chemical literature and machine-actionable databases, enabling FAIR data principles and facilitating AI-driven chemical discovery. Through open-source code, accessible models, and comprehensive documentation, the web application enhances accessibility and promotes community-driven development. This approach facilitates unrestricted use and encourages the collaborative advancement of automated chemical literature curation tools. We dedicate MARCUS to Dr Marcus Ennis, the longest-serving curator of the ChEBI database, on the occasion of his 75th birthday.

ChemRxiv

At the request of a journal editor, I reviewed a paper by leading researchers on one of my favorite #chemistry topics - tautomers! This article was featured in the Journal of Chemical Information and Modeling. I am grateful for the #PeerReview certificate presented by the American Chemical Society. It was an honor to be entrusted with this responsibility.

Reminder that I'm #OpenToWork for #cheminformatics or #scientificSoftware development. Let's discuss how my skills can benefit your team.

The 2025_03_1 release of #RDKit release includes my contribution to speed up part of getting 2D fingerprints for a molecule by ~75x! So if you generate #chemical fingerprints, now is a good time to upgrade.

Reminder that I'm #OpenToWork so if you're hiring for #cheminformatics or #scientificSoftware development, let's talk.

#chemistry #DrugDiscovery #pharma #PythonForChemists

https://github.com/rdkit/rdkit/releases/tag/Release_2025_03_1

Release 2025_03_1 (Q1 2025) Release · rdkit/rdkit

Release_2025.03.1 (Changes relative to Release_2024.09.1) Acknowledgements (Note: I'm no longer attempting to manually curate names. If you would like to see your contribution acknowledged with you...

GitHub

I'm excited to present "Finding Tautomers" at the first North American #RDKit User Group Meeting in the #Boston #MA area on Friday April 11!

Reminder that I'm #OpenToWork so if you're in the area and hiring for #cheminformatics or #scientificSoftware development, let me know and we can meet to discuss your needs.

Interested in #MPI and #OpenMP parallel programming to speed up your scientific applications written in #C, #Cpp, #Fortran or #Python (with #numpy)?

Attend our course in #Mainz at the Johannes Gutenberg University (#JGU) for a 4-day course from 1. April to 4. April 2025!

See our announcement page for further details and to register: https://indico.zdv.uni-mainz.de/event/34/

Note, it is an on-site course.

#RSE #HPC #scientificsoftware

Parallel Programming with MPI and OpenMP (4-Day Workshop)

Dive into the world of high-performance computing with our hands-on workshop, focusing on the programming models MPI and OpenMP. Gain practical experience with Message Passing Interface (MPI) basics and shared memory directives of OpenMP through interactive sessions in C or Fortran. Agenda: A preliminary course outline can be found here. Location: Takes place at the computing centre of the University of Mainz. Detailed travel directions will be provided to accepted participants in advance....

Indico

The #Energy #Climate & #Environment program at #IIASAVienna had its quarterly meeting last Friday (~100 researchers), so I had to reflect on our role as community data hub and what to present on behalf of the #ScenarioServices & #ScientificSoftware team.

We developed a new #ScenarioExplorer front-end last year, and we made a lot of progress with our #opensource packages for scenario analysis, validation & data-management.

Step by step towards #OpenScience and reusable, reproducible analysis...

Working with #NUTS administrative EU 🇪🇺 regions is one of the little nuisances in #energysystems modelling and scenario analysis.

So the #IIASA #ScenarioServices team put together a little #opensource #python utility package so that modelers can focus on #freethemodels and don’t have to spend too much time on data-wrangling…
#pysquirrel #ScientificSoftware
https://github.com/iiasa/pysquirrel

GitHub - iiasa/pysquirrel

Contribute to iiasa/pysquirrel development by creating an account on GitHub.

GitHub

Here's an ~ official ~ release announcement for #numpydantic

repo: https://github.com/p2p-ld/numpydantic
docs: https://numpydantic.readthedocs.io

Problems: @pydantic is great for modeling data!! but at the moment it doesn't support array data out of the box. Often array shape and dtype are as important as whether something is an array at all, but there isn't a good way to specify and validate that with the Python type system. Many data formats and standards couple their implementation very tightly with their schema, making them less flexible, less interoperable, and more difficult to maintain than they could be. The existing tools for parameterized array types like nptyping and jaxtyping tie their annotations to a specific array library, rather than allowing array specifications that can be abstract across implementations.

numpydantic is a super small, few-dep, and well-tested package that provides generic array annotations for pydantic models. Specify an array along with its shape and dtype and then use that model with any array library you'd like! Extending support for new array libraries is just subclassing - no PRs or monkeypatching needed. The type has some magic under the hood that uses pydantic validators to give a uniform array interface to things that don't usually behave like arrays - pass a path to a video file, that's an array. pass a path to an HDF5 file and a nested array within it, that's an array. We take advantage of the rest of pydantic's features too, including generating rich JSON schema and smart array dumping.

This is a standalone part of my work with @linkml arrays and rearchitecting neurobio data formats like NWB to be dead simple to use and extend, integrating with the tools you already use and across the experimental process - specify your data in a simple yaml format, and get back high quality data modeling code that is standards-compliant out of the box and can be used with arbitrary backends. One step towards the wild exuberance of FAIR data that is just as comfortable in the scattered scripts of real experimental work as it is in carefully curated archives and high performance computing clusters. Longer term I'm trying to abstract away data store implementations to bring content-addressed p2p data stores right into the python interpreter as simply as if something was born in local memory.

plenty of todos, but hope ya like it.

#linkml #python #NewWork #pydantic #ScientificSoftware

GitHub - p2p-ld/numpydantic: Type annotations for specifying, validating, and serializing arrays with arbitrary backends in Pydantic (and beyond)

Type annotations for specifying, validating, and serializing arrays with arbitrary backends in Pydantic (and beyond) - p2p-ld/numpydantic

GitHub

The Chapel team at HPE is looking for scientists to collaborate with.

Are you doing computation for science using #python or similar tools? Interested in trying something different, to run faster or scale further?

Let's make the world a better place together!

See this #blog post for details:
https://chapel-lang.org/blog/posts/python-science-collabs/

Boosts / reposts / etc greatly appreciated.

#ScientificSoftware #OpenSource #OpenScience #science #hpc

Doing science in Python? Wishing for more speed or scalability?

A call for computational science collaborations around Chapel and Python

Time for a re-#introduction !

I'm a #scicomm enthusiast and board member of #Fediscience. My background is in #Biophysics, done a Postdoc in #GeneticEpidemiology, industry detour, now working in #HPC for some years.

Interested in #HPC, #bioinformatics, #OpenScience, #workflows (#snakemake), #RDM, #scientificsoftware and #sciencecommunication

My blog can be found here: blogs.fediscience.org and my more political me can be found at @rupdecat.