Noel O'Boyle

169 Followers
105 Following
163 Posts

Guided by the science.

A contributor to Open Source and commercial #cheminformatics software for many years. Now working in a biotech leading the cheminformatics line as part of a computational chemistry group.

Bloghttps://baoilleach.blogspot.com

Hey! We're doing an open source and free culture unconference in Manchester on April 25-26. Affordable, family-friendly, lots of fun maker and podcast-y and coder-ish stuff to discuss. Have you got your ticket yet? There's still time to submit a talk for the main track, as well.

Tell your friends!

https://www.oggcamp.org/

OggCamp 2026 | OggCamp 2026

OggCamp is a free software & free culture unconference.

OggCamp 2026

Blogpost that looks at how LLMs have improved at 'reasoning' over time. This is a key capability that enables many scientific workflows.

https://baoilleach.blogspot.com/2026/01/improvement-in-reasoning-performance-of.html

Improvement in reasoning performance of LLMs over time

If you tried using ChatGPT when it first came out and concluded it wasn't much use for a scientific reasoning task, it might be time to try ...

Ensembl is hiring!!

We are on the lookout for a Senior Platform Developer to join our team.

“In this role, you will help shape the Ensembl platform’s technical direction, applying your expertise to build reliable, scalable systems and guide best practices across teams.”

Based in South Cambridgeshire, UK

Please boost or apply!

https://embl.wd103.myworkdayjobs.com/en-US/EMBL/job/Hinxton-Cambridgeshire/Senior-Platform-Developer_JR2824

#jobs #getFediHired #python #devops #science #fediHire

The rise and fall of Stack Overflow is a case in point of the parasitic nature of LLMs. LLMs feed their models on places like Stack Overflow to be useful to users, so users flock to them to avoid the eternal snarky comments and just get an answer to their problem right away. But this is a dead end. No new answers will be generated if no one uses Stack Overflow or similar places.

What goes for Stack Overflow goes essentially for the whole internet. Like a mold growing on food, consuming it, and dying once the food is gone - LLMs will kill large parts of the 'old' internet before long.

The Long Now of the Web: Inside the Internet Archive’s Fight Against Forgetting | HackerNoon

A deep dive into the Internet Archive's custom tech stack.

looking at @dalke's "Superimposed Coding of Count Fingerprints to Binary Fingerprints" https://doi.org/10.26434/chemrxiv-2026-j3hbj

"This paper proposes a novel method based on random superimposed coding to convert count fingerprints to binary fingerprints such that the binary Tanimoto similarity between two binary fingerprints better approximates the multiset Tanimoto similarity between their original count form."

#cheminformatics

Superimposed Coding of Count Fingerprints to Binary Fingerprints

Many cheminformatics workflows use Tanimoto similarity between binary fingerprints. When count fingerprints may be more appropriate, the benefit is often not large enough to justify replacing existing clustering tools, database search components, and other often highly optimized binary methods. This paper proposes a novel method based on random superimposed coding to convert count fingerprints to binary fingerprints such that the binary Tanimoto similarity between two binary fingerprints better approximates the multiset Tanimoto similarity between their original count form. In particular, the k-recall@k score for -nearest neighbor search is consistently better than hash-based folding or RDKit's count simulation conversion across the four count fingerprint generators in the RDKit, if the number of features is relatively small compared to the binary fingerprint size. For example, with 2048-bit Morgan fingerprints of radius 3, the recall at for folded, count simulation, and superimposed coding are approximately 0.80, 0.86, and 0.94, respectively.

ChemRxiv

Last chance (closing dates Jan 11) to apply for open positions in my team:
https://www.ebi.ac.uk/about/teams/chemical-biology-services/joining-the-group/

The first is Technical Lead for the team - this is suitable for someone with relevant experience with either a scientific or computing background.

We also have two positions between ourselves and Open Targets as part of a collaboration to develop a resource that captures drug side effect information. This is advertised as NLP Data Scientist/Scientific Data Engineer.

Boosts appreciated!

Joining the group – Chemical Biology Services

ChemRXiv has accepted my #cheminformatics preprint "Superimposed Coding of Count Fingerprints to Binary Fingerprints". It is available at https://chemrxiv.org/engage/chemrxiv/article-details/69442a39e3cb457e13780fbd .
Superimposed Coding of Count Fingerprints to Binary Fingerprints

Many cheminformatics workflows use Tanimoto similarity between binary fingerprints. When count fingerprints may be more appropriate, the benefit is often not large enough to justify replacing existing clustering tools, database search components, and other often highly optimized binary methods. This paper proposes a novel method based on random superimposed coding to convert count fingerprints to binary fingerprints such that the binary Tanimoto similarity between two binary fingerprints better approximates the multiset Tanimoto similarity between their original count form. In particular, the k-recall@k score for -nearest neighbor search is consistently better than hash-based folding or RDKit's count simulation conversion across the four count fingerprint generators in the RDKit, if the number of features is relatively small compared to the binary fingerprint size. For example, with 2048-bit Morgan fingerprints of radius 3, the recall at for folded, count simulation, and superimposed coding are approximately 0.80, 0.86, and 0.94, respectively.

ChemRxiv

Spent Christmas playing with OpenAI API for first time. With careful use of dictionary filters and a less accurate model (gpt-5-nano) to gate keep, I've essentially run all PubMed abstracts through a classification prompt with GPT5.2 for <$20. Skipping the nano model and running directly would still be <$200.

The hardest part is dealing with the batch API, rate limits, etc. There's probably a business in this somewhere, a website that allows biologists to run these analyses over PubMed.

Ontologies4Chem Workshop

Registration for on-site participation is closed.
However, we are offering the opportunity to participate online.
Details refer to the agenda.
t1p.de/83bom

#chemistry #Chemie #rdm #researchdata #Forschungsdaten #fairdata #workshop #openscience #ontology