Markus Sitzmann

@markussitzmann
189 Followers
186 Following
24 Posts
Scientist #cheminformatics, #openscience, #opendata, #opensource, #IT, personal account, views are my own.
#Twitter doesn't load any tweets anymore. Is it happening now? #twitterdown

Papyrus: a large-scale curated dataset aimed at bioactivity predictions | Journal of Cheminformatics

#cheminformatics #chemoinformatics #chemicalStructure #dataset #chemicalDatabase

https://jcheminf.biomedcentral.com/articles/10.1186/s13321-022-00672-x

Papyrus: a large-scale curated dataset aimed at bioactivity predictions - Journal of Cheminformatics

With the ongoing rapid growth of publicly available ligand–protein bioactivity data, there is a trove of valuable data that can be used to train a plethora of machine-learning algorithms. However, not all data is equal in terms of size and quality and a significant portion of researchers’ time is needed to adapt the data to their needs. On top of that, finding the right data for a research question can often be a challenge on its own. To meet these challenges, we have constructed the Papyrus dataset. Papyrus is comprised of around 60 million data points. This dataset contains multiple large publicly available datasets such as ChEMBL and ExCAPE-DB combined with several smaller datasets containing high-quality data. The aggregated data has been standardised and normalised in a manner that is suitable for machine learning. We show how data can be filtered in a variety of ways and also perform some examples of quantitative structure–activity relationship analyses and proteochemometric modelling. Our ambition is that this pruned data collection constitutes a benchmark set that can be used for constructing predictive models, while also providing an accessible data source for research. Graphical Abstract

BioMed Central
The new #RDKit blog post revisits, and revises, an old one looking at the impact of fingerprint length (and bit collisions) on machine-learning performance.
https://greglandrum.github.io/rdkit-blog/posts/2022-12-25-colliding-bits-ii-revisited.html
RDKit blog - Colliding bits II, revisited

The impact of bit collisions on machine-learning performance

Assessment of chemistry knowledge in large language models that generate code | ChemRxiv https://doi.org/10.26434/chemrxiv-2022-3md3n-v2#.Y5b35r-9_cs.twitter #compchem
Assessment of chemistry knowledge in large language models that generate code

In this work, we investigate the question: do code-generating large language models know chemistry? Our results indicate, mostly yes. To evaluate this, we produce a benchmark set of problems, and evaluate these models based on correctness of code by automated testing and evaluation by experts. We find recent LLMs are able to write correct code across a variety of topics in chemistry and their accuracy can be increased by 30 percentage points via prompt engineering strategies, like putting copyright notices at the top of files. These dataset and evaluation tools are open source which can be contributed to or built upon by future researchers, and will serve as a community resource for evaluating the performance of new models as they emerge. We also describe some good practices for employing LLMs in chemistry. The general success of these models demonstrates that their impact on chemistry teaching and research is poised to be enormous.

ChemRxiv

#cheminformatics #chemicalStructure #chemicalDatabase

An algorithm to classify homologous series within compound datasets

Adelene Lai, Jonas Schaub, Christoph Steinbeck & Emma L. Schymanski

https://jcheminf.biomedcentral.com/articles/10.1186/s13321-022-00663-y

An algorithm to classify homologous series within compound datasets - Journal of Cheminformatics

Homologous series are groups of related compounds that share the same core structure attached to a motif that repeats to different degrees. Compounds forming homologous series are of interest in multiple domains, including natural products, environmental chemistry, and drug design. However, many homologous compounds remain unannotated as such in compound datasets, which poses obstacles to understanding chemical diversity and their analytical identification via database matching. To overcome these challenges, an algorithm to detect homologous series within compound datasets was developed and implemented using the RDKit. The algorithm takes a list of molecules as SMILES strings and a monomer (i.e., repeating unit) encoded as SMARTS as its main inputs. In an iterative process, substructure matching of repeating units, molecule fragmentation, and core detection lead to homologous series classification through grouping of identical cores. Three open compound datasets from environmental chemistry (NORMAN Suspect List Exchange, NORMAN-SLE), exposomics (PubChemLite for Exposomics), and natural products (the COlleCtion of Open NatUral producTs, COCONUT) were subject to homologous series classification using the algorithm. Over 2000, 12,000, and 5000 series with CH2 repeating units were classified in the NORMAN-SLE, PubChemLite, and COCONUT respectively. Validation of classified series was performed using published homologous series and structure categories, including a comparison with a similar existing method for categorising PFAS compounds. The OngLai algorithm and its implementation for classifying homologues are openly available at: https://github.com/adelenelai/onglai-classify-homologues .

BioMed Central
Editorial that other researchers and I wrote for JCIM about #machinelearning in QSAR. We show that ML methods have been used for a long time in chemistry/QSAR.
#compchem
https://pubs.acs.org/doi/10.1021/acs.jcim.2c01422
After one month at Mastodon: almost 100 followers (including a Noble Price winner). I am not complaining 🙂
A bit late but anyway 🙂
The number of users by instance in #mastodon is highly unequal. Some servers (mastodon.social) have a significant fraction of the number of users. If the ubiquitous "rich-get-richer" (RGR) phenomenon kicks in, we might end up with a couple of instances having all of the users. But did it kick in? Here is the relationship between instance's size and growth. There is a little bit of RGR, but some of the rapidly-growing instances are small. Decentralization also in how the network grows.