Florian Huber

410 Followers
456 Following
182 Posts
Professor for data science at HSD Düsseldorf @zdd_duesseldorf
| ML fan & critic | current research mostly #datascience, #machinelearning, #cheminformatics #dataviz #nlp | ✨#openscience #openaccess #rse | living data point 🚲
searchable
searchable

Delighted that a project that took shape within a #machinelearning
workshop at the @eScienceCenter and which was initiated and led by #RozaKamioglu
and #DisaSauter resulted in a data analysis of human #laughter sounds.

--> https://royalsocietypublishing.org/doi/10.1098/rsbl.2024.0543#d1e1021

#OpenAccess #DataScience

I continue to work on #opensource versions of my various teaching materials. Here is the first complete draft for my Python Introduction (for the moment only in German, but English version is on the Todo-list).

--> https://florian-huber.github.io/python-introduction/

#datascience #teaching

New version of my (renamed) textbook: "Hands-on Introduction to Data Science using Python"

Content is now mostly complete. Text and figures will undergo further polishing.

--> https://florian-huber.github.io/data_science_course/book/cover.html

#datascience #opensource #teaching

Hands-on Introduction to Data Science with Python — Hands-on Introduction to Data Science with Python

Nice collaboration led by Niek de Jonge just got published in Journal of Cheminformatics 🚀.

In this work, we implemented and evaluated an extensive cleaning pipeline for MS/MS data.

https://link.springer.com/article/10.1186/s13321-024-00878-1

#Python #matchms #cheminformatics #opensource #openscience

Big thanks to all co-authors for this very nice collaboration!

Reproducible MS/MS library cleaning pipeline in matchms - Journal of Cheminformatics

Mass spectral libraries have proven to be essential for mass spectrum annotation, both for library matching and training new machine learning algorithms. A key step in training machine learning models is the availability of high-quality training data. Public libraries of mass spectrometry data that are open to user submission often suffer from limited metadata curation and harmonization. The resulting variability in data quality makes training of machine learning models challenging. Here we present a library cleaning pipeline designed for cleaning tandem mass spectrometry library data. The pipeline is designed with ease of use, flexibility, and reproducibility as leading principles.Scientific contributionThis pipeline will result in cleaner public mass spectral libraries that will improve library searching and the quality of machine-learning training datasets in mass spectrometry. This pipeline builds on previous work by adding new functionality for curating and correcting annotated libraries, by validating structure annotations. Due to the high quality of our software, the reproducibility, and improved logging, we think our new pipeline has the potential to become the standard in the field for cleaning tandem mass spectrometry libraries. Graphical Abstract

SpringerLink

Large SMILES-based Transformer Encoder-Decoder released by @IBM at #icml2024. Trained on 91 million curated SMILES from #pubchem

--> https://github.com/IBM/materials/tree/main/smi-ted

#cheminformatics

materials/smi-ted at main · IBM/materials

Foundation Model for materials. Contribute to IBM/materials development by creating an account on GitHub.

GitHub
Fantastic talk by @vukosi at #icml2024 on AI in Africa, on African languages, but more in general on AI outside the Western world bubble. ICML is the perfect spot to highlight this incredible discrepancy between the current work on absurdly huge models with nearly obscene hardware requirements on essentially the entire (English!) internet vs. languages for which we have to work with extremely tiny corpora (and in communities with minimal access to compute). #NLP #llm

Interesting and entertaining keynote by Soumith Chintala at #icml2024. I agree with his positive attitude towards #opensource and #openscience.

But I don't buy the "embrace capitalism part".

Da soll noch mal jemand sagen, in Deutschland bewegt sich nichts.

#Digitalisierung #deutschlandtempo

New updates of my #datascience introduction course materials. Most additions and changes were made to improve and expand the #NLP chapters, going from basic string handling to TF-IDF followed by n-grams and word vectors.

Rendered version: https://florian-huber.github.io/data_science_course/

GitHub: https://github.com/florian-huber/data_science_course

Hier eine kleine Data-Science Analyse der AfD Europawahlergebnisse in Deutschland: https://medium.com/@f.huber/europawahl-2024-erkl%C3%A4rungsversuche-mittels-machine-learning-d93ec763509a

Ich versteh's noch immer nicht ... aber das ist wohl auch zuviel verlangt von dem bisschen Machine-Learning.

Notebook & Data: https://github.com/florian-huber/europawahl_2024_afd_analysis

Europawahl 2024 — Erklärungsversuche mittels Machine-Learning

Am 09.06. fand in Deutschland die Europawahl 2024 statt. Deren Ergebnisse wurden und werden kontrovers diskutiert. Wenig Raum für Interpretationen jedoch bietet der Fakt, dass die rechtsextreme AfD…

Medium