Soda - Inria

110 Followers
11 Following
43 Posts
We are an INRIA research team working on the intersection of machine learning, health, and society.
Websitehttps://team.inria.fr/soda/
Githubhttps://github.com/soda-inria
Twitterhttps://twitter.com/soda_INRIA

🎉 Tool for better documentation!! Release of sphinx-gallery, to automatically integrate narrative 🐍 examples in documentations
https://sphinx-gallery.github.io/stable/index.html

Highlight: a light recommender system to show related examples

An illustration of sphinx-gallery:
https://scikit-learn.org/dev/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html
(from @sklearn 's gallery). Note the links to function docs.

Sphinx-gallery comes with awesome features such as
◼online execution with binder or jupyterlite
◼mini-galleries eg to link an object's docstring to its examples

Sphinx-Gallery — Sphinx-Gallery 0.14.0-git documentation

🎓👨‍🦱👩 Post-doc: From missing values to deep learning on sets
https://team.inria.fr/soda/job-offers

with myself and Marine le Morvan
at @Soda_Inria

Come work with us on an exciting topic across statistics and deep learning

Job offers – Soda – Computational and mathematical methods to understand health and society with data

If you want to see the replay of my talk at EuroSciPy 2023 regarding classifier tuning and the misconception behind class imbalance, here you go: https://youtu.be/6YnhoCfArQo

Slides are available at: https://docs.google.com/presentation/d/1IPXbEZpfrynjJMTXjI36rNGuPpOtY6_yJIrQOH6JBpI/edit?usp=sharing

EuroSciPy 2023 - Get the best from your scikit-learn classifier

YouTube

As a side benefit of this refactoring, the traceback of an exception raised in sequential mode (`n_jobs=1`) is now flatter.

3/4

In the future this will also be extended to `return_as="unordered_generator"` to optionally make it possible to aggregate results as soon as ready.

This release also includes a new `parallel_config` context manager as an extension to `parallel_backend` to make it possible to configure all the arguments of the `Parallel` class and not just the backend using a context manager idiom.

Detailed changelog:
https://github.com/joblib/joblib/blob/master/CHANGES.rst#release-130----20230628

2/4

joblib/CHANGES.rst at master · joblib/joblib

Computing with Python functions. Contribute to joblib/joblib development by creating an account on GitHub.

GitHub

joblib 1.3.0 is out in the wild!

joblib is a library that provides an generic way to call into thread-based, process-based and distributed parallelism (via external backends) + a way to cache expensive computation in repeated function calls on disk.

https://joblib.readthedocs.io

This new release provides several major new features, inclusing a `return_as="generator"` argument to the `Parallel`class to make it possible to aggregate parallel results when ready (preserving the submission order).

1/4

Joblib: running Python functions as pipeline jobs — joblib 1.5.3 documentation

The team's annual report is out!
It's our first year, we are still ramping up, but our efforts project our vision:
https://radar.inria.fr/report/2022/soda/index.html

Next year will be even more exciting, as we have many ongoing research, in statistical learning, data management, health or education.

SODA - 2022 - Annual activity report

dirty_cat's TableVectorizer automatically turns complex dataframes into numerical data matrices ready for learning.

Piped with sklearn's HistGradientBoosting, it gives a strong default for learning on tables. Together, they form my go-to learner
https://dirty-cat.github.io/stable/auto_examples/01_dirty_categories.html#example-table-vectorizer

Dirty categories: machine learning with non normalized strings

Including strings that represent categories often calls for much data preparation. In particular categories may appear with many morphological variants, when they have been manually input or assemb...

dirty_cat

Tabular data can benefit from merging external sources of information.

The FeatureAugmenter is a sklearn transformer to augment a given dataframe by joins on reference tables.
https://dirty-cat.github.io/stable/generated/dirty_cat.FeatureAugmenter.html

fuzzy_join makes it robust to mismatch in vocabulary. Hyperparameter optimization can tune matches for prediction

For such external information,
diry-cat can download embeddings of wikipedia data on millions of entities: companies, cities, geographic locations...
https://dirty-cat.github.io/stable/auto_examples/07_ken_embeddings_example.html

dirty_cat.FeatureAugmenter

Usage examples at the bottom of this page. Examples using dirty_cat.FeatureAugmenter: Fuzzy joining dirty tables and the FeatureAugmenter Fuzzy joining dirty tables and the FeatureAugmenter Wikiped...

dirty_cat

Wrangling categories/entities with typos?

dirty-cat's fuzzy_join function is similar to pandas' merge function, but caters for typos by matching with string similarities across the two tables
https://dirty-cat.github.io/stable/generated/dirty_cat.fuzzy_join.html

The deduplicate function enables merging multiple morphological variants of the same string, to recover a category from data with typo:
https://dirty-cat.github.io/stable/generated/dirty_cat.deduplicate.html

dirty_cat.fuzzy_join

Examples using dirty_cat.fuzzy_join: Fuzzy joining dirty tables and the FeatureAugmenter Fuzzy joining dirty tables and the FeatureAugmenter

dirty_cat