🧵🗺️ 🦌 💻 A little thread about #MachineLearning education in ecology.

This fall, I am teaching an ML class based on species distributions. Yesterday's activity was a little game where we kept the same data and tweaked the model to see which combination of data preparation and classifier we could use and whether we "liked" the results. It was, essentially, a computer-assisted vibe check.

The data look like this:

The first thing we tried was a PCA followed by a Random Forest because we are so extremely basic.

It's very obvious that it was not going to work, and so we started thinking a little about how RF generally works, so maybe the PCA is the problem. Let's replace it by a simple Standardizer.

Changing the PCA for a standardizer worked MUCH better! When I asked "So, do we use this model?", one student said, "No -- it works a little too well; this is suspicious".

I love it when the students' instincts lead them to deep insights about models!

So we decided to make a model that worked a little less well. It was time for logistic regression.

Logistic regression was interesting. One of the students said "It looks like a map you would see on Wikipedia", which is true!

It's a big continuous blob of predicted presences, which is making an OK prediction in parts of the range, but misses two important things: there are no observations near the coasts, and the logistic regression misses an inland population to the East.

It gets the "spirit" right, but misses the detail.

So it was time to get serious, and by serious I mean it was time to do a BRT.

The BRT map is "just fine". Inland populations are captured, the coast is predicted as an absence, and it still produces a mostly continuous range. It lies, kind of, at the sweet spot between the logistic regression and the random forest.

And interestingly, when we look at MCC, F1, and all that, it's not the best classifier. It is the best "map", according to biologists, but the RD still beats it by a fair amount.

In doing this activity, we learned two things:

1) The validation statistics are important, but we don't do biology on the validation statistics -- we do biology on the prediction, and it makes sense to pick a model that is not "maximally good" when it makes a prediction that looks more realistic.

2) We didn't identify a model that made everyone happy, and the last three added biologically relevant information, only at different places.

Where to next?

Next week, we will more formally talk about validation with external data (to measure over/underfitting), and see how we can use ensemble models to extract some juice from models that are good but ain't, you know, boss. And after that, each group will get an identical dataset, and try to come up with the best possible classifier.

We are going to have fun. All of this work is pure #JuliaLang, notably using MLJ, which is worth checking out.

/🧵

@tpoisot cool stuff!

MLJ is a fantastic library   I’ve been working on an interface to MLJ that turns any supervised model into a conformal model (for predictive uncertainty quantification). It’s early stages, but perhaps you find a chance to play around with it. Any feedback would be much appreciated 🙏

https://github.com/pat-alt/ConformalPrediction.jl

@patalt WOW yes, I would absolutely play with this! @michael check this out, this seems relevant to just about everything you're up too!
@tpoisot for prediction of presence, I often noticed that RF tend to overfit, resulting in weird predictions. GAMs seems to work better...
@OMorissette Absolutely. RF will overfit anything you want, especially if there is no tree pruning or too many trees. Ensembles of weak learners are a lot more conservative.