So, I'm going to start a thread on trying to better understanding #statistics, in case anybody is interested. Boosts and/or clarifications welcome and appreciated!

It's been bugging me for a while that I don't seem to have a good intuitive grasp of statistics, and this is despite graduating from an engineering school — while I did get courses on probability theory and stuff like Markov chains or EM algorithms and whatnot these were engineering-focused. Case in point, I can't say I "get" confidence intervals. Neither do I understand statistical tests and the p-value outputs that are often presented as "obvious" in other fields. 🧵

#LearningStatistics

One big motivation for this is that people often shove (pseudo-)statistical results under my nose. In some cases it "looks" like they did due diligence but more often than not the significance results look fuzzily fishy — and I can't argue why with confidence because I don't have enough statistical literacy.

A secondary motivation is that many fundamental results, papers, standards and recommendations on #ColourScience are based on statistics around psych and physical tests, whether for good or bad. But these results still elude me for the most part. 🧵

#statistics #LearningStatistics

So, first, maybe means? I know, the things below might have been evident, I've been starting from a very low bar okay?

I hear about means and averages all the time.

One thing that surprised me a few years ago was that the mean of a random variable is itself a random variable.

This was not very obvious to me ; my naive viewpoint was, I think, colored by the fact that if you're just looking at tests were you decide the sampling (e.g. do a poll on 100 people), well, you can just add 100 more and get better results and the law of large numbers says you should get better as you add more, right? 🧵

#statistics #LearningStatistics

What made my gears turn a little was : if instead of adding more data you can only take subsets of your samples? For example you're trying to write a color picker tool on a photo? Different subsets of equal pixel count (the size of the picker tool) come out of a Poisson distribution (plus extra after processing)... and their mean will change even if the color is supposed to be uniform.

The mean is itself a random variable, this means we can do statistics on it. For example, compute its mean (whose difference with the actual mean is the bias) standard deviation (which is called standard error of the estimator of the mean). 🧵

#statistics #LearningStatistics

An intuition I haven't yet verified: when we qualify samples using means and standard deviations, a hidden assumption is often made of a normal (Gaussian) distribution.

This might be what we want (the central limit theorem applies in a lot of cases, and is essentially "throw enough distributions together in a big bowl, mix them up and you end up with a normally distributed smoothie") but this is not always the case.

https://stats.stackexchange.com/questions/493548/when-we-calculate-mean-and-variance-do-we-assume-data-are-normally-distributed has more to say on this, but I'm not fully satisfied because it focuses on the pure theoretical math side of things, not on "what people actually interpret it is". 🧵

#statistics #LearningStatistics

When we calculate mean and variance, do we assume data are normally distributed?

When we calculate mean and variance using the two equations taught in school: $\mu = \frac{1}{N}\sum_{i=1}^N{x_i}$ $\sigma^2 = \frac{1}{N}\sum_{i=1}^N{(x_i-\mu)^2}$ Then do we assume, that the da...

Cross Validated

Some of the answers in the last link do point out interesting results : sample mean and variance are optimal for a Gaussian distribution.

https://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation adds on that the midrange ((min+max)/2) would be optimal for unknown bounded distributions ? 🧵

#statistics #LearningStatistics

Unbiased estimation of standard deviation - Wikipedia

The last page is also part of a bunch of wiki pages that are... surely technically correct but difficult to grasp intuitively.

Note the difference between population (1/N) and sample (1/(N-1)) stats. The first has better mean squared error but biased with respect to the population, and the second has worse MSE but is unbiased with respect to the population.

I spent some time trying to grasp that, and came to the conclusion that in practical terms it's not actionable for me yet: I either have large N, or my problem is small but more complex than a mean/var/std and I have no clue how to get an unbiased estimator for that. 🧵

#statistics #LearningStatistics

The second big question that has been in my head for a while is : what the heck are confidence intervals?

I used to think it's just standard deviations, but mucked with a fuzzy coefficient. Like, you know, you picked 3 sigma because your predecessor picked 3 sigma or something.

My current understanding is:

- they're built upon an a priori about how the data looks like - for example, that the data comes from a perfect normal distribution (because your predecessor told you it's a good one), and you have a math formula telling you the ideal probabilities

#statistics #LearningStatistics

- it's very easy to use that a priori wrong - notably because real data might not be quite normal, or because it is but there's still a non-zero chance of being outside of the interval anyway (95% CI means there's 5% chance of being outside, and maybe more if your prior is wrong, if your data is correlated, if you don't have enough data and you're far from the asymptotic case etc)
- they can also be random variables! CIs have CIs themselves? yes, I know, the shock! just like my epiphany about the mean, it very much sounds like it's random turtles all the way down

#statistics #LearningStatistics

Shower thought: it's in general a coincidence that means and standard deviations (in general, moments, for which we have simple and exact formulae from the samples) actually correspond to model parameters for the case of some common distributions.

I think that helps me put in perspective why half of the things I read treat these the same as much more more complex problems for which we have to use more convoluted algorithms like optimization routines, maximum likelihood estimates through EM algos etc etc.

#statistics #LearningStatistics

A second more involved realization: I wish people writing pages/articles/courses told upfront why statistics textbooks are so full of some more complex distributions like Student t, chi-squared instead of harping for 20 pages about their properties.

I now understand that:
- the mean often follows a Gaussian distribution
- the variance often follows a chi-squared distribution (I think this really needs a good visualization)
- when sigma is known a priori Gaussian CIs of samples from a Gaussian variable are estimated from a Gaussian distribution ; when not it is a Student t distribution (it cancels both mean and std. dev)

#statistics #LearningStatistics

Buried under the https://en.m.wikipedia.org/wiki/Student's_t-distribution is this quote which explains a lot:

"Quite often, textbook problems will treat the population standard deviation as if it were known and thereby avoid the need to use the Student's t distribution. These problems are generally of two kinds: (1) those in which the sample size is so large that one may treat a data-based estimate of the variance as if it were certain, and (2) those that illustrate mathematical reasoning, in which the problem of estimating the standard deviation is temporarily ignored because that is not the point that the author or instructor is then explaining."

#statisticd #LearningStatistics

Student's t-distribution - Wikipedia

https://stats.libretexts.org/Bookshelves/Probability_Theory/Probability_Mathematical_Statistics_and_Stochastic_Processes_(Siegrist)/05%3A_Special_Distributions/5.10%3A_The_Student_t_Distribution

"Suppose that Z has the standard normal distribution, V has the chi-squared distribution with n∈(0,āˆž) degrees of freedom, and that Z and V are independent. Random variable
T=Z/√(V/N) has the student t distribution with n degrees of freedom."

This formula is very reminiscent from the one used to construct CIs of Gaussian samples with known std. dev., just with the sample estimate of sigma instead of an a priori fixed sigma.

#statistics #LearningStatistics

5.10: The Student t Distribution

In this section we will study a distribution that has special importance in statistics. In particular, this distribution will arise in the study of a standardized version of the sample mean when the …

Statistics LibreTexts

So, to recap:

- sample means and standard deviations just happen to be optimal estimators of the parameters of a Gaussian distribution
- Gaussian distributions happen naturally (Central Limit Theorem), especially when mixing several causes to an effect so we can often fall back to them
- to construct a CI one has to build a probability around something independent of the very thing we're trying to estimate (otherwise circular dep!)
- it's easy when sigma is known (literally the CLT), but to extract something without both sigma and mu we need a bit more elbow grease (Student t)
- when not Gaussian we need moar math

#statistics #LearningStatistics

(and I need to get the famous https://nostarch.com/regression just to see people do double takes when they cross my desk)
The Manga Guide to Regression Analysis

The Manga Guide to Regression Analysis teaches you effective ways to analyze data and make predictions.

The famous "Manga Guide to Regression Analysis" is actually pretty darn good? 😮

It's like a beginner book obviously, but it goes right away from "what is a probability distribution" to "we use normal because they're everywhere" to "hey chi-squared is for variances" and even "F is for ratios of variances" (hey, a new one in my pokedex!).

#statistics #LearningStatistics