Most of the Artificial Neural Net simulation research I have seen (say, at venues like NeurIPS) seems to take a *very* simple conceptual approach to analysis of simulation results - just treat everything as independent observations with fixed effects conditions, when it might be better conceptualised as random effects and repeated measures. Do other people think this? Does anyone have views on whether it would be worthwhile doing more complex analyses and whether the typical publication venues would accept those more complex analyses? Are there any guides to appropriate analyses for simulation results, e.g what to do with the results coming from multi-fold cross-validation (I presume the results are not independent across folds because they share cases).

@cogsci #CogSci #CognitiveScience #MathPsych #MathematicalPsychology #NeuralNetworks #MachineLearning

@RossGayler
@cogsci
Yes you are correct re: the stats. No nobody seems to care
@jonny @cogsci
Thanks for the confirmation of the observation. I am asking around elsewhere for an introduction/guide/tutorial on appropriate statistical methods for evaluating computational research studies.
[edited to clarify the topic of my request]
@jonny @cogsci I have just updated that post, clarifying that I am interested in "appropriate statistical methods for evaluating computational research studies" rather than "using simulation studies to evaluate statistical methods".
@RossGayler
@cogsci
Paging @neuralreckoning who works with artificial spiking neural nets

@jonny @cogsci @neuralreckoning
Here is the query I raised in a couple of off-fediverse forums:

I would greatly appreciate pointers to any introduction/guide/tutorial on appropriate statistical methods for evaluating computational research studies.

Context: I am starting to do some work in a research field where the researchers are mostly computer scientists and most of the studies are computational experiments. The statistical analysis of the results generated by those studies is nonexistent to naive - often results consist only of a table of means, and you might get standard deviations (not standard errors). I would like to do better from a statistical point of view, but (a) it's a very long time since I had to think about analogous issues in non-computational disciplines, and (b) some statistical issues may be specific to computational experiments, e.g. re-using the same random number stream as input, and non-independence between the folds of a multi-fold validation. Also, I am concerned that trying to introduce some statistical sophistication will attract negative comments from reviewers - so I want to be able to cite something that points out what the statistical problems are and how to deal with them.

Example: Graph Neural Networks are machine learning models that operate on graphs as input. A typical task is to learn to label graphs (classification of graphs), for example, represent chemical molecular structures as graphs and classify them as mutagenic or not. Researchers develop new GNN algorithms and want to compare their performance to other GNN algorithms. There is an archive of graph datasets, which is typically used for this comparison. There are many datasets in the archive, but of course this archive is just a convenience sample of all possible graph datasets (if that even makes sense). Each dataset contains some number (not always large) of labelled graphs (cases). The cases are randomly partitioned into train/test sets, the GNN trained on the train set, then the trained GNN evaluated on the test set. The evaluation metric is usually a single number summary - accuracy (they really love accuracy as the metric). The random train/test partitioning is repeated some small number of times (k-fold validation) to get a distribution of evaluation metrics, and the mean value is reported. This is done independently for each of the GNNs to be compared.

I have not, so far, seen any discussion of: the convenience sample nature of the dataset archive, possible advantages of comparing GNNs on the same train/test partition, issues around accuracy as a metric, or statistical dependence between folds in k-fold validation because of sharing cases. So I am trying find resources identifying statistical issues in that kind of research and potential statistical approaches for analysing the results of the experiments.

@RossGayler
Aha, well yes it entirely depends on the question at hand and the experimental design. So eg. One major distinction is whether you are trying to say something about a model, a family of models, or the data. Parametric statistics is for inference over samples of a definable population, so eg. a point estimate of accuracy on held out data is fine if all youre trying to do is make a claim about a single model since there is no "population" you are sampling from. If youre trying to make a claim about a class of models then now you are sampling from the (usually) real valued, n-dimensional model space, so there the usual requirements for random sampling within parameter space would apply.

Making a claim about the data is much different, because now you have a joint analysis problem of "the effects of my model" and "the effects of the data" (neuroscientists love to treat the SVMs in their "decoding" analyses as neutral and skip that part, making claims about the data by comparing eg. Classification accuracies as if they were only dependent on the data. Even randomly sampling the subspace there doesnt get rid of that problem because different model architectures, training regimes, etc. Have different capacities for classifying different kinds of source data topologies, but I digress.)

For methods questions like this I try and steer clear of domain specific papers and go to the stats lit or even stats textbooks, because domain specific papers are translations of translations, and often have uh motivated reasoning. For example, the technique "representational similarity analysis" in neuro is wholly unfounded on any kind of mathematical or statistical proof or theory, and yet it flourishes because it sounds sorta ok and allows you to basically "choose your own adventure" to produce whatever result you want.

For k-fold, its a traditional repeated measures problem (depending on how you set it up). The benchmarking paradigm re: standard datasets and comparing accuracy is basically fine if the claim you are making is exactly "my model in particular is more accurate on this particular set of benchmarks." Youre right that even for that, to get some kind of aggregated accuracy you would want an MLM with dataset as random effect, but since the difference in datasets is often ill defined and as you say based in convenience im not sure how enlightening it would be.

Would need more information on the specific question you had in mind to recommend lit, and I am not a statistician I just get annoyed with lazy dogshit and think stats and topology (which is relevant bc many neuro problems devolve into estimating metric spaces) is interesting rather than a nuisance.

@jonny
Thanks for the helpful comments.

"One major distinction is whether you are trying to say something about a model, a family of models, or the data."

The typical ML paper appears to be aiming to claim: my (class of) models is (ever so marginally) better than your (class of) models (evaluated on a convenience sample of data sets that is not necessarily relevant to anything anyone would care about in practice - looking at you MNIST, CIFAR, and friends).

Do you have a citable reference to some discussion of the conceptual issues around "whether you are trying to say something about a model, a family of models, or the data"?

Re your representational similarity analysis digression. Yeah - my standard rant is: What make you think that the analysis you're doing is answering the question you think you're asking?

"For methods questions like this I try and steer clear of domain specific papers and go to the stats lit or even stats textbooks"

Yeah I think that's where I have landed. The most helpful references I have received have been to docs by legit statisticians writing about evaluation of predictions in general. I need to find my copy of ESL (https://link.springer.com/book/10.1007/978-0-387-84858-7) and see what they have to say about evaluation.

Re "choose your own adventure": In case you haven't seen it, I like this old paper: http://repository.upenn.edu/cgi/viewcontent.cgi?article=1015&context=marketing_papers

Re k-fold, repeated measures, MLM: Yeah, I'm going to have to refresh my decades-old memory of that stuff, and as you say, the result may be to conclude that it wasn't worth the bother. Unfortunately, you don't know that you've done enough until you have demonstrated that you have done too much (to the great annoyance of my former managers who had views on charging out my time).

The Elements of Statistical Learning

SpringerLink
@jonny @RossGayler @cogsci I'm very ignorant of statistics, but yeah I agree ML publications are usually pretty poor on this.

@RossGayler If you know that the IID assumption does not hold for your synthetic samples, you should handle them accordingly. Otherwise I don't see any issues here. The following could serve as a primer for model evaluation, as it also dicusses alternatives to CV:

https://arxiv.org/abs/1811.12808

Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning

The correct use of model evaluation, model selection, and algorithm selection techniques is vital in academic machine learning research as well as in many industrial settings. This article reviews different techniques that can be used for each of these three subtasks and discusses the main advantages and disadvantages of each technique with references to theoretical and empirical studies. Further, recommendations are given to encourage best yet feasible practices in research and applications of machine learning. Common methods such as the holdout method for model evaluation and selection are covered, which are not recommended when working with small datasets. Different flavors of the bootstrap technique are introduced for estimating the uncertainty of performance estimates, as an alternative to confidence intervals via normal approximation if bootstrapping is computationally feasible. Common cross-validation techniques such as leave-one-out cross-validation and k-fold cross-validation are reviewed, the bias-variance trade-off for choosing k is discussed, and practical tips for the optimal choice of k are given based on empirical evidence. Different statistical tests for algorithm comparisons are presented, and strategies for dealing with multiple comparisons such as omnibus tests and multiple-comparison corrections are discussed. Finally, alternative methods for algorithm selection, such as the combined F-test 5x2 cross-validation and nested cross-validation, are recommended for comparing machine learning algorithms when datasets are small.

arXiv.org

@feliks
Thanks - that's exactly the sort of article I was looking for: a bona fide statistician discussing statistical issues in evaluation of ML.

re the IID assumption: The impetus to make my query was coming across a bunch of ML research that was apparently unaware that IID was implicitly assumed by their analyses.