@RossGayler
Aha, well yes it entirely depends on the question at hand and the experimental design. So eg. One major distinction is whether you are trying to say something about a model, a family of models, or the data. Parametric statistics is for inference over samples of a definable population, so eg. a point estimate of accuracy on held out data is fine if all youre trying to do is make a claim about a single model since there is no "population" you are sampling from. If youre trying to make a claim about a class of models then now you are sampling from the (usually) real valued, n-dimensional model space, so there the usual requirements for random sampling within parameter space would apply.
Making a claim about the data is much different, because now you have a joint analysis problem of "the effects of my model" and "the effects of the data" (neuroscientists love to treat the SVMs in their "decoding" analyses as neutral and skip that part, making claims about the data by comparing eg. Classification accuracies as if they were only dependent on the data. Even randomly sampling the subspace there doesnt get rid of that problem because different model architectures, training regimes, etc. Have different capacities for classifying different kinds of source data topologies, but I digress.)
For methods questions like this I try and steer clear of domain specific papers and go to the stats lit or even stats textbooks, because domain specific papers are translations of translations, and often have uh motivated reasoning. For example, the technique "representational similarity analysis" in neuro is wholly unfounded on any kind of mathematical or statistical proof or theory, and yet it flourishes because it sounds sorta ok and allows you to basically "choose your own adventure" to produce whatever result you want.
For k-fold, its a traditional repeated measures problem (depending on how you set it up). The benchmarking paradigm re: standard datasets and comparing accuracy is basically fine if the claim you are making is exactly "my model in particular is more accurate on this particular set of benchmarks." Youre right that even for that, to get some kind of aggregated accuracy you would want an MLM with dataset as random effect, but since the difference in datasets is often ill defined and as you say based in convenience im not sure how enlightening it would be.
Would need more information on the specific question you had in mind to recommend lit, and I am not a statistician I just get annoyed with lazy dogshit and think stats and topology (which is relevant bc many neuro problems devolve into estimating metric spaces) is interesting rather than a nuisance.