Assumptions about (statistical) assumptions

Parametric statistics are not appropriate for mutual information because the values are non-normally distributed.

Source: Cohen, Michael X. (2014). Analyzing Neural Time Series Data. MIT Press. p 405.

That’s not quite true. Not all parametric statistics depend on normality; indeed the whole class of generalized linear models removes the normality assumption and replaces it with another, appropriate, but still parametric distributional assumption.

Parametric does not mean “assumption of normality” nor does non parametric mean “assumption free”. Parametric simply means that a given distribution (and hence statistical model) can be described via parameters to a given function or functions (in particular, the probability density function, PDF, and the cumulative density function, CDF[¹]). For example, the normal distribution is described by the probability density function

f\left(x; \mu, \sigma\right) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{\left(x-\mu\right)^2}{2\sigma^2}}

Setting the parameters $\mu$ (population mean) and $\sigma$ (standard deviation) (or equivalently $\sigma^2$ , the variance) completely determines the distribution, giving it a nice, easy-to-compute closed form. Nearly every distribution that you can name is ‘parametric’ in this sense: Student’s $t$ , Pareto, Gamma, Cauchy, Laplace, Beta, Binomial, etc. So it’s quite possible that there is a parametric statistical model that fits a given problem, even when conditional normality[²] can be completely excluded. In any case, attempts to ‘correct’ a statistical model based on the normality assumption when normality is known not to hold are likely to have other latent issues and tradeoffs.

As an alternative to parametric statistics and their distributional assumptions (i.e. that the conditional data distribution can be expressed as a given parametric distribution), non parametric statistics are often presented as an ‘assumption free’ if computationally more expensive alternatives. Computer time is cheap nowadays and the loss of power for modern non parametric methods compared to parametric methods is minimal even when the latter’s assumptions hold, so it seems natural to just use these. (Indeed, Rand Wilcox’s promotion of robust – which goes beyond ‘non parametric’ – statistics is based on this combined with an observation going back to Tukey that most bell-curves are not normal and so the normality assumption may often be invalid.) That’s fine, but these methods also have lots of assumptions, some of them quite strong. For example, the bootstrap assumes that the samples are independent and identically distributed (i.i.d.). In the case of mutual information applied to different EEG channels on the same participant, no two simultaneous measurements from a single person’s scalp can be said to be truly independent (Gauss’ law, spherical conductors, and all that jazz), so this assumption is violated. Violating testing assumptions isn’t the end of the world, but it does mean that you may experience what C programmers dread, namely, undefined behavior.

Now, I do suspect that a permutation test is probably the best option for comparing (within the significance-testing framework) these information-theoretic measures in EEG work. But that’s not because the data aren’t normally distributed, but rather we have no idea what the underlying “ground-truth” distribution is!

Strictly speaking, for discrete distributions, you have mass functions instead of density functions, but the same holds. ↩
It is useful to emphasize here that the assumption of normality present in many common statistical models is normality of the conditional response or equivalently, normality of the residuals. This is part of why the homoskedacity assumption is important: heteroskedacity means the residuals are distributed as a mixture of (normal) distributions and not as a single normal distribution. In practical terms, this messes up your error estimates, which impacts the calculation of $p$-values, etc. In many ways, the field of robust statistics is just a long discussion of how to handle mixture distributions in your error term. ↩

Phillip M. Alday