Someone asked an interesting question on Cross Validated recently about comparing the means and variances of two groups. They had a substantive interest in exploring variance and mean differences between two groups. They were thinking about Shapiro-Wilk test to test the data for normality, Levene’s test or *F*-test for variance comparisons depending on the results of Shapiro-Wilk, and Mann-Whitney or Welch’s test for means comparisons depending on the Shapiro-Wilk.

I gave a somewhat detailed answer and liked it, so pasting it here verbatim.

### Location differences

**Mann-Whitney does not test mean differences**

I’ll begin by knocking off the simpler questions. If your primary question about location differences is a difference in means, then you probably do not want a non-parametric test like the Mann-Whitney test. The Mann-Whitney test is a test of stochastic dominance. So given two groups A and B, if I were to randomly draw from A and B, which value would be greater. If they cancel out on average, then there is no stochastic dominance; the opposite follows. This test would work regardless of non-normality or heteroskedasticity. However, if you have non-normality and heteroskedasticity in particular, then this test is anything but a test of the mean-difference. The mean-difference can be zero, but you can easily have stochastic dominance of one group over the other. Given the choice, I personally may be more interested in a test of stochastic dominance, but they are not commonly explained like this in most applied literature I come across.

**Non-normality may not be too important**

Next issue is normality in *relation to mean differences*. We will bring up normality again in *relation to variance differences* as things work somewhat differently there. Unless you expect the data to be extremely non-normal, the sample sizes you have may be large enough to ignore questions of normality. If the data are extremely non-normal or you can hypothesize a theoretical distribution from which they may arise, then maybe it is better to run the model using that distribution, such as Poisson for count data or income. Also, certain transformations might make sense theoretically, such that you can expect to use them even before viewing the data. The point I’m trying to make is with relation to mean differences, heteroskedasticity may be more consequential in your situation.

If you have to examine normality, know that all other things held constant, statistical tests improve in their ability to detect differences given greater sample size. Shapiro-Wilk might be able to detect minor deviations from normality at the sample size you have. Additionally, if you do the test at all, you should probably do it on the data after subtracting off the group means. Most importantly though, making future decisions contingent on such preliminary tests can make your eventual decision flawed. I do not know of studies of the sort with normality testing, but there are such studies with heteroskedasticity-testing, see for one:

- Zimmerman, D. W. (2004). A note on preliminary tests of equality of variances. British Journal of Mathematical and Statistical Psychology, 57(1), 173–181. https://doi.org/10.1348/000711004849222

**Dealing with heteroskedasticity**

So if your primary question about location differences is a difference in means, then my recommendation would be to compute the difference in means and use a test that can adjust for possible violations of heteroskedasticity. The most developed requiring very little additional computer time (wild bootstrapping is good but can take eons) are heteroskedasticity-consistent standard errors in econometrics. I would recommend the HC3, HC4 or HC5 variants. See:

- Hausman, J., & Palmer, C. (2012). Heteroskedasticity-robust inference in finite samples. Economics Letters, 116(2), 232–235. https://doi.org/10.1016/j.econlet.2012.02.007
- Cribari-Neto, F., Souza, T. C., & Vasconcellos, K. L. P. (2007). Inference Under Heteroskedasticity and Leveraged Data. Communications in Statistics - Theory and Methods, 36(10), 1877–1888. https://doi.org/10.1080/03610920601126589

These methods are more recently developed than Welch’s correction and do not require you to know the correct model specification for the variance. So run a regression of the outcome on group, the coefficient is your mean difference, and the robust error correction corrects the p-value for heteroskedasticity. There are methods that allow you to simultaneously model the mean and variance such as generalized least squares, but normality of the data comes back into play in relation to the test of the variance. I hope the above helps in relation to questions about mean differences.

### Variance differences

**Brief simulation I conducted**

I next turn to the other primary question about variance differences. I ran some simulations of this weeks ago. I assumed that the mean and variance of outcome was a function of the groups alone $-$ a simplifying assumption that would be met in say a randomized trial. I varied:

- the distribution of the data, normal, or skewed ($\chi^2_8$ centered and scaled to meet mean and variance requirements $\approx$ skew of 1 asymptotically). The choice of $\chi^2$ is not ideal for generating skewed data especially under unbalanced design but I think it suffices.
- balanced versus unbalanced design (1:3, so not as extreme as your situation).

And the maximum sample size I considered was 200 persons in both groups.

I tested the ability of the methods I considered to maintain nominal error-rate and statistical power. To knock off power questions now, at sample sizes below OP’s, most methods displayed similar statistical power with regard to detecting variance differences. But not all had ability to maintain nominal error rate. So when I use *performed relatively well* below, I mean maintained nominal error rate.

**Levene test with median and OLS on squared residuals may be good choices**

The most standard way is the $F$-test. But unless your data are normally distributed, this test behaves very badly. So one can take it off the table. The next standard is Levene’s test. If you are concerned about normality, you can robustify it by conducting Levene’s test using the median in place of the mean in the formula for the test. In the simulations I conducted, and this approach seemed to perform relatively well across a variety of situations and should be available in major statistical packages. The finding in my own simulations is backed up by the recommendations from the NIST engineering statistics handbook: https://itl.nist.gov/div898/handbook/eda/section3/eda35a.htm.

However, I also found that you can take the results from the first regression you conducted to obtain the mean differences. Square the residuals from this regression, then regress this squared residual on the group variable again. I found this approach to testing the variance differences to perform well across all conditions in my simulation.

### In summary

So to recap, if I were in your situation and wanted to make informed decisions a-priori assuming non-normality is not extreme, I would:

- conduct the standard regression model regressing the data on group membership to obtain the mean difference, and use heteroskedasticity-consistent standard errors for inference
- I would conduct Levene’s test using the median as center rather than the mean. I might also use the regression of squared residuals approach.

Additional methods I tested were Levene’s test using the Hodges-Lehmann median (a nice robust estimator which has a relation to the aforementioned Mann-Whitney) as the center; generalized least squares; and three methods from the structural equation modeling literature: *diagonally-weighted least squares*, and mean and variance adjusted OLS, and maximim likelihood with a sandwich estimator commonly referred to as *MLM*. The methods I focused on in the bulk of the text won out.

I’d be happy to share the simulation scripts on request; there are about six/seven files.

Comments powered by Talkyard.