Discovering Statistics: The Blog: September 2012

My last blog was about the assumption of normality, and this one continues the theme by looking at homogeneity of variance (or homoscedasticity to give it its even more tongue-twisting name). Just to remind you, I’m writing about assumptions because this paper showed (sort of) that recent postgraduate researchers don’t seem to check them. Also, as I mentioned before, I get asked about assumptions a lot. Before I get hauled up before a court for self-plaigerism I will be up front and say that this is an edited extract from the new edition of my Discovering Statistics book. If making edited extracts of my book available for free makes me a bad and nefarious person then so be it.

Assumptions: A reminder

Now, I’m even going to self-plagiarize my last blog to remind you that most of the models we fit to data sets are based on the general linear model, (GLM). This fact means that any assumption that applies to the GLM (i.e., regression) applies to virtually everything else. You don’t really need to memorize a list of different assumptions for different tests: if it’s a GLM (e.g., ANOVA, regression etc.) then you need to think about the assumptions of regression. The most important ones are:

Linearity
Normality (of residuals)
Homoscedasticity (aka homogeneity of variance)
Independence of errors.

What Does Homoscedasticity Affect?

Like normality, if you’re thinking about homoscedasticity, then you need to think about 3 things:

Parameter estimates: That could be an estimate of the mean, or a b in regression (and a b in regression can represent differences between means). if we assume equality of variance then the estimates we get using the method of least squares will be optimal.
Confidence intervals: whenever you have a parameter, you usually want to compute a confidence interval (CI) because it’ll give you some idea of what the population value of the parameter is.
Significance tests: we often test parameters against a null value (usually we’re testing whether b is different from 0). For this process to work, we assume that the parameter estimates have a normal distribution.

When Does The Assumption Matter?

With reference to the three things above, let’s look at the effect of heterogeneity of variance/heteroscedasticity:

Parameter estimates: If variances for the outcome variable differ along the predictor variable then the estimates of the parameters within the model will not be optimal. The method of least squares (known as ordinary least squares, OLS), which we normally use, will produce ‘unbiased’ estimates of parameters even when homogeneity of variance can't be assumed, but better estimates can be achieved using different methods, for example, by using weighted least squares (WLS) in which each case is weighted by a function of its variance. Therefore, if all you care about is estimating the parameters of the model in your sample then you don’t need to worry about homogeneity of variance in most cases: the method of least squares will produce unbiased estimates (Hayes & Cai, 2007). However, if you even better estimates, then use weighted least squares regression to estimate the parameters.
Confidence intervals: unequal variances/heteroscedasticity creates a bias and inconsistency in the estimate of the standard error associated with the parameter estimates in your model (Hayes & Cai, 2007). As such, your confidence intervals and significance tests for the parameter estimates will be biased, because they are computed using the standard error. Confidence intervals can be ‘extremely inaccurate’ when homogeneity of variance/homoscedasticity cannot be assumed (Wilcox, 2010).
Significance tests: same as above.

Summary

If all you want to do is estimate the parameters of your model then homoscedasticity doesn’t really matter: if you have heteroscedasticity then using weighted least squares to estimate the parameters will give you better estimates, but the estimates from ordinary least squares will be ‘unbiased’ (although not as good as WLS).
If you’re interested in confidence intervals around the parameter estimates (bs), or significance tests of the parameter estimates then homoscedasticity does matter. However, many tests have variants to cope with these situations; for example, the t-test, the Brown-Forsythe and Welch adjustments in ANOVA, and numerous robust variants described by Wilcox (2010) and explained, for R, in my book (Field, Miles, & Field, 2012)

Declaration

 This blog is based on excerpts from the forthcoming 4th edition of ‘Discovering Statistics Using SPSS: and sex and drugs and rock ‘n’ roll’.

References

Field, A. P., Miles, J. N. V., & Field, Z. C. (2012). Discovering statistics using R: And sex and drugs and rock 'n' roll. London: Sage.
Hayes, A. F., & Cai, L. (2007). Using heteroskedasticity-consistent standard error estimators in OLS regression: An introduction and software implementation. Behavior Research Methods, 39(4), 709-722.
Wilcox, R. R. (2010). Fundamentals of modern statistical methods: substantially improving power and accuracy. New York: Springer.

Discovering Statistics: The Blog

Thursday, September 13, 2012

Assumptions Part 2: Homogeneity of Variance/Homoscedasticity