Continuing on whether it’s a good idea to dichotomize continuous variables prior to analysis for substantive reasons, I think I settle on the side of bad idea. The major reason is potential heteroskedasticity of the error term in the linear regression model for the original continuous variable.
This is an interesting issue, but one that I do not want to devote time to write about. So I decided to write a brief methods note. The goal is that the document is easy to read, simple and at least causes any reader to rethink dichotomization if they do it normally. However, none of it is new.
- Rmarkdown document which contains source here (with references).
- I used papaja to compile the document.
N.B.: I don’t mention it in the methodological note, but the reason methods like generalized additive models and kernel regularized least squares with logistic loss might work is heteroskedasticity in the continuous variable manifests as a different mean function than the inverse logit function. So methods that relax the linearity assumptions of regression (linear/logistic) can do a good job estimating the true relations in the data.