This is my first (first author) journal article. We started writing it in Summer 2018, with first submission by November 2018. So my thinking has changed somewhat since then. There is a non-pay-walled version below, but the version of record is available at the Journal of Experimental Education, http://www.tandfonline.com/10.1080/00220973.2019.1693328. The best part of the review process was communicating with the editor, Professor Brian French - he was very kind.
Here’s the PDF and GitHub repository with simulation and figure reproduction code in R (very little comments).
Something I dislike in the paper is the hypothetical two-by-two contingency table repeated from Peng, Lee and Ingersoll (PLI, 2001, a well-cited review of logistic regression).1 Their hypothetical example had some inner city school children recommended to remedial reading. However, given that the example is hypothetical/invented, why invent inner city school children in need of saving?
Perhaps, PLI were analyzing a related real-life dataset when they wrote this. But what excuse do I have for choosing this example?2 None. I make these comments so methodologists can do better with our choice of examples, real or hypothetical. Given that the example datasets we use are sometimes ancillary to our work, we can be more thoughtful.
The summary of the paper is:
- Odds ratios are non-collapsible, i.e. odds ratios will change depending on covariates in the model, even when there is no confounding.
- You can use linear regression to calculate a risk difference, and Poisson regression to calculate a risk ratio instead. These are collapsible.
- However, OLS and Poisson regression will return downwardly biased estimates in commonplace data analysis scenarios.
- You can modify estimation in simple ways that anyone can implement to obtain much less biased estimates. We applied Horrace and Oaxaca’s sequential approach3 to both linear and Poisson regression.
If we were to write this paper again, I would:
- select a different example for the two-by-two contingency table;
- evaluate the recommended estimators on efficiency. There is a point at which unbiased estimators are so inefficient that they’re junk; and
- evaluate a Bayesian estimation approach for the sequential OLS/Poisson, where we censor the posterior of the predicted outcome at about 0 and about 1, before passing the predicted outcome to the likelihood.
Minor interesting points:
- One thing in the paper that I’ve not seen in the literature is the simple derivation for a log-probability model in Appendix A.
- The correlated predictors (one binary, one uniform) in both simulations were generated using the copula method. And the correlation was adjusted for attenuation given that the binary was dichotomized.
- Peng, C.-Y. J., Lee, K. L., & Ingersoll, G. M. (2002). An introduction to logistic regression analysis and reporting. The Journal of Educational Research, 96(1), 3–14. https://doi.org/10.1080/00220670209598786 ↩
- I, not my co-authors, chose this example. ↩
- Horrace, W. C., & Oaxaca, R. L. (2003, January 1). New Wine in Old Bottles: A Sequential Estimation Technique for the Lpm. Retrieved from https://papers.ssrn.com/sol3/papers.cfm?abstract_id=383102 ↩