When students and users of statistical methods first learn about regression
analysis there is an emphasis on the technical details of models and estimation
methods that invariably runs ahead of the purposes for which these models might
be used. More broadly, statistics is widely understood to provide a body of
techniques for "modelling data", underpinned by what we describe as the "true
model myth", according to which the task of the statistician/data analyst is to
build a model that closely approximates the true data generating process. By
way of our own historical examples and a brief review of mainstream clinical
research journals, we describe how this perspective leads to a range of
problems in the application of regression methods, including misguided
"adjustment" for covariates, misinterpretation of regression coefficients and
the widespread fitting of regression models without a clear purpose. We then
outline an alternative approach to the teaching and application of regression
methods, which begins by focussing on clear definition of the substantive
research question within one of three distinct types: descriptive, predictive,
or causal. The simple univariable regression model may be introduced as a tool
for description, while the development and application of multivariable
regression models should proceed differently according to the type of question.
Regression methods will no doubt remain central to statistical practice as they
provide a powerful tool for representing variation in a response or outcome
variable as a function of "input" variables, but their conceptualisation and
usage should follow from the purpose at hand.Comment: 24 pages main document including 3 figures, plus 15 pages
supplementary material. Based on plenary lecture (President's Invited
Speaker) delivered to ISCB43, Newcastle, UK, August 2022. Submitted for
publication 12-Sep-2