Article Text

Download PDFPDF

Regression analysis
  1. Steff Lewis
  1. Senior Research Fellow/Statistician, University of Edinburgh, Division of Clinical Neurosciences, Western General Hospital, Crewe Road, Edinburgh EH4 2XU, UK; steff.lewis{at}ed.ac.uk

    Statistics from Altmetric.com


    Embedded Image

    Regression analysis describes the relation between an outcome of interest and one or more variables, known as explanatory variables. For example, figure 1 shows how height (the outcome) is related to age (the explanatory variable) in young children. Each cross on the plot represents the value for an individual child, and the dotted line is the regression line, which will be explained later.

    Figure 1

    Scatter plot of height and age in 100 children, with regression line. (Data used with permission from the Office of Population Censuses and Surveys. Social Survey Division, National Diet, Nutrition and Dental Survey of Children Aged 1 1/2 to 4 1/2 Years, 1992–1993. SN: 3481. Colchester, UK: December 1995.)

    How a regression analysis is performed depends on the type of outcome data. Three common methods are described in this article, relating to:

    • continuous outcomes (such as height): linear regression

    • binary outcomes (such as stroke/no stroke): logistic regression

    • time-to-event outcomes (such as time to death): Cox proportional hazards.

    Regression analysis is so commonly used that clinicians must be able to at least understand the reporting of multivariable regression in publications, even if not able to do the analysis themselves. It would also be helpful for many to be able to interpret the computer output from a multivariable regression procedure. The methods described are available in standard statistical software packages.

    LINEAR REGRESSION

    Simple linear regression is used to describe the relation between one continuous outcome variable—for example, height—and another (explanatory) variable—for example, age (fig 1). The explanatory variable may be binary (for example, male, female), have several categories (for example, nationality), or be continuous (for example, age). Here it seems sensible to choose height as the outcome variable (y, vertical axis), and age the explanatory variable (x, horizontal axis) as a person’s height depends on their age, not the other way round.

    Before considering a linear regression analysis between two continuous variables, draw a scatter plot (without a regression line), to look for a relation between the two variables. Such plots should be presented in published reports (although they are often not), to give the reader a visual impression of the strength of the association between the variables. If there is no obvious relation, no further analysis is necessary (fig 2A). If the relation is clearly not linear (fig 2C), then a simple linear regression line is not an appropriate summary of the relation between the two variables. The spread of the data (variance) should also be fairly constant as one variable increases. For example, in figure 2D, the data flare out as x and y increase, and simple linear regression would not be appropriate here. In figure 2B, the spread of the data is constant as x and y increase, so a regression line can be calculated.

    Figure 2

    Four scatter plots showing (A) no relation between the variables, (B) a linear relation between the variables with constant variance, (C) a quadratic relation between the variables, and (D) a linear relation with increasing variance (spread) for higher values.

    The scatter plot can be used to look for outliers, which are any values that are markedly different from the other values in the data set. Outlying values should be checked to ensure they have been entered correctly. If they are real, the analysis can be performed with and without them to see what effect this has, and how the result might be best interpreted (fig 3). It should be clearly stated in a paper whether outliers have been deleted, why this was done, and what effect it had—the outliers must not be simply suppressed!

    Figure 3

    Two scatter plots showing a regression line fitted to data with (A) no outliers and (B) three outliers that have a profound effect on the estimated regression line.

    A regression analysis attempts to fit a line to the data, described by the equation:

    y  =  A + Bx

    where A is the point where the regression line crosses the vertical axis (the value of y when x is zero), and B is the slope of the line (when x increases by one unit, y will change by B units). The line is chosen to best fit the data. In the height and age example, the equation is:

    Height in metres  =  0.71 + (0.006 × age in months)

    This can be superimposed onto the scatter plot (fig 1). There is little point in graphically presenting the regression line without showing the scatter of data around it, because the line gives no more information than the regression equation. A regression equation should only be used to describe the observed data, and should not be used to predict the outcome variable outside the observed range of values. For instance, in the age and height example it is inappropriate to use the regression equation to estimate the height of a 65-year-old—people rarely grow forever.

    It is possible to model explanatory variables that do not have a linear relation with outcome, although this is a more complex procedure. When reading papers that describe complex non-linear relationships, consider whether they make sense. Statistics packages do not tell you whether the results of an analysis are plausible, and so will let you fit all manner of weird and wonderful lines to your data.

    Papers often present the value “R2”. This gives the proportion of the variability of the outcome variable that is explained by the explanatory variable. Values close to 1 show that the explanatory variable explains most of the variation in the outcome variable (as in fig 2B); values close to zero show that the explanatory variable explains very little of the variation in the outcome variable (as in fig 2A). In the example of height and age, R2 is 0.75, which is very good. Values of R2 in the medical literature are often much lower than this.

    LOGISTIC REGRESSION

    Logistic regression is used when the outcome variable is binary—for example, the presence or absence of a condition, such as being independent in activities of daily living or not, or having a stroke or not. Again, the explanatory variable may be binary (for example, male, female), have several categories (for example, nationality), or be continuous (for example, age). The principle of logistic regression is very similar to linear regression. Each patient has a probability, p, of achieving a particular outcome (for example, having a stroke), and this is modelled as:

    Embedded Image

    (this is a natural log).

    Statistics packages often give the value of B, from which an odds ratio can be derived. This is the figure that is usually found in papers. An odds ratio is a measure of the strength of an association and describes the odds of a member of one group of patients (for example, women) suffering an outcome event relative to a member of a different group of patients (for example, men). A value of 1 means that there is no association. An example of a logistic regression analysis is shown in table 1 .

    Table 1

    Predicting outcome after stroke: model to predict probability of survival free of dependency (modified Rankin <3) at 6 months, using logistic regression

    COX PROPORTIONAL HAZARDS

    Cox proportional hazards is a type of regression analysis that is frequently used when the outcome is the time to an event (for example, time to death, time to stroke). At the end of a study, some patients may not have had the outcome of interest (for example, stroke), and all that is known is that the patient did not have a stroke up to a particular point in time. These observations are called “censored”, and Cox proportional hazards analysis uses this information.

    The Cox proportional hazards regression model has a complex mathematical formula, but the results are interpreted in the same way as a logistic regression model. However, the results are explained in terms of a hazard ratio, rather than an odds ratio. The hazard is the risk of an event at a given time, if the patient has not had an event until that time. An example of a Cox regression analysis is shown in table 2.

    Table 2

    Predicting outcome after stroke: model to predict probability of 6-month survival, using Cox proportional hazards

    Cox proportional hazards assumes that the hazard ratio is constant at all time points. That is to say, if patients in atrial fibrillation have a risk of stroke that is twice as high as patients in sinus rhythm at one time point, then the risk in patients with atrial fibrillation remains roughly twice as high at all other times. If this assumption does not hold, then Cox proportional hazards analysis should not be used. This can be assessed using a Kaplan-Meier curve as in figure 4. In these plots, the probability of surviving at each time point is calculated conditional on having survived up to that time point, which uses the data at the time point of interest, and all previous time points. As patients die or are lost to follow-up, they cease to add further information to the calculations at later time points. This means that as patients have events or drop out of the study, there is more uncertainty around the exact position of the survival line. The eye is naturally drawn to the right hand end of the plot, but we should resist this temptation—it is the least accurate bit!

    Figure 4

    Two Kaplan–Meier plots showing the proportion of patients alive over time since randomisation in the treated and control groups of a hypothetical randomised trial showing (A) lines for treated and control groups roughly parallel, so Cox regression would be appropriate and (B) treated group with early risk, but later benefit over control leading to crossing lines on the plot—Cox regression would not be appropriate.

    MULTIVARIABLE REGRESSION

    In many clinical studies there are several explanatory variables of interest rather than just one. Linear, logistic and Cox regression can all be used to analyse more than one explanatory variable simultaneously. The results look very similar to those for analyses with one explanatory factor, but here there are several regression coefficients (denoted by B in the simple linear regression and logistic regression equations shown earlier), one for each explanatory variable.

    In multivariable regression, the relation between the outcome and each explanatory variable is adjusted for the effects of the other variables. For this reason, multivariable regression is often used to examine the relation between an outcome and a single explanatory factor, adjusted for one or more other variables. For example, you might test whether there is a relation between survival and treatment with a new drug after adjusting for disease severity at baseline. Multivariable regression is also used to develop predictive models for individual patients.

    Tables 1 and 2 show two models developed to predict outcome in stroke patients using six features that can be easily measured at a baseline clinical assessment. The first is a logistic regression model to predict being alive and independent at 6 months, and the second is a Cox proportional hazard model to predict survival to 30 days after stroke onset.

    WHICH VARIABLES SHOULD BE SELECTED FOR MULTIVARIABLE ANALYSIS?

    The number of variables in multivariable regression analyses needs to be carefully thought about and controlled. If too many variables are put into the analysis, some will be associated with an outcome by chance, and their inclusion may reduce the apparent association between any real predictors and outcome:

    • In a linear regression model, a good rule of thumb is that there should be at least 10 patients for every variable selected.

    • In a logistic regression model, there should be at least 10 patients with the less common of the two possible outcomes for every variable included in the model—that is, in a data set in which 80 patients did not have strokes, and 20 did, you can include two variables in a logistic regression model to predict stroke (for example, age and blood pressure).

    • In Cox proportional hazards, you should have at least 10 outcome events (for example, death) per explanatory variable that you include.

    It is very common in clinical research to have many variables from few patients, and we must choose which variables to include. This choice should be made before looking at the association between each explanatory variable and outcome. Choose the variables that:

    • from previous work are most likely to be associated with outcome

    • are the most clinically relevant

    • are reliably and accurately measured

    • distinguish between patients—in other words pick variables that do not have the same value for nearly all patients in the data set

    • have the least missing data

    • do not measure the same underlying effect as another included variable (for example, systolic and diastolic blood pressure). If two variables are too highly associated, then they will both be attempting to explain the same part of the variability in the outcome variable. This makes the estimates in the model less reliable.

    Practice points

    • Check that the type of regression used was appropriate for the type of outcome variable.

    • Make sure that the number of explanatory variables included is appropriate and that there are not too many missing data.

    • Consider whether any derived model is of practical use.

    • Look for plots of raw data and measures of the magnitude of effect rather than just a string of p values.

    To reduce the number of variables, it often makes sense to combine several variables into one—for example, “history of transient ischaemic attack” and “previous stroke” could be combined into “prior cerebrovascular disease”.

    When developing a clinical model to predict outcome in individual patients, for pragmatic reasons explanatory variables should be easy and cheap to collect, and test results that are available immediately should be used in preference to ones that take time. And the prediction model needs to be as simple as possible (so people actually use it).

    Generally, it is better to include continuous variables as they are, rather than dichotomising, as dichotomising loses statistical power. On the other hand, dichotomising variables can help to develop a prediction model that is easy to use.

    REPORTING RESULTS

    When interpreting data, the emphasis should be on the effect size and its confidence interval, and not just the p value. The reader needs to be able to assess whether the magnitude of any effect is of clinical significance.

    When reporting the results it should be clear how missing data were interpreted—whether they were excluded, whether missing test results were assumed to be negative, or whether they were included as a separate category of data. Missing data can be an important predictor of outcome. For instance, a result may not be available because the patient was too ill for the assessment to be done.

    For predictive models, calibration and discrimination should be described. Calibration is a description of how closely predicted outcomes match actual outcomes. Discrimination is how well the model divides the patients with and without the outcome of interest. So for instance, if 50% of a population will develop epilepsy, you could tell each patient that they have a 50% chance of developing the condition. This would be perfectly calibrated, as 50% of the patients will develop epilepsy, but it has no discrimination at all, as it does not separate out those who will from those who will not develop epilepsy. The perfect model would result in half the patients being told (correctly) that they would develop epilepsy, and the other half being told (correctly) they would not.

    FURTHER READING

    Landau S, Everitt BS. A handbook of statistical analyses using SPSS. Chapman & Hall/CRC.

    Mitchell H. Katz. Multivariable analysis: a practical guide for clinicians. Cambridge University Press.

    Acknowledgments

    Thanks to Will Whitely, Fergus Doubal and Charles Warlow for helpful comments in the writing of this article.

    View Abstract

    Request Permissions

    If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

    Linked Articles

    • From the editor's desk
      Charles Warlow