# Model fitting in R

An introduction using lm()

R
model fitting
Author

Pedro J. Aphalo

Published

2023-05-30

Modified

2023-09-13

Abstract
In this page I give a brief presentation on the mechanics of fitting statistical models to observed data using R. I use a lineal model (LM), polynomial regression, as example. I describe the most important values contained in the model fit object returned by function lm() and how to extract them. I rely heavily on diagrams and data plots.
Keywords

predicted values, residual values, parameter estimates, model formulas

## Introduction

Fitting a model to data consists in finding the parameter values that best explain the data or observations. These parameter values are estimates of those in a larger population of possible observations from which we have drawn a sample or subset.

Tip

You may want to play with the my linear regression Shiny App to get a better sense of how the size of the response and error variation affect the linear model fit to succesive sample draws from a normally distributed population of artificial observations.

Model selection involves comparing models that differ in their structure, i.e., in the formula that relates parameters and data.

In general, there are many different aspects of a model fit that we may be interested in. The most computationally intensive step is the fitting itself. Thus R’s approach is to save the results from fitting, or fitted model, and separately query it for different derived numerical and graphical output as needed.

Tip

In this page some code chunks are “folded” so as to decrease the clutter. Above the R output, text or plots, you will find a small triangle followed by “Code”. Clicking on the triangle “unfolds” the code chunk making visible the R code used to produce the output shown. The code in the chunks can be copied by clicking on the top right corner, where an icon appears when the mouse cursor hovers over the code listing.

## Linear models (LM)

### Fitting

Attach packages used for plotting.

Code
library(ggpmisc)
library(broom)
theme_set(theme_bw(16))

Generate some artificial data.

Code
set.seed(19065)
# set.seed(4321) # change or comment out to get different pseudorandom values
x <- 1:24
y <- (x + x^2 + x^3) + rnorm(length(x), mean = 0, sd = mean(x^3) / 2)
y <- y / max(y)
my.data <- data.frame(x,
y,
group = c("A", "B"),
y2 = y * c(1, 2) + c(0, 0.2),
block = c("a", "a", "b", "b"),
wt = sqrt(x))

Plot these data to have a look at them.

Code
ggplot(my.data, aes(x, y)) +
geom_point()

In R we first save the fitted model object into a variable, and in a second stage extract or query different results as needed.

Here we fit a third-degree polynomial regression using function lm() (linear model, LM). .

fm1 <- lm(formula = y ~ poly(x, 3), data = my.data)

Then we query this object saved in the variable, here fm1, with different methods, for example summary().

summary(fm1)

Call:
lm(formula = y ~ poly(x, 3), data = my.data)

Residuals:
Min       1Q   Median       3Q      Max
-0.24817 -0.07635  0.01945  0.08890  0.18264

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.25644    0.02525  10.154 2.44e-09 ***
poly(x, 3)1  1.49842    0.12372  12.111 1.15e-10 ***
poly(x, 3)2  0.55768    0.12372   4.508 0.000215 ***
poly(x, 3)3 -0.06142    0.12372  -0.496 0.625015
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1237 on 20 degrees of freedom
Multiple R-squared:  0.8932,    Adjusted R-squared:  0.8772
F-statistic: 55.75 on 3 and 20 DF,  p-value: 6.793e-10

### The components of the model fit object

Functions to extract and/or compute different estimates and carry out tests are available. These functions take the model fit object as argument. Above I showed the use of summary(). The next diagram shows most of the functions that can be used with linear-model-fit objects.

I expect you to be already (to some extent) familiar with the ANOVA table, coefficient estimates and plots of residuals. But, you could be still wordering about how does all this fit together? I will use plots to explain the most interesting estimates and results from model fitting. R’s functions in the diagram above, return numeric values. Thus, for creating the plots, I use these functions indirectly through functions from package ‘ggpmisc’. This mainly saves typing. The model formula, data and model fit function used for the plots below, are the same as in the example above.

The observations (artificial data in this case). Same plot as above, will be the base of subsequent plots.

Code
ggplot(my.data, aes(x = x, y = y)) +
geom_point()

The fitted values, as blue points, added. Fitted values are the “model predictions” for the observed values of x.

Code
ggplot(my.data, aes(x = x, y = y)) +
geom_point() +
stat_fit_fitted(formula = y ~ poly(x, 3),
colour = "blue")

The residuals plotted as deviations from the fitted values. These residuals (in red) are used as the criterion for finding the best fit. The observations (in black) are given, but the position of the predicted values (in blue) are decided based on the minimization algorithm used as the criterion to fit a model.

Code
ggplot(my.data, aes(x = x, y = y)) +
geom_point() +
stat_fit_deviations(formula = y ~ poly(x, 3),
colour = "red",
arrow = arrow(length = unit(0.33, "lines"), ends = "both")) +
stat_fit_fitted(formula = y ~ poly(x, 3), colour = "blue")

The residuals plotted on their own. Plots like the one below can be used mainly as a tool to check that the assumptions, in our case those required to fit a linear model using least squares as a criterion are fulfilled, at least approximately (homogeneity of variace, in particular). In this plot the values on the y axis describe the length of red segments in the plot above and are shown as red points for each observed value of x.

Code
ggplot(my.data, aes(x = x, y = y)) +
geom_hline(yintercept = 0, linetype = "dashed") +
stat_fit_residuals(formula = y ~ poly(x, 3),
colour = "red")

The form of the equation of the fitted curve is given by the model formula passed as argument to lm(), but the estimates of the value of the four parameters in the equation are those giving the curve that makes the sum of squares of of the residuals as small as possible.

Code
ggplot(my.data, aes(x = x, y = y)) +
geom_point() +
stat_fit_deviations(formula = y ~ poly(x, 3),
colour = "red",
arrow = arrow(length = unit(0.33, "lines"), ends = "both")) +
stat_fit_fitted(formula = y ~ poly(x, 3), colour = "blue") +
stat_poly_eq(formula = y ~ poly(x, 3),
mapping = use_label("eq"))

The equation describes a curve. The prediction line passes through the fitted values but interpolates between them as shown immediately below, and, as I will show farther below, can extrapolate outside the range of the observed x values.

Code
ggplot(my.data, aes(x = x, y = y)) +
geom_point() +
stat_poly_line(formula = y ~ poly(x, 3), se = FALSE) +
stat_poly_eq(formula = y ~ poly(x, 3),
mapping = use_label("eq"))

The prediction line and $$R^2$$, $$n$$, $$F\textrm{-value}$$ and $$P\textrm{-value}$$. We include as annotations other parameter estimates of interest.

Code
ggplot(my.data, aes(x = x, y = y)) +
geom_point() +
stat_poly_line(formula = y ~ poly(x, 3), se = FALSE) +
stat_poly_eq(formula = y ~ poly(x, 3),
mapping = use_label("eq")) +
stat_poly_eq(formula = y ~ poly(x, 3),
label.y = 0.89,
mapping = use_label(c("R2", "n", "F", "P")))

When additional details for each parameter estimate are of interest, a table is usually used. The bi in the table are the coefficients in the equation in the plot above, and the si are the their corresponding standard errors, from which a t-value can be computed.

Code
ggplot(my.data, aes(x = x, y = y)) +
geom_point() +
stat_poly_line(formula = y ~ poly(x, 3), se = FALSE) +
stat_fit_tb(method.args = list(formula = y ~ poly(x, 3)),
tb.vars = c(Term = 1,
"italic(b)[i]" = 2,
"italic(s)[i]" = 3,
"italic(b)[i] / italic(s)[i]~=~italic(t)" = 4,
"italic(P)" = 5),
tb.params = c("x^0" = 1,
"x^1" = 2,
"x^2" = 3,
"x^3" = 4),
parse = TRUE,
label.x = "left")

Using the bi together with the si, we can compute an estimate of the variation affecting the line as a whole. This makes possible the computation of a confidence band around the predicted curve. The prediction line plus its 95% confidence band are shown in the plot below. This band informs us about where the true line (for the sampled population as a whole) is likely to reside. The caveat is, that this estimated band and curve are true only given the form of the model that was used. If the form of model we have passed as argument can take a suitable shape and is flexible enough to describe features of interest the prediction and band are useful, otherwise they are of little use. This we can best judge by looking at a plot like the one below.

Code
ggplot(my.data, aes(x = x, y = y)) +
geom_point() +
stat_poly_line(formula = y ~ poly(x, 3), se = TRUE)

The residuals plotted as deviations from the prediction line. This is an alternative way of subjectively assessing the goodness of a model fit. Be aware, that how many observations fall outside the band depends on the number of replicates: the more observations we have, the more confident we can be about the estimated line, and the higher the proportion of observations that will be plotted outside the band. What we have plotted is a confidence band for the curve itself, not a density estimate for the probability of observing x and y value pairs.

Code
ggplot(my.data, aes(x = x, y = y)) +
geom_point() +
stat_poly_line(formula = y ~ poly(x, 3), se = TRUE) +
stat_fit_deviations(formula = y ~ poly(x, 3), colour = "red",
arrow = arrow(length = unit(0.33, "lines"), ends = "both"))

The prediction line plus its 95% confidence band with extrapolation. When we extrapolate into “unknown territory” uncertainties rapidly increase and the band widens dramatically, and in a way that depends strongly on the model formula. (The range of the y axis is expanded!)

Code
ggplot(my.data, aes(x = x, y = y)) +
geom_point() +
geom_vline(xintercept = range(my.data\$x), linewidth = 0.33) +
stat_poly_line(formula = y ~ poly(x, 3), se = TRUE, fullrange = TRUE) +
expand_limits(x = c(-10, 40)) # an arbitrary range

## Generalised linear models (GLM)

As shown in the diagram below, the overall approach is very similar to that used for linear models.

## Non-linear models

As shown in the diagram below, the overall approach is very similar to that used for linear models.

## Model selection flowchart

I mentioned above that the model chosen was given by the argument we passed to lm(). Obviously, it is also possible to use other models. The choice of a model has to balance increasing the proportion of the total variation that is explained while still extracting the useful information out of the data separately from accounting for the noise. One approach is model selection, instead of simply finding the best set of parameter estimates, comparing the best fits possible with each equation form, and selecting the best performing one.

Model selection can be done manually by comparing models fitted individually or automatically using a stepwise approach. In the case of polynomials, one possibility is to compare polynomials of different degrees. Automatic selection is described by the diagram below.

Warning

The assumption of most usual model fitting procedures is that residuals are normally distributed and independent of the magnitude of the response variable, not that the observations themselves are normally distributed. Only in the simplest cases, such as comparing a mean against a constant, the residuals have the same distribution as the data but centred on zero instead of on the mean. In most other cases, this is not the case, as part of the variation among individual observations is accounted by terms in the fitted model, such as random effects, correlations or even variance covariates. The justification is that we use the residuals to estimate the error variation and the tests of significance are based on this error variance estimate.

It is also of little use to test for the statistical significance of the deviations from these assumptions, as we should not expect in the real world for the assumptions to be ever exactly fulfilled. The power of tests increases with replication, but the bias introduced into estimates does not depend on significance, but on the magnitude of the deviations. Furthermore, the higher the replication, the less the bias introduced in the estimates by deviations from the assumptions. In other words, the more replicates we have smaller deviations from assumptions become detectable as significant, while the importance of deviations decreases.

When there are very few replicates available, it is imposible to assess directly if observations come from a normally distributed population or not. In such cases, it can be wise to rely on previous information about the sampled population and sampling method used instead of on the current observations.

Tip

To create the plots above I used packages ‘ggplot2’ and ‘ggpmisc’. Galleries of plot examples using ‘ggpmisc’ and some other packages are available at the R for Photobiology web site. They contain R code folded in the same way as in this page.