## Code

```
library(ggpmisc)
library(broom)
theme_set(theme_bw(16))
```

Author

Pedro J. Aphalo

Published

2023-05-30

Modified

2023-09-13

Abstract

In this page I give a brief presentation on the mechanics of fitting statistical models to observed data using R. I use a lineal model (LM), polynomial regression, as example. I describe the most important values contained in the model fit object returned by function `lm()`

and how to extract them. I rely heavily on diagrams and data plots.
Keywords

predicted values, residual values, parameter estimates, model formulas

Fitting a model to data consists in finding the parameter values that best explain the data or observations. These parameter values are estimates of those in a larger population of possible observations from which we have drawn a sample or subset.

Tip

You may want to play with the my linear regression Shiny App to get a better sense of how the size of the response and error variation affect the linear model fit to succesive sample draws from a normally distributed population of artificial observations.

Model selection involves comparing models that differ in their structure, i.e., in the formula that relates parameters and data.

In general, there are many different aspects of a model fit that we may be interested in. The most computationally intensive step is the fitting itself. Thus R’s approach is to save the results from fitting, or *fitted model*, and separately query it for different derived numerical and graphical output as needed.

Tip

In this page some code chunks are “folded” so as to decrease the clutter. Above the R output, text or plots, you will find a small triangle followed by “Code”. Clicking on the triangle “unfolds” the code chunk making visible the R code used to produce the output shown. The code in the chunks can be copied by clicking on the top right corner, where an icon appears when the mouse cursor hovers over the code listing.

Attach packages used for plotting.

Generate some artificial data.

```
set.seed(19065)
# set.seed(4321) # change or comment out to get different pseudorandom values
x <- 1:24
y <- (x + x^2 + x^3) + rnorm(length(x), mean = 0, sd = mean(x^3) / 2)
y <- y / max(y)
my.data <- data.frame(x,
y,
group = c("A", "B"),
y2 = y * c(1, 2) + c(0, 0.2),
block = c("a", "a", "b", "b"),
wt = sqrt(x))
```

Plot these data to have a look at them.

In R we first save the fitted model object into a variable, and in a second stage extract or query different results as needed.

Here we fit a third-degree polynomial regression using function `lm()`

(linear model, LM). .

Then we query this object saved in the variable, here `fm1`

, with different methods, for example `summary()`

.

```
Call:
lm(formula = y ~ poly(x, 3), data = my.data)
Residuals:
Min 1Q Median 3Q Max
-0.24817 -0.07635 0.01945 0.08890 0.18264
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.25644 0.02525 10.154 2.44e-09 ***
poly(x, 3)1 1.49842 0.12372 12.111 1.15e-10 ***
poly(x, 3)2 0.55768 0.12372 4.508 0.000215 ***
poly(x, 3)3 -0.06142 0.12372 -0.496 0.625015
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1237 on 20 degrees of freedom
Multiple R-squared: 0.8932, Adjusted R-squared: 0.8772
F-statistic: 55.75 on 3 and 20 DF, p-value: 6.793e-10
```

Functions to extract and/or compute different estimates and carry out tests are available. These functions take the model fit object as argument. Above I showed the use of `summary()`

. The next diagram shows most of the functions that can be used with linear-model-fit objects.

I expect you to be already (to some extent) familiar with the ANOVA table, coefficient estimates and plots of residuals. But, you could be still wordering about how does all this fit together? I will use plots to explain the most interesting estimates and results from model fitting. R’s functions in the diagram above, return numeric values. Thus, for creating the plots, I use these functions indirectly through functions from package ‘ggpmisc’. This mainly saves typing. The model formula, data and model fit function used for the plots below, are the same as in the example above.

The observations (artificial data in this case). Same plot as above, will be the base of subsequent plots.

The *fitted values*, as blue points, added. Fitted values are the “model predictions” for the observed values of *x*.

The residuals plotted as deviations from the fitted values. These residuals (in red) are used as the criterion for finding the best fit. The observations (in black) are given, but the position of the predicted values (in blue) are decided based on the minimization algorithm used as the criterion to fit a model.

The residuals plotted on their own. Plots like the one below can be used mainly as a tool to check that the assumptions, in our case those required to fit a linear model using least squares as a criterion are fulfilled, at least approximately (homogeneity of variace, in particular). In this plot the values on the *y* axis describe the length of red segments in the plot above and are shown as red points for each observed value of *x*.

The form of the *equation of the fitted curve* is given by the model formula passed as argument to `lm()`

, but the estimates of the value of the four parameters in the equation are those giving the curve that makes the sum of squares of of the residuals as small as possible.

```
ggplot(my.data, aes(x = x, y = y)) +
geom_point() +
stat_fit_deviations(formula = y ~ poly(x, 3),
colour = "red",
arrow = arrow(length = unit(0.33, "lines"), ends = "both")) +
stat_fit_fitted(formula = y ~ poly(x, 3), colour = "blue") +
stat_poly_eq(formula = y ~ poly(x, 3),
mapping = use_label("eq"))
```

The *equation* describes a curve. The prediction line passes through the fitted values but interpolates between them as shown immediately below, and, as I will show farther below, can extrapolate outside the range of the observed *x* values.

The prediction line and \(R^2\), \(n\), \(F\textrm{-value}\) and \(P\textrm{-value}\). We include as annotations other parameter estimates of interest.

When additional details for each parameter estimate are of interest, a table is usually used. The *b*_{i} in the table are the coefficients in the equation in the plot above, and the *s*_{i} are the their corresponding standard errors, from which a *t*-value can be computed.

```
ggplot(my.data, aes(x = x, y = y)) +
geom_point() +
stat_poly_line(formula = y ~ poly(x, 3), se = FALSE) +
stat_fit_tb(method.args = list(formula = y ~ poly(x, 3)),
tb.vars = c(Term = 1,
"italic(b)[i]" = 2,
"italic(s)[i]" = 3,
"italic(b)[i] / italic(s)[i]~`=`~italic(t)" = 4,
"italic(P)" = 5),
tb.params = c("x^0" = 1,
"x^1" = 2,
"x^2" = 3,
"x^3" = 4),
parse = TRUE,
label.x = "left")
```

Using the *b*_{i} together with the *s*_{i}, we can compute an estimate of the variation affecting the line as a whole. This makes possible the computation of a confidence band around the predicted curve. The prediction line plus its 95% confidence band are shown in the plot below. This band informs us about where the true line (for the sampled population as a whole) is likely to reside. The caveat is, that this estimated band and curve are true only given the form of the model that was used. If the form of model we have passed as argument can take a suitable shape and is flexible enough to describe features of interest the prediction and band are useful, otherwise they are of little use. This we can best judge by looking at a plot like the one below.

The residuals plotted as deviations from the prediction line. This is an alternative way of subjectively assessing the goodness of a model fit. Be aware, that how many observations fall outside the band depends on the number of replicates: the more observations we have, the more confident we can be about the estimated line, and the higher the proportion of observations that will be plotted outside the band. What we have plotted is a confidence band for the curve itself, not a density estimate for the probability of observing *x* and *y* value pairs.

The prediction line plus its 95% confidence band with extrapolation. When we extrapolate into “unknown territory” uncertainties rapidly increase and the band widens dramatically, and in a way that depends strongly on the model formula. (The range of the *y* axis is expanded!)

As shown in the diagram below, the overall approach is very similar to that used for linear models.

As shown in the diagram below, the overall approach is very similar to that used for linear models.

I mentioned above that the model chosen was given by the argument we passed to `lm()`

. Obviously, it is also possible to use other models. The choice of a model has to balance increasing the proportion of the total variation that is explained while still extracting the useful information out of the data separately from accounting for the noise. One approach is model selection, instead of simply finding the best set of parameter estimates, comparing the best fits possible with each equation form, and selecting the best performing one.

Model selection can be done manually by comparing models fitted individually or automatically using a stepwise approach. In the case of polynomials, one possibility is to compare polynomials of different degrees. Automatic selection is described by the diagram below.

Warning

The assumption of most usual model fitting procedures is that residuals are normally distributed and independent of the magnitude of the response variable, not that the observations themselves are normally distributed. Only in the simplest cases, such as comparing a mean against a constant, the residuals have the same distribution as the data but centred on zero instead of on the mean. In most other cases, this is not the case, as part of the variation among individual observations is accounted by terms in the fitted model, such as random effects, correlations or even variance covariates. The justification is that we use the residuals to estimate the *error* variation and the tests of significance are based on this error variance estimate.

It is also of little use to test for the statistical significance of the deviations from these assumptions, as we should not expect in the real world for the assumptions to be ever exactly fulfilled. The power of tests increases with replication, but the bias introduced into estimates does not depend on significance, but on the magnitude of the deviations. Furthermore, the higher the replication, the less the bias introduced in the estimates by deviations from the assumptions. In other words, the more replicates we have smaller deviations from assumptions become detectable as significant, while the importance of deviations decreases.

When there are very few replicates available, it is imposible to assess directly if observations come from a normally distributed population or not. In such cases, it can be wise to rely on previous information about the sampled population and sampling method used instead of on the current observations.

Tip

To create the plots above I used packages ‘ggplot2’ and ‘ggpmisc’. Galleries of plot examples using ‘ggpmisc’ and some other packages are available at the R for Photobiology web site. They contain R code folded in the same way as in this page.

```
---
title: "Model fitting in R"
subtitle: "An introduction using `lm()`"
author: "Pedro J. Aphalo"
date: 2023-05-30
date-modified: 2023-09-13
categories: [R, model fitting]
keywords: [predicted values, residual values, parameter estimates, model formulas]
format:
html:
code-fold: true
code-tools: true
abstract:
In this page I give a brief presentation on the mechanics of fitting statistical models to observed data using R. I use a lineal model (LM), polynomial regression, as example. I describe the most important values contained in the model fit object returned by function `lm()` and how to extract them. I rely heavily on diagrams and data plots.
---
## Introduction
Fitting a model to data consists in finding the parameter values that best
explain the data or observations. These parameter values are estimates of
those in a larger population of possible observations from which we have
drawn a sample or subset.
::: callout-tip
You may want to play with the my
[linear regression Shiny App](https://aphalo.shinyapps.io/class01-practical/#21)
to get a better sense of how the size of the response and error variation
affect the linear model fit to succesive sample draws from a normally
distributed population of artificial observations.
:::
Model selection involves comparing models that differ in their structure, i.e.,
in the formula that relates parameters and data.
In general, there are many different aspects of a model fit that we may be
interested in. The most computationally intensive step is the fitting itself.
Thus R's approach is to save the results from fitting, or _fitted model_, and
separately query it for different derived numerical and graphical output as
needed.
::: callout-tip
In this page some code chunks are "folded" so as to decrease the clutter. Above the R
output, text or plots, you will find a small triangle followed by "Code".
Clicking on the triangle "unfolds" the code chunk making visible the R code used
to produce the output shown.
The code in the chunks can be copied by clicking on the top right corner, where
an icon appears when the mouse cursor hovers over the code listing.
:::
## Linear models (LM)
### Fitting
Attach packages used for plotting.
```{r, message = FALSE}
library(ggpmisc)
library(broom)
theme_set(theme_bw(16))
```
Generate some artificial data.
```{r}
set.seed(19065)
# set.seed(4321) # change or comment out to get different pseudorandom values
x <- 1:24
y <- (x + x^2 + x^3) + rnorm(length(x), mean = 0, sd = mean(x^3) / 2)
y <- y / max(y)
my.data <- data.frame(x,
y,
group = c("A", "B"),
y2 = y * c(1, 2) + c(0, 0.2),
block = c("a", "a", "b", "b"),
wt = sqrt(x))
```
Plot these data to have a look at them.
```{r}
ggplot(my.data, aes(x, y)) +
geom_point()
```
In R we first save the fitted model object into a variable, and in a second stage extract or query different results as needed.
```{mermaid}
%%| label: fig-barebones
%%| fig-cap: A minimalist diagram of model fitting in R.
%%{init: {"htmlLabels": true} }%%
flowchart LR
A(<i>model formula</i>) --> B[model fit\nfunction] --> C(model fit\nobject) --> D1['diagnostics' plots]
AA(<i>observations</i>) --> B
C --> D2[query methods]
```
Here we fit a third-degree polynomial regression using function `lm()` (linear model, LM). .
```{r}
#| code-fold: false
fm1 <- lm(formula = y ~ poly(x, 3), data = my.data)
```
Then we query this object saved in the variable, here `fm1`, with different methods,
for example `summary()`.
```{r}
#| code-fold: false
summary(fm1)
```
### The components of the model fit object
Functions to extract and/or compute different estimates and carry out tests are available. These functions take the model fit object as argument. Above I showed the use of `summary()`. The next diagram shows most of the functions that can be used with linear-model-fit objects.
```{mermaid}
%%| label: fig-lm
%%| fig-cap: A diagram of linear-model (LM) fitting in R.
%%{init: {"htmlLabels": true} }%%
flowchart LR
A1(<i>model formula</i>) --> B["<code>lm()</code>"] --> C(<code>lm</code> object) --> C1["<code>plot()</code>"]
A2(<i>observations</i>) --> B
A3(<i>weights</i>) -.-> B
C --> C2["<code>summary()</code>"]
C --> C3["<code>anova()</code>"]
C --> C4["<code>residuals()</code>"]
C --> C5["<code>fitted()</code>"]
C --> C6["<code>AIC()</code>"]
C --> C7["<code>BIC()</code>"]
C --> C8["<code>coefficients()</code>"]
C --> C11["<code>formula()</code>"]
C --> C12["<code>weights()</code>"]
C --> C9["<code>confint()</code>"]
C --> C10["<code>predict()</code>"]
BB("<i>new data</i>") --> C10
```
I expect you to be already (to some extent) familiar with the ANOVA table, coefficient estimates and plots of residuals. But, you could be still wordering about how does all this fit together? I will use plots to explain the most interesting estimates and results from model fitting. R's functions in the diagram above, return numeric values. Thus, for creating the plots, I use these functions indirectly through functions from package 'ggpmisc'. This mainly saves typing. The model formula, data and model fit function used for the plots below, are the same as in the example above.
The observations (artificial data in this case). Same plot as above, will be the base of subsequent plots.
```{r}
ggplot(my.data, aes(x = x, y = y)) +
geom_point()
```
The _fitted values_, as blue points, added. Fitted values are the "model predictions" for the observed values of _x_.
```{r}
ggplot(my.data, aes(x = x, y = y)) +
geom_point() +
stat_fit_fitted(formula = y ~ poly(x, 3),
colour = "blue")
```
The residuals plotted as deviations from the fitted values. These residuals (in red) are used as the criterion for finding the best fit. The observations (in black) are given, but the position of the predicted values (in blue) are decided based on the minimization algorithm used as the criterion to fit a model.
```{r}
ggplot(my.data, aes(x = x, y = y)) +
geom_point() +
stat_fit_deviations(formula = y ~ poly(x, 3),
colour = "red",
arrow = arrow(length = unit(0.33, "lines"), ends = "both")) +
stat_fit_fitted(formula = y ~ poly(x, 3), colour = "blue")
```
The residuals plotted on their own. Plots like the one below can be used mainly as a tool to check that the assumptions, in our case those required to fit a linear model using least squares as a criterion are fulfilled, at least approximately (homogeneity of variace, in particular). In this plot the values on the _y_ axis describe the length of red segments in the plot above and are shown as red points for each observed value of _x_.
```{r}
ggplot(my.data, aes(x = x, y = y)) +
geom_hline(yintercept = 0, linetype = "dashed") +
stat_fit_residuals(formula = y ~ poly(x, 3),
colour = "red")
```
The form of the _equation of the fitted curve_ is given by the model formula passed as argument to `lm()`, but the estimates of the value of the four parameters in the equation are those giving the curve that makes the sum of squares of of the residuals as small as possible.
```{r}
ggplot(my.data, aes(x = x, y = y)) +
geom_point() +
stat_fit_deviations(formula = y ~ poly(x, 3),
colour = "red",
arrow = arrow(length = unit(0.33, "lines"), ends = "both")) +
stat_fit_fitted(formula = y ~ poly(x, 3), colour = "blue") +
stat_poly_eq(formula = y ~ poly(x, 3),
mapping = use_label("eq"))
```
The _equation_ describes a curve. The prediction line passes through the fitted values but interpolates between them as shown immediately below, and, as I will show farther below, can extrapolate outside the range of the observed _x_ values.
```{r}
ggplot(my.data, aes(x = x, y = y)) +
geom_point() +
stat_poly_line(formula = y ~ poly(x, 3), se = FALSE) +
stat_poly_eq(formula = y ~ poly(x, 3),
mapping = use_label("eq"))
```
The prediction line and $R^2$, $n$, $F\textrm{-value}$ and $P\textrm{-value}$. We include as annotations other parameter estimates of interest.
```{r}
ggplot(my.data, aes(x = x, y = y)) +
geom_point() +
stat_poly_line(formula = y ~ poly(x, 3), se = FALSE) +
stat_poly_eq(formula = y ~ poly(x, 3),
mapping = use_label("eq")) +
stat_poly_eq(formula = y ~ poly(x, 3),
label.y = 0.89,
mapping = use_label(c("R2", "n", "F", "P")))
```
When additional details for each parameter estimate are of interest, a table is usually used. The _b_~i~ in the table are the coefficients in the equation in the plot above, and the _s_~i~ are the their corresponding standard errors, from which a _t_-value can be computed.
```{r}
ggplot(my.data, aes(x = x, y = y)) +
geom_point() +
stat_poly_line(formula = y ~ poly(x, 3), se = FALSE) +
stat_fit_tb(method.args = list(formula = y ~ poly(x, 3)),
tb.vars = c(Term = 1,
"italic(b)[i]" = 2,
"italic(s)[i]" = 3,
"italic(b)[i] / italic(s)[i]~`=`~italic(t)" = 4,
"italic(P)" = 5),
tb.params = c("x^0" = 1,
"x^1" = 2,
"x^2" = 3,
"x^3" = 4),
parse = TRUE,
label.x = "left")
```
Using the _b_~i~ together with the _s_~i~, we can compute an estimate of the variation affecting the line as a whole. This makes possible the computation of a confidence band around the predicted curve. The prediction line plus its 95% confidence band are shown in the plot below. This band informs us about where the true line (for the sampled population as a whole) is likely to reside. The caveat is, that this estimated band and curve are true only given the form of the model that was used. If the form of model we have passed as argument can take a suitable shape and is flexible enough to describe features of interest the prediction and band are useful, otherwise they are of little use. This we can best judge by looking at a plot like the one below.
```{r}
ggplot(my.data, aes(x = x, y = y)) +
geom_point() +
stat_poly_line(formula = y ~ poly(x, 3), se = TRUE)
```
The residuals plotted as deviations from the prediction line. This is an alternative way of subjectively assessing the goodness of a model fit. Be aware, that how many observations fall outside the band depends on the number of replicates: the more observations we have, the more confident we can be about the estimated line, and the higher the proportion of observations that will be plotted outside the band.
What we have plotted is a confidence band for the curve itself, not a density estimate for the probability of observing _x_ and _y_ value pairs.
```{r}
ggplot(my.data, aes(x = x, y = y)) +
geom_point() +
stat_poly_line(formula = y ~ poly(x, 3), se = TRUE) +
stat_fit_deviations(formula = y ~ poly(x, 3), colour = "red",
arrow = arrow(length = unit(0.33, "lines"), ends = "both"))
```
The prediction line plus its 95% confidence band with extrapolation. When we extrapolate into "unknown territory" uncertainties rapidly increase and the band widens dramatically, and in a way that depends strongly on the model formula. (The range of the _y_ axis is expanded!)
```{r}
ggplot(my.data, aes(x = x, y = y)) +
geom_point() +
geom_vline(xintercept = range(my.data$x), linewidth = 0.33) +
stat_poly_line(formula = y ~ poly(x, 3), se = TRUE, fullrange = TRUE) +
expand_limits(x = c(-10, 40)) # an arbitrary range
```
## Generalised linear models (GLM)
As shown in the diagram below, the overall approach is very similar to that used for linear models.
```{mermaid}
%%| label: fig-glm
%%| fig-cap: A diagram of generalized-linear-model (GLM) fitting in R. Query methods as in @fig-lm.
%%{init: {"htmlLabels": true} }%%
flowchart LR
A1(<i>model formula</i>) --> B["<code>glm()</code>"] --> C(<code>glm</code> object) --> C1[query methods]
A2(<i>observations</i>) --> B
A3(<i>weights</i>) -.-> B
A4(<i>family</i> and <i>link</i>) --> B
```
## Non-linear models
As shown in the diagram below, the overall approach is very similar to that used for linear models.
```{mermaid}
%%| label: fig-nls
%%| fig-cap: A diagram of nonlinear least squares (NLS) model fitting by numerical approximation in R. Query methods similar to those in @fig-lm.
%%{init: {"htmlLabels": true} }%%
flowchart LR
A1(<i>model formula</i>) --> B["<code>nls()</code>"] --> C(<code>nls</code> object) --> C1[query methods]
A2(<i>observations</i>) --> B
A3(<i>weights</i>) -.-> B
A5(<i>starting values</i>) --> B
```
## Model selection flowchart
I mentioned above that the model chosen was given by the argument we passed to `lm()`. Obviously, it is also possible to use other models. The choice of a model has to balance increasing the proportion of the total variation that is explained while still extracting the useful information out of the data separately from accounting for the noise. One approach is model selection, instead of simply finding the best set of parameter estimates, comparing the best fits possible with each equation form, and selecting the best performing one.
Model selection can be done manually by comparing models fitted individually or automatically using a stepwise approach. In the case of polynomials, one possibility is to compare polynomials of different degrees. Automatic selection is described by the diagram below.
```{mermaid}
%%| label: fig-lm-step
%%| fig-cap: A diagram of linear-model (LM) fitting with stepwise model selection in R.
%%{init: {"htmlLabels": true} }%%
flowchart TB
A1(<i>model formula</i>) --> B["<code>lm()</code>"] --> C(<code>lm</code> object)
A2(<i>observations</i>) --> B
A3(<i>weights</i>) -.-> B
C --> CI[query methods]
C --> C1["<code>step()</code>"]
subgraph Z ["<strong>Model selection</strong>"]
C1 --> C3(<code>lm</code> object)
z1(most complex\n<i>model formula</i>) -.-> C1
z2(simplest nested\n<i>model formula</i>) -.-> C1
C3 --> CF[query methods]
end
style Z fill:#fff
```
::: callout-warning
The assumption of most usual model fitting procedures is that residuals are normally distributed and independent of the magnitude of the response variable, not that the observations themselves are normally distributed. Only in the simplest cases, such as comparing a mean against a constant, the residuals have the same distribution as the data but centred on zero instead of on the mean. In most other cases, this is not the case, as part of the variation among individual observations is accounted by terms in the fitted model, such as random effects, correlations or even variance covariates. The justification is that we use the residuals to estimate the _error_ variation and the tests of significance are based on this error variance estimate.
It is also of little use to test for the statistical significance of the deviations from these assumptions, as we should not expect in the real world for the assumptions to be ever exactly fulfilled. The power of tests increases with replication, but the bias introduced into estimates does not depend on significance, but on the magnitude of the deviations. Furthermore, the higher the replication, the less the bias introduced in the estimates by deviations from the assumptions. In other words, the more replicates we have smaller deviations from assumptions become detectable as significant, while the importance of deviations decreases.
When there are very few replicates available, it is imposible to assess directly if observations come from a normally distributed population or not. In such cases, it can be wise to rely on previous information about the sampled population and sampling method used instead of on the current observations.
:::
::: callout-tip
To create the plots above I used packages 'ggplot2' and 'ggpmisc'. [Galleries
of plot examples using 'ggpmisc' and some other
packages](https://www.r4photobiology.info/galleries.html) are available at the
[R for Photobiology web site](https://www.r4photobiology.info). They contain R
code folded in the same way as in this page.
:::
```