A Primer in Regression Techniques

All models are wrong, but some are useful – Box

Introduction

This chapter will provide all the foundations we need for the coming chapters. It is not intended as a general and all-exhaustive introduction to regression techniques, but rather the minimum requirement moving forwards. We will also hone our data processing and plotting skills.

Prerequisites

library(mefa4)                # data manipulation
library(mgcv)                 # GAMs
library(pscl)                 # zero-inflated models
library(lme4)                 # GLMMs
library(MASS)                 # Negative Binomial GLM
library(partykit)             # regression trees
library(intrval)              # interval magic
library(opticut)              # optimal partitioning
library(visreg)               # regression visualization
library(ResourceSelection)    # marginal effects
library(MuMIn)                # multi-model inference
source("src/functions.R")         # some useful stuff
load("data/josm-data.rda") # JOSM data

Let’s pick a species, Ovenbird (OVEN), that is quite common and abundant in the data set. We put together a little data set to work with:

spp <- "OVEN"

ytot <- Xtab(~ SiteID + SpeciesID, josm$counts[josm$counts$DetectType1 != "V",])
ytot <- ytot[,colSums(ytot > 0) > 0]
x <- data.frame(
  josm$surveys, 
  y=as.numeric(ytot[rownames(josm$surveys), spp]))
x$FOR <- x$Decid + x$Conif+ x$ConifWet # forest
x$AHF <- x$Agr + x$UrbInd + x$Roads # 'alienating' human footprint
x$WET <- x$OpenWet + x$ConifWet + x$Water # wet + water
cn <- c("Open", "Water", "Agr", "UrbInd", "SoftLin", "Roads", "Decid", 
  "OpenWet", "Conif", "ConifWet")
x$HAB <- droplevels(find_max(x[,cn])$index) # drop empty levels
x$DEC <- ifelse(x$HAB == "Decid", 1, 0)

table(x$y)

## 
##    0    1    2    3    4    5    6 
## 2493  883  656  363  132   29   13

Poisson null model

The null model states that the expected values of the count at all locations are identical: \(E[Y_i]=\lambda\) (\(i=1,...,n\)), where \(Y_i\) is a random variable that follows a Poisson distribution with mean \(\lambda\): \((Y_i \mid \lambda) \sim Poisson(\lambda)\). The observation (\(y_i\)) is a realization of the random variables \(Y\) at site \(i\), these observations are independent and identically distributed (i.i.d.), and we have \(n\) observations in total.

Saying the the distribution is Poisson is an assumption in itself. For example we assume that the variance equals the mean (\(V(\mu)=\mu\)).

mP0 <- glm(y ~ 1, data=x, family=poisson)
mean(x$y)

## [1] 0.8831254

mean(fitted(mP0))

## [1] 0.8831254

exp(coef(mP0))

## (Intercept) 
##   0.8831254

The family=poisson specification implicitly assumes that we use a logarithmic link functions, that is to say that \(log(\lambda) = \beta_0\), or equivalently: \(\lambda = e^{\beta_0}\). The mean of the observations equal the mean of the fitted values, as expected.

The logarithmic function is called the link function, its inverse, the exponential function is called the inverse link function. The model family has these convenently stored for us:

mP0$family

## 
## Family: poisson 
## Link function: log

mP0$family$linkfun

## function (mu) 
## log(mu)
## <environment: namespace:stats>

mP0$family$linkinv

## function (eta) 
## pmax(exp(eta), .Machine$double.eps)
## <environment: namespace:stats>

Inspect the summary

summary(mP0)

## 
## Call:
## glm(formula = y ~ 1, family = poisson, data = x)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.329  -1.329  -1.329   1.018   3.572  
## 
## Coefficients:
##             Estimate Std. Error z value
## (Intercept) -0.12429    0.01574  -7.895
##                       Pr(>|z|)    
## (Intercept) 0.0000000000000029 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 7424.8  on 4568  degrees of freedom
## Residual deviance: 7424.8  on 4568  degrees of freedom
## AIC: 12573
## 
## Number of Fisher Scoring iterations: 6

Notice:

residuals are not symmetric around 0 and median equals the minimum
residual deviance much higher than residual degrees of freedom

These indicate that our parametric model (Poisson error distribution, constant expected value) is not quite right. See if we can improve that somehow and explain more of the variation.

We can pick an error distribution that would fit the residuals around the constant expected value better (i.e. using random effects). But this would be a really useless exercise because we would not learn about what is driving the variation in the counts. Thus we would have a really hard time predicting abundance of the species for unsampled locations. We would be right on average, but we wouldn’t be able to tell how abundance varies along e.g. a disturbance gradient.

An alternative approach would be to find predictors that could explain the variation.

Exploring covariates

Now, in the absence of info about species biology, we are looking at a blank page. How should we proceed? What kind of covariate (linear predictor) should we use? We can do a quick and dirty exploration to see what are the likely candidates. We use a regression tree (ctree refers to conditional trees). It is a non-parametric method based on binary recursive partitioning in a conditional inference framework.

This means that binary splits are made along the predictor variables, and the explanatory power of the split is assessed based on how it maximized difference between the splits and minimized the difference inside the buckets created by the splits.

It is called conditional, because every new split is conditional on the previous splits (difference can be measured in many different ways, think e.g. sum of squares). The stopping rule in this implementation is based on permutation tests (see ?ctree or details and references).

mCT <- ctree(y ~ Open + Water + Agr + UrbInd + SoftLin + Roads + 
  Decid + OpenWet + Conif + ConifWet, data=x)
plot(mCT)

The model can be seen as a piecewise constant regression, where each bucket (defined by the splits along the tree) yields a constant predictions based on the mean of the observations in the bucket. Any new data classified into the same bucket will get the same value. There is no notion of uncertainty (confidence or prediction intervals) in this nonparameric model.

But we see something very useful: the proportion of deciduous forest in the landscape seems to be vary influential for Ovenbird abundance. The closer the variable appears to tree base of the tree the more influence it has.

Single covariate

With this new found knowledge, let’s fit a parametric (Poisson) linear model using Decid as a predictor:

mP1 <- glm(y ~ Decid, data=x, family=poisson)
mean(x$y)

## [1] 0.8831254

mean(fitted(mP1))

## [1] 0.8831254

coef(mP1)

## (Intercept)       Decid 
##   -1.164268    2.133832

Same as before, the mean of the observations equal the mean of the fitted values. But instead of only the intercept, now we have 2 coefficients estimated. Our linear predictor thus looks like: \(log(\lambda_i) = \beta_0 + \beta_1 x_{1i}\). This means that expected abundance is\(e^{\beta_0}\) where Decid=0, \(e^{\beta_0}e^{\beta_1}\) where Decid=1, and \(e^{\beta_0+\beta_1 x_{1}}\) in between.

The relationship can be visualized by plotting the fitted values against the predictor, or using the coefficients to make predictions using our formula:

## make a sequence between 0 and 1
dec <- seq(from=0, to=1, by=0.01)
## predict lambda
lam <- exp(coef(mP1)[1] + coef(mP1)[2] * dec)

plot(fitted(mP1) ~ Decid, x, pch=19, col="grey") # fitted
lines(lam ~ dec, col=2) # our predicted
rug(x$Decid) # observed x values

The model summary tells us that resudials are not quite right (we would expect 0 median and symmertic tails), in line with residual deviance still being higher than residual degrees of freedom (these should be close if the Poisson assumption holds, but it is much better than what we saw for the null model).

However, we learned something important. The Decid effect is significant (meaning that the effect size is large compared to the standard error):

summary(mP1)

## 
## Call:
## glm(formula = y ~ Decid, family = poisson, data = x)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2909  -0.9772  -0.7901   0.4687   4.1970  
## 
## Coefficients:
##             Estimate Std. Error z value
## (Intercept) -1.16427    0.03515  -33.12
## Decid        2.13383    0.05369   39.74
##                        Pr(>|z|)    
## (Intercept) <0.0000000000000002 ***
## Decid       <0.0000000000000002 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 7424.8  on 4568  degrees of freedom
## Residual deviance: 5736.9  on 4567  degrees of freedom
## AIC: 10887
## 
## Number of Fisher Scoring iterations: 6

Note: we see a significant (<0.05) \(P\)-value for the intercept as well. It is totally meaningless. That \(P\)-value relates to the hull hypothesis of the intercept (\(\beta_0\)) being 0. There is nothing special about that, it is like saying the average abundance is different from 1. But if \(\beta_1\) is significantly different from 0, it means that the main effect has non-negligible effect on the mean abundance.

We can compare this model to the null (constant, intercept-only) model:

AIC(mP0, mP1)

##     df      AIC
## mP0  1 12572.97
## mP1  2 10887.10

BIC(mP0, mP1)

##     df      BIC
## mP0  1 12579.40
## mP1  2 10899.95

model.sel(mP0, mP1)

## Model selection table 
##     (Intrc) Decid       family df    logLik    AICc
## mP1 -1.1640 2.134 poisson(log)  2 -5441.550 10887.1
## mP0 -0.1243       poisson(log)  1 -6285.486 12573.0
##       delta weight
## mP1    0.00      1
## mP0 1685.87      0
## Models ranked by AICc(x)

R2dev(mP0, mP1)

##           R2    R2adj Deviance     Dev0     DevR
## mP0    0.000    0.000    0.000 7424.779 7424.779
## mP1    0.227    0.227 1687.873 7424.779 5736.906
##          df0      dfR             p_value    
## mP0 4568.000 4568.000 <0.0000000000000002 ***
## mP1 4568.000 4567.000 <0.0000000000000002 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AIC uses the negative log likelihood and the number of parameters as penalty. Smaller value indicate a model that is closer to the (unknowable) true model (caveat: this statement is true only asymptotically, i.e. it holds for very large sample sizes). For small samples, we often use BIC (more penalty for complex models when sample size is small), or AICc (as in MuMIn::model.sel()).

The other little table returned by R2dev() shows deviance based (quasi) \(R^2\) and adjusted \(R^2\) for some GLM classes, just for the sake of completeness. The Chi-squared based test indicates good fit when the \(p\)-value is high (probability of being distributed according the Poisson).

None of these two models is a particularly good fit in terms of the parametric distribution. This, however does not mean these models are useless for making inferential statements about ovenbirds. How useful these statements are, that is another question. Let’s dive into confidence and prediction intervals a bit.

B <- 2000
alpha <- 0.05

xnew <- data.frame(Decid=seq(0, 1, 0.01))
CI0 <- predict_sim(mP0, xnew, interval="confidence", level=1-alpha, B=B)
PI0 <- predict_sim(mP0, xnew, interval="prediction", level=1-alpha, B=B)
CI1 <- predict_sim(mP1, xnew, interval="confidence", level=1-alpha, B=B)
PI1 <- predict_sim(mP1, xnew, interval="prediction", level=1-alpha, B=B)

## nominal coverage is 95%
sum(x$y %[]% predict_sim(mP0, interval="prediction", level=1-alpha, B=B)[,c("lwr", "upr")]) / nrow(x)

## [1] 0.9619173

sum(x$y %[]% predict_sim(mP1, interval="prediction", level=1-alpha, B=B)[,c("lwr", "upr")]) / nrow(x)

## [1] 0.9702342

An estimate is said to have good coverage when the prediction intervals encompass the right amount of the observations. When the nominal level is 95% (\(100 \times (1-\alpha)\), where \(\alpha\) is Type I. error rate, rejecting a true null hypothesis), we expect 95% of the observations fall within the 95% prediction interval. The prediction interval includes the uncertainty around the coefficients (confidence intervals, uncertainty in \(\hat{\lambda}\)) and the stochasticity coming from the Poisson distribution (\(Y_i \sim Poisson(\hat{\lambda})\)).

The code above calculate the confidence and prediction intervals for the two models. We also compared the prediction intervals and the nominal levels, and those were quite close (ours being a bit more conservative), hinting that maybe the Poisson distributional assumption is not very bad after all, but we’ll come back to this later.

Let’s see our confidence and prediction intervals for the two models:

yj <- jitter(x$y, 0.5)

plot(yj ~ Decid, x, xlab="Decid", ylab="E[Y]",
  ylim=c(0, max(PI1$upr)+1), pch=19, col="#bbbbbb33", main="P0 vs P1")

polygon(c(xnew$Decid, rev(xnew$Decid)),
  c(PI0$lwr, rev(PI0$upr)), border=NA, col="#0000ff44")
polygon(c(xnew$Decid, rev(xnew$Decid)),
  c(CI0$lwr, rev(CI0$upr)), border=NA, col="#0000ff44")
lines(CI0$fit ~ xnew$Decid, lty=1, col=4)

polygon(c(xnew$Decid, rev(xnew$Decid)),
  c(PI1$lwr, rev(PI1$upr)), border=NA, col="#ff000044")
polygon(c(xnew$Decid, rev(xnew$Decid)),
  c(CI1$lwr, rev(CI1$upr)), border=NA, col="#ff000044")
lines(CI1$fit ~ xnew$Decid, lty=1, col=2)

legend("topleft", bty="n", fill=c("#0000ff44", "#ff000044"), lty=1, col=c(4,2),
  border=NA, c("Null", "Decid"))

Exercise

What can we conclude from this plot?

Coverage is comparable, so what is the difference then?
Which model should I use for prediction and why? (Hint: look at the non overlapping regions.)
Confidence intervals are super narrow. Is that of any use? What does CI depend on? (Don’t read the next question yet!)
Is PI expected to change with sample size?

Additive model

Generalized additive models (GAMs) are semi-parametric, meaning that parametric assumptions apply, but responses are modelled more flexibly.

mGAM <- mgcv::gam(y ~ s(Decid), x, family=poisson)
summary(mGAM)

## 
## Family: poisson 
## Link function: log 
## 
## Formula:
## y ~ s(Decid)
## 
## Parametric coefficients:
##             Estimate Std. Error z value
## (Intercept)  -0.5606     0.0283  -19.81
##                        Pr(>|z|)    
## (Intercept) <0.0000000000000002 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df Chi.sq             p-value    
## s(Decid) 8.557  8.937   1193 <0.0000000000000002 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.239   Deviance explained =   29%
## UBRE = 0.15808  Scale est. = 1         n = 4569

plot(mGAM)

Compare the 4 predictions we have so far

fitCT <- predict(mCT, x[order(x$Decid),])
fitGAM <- predict(mGAM, xnew, type="response")

plot(yj ~ Decid, x, xlab="Decid", ylab="E[Y]",
  ylim=c(0, max(PI1$upr)+1), pch=19, col="#bbbbbb33", main="P0")
lines(CI0$fit ~ xnew$Decid, lty=1, col=1)
lines(CI1$fit ~ xnew$Decid, lty=1, col=2)
lines(fitCT ~ x$Decid[order(x$Decid)], lty=1, col=3)
lines(fitGAM ~ xnew$Decid, lty=1, col=4)
legend("topleft", bty="n", lty=1, col=1:4,
  legend=c("Null", "Decid", "ctree", "GAM"))

Exercise

Play with GAM and other variables to understand response curves: plot(mgcv::gam(y ~ s(<variable_name>), data=x, family=poisson))

Nonlinear terms

We can use polynomial terms to approximate the GAM fit:

mP12 <- glm(y ~ Decid + I(Decid^2), data=x, family=poisson)
mP13 <- glm(y ~ Decid + I(Decid^2) + I(Decid^3), data=x, family=poisson)
mP14 <- glm(y ~ Decid + I(Decid^2) + I(Decid^3) + I(Decid^4), data=x, family=poisson)
model.sel(mP1, mP12, mP13, mP14, mGAM)

## Model selection table 
##        (Int)    Dcd  Dcd^2  Dcd^3  Dcd^4 s(Dcd)
## mGAM -0.5606                                  +
## mP14 -2.6640 16.640 -38.60 41.470 -16.31       
## mP13 -2.3910 11.400 -16.31  8.066              
## mP12 -1.9240  6.259  -3.97                     
## mP1  -1.1640  2.134                            
##            family class df    logLik    AICc  delta
## mGAM poisson(log)   gam  9 -5209.172 10437.5   0.00
## mP14 poisson(log)   glm  5 -5215.401 10440.8   3.31
## mP13 poisson(log)   glm  4 -5226.456 10460.9  23.42
## mP12 poisson(log)   glm  3 -5268.767 10543.5 106.04
## mP1  poisson(log)   glm  2 -5441.550 10887.1 449.60
##      weight
## mGAM   0.84
## mP14   0.16
## mP13   0.00
## mP12   0.00
## mP1    0.00
## Models ranked by AICc(x)

Not a surprise that the most complex model won. GAM was more complex than that (see df column).

pr <- cbind(
  predict(mP1, xnew, type="response"),
  predict(mP12, xnew, type="response"),
  predict(mP13, xnew, type="response"),
  predict(mP14, xnew, type="response"),
  fitGAM)
matplot(xnew$Decid, pr, lty=1, type="l",
  xlab="Decid", ylab="E[Y]")
legend("topleft", lty=1, col=1:5, bty="n",
  legend=c("Linear", "Quadratic", "Cubic", "Quartic", "GAM"))

Let’s see how these affect our prediction intervals:

CI12 <- predict_sim(mP12, xnew, interval="confidence", level=1-alpha, B=B)
PI12 <- predict_sim(mP12, xnew, interval="prediction", level=1-alpha, B=B)
CI13 <- predict_sim(mP13, xnew, interval="confidence", level=1-alpha, B=B)
PI13 <- predict_sim(mP13, xnew, interval="prediction", level=1-alpha, B=B)
CI14 <- predict_sim(mP14, xnew, interval="confidence", level=1-alpha, B=B)
PI14 <- predict_sim(mP14, xnew, interval="prediction", level=1-alpha, B=B)

op <- par(mfrow=c(2,2))
plot(yj ~ Decid, x, xlab="Decid", ylab="E[Y]",
  ylim=c(0, max(PI1$upr)+1), pch=19, col="#bbbbbb33", main="Linear")
polygon(c(xnew$Decid, rev(xnew$Decid)),
  c(PI1$lwr, rev(PI1$upr)), border=NA, col="#0000ff44")
polygon(c(xnew$Decid, rev(xnew$Decid)),
  c(CI1$lwr, rev(CI1$upr)), border=NA, col="#0000ff88")
lines(CI1$fit ~ xnew$Decid, lty=1, col=4)
lines(fitGAM ~ xnew$Decid, lty=2, col=1)

plot(yj ~ Decid, x, xlab="Decid", ylab="E[Y]",
  ylim=c(0, max(PI1$upr)+1), pch=19, col="#bbbbbb33", main="Quadratic")
polygon(c(xnew$Decid, rev(xnew$Decid)),
  c(PI12$lwr, rev(PI12$upr)), border=NA, col="#0000ff44")
polygon(c(xnew$Decid, rev(xnew$Decid)),
  c(CI12$lwr, rev(CI12$upr)), border=NA, col="#0000ff88")
lines(CI12$fit ~ xnew$Decid, lty=1, col=4)
lines(fitGAM ~ xnew$Decid, lty=2, col=1)

plot(yj ~ Decid, x, xlab="Decid", ylab="E[Y]",
  ylim=c(0, max(PI1$upr)+1), pch=19, col="#bbbbbb33", main="Cubic")
polygon(c(xnew$Decid, rev(xnew$Decid)),
  c(PI13$lwr, rev(PI13$upr)), border=NA, col="#0000ff44")
polygon(c(xnew$Decid, rev(xnew$Decid)),
  c(CI13$lwr, rev(CI13$upr)), border=NA, col="#0000ff88")
lines(CI13$fit ~ xnew$Decid, lty=1, col=4)
lines(fitGAM ~ xnew$Decid, lty=2, col=1)

plot(yj ~ Decid, x, xlab="Decid", ylab="E[Y]",
  ylim=c(0, max(PI1$upr)+1), pch=19, col="#bbbbbb33", main="Quartic")
polygon(c(xnew$Decid, rev(xnew$Decid)),
  c(PI14$lwr, rev(PI14$upr)), border=NA, col="#0000ff44")
polygon(c(xnew$Decid, rev(xnew$Decid)),
  c(CI14$lwr, rev(CI14$upr)), border=NA, col="#0000ff88")
lines(CI14$fit ~ xnew$Decid, lty=1, col=4)
lines(fitGAM ~ xnew$Decid, lty=2, col=1)

par(op)

Categorical variables

Categorical variables are expanded into a model matrix before parameter estimation. The model matrix usually contains indicator variables for each level (value 1 when factor value equals a particular label, 0 otherwise) except for the reference category (check relevel if you want to change the reference category).

The estimate for the reference category comes from the intercept, the rest of the estimates are relative to the reference category. In the log-linear model example this means a ratio.

head(model.matrix(~DEC, x))

##         (Intercept) DEC
## CL10102           1   1
## CL10106           1   0
## CL10108           1   0
## CL10109           1   1
## CL10111           1   1
## CL10112           1   1

mP2 <- glm(y ~ DEC, data=x, family=poisson)
summary(mP2)

## 
## Call:
## glm(formula = y ~ DEC, family = poisson, data = x)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6914  -0.9210  -0.9210   0.4489   4.5433  
## 
## Coefficients:
##             Estimate Std. Error z value
## (Intercept) -0.85768    0.03080  -27.84
## DEC          1.21565    0.03584   33.92
##                        Pr(>|z|)    
## (Intercept) <0.0000000000000002 ***
## DEC         <0.0000000000000002 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 7424.8  on 4568  degrees of freedom
## Residual deviance: 6095.5  on 4567  degrees of freedom
## AIC: 11246
## 
## Number of Fisher Scoring iterations: 6

coef(mP2)

## (Intercept)         DEC 
##  -0.8576802   1.2156499

The estimate for a non-deciduous landscape is \(e^{\beta_0}\), and it is \(e^{\beta_0}e^{\beta_1}\) for deciduous landscapes. (Of course such binary classification at the landscape (1 km\(^2\)) level doesn’t really makes sense for various reasons.)

boxplot(Decid ~ DEC, x)

model.sel(mP1, mP2)

## Model selection table 
##     (Intrc) Decid   DEC       family df    logLik
## mP1 -1.1640 2.134       poisson(log)  2 -5441.550
## mP2 -0.8577       1.216 poisson(log)  2 -5620.871
##        AICc  delta weight
## mP1 10887.1   0.00      1
## mP2 11245.7 358.64      0
## Models ranked by AICc(x)

R2dev(mP1, mP2)

##           R2    R2adj Deviance     Dev0     DevR
## mP1    0.227    0.227 1687.873 7424.779 5736.906
## mP2    0.179    0.179 1329.230 7424.779 6095.549
##          df0      dfR             p_value    
## mP1 4568.000 4567.000 <0.0000000000000002 ***
## mP2 4568.000 4567.000 <0.0000000000000002 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Having estimates for each land cover type improves the model, but the model using continuous variable is still better

mP3 <- glm(y ~ HAB, data=x, family=poisson)
summary(mP3)

## 
## Call:
## glm(formula = y ~ HAB, family = poisson, data = x)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6914  -0.8732  -0.8168   0.4489   4.8315  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -1.3863     0.5774  -2.401  0.01634 * 
## HABWater      1.0296     0.6901   1.492  0.13568   
## HABAgr        0.6931     0.9129   0.759  0.44767   
## HABUrbInd     0.1335     0.7638   0.175  0.86121   
## HABRoads    -10.9163   201.2853  -0.054  0.95675   
## HABDecid      1.7443     0.5776   3.020  0.00253 **
## HABOpenWet    0.4220     0.5914   0.714  0.47549   
## HABConif      0.9128     0.5792   1.576  0.11504   
## HABConifWet   0.2883     0.5790   0.498  0.61852   
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 7424.8  on 4568  degrees of freedom
## Residual deviance: 5997.2  on 4560  degrees of freedom
## AIC: 11161
## 
## Number of Fisher Scoring iterations: 10

model.sel(mP1, mP2, mP3)

## Model selection table 
##     (Intrc) Decid   DEC HAB       family df
## mP1 -1.1640 2.134           poisson(log)  2
## mP3 -1.3860               + poisson(log)  9
## mP2 -0.8577       1.216     poisson(log)  2
##        logLik    AICc  delta weight
## mP1 -5441.550 10887.1   0.00      1
## mP3 -5571.710 11161.5 274.36      0
## mP2 -5620.871 11245.7 358.64      0
## Models ranked by AICc(x)

R2dev(mP1, mP2, mP3)

##           R2    R2adj Deviance     Dev0     DevR
## mP1    0.227    0.227 1687.873 7424.779 5736.906
## mP2    0.179    0.179 1329.230 7424.779 6095.549
## mP3    0.192    0.191 1427.552 7424.779 5997.226
##          df0      dfR             p_value    
## mP1 4568.000 4567.000 <0.0000000000000002 ***
## mP2 4568.000 4567.000 <0.0000000000000002 ***
## mP3 4568.000 4560.000 <0.0000000000000002 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The prediction in this case would look like: \(log(\lambda_i)=\beta_0 + \sum_{j=1}^{k-1} \beta_j x_{ji}\), where we have \(k\) factor levels (and \(k-1\) indicator variables besides the intercept).

Here is a general way of calculating fitted values or making predictions based on the design matrix (X) and the coefficients (b) (column ordering in X must match the elements in b) given a parametric log-linear model object and data frame df (the code won’t run as is, object is just a placeholder for your GLM model object):

b <- coef(object)
X <- model.matrix(formula(object), df)
exp(X %*% b)

Multiple main effects

We can keep adding variables to the model in a forwards-selection fashion. add1 adds variables one at a time, selecting from the scope defined by the formula:

scope <- as.formula(paste("~ FOR + WET + AHF +", paste(cn, collapse="+")))
tmp <- add1(mP1, scope)
tmp$AIC_drop <- tmp$AIC-tmp$AIC[1] # current model
tmp[order(tmp$AIC),]

## Single term additions
## 
## Model:
## y ~ Decid
##          Df Deviance   AIC AIC_drop
## ConifWet  1   5638.4 10791  -96.527
## Conif     1   5685.5 10838  -49.444
## WET       1   5686.8 10839  -48.056
## Water     1   5721.2 10873  -13.748
## FOR       1   5723.9 10876  -10.990
## OpenWet   1   5728.0 10880   -6.885
## Open      1   5730.2 10882   -4.733
## Roads     1   5733.0 10885   -1.893
## AHF       1   5734.3 10886   -0.654
## <none>        5736.9 10887    0.000
## Agr       1   5736.1 10888    1.182
## UrbInd    1   5736.4 10889    1.507
## SoftLin   1   5736.5 10889    1.598

It looks like ConifWet is the best covariate to add next because it leads to the biggest drop in AIC, and both effects are significant.

mP4 <- glm(y ~ Decid + ConifWet, data=x, family=poisson)
summary(mP4)

## 
## Call:
## glm(formula = y ~ Decid + ConifWet, family = poisson, data = x)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2372  -0.9962  -0.6786   0.4473   4.4390  
## 
## Coefficients:
##             Estimate Std. Error z value
## (Intercept) -0.70142    0.05563 -12.608
## Decid        1.62240    0.07187  22.574
## ConifWet    -0.97845    0.09926  -9.858
##                        Pr(>|z|)    
## (Intercept) <0.0000000000000002 ***
## Decid       <0.0000000000000002 ***
## ConifWet    <0.0000000000000002 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 7424.8  on 4568  degrees of freedom
## Residual deviance: 5638.4  on 4566  degrees of freedom
## AIC: 10791
## 
## Number of Fisher Scoring iterations: 6

The drop1 function is the opposite of add1, it assesses which term should be dropped from a more saturated model:

formula_all <- y ~ Open + Agr + UrbInd + SoftLin + Roads + 
  Decid + OpenWet + Conif + ConifWet + 
  OvernightRain + TSSR + DAY + Longitude + Latitude

tmp <- drop1(glm(formula_all, data=x, family=poisson))
tmp$AIC_drop <- tmp$AIC-tmp$AIC[1] # current model
tmp[order(tmp$AIC),]

## Single term deletions
## 
## Model:
## y ~ Open + Agr + UrbInd + SoftLin + Roads + Decid + OpenWet + 
##     Conif + ConifWet + OvernightRain + TSSR + DAY + Longitude + 
##     Latitude
##               Df Deviance   AIC AIC_drop
## OvernightRain  1   5500.0 10674   -1.999
## Roads          1   5500.1 10674   -1.894
## SoftLin        1   5500.4 10675   -1.560
## Agr            1   5500.6 10675   -1.356
## <none>             5500.0 10676    0.000
## Decid          1   5505.0 10679    3.006
## OpenWet        1   5505.1 10679    3.146
## Conif          1   5508.0 10682    6.042
## UrbInd         1   5510.7 10685    8.740
## Longitude      1   5518.5 10693   16.520
## TSSR           1   5523.8 10698   21.822
## ConifWet       1   5528.4 10703   26.431
## DAY            1   5528.6 10703   26.651
## Open           1   5530.7 10705   28.732
## Latitude       1   5580.2 10754   78.186

The step function can be used to automatically select the best model based on adding/dropping terms:

mPstep <- step(glm(formula_all, data=x, family=poisson), 
  trace=0) # use trace=1 to see all the steps
summary(mPstep)

## 
## Call:
## glm(formula = y ~ Open + UrbInd + Decid + OpenWet + Conif + ConifWet + 
##     TSSR + DAY + Longitude + Latitude, family = poisson, data = x)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7634  -0.9859  -0.6739   0.4507   4.6238  
## 
## Coefficients:
##              Estimate Std. Error z value
## (Intercept) -5.882931   1.302227  -4.518
## Open        -3.474283   0.658674  -5.275
## UrbInd      -1.668835   0.542157  -3.078
## Decid        0.833721   0.259569   3.212
## OpenWet     -0.740756   0.302380  -2.450
## Conif       -0.885584   0.265661  -3.334
## ConifWet    -1.894231   0.271697  -6.972
## TSSR        -1.234156   0.249837  -4.940
## DAY         -2.879703   0.526861  -5.466
## Longitude    0.038307   0.008766   4.370
## Latitude     0.209301   0.023092   9.064
##                         Pr(>|z|)    
## (Intercept)     0.00000625468259 ***
## Open            0.00000013299998 ***
## UrbInd                  0.002083 ** 
## Decid                   0.001318 ** 
## OpenWet                 0.014295 *  
## Conif                   0.000858 ***
## ConifWet        0.00000000000313 ***
## TSSR            0.00000078186410 ***
## DAY             0.00000004608976 ***
## Longitude       0.00001242107713 ***
## Latitude    < 0.0000000000000002 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 7424.8  on 4568  degrees of freedom
## Residual deviance: 5501.1  on 4558  degrees of freedom
## AIC: 10669
## 
## Number of Fisher Scoring iterations: 6

Interaction

When we consider interactions between two variables (say \(x_1\) and \(x_2\)), we refer to adding another variable to the model matrix that is a product of the two variables (\(x_{12}=x_1 x_2\)):

head(model.matrix(~x1 * x2, data.frame(x1=1:4, x2=10:7)))

##   (Intercept) x1 x2 x1:x2
## 1           1  1 10    10
## 2           1  2  9    18
## 3           1  3  8    24
## 4           1  4  7    28

Let’s consider interaction between our two predictors from before:

mP5 <- glm(y ~ Decid * ConifWet, data=x, family=poisson)
summary(mP5)

## 
## Call:
## glm(formula = y ~ Decid * ConifWet, family = poisson, data = x)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0813  -1.0222  -0.4839   0.3740   4.3213  
## 
## Coefficients:
##                Estimate Std. Error z value
## (Intercept)    -0.56041    0.05659  -9.902
## Decid           1.21250    0.07818  15.509
## ConifWet       -2.31239    0.14901 -15.519
## Decid:ConifWet  5.34611    0.35658  14.993
##                           Pr(>|z|)    
## (Intercept)    <0.0000000000000002 ***
## Decid          <0.0000000000000002 ***
## ConifWet       <0.0000000000000002 ***
## Decid:ConifWet <0.0000000000000002 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 7424.8  on 4568  degrees of freedom
## Residual deviance: 5395.2  on 4565  degrees of freedom
## AIC: 10549
## 
## Number of Fisher Scoring iterations: 6

model.sel(mP0, mP1, mP4, mP5)

## Model selection table 
##       (Int)   Dcd     CnW CnW:Dcd       family df
## mP5 -0.5604 1.213 -2.3120   5.346 poisson(log)  4
## mP4 -0.7014 1.622 -0.9785         poisson(log)  3
## mP1 -1.1640 2.134                 poisson(log)  2
## mP0 -0.1243                       poisson(log)  1
##        logLik    AICc   delta weight
## mP5 -5270.689 10549.4    0.00      1
## mP4 -5392.287 10790.6  241.19      0
## mP1 -5441.550 10887.1  337.72      0
## mP0 -6285.486 12573.0 2023.59      0
## Models ranked by AICc(x)

The model with the interaction is best supported, but how do we make sense of this relationship? We can’t easily visualize it in a single plot. We can either

fix all variables (at their mean/meadian) and see how the response is changing along a single variable: this is called a conditional effect (conditional on fixing other variables), this is what visreg::visreg() does
or plot the fitted values against the predictor variable (one at a time), this is called a marginal effect, and this is what ResourceSelection::mep() does

visreg(mP5, scale="response", xvar="ConifWet", by="Decid")

mep(mP5)

Let’s use GAM to fit a bivariate spline:

mGAM2 <- mgcv::gam(y ~ s(Decid, ConifWet), data=x, family=poisson)
plot(mGAM2, scheme=2, rug=FALSE)

Final battle of Poisson models:

model.sel(mP0, mP1, mP12, mP13, mP14, mP2, mP3, mP4, mP5, mGAM, mGAM2)

## Model selection table 
##         (Int)    Dcd  Dcd^2  Dcd^3  Dcd^4   DEC HAB
## mGAM2 -0.6242                                      
## mGAM  -0.5606                                      
## mP14  -2.6640 16.640 -38.60 41.470 -16.31          
## mP13  -2.3910 11.400 -16.31  8.066                 
## mP12  -1.9240  6.259  -3.97                        
## mP5   -0.5604  1.213                               
## mP4   -0.7014  1.622                               
## mP1   -1.1640  2.134                               
## mP3   -1.3860                                     +
## mP2   -0.8577                             1.216    
## mP0   -0.1243                                      
##           CnW CnW:Dcd s(Dcd) s(Dcd,CnW)       family
## mGAM2                                 + poisson(log)
## mGAM                       +            poisson(log)
## mP14                                    poisson(log)
## mP13                                    poisson(log)
## mP12                                    poisson(log)
## mP5   -2.3120   5.346                   poisson(log)
## mP4   -0.9785                           poisson(log)
## mP1                                     poisson(log)
## mP3                                     poisson(log)
## mP2                                     poisson(log)
## mP0                                     poisson(log)
##       class df    logLik    AICc   delta weight
## mGAM2   gam 27 -5160.178 10375.8    0.00      1
## mGAM    gam  9 -5209.172 10437.5   61.73      0
## mP14    glm  5 -5215.401 10440.8   65.05      0
## mP13    glm  4 -5226.456 10460.9   85.15      0
## mP12    glm  3 -5268.767 10543.5  167.77      0
## mP5     glm  4 -5270.689 10549.4  173.62      0
## mP4     glm  3 -5392.287 10790.6  414.81      0
## mP1     glm  2 -5441.550 10887.1  511.33      0
## mP3     glm  9 -5571.710 11161.5  785.69      0
## mP2     glm  2 -5620.871 11245.7  869.98      0
## mP0     glm  1 -6285.486 12573.0 2197.20      0
## Models ranked by AICc(x)

R2dev(mP0, mP1, mP12, mP13, mP14, mP2, mP3, mP4, mP5, mGAM, mGAM2)

##             R2    R2adj Deviance     Dev0     DevR
## mP0      0.000    0.000    0.000 7424.779 7424.779
## mP1      0.227    0.227 1687.873 7424.779 5736.906
## mP12     0.274    0.274 2033.438 7424.779 5391.341
## mP13     0.285    0.285 2118.060 7424.779 5306.719
## mP14     0.288    0.288 2140.170 7424.779 5284.609
## mP2      0.179    0.179 1329.230 7424.779 6095.549
## mP3      0.192    0.191 1427.552 7424.779 5997.226
## mP4      0.241    0.240 1786.399 7424.779 5638.379
## mP5      0.273    0.273 2029.595 7424.779 5395.184
## mGAM     0.290    0.289 2152.629 7424.779 5272.150
## mGAM2    0.303    0.299 2250.617 7424.779 5174.162
##            df0      dfR              p_value    
## mP0   4568.000 4568.000 < 0.0000000000000002 ***
## mP1   4568.000 4567.000 < 0.0000000000000002 ***
## mP12  4568.000 4566.000 < 0.0000000000000002 ***
## mP13  4568.000 4565.000   0.0000000000000763 ***
## mP14  4568.000 4564.000   0.0000000000003352 ***
## mP2   4568.000 4567.000 < 0.0000000000000002 ***
## mP3   4568.000 4560.000 < 0.0000000000000002 ***
## mP4   4568.000 4566.000 < 0.0000000000000002 ***
## mP5   4568.000 4565.000 < 0.0000000000000002 ***
## mGAM  4568.000 4559.000   0.0000000000005476 ***
## mGAM2 4568.000 4539.000   0.0000000000854279 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Of course, the most complex model wins but the Chi-square test is still significant (indicating lack of fit). Let’s try different error distribution.

Different error distributions

We will use the 2-variable model with interaction:

mP <- glm(y ~ Decid * ConifWet, data=x, family=poisson)

Let us try the Negative Binomial distribution first. This distribution is related to Binomial experiments (number of trials required to get a fixed number of successes given a binomial probability). It can also be derived as a mixture of Poisson and Gamma distributions (see Wikipedia), which is a kind of hierarchical model. In this case, the Gamma distribution acts as an i.i.d. random effect for the intercept: \(Y_i\sim Poisson(\lambda_i)\), \(\lambda_i \sim Gamma(e^{\beta_0+\beta_1 x_{1i}}, \gamma)\), where \(\gamma\) is the Gamma variance.

The Negative Binomial variance (using the parametrization common in R functions) is a function of the mean and the scale: \(V(\mu) = \mu + \mu^2/\theta\).

mNB <- glm.nb(y ~ Decid * ConifWet, data=x)
summary(mNB)

## 
## Call:
## glm.nb(formula = y ~ Decid * ConifWet, data = x, init.theta = 3.5900635, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8605  -0.9847  -0.4505   0.3174   3.8030  
## 
## Coefficients:
##                Estimate Std. Error z value
## (Intercept)    -0.59051    0.06297  -9.378
## Decid           1.24590    0.08919  13.969
## ConifWet       -2.35454    0.16045 -14.674
## Decid:ConifWet  5.69452    0.40095  14.203
##                           Pr(>|z|)    
## (Intercept)    <0.0000000000000002 ***
## Decid          <0.0000000000000002 ***
## ConifWet       <0.0000000000000002 ***
## Decid:ConifWet <0.0000000000000002 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(3.5901) family taken to be 1)
## 
##     Null deviance: 6089.3  on 4568  degrees of freedom
## Residual deviance: 4387.7  on 4565  degrees of freedom
## AIC: 10440
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  3.590 
##           Std. Err.:  0.425 
## 
##  2 x log-likelihood:  -10430.448

Next, we look at zero-inflated models. In this case, the mixture distribution is a Bernoulli distribution and a count distribution (Poisson or Negative Binomial, for example). The 0’s can come from both the zero and the count distributions, whereas the >0 values can only come from the count distribution: \(A_i \sim Bernoulli(\varphi)\), \(Y_i \sim Poisson(A_i \lambda_i)\).

The zero part of the zero-inflated models are often parametrized as probability of zero (\(1-\varphi\)), as in the pscl::zeroinfl function:

## Zero-inflated Poisson
mZIP <- zeroinfl(y ~ Decid * ConifWet | 1, x, dist="poisson")
summary(mZIP)

## 
## Call:
## zeroinfl(formula = y ~ Decid * ConifWet | 1, data = x, 
##     dist = "poisson")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2182 -0.7003 -0.3390  0.3778  8.9518 
## 
## Count model coefficients (poisson with log link):
##                Estimate Std. Error z value
## (Intercept)    -0.32410    0.06513  -4.976
## Decid           1.06993    0.08599  12.443
## ConifWet       -2.44111    0.15644 -15.604
## Decid:ConifWet  5.93789    0.38407  15.461
##                            Pr(>|z|)    
## (Intercept)             0.000000648 ***
## Decid          < 0.0000000000000002 ***
## ConifWet       < 0.0000000000000002 ***
## Decid:ConifWet < 0.0000000000000002 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value
## (Intercept)  -1.6135     0.1034   -15.6
##                        Pr(>|z|)    
## (Intercept) <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 14 
## Log-likelihood: -5206 on 5 Df

## Zero-inflated Negative Binomial
mZINB <- zeroinfl(y ~ Decid * ConifWet | 1, x, dist="negbin")

## Warning in sqrt(diag(vc)[np]): NaNs produced

summary(mZINB)

## 
## Call:
## zeroinfl(formula = y ~ Decid * ConifWet | 1, data = x, 
##     dist = "negbin")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2181 -0.7003 -0.3390  0.3778  8.9501 
## 
## Count model coefficients (negbin with log link):
##                Estimate Std. Error z value
## (Intercept)    -0.32419    0.06512  -4.978
## Decid           1.07005    0.08599  12.444
## ConifWet       -2.44046    0.15642 -15.602
## Decid:ConifWet  5.93657    0.38405  15.458
## Log(theta)      9.72270         NA      NA
##                            Pr(>|z|)    
## (Intercept)             0.000000642 ***
## Decid          < 0.0000000000000002 ***
## ConifWet       < 0.0000000000000002 ***
## Decid:ConifWet < 0.0000000000000002 ***
## Log(theta)                       NA    
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value
## (Intercept)  -1.6134     0.1034  -15.61
##                        Pr(>|z|)    
## (Intercept) <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 16692.3363 
## Number of iterations in BFGS optimization: 14 
## Log-likelihood: -5206 on 6 Df

Now we compare the four different parametric models:

AIC(mP, mNB, mZIP, mZINB)

##       df      AIC
## mP     4 10549.38
## mNB    5 10440.45
## mZIP   5 10421.96
## mZINB  6 10423.95

Our best model is the Zero-inflated Poisson. The probability of observing a zero as part of the zero distribution is back transformed from the zero coefficient using the inverse logit function:

unname(plogis(coef(mZIP, "zero"))) # P of 0

## [1] 0.1661095

Now we use the scale parameter to visualize the variance functions for the Negative Binomial models (the 1:1 line is the Poisson model):

mNB$theta

## [1] 3.590064

mZINB$theta

## [1] 16692.34

mu <- seq(0, 5, 0.01)
plot(mu, mu + mu^2/mNB$theta, type="l", col=2,
  ylab=expression(V(mu)), xlab=expression(mu))
lines(mu, mu + mu^2/mZINB$theta, type="l", col=4)
abline(0,1, lty=2)
legend("topleft", bty="n", lty=1, col=c(2,4),
  legend=paste(c("NB", "ZINB"), round(c(mNB$theta, mZINB$theta), 2)))

Mixed models

It is also common practice to consider generalized linear mixed models (GLMMs) for count data. These mixed models are usually considered as Poisson-Lognormal mixtures. The simplest, so called i.i.d., case is similar to the Negative Binomial, but instead of Gamma, we have Lognormal distribution: \(Y_i\sim Poisson(\lambda_i)\), \(log(\lambda_i) = \beta_0+\beta_1 x_{1i}+\epsilon_i\), \(\epsilon_i \sim Normal(0, \sigma^2)\), where \(\sigma^2\) is the Lognormal variance on the log scale.

We can use the lme4::glmer function: use SiteID as random effect (we have exactly \(n\) random effects).

mPLN1 <- glmer(y ~ Decid * ConifWet + (1 | SiteID), data=x, family=poisson)
summary(mPLN1)

## Generalized linear mixed model fit by maximum
##   likelihood (Laplace Approximation) [glmerMod]
##  Family: poisson  ( log )
## Formula: y ~ Decid * ConifWet + (1 | SiteID)
##    Data: x
## 
##      AIC      BIC   logLik deviance df.resid 
##  10422.8  10455.0  -5206.4  10412.8     4564 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.1502 -0.6287 -0.2877  0.4178  5.4693 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  SiteID (Intercept) 0.294    0.5422  
## Number of obs: 4569, groups:  SiteID, 4569
## 
## Fixed effects:
##                Estimate Std. Error z value
## (Intercept)    -0.75179    0.06748  -11.14
## Decid           1.28467    0.09200   13.96
## ConifWet       -2.33801    0.16251  -14.39
## Decid:ConifWet  5.63257    0.41179   13.68
##                           Pr(>|z|)    
## (Intercept)    <0.0000000000000002 ***
## Decid          <0.0000000000000002 ***
## ConifWet       <0.0000000000000002 ***
## Decid:ConifWet <0.0000000000000002 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Decid  ConfWt
## Decid       -0.895              
## ConifWet    -0.622  0.644       
## Decid:CnfWt  0.176 -0.379 -0.700

Note: the number of unknowns we have to somehow estimate is now more than the number of observations we have. How is that possible?

Alternatively, we can use SurveyArea as a grouping variable. We have now \(m < n\) random effects, and survey areas can be seen as larger landscapes within which the sites are clustered: \(Y_ij\sim Poisson(\lambda_ij)\), \(log(\lambda_ij) = \beta_0+\beta_1 x_{1ij}+\epsilon_i\), \(\epsilon_i \sim Normal(0, \sigma^2)\). The index \(i\) (\(i=1,...,m\)) defines the cluster (survey area), the \(j\) (\(j=1,...,n_i\)) defines the sites within survey area \(i\) (\(n = \sum_{i=1}^m n_i\)).

mPLN2 <- glmer(y ~ Decid * ConifWet + (1 | SurveyArea), data=x, family=poisson)
summary(mPLN2)

## Generalized linear mixed model fit by maximum
##   likelihood (Laplace Approximation) [glmerMod]
##  Family: poisson  ( log )
## Formula: y ~ Decid * ConifWet + (1 | SurveyArea)
##    Data: x
## 
##      AIC      BIC   logLik deviance df.resid 
##  10021.0  10053.1  -5005.5  10011.0     4564 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.7388 -0.6428 -0.3200  0.3550  6.5348 
## 
## Random effects:
##  Groups     Name        Variance Std.Dev.
##  SurveyArea (Intercept) 0.2953   0.5435  
## Number of obs: 4569, groups:  SurveyArea, 271
## 
## Fixed effects:
##                Estimate Std. Error z value
## (Intercept)    -0.74587    0.07828  -9.528
## Decid           1.19670    0.09841  12.160
## ConifWet       -2.32133    0.16863 -13.766
## Decid:ConifWet  5.53462    0.39771  13.916
##                           Pr(>|z|)    
## (Intercept)    <0.0000000000000002 ***
## Decid          <0.0000000000000002 ***
## ConifWet       <0.0000000000000002 ***
## Decid:ConifWet <0.0000000000000002 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Decid  ConfWt
## Decid       -0.808              
## ConifWet    -0.610  0.628       
## Decid:CnfWt  0.162 -0.325 -0.670

In the battle of distributions (keeping the linear predictor part the same) the clustered GLMM was best supported:

tmp <- AIC(mP, mNB, mZIP, mZINB, mPLN1, mPLN2)
tmp$delta_AIC <- tmp$AIC - min(tmp$AIC)
tmp[order(tmp$AIC),]

##       df      AIC delta_AIC
## mPLN2  5 10020.97    0.0000
## mZIP   5 10421.96  400.9852
## mPLN1  5 10422.82  401.8529
## mZINB  6 10423.95  402.9764
## mNB    5 10440.45  419.4770
## mP     4 10549.38  528.4065

Exercise

How can we interpret these different kinds of overdispersion (zero-inflation and higher than Poisson variance)?
What are some of the biological mechanisms that can contribute to the overdispersion?
What are some of the biological mechanisms that can lead to the clustered GLMM being the best model?

Count duration effects

Let’s change gears a bit now, and steer closer to the main focus of this workshop. We want to account for methodological differences among samples. One aspect of methodologies involve variation in total counting duration. We’ll now inspect what that does to our observations.

First, we create a list of matrices where counts are tabulated by surveys and time intervals for each species:

ydur <- Xtab(~ SiteID + Dur + SpeciesID , 
  josm$counts[josm$counts$DetectType1 != "V",])

The ydur object is a list with identical matrices, one for each species. We use the same species (spp) as before

y <- as.matrix(ydur[[spp]])
head(y)

##         0-3min 3-5min 5-10min
## CL10102      3      0       0
## CL10106      0      0       0
## CL10108      0      0       0
## CL10109      2      0       1
## CL10111      2      0       0
## CL10112      2      0       0

colMeans(y) # mean count of new individuals

##    0-3min    3-5min   5-10min 
## 0.6736704 0.0934559 0.1159991

Create a data frame including the cumulative counts in the 0-3, 0-5, and 0-10 minutes time intervals

cumsum(colMeans(y)) # cumulative counts

##    0-3min    3-5min   5-10min 
## 0.6736704 0.7671263 0.8831254

x <- data.frame(
  josm$surveys, 
  y3=y[,"0-3min"],
  y5=y[,"0-3min"]+y[,"3-5min"],
  y10=rowSums(y))

(tb3 <- table(x$y3))

## 
##    0    1    2    3    4    5    6 
## 2768  922  576  226   61   14    2

(tb5 <- table(x$y5))

## 
##    0    1    2    3    4    5    6 
## 2643  894  632  285   87   24    4

(tb10 <- table(x$y10))

## 
##    0    1    2    3    4    5    6 
## 2493  883  656  363  132   29   13

plot(names(tb3), as.numeric(tb3)/sum(tb3), type="l",
     xlab="Count", ylab="Proportion")
lines(names(tb5), as.numeric(tb5)/sum(tb5), type="l", col=2)
lines(names(tb10), as.numeric(tb10)/sum(tb10), type="l", col=4)
abline(v=mean(x$y3), lty=2)
abline(v=mean(x$y5), lty=2, col=2)
abline(v=mean(x$y10), lty=2, col=4)
legend("topright", bty="n", lty=1, col=c(1,2,4),
       legend=paste0("0-", c(3, 5, 10), " min"))

If we fit single-predictor GLMs to these 3 responses, we get different fitted values, consistent with our mean counts:

m3 <- glm(y3 ~ Decid, data=x, family=poisson)
m5 <- glm(y5 ~ Decid, data=x, family=poisson)
m10 <- glm(y10 ~ Decid, data=x, family=poisson)
mean(fitted(m3))

## [1] 0.6736704

mean(fitted(m5))

## [1] 0.7671263

mean(fitted(m10))

## [1] 0.8831254

Using the multiple time interval data, we can pretend that we have a mix of methodologies with respect to count duration:

set.seed(1)
x$meth <- as.factor(sample(c("A", "B", "C"), nrow(x), replace=TRUE))
x$y <- x$y3
x$y[x$meth == "B"] <- x$y5[x$meth == "B"]
x$y[x$meth == "C"] <- x$y10[x$meth == "C"]

We can estimate the effect of the methodology:

mm <- glm(y ~ meth - 1, data=x, family=poisson)
summary(mm)

## 
## Call:
## glm(formula = y ~ meth - 1, family = poisson, data = x)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3119  -1.2733  -1.1472   0.3913   3.9799  
## 
## Coefficients:
##       Estimate Std. Error z value
## methA -0.41855    0.03143 -13.315
## methB -0.20998    0.02824  -7.436
## methC -0.15023    0.02798  -5.368
##                   Pr(>|z|)    
## methA < 0.0000000000000002 ***
## methB    0.000000000000104 ***
## methC    0.000000079437024 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 7233.2  on 4569  degrees of freedom
## Residual deviance: 6938.6  on 4566  degrees of freedom
## AIC: 11646
## 
## Number of Fisher Scoring iterations: 6

boxplot(fitted(mm) ~ meth, x)

exp(coef(mm))

##     methA     methB     methC 
## 0.6579974 0.8106012 0.8605121

Or the effect of the continuous predictor and the method (discrete):

mm <- glm(y ~ Decid + meth, data=x, family=poisson)
summary(mm)

## 
## Call:
## glm(formula = y ~ Decid + meth, family = poisson, data = x)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2788  -0.9529  -0.7402   0.4472   4.5675  
## 
## Coefficients:
##             Estimate Std. Error z value
## (Intercept) -1.46163    0.04596 -31.804
## Decid        2.14500    0.05734  37.407
## methB        0.16685    0.04227   3.948
## methC        0.30269    0.04209   7.191
##                         Pr(>|z|)    
## (Intercept) < 0.0000000000000002 ***
## Decid       < 0.0000000000000002 ***
## methB          0.000078964144157 ***
## methC          0.000000000000643 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 6983.3  on 4568  degrees of freedom
## Residual deviance: 5443.8  on 4565  degrees of freedom
## AIC: 10153
## 
## Number of Fisher Scoring iterations: 6

boxplot(fitted(mm) ~ meth, x)

exp(coef(mm))

## (Intercept)       Decid       methB       methC 
##   0.2318576   8.5420610   1.1815789   1.3534989

The fixed effects adjust the means quite well:

cumsum(colMeans(y))

##    0-3min    3-5min   5-10min 
## 0.6736704 0.7671263 0.8831254

mean(y[,1]) * c(1, exp(coef(mm))[3:4])

##               methB     methC 
## 0.6736704 0.7959947 0.9118121

But it is all relative, depends on reference methodology/protocol and the data. As the time increases, it is not guaranteed that the estimates will increase as well – there can be all kinds off biases and sampling variation that would work against us.

The other problem is, we can’t easily extrapolate to a methodology with count duration of 12 minutes, or interpolate to a methodology with count duration of 2 or 8 minutes. We somehow need to express time expediture in minutes to make that work. Let’s try something else:

x$tmax <- c(3, 5, 10)[as.integer(x$meth)]
mm <- glm(y ~ Decid + I(log(tmax)), data=x, family=poisson)
summary(mm)

## 
## Call:
## glm(formula = y ~ Decid + I(log(tmax)), family = poisson, data = x)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2912  -0.9529  -0.7308   0.4482   4.5476  
## 
## Coefficients:
##              Estimate Std. Error z value
## (Intercept)  -1.71574    0.07044 -24.359
## Decid         2.14726    0.05730  37.472
## I(log(tmax))  0.24559    0.03431   7.158
##                          Pr(>|z|)    
## (Intercept)  < 0.0000000000000002 ***
## Decid        < 0.0000000000000002 ***
## I(log(tmax))    0.000000000000818 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 6983.3  on 4568  degrees of freedom
## Residual deviance: 5444.9  on 4566  degrees of freedom
## AIC: 10152
## 
## Number of Fisher Scoring iterations: 6

tmax <- seq(0, 20, 0.01)
plot(tmax, exp(log(tmax) * coef(mm)[3]), type="l",
  ylab="Method effect", col=2)

Exercise

Now we are getting somewhere. But still, couple of questions come to mind:

The function keeps increasing monotonically. Why is this an issue?
What kind of function would we need instead and why?
What is the underlying biological mechanism that we need to model somehow?

Count radius effects

Before solving the count duration issue, let us look at the effect of survey area. We get a similar count breakdown, but now by distance band:

ydis <- Xtab(~ SiteID + Dis + SpeciesID , 
  josm$counts[josm$counts$DetectType1 != "V",])

y <- as.matrix(ydis[[spp]])
head(y)

##         0-50m 50-100m 100+m
## CL10102     1       2     0
## CL10106     0       0     0
## CL10108     0       0     0
## CL10109     1       2     0
## CL10111     1       0     1
## CL10112     0       2     0

colMeans(y) # mean count of new individuals

##      0-50m    50-100m      100+m 
## 0.29240534 0.49223025 0.09848982

cumsum(colMeans(y)) # cumulative counts

##     0-50m   50-100m     100+m 
## 0.2924053 0.7846356 0.8831254

x <- data.frame(
  josm$surveys, 
  y50=y[,"0-50m"],
  y100=y[,"0-50m"]+y[,"50-100m"])

We don’t consider the unlimited distance case, because the survey area there is unknown (we will address this problem later).

(tb50 <- table(x$y50))

## 
##    0    1    2    3    4    5 
## 3521  792  228   25    2    1

(tb100 <- table(x$y100))

## 
##    0    1    2    3    4    5    6 
## 2654  833  647  316   92   20    7

plot(names(tb50), as.numeric(tb50)/sum(tb50), type="l",
     xlab="Count", ylab="Proportion")
lines(names(tb100), as.numeric(tb100)/sum(tb100), type="l", col=2)
abline(v=mean(x$y50), lty=2)
abline(v=mean(x$y100), lty=2, col=2)
legend("topright", bty="n", lty=1, col=c(1,2),
       legend=paste0("0-", c(50, 100), " m"))

We compare the counts within the 0-50 and 0-100 m circles:

m50 <- glm(y50 ~ Decid, data=x, family=poisson)
m100 <- glm(y100 ~ Decid, data=x, family=poisson)
mean(fitted(m50))

## [1] 0.2924053

mean(fitted(m100))

## [1] 0.7846356

coef(m50)

## (Intercept)       Decid 
##   -2.265081    2.126118

coef(m100)

## (Intercept)       Decid 
##   -1.326642    2.208799

Offsets

Offsets are constant terms in the linear predictor, e.g. \(log(\lambda_i) = \beta_0 + \beta_1 x_{1i} + o_i\), where \(o_i\) is an offset.

In the survey area case, an offset might be the log of area surveyed. Abundance (\(N\)) is population density (\(D\)) multiplied by survey area (\(A\)). The logic for this is based on point processes: intensity is a linear function of area under a homogeneous Poisson point process. So we can say thet \(E[Y_i] = N_i = D_i A_i\), thus \(log(N_i) = log(D_i) + log(A_i)\) where \(o_i = log(A_i)\) is the offset.

Let’s see if using area as offset makes our models comparable. Instead of mixing up different survey types, let’s see if we can make them identical. We use distance in meters divided by 100, so the population density is estimated in ha.

m50 <- glm(y50 ~ Decid, data=x, family=poisson, 
  offset=rep(log(0.5^2*pi), nrow(x)))
m100 <- glm(y100 ~ Decid, data=x, family=poisson,
  offset=rep(log(1^2*pi), nrow(x)))
coef(m50)

## (Intercept)       Decid 
##   -2.023517    2.126118

coef(m100)

## (Intercept)       Decid 
##   -2.471372    2.208799

mean(exp(model.matrix(m50) %*% coef(m50)))

## [1] 0.372302

mean(exp(model.matrix(m100) %*% coef(m100)))

## [1] 0.2497573

These coefficients and mean predictions are much closer to each other, but something else is going on.

We pretend again, that survey area varies in our data set:

set.seed(1)
x$meth <- as.factor(sample(c("A", "B"), nrow(x), replace=TRUE))
x$y <- x$y50
x$y[x$meth == "B"] <- x$y100[x$meth == "B"]

Methodology effect:

mm <- glm(y ~ meth - 1, data=x, family=poisson)
summary(mm)

## 
## Call:
## glm(formula = y ~ meth - 1, family = poisson, data = x)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2612  -1.2612  -0.7591   0.2206   3.7202  
## 
## Coefficients:
##       Estimate Std. Error z value
## methA -1.24440    0.03916 -31.775
## methB -0.22902    0.02335  -9.808
##                  Pr(>|z|)    
## methA <0.0000000000000002 ***
## methB <0.0000000000000002 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 7387.0  on 4569  degrees of freedom
## Residual deviance: 5683.8  on 4567  degrees of freedom
## AIC: 9170.7
## 
## Number of Fisher Scoring iterations: 6

exp(coef(mm))

##     methA     methB 
## 0.2881131 0.7953166

Predictor and method effects:

mm <- glm(y ~ Decid + meth, data=x, family=poisson)
summary(mm)

## 
## Call:
## glm(formula = y ~ Decid + meth, family = poisson, data = x)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2091  -0.8554  -0.5895   0.2436   3.5774  
## 
## Coefficients:
##             Estimate Std. Error z value
## (Intercept) -2.32753    0.05673  -41.03
## Decid        2.19611    0.06893   31.86
## methB        1.02835    0.04560   22.55
##                        Pr(>|z|)    
## (Intercept) <0.0000000000000002 ***
## Decid       <0.0000000000000002 ***
## methB       <0.0000000000000002 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 6247.1  on 4568  degrees of freedom
## Residual deviance: 4595.6  on 4566  degrees of freedom
## AIC: 8084.5
## 
## Number of Fisher Scoring iterations: 6

boxplot(fitted(mm) ~ meth, x)

exp(coef(mm))

## (Intercept)       Decid       methB 
##  0.09753656  8.98995621  2.79645721

cumsum(colMeans(y))[1:2]

##     0-50m   50-100m 
## 0.2924053 0.7846356

mean(y[,1]) * c(1, exp(coef(mm))[3])

##               methB 
## 0.2924053 0.8176990

Use log area as continuous predictor: we would expect a close to 1:1 relationship on the abundance scale.

x$logA <- log(ifelse(x$meth == "A", 0.5, 1)^2*pi)
mm <- glm(y ~ Decid + logA, data=x, family=poisson)
summary(mm)

## 
## Call:
## glm(formula = y ~ Decid + logA, family = poisson, data = x)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2091  -0.8554  -0.5895   0.2436   3.5774  
## 
## Coefficients:
##             Estimate Std. Error z value
## (Intercept) -2.14834    0.05232  -41.06
## Decid        2.19611    0.06893   31.86
## logA         0.74180    0.03289   22.55
##                        Pr(>|z|)    
## (Intercept) <0.0000000000000002 ***
## Decid       <0.0000000000000002 ***
## logA        <0.0000000000000002 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 6247.1  on 4568  degrees of freedom
## Residual deviance: 4595.6  on 4566  degrees of freedom
## AIC: 8084.5
## 
## Number of Fisher Scoring iterations: 6

A <- seq(0, 2, 0.01) # in ha
plot(A, exp(log(A) * coef(mm)[3]), type="l",
  ylab="Method effect", col=2)
abline(0, 1, lty=2)

The offset forces the relationship to be 1:1 (it is like fixing the logA coefficient to be 1):

mm <- glm(y ~ Decid, data=x, family=poisson, offset=x$logA)
summary(mm)

## 
## Call:
## glm(formula = y ~ Decid, family = poisson, data = x, offset = x$logA)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3051  -0.8416  -0.5245   0.2338   3.6000  
## 
## Coefficients:
##             Estimate Std. Error z value
## (Intercept) -2.36583    0.04530  -52.23
## Decid        2.20312    0.06901   31.93
##                        Pr(>|z|)    
## (Intercept) <0.0000000000000002 ***
## Decid       <0.0000000000000002 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 5746.0  on 4568  degrees of freedom
## Residual deviance: 4653.6  on 4567  degrees of freedom
## AIC: 8140.5
## 
## Number of Fisher Scoring iterations: 6

boxplot(fitted(mm) ~ meth, x)

cumsum(colMeans(y))[1:2]

##     0-50m   50-100m 
## 0.2924053 0.7846356

c(0.5, 1)^2*pi * mean(exp(model.matrix(mm) %*% coef(mm))) # /ha

## [1] 0.2172632 0.8690527

Prediction with offsets

Predictions using offsets in glm can be tricky, because the offset is usually included in the calculation of the expected value.

mm1 <- glm(y ~ Decid, data=x, family=poisson, offset=x$logA)
mm2 <- glm(y ~ Decid + offset(logA), data=x, family=poisson)
mean(fitted(mm1))

## [1] 0.5441016

mean(fitted(mm2))

## [1] 0.5441016

mean(predict(mm1, type="response"))

## [1] 0.5441016

mean(predict(mm2, type="response"))

## [1] 0.5441016

mean(exp(model.matrix(mm1) %*% coef(mm1))*exp(x$logA))

## [1] 0.5441016

mean(exp(model.matrix(mm2) %*% coef(mm2))*exp(x$logA))

## [1] 0.5441016

When estimating population density, we often want results to refer to a unit area (\(A=1\)) that leads to 0 offset (\(log(1)=0\)), thus we need to make sure it is the case

mean(exp(model.matrix(mm1) %*% coef(mm1)))

## [1] 0.2766281

mean(exp(model.matrix(mm2) %*% coef(mm2)))

## [1] 0.2766281

The safest way is to use the matrix product (exp(model.matrix(mm) %*% coef(mm) + <offset>)). We can often omit the offset, e.g. in the log area case we can express the prediction per unit area. If the unit is 1 ha, as in our case, log(1)=0, which means the mean abundance per unit area can be calculated by omitting the offsets all together.

Exercise

When we compared the 0-50 and 0-100 m counts we could not make abundances comparable using log area as an offset. Why?
We got a logA coefficient that was less than 1 when we should have gotten 1. Why?

Outlook

It looks like that our \(E[Y]=N=DA\) does not hold.

There is an issue with survey length,
none of the tweaks using area worked as expected.

The best that we can do today is to say thet \(E[Y]=DAC\) where \(C\) is a correction factor whose value is unknown. Note that \(N\) is still our interest, but instead of \(E[Y]=N\) we observe a count that is \(E[Y] < N\).

The value of \(C\) must relate to the time and distance components of the surveys. If that’s the case, we can just call the time related part of \(C\) as \(p\) and the distance related part of \(C\) as \(q\) (\(C=pq\)). If somehow we can estimate \(p\) and \(q\), we can estimate \(N\) because \(N=DA = E[Y] / pq\).

The basic equation that we will rely on is \(E[Y]=DApq=qpAD\), which reads QPAD. If you did not know, now you know why this approach is often referred to as the QPAD approach.

On the next days you will learn how to estimate \(p\) from removal sampling and \(q\) (and sometimes also \(A\)) from distance sampling. Once we have a handle on \(Apq\), we can estimate \(D\) and \(N\).

Ultimately, you will learn how to estimate population size.

Homework

Think about why knowing the absolute population size matters. Why shouldn’t we just stop at good old GLM and relative abundance?

A Primer in Regression Techniques

Peter Solymos solymos@ualberta.ca

Introduction

Prerequisites

Poisson null model

Exploring covariates

Single covariate

Exercise

Additive model

Exercise

Nonlinear terms

Categorical variables

Multiple main effects

Interaction

Different error distributions

Mixed models

Exercise

Count duration effects

Exercise

Count radius effects

Offsets

Prediction with offsets

Exercise

Outlook

Homework