This tutorial will cover fitting basic GAMM models, significance testing, identifying areas of significant change, and dealing with interactions.
# Load the datafile
dataFile <- "r2star_data.csv"
df <- read.csv(dataFile)
Fitting a nonlinear age effect. We will look at the development of R2* in the Accumbens as an example. The model will also include a covariate for sex. Here is the model formula: Accumbens_Area ~ sex +s(age, k = 4)
s
indicates we will use a spline smooth. The default is a thin plate regression spline (you should have a good reason to change this) k
sets a maximum for the amount of knots that can be fit. The max degrees of freedom for the spline is k-1. For development, we don’t really expect more than 2 or 3 inflection points, so k=4
is pretty reasonable.
# Define the model formula
model_formula <- as.formula("Accumbens_Area ~ sex +s(age, k = 4)")
model <- gamm(model_formula,
random = list(bblid=~1), # We specify a random intercept for subject id (bblid). gamm uses lists for this.
data = df)
gamm
model objects have two parts, a gam
object and an lme
object. Use the gam
object for summary, prediction, plotting, etc.
The thin plate regression spline is a penalized spline that tries to optimize the wiggliness in the fit. The edf
is representing the effective degrees of freedom after penalization.
summary(model$gam)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Accumbens_Area ~ sex + s(age, k = 4)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.7938 0.1144 155.497 <2e-16 ***
## sexmale -0.3790 0.1669 -2.271 0.0233 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(age) 2.608 2.608 52.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.109
## Scale est. = 3.0664 n = 1236
Though fitting the model as we did is nice for optimizing the smoothness of the fit, it may not be ideal when we care about significance testing. This is because the penalized term p-values can be unreliable and sometimes underestimated. When we care about significance testing, we should fit the smooths using unpenalized splines by setting fx = F
.
model_formula <- as.formula("Accumbens_Area ~ sex +s(age, k = 4, fx = T)")
model <- gamm(model_formula,
random = list(bblid=~1),
data = df)
# Look at the model summary
summary(model$gam)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Accumbens_Area ~ sex + s(age, k = 4, fx = T)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.7940 0.1144 155.568 <2e-16 ***
## sexmale -0.3781 0.1668 -2.266 0.0236 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(age) 3 3 45.43 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.108
## Scale est. = 3.0525 n = 1236
# Let's plot the model
final_plot <- visualize_model(modobj = model,smooth_var = "age",int_var=NULL,group_var = "bblid")
# We can adjust the axes limits and labels
final_plot + ylim(10,30) + labs(x="Age (Years)",y = "Accumbens R2* (1/sec)")
We can answer this question by looking at the derivative of the smooth function, calculating the confidence interval, and identifying where the derivative is significantly non-zero.
Luckily there was a package recently developed by Gavin Simpson that can do this for us! Package info: https://github.com/gavinsimpson/gratia Blog post introducing the package: https://www.fromthebottomoftheheap.net/2018/10/23/introducing-gratia/ Interesting blog posts about the method for derivatives: https://www.fromthebottomoftheheap.net/2017/03/21/simultaneous-intervals-for-derivatives-of-smooths/
Let’s start with the model from our first example
model_formula <- as.formula("Accumbens_Area ~ sex +s(age, k = 4, fx = T)")
model <- gamm(model_formula,
random = list(bblid=~1),
data = df)
Now let’s get the derivative.
# Calculate the derivative
d<-derivatives(model,n=1000)
# Take a look at it using gratia::draw
d_plot <- draw(d)
print(d_plot)
# Identify significant areas
d<- d %>%
mutate(sig = !(0 >lower & 0 < upper)) #Ages where the CI does not include zero
cat(sprintf("\nSignificant change: %1.2f - %1.2f\n",min(d$data[d$sig==T]),max(d$data[d$sig==T]))) #this only works properly if there is one contiguous area of significant change (sorry)
##
## Significant change: 8.17 - 17.27
# Add some ornaments to the plot
d_plot <- ggplot(data = d,aes(x=data,y = derivative,color=sig, ymin=lower,ymax=upper)) + geom_ribbon(fill="black",alpha=.3,color=NA) + geom_line(size=1,show.legend = F) +scale_color_manual(values = c("TRUE" = "firebrick","FALSE" = "black")) + geom_hline(yintercept = 0,linetype=2)
# Let's visualize using our visualizer function
big_plot <- visualize_model(modobj = model,smooth_var = "age",int_var = "sex",group_var = "bblid",plabels = "scatter + derivative", derivative_plot = T)
##
## Sig change: 8.17 - 17.21
# Note: Currently the derivative_plot = T is very finicky about the output figure size. It is best to use ggsave to output
ggsave(plot = big_plot,filename = "derivative_fig.png",device = "png",width = 180,height = 120,units = "mm")
knitr::include_graphics("derivative_fig.png")
We can investigate factor-smooth interactions and continuous interactions.
Factor-smooth interactions allow you to test whether the smoothed term varies across levels of a factor. In this example, we’ll fit nonlinear age effect and check for age*sex interaction, s(age, by = oSex)
. We will test in the Pallidum because we have an interaction here.
NOTE: If you are interested in whether the smooths DIFFER between levels of the factor, which we usually do when testing for an interaction, the variable types and model specification matter! The factor must be specified as an ordered factor, and you must include a “main effect” smooth in the model. This fits a reference smooth at the first level of your factor and models smooths at the other levels as a comparison to the reference. We will create a variable oSex
as an ordered factor of sex, and test the model as Pallidum ~ oSex +s(age, k = 4, fx = T) + s(age, by = oSex, k = 4, fx = T)
df$oSex <- ordered(df$sex, levels = c("female","male")) # Females will be the reference group
model_formula <- as.formula("Pallidum ~ oSex + s(age, k = 4, fx = T) + s(age, by = oSex, k = 4, fx = T)")
# Note we keep fx = T for reliable p-values.
model <- gamm(model_formula,
random = list(bblid=~1),
data = df)
summary(model$gam)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Pallidum ~ oSex + s(age, k = 4, fx = T) + s(age, by = oSex, k = 4,
## fx = T)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.33105 0.05281 422.830 <2e-16 ***
## oSex.L 0.09464 0.07469 1.267 0.205
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(age) 3 3 122.436 < 2e-16 ***
## s(age):oSexmale 3 3 4.815 0.00245 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.367
## Scale est. = 1.2919 n = 1236
It looks like we have a significant interaction here. In order to confirm the p-value for the interaction, we can compare this model to a main effect only model and see if it is a significant improvement. This can be done with a bootstrap likelihood ratio test. Let’s make a function to do this:
## Parametric bootstrap of likelihood ratio test for nested models
## Takes your gamm model object as an input. If you are using `gam` this will have to be tweaked a little.
## Right now, this requires your term of interested to be the LAST term in the model.
pboot <- function(modelobj){
numsims <- 1000 #This is the number of bootstrap simulations. This # should be higher in practice, probably something like 10,000
df <- modelobj$gam$model #Get a data frame of all data used in the model (if using gam, this is modelobj$model)
group_var_name <- names(modelobj$lme$coefficients$random) # the name of the random effects variable
thisResp <- as.character(modelobj$gam$terms[[2]])
f1 <- modelobj$gam$formula #Get the formula (if using gam, this is modelobj$formula)
theseVars <- attr(terms(f1),"term.labels") #Get the terms
f2 <- reformulate(theseVars[0:(length(theseVars)-1)],response = thisResp) #Drop the last term from the formula. This is the simpler model now.
# The bootstrap function takes an lme object, so we fit the models using gam to get the design matrix, and stick that matrix into lmer
#Fit
g1 <- gam(f1,data = df)
g2 <- gam(f2,data = df)
#Get the matrices
mat1 <- model.matrix(g1)
mat2 <- model.matrix(g2)
#Tack on the response variable and grouping variable.
group_var<- df[,group_var_name]
y <- df[,thisResp]
# Fit the models with `lmer`
m1 <- lmer(y ~ -1 + mat1 + (1|group_var))
m2 <- lmer(y ~ -1 + mat2 + (1|group_var))
# Create the bootstrap distribution
refdist <- PBrefdist(m1, m2, nsim=numsims) # note you can parallelize this (look at help for the function)
pb <- PBmodcomp(m1, m2, ref = refdist) # Compare the models
int_pval <- pb$test["PBtest","p.value"] # Pull out the p-value from the bootstrap test.
# Now return the best model
if (int_pval < .05) {
pb$bestmod <- f1
} else {
pb$bestmod <- f2
}
return(pb)
}
pb <- pboot(model)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl =
## control$checkConv, : Model failed to converge with max|grad| = 0.00682701
## (tol = 0.002, component 1)
#Print a table that shows the bootstrap outcome
pb %>%
summary() %>%
.$test %>%
as.data.frame() %>%
kable(caption = "Parametric Bootstrap test results") %>%
kable_styling(full_width = F, position = "left",bootstrap_options = c("striped"))
stat | df | ddf | p.value | |
---|---|---|---|---|
PBtest | 14.376636 | NA | NA | 0.0029970 |
Gamma | 14.376636 | NA | NA | 0.0025997 |
Bartlett | 14.295718 | 3 | NA | 0.0025291 |
F | 4.792212 | 3 | 2.991581 | 0.1156337 |
LRT | 14.376636 | 3 | NA | 0.0024348 |
# The bootstrap p-value confirms that the our interaction term is a significant improvement
# This code will handle keeping the best model.
if (isTRUE(all.equal(pb$bestmod,model_formula))) {
cat("The initial (more complicated) model is best")
final_model <- model
} else {
cat("The simpler model is best")
cat(" refitting ")
final_model <-gamm(as.formula(pb$bestmod),
data=df,
random = list(bblid =~ 1),
subset = exclusions == F)
}
## The initial (more complicated) model is best
summary(final_model$gam)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Pallidum ~ oSex + s(age, k = 4, fx = T) + s(age, by = oSex, k = 4,
## fx = T)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.33105 0.05281 422.830 <2e-16 ***
## oSex.L 0.09464 0.07469 1.267 0.205
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(age) 3 3 122.436 < 2e-16 ***
## s(age):oSexmale 3 3 4.815 0.00245 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.367
## Scale est. = 1.2919 n = 1236
We can simplify the summary a little bit. This is helpful if you are looping over many models and want to see the results at a glance.
#Display model results:
smooths_tidytable<- tidy(final_model$gam)
parametric_tidytable <- tidy(final_model$gam,parametric = T)
smoothnames = names(smooths_tidytable)
parametricnames = names(parametric_tidytable)
names(smooths_tidytable)=parametricnames
numObs <- final_model$lme$dims$N
stattable <- rbind(parametric_tidytable,smoothnames,smooths_tidytable) %>%
kable(caption = sprintf("Regression table from gamm. N = %d",numObs)) %>%
kable_styling(full_width = F, position = "left")
print(stattable)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 22.3310461495713 | 0.0528132858197632 | 422.83008532703 | 0 |
oSex.L | 0.0946440202407537 | 0.0746892650797957 | 1.26717032413746 | 0.205334649425365 |
term | edf | ref.df | statistic | p.value |
s(age) | 3.00000000000001 | 3 | 122.436127454182 | 2.24004626057781e-69 |
s(age):oSexmale | 3.00000000000001 | 3 | 4.81502657334852 | 0.00244630583117889 |
# Send to visualizer function
# int_var is the factor, will be plotted as separate lines
# group_var identifies the variable that links points in the spaghetti plot
# plabel will attach a title to the plot
model_plot <- visualize_model(modobj = final_model,smooth_var = "age",int_var = "oSex",group_var = "bblid",plabels = "Factor-smooth interaction",check_diagnostics = F)
# Adjust some labels and plot limits
model_plot<- model_plot + labs(y = "Pallidum R2* (1/sec)", x = "Age (years)") + ylim(12,30)
print(model_plot)
# We can also get the derivatives
model_plot <- visualize_model(modobj = final_model,smooth_var = "age",int_var = "oSex",group_var = "bblid",plabels = "Factor-smooth interaction",derivative_plot = T)
# Save the plot
ggsave(plot = big_plot,filename = "derivative_fig2.png",device = "png",width = 180,height = 120,units = "mm")
knitr::include_graphics("derivative_fig2.png")
There are two possible models
1. Additive (main effects + interaction) bivariate smooth model (fully nonlinear interaction)
2. varying coefficient model (linear-nonlinear interaction)
Model selection and significance testing workflow:
* We select the best model is based on the smallest BIC (or AIC).
* After the best interaction model is selected, the significance of the interaction is tested with a parametric bootstrap likelihood ratio test. This test compares the model with the interaction term against a simpler nested model with main effects only.
* If the interaction model is significantly better, we keep that model. If not, the final model is the simpler model with no interaction.
ti
: ti(age) + ti(Cognition) + ti(age,Cognition)
From documentation: This model specifies a main effects + interaction structure such as: y ~ ti(x) + ti(z) + ti(x,z)
ti
is the proper way of specifying an interaction term in the context of included main effect terms:
“This functional ANOVA decomposition is supported by ti terms, which produce tensor product interactions from which the main effects have been excluded, under the assumption that they will be included separately. For example the ~ ti(x) + ti(z) + ti(x,z) would produce the above main effects + interaction structure. This is much better than attempting the same thing with s or te terms representing the interactions (although mgcv does not forbid it). Technically ti terms are very simple: they simply construct tensor product bases from marginal smooths to which identifiability constraints (usually sum-to-zero) have already been applied: correct nesting is then automatic (as with all interactions in a GLM framework). See Wood (2017, section 5.6.3).”
NOTE: This model model can onl be fit as a GAMM using gamm
. This specification is not available with gamm4
.
by =
)This will make the fit linear (rather than non-linear smooth) in the by
variable. From documentation: “When using by
with a numberic covariate,”the by argument ensures that the smooth function gets multiplied by covariate z"
# Compare the two interaction models
# Bivariate interaction model
bv_formula <- as.formula("Putamen ~ ti(age, k=4, fx = T) + ti(NAR_Overall_Efficiency, k=4, fx = T) + ti(age,NAR_Overall_Efficiency, k=4, fx = T)")
# Linear varying coefficient interaction
vc_formula <- as.formula("Putamen ~ s(age, k=4, fx = T) + s(age, by = NAR_Overall_Efficiency, k=4, fx = T)")
# Fit each model
bv <- gamm(as.formula(bv_formula),
random = list(bblid=~1),
data = df)
vc <- gamm(as.formula(vc_formula),
random = list(bblid=~1),
data = df)
# get BIC
bic<-BIC(bv$lme,vc$lme) #Get the BIC from the lme object for gamm
bestmod <- gsub(row.names(bic)[which.min(bic$BIC)],pattern = "$lme",replacement = "", fixed = T) #best is min BIC
switch (bestmod,
"bv" = {model <- bv},
"vc" = {model <- vc}
)
model_formula <- model$gam$formula
anova(model$gam)
Family: gaussian Link function: identity
Formula: Putamen ~ s(age, k = 4, fx = T) + s(age, by = NAR_Overall_Efficiency, k = 4, fx = T)
Approximate significance of smooth terms: edf Ref.df F p-value s(age) 3 3 44.99 <2e-16 s(age):NAR_Overall_Efficiency 4 4 2.76 0.0266
# Confirm the interaction model is a significant improvement.
pb <- pboot(model)
#Print a table that shows the bootstrap outcome
pb %>%
summary() %>%
.$test %>%
as.data.frame() %>%
kable(caption = "Parametric Bootstrap test results") %>%
kable_styling(full_width = F, position = "left",bootstrap_options = c("striped"))
stat | df | ddf | p.value | |
---|---|---|---|---|
PBtest | 10.907833 | NA | NA | 0.0339660 |
Gamma | 10.907833 | NA | NA | 0.0324136 |
Bartlett | 10.521019 | 4 | NA | 0.0325087 |
F | 2.726958 | 4 | 2.635513 | 0.2384540 |
LRT | 10.907833 | 4 | NA | 0.0276196 |
# Plot the outcome
model_plot <- visualize_model(modobj = model,smooth_var = "age",int_var = "NAR_Overall_Efficiency",group_var = "bblid",plabels = "continuous interaction",derivative_plot = F)
model_plot <- model_plot + ylim(10,22) + labs(x="Age (Years)", y = "Putamen R2* (1/sec)")
print(model_plot)
Since the varying coefficient model was the best model, let’s confirm we don’t have any concurvity issues between the main effect and the interaction.
We can use the concurvity
function to assess this. From the help documentation,?concurvity
> Concurvity occurs when some smooth term in a model could be approximated by one or more of the other smooth terms in the model.
Concurvity can be viewed as a generalization of co-linearity, and causes similar problems of interpretation. It can also make estimates somewhat unstable.
Concurvity is a value betwee zero and one, with zero indicating no problem, and 1 indicating lack of indentifiability.
# Check the concurvity of the vc model.
c<-as.data.frame(concurvity(model$gam))
c %>%
kable(caption = "Convurvity")%>%
kable_styling(full_width = F,bootstrap_options = "striped",position = "left")
para | s(age) | s(age):NAR_Overall_Efficiency | |
---|---|---|---|
worst | 0.4222969 | 0.5109460 | 0.5268890 |
observed | 0.4222969 | 0.4410716 | 0.4947144 |
estimate | 0.4222969 | 0.4449870 | 0.4717645 |
There are a number of cool packages out there for visualizing GAM models. One caveat is that they are generally meant for GAM rather than GAMM, and don’t have options to add a spaghetti plot. I ended up making my own visualizer function to get spaghetti plots and add derivative information, but these are definitely worth checking out.
vis.gam
comes with mgcv
and can make a few different plots. I like the perspective plot (example below).
mgcViz
also makes some pretty plots for all the terms in your model (smooths and parametric terms). I included one example below, but you can check out their vignette for more: https://cran.r-project.org/web/packages/mgcViz/vignettes/mgcviz.html
itsadug
is also a cool package that has some extra nifty tools for exploring your model. See the vignette here: https://cran.r-project.org/web/packages/itsadug/vignettes/inspect.html
gratia
is a newer package that also has some handy plotting and diagnostic tools. We have used it already here for calculating derivatives and confidence intervals for derivatives. Info can be found here: https://gavinsimpson.github.io/gratia/
# Using vis.gam
# This is set up for gam, not gamm, so we have to change our gam object a bit
model$gam$data <- model$gam$model
vis.gam(model$gam,view = c("age","NAR_Overall_Efficiency"),plot.type = "persp",theta = 45,color = "topo",zlab = "Putamen R2*")
b <- getViz(model$gam)
var_plot <- plot(b,allTerms = T) +
l_ciPoly() +
l_fitRaster() + l_fitContour() +
l_points() + l_ciBar() +
l_fitPoints(size = 1, col = 2) + l_fitLine() +
labs(title = "mgcViz plots")
print(var_plot,pages = 1)