glmmTMB
has the capability to simulate from a fitted
model. These simulations resample random effects from their estimated
distribution. In future versions of glmmTMB
, it may be
possible to condition on estimated random effects.
library(glmmTMB)
library(ggplot2); theme_set(theme_bw())
Fit a typical model:
data(Owls)
<- glmmTMB(SiblingNegotiation ~ FoodTreatment*SexParent +
owls_nb1 1|Nest)+offset(log(BroodSize)),
(family = nbinom1,
ziformula = ~1, data=Owls)
Then we can simulate from the fitted model with the
simulate.glmmTMB
function. It produces a list of simulated
observation vectors, each of which is the same size as the original
vector of observations. The default is to only simulate one vector
(nsim=1
) but we still return a list for consistency.
=simulate(owls_nb1, seed=1)
simo=Owls
Simdat$SiblingNegotiation=simo[[1]]
Simdat=transform(Simdat,
SimdatNegPerChick = SiblingNegotiation/BroodSize,
type="simulated")
$type = "observed"
Owls=rbind(Owls, Simdat) Dat
Then we can plot the simulated data against the observed data to check if they are similar.
ggplot(Dat, aes(NegPerChick, colour=type))+geom_density()+facet_grid(FoodTreatment~SexParent)
what if you want to simulate data with specified parameters in the absence of a data set, for example for a power analysis?
glmmTMB
has a simulate_new
function that
can handle this case; the hardest part is understanding the meaning of
the parameter values, especially for random-effects covariances.
For the first example we’ll simulate something that looks like the
classic “sleep study” data, using the sleepstudy
data set
for structure and covariates. The conditional-fixed effects parameters
(beta
) are standard regression parameters (intercept and
slope): we use 250 and 10, which are close to the values from the actual
data. The only other parameter, betadisp
, is the log of the
dispersion parameter, which in the specific case of the Gaussian
(default) family is the log of the conditional (residual)
variance; the standard deviation from a simple regression on
these data1 is approximately 50, so we use
2*log(50)
.
data("sleepstudy", package = "lme4")
set.seed(101)
<- transform(sleepstudy,
ss_sim Reaction = simulate_new(~ Days,
newdata = sleepstudy,
family = gaussian,
newparams = list(beta = c(250, 10),
betadisp = 2*log(50)))[[1]])
For comparison, we’ll also fit the model and use the built-in simulation method:
<- glmmTMB(Reaction ~ Days, sleepstudy)
ss_fit <- transform(sleepstudy,
ss_simlm Reaction = simulate(ss_fit)[[1]])
Comparing against the real data set:
library(ggplot2); theme_set(theme_bw())
<- rbind(data.frame(sleepstudy, sample = "real"),
ss_comb data.frame(ss_sim, sample = "simulated"),
data.frame(ss_simlm, sample = "simulated (from fit)")
)ggplot(ss_comb, aes(x = Days, y = Reaction, colour = Subject)) +
geom_line() +
facet_wrap(~sample)
The simulated data have about the right variability, but in contrast to the real data have no among-subject variation.
The next example will be more complex, getting into the nuts and
bolts of how to translate random effects covariances into the terms that
glmmTMB
expects.
The hardest piece is probably translating correlation parameters. The
“covariance structures” vignette has more details on how correlation
matrices are parameterized, and the put_cor()
function is a
general translator, but for the specific case of 2x2 correlation
matrices (i.e. with a single correlation parameter), a correlation \(\rho\) corresponds to a
glmmTMB
parameter of \(\rho/\sqrt{1-\rho^2}\). Here’s a utility
function:
<- function(rho) rho/sqrt(1-rho^2)
rho_to_theta ## tests
stopifnot(all.equal(get_cor(rho_to_theta(-0.2)), -0.2))
stopifnot(all.equal(rho_to_theta(-0.2), put_cor(matrix(c(1,-0.2,-0.2,1), 2))))
Setting up metadata/covariates (tools in the faux
package may also be useful for this):
<- 10
n_id <- expand.grid(trt = factor(c("A", "B")),
dd id = factor(1:n_id),
time = 1:6)
We’ll set up some reasonable fixed effects. When in doubt about the
order of fixed effect parameters, use model.matrix()
to
check:
<- ~trt*time+(1+time|id)
form1 colnames(model.matrix(lme4::nobars(form1), data = dd))
## [1] "(Intercept)" "trtB" "time" "trtB:time"
## intercept, trtB effect, slope, trt x slope interaction
<- c(1, 2, 0.1, 0.2) beta_vec
We’ll set SDs such that the average coeff var = 1 (SD = mean for
among-subject variation in intercept and slope). As described in the
“covstruct” vignette, the parameter vector for a random-effect
covariance consists of the log-standard-deviations followed by the
scaled correlations. Finally we’ll set the dispersion parameter for the
negative binomial conditional distribution to 1 (more detail on the
betadisp
parameterization for different families is given in
?sigma.glmmTMB
).
<- c(1.5,0.15)
sdvec <- -0.2
corval <- c(log(sdvec), rho_to_theta(corval))
thetavec <- list(beta = beta_vec,
par1 betadisp = log(1), ## log(theta)
theta = thetavec)
Now simulate:
$y <- simulate_new(form1,
ddnewdata = dd,
seed = 101,
family = nbinom2,
newparams = par1)[[1]]
## Warning in glmmTMB(form, data = newdata, family = family, ..., doFit = FALSE):
## non-integer counts in a nbinom2 model
range(dd$y)
## [1] 0 598
For comparison, we’ll do this by hand (with some help from
lme4
machinery). lme4
parameterizes covariance
matrices by the lower triangle of the Cholesky factor rather than using
glmmTMB
’s method …
library(lme4)
## Loading required package: Matrix
set.seed(101)
<- model.matrix(nobars(form1), data = dd)
X ## generate random effects values
<- mkReTrms(findbars(form1),
rt model.frame(subbars(form1), data = dd))
<- t(rt$Zt)
Z ## construct covariance matrix
<- diag(sdvec) %*% matrix(c(1, corval, corval, 1), 2) %*% diag(sdvec)
Sigma <- t(chol(Sigma))[c(1,2,4)]
lmer_thetavec ## plug values into Cholesky factor of random effects covariance matrix
$Lambdat@x <- lmer_thetavec[rt$Lind]
rt<- rnorm(nrow(rt$Lambdat))
u <- t(rt$Lambdat) %*% u
b <- drop(X %*% par1$beta + t(rt$Zt) %*% b)
eta <- exp(eta)
mu <- rnbinom(nrow(dd), size = 1, mu = mu)
y range(y) ## range varies a lot
## [1] 0 1484
Alternatively we could have generated the random effects with:
<- MASS::mvrnorm(1, mu = rep(0,2*n_id),
b Sigma = Matrix::.bdiag(replicate(n_id,
Sigma,simplify = FALSE)))
We will simulate a single time series with AR1 structure, with a nugget (measurement error) variance \(\sigma^2_n = 1.0\), an autoregressive variance of \(\sigma^2_a = 1\), and an autoregressive parameter of \(\phi = 0.7\),
First by brute force, using the code from the “covariance structures” vignette:
set.seed(101)
<- 1000 ## Number of time points
n <- MASS::mvrnorm(mu = rep(0,n),
x Sigma = .7 ^ as.matrix(dist(1:n)) ) ## Simulate the process using the MASS package
## as.matrix(dist(1:n)) constructs a banded matrix with m_{ij} = abs(i-j)
<- x + rnorm(n) ## Add measurement noise/nugget
y <- data.frame(y,
dat0 times = factor(1:n, levels=1:n),
group = factor(rep(1, n)))
Now using simulate_new()
with matching parameters
beta = 0
(the only fixed effect is the intercept, which we
set to zero); betadisp = 0
(the log-variance of the
measurement error [note Gaussian family uses log-variance rather than
log-SD parameterization, although in this case it doesn’t make any
difference …]); theta[1] = 0
(log-SD of autoregressive
variance); and theta[2]
specifying a correlation parameter
\(\phi = 0.7\) (see “Covariance
structures” vignette for details).
<- function(phi) phi/sqrt(1-phi^2)
phi_to_AR1 <- simulate_new(~ ar1(times + 0 | group),
s2 newdata = dat0,
seed = 101,
newparams = list(
beta = 0,
betadisp = 0,
theta = c(0, phi_to_AR1(0.7)))
1]] )[[
With a nugget variance of \(\sigma^2_n = 1.0\), an autoregressive variance of \(\sigma^2_a = 1\), and an autoregressive parameter of \(\phi = 0.7\), we expect the ACF to be \(\sigma^2_a/(\sigma^2_a + \sigma^2_n) \phi^d\) .
<- drop(acf(dat0$y, plot = FALSE)$acf)
a1 <- drop(acf(s2, plot = FALSE)$acf)
a2 par(las = 1, bty = "l")
matplot(0:(length(a1)-1), cbind(a1, a2), pch = 1,
ylab = "autocorrelation", xlab = "lag")
curve(0.7^x/2, add = TRUE, col = 4, lwd = 2)
The precise curves are different (because the multivariate normal deviates are generated in a different way), but the ACFs match.
lme4
(spatial, ZI, etc.). If necessary, more details on
parameterizations (shape/scale for spatial cov structures, etc.)I realize this violates the assumption of de novo simulation that we don’t know what the real data looks like yet …↩︎