Here we describe the exact model
definition as well as the estimation algorithms in detail. After reading
through this vignette, you can follow the implementation of the
algorithm in mmrm.cpp
and the covariance structures in
covariance.h
in the src
directory of this
package.
The mixed model for repeated measures (MMRM) definition we are using in this package is the following. Let i = 1, …, n denote the subjects from which we observe multiple observations j = 1, …, mi from total mi time points tij ∈ {t1, …, tm}. Note that the number of time points for a specific subject, mi, can be smaller than m, when only a subset of the possible m time points have been observed.
For each subject i we observe a vector Yi = (yi1, …, yimi)⊤ ∈ ℝmi and given a design matrix Xi ∈ ℝmi × p and a corresponding coefficient vector β ∈ ℝp we assume that the observations are multivariate normal distributed: Yi ∼ N(Xiβ, Σi) where the covariance matrix Σi ∈ ℝmi × mi is derived by subsetting the overall covariance matrix Σ ∈ ℝm × m appropriately by Σi = Gi−1/2Si⊤ΣSiGi−1/2 where the subsetting matrix Si ∈ {0, 1}m × mi contains in each of its mi columns contains a single 1 indicating which overall time point is matching tij. Each row contains at most a single 1 but can also contain only 0 if this time point was not observed. For example, assume a subject was observed on time points 1, 3, 4 out of total 5 then the subsetting matrix is $$ S_i = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ 0 & 0 & 0 \end{pmatrix}. $$ Gi ∈ ℝ > 0mi × mi is the diagonal weight matrix, which is the identity matrix if no weights are specified. Note that this follows from the well known property of the multivariate normal distribution that linear combinations of the random vector again have a multivariate normal distribution with the correspondingly modified mean vector and covariance matrix.
Conditional on the design matrices Xi, the coefficient vector β and the covariance matrix Σ we assume that the observations are independent between the subjects.
We can write the linear model for all subjects together as Y = Xβ + ϵ where Y ∈ ℝN combines all subject specific observations vectors Yi such that we have in total $N = \sum_{i = 1}^{n}{m_i}$ observations, X ∈ ℝN × p combines all subject specific design matrices and ϵ ∈ ℝN has a multivariate normal distribution ϵ ∼ N(0, Ω) where Ω ∈ ℝN × N is block-diagonal containing the subject specific Σi covariance matrices on the diagonal and 0 in the remaining entries.
The symmetric and positive definite covariance matrix $$ \Sigma = \begin{pmatrix} \sigma_1^2 & \sigma_{12} & \dots & \dots & \sigma_{1m} \\ \sigma_{21} & \sigma_2^2 & \sigma_{23} & \dots & \sigma_{2m}\\ \vdots & & \ddots & & \vdots \\ \vdots & & & \ddots & \vdots \\ \sigma_{m1} & \dots & \dots & \sigma_{m,m-1} & \sigma_m^2 \end{pmatrix} $$ is parametrized by a vector of variance parameters θ = (θ1, …, θk)⊤. There are many different choices for how to model the covariance matrix and correspondingly θ has different interpretations. Since any covariance matrix has a unique Cholesky factorization Σ = LL⊤ where L is the lower triangular Cholesky factor, we are going to use this below.
The most general model uses a saturated parametrization, i.e. any covariance matrix could be represented in this form. Here we use L = DL̃ where D is the diagonal matrix of standard deviations, and L̃ is a unit diagonal lower triangular matrix. Hence we start θ with the natural logarithm of the standard deviations, followed by the row-wise filled entries of L̃ = {lij}1 ≤ j < i ≤ m: θ = (log (σ1), …, log (σm), l21, l31, l32, …, lm, m − 1)⊤ Here θ has k = m(m + 1)/2 entries. For example for m = 4 time points we need k = 10 variance parameters to model the unstructured covariance matrix.
Other covariance matrix choices are explained in the covariance structures vignette.
In some cases, we would like to estimate unique covariance matrices across groups, while keeping the covariance structure (unstructured, ante-dependence, Toeplitz, etc.) consistent across groups. Following the notations in the previous section, for subject i in group g(i), we have
Σi = Si⊤Σg(i)Si
where g(i) is the group of subject i and Σg(i) is the covariance matrix of group g(i).
The parametrization of θ is similar to other non-grouped θ. Assume that there are total number of G groups, the length of θ is multiplied by G, and for each part, θ is parametrized in the same fashion. For example, for an unstructured covariance matrix, θ has k = G * m(m + 1)/2 entries.
A spatial covariance structure can model individual-specific visit times. An individual’s covariance matrix is then a function of both the population-level covariance parameters (specific to the chosen structure) and the individual’s visit times. Following the notations in the previous section, for subject i with total number of mi visits, we have
σijk = σ * f(dist(cij, cik))
The (mij, mik) element of Σi is a function of the distance between mij and mik visit occurring on tmij and tmik. tmij is the coordinate(time) of mij visit for subject i. σ is the constant variance. Usually we use Euclidean distance.
Currently only spatial exponential covariance structure is implemented. For coordinates with multiple dimensions, the Euclidean distance is used without transformations.
Given the general linear model above, and conditional on θ, we know that the likelihood for
β is $$
L(\beta; Y) = (2\pi)^{-N/2} \det(\Omega)^{-1/2}
\exp\left\{
- \frac{1}{2}(Y - X\beta)^\top \Omega^{-1} (Y - X\beta)
\right\}
$$ and we also know that the maximum likelihood (ML) estimate of
β is the weighted least
squares estimator β̂ solving
the estimating equation (X⊤Ω−1X)β̂ = X⊤Ω−1Y.
Plugging in β̂ into the
likelihood above gives then the value of the function we want to
maximize with regards to the variance parameters θ. Practically this will be done on
the negative log scale: $$
f(\theta; \hat{\beta}) = - \log L(\hat{\beta}; Y) = \frac{N}{2}
\log(2\pi) +
\frac{1}{2}\log\det(\Omega) +
\frac{1}{2} (Y - X\hat{\beta})^\top \Omega^{-1} (Y - X\hat{\beta})
$$ The objective function f(θ; β̂) is then
minimized with numerical optimizers utilizing quasi-Newton-Raphson
algorithms based on the gradient (or additionally with Hessian, see optimizer). Here the use of the
Template Model Builder package TMB
is helpful because
TMB
allows to perform the calculations in C++, which
maximizes the speed.TMB
performs automatic differentiation of the objective
function with regards to the variance parameters θ, so that gradient and Hessian do
not have to be approximated numerically or coded explicitly.Let’s have a look at the details of calculating the log likelihood above, including in particular the weighted least squares (WLS) estimator β̂.
Starting point is the linear equation above and the observation that both the left and right hand sides can be decomposed into subject-specific terms given the block-diagonal structure of Ω and therefore its inverse, W = Ω−1: $$ X^\top \Omega^{-1} X = X^\top W X = \sum_{i=1}^{n} X_i^\top W_i X_i $$ and similarly $$ X^\top \Omega^{-1} Y = X^\top W Y = \sum_{i=1}^{n} X_i^\top W_i Y_i $$ where Wi = Σi−1 is the weight matrix for subject i, the inverse of its covariance matrix.
Instead of calculating this inverse explicitly, it is always better numerically to work with the Cholesky factorization and solve linear equations instead. Here we calculate the factorization Σi = LiLi⊤. Note that in the case where mi = m, i.e. this subject has all time points observed, then Σi = Σ and we don’t need to calculate this again because we have already Σ = LL⊤, i.e. Li = L. Unfortunately, if mi < m, then we need to calculate this explicitly, as there is no way to update the Cholesky factorization for a subset operation Σi = Si⊤ΣSi as we have above. Given Li, we solve LiX̃i = Xi for X̃i with an efficient forward-solve, and similarly we solve LiỸi = Yi for Ỹi. Therefore we have Xi⊤WiXi = X̃i⊤X̃i and Xi⊤WiYi = X̃i⊤Ỹi and we can thereby calculate the left and right hand sides for the WLS estimating equation. We solve this equation with a robust Cholesky decomposition with pivoting. The advantage is that we can reuse this decomposition for calculating the covariance matrix of β̂, i.e. K = (X⊤WX)−1, by supplying the identity matrix as alternative right hand side.
For the objective function we also need the log determinant of Ω: where li, jj are the diagonal entries of the factor Li and we have used that
And finally, for the quadratic form we can reuse the weighted response vector and design matrix: $$ (Y - X\hat{\beta})^\top \Omega^{-1} (Y - X\hat{\beta}) = \sum_{i=1}^{n} (Y_i - X_i\hat{\beta})^\top W_i (Y_i - X_i\hat{\beta}) = \sum_{i=1}^{n} (\tilde{Y}_i - \tilde{X}_i\hat{\beta})^\top (\tilde{Y}_i - \tilde{X}_i\hat{\beta}) $$
Under the restricted ML estimation (REML) paradigm we first obtain the marginal likelihood of the variance parameters θ by integrating out the remaining parameters β from the likelihood. Here we have: $$ L(\theta; Y) = \int_{\mathbb{R}^p} L(\beta; Y) d\beta = (2\pi)^{-N/2} \det(\Omega)^{-1/2} \int_{\mathbb{R}^p} \exp\left\{ - \frac{1}{2}(Y - X\beta)^\top \Omega^{-1} (Y - X\beta) \right\} d\beta $$ where we note that det (Ω) depends on θ but not on β and can therefore be pulled out of the integral.
Let’s focus now on the quadratic form in the exponential function and complete the square with regards to β to obtain the kernel of a multivariate normal distribution: where we used K = (X⊤WX)−1 and could early on identify K as the covariance matrix of the kernel of the multivariate normal of β and then later β̂ as the mean vector.
With this, we know that the integral of the multivariate normal kernel is the inverse of the normalizing constants, and thus $$ \int_{\mathbb{R}^p} \exp\left\{ - \frac{1}{2}(Y - X\beta)^\top \Omega^{-1} (Y - X\beta) \right\} d\beta = \exp\left\{ -\frac{1}{2} Y^\top \Omega^{-1} Y + \frac{1}{2} \hat{\beta}^{-1} K^{-1} \hat{\beta} \right\} (2\pi)^{p/2} \det{K}^{1/2} $$ such that the integrated likelihood is $$ L(\theta; Y) = (2\pi)^{-(N-p)/2} \det(\Omega)^{-1/2} \det{K}^{1/2} \exp\left\{ -\frac{1}{2} Y^\top \Omega^{-1} Y + \frac{1}{2} \hat{\beta}^\top K^{-1} \hat{\beta} \right\}. $$
As objective function which we want to minimize with regards to the variance parameters θ we again take the negative natural logarithm $$ f(\theta) = -\log L(\theta;Y) = \frac{N-p}{2} \log(2\pi) + \frac{1}{2}\log\det(\Omega) - \frac{1}{2}\log\det(K) + \frac{1}{2} \tilde{Y}^\top \tilde{Y} - \frac{1}{2} \hat{\beta}^\top \tilde{X}^\top \tilde{X} \hat{\beta} $$ It is interesting to see that computation of the REML objective function is only requiring a few additional calculations compared to the ML objective function. In particular, since we already have the matrix decomposition of K−1, it is very easy to obtain the determinant of it.
Also here we use numeric optimization of f(θ) and the
TMB
library supports this efficiently through automatic
differentiation.