Multidimensional CFA with RStan

For starters, here’s the Stan code. And here’s the R script: Stan code. If you are already familiar with RStan, the basic concepts you need to combine are standard multilevel models with correlated random slopes and heteroskedastic errors.

I will embed R code into the demonstration. The required packages are lavaan, lme4 and RStan.

I like to understand most statistical methods as regression models. This way, it’s easy to understand the claims underlying a large number of techniques. It’s an approach that works for multilevel, SEM, and IRT models. Here, I’ll be focusing on confirmatory factor analysis (CFA), so I’ll begin by developing CFA from a model that is easy to fit in any multilevel regression software:

$$ y_{pi} = \theta_p + \beta_i + \epsilon $$

where $y_{pi}$ is the response of person $p$ to item $i$, $\theta_p$ is the random intercept that varies by person and is assumed to be normal with mean zero and standard deviation $\alpha$ $\big(\mathcal{N}(0, \alpha^2)\big)$, $\beta_i$ are the coefficients of the item dummies, and $\epsilon$ is the residual (level 1) error assumed normal with mean zero and standard deviation $\sigma,\ \mathcal{N}(0, \sigma^2)$.

Anyone familiar with multilevel regression can fit this basic random-intercept model. Shape your person by item data to “long form” where each row contains a person ID, an item ID and the item response. Make the item response the outcome, make each item into a dummy variable as predictors (at level 1) and place a random intercept on the person. Your random intercept standard deviation is $\alpha$, residual standard deviation (or level one error) is $\sigma$ and the item dummies are $\beta_i$. If your software permits, remove the fixed intercept and use all item dummies. Finally, $\theta_p$ represents the latent score for each individual $p$.

To demonstrate, I use the HolzingerSwineford dataset. There are 9 items, items 1 to 3 load on factor 1, 4 to 6 on factor 2, and 7 to 9 on factor 3.

library(lavaan)

dat <- HolzingerSwineford1939
dat$sex <- dat$sex - 1 # Who is 1, no idea
dat$grade <- dat$grade - min(dat$grade, na.rm = TRUE)
dat$ID <- 1:nrow(dat)

# Make data long
dat.l <- tidyr::gather(dat, item, score, x1:x9)
dat.l$item.no <- as.integer(gsub("x", "", dat.l$item))

library(lme4)

lmer(score ~ 0 + factor(item.no) + (1 | ID), dat.l, REML = FALSE)

# Random effects:
# Groups   Name        Std.Dev.
# ID       (Intercept) 0.5758  
# Residual             0.9694  
# Number of obs: 2709, groups:  ID, 301

The model above fit with ML not REML is identical to a unidimensional CFA where we constrain all the item loadings to be the same value and the item errors to be the same value. The random intercept standard deviation, $\alpha$, is the factor loading for all items and $\sigma^2$ is the item error variance for all items. To get the standardized loading which we will call $\lambda$, use:

$$\lambda = \frac{\alpha}{\sqrt{\alpha^2+\sigma^2}} = \frac{0.5758}{\sqrt{0.5758^2+0.9694^2}}=0.5107$$

The lavaan model below should confirm this to be true. Note that in the lavaan syntax, the factor is standardized to have variance of 1 using std.lv = TRUE. The item dummies (not shown) from lme4 are also the CFA item intercepts.

parameterEstimates(sem(
  "F1 =~ a * x1 + a * x2 + a * x3 + a * x4 + a * x5 + a * x6 + a * x7 + a * x8 + a * x9\n
  x1 ~~ f * x1\nx2 ~~ f * x2\nx3 ~~ f * x3\nx4 ~~ f * x4\n
  x5 ~~ f * x5\nx6 ~~ f * x6\nx7 ~~ f * x7\nx8 ~~ f * x8\nx9 ~~ f * x9",
  dat, std.lv = TRUE
), standardized = TRUE)[c(1:2, 10:11), c(1:5, 12)]

#    lhs op rhs label   est std.all
# 1   F1 =~  x1     a 0.576   0.511
# 2   F1 =~  x2     a 0.576   0.511
# 10  x1 ~~  x1     f 0.940   0.739
# 11  x2 ~~  x2     f 0.940   0.739

The standardized loading has a similar formula to the (residual) intraclass correlation coefficient (ICC) which is $\frac{\alpha^2}{\alpha^2+\sigma^2}$. So we have that $\lambda=\sqrt{\mathrm{ICC}}$. This is in line with thinking of standardized loadings as correlation coefficients and ICCs as $R^2$.

Now, everyone familiar with CFA knows we almost never constrain all item loadings and item errors to have the same value (for what it’s worth, this is what we do in the Rasch model with binary outcomes). But we will maintain the CFA as regression for as long as possible. Let us extend the model to include multiple factors. To include multiple factors, we create an indicator column in long form that uniquely identifies the factor an item belongs to. And instead of using a single random intercept, we use the factor dummies as random slopes without the random intercept. In this model, we assume:

$$ y_{pi} = \mathbf{\theta}_{pf} + \beta_i + \epsilon $$

where $\mathbf{\theta}_{pf}$ instead of being the same at all times for any individual now also depends on the specific factor $f$. Precisely, we assume the latent scores are multivariate normal, $\mathbf{\theta}_{pf}\sim\mathcal{N}_f(\mathbf{0,\Sigma})$, where $f$ is the number of factors, and $\mathbf{\Sigma}$ represents the covariance matrix of the random slopes. The square root of the diagonal of this covariance matrix (i.e. standard deviations of random slopes) will again be the unstandardized factor loadings, $\alpha_f$. The off-diagonal covariances when standardized (i.e. correlations) will be the interfactor correlations. We can again fit this model in lme4:

# Assign item to factors
dat.l$Fs <- ((dat.l$item.no - 1) %/% 3) + 1

lmer(score ~ 0 + factor(item) + (0 + factor(Fs) | ID), dat.l, REML = FALSE)

# Random effects:
#  Groups   Name        Std.Dev. Corr     
#  ID       factor(Fs)1 0.7465            
#           factor(Fs)2 0.9630   0.41     
#           factor(Fs)3 0.6729   0.38 0.30
#  Residual             0.7909

The corresponding lavaan model is:

parameterEstimates(sem(
  "F1 =~ a * x1 + a * x2 + a * x3\nF2 =~ b * x4 + b * x5 + b * x6\nF3 =~ c * x7 + c * x8 + c * x9\n
  x1 ~~ f * x1\nx2 ~~ f * x2\nx3 ~~ f * x3\nx4 ~~ f * x4\nx5 ~~ f * x5\n
  x6 ~~ f * x6\nx7 ~~ f * x7\nx8 ~~ f * x8\nx9 ~~ f * x9",
  dat, std.lv = TRUE
), standardized = TRUE)[c(1:10, 22:24), c(1:5, 12)]

#    lhs op rhs label   est std.all
# 1   F1 =~  x1     a 0.746   0.686
# 2   F1 =~  x2     a 0.746   0.686
# 3   F1 =~  x3     a 0.746   0.686
# 4   F2 =~  x4     b 0.963   0.773
# 5   F2 =~  x5     b 0.963   0.773
# 6   F2 =~  x6     b 0.963   0.773
# 7   F3 =~  x7     c 0.673   0.648
# 8   F3 =~  x8     c 0.673   0.648
# 9   F3 =~  x9     c 0.673   0.648
# 10  x1 ~~  x1     f 0.626   0.529
# 22  F1 ~~  F2       0.407   0.407
# 23  F1 ~~  F3       0.385   0.385
# 24  F2 ~~  F3       0.301   0.301

We see that the factor loadings in CFA are the random slope standard deviations from multilevel. And the interfactor correlation matrix matches the random slope correlations from multilevel.

The final model we can fit with some, not all, multilevel regression software is:

$$ y_{pi} = \mathbf{\theta}_{pf} + \beta_i + \epsilon_i $$

where we permit the residual error, $\epsilon_i$, to be different depending on the item. This is a heteroskedastic regression model. One would have to switch to from lme4 to nlme or glmmTMB in R to estimate it; it should be possible in HLM. In lavaan, the model syntax would be:

# Drop the error variance constraints
"F1 =~ a * X1 + a * X2 + a * X3\nF2 =~ b * X4 + b * X5 + b * X6\nF3 =~c * X7 + c * X8 + c * X9"

The latest model is very close to a standard CFA model. The final change is that we need to permit the item loadings to vary by item, not by factor. Once we do this, we can no longer fit the model with multilevel regression software. The equation becomes:

$$ y_{pi} = \alpha_i\theta_{pf} + \beta_i + \epsilon_i $$

$\mathbf{\theta}_{pf}$ is still multivariate normal, but with variance $\mathbf{1},\ \mathbf{\theta}_{pf}\sim\mathcal{N}_f(\mathbf{0,1})$. The most interesting inclusion is that we multiply the latent scores by an item coefficient, $\alpha_i$.¹ Note that $\alpha$ consistently represents the unstandardized loadings in this demonstration. Here, instead of being constant or varying by factor, it varies by item.

Bayesian software can fit a complicated model like this. We’d have to specify priors for the different components of this equation. Page 144 of the Stan manual provides some priors for the 2PL item-response theory model that we can apply here as a starting point:

loadings will be log-normal constraining them to be positive, $\alpha_i\sim\mathrm{Lognormal}(0,\sigma_\alpha)$. For $\sigma_\alpha$, we can assume it’s half-Cauchy, $\sigma_\alpha\sim C^{+}(0, 2.5)$.
there is usually enough information to estimate item intercepts/difficulties, we can use a uniform prior
item errors can be assumed to be inverse gamma, $\epsilon_i\sim\mathrm{InvGamma}(1,1)$.
the latent scores are multivariate normal, $\mathbf{\theta}_{pf}\sim\mathcal{N}_f(\mathbf{0,1})$

The one point requiring additional comments is the multivariate normal prior for $\mathbf{\theta}_{pf}$. For a positive-definite correlation matrix, $\mathbf{R}$, we can apply the Cholesky decomposition: $\mathbf{R} = \mathbf{L}\mathbf{L}^*$. In Stan, it is computationally more convenient to apply a prior on $\mathbf{L}$ and generate the multivariate normal latent scores using $\mathbf{L}$. The recommended choice of prior for $\mathbf{L}$ is the LKJ-prior.²

In Stan syntax, the required data are:

data {
  real g_alpha; // for inverse gamma
  real g_beta; // for inverse gamma
  int<lower = 0> N; // scalar, number of person times number of items
  int<lower = 0> Ni; // scalar, number of items
  int<lower = 0> Np; // scalar, number of persons
  int<lower = 0> Nf; // scalar, number of factors
  vector[N] response; // vector, long form of item responses
  // all remaining entries are data in long form
  // with consecutive integers beginning at 1 acting as unique identifiers
  int<lower = 1, upper = Ni> items[N];
  int<lower = 1, upper = Np> persons[N];
  int<lower = 1, upper = Nf> factors[N];
}

The estimated parameters are:

parameters {
  vector<lower = 0>[Ni] item_vars; // item vars heteroskedastic
  real<lower = 0> sigma_alpha; // sd of loadings, hyperparm
  vector<lower = 0>[Ni] alphas; // loadings
  vector[Ni] betas; // item intercepts, default uniform prior
  vector[Nf] thetas[Np]; // person scores for each factor
  cholesky_factor_corr[Nf] L; // Cholesky decomp of corr mat of random slopes
}

We need some transformed parameters to capture the mean and variance of the item responses. This is where we provide the regression equation verbatim:

transformed parameters {
  vector[N] yhat;
  vector[N] item_sds_i;

  for (i in 1:N) {
    yhat[i] = alphas[items[i]] * thetas[persons[i], factors[i]] + betas[items[i]];
    item_sds_i[i] = sqrt(item_vars[items[i]]);
  }
}

And for the priors:

model {
  vector[Nf] A = rep_vector(1, Nf); // Vector of random slope variances
  matrix[Nf, Nf] A0;

  L ~ lkj_corr_cholesky(Nf);
  A0 = diag_pre_multiply(A, L);
  thetas ~ multi_normal_cholesky(rep_vector(0, Nf), A0);

  alphas ~ lognormal(0, sigma_alpha);
  sigma_alpha ~ cauchy(0, 2.5); // hyperparm
  item_vars ~ inv_gamma(g_alpha, g_beta);

  response ~ normal(yhat, item_sds_i);
}

Finally, we can compute the standardized loadings and the interfactor correlation matrix, R:

generated quantities {
  vector<lower = 0>[Ni] loadings_std; // obtain loadings_std
  matrix[Nf, Nf] R;

  R = tcrossprod(L);
  for (i in 1:Ni) {
    loadings_std[i] = alphas[i] / sqrt(square(alphas[i]) + item_vars[i]);
  }
}

We could make a few modifications:

We can standardize the item responses prior to modeling to improve computational stability
Then apply a normal prior to the item intercepts

The syntax to run the model would then be:

# First, let's fit the model in lavaan:
cfa.lav.fit <- sem(
  "F1 =~ x1 + x2 + x3\nF2 =~ x4 + x5 + x6\nF3 =~ x7 + x8 + x9",
  dat, std.lv = TRUE, meanstructure = TRUE)

library(rstan)

options(mc.cores = parallel::detectCores()) # Use multiple cores
rstan_options(auto_write = TRUE) # One time Stan program compilation

cfa.mm <- stan_model(stanc_ret = stanc(file = "bayes_script/cfa.stan")) # Compile Stan code

cfa.stan.fit <- sampling(
  cfa.mm, data = list(
    N = nrow(dat.l), Ni = length(unique(dat.l$item)),
    Np = length(unique(dat.l$ID)), Nf = length(unique(dat.l$Fs)),
    items = dat.l$item.no, factors = dat.l$Fs,
    persons = dat.l$ID, response = dat.l$score,
    g_alpha = 1, g_beta = 1),
  control = list(adapt_delta = .99999, max_treedepth = 15))

What are some loadings?

print(cfa.stan.fit, pars = c("alphas", "loadings_std"),
  probs = c(.025, .5, .975), digits_summary = 3)

#                  mean se_mean    sd  2.5%   50% 97.5% n_eff  Rhat
# alphas[1]       0.889   0.003 0.078 0.733 0.890 1.041   790 1.002
# alphas[4]       0.991   0.002 0.056 0.885 0.988 1.101  1263 1.002
# alphas[5]       1.102   0.002 0.062 0.980 1.102 1.224  1056 1.001
# alphas[9]       0.692   0.003 0.075 0.548 0.692 0.846   799 1.005
# loadings_std[1] 0.751   0.002 0.052 0.643 0.752 0.848   601 1.003
# loadings_std[4] 0.848   0.001 0.023 0.801 0.849 0.890  1275 1.003
# loadings_std[5] 0.851   0.001 0.023 0.803 0.852 0.891  1176 1.001
# loadings_std[9] 0.672   0.003 0.059 0.552 0.673 0.786   556 1.007

# For comparison, the lavaan loadings are:
parameterEstimates(cfa.lav.fit, standardized = TRUE)[1:9, c(1:5, 11)]

#   lhs op rhs   est    se std.all
# 1  F1 =~  x1 0.900 0.081   0.772
# 4  F2 =~  x4 0.990 0.057   0.852
# 5  F2 =~  x5 1.102 0.063   0.855
# 9  F3 =~  x9 0.670 0.065   0.665

For the interfactor correlations:

print(cfa.stan.fit, pars = c("R[1, 2]", "R[1, 3]", "R[2, 3]"),
      probs = c(.025, .5, .975), digits_summary = 3)

#       mean se_mean    sd  2.5%   50% 97.5% n_eff  Rhat
# R[1,2] 0.435   0.001 0.065 0.303 0.437 0.557  2019 0.999
# R[1,3] 0.451   0.003 0.081 0.289 0.450 0.607   733 1.005
# R[2,3] 0.271   0.001 0.071 0.130 0.272 0.406  2599 1.000

# From lavaan:
parameterEstimates(cfa.lav.fit, standardized = TRUE)[22:24, c(1:5, 11)]

#    lhs op rhs   est    se std.all
# 22  F1 ~~  F2 0.459 0.064   0.459
# 23  F1 ~~  F3 0.471 0.073   0.471
# 24  F2 ~~  F3 0.283 0.069   0.283

For the error variances:

print(cfa.stan.fit, pars = c("item_vars"),
      probs = c(.025, .5, .975), digits_summary = 3)

#               mean se_mean    sd  2.5%   50% 97.5% n_eff  Rhat
# item_vars[3] 0.829   0.003 0.095 0.652 0.828 1.026  1292 1.000
# item_vars[4] 0.383   0.001 0.049 0.292 0.381 0.481  1552 1.002
# item_vars[5] 0.459   0.001 0.059 0.351 0.456 0.581  1577 1.001
# item_vars[9] 0.575   0.004 0.085 0.410 0.575 0.739   532 1.008

# From lavaan:
parameterEstimates(cfa.lav.fit, standardized = TRUE)[10:18, 1:5]

#    lhs op rhs   est    se
# 12  x3 ~~  x3 0.844 0.091
# 13  x4 ~~  x4 0.371 0.048
# 14  x5 ~~  x5 0.446 0.058
# 18  x9 ~~  x9 0.566 0.071

For the item intercepts:

print(cfa.stan.fit, pars = c("betas"),
      probs = c(.025, .5, .975), digits_summary = 3)

#         mean se_mean    sd  2.5%   50% 97.5% n_eff  Rhat
# betas[2] 6.087   0.001 0.068 5.954 6.089 6.219  2540 1.001
# betas[3] 2.248   0.001 0.066 2.122 2.248 2.381  1980 1.002
# betas[6] 2.182   0.003 0.063 2.058 2.182 2.302   625 1.008
# betas[7] 4.185   0.002 0.066 4.054 4.186 4.315  1791 1.001

# From lavaan:
parameterEstimates(cfa.lav.fit, standardized = TRUE)[25:33, 1:5]
#    lhs op rhs   est    se
# 26  x2 ~1     6.088 0.068
# 27  x3 ~1     2.250 0.065
# 30  x6 ~1     2.186 0.063
# 31  x7 ~1     4.186 0.063

So we are able to replicate the results from lavaan. From here, you can extend the model in interesting ways to arrive at other results.

For example, if you want to do a regression of the factors, you can use the posterior of the correlation matrix and the solve() function to arrive at the coefficients for the factors in the regression. Here, I regress factor 1 on both factors 2 and 3:

R <- extract(cfa.stan.fit, c("R[1, 2]", "R[1, 3]", "R[2, 3]"))
R <- cbind(R$`R[1,2]`, R$`R[1,3]`, R$`R[2,3]`)
coefs <- matrix(NA, nrow(R), ncol(R) - 1)
for (i in 1:nrow(R)) {
  m <- matrix(c(1, R[i, 3], R[i, 3], 1), 2, 2)
  coefs[i, ] <- solve(m, R[i, 1:2])
}; rm(i, m)
t(apply(coefs, 2, function (x) {
  c(estimate = mean(x), sd = sd(x), quantile(x, c(.025, .25, .5, .75, .975)))
}))
#       estimate         sd      2.5%       25%       50%       75%     97.5%
# [1,] 0.3362981 0.07248634 0.1918812 0.2877936 0.3387682 0.3875141 0.4725508
# [2,] 0.3605951 0.08466494 0.1996710 0.3027047 0.3594806 0.4164141 0.5308578

This makes the model non-linear in its parameters showing why software for generalized linear mixed models cannot fit this model. ↩︎
I’m guessing older software manuals may recommend the inverse-Wishart prior for $\mathbf{\Sigma}$. But see HERE as starting point for comments on LKJ-prior. As a funny side note, LKJ is an acronym for the researchers who recommended this approach, the last author is Harry, last name, Joe. And he developed an earlier form of the method. So in the paper by LKJ, you see references to Joe’s method which seems out of place for an academic paper :). ↩︎

Multidimensional CFA with RStan

James Uanhoro

2018/11/28