Using binary regression software to model ordinal data as a multivariate GLM

I have read that the most common model for analyzing ordinal data is the cumulative link logistic model, coupled with the proportional odds assumption. Essentially, you treat the outcome as if it were the categorical manifestation of a continuous latent variable. The predictor variables of this outcome influence it in one way only, so you get a single regression coefficient for each predictor. But the model has several intercepts representing the points at which the variable was cut to create the observed categorical manifestation.

That each predictor variable influences the outcome in one way - as you have in commonplace regression models - is the proportional odds assumption or constraint. Alternatively, one can have each predictor variable influence the outcome differently at each cut-point. This would be relaxing the proportional odds constraint.

In Agresti’s Categorical Data Analysis,¹ he writes, “McCullagh (1980) and Thompson and Baker (1981) treated cumulative link models as multivariate GLMs” when discussing estimation for cumulative link models. To unpack this claim, I did some extra reading and it’s relative easy to see how. Consider an ordinal variable that has 4 levels: 1, 2, 3, 4. This ordinal variable can be expressed as three binary variables. And there are a variety of ways to do this, I will use the cumulative link parameterization as an example.

Since there are 4 levels, there are three transitions (aka thresholds, cut-points): from 1 to 2, 2 to 3 and 3 to 4. Using the cumulative link parameterization, we generate a new variable per transition point as shown in the table below.

$$\begin{array}{c|ccc} & 1 - 2 & 2 - 3 & 3 - 4\\ \hline 1 & 0 & 0 & 0 \\ 2 & 1 & 0 & 0 \\ 3 & 1 & 1 & 0 \\ 4 & 1 & 1 & 1\end{array}$$

A 1 is less than all the cut-points, so becomes three zeros; a 4 exceeds all the cut-points and becomes three ones. Given 4 levels, each respondent now has three responses, one per cut-point, hence the description of this ordinal model as a multivariate GLM. See the bottom of this page for the adjacent categories parameterization.

How does one model this using univariate GLM software? The UCLA idre page has this article on multivariate random coefficient models. It’s relevant here because they use nlme (a univariate linear mixed models software) to model a multivariate outcome. The essential idea is in stacking the data, making it a kind of repeated measures, but finding a way to signal to the software that the outcomes are different, requiring different intercepts and slopes for the predictors.

So what we do is we transform the data from wide to long, model it as a regular binomial, but we need to tell the model to estimate a different intercept for each level. If we want proportional odds, then we can estimate a single coefficient for each predictor. If not, we can do otherwise using interaction terms as we will see. The unique challenge is the outcomes are correlated, so we need a model that can model this correlation.² For this, I use general estimating equations (GEE) with an unstructured working correlation structure.³

Demonstration

library(ordinal) # For ordinal regression to check our results
library(geepack) # For GEE with binary data

I’ll use the soup dataset from the ordinal package.

soup <- ordinal::soup
soup$ID <- 1:nrow(soup) # Create a person ID variable
str(soup)

'data.frame':	1847 obs. of  13 variables:
 $ RESP    : Factor w/ 185 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ PROD    : Factor w/ 2 levels "Ref","Test": 1 2 1 2 1 2 2 2 2 1 ...
 $ PRODID  : Factor w/ 6 levels "1","2","3","4",..: 1 2 1 3 1 6 2 4 5 1 ...
 $ SURENESS: Ord.factor w/ 6 levels "1"<"2"<"3"<"4"<..: 6 5 5 6 5 5 2 5 5 2 ...
 $ DAY     : Factor w/ 2 levels "1","2": 1 1 1 1 2 2 2 2 2 2 ...
 $ SOUPTYPE: Factor w/ 3 levels "Self-made","Canned",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ SOUPFREQ: Factor w/ 3 levels ">1/week","1-4/month",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ COLD    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
 $ EASY    : Factor w/ 10 levels "1","2","3","4",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ GENDER  : Factor w/ 2 levels "Male","Female": 2 2 2 2 2 2 2 2 2 2 ...
 $ AGEGROUP: Factor w/ 4 levels "18-30","31-40",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ LOCATION: Factor w/ 3 levels "Region 1","Region 2",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ ID      : int  1 2 3 4 5 6 7 8 9 10 ...

I’ll use the SURENESS variable. It has 6 levels. I’ll model it with the DAY and GENDER variables.

# Select variables to work with
soup <- dplyr::select(soup, ID, SURENESS, DAY, GENDER)
# I like dummy variables with recognizable names
soup$girl <- ifelse(soup$GENDER == "Female", 1, 0) # Make male reference group
soup$day2 <- ifelse(soup$DAY == "2", 1, 0) # Make day 1 reference group

The next step is to transform the ordinal outcome into 5 outcomes representing each of the thresholds:

soup$SURE2 <- ifelse(as.numeric(as.character(soup$SURENESS)) >= 2, 1, 0)
soup$SURE3 <- ifelse(as.numeric(as.character(soup$SURENESS)) >= 3, 1, 0)
soup$SURE4 <- ifelse(as.numeric(as.character(soup$SURENESS)) >= 4, 1, 0)
soup$SURE5 <- ifelse(as.numeric(as.character(soup$SURENESS)) >= 5, 1, 0)
soup$SURE6 <- ifelse(as.numeric(as.character(soup$SURENESS)) >= 6, 1, 0)

With that done, we’re ready for the wide to long transformation of these 5 new outcome variables.

soup.long <- tidyr::gather(soup, SURE, VAL, SURE2:SURE6) # New long data frame
soup.long <- soup.long[order(soup.long$ID), ] # Order the data by person
head(soup.long$SURE) # Currently looks like:

[1] "SURE2" "SURE3" "SURE4" "SURE5" "SURE6" "SURE2"

# I only want 2, 3, 4, ... the final number
soup.long$SURE <- as.numeric(substr(soup.long$SURE, 5, 5))
soup.long$SURE.f <- factor(soup.long$SURE) # Create a factor variable
head(soup.long) # Let's look at the data

     ID SURENESS DAY GENDER girl day2 SURE VAL SURE.f
1     1        6   1 Female    1    0    2   1      2
1848  1        6   1 Female    1    0    3   1      3
3695  1        6   1 Female    1    0    4   1      4
5542  1        6   1 Female    1    0    5   1      5
7389  1        6   1 Female    1    0    6   1      6
2     2        5   1 Female    1    0    2   1      2

Person 1 is a girl with day 1. Her response was a 6, hence, all her VAL scores are 1. She is like 4 was in the transformation table above. Let’s look at someone who did not select the highest response category:

# First 5 rows in data with a four, since data ordered by person, returns an individual's record
soup.long[soup.long$SURENESS == 4, ][1:5, ]

     ID SURENESS DAY GENDER girl day2 SURE VAL SURE.f
22   22        4   1 Female    1    0    2   1      2
1869 22        4   1 Female    1    0    3   1      3
3716 22        4   1 Female    1    0    4   1      4
5563 22        4   1 Female    1    0    5   0      5
7410 22        4   1 Female    1    0    6   0      6

This individual selected a SURENESS of 4. Her first three VAL scores are 1 because she cleared the first three thresholds. Her final two scores are 0 because 4 is less than the 4 - 5 threshold and 5 - 6 threshold.

The next step is to create dummy variables for the thresholds. These variables will be used to represent the intercept in our models.

# Taking a short cut to create dummy variables.
soup.long <- cbind(soup.long, model.matrix(
  ~ 0 + SURE.f, data = soup.long,
  contrasts.arg = list(SURE.f = "contr.treatment")) * -1)

Note that I multiplied the dummy variables by -1. In ordinal regression, doing this makes interpretation easier and is the default option in the ordinal package. In summary, it ensures that positive coefficients increase the odds of moving from a lower category, say 3, to a higher category, 4, or responding to a higher response category.

And now, we are ready to run the models. We use GEE, remove the intercept and place a dummy variable for each threshold, and the two coefficients for day2 and girl. The correlation structure is unstructured.

pom.bin <- geeglm( # Prop odds model
  VAL ~ 0 + SURE.f2 + SURE.f3 + SURE.f4 + SURE.f5 + SURE.f6 + girl + day2,
  id = ID, data = soup.long, family = binomial, corstr = "unstructured"
)

Next, I estimate the model using standard ordinal regression software:

pom.ord <- clm(SURENESS ~ girl + day2, data = soup)

Let’s compare coefficients and standard errors:

round(cbind(coef(summary(pom.bin)), coef(summary(pom.ord)))[, c(1, 5, 2, 6, 3, 7, 4, 8)], 5)

        Estimate Estimate.1 Std.err Std. Error     Wald  z value Pr(>|W|) Pr(>|z|)
SURE.f2 -2.13244   -2.13155 0.10454    0.10450 416.0946 -20.3971   0.0000   0.0000
SURE.f3 -1.19345   -1.19259 0.09142    0.09232 170.4284 -12.9179   0.0000   0.0000
SURE.f4 -0.89164   -0.89079 0.08979    0.09011  98.5995  -9.8857   0.0000   0.0000
SURE.f5 -0.65782   -0.65697 0.08945    0.08898  54.0791  -7.3833   0.0000   0.0000
SURE.f6 -0.04558   -0.04477 0.08801    0.08789   0.2682  -0.5093   0.6046   0.6105
girl    -0.04932   -0.04917 0.09036    0.09074   0.2980  -0.5419   0.5851   0.5879
day2    -0.26172   -0.26037 0.08584    0.08579   9.2954  -3.0351   0.0023   0.0024

Placing the results side by side, one can see that they are very close.

However, estimation using glm() which cannot model dependencies between the outcomes of a person yields different results. For demonstration:

round(coef(summary(glm(
  VAL ~ 0 + SURE.f2 + SURE.f3 + SURE.f4 + SURE.f5 + SURE.f6 + girl + day2,
  data = soup.long, family = binomial
))), 5)

        Estimate Std. Error z value Pr(>|z|)
SURE.f2 -2.15144    0.08255 -26.062   0.0000
SURE.f3 -1.21271    0.06736 -18.004   0.0000
SURE.f4 -0.91149    0.06472 -14.084   0.0000
SURE.f5 -0.67782    0.06327 -10.713   0.0000
SURE.f6 -0.06523    0.06178  -1.056   0.2911
girl    -0.07326    0.04961  -1.477   0.1398
day2    -0.26898    0.04653  -5.780   0.0000

Both the estimates and standard errors are inadequate.

We can easily relax the proportional odds constraint in the pom.bin model. Let’s run what some call a partial proportional odds model by relaxing this constraint for the day2 predictor. We do this by estimating an interaction between the threshold dummies and the day2 predictor. Note that there is no effect for the day2 variable itself, it only appears in the interaction terms with the various intercepts.

npom.bin <- geeglm( # Partial prop odds, prop odds for girl only
  VAL ~ 0 + SURE.f2 + SURE.f3 + SURE.f4 + SURE.f5 + SURE.f6 + girl +
    # SURE.f2:girl + SURE.f3:girl + SURE.f4:girl + SURE.f5:girl + SURE.f6:girl +
    SURE.f2:day2 + SURE.f3:day2 + SURE.f4:day2 + SURE.f5:day2 + SURE.f6:day2,
  id = ID, data = soup.long, family = binomial, corstr = "unstructured"
)

I also run the same model in ordinal for comparison, using the nominal argument for day2.

npom.ord <- clm( # Using ordinal package for comparison
  SURENESS ~ girl, data = soup, nominal = ~ day2)
# Compare results
round(cbind(coef(summary(npom.bin))[c(1:5, 7:11, 6), ],
            coef(summary(npom.ord)))[, c(1, 5, 2, 6, 3, 7, 4, 8)], 5)

             Estimate Estimate.1 Std.err Std. Error     Wald  z value Pr(>|W|) Pr(>|z|)
SURE.f2      -2.02982   -2.03106 0.11800    0.11834 295.8986 -17.1630  0.00000  0.00000
SURE.f3      -1.22087   -1.22213 0.09829    0.09857 154.2801 -12.3980  0.00000  0.00000
SURE.f4      -0.92773   -0.92899 0.09458    0.09443  96.2112  -9.8375  0.00000  0.00000
SURE.f5      -0.65744   -0.65870 0.09246    0.09188  50.5554  -7.1693  0.00000  0.00000
SURE.f6      -0.04733   -0.04859 0.08955    0.08965   0.2793  -0.5420  0.59714  0.58784
SURE.f2:day2  0.07359    0.07360 0.14148    0.14155   0.2705   0.5199  0.60298  0.60312
SURE.f3:day2  0.31691    0.31697 0.10607    0.10613   8.9270   2.9867  0.00281  0.00282
SURE.f4:day2  0.33301    0.33308 0.09970    0.09973  11.1551   3.3398  0.00084  0.00084
SURE.f5:day2  0.26330    0.26339 0.09618    0.09616   7.4938   2.7391  0.00619  0.00616
SURE.f6:day2  0.26741    0.26748 0.09347    0.09345   8.1842   2.8622  0.00423  0.00421
girl         -0.04809   -0.04994 0.09048    0.09077   0.2825  -0.5502  0.59507  0.58221

The results are again comparable.

We can now compare the partial proportional odds binary model to the proportional odds binary model to test the constraint for the day2 variable. Luckily, geepack permits anova() which conducts a Wald test comparing both models:

anova(pom.bin, npom.bin) # Testing partial proportional odds assumption.

Analysis of 'Wald statistic' Table

Model 1 VAL ~ 0 + SURE.f2 + SURE.f3 + SURE.f4 + SURE.f5 + SURE.f6 + girl + SURE.f2:day2 + SURE.f3:day2 + SURE.f4:day2 + SURE.f5:day2 + SURE.f6:day2
Model 2 VAL ~ 0 + SURE.f2 + SURE.f3 + SURE.f4 + SURE.f5 + SURE.f6 + girl + day2
  Df   X2 P(>|Chi|)
1  4 6.94      0.14

The difference between both models was not statistically significant, suggesting the proportionality constraint for the day2 variable suffices.

We can conduct the same test in ordinal either by comparing pom.ord and npom.ord models using anova(), or using the nomimal_test() function. Both are likelihood ratio tests, which are more adequate than the Wald test for the GEEs above.

anova(pom.ord, npom.ord)

Likelihood ratio tests of cumulative link models:

         formula:               nominal: link: threshold:
pom.ord  SURENESS ~ girl + day2 ~1       logit flexible  
npom.ord SURENESS ~ girl        ~day2    logit flexible  

         no.par  AIC logLik LR.stat df Pr(>Chisq)
pom.ord       7 5554  -2770                      
npom.ord     11 5555  -2766    6.91  4       0.14

nominal_test(pom.ord)

Tests of nominal effects

formula: SURENESS ~ girl + day2
       Df logLik  AIC  LRT Pr(>Chi)  
<none>     -2770 5554                
girl    4  -2766 5554 8.02    0.091 .
day2    4  -2766 5555 6.91    0.141  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Both tests converge to the same result, and also give the same p-values for the Wald test comparing the GEE models. However, the Wald-$\chi^2$ test statistic was slightly higher.

Having done this, of course using a package for ordinal data is much easier. However, there may be some advantage to treating the model as binary, but all this was driven by curiosity rather than necessity. For some reason I am yet to figure out, ordinal returns only one set of fitted probabilities when one attempts to obtain predicted probabilities from the model using the fitted() function. Ideally, it should return a fitted probability for each threshold. With geepack, obtaining the predicted probabilities at each level is straightforward using the fitted() function. However, this advantage is trivial. For one, VGAM’s vglm() does not suffer the same problem, and calculating predicted probabilities is a relatively simple task.

Also, if one is familiar with maximum likelihood estimation, one could simply program the likelihood function. It is the product of the probability of responding to each category within each person, followed by the product of these values across persons. If we take the log likelihood instead, we arrive at sums across persons, and within persons.

Syntax for the example above in the proportional odds case would be:

library(stats4) # For maximum likelihood estimation

# The coefficients from glm are useful starting values
# Using them can remove the need for programming monotonicity of thresholds
start.coefs <- -coef(glm( # Reverse code the thresholds
  VAL ~ 0 + SURE.f + girl + day2, binomial, soup.long))
X <- cbind(soup$girl, soup$day2) # matrix of predictors
y <- (soup$SURENESS) # response variable
inv.logit <- function(x) 1 / (1 + exp(-x)) # useful within likelihood function
negll <- function(a1, a2, a3, a4, a5, bg, bd) {
  betas <- -c(bg, bd) # Reverse coefs
  # sums of probabilities, probability of 6 + prob of 5 + ...
  ll <- sum(log(1 - inv.logit(a5 + X %*% betas))[y == 6]) +
    sum(log(inv.logit(a5 + X %*% betas) - inv.logit(a4 + X %*% betas))[y == 5]) +
    sum(log(inv.logit(a4 + X %*% betas) - inv.logit(a3 + X %*% betas))[y == 4]) +
    sum(log(inv.logit(a3 + X %*% betas) - inv.logit(a2 + X %*% betas))[y == 3]) +
    sum(log(inv.logit(a2 + X %*% betas) - inv.logit(a1 + X %*% betas))[y == 2]) +
    sum(log(inv.logit(a1 + X %*% betas))[y == 1])
  -ll # return negative log likelihood
}
# Starting values need named list
start.coefs <- sapply(start.coefs, list)
names(start.coefs) <- c(paste0("a", 1:5), paste0("b", c("g", "d")))
res <- mle(negll, start.coefs) # Obtain ML solution
coef(summary(res))
      Estimate Std. Error
a1 -2.13155603 0.10450286
a2 -1.19259266 0.09232077
a3 -0.89079068 0.09010891
a4 -0.65697671 0.08898063
a5 -0.04477565 0.08788869
bg -0.04917604 0.09073602
bd -0.26037369 0.08578617

coef(summary(pom.ord))
        Estimate Std. Error     z value     Pr(>|z|)
1|2  -2.13155281 0.10450291 -20.3970663 1.775532e-92
2|3  -1.19259171 0.09232091 -12.9178937 3.567748e-38
3|4  -0.89078590 0.09010896  -9.8856524 4.804418e-23
4|5  -0.65697465 0.08898068  -7.3833401 1.543671e-13
5|6  -0.04476553 0.08788871  -0.5093434 6.105115e-01
girl -0.04917245 0.09073601  -0.5419287 5.878676e-01
day2 -0.26037360 0.08578617  -3.0351465 2.404188e-03

The results are pretty similar, and for a more certain way of comparing models, we can always compare log-likelihoods:

logLik(res)
'log Lik.' -2769.784 (df=7)

logLik(pom.ord)
'log Lik.' -2769.784 (df=7)

Adjacent categories parameterization:

$$\begin{array}{c|ccc} & 1 - 2 & 2 - 3 & 3 - 4\\ \hline 1 & 0 & & \\ 2 & 1 & 0 & \\ 3 & & 1 & 0 \\ 4 & & & 1\end{array}$$

Agresti, A. (2013). Categorical data analysis. Wiley-Interscience. ↩︎
Under normal circumstances, we would also need to inform the software that the variances are different for each outcome. However this is the binomial regression, and the variance of the latent variable is fixed at $\frac{pi^2}{3}$. If this were an attempt at multivariate OLS regression, one would have to model the dispersion separately for each variable. I believe the sformula = argument in the geese() function in geepack should make this possible. ↩︎
I use an unstructured working correlation structure because it allows the estimation of the correlation between each pair of outcomes. In the example here, there are five outcomes, so we would estimate 10 $\big(\frac{5 \times 4}{2}\big)$ correlations. Other alternatives do not permit this: independence places 1 on diagonal, and 0 off diagonal and produces the same parameter estimates as glm(); exchangeable forces all off diagonal elements to be the same (compound symmetry in ANOVA terminology) and would be overly constraining; and ar1 uses a specification that is more adequate for time ordered data. ↩︎

Using binary regression software to model ordinal data as a multivariate GLM

James Uanhoro

2018/02/12

Demonstration