7 Models for ordinal data
Ordinal-type of variable arise often in psychology. One common example are responses to Likert scales. Although it is very common practice that these are analyzed with a linear model, it is know that this approach can lead to serious inference errors.17 For this reason, the recommended approach is to use a model appropriate for ordinal data. Here I will describe an approach to this, using an ordered logistic regression model (also know as proportional odds model).
7.1 Ordered logistic regression
One way to think about this model is by assuming the existence of a continuous latent quantity, call it \(y\), specified by a logistic probability density function. The latent distribution is partitioned into a series of \(k\) intervals, where \(k\) is the number of ordered choice options available to respondents, using \(k+1\) latent cut-points, \(c_1, \ldots, c_{k+1}\). By integrating the latent density function within each interval we obtain the ordinal response probabilities \(p_1, \ldots, p_k\). Other choices are possible (e.g. assuming a normally distributed latent variable would yield an ordered probit model). Beyond mathematical convenience, one advantage of the ordered logit is that coefficient can be interpreted as ordered log-odds, implementing the proportional odds assumption.18
Formally, the model can be notated as
\[ \begin{aligned} p_k & = p\left(c_{k-1} < y \le c_k \mid \mu \right)\\ & = \text{logit}^{-1}\left(c_k - \mu \right) - \text{logit}^{-1}\left(c_{k-1} - \mu \right) \end{aligned} \]
where \[ \text{logit}^{-1}(\alpha) = \frac{1}{1+e^{-\alpha}} \] is the cumulative function of the logistic distribution (also known as inverse-logit), and \[ \mu = \beta_1 x_1 + \ldots + \beta_n x_n \] is the linear part of the model (a linear combination of the \(n\) predictor variables).
This is the general approach and the formalism used - below I present few examples that illustrates how this work in practice in R.
7.1.1 Mixed-effects ordinal regression
R libraries used in this example
In addition to the above libraries, here I will create a handy R function that gives the probabilities of the categorical responses given a mean value of the latent quantity (indicated with \(\mu\) above) and a set of cutpoints \(c_1, \ldots, c_{k+1}\). This will be used both for simulating the data and for plotting the fit of the model.
ordered_logistic <- function(eta, cutpoints){
cutpoints <- c(cutpoints, Inf)
k <- length(cutpoints)
p <- rep(NA, k)
p[1] <- plogis(cutpoints[1], location=eta, scale=1, lower.tail=TRUE)
for(i in 2:k){
p[i] <- plogis(cutpoints[i], location=eta, scale=1, lower.tail=TRUE) -
plogis(cutpoints[i-1], location=eta, scale=1, lower.tail=TRUE)
}
return(p)
}
For this example we simulate some data. We have two predictors: x1
, a continuous predictor that vary with each observation, and d1
a dummy variable that indicate a categorical predictor with 2 levels (e.g. two experimental conditions). The conditions are within-subject, meaningthat each participant (identified by the variable id
) is being tested in both conditions.
set.seed(5)
N <- 200
N_id <- 10
dat <- data.frame(
id = factor(sample(1:N_id,N, replace = T)),
d1 = rbinom(N,1,0.5), # dummy variable (0,1) indicate 2 conditions
x1 = runif(n = N, min = 1, max = 10)
)
rfx <- rnorm(length(unique(dat$id)), mean=0, sd=5)
LP <- 0.5*dat$x1 + 2*dat$d1 + rfx[dat$id]
for(i in 1:N){
dat$response[i] <- which(rmultinom(1,1, ordered_logistic(LP[i], c(0,2.5, 5,10)))==1)
}
dat$response <- factor(dat$response)
str(dat)
#> 'data.frame': 200 obs. of 4 variables:
#> $ id : Factor w/ 10 levels "1","2","3","4",..: 2 9 9 9 5 7 7 3 3 6 ...
#> $ d1 : int 0 0 0 0 0 0 0 1 0 0 ...
#> $ x1 : num 6.81 1.49 6.94 3.09 3.97 ...
#> $ response: Factor w/ 5 levels "1","2","3","4",..: 4 1 3 2 2 3 3 4 3 2 ...
The dependent variable is categorical with 5 levels - here is a plot of the number of responses per category in the two conditions. We are interested in testing whether the distribution differ across the conditions.
ggplot(dat,aes(x=response))+
geom_bar()+
facet_grid(.~d1)
We use the clmm()
function in the package ordinal
to estimate the model. The syntax is similar to what we would use for a linear mixed effect model. Note that in the output the Threshold coefficients
are the latent cutpoints \(c_1, \ldots, c_{4}\)
model <- clmm(response ~ x1 + d1 + (1|id), data = dat)
summary(model)
#> Cumulative Link Mixed Model fitted with the Laplace approximation
#>
#> formula: response ~ x1 + d1 + (1 | id)
#> data: dat
#>
#> link threshold nobs logLik AIC niter max.grad
#> logit flexible 200 -175.75 365.49 291(919) 6.56e-05
#> cond.H
#> 8.6e+02
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> id (Intercept) 4.224 2.055
#> Number of groups: id 10
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> x1 0.55685 0.07516 7.409 1.27e-13 ***
#> d1 2.29521 0.36659 6.261 3.82e-10 ***
#> ---
#> Signif. codes:
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Threshold coefficients:
#> Estimate Std. Error z value
#> 1|2 -2.1850 0.8512 -2.567
#> 2|3 0.2853 0.7664 0.372
#> 3|4 2.9453 0.8059 3.655
#> 4|5 7.8126 1.0072 7.757
There is no function that can out-of the box calculate the predictions of the model for us, so this will need some coding. I also use the library DescTools
to calculate simultaneous multinomial confidence intervals. In the resulting plot the black line are model fit, and bar the observed responses.
# pre-allocate a matrix to store model predictions
# note that these are a vector of 5 probabilities for each trial
pred_mat <- matrix(NA, nrow=N, ncol=length(unique( dat$response)))
for(i in 1:N){
# first calculate the linear predictor
# by summing all variable as indicated
# in the model formulate, weighted by the coefficients
eta <- dat$x1[i]*model$beta['x1'] + dat$d1[i]*model$beta['d1'] + model$ranef[dat$id[i]]
# note that + model$ranef[dat$id[i]] adds
# the random intercept for the subjects of observation i
# calculate vector of predicted probabilities
pred_mat[i,] <- ordered_logistic(eta, model$Theta)
}
# add predictions to dataset
pred_dat <- data.frame(pred_mat)
colnames(pred_dat) <- paste("resp_",1:ncol(pred_mat),sep="")
pred_dat <- cbind(dat, pred_dat)
# in order to visalize the predictions,
# we first average them for each condition
pred_dat %>%
pivot_longer(cols=starts_with("resp_"),
names_prefix="resp_",
values_to = "prob",
names_to ="response_category") %>%
group_by(d1, response_category) %>%
summarise(prob = mean(prob),
n=sum(response==response_category)) %>%
group_by(d1) %>%
mutate(prop_obs = n/sum(n),
response=as.numeric(response_category)) -> pred_d1
#> `summarise()` has grouped output by 'd1'. You can override
#> using the `.groups` argument.
# cimpute the multinomial interval
pred_d1$CI_lb <- MultinomCI(pred_d1$n)[,"lwr.ci"] *2
pred_d1$CI_ub <- MultinomCI(pred_d1$n)[,"upr.ci"] *2
# note that I multiply for 2 because in the plot each condition
# will be in a different panel and the probability will sum to 1 in each panel
# visualize (aggregated) ordinal response & prediction
# the black line are the predictions of the model
ggplot(pred_d1,aes(x=response, y=prop_obs))+
geom_col()+
geom_errorbar(data=pred_d1, aes(ymin=CI_lb, ymax=CI_ub),width=0.2)+
facet_grid(.~d1)+
geom_line(data=pred_d1, aes(y=prob), size=2)+
labs(y="probability")
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2
#> 3.4.0.
#> ℹ Please use `linewidth` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where
#> this warning was generated.
We can repeat the same process also for calculating predictions for individual participants
# split by ID
pred_dat %>%
pivot_longer(cols=starts_with("resp_"),
names_prefix="resp_",
values_to = "prob",
names_to ="response_category") %>%
group_by(id, d1, response_category) %>%
summarise(prob = mean(prob),
n=sum(response==response_category)) %>%
group_by(d1, id) %>%
mutate(prop_obs = n/sum(n),
response=as.numeric(response_category)) -> pred_d1
#> `summarise()` has grouped output by 'id', 'd1'. You can
#> override using the `.groups` argument.
# calculate multinomial CI
# we do a loop over all participants and conditions
pred_d1$CI_lb <- NA
pred_d1$CI_ub <- NA
for(i in unique(pred_d1$id)){
for(cond in c(0,1)){
pred_d1$CI_lb[pred_d1$id==i & pred_d1$d1==cond] <- DescTools::MultinomCI(pred_d1$n[pred_d1$id==i & pred_d1$d1==cond])[,"lwr.ci"]
pred_d1$CI_ub[pred_d1$id==i & pred_d1$d1==cond] <- DescTools::MultinomCI(pred_d1$n[pred_d1$id==i & pred_d1$d1==cond])[,"upr.ci"]
}
}
pred_d1 %>%
mutate(condition = factor(d1)) %>%
ggplot(aes(x=response, y=prop_obs, fill=condition))+
geom_col()+
geom_errorbar(aes(ymin=CI_lb, ymax=CI_ub, color=condition), width=0.2)+
facet_grid(id~d1)+
geom_line(aes(y=prob), size=2)+
labs(y="probability")