GDR Vision - Paris 4/10/2018

Today's program

  1. Multilevel models from a frequentist perspective
  2. The Bayesian approach
  3. Markov Chain Monte Carlo (MCMC)
  4. Examples

Linear models

  • The dependent variable \(y\) is modeled as a weighted combination of the independent variables, plus an additive error \(\epsilon\) \[ y_i=\beta_0 + \beta_1x_{1i} + \ldots +\beta_nx_{ni} + \epsilon_i \\ \epsilon \sim \mathcal{N}\left( 0, \sigma^2 \right) \]

Linear models

  • The dependent variable \(y\) is modeled as a weighted combination of the independent variables, plus an additive error \(\epsilon\) \[ \textbf{Y} = \textbf{X}\beta + \epsilon \\ \epsilon \sim \mathcal{N}\left( 0, \sigma^2 \right) \]

Matrix notation:

\[ \bf{Y} = \bf{X}\beta + \epsilon \] \[ \left( \begin{array}{c} y_1 \\ \vdots \\ y_m \end{array} \right) = \left( \begin{array}{cccc} 1 & x_{11} & \ldots & x_{1n}\\ \vdots & \vdots & \vdots & \vdots\\ 1 & x_{m1} & \ldots & x_{mn} \end{array} \right) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_n \end{array} \right) + \left( \begin{array}{c} \epsilon_1 \\ \vdots \\ \epsilon_m \end{array} \right) \]

Matrix multiplication: \[ \left( \begin{array}{cc} a & b \\ c& d \end{array} \right) \left( \begin{array}{c} 1 \\ 2 \end{array} \right) = 1 \left( \begin{array}{c} a \\ c \end{array} \right) + 2 \left( \begin{array}{c} b \\ d \end{array} \right) = \left( \begin{array}{c} a + 2b \\ c+ 2d \end{array} \right) \]

Linear mixed-effects models

  • 'Classical' linear models are fixed-effects only:
    • independent variables are all experimental manipulation (they are not random).
    • the only random source of variation is the residual error \(\epsilon \sim \mathcal{N}\left( 0, \sigma^2 \right)\).
  • However observations (e.g. trials) are often grouped according to observational cluster (e.g. subjects), random samples from a larger population, on which we'd like to make inferences.
  • In order to properly generalize from the sample to the population, these variables should be treated as random effects. Mixed-effects models allow to do that by explicitly modelling the population distribution.

  • A model containing both fixed and random effects is usually called a mixed-effects model (but also: hierarchical, multilevel, etc.).
  • Random effects are treated as random variations around a population mean. These random variations are usually assumed to have a Gaussian distribution.
  • A simple example: a random-intercept model. Regressions have the same slope in each of the \(J\) groups (\(j=1,\dots,J\)), but random variations in intercept \[ y_{ij} = \beta_0 + b_j + \beta_1x_{ij} + \epsilon_{i}\\ \epsilon \sim \mathcal{N}\left( 0, \sigma^2 \right)\\ b \sim \mathcal{N}\left( 0, \sigma_b^2 \right) \]

  • General formulation in matrix notation \[ \textbf{Y} = \textbf{X}\beta + \textbf{Z}b + \epsilon \] where \(\textbf{X}\) and \(\textbf{Z}\) are the known fixed-effects and random-effects regressor matrices.

  • The components of the residual error vector \(\epsilon \sim \mathcal{N}\left( 0, \sigma^2 \right)\) are assumed to be i.i.d. (independent and identically distributed).

  • The random-effect components, \(b \sim \mathcal{N}\left( 0, \Omega \right)\) are assumed to be normally distributed with mean 0, however they are not necessarily independent (the components \(b_j\) can be correlated, and correlations can be estimated). Example:\[ \left[ {\begin{array}{c} {{b_0}}\\ {{b_1}} \end{array}} \right] \sim\cal N \left( {\left[ {\begin{array}{c} 0 \\ 0 \end{array}} \right],\Omega = \left[ {\begin{array}{c} {{\mathop{\rm Var}} \left( b_0 \right)} & {{\mathop{\rm cov}} \left( {{b_0},{b_1}} \right)}\\ {{\mathop{\rm cov}} \left( {{b_0},{b_1}} \right)} & {{\mathop{\rm Var}} \left( b_1 \right)} \end{array}} \right]} \right) \]

Likelihood function

  • Parameters (\(\beta\), \(\sigma^2\) and \(\Omega\)) are estimated by maximizing the likelihood function, which is the probability of the data, given the parameters (but treated as a function of the parameters, keeping the data fixed).

  • The probability of the data conditional to the random effects is integrated with respect to the marginal density of the random effects, to obtain the marginal density of the data \[ L\left(\beta,\sigma^2,\Omega \mid \text{data}\right) = \int p \left(\text{data} \mid \beta,\sigma^2, b\right) \, p\left(b \mid \Omega\right) \, db \]

ML vs REML

  • The \(\textsf{R}\) library lme4 provides two methods for estimating parameters: Maximum Likelihood (ML) and Restricted Maximum Likelihood (REML).

  • ML tend to be biased and underestimate the variance components (e.g. \(\Omega\)).

  • REML provide less biased variance estimates: conceptually can be seen as similar to Bessel's correction for sample variance (using \(n-1\) instead of \(n\) in the denominator).

Advantages of multilevel models

  • Improved estimates for repeated sampling (e.g. repeated-measures design).
  • Improved estimates for imbalances in sampling (e.g. unequal number of trials across subjects).
  • Avoid averaging (pre-averaging of data remove variation and can manifacture false confidence).
  • Subject-specific standard error is taken into account in group-level estimates.
  • Variation among group or individuals is modelled explicitly.
  • Outperform classical methods in predictive ability.

Example 1 sleepstudy

sleepstudy is a dataset in the lme4 package, with reaction times data from 18 subjects that were restricted to 3 hours of sleep for 10 days.

Questions:

  • how reaction times changes with each sleep-deprived night?
  • are individual difference in baseline response times related to individual differences in the effect of sleep deprivation?
str(sleepstudy)
'data.frame':   180 obs. of  3 variables:
 $ Reaction: num  250 259 251 321 357 ...
 $ Days    : num  0 1 2 3 4 5 6 7 8 9 ...
 $ Subject : Factor w/ 18 levels "308","309","310",..: 1 1 1 1 1 1 1 1 1 1 ...

  • The model we want to fit includes both random slopes and intercept \[ y_{ij} = \beta_0 + b_{0j} + \left( \beta_1 + b_{1j} \right) \times {\rm{Days}}_i + \epsilon_i \\ \epsilon \sim \mathcal{N}\left( 0, \sigma^2 \right) \\ b\sim\cal N \left( 0, \Omega\right) \]
  • In lme4
m.1 <- lmer(Reaction ~ 1 + Days + (1 + Days | Subject), 
    sleepstudy)

summary(m.1)
Linear mixed model fit by REML ['lmerMod']
Formula: Reaction ~ 1 + Days + (1 + Days | Subject)
   Data: sleepstudy

REML criterion at convergence: 1743.6

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.9536 -0.4634  0.0231  0.4634  5.1793 

Random effects:
 Groups   Name        Variance Std.Dev. Corr
 Subject  (Intercept) 612.09   24.740       
          Days         35.07    5.922   0.07
 Residual             654.94   25.592       
Number of obs: 180, groups:  Subject, 18

Fixed effects:
            Estimate Std. Error t value
(Intercept)  251.405      6.825   36.84
Days          10.467      1.546    6.77

Correlation of Fixed Effects:
     (Intr)
Days -0.138

The function confint() allow easily to compute bootstrapped 95% CI of the parameters

CI_fixef <- confint(m.1, method = "boot", nsim = 500, oldNames = F)
print(CI_fixef, digits = 2)
                              2.5 % 97.5 %
sd_(Intercept)|Subject        12.81  35.63
cor_Days.(Intercept)|Subject  -0.53   0.81
sd_Days|Subject                3.57   8.32
sigma                         22.61  28.44
(Intercept)                  237.02 264.76
Days                           7.54  13.20

  • Tests of fixed and random effects can also be conducted using likelihood ratio tests, by comparing the likelihood of two nested models.
  • If \(L_1\) and \(L_2\) are the maximised likelihoods of two nested models with \(k_1 < k_2\) parameters, the test statistic is \(2\text{log}\left( L_2/ L_1 \right)\), which is approximately \(\chi^2\) with \(k_2 - k_1\) degrees of freedom.
  • Example: test if there are substantial variation across subjects in the effect of Days
# m.1 <- lmer(Reaction ~ 1 + Days + (1 + Days|Subject),
# sleepstudy)
m.2 <- lmer(Reaction ~ 1 + Days + (1 | Subject), sleepstudy)
anova(m.1, m.2)
refitting model(s) with ML (instead of REML)
Data: sleepstudy
Models:
m.2: Reaction ~ 1 + Days + (1 | Subject)
m.1: Reaction ~ 1 + Days + (1 + Days | Subject)
    Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)    
m.2  4 1802.1 1814.8 -897.04   1794.1                             
m.1  6 1763.9 1783.1 -875.97   1751.9 42.139      2  7.072e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Shrinkage: the predicted \({\hat b}_j\) (conditional modes of the random effects) can be seen as a "compromise" between the within-subject estimates and the population mean.

Diagnostic

As for any linear model, it is important to check that residual errors are well-behaved

The frequentist approach

  • So far, we've seen multilevel models from a frequentist point of view.
    • parameters are seen as unknown fixed quantities.
    • the goal is to find best guesses (point-estimates) of the parameters.
    • uncertainty in the parameter's value is not treated probabilistically.
    • uncertainty in the estimates is characterized in relation to the data-generating process (think about the highly un-intuitive definition of confidence interval).
    • probability is interpreted as the expected long-run frequency of an event in an (imaginary) very large sample.

The Bayesian approach

  • What is different in the Bayesian approach?
    • probability is interpreted as the degree of belief in the occurrence of an event, or in a proposition.
    • parameters are seen as unknown random variable, uncertainty in their values represented by a prior probability distribution.
    • the goal is to update the prior distribution on the basis of the available data, leading to a posterior probability distribution.
    • this is done with Bayes' theorem, however the Bayesian approach is not distinguished by Bayes' theorem (which is just a trivial implication of probability theory).
    • "The essential characteristic of Bayesian methods is their explicit use of probability for quantifying uncertainty in inferences based on statistical analysis" (Gelman et al. 2013).

Bayesian models

  • In a typical Bayesian setting we have a likelihood \(p\left(\text{data} \mid \theta\right)\), conditional on the parameter \(\theta\), and a prior distribution \(p\left(\theta\right)\). The posterior distribution for the parameter is \[ p\left(\theta \mid \text{data}\right) = \frac{p\left(\text{data} \mid \theta\right) p\left(\theta\right)}{p\left( \text{data} \right)} \]

Bayesian models

  • In a typical Bayesian setting we have a likelihood \(p\left(\text{data} \mid \theta\right)\), conditional on the parameter \(\theta\), and a prior distribution \(p\left(\theta\right)\). The posterior distribution for the parameter is \[ p\left(\theta \mid \text{data}\right) = \frac{p\left(\text{data} \mid \theta\right)p\left(\theta\right)}{\int p\left( \text{data} \mid \theta \right) p\left(\theta\right) d\theta}\]

Bayesian models

  • In a typical Bayesian setting we have a likelihood \(p\left(\text{data} \mid \theta\right)\), conditional on the parameter \(\theta\), and a prior distribution \(p\left(\theta\right)\). The posterior distribution for the parameter is \[ p\left(\theta \mid \text{data} \right) = \frac{p\left(\text{data} \mid \theta\right)p\left(\theta\right)}{\int p\left( \text{data} \mid \theta \right) p\left(\theta\right) d\theta}\\ \text{posterior} = \frac{\text{likelihood} \times \text{prior}}{\text{average likelihood}} \]

Bayesian models

  • In a typical Bayesian setting we have a likelihood \(p\left(\text{data} \mid \theta\right)\), conditional on the parameter \(\theta\), and a prior distribution \(p\left(\theta\right)\). The posterior distribution for the parameter is \[ p\left(\theta \mid \text{data}\right) \propto p\left(\text{data} \mid \theta\right)p\left(\theta\right)\\ \text{posterior} \propto \text{likelihood} \times \text{prior} \]

Bayesian multilevel models

  • In the case of a multilevel model, each cluster \(j\) has parameter \(\theta_j\) which is an independent sample from a population distribution governed by some hyperparameter \(\phi\).
  • \(\phi\) is also not known, and thus has its prior distribution \(p\left(\phi\right)\) (hyperprior).
  • We have a joint prior distribution: \(p\left(\phi,\theta\right)=p\left(\phi\right) \prod_{j=1}^{J} p\left(\theta_j \mid \phi\right)\)
  • and a joint posterior distribution: \[p\left(\phi,\theta \mid \text{data}\right) \propto \underbrace{p\left(\phi\right) \prod_{j=1}^{J} p\left(\theta_j \mid \phi\right)}_\text{prior} \times \underbrace{ p\left(\text{data} \mid \theta_j\right)}_\text{likelihood} \]

Bayesian parameter inference

  • The information we want about a parameter \(\phi\) is contained in its marginal posterior distribution \(p\left(\phi \mid \text{data} \right)=\int p\left(\phi,\theta \mid \text{data} \right) d\theta\).
  • The calculation can be difficult once a model has many parameters (to describe the uncertainty in a parameter of interest one must average or marginalise over the uncertainty in all other parameters).
  • Monte Carlo techniques and computer simulations (Markov-Chain Monte Carlo sampling, MCMC) can be used to draw samples from the posterior distribution.
  • Once you have samples from the posterior distribution, summarizing the marginal posterior distribution becomes simply a matter of counting values.

Metropolis algorithm

Say you want to sample from a density \(P({\bf x})\), but you can only evaluate a function \(f({\bf x})\) that is proportional to the density \(P({\bf x})\):

  • Initialization:
    1. choose arbitrary starting point \({\bf x}_0\)
    2. choose a probability density to generate proposals \(g({\bf x}_{n+1}|{\bf x}_n)\) (proposal density)
  • For each iteration \(i\):
    1. generate a candidate \({\bf x'}\) sampling from \(g({\bf x}_{i+1}|{\bf x}_{i})\)
    2. calculate the acceptance ratio \(a = \frac{f({\bf x}')}{f({\bf x}_{i})}\)
    3. accept/reject: generate a uniform number \(u\) in \([0,1]\)
      • if \(u \le a\), accept and set \({\bf x}_{i+1}={\bf x'}\) ("move to the new value")
      • if \(u > a\), reject and set \({\bf x}_{i+1}={\bf x}_{i}\) ("stay where you are")

MCMC example

Let use the sampling algorithm to infer values from one participant of the dataset sleepstudy.

  • To sample from the posterior distribution of parameters is sufficient to compute the un-normalized posterior: \(\text{prior} \times \text{likelihood}\)

MCMC example

First, code one function to compute the (log)likelihood, keeping the data fixed.

x <- sleepstudy$Days[sleepstudy$Subject == "308"]
y <- sleepstudy$Reaction[sleepstudy$Subject == "308"]

loglik <- function(par) {
    # par = [intercept, slope, log(residual Std.)]
    pred <- par[1] + par[2] * x
    return(sum(dnorm(y, mean = pred, sd = exp(par[3]), log = T)))
}

MCMC example

Next, choose your priors

MCMC example

Next, choose your priors

MCMC example

Write a function that compute the log of the joint prior density for a combination of parameters

logprior <- function(par) {
    intercept_prior <- dnorm(par[1], mean = 250, sd = 180, 
        log = T)
    slope_prior <- dnorm(par[2], mean = 20, sd = 20, log = T)
    sd_prior <- dnorm(par[3], mean = 4, sd = 1, log = T)
    return(intercept_prior + slope_prior + sd_prior)
}

MCMC example

Put together likelihood and prior to obtain the (log) posterior density

logposterior <- function(par) {
    return(loglik(par) + logprior(par))
}

MCMC example

Choose the arbitrary starting point and the proposal density \(g({\bf x}_{n+1}|{\bf x}_n)\)

# initial parameters
startvalue <- c(250, 20, 5)

# proposal density
proposalfunction <- function(par) {
    return(rnorm(3, mean = par, sd = c(15, 5, 0.2)))
}

MCMC example

This function execute iteratively the step we have seen before

run_metropolis_MCMC <- function(startvalue, iterations) {
    chain <- array(dim = c(iterations + 1, 3))
    chain[1, ] <- startvalue
    
    for (i in 1:iterations) {
        # draw a random proposal
        proposal <- proposalfunction(chain[i, ])
        
        # ratio of posterior density between new and old values
        a <- exp(logposterior(proposal) - logposterior(chain[i, 
            ]))
        
        # accept/reject
        if (runif(1) < a) {
            chain[i + 1, ] <- proposal
        } else {
            chain[i + 1, ] <- chain[i, ]
        }
    }
    return(chain)
}

MCMC example

MCMC example

Having the samples from the posterior distribution, summarising uncertainty is simply a matter of counting values. Here I calculate a 95% Bayesian credible interval for slope.

# remove initial 'burn in' samples
burnIn <- 5000
slope_samples <- chain[-(1:burnIn), 2]

# mean of posterior distribution
mean(slope_samples)
[1] 21.75772

# 95% Bayesian credible interval
alpha <- 0.05
round(c(quantile(slope_samples, probs = alpha/2), quantile(slope_samples, 
    probs = 1 - alpha/2)), digits = 2)
 2.5% 97.5% 
10.82 33.27 

  • The Metropolis algorithm is the grandparent of modern MCMC techniques for drawing samples from unknown posterior distributions
  • Proposals are random and that can make it very inefficient, especially as the complexity of the model (and thus of the posterior) increases
  • Modern algorithm use clever proposals to get a better picture of the posterior distribution with fewer samples
  • A state-of-the-art algorithm is Hamiltonian Monte Carlo (HMC), which is used by the free software Stan (mc-stan.org)

Stan

  • Stan (named after Stanislaw Ulam, 1909-1984, co-inventor of MCMC methods) is a probabilistic programming language that enabels users to easily Bayesian inference on complex models.
  • Once a model is defined, Stan compiles a C++ program that uses HMC to draw samples from the posterior density
  • Stan is free, open-source, has a comprehensive documentation and interfaces for the most popular computing environments (R, Matlab, Python, Mathematica)
  • We shall see how to build a Bayesian multilevel model in Stan using again the sleepstudy dataset
  • I will use Stan's R interface (RStan, see installation instructions at github.com/stan-dev/rstan/wiki/RStan-Getting-Started)

Bayesian LMM

  • The model can be expressed as \[ y_{j} \sim \text{Normal}\left( \beta_0 + b_{0j} + \left( \beta_1 + b_{1j} \right){\rm{Days}}, \,\, \sigma_{\epsilon}^2 \right)\\ b \sim \text{Normal}\left( 0, \,\, \Omega \right)\\ \]

  • Priors\[ \beta_0 \sim \text{Normal} \left(0.3,1 \right)\\ \beta_1 \sim \text{Normal} \left(0.01,0.5 \right)\\ \sigma_{\epsilon} \sim \text{HalfCauchy} \left(0, 0.5\right)\\ \]

Bayesian LMM

  • For efficiency and numerical stability, the matrix \(\Omega\) is usually parametrized by a vector of variances \(\sigma_0, \sigma_1\) and a correlation matrix \(\textbf{R}\) \[ \Omega = \left( \begin{array}{cc} \sigma_0 & 0 \\ 0 & \sigma_1 \end{array} \right) \textbf{R} \left( \begin{array}{cc} \sigma_0 & 0 \\ 0 & \sigma_1 \end{array} \right) \]

Variance-covariance and correlation matrices

\[ \begin{align*} \Omega & = \left( \begin{array}{cc} \sigma_0 & 0 \\ 0 & \sigma_1 \end{array} \right) \textbf{R} \left( \begin{array}{cc} \sigma_0 & 0 \\ 0 & \sigma_1 \end{array} \right)\\ & = \left( \begin{array}{cc} \sigma_0 & 0 \\ 0 & \sigma_1 \end{array} \right) \left( \begin{array}{cc} 1 & \rho \\ \rho & 1 \end{array} \right) \left( \begin{array}{cc} \sigma_0 & 0 \\ 0 & \sigma_1 \end{array} \right) \\ & = \left( \begin{array}{cc} \sigma_0 & \rho\sigma_0 \\ \rho\sigma_1 & \sigma_1 \end{array} \right) \left( \begin{array}{cc} \sigma_0 & 0 \\ 0 & \sigma_1 \end{array} \right) \\ & = \left( \begin{array}{cc} \sigma_0^2 & \rho\sigma_0\sigma_1 \\ \rho\sigma_0\sigma_1 & \sigma_1^2 \end{array} \right)\\ \end{align*} \]

Bayesian LMM

  • Hyperprior\[ \Omega = \left( \begin{array}{cc} \sigma_0 & 0 \\ 0 & \sigma_1 \end{array} \right) \textbf{R} \left( \begin{array}{cc} \sigma_0 & 0 \\ 0 & \sigma_1 \end{array} \right)\\ {\bf \sigma}_{\Omega} \sim \text{HalfCauchy} \left(0, 0.5\right)\\ \textbf{R} \sim \text{LKJcorr} \left( 2 \right)\\ \]

Prior for correlation matrix

The LKJ prior (Lewandowski, Kurowicka, & Joe, 2009) is a prior distribution over correlation matrices, has one parameter that control how 'skeptical' is the prior with respect to extreme correlation values.

Coding the model in Stan language

  • Stan models are typically specified in separate text files with .stan extension
  • Model specification consists of several blocks, e.g.
    • data: declaration of all dependent/independent variables
    • parameters: declare the free parameters of the model
    • model: define likelihood function and priors
  • Other useful block types:
    • transformed parameters: apply any transformation to the parameters, before computing the likelihood
    • generated quantities: derive any quantities of interest based on data or parameters

data

data {
  int<lower=1> N;                  // number of observations
  real RT[N];                      // reaction time (transformed in seconds)
  int<lower=0,upper=9> Days[N];    // predictor
  int<lower=1> J;                  // number of subjects
  int<lower=1,upper=J> Subject[N]; // subject id
}

RStan requires the dataset to be organized as a list object

d_stan <- list(Subject = as.numeric(factor(sleepstudy$Subject, 
    labels = 1:length(unique(sleepstudy$Subject)))), Days = sleepstudy$Days, 
    RT = sleepstudy$Reaction/1000, N = nrow(sleepstudy), 
    J = length(unique(sleepstudy$Subject)))

parameters

parameters {
  vector[2] beta;               // fixed-effects parameters
  real<lower=0> sigma_e;        // residual std
  vector<lower=0>[2] sigma_u;   // random effects standard deviations
  cholesky_factor_corr[2] L_u;  // L_u is the Choleski factor of the correlation matrix
  matrix[2,J] z_u;              // random effect matrix
}

transformed parameters {
  matrix[2,J] u;
  u = diag_pre_multiply(sigma_u, L_u) * z_u; // use Cholesky to set correlation
}

For efficiency and numerical stability (see Stan manual) the correlation matrix \(\textbf{R}\) is parametrized by its Cholesky factor \(\textbf{L}\), which can be seen as the square root of \(\textbf{R}\), i.e. \[\textbf{L}\textbf{L}^T=\textbf{R} = \left( \begin{array}{cc} 1 & \rho \\ \rho & 1 \end{array} \right) \] If \(\textbf{Z}_{\text{uncorr}}\) is a 2x\(n\) matrix where the rows are 2 samples from uncorrelated random variables, then \[ \textbf{Z}_{\text{corr}} = \left( \begin{array}{cc} \sigma_0 & 0 \\ 0 & \sigma_1 \end{array} \right) \textbf{L} \, \textbf{Z}_{\text{uncorr}} \] then \(\textbf{Z}_{\text{corr}}\) have the desired variances \(\sigma_0^2,\sigma_1^2\) and correlation \(\rho\). (\(\textbf{L}\) is lower-triangular Cholesky factor.)

Why does that work? Say the the two uncorrelated samples have zero mean and unit variance, then the covariance matrix is identity matrix, \(\mathbb{E}\left( {Z{Z^T}} \right) - \mathbb{E}\left( Z \right) \mathbb{E}\left(Z \right)^T = \mathbb{E} \left( Z Z^T \right) = I\), and is identical to the correlation matrix.

Say the desired correlation matrix is \(\textbf{R}\); since it is symmetric and positive defined it is possible to obtain the Cholesky factorization \(L{L^T} = \textbf{R}\).

We transform the samples with \(X=LZ\), we have that that the new covariance (correlation) matrix is \[ \begin{align} \mathbb{E} \left(XX^T\right) &= \mathbb{E} \left((LZ)(LZ)^T \right) \\ &= \mathbb{E} \left(LZ Z^T L^T\right) \\ &= L \mathbb{E} \left(ZZ^T \right) L^T \\ &= LIL^T = LL^T = \textbf{R} \\ \end{align} \]

The third step is justified because the expected value is a linear operator, therefore \(\mathbb{E}(cX) = c\mathbb{E}(X)\). Also \((AB)^T = B^T A^T\), the order of the factor reverses.

model

model {
  real mu; // conditional mean of the dependent variable

  //priors
  L_u ~ lkj_corr_cholesky(2);   // LKJ prior for the correlation matrix
  to_vector(z_u) ~ normal(0,1); // before Cholesky, random effects are normal variates with SD=1
  sigma_u ~ cauchy(0, 0.5);     // SD of random effects (vectorized)
  sigma_e ~ cauchy(0, 0.5);     // prior for residual standard deviation
  beta[1] ~ normal(0.3, 1);     // prior for fixed-effect intercept
  beta[2] ~ normal(0.01, 0.5);  // prior for fixed-effect slope

  //likelihood
  for (i in 1:N){
    mu = beta[1] + u[1,Subject[i]] + (beta[2] + u[2,Subject[i]])*Days[i];
    RT[i] ~ normal(mu, sigma_e);
  }
}

Running the model

  • Run sampling through RStan
library(rstan)

options(mc.cores = parallel::detectCores())  # indicate stan to use multiple cores if available

sleep_model <- stan(file = "sleep_model_v2.stan", data = d_stan, 
    iter = 2000, chains = 4)

Evaluating convergence

  • Visualize chains to check for convergence
traceplot(sleep_model, pars = c("beta", "sigma_e", "sigma_u"), 
    inc_warmup = F)

Evaluating convergence

  • Check that \(\hat R \approx 1\pm0.1\) (essentially \(\hat R\) is the ratio of between-chain variance to within-chain variance)
print(sleep_model, pars = c("beta", "sigma_e"), probs = c(0.025, 
    0.975), digits = 3)
Inference for Stan model: sleep_model_v2.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

         mean se_mean    sd  2.5% 97.5% n_eff Rhat
beta[1] 0.251       0 0.008 0.236 0.267  2545    1
beta[2] 0.010       0 0.002 0.007 0.014  2409    1
sigma_e 0.026       0 0.002 0.023 0.029  4000    1

Samples were drawn using NUTS(diag_e) at Sun Sep 23 14:37:56 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Posterior distribution of fixed-effects parameters.

pairs(sleep_model, pars = c("beta", "sigma_e", "sigma_u"))

Examine random effects correlation.

# Use the L matrices to compute the correlation matrices
L_u <- extract(sleep_model, pars = "L_u")$L_u
cor_u <- apply(L_u, 1, function(x) tcrossprod(x)[1, 2])
print(signif(quantile(cor_u, probs = c(0.025, 0.5, 0.975))))
      2.5%        50%      97.5% 
-0.4676860  0.0597705  0.5986940 

Posterior predictive check

  • 'simulating replicated data under the fitted model and then comparing these to the observed data' (Gelman and Hill, 2007).
generated quantities {
  real y_rep[N];        
  matrix[2,J] u_rep;    // matrix of simulated random effects
    for (j in 1:J) {
    u_rep[1:2,j] = multi_normal_cholesky_rng(rep_vector(0, 2), 
                                    diag_matrix(sigma_u) * (L_u));
  }
  for (i in 1:N){     // simulate the model
    y_rep[i] = normal_rng(beta[1] + u_rep[1,Subject[i]] + 
                          (beta[2] + u_rep[2,Subject[i]])*Days[i], 
                          sigma_e);
  }
}

Posterior predictive check

  • If the model fits, then replicated data generated under the model should look similar to observed data.
  • Generating random observations from new simulated observers allow approximating the posterior predictive distribution, \(p \left( y^{\text{rep}} \mid y \right) = \int p \left( y^{\text{rep}} \mid \theta \right) p\left(\theta \mid y\right) d \theta\)
    <>

Comparison with lme4 random effects estimates.

# extract samples
samples_rf0 <- extract(sleep_model, pars = "u")$u[, 1, ]
samples_rf1 <- extract(sleep_model, pars = "u")$u[, 2, ]

# compute posterior mean
rf_int <- apply(samples_rf0, 2, mean)
rf_slp <- apply(samples_rf1, 2, mean)

par(mfrow = c(1, 2))  # plot
plot(rf_int, ranef(m.1)$Subject[, 1], pch = 19, xlab = "Stan", 
    ylab = "lme4", main = "Intercept")
plot(rf_slp, ranef(m.1)$Subject[, 2], pch = 19, xlab = "Stan", 
    ylab = "lme4", main = "Slope")

rstanarm

  • It is not strictly necessary to write code from scratch in case of standard models
  • The package rstanarm automatically prepare Stan code from a model specified in standard R syntax, e.g.
library(rstanarm)
sleepstudy$Reaction_s <- sleepstudy$Reaction/1000

sleep_model_2 <- stan_lmer(Reaction_s ~ Days + (Days | Subject), 
    data = sleepstudy, cores = 2, iter = 2000, chain = 4)

rstanarm

  • Priors can be assigned (see the function prior_summary() to examine the default)
  • Another options is the package bmrs (Bürkner, 2017)

summary(sleep_model_2, pars = c("(Intercept)", "Days"), 
    digits = 3)

Model Info:

 function:     stan_lmer
 family:       gaussian [identity]
 formula:      Reaction_s ~ Days + (Days | Subject)
 algorithm:    sampling
 priors:       see help('prior_summary')
 sample:       4000 (posterior sample size)
 observations: 180
 groups:       Subject (18)

Estimates:
              mean   sd    2.5%   25%   50%   75%   97.5%
(Intercept) 0.251  0.007 0.237  0.247 0.251 0.256 0.265  
Days        0.010  0.002 0.007  0.009 0.010 0.011 0.014  

Diagnostics:
            mcse  Rhat  n_eff
(Intercept) 0.000 1.000 1790 
Days        0.000 1.004 1364 

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

Interim summary

  • Multilevel model are characterised by more than one sources of "random" variation
  • The distinguishing feature of Bayesian methods is the explicit use of probability distributions to quantify the uncertainty about unknown quantities
  • Using samples to approximate posterior distributions makes it much easier to summarize the results
  • Bayesian models estimated by MCMC sampling do not have the problem of convergence warnings (as in lme4), but it is important to check that sampling was succesfull (e.g. \(\hat R\), …)

MCMC: taming a wild chain

  • When the posterior density has broad, flat regions, chains can wander erratically
dat <- list(y = c(-1, 1))
stan_code <- "
data{
  real y[2];
}
parameters{
  real mu;
  real<lower=0> sigma;
}
model{
  y ~ normal(mu, sigma);
}
"
m_ <- stan(model_code = stan_code, data = dat, iter = 2000, 
    chains = 4)

MCMC: taming a wild chain

  • When the posterior density has broad, flat regions, chains can wander erratically
rstan::traceplot(m_, pars = c("mu", "sigma"), inc_warmup = F)

MCMC: taming a wild chain

  • Weakly informative prior can gently guide the Markov chain toward reasonable parameter values
dat <- list(y = c(-1, 1))
stan_code <- "
data{
  real y[2];
}
parameters{
  real mu;
  real<lower=0> sigma;
}
model{
  mu ~ normal(1, 10);
  sigma ~ cauchy(0, 1);
  y ~ normal(mu, sigma);
}
"
m_ <- stan(model_code = stan_code, data = dat, iter = 2000, 
    chains = 4)

MCMC: taming a wild chain

  • Weakly informative prior can gently guide the Markov chain toward reasonable parameter values
rstan::traceplot(m_, pars = c("mu", "sigma"), inc_warmup = F)

MCMC: non-identifiable parameters

  • Markow chains give problems if parameters are non-identifiable (e.g. highly correlated predictors)
dat <- list(y = rnorm(100, mean = 0, sd = 1))
stan_code <- "
data{
  real y[100];
}
parameters{
  vector[2] alpha;
  real<lower=0> sigma;
}
model{
  real mu;
  mu = alpha[1] + alpha[2];
  y ~ normal(mu, sigma);
}
"
m_ <- stan(model_code = stan_code, data = dat, iter = 4000, 
    chains = 2)

MCMC: non-identifiable parameters

  • Markow chains give problems if parameters are non-identifiable (e.g. highly correlated predictors)
rstan::traceplot(m_, pars = c("alpha", "sigma"), inc_warmup = F)

MCMC: non-identifiable parameters

  • Weakly informative priors can help also in this case
dat <- list(y = rnorm(100, mean = 0, sd = 1))
stan_code <- "
data{
  real y[100];
}
parameters{
  vector[2] alpha;
  real<lower=0> sigma;
}
model{
  real mu;
  alpha ~ normal(0, 10);
  mu = alpha[1] + alpha[2];
  y ~ normal(mu, sigma);
}
"
m_ <- stan(model_code = stan_code, data = dat, iter = 4000, 
    chains = 2)

MCMC: non-identifiable parameters

  • Weakly informative priors can help also in this case.
rstan::traceplot(m_, pars = c("alpha", "sigma"), inc_warmup = F)

Priors

  • Priors are the initial belief about the plausiblity of possible parameter values.
  • There are no conventional priors (differently from likelihoods).
  • Priors are subjective, but explicitly stated and easily criticized (e.g. less opaque than dropping outliers).
  • Priors are assumptions, and should be criticized (e.g. try different assumptions to check how sensitive inference is to the choice of priors).
  • Priors cannot be exact, but should be defendable.

Priors

  • There are no non-informative priors: flat priors such as \(\text{Uniform} \left(- \infty, \infty \right)\) tells the model that unrealistic values (e.g. extremely large) are as plausible as realistic ones.
  • Priors should restrict parameter space to values that make sense.
  • Preferably choose weakly informative priors that convey less information than you have available.
  • Priors should provide some regularization, i.e. endow the model with some skepticism, so that it doesn't get too excited when learning from the data (in a frequentist settings, this is similar to penalized likelihood methods)
  • Prior recommendations: https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations
  • If using informative priors, these should be justified in the text when reporting the results (see examples in Gelman & Hennig, 2016, Beyond subjective and objective in statistics)

Generalized linear models (GLM)

  • A generalization of linear models where the linear predictor is related to the dependent variable through a link function \[ \Phi^{-1} \left[ p \left(y=1 \right) \right] = \textbf{X} \beta \]
  • The coefficients of the GLM can be straighforwardly mapped to the parameters of the psychometric function \[ \begin{align} p \left(y_i\right) & = \Phi\left(\frac{x_i -\mu}{\sigma} \right) \\ & = \Phi ( -\frac{\mu}{\sigma} + \frac{1}{\sigma}x_i )\\ & = \Phi\left(\beta_0 + \beta_1x_i \right) \\ \\& \mu=-\frac{\beta_0}{\beta_1}, \,\,\, \sigma=\frac{1}{\beta_1} \\ \end{align} \]

Generalized linear multilevel models

  • Often psychometric function are fit to individual data and group analysis is done on the estimated parameters. This approach neglects subject-specific standard errors and imbalances in the number of repetitions/trials.

  • Inferences can be improved by adopting a (generalized linear) multilevel approach (GLMM) \[ p \left(y \mid b\right) = \Phi \left( \textbf{X} \beta + \textbf{Z} b \right) \\ b \sim \mathcal{N}\left( 0, \Omega \right) \]

  • This model could be estimated with the glmer function of the lme4 library. However it is not a realistic model: sometimes subjects push buttons at random and make stimulus-independent errors (lapses)
  • We can extend this model with an extra parameter \(\lambda\) indicating the probability of lapsing\[ p \left(y \mid b\right) = \lambda+\left( 1-2\lambda \right) \times \Phi \left( \textbf{X} \beta + \textbf{Z} b \right) \\ b \sim \mathcal{N}\left( 0, \Omega \right)\\ \]

GLMM example

  • Observer were presented with two brief (250ms) peripheral stimuli, and asked to indicate which of the two was the furthest from the central fixation point.
  • 2 conditions, with different levels of blur of the targets. How does the precision of position judgments change across conditions?
d <- read.table("bisection2.txt", sep = "\t", header = T)
str(d)
'data.frame':   4593 obs. of  5 variables:
 $ id  : int  4 4 4 4 4 4 4 4 4 4 ...
 $ dx  : num  0.04 -0.04 -0.53 -0.53 -0.36 0.44 0.53 0.69 0.93 -1.09 ...
 $ rr  : int  1 1 0 1 0 1 0 1 0 0 ...
 $ acc : int  1 0 1 0 1 1 0 1 0 1 ...
 $ cond: int  1 2 1 2 2 1 2 2 2 2 ...

GLMM example

Lapse rates

  • Each observer \(j\) has a specific lapse rate \(\lambda_j\). We use a Beta distribution to express the distribution of lapse rates in the population\[ \lambda_j \sim \text{Beta} \left(a,b\right) \]

Lapse rates

  • Each observer \(j\) has a specific lapse rate \(\lambda_j\). We use a Beta distribution to express the distribution of lapse rates in the population\[ \lambda_j \sim \text{Beta} \left(a,b\right) \]

  • \(a\) and \(b\) are the hyperparameter that govern the distribution of lapse rates in the population. We need an hyperprior to express the uncertainty in their values.
  • Since \(a>0\) and \(b>0\), the hyperprior can be modelled with a Gamma distribution. We can set the shape and rate parameters of the Gamma so that they favor a positively skewed Beta distribution with mode at 0\[ a \sim \text{Gamma} \left(4,1 \right)\\ b \sim \text{Gamma} \left(1,0.2 \right)\\ \]

Stan code

Declare all the variables in the data block.

data {
  int<lower=1> N;
  int<lower=0,upper=1> rr[N];
  real ds[N];
  int<lower=0,upper=1> cond[N];
  int<lower=1> J;
  int<lower=1,upper=J> id[N];
}

Define the free parameters in the parameters block

parameters {
  vector[4] beta;                    // fixed-effects parameters
  vector<lower=0,upper=1>[J] lambda; // lapse rate
  vector<lower=0>[2] beta_ab;        // population parameters of lapse rate
  vector<lower=0>[4] sigma_u;        // random effects standard deviations
  cholesky_factor_corr[4] L_u;       // L_u is the Choleski factor of the correlation matrix
  matrix[4,J] z_u;                   // random effect matrix
}

In the transformed parameters block apply any transformations needed to compute the likelihood.

transformed parameters {
  matrix[4,J] u;
  u = diag_pre_multiply(sigma_u, L_u) * z_u; // use Cholesky to set correlation
}

Define priors and likelihood in the model block.

model {
  real mu; // linear predictor

  //priors
  L_u ~ lkj_corr_cholesky(2);   // LKJ prior for the correlation matrix
  to_vector(z_u) ~ normal(0,1); // before Cholesky, random eff. are normal variates with SD=1
  sigma_u ~ cauchy(0, 1);       // SD of random effects (vectorized)
  beta[1] ~ normal(0, 2);       // prior for intercept (cond 1)
  beta[2] ~ normal(0, 4);       // prior for slope (cond 1)
  beta[3] ~ normal(0, 1);       // prior for diff. in intercept (cond 1)
  beta[4] ~ normal(0, 2);       // prior for diff. in slope (cond 1)
  beta_ab[1] ~ gamma(4,1);      // hyper priors for lapse rate distribution
  beta_ab[2] ~ gamma(1,0.2);
  lambda ~ beta(beta_ab[1] ,beta_ab[2]);  //lapse rate

  //likelihood
  for (i in 1:N){
    mu = beta[1] + u[1,id[i]] + cond[i]*(beta[2]+u[2,id[i]]) 
        + (beta[3] + u[3,id[i]] + cond[i]*(beta[4] + u[4,id[i]] ))*ds[i];
    rr[i] ~ bernoulli((1-lambda[id[i]])*Phi(mu)+lambda[id[i]]/2);
  }
}

Prepare the data as a list for Stan.

d_stan <- list(id = as.numeric(factor(d$id, labels = 1:length(unique(d$id)))), 
    ds = d$dx, rr = d$rr, N = nrow(d), J = length(unique(d$id)), 
    cond = d$cond - 1)
str(d_stan)
List of 6
 $ id  : num [1:4593] 1 1 1 1 1 1 1 1 1 1 ...
 $ ds  : num [1:4593] 0.04 -0.04 -0.53 -0.53 -0.36 0.44 0.53 0.69 0.93 -1.09 ...
 $ rr  : int [1:4593] 1 1 0 1 0 1 0 1 0 0 ...
 $ N   : int 4593
 $ J   : int 13
 $ cond: num [1:4593] 0 1 0 1 1 0 1 1 1 1 ...

Run sampling

bisection_model <- stan(file = "bisection_stan_v1.stan", 
    data = d_stan, iter = 4000, chains = 4)

Check the convergence by visualising the chains

traceplot(bisection_model, pars = c("beta", "sigma_u", "beta_ab"), 
    inc_warmup = F)

Visualize posterior distribution

pairs(bisection_model, pars = c("beta", "sigma_u"))

print(bisection_model, pars = c("beta", "sigma_u", "beta_ab"), 
    probs = c(0.025, 0.975), digits = 3)
Inference for Stan model: bisection_stan_v1.
4 chains, each with iter=4000; warmup=2000; thin=1; 
post-warmup draws per chain=2000, total post-warmup draws=8000.

             mean se_mean    sd   2.5%  97.5% n_eff  Rhat
beta[1]    -0.577   0.007 0.329 -1.231  0.068  2512 1.001
beta[2]     0.430   0.004 0.216  0.001  0.865  3396 1.001
beta[3]     1.661   0.005 0.247  1.194  2.156  2596 1.000
beta[4]    -0.946   0.003 0.177 -1.312 -0.625  2926 1.000
sigma_u[1]  1.152   0.004 0.251  0.767  1.725  3708 1.000
sigma_u[2]  0.689   0.003 0.168  0.421  1.081  4004 1.000
sigma_u[3]  0.705   0.003 0.170  0.434  1.099  2604 1.001
sigma_u[4]  0.440   0.003 0.132  0.214  0.740  2404 1.001
beta_ab[1]  1.233   0.011 0.567  0.435  2.594  2458 1.000
beta_ab[2]  9.303   0.082 4.554  3.029 20.679  3110 1.000

Samples were drawn using NUTS(diag_e) at Thu Sep 27 09:44:39 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Inference

Dedicated packages (e.g. the package bayesplot) makes it easier to summarize the results.

library(bayesplot)
posterior2 <- extract(bisection_model)
mcmc_intervals(posterior2, regex_pars = c("beta\\[[1-4]", 
    "sigma_u\\[[1-4]"), prob = 0.8, prob_outer = 0.95, point_est = "mean")

Inference

Dedicated packages (e.g. the package bayesplot) makes it easier to summarize the results.

mcmc_areas(posterior2, regex_pars = c("beta\\[[1-4]", "sigma_u\\[[1-4]"), 
    prob = 0.8, point_est = "mean")

Inference

Fixed-effects are parametrised according to standard GLM formulation (in terms of \(\beta_0, \beta_1, \dots\)). These can be easily transformed in the psychometric functions parameters \(\mu\) and \(\sigma\).

beta_par <- extract(bisection_model, pars = "beta")$beta
jnd_1 <- 1/beta_par[, 3]
jnd_2 <- 1/(beta_par[, 3] + beta_par[, 4])
jnd_diff <- jnd_2 - jnd_1  # posterior JND differences distribution

Calculate Highest Posterior Density (HPD) intervals (the narrowest interval containing the specified probability mass, e.g. 0.95).

jnd_ci_1 <- HPDinterval(as.mcmc(jnd_1), prob = 0.95)
jnd_ci_2 <- HPDinterval(as.mcmc(jnd_2), prob = 0.95)
jnd_ci_diff <- HPDinterval(as.mcmc(jnd_diff), prob = 0.95)

Inference

Plot

Hmisc::errbar(c("sigma 1", "sigma 2", "difference"), c(mean(jnd_1), 
    mean(jnd_2), mean(jnd_diff)), yplus = c(jnd_ci_1[2], 
    jnd_ci_2[2], jnd_ci_diff[2]), yminus = c(jnd_ci_1[1], 
    jnd_ci_2[1], jnd_ci_diff[1]))

Hierarchical mixtures

  • Participants adjust an arrow to report the direction of a briefly presented moving stimulus, in one of two conditions, pre vs. post cue (data from Haladjian, Lisi & Cavanagh, APP 2018).
  • We are interested in both the variability and the bias of the response error.
  • These measures can be distorted by random guesses, so they are tipycally modelled as a mixture of a uniform and a bell-shaped distribution (e.g. Gaussian, vonMises).
  • Here, the response are angles, and the response error is bounded within \(-\pi\) and \(\pi\)
  • The mixture model can be expressed as \[ p(x)=\lambda \frac{1}{2\pi} + (1-\lambda) \frac{1}{2\pi I_0(k)} e^{k \cos (x-\mu)} \] where \(\lambda\) indicates the probability of a random response.

Hierarchical mixtures

d <- read.table("cueingdd.txt", header = T, sep = "\t")
d$alpha_rad <- d$alpha_deg/180 * pi
d$rerr <- d$dir * angDiff(d$alpha_rad, d$resp)  # transform in signed angular differences
str(d)
'data.frame':   4608 obs. of  8 variables:
 $ id       : Factor w/ 8 levels "sj01","sj02",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ cue      : int  2 2 2 2 2 1 2 2 2 2 ...
 $ alpha_deg: int  220 270 130 180 320 50 330 90 300 80 ...
 $ location : int  2 4 2 2 3 4 1 4 1 3 ...
 $ dir      : int  1 -1 1 1 1 1 1 1 -1 1 ...
 $ resp     : num  -1.77 -0.35 1.84 -2.12 -0.38 1.22 0.37 1.57 -1.96 1.87 ...
 $ rerr     : num  0.673 -1.221 -0.429 1.022 0.318 ...
 $ alpha_rad: num  3.84 4.71 2.27 3.14 5.59 ...

Hierarchical mixtures

Hierarchical mixtures

  • data block.
data {
  int<lower=1> N;
  real<lower=-pi(),upper=pi()> rerr[N];
  int<lower=1,upper=2> cue[N];
  int<lower=1> J;
  int<lower=1,upper=J> id[N];
}

Hierarchical mixtures

  • parameter block.
parameters {
  // group-level
  vector[2] mu;         // bias
  vector[2] logk;       // k (log scale)
  vector[2] logilambda;   // mixing parameter (probit scale)
  
  // subject specific
  vector<lower=0>[6] sigma_u;    // random effects standard deviations
  cholesky_factor_corr[6] L_u;   // L_u is the Choleski factor
  matrix[6,J] z_u;               // random effect matrix
}

Hierarchical mixtures

  • transformed parameters block: here the parameters are transformed as required for the computation of the likelihood (to ensure that \(k>0\) and \(1<\lambda < 1\)).
transformed parameters {
  matrix<lower=0,upper=1>[2,J] lambda_j;
  matrix<lower=0>[2,J] kappa_j;
  matrix[2,J] mu_j;
  matrix[6,J] u;
  
  u = diag_pre_multiply(sigma_u, L_u) * z_u;
  
  for (j in 1:J){
    mu_j[1,j]= mu[1] + u[1,j];
    mu_j[2,j]= mu[2] + u[2,j];
    kappa_j[1,j]= exp(logk[1] + u[3,j]);
    kappa_j[2,j]= exp(logk[2] + u[4,j]);
    lambda_j[1,j]= inv_logit(logilambda[1] + u[5,j]);
    lambda_j[2,j]= inv_logit(logilambda[2] + u[6,j]);
  }
}

Hierarchical mixtures

  • model block.
model {
  // priors
  L_u ~ lkj_corr_cholesky(2); 
  to_vector(z_u) ~ normal(0,1); 
  sigma_u ~ normal(0, 10);
  mu ~ normal(0.5, 10);
  logk ~ normal(2, 20);
  logilambda ~ normal(-4, 30);
 
  // likelihood
  for (n in 1:N){
    if (kappa_j[cue[n],id[n]] < 100)
      target += log_mix(lambda_j[cue[n],id[n]],
                      uniform_lpdf(rerr[n] | -pi(), pi()),
                      von_mises_lpdf(rerr[n] | mu_j[cue[n],id[n]], kappa_j[cue[n],id[n]]));
    else
      target += log_mix(lambda_j[cue[n],id[n]],
                      uniform_lpdf(rerr[n] | -pi(), pi()),
                      normal_lpdf(rerr[n] | mu_j[cue[n],id[n]], sqrt(1/kappa_j[cue[n],id[n]])));
  }
}

Hierarchical mixture

  • Check convergence across chains.

Hierarchical mixture

  • Compute Bayesian credibility interval over mean bias differences.
mu1 <- extract(mix_model, pars = "mu[1]", inc_warmup = F, 
    permuted = FALSE)
mu2 <- extract(mix_model, pars = "mu[2]", inc_warmup = F, 
    permuted = FALSE)
mu1 <- c(mu1[, 1, 1], mu1[, 2, 1])/pi * 180
mu2 <- c(mu2[, 1, 1], mu2[, 2, 1])/pi * 180

HPDinterval(as.mcmc(mu2 - mu1), prob = 0.95)
         lower    upper
var1 -2.806921 12.10085
attr(,"Probability")
[1] 0.95

Hierarchical mixture

  • Compute Bayesian credibility interval over mean differences in precision.
k1 <- extract(mix_model, pars = "logk[1]", inc_warmup = F, 
    permuted = FALSE)
k2 <- extract(mix_model, pars = "logk[2]", inc_warmup = F, 
    permuted = FALSE)

# transform concentration parameter k into a standard
# deviation
k2SD <- function(k) sqrt(-2 * log(besselI(k, nu = 1)/besselI(k, 
    nu = 0)))

sd1 <- k2SD(exp(c(k1[, 1, 1], k1[, 2, 1])))/pi * 180
sd2 <- k2SD(exp(c(k2[, 1, 1], k2[, 2, 1])))/pi * 180

HPDinterval(as.mcmc((sd2 - sd1), prob = 0.95))
       lower    upper
var1 1.98266 10.74307
attr(,"Probability")
[1] 0.95

Useful resources

  • Stan User manual.
  • Stan forum: https://discourse.mc-stan.org.
  • Statistical Rethinking: A Bayesian Course with Examples in R and Stan, Richard McElreath CRC Press, 2015.