GDR Vision - Paris 4/10/2018

Today's program

  1. Multilevel models from a frequentist perspective
  2. The Bayesian approach
  3. Markov Chain Monte Carlo (MCMC)
  4. Examples

Linear models

  • The dependent variable \(y\) is modeled as a weighted combination of the independent variables, plus an additive error \(\epsilon\) \[ y_i=\beta_0 + \beta_1x_{1i} + \ldots +\beta_nx_{ni} + \epsilon_i \\ \epsilon \sim \mathcal{N}\left( 0, \sigma^2 \right) \]

Linear models

  • The dependent variable \(y\) is modeled as a weighted combination of the independent variables, plus an additive error \(\epsilon\) \[ \textbf{Y} = \textbf{X}\beta + \epsilon \\ \epsilon \sim \mathcal{N}\left( 0, \sigma^2 \right) \]

Matrix notation:

\[ \bf{Y} = \bf{X}\beta + \epsilon \] \[ \left( \begin{array}{c} y_1 \\ \vdots \\ y_m \end{array} \right) = \left( \begin{array}{cccc} 1 & x_{11} & \ldots & x_{1n}\\ \vdots & \vdots & \vdots & \vdots\\ 1 & x_{m1} & \ldots & x_{mn} \end{array} \right) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_n \end{array} \right) + \left( \begin{array}{c} \epsilon_1 \\ \vdots \\ \epsilon_m \end{array} \right) \]

Matrix multiplication: \[ \left( \begin{array}{cc} a & b \\ c& d \end{array} \right) \left( \begin{array}{c} 1 \\ 2 \end{array} \right) = 1 \left( \begin{array}{c} a \\ c \end{array} \right) + 2 \left( \begin{array}{c} b \\ d \end{array} \right) = \left( \begin{array}{c} a + 2b \\ c+ 2d \end{array} \right) \]

Linear mixed-effects models

  • 'Classical' linear models are fixed-effects only:
    • independent variables are all experimental manipulation (they are not random).
    • the only random source of variation is the residual error \(\epsilon \sim \mathcal{N}\left( 0, \sigma^2 \right)\).
  • However observations (e.g. trials) are often grouped according to observational cluster (e.g. subjects), random samples from a larger population, on which we'd like to make inferences.
  • In order to properly generalize from the sample to the population, these variables should be treated as random effects. Mixed-effects models allow to do that by explicitly modelling the population distribution.

  • A model containing both fixed and random effects is usually called a mixed-effects model (but also: hierarchical, multilevel, etc.).
  • Random effects are treated as random variations around a population mean. These random variations are usually assumed to have a Gaussian distribution.
  • A simple example: a random-intercept model. Regressions have the same slope in each of the \(J\) groups (\(j=1,\dots,J\)), but random variations in intercept \[ y_{ij} = \beta_0 + b_j + \beta_1x_{ij} + \epsilon_{i}\\ \epsilon \sim \mathcal{N}\left( 0, \sigma^2 \right)\\ b \sim \mathcal{N}\left( 0, \sigma_b^2 \right) \]

  • General formulation in matrix notation \[ \textbf{Y} = \textbf{X}\beta + \textbf{Z}b + \epsilon \] where \(\textbf{X}\) and \(\textbf{Z}\) are the known fixed-effects and random-effects regressor matrices.

  • The components of the residual error vector \(\epsilon \sim \mathcal{N}\left( 0, \sigma^2 \right)\) are assumed to be i.i.d. (independent and identically distributed).

  • The random-effect components, \(b \sim \mathcal{N}\left( 0, \Omega \right)\) are assumed to be normally distributed with mean 0, however they are not necessarily independent (the components \(b_j\) can be correlated, and correlations can be estimated). Example:\[ \left[ {\begin{array}{c} {{b_0}}\\ {{b_1}} \end{array}} \right] \sim\cal N \left( {\left[ {\begin{array}{c} 0 \\ 0 \end{array}} \right],\Omega = \left[ {\begin{array}{c} {{\mathop{\rm Var}} \left( b_0 \right)} & {{\mathop{\rm cov}} \left( {{b_0},{b_1}} \right)}\\ {{\mathop{\rm cov}} \left( {{b_0},{b_1}} \right)} & {{\mathop{\rm Var}} \left( b_1 \right)} \end{array}} \right]} \right) \]

Likelihood function

  • Parameters (\(\beta\), \(\sigma^2\) and \(\Omega\)) are estimated by maximizing the likelihood function, which is the probability of the data, given the parameters (but treated as a function of the parameters, keeping the data fixed).

  • The probability of the data conditional to the random effects is integrated with respect to the marginal density of the random effects, to obtain the marginal density of the data \[ L\left(\beta,\sigma^2,\Omega \mid \text{data}\right) = \int p \left(\text{data} \mid \beta,\sigma^2, b\right) \, p\left(b \mid \Omega\right) \, db \]

ML vs REML

  • The \(\textsf{R}\) library lme4 provides two methods for estimating parameters: Maximum Likelihood (ML) and Restricted Maximum Likelihood (REML).

  • ML tend to be biased and underestimate the variance components (e.g. \(\Omega\)).

  • REML provide less biased variance estimates: conceptually can be seen as similar to Bessel's correction for sample variance (using \(n-1\) instead of \(n\) in the denominator).

Advantages of multilevel models

  • Improved estimates for repeated sampling (e.g. repeated-measures design).
  • Improved estimates for imbalances in sampling (e.g. unequal number of trials across subjects).
  • Avoid averaging (pre-averaging of data remove variation and can manifacture false confidence).
  • Subject-specific standard error is taken into account in group-level estimates.
  • Variation among group or individuals is modelled explicitly.
  • Outperform classical methods in predictive ability.

Example 1 sleepstudy

sleepstudy is a dataset in the lme4 package, with reaction times data from 18 subjects that were restricted to 3 hours of sleep for 10 days.

Questions:

  • how reaction times changes with each sleep-deprived night?
  • are individual difference in baseline response times related to individual differences in the effect of sleep deprivation?
str(sleepstudy)
'data.frame':   180 obs. of  3 variables:
 $ Reaction: num  250 259 251 321 357 ...
 $ Days    : num  0 1 2 3 4 5 6 7 8 9 ...
 $ Subject : Factor w/ 18 levels "308","309","310",..: 1 1 1 1 1 1 1 1 1 1 ...

  • The model we want to fit includes both random slopes and intercept \[ y_{ij} = \beta_0 + b_{0j} + \left( \beta_1 + b_{1j} \right) \times {\rm{Days}}_i + \epsilon_i \\ \epsilon \sim \mathcal{N}\left( 0, \sigma^2 \right) \\ b\sim\cal N \left( 0, \Omega\right) \]
  • In lme4
m.1 <- lmer(Reaction ~ 1 + Days + (1 + Days | Subject), 
    sleepstudy)

summary(m.1)
Linear mixed model fit by REML ['lmerMod']
Formula: Reaction ~ 1 + Days + (1 + Days | Subject)
   Data: sleepstudy

REML criterion at convergence: 1743.6

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.9536 -0.4634  0.0231  0.4634  5.1793 

Random effects:
 Groups   Name        Variance Std.Dev. Corr
 Subject  (Intercept) 612.09   24.740       
          Days         35.07    5.922   0.07
 Residual             654.94   25.592       
Number of obs: 180, groups:  Subject, 18

Fixed effects:
            Estimate Std. Error t value
(Intercept)  251.405      6.825   36.84
Days          10.467      1.546    6.77

Correlation of Fixed Effects:
     (Intr)
Days -0.138

The function confint() allow easily to compute bootstrapped 95% CI of the parameters

CI_fixef <- confint(m.1, method = "boot", nsim = 500, oldNames = F)
print(CI_fixef, digits = 2)
                              2.5 % 97.5 %
sd_(Intercept)|Subject        12.81  35.63
cor_Days.(Intercept)|Subject  -0.53   0.81
sd_Days|Subject                3.57   8.32
sigma                         22.61  28.44
(Intercept)                  237.02 264.76
Days                           7.54  13.20

  • Tests of fixed and random effects can also be conducted using likelihood ratio tests, by comparing the likelihood of two nested models.
  • If \(L_1\) and \(L_2\) are the maximised likelihoods of two nested models with \(k_1 < k_2\) parameters, the test statistic is \(2\text{log}\left( L_2/ L_1 \right)\), which is approximately \(\chi^2\) with \(k_2 - k_1\) degrees of freedom.
  • Example: test if there are substantial variation across subjects in the effect of Days
# m.1 <- lmer(Reaction ~ 1 + Days + (1 + Days|Subject),
# sleepstudy)
m.2 <- lmer(Reaction ~ 1 + Days + (1 | Subject), sleepstudy)
anova(m.1, m.2)
refitting model(s) with ML (instead of REML)
Data: sleepstudy
Models:
m.2: Reaction ~ 1 + Days + (1 | Subject)
m.1: Reaction ~ 1 + Days + (1 + Days | Subject)
    Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)    
m.2  4 1802.1 1814.8 -897.04   1794.1                             
m.1  6 1763.9 1783.1 -875.97   1751.9 42.139      2  7.072e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Shrinkage: the predicted \({\hat b}_j\) (conditional modes of the random effects) can be seen as a "compromise" between the within-subject estimates and the population mean.

Diagnostic

As for any linear model, it is important to check that residual errors are well-behaved

The frequentist approach

  • So far, we've seen multilevel models from a frequentist point of view.
    • parameters are seen as unknown fixed quantities.
    • the goal is to find best guesses (point-estimates) of the parameters.
    • uncertainty in the parameter's value is not treated probabilistically.
    • uncertainty in the estimates is characterized in relation to the data-generating process (think about the highly un-intuitive definition of confidence interval).
    • probability is interpreted as the expected long-run frequency of an event in an (imaginary) very large sample.

The Bayesian approach

  • What is different in the Bayesian approach?
    • probability is interpreted as the degree of belief in the occurrence of an event, or in a proposition.
    • parameters are seen as unknown random variable, uncertainty in their values represented by a prior probability distribution.
    • the goal is to update the prior distribution on the basis of the available data, leading to a posterior probability distribution.
    • this is done with Bayes' theorem, however the Bayesian approach is not distinguished by Bayes' theorem (which is just a trivial implication of probability theory).
    • "The essential characteristic of Bayesian methods is their explicit use of probability for quantifying uncertainty in inferences based on statistical analysis" (Gelman et al. 2013).

Bayesian models

  • In a typical Bayesian setting we have a likelihood \(p\left(\text{data} \mid \theta\right)\), conditional on the parameter \(\theta\), and a prior distribution \(p\left(\theta\right)\). The posterior distribution for the parameter is \[ p\left(\theta \mid \text{data}\right) = \frac{p\left(\text{data} \mid \theta\right) p\left(\theta\right)}{p\left( \text{data} \right)} \]

Bayesian models

  • In a typical Bayesian setting we have a likelihood \(p\left(\text{data} \mid \theta\right)\), conditional on the parameter \(\theta\), and a prior distribution \(p\left(\theta\right)\). The posterior distribution for the parameter is \[ p\left(\theta \mid \text{data}\right) = \frac{p\left(\text{data} \mid \theta\right)p\left(\theta\right)}{\int p\left( \text{data} \mid \theta \right) p\left(\theta\right) d\theta}\]

Bayesian models

  • In a typical Bayesian setting we have a likelihood \(p\left(\text{data} \mid \theta\right)\), conditional on the parameter \(\theta\), and a prior distribution \(p\left(\theta\right)\). The posterior distribution for the parameter is \[ p\left(\theta \mid \text{data} \right) = \frac{p\left(\text{data} \mid \theta\right)p\left(\theta\right)}{\int p\left( \text{data} \mid \theta \right) p\left(\theta\right) d\theta}\\ \text{posterior} = \frac{\text{likelihood} \times \text{prior}}{\text{average likelihood}} \]

Bayesian models

  • In a typical Bayesian setting we have a likelihood \(p\left(\text{data} \mid \theta\right)\), conditional on the parameter \(\theta\), and a prior distribution \(p\left(\theta\right)\). The posterior distribution for the parameter is \[ p\left(\theta \mid \text{data}\right) \propto p\left(\text{data} \mid \theta\right)p\left(\theta\right)\\ \text{posterior} \propto \text{likelihood} \times \text{prior} \]

Bayesian multilevel models

  • In the case of a multilevel model, each cluster \(j\) has parameter \(\theta_j\) which is an independent sample from a population distribution governed by some hyperparameter \(\phi\).
  • \(\phi\) is also not known, and thus has its prior distribution \(p\left(\phi\right)\) (hyperprior).
  • We have a joint prior distribution: \(p\left(\phi,\theta\right)=p\left(\phi\right) \prod_{j=1}^{J} p\left(\theta_j \mid \phi\right)\)
  • and a joint posterior distribution: \[p\left(\phi,\theta \mid \text{data}\right) \propto \underbrace{p\left(\phi\right) \prod_{j=1}^{J} p\left(\theta_j \mid \phi\right)}_\text{prior} \times \underbrace{ p\left(\text{data} \mid \theta_j\right)}_\text{likelihood} \]

Bayesian parameter inference

  • The information we want about a parameter \(\phi\) is contained in its marginal posterior distribution \(p\left(\phi \mid \text{data} \right)=\int p\left(\phi,\theta \mid \text{data} \right) d\theta\).
  • The calculation can be difficult once a model has many parameters (to describe the uncertainty in a parameter of interest one must average or marginalise over the uncertainty in all other parameters).
  • Monte Carlo techniques and computer simulations (Markov-Chain Monte Carlo sampling, MCMC) can be used to draw samples from the posterior distribution.
  • Once you have samples from the posterior distribution, summarizing the marginal posterior distribution becomes simply a matter of counting values.

Metropolis algorithm

Say you want to sample from a density \(P({\bf x})\), but you can only evaluate a function \(f({\bf x})\) that is proportional to the density \(P({\bf x})\):

  • Initialization:
    1. choose arbitrary starting point \({\bf x}_0\)
    2. choose a probability density to generate proposals \(g({\bf x}_{n+1}|{\bf x}_n)\) (proposal density)
  • For each iteration \(i\):
    1. generate a candidate \({\bf x'}\) sampling from \(g({\bf x}_{i+1}|{\bf x}_{i})\)
    2. calculate the acceptance ratio \(a = \frac{f({\bf x}')}{f({\bf x}_{i})}\)
    3. accept/reject: generate a uniform number \(u\) in \([0,1]\)
      • if \(u \le a\), accept and set \({\bf x}_{i+1}={\bf x'}\) ("move to the new value")
      • if \(u > a\), reject and set \({\bf x}_{i+1}={\bf x}_{i}\) ("stay where you are")

MCMC example

Let use the sampling algorithm to infer values from one participant of the dataset sleepstudy.

  • To sample from the posterior distribution of parameters is sufficient to compute the un-normalized posterior: \(\text{prior} \times \text{likelihood}\)

MCMC example

First, code one function to compute the (log)likelihood, keeping the data fixed.

x <- sleepstudy$Days[sleepstudy$Subject == "308"]
y <- sleepstudy$Reaction[sleepstudy$Subject == "308"]

loglik <- function(par) {
    # par = [intercept, slope, log(residual Std.)]
    pred <- par[1] + par[2] * x
    return(sum(dnorm(y, mean = pred, sd = exp(par[3]), log = T)))
}

MCMC example

Next, choose your priors

MCMC example

Next, choose your priors