Loading [MathJax]/jax/output/CommonHTML/jax.js
+ - 0:00:00
Notes for current slide
Notes for next slide

An introduction to Bayesian data analysis





Matteo Lisi

29 November 2023

1

Disclaimer



2

Disclaimer



There is no 'dark side' in statistics.

2

Image from 'Statistical rethinking', Richard McElreath.

3

Outline

  1. How to fit a model to data: frequentist & Bayesian approaches
  2. Computations in Bayesian inference & MCMC sampling
  3. Examples: linear regression & drift-diffusion model
  4. Bayes factors
4

Fitting a model: frequentist approach

  • Maximum likelihood estimation (MLE)
5

Fitting a model: frequentist approach

  • Maximum likelihood estimation (MLE)
  • Likelihood function: a mathematical function that gives the probability of observing the data, given specific values for the parameters of a statistical model.
5

Fitting a model: frequentist approach

  • Maximum likelihood estimation (MLE)
  • Likelihood function: a mathematical function that gives the probability of observing the data, given specific values for the parameters of a statistical model.

  • The higher the value of the likelihood function for a set of parameters, the more likely it is that these parameters are the correct ones that explain our data.

5

MLE in linear regression

  • The best fitting line in linear regression minimizes the sum of squared residuals (errors).

6

MLE in linear regression

  • The best fitting line in linear regression minimizes the sum of squared residuals (errors).


  • Equivalent to maximizing the probability of the data assuming residuals have a Gaussian distribution centred on predicted values.

6

Likelihood function

  • Defined as the probability of the data, given some parameter values, usually notated as p(dataparameters).
7

Likelihood function

  • Defined as the probability of the data, given some parameter values, usually notated as p(dataparameters).

  • For a simple linear regression yi=β0+β1xi+ϵi ϵiN(0,σ2)

7

Likelihood function

  • Defined as the probability of the data, given some parameter values, usually notated as p(dataparameters).

  • For a simple linear regression yi=β0+β1xi+ϵi ϵiN(0,σ2)the likelihood is p(y,xdataβ0,β1,σ2parameters)=ni=1productN(predicted valueβ0+β1xi,σ2)Gaussian probability density

7

Likelihood function

  • Defined as the probability of the data, given some parameter values, usually notated as p(dataparameters).
  • For a simple linear regression yi=β0+β1xi+ϵi ϵiN(0,σ2) the likelihood is p(y,xdataβ0,β1,σ2parameters)=ni=1product12πσ2e(yipredicted value(β0+β1xi))22σ2Gaussian probability density
7

Likelihood function

In practice, we usually work with the logarithm of the likelihood function (log-likelihood)


L(y,xβ0,β1,σ2)=log[p(y,xβ0,β1,σ2)]=n2log(2π)n2log(σ2)12σ2ni=1(yiβ0β1xi)2sum of squared residual errors

8

Likelihood function

In practice, we usually work with the logarithm of the likelihood function (log-likelihood)


L(y,xβ0,β1,σ2)=log[p(y,xβ0,β1,σ2)]=n2log(2π)n2log(σ2)12σ2ni=1(yiβ0β1xi)2sum of squared residual errors

  • The values of intercept β0 and slope β1 that minimize the sum of squared residuals also maximize the (log) likelihood function.
8

The frequentist approach

  • Maximum likelihood estimation is the backbone of statistical estimation in the frequentist inference.
9

The frequentist approach

  • Maximum likelihood estimation is the backbone of statistical estimation in the frequentist inference.

  • Key aspects of frequentist inference:

9

The frequentist approach

  • Maximum likelihood estimation is the backbone of statistical estimation in the frequentist inference.

  • Key aspects of frequentist inference:

    • Parameters are considered unknown fixed quantities.
9

The frequentist approach

  • Maximum likelihood estimation is the backbone of statistical estimation in the frequentist inference.

  • Key aspects of frequentist inference:

    • Parameters are considered unknown fixed quantities.

    • The objective is to find best point-estimates for these parameters.

9

The frequentist approach

  • Maximum likelihood estimation is the backbone of statistical estimation in the frequentist inference.

  • Key aspects of frequentist inference:

    • Parameters are considered unknown fixed quantities.

    • The objective is to find best point-estimates for these parameters.

    • Uncertainty regarding the value of a parameter is not quantified using probabilities.

9

The frequentist approach

  • Maximum likelihood estimation is the backbone of statistical estimation in the frequentist inference.

  • Key aspects of frequentist inference:

    • Parameters are considered unknown fixed quantities.

    • The objective is to find best point-estimates for these parameters.

    • Uncertainty regarding the value of a parameter is not quantified using probabilities.

    • Uncertainty in the estimates is expressed in relation to the data-generating process (think about the un-intuitive definition of confidence interval).

9

The frequentist approach

  • Maximum likelihood estimation is the backbone of statistical estimation in the frequentist inference.

  • Key aspects of frequentist inference:

    • Parameters are considered unknown fixed quantities.

    • The objective is to find best point-estimates for these parameters.

    • Uncertainty regarding the value of a parameter is not quantified using probabilities.

    • Uncertainty in the estimates is expressed in relation to the data-generating process (think about the un-intuitive definition of confidence interval).

    • Probability is interpreted as the long-term frequency of an event occurring a very large (infinite) series of repeated trials or samples.

9

The Bayesian approach

10

The Bayesian approach

  • What is different in the Bayesian approach?
10

The Bayesian approach

  • What is different in the Bayesian approach?

    • Probability represents the degree of belief or confidence in the occurrence of an event or the truth of a proposition.
10

The Bayesian approach

  • What is different in the Bayesian approach?

    • Probability represents the degree of belief or confidence in the occurrence of an event or the truth of a proposition.

    • Parameters are treated as random variables with associated uncertainty, described by a prior probability distribution.

10

The Bayesian approach

  • What is different in the Bayesian approach?

    • Probability represents the degree of belief or confidence in the occurrence of an event or the truth of a proposition.

    • Parameters are treated as random variables with associated uncertainty, described by a prior probability distribution.

    • The objective is to refine the prior distribution using observed data, yielding an updated posterior probability distribution.

10

The Bayesian approach

  • What is different in the Bayesian approach?

    • Probability represents the degree of belief or confidence in the occurrence of an event or the truth of a proposition.

    • Parameters are treated as random variables with associated uncertainty, described by a prior probability distribution.

    • The objective is to refine the prior distribution using observed data, yielding an updated posterior probability distribution.

    • This update is achieved through Bayes' theorem, however the Bayesian approach is not defined by the use Bayes' theorem itself (which is just an implication of the axioms of probability).

10

The Bayesian approach

  • What is different in the Bayesian approach?

    • Probability represents the degree of belief or confidence in the occurrence of an event or the truth of a proposition.

    • Parameters are treated as random variables with associated uncertainty, described by a prior probability distribution.

    • The objective is to refine the prior distribution using observed data, yielding an updated posterior probability distribution.

    • This update is achieved through Bayes' theorem, however the Bayesian approach is not defined by the use Bayes' theorem itself (which is just an implication of the axioms of probability).

    • "The essential characteristic of Bayesian methods is their explicit use of probability for quantifying uncertainty in inferences based on statistical analysis" (Gelman et al., 2013).

10

Elements of Bayesian models

  • In a typical Bayesian setting we have:
11

Elements of Bayesian models

  • In a typical Bayesian setting we have:

    • likelihood p(dataθ), giving probability of the data conditional on the parameter(s) θ;
11

Elements of Bayesian models

  • In a typical Bayesian setting we have:

    • likelihood p(dataθ), giving probability of the data conditional on the parameter(s) θ;

    • prior p(θ), which formalizes a-priori belief about the plausibility of parameter values

11

Elements of Bayesian models

  • In a typical Bayesian setting we have:

    • likelihood p(dataθ), giving probability of the data conditional on the parameter(s) θ;

    • prior p(θ), which formalizes a-priori belief about the plausibility of parameter values

    • posterior distribution, obtained by applying Bayes theorem: p(θdata)=p(dataθ)p(θ)p(data)

11

Elements of Bayesian models

  • In a typical Bayesian setting we have:

    • likelihood p(dataθ), giving probability of the data conditional on the parameter(s) θ;

    • prior p(θ), which formalizes a-priori belief about the plausibility of parameter values

    • posterior distribution, obtained by applying Bayes theorem:p(θdata)=p(dataθ)p(θ)p(dataθ)p(θ)dθ

11

Elements of Bayesian models

  • In a typical Bayesian setting we have:

    • likelihood p(dataθ), giving probability of the data conditional on the parameter(s) θ;

    • prior p(θ), which formalizes a-priori belief about the plausibility of parameter values

    • posterior distribution, obtained by applying Bayes theorem: posterior=likelihood×prioraverage likelihood

11

Elements of Bayesian models

  • In a typical Bayesian setting we have:

    • likelihood p(dataθ), giving probability of the data conditional on the parameter(s) θ;

    • prior p(θ), which formalizes a-priori belief about the plausibility of parameter values

    • posterior distribution, obtained by applying Bayes theorem: posteriorlikelihood×prior

11

Bayesian inference

  • When we're interested in a specific parameter (e.g. the slope β1) we focus on its marginal posterior distribution. This allows summarizing information about parameters in useful ways (for example: Bayesian credible intervals).
12

Bayesian inference

  • When we're interested in a specific parameter (e.g. the slope β1) we focus on its marginal posterior distribution. This allows summarizing information about parameters in useful ways (for example: Bayesian credible intervals).

  • The marginal posterior distribution is found by integrating over other parameters in the posterior - think of this as averaging out the effects of all other parameters to focus only on β1.

12

Bayesian inference

  • When we're interested in a specific parameter (e.g. the slope β1) we focus on its marginal posterior distribution. This allows summarizing information about parameters in useful ways (for example: Bayesian credible intervals).

  • The marginal posterior distribution is found by integrating over other parameters in the posterior - think of this as averaging out the effects of all other parameters to focus only on β1.

    formally: p(β1data)=p(β1,β0,σdata)dβ0dσ).

12

Bayesian inference

  • When we're interested in a specific parameter (e.g. the slope β1) we focus on its marginal posterior distribution. This allows summarizing information about parameters in useful ways (for example: Bayesian credible intervals).

  • The marginal posterior distribution is found by integrating over other parameters in the posterior - think of this as averaging out the effects of all other parameters to focus only on β1.

    formally: p(β1data)=p(β1,β0,σdata)dβ0dσ).

  • As models become complex, this calculation gets challenging because it involves averaging (or marginalizing) over many parameters.
12

Bayesian inference

  • When we're interested in a specific parameter (e.g. the slope β1) we focus on its marginal posterior distribution. This allows summarizing information about parameters in useful ways (for example: Bayesian credible intervals).

  • The marginal posterior distribution is found by integrating over other parameters in the posterior - think of this as averaging out the effects of all other parameters to focus only on β1.

    formally: p(β1data)=p(β1,β0,σdata)dβ0dσ).

  • As models become complex, this calculation gets challenging because it involves averaging (or marginalizing) over many parameters.

  • We often use computer simulations, like Markov Chain Monte Carlo (MCMC) sampling, to handle these calculations efficiently.

12

Bayesian inference

  • When we're interested in a specific parameter (e.g. the slope β1) we focus on its marginal posterior distribution. This allows summarizing information about parameters in useful ways (for example: Bayesian credible intervals).

  • The marginal posterior distribution is found by integrating over other parameters in the posterior - think of this as averaging out the effects of all other parameters to focus only on β1.

    formally: p(β1data)=p(β1,β0,σdata)dβ0dσ).

  • As models become complex, this calculation gets challenging because it involves averaging (or marginalizing) over many parameters.

  • We often use computer simulations, like Markov Chain Monte Carlo (MCMC) sampling, to handle these calculations efficiently.

  • After performing MCMC, we end up with 'samples' drawn from the posterior distribution. With these in hand, summarizing the the marginal posterior becomes simply a matter of 'counting' the samples.

12

Metropolis algorithm

Say you want to sample from a probability distribution p(x) (e.g. the posterior), but you can only evaluate a function f(x) that is proportional to the density p(x) (e.g. the product of prior and likelihood).

13

Metropolis algorithm

Say you want to sample from a probability distribution p(x) (e.g. the posterior), but you can only evaluate a function f(x) that is proportional to the density p(x) (e.g. the product of prior and likelihood).

  • Initialization:
13

Metropolis algorithm

Say you want to sample from a probability distribution p(x) (e.g. the posterior), but you can only evaluate a function f(x) that is proportional to the density p(x) (e.g. the product of prior and likelihood).

  • Initialization:

    1. choose arbitrary starting point x0
13

Metropolis algorithm

Say you want to sample from a probability distribution p(x) (e.g. the posterior), but you can only evaluate a function f(x) that is proportional to the density p(x) (e.g. the product of prior and likelihood).

  • Initialization:

    1. choose arbitrary starting point x0

    2. choose a probability distribution to generate proposals g(xn+1|xn) (proposal density)

13

Metropolis algorithm

Say you want to sample from a probability distribution p(x) (e.g. the posterior), but you can only evaluate a function f(x) that is proportional to the density p(x) (e.g. the product of prior and likelihood).

  • Initialization:

    1. choose arbitrary starting point x0

    2. choose a probability distribution to generate proposals g(xn+1|xn) (proposal density)

  • For each iteration i:
13

Metropolis algorithm

Say you want to sample from a probability distribution p(x) (e.g. the posterior), but you can only evaluate a function f(x) that is proportional to the density p(x) (e.g. the product of prior and likelihood).

  • Initialization:

    1. choose arbitrary starting point x0

    2. choose a probability distribution to generate proposals g(xn+1|xn) (proposal density)

  • For each iteration i:

    1. generate a candidate x sampling from g(xi+1|xi)
13

Metropolis algorithm

Say you want to sample from a probability distribution p(x) (e.g. the posterior), but you can only evaluate a function f(x) that is proportional to the density p(x) (e.g. the product of prior and likelihood).

  • Initialization:

    1. choose arbitrary starting point x0

    2. choose a probability distribution to generate proposals g(xn+1|xn) (proposal density)

  • For each iteration i:

    1. generate a candidate x sampling from g(xi+1|xi)

    2. calculate the acceptance ratio a=f(x)f(xi)

13

Metropolis algorithm

Say you want to sample from a probability distribution p(x) (e.g. the posterior), but you can only evaluate a function f(x) that is proportional to the density p(x) (e.g. the product of prior and likelihood).

  • Initialization:

    1. choose arbitrary starting point x0

    2. choose a probability distribution to generate proposals g(xn+1|xn) (proposal density)

  • For each iteration i:

    1. generate a candidate x sampling from g(xi+1|xi)

    2. calculate the acceptance ratio a=f(x)f(xi)

    3. accept/reject: generate a uniform number u in [0,1]

13

Metropolis algorithm

Say you want to sample from a probability distribution p(x) (e.g. the posterior), but you can only evaluate a function f(x) that is proportional to the density p(x) (e.g. the product of prior and likelihood).

  • Initialization:

    1. choose arbitrary starting point x0

    2. choose a probability distribution to generate proposals g(xn+1|xn) (proposal density)

  • For each iteration i:

    1. generate a candidate x sampling from g(xi+1|xi)

    2. calculate the acceptance ratio a=f(x)f(xi)

    3. accept/reject: generate a uniform number u in [0,1]

    • if ua, accept and set xi+1=x ("move to the new value")
13

Metropolis algorithm

Say you want to sample from a probability distribution p(x) (e.g. the posterior), but you can only evaluate a function f(x) that is proportional to the density p(x) (e.g. the product of prior and likelihood).

  • Initialization:

    1. choose arbitrary starting point x0

    2. choose a probability distribution to generate proposals g(xn+1|xn) (proposal density)

  • For each iteration i:

    1. generate a candidate x sampling from g(xi+1|xi)

    2. calculate the acceptance ratio a=f(x)f(xi)

    3. accept/reject: generate a uniform number u in [0,1]

    • if ua, accept and set xi+1=x ("move to the new value")

    • if u>a, reject and set xi+1=xi ("stay where you are")

13

MCMC example

Implementing Metropolis algorithm to fit regression for one participant of the dataset sleepstudy (from the lme4 package).

  • To sample from the posterior we only need to compute the prior and the likelihood.
14

MCMC example: likelihood

Custom function to compute the (log) likelihood, keeping the data fixed.

x <- sleepstudy$Days[sleepstudy$Subject=="308"]
y <- sleepstudy$Reaction[sleepstudy$Subject=="308"]
loglik <- function(par){
# parameter vector = [intercept, slope, log(SD)]
pred <- par[1] + par[2]*x
return(sum(dnorm(y, mean = pred, sd = exp(par[3]), log = TRUE)))
}
15

MCMC example: priors

Prior distribution about parameter values

16

MCMC example: priors

Prior distribution about parameter values

17

MCMC example: priors

Similar to the likelihood, we write a custom function that compute the joint log-prior probability

logprior <- function(par){
intercept_prior <- dnorm(par[1], mean=250, sd=180, log=TRUE)
slope_prior <- dnorm(par[2], mean=20, sd=20, log=TRUE)
sd_prior <- dnorm(par[3],mean=4, sd=1, log=TRUE)
return(intercept_prior+slope_prior+sd_prior)
}
18

MCMC example: priors

Similar to the likelihood, we write a custom function that compute the joint log-prior probability

logprior <- function(par){
intercept_prior <- dnorm(par[1], mean=250, sd=180, log=TRUE)
slope_prior <- dnorm(par[2], mean=20, sd=20, log=TRUE)
sd_prior <- dnorm(par[3],mean=4, sd=1, log=TRUE)
return(intercept_prior+slope_prior+sd_prior)
}
18

MCMC example

Put together likelihood and prior to obtain the (log) posterior probability

logposterior <- function(par){
return (loglik(par) + logprior(par))
}
19

MCMC example

Choose the arbitrary starting point and the proposal density g(xn+1|xn)

# initial parameters
startvalue <- c(250, 20, 5) # [intercept, slope, log(SD)]
# proposal density
proposalfunction <- function(par){
return(rnorm(3, mean = par, sd= c(15,5,0.2)))
}
20

MCMC example

This function execute iteratively the step we have seen before

run_metropolis_MCMC <- function(startvalue, iterations){
# set up an empty array to store smapled values
chain <- array(dim = c(iterations+1,3))
# put starting values at top of arrays
chain[1,] <- startvalue
for (i in 1:iterations){
# draw a random proposal
proposal <- proposalfunction(chain[i,])
# ratio of posterior density between new and old values
a <- exp(logposterior(proposal) - logposterior(chain[i,]))
# sample random number & accept/reject the parameter values
if (runif(1) < a){
chain[i+1,] <- proposal
}else{
chain[i+1,] <- chain[i,]
}
}
return(chain)
}
21

MCMC example

Run sampling for many iterations

chain <- run_metropolis_MCMC(startvalue, 20000)

Output:

head(chain)
## [,1] [,2] [,3]
## [1,] 250.0000 20.00000 5.000000
## [2,] 240.6032 20.91822 4.832874
## [3,] 246.8228 13.21847 4.647161
## [4,] 257.8977 16.09737 4.586083
## [5,] 257.8977 16.09737 4.586083
## [6,] 274.7716 15.87271 4.582845
22

MCMC example

23

MCMC example

Having the samples from the posterior distribution, summarising uncertainty is simply a matter of counting values.

24

MCMC example

Having the samples from the posterior distribution, summarising uncertainty is simply a matter of counting values.

We can calculate a 95% Bayesian credible interval for slope by taking the 2.5th and 97.5th percentiles of posterior samples

# remove initial 'burn in' samples
burnIn <- 5000
slope_samples <- chain[-(1:burnIn),2]
# mean of posterior distribution
mean(slope_samples)
## [1] 21.75772
# 95% Bayesian credible interval
alpha <- 0.05
round(c(quantile(slope_samples, probs = alpha/2),
quantile(slope_samples, probs = 1-alpha/2)),
digits=2)
## 2.5% 97.5%
## 10.82 33.27
24
  • The Metropolis algorithm is the grandparent of modern MCMC techniques for drawing samples from unknown posterior distributions.
25
  • The Metropolis algorithm is the grandparent of modern MCMC techniques for drawing samples from unknown posterior distributions.

  • Proposals are random and that can make it very inefficient, especially as the complexity of the model (and thus of the posterior) increases.

25
  • The Metropolis algorithm is the grandparent of modern MCMC techniques for drawing samples from unknown posterior distributions.

  • Proposals are random and that can make it very inefficient, especially as the complexity of the model (and thus of the posterior) increases.

  • Modern algorithm are more efficient, they use clever proposal methods to get a better picture of the posterior distribution with fewer samples.

25
  • The Metropolis algorithm is the grandparent of modern MCMC techniques for drawing samples from unknown posterior distributions.

  • Proposals are random and that can make it very inefficient, especially as the complexity of the model (and thus of the posterior) increases.

  • Modern algorithm are more efficient, they use clever proposal methods to get a better picture of the posterior distribution with fewer samples.

  • A state-of-the-art algorithm is Hamiltonian Monte Carlo (HMC), which is used by the free software Stan (mc-stan.org).

25

Stan

  • Stan (named after Stanislaw Ulam, 1909-1984, co-inventor of MCMC methods) is a probabilistic programming language that makes it easy to do Bayesian inference on complex models.
26

Stan

  • Stan (named after Stanislaw Ulam, 1909-1984, co-inventor of MCMC methods) is a probabilistic programming language that makes it easy to do Bayesian inference on complex models.

  • Once a model is defined, Stan compiles a C++ program that uses HMC to draw samples from the posterior density.

26

Stan

  • Stan (named after Stanislaw Ulam, 1909-1984, co-inventor of MCMC methods) is a probabilistic programming language that makes it easy to do Bayesian inference on complex models.

  • Once a model is defined, Stan compiles a C++ program that uses HMC to draw samples from the posterior density.

  • Stan is free, open-source, has a comprehensive documentation and interfaces for the most popular computing environments (R, Matlab, Python, Mathematica)

26

Stan examples

  1. Linear regression

  2. Drift-diffusion model

27

1. Linear regression in Stan

Walking speed dataset:

d <- read.table ("../data/wagespeed.csv",
header=T,
sep=",")
head(d)
## wage wspeed city
## 1 15.59 73.83 Jakarta
## 2 20.22 70.26 Sofia
## 3 25.62 65.46 Bucharest
## 4 32.78 65.36 Rio
## 5 23.96 76.84 Guanzhou
## 6 23.19 80.66 MexicoCity
28

1. Linear regression in Stan

Walking speed dataset:

d <- read.table ("../data/wagespeed.csv",
header=T,
sep=",")
head(d)
## wage wspeed city
## 1 15.59 73.83 Jakarta
## 2 20.22 70.26 Sofia
## 3 25.62 65.46 Bucharest
## 4 32.78 65.36 Rio
## 5 23.96 76.84 Guanzhou
## 6 23.19 80.66 MexicoCity

28
  • Format data as a 'list'
  • We standardize the variables so that slope can be interpreted as a correlation coefficient
# function to standardize variables
standardize <- function(x){
Z <- (x - mean(x)) / sd(x)
return(Z)
}
# prepare data to fit in Stan
data_stan <- list(
N = nrow(d),
wage = standardize(d$wage),
wspeed = standardize(d$wspeed)
)
#
str(data_stan)
## List of 3
## $ N : int 27
## $ wage : num [1:27] -1.461 -1.308 -1.129 -0.892 -1.184 ...
## $ wspeed: num [1:27] -1.06 -1.48 -2.05 -2.06 -0.71 ...


29

Model code in stan language:

data {
int<lower=0> N;
vector[N] wspeed;
vector[N] wage;
}
parameters {
real beta[2];
real<lower=0> sigma;
}
model {
// priors
beta[1] ~ normal(0, 1);
beta[2] ~ normal(0.5, 1);
sigma ~ normal(0,1) T[0,];
// likelihood
wspeed ~ normal(beta[1] + beta[2]*wage, sigma);
}
30

Model code in stan language:

data {
int<lower=0> N;
vector[N] wspeed;
vector[N] wage;
}
parameters {
real beta[2];
real<lower=0> sigma;
}
model {
// priors
beta[1] ~ normal(0, 1);
beta[2] ~ normal(0.5, 1);
sigma ~ normal(0,1) T[0,];
// likelihood
wspeed ~ normal(beta[1] + beta[2]*wage, sigma);
}

The code contain 3 blocks:

30

Model code in stan language:

data {
int<lower=0> N;
vector[N] wspeed;
vector[N] wage;
}
parameters {
real beta[2];
real<lower=0> sigma;
}
model {
// priors
beta[1] ~ normal(0, 1);
beta[2] ~ normal(0.5, 1);
sigma ~ normal(0,1) T[0,];
// likelihood
wspeed ~ normal(beta[1] + beta[2]*wage, sigma);
}

The code contain 3 blocks:

  • data block, in which variables are declared;
30

Model code in stan language:

data {
int<lower=0> N;
vector[N] wspeed;
vector[N] wage;
}
parameters {
real beta[2];
real<lower=0> sigma;
}
model {
// priors
beta[1] ~ normal(0, 1);
beta[2] ~ normal(0.5, 1);
sigma ~ normal(0,1) T[0,];
// likelihood
wspeed ~ normal(beta[1] + beta[2]*wage, sigma);
}

The code contain 3 blocks:

  • data block, in which variables are declared;

  • parameters block, in which we declare the parameters that we want to sample;

30

Model code in stan language:

data {
int<lower=0> N;
vector[N] wspeed;
vector[N] wage;
}
parameters {
real beta[2];
real<lower=0> sigma;
}
model {
// priors
beta[1] ~ normal(0, 1);
beta[2] ~ normal(0.5, 1);
sigma ~ normal(0,1) T[0,];
// likelihood
wspeed ~ normal(beta[1] + beta[2]*wage, sigma);
}

The code contain 3 blocks:

  • data block, in which variables are declared;

  • parameters block, in which we declare the parameters that we want to sample;

  • model block, containing the model specification (i.e. priors and likelihood)

30



The model block allows defining the model in terms of probability distribution

model {
// priors
beta[1] ~ normal(0, 1);
beta[2] ~ normal(0.5, 1);
sigma ~ normal(0,1) T[0,];
// likelihood
wspeed ~ normal(beta[1] + beta[2]*wage, sigma);
}
31



The model block allows defining the model in terms of probability distribution

model {
// priors
beta[1] ~ normal(0, 1);
beta[2] ~ normal(0.5, 1);
sigma ~ normal(0,1) T[0,];
// likelihood
wspeed ~ normal(beta[1] + beta[2]*wage, sigma);
}




β0N(0,1)β1N(0.5,1)σN[0,+)(0,1)

wspeedN(β0+β1wage,σ2)

31

Sampling from posterior using Stan

library(rstan)
options(mc.cores = parallel::detectCores()) # indicate stan to use multiple cores if available
# run sampling: 4 chains in parallel on separate cores
wspeed_fit <- stan(file = "wspeed_model.stan", data = data_stan, iter = 2000, chains = 4)
32

Sampling from posterior using Stan

library(rstan)
options(mc.cores = parallel::detectCores()) # indicate stan to use multiple cores if available
# run sampling: 4 chains in parallel on separate cores
wspeed_fit <- stan(file = "wspeed_model.stan", data = data_stan, iter = 2000, chains = 4)

Results

print(wspeed_fit)
## Inference for Stan model: anon_model.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## beta[1] 0.00 0.00 0.14 -0.28 -0.09 0.00 0.09 0.26 3788 1
## beta[2] 0.73 0.00 0.14 0.45 0.63 0.73 0.82 1.02 3496 1
## sigma 0.72 0.00 0.10 0.55 0.64 0.71 0.78 0.95 3628 1
## lp__ -4.01 0.03 1.24 -7.14 -4.62 -3.70 -3.09 -2.56 1987 1
##
## Samples were drawn using NUTS(diag_e) at Sat Nov 25 17:22:37 2023.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
32

Sampling from posterior using Stan

library(rstan)
options(mc.cores = parallel::detectCores()) # indicate stan to use multiple cores if available
# run sampling: 4 chains in parallel on separate cores
wspeed_fit <- stan(file = "wspeed_model.stan", data = data_stan, iter = 2000, chains = 4)

Results

print(wspeed_fit)
## Inference for Stan model: anon_model.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## beta[1] 0.00 0.00 0.14 -0.28 -0.09 0.00 0.09 0.26 3788 1
## beta[2] 0.73 0.00 0.14 0.45 0.63 0.73 0.82 1.02 3496 1
## sigma 0.72 0.00 0.10 0.55 0.64 0.71 0.78 0.95 3628 1
## lp__ -4.01 0.03 1.24 -7.14 -4.62 -3.70 -3.09 -2.56 1987 1
##
## Samples were drawn using NUTS(diag_e) at Sat Nov 25 17:22:37 2023.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).

The Rhat ( ˆR ) is a diagnostic index; at convergence ˆR1±0.1

(essentially ˆR is the ratio of between-chain to within-chain variance)

32

Visualize chains to check for convergence

traceplot(wspeed_fit, pars=c("beta","sigma"))

33

Visualize posterior distribution

pairs(wspeed_fit, pars=c("beta","sigma"))

34

Visualize model fit to the data

35

Priors

  • Priors represent our initial beliefs about the plausibility of parameter values.
36

Priors

  • Priors represent our initial beliefs about the plausibility of parameter values.

  • Unlike likelihoods, there are no conventional priors; choice depends on context.

36

Priors

  • Priors represent our initial beliefs about the plausibility of parameter values.

  • Unlike likelihoods, there are no conventional priors; choice depends on context.

  • Priors are subjective but should be explicitly stated for transparency and critique (more transparent than, say, dropping outliers).

36

Priors

  • Priors represent our initial beliefs about the plausibility of parameter values.

  • Unlike likelihoods, there are no conventional priors; choice depends on context.

  • Priors are subjective but should be explicitly stated for transparency and critique (more transparent than, say, dropping outliers).

  • Priors are assumptions, and should be criticized (e.g. try different assumptions to check how sensitive inference is to the choice of priors).

36

Priors

  • Priors represent our initial beliefs about the plausibility of parameter values.

  • Unlike likelihoods, there are no conventional priors; choice depends on context.

  • Priors are subjective but should be explicitly stated for transparency and critique (more transparent than, say, dropping outliers).

  • Priors are assumptions, and should be criticized (e.g. try different assumptions to check how sensitive inference is to the choice of priors).

  • Priors cannot be exact, but should be defendable.

36

Priors

  • There are no non-informative priors: flat priors such as Uniform(,) tells the model that unrealistic values (e.g. extremely large) are as plausible as realistic ones.
37

Priors

  • There are no non-informative priors: flat priors such as Uniform(,) tells the model that unrealistic values (e.g. extremely large) are as plausible as realistic ones.

  • Priors should restrict parameter space to values that make sense.

37

Priors

  • There are no non-informative priors: flat priors such as Uniform(,) tells the model that unrealistic values (e.g. extremely large) are as plausible as realistic ones.

  • Priors should restrict parameter space to values that make sense.

  • Preferably choose weakly informative priors containg less information than what might be known (but still guide the model sensibly).

37

Priors

  • There are no non-informative priors: flat priors such as Uniform(,) tells the model that unrealistic values (e.g. extremely large) are as plausible as realistic ones.

  • Priors should restrict parameter space to values that make sense.

  • Preferably choose weakly informative priors containg less information than what might be known (but still guide the model sensibly).

  • Priors should provide some regularization, i.e. add some skepticism, akin to penalized likelihood methods in frequentist statistics (e.g. LASSO or ridge regression), preventing overfitting to data.

37

Priors

  • There are no non-informative priors: flat priors such as Uniform(,) tells the model that unrealistic values (e.g. extremely large) are as plausible as realistic ones.

  • Priors should restrict parameter space to values that make sense.

  • Preferably choose weakly informative priors containg less information than what might be known (but still guide the model sensibly).

  • Priors should provide some regularization, i.e. add some skepticism, akin to penalized likelihood methods in frequentist statistics (e.g. LASSO or ridge regression), preventing overfitting to data.

  • Prior recommendations: https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations

37

Priors

  • There are no non-informative priors: flat priors such as Uniform(,) tells the model that unrealistic values (e.g. extremely large) are as plausible as realistic ones.

  • Priors should restrict parameter space to values that make sense.

  • Preferably choose weakly informative priors containg less information than what might be known (but still guide the model sensibly).

  • Priors should provide some regularization, i.e. add some skepticism, akin to penalized likelihood methods in frequentist statistics (e.g. LASSO or ridge regression), preventing overfitting to data.

  • Prior recommendations: https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations

  • If using informative priors, these should be justified in the text when reporting the results (see examples in Gelman & Hennig, 2016, Beyond subjective and objective in statistics)

37

Priors

  • There are no non-informative priors: flat priors such as Uniform(,) tells the model that unrealistic values (e.g. extremely large) are as plausible as realistic ones.

  • Priors should restrict parameter space to values that make sense.

  • Preferably choose weakly informative priors containg less information than what might be known (but still guide the model sensibly).

  • Priors should provide some regularization, i.e. add some skepticism, akin to penalized likelihood methods in frequentist statistics (e.g. LASSO or ridge regression), preventing overfitting to data.

  • Prior recommendations: https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations

  • If using informative priors, these should be justified in the text when reporting the results (see examples in Gelman & Hennig, 2016, Beyond subjective and objective in statistics)

  • When in doubt (especially for complex models where the relationship between prior & observed data is not straighforward) is always a good idea to do a prior predictive check (i.e. sample parameter values from the priors, use it to simulate data, and see if the overall distribution looks sensible).

37

Example 2: The Drift Diffusion Model (DDM)

The DDM describes the process of decision making in terms of:

  • Drift Rate (v): Speed of information accumulation.
  • Boundary Separation (a): Threshold for making a decision.
  • Starting Point (z): Bias towards one of the responses.
  • Non-Decision Time (t0): Processes outside decision making.

We will estimate these parameters using Stan.

38

Example 2: The Drift Diffusion Model (DDM)

The distribution of 'first-passage times' is given by the Wiener distribution, which is already included in Stan math library: https://mc-stan.org/docs/functions-reference/wiener-first-passage-time-distribution.html

For a correct response (upper boundary):

rtWiener(a,t0,z,v)

For an error (lower boundary):

rtWiener(a,t0,1z,v)

39

DDM Stan code

data {
int<lower=0> N; // Number of observations
int<lower=0,upper=1> resp[N]; // Response: 1 or 0
real<lower=0> rt[N]; // Response time
}
parameters {
real v; // Drift rate
real<lower=0> a; // Boundary separation
real<lower=0> t0; // Non-decision time
real<lower=0, upper=1> z; // Starting point
}
model {
// Priors
v ~ normal(1, 3);
a ~ normal(1, 1)T[0,2];
t0 ~ normal(0, 3)T[0,];
z ~ beta(2, 2);
// Likelihood
for (i in 1:N) {
if (resp[i] == 1) {
rt[i] ~ wiener(a, t0, z, v);
} else {
rt[i] ~ wiener(a, t0, 1 - z, -v);
}
}
}
40
d <- read_csv("../data/ddm_data.csv") # load data
d %>% # make a plot
mutate(resp = factor(resp)) %>%
ggplot(aes(x=rt, color=resp, fill=resp))+
geom_histogram(binwidth=0.05, color="white") +
facet_grid(.~resp)

41

Format data as a list and run sampling

data_stan <- list(
N = nrow(d),
rt = d$rt,
resp = d$resp
)
ddm_fit <- stan(file = "../ddm4p.stan", # ddm4p.stan is the file containing the model code
data = data_stan, iter = 2000, chains = 4)

Results:

print(ddm_fit)
## Inference for Stan model: anon_model.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## v 1.01 0.00 0.12 0.78 0.93 1.01 1.09 1.25 2524 1
## a 0.98 0.00 0.02 0.94 0.97 0.98 1.00 1.02 2770 1
## t0 0.25 0.00 0.00 0.24 0.25 0.25 0.25 0.26 2152 1
## z 0.50 0.00 0.02 0.47 0.49 0.50 0.51 0.53 2600 1
## lp__ 17.73 0.03 1.44 14.16 17.02 18.05 18.78 19.53 1708 1
##
## Samples were drawn using NUTS(diag_e) at Fri Nov 24 13:44:19 2023.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
42

Visualize chains to check for convergence

traceplot(ddm_fit, pars=c("v","a","t0","z"))

43

Visualize posterior distribution

pairs(ddm_fit, pars=c("v","a","t0","z"))

44

Visualize model fit to the data

45

Visualize marginal distribution & credible intervals

# Extract samples and convert to long format for plotting
parameter_samples <- gather_draws(ddm_fit, v, a, t0, z)
# Create violin plot with 95% HDI intervals
ddm_fit %>%
spread_draws(v, a, t0, z) %>%
pivot_longer(cols = c(v, a, t0, z), names_to = "parameter", values_to = "value") %>%
ggplot(aes(y = parameter, x = value)) +
stat_halfeye(.width = .95,normalize="groups") +
labs(x="value",y="parameter")

46

Bayes factors

  • A Bayesian approach to hypothesis testing & model comparison.
47

Bayes factors

  • A Bayesian approach to hypothesis testing & model comparison.

  • Suppose we have 2 competing models for data: model M1 with parameters θ1 vs. model M2 with parameters θ2.

47

Bayes factors

  • A Bayesian approach to hypothesis testing & model comparison.

  • Suppose we have 2 competing models for data: model M1 with parameters θ1 vs. model M2 with parameters θ2.

  • The Bayes factor is the ratio of the marginal (average) likelihoods of the two modelsBF1,2=p1(dataM1)p1(data M1)=p1(dataθ1)p1(θ1)dθ1p2(dataθ2)p2(θ2)dθ2

47

Bayes factors

  • A Bayesian approach to hypothesis testing & model comparison.

  • Suppose we have 2 competing models for data: model M1 with parameters θ1 vs. model M2 with parameters θ2.

  • The Bayes factor is the ratio of the marginal (average) likelihoods of the two modelsBF1,2=p1(dataM1)p1(data M1)=p1(dataθ1)p1(θ1)dθ1p2(dataθ2)p2(θ2)dθ2

  • The Bayes factor alone does not tell which model is more likely! This is quantified by the posterior odds, which takes also the prior odds into account: p(M1data)p(M2data)posterior odds=p(M1)p(M2)prior odds×p1(dataθ1)p1(θ1)dθ1p2(dataθ2)p2(θ2)dθ2Bayes factor

47

Bayes factors

  • A Bayesian approach to hypothesis testing & model comparison.

  • Suppose we have 2 competing models for data: model M1 with parameters θ1 vs. model M2 with parameters θ2.

  • The Bayes factor is the ratio of the marginal (average) likelihoods of the two modelsBF1,2=p1(dataM1)p1(data M1)=p1(dataθ1)p1(θ1)dθ1p2(dataθ2)p2(θ2)dθ2

  • The Bayes factor alone does not tell which model is more likely! This is quantified by the posterior odds, which takes also the prior odds into account: p(M1data)p(M2data)posterior odds=p(M1)p(M2)prior odds×p1(dataθ1)p1(θ1)dθ1p2(dataθ2)p2(θ2)dθ2Bayes factor

  • The Bayes factors only tells us how much - given the data and the prior - we need to update our relative beliefs between two models.
47

Bayes factors

  • A Bayesian approach to hypothesis testing & model comparison.

  • Suppose we have 2 competing models for data: model M1 with parameters θ1 vs. model M2 with parameters θ2.

  • The Bayes factor is the ratio of the marginal (average) likelihoods of the two modelsBF1,2=p1(dataM1)p1(data M1)=p1(dataθ1)p1(θ1)dθ1p2(dataθ2)p2(θ2)dθ2

  • The Bayes factor alone does not tell which model is more likely! This is quantified by the posterior odds, which takes also the prior odds into account: p(M1data)p(M2data)posterior odds=p(M1)p(M2)prior odds×p1(dataθ1)p1(θ1)dθ1p2(dataθ2)p2(θ2)dθ2Bayes factor

  • The Bayes factors only tells us how much - given the data and the prior - we need to update our relative beliefs between two models.

  • Say we found that BF1,23, we can say that 'the data are 3 times more likely under model 1 than model 2'.

47

Bayes factors for nested models: the Savage–Dickey density ratio

  • When the two models are nested (e.g. the M1 is a special case of M2 in which which one specific parameter is set to zero - corresponding to the null hypothesis in classical NHST framework), the Bayes factor can be obtained by dividing the height of the posterior by the height of the prior at the point of interest (Dickey & Lientz, 1970)
48

Bayes factors for nested models: the Savage–Dickey density ratio

  • When the two models are nested (e.g. the M1 is a special case of M2 in which which one specific parameter is set to zero - corresponding to the null hypothesis in classical NHST framework), the Bayes factor can be obtained by dividing the height of the posterior by the height of the prior at the point of interest (Dickey & Lientz, 1970)
  • Left plot: BF0,12; and right plot BF0,112 (thus implying BF1,02)

(see savage.dickey.bf() function in mlisi package)

48

Cautionary note about Bayes factors

  • Bayes factors represents the update in beliefs from prior to posterior, and therefore are strongly dependent on the prior used.
49

Cautionary note about Bayes factors

  • Bayes factors represents the update in beliefs from prior to posterior, and therefore are strongly dependent on the prior used.

  • Bayes factors are "(...) very sensitive to features of the prior that have almost no effect on the posterior. With hundreds of data points, the difference between a normal(0, 1) and normal(0, 100) prior is negligible if the true value is in the range (-3, 3), but it can have a huge effect on Bayes factors." (Bob Carpenter, post link)

49

Cautionary note about Bayes factors

  • Bayes factors represents the update in beliefs from prior to posterior, and therefore are strongly dependent on the prior used.

  • Bayes factors are "(...) very sensitive to features of the prior that have almost no effect on the posterior. With hundreds of data points, the difference between a normal(0, 1) and normal(0, 100) prior is negligible if the true value is in the range (-3, 3), but it can have a huge effect on Bayes factors." (Bob Carpenter, post link)

  • Bayes factors makes sense only if you have strong reasons for selecting a particular prior!

49

Cautionary note about Bayes factors

  • Bayes factors represents the update in beliefs from prior to posterior, and therefore are strongly dependent on the prior used.

  • Bayes factors are "(...) very sensitive to features of the prior that have almost no effect on the posterior. With hundreds of data points, the difference between a normal(0, 1) and normal(0, 100) prior is negligible if the true value is in the range (-3, 3), but it can have a huge effect on Bayes factors." (Bob Carpenter, post link)

  • Bayes factors makes sense only if you have strong reasons for selecting a particular prior!

  • If not, other approaches may be preferable - e.g. Bayesian cross-validation, and information criterial like WAIC.

49

Cautionary note about Bayes factors

  • Bayes factors represents the update in beliefs from prior to posterior, and therefore are strongly dependent on the prior used.

  • Bayes factors are "(...) very sensitive to features of the prior that have almost no effect on the posterior. With hundreds of data points, the difference between a normal(0, 1) and normal(0, 100) prior is negligible if the true value is in the range (-3, 3), but it can have a huge effect on Bayes factors." (Bob Carpenter, post link)

  • Bayes factors makes sense only if you have strong reasons for selecting a particular prior!

  • If not, other approaches may be preferable - e.g. Bayesian cross-validation, and information criterial like WAIC.

  • Never trust a paper that just throws Bayes factors there without specifying and motivating the priors!

49

Some important stuff we have not had time to cover:

  • Posterior predictive checks

  • Model comparison, WAIC, & Bayesian leave-one-out cross-validation

50

Useful resources

  • Stan User manual.

  • Stan forum: https://discourse.mc-stan.org.

  • Statistical Rethinking: A Bayesian Course with Examples in R and Stan, Richard McElreath CRC Press, 2015.

51

Disclaimer



2
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
oTile View: Overview of Slides
Esc Back to slideshow