There is no 'dark side' in statistics.
Image from 'Statistical rethinking', Richard McElreath.
Likelihood function: a mathematical function that gives the probability of observing the data, given specific values for the parameters of a statistical model.
The higher the value of the likelihood function for a set of parameters, the more likely it is that these parameters are the correct ones that explain our data.
Defined as the probability of the data, given some parameter values, usually notated as p(data∣parameters).
For a simple linear regression yi=β0+β1xi+ϵi ϵi∼N(0,σ2)
Defined as the probability of the data, given some parameter values, usually notated as p(data∣parameters).
For a simple linear regression yi=β0+β1xi+ϵi ϵi∼N(0,σ2)the likelihood is p(y,x⏟data∣β0,β1,σ2⏟parameters)=n∏i=1⏟productN(predicted value⏞β0+β1xi,σ2)⏟Gaussian probability density
In practice, we usually work with the logarithm of the likelihood function (log-likelihood)
L(y,x∣β0,β1,σ2)=log[p(y,x∣β0,β1,σ2)]=−n2log(2π)−n2log(σ2)−12σ2n∑i=1(yi−β0−β1xi)2⏟sum of squared residual errors
In practice, we usually work with the logarithm of the likelihood function (log-likelihood)
L(y,x∣β0,β1,σ2)=log[p(y,x∣β0,β1,σ2)]=−n2log(2π)−n2log(σ2)−12σ2n∑i=1(yi−β0−β1xi)2⏟sum of squared residual errors
Maximum likelihood estimation is the backbone of statistical estimation in the frequentist inference.
Key aspects of frequentist inference:
Maximum likelihood estimation is the backbone of statistical estimation in the frequentist inference.
Key aspects of frequentist inference:
Maximum likelihood estimation is the backbone of statistical estimation in the frequentist inference.
Key aspects of frequentist inference:
Parameters are considered unknown fixed quantities.
The objective is to find best point-estimates for these parameters.
Maximum likelihood estimation is the backbone of statistical estimation in the frequentist inference.
Key aspects of frequentist inference:
Parameters are considered unknown fixed quantities.
The objective is to find best point-estimates for these parameters.
Uncertainty regarding the value of a parameter is not quantified using probabilities.
Maximum likelihood estimation is the backbone of statistical estimation in the frequentist inference.
Key aspects of frequentist inference:
Parameters are considered unknown fixed quantities.
The objective is to find best point-estimates for these parameters.
Uncertainty regarding the value of a parameter is not quantified using probabilities.
Uncertainty in the estimates is expressed in relation to the data-generating process (think about the un-intuitive definition of confidence interval).
Maximum likelihood estimation is the backbone of statistical estimation in the frequentist inference.
Key aspects of frequentist inference:
Parameters are considered unknown fixed quantities.
The objective is to find best point-estimates for these parameters.
Uncertainty regarding the value of a parameter is not quantified using probabilities.
Uncertainty in the estimates is expressed in relation to the data-generating process (think about the un-intuitive definition of confidence interval).
Probability is interpreted as the long-term frequency of an event occurring a very large (infinite) series of repeated trials or samples.
What is different in the Bayesian approach?
What is different in the Bayesian approach?
Probability represents the degree of belief or confidence in the occurrence of an event or the truth of a proposition.
Parameters are treated as random variables with associated uncertainty, described by a prior probability distribution.
What is different in the Bayesian approach?
Probability represents the degree of belief or confidence in the occurrence of an event or the truth of a proposition.
Parameters are treated as random variables with associated uncertainty, described by a prior probability distribution.
The objective is to refine the prior distribution using observed data, yielding an updated posterior probability distribution.
What is different in the Bayesian approach?
Probability represents the degree of belief or confidence in the occurrence of an event or the truth of a proposition.
Parameters are treated as random variables with associated uncertainty, described by a prior probability distribution.
The objective is to refine the prior distribution using observed data, yielding an updated posterior probability distribution.
This update is achieved through Bayes' theorem, however the Bayesian approach is not defined by the use Bayes' theorem itself (which is just an implication of the axioms of probability).
What is different in the Bayesian approach?
Probability represents the degree of belief or confidence in the occurrence of an event or the truth of a proposition.
Parameters are treated as random variables with associated uncertainty, described by a prior probability distribution.
The objective is to refine the prior distribution using observed data, yielding an updated posterior probability distribution.
This update is achieved through Bayes' theorem, however the Bayesian approach is not defined by the use Bayes' theorem itself (which is just an implication of the axioms of probability).
"The essential characteristic of Bayesian methods is their explicit use of probability for quantifying uncertainty in inferences based on statistical analysis" (Gelman et al., 2013).
In a typical Bayesian setting we have:
In a typical Bayesian setting we have:
likelihood p(data∣θ), giving probability of the data conditional on the parameter(s) θ;
prior p(θ), which formalizes a-priori belief about the plausibility of parameter values
In a typical Bayesian setting we have:
likelihood p(data∣θ), giving probability of the data conditional on the parameter(s) θ;
prior p(θ), which formalizes a-priori belief about the plausibility of parameter values
posterior distribution, obtained by applying Bayes theorem: p(θ∣data)=p(data∣θ)p(θ)p(data)
In a typical Bayesian setting we have:
likelihood p(data∣θ), giving probability of the data conditional on the parameter(s) θ;
prior p(θ), which formalizes a-priori belief about the plausibility of parameter values
posterior distribution, obtained by applying Bayes theorem:p(θ∣data)=p(data∣θ)p(θ)∫p(data∣θ)p(θ)dθ
In a typical Bayesian setting we have:
likelihood p(data∣θ), giving probability of the data conditional on the parameter(s) θ;
prior p(θ), which formalizes a-priori belief about the plausibility of parameter values
posterior distribution, obtained by applying Bayes theorem: posterior=likelihood×prioraverage likelihood
In a typical Bayesian setting we have:
likelihood p(data∣θ), giving probability of the data conditional on the parameter(s) θ;
prior p(θ), which formalizes a-priori belief about the plausibility of parameter values
posterior distribution, obtained by applying Bayes theorem: posterior∝likelihood×prior
When we're interested in a specific parameter (e.g. the slope β1) we focus on its marginal posterior distribution. This allows summarizing information about parameters in useful ways (for example: Bayesian credible intervals).
The marginal posterior distribution is found by integrating over other parameters in the posterior - think of this as averaging out the effects of all other parameters to focus only on β1.
When we're interested in a specific parameter (e.g. the slope β1) we focus on its marginal posterior distribution. This allows summarizing information about parameters in useful ways (for example: Bayesian credible intervals).
The marginal posterior distribution is found by integrating over other parameters in the posterior - think of this as averaging out the effects of all other parameters to focus only on β1.
formally: p(β1∣data)=∫∫p(β1,β0,σ∣data)dβ0dσ).
When we're interested in a specific parameter (e.g. the slope β1) we focus on its marginal posterior distribution. This allows summarizing information about parameters in useful ways (for example: Bayesian credible intervals).
The marginal posterior distribution is found by integrating over other parameters in the posterior - think of this as averaging out the effects of all other parameters to focus only on β1.
formally: p(β1∣data)=∫∫p(β1,β0,σ∣data)dβ0dσ).
When we're interested in a specific parameter (e.g. the slope β1) we focus on its marginal posterior distribution. This allows summarizing information about parameters in useful ways (for example: Bayesian credible intervals).
The marginal posterior distribution is found by integrating over other parameters in the posterior - think of this as averaging out the effects of all other parameters to focus only on β1.
formally: p(β1∣data)=∫∫p(β1,β0,σ∣data)dβ0dσ).
As models become complex, this calculation gets challenging because it involves averaging (or marginalizing) over many parameters.
We often use computer simulations, like Markov Chain Monte Carlo (MCMC) sampling, to handle these calculations efficiently.
When we're interested in a specific parameter (e.g. the slope β1) we focus on its marginal posterior distribution. This allows summarizing information about parameters in useful ways (for example: Bayesian credible intervals).
The marginal posterior distribution is found by integrating over other parameters in the posterior - think of this as averaging out the effects of all other parameters to focus only on β1.
formally: p(β1∣data)=∫∫p(β1,β0,σ∣data)dβ0dσ).
As models become complex, this calculation gets challenging because it involves averaging (or marginalizing) over many parameters.
We often use computer simulations, like Markov Chain Monte Carlo (MCMC) sampling, to handle these calculations efficiently.
After performing MCMC, we end up with 'samples' drawn from the posterior distribution. With these in hand, summarizing the the marginal posterior becomes simply a matter of 'counting' the samples.
Say you want to sample from a probability distribution p(x) (e.g. the posterior), but you can only evaluate a function f(x) that is proportional to the density p(x) (e.g. the product of prior and likelihood).
Say you want to sample from a probability distribution p(x) (e.g. the posterior), but you can only evaluate a function f(x) that is proportional to the density p(x) (e.g. the product of prior and likelihood).
Say you want to sample from a probability distribution p(x) (e.g. the posterior), but you can only evaluate a function f(x) that is proportional to the density p(x) (e.g. the product of prior and likelihood).
Initialization:
Say you want to sample from a probability distribution p(x) (e.g. the posterior), but you can only evaluate a function f(x) that is proportional to the density p(x) (e.g. the product of prior and likelihood).
Initialization:
choose arbitrary starting point x0
choose a probability distribution to generate proposals g(xn+1|xn) (proposal density)
Say you want to sample from a probability distribution p(x) (e.g. the posterior), but you can only evaluate a function f(x) that is proportional to the density p(x) (e.g. the product of prior and likelihood).
Initialization:
choose arbitrary starting point x0
choose a probability distribution to generate proposals g(xn+1|xn) (proposal density)
Say you want to sample from a probability distribution p(x) (e.g. the posterior), but you can only evaluate a function f(x) that is proportional to the density p(x) (e.g. the product of prior and likelihood).
Initialization:
choose arbitrary starting point x0
choose a probability distribution to generate proposals g(xn+1|xn) (proposal density)
For each iteration i:
Say you want to sample from a probability distribution p(x) (e.g. the posterior), but you can only evaluate a function f(x) that is proportional to the density p(x) (e.g. the product of prior and likelihood).
Initialization:
choose arbitrary starting point x0
choose a probability distribution to generate proposals g(xn+1|xn) (proposal density)
For each iteration i:
generate a candidate x′ sampling from g(xi+1|xi)
calculate the acceptance ratio a=f(x′)f(xi)
Say you want to sample from a probability distribution p(x) (e.g. the posterior), but you can only evaluate a function f(x) that is proportional to the density p(x) (e.g. the product of prior and likelihood).
Initialization:
choose arbitrary starting point x0
choose a probability distribution to generate proposals g(xn+1|xn) (proposal density)
For each iteration i:
generate a candidate x′ sampling from g(xi+1|xi)
calculate the acceptance ratio a=f(x′)f(xi)
accept/reject: generate a uniform number u in [0,1]
Say you want to sample from a probability distribution p(x) (e.g. the posterior), but you can only evaluate a function f(x) that is proportional to the density p(x) (e.g. the product of prior and likelihood).
Initialization:
choose arbitrary starting point x0
choose a probability distribution to generate proposals g(xn+1|xn) (proposal density)
For each iteration i:
generate a candidate x′ sampling from g(xi+1|xi)
calculate the acceptance ratio a=f(x′)f(xi)
accept/reject: generate a uniform number u in [0,1]
Say you want to sample from a probability distribution p(x) (e.g. the posterior), but you can only evaluate a function f(x) that is proportional to the density p(x) (e.g. the product of prior and likelihood).
Initialization:
choose arbitrary starting point x0
choose a probability distribution to generate proposals g(xn+1|xn) (proposal density)
For each iteration i:
generate a candidate x′ sampling from g(xi+1|xi)
calculate the acceptance ratio a=f(x′)f(xi)
accept/reject: generate a uniform number u in [0,1]
if u≤a, accept and set xi+1=x′ ("move to the new value")
if u>a, reject and set xi+1=xi ("stay where you are")
Implementing Metropolis algorithm to fit regression for one participant of the dataset sleepstudy
(from the lme4
package).
Custom function to compute the (log) likelihood, keeping the data fixed.
x <- sleepstudy$Days[sleepstudy$Subject=="308"]y <- sleepstudy$Reaction[sleepstudy$Subject=="308"]loglik <- function(par){ # parameter vector = [intercept, slope, log(SD)] pred <- par[1] + par[2]*x return(sum(dnorm(y, mean = pred, sd = exp(par[3]), log = TRUE)))}
Prior distribution about parameter values
Prior distribution about parameter values
Similar to the likelihood, we write a custom function that compute the joint log-prior probability
logprior <- function(par){ intercept_prior <- dnorm(par[1], mean=250, sd=180, log=TRUE) slope_prior <- dnorm(par[2], mean=20, sd=20, log=TRUE) sd_prior <- dnorm(par[3],mean=4, sd=1, log=TRUE) return(intercept_prior+slope_prior+sd_prior)}
Similar to the likelihood, we write a custom function that compute the joint log-prior probability
logprior <- function(par){ intercept_prior <- dnorm(par[1], mean=250, sd=180, log=TRUE) slope_prior <- dnorm(par[2], mean=20, sd=20, log=TRUE) sd_prior <- dnorm(par[3],mean=4, sd=1, log=TRUE) return(intercept_prior+slope_prior+sd_prior)}
Put together likelihood and prior to obtain the (log) posterior probability
logposterior <- function(par){ return (loglik(par) + logprior(par))}
Choose the arbitrary starting point and the proposal density g(xn+1|xn)
# initial parametersstartvalue <- c(250, 20, 5) # [intercept, slope, log(SD)]# proposal densityproposalfunction <- function(par){ return(rnorm(3, mean = par, sd= c(15,5,0.2)))}
This function execute iteratively the step we have seen before
run_metropolis_MCMC <- function(startvalue, iterations){ # set up an empty array to store smapled values chain <- array(dim = c(iterations+1,3)) # put starting values at top of arrays chain[1,] <- startvalue for (i in 1:iterations){ # draw a random proposal proposal <- proposalfunction(chain[i,]) # ratio of posterior density between new and old values a <- exp(logposterior(proposal) - logposterior(chain[i,])) # sample random number & accept/reject the parameter values if (runif(1) < a){ chain[i+1,] <- proposal }else{ chain[i+1,] <- chain[i,] } } return(chain)}
Run sampling for many iterations
chain <- run_metropolis_MCMC(startvalue, 20000)
Output:
head(chain)
## [,1] [,2] [,3]## [1,] 250.0000 20.00000 5.000000## [2,] 240.6032 20.91822 4.832874## [3,] 246.8228 13.21847 4.647161## [4,] 257.8977 16.09737 4.586083## [5,] 257.8977 16.09737 4.586083## [6,] 274.7716 15.87271 4.582845
Having the samples from the posterior distribution, summarising uncertainty is simply a matter of counting values.
Having the samples from the posterior distribution, summarising uncertainty is simply a matter of counting values.
We can calculate a 95% Bayesian credible interval for slope by taking the 2.5th and 97.5th percentiles of posterior samples
# remove initial 'burn in' samplesburnIn <- 5000slope_samples <- chain[-(1:burnIn),2]# mean of posterior distributionmean(slope_samples)
## [1] 21.75772
# 95% Bayesian credible intervalalpha <- 0.05round(c(quantile(slope_samples, probs = alpha/2), quantile(slope_samples, probs = 1-alpha/2)), digits=2)
## 2.5% 97.5% ## 10.82 33.27
The Metropolis algorithm is the grandparent of modern MCMC techniques for drawing samples from unknown posterior distributions.
Proposals are random and that can make it very inefficient, especially as the complexity of the model (and thus of the posterior) increases.
The Metropolis algorithm is the grandparent of modern MCMC techniques for drawing samples from unknown posterior distributions.
Proposals are random and that can make it very inefficient, especially as the complexity of the model (and thus of the posterior) increases.
Modern algorithm are more efficient, they use clever proposal methods to get a better picture of the posterior distribution with fewer samples.
The Metropolis algorithm is the grandparent of modern MCMC techniques for drawing samples from unknown posterior distributions.
Proposals are random and that can make it very inefficient, especially as the complexity of the model (and thus of the posterior) increases.
Modern algorithm are more efficient, they use clever proposal methods to get a better picture of the posterior distribution with fewer samples.
A state-of-the-art algorithm is Hamiltonian Monte Carlo (HMC), which is used by the free software Stan (mc-stan.org).
Stan (named after Stanislaw Ulam, 1909-1984, co-inventor of MCMC methods) is a probabilistic programming language that makes it easy to do Bayesian inference on complex models.
Once a model is defined, Stan compiles a C++ program that uses HMC to draw samples from the posterior density.
Stan (named after Stanislaw Ulam, 1909-1984, co-inventor of MCMC methods) is a probabilistic programming language that makes it easy to do Bayesian inference on complex models.
Once a model is defined, Stan compiles a C++ program that uses HMC to draw samples from the posterior density.
Stan is free, open-source, has a comprehensive documentation and interfaces for the most popular computing environments (R, Matlab, Python, Mathematica)
Linear regression
Drift-diffusion model
Walking speed dataset:
d <- read.table ("../data/wagespeed.csv", header=T, sep=",")head(d)
## wage wspeed city## 1 15.59 73.83 Jakarta## 2 20.22 70.26 Sofia## 3 25.62 65.46 Bucharest## 4 32.78 65.36 Rio## 5 23.96 76.84 Guanzhou## 6 23.19 80.66 MexicoCity
Walking speed dataset:
d <- read.table ("../data/wagespeed.csv", header=T, sep=",")head(d)
## wage wspeed city## 1 15.59 73.83 Jakarta## 2 20.22 70.26 Sofia## 3 25.62 65.46 Bucharest## 4 32.78 65.36 Rio## 5 23.96 76.84 Guanzhou## 6 23.19 80.66 MexicoCity
# function to standardize variablesstandardize <- function(x){ Z <- (x - mean(x)) / sd(x) return(Z)}# prepare data to fit in Standata_stan <- list( N = nrow(d), wage = standardize(d$wage), wspeed = standardize(d$wspeed))# str(data_stan)
## List of 3## $ N : int 27## $ wage : num [1:27] -1.461 -1.308 -1.129 -0.892 -1.184 ...## $ wspeed: num [1:27] -1.06 -1.48 -2.05 -2.06 -0.71 ...
Model code in stan
language:
data { int<lower=0> N; vector[N] wspeed; vector[N] wage;}parameters { real beta[2]; real<lower=0> sigma;}model { // priors beta[1] ~ normal(0, 1); beta[2] ~ normal(0.5, 1); sigma ~ normal(0,1) T[0,]; // likelihood wspeed ~ normal(beta[1] + beta[2]*wage, sigma);}
Model code in stan
language:
data { int<lower=0> N; vector[N] wspeed; vector[N] wage;}parameters { real beta[2]; real<lower=0> sigma;}model { // priors beta[1] ~ normal(0, 1); beta[2] ~ normal(0.5, 1); sigma ~ normal(0,1) T[0,]; // likelihood wspeed ~ normal(beta[1] + beta[2]*wage, sigma);}
The code contain 3 blocks:
Model code in stan
language:
data { int<lower=0> N; vector[N] wspeed; vector[N] wage;}parameters { real beta[2]; real<lower=0> sigma;}model { // priors beta[1] ~ normal(0, 1); beta[2] ~ normal(0.5, 1); sigma ~ normal(0,1) T[0,]; // likelihood wspeed ~ normal(beta[1] + beta[2]*wage, sigma);}
The code contain 3 blocks:
Model code in stan
language:
data { int<lower=0> N; vector[N] wspeed; vector[N] wage;}parameters { real beta[2]; real<lower=0> sigma;}model { // priors beta[1] ~ normal(0, 1); beta[2] ~ normal(0.5, 1); sigma ~ normal(0,1) T[0,]; // likelihood wspeed ~ normal(beta[1] + beta[2]*wage, sigma);}
The code contain 3 blocks:
data block, in which variables are declared;
parameters block, in which we declare the parameters that we want to sample;
Model code in stan
language:
data { int<lower=0> N; vector[N] wspeed; vector[N] wage;}parameters { real beta[2]; real<lower=0> sigma;}model { // priors beta[1] ~ normal(0, 1); beta[2] ~ normal(0.5, 1); sigma ~ normal(0,1) T[0,]; // likelihood wspeed ~ normal(beta[1] + beta[2]*wage, sigma);}
The code contain 3 blocks:
data block, in which variables are declared;
parameters block, in which we declare the parameters that we want to sample;
model block, containing the model specification (i.e. priors and likelihood)
The model block allows defining the model in terms of probability distribution
model { // priors beta[1] ~ normal(0, 1); beta[2] ~ normal(0.5, 1); sigma ~ normal(0,1) T[0,]; // likelihood wspeed ~ normal(beta[1] + beta[2]*wage, sigma);}
The model block allows defining the model in terms of probability distribution
model { // priors beta[1] ~ normal(0, 1); beta[2] ~ normal(0.5, 1); sigma ~ normal(0,1) T[0,]; // likelihood wspeed ~ normal(beta[1] + beta[2]*wage, sigma);}
β0∼N(0,1)β1∼N(0.5,1)σ∼N[0,+∞)(0,1)
wspeed∼N(β0+β1wage,σ2)
library(rstan)options(mc.cores = parallel::detectCores()) # indicate stan to use multiple cores if available# run sampling: 4 chains in parallel on separate coreswspeed_fit <- stan(file = "wspeed_model.stan", data = data_stan, iter = 2000, chains = 4)
library(rstan)options(mc.cores = parallel::detectCores()) # indicate stan to use multiple cores if available# run sampling: 4 chains in parallel on separate coreswspeed_fit <- stan(file = "wspeed_model.stan", data = data_stan, iter = 2000, chains = 4)
Results
print(wspeed_fit)
## Inference for Stan model: anon_model.## 4 chains, each with iter=2000; warmup=1000; thin=1; ## post-warmup draws per chain=1000, total post-warmup draws=4000.## ## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat## beta[1] 0.00 0.00 0.14 -0.28 -0.09 0.00 0.09 0.26 3788 1## beta[2] 0.73 0.00 0.14 0.45 0.63 0.73 0.82 1.02 3496 1## sigma 0.72 0.00 0.10 0.55 0.64 0.71 0.78 0.95 3628 1## lp__ -4.01 0.03 1.24 -7.14 -4.62 -3.70 -3.09 -2.56 1987 1## ## Samples were drawn using NUTS(diag_e) at Sat Nov 25 17:22:37 2023.## For each parameter, n_eff is a crude measure of effective sample size,## and Rhat is the potential scale reduction factor on split chains (at ## convergence, Rhat=1).
library(rstan)options(mc.cores = parallel::detectCores()) # indicate stan to use multiple cores if available# run sampling: 4 chains in parallel on separate coreswspeed_fit <- stan(file = "wspeed_model.stan", data = data_stan, iter = 2000, chains = 4)
Results
print(wspeed_fit)
## Inference for Stan model: anon_model.## 4 chains, each with iter=2000; warmup=1000; thin=1; ## post-warmup draws per chain=1000, total post-warmup draws=4000.## ## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat## beta[1] 0.00 0.00 0.14 -0.28 -0.09 0.00 0.09 0.26 3788 1## beta[2] 0.73 0.00 0.14 0.45 0.63 0.73 0.82 1.02 3496 1## sigma 0.72 0.00 0.10 0.55 0.64 0.71 0.78 0.95 3628 1## lp__ -4.01 0.03 1.24 -7.14 -4.62 -3.70 -3.09 -2.56 1987 1## ## Samples were drawn using NUTS(diag_e) at Sat Nov 25 17:22:37 2023.## For each parameter, n_eff is a crude measure of effective sample size,## and Rhat is the potential scale reduction factor on split chains (at ## convergence, Rhat=1).
The Rhat
( ˆR ) is a diagnostic index; at convergence ˆR≈1±0.1
(essentially ˆR is the ratio of between-chain to within-chain variance)
Visualize chains to check for convergence
traceplot(wspeed_fit, pars=c("beta","sigma"))
Visualize posterior distribution
pairs(wspeed_fit, pars=c("beta","sigma"))
Visualize model fit to the data
Priors represent our initial beliefs about the plausibility of parameter values.
Unlike likelihoods, there are no conventional priors; choice depends on context.
Priors represent our initial beliefs about the plausibility of parameter values.
Unlike likelihoods, there are no conventional priors; choice depends on context.
Priors are subjective but should be explicitly stated for transparency and critique (more transparent than, say, dropping outliers).
Priors represent our initial beliefs about the plausibility of parameter values.
Unlike likelihoods, there are no conventional priors; choice depends on context.
Priors are subjective but should be explicitly stated for transparency and critique (more transparent than, say, dropping outliers).
Priors are assumptions, and should be criticized (e.g. try different assumptions to check how sensitive inference is to the choice of priors).
Priors represent our initial beliefs about the plausibility of parameter values.
Unlike likelihoods, there are no conventional priors; choice depends on context.
Priors are subjective but should be explicitly stated for transparency and critique (more transparent than, say, dropping outliers).
Priors are assumptions, and should be criticized (e.g. try different assumptions to check how sensitive inference is to the choice of priors).
Priors cannot be exact, but should be defendable.
There are no non-informative priors: flat priors such as Uniform(−∞,∞) tells the model that unrealistic values (e.g. extremely large) are as plausible as realistic ones.
Priors should restrict parameter space to values that make sense.
There are no non-informative priors: flat priors such as Uniform(−∞,∞) tells the model that unrealistic values (e.g. extremely large) are as plausible as realistic ones.
Priors should restrict parameter space to values that make sense.
Preferably choose weakly informative priors containg less information than what might be known (but still guide the model sensibly).
There are no non-informative priors: flat priors such as Uniform(−∞,∞) tells the model that unrealistic values (e.g. extremely large) are as plausible as realistic ones.
Priors should restrict parameter space to values that make sense.
Preferably choose weakly informative priors containg less information than what might be known (but still guide the model sensibly).
Priors should provide some regularization, i.e. add some skepticism, akin to penalized likelihood methods in frequentist statistics (e.g. LASSO or ridge regression), preventing overfitting to data.
There are no non-informative priors: flat priors such as Uniform(−∞,∞) tells the model that unrealistic values (e.g. extremely large) are as plausible as realistic ones.
Priors should restrict parameter space to values that make sense.
Preferably choose weakly informative priors containg less information than what might be known (but still guide the model sensibly).
Priors should provide some regularization, i.e. add some skepticism, akin to penalized likelihood methods in frequentist statistics (e.g. LASSO or ridge regression), preventing overfitting to data.
Prior recommendations: https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations
There are no non-informative priors: flat priors such as Uniform(−∞,∞) tells the model that unrealistic values (e.g. extremely large) are as plausible as realistic ones.
Priors should restrict parameter space to values that make sense.
Preferably choose weakly informative priors containg less information than what might be known (but still guide the model sensibly).
Priors should provide some regularization, i.e. add some skepticism, akin to penalized likelihood methods in frequentist statistics (e.g. LASSO or ridge regression), preventing overfitting to data.
Prior recommendations: https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations
If using informative priors, these should be justified in the text when reporting the results (see examples in Gelman & Hennig, 2016, Beyond subjective and objective in statistics)
There are no non-informative priors: flat priors such as Uniform(−∞,∞) tells the model that unrealistic values (e.g. extremely large) are as plausible as realistic ones.
Priors should restrict parameter space to values that make sense.
Preferably choose weakly informative priors containg less information than what might be known (but still guide the model sensibly).
Priors should provide some regularization, i.e. add some skepticism, akin to penalized likelihood methods in frequentist statistics (e.g. LASSO or ridge regression), preventing overfitting to data.
Prior recommendations: https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations
If using informative priors, these should be justified in the text when reporting the results (see examples in Gelman & Hennig, 2016, Beyond subjective and objective in statistics)
When in doubt (especially for complex models where the relationship between prior & observed data is not straighforward) is always a good idea to do a prior predictive check (i.e. sample parameter values from the priors, use it to simulate data, and see if the overall distribution looks sensible).
The DDM describes the process of decision making in terms of:
v
): Speed of information accumulation.a
): Threshold for making a decision.z
): Bias towards one of the responses.t0
): Processes outside decision making.We will estimate these parameters using Stan.
The distribution of 'first-passage times' is given by the Wiener distribution, which is already included in Stan math library: https://mc-stan.org/docs/functions-reference/wiener-first-passage-time-distribution.html
For a correct response (upper boundary):
rt∼Wiener(a,t0,z,v)
For an error (lower boundary):
rt∼Wiener(a,t0,1−z,−v)
data { int<lower=0> N; // Number of observations int<lower=0,upper=1> resp[N]; // Response: 1 or 0 real<lower=0> rt[N]; // Response time}parameters { real v; // Drift rate real<lower=0> a; // Boundary separation real<lower=0> t0; // Non-decision time real<lower=0, upper=1> z; // Starting point}model { // Priors v ~ normal(1, 3); a ~ normal(1, 1)T[0,2]; t0 ~ normal(0, 3)T[0,]; z ~ beta(2, 2); // Likelihood for (i in 1:N) { if (resp[i] == 1) { rt[i] ~ wiener(a, t0, z, v); } else { rt[i] ~ wiener(a, t0, 1 - z, -v); } }}
d <- read_csv("../data/ddm_data.csv") # load datad %>% # make a plot mutate(resp = factor(resp)) %>% ggplot(aes(x=rt, color=resp, fill=resp))+ geom_histogram(binwidth=0.05, color="white") + facet_grid(.~resp)
Format data as a list and run sampling
data_stan <- list( N = nrow(d), rt = d$rt, resp = d$resp)ddm_fit <- stan(file = "../ddm4p.stan", # ddm4p.stan is the file containing the model code data = data_stan, iter = 2000, chains = 4)
Results:
print(ddm_fit)
## Inference for Stan model: anon_model.## 4 chains, each with iter=2000; warmup=1000; thin=1; ## post-warmup draws per chain=1000, total post-warmup draws=4000.## ## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat## v 1.01 0.00 0.12 0.78 0.93 1.01 1.09 1.25 2524 1## a 0.98 0.00 0.02 0.94 0.97 0.98 1.00 1.02 2770 1## t0 0.25 0.00 0.00 0.24 0.25 0.25 0.25 0.26 2152 1## z 0.50 0.00 0.02 0.47 0.49 0.50 0.51 0.53 2600 1## lp__ 17.73 0.03 1.44 14.16 17.02 18.05 18.78 19.53 1708 1## ## Samples were drawn using NUTS(diag_e) at Fri Nov 24 13:44:19 2023.## For each parameter, n_eff is a crude measure of effective sample size,## and Rhat is the potential scale reduction factor on split chains (at ## convergence, Rhat=1).
Visualize chains to check for convergence
traceplot(ddm_fit, pars=c("v","a","t0","z"))
Visualize posterior distribution
pairs(ddm_fit, pars=c("v","a","t0","z"))
Visualize model fit to the data
Visualize marginal distribution & credible intervals
# Extract samples and convert to long format for plottingparameter_samples <- gather_draws(ddm_fit, v, a, t0, z)# Create violin plot with 95% HDI intervalsddm_fit %>% spread_draws(v, a, t0, z) %>% pivot_longer(cols = c(v, a, t0, z), names_to = "parameter", values_to = "value") %>% ggplot(aes(y = parameter, x = value)) + stat_halfeye(.width = .95,normalize="groups") + labs(x="value",y="parameter")
A Bayesian approach to hypothesis testing & model comparison.
Suppose we have 2 competing models for data: model M1 with parameters θ1 vs. model M2 with parameters θ2.
A Bayesian approach to hypothesis testing & model comparison.
Suppose we have 2 competing models for data: model M1 with parameters θ1 vs. model M2 with parameters θ2.
The Bayes factor is the ratio of the marginal (average) likelihoods of the two modelsBF1,2=p1(data∣M1)p1(data∣ M1)=∫p1(data∣θ1)p1(θ1)dθ1∫p2(data∣θ2)p2(θ2)dθ2
A Bayesian approach to hypothesis testing & model comparison.
Suppose we have 2 competing models for data: model M1 with parameters θ1 vs. model M2 with parameters θ2.
The Bayes factor is the ratio of the marginal (average) likelihoods of the two modelsBF1,2=p1(data∣M1)p1(data∣ M1)=∫p1(data∣θ1)p1(θ1)dθ1∫p2(data∣θ2)p2(θ2)dθ2
The Bayes factor alone does not tell which model is more likely! This is quantified by the posterior odds, which takes also the prior odds into account: p(M1∣data)p(M2∣data)⏟posterior odds=p(M1)p(M2)⏟prior odds×∫p1(data∣θ1)p1(θ1)dθ1∫p2(data∣θ2)p2(θ2)dθ2⏟Bayes factor
A Bayesian approach to hypothesis testing & model comparison.
Suppose we have 2 competing models for data: model M1 with parameters θ1 vs. model M2 with parameters θ2.
The Bayes factor is the ratio of the marginal (average) likelihoods of the two modelsBF1,2=p1(data∣M1)p1(data∣ M1)=∫p1(data∣θ1)p1(θ1)dθ1∫p2(data∣θ2)p2(θ2)dθ2
The Bayes factor alone does not tell which model is more likely! This is quantified by the posterior odds, which takes also the prior odds into account: p(M1∣data)p(M2∣data)⏟posterior odds=p(M1)p(M2)⏟prior odds×∫p1(data∣θ1)p1(θ1)dθ1∫p2(data∣θ2)p2(θ2)dθ2⏟Bayes factor
A Bayesian approach to hypothesis testing & model comparison.
Suppose we have 2 competing models for data: model M1 with parameters θ1 vs. model M2 with parameters θ2.
The Bayes factor is the ratio of the marginal (average) likelihoods of the two modelsBF1,2=p1(data∣M1)p1(data∣ M1)=∫p1(data∣θ1)p1(θ1)dθ1∫p2(data∣θ2)p2(θ2)dθ2
The Bayes factor alone does not tell which model is more likely! This is quantified by the posterior odds, which takes also the prior odds into account: p(M1∣data)p(M2∣data)⏟posterior odds=p(M1)p(M2)⏟prior odds×∫p1(data∣θ1)p1(θ1)dθ1∫p2(data∣θ2)p2(θ2)dθ2⏟Bayes factor
The Bayes factors only tells us how much - given the data and the prior - we need to update our relative beliefs between two models.
Say we found that BF1,2≈3, we can say that 'the data are ≈ 3 times more likely under model 1 than model 2'.
(see savage.dickey.bf()
function in mlisi
package)
Bayes factors represents the update in beliefs from prior to posterior, and therefore are strongly dependent on the prior used.
Bayes factors are "(...) very sensitive to features of the prior that have almost no effect on the posterior. With hundreds of data points, the difference between a normal(0, 1) and normal(0, 100) prior is negligible if the true value is in the range (-3, 3), but it can have a huge effect on Bayes factors." (Bob Carpenter, post link)
Bayes factors represents the update in beliefs from prior to posterior, and therefore are strongly dependent on the prior used.
Bayes factors are "(...) very sensitive to features of the prior that have almost no effect on the posterior. With hundreds of data points, the difference between a normal(0, 1) and normal(0, 100) prior is negligible if the true value is in the range (-3, 3), but it can have a huge effect on Bayes factors." (Bob Carpenter, post link)
Bayes factors makes sense only if you have strong reasons for selecting a particular prior!
Bayes factors represents the update in beliefs from prior to posterior, and therefore are strongly dependent on the prior used.
Bayes factors are "(...) very sensitive to features of the prior that have almost no effect on the posterior. With hundreds of data points, the difference between a normal(0, 1) and normal(0, 100) prior is negligible if the true value is in the range (-3, 3), but it can have a huge effect on Bayes factors." (Bob Carpenter, post link)
Bayes factors makes sense only if you have strong reasons for selecting a particular prior!
If not, other approaches may be preferable - e.g. Bayesian cross-validation, and information criterial like WAIC.
Bayes factors represents the update in beliefs from prior to posterior, and therefore are strongly dependent on the prior used.
Bayes factors are "(...) very sensitive to features of the prior that have almost no effect on the posterior. With hundreds of data points, the difference between a normal(0, 1) and normal(0, 100) prior is negligible if the true value is in the range (-3, 3), but it can have a huge effect on Bayes factors." (Bob Carpenter, post link)
Bayes factors makes sense only if you have strong reasons for selecting a particular prior!
If not, other approaches may be preferable - e.g. Bayesian cross-validation, and information criterial like WAIC.
Never trust a paper that just throws Bayes factors there without specifying and motivating the priors!
Posterior predictive checks
Model comparison, WAIC, & Bayesian leave-one-out cross-validation
Stan User manual.
Stan forum: https://discourse.mc-stan.org.
Statistical Rethinking: A Bayesian Course with Examples in R and Stan, Richard McElreath CRC Press, 2015.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
There is no 'dark side' in statistics.
Image from 'Statistical rethinking', Richard McElreath.
Likelihood function: a mathematical function that gives the probability of observing the data, given specific values for the parameters of a statistical model.
The higher the value of the likelihood function for a set of parameters, the more likely it is that these parameters are the correct ones that explain our data.
Defined as the probability of the data, given some parameter values, usually notated as p(data∣parameters).
For a simple linear regression yi=β0+β1xi+ϵi ϵi∼N(0,σ2)
Defined as the probability of the data, given some parameter values, usually notated as p(data∣parameters).
For a simple linear regression yi=β0+β1xi+ϵi ϵi∼N(0,σ2)the likelihood is p(y,x⏟data∣β0,β1,σ2⏟parameters)=n∏i=1⏟productN(predicted value⏞β0+β1xi,σ2)⏟Gaussian probability density
In practice, we usually work with the logarithm of the likelihood function (log-likelihood)
L(y,x∣β0,β1,σ2)=log[p(y,x∣β0,β1,σ2)]=−n2log(2π)−n2log(σ2)−12σ2n∑i=1(yi−β0−β1xi)2⏟sum of squared residual errors
In practice, we usually work with the logarithm of the likelihood function (log-likelihood)
L(y,x∣β0,β1,σ2)=log[p(y,x∣β0,β1,σ2)]=−n2log(2π)−n2log(σ2)−12σ2n∑i=1(yi−β0−β1xi)2⏟sum of squared residual errors
Maximum likelihood estimation is the backbone of statistical estimation in the frequentist inference.
Key aspects of frequentist inference:
Maximum likelihood estimation is the backbone of statistical estimation in the frequentist inference.
Key aspects of frequentist inference:
Maximum likelihood estimation is the backbone of statistical estimation in the frequentist inference.
Key aspects of frequentist inference:
Parameters are considered unknown fixed quantities.
The objective is to find best point-estimates for these parameters.
Maximum likelihood estimation is the backbone of statistical estimation in the frequentist inference.
Key aspects of frequentist inference:
Parameters are considered unknown fixed quantities.
The objective is to find best point-estimates for these parameters.
Uncertainty regarding the value of a parameter is not quantified using probabilities.
Maximum likelihood estimation is the backbone of statistical estimation in the frequentist inference.
Key aspects of frequentist inference:
Parameters are considered unknown fixed quantities.
The objective is to find best point-estimates for these parameters.
Uncertainty regarding the value of a parameter is not quantified using probabilities.
Uncertainty in the estimates is expressed in relation to the data-generating process (think about the un-intuitive definition of confidence interval).
Maximum likelihood estimation is the backbone of statistical estimation in the frequentist inference.
Key aspects of frequentist inference:
Parameters are considered unknown fixed quantities.
The objective is to find best point-estimates for these parameters.
Uncertainty regarding the value of a parameter is not quantified using probabilities.
Uncertainty in the estimates is expressed in relation to the data-generating process (think about the un-intuitive definition of confidence interval).
Probability is interpreted as the long-term frequency of an event occurring a very large (infinite) series of repeated trials or samples.
What is different in the Bayesian approach?
What is different in the Bayesian approach?
Probability represents the degree of belief or confidence in the occurrence of an event or the truth of a proposition.
Parameters are treated as random variables with associated uncertainty, described by a prior probability distribution.
What is different in the Bayesian approach?
Probability represents the degree of belief or confidence in the occurrence of an event or the truth of a proposition.
Parameters are treated as random variables with associated uncertainty, described by a prior probability distribution.
The objective is to refine the prior distribution using observed data, yielding an updated posterior probability distribution.
What is different in the Bayesian approach?
Probability represents the degree of belief or confidence in the occurrence of an event or the truth of a proposition.
Parameters are treated as random variables with associated uncertainty, described by a prior probability distribution.
The objective is to refine the prior distribution using observed data, yielding an updated posterior probability distribution.
This update is achieved through Bayes' theorem, however the Bayesian approach is not defined by the use Bayes' theorem itself (which is just an implication of the axioms of probability).
What is different in the Bayesian approach?
Probability represents the degree of belief or confidence in the occurrence of an event or the truth of a proposition.
Parameters are treated as random variables with associated uncertainty, described by a prior probability distribution.
The objective is to refine the prior distribution using observed data, yielding an updated posterior probability distribution.
This update is achieved through Bayes' theorem, however the Bayesian approach is not defined by the use Bayes' theorem itself (which is just an implication of the axioms of probability).
"The essential characteristic of Bayesian methods is their explicit use of probability for quantifying uncertainty in inferences based on statistical analysis" (Gelman et al., 2013).
In a typical Bayesian setting we have:
In a typical Bayesian setting we have:
likelihood p(data∣θ), giving probability of the data conditional on the parameter(s) θ;
prior p(θ), which formalizes a-priori belief about the plausibility of parameter values
In a typical Bayesian setting we have:
likelihood p(data∣θ), giving probability of the data conditional on the parameter(s) θ;
prior p(θ), which formalizes a-priori belief about the plausibility of parameter values
posterior distribution, obtained by applying Bayes theorem: p(θ∣data)=p(data∣θ)p(θ)p(data)
In a typical Bayesian setting we have:
likelihood p(data∣θ), giving probability of the data conditional on the parameter(s) θ;
prior p(θ), which formalizes a-priori belief about the plausibility of parameter values
posterior distribution, obtained by applying Bayes theorem:p(θ∣data)=p(data∣θ)p(θ)∫p(data∣θ)p(θ)dθ
In a typical Bayesian setting we have:
likelihood p(data∣θ), giving probability of the data conditional on the parameter(s) θ;
prior p(θ), which formalizes a-priori belief about the plausibility of parameter values
posterior distribution, obtained by applying Bayes theorem: posterior=likelihood×prioraverage likelihood
In a typical Bayesian setting we have:
likelihood p(data∣θ), giving probability of the data conditional on the parameter(s) θ;
prior p(θ), which formalizes a-priori belief about the plausibility of parameter values
posterior distribution, obtained by applying Bayes theorem: posterior∝likelihood×prior
When we're interested in a specific parameter (e.g. the slope β1) we focus on its marginal posterior distribution. This allows summarizing information about parameters in useful ways (for example: Bayesian credible intervals).
The marginal posterior distribution is found by integrating over other parameters in the posterior - think of this as averaging out the effects of all other parameters to focus only on β1.
When we're interested in a specific parameter (e.g. the slope β1) we focus on its marginal posterior distribution. This allows summarizing information about parameters in useful ways (for example: Bayesian credible intervals).
The marginal posterior distribution is found by integrating over other parameters in the posterior - think of this as averaging out the effects of all other parameters to focus only on β1.
formally: p(β1∣data)=∫∫p(β1,β0,σ∣data)dβ0dσ).
When we're interested in a specific parameter (e.g. the slope β1) we focus on its marginal posterior distribution. This allows summarizing information about parameters in useful ways (for example: Bayesian credible intervals).
The marginal posterior distribution is found by integrating over other parameters in the posterior - think of this as averaging out the effects of all other parameters to focus only on β1.
formally: p(β1∣data)=∫∫p(β1,β0,σ∣data)dβ0dσ).
When we're interested in a specific parameter (e.g. the slope β1) we focus on its marginal posterior distribution. This allows summarizing information about parameters in useful ways (for example: Bayesian credible intervals).
The marginal posterior distribution is found by integrating over other parameters in the posterior - think of this as averaging out the effects of all other parameters to focus only on β1.
formally: p(β1∣data)=∫∫p(β1,β0,σ∣data)dβ0dσ).
As models become complex, this calculation gets challenging because it involves averaging (or marginalizing) over many parameters.
We often use computer simulations, like Markov Chain Monte Carlo (MCMC) sampling, to handle these calculations efficiently.
When we're interested in a specific parameter (e.g. the slope β1) we focus on its marginal posterior distribution. This allows summarizing information about parameters in useful ways (for example: Bayesian credible intervals).
The marginal posterior distribution is found by integrating over other parameters in the posterior - think of this as averaging out the effects of all other parameters to focus only on β1.
formally: p(β1∣data)=∫∫p(β1,β0,σ∣data)dβ0dσ).
As models become complex, this calculation gets challenging because it involves averaging (or marginalizing) over many parameters.
We often use computer simulations, like Markov Chain Monte Carlo (MCMC) sampling, to handle these calculations efficiently.
After performing MCMC, we end up with 'samples' drawn from the posterior distribution. With these in hand, summarizing the the marginal posterior becomes simply a matter of 'counting' the samples.
Say you want to sample from a probability distribution p(x) (e.g. the posterior), but you can only evaluate a function f(x) that is proportional to the density p(x) (e.g. the product of prior and likelihood).
Say you want to sample from a probability distribution p(x) (e.g. the posterior), but you can only evaluate a function f(x) that is proportional to the density p(x) (e.g. the product of prior and likelihood).
Say you want to sample from a probability distribution p(x) (e.g. the posterior), but you can only evaluate a function f(x) that is proportional to the density p(x) (e.g. the product of prior and likelihood).
Initialization:
Say you want to sample from a probability distribution p(x) (e.g. the posterior), but you can only evaluate a function f(x) that is proportional to the density p(x) (e.g. the product of prior and likelihood).
Initialization:
choose arbitrary starting point x0
choose a probability distribution to generate proposals g(xn+1|xn) (proposal density)
Say you want to sample from a probability distribution p(x) (e.g. the posterior), but you can only evaluate a function f(x) that is proportional to the density p(x) (e.g. the product of prior and likelihood).
Initialization:
choose arbitrary starting point x0
choose a probability distribution to generate proposals g(xn+1|xn) (proposal density)
Say you want to sample from a probability distribution p(x) (e.g. the posterior), but you can only evaluate a function f(x) that is proportional to the density p(x) (e.g. the product of prior and likelihood).
Initialization:
choose arbitrary starting point x0
choose a probability distribution to generate proposals g(xn+1|xn) (proposal density)
For each iteration i:
Say you want to sample from a probability distribution p(x) (e.g. the posterior), but you can only evaluate a function f(x) that is proportional to the density p(x) (e.g. the product of prior and likelihood).
Initialization:
choose arbitrary starting point x0
choose a probability distribution to generate proposals g(xn+1|xn) (proposal density)
For each iteration i:
generate a candidate x′ sampling from g(xi+1|xi)
calculate the acceptance ratio a=f(x′)f(xi)
Say you want to sample from a probability distribution p(x) (e.g. the posterior), but you can only evaluate a function f(x) that is proportional to the density p(x) (e.g. the product of prior and likelihood).
Initialization:
choose arbitrary starting point x0
choose a probability distribution to generate proposals g(xn+1|xn) (proposal density)
For each iteration i:
generate a candidate x′ sampling from g(xi+1|xi)
calculate the acceptance ratio a=f(x′)f(xi)
accept/reject: generate a uniform number u in [0,1]
Say you want to sample from a probability distribution p(x) (e.g. the posterior), but you can only evaluate a function f(x) that is proportional to the density p(x) (e.g. the product of prior and likelihood).
Initialization:
choose arbitrary starting point x0
choose a probability distribution to generate proposals g(xn+1|xn) (proposal density)
For each iteration i:
generate a candidate x′ sampling from g(xi+1|xi)
calculate the acceptance ratio a=f(x′)f(xi)
accept/reject: generate a uniform number u in [0,1]
Say you want to sample from a probability distribution p(x) (e.g. the posterior), but you can only evaluate a function f(x) that is proportional to the density p(x) (e.g. the product of prior and likelihood).
Initialization:
choose arbitrary starting point x0
choose a probability distribution to generate proposals g(xn+1|xn) (proposal density)
For each iteration i:
generate a candidate x′ sampling from g(xi+1|xi)
calculate the acceptance ratio a=f(x′)f(xi)
accept/reject: generate a uniform number u in [0,1]
if u≤a, accept and set xi+1=x′ ("move to the new value")
if u>a, reject and set xi+1=xi ("stay where you are")
Implementing Metropolis algorithm to fit regression for one participant of the dataset sleepstudy
(from the lme4
package).
Custom function to compute the (log) likelihood, keeping the data fixed.
x <- sleepstudy$Days[sleepstudy$Subject=="308"]y <- sleepstudy$Reaction[sleepstudy$Subject=="308"]loglik <- function(par){ # parameter vector = [intercept, slope, log(SD)] pred <- par[1] + par[2]*x return(sum(dnorm(y, mean = pred, sd = exp(par[3]), log = TRUE)))}
Prior distribution about parameter values
Prior distribution about parameter values
Similar to the likelihood, we write a custom function that compute the joint log-prior probability
logprior <- function(par){ intercept_prior <- dnorm(par[1], mean=250, sd=180, log=TRUE) slope_prior <- dnorm(par[2], mean=20, sd=20, log=TRUE) sd_prior <- dnorm(par[3],mean=4, sd=1, log=TRUE) return(intercept_prior+slope_prior+sd_prior)}
Similar to the likelihood, we write a custom function that compute the joint log-prior probability
logprior <- function(par){ intercept_prior <- dnorm(par[1], mean=250, sd=180, log=TRUE) slope_prior <- dnorm(par[2], mean=20, sd=20, log=TRUE) sd_prior <- dnorm(par[3],mean=4, sd=1, log=TRUE) return(intercept_prior+slope_prior+sd_prior)}
Put together likelihood and prior to obtain the (log) posterior probability
logposterior <- function(par){ return (loglik(par) + logprior(par))}
Choose the arbitrary starting point and the proposal density g(xn+1|xn)
# initial parametersstartvalue <- c(250, 20, 5) # [intercept, slope, log(SD)]# proposal densityproposalfunction <- function(par){ return(rnorm(3, mean = par, sd= c(15,5,0.2)))}
This function execute iteratively the step we have seen before
run_metropolis_MCMC <- function(startvalue, iterations){ # set up an empty array to store smapled values chain <- array(dim = c(iterations+1,3)) # put starting values at top of arrays chain[1,] <- startvalue for (i in 1:iterations){ # draw a random proposal proposal <- proposalfunction(chain[i,]) # ratio of posterior density between new and old values a <- exp(logposterior(proposal) - logposterior(chain[i,])) # sample random number & accept/reject the parameter values if (runif(1) < a){ chain[i+1,] <- proposal }else{ chain[i+1,] <- chain[i,] } } return(chain)}
Run sampling for many iterations
chain <- run_metropolis_MCMC(startvalue, 20000)
Output:
head(chain)
## [,1] [,2] [,3]## [1,] 250.0000 20.00000 5.000000## [2,] 240.6032 20.91822 4.832874## [3,] 246.8228 13.21847 4.647161## [4,] 257.8977 16.09737 4.586083## [5,] 257.8977 16.09737 4.586083## [6,] 274.7716 15.87271 4.582845
Having the samples from the posterior distribution, summarising uncertainty is simply a matter of counting values.
Having the samples from the posterior distribution, summarising uncertainty is simply a matter of counting values.
We can calculate a 95% Bayesian credible interval for slope by taking the 2.5th and 97.5th percentiles of posterior samples
# remove initial 'burn in' samplesburnIn <- 5000slope_samples <- chain[-(1:burnIn),2]# mean of posterior distributionmean(slope_samples)
## [1] 21.75772
# 95% Bayesian credible intervalalpha <- 0.05round(c(quantile(slope_samples, probs = alpha/2), quantile(slope_samples, probs = 1-alpha/2)), digits=2)
## 2.5% 97.5% ## 10.82 33.27
The Metropolis algorithm is the grandparent of modern MCMC techniques for drawing samples from unknown posterior distributions.
Proposals are random and that can make it very inefficient, especially as the complexity of the model (and thus of the posterior) increases.
The Metropolis algorithm is the grandparent of modern MCMC techniques for drawing samples from unknown posterior distributions.
Proposals are random and that can make it very inefficient, especially as the complexity of the model (and thus of the posterior) increases.
Modern algorithm are more efficient, they use clever proposal methods to get a better picture of the posterior distribution with fewer samples.
The Metropolis algorithm is the grandparent of modern MCMC techniques for drawing samples from unknown posterior distributions.
Proposals are random and that can make it very inefficient, especially as the complexity of the model (and thus of the posterior) increases.
Modern algorithm are more efficient, they use clever proposal methods to get a better picture of the posterior distribution with fewer samples.
A state-of-the-art algorithm is Hamiltonian Monte Carlo (HMC), which is used by the free software Stan (mc-stan.org).
Stan (named after Stanislaw Ulam, 1909-1984, co-inventor of MCMC methods) is a probabilistic programming language that makes it easy to do Bayesian inference on complex models.
Once a model is defined, Stan compiles a C++ program that uses HMC to draw samples from the posterior density.
Stan (named after Stanislaw Ulam, 1909-1984, co-inventor of MCMC methods) is a probabilistic programming language that makes it easy to do Bayesian inference on complex models.
Once a model is defined, Stan compiles a C++ program that uses HMC to draw samples from the posterior density.
Stan is free, open-source, has a comprehensive documentation and interfaces for the most popular computing environments (R, Matlab, Python, Mathematica)
Linear regression
Drift-diffusion model
Walking speed dataset:
d <- read.table ("../data/wagespeed.csv", header=T, sep=",")head(d)
## wage wspeed city## 1 15.59 73.83 Jakarta## 2 20.22 70.26 Sofia## 3 25.62 65.46 Bucharest## 4 32.78 65.36 Rio## 5 23.96 76.84 Guanzhou## 6 23.19 80.66 MexicoCity
Walking speed dataset:
d <- read.table ("../data/wagespeed.csv", header=T, sep=",")head(d)
## wage wspeed city## 1 15.59 73.83 Jakarta## 2 20.22 70.26 Sofia## 3 25.62 65.46 Bucharest## 4 32.78 65.36 Rio## 5 23.96 76.84 Guanzhou## 6 23.19 80.66 MexicoCity
# function to standardize variablesstandardize <- function(x){ Z <- (x - mean(x)) / sd(x) return(Z)}# prepare data to fit in Standata_stan <- list( N = nrow(d), wage = standardize(d$wage), wspeed = standardize(d$wspeed))# str(data_stan)
## List of 3## $ N : int 27## $ wage : num [1:27] -1.461 -1.308 -1.129 -0.892 -1.184 ...## $ wspeed: num [1:27] -1.06 -1.48 -2.05 -2.06 -0.71 ...
Model code in stan
language:
data { int<lower=0> N; vector[N] wspeed; vector[N] wage;}parameters { real beta[2]; real<lower=0> sigma;}model { // priors beta[1] ~ normal(0, 1); beta[2] ~ normal(0.5, 1); sigma ~ normal(0,1) T[0,]; // likelihood wspeed ~ normal(beta[1] + beta[2]*wage, sigma);}
Model code in stan
language:
data { int<lower=0> N; vector[N] wspeed; vector[N] wage;}parameters { real beta[2]; real<lower=0> sigma;}model { // priors beta[1] ~ normal(0, 1); beta[2] ~ normal(0.5, 1); sigma ~ normal(0,1) T[0,]; // likelihood wspeed ~ normal(beta[1] + beta[2]*wage, sigma);}
The code contain 3 blocks:
Model code in stan
language:
data { int<lower=0> N; vector[N] wspeed; vector[N] wage;}parameters { real beta[2]; real<lower=0> sigma;}model { // priors beta[1] ~ normal(0, 1); beta[2] ~ normal(0.5, 1); sigma ~ normal(0,1) T[0,]; // likelihood wspeed ~ normal(beta[1] + beta[2]*wage, sigma);}
The code contain 3 blocks:
Model code in stan
language:
data { int<lower=0> N; vector[N] wspeed; vector[N] wage;}parameters { real beta[2]; real<lower=0> sigma;}model { // priors beta[1] ~ normal(0, 1); beta[2] ~ normal(0.5, 1); sigma ~ normal(0,1) T[0,]; // likelihood wspeed ~ normal(beta[1] + beta[2]*wage, sigma);}
The code contain 3 blocks:
data block, in which variables are declared;
parameters block, in which we declare the parameters that we want to sample;
Model code in stan
language:
data { int<lower=0> N; vector[N] wspeed; vector[N] wage;}parameters { real beta[2]; real<lower=0> sigma;}model { // priors beta[1] ~ normal(0, 1); beta[2] ~ normal(0.5, 1); sigma ~ normal(0,1) T[0,]; // likelihood wspeed ~ normal(beta[1] + beta[2]*wage, sigma);}
The code contain 3 blocks:
data block, in which variables are declared;
parameters block, in which we declare the parameters that we want to sample;
model block, containing the model specification (i.e. priors and likelihood)
The model block allows defining the model in terms of probability distribution
model { // priors beta[1] ~ normal(0, 1); beta[2] ~ normal(0.5, 1); sigma ~ normal(0,1) T[0,]; // likelihood wspeed ~ normal(beta[1] + beta[2]*wage, sigma);}
The model block allows defining the model in terms of probability distribution
model { // priors beta[1] ~ normal(0, 1); beta[2] ~ normal(0.5, 1); sigma ~ normal(0,1) T[0,]; // likelihood wspeed ~ normal(beta[1] + beta[2]*wage, sigma);}
β0∼N(0,1)β1∼N(0.5,1)σ∼N[0,+∞)(0,1)
wspeed∼N(β0+β1wage,σ2)
library(rstan)options(mc.cores = parallel::detectCores()) # indicate stan to use multiple cores if available# run sampling: 4 chains in parallel on separate coreswspeed_fit <- stan(file = "wspeed_model.stan", data = data_stan, iter = 2000, chains = 4)
library(rstan)options(mc.cores = parallel::detectCores()) # indicate stan to use multiple cores if available# run sampling: 4 chains in parallel on separate coreswspeed_fit <- stan(file = "wspeed_model.stan", data = data_stan, iter = 2000, chains = 4)
Results
print(wspeed_fit)
## Inference for Stan model: anon_model.## 4 chains, each with iter=2000; warmup=1000; thin=1; ## post-warmup draws per chain=1000, total post-warmup draws=4000.## ## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat## beta[1] 0.00 0.00 0.14 -0.28 -0.09 0.00 0.09 0.26 3788 1## beta[2] 0.73 0.00 0.14 0.45 0.63 0.73 0.82 1.02 3496 1## sigma 0.72 0.00 0.10 0.55 0.64 0.71 0.78 0.95 3628 1## lp__ -4.01 0.03 1.24 -7.14 -4.62 -3.70 -3.09 -2.56 1987 1## ## Samples were drawn using NUTS(diag_e) at Sat Nov 25 17:22:37 2023.## For each parameter, n_eff is a crude measure of effective sample size,## and Rhat is the potential scale reduction factor on split chains (at ## convergence, Rhat=1).
library(rstan)options(mc.cores = parallel::detectCores()) # indicate stan to use multiple cores if available# run sampling: 4 chains in parallel on separate coreswspeed_fit <- stan(file = "wspeed_model.stan", data = data_stan, iter = 2000, chains = 4)
Results
print(wspeed_fit)
## Inference for Stan model: anon_model.## 4 chains, each with iter=2000; warmup=1000; thin=1; ## post-warmup draws per chain=1000, total post-warmup draws=4000.## ## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat## beta[1] 0.00 0.00 0.14 -0.28 -0.09 0.00 0.09 0.26 3788 1## beta[2] 0.73 0.00 0.14 0.45 0.63 0.73 0.82 1.02 3496 1## sigma 0.72 0.00 0.10 0.55 0.64 0.71 0.78 0.95 3628 1## lp__ -4.01 0.03 1.24 -7.14 -4.62 -3.70 -3.09 -2.56 1987 1## ## Samples were drawn using NUTS(diag_e) at Sat Nov 25 17:22:37 2023.## For each parameter, n_eff is a crude measure of effective sample size,## and Rhat is the potential scale reduction factor on split chains (at ## convergence, Rhat=1).
The Rhat
( ˆR ) is a diagnostic index; at convergence ˆR≈1±0.1
(essentially ˆR is the ratio of between-chain to within-chain variance)
Visualize chains to check for convergence
traceplot(wspeed_fit, pars=c("beta","sigma"))
Visualize posterior distribution
pairs(wspeed_fit, pars=c("beta","sigma"))
Visualize model fit to the data
Priors represent our initial beliefs about the plausibility of parameter values.
Unlike likelihoods, there are no conventional priors; choice depends on context.
Priors represent our initial beliefs about the plausibility of parameter values.
Unlike likelihoods, there are no conventional priors; choice depends on context.
Priors are subjective but should be explicitly stated for transparency and critique (more transparent than, say, dropping outliers).
Priors represent our initial beliefs about the plausibility of parameter values.
Unlike likelihoods, there are no conventional priors; choice depends on context.
Priors are subjective but should be explicitly stated for transparency and critique (more transparent than, say, dropping outliers).
Priors are assumptions, and should be criticized (e.g. try different assumptions to check how sensitive inference is to the choice of priors).
Priors represent our initial beliefs about the plausibility of parameter values.
Unlike likelihoods, there are no conventional priors; choice depends on context.
Priors are subjective but should be explicitly stated for transparency and critique (more transparent than, say, dropping outliers).
Priors are assumptions, and should be criticized (e.g. try different assumptions to check how sensitive inference is to the choice of priors).
Priors cannot be exact, but should be defendable.
There are no non-informative priors: flat priors such as Uniform(−∞,∞) tells the model that unrealistic values (e.g. extremely large) are as plausible as realistic ones.
Priors should restrict parameter space to values that make sense.
There are no non-informative priors: flat priors such as Uniform(−∞,∞) tells the model that unrealistic values (e.g. extremely large) are as plausible as realistic ones.
Priors should restrict parameter space to values that make sense.
Preferably choose weakly informative priors containg less information than what might be known (but still guide the model sensibly).
There are no non-informative priors: flat priors such as Uniform(−∞,∞) tells the model that unrealistic values (e.g. extremely large) are as plausible as realistic ones.
Priors should restrict parameter space to values that make sense.
Preferably choose weakly informative priors containg less information than what might be known (but still guide the model sensibly).
Priors should provide some regularization, i.e. add some skepticism, akin to penalized likelihood methods in frequentist statistics (e.g. LASSO or ridge regression), preventing overfitting to data.
There are no non-informative priors: flat priors such as Uniform(−∞,∞) tells the model that unrealistic values (e.g. extremely large) are as plausible as realistic ones.
Priors should restrict parameter space to values that make sense.
Preferably choose weakly informative priors containg less information than what might be known (but still guide the model sensibly).
Priors should provide some regularization, i.e. add some skepticism, akin to penalized likelihood methods in frequentist statistics (e.g. LASSO or ridge regression), preventing overfitting to data.
Prior recommendations: https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations
There are no non-informative priors: flat priors such as Uniform(−∞,∞) tells the model that unrealistic values (e.g. extremely large) are as plausible as realistic ones.
Priors should restrict parameter space to values that make sense.
Preferably choose weakly informative priors containg less information than what might be known (but still guide the model sensibly).
Priors should provide some regularization, i.e. add some skepticism, akin to penalized likelihood methods in frequentist statistics (e.g. LASSO or ridge regression), preventing overfitting to data.
Prior recommendations: https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations
If using informative priors, these should be justified in the text when reporting the results (see examples in Gelman & Hennig, 2016, Beyond subjective and objective in statistics)
There are no non-informative priors: flat priors such as Uniform(−∞,∞) tells the model that unrealistic values (e.g. extremely large) are as plausible as realistic ones.
Priors should restrict parameter space to values that make sense.
Preferably choose weakly informative priors containg less information than what might be known (but still guide the model sensibly).
Priors should provide some regularization, i.e. add some skepticism, akin to penalized likelihood methods in frequentist statistics (e.g. LASSO or ridge regression), preventing overfitting to data.
Prior recommendations: https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations
If using informative priors, these should be justified in the text when reporting the results (see examples in Gelman & Hennig, 2016, Beyond subjective and objective in statistics)
When in doubt (especially for complex models where the relationship between prior & observed data is not straighforward) is always a good idea to do a prior predictive check (i.e. sample parameter values from the priors, use it to simulate data, and see if the overall distribution looks sensible).
The DDM describes the process of decision making in terms of:
v
): Speed of information accumulation.a
): Threshold for making a decision.z
): Bias towards one of the responses.t0
): Processes outside decision making.We will estimate these parameters using Stan.
The distribution of 'first-passage times' is given by the Wiener distribution, which is already included in Stan math library: https://mc-stan.org/docs/functions-reference/wiener-first-passage-time-distribution.html
For a correct response (upper boundary):
rt∼Wiener(a,t0,z,v)
For an error (lower boundary):
rt∼Wiener(a,t0,1−z,−v)
data { int<lower=0> N; // Number of observations int<lower=0,upper=1> resp[N]; // Response: 1 or 0 real<lower=0> rt[N]; // Response time}parameters { real v; // Drift rate real<lower=0> a; // Boundary separation real<lower=0> t0; // Non-decision time real<lower=0, upper=1> z; // Starting point}model { // Priors v ~ normal(1, 3); a ~ normal(1, 1)T[0,2]; t0 ~ normal(0, 3)T[0,]; z ~ beta(2, 2); // Likelihood for (i in 1:N) { if (resp[i] == 1) { rt[i] ~ wiener(a, t0, z, v); } else { rt[i] ~ wiener(a, t0, 1 - z, -v); } }}
d <- read_csv("../data/ddm_data.csv") # load datad %>% # make a plot mutate(resp = factor(resp)) %>% ggplot(aes(x=rt, color=resp, fill=resp))+ geom_histogram(binwidth=0.05, color="white") + facet_grid(.~resp)
Format data as a list and run sampling
data_stan <- list( N = nrow(d), rt = d$rt, resp = d$resp)ddm_fit <- stan(file = "../ddm4p.stan", # ddm4p.stan is the file containing the model code data = data_stan, iter = 2000, chains = 4)
Results:
print(ddm_fit)
## Inference for Stan model: anon_model.## 4 chains, each with iter=2000; warmup=1000; thin=1; ## post-warmup draws per chain=1000, total post-warmup draws=4000.## ## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat## v 1.01 0.00 0.12 0.78 0.93 1.01 1.09 1.25 2524 1## a 0.98 0.00 0.02 0.94 0.97 0.98 1.00 1.02 2770 1## t0 0.25 0.00 0.00 0.24 0.25 0.25 0.25 0.26 2152 1## z 0.50 0.00 0.02 0.47 0.49 0.50 0.51 0.53 2600 1## lp__ 17.73 0.03 1.44 14.16 17.02 18.05 18.78 19.53 1708 1## ## Samples were drawn using NUTS(diag_e) at Fri Nov 24 13:44:19 2023.## For each parameter, n_eff is a crude measure of effective sample size,## and Rhat is the potential scale reduction factor on split chains (at ## convergence, Rhat=1).
Visualize chains to check for convergence
traceplot(ddm_fit, pars=c("v","a","t0","z"))
Visualize posterior distribution
pairs(ddm_fit, pars=c("v","a","t0","z"))
Visualize model fit to the data
Visualize marginal distribution & credible intervals
# Extract samples and convert to long format for plottingparameter_samples <- gather_draws(ddm_fit, v, a, t0, z)# Create violin plot with 95% HDI intervalsddm_fit %>% spread_draws(v, a, t0, z) %>% pivot_longer(cols = c(v, a, t0, z), names_to = "parameter", values_to = "value") %>% ggplot(aes(y = parameter, x = value)) + stat_halfeye(.width = .95,normalize="groups") + labs(x="value",y="parameter")
A Bayesian approach to hypothesis testing & model comparison.
Suppose we have 2 competing models for data: model M1 with parameters θ1 vs. model M2 with parameters θ2.
A Bayesian approach to hypothesis testing & model comparison.
Suppose we have 2 competing models for data: model M1 with parameters θ1 vs. model M2 with parameters θ2.
The Bayes factor is the ratio of the marginal (average) likelihoods of the two modelsBF1,2=p1(data∣M1)p1(data∣ M1)=∫p1(data∣θ1)p1(θ1)dθ1∫p2(data∣θ2)p2(θ2)dθ2
A Bayesian approach to hypothesis testing & model comparison.
Suppose we have 2 competing models for data: model M1 with parameters θ1 vs. model M2 with parameters θ2.
The Bayes factor is the ratio of the marginal (average) likelihoods of the two modelsBF1,2=p1(data∣M1)p1(data∣ M1)=∫p1(data∣θ1)p1(θ1)dθ1∫p2(data∣θ2)p2(θ2)dθ2
The Bayes factor alone does not tell which model is more likely! This is quantified by the posterior odds, which takes also the prior odds into account: p(M1∣data)p(M2∣data)⏟posterior odds=p(M1)p(M2)⏟prior odds×∫p1(data∣θ1)p1(θ1)dθ1∫p2(data∣θ2)p2(θ2)dθ2⏟Bayes factor
A Bayesian approach to hypothesis testing & model comparison.
Suppose we have 2 competing models for data: model M1 with parameters θ1 vs. model M2 with parameters θ2.
The Bayes factor is the ratio of the marginal (average) likelihoods of the two modelsBF1,2=p1(data∣M1)p1(data∣ M1)=∫p1(data∣θ1)p1(θ1)dθ1∫p2(data∣θ2)p2(θ2)dθ2
The Bayes factor alone does not tell which model is more likely! This is quantified by the posterior odds, which takes also the prior odds into account: p(M1∣data)p(M2∣data)⏟posterior odds=p(M1)p(M2)⏟prior odds×∫p1(data∣θ1)p1(θ1)dθ1∫p2(data∣θ2)p2(θ2)dθ2⏟Bayes factor
A Bayesian approach to hypothesis testing & model comparison.
Suppose we have 2 competing models for data: model M1 with parameters θ1 vs. model M2 with parameters θ2.
The Bayes factor is the ratio of the marginal (average) likelihoods of the two modelsBF1,2=p1(data∣M1)p1(data∣ M1)=∫p1(data∣θ1)p1(θ1)dθ1∫p2(data∣θ2)p2(θ2)dθ2
The Bayes factor alone does not tell which model is more likely! This is quantified by the posterior odds, which takes also the prior odds into account: p(M1∣data)p(M2∣data)⏟posterior odds=p(M1)p(M2)⏟prior odds×∫p1(data∣θ1)p1(θ1)dθ1∫p2(data∣θ2)p2(θ2)dθ2⏟Bayes factor
The Bayes factors only tells us how much - given the data and the prior - we need to update our relative beliefs between two models.
Say we found that BF1,2≈3, we can say that 'the data are ≈ 3 times more likely under model 1 than model 2'.
(see savage.dickey.bf()
function in mlisi
package)
Bayes factors represents the update in beliefs from prior to posterior, and therefore are strongly dependent on the prior used.
Bayes factors are "(...) very sensitive to features of the prior that have almost no effect on the posterior. With hundreds of data points, the difference between a normal(0, 1) and normal(0, 100) prior is negligible if the true value is in the range (-3, 3), but it can have a huge effect on Bayes factors." (Bob Carpenter, post link)
Bayes factors represents the update in beliefs from prior to posterior, and therefore are strongly dependent on the prior used.
Bayes factors are "(...) very sensitive to features of the prior that have almost no effect on the posterior. With hundreds of data points, the difference between a normal(0, 1) and normal(0, 100) prior is negligible if the true value is in the range (-3, 3), but it can have a huge effect on Bayes factors." (Bob Carpenter, post link)
Bayes factors makes sense only if you have strong reasons for selecting a particular prior!
Bayes factors represents the update in beliefs from prior to posterior, and therefore are strongly dependent on the prior used.
Bayes factors are "(...) very sensitive to features of the prior that have almost no effect on the posterior. With hundreds of data points, the difference between a normal(0, 1) and normal(0, 100) prior is negligible if the true value is in the range (-3, 3), but it can have a huge effect on Bayes factors." (Bob Carpenter, post link)
Bayes factors makes sense only if you have strong reasons for selecting a particular prior!
If not, other approaches may be preferable - e.g. Bayesian cross-validation, and information criterial like WAIC.
Bayes factors represents the update in beliefs from prior to posterior, and therefore are strongly dependent on the prior used.
Bayes factors are "(...) very sensitive to features of the prior that have almost no effect on the posterior. With hundreds of data points, the difference between a normal(0, 1) and normal(0, 100) prior is negligible if the true value is in the range (-3, 3), but it can have a huge effect on Bayes factors." (Bob Carpenter, post link)
Bayes factors makes sense only if you have strong reasons for selecting a particular prior!
If not, other approaches may be preferable - e.g. Bayesian cross-validation, and information criterial like WAIC.
Never trust a paper that just throws Bayes factors there without specifying and motivating the priors!
Posterior predictive checks
Model comparison, WAIC, & Bayesian leave-one-out cross-validation
Stan User manual.
Stan forum: https://discourse.mc-stan.org.
Statistical Rethinking: A Bayesian Course with Examples in R and Stan, Richard McElreath CRC Press, 2015.