Posts | Matteo Lisi

How much is this game worth?

Sat, 26 Sep 2020 00:00:00 +0000

This question was posed during an interview for a AI / data science position for a global financial firm:

Consider a game of chance in which a player can roll up to 3 times a dice. They win an amount of money proportional to the outcome of the last dice roll (1, 2, 3, 4, 5, or 6 £). They don’t need to do all the 3 throws and can stop before and collect their win if they want. You are the house in this game, what is the minimum amount of £ that you can charge for playing this game such that you won’t take losses in the long term.

Answer by Oliver Perkins.

Oliver Perkins pointed out that this can be calculated working backwards from the last throw:

£4.65?

As the punter on roml 3, E(£) = 3.5, so we stick on 4+ on throw 2. Knowing this our EV for the final 2 throws is is (0.53.5)+(0.55)=4.25. Therefore we stick on 5+ on the first throw so E(£) = (0.6674.25)+(0.3335.5) = ~4.64
— Oli Perkins 🔥🌍🏳️ 🌈 (@OliPerkins2) September 27, 2020

His solution makes perfect sense and differs from my initial one in that the player would accept a 4 at throw 2 (although now I have updated my answer to take that into account). This small difference in the strategy it’s rational since it increases the overall expected value of the game by $\approx$ 0.05 from a strategy in which the player do not settle for anything less than 5 at any throw.

Alternative approach.

Let’s define with $\mathcal{C}$ the price that the player pay to play the game, and with $\mathcal{W}$ the amount they win. A possible strategy could be to keep playing until their winning are at least equal to the price, i.e. until $\mathcal{W} \ge \mathcal{C}$ (and may also continue after that if continuing increases their expected win).

Assume the player continue until they get a 6 (). Let’s define $D_1, D_2, D_3$ indicate the outcomes of throws 1, 2, and 3, respectively. The probability of getting a 6 is

\[ p\left(\mathcal{W} =6\right) = \frac{1}{6} + \underbrace{\left( 1 - \frac{1}{6} \right) \times \frac{1}{6}}_{p \left(D_2=6 \mid D_1<6\right)} + \underbrace{\left( 1 - \frac{1}{6} \right)^2 \times \frac{1}{6}}_{p \left(D_3=6 \mid D_1<6 \cup D_2<6\right)} = \frac{91}{216} \approx 0.42 \]

If instead the player aims for at least $5$, the probability of getting it is

\[ p\left(\mathcal{W} \ge 5 \right) = \frac{2}{6} + \underbrace{\left( 1 - \frac{2}{6} \right) \times \frac{2}{6}}_{p \left(D_2 \ge5 \mid D_1<5\right)} + \underbrace{\left( 1 - \frac{2}{6} \right)^2 \times \frac{2}{6}}_{p \left(D_3 \ge 5 \mid D_1<5 \cup D_2<5\right)} = \frac{19}{27} \approx 0.70 \]

Say the player obtain 5 at the first throw. Is it worth to continue?

The probability of getting at least $5$ in the next two throws is $\frac{2}{6} + \left(1 - \frac{2}{6} \right)\times \frac{2}{6} = \frac{20}{36} \approx 0.55$

The probability of getting a $6$ in the next two throws is $\frac{1}{6} + \left(1 - \frac{1}{6} \right)\times \frac{1}{6} = \frac{11}{36} \approx 0.30$

If the player reach the last throw without obtaining at least $5$ the remaining outcomes, $1$ to $4$, are equally likely with probability $\frac{1}{4}$. Thus continuing after obtaining $5$ at the first throw has an expected value of \[\frac{11}{36}\times6 + \left(\frac{20}{36} - \frac{11}{36}\right)\times 5 + \left(1 - \frac{20}{36}\right)\times \sum_{i=1}^4 \frac{1}{4}i \approx 4.19\]

The value of stopping after having obtained $5$ is $5$, thus the player should stop and not continue.

Expected value of the game

If $\mathcal{C}=5$ we have that $p\left(\mathcal{W} > 5 \right) = \frac{91}{216} \approx 0.42$, and also that $p\left(\mathcal{W} > 4 \right) = \frac{19}{27} \approx 0.70$. Thus in the long run the player will win more than they paid about $42$% of the times, they will be even with the house $28$% of the times, and they will loose money (obtaining any number from $1$ to $4$, with equal probability) about $30$% of the times.

To see if this is a good deal we calculate the expected value assuming the player continue until they either obtain a $5$ or complete 3 throws. The probability of getting at least 5 in three throws is $\frac{19}{27}$, and conditional of getting at least 5 the two outcomes of 6 and 5 have both the same probability $\frac{1}{2}$. We have that for a price of $5$ the player that follow this strategy is expected to incur a loss in the long run since the expected value is:

\[\underbrace{\frac{19}{27}\frac{1}{2}\times6 + \frac{19}{27}\frac{1}{2}\times5 + \left(1 - \frac{19}{27}\right) \frac{1}{4}\times \sum_{i=1}^4 i}_{\text{expected value if player keep playing until } \mathcal{W} \ge 5} = \frac{83}{18} \approx 4.61\]

Now, say the player obtain 4 at the second throw, should they keep it? Yes, since the expected value of the last throw is $\frac{1}{6}\sum_{i=1}^6 i=3.5$. To take this into account we need a slightly different calculation in which the outcomes are considered separately for each throw (the underbraces indicate the acceptable score in each throw):

\[ \underbrace{\frac{2}{6}\frac{1}{2}\times \left(5+6\right)}_{\text{5 or 6 in 1st throw}} + \underbrace{\left(1-\frac{2}{6}\right)\frac{3}{6}\frac{1}{3}\times \left(4+5+6\right)}_{\text{4, 5 or 6 in 2nd throw}} + \underbrace{\left(1-\frac{2}{6}\right)\frac{3}{6} \frac{1}{6}\times \sum_{i=1}^6 i}_{\text{any number in last throw}} = \frac{14}{3} \approx 4.67 \]

Thus, the house needs to set a price $\mathcal{C}$ that is at least $\frac{14}{3}\approx 4.67$ otherwise they risk incurring losses.

Setting Kazam to correctly record full-screen on HiDPI displays in Ubuntu

Thu, 11 Jun 2020 00:00:00 +0000

In this new covid-19 world it happens more and more often that I need to record full-screen videos, for example for lectures. This is something that one can do live with Zoom, but that is not the most practical option for non-live recordings.

In Ubuntu there is a nice software, Kazam screencaster, that is perfect for the job, except that it does not get correctly the screen size if you have a high pixel density display (HiDPI): you end up with a video with only the top-left corner of the screen cropped.

There is a simple patch to fix that issue, which I describe here in case it’s useful to someone else and for the benefit of my future self.

First, you need to find the files gstreamer.py and prefs.py in the Kazam installation. For me they were in /usr/lib/python3/dist-packages/kazam/backend/.

Next, you have to fix these such that they take into account the screen scaling factor, which is obtained from the get_monitor_scale_factor function in the Gtk library

This cane be done by adding these lines to the file gstreamer.py.

 scale = self.video_source['scale']
 startx = startx * scale 
 starty = starty * scale 
 endx = endx * scale 
 endy = endy * scale

They should be added around lines 120 or so, right after the properties endx and endy are set up (endy = starty + height - 1).

Next, open the file prefs.py and, around line 324, change this bit

for i in range(self.default_screen.get_n_monitors()):
   rect = self.default_screen.get_monitor_geometry(i)
   self.logger.debug("  Monitor {0} - X: {1}, Y: {2}, W: {3}, H: {4}".format(i,
                                      rect.x,
                                      rect.y,
                                      rect.width,
                                      rect.height))
                                      rect.height))

   self.screens.append({"x": rect.x,
                        "y": rect.y,
                        "width": rect.width,
                        "height": rect.height})

into this

for i in range(self.default_screen.get_n_monitors()):
     rect = self.default_screen.get_monitor_geometry(i)
     scale = self.default_screen.get_monitor_scale_factor(i)

     self.logger.debug("  Monitor {0} - X: {1}, Y: {2}, W: {3}, H: {4}, scale: {5}".format(i,
                                        rect.x,
                                        rect.y,
                                        rect.width,
                                        rect.height,
                                        scale))

    self.screens.append({"x": rect.x,
                         "y": rect.y,
                         "width": rect.width,
                         "height": rect.height,
                         "scale": scale})

That’s it! Restart Kazam and the next fullscreen recording should work OK.

Thanks to user sllorente for describing the patch here!

Defending the null hypothesis

Mon, 09 Sep 2019 00:00:00 +0000

A friend was working on a paper and found himself in the situation of having to defend the null hypothesis that a particular effect is absent (or not measurable) when tested under more controlled conditions than those used in previous studies. He asked for some practical advice: “what would convince you as as a reviewer of a null result?”

No statistical test can “prove” a null results (intended as the point-null hypothesis that an effect of interest is zero). You can however: (i) present evidence that the data are more likely under the null hypothesis than under the alternative; or (ii) put a cap on the size of the effect, which could enable you to argue that any effect, if present, is so small that can be considered theoretically or pragmatically irrelevant.

(i) is the Bayesian approach and requires calculating a Bayes factor - that is the ratio between the average (or marginal) likelihood of the data under the null and alternative hypothesis. Note that Bayes factor calculation is highly influenced by the priors (e.g. the prior expectations about the effect size). Luckily, in the case of a single comparison (e.g. a t-test), there is popular way of computing Bayes factors which requires minimal assumptions about the effect of interest, as it is developed using uninformative or minimally informative priors, called the JZW prior (technically correspond to assuming a Cauchy prior on the standardized effect size and a uninformative Jeffrey’s prior on the variances of your measurements). It’s been derived in a paper by Rouder et al. [@Rouder2009c] and there is a easy-to-use R implementation of it in the package BayesFactor, (see function ttestBF()).

(ii) is the frequentist alternative. In a frequentist approach you don’t express belief in an hypothesis in terms of probability; uncertainty is characterized in relation to the data-generating process (e.g. how many times you would reject the null if you repeated the experiment a zillion time? - probability is interpreted as the long-run frequency in an imaginary very, very large sample). Under this approach you can estimate what is the maximum size of the effect since you did not detected it in your current experiment. Daniel Lakens has written an easy-to-use package for that, called TOSTER; see this vignette for an introduction.

Installing RStan on HPC cluster

Sun, 04 Aug 2019 00:00:00 +0000

This took me some time to make it work, so I’ll write the details here for the benefit of my future self and anyone else facing similar issues.

To run R in the Apocrita cluster (which runs CentOS 7) first load the modules

module load R
module load gcc

(gcc is required to compile the packages from source.)

Before starting you should make sure that you don’t have any previous installation of RStan in your system. From an R terminal, type:

remove.packages("rstan")
remove.packages("StanHeaders")
if (file.exists(".RData")) file.remove(".RData")

One problem that I had initially was (I think) due to the fact that Rcpp and rstan had been installed with different compiler or compilation flags. Thanks to the IT support at Queen Mary University, the correct C++ toolchain configuration that made the trick for me is the following:

CXX14 = g++ -std=c++1y
CXX14FLAGS = -O3 -Wno-unused-variable -Wno-unused-function -fPIC

To write the correct configuration in the ~/.R/Makevars file from an R terminal:

dotR <- file.path(Sys.getenv("HOME"), ".R")
if (!file.exists(dotR)) dir.create(dotR)
M <- file.path(dotR, "Makevars")
if (!file.exists(M)) file.create(M)
cat("\nCXX14 = g++ -std=c++1y", "CXX14FLAGS = -O3 -Wno-unused-variable -Wno-unused-function -fPIC", file = M, sep = "\n", append = TRUE)

Finally, install RStan:

Sys.setenv(MAKEFLAGS = "-j4") # four cores used for building install
install.packages("rstan", type = "source")

Note that in my case it worked correctly without requiring to run the instructions specific for CentOS 7.0 indicated at rstan installation page.

Another thing that I did, although I am not sure it is strictly necessary, was to install RStan on a new R library, that is in a directory that contained only packages necessary to run RStan.

Attitude shift towards Remain in European Elections obscured in press by rebranded Farage party

Mon, 27 May 2019 00:00:00 +0000

The European Election results reveal a shift towards parties that support a soft or no Brexit compared to votes in 2014. Instead, major news outlets in the UK and Europe claim that Hard Brexit has gained support. To see why this is a misguided conclusion, just look at the numbers:

Almost complete vote share results of UK’s EU elections 2019, put in perspective.

Parties supporting strong UK independence have lost overwhelming numbers of voters, with UKIP losing 24.2% of the votes, and conservative losing 14.8%. This loss is only partially recovered by Farage’s Brexit Party, which gained 31.6%, suggesting that a substantial proportion of voters switched to parties favoring a soft or no Brexit.

Amongst parties favoring closer European connections, those vocally against Brexit show major wins (Lib Dem, 13.4%, Green 4.2%), whilst the Labor party without clear Brexit stance, has lost 11.3% of its votes.

The numerical shift towards parties against Brexit, is obscured by Farage’s clever rebranding of UKIP. The apparent overnight success of Farage’s new Brexit party, now the largest party around, is interpreted in major news outlets such as the Guardian and the French24, as a victory for hard Brexiteers. This conclusion overlooks that the vast majority of the gained votes are funneled directly from Farrage’s own former party UKIP. This rebranding allows the Farage to claim a victory of 28 new seats over 2014, instead of the correct increase of 5.

Despite claims on the triumph of Farage’s party on the media, the numbers actually suggest a rising scepticism toward Brexit.

(After making this plot we realized that the Guardian had written a perspective article on the same lines.).

Tessa Dekker & Matteo Lisi

Ephemeral patterns of complexity

Fri, 05 Apr 2019 00:00:00 +0000

From ‘The Big Picture: On the Origins of Life, Meaning, and the Universe Itself’ by Sean Carrol, www.preposterousuniverse.com/bigpicture

Bayesian model selection at the group level

Fri, 25 Jan 2019 00:00:00 +0000

In experimental psychology and neuroscience the classical approach when comparing different models that make quantitative predictions about the behavior of participants is to aggregate the predictive ability of the model (e.g. as quantified by Akaike Information criterion) across participants, and then see which one provide on average the best performance. Although correct, this approach neglect the possibility that different participants might use different strategies that are best described by alternative, competing models. To account for this, Stephan et al. (Stephan et al. 2009) proposed a more conservative approach where models are treated as random effects that could differ between subjects and have a fixed (unknown) distribution in the population. The relevant statistical quantity is the frequency with which any model prevails in the population. Note that this is different from the definition of random-effects in classical statistic where random effects models have multiple sources of variation, e.g. within- and between- subject variance. An useful and popular way to summarize the results of this analysis is by reporting the model’s exceedance probabilities, which measures how likely it is that any given model is more frequent than all other models in the set. The following exposition is largerly based on Stephan et al’s paper (Stephan et al. 2009).

Model evidence

Let’s say we have an experiment with $\left(1,\dots,N\right)$ participants. Their performance is quantitatively predicted by a set $\left(1,\dots,K\right)$ competing models. The behaviour of any subject $n$ can be fit by the model $k$ by finding the value(s) of the parameter(s) $\theta_k$ that maximize the likelihood of the data $y_n$ under the model. In a fully Bayesian setting each unknown parameter would have a prior probability distribution, and the quantity of choice ofr comparing the goodness of fit of the model is the marginal likelihood, that is \[ p \left(y_n \mid k \right) = \int p\left(y_n \mid k, \theta_k \right) \, p\left(\theta_k \right) d\theta. \] By integrating over the prior probability of parameters the marginal likelihood provide a measure of the evidence in favour of a specific model while taking into account the complexity of the model. We might also do something simpler and approximate the model evidence using e.g. the Akaike information criterion.

Models as random effects

We are interested in finding which model does better at predicting behavior, however we allow for different participants to use different strategies which can be represented by different models. To achieve that we treat the model as random effects and we assume that the frequency or probability of models in the population, $(r_1, \dots, r_K)$, is described by a Dirichlet distribution with parameters $\boldsymbol{\alpha } = \alpha_1, \dots, \alpha_k$, \[ \begin{align} p\left(r \mid \boldsymbol{\alpha } \right) & = \text{Dir} \left(r, \boldsymbol{ \alpha } \right) \\ & = \frac{1}{\mathbf{B} \left(\boldsymbol{ \alpha } \right)} \prod_{i=1}^K r_i^{\alpha_i -1} \nonumber \end{align}. \] Where the normalizing constant $\mathbf{B} \left(\boldsymbol{ \alpha } \right)$ is the multivariate Beta function. The probabilities $r$ generates ‘switches’ or indicator variables $m_n = m_1, \dots, m_N$ where $m \in \left \{ 0, 1\right \}$ and $\sum_1^K m_{nk}=1$. These indicator variables prescribe the model for the subjects $n$, $ p(m_{nk}=1)=r_k$. Given the probabilities $r$, the indicator variables have thus a multinomial distribution, that is \[ p\left(m_n \mid \mathbf{r} \right) = \prod_{k=1}^K r_k^{m_{nk}}. \] The graphical model that summarizes these dependencies is shown the following graph:

Variational Bayesian approach

The goal is to estimate the parameters $\boldsymbol{\alpha}$ that define the posterior distribution of model frequencies given the data, $ p ( r | y)$. To do so we need an estimate of the model evidence $p \left(m_{nk}=1 \mid y_n \right)$, that is the belief that the model $k$ generated data from subject $m$. There are many possible approach that can be used to estimate the model evidence, either exactly or approximately. Importantly, these would need to be normalized so that they sum to one across models, so that is one were using the Akaike Information criterion, this should be transformed into Akaike weights (Burnham and Anderson 2002).

Generative model

Given the graphical model illustrated above, the joint probability of parameters and data can be expressed as \[ \begin{align} p \left( y, r, m \right) & = p \left( y \mid m \right) \, p \left( m \mid r \right) \, p \left( r \mid \boldsymbol{\alpha} \right) \\ & = p \left( r \mid \boldsymbol{\alpha} \right) \left[ \prod_{n=1}^N p \left( y_n \mid m_n \right) \, p\left(m_n \mid r \right) \right] \nonumber \\ & = \frac{1}{\mathbf{B} \left(\boldsymbol{ \alpha } \right)} \left[ \prod_{k=1}^K r_k^{\alpha_k -1} \right] \left[ \prod_{n=1}^N p \left( y_n \mid m_n\right) \, \prod_{k=1}^K r_k^{m_{nk}} \right] \nonumber \\ & = \frac{1}{\mathbf{B} \left(\boldsymbol{ \alpha } \right)} \prod_{n=1}^N \left[ \prod_{k=1}^K \left[ p \left( y_n \mid m_{nk} \right) \, r_k \right]^{m_{nk}} \, r_k^{\alpha_k -1} \right]. \nonumber \end{align} \] And the log probability is \[ \log p \left( y, r, m \right) = - \log \mathbf{B} \left(\boldsymbol{ \alpha } \right) + \sum_{n=1}^N \sum_{k=1}^K \left[ \left(\alpha_k -1 \right) \log r_k + m_{nk} \left( p \left( \log y_n \mid m_{nk} \right) + \log r_k\right)\right]. \]

Variational approximation

In order to fit this hierarchical model following the variational approach one needs to define an approximate posterior distribution over model frequencies and assignments, $q\left(r,m\right)$, which is assumed to be adequately described by a mean-field factorisation, that is $q\left(r,m\right) = q\left(r\right) \, q\left(m\right)$. The two densities are proportional to the exponentiated variational energies $I(m), I(r)$, which are essentially the un-normalized approximated log-posterior densities, that is \[ \begin{align} q\left(r\right) & \propto e^{I(r)}, \, q\left(m\right)\propto e^{I(m)} \\ I(r) & = \left< \log p \left( y, r, m \right) \right>_{q(r)} \\ I(m) & = \left< \log p \left( y, r, m \right) \right>_{q(m)} \end{align} \] For the approximate posterior over model assignment $q(m)$ we first compute $I(m)$ and then an appropriate normalization constant. From the expression above of the joint log-probability, and removing all the terms that do not depend on $m$ we have that the un-normalized approximate log-posterior (the variational energy) can be expressed as \[ \begin{align} I(m) & = \int p \left( y, r, m \right) \, q(r) \, dr \\ & = \sum_{n=1}^N \sum_{k=1}^K m_{nk} \left[ p \left( \log y_n \mid m_{nk} \right) + \int q(r_k) \log r_k \, d r_k \right] \nonumber \\ & = \sum_{n=1}^N \sum_{k=1}^K m_{nk} \left[ p \left( \log y_n \mid m_{nk} \right) + \psi (\alpha_k) -\psi \left( \alpha_S \right) \right] \nonumber \end{align} \] where $\alpha_S = \sum_{k=1}^K \alpha_k$ and $\psi$ is the digamma function. If you wonder (as I did when reading this the first time) where the hell does the digamma function comes from here: well it is here due to a property of the Dirichlet distribution, which says that the expected value of $\log r_k$ can be computed as \[ \mathbb{E} \left[\log r_k \right] = \int p(r_k) \log r_k \, d r_k = \psi (\alpha_k) -\psi \left( \sum_{k=1}^K \alpha_k \right) \]

From this, we have that the un-normalized posterior belief that model $k$ generated data from subject $n$ is \[ u_{nk} = \exp {\left[ p \left( \log y_n \mid m_{nk} \right) + \psi (\alpha_k) -\psi \left( \alpha_S \right) \right]} \] and the normalized belief is \[ g_{nk} = \frac{u_{nk}}{\sum_{k=1}^K u_{nk}} \]

We need also to compute the approximate posterior density $q(r)$, and we begin as above by computing the un-normalized, approximate log-posterior or variational energy \[ \begin{align} I(r) & = \int p \left( y, r, m \right) \, q(m) \, dm \\ & = \sum_{k=1}^K \left[\log r_k \left(\alpha_{0k} -1 \right) + \sum_{n=1}^N g_{nk} \log r_k \right] \end{align} \] The logarithm of a Dirichlet density is $\log \text{Dir} (r , \boldsymbol{\alpha}) = \sum_{k=1}^K \log r_k \left(\alpha_{0k} -1 \right) + \dots$, therefore the parameters of the approximate posterior are \[ \boldsymbol{\alpha} = \boldsymbol{\alpha}_0 + \sum_{n=1}^N g_{nk} \]

Iterative algorithm}

The algorithm (Stephan et al. 2009) proceeds by estimating iteratively the posterior belief that a given model generated the data from a certain subject, by integrating out the prior probabilities of the models (the $r_k$ predicted by the Dirichlet distribution that describes the frequency of models in the population) in log-space as described above. Next the parameters of the approximate Dirichlet posterior are updated, which gives new priors to integrate out from the model evidence, and so on until convergence.Convergence is assessed by keeping track of how much the vector $ $ change from one iteration to the next, i.e. is common to consider that the procedure has converged when $\left\Vert \boldsymbol{\alpha}_{t-1} \cdot \boldsymbol{\alpha}_t \right\Vert < 10^{-4}$ (where $\cdot$ is the dot product).

Exceedance probabilities

After having found the optimised values of $\boldsymbol{\alpha}$, one popular way to report the results and rank the models is by their exceedance probability, which is defined as the (second order) probability that participants were more likely to choose a certain model to generate behavior rather than any other alternative model, that is \[ \forall j \in \left\{1, \dots, K, j \ne k \right\}, \,\,\, \varphi_k = p \left(r_k > r_j \mid y, \boldsymbol{\alpha} \right). \] In the case of $K>2$ models, the exceedance probabilities $\varphi_k$ are computed by generating random samples from univariate Gamma densities and then normalizing. Specifically, each multivariate Dirichlet sample is composed of $K$ independent random samples $(x_1, \dots, x_K)$ distributed according to the density $\text{Gamma}\left(\alpha_i, 1\right) = \frac{x_i^{\alpha_i-1} e^{-x_i}}{\Gamma(\alpha_i)}$, and then set normalize them by taking $z_i = \frac{x_i}{ \sum_{i=1}^K x_i}$. The exceedance probability $\varphi_k$ for each model $k$ is then computed as \[ \varphi_k = \frac{\sum \mathop{\bf{1}}_{z_k>z_j, \forall j \in \left\{1, \dots, K, j \ne k \right\} }}{ \text{n. of samples}} \] where $\mathop{\bf{1}}_{\dots}$ is the indicator function ($\mathop{\bf{1}}_{x>0} = 1$ if $x>0$ and $0$ otherwise), summed over the total number of multivariate samples drawn.

Code!

All this is already implementd in Matlab code in SPM 12. However, if you don’t like Matlab, I have translated it into R, and put it into a package on Github.

References

Burnham, Kenneth P., and David R. Anderson. 2002. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. 2nd editio. New York, US: Springer New York. https://doi.org/10.1007/b97636.

Stephan, Klaas Enno, Will D. Penny, Jean Daunizeau, Rosalyn J. Moran, and Karl J. Friston. 2009. “Bayesian model selection for group studies.” NeuroImage 46 (4). Elsevier Inc.: 1004–17. https://doi.org/10.1016/j.neuroimage.2009.03.025.

Bayesian multilevel models using R and Stan (part 1)

Thu, 01 Mar 2018 00:00:00 +0000

Photo ©Roxie and Lee Carroll, www.akidsphoto.com.

In my previous lab I was known for promoting the use of multilevel, or mixed-effects model among my colleagues. (The slides on the /misc section of this website are part of this effort.) Multilevel models should be the standard approach in fields like experimental psychology and neuroscience, where the data is naturally grouped according to “observational units”, i.e. individual participants. I agree with Richard McElreath when he writes that “multilevel regression deserves to be the default form of regression” (see here, section 1.3.2) and that, at least in our fields, studies not using a multilevel approach should justify the choice of not using it.

In $\textsf{R}$, the easiest way to fit multilevel linear and generalized-linear models is provided by the lme4 library (Bates et al. 2014). lme4 is a great package, which allows users to test different models very easily and painlessly. However it has also some limitations: it can be used to fit only classical forms of linear and generalized linear models, and can’t, for example, use to fit psychometric functions that take attention lapses into account (see here). Also, lme4 allows to fit multilevel models from a frequentist approach, and thus do not allow to incorporate prior knowledge into the model, or to use regularizing priors to reduce the risk of overfitting. For this reason, I have recently started using Stan, through its $\textsf{R}$Stan interface, to fit multilevel models in a Bayesian settings, and I find it great! It certainly requires more effort to define the models, however I think that the flexibility offered by a software like Stan is well worth the time spent to learning how to use it.

For people like me, used to work with lme4, Stan can be a bit discouraing at first. The approach to write the model is quite different, and it requires specifying explicitly all the distributional assumptions. Also, implementing models with correlated random effects requires some specific notions of algebra. So I prepared a first tutorial showing how to analyse in Stan one of the most common introductory examples to mixed-effects models, the sleepstudy dataset (contained in the lme4 package). This will be followed by another tutorial showing how to use this approach to fit dataset where the dependent variable is a binary outcome, as it is the case for most psychophysical data.

The `sleepstudy` example

This dataset contains part of the data from a published study (Belenky et al. 2003) that examined the effect of sleep deprivation on reaction times. (This is a sensible topic: think for example to long-distance truck drivers.) The dataset contains the average reaction times for the 18 subjects of the sleep-deprived group, for the first 10 days of the study, up to the recovery period.

library(lme4)
Loading required package: Matrix
str(sleepstudy)
'data.frame':   180 obs. of  3 variables:
 $ Reaction: num  250 259 251 321 357 ...
 $ Days    : num  0 1 2 3 4 5 6 7 8 9 ...
 $ Subject : Factor w/ 18 levels "308","309","310",..: 1 1 1 1 1 1 1 1 1 1 ...

The model I want to fit to the data will contain both random intercepts and slopes; in addition the correlation between the random effects should also be estimated. Using lme4, this model could be estimated by using

lmer(Reaction ~ Days + (Days | Subject), sleepstudy)

The model could be formally notated as \[ y_{ij} = \beta_0 + u_{0j} + \left( \beta_1 + u_{1j} \right) \cdot {\rm{Days}} + e_i \] where $\beta_0$ and $\beta_1$ are the fixed effects parameters (intercept and slope), $u_{0j}$ and $u_{1j}$ are the subject specific random intercept and slope (the index $j$ denotes the subject), and $e \sim\cal N \left( 0,\sigma_e^2 \right)$ is the (normally distributed) residual error. The random effects $u_0$ and $u_1$ have a multivariate normal distribution, with mean 0 and covariance matrix $\Omega$ \[ \left[ {\begin{array}{*{20}{c}} {{u_0}}\\ {{u_1}} \end{array}} \right] \sim\cal N \left( {\left[ {\begin{array}{*{20}{c}} 0\\ 0 \end{array}} \right],\Omega = \left[ {\begin{array}{*{20}{c}} {\sigma _0^2}&{{\mathop{\rm cov}} \left( {{u_0},{u_1}} \right)}\\ {{\mathop{\rm cov}} \left( {{u_0},{u_1}} \right)}&{\sigma _1^2} \end{array}} \right]} \right) \]

In Stan, fitting this model requires preparing a separate text file (usually saved with the ‘.stan’ extension), containing several “blocks”. The 3 main types of blocks in Stan are:

data all the dependent and independent variables needs to be declared in this blocks
parameters here one should declare the free parameters of the model; what Stan do is essentially use a MCMC algorithm to draw samples from the posterior distribution of the parameters given the dataset
model here one should define the likelihood function and, if used, the priors

Additionally, we will use two other types of blocks, transformed parameters and generated quantities. The first is necessary because we are estimating also the full correlation matrix of the random effects. We will parametrize the covariance matrix as the Cholesky factor of the correlation matrix (see my post on the Cholesky factorization), and in the transformed parameters block we will multiply the random effects with the Choleki factor, to transform them so that they have the intended correlation matrix. The generated quantities block can be used to compute any additional quantities we may want to compute once for each sample; I will use it to transform the Cholesky factor into the correlation matrix (this step is not essential but makes the examination of the model easier).

Data

RStan requires the data to be organized in a list object. It can be done with the following command

d_stan <- list(Subject = as.numeric(factor(sleepstudy$Subject, 
    labels = 1:length(unique(sleepstudy$Subject)))), Days = sleepstudy$Days, 
    RT = sleepstudy$Reaction/1000, N = nrow(sleepstudy), J = length(unique(sleepstudy$Subject)))

Note that I also included two scalar variables, N and J, indicating respectively the number of observation and the number of subjects. Subject was a categorical factor, but to input it in Stan I transformed it into an integer index. I also rescaled the reaction times, so that they are in seconds instead of milliseconds.

These variables can be declared in Stan with the following block. We need to declare the variable type (e.g. real or integer, similarly to programming languages as C++) and for vectors we need to declare the length of the vectors (hence the need of the two scalar variables N and J). Note that variables can be given lower and upper bounds. See the Stan reference manual for more information of the variable types.

data {
  int<lower=1> N;            //number of observations
  real RT[N];                //reaction times

  int<lower=0,upper=9> Days[N];   //predictor (days of sleep deprivation)

  // grouping factor
  int<lower=1> J;                   //number of subjects
  int<lower=1,upper=J> Subject[N];  //subject id
}

Parameters

Here is the parameter block. Stan will draw samples from the posterior distribution of all the parameters listed here. Note that for parameters representing standard deviations is necessary to set the lower bound to 0 (variances and standard deviations cannot be negative). This is equivalent to estimating the logarithm of the standard deviation (which can be both positive or negative) and exponentiating before computing the likelihood (because $e^x>0$ for any $x$). Note that we have also one parameter for the standard deviation of the residual errors (which was implicit in lme4). The random effects are parametrixed by a 2 x J random effect matrix z_u, and by the Cholesky factor of the correlation matrix L_u. I have added also the transformed parameters block, where the Cholesky factor is first multipled by the diagonal matrix formed by the vector of the random effect variances sigma_u, and then is multiplied with the random effect matrix, to obtain a random effects matrix with the intended correlations, which will be used in the model block below to compute the likelihood of the data.

parameters {
  vector[2] beta;                   // fixed-effects parameters
  real<lower=0> sigma_e;            // residual std
  vector<lower=0>[2] sigma_u;       // random effects standard deviations

  // declare L_u to be the Choleski factor of a 2x2 correlation matrix
  cholesky_factor_corr[2] L_u;

  matrix[2,J] z_u;                  // random effect matrix
}

transformed parameters {
  // this transform random effects so that they have the correlation
  // matrix specified by the correlation matrix above
  matrix[2,J] u;
  u = diag_pre_multiply(sigma_u, L_u) * z_u;

}

Model

Finally the model block. Here we can define priors for the parameters, and then write the likelihood of the data given the parameters. The likelihood function corresponds to the model equation we saw before.

model {
  real mu; // conditional mean of the dependent variable

  //priors
  L_u ~ lkj_corr_cholesky(1.5); // LKJ prior for the correlation matrix
  to_vector(z_u) ~ normal(0,2);
  sigma_e ~ normal(0, 5);       // prior for residual standard deviation
  beta[1] ~ normal(0.3, 0.5);   // prior for fixed-effect intercept
  beta[2] ~ normal(0.2, 2);     // prior for fixed-effect slope

  //likelihood
  for (i in 1:N){
    mu = beta[1] + u[1,Subject[i]] + (beta[2] + u[2,Subject[i]])*Days[i];
    RT[i] ~ normal(mu, sigma_e);
  }
}

For the correlation matrix, Stan manual suggest to use a LKJ prior¹. This prior has one single shape parameters, $\eta$: if you set $\eta=1$ then you have effectively a uniform prior distribution over any (Cholesky factor of) 2x2 correlation matrices. For values $\eta>1$ instead you get a more conservative prior, with a mode in the identity matrix (where the correlations are 0). For more information about the LKJ prior see page 556 of Stan reference manual, version 2.17.0, and also this page for an intuitive demonstration.

Importantly, I have used (weakly) informative priors for the fixed effect estimates. We know from the literature that simple reaction times are around 300ms, hence the prior for the intercept, which represents the avearage reaction times at Day 0, i.e. before the sleep deprivation. We expect the reaction times to increase with sleep deprivation, so I have used for the slope a Gaussian prior centered at a small positive value (0.2 seconds), which would represents the increase in reaction times with each day of sleep deprivation, however using a very broad standard deviation (2 seconds), which could accomodate also negative or very different slope values if needed. It may be useful to visualize with a plot the priors.

Generated quantities

Finally, we can add one last block to the model file, to store for each sampling iteration the correlation matrix of the random effect, which can be computed multyplying the Cholesky factor with its transpose.

generated quantities {
  matrix[2, 2] Omega;
  Omega = L_u * L_u'; // so that it return the correlation matrix
}

Estimating the model

Having written all the above blocks in a separate text file (I called it “sleep_model.stan”), we can call Stan from R with following commands. I run 4 independent chains (each chain is a stochastic process which sequentially generate random values; they are called chain because each sample depends on the previous one), each for 2000 samples. The first 1000 samples are the warmup (or sometimes called burn-in), which are intended to allow the sampling process to settle into the posterior distribution; these samples will not be used for inference. Each chain is independent from the others, therefore having multiple chains is also useful to check the convergence (i.e. by looking if all chains converged to the same regions of the parameter space). Additionally, having multiple chain allows to compute a statistic which is also used to check convergence: this is called $\hat R$ and it corresponds to the ratio of the between-chain variance and the within-chain variance. If the sampling has converged then ${\hat R} \approx 1 \pm 0.01$. When we call function stan, it will compile a C++ program which produces samples from the joint posterior of the parameter using a powerful variant of MCMC sampling, called Hamiltomian Monte Carlo (see here for an intuitive explanation of the sampling algorithm).

library(rstan)
options(mc.cores = parallel::detectCores())  # indicate stan to use multiple cores if available
sleep_model <- stan(file = "sleep_model.stan", data = d_stan, 
    iter = 2000, chains = 4)

One way to check the convergence of the model is to plot the chain of samples. They should look like a “fat, hairy caterpillar which does not bend” (Sorensen, Hohenstein, and Vasishth 2016), suggesting that the sampling was stable at the posterior.

traceplot(sleep_model, pars = c("beta"), inc_warmup = FALSE)

There is a print() method for visualising the estimates of the parameters. The values of the ${\hat R}$ (Rhat) statistics also confirm that the chains converged. The method automatically report credible intervals for the parameters (computed with the percentile method from the samples of the posterior distribution).

print(sleep_model, pars = c("beta"), probs = c(0.025, 0.975), 
    digits = 3)
Inference for Stan model: sleep_model_v1.
5 chains, each with iter=6000; warmup=3000; thin=1; 
post-warmup draws per chain=3000, total post-warmup draws=15000.

         mean se_mean    sd  2.5% 97.5% n_eff Rhat
beta[1] 0.255       0 0.006 0.243 0.268  6826    1
beta[2] 0.011       0 0.001 0.008 0.013  7830    1

Samples were drawn using NUTS(diag_e) at Sat Sep 22 17:15:42 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

And we can visualze the posterior distribution as histograms (here for the fixed effects parameters and the standard deviations of the corresponding random effects).

plot(sleep_model, plotfun = "hist", pars = c("beta", "sigma_u"))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Finally, we can also examine the correlation matrix of random-effects.

print(sleep_model, pars = c("Omega"), digits = 3)
Inference for Stan model: sleep_model_v1.
5 chains, each with iter=6000; warmup=3000; thin=1; 
post-warmup draws per chain=3000, total post-warmup draws=15000.

            mean se_mean    sd   2.5%   25%   50%   75% 97.5% n_eff  Rhat
Omega[1,1] 1.000     NaN 0.000  1.000 1.000 1.000 1.000 1.000   NaN   NaN
Omega[1,2] 0.221   0.007 0.344 -0.546 0.011 0.251 0.467 0.807  2228 1.001
Omega[2,1] 0.221   0.007 0.344 -0.546 0.011 0.251 0.467 0.807  2228 1.001
Omega[2,2] 1.000   0.000 0.000  1.000 1.000 1.000 1.000 1.000   160 1.000

Samples were drawn using NUTS(diag_e) at Sat Sep 22 17:15:42 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

The Rhat values for the first entry of the correlation matrix is NaN. This is expected for variables that remain constant during samples. We can check that this variable resulted in a series of identical values during sampling with the following command

all(unlist(extract(sleep_model, pars = "Omega[1,1]")) == 1)  # all values are =1 ?
[1] TRUE

That’s all! You can check by yourself that the values are the sufficiently similar to what we would obtain using lmer, and eventually experiment by yourself how the estimates changes when more informative priors are used. For more examples on how to fit linear mixed-effects models using Stan I recommend the article by Sorensen (Sorensen, Hohenstein, and Vasishth 2016), which also show how to implement crossed random effects of subjects and item (words), as it is conventional in linguistics.

References

Bates, D, M Maechler, B Bolker, and S Walker. 2014. “lme4: Linear mixed-effects models using Eigen and S4.” R package version 1.1-7. http://cran.r-project.org/package=lme4.

Belenky, Gregory, Nancy J Wesensten, David R Thorne, Maria L Thomas, Helen C Sing, Daniel P Redmond, Michael B Russo, and J Balkin, Thomas. 2003. “Patterns of performance degradation and restoration during sleep restriction and subsequent recovery: a sleep dose-response study.” Journal of Sleep Research 12 (1): 1–12. https://doi.org/10.1046/j.1365-2869.2003.00337.x.

Sorensen, Tanner, Sven Hohenstein, and Shravan Vasishth. 2016. “Bayesian linear mixed models using Stan: A tutorial for psychologists, linguists, and cognitive scientists.” The Quantitative Methods for Psychology 12 (3): 175–200. https://doi.org/10.20982/tqmp.12.3.p175.

The LKJ prior is named after the authors, see: Lewandowski, D., Kurowicka, D., and Joe, H. (2009). Generating random correlation matrices based on vines and extended onion method. Journal of Multivariate Analysis, 100:1989–2001↩

Simulating correlated variables with the Cholesky factorization

Sun, 21 Jan 2018 00:00:00 +0000

Generating random variables with given variance-covariance matrix can be useful for many purposes. For example it is useful for generating random intercepts and slopes with given correlations when simulating a multilevel, or mixed-effects, model (e.g. see here). This can be achieved efficiently with the Choleski factorization. In linear algebra the factorization or decomposition of a matrix is the factorization of a matrix into a product of matrices. More specifically, the Choleski factorization is a decomposition of a positive-defined, symmetric¹ matrix into a product of a triangular matrix and its conjugate transpose; in other words is a method to find the square root of a matrix. The square root of a matrix $C$ is another matrix $L$ such that ${L^T}L = C$.²

Suppose you want to create 2 variables, having a Gaussian distribution, and a positive correlation, say $0.7$. The first step is to define the correlation matrix \[C = \left( {\begin{array}{*{20}{c}} 1&{0.7}\\ {0.7}&1 \end{array}} \right)\] Elements in the diagonal can be understood as the correlation of each variable with itself, and therefore are 1, while elements outside the diagonal indicate the desired correlation. In $\textsf{R}$

C <- matrix(c(1,0.7,0.7,1),2,2)

Next one can use the chol() function to compute the Cholesky factor. (The function provides the upper triangular square root of $C$).

L <- chol(C)

If you multiply the matrix $L$ with itself you get back the original correlation matrix ($\textsf{R}$ output below).

t(L) %*% L
     [,1] [,2]
[1,]  1.0  0.7
[2,]  0.7  1.0

Then we need another matrix with the desired standard deviation in the diagonal (in this example I choose 1 and 2)

tau <- diag(c(1,2))

Multiply that matrix with the lower triangular square root of the correlation matrix (can be obtained by taking the transpose of $L$)

Lambda <- tau %*% t(L)

Now we can generate values for 2 independent random variables $z\sim\cal N\left( {0,1} \right)$

Z <- rbind(rnorm(1e4),rnorm(1e4))

Finally, to introduce the correlations , multiply them with the Lambda obtained above

X <- Lambda %*% Z

Now plot the results We can verify that the correlation as estimated from the sample corresponds (or is close enough) to the generative value.

# correlation in the generated sample
cor(X[1,],X[2,])
[1] 0.7093591

Why does it work?

The covariance matrix of the initial, uncorrelated sample is $\mathbb{E} \left( Z Z^T \right) = I$, that is the identity matrix, since they have zero mean and unit variance $z\sim\cal N\left( {0,1} \right)$³.

Let’s suppose that the desired covariance matrix is $\Sigma$; since it is symmetric and positive defined it is possible to obtain the Cholesky factorization $L{L^T} = \Sigma$.

If we then compute a new random vector as $X=LZ$, we have that its covariance matrix is \[ \begin{align} \mathbb{E} \left(XX^T\right) &= \mathbb{E} \left((LZ)(LZ)^T \right) \\ &= \mathbb{E} \left(LZ Z^T L^T\right) \\ &= L \mathbb{E} \left(ZZ^T \right) L^T \\ &= LIL^T = LL^T = \Sigma \\ \end{align} \] Therefore the new random vector $X$ has the covariance matrix $\Sigma$.

The third step is justified because the expected value is a linear operator, therefore $\mathbb{E}(cX) = c\mathbb{E}(X)$. Also $(AB)^T = B^T A^T$, note that the order of the factor reverses.

Actually Choleski factorization can be obtained from all Hermitian matrices. Hermitian matrices are a complex extension of real symmetric matrices. A symmetric matrices is one that it is equal to its transpose, which implies that its entries are symmetric with respect to the diagonal. In a Hermitian matrix, symmetric entries with respect to the diagonal are complex conjugates, i.e. they have the same real part, and an imaginary part with equal magnitude but opposite in sign. For example, the complex conjugate of $x+iy$ is $x-iy$ (or, equivalently, $re^{i\theta}$ and $re^{-i\theta}$). Real symmetric matrices can be considered a special case of Hermitian matrices where the imaginary component $y$ (or $\theta$) is $0$.↩
Note that I am using the convention of $\textsf{R}$ software, where the function chol(), which compute the factorization, returns the upper triangular factor of the Choleski decomposition. I think that is more commonly assumed that the Choleski decomposition returns the lower triangular factor $L$, in which case $L{L^T} = C$.↩
More generally the variance-covariance matrix is $\Sigma = \mathbb{E}\left( {X{X^T}} \right) - \mathbb{E}\left( X \right) \mathbb{E}\left(X \right)^T$. $\mathbb{E}$ indicates the expected value.↩

Multi-model estimation of psychophysical parameters

Fri, 08 Dec 2017 00:00:00 +0000

In the study of human perception we often need to measure how sensitive is an observer to a stimulus variation, and how her/his sensitivity changes due to changes in the context or experimental manipulations. In many applications this can be done by estimating the slope of the psychometric function¹, a parameter that relates to the precision with which the observer can make judgements about the stimulus. A psychometric function is generally characterized by 2-3 parameters: the slope, the threshold (or criterion), and an optional lapse parameter, which indicate the rate at which attention lapses (i.e. stimulus-independent errors) occur.

As an example, consider the situation where an observer is asked to judge whether a signal (can be anything, from the orientation angle of a line on a screen, or the pitch of a tone, to the speed of a car or the approximate number of people in a crowd, etc.) is above or below a given reference value, call it zero. The experimenter presents the observers with many signals of different intensities, and the observer is asked to respond by making a binary choice (larger/smaller than the reference), under two different contextual conditions (before/after having a pint, with different headphones, etc.). These two conditions are expected to results in different sensitivity, and the experimenter is interested in estimating as precisely as possible the difference in sensitivity². The psychometric function for one observer in the two conditions might look like this (figure below).

Psychometric functions. Each points is a response (0 or 1 ; some vertical jitter is added for clarity), and the lines represent the fitted psychometric model (here a cumulative Gaussian psychometric function). The two facets of the plots represent the two different conditions. It can be seen that the precision seems to be different across conditions: judgements made under condition ‘2’ are more variable, indicating reduced sensitivity.

Our focus is on the psychometric slope, and we are not really interested in measuring the lapse rate; however it is still important to take lapses into account: it has been shown that not accounting for lapses can have a large influence on the estimates of the slope (Wichmann and Hill 2001).

The problem with lapses

Different observer may lapse at quite different rates, and for some of them the lapse rate is probably so small that can be considered negligible. Also, we usually don’t have hypothesis about lapses, and about whether they should or should not vary across conditions. We can base our analysis on different assumptions about when the observers may have attention lapses:

they may never lapse (or they do so with a small, negligible frequency);
they may lapse at a fairly large rate, but the rate is assumed constant across conditions (reasonable, especially if conditions are randomly interleaved);
they may lapse with variable rate across conditions.

These assumptions will lead to three different psychometric models. The number can increase if we consider also different functional forms of the relationship between stimulus and choice; here for simplicity I will consider only psychometric models based on the cumulative Gaussian function (equivalent to a probit analysis), $\Phi (\frac{x-\mu}{\sigma}) = \frac{1}{2}\left[ {1 + {\rm{erf}}\left( {\frac{{x - \mu }}{{\sigma \sqrt 2 }}} \right)} \right]$, where the mean $\mu$ woud correspond to the threshold parameter, $\sigma$ to the slope, and $x$ is the stimulus intensity. In our case the first assumption (zero lapses) would lead to the simplest psychometric model \[ \Psi (x, \mu_i, \sigma_i)= \Phi (\frac{x-\mu_i}{\sigma_i}) \] where the subscript $i$ indicates that the values of both mean $\mu_i$ and slope $\sigma_i$ are specific to the condition $i$. The second assumption (fixed lapse rate) could correspond to the model \[ \Psi (x, \mu_i, \sigma_i, \lambda)= \lambda + (1-2\lambda) \Phi (\frac{x-\mu_i}{\sigma_i}) \] where the parameter $\lambda$ correspond to the probability of the observer making a random error. Note that this is assumed to be fixed with respect to the condition (no subscript). Finally the last assumption (variable lapse rate) would suggests the model \[ \Psi (x, \mu_i, \sigma_i, \lambda_i)= \lambda_i + (1-2\lambda_i) \Phi (\frac{x-\mu_i}{\sigma_i}) \] where all the parameters are allowed to vary between conditions.

We have thus three different models, but we haven’t any prior information to decide which model is more likely to be correct in our case. Also, we acknowledge the fact that there are individual differences and each observer in our sample may conform to one of the three assumptions with equal probability. Hence, ideally, we would like to find a way to deal with lapses - and find the best estimates of the slope values $\sigma_i$ without committing to one of the three models.

Multi-model inference

One possible solution to this problem is provided by a multi-model, or model averaging, approach (Burnham and Anderson 2002). This requires calculating the AIC (Akaike Information Criterion)³ for each model and subjects, and then combine the estimates according to the Akaike weights of each model. To compute the Akaike weights one typically proceed by first transforming them into differences with respect to the AIC of the best candidate model (i.e. the one with lower AIC) \[ {\Delta _m} = {\rm{AI}}{{\rm{C}}_m} - \min {\rm{AIC}} \] From the differences in AIC, we can obtain an estimate of the relative likelihood of the model $m$ given the data \[ \mathcal{L} \left( {m|{\rm{data}}} \right) \propto \exp \left( { - \frac{1}{2}{\Delta _m}} \right) \] Then, to obtain the Akaike weight $w_m$ of the model $m$, the relative likelihoods are normalized (divided by their sum) \[ {w_m} = \frac{{\exp \left( { - \frac{1}{2}{\Delta _m}} \right)}}{{\mathop \sum \limits_{k = 1}^K \exp \left( { - \frac{1}{2}{\Delta _k}} \right)}} \] Finally, one can compute the model-averaged estimate of the parameter⁴, $\hat {\bar \sigma}$, by combining the estimate of each model according to their Akaike weight \[ \hat {\bar \sigma} = \sum\limits_{k = 1}^K {{w_k}\hat \sigma_k } \]

Simulation results

Model averaging seems a sensitive approach to deal with the uncertainty about which form of the model is best suited to our data. To see whether it is worth doing the extra work of fitting 3 models instead of just one, I run a simulation, where I repeatedly fit and compare the estimates of the three models, with the model-averaged estimate, for different values of sample sizes. In all the simulations, each observer is generated by randomly drawing parameters from a Gaussian distribution which summarize the distribution of the parameters in the population. Hence, I know the true difference in sensitivity in the population, and by simulating and fitting the models I can test which estimating procedure is more efficient. In statistics a procedure or an estimator is said to be more efficient than another one when it provides a better estimate with the same number or fewer observations. The notion of “better” clearly relies on the choice of a cost function, which for example can be the mean squared error (it is here).

Additionally, in my simulations each simulated observer could, with equal probability $\frac{1}{3}$, either never lapse, lapse with a constant rate across conditions, or lapse at a higher rate in the more difficult condition (condition ‘2’ where the judgements are less precise). The lapse rates were draw uniformly from the interval [0.01, 0.1], and could get as high as 0.15 in condition ‘2’. Each simulated observer ran 250 trials per condition (similar to the figure at the top of this page). I simulated dataset from $n=5$ to $n=50$, using 100 iterations for each sample size (only 85 in the case of $n=50$ because the simulation was taking too long and I needed my laptop for other stuff). For simplicity I assumed that the different parameters were not correlated across observers⁵. I also had my simulated observer using the same criterion across the two conditions, although this may not necessarily be true. The quantity of interest here is the difference in slope between the two condition, that is $\sigma_2 - \sigma_1$.

First, I examined the mean squared error of each of the models’ estimates, and of the model-averaged estimate. This is the average squared difference between the estimate and the true value. The results shows (unless my color blindness fooled me) that the model-averaged estimate attains always the smaller error. Note also that the error tend to decrease exponentially with the sample size. Interestingly, the worst model seems to be the one that allow for the lapses to vary across conditions. This may be because the change in the lapse rate across condition was - when present - relatively small, but also because this model has a larger number of parameters, and thus produces more variable estimates (that is with higher standard errors) than smaller model. Indeed, given that I know the ‘true’ value of the parameres in this simulation settings, I can divide the error into the two subcomponents of variance and bias (see this page for a nice introduction to the bias-variance tradeoff). The bias is the difference between the expected estimate (averaged over many repetitions/iterations) of the same model and the true quantity that we want to estimate. The variance is simply the variability of the model estimates, i.e. how much they oscillate around the expected estimate.

Here is a plot of the variance. Indeed it can be seen that the variable-lapse model, which has more parameters, is the one that produces more variable estimates. There is however little difference between the other two models’ and the multi-model estimates

And here is the bias. This is very satisfactory, as it shows that while all individual models produced biased estimates, the bias of the model-averaged estimates is zero, or very close to zero. In sum, by averaging models of different levels of complexity according to their relative likelihood, I was able to simultaneously minimize the variance and decrease the bias of my estimates, and achieve a greater efficiency. Model averaging seems to be the ideal procedure in this specific settings where the observer would belong to one of the three categories (i.e., she/he would conform to one of the three assumptions) with equal probability. However I think (although I haven’t checked) that it would perform well even in cases where a single “type” of observers is largely predominant over the other.

Code

The (clumsy written) code for the simulations is shown below:

# This load some handy functions that are required below
library(RCurl)
script <- getURL("https://raw.githubusercontent.com/mattelisi/miscR/master/miscFunctions.R", ssl.verifypeer = FALSE)
eval(parse(text = script))

set.seed(1)

# sim parameters
n_sim <- 100
sample_sizes <- seq(5, 100, 5)

# parameters
R <- 3 # range of signal levels (-R, R)
n_trial <- 500
mu_par <- c(0, 0.25) # population (mean, std.)
sigma_par <- c(1, 0.25)
sigmaDiff_par <- c(1, 0.5)
lapse_range <- c(0.01, 0.1)

# start
res <- {}
for(n_subjects in sample_sizes){
for(iteration in 1:n_sim){  
    
    # make dataset
    d <- {}
    for(i in 1:n_subjects){
        d_ <- data.frame(x=runif(n_trial)*2*R-R, 
                    condition=as.factor(rep(1:2,n_trial/2)), 
                    id=i, r=NA)

        r_i <- runif(1) # draw observer type (wrt lapses)

        if(r_i<1/3){
            # no lapses
            par1 <- c(rnorm(1,mu_par[1],mu_par[2]), 
                abs(rnorm(1,sigma_par[1],sigma_par[2])),
                0)
            par2 <- c(par1[1], 
                par1[2]+abs(rnorm(1,sigmaDiff_par[1],sigmaDiff_par[2])),
                0) 

        }else if(r_i>=1/3 & r_i<2/3){
            # fixed lapses
            l_i <- runif(1)*diff(lapse_range) + lapse_range[1]
            par1 <- c(rnorm(1,mu_par[1],mu_par[2]), 
                abs(rnorm(1,sigma_par[1],sigma_par[2])),
                l_i)
            par2 <- c(par1[1], 
                par1[2]+abs(rnorm(1,sigmaDiff_par[1],sigmaDiff_par[2])), 
                l_i) 

        }else{
            # varying lapses
            l_i_1 <- runif(1)*diff(lapse_range) + lapse_range[1]
            l_i_2 <- l_i_1 + (runif(1)*diff(lapse_range) + lapse_range[1])/2
            par1 <- c(rnorm(1,mu_par[1],mu_par[2]), 
                abs(rnorm(1,sigma_par[1],sigma_par[2])),
                l_i_1)
            par2 <- c(par1[1], 
                par1[2]+abs(rnorm(1,sigmaDiff_par[1],sigmaDiff_par[2])), 
                l_i_2) 
        }

        ## simulate observer
        for(i in 1:sum(d_$condition=="1")){
            d_$r[d_$condition=="1"][i] <- rbinom(1,1,
                psy_3par(d_$x[d_$condition=="1"][i],par1[1],par1[2],par1[3]))
        }
        for(i in 1:sum(d_$condition=="2")){
            d_$r[d_$condition=="2"][i] <- rbinom(1,1,
                psy_3par(d_$x[d_$condition=="2"][i],par2[1],par2[2],par2[3]))
        }
        d <- rbind(d,d_)
    }
    
    
    ## model fitting

    # lapse assumed to be 0
    fit0 <- {}
    for(j in unique(d$id)){
        m0 <- glm(r~x*condition, family=binomial(probit),d[d$id==j,])
        sigma_1 <- 1/coef(m0)[2]
        sigma_2 <- 1/(coef(m0)[2] + coef(m0)[4])
        fit0 <- rbind(fit0, data.frame(id=j, sigma_1, sigma_2, 
                loglik=logLik(m0), aic=AIC(m0), model="zero_lapse") )
    }
    
    # fix lapse rate 
    start_p <- c(rep(c(0,1),2), 0)
    l_b <- c(rep(c(-5, 0.05),2), 0)
    u_b <- c(rep(c(5, 20), 2), 0.5)
    fit1 <- {}

    for(j in unique(d$id)){
        ftm <- optimx::optimx(par = start_p, lnorm_3par_multi , 
                d=d[d$id==j,],  method="bobyqa", 
                lower =l_b, upper =u_b)
        
        negloglik <- ftm$value
        aic <- 2*5 + 2*negloglik
        # fitted parameters are the first n numbers of optimx output
        sigma_1<-unlist(ftm [1,2])
        sigma_2<-unlist(ftm [1,4])
        fit1 <- rbind(fit1, data.frame(id=j, sigma_1, sigma_2, 
                loglik=-negloglik, aic, model="fix_lapse")  )
    }
    
    
    # varying lapse rate
    start_p <- c(0,1, 0)
    l_b <- c(-5, 0.05, 0)
    u_b <- c(5, 20, 0.5)
    fit2 <- {}
    for(j in unique(d$id)){
        # fit condition 1
        ftm <- optimx::optimx(par = start_p, lnorm_3par , 
                d=d[d$id==j & d$condition=="1",],  
                method="bobyqa", lower =l_b, upper =u_b) 
        negloglik_1 <- ftm$value; sigma_1 <- unlist(ftm [1,2])
        # fit condition 2
        ftm <- optimx::optimx(par = start_p, lnorm_3par , 
                d=d[d$id==j & d$condition=="2",],  
                method="bobyqa", lower =l_b, upper =u_b) 
        negloglik_2 <- ftm$value; sigma_2 <- unlist(ftm [1,2])

        aic <- 2*6 + 2*(negloglik_1 + negloglik_2)
        fit2 <- rbind(fit2, data.frame(id=j, sigma_1, sigma_2, 
                loglik=-negloglik_1-negloglik_2, aic, model="var_lapse"))
    }
    
    # compute estimates of the change in slope
    effect_0 <- mean((fit0$sigma_2-fit0$sigma_1))
    effect_1 <- mean((fit1$sigma_2-fit1$sigma_1))
    effect_2 <- mean((fit2$sigma_2-fit2$sigma_1))
    
    effect_av <- {}
    for(j in unique(fit0$id)){
        dj <- rbind(fit0[fit0$id==j,], fit1[fit1$id==j,], fit2[fit2$id==j,])
        min_aic <- min(dj$aic)
        dj$delta <- dj$aic - min_aic
        den <- sum(exp(-0.5*c(dj$delta)))
        dj$w <- exp(-0.5*dj$delta) / den
        effect_av <- c(effect_av, sum((dj$sigma_2-dj$sigma_1) * dj$w))
    }
    effect_av <- mean(effect_av)
    
    # store results
    res <- rbind(res, data.frame(effect_0, effect_1, effect_2, 
            effect_av, effect_true=sigmaDiff_par[1], 
            n_subjects, n_trial, iteration))

}
}

## PLOT RESULTS
library(ggplot2)
library(reshape2)

res$err0 <- (res$effect_0 -1)^2
res$err1 <- (res$effect_1 -1)^2
res$err2 <- (res$effect_2 -1)^2
res$errav <- (res$effect_av -1)^2

# plot MSE
ares <- aggregate(cbind(err0,err1,err2,errav)~n_subjects, res, mean)
ares <- melt(ares, id.vars=c("n_subjects"))
levels(ares$variable) <- c("no lapses", "fixed lapse rate", "variable lapse rate", "model averaged")
ggplot(ares,aes(x=n_subjects, y=value, color=variable))+geom_line(size=1)+nice_theme+scale_color_brewer(palette="Dark2",name="model")+labs(x="number of subjects",y="mean squared error")+geom_hline(yintercept=0,lty=2,size=0.2)

# plot variance
ares <- aggregate(cbind(effect_0,effect_1,effect_2,effect_av)~n_subjects, res, var)
ares <- melt(ares, id.vars=c("n_subjects"))
levels(ares$variable) <- c("no lapses", "fixed lapse rate", "variable lapse rate", "model averaged")
ggplot(ares,aes(x=n_subjects, y=value, color=variable))+geom_line(size=1)+nice_theme+scale_color_brewer(palette="Dark2",name="model")+labs(x="number of subjects",y="variance")

# plot bias
ares <- aggregate(cbind(effect_0,effect_1,effect_2,effect_av)~n_subjects, res, mean)
ares <- melt(ares, id.vars=c("n_subjects"))
levels(ares$variable) <- c("no lapses", "fixed lapse rate", "variable lapse rate", "model averaged")
ares$value <- ares$value -1
ggplot(ares,aes(x=n_subjects, y=value, color=variable))+geom_hline(yintercept=0,lty=2,size=0.2)+geom_line(size=1)+nice_theme+scale_color_brewer(palette="Dark2",name="model")+labs(x="number of subjects",y="bias")

References

Wichmann, F a, and N J Hill. 2001. “The psychometric function: I. Fitting, sampling, and goodness of fit.” Perception & Psychophysics 63 (8): 1293–1313. http://www.ncbi.nlm.nih.gov/pubmed/11800458.

The psychometric function is a statistical model that predicts the probabilities of the observer response (e.g. “stimulus A has a larger/smaller instensity than stimulus B”), conditional to the stimulus and the experimental condition.↩
A good experimenter should do that (estimate the size of the difference). A “bad” experimenter might just be interested in obtaining $p<.05$. See this page, compiled by Jean-Michel Hupé, for some references and guidelines against $p$-hacking and the misuse of statistical tools in neuroscience.↩
The AIC of a model is computed as ${\rm{AIC}} = 2k - 2\log \left( \mathcal{L} \right)$, where $k$ is the number of free parameters, and $\mathcal{L}$ is the maximum value of the likelihood function of that model.↩
Here it is a parameter common to all models. See the book of Burnham & Andersen for methods to methods to deal with different situations (Burnham and Anderson 2002).↩
Such correlation when present can be modelled using a mixed-effect approach. See my tutorial on mized-effects model in the‘misc’ section of this website.↩

Listing's law, and the mathematics of the eyes

Wed, 27 Sep 2017 00:00:00 +0000

Brief intro to the mathematical formalism used to describe rotations of the eyes in 3D (including the torsional component).

The shape of the human eye is approximately a sphere with a diameter of 23 mm, and mechanically it behaves like a ball in a ball and socket joint. Because there is a functional distinguished axis - the visual axis, that is the line of gaze or more precisely the imaginary straight line passing through both the center of the pupil and the center of the fovea - the movements of the eyes are usually divided in gaze direction and cyclotorsion (or simply torsion): while gaze direction refers to the direction of the visual axis, the torsion indicates the rotation of the eyeball about the visual axis. While modern video-based eyetrackers allow to record movements of the visual axis, they do not provide data about torsion. It turns out that there is a nice mathematical relationship that constrains the torsion of the eye in every direction of the gaze. This relationship is known as Listing’s law, and was named after the german mathematician Johann Benedict Listing (1808-1882). Listing’s law can be better understood by looking at how the 3D orientation of the eye can be formally described.

Mathematics of 3D eye movements

3D eye position can be specified by characterising the 3D rotation that brings the eye to the current eye position from an arbitrary reference or primary position¹, which typically is defined as the position that the eye assumes when looking straight ahead with the head in normal, upright position. This rotation can be described by the 3-by-3 rotation matrix $\bf{R}$. More specifically the matrix can be used to describe the rotation of three-dimensional coordinates by a certain angle about a certain axis. To formally define this matrix, consider the coordinate system $\{ \vec{h}_1,\vec{h}_2,\vec{h}_3 \}$ (a coordinate system is defined by a set of linearly independent vectors; e.g. here $\vec{h}_1 = (1,0,0)$, corresponding to the $x$ axis) as the head-centered coordinate system where the axis $\vec{h}_1$ correspond to the visual axis when the eye is in the reference position, and $\{\vec{e}_1,\vec{e}_2,\vec{e}_3\}$ is an eye-centered coordinate system where $\vec{e}_1$ always correspond to the visual axis, regardless of the orientation of the eye (see the figure on top of this page). Any orientation of the eye can be described by a matrix $\bf{R}$ such that \[ {{\vec{e}}_i} = {\bf{R}} {{\vec{h}}_i} \] where $i=1,2,3$. This rotation matrix is straightforward for 1D rotations. For example, a purely horizontal rotation of an angle $\theta$ around the axis $\vec{h}_3$ is formulated as \[ \bf{R}_3 \left( \theta \right) = \left( {\begin{array}{*{20}{c}} {\cos \theta }&{ - \sin \theta }&0\\ {\sin \theta }&{\cos \theta }&0\\ 0&0&1 \end{array}} \right) \] The first two columns of the matrix indicates the new coordinates of the first (i.e., $\vec{h}_1$) and of the second basis ($\vec{h}_2$) of the new eye-centerd coordinate system after the rotation, expressed according to the initial head-centered coordinate system. The third basis, $\vec{h}_3$ is the axis of rotation, and does not change. It becomes more complicated for 3D rotations, i.e. rotations of the fixed eye-centered coordinate system to any new orientation. They can be obtained by calculating a sequence of 3 different rotations about the three fixed axis, and multiplying the corresponding matrices: $\bf{R} = \bf{R}_3 \left( \theta \right) \bf{R}_2 \left( \phi \right) \bf{R}_1 \left( \psi \right)$. Although the first two rotations are sufficient to specity the orientation of the visual axis, the third is necessary to specify the torsion component and fully specify the 3D orientation of the eye. Importantly, the order of the three rotations is relevant - rotations are not commutative, so if you put them in different order you end up with a different result - and needs to be arbitrarily specified (when it is specified in this order is referred to as Flick sequence). This representation of 3D orientations is not very efficient (9 values, while only 3 are necessary), or practical for computations; additionally one needs to define arbitrarily the order of the rotations of the sequence.

Quaternions and rotation vectors

An alternative way to describe rotations is with quaternions. Quaternions can be looked upon as four-dimensional vectors, although they are more commonly split in a real scalar part and an imaginary vector part; they are in fact an extension of the complex numbers. They have the form \[ q_0 + q_1i + q_2j + q_3k = \left( q_0,\vec{q} \cdot \vec{I} \right) = \left( r,\vec{v} \right) \] where \[ \vec{q} = \left( \begin{array}{*{20}{c}} {q_1}\\ {q_2}\\ {q_3} \end{array} \right) \] and \[ \vec{I} = \left( \begin{array}{*{20}{c}} {i}\\ {j}\\ {k} \end{array} \right) \] $i,j,k$ are the quaternion units. These can be multiplied according to the following formula, discovered by Hamilton in 1843 \[ i^2 = j^2 = k^2 = ijk = - 1 \] This formula may seems strange but it determines all the possible products of $i,j,k$, such as $ij=k$ and $ji=-k$. Note that the product of the basis are not commutative. There is a visual trick to remember the multiplication rules, based on the following diagram:

Multiplying quaternions. Multiplying two elements in the clockwise direction gives the next element along the same direction (e.g. $jk=i$). The same is for counter-clockwise directions, except that the result is negative (e.g. $kj=-i$).

Quaternions can be used to represent rotations. For example, a rotation of an angle $\theta$ around the axis define by the unit vector $\vec{u} = (u_1, u_2,u_3) = u_1i + u_2j + u_3k$² can be described by the following quaternion \[ \cos \frac{\theta}{2} + \sin \frac{\theta}{2}\left( u_1i + u_2j + u_3k \right) \] The direction of the rotation is given by the right-hand rule. Successive rotations can combined using the formula for quaternion multiplication. The multiplication of quaternions can be computed by the products of their elements element as if they were two polynomials, but keeping track of the ordering of the basis, as their multiplication is not commutative. This is a desired property if we want to specify rotations, which as seen earlier are also not commutative. Quaternion multiplication can be also expressed in the modern language of vector and cross product \[ \left( r_1,\vec{v_1} \right) \left( r_2,\vec{v_2} \right) = \left( r_1 r_2 - \vec{v_1} \cdot \vec{v_2},\;\; r_1\vec{v_2} + r_2\vec{v_1} +\vec{v_1} \times \vec{v_2} \right) \] where “$\cdot$” is the dot product and “$\times$” is the cross product.

In sum, quaternions are pretty useful to compute transformations in 3D. One can use quaternions to combine 3D any sequence of rotations about arbitrary axis (using quaternion multiplications), as well as to rotate any 3D Euclidean vector about any arbitrary axis. A quaternion can also be transformed into a 3D rotation matrix (formula here), which then may be used in 3D graphics.

Rotation vectors are an even more succint representation of rotations. Indeed, the scalar component of the quaterion ($q_0$) does not add any information that is not alredy in the vector part, so a rotation could be effectively described by just 3 numbers. The rotation vector $\vec{r}$, which correspond to a rotation of an angle $\theta$ about an axis $\vec{n}$ is defined as \[ \vec{r} = \tan \left( \frac{\theta}{2} \right) \vec{n} \] which can be defined also with respect to the equivalent quaternion $\textbf{q}$ \[ \textbf{q}=\left( q_0, \vec{q} \right) = \left( \cos \left(\frac{\theta}{2}\right), \sin \left(\frac{\theta}{2}\right)\vec{n} \right) \] as \[ \vec{r} = \frac{\vec{q}}{q_0} \].

Donder’s law and Listing’s law

Donder’s law (1848) states that the eye use only two degrees of freedom while fixating, although mechanically it has three. In othere words this means that the torsion component of the eye movement is not arbitrary but it is uniquely determined by the direction of the visual axis and is independent of the previous eye movements. From the material review above it should be clear how any 3D eye orientation can be fully described as a rotation abour a given axis from a primary reference position. This allows also to formulate Donder’s law more specifically,according to what is known as Listing’s law (Helmholtz et al. 1910,@Haustein1989) “There exists a certain eye position from which the eye may reach any other position of fixation by a rotation around an axis perpendicular to the visual axis. This particular position is called primary position”. This means that all possible eye positions can be reached from the primary position by a single rotation about an axis perpendicular to the visual axis. Since they are all perpendicular to the visual axis, all rotation axis that satisfy Listing’s law are on the same plane (Listing’s plane). The law can be tested with eyetracking equipments that allows measuring also the torsional components (such as scleral coils): results have shown that the standard deviation from Listing’s plane of empirically measured rotation vectors is only about 0.5-1 deg (Haslwanter 1995). Formally it can be written that for any orientation of the visual axis, defined by the rotation vector $\vec{a}$ and measured from the primary position $\vec{h_1}=(1,0,0)$, \[ \vec{h_1} \cdot \vec{a} = 0 \] This indicates simply that the rotation about the visual axis is 0, and that as a consequence all the rotation axes lies in a frontal plane.

Going back to the beginning, knowing the coordinates os Listing’s plane one can compute the rotation vector that correspond to the current eye position from the recording of the 2D gaze location on a screen. In the simplest case, we assume that the primary position corresponds to when the observer fixates the center of the screen, $(0,0)$. What is the rotation vector that describes the 3D eye orientation when the observer fixates the location $(s_x, s_y)$ ? Let’s say the position on screen is defined in cm, and we know that the distance of the eye from the screen is $L$ cm. The rotation angle can be computed as $\theta = \rm{atan} \frac{\sqrt{s_x^2+s_y^2}}{L}$, while the angle that defines the orientation of the rotation axis within Listing’s plane is $\alpha = \rm{atan2}(s_y,s_x)$. The complete rotation vector is then \[ \vec{r} = \tan \left( \frac{\theta}{2}\right) \cdot \left( {\begin{array}{*{20}{c}} 0\\ {\cos \alpha }\\ { - \sin \alpha } \end{array}} \right) \] This vector describe aparticular eye position as a rotation from the reference position, and does not have a torsional component (that is a component along $\vec{h_1}$). Indeed, Listing’s law implies that all possible eye positions can be reached from the primary reference position without a torsional component. However, vectors describing rotations from and to positions different than the primary one do not, in general, lie in Listing’s plane. For Listing law to hold such vectors must lie in a plane whose orientation depends on the current eye position, and more specifically is such that the vector perpendicular to the plane is exactly halfway between the current and the primary eye position (Tweed and Vilis 1990).

References

Haslwanter, Thomas. 1995. “Mathematics of three-dimensional eye rotations.” Vision Research 35 (12): 1727–39. https://doi.org/10.1016/0042-6989(94)00257-M.

Haustein, Werner. 1989. “Considerations on Listing’s Law and the primary position by means of a matrix description of eye position control.” Biological Cybernetics 60 (6): 411–20. https://doi.org/10.1007/BF00204696.

Helmholtz, Hermann von, Hermann von Helmholtz, Hermann von Helmholtz, and Hermann von Helmholtz. 1910. Handbuch der Physiologischen Optik. Hamburg: Voss.

Tweed, Douglas, and Tutis Vilis. 1990. “Geometric relations of eye position and velocity vectors during saccades.” Vision Research 30 (1): 111–27. https://doi.org/10.1016/0042-6989(90)90131-4.

Euler’s theorem guarantee that a rigid body can always move from one orientation to any different one through a single rotation about a fixed axis.↩
Saying that $\vec(u)$ is a unit vector indicates that it has length 1, i.e. $\left| \vec{u} \right| = \sqrt{u_1^2 + u_2^2 + u_3^2} = 1$↩

Posts | Matteo Lisi

How much is this game worth?

Expected value of the game

Setting Kazam to correctly record full-screen on HiDPI displays in Ubuntu

Defending the null hypothesis

Installing RStan on HPC cluster

Attitude shift towards Remain in European Elections obscured in press by rebranded Farage party

Ephemeral patterns of complexity

Bayesian model selection at the group level

Model evidence

Models as random effects

Variational Bayesian approach

Generative model

Variational approximation

Iterative algorithm}

Exceedance probabilities

Code!

References

Bayesian multilevel models using R and Stan (part 1)

The sleepstudy example

Data

Parameters

Model

Generated quantities

Estimating the model

References

Simulating correlated variables with the Cholesky factorization

Why does it work?

Multi-model estimation of psychophysical parameters

The problem with lapses

Multi-model inference

Simulation results

Code

References

Listing's law, and the mathematics of the eyes

Mathematics of 3D eye movements

Quaternions and rotation vectors

Donder’s law and Listing’s law

References

The `sleepstudy` example