Bayesian Data Analysis

Is this thing healthy?

Joshua Wilson Black

joshua.black@canterbury.ac.nz

Te Kāhui Roro Reo | New Zealand Institute of Language, Brain and Behaviour

Te Whare Wānanga o Waitaha | University of Canterbury

Overview

What’s under the hood?
A more complex trap F1 model
- Convergence and visual diagnostics
- Effective sample size and Rhat

What’s under the hood?

Sampling the posterior

Exact mathematical descriptions of the posterior are very very hard.
The trick: take a lot of ‘samples’ from the posterior using a computer.
- This can be done using the prior, the model structure, and the data.
Take enough samples and you have an approximation to the posterior.
Take then in the right way, and you have a good approximation!

‘MCMC’

MCMC stands for “Markov Chain Monte Carlo”.
- Markov Chain: a sequence where the next state is determined entirely by the current state.
- Monte Carlo: The sampling is probabilistic.
  - Monte Carlo has a lot of casinos…
An MCMC ‘chain’ explores the posterior distribution, using a probabilistic method to choose where to go next.

…as opposed to?

‘Grid sampling’: choosing a range of parameter values to sample in advance (too expensive in a complex model).
Mathematical solutions: the calculus is too hard (unless you’re planning to win a Fields Medal)

‘Convergence’

In Bayesian modelling, ‘convergence’ is a feature of MCMC chains.
Every chain should explore the same distribution (convergence)…
- …and it should be the correct distribution.
This is why we need more than one chain.

MCMC intuition

You have a bunch of little critters (at least four)
You send them out to wander around a hilly landscape
They prefer the lowlands, but will very occasionally climb
You keep a record of everywhere they go
More frequently visited places are lower
- Now you have an elevation map!
‘Lower’ areas = high probability regions of the posterior

Back to trap F1

Let’s fit a model

model_formula <- F1_lob2 ~ age * gender + stopword + 
  school_type + art_rate +
  (1|school) + (1|part_id) + 
  (1|word_id) + (1|following)

school_type is a predictor (private/public), individual school is a random intercepts.

A prior

first_prior <- c(
  prior(student_t(3, 0, 2), class = "Intercept"),
  prior(normal(0, 2), class = "sigma"),
  prior(normal(0, 0.5), class = "b"),
  prior(normal(0, 2), coef = "art_rate"),
  prior(normal(0, 2), class = "b", coef = "stopwordTRUE"),
  prior(normal(0, 1), class = "sd"),
  prior(normal(0, 2), class = "sd", group = "following"),
  prior(normal(0, 2), class = "sd", group = "word_id")
)

Fit via MCMC

trap_fit_1 <- brm(
  model_formula,  
  data = qb2,
  prior = first_prior,
  cores = 4,
  threads = 2,
  chains = 4,
  iter = 3000, # 1000 more than default.
  warmup = 1000
)

Output

starting worker pid=43786 on localhost:11527 at 14:42:12.022
starting worker pid=43799 on localhost:11527 at 14:42:12.162
starting worker pid=43812 on localhost:11527 at 14:42:12.275
starting worker pid=43825 on localhost:11527 at 14:42:12.388

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0.000417 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 4.17 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 3000 [  0%]  (Warmup)

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 0.000333 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 3.33 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 3000 [  0%]  (Warmup)

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 0.000465 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 4.65 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 3000 [  0%]  (Warmup)

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 0.000391 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 3.91 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 3000 [  0%]  (Warmup)
Chain 1: Iteration:  300 / 3000 [ 10%]  (Warmup)
Chain 3: Iteration:  300 / 3000 [ 10%]  (Warmup)
Chain 2: Iteration:  300 / 3000 [ 10%]  (Warmup)
Chain 4: Iteration:  300 / 3000 [ 10%]  (Warmup)
Chain 1: Iteration:  600 / 3000 [ 20%]  (Warmup)
Chain 2: Iteration:  600 / 3000 [ 20%]  (Warmup)
Chain 3: Iteration:  600 / 3000 [ 20%]  (Warmup)
Chain 4: Iteration:  600 / 3000 [ 20%]  (Warmup)
Chain 2: Iteration:  900 / 3000 [ 30%]  (Warmup)
Chain 1: Iteration:  900 / 3000 [ 30%]  (Warmup)
Chain 2: Iteration: 1001 / 3000 [ 33%]  (Sampling)
Chain 4: Iteration:  900 / 3000 [ 30%]  (Warmup)
Chain 1: Iteration: 1001 / 3000 [ 33%]  (Sampling)
Chain 3: Iteration:  900 / 3000 [ 30%]  (Warmup)
Chain 4: Iteration: 1001 / 3000 [ 33%]  (Sampling)
Chain 3: Iteration: 1001 / 3000 [ 33%]  (Sampling)
Chain 2: Iteration: 1300 / 3000 [ 43%]  (Sampling)
Chain 1: Iteration: 1300 / 3000 [ 43%]  (Sampling)
Chain 4: Iteration: 1300 / 3000 [ 43%]  (Sampling)
Chain 3: Iteration: 1300 / 3000 [ 43%]  (Sampling)
Chain 2: Iteration: 1600 / 3000 [ 53%]  (Sampling)
Chain 1: Iteration: 1600 / 3000 [ 53%]  (Sampling)
Chain 4: Iteration: 1600 / 3000 [ 53%]  (Sampling)
Chain 3: Iteration: 1600 / 3000 [ 53%]  (Sampling)
Chain 2: Iteration: 1900 / 3000 [ 63%]  (Sampling)
Chain 1: Iteration: 1900 / 3000 [ 63%]  (Sampling)
Chain 4: Iteration: 1900 / 3000 [ 63%]  (Sampling)
Chain 3: Iteration: 1900 / 3000 [ 63%]  (Sampling)
Chain 2: Iteration: 2200 / 3000 [ 73%]  (Sampling)
Chain 1: Iteration: 2200 / 3000 [ 73%]  (Sampling)
Chain 4: Iteration: 2200 / 3000 [ 73%]  (Sampling)
Chain 3: Iteration: 2200 / 3000 [ 73%]  (Sampling)
Chain 2: Iteration: 2500 / 3000 [ 83%]  (Sampling)
Chain 1: Iteration: 2500 / 3000 [ 83%]  (Sampling)
Chain 4: Iteration: 2500 / 3000 [ 83%]  (Sampling)
Chain 2: Iteration: 2800 / 3000 [ 93%]  (Sampling)
Chain 3: Iteration: 2500 / 3000 [ 83%]  (Sampling)
Chain 1: Iteration: 2800 / 3000 [ 93%]  (Sampling)
Chain 2: Iteration: 3000 / 3000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 54.204 seconds (Warm-up)
Chain 2:                75.637 seconds (Sampling)
Chain 2:                129.841 seconds (Total)
Chain 2: 
Chain 4: Iteration: 2800 / 3000 [ 93%]  (Sampling)
Chain 3: Iteration: 2800 / 3000 [ 93%]  (Sampling)
Chain 1: Iteration: 3000 / 3000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 56.812 seconds (Warm-up)
Chain 1:                77.457 seconds (Sampling)
Chain 1:                134.269 seconds (Total)
Chain 1: 
Chain 4: Iteration: 3000 / 3000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 61.326 seconds (Warm-up)
Chain 4:                74.404 seconds (Sampling)
Chain 4:                135.73 seconds (Total)
Chain 4: 
Chain 3: Iteration: 3000 / 3000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 64.151 seconds (Warm-up)
Chain 3:                74.591 seconds (Sampling)
Chain 3:                138.742 seconds (Total)
Chain 3:

4 ‘workers’, i.e., threads
4 ‘chains’
warmup finetunes the MCMC algorithm
sampling is the actual exploration of the posterior

Default plots

plot(trap_fit_1)

Left: histogram of posterior for coefficient
Right: trace plot of chains
If not a ‘fuzzy worm’, you have bad convergence!
Some more options are in analysis.R

Rhat and ESS

summary(trap_fit_1)

 Family: gaussian 
  Links: mu = identity 
Formula: F1_lob2 ~ age * gender + stopword + school_type + art_rate + (1 | school) + (1 | part_id) + (1 | word_id) + (1 | following) 
   Data: qb2 (Number of observations: 7424) 
  Draws: 4 chains, each with iter = 3000; warmup = 1000; thin = 1;
         total post-warmup draws = 8000

Multilevel Hyperparameters:
~following (Number of levels: 28) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     1.60      0.28     1.12     2.21 1.00     1954     3510

~part_id (Number of levels: 43) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.22      0.03     0.17     0.30 1.00     2293     3806

~school (Number of levels: 21) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.05      0.04     0.00     0.15 1.00     1727     3092

~word_id (Number of levels: 557) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.23      0.03     0.17     0.29 1.00     2473     3556

Regression Coefficients:
                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept               1.11      0.36     0.39     1.81 1.01      566     1174
age36M45               -0.34      0.16    -0.66    -0.03 1.00     2285     3694
age46M55               -0.29      0.13    -0.54    -0.04 1.00     1700     3122
age56M65               -0.11      0.14    -0.39     0.18 1.00     2140     3595
age66M75               -0.20      0.13    -0.46     0.06 1.00     1979     3068
age76M85                0.04      0.15    -0.25     0.33 1.00     1845     2898
genderMale              0.22      0.13    -0.04     0.48 1.00     1824     3216
stopwordTRUE           -0.09      0.08    -0.25     0.06 1.00     1833     3608
school_typePublic      -0.19      0.12    -0.42     0.05 1.00     2313     3830
art_rate               -0.02      0.01    -0.05     0.00 1.00     9169     6023
age36M45:genderMale    -0.35      0.27    -0.87     0.18 1.00     3152     4709
age46M55:genderMale    -0.09      0.22    -0.52     0.34 1.00     2268     3946
age56M65:genderMale    -0.26      0.21    -0.67     0.16 1.00     2306     3959
age66M75:genderMale     0.09      0.19    -0.29     0.47 1.00     2078     3745
age76M85:genderMale    -0.00      0.49    -0.95     0.96 1.00    11618     5746

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.85      0.01     0.83     0.86 1.00     9355     6186

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Rhat: a numerical measure of convergence (should be <= 1.01)

ESS: ‘effective sample size’, i.e., how many times have your critters turned up here?

Bulk_ESS: how well sampled is the middle? (i.e. how reliable is the mean?)

Tail_ESS: how well sampled are the tails? (i.e. how reliable are the CIs?)

Let’s make this model unhealthy

Uh-oh collinearity…

qb2 <- qb2 |> 
  mutate(
    shmarticulation_rate = art_rate + 
      rnorm(n=nrow(qb2), 0, 0.1)
  )

bad_model_formula <- F1_lob2 ~ age * gender + stopword + school_type + 
  art_rate + shmarticulation_rate +
  (1|school) + (1|part_id) + (1|word_id) + (1|following)

Lazy and jumpy critters

trap_fit_1 <- brm(
  model_formula,  
  data = qb2,
  prior = first_prior,
  cores = 4,
  threads = 2,
  chains = 4,
  iter = 3000, # 1000 more than default.
  warmup = 1000
)

Lazy and jumpy critters

trap_fit_nasty <- brm(
  bad_model_formula,  
  data = qb2,
  prior = first_prior,
  cores = 4,
  threads = 2,
  chains = 4,
  iter = 200, # lazy
  control = list(adapt_delta = 0.5) # jumpy
)

Warnings

Warning: There were 11 divergent transitions after warmup. See
https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them.
Warning: Examine the pairs() plot to diagnose sampling problems

Warning: The largest R-hat is 1.09, indicating chains have not mixed.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#r-hat
Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess
Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#tail-ess

Warnings (cont. 2)

The warning messages, and website links are pretty good!
Divergent transitions: often because your critters are moving too far each step and missing bits of the posterior.
- Typical solution: increase adapt_delta.
ESS: as message says, often fixed by increasing iterations.

Bad trace plot

The chains are wandering around quite independently.
We’ll fix this step-by-step in analysis.R.

Summary

What’s under the hood?
- MCMC sampling
A more complex trap F1 model
- Convergence and visual diagnostics
- Effective sample size and Rhat
- What happens when we break it?

Now and next time

Download GitHub repository: https://github.com/nzilbb/ws-bayes-4
Work through analysis.R
There’s a month until our next meeting
Let me know of any topics you really want to cover
I want to look at:
- Bayesian model comparison
- Bayesian variant of the model-to-PCA workflow
- Examples from papers, incl. how to report.

References

Allaire, JJ, Yihui Xie, Christophe Dervieux, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, et al. 2025. rmarkdown: Dynamic Documents for r. https://github.com/rstudio/rmarkdown.

Bürkner, Paul-Christian. 2017. “brms: An R Package for Bayesian Multilevel Models Using Stan.” Journal of Statistical Software 80 (1): 1–28. https://doi.org/10.18637/jss.v080.i01.

———. 2018. “Advanced Bayesian Multilevel Modeling with the R Package brms.” The R Journal 10 (1): 395–411. https://doi.org/10.32614/RJ-2018-017.

———. 2021. “Bayesian Item Response Modeling in R with brms and Stan.” Journal of Statistical Software 100 (5): 1–54. https://doi.org/10.18637/jss.v100.i05.

Gabry, Jonah, and Tristan Mahr. 2025. “bayesplot: Plotting for Bayesian Models.” https://mc-stan.org/bayesplot/.

Gabry, Jonah, Daniel Simpson, Aki Vehtari, Michael Betancourt, and Andrew Gelman. 2019. “Visualization in Bayesian Workflow.” J. R. Stat. Soc. A 182: 389–402. https://doi.org/10.1111/rssa.12378.

Kallioinen, Noa, Topi Paananen, Paul-Christian Bürkner, and Aki Vehtari. 2023. “Detecting and Diagnosing Prior and Likelihood Sensitivity with Power-Scaling.” Statistics and Computing 34. https://doi.org/10.1007/s11222-023-10366-5.

Kay, Matthew. 2024. tidybayes: Tidy Data and Geoms for Bayesian Models. https://doi.org/10.5281/zenodo.1308151.

Makowski, Dominique, Mattan S. Ben-Shachar, Brenton M. Wiernik, Indrajeet Patil, Rémi Thériault, and Daniel Lüdecke. 2025. “modelbased: An R Package to Make the Most Out of Your Statistical Models Through Marginal Means, Marginal Effects, and Model Predictions.” Journal of Open Source Software 10 (109): 7969. https://doi.org/10.21105/joss.07969.

Müller, Kirill. 2025. here: A Simpler Way to Find Your Files. https://doi.org/10.32614/CRAN.package.here.

Paananen, Topi, Juho Piironen, Paul-Christian Bürkner, and Aki Vehtari. 2021. “Implicitly Adaptive Importance Sampling.” Statistics and Computing 31: 1–19. https://doi.org/10.1007/s11222-020-09982-2.

R Core Team. 2025. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Vehtari, Aki, Daniel Simpson, Andrew Gelman, Yuling Yao, and Jonah Gabry. 2024. “Pareto Smoothed Importance Sampling.” Journal of Machine Learning Research 25.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Xie, Yihui, J. J. Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown.

Xie, Yihui, Christophe Dervieux, and Emily Riederer. 2020. R Markdown Cookbook. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown-cookbook.