library(tidyverse)
library(here)
library(brms)
theme_set(theme_bw(base_size=20))
qb2 <- read_rds(here('data', 'session_1.rds'))9 Bayesian Data Analysis
This chapter presents a short introduction to Bayesian Data Analysis for linguistics with sections corresponding to the first four intro workshops offered during NZILBB R, Stats and Open Science sessions.
Section 9.1 provides some orientation to Bayesian methods by first fitting a frequentist version of a simple model then switching it to a model fit using the brms package. There’s a few details about what distinguishes Bayesian and frequentists methods in a callout block. We then turn to the ‘posterior distribution’ (Section 9.2), i.e., to the question of model interpretation. We look at how to determine what a Bayesian model tells us after it has seen the data. We then turn to the ‘prior distribution’, which allows the researcher to incorporate relevant information to the model before the model sees the data (Section 9.3). This is a significant difference between Bayesian and non-Bayesian approaches. Finally, we consider various metrics of model health (including, for instance, ’trace plots) (Section 9.4).
Further, and more detailed, resources are set out in Section 9.5
9.1 Orientation
This section material corresponds to the slides here.
One easy way to get access to the data used in this section is to start an R project in RStudio, select ‘version control’, and use the GitHub repository https://github.com/nzilbb/ws-bayes-1.
9.1.1 A Silly Model
People from Christchurch are (in)famous for asking people what secondary school they attended. This is typically, I suppose, taken to be an obsession with social status and class. But is there identifiable linguistic variation between people who went to different schools?
Let’s look at data from the QuakeBox corpus which bears on this question (Clark et al. 2016; Walsh et al. 2013).
We’ll first load a few packages and some data.
Figure 9.1 shows [trap{.smallcaps}] F1 realisations, normalised via Labanov normalisation, for all schools which have three or more students in our dataset.
qb2 |>
group_by(school) |>
mutate(
student_n = length(unique(part_id))
) |>
filter(student_n >= 3) |>
ungroup() |>
group_by(part_id) |>
summarise(
F1_lob2 = mean(F1_lob2, na.rm=T),
school = first(school)
) |>
ggplot(
aes(
y = F1_lob2,
x = reorder(school, F1_lob2)
)
) +
geom_boxplot() +
scale_y_reverse() +
theme(
axis.text.x = element_text(size=12, angle = 45, hjust = 1)
) +
labs(
y = "TRAP F1 (Lobanov)",
x = "High school"
)
The patterns in Figure 9.1 are something like what would be expected given social stereotypes. Avonside Girls’ is a state school in the, typically less well-off east side of the city, while St Margaret’s is a private school for girls in a very well-off part of the city. Lower trap realisation are, in turn, indicative of more conservative NZE (note flipped y-axis).
Suppose we wanted to fit a model to determine whether the trap F1 realisations of St Margaret’s and Avonside really are different.1 Let’s do this is in a standard, non-Bayesian (i.e. ‘frequentist’), way:
# Create subset of data with only Avonside and St Margaret's
trap_sub <- qb2 |>
filter(
school %in% c(
"Avonside Girls' High School",
"St Margaret's College"
)
) |>
select(
age, school, F1_lob2, part_id
)
trap_fit <- lm(
F1_lob2 ~ school,
data = trap_sub
)Let’s have a look at the model summary:
summary(trap_fit)
Call:
lm(formula = F1_lob2 ~ school, data = trap_sub)
Residuals:
Min 1Q Median 3Q Max
-2.73538 -0.47347 -0.02352 0.46701 2.95033
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.24632 0.02946 8.362 <2e-16 ***
schoolSt Margaret's College 0.47782 0.05258 9.087 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8195 on 1126 degrees of freedom
Multiple R-squared: 0.06832, Adjusted R-squared: 0.0675
F-statistic: 82.58 on 1 and 1126 DF, p-value: < 2.2e-16
This is a fairly standard analysis of variance model. We’re attempting to determine if there is a difference in the mean trap F1 between people who attended one or other of these schools. Looking first to the coefficients table, our (Intercept) estimate, 0.25, indicates the estimated mean value for Avonside alumnae, while the estimate for schoolSt Margaret's College indicates the difference in means between Avonside and St Margaret’s alumnae. Since this is positive, the model says that the trap F1 value for St Margaret’s is higher than for Avonsiders, and thus that their trap realisation is lower in the vowel space. This is conservative for New Zealand English.
The coefficients section also gives an indication of ‘error’ (the Std. Error column) and information about statistical hypothesis tests corresponding to each coefficient (the t value and Pr(>|t|) columns). In this case, we would say that the difference between Avonside and St Margaret’s alumnae, in terms of mean trap production is statistically significant. We’ll see shortly that this is not a core idea when we switch to a Bayesian approach.
Finally, we see some important information about the residuals (i.e., the variation not explained by the model) above the coefficients and some global information, including another statistical test, below the coefficients.
A side quest: this subsection is called ‘a silly model’. One way in which it is silly is that the data does not allow us to exclude some very plausible alternative explanations for this difference in trap F1. You can convince yourself of this using the techniques from the Exploratory Data Visualisation chapter. Consider, especially, the distribution of ages across schools.
How do we make this model Bayesian? It’s very easy with the brms package.
library(brms)
trap_fit_b <- brm(
F1_lob2 ~ school,
data = trap_sub
)In a very narrow sense, becoming Bayesian is often as easy as switching whatever modelling function you are currently using (lm(), lme4::lmer(), mgcv::gam() etc) to brms::brm(). Often this has the nice consequence of removing any convergence issues you might have been having (especially with complex random effects modules using the lme4 package).2 It is important to be careful here. There are no free rides!
If you run the code block above you will see a series of messages about ‘compiling’ a ‘Stan program’, and ‘sampling’ across a variety of ‘chains’. We’ll discuss the meaning of these outputs later. These messages are the first indication that something quite different is going on when we fit a Bayesian model.
Let’s start by looking at the model summary:
summary(trap_fit_b) Family: gaussian
Links: mu = identity
Formula: F1_lob2 ~ school
Data: trap_sub (Number of observations: 1128)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept 0.25 0.03 0.19 0.30 1.00 3518
schoolStMargaretsCollege 0.48 0.05 0.37 0.59 1.00 3405
Tail_ESS
Intercept 2744
schoolStMargaretsCollege 2568
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 0.82 0.02 0.79 0.86 1.00 4687 2727
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
The coefficients table here is almost identical to the table for trap_fit. There are the same two coefficients, with the same estimates and error (just rounded to two decimal places). The 95% “CI”s are the same as the 95% “CI”s which would be generated by the error terms in the non-Bayesian model.3
But some differences already emerge: there’s some new language about ‘chains’, ‘iter’, ‘warmup’, and ‘draws’. There’s also columns with names containing ‘ESS’, or ‘Effective Sample Size’. Finally, note the absence of any statistical tests.
The model summary undersells some of the key differences between Bayesian and non-Bayesian methods. Two are worth focusing on here:
- Bayesian methods produce distributions for model coefficients rather than point estimates.
- Bayesian “CI”s, variously named ‘credible intervals’ or ‘compatibility intervals’, are interpreted differently from frequentist ‘confidence intervals’.
Let’s consider each of these in turn.
Figure 9.2 shows the probability distribution for the intercept coefficient, i.e., the estimated mean value for Avonside alumnae. From this distribution, we can work out the probability, according to the model and data, that the true mean value is in a given range. The probability distribution is the real estimate derived from the model. The decision to summarise the distribution with a single value is entirely optional and the choice of a summary value is up to us. We can use the median, the mean, the 10% quartile, or whatever else might be useful for a given problem. The summary we have just looked at gives the mean value in the Estimate column.
library(tidybayes) # Tidybayes will be discussed in more detail below.
trap_fit_b |>
spread_draws(b_Intercept) |>
rename(
Avonside = b_Intercept
) |>
ggplot(
aes(
x = Avonside
)
) +
geom_density(alpha=0.4, colour="darkgreen", fill="darkgreen") +
labs(
x = "Avonside coefficient (intercept)"
)
Bayesian CIs, or ‘credible intervals’ naturally emerge from this way of thinking about model estimates. Credible intervals are another way of summarising the probability distributions generated by Bayesian models. In this case, we might want a 95% credible interval centred on the median (i.e., the interval between the 2.5% and the 97.5% quantiles of the distribution). We interpret this interval as having a 95% probability of containing the true mean trap production for Avonsiders.
There is no particular reason to insist on 95% credible intervals, and in practice a series of ‘widths’ are often reported. One way to do this visually is as follows:
trap_fit_b |>
spread_draws(b_Intercept) |>
rename(
Avonside = b_Intercept
) |>
ggplot(
aes(
x = Avonside
)
) +
stat_halfeye(
slab_alpha = 0.4,
fill = "darkgreen",
slab_colour = "darkgreen",
.width = c(0.5, 0.7, 0.9),
point_interval = "mean_qi",
interval_size_range = c(2, 4)
) +
labs(
x = "Avonside coefficient (intercept)"
)
Figure 9.3 displays both the distribution for the coefficient and an ‘interval plot’ identifying the mean value of the distribution and a series of quantile intervals of different widths centred on the median. The thickest line shows the 50% interval, the next thickest shows the 70% and the thinnest line shows the 90% interval.
Suppose, now, that we are interested in whether Avonside and St Margaret’s alumnae have distinct productions of trap. We can get at this by sampling trap values from our model for both schools and looking at the distribution of the difference between the two.4 Figure 9.4 shows this and adds a vertical dashed line at 0. This allows a visual indication of the compatibility of the model and data with the hypothesis that the difference between schools is 0.
predictions <- tibble(
school = unique(trap_sub$school)
) |>
add_epred_draws(trap_fit_b)
predictions |>
pivot_wider(
names_from = school,
values_from = .epred,
id_cols = .draw
) |>
mutate(
Difference = `St Margaret's College` - `Avonside Girls' High School`
) |>
ggplot(
aes(
x = Difference
)
) +
geom_density(alpha = 0.4, fill = "grey") +
geom_vline(xintercept = 0, linetype="dashed")
The difference in Figure 9.4 is positibe when St Margaret’s alumnae have higher trap F1 than Avonside (i.e., more conservative productions). Rather than a statistical hypothesis check with a p-value, as in the initial model, we have a visual indication of compatibility between a hypothesis we are interested in and the combination of the model and data.5
It might seem strange that there are (at least) two ‘schools’ of statistics. This callout block is a short discussion of why with an emphasis on how the two approaches differ in their use of probability.
Put simply: For frequentism (the main alternative to Bayesianism), probability quantifies the reliability of the method. For Bayesians probability quantifies uncertainty. Both of these ideas are well motivated and are unlikely to go away any time soon. Which approach to probability one prefers is largely a matter of ones philosophical orientation.
‘Frequentist’ and ‘Bayesian’ are terms applied directly to interpretations of probability.6 Frequentist understandings of probability emphasise, in one way or another, the frequency with which events of a certain type occur in a sample. So, for instance, a frequentist understands the claim that there is a 5% probability of drawing a red ball from this bag of balls in terms of the frequency with which red balls are actually (or would be) drawn from the bag. Probability is tied to the practice of sampling and is, according to the frequentist, not a subjective notion. It’s about what does or would happen were you to sample from the relevant population.
Probability statements which are naturally interpreted in frequentist terms appear often in sociolinguistics. If, for instance, I fit a model to some New Zealand English data, I might find that, say’, “The probability of starting a conversational interaction with”kia ora” is 10% for female speakers born in the 1970s”. A frequentist understanding of this statement is that if I went out and sampled female speakers born in the 1970s, I would find that 10% of the sample of conversational interactions would begin with “kia ora”.
A ‘Bayesian’ about probability understands probability as expressing degrees of belief, or uncertainty, or something of that sort. Some Bayesians treat this as a subjective notion — i.e., a measure of how strong this or that individual’s beliefs are. But there are more and less ‘subjective’ versions of Bayesianism.
Claims which are naturally interpreted in Bayesian terms are also present in linguistics. For instance, if I am working on language phylogenies I might make a claim like “the probability that language X and language Y are in the same family is 65%”. There is no plausible way to understand this on frequentist terms. What would you sample? You’d have to be able to sample ‘universes’ in which the two languages exist and find that in 65% of universes the two languages are in the same family. It is understandable as a claim about how certain we should be that the two families are related given our current knowledge (in practice for data analysis, by ‘current knowledge’ we mean our model and data).7
Frequentist statisticians use probability to measure the reliability of a statistical method. They ask: how often would I be wrong if I applied this method over and over again? This is where p-values come from. The p-value tells us how often a certain kind of result occurs in applications of a given method. We can then, say, pick a p-value cut off of 0.05 as a decision that we accept that we will wrongly identify an effect in cases where there is no effect 5% of the time. The point is that the probability attaches to applications of the method.
This helps to interpret frequentist confidence intervals. These are often wrongly interpreted as providing, say, a region in which there is a 95% probability of finding the true value. This is what Bayesian ‘credible intervals’ provide, but it is not the frequentist approach. A frequentist confidence interval says that in 95% of cases of in which we apply the method of generating confidence intervals, the true value will be inside the interval. This is claim about repeated applications of the method. When we do this, the actual interval we are talking about will change (i.e., with a slightly different sample you’ll get different confidence intervals). The 95% probability does not apply to the particular interval we generate from our data, but to the whole range of intervals we might get from applying the same method to different samples. The Bayesian credible interval, on the other hand, attaches the probability to this specific interval.
Bayesian methods are often defended on the basis that they are less counter-intuitive than frequentist methods. This may be so. But it is important to note that this is only done by changing the topic. We are not longer talking about the reliability of a general method but rather about the compatibility of this specific data set and model with a range of parameter values.
One common charge against Bayesian statistics is that there is something essentially ‘subjective’ about Bayesian methods. It is true that there is something in some sense subjective about Bayesian interpretations of probability and in the use of a ‘prior’ distribution (more later). But frequentist approaches are also subjective, in the sense of requiring researcher judgement in the selection of assumptions or the decision that model diagnostics are ‘good enough’ etc. They just don’t use probability as a way of capturing these features of modelling. For more on the idea of ‘subjectivity’ in statistics see Gelman and Hennig (2017).
All this to say, Bayesianism is not ‘statistics 2.0’. It’s an orientation to data analysis which you might find appealing and is likely to be increasingly popular, such that you ought at least to be able to understand it.
Before getting deep into Bayesian methods, it’s worth getting comfortable with the simple models we’ve looked at so far.
Exploratory visualisation refresh: load the data and do some exploratory visualisation. Are there any variables apart from
schoolwhich might be important?Make sure you’re set up: Run the code blocks above on your own machine.
Data wrangling refresh: Fit a Bayesian model which distinguishes
Modify the code which generated Figure 9.3 with different values for
.width(i.e., different width credible intervals).Plot the distribution for the
schoolStMargretsCollegecoefficient. Is this the same distribution as in Figure 9.4. If so, why?
9.2 Posterior

9.3 Prior

9.4 Model Health

9.5 Resources

In a real data analysis project, it would be a bad idea to jump straight to a model here. Ignore this for the moment!↩︎
This practical benefit is a significant motivation for many to switch to Bayesian methods.↩︎
My scare quotes here will be explained in a moment.↩︎
For this simple model, the resulting distribution is just the distribution of the
schoolStMargaretsCollegecoefficient. An exercise below asks you to show this. ↩︎If the model or the data are bad, the distributions estimated by our Bayesian model will also be bad.↩︎
Acordding to the arch-frequentist, Charles Peirce, the Bayesian interpretation of probability could only work “if universes were as plenty as blackberries”.↩︎