Foundations

Session 3: Exploratory Data Visualisation

Joshua Wilson Black

Te Kāhui Roro Reo | New Zealand Institute of Language, Brain and Behaviour

Te Whare Wānanga o Waitaha | University of Canterbury

Overview

Overview

  1. New infrastructure: Markdown
  2. Download some data from OSF.io.
    • Topic: Word usage factors and duration.
    • We’ll use this for the rest of the ‘Foundations’ sessions.
  3. Exploratory visualisation with ggplot2.

Markdown

Markdown

  • Markdown is a ‘markup’ language
    • Like LaTeX or HTML.
  • Markdown was designed to be easy to read (for a human).
  • We combine text in markdown with blocks of code.
  • This is a variety of ‘literate programming’.
  • Markdown documents can be turned into pdf (via LaTeX), HTML, OpenOffice or Word files.
## Markdown

- Markdown is a 'markup' language
    - Like LaTeX or HTML.
- Markdown was designed to be easy to read (for a human).
- We combine text in markdown with blocks of code.
- This is a variety of 'literate programming'.
- Markdown documents can be turned into pdf (via LaTeX), HTML, OpenOffice
or Word files.

<section id="markdown-1" class="slide level2">
<h2>Markdown</h2>
<ul>
<li class="fragment">Markdown is a ‘markup’ language
<ul>
<li class="fragment">Like LaTeX or HTML.</li>
</ul></li>
<li class="fragment">Markdown was designed to be easy to read (for a human).</li>
<li class="fragment">We combine text in markdown with blocks of code.</li>
<li class="fragment">This is a variety of ‘literate programming’.</li>
<li class="fragment">Markdown documents can be turned into pdf (via LaTeX), HTML, OpenOffice or Word files.</li>
</ul>
</section>

Quarto

  • Quarto is a “an open-source scientific and technical publishing system”
  • Developed by Posit (formerly RStudio)
  • It is almost entirely compatible with RMarkdown
  • Quarto is ‘multilingual’ — designed from scratch to work with Python, Julia, etc.
  • Quarto documents allow us to run code blocks and publish the output.

Source view of a Quarto Document open in RStudio.

Tips and tricks

  • RStudio provides a ‘Source’ and a ‘Visual’ mode (toggle at top of source window).
  • The ‘Visual’ mode is more like a traditional word processor.
  • I always use the ‘Source’ mode 🤓
  • Keyboard shortcuts:
    • Ctrl/Cmd + Alt + I (Insert new code block)
    • Ctrl/Cmd + Alt + N (Run next code block.)

NZILBB template

  • Work-in-progress: an NZILBB Quarto template
# Run in terminal
quarto use template JoshuaWilsonBlack/nzilbb_doc

Get some data

Word duration data

  • We will be exploring data from Sóskuthy and Hay (2017)
  • Topic: how word ‘usage factors’ affect word duration.
    • Usage factors include how frequent a word is, how predictable it is in context etc.
    • Upshot: Change in usage factors over long time periods affects word duration.
    • Upshot: There’s some kind of feedback mechanism between production and perception of words.
  • Data is shared via OSF.io

Plotting with ggplot2

The ‘grammar’

  • The ‘gg’ in ggplot2 stands for ‘grammar of graphics’.
  • What’s the grammar?

…a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system. (H. Wickham 2016, 4)

The implementation

  1. Data
  2. Aesthetic mappings (from variables to visual properties)
  3. Layers (points, lines, bars…)
  4. Scales and co-ordinates (specify how the mappings work)
  5. Facets (subgroups in data -> subplots)
  6. Theme (e.g., font size, background colour…)

Data

big_dia <- read_csv(here('data', 'big_dia.csv'))

big_dia |> 
  head()
...1 TargetOrthography foll_wf prev_wf Speaker Corpus YOB WordDuration TargetPhonemes dur.context dur.context.avg prev_pred_wf_log foll_pred_wf_log prev_info_wf_avg foll_info_wf_avg syll.length.3 seg.no repeated.20 wclass initial final unigram.google.gb unigram.diff.google.gb rep.times.avg final.prop.log foll.diff prev.diff final.diff dur.context.diff syll.no
1 able to only speaker426 CC 1976 0.2600000 1bP 305 0.3101003 -3.962387 -1.94206 1.986253 2.008305 0.1189995 3 FALSE a FALSE FALSE -8.647486 0.0018382 5.36014 -5.283204 -0.154497 -0.503365 -0.5626083 -0.1447357 2
2 able to were speaker66 CC 1926 0.3400000 1bP 310 0.3101003 -2.403104 -1.94206 1.986253 2.008305 0.1606245 3 FALSE a FALSE FALSE -8.647486 0.0018382 5.36014 -5.283204 -0.154497 -0.503365 -0.5626083 -0.1447357 2
3 able to was speaker256 IA 1921 0.3900000 1bP 310 0.3101003 -2.800220 -1.94206 1.986253 2.008305 0.2388915 3 FALSE a FALSE FALSE -8.647486 0.0018382 5.36014 -5.283204 -0.154497 -0.503365 -0.5626083 -0.1447357 2
4 able to were speaker256 IA 1921 0.3000000 1bP 310 0.3101003 -2.403104 -1.94206 1.986253 2.008305 0.1985703 3 FALSE a FALSE FALSE -8.647486 0.0018382 5.36014 -5.283204 -0.154497 -0.503365 -0.5626083 -0.1447357 2
5 able to be speaker105 CC 1965 0.0521634 1bP 310 0.3101003 -1.467725 -1.94206 1.986253 2.008305 0.1449822 3 FALSE a FALSE FALSE -8.647486 0.0018382 5.36014 -5.283204 -0.154497 -0.503365 -0.5626083 -0.1447357 2
6 able to be speaker288 MU 1894 0.3100000 1bP 310 0.3101003 -1.467725 -1.94206 1.986253 2.008305 0.1735298 3 FALSE a FALSE FALSE -8.647486 0.0018382 5.36014 -5.283204 -0.154497 -0.503365 -0.5626083 -0.1447357 2

Aesthetic mappings

big_dia |> 
  ggplot(
    mapping = aes(
      x = WordDuration,
      fill = Corpus
    )
  )
  • We ‘map’:
    • word duration to position on x-axis.
    • corpus (i.e., which collection of recordings) to ‘fill’
  • No ‘layers’ yet.
  • ggplot2 still produces output…

Aesthetic mappings

Figure 1: Output with only aesthetic mappings specified. ggplot2 sets up a default scale for the x-axis.

Layers

big_dia |> 
  ggplot(
    mapping = aes(
      x = WordDuration,
      fill = Corpus
    )
  ) +
  geom_histogram(bins = 30) 
  • Layers tend to start with geom_.
  • geom_histogram() creates a histogram.
  • Layers have options (arguments) which modify their appearance.

Layers

Figure 2: Histogram with 30 bins and fill for corpus.

Layers (II)

big_dia |> 
  ggplot(
    mapping = aes(
      x = WordDuration,
      fill = Corpus
    )
  ) +
  geom_density(alpha = 0.5)
  • An alternative layer with the same aesthetic mapping.
  • geom_density() creates a density plot.

Layers (II)

Figure 3: Density with 30 bins and fill for corpus.

Layers (III)

big_dia |> 
  ggplot(
    mapping = aes(
      x = WordDuration,
      fill = Corpus,
      colour = Corpus
    )
  ) +
  geom_density(alpha = 0.2)
  • We can go back and change the aesthetic mapping.
  • NB: difference between ‘fill’ and ‘colour’!

Layers (III)

Figure 4: Density with 30 bins and colour and fill for corpus.

Scales

big_dia |> 
  ggplot(
    mapping = aes(
      x = WordDuration,
      fill = Corpus,
      colour = Corpus
    )
  ) +
  geom_density(alpha = 0.2) +
  scale_fill_manual(
    values = c("MU" = "#b120cf", "IA" = "#cfb120", "CC" = "#20cfb1")
  ) +
  scale_colour_manual(
    values = c("MU" = "#b120cf", "IA" = "#cfb120", "CC" = "#20cfb1")
  )
  • Functions which start with scale_ help to specify aesthetic mappings.
  • i.e., this data value goes to that colour, etc.

Scales

Figure 5: Density with 30 bins and colour and fill for corpus. Both colour and fill modified from default scales.

Coordinates

big_dia |> 
  ggplot(
    mapping = aes(
      x = WordDuration,
      fill = Corpus,
      colour = Corpus
    )
  ) +
  geom_density(alpha = 0.2) +
  scale_fill_manual(
    values = c("MU" = "#b120cf", "IA" = "#cfb120", "CC" = "#20cfb1")
  ) +
  scale_colour_manual(
    values = c("MU" = "#b120cf", "IA" = "#cfb120", "CC" = "#20cfb1")
  ) +
  coord_cartesian(xlim = c(-2.5, 2.5))
  • Coordinate systems also determine the mapping.
  • A lot of overlap with scales.

Coordinates

Figure 6: Density with 30 bins and colour and fill for corpus. Both colour and fill modified from default scales. Silly coordinate added.

Facets

big_dia |> 
  ggplot(
    mapping = aes(
      x = WordDuration,
      fill = Corpus,
      colour = Corpus
    )
  ) +
  geom_density(alpha = 0.2) +
  scale_fill_manual(
    values = c("MU" = "#b120cf", "IA" = "#cfb120", "CC" = "#20cfb1")
  ) +
  scale_colour_manual(
    values = c("MU" = "#b120cf", "IA" = "#cfb120", "CC" = "#20cfb1")
  ) +
  facet_grid(vars(final))
  • We use facets to create subplots.
  • Note use of var() function.

Facets

Figure 7: Density with 30 bins and colour and fill for corpus. Faceted by word finality.

The Forbidden Plot

  • Pie charts are out of fashion.
  • I used to (wrongly!) say you can’t even make them in ggplot2.
big_dia |> 
  ggplot(
    aes(
      x = factor(1),
      fill = Corpus
    )
  ) +
  geom_bar(width=1, colour="white") +
  scale_fill_manual(
    values = c("MU" = "#b120cf", "IA" = "#cfb120", "CC" = "#20cfb1")
  ) +
  coord_polar(theta = "y", direction = -1) +
  theme_void()
Figure 8: The forbidden plot.

What to do now

  1. Go to https://nzilbb.github.io/statistics_workshops/
  2. Work through the ‘Exploratory Data Visualisation’ chapter

References

Allaire, JJ, Yihui Xie, Christophe Dervieux, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, et al. 2024. rmarkdown: Dynamic Documents for r. https://github.com/rstudio/rmarkdown.
Müller, Kirill. 2020. here: A Simpler Way to Find Your Files. https://doi.org/10.32614/CRAN.package.here.
R Core Team. 2025. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Sóskuthy, Márton, and Jennifer Hay. 2017. “Changing Word Usage Predicts Changing Word Durations in New Zealand English.” Cognition 166 (September): 298–313. https://doi.org/10.1016/j.cognition.2017.05.032.
Wickham, H. 2016. Ggplot2: Elegant Graphics for Data Analysis. Use R! Springer International Publishing. https://books.google.co.nz/books?id=RTMFswEACAAJ.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wilkinson, Leland. 1999. The Grammar of Graphics. Statistics and Computing. New York: Springer. http://digitool.hbz-nrw.de:1801/webclient/DeliveryManager?pid=2966363.
Xie, Yihui. 2014. knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC.
———. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.
———. 2025. knitr: A General-Purpose Package for Dynamic Report Generation in R. https://yihui.org/knitr/.
Xie, Yihui, J. J. Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown.
Xie, Yihui, Christophe Dervieux, and Emily Riederer. 2020. R Markdown Cookbook. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown-cookbook.
Zhu, Hao. 2024. kableExtra: Construct Complex Table with kable and Pipe Syntax. https://doi.org/10.32614/CRAN.package.kableExtra.