Unisyn Tagger

The Unisyn Tagger tags word tokens with data from Unisyn, a lexicon produced by the Centre for Speech Technology Research at the University of Edinburgh.

Unisyn is a 'master lexicon' of English, which contains:

  • orthography
  • part-of-speech
  • pronunciation, in an 'accent neutral' form
  • 'enriched orthography' showing morphological information
  • frequency, as derived from various sources, including the British National Corpus, Time articles, Gutenberg, etc.

The pronunciations in the lexicon can be converted into an accent-specific form using perl scripts that are included with the lexicon.

Getting Unisyn

Unisyn is available under a non-commercial license, and must be acquired seperately from this layer manager. To acquire Unisyn, you must first register on the the Unisyn website and accept the terms of their license. The Unisyn website is here:
http://www.cstr.ed.ac.uk/projects/unisyn/

(This layer manager has been tested with version 1.3 of Unisyn)

Using Unisyn with this annotator

Once you've got Unisyn, you can use it to produce accent-specific lexicons, and provide these lexicons to the annotator, which then uses them to tag word tokens in your transcripts.

For example, if you want to annotate your transcripts with 'General American English' pronunciations:

  1. Generate the General American English (gam) lexicon by running the following Unisyn commands:
    1. get-exceptions.pl -a gam -f unilex > gam.1
    2. post-lex-rules.pl -a gam -f gam.1 > gam.2
    3. map-unique.pl -a gam -f gam.2 > gam.unisyn
    This gives you the file gam.unisyn, which is the lexicon file you need for the next step.
  2. Create the layer for your pronunciation annotations
  3. Upload the accent-specific lexicon on the layer configuration page

Mapping Unisyn pronunciations to the DISC phoneme set

Some processing of phonological layers assumes that the annotations use the DISC phoneme set designed for the CELEX phonemic transcriptions. This set is used because each phoneme is expressed by precisely one ASCII character, including phonemes usually expressed using a digraph - e.g. affricates like /tʃ/ (which is /J/ in DISC) and diphthongs like /aɪ/ (which is /2/ in DISC)

Unisyn transcriptions use a set of phones that is greater that the set of phones available in DISC, and the transcriptions are designed to be broadly phonetic, not phonemic.

This means that using the DISC representation of the transcripts is imperfect, as there is a certain amount of loss of information when mapping Unisyn phones to DISC phonemes.

If having the original transcriptions precisely as defined in the Unisyn lexicon is very important, you can instead create a layer that uses the original transcription as contained in the file you uploaded. This has the advantage that the transcriptions are not filtered through the above mapping, and the disadvantage that LaBB-CAT won't be able to display the transcriptions using IPA symbols, nor help you when creating search patterns for the layer.

If you decide to do this, Unisyn offers you two possible representations:

  • Unisyn transcriptions - e.g. { p r @ . n ~ uh n s $}.< ii . * ei . sh n!< - these are already present in the file that you generated if you followed the instructions above (i.e. gam.unisyn)
  • SAM-PA transcriptions - e.g. pr\@%nVns$i"e$Sn=$@5 - these can be obtained by running an extra Unisyn command, and uploading the resulting gam.sampa file:
    output-sam.pl -a gam -f gam.unisyn > gam.sampa

(Unisyn has a third script called output-ipa.pl which produces transcriptions for displaying in HTML - e.g. p&#633;&#601;&#716;n&#652;ns.i&#712;e.&#643;n &#809 - which are not suitable for search, analysis, or forced-alignment)

In order to prevent the DISC mapping from applying on your layer:

  • When creating the layer, set the layer type to Text rather than Phonological.
  • When configuring the layer, set the field to Phonemes (original file) rather than Phonemes (DISC).