Unisyn Tagger
The Unisyn Tagger tags word tokens with data from Unisyn, a lexicon produced by the Centre for Speech Technology Research at the University of Edinburgh.
Unisyn is a 'master lexicon' of English, which contains:
- orthography
- part-of-speech
- pronunciation, in an 'accent neutral' form
- 'enriched orthography' showing morphological information
- frequency, as derived from various sources, including the British National Corpus, Time articles, Gutenberg, etc.
The pronunciations in the lexicon can be converted into an accent-specific form using perl scripts that are included with the lexicon.
Getting Unisyn
Unisyn is available under a non-commercial license, and must be acquired seperately
from this layer manager. To acquire Unisyn, you must first register on the the Unisyn
website and accept the terms of their license. The Unisyn website is here:
http://www.cstr.ed.ac.uk/projects/unisyn/
(This layer manager has been tested with version 1.3 of Unisyn)
Using Unisyn with this annotator
Once you've got Unisyn, you can use it to produce accent-specific lexicons, and provide these lexicons to the annotator, which then uses them to tag word tokens in your transcripts.
For example, if you want to annotate your transcripts with 'General American English' pronunciations:
- Generate the General American English (gam) lexicon by running the following Unisyn commands:
get-exceptions.pl -a gam -f unilex > gam.1
post-lex-rules.pl -a gam -f gam.1 > gam.2
map-unique.pl -a gam -f gam.2 > gam.unisyn
- Create the layer for your pronunciation annotations
- Upload the accent-specific lexicon on the layer configuration page
Mapping Unisyn pronunciations to the DISC phoneme set
Some processing of phonological layers assumes that the annotations use the DISC phoneme set
designed for the CELEX phonemic transcriptions. This set is used because each phoneme is
expressed by precisely one ASCII character, including phonemes usually expressed using a
digraph - e.g. affricates like /tʃ/
(which is /J/
in
DISC) and diphthongs like /aɪ/
(which is /2/
in
DISC)
Unisyn transcriptions use a set of phones that is greater that the set of phones available in DISC, and the transcriptions are designed to be broadly phonetic, not phonemic.
This means that using the DISC representation of the transcripts is imperfect, as there is a certain amount of loss of information when mapping Unisyn phones to DISC phonemes.
If having the original transcriptions precisely as defined in the Unisyn lexicon is very important, you can instead create a layer that uses the original transcription as contained in the file you uploaded. This has the advantage that the transcriptions are not filtered through the above mapping, and the disadvantage that LaBB-CAT won't be able to display the transcriptions using IPA symbols, nor help you when creating search patterns for the layer.
If you decide to do this, Unisyn offers you two possible representations:
- Unisyn transcriptions - e.g. { p r @ . n ~ uh n s $}.< ii . * ei . sh n!< - these are already present in the file that you generated if you followed the instructions above (i.e. gam.unisyn)
- SAM-PA transcriptions - e.g. pr\@%nVns$i"e$Sn=$@5 - these can be
obtained by running an extra Unisyn command, and uploading the
resulting gam.sampa file:
output-sam.pl -a gam -f gam.unisyn > gam.sampa
(Unisyn has a third script called output-ipa.pl
which produces
transcriptions for displaying in HTML -
e.g. pɹəˌnʌns.iˈe.ʃn
̩ - which are not suitable for search, analysis, or
forced-alignment)
In order to prevent the DISC mapping from applying on your layer:
- When creating the layer, set the layer type to Text rather than Phonological.
- When configuring the layer, set the field to Phonemes (original file) rather than Phonemes (DISC).