Flat Lexicon Tagger
This annotator tags annotates words with data from a dictionary loaded from a plain text file (e.g. a CSV file). The file must have a 'flat' structure in the sense that it's a simple list of dictionary entries with a fixed number of columns/fields, rather than having a complex structure.
The dictionary file you supply may contain multiple fields, and multiple entries per word. It might include:
- word orthography
- lemma
- part-of-speech
- pronunciation
- frequency
...or any other type
data you like.
Getting a dictionary file
What dictionary file you want depends on what you want to annotate. For pronunciations, you might download some standard dictionary for your target language, such as Unisyn, the CMU Pronouncing dictionary, CELEX, etc. (although there are also specialised layer managers for these particular lexicons). Frequency lists include CELEX, SubtlexUS, and Adam Kilgarriff's BNC Frequency Lists.
Alternatively, you might have, or prepare, your own dictionary containing pronunciations, lemmata, etc.
NB the text file must use ASCII or UTF-8 character encoding. If your
dictionary file uses another encoding (e.g. Western
or ISO-8859, you will
need to re-save the file using UTF-8 (in many text editors, the character encoding
is an option available when you select Save As...
from the File
menu).
You can upload as many dictionaries as you like. Once you have at least one dictionary, you can configure a word layer to lookup the resulting lexicons .