Flat Lexicon Tagger

This annotator tags annotates words with data from a dictionary loaded from a plain text file (e.g. a CSV file). The file must have a 'flat' structure in the sense that it's a simple list of dictionary entries with a fixed number of columns/fields, rather than having a complex structure.

The dictionary file you supply may contain multiple fields, and multiple entries per word. It might include:

  • word orthography
  • lemma
  • part-of-speech
  • pronunciation
  • frequency

...or any other type data you like.

Getting a dictionary file

What dictionary file you want depends on what you want to annotate. For pronunciations, you might download some standard dictionary for your target language, such as Unisyn, the CMU Pronouncing dictionary, CELEX, etc. (although there are also specialised layer managers for these particular lexicons). Frequency lists include CELEX, SubtlexUS, and Adam Kilgarriff's BNC Frequency Lists.

Alternatively, you might have, or prepare, your own dictionary containing pronunciations, lemmata, etc.

NB the text file must use ASCII or UTF-8 character encoding. If your dictionary file uses another encoding (e.g. Western or ISO-8859, you will need to re-save the file using UTF-8 (in many text editors, the character encoding is an option available when you select Save As... from the File menu).

You can upload as many dictionaries as you like. Once you have at least one dictionary, you can configure a word layer to lookup the resulting lexicons .