Orthography Standardizer

The Orthography Standardizer annotator takes tokens from the word layer and 'cleans up' their labels, creating tags on a new layer with standardized labels, which should be optimal for looking up lexicons, frequency computation, etc. Specifically:

  1. the word label is converted to all lowercase,
  2. standardizes characters sometimes produced by word-processing software - e.g. em-dashes are converted to hyphens, 'smart-quote' apostrophes ’ are converted to plain apostrophes ' , etc.
  3. removes all punctuation characters except: ~ - ' :
  4. trailing apostrophes and hyphens are removed (leaving word-internal ones intact), and
  5. leading/trailing whitespace is trimmed off.

For example:

word orthography
“Why why
hasn’t hasn't
Inés inés
d~ d~
got — got
her her
X—ray x-ray
yet?” yet

A second possible use for this annotator is to copy (and clean up) tokens from one layer to another, optionally filtering in or out tokens depending on whether they fall within the bounds of a third layer. Tokens that fall within annotation bounds can be included or excluded depending on the label of the filter layer annotation. e.g. word tokens can be copied to an output layer if they've been tagged as being in a specific language, or tokens can be copied but only when they're not tagged as being part of a reading passage, etc.