Orthography Standardizer

The Orthography Standardizer annotator takes tokens from the word layer and 'cleans up' their labels, creating tags on a new layer with standardized labels, which should be optimal for looking up lexicons, frequency computation, etc. Specifically:

  1. the word label is converted to all lowercase,
  2. standardizes characters sometimes produced by word-processing software - e.g. em-dashes are converted to hyphens, 'smart-quote' apostrophes ’ are converted to plain apostrophes ' , etc.
  3. removes all punctuation characters except: ~ - ' :
  4. trailing apostrophes and hyphens are removed (leaving word-internal ones intact), and
  5. leading/trailing whitespace is trimmed off.

For example:

word orthography
“Why why
hasn’t hasn't
Inés inés
d~ d~
got — got
her her
X—ray x-ray
yet?” yet