Orthography Standardizer
The Orthography Standardizer annotator takes tokens from the word layer and 'cleans up' their labels, creating tags on a new layer with standardized labels, which should be optimal for looking up lexicons, frequency computation, etc. Specifically:
- the word label is converted to all lowercase,
- standardizes characters sometimes produced by word-processing software - e.g. em-dashes are converted to hyphens, 'smart-quote' apostrophes ’ are converted to plain apostrophes ' , etc.
- removes all punctuation characters except: ~ - ' :
- trailing apostrophes and hyphens are removed (leaving word-internal ones intact), and
- leading/trailing whitespace is trimmed off.
For example:
word | orthography | |
---|---|---|
“Why | → | why |
hasn’t | → | hasn't |
Inés | → | inés |
d~ | → | d~ |
got — | → | got |
her | → | her |
X—ray | → | x-ray |
yet?” | → | yet |