POSIX / IEEE Std 1003.1
Matches baseline
POSIX / IEEE Std 1003.1
Matches baseline
Proprietary (reverse-engineered)
Matches baseline
Proprietary (reverse-engineered)
Matches baseline
Open-source (ICU-based)
Matches baseline
Open-source (wp_word_count)
Matches baseline
Unicode Standard Annex #29
Matches baseline
UAX #29 + Unicode CLDR / LDML
Matches baseline
ICU (open-source)
Matches baseline
ECMA-402 (TC39)
Matches baseline
GMX-V 1.0/2.0 (LISA/ETSI/GALA)
Matches baseline
GMX-V 2.0 (ETSI)
Matches baseline
Word processors, CMS editors and standards do not always agree on URLs, email addresses, hyphenated words, punctuation, CJK text and markup.
If a client, teacher, editor or platform checks the final count in one tool, choose that method before making length decisions.
The two products treat hyphenated terms, URLs, email addresses, numbers and punctuation differently. Their counts can drift further apart on technical or multilingual text.
POSIX wc splits on ASCII whitespace and ignores Unicode word
boundaries. UAX 29 defines language-aware word boundaries for scripts including
Latin, Cyrillic, Arabic, Hebrew and CJK.
GMX-V is the LISA Global Information Management Metrics Volume standard used in localization. It defines repeatable word counts for translation pricing and quoting.
Use whichever method matches the place where the count is checked. If a client uses Microsoft Word, match that. If a CMS or translation tool checks the count, match that instead.