Locale, character set, character encoding, font

Wikipedia has a section on character encoding terminology which starts “the various terms related to character encoding are often used inconsistently or incorrectly”, which is true in my experience. Beware when reading material online. Also, I might commit some of those sins in this document.

Broadly, these are the most useful concepts:

  • Locale: Encodes human customs. It often combines several aspects such as language, geographical region, and writing conventions.

  • Character: “The smallest unit of text that has semantic value”. This can be smaller than what would colloquially be considered a “character”, e.g lone [diatric][]s.

  • [Character set][]: An abstract collection of characters. A coded character set Unicode.

  • [Character encoding][]: Maps characters from a character set to byte sequences. E.g. [UTF-8][].

  • Font: Turns encoded characters to a visual representation.

[Character set]: [Character encoding]: https://en.wikipedia.org/wiki/Character_encoding [UTF-8]: https://en.wikipedia.org/wiki/UTF-8

References:

These are conceptually independent. Character sets and character encodings are often

[language[_territory][.codeset][@modifier]]

  • language: ISO-639, either a 2-letter lowercase alphabetic code in (the smaller) Set 1 or a 3-letter alphabetic code in (the larger) Set 2
locales="$(locale -a)"
languages="$(echo "$locales" | sed -n 's/^\([^_.@]\+\).*/\1/p'    | sort | uniq -c | sort -n)"
countries="$(echo "$locales" | sed -n 's/.*[_]\([^.@]\+\).*/\1/p' | sort | uniq -c | sort -n)"
codeset="$(  echo "$locales" | sed -n 's/.*[.]\([^@]\+\).*/\1/p'  | sort | uniq -c | sort -n)"
modifier="$( echo "$locales" | sed -n 's/.*[@]\(.\+\)$/\1/p'      | sort | uniq -c | sort -n)"
echo "$languages"
echo "$languages" | grep '[[:upper:]]' # `C`, `POSIX`.
echo "$countries"
echo "$countries" | grep '[[:lower:]]' # Nothing.
echo "$codeset"
echo "$modifier"

locale -m

Global human culture is messy, formalizing it is not easy. It is easy to introduce assumptions that cause breakages.

In the past I removed utf8 from locale LC_* environment variables if TERM was some linux variant.