I ran into the UTF-8 normalization issue when playing around with the String.split
exercises.
Some non-ASCII values can be represented by either one or two graphemes.
For example ô
can be a single grapheme or composed of two graphemes, o
and ^
.
Always working with normalized UTF-8 strings avoids confusion.
There are four types of UTF normalization which you can read about on the Unicode.org site.
Elixir seems to prefer the Canonical Decomposition (NFD) form as that is what the String.equivalent?
function uses.
The following examples demonstrate:
# Add a diacritical to a base ASCII character
iex(1)> "o\u0302"
"ô"
iex(3)> "o\u0302" == "ô"
false
iex(3)> String.normalize("o\u0302", :nfd) == "ô"
false
iex(4)> String.normalize("o\u0302", :nfd) == String.normalize("ô", :nfd)
true
iex(5)> String.normalize("o\u0302", :nfc) == "ô"
true
The non-normalized results may vary based on the platform/shell you use.
All notes and comments are my own opinion. Follow me at @rgacote@genserver.social