What Everybody Software Developer Should Know (2): UTF-8
Characters encodings were always a pain. I really like that UTF-8 removes any need for multiple encodings.
Some people may argue. “We don’t need UTF-8, because we are serving only US/English speaking customers. ASCII is fine. Latin 1 is fine.”
No, it isn’t. Why not ASCII? With UTF-8 you can use emoticons, symbols and other stuff from other UTF-8 planes like Korean or Chinese characters. And it’s backwards-compatible with ASCII. But yeah, usually you need UTF-8 support in a database or structures.
Why not Latin 1? As a person from central Europe, I use Latin 2 and these systems tend to mangle our characters, because ď is interpreted as ï (the same code point).
We use UTF-16. OMFG, I don’t like UTF-16 since I contributed to a multiplatform (Windows, Linux, Unix, …) program for text processing. I really thought I die. Java uses UTF-16. Windows uses UTF-16 (or UCS-2), though it’s limited to 4 bytes per character - the specification really supports up to 6 bytes per character. Overall it’s a mess. I really hate wstring in Windows.
C programmers can say - but working with UTF-8 is a pain. Yes, I agree. I wonder why in 2012 wasn’t a good library for working with UTF-8 text. But usually you just need to use
char * and don’t touch anything. But then there’s Windows problems…QString from Qt could solve a couple of problems, but it’s C++.
UTF-8 unifies encodings - Slovak language has at least four encodings - Windows CP1250, Latin 2 (ISO 8859-2), and CP852 and Kamenický encoding from DOS era. I don’t count how many times I had to fix characters because some software couldn’t work with different encodings properly.
UTF-8 support these days is pretty good (Ruby 2.4 can finally upcase UTF-8 characters correctly), so why use something else? Emoticons and (ﾉ≧∇≦)ﾉ ﾐ ┸━┸ weird emojis included!
Fortunately web is all about UTF-8 and slowly, but surely it starts to prevail everywhere, because of web, analytics, big data, webviews on mobile phones and APIs. And I consider it to be a good thing.
Hard stuff with UTF-8 - encoding and decoding, normalization and character equivalence since one character (visually) can be sometimes be written with two ways. But this is handled by libraries and in my web development work I’m not really aware of this.