Things You Don't Need To Know About Encodings

  _____  ________  ________ ________
 / __  \|\  ___  \|\   ____\\_____  \
|\/_|\  \ \____   \ \  \___\|____|\ /_
\|/ \ \  \|____|\  \ \  \____    \|\  \
     \ \  \  __\_\  \ \  ___  \ __\_\  \
      \ \__\|\_______\ \_______\\_______\
       \|__|\|_______|\|_______\|_______|

ASCII

The Institute of Electrical and Electronics Engineers, IEEE, reaches an important milestone 1963. By publishing the first version of the American Standard Code for Information Interchange, commonly known as ASCII — but more precisely US-ASCII, if the Internet Assigned Numbers Authority (IANA) gets their word in — order is brought to the wild west of punched cards and magnetic tapes. The encoding, a set of 128 characters, and respective codes were agreed upon. At last, information could be transmitted and machines understanding was secured…

At least between some, as long as you were fine with limiting yourself to the English alphabet and numbers. And weren’t using IBM’s systems, as they were pushing their own proprietary encoding, EBCDIC, at the time.

Basically these ones:

(

)

;

[

]

{

}

But if you were fine with those you were golden. Oh wow, is that an @ in there? Damn. 1963, huh?

ISO-8859-1 and Windows-1252

While also an important milestone due to the inclusion of more latin based characters, we can kind of ignore that iso-8859-1 was its own thing, separate from windows-1252. You see, Microsoft developed a whole bunch of products during the ’90s. Products that used their very own charset, windows-1252, which at the time had already diverged from the standard years earlier when Windows 2.0 was released. The fine engineers over at Microsoft decided that they were basically the same, and as such, products were marketed as being compliant.

This went on and on until the web got sick of seeing question marks and other nonsensical symbols, and collectively decided to treat any occurence of iso-8859-1 as windows-1252.

Fuck it — let's just treat iso-8859-1 as if it was the windows-thing.

— WHATWG, probably

To make things simpler, another alias you might stumble upon for iso-8859-1, or cp1252 — I mean windows-1252 — is latin1. No, wait… ANSI! Anyway, if you’re not doing anything weird, these are likely handled the same way; It’s all windows-1252.

… The wonderful world of Microsoft.

Names, misnomers and aliases aside, the encoding had a huge weakness. Much like ASCII before, windows-1252 was far from a complete character set, considering the multitude of languages that exist all over the world. It was also constrained to a single byte per character, thus only allowing 256 characters. Things would have to change again.

Unicode

Enter The Unicode Standard. Not short for “You and I”-code, but rather universal character encoding.

While the rest of the world was busy jamming to “Call Me” by Blondie, a handful of nerds were pondering the existence of an encoding for the evermore globalized internet. And for that, we should be ever thankful; Unicode has enabled us to have tons of emojis and characters which we can combine to make funny faces.

¯\_(ツ)_/¯

The “lol, i dunno” face.

Not as important, but still worthy of note, is that it has helped tremendously in making the web more accessible.

UTF-8

Unicode defines three different encodings. Now, you might be asking yourself “didn’t they want a single universal one?” and you would be correct in doing so. UTF-8, UTF-16 and UTF-32. The absolute vast majority of web pages are delivered in UTF-8, and, unlike Windows-1252, is compatible with the actual ISO-8859-1, which in turn compatible with US-ASCII. It’s not surprising that, when IETF made a choice about which encoding to support, it was in favour of UTF-8.

Now, another beautiful thing that UTF-8 does, besides being an actual superset of the major encodings that came prior to it, is that it is able to optimally encompass a lot of characters. This is thanks to the use of variable-length encoding. Meaning that a single character can use from one up to four bytes of space. As an example, the first 128 character subset that is compatible with US-ASCII will only take up one byte, but an Egyptian hieroglyph will use four.

𓀪

A hieroglyph of an ancient Egyptian skipping rope, using four octets, or bytes.

Building A Morse Code Translator

I always like to build something on the topic I’m writing about, and during some research I found out that Morse code is ~~logically~~ technically… debatedly a binary encoding system consisting of dits and dahs!

A(·–)L(·–··)L(·–··) Q(––·–)U(··–)I(··)E(·)T(–) O(–––)N(–·) T(–)H(····)E(·) W(·––)E(·)S(···)T(–)E(·)R(·–·)N(–·) F(··–·)R(·–·)O(–––)N(–·)T(–)

Happy coding!