IDN domains: web addresses with special characters
The internet is steadily globalizing. According to the international telecommunications union (ITU), more than three billion people use the World Wide Web—and increasingly so in their own mother tongues. This change was in part brought on by the introduction of international domain names (IDN) in 2003.
What is an internationalized domain name (IDN)?
Until 2003, domain names were only allowed to consist of letters from the Latin alphabet, the numbers 0 to 9, and a hyphen. These limited options can be explained by taking a closer look at the domain name system (DNS). The service, which is also responsible for translating URLs into IP addresses, operates on a naming scheme that’s based on the American Standard Code for Information Interchange (ASCII). This system is mostly built on English-language keyboards and isn’t very representative for an international project like the internet.
In order to compensate for this major drawback, a system called the Internationalizing Domain Names in Application (IDNA) was called into action. The goal of this mechanism is to define a standardized translation from Unicode into ASCII, making it possible to display characters of every known alphabet in internet domains.
IDNA is considered to be one of the biggest revolutions in the history of the internet. This new system is particularly helpful for individuals using Asian, African, or Arabic character systems. Theoretically, every Unicode text can be used in an internationalized domain name. In practice, however, domain registries are able to individually decide which special characters can be used for registration. Selection tends to vary, as domain registries are able to individually determine which special characters can be used for registration. This means that symbols differ depending on which top-level domain (e.g. .com, .mx, .ca, etc.) is used.
How does IDNA work?
Much of the internet’s infrastructure is only supported by the ASCII character set. And in order to make sure that these internationalized names can be processed, each IDN that’s available in Unicode can be translated into an ACE string, which is based on ASCII. Following this, URLs featuring characters with accents or umlauts are displayed; the server, on the other hand, continues to process the addresses as ASCII compatible. These processes are specified in the memos IDNA2003 and IDNA2008. Translating from Unicode to ASCII occurs client-side and is based on a standardized coding processes called Punycode.
The RFC 3492-standardized Punycode was developed for clearly displaying Unicode character strings without loss of quality to ASCII symbols. All non-ASCII characters are removed from the domain name, coded, and separated with a hyphen. This code sequence contains information about the Unicode symbol in question as well as its position in the domain name. Additionally, each ACE string created in this way is mounted with the prefix xn at the beginning; this clarifies to the reader that the character sequence is IDN coded according to IDNA and Puny coding standards. Here’s an example comparing an IDA form domain to its ACE string counterpart:
IDN form: müllers-café.com
ACE String: xn--mllers-caf-k7a2t.com
The prefix xn, which labels the domain as an ACE string is followed by the part of domain name that’s been removed of all non-ASCII characters, mllers-caf. The coded special characters k7a2t were then tagged on to the end of the domain and separated by a hyphen.
Versign’s IDN conversion makes it easy to translate from ACE into IDN or vice versa.
Differences between IDNA2003 and IDNA
For the original 2003 procedure, internationalized URLs were normalized prior to Punycode encoding using the nameprep method. This method changed capital letters into lowercase letters, removed control characters, and transferred equivalent characters into a unified form. Nameprep was removed from this process with the introduction of IDNA2008. Now, IDNA does not specify any normalization; instead, it recommends an algorithm that converts capital letters into lowercase ones.