Punycode is a standardized encoding method that allows Unicode characters to be mapped using a limited ASCII character set that consists of only the following elements:
- Lower case letters: a to z
- Digits: 0 to 9
- Special characters: hyphen (-)
The listed elements are referred to as base characters.
The procedure is primarily used for processing internationalized domain names (IDN), which contain non-base characters in addition to the base characters.
Development of the coding method
Punycode was standardized in 2003 by the Internet Engineering Task Force (IETF) as syntax for encoding Internationalized Domain Names in Applications (IDNA).
The IETF defines domain names as IDN if they contain special characters such as diacritics or characters of other alphabets (i.e. umlauts in German), in addition to the letters of the Latin alphabet, and therefore cannot be processed by basic protocols such as the Domain Name System (DNS).
We will use characters in the German language as an example: although a domain name like müller-büromöbel (Müller’s office furniture) is allowed under the top-level domain .de since the introduction of IDNs, it can only be processed in the context of name resolution – by encoding the non-base characters. Numerous internet protocols are based on the English written language and therefore only support the limited ASCII character set.
In order to ensure compatibility between IDNs and older internet standards, the IETF has regulated the coding of internationalized domain names with the previously permitted characters and therefore standardized a corresponding procedure with Punycode.
For e-mail addresses, Punycode encoding is only used for internationalized e-mail domains. The local part (before the @ character) is encoded via UTF-8, if it contains non-ASCII characters.
How does the Punycode coding work?
Punycode is defined by the IETF in RFC 3492 as a possible application of a general coding algorithm known as a boot string. The bootstring algorithm enables character strings of arbitrary character sets with a limited selection of elements to be mapped. The development of the coding procedure is based on six principles:
- Completeness: Each output string can be mapped to a simplified string using a boot string.
- Uniqueness: Assigning the output string to the respective boot string coding is unique. Each Punycode can be assigned exactly one ASCII counterpart and vice versa.
- Reversibility: Coding by boot string can be reversed at any time without any information loss.
- Efficiency: The encoded string is – if at all – only minimally longer than the output string.
- Simplicity: Bootstring uses simple coding and decoding algorithms.
- Readability: Only those characters are coded that cannot be represented in the target character set. All other characters remain unchanged.
Punycode specifies the bootstring according to the requirements for internationalized domain names. This should enable the Unicode characters to be mapped via the previously permitted base characters.
We illustrate the coding with the following example.
The IDN müller-büromöbel contains two characters with ü and ö, which are not included in the previously permitted character set for domain names and must therefore be encoded via Punycode to ensure compatibility.
In the first step, the coding procedure provides for a normalization of the output character string. All uppercase letters are replaced by corresponding lowercase letters.
In the second step, all non-basic characters are eradicated. These are then added to the domain name in coded form and separated by a hyphen.
If the Punycode syntax is used to encode internet addresses, each result string is provided with a so-called ACE prefix (short for ASCII-compatible encoding):
ACE prefix: xn--
The ACE prefix ensures that domain names containing hyphens are not misinterpreted as international domain names.
This results in the following coding for the IDN müller-büromöbel:
Deviations occur if the domain name contains either no non-base characters or only non-base characters.
A domain name that contains only non-base characters shows only the encoded string and the ACE prefix after being encoded.
A domain name such as παράδειγμα (Greek for “example”) therefore corresponds to the following coding:
If a domain name contains only base characters, only the ACE prefix and a hyphen at the end will be added. In this case, the Punycode syntax is not used in connection with domain names.
If you consider the “Fully Qualified Domain Name” (FQDN) as a whole, each label (top-level domain, second-level domain, third-level domain, etc.) is coded separately.
A domain like пример.бг (Bulgarian for “example.bg”) could be encoded as follows
The following table gives an overview of the different variants of the Punycode syntax.
|Base & non-base character||müller-büromöbel.de||mller-brombel-rmb4fg.de||xn--mller-brombel-rmb4fg.de|
|Only non-base characters||Παράδειγμα.gr||hxajbheg2az3al.gr||xn--hxajbheg2az3al.gr|
|Only base characters||example.org||example.org-||No use|
The algorithm underlying the Punycode syntax ensures that none of the domain labels exceed the maximum length of 63 characters despite the conversion.
Unicode characters are not transferred one-to-one into ASCII characters during coding. Instead, the algorithm determines a string that results from the distance between the erased characters and from the position of the characters in the output string.
In the example above, the character string rmb4fg indicates that mller-brombel must be supplemented with the Unicode characters ü and ö in the second and seventh places.
The underlying Punycode algorithm is described in detail in RFC 3492. In addition, the document provides an implementation of the coding procedure in the programming language C.
When encoding internationalized domain names, users usually use freely available Punycode converters.
Free Punycode converter
Free Punycode converters for transferring IDNs into an ACII-compatible representation can be found on various websites.
We recommend Punycoder, which converts Punycode to text/Unicode and vice versa. This tool uses the IDNA2008 standard, but with Unicode TR#46 compatibility processing.
Puny coding with emoji domains
Not only internationalized domain names but also emoji domains can be realized via Punycode. The prerequisite for this is that the respective top-level domain permits the use of emojis and the desired emoticon has been recorded in the Unicode standard.
At the moment, the following TLDs allow registration of emoji domains: .ws, .tk, .to, .ml, .ga, .cf, .gq, and .fm.
Emoji domains are technically processed as Punycode, but in theory should be presented to the user as a combination of text and emoticons.
Emoji domain: i❤.ws/
Practically no standard browser implements this at present. If you enter an emoji domain in Firefox, Chrome, Safari, Edge, or Opera, the address bar only shows the ACE string.
Punycode as a security risk?
Punycode becomes a security risk in connection with homographic phishing – cyber-attacks in which criminals use the similar appearance of different characters to lure their unsuspecting victims to fake websites.
This leads internet users to a website with the following IDN:
The URL provided is not the official website of the California technology company Apple Inc. but a phishing website created for demonstration purposes.
Instead of the ASCII character a with Unicode U+0061, the Cyrillic а (U+0430) is used – these two characters can hardly be distinguished with the naked eye, but are interpreted as different characters by web browsers.
Particularly unfavorable for internet users: even certificates do not provide security. For modern phishing campaigns, criminals create valid SSL certificates that are supposed to make their websites look like the real deal.
Current versions of Chrome and Opera prevent phishing attacks like these by displaying the ACE string instead of the internationalized domain on IDNs that mix characters from different character sets. Internet Explorer and Microsoft Edge completely prevent domains like these from being accessed. Only Firefox does not offer any protection against Punycode phishing.
This is how Firefox users can protect themselves: In order to reduce the risk that phishing websites pose, Firefox users currently only have the option of preventing Punycode from being translated into IDNs in general. Only two steps are necessary for this temporary solution:
- Access the configuration editor: Type about:config in the address bar of your web browser to open the Firefox configuration editor.
- Force Punycode: Find the setting network.IDN_show_punycode and change its value from false to true.
After configuration, Firefox will display internationalized domains in the address bar as ACE strings.