Good internationalisation practices for web developers, part 1 : Characters & fonts

Published by: Tim Bakkum,

  • Localisation
  • Internationalisation
  • Front-end development
Good internationalisation practices for web developers, part 1 : Characters & fonts

As a web developer, front-end or back-end, you might have to deal with making multilingual themes. In order for a website to be translated, or localised to be more specific, it has to be set up in a way that allows all the localisable content to be easily extracted, translated into other languages, and put back in and displayed properly. This process is referred to as internationalisation (i18n) and usually happens before the localisation and translation phase.

Why?

I18n is important because it allows you to save time. Developers are not necessarily translators and vice versa. Proper i18n allows translators to translate and/or adapt content in any language without having to touch the source code of the website. In this series of articles I will discuss good practices which you can start using immediately. In this article, I focus on character encoding and font in relation to website internationalisation.

Character encoding & fonts

Not every language uses the same characters. Your website should have proper language declarations and character encoding so the text is properly displayed, no matter the language of the content. Nowadays the UTF-8 encoding supports all possible languages and characters, even those that have multiple byte characters like Chinese or Japanese, whereas in the past character encoding limited you to one script or language. UTF-8 is an implementation of Unicode, a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. UTF-8 can be used to encode all the different assets of your multilingual web application, and you should. Declare UTF-8 encoding everywhere consistently: in your JavaScript, XML, databases, html, CSS, php, Java, etc. Also make sure that the server serves the document in UTF-8. Not doing so risks running into encoding/decoding errors when these different technologies communicate. For example if you request information from your database using MySQL to render in an html document, without setting the right character set, the result might be “distorted” because the encoding don’t align and the data cannot be interpreted properly. To give you an example of these scrambled characters, take a look at this page I set up encoded in ISO-8859-1, which only supports characters from Latin script, containing Japanese characters from Katakana and Hiragana. Then compare it to the same page encoded in UTF-8, where latin script happily mingles with Japanese. Avoiding these weird looking characters not only improves the readability of your website but also allows search engines to properly index your content.

Fonts & glyphs

Great, so you have ensured that your date is saved and displayed using UTF-8 encoding. Now the graphic designer wants to use a fancy font… Most fonts commonly used in web design only support latin characters and some of them are limited in the amount of special characters, like letters with accents or language specific letters. When talking about a font, those letters or special characters are referred to as glyphs. When choosing a font, you can check which glyphs are available. The font that is used for this text, Lato, does not support all glyphs, as you can see in the following screenshot of the character set of the Lato font in google fonts service. The X marks indicate when a glyph is not available.

The developer should make sure, in collaboration with the designer, that all characters are rendered properly in order to prevent characters being displayed as “tofu”, ꟞꟞꟞꟞ , the little boxes that indicate that there is no font available to display the character. Google’s Noto Font has free versions for all languages. In fact, Noto stands for "No more Tofu", according to Google. If the designer of your web application absolutely wants to use a certain font for artistic reasons, you can always include a different font for the languages that are not supported by that font, and load them conditionally whenever necessary.

Things to watch out for

  • Font sizing: as a general rule, asian scripts needs to be rendered bigger than its latin counterpart to ensure readability.
  • Font styles such as bold and italic might not be used in every language you are translating to, Asian languages for example. As a developer, make sure that styles like these are not hard-coded into (html) markup.
  • Cases: languages such as Hebrew do not have cases (uppercase and lowercase).
  • Punctuation: avoid hard coding of punctuation, because some languages do not use punctuation like European languages. (spaces, commas, full-stops, etc.)
  • Quotation marks can be language specific, such as the French guillemets « for example ». As each different form of a quotation mark is mapped to a different Unicode character, you should make sure that the correct one is being used for the current language. Check out this tutorial for using French quotation marks on html quote tags using different CSS properties.
  • Ellipsis: some languages may require special attention when truncating text. In languages such as Arabic, words may actually become longer when you take away letters, or shorter when you add characters! This may cause problems when you only want to show a certain amount of characters of a certain text. Apple’s Iphone has had a well-known bug in the past due to this problem, check out the video for an in depth explanation of this phenomenon : https://www.youtube.com/watch?v=hJLMSllzoLA
  • Font declaration order matters when declaring multiple fonts. As some Asian fonts also contains glyphs for Latin script, results may vary depending on the order in which fonts are declared, as Kendra Schaeffer describes in her blog article : Chinese Standard Web Fonts: A Guide to CSS Font Family Declarations for Web Design in Simplified Chinese.
  • Font file size : Even simplified Chinese has over 20,000 glyphs, which makes files too big (3-7MB) with the @font-face css rule. However, there are online typekit-like services that only serve the characters that your webpage needs, like Youziku and Justfont to lighten the load for the user. This is currently the only viable alternative to using the default system font to display Chinese characters.

Naturally, you should always let native speakers of the languages your are localising test and review the web application to ensure that you haven’t made any mistakes that you are unlikely to be aware of due to the level of specificity. You might be wondering, what about text-direction ? That happens to be the topic of the next article in the series. In future articles I will also focus on images, string translation and front-end implementation of web design. Want to learn more about where internationalisation fits into the localisation industry workflow? Check out my article What is Localisation?