Skip to main content
Logo image

Section 20 Internationalization

View Source
Supporting a multitude of possible characters, across many languages and across many output formats can be a challenge. One of our goals is to make this much easier for authors. Fortunately, the Unicode standard has led to improvements from the 7-bit ASCII standard of old.

Unicode Characters for HTML Output.

First, we discuss HTML output. If you include Unicode characters in your PreTeXt source, they should survive just fine en route to a web browser or e-reader. Here are the caveats for HTML output:
  • So that you can continue to get the best results with print and PDF output, use available empty elements for obscure characters, even if targeting HTML output, before resorting to a Unicode character. For example, use <copyright/> for the copyright symbol in text before resorting to the Unicode character U+00A9. It is a bit more work, but you will get better results with other conversions, even if you initially are only fascinated by HTML.
  • How you actually enter Unicode characters into your source file is dependent on your editor and operating system, and is therefore outside the scope of our documentation. You can cut-and-paste characters and text from the source of our examples for initial testing and experimentation.
  • Always, always identify your source as having Unicode characters by including the incantation
    <?xml version="1.0" encoding="UTF-8" ?>
    as the first line of your source file. (You may be able to accurately cut-and-paste this version here. But if the copy has non-standard characters in it, go back to the top of this source file for a copy.)
  • Alan Wood’s Unicode Resources
     1 
    www.alanwood.net/unicode/unicode_samples.html
    has a plethora of samples of various groups of Unicode characters. If you, or your readers, are “missing” characters in a web browser, this is a good place to start testing the local setup.

Characters in , PDF, print.

The situation for is a bit more complicated, since pre-dates Unicode’s widespread adoption.
This sample article is intended to work well, out-of-the-box, for authors just starting with PreTeXt. So we only include here examples that we know are likely to convert to PDF without any errors. For more extensive examples and experiments, we provide the sample document examples/fonts/fonts-and-characters.xml, so be aware of that example as you look to see what is possible.
Similarly, you should be able to process this sample article successfully with various engines. We test regularly with pdflatex and xelatex and provide online sample PDF output of this document processed by pdflatex. In principle, you should be able to use latex (to produce a DVI), and possibly other (unsupported) engines, such as lualatex.
Once you get beyond the Latin alphabet, with accents common in Western Europe and the Western Hemisphere, you will almost assuredly need to restrict your attention to producing PDF output with the xelatex engine. This is discussed and tested in examples/fonts/fonts-and-characters.xml.

Basic Latin, U+0000U+007F.

Unicode uses multiple 8-bit bytes to represent characters, and these are typically expressed in hexadecimal (base 16) notation. Using just a single byte, we can get 256 values, and the first 128 (hex 00 to 7F) are the “usual” Latin characters with some values used as control codes. These 95 characters are the most basic, and will all render using pdflatex or xelatex with no special setup (and will render easily in HTML). U+0000 to U+001F are control codes and not used here. U+007F is also a control code and so is excluded, while U+0020 is a space, so appears invisible in the table. In the source we have authored each character by its escaped version using its Unicode number (in hexadecimal). So, for example, capital-B is authored as &#x0042;.
Table 20.1. Basic Latin, Regular
0 1 2 3 4 5 6 7 8 9 A B C D E F
002_ ! " # $ % & ( ) * + , - . /
003_ 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
004_ @ A B C D E F G H I J K L M N O
005_ P Q R S T U V W X Y Z [ \ ] ^ _
006_ ` a b c d e f g h i j k l m n o
007_ p q r s t u v w x y z { | } ~

Latin-1 Supplement, U+0080U+00FF.

Now we are interested in the next 128 possible bytes, (hex 80 to FF). The first 32 are again control codes and U+00A0 is a non-breaking space, so is invisible, while U+00AD is a soft hyphen (which we have not implemented and so is excluded). We have taken care to see that the remainder will render using pdflatex or xelatex with no special setup (and HTML). In the source we have authored each character by its escaped version using its Unicode number (in hexadecimal). So, for example, a copyright symbol is authored as &#x00A9;.
Table 20.2. Latin-1 Supplement, Regular
0 1 2 3 4 5 6 7 8 9 A B C D E F
00A_   ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯
00B_ ° ± ² ³ ´ µ · ¸ ¹ º » ¼ ½ ¾ ¿
00C_ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
00D_ Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
00E_ à á â ã ä å æ ç è é ê ë ì í î ï
00F_ ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

Monospace, Basic Latin and Latin-1 Supplement, U+0000U+00FF.

A monospace font is critical for samples of keyboard input and to distinguish exact technical input from running commentary. We list here all of the reasonable characters from the first 256 Unicode code points. (We skip the same 65 control characters from above, and the soft hyphen.) These should all render fine in HTML and when processed with xelatex, however our focus with this sample article for PDF output is the capabilities when processed with pdflatex. First, characters from U+0000U+007F.
Table 20.3. Basic Latin, Monospace
0 1 2 3 4 5 6 7 8 9 A B C D E F
002_ ! " # $ % & ' ( ) * + , - . /
003_ 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
004_ @ A B C D E F G H I J K L M N O
005_ P Q R S T U V W X Y Z [ \ ] ^ _
006_ ` a b c d e f g h i j k l m n o
007_ p q r s t u v w x y z { | } ~
Note that the single and double quotes are upright and dumb, not curly and smart: ' " ' " ' ". And a backtick is a backtick: ` ` `. The zero is distinguished from the capital “oh”: 0 O 0 O 0 O. And the numeral one is slightly different from the lower-case “ell”: 1 l 1 l 1 l. The hyphen should be short and not expanded into some other kind of dash: - - -. These characters should all cut/paste out of a PDF into a text editor with no conversion to other characters.
Now the remaining characters from U+0080U+00FF. The program tag is implemented in via the listing package and these characters require ad-hoc replacements for processing by pdflatex. (You can see the replacements in the preamble of the source for this document.) The replacement mechanism provided by the listing package will cause the characters below to produce a compilation error if processed by pdflatex and in a table cell in certain situations (which we have avoided in the table below). The only workaround in this case is to switch to xelatex.
Table 20.4. Latin-1 Supplement, Monospace
0 1 2 3 4 5 6 7 8 9 A B C D E F
00A_ ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯
00B_ ° ± ² ³ ´ µ · ¸ ¹ º » ¼ ½ ¾ ¿
00C_ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
00D_ Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
00E_ à á â ã ä å æ ç è é ê ë ì í î ï
00F_ ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
The pre tag is implemented in with the fancyvrb package. You can compare results here with the table above, lines here are rows above.
  ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬   ® ¯
° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿
À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
à á â ã ä å æ ç è é ê ë ì í î ï
ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
The console tag is also implemented with fancyvrb, with adjustments for the input lines. It will not look like it, but these are 8 such inputs, with similar results to above, but now bolded.
¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬   ® ¯
° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿
À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
à á â ã ä å æ ç è é ê ë ì í î ï
ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
We take care to render the U+0080U+00FF characters in Sage cells. This would allow some flexibility in comments and strings employed. The following is just a test of these characters in the input and output of a sage element. This is not functional code.
The table below has a single column, and each cell of the table has a string of 10 characters inside a c element. It is meant to test if the font is monospace in this situation.
Table 20.5. Alignment Test
0123456789
9876543210
iiiiiiiiii
mmmmmmmmmm
Again, more examples and more thorough explanations can be found in the sample: examples/fonts/fonts-and-characters.xml. Be aware that the nature of the more advanced sample is that it will likely produce many errors when processed with pdflatex. Adding -interaction batchmode or -interaction nonstopmode to the pdflatex command-line will sometimes be less painless than acknowledging each error. The more advanced sample will perform well when processed with xelatex.