Section 22 Internationalization
Supporting a multitude of possible characters, across many languages and across many output formats can be a challenge. One of our goals is to make this much easier for authors. Fortunately, the Unicode standard has led to improvements from the 7-bit ASCII standard of old.
Unicode Characters for HTML Output.
First, we discuss HTML output. If you include Unicode characters in your PreTeXt source, they should survive just fine en route to a web browser or e-reader. Here are the caveats for HTML output:
- So that you can continue to get the best results with print and PDF output, use available empty elements for obscure characters, even if targeting HTML output, before resorting to a Unicode character. For example, use
<copyright/>
for the copyright symbol in text before resorting to the Unicode characterU+00A9
. It is a bit more work, but you will get better results with other conversions, even if you initially are only fascinated by HTML. - How you actually enter Unicode characters into your source file is dependent on your editor and operating system, and is therefore outside the scope of our documentation. You can cut-and-paste characters and text from the source of our examples for initial testing and experimentation.
- Always, always identify your source as having Unicode characters by including the incantation
<?xml version="1.0" encoding="UTF-8" ?>
as the first line of your source file. (You may be able to accurately cut-and-paste this version here. But if the copy has non-standard characters in it, go back to the top of this source file for a copy.) - Alan Wood’s Unicode Resourceshas a plethora of samples of various groups of Unicode characters. If you, or your readers, are “missing” characters in a web browser, this is a good place to start testing the local setup.
1
www.alanwood.net/unicode/unicode_samples.html
Characters in LaTeX, PDF, print.
The situation for LaTeX is a bit more complicated, since TeX pre-dates Unicode’s widespread adoption.
This sample article is intended to work well, out-of-the-box, for authors just starting with PreTeXt. So we only include here examples that we know are likely to convert to PDF without any errors. For more extensive examples and experiments, we provide the sample document
examples/fonts/fonts-and-characters.xml
, so be aware of that example as you look to see what is possible.Similarly, you should be able to process this sample article successfully with various LaTeX engines. We test regularly with
pdflatex
and xelatex
and provide online sample PDF output of this document processed by pdflatex
. In principle, you should be able to use latex
(to produce a DVI), and possibly other (unsupported) engines, such as lualatex
.Once you get beyond the Latin alphabet, with accents common in Western Europe and the Western Hemisphere, you will almost assuredly need to restrict your attention to producing PDF output with the
xelatex
engine. This is discussed and tested in examples/fonts/fonts-and-characters.xml
.Basic Latin, U+0000
–U+007F
.
Unicode uses multiple 8-bit bytes to represent characters, and these are typically expressed in hexadecimal (base 16) notation. Using just a single byte, we can get 256 values, and the first 128 (hex
00
to 7F
) are the “usual” Latin characters with some values used as control codes. These 95 characters are the most basic, and will all render using pdflatex
or xelatex
with no special setup (and will render easily in HTML). U+0000
to U+001F
are control codes and not used here. U+007F
is also a control code and so is excluded, while U+0020
is a space, so appears invisible in the table. In the source we have authored each character by its escaped version using its Unicode number (in hexadecimal). So, for example, capital-B is authored as B
.0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
A |
B |
C |
D |
E |
F |
|
002_ |
! | " | # | $ | % | & | ’ | ( | ) | * | + | , | - | . | / | |
003_ |
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | : | ; | < | = | > | ? |
004_ |
@ | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O |
005_ |
P | Q | R | S | T | U | V | W | X | Y | Z | [ | \ | ] | ^ | _ |
006_ |
` | a | b | c | d | e | f | g | h | i | j | k | l | m | n | o |
007_ |
p | q | r | s | t | u | v | w | x | y | z | { | | | } | ~ |
Latin-1 Supplement, U+0080
–U+00FF
.
Now we are interested in the next 128 possible bytes, (hex
80
to FF
). The first 32 are again control codes and U+00A0
is a non-breaking space, so is invisible, while U+00AD
is a soft hyphen (which we have not implemented and so is excluded). We have taken care to see that the remainder will render using pdflatex
or xelatex
with no special setup (and HTML). In the source we have authored each character by its escaped version using its Unicode number (in hexadecimal). So, for example, a copyright symbol is authored as ©
.0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
A |
B |
C |
D |
E |
F |
|
00A_ |
¡ | ¢ | £ | ¤ | ¥ | ¦ | § | ¨ | © | ª | « | ¬ | ® | ¯ | ||
00B_ |
° | ± | ² | ³ | ´ | µ | ¶ | · | ¸ | ¹ | º | » | ¼ | ½ | ¾ | ¿ |
00C_ |
À | Á | Â | Ã | Ä | Å | Æ | Ç | È | É | Ê | Ë | Ì | Í | Î | Ï |
00D_ |
Ð | Ñ | Ò | Ó | Ô | Õ | Ö | × | Ø | Ù | Ú | Û | Ü | Ý | Þ | ß |
00E_ |
à | á | â | ã | ä | å | æ | ç | è | é | ê | ë | ì | í | î | ï |
00F_ |
ð | ñ | ò | ó | ô | õ | ö | ÷ | ø | ù | ú | û | ü | ý | þ | ÿ |
Monospace, Basic Latin and Latin-1 Supplement, U+0000
–U+00FF
.
A monospace font is critical for samples of keyboard input and to distinguish exact technical input from running commentary. We list here all of the reasonable characters from the first 256 Unicode code points. (We skip the same 65 control characters from above, and the soft hyphen.) These should all render fine in HTML and when processed with
xelatex
, however our focus with this sample article for PDF output is the capabilities when processed with pdflatex
. First, characters from U+0000
–U+007F
.0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
A |
B |
C |
D |
E |
F |
|
002_ |
|
! |
" |
# |
$ |
% |
& |
' |
( |
) |
* |
+ |
, |
- |
. |
/ |
003_ |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
: |
; |
< |
= |
> |
? |
004_ |
@ |
A |
B |
C |
D |
E |
F |
G |
H |
I |
J |
K |
L |
M |
N |
O |
005_ |
P |
Q |
R |
S |
T |
U |
V |
W |
X |
Y |
Z |
[ |
\ |
] |
^ |
_ |
006_ |
` |
a |
b |
c |
d |
e |
f |
g |
h |
i |
j |
k |
l |
m |
n |
o |
007_ |
p |
q |
r |
s |
t |
u |
v |
w |
x |
y |
z |
{ |
| |
} |
~ |
Note that the single and double quotes are upright and dumb, not curly and smart:
' " ' " ' "
. And a backtick is a backtick: ` ` `
. The zero is distinguished from the capital “oh”: 0 O 0 O 0 O
. And the numeral one is slightly different from the lower-case “ell”: 1 l 1 l 1 l
. The hyphen should be short and not expanded into some other kind of dash: - - -
. These characters should all cut/paste out of a PDF into a text editor with no conversion to other characters.Now the remaining characters from
U+0080
–U+00FF
. The program
tag is implemented in LaTeX via the listing
package and these characters require ad-hoc replacements for processing by pdflatex
. (You can see the replacements in the preamble of the LaTeX source for this document.) The replacement mechanism provided by the listing
package will cause the characters below to produce a LaTeX compilation error if processed by pdflatex
and in a table cell in certain situations (which we have avoided in the table below). The only workaround in this case is to switch to xelatex
.0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
A |
B |
C |
D |
E |
F |
|
00A_ |
¡ |
¢ |
£ |
¤ |
¥ |
¦ |
§ |
¨ |
© |
ª |
« |
¬ |
® |
¯ |
||
00B_ |
° |
± |
² |
³ |
´ |
µ |
¶ |
· |
¸ |
¹ |
º |
» |
¼ |
½ |
¾ |
¿ |
00C_ |
À |
Á |
 |
à |
Ä |
Å |
Æ |
Ç |
È |
É |
Ê |
Ë |
Ì |
Í |
Î |
Ï |
00D_ |
Ð |
Ñ |
Ò |
Ó |
Ô |
Õ |
Ö |
× |
Ø |
Ù |
Ú |
Û |
Ü |
Ý |
Þ |
ß |
00E_ |
à |
á |
â |
ã |
ä |
å |
æ |
ç |
è |
é |
ê |
ë |
ì |
í |
î |
ï |
00F_ |
ð |
ñ |
ò |
ó |
ô |
õ |
ö |
÷ |
ø |
ù |
ú |
û |
ü |
ý |
þ |
ÿ |
The
pre
tag is implemented in LaTeX with the fancyvrb
package. You can compare results here with the table above, lines here are rows above.¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
The
console
tag is also implemented with fancyvrb
, with adjustments for the input lines. It will not look like it, but these are 8 such inputs, with similar results to above, but now bolded.¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
We take care to render the
U+0080
–U+00FF
characters in Sage cells. This would allow some flexibility in comments and strings employed. The following is just a test of these characters in the input
and output
of a sage
element. This is not functional code.The table below has a single column, and each cell of the table has a string of 10 characters inside a
c
element. It is meant to test if the font is monospace in this situation.0123456789 |
9876543210 |
iiiiiiiiii |
mmmmmmmmmm |
Again, more examples and more thorough explanations can be found in the sample:
examples/fonts/fonts-and-characters.xml
. Be aware that the nature of the more advanced sample is that it will likely produce many errors when processed with pdflatex
. Adding -interaction batchmode
or -interaction nonstopmode
to the pdflatex
command-line will sometimes be less painless than acknowledging each error. The more advanced sample will perform well when processed with xelatex
.