856 lines
30 KiB
Plaintext
856 lines
30 KiB
Plaintext
Internet Draft Paul Hoffman
|
|
draft-ietf-idn-nameprep-00.txt IMC & VPNC
|
|
July 3, 2000 Marc Blanchet
|
|
Expires in six months ViaGenie
|
|
|
|
Preparation of Internationalized Host Names
|
|
|
|
Status of this memo
|
|
|
|
This document is an Internet-Draft and is in full conformance with all
|
|
provisions of Section 10 of RFC2026.
|
|
|
|
Internet-Drafts are working documents of the Internet Engineering Task
|
|
Force (IETF), its areas, and its working groups. Note that other groups
|
|
may also distribute working documents as Internet-Drafts.
|
|
|
|
Internet-Drafts are draft documents valid for a maximum of six months
|
|
and may be updated, replaced, or obsoleted by other documents at any
|
|
time. It is inappropriate to use Internet-Drafts as reference material
|
|
or to cite them other than as "work in progress."
|
|
|
|
|
|
The list of current Internet-Drafts can be accessed at
|
|
http://www.ietf.org/ietf/1id-abstracts.txt
|
|
|
|
The list of Internet-Draft Shadow Directories can be accessed at
|
|
http://www.ietf.org/shadow.html.
|
|
|
|
|
|
Abstract
|
|
|
|
This document describes how to prepare internationalized host names for
|
|
transmission on the wire. The steps include excluding characters that
|
|
are prohibited from appearing in internationalized host names, changing
|
|
all characters that have case properties to be lowercase, and
|
|
normalizing the characters. Further, this document lists the prohibited
|
|
characters.
|
|
|
|
|
|
1. Introduction
|
|
|
|
When expanding today's DNS to include internationalized host names,
|
|
those new names will be handled in many parts of the DNS. The IDN
|
|
Working Group's requirements document [IDNReq] describes a framework for
|
|
domain name handling as well as requirements for the new names. The IDN
|
|
Working Group's comparison document [IDNComp] gives a framework for how
|
|
various parts of the IDN solution work together.
|
|
|
|
A user can enter a domain name into an application program in a myriad
|
|
of fashions. Depending on the input method, the characters entered in
|
|
the domain name may or may not be those that are allowed in
|
|
internationalized host names. Thus, there must be a way to canonicalized
|
|
the user's input before the name is resolved in the DNS.
|
|
|
|
It is a design goal of this document to allow users to enter host names
|
|
in applications and have the highest chance of getting the name correct.
|
|
This means that the user should not be limited to only entering exactly
|
|
the characters that might have been used, but to instead be able to
|
|
enter characters that unambiguously canonicalize to characters in the
|
|
desired host name. At the same time, this process must not introduce any
|
|
chance that two host names could be represented by two distinct strings
|
|
of characters that look identical to typical users. It is also a design
|
|
goal to have all preprocessing of IDN done before going on the wire, so
|
|
that no transformation is done in the DNS server space.
|
|
|
|
This document describes the steps needed to convert a name part from one
|
|
that is entered by the user to one that can be used in the DNS.
|
|
|
|
1.1 Terminology
|
|
|
|
The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and
|
|
"MAY" in this document are to be interpreted as described in RFC 2119
|
|
[RFC2119].
|
|
|
|
Examples in this document use the notation from the Unicode Standard
|
|
[Unicode3] as well as the ISO 10646 [ISO10646] names. For example, the
|
|
letter "a" may be represented as either "U+0061" or "LATIN SMALL LETTER
|
|
A". In the lists of prohibited characters, the "U+" is left off to make
|
|
the lists easier to read.
|
|
|
|
1.2 IDN summary
|
|
|
|
Using the terminology in [IDNComp], this document specifies all of the
|
|
prohibited characters and the canonicalization for an IDN solution.
|
|
Specifically, it covers the following sections from [IDNComp]:
|
|
|
|
prohib-1: Identical and near-identical characters
|
|
prohib-2: Separators
|
|
prohib-3: Non-displaying and non-spacing characters
|
|
prohib-4: Private use characters
|
|
prohib-5: Punctuation
|
|
prohib-6: Symbols
|
|
canon-1.2: Normalization Form KC
|
|
canon-2.1: Case folding in ASCII
|
|
canon-2.2: Case folding in non-ASCII
|
|
|
|
Note that this document does not cover:
|
|
canon-1.1: Normalization Form C
|
|
canon-2.3: Han folding
|
|
|
|
1.3 Open issues
|
|
|
|
This is the first draft of this document. Although there has been much
|
|
discussion on the WG mailing list about the topics here, there has not
|
|
yet been much agreement on some issues. Now that there is a document to
|
|
talk about, that discussion can be more focussed.
|
|
|
|
1.3.1 Where to do name preparation
|
|
|
|
Section 2.1 says to do name preparation in the resolver. An argument can
|
|
be made for doing name preparation in the application, before the
|
|
application service interface. An advantage of that proposal is that
|
|
resolvers would not need to do any name preparation. A disadvantage is
|
|
that applications would have to be updated each time the IDN protocol is
|
|
updated, such as if new characters are added to the repertoire of
|
|
allowed characters. It seems likely that resolvers are more easily
|
|
updated than all the individual applications that use internationalized
|
|
host names.
|
|
|
|
1.3.2 Choosing between normalization form C and KC
|
|
|
|
Much of the discussion of normalization on the WG mailing list assumed
|
|
that normalization form C would be used. Near the time that this
|
|
document was written, people started considering form KC instead of C.
|
|
This document used form KC, but the reasons for doing so could be
|
|
contentious.
|
|
|
|
1.3.3 Does the prohibition catch all bad characters?
|
|
|
|
On the mailing list, it was discussed doing prohibition in two steps: a
|
|
short list of prohibited characters before case folding in order to
|
|
prevent uppercase characters that have no lowercase equivalents from
|
|
getting through, and then a full check on the output of normalization.
|
|
In this draft, all checking is done before case folding, based on the
|
|
(possibly wrong) assumption that none of the prohibited characters will
|
|
re-appear after the case folding and normalization. If that assumption
|
|
turns out to be wrong, a check for just those problematic characters can
|
|
be added after normalization, or a full check against the prohibited
|
|
characters can be added.
|
|
|
|
|
|
2. Preparation Overview
|
|
|
|
This section describes where name preparation happens and the steps that
|
|
name preparation software must take.
|
|
|
|
2.1 Where name preparation happens
|
|
|
|
Part of the chart in section 1.4 of [IDNReq] looks like this:
|
|
|
|
+---------------+
|
|
| Application |
|
|
+---------------+
|
|
| Application service interface
|
|
| For ex. GethostbyXXXX interface
|
|
+---------------+
|
|
| Resolver |
|
|
+---------------+
|
|
| <----- DNS service interface
|
|
+-------------------------------------------+
|
|
|
|
In this specification, the name preparation is done in the resolver,
|
|
before the DNS service interface. That is, it is acceptable for software
|
|
in the application service interface (such as a "GetHostByName" API) to
|
|
pass the resolver a name that has not been prepared. However, the
|
|
resolver MUST prepare the name as described in this specification before
|
|
passing it to the DNS service interface.
|
|
|
|
2.2 Name preparation steps
|
|
|
|
The steps for preparing names are:
|
|
|
|
1) Input from the application service interface -- This can be done in
|
|
many ways and is not specified in this document
|
|
|
|
2) Look for prohibited input -- Check for any characters that are not
|
|
allowed in the input. If any are found, return an error to the
|
|
application service interface. This step is necessary to prevent errors
|
|
in the following two steps. This step fulfills prohib-1, prohib-2,
|
|
prohib-3, prohib-4, prohib-5, and prohib-6 from [IDNComp].
|
|
|
|
3) Fold case -- Change all uppercase characters into lowercase
|
|
characters. Design note: this step could just as easily have been
|
|
"change all lowercase characters into uppercase characters". However,
|
|
the upper-to-lower folding was chosen because most users of the Internet
|
|
today enter host names in lowercase. This step fulfills canon-2.1 and
|
|
canon-2.2 from [IDNComp].
|
|
|
|
4) Canonicalize -- Normalize the characters. This step fulfils canon-1.2
|
|
from [IDNComp].
|
|
|
|
5) Resolution of the prepared name -- This must be specified in a
|
|
different IDN document.
|
|
|
|
The above steps MUST be performed in the order given in order to comply
|
|
with this specification.
|
|
|
|
|
|
3. Prohibited Input
|
|
|
|
Before the text can be processed, it must be checked for prohibited
|
|
characters. There is a variety of prohibited characters, as described in
|
|
this section.
|
|
|
|
Note that one of the goals of IDN is to allow the widest possible set of
|
|
host names as long as those host names do not cause other problems, such
|
|
as possible ambiguity. Specifically, experience with current DNS names
|
|
have shown that there is a desire for host names that include personal
|
|
names, company names, and spoken phrases. A goal of this section is to
|
|
prohibit as few characters that might be used in these contexts as
|
|
possible while making sure that characters that might easily cause
|
|
confusion or ambiguity are prohibited.
|
|
|
|
Note that every character listed in this section MUST NOT be transmitted
|
|
on the DNS service interface. Although the checking is being performed
|
|
before case folding and canonicalization, those steps cannot result in
|
|
any of these characters if these characters are not in the input stream.
|
|
[[[NOTE: THIS STATEMENT NEEDS TO BE CHECKED ALGORITHMICALLY.]]] If a DNS
|
|
server receives a request containing a prohibited character, then the
|
|
IDN protocol MUST return an error message.
|
|
|
|
|
|
Note that some characters listed in one section would also appear in
|
|
other sections. Each character is only listed once.
|
|
|
|
3.1 prohib-1: Identical and near-identical characters
|
|
|
|
Many characters in [ISO10646] are identical or nearly identical to other
|
|
characters. These were often included for compatibility with other
|
|
character sets.
|
|
|
|
The characters prohibited because they are identical or nearly identical
|
|
to allowed characters are:
|
|
|
|
00AD SOFT HYPHEN
|
|
00D7 MULTIPLICATION SIGN
|
|
01C3 LATIN LETTER RETROFLEX CLICK
|
|
02B0-02FF [SPACING MODIFIER LETTERS]
|
|
066D ARABIC FIVE POINTED STAR
|
|
1806 MONGOLIAN TODO SOFT HYPHEN
|
|
2010 HYPHEN
|
|
2011 NON-BREAKING HYPHEN
|
|
2012 FIGURE DASH
|
|
2013 EN DASH
|
|
2014 EM DASH
|
|
2160-217F [ROMAN NUMERALS]
|
|
FB1D-FB4F [HEBREW PRESENTATION FORMS]
|
|
FB50-FDFF [ARABIC PRESENTATION FORMS A]
|
|
FE20-FE2F [COMBINING HALF MARKS]
|
|
FE30-FE4F [CJK COMPATIBILITY FORMS]
|
|
FE50-FE6F [SMALL FORM VARIANTS]
|
|
FE70-FEFC [ARABIC PRESENTATION FORMS B]
|
|
FF00-FFEF [HALFWIDTH AND FULLWIDTH FORMS]
|
|
|
|
3.2 prohib-2: Separators
|
|
|
|
Horizontal and vertical spacing characters would make it unclear where a
|
|
host name begins and ends. The prohibited spacing characters are:
|
|
|
|
0020 SPACE
|
|
00A0 NO-BREAK SPACE
|
|
1680 OGHAM SPACE MARK
|
|
2000-200B [SPACES]
|
|
2028 LINE SEPARATOR
|
|
2029 PARAGRAPH SEPARATOR
|
|
202F NARROW NO-BREAK SPACE
|
|
3000 IDEOGRAPHIC SPACE
|
|
|
|
Allowing periods and period-like characters as characters within a name
|
|
part would also cause similar confusion. The prohibited periods,
|
|
characters that look like periods, and characters that canonicalize to a
|
|
period or to a period-like character are:
|
|
|
|
002E FULL STOP
|
|
06D4 ARABIC FULL STOP
|
|
2024 ONE DOT LEADER
|
|
2025 TWO DOT LEADER
|
|
2026 HORIZONTAL ELLIPSIS
|
|
2488 DIGIT ONE FULL STOP
|
|
2489 DIGIT TWO FULL STOP
|
|
248A DIGIT THREE FULL STOP
|
|
248B DIGIT FOUR FULL STOP
|
|
248C DIGIT FIVE FULL STOP
|
|
248D DIGIT SIX FULL STOP
|
|
248E DIGIT SEVEN FULL STOP
|
|
248F DIGIT EIGHT FULL STOP
|
|
2490 DIGIT NINE FULL STOP
|
|
2491 NUMBER TEN FULL STOP
|
|
2492 NUMBER ELEVEN FULL STOP
|
|
2493 NUMBER TWELVE FULL STOP
|
|
2494 NUMBER THIRTEEN FULL STOP
|
|
2495 NUMBER FOURTEEN FULL STOP
|
|
2496 NUMBER FIFTEEN FULL STOP
|
|
2497 NUMBER SIXTEEN FULL STOP
|
|
2498 NUMBER SEVENTEEN FULL STOP
|
|
2499 NUMBER EIGHTEEN FULL STOP
|
|
249A NUMBER NINETEEN FULL STOP
|
|
249B NUMBER TWENTY FULL STOP
|
|
33C2 SQUARE AM
|
|
33C2 SQUARE AM
|
|
33C7 SQUARE CO
|
|
33D8 SQUARE PM
|
|
33D8 SQUARE PM
|
|
|
|
3.3 prohib-3: Non-displaying and non-spacing characters
|
|
|
|
There are many characters that cannot be seen in the ISO 10646 character
|
|
set. These include control characters, non-breaking spaces, formatting
|
|
characters, and tagging characters. These characters would certainly
|
|
cause confusion if allowed in host names.
|
|
|
|
0000-001F [CONTROL CHARACTERS]
|
|
007F DELETE
|
|
0080-009F [CONTROL CHARACTERS]
|
|
070F SYRIAC ABBREVIATION MARK
|
|
180B MONGOLIAN FREE VARIATION SELECTOR ONE
|
|
180C MONGOLIAN FREE VARIATION SELECTOR TWO
|
|
180D MONGOLIAN FREE VARIATION SELECTOR THREE
|
|
180E MONGOLIAN VOWEL SEPARATOR
|
|
200C ZERO WIDTH NON-JOINER
|
|
200D ZERO WIDTH JOINER
|
|
200E LEFT-TO-RIGHT MARK
|
|
200F RIGHT-TO-LEFT MARK
|
|
202A LEFT-TO-RIGHT EMBEDDING
|
|
202B RIGHT-TO-LEFT EMBEDDING
|
|
202C POP DIRECTIONAL FORMATTING
|
|
202D LEFT-TO-RIGHT OVERRIDE
|
|
202E RIGHT-TO-LEFT OVERRIDE
|
|
206A INHIBIT SYMMETRIC SWAPPING
|
|
206B ACTIVATE SYMMETRIC SWAPPING
|
|
206C INHIBIT ARABIC FORM SHAPING
|
|
206D ACTIVATE ARABIC FORM SHAPING
|
|
206E NATIONAL DIGIT SHAPES
|
|
206F NOMINAL DIGIT SHAPES
|
|
FEFF ZERO WIDTH NO-BREAK SPACE
|
|
FFF9 INTERLINEAR ANNOTATION ANCHOR
|
|
FFFA INTERLINEAR ANNOTATION SEPARATOR
|
|
FFFB INTERLINEAR ANNOTATION TERMINATOR
|
|
FFFC OBJECT REPLACEMENT CHARACTER
|
|
FFFD REPLACEMENT CHARACTER
|
|
|
|
3.4 prohib-4: Private use characters
|
|
|
|
Because private-use characters do not have defined meanings, they are
|
|
prohibited. The private-use characters are:
|
|
|
|
E000-F8FF [PRIVATE USE, PLANE 0]
|
|
|
|
3.5 prohib-5: Punctuation
|
|
|
|
The following characters are reserved or delimiters in URLs [RFC2396]
|
|
and [RFC2732]:
|
|
|
|
" # $ % & + , . / : ; < = > ? @ [ ]
|
|
|
|
3.5.1 Characters from URLs
|
|
|
|
The following punctuation characters are prohibited because they are
|
|
reserved or delimiters in URLs.
|
|
|
|
0022 QUOTATION MARK
|
|
0023 NUMBER SIGN
|
|
0024 DOLLAR SIGN
|
|
0025 PERCENT SIGN
|
|
0026 AMPERSAND
|
|
002B PLUS SIGN
|
|
002C COMMA
|
|
002E FULL STOP
|
|
002F SOLIDUS
|
|
003A COLON
|
|
003B SEMICOLON
|
|
003C LESS-THAN SIGN
|
|
003D EQUALS SIGN
|
|
003E GREATER-THAN SIGN
|
|
003F QUESTION MARK
|
|
0040 COMMERCIAL AT
|
|
005B LEFT SQUARE BRACKET
|
|
005D RIGHT SQUARE BRACKET
|
|
|
|
3.5.2 Characters that canonicalize to characters from URLs
|
|
|
|
The following punctuation characters are prohibited because their
|
|
normalization contains one or more of the characters from section 3.5.1.
|
|
|
|
037E GREEK QUESTION MARK
|
|
2048 QUESTION EXCLAMATION MARK
|
|
2049 EXCLAMATION QUESTION MARK
|
|
207A SUPERSCRIPT PLUS SIGN
|
|
207C SUPERSCRIPT EQUALS SIGN
|
|
208A SUBSCRIPT PLUS SIGN
|
|
208C SUBSCRIPT EQUALS SIGN
|
|
2100 ACCOUNT OF
|
|
2101 ADDRESSED TO THE SUBJECT
|
|
2105 CARE OF
|
|
2106 CADA UNA
|
|
|
|
3.5.3 Characters that look like characters from URLs
|
|
|
|
The following are prohibited because they look indistinguishable from
|
|
the characters listed in section 3.5.1.
|
|
|
|
037E GREEK QUESTION MARK
|
|
0589 ARMENIAN FULL STOP
|
|
060C ARABIC COMMA
|
|
061B ARABIC SEMICOLON
|
|
066A ARABIC PERCENT SIGN
|
|
201A SINGLE LOW-9 QUOTATION MARK
|
|
2030 PER MILLE SIGN
|
|
2031 PER TEN THOUSAND SIGN
|
|
2033 DOUBLE PRIME
|
|
2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK
|
|
2044 FRACTION SLASH
|
|
203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
|
|
203D INTERROBANG
|
|
3001 IDEOGRAPHIC COMMA
|
|
3002 IDEOGRAPHIC FULL STOP
|
|
3003 DITTO MARK
|
|
3008 LEFT ANGLE BRACKET
|
|
3009 RIGHT ANGLE BRACKET
|
|
3014 LEFT TORTOISE SHELL BRACKET
|
|
3015 RIGHT TORTOISE SHELL BRACKET
|
|
301A LEFT WHITE SQUARE BRACKET
|
|
301B RIGHT WHITE SQUARE BRACKET
|
|
|
|
3.5.4 Other punctuation
|
|
|
|
The following punctuation are prohibited because they are unlikely to
|
|
be used in names and may be confusing to users or to character-entry
|
|
processes:
|
|
|
|
005C REVERSE SOLIDUS
|
|
|
|
3.6 prohib-6: Symbols
|
|
|
|
[UniData] has non-normative categories for symbols. The four symbol
|
|
categories are:
|
|
|
|
Symbol, Currency: Currency symbols could appear in company names and
|
|
spoken phrases, so they are not prohibited.
|
|
|
|
Symbol, Modifier: Stand-alone modifiers might appear in personal names,
|
|
company names, and spoken phrases, so they are not prohibited.
|
|
|
|
Symbol, Math: It is very unlikely that there are any significant
|
|
personal names, company names, or spoken phrases that contain
|
|
mathematical symbols. Further, many of these symbols are the same or
|
|
similar to other punctuation, thereby leading to ambiguity. For this
|
|
reason, math-specific symbols are prohibited. These prohibited math
|
|
symbols are:
|
|
|
|
00AC NOT SIGN
|
|
00B1 PLUS-MINUS SIGN
|
|
2200-22FF [MATHEMATICAL OPERATORS]
|
|
|
|
Further, the following characters canonicalize to characters in the
|
|
above math list, and therefore are also prohibited:
|
|
|
|
00BC VULGAR FRACTION ONE QUARTER
|
|
00BD VULGAR FRACTION ONE HALF
|
|
00BE VULGAR FRACTION THREE QUARTERS
|
|
207B SUPERSCRIPT MINUS
|
|
208B SUBSCRIPT MINUS
|
|
2153 VULGAR FRACTION ONE THIRD
|
|
2154 VULGAR FRACTION TWO THIRDS
|
|
2155 VULGAR FRACTION ONE FIFTH
|
|
2156 VULGAR FRACTION TWO FIFTHS
|
|
2157 VULGAR FRACTION THREE FIFTHS
|
|
2158 VULGAR FRACTION FOUR FIFTHS
|
|
2159 VULGAR FRACTION ONE SIXTH
|
|
215A VULGAR FRACTION FIVE SIXTHS
|
|
215B VULGAR FRACTION ONE EIGHTH
|
|
215C VULGAR FRACTION THREE EIGHTHS
|
|
215D VULGAR FRACTION FIVE EIGHTHS
|
|
215E VULGAR FRACTION SEVEN EIGHTHS
|
|
215F FRACTION NUMERATOR ONE
|
|
33A7 SQUARE M OVER S
|
|
33A8 SQUARE M OVER S SQUARED
|
|
33AE SQUARE RAD OVER S
|
|
33AF SQUARE RAD OVER S SQUARED
|
|
33C6 SQUARE C OVER KG
|
|
|
|
Symbol, Other: This category covers a multitude of symbols, few of which
|
|
would ever appear in personal names, company names, and spoken phrases.
|
|
The rest of the prohibited symbols are:
|
|
|
|
2190-21FF [ARROWS]
|
|
2300-23FF [MISCELLANEOUS TECHNICAL]
|
|
2400-243F [CONTROL PICTURES]
|
|
2440-245F [OPTICAL CHARACTER RECOGNITION]
|
|
2500-257F [BOX DRAWING]
|
|
2580-259F [BLOCK ELEMENTS]
|
|
25A0-25FF [GEOMETRIC SHAPES]
|
|
2600-267F [MISCELLANEOUS SYMBOLS]
|
|
2700-27BF [DINGBATS]
|
|
2800-287F [BRAILLE PATTERNS]
|
|
|
|
3.7 Additional prohibited characters
|
|
|
|
3.7.1 Unassigned characters
|
|
|
|
All characters not yet assigned in [ISO10646] are prohibited. Although
|
|
this may at first seem trivial, it is extremely important because
|
|
characters that may be assigned in the future might have properties that
|
|
would cause them to be prohibited or might have case-folding properties.
|
|
As is the case of all prohibited characters, if a DNS server receives a
|
|
request containing an unassigned character, then the IDN protocol MUST
|
|
return an error message.
|
|
|
|
3.7.2 Surrogate characters
|
|
|
|
So far, all proposals for binary encodings of internationalized name
|
|
parts have specified UTF-8 as the encoding format. In such an encoding,
|
|
surrogate characters MUST NOT be used. Therefore, for UTF-8 encodings,
|
|
the following are prohibited:
|
|
|
|
D800-DFFF [SURROGATE CHARACTERS]
|
|
|
|
3.7.3 Uppercase characters with no lowercase mappings
|
|
|
|
There are many uppercase characters in [ISO10646] which do not have
|
|
lowercase equivalents in [UniData]. Therefore, they are prohibited on
|
|
input because they would get through the case mapping step while still
|
|
being in uppercase.
|
|
|
|
The characters that are prohibited on input because they are uppercase
|
|
but have no lowercase mappings are:
|
|
|
|
03D2 GREEK UPSILON WITH HOOK SYMBOL
|
|
03D3 GREEK UPSILON WITH ACUTE AND HOOK SYMBOL
|
|
03D4 GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL
|
|
04C0 CYRILLIC LETTER PALOCHKA
|
|
10A0-10C5 [GEORGIAN CAPITAL LETTERS]
|
|
|
|
Note that many characters in the range U+1200 to U+213A, the letterlike
|
|
symbols, also are uppercase but have no lowercase mappings. However,
|
|
they are not listed here because the entire range is already prohibited
|
|
in section 3.6.
|
|
|
|
3.7.4 Radicals and Ideographic Description
|
|
|
|
Some Han characters can be informally defined in terms of ideographic
|
|
descriptions. However, ideographic descriptions can lead to multiple
|
|
character streams leading to the same character in a fashion that does
|
|
not canonicalize. Thus, the radicals for ideographic description and the
|
|
ideographic description characters themselves are prohibited. These
|
|
characters are:
|
|
|
|
2E80-2EFF [CJK RADICALS SUPPLEMENT]
|
|
2F00-2FDF [KANGXI RADICALS]
|
|
2FF0-2FFF [IDEOGRAPHIC DESCRIPTION CHARACTERS]
|
|
|
|
3.8 Summary of prohibited characters
|
|
|
|
The following is a collected list from the previous sections.
|
|
|
|
0000-001F [CONTROL CHARACTERS]
|
|
0020 SPACE
|
|
0022 QUOTATION MARK
|
|
0023 NUMBER SIGN
|
|
0024 DOLLAR SIGN
|
|
0025 PERCENT SIGN
|
|
0026 AMPERSAND
|
|
002B PLUS SIGN
|
|
002C COMMA
|
|
002E FULL STOP
|
|
002E FULL STOP
|
|
002F SOLIDUS
|
|
003A COLON
|
|
003B SEMICOLON
|
|
003C LESS-THAN SIGN
|
|
003D EQUALS SIGN
|
|
003E GREATER-THAN SIGN
|
|
003F QUESTION MARK
|
|
0040 COMMERCIAL AT
|
|
005B LEFT SQUARE BRACKET
|
|
005C REVERSE SOLIDUS
|
|
005D RIGHT SQUARE BRACKET
|
|
007F DELETE
|
|
0080-009F [CONTROL CHARACTERS]
|
|
00A0 NO-BREAK SPACE
|
|
00AC NOT SIGN
|
|
00AD SOFT HYPHEN
|
|
00B1 PLUS-MINUS SIGN
|
|
00BC VULGAR FRACTION ONE QUARTER
|
|
00BD VULGAR FRACTION ONE HALF
|
|
00BE VULGAR FRACTION THREE QUARTERS
|
|
00D7 MULTIPLICATION SIGN
|
|
01C3 LATIN LETTER RETROFLEX CLICK
|
|
02B0-02FF [SPACING MODIFIER LETTERS]
|
|
037E GREEK QUESTION MARK
|
|
037E GREEK QUESTION MARK
|
|
03D2 GREEK UPSILON WITH HOOK SYMBOL
|
|
03D3 GREEK UPSILON WITH ACUTE AND HOOK SYMBOL
|
|
03D4 GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL
|
|
04C0 CYRILLIC LETTER PALOCHKA
|
|
0589 ARMENIAN FULL STOP
|
|
060C ARABIC COMMA
|
|
061B ARABIC SEMICOLON
|
|
066A ARABIC PERCENT SIGN
|
|
066D ARABIC FIVE POINTED STAR
|
|
06D4 ARABIC FULL STOP
|
|
070F SYRIAC ABBREVIATION MARK
|
|
10A0-10C5 [GEORGIAN CAPITAL LETTERS]
|
|
1680 OGHAM SPACE MARK
|
|
1806 MONGOLIAN TODO SOFT HYPHEN
|
|
180B MONGOLIAN FREE VARIATION SELECTOR ONE
|
|
180C MONGOLIAN FREE VARIATION SELECTOR TWO
|
|
180D MONGOLIAN FREE VARIATION SELECTOR THREE
|
|
180E MONGOLIAN VOWEL SEPARATOR
|
|
2000-200B [SPACES]
|
|
200C ZERO WIDTH NON-JOINER
|
|
200D ZERO WIDTH JOINER
|
|
200E LEFT-TO-RIGHT MARK
|
|
200F RIGHT-TO-LEFT MARK
|
|
2010 HYPHEN
|
|
2011 NON-BREAKING HYPHEN
|
|
2012 FIGURE DASH
|
|
2013 EN DASH
|
|
2014 EM DASH
|
|
201A SINGLE LOW-9 QUOTATION MARK
|
|
2024 ONE DOT LEADER
|
|
2025 TWO DOT LEADER
|
|
2026 HORIZONTAL ELLIPSIS
|
|
2028 LINE SEPARATOR
|
|
2029 PARAGRAPH SEPARATOR
|
|
202A LEFT-TO-RIGHT EMBEDDING
|
|
202B RIGHT-TO-LEFT EMBEDDING
|
|
202C POP DIRECTIONAL FORMATTING
|
|
202D LEFT-TO-RIGHT OVERRIDE
|
|
202E RIGHT-TO-LEFT OVERRIDE
|
|
202F NARROW NO-BREAK SPACE
|
|
2030 PER MILLE SIGN
|
|
2031 PER TEN THOUSAND SIGN
|
|
2033 DOUBLE PRIME
|
|
2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK
|
|
203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
|
|
203D INTERROBANG
|
|
2044 FRACTION SLASH
|
|
2048 QUESTION EXCLAMATION MARK
|
|
2049 EXCLAMATION QUESTION MARK
|
|
206A INHIBIT SYMMETRIC SWAPPING
|
|
206B ACTIVATE SYMMETRIC SWAPPING
|
|
206C INHIBIT ARABIC FORM SHAPING
|
|
206D ACTIVATE ARABIC FORM SHAPING
|
|
206E NATIONAL DIGIT SHAPES
|
|
206F NOMINAL DIGIT SHAPES
|
|
207A SUPERSCRIPT PLUS SIGN
|
|
207B SUPERSCRIPT MINUS
|
|
207C SUPERSCRIPT EQUALS SIGN
|
|
208A SUBSCRIPT PLUS SIGN
|
|
208B SUBSCRIPT MINUS
|
|
208C SUBSCRIPT EQUALS SIGN
|
|
2100 ACCOUNT OF
|
|
2101 ADDRESSED TO THE SUBJECT
|
|
2105 CARE OF
|
|
2106 CADA UNA
|
|
2153 VULGAR FRACTION ONE THIRD
|
|
2154 VULGAR FRACTION TWO THIRDS
|
|
2155 VULGAR FRACTION ONE FIFTH
|
|
2156 VULGAR FRACTION TWO FIFTHS
|
|
2157 VULGAR FRACTION THREE FIFTHS
|
|
2158 VULGAR FRACTION FOUR FIFTHS
|
|
2159 VULGAR FRACTION ONE SIXTH
|
|
215A VULGAR FRACTION FIVE SIXTHS
|
|
215B VULGAR FRACTION ONE EIGHTH
|
|
215C VULGAR FRACTION THREE EIGHTHS
|
|
215D VULGAR FRACTION FIVE EIGHTHS
|
|
215E VULGAR FRACTION SEVEN EIGHTHS
|
|
215F FRACTION NUMERATOR ONE
|
|
2160-217F [ROMAN NUMERALS]
|
|
2190-21FF [ARROWS]
|
|
2200-22FF [MATHEMATICAL OPERATORS]
|
|
2300-23FF [MISCELLANEOUS TECHNICAL]
|
|
2400-243F [CONTROL PICTURES]
|
|
2440-245F [OPTICAL CHARACTER RECOGNITION]
|
|
2488 DIGIT ONE FULL STOP
|
|
2489 DIGIT TWO FULL STOP
|
|
248A DIGIT THREE FULL STOP
|
|
248B DIGIT FOUR FULL STOP
|
|
248C DIGIT FIVE FULL STOP
|
|
248D DIGIT SIX FULL STOP
|
|
248E DIGIT SEVEN FULL STOP
|
|
248F DIGIT EIGHT FULL STOP
|
|
2490 DIGIT NINE FULL STOP
|
|
2491 NUMBER TEN FULL STOP
|
|
2492 NUMBER ELEVEN FULL STOP
|
|
2493 NUMBER TWELVE FULL STOP
|
|
2494 NUMBER THIRTEEN FULL STOP
|
|
2495 NUMBER FOURTEEN FULL STOP
|
|
2496 NUMBER FIFTEEN FULL STOP
|
|
2497 NUMBER SIXTEEN FULL STOP
|
|
2498 NUMBER SEVENTEEN FULL STOP
|
|
2499 NUMBER EIGHTEEN FULL STOP
|
|
249A NUMBER NINETEEN FULL STOP
|
|
249B NUMBER TWENTY FULL STOP
|
|
2500-257F [BOX DRAWING]
|
|
2580-259F [BLOCK ELEMENTS]
|
|
25A0-25FF [GEOMETRIC SHAPES]
|
|
2600-267F [MISCELLANEOUS SYMBOLS]
|
|
2700-27BF [DINGBATS]
|
|
2800-287F [BRAILLE PATTERNS]
|
|
2E80-2EFF [CJK RADICALS SUPPLEMENT]
|
|
2F00-2FDF [KANGXI RADICALS]
|
|
2FF0-2FFF [IDEOGRAPHIC DESCRIPTION CHARACTERS]
|
|
3000 IDEOGRAPHIC SPACE
|
|
3001 IDEOGRAPHIC COMMA
|
|
3002 IDEOGRAPHIC FULL STOP
|
|
3003 DITTO MARK
|
|
3008 LEFT ANGLE BRACKET
|
|
3009 RIGHT ANGLE BRACKET
|
|
33A7 SQUARE M OVER S
|
|
33A8 SQUARE M OVER S SQUARED
|
|
33AE SQUARE RAD OVER S
|
|
33AF SQUARE RAD OVER S SQUARED
|
|
33C2 SQUARE AM
|
|
33C2 SQUARE AM
|
|
33C6 SQUARE C OVER KG
|
|
33C7 SQUARE CO
|
|
33D8 SQUARE PM
|
|
33D8 SQUARE PM
|
|
D800-DFFF [SURROGATE CHARACTERS]
|
|
E000-F8FF [PRIVATE USE, PLANE 0]
|
|
FB1D-FB4F [HEBREW PRESENTATION FORMS]
|
|
FB50-FDFF [ARABIC PRESENTATION FORMS A]
|
|
FE20-FE2F [COMBINING HALF MARKS]
|
|
FE30-FE4F [CJK COMPATIBILITY FORMS]
|
|
FE50-FE6F [SMALL FORM VARIANTS]
|
|
FE70-FEFC [ARABIC PRESENTATION FORMS B]
|
|
FEFF ZERO WIDTH NO-BREAK SPACE
|
|
FF00-FFEF [HALFWIDTH AND FULLWIDTH FORMS]
|
|
FFF9 INTERLINEAR ANNOTATION ANCHOR
|
|
FFFA INTERLINEAR ANNOTATION SEPARATOR
|
|
FFFB INTERLINEAR ANNOTATION TERMINATOR
|
|
FFFC OBJECT REPLACEMENT CHARACTER
|
|
FFFD REPLACEMENT CHARACTER
|
|
Unassigned characters
|
|
|
|
|
|
4. Case Folding
|
|
|
|
After it has been verified that the input text has none of the
|
|
characters prohibited for case folding, the case-folding step itself is
|
|
quite straight-forward. For each character in the input, if there is a
|
|
lowercase mapping for that character in [UniData], the input character
|
|
is changed to the mapped lowercase letter.
|
|
|
|
|
|
5. Canonicalization
|
|
|
|
After case folding, the input string is normalized using form KC, as
|
|
described in [UTR15].
|
|
|
|
6. IDN Table Revisions
|
|
|
|
A table consisting of all characters allowed and prohibited and the
|
|
rules for case folding and canonicalization will be created based on the
|
|
content of the [UniData] and on the content of this document. This table
|
|
will be the authority for implementations to follow and will be
|
|
normatively referenced by this document. Such a table will enable the
|
|
IDN protocol to have versions independent of the revisions to Unicode
|
|
and/or to ISO 10646 because the revision of IDN and its deployment may
|
|
not in sync with revisions to Unicode and ISO 10646.
|
|
|
|
In a future draft of this document, IANA will be asked to keep this
|
|
table, with an initial version number of 1. Each new version of the
|
|
table will have a new, higher version number.
|
|
|
|
|
|
7. Security Considerations
|
|
|
|
Much of the security of the Internet relies on the DNS. Thus, any change
|
|
to the characteristics of the DNS can change the security of much of the
|
|
Internet.
|
|
|
|
Host names are used by users to connect to Internet servers. The
|
|
security of the Internet would be compromised if a user entering a
|
|
single internationalized name could be connected to different servers
|
|
based on different interpretations of the internationalized host name.
|
|
|
|
|
|
8. References
|
|
|
|
[IDNComp] Paul Hoffman, "Comparison of Internationalized Domain Name
|
|
Proposals", draft-ietf-idn-compare.
|
|
|
|
[IDNReq] James Seng, "Requirements of Internationalized Domain Names",
|
|
draft-ietf-idn-requirement.
|
|
|
|
[ISO10646] ISO/IEC 10646-1:1993. International Standard -- Information
|
|
technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part
|
|
1: Architecture and Basic Multilingual Plane. Five amendments and a
|
|
technical corrigendum have been published up to now. UTF-16 is described
|
|
in Annex Q, published as Amendment 1. 17 other amendments are currently
|
|
at various stages of standardization. [[[ THIS REFERENCE NEEDS TO BE
|
|
UPDATED AFTER DETERMINING ACCEPTABLE WORDING ]]]
|
|
|
|
[Normalize] Character Normalization in IETF Protocols,
|
|
draft-duerst-i18n-norm-03
|
|
|
|
[RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate
|
|
Requirement Levels", March 1997, RFC 2119.
|
|
|
|
[RFC2396] Tim Berners-Lee, et. al., "Uniform Resource Identifiers (URI):
|
|
Generic Syntax", August 1998, RFC 2396.
|
|
|
|
[RFC2732] Robert Hinden, et. al., Format for Literal IPv6 Addresses in
|
|
URL's, December 1999, RFC 2732.
|
|
|
|
[STD13] Paul Mockapetris, "Domain names - implementation and
|
|
specification", November 1987, STD 13 (RFC 1035).
|
|
|
|
[Unicode3] The Unicode Consortium, "The Unicode Standard -- Version
|
|
3.0", ISBN 0-201-61633-5. Described at
|
|
<http://www.unicode.org/unicode/standard/versions/Unicode3.0.html>.
|
|
|
|
[UniData] The Unicode Consortium. UnicodeData File.
|
|
<ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt>.
|
|
|
|
[UTR15] Mark Davis and Martin Duerst. Unicode Normalization Forms.
|
|
Unicode Technical Report #15.
|
|
<http://www.unicode.org/unicode/reports/tr15/>.
|
|
|
|
|
|
A. Acknowledgements
|
|
|
|
Many people from the IETF IDN Working Group and the Unicode Technical
|
|
Committee contributed ideas that went into the first draft of this
|
|
document. Mark Davis was particularly helpful in some of the early
|
|
ideas.
|
|
|
|
|
|
B. Changes From Previous Versions of this Draft
|
|
|
|
This is the -00 version, so there are no changes.
|
|
|
|
|
|
C. IANA Considerations
|
|
|
|
There are no specific IANA considerations in this draft, but there will
|
|
be in a future draft of this document.
|
|
|
|
|
|
D. Author Contact Information
|
|
|
|
Paul Hoffman
|
|
Internet Mail Consortium and VPN Consortium
|
|
127 Segre Place
|
|
Santa Cruz, CA 95060 USA
|
|
paul.hoffman@imc.org and paul.hoffman@vpnc.org
|
|
|
|
Marc Blanchet
|
|
Viagenie inc.
|
|
2875 boul. Laurier, bur. 300
|
|
Ste-Foy, Quebec, Canada, G1V 2M2
|
|
Marc.Blanchet@viagenie.qc.ca
|