452 lines
16 KiB
Plaintext
452 lines
16 KiB
Plaintext
Internet Engineering Task Force (IETF) Mark Welter
|
|
INTERNET-DRAFT Brian W. Spolarich
|
|
draft-ietf-idn-utf6-00 WALID, Inc.
|
|
November 16, 2000 Expires May 16, 2001
|
|
|
|
|
|
UTF-6 - Yet Another ASCII-Compatible Encoding for IDN
|
|
|
|
Status of this memo
|
|
|
|
This document is an Internet-Draft and is in full conformance with all
|
|
provisions of Section 10 of RFC2026.
|
|
|
|
Internet-Drafts are working documents of the Internet Engineering Task
|
|
Force (IETF), its areas, and its working groups. Note that other
|
|
groups may also distribute working documents as Internet-Drafts.
|
|
|
|
Internet-Drafts are draft documents valid for a maximum of six months
|
|
and may be updated, replaced, or obsoleted by other documents at any
|
|
time. It is inappropriate to use Internet-Drafts as reference
|
|
material or to cite them other than as "work in progress."
|
|
|
|
The list of current Internet-Drafts can be accessed at
|
|
http://www.ietf.org/ietf/1id-abstracts.txt
|
|
|
|
The list of Internet-Draft Shadow Directories can be accessed at
|
|
http://www.ietf.org/shadow.html.
|
|
|
|
The distribution of this document is unlimited.
|
|
|
|
Copyright (c) The Internet Society (2000). All Rights Reserved.
|
|
|
|
Abstract
|
|
|
|
This document describes a tranformation method for representing
|
|
Unicode character codepoints in host name parts in a fashion that is
|
|
completely compatible with the current Domain Name System. It is
|
|
proposed as a potential candidate for an ASCII-Compatible Encoding (ACE)
|
|
for supporting the deployment of an internationalized Domain Name System.
|
|
The tranformation method, an extension of the UTF-5 encoding proposed by
|
|
Duerst, provides both for more efficient representation of typical Unicode
|
|
sequences while preserving simplicity and readability. This transformation
|
|
method is deployed as part of the current WALID multilingual domain name
|
|
system implementation, although that status should not necessarily influence
|
|
the evaluation of its merits as a candidate encoding method.
|
|
|
|
|
|
Table of Contents
|
|
|
|
1. Introduction
|
|
1.1 Terminology
|
|
2. Hostname Part Transformation
|
|
2.1 Post-Converted Name Prefix
|
|
2.2 Hostname Prepartion
|
|
2.3 Definitions
|
|
2.4 UTF-6 Encoding
|
|
2.4.1 Variable Length Hex Encoding
|
|
2.4.2 UTF-6 Compression Algorithm
|
|
2.4.3 Forward Transformation Algorithm
|
|
2.5 UTF-6 Decoding
|
|
2.5.1 Variable Length Hex Decoding
|
|
2.5.2 UTF-6 Decompression Algorithm
|
|
2.5.3 Reverse Transformation Algorithm
|
|
3. Examples
|
|
3.1 'www.walid.com' (in Arabic)
|
|
4. Security Considerations
|
|
5. References
|
|
|
|
|
|
1. Introduction
|
|
|
|
UTF-6 describes an encoding scheme of the ISO/IEC 10646 [ISO10646]
|
|
character set (whose character code assignments are synchronized
|
|
with Unicode [UNICODE3]), and the procedures for using this scheme
|
|
to transform host name parts containing Unicode character sequences
|
|
into sequences that are compatible with the current DNS protocol
|
|
[STD13]. As such, it satisfies the definition of a 'charset' as
|
|
defined in [IDNREQ].
|
|
|
|
1.1 Terminology
|
|
|
|
The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and
|
|
"MAY" in this document are to be interpreted as described in RFC 2119
|
|
[RFC2119].
|
|
|
|
Hexadecimal values are shown preceded with an "0x". For example,
|
|
"0xa1b5" indicates two octets, 0xa1 followed by 0xb5. Binary values are
|
|
shown preceded with an "0b". For example, a nine-bit value might be
|
|
shown as "0b101101111".
|
|
|
|
Examples in this document use the notation from the Unicode Standard
|
|
[UNICODE3] as well as the ISO 10646 names. For example, the letter "a"
|
|
may be represented as either "U+0061" or "LATIN SMALL LETTER A".
|
|
|
|
UTF-6 converts strings with internationalized characters into
|
|
strings of US-ASCII that are acceptable as host name parts in current
|
|
DNS host naming usage. The former are called "pre-converted" and the
|
|
latter are called "post-converted". This specification defines both
|
|
a forward and reverse transformation algorithm.
|
|
|
|
|
|
2. Hostname Part Transformation
|
|
|
|
According to [STD13], hostname parts must be case-insensitive, start and
|
|
end with a letter or digit, and contain only letters, digits, and the
|
|
hyphen character ("-"). This, of course, excludes most characters used
|
|
by non-English speakers, characters, as well as many other characters in
|
|
the ASCII character repertoire. Further, domain name parts must be
|
|
63 octets or shorter in length.
|
|
|
|
|
|
2.1 Post-Converted Name Prefix
|
|
|
|
This document defines the string 'wq--' as a prefix to identify
|
|
UTF-6-encoded sequences. For the purposes of comparison in the IDN
|
|
Working Group activities, the 'wq--' prefix should be used solely to
|
|
identify UTF-6 sequences. However, should this document proceed beyond
|
|
draft status the prefix should be changed to whatever prefix, if any,
|
|
is the final consensus of the IDN working group.
|
|
|
|
Note that the prepending of a fixed identifier sequence is only one
|
|
mechanism for differentiating ASCII character encoded international
|
|
domain names from 'ordinary' domain names. One method, as proposed in
|
|
[IDNRACE], is to include a character prefix or suffix that does not
|
|
appear in any name in any zone file. A second method is to insert a
|
|
domain component which pushes off any international names one or more
|
|
levels deeper into the DNS heirarchy. There are trade-offs between
|
|
these two methods which are independent of the Unicode to ASCII
|
|
transcoding method finally chosen. We do not address the international
|
|
vs. 'ordinary' name differention issue in this paper.
|
|
|
|
2.2 Hostname Prepartion
|
|
|
|
The hostname part is assumed to have at least one character disallowed
|
|
by [STD13], and that is has been processed for logically equivalent
|
|
character mapping, filtering of disallowed characters (if any), and
|
|
compatibility composition/decomposition before presentation to the UTF-6
|
|
conversion algorithm.
|
|
|
|
While it is possible to invent a transcoding mechanism that relies
|
|
on certain Unicode characters being deemed illegal within domain names
|
|
and hence available to the transcoding mechanism for improving encoding
|
|
efficiency, we feel that such a proposal would complicate matters
|
|
excessively. We also believe that Unicode name preprocessing for
|
|
both name resolution and name registration should be considered as s
|
|
separate, independent issues, which we will attempt to address in a
|
|
separate document.
|
|
|
|
2.3 Definitions
|
|
|
|
For clarity:
|
|
|
|
'integer' is an unsigned binary quantity;
|
|
'byte' is an 8-bit integer quantity;
|
|
'nibble' is a 4-bit integer quantity.
|
|
|
|
2.4 UTF-6 Encoding
|
|
|
|
The idea behind this scheme was to improve on the UTF-5 transformation
|
|
algorithm described in [IDNDUERST] by providing a straightforward
|
|
compression mechanism. UTF-6 defines a compression mechanism by
|
|
indentifying identical leading byte or nibble values in the pre-converted
|
|
string, and using the length of this leading value to select a mask which
|
|
can be applied to the pre-converted string. The resulting post-converted
|
|
string is preserves the simplicity and readability of UTF-5 while
|
|
enabling longer sequences to be encoded into a single host name part.
|
|
|
|
2.4.1 Variable Length Hex Encoding
|
|
|
|
The variable length hex encoding algorithm was introduced by Duerst in
|
|
[IDNDUERST]. It encodes an integer value in a slight modification of
|
|
traditional hexadecimal notation, the difference being that the most
|
|
significant digit is represented with an alternate set of "digits"
|
|
- -- 'g through 'v' are used to represent 0 through 15. The result is a
|
|
variable length encoding which can efficiently represent integers of
|
|
arbitrary length.
|
|
|
|
The variable length nibble encoding of an integer, C, is defined
|
|
as follows:
|
|
|
|
1. Skip over leading non-significant zero nibbles to find I,
|
|
the first significant nibble of c;
|
|
|
|
2. Emit the Ith character of the set [ghijklmopqrstuv];
|
|
|
|
3. Continue from most to least significant, encoding each remaining
|
|
nibble J by emitting the Jth character of the set [0123456789abcdef].
|
|
|
|
Examples:
|
|
|
|
0x1f4c is encoded as "hf4c"
|
|
0x0624 is encoded as "m24"
|
|
0x0000 is encoded as "g"
|
|
'n' a single character in single quotes stands for the
|
|
Unicode code point for that character.
|
|
|
|
2.4.2 UTF-6 Compression Algorithm
|
|
|
|
UTF-6 improves on the UTF-5 encoding by providing compression, which
|
|
enables encoding of a larger number of characters in each hostname
|
|
part. The compression algorithm is defined as follows:
|
|
|
|
1. Set the mask to 0xFFFF;
|
|
|
|
2. If the number of non '-' characters is less than 2, proceed to
|
|
step 5;
|
|
|
|
3. If the most significant byte of every non '-' character is the
|
|
same value:
|
|
|
|
3a. Set HB to this value;
|
|
3b. Emit 'Y';
|
|
3c. Emit the variable length hex encoding of HB;
|
|
3d. Set the mask to 0x00FF;
|
|
3e. Proceed to step 5.
|
|
|
|
4. If the most significant nibble of every non '-' character is the
|
|
same value:
|
|
|
|
4a. Set HN to this value;
|
|
4b. Emit 'Z';
|
|
4c. Emit the variable length hex encoding of HN;
|
|
4d. Set the mask to 0x0FFF.
|
|
|
|
5. Foreach input character:
|
|
|
|
5a. Set HN to the result of the bitwise AND of the input
|
|
character and the mask;
|
|
5b. Emit the variable length nibble encoding of HN.
|
|
|
|
2.4.3 Forward Transformation Algorithm
|
|
|
|
The UTF-6 transformation algorithm accepts a string in UTF-16
|
|
[ISO10646] format as input. The encoding algorithm is as follows:
|
|
|
|
1. Break the hostname string into dot-separated hostname parts.
|
|
For each hostname part, perform steps 2 and 3 below;
|
|
|
|
2. Compress the component using the method described in section
|
|
2.4.2 above, and encode using the encoding described in section 2.4.1;
|
|
|
|
3. Prepend the post-converted name prefix 'wq--' (see section 2.1
|
|
above) to the resulting string.
|
|
|
|
|
|
2.5 UTF-6 Decoding
|
|
|
|
2.5.1 Variable Length Hex Decoding
|
|
|
|
1. Let N be the lower case of the first input character;
|
|
|
|
If N is not in set [ghijklmnopqrstuv] return error,
|
|
else consume the input character;
|
|
|
|
2. Let R = N - 'g';
|
|
|
|
3. If another input character exists,
|
|
then let N be the lower case of the next input character,
|
|
else goto Step 9;
|
|
|
|
4. If N is not in the set [0123456789abcdef], go to Step 9;
|
|
|
|
5. Let N = the lower case of the next input character and consume
|
|
the input character;
|
|
|
|
6. Let R = R * 16;
|
|
|
|
7. If N is in set [0123456789],
|
|
then let R = R + (N - '0'),
|
|
else let R = R + (N - 'a') + 10;
|
|
|
|
8. Go to step 3;
|
|
|
|
9. Return decoded result R.
|
|
|
|
2.5.2 UTF-6 Decompression Algorithm
|
|
|
|
1. Let N be the lower case of the first input character;
|
|
|
|
2. If N != 'y' and N != 'z',
|
|
|
|
2a. Let CPART be 0;
|
|
2b. Let VMAX be 0xFFFF;
|
|
|
|
This is the no-compression case;
|
|
|
|
3. If N == 'y',
|
|
|
|
3a. Let M be the variable length hex decoding of the next
|
|
character;
|
|
3b. Let CPART be the result of M * 0x0100;
|
|
3c. Let VMAX be 0x00FF;
|
|
3d. Continue to Step 5;
|
|
|
|
4. If N == 'z',
|
|
|
|
4a. Let M be the variable length hex decoding of the next
|
|
character;
|
|
4b. Let CPART be the result of M * 0x1000;
|
|
4c. Let VMAX be 0x0FFF;
|
|
4d. Continue to Step 5;
|
|
|
|
5. While another input character exists, let N be the lower case of
|
|
the next input character, and do the following:
|
|
|
|
5a. If N == '-' consume the character and
|
|
then append '-' to the result string,
|
|
else let VPART be the next variable hex decoded value;
|
|
5b. If VPART > VMAX, return error,
|
|
else append CPART + VPART to the result string;
|
|
|
|
6. Return the result string.
|
|
|
|
2.5.3 Reverse Transformation Algorithm
|
|
|
|
1. Break the string into dot-separated components and apply Steps
|
|
2 through 4 to each component:
|
|
|
|
2. Check for legality (in terms of RFC1035 permitted characters) and
|
|
return error status if illegal,
|
|
|
|
3. Remove the post converted name prefix 'wq--' (see Section 2.1),
|
|
|
|
4. Decompress the component using the decompression algorithm
|
|
described above.
|
|
|
|
5. Concatenate the decoded segments with dot separators and return.
|
|
|
|
|
|
3. Examples
|
|
|
|
The examples below illustrate the encoding algorithm and provide
|
|
comparisons to alternate encoding schemes. UTF-5 sequences are
|
|
prefixed with '----', as no ACE prefix was defined for that encoding.
|
|
|
|
3.1 'www.walid.com' (in Arabic):
|
|
|
|
UTF-16: U+0645 U+0648 U+0642 U+0639 . U+0648 U+0644 U+064A U+062F .
|
|
U+0634 U+0631 U+0643 U+0629
|
|
|
|
UTF-6: wq--ymk5k8k2j9.wq--ymk8k4kaif.wq--ymj4j1k3i9
|
|
|
|
UTF-5: ----m45m48m42m39.----m48m44m4am2f.----m34m31m43m29
|
|
|
|
RACE: bq--azcuqqrz.bq--azeeisrp.bq--ay2dcqzj
|
|
|
|
LACE: bq--aqdekscche.bq--aqdeqrckf5.bq--aqddimkdfe
|
|
|
|
|
|
3.2 Mixed Katakana and Hiragana (SOREZORENOBASHO)
|
|
|
|
UTF-16: U+305D U+308C U+305E U+308C U+306E U+5834 U+6240
|
|
|
|
UTF-6:
|
|
|
|
UTF-5:
|
|
|
|
RACE: bq--4ayf3memgbpdbdbqnzmdiysa
|
|
|
|
LACE: bq--auyf4dc7rrxacwbuafrea
|
|
|
|
|
|
3.3 Currently Disallowed ASCII Characters ($OneBillionDollars!):
|
|
|
|
UTF-16: U+0024 U+004F U+006E U+0065 U+0042 U+0069 U+006C U+006C
|
|
U+0069 U+006F U+006E U+0044 U+006F U+006C U+006C U+0061
|
|
U+0072 U+0073 U+0021
|
|
|
|
UTF-6:
|
|
|
|
UTF-5:
|
|
|
|
RACE: bq--aase74tfijuwy4djn6xei44mnrqxe5zb
|
|
|
|
LACE: bq--cmacit4omvbgs4dmnfxw5rdpnrwgc5ttee
|
|
|
|
4. Security Considerations
|
|
|
|
Much of the security of the Internet relies on the DNS and any
|
|
change to the characteristics of the DNS may change the security of
|
|
much of the Internet. Therefore UTF-6 makes no changes to the DNS itself.
|
|
|
|
UTF-6 is designed so that distinct Unicode sequences map to distinct
|
|
domain name sequences (modulo the Unicode and DNS equivalence rules).
|
|
Therefore use of UTF-6 with DNS will not negatively affect security.
|
|
|
|
5. References
|
|
|
|
[IDNCOMP] Paul Hoffman, "Comparison of Internationalized Domain Name
|
|
Proposals", draft-ietf-idn-compare.
|
|
|
|
[IDNREQ] James Seng, "Requirements of Internationalized Domain Names",
|
|
draft-ietf-idn-requirement.
|
|
|
|
[IDNNAMEPREP] Paul Hoffman and Marc Blanchet, "Preparation of
|
|
Internationalized Host Names", draft-ietf-idn-nameprep
|
|
|
|
[IDNDUERST] M. Duerst, "Internationalization of Domain Names",
|
|
draft-duerst-dns-i18n.
|
|
|
|
[ISO10646] ISO/IEC 10646-1:1993. International Standard -- Information
|
|
technology -- Universal Multiple-Octet Coded Character Set (UCS) --
|
|
Part 1: Architecture and Basic Multilingual Plane. Five amendments and
|
|
a technical corrigendum have been published up to now. UTF-16 is
|
|
described in Annex Q, published as Amendment 1. 17 other amendments are
|
|
currently at various stages of standardization.
|
|
|
|
[RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate
|
|
Requirement Levels", March 1997, RFC 2119.
|
|
|
|
[STD13] Paul Mockapetris, "Domain names - implementation and
|
|
specification", November 1987, STD 13 (RFC 1035).
|
|
|
|
[UNICODE3] The Unicode Consortium, "The Unicode Standard -- Version
|
|
3.0", ISBN 0-201-61633-5. Described at
|
|
<http://www.unicode.org/unicode/standard/versions/Unicode3.0.html>.
|
|
|
|
A. Acknowledgements
|
|
|
|
The structure (and some of the structural text) of this document is
|
|
intentionally borrowed from the LACE IDN draft (draft-ietf-idn-lace)
|
|
by Mark Davis and Paul Hoffman.
|
|
|
|
The 'SOREZORENOBASHO' example was taken from draft-ietf-idn-brace draft
|
|
by Adam Costello.
|
|
|
|
B. IANA Considerations
|
|
|
|
There are no IANA considerations in this document.
|
|
|
|
C. Author Contact Information
|
|
|
|
Mark Welter
|
|
Brian W. Spolarich
|
|
WALID, Inc.
|
|
State Technology Park
|
|
2245 S. State St.
|
|
Ann Arbor, MI 48104
|
|
+1-734-822-2020
|
|
|
|
mwelter@walid.com
|
|
briansp@walid.com
|
|
-----BEGIN PGP SIGNATURE-----
|
|
Version: GnuPG v1.0.1 (GNU/Linux)
|
|
Comment: For info see http://www.gnupg.org
|
|
|
|
iD8DBQE6FaCt/DkPcNgtD/0RAtRmAJwISVeJGY6qmll71mL+Axc51o8iIwCgmNt/
|
|
86RcQh1JQYWTux+8FS+XvMU=
|
|
=bxiv
|
|
-----END PGP SIGNATURE-----
|
|
|