Files
bind9/doc/draft/draft-hall-dm-idns-00.txt
Andreas Gustafsson a831ffc8fe new draft
2001-11-15 23:46:00 +00:00

2740 lines
142 KiB
Plaintext
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
INTERNET-DRAFT Eric A. Hall, Editor
Document: draft-hall-dm-idns-00.txt Consultant
Expires: May 2002 November 2001
The Internationalized Domain Name System
Status of this Memo
This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other
documents at any time. It is inappropriate to use Internet-Drafts
as reference material or to cite them other than as "work in
progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
1. Abstract
The principle intention of this specification is to facilitate the
deployment of a completely internationalized domain name syntax
and service which new protocols, applications and host systems can
use, but without disrupting the existing infrastructure. Towards
that end, this document describes a series of elective
encapsulation services and protocol extensions which cumulatively
allow internationalized domain names to be stored and transmitted
in the existing DNS message and within application data streams,
according to the compliance level of the participating systems.
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
Table of Contents
1. Abstract..................................................1
2. Definitions and Terminology...............................3
3. Introduction..............................................4
3.1. Background.............................................4
3.2. Objectives.............................................5
3.3. Common Usage Scenarios.................................7
3.4. User Audiences.........................................9
3.5. Service Overview......................................11
3.6. Process Example.......................................13
4. The Internationalized Namespace..........................19
4.1. Internationalized Domain Names and Labels.............20
4.2. Internationalized Host Identifiers....................27
4.3. STD13 Domain Names....................................28
4.4. STD13 Host Identifiers................................29
5. Transfer Encodings and Label Types.......................30
5.1. The EDNS/UTF-8 Label Type.............................31
5.2. The STD13 Legacy Label Type...........................33
6. Application Guidelines...................................36
6.1. Input and Output Charsets.............................37
6.2. Protocol and Application Data.........................38
6.3. DNS Lookups and Resolver Calls........................40
7. Resolver Guidelines......................................42
7.1. Resolver APIs.........................................42
7.2. Query Processing Services.............................44
7.3. The Hosts Database....................................48
8. Server Guidelines........................................49
8.1. Internationalized Zones...............................50
8.2. Namespace Visibility Restrictions.....................51
8.3. The Master File Format................................52
9. Caching Guidelines.......................................53
10. Security Considerations..................................53
11. IANA Considerations......................................54
12. References...............................................54
13. Acknowledgements.........................................55
14. Editor's Address.........................................55
Hall I-D Expires: May 2002 [page 2]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
2. Definitions and Terminology
This document unites, enhances and clarifies several pre-existing
technologies. Readers are expected to be familiar with the
following specifications:
[AMC-ACE-Z] <draft-ietf-idn-amc-ace-z>, "AMC-ACE-Z version
0.3.1"
[NAMEPREP] <draft-ietf-idn-nameprep>, "Preparation of
Internationalized Host Names"
[STD13] (RFC 1034) "Domain names - concepts and facilities",
(RFC 1035) "Domain names - implementation and
specification"
[STD3] (RFC 1122) "Requirements for Internet Hosts --
Communication Layers", (RFC1123) "Requirements for Internet
Hosts -- Application and Support"
[BCP18] (RFC 2277) "IETF Policy on Character Sets and
Languages"
[RFC2279] "UTF-8, a transformation format of ISO 10646"
[RFC2671] "Extension Mechanisms for DNS (EDNS0)"
The following abbreviations are used throughout this document:
UCS (Universal Character Set) “ The ISO/IEC 10646 character
set repertoire, as represented by the Unicode 3.1
specification.
ACE (ASCII-Compatible Encoding) “ A transfer encoding which
encodes UCS character codes into a seven-bit codespace
which is compatible with US-ASCII.
UTF-8 (UCS Transformation Format, Eight-Bit) “ A transfer
encoding which encodes UCS characters into an eight-bit
codespace which is compatible with DNS message formats.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL"
in this document are to be interpreted as described in RFC 2119.
Hall I-D Expires: May 2002 [page 3]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
3. Introduction
The domain name system (DNS) [STD13] currently defines a message,
namespace and protocol. Although the DNS message is capable of
transferring eight-bit character codes as protocol data,
applications are currently limited to a subset of US-ASCII when
they interact with the DNS namespace, and this restricted syntax
is enforced by almost every TCP/IP application and protocol which
utilizes domain names as embedded data (including, surprisingly,
the DNS protocol).
In order to allow for the use of a larger range of characters in
the namespace, this document extends and clarifies a variety of
Internet specifications so that characters from the Universal
Character Set (UCS) [ISO10646] may be used in domain names. This
document also extends the DNS message structure to allow for the
use of UTF-8 [RFC2279] encoded characters for the purpose of
transferring these domain names, but also provides an ASCII-
compatible encoding (ACE) [AMC-ACE-Z] of these character codes
which existing protocols and applications can use to access the
internationalized domain names, and also provides identification
mechanisms which allow the end-point systems to downwardly
negotiate when needed. Finally, this document defines behavior for
DNS systems which implement this architecture, including the end-
point applications which generate and store DNS domain names, and
the resolvers, caches and servers which process them.
The mechanisms presented here are elective. Developers, zone
administrators and network operators who wish to make use of the
internationalized domain names may do so according to their own
schedule. Those developers, administrators and operators who
cannot or prefer not to implement the specified extensions can
continue to use their legacy systems, and will still be able to
access resources from the internationalized domain name system.
3.1. Background
From one perspective, DNS is already an "eight-bit clean" system,
in that the structured DNS message is capable of storing and
transmitting eight-bit data without any additional effort.
However, this perspective only considers one particular facet of
the domain name system, and ignores the more critical aspect of
Hall I-D Expires: May 2002 [page 4]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
the DNS namespace, which has rules that are entirely different
from those which govern the message format.
The DNS namespace (or more appropriately, the view of the
namespace which applications use and enforce) is governed by rules
set forth in RFC952 [RFC952], STD3 [STD3], and STD13, which
collectively define the characters that are eligible for use with
host names. These rules are meant to provide a common template
which may be applied to either the DNS namespace or a local hosts
database, such that a query for "host.example.com" can be
processed through either system. The range of valid characters
currently defined are the letters, numbers and hyphen characters
from US-ASCII [ASCII] (additional rules also govern the valid
order and length of a host name). Character code values outside of
this range are valid in domain name messages, but are undefined
when used in the namespace, and are subject to interpretation by
the applications which generate them.
The host name rules are enforced by almost every application and
protocol which uses DNS to identify a host or system. This
includes network utilities such as ping and traceroute which
simply identify systems by name, and complex protocols such as
SMTP which use domain names to determine message-routing paths.
Portions of the DNS protocol itself are also affected by these
restrictions, such as the domain names which may be used for NS
resource records with sub-domain delegation operations (since
these servers are connection targets, they are also required to be
compliant with the host name rules).
Because these domain names are so pervasive throughout the
Internet (and even within proprietary applications that run on
private networks), it is not possible to declare a "flag day" at
which eight-bit domain names will be considered valid encodings of
a particular character set. Instead, an extended namespace with a
larger set of charset rules must be defined, an extended DNS
protocol capable of supporting these domain names must be
deployed, and a transitional mechanism which allows the old and
new systems to interact must be established. This document
attempts to meet these objectives.
3.2. Objectives
In broad terms, this document has one overall goal, which is to
facilitate the creation and use of an internationalized domain
name system around a UCS namespace, a collection of UTF-8 and
Hall I-D Expires: May 2002 [page 5]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
legacy-compatible encodings which are suitable for transferring
internationalized domain names within DNS and the affected
application data streams, and a negotiation mechanism which allows
end-point systems to identify the encoding that they will use for
a particular operation.
One of the objectives stated above is to internationalize the
existing DNS namespace, by allowing UCS characters to be used in
host names and sub-domain delegations in old and new zones
equally. As such, this document does not define a new namespace,
but instead defines mechanisms by which leaf-nodes and sub-domains
may be created within the existing hierarchy.
UTF-8 was chosen as the primary transfer encoding of these domain
names for several reasons. For one, there is a wide availability
of tools and expertise surrounding UTF-8, and it is already widely
deployed within development environments, operating systems and
applications. Furthermore, BCP18 [BCP18] requires that new
application protocols be able to use UTF-8 as application data,
and for many applications, this specifically means domain names
which are passed as data. All signs indicate that UTF-8 is
currently and will continue to be the preferred eight-bit encoding
on the Internet, and this specification embraces this position in
its design.
However, most of the network services currently in use are bound
by the legacy host naming restrictions, and those applications and
protocols will also need to be able to interact with resources
from the internationalized namespace, even though they will not be
compliant with the UTF-8 encoding mechanisms defined in this
document. In order to allow these systems to participate, this
specification also embraces the use of ACE as a seven-bit
backwards-compatible encoding for legacy systems to use.
Note that even though a single encoding could have been specified
by this document, past and present requirements would not have
been satisfied by a single choice. For example, supporting UTF-8
alone would mean isolating legacy systems from resources in the
UCS namespace, while supporting ACE alone would not have provided
a truly internationalized namespace (the ACE encoded domain names
still appear in user data quite frequently). By allowing the UTF-8
and ACE encodings to coexist, the existing and emerging
communities can both be served.
Because both encodings will be active during the same time period,
this document also defines DNS protocol extensions which allow the
Hall I-D Expires: May 2002 [page 6]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
end-point systems to detect the encoding that is in use for a
particular query/response pair. Note that these negotiation
mechanisms not only allow new and legacy systems to interoperate,
but they also provide a transition service for developers, zone
administrators and end-users, in that ACE encoded domain names can
be initially deployed within existing applications and DNS
systems, while individual elements of the infrastructure can be
upgraded without disturbing other components.
3.3. Common Usage Scenarios
Discussion of the mechanism provided by this document depends upon
the usage context of the domain names themselves. Domain names are
extremely pervasive, and are used by almost every TCP/IP protocol
and application in one form or another. However, most usages fall
under one or more of the following scenarios:
* Connection identifiers “ Domain names are most commonly
used as host-specific identifiers for outbound connection
requests, whether this be for a command-line application
such as ping, or as a host name which is stored in an
application's configuration file. Another common usage
scenario for connection identifiers is with reverse
lookups, where a server is logging incoming connections by
the corresponding domain name, or where a program such as
netstat is displaying all of the application sessions which
are currently active on a host. In both of these cases,
domain names are passed through applications to a resolver,
resulting in DNS queries and responses which eventually
provide the requested DNS data.
A related use (but one which does not generate DNS
messages) is determining the host name of the local system.
This is commonly found with applications and protocols that
need to display the domain name of the local system as part
of a protocol operation (such as an SMTP greeting banner)
or as application data.
Connection identifiers (and lookups in general) are
probably the largest single use of domain names today, and
this is likely to be the case with internationalized domain
names as well. This document fully supports the use of
internationalized domain names for lookup operations, as
long as the calling application, the stub resolver, the
local caching servers, and the authoritative servers for
Hall I-D Expires: May 2002 [page 7]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
the specified domain name are compliant with this
specification. If any of these components are not capable
of supporting internationalized domain names in this
manner, the ACE equivalent domain name will be negotiated
for the operation at hand.
* Protocol data “ Some application protocols exchange domain
names as protocol data, with those domain names either
determining or altering a service-specific operation.
Examples of this usage include SMTP envelopes ("RCPT TO
<user@domain.dom>") where the domain name is used to
determine whether or not a particular email message should
be accepted for delivery, the HTTP HOST header field which
identifies a specific document tree on a shared server,
BOOTP/DHCP options, WHOIS input, and more.
Because these protocols treat domain names as protocol
data, most of these protocols also have specific formatting
requirements which must be addressed before UTF-8 domain
names can be used by these protocols directly. This
document is intended to facilitate the use of UTF-8 encoded
domain names in this manner, although it is expected that
most of the protocol development groups will need to
develop negotiation mechanisms before these protocols can
use internationalized domain names directly. Until such
work is completed, ACE equivalent domain names can be used
to provide these protocols with access to the
internationalized namespace.
* Structured application data “ Structured application data
is similar to protocol data in that it can trigger or
affect some protocol action, although this will not always
occur. For example, a web browser can process an embedded
IMG link which may be present in a web page, while a user
can manually follow an embedded email link which is also
stored in the same web page; even though both usage models
share the same structured data format (URLs), they are
processed differently by the application. Similarly, email
messages typically contain multiple domain names as
structured data in the message headers, and some of these
domain names will directly affect subsequent protocol
operations, while others will not.
Because of this ambiguity, this document defines no
specific treatment for structured application data. In some
cases, no additional mechanisms will be required, while
Hall I-D Expires: May 2002 [page 8]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
other scenarios will require negotiation mechanisms before
an internationalized domain name can be used in the
structured data (with ACE being required as the interim
format). Each protocol development group is encouraged to
analyze each usage independently, to classify the usage as
a connection identifier, protocol data, or unstructured
application data, and to determine the appropriate course
of action for each usage accordingly.
* Unstructured application data “ Many application protocols
provide free-text data which can contain domain names, but
with those domain names existing as unstructured data. For
example, an email message which is provided as a text/plain
MIME body part may contain a domain name which identifies a
system or service in the context of a specific application,
but in an unstructured form ("your files were moved from
server1 to server2"). Similarly, an email address may be
provided in WHOIS output, but as unstructured data which
does not affect the protocol.
Given the application-specific nature of this data, it
cannot be managed by any global protocol or process. Where
a protocol has rules or restrictions on the data itself,
then those rules are maintained, but some formatting rules
may need to be extended before internationalized domain
names (or their equivalents) can be encoded in the
application data. For example, internationalized domain
names in email messages may need to be converted to a
preferred display charset, while ACE equivalents may be
necessary for protocols which only support US-ASCII.
Each of the above scenarios represent distinct handling cases
where internationalized domain names may or may not be used
directly. In some cases, the internationalized domain names may be
used as soon as the applications and resolvers are configured to
use them, while in other cases, measured and cautious deployment
is required in order to prevent undue breakage. In the latter
cases, however, the backwards-compatible ACE encoding is available
so that the internationalized domain names can be used.
3.4. User Audiences
Another perspective on the changes which will result from
deploying the mechanisms described in this document can be seen by
analyzing how any such changes will affect the different
Hall I-D Expires: May 2002 [page 9]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
"audiences" who work with domain names, and who have their own
unique context-specific usage requirements and objectives. The
three main audiences discussed in this document are:
* Developers. Protocol and application developers need to be
able to incorporate internationalized domain names into
their systems as easily as possible, although there are
many factors which will affect such usage, including the
input and output charsets and encodings which are available
to the applications and protocols. Where feasible, this
specification allows developers to choose any charset or
encoding which may be required and suitable for use,
although in most cases, a recommendation is also made for
the use of UTF-8 in particular.
Developers may adopt internationalized domain names for
connection identifiers and lookup operations fairly
quickly, such that users can use those system as soon as
they have compliant systems (and they have a target domain
name to communicate with). Implementing support for
internationalized domain names in protocols and application
data will require additional effort by the affected
development groups.
Support for ACE will be harder to implement, since it is a
relatively new and untested encoding syntax, with no
existing developer tools. This will likely be the largest
hurdle to overcome when developing applications for use
with this service.
* Zone administrators. Organizations that wish to deploy
internationalized domain names should be able to do so
easily, at a reasonable cost, and without suffering
excessive pre-conditions. Towards this objective, the
mechanisms described by this document allow organizations
to deploy and use internationalized domain names within any
zone immediately, without requiring any other zone to have
been updated beforehand (although there are specific and
strong suggestions for upgrading the Internet's high-load
servers as soon as possible).
If an organization wishes to publish internationalized
domain names for users to access and utilize, the
authoritative servers for the affected zone must be
compliant with the naming rules and message formats
described by this document, which will almost certainly
Hall I-D Expires: May 2002 [page 10]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
require the administrators of that zone to upgrade their
servers. However, organizations may also choose to only
deploy ACE encoded domain names if an immediate migration
is not feasible, with the caveat that internationalized
domain names in their native form will not be available
from those zones.
* Network operators. The systems and human users which
generate DNS lookups are another area of concern, as these
protocols, programs and users will expect these lookups to
succeed, and will also expect that the visible namespace
will be compatible with the capabilities of the requesting
system at a minimum investment. This is a broad range of
requirements.
At a minimum, applications must be capable of generating
and accepting the internationalized domain names if they
are to use those domain names (see the "Developers"
discussion above for the application requirements).
Similarly, the local resolvers, caches and forwarders on
the user's network must also support the message formats if
they are to relay internationalized domain names between
their local applications and the remote zones being
queried. If the applications, resolvers and caches do not
support these requirements, intermediary systems will
perform the down-level negotiation automatically on their
behalf such that additional effort is not required on the
user's part.
In summary, the developers, zone administrators and end-users can
immediately participate in the internationalized namespace at no
additional expense if they are content with using ACE encoded
domain names, and can use internationalized domain names in their
native form if they are willing to make the necessary investments.
Furthermore, since the native and backwards-compatible encodings
are not mutually exclusive, implementers of this specification
have the option of adopting ACE for immediate use and then
transitioning to internationalized domain names on a per-system,
per-zone, or per-application basis, according to their schedule.
3.5. Service Overview
This document specifies a variety of extensions to several
different protocols and services in order to facilitate the use of
internationalized domain names anywhere this support exists or can
Hall I-D Expires: May 2002 [page 11]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
be implemented, and to provide a legacy-compatible domain name in
all other situations.
More specifically, this document defines or clarifies behavior for
the following elements:
* Host name character restrictions. Legacy protocols and
applications are currently restricted to the legacy host
naming rules, which only allow for a subset of US-ASCII
characters (letters, digits and the hyphen character). This
document redefines the characters which are valid within a
host name so that system identifiers, domain name parts of
host names, and new network services can use most of the
characters from the UCS.
* DNS message format. This document defines an extended label
format based on the extended label services provided by
RFC2671 (Extension Mechanisms for DNS - EDNS0) [RFC2671],
with this label format being used to encapsulate UTF-8
encoded internationalized domain names in DNS messages. Any
DNS message which carries the UTF-8 encoded domain names is
required to use the EDNS/UTF-8 label type defined in this
document. Any DNS message which carries legacy domain names
(including the ACE encoded equivalent domain names) is
required to use the traditional message format.
* Application handling rules. Applications can use
internationalized domain names immediately for lookup
operations that do not directly affect external services or
protocols, and can use ACE encoding sequences to specify
internationalized domain names in legacy protocol
operations, and can use them both at the same time.
* Stub resolvers. Stub resolvers will most likely need to
provide a series of internationalized APIs in order to
fully support applications that generate internationalized
domain name lookups. For example, these APIs will almost
certainly be required in order for the resolver to
determine that the calling application is compliant with
the host name requirements defined by this document, and
that the domain names should be encoded in the proper label
format. Although this specification does not dictate these
APIs, it encourages their use, and provides some guidance
on the issues surrounding their use.
Hall I-D Expires: May 2002 [page 12]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
* Forwarders, resolving servers and caches. The user-side
servers which process internationalized domain names have
several protocol-specific requirements, including the
negotiated fall-back service when UTF-8 queries fail.
* Authoritative servers. A key part of this specification is
the simultaneous support for internationalized and legacy
compatible domain names in the UCS namespace, thereby
allowing a domain name to be entered into an authoritative
zone database once, and for the appropriate response to be
generated by a server according to the label encoding from
the associated query. In order for this to work, this
specification requires authoritative servers which serve
internationalized domain names to comply with specific
conditions. This specification also allows existing servers
to serve ACE equivalent domain names when the authoritative
servers cannot be upgraded, although this typically results
in lower levels of functionality.
The elements listed above collectively define a completely
internationalized domain name system, which is capable of
servicing internationalized domain names in all compliant systems,
and which is also capable of providing ACE encoded equivalent
domain names when any component from the internationalized service
is not available.
3.6. Process Example
This section illustrates a series of query/response transactions
under which the processes and protocols defined in this document
function. This example uses a reverse lookup for the PTR resource
record associated with the "14.2.0.192.in-addr.arpa." domain name
(forward lookups work similarly, but the issues are more fully
demonstrated by PTR lookups). Each of the various technologies
shown below are described in later sections of this document. The
sole purpose of this example is to provide an illustration of
these mechanisms in order to facilitate better discussion.
Note that this illustration represents a worst-case scenario
(thereby exercising most of the functionality provided by this
specification), and does not represent a typical scenario.
a. First, a PTR resource record for 14.2.0.192.in-addr.arpa.
is added to the internationalized zone database on the
replication master server for the 2.0.192.in-addr.arpa.
Hall I-D Expires: May 2002 [page 13]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
zone, with the resource record data value of
"host.<idn>.example.com." (where <idn> is an
internationalized domain name compliant with the host
naming rules provided in this document). Both of these
domain names have a primary representation consisting of
UCS characters in some local encoding, but are also
available as UTF-8 and ACE encoded data so they can be
encapsulated within DNS queries and responses.
Once the zone is reloaded and is replicated by the other
authoritative servers for that zone, the domain names can
be processed.
b. An application on a remote system generates a DNS lookup
for the PTR resource record associated with the
14.2.0.192.in-addr.arpa. domain name.
If this is a legacy application, it issues the lookup using
the only method it knows, which is to pass the domain name
to the legacy resolver API. This would result in the
resolver issuing a legacy DNS query for the PTR resource
record associated with the specified domain name.
If this application is compliant with this specification,
it performs the following steps:
1. Verify that the resolver is capable of processing
queries for UTF-8 domain names by probing for an
internationalized API. If this step failed, then the
domain name would be converted to the legacy STD13
octet encoding in step 3.6.b.3 and passed to the
resolver's legacy API.
2. Convert the domain name from its generated encoding to
the canonical UCS characters, and then normalize and
case-convert the UCS characters.
3. Convert the normalized and lowercased UCS characters
to the charset or encoding used by the resolver's
internationalized API.
4. Issue a lookup for the PTR resource record associated
with the internationalized domain name, via the
resolver's internationalized API.
Hall I-D Expires: May 2002 [page 14]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
Note that even though the domain name is compatible
with the legacy host name rules, the domain name is
passed through the internationalized API so that
servers can tell whether or not the original
application is UTF-8 compliant, and can determine the
format of any internationalized domain names which are
to be returned in the response messages. This is
required in case the queried resource record includes
internationalized domain names as resource record data
(as would be the case with PTR resource records), and
is also required for the proper handling of any SOA or
NS resource records which may be returned as
additional data in the response.
For the purpose of this example, we will assume that each
of these steps were successfully performed.
c. The client's stub resolver generates the query, with the
Question Section of the query containing the UTF-8 encoded
domain name encapsulated in an EDNS/UTF-8 extended label.
d. The stub resolver sends the query to one of its configured
resolving servers.
e. The resolving server will either answer the query from its
cache or forward the query to a name server which is
authoritative for the namespace hierarchy, as per the
normal query-resolution procedure. For the purpose of this
example, we will assume that the server has no information
about the specified domain name, so it forwards the query
to one of the root zone's authoritative servers in order to
begin the iterative resolution process.
f. The queried server responds with a referral, providing
delegation data for a zone in the path to the queried
domain name. For the purposes of this example, we will use
192.in-addr.arpa. as the delegation domain specified in the
referral message.
The specific format of the referral will depend on whether
or not the queried server understands the EDNS/UTF-8 label
encoding. If the server is compliant with this
specification (which it is, or else it wouldn't have
answered with a referral), then the referral will also
provide ENDS/UTF-8 encoded domain names in the Authority
and Additional-Data Sections of the referral. If the server
Hall I-D Expires: May 2002 [page 15]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
was not compliant with this specification, it would return
an error upon seeing the extended label type, which would
cause the resolving server to restart the query using the
legacy label type.
g. The resolving server decodes the UTF-8 encoded domain names
to their UCS character representation, caches the resource
records in their UCS form, and sends the query to one of
the authoritative servers for the referral zone. Note that
the cache did not normalize or case-convert the UCS
characters; only the end-systems perform this work.
h. In this case, the queried server does not understand the
EDNS/UTF-8 label format, and has returned a FORMERR
response code.
i. When these errors are encountered, the current resolver
(whether this is the client's stub resolver or a caching
server in the query path) must convert the query domain
name from its current form to a legacy-compatible encoding
(either ACE or STD13 octet sequences, depending on the UCS
characters which have been encoded), and then has to
reissue the query in that format.
In this case, the domain name only contains printable
characters from US-ASCII, so the STD13 octet encoding is
used for the fall-back query. Because the UCS domain name
was normalized and lowercased before it was passed to the
client's stub resolver, the legacy domain name will also be
in this format (although it will be compared in a case-
neutral form by the recipient server).
Note that once this conversion takes place, the legacy
label format is used for the remainder of the current query
chain (this prevents excessive delays from multiple fall-
back operations, which could result in timeouts at the
original resolver or application).
Hall I-D Expires: May 2002 [page 16]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
j. The queried server returns a delegation referral for the
2.0.192.in-addr.arpa. zone. Since the query arrived in the
STD13 octet encoding, the server has no indicator of the
client's capabilities, so the referral NS resource records
will also be returned in legacy compatible form (either as
STD13 octet sequences or as ACE encoded data, depending on
the character codes provided in each label from each of the
associated domain names).
Note that even though these NS resource records will be
restricted to legacy-compatible host names and label types,
they may contain and reference ACE domain names. In this
regard, a legacy server in the delegation path does not
prevent internationalized domain names from being delegated
or resolved, but only prevents them from being processed as
EDNS/UTF-8 extended labels.
Also note that once the authoritative servers for a zone
have been discovered and cached, any subsequent UTF-8
queries which are generated for the resources in that zone
will be sent directly to one of those servers, bypassing
the delegation hierarchy. As such, subsequent queries which
are provided in EDNS/UTF-8 labels can be processed directly
by the zone's authoritative servers, without the delegation
servers disrupting the process.
k. The resolving server decodes the STD13 octet sequences and
ACE encoded domain names to their UCS character
representations, caches the resource records, and resends
the query to one of the authoritative servers for the
referral zone.
l. The queried server processes the request. Since this query
arrived as an STD13 octet sequence, the server must compare
the seven-bit characters from the domain name (which is all
of them, in this example) in a case-neutral form. Note that
if the query had arrived as ACE or UTF-8 encoded domain
names, the server would have decoded the specified domain
name to its canonical UCS characters and performed a case-
exact match against the resulting characters.
m. The queried server responds with the requested data. Note
that the query was submitted in the legacy label form due
to the fall-back processing which occurred in step 3.6.i,
so the server will only respond to this query with STD13
Hall I-D Expires: May 2002 [page 17]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
octet sequences or ACE encoded domain names, using the
STD13 legacy label.
n. The resolving server decodes the STD13 octet sequences and
ACE encoded domain names to their UCS character
representations, and caches the resource records. Since the
query was originally received as an internationalized
domain name (as indicated by the EDNS/UTF-8 extended label
from the original query), the resolving server has to
encode the answer data as UTF-8 before passing it back to
the client's stub resolver. However, since the input was
not provided in an encoded UCS form, the server has to
normalize and case-convert the STD13 octet sequence in
order to provide a valid internationalized domain name.
o. The stub resolver decodes the UTF-8 encoded domain names
which have been provided in the response message to their
UCS character representation, and passes the data to the
original calling application using the charset or encoding
favored by the resolver.
p. The application validates the received domain name by
decoding the internationalized domain name to its canonical
UCS characters, normalizing and down-casing the resulting
domain name, and comparing the results with the answer data
which was provided by the resolver.
As can be seen, the UTF-8 name resolution process is identical to
the current resolution process, with the addition of a single
fall-back query in step 3.6.i which resulted in one extra
query/response pair (roughly equivalent to adding one extra
delegation referral into the query path), and with several
different encoding conversions, as required by the participating
systems and services. This example also illustrates the
requirements which are placed on developers, zone administrators,
and network operators in order for typical connection identifier
services to function with UTF-8 domain names.
However, if each system and service had used UTF-8 for encoding
purposes (including everything between the stub resolver's APIs
and the authoritative servers for the target zone), then no
additional queries or conversions would have been required (other
than the direct UCS conversions required for validation and
caching, the latter of which can be performed separately without
affecting the processing path). In this regard, the example above
illustrates how this system can function even when only a portion
Hall I-D Expires: May 2002 [page 18]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
of the participating systems utilize UTF-8, and also illustrates
how effective the entire operation would be if all of the
recommendations and requirements provided in this specification
were adopted.
It is also important to reiterate here that any such costs
associated with this compliance are entirely elective by the
affected parties. If they want to streamline the process, the
option is available to them, although the system also works when
very few optimizations are implemented.
4. The Internationalized Namespace
In simple terms, this specification defines an internationalized
namespace which consists of domain names and labels that contain
UCS character codes, and also specifies a series of encoding
formats which may be used whenever the UCS values need to be
encapsulated for transmission within DNS messages or application
data streams.
In this regard, the internationalized namespace is the UCS
representation of the domain names and labels as they are used for
comparison operations once a domain name arrives for processing,
while the transfer encodings ensure that a domain name arrives at
the destination system intact, so that it may be processed in its
canonical form.
There are four conceptual elements to this model:
* Character codes. Labels from internationalized domain names
have a single logical canonical representation as sequences
of UCS code point values. The UCS characters are used when
a particular label from a domain name is created by an
application, stored in a zone, hosts or cache database, and
is used whenever two sets of domain names or labels need to
be compared. However, different kinds of domain names have
different rules which govern the character codes that may
be used.
* Storage encodings. Whenever a domain name is created or
copied from the network, it must be stored in a format that
is reversible to the canonical UCS character representation
of that domain name. This specification does not mandate or
require any particular storage encoding, and allows this
decision to be made on a per-implementation basis, as long
Hall I-D Expires: May 2002 [page 19]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
as the storage encoding supports character codes which can
be converted to UCS equivalent values for comparison
purposes. However, the use of UTF-8 for this purpose is
encouraged, since it is the most common.
* Transfer encodings. Whenever a domain name needs to be sent
over the network, it must be packaged in a form which is
compliant with the capabilities of the transfer protocol in
use. This document specifies three transfer encodings which
may be used to encode canonical UCS character codes in DNS
messages or application streams, which are: the octet
encoding from STD13, the ACE encoding from <ACE-Z>, and the
UTF-8 encoding from RFC2279. Each encoding has different
costs and benefits in different usage scenarios.
* Comparison operations. When two domain names need to be
compared, they also follow rules which are appropriate to
the type of domain name being provided, and the transfer
encoding which may have been used to provide the domain
name to the system.
This document defines four distinct types of internationalized
domain names which may exist in the internationalized namespace,
and also describes how each of the above considerations affect
those domain names and their labels. These domain name types are
described throughout the remainder of this section.
4.1. Internationalized Domain Names and Labels
This section describes the master template rules for all domain
names and labels which may be used in the internationalized
namespace, although subordinate rules and restrictions are also
applied as secondary filters, depending on the intended usage of
the domain name.
For example, domain names and labels which are to be used as
internationalized host identifiers (either as host names, or as
domain names which are used to specify a host) are restricted to a
specific subset of UCS characters. Meanwhile, domain names and
labels which are compliant with STD13's global rules are
restricted to eight-bit code values, while the domain names and
labels which are used as STD13 host identifiers are restricted to
a specific subset of US-ASCII.
Hall I-D Expires: May 2002 [page 20]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
The following diagram illustrates how the subordinate rules are
applied and interpreted against the master restrictions:
+-----------------------+
| Internationalized DNs |
+-----------------------+
any UCS character codes
/ |
/ |
/ |
/ |
+-----------+ +-----------+ +------------+
| Int. Host | | STD13 DNs +-----+ STD13 Host |
+-----------+ +-----------+ +------------+
normalized character ASCII letters,
subset of codes 0x00 numbers, and
UCS chars through 0xFF hyphen char
As can be seen, the internationalized domain names and labels
rules allow any UCS character code to be stored, although each
particular usage of the domain names and labels will have their
own secondary rules and restrictions.
In order to allow future documents to define additional rules as
required for their usage, this document defines very few global
rules on the core internationalized domain names and labels.
4.1.1. IDN syntax and structure
In this specification, an internationalized domain name consists
of a variable number of labels, each of which contain a variable
number of UCS character codes, not all of which will have defined
UCS character interpretations.
Furthermore, the encoding system which is used to store and
interpret those values on a system is not relevant to this
specification, and is therefore not defined. The characters in a
label can be stored in memory or on disk as UTF-8, UCS-4, ACE, or
any other storage encoding which is desired by the operators and
implementers of the affected system, as long as that encoding
system is reversible to the canonical UCS character code values,
and is able to represent the necessary range of UCS characters
(the "necessary range" varies by operation).
Hall I-D Expires: May 2002 [page 21]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
The only universal restrictions which apply to internationalized
domain names and labels are those which govern length. This
specification requires that labels from internationalized domain
names MUST be restricted to a minimum length of two characters and
a maximum length of 63 characters, inclusive. The exception to
this rule is the root domain, which is always represented by a
zero-length label. Note that this rule specifically refers to the
canonical UCS characters, rather than any encoded form (encoding
will often result in labels and domain names with fewer actual
characters, due to overhead from the encoding algorithm).
A fully-qualified internationalized domain name is formed by
joining a series of labels together, with the most-contextually
specific label in the left-most position of the label sequence,
and with the root domain occupying the right-most position. The
sum total of all labels in an internationalized domain name MUST
NOT exceed 255 characters, inclusive. Any number of labels MAY be
stored in the domain name, but the sum total of their lengths MUST
NOT exceed this limit.
However, labels which contain UCS character codes greater than
U+007F will result in multi-byte UTF-8 and ACE encodings, so the
maximum length of a label or an internationalized domain name is
governed by their UTF-8 and ACE encoded lengths. Both encodings
MUST result in an encoded length of 63 octets or less in order to
be usable, with a maximum cumulative length of 255 octets.
4.1.2. IDN transfer encodings
The UCS is currently occupies a 21-bit range of character code
values, containing tens of thousands of assigned characters, and
hundreds of thousands of unassigned characters. Due to the multi-
byte nature of the code point values, UCS characters cannot be
passed as protocol or application data in most of the existing
Internet protocols (including DNS messages), at least not without
the help of some kind of encoding scheme. At the very least, the
UCS character values have to be encoded as eight-bit sequences if
they are to fit within existing eight-bit data structures, and
have to be encoded as a subset of US-ASCII characters if they are
to be usable with legacy protocols and applications which only use
STD13's host identifier rules for their structured domain name
data types.
With this objective in mind, this document defines three different
transfer encoding systems which can be used to convert
Hall I-D Expires: May 2002 [page 22]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
internationalized domain names and labels into a form which is
suitable for transfer in different data streams. These are the
legacy STD13 octet encoding, ACE, and UTF-8. Each of these
encoding schemes provide different benefits and capabilities to
the internationalized DNS effort.
* STD13 octets. The STD13 octet encoding scheme provides a
direct one-to-one mapping between eight-bit characters and
their eight-bit values, but it is only capable of storing
character codes in the range of U+0000 through U+00FF,
which severely restricts its usefulness.
* ACE. The ACE encoding scheme is capable of storing UCS
character code value as seven-bit sequences in STD13 legacy
labels. While this makes it practically compatible with the
legacy host identifier rules, the resulting data imposes
additional labor on the Internet community, and the reuse
of the legacy label also results in certain amounts of
ambiguity with some DNS domain names and labels.
* UTF-8. The UTF-8 encoding scheme is capable of encoding all
UCS character code values as sequences of eight-bit data
which are compatible with legacy DNS message restrictions,
but the encoded output requires explicit support from
internationalized applications and protocols. UTF-8 output
uses a new label type in order to prevent additional
ambiguity problems from arising.
The table below illustrates the UCS character code sequences which
are supported by each of the different encoding schemes.
STD13
Octets ACE UTF-8
+-------+-------+--------
| | |
US-ASCII | Y | | Y
| | |
Eight-Bit | Y | Y | Y
| | |
Any UCS Chars | | Y | Y
| | |
Hall I-D Expires: May 2002 [page 23]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
More specifically, the character code sequence ranges and their
valid encodings are:
* US-ASCII. If a label only contains character codes from the
range of U+0000 through U+007F, then it MAY be encoded as a
legacy STD13 octet sequence or UTF-8, but MUST NOT be
encoded as ACE.
Note that this specification explicitly prohibits seven-bit
labels from being encoded as ACE data, since such an action
would be redundant, results in greater processing overhead
for those labels, and multiple representations introduce
problems with caches on legacy systems. Furthermore,
certain security risks would be introduced if this were
allowed. For example, a malicious user could register or
purposefully create an ACE encoded representation of the
"example.com" label sequence such that users mistakenly
sent sensitive data to malicious systems.
In order to prevent these problems from occurring, this
specification requires that any ACE-encoded label which
consists entirely of seven-bit characters MUST be
immediately discarded with extreme prejudice. This rule
applies to every implementation of this specification,
including any applications, resolvers, caches or servers
which process labels.
* Eight-bit codes. If a label contains character codes from
the eight-bit range of U+0000 through U+00FF, then it MAY
be encoded as STD13 octet sequences, ACE, or UTF-8. This
rule specifically requires that the label MUST contain at
least one character from the eight-bit range, MAY contain
any number of characters from the seven-bit range, but MUST
NOT contain characters with code values which are greater
than U+00FF.
Since the STD13 octet encoding and ACE both use the legacy
STD13 label type, this specification relies on the input
encoding of a domain name in order to determine the output
encoding. In some cases, however, the input encoding will
not be clear, or will not be specified, and this can result
in some ambiguity with label sequences from this range.
For example, if the domain name provided in a query
consists of seven-bit labels, then the STD13 octet sequence
is the only valid encoding for the legacy STD13 label,
Hall I-D Expires: May 2002 [page 24]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
meaning that ACE could not have been used in the query. If
the specified domain name exists as a CNAME resource record
which refers to a domain name that contains eight-bit
character codes, then the proper output encoding for that
domain name will not be clearly discernable. Moreover, the
STD13 and ACE encodings will generate different results,
since the STD13 octet sequence will only contain a single
octet for the eight-bit character, while the ACE encoding
will contain multiple octets of encoded data.
When this situation arises, systems MUST give preference to
the ACE encoding, on the assumption that the referenced
character is more likely to represent a UCS character than
an eight-bit code value (the UCS characters in this range
are Latin-1, which are the most common characters after the
legacy US-ASCII set). Furthermore, the ACE encoded
representation of these characters allow for a broader
range of subsequent operations (since it complies with the
legacy host naming restrictions, it can be used with CNAME
resource records that refer to hosts), while the STD13
octet encoded representation does not.
It is possible to avoid this scenario on authoritative zone
servers (and thus the affected caches) by allowing the
operator to specify whether or not the input is Latin-1 UCS
character data or binary data, with the server generating
the proper output accordingly. Also note that the default
encoding specified by this document is UTF-8, which does
not suffer from the ambiguity problems described above.
* Any UCS character codes. If a label consists of any
character codes greater than U+00FF, then it MAY be encoded
as ACE or UTF-8, but MUST NOT be encoded as STD13 octet
sequences. STD13 is not capable of representing character
codes greater than U+00FF, so it cannot be used with any
UCS characters beyond the eight-bit range.
Encodings are performed on a per-label basis. Each label MUST NOT
be encoded more than once. Also note that recursive encodings
result in applications discarding the domain name.
When the STD13 octet encoding is used to encode labels for
transmission, the labels are encoded according to the rules
specified in STD13, and are encapsulated in STD13 legacy labels.
Hall I-D Expires: May 2002 [page 25]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
When ACE is used to encode labels for transmission, the labels are
encoded according to the rules specified in <ACE-Z>, and are
encapsulated in STD13 legacy labels (this process is described in
section 5.2).
When UTF-8 is used to encode labels for transmission, the labels
are encoded according to the rules specified in RFC2279, and are
encapsulated in EDNS/UTF-8 extended labels (the format of this
label is described in section 5.1).
Note that a domain name MAY contain any combination of STD13 octet
encoded labels and ACE encoded labels. However, if a domain name
contains any UTF-8 encoded labels, then ALL of the labels from
that domain name MUST be encoded as UTF-8 data. This rule
primarily exists so that DNS compression services can be
maintained consistently, but it also prevents mixed referrals
which can trigger unnecessary fall-back processing, and also
provides a single encoding representation to internationalized
systems which benefits efficiency.
The root domain (as specified by the zero-length label at the
right edge of the domain name) MUST NOT be encoded with ACE. More
specifically, zero-length labels MUST NOT contain any character
data of any kind, and since ACE labels have prefix strings, they
are explicitly forbidden from being used for the root domain.
4.1.3. IDN comparison operations
When an internationalized domain name label is received from the
network as ACE or UTF-8 encoded data, the labels MUST be decoded
to their canonical UCS character representation, and the resulting
UCS characters MUST be compared as case-exact sequences to their
stored equivalents. Except where specifically required in this
specification (EG, validity tests which are performed by
applications), normalization and case-conversion MUST NOT be
performed against the resulting UCS character codes prior to any
comparison operations being performed.
However, internationalized domain name labels which are received
as STD13 octet sequences MUST be given special treatment, as these
domain names could have originated from legacy systems operating
under STD13's rules. In this case, the seven-bit US-ASCII
alphabetic characters (U+0041 through U+005A, and U+0061 through
U+007A) from those labels MUST be compared in a case-neutral form.
All other code values MUST be compared as case-exact code values
Hall I-D Expires: May 2002 [page 26]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
(this particularly includes eight-bit characters, which were not
defined by STD13).
4.2. Internationalized Host Identifiers
Internationalized host identifiers are a subset of the
internationalized domain names described in section 4.1, which
only use a subset of the allowable UCS characters, but which reuse
the global transfer encodings and comparison routines.
Most of the displayable characters from the UCS can be used in
host identifiers, and there are no additional rules governing the
ordering or length of their labels. However, the characters which
are used in internationalized host identifiers MUST be normalized
and case-converted before they are encoded for storage or
transfer. This requires more effort on the part of applications
and servers when the internationalized domain names are initially
created, but results in less ambiguity and lower processing
requirements for servers, caches and resolvers during subsequent
comparison operations.
The restrictions which govern the creation of internationalized
host identifiers are as follows:
a. Labels MUST be restricted to the subset of characters which
are permitted by <nameprep> [nameprep]. Characters which
are prohibited by <nameprep> MUST NOT appear in any label
of any internationalized host identifier.
b. Labels MUST be normalized through <nameprep> before they
are stored or encoded for transfer. Internationalized host
identifiers will not be normalized as part of any
comparison operation, so systems MUST normalize the labels
before they are stored or transmitted.
c. Labels MUST be converted to lowercase according to the
case-mappings rules specified in <nameprep> before they are
stored or encoded for transfer. Internationalized host
identifiers will not be converted to lowercase as part of
any comparison operation, so systems MUST normalize the
labels before they are stored or transmitted.
According to the rules above, a label from an internationalized
host identifier which was originally created with the UCS
character sequence of <LATIN CAPITAL LETTER A><COMBINING ACUTE
Hall I-D Expires: May 2002 [page 27]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
ACCENT><LATIN CAPITAL LETTER B> (U+0041 U+0301 U+0042) would be
normalized and lowercased to <LATIN SMALL LETTER A WITH
ACUTE><LATIN SMALL LETTER B> (U+00E1 U+0062). The normalized,
lowercase form would be used as the canonical UCS character
representation of that label when it was encoded for storage and
transmission purposes, and would be the form which was used for
comparison operations on any resolvers, caches and servers.
Internationalized host identifiers which are received from the
network can contain labels which have been encoded as STD13 octet
sequences, ACE or UTF-8. In all of these cases, the comparison
rules defined in section 4.1.3 MUST be applied.
4.3. STD13 Domain Names
STD13 allows any eight-bit code values to be used in domain name
labels. However, STD13 host identifiers (as described in section
4.4 of this specification) are the most common form of STD13
domain names, and have much tighter restrictions.
There are common uses of STD13 domain names which do not comply
with the STD13 host identifier subset, however. One common example
of this is SRV identifiers, which use an underscore character
(U+005F) as part of their label syntax. Another common example is
found when email addresses are provided in SOA and RP resource
records, and where the left-hand side of the email address is
stored as an STD13 domain name label which does not represent a
host identifier. Furthermore, email addresses often contain extra
characters which are not legal in STD13 host identifiers, such as
a full-stop character (U+002E). For example, "joe.admin" could be
stored as an STD13 domain name label in the fully-qualified domain
name of "joe.admin.example.com.", which would represent the email
address of "joe.admin@example.com" when that domain name was
extracted from the SOA or RP resource record and processed.
Implementations of this specification MUST allow STD13 domain
names to be created and stored, using the following rules:
a. Labels MUST be restricted to the code values of U+0000
through U+00FF. Restrictions on character content MUST NOT
be applied (note that if this domain name will be used as
part of an STD13 host identifier, the rules specified in
section 4.4 MUST be used instead).
Hall I-D Expires: May 2002 [page 28]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
b. Labels MUST NOT be normalized or lowercased before they are
stored or encoded for transfer.
c. Systems MUST allow STD13 domain names to be specified as
exact sequences of eight-bit octet values, and MUST NOT
treat these sequences as canonical UCS characters which are
normalized or lowercased. STD13 defines an escaping
mechanism whereby the decimal value of the octet is
prefaced with a reverse-solidus (such as "\193"), which is
suggested for this usage.
STD13 domain names which are received from the network can contain
labels which have been encoded as STD13 octet sequences, ACE or
UTF-8. In all of these cases, the comparison rules defined in
section 4.1.3 MUST be applied. Note that some of these sequences
can contain octet code values which have not been normalized or
lowercased by the originating system, since these values can be
used to specify binary domain names.
4.4. STD13 Host Identifiers
This document does not deprecate, replace or modify the host name
rules defined by RFC952, STD3 or STD13 as they apply to legacy
host identifiers. However, there are several issues which affect
the usage of these domain names and their labels in this system.
The range of characters which are currently defined as valid in
STD13 host identifiers are the uppercase and lowercase letters,
numbers and hyphen character from US-ASCII. No other characters
are allowed to be used. Furthermore, the current rules also
prohibit the use of the hyphen character in the first or last
character position of a host identifier label.
Implementations of this specification MUST allow STD13 host
identifiers to be created and stored, using the following rules:
a. Labels MUST be restricted to the code values of U+002D,
U+0031 through U+0039, U+0041 through U+005A, and U+0061
through U+007A.
b. Labels MUST NOT contain the code value of U+002D in either
the first or last character position of the label.
Hall I-D Expires: May 2002 [page 29]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
c. The alphabetic characters MUST be converted to lowercase
before they are stored or transmitted. STD13 host
identifiers are always compared in a case-neutral form.
STD13 host identifiers which are received from the network can
contain labels which have been encoded as STD13 octet sequences
UTF-8. In both cases, the comparison rules defined in section
4.1.3 MUST be applied.
5. Transfer Encodings and Label Types
As was discussed in section 4.1.2, internationalized domain names
and labels are required to be encoded as either eight-bit or
seven-bit data whenever they are transmitted as protocol or
application data.
The particular output encoding format which will be used for any
given label will be primarily determined by the capabilities of
the participating end-point systems. If the application or
protocol which is relaying the domain name labels supports
internationalized domain names directly then UTF-8 encoded labels
can be used, but if the protocol or application is only capable of
supporting STD13 host identifiers as domain name data, then the
STD13 octet and/or ACE encoded labels will have to be used.
With DNS messages in particular, the "data type" is the label
encapsulation in use. Although STD13 legacy labels allow for the
use of eight-bit codes, multiple encodings for the same basic
character data result in interpretation problems without some form
of ancillary tagging service. For this reason, each encoding is
represented differently by this specification. When the STD13
legacy label contains STD13 octet sequences then no tagging is
provided, but if the STD13 legacy label contains ACE encoded data
then the encoded sequence is tagged with an ACE identifier (a
character prefix which does not normally appear in labels). When
UTF-8 domain names are provided, an EDNS/UTF-8 extended label is
used to encapsulate the internationalized domain name.
Furthermore, the encoding which is used for any label in the
message will also determine the label type which is used to
encapsulate and transfer the entire domain name. If any label
contains EDNS/UTF-8 extended labels, then all of the labels from
that domain name are required to be encapsulated for transfer in
EDNS/UTF-8 extended labels. Conversely, if a domain name contains
ACE or STD13 octet encoded labels, then all of the labels from
Hall I-D Expires: May 2002 [page 30]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
that domain name are required to be encapsulated for transfer
using the STD13 legacy label format.
Note that other legacy applications and protocols will most likely
be required to provide extended encodings or negotiation features
before they can exchange internationalized domain names directly.
However, new applications and protocols which are subsequently
written to comply with BCP18 and this specification should not
require any such effort, as they should be capable of transferring
UTF-8 domain names from the beginning.
5.1. The EDNS/UTF-8 Label Type
Any internationalized domain name label which has been encoded as
UTF-8 for transmission in a DNS message MUST be encapsulated as a
EDNS/UTF-8 label.
The EDNS/UTF-8 extended label is an instance of EDNS extended
label types (as defined by RFC2671). Extended labels are indicated
by the leading bit pattern of 0b01 in the label type field (the
first two bits from the "label length" octet of the STD13 legacy
label type), with the remaining six bits of this octet indicating
the extended label type in use. The EDNS/UTF-8 label type uses the
binary value of 0b000011 for this indication (note that IANA may
change this assignment).
EDNS/UTF-8 labels contain two subordinate units of data. The first
octet contains a length indicator which works exactly the same as
the length octet as used by STD13 legacy labels: if the first two
bits of this octet are 0b00 then the rest of that octet provides
the length of the label data field, but if the first two bits of
this octet are 0b11 then the label is a pointer to some other
label, and the remainder of the length octet provides an off-set
which points to the length octet of the referenced label, as per
the rules provided in section 4.1.4 of RFC 1035 (STD13, part 2).
Hall I-D Expires: May 2002 [page 31]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
The structure of the EDNS/UTF-8 extended label is illustrated by
the following figure.
1 1 1 1 1 1 1 1 1 1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0 1|0 0 0 0 1 1| length | label data /// |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
0b01 “ The extended label identifier.
0b000011 “ The EDNS/UTF-8 extended label type identifier.
Length “ The number of octets in the label data, or the off-
set to the length octet of another EDNS/UTF-8 label.
Label data “ The label data, encoded as UTF-8 octets.
The following example shows the domain name of me.com, where the
"e" in "me" is the UCS character <LATIN SMALL LETTER E WITH ACUTE>
(U+00E9), which has the UTF-8 encoded octet sequence of 0xC3A9.
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
20 | 0 1 0 0 0 0 1 1| 0x03 |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
22 | 0x6D (m) | 0xC3 (e') |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
24 | 0xA9 (e') | 0 1 0 0 0 0 1 1|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
26 | 0x03 | 0x63 (c) |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
28 | 0x6F (o) | 0x6D (m) |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
30 | 0 1 0 0 0 0 1 1| 0x00 |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
Octet 20 identifies the EDNS/UTF-8 extended label type, while
octet 21 indicates that the label is three octets long. Octet 22
contains the UTF-8 value for lowercase "m", while octets 23 and 24
contain the UTF-8 value for the UCS character <LATIN SMALL LETTER
E WITH ACUTE> (encoded as 0xC3A9).
Similarly, octet 25 identifies another EDNS/UTF-8 extended label
type, while octet 26 indicates that the label is three octets
long, while octets 27 through 29 contain the UTF-8 values for the
lowercase alphabetic sequence of "com".
Hall I-D Expires: May 2002 [page 32]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
Finally, octet 30 identifies another EDNS/UTF-8 extended label
type, while octet 31 indicates that the label is zero octets in
length, thereby signifying the root zone (the end of the queried
domain name).
Note that the use of the EDNS/UTF-8 extended label type serves
multiple purposes. On the one hand, it provides a method of
signaling the resolver's capabilities to the server, so that the
server can determine which format it needs to use when returning
answers, referrals or errors. Moreover, using an encapsulation
format which is not backwards compatible prevents certain
ambiguity problems which can result from overloading the STD13
legacy label with multiple encodings. These problems are seen in
certain situations with STD13 octet encoding and ACE, where a
server cannot adequately determine which encoding a resolver
desires. By using a separate extended label type for UT-8, these
kinds of ambiguities are avoided.
There are additional benefits which come from using EDNS extended
label types, which are best expressed as "future possibilities".
Once the EDNS extended label mechanisms are widely deployed, it
becomes feasible to specify additional encoding mechanisms as soon
as the Internet community deems it desirable. In this regard,
defining alternative encodings is much easier the second time.
5.2. The STD13 Legacy Label Type
Any internationalized domain name label which has been encoded as
ACE or STD13 octet sequences for transmission in a DNS message
MUST be encapsulated within an STD13 legacy label.
This document does not deprecate, replace or extend the STD13
octet encoding or label encapsulation rules defined by STD13.
However, this document does provide some guidance on the creation
and interpretation of ACE encoded labels when they are stored in
legacy labels, which is necessary in order for recipient systems
to properly detect and decode the label contents.
Note that STD13 octet sequences and ACE data MAY both be provided
the same domain name. As such, each STD13 legacy label from a DNS
message must be examined and processed independently.
Hall I-D Expires: May 2002 [page 33]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
5.2.1. ACE encoded labels
ACE encoded labels always begin with the character sequence of
<TBD> (this document uses "zz--" as a placeholder sequence until a
formal assignment is made). Any label which contains ACE encoded
data MUST begin with this character sequence prefix. Similarly,
any label which begins with this character sequence MUST be
recognized and processed as an ACE encoded label, according to the
rules defined in this specification.
Encoding and encapsulating a label as ACE data is a three-part
process, as follows:
a. Encode the canonical UCS character data from the
internationalized domain name label into ACE using the
procedure defined in <ACE-Z>
b. Preface the encoded output with the "zz--" prefix sequence,
thereby indicating that this label contains ACE encoded UCS
character data.
c. Determine the length of the encoded data and store this
value in the STD13 legacy label's length octet.
Decoding an ACE label is the opposite of that process.
Note that whenever the ACE algorithm encounters a seven-bit
character code in the input, it is passed through unmodified to
the encoded output. If a label only contains seven-bit character
codes, the label MUST NOT be encoded as ACE, and MUST be encoded
as either STD13 octet sequences or UTF-8. Forcing a seven-bit
label to be encoded as ACE serves no benefit, incurs additional
processing on the end-point systems, and can also expose certain
security risks. Any system which is capable of generating and
deciphering ACE encoded labels is required to treat such sequences
as hostile, and MUST dispose of them immediately without any
further processing immediately; systems are forbidden to even
return these labels in DNS error messages.
Similarly, ACE MUST NOT be used to encode any zero-length labels
(including but not specifically limited to the root domain), since
the presence of prefix characters in these labels can invalidate
their protocol-specific interpretations.
When an STD13 legacy label is received which has "zz--" in the
first four character positions, the label MUST be treated as an
Hall I-D Expires: May 2002 [page 34]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
ACE-encoded internationalized domain name, and MUST be decoded to
its canonical UCS character values for further processing.
Note that STD13 legacy labels MUST be verified before the ACE
encoded data is extracted (as per the rules defined in STD13 which
govern the STD13 legacy label type), but systems which are
compliant with this specification MUST perform all subsequent
comparison, caching, or storage operations against the canonical
UCS characters, and MUST NOT use the ACE encoded label sequence
for any of these operations.
Note that the legacy systems which are not compliant with this
specification will treat ACE encoded labels as any other STD13
legacy label.
5.2.2. STD13 octet encoded labels
Any STD13 legacy labels which do not begin with the ACE prefix
MUST be treated as STD13 octet encoding sequences. The rules for
this process are defined by STD13's default label encapsulation
services, although this document also provides some clarifications
on the use of this encoding with internationalized domain names
and labels.
Whenever the STD13 octet sequence is used to encode the labels
from an internationalized domain name, the octet values of the
canonical UCS characters are stored directly in the label. Because
the DNS message is limited to octets, the range of UCS character
codes which are eligible for use with STD13 octet sequences is
limited to U+0000 through U+00FF. If any UCS character codes
outside this range need to be transferred, the internationalized
domain name label will have to be encoded as ACE or UTF-8.
Note that comparison operations for the seven-bit range of
alphabetic character values MUST be performed in a case-neutral
form, although eight-bit code values MUST NOT be normalized or
case-converted as part of a comparison operation. These rules are
required in order to ensure backwards compatibility with the STD13
compliant systems which may be generating these labels as parts of
an STD13 domain name while also supporting the normalization and
case-conversion which may have been applied to the UCS characters
in the storage or transfer encoding systems.
Hall I-D Expires: May 2002 [page 35]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
6. Application Guidelines
As was discussed in section 3.3, there are multiple scenarios in
which an application can make use of internationalized domain
names, ranging from simple lookups of connection identifiers to
abstract encapsulations of unstructured application data. This is
an extremely broad range of uses, which is complicated by the
extreme pervasiveness of applications and protocols that use
domain names for one or more of these purposes.
Furthermore, network applications face a complex array of input
and output operations which will cumulatively affect the ability
of that application to make use of the internationalized domain
name system for various services and functions. These issues are
illustrated by the figure below:
[IDNs] [IDNs]
| ^
| |
+------V------+ +------+------+
| input | | output |
| charset | | charset |
+-----------+-+ +-+-----------+
\ /
+---+-----+---+
| Application |
+---+-----+---+
/ \
+-----------+-+ +-+-----------+
| lookups | | app data <---> [IDNs]
+------+------+ +-------------+
|
+------+------+
| resolver <---> [IDNs]
+-------------+
As can be seen, the ability for an applications to complete adopt
internationalized domain names will be determined by many factors,
any one of which could prevent the application from completely
incorporating the restrictions and recommendations prescribed by
this specification.
In order to allow for a flexible adoption schedule, this
specification defines very few mandates that applications must
adopt, but instead focuses on recommendations which applications
should comply with whenever they need to use internationalized
Hall I-D Expires: May 2002 [page 36]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
domain names, and also provides recommendations for situations
where the preferred behavior is not feasible. Applications which
are compliant with all of the recommendations provided in this
specification will be able to generate, store, transfer and
resolve internationalized domain names throughout all of their
operations, using UTF-8 as a common encoding for all of these
operations. Meanwhile, applications which are not in complete
compliance with this specification will still be able to make use
of the internationalized domain names in these operations,
although such access may be limited to using backwards-compatible
encodings which require greater amounts of effort to implement and
which provide fewer benefits.
6.1. Input and Output Charsets
If an application is unable to accept, process, store or display
characters from the complete UCS repertoire, that application's
support for internationalized domain names will be somewhat
limited, by definition.
Although this document does not mandate any particular charset or
encoding which all applications must use for all operations,
applications SHOULD use coded character sets or encodings which
can handle characters from a reasonable number of scripts.
In particular, the following areas have specific requirements:
* Input charsets and encodings. Since UTF-8 is used as the
default encoding for internationalized domain names
throughout this specification (and others, such as BCP18),
UTF-8 is also RECOMMENDED for use with input encodings of
internationalized domain names in particular, although this
is not required. Many platforms and development
environments support UTF-8 as a local encoding of the UCS
and it can be reasonably used with many types of input
(such as configuration files), although many systems will
require a specific encoding (such as UCS-2, or ISO/IEC
8859-1) in situations which require memory access or
keyboard input.
Regardless of the input encodings used, implementations
MUST map domain names and labels to their canonical UCS
characters for any normalization and case-conversion work
which is subsequently required by any DNS lookups (see
section 6.3).
Hall I-D Expires: May 2002 [page 37]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
* Output choices will likely be limited to a system-preferred
charset or encoding. In general, this document RECOMMENDS
that output systems choose an output charset or encoding
which reflects the data being provided. However,
applications MUST NOT display unknown characters with
generic replacement characters (such as boxes or circles)
if it is known that the original characters are not
available for display with the specified charset, as such
characters will almost certainly trigger failure conditions
in subsequent protocol operations.
In those situations where adequate input or output charsets or
encodings are unavailable, applications MAY use ACE to encode
internationalized domain names for the purpose of ensuring that
the data is provided intact. Since ACE is capable of representing
UCS characters as sequences of seven-bit characters, it is
functionally usable as a last line of defense in almost any
environment, with the caveat that ACE encoding sequences are
extremely cryptic and will likely result in lower levels of
usability and functionality.
6.2. Protocol and Application Data
There are several interrelated issues which will determine an
application's ability to provide or accept internationalized
domain names as protocol or application data, although the
principle determining factors for any such usage will generally be
the capabilities of the underlying protocol itself.
If a protocol allows negotiation or tagging services in order to
distinguish between different encodings, that protocol can likely
be extended to support the use of UTF-8 as protocol or application
data through command/response negotiation options or through data-
type tags. Older protocols which do not provide any negotiation
services or which mandate the use of US-ASCII in all data will
likely require the use of ACE encoded domain names as a short-term
measure until the protocol is made compliant with BCP18.
* Protocol data. If the protocol supports UTF-8 encoded
internationalized domain names in commands or responses,
then that encoding SHOULD be used wherever it is allowed.
If UTF-8 is not supported by the protocol, STD13 octet
sequences and/or ACE encoded equivalents of the
internationalized domain name MUST be used.
Hall I-D Expires: May 2002 [page 38]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
In some cases, this negotiation can be performed on a per-
session basis, while in other cases this work will need to
be performed for each transaction within the session, while
in other cases the internationalized domain names will have
to be tagged whenever they are provided as protocol or
application data.
The DNS protocol is itself an example of a protocol which
requires tagging in order for internationalized domain
names to be exchanged within the existing DNS message (with
these indicators taking the form of ACE encoding prefixes
and EDNS/UTF-8 extended label type codes). Meanwhile, a
protocol such as WHOIS can theoretically support a session-
wide negotiation option that allowed the use of
internationalized domain names as protocol and application
data for the duration of that session. Conversely, a
protocol such as SMTP will likely require the use of
session-specific identifiers for some operations, while
other operations may be able to use label tags (similar to
the existing support for domain literals, which are
identified by a pair of surrounding square brackets).
Regardless of the encodings which are used, implementations
MUST map domain names and labels to their canonical UCS
characters for any normalization and case-conversion work
which is subsequently required as part of a DNS lookup (see
section 6.3).
* Structured application data. Structured application data
such as URLs and email addresses MUST be processed
according to the rules which govern those data formats.
Applications MUST NOT perform any conversion or
transliteration which is not explicitly prescribed by the
governing documents, since non-standard usages are likely
to result in misinterpreted data.
* Unstructured application data. Domain names which appear as
unstructured data in application content are beyond the
control of this specification, and are generally subject to
the encoding and formatting desires of the end-users who
created the data. Generally speaking, it is RECOMMENDED
that applications allow users to enter or view documents in
whatever format they prefer, but that any conversion
between multiple source and destination charsets and
encodings use UCS as the translation intermediary, such
Hall I-D Expires: May 2002 [page 39]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
that internationalized domain names are properly converted
along with the rest of the application data.
In some cases, the application will need to probe the resolver
before it can use internationalized domain names as data. For
example, a participating system may need to determine the
internationalized domain name of the local system so that it can
provide this data in a protocol-specific banner message, and in
these cases, the application will have to communicate with the
resolver before this data can be provided.
Due to the usage-specific nature of internationalized domain names
within protocol and application data streams, each development
group will have to analyze the restrictions and capabilities which
affect their specific services independently.
6.3. DNS Lookups and Resolver Calls
One of the most frequent uses for domain names is for lookup
operations, such as for locating the IP addresses associated with
a specified domain name, determining the domain name associated
with a specified IP address, or performing a protocol-specific
lookup operation for a specific resource record (such as the MX or
SOA resource records associated with a specific domain).
Since these lookup operations do not directly affect external
protocols or data, internationalized domain names can be used for
lookup operations at the application's discretion. For example,
applications such as ping and netstat only use domain names for
display purposes, and can therefore make immediate use of
internationalized domain names within their protocol operations.
Similarly, a protocol can be limited to STD13 host identifiers as
protocol identifiers which will require the application to provide
internationalized domain names as ACE encoded sequences, but any
lookup operations which are necessary for the internationalized
domain names can still be performed in their native form. In these
cases, the protocol operations and lookup operations are separate
tasks with separate rules.
Similarly, applications are not required to use internationalized
domain names and internationalized resolver APIs for every lookup.
In some cases, it may be more efficient for an application to only
use internationalized domain names for lookup operations against
connection identifiers, and to use STD13 octet sequences or ACE
encoded legacy lookups for domain names which were obtained as
Hall I-D Expires: May 2002 [page 40]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
protocol or application data (this will be especially true in
those cases where the protocol does not yet provide an
internationalized domain name data-type). In those cases where an
application prefers to use the legacy resolution path, the
application MUST use the resolver's legacy APIs. For lookups
against internationalized domain names, the application MUST use
the resolver's internationalized APIs.
Note that this specification does not define a mandatory encoding
which must be used between the applications and the local
resolver. However, resolvers MUST provide at least one encoding
which is capable of supporting the entire UCS repertoire of
character codes, including character codes which are currently
unassigned. Since UTF-8 is the default encoding which is used
throughout this specification, it is also RECOMMENDED for use with
resolver APIs, although this is not required. Resolvers MAY
dictate a local encoding, with the only requirement being support
for the entire range of UCS character codes.
Regardless of the data being provided or the charset or encoding
which is used to provide that data, applications MUST normalize
and case-convert any internationalized host identifiers which it
generates or receives from a lookup operation. This process MUST
use the canonical UCS characters of the domain name according to
the rules specified in <nameprep> for every host identifier which
is sent to or received from a resolver.
If the application knows that the requested data specifically
refers to a host identifier, then the domain name data which is
returned by the resolver MUST be normalized and case-converted,
and the resulting domain name MUST be compared to the original
domain name which was received prior to the normalization and
case-conversion steps. If the processed domain name does not match
the domain name which was received, the domain name MUST be
discarded as malformed.
This step is necessary in order to ensure the integrity and
veracity of internationalized domain names which are processed by
applications, since there are multiple opportunities for errors to
be introduced (such as mistyped entries in the resolver's hosts
database, or malicious data which has been purposefully provided
in a zone), and these errors can result in sensitive data being
directed to the wrong network. Note that the above rule
specifically applies to host identifiers and not to all
internationalized domain names as a whole; applications MUST NOT
arbitrarily normalize and case-convert any and all domain names,
Hall I-D Expires: May 2002 [page 41]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
but MUST apply these steps to any and all domain names which are
known to be used as host identifiers.
As part of the processing rules for DNS lookups, it is expected
that an application can exchange internationalized domain names
with the resolver using a charset or encoding which is capable of
representing the entire UCS character code range. Towards this
objective, applications SHOULD test the capabilities of the
resolver prior to transferring internationalized domain names. In
those situations where the resolver is unable to support this
usage, the application MUST encode the internationalized domain
name as STD13 octet sequences or ACE, and pass the resulting STD13
host identifier to the resolver.
7. Resolver Guidelines
Resolvers play a crucial role in the use of internationalized
domain names, in that they provide the internationalized namespace
which applications work with. As part of this service, resolvers
provide encapsulation services for the internationalized domain
names which are exchanged with the applications, resolve queries
in the internationalized namespace on behalf of the applications,
and provide lookup matching for entries which are stored in a
local hosts database. Note that resolvers which cache answer data
for subsequent operations are also governed by the caching
restrictions provided in section 9.
7.1. Resolver APIs
Stub resolvers which communicate directly with applications that
are compliant with this specification are strongly encouraged to
provide a separate set of APIs for those applications to use
whenever internationalized domain names need to be provided in
queries or response messages.
The use of an internationalized API will generally facilitate
smoother operations for the applications, in that it will allow
the application to determine the capabilities of the resolver, to
obtain the internationalized domain name of the local system, and
to process queries for internationalized domain names as special
data types.
Furthermore, the use of internationalized versus legacy APIs
provides a way for resolvers to separate internationalized and
Hall I-D Expires: May 2002 [page 42]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
legacy application query paths, such that the legacy APIs only
result in STD13 legacy labels, while the internationalized APIs
generate and trigger EDNS/UTF-8 extended labels. The output
formatting of the DNS messages are controlled by tight
restrictions, and the use of alternative APIs will likely result
in simpler resolver implementations.
For example, it is suggested that applications use the
internationalized APIs for all of the DNS lookups they generate,
even if the domain name only contains seven-bit characters. This
is required in case the queried domain name only exists with a
CNAME or PTR resource record which references an internationalized
domain name, and the server has to know which encoding to use for
that query. If the client had not used the internationalized API
for the original lookup of the domain name, the resolver may have
chosen the wrong label type, and thus the response data would only
be returned as ACE encoded data.
Conversely, older applications which generate malformed eight-bit
queries through the legacy APIs will result in those queries being
properly rejected by the DNS servers, preventing undue problems
with these applications from occurring. For example, an older
application may process an internationalized domain name through
the system-default charset or encoding (such as MacRoman), which
would result in the domain name being malformed when the
application tried to do something important with that domain name
(such as send an email message over SMTP). The use of multiple
APIs causes these malformed applications to break, and the invalid
domain names are kept out of the application protocol space.
Internationalized APIs are optional to the extent that an
application MAY use an embedded resolver which is known to be
capable of generating and processing internationalized domain
names through the existing function calls. However, the use of
separate APIs for internationalized domain names is encouraged.
Although this document does not mandate any specific APIs, the
following functions SHOULD be provided for in some form:
* Test Wide. Applications MUST be able to test the resolver
for compliance with this specification. In those cases
where this function is performed by some other function
(such as one of the following), the capabilities of the
resolver MUST be detectable even if the requested operation
fails. For example, if an application issues a call for the
internationalized domain name of the local system, the
Hall I-D Expires: May 2002 [page 43]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
capability of the resolver to handle internationalized
domain names MUST be uniquely represented even if the local
host name cannot be determined.
* Get Wide X-By-Y. Applications SHOULD be able to specify any
resource record associated with any internationalized
domain name as part of a lookup operation. Whether this
service is provided as a series of lookup-specific APIs or
as a general purpose API is up to the resolver.
* Get Wide Local Name. Applications which utilize
internationalized domain names as data will need to be able
to determine the internationalized form of their local
system name for some operations (such as a protocol-
specific welcome banner). When this function is called, the
resulting data MUST be provided as the canonical UCS
character code values, or their equivalent as represented
by a locally mandated charset or encoding.
Note that an ACE equivalent of the system name SHOULD be
returned when the relevant legacy API is queried. In those
cases where the legacy and internationalized domain names
both contain seven-bit character codes (possibly because
the host name is only available in US-ASCII, or because the
host name was assigned as ACE by an external configuration
service), the internationalized host name MUST still be
accessible through the internationalized function.
Note that this application does not specify a charset or encoding
which must be used by the resolver APIs. However, wherever an
internationalized API is presented, the resolver MUST utilize a
charset or encoding which supports the entire UCS repertoire of
character codes, including character codes which are currently
unassigned. Since UTF-8 is the default charset for most of the
operations specified in this document, it is also RECOMMENDED for
this service, but is not required.
7.2. Query Processing Services
Resolvers which are compliant with the recommendations provided in
this specification will provide two query paths, one of which
supports STD13 domain names and another which supports
internationalized domain names. Technically, there is no
requirement for two processing paths, although these paths will
Hall I-D Expires: May 2002 [page 44]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
likely exist as conceptual paths even if they are not represented
or implemented uniquely in all resolvers.
The legacy processing path is defined by STD13. This document does
not update, modify or extend the rules that resolvers operate
under when an STD13 compliant domain name is received by a legacy
application through any legacy APIs which may exist. However, when
an internationalized domain name is received from an
internationalized application through any internationalized APIs,
the processing rules defined in this section MUST be followed.
Note that these rules apply to all resolvers, whether they are
stub resolvers, forwarders or caching servers.
Generally speaking, the internationalized domain name resolution
process has two major components: processing internationalized
domain names as queries, and performing fall-back processing if an
EDNS/UTF-8 query is rejected by an authoritative server.
7.2.1. Internationalized queries
Queries for internationalized domain names which are received
through internationalized APIs can be expected to have originated
at an application which is capable of accepting and processing
internationalized domain names in the response messages.
Resolvers MUST encode the labels from the queried domain name as
UTF-8 and encapsulate the resulting encoded labels into EDNS/UTF-8
extended labels for transfer within DNS messages, per the
instructions provided in section 5.1.
Any and all responses to these queries will also be encoded as
UTF-8 and encapsulated in EDNS/UTF-8 extended labels. Resolvers
MUST decode the provided response data, convert the labels to
their canonical UCS character codes, and return the requested data
to the calling application.
The resolver MUST NOT normalize or case convert internationalized
domain names which may be received in queries or response
messages. Since the queries have originated from applications
which have indicated that they are compliant with this
specification (via the API) while the responses will have
originated from caches or servers which indicate that they are
also compliant (via the EDNS/UTF-8 extended labels), those systems
are assumed to have normalized and case-converted the domain names
before they were generated or stored. Also note that applications
Hall I-D Expires: May 2002 [page 45]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
will validate the host identifiers that they receive in response
messages, so an additional check is expected to be performed on
the answer data by those systems.
7.2.2. Fall-back processing
If a queried server is unable to process EDNS/UTF-8 extended
labels, then it is required by STD13 to generate an error
signifying the problem. Resolvers MUST interpret these errors,
decode the UTF-8 queried domain name, re-encode it as STD13 octets
and/or ACE per the instructions provided in section 5.2, and then
reissue the query as an STD13 legacy label sequence.
The legacy DNS error responses which will trigger this series of
events are FORMERR and NOTIMPL. Any other errors indicate that the
EDNS/UTF-8 extended label was successfully processed but that the
query was not matched, and those errors MUST be returned to the
application. If the fallback processing results in any error
responses whatsoever, then the resolver MUST return those errors
to the calling application.
Any servers which subsequently receive the fall-back queries and
which are compliant with this specification will process the
queries as internationalized domain names, and will return the
answer data as STD13 octet sequences or ACE encoded data, using
the STD13 legacy label.
Generally speaking, fall-back processing serves two purposes:
* Answering the initial query. If a UTF-8 domain name cannot
be resolved because a server in the delegation path does
not understand the EDNS/UTF-8 label type, the resolver can
reissue the query as an ACE encoded legacy label type so
that the query proceeds past the problematic server.
* Seeding the resolver's cache. As a result of the above, the
resolver will learn about the authoritative name servers
for the target zone, and this information can be used for
any subsequent queries for domain names within the
specified zone (for as long as the data is cached, anyway).
As such, any subsequent EDNS/UTF-8 queries which are issued
for the portion of the namespace served by that zone will
be sent directly to one of those authoritative servers
where they can be answered directly. In this regard,
Hall I-D Expires: May 2002 [page 46]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
subsequent lookups do not require fall-back processing if
they are received during the cache window.
Regardless of whether or not fall-back processing has been
performed, if the calling application issued the original query as
an internationalized domain name, then the resolver MUST respond
to the query in that form as well. This means that the resolver
MUST convert any STD13 octet sequences or ACE encoded labels into
their canonical UCS characters, convert the answer data into the
resolver's native charset or encoding, and return the data to the
calling process. The resolver MUST NOT perform any normalization
or case-conversion during this process, as such an action can
corrupt domain names which are not used for host identifiers.
If the original query was received through the resolver's legacy
APIs, then the query MUST be generated and returned in the legacy
format, and MUST NOT be converted to an internationalized domain
name prior to the query or response being passed through.
Once fall-back processing occurs, the process MUST NOT be repeated
for any additional queries in the current lookup operation. No
other queries from the current lookup operations MUST NOT be sent
as EDNS/UTF-8 extended labels, since multiple fall-back operations
can result in time-outs on the client systems.
Because the fall-back process results in two lookups being issued
against the rejecting zone, eliminating the fall-back processing
as soon as possible will be an operational requirement for many
organizations. Any caches or forwarders which are used by stub
resolvers within an end-user network are practically required to
be able to process the EDNS/UTF-8 queries, since those servers
will receive every query which is issued by the stub resolvers.
While this isn't a technical requirement (fall-back processing
will get around the problematic servers), it will likely prove to
be a consideration for network operators looking to support
internationalized domain names on their local networks.
This document also strongly encourages the root and TLD servers to
be upgraded as soon as possible (even if they do not intend to
directly provide UTF-8 domain name delegations), in order to allow
those servers to read and process the EDNS/UTF-8 extended labels,
thereby reducing the number of fall-back queries which are sent to
those servers.
Hall I-D Expires: May 2002 [page 47]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
7.3. The Hosts Database
Generally speaking, there are two areas of consideration for stub
resolvers that provide local hosts databases for name resolution
services. These are the input requirements for internationalized
domain names which will be added to the hosts database, and the
requirements which govern how queries will be compared to the
entries in the hosts database.
Note that resolvers are not required to implement a hosts database
or local lookup services (STD3 says "a host MAY also implement a
host name translation mechanism that searches a local Internet
host table"). However, wherever a hosts database is provided with
an internationalized resolver, compliance with the rules specified
in this section is required.
If a stub resolver offers the capability to compare
internationalized domain names against a local hosts database,
that database MUST be compatible with the internationalized domain
name rules specified in section 4 of this document.
In particular, the resolver SHOULD allow internationalized domain
names with any code values to be stored, even if the canonical UCS
characters for those values are undefined or are illegal for use
with internationalized host identifiers (this is required to
support domain names which are not host identifiers). In those
cases where an internationalized domain name specifies an exact
sequence of octets for binary comparison, the hosts database MUST
provide a mechanism for tagging the eight-bit characters so that
they are not interpreted, processed or compared as the canonical
UCS character equivalents of those codes.
However, entries which explicitly provide host identifiers MUST be
normalized and case-converted prior to being stored. In order to
satisfy both of these requirements, it is RECOMMENDED that hosts
databases store internationalized host identifiers as untagged
data, but that they also provide some sort of tagging service for
character code values which are to be returned as-is. STD13
defines an escaping mechanism whereby the decimal value of the
octet is prefaced with a reverse-solidus (such as "\193"), which
is suggested for this usage.
The storage format of the hosts database MAY use any charset or
encoding the resolver deems most suitable for that platform, as
long as the rules and restrictions provided above are followed.
Since UTF-8 is used as the default encoding throughout this
Hall I-D Expires: May 2002 [page 48]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
specification, it is RECOMMENDED as the default encoding for hosts
databases as well, although this is not required.
Not all of the applications which use a resolver are likely to be
compliant with this specification, so resolvers MUST ensure that
they are able to interpret and process any queries from the legacy
APIs which provide the ACE equivalent of an internationalized
domain name that is stored in the hosts database. When such a
query arrives, the domain name MUST be converted to the canonical
UCS character codes represented by the ACE encoded sequence and
compared to entries in the hosts database in that form (tagged
octets excluded). Any internationalized domain names which are
required to be returned through the legacy APIs MUST be converted
to STD13 octet sequences and/or ACE before they are returned.
8. Server Guidelines
When a zone administrator desires to provide internationalized
domain names in a zone, they are presented with two options: they
can add the STD13 octets or ACE encoded internationalized domain
names to an existing zone, or they can use internationalized zone
databases directly. Both of these usage scenarios have their own
benefits and restrictions.
Using STD13 octet sequences and ACE with legacy servers allows for
the immediate deployment of internationalized domain names on
existing servers, and within hierarchies which include
internationalized domain names. However, any such queries which
originate at applications that are compliant with this
specification will always initially fail, guaranteeing that fall-
back processing will always occur for those zones.
Conversely, using internationalized zones directly allows servers
to process legacy, ACE and EDNS/UTF-8 queries equally, thereby
providing greater value to the applications and resolvers which
have been made compliant with this specification. However,
internationalized zones have additional requirements (most
notably, they are required to be upgraded simultaneously), and
these will prove burdensome to some zone operators.
This specification focuses on the processing requirements for
internationalized zones which support the use of internationalized
domain names as explicit data, and which also support the
necessary subordinate mechanisms such as EDNS/UTF-8 queries. When
STD13 octet sequences or ACE encoded domain names are used with
Hall I-D Expires: May 2002 [page 49]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
legacy servers, the rules defined in STD13 for those servers MUST
be used.
Note that each zone SHOULD be configurable independently. If a
server hosts multiple zones, each of those zones SHOULD be
operable as independent entities, with any of them using ACE or
internationalized domain names as necessary. This rule is
necessary since each zone is likely to have different replication
partners and configuration rules which will require different
migration strategies.
8.1. Internationalized Zones
All domain names which are published by an internationalized zone
MUST be compatible with the restrictions specified in section 4 of
this document. In particular, the zone database MUST allow binary
domain names to be stored as any octet value, but MUST also comply
with the normalization and case-mapping rules when a domain name
represents a host identifier. These restrictions MUST be applied
as part of the process in which the domain name is being added to
the zone database. In those cases where an internationalized
domain name specifies an exact sequence of octets for binary
comparison, the hosts database MUST provide a mechanism for
tagging the eight-bit characters so that they are not interpreted,
processed or compared as the canonical UCS character equivalents
of those codes. STD13 defines an escaping mechanism whereby the
decimal value of the octet is prefaced with a reverse-solidus
(such as "\193"), which is suggested for this usage.
Servers which are compliant with this specification MUST be
capable of providing UTF-8 and ACE encoded representations of the
UCS domain names which are stored in the zone, and servers MUST
restrict output to only one label type for any protocol operation,
such that queries containing STD13 legacy labels MUST be answered
with STD13 octet sequences and/or ACE encoded domain names, while
EDNS/UTF-8 queries MUST only be answered with UTF-8 encoded domain
names (this not only includes basic operations such as simple
queries, but also includes advanced operations such as zone
transfers; see section 8.2). Similarly, external operations such
as exporting the contents of the zone to a master file (as
discussed in section 8.3) MUST result in a single encoding form
being used for that specific operation.
Note that the underlying zone database technology which may be
employed by any particular server is beyond the scope of this
Hall I-D Expires: May 2002 [page 50]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
document. Servers MAY use any database technology, charset or
encoding deemed appropriate for the local environment, although
the contents of the zone MUST be mapped to the canonical UCS
character codes for all comparison operations (octet values
excluded). Since UTF-8 is used as the default encoding throughout
this specification, it is RECOMMENDED for use as the default
encoding with zone databases as well, but is not required.
Servers MUST NOT normalize or case-map any UCS characters which
are decoded from UTF-8 or ACE encoded labels, and MUST restrict
comparison operations of these labels to precise matches of the
UCS domain names which are stored in the zone database. However,
the seven bit character codes from any labels which are received
as STD13 octet sequences MUST be compared in a case-neutral form,
and MUST NOT be normalized as part of the comparison operation.
When a zone is converted to support internationalized domain
names, all of the servers which replicate that zone MUST be
upgraded. This is required due to ambiguities that can occur with
labels which may be encoded as either STD13 octet sequences or ACE
data, and where the label only uses character codes from the
eight-bit range of character codes (this problem is described in
detail in section 4.1.2). In order to ensure that all of the
servers for a zone respond to one of those queries correctly, all
of the servers which replicate the zone MUST fully support this
document and its requirements.
8.2. Namespace Visibility Restrictions
In all cases, the encoding format of the domain names which are
returned in response to a query MUST be the same as the encoding
format which was used by the query. If the query was provided as a
sequence of legacy labels, then all of the domain names which are
provided in the response message MUST be provided as legacy labels
(containing either ACE or STD13 octet encoded values).
Similarly, if a query is provided as EDNS/UTF-8 encoded data, all
domain names which are provided in the response message MUST be
provided as UTF-8 encoded data in EDNS/UTF-8 extended labels. In
some situations, this process may require the server to perform an
extra conversion.
For example, assume that the <idn>.example.com. domain name has
two associated MX resource records, one of which points to the UCS
domain name of mail.<idn>.example.com, while the other points to
Hall I-D Expires: May 2002 [page 51]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
the ACE encoded domain name of mail.<ace>.example.net. (where the
"<ace>" label is the ACE equivalent of an internationalized sub-
domain in the example.net. zone). If a UTF-8 query arrives for the
MX resource records associated with the <idn>.example.com. domain
name, both resource records MUST be returned as EDNS/UTF-8 data.
In order for this requirement to be satisfied, the server will
have to decode the <ace> label to its UCS canonical form for zone
storage purposes, and encode the domain name as UTF-8 for
transmission whenever an EDNS/UTF-8 answer set is required.
The visibility rules specified in this section are mandatory for
every domain name which is provided in any message. If a system
requests a zone transfer and uses the EDNS/UTF-8 extended label
type in the request, all of the domain names in all of the
messages which are sent as part of the zone transfer MUST be
provided in their UTF-8 encoded form. Similarly, if a zone
transfer is requested and uses the legacy label type, then all of
the domain names from all of the messages which are sent as part
of the zone transfer MUST be provided as either STD13 octet
sequences or ACE encoded data, using the legacy label type.
8.3. The Master File Format
STD13 specifies a "master file" format which is used as a
platform-neutral storage and transfer format for importing and
exporting the contents of a particular zone. Note that the master
file is not the same as the operating database for a zone; the
master file format is used (or is useful) for copying a zone to
another server, storing a copy of the zone database off-line,
emailing a copy of the zone to another user or system, and
performing other off-line actions against the database' contents.
Once a zone is loaded on a server, however, any database
technology can be used for managing the zones and generating
response messages.
In order to facilitate the continued use of master files, any zone
which is compliant with this specification MUST support the use of
UTF-8 as an import and export encoding format for the master file
associated with that zone.
Furthermore, compliant versions of a master file are required to
have the "$UTF-8" control literal at the beginning of the first
line of text in the master file if it contains UTF-8 encoded data.
Master files from zones which do not contain UTF-8 encoded domain
Hall I-D Expires: May 2002 [page 52]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
names MUST NOT contain the "$UTF-8" control literal in the first
print position of any line.
If the master file contains the "$UTF-8" control literal, all of
the data within the master file MUST be encoded in UTF-8 as
specified by RFC2279, and SHOULD be managed with UTF-8 compliant
tools (such as UTF-8 text editors, mailers that support UTF-8 MIME
encodings, and so forth).
9. Caching Guidelines
Whenever an internationalized domain name is stored in a cache, it
MUST be stored in its canonical UCS character code form,
regardless of whether the domain name was received as STD13 octet
encoding sequences, UTF-8, or ACE data. Caches MUST NOT normalize
or case convert any domain names that they store, as such a
process could invalidate domain names that are not used for host
identifiers.
Any subsequent queries which are processed through the cache MUST
be compared against the stored UCS characters. Internationalized
domain name labels which are decoded from UTF-8 or ACE labels MUST
NOT be normalized or case-converted as part of the comparison
operation, although labels which are provided as STD13 octet
sequences MUST be compared as case-neutral octet values.
Caches MUST be capable of providing UTF-8 and ACE encoded
representations of the UCS domain names which are stored in the
cache, with the appropriate format determined by the format used
in the corresponding query. However, answer data MUST be
restricted to only one encoding form for any protocol operation,
meaning that queries containing legacy labels MUST only be
answered with STD13 octet sequences and/or ACE encoded labels,
while UTF-8 queries MUST only be answered with UTF-8 encoded
domain names.
10. Security Considerations
This document defines an extension to the domain name system, and
as such, it inherits the weaknesses which already exist in DNS.
Where possible, this specification strengthens DNS with multiple
checks. For example, this specification requires that domain names
be validated three times before they are used by applications:
once on specification, once on entry at the authoritative zone or
Hall I-D Expires: May 2002 [page 53]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
hosts database, and once again when the answer data is received by
the requesting application. Despite these checks, the root
weaknesses inherent in DNS are still present.
This document uses multiple encoding algorithms, although boundary
conditions from the existing DNS are preserved for both the source
and encoded representations.
11. IANA Considerations
This document requires the use of an EDNS extended label type
identification code. This document uses the b000011 ELT code.
12. References
[AMC-ACE-Z] <draft-ietf-idn-amc-ace-z>, "AMC-ACE-Z version
0.3.1"
[NAMEPREP] <draft-ietf-idn-nameprep>, "Preparation of
Internationalized Host Names"
[RFC2119] "Key words for use in RFCs to Indicate Requirement
Levels"
[RFC952] "DoD Internet host table specification"
[STD13] (RFC 1034) "Domain names - concepts and facilities",
(RFC 1035) "Domain names - implementation and
specification"
[STD3] (RFC 1122) "Requirements for Internet Hosts --
Communication Layers", (RFC1123) "Requirements for Internet
Hosts -- Application and Support"
[BCP18] (RFC 2277) "IETF Policy on Character Sets and
Languages"
[RFC2279] "UTF-8, a transformation format of ISO 10646"
[RFC2671] "Extension Mechanisms for DNS (EDNS0)"
[ASCII] "ANSI X3.4-1968. USA Standard Code for Information
Interchange"
Hall I-D Expires: May 2002 [page 54]
INTERNET-DRAFT draft-hall-dm-idns-00.txt November 2001
[ISO10646] "ISO/IEC 10646-1:2000. International Standard --
Information technology -- Universal Multiple-Octet Coded
Character Set (UCS) -- Part 1: Architecture and Basic
Multilingual Plane"
13. Acknowledgements
This document is an assembly of multiple ideas and proposals which
have been made on the IDN working group mailing list. Many of the
ideas presented here have been proposed by multiple parties in one
form or another, although Dan Oscarsson is credited for proposing
a dual-mode operation which is capable of simultaneously
supporting UTF-8 and legacy mode encodings. Other contributors to
key elements from this specification (some of them unknowingly or
unwillingly) include (alphabetically) Marc Blanchett, Adam
Costello, Mark Davis, Martin Duerst, Patrik Faltstrom, Paul
Hoffman, David Hopwood, and many others.
14. Editor's Address
Eric A. Hall
ehall@ehsco.com
Hall I-D Expires: May 2002 [page 55]