diff --git a/doc/design/red-black b/doc/design/red-black new file mode 100644 index 0000000000..c40aaf7b16 --- /dev/null +++ b/doc/design/red-black @@ -0,0 +1,228 @@ + Red-Black Tree Implementation Notes + +OVERVIEW + +BIND9's basic name storage mechanism is to use a modified form of +balanced binary tree known as a red-black tree. Red-black trees +provide for relatively efficient storage, retrieval and removal of +data while maintaining the lexical order of all stored keys, a +necessary function for DNS security. + +DESCRIPTION + +A red-black tree is a balanced binary tree named for the coloring that +is done in the tree, identifying each node as either red or black. +There are two simple rules for maintaining the color of nodes: + (1) A red node has only black children. + (2) The path from the root to any leaf node always includes the + same number of black nodes. + +Whenever a key is added or removed, adjustments are made to adhere to +those two rules. These adjustments are relatively cheap to make but +maintain the balance of the tree, thus making for efficient addition, +lookup and deletion operations, all of which are O(log N). The color +of a node is not relevant to external users of the tree; it is needed +only to maintain the balance of the tree. + +For more information on basic red-black trees, see _Introduction to +Algorithms_, Cormen, Leiserson, and Rivest, MIT Press / McGraw Hill, +1990, ISBN 0-262-03141-8, chapter 14. + +In BIND9, the red-black tree implementation uses DNS names as keys, +and can store arbitrary data with each key value. "name" and "key" +are used interchangably in this document. + +The basic red-black tree algorithm is further adapted for use in BIND9 +to incorporate the notion of hierarchy, creating a tree of red-black +trees. Where there is more than one name with a common suffix, all +names with that suffix are stored in their own red-black tree, with a +down pointer from the suffix locating the subtree. + +For example, consider storing the following names: + a x.d.e.f o.w.y.d.e.f + b z.d.e.f p.w.y.d.e.f + c g.h q.w.y.d.e.f + +No matter which order the keys were added, this would result in a tree +that can be visualized as: + + b + / \ + a d.e.f + /|\ + c | g.h + | + w.y + /|\ + x | z + | + p + / \ + o q + +This tree shows that when there is no key for a particular label, and +when there is only one known label for its immediate subordinate, then +multiple labels can appear in a single node, such as at d.e.f and g.h. +It also demonstrates that there can be more nodes in the tree of trees +than there are actual keys (which degrades the O(log N) performance +marginally); the nodes at d.e.f and w.y do not represent keys. + +As an aside, remember that when ordering DNS names, labels are +examined from the right, therefore w.y sorts after x and before z. + +When bitfields are fully supported (they aren't yet) then a split can +occur not only on a regular label boundary, but also between any two +bits. The common-suffix rules will be applied to keep as many bits +together as possible. + +In the current implementation of the tree of trees, a node is +considered to "formally" exist only if it has data associated with +it. So if the above tree then had the key d.e.f added to it, the +operation would succeed rather than getting an "already exists" +error. + +Along the same lines, if a key is added with a name which is a proper +superdomain of the name stored in an existing node, the operation will +succeed by splitting the existing node into one node that is the key +and another node that is the remaining parts of the name. Adding e.f +to the above tree results in the top level red-black tree having a +node named e.f where the current d.e.f is, and a down pointer from +d.e.f to a "tree" of a single node named d. The down pointer from d +would be kept to the level which has x, w.y, and z. + +A similar split of d.e.f would occur if the name k.e.f were added. +The top level tree would have the node e.f with a down pointer to a +level that had both d and k, and d would continue to have its down +pointer to the x, w.y and z level. + +It is guaranteed when splitting that external references to the node +that is split will remain valid --- in the previous examples, anything +that was pointing to the node that was d.e.f will still point to the +node that is now just d. + +When deleting keys, nodes can be rejoined. If both of p.w.y.d.e.f and +q.w.y.d.e.f were removed from the example tree, the node named w.y +would become o.w.y. Unlike splitting, it is _not_ guaranteed that +external references remain consistent; sometimes they will, sometimes +they won't. Also, note that deletion is not perfectly symmetric with +addition. If you "undo" the last addition with a deletion of the same +key then the tree of trees is not guaranteed to have exactly the same +structure as it had prior to the addition. Sometimes, but not always. + +Rejoining does not happen if it would violate any of the rules that +cause a split. o would not be rejoined with w.y if w.y had data +associated with the key; o would remain as a single node on its own +level. This emphasizes the rule that a node is considered to formally +exist only if data is associated with it, because even if w.y.d.e.f +had been explicitly added as a key but with no data, then o would +still be merged with the w.y node when p and q were deleted. + +Searching for a node generally returns one of three possible results: +either the key is found, a superdomain (partial match) of the key is +found, or no part of the key is found. The first and last are rather +obvious, and the second result basically means that a hierarchically +enclosing name is found; e.g, searching for bb.rc.vix.com turned up +rc.vix.com, but not the full name. + +No locking is done within the RBT library. @@@ + +EXTERNAL PROGRAMMATIC DETAILS + +This section details the functions used to interact with the BIND9 +red-black tree library, or RBT for short. + +A source file that will be using RBT will usually need to include +. This header file automatically includes , , and . + +The rbt.h file has more complete descriptions of each of the functions +named here, including what is required for each argument, what each +function ensures (and might not ensure) will occur, and the full range +of possible results for each call. Note well: if a function returns a +dns_result_t rather than void, it definitely means there is something +that can go possibly wrong in the function and it should be checked by +the caller. + +A new tree of trees must be initialized using: + + dns_result_t dns_rbt_create(isc_mem_t *mctx, void (*deleter)(void *, void *), + void *deleter_arg, dns_rbt_t **rbtp); + +The memory context, mctx, must be a non-null pointer that was +initialized with isc_mem_create(). The deleter argument, if non-null, +should point to a function that is responsible for cleaning up any +memory associated with the data pointer of a node when the node is +deleted. It is passed the deleted node's data pointer as its first +argument and deleter_arg as its second argument. + +After initializing an RBT manager, to add keys to the tree, use: + + dns_result_t dns_rbt_addname(dns_rbt_t *rbt, dns_name_t *name, void *data); + +The name _must_ be an absolute name. It is not required that the data +pointer be non-null, but it is recommended that it point to something, +even just invalid memory, because of the various searching and +deletion issues described in the previous section. The RBT code will +not attempt to dereference the pointer. + +To find a key in the tree, use: + + dns_result_t dns_rbt_findname(dns_rbt_t *rbt, dns_name_t *name, void **data); + +The data parameter must not be NULL, but *data must be NULL. The +result will be either DNS_R_SUCCESS, DNS_R_PARTIALMATCH or +DNS_R_NOTFOUND. In the first case, an exact match was found for the +name and there was an associate data pointer, which is returned via +the data parameter. A partial match results when the name has not +been found but a superdomain name, with data, does exist; then the +data for that name is returned in the data parameter. If no data is +found for the name or a superdomain, *data will remain NULL. + + + +INTERNAL PROGRAMMATIC DETAILS + +This section is mainly relevant to the RBT DB implementation. It is +highly recommended that programmers using the RBT library stick to the +functions named in the previous section. + +The dns_rbt_addname and dns_rbt_findname functions named in the +previous section are wrappers around dns_rbt_addnode and +dns_rbt_findnode. The *node functions for the most part do not +particularly care whether a node has an associated data pointer or +not, whereas the *name functions do. The one exception to this is +that when a PARTIALMATCH is returned for a search, the indicated node +is the deepest match that has data, rather than just the deepest +match. This inconsistency could be addressed by adding another +parameter to dns_rbt_findnode. + +Each node in the tree of trees is represented by the following structure: + + typedef struct dns_rbtnode { + struct dns_rbtnode *left; + struct dns_rbtnode *right; + struct dns_rbtnode *down; + /* + * The following bitfields add up to a total bitwidth of 32. + * The range of values necessary for each item is indicated, + * but in the case of "attributes" the field is wider to accomodate + * possible future expansion. "offsetlen" could be one bit + * narrower by always adjusting its value by 1 to find the real + * offsetlen, but doing so does not gain anything (except perhaps + * another bit for "attributes", which doesn't yet need any more). + */ + unsigned int color:1; /* range is 0..1 */ + unsigned int attributes:6; /* range is 0..2 */ + unsigned int namelen:8; /* range is 1..255 */ + unsigned int offsetlen:8; /* range is 1..128 */ + unsigned int padbytes:9; /* range is 0..380 */ + /* + * These values are used in the RBT DB implementation. The + * appropriate node lock must be held before accessing them. + */ + void *data; + unsigned int dirty:1; + unsigned int locknum:DNS_RBT_LOCKLENGTH; + unsigned int references:DNS_RBT_REFLENGTH; + } dns_rbtnode_t; +