It is more intuitive to have the countdown 'max-stale-ttl' as the
RRset TTL, instead of 0 TTL. This information was already available
in a comment "; stale (will be retained for x more seconds", but
Support suggested to put it in the TTL field instead.
(cherry picked from commit a83c8cb0af)
Before binding an RRset, check the time and see if this record is
stale (or perhaps even ancient). Marking a header stale or ancient
happens only when looking up an RRset in cache, but binding an RRset
can also happen on other occasions (for example when dumping the
database).
Check the time and compare it to the header. If according to the
time the entry is stale, but not ancient, set the STALE attribute.
If according to the time is ancient, set the ANCIENT attribute.
We could mark the header stale or ancient here, but that requires
locking, so that's why we only compare the current time against
the rdh_ttl.
Adjust the test to check the dump-db before querying for data. In the
dumped file the entry should be marked as stale, despite no cache
lookup happened since the initial query.
(cherry picked from commit debee6157b)
When introducing change 5149, "rndc dumpdb" started to print a line
above a stale RRset, indicating how long the data will be retained.
At that time, I thought it should also be possible to load
a cache from file. But if a TTL has a value of 0 (because it is stale),
stale entries wouldn't be loaded from file. So, I added the
'max-stale-ttl' to TTL values, and adjusted the $DATE accordingly.
Since we actually don't have a "load cache from file" feature, this
is premature and is causing confusion at operators. This commit
changes the 'max-stale-ttl' adjustments.
A check in the serve-stale system test is added for a non-stale
RRset (longttl.example) to make sure the TTL in cache is sensible.
Also, the comment above stale RRsets could have nonsensical
values. A possible reason why this may happen is when the RRset was
marked a stale but the 'max-stale-ttl' has passed (and is actually an
RRset awaiting cleanup). This would lead to the "will be retained"
value to be negative (but since it is stored in an uint32_t, you would
get a nonsensical value (e.g. 4294362497).
To mitigate against this, we now also check if the header is not
ancient. In addition we check if the stale_ttl would be negative, and
if so we set it to 0. Most likely this will not happen because the
header would already have been marked ancient, but there is a possible
race condition where the 'rdh_ttl + serve_stale_ttl' has passed,
but the header has not been checked for staleness.
(cherry picked from commit 2a5e0232ed)
The dboption DNS_DBFIND_STALEONLY caused confusion because it implies
we are looking for stale data **only** and ignore any active RRsets in
the cache. Rename it to DNS_DBFIND_STALETIMEOUT as it is more clear
the option is related to a lookup due to "stale-answer-client-timeout".
Rename other usages of "staleonly", instead use "lookup due to...".
Also rename related function and variable names.
(cherry picked from commit 839df94190)
If we did not attempt a fetch due to fetch-limits, we should not start
the stale-refresh-time window.
Introduce a new flag DNS_DBFIND_STALESTART to differentiate between
a resolver failure and unexpected error. If we are resuming, this
indicates a resolver failure, then start the stale-refresh-time window,
otherwise don't start the stale-refresh-time window, but still fall
back to using stale data.
(This commit also wraps some docstrings to 80 characters width)
(cherry picked from commit aabdedeae3)
*** CID 318094: Null pointer dereferences (REVERSE_INULL)
/lib/dns/rbtdb.c: 1389 in newversion()
1383 version->xfrsize = rbtdb->current_version->xfrsize;
1384 RWUNLOCK(&rbtdb->current_version->rwlock, isc_rwlocktype_read);
1385 rbtdb->next_serial++;
1386 rbtdb->future_version = version;
1387 RBTDB_UNLOCK(&rbtdb->lock, isc_rwlocktype_write);
1388
CID 318094: Null pointer dereferences (REVERSE_INULL)
Null-checking "version" suggests that it may be null, but it has already been dereferenced on all paths leading to the check.
1389 if (version == NULL) {
1390 return (result);
1391 }
1392
1393 *versionp = version;
1394
(cherry picked from commit 456d53d1fb)
This commit fix a race that could happen when two or more threads have
failed to refresh the same RRset, the threads could simultaneously
attempt to update the header->last_refresh_fail_ts field in
check_stale_header, a field used to implement stale-refresh-time.
By making this field atomic we avoid such race.
(cherry picked from commit c75575e350)
First of all, there was a flaw in the code related to the
'stale-refresh-time' option. If stale answers are enabled, and we
returned stale data, then it was assumed that it was because we were
in the 'stale-refresh-time' window. But now we could also have returned
stale data because of a 'stale-answer-client-timeout'. To fix this,
introduce a rdataset attribute DNS_RDATASETATTR_STALE_WINDOW to
indicate whether the stale cache entry was returned because the
'stale-refresh-time' window is active.
Second, remove the special case handling when the result is
DNS_R_NCACHENXRRSET. This can be done more generic in the code block
when dealing with stale data.
Putting all stale case handling in the code block when dealing with
stale data makes the code more easy to follow.
Update documentation to be more verbose and to match then new code
flow.
(cherry picked from commit fa0c9280d2)
The general logic behind the addition of this new feature works as
folows:
When a client query arrives, the basic path (query.c / ns_query_recurse)
was to create a fetch, waiting for completion in fetch_callback.
With the introduction of stale-answer-client-timeout, a new event of
type DNS_EVENT_TRYSTALE may invoke fetch_callback, whenever stale
answers are enabled and the fetch took longer than
stale-answer-client-timeout to complete.
When an event of type DNS_EVENT_TRYSTALE triggers fetch_callback, we
must ensure that the folowing happens:
1. Setup a new query context with the sole purpose of looking up for
stale RRset only data, for that matters a new flag was added
'DNS_DBFIND_STALEONLY' used in database lookups.
. If a stale RRset is found, mark the original client query as
answered (with a new query attribute named NS_QUERYATTR_ANSWERED),
so when the fetch completion event is received later, we avoid
answering the client twice.
. If a stale RRset is not found, cleanup and wait for the normal
fetch completion event.
2. In ns_query_done, we must change this part:
/*
* If we're recursing then just return; the query will
* resume when recursion ends.
*/
if (RECURSING(qctx->client)) {
return (qctx->result);
}
To this:
if (RECURSING(qctx->client) && !QUERY_STALEONLY(qctx->client)) {
return (qctx->result);
}
Otherwise we would not proceed to answer the client if it happened
that a stale answer was found when looking up for stale only data.
When an event of type DNS_EVENT_FETCHDONE triggers fetch_callback, we
proceed as before, resuming query, updating stats, etc, but a few
exceptions had to be added, most important of which are two:
1. Before answering the client (ns_client_send), check if the query
wasn't already answered before.
2. Before detaching a client, e.g.
isc_nmhandle_detach(&client->reqhandle), ensure that this is the
fetch completion event, and not the one triggered due to
stale-answer-client-timeout, so a correct call would be:
if (!QUERY_STALEONLY(client)) {
isc_nmhandle_detach(&client->reqhandle);
}
Other than these notes, comments were added in code in attempt to make
these updates easier to follow.
(cherry picked from commit 171a5b7542)
- change name of 'bytes' to 'xfrsize' in dns_db_getsize() parameter list
and related variables; this is a more accurate representation of what
the function is doing
- change the size calculations in dns_db_getsize() to more accurately
represent the space needed for a *XFR message or journal file to contain
the data in the database. previously we returned the sizes of all
rdataslabs, including header overhead and offset tables, which
resulted in the database size being reported as much larger than the
equivalent *XFR or journal.
- map files caused a particular problem here: the fullname can't be
determined from the node while a file is being deserialized, because
the uppernode pointers aren't set yet. so we store "full name length"
in the dns_rbtnode structure while serializing, and clear it after
deserialization is complete.
It is possible to have two threads destroying an rbtdb at the same
time when detachnode() executes and removes the last reference to
a node between exiting being set to true for the node and testing
if the references are zero in maybe_free_rbtdb(). Move NODE_UNLOCK()
to after checking if references is zero to prevent detachnode()
changing the reference count too early.
(cherry picked from commit 859d2fdad6)
Before this update, BIND would attempt to do a full recursive resolution
process for each query received if the requested rrset had its ttl
expired. If the resolution fails for any reason, only then BIND would
check for stale rrset in cache (if 'stale-cache-enable' and
'stale-answer-enable' is on).
The problem with this approach is that if an authoritative server is
unreachable or is failing to respond, it is very unlikely that the
problem will be fixed in the next seconds.
A better approach to improve performance in those cases, is to mark the
moment in which a resolution failed, and if new queries arrive for that
same rrset, try to respond directly from the stale cache, and do that
for a window of time configured via 'stale-refresh-time'.
Only when this interval expires we then try to do a normal refresh of
the rrset.
The logic behind this commit is as following:
- In query.c / query_gotanswer(), if the test of 'result' variable falls
to the default case, an error is assumed to have happened, and a call
to 'query_usestale()' is made to check if serving of stale rrset is
enabled in configuration.
- If serving of stale answers is enabled, a flag will be turned on in
the query context to look for stale records:
query.c:6839
qctx->client->query.dboptions |= DNS_DBFIND_STALEOK;
- A call to query_lookup() will be made again, inside it a call to
'dns_db_findext()' is made, which in turn will invoke rbdb.c /
cache_find().
- In rbtdb.c / cache_find() the important bits of this change is the
call to 'check_stale_header()', which is a function that yields true
if we should skip the stale entry, or false if we should consider it.
- In check_stale_header() we now check if the DNS_DBFIND_STALEOK option
is set, if that is the case we know that this new search for stale
records was made due to a failure in a normal resolution, so we keep
track of the time in which the failured occured in rbtdb.c:4559:
header->last_refresh_fail_ts = search->now;
- In check_stale_header(), if DNS_DBFIND_STALEOK is not set, then we
know this is a normal lookup, if the record is stale and the query
time is between last failure time + stale-refresh-time window, then
we return false so cache_find() knows it can consider this stale
rrset entry to return as a response.
The last additions are two new methods to the database interface:
- setservestale_refresh
- getservestale_refresh
Those were added so rbtdb can be aware of the value set in configuration
option, since in that level we have no access to the view object.
After backporting #1870 to 9.11-S I saw that the condition check there
is different than in the main branch. In 9.11-S "stale" can mean
stale and serve-stale, or not active (awaiting cleanup). In 9.16 and
later versions, "stale" is stale and serve-stale, and "ancient" means
not active (awaiting cleanup). An "ancient" RRset is one that is not
active (TTL expired) and is not eligble for serve-stale.
Update the condition for rndc dumpdb -expired to closer match what is
in 9.11-S.
(cherry picked from commit 5614454c3b)
By changing the check in 'rdatasetiter_first' and 'rdatasetiter_next'
from "now > header->rdh_ttl" to "now - RBDTB_VIRTUAL > header->rdh_ttl"
we include expired rdataset entries so that they can be used for
"rndc dumpdb -expired".
(cherry picked from commit 17d5bd4493)
The rbtdb version glue_table has been refactored similarly to rbt.c hash
table, so it does use 32-bit hash function return values and apply
Fibonacci Hashing to lookup the index to the hash table instead of
modulo. For more details, see the lib/dns/rbt.c commit log.
(cherry picked from commit 01684cc219)
When a received RRSet has TTL 0, they would be preserved for
serve-stale (default `max-stale-cache` is 12 hours) rather than expiring
them quickly from the cache database.
This commit makes sure the RRSet didn't have TTL 0 before marking the
entry in the database as "stale".
(cherry picked from commit 6ffa2ddae0)
Created isc_refcount_decrement_expect macro to test conditionally
the return value to ensure it is in expected range. Converted
unchecked isc_refcount_decrement to use isc_refcount_decrement_expect.
Converted INSIST(isc_refcount_decrement()...) to isc_refcount_decrement_expect.
(cherry picked from commit bde5c7632a)
There were several problems with rbt hashtable implementation:
1. Our internal hashing function returns uint64_t value, but it was
silently truncated to unsigned int in dns_name_hash() and
dns_name_fullhash() functions. As the SipHash 2-4 higher bits are
more random, we need to use the upper half of the return value.
2. The hashtable implementation in rbt.c was using modulo to pick the
slot number for the hash table. This has several problems because
modulo is: a) slow, b) oblivious to patterns in the input data. This
could lead to very uneven distribution of the hashed data in the
hashtable. Combined with the single-linked lists we use, it could
really hog-down the lookup and removal of the nodes from the rbt
tree[a]. The Fibonacci Hashing is much better fit for the hashtable
function here. For longer description, read "Fibonacci Hashing: The
Optimization that the World Forgot"[b] or just look at the Linux
kernel. Also this will make Diego very happy :).
3. The hashtable would rehash every time the number of nodes in the rbt
tree would exceed 3 * (hashtable size). The overcommit will make the
uneven distribution in the hashtable even worse, but the main problem
lies in the rehashing - every time the database grows beyond the
limit, each subsequent rehashing will be much slower. The mitigation
here is letting the rbt know how big the cache can grown and
pre-allocate the hashtable to be big enough to actually never need to
rehash. This will consume more memory at the start, but since the
size of the hashtable is capped to `1 << 32` (e.g. 4 mio entries), it
will only consume maximum of 32GB of memory for hashtable in the
worst case (and max-cache-size would need to be set to more than
4TB). Calling the dns_db_adjusthashsize() will also cap the maximum
size of the hashtable to the pre-computed number of bits, so it won't
try to consume more gigabytes of memory than available for the
database.
FIXME: What is the average size of the rbt node that gets hashed? I
chose the pagesize (4k) as initial value to precompute the size of
the hashtable, but the value is based on feeling and not any real
data.
For future work, there are more places where we use result of the hash
value modulo some small number and that would benefit from Fibonacci
Hashing to get better distribution.
Notes:
a. A doubly linked list should be used here to speedup the removal of
the entries from the hashtable.
b. https://probablydance.com/2018/06/16/fibonacci-hashing-the-optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo/
(cherry picked from commit e24bc324b4)
The ThreadSanitizer found a data race when updating the stale header.
Instead of trying to acquire the write lock and failing occasionally
which would skew the statistics, the dns_rdatasetheader_t.attributes
field has been promoted to use stdatomics. Updating the attributes in
the mark_header_ancient() and mark_header_stale() now uses the cmpxchg
to update the attributes forfeiting the need to hold the write lock on
the tree. Please note that mark_header_ancient() still needs to hold
the lock because .dirty is being updated in the same go.
(cherry picked from commit 81d4230e60)
RBTDB node can now appear on the deadnodes lists following the changes
to decrement_reference in 176b23b6cd to
defer checking of node->down when the tree write lock is not held. The
node should be unlinked instead.
(cherry picked from commit 569cc155b8680d8ed12db1fabbe20947db24a0f9)
in addition to being more efficient, this prevents a possible crash by
looking up the node name before the tree sructure can be changed when
cleaning up dead nodes in addrdataset().
(cherry picked from commit db9d10e3c1)
Save 'i' to 'locknum' and use that rather than using
'header->node->locknum' when performing the deferred
unlock as 'header->node->locknum' can theoretically be
different to 'i'.
(cherry picked from commit 8dd8d48c9f)
"max-journal-size" is set by default to twice the size of the zone
database. however, the calculation of zone database size was flawed.
- change the size calculations in dns_db_getsize() to more accurately
represent the space needed for a journal file or *XFR message to
contain the data in the database. previously we returned the sizes
of all rdataslabs, including header overhead and offset tables,
which resulted in the database size being reported as much larger
than the equivalent journal transactions would have been.
- map files caused a particular problem here: the full name can't be
determined from the node while a file is being deserialized, because
the uppernode pointers aren't set yet. so we store "full name length"
in the dns_rbtnode structure while serializing, and clear it after
deserialization is complete.
Updating LRU requires write-locking the node, which causes contention.
Update LRU only if time difference is large enough.
(cherry picked from commit fe584c01cc)
Start enforcing the clang-format rules on changed files
Closes#46
See merge request isc-projects/bind9!3063
(cherry picked from commit a04cdde45d)
d2b5853b Start enforcing the clang-format rules on changed files
618947c6 Switch AlwaysBreakAfterReturnType from TopLevelDefinitions to All
654927c8 Add separate .clang-format files for headers
5777c44a Reformat using the new rules
60d29f69 Don't enforce copyrights on .clang-format
adjust clang-format options to get closer to ISC style
See merge request isc-projects/bind9!3061
(cherry picked from commit d3b49b6675)
0255a974 revise .clang-format and add a C formatting script in util
e851ed0b apply the modified style
Add curly braces using uncrustify and then reformat with clang-format back
Closes#46
See merge request isc-projects/bind9!3057
(cherry picked from commit 67b68e06ad)
36c6105e Use coccinelle to add braces to nested single line statement
d14bb713 Add copy of run-clang-tidy that can fixup the filepaths
056e133c Use clang-tidy to add curly braces around one-line statements
Reformat source code with clang-format
Closes#46
See merge request isc-projects/bind9!2156
(cherry picked from commit 7099e79a9b)
4c3b063e Import Linux kernel .clang-format with small modifications
f50b1e06 Use clang-format to reformat the source files
11341c76 Update the definition files for Windows
df6c1f76 Remove tkey_test (which is no-op anyway)
6412 cleanup:
6413 dns_rdataset_disassociate(&neg);
6414 dns_rdataset_disassociate(&negsig);
CID 1452700 (#1 of 1): Dereference before null check (REVERSE_INULL)
check_after_deref: Null-checking closest suggests that it
may be null, but it has already been dereferenced on all
paths leading to the check.
6415 if (closest != NULL)
6416 free_noqname(mctx, &closest);
6367cleanup:
6368 dns_rdataset_disassociate(&neg);
6369 dns_rdataset_disassociate(&negsig);
CID 1452704 (#1 of 1): Dereference before null check
(REVERSE_INULL) check_after_deref: Null-checking noqname
suggests that it may be null, but it has already been
dereferenced on all paths leading to the check.
6370 if (noqname != NULL)
6371 free_noqname(mctx, &noqname);
This commit simplifies the cachedb rrset statistics in two ways:
- Introduce new rdtypecounter arithmetics, allowing bitwise
operations.
- Remove the special DLV statistic counter.
New rdtypecounter arithmetics
-----------------------------
"The rdtypecounter arithmetics is a brain twister". Replace the
enum counters with some defines. A rdtypecounter is now 8 bits for
RRtypes and 3 bits for flags:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| | | | | | S |NX| RRType |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
If the 8 bits for RRtype are all zero, this is an Other RRtype.
Bit 7 is the NXRRSET (NX) flag and indicates whether this is a
positive (0) or a negative (1) RRset.
Then bit 5 and 6 mostly tell you if this counter is for an active,
stale, or ancient RRtype:
S = 0x00 means Active
S = 0x01 means Stale
S = 0x10 means Ancient
Since a counter cannot be stale and ancient at the same time, we
treat S = 0x11 as a special case to deal with NXDOMAIN counters.
S = 0x11 indicates an NXDOMAIN counter and in this case the RRtype
field signals the expiry of this cached item:
RRType = 0 means Active
RRType = 1 means Stale
RRType = 2 means Ancient
The isc_buffer_allocate() function now cannot fail with ISC_R_MEMORY.
This commit removes all the checks on the return code using the semantic
patch from previous commit, as isc_buffer_allocate() now returns void.
This commits removes superfluous checks when using the isc_refcount API.
Examples of superfluous checks:
1. The isc_refcount_decrement function ensures there was not underflow,
so this check is superfluous:
INSIST(isc_refcount_decrement(&r) > 0);
2 .The isc_refcount_destroy() includes check whether the counter
is zero, therefore this is superfluous:
INSIST(isc_refcount_decrement(&r) == 1 && isc_refcount_destroy(&r));
In decrement_reference only test node->down if the tree lock
is held. As node->down is not always tested in
decrement_reference we need to test that it is non NULL in
cleanup_dead_nodes prior to removing the node from the rbt
tree. Additionally it is not always possible to aquire the
node lock and reactivate a node when adding parent nodes.
Reactivate such nodes in cleanup_dead_nodes if required.