The named configuration files used in the "geoip2" system test cause a
rather large number of views (6-8) to be set up in each tested named
instance. Each view has its own cache.
Commit e24bc324b4 caused the RBT hash
table to be pre-allocated to a size derived from "max-cache-size", so
that it never needs to be rehashed. The size of that hash table is not
expected to be significant enough to cause memory use issues in typical
conditions even for large "max-cache-size" settings.
However, these two factors combined can cause memory exhaustion issues
in GitLab CI, where we run multiple "instances" of the test suite in
parallel on the same runner, each test suite executes multiple system
tests concurrently, and each system test may potentially start multiple
named instances at the same time. In practice, this problem currently
only seems to be affecting the "geoip2" system test, which is failing
intermittently due to named instances used by that test getting killed
by oom-killer.
Prevent the "geoip2" system test from failing intermittently by setting
"max-cache-size" in named configuration files used in that test to a low
value in order to keep memory usage at bay even with a large number of
views configured.
The current serve-stale implementation in BIND 9 stores all received
records in the cache for a max-stale-ttl interval (default 12 hours).
This allows DNS operators to turn the serve-stale answers in an event of
large authoritative DNS outage. The caching of the stale answers needs
to be enabled before the outage happens or the feature would be
otherwise useless.
The negative consequence of the default setting is the inevitable
cache-bloat that happens for every and each DNS operator running named.
In this MR, a new configuration option `stale-cache-enable` is
introduced that allows the operators to selectively enable or disable
the serve-stale feature of BIND 9 based on their decision.
The newly introduced option has been disabled by default,
e.g. serve-stale is disabled in the default configuration and has to be
enabled if required.
Created isc_refcount_decrement_expect macro to test conditionally
the return value to ensure it is in expected range. Converted
unchecked isc_refcount_decrement to use isc_refcount_decrement_expect.
Converted INSIST(isc_refcount_decrement()...) to isc_refcount_decrement_expect.
It seems that config.guess gets always created in source root, so for
that sake of out-of-tree system test, we should expect the file there
instead of where configure was run.
The $SYSTEMTESTTOP shell variable if often set to .. in various shell
scripts inside bin/tests/system/, but most of the time it is only
used one line later, while sourcing conf.sh. This hardly improves
code readability.
$SYSTEMTESTTOP is also used for the purpose of referencing
scripts/files living in bin/tests/system/, but given that the
variable is always set to a short, relative path, we can drop it and
replace all of its occurrences with the relative path without adversely
affecting code readability.
There were several problems with rbt hashtable implementation:
1. Our internal hashing function returns uint64_t value, but it was
silently truncated to unsigned int in dns_name_hash() and
dns_name_fullhash() functions. As the SipHash 2-4 higher bits are
more random, we need to use the upper half of the return value.
2. The hashtable implementation in rbt.c was using modulo to pick the
slot number for the hash table. This has several problems because
modulo is: a) slow, b) oblivious to patterns in the input data. This
could lead to very uneven distribution of the hashed data in the
hashtable. Combined with the single-linked lists we use, it could
really hog-down the lookup and removal of the nodes from the rbt
tree[a]. The Fibonacci Hashing is much better fit for the hashtable
function here. For longer description, read "Fibonacci Hashing: The
Optimization that the World Forgot"[b] or just look at the Linux
kernel. Also this will make Diego very happy :).
3. The hashtable would rehash every time the number of nodes in the rbt
tree would exceed 3 * (hashtable size). The overcommit will make the
uneven distribution in the hashtable even worse, but the main problem
lies in the rehashing - every time the database grows beyond the
limit, each subsequent rehashing will be much slower. The mitigation
here is letting the rbt know how big the cache can grown and
pre-allocate the hashtable to be big enough to actually never need to
rehash. This will consume more memory at the start, but since the
size of the hashtable is capped to `1 << 32` (e.g. 4 mio entries), it
will only consume maximum of 32GB of memory for hashtable in the
worst case (and max-cache-size would need to be set to more than
4TB). Calling the dns_db_adjusthashsize() will also cap the maximum
size of the hashtable to the pre-computed number of bits, so it won't
try to consume more gigabytes of memory than available for the
database.
FIXME: What is the average size of the rbt node that gets hashed? I
chose the pagesize (4k) as initial value to precompute the size of
the hashtable, but the value is based on feeling and not any real
data.
For future work, there are more places where we use result of the hash
value modulo some small number and that would benefit from Fibonacci
Hashing to get better distribution.
Notes:
a. A doubly linked list should be used here to speedup the removal of
the entries from the hashtable.
b. https://probablydance.com/2018/06/16/fibonacci-hashing-the-optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo/
Make sure bin/tests/system/run.sh returns a non-zero exit code if any of
the following happens:
- the test being run produces a core dump,
- assertion failures are found in the test's logs,
- ThreadSanitizer reports are found after the test completes,
- the servers started by the test fail to shut down cleanly.
This change is necessary to always fail a test in such cases (before the
migration to Automake, test failures were determined based on the
presence of "R:<test-name>:FAIL" lines in the test suite output and thus
it was not necessary for bin/tests/system/run.sh to return a non-zero
exit code).
Since October 2019 I have had complaints from `dnssec-cds` reporting
that the signatures on some of my test zones had expired. These were
zones signed by BIND 9.15 or 9.17, with a DNSKEY TTL of 24h and
`sig-validity-interval 10 8`.
This is the same setup we have used for our production zones since
2015, which is intended to re-sign the zones every 2 days, keeping
at least 8 days signature validity. The SOA expire interval is 7
days, so even in the presence of zone transfer problems, no-one
should ever see expired signatures. (These timers are a bit too
tight to be completely correct, because I should have increased
the expiry timers when I increased the DNSKEY TTLs from 1h to 24h.
But that should only matter when zone transfers are broken, which
was not the case for the error reports that led to this patch.)
For example, this morning my test zone contained:
dev.dns.cam.ac.uk. 86400 IN RRSIG DNSKEY 13 5 86400 (
20200701221418 20200621213022 ...)
But one of my resolvers had cached:
dev.dns.cam.ac.uk. 21424 IN RRSIG DNSKEY 13 5 86400 (
20200622063022 20200612061136 ...)
This TTL was captured at 20200622105807 so the resolver cached the
RRset 64976 seconds previously (18h02m56s), at 20200621165511
only about 12h before expiry.
The other symptom of this error was incorrect `resign` times in
the output from `rndc zonestatus`.
For example, I have configured a test zone
zone fast.dotat.at {
file "../u/z/fast.dotat.at";
type primary;
auto-dnssec maintain;
sig-validity-interval 500 499;
};
The zone is reset to a minimal zone containing only SOA and NS
records, and when `named` starts it loads and signs the zone. After
that, `rndc zonestatus` reports:
next resign node: fast.dotat.at/NS
next resign time: Fri, 28 May 2021 12:48:47 GMT
The resign time should be within the next 24h, but instead it is
near the signature expiry time, which the RRSIG(NS) says is
20210618074847. (Note 499 hours is a bit more than 20 days.)
May/June 2021 is less than 500 days from now because expiry time
jitter is applied to the NS records.
Using this test I bisected this bug to 09990672d which contained a
mistake leading to the resigning interval always being calculated in
hours, when days are expected.
This bug only occurs for configurations that use the two-argument form
of `sig-validity-interval`.
When we're shutting the system down via "rndc stop" or "rndc halt",
or reconfiguring the control channel, there are potential shutdown
races between the server task and network manager. These are adressed by:
- purging any pending command tasks when shutting down the control channel
- adding an extra handle reference before the command handler to
ensure the handle can't be deleted out from under us before calling
command_respond()
- using an isc_task to execute all rndc functions makes it relatively
simple for them to acquire task exclusive mode when needed
- control_recvmessage() has been separated into two functions,
control_recvmessage() and control_respond(). the respond function
can be called immediately from control_recvmessage() when processing
a nonce, or it can be called after returning from the task event
that ran the rndc command function.
- updated libisccc to use netmgr events
- updated rndc to use isc_nm_tcpconnect() to establish connections
- updated control channel to use isc_nm_listentcp()
open issues:
- the control channel timeout was previously 60 seconds, but it is now
overridden by the TCP idle timeout setting, which defaults to 30
seconds. we should add a function that sets the timeout value for
a specific listener socket, instead of always using the global value
set in the netmgr. (for the moment, since 30 seconds is a reasonable
timeout for the control channel, I'm not prioritizing this.)
- the netmgr currently has no support for UNIX-domain sockets; until
this is addressed, it will not be possible to configure rndc to use
them. we will need to either fix this or document the change in
behavior.
When "rndc reconfig" is run, named first configures a fresh set of views
and then tears down the old views. Consider what happens for a single
view with LMDB enabled; "envA" is the pointer to the LMDB environment
used by the original/old version of the view, "envB" is the pointer to
the same LMDB environment used by the new version of that view:
1. mdb_env_open(envA) is called when the view is first created.
2. "rndc reconfig" is called.
3. mdb_env_open(envB) is called for the new instance of the view.
4. mdb_env_close(envA) is called for the old instance of the view.
This seems to have worked so far. However, an upstream change [1] in
LMDB which will be part of its 0.9.26 release prevents the above
sequence of calls from working as intended because the locktable mutexes
will now get destroyed by the mdb_env_close() call in step 4 above,
causing any subsequent mdb_txn_begin() calls to fail (because all of the
above steps are happening within a single named process).
Preventing the above scenario from happening would require either
redesigning the way we use LMDB in BIND, which is not something we can
easily backport, or redesigning the way BIND carries out its
reconfiguration process, which would be an even more severe change.
To work around the problem, set MDB_NOLOCK when calling mdb_env_open()
to stop LMDB from controlling concurrent access to the database and do
the necessary locking in named instead. Reuse the view->new_zone_lock
mutex for this purpose to prevent the need for modifying struct dns_view
(which would necessitate library API version bumps). Drop use of
MDB_NOTLS as it is made redundant by MDB_NOLOCK: MDB_NOTLS only affects
where LMDB reader locktable slots are stored while MDB_NOLOCK prevents
the reader locktable from being used altogether.
[1] 2fd44e3251
BUFSIZ (512 bytes on Windows) may not be enough to fit the status of a
DNSSEC policy and three DNSSEC keys.
Set the size of the relevant buffer to a hardcoded value of 4096 bytes,
which should be enough for most scenarios.
While the creation and publication times of the various keys
in this policy are nearly at the same time there is a chance that
one key is created a second later than the other.
The `set_keytimes_algorithm_policy` mistakenly set the keytimes
for KEY3 based of the "published" time from KEY2.
this changes most visble uses of master/slave terminology in tests.sh
and most uses of 'type master' or 'type slave' in named.conf files.
files in the checkconf test were not updated in order to confirm that
the old syntax still works. rpzrecurse was also left mostly unchanged
to avoid interference with DNSRPS.
it is now an error to have two primaries lists with the same
name. this is true regardless of whether the "primaries" or
"masters" keywords were used to define them.