This commit add 4 tests for the new option:
1. Test default configuration of stale-answer-client-timeout, a
value of 1.8 seconds, with stale-refresh-time disabled.
2. Test disabling of stale-answer-client-timeout.
3. Test stale-answer-client-timeout with a value of zero, in this
case we take advantage of a log entry which shows that a stale
answer was promptly used before an attempt to refresh the RRset
is made. We also check, by activating a disabled authoritative
server, that the RRset was successfully refreshed after that.
4. Test stale-answer-client-timeout 0 with stale-refresh-time 4, in
this test we want to ensure a couple things:
- If we have a stale RRSet entry in cache, a request must be
promptly answered with this data, while BIND must also attempt
to refresh the RRSet in background.
- If the attempt to refresh the RRSet times out, the RRSet must
have its stale-refresh-time window activated.
- If a new request for the same RRSet arrives, it must be
promptly answered with stale data due to stale-refresh-time
being active for this RRSet, in this case no attempt to refresh
the RRSet is made.
- Enable authoritative server, ensure that the RRSet was not
refreshed, to honor stale-refresh-time.
- Wait for stale-refresh-window time pass, send another request
for the same RRSet, this time we expect the answer to be the
stale entry in cache being hit due to
stale-answer-client-timeout 0.
- Send another request, this time we expect the answer to be an
active RRSet, since it must have been refreshed during the
previous request.
(cherry picked from commit 35fd039d03)
This commit allows to specify "disabled" or "off" in
stale-answer-client-timeout statement. The logic to support this
behavior will be added in the subsequent commits.
This commit also ensures an upper bound to stale-answer-client-timeout
which equals to one second less than 'resolver-query-timeout'.
(cherry picked from commit 0ad6f594f6)
After the addition of stale-answer-client-timeout a test was broken due
to the following behavior expected by the test.
1. Prime cache data.example txt.
2. Disable authoritative server.
3. Send a query for data.example txt.
4. Recursive server will timeout and answer from cache with stale RRset.
5. Recursive server will activate stale-refresh-time due to the previous
failure in attempting to refresh the RRset.
6. Send a query for data.example txt.
7. Expect stale answer from cache due to stale-refresh-time
window being active, even if authoritative server is up.
Problem is that in step 4, due to the new option
stale-answer-client-timeout, recursive server will answer with stale
data before the actual fetch completes.
Since the original fetch is still running in background, if we re-enable
the authoritative server during that time, the RRset will actually be
successfully refreshed, and stale-refresh-window will not be activated.
The next queries will fail because they expect the TTL of the RRset to
match the one in the stale cache, not the one just refreshed.
To solve this, we explicitly disable stale-answer-client-timeout for
this test, as it's not the feature we are interested in testing here
anyways.
(cherry picked from commit a12bf4b61b)
The general logic behind the addition of this new feature works as
folows:
When a client query arrives, the basic path (query.c / ns_query_recurse)
was to create a fetch, waiting for completion in fetch_callback.
With the introduction of stale-answer-client-timeout, a new event of
type DNS_EVENT_TRYSTALE may invoke fetch_callback, whenever stale
answers are enabled and the fetch took longer than
stale-answer-client-timeout to complete.
When an event of type DNS_EVENT_TRYSTALE triggers fetch_callback, we
must ensure that the folowing happens:
1. Setup a new query context with the sole purpose of looking up for
stale RRset only data, for that matters a new flag was added
'DNS_DBFIND_STALEONLY' used in database lookups.
. If a stale RRset is found, mark the original client query as
answered (with a new query attribute named NS_QUERYATTR_ANSWERED),
so when the fetch completion event is received later, we avoid
answering the client twice.
. If a stale RRset is not found, cleanup and wait for the normal
fetch completion event.
2. In ns_query_done, we must change this part:
/*
* If we're recursing then just return; the query will
* resume when recursion ends.
*/
if (RECURSING(qctx->client)) {
return (qctx->result);
}
To this:
if (RECURSING(qctx->client) && !QUERY_STALEONLY(qctx->client)) {
return (qctx->result);
}
Otherwise we would not proceed to answer the client if it happened
that a stale answer was found when looking up for stale only data.
When an event of type DNS_EVENT_FETCHDONE triggers fetch_callback, we
proceed as before, resuming query, updating stats, etc, but a few
exceptions had to be added, most important of which are two:
1. Before answering the client (ns_client_send), check if the query
wasn't already answered before.
2. Before detaching a client, e.g.
isc_nmhandle_detach(&client->reqhandle), ensure that this is the
fetch completion event, and not the one triggered due to
stale-answer-client-timeout, so a correct call would be:
if (!QUERY_STALEONLY(client)) {
isc_nmhandle_detach(&client->reqhandle);
}
Other than these notes, comments were added in code in attempt to make
these updates easier to follow.
(cherry picked from commit 171a5b7542)
the taskset command used for the cpu system test seems
to be failing under vmware, causing a test failure. we
can try the taskset command and skip the test if it doesn't
work.
(cherry picked from commit a8a49bb783)
The -E option does not default to pkcs11 if --with-pkcs11 is set,
but always needs to be set explicitly.
(cherry picked from commit 0536375d4cf61c9b570a32e808dde78a7ef859bf)
Coverity Scan identified the following issue in bin/named/zoneconf.c:
*** CID 314969: Control flow issues (DEADCODE)
/bin/named/zoneconf.c: 2212 in named_zone_inlinesigning()
if (!inline_signing && !zone_is_dynamic &&
cfg_map_get(zoptions, "dnssec-policy", &signing) == ISC_R_SUCCESS &&
signing != NULL)
{
if (strcmp(cfg_obj_asstring(signing), "none") != 0) {
inline_signing = true;
>>> CID 314969: Control flow issues (DEADCODE)
>>> Execution cannot reach the expression ""no"" inside this statement: "dns_zone_log(zone, 1, "inli...".
dns_zone_log(
zone, ISC_LOG_DEBUG(1), "inline-signing: %s",
inline_signing
? "implicitly through dnssec-policy"
: "no");
} else {
...
}
}
This is because we first set 'inline_signing = true' and then check
its value in 'dns_zone_log'.
(cherry picked from commit 8df629d0b2)
this changes most visble uses of master/slave terminology in tests.sh
and most uses of 'type master' or 'type slave' in named.conf files.
files in the checkconf test were not updated in order to confirm that
the old syntax still works. rpzrecurse was also left mostly unchanged
to avoid interference with DNSRPS.
(cherry picked from commit e43b3c1fa1)
it is now an error to have two primaries lists with the same
name. this is true regardless of whether the "primaries" or
"masters" keywords were used to define them.
(cherry picked from commit f619708bbf)
as "type primary" is preferred over "type master" now, it makes
sense to make "primaries" available as a synonym too.
added a correctness check to ensure "primaries" and "masters"
cannot both be used in the same zone.
(cherry picked from commit 16e14353b1)
The mkeys system test started to fail after introducing support for
zones transitioning to unsigned without going bogus. This is because
there was actually a bug in the code: if you reconfigure a zone and
remove the "auto-dnssec" option, the zone is actually still DNSSEC
maintained. This is because in zoneconf.c there is no call
to 'dns_zone_setkeyopt()' if the configuration option is not used
(cfg_map_get(zoptions, "auto-dnssec", &obj) will return an error).
The mkeys system test implicitly relied on this bug: initially the
root zone is being DNSSEC maintained, then at some point it needs to
reset the root zone in order to prepare for some tests with bad
signatures. Because it needs to inject a bad signature, 'auto-dnssec'
is removed from the configuration.
The test pass but for the wrong reasons:
I:mkeys:reset the root server
I:mkeys:reinitialize trust anchors
I:mkeys:check positive validation (18)
The 'check positive validation' test works because the zone is still
DNSSEC maintained: The DNSSEC records in the signed root zone file on
disk are being ignored.
After fixing the bug/introducing graceful transition to insecure,
the root zone is no longer DNSSEC maintained after the reconfig.
The zone now explicitly needs to be reloaded because otherwise the
'check positive validation' test works against an old version of the
zone (the one with all the revoked keys), and the test will obviously
fail.
(cherry picked from commit 2fc42b598b)
Configure "none" as a builtin policy. Change the 'cfg_kasp_fromconfig'
api so that the 'name' will determine what policy needs to be
configured.
When transitioning a zone from secure to insecure, there will be
cases when a zone with no DNSSEC policy (dnssec-policy none) should
be using KASP. When there are key state files available, this is an
indication that the zone once was DNSSEC signed but is reconfigured
to become insecure.
If we would not run the keymgr, named would abruptly remove the
DNSSEC records from the zone, making the zone bogus. Therefore,
change the code such that a zone will use kasp if there is a valid
dnssec-policy configured, or if there are state files available.
(cherry picked from commit cf420b2af0)
Add two test zones that will be reconfigured to go insecure, by
setting the 'dnssec-policy' option to 'none'.
One zone was using inline-signing (implicitly through dnssec-policy),
the other is a dynamic zone.
Two tweaks to the kasp system test are required: we need to set
when to except the CDS/CDS Delete Records, and we need to know
when we are dealing with a dynamic zone (because the logs to look for
are slightly different, inline-signing prints "(signed)" after the
zone name, dynamic zones do not).
(cherry picked from commit fa2e4e66b0)
Add a test to check BIND 9 honors CPU affinity mask. This requires
some changes to the start script, to construct the named command.
(cherry picked from commit f1a097964c)
When using the `unixtime` or `date` method to update the SOA serial,
`named` and `dnssec-signzone` would silently fallback to `increment`
method to prevent the new serial number to be smaller than the old
serial number (using the serial number arithmetics). Add a warning
message when such fallback happens.
(cherry picked from commit ef685bab5c)
This commit extends the perl Configure script to also check for libssl
in addition to libcrypto and change the vcxproj source files to link
with both libcrypto and libssl.
This is a part of the works that intends to make the netmgr stable,
testable, maintainable and tested. It contains a numerous changes to
the netmgr code and unfortunately, it was not possible to split this
into smaller chunks as the work here needs to be committed as a complete
works.
NOTE: There's a quite a lot of duplicated code between udp.c, tcp.c and
tcpdns.c and it should be a subject to refactoring in the future.
The changes that are included in this commit are listed here
(extensively, but not exclusively):
* The netmgr_test unit test was split into individual tests (udp_test,
tcp_test, tcpdns_test and newly added tcp_quota_test)
* The udp_test and tcp_test has been extended to allow programatic
failures from the libuv API. Unfortunately, we can't use cmocka
mock() and will_return(), so we emulate the behaviour with #define and
including the netmgr/{udp,tcp}.c source file directly.
* The netievents that we put on the nm queue have variable number of
members, out of these the isc_nmsocket_t and isc_nmhandle_t always
needs to be attached before enqueueing the netievent_<foo> and
detached after we have called the isc_nm_async_<foo> to ensure that
the socket (handle) doesn't disappear between scheduling the event and
actually executing the event.
* Cancelling the in-flight TCP connection using libuv requires to call
uv_close() on the original uv_tcp_t handle which just breaks too many
assumptions we have in the netmgr code. Instead of using uv_timer for
TCP connection timeouts, we use platform specific socket option.
* Fix the synchronization between {nm,async}_{listentcp,tcpconnect}
When isc_nm_listentcp() or isc_nm_tcpconnect() is called it was
waiting for socket to either end up with error (that path was fine) or
to be listening or connected using condition variable and mutex.
Several things could happen:
0. everything is ok
1. the waiting thread would miss the SIGNAL() - because the enqueued
event would be processed faster than we could start WAIT()ing.
In case the operation would end up with error, it would be ok, as
the error variable would be unchanged.
2. the waiting thread miss the sock->{connected,listening} = `true`
would be set to `false` in the tcp_{listen,connect}close_cb() as
the connection would be so short lived that the socket would be
closed before we could even start WAIT()ing
* The tcpdns has been converted to using libuv directly. Previously,
the tcpdns protocol used tcp protocol from netmgr, this proved to be
very complicated to understand, fix and make changes to. The new
tcpdns protocol is modeled in a similar way how tcp netmgr protocol.
Closes: #2194, #2283, #2318, #2266, #2034, #1920
* The tcp and tcpdns is now not using isc_uv_import/isc_uv_export to
pass accepted TCP sockets between netthreads, but instead (similar to
UDP) uses per netthread uv_loop listener. This greatly reduces the
complexity as the socket is always run in the associated nm and uv
loops, and we are also not touching the libuv internals.
There's an unfortunate side effect though, the new code requires
support for load-balanced sockets from the operating system for both
UDP and TCP (see #2137). If the operating system doesn't support the
load balanced sockets (either SO_REUSEPORT on Linux or SO_REUSEPORT_LB
on FreeBSD 12+), the number of netthreads is limited to 1.
* The netmgr has now two debugging #ifdefs:
1. Already existing NETMGR_TRACE prints any dangling nmsockets and
nmhandles before triggering assertion failure. This options would
reduce performance when enabled, but in theory, it could be enabled
on low-performance systems.
2. New NETMGR_TRACE_VERBOSE option has been added that enables
extensive netmgr logging that allows the software engineer to
precisely track any attach/detach operations on the nmsockets and
nmhandles. This is not suitable for any kind of production
machine, only for debugging.
* The tlsdns netmgr protocol has been split from the tcpdns and it still
uses the old method of stacking the netmgr boxes on top of each other.
We will have to refactor the tlsdns netmgr protocol to use the same
approach - build the stack using only libuv and openssl.
* Limit but not assert the tcp buffer size in tcp_alloc_cb
Closes: #2061
(cherry picked from commit 634bdfb16d)
The DNS Flag Day 2020 reduced all the EDNS buffer sizes to 1232. In
this commit, we revert the default value for nocookie-udp-size back to
4096 because the option is too obscure and most people don't realize
that they also need to change this configuration option in addition to
max-udp-size.
(cherry picked from commit 79c196fc77)
Since the queries sent towards root and TLD servers are now included in
the count (as a result of the fix for CVE-2020-8616),
"max-recursion-queries" has a higher chance of being exceeded by
non-attack queries. Increase its default value from 75 to 100.
(cherry picked from commit ab0bf49203)
The traceback files could overwrite each other on systems which do not
use different core dump file names for different processes. Prevent
that by writing the traceback file to the same directory as the core
dump file.
These changes still do not prevent the operating system from overwriting
a core dump file if the same binary crashes multiple times in the same
directory and core dump files are named identically for different
processes.
(cherry picked from commit 6428fc26af)