All the places the qp-trie code was using `call_rcu()` needed
`__tsan_release()` and `__tsan_acquire()` annotations, so
add a couple of wrappers to encapsulate this pattern.
With these wrappers, the tests run almost clean under thread
sanitizer. The remaining problems are due to `rcu_barrier()`
which can be suppressed using `.tsan-suppress`. It does not
suppress the whole of `liburcu`, because we would like thread
sanitizer to detect problems in `call_rcu()` callbacks, which
are called from `liburcu`.
The CI jobs have been updated to use `.tsan-suppress` by
default, except for a special-case job that needs the
additional suppressions in `.tsan-suppress-extra`.
We might be able to get rid of some of this after liburcu gains
support for thread sanitizer.
Note: the `rcu_barrier()` suppression is not entirely effective:
tsan sometimes reports races that originate inside `rcu_barrier()`
but tsan has discarded the stack so it does not have the
information required to suppress the report. These "races" can
be made much easier to reproduce by adding `atexit_sleep_ms=1000`
to `TSAN_OPTIONS`. The problem with tsan's short memory can be
addressed by increasing `history_size`: when it is large enough
(6 or 7) the `rcu_barrier()` stack usually survives long enough
for suppression to work.
Previously, if an exception would happen inside the `with` block, the
error handler would wait indefinitely for the process to end. That would
never happen, since the termination signal was never sent to named and
the test would get stuck.
Using the try-finally block ensures that the named process is always
killed and any exception or errors will be handled gracefully.
Improve code readability by splitting the test into more functions. Some
could be re-used later on for more general-purpose subprocess handling
or named checks.
Add more tests to the dnstap system test to roll with different values.
Touch some files to make sure the number of existing files exceed the
number that we want to keep.
Add a test to the logfileconfig system test for the increment suffix.
When dns_request_create() failed in notify_send_toaddr(), sending the
notify would silently fail. When notify_done() failed, the error would
be logged on the DEBUG(2) level.
This commit remedies the situation by:
* Promoting several messages related to notifies to INFO level and add
a "success" log message at the INFO level
* Adding a TCP fallback - when sending the notify over UDP fails, named
will retry sending notify over TCP and log the information on the
NOTICE level
* When sending the notify over TCP fails, it will be logged on the
WARNING level
Closes: #4001, #4002
There is no 'ret' in this test, and it is obvious that 'ret=1'
should be 'tmp=1' for the check to work correctly, if the string
is not found in the log file.
Add a test case to cover #3679 where a user migrates from a KSK/ZSK
split using auto-dnssec maintain, to the default dnssec-policy (CSK).
The test actually does not use the default dnssec-policy, but it does
use one that has the same keys clause. For testing convenience, we use
the same propagation time values as other test cases that migrate to
dnssec-policy with mismatching existing key set.
At the time of test number (19), there were 10 "sending packet to
10.53.0.7" lines in the "legacy/ns1/named.run" file; usually, only seven
are present:
I:legacy:checking recursive lookup to edns 512 + no tcp server does not cause query loops (19)
I:legacy:ns1 sent 10 queries to ns7, expected less than 10
I:legacy:failed
Those three can be attributed to tests "8", "10", and "18", where the
dig of "resolution_fails()" retried after a timeout to succeed with
"status: SERVFAIL" subsequently, as seen in each of
dig.out.test{8,10,18} files.
;; communications error to 10.53.0.1#13093: timed out
; <<>> DiG 9.19.12-dev <<>> -p 13093 +tcp @10.53.0.1 edns512-notcp. TXT
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 5368
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
This retry is unnecessary because "resolution_fails()" considers timeout
a positive result.
This change makes the zone table lock-free for reads. Previously, the
zone table used a red-black tree, which is not thread safe, so the hot
read path acquired both the per-view mutex and the per-zonetable
rwlock. (The double locking was to fix to cleanup races on shutdown.)
One visible difference is that zones are not necessarily shut down
promptly: it depends on when the qp-trie garbage collector cleans up
the zone table. The `catz` system test checks several times that zones
have been deleted; the test now checks for zones to be removed from
the server configuration, instead of being fully shut down. The catz
test does not churn through enough zones to trigger a gc, so the zones
are not fully detached until the server exits.
After this change, it is still possible to improve the way we handle
changes to the zone table, for instance, batching changes, or better
compaction heuristics.
The dnspython.Resolve.resolve() requires at least dnspython >= 2.0.0,
this wasn't enforced in the shutdown system test leading to infinite
loop waiting for the server start due to failing resolve() call.
We don't need a separate module/file for every test. Both the rpz tests
could live in the same file.
The setup/teardown of servers if performed separately for each module --
unless there is a need to do that, it's better to avoid it.
This adds rudimentary test for response-policy zones in multiple
views. Different combinations are tested:
- two views with response-policy inherited from options {};
- two views view explicit response-policy using same RPZ zone name
- two views view explicit response-policy using secondary RPZ zone
* nsupdate should take 12 seconds (one try and three retries with
3 second timeout for each), UDP mode
* nsupdate -u 4 -r 1 should take 8 seconds (one try and one retry with
4 second timeout for each), UDP mode
* nsupdate -u 0 -t 8 -r 1 should also take 8 seconds, UDP mode
* nsupdate -u 4 -t 30 -r 1 should also take 8 seconds, as -u takes
precedence over -t, UDP mode
* nsupdate -t 8 -v should also take 8 seconds, TCP mode
The checkds system test could fail if some parent secondary servers did
not yet loaded all the zones before ns9 started sending DS queries. This
leads to SERVFAIL responses, while the test case expects good DS
responses. In order to mitigate against this issue, call 'rndc loadkeys'
to quickly restart the checkds procedure again.
Also refactor the checkds system test, to get rid of the many zone
name duplications. Update the functions 'zone_check' and
'keystate_check' to make the zone name an FQDN so we can just pass
the 'zone' variable into the function.
If the 'checkds' option is not explicitly set, check if there are
'parental-agents' for the zone configured. If so, default to "explicit",
otherwise default to "yes".
Add two new checkds test servers, that are hidden secondaries (hidden
as in not published in the NS RRset), that can be used specifically
for testing explicitly configured parental-agents.
Implement the new feature, automatic parental-agents. This is enabled
with 'checkds yes'.
When set to 'yes', instead of querying the explicit configured
parental agents, look up the parental agents by resolving the parent
NS records. The found parent NS RRset is considered to be the list
of parental agents that should be queried during a KSK rollover,
looking up the DS RRset corresponding to the key signing keys.
For each NS record, look up the addresses in the ADB. These addresses
will be used to send the DS requests. Count the number of servers and
keep track of how many good DS responses were seen.
The previous test cases already test the more complex case where there
are empty non-terminals between the child apex and the parent domain.
Add a test case where this is not the case, to execute the other code
path.
Add test cases for when checkds is disabled. Copy the test cases that
would have resulted in a DSPublish or DSRemoved and make sure that
with 'checkds no' the metadata is not set.
Add the test cases for automatic parental-agents, i.e. when 'checkds'
is set to 'yes'. Split out the special cases that use a reference
or a resolver as parental-agent so that the common use cases can be
tested with the same function.
Make the checkds system test more structured with the many more test
cases to come. Add a README for clarity.
Update the 'has_signed_apex_nsec' helper function so it can take any
domain name regardless of the number of labels.
Change the DNS tree structure such that we have different TLD names
for the various test scenarios, because we need servers that respond
differently to DS queries. Note that this isn't applicable to the
existing "checkds explicit" test cases, but is preparation work for
testing "checkds yes" (automatic parental agents).
Add a trust-anchor to the server that will be querying for parent
NS records.
Add a new configuration option to set how the checkds method should
work. Acceptable values are 'yes', 'no', and 'explicit'.
When set to 'yes', the checkds method is to lookup the parental agents
by querying the NS records of the parent zone.
When set to 'no', no checkds method is enabled. Users should run
the 'rndc checkds' command to signal that DS records are published and
withdrawn.
When set to 'explicit', the parental agents are explicitly configured
with the 'parental-agents' configuration option.
Cleanup the remnants of MS Compiler bits from <isc/refcount.h>, printing
the information in named/main.c, and cleanup some comments about Windows
that no longer apply.
The bits in picohttpparser.{h,c} were left out, because it's not our
code.
hypothesis prior to 4.41.2 uses hashlib.md5 which is not FIPS
compliant causing the wildcard system test to fail. Check if
we are running if FIPS mode and if so make the minimum version
of hypothesis we will accept to be 4.41.2.