In #3381 (and #3385), we committed a backward-incompatible change to
BIND 9.19.5, 9.18.7, and 9.16.33, explicitly requiring "inline-signing"
for every "dnssec-policy".
We did this backward-incompatible change deliberately, knowing the
consequences for users and their configurations. But if we didn't, say,
we were unaware this is a backward-incompatible change and fixed failing
systems test by "tweaking a knob to make the CI pass", we would not have
a second look before the change hits user configurations.
"cross-version-config-tests" CI job is such a second look. It will run
system tests from the latest release tag specific to the particular
branch (e.g., v9.19.12 for the "main" branch) with BIND 9 binaries from
the current "HEAD" (the future v9.19.13). This Frankenstein build gets
conceived by altering the "TOP_BUILDDIR" variable in
"bin/tests/system/conf.sh".
Caveats:
- Only system test configurations are tested; no actual test code is
run.
- Problems with namedN.conf configurations are not identified.
When backward-incompatible change is introduced, the CI job is expected
to fail. If the change is deliberate, the job will keep failing until
the version with the backward-incompatible change is tagged, and the
minor version in configure.ac is bumped.
(cherry picked from commit cc54211baa)
The setup.pl script has been replaced with static BIND configurations,
and in the course of this change, the unused ns1 server was removed.
This enhancement has greatly improved the overall test's readability.
(cherry picked from commit 08a8906cfc)
The shell version of the test was completed only after all DNS zone
updates were sent, even if the BIND server crashed while processing
them, leading to prolonged execution and potential hang in the CI
environment. The Python rewrite of the test ensures that DNS update
tasks finish within five minutes of starting, irrespective of a BIND
crash possibility or DNS zone updates not finishing in time.
(cherry picked from commit ecd7b30d0a)
The fstrm_capture utility is started in the background during the
"dnstap" system test. Consequently, "rndc dnstap-reopen" and similar
commands may be executed before fstrm_capture starts listening on the
Unix domain socket it is configured to receive dnstap data on. This
results in the dnstap data sent to that socket in the meantime to be
lost; while the fstrm writer thread is able to recover from such a
scenario within a couple of seconds (by reopening the configured dnstap
destination itself), only one write attempt is made for data
successfully queued to the writer thread, so dnstap frames can still be
lost in the process. This may happen during the "dnstap" system test,
leading to the dnstap output file being empty, which in turn causes the
test to fail.
Fix by waiting until fstrm_capture starts listening on the Unix domain
socket it is configured to use before asking named to reopen the
configured dnstap destination. Since various fstrm_capture versions log
different messages when the listening socket is set up, wait for a
common string that works for all fstrm_capture versions released to
date. Add a few extra debug messages indicating test progress and make
the test fail if the expected fstrm_capture log message is not generated
within 10 seconds.
(cherry picked from commit 26d3d97f12)
The fstrm_capture.out file is overwritten when the fstrm_capture utility
is restarted during the "dnstap" system test. Use a separate output
file for each fstrm_capture instance to ensure all output produced by
that tool during the "dnstap" system test is preserved for forensic
purposes.
(cherry picked from commit bd2941fc72)
Errors getting transfer statistics from named.run where not detected
as ret was not set to one if there hadn't been a success after looping
for a while.
(cherry picked from commit 287a1ac09b)
Commit dc6dafdad1 allows larger TTL values
in zones that go insecure, and ignores the maximum zone TTL.
This means that if you use TTL values larger than 1 day in your zone,
your zone runs the risk of going bogus before it moves safely to
insecure.
Most resolvers by default cap the maximum TTL that they cache RRsets,
at one day (Unbound, Knot, PowerDNS) so that is fine. However, BIND 9's
default is one week.
Change the default TTLsig to one week, so that also for BIND 9
resolvers in the default cases responses for zones that are going
insecure will not be evaluated as bogus.
This change does mean that when unsigning your zone, it will take six
days longer to safely go insecure, regardless of what TTL values you
use in the zone.
(cherry picked from commit 32686beabc)
these options concentrate zone maintenance actions into
bursts for the benefit of servers with intermittent connections.
that's no longer something we really need to optimize.
(cherry picked from commit eeeccec67c)
'HOME=value command' should only change HOME for command but on
some platforms this occasionally sets HOME for the rest of the
test. Explicitly isolate the enviroment change using a sub shell.
(cherry picked from commit 96f75bba18)
Allow larger TTL values in zones that go insecure. This is necessary
because otherwise the zone will not be loaded due to the max-zone-ttl
of P1D that is part of the current insecure policy.
In the keymgr.c code, default back to P1D if the max-zone-ttl is set
to zero.
(cherry picked from commit dc6dafdad1)
Check for the expected error message which includes rcode REFUSED
then reload the server to specify the keytab for the rest of the
GSSAPI tests.
(cherry picked from commit 3a2a24903c)
Add a unit test to check if the overmem purging in the RBTDB is
effective when mixed size RR data is inserted into the database.
Co-authored-by: Ondřej Surý <ondrej@isc.org>
Co-authored-by: Jinmei Tatuya <jtatuya@infoblox.com>
(manually picked from 269c03831f)
The resolve binary is affected by GL#4119 which occassionally makes it
hand during system tests when running with TSAN. This is a workaround to
avoid wasting resources caused by a CI timeout for the system test tsan
jobs.
The conditions that trigger the crash:
- a stale record is in cache
- stale-answer-client-timeout is 0
- multiple clients query for the stale record, enough of them to exceed
the recursive-clients quota
- the response from the authoritative is sufficiently delayed so that
recursive-clients quota is exceeded first
The reproducer attempts to simulate this situation. However, it hasn't
proven to be 100 % reproducible, especially in CI. When reproducing
locally, the priming query also seems to sometimes interfere and prevent
the crash. When the reproducer is ran twice, it appears to be more
reliable in reproducing the issue.
(cherry picked from commit f617512d37)
The keys directory should be cleaned up in clean.sh. Doing that in the
test itself isn't reliable which may lead to failing mkdir which causes
the test to fail with set -e.
(cherry picked from commit 062dfac28e)
When a primary server is not responding, mark it as temporarialy
unreachable. This will prevent too many zones queuing up on a
unreachable server and allow the refresh process to move onto
the next primary sooner once it has been so marked.
(cherry picked from commit 621c117101)
The detach (and possibly close) netmgr events can cause additional
callbacks to be called when under exclusive mode. The detach can
trigger next queued TCP query to be processed and close will call
configured close callback.
Move the detach and close netmgr events from the priority queue to the
normal queue as the detaching and closing the sockets can wait for the
exclusive mode to be over.
Because of a typo, the fetch.pl script tries to extract the server
address from the input parameter 'a' instead of 's'. Fix the typo.
(cherry picked from commit aa7538fd38)