Currently, the outgoing UDP sockets have enabled
SO_REUSEADDR (SO_REUSEPORT on BSDs) which allows multiple UDP sockets to
bind to the same address+port. There's one caveat though - only a
single (the last one) socket is going to receive all the incoming
traffic. This in turn could lead to incoming DNS message matching to
invalid dns_dispatch and getting dropped.
Disable setting the SO_REUSEADDR on the outgoing UDP sockets. This
needs to be done explicitly because `uv_udp_open()` silently enables the
option on the socket.
(cherry picked from commit eec30c33c2)
As the relaxed memory ordering doesn't ensure any memory
synchronization, it is possible that the increment will succeed even
in the case when it should not - there is a race between
atomic_fetch_sub(..., acq_rel) and atomic_fetch_add(..., relaxed).
Only the result is consistent, but the previous value for both calls
could be same when both calls are executed at the same time.
(cherry picked from commit 88227ea665)
We currently set SO_INCOMING_CPU incorrectly, and testing by Ondrej
shows that fixing the issue and setting affinities is worse than letting
the kernel schedule threads without constraints. So we should not set
SO_INCOMING_CPU anymore.
(cherry picked from commit 8b8149cdd2)
If the operating system UDP queue gets full and the outgoing UDP sending
starts to be delayed, BIND 9 could exhibit memory spikes as it tries to
enqueue all the outgoing UDP messages. As those are not going to be
delivered anyway (as we argued when we stopped enlarging the operating
system send and receive buffers), try to send the UDP messages directly
using `uv_udp_try_send()` and if that fails, drop the outgoing UDP
message.
(cherry picked from commit b576c4c977)
Administrators may wish to constrain the set of cores that BIND 9 runs
on via the 'taskset', 'cpuset' or 'numactl' programs (or equivalent on
other O/S), for example to achieve higher (or more stable) performance
by more closely associating threads with individual NIC rx queues. If
the admin has used taskset, it follows that BIND ought to
automatically use the given number of CPUs rather than the system wide
count.
Co-Authored-By: Ray Bellis <ray@isc.org>
(cherry picked from commit 5a2df8caf5)
Although the nanual page of malloc_usable_size says:
Although the excess bytes can be over‐written by the application
without ill effects, this is not good programming practice: the
number of excess bytes in an allocation depends on the underlying
implementation.
it looks like the premise is broken with _FORTIFY_SOURCE=3 on newer
systems and it might return a value that causes program to stop with
"buffer overflow" detected from the _FORTIFY_SOURCE. As we do have own
implementation that tracks the allocation size that we can use to track
the allocation size, we can stop relying on this introspection function.
Also the newer manual page for malloc_usable_size changed the NOTES to:
The value returned by malloc_usable_size() may be greater than the
requested size of the allocation because of various internal
implementation details, none of which the programmer should rely on.
This function is intended to only be used for diagnostics and
statistics; writing to the excess memory without first calling
realloc(3) to resize the allocation is not supported. The returned
value is only valid at the time of the call.
Remove usage of both malloc_usable_size() and malloc_size() to be on the
safe size and only use the internal size tracking mechanism when
jemalloc is not available.
(cherry picked from commit d61712d14e)
The ISC_ATTR_UNUSED macro was missing in BIND 9.18, which
complicated things when backporting merge requests from main.
As __attribute__((__unused__)) is ubiquitous, just define the
macro.
New version of clang (19) has introduced a stricter checks when mixing
integer (and float types) with enums. In this case, we used enum {}
as C17 doesn't have constexpr yet. Change the time conversion constants
to be #defined constants because of RHEL 8 compiler doesn't consider
static const unsigned int to be constant.
(cherry picked from commit b03e90e0d4)
Instead of directly using the result of dirfd() in the unlinkat() call,
check whether the returned file descriptor is actually valid. That
doesn't really change the logic as the unlinkat() would fail with
invalid descriptor anyway, but this is cleaner and will report the right
error returned directly by dirfd() instead of EBADF from unlinkat().
(cherry picked from commit 59f4fdebc0)
The getifaddr() works fine for years, so we don't have to
keep the callback to parse /proc/net/if_inet6 anymore.
(cherry picked from commit 2fbf9757b8)
The clang-scan 19 has reported that we are ignoring errno after the call
to rewind(). As we don't really care about the result, just silence the
error, the whole code will be removed in the development version anyway
as it is not needed.
(cherry picked from commit dda5ba53df)
When the SSL object was destroyed, it would invalidate all SSL_SESSION
objects including the cached, but not yet used, TLS session objects.
Properly disassociate the SSL object from the SSL_SESSION before we
store it in the TLS session cache, so we can later destroy it without
invalidating the cached TLS sessions.
Co-authored-by: Ondřej Surý <ondrej@isc.org>
Co-authored-by: Artem Boldariev <artem@isc.org>
Co-authored-by: Aram Sargsyan <aram@isc.org>
(cherry picked from commit c11b736e44)
When TLS connection (TLSstream) connection was accepted, the children
listening socket was not attached to sock->server and thus it could have
been freed before all the accepted connections were actually closed.
In turn, this would cause us to call isc_tls_free() too soon - causing
cascade errors in pending SSL_read_ex() in the accepted connections.
Properly attach and detach the children listening socket when accepting
and closing the server connections.
(cherry picked from commit 684f3eb8e6)
Don't run more events than already scheduled. If the quantum is set to
a high value, the task_run() would execute already scheduled, and all
new events that result from running event->ev_action().
Setting quantum to a number of scheduled events will postpone events
scheduled after we enter the loop here to the next task_run()
invocation.
Instead of randomly using -1 or 1 as a failure status, properly utilize
the EXIT_FAILURE define that's platform specific (as it should be).
(cherry picked from commit76997983fde02d9c32aa23bda30b65f1ebd4178c)
On a 32 bit machine casting to size_t can still lead to an overflow.
Cast to uint64_t. Also detect all possible negative values for
pages and pagesize to silence warning about possible negative value.
39#if defined(_SC_PHYS_PAGES) && defined(_SC_PAGESIZE)
1. tainted_data_return: Called function sysconf(_SC_PHYS_PAGES),
and a possible return value may be less than zero.
2. assign: Assigning: pages = sysconf(_SC_PHYS_PAGES).
40 long pages = sysconf(_SC_PHYS_PAGES);
41 long pagesize = sysconf(_SC_PAGESIZE);
42
3. Condition pages == -1, taking false branch.
4. Condition pagesize == -1, taking false branch.
43 if (pages == -1 || pagesize == -1) {
44 return (0);
45 }
46
5. overflow: The expression (size_t)pages * pagesize might be negative,
but is used in a context that treats it as unsigned.
CID 498034: (#1 of 1): Overflowed return value (INTEGER_OVERFLOW)
6. return_overflow: (size_t)pages * pagesize, which might have underflowed,
is returned from the function.
47 return ((size_t)pages * pagesize);
48#endif /* if defined(_SC_PHYS_PAGES) && defined(_SC_PAGESIZE) */
(cherry picked from commit e8dbc5db92)
This commit ensures that we restart reading only when all DNS data in
the input buffer is processed so the we will not get into the
situation when the buffer is overrun.
Be more aggressive when throttling the reading - when we can't send the
outgoing TCP synchronously with uv_try_write(), we start throttling the
reading immediately instead of waiting for the send buffers to fill up.
This should not affect behaved clients that read the data from the TCP
on the other end.
(cherry picked from commit bc3e713317)
This commit ensures that socket objects use smaller sizes for its
internal requests and handles pools. That prevents a memory allocator
from thrashing.
When a peer is not reading the data we are sending it was for the TLS
DNS code to end up in a situation when it would indefinitely
reschedule send requests, effectively turning the 'uv_loop' into a
busy loop that would consume CPU cycles in endless efforts to send
outgoing data.
The main reason for that was only one send buffer dedicated for sends:
the code would re-queue sends until it is empty - that would never
happen when the remote side is not reading data.
That seems like an omission from the older day of the Network Manager
as it is quiet simple to make the code use multiple buffers for
sends. That ultimately breaks the cycle of futile send request
rescheduling.
As a side effect, this commit also gets rid of one memory copying on a
hot path.
Due to omission it was possible to un-throttle a TCP connection
previously throttled due to the peer not reading back data we are
sending.
In particular, that affected DoH code, but it could also affect other
transports (the current or future ones) that pause/resume reading
according to its internal state.
(cherry picked from commit d228aa8bbb944fbd04baf22d151fde5c33561e26)
The single TCP read can create as much as 64k divided by the minimum
size of the DNS message. This can clog the processing thread and trash
the memory allocator because we need to do as much as ~20k allocations in
a single UV loop tick.
Limit the number of the DNS messages processed in a single UV loop tick
to just single DNS message and limit the number of the outstanding DNS
messages back to 23. This effectively limits the number of pipelined
DNS messages to that number (this is the limit we already had before).
This reverts commit 780a89012d.
When TCP client would not read the DNS message sent to them, the TCP
sends inside named would accumulate and cause degradation of the
service. Throttle the reading from the TCP socket when we accumulate
enough DNS data to be sent. Currently this is limited in a way that a
single largest possible DNS message can fit into the buffer.
(cherry picked from commit 26006f7b44474819fac2a76dc6cd6f69f0d76828)
This commit ensures that an HTTP endpoints set reference is stored in
a socket object associated with an HTTP/2 stream instead of
referencing the global set stored inside a listener.
This helps to prevent an issue like follows:
1. BIND is configured to serve DoH clients;
2. A client is connected and one or more HTTP/2 stream is
created. Internal pointers are now pointing to the data on the
associated HTTP endpoints set;
3. BIND is reconfigured - the new endpoints set object is created and
promoted to all listeners;
4. The old pointers to the HTTP endpoints set data are now invalid.
Instead referencing a global object that is updated on
re-configurations we now store a local reference which prevents the
endpoints set objects to go out of scope prematurely.
(cherry picked from commit b9b5d0c01a3a546c4a6a8b3bff8ae9dd31fee224)
It was reported that HTTP/2 session might get closed or even deleted
before all async. processing has been completed.
This commit addresses that: now we are avoiding using the object when
we do not need it or specifically check if the pointers used are not
'NULL' and by ensuring that there is at least one reference to the
session object while we are doing incoming data processing.
This commit makes the code more resilient to such issues in the
future.
(cherry picked from commit 0cca550dff403c6100b7c0da8f252e7967765ba7)
If uv_tcp_close_reset() returns an error code, this means the
reset_shutdown callback has not been issued, so do it now.
(cherry picked from commit c40e5c8653)
Failing to accept TCP/TLS connections in 9.18 detaches the quota in
isc__nm_failed_accept_cb, causing TCP4Clients and TCP6Clients statistics
to not decrease inside cleanup.
Fix by increasing the counter after the point of no failure but before
handling statistics through the client's socket is no longer valid.
When isc_task_purgeevent() is called for and 'event', the event, in
the meanwhile, could in theory get processed, unlinked, and freed.
So when the function then operates on the 'event', it causes a
segmentation fault.
The only place where isc_task_purgeevent() is called is from
timer_purge().
In order to resolve the data race, call isc_task_purgeevent() inside
the 'timer->lock' locked block, so that timerevent_destroy() won't
be able to destroy the event if it was processed in the meanwhile,
before isc_task_purgeevent() had a chance to purge it.
In order to be able to do that, move the responsibility of calling
isc_event_free() (upon a successful purge) out from the
isc_task_purgeevent() function to its caller instead, so that it can
be called outside of the timer->lock locked block.
Let basic_tick() of 'task1' and 'basic_quick' of 'task4' run in
different threads, and insert an artificial delay in timer_purge()
to cause an existing race condition to appear.
The statistics channel does not expose the current number of TCP clients
connected, only the highwater. Therefore, users did not have an easy
means to collect statistics about TCP clients served over time. This
information could only be measured as a seperate mechanism via rndc by
looking at the TCP quota filled.
In order to expose the exact current count of connected TCP clients
(tracked by the "tcp-clients" quota) as a statistics counter, an
extra, dedicated Network Manager callback would need to be
implemented for that purpose (a counterpart of ns__client_tcpconn()
that would be run when a TCP connection is torn down), which is
inefficient. Instead, track the number of currently-connected TCP
clients separately for IPv4 and IPv6, as Network Manager statistics.
(cherry picked from commit 2690dc48d3)
The case insensitive matching in isc_ht was basically completely broken
as only the hashvalue computation was case insensitive, but the key
comparison was always case sensitive.
(cherry picked from commit ec11aa2836)
The case insensitive matching in isc_ht was basically completely broken
as only the hashvalue computation was case insensitive, but the key
comparison was always case sensitive.
(cherry picked from commit 34ae6916f115fc291865857509433f95c2bc0871)
Change the taskmgr (and thus netmgr) in a way that it supports fast and
slow task queues. The fast queue is used for incoming DNS traffic and
it will pass the processing to the slow queue for sending outgoing DNS
messages and processing resolver messages.
In the future, more tasks might get moved to the slow queues, so the
cached and authoritative DNS traffic can be handled without being slowed
down by operations that take longer time to process.
cmocka.h and jemalloc.h/malloc_np.h has conflicting macro definitions.
While fixing them with push_macro for only malloc is done below, we only
need the non-standard mallocx interface which is easy to just define by
ourselves.
(cherry picked from commit 197de93bdc)
Because we don't use jemalloc functions directly, but only via the
libisc library, the dynamic linker might pull the jemalloc library
too late when memory has been already allocated via standard libc
allocator.
Add a workaround round isc_mem_create() that makes the dynamic linker
to pull jemalloc earlier than libc.
(cherry picked from commit 41a0ee1071)
When connecting to a remote party the TLS DNS code could process more
than one message at a time despite the fact that it is expected that
we should stop after every DNS message.
Every DNS message is handled and consumed from the input buffer by
isc__nm_process_sock_buffer(). However, as opposed to TCP DNS code, it
can be called more than once when processing incoming data from a
server (see tls_cycle_input()). That, in turn means that we can
process more than one message at a time. Some higher level code might
not expect that, as it breaks the contract.
In particular, in the original report that happened during
isc__nm_async_tlsdnsshutdown() call: when shutting down multiple calls
to tls_cycle() are possible (each possibly leading to a
isc__nm_process_sock_buffer()). If there are any non processed
messages left, for any of the messages left the read callback will be
called even when it is not expected as there were no preceding
isc_nm_read().
To keep TCP DNS and TLS DNS code in sync, we make a similar change to
it as well, although it should not matter.