QSBR: safe memory reclamation for lock-free data structures
This "quiescent state based reclamation" module provides support for
the qp-trie module in dns/qp. It is a replacement for liburcu, written
without reference to the urcu source code, and in fact it works in a
significantly different way.
A few specifics of BIND make this variant of QSBR somewhat simpler:
* We can require that wait-free access to a qp-trie only happens in
an isc_loop callback. The loop provides a natural quiescent state,
after the callbacks are done, when no qp-trie access occurs.
* We can dispense with any API like rcu_synchronize(). In practice,
it takes far too long to wait for a grace period to elapse for each
write to a data structure.
* We use the idea of "phases" (aka epochs or eras) from EBR to
reduce the amount of bookkeeping needed to track memory that is no
longer needed, knowing that the qp-trie does most of that work
already.
I considered hazard pointers for safe memory reclamation. They have
more read-side overhead (updating the hazard pointers) and it wasn't
clear to me how to nicely schedule the cleanup work. Another
alternative, epoch-based reclamation, is designed for fine-grained
lock-free updates, so it needs some rethinking to work well with the
heavily read-biased design of the qp-trie. QSBR has the fastest read
side of the basic SMR algorithms (with no barriers), and fits well
into a libuv loop. More recent hybrid SMR algorithms do not appear to
have enough benefits to justify the extra complexity.
This commit is contained in:
397
doc/dev/qsbr.md
Normal file
397
doc/dev/qsbr.md
Normal file
@@ -0,0 +1,397 @@
|
||||
<!--
|
||||
Copyright (C) Internet Systems Consortium, Inc. ("ISC")
|
||||
|
||||
SPDX-License-Identifier: MPL-2.0
|
||||
|
||||
This Source Code Form is subject to the terms of the Mozilla Public
|
||||
License, v. 2.0. If a copy of the MPL was not distributed with this
|
||||
file, you can obtain one at https://mozilla.org/MPL/2.0/.
|
||||
|
||||
See the COPYRIGHT file distributed with this work for additional
|
||||
information regarding copyright ownership.
|
||||
-->
|
||||
|
||||
QSBR: quiescent state based reclamation
|
||||
=======================================
|
||||
|
||||
QSBR is a safe memory reclamation (SMR) algorithm for lock-free data
|
||||
structures such as a qp-trie. (See `doc/dev/qp.md`.)
|
||||
|
||||
When an object is unlinked from a lock-free data structure, it
|
||||
cannot be `free()`ed immediately, because there can still be readers
|
||||
accessing the object via an old version of the data structure. SMR
|
||||
algorithms determine when it is safe to reclaim memory after it has
|
||||
been unlinked.
|
||||
|
||||
|
||||
Introductions and overviews
|
||||
---------------------------
|
||||
|
||||
There is a terse overview in `include/isc/qsbr.h`.
|
||||
|
||||
Jeff Preshing has a nice introduction to QSBR,
|
||||
_<https://preshing.com/20160726/using-quiescent-states-to-reclaim-memory/>_
|
||||
|
||||
At the end of this note is a copy of a blog post about writing BIND's
|
||||
`isc_qsbr`, _<https://dotat.at/@/2023-01-10-qsbr.html>_
|
||||
|
||||
[Paul McKenney's web page][paulmck] has links to his book on
|
||||
concurrent programming, the [Userspace RCU library][urcu], and more.
|
||||
McKenney invented RCU and QSBR. RCU is the Linux kernel's machinery
|
||||
for lock-free data structures and safe memory reclamation, based on
|
||||
QSBR.
|
||||
|
||||
[paulmck]: http://www.rdrop.com/~paulmck/
|
||||
[urcu]: https://liburcu.org/
|
||||
|
||||
|
||||
Example code
|
||||
------------
|
||||
|
||||
If you are implementing a lock-free data structure that needs safe
|
||||
memory reclamation, here's a guide to using `isc_qsbr`, based on how
|
||||
QSBR is used by `dns_qp`.
|
||||
|
||||
### registration
|
||||
|
||||
When the program starts up you need to register a global callback
|
||||
function that will reclaim unused memory. You can do so using an
|
||||
ISC_CONSTRUCTOR function that runs automatically at startup.
|
||||
|
||||
static void
|
||||
qp_qsbr_register(void) ISC_CONSTRUCTOR;
|
||||
static void
|
||||
qp_qsbr_register(void) {
|
||||
isc_qsbr_register(qp_qsbr_reclaimer);
|
||||
}
|
||||
|
||||
### work list
|
||||
|
||||
Your module will need somewhere that your callback can find the work
|
||||
it needs to do. The qp-trie has an atomic list of `dns_qpmulti_t`
|
||||
objects for this purpose.
|
||||
|
||||
/* a global variable */
|
||||
static ISC_ASTACK(dns_qpmulti_t) qsbr_work;
|
||||
|
||||
The reason for using global variables is so that we don't need to
|
||||
allocate a thunk every time we have memory reclamation work to do.
|
||||
|
||||
### read-only access
|
||||
|
||||
You should design your data structure so that it has a single atomic
|
||||
root pointer referring to its current version. A lock-free reader
|
||||
_must_ run in an `isc_loop` callback. It gains access to the data
|
||||
structure by taking a copy of this pointer:
|
||||
|
||||
qp_node_t *reader = atomic_load_acquire(&multi->reader);
|
||||
|
||||
During an `isc_loop` callback, a reader should keep using the same
|
||||
pointer go get a consistent view of the data structure. If it reloads
|
||||
the pointer it can get a different version changed by concurrent
|
||||
writers.
|
||||
|
||||
A reader _must_ stop using the root pointer and any interior pointers
|
||||
obtained via the root pointer before it returns to the `isc_loop`.
|
||||
|
||||
### modifications and writes
|
||||
|
||||
All changes to the data structure must be copy-on-write (aka
|
||||
read-copy-update) so that concurrent readers are not disturbed.
|
||||
|
||||
When a new version of the data structure has been prepared, it is
|
||||
committed by overwriting the atomic root pointer,
|
||||
|
||||
atomic_store_release(&multi->reader, reader); /* COMMIT */
|
||||
|
||||
### scheduling cleanup
|
||||
|
||||
After committing a change, your data structure may have memory that
|
||||
will become free, after concurrent readers have stopped accessing it.
|
||||
To reclaim the memory when it is safe, use code like:
|
||||
|
||||
isc_qsbr_phase_t phase = isc_qsbr_phase(multi->loopmgr);
|
||||
if (defer_chunk_reclamation(qp, phase)) {
|
||||
ISC_ASTACK_ADD(qsbr_work, multi, cleanup);
|
||||
isc_qsbr_activate(multi->loopmgr, phase);
|
||||
}
|
||||
|
||||
* First, get the current QSBR phase
|
||||
|
||||
* Second, mark free memory with the phase number. The qp-trie scans
|
||||
its chunks and marks those that will become free, and returns
|
||||
`true` if there is cleanup work to do.
|
||||
|
||||
* If so, the qp-trie is added to the work list. (`ISC_ALIST_ADD()`
|
||||
is idempotent).
|
||||
|
||||
* Finally, QSBR is informed that there is work to do.
|
||||
|
||||
In other cases it might not make sense to scan the data structure
|
||||
after committing, and instead you might make note of which memory to
|
||||
clean up while making changes before you know what the phase will be.
|
||||
You can then have per-phase work lists, like:
|
||||
|
||||
static ISC_ASTACK(my_work_t) qsbr_work[ISC_QSBR_PHASES];
|
||||
|
||||
isc_qsbr_phase_t phase = isc_qsbr_phase(loopmgr);
|
||||
ISC_ASTACK_ADD(qsbr_work[phase], cleanup_work, link);
|
||||
isc_qsbr_activate(loopmgr, phase);
|
||||
|
||||
In general, there will be several (maybe many) write operations during
|
||||
a grace period. Your lock-free data structure should collect its
|
||||
reclamation work from all these writes into a batch per phase, i.e.
|
||||
per grace period.
|
||||
|
||||
### reclaiming
|
||||
|
||||
Inside the reclaimer callback, we iterate over the work list and clean
|
||||
up each item on it. If there is more cleanup work to do in another
|
||||
phase, we put the qp-trie back on the work list for another go.
|
||||
|
||||
static void
|
||||
qsbreclaimer(void *arg, isc_qsbr_phase_t phase) {
|
||||
UNUSED(arg);
|
||||
|
||||
ISC_STACK(dns_qpmulti_t) drain = ISC_ASTACK_TO_STACK(qsbr_work);
|
||||
while (!ISC_STACK_EMPTY(drain)) {
|
||||
dns_qpmulti_t *multi = ISC_STACK_POP(drain, cleanup);
|
||||
INSIST(QPMULTI_VALID(multi));
|
||||
LOCK(&multi->mutex);
|
||||
if (reclaim_chunks(&multi->writer, phase)) {
|
||||
/* more to do next time */
|
||||
ISC_ALIST_PUSH(qsbr_work, multi, cleanup);
|
||||
}
|
||||
UNLOCK(&multi->mutex);
|
||||
}
|
||||
}
|
||||
|
||||
### reclaim marks
|
||||
|
||||
In the qp-trie data structure, each chunk has some metadata which
|
||||
includes a bitfield for the reclaim phase:
|
||||
|
||||
isc_qsbr_phase_t phase : ISC_QSBR_PHASE_BITS;
|
||||
|
||||
We use a bitfield so that all the metadata fits in a single word.
|
||||
|
||||
|
||||
------------------------------------------------------------------------
|
||||
|
||||
Safe memory reclamation for BIND
|
||||
================================
|
||||
|
||||
At the end of October 2022, I _finally_ got [my multithreaded
|
||||
qp-trie][qp-gc] working! It could be built with two different
|
||||
concurrency control mechanisms:
|
||||
|
||||
* A reader/writer lock
|
||||
|
||||
This has poor read-side scalability, because every thread is
|
||||
hammering on the same shared location. But its write performance
|
||||
is reasonably good: concurrent readers don't slow it down too much.
|
||||
|
||||
* [`liburcu`, userland read-copy-update][urcu]
|
||||
|
||||
RCU has a fast and scalable read side, nice! But on the write side
|
||||
I used `synchronize_rcu()`, which is blocking and rather slow, so
|
||||
my write performance was terrible.
|
||||
|
||||
OK, but I want the best of both worlds! To fix it, I needed to change
|
||||
the qp-trie code to use safe memory reclamation more effectively:
|
||||
instead of blocking inside `synchronize_rcu()` before cleaning up, use
|
||||
`call_rcu()` to clean up asynchronously. I expect I'll write about the
|
||||
qp-trie changes another time.
|
||||
|
||||
Another issue is that I want the best of both worlds _by default_,
|
||||
but `liburcu` is [LGPL][] and we don't want BIND to depend on
|
||||
code whose licence demands more from our users than the [MPL][].
|
||||
|
||||
[qp-gc]: https://dotat.at/@/2021-06-23-page-based-gc-for-qp-trie-rcu.html
|
||||
[LGPL]: https://opensource.org/licenses/LGPL-2.1
|
||||
[MPL]: https://opensource.org/licenses/MPL-2.0
|
||||
|
||||
So I set out to write my own safe memory reclamation support code.
|
||||
|
||||
|
||||
lock freedom
|
||||
------------
|
||||
|
||||
In a [multithreaded qp-trie][qp-gc], there can be many concurrent
|
||||
readers, but there can be only one writer at a time and modifications
|
||||
are strictly serialized. When I have got it working properly, readers
|
||||
are completely wait-free, unaffected by other readers, and almost
|
||||
unaffected by writers. Writers need to get a mutex to ensure there is
|
||||
only one at a time, but once the mutex is acquired, a writer is not
|
||||
obstructed by readers.
|
||||
|
||||
The way this works is that readers use an atomic load to get a pointer
|
||||
to the root of the current version of the trie. Readers can make
|
||||
multiple queries using this root pointer and the results will be
|
||||
consistent wrt that particular version, regardless of what changes
|
||||
writers might be making concurrently. Writers do not affect readers
|
||||
because all changes are made by copy-on-write. When a writer is ready
|
||||
to commit a new version of the trie, it uses an atomic store to flip
|
||||
the root pointer.
|
||||
|
||||
|
||||
safe memory reclamation
|
||||
-----------------------
|
||||
|
||||
We can't copy-on-write indefinitely: we need to reclaim the memory
|
||||
used by old versions of the trie. And we must do so "safely", i.e.
|
||||
without `free()`ing memory that readers are still using.
|
||||
|
||||
So, before `free()`ing memory, a writer must wait for a _"grace
|
||||
period"_, which is a jargon term meaning "until readers are not using
|
||||
the old version". There are a bunch of algorithms for determining when
|
||||
a grace period is over, with varying amounts of over-approximation,
|
||||
CPU overhead, and memory backlog.
|
||||
|
||||
The [RCU][urcu] function `synchronize_rcu()` is slow because it blocks
|
||||
waiting for a grace period; the `call_rcu()` function runs a callback
|
||||
asynchronously after a grace period has passed. I wanted to avoid
|
||||
blocking my writers, so I needed to implement something like
|
||||
`call_rcu()`.
|
||||
|
||||
|
||||
aversions
|
||||
---------
|
||||
|
||||
When I started trying to work out how to do safe memory reclamation,
|
||||
it all seemed quite intimidating. But as I learned more, I found that
|
||||
my circumstances make it easier than it appeared at first.
|
||||
|
||||
The [`liburcu`][urcu] homepage has a long list of supported CPU
|
||||
architectures and operating systems. Do I have to care about those
|
||||
details too? No! The RCU code dates back to before the age of
|
||||
standardized concurrent memory models, so the RCU developers had to
|
||||
invent their own atomic primitives and correctness rules. Twenty-ish
|
||||
years later the state of the art has advanced, so I can use
|
||||
`<stdatomic.h>` without having to re-do it like `liburcu`.
|
||||
|
||||
You can also choose between several algorithms implemented by
|
||||
[`liburcu`][urcu], involving questions about kernel support, specially
|
||||
reserved signals, and intrusiveness in application code. But while I
|
||||
was working out how to schedule asynchronous memory reclamation work,
|
||||
I realised that BIND is already well-suited to the fastest flavour of
|
||||
RCU, called "QSBR".
|
||||
|
||||
|
||||
QSBR
|
||||
----
|
||||
|
||||
QSBR stands for "quiescent state based reclamation". A _"quiescent
|
||||
state"_ is a fancy name for a point when a thread is not accessing a
|
||||
lock-free data structure, and does not retain any root pointers or
|
||||
interior pointers.
|
||||
|
||||
When a thread has passed through a quiescent state, it no longer has
|
||||
access to older versions of the data structures. When _all_ threads
|
||||
have passed through quiescent states, then nothing in the program has
|
||||
access to old versions. This is how QSBR detects grace periods: after
|
||||
a writer commits a new version, it waits for all threads to pass
|
||||
through quiescent states, and therefore a grace period has definitely
|
||||
elapsed, and so it is then safe to reclaim the old version's memory.
|
||||
|
||||
QSBR is fast because readers do not need to explicitly mark the
|
||||
critical section surrounding the atomic load that I mentioned earlier.
|
||||
Threads just need to pass through a quiescent state frequently enough
|
||||
that there isn't a huge build-up of unreclaimed memory.
|
||||
|
||||
Inside an operating system kernel (RCU's native environment), a
|
||||
context switch provides a natural quiescent state. In a userland
|
||||
application, you need to find a good place to call
|
||||
`rcu_quiescent_state()`. You could call it every time you have
|
||||
finished using a root pointer, but marking a quiescent state is not
|
||||
completely free, so there are probably more efficient ways.
|
||||
|
||||
|
||||
`libuv`
|
||||
-------
|
||||
|
||||
BIND is multithreaded, and (basically) each thread runs an event loop.
|
||||
Recent versions of BIND use [`libuv`][uv] for the event loops.
|
||||
|
||||
A lot of things started falling into place when I realised that the
|
||||
`libuv` event loop gives BIND a [natural quiescent state][uv-loop]:
|
||||
when the event callbacks have finished running, and `libuv` is about
|
||||
to call `select()` or `poll()` or whatever, we can mark a quiescent
|
||||
state. We can require that event-handling functions do not stash root
|
||||
pointers in the heap, but only use them via local variables, so we
|
||||
know that old versions are inaccessible after the callback returns.
|
||||
|
||||
My design marks a quiescent state once per loop, so on a busy server
|
||||
where each loop has lots to do, the cost of marking a quiescent state
|
||||
is amortized across several I/O events.
|
||||
|
||||
[uv]: http://libuv.org/
|
||||
[uv-loop]: http://docs.libuv.org/en/v1.x/design.html#the-i-o-loop
|
||||
|
||||
|
||||
fuzzy barrier
|
||||
-------------
|
||||
|
||||
So, how do we mark a quiescent state? Using a _"fuzzy barrier"_.
|
||||
|
||||
When a thread reaches a normal barrier, it blocks until all the other
|
||||
threads have reached the barrier, after which exactly one of the
|
||||
threads can enter a protected section of code, and the others are
|
||||
unblocked and can proceed as normal.
|
||||
|
||||
When a thread encounters a fuzzy barrier, it never blocks. It either
|
||||
proceeds immediately as normal, or if it is the last thread to reach
|
||||
the barrier, it enters the protected code.
|
||||
|
||||
RCU does not actually use a fuzzy barrier as I have described it. Like
|
||||
a fuzzy barrier, each thread keeps track of whether it has passed
|
||||
through a quiescent state in the current grace period, without
|
||||
blocking; but unlike a fuzzy barrier, no thread is diverted to the
|
||||
protected code. Instead, code that wants to enter a protected section
|
||||
uses the blocking `synchronize_rcu()` function.
|
||||
|
||||
|
||||
EBR-ish
|
||||
-------
|
||||
|
||||
As in the paper ["performance of memory reclamation for lockless
|
||||
synchronization"][HMBW], my implementation of QSBR uses a fuzzy
|
||||
barrier designed for another safe memory reclamation algorithm, EBR,
|
||||
epoch based reclamation. (EBR was invented here in Cambridge by [Keir
|
||||
Fraser][tr579].)
|
||||
|
||||
[HMBW]: http://csng.cs.toronto.edu/publication_files/0000/0159/jpdc07.pdf
|
||||
[tr579]: https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-579.html
|
||||
|
||||
Actually, my fuzzy barrier is slightly different to EBR's. In EBR, the
|
||||
fuzzy barrier is used every time the program enters a critical
|
||||
section. (In qp-trie terms, that would be every time a reader fetches
|
||||
a root pointer.) So it is vital that EBR's barrier avoids mutating
|
||||
shared state, because that would wreck multithreaded performance.
|
||||
|
||||
Because BIND will only pass through the fuzzy barrier when it is about
|
||||
to use a blocking system call, my version mutates shared state more
|
||||
frequently (typically, once per CPU per grace period, instead of once
|
||||
per grace period). If this turns out to be a problem, it won't be too
|
||||
hard to make it work more like EBR.
|
||||
|
||||
More trivially, I'm using the term "phase" instead of "epoch", because
|
||||
it's nothing to do with the unix epoch, because there are three
|
||||
phases, and because I can talk about phase transitions and threads
|
||||
being out of phase with each other.
|
||||
|
||||
|
||||
coda
|
||||
----
|
||||
|
||||
While reading various RCU-related papers, I was amused by ["user-level
|
||||
implementations of read-copy update"][DMSDW], which says:
|
||||
|
||||
> BIND, a major domain-name server used for Internet domain-name
|
||||
> resolution, is facing scalability issues. Since domain names
|
||||
> are read often but rarely updated, using user-level RCU might be
|
||||
> beneficial.
|
||||
|
||||
Yes, I think it might :-)
|
||||
|
||||
[DMSDW]: https://www.efficios.com/publications/
|
||||
Reference in New Issue
Block a user