TLS DNS: take into account partial writes by SSL_write_ex()

This commit changes TLS DNS so that partial writes by the SSL_write_ex() function are taken into account properly. Now, before doing encryption, we are flushing the buffers for outgoing encrypted data. The problem is fairly complicated and originates from the fact that it is somewhat hard to understand by reading the documentation if and when partial writes are supported/enabled or not, and one can get a false impression that they are not supported or enabled by default (https://www.openssl.org/docs/man3.1/man3/SSL_write_ex.html). I have added a lengthy comment about that into the code because it will be more useful there. The documentation on this topic is vague and hard to follow. The main point is that when SSL_write_ex() fails with SSL_ERROR_WANT_WRITE, the OpenSSL code tells us that we need to flush the outgoing buffers and then call SSL_write_ex() again with exactly the same arguments in order to continue as partial write could have happened on the previous call to SSL_write_ex() (that is not hard to verify by calling BIO_pending(sock->tls.app_rbio) before and after the call to SSL_write_ex() and comparing the returned values). This aspect was not taken into account in the code. Now, one can wonder how that could have led to the behaviour that we saw in the #4255 bug report. In particular, how could we lose one message and duplicate one twice? That is where things get interesting. One needs to keep two things in mind (that is important): Firstly, the possibility that two (or more) subsequent SSL_write_ex() calls will be done with exactly the same arguments is very high (the code does not guarantee that in any way, but in practice, that happens a lot). Secondly, the dnsperf (the software that helped us to trigger the bug) bombed the test server with messages that contained exactly the same data. The only difference in the responses is message IDs, which can be found closer to the start of a message. So, that is what was going on in the older version of the code: 1. During one of the isc_nm_send() calls, the SSL_write_ex() call fails with SSL_ERROR_WANT_WRITE. Partial writing has happened, though, and we wrote a part of the message with the message ID (e.g. 2014). Nevertheless, we have rescheduled the complete send operation asynchronously by a call to tlsdns_send_enqueue(). 2. While the asynchronous request has not been completed, we try to send the message (e.g. with ID 2015). The next isc_nm_send() or re-queued send happens with a call to SSL_write_ex() with EXACTLY the same arguments as in the case of the previous call. That is, we are acting as if we want to complete the previously failed SSL_write_ex() attempt (according to the OpenSSL documentation: https://www.openssl.org/docs/man3.1/man3/SSL_write_ex.html, the "Warnings" section). This way, we already have a start of the message containing the previous ID (2014 in our case) but complete the write request with the rest of the data given in the current write attempt. However, as responses differ only in message ID, we end up sending a valid (properly structured) DNS message but with the ID of the previous one. This way, we send a message with ID from the previous isc_nm_send() attempt. The message with the ID from the send request from this attempt will never be sent, as the code thinks that it is sending it now (that is how we send the message with ID 2014 instead of 2015, as in our example, thus making the message with ID 2015 never to be sent). 3. At some point later, the asynchronous send request (the rescheduled on the first step) completes without an error, sending a second message with the same ID (2014). It took exhausting SSL write buffers (so that a data encryption attempt cannot be completed in one operation) via long DoT streams in order to exhibit the behaviour described above. The exhaustion happened because we have not been trying to flush the buffers often enough (especially in the case of multiple subsequent writes). In my opinion, the origin of the problem can be described as follows: It happened due to making wrong guesses caused by poorly written documentation.
2023-08-23 23:36:16 +03:00
parent 7df60d3085
commit 84d71c8e2c
1 changed files with 72 additions and 1 deletions
--- a/lib/isc/netmgr/tlsdns.c
+++ b/lib/isc/netmgr/tlsdns.c
@@ -1832,6 +1832,77 @@ tlsdns_send_direct(isc_nmsocket_t *sock, isc__nm_uvreq_t *req) {
 		goto requeue;
 	}

+	/*
+	 * Try to send any pending data before trying to call SSL_write_ex().
+	 * Otherwise, it could fail with SSL_ERROR_WANT_WRITE error.
+	 *
+	 * It is important to stress that we want to avoid this happening
+	 * due to how SSL_write_ex() works - mainly to avoid partial
+	 * writes.
+	 *
+	 * Although the documentation for these functions is vague, it is
+	 * not stated that partial writes are not possible. On the
+	 * contrary, one can deduce that it is possible and recovering
+	 * from this situation is complicated and unreasonably hard to
+	 * implement to the point when it is better to avoid this
+	 * situation altogether.
+	 *
+	 * In particular, the following can be found in the documentation:
+	 *
+	 * "The write functions will only return with success when the
+	 * complete contents of buf of length num has been written. This
+	 * default behaviour can be changed with the
+	 * SSL_MODE_ENABLE_PARTIAL_WRITE option of
+	 * SSL_CTX_set_mode(3). When this flag is set, the write functions
+	 * will also return with success when a partial write has been
+	 * successfully completed. In this case, the write function
+	 * operation is considered completed. The bytes are sent, and a
+	 * new write call with a new buffer (with the already sent bytes
+	 * removed) must be started. A partial write is performed with the
+	 * size of a message block, which is 16kB."
+
+	 * That is, it is said that success is returned only when the
+	 * complete chunk of data is written (encrypted), but it does not
+	 * mention that partial writes are not possible (the behaviour can
+	 * be changed using SSL_MODE_ENABLE_PARTIAL_WRITE). Another
+	 * important aspect of this passage is that a partial write of up
+	 * to 16 kilobytes can happen, and the call still can
+	 * fail. Needless to say, this amount of data may include more
+	 * than one DNS message.
+	 *
+	 * One could expect that SSL_write_ex() should return the number
+	 * of bytes written, but no, that is not guaranteed (emphasis is
+	 * mine): "*On success* SSL_write_ex() will store the number of
+	 * bytes written in *written."
+	 *
+	 * Moreover, we can find the following guidance on how to handle
+	 * the SSL_ERROR_WANT_WRITE error in the "Warnings" section of the
+	 * documentation:
+
+	 * "When a write function call has to be repeated because
+	 * SSL_get_error(3) returned SSL_ERROR_WANT_READ or
+	 * SSL_ERROR_WANT_WRITE, it must be repeated with the same
+	 * arguments. The data that was passed might have been partially
+	 * processed. When SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER was set
+	 * using SSL_CTX_set_mode(3) the pointer can be different, but the
+	 * data and length should still be the same."
+	 *
+	 * That is, when a call to SSL_write_ex() fails with
+	 * SSL_ERROR_WANT_WRITE, we must attempt to make the next call to
+	 * the function exactly with the same arguments. Of course, the
+	 * code is structured in such a way that we cannot guarantee that
+	 * (and keeping track of that would be unreasonably complicated to
+	 * implement). The best we can do to avoid this error is to get
+	 * (and send) the outgoing data from the SSL buffer ASAP before
+	 * processing the subsequent write request. We can achieve that by
+	 * calling tls_cycle() and rescheduling the write request for
+	 * being processed later.
+	 */
+	if (BIO_pending(sock->tls.app_rbio) > 0) {
+		/* Handle any pending data and requeue the write request. */
+		goto cycle;
+	}
+
 	/*
 	 * There's no SSL_writev(), so we need to use a local buffer to
 	 * assemble the whole message
@@ -1844,7 +1915,6 @@ tlsdns_send_direct(isc_nmsocket_t *sock, isc__nm_uvreq_t *req) {

 	rv = SSL_write_ex(sock->tls.tls, worker->sendbuf, sendlen, &bytes);
 	if (rv > 0) {
-		/* SSL_write_ex() doesn't do partial writes */
 		INSIST(sendlen == bytes);

 		ISC_LIST_APPEND(sock->tls.sendreqs, req, link);
@@ -1865,6 +1935,7 @@ tlsdns_send_direct(isc_nmsocket_t *sock, isc__nm_uvreq_t *req) {
 		return (ISC_R_TLSERROR);
 	}

+cycle:
 	result = tls_cycle(sock);

 requeue: