From aed7082419465d1c74a0b96cbdc9ae938deaff06 Mon Sep 17 00:00:00 2001
From: Hugo Landau <hlandau@openssl.org>
Date: Fri, 25 Nov 2022 12:47:48 +0000
Subject: QUIC I/O Architecture Design Document

Reviewed-by: Tomas Mraz <tomas@openssl.org>
Reviewed-by: Paul Dale <pauli@openssl.org>
(Merged from https://github.com/openssl/openssl/pull/19770)
---
 doc/designs/quic-design/quic-io-arch.md | 483 ++++++++++++++++++++++++++++++++
 1 file changed, 483 insertions(+)
 create mode 100644 doc/designs/quic-design/quic-io-arch.md

(limited to 'doc')

diff --git a/doc/designs/quic-design/quic-io-arch.md b/doc/designs/quic-design/quic-io-arch.md
new file mode 100644
index 0000000000..09115e65a2
--- /dev/null
+++ b/doc/designs/quic-design/quic-io-arch.md
@@ -0,0 +1,483 @@
+QUIC I/O Architecture
+=====================
+
+This document discusses possible implementation options for the I/O architecture
+internal to the libssl QUIC implementation, discusses the underlying design
+constraints driving this decision and introduces the resulting I/O architecture.
+It also identifies potential hazards to existing applications, and identifies
+how those hazards are mitigated.
+
+Objectives
+----------
+
+The OpenSSL QUIC API design is intended to meet the following objectives,
+amongst others:
+
+ - We want to support both blocking and non-blocking semantics
+   for application use of the libssl APIs.
+
+ - In the case of non-blocking applications, it must be possible
+   for an application to do its own polling and make its own event
+   loop.
+
+Requirements
+------------
+
+These requirements are complicated by the fact that traditional use of the
+libssl API allows an application to pass an arbitrary BIO to an SSL object; not
+only that, separate BIOs can be passed for the read and write directions. The
+nature of this BIO can be arbitrary; it could be a socket, or a memory buffer.
+
+Implementation of QUIC will require that the underlying network BIO passed to
+the QUIC implementation be configured to support datagram semantics instead of
+bytestream semantics as has been the case with traditional TLS over TCP.
+
+Implementation of QUIC requires handling of timer events as well as the
+circumstances where a network socket becomes readable or writable. In many cases
+we need to handle these events simultaneously (e.g. wait until a socket becomes
+readable, or a timeout expires, whichever comes first).
+
+Blocking vs. Non-Blocking I/O
+-----------------------------
+
+The above constraints make it effectively a requirement that non-blocking I/O be
+used for the calls to the underlying network BIOs. To illustrate this point, we
+first consider how QUIC might be implemented using blocking I/O internally.
+
+To function correctly and provide blocking semantics at the application level,
+our QUIC implementation must be able to block such that it can respond to any of
+the following events for the underlying network read and write BIOs immediately:
+
+- The underlying network write BIO becomes writeable;
+- The underlying network read BIO becomes readable;
+- A timeout expires.
+
+### Blocking sockets and select(3)
+
+Firstly, consider how this might be accomplished using the Berkeley sockets API.
+Blocking on all three wakeup conditions listed above would require use of an API
+such as select(3) or poll(3), regardless of whether the network socket is
+configured in blocking mode or not.
+
+While in principle APIs such as select(3) can be used with a socket in blocking
+mode, this is not an advisable usage mode. If a socket is in blocking mode,
+calls to send(3) or recv(3) may block for some arbitrary period of time, meaning
+that our QUIC implementation cannot handle incoming data (if we are blocked on
+send), send outgoing data (if we are blocked on receive), or handle timeout
+events.
+
+Though it can be argued that a select(3) call indicating readability or
+writeability should guarantee that a subsequent send(3) or recv(3) call will not
+block, there are several reasons why this is an extremely undesirable solution:
+
+- It is quite likely that there are buggy OSes out there which perform spurious
+  wakeups from select(3).
+
+- The fact that a socket is writeable does not necessarily mean that a datagram
+  of the size we wish to send is writeable, so a send(3) call could block
+  anyway.
+
+- This usage pattern precludes multithreaded use barring some locking scheme
+  due to the possibility of other threads racing between the call to select(3)
+  and the subsequent I/O call. This undermines our intentions to support
+  multi-threaded network I/O on the backend.
+
+Moreover, our QUIC implementation will not drive the Berkeley sockets API
+directly but uses the BIO abstraction to access the network, so these issues are
+then compounded by the limitations of our existing BIO interfaces. We do not
+have a BIO interface which provides for select(3)-like functionality or which
+can implement the required semantics above. Therefore, trying to implement QUIC
+on top of blocking I/O in this way would require violating the BIO abstraction
+layer, and would not work with custom BIOs.
+
+### Blocking sockets and threads
+
+Another conceptual possibility is that blocking calls could be kept ongoing in
+parallel threads. Under this model, there would be three threads:
+
+- a thread which exists solely to execute blocking calls to the `BIO_write` of
+  an underlying network BIO,
+- a thread which exists solely to execute blocking calls to the `BIO_read` of an
+  underlying network BIO,
+- a thread which exists solely to wait for and dispatch timeout events.
+
+This has a large number of disadvantages:
+
+- There is a hard requirement for threading functionality in order to be
+  able to support blocking semantics at the application level. Use of blocking
+  semantics at the application level will have a hard requirement on use of the
+  thread assisted mode. In environments where threading support is not available
+  or desired, our APIs will only be usable in a non-blocking fashion.
+
+- Several threads are spawned which the application is not in control of.
+  This undermines our general approach of providing the application with control
+  over OpenSSL's use of resources, such as allowing the application to do its
+  own polling or provide its own allocators.
+
+  At a minimum for a client, there must be two threads per connection. This
+  means if an application opens many outgoing connections, there will need
+  to be `2n` extra threads spawned.
+
+- By blocking in `BIO_write` calls, this precludes correct implementation of
+  QUIC. Unlike any analogue in TLS, QUIC packets are time sensitive and intended
+  to be transmitted as soon as they are generated. QUIC packets contain fields
+  such as the ACK Delay value, which is intended to describe the time between a
+  packet being received and a return packet being generated. Correct calculation
+  of this field is necessary to correct calculation of connection RTT. It is
+  therefore important to only generate packets when they are ready to be sent,
+  otherwise suboptimal performance will result. This is a usage model which
+  aligns optimally to non-blocking I/O and which cannot be accommodated
+  by blocking I/O.
+
+- Since existing custom BIOs will not be expecting concurrent `BIO_read` and
+  `BIO_write` calls, they will need to be adapted to support this, which is
+  likely to require substantial rework of those custom BIOs (trivial locking of
+  calls obviously does not work since both of these calls must be able to block
+  on network I/O simultaneously).
+
+Moreover, this does not appear to be a realistically implementable approach:
+
+- The question is posed of how to handle connection teardown, which does not
+  seem to be solvable. If parallel threads are blocking in blocking `BIO_read`
+  and `BIO_write` calls on some underlying network BIO, there needs to be some
+  way to force these calls to return once `SSL_free` is called and we need to
+  tear down the connection. However, the BIO interface does not provide
+  any way to do this. *At best* we might assume the BIO is a `BIO_s_dgram`
+  (but cannot assume this in the general case), but even then we can only
+  accomplish teardown by violating the BIO abstraction and closing the
+  underlying socket.
+
+  This is the only portable way to ensure that a recv(3) call to the same socket
+  returns. This obviously is a highly application-visible change (and is likely
+  to be far more disruptive than configuring the socket into non-blocking mode).
+
+  Moreover, it is not workable anyway because it only works for a socket-based
+  BIO and violates the BIO abstraction. For BIOs in general, there does not
+  appear to be any viable solution to the teardown issue.
+
+Even if this approach were successfully implemented, applications will still
+need to change to using network BIOs with datagram semantics. For applications
+using custom BIOs, this is likely to require substantial rework of those BIOs.
+There is no possible way around this. Thus, even if this solution were adopted
+(notwithstanding the issues which preclude this noted above) for the purposes of
+accommodating applications using custom network BIOs in a blocking mode, these
+applications would still have to completely rework their implementation of those
+BIOs. In any case, it is expected to be very rare that sophisticated
+applications implementing their own custom BIOs will do so in a blocking mode.
+
+### Use of non-blocking I/O
+
+By comparison, use of non-blocking I/O and select(3) or similar APIs on the
+network side makes satisfying our requirements for QUIC easy, and also allows
+our internal approach to I/O to be flexibly adapted in the future as
+requirements may evolve.
+
+This is also the approach used by all other known QUIC implementations; it is
+highly unlikely that any QUIC implementations exist which use blocking network
+I/O, as (as mentioned above) it would lead to suboptimal performance due to the
+ACK delay issue.
+
+Note that this is orthogonal to whether we provide blocking I/O semantics to the
+application. We can use blocking I/O internally while using this to provide
+either blocking or non-blocking semantics to the application, based on what the
+application requests.
+
+This approach in general requires that a network socket be configured in
+non-blocking mode. Though some OSes support a `MSG_DONTWAIT` flag which allows a
+single I/O operation to be made non-blocking, not all OSes support this (e.g.
+Windows), thus this cannot be relied on. As such, we need to configure any
+socket FD we use into non-blocking mode.
+
+Of the approaches outlined in this document, the use of non-blocking I/O has the
+fewest disadvantages and is the only approach which appears to actually be
+implementable in practice. Moreover, each disadvantage can be readily mitigated:
+
+  - We rely on having a select(3) or poll(3) like function available from the
+    OS.
+
+    However:
+
+    - Firstly, we already rely on select(3) in our code, so this does not appear
+      to raise any portability issues;
+
+    - Secondly, we have the option of providing a custom poller interface which
+      allows an application to provide its own implementation of a
+      select(3)-like function. In fact, this has the potential to be quite
+      powerful and would allow the application to implement its own pollable
+      BIOs, and therefore perform blocking I/O on top of any custom BIO.
+
+      For example, while historically none of our own memory-based BIOs have
+      supported blocking semantics, a sophisticated application could if it
+      wished choose to implement a custom blocking memory BIO and implement a
+      custom poller which synchronises using a custom poll descriptor based
+      around condition variables rather than sockets. Thus this scheme is
+      highly flexible.
+
+      (It is worth noting also that the implementation of blocking semantics at
+      the application level also does not rely on any privileged access to the
+      internals of the QUIC implementation and an application could if it wished
+      build blocking semantics out of a non-blocking QUIC instance; this is not
+      particularly difficult, though providing custom pollers here would mean
+      there should be no need for an application to do so.)
+
+  - Configuring a socket into non-blocking mode might confuse an application.
+
+    However:
+
+    - Applications will already have to make changes to any network-side BIOs,
+      for example switching from a `BIO_s_socket` to a `BIO_s_dgram`, or from a
+      BIO pair to a `BIO_s_dgram_pair`. Custom BIOs will need to be
+      substantially reworked to switch from bytestream semantics to datagram
+      semantics. Such applications will already need substantial changes, and
+      this is unavoidable.
+
+      Of course, application impacts and migration guidance can (and will) all
+      be documented.
+
+    - In order for an application to be confused by us putting a socket into
+      non-blocking mode, it would need to be trying to use the socket in some
+      way. But it is not possible for an application to pass a socket to our
+      QUIC implementation, and also try to use the socket directly, and have
+      QUIC still work. Using QUIC necessarily requires that an application not
+      also be trying to make use of the same socket.
+
+    - There are some circumstances where an application might want to multiplex
+      other protocols onto the same UDP socket, for example with protocols like
+      RTP/RTCP or STUN; this can be facilitated using the QUIC fixed bit.
+      However, these use cases cannot be supported without explicit assistance
+      from a QUIC implementation and this use case cannot be facilitated by
+      simply sharing a network socket, as incoming datagrams will not be routed
+      correctly. (We may offer some functionality in future to allow this to be
+      coordinated but this is not for MVP.) Thus this also is not a concern.
+      Moreover, it is extremely unlikely that any such applications are using
+      sockets in blocking mode anyway.
+
+Advantages:
+
+  - An application retains full control of its event loop in non-blocking mode.
+
+    When using libssl in application-level blocking mode, via a custom poller
+    interface, the application would actually able to exercise more control over
+    I/O than it actually is at present when using libssl in blocking mode.
+
+  - Feasible to implement and already working in tests.
+    Minimises further development needed to ship.
+
+  - Does not rely on creating threads and can support blocking I/O at the
+    application level without relying on thread assisted mode.
+
+  - Does not require an application-provided network-side custom BIO to be
+    reworked to support concurrent calls to it.
+
+  - Allows performance-optimal implementation of QUIC RFC requirements.
+
+  - Ensures our internal I/O architecture remains flexible for future evolution
+    without breaking compatibility in the future.
+
+Use of Internal Non-Blocking I/O
+--------------------------------
+
+Based on the above evaluation, implementation has been undertaken using
+non-blocking I/O internally. Applications can use blocking or non-blocking I/O
+at the libssl API level. Network-level BIOs must operate in a non-blocking mode
+or be configurable by QUIC to this end.
+
+### Support of arbitrary BIOs
+
+We need to support not just socket FDs but arbitrary BIOs as the basis for the
+use of QUIC. The use of QUIC with e.g. `BIO_s_dgram_pair`, a bidirectional
+memory buffer with datagram semantics, is to be supported as part of MVP. This
+must be reconciled with the desire to support application-managed event loops.
+
+Broadly, the intention so far has been to enable the use of QUIC with an
+application event loop in application-level non-blocking mode by exposing an
+appropriate OS-level synchronisation primitive to the application. On \*NIX
+platforms, this essentially means we provide the application with:
+
+  - An FD which should be polled for readability, writability, or both; and
+  - A deadline (if any is currently applicable).
+
+Once either of these conditions is met, the QUIC state machine can be
+(potentially) advanced meaningfully, and the application is expected to reenter
+the QUIC state machine by calling `SSL_tick()` (or `SSL_read()` or
+`SSL_write()`).
+
+This model is readily supported when the read and write BIOs we are provided
+with are socket BIOs:
+
+  - The read-pollable FD is the FD of the read BIO.
+  - The write-pollable FD is the FD of the write BIO.
+
+However, things become more complex when we are dealing with memory-based BIOs
+such as `BIO_dgram_pair` which do not naturally correspond to any OS primitive
+which can be used for synchronisation, or when we are dealing with an
+application-provided custom BIO.
+
+### Pollable and Non-Pollable BIOs
+
+In order to accommodate these various cases, we draw a distinction between
+pollable and non-pollable BIOs.
+
+  - A pollable BIO is a BIO which can provide some kind of OS-level
+    synchronisation primitive, which can be used to determine when
+    the BIO might be able to do useful work once more.
+
+  - A non-pollable BIO has no naturally associated OS-level synchronisation
+    primitive, but its state only changes in response to calls made to it (or to
+    a related BIO, such as the other end of a pair).
+
+#### Supporting Pollable BIOs
+
+“OS-level synchronisation primitive” is deliberately vague. Most modern OSes use
+unified handle spaces (UNIX, Windows) though it is likely there are more obscure
+APIs on these platforms which have other handle spaces. However, this
+unification is not necessarily significant.
+
+For example, Windows sockets are kernel handles and thus like any other object
+they can be used with the generic Win32 `WaitForSingleObject()` API, but not in
+a useful manner; the generic readiness mechanism for WIndows handles is not
+plumbed in for socket handles, and so sockets are simply never considered ready
+for the purposes of this API, which will never return. Instead, the
+WinSock-specific `select()` call must be used. On the other hand, other kinds of
+synchronisation primitive like a Win32 Event must use `WaitForSingleObject()`.
+
+Thus while in theory most modern operating systems have unified handle spaces in
+practice there are substantial usage differences between different handle types.
+As such, an API to expose a synchronisation primitive should be of a tagged
+union design supporting possible variation.
+
+A BIO object will provide methods to retrieve a pollable OS-level
+synchronisation primitive which can be used to determine when the QUIC state
+machine can (potentially) do more work. This maintains the integrity of the BIO
+abstraction layer. Equivalent SSL object API calls which forward to the
+equivalent calls of the underlying network BIO will also be provided.
+
+The core mechanic is as follows:
+
+```c
+#define BIO_POLL_DESCRIPTOR_TYPE_NONE        0
+#define BIO_POLL_DESCRIPTOR_TYPE_SOCK_FD     1
+#define BIO_POLL_DESCRIPTOR_CUSTOM_START     8192
+
+#define BIO_POLL_DESCRIPTOR_NUM_CUSTOM       4
+
+typedef struct bio_poll_descriptor_st {
+    int type;
+    union {
+        int fd;
+        union {
+            void        *ptr;
+            uint64_t    u64;
+        } custom[BIO_POLL_DESCRIPTOR_NUM_CUSTOM];
+    } value;
+} BIO_POLL_DESCRIPTOR;
+
+int BIO_get_rpoll_descriptor(BIO *ssl, BIO_POLL_DESCRIPTOR *desc);
+int BIO_get_wpoll_descriptor(BIO *ssl, BIO_POLL_DESCRIPTOR *desc);
+
+int SSL_get_rpoll_descriptor(SSL *ssl, BIO_POLL_DESCRIPTOR *desc);
+int SSL_get_wpoll_descriptor(SSL *ssl, BIO_POLL_DESCRIPTOR *desc);
+```
+
+Currently only a single descriptor type is defined, which is a FD on \*NIX and a
+Winsock socket handle on Windows. These use the same type to minimise code
+changes needed on different platforms in the common case of an OS network
+socket. (Use of an `int` here is strictly incorrect for Windows; however, this
+style of usage is prevalent in the OpenSSL codebase, so for consistency we
+continue the pattern here.)
+
+Poll descriptor types at or above `BIO_POLL_DESCRIPTOR_CUSTOM_START` are
+reserved for application-defined use. The `value.custom` field of the
+`BIO_POLL_DESCRIPTOR` structure is provided for applications to store values of
+their choice in. An application is free to define the semantics.
+
+libssl will not know how to poll custom poll descriptors itself, thus these are
+only useful when the application will provide a custom poller function, which
+performs polling on behalf of libssl and which implements support for those
+custom poll descriptors.
+
+For `BIO_s_ssl`, the `BIO_get_[rw]poll_descriptor` functions are equivalent to
+the `SSL_get_[rw]poll_descriptor` functions. The `SSL_get_[rw]poll_descriptor`
+functions are equivalent to calling `BIO_get_[rw]poll_descriptor` on the
+underlying BIOs provided to the SSL object. For a socket BIO, this will likely
+just yield the socket's FD. For memory-based BIOs, see below.
+
+#### Supporting Non-Pollable BIOs
+
+Where we are provided with a non-pollable BIO, we cannot provide the application
+with any primitive used for synchronisation and it is assumed that the
+application will handle its own network I/O, for example via a
+`BIO_s_dgram_pair`.
+
+When libssl calls `BIO_get_[rw]poll_descriptor` on the underlying BIO, the call
+fails, indicating that a non-pollable BIO is being used. Thus, if an application
+calls `SSL_get_[rw]poll_descriptor`, that call also fails.
+
+There are various circumstances which need to be handled:
+
+  - The QUIC implementation wants to write data to the network but
+    is currently unable to (e.g. `BIO_s_dgram_pair` is full).
+
+    This is not hard as our internal TX record layer allows arbitrary buffering.
+    The only limit comes when QUIC flow control (which only applies to
+    application stream data) applies a limit; then calls to e.g. `SSL_write` we
+    must fail with `SSL_ERROR_WANT_WRITE`.
+
+  - The QUIC implementation wants to read data from the network
+    but is currently unable to (e.g. `BIO_s_dgram_pair` is empty).
+
+    Here calls like `SSL_read` need to fail with `SSL_ERROR_WANT_READ`; we
+    thereby support libssl's classic nonblocking I/O interface.
+
+It is worth noting that theoretically a memory-based BIO could be implemented
+which is pollable, for example using condition variables. An application could
+implement a custom BIO, custom poll descriptor and custom poller to facilitate
+this.
+
+### Configuration of Blocking vs. Non-Blocking Mode
+
+Traditionally an SSL object has operated either in blocking mode or non-blocking
+mode without requiring explicit configuration; if a socket returns EWOULDBLOCK
+or similar, it is handled appropriately, and if a socket call blocks, there is
+no issue. Since the QUIC implementation is building on non-blocking I/O, this
+implicit configuration of non-blocking mode is not feasible.
+
+Note that Windows does not have an API for determining whether a socket is in
+blocking mode, so it is not possible to use the initial state of an underlying
+socket to determine if the application wants to use non-blocking I/O or not.
+Moreover this would undermine the BIO abstraction.
+
+As such, an explicit call is introduced to configure an SSL (QUIC) object into
+non-blocking mode:
+
+```c
+int SSL_set_blocking_mode(SSL *s, int blocking);
+int SSL_get_blocking_mode(SSL *s);
+```
+
+Applications desiring non-blocking operation will need to call this API to
+configure a new QUIC connection accordingly. Blocking mode is chosen as the
+default for parity with traditional Berkeley sockets APIs and to make things
+simpler for blocking applications, which are likely to be seeking a simpler
+solution. However, blocking mode cannot be supported with a non-pollable BIO,
+and thus blocking mode defaults to off when used with such a BIO.
+
+A method is also needed for the QUIC implementation to inform an underlying BIO
+that it must not block. The SSL object will call this function when it is
+provided with an underlying BIO. For a socket BIO this can set the socket as
+non-blocking; for a memory-based BIO it is a no-op; for `BIO_s_ssl` it is
+equivalent to a call to `SSL_set_blocking_mode()`.
+
+### Internal Polling
+
+When blocking mode is configured, the QUIC implementation will call
+`BIO_get_[rw]poll_descriptor` on the underlying BIOs and use a suitable OS
+function (e.g. `select()`) or, if configured, custom poller function, to block.
+This will be implemented by an internal function which can accept up to two poll
+descriptors (one for the read BIO, one for the write BIO), which might be
+identical.
+
+Blocking mode cannot be used with a non-pollable underlying BIO. If
+`BIO_get[rw]poll_descriptor` is not implemented for either of the underlying
+read and write BIOs, blocking mode cannot be enabled and blocking mode defaults
+to off.
-- 
cgit v1.2.1