diff options
author | Lennart Poettering <lennart@poettering.net> | 2023-02-23 16:45:52 +0100 |
---|---|---|
committer | Lennart Poettering <lennart@poettering.net> | 2023-03-01 09:43:24 +0100 |
commit | a4b13ae1be0e8d8b811eb1979224597eca57425f (patch) | |
tree | 7dc8d9b90cbd380f1b5d799a01d0e7975222cb70 /docs | |
parent | 3b7101183cac4b35a8bd6ea2c1de9260c33f977f (diff) | |
download | systemd-a4b13ae1be0e8d8b811eb1979224597eca57425f.tar.gz |
doc: add document explaining memory pressure handling
Diffstat (limited to 'docs')
-rw-r--r-- | docs/MEMORY_PRESSURE.md | 240 |
1 files changed, 240 insertions, 0 deletions
diff --git a/docs/MEMORY_PRESSURE.md b/docs/MEMORY_PRESSURE.md new file mode 100644 index 0000000000..6a14d74f44 --- /dev/null +++ b/docs/MEMORY_PRESSURE.md @@ -0,0 +1,240 @@ +--- +title: Memory Pressure Handling +category: Interfaces +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Memory Pressure Handling in systemd + +When the system is under memory pressure (i.e. some component of the OS +requires memory allocation but there is only very little or none available), +it can attempt various things to make more memory available again ("reclaim"): + +* The kernel can flush out memory pages backed by files on disk, under the + knowledge that it can reread them from disk when needed again. Candidate + pages are the many memory mapped executable files and shared libraries on + disk, among others. + +* The kernel can flush out memory packages not backed by files on disk + ("anonymous" memory, i.e. memory allocated via `malloc()` and similar calls, + or `tmpfs` file system contents) if there's swap to write it to. + +* Userspace can proactively release memory it allocated but doesn't immediately + require back to the kernel. This includes allocation caches, and other forms + of caches that are not required for normal operation to continue. + +The latter is what we want to focus on in this document: how to ensure +userspace process can detect mounting memory pressure early and release memory +back to the kernel as it happens, relieving the memory pressure before it +becomes too critical. + +The effects of memory pressure during runtime generaly are growing latencies +during operation: when a program requires memory but the system is busy writing +out memory to (relatively slow) disks in order make some available, this +generally surfaces in scheduling latencies, and applications and services will +slow down until memory pressure is relieved. Hence, to ensure stable service +latencies it is essential to release unneeded memory back to the kernel early +on. + +On Linux the [Pressure Stall Information +(PSI)](https://docs.kernel.org/accounting/psi.html) Linux kernel interface is +the primary way to determine the system or a part of it is under memory +pressure. PSI provides a way how userspace can acquire a `poll()`-able file +descriptor that gets notifications whenever memory pressure latencies for the +system or a for a control group grow beyond some level. + +`systemd` itself makes use of PSI, and helps applications to do so +too. Specifically: + +* Most of systemd's long running components watch for PSI memory pressure + events, and release allocation caches and other resources once seen. + +* systemd's service manager provides a protocol for asking services to listen + to PSI events and configure the appropriate pressure thresholds. + +* systemd's `sd-event` event loop API provides a high-level call + `sd_event_add_memory_pressure()` which allows programs using it to + efficiently hook into the PSI memory pressure protocol provided by the + service manager, with very few lines of code. + +## Memory Pressure Service Protocol + +If memory pressure handling for a specific service is enabled via +`MemoryPressureWatch=` the memory pressure service protocol is used to tell the +service code about this. Specifically two environment variables are set by the +service manager, and typically consumed by the service: + +* The `$MEMORY_PRESSURE_WATCH` environment variable will contain an absolute + path in the file system to the file to watch for memory pressure events. This + will usually point to a PSI file such as the `memory.pressure` file of the + service's cgroup. In order to make debugging easier, and allow later + extension it is recommended for applications to also allow this path to refer + to an `AF_UNIX` stream socket in the file system or a FIFO inode in the file + system. Regardless which of the three types of inodes this absolute path + refers to, all three are `poll()`-able for memory pressure events. The + variable can also be set to the literal string `/dev/null`. If so the service + code should take this as indication that memory pressure monitoring is not + desired and should be turned off. + +* The `$MEMORY_PRESSURE_WRITE` environment variable is optional. If set by the + service manager it contains Base64 encoded data (that may contain arbitrary + binary values, including NUL bytes) that should be written into the path + provided via `$MEMORY_PRESSURE_WATCH` right after opening it. Typically, if + talking directly to a PSI kernel file this will contain information about the + threshold settings configurable in the service manager. + +When a service initializes it hence should look for +`$MEMORY_PRESSURE_WATCH`. If set, it should try to open the specified path. If +it detects the path to refer to a regular file it should assume it refers to a +PSI kernel file. If so, it should write the data from `$MEMORY_PRESSURE_WRITE` +into the file descriptor (after Base64-decoding it, and only if the variable is +set) and then watch for `POLLPRI` events on it. If it detects the paths refers +to a FIFO inode, it should open it, write the `$MEMORY_PRESSURE_WRITE` data +into it (as above) and then watch for `POLLIN` events on it. Whenever `POLLIN` +is seen it should read and discard any data queued in the FIFO. If the path +refers to an `AF_UNIX` socket in the file system, the application should +`connect()` a stream socket to it, write `$MEMORY_PRESSURE_WRITE` into it (as +above) and watch for `POLLIN`, discarding any data it might receive. + +To summarize: + +* If `$MEMORY_PRESSURE_WATCH` points to a regular file: open and watch for + `POLLPRI`, never read from the file descriptor. + +* If `$MEMORY_PRESSURE_WATCH` points to a FIFO: open and watch for `POLLIN`, + read/discard any incoming data. + +* If `$MEMORY_PRESSURE_WATCH` points to an `AF_UNIX` socket: connect and watch + for `POLLIN`, read/discard any incoming data. + +* If `$MEMORY_PRESSURE_WATCH` contains the literal string `/dev/null`, turn off + memory pressure handling. + +(And in each case, immediately after opening/connecting to the path, write the +decoded `$MEMORY_PRESSURE_WRITE` data into it.) + +Whenever a `POLLPRI`/`POLLIN` event is seen the service is under memory +pressure. It should use this as hint to release suitable redundant resources, +for example: + +* glibc's memory allocation cache, via + [`malloc_trim()`](https://man7.org/linux/man-pages/man3/malloc_trim.3.html). Similar, + allocation caches implemented in the service itself. + +* Any other local caches, such DNS caches, or web caches (in particular if + service is a web browser). + +* Terminate any idle worker threads or processes. + +* Run a garbage collection (GC) cycle, if the programming languages supports that. + +* Terminate the process if idle, and if it can be automatically started when + needed next. + +Which actions precisely to take depends on the service in question. Note that +the notifications are delivered when memory allocation latency already degraded +beyond some point. Hence when discussing which resources to keep and which ones +to discard it should be kept in mind that it is typically acceptable that +latencies to recover the discarded resources at a later point are less of a +problem, given that latencies *already* are affected negatively. + +In case the path supplied via `$MEMORY_PRESSURE_WATCH` points to a PSI kernel +API file, or to an `AF_UNIX` opening it multiple times is safe and reliable, +and should deliver notifications to each of the opened file descriptors. This +is specifically useful for services that consist of multiple processes, and +where each of them shall be able to release resources on memory pressure. + +The `POLLPRI`/`POLLIN` conditions will be triggered every time memory pressure +is detected, but not continously. It is thus safe to keep `poll()`-ing on the +same file descriptor continously, and executing resource release operations +whenever the file descriptor triggers without having to expect overloading the +process. + +(Currently, the protocol defined here only allows configuration of a single +"degree" of memory pressure, there's no distinction made on how strong the +pressure is. In future, if it becomes apparent that there's clear need to +extend this we might eventually add different degrees, most likely by adding +additional environment variables such as `$MEMORY_PRESSURE_WRITE_LOW` and +`$MEMORY_PRESSURE_WRITE_HIGH` or similar, which may contain different settings +for lower or higher memory pressure thresholds.) + +## Service Manager Settings + +The service manager provides two per-service settings that control the memory +pressure handling: + +* The + [`MemoryPressureWatch=`](https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#MemoryPressureWatch=) + setting controls whether to enable the memory pressure protocol for the + service in question. + +* The `MemoryPressureThresholdSec=` setting allows to configure the threshold + when to signal memory pressure to the services. It takes a time value + (usually in the millisecond range) that defines a threshold per 1s time + window: if memory allocation latencies grow beyond this threshold + notifications are generated towards the service, requesting it to release + resources. + +The `/etc/systemd/system.conf` file provides two settings that may be used to +select the default values for the above settings. If the threshold is neither +configured via the per-service nor via the default system-wide option, it +defaults to 100ms. + +Ẁhen memory pressure monitoring is enabled for a service via +`MemoryPressureWatch=` this primarily does three things: + +* It enables cgroup memory accounting for the service (this is a requirement + for per-cgroup PSI) + +* It sets the aforementioned two environment variables for processes invoked + for the service, based on the control group of the service and provided + settings. + +* The `memory.pressure` PSI control group file associated with the service's + cgroup is delegated to the service (i.e. permissions are relaxed so that + unprivileged service payload code can open the file for writing). + +## Memory Pressure Events in `sd-event` + +The +[`sd-event`](https://www.freedesktop.org/software/systemd/man/sd-event.html) +event loop library provides two API calls that encapsulate the +functionality described above: + +* The + [`sd_event_add_memory_pressure()`](https://www.freedesktop.org/software/systemd/man/sd_event_add_memory_pressure.html) + call implements the service-side of the memory pressure protocol and + integrates it with an `sd-event` event loop. It reads the two environment + variables, connects/opens the specified file, writes the the specified data + to it and then watches for events. + +* The `sd_event_trim_memory()` call may be called to trim the calling + processes' memory. It's a wrapper around glibc's `malloc_trim()`, but first + releases allocation caches maintained by libsystemd internally. If the + callback function passed to `sd_event_add_memory_pressure()` is passed as + `NULL` this function is called as default implementation. + +Making use of this, in order to hook up a service using `sd-event` with +automatic memory pressure handling, it's typically sufficient to add a line +such as: + +```c +(void) sd_event_add_memory_pressure(event, NULL, NULL, NULL); +``` + +– right after allocating the event loop object `event`. + +## Other APIs + +Other programming environments might have native APIs to watch memory +pressure/low memory events. Most notable is probably GLib's +[GMemoryMonitor](https://developer-old.gnome.org/gio/stable/GMemoryMonitor.html). It +currently uses the per-system Linux PSI interface as backend, but it operates +differently than the above: memory pressure events are picked up by a system +service, which then propagates this through D-Bus to the applications. This is +typically less than ideal, since this means each notification event has to +travel through three processes before being handled, and this creates +additional latencies at a time where the system is already experiencing adverse +latencies. Moreover, it focusses on system-wide PSI events, even though +service-local ones are generally the better approach. |