diff options
606 files changed, 11563 insertions, 8072 deletions
@@ -33,6 +33,8 @@ Al Viro <viro@zenIV.linux.org.uk> Andi Kleen <ak@linux.intel.com> <ak@suse.de> Andi Shyti <andi@etezian.org> <andi.shyti@samsung.com> Andreas Herrmann <aherrman@de.ibm.com> +Andrej Shadura <andrew.shadura@collabora.co.uk> +Andrej Shadura <andrew@shadura.me> <andrew@beldisplaytech.com> Andrew Morton <akpm@linux-foundation.org> Andrew Murray <amurray@thegoodpenguin.co.uk> <amurray@embedded-bits.co.uk> Andrew Murray <amurray@thegoodpenguin.co.uk> <andrew.murray@arm.com> diff --git a/Documentation/block/inline-encryption.rst b/Documentation/block/inline-encryption.rst index 7f9b40d6b416..71d1044617a9 100644 --- a/Documentation/block/inline-encryption.rst +++ b/Documentation/block/inline-encryption.rst @@ -7,230 +7,269 @@ Inline Encryption Background ========== -Inline encryption hardware sits logically between memory and the disk, and can -en/decrypt data as it goes in/out of the disk. Inline encryption hardware has a -fixed number of "keyslots" - slots into which encryption contexts (i.e. the -encryption key, encryption algorithm, data unit size) can be programmed by the -kernel at any time. Each request sent to the disk can be tagged with the index -of a keyslot (and also a data unit number to act as an encryption tweak), and -the inline encryption hardware will en/decrypt the data in the request with the -encryption context programmed into that keyslot. This is very different from -full disk encryption solutions like self encrypting drives/TCG OPAL/ATA -Security standards, since with inline encryption, any block on disk could be -encrypted with any encryption context the kernel chooses. - +Inline encryption hardware sits logically between memory and disk, and can +en/decrypt data as it goes in/out of the disk. For each I/O request, software +can control exactly how the inline encryption hardware will en/decrypt the data +in terms of key, algorithm, data unit size (the granularity of en/decryption), +and data unit number (a value that determines the initialization vector(s)). + +Some inline encryption hardware accepts all encryption parameters including raw +keys directly in low-level I/O requests. However, most inline encryption +hardware instead has a fixed number of "keyslots" and requires that the key, +algorithm, and data unit size first be programmed into a keyslot. Each +low-level I/O request then just contains a keyslot index and data unit number. + +Note that inline encryption hardware is very different from traditional crypto +accelerators, which are supported through the kernel crypto API. Traditional +crypto accelerators operate on memory regions, whereas inline encryption +hardware operates on I/O requests. Thus, inline encryption hardware needs to be +managed by the block layer, not the kernel crypto API. + +Inline encryption hardware is also very different from "self-encrypting drives", +such as those based on the TCG Opal or ATA Security standards. Self-encrypting +drives don't provide fine-grained control of encryption and provide no way to +verify the correctness of the resulting ciphertext. Inline encryption hardware +provides fine-grained control of encryption, including the choice of key and +initialization vector for each sector, and can be tested for correctness. Objective ========= -We want to support inline encryption (IE) in the kernel. -To allow for testing, we also want a crypto API fallback when actual -IE hardware is absent. We also want IE to work with layered devices -like dm and loopback (i.e. we want to be able to use the IE hardware -of the underlying devices if present, or else fall back to crypto API -en/decryption). - +We want to support inline encryption in the kernel. To make testing easier, we +also want support for falling back to the kernel crypto API when actual inline +encryption hardware is absent. We also want inline encryption to work with +layered devices like device-mapper and loopback (i.e. we want to be able to use +the inline encryption hardware of the underlying devices if present, or else +fall back to crypto API en/decryption). Constraints and notes ===================== -- IE hardware has a limited number of "keyslots" that can be programmed - with an encryption context (key, algorithm, data unit size, etc.) at any time. - One can specify a keyslot in a data request made to the device, and the - device will en/decrypt the data using the encryption context programmed into - that specified keyslot. When possible, we want to make multiple requests with - the same encryption context share the same keyslot. - -- We need a way for upper layers like filesystems to specify an encryption - context to use for en/decrypting a struct bio, and a device driver (like UFS) - needs to be able to use that encryption context when it processes the bio. - -- We need a way for device drivers to expose their inline encryption - capabilities in a unified way to the upper layers. - - -Design -====== - -We add a struct bio_crypt_ctx to struct bio that can -represent an encryption context, because we need to be able to pass this -encryption context from the upper layers (like the fs layer) to the -device driver to act upon. - -While IE hardware works on the notion of keyslots, the FS layer has no -knowledge of keyslots - it simply wants to specify an encryption context to -use while en/decrypting a bio. - -We introduce a keyslot manager (KSM) that handles the translation from -encryption contexts specified by the FS to keyslots on the IE hardware. -This KSM also serves as the way IE hardware can expose its capabilities to -upper layers. The generic mode of operation is: each device driver that wants -to support IE will construct a KSM and set it up in its struct request_queue. -Upper layers that want to use IE on this device can then use this KSM in -the device's struct request_queue to translate an encryption context into -a keyslot. The presence of the KSM in the request queue shall be used to mean -that the device supports IE. - -The KSM uses refcounts to track which keyslots are idle (either they have no -encryption context programmed, or there are no in-flight struct bios -referencing that keyslot). When a new encryption context needs a keyslot, it -tries to find a keyslot that has already been programmed with the same -encryption context, and if there is no such keyslot, it evicts the least -recently used idle keyslot and programs the new encryption context into that -one. If no idle keyslots are available, then the caller will sleep until there -is at least one. - - -blk-mq changes, other block layer changes and blk-crypto-fallback -================================================================= - -We add a pointer to a ``bi_crypt_context`` and ``keyslot`` to -struct request. These will be referred to as the ``crypto fields`` -for the request. This ``keyslot`` is the keyslot into which the -``bi_crypt_context`` has been programmed in the KSM of the ``request_queue`` -that this request is being sent to. - -We introduce ``block/blk-crypto-fallback.c``, which allows upper layers to remain -blissfully unaware of whether or not real inline encryption hardware is present -underneath. When a bio is submitted with a target ``request_queue`` that doesn't -support the encryption context specified with the bio, the block layer will -en/decrypt the bio with the blk-crypto-fallback. - -If the bio is a ``WRITE`` bio, a bounce bio is allocated, and the data in the bio -is encrypted stored in the bounce bio - blk-mq will then proceed to process the -bounce bio as if it were not encrypted at all (except when blk-integrity is -concerned). ``blk-crypto-fallback`` sets the bounce bio's ``bi_end_io`` to an -internal function that cleans up the bounce bio and ends the original bio. - -If the bio is a ``READ`` bio, the bio's ``bi_end_io`` (and also ``bi_private``) -is saved and overwritten by ``blk-crypto-fallback`` to -``bio_crypto_fallback_decrypt_bio``. The bio's ``bi_crypt_context`` is also -overwritten with ``NULL``, so that to the rest of the stack, the bio looks -as if it was a regular bio that never had an encryption context specified. -``bio_crypto_fallback_decrypt_bio`` will decrypt the bio, restore the original -``bi_end_io`` (and also ``bi_private``) and end the bio again. - -Regardless of whether real inline encryption hardware is used or the +- We need a way for upper layers (e.g. filesystems) to specify an encryption + context to use for en/decrypting a bio, and device drivers (e.g. UFSHCD) need + to be able to use that encryption context when they process the request. + Encryption contexts also introduce constraints on bio merging; the block layer + needs to be aware of these constraints. + +- Different inline encryption hardware has different supported algorithms, + supported data unit sizes, maximum data unit numbers, etc. We call these + properties the "crypto capabilities". We need a way for device drivers to + advertise crypto capabilities to upper layers in a generic way. + +- Inline encryption hardware usually (but not always) requires that keys be + programmed into keyslots before being used. Since programming keyslots may be + slow and there may not be very many keyslots, we shouldn't just program the + key for every I/O request, but rather keep track of which keys are in the + keyslots and reuse an already-programmed keyslot when possible. + +- Upper layers typically define a specific end-of-life for crypto keys, e.g. + when an encrypted directory is locked or when a crypto mapping is torn down. + At these times, keys are wiped from memory. We must provide a way for upper + layers to also evict keys from any keyslots they are present in. + +- When possible, device-mapper devices must be able to pass through the inline + encryption support of their underlying devices. However, it doesn't make + sense for device-mapper devices to have keyslots themselves. + +Basic design +============ + +We introduce ``struct blk_crypto_key`` to represent an inline encryption key and +how it will be used. This includes the actual bytes of the key; the size of the +key; the algorithm and data unit size the key will be used with; and the number +of bytes needed to represent the maximum data unit number the key will be used +with. + +We introduce ``struct bio_crypt_ctx`` to represent an encryption context. It +contains a data unit number and a pointer to a blk_crypto_key. We add pointers +to a bio_crypt_ctx to ``struct bio`` and ``struct request``; this allows users +of the block layer (e.g. filesystems) to provide an encryption context when +creating a bio and have it be passed down the stack for processing by the block +layer and device drivers. Note that the encryption context doesn't explicitly +say whether to encrypt or decrypt, as that is implicit from the direction of the +bio; WRITE means encrypt, and READ means decrypt. + +We also introduce ``struct blk_crypto_profile`` to contain all generic inline +encryption-related state for a particular inline encryption device. The +blk_crypto_profile serves as the way that drivers for inline encryption hardware +advertise their crypto capabilities and provide certain functions (e.g., +functions to program and evict keys) to upper layers. Each device driver that +wants to support inline encryption will construct a blk_crypto_profile, then +associate it with the disk's request_queue. + +The blk_crypto_profile also manages the hardware's keyslots, when applicable. +This happens in the block layer, so that users of the block layer can just +specify encryption contexts and don't need to know about keyslots at all, nor do +device drivers need to care about most details of keyslot management. + +Specifically, for each keyslot, the block layer (via the blk_crypto_profile) +keeps track of which blk_crypto_key that keyslot contains (if any), and how many +in-flight I/O requests are using it. When the block layer creates a +``struct request`` for a bio that has an encryption context, it grabs a keyslot +that already contains the key if possible. Otherwise it waits for an idle +keyslot (a keyslot that isn't in-use by any I/O), then programs the key into the +least-recently-used idle keyslot using the function the device driver provided. +In both cases, the resulting keyslot is stored in the ``crypt_keyslot`` field of +the request, where it is then accessible to device drivers and is released after +the request completes. + +``struct request`` also contains a pointer to the original bio_crypt_ctx. +Requests can be built from multiple bios, and the block layer must take the +encryption context into account when trying to merge bios and requests. For two +bios/requests to be merged, they must have compatible encryption contexts: both +unencrypted, or both encrypted with the same key and contiguous data unit +numbers. Only the encryption context for the first bio in a request is +retained, since the remaining bios have been verified to be merge-compatible +with the first bio. + +To make it possible for inline encryption to work with request_queue based +layered devices, when a request is cloned, its encryption context is cloned as +well. When the cloned request is submitted, it is then processed as usual; this +includes getting a keyslot from the clone's target device if needed. + +blk-crypto-fallback +=================== + +It is desirable for the inline encryption support of upper layers (e.g. +filesystems) to be testable without real inline encryption hardware, and +likewise for the block layer's keyslot management logic. It is also desirable +to allow upper layers to just always use inline encryption rather than have to +implement encryption in multiple ways. + +Therefore, we also introduce *blk-crypto-fallback*, which is an implementation +of inline encryption using the kernel crypto API. blk-crypto-fallback is built +into the block layer, so it works on any block device without any special setup. +Essentially, when a bio with an encryption context is submitted to a +request_queue that doesn't support that encryption context, the block layer will +handle en/decryption of the bio using blk-crypto-fallback. + +For encryption, the data cannot be encrypted in-place, as callers usually rely +on it being unmodified. Instead, blk-crypto-fallback allocates bounce pages, +fills a new bio with those bounce pages, encrypts the data into those bounce +pages, and submits that "bounce" bio. When the bounce bio completes, +blk-crypto-fallback completes the original bio. If the original bio is too +large, multiple bounce bios may be required; see the code for details. + +For decryption, blk-crypto-fallback "wraps" the bio's completion callback +(``bi_complete``) and private data (``bi_private``) with its own, unsets the +bio's encryption context, then submits the bio. If the read completes +successfully, blk-crypto-fallback restores the bio's original completion +callback and private data, then decrypts the bio's data in-place using the +kernel crypto API. Decryption happens from a workqueue, as it may sleep. +Afterwards, blk-crypto-fallback completes the bio. + +In both cases, the bios that blk-crypto-fallback submits no longer have an +encryption context. Therefore, lower layers only see standard unencrypted I/O. + +blk-crypto-fallback also defines its own blk_crypto_profile and has its own +"keyslots"; its keyslots contain ``struct crypto_skcipher`` objects. The reason +for this is twofold. First, it allows the keyslot management logic to be tested +without actual inline encryption hardware. Second, similar to actual inline +encryption hardware, the crypto API doesn't accept keys directly in requests but +rather requires that keys be set ahead of time, and setting keys can be +expensive; moreover, allocating a crypto_skcipher can't happen on the I/O path +at all due to the locks it takes. Therefore, the concept of keyslots still +makes sense for blk-crypto-fallback. + +Note that regardless of whether real inline encryption hardware or blk-crypto-fallback is used, the ciphertext written to disk (and hence the -on-disk format of data) will be the same (assuming the hardware's implementation -of the algorithm being used adheres to spec and functions correctly). - -If a ``request queue``'s inline encryption hardware claimed to support the -encryption context specified with a bio, then it will not be handled by the -``blk-crypto-fallback``. We will eventually reach a point in blk-mq when a -struct request needs to be allocated for that bio. At that point, -blk-mq tries to program the encryption context into the ``request_queue``'s -keyslot_manager, and obtain a keyslot, which it stores in its newly added -``keyslot`` field. This keyslot is released when the request is completed. - -When the first bio is added to a request, ``blk_crypto_rq_bio_prep`` is called, -which sets the request's ``crypt_ctx`` to a copy of the bio's -``bi_crypt_context``. bio_crypt_do_front_merge is called whenever a subsequent -bio is merged to the front of the request, which updates the ``crypt_ctx`` of -the request so that it matches the newly merged bio's ``bi_crypt_context``. In particular, the request keeps a copy of the ``bi_crypt_context`` of the first -bio in its bio-list (blk-mq needs to be careful to maintain this invariant -during bio and request merges). - -To make it possible for inline encryption to work with request queue based -layered devices, when a request is cloned, its ``crypto fields`` are cloned as -well. When the cloned request is submitted, blk-mq programs the -``bi_crypt_context`` of the request into the clone's request_queue's keyslot -manager, and stores the returned keyslot in the clone's ``keyslot``. +on-disk format of data) will be the same (assuming that both the inline +encryption hardware's implementation and the kernel crypto API's implementation +of the algorithm being used adhere to spec and function correctly). +blk-crypto-fallback is optional and is controlled by the +``CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK`` kernel configuration option. API presented to users of the block layer ========================================= -``struct blk_crypto_key`` represents a crypto key (the raw key, size of the -key, the crypto algorithm to use, the data unit size to use, and the number of -bytes required to represent data unit numbers that will be specified with the -``bi_crypt_context``). - -``blk_crypto_init_key`` allows upper layers to initialize such a -``blk_crypto_key``. - -``bio_crypt_set_ctx`` should be called on any bio that a user of -the block layer wants en/decrypted via inline encryption (or the -blk-crypto-fallback, if hardware support isn't available for the desired -crypto configuration). This function takes the ``blk_crypto_key`` and the -data unit number (DUN) to use when en/decrypting the bio. - -``blk_crypto_config_supported`` allows upper layers to query whether or not the -an encryption context passed to request queue can be handled by blk-crypto -(either by real inline encryption hardware, or by the blk-crypto-fallback). -This is useful e.g. when blk-crypto-fallback is disabled, and the upper layer -wants to use an algorithm that may not supported by hardware - this function -lets the upper layer know ahead of time that the algorithm isn't supported, -and the upper layer can fallback to something else if appropriate. - -``blk_crypto_start_using_key`` - Upper layers must call this function on -``blk_crypto_key`` and a ``request_queue`` before using the key with any bio -headed for that ``request_queue``. This function ensures that either the -hardware supports the key's crypto settings, or the crypto API fallback has -transforms for the needed mode allocated and ready to go. Note that this -function may allocate an ``skcipher``, and must not be called from the data -path, since allocating ``skciphers`` from the data path can deadlock. - -``blk_crypto_evict_key`` *must* be called by upper layers before a -``blk_crypto_key`` is freed. Further, it *must* only be called only once -there are no more in-flight requests that use that ``blk_crypto_key``. -``blk_crypto_evict_key`` will ensure that a key is removed from any keyslots in -inline encryption hardware that the key might have been programmed into (or the blk-crypto-fallback). +``blk_crypto_config_supported()`` allows users to check ahead of time whether +inline encryption with particular crypto settings will work on a particular +request_queue -- either via hardware or via blk-crypto-fallback. This function +takes in a ``struct blk_crypto_config`` which is like blk_crypto_key, but omits +the actual bytes of the key and instead just contains the algorithm, data unit +size, etc. This function can be useful if blk-crypto-fallback is disabled. + +``blk_crypto_init_key()`` allows users to initialize a blk_crypto_key. + +Users must call ``blk_crypto_start_using_key()`` before actually starting to use +a blk_crypto_key on a request_queue (even if ``blk_crypto_config_supported()`` +was called earlier). This is needed to initialize blk-crypto-fallback if it +will be needed. This must not be called from the data path, as this may have to +allocate resources, which may deadlock in that case. + +Next, to attach an encryption context to a bio, users should call +``bio_crypt_set_ctx()``. This function allocates a bio_crypt_ctx and attaches +it to a bio, given the blk_crypto_key and the data unit number that will be used +for en/decryption. Users don't need to worry about freeing the bio_crypt_ctx +later, as that happens automatically when the bio is freed or reset. + +Finally, when done using inline encryption with a blk_crypto_key on a +request_queue, users must call ``blk_crypto_evict_key()``. This ensures that +the key is evicted from all keyslots it may be programmed into and unlinked from +any kernel data structures it may be linked into. + +In summary, for users of the block layer, the lifecycle of a blk_crypto_key is +as follows: + +1. ``blk_crypto_config_supported()`` (optional) +2. ``blk_crypto_init_key()`` +3. ``blk_crypto_start_using_key()`` +4. ``bio_crypt_set_ctx()`` (potentially many times) +5. ``blk_crypto_evict_key()`` (after all I/O has completed) +6. Zeroize the blk_crypto_key (this has no dedicated function) + +If a blk_crypto_key is being used on multiple request_queues, then +``blk_crypto_config_supported()`` (if used), ``blk_crypto_start_using_key()``, +and ``blk_crypto_evict_key()`` must be called on each request_queue. API presented to device drivers =============================== -A :c:type:``struct blk_keyslot_manager`` should be set up by device drivers in -the ``request_queue`` of the device. The device driver needs to call -``blk_ksm_init`` (or its resource-managed variant ``devm_blk_ksm_init``) on the -``blk_keyslot_manager``, while specifying the number of keyslots supported by -the hardware. - -The device driver also needs to tell the KSM how to actually manipulate the -IE hardware in the device to do things like programming the crypto key into -the IE hardware into a particular keyslot. All this is achieved through the -struct blk_ksm_ll_ops field in the KSM that the device driver -must fill up after initing the ``blk_keyslot_manager``. - -The KSM also handles runtime power management for the device when applicable -(e.g. when it wants to program a crypto key into the IE hardware, the device -must be runtime powered on) - so the device driver must also set the ``dev`` -field in the ksm to point to the `struct device` for the KSM to use for runtime -power management. - -``blk_ksm_reprogram_all_keys`` can be called by device drivers if the device -needs each and every of its keyslots to be reprogrammed with the key it -"should have" at the point in time when the function is called. This is useful -e.g. if a device loses all its keys on runtime power down/up. - -If the driver used ``blk_ksm_init`` instead of ``devm_blk_ksm_init``, then -``blk_ksm_destroy`` should be called to free up all resources used by a -``blk_keyslot_manager`` once it is no longer needed. +A device driver that wants to support inline encryption must set up a +blk_crypto_profile in the request_queue of its device. To do this, it first +must call ``blk_crypto_profile_init()`` (or its resource-managed variant +``devm_blk_crypto_profile_init()``), providing the number of keyslots. + +Next, it must advertise its crypto capabilities by setting fields in the +blk_crypto_profile, e.g. ``modes_supported`` and ``max_dun_bytes_supported``. + +It then must set function pointers in the ``ll_ops`` field of the +blk_crypto_profile to tell upper layers how to control the inline encryption +hardware, e.g. how to program and evict keyslots. Most drivers will need to +implement ``keyslot_program`` and ``keyslot_evict``. For details, see the +comments for ``struct blk_crypto_ll_ops``. + +Once the driver registers a blk_crypto_profile with a request_queue, I/O +requests the driver receives via that queue may have an encryption context. All +encryption contexts will be compatible with the crypto capabilities declared in +the blk_crypto_profile, so drivers don't need to worry about handling +unsupported requests. Also, if a nonzero number of keyslots was declared in the +blk_crypto_profile, then all I/O requests that have an encryption context will +also have a keyslot which was already programmed with the appropriate key. + +If the driver implements runtime suspend and its blk_crypto_ll_ops don't work +while the device is runtime-suspended, then the driver must also set the ``dev`` +field of the blk_crypto_profile to point to the ``struct device`` that will be +resumed before any of the low-level operations are called. + +If there are situations where the inline encryption hardware loses the contents +of its keyslots, e.g. device resets, the driver must handle reprogramming the +keyslots. To do this, the driver may call ``blk_crypto_reprogram_all_keys()``. + +Finally, if the driver used ``blk_crypto_profile_init()`` instead of +``devm_blk_crypto_profile_init()``, then it is responsible for calling +``blk_crypto_profile_destroy()`` when the crypto profile is no longer needed. Layered Devices =============== -Request queue based layered devices like dm-rq that wish to support IE need to -create their own keyslot manager for their request queue, and expose whatever -functionality they choose. When a layered device wants to pass a clone of that -request to another ``request_queue``, blk-crypto will initialize and prepare the -clone as necessary - see ``blk_crypto_insert_cloned_request`` in -``blk-crypto.c``. - - -Future Optimizations for layered devices -======================================== - -Creating a keyslot manager for a layered device uses up memory for each -keyslot, and in general, a layered device merely passes the request on to a -"child" device, so the keyslots in the layered device itself are completely -unused, and don't need any refcounting or keyslot programming. We can instead -define a new type of KSM; the "passthrough KSM", that layered devices can use -to advertise an unlimited number of keyslots, and support for any encryption -algorithms they choose, while not actually using any memory for each keyslot. -Another use case for the "passthrough KSM" is for IE devices that do not have a -limited number of keyslots. - +Request queue based layered devices like dm-rq that wish to support inline +encryption need to create their own blk_crypto_profile for their request_queue, +and expose whatever functionality they choose. When a layered device wants to +pass a clone of that request to another request_queue, blk-crypto will +initialize and prepare the clone as necessary; see +``blk_crypto_insert_cloned_request()``. Interaction between inline encryption and blk integrity ======================================================= @@ -257,7 +296,7 @@ Because there isn't any real hardware yet, it seems prudent to assume that hardware implementations might not implement both features together correctly, and disallow the combination for now. Whenever a device supports integrity, the kernel will pretend that the device does not support hardware inline encryption -(by essentially setting the keyslot manager in the request_queue of the device -to NULL). When the crypto API fallback is enabled, this means that all bios with -and encryption context will use the fallback, and IO will complete as usual. -When the fallback is disabled, a bio with an encryption context will be failed. +(by setting the blk_crypto_profile in the request_queue of the device to NULL). +When the crypto API fallback is enabled, this means that all bios with and +encryption context will use the fallback, and IO will complete as usual. When +the fallback is disabled, a bio with an encryption context will be failed. diff --git a/Documentation/core-api/cachetlb.rst b/Documentation/core-api/cachetlb.rst index 8aed9103e48a..5c0552e78c58 100644 --- a/Documentation/core-api/cachetlb.rst +++ b/Documentation/core-api/cachetlb.rst @@ -326,6 +326,12 @@ maps this page at its virtual address. dirty. Again, see sparc64 for examples of how to deal with this. + ``void flush_dcache_folio(struct folio *folio)`` + This function is called under the same circumstances as + flush_dcache_page(). It allows the architecture to + optimise for flushing the entire folio of pages instead + of flushing one page at a time. + ``void copy_to_user_page(struct vm_area_struct *vma, struct page *page, unsigned long user_vaddr, void *dst, void *src, int len)`` ``void copy_from_user_page(struct vm_area_struct *vma, struct page *page, diff --git a/Documentation/core-api/mm-api.rst b/Documentation/core-api/mm-api.rst index a42f9baddfbf..395835f9289f 100644 --- a/Documentation/core-api/mm-api.rst +++ b/Documentation/core-api/mm-api.rst @@ -95,6 +95,11 @@ More Memory Management Functions .. kernel-doc:: mm/mempolicy.c .. kernel-doc:: include/linux/mm_types.h :internal: +.. kernel-doc:: include/linux/mm_inline.h +.. kernel-doc:: include/linux/page-flags.h .. kernel-doc:: include/linux/mm.h :internal: +.. kernel-doc:: include/linux/page_ref.h .. kernel-doc:: include/linux/mmzone.h +.. kernel-doc:: mm/util.c + :functions: folio_mapping diff --git a/Documentation/devicetree/bindings/mfd/brcm,cru.yaml b/Documentation/devicetree/bindings/mfd/brcm,cru.yaml index fc1317ab3226..28ac60acf4ac 100644 --- a/Documentation/devicetree/bindings/mfd/brcm,cru.yaml +++ b/Documentation/devicetree/bindings/mfd/brcm,cru.yaml @@ -32,13 +32,13 @@ properties: "#size-cells": const: 1 - pinctrl: - $ref: ../pinctrl/brcm,ns-pinmux.yaml - patternProperties: '^clock-controller@[a-f0-9]+$': $ref: ../clock/brcm,iproc-clocks.yaml + '^pin-controller@[a-f0-9]+$': + $ref: ../pinctrl/brcm,ns-pinmux.yaml + '^thermal@[a-f0-9]+$': $ref: ../thermal/brcm,ns-thermal.yaml @@ -73,9 +73,10 @@ examples: "iprocfast", "sata1", "sata2"; }; - pinctrl { + pin-controller@1c0 { compatible = "brcm,bcm4708-pinmux"; - offset = <0x1c0>; + reg = <0x1c0 0x24>; + reg-names = "cru_gpio_control"; }; thermal@2c0 { diff --git a/Documentation/devicetree/bindings/pinctrl/brcm,ns-pinmux.yaml b/Documentation/devicetree/bindings/pinctrl/brcm,ns-pinmux.yaml index 470aff599c27..fc39e3e9f71c 100644 --- a/Documentation/devicetree/bindings/pinctrl/brcm,ns-pinmux.yaml +++ b/Documentation/devicetree/bindings/pinctrl/brcm,ns-pinmux.yaml @@ -17,9 +17,6 @@ description: A list of pins varies across chipsets so few bindings are available. - Node of the pinmux must be nested in the CRU (Central Resource Unit) "syscon" - node. - properties: compatible: enum: @@ -27,10 +24,11 @@ properties: - brcm,bcm4709-pinmux - brcm,bcm53012-pinmux - offset: - description: offset of pin registers in the CRU block + reg: maxItems: 1 - $ref: /schemas/types.yaml#/definitions/uint32-array + + reg-names: + const: cru_gpio_control patternProperties: '-pins$': @@ -72,23 +70,20 @@ allOf: uart1_grp ] required: - - offset + - reg + - reg-names additionalProperties: false examples: - | - cru@1800c100 { - compatible = "syscon", "simple-mfd"; - reg = <0x1800c100 0x1a4>; - - pinctrl { - compatible = "brcm,bcm4708-pinmux"; - offset = <0xc0>; - - spi-pins { - function = "spi"; - groups = "spi_grp"; - }; + pin-controller@1800c1c0 { + compatible = "brcm,bcm4708-pinmux"; + reg = <0x1800c1c0 0x24>; + reg-names = "cru_gpio_control"; + + spi-pins { + function = "spi"; + groups = "spi_grp"; }; }; diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index c0ad233963ae..bee63d42e5ec 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -29,7 +29,6 @@ algorithms work. fiemap files locks - mandatory-locking mount_api quota seq_file diff --git a/Documentation/filesystems/locks.rst b/Documentation/filesystems/locks.rst index c5ae858b1aac..26429317dbbc 100644 --- a/Documentation/filesystems/locks.rst +++ b/Documentation/filesystems/locks.rst @@ -57,16 +57,9 @@ fcntl(), with all the problems that implies. 1.3 Mandatory Locking As A Mount Option --------------------------------------- -Mandatory locking, as described in -'Documentation/filesystems/mandatory-locking.rst' was prior to this release a -general configuration option that was valid for all mounted filesystems. This -had a number of inherent dangers, not the least of which was the ability to -freeze an NFS server by asking it to read a file for which a mandatory lock -existed. - -From this release of the kernel, mandatory locking can be turned on and off -on a per-filesystem basis, using the mount options 'mand' and 'nomand'. -The default is to disallow mandatory locking. The intention is that -mandatory locking only be enabled on a local filesystem as the specific need -arises. +Mandatory locking was prior to this release a general configuration option +that was valid for all mounted filesystems. This had a number of inherent +dangers, not the least of which was the ability to freeze an NFS server by +asking it to read a file for which a mandatory lock existed. +Such option was dropped in Kernel v5.14. diff --git a/Documentation/filesystems/netfs_library.rst b/Documentation/filesystems/netfs_library.rst index 57a641847818..bb68d39f03b7 100644 --- a/Documentation/filesystems/netfs_library.rst +++ b/Documentation/filesystems/netfs_library.rst @@ -524,3 +524,5 @@ Note that these methods are passed a pointer to the cache resource structure, not the read request structure as they could be used in other situations where there isn't a read request structure as well, such as writing dirty data to the cache. + +.. kernel-doc:: include/linux/netfs.h diff --git a/Documentation/networking/devlink/ice.rst b/Documentation/networking/devlink/ice.rst index a432dc419fa4..5d97cee9457b 100644 --- a/Documentation/networking/devlink/ice.rst +++ b/Documentation/networking/devlink/ice.rst @@ -30,10 +30,11 @@ The ``ice`` driver reports the following versions PHY, link, etc. * - ``fw.mgmt.api`` - running - - 1.5 - - 2-digit version number of the API exported over the AdminQ by the - management firmware. Used by the driver to identify what commands - are supported. + - 1.5.1 + - 3-digit version number (major.minor.patch) of the API exported over + the AdminQ by the management firmware. Used by the driver to + identify what commands are supported. Historical versions of the + kernel only displayed a 2-digit version number (major.minor). * - ``fw.mgmt.build`` - running - 0x305d955f diff --git a/Documentation/networking/mctp.rst b/Documentation/networking/mctp.rst index 6100cdc220f6..fa7730dbf7b9 100644 --- a/Documentation/networking/mctp.rst +++ b/Documentation/networking/mctp.rst @@ -59,11 +59,11 @@ specified with a ``sockaddr`` type, with a single-byte endpoint address: }; struct sockaddr_mctp { - unsigned short int smctp_family; - int smctp_network; - struct mctp_addr smctp_addr; - __u8 smctp_type; - __u8 smctp_tag; + __kernel_sa_family_t smctp_family; + unsigned int smctp_network; + struct mctp_addr smctp_addr; + __u8 smctp_type; + __u8 smctp_tag; }; #define MCTP_NET_ANY 0x0 diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst index 2e8134059c87..6655d929a351 100644 --- a/Documentation/userspace-api/ioctl/ioctl-number.rst +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst @@ -104,6 +104,7 @@ Code Seq# Include File Comments '8' all SNP8023 advanced NIC card <mailto:mcr@solidum.com> ';' 64-7F linux/vfio.h +'=' 00-3f uapi/linux/ptp_clock.h <mailto:richardcochran@gmail.com> '@' 00-0F linux/radeonfb.h conflict! '@' 00-0F drivers/video/aty/aty128fb.c conflict! 'A' 00-1F linux/apm_bios.h conflict! diff --git a/MAINTAINERS b/MAINTAINERS index 8d118d7957d2..3b79fd441dde 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5458,6 +5458,19 @@ F: include/net/devlink.h F: include/uapi/linux/devlink.h F: net/core/devlink.c +DH ELECTRONICS IMX6 DHCOM BOARD SUPPORT +M: Christoph Niedermaier <cniedermaier@dh-electronics.com> +L: kernel@dh-electronics.com +S: Maintained +F: arch/arm/boot/dts/imx6*-dhcom-* + +DH ELECTRONICS STM32MP1 DHCOM/DHCOR BOARD SUPPORT +M: Marek Vasut <marex@denx.de> +L: kernel@dh-electronics.com +S: Maintained +F: arch/arm/boot/dts/stm32mp1*-dhcom-* +F: arch/arm/boot/dts/stm32mp1*-dhcor-* + DIALOG SEMICONDUCTOR DRIVERS M: Support Opensource <support.opensource@diasemi.com> S: Supported @@ -6147,8 +6160,7 @@ T: git git://anongit.freedesktop.org/drm/drm F: Documentation/devicetree/bindings/display/ F: Documentation/devicetree/bindings/gpu/ F: Documentation/gpu/ -F: drivers/gpu/drm/ -F: drivers/gpu/vga/ +F: drivers/gpu/ F: include/drm/ F: include/linux/vga* F: include/uapi/drm/ @@ -11278,7 +11290,6 @@ F: Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst F: drivers/net/ethernet/marvell/octeontx2/af/ MARVELL PRESTERA ETHERNET SWITCH DRIVER -M: Vadym Kochan <vkochan@marvell.com> M: Taras Chornyi <tchornyi@marvell.com> S: Supported W: https://github.com/Marvell-switching/switchdev-prestera @@ -20336,6 +20347,7 @@ X86 ARCHITECTURE (32-BIT AND 64-BIT) M: Thomas Gleixner <tglx@linutronix.de> M: Ingo Molnar <mingo@redhat.com> M: Borislav Petkov <bp@alien8.de> +M: Dave Hansen <dave.hansen@linux.intel.com> M: x86@kernel.org R: "H. Peter Anvin" <hpa@zytor.com> L: linux-kernel@vger.kernel.org @@ -2,8 +2,8 @@ VERSION = 5 PATCHLEVEL = 15 SUBLEVEL = 0 -EXTRAVERSION = -rc6 -NAME = Opossums on Parade +EXTRAVERSION = +NAME = Trick or Treat # *DOCUMENTATION* # To see a list of typical targets execute "make help" diff --git a/arch/arc/include/asm/cacheflush.h b/arch/arc/include/asm/cacheflush.h index e201b4b1655a..e8c2c7469e10 100644 --- a/arch/arc/include/asm/cacheflush.h +++ b/arch/arc/include/asm/cacheflush.h @@ -36,6 +36,7 @@ void __flush_dcache_page(phys_addr_t paddr, unsigned long vaddr); #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 void flush_dcache_page(struct page *page); +void flush_dcache_folio(struct folio *folio); void dma_cache_wback_inv(phys_addr_t start, unsigned long sz); void dma_cache_inv(phys_addr_t start, unsigned long sz); diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 59baf6c132a7..dcf2df6da98f 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -92,6 +92,7 @@ config ARM select HAVE_FTRACE_MCOUNT_RECORD if !XIP_KERNEL select HAVE_FUNCTION_GRAPH_TRACER if !THUMB2_KERNEL && !CC_IS_CLANG select HAVE_FUNCTION_TRACER if !XIP_KERNEL + select HAVE_FUTEX_CMPXCHG if FUTEX select HAVE_GCC_PLUGINS select HAVE_HW_BREAKPOINT if PERF_EVENTS && (CPU_V6 || CPU_V6K || CPU_V7) select HAVE_IRQ_TIME_ACCOUNTING diff --git a/arch/arm/boot/compressed/decompress.c b/arch/arm/boot/compressed/decompress.c index aa075d8372ea..74255e819831 100644 --- a/arch/arm/boot/compressed/decompress.c +++ b/arch/arm/boot/compressed/decompress.c @@ -47,7 +47,10 @@ extern char * strchrnul(const char *, int); #endif #ifdef CONFIG_KERNEL_XZ +/* Prevent KASAN override of string helpers in decompressor */ +#undef memmove #define memmove memmove +#undef memcpy #define memcpy memcpy #include "../../../../lib/decompress_unxz.c" #endif diff --git a/arch/arm/boot/dts/sun7i-a20-olinuxino-lime2.dts b/arch/arm/boot/dts/sun7i-a20-olinuxino-lime2.dts index 8077f1716fbc..ecb91fb899ff 100644 --- a/arch/arm/boot/dts/sun7i-a20-olinuxino-lime2.dts +++ b/arch/arm/boot/dts/sun7i-a20-olinuxino-lime2.dts @@ -112,7 +112,7 @@ pinctrl-names = "default"; pinctrl-0 = <&gmac_rgmii_pins>; phy-handle = <&phy1>; - phy-mode = "rgmii"; + phy-mode = "rgmii-id"; status = "okay"; }; diff --git a/arch/arm/include/asm/cacheflush.h b/arch/arm/include/asm/cacheflush.h index 5e56288e343b..e68fb879e4f9 100644 --- a/arch/arm/include/asm/cacheflush.h +++ b/arch/arm/include/asm/cacheflush.h @@ -290,6 +290,7 @@ extern void flush_cache_page(struct vm_area_struct *vma, unsigned long user_addr */ #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 extern void flush_dcache_page(struct page *); +void flush_dcache_folio(struct folio *folio); #define ARCH_IMPLEMENTS_FLUSH_KERNEL_VMAP_RANGE 1 static inline void flush_kernel_vmap_range(void *addr, int size) diff --git a/arch/arm/include/asm/uaccess.h b/arch/arm/include/asm/uaccess.h index 084d1c07c2d0..36fbc3329252 100644 --- a/arch/arm/include/asm/uaccess.h +++ b/arch/arm/include/asm/uaccess.h @@ -176,6 +176,7 @@ extern int __get_user_64t_4(void *); register unsigned long __l asm("r1") = __limit; \ register int __e asm("r0"); \ unsigned int __ua_flags = uaccess_save_and_enable(); \ + int __tmp_e; \ switch (sizeof(*(__p))) { \ case 1: \ if (sizeof((x)) >= 8) \ @@ -203,9 +204,10 @@ extern int __get_user_64t_4(void *); break; \ default: __e = __get_user_bad(); break; \ } \ + __tmp_e = __e; \ uaccess_restore(__ua_flags); \ x = (typeof(*(p))) __r2; \ - __e; \ + __tmp_e; \ }) #define get_user(x, p) \ diff --git a/arch/arm/kernel/head.S b/arch/arm/kernel/head.S index 29070eb8df7d..3fc7f9750ce4 100644 --- a/arch/arm/kernel/head.S +++ b/arch/arm/kernel/head.S @@ -253,7 +253,7 @@ __create_page_tables: add r0, r4, #KERNEL_OFFSET >> (SECTION_SHIFT - PMD_ORDER) ldr r6, =(_end - 1) adr_l r5, kernel_sec_start @ _pa(kernel_sec_start) -#ifdef CONFIG_CPU_ENDIAN_BE8 +#if defined CONFIG_CPU_ENDIAN_BE8 || defined CONFIG_CPU_ENDIAN_BE32 str r8, [r5, #4] @ Save physical start of kernel (BE) #else str r8, [r5] @ Save physical start of kernel (LE) @@ -266,7 +266,7 @@ __create_page_tables: bls 1b eor r3, r3, r7 @ Remove the MMU flags adr_l r5, kernel_sec_end @ _pa(kernel_sec_end) -#ifdef CONFIG_CPU_ENDIAN_BE8 +#if defined CONFIG_CPU_ENDIAN_BE8 || defined CONFIG_CPU_ENDIAN_BE32 str r3, [r5, #4] @ Save physical end of kernel (BE) #else str r3, [r5] @ Save physical end of kernel (LE) diff --git a/arch/arm/kernel/traps.c b/arch/arm/kernel/traps.c index 4a7edc6e848f..195dff58bafc 100644 --- a/arch/arm/kernel/traps.c +++ b/arch/arm/kernel/traps.c @@ -136,7 +136,7 @@ static void dump_mem(const char *lvl, const char *str, unsigned long bottom, for (p = first, i = 0; i < 8 && p < top; i++, p += 4) { if (p >= bottom && p < top) { unsigned long val; - if (get_kernel_nofault(val, (unsigned long *)p)) + if (!get_kernel_nofault(val, (unsigned long *)p)) sprintf(str + i * 9, " %08lx", val); else sprintf(str + i * 9, " ????????"); diff --git a/arch/arm/kernel/vmlinux-xip.lds.S b/arch/arm/kernel/vmlinux-xip.lds.S index 50136828f5b5..f14c2360ea0b 100644 --- a/arch/arm/kernel/vmlinux-xip.lds.S +++ b/arch/arm/kernel/vmlinux-xip.lds.S @@ -40,6 +40,10 @@ SECTIONS ARM_DISCARD *(.alt.smp.init) *(.pv_table) +#ifndef CONFIG_ARM_UNWIND + *(.ARM.exidx) *(.ARM.exidx.*) + *(.ARM.extab) *(.ARM.extab.*) +#endif } . = XIP_VIRT_ADDR(CONFIG_XIP_PHYS_ADDR); @@ -172,7 +176,7 @@ ASSERT((__arch_info_end - __arch_info_begin), "no machine record defined") ASSERT((_end - __bss_start) >= 12288, ".bss too small for CONFIG_XIP_DEFLATED_DATA") #endif -#ifdef CONFIG_ARM_MPU +#if defined(CONFIG_ARM_MPU) && !defined(CONFIG_COMPILE_TEST) /* * Due to PMSAv7 restriction on base address and size we have to * enforce minimal alignment restrictions. It was seen that weaker diff --git a/arch/arm/mm/proc-macros.S b/arch/arm/mm/proc-macros.S index e2c743aa2eb2..d9f7dfe2a7ed 100644 --- a/arch/arm/mm/proc-macros.S +++ b/arch/arm/mm/proc-macros.S @@ -340,6 +340,7 @@ ENTRY(\name\()_cache_fns) .macro define_tlb_functions name:req, flags_up:req, flags_smp .type \name\()_tlb_fns, #object + .align 2 ENTRY(\name\()_tlb_fns) .long \name\()_flush_user_tlb_range .long \name\()_flush_kern_tlb_range diff --git a/arch/arm/probes/kprobes/core.c b/arch/arm/probes/kprobes/core.c index 27e0af78e88b..9d8634e2f12f 100644 --- a/arch/arm/probes/kprobes/core.c +++ b/arch/arm/probes/kprobes/core.c @@ -439,7 +439,7 @@ static struct undef_hook kprobes_arm_break_hook = { #endif /* !CONFIG_THUMB2_KERNEL */ -int __init arch_init_kprobes() +int __init arch_init_kprobes(void) { arm_probes_decode_init(); #ifdef CONFIG_THUMB2_KERNEL diff --git a/arch/arm64/boot/dts/allwinner/sun50i-h5-nanopi-neo2.dts b/arch/arm64/boot/dts/allwinner/sun50i-h5-nanopi-neo2.dts index 02f8e72f0cad..05486cccee1c 100644 --- a/arch/arm64/boot/dts/allwinner/sun50i-h5-nanopi-neo2.dts +++ b/arch/arm64/boot/dts/allwinner/sun50i-h5-nanopi-neo2.dts @@ -75,7 +75,7 @@ pinctrl-0 = <&emac_rgmii_pins>; phy-supply = <®_gmac_3v3>; phy-handle = <&ext_rgmii_phy>; - phy-mode = "rgmii"; + phy-mode = "rgmii-id"; status = "okay"; }; diff --git a/arch/arm64/boot/dts/freescale/imx8mm-kontron-n801x-s.dts b/arch/arm64/boot/dts/freescale/imx8mm-kontron-n801x-s.dts index d17abb515835..e99e7644ff39 100644 --- a/arch/arm64/boot/dts/freescale/imx8mm-kontron-n801x-s.dts +++ b/arch/arm64/boot/dts/freescale/imx8mm-kontron-n801x-s.dts @@ -70,7 +70,9 @@ regulator-name = "rst-usb-eth2"; pinctrl-names = "default"; pinctrl-0 = <&pinctrl_usb_eth2>; - gpio = <&gpio3 2 GPIO_ACTIVE_LOW>; + gpio = <&gpio3 2 GPIO_ACTIVE_HIGH>; + enable-active-high; + regulator-always-on; }; reg_vdd_5v: regulator-5v { @@ -95,7 +97,7 @@ clocks = <&osc_can>; interrupt-parent = <&gpio4>; interrupts = <28 IRQ_TYPE_EDGE_FALLING>; - spi-max-frequency = <100000>; + spi-max-frequency = <10000000>; vdd-supply = <®_vdd_3v3>; xceiver-supply = <®_vdd_5v>; }; @@ -111,7 +113,7 @@ &fec1 { pinctrl-names = "default"; pinctrl-0 = <&pinctrl_enet>; - phy-connection-type = "rgmii"; + phy-connection-type = "rgmii-rxid"; phy-handle = <ðphy>; status = "okay"; diff --git a/arch/arm64/boot/dts/freescale/imx8mm-kontron-n801x-som.dtsi b/arch/arm64/boot/dts/freescale/imx8mm-kontron-n801x-som.dtsi index 9db9b90bf2bc..42bbbb3f532b 100644 --- a/arch/arm64/boot/dts/freescale/imx8mm-kontron-n801x-som.dtsi +++ b/arch/arm64/boot/dts/freescale/imx8mm-kontron-n801x-som.dtsi @@ -91,10 +91,12 @@ reg_vdd_soc: BUCK1 { regulator-name = "buck1"; regulator-min-microvolt = <800000>; - regulator-max-microvolt = <900000>; + regulator-max-microvolt = <850000>; regulator-boot-on; regulator-always-on; regulator-ramp-delay = <3125>; + nxp,dvs-run-voltage = <850000>; + nxp,dvs-standby-voltage = <800000>; }; reg_vdd_arm: BUCK2 { @@ -111,7 +113,7 @@ reg_vdd_dram: BUCK3 { regulator-name = "buck3"; regulator-min-microvolt = <850000>; - regulator-max-microvolt = <900000>; + regulator-max-microvolt = <950000>; regulator-boot-on; regulator-always-on; }; @@ -150,7 +152,7 @@ reg_vdd_snvs: LDO2 { regulator-name = "ldo2"; - regulator-min-microvolt = <850000>; + regulator-min-microvolt = <800000>; regulator-max-microvolt = <900000>; regulator-boot-on; regulator-always-on; diff --git a/arch/arm64/boot/dts/qcom/sm8250.dtsi b/arch/arm64/boot/dts/qcom/sm8250.dtsi index 8c15d9fed08f..d12e4cbfc852 100644 --- a/arch/arm64/boot/dts/qcom/sm8250.dtsi +++ b/arch/arm64/boot/dts/qcom/sm8250.dtsi @@ -2590,9 +2590,10 @@ power-domains = <&dispcc MDSS_GDSC>; clocks = <&dispcc DISP_CC_MDSS_AHB_CLK>, + <&gcc GCC_DISP_HF_AXI_CLK>, <&gcc GCC_DISP_SF_AXI_CLK>, <&dispcc DISP_CC_MDSS_MDP_CLK>; - clock-names = "iface", "nrt_bus", "core"; + clock-names = "iface", "bus", "nrt_bus", "core"; assigned-clocks = <&dispcc DISP_CC_MDSS_MDP_CLK>; assigned-clock-rates = <460000000>; diff --git a/arch/arm64/kvm/hyp/include/nvhe/gfp.h b/arch/arm64/kvm/hyp/include/nvhe/gfp.h index fb0f523d1492..0a048dc06a7d 100644 --- a/arch/arm64/kvm/hyp/include/nvhe/gfp.h +++ b/arch/arm64/kvm/hyp/include/nvhe/gfp.h @@ -24,6 +24,7 @@ struct hyp_pool { /* Allocation */ void *hyp_alloc_pages(struct hyp_pool *pool, unsigned short order); +void hyp_split_page(struct hyp_page *page); void hyp_get_page(struct hyp_pool *pool, void *addr); void hyp_put_page(struct hyp_pool *pool, void *addr); diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c index bacd493a4eac..34eeb524b686 100644 --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c @@ -35,7 +35,18 @@ const u8 pkvm_hyp_id = 1; static void *host_s2_zalloc_pages_exact(size_t size) { - return hyp_alloc_pages(&host_s2_pool, get_order(size)); + void *addr = hyp_alloc_pages(&host_s2_pool, get_order(size)); + + hyp_split_page(hyp_virt_to_page(addr)); + + /* + * The size of concatenated PGDs is always a power of two of PAGE_SIZE, + * so there should be no need to free any of the tail pages to make the + * allocation exact. + */ + WARN_ON(size != (PAGE_SIZE << get_order(size))); + + return addr; } static void *host_s2_zalloc_page(void *pool) diff --git a/arch/arm64/kvm/hyp/nvhe/page_alloc.c b/arch/arm64/kvm/hyp/nvhe/page_alloc.c index 41fc25bdfb34..0bd7701ad1df 100644 --- a/arch/arm64/kvm/hyp/nvhe/page_alloc.c +++ b/arch/arm64/kvm/hyp/nvhe/page_alloc.c @@ -152,6 +152,7 @@ static inline void hyp_page_ref_inc(struct hyp_page *p) static inline int hyp_page_ref_dec_and_test(struct hyp_page *p) { + BUG_ON(!p->refcount); p->refcount--; return (p->refcount == 0); } @@ -193,6 +194,20 @@ void hyp_get_page(struct hyp_pool *pool, void *addr) hyp_spin_unlock(&pool->lock); } +void hyp_split_page(struct hyp_page *p) +{ + unsigned short order = p->order; + unsigned int i; + + p->order = 0; + for (i = 1; i < (1 << order); i++) { + struct hyp_page *tail = p + i; + + tail->order = 0; + hyp_set_page_refcounted(tail); + } +} + void *hyp_alloc_pages(struct hyp_pool *pool, unsigned short order) { unsigned short i = order; diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 1a94a7ca48f2..69bd1732a299 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1529,8 +1529,10 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm, * when updating the PG_mte_tagged page flag, see * sanitise_mte_tags for more details. */ - if (kvm_has_mte(kvm) && vma->vm_flags & VM_SHARED) - return -EINVAL; + if (kvm_has_mte(kvm) && vma->vm_flags & VM_SHARED) { + ret = -EINVAL; + break; + } if (vma->vm_flags & VM_PFNMAP) { /* IO region dirty page logging not allowed */ diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index 41c23f474ea6..803e7773fa86 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -1136,6 +1136,11 @@ out: return prog; } +u64 bpf_jit_alloc_exec_limit(void) +{ + return BPF_JIT_REGION_SIZE; +} + void *bpf_jit_alloc_exec(unsigned long size) { return __vmalloc_node_range(size, PAGE_SIZE, BPF_JIT_REGION_START, diff --git a/arch/m68k/emu/nfblock.c b/arch/m68k/emu/nfblock.c index 4ef457ba5220..9c57b245dc12 100644 --- a/arch/m68k/emu/nfblock.c +++ b/arch/m68k/emu/nfblock.c @@ -99,6 +99,7 @@ static int __init nfhd_init_one(int id, u32 blocks, u32 bsize) { struct nfhd_device *dev; int dev_id = id - NFHD_DEV_OFFSET; + int err = -ENOMEM; pr_info("nfhd%u: found device with %u blocks (%u bytes)\n", dev_id, blocks, bsize); @@ -129,16 +130,20 @@ static int __init nfhd_init_one(int id, u32 blocks, u32 bsize) sprintf(dev->disk->disk_name, "nfhd%u", dev_id); set_capacity(dev->disk, (sector_t)blocks * (bsize / 512)); blk_queue_logical_block_size(dev->disk->queue, bsize); - add_disk(dev->disk); + err = add_disk(dev->disk); + if (err) + goto out_cleanup_disk; list_add_tail(&dev->list, &nfhd_list); return 0; +out_cleanup_disk: + blk_cleanup_disk(dev->disk); free_dev: kfree(dev); out: - return -ENOMEM; + return err; } static int __init nfhd_init(void) diff --git a/arch/m68k/include/asm/cacheflush_mm.h b/arch/m68k/include/asm/cacheflush_mm.h index 1ac55e7b47f0..8ab46625ddd3 100644 --- a/arch/m68k/include/asm/cacheflush_mm.h +++ b/arch/m68k/include/asm/cacheflush_mm.h @@ -250,6 +250,7 @@ static inline void __flush_page_to_ram(void *vaddr) #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 #define flush_dcache_page(page) __flush_page_to_ram(page_address(page)) +void flush_dcache_folio(struct folio *folio); #define flush_dcache_mmap_lock(mapping) do { } while (0) #define flush_dcache_mmap_unlock(mapping) do { } while (0) #define flush_icache_page(vma, page) __flush_page_to_ram(page_address(page)) diff --git a/arch/mips/include/asm/cacheflush.h b/arch/mips/include/asm/cacheflush.h index b3dc9c589442..f207388541d5 100644 --- a/arch/mips/include/asm/cacheflush.h +++ b/arch/mips/include/asm/cacheflush.h @@ -61,6 +61,8 @@ static inline void flush_dcache_page(struct page *page) SetPageDcacheDirty(page); } +void flush_dcache_folio(struct folio *folio); + #define flush_dcache_mmap_lock(mapping) do { } while (0) #define flush_dcache_mmap_unlock(mapping) do { } while (0) diff --git a/arch/nds32/include/asm/cacheflush.h b/arch/nds32/include/asm/cacheflush.h index c2a222ebfa2a..3fc0bb7d6487 100644 --- a/arch/nds32/include/asm/cacheflush.h +++ b/arch/nds32/include/asm/cacheflush.h @@ -27,6 +27,7 @@ void flush_cache_vunmap(unsigned long start, unsigned long end); #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 void flush_dcache_page(struct page *page); +void flush_dcache_folio(struct folio *folio); void copy_to_user_page(struct vm_area_struct *vma, struct page *page, unsigned long vaddr, void *dst, void *src, int len); void copy_from_user_page(struct vm_area_struct *vma, struct page *page, diff --git a/arch/nds32/kernel/ftrace.c b/arch/nds32/kernel/ftrace.c index 0e23e3a8df6b..d55b73b18149 100644 --- a/arch/nds32/kernel/ftrace.c +++ b/arch/nds32/kernel/ftrace.c @@ -6,7 +6,7 @@ #ifndef CONFIG_DYNAMIC_FTRACE extern void (*ftrace_trace_function)(unsigned long, unsigned long, - struct ftrace_ops*, struct pt_regs*); + struct ftrace_ops*, struct ftrace_regs*); extern void ftrace_graph_caller(void); noinline void __naked ftrace_stub(unsigned long ip, unsigned long parent_ip, diff --git a/arch/nios2/include/asm/cacheflush.h b/arch/nios2/include/asm/cacheflush.h index 18eb9f69f806..1999561b22aa 100644 --- a/arch/nios2/include/asm/cacheflush.h +++ b/arch/nios2/include/asm/cacheflush.h @@ -28,7 +28,8 @@ extern void flush_cache_range(struct vm_area_struct *vma, unsigned long start, extern void flush_cache_page(struct vm_area_struct *vma, unsigned long vmaddr, unsigned long pfn); #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 -extern void flush_dcache_page(struct page *page); +void flush_dcache_page(struct page *page); +void flush_dcache_folio(struct folio *folio); extern void flush_icache_range(unsigned long start, unsigned long end); extern void flush_icache_page(struct vm_area_struct *vma, struct page *page); diff --git a/arch/nios2/include/asm/irqflags.h b/arch/nios2/include/asm/irqflags.h index b3ec3e510706..25acf27862f9 100644 --- a/arch/nios2/include/asm/irqflags.h +++ b/arch/nios2/include/asm/irqflags.h @@ -9,7 +9,7 @@ static inline unsigned long arch_local_save_flags(void) { - return RDCTL(CTL_STATUS); + return RDCTL(CTL_FSTATUS); } /* @@ -18,7 +18,7 @@ static inline unsigned long arch_local_save_flags(void) */ static inline void arch_local_irq_restore(unsigned long flags) { - WRCTL(CTL_STATUS, flags); + WRCTL(CTL_FSTATUS, flags); } static inline void arch_local_irq_disable(void) diff --git a/arch/nios2/include/asm/registers.h b/arch/nios2/include/asm/registers.h index 183c720e454d..95b67dd16f81 100644 --- a/arch/nios2/include/asm/registers.h +++ b/arch/nios2/include/asm/registers.h @@ -11,7 +11,7 @@ #endif /* control register numbers */ -#define CTL_STATUS 0 +#define CTL_FSTATUS 0 #define CTL_ESTATUS 1 #define CTL_BSTATUS 2 #define CTL_IENABLE 3 diff --git a/arch/nios2/platform/Kconfig.platform b/arch/nios2/platform/Kconfig.platform index 9e32fb7f3d4c..e849daff6fd1 100644 --- a/arch/nios2/platform/Kconfig.platform +++ b/arch/nios2/platform/Kconfig.platform @@ -37,6 +37,7 @@ config NIOS2_DTB_PHYS_ADDR config NIOS2_DTB_SOURCE_BOOL bool "Compile and link device tree into kernel image" + depends on !COMPILE_TEST help This allows you to specify a dts (device tree source) file which will be compiled and linked into the kernel image. diff --git a/arch/parisc/include/asm/cacheflush.h b/arch/parisc/include/asm/cacheflush.h index eef0096db5f8..da0cd4b3a28f 100644 --- a/arch/parisc/include/asm/cacheflush.h +++ b/arch/parisc/include/asm/cacheflush.h @@ -49,7 +49,8 @@ void invalidate_kernel_vmap_range(void *vaddr, int size); #define flush_cache_vunmap(start, end) flush_cache_all() #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 -extern void flush_dcache_page(struct page *page); +void flush_dcache_page(struct page *page); +void flush_dcache_folio(struct folio *folio); #define flush_dcache_mmap_lock(mapping) xa_lock_irq(&mapping->i_pages) #define flush_dcache_mmap_unlock(mapping) xa_unlock_irq(&mapping->i_pages) diff --git a/arch/powerpc/kernel/idle_book3s.S b/arch/powerpc/kernel/idle_book3s.S index abb719b21cae..3d97fb833834 100644 --- a/arch/powerpc/kernel/idle_book3s.S +++ b/arch/powerpc/kernel/idle_book3s.S @@ -126,14 +126,16 @@ _GLOBAL(idle_return_gpr_loss) /* * This is the sequence required to execute idle instructions, as * specified in ISA v2.07 (and earlier). MSR[IR] and MSR[DR] must be 0. - * - * The 0(r1) slot is used to save r2 in isa206, so use that here. + * We have to store a GPR somewhere, ptesync, then reload it, and create + * a false dependency on the result of the load. It doesn't matter which + * GPR we store, or where we store it. We have already stored r2 to the + * stack at -8(r1) in isa206_idle_insn_mayloss, so use that. */ #define IDLE_STATE_ENTER_SEQ_NORET(IDLE_INST) \ /* Magic NAP/SLEEP/WINKLE mode enter sequence */ \ - std r2,0(r1); \ + std r2,-8(r1); \ ptesync; \ - ld r2,0(r1); \ + ld r2,-8(r1); \ 236: cmpd cr0,r2,r2; \ bne 236b; \ IDLE_INST; \ diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 9cc7d3dbf439..605bab448f84 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -1730,8 +1730,6 @@ void __cpu_die(unsigned int cpu) void arch_cpu_idle_dead(void) { - sched_preempt_enable_no_resched(); - /* * Disable on the down path. This will be re-enabled by * start_secondary() via start_secondary_resume() below diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index dab5c56ffd0e..a52af8fbf571 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -1302,6 +1302,12 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn) struct property *default_win; int reset_win_ext; + /* DDW + IOMMU on single window may fail if there is any allocation */ + if (iommu_table_in_use(tbl)) { + dev_warn(&dev->dev, "current IOMMU table in use, can't be replaced.\n"); + goto out_failed; + } + default_win = of_find_property(pdn, "ibm,dma-window", NULL); if (!default_win) goto out_failed; @@ -1356,12 +1362,6 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn) query.largest_available_block, 1ULL << page_shift); - /* DDW + IOMMU on single window may fail if there is any allocation */ - if (default_win_removed && iommu_table_in_use(tbl)) { - dev_dbg(&dev->dev, "current IOMMU table in use, can't be replaced.\n"); - goto out_failed; - } - len = order_base_2(query.largest_available_block << page_shift); win_name = DMA64_PROPNAME; } else { @@ -1411,18 +1411,19 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn) } else { struct iommu_table *newtbl; int i; + unsigned long start = 0, end = 0; for (i = 0; i < ARRAY_SIZE(pci->phb->mem_resources); i++) { const unsigned long mask = IORESOURCE_MEM_64 | IORESOURCE_MEM; /* Look for MMIO32 */ - if ((pci->phb->mem_resources[i].flags & mask) == IORESOURCE_MEM) + if ((pci->phb->mem_resources[i].flags & mask) == IORESOURCE_MEM) { + start = pci->phb->mem_resources[i].start; + end = pci->phb->mem_resources[i].end; break; + } } - if (i == ARRAY_SIZE(pci->phb->mem_resources)) - goto out_del_list; - /* New table for using DDW instead of the default DMA window */ newtbl = iommu_pseries_alloc_table(pci->phb->node); if (!newtbl) { @@ -1432,15 +1433,15 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn) iommu_table_setparms_common(newtbl, pci->phb->bus->number, create.liobn, win_addr, 1UL << len, page_shift, NULL, &iommu_table_lpar_multi_ops); - iommu_init_table(newtbl, pci->phb->node, pci->phb->mem_resources[i].start, - pci->phb->mem_resources[i].end); + iommu_init_table(newtbl, pci->phb->node, start, end); pci->table_group->tables[1] = newtbl; /* Keep default DMA window stuct if removed */ if (default_win_removed) { tbl->it_size = 0; - kfree(tbl->it_map); + vfree(tbl->it_map); + tbl->it_map = NULL; } set_iommu_table_base(&dev->dev, newtbl); diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index 6a6fa9e976d5..f076cee11af6 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -163,6 +163,12 @@ config PAGE_OFFSET default 0xffffffff80000000 if 64BIT && MAXPHYSMEM_2GB default 0xffffffe000000000 if 64BIT && MAXPHYSMEM_128GB +config KASAN_SHADOW_OFFSET + hex + depends on KASAN_GENERIC + default 0xdfffffc800000000 if 64BIT + default 0xffffffff if 32BIT + config ARCH_FLATMEM_ENABLE def_bool !NUMA diff --git a/arch/riscv/include/asm/kasan.h b/arch/riscv/include/asm/kasan.h index a2b3d9cdbc86..b00f503ec124 100644 --- a/arch/riscv/include/asm/kasan.h +++ b/arch/riscv/include/asm/kasan.h @@ -30,8 +30,7 @@ #define KASAN_SHADOW_SIZE (UL(1) << ((CONFIG_VA_BITS - 1) - KASAN_SHADOW_SCALE_SHIFT)) #define KASAN_SHADOW_START KERN_VIRT_START #define KASAN_SHADOW_END (KASAN_SHADOW_START + KASAN_SHADOW_SIZE) -#define KASAN_SHADOW_OFFSET (KASAN_SHADOW_END - (1ULL << \ - (64 - KASAN_SHADOW_SCALE_SHIFT))) +#define KASAN_SHADOW_OFFSET _AC(CONFIG_KASAN_SHADOW_OFFSET, UL) void kasan_init(void); asmlinkage void kasan_early_init(void); diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S index fce5184b22c3..52c5ff9804c5 100644 --- a/arch/riscv/kernel/head.S +++ b/arch/riscv/kernel/head.S @@ -193,6 +193,7 @@ setup_trap_vector: csrw CSR_SCRATCH, zero ret +.align 2 .Lsecondary_park: /* We lack SMP support or have too many harts, so park this hart */ wfi diff --git a/arch/riscv/mm/kasan_init.c b/arch/riscv/mm/kasan_init.c index d7189c8714a9..54294f83513d 100644 --- a/arch/riscv/mm/kasan_init.c +++ b/arch/riscv/mm/kasan_init.c @@ -17,6 +17,9 @@ asmlinkage void __init kasan_early_init(void) uintptr_t i; pgd_t *pgd = early_pg_dir + pgd_index(KASAN_SHADOW_START); + BUILD_BUG_ON(KASAN_SHADOW_OFFSET != + KASAN_SHADOW_END - (1UL << (64 - KASAN_SHADOW_SCALE_SHIFT))); + for (i = 0; i < PTRS_PER_PTE; ++i) set_pte(kasan_early_shadow_pte + i, mk_pte(virt_to_page(kasan_early_shadow_page), @@ -172,21 +175,10 @@ void __init kasan_init(void) phys_addr_t p_start, p_end; u64 i; - /* - * Populate all kernel virtual address space with kasan_early_shadow_page - * except for the linear mapping and the modules/kernel/BPF mapping. - */ - kasan_populate_early_shadow((void *)KASAN_SHADOW_START, - (void *)kasan_mem_to_shadow((void *) - VMEMMAP_END)); if (IS_ENABLED(CONFIG_KASAN_VMALLOC)) kasan_shallow_populate( (void *)kasan_mem_to_shadow((void *)VMALLOC_START), (void *)kasan_mem_to_shadow((void *)VMALLOC_END)); - else - kasan_populate_early_shadow( - (void *)kasan_mem_to_shadow((void *)VMALLOC_START), - (void *)kasan_mem_to_shadow((void *)VMALLOC_END)); /* Populate the linear mapping */ for_each_mem_range(i, &p_start, &p_end) { diff --git a/arch/riscv/net/bpf_jit_core.c b/arch/riscv/net/bpf_jit_core.c index fed86f42dfbe..753d85bdfad0 100644 --- a/arch/riscv/net/bpf_jit_core.c +++ b/arch/riscv/net/bpf_jit_core.c @@ -125,7 +125,8 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) if (i == NR_JIT_ITERATIONS) { pr_err("bpf-jit: image did not converge in <%d passes!\n", i); - bpf_jit_binary_free(jit_data->header); + if (jit_data->header) + bpf_jit_binary_free(jit_data->header); prog = orig_prog; goto out_offset; } @@ -166,6 +167,11 @@ out: return prog; } +u64 bpf_jit_alloc_exec_limit(void) +{ + return BPF_JIT_REGION_SIZE; +} + void *bpf_jit_alloc_exec(unsigned long size) { return __vmalloc_node_range(size, PAGE_SIZE, BPF_JIT_REGION_START, diff --git a/arch/s390/kvm/gaccess.c b/arch/s390/kvm/gaccess.c index b9f85b2dc053..6af59c59cc1b 100644 --- a/arch/s390/kvm/gaccess.c +++ b/arch/s390/kvm/gaccess.c @@ -894,6 +894,11 @@ int access_guest_real(struct kvm_vcpu *vcpu, unsigned long gra, /** * guest_translate_address - translate guest logical into guest absolute address + * @vcpu: virtual cpu + * @gva: Guest virtual address + * @ar: Access register + * @gpa: Guest physical address + * @mode: Translation access mode * * Parameter semantics are the same as the ones from guest_translate. * The memory contents at the guest address are not changed. @@ -934,6 +939,11 @@ int guest_translate_address(struct kvm_vcpu *vcpu, unsigned long gva, u8 ar, /** * check_gva_range - test a range of guest virtual addresses for accessibility + * @vcpu: virtual cpu + * @gva: Guest virtual address + * @ar: Access register + * @length: Length of test range + * @mode: Translation access mode */ int check_gva_range(struct kvm_vcpu *vcpu, unsigned long gva, u8 ar, unsigned long length, enum gacc_mode mode) @@ -956,6 +966,7 @@ int check_gva_range(struct kvm_vcpu *vcpu, unsigned long gva, u8 ar, /** * kvm_s390_check_low_addr_prot_real - check for low-address protection + * @vcpu: virtual cpu * @gra: Guest real address * * Checks whether an address is subject to low-address protection and set @@ -979,6 +990,7 @@ int kvm_s390_check_low_addr_prot_real(struct kvm_vcpu *vcpu, unsigned long gra) * @pgt: pointer to the beginning of the page table for the given address if * successful (return value 0), or to the first invalid DAT entry in * case of exceptions (return value > 0) + * @dat_protection: referenced memory is write protected * @fake: pgt references contiguous guest memory block, not a pgtable */ static int kvm_s390_shadow_tables(struct gmap *sg, unsigned long saddr, diff --git a/arch/s390/kvm/intercept.c b/arch/s390/kvm/intercept.c index 72b25b7cc6ae..2bd8f854f1b4 100644 --- a/arch/s390/kvm/intercept.c +++ b/arch/s390/kvm/intercept.c @@ -269,6 +269,7 @@ static int handle_prog(struct kvm_vcpu *vcpu) /** * handle_external_interrupt - used for external interruption interceptions + * @vcpu: virtual cpu * * This interception only occurs if the CPUSTAT_EXT_INT bit was set, or if * the new PSW does not have external interrupts disabled. In the first case, @@ -315,7 +316,8 @@ static int handle_external_interrupt(struct kvm_vcpu *vcpu) } /** - * Handle MOVE PAGE partial execution interception. + * handle_mvpg_pei - Handle MOVE PAGE partial execution interception. + * @vcpu: virtual cpu * * This interception can only happen for guests with DAT disabled and * addresses that are currently not mapped in the host. Thus we try to diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c index 10722455fd02..2245f4b8d362 100644 --- a/arch/s390/kvm/interrupt.c +++ b/arch/s390/kvm/interrupt.c @@ -3053,13 +3053,14 @@ static void __airqs_kick_single_vcpu(struct kvm *kvm, u8 deliverable_mask) int vcpu_idx, online_vcpus = atomic_read(&kvm->online_vcpus); struct kvm_s390_gisa_interrupt *gi = &kvm->arch.gisa_int; struct kvm_vcpu *vcpu; + u8 vcpu_isc_mask; for_each_set_bit(vcpu_idx, kvm->arch.idle_mask, online_vcpus) { vcpu = kvm_get_vcpu(kvm, vcpu_idx); if (psw_ioint_disabled(vcpu)) continue; - deliverable_mask &= (u8)(vcpu->arch.sie_block->gcr[6] >> 24); - if (deliverable_mask) { + vcpu_isc_mask = (u8)(vcpu->arch.sie_block->gcr[6] >> 24); + if (deliverable_mask & vcpu_isc_mask) { /* lately kicked but not yet running */ if (test_and_set_bit(vcpu_idx, gi->kicked_mask)) return; diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c index 6a6dd5e1daf6..1c97493d21e1 100644 --- a/arch/s390/kvm/kvm-s390.c +++ b/arch/s390/kvm/kvm-s390.c @@ -3363,6 +3363,7 @@ out_free_sie_block: int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu) { + clear_bit(vcpu->vcpu_idx, vcpu->kvm->arch.gisa_int.kicked_mask); return kvm_s390_vcpu_has_irq(vcpu, 0); } diff --git a/arch/sh/include/asm/cacheflush.h b/arch/sh/include/asm/cacheflush.h index 372afa82fee6..c7a97f32432f 100644 --- a/arch/sh/include/asm/cacheflush.h +++ b/arch/sh/include/asm/cacheflush.h @@ -42,7 +42,8 @@ extern void flush_cache_page(struct vm_area_struct *vma, extern void flush_cache_range(struct vm_area_struct *vma, unsigned long start, unsigned long end); #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 -extern void flush_dcache_page(struct page *page); +void flush_dcache_page(struct page *page); +void flush_dcache_folio(struct folio *folio); extern void flush_icache_range(unsigned long start, unsigned long end); #define flush_icache_user_range flush_icache_range extern void flush_icache_page(struct vm_area_struct *vma, diff --git a/arch/um/drivers/ubd_kern.c b/arch/um/drivers/ubd_kern.c index fefd343412c7..69d2d0049a61 100644 --- a/arch/um/drivers/ubd_kern.c +++ b/arch/um/drivers/ubd_kern.c @@ -855,8 +855,8 @@ static const struct attribute_group *ubd_attr_groups[] = { NULL, }; -static void ubd_disk_register(int major, u64 size, int unit, - struct gendisk *disk) +static int ubd_disk_register(int major, u64 size, int unit, + struct gendisk *disk) { disk->major = major; disk->first_minor = unit << UBD_SHIFT; @@ -873,7 +873,7 @@ static void ubd_disk_register(int major, u64 size, int unit, disk->private_data = &ubd_devs[unit]; disk->queue = ubd_devs[unit].queue; - device_add_disk(&ubd_devs[unit].pdev.dev, disk, ubd_attr_groups); + return device_add_disk(&ubd_devs[unit].pdev.dev, disk, ubd_attr_groups); } #define ROUND_BLOCK(n) ((n + (SECTOR_SIZE - 1)) & (-SECTOR_SIZE)) @@ -920,10 +920,15 @@ static int ubd_add(int n, char **error_out) blk_queue_write_cache(ubd_dev->queue, true, false); blk_queue_max_segments(ubd_dev->queue, MAX_SG); blk_queue_segment_boundary(ubd_dev->queue, PAGE_SIZE - 1); - ubd_disk_register(UBD_MAJOR, ubd_dev->size, n, disk); + err = ubd_disk_register(UBD_MAJOR, ubd_dev->size, n, disk); + if (err) + goto out_cleanup_disk; + ubd_gendisk[n] = disk; return 0; +out_cleanup_disk: + blk_cleanup_disk(disk); out_cleanup_tags: blk_mq_free_tag_set(&ubd_dev->tag_set); out: diff --git a/arch/x86/crypto/sm4-aesni-avx-asm_64.S b/arch/x86/crypto/sm4-aesni-avx-asm_64.S index 18d2f5199194..1cc72b4804fa 100644 --- a/arch/x86/crypto/sm4-aesni-avx-asm_64.S +++ b/arch/x86/crypto/sm4-aesni-avx-asm_64.S @@ -78,7 +78,7 @@ vpxor tmp0, x, x; -.section .rodata.cst164, "aM", @progbits, 164 +.section .rodata.cst16, "aM", @progbits, 16 .align 16 /* @@ -133,6 +133,10 @@ .L0f0f0f0f: .long 0x0f0f0f0f +/* 12 bytes, only for padding */ +.Lpadding_deadbeef: + .long 0xdeadbeef, 0xdeadbeef, 0xdeadbeef + .text .align 16 diff --git a/arch/x86/crypto/sm4-aesni-avx2-asm_64.S b/arch/x86/crypto/sm4-aesni-avx2-asm_64.S index d2ffd7f76ee2..9c5d3f3ad45a 100644 --- a/arch/x86/crypto/sm4-aesni-avx2-asm_64.S +++ b/arch/x86/crypto/sm4-aesni-avx2-asm_64.S @@ -93,7 +93,7 @@ vpxor tmp0, x, x; -.section .rodata.cst164, "aM", @progbits, 164 +.section .rodata.cst16, "aM", @progbits, 16 .align 16 /* @@ -148,6 +148,10 @@ .L0f0f0f0f: .long 0x0f0f0f0f +/* 12 bytes, only for padding */ +.Lpadding_deadbeef: + .long 0xdeadbeef, 0xdeadbeef, 0xdeadbeef + .text .align 16 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index f8f48a7ec577..13f64654dfff 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -702,7 +702,8 @@ struct kvm_vcpu_arch { struct kvm_pio_request pio; void *pio_data; - void *guest_ins_data; + void *sev_pio_data; + unsigned sev_pio_count; u8 event_exit_inst_len; @@ -1097,7 +1098,7 @@ struct kvm_arch { u64 cur_tsc_generation; int nr_vcpus_matched_tsc; - spinlock_t pvclock_gtod_sync_lock; + raw_spinlock_t pvclock_gtod_sync_lock; bool use_master_clock; u64 master_kernel_ns; u64 master_cycle_now; diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 76fb00921203..d6ac32f3f650 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -2321,13 +2321,14 @@ EXPORT_SYMBOL_GPL(kvm_apic_update_apicv); void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event) { struct kvm_lapic *apic = vcpu->arch.apic; + u64 msr_val; int i; if (!init_event) { - vcpu->arch.apic_base = APIC_DEFAULT_PHYS_BASE | - MSR_IA32_APICBASE_ENABLE; + msr_val = APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE; if (kvm_vcpu_is_reset_bsp(vcpu)) - vcpu->arch.apic_base |= MSR_IA32_APICBASE_BSP; + msr_val |= MSR_IA32_APICBASE_BSP; + kvm_lapic_set_base(vcpu, msr_val); } if (!apic) @@ -2336,11 +2337,9 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event) /* Stop the timer in case it's a reset to an active apic */ hrtimer_cancel(&apic->lapic_timer.timer); - if (!init_event) { - apic->base_address = APIC_DEFAULT_PHYS_BASE; - + /* The xAPIC ID is set at RESET even if the APIC was already enabled. */ + if (!init_event) kvm_apic_set_xapic_id(apic, vcpu->vcpu_id); - } kvm_apic_set_version(apic->vcpu); for (i = 0; i < KVM_APIC_LVT_NUM; i++) @@ -2481,6 +2480,11 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu, int timer_advance_ns) lapic_timer_advance_dynamic = false; } + /* + * Stuff the APIC ENABLE bit in lieu of temporarily incrementing + * apic_hw_disabled; the full RESET value is set by kvm_lapic_reset(). + */ + vcpu->arch.apic_base = MSR_IA32_APICBASE_ENABLE; static_branch_inc(&apic_sw_disabled.key); /* sw disabled at reset */ kvm_iodevice_init(&apic->dev, &apic_mmio_ops); @@ -2942,5 +2946,7 @@ int kvm_apic_accept_events(struct kvm_vcpu *vcpu) void kvm_lapic_exit(void) { static_key_deferred_flush(&apic_hw_disabled); + WARN_ON(static_branch_unlikely(&apic_hw_disabled.key)); static_key_deferred_flush(&apic_sw_disabled); + WARN_ON(static_branch_unlikely(&apic_sw_disabled.key)); } diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 1a64ba5b9437..0cc58901bf7a 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4596,10 +4596,10 @@ static void update_pkru_bitmask(struct kvm_mmu *mmu) unsigned bit; bool wp; - if (!is_cr4_pke(mmu)) { - mmu->pkru_mask = 0; + mmu->pkru_mask = 0; + + if (!is_cr4_pke(mmu)) return; - } wp = is_cr0_wp(mmu); diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index c36b5fe4c27c..7e34d7163ada 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -618,7 +618,12 @@ static int __sev_launch_update_vmsa(struct kvm *kvm, struct kvm_vcpu *vcpu, vmsa.handle = to_kvm_svm(kvm)->sev_info.handle; vmsa.address = __sme_pa(svm->vmsa); vmsa.len = PAGE_SIZE; - return sev_issue_cmd(kvm, SEV_CMD_LAUNCH_UPDATE_VMSA, &vmsa, error); + ret = sev_issue_cmd(kvm, SEV_CMD_LAUNCH_UPDATE_VMSA, &vmsa, error); + if (ret) + return ret; + + vcpu->arch.guest_state_protected = true; + return 0; } static int sev_launch_update_vmsa(struct kvm *kvm, struct kvm_sev_cmd *argp) @@ -1479,6 +1484,13 @@ static int sev_receive_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp) goto e_free_trans; } + /* + * Flush (on non-coherent CPUs) before RECEIVE_UPDATE_DATA, the PSP + * encrypts the written data with the guest's key, and the cache may + * contain dirty, unencrypted data. + */ + sev_clflush_pages(guest_page, n); + /* The RECEIVE_UPDATE_DATA command requires C-bit to be always set. */ data.guest_address = (page_to_pfn(guest_page[0]) << PAGE_SHIFT) + offset; data.guest_address |= sev_me_mask; @@ -2579,11 +2591,20 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu) int sev_es_string_io(struct vcpu_svm *svm, int size, unsigned int port, int in) { - if (!setup_vmgexit_scratch(svm, in, svm->vmcb->control.exit_info_2)) + int count; + int bytes; + + if (svm->vmcb->control.exit_info_2 > INT_MAX) + return -EINVAL; + + count = svm->vmcb->control.exit_info_2; + if (unlikely(check_mul_overflow(count, size, &bytes))) + return -EINVAL; + + if (!setup_vmgexit_scratch(svm, in, bytes)) return -EINVAL; - return kvm_sev_es_string_io(&svm->vcpu, size, port, - svm->ghcb_sa, svm->ghcb_sa_len, in); + return kvm_sev_es_string_io(&svm->vcpu, size, port, svm->ghcb_sa, count, in); } void sev_es_init_vmcb(struct vcpu_svm *svm) diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 128a54b1fbf1..5d30db599e10 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -191,7 +191,7 @@ struct vcpu_svm { /* SEV-ES scratch area support */ void *ghcb_sa; - u64 ghcb_sa_len; + u32 ghcb_sa_len; bool ghcb_sa_sync; bool ghcb_sa_free; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 116b08904ac3..7d595effb66f 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -5562,9 +5562,13 @@ static int handle_encls(struct kvm_vcpu *vcpu) static int handle_bus_lock_vmexit(struct kvm_vcpu *vcpu) { - vcpu->run->exit_reason = KVM_EXIT_X86_BUS_LOCK; - vcpu->run->flags |= KVM_RUN_X86_BUS_LOCK; - return 0; + /* + * Hardware may or may not set the BUS_LOCK_DETECTED flag on BUS_LOCK + * VM-Exits. Unconditionally set the flag here and leave the handling to + * vmx_handle_exit(). + */ + to_vmx(vcpu)->exit_reason.bus_lock_detected = true; + return 1; } /* @@ -6051,9 +6055,8 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath) int ret = __vmx_handle_exit(vcpu, exit_fastpath); /* - * Even when current exit reason is handled by KVM internally, we - * still need to exit to user space when bus lock detected to inform - * that there is a bus lock in guest. + * Exit to user space when bus lock detected to inform that there is + * a bus lock in guest. */ if (to_vmx(vcpu)->exit_reason.bus_lock_detected) { if (ret > 0) @@ -6302,18 +6305,13 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu) /* * If we are running L2 and L1 has a new pending interrupt - * which can be injected, we should re-evaluate - * what should be done with this new L1 interrupt. - * If L1 intercepts external-interrupts, we should - * exit from L2 to L1. Otherwise, interrupt should be - * delivered directly to L2. + * which can be injected, this may cause a vmexit or it may + * be injected into L2. Either way, this interrupt will be + * processed via KVM_REQ_EVENT, not RVI, because we do not use + * virtual interrupt delivery to inject L1 interrupts into L2. */ - if (is_guest_mode(vcpu) && max_irr_updated) { - if (nested_exit_on_intr(vcpu)) - kvm_vcpu_exiting_guest_mode(vcpu); - else - kvm_make_request(KVM_REQ_EVENT, vcpu); - } + if (is_guest_mode(vcpu) && max_irr_updated) + kvm_make_request(KVM_REQ_EVENT, vcpu); } else { max_irr = kvm_lapic_find_highest_irr(vcpu); } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index aabd3a2ec1bc..bfe0de3008a6 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2542,7 +2542,7 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data) kvm_vcpu_write_tsc_offset(vcpu, offset); raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags); - spin_lock_irqsave(&kvm->arch.pvclock_gtod_sync_lock, flags); + raw_spin_lock_irqsave(&kvm->arch.pvclock_gtod_sync_lock, flags); if (!matched) { kvm->arch.nr_vcpus_matched_tsc = 0; } else if (!already_matched) { @@ -2550,7 +2550,7 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data) } kvm_track_tsc_matching(vcpu); - spin_unlock_irqrestore(&kvm->arch.pvclock_gtod_sync_lock, flags); + raw_spin_unlock_irqrestore(&kvm->arch.pvclock_gtod_sync_lock, flags); } static inline void adjust_tsc_offset_guest(struct kvm_vcpu *vcpu, @@ -2780,9 +2780,9 @@ static void kvm_gen_update_masterclock(struct kvm *kvm) kvm_make_mclock_inprogress_request(kvm); /* no guest entries from this point */ - spin_lock_irqsave(&ka->pvclock_gtod_sync_lock, flags); + raw_spin_lock_irqsave(&ka->pvclock_gtod_sync_lock, flags); pvclock_update_vm_gtod_copy(kvm); - spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags); + raw_spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags); kvm_for_each_vcpu(i, vcpu, kvm) kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); @@ -2800,15 +2800,15 @@ u64 get_kvmclock_ns(struct kvm *kvm) unsigned long flags; u64 ret; - spin_lock_irqsave(&ka->pvclock_gtod_sync_lock, flags); + raw_spin_lock_irqsave(&ka->pvclock_gtod_sync_lock, flags); if (!ka->use_master_clock) { - spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags); + raw_spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags); return get_kvmclock_base_ns() + ka->kvmclock_offset; } hv_clock.tsc_timestamp = ka->master_cycle_now; hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset; - spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags); + raw_spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags); /* both __this_cpu_read() and rdtsc() should be on the same cpu */ get_cpu(); @@ -2902,13 +2902,13 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) * If the host uses TSC clock, then passthrough TSC as stable * to the guest. */ - spin_lock_irqsave(&ka->pvclock_gtod_sync_lock, flags); + raw_spin_lock_irqsave(&ka->pvclock_gtod_sync_lock, flags); use_master_clock = ka->use_master_clock; if (use_master_clock) { host_tsc = ka->master_cycle_now; kernel_ns = ka->master_kernel_ns; } - spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags); + raw_spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags); /* Keep irq disabled to prevent changes to the clock */ local_irq_save(flags); @@ -6100,13 +6100,13 @@ set_pit2_out: * is slightly ahead) here we risk going negative on unsigned * 'system_time' when 'user_ns.clock' is very small. */ - spin_lock_irq(&ka->pvclock_gtod_sync_lock); + raw_spin_lock_irq(&ka->pvclock_gtod_sync_lock); if (kvm->arch.use_master_clock) now_ns = ka->master_kernel_ns; else now_ns = get_kvmclock_base_ns(); ka->kvmclock_offset = user_ns.clock - now_ns; - spin_unlock_irq(&ka->pvclock_gtod_sync_lock); + raw_spin_unlock_irq(&ka->pvclock_gtod_sync_lock); kvm_make_all_cpus_request(kvm, KVM_REQ_CLOCK_UPDATE); break; @@ -6906,7 +6906,7 @@ static int kernel_pio(struct kvm_vcpu *vcpu, void *pd) } static int emulator_pio_in_out(struct kvm_vcpu *vcpu, int size, - unsigned short port, void *val, + unsigned short port, unsigned int count, bool in) { vcpu->arch.pio.port = port; @@ -6914,10 +6914,8 @@ static int emulator_pio_in_out(struct kvm_vcpu *vcpu, int size, vcpu->arch.pio.count = count; vcpu->arch.pio.size = size; - if (!kernel_pio(vcpu, vcpu->arch.pio_data)) { - vcpu->arch.pio.count = 0; + if (!kernel_pio(vcpu, vcpu->arch.pio_data)) return 1; - } vcpu->run->exit_reason = KVM_EXIT_IO; vcpu->run->io.direction = in ? KVM_EXIT_IO_IN : KVM_EXIT_IO_OUT; @@ -6929,26 +6927,39 @@ static int emulator_pio_in_out(struct kvm_vcpu *vcpu, int size, return 0; } -static int emulator_pio_in(struct kvm_vcpu *vcpu, int size, - unsigned short port, void *val, unsigned int count) +static int __emulator_pio_in(struct kvm_vcpu *vcpu, int size, + unsigned short port, unsigned int count) { - int ret; + WARN_ON(vcpu->arch.pio.count); + memset(vcpu->arch.pio_data, 0, size * count); + return emulator_pio_in_out(vcpu, size, port, count, true); +} - if (vcpu->arch.pio.count) - goto data_avail; +static void complete_emulator_pio_in(struct kvm_vcpu *vcpu, void *val) +{ + int size = vcpu->arch.pio.size; + unsigned count = vcpu->arch.pio.count; + memcpy(val, vcpu->arch.pio_data, size * count); + trace_kvm_pio(KVM_PIO_IN, vcpu->arch.pio.port, size, count, vcpu->arch.pio_data); + vcpu->arch.pio.count = 0; +} - memset(vcpu->arch.pio_data, 0, size * count); +static int emulator_pio_in(struct kvm_vcpu *vcpu, int size, + unsigned short port, void *val, unsigned int count) +{ + if (vcpu->arch.pio.count) { + /* Complete previous iteration. */ + } else { + int r = __emulator_pio_in(vcpu, size, port, count); + if (!r) + return r; - ret = emulator_pio_in_out(vcpu, size, port, val, count, true); - if (ret) { -data_avail: - memcpy(val, vcpu->arch.pio_data, size * count); - trace_kvm_pio(KVM_PIO_IN, port, size, count, vcpu->arch.pio_data); - vcpu->arch.pio.count = 0; - return 1; + /* Results already available, fall through. */ } - return 0; + WARN_ON(count != vcpu->arch.pio.count); + complete_emulator_pio_in(vcpu, val); + return 1; } static int emulator_pio_in_emulated(struct x86_emulate_ctxt *ctxt, @@ -6963,9 +6974,15 @@ static int emulator_pio_out(struct kvm_vcpu *vcpu, int size, unsigned short port, const void *val, unsigned int count) { + int ret; + memcpy(vcpu->arch.pio_data, val, size * count); trace_kvm_pio(KVM_PIO_OUT, port, size, count, vcpu->arch.pio_data); - return emulator_pio_in_out(vcpu, size, port, (void *)val, count, false); + ret = emulator_pio_in_out(vcpu, size, port, count, false); + if (ret) + vcpu->arch.pio.count = 0; + + return ret; } static int emulator_pio_out_emulated(struct x86_emulate_ctxt *ctxt, @@ -8139,9 +8156,9 @@ static void kvm_hyperv_tsc_notifier(void) list_for_each_entry(kvm, &vm_list, vm_list) { struct kvm_arch *ka = &kvm->arch; - spin_lock_irqsave(&ka->pvclock_gtod_sync_lock, flags); + raw_spin_lock_irqsave(&ka->pvclock_gtod_sync_lock, flags); pvclock_update_vm_gtod_copy(kvm); - spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags); + raw_spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags); kvm_for_each_vcpu(cpu, vcpu, kvm) kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); @@ -8783,9 +8800,17 @@ static void post_kvm_run_save(struct kvm_vcpu *vcpu) kvm_run->cr8 = kvm_get_cr8(vcpu); kvm_run->apic_base = kvm_get_apic_base(vcpu); + + /* + * The call to kvm_ready_for_interrupt_injection() may end up in + * kvm_xen_has_interrupt() which may require the srcu lock to be + * held, to protect against changes in the vcpu_info address. + */ + vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu); kvm_run->ready_for_interrupt_injection = pic_in_kernel(vcpu->kvm) || kvm_vcpu_ready_for_interrupt_injection(vcpu); + srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx); if (is_smm(vcpu)) kvm_run->flags |= KVM_RUN_X86_SMM; @@ -9643,14 +9668,14 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) if (likely(exit_fastpath != EXIT_FASTPATH_REENTER_GUEST)) break; - if (unlikely(kvm_vcpu_exit_request(vcpu))) { + if (vcpu->arch.apicv_active) + static_call(kvm_x86_sync_pir_to_irr)(vcpu); + + if (unlikely(kvm_vcpu_exit_request(vcpu))) { exit_fastpath = EXIT_FASTPATH_EXIT_HANDLED; break; } - - if (vcpu->arch.apicv_active) - static_call(kvm_x86_sync_pir_to_irr)(vcpu); - } + } /* * Do this here before restoring debug registers on the host. And @@ -11182,7 +11207,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) raw_spin_lock_init(&kvm->arch.tsc_write_lock); mutex_init(&kvm->arch.apic_map_lock); - spin_lock_init(&kvm->arch.pvclock_gtod_sync_lock); + raw_spin_lock_init(&kvm->arch.pvclock_gtod_sync_lock); kvm->arch.kvmclock_offset = -get_kvmclock_base_ns(); pvclock_update_vm_gtod_copy(kvm); @@ -11392,7 +11417,8 @@ static int memslot_rmap_alloc(struct kvm_memory_slot *slot, int level = i + 1; int lpages = __kvm_mmu_slot_lpages(slot, npages, level); - WARN_ON(slot->arch.rmap[i]); + if (slot->arch.rmap[i]) + continue; slot->arch.rmap[i] = kvcalloc(lpages, sz, GFP_KERNEL_ACCOUNT); if (!slot->arch.rmap[i]) { @@ -12367,44 +12393,81 @@ int kvm_sev_es_mmio_read(struct kvm_vcpu *vcpu, gpa_t gpa, unsigned int bytes, } EXPORT_SYMBOL_GPL(kvm_sev_es_mmio_read); -static int complete_sev_es_emulated_ins(struct kvm_vcpu *vcpu) +static int kvm_sev_es_outs(struct kvm_vcpu *vcpu, unsigned int size, + unsigned int port); + +static int complete_sev_es_emulated_outs(struct kvm_vcpu *vcpu) { - memcpy(vcpu->arch.guest_ins_data, vcpu->arch.pio_data, - vcpu->arch.pio.count * vcpu->arch.pio.size); - vcpu->arch.pio.count = 0; + int size = vcpu->arch.pio.size; + int port = vcpu->arch.pio.port; + vcpu->arch.pio.count = 0; + if (vcpu->arch.sev_pio_count) + return kvm_sev_es_outs(vcpu, size, port); return 1; } static int kvm_sev_es_outs(struct kvm_vcpu *vcpu, unsigned int size, - unsigned int port, void *data, unsigned int count) + unsigned int port) { - int ret; - - ret = emulator_pio_out_emulated(vcpu->arch.emulate_ctxt, size, port, - data, count); - if (ret) - return ret; + for (;;) { + unsigned int count = + min_t(unsigned int, PAGE_SIZE / size, vcpu->arch.sev_pio_count); + int ret = emulator_pio_out(vcpu, size, port, vcpu->arch.sev_pio_data, count); + + /* memcpy done already by emulator_pio_out. */ + vcpu->arch.sev_pio_count -= count; + vcpu->arch.sev_pio_data += count * vcpu->arch.pio.size; + if (!ret) + break; - vcpu->arch.pio.count = 0; + /* Emulation done by the kernel. */ + if (!vcpu->arch.sev_pio_count) + return 1; + } + vcpu->arch.complete_userspace_io = complete_sev_es_emulated_outs; return 0; } static int kvm_sev_es_ins(struct kvm_vcpu *vcpu, unsigned int size, - unsigned int port, void *data, unsigned int count) + unsigned int port); + +static void advance_sev_es_emulated_ins(struct kvm_vcpu *vcpu) { - int ret; + unsigned count = vcpu->arch.pio.count; + complete_emulator_pio_in(vcpu, vcpu->arch.sev_pio_data); + vcpu->arch.sev_pio_count -= count; + vcpu->arch.sev_pio_data += count * vcpu->arch.pio.size; +} - ret = emulator_pio_in_emulated(vcpu->arch.emulate_ctxt, size, port, - data, count); - if (ret) { - vcpu->arch.pio.count = 0; - } else { - vcpu->arch.guest_ins_data = data; - vcpu->arch.complete_userspace_io = complete_sev_es_emulated_ins; +static int complete_sev_es_emulated_ins(struct kvm_vcpu *vcpu) +{ + int size = vcpu->arch.pio.size; + int port = vcpu->arch.pio.port; + + advance_sev_es_emulated_ins(vcpu); + if (vcpu->arch.sev_pio_count) + return kvm_sev_es_ins(vcpu, size, port); + return 1; +} + +static int kvm_sev_es_ins(struct kvm_vcpu *vcpu, unsigned int size, + unsigned int port) +{ + for (;;) { + unsigned int count = + min_t(unsigned int, PAGE_SIZE / size, vcpu->arch.sev_pio_count); + if (!__emulator_pio_in(vcpu, size, port, count)) + break; + + /* Emulation done by the kernel. */ + advance_sev_es_emulated_ins(vcpu); + if (!vcpu->arch.sev_pio_count) + return 1; } + vcpu->arch.complete_userspace_io = complete_sev_es_emulated_ins; return 0; } @@ -12412,8 +12475,10 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size, unsigned int port, void *data, unsigned int count, int in) { - return in ? kvm_sev_es_ins(vcpu, size, port, data, count) - : kvm_sev_es_outs(vcpu, size, port, data, count); + vcpu->arch.sev_pio_data = data; + vcpu->arch.sev_pio_count = count; + return in ? kvm_sev_es_ins(vcpu, size, port) + : kvm_sev_es_outs(vcpu, size, port); } EXPORT_SYMBOL_GPL(kvm_sev_es_string_io); diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index 9ea9c3dabe37..8f62baebd028 100644 --- a/arch/x86/kvm/xen.c +++ b/arch/x86/kvm/xen.c @@ -190,6 +190,7 @@ void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, int state) int __kvm_xen_has_interrupt(struct kvm_vcpu *v) { + int err; u8 rc = 0; /* @@ -216,13 +217,29 @@ int __kvm_xen_has_interrupt(struct kvm_vcpu *v) if (likely(slots->generation == ghc->generation && !kvm_is_error_hva(ghc->hva) && ghc->memslot)) { /* Fast path */ - __get_user(rc, (u8 __user *)ghc->hva + offset); - } else { - /* Slow path */ - kvm_read_guest_offset_cached(v->kvm, ghc, &rc, offset, - sizeof(rc)); + pagefault_disable(); + err = __get_user(rc, (u8 __user *)ghc->hva + offset); + pagefault_enable(); + if (!err) + return rc; } + /* Slow path */ + + /* + * This function gets called from kvm_vcpu_block() after setting the + * task to TASK_INTERRUPTIBLE, to see if it needs to wake immediately + * from a HLT. So we really mustn't sleep. If the page ended up absent + * at that point, just return 1 in order to trigger an immediate wake, + * and we'll end up getting called again from a context where we *can* + * fault in the page and wait for it. + */ + if (in_atomic() || !task_is_running(current)) + return 1; + + kvm_read_guest_offset_cached(v->kvm, ghc, &rc, offset, + sizeof(rc)); + return rc; } diff --git a/arch/xtensa/include/asm/cacheflush.h b/arch/xtensa/include/asm/cacheflush.h index cf907e5bf2f2..a8a041609c5d 100644 --- a/arch/xtensa/include/asm/cacheflush.h +++ b/arch/xtensa/include/asm/cacheflush.h @@ -120,7 +120,8 @@ void flush_cache_page(struct vm_area_struct*, #define flush_cache_vunmap(start,end) flush_cache_all() #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 -extern void flush_dcache_page(struct page*); +void flush_dcache_page(struct page *); +void flush_dcache_folio(struct folio *); void local_flush_cache_range(struct vm_area_struct *vma, unsigned long start, unsigned long end); @@ -137,7 +138,9 @@ void local_flush_cache_page(struct vm_area_struct *vma, #define flush_cache_vunmap(start,end) do { } while (0) #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 0 +#define ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO #define flush_dcache_page(page) do { } while (0) +static inline void flush_dcache_folio(struct folio *folio) { } #define flush_icache_range local_flush_icache_range #define flush_cache_page(vma, addr, pfn) do { } while (0) diff --git a/arch/xtensa/platforms/iss/simdisk.c b/arch/xtensa/platforms/iss/simdisk.c index ddd1fe3db474..07b642c1916a 100644 --- a/arch/xtensa/platforms/iss/simdisk.c +++ b/arch/xtensa/platforms/iss/simdisk.c @@ -258,6 +258,7 @@ static int __init simdisk_setup(struct simdisk *dev, int which, struct proc_dir_entry *procdir) { char tmp[2] = { '0' + which, 0 }; + int err = -ENOMEM; dev->fd = -1; dev->filename = NULL; @@ -266,7 +267,7 @@ static int __init simdisk_setup(struct simdisk *dev, int which, dev->gd = blk_alloc_disk(NUMA_NO_NODE); if (!dev->gd) - return -ENOMEM; + goto out; dev->gd->major = simdisk_major; dev->gd->first_minor = which; dev->gd->minors = SIMDISK_MINORS; @@ -274,10 +275,18 @@ static int __init simdisk_setup(struct simdisk *dev, int which, dev->gd->private_data = dev; snprintf(dev->gd->disk_name, 32, "simdisk%d", which); set_capacity(dev->gd, 0); - add_disk(dev->gd); + err = add_disk(dev->gd); + if (err) + goto out_cleanup_disk; dev->procfile = proc_create_data(tmp, 0644, procdir, &simdisk_proc_ops, dev); + return 0; + +out_cleanup_disk: + blk_cleanup_disk(dev->gd); +out: + return err; } static int __init simdisk_init(void) diff --git a/block/Makefile b/block/Makefile index 74df168729ec..44df57e562bf 100644 --- a/block/Makefile +++ b/block/Makefile @@ -9,7 +9,7 @@ obj-y := bdev.o fops.o bio.o elevator.o blk-core.o blk-sysfs.o \ blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \ blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \ genhd.o ioprio.o badblocks.o partitions/ blk-rq-qos.o \ - disk-events.o + disk-events.o blk-ia-ranges.o obj-$(CONFIG_BOUNCE) += bounce.o obj-$(CONFIG_BLK_DEV_BSG_COMMON) += bsg.o @@ -36,6 +36,6 @@ obj-$(CONFIG_BLK_DEBUG_FS) += blk-mq-debugfs.o obj-$(CONFIG_BLK_DEBUG_FS_ZONED)+= blk-mq-debugfs-zoned.o obj-$(CONFIG_BLK_SED_OPAL) += sed-opal.o obj-$(CONFIG_BLK_PM) += blk-pm.o -obj-$(CONFIG_BLK_INLINE_ENCRYPTION) += keyslot-manager.o blk-crypto.o +obj-$(CONFIG_BLK_INLINE_ENCRYPTION) += blk-crypto.o blk-crypto-profile.o obj-$(CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK) += blk-crypto-fallback.o obj-$(CONFIG_BLOCK_HOLDER_DEPRECATED) += holder.o diff --git a/block/bdev.c b/block/bdev.c index cff0bb3a4578..7e6156203a71 100644 --- a/block/bdev.c +++ b/block/bdev.c @@ -964,9 +964,11 @@ EXPORT_SYMBOL(blkdev_put); * @pathname: special file representing the block device * @dev: return value of the block device's dev_t * - * Get a reference to the blockdevice at @pathname in the current - * namespace if possible and return it. Return ERR_PTR(error) - * otherwise. + * Lookup the block device's dev_t at @pathname in the current + * namespace if possible and return it by @dev. + * + * RETURNS: + * 0 if succeeded, errno otherwise. */ int lookup_bdev(const char *pathname, dev_t *dev) { diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c index 882ec4bc51ad..24a5c5329bcd 100644 --- a/block/bfq-cgroup.c +++ b/block/bfq-cgroup.c @@ -463,7 +463,7 @@ static int bfqg_stats_init(struct bfqg_stats *stats, gfp_t gfp) { if (blkg_rwstat_init(&stats->bytes, gfp) || blkg_rwstat_init(&stats->ios, gfp)) - return -ENOMEM; + goto error; #ifdef CONFIG_BFQ_CGROUP_DEBUG if (blkg_rwstat_init(&stats->merged, gfp) || @@ -476,13 +476,15 @@ static int bfqg_stats_init(struct bfqg_stats *stats, gfp_t gfp) bfq_stat_init(&stats->dequeue, gfp) || bfq_stat_init(&stats->group_wait_time, gfp) || bfq_stat_init(&stats->idle_time, gfp) || - bfq_stat_init(&stats->empty_time, gfp)) { - bfqg_stats_exit(stats); - return -ENOMEM; - } + bfq_stat_init(&stats->empty_time, gfp)) + goto error; #endif return 0; + +error: + bfqg_stats_exit(stats); + return -ENOMEM; } static struct bfq_group_data *cpd_to_bfqgd(struct blkcg_policy_data *cpd) diff --git a/block/bio.c b/block/bio.c index 4f397ba47db5..15ab0d6d1c06 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1033,52 +1033,40 @@ int bio_add_page(struct bio *bio, struct page *page, } EXPORT_SYMBOL(bio_add_page); -void bio_release_pages(struct bio *bio, bool mark_dirty) +void __bio_release_pages(struct bio *bio, bool mark_dirty) { struct bvec_iter_all iter_all; struct bio_vec *bvec; - if (bio_flagged(bio, BIO_NO_PAGE_REF)) - return; - bio_for_each_segment_all(bvec, bio, iter_all) { if (mark_dirty && !PageCompound(bvec->bv_page)) set_page_dirty_lock(bvec->bv_page); put_page(bvec->bv_page); } } -EXPORT_SYMBOL_GPL(bio_release_pages); +EXPORT_SYMBOL_GPL(__bio_release_pages); -static void __bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter) +void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter) { + size_t size = iov_iter_count(iter); + WARN_ON_ONCE(bio->bi_max_vecs); + if (bio_op(bio) == REQ_OP_ZONE_APPEND) { + struct request_queue *q = bdev_get_queue(bio->bi_bdev); + size_t max_sectors = queue_max_zone_append_sectors(q); + + size = min(size, max_sectors << SECTOR_SHIFT); + } + bio->bi_vcnt = iter->nr_segs; bio->bi_io_vec = (struct bio_vec *)iter->bvec; bio->bi_iter.bi_bvec_done = iter->iov_offset; - bio->bi_iter.bi_size = iter->count; + bio->bi_iter.bi_size = size; bio_set_flag(bio, BIO_NO_PAGE_REF); bio_set_flag(bio, BIO_CLONED); } -static int bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter) -{ - __bio_iov_bvec_set(bio, iter); - iov_iter_advance(iter, iter->count); - return 0; -} - -static int bio_iov_bvec_set_append(struct bio *bio, struct iov_iter *iter) -{ - struct request_queue *q = bdev_get_queue(bio->bi_bdev); - struct iov_iter i = *iter; - - iov_iter_truncate(&i, queue_max_zone_append_sectors(q) << 9); - __bio_iov_bvec_set(bio, &i); - iov_iter_advance(iter, i.count); - return 0; -} - static void bio_put_pages(struct page **pages, size_t size, size_t off) { size_t i, nr = DIV_ROUND_UP(size + (off & ~PAGE_MASK), PAGE_SIZE); @@ -1220,9 +1208,9 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) int ret = 0; if (iov_iter_is_bvec(iter)) { - if (bio_op(bio) == REQ_OP_ZONE_APPEND) - return bio_iov_bvec_set_append(bio, iter); - return bio_iov_bvec_set(bio, iter); + bio_iov_bvec_set(bio, iter); + iov_iter_advance(iter, bio->bi_iter.bi_size); + return 0; } do { diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 8908298d6ad3..88b1fce90520 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -634,6 +634,14 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol, q = bdev_get_queue(bdev); + /* + * blkcg_deactivate_policy() requires queue to be frozen, we can grab + * q_usage_counter to prevent concurrent with blkcg_deactivate_policy(). + */ + ret = blk_queue_enter(q, 0); + if (ret) + return ret; + rcu_read_lock(); spin_lock_irq(&q->queue_lock); @@ -703,6 +711,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol, goto success; } success: + blk_queue_exit(q); ctx->bdev = bdev; ctx->blkg = blkg; ctx->body = input; @@ -715,6 +724,7 @@ fail_unlock: rcu_read_unlock(); fail: blkdev_put_no_open(bdev); + blk_queue_exit(q); /* * If queue was bypassing, we should retry. Do so after a * short msleep(). It isn't strictly necessary but queue diff --git a/block/blk-core.c b/block/blk-core.c index d0c2e11411d0..fd389a16013c 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -389,7 +389,7 @@ EXPORT_SYMBOL(blk_cleanup_queue); static bool blk_try_enter_queue(struct request_queue *q, bool pm) { rcu_read_lock(); - if (!percpu_ref_tryget_live(&q->q_usage_counter)) + if (!percpu_ref_tryget_live_rcu(&q->q_usage_counter)) goto fail; /* @@ -404,7 +404,7 @@ static bool blk_try_enter_queue(struct request_queue *q, bool pm) return true; fail_put: - percpu_ref_put(&q->q_usage_counter); + blk_queue_exit(q); fail: rcu_read_unlock(); return false; @@ -1080,7 +1080,7 @@ EXPORT_SYMBOL(submit_bio); */ int bio_poll(struct bio *bio, struct io_comp_batch *iob, unsigned int flags) { - struct request_queue *q = bio->bi_bdev->bd_disk->queue; + struct request_queue *q = bdev_get_queue(bio->bi_bdev); blk_qc_t cookie = READ_ONCE(bio->bi_cookie); int ret; @@ -1089,7 +1089,7 @@ int bio_poll(struct bio *bio, struct io_comp_batch *iob, unsigned int flags) return 0; if (current->plug) - blk_flush_plug_list(current->plug, false); + blk_flush_plug(current->plug, false); if (blk_queue_enter(q, BLK_MQ_REQ_NOWAIT)) return 0; @@ -1550,11 +1550,12 @@ void blk_start_plug_nr_ios(struct blk_plug *plug, unsigned short nr_ios) if (tsk->plug) return; - INIT_LIST_HEAD(&plug->mq_list); + plug->mq_list = NULL; plug->cached_rq = NULL; plug->nr_ios = min_t(unsigned short, nr_ios, BLK_MAX_REQUEST_COUNT); plug->rq_count = 0; plug->multiple_queues = false; + plug->has_elevator = false; plug->nowait = false; INIT_LIST_HEAD(&plug->cb_list); @@ -1636,11 +1637,11 @@ struct blk_plug_cb *blk_check_plugged(blk_plug_cb_fn unplug, void *data, } EXPORT_SYMBOL(blk_check_plugged); -void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule) +void blk_flush_plug(struct blk_plug *plug, bool from_schedule) { - flush_plug_callbacks(plug, from_schedule); - - if (!list_empty(&plug->mq_list)) + if (!list_empty(&plug->cb_list)) + flush_plug_callbacks(plug, from_schedule); + if (!rq_list_empty(plug->mq_list)) blk_mq_flush_plug_list(plug, from_schedule); if (unlikely(!from_schedule && plug->cached_rq)) blk_mq_free_plug_rqs(plug); @@ -1658,11 +1659,10 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule) */ void blk_finish_plug(struct blk_plug *plug) { - if (plug != current->plug) - return; - blk_flush_plug_list(plug, false); - - current->plug = NULL; + if (plug == current->plug) { + blk_flush_plug(plug, false); + current->plug = NULL; + } } EXPORT_SYMBOL(blk_finish_plug); diff --git a/block/blk-crypto-fallback.c b/block/blk-crypto-fallback.c index ec4c7823541c..c87aba8584c6 100644 --- a/block/blk-crypto-fallback.c +++ b/block/blk-crypto-fallback.c @@ -12,9 +12,9 @@ #include <crypto/skcipher.h> #include <linux/blk-cgroup.h> #include <linux/blk-crypto.h> +#include <linux/blk-crypto-profile.h> #include <linux/blkdev.h> #include <linux/crypto.h> -#include <linux/keyslot-manager.h> #include <linux/mempool.h> #include <linux/module.h> #include <linux/random.h> @@ -73,12 +73,12 @@ static mempool_t *bio_fallback_crypt_ctx_pool; static DEFINE_MUTEX(tfms_init_lock); static bool tfms_inited[BLK_ENCRYPTION_MODE_MAX]; -static struct blk_crypto_keyslot { +static struct blk_crypto_fallback_keyslot { enum blk_crypto_mode_num crypto_mode; struct crypto_skcipher *tfms[BLK_ENCRYPTION_MODE_MAX]; } *blk_crypto_keyslots; -static struct blk_keyslot_manager blk_crypto_ksm; +static struct blk_crypto_profile blk_crypto_fallback_profile; static struct workqueue_struct *blk_crypto_wq; static mempool_t *blk_crypto_bounce_page_pool; static struct bio_set crypto_bio_split; @@ -89,9 +89,9 @@ static struct bio_set crypto_bio_split; */ static u8 blank_key[BLK_CRYPTO_MAX_KEY_SIZE]; -static void blk_crypto_evict_keyslot(unsigned int slot) +static void blk_crypto_fallback_evict_keyslot(unsigned int slot) { - struct blk_crypto_keyslot *slotp = &blk_crypto_keyslots[slot]; + struct blk_crypto_fallback_keyslot *slotp = &blk_crypto_keyslots[slot]; enum blk_crypto_mode_num crypto_mode = slotp->crypto_mode; int err; @@ -104,45 +104,41 @@ static void blk_crypto_evict_keyslot(unsigned int slot) slotp->crypto_mode = BLK_ENCRYPTION_MODE_INVALID; } -static int blk_crypto_keyslot_program(struct blk_keyslot_manager *ksm, - const struct blk_crypto_key *key, - unsigned int slot) +static int +blk_crypto_fallback_keyslot_program(struct blk_crypto_profile *profile, + const struct blk_crypto_key *key, + unsigned int slot) { - struct blk_crypto_keyslot *slotp = &blk_crypto_keyslots[slot]; + struct blk_crypto_fallback_keyslot *slotp = &blk_crypto_keyslots[slot]; const enum blk_crypto_mode_num crypto_mode = key->crypto_cfg.crypto_mode; int err; if (crypto_mode != slotp->crypto_mode && slotp->crypto_mode != BLK_ENCRYPTION_MODE_INVALID) - blk_crypto_evict_keyslot(slot); + blk_crypto_fallback_evict_keyslot(slot); slotp->crypto_mode = crypto_mode; err = crypto_skcipher_setkey(slotp->tfms[crypto_mode], key->raw, key->size); if (err) { - blk_crypto_evict_keyslot(slot); + blk_crypto_fallback_evict_keyslot(slot); return err; } return 0; } -static int blk_crypto_keyslot_evict(struct blk_keyslot_manager *ksm, - const struct blk_crypto_key *key, - unsigned int slot) +static int blk_crypto_fallback_keyslot_evict(struct blk_crypto_profile *profile, + const struct blk_crypto_key *key, + unsigned int slot) { - blk_crypto_evict_keyslot(slot); + blk_crypto_fallback_evict_keyslot(slot); return 0; } -/* - * The crypto API fallback KSM ops - only used for a bio when it specifies a - * blk_crypto_key that was not supported by the device's inline encryption - * hardware. - */ -static const struct blk_ksm_ll_ops blk_crypto_ksm_ll_ops = { - .keyslot_program = blk_crypto_keyslot_program, - .keyslot_evict = blk_crypto_keyslot_evict, +static const struct blk_crypto_ll_ops blk_crypto_fallback_ll_ops = { + .keyslot_program = blk_crypto_fallback_keyslot_program, + .keyslot_evict = blk_crypto_fallback_keyslot_evict, }; static void blk_crypto_fallback_encrypt_endio(struct bio *enc_bio) @@ -160,7 +156,7 @@ static void blk_crypto_fallback_encrypt_endio(struct bio *enc_bio) bio_endio(src_bio); } -static struct bio *blk_crypto_clone_bio(struct bio *bio_src) +static struct bio *blk_crypto_fallback_clone_bio(struct bio *bio_src) { struct bvec_iter iter; struct bio_vec bv; @@ -187,13 +183,14 @@ static struct bio *blk_crypto_clone_bio(struct bio *bio_src) return bio; } -static bool blk_crypto_alloc_cipher_req(struct blk_ksm_keyslot *slot, - struct skcipher_request **ciph_req_ret, - struct crypto_wait *wait) +static bool +blk_crypto_fallback_alloc_cipher_req(struct blk_crypto_keyslot *slot, + struct skcipher_request **ciph_req_ret, + struct crypto_wait *wait) { struct skcipher_request *ciph_req; - const struct blk_crypto_keyslot *slotp; - int keyslot_idx = blk_ksm_get_slot_idx(slot); + const struct blk_crypto_fallback_keyslot *slotp; + int keyslot_idx = blk_crypto_keyslot_index(slot); slotp = &blk_crypto_keyslots[keyslot_idx]; ciph_req = skcipher_request_alloc(slotp->tfms[slotp->crypto_mode], @@ -210,7 +207,7 @@ static bool blk_crypto_alloc_cipher_req(struct blk_ksm_keyslot *slot, return true; } -static bool blk_crypto_split_bio_if_needed(struct bio **bio_ptr) +static bool blk_crypto_fallback_split_bio_if_needed(struct bio **bio_ptr) { struct bio *bio = *bio_ptr; unsigned int i = 0; @@ -265,7 +262,7 @@ static bool blk_crypto_fallback_encrypt_bio(struct bio **bio_ptr) { struct bio *src_bio, *enc_bio; struct bio_crypt_ctx *bc; - struct blk_ksm_keyslot *slot; + struct blk_crypto_keyslot *slot; int data_unit_size; struct skcipher_request *ciph_req = NULL; DECLARE_CRYPTO_WAIT(wait); @@ -277,7 +274,7 @@ static bool blk_crypto_fallback_encrypt_bio(struct bio **bio_ptr) blk_status_t blk_st; /* Split the bio if it's too big for single page bvec */ - if (!blk_crypto_split_bio_if_needed(bio_ptr)) + if (!blk_crypto_fallback_split_bio_if_needed(bio_ptr)) return false; src_bio = *bio_ptr; @@ -285,24 +282,25 @@ static bool blk_crypto_fallback_encrypt_bio(struct bio **bio_ptr) data_unit_size = bc->bc_key->crypto_cfg.data_unit_size; /* Allocate bounce bio for encryption */ - enc_bio = blk_crypto_clone_bio(src_bio); + enc_bio = blk_crypto_fallback_clone_bio(src_bio); if (!enc_bio) { src_bio->bi_status = BLK_STS_RESOURCE; return false; } /* - * Use the crypto API fallback keyslot manager to get a crypto_skcipher - * for the algorithm and key specified for this bio. + * Get a blk-crypto-fallback keyslot that contains a crypto_skcipher for + * this bio's algorithm and key. */ - blk_st = blk_ksm_get_slot_for_key(&blk_crypto_ksm, bc->bc_key, &slot); + blk_st = blk_crypto_get_keyslot(&blk_crypto_fallback_profile, + bc->bc_key, &slot); if (blk_st != BLK_STS_OK) { src_bio->bi_status = blk_st; goto out_put_enc_bio; } /* and then allocate an skcipher_request for it */ - if (!blk_crypto_alloc_cipher_req(slot, &ciph_req, &wait)) { + if (!blk_crypto_fallback_alloc_cipher_req(slot, &ciph_req, &wait)) { src_bio->bi_status = BLK_STS_RESOURCE; goto out_release_keyslot; } @@ -363,7 +361,7 @@ out_free_bounce_pages: out_free_ciph_req: skcipher_request_free(ciph_req); out_release_keyslot: - blk_ksm_put_slot(slot); + blk_crypto_put_keyslot(slot); out_put_enc_bio: if (enc_bio) bio_put(enc_bio); @@ -381,7 +379,7 @@ static void blk_crypto_fallback_decrypt_bio(struct work_struct *work) container_of(work, struct bio_fallback_crypt_ctx, work); struct bio *bio = f_ctx->bio; struct bio_crypt_ctx *bc = &f_ctx->crypt_ctx; - struct blk_ksm_keyslot *slot; + struct blk_crypto_keyslot *slot; struct skcipher_request *ciph_req = NULL; DECLARE_CRYPTO_WAIT(wait); u64 curr_dun[BLK_CRYPTO_DUN_ARRAY_SIZE]; @@ -394,17 +392,18 @@ static void blk_crypto_fallback_decrypt_bio(struct work_struct *work) blk_status_t blk_st; /* - * Use the crypto API fallback keyslot manager to get a crypto_skcipher - * for the algorithm and key specified for this bio. + * Get a blk-crypto-fallback keyslot that contains a crypto_skcipher for + * this bio's algorithm and key. */ - blk_st = blk_ksm_get_slot_for_key(&blk_crypto_ksm, bc->bc_key, &slot); + blk_st = blk_crypto_get_keyslot(&blk_crypto_fallback_profile, + bc->bc_key, &slot); if (blk_st != BLK_STS_OK) { bio->bi_status = blk_st; goto out_no_keyslot; } /* and then allocate an skcipher_request for it */ - if (!blk_crypto_alloc_cipher_req(slot, &ciph_req, &wait)) { + if (!blk_crypto_fallback_alloc_cipher_req(slot, &ciph_req, &wait)) { bio->bi_status = BLK_STS_RESOURCE; goto out; } @@ -435,7 +434,7 @@ static void blk_crypto_fallback_decrypt_bio(struct work_struct *work) out: skcipher_request_free(ciph_req); - blk_ksm_put_slot(slot); + blk_crypto_put_keyslot(slot); out_no_keyslot: mempool_free(f_ctx, bio_fallback_crypt_ctx_pool); bio_endio(bio); @@ -474,9 +473,9 @@ static void blk_crypto_fallback_decrypt_endio(struct bio *bio) * @bio_ptr: pointer to the bio to prepare * * If bio is doing a WRITE operation, this splits the bio into two parts if it's - * too big (see blk_crypto_split_bio_if_needed). It then allocates a bounce bio - * for the first part, encrypts it, and update bio_ptr to point to the bounce - * bio. + * too big (see blk_crypto_fallback_split_bio_if_needed()). It then allocates a + * bounce bio for the first part, encrypts it, and updates bio_ptr to point to + * the bounce bio. * * For a READ operation, we mark the bio for decryption by using bi_private and * bi_end_io. @@ -500,8 +499,8 @@ bool blk_crypto_fallback_bio_prep(struct bio **bio_ptr) return false; } - if (!blk_ksm_crypto_cfg_supported(&blk_crypto_ksm, - &bc->bc_key->crypto_cfg)) { + if (!__blk_crypto_cfg_supported(&blk_crypto_fallback_profile, + &bc->bc_key->crypto_cfg)) { bio->bi_status = BLK_STS_NOTSUPP; return false; } @@ -527,7 +526,7 @@ bool blk_crypto_fallback_bio_prep(struct bio **bio_ptr) int blk_crypto_fallback_evict_key(const struct blk_crypto_key *key) { - return blk_ksm_evict_key(&blk_crypto_ksm, key); + return __blk_crypto_evict_key(&blk_crypto_fallback_profile, key); } static bool blk_crypto_fallback_inited; @@ -535,6 +534,7 @@ static int blk_crypto_fallback_init(void) { int i; int err; + struct blk_crypto_profile *profile = &blk_crypto_fallback_profile; if (blk_crypto_fallback_inited) return 0; @@ -545,24 +545,24 @@ static int blk_crypto_fallback_init(void) if (err) goto out; - err = blk_ksm_init(&blk_crypto_ksm, blk_crypto_num_keyslots); + err = blk_crypto_profile_init(profile, blk_crypto_num_keyslots); if (err) goto fail_free_bioset; err = -ENOMEM; - blk_crypto_ksm.ksm_ll_ops = blk_crypto_ksm_ll_ops; - blk_crypto_ksm.max_dun_bytes_supported = BLK_CRYPTO_MAX_IV_SIZE; + profile->ll_ops = blk_crypto_fallback_ll_ops; + profile->max_dun_bytes_supported = BLK_CRYPTO_MAX_IV_SIZE; /* All blk-crypto modes have a crypto API fallback. */ for (i = 0; i < BLK_ENCRYPTION_MODE_MAX; i++) - blk_crypto_ksm.crypto_modes_supported[i] = 0xFFFFFFFF; - blk_crypto_ksm.crypto_modes_supported[BLK_ENCRYPTION_MODE_INVALID] = 0; + profile->modes_supported[i] = 0xFFFFFFFF; + profile->modes_supported[BLK_ENCRYPTION_MODE_INVALID] = 0; blk_crypto_wq = alloc_workqueue("blk_crypto_wq", WQ_UNBOUND | WQ_HIGHPRI | WQ_MEM_RECLAIM, num_online_cpus()); if (!blk_crypto_wq) - goto fail_free_ksm; + goto fail_destroy_profile; blk_crypto_keyslots = kcalloc(blk_crypto_num_keyslots, sizeof(blk_crypto_keyslots[0]), @@ -596,8 +596,8 @@ fail_free_keyslots: kfree(blk_crypto_keyslots); fail_free_wq: destroy_workqueue(blk_crypto_wq); -fail_free_ksm: - blk_ksm_destroy(&blk_crypto_ksm); +fail_destroy_profile: + blk_crypto_profile_destroy(profile); fail_free_bioset: bioset_exit(&crypto_bio_split); out: @@ -611,7 +611,7 @@ out: int blk_crypto_fallback_start_using_mode(enum blk_crypto_mode_num mode_num) { const char *cipher_str = blk_crypto_modes[mode_num].cipher_str; - struct blk_crypto_keyslot *slotp; + struct blk_crypto_fallback_keyslot *slotp; unsigned int i; int err = 0; diff --git a/block/blk-crypto-profile.c b/block/blk-crypto-profile.c new file mode 100644 index 000000000000..605ba0626a5c --- /dev/null +++ b/block/blk-crypto-profile.c @@ -0,0 +1,565 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright 2019 Google LLC + */ + +/** + * DOC: blk-crypto profiles + * + * 'struct blk_crypto_profile' contains all generic inline encryption-related + * state for a particular inline encryption device. blk_crypto_profile serves + * as the way that drivers for inline encryption hardware expose their crypto + * capabilities and certain functions (e.g., functions to program and evict + * keys) to upper layers. Device drivers that want to support inline encryption + * construct a crypto profile, then associate it with the disk's request_queue. + * + * If the device has keyslots, then its blk_crypto_profile also handles managing + * these keyslots in a device-independent way, using the driver-provided + * functions to program and evict keys as needed. This includes keeping track + * of which key and how many I/O requests are using each keyslot, getting + * keyslots for I/O requests, and handling key eviction requests. + * + * For more information, see Documentation/block/inline-encryption.rst. + */ + +#define pr_fmt(fmt) "blk-crypto: " fmt + +#include <linux/blk-crypto-profile.h> +#include <linux/device.h> +#include <linux/atomic.h> +#include <linux/mutex.h> +#include <linux/pm_runtime.h> +#include <linux/wait.h> +#include <linux/blkdev.h> +#include <linux/blk-integrity.h> + +struct blk_crypto_keyslot { + atomic_t slot_refs; + struct list_head idle_slot_node; + struct hlist_node hash_node; + const struct blk_crypto_key *key; + struct blk_crypto_profile *profile; +}; + +static inline void blk_crypto_hw_enter(struct blk_crypto_profile *profile) +{ + /* + * Calling into the driver requires profile->lock held and the device + * resumed. But we must resume the device first, since that can acquire + * and release profile->lock via blk_crypto_reprogram_all_keys(). + */ + if (profile->dev) + pm_runtime_get_sync(profile->dev); + down_write(&profile->lock); +} + +static inline void blk_crypto_hw_exit(struct blk_crypto_profile *profile) +{ + up_write(&profile->lock); + if (profile->dev) + pm_runtime_put_sync(profile->dev); +} + +/** + * blk_crypto_profile_init() - Initialize a blk_crypto_profile + * @profile: the blk_crypto_profile to initialize + * @num_slots: the number of keyslots + * + * Storage drivers must call this when starting to set up a blk_crypto_profile, + * before filling in additional fields. + * + * Return: 0 on success, or else a negative error code. + */ +int blk_crypto_profile_init(struct blk_crypto_profile *profile, + unsigned int num_slots) +{ + unsigned int slot; + unsigned int i; + unsigned int slot_hashtable_size; + + memset(profile, 0, sizeof(*profile)); + init_rwsem(&profile->lock); + + if (num_slots == 0) + return 0; + + /* Initialize keyslot management data. */ + + profile->slots = kvcalloc(num_slots, sizeof(profile->slots[0]), + GFP_KERNEL); + if (!profile->slots) + return -ENOMEM; + + profile->num_slots = num_slots; + + init_waitqueue_head(&profile->idle_slots_wait_queue); + INIT_LIST_HEAD(&profile->idle_slots); + + for (slot = 0; slot < num_slots; slot++) { + profile->slots[slot].profile = profile; + list_add_tail(&profile->slots[slot].idle_slot_node, + &profile->idle_slots); + } + + spin_lock_init(&profile->idle_slots_lock); + + slot_hashtable_size = roundup_pow_of_two(num_slots); + /* + * hash_ptr() assumes bits != 0, so ensure the hash table has at least 2 + * buckets. This only makes a difference when there is only 1 keyslot. + */ + if (slot_hashtable_size < 2) + slot_hashtable_size = 2; + + profile->log_slot_ht_size = ilog2(slot_hashtable_size); + profile->slot_hashtable = + kvmalloc_array(slot_hashtable_size, + sizeof(profile->slot_hashtable[0]), GFP_KERNEL); + if (!profile->slot_hashtable) + goto err_destroy; + for (i = 0; i < slot_hashtable_size; i++) + INIT_HLIST_HEAD(&profile->slot_hashtable[i]); + + return 0; + +err_destroy: + blk_crypto_profile_destroy(profile); + return -ENOMEM; +} +EXPORT_SYMBOL_GPL(blk_crypto_profile_init); + +static void blk_crypto_profile_destroy_callback(void *profile) +{ + blk_crypto_profile_destroy(profile); +} + +/** + * devm_blk_crypto_profile_init() - Resource-managed blk_crypto_profile_init() + * @dev: the device which owns the blk_crypto_profile + * @profile: the blk_crypto_profile to initialize + * @num_slots: the number of keyslots + * + * Like blk_crypto_profile_init(), but causes blk_crypto_profile_destroy() to be + * called automatically on driver detach. + * + * Return: 0 on success, or else a negative error code. + */ +int devm_blk_crypto_profile_init(struct device *dev, + struct blk_crypto_profile *profile, + unsigned int num_slots) +{ + int err = blk_crypto_profile_init(profile, num_slots); + + if (err) + return err; + + return devm_add_action_or_reset(dev, + blk_crypto_profile_destroy_callback, + profile); +} +EXPORT_SYMBOL_GPL(devm_blk_crypto_profile_init); + +static inline struct hlist_head * +blk_crypto_hash_bucket_for_key(struct blk_crypto_profile *profile, + const struct blk_crypto_key *key) +{ + return &profile->slot_hashtable[ + hash_ptr(key, profile->log_slot_ht_size)]; +} + +static void +blk_crypto_remove_slot_from_lru_list(struct blk_crypto_keyslot *slot) +{ + struct blk_crypto_profile *profile = slot->profile; + unsigned long flags; + + spin_lock_irqsave(&profile->idle_slots_lock, flags); + list_del(&slot->idle_slot_node); + spin_unlock_irqrestore(&profile->idle_slots_lock, flags); +} + +static struct blk_crypto_keyslot * +blk_crypto_find_keyslot(struct blk_crypto_profile *profile, + const struct blk_crypto_key *key) +{ + const struct hlist_head *head = + blk_crypto_hash_bucket_for_key(profile, key); + struct blk_crypto_keyslot *slotp; + + hlist_for_each_entry(slotp, head, hash_node) { + if (slotp->key == key) + return slotp; + } + return NULL; +} + +static struct blk_crypto_keyslot * +blk_crypto_find_and_grab_keyslot(struct blk_crypto_profile *profile, + const struct blk_crypto_key *key) +{ + struct blk_crypto_keyslot *slot; + + slot = blk_crypto_find_keyslot(profile, key); + if (!slot) + return NULL; + if (atomic_inc_return(&slot->slot_refs) == 1) { + /* Took first reference to this slot; remove it from LRU list */ + blk_crypto_remove_slot_from_lru_list(slot); + } + return slot; +} + +/** + * blk_crypto_keyslot_index() - Get the index of a keyslot + * @slot: a keyslot that blk_crypto_get_keyslot() returned + * + * Return: the 0-based index of the keyslot within the device's keyslots. + */ +unsigned int blk_crypto_keyslot_index(struct blk_crypto_keyslot *slot) +{ + return slot - slot->profile->slots; +} +EXPORT_SYMBOL_GPL(blk_crypto_keyslot_index); + +/** + * blk_crypto_get_keyslot() - Get a keyslot for a key, if needed. + * @profile: the crypto profile of the device the key will be used on + * @key: the key that will be used + * @slot_ptr: If a keyslot is allocated, an opaque pointer to the keyslot struct + * will be stored here; otherwise NULL will be stored here. + * + * If the device has keyslots, this gets a keyslot that's been programmed with + * the specified key. If the key is already in a slot, this reuses it; + * otherwise this waits for a slot to become idle and programs the key into it. + * + * This must be paired with a call to blk_crypto_put_keyslot(). + * + * Context: Process context. Takes and releases profile->lock. + * Return: BLK_STS_OK on success, meaning that either a keyslot was allocated or + * one wasn't needed; or a blk_status_t error on failure. + */ +blk_status_t blk_crypto_get_keyslot(struct blk_crypto_profile *profile, + const struct blk_crypto_key *key, + struct blk_crypto_keyslot **slot_ptr) +{ + struct blk_crypto_keyslot *slot; + int slot_idx; + int err; + + *slot_ptr = NULL; + + /* + * If the device has no concept of "keyslots", then there is no need to + * get one. + */ + if (profile->num_slots == 0) + return BLK_STS_OK; + + down_read(&profile->lock); + slot = blk_crypto_find_and_grab_keyslot(profile, key); + up_read(&profile->lock); + if (slot) + goto success; + + for (;;) { + blk_crypto_hw_enter(profile); + slot = blk_crypto_find_and_grab_keyslot(profile, key); + if (slot) { + blk_crypto_hw_exit(profile); + goto success; + } + + /* + * If we're here, that means there wasn't a slot that was + * already programmed with the key. So try to program it. + */ + if (!list_empty(&profile->idle_slots)) + break; + + blk_crypto_hw_exit(profile); + wait_event(profile->idle_slots_wait_queue, + !list_empty(&profile->idle_slots)); + } + + slot = list_first_entry(&profile->idle_slots, struct blk_crypto_keyslot, + idle_slot_node); + slot_idx = blk_crypto_keyslot_index(slot); + + err = profile->ll_ops.keyslot_program(profile, key, slot_idx); + if (err) { + wake_up(&profile->idle_slots_wait_queue); + blk_crypto_hw_exit(profile); + return errno_to_blk_status(err); + } + + /* Move this slot to the hash list for the new key. */ + if (slot->key) + hlist_del(&slot->hash_node); + slot->key = key; + hlist_add_head(&slot->hash_node, + blk_crypto_hash_bucket_for_key(profile, key)); + + atomic_set(&slot->slot_refs, 1); + + blk_crypto_remove_slot_from_lru_list(slot); + + blk_crypto_hw_exit(profile); +success: + *slot_ptr = slot; + return BLK_STS_OK; +} + +/** + * blk_crypto_put_keyslot() - Release a reference to a keyslot + * @slot: The keyslot to release the reference of (may be NULL). + * + * Context: Any context. + */ +void blk_crypto_put_keyslot(struct blk_crypto_keyslot *slot) +{ + struct blk_crypto_profile *profile; + unsigned long flags; + + if (!slot) + return; + + profile = slot->profile; + + if (atomic_dec_and_lock_irqsave(&slot->slot_refs, + &profile->idle_slots_lock, flags)) { + list_add_tail(&slot->idle_slot_node, &profile->idle_slots); + spin_unlock_irqrestore(&profile->idle_slots_lock, flags); + wake_up(&profile->idle_slots_wait_queue); + } +} + +/** + * __blk_crypto_cfg_supported() - Check whether the given crypto profile + * supports the given crypto configuration. + * @profile: the crypto profile to check + * @cfg: the crypto configuration to check for + * + * Return: %true if @profile supports the given @cfg. + */ +bool __blk_crypto_cfg_supported(struct blk_crypto_profile *profile, + const struct blk_crypto_config *cfg) +{ + if (!profile) + return false; + if (!(profile->modes_supported[cfg->crypto_mode] & cfg->data_unit_size)) + return false; + if (profile->max_dun_bytes_supported < cfg->dun_bytes) + return false; + return true; +} + +/** + * __blk_crypto_evict_key() - Evict a key from a device. + * @profile: the crypto profile of the device + * @key: the key to evict. It must not still be used in any I/O. + * + * If the device has keyslots, this finds the keyslot (if any) that contains the + * specified key and calls the driver's keyslot_evict function to evict it. + * + * Otherwise, this just calls the driver's keyslot_evict function if it is + * implemented, passing just the key (without any particular keyslot). This + * allows layered devices to evict the key from their underlying devices. + * + * Context: Process context. Takes and releases profile->lock. + * Return: 0 on success or if there's no keyslot with the specified key, -EBUSY + * if the keyslot is still in use, or another -errno value on other + * error. + */ +int __blk_crypto_evict_key(struct blk_crypto_profile *profile, + const struct blk_crypto_key *key) +{ + struct blk_crypto_keyslot *slot; + int err = 0; + + if (profile->num_slots == 0) { + if (profile->ll_ops.keyslot_evict) { + blk_crypto_hw_enter(profile); + err = profile->ll_ops.keyslot_evict(profile, key, -1); + blk_crypto_hw_exit(profile); + return err; + } + return 0; + } + + blk_crypto_hw_enter(profile); + slot = blk_crypto_find_keyslot(profile, key); + if (!slot) + goto out_unlock; + + if (WARN_ON_ONCE(atomic_read(&slot->slot_refs) != 0)) { + err = -EBUSY; + goto out_unlock; + } + err = profile->ll_ops.keyslot_evict(profile, key, + blk_crypto_keyslot_index(slot)); + if (err) + goto out_unlock; + + hlist_del(&slot->hash_node); + slot->key = NULL; + err = 0; +out_unlock: + blk_crypto_hw_exit(profile); + return err; +} + +/** + * blk_crypto_reprogram_all_keys() - Re-program all keyslots. + * @profile: The crypto profile + * + * Re-program all keyslots that are supposed to have a key programmed. This is + * intended only for use by drivers for hardware that loses its keys on reset. + * + * Context: Process context. Takes and releases profile->lock. + */ +void blk_crypto_reprogram_all_keys(struct blk_crypto_profile *profile) +{ + unsigned int slot; + + if (profile->num_slots == 0) + return; + + /* This is for device initialization, so don't resume the device */ + down_write(&profile->lock); + for (slot = 0; slot < profile->num_slots; slot++) { + const struct blk_crypto_key *key = profile->slots[slot].key; + int err; + + if (!key) + continue; + + err = profile->ll_ops.keyslot_program(profile, key, slot); + WARN_ON(err); + } + up_write(&profile->lock); +} +EXPORT_SYMBOL_GPL(blk_crypto_reprogram_all_keys); + +void blk_crypto_profile_destroy(struct blk_crypto_profile *profile) +{ + if (!profile) + return; + kvfree(profile->slot_hashtable); + kvfree_sensitive(profile->slots, + sizeof(profile->slots[0]) * profile->num_slots); + memzero_explicit(profile, sizeof(*profile)); +} +EXPORT_SYMBOL_GPL(blk_crypto_profile_destroy); + +bool blk_crypto_register(struct blk_crypto_profile *profile, + struct request_queue *q) +{ + if (blk_integrity_queue_supports_integrity(q)) { + pr_warn("Integrity and hardware inline encryption are not supported together. Disabling hardware inline encryption.\n"); + return false; + } + q->crypto_profile = profile; + return true; +} +EXPORT_SYMBOL_GPL(blk_crypto_register); + +void blk_crypto_unregister(struct request_queue *q) +{ + q->crypto_profile = NULL; +} + +/** + * blk_crypto_intersect_capabilities() - restrict supported crypto capabilities + * by child device + * @parent: the crypto profile for the parent device + * @child: the crypto profile for the child device, or NULL + * + * This clears all crypto capabilities in @parent that aren't set in @child. If + * @child is NULL, then this clears all parent capabilities. + * + * Only use this when setting up the crypto profile for a layered device, before + * it's been exposed yet. + */ +void blk_crypto_intersect_capabilities(struct blk_crypto_profile *parent, + const struct blk_crypto_profile *child) +{ + if (child) { + unsigned int i; + + parent->max_dun_bytes_supported = + min(parent->max_dun_bytes_supported, + child->max_dun_bytes_supported); + for (i = 0; i < ARRAY_SIZE(child->modes_supported); i++) + parent->modes_supported[i] &= child->modes_supported[i]; + } else { + parent->max_dun_bytes_supported = 0; + memset(parent->modes_supported, 0, + sizeof(parent->modes_supported)); + } +} +EXPORT_SYMBOL_GPL(blk_crypto_intersect_capabilities); + +/** + * blk_crypto_has_capabilities() - Check whether @target supports at least all + * the crypto capabilities that @reference does. + * @target: the target profile + * @reference: the reference profile + * + * Return: %true if @target supports all the crypto capabilities of @reference. + */ +bool blk_crypto_has_capabilities(const struct blk_crypto_profile *target, + const struct blk_crypto_profile *reference) +{ + int i; + + if (!reference) + return true; + + if (!target) + return false; + + for (i = 0; i < ARRAY_SIZE(target->modes_supported); i++) { + if (reference->modes_supported[i] & ~target->modes_supported[i]) + return false; + } + + if (reference->max_dun_bytes_supported > + target->max_dun_bytes_supported) + return false; + + return true; +} +EXPORT_SYMBOL_GPL(blk_crypto_has_capabilities); + +/** + * blk_crypto_update_capabilities() - Update the capabilities of a crypto + * profile to match those of another crypto + * profile. + * @dst: The crypto profile whose capabilities to update. + * @src: The crypto profile whose capabilities this function will update @dst's + * capabilities to. + * + * Blk-crypto requires that crypto capabilities that were + * advertised when a bio was created continue to be supported by the + * device until that bio is ended. This is turn means that a device cannot + * shrink its advertised crypto capabilities without any explicit + * synchronization with upper layers. So if there's no such explicit + * synchronization, @src must support all the crypto capabilities that + * @dst does (i.e. we need blk_crypto_has_capabilities(@src, @dst)). + * + * Note also that as long as the crypto capabilities are being expanded, the + * order of updates becoming visible is not important because it's alright + * for blk-crypto to see stale values - they only cause blk-crypto to + * believe that a crypto capability isn't supported when it actually is (which + * might result in blk-crypto-fallback being used if available, or the bio being + * failed). + */ +void blk_crypto_update_capabilities(struct blk_crypto_profile *dst, + const struct blk_crypto_profile *src) +{ + memcpy(dst->modes_supported, src->modes_supported, + sizeof(dst->modes_supported)); + + dst->max_dun_bytes_supported = src->max_dun_bytes_supported; +} +EXPORT_SYMBOL_GPL(blk_crypto_update_capabilities); diff --git a/block/blk-crypto.c b/block/blk-crypto.c index 8f53f4a1f9e2..ec9efeeeca91 100644 --- a/block/blk-crypto.c +++ b/block/blk-crypto.c @@ -11,7 +11,7 @@ #include <linux/bio.h> #include <linux/blkdev.h> -#include <linux/keyslot-manager.h> +#include <linux/blk-crypto-profile.h> #include <linux/module.h> #include <linux/slab.h> @@ -218,8 +218,9 @@ static bool bio_crypt_check_alignment(struct bio *bio) blk_status_t __blk_crypto_init_request(struct request *rq) { - return blk_ksm_get_slot_for_key(rq->q->ksm, rq->crypt_ctx->bc_key, - &rq->crypt_keyslot); + return blk_crypto_get_keyslot(rq->q->crypto_profile, + rq->crypt_ctx->bc_key, + &rq->crypt_keyslot); } /** @@ -233,7 +234,7 @@ blk_status_t __blk_crypto_init_request(struct request *rq) */ void __blk_crypto_free_request(struct request *rq) { - blk_ksm_put_slot(rq->crypt_keyslot); + blk_crypto_put_keyslot(rq->crypt_keyslot); mempool_free(rq->crypt_ctx, bio_crypt_ctx_pool); blk_crypto_rq_set_defaults(rq); } @@ -264,6 +265,7 @@ bool __blk_crypto_bio_prep(struct bio **bio_ptr) { struct bio *bio = *bio_ptr; const struct blk_crypto_key *bc_key = bio->bi_crypt_context->bc_key; + struct blk_crypto_profile *profile; /* Error if bio has no data. */ if (WARN_ON_ONCE(!bio_has_data(bio))) { @@ -280,8 +282,8 @@ bool __blk_crypto_bio_prep(struct bio **bio_ptr) * Success if device supports the encryption context, or if we succeeded * in falling back to the crypto API. */ - if (blk_ksm_crypto_cfg_supported(bdev_get_queue(bio->bi_bdev)->ksm, - &bc_key->crypto_cfg)) + profile = bdev_get_queue(bio->bi_bdev)->crypto_profile; + if (__blk_crypto_cfg_supported(profile, &bc_key->crypto_cfg)) return true; if (blk_crypto_fallback_bio_prep(bio_ptr)) @@ -357,7 +359,7 @@ bool blk_crypto_config_supported(struct request_queue *q, const struct blk_crypto_config *cfg) { return IS_ENABLED(CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK) || - blk_ksm_crypto_cfg_supported(q->ksm, cfg); + __blk_crypto_cfg_supported(q->crypto_profile, cfg); } /** @@ -378,7 +380,7 @@ bool blk_crypto_config_supported(struct request_queue *q, int blk_crypto_start_using_key(const struct blk_crypto_key *key, struct request_queue *q) { - if (blk_ksm_crypto_cfg_supported(q->ksm, &key->crypto_cfg)) + if (__blk_crypto_cfg_supported(q->crypto_profile, &key->crypto_cfg)) return 0; return blk_crypto_fallback_start_using_mode(key->crypto_cfg.crypto_mode); } @@ -394,18 +396,17 @@ int blk_crypto_start_using_key(const struct blk_crypto_key *key, * evicted from any hardware that it might have been programmed into. The key * must not be in use by any in-flight IO when this function is called. * - * Return: 0 on success or if key is not present in the q's ksm, -err on error. + * Return: 0 on success or if the key wasn't in any keyslot; -errno on error. */ int blk_crypto_evict_key(struct request_queue *q, const struct blk_crypto_key *key) { - if (blk_ksm_crypto_cfg_supported(q->ksm, &key->crypto_cfg)) - return blk_ksm_evict_key(q->ksm, key); + if (__blk_crypto_cfg_supported(q->crypto_profile, &key->crypto_cfg)) + return __blk_crypto_evict_key(q->crypto_profile, key); /* - * If the request queue's associated inline encryption hardware didn't - * have support for the key, then the key might have been programmed - * into the fallback keyslot manager, so try to evict from there. + * If the request_queue didn't support the key, then blk-crypto-fallback + * may have been used, so try to evict the key from blk-crypto-fallback. */ return blk_crypto_fallback_evict_key(key); } diff --git a/block/blk-flush.c b/block/blk-flush.c index 4201728bf3a5..8e364bda5166 100644 --- a/block/blk-flush.c +++ b/block/blk-flush.c @@ -379,7 +379,7 @@ static void mq_flush_data_end_io(struct request *rq, blk_status_t error) * @rq is being submitted. Analyze what needs to be done and put it on the * right queue. */ -void blk_insert_flush(struct request *rq) +bool blk_insert_flush(struct request *rq) { struct request_queue *q = rq->q; unsigned long fflags = q->queue_flags; /* may change, cache */ @@ -409,7 +409,7 @@ void blk_insert_flush(struct request *rq) */ if (!policy) { blk_mq_end_request(rq, 0); - return; + return true; } BUG_ON(rq->bio != rq->biotail); /*assumes zero or single bio rq */ @@ -420,10 +420,8 @@ void blk_insert_flush(struct request *rq) * for normal execution. */ if ((policy & REQ_FSEQ_DATA) && - !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) { - blk_mq_request_bypass_insert(rq, false, false); - return; - } + !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) + return false; /* * @rq should go through flush machinery. Mark it part of flush @@ -439,6 +437,8 @@ void blk_insert_flush(struct request *rq) spin_lock_irq(&fq->mq_flush_lock); blk_flush_complete_seq(rq, fq, REQ_FSEQ_ACTIONS & ~policy, 0); spin_unlock_irq(&fq->mq_flush_lock); + + return true; } /** diff --git a/block/blk-ia-ranges.c b/block/blk-ia-ranges.c new file mode 100644 index 000000000000..c246c425d0d7 --- /dev/null +++ b/block/blk-ia-ranges.c @@ -0,0 +1,348 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Block device concurrent positioning ranges. + * + * Copyright (C) 2021 Western Digital Corporation or its Affiliates. + */ +#include <linux/kernel.h> +#include <linux/blkdev.h> +#include <linux/slab.h> +#include <linux/init.h> + +#include "blk.h" + +static ssize_t +blk_ia_range_sector_show(struct blk_independent_access_range *iar, + char *buf) +{ + return sprintf(buf, "%llu\n", iar->sector); +} + +static ssize_t +blk_ia_range_nr_sectors_show(struct blk_independent_access_range *iar, + char *buf) +{ + return sprintf(buf, "%llu\n", iar->nr_sectors); +} + +struct blk_ia_range_sysfs_entry { + struct attribute attr; + ssize_t (*show)(struct blk_independent_access_range *iar, char *buf); +}; + +static struct blk_ia_range_sysfs_entry blk_ia_range_sector_entry = { + .attr = { .name = "sector", .mode = 0444 }, + .show = blk_ia_range_sector_show, +}; + +static struct blk_ia_range_sysfs_entry blk_ia_range_nr_sectors_entry = { + .attr = { .name = "nr_sectors", .mode = 0444 }, + .show = blk_ia_range_nr_sectors_show, +}; + +static struct attribute *blk_ia_range_attrs[] = { + &blk_ia_range_sector_entry.attr, + &blk_ia_range_nr_sectors_entry.attr, + NULL, +}; +ATTRIBUTE_GROUPS(blk_ia_range); + +static ssize_t blk_ia_range_sysfs_show(struct kobject *kobj, + struct attribute *attr, char *buf) +{ + struct blk_ia_range_sysfs_entry *entry = + container_of(attr, struct blk_ia_range_sysfs_entry, attr); + struct blk_independent_access_range *iar = + container_of(kobj, struct blk_independent_access_range, kobj); + ssize_t ret; + + mutex_lock(&iar->queue->sysfs_lock); + ret = entry->show(iar, buf); + mutex_unlock(&iar->queue->sysfs_lock); + + return ret; +} + +static const struct sysfs_ops blk_ia_range_sysfs_ops = { + .show = blk_ia_range_sysfs_show, +}; + +/* + * Independent access range entries are not freed individually, but alltogether + * with struct blk_independent_access_ranges and its array of ranges. Since + * kobject_add() takes a reference on the parent kobject contained in + * struct blk_independent_access_ranges, the array of independent access range + * entries cannot be freed until kobject_del() is called for all entries. + * So we do not need to do anything here, but still need this no-op release + * operation to avoid complaints from the kobject code. + */ +static void blk_ia_range_sysfs_nop_release(struct kobject *kobj) +{ +} + +static struct kobj_type blk_ia_range_ktype = { + .sysfs_ops = &blk_ia_range_sysfs_ops, + .default_groups = blk_ia_range_groups, + .release = blk_ia_range_sysfs_nop_release, +}; + +/* + * This will be executed only after all independent access range entries are + * removed with kobject_del(), at which point, it is safe to free everything, + * including the array of ranges. + */ +static void blk_ia_ranges_sysfs_release(struct kobject *kobj) +{ + struct blk_independent_access_ranges *iars = + container_of(kobj, struct blk_independent_access_ranges, kobj); + + kfree(iars); +} + +static struct kobj_type blk_ia_ranges_ktype = { + .release = blk_ia_ranges_sysfs_release, +}; + +/** + * disk_register_ia_ranges - register with sysfs a set of independent + * access ranges + * @disk: Target disk + * @new_iars: New set of independent access ranges + * + * Register with sysfs a set of independent access ranges for @disk. + * If @new_iars is not NULL, this set of ranges is registered and the old set + * specified by q->ia_ranges is unregistered. Otherwise, q->ia_ranges is + * registered if it is not already. + */ +int disk_register_independent_access_ranges(struct gendisk *disk, + struct blk_independent_access_ranges *new_iars) +{ + struct request_queue *q = disk->queue; + struct blk_independent_access_ranges *iars; + int i, ret; + + lockdep_assert_held(&q->sysfs_dir_lock); + lockdep_assert_held(&q->sysfs_lock); + + /* If a new range set is specified, unregister the old one */ + if (new_iars) { + if (q->ia_ranges) + disk_unregister_independent_access_ranges(disk); + q->ia_ranges = new_iars; + } + + iars = q->ia_ranges; + if (!iars) + return 0; + + /* + * At this point, iars is the new set of sector access ranges that needs + * to be registered with sysfs. + */ + WARN_ON(iars->sysfs_registered); + ret = kobject_init_and_add(&iars->kobj, &blk_ia_ranges_ktype, + &q->kobj, "%s", "independent_access_ranges"); + if (ret) { + q->ia_ranges = NULL; + kfree(iars); + return ret; + } + + for (i = 0; i < iars->nr_ia_ranges; i++) { + iars->ia_range[i].queue = q; + ret = kobject_init_and_add(&iars->ia_range[i].kobj, + &blk_ia_range_ktype, &iars->kobj, + "%d", i); + if (ret) { + while (--i >= 0) + kobject_del(&iars->ia_range[i].kobj); + kobject_del(&iars->kobj); + kobject_put(&iars->kobj); + return ret; + } + } + + iars->sysfs_registered = true; + + return 0; +} + +void disk_unregister_independent_access_ranges(struct gendisk *disk) +{ + struct request_queue *q = disk->queue; + struct blk_independent_access_ranges *iars = q->ia_ranges; + int i; + + lockdep_assert_held(&q->sysfs_dir_lock); + lockdep_assert_held(&q->sysfs_lock); + + if (!iars) + return; + + if (iars->sysfs_registered) { + for (i = 0; i < iars->nr_ia_ranges; i++) + kobject_del(&iars->ia_range[i].kobj); + kobject_del(&iars->kobj); + kobject_put(&iars->kobj); + } else { + kfree(iars); + } + + q->ia_ranges = NULL; +} + +static struct blk_independent_access_range * +disk_find_ia_range(struct blk_independent_access_ranges *iars, + sector_t sector) +{ + struct blk_independent_access_range *iar; + int i; + + for (i = 0; i < iars->nr_ia_ranges; i++) { + iar = &iars->ia_range[i]; + if (sector >= iar->sector && + sector < iar->sector + iar->nr_sectors) + return iar; + } + + return NULL; +} + +static bool disk_check_ia_ranges(struct gendisk *disk, + struct blk_independent_access_ranges *iars) +{ + struct blk_independent_access_range *iar, *tmp; + sector_t capacity = get_capacity(disk); + sector_t sector = 0; + int i; + + /* + * While sorting the ranges in increasing LBA order, check that the + * ranges do not overlap, that there are no sector holes and that all + * sectors belong to one range. + */ + for (i = 0; i < iars->nr_ia_ranges; i++) { + tmp = disk_find_ia_range(iars, sector); + if (!tmp || tmp->sector != sector) { + pr_warn("Invalid non-contiguous independent access ranges\n"); + return false; + } + + iar = &iars->ia_range[i]; + if (tmp != iar) { + swap(iar->sector, tmp->sector); + swap(iar->nr_sectors, tmp->nr_sectors); + } + + sector += iar->nr_sectors; + } + + if (sector != capacity) { + pr_warn("Independent access ranges do not match disk capacity\n"); + return false; + } + + return true; +} + +static bool disk_ia_ranges_changed(struct gendisk *disk, + struct blk_independent_access_ranges *new) +{ + struct blk_independent_access_ranges *old = disk->queue->ia_ranges; + int i; + + if (!old) + return true; + + if (old->nr_ia_ranges != new->nr_ia_ranges) + return true; + + for (i = 0; i < old->nr_ia_ranges; i++) { + if (new->ia_range[i].sector != old->ia_range[i].sector || + new->ia_range[i].nr_sectors != old->ia_range[i].nr_sectors) + return true; + } + + return false; +} + +/** + * disk_alloc_independent_access_ranges - Allocate an independent access ranges + * data structure + * @disk: target disk + * @nr_ia_ranges: Number of independent access ranges + * + * Allocate a struct blk_independent_access_ranges structure with @nr_ia_ranges + * access range descriptors. + */ +struct blk_independent_access_ranges * +disk_alloc_independent_access_ranges(struct gendisk *disk, int nr_ia_ranges) +{ + struct blk_independent_access_ranges *iars; + + iars = kzalloc_node(struct_size(iars, ia_range, nr_ia_ranges), + GFP_KERNEL, disk->queue->node); + if (iars) + iars->nr_ia_ranges = nr_ia_ranges; + return iars; +} +EXPORT_SYMBOL_GPL(disk_alloc_independent_access_ranges); + +/** + * disk_set_independent_access_ranges - Set a disk independent access ranges + * @disk: target disk + * @iars: independent access ranges structure + * + * Set the independent access ranges information of the request queue + * of @disk to @iars. If @iars is NULL and the independent access ranges + * structure already set is cleared. If there are no differences between + * @iars and the independent access ranges structure already set, @iars + * is freed. + */ +void disk_set_independent_access_ranges(struct gendisk *disk, + struct blk_independent_access_ranges *iars) +{ + struct request_queue *q = disk->queue; + + if (WARN_ON_ONCE(iars && !iars->nr_ia_ranges)) { + kfree(iars); + iars = NULL; + } + + mutex_lock(&q->sysfs_dir_lock); + mutex_lock(&q->sysfs_lock); + + if (iars) { + if (!disk_check_ia_ranges(disk, iars)) { + kfree(iars); + iars = NULL; + goto reg; + } + + if (!disk_ia_ranges_changed(disk, iars)) { + kfree(iars); + goto unlock; + } + } + + /* + * This may be called for a registered queue. E.g. during a device + * revalidation. If that is the case, we need to unregister the old + * set of independent access ranges and register the new set. If the + * queue is not registered, registration of the device request queue + * will register the independent access ranges, so only swap in the + * new set and free the old one. + */ +reg: + if (blk_queue_registered(q)) { + disk_register_independent_access_ranges(disk, iars); + } else { + swap(q->ia_ranges, iars); + kfree(iars); + } + +unlock: + mutex_unlock(&q->sysfs_lock); + mutex_unlock(&q->sysfs_dir_lock); +} +EXPORT_SYMBOL_GPL(disk_set_independent_access_ranges); diff --git a/block/blk-integrity.c b/block/blk-integrity.c index cef534a7cbc9..d670d54e5f7a 100644 --- a/block/blk-integrity.c +++ b/block/blk-integrity.c @@ -409,9 +409,9 @@ void blk_integrity_register(struct gendisk *disk, struct blk_integrity *template blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, disk->queue); #ifdef CONFIG_BLK_INLINE_ENCRYPTION - if (disk->queue->ksm) { + if (disk->queue->crypto_profile) { pr_warn("blk-integrity: Integrity and hardware inline encryption are not supported together. Disabling hardware inline encryption.\n"); - blk_ksm_unregister(disk->queue); + blk_crypto_unregister(disk->queue); } #endif } diff --git a/block/blk-merge.c b/block/blk-merge.c index ec727234ac48..df69f4bb7717 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -383,7 +383,7 @@ void __blk_queue_split(struct request_queue *q, struct bio **bio, */ void blk_queue_split(struct bio **bio) { - struct request_queue *q = (*bio)->bi_bdev->bd_disk->queue; + struct request_queue *q = bdev_get_queue((*bio)->bi_bdev); unsigned int nr_segs; if (blk_may_split(q, *bio)) @@ -1067,9 +1067,8 @@ static enum bio_merge_status blk_attempt_bio_merge(struct request_queue *q, * @q: request_queue new bio is being queued at * @bio: new bio being queued * @nr_segs: number of segments in @bio - * @same_queue_rq: pointer to &struct request that gets filled in when - * another request associated with @q is found on the plug list - * (optional, may be %NULL) + * @same_queue_rq: output value, will be true if there's an existing request + * from the passed in @q already in the plug list * * Determine whether @bio being queued on @q can be merged with the previous * request on %current's plugged list. Returns %true if merge was successful, @@ -1085,23 +1084,23 @@ static enum bio_merge_status blk_attempt_bio_merge(struct request_queue *q, * Caller must ensure !blk_queue_nomerges(q) beforehand. */ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio, - unsigned int nr_segs, struct request **same_queue_rq) + unsigned int nr_segs, bool *same_queue_rq) { struct blk_plug *plug; struct request *rq; plug = blk_mq_plug(q, bio); - if (!plug || list_empty(&plug->mq_list)) + if (!plug || rq_list_empty(plug->mq_list)) return false; /* check the previously added entry for a quick merge attempt */ - rq = list_last_entry(&plug->mq_list, struct request, queuelist); - if (rq->q == q && same_queue_rq) { + rq = rq_list_peek(&plug->mq_list); + if (rq->q == q) { /* * Only blk-mq multiple hardware queues case checks the rq in * the same queue, there should be only one such rq in a queue */ - *same_queue_rq = rq; + *same_queue_rq = true; } if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) == BIO_MERGE_OK) return true; diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index 68ca5d21cda7..0f8c60e9c719 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -550,7 +550,7 @@ static int hctx_active_show(void *data, struct seq_file *m) { struct blk_mq_hw_ctx *hctx = data; - seq_printf(m, "%d\n", atomic_read(&hctx->nr_active)); + seq_printf(m, "%d\n", __blk_mq_active_requests(hctx)); return 0; } diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index e85b7556b096..c62b966dfaba 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -361,7 +361,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx) } } -bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio, +bool blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio, unsigned int nr_segs) { struct elevator_queue *e = q->elevator; @@ -541,7 +541,7 @@ static void blk_mq_sched_tags_teardown(struct request_queue *q, unsigned int fla queue_for_each_hw_ctx(q, hctx, i) { if (hctx->sched_tags) { - if (!blk_mq_is_shared_tags(q->tag_set->flags)) + if (!blk_mq_is_shared_tags(flags)) blk_mq_free_rq_map(hctx->sched_tags); hctx->sched_tags = NULL; } diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h index 98836106b25f..25d1034952b6 100644 --- a/block/blk-mq-sched.h +++ b/block/blk-mq-sched.h @@ -12,7 +12,7 @@ void blk_mq_sched_assign_ioc(struct request *rq); bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio, unsigned int nr_segs, struct request **merged_request); -bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio, +bool blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio, unsigned int nr_segs); bool blk_mq_sched_try_insert_merge(struct request_queue *q, struct request *rq, struct list_head *free); @@ -43,16 +43,6 @@ static inline bool bio_mergeable(struct bio *bio) } static inline bool -blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio, - unsigned int nr_segs) -{ - if (blk_queue_nomerges(q) || !bio_mergeable(bio)) - return false; - - return __blk_mq_sched_bio_merge(q, bio, nr_segs); -} - -static inline bool blk_mq_sched_allow_merge(struct request_queue *q, struct request *rq, struct bio *bio) { diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index b94c3e8ef392..995336abee33 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -399,9 +399,12 @@ void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn, void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset, busy_tag_iter_fn *fn, void *priv) { - int i; + unsigned int flags = tagset->flags; + int i, nr_tags; + + nr_tags = blk_mq_is_shared_tags(flags) ? 1 : tagset->nr_hw_queues; - for (i = 0; i < tagset->nr_hw_queues; i++) { + for (i = 0; i < nr_tags; i++) { if (tagset->tags && tagset->tags[i]) __blk_mq_all_tag_iter(tagset->tags[i], fn, priv, BT_TAG_ITER_STARTED); diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h index 78ae2fb8e2a4..df787b5a23bd 100644 --- a/block/blk-mq-tag.h +++ b/block/blk-mq-tag.h @@ -4,29 +4,6 @@ struct blk_mq_alloc_data; -/* - * Tag address space map. - */ -struct blk_mq_tags { - unsigned int nr_tags; - unsigned int nr_reserved_tags; - - atomic_t active_queues; - - struct sbitmap_queue bitmap_tags; - struct sbitmap_queue breserved_tags; - - struct request **rqs; - struct request **static_rqs; - struct list_head page_list; - - /* - * used to clear request reference in rqs[] before freeing one - * request pool - */ - spinlock_t lock; -}; - extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags, unsigned int reserved_tags, int node, int alloc_policy); diff --git a/block/blk-mq.c b/block/blk-mq.c index 9248edd8a7d3..07eb1412760b 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -19,7 +19,6 @@ #include <linux/smp.h> #include <linux/interrupt.h> #include <linux/llist.h> -#include <linux/list_sort.h> #include <linux/cpu.h> #include <linux/cache.h> #include <linux/sched/sysctl.h> @@ -242,7 +241,12 @@ EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue); */ void blk_mq_quiesce_queue_nowait(struct request_queue *q) { - blk_queue_flag_set(QUEUE_FLAG_QUIESCED, q); + unsigned long flags; + + spin_lock_irqsave(&q->queue_lock, flags); + if (!q->quiesce_depth++) + blk_queue_flag_set(QUEUE_FLAG_QUIESCED, q); + spin_unlock_irqrestore(&q->queue_lock, flags); } EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_nowait); @@ -283,10 +287,21 @@ EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue); */ void blk_mq_unquiesce_queue(struct request_queue *q) { - blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q); + unsigned long flags; + bool run_queue = false; + + spin_lock_irqsave(&q->queue_lock, flags); + if (WARN_ON_ONCE(q->quiesce_depth <= 0)) { + ; + } else if (!--q->quiesce_depth) { + blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q); + run_queue = true; + } + spin_unlock_irqrestore(&q->queue_lock, flags); /* dispatch requests which are inserted during quiescing */ - blk_mq_run_hw_queues(q, true); + if (run_queue) + blk_mq_run_hw_queues(q, true); } EXPORT_SYMBOL_GPL(blk_mq_unquiesce_queue); @@ -301,40 +316,37 @@ void blk_mq_wake_waiters(struct request_queue *q) } static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data, - unsigned int tag, u64 alloc_time_ns) + struct blk_mq_tags *tags, unsigned int tag, u64 alloc_time_ns) { struct blk_mq_ctx *ctx = data->ctx; struct blk_mq_hw_ctx *hctx = data->hctx; struct request_queue *q = data->q; - struct elevator_queue *e = q->elevator; - struct blk_mq_tags *tags = blk_mq_tags_from_data(data); struct request *rq = tags->static_rqs[tag]; - unsigned int rq_flags = 0; - if (e) { - rq_flags = RQF_ELV; - rq->tag = BLK_MQ_NO_TAG; - rq->internal_tag = tag; - } else { - rq->tag = tag; - rq->internal_tag = BLK_MQ_NO_TAG; - } + rq->q = q; + rq->mq_ctx = ctx; + rq->mq_hctx = hctx; + rq->cmd_flags = data->cmd_flags; if (data->flags & BLK_MQ_REQ_PM) - rq_flags |= RQF_PM; + data->rq_flags |= RQF_PM; if (blk_queue_io_stat(q)) - rq_flags |= RQF_IO_STAT; - rq->rq_flags = rq_flags; + data->rq_flags |= RQF_IO_STAT; + rq->rq_flags = data->rq_flags; + + if (!(data->rq_flags & RQF_ELV)) { + rq->tag = tag; + rq->internal_tag = BLK_MQ_NO_TAG; + } else { + rq->tag = BLK_MQ_NO_TAG; + rq->internal_tag = tag; + } + rq->timeout = 0; if (blk_mq_need_time_stamp(rq)) rq->start_time_ns = ktime_get_ns(); else rq->start_time_ns = 0; - /* csd/requeue_work/fifo_time is initialized before use */ - rq->q = q; - rq->mq_ctx = ctx; - rq->mq_hctx = hctx; - rq->cmd_flags = data->cmd_flags; rq->rq_disk = NULL; rq->part = NULL; #ifdef CONFIG_BLK_RQ_ALLOC_TIME @@ -346,7 +358,6 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data, #if defined(CONFIG_BLK_DEV_INTEGRITY) rq->nr_integrity_segments = 0; #endif - rq->timeout = 0; rq->end_io = NULL; rq->end_io_data = NULL; @@ -381,20 +392,23 @@ __blk_mq_alloc_requests_batch(struct blk_mq_alloc_data *data, u64 alloc_time_ns) { unsigned int tag, tag_offset; + struct blk_mq_tags *tags; struct request *rq; - unsigned long tags; + unsigned long tag_mask; int i, nr = 0; - tags = blk_mq_get_tags(data, data->nr_tags, &tag_offset); - if (unlikely(!tags)) + tag_mask = blk_mq_get_tags(data, data->nr_tags, &tag_offset); + if (unlikely(!tag_mask)) return NULL; - for (i = 0; tags; i++) { - if (!(tags & (1UL << i))) + tags = blk_mq_tags_from_data(data); + for (i = 0; tag_mask; i++) { + if (!(tag_mask & (1UL << i))) continue; + prefetch(tags->static_rqs[tag]); tag = tag_offset + i; - tags &= ~(1UL << i); - rq = blk_mq_rq_ctx_init(data, tag, alloc_time_ns); + tag_mask &= ~(1UL << i); + rq = blk_mq_rq_ctx_init(data, tags, tag, alloc_time_ns); rq_list_add(data->cached_rq, rq); } data->nr_tags -= nr; @@ -465,7 +479,8 @@ retry: goto retry; } - return blk_mq_rq_ctx_init(data, tag, alloc_time_ns); + return blk_mq_rq_ctx_init(data, blk_mq_tags_from_data(data), tag, + alloc_time_ns); } struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op, @@ -475,6 +490,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op, .q = q, .flags = flags, .cmd_flags = op, + .rq_flags = q->elevator ? RQF_ELV : 0, .nr_tags = 1, }; struct request *rq; @@ -504,6 +520,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, .q = q, .flags = flags, .cmd_flags = op, + .rq_flags = q->elevator ? RQF_ELV : 0, .nr_tags = 1, }; u64 alloc_time_ns = 0; @@ -549,7 +566,8 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, tag = blk_mq_get_tag(&data); if (tag == BLK_MQ_NO_TAG) goto out_queue_exit; - return blk_mq_rq_ctx_init(&data, tag, alloc_time_ns); + return blk_mq_rq_ctx_init(&data, blk_mq_tags_from_data(&data), tag, + alloc_time_ns); out_queue_exit: blk_queue_exit(q); @@ -580,7 +598,7 @@ void blk_mq_free_request(struct request *rq) struct request_queue *q = rq->q; struct blk_mq_hw_ctx *hctx = rq->mq_hctx; - if (rq->rq_flags & (RQF_ELVPRIV | RQF_ELV)) { + if (rq->rq_flags & RQF_ELVPRIV) { struct elevator_queue *e = q->elevator; if (e->type->ops.finish_request) @@ -618,25 +636,23 @@ void blk_mq_free_plug_rqs(struct blk_plug *plug) static void req_bio_endio(struct request *rq, struct bio *bio, unsigned int nbytes, blk_status_t error) { - if (error) + if (unlikely(error)) { bio->bi_status = error; - - if (unlikely(rq->rq_flags & RQF_QUIET)) - bio_set_flag(bio, BIO_QUIET); - - bio_advance(bio, nbytes); - - if (req_op(rq) == REQ_OP_ZONE_APPEND && error == BLK_STS_OK) { + } else if (req_op(rq) == REQ_OP_ZONE_APPEND) { /* * Partial zone append completions cannot be supported as the * BIO fragments may end up not being written sequentially. */ - if (bio->bi_iter.bi_size) + if (bio->bi_iter.bi_size != nbytes) bio->bi_status = BLK_STS_IOERR; else bio->bi_iter.bi_sector = rq->__sector; } + bio_advance(bio, nbytes); + + if (unlikely(rq->rq_flags & RQF_QUIET)) + bio_set_flag(bio, BIO_QUIET); /* don't actually finish bio if it's part of flush sequence */ if (bio->bi_iter.bi_size == 0 && !(rq->rq_flags & RQF_FLUSH_SEQ)) bio_endio(bio); @@ -680,7 +696,7 @@ bool blk_update_request(struct request *req, blk_status_t error, { int total_bytes; - trace_block_rq_complete(req, blk_status_to_errno(error), nr_bytes); + trace_block_rq_complete(req, error, nr_bytes); if (!req->bio) return false; @@ -806,7 +822,7 @@ static inline void blk_mq_flush_tag_batch(struct blk_mq_hw_ctx *hctx, void blk_mq_end_request_batch(struct io_comp_batch *iob) { int tags[TAG_COMP_BATCH], nr_tags = 0; - struct blk_mq_hw_ctx *last_hctx = NULL; + struct blk_mq_hw_ctx *cur_hctx = NULL; struct request *rq; u64 now = 0; @@ -829,17 +845,17 @@ void blk_mq_end_request_batch(struct io_comp_batch *iob) blk_pm_mark_last_busy(rq); rq_qos_done(rq->q, rq); - if (nr_tags == TAG_COMP_BATCH || - (last_hctx && last_hctx != rq->mq_hctx)) { - blk_mq_flush_tag_batch(last_hctx, tags, nr_tags); + if (nr_tags == TAG_COMP_BATCH || cur_hctx != rq->mq_hctx) { + if (cur_hctx) + blk_mq_flush_tag_batch(cur_hctx, tags, nr_tags); nr_tags = 0; + cur_hctx = rq->mq_hctx; } tags[nr_tags++] = rq->tag; - last_hctx = rq->mq_hctx; } if (nr_tags) - blk_mq_flush_tag_batch(last_hctx, tags, nr_tags); + blk_mq_flush_tag_batch(cur_hctx, tags, nr_tags); } EXPORT_SYMBOL_GPL(blk_mq_end_request_batch); @@ -1040,7 +1056,6 @@ void blk_mq_requeue_request(struct request *rq, bool kick_requeue_list) /* this request will be re-inserted to io scheduler queue */ blk_mq_sched_requeue_request(rq); - BUG_ON(!list_empty(&rq->queuelist)); blk_mq_add_to_requeue_list(rq, true, kick_requeue_list); } EXPORT_SYMBOL(blk_mq_requeue_request); @@ -1121,17 +1136,6 @@ void blk_mq_delay_kick_requeue_list(struct request_queue *q, } EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list); -struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag) -{ - if (tag < tags->nr_tags) { - prefetch(tags->rqs[tag]); - return tags->rqs[tag]; - } - - return NULL; -} -EXPORT_SYMBOL(blk_mq_tag_to_rq); - static bool blk_mq_rq_inflight(struct blk_mq_hw_ctx *hctx, struct request *rq, void *priv, bool reserved) { @@ -1336,7 +1340,7 @@ struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx, return data.rq; } -static bool __blk_mq_get_driver_tag(struct request *rq) +static bool __blk_mq_alloc_driver_tag(struct request *rq) { struct sbitmap_queue *bt = &rq->mq_hctx->tags->bitmap_tags; unsigned int tag_offset = rq->mq_hctx->tags->nr_reserved_tags; @@ -1360,11 +1364,9 @@ static bool __blk_mq_get_driver_tag(struct request *rq) return true; } -bool blk_mq_get_driver_tag(struct request *rq) +bool __blk_mq_get_driver_tag(struct blk_mq_hw_ctx *hctx, struct request *rq) { - struct blk_mq_hw_ctx *hctx = rq->mq_hctx; - - if (rq->tag == BLK_MQ_NO_TAG && !__blk_mq_get_driver_tag(rq)) + if (rq->tag == BLK_MQ_NO_TAG && !__blk_mq_alloc_driver_tag(rq)) return false; if ((hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) && @@ -1594,6 +1596,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list, int errors, queued; blk_status_t ret = BLK_STS_OK; LIST_HEAD(zone_list); + bool needs_resource = false; if (list_empty(list)) return false; @@ -1639,6 +1642,8 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list, queued++; break; case BLK_STS_RESOURCE: + needs_resource = true; + fallthrough; case BLK_STS_DEV_RESOURCE: blk_mq_handle_dev_resource(rq, list); goto out; @@ -1649,6 +1654,7 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list, * accept. */ blk_mq_handle_zone_resource(rq, &zone_list); + needs_resource = true; break; default: errors++; @@ -1673,7 +1679,6 @@ out: /* For non-shared tags, the RESTART check will suffice */ bool no_tag = prep == PREP_DISPATCH_NO_TAG && (hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED); - bool no_budget_avail = prep == PREP_DISPATCH_NO_BUDGET; if (nr_budgets) blk_mq_release_budgets(q, list); @@ -1714,14 +1719,16 @@ out: * If driver returns BLK_STS_RESOURCE and SCHED_RESTART * bit is set, run queue after a delay to avoid IO stalls * that could otherwise occur if the queue is idle. We'll do - * similar if we couldn't get budget and SCHED_RESTART is set. + * similar if we couldn't get budget or couldn't lock a zone + * and SCHED_RESTART is set. */ needs_restart = blk_mq_sched_needs_restart(hctx); + if (prep == PREP_DISPATCH_NO_BUDGET) + needs_resource = true; if (!needs_restart || (no_tag && list_empty_careful(&hctx->dispatch_wait.entry))) blk_mq_run_hw_queue(hctx, true); - else if (needs_restart && (ret == BLK_STS_RESOURCE || - no_budget_avail)) + else if (needs_restart && needs_resource) blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY); blk_mq_update_dispatch_busy(hctx, true); @@ -2161,54 +2168,106 @@ void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx, spin_unlock(&ctx->lock); } -static int plug_rq_cmp(void *priv, const struct list_head *a, - const struct list_head *b) +static void blk_mq_commit_rqs(struct blk_mq_hw_ctx *hctx, int *queued, + bool from_schedule) { - struct request *rqa = container_of(a, struct request, queuelist); - struct request *rqb = container_of(b, struct request, queuelist); + if (hctx->queue->mq_ops->commit_rqs) { + trace_block_unplug(hctx->queue, *queued, !from_schedule); + hctx->queue->mq_ops->commit_rqs(hctx); + } + *queued = 0; +} - if (rqa->mq_ctx != rqb->mq_ctx) - return rqa->mq_ctx > rqb->mq_ctx; - if (rqa->mq_hctx != rqb->mq_hctx) - return rqa->mq_hctx > rqb->mq_hctx; +static void blk_mq_plug_issue_direct(struct blk_plug *plug, bool from_schedule) +{ + struct blk_mq_hw_ctx *hctx = NULL; + struct request *rq; + int queued = 0; + int errors = 0; + + while ((rq = rq_list_pop(&plug->mq_list))) { + bool last = rq_list_empty(plug->mq_list); + blk_status_t ret; - return blk_rq_pos(rqa) > blk_rq_pos(rqb); + if (hctx != rq->mq_hctx) { + if (hctx) + blk_mq_commit_rqs(hctx, &queued, from_schedule); + hctx = rq->mq_hctx; + } + + ret = blk_mq_request_issue_directly(rq, last); + switch (ret) { + case BLK_STS_OK: + queued++; + break; + case BLK_STS_RESOURCE: + case BLK_STS_DEV_RESOURCE: + blk_mq_request_bypass_insert(rq, false, last); + blk_mq_commit_rqs(hctx, &queued, from_schedule); + return; + default: + blk_mq_end_request(rq, ret); + errors++; + break; + } + } + + /* + * If we didn't flush the entire list, we could have told the driver + * there was more coming, but that turned out to be a lie. + */ + if (errors) + blk_mq_commit_rqs(hctx, &queued, from_schedule); } void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule) { + struct blk_mq_hw_ctx *this_hctx; + struct blk_mq_ctx *this_ctx; + unsigned int depth; LIST_HEAD(list); - if (list_empty(&plug->mq_list)) + if (rq_list_empty(plug->mq_list)) return; - list_splice_init(&plug->mq_list, &list); - - if (plug->rq_count > 2 && plug->multiple_queues) - list_sort(NULL, &list, plug_rq_cmp); - plug->rq_count = 0; + if (!plug->multiple_queues && !plug->has_elevator && !from_schedule) { + blk_mq_plug_issue_direct(plug, from_schedule); + if (rq_list_empty(plug->mq_list)) + return; + } + + this_hctx = NULL; + this_ctx = NULL; + depth = 0; do { - struct list_head rq_list; - struct request *rq, *head_rq = list_entry_rq(list.next); - struct list_head *pos = &head_rq->queuelist; /* skip first */ - struct blk_mq_hw_ctx *this_hctx = head_rq->mq_hctx; - struct blk_mq_ctx *this_ctx = head_rq->mq_ctx; - unsigned int depth = 1; - - list_for_each_continue(pos, &list) { - rq = list_entry_rq(pos); - BUG_ON(!rq->q); - if (rq->mq_hctx != this_hctx || rq->mq_ctx != this_ctx) - break; - depth++; + struct request *rq; + + rq = rq_list_pop(&plug->mq_list); + + if (!this_hctx) { + this_hctx = rq->mq_hctx; + this_ctx = rq->mq_ctx; + } else if (this_hctx != rq->mq_hctx || this_ctx != rq->mq_ctx) { + trace_block_unplug(this_hctx->queue, depth, + !from_schedule); + blk_mq_sched_insert_requests(this_hctx, this_ctx, + &list, from_schedule); + depth = 0; + this_hctx = rq->mq_hctx; + this_ctx = rq->mq_ctx; + } - list_cut_before(&rq_list, &list, pos); - trace_block_unplug(head_rq->q, depth, !from_schedule); - blk_mq_sched_insert_requests(this_hctx, this_ctx, &rq_list, + list_add(&rq->queuelist, &list); + depth++; + } while (!rq_list_empty(plug->mq_list)); + + if (!list_empty(&list)) { + trace_block_unplug(this_hctx->queue, depth, !from_schedule); + blk_mq_sched_insert_requests(this_hctx, this_ctx, &list, from_schedule); - } while(!list_empty(&list)); + } } static void blk_mq_bio_to_request(struct request *rq, struct bio *bio, @@ -2388,16 +2447,17 @@ void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx, static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq) { - list_add_tail(&rq->queuelist, &plug->mq_list); - plug->rq_count++; - if (!plug->multiple_queues && !list_is_singular(&plug->mq_list)) { - struct request *tmp; + if (!plug->multiple_queues) { + struct request *nxt = rq_list_peek(&plug->mq_list); - tmp = list_first_entry(&plug->mq_list, struct request, - queuelist); - if (tmp->q != rq->q) + if (nxt && nxt->q != rq->q) plug->multiple_queues = true; } + if (!plug->has_elevator && (rq->rq_flags & RQF_ELV)) + plug->has_elevator = true; + rq->rq_next = NULL; + rq_list_add(&plug->mq_list, rq); + plug->rq_count++; } /* @@ -2429,10 +2489,9 @@ void blk_mq_submit_bio(struct bio *bio) { struct request_queue *q = bdev_get_queue(bio->bi_bdev); const int is_sync = op_is_sync(bio->bi_opf); - const int is_flush_fua = op_is_flush(bio->bi_opf); struct request *rq; struct blk_plug *plug; - struct request *same_queue_rq = NULL; + bool same_queue_rq = false; unsigned int nr_segs = 1; blk_status_t ret; @@ -2443,12 +2502,12 @@ void blk_mq_submit_bio(struct bio *bio) if (!bio_integrity_prep(bio)) goto queue_exit; - if (!is_flush_fua && !blk_queue_nomerges(q) && - blk_attempt_plug_merge(q, bio, nr_segs, &same_queue_rq)) - goto queue_exit; - - if (blk_mq_sched_bio_merge(q, bio, nr_segs)) - goto queue_exit; + if (!blk_queue_nomerges(q) && bio_mergeable(bio)) { + if (blk_attempt_plug_merge(q, bio, nr_segs, &same_queue_rq)) + goto queue_exit; + if (blk_mq_sched_bio_merge(q, bio, nr_segs)) + goto queue_exit; + } rq_qos_throttle(q, bio); @@ -2461,6 +2520,7 @@ void blk_mq_submit_bio(struct bio *bio) .q = q, .nr_tags = 1, .cmd_flags = bio->bi_opf, + .rq_flags = q->elevator ? RQF_ELV : 0, }; if (plug) { @@ -2491,14 +2551,12 @@ void blk_mq_submit_bio(struct bio *bio) return; } - if (unlikely(is_flush_fua)) { - struct blk_mq_hw_ctx *hctx = rq->mq_hctx; - /* Bypass scheduler for flush requests */ - blk_insert_flush(rq); - blk_mq_run_hw_queue(hctx, true); - } else if (plug && (q->nr_hw_queues == 1 || - blk_mq_is_shared_tags(rq->mq_hctx->flags) || - q->mq_ops->commit_rqs || !blk_queue_nonrot(q))) { + if (op_is_flush(bio->bi_opf) && blk_insert_flush(rq)) + return; + + if (plug && (q->nr_hw_queues == 1 || + blk_mq_is_shared_tags(rq->mq_hctx->flags) || + q->mq_ops->commit_rqs || !blk_queue_nonrot(q))) { /* * Use plugging if we have a ->commit_rqs() hook as well, as * we know the driver uses bd->last in a smart fashion. @@ -2509,14 +2567,16 @@ void blk_mq_submit_bio(struct bio *bio) unsigned int request_count = plug->rq_count; struct request *last = NULL; - if (!request_count) + if (!request_count) { trace_block_plug(q); - else - last = list_entry_rq(plug->mq_list.prev); + } else if (!blk_queue_nomerges(q)) { + last = rq_list_peek(&plug->mq_list); + if (blk_rq_bytes(last) < BLK_PLUG_FLUSH_SIZE) + last = NULL; + } - if (request_count >= blk_plug_max_rq_count(plug) || (last && - blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) { - blk_flush_plug_list(plug, false); + if (request_count >= blk_plug_max_rq_count(plug) || last) { + blk_mq_flush_plug_list(plug, false); trace_block_plug(q); } @@ -2525,6 +2585,8 @@ void blk_mq_submit_bio(struct bio *bio) /* Insert the request at the IO scheduler queue */ blk_mq_sched_insert_request(rq, false, true, true); } else if (plug && !blk_queue_nomerges(q)) { + struct request *next_rq = NULL; + /* * We do limited plugging. If the bio can be merged, do that. * Otherwise the existing request in the plug list will be @@ -2532,19 +2594,16 @@ void blk_mq_submit_bio(struct bio *bio) * The plug list might get flushed before this. If that happens, * the plug list is empty, and same_queue_rq is invalid. */ - if (list_empty(&plug->mq_list)) - same_queue_rq = NULL; if (same_queue_rq) { - list_del_init(&same_queue_rq->queuelist); + next_rq = rq_list_pop(&plug->mq_list); plug->rq_count--; } blk_add_rq_to_plug(plug, rq); trace_block_plug(q); - if (same_queue_rq) { + if (next_rq) { trace_block_unplug(q, 1, true); - blk_mq_try_issue_directly(same_queue_rq->mq_hctx, - same_queue_rq); + blk_mq_try_issue_directly(next_rq->mq_hctx, next_rq); } } else if ((q->nr_hw_queues > 1 && is_sync) || !rq->mq_hctx->dispatch_busy) { diff --git a/block/blk-mq.h b/block/blk-mq.h index ebf67f4d4f2e..28859fc5faee 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -122,6 +122,7 @@ extern int blk_mq_sysfs_register(struct request_queue *q); extern void blk_mq_sysfs_unregister(struct request_queue *q); extern void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx); void blk_mq_free_plug_rqs(struct blk_plug *plug); +void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule); void blk_mq_release(struct request_queue *q); @@ -148,6 +149,7 @@ struct blk_mq_alloc_data { blk_mq_req_flags_t flags; unsigned int shallow_depth; unsigned int cmd_flags; + unsigned int rq_flags; /* allocate multiple requests/tags in one go */ unsigned int nr_tags; @@ -165,10 +167,9 @@ static inline bool blk_mq_is_shared_tags(unsigned int flags) static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data *data) { - if (data->q->elevator) - return data->hctx->sched_tags; - - return data->hctx->tags; + if (!(data->rq_flags & RQF_ELV)) + return data->hctx->tags; + return data->hctx->sched_tags; } static inline bool blk_mq_hctx_stopped(struct blk_mq_hw_ctx *hctx) @@ -258,7 +259,20 @@ static inline void blk_mq_put_driver_tag(struct request *rq) __blk_mq_put_driver_tag(rq->mq_hctx, rq); } -bool blk_mq_get_driver_tag(struct request *rq); +bool __blk_mq_get_driver_tag(struct blk_mq_hw_ctx *hctx, struct request *rq); + +static inline bool blk_mq_get_driver_tag(struct request *rq) +{ + struct blk_mq_hw_ctx *hctx = rq->mq_hctx; + + if (rq->tag != BLK_MQ_NO_TAG && + !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) { + hctx->tags->rqs[rq->tag] = rq; + return true; + } + + return __blk_mq_get_driver_tag(hctx, rq); +} static inline void blk_mq_clear_mq_map(struct blk_mq_queue_map *qmap) { diff --git a/block/blk-settings.c b/block/blk-settings.c index a7c857ad7d10..b880c70e22e4 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -842,6 +842,24 @@ bool blk_queue_can_use_dma_map_merging(struct request_queue *q, } EXPORT_SYMBOL_GPL(blk_queue_can_use_dma_map_merging); +static bool disk_has_partitions(struct gendisk *disk) +{ + unsigned long idx; + struct block_device *part; + bool ret = false; + + rcu_read_lock(); + xa_for_each(&disk->part_tbl, idx, part) { + if (bdev_is_partition(part)) { + ret = true; + break; + } + } + rcu_read_unlock(); + + return ret; +} + /** * blk_queue_set_zoned - configure a disk queue zoned model. * @disk: the gendisk of the queue to configure @@ -876,7 +894,7 @@ void blk_queue_set_zoned(struct gendisk *disk, enum blk_zoned_model model) * we do nothing special as far as the block layer is concerned. */ if (!IS_ENABLED(CONFIG_BLK_DEV_ZONED) || - !xa_empty(&disk->part_tbl)) + disk_has_partitions(disk)) model = BLK_ZONED_NONE; break; case BLK_ZONED_NONE: diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 36f14d658e81..cef1f713370b 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -873,16 +873,15 @@ int blk_register_queue(struct gendisk *disk) } mutex_lock(&q->sysfs_lock); + + ret = disk_register_independent_access_ranges(disk, NULL); + if (ret) + goto put_dev; + if (q->elevator) { ret = elv_register_queue(q, false); - if (ret) { - mutex_unlock(&q->sysfs_lock); - mutex_unlock(&q->sysfs_dir_lock); - kobject_del(&q->kobj); - blk_trace_remove_sysfs(dev); - kobject_put(&dev->kobj); - return ret; - } + if (ret) + goto put_dev; } blk_queue_flag_set(QUEUE_FLAG_REGISTERED, q); @@ -914,6 +913,16 @@ unlock: } return ret; + +put_dev: + disk_unregister_independent_access_ranges(disk); + mutex_unlock(&q->sysfs_lock); + mutex_unlock(&q->sysfs_dir_lock); + kobject_del(&q->kobj); + blk_trace_remove_sysfs(dev); + kobject_put(&dev->kobj); + + return ret; } /** @@ -958,6 +967,7 @@ void blk_unregister_queue(struct gendisk *disk) mutex_lock(&q->sysfs_lock); if (q->elevator) elv_unregister_queue(q); + disk_unregister_independent_access_ranges(disk); mutex_unlock(&q->sysfs_lock); mutex_unlock(&q->sysfs_dir_lock); diff --git a/block/blk-wbt.c b/block/blk-wbt.c index 874c1c37bf0c..0c119be0e813 100644 --- a/block/blk-wbt.c +++ b/block/blk-wbt.c @@ -357,6 +357,9 @@ static void wb_timer_fn(struct blk_stat_callback *cb) unsigned int inflight = wbt_inflight(rwb); int status; + if (!rwb->rqos.q->disk) + return; + status = latency_exceeded(rwb, cb->stat); trace_wbt_timer(rwb->rqos.q->disk->bdi, status, rqd->scale_step, diff --git a/block/blk.h b/block/blk.h index e80350327e6d..7afffd548daf 100644 --- a/block/blk.h +++ b/block/blk.h @@ -218,7 +218,7 @@ void blk_add_timer(struct request *req); void blk_print_req_error(struct request *req, blk_status_t status); bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio, - unsigned int nr_segs, struct request **same_queue_rq); + unsigned int nr_segs, bool *same_queue_rq); bool blk_bio_list_merge(struct request_queue *q, struct list_head *list, struct bio *bio, unsigned int nr_segs); @@ -236,7 +236,7 @@ void __blk_account_io_done(struct request *req, u64 now); */ #define ELV_ON_HASH(rq) ((rq)->rq_flags & RQF_HASHED) -void blk_insert_flush(struct request *rq); +bool blk_insert_flush(struct request *rq); int elevator_switch_mq(struct request_queue *q, struct elevator_type *new_e); @@ -454,4 +454,8 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); extern const struct address_space_operations def_blk_aops; +int disk_register_independent_access_ranges(struct gendisk *disk, + struct blk_independent_access_ranges *new_iars); +void disk_unregister_independent_access_ranges(struct gendisk *disk); + #endif /* BLK_INTERNAL_H */ diff --git a/block/fops.c b/block/fops.c index 2ae8a7bd2b84..3777c7b76eae 100644 --- a/block/fops.c +++ b/block/fops.c @@ -76,7 +76,7 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb, bio_init(&bio, vecs, nr_pages); bio_set_dev(&bio, bdev); - bio.bi_iter.bi_sector = pos >> 9; + bio.bi_iter.bi_sector = pos >> SECTOR_SHIFT; bio.bi_write_hint = iocb->ki_hint; bio.bi_private = current; bio.bi_end_io = blkdev_bio_end_io_simple; @@ -124,9 +124,8 @@ out: } enum { - DIO_MULTI_BIO = 1, - DIO_SHOULD_DIRTY = 2, - DIO_IS_SYNC = 4, + DIO_SHOULD_DIRTY = 1, + DIO_IS_SYNC = 2, }; struct blkdev_dio { @@ -137,7 +136,7 @@ struct blkdev_dio { size_t size; atomic_t ref; unsigned int flags; - struct bio bio; + struct bio bio ____cacheline_aligned_in_smp; }; static struct bio_set blkdev_dio_pool; @@ -150,7 +149,7 @@ static void blkdev_bio_end_io(struct bio *bio) if (bio->bi_status && !dio->bio.bi_status) dio->bio.bi_status = bio->bi_status; - if (!(dio->flags & DIO_MULTI_BIO) || atomic_dec_and_test(&dio->ref)) { + if (atomic_dec_and_test(&dio->ref)) { if (!(dio->flags & DIO_IS_SYNC)) { struct kiocb *iocb = dio->iocb; ssize_t ret; @@ -165,8 +164,7 @@ static void blkdev_bio_end_io(struct bio *bio) } dio->iocb->ki_complete(iocb, ret, 0); - if (dio->flags & DIO_MULTI_BIO) - bio_put(&dio->bio); + bio_put(&dio->bio); } else { struct task_struct *waiter = dio->waiter; @@ -190,7 +188,6 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, struct blk_plug plug; struct blkdev_dio *dio; struct bio *bio; - bool do_poll = (iocb->ki_flags & IOCB_HIPRI); bool is_read = (iov_iter_rw(iter) == READ), is_sync; loff_t pos = iocb->ki_pos; int ret = 0; @@ -202,11 +199,17 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, bio = bio_alloc_kiocb(iocb, nr_pages, &blkdev_dio_pool); dio = container_of(bio, struct blkdev_dio, bio); + atomic_set(&dio->ref, 1); + /* + * Grab an extra reference to ensure the dio structure which is embedded + * into the first bio stays around. + */ + bio_get(bio); + is_sync = is_sync_kiocb(iocb); if (is_sync) { dio->flags = DIO_IS_SYNC; dio->waiter = current; - bio_get(bio); } else { dio->flags = 0; dio->iocb = iocb; @@ -216,16 +219,11 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, if (is_read && iter_is_iovec(iter)) dio->flags |= DIO_SHOULD_DIRTY; - /* - * Don't plug for HIPRI/polled IO, as those should go straight - * to issue - */ - if (!(iocb->ki_flags & IOCB_HIPRI)) - blk_start_plug(&plug); + blk_start_plug(&plug); for (;;) { bio_set_dev(bio, bdev); - bio->bi_iter.bi_sector = pos >> 9; + bio->bi_iter.bi_sector = pos >> SECTOR_SHIFT; bio->bi_write_hint = iocb->ki_hint; bio->bi_private = dio; bio->bi_end_io = blkdev_bio_end_io; @@ -254,34 +252,15 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS); if (!nr_pages) { - if (do_poll) - bio_set_polled(bio, iocb); submit_bio(bio); - if (do_poll) - WRITE_ONCE(iocb->private, bio); break; } - if (!(dio->flags & DIO_MULTI_BIO)) { - /* - * AIO needs an extra reference to ensure the dio - * structure which is embedded into the first bio - * stays around. - */ - if (!is_sync) - bio_get(bio); - dio->flags |= DIO_MULTI_BIO; - atomic_set(&dio->ref, 2); - do_poll = false; - } else { - atomic_inc(&dio->ref); - } - + atomic_inc(&dio->ref); submit_bio(bio); bio = bio_alloc(GFP_KERNEL, nr_pages); } - if (!(iocb->ki_flags & IOCB_HIPRI)) - blk_finish_plug(&plug); + blk_finish_plug(&plug); if (!is_sync) return -EIOCBQUEUED; @@ -290,9 +269,7 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, set_current_state(TASK_UNINTERRUPTIBLE); if (!READ_ONCE(dio->waiter)) break; - - if (!do_poll || !bio_poll(bio, NULL, 0)) - blk_io_schedule(); + blk_io_schedule(); } __set_current_state(TASK_RUNNING); @@ -305,6 +282,94 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, return ret; } +static void blkdev_bio_end_io_async(struct bio *bio) +{ + struct blkdev_dio *dio = container_of(bio, struct blkdev_dio, bio); + struct kiocb *iocb = dio->iocb; + ssize_t ret; + + if (likely(!bio->bi_status)) { + ret = dio->size; + iocb->ki_pos += ret; + } else { + ret = blk_status_to_errno(bio->bi_status); + } + + iocb->ki_complete(iocb, ret, 0); + + if (dio->flags & DIO_SHOULD_DIRTY) { + bio_check_pages_dirty(bio); + } else { + bio_release_pages(bio, false); + bio_put(bio); + } +} + +static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb, + struct iov_iter *iter, + unsigned int nr_pages) +{ + struct block_device *bdev = iocb->ki_filp->private_data; + struct blkdev_dio *dio; + struct bio *bio; + loff_t pos = iocb->ki_pos; + int ret = 0; + + if ((pos | iov_iter_alignment(iter)) & + (bdev_logical_block_size(bdev) - 1)) + return -EINVAL; + + bio = bio_alloc_kiocb(iocb, nr_pages, &blkdev_dio_pool); + dio = container_of(bio, struct blkdev_dio, bio); + dio->flags = 0; + dio->iocb = iocb; + bio_set_dev(bio, bdev); + bio->bi_iter.bi_sector = pos >> SECTOR_SHIFT; + bio->bi_write_hint = iocb->ki_hint; + bio->bi_end_io = blkdev_bio_end_io_async; + bio->bi_ioprio = iocb->ki_ioprio; + + if (iov_iter_is_bvec(iter)) { + /* + * Users don't rely on the iterator being in any particular + * state for async I/O returning -EIOCBQUEUED, hence we can + * avoid expensive iov_iter_advance(). Bypass + * bio_iov_iter_get_pages() and set the bvec directly. + */ + bio_iov_bvec_set(bio, iter); + } else { + ret = bio_iov_iter_get_pages(bio, iter); + if (unlikely(ret)) { + bio->bi_status = BLK_STS_IOERR; + bio_endio(bio); + return ret; + } + } + dio->size = bio->bi_iter.bi_size; + + if (iov_iter_rw(iter) == READ) { + bio->bi_opf = REQ_OP_READ; + if (iter_is_iovec(iter)) { + dio->flags |= DIO_SHOULD_DIRTY; + bio_set_pages_dirty(bio); + } + } else { + bio->bi_opf = dio_bio_write_op(iocb); + task_io_account_write(bio->bi_iter.bi_size); + } + + if (iocb->ki_flags & IOCB_HIPRI) { + bio->bi_opf |= REQ_POLLED | REQ_NOWAIT; + submit_bio(bio); + WRITE_ONCE(iocb->private, bio); + } else { + if (iocb->ki_flags & IOCB_NOWAIT) + bio->bi_opf |= REQ_NOWAIT; + submit_bio(bio); + } + return -EIOCBQUEUED; +} + static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) { unsigned int nr_pages; @@ -313,9 +378,11 @@ static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) return 0; nr_pages = bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS + 1); - if (is_sync_kiocb(iocb) && nr_pages <= BIO_MAX_VECS) - return __blkdev_direct_IO_simple(iocb, iter, nr_pages); - + if (likely(nr_pages <= BIO_MAX_VECS)) { + if (is_sync_kiocb(iocb)) + return __blkdev_direct_IO_simple(iocb, iter, nr_pages); + return __blkdev_direct_IO_async(iocb, iter, nr_pages); + } return __blkdev_direct_IO(iocb, iter, bio_max_segs(nr_pages)); } @@ -503,17 +570,20 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to) size_t shorted = 0; ssize_t ret; - if (pos >= size) - return 0; - - size -= pos; - if (iov_iter_count(to) > size) { - shorted = iov_iter_count(to) - size; - iov_iter_truncate(to, size); + if (unlikely(pos + iov_iter_count(to) > size)) { + if (pos >= size) + return 0; + size -= pos; + if (iov_iter_count(to) > size) { + shorted = iov_iter_count(to) - size; + iov_iter_truncate(to, size); + } } ret = generic_file_read_iter(iocb, to); - iov_iter_reexpand(to, iov_iter_count(to) + shorted); + + if (unlikely(shorted)) + iov_iter_reexpand(to, iov_iter_count(to) + shorted); return ret; } @@ -562,16 +632,18 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start, switch (mode) { case FALLOC_FL_ZERO_RANGE: case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE: - error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9, - GFP_KERNEL, BLKDEV_ZERO_NOUNMAP); + error = blkdev_issue_zeroout(bdev, start >> SECTOR_SHIFT, + len >> SECTOR_SHIFT, GFP_KERNEL, + BLKDEV_ZERO_NOUNMAP); break; case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE: - error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9, - GFP_KERNEL, BLKDEV_ZERO_NOFALLBACK); + error = blkdev_issue_zeroout(bdev, start >> SECTOR_SHIFT, + len >> SECTOR_SHIFT, GFP_KERNEL, + BLKDEV_ZERO_NOFALLBACK); break; case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | FALLOC_FL_NO_HIDE_STALE: - error = blkdev_issue_discard(bdev, start >> 9, len >> 9, - GFP_KERNEL, 0); + error = blkdev_issue_discard(bdev, start >> SECTOR_SHIFT, + len >> SECTOR_SHIFT, GFP_KERNEL, 0); break; default: error = -EOPNOTSUPP; diff --git a/block/genhd.c b/block/genhd.c index 53495e3391e3..febaaa55125a 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -590,16 +590,6 @@ void del_gendisk(struct gendisk *disk) * Prevent new I/O from crossing bio_queue_enter(). */ blk_queue_start_drain(q); - blk_mq_freeze_queue_wait(q); - - rq_qos_exit(q); - blk_sync_queue(q); - blk_flush_integrity(); - /* - * Allow using passthrough request again after the queue is torn down. - */ - blk_queue_flag_clear(QUEUE_FLAG_INIT_DONE, q); - __blk_mq_unfreeze_queue(q, true); if (!(disk->flags & GENHD_FL_HIDDEN)) { sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi"); @@ -622,9 +612,41 @@ void del_gendisk(struct gendisk *disk) sysfs_remove_link(block_depr, dev_name(disk_to_dev(disk))); pm_runtime_set_memalloc_noio(disk_to_dev(disk), false); device_del(disk_to_dev(disk)); + + blk_mq_freeze_queue_wait(q); + + rq_qos_exit(q); + blk_sync_queue(q); + blk_flush_integrity(); + /* + * Allow using passthrough request again after the queue is torn down. + */ + blk_queue_flag_clear(QUEUE_FLAG_INIT_DONE, q); + __blk_mq_unfreeze_queue(q, true); + } EXPORT_SYMBOL(del_gendisk); +/** + * invalidate_disk - invalidate the disk + * @disk: the struct gendisk to invalidate + * + * A helper to invalidates the disk. It will clean the disk's associated + * buffer/page caches and reset its internal states so that the disk + * can be reused by the drivers. + * + * Context: can sleep + */ +void invalidate_disk(struct gendisk *disk) +{ + struct block_device *bdev = disk->part0; + + invalidate_bdev(bdev); + bdev->bd_inode->i_mapping->wb_err = 0; + set_capacity(disk, 0); +} +EXPORT_SYMBOL(invalidate_disk); + /* sysfs access to bad-blocks list. */ static ssize_t disk_badblocks_show(struct device *dev, struct device_attribute *attr, @@ -1390,12 +1412,6 @@ void set_disk_ro(struct gendisk *disk, bool read_only) } EXPORT_SYMBOL(set_disk_ro); -int bdev_read_only(struct block_device *bdev) -{ - return bdev->bd_read_only || get_disk_ro(bdev->bd_disk); -} -EXPORT_SYMBOL(bdev_read_only); - void inc_diskseq(struct gendisk *disk) { disk->diskseq = atomic64_inc_return(&diskseq); diff --git a/block/keyslot-manager.c b/block/keyslot-manager.c deleted file mode 100644 index 1792159d12d1..000000000000 --- a/block/keyslot-manager.c +++ /dev/null @@ -1,579 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0 -/* - * Copyright 2019 Google LLC - */ - -/** - * DOC: The Keyslot Manager - * - * Many devices with inline encryption support have a limited number of "slots" - * into which encryption contexts may be programmed, and requests can be tagged - * with a slot number to specify the key to use for en/decryption. - * - * As the number of slots is limited, and programming keys is expensive on - * many inline encryption hardware, we don't want to program the same key into - * multiple slots - if multiple requests are using the same key, we want to - * program just one slot with that key and use that slot for all requests. - * - * The keyslot manager manages these keyslots appropriately, and also acts as - * an abstraction between the inline encryption hardware and the upper layers. - * - * Lower layer devices will set up a keyslot manager in their request queue - * and tell it how to perform device specific operations like programming/ - * evicting keys from keyslots. - * - * Upper layers will call blk_ksm_get_slot_for_key() to program a - * key into some slot in the inline encryption hardware. - */ - -#define pr_fmt(fmt) "blk-crypto: " fmt - -#include <linux/keyslot-manager.h> -#include <linux/device.h> -#include <linux/atomic.h> -#include <linux/mutex.h> -#include <linux/pm_runtime.h> -#include <linux/wait.h> -#include <linux/blkdev.h> -#include <linux/blk-integrity.h> - -struct blk_ksm_keyslot { - atomic_t slot_refs; - struct list_head idle_slot_node; - struct hlist_node hash_node; - const struct blk_crypto_key *key; - struct blk_keyslot_manager *ksm; -}; - -static inline void blk_ksm_hw_enter(struct blk_keyslot_manager *ksm) -{ - /* - * Calling into the driver requires ksm->lock held and the device - * resumed. But we must resume the device first, since that can acquire - * and release ksm->lock via blk_ksm_reprogram_all_keys(). - */ - if (ksm->dev) - pm_runtime_get_sync(ksm->dev); - down_write(&ksm->lock); -} - -static inline void blk_ksm_hw_exit(struct blk_keyslot_manager *ksm) -{ - up_write(&ksm->lock); - if (ksm->dev) - pm_runtime_put_sync(ksm->dev); -} - -static inline bool blk_ksm_is_passthrough(struct blk_keyslot_manager *ksm) -{ - return ksm->num_slots == 0; -} - -/** - * blk_ksm_init() - Initialize a keyslot manager - * @ksm: The keyslot_manager to initialize. - * @num_slots: The number of key slots to manage. - * - * Allocate memory for keyslots and initialize a keyslot manager. Called by - * e.g. storage drivers to set up a keyslot manager in their request_queue. - * - * Return: 0 on success, or else a negative error code. - */ -int blk_ksm_init(struct blk_keyslot_manager *ksm, unsigned int num_slots) -{ - unsigned int slot; - unsigned int i; - unsigned int slot_hashtable_size; - - memset(ksm, 0, sizeof(*ksm)); - - if (num_slots == 0) - return -EINVAL; - - ksm->slots = kvcalloc(num_slots, sizeof(ksm->slots[0]), GFP_KERNEL); - if (!ksm->slots) - return -ENOMEM; - - ksm->num_slots = num_slots; - - init_rwsem(&ksm->lock); - - init_waitqueue_head(&ksm->idle_slots_wait_queue); - INIT_LIST_HEAD(&ksm->idle_slots); - - for (slot = 0; slot < num_slots; slot++) { - ksm->slots[slot].ksm = ksm; - list_add_tail(&ksm->slots[slot].idle_slot_node, - &ksm->idle_slots); - } - - spin_lock_init(&ksm->idle_slots_lock); - - slot_hashtable_size = roundup_pow_of_two(num_slots); - /* - * hash_ptr() assumes bits != 0, so ensure the hash table has at least 2 - * buckets. This only makes a difference when there is only 1 keyslot. - */ - if (slot_hashtable_size < 2) - slot_hashtable_size = 2; - - ksm->log_slot_ht_size = ilog2(slot_hashtable_size); - ksm->slot_hashtable = kvmalloc_array(slot_hashtable_size, - sizeof(ksm->slot_hashtable[0]), - GFP_KERNEL); - if (!ksm->slot_hashtable) - goto err_destroy_ksm; - for (i = 0; i < slot_hashtable_size; i++) - INIT_HLIST_HEAD(&ksm->slot_hashtable[i]); - - return 0; - -err_destroy_ksm: - blk_ksm_destroy(ksm); - return -ENOMEM; -} -EXPORT_SYMBOL_GPL(blk_ksm_init); - -static void blk_ksm_destroy_callback(void *ksm) -{ - blk_ksm_destroy(ksm); -} - -/** - * devm_blk_ksm_init() - Resource-managed blk_ksm_init() - * @dev: The device which owns the blk_keyslot_manager. - * @ksm: The blk_keyslot_manager to initialize. - * @num_slots: The number of key slots to manage. - * - * Like blk_ksm_init(), but causes blk_ksm_destroy() to be called automatically - * on driver detach. - * - * Return: 0 on success, or else a negative error code. - */ -int devm_blk_ksm_init(struct device *dev, struct blk_keyslot_manager *ksm, - unsigned int num_slots) -{ - int err = blk_ksm_init(ksm, num_slots); - - if (err) - return err; - - return devm_add_action_or_reset(dev, blk_ksm_destroy_callback, ksm); -} -EXPORT_SYMBOL_GPL(devm_blk_ksm_init); - -static inline struct hlist_head * -blk_ksm_hash_bucket_for_key(struct blk_keyslot_manager *ksm, - const struct blk_crypto_key *key) -{ - return &ksm->slot_hashtable[hash_ptr(key, ksm->log_slot_ht_size)]; -} - -static void blk_ksm_remove_slot_from_lru_list(struct blk_ksm_keyslot *slot) -{ - struct blk_keyslot_manager *ksm = slot->ksm; - unsigned long flags; - - spin_lock_irqsave(&ksm->idle_slots_lock, flags); - list_del(&slot->idle_slot_node); - spin_unlock_irqrestore(&ksm->idle_slots_lock, flags); -} - -static struct blk_ksm_keyslot *blk_ksm_find_keyslot( - struct blk_keyslot_manager *ksm, - const struct blk_crypto_key *key) -{ - const struct hlist_head *head = blk_ksm_hash_bucket_for_key(ksm, key); - struct blk_ksm_keyslot *slotp; - - hlist_for_each_entry(slotp, head, hash_node) { - if (slotp->key == key) - return slotp; - } - return NULL; -} - -static struct blk_ksm_keyslot *blk_ksm_find_and_grab_keyslot( - struct blk_keyslot_manager *ksm, - const struct blk_crypto_key *key) -{ - struct blk_ksm_keyslot *slot; - - slot = blk_ksm_find_keyslot(ksm, key); - if (!slot) - return NULL; - if (atomic_inc_return(&slot->slot_refs) == 1) { - /* Took first reference to this slot; remove it from LRU list */ - blk_ksm_remove_slot_from_lru_list(slot); - } - return slot; -} - -unsigned int blk_ksm_get_slot_idx(struct blk_ksm_keyslot *slot) -{ - return slot - slot->ksm->slots; -} -EXPORT_SYMBOL_GPL(blk_ksm_get_slot_idx); - -/** - * blk_ksm_get_slot_for_key() - Program a key into a keyslot. - * @ksm: The keyslot manager to program the key into. - * @key: Pointer to the key object to program, including the raw key, crypto - * mode, and data unit size. - * @slot_ptr: A pointer to return the pointer of the allocated keyslot. - * - * Get a keyslot that's been programmed with the specified key. If one already - * exists, return it with incremented refcount. Otherwise, wait for a keyslot - * to become idle and program it. - * - * Context: Process context. Takes and releases ksm->lock. - * Return: BLK_STS_OK on success (and keyslot is set to the pointer of the - * allocated keyslot), or some other blk_status_t otherwise (and - * keyslot is set to NULL). - */ -blk_status_t blk_ksm_get_slot_for_key(struct blk_keyslot_manager *ksm, - const struct blk_crypto_key *key, - struct blk_ksm_keyslot **slot_ptr) -{ - struct blk_ksm_keyslot *slot; - int slot_idx; - int err; - - *slot_ptr = NULL; - - if (blk_ksm_is_passthrough(ksm)) - return BLK_STS_OK; - - down_read(&ksm->lock); - slot = blk_ksm_find_and_grab_keyslot(ksm, key); - up_read(&ksm->lock); - if (slot) - goto success; - - for (;;) { - blk_ksm_hw_enter(ksm); - slot = blk_ksm_find_and_grab_keyslot(ksm, key); - if (slot) { - blk_ksm_hw_exit(ksm); - goto success; - } - - /* - * If we're here, that means there wasn't a slot that was - * already programmed with the key. So try to program it. - */ - if (!list_empty(&ksm->idle_slots)) - break; - - blk_ksm_hw_exit(ksm); - wait_event(ksm->idle_slots_wait_queue, - !list_empty(&ksm->idle_slots)); - } - - slot = list_first_entry(&ksm->idle_slots, struct blk_ksm_keyslot, - idle_slot_node); - slot_idx = blk_ksm_get_slot_idx(slot); - - err = ksm->ksm_ll_ops.keyslot_program(ksm, key, slot_idx); - if (err) { - wake_up(&ksm->idle_slots_wait_queue); - blk_ksm_hw_exit(ksm); - return errno_to_blk_status(err); - } - - /* Move this slot to the hash list for the new key. */ - if (slot->key) - hlist_del(&slot->hash_node); - slot->key = key; - hlist_add_head(&slot->hash_node, blk_ksm_hash_bucket_for_key(ksm, key)); - - atomic_set(&slot->slot_refs, 1); - - blk_ksm_remove_slot_from_lru_list(slot); - - blk_ksm_hw_exit(ksm); -success: - *slot_ptr = slot; - return BLK_STS_OK; -} - -/** - * blk_ksm_put_slot() - Release a reference to a slot - * @slot: The keyslot to release the reference of. - * - * Context: Any context. - */ -void blk_ksm_put_slot(struct blk_ksm_keyslot *slot) -{ - struct blk_keyslot_manager *ksm; - unsigned long flags; - - if (!slot) - return; - - ksm = slot->ksm; - - if (atomic_dec_and_lock_irqsave(&slot->slot_refs, - &ksm->idle_slots_lock, flags)) { - list_add_tail(&slot->idle_slot_node, &ksm->idle_slots); - spin_unlock_irqrestore(&ksm->idle_slots_lock, flags); - wake_up(&ksm->idle_slots_wait_queue); - } -} - -/** - * blk_ksm_crypto_cfg_supported() - Find out if a crypto configuration is - * supported by a ksm. - * @ksm: The keyslot manager to check - * @cfg: The crypto configuration to check for. - * - * Checks for crypto_mode/data unit size/dun bytes support. - * - * Return: Whether or not this ksm supports the specified crypto config. - */ -bool blk_ksm_crypto_cfg_supported(struct blk_keyslot_manager *ksm, - const struct blk_crypto_config *cfg) -{ - if (!ksm) - return false; - if (!(ksm->crypto_modes_supported[cfg->crypto_mode] & - cfg->data_unit_size)) - return false; - if (ksm->max_dun_bytes_supported < cfg->dun_bytes) - return false; - return true; -} - -/** - * blk_ksm_evict_key() - Evict a key from the lower layer device. - * @ksm: The keyslot manager to evict from - * @key: The key to evict - * - * Find the keyslot that the specified key was programmed into, and evict that - * slot from the lower layer device. The slot must not be in use by any - * in-flight IO when this function is called. - * - * Context: Process context. Takes and releases ksm->lock. - * Return: 0 on success or if there's no keyslot with the specified key, -EBUSY - * if the keyslot is still in use, or another -errno value on other - * error. - */ -int blk_ksm_evict_key(struct blk_keyslot_manager *ksm, - const struct blk_crypto_key *key) -{ - struct blk_ksm_keyslot *slot; - int err = 0; - - if (blk_ksm_is_passthrough(ksm)) { - if (ksm->ksm_ll_ops.keyslot_evict) { - blk_ksm_hw_enter(ksm); - err = ksm->ksm_ll_ops.keyslot_evict(ksm, key, -1); - blk_ksm_hw_exit(ksm); - return err; - } - return 0; - } - - blk_ksm_hw_enter(ksm); - slot = blk_ksm_find_keyslot(ksm, key); - if (!slot) - goto out_unlock; - - if (WARN_ON_ONCE(atomic_read(&slot->slot_refs) != 0)) { - err = -EBUSY; - goto out_unlock; - } - err = ksm->ksm_ll_ops.keyslot_evict(ksm, key, - blk_ksm_get_slot_idx(slot)); - if (err) - goto out_unlock; - - hlist_del(&slot->hash_node); - slot->key = NULL; - err = 0; -out_unlock: - blk_ksm_hw_exit(ksm); - return err; -} - -/** - * blk_ksm_reprogram_all_keys() - Re-program all keyslots. - * @ksm: The keyslot manager - * - * Re-program all keyslots that are supposed to have a key programmed. This is - * intended only for use by drivers for hardware that loses its keys on reset. - * - * Context: Process context. Takes and releases ksm->lock. - */ -void blk_ksm_reprogram_all_keys(struct blk_keyslot_manager *ksm) -{ - unsigned int slot; - - if (blk_ksm_is_passthrough(ksm)) - return; - - /* This is for device initialization, so don't resume the device */ - down_write(&ksm->lock); - for (slot = 0; slot < ksm->num_slots; slot++) { - const struct blk_crypto_key *key = ksm->slots[slot].key; - int err; - - if (!key) - continue; - - err = ksm->ksm_ll_ops.keyslot_program(ksm, key, slot); - WARN_ON(err); - } - up_write(&ksm->lock); -} -EXPORT_SYMBOL_GPL(blk_ksm_reprogram_all_keys); - -void blk_ksm_destroy(struct blk_keyslot_manager *ksm) -{ - if (!ksm) - return; - kvfree(ksm->slot_hashtable); - kvfree_sensitive(ksm->slots, sizeof(ksm->slots[0]) * ksm->num_slots); - memzero_explicit(ksm, sizeof(*ksm)); -} -EXPORT_SYMBOL_GPL(blk_ksm_destroy); - -bool blk_ksm_register(struct blk_keyslot_manager *ksm, struct request_queue *q) -{ - if (blk_integrity_queue_supports_integrity(q)) { - pr_warn("Integrity and hardware inline encryption are not supported together. Disabling hardware inline encryption.\n"); - return false; - } - q->ksm = ksm; - return true; -} -EXPORT_SYMBOL_GPL(blk_ksm_register); - -void blk_ksm_unregister(struct request_queue *q) -{ - q->ksm = NULL; -} - -/** - * blk_ksm_intersect_modes() - restrict supported modes by child device - * @parent: The keyslot manager for parent device - * @child: The keyslot manager for child device, or NULL - * - * Clear any crypto mode support bits in @parent that aren't set in @child. - * If @child is NULL, then all parent bits are cleared. - * - * Only use this when setting up the keyslot manager for a layered device, - * before it's been exposed yet. - */ -void blk_ksm_intersect_modes(struct blk_keyslot_manager *parent, - const struct blk_keyslot_manager *child) -{ - if (child) { - unsigned int i; - - parent->max_dun_bytes_supported = - min(parent->max_dun_bytes_supported, - child->max_dun_bytes_supported); - for (i = 0; i < ARRAY_SIZE(child->crypto_modes_supported); - i++) { - parent->crypto_modes_supported[i] &= - child->crypto_modes_supported[i]; - } - } else { - parent->max_dun_bytes_supported = 0; - memset(parent->crypto_modes_supported, 0, - sizeof(parent->crypto_modes_supported)); - } -} -EXPORT_SYMBOL_GPL(blk_ksm_intersect_modes); - -/** - * blk_ksm_is_superset() - Check if a KSM supports a superset of crypto modes - * and DUN bytes that another KSM supports. Here, - * "superset" refers to the mathematical meaning of the - * word - i.e. if two KSMs have the *same* capabilities, - * they *are* considered supersets of each other. - * @ksm_superset: The KSM that we want to verify is a superset - * @ksm_subset: The KSM that we want to verify is a subset - * - * Return: True if @ksm_superset supports a superset of the crypto modes and DUN - * bytes that @ksm_subset supports. - */ -bool blk_ksm_is_superset(struct blk_keyslot_manager *ksm_superset, - struct blk_keyslot_manager *ksm_subset) -{ - int i; - - if (!ksm_subset) - return true; - - if (!ksm_superset) - return false; - - for (i = 0; i < ARRAY_SIZE(ksm_superset->crypto_modes_supported); i++) { - if (ksm_subset->crypto_modes_supported[i] & - (~ksm_superset->crypto_modes_supported[i])) { - return false; - } - } - - if (ksm_subset->max_dun_bytes_supported > - ksm_superset->max_dun_bytes_supported) { - return false; - } - - return true; -} -EXPORT_SYMBOL_GPL(blk_ksm_is_superset); - -/** - * blk_ksm_update_capabilities() - Update the restrictions of a KSM to those of - * another KSM - * @target_ksm: The KSM whose restrictions to update. - * @reference_ksm: The KSM to whose restrictions this function will update - * @target_ksm's restrictions to. - * - * Blk-crypto requires that crypto capabilities that were - * advertised when a bio was created continue to be supported by the - * device until that bio is ended. This is turn means that a device cannot - * shrink its advertised crypto capabilities without any explicit - * synchronization with upper layers. So if there's no such explicit - * synchronization, @reference_ksm must support all the crypto capabilities that - * @target_ksm does - * (i.e. we need blk_ksm_is_superset(@reference_ksm, @target_ksm) == true). - * - * Note also that as long as the crypto capabilities are being expanded, the - * order of updates becoming visible is not important because it's alright - * for blk-crypto to see stale values - they only cause blk-crypto to - * believe that a crypto capability isn't supported when it actually is (which - * might result in blk-crypto-fallback being used if available, or the bio being - * failed). - */ -void blk_ksm_update_capabilities(struct blk_keyslot_manager *target_ksm, - struct blk_keyslot_manager *reference_ksm) -{ - memcpy(target_ksm->crypto_modes_supported, - reference_ksm->crypto_modes_supported, - sizeof(target_ksm->crypto_modes_supported)); - - target_ksm->max_dun_bytes_supported = - reference_ksm->max_dun_bytes_supported; -} -EXPORT_SYMBOL_GPL(blk_ksm_update_capabilities); - -/** - * blk_ksm_init_passthrough() - Init a passthrough keyslot manager - * @ksm: The keyslot manager to init - * - * Initialize a passthrough keyslot manager. - * Called by e.g. storage drivers to set up a keyslot manager in their - * request_queue, when the storage driver wants to manage its keys by itself. - * This is useful for inline encryption hardware that doesn't have the concept - * of keyslots, and for layered devices. - */ -void blk_ksm_init_passthrough(struct blk_keyslot_manager *ksm) -{ - memset(ksm, 0, sizeof(*ksm)); - init_rwsem(&ksm->lock); -} -EXPORT_SYMBOL_GPL(blk_ksm_init_passthrough); diff --git a/block/partitions/core.c b/block/partitions/core.c index 66ef9bc6d6a1..334b72ef1d73 100644 --- a/block/partitions/core.c +++ b/block/partitions/core.c @@ -425,6 +425,7 @@ out_del: device_del(pdev); out_put: put_device(pdev); + return ERR_PTR(err); out_put_disk: put_disk(disk); return ERR_PTR(err); diff --git a/drivers/acpi/power.c b/drivers/acpi/power.c index b9863e22b952..f0ed4414edb1 100644 --- a/drivers/acpi/power.c +++ b/drivers/acpi/power.c @@ -1035,13 +1035,8 @@ void acpi_turn_off_unused_power_resources(void) list_for_each_entry_reverse(resource, &acpi_power_resource_list, list_node) { mutex_lock(&resource->resource_lock); - /* - * Turn off power resources in an unknown state too, because the - * platform firmware on some system expects the OS to turn off - * power resources without any users unconditionally. - */ if (!resource->ref_count && - resource->state != ACPI_POWER_RESOURCE_STATE_OFF) { + resource->state == ACPI_POWER_RESOURCE_STATE_ON) { acpi_handle_debug(resource->device.handle, "Turning OFF\n"); __acpi_power_off(resource); } diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c index f9383736fa0f..71419eb16e09 100644 --- a/drivers/acpi/tables.c +++ b/drivers/acpi/tables.c @@ -21,6 +21,7 @@ #include <linux/earlycpio.h> #include <linux/initrd.h> #include <linux/security.h> +#include <linux/kmemleak.h> #include "internal.h" #ifdef CONFIG_ACPI_CUSTOM_DSDT @@ -601,6 +602,8 @@ void __init acpi_table_upgrade(void) */ arch_reserve_mem_area(acpi_tables_addr, all_tables_size); + kmemleak_ignore_phys(acpi_tables_addr); + /* * early_ioremap only can remap 256k one time. If we map all * tables one time, we will hit the limit. Need to map chunks diff --git a/drivers/ata/sata_mv.c b/drivers/ata/sata_mv.c index 9d86203e1e7a..c53633d47bfb 100644 --- a/drivers/ata/sata_mv.c +++ b/drivers/ata/sata_mv.c @@ -3896,8 +3896,8 @@ static int mv_chip_id(struct ata_host *host, unsigned int board_idx) break; default: - dev_err(host->dev, "BUG: invalid board index %u\n", board_idx); - return 1; + dev_alert(host->dev, "BUG: invalid board index %u\n", board_idx); + return -EINVAL; } hpriv->hp_flags = hp_flags; diff --git a/drivers/base/regmap/regcache-rbtree.c b/drivers/base/regmap/regcache-rbtree.c index cfa29dc89bbf..fabf87058d80 100644 --- a/drivers/base/regmap/regcache-rbtree.c +++ b/drivers/base/regmap/regcache-rbtree.c @@ -281,14 +281,14 @@ static int regcache_rbtree_insert_to_block(struct regmap *map, if (!blk) return -ENOMEM; + rbnode->block = blk; + if (BITS_TO_LONGS(blklen) > BITS_TO_LONGS(rbnode->blklen)) { present = krealloc(rbnode->cache_present, BITS_TO_LONGS(blklen) * sizeof(*present), GFP_KERNEL); - if (!present) { - kfree(blk); + if (!present) return -ENOMEM; - } memset(present + BITS_TO_LONGS(rbnode->blklen), 0, (BITS_TO_LONGS(blklen) - BITS_TO_LONGS(rbnode->blklen)) @@ -305,7 +305,6 @@ static int regcache_rbtree_insert_to_block(struct regmap *map, } /* update the rbnode block, its size and the base register */ - rbnode->block = blk; rbnode->blklen = blklen; rbnode->base_reg = base_reg; rbnode->cache_present = present; diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index ab3e37aa1830..f7f32eeaee63 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -180,14 +180,6 @@ config BLK_DEV_LOOP bits of, say, a sound file). This is also safe if the file resides on a remote file server. - There are several ways of encrypting disks. Some of these require - kernel patches. The vanilla kernel offers the cryptoloop option - and a Device Mapper target (which is superior, as it supports all - file systems). If you want to use the cryptoloop, say Y to both - LOOP and CRYPTOLOOP, and make sure you have a recent (version 2.12 - or later) version of util-linux. Additionally, be aware that - the cryptoloop is not safe for storing journaled filesystems. - Note that this loop device has nothing to do with the loopback device used for network connections from the machine to itself. @@ -211,21 +203,6 @@ config BLK_DEV_LOOP_MIN_COUNT is used, it can be set to 0, since needed loop devices can be dynamically allocated with the /dev/loop-control interface. -config BLK_DEV_CRYPTOLOOP - tristate "Cryptoloop Support (DEPRECATED)" - select CRYPTO - select CRYPTO_CBC - depends on BLK_DEV_LOOP - help - Say Y here if you want to be able to use the ciphers that are - provided by the CryptoAPI as loop transformation. This might be - used as hard disk encryption. - - WARNING: This device is not safe for journaled file systems like - ext3 or Reiserfs. Please use the Device Mapper crypto module - instead, which can be configured to be on-disk compatible with the - cryptoloop device. cryptoloop support will be removed in Linux 5.16. - source "drivers/block/drbd/Kconfig" config BLK_DEV_NBD diff --git a/drivers/block/Makefile b/drivers/block/Makefile index bc68817ef496..11a74f17c9ad 100644 --- a/drivers/block/Makefile +++ b/drivers/block/Makefile @@ -24,7 +24,6 @@ obj-$(CONFIG_CDROM_PKTCDVD) += pktcdvd.o obj-$(CONFIG_SUNVDC) += sunvdc.o obj-$(CONFIG_BLK_DEV_NBD) += nbd.o -obj-$(CONFIG_BLK_DEV_CRYPTOLOOP) += cryptoloop.o obj-$(CONFIG_VIRTIO_BLK) += virtio_blk.o obj-$(CONFIG_BLK_DEV_SX8) += sx8.o diff --git a/drivers/block/amiflop.c b/drivers/block/amiflop.c index 2909fd9e72fb..bf5c124c5452 100644 --- a/drivers/block/amiflop.c +++ b/drivers/block/amiflop.c @@ -1780,6 +1780,7 @@ static const struct blk_mq_ops amiflop_mq_ops = { static int fd_alloc_disk(int drive, int system) { struct gendisk *disk; + int err; disk = blk_mq_alloc_disk(&unit[drive].tag_set, NULL); if (IS_ERR(disk)) @@ -1798,8 +1799,10 @@ static int fd_alloc_disk(int drive, int system) set_capacity(disk, 880 * 2); unit[drive].gendisk[system] = disk; - add_disk(disk); - return 0; + err = add_disk(disk); + if (err) + blk_cleanup_disk(disk); + return err; } static int fd_alloc_drive(int drive) diff --git a/drivers/block/aoe/aoeblk.c b/drivers/block/aoe/aoeblk.c index 06b360f7123a..52484bcdedb9 100644 --- a/drivers/block/aoe/aoeblk.c +++ b/drivers/block/aoe/aoeblk.c @@ -37,8 +37,7 @@ static ssize_t aoedisk_show_state(struct device *dev, struct gendisk *disk = dev_to_disk(dev); struct aoedev *d = disk->private_data; - return snprintf(page, PAGE_SIZE, - "%s%s\n", + return sysfs_emit(page, "%s%s\n", (d->flags & DEVFL_UP) ? "up" : "down", (d->flags & DEVFL_KICKME) ? ",kickme" : (d->nopen && !(d->flags & DEVFL_UP)) ? ",closewait" : ""); @@ -52,8 +51,8 @@ static ssize_t aoedisk_show_mac(struct device *dev, struct aoetgt *t = d->targets[0]; if (t == NULL) - return snprintf(page, PAGE_SIZE, "none\n"); - return snprintf(page, PAGE_SIZE, "%pm\n", t->addr); + return sysfs_emit(page, "none\n"); + return sysfs_emit(page, "%pm\n", t->addr); } static ssize_t aoedisk_show_netif(struct device *dev, struct device_attribute *attr, char *page) @@ -85,7 +84,7 @@ static ssize_t aoedisk_show_netif(struct device *dev, ne = nd; nd = nds; if (*nd == NULL) - return snprintf(page, PAGE_SIZE, "none\n"); + return sysfs_emit(page, "none\n"); for (p = page; nd < ne; nd++) p += scnprintf(p, PAGE_SIZE - (p-page), "%s%s", p == page ? "" : ",", (*nd)->name); @@ -99,7 +98,7 @@ static ssize_t aoedisk_show_fwver(struct device *dev, struct gendisk *disk = dev_to_disk(dev); struct aoedev *d = disk->private_data; - return snprintf(page, PAGE_SIZE, "0x%04x\n", (unsigned int) d->fw_ver); + return sysfs_emit(page, "0x%04x\n", (unsigned int) d->fw_ver); } static ssize_t aoedisk_show_payload(struct device *dev, struct device_attribute *attr, char *page) @@ -107,7 +106,7 @@ static ssize_t aoedisk_show_payload(struct device *dev, struct gendisk *disk = dev_to_disk(dev); struct aoedev *d = disk->private_data; - return snprintf(page, PAGE_SIZE, "%lu\n", d->maxbcnt); + return sysfs_emit(page, "%lu\n", d->maxbcnt); } static int aoedisk_debugfs_show(struct seq_file *s, void *ignored) @@ -417,7 +416,9 @@ aoeblk_gdalloc(void *vp) spin_unlock_irqrestore(&d->lock, flags); - device_add_disk(NULL, gd, aoe_attr_groups); + err = device_add_disk(NULL, gd, aoe_attr_groups); + if (err) + goto out_disk_cleanup; aoedisk_add_debugfs(d); spin_lock_irqsave(&d->lock, flags); @@ -426,6 +427,8 @@ aoeblk_gdalloc(void *vp) spin_unlock_irqrestore(&d->lock, flags); return; +out_disk_cleanup: + blk_cleanup_disk(gd); err_tagset: blk_mq_free_tag_set(set); err_mempool: diff --git a/drivers/block/ataflop.c b/drivers/block/ataflop.c index 58e921ab5729..d14bdc3589b2 100644 --- a/drivers/block/ataflop.c +++ b/drivers/block/ataflop.c @@ -299,6 +299,7 @@ static struct atari_floppy_struct { disk change detection) */ int flags; /* flags */ struct gendisk *disk[NUM_DISK_MINORS]; + bool registered[NUM_DISK_MINORS]; int ref; int type; struct blk_mq_tag_set tag_set; @@ -457,10 +458,20 @@ static DEFINE_TIMER(fd_timer, check_change); static void fd_end_request_cur(blk_status_t err) { + DPRINT(("fd_end_request_cur(), bytes %d of %d\n", + blk_rq_cur_bytes(fd_request), + blk_rq_bytes(fd_request))); + if (!blk_update_request(fd_request, err, blk_rq_cur_bytes(fd_request))) { + DPRINT(("calling __blk_mq_end_request()\n")); __blk_mq_end_request(fd_request, err); fd_request = NULL; + } else { + /* requeue rest of request */ + DPRINT(("calling blk_mq_requeue_request()\n")); + blk_mq_requeue_request(fd_request, true); + fd_request = NULL; } } @@ -654,9 +665,6 @@ static inline void copy_buffer(void *from, void *to) *p2++ = *p1++; } - - - /* General Interrupt Handling */ static void (*FloppyIRQHandler)( int status ) = NULL; @@ -701,12 +709,21 @@ static void fd_error( void ) if (fd_request->error_count >= MAX_ERRORS) { printk(KERN_ERR "fd%d: too many errors.\n", SelectedDrive ); fd_end_request_cur(BLK_STS_IOERR); + finish_fdc(); + return; } else if (fd_request->error_count == RECALIBRATE_ERRORS) { printk(KERN_WARNING "fd%d: recalibrating\n", SelectedDrive ); if (SelectedDrive != -1) SUD.track = -1; } + /* need to re-run request to recalibrate */ + atari_disable_irq( IRQ_MFP_FDC ); + + setup_req_params( SelectedDrive ); + do_fd_action( SelectedDrive ); + + atari_enable_irq( IRQ_MFP_FDC ); } @@ -733,8 +750,10 @@ static int do_format(int drive, int type, struct atari_format_descr *desc) if (type) { type--; if (type >= NUM_DISK_MINORS || - minor2disktype[type].drive_types > DriveType) + minor2disktype[type].drive_types > DriveType) { + finish_fdc(); return -EINVAL; + } } q = unit[drive].disk[type]->queue; @@ -752,6 +771,7 @@ static int do_format(int drive, int type, struct atari_format_descr *desc) } if (!UDT || desc->track >= UDT->blocks/UDT->spt/2 || desc->head >= 2) { + finish_fdc(); ret = -EINVAL; goto out; } @@ -792,6 +812,7 @@ static int do_format(int drive, int type, struct atari_format_descr *desc) wait_for_completion(&format_wait); + finish_fdc(); ret = FormatError ? -EIO : 0; out: blk_mq_unquiesce_queue(q); @@ -826,6 +847,7 @@ static void do_fd_action( int drive ) else { /* all sectors finished */ fd_end_request_cur(BLK_STS_OK); + finish_fdc(); return; } } @@ -1230,6 +1252,7 @@ static void fd_rwsec_done1(int status) else { /* all sectors finished */ fd_end_request_cur(BLK_STS_OK); + finish_fdc(); } return; @@ -1351,7 +1374,7 @@ static void fd_times_out(struct timer_list *unused) static void finish_fdc( void ) { - if (!NeedSeek) { + if (!NeedSeek || !stdma_is_locked_by(floppy_irq)) { finish_fdc_done( 0 ); } else { @@ -1386,7 +1409,8 @@ static void finish_fdc_done( int dummy ) start_motor_off_timer(); local_irq_save(flags); - stdma_release(); + if (stdma_is_locked_by(floppy_irq)) + stdma_release(); local_irq_restore(flags); DPRINT(("finish_fdc() finished\n")); @@ -1436,8 +1460,7 @@ static int floppy_revalidate(struct gendisk *disk) unsigned int drive = p - unit; if (test_bit(drive, &changed_floppies) || - test_bit(drive, &fake_change) || - p->disktype == 0) { + test_bit(drive, &fake_change) || !p->disktype) { if (UD.flags & FTD_MSG) printk(KERN_ERR "floppy: clear format %p!\n", UDT); BufferDrive = -1; @@ -1476,15 +1499,6 @@ static void setup_req_params( int drive ) ReqTrack, ReqSector, (unsigned long)ReqData )); } -static void ataflop_commit_rqs(struct blk_mq_hw_ctx *hctx) -{ - spin_lock_irq(&ataflop_lock); - atari_disable_irq(IRQ_MFP_FDC); - finish_fdc(); - atari_enable_irq(IRQ_MFP_FDC); - spin_unlock_irq(&ataflop_lock); -} - static blk_status_t ataflop_queue_rq(struct blk_mq_hw_ctx *hctx, const struct blk_mq_queue_data *bd) { @@ -1492,6 +1506,10 @@ static blk_status_t ataflop_queue_rq(struct blk_mq_hw_ctx *hctx, int drive = floppy - unit; int type = floppy->type; + DPRINT(("Queue request: drive %d type %d sectors %d of %d last %d\n", + drive, type, blk_rq_cur_sectors(bd->rq), + blk_rq_sectors(bd->rq), bd->last)); + spin_lock_irq(&ataflop_lock); if (fd_request) { spin_unlock_irq(&ataflop_lock); @@ -1512,6 +1530,7 @@ static blk_status_t ataflop_queue_rq(struct blk_mq_hw_ctx *hctx, /* drive not connected */ printk(KERN_ERR "Unknown Device: fd%d\n", drive ); fd_end_request_cur(BLK_STS_IOERR); + stdma_release(); goto out; } @@ -1528,11 +1547,13 @@ static blk_status_t ataflop_queue_rq(struct blk_mq_hw_ctx *hctx, if (--type >= NUM_DISK_MINORS) { printk(KERN_WARNING "fd%d: invalid disk format", drive ); fd_end_request_cur(BLK_STS_IOERR); + stdma_release(); goto out; } if (minor2disktype[type].drive_types > DriveType) { printk(KERN_WARNING "fd%d: unsupported disk format", drive ); fd_end_request_cur(BLK_STS_IOERR); + stdma_release(); goto out; } type = minor2disktype[type].index; @@ -1551,8 +1572,6 @@ static blk_status_t ataflop_queue_rq(struct blk_mq_hw_ctx *hctx, setup_req_params( drive ); do_fd_action( drive ); - if (bd->last) - finish_fdc(); atari_enable_irq( IRQ_MFP_FDC ); out: @@ -1635,6 +1654,7 @@ static int fd_locked_ioctl(struct block_device *bdev, fmode_t mode, /* what if type > 0 here? Overwrite specified entry ? */ if (type) { /* refuse to re-set a predefined type for now */ + finish_fdc(); return -EINVAL; } @@ -1702,8 +1722,10 @@ static int fd_locked_ioctl(struct block_device *bdev, fmode_t mode, /* sanity check */ if (setprm.track != dtp->blocks/dtp->spt/2 || - setprm.head != 2) + setprm.head != 2) { + finish_fdc(); return -EINVAL; + } UDT = dtp; set_capacity(disk, UDT->blocks); @@ -1963,7 +1985,6 @@ static const struct block_device_operations floppy_fops = { static const struct blk_mq_ops ataflop_mq_ops = { .queue_rq = ataflop_queue_rq, - .commit_rqs = ataflop_commit_rqs, }; static int ataflop_alloc_disk(unsigned int drive, unsigned int type) @@ -2001,12 +2022,28 @@ static void ataflop_probe(dev_t dev) return; mutex_lock(&ataflop_probe_lock); if (!unit[drive].disk[type]) { - if (ataflop_alloc_disk(drive, type) == 0) + if (ataflop_alloc_disk(drive, type) == 0) { add_disk(unit[drive].disk[type]); + unit[drive].registered[type] = true; + } } mutex_unlock(&ataflop_probe_lock); } +static void atari_cleanup_floppy_disk(struct atari_floppy_struct *fs) +{ + int type; + + for (type = 0; type < NUM_DISK_MINORS; type++) { + if (!fs->disk[type]) + continue; + if (fs->registered[type]) + del_gendisk(fs->disk[type]); + blk_cleanup_disk(fs->disk[type]); + } + blk_mq_free_tag_set(&fs->tag_set); +} + static int __init atari_floppy_init (void) { int i; @@ -2065,7 +2102,10 @@ static int __init atari_floppy_init (void) for (i = 0; i < FD_MAX_UNITS; i++) { unit[i].track = -1; unit[i].flags = 0; - add_disk(unit[i].disk[0]); + ret = add_disk(unit[i].disk[0]); + if (ret) + goto err_out_dma; + unit[i].registered[0] = true; } printk(KERN_INFO "Atari floppy driver: max. %cD, %strack buffering\n", @@ -2075,12 +2115,11 @@ static int __init atari_floppy_init (void) return 0; +err_out_dma: + atari_stram_free(DMABuffer); err: - while (--i >= 0) { - blk_cleanup_queue(unit[i].disk[0]->queue); - put_disk(unit[i].disk[0]); - blk_mq_free_tag_set(&unit[i].tag_set); - } + while (--i >= 0) + atari_cleanup_floppy_disk(&unit[i]); unregister_blkdev(FLOPPY_MAJOR, "fd"); out_unlock: @@ -2129,18 +2168,10 @@ __setup("floppy=", atari_floppy_setup); static void __exit atari_floppy_exit(void) { - int i, type; + int i; - for (i = 0; i < FD_MAX_UNITS; i++) { - for (type = 0; type < NUM_DISK_MINORS; type++) { - if (!unit[i].disk[type]) - continue; - del_gendisk(unit[i].disk[type]); - blk_cleanup_queue(unit[i].disk[type]->queue); - put_disk(unit[i].disk[type]); - } - blk_mq_free_tag_set(&unit[i].tag_set); - } + for (i = 0; i < FD_MAX_UNITS; i++) + atari_cleanup_floppy_disk(&unit[i]); unregister_blkdev(FLOPPY_MAJOR, "fd"); del_timer_sync(&fd_timer); diff --git a/drivers/block/cryptoloop.c b/drivers/block/cryptoloop.c deleted file mode 100644 index f0a91faa43a8..000000000000 --- a/drivers/block/cryptoloop.c +++ /dev/null @@ -1,206 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0-or-later -/* - Linux loop encryption enabling module - - Copyright (C) 2002 Herbert Valerio Riedel <hvr@gnu.org> - Copyright (C) 2003 Fruhwirth Clemens <clemens@endorphin.org> - - */ - -#include <linux/module.h> - -#include <crypto/skcipher.h> -#include <linux/init.h> -#include <linux/string.h> -#include <linux/blkdev.h> -#include <linux/scatterlist.h> -#include <linux/uaccess.h> -#include "loop.h" - -MODULE_LICENSE("GPL"); -MODULE_DESCRIPTION("loop blockdevice transferfunction adaptor / CryptoAPI"); -MODULE_AUTHOR("Herbert Valerio Riedel <hvr@gnu.org>"); - -#define LOOP_IV_SECTOR_BITS 9 -#define LOOP_IV_SECTOR_SIZE (1 << LOOP_IV_SECTOR_BITS) - -static int -cryptoloop_init(struct loop_device *lo, const struct loop_info64 *info) -{ - int err = -EINVAL; - int cipher_len; - int mode_len; - char cms[LO_NAME_SIZE]; /* cipher-mode string */ - char *mode; - char *cmsp = cms; /* c-m string pointer */ - struct crypto_sync_skcipher *tfm; - - /* encryption breaks for non sector aligned offsets */ - - if (info->lo_offset % LOOP_IV_SECTOR_SIZE) - goto out; - - strncpy(cms, info->lo_crypt_name, LO_NAME_SIZE); - cms[LO_NAME_SIZE - 1] = 0; - - cipher_len = strcspn(cmsp, "-"); - - mode = cmsp + cipher_len; - mode_len = 0; - if (*mode) { - mode++; - mode_len = strcspn(mode, "-"); - } - - if (!mode_len) { - mode = "cbc"; - mode_len = 3; - } - - if (cipher_len + mode_len + 3 > LO_NAME_SIZE) - return -EINVAL; - - memmove(cms, mode, mode_len); - cmsp = cms + mode_len; - *cmsp++ = '('; - memcpy(cmsp, info->lo_crypt_name, cipher_len); - cmsp += cipher_len; - *cmsp++ = ')'; - *cmsp = 0; - - tfm = crypto_alloc_sync_skcipher(cms, 0, 0); - if (IS_ERR(tfm)) - return PTR_ERR(tfm); - - err = crypto_sync_skcipher_setkey(tfm, info->lo_encrypt_key, - info->lo_encrypt_key_size); - - if (err != 0) - goto out_free_tfm; - - lo->key_data = tfm; - return 0; - - out_free_tfm: - crypto_free_sync_skcipher(tfm); - - out: - return err; -} - - -typedef int (*encdec_cbc_t)(struct skcipher_request *req); - -static int -cryptoloop_transfer(struct loop_device *lo, int cmd, - struct page *raw_page, unsigned raw_off, - struct page *loop_page, unsigned loop_off, - int size, sector_t IV) -{ - struct crypto_sync_skcipher *tfm = lo->key_data; - SYNC_SKCIPHER_REQUEST_ON_STACK(req, tfm); - struct scatterlist sg_out; - struct scatterlist sg_in; - - encdec_cbc_t encdecfunc; - struct page *in_page, *out_page; - unsigned in_offs, out_offs; - int err; - - skcipher_request_set_sync_tfm(req, tfm); - skcipher_request_set_callback(req, CRYPTO_TFM_REQ_MAY_SLEEP, - NULL, NULL); - - sg_init_table(&sg_out, 1); - sg_init_table(&sg_in, 1); - - if (cmd == READ) { - in_page = raw_page; - in_offs = raw_off; - out_page = loop_page; - out_offs = loop_off; - encdecfunc = crypto_skcipher_decrypt; - } else { - in_page = loop_page; - in_offs = loop_off; - out_page = raw_page; - out_offs = raw_off; - encdecfunc = crypto_skcipher_encrypt; - } - - while (size > 0) { - const int sz = min(size, LOOP_IV_SECTOR_SIZE); - u32 iv[4] = { 0, }; - iv[0] = cpu_to_le32(IV & 0xffffffff); - - sg_set_page(&sg_in, in_page, sz, in_offs); - sg_set_page(&sg_out, out_page, sz, out_offs); - - skcipher_request_set_crypt(req, &sg_in, &sg_out, sz, iv); - err = encdecfunc(req); - if (err) - goto out; - - IV++; - size -= sz; - in_offs += sz; - out_offs += sz; - } - - err = 0; - -out: - skcipher_request_zero(req); - return err; -} - -static int -cryptoloop_ioctl(struct loop_device *lo, int cmd, unsigned long arg) -{ - return -EINVAL; -} - -static int -cryptoloop_release(struct loop_device *lo) -{ - struct crypto_sync_skcipher *tfm = lo->key_data; - if (tfm != NULL) { - crypto_free_sync_skcipher(tfm); - lo->key_data = NULL; - return 0; - } - printk(KERN_ERR "cryptoloop_release(): tfm == NULL?\n"); - return -EINVAL; -} - -static struct loop_func_table cryptoloop_funcs = { - .number = LO_CRYPT_CRYPTOAPI, - .init = cryptoloop_init, - .ioctl = cryptoloop_ioctl, - .transfer = cryptoloop_transfer, - .release = cryptoloop_release, - .owner = THIS_MODULE -}; - -static int __init -init_cryptoloop(void) -{ - int rc = loop_register_transfer(&cryptoloop_funcs); - - if (rc) - printk(KERN_ERR "cryptoloop: loop_register_transfer failed\n"); - else - pr_warn("the cryptoloop driver has been deprecated and will be removed in in Linux 5.16\n"); - return rc; -} - -static void __exit -cleanup_cryptoloop(void) -{ - if (loop_unregister_transfer(LO_CRYPT_CRYPTOAPI)) - printk(KERN_ERR - "cryptoloop: loop_unregister_transfer failed\n"); -} - -module_init(init_cryptoloop); -module_exit(cleanup_cryptoloop); diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c index 55234a558e98..19db80a1e409 100644 --- a/drivers/block/drbd/drbd_main.c +++ b/drivers/block/drbd/drbd_main.c @@ -2794,7 +2794,9 @@ enum drbd_ret_code drbd_create_device(struct drbd_config_context *adm_ctx, unsig goto out_idr_remove_vol; } - add_disk(disk); + err = add_disk(disk); + if (err) + goto out_cleanup_disk; /* inherit the connection state */ device->state.conn = first_connection(resource)->cstate; @@ -2808,6 +2810,8 @@ enum drbd_ret_code drbd_create_device(struct drbd_config_context *adm_ctx, unsig drbd_debugfs_device_add(device); return NO_ERROR; +out_cleanup_disk: + blk_cleanup_disk(disk); out_idr_remove_vol: idr_remove(&connection->peer_devices, vnr); out_idr_remove_from_resource: diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c index 6288ce888414..3873e789478e 100644 --- a/drivers/block/floppy.c +++ b/drivers/block/floppy.c @@ -4479,6 +4479,7 @@ static const struct blk_mq_ops floppy_mq_ops = { }; static struct platform_device floppy_device[N_DRIVE]; +static bool registered[N_DRIVE]; static bool floppy_available(int drive) { @@ -4694,8 +4695,12 @@ static int __init do_floppy_init(void) if (err) goto out_remove_drives; - device_add_disk(&floppy_device[drive].dev, disks[drive][0], - NULL); + registered[drive] = true; + + err = device_add_disk(&floppy_device[drive].dev, + disks[drive][0], NULL); + if (err) + goto out_remove_drives; } return 0; @@ -4704,7 +4709,8 @@ out_remove_drives: while (drive--) { if (floppy_available(drive)) { del_gendisk(disks[drive][0]); - platform_device_unregister(&floppy_device[drive]); + if (registered[drive]) + platform_device_unregister(&floppy_device[drive]); } } out_release_dma: @@ -4947,30 +4953,14 @@ static void __exit floppy_module_exit(void) if (disks[drive][i]) del_gendisk(disks[drive][i]); } - platform_device_unregister(&floppy_device[drive]); + if (registered[drive]) + platform_device_unregister(&floppy_device[drive]); } for (i = 0; i < ARRAY_SIZE(floppy_type); i++) { if (disks[drive][i]) - blk_cleanup_queue(disks[drive][i]->queue); + blk_cleanup_disk(disks[drive][i]); } blk_mq_free_tag_set(&tag_sets[drive]); - - /* - * These disks have not called add_disk(). Don't put down - * queue reference in put_disk(). - */ - if (!(allowed_drive_mask & (1 << drive)) || - fdc_state[FDC(drive)].version == FDC_NONE) { - for (i = 0; i < ARRAY_SIZE(floppy_type); i++) { - if (disks[drive][i]) - disks[drive][i]->queue = NULL; - } - } - - for (i = 0; i < ARRAY_SIZE(floppy_type); i++) { - if (disks[drive][i]) - put_disk(disks[drive][i]); - } } cancel_delayed_work_sync(&fd_timeout); diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 7bf4686af774..f094de5f0056 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -133,58 +133,6 @@ static void loop_global_unlock(struct loop_device *lo, bool global) static int max_part; static int part_shift; -static int transfer_xor(struct loop_device *lo, int cmd, - struct page *raw_page, unsigned raw_off, - struct page *loop_page, unsigned loop_off, - int size, sector_t real_block) -{ - char *raw_buf = kmap_atomic(raw_page) + raw_off; - char *loop_buf = kmap_atomic(loop_page) + loop_off; - char *in, *out, *key; - int i, keysize; - - if (cmd == READ) { - in = raw_buf; - out = loop_buf; - } else { - in = loop_buf; - out = raw_buf; - } - - key = lo->lo_encrypt_key; - keysize = lo->lo_encrypt_key_size; - for (i = 0; i < size; i++) - *out++ = *in++ ^ key[(i & 511) % keysize]; - - kunmap_atomic(loop_buf); - kunmap_atomic(raw_buf); - cond_resched(); - return 0; -} - -static int xor_init(struct loop_device *lo, const struct loop_info64 *info) -{ - if (unlikely(info->lo_encrypt_key_size <= 0)) - return -EINVAL; - return 0; -} - -static struct loop_func_table none_funcs = { - .number = LO_CRYPT_NONE, -}; - -static struct loop_func_table xor_funcs = { - .number = LO_CRYPT_XOR, - .transfer = transfer_xor, - .init = xor_init -}; - -/* xfer_funcs[0] is special - its release function is never called */ -static struct loop_func_table *xfer_funcs[MAX_LO_CRYPT] = { - &none_funcs, - &xor_funcs -}; - static loff_t get_size(loff_t offset, loff_t sizelimit, struct file *file) { loff_t loopsize; @@ -228,8 +176,7 @@ static void __loop_update_dio(struct loop_device *lo, bool dio) /* * We support direct I/O only if lo_offset is aligned with the * logical I/O size of backing device, and the logical block - * size of loop is bigger than the backing device's and the loop - * needn't transform transfer. + * size of loop is bigger than the backing device's. * * TODO: the above condition may be loosed in the future, and * direct I/O may be switched runtime at that time because most @@ -238,8 +185,7 @@ static void __loop_update_dio(struct loop_device *lo, bool dio) if (dio) { if (queue_logical_block_size(lo->lo_queue) >= sb_bsize && !(lo->lo_offset & dio_align) && - mapping->a_ops->direct_IO && - !lo->transfer) + mapping->a_ops->direct_IO) use_dio = true; else use_dio = false; @@ -273,19 +219,6 @@ static void __loop_update_dio(struct loop_device *lo, bool dio) } /** - * loop_validate_block_size() - validates the passed in block size - * @bsize: size to validate - */ -static int -loop_validate_block_size(unsigned short bsize) -{ - if (bsize < 512 || bsize > PAGE_SIZE || !is_power_of_2(bsize)) - return -EINVAL; - - return 0; -} - -/** * loop_set_size() - sets device size and notifies userspace * @lo: struct loop_device to set the size for * @size: new size of the loop device @@ -299,24 +232,6 @@ static void loop_set_size(struct loop_device *lo, loff_t size) kobject_uevent(&disk_to_dev(lo->lo_disk)->kobj, KOBJ_CHANGE); } -static inline int -lo_do_transfer(struct loop_device *lo, int cmd, - struct page *rpage, unsigned roffs, - struct page *lpage, unsigned loffs, - int size, sector_t rblock) -{ - int ret; - - ret = lo->transfer(lo, cmd, rpage, roffs, lpage, loffs, size, rblock); - if (likely(!ret)) - return 0; - - printk_ratelimited(KERN_ERR - "loop: Transfer error at byte offset %llu, length %i.\n", - (unsigned long long)rblock << 9, size); - return ret; -} - static int lo_write_bvec(struct file *file, struct bio_vec *bvec, loff_t *ppos) { struct iov_iter i; @@ -356,41 +271,6 @@ static int lo_write_simple(struct loop_device *lo, struct request *rq, return ret; } -/* - * This is the slow, transforming version that needs to double buffer the - * data as it cannot do the transformations in place without having direct - * access to the destination pages of the backing file. - */ -static int lo_write_transfer(struct loop_device *lo, struct request *rq, - loff_t pos) -{ - struct bio_vec bvec, b; - struct req_iterator iter; - struct page *page; - int ret = 0; - - page = alloc_page(GFP_NOIO); - if (unlikely(!page)) - return -ENOMEM; - - rq_for_each_segment(bvec, rq, iter) { - ret = lo_do_transfer(lo, WRITE, page, 0, bvec.bv_page, - bvec.bv_offset, bvec.bv_len, pos >> 9); - if (unlikely(ret)) - break; - - b.bv_page = page; - b.bv_offset = 0; - b.bv_len = bvec.bv_len; - ret = lo_write_bvec(lo->lo_backing_file, &b, &pos); - if (ret < 0) - break; - } - - __free_page(page); - return ret; -} - static int lo_read_simple(struct loop_device *lo, struct request *rq, loff_t pos) { @@ -420,64 +300,12 @@ static int lo_read_simple(struct loop_device *lo, struct request *rq, return 0; } -static int lo_read_transfer(struct loop_device *lo, struct request *rq, - loff_t pos) -{ - struct bio_vec bvec, b; - struct req_iterator iter; - struct iov_iter i; - struct page *page; - ssize_t len; - int ret = 0; - - page = alloc_page(GFP_NOIO); - if (unlikely(!page)) - return -ENOMEM; - - rq_for_each_segment(bvec, rq, iter) { - loff_t offset = pos; - - b.bv_page = page; - b.bv_offset = 0; - b.bv_len = bvec.bv_len; - - iov_iter_bvec(&i, READ, &b, 1, b.bv_len); - len = vfs_iter_read(lo->lo_backing_file, &i, &pos, 0); - if (len < 0) { - ret = len; - goto out_free_page; - } - - ret = lo_do_transfer(lo, READ, page, 0, bvec.bv_page, - bvec.bv_offset, len, offset >> 9); - if (ret) - goto out_free_page; - - flush_dcache_page(bvec.bv_page); - - if (len != bvec.bv_len) { - struct bio *bio; - - __rq_for_each_bio(bio, rq) - zero_fill_bio(bio); - break; - } - } - - ret = 0; -out_free_page: - __free_page(page); - return ret; -} - static int lo_fallocate(struct loop_device *lo, struct request *rq, loff_t pos, int mode) { /* * We use fallocate to manipulate the space mappings used by the image - * a.k.a. discard/zerorange. However we do not support this if - * encryption is enabled, because it may give an attacker useful - * information. + * a.k.a. discard/zerorange. */ struct file *file = lo->lo_backing_file; struct request_queue *q = lo->lo_queue; @@ -660,16 +488,12 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq) case REQ_OP_DISCARD: return lo_fallocate(lo, rq, pos, FALLOC_FL_PUNCH_HOLE); case REQ_OP_WRITE: - if (lo->transfer) - return lo_write_transfer(lo, rq, pos); - else if (cmd->use_aio) + if (cmd->use_aio) return lo_rw_aio(lo, cmd, pos, WRITE); else return lo_write_simple(lo, rq, pos); case REQ_OP_READ: - if (lo->transfer) - return lo_read_transfer(lo, rq, pos); - else if (cmd->use_aio) + if (cmd->use_aio) return lo_rw_aio(lo, cmd, pos, READ); else return lo_read_simple(lo, rq, pos); @@ -934,7 +758,7 @@ static void loop_config_discard(struct loop_device *lo) * not blkdev_issue_discard(). This maintains consistent behavior with * file-backed loop devices: discarded regions read back as zero. */ - if (S_ISBLK(inode->i_mode) && !lo->lo_encrypt_key_size) { + if (S_ISBLK(inode->i_mode)) { struct request_queue *backingq = bdev_get_queue(I_BDEV(inode)); max_discard_sectors = backingq->limits.max_write_zeroes_sectors; @@ -943,11 +767,9 @@ static void loop_config_discard(struct loop_device *lo) /* * We use punch hole to reclaim the free space used by the - * image a.k.a. discard. However we do not support discard if - * encryption is enabled, because it may give an attacker - * useful information. + * image a.k.a. discard. */ - } else if (!file->f_op->fallocate || lo->lo_encrypt_key_size) { + } else if (!file->f_op->fallocate) { max_discard_sectors = 0; granularity = 0; @@ -1084,43 +906,6 @@ static void loop_update_rotational(struct loop_device *lo) blk_queue_flag_clear(QUEUE_FLAG_NONROT, q); } -static int -loop_release_xfer(struct loop_device *lo) -{ - int err = 0; - struct loop_func_table *xfer = lo->lo_encryption; - - if (xfer) { - if (xfer->release) - err = xfer->release(lo); - lo->transfer = NULL; - lo->lo_encryption = NULL; - module_put(xfer->owner); - } - return err; -} - -static int -loop_init_xfer(struct loop_device *lo, struct loop_func_table *xfer, - const struct loop_info64 *i) -{ - int err = 0; - - if (xfer) { - struct module *owner = xfer->owner; - - if (!try_module_get(owner)) - return -EINVAL; - if (xfer->init) - err = xfer->init(lo, i); - if (err) - module_put(owner); - else - lo->lo_encryption = xfer; - } - return err; -} - /** * loop_set_status_from_info - configure device from loop_info * @lo: struct loop_device to configure @@ -1133,55 +918,27 @@ static int loop_set_status_from_info(struct loop_device *lo, const struct loop_info64 *info) { - int err; - struct loop_func_table *xfer; - kuid_t uid = current_uid(); - if ((unsigned int) info->lo_encrypt_key_size > LO_KEY_SIZE) return -EINVAL; - err = loop_release_xfer(lo); - if (err) - return err; - - if (info->lo_encrypt_type) { - unsigned int type = info->lo_encrypt_type; - - if (type >= MAX_LO_CRYPT) - return -EINVAL; - xfer = xfer_funcs[type]; - if (xfer == NULL) - return -EINVAL; - } else - xfer = NULL; - - err = loop_init_xfer(lo, xfer, info); - if (err) - return err; + switch (info->lo_encrypt_type) { + case LO_CRYPT_NONE: + break; + case LO_CRYPT_XOR: + pr_warn("support for the xor transformation has been removed.\n"); + return -EINVAL; + case LO_CRYPT_CRYPTOAPI: + pr_warn("support for cryptoloop has been removed. Use dm-crypt instead.\n"); + return -EINVAL; + default: + return -EINVAL; + } lo->lo_offset = info->lo_offset; lo->lo_sizelimit = info->lo_sizelimit; memcpy(lo->lo_file_name, info->lo_file_name, LO_NAME_SIZE); - memcpy(lo->lo_crypt_name, info->lo_crypt_name, LO_NAME_SIZE); lo->lo_file_name[LO_NAME_SIZE-1] = 0; - lo->lo_crypt_name[LO_NAME_SIZE-1] = 0; - - if (!xfer) - xfer = &none_funcs; - lo->transfer = xfer->transfer; - lo->ioctl = xfer->ioctl; - lo->lo_flags = info->lo_flags; - - lo->lo_encrypt_key_size = info->lo_encrypt_key_size; - lo->lo_init[0] = info->lo_init[0]; - lo->lo_init[1] = info->lo_init[1]; - if (info->lo_encrypt_key_size) { - memcpy(lo->lo_encrypt_key, info->lo_encrypt_key, - info->lo_encrypt_key_size); - lo->lo_key_owner = uid; - } - return 0; } @@ -1236,7 +993,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode, } if (config->block_size) { - error = loop_validate_block_size(config->block_size); + error = blk_validate_block_size(config->block_size); if (error) goto out_unlock; } @@ -1329,7 +1086,6 @@ static int __loop_clr_fd(struct loop_device *lo, bool release) { struct file *filp = NULL; gfp_t gfp = lo->old_gfp_mask; - struct block_device *bdev = lo->lo_device; int err = 0; bool partscan = false; int lo_number; @@ -1381,36 +1137,23 @@ static int __loop_clr_fd(struct loop_device *lo, bool release) lo->lo_backing_file = NULL; spin_unlock_irq(&lo->lo_lock); - loop_release_xfer(lo); - lo->transfer = NULL; - lo->ioctl = NULL; lo->lo_device = NULL; - lo->lo_encryption = NULL; lo->lo_offset = 0; lo->lo_sizelimit = 0; - lo->lo_encrypt_key_size = 0; - memset(lo->lo_encrypt_key, 0, LO_KEY_SIZE); - memset(lo->lo_crypt_name, 0, LO_NAME_SIZE); memset(lo->lo_file_name, 0, LO_NAME_SIZE); blk_queue_logical_block_size(lo->lo_queue, 512); blk_queue_physical_block_size(lo->lo_queue, 512); blk_queue_io_min(lo->lo_queue, 512); - if (bdev) { - invalidate_bdev(bdev); - bdev->bd_inode->i_mapping->wb_err = 0; - } - set_capacity(lo->lo_disk, 0); + invalidate_disk(lo->lo_disk); loop_sysfs_exit(lo); - if (bdev) { - /* let user-space know about this change */ - kobject_uevent(&disk_to_dev(bdev->bd_disk)->kobj, KOBJ_CHANGE); - } + /* let user-space know about this change */ + kobject_uevent(&disk_to_dev(lo->lo_disk)->kobj, KOBJ_CHANGE); mapping_set_gfp_mask(filp->f_mapping, gfp); /* This is safe: open() is still holding a reference. */ module_put(THIS_MODULE); blk_mq_unfreeze_queue(lo->lo_queue); - partscan = lo->lo_flags & LO_FLAGS_PARTSCAN && bdev; + partscan = lo->lo_flags & LO_FLAGS_PARTSCAN; lo_number = lo->lo_number; disk_force_media_change(lo->lo_disk, DISK_EVENT_MEDIA_CHANGE); out_unlock: @@ -1498,7 +1241,6 @@ static int loop_set_status(struct loop_device *lo, const struct loop_info64 *info) { int err; - kuid_t uid = current_uid(); int prev_lo_flags; bool partscan = false; bool size_changed = false; @@ -1506,12 +1248,6 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info) err = mutex_lock_killable(&lo->lo_mutex); if (err) return err; - if (lo->lo_encrypt_key_size && - !uid_eq(lo->lo_key_owner, uid) && - !capable(CAP_SYS_ADMIN)) { - err = -EPERM; - goto out_unlock; - } if (lo->lo_state != Lo_bound) { err = -ENXIO; goto out_unlock; @@ -1597,14 +1333,6 @@ loop_get_status(struct loop_device *lo, struct loop_info64 *info) info->lo_sizelimit = lo->lo_sizelimit; info->lo_flags = lo->lo_flags; memcpy(info->lo_file_name, lo->lo_file_name, LO_NAME_SIZE); - memcpy(info->lo_crypt_name, lo->lo_crypt_name, LO_NAME_SIZE); - info->lo_encrypt_type = - lo->lo_encryption ? lo->lo_encryption->number : 0; - if (lo->lo_encrypt_key_size && capable(CAP_SYS_ADMIN)) { - info->lo_encrypt_key_size = lo->lo_encrypt_key_size; - memcpy(info->lo_encrypt_key, lo->lo_encrypt_key, - lo->lo_encrypt_key_size); - } /* Drop lo_mutex while we call into the filesystem. */ path = lo->lo_backing_file->f_path; @@ -1630,16 +1358,8 @@ loop_info64_from_old(const struct loop_info *info, struct loop_info64 *info64) info64->lo_rdevice = info->lo_rdevice; info64->lo_offset = info->lo_offset; info64->lo_sizelimit = 0; - info64->lo_encrypt_type = info->lo_encrypt_type; - info64->lo_encrypt_key_size = info->lo_encrypt_key_size; info64->lo_flags = info->lo_flags; - info64->lo_init[0] = info->lo_init[0]; - info64->lo_init[1] = info->lo_init[1]; - if (info->lo_encrypt_type == LO_CRYPT_CRYPTOAPI) - memcpy(info64->lo_crypt_name, info->lo_name, LO_NAME_SIZE); - else - memcpy(info64->lo_file_name, info->lo_name, LO_NAME_SIZE); - memcpy(info64->lo_encrypt_key, info->lo_encrypt_key, LO_KEY_SIZE); + memcpy(info64->lo_file_name, info->lo_name, LO_NAME_SIZE); } static int @@ -1651,16 +1371,8 @@ loop_info64_to_old(const struct loop_info64 *info64, struct loop_info *info) info->lo_inode = info64->lo_inode; info->lo_rdevice = info64->lo_rdevice; info->lo_offset = info64->lo_offset; - info->lo_encrypt_type = info64->lo_encrypt_type; - info->lo_encrypt_key_size = info64->lo_encrypt_key_size; info->lo_flags = info64->lo_flags; - info->lo_init[0] = info64->lo_init[0]; - info->lo_init[1] = info64->lo_init[1]; - if (info->lo_encrypt_type == LO_CRYPT_CRYPTOAPI) - memcpy(info->lo_name, info64->lo_crypt_name, LO_NAME_SIZE); - else - memcpy(info->lo_name, info64->lo_file_name, LO_NAME_SIZE); - memcpy(info->lo_encrypt_key, info64->lo_encrypt_key, LO_KEY_SIZE); + memcpy(info->lo_name, info64->lo_file_name, LO_NAME_SIZE); /* error in case values were truncated */ if (info->lo_device != info64->lo_device || @@ -1759,7 +1471,7 @@ static int loop_set_block_size(struct loop_device *lo, unsigned long arg) if (lo->lo_state != Lo_bound) return -ENXIO; - err = loop_validate_block_size(arg); + err = blk_validate_block_size(arg); if (err) return err; @@ -1809,7 +1521,7 @@ static int lo_simple_ioctl(struct loop_device *lo, unsigned int cmd, err = loop_set_block_size(lo, arg); break; default: - err = lo->ioctl ? lo->ioctl(lo, cmd, arg) : -EINVAL; + err = -EINVAL; } mutex_unlock(&lo->lo_mutex); return err; @@ -1885,7 +1597,6 @@ struct compat_loop_info { compat_ulong_t lo_inode; /* ioctl r/o */ compat_dev_t lo_rdevice; /* ioctl r/o */ compat_int_t lo_offset; - compat_int_t lo_encrypt_type; compat_int_t lo_encrypt_key_size; /* ioctl w/o */ compat_int_t lo_flags; /* ioctl r/o */ char lo_name[LO_NAME_SIZE]; @@ -1914,16 +1625,8 @@ loop_info64_from_compat(const struct compat_loop_info __user *arg, info64->lo_rdevice = info.lo_rdevice; info64->lo_offset = info.lo_offset; info64->lo_sizelimit = 0; - info64->lo_encrypt_type = info.lo_encrypt_type; - info64->lo_encrypt_key_size = info.lo_encrypt_key_size; info64->lo_flags = info.lo_flags; - info64->lo_init[0] = info.lo_init[0]; - info64->lo_init[1] = info.lo_init[1]; - if (info.lo_encrypt_type == LO_CRYPT_CRYPTOAPI) - memcpy(info64->lo_crypt_name, info.lo_name, LO_NAME_SIZE); - else - memcpy(info64->lo_file_name, info.lo_name, LO_NAME_SIZE); - memcpy(info64->lo_encrypt_key, info.lo_encrypt_key, LO_KEY_SIZE); + memcpy(info64->lo_file_name, info.lo_name, LO_NAME_SIZE); return 0; } @@ -1943,24 +1646,14 @@ loop_info64_to_compat(const struct loop_info64 *info64, info.lo_inode = info64->lo_inode; info.lo_rdevice = info64->lo_rdevice; info.lo_offset = info64->lo_offset; - info.lo_encrypt_type = info64->lo_encrypt_type; - info.lo_encrypt_key_size = info64->lo_encrypt_key_size; info.lo_flags = info64->lo_flags; - info.lo_init[0] = info64->lo_init[0]; - info.lo_init[1] = info64->lo_init[1]; - if (info.lo_encrypt_type == LO_CRYPT_CRYPTOAPI) - memcpy(info.lo_name, info64->lo_crypt_name, LO_NAME_SIZE); - else - memcpy(info.lo_name, info64->lo_file_name, LO_NAME_SIZE); - memcpy(info.lo_encrypt_key, info64->lo_encrypt_key, LO_KEY_SIZE); + memcpy(info.lo_name, info64->lo_file_name, LO_NAME_SIZE); /* error in case values were truncated */ if (info.lo_device != info64->lo_device || info.lo_rdevice != info64->lo_rdevice || info.lo_inode != info64->lo_inode || - info.lo_offset != info64->lo_offset || - info.lo_init[0] != info64->lo_init[0] || - info.lo_init[1] != info64->lo_init[1]) + info.lo_offset != info64->lo_offset) return -EOVERFLOW; if (copy_to_user(arg, &info, sizeof(info))) @@ -2101,43 +1794,6 @@ MODULE_PARM_DESC(max_part, "Maximum number of partitions per loop device"); MODULE_LICENSE("GPL"); MODULE_ALIAS_BLOCKDEV_MAJOR(LOOP_MAJOR); -int loop_register_transfer(struct loop_func_table *funcs) -{ - unsigned int n = funcs->number; - - if (n >= MAX_LO_CRYPT || xfer_funcs[n]) - return -EINVAL; - xfer_funcs[n] = funcs; - return 0; -} - -int loop_unregister_transfer(int number) -{ - unsigned int n = number; - struct loop_func_table *xfer; - - if (n == 0 || n >= MAX_LO_CRYPT || (xfer = xfer_funcs[n]) == NULL) - return -EINVAL; - /* - * This function is called from only cleanup_cryptoloop(). - * Given that each loop device that has a transfer enabled holds a - * reference to the module implementing it we should never get here - * with a transfer that is set (unless forced module unloading is - * requested). Thus, check module's refcount and warn if this is - * not a clean unloading. - */ -#ifdef CONFIG_MODULE_UNLOAD - if (xfer->owner && module_refcount(xfer->owner) != -1) - pr_err("Danger! Unregistering an in use transfer function.\n"); -#endif - - xfer_funcs[n] = NULL; - return 0; -} - -EXPORT_SYMBOL(loop_register_transfer); -EXPORT_SYMBOL(loop_unregister_transfer); - static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx, const struct blk_mq_queue_data *bd) { @@ -2394,13 +2050,19 @@ static int loop_add(int i) disk->event_flags = DISK_EVENT_FLAG_UEVENT; sprintf(disk->disk_name, "loop%d", i); /* Make this loop device reachable from pathname. */ - add_disk(disk); + err = add_disk(disk); + if (err) + goto out_cleanup_disk; + /* Show this loop device. */ mutex_lock(&loop_ctl_mutex); lo->idr_visible = true; mutex_unlock(&loop_ctl_mutex); + return i; +out_cleanup_disk: + blk_cleanup_disk(disk); out_cleanup_tags: blk_mq_free_tag_set(&lo->tag_set); out_free_idr: diff --git a/drivers/block/loop.h b/drivers/block/loop.h index 04c88dd6eabd..082d4b6bfc6a 100644 --- a/drivers/block/loop.h +++ b/drivers/block/loop.h @@ -32,23 +32,10 @@ struct loop_device { loff_t lo_offset; loff_t lo_sizelimit; int lo_flags; - int (*transfer)(struct loop_device *, int cmd, - struct page *raw_page, unsigned raw_off, - struct page *loop_page, unsigned loop_off, - int size, sector_t real_block); char lo_file_name[LO_NAME_SIZE]; - char lo_crypt_name[LO_NAME_SIZE]; - char lo_encrypt_key[LO_KEY_SIZE]; - int lo_encrypt_key_size; - struct loop_func_table *lo_encryption; - __u32 lo_init[2]; - kuid_t lo_key_owner; /* Who set the key */ - int (*ioctl)(struct loop_device *, int cmd, - unsigned long arg); struct file * lo_backing_file; struct block_device *lo_device; - void *key_data; gfp_t old_gfp_mask; @@ -82,21 +69,4 @@ struct loop_cmd { struct cgroup_subsys_state *memcg_css; }; -/* Support for loadable transfer modules */ -struct loop_func_table { - int number; /* filter type */ - int (*transfer)(struct loop_device *lo, int cmd, - struct page *raw_page, unsigned raw_off, - struct page *loop_page, unsigned loop_off, - int size, sector_t real_block); - int (*init)(struct loop_device *, const struct loop_info64 *); - /* release is called from loop_unregister_transfer or clr_fd */ - int (*release)(struct loop_device *); - int (*ioctl)(struct loop_device *, int cmd, unsigned long arg); - struct module *owner; -}; - -int loop_register_transfer(struct loop_func_table *funcs); -int loop_unregister_transfer(int number); - #endif diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c index 901855717cb5..c91b9010c1a6 100644 --- a/drivers/block/mtip32xx/mtip32xx.c +++ b/drivers/block/mtip32xx/mtip32xx.c @@ -3633,7 +3633,9 @@ skip_create_disk: set_capacity(dd->disk, capacity); /* Enable the block device and add it to /dev */ - device_add_disk(&dd->pdev->dev, dd->disk, mtip_disk_attr_groups); + rv = device_add_disk(&dd->pdev->dev, dd->disk, mtip_disk_attr_groups); + if (rv) + goto read_capacity_error; if (dd->mtip_svc_handler) { set_bit(MTIP_DDF_INIT_DONE_BIT, &dd->dd_flag); @@ -4061,7 +4063,6 @@ block_initialize_err: msi_initialize_err: if (dd->isr_workq) { - flush_workqueue(dd->isr_workq); destroy_workqueue(dd->isr_workq); drop_cpu(dd->work[0].cpu_binding); drop_cpu(dd->work[1].cpu_binding); @@ -4119,7 +4120,6 @@ static void mtip_pci_remove(struct pci_dev *pdev) mtip_block_remove(dd); if (dd->isr_workq) { - flush_workqueue(dd->isr_workq); destroy_workqueue(dd->isr_workq); drop_cpu(dd->work[0].cpu_binding); drop_cpu(dd->work[1].cpu_binding); diff --git a/drivers/block/n64cart.c b/drivers/block/n64cart.c index b168ca25b6c9..78282f01f581 100644 --- a/drivers/block/n64cart.c +++ b/drivers/block/n64cart.c @@ -115,6 +115,7 @@ static const struct block_device_operations n64cart_fops = { static int __init n64cart_probe(struct platform_device *pdev) { struct gendisk *disk; + int err = -ENOMEM; if (!start || !size) { pr_err("start or size not specified\n"); @@ -132,7 +133,7 @@ static int __init n64cart_probe(struct platform_device *pdev) disk = blk_alloc_disk(NUMA_NO_NODE); if (!disk) - return -ENOMEM; + goto out; disk->first_minor = 0; disk->flags = GENHD_FL_NO_PART_SCAN; @@ -147,11 +148,18 @@ static int __init n64cart_probe(struct platform_device *pdev) blk_queue_physical_block_size(disk->queue, 4096); blk_queue_logical_block_size(disk->queue, 4096); - add_disk(disk); + err = add_disk(disk); + if (err) + goto out_cleanup_disk; pr_info("n64cart: %u kb disk\n", size / 1024); return 0; + +out_cleanup_disk: + blk_cleanup_disk(disk); +out: + return err; } static struct platform_driver n64cart_driver = { diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c index 1183f7872b71..b47b2a87ae8f 100644 --- a/drivers/block/nbd.c +++ b/drivers/block/nbd.c @@ -122,15 +122,21 @@ struct nbd_device { struct work_struct remove_work; struct list_head list; - struct task_struct *task_recv; struct task_struct *task_setup; unsigned long flags; + pid_t pid; /* pid of nbd-client, if attached */ char *backend; }; #define NBD_CMD_REQUEUED 1 +/* + * This flag will be set if nbd_queue_rq() succeed, and will be checked and + * cleared in completion. Both setting and clearing of the flag are protected + * by cmd->lock. + */ +#define NBD_CMD_INFLIGHT 2 struct nbd_cmd { struct nbd_device *nbd; @@ -217,7 +223,7 @@ static ssize_t pid_show(struct device *dev, struct gendisk *disk = dev_to_disk(dev); struct nbd_device *nbd = (struct nbd_device *)disk->private_data; - return sprintf(buf, "%d\n", task_pid_nr(nbd->task_recv)); + return sprintf(buf, "%d\n", nbd->pid); } static const struct device_attribute pid_attr = { @@ -310,26 +316,19 @@ static void nbd_mark_nsock_dead(struct nbd_device *nbd, struct nbd_sock *nsock, nsock->sent = 0; } -static void nbd_size_clear(struct nbd_device *nbd) -{ - if (nbd->config->bytesize) { - set_capacity(nbd->disk, 0); - kobject_uevent(&nbd_to_dev(nbd)->kobj, KOBJ_CHANGE); - } -} - static int nbd_set_size(struct nbd_device *nbd, loff_t bytesize, loff_t blksize) { if (!blksize) blksize = 1u << NBD_DEF_BLKSIZE_BITS; - if (blksize < 512 || blksize > PAGE_SIZE || !is_power_of_2(blksize)) + + if (blk_validate_block_size(blksize)) return -EINVAL; nbd->config->bytesize = bytesize; nbd->config->blksize_bits = __ffs(blksize); - if (!nbd->task_recv) + if (!nbd->pid) return 0; if (nbd->config->flags & NBD_FLAG_SEND_TRIM) { @@ -405,6 +404,11 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req, if (!mutex_trylock(&cmd->lock)) return BLK_EH_RESET_TIMER; + if (!__test_and_clear_bit(NBD_CMD_INFLIGHT, &cmd->flags)) { + mutex_unlock(&cmd->lock); + return BLK_EH_DONE; + } + if (!refcount_inc_not_zero(&nbd->config_refs)) { cmd->status = BLK_STS_TIMEOUT; mutex_unlock(&cmd->lock); @@ -484,7 +488,8 @@ done: } /* - * Send or receive packet. + * Send or receive packet. Return a positive value on success and + * negtive value on failue, and never return 0. */ static int sock_xmit(struct nbd_device *nbd, int index, int send, struct iov_iter *iter, int msg_flags, int *sent) @@ -610,7 +615,7 @@ static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd, int index) result = sock_xmit(nbd, index, 1, &from, (type == NBD_CMD_WRITE) ? MSG_MORE : 0, &sent); trace_nbd_header_sent(req, handle); - if (result <= 0) { + if (result < 0) { if (was_interrupted(result)) { /* If we havne't sent anything we can just return BUSY, * however if we have sent something we need to make @@ -654,7 +659,7 @@ send_pages: skip = 0; } result = sock_xmit(nbd, index, 1, &from, flags, &sent); - if (result <= 0) { + if (result < 0) { if (was_interrupted(result)) { /* We've already sent the header, we * have no choice but to set pending and @@ -688,38 +693,45 @@ out: return 0; } -/* NULL returned = something went wrong, inform userspace */ -static struct nbd_cmd *nbd_read_stat(struct nbd_device *nbd, int index) +static int nbd_read_reply(struct nbd_device *nbd, int index, + struct nbd_reply *reply) { - struct nbd_config *config = nbd->config; - int result; - struct nbd_reply reply; - struct nbd_cmd *cmd; - struct request *req = NULL; - u64 handle; - u16 hwq; - u32 tag; - struct kvec iov = {.iov_base = &reply, .iov_len = sizeof(reply)}; + struct kvec iov = {.iov_base = reply, .iov_len = sizeof(*reply)}; struct iov_iter to; - int ret = 0; + int result; - reply.magic = 0; - iov_iter_kvec(&to, READ, &iov, 1, sizeof(reply)); + reply->magic = 0; + iov_iter_kvec(&to, READ, &iov, 1, sizeof(*reply)); result = sock_xmit(nbd, index, 0, &to, MSG_WAITALL, NULL); - if (result <= 0) { - if (!nbd_disconnected(config)) + if (result < 0) { + if (!nbd_disconnected(nbd->config)) dev_err(disk_to_dev(nbd->disk), "Receive control failed (result %d)\n", result); - return ERR_PTR(result); + return result; } - if (ntohl(reply.magic) != NBD_REPLY_MAGIC) { + if (ntohl(reply->magic) != NBD_REPLY_MAGIC) { dev_err(disk_to_dev(nbd->disk), "Wrong magic (0x%lx)\n", - (unsigned long)ntohl(reply.magic)); - return ERR_PTR(-EPROTO); + (unsigned long)ntohl(reply->magic)); + return -EPROTO; } - memcpy(&handle, reply.handle, sizeof(handle)); + return 0; +} + +/* NULL returned = something went wrong, inform userspace */ +static struct nbd_cmd *nbd_handle_reply(struct nbd_device *nbd, int index, + struct nbd_reply *reply) +{ + int result; + struct nbd_cmd *cmd; + struct request *req = NULL; + u64 handle; + u16 hwq; + u32 tag; + int ret = 0; + + memcpy(&handle, reply->handle, sizeof(handle)); tag = nbd_handle_to_tag(handle); hwq = blk_mq_unique_tag_to_hwq(tag); if (hwq < nbd->tag_set.nr_hw_queues) @@ -734,6 +746,16 @@ static struct nbd_cmd *nbd_read_stat(struct nbd_device *nbd, int index) cmd = blk_mq_rq_to_pdu(req); mutex_lock(&cmd->lock); + if (!__test_and_clear_bit(NBD_CMD_INFLIGHT, &cmd->flags)) { + dev_err(disk_to_dev(nbd->disk), "Suspicious reply %d (status %u flags %lu)", + tag, cmd->status, cmd->flags); + ret = -ENOENT; + goto out; + } + if (cmd->index != index) { + dev_err(disk_to_dev(nbd->disk), "Unexpected reply %d from different sock %d (expected %d)", + tag, index, cmd->index); + } if (cmd->cmd_cookie != nbd_handle_to_cookie(handle)) { dev_err(disk_to_dev(nbd->disk), "Double reply on req %p, cmd_cookie %u, handle cookie %u\n", req, cmd->cmd_cookie, nbd_handle_to_cookie(handle)); @@ -752,9 +774,9 @@ static struct nbd_cmd *nbd_read_stat(struct nbd_device *nbd, int index) ret = -ENOENT; goto out; } - if (ntohl(reply.error)) { + if (ntohl(reply->error)) { dev_err(disk_to_dev(nbd->disk), "Other side returned error (%d)\n", - ntohl(reply.error)); + ntohl(reply->error)); cmd->status = BLK_STS_IOERR; goto out; } @@ -763,11 +785,12 @@ static struct nbd_cmd *nbd_read_stat(struct nbd_device *nbd, int index) if (rq_data_dir(req) != WRITE) { struct req_iterator iter; struct bio_vec bvec; + struct iov_iter to; rq_for_each_segment(bvec, req, iter) { iov_iter_bvec(&to, READ, &bvec, 1, bvec.bv_len); result = sock_xmit(nbd, index, 0, &to, MSG_WAITALL, NULL); - if (result <= 0) { + if (result < 0) { dev_err(disk_to_dev(nbd->disk), "Receive data failed (result %d)\n", result); /* @@ -776,7 +799,7 @@ static struct nbd_cmd *nbd_read_stat(struct nbd_device *nbd, int index) * and let the timeout stuff handle resubmitting * this request onto another connection. */ - if (nbd_disconnected(config)) { + if (nbd_disconnected(nbd->config)) { cmd->status = BLK_STS_IOERR; goto out; } @@ -800,24 +823,46 @@ static void recv_work(struct work_struct *work) work); struct nbd_device *nbd = args->nbd; struct nbd_config *config = nbd->config; + struct request_queue *q = nbd->disk->queue; + struct nbd_sock *nsock; struct nbd_cmd *cmd; struct request *rq; while (1) { - cmd = nbd_read_stat(nbd, args->index); - if (IS_ERR(cmd)) { - struct nbd_sock *nsock = config->socks[args->index]; + struct nbd_reply reply; - mutex_lock(&nsock->tx_lock); - nbd_mark_nsock_dead(nbd, nsock, 1); - mutex_unlock(&nsock->tx_lock); + if (nbd_read_reply(nbd, args->index, &reply)) + break; + + /* + * Grab .q_usage_counter so request pool won't go away, then no + * request use-after-free is possible during nbd_handle_reply(). + * If queue is frozen, there won't be any inflight requests, we + * needn't to handle the incoming garbage message. + */ + if (!percpu_ref_tryget(&q->q_usage_counter)) { + dev_err(disk_to_dev(nbd->disk), "%s: no io inflight\n", + __func__); + break; + } + + cmd = nbd_handle_reply(nbd, args->index, &reply); + if (IS_ERR(cmd)) { + percpu_ref_put(&q->q_usage_counter); break; } rq = blk_mq_rq_from_pdu(cmd); if (likely(!blk_should_fake_timeout(rq->q))) blk_mq_complete_request(rq); + percpu_ref_put(&q->q_usage_counter); } + + nsock = config->socks[args->index]; + mutex_lock(&nsock->tx_lock); + nbd_mark_nsock_dead(nbd, nsock, 1); + mutex_unlock(&nsock->tx_lock); + nbd_config_put(nbd); atomic_dec(&config->recv_threads); wake_up(&config->recv_wq); @@ -833,6 +878,10 @@ static bool nbd_clear_req(struct request *req, void *data, bool reserved) return true; mutex_lock(&cmd->lock); + if (!__test_and_clear_bit(NBD_CMD_INFLIGHT, &cmd->flags)) { + mutex_unlock(&cmd->lock); + return true; + } cmd->status = BLK_STS_IOERR; mutex_unlock(&cmd->lock); @@ -914,7 +963,6 @@ static int nbd_handle_cmd(struct nbd_cmd *cmd, int index) if (!refcount_inc_not_zero(&nbd->config_refs)) { dev_err_ratelimited(disk_to_dev(nbd->disk), "Socks array is empty\n"); - blk_mq_start_request(req); return -EINVAL; } config = nbd->config; @@ -923,7 +971,6 @@ static int nbd_handle_cmd(struct nbd_cmd *cmd, int index) dev_err_ratelimited(disk_to_dev(nbd->disk), "Attempted send on invalid socket\n"); nbd_config_put(nbd); - blk_mq_start_request(req); return -EINVAL; } cmd->status = BLK_STS_OK; @@ -947,7 +994,6 @@ again: */ sock_shutdown(nbd); nbd_config_put(nbd); - blk_mq_start_request(req); return -EIO; } goto again; @@ -969,7 +1015,13 @@ again: * returns EAGAIN can be retried on a different socket. */ ret = nbd_send_cmd(nbd, cmd, index); - if (ret == -EAGAIN) { + /* + * Access to this flag is protected by cmd->lock, thus it's safe to set + * the flag after nbd_send_cmd() succeed to send request to server. + */ + if (!ret) + __set_bit(NBD_CMD_INFLIGHT, &cmd->flags); + else if (ret == -EAGAIN) { dev_err_ratelimited(disk_to_dev(nbd->disk), "Request send failed, requeueing\n"); nbd_mark_nsock_dead(nbd, nsock, 1); @@ -1206,7 +1258,7 @@ static void send_disconnects(struct nbd_device *nbd) iov_iter_kvec(&from, WRITE, &iov, 1, sizeof(request)); mutex_lock(&nsock->tx_lock); ret = sock_xmit(nbd, i, 1, &from, 0, NULL); - if (ret <= 0) + if (ret < 0) dev_err(disk_to_dev(nbd->disk), "Send disconnect failed %d\n", ret); mutex_unlock(&nsock->tx_lock); @@ -1237,11 +1289,13 @@ static void nbd_config_put(struct nbd_device *nbd) &nbd->config_lock)) { struct nbd_config *config = nbd->config; nbd_dev_dbg_close(nbd); - nbd_size_clear(nbd); + invalidate_disk(nbd->disk); + if (nbd->config->bytesize) + kobject_uevent(&nbd_to_dev(nbd)->kobj, KOBJ_CHANGE); if (test_and_clear_bit(NBD_RT_HAS_PID_FILE, &config->runtime_flags)) device_remove_file(disk_to_dev(nbd->disk), &pid_attr); - nbd->task_recv = NULL; + nbd->pid = 0; if (test_and_clear_bit(NBD_RT_HAS_BACKEND_FILE, &config->runtime_flags)) { device_remove_file(disk_to_dev(nbd->disk), &backend_attr); @@ -1282,7 +1336,7 @@ static int nbd_start_device(struct nbd_device *nbd) int num_connections = config->num_connections; int error = 0, i; - if (nbd->task_recv) + if (nbd->pid) return -EBUSY; if (!config->socks) return -EINVAL; @@ -1301,7 +1355,7 @@ static int nbd_start_device(struct nbd_device *nbd) } blk_mq_update_nr_hw_queues(&nbd->tag_set, config->num_connections); - nbd->task_recv = current; + nbd->pid = task_pid_nr(current); nbd_parse_flags(nbd); @@ -1557,8 +1611,8 @@ static int nbd_dbg_tasks_show(struct seq_file *s, void *unused) { struct nbd_device *nbd = s->private; - if (nbd->task_recv) - seq_printf(s, "recv: %d\n", task_pid_nr(nbd->task_recv)); + if (nbd->pid) + seq_printf(s, "recv: %d\n", nbd->pid); return 0; } @@ -1762,7 +1816,9 @@ static struct nbd_device *nbd_dev_add(int index, unsigned int refs) disk->fops = &nbd_fops; disk->private_data = nbd; sprintf(disk->disk_name, "nbd%d", index); - add_disk(disk); + err = add_disk(disk); + if (err) + goto out_err_disk; /* * Now publish the device. @@ -1771,6 +1827,8 @@ static struct nbd_device *nbd_dev_add(int index, unsigned int refs) nbd_total_devices++; return nbd; +out_err_disk: + blk_cleanup_disk(disk); out_free_idr: mutex_lock(&nbd_index_mutex); idr_remove(&nbd_index_idr, index); @@ -2135,7 +2193,7 @@ static int nbd_genl_reconfigure(struct sk_buff *skb, struct genl_info *info) mutex_lock(&nbd->config_lock); config = nbd->config; if (!test_bit(NBD_RT_BOUND, &config->runtime_flags) || - !nbd->task_recv) { + !nbd->pid) { dev_err(nbd_to_dev(nbd), "not configured, cannot reconfigure\n"); ret = -EINVAL; diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c index e5cbcf582233..323af5c9c802 100644 --- a/drivers/block/null_blk/main.c +++ b/drivers/block/null_blk/main.c @@ -92,6 +92,10 @@ static int g_submit_queues = 1; module_param_named(submit_queues, g_submit_queues, int, 0444); MODULE_PARM_DESC(submit_queues, "Number of submission queues"); +static int g_poll_queues = 1; +module_param_named(poll_queues, g_poll_queues, int, 0444); +MODULE_PARM_DESC(poll_queues, "Number of IOPOLL submission queues"); + static int g_home_node = NUMA_NO_NODE; module_param_named(home_node, g_home_node, int, 0444); MODULE_PARM_DESC(home_node, "Home node for the device"); @@ -324,29 +328,69 @@ nullb_device_##NAME##_store(struct config_item *item, const char *page, \ } \ CONFIGFS_ATTR(nullb_device_, NAME); -static int nullb_apply_submit_queues(struct nullb_device *dev, - unsigned int submit_queues) +static int nullb_update_nr_hw_queues(struct nullb_device *dev, + unsigned int submit_queues, + unsigned int poll_queues) + { - struct nullb *nullb = dev->nullb; struct blk_mq_tag_set *set; + int ret, nr_hw_queues; - if (!nullb) + if (!dev->nullb) return 0; /* + * Make sure at least one queue exists for each of submit and poll. + */ + if (!submit_queues || !poll_queues) + return -EINVAL; + + /* * Make sure that null_init_hctx() does not access nullb->queues[] past * the end of that array. */ - if (submit_queues > nr_cpu_ids) + if (submit_queues > nr_cpu_ids || poll_queues > g_poll_queues) return -EINVAL; - set = nullb->tag_set; - blk_mq_update_nr_hw_queues(set, submit_queues); - return set->nr_hw_queues == submit_queues ? 0 : -ENOMEM; + + /* + * Keep previous and new queue numbers in nullb_device for reference in + * the call back function null_map_queues(). + */ + dev->prev_submit_queues = dev->submit_queues; + dev->prev_poll_queues = dev->poll_queues; + dev->submit_queues = submit_queues; + dev->poll_queues = poll_queues; + + set = dev->nullb->tag_set; + nr_hw_queues = submit_queues + poll_queues; + blk_mq_update_nr_hw_queues(set, nr_hw_queues); + ret = set->nr_hw_queues == nr_hw_queues ? 0 : -ENOMEM; + + if (ret) { + /* on error, revert the queue numbers */ + dev->submit_queues = dev->prev_submit_queues; + dev->poll_queues = dev->prev_poll_queues; + } + + return ret; +} + +static int nullb_apply_submit_queues(struct nullb_device *dev, + unsigned int submit_queues) +{ + return nullb_update_nr_hw_queues(dev, submit_queues, dev->poll_queues); +} + +static int nullb_apply_poll_queues(struct nullb_device *dev, + unsigned int poll_queues) +{ + return nullb_update_nr_hw_queues(dev, dev->submit_queues, poll_queues); } NULLB_DEVICE_ATTR(size, ulong, NULL); NULLB_DEVICE_ATTR(completion_nsec, ulong, NULL); NULLB_DEVICE_ATTR(submit_queues, uint, nullb_apply_submit_queues); +NULLB_DEVICE_ATTR(poll_queues, uint, nullb_apply_poll_queues); NULLB_DEVICE_ATTR(home_node, uint, NULL); NULLB_DEVICE_ATTR(queue_mode, uint, NULL); NULLB_DEVICE_ATTR(blocksize, uint, NULL); @@ -466,6 +510,7 @@ static struct configfs_attribute *nullb_device_attrs[] = { &nullb_device_attr_size, &nullb_device_attr_completion_nsec, &nullb_device_attr_submit_queues, + &nullb_device_attr_poll_queues, &nullb_device_attr_home_node, &nullb_device_attr_queue_mode, &nullb_device_attr_blocksize, @@ -593,6 +638,9 @@ static struct nullb_device *null_alloc_dev(void) dev->size = g_gb * 1024; dev->completion_nsec = g_completion_nsec; dev->submit_queues = g_submit_queues; + dev->prev_submit_queues = g_submit_queues; + dev->poll_queues = g_poll_queues; + dev->prev_poll_queues = g_poll_queues; dev->home_node = g_home_node; dev->queue_mode = g_queue_mode; dev->blocksize = g_bs; @@ -1454,12 +1502,100 @@ static bool should_requeue_request(struct request *rq) return false; } +static int null_map_queues(struct blk_mq_tag_set *set) +{ + struct nullb *nullb = set->driver_data; + int i, qoff; + unsigned int submit_queues = g_submit_queues; + unsigned int poll_queues = g_poll_queues; + + if (nullb) { + struct nullb_device *dev = nullb->dev; + + /* + * Refer nr_hw_queues of the tag set to check if the expected + * number of hardware queues are prepared. If block layer failed + * to prepare them, use previous numbers of submit queues and + * poll queues to map queues. + */ + if (set->nr_hw_queues == + dev->submit_queues + dev->poll_queues) { + submit_queues = dev->submit_queues; + poll_queues = dev->poll_queues; + } else if (set->nr_hw_queues == + dev->prev_submit_queues + dev->prev_poll_queues) { + submit_queues = dev->prev_submit_queues; + poll_queues = dev->prev_poll_queues; + } else { + pr_warn("tag set has unexpected nr_hw_queues: %d\n", + set->nr_hw_queues); + return -EINVAL; + } + } + + for (i = 0, qoff = 0; i < set->nr_maps; i++) { + struct blk_mq_queue_map *map = &set->map[i]; + + switch (i) { + case HCTX_TYPE_DEFAULT: + map->nr_queues = submit_queues; + break; + case HCTX_TYPE_READ: + map->nr_queues = 0; + continue; + case HCTX_TYPE_POLL: + map->nr_queues = poll_queues; + break; + } + map->queue_offset = qoff; + qoff += map->nr_queues; + blk_mq_map_queues(map); + } + + return 0; +} + +static int null_poll(struct blk_mq_hw_ctx *hctx, struct io_comp_batch *iob) +{ + struct nullb_queue *nq = hctx->driver_data; + LIST_HEAD(list); + int nr = 0; + + spin_lock(&nq->poll_lock); + list_splice_init(&nq->poll_list, &list); + spin_unlock(&nq->poll_lock); + + while (!list_empty(&list)) { + struct nullb_cmd *cmd; + struct request *req; + + req = list_first_entry(&list, struct request, queuelist); + list_del_init(&req->queuelist); + cmd = blk_mq_rq_to_pdu(req); + cmd->error = null_process_cmd(cmd, req_op(req), blk_rq_pos(req), + blk_rq_sectors(req)); + end_cmd(cmd); + nr++; + } + + return nr; +} + static enum blk_eh_timer_return null_timeout_rq(struct request *rq, bool res) { + struct blk_mq_hw_ctx *hctx = rq->mq_hctx; struct nullb_cmd *cmd = blk_mq_rq_to_pdu(rq); pr_info("rq %p timed out\n", rq); + if (hctx->type == HCTX_TYPE_POLL) { + struct nullb_queue *nq = hctx->driver_data; + + spin_lock(&nq->poll_lock); + list_del_init(&rq->queuelist); + spin_unlock(&nq->poll_lock); + } + /* * If the device is marked as blocking (i.e. memory backed or zoned * device), the submission path may be blocked waiting for resources @@ -1480,10 +1616,11 @@ static blk_status_t null_queue_rq(struct blk_mq_hw_ctx *hctx, struct nullb_queue *nq = hctx->driver_data; sector_t nr_sectors = blk_rq_sectors(bd->rq); sector_t sector = blk_rq_pos(bd->rq); + const bool is_poll = hctx->type == HCTX_TYPE_POLL; might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING); - if (nq->dev->irqmode == NULL_IRQ_TIMER) { + if (!is_poll && nq->dev->irqmode == NULL_IRQ_TIMER) { hrtimer_init(&cmd->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); cmd->timer.function = null_cmd_timer_expired; } @@ -1507,6 +1644,13 @@ static blk_status_t null_queue_rq(struct blk_mq_hw_ctx *hctx, return BLK_STS_OK; } } + + if (is_poll) { + spin_lock(&nq->poll_lock); + list_add_tail(&bd->rq->queuelist, &nq->poll_list); + spin_unlock(&nq->poll_lock); + return BLK_STS_OK; + } if (cmd->fake_timeout) return BLK_STS_OK; @@ -1542,6 +1686,8 @@ static void null_init_queue(struct nullb *nullb, struct nullb_queue *nq) init_waitqueue_head(&nq->wait); nq->queue_depth = nullb->queue_depth; nq->dev = nullb->dev; + INIT_LIST_HEAD(&nq->poll_list); + spin_lock_init(&nq->poll_lock); } static int null_init_hctx(struct blk_mq_hw_ctx *hctx, void *driver_data, @@ -1567,6 +1713,8 @@ static const struct blk_mq_ops null_mq_ops = { .queue_rq = null_queue_rq, .complete = null_complete_rq, .timeout = null_timeout_rq, + .poll = null_poll, + .map_queues = null_map_queues, .init_hctx = null_init_hctx, .exit_hctx = null_exit_hctx, }; @@ -1663,13 +1811,17 @@ static int setup_commands(struct nullb_queue *nq) static int setup_queues(struct nullb *nullb) { - nullb->queues = kcalloc(nr_cpu_ids, sizeof(struct nullb_queue), + int nqueues = nr_cpu_ids; + + if (g_poll_queues) + nqueues += g_poll_queues; + + nullb->queues = kcalloc(nqueues, sizeof(struct nullb_queue), GFP_KERNEL); if (!nullb->queues) return -ENOMEM; nullb->queue_depth = nullb->dev->hw_queue_depth; - return 0; } @@ -1721,9 +1873,14 @@ static int null_gendisk_register(struct nullb *nullb) static int null_init_tag_set(struct nullb *nullb, struct blk_mq_tag_set *set) { + int poll_queues; + set->ops = &null_mq_ops; set->nr_hw_queues = nullb ? nullb->dev->submit_queues : g_submit_queues; + poll_queues = nullb ? nullb->dev->poll_queues : g_poll_queues; + if (poll_queues) + set->nr_hw_queues += poll_queues; set->queue_depth = nullb ? nullb->dev->hw_queue_depth : g_hw_queue_depth; set->numa_node = nullb ? nullb->dev->home_node : g_home_node; @@ -1733,7 +1890,11 @@ static int null_init_tag_set(struct nullb *nullb, struct blk_mq_tag_set *set) set->flags |= BLK_MQ_F_NO_SCHED; if (g_shared_tag_bitmap) set->flags |= BLK_MQ_F_TAG_HCTX_SHARED; - set->driver_data = NULL; + set->driver_data = nullb; + if (g_poll_queues) + set->nr_maps = 3; + else + set->nr_maps = 1; if ((nullb && nullb->dev->blocking) || g_blocking) set->flags |= BLK_MQ_F_BLOCKING; @@ -1753,6 +1914,13 @@ static int null_validate_conf(struct nullb_device *dev) dev->submit_queues = nr_cpu_ids; else if (dev->submit_queues == 0) dev->submit_queues = 1; + dev->prev_submit_queues = dev->submit_queues; + + if (dev->poll_queues > g_poll_queues) + dev->poll_queues = g_poll_queues; + else if (dev->poll_queues == 0) + dev->poll_queues = 1; + dev->prev_poll_queues = dev->poll_queues; dev->queue_mode = min_t(unsigned int, dev->queue_mode, NULL_Q_MQ); dev->irqmode = min_t(unsigned int, dev->irqmode, NULL_IRQ_TIMER); diff --git a/drivers/block/null_blk/null_blk.h b/drivers/block/null_blk/null_blk.h index 64bef125d1df..78eb56b0ca55 100644 --- a/drivers/block/null_blk/null_blk.h +++ b/drivers/block/null_blk/null_blk.h @@ -32,6 +32,9 @@ struct nullb_queue { struct nullb_device *dev; unsigned int requeue_selection; + struct list_head poll_list; + spinlock_t poll_lock; + struct nullb_cmd *cmds; }; @@ -83,6 +86,9 @@ struct nullb_device { unsigned int zone_max_open; /* max number of open zones */ unsigned int zone_max_active; /* max number of active zones */ unsigned int submit_queues; /* number of submission queues */ + unsigned int prev_submit_queues; /* number of submission queues before change */ + unsigned int poll_queues; /* number of IOPOLL submission queues */ + unsigned int prev_poll_queues; /* number of IOPOLL submission queues before change */ unsigned int home_node; /* home node for the device */ unsigned int queue_mode; /* block interface */ unsigned int blocksize; /* block size */ diff --git a/drivers/block/paride/pcd.c b/drivers/block/paride/pcd.c index f9cdd11f02f5..f6b1d63e96e1 100644 --- a/drivers/block/paride/pcd.c +++ b/drivers/block/paride/pcd.c @@ -183,8 +183,6 @@ static int pcd_audio_ioctl(struct cdrom_device_info *cdi, static int pcd_packet(struct cdrom_device_info *cdi, struct packet_command *cgc); -static int pcd_detect(void); -static void pcd_probe_capabilities(void); static void do_pcd_read_drq(void); static blk_status_t pcd_queue_rq(struct blk_mq_hw_ctx *hctx, const struct blk_mq_queue_data *bd); @@ -302,53 +300,6 @@ static const struct blk_mq_ops pcd_mq_ops = { .queue_rq = pcd_queue_rq, }; -static void pcd_init_units(void) -{ - struct pcd_unit *cd; - int unit; - - pcd_drive_count = 0; - for (unit = 0, cd = pcd; unit < PCD_UNITS; unit++, cd++) { - struct gendisk *disk; - - if (blk_mq_alloc_sq_tag_set(&cd->tag_set, &pcd_mq_ops, 1, - BLK_MQ_F_SHOULD_MERGE)) - continue; - - disk = blk_mq_alloc_disk(&cd->tag_set, cd); - if (IS_ERR(disk)) { - blk_mq_free_tag_set(&cd->tag_set); - continue; - } - - INIT_LIST_HEAD(&cd->rq_list); - blk_queue_bounce_limit(disk->queue, BLK_BOUNCE_HIGH); - cd->disk = disk; - cd->pi = &cd->pia; - cd->present = 0; - cd->last_sense = 0; - cd->changed = 1; - cd->drive = (*drives[unit])[D_SLV]; - if ((*drives[unit])[D_PRT]) - pcd_drive_count++; - - cd->name = &cd->info.name[0]; - snprintf(cd->name, sizeof(cd->info.name), "%s%d", name, unit); - cd->info.ops = &pcd_dops; - cd->info.handle = cd; - cd->info.speed = 0; - cd->info.capacity = 1; - cd->info.mask = 0; - disk->major = major; - disk->first_minor = unit; - disk->minors = 1; - strcpy(disk->disk_name, cd->name); /* umm... */ - disk->fops = &pcd_bdops; - disk->flags = GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE; - disk->events = DISK_EVENT_MEDIA_CHANGE; - } -} - static int pcd_open(struct cdrom_device_info *cdi, int purpose) { struct pcd_unit *cd = cdi->handle; @@ -630,10 +581,11 @@ static int pcd_drive_status(struct cdrom_device_info *cdi, int slot_nr) return CDS_DISC_OK; } -static int pcd_identify(struct pcd_unit *cd, char *id) +static int pcd_identify(struct pcd_unit *cd) { - int k, s; char id_cmd[12] = { 0x12, 0, 0, 0, 36, 0, 0, 0, 0, 0, 0, 0 }; + char id[18]; + int k, s; pcd_bufblk = -1; @@ -661,108 +613,47 @@ static int pcd_identify(struct pcd_unit *cd, char *id) } /* - * returns 0, with id set if drive is detected - * -1, if drive detection failed + * returns 0, with id set if drive is detected, otherwise an error code. */ -static int pcd_probe(struct pcd_unit *cd, int ms, char *id) +static int pcd_probe(struct pcd_unit *cd, int ms) { if (ms == -1) { for (cd->drive = 0; cd->drive <= 1; cd->drive++) - if (!pcd_reset(cd) && !pcd_identify(cd, id)) + if (!pcd_reset(cd) && !pcd_identify(cd)) return 0; } else { cd->drive = ms; - if (!pcd_reset(cd) && !pcd_identify(cd, id)) + if (!pcd_reset(cd) && !pcd_identify(cd)) return 0; } - return -1; + return -ENODEV; } -static void pcd_probe_capabilities(void) +static int pcd_probe_capabilities(struct pcd_unit *cd) { - int unit, r; - char buffer[32]; char cmd[12] = { 0x5a, 1 << 3, 0x2a, 0, 0, 0, 0, 18, 0, 0, 0, 0 }; - struct pcd_unit *cd; - - for (unit = 0, cd = pcd; unit < PCD_UNITS; unit++, cd++) { - if (!cd->present) - continue; - r = pcd_atapi(cd, cmd, 18, buffer, "mode sense capabilities"); - if (r) - continue; - /* we should now have the cap page */ - if ((buffer[11] & 1) == 0) - cd->info.mask |= CDC_CD_R; - if ((buffer[11] & 2) == 0) - cd->info.mask |= CDC_CD_RW; - if ((buffer[12] & 1) == 0) - cd->info.mask |= CDC_PLAY_AUDIO; - if ((buffer[14] & 1) == 0) - cd->info.mask |= CDC_LOCK; - if ((buffer[14] & 8) == 0) - cd->info.mask |= CDC_OPEN_TRAY; - if ((buffer[14] >> 6) == 0) - cd->info.mask |= CDC_CLOSE_TRAY; - } -} - -static int pcd_detect(void) -{ - char id[18]; - int k, unit; - struct pcd_unit *cd; + char buffer[32]; + int ret; - printk("%s: %s version %s, major %d, nice %d\n", - name, name, PCD_VERSION, major, nice); + ret = pcd_atapi(cd, cmd, 18, buffer, "mode sense capabilities"); + if (ret) + return ret; + + /* we should now have the cap page */ + if ((buffer[11] & 1) == 0) + cd->info.mask |= CDC_CD_R; + if ((buffer[11] & 2) == 0) + cd->info.mask |= CDC_CD_RW; + if ((buffer[12] & 1) == 0) + cd->info.mask |= CDC_PLAY_AUDIO; + if ((buffer[14] & 1) == 0) + cd->info.mask |= CDC_LOCK; + if ((buffer[14] & 8) == 0) + cd->info.mask |= CDC_OPEN_TRAY; + if ((buffer[14] >> 6) == 0) + cd->info.mask |= CDC_CLOSE_TRAY; - par_drv = pi_register_driver(name); - if (!par_drv) { - pr_err("failed to register %s driver\n", name); - return -1; - } - - k = 0; - if (pcd_drive_count == 0) { /* nothing spec'd - so autoprobe for 1 */ - cd = pcd; - if (cd->disk && pi_init(cd->pi, 1, -1, -1, -1, -1, -1, - pcd_buffer, PI_PCD, verbose, cd->name)) { - if (!pcd_probe(cd, -1, id)) { - cd->present = 1; - k++; - } else - pi_release(cd->pi); - } - } else { - for (unit = 0, cd = pcd; unit < PCD_UNITS; unit++, cd++) { - int *conf = *drives[unit]; - if (!conf[D_PRT]) - continue; - if (!cd->disk) - continue; - if (!pi_init(cd->pi, 0, conf[D_PRT], conf[D_MOD], - conf[D_UNI], conf[D_PRO], conf[D_DLY], - pcd_buffer, PI_PCD, verbose, cd->name)) - continue; - if (!pcd_probe(cd, conf[D_SLV], id)) { - cd->present = 1; - k++; - } else - pi_release(cd->pi); - } - } - if (k) - return 0; - - printk("%s: No CD-ROM drive found\n", name); - for (unit = 0, cd = pcd; unit < PCD_UNITS; unit++, cd++) { - if (!cd->disk) - continue; - blk_cleanup_disk(cd->disk); - blk_mq_free_tag_set(&cd->tag_set); - } - pi_unregister_driver(par_drv); - return -1; + return 0; } /* I/O request processing */ @@ -999,43 +890,130 @@ static int pcd_get_mcn(struct cdrom_device_info *cdi, struct cdrom_mcn *mcn) return 0; } +static int pcd_init_unit(struct pcd_unit *cd, bool autoprobe, int port, + int mode, int unit, int protocol, int delay, int ms) +{ + struct gendisk *disk; + int ret; + + ret = blk_mq_alloc_sq_tag_set(&cd->tag_set, &pcd_mq_ops, 1, + BLK_MQ_F_SHOULD_MERGE); + if (ret) + return ret; + + disk = blk_mq_alloc_disk(&cd->tag_set, cd); + if (IS_ERR(disk)) { + ret = PTR_ERR(disk); + goto out_free_tag_set; + } + + INIT_LIST_HEAD(&cd->rq_list); + blk_queue_bounce_limit(disk->queue, BLK_BOUNCE_HIGH); + cd->disk = disk; + cd->pi = &cd->pia; + cd->present = 0; + cd->last_sense = 0; + cd->changed = 1; + cd->drive = (*drives[cd - pcd])[D_SLV]; + + cd->name = &cd->info.name[0]; + snprintf(cd->name, sizeof(cd->info.name), "%s%d", name, unit); + cd->info.ops = &pcd_dops; + cd->info.handle = cd; + cd->info.speed = 0; + cd->info.capacity = 1; + cd->info.mask = 0; + disk->major = major; + disk->first_minor = unit; + disk->minors = 1; + strcpy(disk->disk_name, cd->name); /* umm... */ + disk->fops = &pcd_bdops; + disk->flags = GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE; + disk->events = DISK_EVENT_MEDIA_CHANGE; + + if (!pi_init(cd->pi, autoprobe, port, mode, unit, protocol, delay, + pcd_buffer, PI_PCD, verbose, cd->name)) { + ret = -ENODEV; + goto out_free_disk; + } + ret = pcd_probe(cd, ms); + if (ret) + goto out_pi_release; + + cd->present = 1; + pcd_probe_capabilities(cd); + ret = register_cdrom(cd->disk, &cd->info); + if (ret) + goto out_pi_release; + ret = add_disk(cd->disk); + if (ret) + goto out_unreg_cdrom; + return 0; + +out_unreg_cdrom: + unregister_cdrom(&cd->info); +out_pi_release: + pi_release(cd->pi); +out_free_disk: + blk_cleanup_disk(cd->disk); +out_free_tag_set: + blk_mq_free_tag_set(&cd->tag_set); + return ret; +} + static int __init pcd_init(void) { - struct pcd_unit *cd; - int unit; + int found = 0, unit; if (disable) return -EINVAL; - pcd_init_units(); + if (register_blkdev(major, name)) + return -EBUSY; - if (pcd_detect()) - return -ENODEV; + pr_info("%s: %s version %s, major %d, nice %d\n", + name, name, PCD_VERSION, major, nice); - /* get the atapi capabilities page */ - pcd_probe_capabilities(); + par_drv = pi_register_driver(name); + if (!par_drv) { + pr_err("failed to register %s driver\n", name); + goto out_unregister_blkdev; + } - if (register_blkdev(major, name)) { - for (unit = 0, cd = pcd; unit < PCD_UNITS; unit++, cd++) { - if (!cd->disk) - continue; + for (unit = 0; unit < PCD_UNITS; unit++) { + if ((*drives[unit])[D_PRT]) + pcd_drive_count++; + } + + if (pcd_drive_count == 0) { /* nothing spec'd - so autoprobe for 1 */ + if (!pcd_init_unit(pcd, 1, -1, -1, -1, -1, -1, -1)) + found++; + } else { + for (unit = 0; unit < PCD_UNITS; unit++) { + struct pcd_unit *cd = &pcd[unit]; + int *conf = *drives[unit]; - blk_cleanup_queue(cd->disk->queue); - blk_mq_free_tag_set(&cd->tag_set); - put_disk(cd->disk); + if (!conf[D_PRT]) + continue; + if (!pcd_init_unit(cd, 0, conf[D_PRT], conf[D_MOD], + conf[D_UNI], conf[D_PRO], conf[D_DLY], + conf[D_SLV])) + found++; } - return -EBUSY; } - for (unit = 0, cd = pcd; unit < PCD_UNITS; unit++, cd++) { - if (cd->present) { - register_cdrom(cd->disk, &cd->info); - cd->disk->private_data = cd; - add_disk(cd->disk); - } + if (!found) { + pr_info("%s: No CD-ROM drive found\n", name); + goto out_unregister_pi_driver; } return 0; + +out_unregister_pi_driver: + pi_unregister_driver(par_drv); +out_unregister_blkdev: + unregister_blkdev(major, name); + return -ENODEV; } static void __exit pcd_exit(void) @@ -1044,20 +1022,18 @@ static void __exit pcd_exit(void) int unit; for (unit = 0, cd = pcd; unit < PCD_UNITS; unit++, cd++) { - if (!cd->disk) + if (!cd->present) continue; - if (cd->present) { - del_gendisk(cd->disk); - pi_release(cd->pi); - unregister_cdrom(&cd->info); - } - blk_cleanup_queue(cd->disk->queue); + unregister_cdrom(&cd->info); + del_gendisk(cd->disk); + pi_release(cd->pi); + blk_cleanup_disk(cd->disk); + blk_mq_free_tag_set(&cd->tag_set); - put_disk(cd->disk); } - unregister_blkdev(major, name); pi_unregister_driver(par_drv); + unregister_blkdev(major, name); } MODULE_LICENSE("GPL"); diff --git a/drivers/block/paride/pd.c b/drivers/block/paride/pd.c index 675327df6aff..e59759bcf416 100644 --- a/drivers/block/paride/pd.c +++ b/drivers/block/paride/pd.c @@ -875,9 +875,27 @@ static const struct blk_mq_ops pd_mq_ops = { .queue_rq = pd_queue_rq, }; -static void pd_probe_drive(struct pd_unit *disk) +static int pd_probe_drive(struct pd_unit *disk, int autoprobe, int port, + int mode, int unit, int protocol, int delay) { + int index = disk - pd; + int *parm = *drives[index]; struct gendisk *p; + int ret; + + disk->pi = &disk->pia; + disk->access = 0; + disk->changed = 1; + disk->capacity = 0; + disk->drive = parm[D_SLV]; + snprintf(disk->name, PD_NAMELEN, "%s%c", name, 'a' + index); + disk->alt_geom = parm[D_GEO]; + disk->standby = parm[D_SBY]; + INIT_LIST_HEAD(&disk->rq_list); + + if (!pi_init(disk->pi, autoprobe, port, mode, unit, protocol, delay, + pd_scratch, PI_PD, verbose, disk->name)) + return -ENXIO; memset(&disk->tag_set, 0, sizeof(disk->tag_set)); disk->tag_set.ops = &pd_mq_ops; @@ -887,14 +905,14 @@ static void pd_probe_drive(struct pd_unit *disk) disk->tag_set.queue_depth = 2; disk->tag_set.numa_node = NUMA_NO_NODE; disk->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_BLOCKING; - - if (blk_mq_alloc_tag_set(&disk->tag_set)) - return; + ret = blk_mq_alloc_tag_set(&disk->tag_set); + if (ret) + goto pi_release; p = blk_mq_alloc_disk(&disk->tag_set, disk); if (IS_ERR(p)) { - blk_mq_free_tag_set(&disk->tag_set); - return; + ret = PTR_ERR(p); + goto free_tag_set; } disk->gd = p; @@ -905,102 +923,88 @@ static void pd_probe_drive(struct pd_unit *disk) p->minors = 1 << PD_BITS; p->events = DISK_EVENT_MEDIA_CHANGE; p->private_data = disk; - blk_queue_max_hw_sectors(p->queue, cluster); blk_queue_bounce_limit(p->queue, BLK_BOUNCE_HIGH); if (disk->drive == -1) { - for (disk->drive = 0; disk->drive <= 1; disk->drive++) - if (pd_special_command(disk, pd_identify) == 0) - return; - } else if (pd_special_command(disk, pd_identify) == 0) - return; - disk->gd = NULL; + for (disk->drive = 0; disk->drive <= 1; disk->drive++) { + ret = pd_special_command(disk, pd_identify); + if (ret == 0) + break; + } + } else { + ret = pd_special_command(disk, pd_identify); + } + if (ret) + goto put_disk; + set_capacity(disk->gd, disk->capacity); + ret = add_disk(disk->gd); + if (ret) + goto cleanup_disk; + return 0; +cleanup_disk: + blk_cleanup_disk(disk->gd); +put_disk: put_disk(p); + disk->gd = NULL; +free_tag_set: + blk_mq_free_tag_set(&disk->tag_set); +pi_release: + pi_release(disk->pi); + return ret; } -static int pd_detect(void) +static int __init pd_init(void) { int found = 0, unit, pd_drive_count = 0; struct pd_unit *disk; - for (unit = 0; unit < PD_UNITS; unit++) { - int *parm = *drives[unit]; - struct pd_unit *disk = pd + unit; - disk->pi = &disk->pia; - disk->access = 0; - disk->changed = 1; - disk->capacity = 0; - disk->drive = parm[D_SLV]; - snprintf(disk->name, PD_NAMELEN, "%s%c", name, 'a'+unit); - disk->alt_geom = parm[D_GEO]; - disk->standby = parm[D_SBY]; - if (parm[D_PRT]) - pd_drive_count++; - INIT_LIST_HEAD(&disk->rq_list); - } + if (disable) + return -ENODEV; + + if (register_blkdev(major, name)) + return -ENODEV; + + printk("%s: %s version %s, major %d, cluster %d, nice %d\n", + name, name, PD_VERSION, major, cluster, nice); par_drv = pi_register_driver(name); if (!par_drv) { pr_err("failed to register %s driver\n", name); - return -1; + goto out_unregister_blkdev; } - if (pd_drive_count == 0) { /* nothing spec'd - so autoprobe for 1 */ - disk = pd; - if (pi_init(disk->pi, 1, -1, -1, -1, -1, -1, pd_scratch, - PI_PD, verbose, disk->name)) { - pd_probe_drive(disk); - if (!disk->gd) - pi_release(disk->pi); - } + for (unit = 0; unit < PD_UNITS; unit++) { + int *parm = *drives[unit]; + if (parm[D_PRT]) + pd_drive_count++; + } + + if (pd_drive_count == 0) { /* nothing spec'd - so autoprobe for 1 */ + if (!pd_probe_drive(pd, 1, -1, -1, -1, -1, -1)) + found++; } else { for (unit = 0, disk = pd; unit < PD_UNITS; unit++, disk++) { int *parm = *drives[unit]; if (!parm[D_PRT]) continue; - if (pi_init(disk->pi, 0, parm[D_PRT], parm[D_MOD], - parm[D_UNI], parm[D_PRO], parm[D_DLY], - pd_scratch, PI_PD, verbose, disk->name)) { - pd_probe_drive(disk); - if (!disk->gd) - pi_release(disk->pi); - } - } - } - for (unit = 0, disk = pd; unit < PD_UNITS; unit++, disk++) { - if (disk->gd) { - set_capacity(disk->gd, disk->capacity); - add_disk(disk->gd); - found = 1; + if (!pd_probe_drive(disk, 0, parm[D_PRT], parm[D_MOD], + parm[D_UNI], parm[D_PRO], parm[D_DLY])) + found++; } } if (!found) { printk("%s: no valid drive found\n", name); - pi_unregister_driver(par_drv); + goto out_pi_unregister_driver; } - return found; -} - -static int __init pd_init(void) -{ - if (disable) - goto out1; - - if (register_blkdev(major, name)) - goto out1; - - printk("%s: %s version %s, major %d, cluster %d, nice %d\n", - name, name, PD_VERSION, major, cluster, nice); - if (!pd_detect()) - goto out2; return 0; -out2: +out_pi_unregister_driver: + pi_unregister_driver(par_drv); +out_unregister_blkdev: unregister_blkdev(major, name); -out1: return -ENODEV; } diff --git a/drivers/block/paride/pf.c b/drivers/block/paride/pf.c index d5b9c88ba76f..bf8d0ef41a0a 100644 --- a/drivers/block/paride/pf.c +++ b/drivers/block/paride/pf.c @@ -214,7 +214,6 @@ static int pf_getgeo(struct block_device *bdev, struct hd_geometry *geo); static void pf_release(struct gendisk *disk, fmode_t mode); -static int pf_detect(void); static void do_pf_read(void); static void do_pf_read_start(void); static void do_pf_write(void); @@ -285,45 +284,6 @@ static const struct blk_mq_ops pf_mq_ops = { .queue_rq = pf_queue_rq, }; -static void __init pf_init_units(void) -{ - struct pf_unit *pf; - int unit; - - pf_drive_count = 0; - for (unit = 0, pf = units; unit < PF_UNITS; unit++, pf++) { - struct gendisk *disk; - - if (blk_mq_alloc_sq_tag_set(&pf->tag_set, &pf_mq_ops, 1, - BLK_MQ_F_SHOULD_MERGE)) - continue; - - disk = blk_mq_alloc_disk(&pf->tag_set, pf); - if (IS_ERR(disk)) { - blk_mq_free_tag_set(&pf->tag_set); - continue; - } - - INIT_LIST_HEAD(&pf->rq_list); - blk_queue_max_segments(disk->queue, cluster); - blk_queue_bounce_limit(disk->queue, BLK_BOUNCE_HIGH); - pf->disk = disk; - pf->pi = &pf->pia; - pf->media_status = PF_NM; - pf->drive = (*drives[unit])[D_SLV]; - pf->lun = (*drives[unit])[D_LUN]; - snprintf(pf->name, PF_NAMELEN, "%s%d", name, unit); - disk->major = major; - disk->first_minor = unit; - disk->minors = 1; - strcpy(disk->disk_name, pf->name); - disk->fops = &pf_fops; - disk->events = DISK_EVENT_MEDIA_CHANGE; - if (!(*drives[unit])[D_PRT]) - pf_drive_count++; - } -} - static int pf_open(struct block_device *bdev, fmode_t mode) { struct pf_unit *pf = bdev->bd_disk->private_data; @@ -691,9 +651,9 @@ static int pf_identify(struct pf_unit *pf) return 0; } -/* returns 0, with id set if drive is detected - -1, if drive detection failed -*/ +/* + * returns 0, with id set if drive is detected, otherwise an error code. + */ static int pf_probe(struct pf_unit *pf) { if (pf->drive == -1) { @@ -715,60 +675,7 @@ static int pf_probe(struct pf_unit *pf) if (!pf_identify(pf)) return 0; } - return -1; -} - -static int pf_detect(void) -{ - struct pf_unit *pf = units; - int k, unit; - - printk("%s: %s version %s, major %d, cluster %d, nice %d\n", - name, name, PF_VERSION, major, cluster, nice); - - par_drv = pi_register_driver(name); - if (!par_drv) { - pr_err("failed to register %s driver\n", name); - return -1; - } - k = 0; - if (pf_drive_count == 0) { - if (pi_init(pf->pi, 1, -1, -1, -1, -1, -1, pf_scratch, PI_PF, - verbose, pf->name)) { - if (!pf_probe(pf) && pf->disk) { - pf->present = 1; - k++; - } else - pi_release(pf->pi); - } - - } else - for (unit = 0; unit < PF_UNITS; unit++, pf++) { - int *conf = *drives[unit]; - if (!conf[D_PRT]) - continue; - if (pi_init(pf->pi, 0, conf[D_PRT], conf[D_MOD], - conf[D_UNI], conf[D_PRO], conf[D_DLY], - pf_scratch, PI_PF, verbose, pf->name)) { - if (pf->disk && !pf_probe(pf)) { - pf->present = 1; - k++; - } else - pi_release(pf->pi); - } - } - if (k) - return 0; - - printk("%s: No ATAPI disk detected\n", name); - for (pf = units, unit = 0; unit < PF_UNITS; pf++, unit++) { - if (!pf->disk) - continue; - blk_cleanup_disk(pf->disk); - blk_mq_free_tag_set(&pf->tag_set); - } - pi_unregister_driver(par_drv); - return -1; + return -ENODEV; } /* The i/o request engine */ @@ -1014,61 +921,134 @@ static void do_pf_write_done(void) next_request(0); } +static int __init pf_init_unit(struct pf_unit *pf, bool autoprobe, int port, + int mode, int unit, int protocol, int delay, int ms) +{ + struct gendisk *disk; + int ret; + + ret = blk_mq_alloc_sq_tag_set(&pf->tag_set, &pf_mq_ops, 1, + BLK_MQ_F_SHOULD_MERGE); + if (ret) + return ret; + + disk = blk_mq_alloc_disk(&pf->tag_set, pf); + if (IS_ERR(disk)) { + ret = PTR_ERR(disk); + goto out_free_tag_set; + } + disk->major = major; + disk->first_minor = pf - units; + disk->minors = 1; + strcpy(disk->disk_name, pf->name); + disk->fops = &pf_fops; + disk->events = DISK_EVENT_MEDIA_CHANGE; + disk->private_data = pf; + + blk_queue_max_segments(disk->queue, cluster); + blk_queue_bounce_limit(disk->queue, BLK_BOUNCE_HIGH); + + INIT_LIST_HEAD(&pf->rq_list); + pf->disk = disk; + pf->pi = &pf->pia; + pf->media_status = PF_NM; + pf->drive = (*drives[disk->first_minor])[D_SLV]; + pf->lun = (*drives[disk->first_minor])[D_LUN]; + snprintf(pf->name, PF_NAMELEN, "%s%d", name, disk->first_minor); + + if (!pi_init(pf->pi, autoprobe, port, mode, unit, protocol, delay, + pf_scratch, PI_PF, verbose, pf->name)) { + ret = -ENODEV; + goto out_free_disk; + } + ret = pf_probe(pf); + if (ret) + goto out_pi_release; + + ret = add_disk(disk); + if (ret) + goto out_pi_release; + pf->present = 1; + return 0; + +out_pi_release: + pi_release(pf->pi); +out_free_disk: + blk_cleanup_disk(pf->disk); +out_free_tag_set: + blk_mq_free_tag_set(&pf->tag_set); + return ret; +} + static int __init pf_init(void) { /* preliminary initialisation */ struct pf_unit *pf; - int unit; + int found = 0, unit; if (disable) return -EINVAL; - pf_init_units(); + if (register_blkdev(major, name)) + return -EBUSY; - if (pf_detect()) - return -ENODEV; - pf_busy = 0; + printk("%s: %s version %s, major %d, cluster %d, nice %d\n", + name, name, PF_VERSION, major, cluster, nice); - if (register_blkdev(major, name)) { - for (pf = units, unit = 0; unit < PF_UNITS; pf++, unit++) { - if (!pf->disk) - continue; - blk_cleanup_queue(pf->disk->queue); - blk_mq_free_tag_set(&pf->tag_set); - put_disk(pf->disk); - } - return -EBUSY; + par_drv = pi_register_driver(name); + if (!par_drv) { + pr_err("failed to register %s driver\n", name); + goto out_unregister_blkdev; } - for (pf = units, unit = 0; unit < PF_UNITS; pf++, unit++) { - struct gendisk *disk = pf->disk; + for (unit = 0; unit < PF_UNITS; unit++) { + if (!(*drives[unit])[D_PRT]) + pf_drive_count++; + } - if (!pf->present) - continue; - disk->private_data = pf; - add_disk(disk); + pf = units; + if (pf_drive_count == 0) { + if (pf_init_unit(pf, 1, -1, -1, -1, -1, -1, verbose)) + found++; + } else { + for (unit = 0; unit < PF_UNITS; unit++, pf++) { + int *conf = *drives[unit]; + if (!conf[D_PRT]) + continue; + if (pf_init_unit(pf, 0, conf[D_PRT], conf[D_MOD], + conf[D_UNI], conf[D_PRO], conf[D_DLY], + verbose)) + found++; + } + } + if (!found) { + printk("%s: No ATAPI disk detected\n", name); + goto out_unregister_pi_driver; } + pf_busy = 0; return 0; + +out_unregister_pi_driver: + pi_unregister_driver(par_drv); +out_unregister_blkdev: + unregister_blkdev(major, name); + return -ENODEV; } static void __exit pf_exit(void) { struct pf_unit *pf; int unit; - unregister_blkdev(major, name); + for (pf = units, unit = 0; unit < PF_UNITS; pf++, unit++) { - if (!pf->disk) + if (!pf->present) continue; - - if (pf->present) - del_gendisk(pf->disk); - - blk_cleanup_queue(pf->disk->queue); + del_gendisk(pf->disk); + blk_cleanup_disk(pf->disk); blk_mq_free_tag_set(&pf->tag_set); - put_disk(pf->disk); - - if (pf->present) - pi_release(pf->pi); + pi_release(pf->pi); } + + unregister_blkdev(major, name); } MODULE_LICENSE("GPL"); diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c index cb52cce6fb03..e48d4771d4c1 100644 --- a/drivers/block/pktcdvd.c +++ b/drivers/block/pktcdvd.c @@ -2728,7 +2728,9 @@ static int pkt_setup_dev(dev_t dev, dev_t* pkt_dev) /* inherit events of the host device */ disk->events = pd->bdev->bd_disk->events; - add_disk(disk); + ret = add_disk(disk); + if (ret) + goto out_mem2; pkt_sysfs_dev_new(pd); pkt_debugfs_dev_new(pd); diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index bf60aebd0cfb..953fa134cd3d 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -7054,7 +7054,9 @@ static ssize_t do_rbd_add(struct bus_type *bus, if (rc) goto err_out_image_lock; - device_add_disk(&rbd_dev->dev, rbd_dev->disk, NULL); + rc = device_add_disk(&rbd_dev->dev, rbd_dev->disk, NULL); + if (rc) + goto err_out_cleanup_disk; spin_lock(&rbd_dev_list_lock); list_add_tail(&rbd_dev->node, &rbd_dev_list); @@ -7068,6 +7070,8 @@ out: module_put(THIS_MODULE); return rc; +err_out_cleanup_disk: + rbd_free_disk(rbd_dev); err_out_image_lock: rbd_dev_image_unlock(rbd_dev); rbd_dev_device_release(rbd_dev); diff --git a/drivers/block/rnbd/rnbd-clt.c b/drivers/block/rnbd/rnbd-clt.c index 0ec0191d4196..2df0657cdf00 100644 --- a/drivers/block/rnbd/rnbd-clt.c +++ b/drivers/block/rnbd/rnbd-clt.c @@ -1384,8 +1384,10 @@ static void setup_request_queue(struct rnbd_clt_dev *dev) blk_queue_write_cache(dev->queue, dev->wc, dev->fua); } -static void rnbd_clt_setup_gen_disk(struct rnbd_clt_dev *dev, int idx) +static int rnbd_clt_setup_gen_disk(struct rnbd_clt_dev *dev, int idx) { + int err; + dev->gd->major = rnbd_client_major; dev->gd->first_minor = idx << RNBD_PART_BITS; dev->gd->minors = 1 << RNBD_PART_BITS; @@ -1410,7 +1412,11 @@ static void rnbd_clt_setup_gen_disk(struct rnbd_clt_dev *dev, int idx) if (!dev->rotational) blk_queue_flag_set(QUEUE_FLAG_NONROT, dev->queue); - add_disk(dev->gd); + err = add_disk(dev->gd); + if (err) + blk_cleanup_disk(dev->gd); + + return err; } static int rnbd_client_setup_device(struct rnbd_clt_dev *dev) @@ -1426,8 +1432,7 @@ static int rnbd_client_setup_device(struct rnbd_clt_dev *dev) rnbd_init_mq_hw_queues(dev); setup_request_queue(dev); - rnbd_clt_setup_gen_disk(dev, idx); - return 0; + return rnbd_clt_setup_gen_disk(dev, idx); } static struct rnbd_clt_dev *init_dev(struct rnbd_clt_session *sess, diff --git a/drivers/block/rsxx/core.c b/drivers/block/rsxx/core.c index 83636714b8d7..8d9d69f5dfbc 100644 --- a/drivers/block/rsxx/core.c +++ b/drivers/block/rsxx/core.c @@ -935,7 +935,9 @@ static int rsxx_pci_probe(struct pci_dev *dev, card->size8 = 0; } - rsxx_attach_dev(card); + st = rsxx_attach_dev(card); + if (st) + goto failed_create_dev; /************* Setup Debugfs *************/ rsxx_debugfs_dev_new(card); diff --git a/drivers/block/rsxx/dev.c b/drivers/block/rsxx/dev.c index 268252380e88..dd33f1bdf3b8 100644 --- a/drivers/block/rsxx/dev.c +++ b/drivers/block/rsxx/dev.c @@ -191,6 +191,8 @@ static bool rsxx_discard_supported(struct rsxx_cardinfo *card) int rsxx_attach_dev(struct rsxx_cardinfo *card) { + int err = 0; + mutex_lock(&card->dev_lock); /* The block device requires the stripe size from the config. */ @@ -199,13 +201,17 @@ int rsxx_attach_dev(struct rsxx_cardinfo *card) set_capacity(card->gendisk, card->size8 >> 9); else set_capacity(card->gendisk, 0); - device_add_disk(CARD_TO_DEV(card), card->gendisk, NULL); - card->bdev_attached = 1; + err = device_add_disk(CARD_TO_DEV(card), card->gendisk, NULL); + if (err == 0) + card->bdev_attached = 1; } mutex_unlock(&card->dev_lock); - return 0; + if (err) + blk_cleanup_disk(card->gendisk); + + return err; } void rsxx_detach_dev(struct rsxx_cardinfo *card) diff --git a/drivers/block/swim.c b/drivers/block/swim.c index 3911d0833e1b..821594cd1315 100644 --- a/drivers/block/swim.c +++ b/drivers/block/swim.c @@ -185,6 +185,7 @@ struct floppy_state { int track; int ref_count; + bool registered; struct gendisk *disk; struct blk_mq_tag_set tag_set; @@ -772,6 +773,20 @@ static const struct blk_mq_ops swim_mq_ops = { .queue_rq = swim_queue_rq, }; +static void swim_cleanup_floppy_disk(struct floppy_state *fs) +{ + struct gendisk *disk = fs->disk; + + if (!disk) + return; + + if (fs->registered) + del_gendisk(fs->disk); + + blk_cleanup_disk(disk); + blk_mq_free_tag_set(&fs->tag_set); +} + static int swim_floppy_init(struct swim_priv *swd) { int err; @@ -828,7 +843,10 @@ static int swim_floppy_init(struct swim_priv *swd) swd->unit[drive].disk->events = DISK_EVENT_MEDIA_CHANGE; swd->unit[drive].disk->private_data = &swd->unit[drive]; set_capacity(swd->unit[drive].disk, 2880); - add_disk(swd->unit[drive].disk); + err = add_disk(swd->unit[drive].disk); + if (err) + goto exit_put_disks; + swd->unit[drive].registered = true; } return 0; @@ -836,12 +854,7 @@ static int swim_floppy_init(struct swim_priv *swd) exit_put_disks: unregister_blkdev(FLOPPY_MAJOR, "fd"); do { - struct gendisk *disk = swd->unit[drive].disk; - - if (!disk) - continue; - blk_cleanup_disk(disk); - blk_mq_free_tag_set(&swd->unit[drive].tag_set); + swim_cleanup_floppy_disk(&swd->unit[drive]); } while (drive--); return err; } @@ -910,12 +923,8 @@ static int swim_remove(struct platform_device *dev) int drive; struct resource *res; - for (drive = 0; drive < swd->floppy_count; drive++) { - del_gendisk(swd->unit[drive].disk); - blk_cleanup_queue(swd->unit[drive].disk->queue); - blk_mq_free_tag_set(&swd->unit[drive].tag_set); - put_disk(swd->unit[drive].disk); - } + for (drive = 0; drive < swd->floppy_count; drive++) + swim_cleanup_floppy_disk(&swd->unit[drive]); unregister_blkdev(FLOPPY_MAJOR, "fd"); diff --git a/drivers/block/swim3.c b/drivers/block/swim3.c index 965af0a3e95b..4b91c9aa5892 100644 --- a/drivers/block/swim3.c +++ b/drivers/block/swim3.c @@ -27,6 +27,7 @@ #include <linux/module.h> #include <linux/spinlock.h> #include <linux/wait.h> +#include <linux/major.h> #include <asm/io.h> #include <asm/dbdma.h> #include <asm/prom.h> @@ -1229,7 +1230,9 @@ static int swim3_attach(struct macio_dev *mdev, disk->flags |= GENHD_FL_REMOVABLE; sprintf(disk->disk_name, "fd%d", floppy_count); set_capacity(disk, 2880); - add_disk(disk); + rc = add_disk(disk); + if (rc) + goto out_cleanup_disk; disks[floppy_count++] = disk; return 0; diff --git a/drivers/block/sx8.c b/drivers/block/sx8.c index 420cd952ddc4..d1676fe0da1a 100644 --- a/drivers/block/sx8.c +++ b/drivers/block/sx8.c @@ -297,6 +297,7 @@ struct carm_host { struct work_struct fsm_task; + int probe_err; struct completion probe_comp; }; @@ -1181,8 +1182,11 @@ static void carm_fsm_task (struct work_struct *work) struct gendisk *disk = port->disk; set_capacity(disk, port->capacity); - add_disk(disk); - activated++; + host->probe_err = add_disk(disk); + if (!host->probe_err) + activated++; + else + break; } printk(KERN_INFO DRV_NAME "(%s): %d ports activated\n", @@ -1192,11 +1196,9 @@ static void carm_fsm_task (struct work_struct *work) reschedule = 1; break; } - case HST_PROBE_FINISHED: complete(&host->probe_comp); break; - case HST_ERROR: /* FIXME: TODO */ break; @@ -1507,7 +1509,12 @@ static int carm_init_one (struct pci_dev *pdev, const struct pci_device_id *ent) goto err_out_free_irq; DPRINTK("waiting for probe_comp\n"); + host->probe_err = -ENODEV; wait_for_completion(&host->probe_comp); + if (host->probe_err) { + rc = host->probe_err; + goto err_out_free_irq; + } printk(KERN_INFO "%s: pci %s, ports %d, io %llx, irq %u, major %d\n", host->name, pci_name(pdev), (int) CARM_MAX_PORTS, diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c index 303caf2d17d0..fd086179f980 100644 --- a/drivers/block/virtio_blk.c +++ b/drivers/block/virtio_blk.c @@ -815,9 +815,17 @@ static int virtblk_probe(struct virtio_device *vdev) err = virtio_cread_feature(vdev, VIRTIO_BLK_F_BLK_SIZE, struct virtio_blk_config, blk_size, &blk_size); - if (!err) + if (!err) { + err = blk_validate_block_size(blk_size); + if (err) { + dev_err(&vdev->dev, + "virtio_blk: invalid block size: 0x%x\n", + blk_size); + goto out_cleanup_disk; + } + blk_queue_logical_block_size(q, blk_size); - else + } else blk_size = queue_logical_block_size(q); /* Use topology information if available */ diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c index df0deb927760..8e3983e456f3 100644 --- a/drivers/block/xen-blkfront.c +++ b/drivers/block/xen-blkfront.c @@ -2386,7 +2386,13 @@ static void blkfront_connect(struct blkfront_info *info) for_each_rinfo(info, rinfo, i) kick_pending_request_queues(rinfo); - device_add_disk(&info->xbdev->dev, info->gd, NULL); + err = device_add_disk(&info->xbdev->dev, info->gd, NULL); + if (err) { + blk_cleanup_disk(info->gd); + blk_mq_free_tag_set(&info->tag_set); + info->rq = NULL; + goto fail; + } info->is_ready = 1; return; diff --git a/drivers/cdrom/gdrom.c b/drivers/cdrom/gdrom.c index 8e1fe75af93f..d50cc1fd34d5 100644 --- a/drivers/cdrom/gdrom.c +++ b/drivers/cdrom/gdrom.c @@ -805,9 +805,14 @@ static int probe_gdrom(struct platform_device *devptr) err = -ENOMEM; goto probe_fail_free_irqs; } - add_disk(gd.disk); + err = add_disk(gd.disk); + if (err) + goto probe_fail_add_disk; + return 0; +probe_fail_add_disk: + kfree(gd.toc); probe_fail_free_irqs: free_irq(HW_EVENT_GDROM_DMA, &gd); free_irq(HW_EVENT_GDROM_CMD, &gd); diff --git a/drivers/char/tpm/Kconfig b/drivers/char/tpm/Kconfig index d6ba644f6b00..4a5516406c22 100644 --- a/drivers/char/tpm/Kconfig +++ b/drivers/char/tpm/Kconfig @@ -76,7 +76,7 @@ config TCG_TIS_SPI_CR50 config TCG_TIS_SYNQUACER tristate "TPM Interface Specification 1.2 Interface / TPM 2.0 FIFO Interface (MMIO - SynQuacer)" - depends on ARCH_SYNQUACER + depends on ARCH_SYNQUACER || COMPILE_TEST select TCG_TIS_CORE help If you have a TPM security chip that is compliant with the diff --git a/drivers/char/tpm/tpm2-space.c b/drivers/char/tpm/tpm2-space.c index 784b8b3cb903..97e916856cf3 100644 --- a/drivers/char/tpm/tpm2-space.c +++ b/drivers/char/tpm/tpm2-space.c @@ -455,6 +455,9 @@ static int tpm2_map_response_body(struct tpm_chip *chip, u32 cc, u8 *rsp, if (be32_to_cpu(data->capability) != TPM2_CAP_HANDLES) return 0; + if (be32_to_cpu(data->count) > (UINT_MAX - TPM_HEADER_SIZE - 9) / 4) + return -EFAULT; + if (len != TPM_HEADER_SIZE + 9 + 4 * be32_to_cpu(data->count)) return -EFAULT; diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c index 69579efb247b..b2659a4c4016 100644 --- a/drivers/char/tpm/tpm_tis_core.c +++ b/drivers/char/tpm/tpm_tis_core.c @@ -48,6 +48,7 @@ static int wait_for_tpm_stat(struct tpm_chip *chip, u8 mask, unsigned long timeout, wait_queue_head_t *queue, bool check_cancel) { + struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev); unsigned long stop; long rc; u8 status; @@ -80,8 +81,8 @@ again: } } else { do { - usleep_range(TPM_TIMEOUT_USECS_MIN, - TPM_TIMEOUT_USECS_MAX); + usleep_range(priv->timeout_min, + priv->timeout_max); status = chip->ops->status(chip); if ((status & mask) == mask) return 0; @@ -945,7 +946,22 @@ int tpm_tis_core_init(struct device *dev, struct tpm_tis_data *priv, int irq, chip->timeout_b = msecs_to_jiffies(TIS_TIMEOUT_B_MAX); chip->timeout_c = msecs_to_jiffies(TIS_TIMEOUT_C_MAX); chip->timeout_d = msecs_to_jiffies(TIS_TIMEOUT_D_MAX); + priv->timeout_min = TPM_TIMEOUT_USECS_MIN; + priv->timeout_max = TPM_TIMEOUT_USECS_MAX; priv->phy_ops = phy_ops; + + rc = tpm_tis_read32(priv, TPM_DID_VID(0), &vendor); + if (rc < 0) + goto out_err; + + priv->manufacturer_id = vendor; + + if (priv->manufacturer_id == TPM_VID_ATML && + !(chip->flags & TPM_CHIP_FLAG_TPM2)) { + priv->timeout_min = TIS_TIMEOUT_MIN_ATML; + priv->timeout_max = TIS_TIMEOUT_MAX_ATML; + } + dev_set_drvdata(&chip->dev, priv); if (is_bsw()) { @@ -988,12 +1004,6 @@ int tpm_tis_core_init(struct device *dev, struct tpm_tis_data *priv, int irq, if (rc) goto out_err; - rc = tpm_tis_read32(priv, TPM_DID_VID(0), &vendor); - if (rc < 0) - goto out_err; - - priv->manufacturer_id = vendor; - rc = tpm_tis_read8(priv, TPM_RID(0), &rid); if (rc < 0) goto out_err; diff --git a/drivers/char/tpm/tpm_tis_core.h b/drivers/char/tpm/tpm_tis_core.h index b2a3c6c72882..3be24f221e32 100644 --- a/drivers/char/tpm/tpm_tis_core.h +++ b/drivers/char/tpm/tpm_tis_core.h @@ -54,6 +54,8 @@ enum tis_defaults { TIS_MEM_LEN = 0x5000, TIS_SHORT_TIMEOUT = 750, /* ms */ TIS_LONG_TIMEOUT = 2000, /* 2 sec */ + TIS_TIMEOUT_MIN_ATML = 14700, /* usecs */ + TIS_TIMEOUT_MAX_ATML = 15000, /* usecs */ }; /* Some timeout values are needed before it is known whether the chip is @@ -98,6 +100,8 @@ struct tpm_tis_data { wait_queue_head_t read_queue; const struct tpm_tis_phy_ops *phy_ops; unsigned short rng_quality; + unsigned int timeout_min; /* usecs */ + unsigned int timeout_max; /* usecs */ }; struct tpm_tis_phy_ops { diff --git a/drivers/char/tpm/tpm_tis_spi_main.c b/drivers/char/tpm/tpm_tis_spi_main.c index 54584b4b00d1..aaa59a00eeae 100644 --- a/drivers/char/tpm/tpm_tis_spi_main.c +++ b/drivers/char/tpm/tpm_tis_spi_main.c @@ -267,6 +267,7 @@ static const struct spi_device_id tpm_tis_spi_id[] = { { "st33htpm-spi", (unsigned long)tpm_tis_spi_probe }, { "slb9670", (unsigned long)tpm_tis_spi_probe }, { "tpm_tis_spi", (unsigned long)tpm_tis_spi_probe }, + { "tpm_tis-spi", (unsigned long)tpm_tis_spi_probe }, { "cr50", (unsigned long)cr50_spi_probe }, {} }; diff --git a/drivers/clk/clk-composite.c b/drivers/clk/clk-composite.c index 0506046a5f4b..510a9965633b 100644 --- a/drivers/clk/clk-composite.c +++ b/drivers/clk/clk-composite.c @@ -58,11 +58,8 @@ static int clk_composite_determine_rate(struct clk_hw *hw, long rate; int i; - if (rate_hw && rate_ops && rate_ops->determine_rate) { - __clk_hw_set_clk(rate_hw, hw); - return rate_ops->determine_rate(rate_hw, req); - } else if (rate_hw && rate_ops && rate_ops->round_rate && - mux_hw && mux_ops && mux_ops->set_parent) { + if (rate_hw && rate_ops && rate_ops->round_rate && + mux_hw && mux_ops && mux_ops->set_parent) { req->best_parent_hw = NULL; if (clk_hw_get_flags(hw) & CLK_SET_RATE_NO_REPARENT) { @@ -107,6 +104,9 @@ static int clk_composite_determine_rate(struct clk_hw *hw, req->rate = best_rate; return 0; + } else if (rate_hw && rate_ops && rate_ops->determine_rate) { + __clk_hw_set_clk(rate_hw, hw); + return rate_ops->determine_rate(rate_hw, req); } else if (mux_hw && mux_ops && mux_ops->determine_rate) { __clk_hw_set_clk(mux_hw, hw); return mux_ops->determine_rate(mux_hw, req); diff --git a/drivers/gpio/gpio-mlxbf2.c b/drivers/gpio/gpio-mlxbf2.c index 177d03ef4529..40a052bc6784 100644 --- a/drivers/gpio/gpio-mlxbf2.c +++ b/drivers/gpio/gpio-mlxbf2.c @@ -256,6 +256,11 @@ mlxbf2_gpio_probe(struct platform_device *pdev) NULL, 0); + if (ret) { + dev_err(dev, "bgpio_init failed\n"); + return ret; + } + gc->direction_input = mlxbf2_gpio_direction_input; gc->direction_output = mlxbf2_gpio_direction_output; gc->ngpio = npins; diff --git a/drivers/gpio/gpio-xgs-iproc.c b/drivers/gpio/gpio-xgs-iproc.c index fa9b4d8c3ff5..43ca52fa6f9a 100644 --- a/drivers/gpio/gpio-xgs-iproc.c +++ b/drivers/gpio/gpio-xgs-iproc.c @@ -224,7 +224,7 @@ static int iproc_gpio_probe(struct platform_device *pdev) } chip->gc.label = dev_name(dev); - if (of_property_read_u32(dn, "ngpios", &num_gpios)) + if (!of_property_read_u32(dn, "ngpios", &num_gpios)) chip->gc.ngpio = num_gpios; irq = platform_get_irq(pdev, 0); diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c index ff80786e3918..01efda4398e5 100644 --- a/drivers/gpu/drm/amd/amdgpu/nv.c +++ b/drivers/gpu/drm/amd/amdgpu/nv.c @@ -1257,7 +1257,7 @@ static int nv_common_early_init(void *handle) AMD_PG_SUPPORT_VCN_DPG | AMD_PG_SUPPORT_JPEG; if (adev->pdev->device == 0x1681) - adev->external_rev_id = adev->rev_id + 0x19; + adev->external_rev_id = 0x20; else adev->external_rev_id = adev->rev_id + 0x01; break; diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_debugfs.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_debugfs.c index 87daa78a32b8..8080bba5b7a7 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_debugfs.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_debugfs.c @@ -263,7 +263,7 @@ static ssize_t dp_link_settings_write(struct file *f, const char __user *buf, if (!wr_buf) return -ENOSPC; - if (parse_write_buffer_into_params(wr_buf, size, + if (parse_write_buffer_into_params(wr_buf, wr_buf_size, (long *)param, buf, max_param_num, ¶m_nums)) { @@ -487,7 +487,7 @@ static ssize_t dp_phy_settings_write(struct file *f, const char __user *buf, if (!wr_buf) return -ENOSPC; - if (parse_write_buffer_into_params(wr_buf, size, + if (parse_write_buffer_into_params(wr_buf, wr_buf_size, (long *)param, buf, max_param_num, ¶m_nums)) { @@ -639,7 +639,7 @@ static ssize_t dp_phy_test_pattern_debugfs_write(struct file *f, const char __us if (!wr_buf) return -ENOSPC; - if (parse_write_buffer_into_params(wr_buf, size, + if (parse_write_buffer_into_params(wr_buf, wr_buf_size, (long *)param, buf, max_param_num, ¶m_nums)) { @@ -914,7 +914,7 @@ static ssize_t dp_dsc_passthrough_set(struct file *f, const char __user *buf, return -ENOSPC; } - if (parse_write_buffer_into_params(wr_buf, size, + if (parse_write_buffer_into_params(wr_buf, wr_buf_size, ¶m, buf, max_param_num, ¶m_nums)) { @@ -1211,7 +1211,7 @@ static ssize_t trigger_hotplug(struct file *f, const char __user *buf, return -ENOSPC; } - if (parse_write_buffer_into_params(wr_buf, size, + if (parse_write_buffer_into_params(wr_buf, wr_buf_size, (long *)param, buf, max_param_num, ¶m_nums)) { @@ -1396,7 +1396,7 @@ static ssize_t dp_dsc_clock_en_write(struct file *f, const char __user *buf, return -ENOSPC; } - if (parse_write_buffer_into_params(wr_buf, size, + if (parse_write_buffer_into_params(wr_buf, wr_buf_size, (long *)param, buf, max_param_num, ¶m_nums)) { @@ -1581,7 +1581,7 @@ static ssize_t dp_dsc_slice_width_write(struct file *f, const char __user *buf, return -ENOSPC; } - if (parse_write_buffer_into_params(wr_buf, size, + if (parse_write_buffer_into_params(wr_buf, wr_buf_size, (long *)param, buf, max_param_num, ¶m_nums)) { @@ -1766,7 +1766,7 @@ static ssize_t dp_dsc_slice_height_write(struct file *f, const char __user *buf, return -ENOSPC; } - if (parse_write_buffer_into_params(wr_buf, size, + if (parse_write_buffer_into_params(wr_buf, wr_buf_size, (long *)param, buf, max_param_num, ¶m_nums)) { @@ -1944,7 +1944,7 @@ static ssize_t dp_dsc_bits_per_pixel_write(struct file *f, const char __user *bu return -ENOSPC; } - if (parse_write_buffer_into_params(wr_buf, size, + if (parse_write_buffer_into_params(wr_buf, wr_buf_size, (long *)param, buf, max_param_num, ¶m_nums)) { @@ -2382,7 +2382,7 @@ static ssize_t dp_max_bpc_write(struct file *f, const char __user *buf, return -ENOSPC; } - if (parse_write_buffer_into_params(wr_buf, size, + if (parse_write_buffer_into_params(wr_buf, wr_buf_size, (long *)param, buf, max_param_num, ¶m_nums)) { diff --git a/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn31/dcn31_clk_mgr.c b/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn31/dcn31_clk_mgr.c index 4a4894e9d9c9..377c4e53a2b3 100644 --- a/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn31/dcn31_clk_mgr.c +++ b/drivers/gpu/drm/amd/display/dc/clk_mgr/dcn31/dcn31_clk_mgr.c @@ -366,32 +366,32 @@ static struct wm_table lpddr5_wm_table = { .wm_inst = WM_A, .wm_type = WM_TYPE_PSTATE_CHG, .pstate_latency_us = 11.65333, - .sr_exit_time_us = 5.32, - .sr_enter_plus_exit_time_us = 6.38, + .sr_exit_time_us = 11.5, + .sr_enter_plus_exit_time_us = 14.5, .valid = true, }, { .wm_inst = WM_B, .wm_type = WM_TYPE_PSTATE_CHG, .pstate_latency_us = 11.65333, - .sr_exit_time_us = 9.82, - .sr_enter_plus_exit_time_us = 11.196, + .sr_exit_time_us = 11.5, + .sr_enter_plus_exit_time_us = 14.5, .valid = true, }, { .wm_inst = WM_C, .wm_type = WM_TYPE_PSTATE_CHG, .pstate_latency_us = 11.65333, - .sr_exit_time_us = 9.89, - .sr_enter_plus_exit_time_us = 11.24, + .sr_exit_time_us = 11.5, + .sr_enter_plus_exit_time_us = 14.5, .valid = true, }, { .wm_inst = WM_D, .wm_type = WM_TYPE_PSTATE_CHG, .pstate_latency_us = 11.65333, - .sr_exit_time_us = 9.748, - .sr_enter_plus_exit_time_us = 11.102, + .sr_exit_time_us = 11.5, + .sr_enter_plus_exit_time_us = 14.5, .valid = true, }, } @@ -518,14 +518,21 @@ static unsigned int find_clk_for_voltage( unsigned int voltage) { int i; + int max_voltage = 0; + int clock = 0; for (i = 0; i < NUM_SOC_VOLTAGE_LEVELS; i++) { - if (clock_table->SocVoltage[i] == voltage) + if (clock_table->SocVoltage[i] == voltage) { return clocks[i]; + } else if (clock_table->SocVoltage[i] >= max_voltage && + clock_table->SocVoltage[i] < voltage) { + max_voltage = clock_table->SocVoltage[i]; + clock = clocks[i]; + } } - ASSERT(0); - return 0; + ASSERT(clock); + return clock; } void dcn31_clk_mgr_helper_populate_bw_params( diff --git a/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_hwseq.c b/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_hwseq.c index 3f2333ec67e2..3afa1159a5f7 100644 --- a/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_hwseq.c +++ b/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_hwseq.c @@ -76,10 +76,6 @@ void dcn31_init_hw(struct dc *dc) if (dc->clk_mgr && dc->clk_mgr->funcs->init_clocks) dc->clk_mgr->funcs->init_clocks(dc->clk_mgr); - // Initialize the dccg - if (res_pool->dccg->funcs->dccg_init) - res_pool->dccg->funcs->dccg_init(res_pool->dccg); - if (IS_FPGA_MAXIMUS_DC(dc->ctx->dce_environment)) { REG_WRITE(REFCLK_CNTL, 0); @@ -106,6 +102,9 @@ void dcn31_init_hw(struct dc *dc) hws->funcs.bios_golden_init(dc); hws->funcs.disable_vga(dc->hwseq); } + // Initialize the dccg + if (res_pool->dccg->funcs->dccg_init) + res_pool->dccg->funcs->dccg_init(res_pool->dccg); if (dc->debug.enable_mem_low_power.bits.dmcu) { // Force ERAM to shutdown if DMCU is not enabled diff --git a/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c b/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c index 0006bbac466c..79e92ecca96c 100644 --- a/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c +++ b/drivers/gpu/drm/amd/display/dc/dcn31/dcn31_resource.c @@ -217,8 +217,8 @@ struct _vcs_dpi_soc_bounding_box_st dcn3_1_soc = { .num_states = 5, .sr_exit_time_us = 9.0, .sr_enter_plus_exit_time_us = 11.0, - .sr_exit_z8_time_us = 402.0, - .sr_enter_plus_exit_z8_time_us = 520.0, + .sr_exit_z8_time_us = 442.0, + .sr_enter_plus_exit_z8_time_us = 560.0, .writeback_latency_us = 12.0, .dram_channel_width_bytes = 4, .round_trip_ping_latency_dcfclk_cycles = 106, @@ -928,7 +928,7 @@ static const struct dc_debug_options debug_defaults_drv = { .disable_dcc = DCC_ENABLE, .vsr_support = true, .performance_trace = false, - .max_downscale_src_width = 3840,/*upto 4K*/ + .max_downscale_src_width = 4096,/*upto true 4K*/ .disable_pplib_wm_range = false, .scl_reset_length10 = true, .sanity_checks = false, @@ -1590,6 +1590,13 @@ static int dcn31_populate_dml_pipes_from_context( pipe = &res_ctx->pipe_ctx[i]; timing = &pipe->stream->timing; + /* + * Immediate flip can be set dynamically after enabling the plane. + * We need to require support for immediate flip or underflow can be + * intermittently experienced depending on peak b/w requirements. + */ + pipes[pipe_cnt].pipe.src.immediate_flip = true; + pipes[pipe_cnt].pipe.src.unbounded_req_mode = false; pipes[pipe_cnt].pipe.src.gpuvm = true; pipes[pipe_cnt].pipe.src.dcc_fraction_of_zs_req_luma = 0; diff --git a/drivers/gpu/drm/amd/display/dc/dml/dcn31/display_mode_vba_31.c b/drivers/gpu/drm/amd/display/dc/dml/dcn31/display_mode_vba_31.c index ce55c9caf9a2..d58925cff420 100644 --- a/drivers/gpu/drm/amd/display/dc/dml/dcn31/display_mode_vba_31.c +++ b/drivers/gpu/drm/amd/display/dc/dml/dcn31/display_mode_vba_31.c @@ -5398,9 +5398,9 @@ void dml31_ModeSupportAndSystemConfigurationFull(struct display_mode_lib *mode_l v->MaximumReadBandwidthWithPrefetch = v->MaximumReadBandwidthWithPrefetch - + dml_max4( - v->VActivePixelBandwidth[i][j][k], - v->VActiveCursorBandwidth[i][j][k] + + dml_max3( + v->VActivePixelBandwidth[i][j][k] + + v->VActiveCursorBandwidth[i][j][k] + v->NoOfDPP[i][j][k] * (v->meta_row_bandwidth[i][j][k] + v->dpte_row_bandwidth[i][j][k]), diff --git a/drivers/gpu/drm/amd/display/include/dal_asic_id.h b/drivers/gpu/drm/amd/display/include/dal_asic_id.h index 5adc471bef57..3d2f0817e40a 100644 --- a/drivers/gpu/drm/amd/display/include/dal_asic_id.h +++ b/drivers/gpu/drm/amd/display/include/dal_asic_id.h @@ -227,7 +227,7 @@ enum { #define FAMILY_YELLOW_CARP 146 #define YELLOW_CARP_A0 0x01 -#define YELLOW_CARP_B0 0x1A +#define YELLOW_CARP_B0 0x20 #define YELLOW_CARP_UNKNOWN 0xFF #ifndef ASICREV_IS_YELLOW_CARP diff --git a/drivers/gpu/drm/amd/display/modules/hdcp/hdcp_psp.c b/drivers/gpu/drm/amd/display/modules/hdcp/hdcp_psp.c index e9bd84ec027d..be61975f1470 100644 --- a/drivers/gpu/drm/amd/display/modules/hdcp/hdcp_psp.c +++ b/drivers/gpu/drm/amd/display/modules/hdcp/hdcp_psp.c @@ -105,6 +105,7 @@ static enum mod_hdcp_status remove_display_from_topology_v3( dtm_cmd->dtm_status = TA_DTM_STATUS__GENERIC_FAILURE; psp_dtm_invoke(psp, dtm_cmd->cmd_id); + mutex_unlock(&psp->dtm_context.mutex); if (dtm_cmd->dtm_status != TA_DTM_STATUS__SUCCESS) { status = remove_display_from_topology_v2(hdcp, index); @@ -115,8 +116,6 @@ static enum mod_hdcp_status remove_display_from_topology_v3( HDCP_TOP_REMOVE_DISPLAY_TRACE(hdcp, display->index); } - mutex_unlock(&psp->dtm_context.mutex); - return status; } @@ -205,6 +204,7 @@ static enum mod_hdcp_status add_display_to_topology_v3( dtm_cmd->dtm_in_message.topology_update_v3.link_hdcp_cap = link->hdcp_supported_informational; psp_dtm_invoke(psp, dtm_cmd->cmd_id); + mutex_unlock(&psp->dtm_context.mutex); if (dtm_cmd->dtm_status != TA_DTM_STATUS__SUCCESS) { status = add_display_to_topology_v2(hdcp, display); @@ -214,8 +214,6 @@ static enum mod_hdcp_status add_display_to_topology_v3( HDCP_TOP_ADD_DISPLAY_TRACE(hdcp, display->index); } - mutex_unlock(&psp->dtm_context.mutex); - return status; } diff --git a/drivers/gpu/drm/ast/ast_mode.c b/drivers/gpu/drm/ast/ast_mode.c index 6bfaefa01818..1e30eaeb0e1b 100644 --- a/drivers/gpu/drm/ast/ast_mode.c +++ b/drivers/gpu/drm/ast/ast_mode.c @@ -1300,18 +1300,6 @@ static enum drm_mode_status ast_mode_valid(struct drm_connector *connector, return flags; } -static enum drm_connector_status ast_connector_detect(struct drm_connector - *connector, bool force) -{ - int r; - - r = ast_get_modes(connector); - if (r <= 0) - return connector_status_disconnected; - - return connector_status_connected; -} - static void ast_connector_destroy(struct drm_connector *connector) { struct ast_connector *ast_connector = to_ast_connector(connector); @@ -1327,7 +1315,6 @@ static const struct drm_connector_helper_funcs ast_connector_helper_funcs = { static const struct drm_connector_funcs ast_connector_funcs = { .reset = drm_atomic_helper_connector_reset, - .detect = ast_connector_detect, .fill_modes = drm_helper_probe_single_connector_modes, .destroy = ast_connector_destroy, .atomic_duplicate_state = drm_atomic_helper_connector_duplicate_state, @@ -1355,8 +1342,7 @@ static int ast_connector_init(struct drm_device *dev) connector->interlace_allowed = 0; connector->doublescan_allowed = 0; - connector->polled = DRM_CONNECTOR_POLL_CONNECT | - DRM_CONNECTOR_POLL_DISCONNECT; + connector->polled = DRM_CONNECTOR_POLL_CONNECT; drm_connector_attach_encoder(connector, encoder); @@ -1425,8 +1411,6 @@ int ast_mode_config_init(struct ast_private *ast) drm_mode_config_reset(dev); - drm_kms_helper_poll_init(dev); - return 0; } diff --git a/drivers/gpu/drm/drm_panel_orientation_quirks.c b/drivers/gpu/drm/drm_panel_orientation_quirks.c index f6bdec7fa925..e1b2ce4921ae 100644 --- a/drivers/gpu/drm/drm_panel_orientation_quirks.c +++ b/drivers/gpu/drm/drm_panel_orientation_quirks.c @@ -134,6 +134,12 @@ static const struct dmi_system_id orientation_data[] = { DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "T103HAF"), }, .driver_data = (void *)&lcd800x1280_rightside_up, + }, { /* AYA NEO 2021 */ + .matches = { + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "AYADEVICE"), + DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "AYA NEO 2021"), + }, + .driver_data = (void *)&lcd800x1280_rightside_up, }, { /* GPD MicroPC (generic strings, also match on bios date) */ .matches = { DMI_EXACT_MATCH(DMI_SYS_VENDOR, "Default string"), @@ -185,6 +191,12 @@ static const struct dmi_system_id orientation_data[] = { DMI_EXACT_MATCH(DMI_BOARD_NAME, "Default string"), }, .driver_data = (void *)&gpd_win2, + }, { /* GPD Win 3 */ + .matches = { + DMI_EXACT_MATCH(DMI_SYS_VENDOR, "GPD"), + DMI_EXACT_MATCH(DMI_PRODUCT_NAME, "G1618-03") + }, + .driver_data = (void *)&lcd720x1280_rightside_up, }, { /* I.T.Works TW891 */ .matches = { DMI_EXACT_MATCH(DMI_SYS_VENDOR, "To be filled by O.E.M."), diff --git a/drivers/gpu/drm/i915/display/intel_dp.c b/drivers/gpu/drm/i915/display/intel_dp.c index abe3d61b6243..5cf152be4487 100644 --- a/drivers/gpu/drm/i915/display/intel_dp.c +++ b/drivers/gpu/drm/i915/display/intel_dp.c @@ -1916,6 +1916,9 @@ void intel_dp_sync_state(struct intel_encoder *encoder, { struct intel_dp *intel_dp = enc_to_intel_dp(encoder); + if (!crtc_state) + return; + /* * Don't clobber DPCD if it's been already read out during output * setup (eDP) or detect. diff --git a/drivers/gpu/drm/i915/gt/intel_timeline.c b/drivers/gpu/drm/i915/gt/intel_timeline.c index 1257f4f11e66..438bbc7b8147 100644 --- a/drivers/gpu/drm/i915/gt/intel_timeline.c +++ b/drivers/gpu/drm/i915/gt/intel_timeline.c @@ -64,7 +64,7 @@ intel_timeline_pin_map(struct intel_timeline *timeline) timeline->hwsp_map = vaddr; timeline->hwsp_seqno = memset(vaddr + ofs, 0, TIMELINE_SEQNO_BYTES); - clflush(vaddr + ofs); + drm_clflush_virt_range(vaddr + ofs, TIMELINE_SEQNO_BYTES); return 0; } @@ -225,7 +225,7 @@ void intel_timeline_reset_seqno(const struct intel_timeline *tl) memset(hwsp_seqno + 1, 0, TIMELINE_SEQNO_BYTES - sizeof(*hwsp_seqno)); WRITE_ONCE(*hwsp_seqno, tl->seqno); - clflush(hwsp_seqno); + drm_clflush_virt_range(hwsp_seqno, TIMELINE_SEQNO_BYTES); } void intel_timeline_enter(struct intel_timeline *tl) diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h index 4037030f0984..9023d4ecf3b3 100644 --- a/drivers/gpu/drm/i915/i915_reg.h +++ b/drivers/gpu/drm/i915/i915_reg.h @@ -11048,12 +11048,6 @@ enum skl_power_gate { #define DC_STATE_DEBUG_MASK_CORES (1 << 0) #define DC_STATE_DEBUG_MASK_MEMORY_UP (1 << 1) -#define BXT_P_CR_MC_BIOS_REQ_0_0_0 _MMIO(MCHBAR_MIRROR_BASE_SNB + 0x7114) -#define BXT_REQ_DATA_MASK 0x3F -#define BXT_DRAM_CHANNEL_ACTIVE_SHIFT 12 -#define BXT_DRAM_CHANNEL_ACTIVE_MASK (0xF << 12) -#define BXT_MEMORY_FREQ_MULTIPLIER_HZ 133333333 - #define BXT_D_CR_DRP0_DUNIT8 0x1000 #define BXT_D_CR_DRP0_DUNIT9 0x1200 #define BXT_D_CR_DRP0_DUNIT_START 8 @@ -11084,9 +11078,7 @@ enum skl_power_gate { #define BXT_DRAM_TYPE_LPDDR4 (0x2 << 22) #define BXT_DRAM_TYPE_DDR4 (0x4 << 22) -#define SKL_MEMORY_FREQ_MULTIPLIER_HZ 266666666 #define SKL_MC_BIOS_DATA_0_0_0_MCHBAR_PCU _MMIO(MCHBAR_MIRROR_BASE_SNB + 0x5E04) -#define SKL_REQ_DATA_MASK (0xF << 0) #define DG1_GEAR_TYPE REG_BIT(16) #define SKL_MAD_INTER_CHANNEL_0_0_0_MCHBAR_MCMAIN _MMIO(MCHBAR_MIRROR_BASE_SNB + 0x5000) diff --git a/drivers/gpu/drm/i915/i915_trace.h b/drivers/gpu/drm/i915/i915_trace.h index 806ad688274b..63fec1c3c132 100644 --- a/drivers/gpu/drm/i915/i915_trace.h +++ b/drivers/gpu/drm/i915/i915_trace.h @@ -794,7 +794,6 @@ DECLARE_EVENT_CLASS(i915_request, TP_STRUCT__entry( __field(u32, dev) __field(u64, ctx) - __field(u32, guc_id) __field(u16, class) __field(u16, instance) __field(u32, seqno) @@ -805,16 +804,14 @@ DECLARE_EVENT_CLASS(i915_request, __entry->dev = rq->engine->i915->drm.primary->index; __entry->class = rq->engine->uabi_class; __entry->instance = rq->engine->uabi_instance; - __entry->guc_id = rq->context->guc_id; __entry->ctx = rq->fence.context; __entry->seqno = rq->fence.seqno; __entry->tail = rq->tail; ), - TP_printk("dev=%u, engine=%u:%u, guc_id=%u, ctx=%llu, seqno=%u, tail=%u", + TP_printk("dev=%u, engine=%u:%u, ctx=%llu, seqno=%u, tail=%u", __entry->dev, __entry->class, __entry->instance, - __entry->guc_id, __entry->ctx, __entry->seqno, - __entry->tail) + __entry->ctx, __entry->seqno, __entry->tail) ); DEFINE_EVENT(i915_request, i915_request_add, diff --git a/drivers/gpu/drm/i915/intel_dram.c b/drivers/gpu/drm/i915/intel_dram.c index 91866520c173..7acce64b0941 100644 --- a/drivers/gpu/drm/i915/intel_dram.c +++ b/drivers/gpu/drm/i915/intel_dram.c @@ -244,7 +244,6 @@ static int skl_get_dram_info(struct drm_i915_private *i915) { struct dram_info *dram_info = &i915->dram_info; - u32 mem_freq_khz, val; int ret; dram_info->type = skl_get_dram_type(i915); @@ -255,17 +254,6 @@ skl_get_dram_info(struct drm_i915_private *i915) if (ret) return ret; - val = intel_uncore_read(&i915->uncore, - SKL_MC_BIOS_DATA_0_0_0_MCHBAR_PCU); - mem_freq_khz = DIV_ROUND_UP((val & SKL_REQ_DATA_MASK) * - SKL_MEMORY_FREQ_MULTIPLIER_HZ, 1000); - - if (dram_info->num_channels * mem_freq_khz == 0) { - drm_info(&i915->drm, - "Couldn't get system memory bandwidth\n"); - return -EINVAL; - } - return 0; } @@ -350,24 +338,10 @@ static void bxt_get_dimm_info(struct dram_dimm_info *dimm, u32 val) static int bxt_get_dram_info(struct drm_i915_private *i915) { struct dram_info *dram_info = &i915->dram_info; - u32 dram_channels; - u32 mem_freq_khz, val; - u8 num_active_channels, valid_ranks = 0; + u32 val; + u8 valid_ranks = 0; int i; - val = intel_uncore_read(&i915->uncore, BXT_P_CR_MC_BIOS_REQ_0_0_0); - mem_freq_khz = DIV_ROUND_UP((val & BXT_REQ_DATA_MASK) * - BXT_MEMORY_FREQ_MULTIPLIER_HZ, 1000); - - dram_channels = val & BXT_DRAM_CHANNEL_ACTIVE_MASK; - num_active_channels = hweight32(dram_channels); - - if (mem_freq_khz * num_active_channels == 0) { - drm_info(&i915->drm, - "Couldn't get system memory bandwidth\n"); - return -EINVAL; - } - /* * Now read each DUNIT8/9/10/11 to check the rank of each dimms. */ diff --git a/drivers/gpu/drm/kmb/kmb_crtc.c b/drivers/gpu/drm/kmb/kmb_crtc.c index 44327bc629ca..06613ffeaaf8 100644 --- a/drivers/gpu/drm/kmb/kmb_crtc.c +++ b/drivers/gpu/drm/kmb/kmb_crtc.c @@ -66,7 +66,8 @@ static const struct drm_crtc_funcs kmb_crtc_funcs = { .disable_vblank = kmb_crtc_disable_vblank, }; -static void kmb_crtc_set_mode(struct drm_crtc *crtc) +static void kmb_crtc_set_mode(struct drm_crtc *crtc, + struct drm_atomic_state *old_state) { struct drm_device *dev = crtc->dev; struct drm_display_mode *m = &crtc->state->adjusted_mode; @@ -75,7 +76,7 @@ static void kmb_crtc_set_mode(struct drm_crtc *crtc) unsigned int val = 0; /* Initialize mipi */ - kmb_dsi_mode_set(kmb->kmb_dsi, m, kmb->sys_clk_mhz); + kmb_dsi_mode_set(kmb->kmb_dsi, m, kmb->sys_clk_mhz, old_state); drm_info(dev, "vfp= %d vbp= %d vsync_len=%d hfp=%d hbp=%d hsync_len=%d\n", m->crtc_vsync_start - m->crtc_vdisplay, @@ -138,7 +139,7 @@ static void kmb_crtc_atomic_enable(struct drm_crtc *crtc, struct kmb_drm_private *kmb = crtc_to_kmb_priv(crtc); clk_prepare_enable(kmb->kmb_clk.clk_lcd); - kmb_crtc_set_mode(crtc); + kmb_crtc_set_mode(crtc, state); drm_crtc_vblank_on(crtc); } @@ -185,11 +186,45 @@ static void kmb_crtc_atomic_flush(struct drm_crtc *crtc, spin_unlock_irq(&crtc->dev->event_lock); } +static enum drm_mode_status + kmb_crtc_mode_valid(struct drm_crtc *crtc, + const struct drm_display_mode *mode) +{ + int refresh; + struct drm_device *dev = crtc->dev; + int vfp = mode->vsync_start - mode->vdisplay; + + if (mode->vdisplay < KMB_CRTC_MAX_HEIGHT) { + drm_dbg(dev, "height = %d less than %d", + mode->vdisplay, KMB_CRTC_MAX_HEIGHT); + return MODE_BAD_VVALUE; + } + if (mode->hdisplay < KMB_CRTC_MAX_WIDTH) { + drm_dbg(dev, "width = %d less than %d", + mode->hdisplay, KMB_CRTC_MAX_WIDTH); + return MODE_BAD_HVALUE; + } + refresh = drm_mode_vrefresh(mode); + if (refresh < KMB_MIN_VREFRESH || refresh > KMB_MAX_VREFRESH) { + drm_dbg(dev, "refresh = %d less than %d or greater than %d", + refresh, KMB_MIN_VREFRESH, KMB_MAX_VREFRESH); + return MODE_BAD; + } + + if (vfp < KMB_CRTC_MIN_VFP) { + drm_dbg(dev, "vfp = %d less than %d", vfp, KMB_CRTC_MIN_VFP); + return MODE_BAD; + } + + return MODE_OK; +} + static const struct drm_crtc_helper_funcs kmb_crtc_helper_funcs = { .atomic_begin = kmb_crtc_atomic_begin, .atomic_enable = kmb_crtc_atomic_enable, .atomic_disable = kmb_crtc_atomic_disable, .atomic_flush = kmb_crtc_atomic_flush, + .mode_valid = kmb_crtc_mode_valid, }; int kmb_setup_crtc(struct drm_device *drm) diff --git a/drivers/gpu/drm/kmb/kmb_drv.c b/drivers/gpu/drm/kmb/kmb_drv.c index 12ce669650cc..961ac6fb5fcf 100644 --- a/drivers/gpu/drm/kmb/kmb_drv.c +++ b/drivers/gpu/drm/kmb/kmb_drv.c @@ -380,7 +380,7 @@ static irqreturn_t handle_lcd_irq(struct drm_device *dev) if (val & LAYER3_DMA_FIFO_UNDERFLOW) drm_dbg(&kmb->drm, "LAYER3:GL1 DMA UNDERFLOW val = 0x%lx", val); - if (val & LAYER3_DMA_FIFO_UNDERFLOW) + if (val & LAYER3_DMA_FIFO_OVERFLOW) drm_dbg(&kmb->drm, "LAYER3:GL1 DMA OVERFLOW val = 0x%lx", val); } diff --git a/drivers/gpu/drm/kmb/kmb_drv.h b/drivers/gpu/drm/kmb/kmb_drv.h index 69a62e2d03ff..bf085e95b28f 100644 --- a/drivers/gpu/drm/kmb/kmb_drv.h +++ b/drivers/gpu/drm/kmb/kmb_drv.h @@ -20,11 +20,18 @@ #define DRIVER_MAJOR 1 #define DRIVER_MINOR 1 +/* Platform definitions */ +#define KMB_CRTC_MIN_VFP 4 +#define KMB_CRTC_MAX_WIDTH 1920 /* max width in pixels */ +#define KMB_CRTC_MAX_HEIGHT 1080 /* max height in pixels */ +#define KMB_CRTC_MIN_WIDTH 1920 +#define KMB_CRTC_MIN_HEIGHT 1080 #define KMB_FB_MAX_WIDTH 1920 #define KMB_FB_MAX_HEIGHT 1080 #define KMB_FB_MIN_WIDTH 1 #define KMB_FB_MIN_HEIGHT 1 - +#define KMB_MIN_VREFRESH 59 /*vertical refresh in Hz */ +#define KMB_MAX_VREFRESH 60 /*vertical refresh in Hz */ #define KMB_LCD_DEFAULT_CLK 200000000 #define KMB_SYS_CLK_MHZ 500 @@ -50,6 +57,7 @@ struct kmb_drm_private { spinlock_t irq_lock; int irq_lcd; int sys_clk_mhz; + struct disp_cfg init_disp_cfg[KMB_MAX_PLANES]; struct layer_status plane_status[KMB_MAX_PLANES]; int kmb_under_flow; int kmb_flush_done; diff --git a/drivers/gpu/drm/kmb/kmb_dsi.c b/drivers/gpu/drm/kmb/kmb_dsi.c index 1793cd31b117..f6071882054c 100644 --- a/drivers/gpu/drm/kmb/kmb_dsi.c +++ b/drivers/gpu/drm/kmb/kmb_dsi.c @@ -482,6 +482,10 @@ static u32 mipi_tx_fg_section_cfg(struct kmb_dsi *kmb_dsi, return 0; } +#define CLK_DIFF_LOW 50 +#define CLK_DIFF_HI 60 +#define SYSCLK_500 500 + static void mipi_tx_fg_cfg_regs(struct kmb_dsi *kmb_dsi, u8 frame_gen, struct mipi_tx_frame_timing_cfg *fg_cfg) { @@ -492,7 +496,12 @@ static void mipi_tx_fg_cfg_regs(struct kmb_dsi *kmb_dsi, u8 frame_gen, /* 500 Mhz system clock minus 50 to account for the difference in * MIPI clock speed in RTL tests */ - sysclk = kmb_dsi->sys_clk_mhz - 50; + if (kmb_dsi->sys_clk_mhz == SYSCLK_500) { + sysclk = kmb_dsi->sys_clk_mhz - CLK_DIFF_LOW; + } else { + /* 700 Mhz clk*/ + sysclk = kmb_dsi->sys_clk_mhz - CLK_DIFF_HI; + } /* PPL-Pixel Packing Layer, LLP-Low Level Protocol * Frame genartor timing parameters are clocked on the system clock, @@ -1322,7 +1331,8 @@ static u32 mipi_tx_init_dphy(struct kmb_dsi *kmb_dsi, return 0; } -static void connect_lcd_to_mipi(struct kmb_dsi *kmb_dsi) +static void connect_lcd_to_mipi(struct kmb_dsi *kmb_dsi, + struct drm_atomic_state *old_state) { struct regmap *msscam; @@ -1331,7 +1341,7 @@ static void connect_lcd_to_mipi(struct kmb_dsi *kmb_dsi) dev_dbg(kmb_dsi->dev, "failed to get msscam syscon"); return; } - + drm_atomic_bridge_chain_enable(adv_bridge, old_state); /* DISABLE MIPI->CIF CONNECTION */ regmap_write(msscam, MSS_MIPI_CIF_CFG, 0); @@ -1342,7 +1352,7 @@ static void connect_lcd_to_mipi(struct kmb_dsi *kmb_dsi) } int kmb_dsi_mode_set(struct kmb_dsi *kmb_dsi, struct drm_display_mode *mode, - int sys_clk_mhz) + int sys_clk_mhz, struct drm_atomic_state *old_state) { u64 data_rate; @@ -1384,18 +1394,13 @@ int kmb_dsi_mode_set(struct kmb_dsi *kmb_dsi, struct drm_display_mode *mode, mipi_tx_init_cfg.lane_rate_mbps = data_rate; } - kmb_write_mipi(kmb_dsi, DPHY_ENABLE, 0); - kmb_write_mipi(kmb_dsi, DPHY_INIT_CTRL0, 0); - kmb_write_mipi(kmb_dsi, DPHY_INIT_CTRL1, 0); - kmb_write_mipi(kmb_dsi, DPHY_INIT_CTRL2, 0); - /* Initialize mipi controller */ mipi_tx_init_cntrl(kmb_dsi, &mipi_tx_init_cfg); /* Dphy initialization */ mipi_tx_init_dphy(kmb_dsi, &mipi_tx_init_cfg); - connect_lcd_to_mipi(kmb_dsi); + connect_lcd_to_mipi(kmb_dsi, old_state); dev_info(kmb_dsi->dev, "mipi hw initialized"); return 0; diff --git a/drivers/gpu/drm/kmb/kmb_dsi.h b/drivers/gpu/drm/kmb/kmb_dsi.h index 66b7c500d9bc..09dc88743d77 100644 --- a/drivers/gpu/drm/kmb/kmb_dsi.h +++ b/drivers/gpu/drm/kmb/kmb_dsi.h @@ -380,7 +380,7 @@ int kmb_dsi_host_bridge_init(struct device *dev); struct kmb_dsi *kmb_dsi_init(struct platform_device *pdev); void kmb_dsi_host_unregister(struct kmb_dsi *kmb_dsi); int kmb_dsi_mode_set(struct kmb_dsi *kmb_dsi, struct drm_display_mode *mode, - int sys_clk_mhz); + int sys_clk_mhz, struct drm_atomic_state *old_state); int kmb_dsi_map_mmio(struct kmb_dsi *kmb_dsi); int kmb_dsi_clk_init(struct kmb_dsi *kmb_dsi); int kmb_dsi_encoder_init(struct drm_device *dev, struct kmb_dsi *kmb_dsi); diff --git a/drivers/gpu/drm/kmb/kmb_plane.c b/drivers/gpu/drm/kmb/kmb_plane.c index 06b0c42c9e91..00404ba4126d 100644 --- a/drivers/gpu/drm/kmb/kmb_plane.c +++ b/drivers/gpu/drm/kmb/kmb_plane.c @@ -67,8 +67,21 @@ static const u32 kmb_formats_v[] = { static unsigned int check_pixel_format(struct drm_plane *plane, u32 format) { + struct kmb_drm_private *kmb; + struct kmb_plane *kmb_plane = to_kmb_plane(plane); int i; + int plane_id = kmb_plane->id; + struct disp_cfg init_disp_cfg; + kmb = to_kmb(plane->dev); + init_disp_cfg = kmb->init_disp_cfg[plane_id]; + /* Due to HW limitations, changing pixel format after initial + * plane configuration is not supported. + */ + if (init_disp_cfg.format && init_disp_cfg.format != format) { + drm_dbg(&kmb->drm, "Cannot change format after initial plane configuration"); + return -EINVAL; + } for (i = 0; i < plane->format_count; i++) { if (plane->format_types[i] == format) return 0; @@ -81,11 +94,17 @@ static int kmb_plane_atomic_check(struct drm_plane *plane, { struct drm_plane_state *new_plane_state = drm_atomic_get_new_plane_state(state, plane); + struct kmb_drm_private *kmb; + struct kmb_plane *kmb_plane = to_kmb_plane(plane); + int plane_id = kmb_plane->id; + struct disp_cfg init_disp_cfg; struct drm_framebuffer *fb; int ret; struct drm_crtc_state *crtc_state; bool can_position; + kmb = to_kmb(plane->dev); + init_disp_cfg = kmb->init_disp_cfg[plane_id]; fb = new_plane_state->fb; if (!fb || !new_plane_state->crtc) return 0; @@ -99,6 +118,16 @@ static int kmb_plane_atomic_check(struct drm_plane *plane, new_plane_state->crtc_w < KMB_FB_MIN_WIDTH || new_plane_state->crtc_h < KMB_FB_MIN_HEIGHT) return -EINVAL; + + /* Due to HW limitations, changing plane height or width after + * initial plane configuration is not supported. + */ + if ((init_disp_cfg.width && init_disp_cfg.height) && + (init_disp_cfg.width != fb->width || + init_disp_cfg.height != fb->height)) { + drm_dbg(&kmb->drm, "Cannot change plane height or width after initial configuration"); + return -EINVAL; + } can_position = (plane->type == DRM_PLANE_TYPE_OVERLAY); crtc_state = drm_atomic_get_existing_crtc_state(state, @@ -335,6 +364,7 @@ static void kmb_plane_atomic_update(struct drm_plane *plane, unsigned char plane_id; int num_planes; static dma_addr_t addr[MAX_SUB_PLANES]; + struct disp_cfg *init_disp_cfg; if (!plane || !new_plane_state || !old_plane_state) return; @@ -357,7 +387,8 @@ static void kmb_plane_atomic_update(struct drm_plane *plane, } spin_unlock_irq(&kmb->irq_lock); - src_w = (new_plane_state->src_w >> 16); + init_disp_cfg = &kmb->init_disp_cfg[plane_id]; + src_w = new_plane_state->src_w >> 16; src_h = new_plane_state->src_h >> 16; crtc_x = new_plane_state->crtc_x; crtc_y = new_plane_state->crtc_y; @@ -500,6 +531,16 @@ static void kmb_plane_atomic_update(struct drm_plane *plane, /* Enable DMA */ kmb_write_lcd(kmb, LCD_LAYERn_DMA_CFG(plane_id), dma_cfg); + + /* Save initial display config */ + if (!init_disp_cfg->width || + !init_disp_cfg->height || + !init_disp_cfg->format) { + init_disp_cfg->width = width; + init_disp_cfg->height = height; + init_disp_cfg->format = fb->format->format; + } + drm_dbg(&kmb->drm, "dma_cfg=0x%x LCD_DMA_CFG=0x%x\n", dma_cfg, kmb_read_lcd(kmb, LCD_LAYERn_DMA_CFG(plane_id))); diff --git a/drivers/gpu/drm/kmb/kmb_plane.h b/drivers/gpu/drm/kmb/kmb_plane.h index 6e8d22cf8819..b51144044fe8 100644 --- a/drivers/gpu/drm/kmb/kmb_plane.h +++ b/drivers/gpu/drm/kmb/kmb_plane.h @@ -63,6 +63,12 @@ struct layer_status { u32 ctrl; }; +struct disp_cfg { + unsigned int width; + unsigned int height; + unsigned int format; +}; + struct kmb_plane *kmb_plane_init(struct drm_device *drm); void kmb_plane_destroy(struct drm_plane *plane); #endif /* __KMB_PLANE_H__ */ diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c index 33da25b81615..267a880811d6 100644 --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c @@ -1838,6 +1838,13 @@ struct msm_gpu *a6xx_gpu_init(struct drm_device *dev) adreno_cmp_rev(ADRENO_REV(6, 3, 5, ANY_ID), info->rev))) adreno_gpu->base.hw_apriv = true; + /* + * For now only clamp to idle freq for devices where this is known not + * to cause power supply issues: + */ + if (info && (info->revn == 618)) + gpu->clamp_to_idle = true; + a6xx_llc_slices_init(pdev, a6xx_gpu); ret = a6xx_set_supported_hw(&pdev->dev, config->rev); diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h index 030f82f149c2..ee25d556c8a1 100644 --- a/drivers/gpu/drm/msm/msm_gpu.h +++ b/drivers/gpu/drm/msm/msm_gpu.h @@ -203,6 +203,10 @@ struct msm_gpu { uint32_t suspend_count; struct msm_gpu_state *crashstate; + + /* Enable clamping to idle freq when inactive: */ + bool clamp_to_idle; + /* True if the hardware supports expanded apriv (a650 and newer) */ bool hw_apriv; diff --git a/drivers/gpu/drm/msm/msm_gpu_devfreq.c b/drivers/gpu/drm/msm/msm_gpu_devfreq.c index 84e98c07c900..20006d060b5b 100644 --- a/drivers/gpu/drm/msm/msm_gpu_devfreq.c +++ b/drivers/gpu/drm/msm/msm_gpu_devfreq.c @@ -200,7 +200,8 @@ void msm_devfreq_idle(struct msm_gpu *gpu) idle_freq = get_freq(gpu); - msm_devfreq_target(&gpu->pdev->dev, &target_freq, 0); + if (gpu->clamp_to_idle) + msm_devfreq_target(&gpu->pdev->dev, &target_freq, 0); df->idle_time = ktime_get(); df->idle_freq = idle_freq; diff --git a/drivers/gpu/drm/mxsfb/mxsfb_drv.c b/drivers/gpu/drm/mxsfb/mxsfb_drv.c index ec0432fe1bdf..86d78634a979 100644 --- a/drivers/gpu/drm/mxsfb/mxsfb_drv.c +++ b/drivers/gpu/drm/mxsfb/mxsfb_drv.c @@ -173,7 +173,11 @@ static void mxsfb_irq_disable(struct drm_device *drm) struct mxsfb_drm_private *mxsfb = drm->dev_private; mxsfb_enable_axi_clk(mxsfb); - mxsfb->crtc.funcs->disable_vblank(&mxsfb->crtc); + + /* Disable and clear VBLANK IRQ */ + writel(CTRL1_CUR_FRAME_DONE_IRQ_EN, mxsfb->base + LCDC_CTRL1 + REG_CLR); + writel(CTRL1_CUR_FRAME_DONE_IRQ, mxsfb->base + LCDC_CTRL1 + REG_CLR); + mxsfb_disable_axi_clk(mxsfb); } diff --git a/drivers/gpu/drm/panel/panel-ilitek-ili9881c.c b/drivers/gpu/drm/panel/panel-ilitek-ili9881c.c index 0145129d7c66..534dd7414d42 100644 --- a/drivers/gpu/drm/panel/panel-ilitek-ili9881c.c +++ b/drivers/gpu/drm/panel/panel-ilitek-ili9881c.c @@ -590,14 +590,14 @@ static const struct drm_display_mode k101_im2byl02_default_mode = { .clock = 69700, .hdisplay = 800, - .hsync_start = 800 + 6, - .hsync_end = 800 + 6 + 15, - .htotal = 800 + 6 + 15 + 16, + .hsync_start = 800 + 52, + .hsync_end = 800 + 52 + 8, + .htotal = 800 + 52 + 8 + 48, .vdisplay = 1280, - .vsync_start = 1280 + 8, - .vsync_end = 1280 + 8 + 48, - .vtotal = 1280 + 8 + 48 + 52, + .vsync_start = 1280 + 16, + .vsync_end = 1280 + 16 + 6, + .vtotal = 1280 + 16 + 6 + 15, .width_mm = 135, .height_mm = 217, diff --git a/drivers/gpu/drm/selftests/test-drm_damage_helper.c b/drivers/gpu/drm/selftests/test-drm_damage_helper.c index 1c19a5d3eefb..8d8d8e214c28 100644 --- a/drivers/gpu/drm/selftests/test-drm_damage_helper.c +++ b/drivers/gpu/drm/selftests/test-drm_damage_helper.c @@ -30,6 +30,7 @@ static void mock_setup(struct drm_plane_state *state) mock_device.driver = &mock_driver; mock_device.mode_config.prop_fb_damage_clips = &mock_prop; mock_plane.dev = &mock_device; + mock_obj_props.count = 0; mock_plane.base.properties = &mock_obj_props; mock_prop.base.id = 1; /* 0 is an invalid id */ mock_prop.dev = &mock_device; diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c index 1c5ffe2935af..abf2d7a4fdf1 100644 --- a/drivers/gpu/drm/ttm/ttm_bo_util.c +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c @@ -190,6 +190,7 @@ static void ttm_transfered_destroy(struct ttm_buffer_object *bo) struct ttm_transfer_obj *fbo; fbo = container_of(bo, struct ttm_transfer_obj, base); + dma_resv_fini(&fbo->base.base._resv); ttm_bo_put(fbo->bo); kfree(fbo); } diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h index 42f3d9d123a1..d030577ad6a2 100644 --- a/drivers/hv/hyperv_vmbus.h +++ b/drivers/hv/hyperv_vmbus.h @@ -13,6 +13,7 @@ #define _HYPERV_VMBUS_H #include <linux/list.h> +#include <linux/bitops.h> #include <asm/sync_bitops.h> #include <asm/hyperv-tlfs.h> #include <linux/atomic.h> diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index a20b8108e160..c00f8e28aab7 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -706,8 +706,9 @@ static void ib_nl_set_path_rec_attrs(struct sk_buff *skb, /* Construct the family header first */ header = skb_put(skb, NLMSG_ALIGN(sizeof(*header))); - memcpy(header->device_name, dev_name(&query->port->agent->device->dev), - LS_DEVICE_NAME_MAX); + strscpy_pad(header->device_name, + dev_name(&query->port->agent->device->dev), + LS_DEVICE_NAME_MAX); header->port_num = query->port->port_num; if ((comp_mask & IB_SA_PATH_REC_REVERSIBLE) && diff --git a/drivers/infiniband/hw/hfi1/pio.c b/drivers/infiniband/hw/hfi1/pio.c index 489b436f19bb..3d42bd2b36bd 100644 --- a/drivers/infiniband/hw/hfi1/pio.c +++ b/drivers/infiniband/hw/hfi1/pio.c @@ -878,6 +878,7 @@ void sc_disable(struct send_context *sc) { u64 reg; struct pio_buf *pbuf; + LIST_HEAD(wake_list); if (!sc) return; @@ -912,19 +913,21 @@ void sc_disable(struct send_context *sc) spin_unlock(&sc->release_lock); write_seqlock(&sc->waitlock); - while (!list_empty(&sc->piowait)) { + if (!list_empty(&sc->piowait)) + list_move(&sc->piowait, &wake_list); + write_sequnlock(&sc->waitlock); + while (!list_empty(&wake_list)) { struct iowait *wait; struct rvt_qp *qp; struct hfi1_qp_priv *priv; - wait = list_first_entry(&sc->piowait, struct iowait, list); + wait = list_first_entry(&wake_list, struct iowait, list); qp = iowait_to_qp(wait); priv = qp->priv; list_del_init(&priv->s_iowait.list); priv->s_iowait.lock = NULL; hfi1_qp_wakeup(qp, RVT_S_WAIT_PIO | HFI1_S_WAIT_PIO_DRAIN); } - write_sequnlock(&sc->waitlock); spin_unlock_irq(&sc->alloc_lock); } diff --git a/drivers/infiniband/hw/irdma/uk.c b/drivers/infiniband/hw/irdma/uk.c index 5fb92de1f015..9b544a3b1288 100644 --- a/drivers/infiniband/hw/irdma/uk.c +++ b/drivers/infiniband/hw/irdma/uk.c @@ -1092,12 +1092,12 @@ irdma_uk_cq_poll_cmpl(struct irdma_cq_uk *cq, struct irdma_cq_poll_info *info) if (cq->avoid_mem_cflct) { ext_cqe = (__le64 *)((u8 *)cqe + 32); get_64bit_val(ext_cqe, 24, &qword7); - polarity = (u8)FIELD_GET(IRDMA_CQ_VALID, qword3); + polarity = (u8)FIELD_GET(IRDMA_CQ_VALID, qword7); } else { peek_head = (cq->cq_ring.head + 1) % cq->cq_ring.size; ext_cqe = cq->cq_base[peek_head].buf; get_64bit_val(ext_cqe, 24, &qword7); - polarity = (u8)FIELD_GET(IRDMA_CQ_VALID, qword3); + polarity = (u8)FIELD_GET(IRDMA_CQ_VALID, qword7); if (!peek_head) polarity ^= 1; } diff --git a/drivers/infiniband/hw/irdma/verbs.c b/drivers/infiniband/hw/irdma/verbs.c index 7110ebf834f9..102dc9342f2a 100644 --- a/drivers/infiniband/hw/irdma/verbs.c +++ b/drivers/infiniband/hw/irdma/verbs.c @@ -3399,9 +3399,13 @@ static void irdma_process_cqe(struct ib_wc *entry, } if (cq_poll_info->ud_vlan_valid) { - entry->vlan_id = cq_poll_info->ud_vlan & VLAN_VID_MASK; - entry->wc_flags |= IB_WC_WITH_VLAN; + u16 vlan = cq_poll_info->ud_vlan & VLAN_VID_MASK; + entry->sl = cq_poll_info->ud_vlan >> VLAN_PRIO_SHIFT; + if (vlan) { + entry->vlan_id = vlan; + entry->wc_flags |= IB_WC_WITH_VLAN; + } } else { entry->sl = 0; } diff --git a/drivers/infiniband/hw/irdma/ws.c b/drivers/infiniband/hw/irdma/ws.c index b68c575eb78e..b0d6ee0739f5 100644 --- a/drivers/infiniband/hw/irdma/ws.c +++ b/drivers/infiniband/hw/irdma/ws.c @@ -330,8 +330,10 @@ enum irdma_status_code irdma_ws_add(struct irdma_sc_vsi *vsi, u8 user_pri) tc_node->enable = true; ret = irdma_ws_cqp_cmd(vsi, tc_node, IRDMA_OP_WS_MODIFY_NODE); - if (ret) + if (ret) { + vsi->unregister_qset(vsi, tc_node); goto reg_err; + } } ibdev_dbg(to_ibdev(vsi->dev), "WS: Using node %d which represents VSI %d TC %d\n", @@ -350,6 +352,10 @@ enum irdma_status_code irdma_ws_add(struct irdma_sc_vsi *vsi, u8 user_pri) } goto exit; +reg_err: + irdma_ws_cqp_cmd(vsi, tc_node, IRDMA_OP_WS_DELETE_NODE); + list_del(&tc_node->siblings); + irdma_free_node(vsi, tc_node); leaf_add_err: if (list_empty(&vsi_node->child_list_head)) { if (irdma_ws_cqp_cmd(vsi, vsi_node, IRDMA_OP_WS_DELETE_NODE)) @@ -369,11 +375,6 @@ vsi_add_err: exit: mutex_unlock(&vsi->dev->ws_mutex); return ret; - -reg_err: - mutex_unlock(&vsi->dev->ws_mutex); - irdma_ws_remove(vsi, user_pri); - return ret; } /** diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c index 3be36ebbf67a..22e2f4d79743 100644 --- a/drivers/infiniband/hw/mlx5/mr.c +++ b/drivers/infiniband/hw/mlx5/mr.c @@ -1339,7 +1339,6 @@ static struct mlx5_ib_mr *reg_create(struct ib_pd *pd, struct ib_umem *umem, goto err_2; } mr->mmkey.type = MLX5_MKEY_MR; - mr->desc_size = sizeof(struct mlx5_mtt); mr->umem = umem; set_mr_fields(dev, mr, umem->length, access_flags); kvfree(in); @@ -1533,6 +1532,7 @@ static struct ib_mr *create_user_odp_mr(struct ib_pd *pd, u64 start, u64 length, ib_umem_release(&odp->umem); return ERR_CAST(mr); } + xa_init(&mr->implicit_children); odp->private = mr; err = mlx5r_store_odp_mkey(dev, &mr->mmkey); diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c index b2fca110346c..e5abbcfc1d57 100644 --- a/drivers/infiniband/hw/mlx5/qp.c +++ b/drivers/infiniband/hw/mlx5/qp.c @@ -4458,6 +4458,8 @@ static int mlx5_ib_modify_dct(struct ib_qp *ibqp, struct ib_qp_attr *attr, MLX5_SET(dctc, dctc, mtu, attr->path_mtu); MLX5_SET(dctc, dctc, my_addr_index, attr->ah_attr.grh.sgid_index); MLX5_SET(dctc, dctc, hop_limit, attr->ah_attr.grh.hop_limit); + if (attr->ah_attr.type == RDMA_AH_ATTR_TYPE_ROCE) + MLX5_SET(dctc, dctc, eth_prio, attr->ah_attr.sl & 0x7); err = mlx5_core_create_dct(dev, &qp->dct.mdct, qp->dct.in, MLX5_ST_SZ_BYTES(create_dct_in), out, diff --git a/drivers/infiniband/hw/qedr/qedr.h b/drivers/infiniband/hw/qedr/qedr.h index 3cb4febaad0f..8def88cfa300 100644 --- a/drivers/infiniband/hw/qedr/qedr.h +++ b/drivers/infiniband/hw/qedr/qedr.h @@ -455,6 +455,7 @@ struct qedr_qp { /* synchronization objects used with iwarp ep */ struct kref refcnt; struct completion iwarp_cm_comp; + struct completion qp_rel_comp; unsigned long iwarp_cm_flags; /* enum iwarp_cm_flags */ }; diff --git a/drivers/infiniband/hw/qedr/qedr_iw_cm.c b/drivers/infiniband/hw/qedr/qedr_iw_cm.c index 1715fbe0719d..a51fc6854984 100644 --- a/drivers/infiniband/hw/qedr/qedr_iw_cm.c +++ b/drivers/infiniband/hw/qedr/qedr_iw_cm.c @@ -83,7 +83,7 @@ static void qedr_iw_free_qp(struct kref *ref) { struct qedr_qp *qp = container_of(ref, struct qedr_qp, refcnt); - kfree(qp); + complete(&qp->qp_rel_comp); } static void diff --git a/drivers/infiniband/hw/qedr/verbs.c b/drivers/infiniband/hw/qedr/verbs.c index 3fbf172dbbef..dcb3653db72d 100644 --- a/drivers/infiniband/hw/qedr/verbs.c +++ b/drivers/infiniband/hw/qedr/verbs.c @@ -1357,6 +1357,7 @@ static void qedr_set_common_qp_params(struct qedr_dev *dev, if (rdma_protocol_iwarp(&dev->ibdev, 1)) { kref_init(&qp->refcnt); init_completion(&qp->iwarp_cm_comp); + init_completion(&qp->qp_rel_comp); } qp->pd = pd; @@ -2857,8 +2858,10 @@ int qedr_destroy_qp(struct ib_qp *ibqp, struct ib_udata *udata) qedr_free_qp_resources(dev, qp, udata); - if (rdma_protocol_iwarp(&dev->ibdev, 1)) + if (rdma_protocol_iwarp(&dev->ibdev, 1)) { qedr_iw_qp_rem_ref(&qp->ibqp); + wait_for_completion(&qp->qp_rel_comp); + } return 0; } diff --git a/drivers/infiniband/hw/qib/qib_user_sdma.c b/drivers/infiniband/hw/qib/qib_user_sdma.c index a67599b5a550..ac11943a5ddb 100644 --- a/drivers/infiniband/hw/qib/qib_user_sdma.c +++ b/drivers/infiniband/hw/qib/qib_user_sdma.c @@ -602,7 +602,7 @@ done: /* * How many pages in this iovec element? */ -static int qib_user_sdma_num_pages(const struct iovec *iov) +static size_t qib_user_sdma_num_pages(const struct iovec *iov) { const unsigned long addr = (unsigned long) iov->iov_base; const unsigned long len = iov->iov_len; @@ -658,7 +658,7 @@ static void qib_user_sdma_free_pkt_frag(struct device *dev, static int qib_user_sdma_pin_pages(const struct qib_devdata *dd, struct qib_user_sdma_queue *pq, struct qib_user_sdma_pkt *pkt, - unsigned long addr, int tlen, int npages) + unsigned long addr, int tlen, size_t npages) { struct page *pages[8]; int i, j; @@ -722,7 +722,7 @@ static int qib_user_sdma_pin_pkt(const struct qib_devdata *dd, unsigned long idx; for (idx = 0; idx < niov; idx++) { - const int npages = qib_user_sdma_num_pages(iov + idx); + const size_t npages = qib_user_sdma_num_pages(iov + idx); const unsigned long addr = (unsigned long) iov[idx].iov_base; ret = qib_user_sdma_pin_pages(dd, pq, pkt, addr, @@ -824,8 +824,8 @@ static int qib_user_sdma_queue_pkts(const struct qib_devdata *dd, unsigned pktnw; unsigned pktnwc; int nfrags = 0; - int npages = 0; - int bytes_togo = 0; + size_t npages = 0; + size_t bytes_togo = 0; int tiddma = 0; int cfur; @@ -885,7 +885,11 @@ static int qib_user_sdma_queue_pkts(const struct qib_devdata *dd, npages += qib_user_sdma_num_pages(&iov[idx]); - bytes_togo += slen; + if (check_add_overflow(bytes_togo, slen, &bytes_togo) || + bytes_togo > type_max(typeof(pkt->bytes_togo))) { + ret = -EINVAL; + goto free_pbc; + } pktnwc += slen >> 2; idx++; nfrags++; @@ -904,8 +908,7 @@ static int qib_user_sdma_queue_pkts(const struct qib_devdata *dd, } if (frag_size) { - int tidsmsize, n; - size_t pktsize; + size_t tidsmsize, n, pktsize, sz, addrlimit; n = npages*((2*PAGE_SIZE/frag_size)+1); pktsize = struct_size(pkt, addr, n); @@ -923,14 +926,24 @@ static int qib_user_sdma_queue_pkts(const struct qib_devdata *dd, else tidsmsize = 0; - pkt = kmalloc(pktsize+tidsmsize, GFP_KERNEL); + if (check_add_overflow(pktsize, tidsmsize, &sz)) { + ret = -EINVAL; + goto free_pbc; + } + pkt = kmalloc(sz, GFP_KERNEL); if (!pkt) { ret = -ENOMEM; goto free_pbc; } pkt->largepkt = 1; pkt->frag_size = frag_size; - pkt->addrlimit = n + ARRAY_SIZE(pkt->addr); + if (check_add_overflow(n, ARRAY_SIZE(pkt->addr), + &addrlimit) || + addrlimit > type_max(typeof(pkt->addrlimit))) { + ret = -EINVAL; + goto free_pbc; + } + pkt->addrlimit = addrlimit; if (tiddma) { char *tidsm = (char *)pkt + pktsize; diff --git a/drivers/infiniband/sw/rdmavt/qp.c b/drivers/infiniband/sw/rdmavt/qp.c index 49bdd78ac664..3305f2744bfa 100644 --- a/drivers/infiniband/sw/rdmavt/qp.c +++ b/drivers/infiniband/sw/rdmavt/qp.c @@ -1223,7 +1223,7 @@ int rvt_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *init_attr, spin_lock(&rdi->n_qps_lock); if (rdi->n_qps_allocated == rdi->dparms.props.max_qp) { spin_unlock(&rdi->n_qps_lock); - ret = ENOMEM; + ret = -ENOMEM; goto bail_ip; } diff --git a/drivers/isdn/hardware/mISDN/hfcpci.c b/drivers/isdn/hardware/mISDN/hfcpci.c index e501cb03f211..bd087cca1c1d 100644 --- a/drivers/isdn/hardware/mISDN/hfcpci.c +++ b/drivers/isdn/hardware/mISDN/hfcpci.c @@ -1994,14 +1994,14 @@ setup_hw(struct hfc_pci *hc) pci_set_master(hc->pdev); if (!hc->irq) { printk(KERN_WARNING "HFC-PCI: No IRQ for PCI card found\n"); - return 1; + return -EINVAL; } hc->hw.pci_io = (char __iomem *)(unsigned long)hc->pdev->resource[1].start; if (!hc->hw.pci_io) { printk(KERN_WARNING "HFC-PCI: No IO-Mem for PCI card found\n"); - return 1; + return -ENOMEM; } /* Allocate memory for FIFOS */ /* the memory needs to be on a 32k boundary within the first 4G */ @@ -2012,7 +2012,7 @@ setup_hw(struct hfc_pci *hc) if (!buffer) { printk(KERN_WARNING "HFC-PCI: Error allocating memory for FIFO!\n"); - return 1; + return -ENOMEM; } hc->hw.fifos = buffer; pci_write_config_dword(hc->pdev, 0x80, hc->hw.dmahandle); @@ -2022,7 +2022,7 @@ setup_hw(struct hfc_pci *hc) "HFC-PCI: Error in ioremap for PCI!\n"); dma_free_coherent(&hc->pdev->dev, 0x8000, hc->hw.fifos, hc->hw.dmahandle); - return 1; + return -ENOMEM; } printk(KERN_INFO diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 5fc989a6d452..9ed9c955add7 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -178,7 +178,6 @@ #define pr_fmt(fmt) "bcache: %s() " fmt, __func__ -#include <linux/bcache.h> #include <linux/bio.h> #include <linux/kobject.h> #include <linux/list.h> @@ -190,6 +189,7 @@ #include <linux/workqueue.h> #include <linux/kthread.h> +#include "bcache_ondisk.h" #include "bset.h" #include "util.h" #include "closure.h" @@ -395,8 +395,6 @@ struct cached_dev { atomic_t io_errors; unsigned int error_limit; unsigned int offline_seconds; - - char backing_dev_name[BDEVNAME_SIZE]; }; enum alloc_reserve { @@ -470,8 +468,6 @@ struct cache { atomic_long_t meta_sectors_written; atomic_long_t btree_sectors_written; atomic_long_t sectors_written; - - char cache_dev_name[BDEVNAME_SIZE]; }; struct gc_stat { diff --git a/include/uapi/linux/bcache.h b/drivers/md/bcache/bcache_ondisk.h index cf7399f03b71..97413586195b 100644 --- a/include/uapi/linux/bcache.h +++ b/drivers/md/bcache/bcache_ondisk.h @@ -43,9 +43,9 @@ static inline void SET_##name(struct bkey *k, unsigned int i, __u64 v) \ #define KEY_MAX_U64S 8 KEY_FIELD(KEY_PTRS, high, 60, 3) -KEY_FIELD(HEADER_SIZE, high, 58, 2) +KEY_FIELD(__PAD0, high, 58, 2) KEY_FIELD(KEY_CSUM, high, 56, 2) -KEY_FIELD(KEY_PINNED, high, 55, 1) +KEY_FIELD(__PAD1, high, 55, 1) KEY_FIELD(KEY_DIRTY, high, 36, 1) KEY_FIELD(KEY_SIZE, high, 20, KEY_SIZE_BITS) diff --git a/drivers/md/bcache/bset.h b/drivers/md/bcache/bset.h index a50dcfda656f..d795c84246b0 100644 --- a/drivers/md/bcache/bset.h +++ b/drivers/md/bcache/bset.h @@ -2,10 +2,10 @@ #ifndef _BCACHE_BSET_H #define _BCACHE_BSET_H -#include <linux/bcache.h> #include <linux/kernel.h> #include <linux/types.h> +#include "bcache_ondisk.h" #include "util.h" /* for time_stats */ /* diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 0595559de174..93b67b8d31c3 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -141,7 +141,7 @@ static uint64_t btree_csum_set(struct btree *b, struct bset *i) uint64_t crc = b->key.ptr[0]; void *data = (void *) i + 8, *end = bset_bkey_last(i); - crc = bch_crc64_update(crc, data, end - data); + crc = crc64_be(crc, data, end - data); return crc ^ 0xffffffffffffffffULL; } diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c index 116edda845c3..6230dfdd9286 100644 --- a/drivers/md/bcache/debug.c +++ b/drivers/md/bcache/debug.c @@ -127,21 +127,20 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio) citer.bi_size = UINT_MAX; bio_for_each_segment(bv, bio, iter) { - void *p1 = kmap_atomic(bv.bv_page); + void *p1 = bvec_kmap_local(&bv); void *p2; cbv = bio_iter_iovec(check, citer); - p2 = page_address(cbv.bv_page); + p2 = bvec_kmap_local(&cbv); - cache_set_err_on(memcmp(p1 + bv.bv_offset, - p2 + bv.bv_offset, - bv.bv_len), + cache_set_err_on(memcmp(p1, p2, bv.bv_len), dc->disk.c, - "verify failed at dev %s sector %llu", - dc->backing_dev_name, + "verify failed at dev %pg sector %llu", + dc->bdev, (uint64_t) bio->bi_iter.bi_sector); - kunmap_atomic(p1); + kunmap_local(p2); + kunmap_local(p1); bio_advance_iter(check, &citer, bv.bv_len); } diff --git a/drivers/md/bcache/features.c b/drivers/md/bcache/features.c index 6d2b7b84a7b7..634922c5601d 100644 --- a/drivers/md/bcache/features.c +++ b/drivers/md/bcache/features.c @@ -6,7 +6,7 @@ * Copyright 2020 Coly Li <colyli@suse.de> * */ -#include <linux/bcache.h> +#include "bcache_ondisk.h" #include "bcache.h" #include "features.h" diff --git a/drivers/md/bcache/features.h b/drivers/md/bcache/features.h index d1c8fd3977fc..09161b89c63e 100644 --- a/drivers/md/bcache/features.h +++ b/drivers/md/bcache/features.h @@ -2,10 +2,11 @@ #ifndef _BCACHE_FEATURES_H #define _BCACHE_FEATURES_H -#include <linux/bcache.h> #include <linux/kernel.h> #include <linux/types.h> +#include "bcache_ondisk.h" + #define BCH_FEATURE_COMPAT 0 #define BCH_FEATURE_RO_COMPAT 1 #define BCH_FEATURE_INCOMPAT 2 diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c index e4388fe3ab7e..9c6f9ec55b72 100644 --- a/drivers/md/bcache/io.c +++ b/drivers/md/bcache/io.c @@ -65,15 +65,15 @@ void bch_count_backing_io_errors(struct cached_dev *dc, struct bio *bio) * we shouldn't count failed REQ_RAHEAD bio to dc->io_errors. */ if (bio->bi_opf & REQ_RAHEAD) { - pr_warn_ratelimited("%s: Read-ahead I/O failed on backing device, ignore\n", - dc->backing_dev_name); + pr_warn_ratelimited("%pg: Read-ahead I/O failed on backing device, ignore\n", + dc->bdev); return; } errors = atomic_add_return(1, &dc->io_errors); if (errors < dc->error_limit) - pr_err("%s: IO error on backing device, unrecoverable\n", - dc->backing_dev_name); + pr_err("%pg: IO error on backing device, unrecoverable\n", + dc->bdev); else bch_cached_dev_error(dc); } @@ -123,13 +123,13 @@ void bch_count_io_errors(struct cache *ca, errors >>= IO_ERROR_SHIFT; if (errors < ca->set->error_limit) - pr_err("%s: IO error on %s%s\n", - ca->cache_dev_name, m, + pr_err("%pg: IO error on %s%s\n", + ca->bdev, m, is_read ? ", recovering." : "."); else bch_cache_set_error(ca->set, - "%s: too many IO errors %s\n", - ca->cache_dev_name, m); + "%pg: too many IO errors %s\n", + ca->bdev, m); } } diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index 23b28edae90f..d15aae6c51c1 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -46,7 +46,7 @@ static void bio_csum(struct bio *bio, struct bkey *k) bio_for_each_segment(bv, bio, iter) { void *d = kmap(bv.bv_page) + bv.bv_offset; - csum = bch_crc64_update(csum, d, bv.bv_len); + csum = crc64_be(csum, d, bv.bv_len); kunmap(bv.bv_page); } @@ -651,8 +651,8 @@ static void backing_request_endio(struct bio *bio) */ if (unlikely(s->iop.writeback && bio->bi_opf & REQ_PREFLUSH)) { - pr_err("Can't flush %s: returned bi_status %i\n", - dc->backing_dev_name, bio->bi_status); + pr_err("Can't flush %pg: returned bi_status %i\n", + dc->bdev, bio->bi_status); } else { /* set to orig_bio->bi_status in bio_complete() */ s->iop.status = bio->bi_status; diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 4f89985abe4b..4a9a65dff95e 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1026,8 +1026,8 @@ static int cached_dev_status_update(void *arg) dc->offline_seconds = 0; if (dc->offline_seconds >= BACKING_DEV_OFFLINE_TIMEOUT) { - pr_err("%s: device offline for %d seconds\n", - dc->backing_dev_name, + pr_err("%pg: device offline for %d seconds\n", + dc->bdev, BACKING_DEV_OFFLINE_TIMEOUT); pr_err("%s: disable I/O request due to backing device offline\n", dc->disk.name); @@ -1058,15 +1058,13 @@ int bch_cached_dev_run(struct cached_dev *dc) }; if (dc->io_disable) { - pr_err("I/O disabled on cached dev %s\n", - dc->backing_dev_name); + pr_err("I/O disabled on cached dev %pg\n", dc->bdev); ret = -EIO; goto out; } if (atomic_xchg(&dc->running, 1)) { - pr_info("cached dev %s is running already\n", - dc->backing_dev_name); + pr_info("cached dev %pg is running already\n", dc->bdev); ret = -EBUSY; goto out; } @@ -1082,7 +1080,9 @@ int bch_cached_dev_run(struct cached_dev *dc) closure_sync(&cl); } - add_disk(d->disk); + ret = add_disk(d->disk); + if (ret) + goto out; bd_link_disk_holder(dc->bdev, dc->disk.disk); /* * won't show up in the uevent file, use udevadm monitor -e instead @@ -1154,16 +1154,16 @@ static void cached_dev_detach_finish(struct work_struct *w) mutex_lock(&bch_register_lock); - calc_cached_dev_sectors(dc->disk.c); bcache_device_detach(&dc->disk); list_move(&dc->list, &uncached_devices); + calc_cached_dev_sectors(dc->disk.c); clear_bit(BCACHE_DEV_DETACHING, &dc->disk.flags); clear_bit(BCACHE_DEV_UNLINK_DONE, &dc->disk.flags); mutex_unlock(&bch_register_lock); - pr_info("Caching disabled for %s\n", dc->backing_dev_name); + pr_info("Caching disabled for %pg\n", dc->bdev); /* Drop ref we took in cached_dev_detach() */ closure_put(&dc->disk.cl); @@ -1203,29 +1203,27 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c, return -ENOENT; if (dc->disk.c) { - pr_err("Can't attach %s: already attached\n", - dc->backing_dev_name); + pr_err("Can't attach %pg: already attached\n", dc->bdev); return -EINVAL; } if (test_bit(CACHE_SET_STOPPING, &c->flags)) { - pr_err("Can't attach %s: shutting down\n", - dc->backing_dev_name); + pr_err("Can't attach %pg: shutting down\n", dc->bdev); return -EINVAL; } if (dc->sb.block_size < c->cache->sb.block_size) { /* Will die */ - pr_err("Couldn't attach %s: block size less than set's block size\n", - dc->backing_dev_name); + pr_err("Couldn't attach %pg: block size less than set's block size\n", + dc->bdev); return -EINVAL; } /* Check whether already attached */ list_for_each_entry_safe(exist_dc, t, &c->cached_devs, list) { if (!memcmp(dc->sb.uuid, exist_dc->sb.uuid, 16)) { - pr_err("Tried to attach %s but duplicate UUID already attached\n", - dc->backing_dev_name); + pr_err("Tried to attach %pg but duplicate UUID already attached\n", + dc->bdev); return -EINVAL; } @@ -1243,15 +1241,13 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c, if (!u) { if (BDEV_STATE(&dc->sb) == BDEV_STATE_DIRTY) { - pr_err("Couldn't find uuid for %s in set\n", - dc->backing_dev_name); + pr_err("Couldn't find uuid for %pg in set\n", dc->bdev); return -ENOENT; } u = uuid_find_empty(c); if (!u) { - pr_err("Not caching %s, no room for UUID\n", - dc->backing_dev_name); + pr_err("Not caching %pg, no room for UUID\n", dc->bdev); return -EINVAL; } } @@ -1319,8 +1315,7 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c, */ kthread_stop(dc->writeback_thread); cancel_writeback_rate_update_dwork(dc); - pr_err("Couldn't run cached device %s\n", - dc->backing_dev_name); + pr_err("Couldn't run cached device %pg\n", dc->bdev); return ret; } @@ -1336,8 +1331,8 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c, /* Allow the writeback thread to proceed */ up_write(&dc->writeback_lock); - pr_info("Caching %s as %s on set %pU\n", - dc->backing_dev_name, + pr_info("Caching %pg as %s on set %pU\n", + dc->bdev, dc->disk.disk->disk_name, dc->disk.c->set_uuid); return 0; @@ -1461,7 +1456,6 @@ static int register_bdev(struct cache_sb *sb, struct cache_sb_disk *sb_disk, struct cache_set *c; int ret = -ENOMEM; - bdevname(bdev, dc->backing_dev_name); memcpy(&dc->sb, sb, sizeof(struct cache_sb)); dc->bdev = bdev; dc->bdev->bd_holder = dc; @@ -1476,7 +1470,7 @@ static int register_bdev(struct cache_sb *sb, struct cache_sb_disk *sb_disk, if (bch_cache_accounting_add_kobjs(&dc->accounting, &dc->disk.kobj)) goto err; - pr_info("registered backing device %s\n", dc->backing_dev_name); + pr_info("registered backing device %pg\n", dc->bdev); list_add(&dc->list, &uncached_devices); /* attach to a matched cache set if it exists */ @@ -1493,7 +1487,7 @@ static int register_bdev(struct cache_sb *sb, struct cache_sb_disk *sb_disk, return 0; err: - pr_notice("error %s: %s\n", dc->backing_dev_name, err); + pr_notice("error %pg: %s\n", dc->bdev, err); bcache_device_stop(&dc->disk); return ret; } @@ -1534,10 +1528,11 @@ static void flash_dev_flush(struct closure *cl) static int flash_dev_run(struct cache_set *c, struct uuid_entry *u) { + int err = -ENOMEM; struct bcache_device *d = kzalloc(sizeof(struct bcache_device), GFP_KERNEL); if (!d) - return -ENOMEM; + goto err_ret; closure_init(&d->cl, NULL); set_closure_fn(&d->cl, flash_dev_flush, system_wq); @@ -1551,9 +1546,12 @@ static int flash_dev_run(struct cache_set *c, struct uuid_entry *u) bcache_device_attach(d, c, u - c->uuids); bch_sectors_dirty_init(d); bch_flash_dev_request_init(d); - add_disk(d->disk); + err = add_disk(d->disk); + if (err) + goto err; - if (kobject_add(&d->kobj, &disk_to_dev(d->disk)->kobj, "bcache")) + err = kobject_add(&d->kobj, &disk_to_dev(d->disk)->kobj, "bcache"); + if (err) goto err; bcache_device_link(d, c, "volume"); @@ -1567,7 +1565,8 @@ static int flash_dev_run(struct cache_set *c, struct uuid_entry *u) return 0; err: kobject_put(&d->kobj); - return -ENOMEM; +err_ret: + return err; } static int flash_devs_run(struct cache_set *c) @@ -1621,8 +1620,8 @@ bool bch_cached_dev_error(struct cached_dev *dc) /* make others know io_disable is true earlier */ smp_mb(); - pr_err("stop %s: too many IO errors on backing device %s\n", - dc->disk.disk->disk_name, dc->backing_dev_name); + pr_err("stop %s: too many IO errors on backing device %pg\n", + dc->disk.disk->disk_name, dc->bdev); bcache_device_stop(&dc->disk); return true; @@ -2338,7 +2337,7 @@ err_btree_alloc: err_free: module_put(THIS_MODULE); if (err) - pr_notice("error %s: %s\n", ca->cache_dev_name, err); + pr_notice("error %pg: %s\n", ca->bdev, err); return ret; } @@ -2348,7 +2347,6 @@ static int register_cache(struct cache_sb *sb, struct cache_sb_disk *sb_disk, const char *err = NULL; /* must be set for any error case */ int ret = 0; - bdevname(bdev, ca->cache_dev_name); memcpy(&ca->sb, sb, sizeof(struct cache_sb)); ca->bdev = bdev; ca->bdev->bd_holder = ca; @@ -2390,14 +2388,14 @@ static int register_cache(struct cache_sb *sb, struct cache_sb_disk *sb_disk, goto out; } - pr_info("registered cache device %s\n", ca->cache_dev_name); + pr_info("registered cache device %pg\n", ca->bdev); out: kobject_put(&ca->kobj); err: if (err) - pr_notice("error %s: %s\n", ca->cache_dev_name, err); + pr_notice("error %pg: %s\n", ca->bdev, err); return ret; } @@ -2617,8 +2615,11 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr, if (SB_IS_BDEV(sb)) { struct cached_dev *dc = kzalloc(sizeof(*dc), GFP_KERNEL); - if (!dc) + if (!dc) { + ret = -ENOMEM; + err = "cannot allocate memory"; goto out_put_sb_page; + } mutex_lock(&bch_register_lock); ret = register_bdev(sb, sb_disk, bdev, dc); @@ -2629,11 +2630,15 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr, } else { struct cache *ca = kzalloc(sizeof(*ca), GFP_KERNEL); - if (!ca) + if (!ca) { + ret = -ENOMEM; + err = "cannot allocate memory"; goto out_put_sb_page; + } /* blkdev_put() will be called in bch_cache_release() */ - if (register_cache(sb, sb_disk, bdev, ca) != 0) + ret = register_cache(sb, sb_disk, bdev, ca); + if (ret) goto out_free_sb; } @@ -2750,7 +2755,7 @@ static int bcache_reboot(struct notifier_block *n, unsigned long code, void *x) * The reason bch_register_lock is not held to call * bch_cache_set_stop() and bcache_device_stop() is to * avoid potential deadlock during reboot, because cache - * set or bcache device stopping process will acqurie + * set or bcache device stopping process will acquire * bch_register_lock too. * * We are safe here because bcache_is_reboot sets to diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c index 05ac1d6fbbf3..1f0dce30fa75 100644 --- a/drivers/md/bcache/sysfs.c +++ b/drivers/md/bcache/sysfs.c @@ -271,7 +271,7 @@ SHOW(__bch_cached_dev) } if (attr == &sysfs_backing_dev_name) { - snprintf(buf, BDEVNAME_SIZE + 1, "%s", dc->backing_dev_name); + snprintf(buf, BDEVNAME_SIZE + 1, "%pg", dc->bdev); strcat(buf, "\n"); return strlen(buf); } diff --git a/drivers/md/bcache/sysfs.h b/drivers/md/bcache/sysfs.h index 215df32f567b..c1752ba2e05b 100644 --- a/drivers/md/bcache/sysfs.h +++ b/drivers/md/bcache/sysfs.h @@ -51,13 +51,27 @@ STORE(fn) \ #define sysfs_printf(file, fmt, ...) \ do { \ if (attr == &sysfs_ ## file) \ - return snprintf(buf, PAGE_SIZE, fmt "\n", __VA_ARGS__); \ + return sysfs_emit(buf, fmt "\n", __VA_ARGS__); \ } while (0) #define sysfs_print(file, var) \ do { \ if (attr == &sysfs_ ## file) \ - return snprint(buf, PAGE_SIZE, var); \ + return sysfs_emit(buf, \ + __builtin_types_compatible_p(typeof(var), int) \ + ? "%i\n" : \ + __builtin_types_compatible_p(typeof(var), unsigned int) \ + ? "%u\n" : \ + __builtin_types_compatible_p(typeof(var), long) \ + ? "%li\n" : \ + __builtin_types_compatible_p(typeof(var), unsigned long)\ + ? "%lu\n" : \ + __builtin_types_compatible_p(typeof(var), int64_t) \ + ? "%lli\n" : \ + __builtin_types_compatible_p(typeof(var), uint64_t) \ + ? "%llu\n" : \ + __builtin_types_compatible_p(typeof(var), const char *) \ + ? "%s\n" : "%i\n", var); \ } while (0) #define sysfs_hprint(file, val) \ diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h index a7da7930a7fd..6f3cb7c92130 100644 --- a/drivers/md/bcache/util.h +++ b/drivers/md/bcache/util.h @@ -340,23 +340,6 @@ static inline int bch_strtoul_h(const char *cp, long *res) _r; \ }) -#define snprint(buf, size, var) \ - snprintf(buf, size, \ - __builtin_types_compatible_p(typeof(var), int) \ - ? "%i\n" : \ - __builtin_types_compatible_p(typeof(var), unsigned int) \ - ? "%u\n" : \ - __builtin_types_compatible_p(typeof(var), long) \ - ? "%li\n" : \ - __builtin_types_compatible_p(typeof(var), unsigned long)\ - ? "%lu\n" : \ - __builtin_types_compatible_p(typeof(var), int64_t) \ - ? "%lli\n" : \ - __builtin_types_compatible_p(typeof(var), uint64_t) \ - ? "%llu\n" : \ - __builtin_types_compatible_p(typeof(var), const char *) \ - ? "%s\n" : "%i\n", var) - ssize_t bch_hprint(char *buf, int64_t v); bool bch_is_zero(const char *p, size_t n); @@ -548,14 +531,6 @@ static inline uint64_t bch_crc64(const void *p, size_t len) return crc ^ 0xffffffffffffffffULL; } -static inline uint64_t bch_crc64_update(uint64_t crc, - const void *p, - size_t len) -{ - crc = crc64_be(crc, p, len); - return crc; -} - /* * A stepwise-linear pseudo-exponential. This returns 1 << (x >> * frac_bits), with the less-significant bits filled in by linear diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h index 55dccdfbcb22..b855fef4f38a 100644 --- a/drivers/md/dm-core.h +++ b/drivers/md/dm-core.h @@ -13,7 +13,7 @@ #include <linux/ktime.h> #include <linux/genhd.h> #include <linux/blk-mq.h> -#include <linux/keyslot-manager.h> +#include <linux/blk-crypto-profile.h> #include <trace/events/block.h> @@ -200,7 +200,7 @@ struct dm_table { struct dm_md_mempools *mempools; #ifdef CONFIG_BLK_INLINE_ENCRYPTION - struct blk_keyslot_manager *ksm; + struct blk_crypto_profile *crypto_profile; #endif }; diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index d95142102bd2..bcddc5effd15 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c @@ -170,7 +170,7 @@ static void free_devices(struct list_head *devices, struct mapped_device *md) } } -static void dm_table_destroy_keyslot_manager(struct dm_table *t); +static void dm_table_destroy_crypto_profile(struct dm_table *t); void dm_table_destroy(struct dm_table *t) { @@ -200,7 +200,7 @@ void dm_table_destroy(struct dm_table *t) dm_free_md_mempools(t->mempools); - dm_table_destroy_keyslot_manager(t); + dm_table_destroy_crypto_profile(t); kfree(t); } @@ -1186,8 +1186,8 @@ static int dm_table_register_integrity(struct dm_table *t) #ifdef CONFIG_BLK_INLINE_ENCRYPTION -struct dm_keyslot_manager { - struct blk_keyslot_manager ksm; +struct dm_crypto_profile { + struct blk_crypto_profile profile; struct mapped_device *md; }; @@ -1213,13 +1213,11 @@ static int dm_keyslot_evict_callback(struct dm_target *ti, struct dm_dev *dev, * When an inline encryption key is evicted from a device-mapper device, evict * it from all the underlying devices. */ -static int dm_keyslot_evict(struct blk_keyslot_manager *ksm, +static int dm_keyslot_evict(struct blk_crypto_profile *profile, const struct blk_crypto_key *key, unsigned int slot) { - struct dm_keyslot_manager *dksm = container_of(ksm, - struct dm_keyslot_manager, - ksm); - struct mapped_device *md = dksm->md; + struct mapped_device *md = + container_of(profile, struct dm_crypto_profile, profile)->md; struct dm_keyslot_evict_args args = { key }; struct dm_table *t; int srcu_idx; @@ -1239,150 +1237,148 @@ static int dm_keyslot_evict(struct blk_keyslot_manager *ksm, return args.err; } -static const struct blk_ksm_ll_ops dm_ksm_ll_ops = { - .keyslot_evict = dm_keyslot_evict, -}; - -static int device_intersect_crypto_modes(struct dm_target *ti, - struct dm_dev *dev, sector_t start, - sector_t len, void *data) +static int +device_intersect_crypto_capabilities(struct dm_target *ti, struct dm_dev *dev, + sector_t start, sector_t len, void *data) { - struct blk_keyslot_manager *parent = data; - struct blk_keyslot_manager *child = bdev_get_queue(dev->bdev)->ksm; + struct blk_crypto_profile *parent = data; + struct blk_crypto_profile *child = + bdev_get_queue(dev->bdev)->crypto_profile; - blk_ksm_intersect_modes(parent, child); + blk_crypto_intersect_capabilities(parent, child); return 0; } -void dm_destroy_keyslot_manager(struct blk_keyslot_manager *ksm) +void dm_destroy_crypto_profile(struct blk_crypto_profile *profile) { - struct dm_keyslot_manager *dksm = container_of(ksm, - struct dm_keyslot_manager, - ksm); + struct dm_crypto_profile *dmcp = container_of(profile, + struct dm_crypto_profile, + profile); - if (!ksm) + if (!profile) return; - blk_ksm_destroy(ksm); - kfree(dksm); + blk_crypto_profile_destroy(profile); + kfree(dmcp); } -static void dm_table_destroy_keyslot_manager(struct dm_table *t) +static void dm_table_destroy_crypto_profile(struct dm_table *t) { - dm_destroy_keyslot_manager(t->ksm); - t->ksm = NULL; + dm_destroy_crypto_profile(t->crypto_profile); + t->crypto_profile = NULL; } /* - * Constructs and initializes t->ksm with a keyslot manager that - * represents the common set of crypto capabilities of the devices - * described by the dm_table. However, if the constructed keyslot - * manager does not support a superset of the crypto capabilities - * supported by the current keyslot manager of the mapped_device, - * it returns an error instead, since we don't support restricting - * crypto capabilities on table changes. Finally, if the constructed - * keyslot manager doesn't actually support any crypto modes at all, - * it just returns NULL. + * Constructs and initializes t->crypto_profile with a crypto profile that + * represents the common set of crypto capabilities of the devices described by + * the dm_table. However, if the constructed crypto profile doesn't support all + * crypto capabilities that are supported by the current mapped_device, it + * returns an error instead, since we don't support removing crypto capabilities + * on table changes. Finally, if the constructed crypto profile is "empty" (has + * no crypto capabilities at all), it just sets t->crypto_profile to NULL. */ -static int dm_table_construct_keyslot_manager(struct dm_table *t) +static int dm_table_construct_crypto_profile(struct dm_table *t) { - struct dm_keyslot_manager *dksm; - struct blk_keyslot_manager *ksm; + struct dm_crypto_profile *dmcp; + struct blk_crypto_profile *profile; struct dm_target *ti; unsigned int i; - bool ksm_is_empty = true; + bool empty_profile = true; - dksm = kmalloc(sizeof(*dksm), GFP_KERNEL); - if (!dksm) + dmcp = kmalloc(sizeof(*dmcp), GFP_KERNEL); + if (!dmcp) return -ENOMEM; - dksm->md = t->md; + dmcp->md = t->md; - ksm = &dksm->ksm; - blk_ksm_init_passthrough(ksm); - ksm->ksm_ll_ops = dm_ksm_ll_ops; - ksm->max_dun_bytes_supported = UINT_MAX; - memset(ksm->crypto_modes_supported, 0xFF, - sizeof(ksm->crypto_modes_supported)); + profile = &dmcp->profile; + blk_crypto_profile_init(profile, 0); + profile->ll_ops.keyslot_evict = dm_keyslot_evict; + profile->max_dun_bytes_supported = UINT_MAX; + memset(profile->modes_supported, 0xFF, + sizeof(profile->modes_supported)); for (i = 0; i < dm_table_get_num_targets(t); i++) { ti = dm_table_get_target(t, i); if (!dm_target_passes_crypto(ti->type)) { - blk_ksm_intersect_modes(ksm, NULL); + blk_crypto_intersect_capabilities(profile, NULL); break; } if (!ti->type->iterate_devices) continue; - ti->type->iterate_devices(ti, device_intersect_crypto_modes, - ksm); + ti->type->iterate_devices(ti, + device_intersect_crypto_capabilities, + profile); } - if (t->md->queue && !blk_ksm_is_superset(ksm, t->md->queue->ksm)) { + if (t->md->queue && + !blk_crypto_has_capabilities(profile, + t->md->queue->crypto_profile)) { DMWARN("Inline encryption capabilities of new DM table were more restrictive than the old table's. This is not supported!"); - dm_destroy_keyslot_manager(ksm); + dm_destroy_crypto_profile(profile); return -EINVAL; } /* - * If the new KSM doesn't actually support any crypto modes, we may as - * well represent it with a NULL ksm. + * If the new profile doesn't actually support any crypto capabilities, + * we may as well represent it with a NULL profile. */ - ksm_is_empty = true; - for (i = 0; i < ARRAY_SIZE(ksm->crypto_modes_supported); i++) { - if (ksm->crypto_modes_supported[i]) { - ksm_is_empty = false; + for (i = 0; i < ARRAY_SIZE(profile->modes_supported); i++) { + if (profile->modes_supported[i]) { + empty_profile = false; break; } } - if (ksm_is_empty) { - dm_destroy_keyslot_manager(ksm); - ksm = NULL; + if (empty_profile) { + dm_destroy_crypto_profile(profile); + profile = NULL; } /* - * t->ksm is only set temporarily while the table is being set - * up, and it gets set to NULL after the capabilities have - * been transferred to the request_queue. + * t->crypto_profile is only set temporarily while the table is being + * set up, and it gets set to NULL after the profile has been + * transferred to the request_queue. */ - t->ksm = ksm; + t->crypto_profile = profile; return 0; } -static void dm_update_keyslot_manager(struct request_queue *q, - struct dm_table *t) +static void dm_update_crypto_profile(struct request_queue *q, + struct dm_table *t) { - if (!t->ksm) + if (!t->crypto_profile) return; - /* Make the ksm less restrictive */ - if (!q->ksm) { - blk_ksm_register(t->ksm, q); + /* Make the crypto profile less restrictive. */ + if (!q->crypto_profile) { + blk_crypto_register(t->crypto_profile, q); } else { - blk_ksm_update_capabilities(q->ksm, t->ksm); - dm_destroy_keyslot_manager(t->ksm); + blk_crypto_update_capabilities(q->crypto_profile, + t->crypto_profile); + dm_destroy_crypto_profile(t->crypto_profile); } - t->ksm = NULL; + t->crypto_profile = NULL; } #else /* CONFIG_BLK_INLINE_ENCRYPTION */ -static int dm_table_construct_keyslot_manager(struct dm_table *t) +static int dm_table_construct_crypto_profile(struct dm_table *t) { return 0; } -void dm_destroy_keyslot_manager(struct blk_keyslot_manager *ksm) +void dm_destroy_crypto_profile(struct blk_crypto_profile *profile) { } -static void dm_table_destroy_keyslot_manager(struct dm_table *t) +static void dm_table_destroy_crypto_profile(struct dm_table *t) { } -static void dm_update_keyslot_manager(struct request_queue *q, - struct dm_table *t) +static void dm_update_crypto_profile(struct request_queue *q, + struct dm_table *t) { } @@ -1414,9 +1410,9 @@ int dm_table_complete(struct dm_table *t) return r; } - r = dm_table_construct_keyslot_manager(t); + r = dm_table_construct_crypto_profile(t); if (r) { - DMERR("could not construct keyslot manager."); + DMERR("could not construct crypto profile."); return r; } @@ -2070,7 +2066,7 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q, return r; } - dm_update_keyslot_manager(q, t); + dm_update_crypto_profile(q, t); disk_update_readahead(t->md->disk); return 0; diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 7870e6460633..63aa52263658 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -29,7 +29,7 @@ #include <linux/refcount.h> #include <linux/part_stat.h> #include <linux/blk-crypto.h> -#include <linux/keyslot-manager.h> +#include <linux/blk-crypto-profile.h> #define DM_MSG_PREFIX "core" @@ -1663,14 +1663,14 @@ static const struct dax_operations dm_dax_ops; static void dm_wq_work(struct work_struct *work); #ifdef CONFIG_BLK_INLINE_ENCRYPTION -static void dm_queue_destroy_keyslot_manager(struct request_queue *q) +static void dm_queue_destroy_crypto_profile(struct request_queue *q) { - dm_destroy_keyslot_manager(q->ksm); + dm_destroy_crypto_profile(q->crypto_profile); } #else /* CONFIG_BLK_INLINE_ENCRYPTION */ -static inline void dm_queue_destroy_keyslot_manager(struct request_queue *q) +static inline void dm_queue_destroy_crypto_profile(struct request_queue *q) { } #endif /* !CONFIG_BLK_INLINE_ENCRYPTION */ @@ -1696,7 +1696,7 @@ static void cleanup_mapped_device(struct mapped_device *md) dm_sysfs_exit(md); del_gendisk(md->disk); } - dm_queue_destroy_keyslot_manager(md->queue); + dm_queue_destroy_crypto_profile(md->queue); blk_cleanup_disk(md->disk); } @@ -2078,7 +2078,9 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t) if (r) return r; - add_disk(md->disk); + r = add_disk(md->disk); + if (r) + return r; r = dm_sysfs_init(md); if (r) { diff --git a/drivers/md/md.c b/drivers/md/md.c index d964da436383..5111ed966947 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -354,7 +354,7 @@ static bool create_on_open = true; */ static DECLARE_WAIT_QUEUE_HEAD(md_event_waiters); static atomic_t md_event_count; -void md_new_event(struct mddev *mddev) +void md_new_event(void) { atomic_inc(&md_event_count); wake_up(&md_event_waiters); @@ -2882,7 +2882,7 @@ static int add_bound_rdev(struct md_rdev *rdev) if (mddev->degraded) set_bit(MD_RECOVERY_RECOVER, &mddev->recovery); set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); - md_new_event(mddev); + md_new_event(); md_wakeup_thread(mddev->thread); return 0; } @@ -2972,7 +2972,11 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len) * -write_error - clears WriteErrorSeen * {,-}failfast - set/clear FailFast */ + + struct mddev *mddev = rdev->mddev; int err = -EINVAL; + bool need_update_sb = false; + if (cmd_match(buf, "faulty") && rdev->mddev->pers) { md_error(rdev->mddev, rdev); if (test_bit(Faulty, &rdev->flags)) @@ -2987,7 +2991,6 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len) if (rdev->raid_disk >= 0) err = -EBUSY; else { - struct mddev *mddev = rdev->mddev; err = 0; if (mddev_is_clustered(mddev)) err = md_cluster_ops->remove_disk(mddev, rdev); @@ -2998,16 +3001,18 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len) set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags); md_wakeup_thread(mddev->thread); } - md_new_event(mddev); + md_new_event(); } } } else if (cmd_match(buf, "writemostly")) { set_bit(WriteMostly, &rdev->flags); mddev_create_serial_pool(rdev->mddev, rdev, false); + need_update_sb = true; err = 0; } else if (cmd_match(buf, "-writemostly")) { mddev_destroy_serial_pool(rdev->mddev, rdev, false); clear_bit(WriteMostly, &rdev->flags); + need_update_sb = true; err = 0; } else if (cmd_match(buf, "blocked")) { set_bit(Blocked, &rdev->flags); @@ -3033,9 +3038,11 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len) err = 0; } else if (cmd_match(buf, "failfast")) { set_bit(FailFast, &rdev->flags); + need_update_sb = true; err = 0; } else if (cmd_match(buf, "-failfast")) { clear_bit(FailFast, &rdev->flags); + need_update_sb = true; err = 0; } else if (cmd_match(buf, "-insync") && rdev->raid_disk >= 0 && !test_bit(Journal, &rdev->flags)) { @@ -3114,6 +3121,8 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len) clear_bit(ExternalBbl, &rdev->flags); err = 0; } + if (need_update_sb) + md_update_sb(mddev, 1); if (!err) sysfs_notify_dirent_safe(rdev->sysfs_state); return err ? err : len; @@ -4095,7 +4104,7 @@ level_store(struct mddev *mddev, const char *buf, size_t len) if (!mddev->thread) md_update_sb(mddev, 1); sysfs_notify_dirent_safe(mddev->sysfs_level); - md_new_event(mddev); + md_new_event(); rv = len; out_unlock: mddev_unlock(mddev); @@ -4616,7 +4625,7 @@ new_dev_store(struct mddev *mddev, const char *buf, size_t len) export_rdev(rdev); mddev_unlock(mddev); if (!err) - md_new_event(mddev); + md_new_event(); return err ? err : len; } @@ -5486,6 +5495,10 @@ static struct attribute *md_default_attrs[] = { NULL, }; +static const struct attribute_group md_default_group = { + .attrs = md_default_attrs, +}; + static struct attribute *md_redundancy_attrs[] = { &md_scan_mode.attr, &md_last_scan_mode.attr, @@ -5508,6 +5521,12 @@ static const struct attribute_group md_redundancy_group = { .attrs = md_redundancy_attrs, }; +static const struct attribute_group *md_attr_groups[] = { + &md_default_group, + &md_bitmap_group, + NULL, +}; + static ssize_t md_attr_show(struct kobject *kobj, struct attribute *attr, char *page) { @@ -5583,7 +5602,7 @@ static const struct sysfs_ops md_sysfs_ops = { static struct kobj_type md_ktype = { .release = md_free, .sysfs_ops = &md_sysfs_ops, - .default_attrs = md_default_attrs, + .default_groups = md_attr_groups, }; int mdp_major = 0; @@ -5592,7 +5611,6 @@ static void mddev_delayed_delete(struct work_struct *ws) { struct mddev *mddev = container_of(ws, struct mddev, del_work); - sysfs_remove_group(&mddev->kobj, &md_bitmap_group); kobject_del(&mddev->kobj); kobject_put(&mddev->kobj); } @@ -5659,7 +5677,7 @@ static int md_alloc(dev_t dev, char *name) strcmp(mddev2->gendisk->disk_name, name) == 0) { spin_unlock(&all_mddevs_lock); error = -EEXIST; - goto abort; + goto out_unlock_disks_mutex; } spin_unlock(&all_mddevs_lock); } @@ -5672,7 +5690,7 @@ static int md_alloc(dev_t dev, char *name) error = -ENOMEM; disk = blk_alloc_disk(NUMA_NO_NODE); if (!disk) - goto abort; + goto out_unlock_disks_mutex; disk->major = MAJOR(mddev->unit); disk->first_minor = unit << shift; @@ -5696,27 +5714,25 @@ static int md_alloc(dev_t dev, char *name) disk->flags |= GENHD_FL_EXT_DEVT; disk->events |= DISK_EVENT_MEDIA_CHANGE; mddev->gendisk = disk; - add_disk(disk); + error = add_disk(disk); + if (error) + goto out_cleanup_disk; error = kobject_add(&mddev->kobj, &disk_to_dev(disk)->kobj, "%s", "md"); - if (error) { - /* This isn't possible, but as kobject_init_and_add is marked - * __must_check, we must do something with the result - */ - pr_debug("md: cannot register %s/md - name in use\n", - disk->disk_name); - error = 0; - } - if (mddev->kobj.sd && - sysfs_create_group(&mddev->kobj, &md_bitmap_group)) - pr_debug("pointless warning\n"); - abort: + if (error) + goto out_del_gendisk; + + kobject_uevent(&mddev->kobj, KOBJ_ADD); + mddev->sysfs_state = sysfs_get_dirent_safe(mddev->kobj.sd, "array_state"); + mddev->sysfs_level = sysfs_get_dirent_safe(mddev->kobj.sd, "level"); + goto out_unlock_disks_mutex; + +out_del_gendisk: + del_gendisk(disk); +out_cleanup_disk: + blk_cleanup_disk(disk); +out_unlock_disks_mutex: mutex_unlock(&disks_mutex); - if (!error && mddev->kobj.sd) { - kobject_uevent(&mddev->kobj, KOBJ_ADD); - mddev->sysfs_state = sysfs_get_dirent_safe(mddev->kobj.sd, "array_state"); - mddev->sysfs_level = sysfs_get_dirent_safe(mddev->kobj.sd, "level"); - } mddev_put(mddev); return error; } @@ -6030,7 +6046,7 @@ int md_run(struct mddev *mddev) if (mddev->sb_flags) md_update_sb(mddev, 0); - md_new_event(mddev); + md_new_event(); return 0; bitmap_abort: @@ -6420,7 +6436,7 @@ static int do_md_stop(struct mddev *mddev, int mode, if (mddev->hold_active == UNTIL_STOP) mddev->hold_active = 0; } - md_new_event(mddev); + md_new_event(); sysfs_notify_dirent_safe(mddev->sysfs_state); return 0; } @@ -6924,7 +6940,7 @@ kick_rdev: md_wakeup_thread(mddev->thread); else md_update_sb(mddev, 1); - md_new_event(mddev); + md_new_event(); return 0; busy: @@ -6997,7 +7013,7 @@ static int hot_add_disk(struct mddev *mddev, dev_t dev) */ set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); md_wakeup_thread(mddev->thread); - md_new_event(mddev); + md_new_event(); return 0; abort_export: @@ -7971,7 +7987,7 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev) md_wakeup_thread(mddev->thread); if (mddev->event_work.func) queue_work(md_misc_wq, &mddev->event_work); - md_new_event(mddev); + md_new_event(); } EXPORT_SYMBOL(md_error); @@ -8855,7 +8871,7 @@ void md_do_sync(struct md_thread *thread) mddev->curr_resync = 3; /* no longer delayed */ mddev->curr_resync_completed = j; sysfs_notify_dirent_safe(mddev->sysfs_completed); - md_new_event(mddev); + md_new_event(); update_time = jiffies; blk_start_plug(&plug); @@ -8926,7 +8942,7 @@ void md_do_sync(struct md_thread *thread) /* this is the earliest that rebuild will be * visible in /proc/mdstat */ - md_new_event(mddev); + md_new_event(); if (last_check + window > io_sectors || j == max_sectors) continue; @@ -9150,7 +9166,7 @@ static int remove_and_add_spares(struct mddev *mddev, sysfs_link_rdev(mddev, rdev); if (!test_bit(Journal, &rdev->flags)) spares++; - md_new_event(mddev); + md_new_event(); set_bit(MD_SB_CHANGE_DEVS, &mddev->sb_flags); } } @@ -9184,7 +9200,7 @@ static void md_start_sync(struct work_struct *ws) } else md_wakeup_thread(mddev->sync_thread); sysfs_notify_dirent_safe(mddev->sysfs_action); - md_new_event(mddev); + md_new_event(); } /* @@ -9443,7 +9459,7 @@ void md_reap_sync_thread(struct mddev *mddev) /* flag recovery needed just to double check */ set_bit(MD_RECOVERY_NEEDED, &mddev->recovery); sysfs_notify_dirent_safe(mddev->sysfs_action); - md_new_event(mddev); + md_new_event(); if (mddev->event_work.func) queue_work(md_misc_wq, &mddev->event_work); } diff --git a/drivers/md/md.h b/drivers/md/md.h index 4c96c36bd01a..53ea7a6961de 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -731,7 +731,7 @@ extern int sync_page_io(struct md_rdev *rdev, sector_t sector, int size, struct page *page, int op, int op_flags, bool metadata_op); extern void md_do_sync(struct md_thread *thread); -extern void md_new_event(struct mddev *mddev); +extern void md_new_event(void); extern void md_allow_write(struct mddev *mddev); extern void md_wait_for_blocked_rdev(struct md_rdev *rdev, struct mddev *mddev); extern void md_set_array_sectors(struct mddev *mddev, sector_t array_sectors); diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 19598bd38939..7dc8026cf6ee 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1496,7 +1496,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio, if (!r1_bio->bios[i]) continue; - if (first_clone) { + if (first_clone && test_bit(WriteMostly, &rdev->flags)) { /* do behind I/O ? * Not if there are too many, or cannot * allocate memory, or a reader on WriteMostly @@ -1529,13 +1529,12 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio, r1_bio->bios[i] = mbio; - mbio->bi_iter.bi_sector = (r1_bio->sector + - conf->mirrors[i].rdev->data_offset); - bio_set_dev(mbio, conf->mirrors[i].rdev->bdev); + mbio->bi_iter.bi_sector = (r1_bio->sector + rdev->data_offset); + bio_set_dev(mbio, rdev->bdev); mbio->bi_end_io = raid1_end_write_request; mbio->bi_opf = bio_op(bio) | (bio->bi_opf & (REQ_SYNC | REQ_FUA)); - if (test_bit(FailFast, &conf->mirrors[i].rdev->flags) && - !test_bit(WriteMostly, &conf->mirrors[i].rdev->flags) && + if (test_bit(FailFast, &rdev->flags) && + !test_bit(WriteMostly, &rdev->flags) && conf->raid_disks - mddev->degraded > 1) mbio->bi_opf |= MD_FAILFAST; mbio->bi_private = r1_bio; @@ -1546,7 +1545,7 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio, trace_block_bio_remap(mbio, disk_devt(mddev->gendisk), r1_bio->sector); /* flush_pending_writes() needs access to the rdev so...*/ - mbio->bi_bdev = (void *)conf->mirrors[i].rdev; + mbio->bi_bdev = (void *)rdev; cb = blk_check_plugged(raid1_unplug, mddev, sizeof(*plug)); if (cb) diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index aa2636582841..dde98f65bd04 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -4647,7 +4647,7 @@ out: } conf->reshape_checkpoint = jiffies; md_wakeup_thread(mddev->sync_thread); - md_new_event(mddev); + md_new_event(); return 0; abort: diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 02ed53b20654..9c1a5877cf9f 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -7732,10 +7732,7 @@ static int raid5_run(struct mddev *mddev) * discard data disk but write parity disk */ stripe = stripe * PAGE_SIZE; - /* Round up to power of 2, as discard handling - * currently assumes that */ - while ((stripe-1) & stripe) - stripe = (stripe | (stripe-1)) + 1; + stripe = roundup_pow_of_two(stripe); mddev->queue->limits.discard_alignment = stripe; mddev->queue->limits.discard_granularity = stripe; @@ -8282,7 +8279,7 @@ static int raid5_start_reshape(struct mddev *mddev) } conf->reshape_checkpoint = jiffies; md_wakeup_thread(mddev->sync_thread); - md_new_event(mddev); + md_new_event(); return 0; } diff --git a/drivers/mmc/core/crypto.c b/drivers/mmc/core/crypto.c index 67557808cada..fec4fbf16a5b 100644 --- a/drivers/mmc/core/crypto.c +++ b/drivers/mmc/core/crypto.c @@ -16,13 +16,13 @@ void mmc_crypto_set_initial_state(struct mmc_host *host) { /* Reset might clear all keys, so reprogram all the keys. */ if (host->caps2 & MMC_CAP2_CRYPTO) - blk_ksm_reprogram_all_keys(&host->ksm); + blk_crypto_reprogram_all_keys(&host->crypto_profile); } void mmc_crypto_setup_queue(struct request_queue *q, struct mmc_host *host) { if (host->caps2 & MMC_CAP2_CRYPTO) - blk_ksm_register(&host->ksm, q); + blk_crypto_register(&host->crypto_profile, q); } EXPORT_SYMBOL_GPL(mmc_crypto_setup_queue); @@ -30,12 +30,15 @@ void mmc_crypto_prepare_req(struct mmc_queue_req *mqrq) { struct request *req = mmc_queue_req_to_req(mqrq); struct mmc_request *mrq = &mqrq->brq.mrq; + struct blk_crypto_keyslot *keyslot; if (!req->crypt_ctx) return; mrq->crypto_ctx = req->crypt_ctx; - if (req->crypt_keyslot) - mrq->crypto_key_slot = blk_ksm_get_slot_idx(req->crypt_keyslot); + + keyslot = req->crypt_keyslot; + if (keyslot) + mrq->crypto_key_slot = blk_crypto_keyslot_index(keyslot); } EXPORT_SYMBOL_GPL(mmc_crypto_prepare_req); diff --git a/drivers/mmc/host/Kconfig b/drivers/mmc/host/Kconfig index 95b3511b0560..ccc148cdb5ee 100644 --- a/drivers/mmc/host/Kconfig +++ b/drivers/mmc/host/Kconfig @@ -506,7 +506,7 @@ config MMC_OMAP_HS config MMC_WBSD tristate "Winbond W83L51xD SD/MMC Card Interface support" - depends on ISA_DMA_API + depends on ISA_DMA_API && !M68K help This selects the Winbond(R) W83L51xD Secure digital and Multimedia card Interface. diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c index 38559a956330..31f841231609 100644 --- a/drivers/mmc/host/cqhci-core.c +++ b/drivers/mmc/host/cqhci-core.c @@ -282,6 +282,9 @@ static void __cqhci_enable(struct cqhci_host *cq_host) cqhci_writel(cq_host, cqcfg, CQHCI_CFG); + if (cqhci_readl(cq_host, CQHCI_CTL) & CQHCI_HALT) + cqhci_writel(cq_host, 0, CQHCI_CTL); + mmc->cqe_on = true; if (cq_host->ops->enable) diff --git a/drivers/mmc/host/cqhci-crypto.c b/drivers/mmc/host/cqhci-crypto.c index 6419cfbb4ab7..d5f4b6972f63 100644 --- a/drivers/mmc/host/cqhci-crypto.c +++ b/drivers/mmc/host/cqhci-crypto.c @@ -6,7 +6,7 @@ */ #include <linux/blk-crypto.h> -#include <linux/keyslot-manager.h> +#include <linux/blk-crypto-profile.h> #include <linux/mmc/host.h> #include "cqhci-crypto.h" @@ -23,9 +23,10 @@ static const struct cqhci_crypto_alg_entry { }; static inline struct cqhci_host * -cqhci_host_from_ksm(struct blk_keyslot_manager *ksm) +cqhci_host_from_crypto_profile(struct blk_crypto_profile *profile) { - struct mmc_host *mmc = container_of(ksm, struct mmc_host, ksm); + struct mmc_host *mmc = + container_of(profile, struct mmc_host, crypto_profile); return mmc->cqe_private; } @@ -57,12 +58,12 @@ static int cqhci_crypto_program_key(struct cqhci_host *cq_host, return 0; } -static int cqhci_crypto_keyslot_program(struct blk_keyslot_manager *ksm, +static int cqhci_crypto_keyslot_program(struct blk_crypto_profile *profile, const struct blk_crypto_key *key, unsigned int slot) { - struct cqhci_host *cq_host = cqhci_host_from_ksm(ksm); + struct cqhci_host *cq_host = cqhci_host_from_crypto_profile(profile); const union cqhci_crypto_cap_entry *ccap_array = cq_host->crypto_cap_array; const struct cqhci_crypto_alg_entry *alg = @@ -115,11 +116,11 @@ static int cqhci_crypto_clear_keyslot(struct cqhci_host *cq_host, int slot) return cqhci_crypto_program_key(cq_host, &cfg, slot); } -static int cqhci_crypto_keyslot_evict(struct blk_keyslot_manager *ksm, +static int cqhci_crypto_keyslot_evict(struct blk_crypto_profile *profile, const struct blk_crypto_key *key, unsigned int slot) { - struct cqhci_host *cq_host = cqhci_host_from_ksm(ksm); + struct cqhci_host *cq_host = cqhci_host_from_crypto_profile(profile); return cqhci_crypto_clear_keyslot(cq_host, slot); } @@ -132,7 +133,7 @@ static int cqhci_crypto_keyslot_evict(struct blk_keyslot_manager *ksm, * "enabled" when these are called, i.e. CQHCI_ENABLE might not be set in the * CQHCI_CFG register. But the hardware allows that. */ -static const struct blk_ksm_ll_ops cqhci_ksm_ops = { +static const struct blk_crypto_ll_ops cqhci_crypto_ops = { .keyslot_program = cqhci_crypto_keyslot_program, .keyslot_evict = cqhci_crypto_keyslot_evict, }; @@ -157,8 +158,8 @@ cqhci_find_blk_crypto_mode(union cqhci_crypto_cap_entry cap) * * If the driver previously set MMC_CAP2_CRYPTO and the CQE declares * CQHCI_CAP_CS, initialize the crypto support. This involves reading the - * crypto capability registers, initializing the keyslot manager, clearing all - * keyslots, and enabling 128-bit task descriptors. + * crypto capability registers, initializing the blk_crypto_profile, clearing + * all keyslots, and enabling 128-bit task descriptors. * * Return: 0 if crypto was initialized or isn't supported; whether * MMC_CAP2_CRYPTO remains set indicates which one of those cases it is. @@ -168,7 +169,7 @@ int cqhci_crypto_init(struct cqhci_host *cq_host) { struct mmc_host *mmc = cq_host->mmc; struct device *dev = mmc_dev(mmc); - struct blk_keyslot_manager *ksm = &mmc->ksm; + struct blk_crypto_profile *profile = &mmc->crypto_profile; unsigned int num_keyslots; unsigned int cap_idx; enum blk_crypto_mode_num blk_mode_num; @@ -199,15 +200,15 @@ int cqhci_crypto_init(struct cqhci_host *cq_host) */ num_keyslots = cq_host->crypto_capabilities.config_count + 1; - err = devm_blk_ksm_init(dev, ksm, num_keyslots); + err = devm_blk_crypto_profile_init(dev, profile, num_keyslots); if (err) goto out; - ksm->ksm_ll_ops = cqhci_ksm_ops; - ksm->dev = dev; + profile->ll_ops = cqhci_crypto_ops; + profile->dev = dev; /* Unfortunately, CQHCI crypto only supports 32 DUN bits. */ - ksm->max_dun_bytes_supported = 4; + profile->max_dun_bytes_supported = 4; /* * Cache all the crypto capabilities and advertise the supported crypto @@ -223,7 +224,7 @@ int cqhci_crypto_init(struct cqhci_host *cq_host) cq_host->crypto_cap_array[cap_idx]); if (blk_mode_num == BLK_ENCRYPTION_MODE_INVALID) continue; - ksm->crypto_modes_supported[blk_mode_num] |= + profile->modes_supported[blk_mode_num] |= cq_host->crypto_cap_array[cap_idx].sdus_mask * 512; } diff --git a/drivers/mmc/host/dw_mmc-exynos.c b/drivers/mmc/host/dw_mmc-exynos.c index 0c75810812a0..1f8a3c0ddfe1 100644 --- a/drivers/mmc/host/dw_mmc-exynos.c +++ b/drivers/mmc/host/dw_mmc-exynos.c @@ -464,6 +464,18 @@ static s8 dw_mci_exynos_get_best_clksmpl(u8 candiates) } } + /* + * If there is no cadiates value, then it needs to return -EIO. + * If there are candiates values and don't find bset clk sample value, + * then use a first candiates clock sample value. + */ + for (i = 0; i < iter; i++) { + __c = ror8(candiates, i); + if ((__c & 0x1) == 0x1) { + loc = i; + goto out; + } + } out: return loc; } @@ -494,6 +506,8 @@ static int dw_mci_exynos_execute_tuning(struct dw_mci_slot *slot, u32 opcode) priv->tuned_sample = found; } else { ret = -EIO; + dev_warn(&mmc->class_dev, + "There is no candiates value about clksmpl!\n"); } return ret; diff --git a/drivers/mmc/host/mtk-sd.c b/drivers/mmc/host/mtk-sd.c index 4dfc246c5f95..b06b4dcb7c78 100644 --- a/drivers/mmc/host/mtk-sd.c +++ b/drivers/mmc/host/mtk-sd.c @@ -2577,6 +2577,25 @@ static int msdc_drv_probe(struct platform_device *pdev) host->dma_mask = DMA_BIT_MASK(32); mmc_dev(mmc)->dma_mask = &host->dma_mask; + host->timeout_clks = 3 * 1048576; + host->dma.gpd = dma_alloc_coherent(&pdev->dev, + 2 * sizeof(struct mt_gpdma_desc), + &host->dma.gpd_addr, GFP_KERNEL); + host->dma.bd = dma_alloc_coherent(&pdev->dev, + MAX_BD_NUM * sizeof(struct mt_bdma_desc), + &host->dma.bd_addr, GFP_KERNEL); + if (!host->dma.gpd || !host->dma.bd) { + ret = -ENOMEM; + goto release_mem; + } + msdc_init_gpd_bd(host, &host->dma); + INIT_DELAYED_WORK(&host->req_timeout, msdc_request_timeout); + spin_lock_init(&host->lock); + + platform_set_drvdata(pdev, mmc); + msdc_ungate_clock(host); + msdc_init_hw(host); + if (mmc->caps2 & MMC_CAP2_CQE) { host->cq_host = devm_kzalloc(mmc->parent, sizeof(*host->cq_host), @@ -2597,25 +2616,6 @@ static int msdc_drv_probe(struct platform_device *pdev) mmc->max_seg_size = 64 * 1024; } - host->timeout_clks = 3 * 1048576; - host->dma.gpd = dma_alloc_coherent(&pdev->dev, - 2 * sizeof(struct mt_gpdma_desc), - &host->dma.gpd_addr, GFP_KERNEL); - host->dma.bd = dma_alloc_coherent(&pdev->dev, - MAX_BD_NUM * sizeof(struct mt_bdma_desc), - &host->dma.bd_addr, GFP_KERNEL); - if (!host->dma.gpd || !host->dma.bd) { - ret = -ENOMEM; - goto release_mem; - } - msdc_init_gpd_bd(host, &host->dma); - INIT_DELAYED_WORK(&host->req_timeout, msdc_request_timeout); - spin_lock_init(&host->lock); - - platform_set_drvdata(pdev, mmc); - msdc_ungate_clock(host); - msdc_init_hw(host); - ret = devm_request_irq(&pdev->dev, host->irq, msdc_irq, IRQF_TRIGGER_NONE, pdev->name, host); if (ret) diff --git a/drivers/mmc/host/sdhci-esdhc-imx.c b/drivers/mmc/host/sdhci-esdhc-imx.c index f18d169bc8ff..e658f0174242 100644 --- a/drivers/mmc/host/sdhci-esdhc-imx.c +++ b/drivers/mmc/host/sdhci-esdhc-imx.c @@ -1187,6 +1187,7 @@ static void esdhc_reset_tuning(struct sdhci_host *host) struct sdhci_pltfm_host *pltfm_host = sdhci_priv(host); struct pltfm_imx_data *imx_data = sdhci_pltfm_priv(pltfm_host); u32 ctrl; + int ret; /* Reset the tuning circuit */ if (esdhc_is_usdhc(imx_data)) { @@ -1199,7 +1200,22 @@ static void esdhc_reset_tuning(struct sdhci_host *host) } else if (imx_data->socdata->flags & ESDHC_FLAG_STD_TUNING) { ctrl = readl(host->ioaddr + SDHCI_AUTO_CMD_STATUS); ctrl &= ~ESDHC_MIX_CTRL_SMPCLK_SEL; + ctrl &= ~ESDHC_MIX_CTRL_EXE_TUNE; writel(ctrl, host->ioaddr + SDHCI_AUTO_CMD_STATUS); + /* Make sure ESDHC_MIX_CTRL_EXE_TUNE cleared */ + ret = readl_poll_timeout(host->ioaddr + SDHCI_AUTO_CMD_STATUS, + ctrl, !(ctrl & ESDHC_MIX_CTRL_EXE_TUNE), 1, 50); + if (ret == -ETIMEDOUT) + dev_warn(mmc_dev(host->mmc), + "Warning! clear execute tuning bit failed\n"); + /* + * SDHCI_INT_DATA_AVAIL is W1C bit, set this bit will clear the + * usdhc IP internal logic flag execute_tuning_with_clr_buf, which + * will finally make sure the normal data transfer logic correct. + */ + ctrl = readl(host->ioaddr + SDHCI_INT_STATUS); + ctrl |= SDHCI_INT_DATA_AVAIL; + writel(ctrl, host->ioaddr + SDHCI_INT_STATUS); } } } diff --git a/drivers/mmc/host/sdhci-pci-core.c b/drivers/mmc/host/sdhci-pci-core.c index be19785227fe..d0f2edfe296c 100644 --- a/drivers/mmc/host/sdhci-pci-core.c +++ b/drivers/mmc/host/sdhci-pci-core.c @@ -616,16 +616,12 @@ static int intel_select_drive_strength(struct mmc_card *card, return intel_host->drv_strength; } -static int bxt_get_cd(struct mmc_host *mmc) +static int sdhci_get_cd_nogpio(struct mmc_host *mmc) { - int gpio_cd = mmc_gpio_get_cd(mmc); struct sdhci_host *host = mmc_priv(mmc); unsigned long flags; int ret = 0; - if (!gpio_cd) - return 0; - spin_lock_irqsave(&host->lock, flags); if (host->flags & SDHCI_DEVICE_DEAD) @@ -638,6 +634,21 @@ out: return ret; } +static int bxt_get_cd(struct mmc_host *mmc) +{ + int gpio_cd = mmc_gpio_get_cd(mmc); + + if (!gpio_cd) + return 0; + + return sdhci_get_cd_nogpio(mmc); +} + +static int mrfld_get_cd(struct mmc_host *mmc) +{ + return sdhci_get_cd_nogpio(mmc); +} + #define SDHCI_INTEL_PWR_TIMEOUT_CNT 20 #define SDHCI_INTEL_PWR_TIMEOUT_UDELAY 100 @@ -1341,6 +1352,14 @@ static int intel_mrfld_mmc_probe_slot(struct sdhci_pci_slot *slot) MMC_CAP_1_8V_DDR; break; case INTEL_MRFLD_SD: + slot->cd_idx = 0; + slot->cd_override_level = true; + /* + * There are two PCB designs of SD card slot with the opposite + * card detection sense. Quirk this out by ignoring GPIO state + * completely in the custom ->get_cd() callback. + */ + slot->host->mmc_host_ops.get_cd = mrfld_get_cd; slot->host->quirks2 |= SDHCI_QUIRK2_NO_1_8_V; break; case INTEL_MRFLD_SDIO: diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c index 8eefa7d5fe85..2d80a04e11d8 100644 --- a/drivers/mmc/host/sdhci.c +++ b/drivers/mmc/host/sdhci.c @@ -2042,6 +2042,12 @@ void sdhci_set_power_noreg(struct sdhci_host *host, unsigned char mode, break; case MMC_VDD_32_33: case MMC_VDD_33_34: + /* + * 3.4 ~ 3.6V are valid only for those platforms where it's + * known that the voltage range is supported by hardware. + */ + case MMC_VDD_34_35: + case MMC_VDD_35_36: pwr = SDHCI_POWER_330; break; default: diff --git a/drivers/mmc/host/tmio_mmc_core.c b/drivers/mmc/host/tmio_mmc_core.c index 7dfc26f48c18..e2affa52ef46 100644 --- a/drivers/mmc/host/tmio_mmc_core.c +++ b/drivers/mmc/host/tmio_mmc_core.c @@ -195,6 +195,10 @@ static void tmio_mmc_reset(struct tmio_mmc_host *host) sd_ctrl_write32_as_16_and_16(host, CTL_IRQ_MASK, host->sdcard_irq_mask_all); host->sdcard_irq_mask = host->sdcard_irq_mask_all; + if (host->native_hotplug) + tmio_mmc_enable_mmc_irqs(host, + TMIO_STAT_CARD_REMOVE | TMIO_STAT_CARD_INSERT); + tmio_mmc_set_bus_width(host, host->mmc->ios.bus_width); if (host->pdata->flags & TMIO_MMC_SDIO_IRQ) { @@ -956,8 +960,15 @@ static void tmio_mmc_set_ios(struct mmc_host *mmc, struct mmc_ios *ios) case MMC_POWER_OFF: tmio_mmc_power_off(host); /* For R-Car Gen2+, we need to reset SDHI specific SCC */ - if (host->pdata->flags & TMIO_MMC_MIN_RCAR2) + if (host->pdata->flags & TMIO_MMC_MIN_RCAR2) { host->reset(host); + + if (host->native_hotplug) + tmio_mmc_enable_mmc_irqs(host, + TMIO_STAT_CARD_REMOVE | + TMIO_STAT_CARD_INSERT); + } + host->set_clock(host, 0); break; case MMC_POWER_UP: @@ -1185,10 +1196,6 @@ int tmio_mmc_host_probe(struct tmio_mmc_host *_host) _host->set_clock(_host, 0); tmio_mmc_reset(_host); - if (_host->native_hotplug) - tmio_mmc_enable_mmc_irqs(_host, - TMIO_STAT_CARD_REMOVE | TMIO_STAT_CARD_INSERT); - spin_lock_init(&_host->lock); mutex_init(&_host->ios_lock); diff --git a/drivers/mmc/host/vub300.c b/drivers/mmc/host/vub300.c index 4950d10d3a19..97beece62fec 100644 --- a/drivers/mmc/host/vub300.c +++ b/drivers/mmc/host/vub300.c @@ -576,7 +576,7 @@ static void check_vub300_port_status(struct vub300_mmc_host *vub300) GET_SYSTEM_PORT_STATUS, USB_DIR_IN | USB_TYPE_VENDOR | USB_RECIP_DEVICE, 0x0000, 0x0000, &vub300->system_port_status, - sizeof(vub300->system_port_status), HZ); + sizeof(vub300->system_port_status), 1000); if (sizeof(vub300->system_port_status) == retval) new_system_port_status(vub300); } @@ -1241,7 +1241,7 @@ static void __download_offload_pseudocode(struct vub300_mmc_host *vub300, SET_INTERRUPT_PSEUDOCODE, USB_DIR_OUT | USB_TYPE_VENDOR | USB_RECIP_DEVICE, 0x0000, 0x0000, - xfer_buffer, xfer_length, HZ); + xfer_buffer, xfer_length, 1000); kfree(xfer_buffer); if (retval < 0) goto copy_error_message; @@ -1284,7 +1284,7 @@ static void __download_offload_pseudocode(struct vub300_mmc_host *vub300, SET_TRANSFER_PSEUDOCODE, USB_DIR_OUT | USB_TYPE_VENDOR | USB_RECIP_DEVICE, 0x0000, 0x0000, - xfer_buffer, xfer_length, HZ); + xfer_buffer, xfer_length, 1000); kfree(xfer_buffer); if (retval < 0) goto copy_error_message; @@ -1991,7 +1991,7 @@ static void __set_clock_speed(struct vub300_mmc_host *vub300, u8 buf[8], usb_control_msg(vub300->udev, usb_sndctrlpipe(vub300->udev, 0), SET_CLOCK_SPEED, USB_DIR_OUT | USB_TYPE_VENDOR | USB_RECIP_DEVICE, - 0x00, 0x00, buf, buf_array_size, HZ); + 0x00, 0x00, buf, buf_array_size, 1000); if (retval != 8) { dev_err(&vub300->udev->dev, "SET_CLOCK_SPEED" " %dkHz failed with retval=%d\n", kHzClock, retval); @@ -2013,14 +2013,14 @@ static void vub300_mmc_set_ios(struct mmc_host *mmc, struct mmc_ios *ios) usb_control_msg(vub300->udev, usb_sndctrlpipe(vub300->udev, 0), SET_SD_POWER, USB_DIR_OUT | USB_TYPE_VENDOR | USB_RECIP_DEVICE, - 0x0000, 0x0000, NULL, 0, HZ); + 0x0000, 0x0000, NULL, 0, 1000); /* must wait for the VUB300 u-proc to boot up */ msleep(600); } else if ((ios->power_mode == MMC_POWER_UP) && !vub300->card_powered) { usb_control_msg(vub300->udev, usb_sndctrlpipe(vub300->udev, 0), SET_SD_POWER, USB_DIR_OUT | USB_TYPE_VENDOR | USB_RECIP_DEVICE, - 0x0001, 0x0000, NULL, 0, HZ); + 0x0001, 0x0000, NULL, 0, 1000); msleep(600); vub300->card_powered = 1; } else if (ios->power_mode == MMC_POWER_ON) { @@ -2275,14 +2275,14 @@ static int vub300_probe(struct usb_interface *interface, GET_HC_INF0, USB_DIR_IN | USB_TYPE_VENDOR | USB_RECIP_DEVICE, 0x0000, 0x0000, &vub300->hc_info, - sizeof(vub300->hc_info), HZ); + sizeof(vub300->hc_info), 1000); if (retval < 0) goto error5; retval = usb_control_msg(vub300->udev, usb_sndctrlpipe(vub300->udev, 0), SET_ROM_WAIT_STATES, USB_DIR_OUT | USB_TYPE_VENDOR | USB_RECIP_DEVICE, - firmware_rom_wait_states, 0x0000, NULL, 0, HZ); + firmware_rom_wait_states, 0x0000, NULL, 0, 1000); if (retval < 0) goto error5; dev_info(&vub300->udev->dev, @@ -2297,7 +2297,7 @@ static int vub300_probe(struct usb_interface *interface, GET_SYSTEM_PORT_STATUS, USB_DIR_IN | USB_TYPE_VENDOR | USB_RECIP_DEVICE, 0x0000, 0x0000, &vub300->system_port_status, - sizeof(vub300->system_port_status), HZ); + sizeof(vub300->system_port_status), 1000); if (retval < 0) { goto error4; } else if (sizeof(vub300->system_port_status) == retval) { diff --git a/drivers/mtd/mtd_blkdevs.c b/drivers/mtd/mtd_blkdevs.c index b8ae1ec14e17..4eaba6f4ec68 100644 --- a/drivers/mtd/mtd_blkdevs.c +++ b/drivers/mtd/mtd_blkdevs.c @@ -384,7 +384,9 @@ int add_mtd_blktrans_dev(struct mtd_blktrans_dev *new) if (new->readonly) set_disk_ro(gd, 1); - device_add_disk(&new->mtd->dev, gd, NULL); + ret = device_add_disk(&new->mtd->dev, gd, NULL); + if (ret) + goto out_cleanup_disk; if (new->disk_attributes) { ret = sysfs_create_group(&disk_to_dev(gd)->kobj, @@ -393,6 +395,8 @@ int add_mtd_blktrans_dev(struct mtd_blktrans_dev *new) } return 0; +out_cleanup_disk: + blk_cleanup_disk(new->disk); out_free_tag_set: blk_mq_free_tag_set(new->tag_set); out_kfree_tag_set: diff --git a/drivers/net/can/m_can/m_can_platform.c b/drivers/net/can/m_can/m_can_platform.c index 308d4f2fff00..eee47bad0592 100644 --- a/drivers/net/can/m_can/m_can_platform.c +++ b/drivers/net/can/m_can/m_can_platform.c @@ -32,8 +32,13 @@ static u32 iomap_read_reg(struct m_can_classdev *cdev, int reg) static int iomap_read_fifo(struct m_can_classdev *cdev, int offset, void *val, size_t val_count) { struct m_can_plat_priv *priv = cdev_to_priv(cdev); + void __iomem *src = priv->mram_base + offset; - ioread32_rep(priv->mram_base + offset, val, val_count); + while (val_count--) { + *(unsigned int *)val = ioread32(src); + val += 4; + src += 4; + } return 0; } @@ -51,8 +56,13 @@ static int iomap_write_fifo(struct m_can_classdev *cdev, int offset, const void *val, size_t val_count) { struct m_can_plat_priv *priv = cdev_to_priv(cdev); + void __iomem *dst = priv->mram_base + offset; - iowrite32_rep(priv->base + offset, val, val_count); + while (val_count--) { + iowrite32(*(unsigned int *)val, dst); + val += 4; + dst += 4; + } return 0; } diff --git a/drivers/net/can/rcar/rcar_can.c b/drivers/net/can/rcar/rcar_can.c index 00e4533c8bdd..8999ec9455ec 100644 --- a/drivers/net/can/rcar/rcar_can.c +++ b/drivers/net/can/rcar/rcar_can.c @@ -846,10 +846,12 @@ static int __maybe_unused rcar_can_suspend(struct device *dev) struct rcar_can_priv *priv = netdev_priv(ndev); u16 ctlr; - if (netif_running(ndev)) { - netif_stop_queue(ndev); - netif_device_detach(ndev); - } + if (!netif_running(ndev)) + return 0; + + netif_stop_queue(ndev); + netif_device_detach(ndev); + ctlr = readw(&priv->regs->ctlr); ctlr |= RCAR_CAN_CTLR_CANM_HALT; writew(ctlr, &priv->regs->ctlr); @@ -868,6 +870,9 @@ static int __maybe_unused rcar_can_resume(struct device *dev) u16 ctlr; int err; + if (!netif_running(ndev)) + return 0; + err = clk_enable(priv->clk); if (err) { netdev_err(ndev, "clk_enable() failed, error %d\n", err); @@ -881,10 +886,9 @@ static int __maybe_unused rcar_can_resume(struct device *dev) writew(ctlr, &priv->regs->ctlr); priv->can.state = CAN_STATE_ERROR_ACTIVE; - if (netif_running(ndev)) { - netif_device_attach(ndev); - netif_start_queue(ndev); - } + netif_device_attach(ndev); + netif_start_queue(ndev); + return 0; } diff --git a/drivers/net/can/sja1000/peak_pci.c b/drivers/net/can/sja1000/peak_pci.c index 6db90dc4bc9d..84f34020aafb 100644 --- a/drivers/net/can/sja1000/peak_pci.c +++ b/drivers/net/can/sja1000/peak_pci.c @@ -752,16 +752,15 @@ static void peak_pci_remove(struct pci_dev *pdev) struct net_device *prev_dev = chan->prev_dev; dev_info(&pdev->dev, "removing device %s\n", dev->name); + /* do that only for first channel */ + if (!prev_dev && chan->pciec_card) + peak_pciec_remove(chan->pciec_card); unregister_sja1000dev(dev); free_sja1000dev(dev); dev = prev_dev; - if (!dev) { - /* do that only for first channel */ - if (chan->pciec_card) - peak_pciec_remove(chan->pciec_card); + if (!dev) break; - } priv = netdev_priv(dev); chan = priv->priv; } diff --git a/drivers/net/can/usb/peak_usb/pcan_usb_fd.c b/drivers/net/can/usb/peak_usb/pcan_usb_fd.c index b11eabad575b..09029a3bad1a 100644 --- a/drivers/net/can/usb/peak_usb/pcan_usb_fd.c +++ b/drivers/net/can/usb/peak_usb/pcan_usb_fd.c @@ -551,11 +551,10 @@ static int pcan_usb_fd_decode_status(struct pcan_usb_fd_if *usb_if, } else if (sm->channel_p_w_b & PUCAN_BUS_WARNING) { new_state = CAN_STATE_ERROR_WARNING; } else { - /* no error bit (so, no error skb, back to active state) */ - dev->can.state = CAN_STATE_ERROR_ACTIVE; + /* back to (or still in) ERROR_ACTIVE state */ + new_state = CAN_STATE_ERROR_ACTIVE; pdev->bec.txerr = 0; pdev->bec.rxerr = 0; - return 0; } /* state hasn't changed */ @@ -568,8 +567,7 @@ static int pcan_usb_fd_decode_status(struct pcan_usb_fd_if *usb_if, /* allocate an skb to store the error frame */ skb = alloc_can_err_skb(netdev, &cf); - if (skb) - can_change_state(netdev, cf, tx_state, rx_state); + can_change_state(netdev, cf, tx_state, rx_state); /* things must be done even in case of OOM */ if (new_state == CAN_STATE_BUS_OFF) diff --git a/drivers/net/dsa/lantiq_gswip.c b/drivers/net/dsa/lantiq_gswip.c index 3ff4b7e177f3..dbd4486a173f 100644 --- a/drivers/net/dsa/lantiq_gswip.c +++ b/drivers/net/dsa/lantiq_gswip.c @@ -230,7 +230,7 @@ #define GSWIP_SDMA_PCTRLp(p) (0xBC0 + ((p) * 0x6)) #define GSWIP_SDMA_PCTRL_EN BIT(0) /* SDMA Port Enable */ #define GSWIP_SDMA_PCTRL_FCEN BIT(1) /* Flow Control Enable */ -#define GSWIP_SDMA_PCTRL_PAUFWD BIT(1) /* Pause Frame Forwarding */ +#define GSWIP_SDMA_PCTRL_PAUFWD BIT(3) /* Pause Frame Forwarding */ #define GSWIP_TABLE_ACTIVE_VLAN 0x01 #define GSWIP_TABLE_VLAN_MAPPING 0x02 diff --git a/drivers/net/dsa/mt7530.c b/drivers/net/dsa/mt7530.c index 094737e5084a..9890672a206d 100644 --- a/drivers/net/dsa/mt7530.c +++ b/drivers/net/dsa/mt7530.c @@ -1035,9 +1035,6 @@ mt7530_port_enable(struct dsa_switch *ds, int port, { struct mt7530_priv *priv = ds->priv; - if (!dsa_is_user_port(ds, port)) - return 0; - mutex_lock(&priv->reg_mutex); /* Allow the user port gets connected to the cpu port and also @@ -1060,9 +1057,6 @@ mt7530_port_disable(struct dsa_switch *ds, int port) { struct mt7530_priv *priv = ds->priv; - if (!dsa_is_user_port(ds, port)) - return; - mutex_lock(&priv->reg_mutex); /* Clear up all port matrix which could be restored in the next @@ -3211,7 +3205,7 @@ mt7530_probe(struct mdio_device *mdiodev) return -ENOMEM; priv->ds->dev = &mdiodev->dev; - priv->ds->num_ports = DSA_MAX_PORTS; + priv->ds->num_ports = MT7530_NUM_PORTS; /* Use medatek,mcm property to distinguish hardware type that would * casues a little bit differences on power-on sequence. diff --git a/drivers/net/ethernet/cavium/thunder/nic_main.c b/drivers/net/ethernet/cavium/thunder/nic_main.c index 691e1475d55e..0fbecd093fa1 100644 --- a/drivers/net/ethernet/cavium/thunder/nic_main.c +++ b/drivers/net/ethernet/cavium/thunder/nic_main.c @@ -1193,7 +1193,7 @@ static int nic_register_interrupts(struct nicpf *nic) dev_err(&nic->pdev->dev, "Request for #%d msix vectors failed, returned %d\n", nic->num_vec, ret); - return 1; + return ret; } /* Register mailbox interrupt handler */ diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c index d1667b759522..a27227aeae88 100644 --- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c +++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c @@ -1224,7 +1224,7 @@ static int nicvf_register_misc_interrupt(struct nicvf *nic) if (ret < 0) { netdev_err(nic->netdev, "Req for #%d msix vectors failed\n", nic->num_vec); - return 1; + return ret; } sprintf(nic->irq_name[irq], "%s Mbox", "NICVF"); @@ -1243,7 +1243,7 @@ static int nicvf_register_misc_interrupt(struct nicvf *nic) if (!nicvf_check_pf_ready(nic)) { nicvf_disable_intr(nic, NICVF_INTR_MBOX, 0); nicvf_unregister_interrupts(nic); - return 1; + return -EIO; } return 0; diff --git a/drivers/net/ethernet/freescale/enetc/enetc_ethtool.c b/drivers/net/ethernet/freescale/enetc/enetc_ethtool.c index 9690e36e9e85..910b9f722504 100644 --- a/drivers/net/ethernet/freescale/enetc/enetc_ethtool.c +++ b/drivers/net/ethernet/freescale/enetc/enetc_ethtool.c @@ -157,7 +157,7 @@ static const struct { { ENETC_PM0_TFRM, "MAC tx frames" }, { ENETC_PM0_TFCS, "MAC tx fcs errors" }, { ENETC_PM0_TVLAN, "MAC tx VLAN frames" }, - { ENETC_PM0_TERR, "MAC tx frames" }, + { ENETC_PM0_TERR, "MAC tx frame errors" }, { ENETC_PM0_TUCA, "MAC tx unicast frames" }, { ENETC_PM0_TMCA, "MAC tx multicast frames" }, { ENETC_PM0_TBCA, "MAC tx broadcast frames" }, diff --git a/drivers/net/ethernet/freescale/enetc/enetc_pf.c b/drivers/net/ethernet/freescale/enetc/enetc_pf.c index 4c977dfc44f0..d522bd5c90b4 100644 --- a/drivers/net/ethernet/freescale/enetc/enetc_pf.c +++ b/drivers/net/ethernet/freescale/enetc/enetc_pf.c @@ -517,10 +517,13 @@ static void enetc_port_si_configure(struct enetc_si *si) static void enetc_configure_port_mac(struct enetc_hw *hw) { + int tc; + enetc_port_wr(hw, ENETC_PM0_MAXFRM, ENETC_SET_MAXFRM(ENETC_RX_MAXFRM_SIZE)); - enetc_port_wr(hw, ENETC_PTCMSDUR(0), ENETC_MAC_MAXFRM_SIZE); + for (tc = 0; tc < 8; tc++) + enetc_port_wr(hw, ENETC_PTCMSDUR(tc), ENETC_MAC_MAXFRM_SIZE); enetc_port_wr(hw, ENETC_PM0_CMD_CFG, ENETC_PM0_CMD_PHY_TX_EN | ENETC_PM0_CMD_TXP | ENETC_PM0_PROMISC); diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.c b/drivers/net/ethernet/hisilicon/hns3/hnae3.c index eef1b2764d34..67b0bf310daa 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hnae3.c +++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.c @@ -10,6 +10,27 @@ static LIST_HEAD(hnae3_ae_algo_list); static LIST_HEAD(hnae3_client_list); static LIST_HEAD(hnae3_ae_dev_list); +void hnae3_unregister_ae_algo_prepare(struct hnae3_ae_algo *ae_algo) +{ + const struct pci_device_id *pci_id; + struct hnae3_ae_dev *ae_dev; + + if (!ae_algo) + return; + + list_for_each_entry(ae_dev, &hnae3_ae_dev_list, node) { + if (!hnae3_get_bit(ae_dev->flag, HNAE3_DEV_INITED_B)) + continue; + + pci_id = pci_match_id(ae_algo->pdev_id_table, ae_dev->pdev); + if (!pci_id) + continue; + if (IS_ENABLED(CONFIG_PCI_IOV)) + pci_disable_sriov(ae_dev->pdev); + } +} +EXPORT_SYMBOL(hnae3_unregister_ae_algo_prepare); + /* we are keeping things simple and using single lock for all the * list. This is a non-critical code so other updations, if happen * in parallel, can wait. diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h b/drivers/net/ethernet/hisilicon/hns3/hnae3.h index 8ba21d6dc220..d701451596c8 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h +++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h @@ -853,6 +853,7 @@ struct hnae3_handle { int hnae3_register_ae_dev(struct hnae3_ae_dev *ae_dev); void hnae3_unregister_ae_dev(struct hnae3_ae_dev *ae_dev); +void hnae3_unregister_ae_algo_prepare(struct hnae3_ae_algo *ae_algo); void hnae3_unregister_ae_algo(struct hnae3_ae_algo *ae_algo); void hnae3_register_ae_algo(struct hnae3_ae_algo *ae_algo); diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c index 2b66c59f5eaf..e54f96251fea 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c @@ -137,7 +137,7 @@ static struct hns3_dbg_cmd_info hns3_dbg_cmd[] = { .name = "uc", .cmd = HNAE3_DBG_CMD_MAC_UC, .dentry = HNS3_DBG_DENTRY_MAC, - .buf_len = HNS3_DBG_READ_LEN, + .buf_len = HNS3_DBG_READ_LEN_128KB, .init = hns3_dbg_common_file_init, }, { @@ -256,7 +256,7 @@ static struct hns3_dbg_cmd_info hns3_dbg_cmd[] = { .name = "tqp", .cmd = HNAE3_DBG_CMD_REG_TQP, .dentry = HNS3_DBG_DENTRY_REG, - .buf_len = HNS3_DBG_READ_LEN, + .buf_len = HNS3_DBG_READ_LEN_128KB, .init = hns3_dbg_common_file_init, }, { @@ -298,7 +298,7 @@ static struct hns3_dbg_cmd_info hns3_dbg_cmd[] = { .name = "fd_tcam", .cmd = HNAE3_DBG_CMD_FD_TCAM, .dentry = HNS3_DBG_DENTRY_FD, - .buf_len = HNS3_DBG_READ_LEN, + .buf_len = HNS3_DBG_READ_LEN_1MB, .init = hns3_dbg_common_file_init, }, { @@ -462,7 +462,7 @@ static const struct hns3_dbg_item rx_queue_info_items[] = { { "TAIL", 2 }, { "HEAD", 2 }, { "FBDNUM", 2 }, - { "PKTNUM", 2 }, + { "PKTNUM", 5 }, { "COPYBREAK", 2 }, { "RING_EN", 2 }, { "RX_RING_EN", 2 }, @@ -565,7 +565,7 @@ static const struct hns3_dbg_item tx_queue_info_items[] = { { "HEAD", 2 }, { "FBDNUM", 2 }, { "OFFSET", 2 }, - { "PKTNUM", 2 }, + { "PKTNUM", 5 }, { "RING_EN", 2 }, { "TX_RING_EN", 2 }, { "BASE_ADDR", 10 }, @@ -790,13 +790,13 @@ static int hns3_dbg_rx_bd_info(struct hns3_dbg_data *d, char *buf, int len) } static const struct hns3_dbg_item tx_bd_info_items[] = { - { "BD_IDX", 5 }, - { "ADDRESS", 2 }, + { "BD_IDX", 2 }, + { "ADDRESS", 13 }, { "VLAN_TAG", 2 }, { "SIZE", 2 }, { "T_CS_VLAN_TSO", 2 }, { "OT_VLAN_TAG", 3 }, - { "TV", 2 }, + { "TV", 5 }, { "OLT_VLAN_LEN", 2 }, { "PAYLEN_OL4CS", 2 }, { "BD_FE_SC_VLD", 2 }, diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c index 468b8f07bf47..4b886a13e079 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c @@ -1847,7 +1847,6 @@ void hns3_shinfo_pack(struct skb_shared_info *shinfo, __u32 *size) static int hns3_skb_linearize(struct hns3_enet_ring *ring, struct sk_buff *skb, - u8 max_non_tso_bd_num, unsigned int bd_num) { /* 'bd_num == UINT_MAX' means the skb' fraglist has a @@ -1864,8 +1863,7 @@ static int hns3_skb_linearize(struct hns3_enet_ring *ring, * will not help. */ if (skb->len > HNS3_MAX_TSO_SIZE || - (!skb_is_gso(skb) && skb->len > - HNS3_MAX_NON_TSO_SIZE(max_non_tso_bd_num))) { + (!skb_is_gso(skb) && skb->len > HNS3_MAX_NON_TSO_SIZE)) { u64_stats_update_begin(&ring->syncp); ring->stats.hw_limitation++; u64_stats_update_end(&ring->syncp); @@ -1900,8 +1898,7 @@ static int hns3_nic_maybe_stop_tx(struct hns3_enet_ring *ring, goto out; } - if (hns3_skb_linearize(ring, skb, max_non_tso_bd_num, - bd_num)) + if (hns3_skb_linearize(ring, skb, bd_num)) return -ENOMEM; bd_num = hns3_tx_bd_count(skb->len); @@ -3258,6 +3255,7 @@ static void hns3_buffer_detach(struct hns3_enet_ring *ring, int i) { hns3_unmap_buffer(ring, &ring->desc_cb[i]); ring->desc[i].addr = 0; + ring->desc_cb[i].refill = 0; } static void hns3_free_buffer_detach(struct hns3_enet_ring *ring, int i, @@ -3336,6 +3334,7 @@ static int hns3_alloc_and_attach_buffer(struct hns3_enet_ring *ring, int i) ring->desc[i].addr = cpu_to_le64(ring->desc_cb[i].dma + ring->desc_cb[i].page_offset); + ring->desc_cb[i].refill = 1; return 0; } @@ -3365,6 +3364,7 @@ static void hns3_replace_buffer(struct hns3_enet_ring *ring, int i, { hns3_unmap_buffer(ring, &ring->desc_cb[i]); ring->desc_cb[i] = *res_cb; + ring->desc_cb[i].refill = 1; ring->desc[i].addr = cpu_to_le64(ring->desc_cb[i].dma + ring->desc_cb[i].page_offset); ring->desc[i].rx.bd_base_info = 0; @@ -3373,6 +3373,7 @@ static void hns3_replace_buffer(struct hns3_enet_ring *ring, int i, static void hns3_reuse_buffer(struct hns3_enet_ring *ring, int i) { ring->desc_cb[i].reuse_flag = 0; + ring->desc_cb[i].refill = 1; ring->desc[i].addr = cpu_to_le64(ring->desc_cb[i].dma + ring->desc_cb[i].page_offset); ring->desc[i].rx.bd_base_info = 0; @@ -3479,10 +3480,14 @@ static int hns3_desc_unused(struct hns3_enet_ring *ring) int ntc = ring->next_to_clean; int ntu = ring->next_to_use; + if (unlikely(ntc == ntu && !ring->desc_cb[ntc].refill)) + return ring->desc_num; + return ((ntc >= ntu) ? 0 : ring->desc_num) + ntc - ntu; } -static void hns3_nic_alloc_rx_buffers(struct hns3_enet_ring *ring, +/* Return true if there is any allocation failure */ +static bool hns3_nic_alloc_rx_buffers(struct hns3_enet_ring *ring, int cleand_count) { struct hns3_desc_cb *desc_cb; @@ -3507,7 +3512,10 @@ static void hns3_nic_alloc_rx_buffers(struct hns3_enet_ring *ring, hns3_rl_err(ring_to_netdev(ring), "alloc rx buffer failed: %d\n", ret); - break; + + writel(i, ring->tqp->io_base + + HNS3_RING_RX_RING_HEAD_REG); + return true; } hns3_replace_buffer(ring, ring->next_to_use, &res_cbs); @@ -3520,6 +3528,7 @@ static void hns3_nic_alloc_rx_buffers(struct hns3_enet_ring *ring, } writel(i, ring->tqp->io_base + HNS3_RING_RX_RING_HEAD_REG); + return false; } static bool hns3_can_reuse_page(struct hns3_desc_cb *cb) @@ -3824,6 +3833,7 @@ static void hns3_rx_ring_move_fw(struct hns3_enet_ring *ring) { ring->desc[ring->next_to_clean].rx.bd_base_info &= cpu_to_le32(~BIT(HNS3_RXD_VLD_B)); + ring->desc_cb[ring->next_to_clean].refill = 0; ring->next_to_clean += 1; if (unlikely(ring->next_to_clean == ring->desc_num)) @@ -4170,6 +4180,7 @@ int hns3_clean_rx_ring(struct hns3_enet_ring *ring, int budget, { #define RCB_NOF_ALLOC_RX_BUFF_ONCE 16 int unused_count = hns3_desc_unused(ring); + bool failure = false; int recv_pkts = 0; int err; @@ -4178,9 +4189,9 @@ int hns3_clean_rx_ring(struct hns3_enet_ring *ring, int budget, while (recv_pkts < budget) { /* Reuse or realloc buffers */ if (unused_count >= RCB_NOF_ALLOC_RX_BUFF_ONCE) { - hns3_nic_alloc_rx_buffers(ring, unused_count); - unused_count = hns3_desc_unused(ring) - - ring->pending_buf; + failure = failure || + hns3_nic_alloc_rx_buffers(ring, unused_count); + unused_count = 0; } /* Poll one pkt */ @@ -4199,11 +4210,7 @@ int hns3_clean_rx_ring(struct hns3_enet_ring *ring, int budget, } out: - /* Make all data has been write before submit */ - if (unused_count > 0) - hns3_nic_alloc_rx_buffers(ring, unused_count); - - return recv_pkts; + return failure ? budget : recv_pkts; } static void hns3_update_rx_int_coalesce(struct hns3_enet_tqp_vector *tqp_vector) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h index 6162d9f88e37..f09a61d9c626 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h @@ -186,11 +186,9 @@ enum hns3_nic_state { #define HNS3_MAX_BD_SIZE 65535 #define HNS3_MAX_TSO_BD_NUM 63U -#define HNS3_MAX_TSO_SIZE \ - (HNS3_MAX_BD_SIZE * HNS3_MAX_TSO_BD_NUM) +#define HNS3_MAX_TSO_SIZE 1048576U +#define HNS3_MAX_NON_TSO_SIZE 9728U -#define HNS3_MAX_NON_TSO_SIZE(max_non_tso_bd_num) \ - (HNS3_MAX_BD_SIZE * (max_non_tso_bd_num)) #define HNS3_VECTOR_GL0_OFFSET 0x100 #define HNS3_VECTOR_GL1_OFFSET 0x200 @@ -332,6 +330,7 @@ struct hns3_desc_cb { u32 length; /* length of the buffer */ u16 reuse_flag; + u16 refill; /* desc type, used by the ring user to mark the type of the priv data */ u16 type; diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c index 307c9e830510..91cb578f56b8 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_dcb.c @@ -137,6 +137,15 @@ static int hclge_ets_sch_mode_validate(struct hclge_dev *hdev, *changed = true; break; case IEEE_8021QAZ_TSA_ETS: + /* The hardware will switch to sp mode if bandwidth is + * 0, so limit ets bandwidth must be greater than 0. + */ + if (!ets->tc_tx_bw[i]) { + dev_err(&hdev->pdev->dev, + "tc%u ets bw cannot be 0\n", i); + return -EINVAL; + } + if (hdev->tm_info.tc_info[i].tc_sch_mode != HCLGE_SCH_MODE_DWRR) *changed = true; diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c index 32f62cd2dd99..9cda8b3562b8 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c @@ -391,7 +391,7 @@ static int hclge_dbg_dump_mac(struct hclge_dev *hdev, char *buf, int len) static int hclge_dbg_dump_dcb_qset(struct hclge_dev *hdev, char *buf, int len, int *pos) { - struct hclge_dbg_bitmap_cmd *bitmap; + struct hclge_dbg_bitmap_cmd req; struct hclge_desc desc; u16 qset_id, qset_num; int ret; @@ -408,12 +408,12 @@ static int hclge_dbg_dump_dcb_qset(struct hclge_dev *hdev, char *buf, int len, if (ret) return ret; - bitmap = (struct hclge_dbg_bitmap_cmd *)&desc.data[1]; + req.bitmap = (u8)le32_to_cpu(desc.data[1]); *pos += scnprintf(buf + *pos, len - *pos, "%04u %#x %#x %#x %#x\n", - qset_id, bitmap->bit0, bitmap->bit1, - bitmap->bit2, bitmap->bit3); + qset_id, req.bit0, req.bit1, req.bit2, + req.bit3); } return 0; @@ -422,7 +422,7 @@ static int hclge_dbg_dump_dcb_qset(struct hclge_dev *hdev, char *buf, int len, static int hclge_dbg_dump_dcb_pri(struct hclge_dev *hdev, char *buf, int len, int *pos) { - struct hclge_dbg_bitmap_cmd *bitmap; + struct hclge_dbg_bitmap_cmd req; struct hclge_desc desc; u8 pri_id, pri_num; int ret; @@ -439,12 +439,11 @@ static int hclge_dbg_dump_dcb_pri(struct hclge_dev *hdev, char *buf, int len, if (ret) return ret; - bitmap = (struct hclge_dbg_bitmap_cmd *)&desc.data[1]; + req.bitmap = (u8)le32_to_cpu(desc.data[1]); *pos += scnprintf(buf + *pos, len - *pos, "%03u %#x %#x %#x\n", - pri_id, bitmap->bit0, bitmap->bit1, - bitmap->bit2); + pri_id, req.bit0, req.bit1, req.bit2); } return 0; @@ -453,7 +452,7 @@ static int hclge_dbg_dump_dcb_pri(struct hclge_dev *hdev, char *buf, int len, static int hclge_dbg_dump_dcb_pg(struct hclge_dev *hdev, char *buf, int len, int *pos) { - struct hclge_dbg_bitmap_cmd *bitmap; + struct hclge_dbg_bitmap_cmd req; struct hclge_desc desc; u8 pg_id; int ret; @@ -466,12 +465,11 @@ static int hclge_dbg_dump_dcb_pg(struct hclge_dev *hdev, char *buf, int len, if (ret) return ret; - bitmap = (struct hclge_dbg_bitmap_cmd *)&desc.data[1]; + req.bitmap = (u8)le32_to_cpu(desc.data[1]); *pos += scnprintf(buf + *pos, len - *pos, "%03u %#x %#x %#x\n", - pg_id, bitmap->bit0, bitmap->bit1, - bitmap->bit2); + pg_id, req.bit0, req.bit1, req.bit2); } return 0; @@ -511,7 +509,7 @@ static int hclge_dbg_dump_dcb_queue(struct hclge_dev *hdev, char *buf, int len, static int hclge_dbg_dump_dcb_port(struct hclge_dev *hdev, char *buf, int len, int *pos) { - struct hclge_dbg_bitmap_cmd *bitmap; + struct hclge_dbg_bitmap_cmd req; struct hclge_desc desc; u8 port_id = 0; int ret; @@ -521,12 +519,12 @@ static int hclge_dbg_dump_dcb_port(struct hclge_dev *hdev, char *buf, int len, if (ret) return ret; - bitmap = (struct hclge_dbg_bitmap_cmd *)&desc.data[1]; + req.bitmap = (u8)le32_to_cpu(desc.data[1]); *pos += scnprintf(buf + *pos, len - *pos, "port_mask: %#x\n", - bitmap->bit0); + req.bit0); *pos += scnprintf(buf + *pos, len - *pos, "port_shaping_pass: %#x\n", - bitmap->bit1); + req.bit1); return 0; } diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c index bb9b026ae88e..93aa7f2bdc13 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c @@ -1560,8 +1560,11 @@ static int hclge_config_tm_hw_err_int(struct hclge_dev *hdev, bool en) /* configure TM QCN hw errors */ hclge_cmd_setup_basic_desc(&desc, HCLGE_TM_QCN_MEM_INT_CFG, false); - if (en) + desc.data[0] = cpu_to_le32(HCLGE_TM_QCN_ERR_INT_TYPE); + if (en) { + desc.data[0] |= cpu_to_le32(HCLGE_TM_QCN_FIFO_INT_EN); desc.data[1] = cpu_to_le32(HCLGE_TM_QCN_MEM_ERR_INT_EN); + } ret = hclge_cmd_send(&hdev->hw, &desc, 1); if (ret) diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h index 07987fb8332e..d811eeefe2c0 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h @@ -50,6 +50,8 @@ #define HCLGE_PPP_MPF_ECC_ERR_INT3_EN 0x003F #define HCLGE_PPP_MPF_ECC_ERR_INT3_EN_MASK 0x003F #define HCLGE_TM_SCH_ECC_ERR_INT_EN 0x3 +#define HCLGE_TM_QCN_ERR_INT_TYPE 0x29 +#define HCLGE_TM_QCN_FIFO_INT_EN 0xFFFF00 #define HCLGE_TM_QCN_MEM_ERR_INT_EN 0xFFFFFF #define HCLGE_NCSI_ERR_INT_EN 0x3 #define HCLGE_NCSI_ERR_INT_TYPE 0x9 diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c index f5b8d1fee0f1..d891390d492f 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c @@ -2847,33 +2847,29 @@ static void hclge_mbx_task_schedule(struct hclge_dev *hdev) { if (!test_bit(HCLGE_STATE_REMOVING, &hdev->state) && !test_and_set_bit(HCLGE_STATE_MBX_SERVICE_SCHED, &hdev->state)) - mod_delayed_work_on(cpumask_first(&hdev->affinity_mask), - hclge_wq, &hdev->service_task, 0); + mod_delayed_work(hclge_wq, &hdev->service_task, 0); } static void hclge_reset_task_schedule(struct hclge_dev *hdev) { if (!test_bit(HCLGE_STATE_REMOVING, &hdev->state) && + test_bit(HCLGE_STATE_SERVICE_INITED, &hdev->state) && !test_and_set_bit(HCLGE_STATE_RST_SERVICE_SCHED, &hdev->state)) - mod_delayed_work_on(cpumask_first(&hdev->affinity_mask), - hclge_wq, &hdev->service_task, 0); + mod_delayed_work(hclge_wq, &hdev->service_task, 0); } static void hclge_errhand_task_schedule(struct hclge_dev *hdev) { if (!test_bit(HCLGE_STATE_REMOVING, &hdev->state) && !test_and_set_bit(HCLGE_STATE_ERR_SERVICE_SCHED, &hdev->state)) - mod_delayed_work_on(cpumask_first(&hdev->affinity_mask), - hclge_wq, &hdev->service_task, 0); + mod_delayed_work(hclge_wq, &hdev->service_task, 0); } void hclge_task_schedule(struct hclge_dev *hdev, unsigned long delay_time) { if (!test_bit(HCLGE_STATE_REMOVING, &hdev->state) && !test_bit(HCLGE_STATE_RST_FAIL, &hdev->state)) - mod_delayed_work_on(cpumask_first(&hdev->affinity_mask), - hclge_wq, &hdev->service_task, - delay_time); + mod_delayed_work(hclge_wq, &hdev->service_task, delay_time); } static int hclge_get_mac_link_status(struct hclge_dev *hdev, int *link_status) @@ -3491,33 +3487,14 @@ static void hclge_get_misc_vector(struct hclge_dev *hdev) hdev->num_msi_used += 1; } -static void hclge_irq_affinity_notify(struct irq_affinity_notify *notify, - const cpumask_t *mask) -{ - struct hclge_dev *hdev = container_of(notify, struct hclge_dev, - affinity_notify); - - cpumask_copy(&hdev->affinity_mask, mask); -} - -static void hclge_irq_affinity_release(struct kref *ref) -{ -} - static void hclge_misc_affinity_setup(struct hclge_dev *hdev) { irq_set_affinity_hint(hdev->misc_vector.vector_irq, &hdev->affinity_mask); - - hdev->affinity_notify.notify = hclge_irq_affinity_notify; - hdev->affinity_notify.release = hclge_irq_affinity_release; - irq_set_affinity_notifier(hdev->misc_vector.vector_irq, - &hdev->affinity_notify); } static void hclge_misc_affinity_teardown(struct hclge_dev *hdev) { - irq_set_affinity_notifier(hdev->misc_vector.vector_irq, NULL); irq_set_affinity_hint(hdev->misc_vector.vector_irq, NULL); } @@ -13052,7 +13029,7 @@ static int hclge_init(void) { pr_info("%s is initializing\n", HCLGE_NAME); - hclge_wq = alloc_workqueue("%s", 0, 0, HCLGE_NAME); + hclge_wq = alloc_workqueue("%s", WQ_UNBOUND, 0, HCLGE_NAME); if (!hclge_wq) { pr_err("%s: failed to create workqueue\n", HCLGE_NAME); return -ENOMEM; @@ -13065,6 +13042,7 @@ static int hclge_init(void) static void hclge_exit(void) { + hnae3_unregister_ae_algo_prepare(&ae_algo); hnae3_unregister_ae_algo(&ae_algo); destroy_workqueue(hclge_wq); } diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h index de6afbcbfbac..69cd8f87b4c8 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h @@ -944,7 +944,6 @@ struct hclge_dev { /* affinity mask and notify for misc interrupt */ cpumask_t affinity_mask; - struct irq_affinity_notify affinity_notify; struct hclge_ptp *ptp; struct devlink *devlink; }; diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c index f314dbd3ce11..95074e91a846 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c @@ -752,6 +752,8 @@ static void hclge_tm_pg_info_init(struct hclge_dev *hdev) hdev->tm_info.pg_info[i].tc_bit_map = hdev->hw_tc_map; for (k = 0; k < hdev->tm_info.num_tc; k++) hdev->tm_info.pg_info[i].tc_dwrr[k] = BW_PERCENT; + for (; k < HNAE3_MAX_TC; k++) + hdev->tm_info.pg_info[i].tc_dwrr[k] = 0; } } diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c index 5fdac8685f95..cf00ad7bb881 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c +++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c @@ -2232,6 +2232,7 @@ static void hclgevf_get_misc_vector(struct hclgevf_dev *hdev) void hclgevf_reset_task_schedule(struct hclgevf_dev *hdev) { if (!test_bit(HCLGEVF_STATE_REMOVING, &hdev->state) && + test_bit(HCLGEVF_STATE_SERVICE_INITED, &hdev->state) && !test_and_set_bit(HCLGEVF_STATE_RST_SERVICE_SCHED, &hdev->state)) mod_delayed_work(hclgevf_wq, &hdev->service_task, 0); @@ -2273,9 +2274,9 @@ static void hclgevf_reset_service_task(struct hclgevf_dev *hdev) hdev->reset_attempts = 0; hdev->last_reset_time = jiffies; - while ((hdev->reset_type = - hclgevf_get_reset_level(hdev, &hdev->reset_pending)) - != HNAE3_NONE_RESET) + hdev->reset_type = + hclgevf_get_reset_level(hdev, &hdev->reset_pending); + if (hdev->reset_type != HNAE3_NONE_RESET) hclgevf_reset(hdev); } else if (test_and_clear_bit(HCLGEVF_RESET_REQUESTED, &hdev->reset_state)) { @@ -3449,6 +3450,8 @@ static int hclgevf_init_hdev(struct hclgevf_dev *hdev) hclgevf_init_rxd_adv_layout(hdev); + set_bit(HCLGEVF_STATE_SERVICE_INITED, &hdev->state); + hdev->last_reset_time = jiffies; dev_info(&hdev->pdev->dev, "finished initializing %s driver\n", HCLGEVF_DRIVER_NAME); @@ -3899,7 +3902,7 @@ static int hclgevf_init(void) { pr_info("%s is initializing\n", HCLGEVF_NAME); - hclgevf_wq = alloc_workqueue("%s", 0, 0, HCLGEVF_NAME); + hclgevf_wq = alloc_workqueue("%s", WQ_UNBOUND, 0, HCLGEVF_NAME); if (!hclgevf_wq) { pr_err("%s: failed to create workqueue\n", HCLGEVF_NAME); return -ENOMEM; diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h index 883130a9b48f..28288d7e3303 100644 --- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h +++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h @@ -146,6 +146,7 @@ enum hclgevf_states { HCLGEVF_STATE_REMOVING, HCLGEVF_STATE_NIC_REGISTERED, HCLGEVF_STATE_ROCE_REGISTERED, + HCLGEVF_STATE_SERVICE_INITED, /* task states */ HCLGEVF_STATE_RST_SERVICE_SCHED, HCLGEVF_STATE_RST_HANDLING, diff --git a/drivers/net/ethernet/intel/e1000e/e1000.h b/drivers/net/ethernet/intel/e1000e/e1000.h index 5b2143f4b1f8..3178efd98006 100644 --- a/drivers/net/ethernet/intel/e1000e/e1000.h +++ b/drivers/net/ethernet/intel/e1000e/e1000.h @@ -113,7 +113,8 @@ enum e1000_boards { board_pch2lan, board_pch_lpt, board_pch_spt, - board_pch_cnp + board_pch_cnp, + board_pch_tgp }; struct e1000_ps_page { @@ -499,6 +500,7 @@ extern const struct e1000_info e1000_pch2_info; extern const struct e1000_info e1000_pch_lpt_info; extern const struct e1000_info e1000_pch_spt_info; extern const struct e1000_info e1000_pch_cnp_info; +extern const struct e1000_info e1000_pch_tgp_info; extern const struct e1000_info e1000_es2_info; void e1000e_ptp_init(struct e1000_adapter *adapter); diff --git a/drivers/net/ethernet/intel/e1000e/ich8lan.c b/drivers/net/ethernet/intel/e1000e/ich8lan.c index 60c582a16821..5e4fc9b4e2ad 100644 --- a/drivers/net/ethernet/intel/e1000e/ich8lan.c +++ b/drivers/net/ethernet/intel/e1000e/ich8lan.c @@ -4813,7 +4813,7 @@ static s32 e1000_reset_hw_ich8lan(struct e1000_hw *hw) static s32 e1000_init_hw_ich8lan(struct e1000_hw *hw) { struct e1000_mac_info *mac = &hw->mac; - u32 ctrl_ext, txdctl, snoop; + u32 ctrl_ext, txdctl, snoop, fflt_dbg; s32 ret_val; u16 i; @@ -4872,6 +4872,15 @@ static s32 e1000_init_hw_ich8lan(struct e1000_hw *hw) snoop = (u32)~(PCIE_NO_SNOOP_ALL); e1000e_set_pcie_no_snoop(hw, snoop); + /* Enable workaround for packet loss issue on TGP PCH + * Do not gate DMA clock from the modPHY block + */ + if (mac->type >= e1000_pch_tgp) { + fflt_dbg = er32(FFLT_DBG); + fflt_dbg |= E1000_FFLT_DBG_DONT_GATE_WAKE_DMA_CLK; + ew32(FFLT_DBG, fflt_dbg); + } + ctrl_ext = er32(CTRL_EXT); ctrl_ext |= E1000_CTRL_EXT_RO_DIS; ew32(CTRL_EXT, ctrl_ext); @@ -5992,3 +6001,23 @@ const struct e1000_info e1000_pch_cnp_info = { .phy_ops = &ich8_phy_ops, .nvm_ops = &spt_nvm_ops, }; + +const struct e1000_info e1000_pch_tgp_info = { + .mac = e1000_pch_tgp, + .flags = FLAG_IS_ICH + | FLAG_HAS_WOL + | FLAG_HAS_HW_TIMESTAMP + | FLAG_HAS_CTRLEXT_ON_LOAD + | FLAG_HAS_AMT + | FLAG_HAS_FLASH + | FLAG_HAS_JUMBO_FRAMES + | FLAG_APME_IN_WUC, + .flags2 = FLAG2_HAS_PHY_STATS + | FLAG2_HAS_EEE, + .pba = 26, + .max_hw_frame_size = 9022, + .get_variants = e1000_get_variants_ich8lan, + .mac_ops = &ich8_mac_ops, + .phy_ops = &ich8_phy_ops, + .nvm_ops = &spt_nvm_ops, +}; diff --git a/drivers/net/ethernet/intel/e1000e/ich8lan.h b/drivers/net/ethernet/intel/e1000e/ich8lan.h index d6a092e5ee74..2504b11c3169 100644 --- a/drivers/net/ethernet/intel/e1000e/ich8lan.h +++ b/drivers/net/ethernet/intel/e1000e/ich8lan.h @@ -289,6 +289,9 @@ /* Proprietary Latency Tolerance Reporting PCI Capability */ #define E1000_PCI_LTR_CAP_LPT 0xA8 +/* Don't gate wake DMA clock */ +#define E1000_FFLT_DBG_DONT_GATE_WAKE_DMA_CLK 0x1000 + void e1000e_write_protect_nvm_ich8lan(struct e1000_hw *hw); void e1000e_set_kmrn_lock_loss_workaround_ich8lan(struct e1000_hw *hw, bool state); diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c index 900b3ab998bd..ebcb2a30add0 100644 --- a/drivers/net/ethernet/intel/e1000e/netdev.c +++ b/drivers/net/ethernet/intel/e1000e/netdev.c @@ -51,6 +51,7 @@ static const struct e1000_info *e1000_info_tbl[] = { [board_pch_lpt] = &e1000_pch_lpt_info, [board_pch_spt] = &e1000_pch_spt_info, [board_pch_cnp] = &e1000_pch_cnp_info, + [board_pch_tgp] = &e1000_pch_tgp_info, }; struct e1000_reg_info { @@ -7896,28 +7897,28 @@ static const struct pci_device_id e1000_pci_tbl[] = { { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_CMP_I219_V11), board_pch_cnp }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_CMP_I219_LM12), board_pch_spt }, { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_CMP_I219_V12), board_pch_spt }, - { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_TGP_I219_LM13), board_pch_cnp }, - { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_TGP_I219_V13), board_pch_cnp }, - { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_TGP_I219_LM14), board_pch_cnp }, - { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_TGP_I219_V14), board_pch_cnp }, - { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_TGP_I219_LM15), board_pch_cnp }, - { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_TGP_I219_V15), board_pch_cnp }, - { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_RPL_I219_LM23), board_pch_cnp }, - { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_RPL_I219_V23), board_pch_cnp }, - { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_ADP_I219_LM16), board_pch_cnp }, - { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_ADP_I219_V16), board_pch_cnp }, - { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_ADP_I219_LM17), board_pch_cnp }, - { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_ADP_I219_V17), board_pch_cnp }, - { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_RPL_I219_LM22), board_pch_cnp }, - { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_RPL_I219_V22), board_pch_cnp }, - { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_MTP_I219_LM18), board_pch_cnp }, - { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_MTP_I219_V18), board_pch_cnp }, - { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_MTP_I219_LM19), board_pch_cnp }, - { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_MTP_I219_V19), board_pch_cnp }, - { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_LNP_I219_LM20), board_pch_cnp }, - { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_LNP_I219_V20), board_pch_cnp }, - { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_LNP_I219_LM21), board_pch_cnp }, - { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_LNP_I219_V21), board_pch_cnp }, + { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_TGP_I219_LM13), board_pch_tgp }, + { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_TGP_I219_V13), board_pch_tgp }, + { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_TGP_I219_LM14), board_pch_tgp }, + { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_TGP_I219_V14), board_pch_tgp }, + { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_TGP_I219_LM15), board_pch_tgp }, + { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_TGP_I219_V15), board_pch_tgp }, + { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_RPL_I219_LM23), board_pch_tgp }, + { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_RPL_I219_V23), board_pch_tgp }, + { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_ADP_I219_LM16), board_pch_tgp }, + { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_ADP_I219_V16), board_pch_tgp }, + { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_ADP_I219_LM17), board_pch_tgp }, + { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_ADP_I219_V17), board_pch_tgp }, + { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_RPL_I219_LM22), board_pch_tgp }, + { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_RPL_I219_V22), board_pch_tgp }, + { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_MTP_I219_LM18), board_pch_tgp }, + { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_MTP_I219_V18), board_pch_tgp }, + { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_MTP_I219_LM19), board_pch_tgp }, + { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_MTP_I219_V19), board_pch_tgp }, + { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_LNP_I219_LM20), board_pch_tgp }, + { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_LNP_I219_V20), board_pch_tgp }, + { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_LNP_I219_LM21), board_pch_tgp }, + { PCI_VDEVICE(INTEL, E1000_DEV_ID_PCH_LNP_I219_V21), board_pch_tgp }, { 0, 0, 0, 0, 0, 0, 0 } /* terminate list */ }; diff --git a/drivers/net/ethernet/intel/ice/ice_common.c b/drivers/net/ethernet/intel/ice/ice_common.c index 2fb81e359cdf..df5ad4de1f00 100644 --- a/drivers/net/ethernet/intel/ice/ice_common.c +++ b/drivers/net/ethernet/intel/ice/ice_common.c @@ -25,6 +25,8 @@ static enum ice_status ice_set_mac_type(struct ice_hw *hw) case ICE_DEV_ID_E810C_BACKPLANE: case ICE_DEV_ID_E810C_QSFP: case ICE_DEV_ID_E810C_SFP: + case ICE_DEV_ID_E810_XXV_BACKPLANE: + case ICE_DEV_ID_E810_XXV_QSFP: case ICE_DEV_ID_E810_XXV_SFP: hw->mac_type = ICE_MAC_E810; break; diff --git a/drivers/net/ethernet/intel/ice/ice_devids.h b/drivers/net/ethernet/intel/ice/ice_devids.h index 9d8194671f6a..ef4392e6e244 100644 --- a/drivers/net/ethernet/intel/ice/ice_devids.h +++ b/drivers/net/ethernet/intel/ice/ice_devids.h @@ -21,6 +21,10 @@ #define ICE_DEV_ID_E810C_QSFP 0x1592 /* Intel(R) Ethernet Controller E810-C for SFP */ #define ICE_DEV_ID_E810C_SFP 0x1593 +/* Intel(R) Ethernet Controller E810-XXV for backplane */ +#define ICE_DEV_ID_E810_XXV_BACKPLANE 0x1599 +/* Intel(R) Ethernet Controller E810-XXV for QSFP */ +#define ICE_DEV_ID_E810_XXV_QSFP 0x159A /* Intel(R) Ethernet Controller E810-XXV for SFP */ #define ICE_DEV_ID_E810_XXV_SFP 0x159B /* Intel(R) Ethernet Connection E823-C for backplane */ diff --git a/drivers/net/ethernet/intel/ice/ice_devlink.c b/drivers/net/ethernet/intel/ice/ice_devlink.c index 14afce82ef63..da7288bdc9a3 100644 --- a/drivers/net/ethernet/intel/ice/ice_devlink.c +++ b/drivers/net/ethernet/intel/ice/ice_devlink.c @@ -63,7 +63,8 @@ static int ice_info_fw_api(struct ice_pf *pf, struct ice_info_ctx *ctx) { struct ice_hw *hw = &pf->hw; - snprintf(ctx->buf, sizeof(ctx->buf), "%u.%u", hw->api_maj_ver, hw->api_min_ver); + snprintf(ctx->buf, sizeof(ctx->buf), "%u.%u.%u", hw->api_maj_ver, + hw->api_min_ver, hw->api_patch); return 0; } diff --git a/drivers/net/ethernet/intel/ice/ice_flex_pipe.c b/drivers/net/ethernet/intel/ice/ice_flex_pipe.c index 06ac9badee77..1ac96dc66d0d 100644 --- a/drivers/net/ethernet/intel/ice/ice_flex_pipe.c +++ b/drivers/net/ethernet/intel/ice/ice_flex_pipe.c @@ -1668,7 +1668,7 @@ static u16 ice_tunnel_idx_to_entry(struct ice_hw *hw, enum ice_tunnel_type type, for (i = 0; i < hw->tnl.count && i < ICE_TUNNEL_MAX_ENTRIES; i++) if (hw->tnl.tbl[i].valid && hw->tnl.tbl[i].type == type && - idx--) + idx-- == 0) return i; WARN_ON_ONCE(1); @@ -1828,7 +1828,7 @@ int ice_udp_tunnel_set_port(struct net_device *netdev, unsigned int table, u16 index; tnl_type = ti->type == UDP_TUNNEL_TYPE_VXLAN ? TNL_VXLAN : TNL_GENEVE; - index = ice_tunnel_idx_to_entry(&pf->hw, idx, tnl_type); + index = ice_tunnel_idx_to_entry(&pf->hw, tnl_type, idx); status = ice_create_tunnel(&pf->hw, index, tnl_type, ntohs(ti->port)); if (status) { diff --git a/drivers/net/ethernet/intel/ice/ice_lag.c b/drivers/net/ethernet/intel/ice/ice_lag.c index 37c18c66b5c7..e375ac849aec 100644 --- a/drivers/net/ethernet/intel/ice/ice_lag.c +++ b/drivers/net/ethernet/intel/ice/ice_lag.c @@ -100,9 +100,9 @@ static void ice_display_lag_info(struct ice_lag *lag) */ static void ice_lag_info_event(struct ice_lag *lag, void *ptr) { - struct net_device *event_netdev, *netdev_tmp; struct netdev_notifier_bonding_info *info; struct netdev_bonding_info *bonding_info; + struct net_device *event_netdev; const char *lag_netdev_name; event_netdev = netdev_notifier_info_to_dev(ptr); @@ -123,19 +123,6 @@ static void ice_lag_info_event(struct ice_lag *lag, void *ptr) goto lag_out; } - rcu_read_lock(); - for_each_netdev_in_bond_rcu(lag->upper_netdev, netdev_tmp) { - if (!netif_is_ice(netdev_tmp)) - continue; - - if (netdev_tmp && netdev_tmp != lag->netdev && - lag->peer_netdev != netdev_tmp) { - dev_hold(netdev_tmp); - lag->peer_netdev = netdev_tmp; - } - } - rcu_read_unlock(); - if (bonding_info->slave.state) ice_lag_set_backup(lag); else @@ -319,6 +306,9 @@ ice_lag_event_handler(struct notifier_block *notif_blk, unsigned long event, case NETDEV_BONDING_INFO: ice_lag_info_event(lag, ptr); break; + case NETDEV_UNREGISTER: + ice_lag_unlink(lag, ptr); + break; default: break; } diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c index dde9802c6c72..b718e196af2a 100644 --- a/drivers/net/ethernet/intel/ice/ice_lib.c +++ b/drivers/net/ethernet/intel/ice/ice_lib.c @@ -2841,6 +2841,7 @@ void ice_napi_del(struct ice_vsi *vsi) */ int ice_vsi_release(struct ice_vsi *vsi) { + enum ice_status err; struct ice_pf *pf; if (!vsi->back) @@ -2912,6 +2913,10 @@ int ice_vsi_release(struct ice_vsi *vsi) ice_fltr_remove_all(vsi); ice_rm_vsi_lan_cfg(vsi->port_info, vsi->idx); + err = ice_rm_vsi_rdma_cfg(vsi->port_info, vsi->idx); + if (err) + dev_err(ice_pf_to_dev(vsi->back), "Failed to remove RDMA scheduler config for VSI %u, err %d\n", + vsi->vsi_num, err); ice_vsi_delete(vsi); ice_vsi_free_q_vectors(vsi); @@ -3092,6 +3097,10 @@ int ice_vsi_rebuild(struct ice_vsi *vsi, bool init_vsi) prev_num_q_vectors = ice_vsi_rebuild_get_coalesce(vsi, coalesce); ice_rm_vsi_lan_cfg(vsi->port_info, vsi->idx); + ret = ice_rm_vsi_rdma_cfg(vsi->port_info, vsi->idx); + if (ret) + dev_err(ice_pf_to_dev(vsi->back), "Failed to remove RDMA scheduler config for VSI %u, err %d\n", + vsi->vsi_num, ret); ice_vsi_free_q_vectors(vsi); /* SR-IOV determines needed MSIX resources all at once instead of per diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c index 0d6c143f6653..06fa93e597fb 100644 --- a/drivers/net/ethernet/intel/ice/ice_main.c +++ b/drivers/net/ethernet/intel/ice/ice_main.c @@ -4224,6 +4224,9 @@ ice_probe(struct pci_dev *pdev, const struct pci_device_id __always_unused *ent) if (!pf) return -ENOMEM; + /* initialize Auxiliary index to invalid value */ + pf->aux_idx = -1; + /* set up for high or low DMA */ err = dma_set_mask_and_coherent(dev, DMA_BIT_MASK(64)); if (err) @@ -4615,7 +4618,8 @@ static void ice_remove(struct pci_dev *pdev) ice_aq_cancel_waiting_tasks(pf); ice_unplug_aux_dev(pf); - ida_free(&ice_aux_ida, pf->aux_idx); + if (pf->aux_idx >= 0) + ida_free(&ice_aux_ida, pf->aux_idx); set_bit(ICE_DOWN, pf->state); mutex_destroy(&(&pf->hw)->fdir_fltr_lock); @@ -5016,6 +5020,8 @@ static const struct pci_device_id ice_pci_tbl[] = { { PCI_VDEVICE(INTEL, ICE_DEV_ID_E810C_BACKPLANE), 0 }, { PCI_VDEVICE(INTEL, ICE_DEV_ID_E810C_QSFP), 0 }, { PCI_VDEVICE(INTEL, ICE_DEV_ID_E810C_SFP), 0 }, + { PCI_VDEVICE(INTEL, ICE_DEV_ID_E810_XXV_BACKPLANE), 0 }, + { PCI_VDEVICE(INTEL, ICE_DEV_ID_E810_XXV_QSFP), 0 }, { PCI_VDEVICE(INTEL, ICE_DEV_ID_E810_XXV_SFP), 0 }, { PCI_VDEVICE(INTEL, ICE_DEV_ID_E823C_BACKPLANE), 0 }, { PCI_VDEVICE(INTEL, ICE_DEV_ID_E823C_QSFP), 0 }, diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c b/drivers/net/ethernet/intel/ice/ice_ptp.c index 80380aed8882..d1ef3d48a4b0 100644 --- a/drivers/net/ethernet/intel/ice/ice_ptp.c +++ b/drivers/net/ethernet/intel/ice/ice_ptp.c @@ -1571,6 +1571,9 @@ err_kworker: */ void ice_ptp_release(struct ice_pf *pf) { + if (!test_bit(ICE_FLAG_PTP, pf->flags)) + return; + /* Disable timestamping for both Tx and Rx */ ice_ptp_cfg_timestamp(pf, false); diff --git a/drivers/net/ethernet/intel/ice/ice_sched.c b/drivers/net/ethernet/intel/ice/ice_sched.c index 9f07b6641705..2d9b10277186 100644 --- a/drivers/net/ethernet/intel/ice/ice_sched.c +++ b/drivers/net/ethernet/intel/ice/ice_sched.c @@ -2071,6 +2071,19 @@ enum ice_status ice_rm_vsi_lan_cfg(struct ice_port_info *pi, u16 vsi_handle) } /** + * ice_rm_vsi_rdma_cfg - remove VSI and its RDMA children nodes + * @pi: port information structure + * @vsi_handle: software VSI handle + * + * This function clears the VSI and its RDMA children nodes from scheduler tree + * for all TCs. + */ +enum ice_status ice_rm_vsi_rdma_cfg(struct ice_port_info *pi, u16 vsi_handle) +{ + return ice_sched_rm_vsi_cfg(pi, vsi_handle, ICE_SCHED_NODE_OWNER_RDMA); +} + +/** * ice_get_agg_info - get the aggregator ID * @hw: pointer to the hardware structure * @agg_id: aggregator ID diff --git a/drivers/net/ethernet/intel/ice/ice_sched.h b/drivers/net/ethernet/intel/ice/ice_sched.h index 9beef8f0ec76..fdf7a5882f07 100644 --- a/drivers/net/ethernet/intel/ice/ice_sched.h +++ b/drivers/net/ethernet/intel/ice/ice_sched.h @@ -89,6 +89,7 @@ enum ice_status ice_sched_cfg_vsi(struct ice_port_info *pi, u16 vsi_handle, u8 tc, u16 maxqs, u8 owner, bool enable); enum ice_status ice_rm_vsi_lan_cfg(struct ice_port_info *pi, u16 vsi_handle); +enum ice_status ice_rm_vsi_rdma_cfg(struct ice_port_info *pi, u16 vsi_handle); /* Tx scheduler rate limiter functions */ enum ice_status diff --git a/drivers/net/ethernet/intel/igc/igc_hw.h b/drivers/net/ethernet/intel/igc/igc_hw.h index 4461f8b9a864..4e0203336c6b 100644 --- a/drivers/net/ethernet/intel/igc/igc_hw.h +++ b/drivers/net/ethernet/intel/igc/igc_hw.h @@ -22,8 +22,8 @@ #define IGC_DEV_ID_I220_V 0x15F7 #define IGC_DEV_ID_I225_K 0x3100 #define IGC_DEV_ID_I225_K2 0x3101 +#define IGC_DEV_ID_I226_K 0x3102 #define IGC_DEV_ID_I225_LMVP 0x5502 -#define IGC_DEV_ID_I226_K 0x5504 #define IGC_DEV_ID_I225_IT 0x0D9F #define IGC_DEV_ID_I226_LM 0x125B #define IGC_DEV_ID_I226_V 0x125C diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_debugfs.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu_debugfs.c index 9338765da048..49d822a98ada 100644 --- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_debugfs.c +++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_debugfs.c @@ -226,18 +226,85 @@ static const struct file_operations rvu_dbg_##name##_fops = { \ static void print_nix_qsize(struct seq_file *filp, struct rvu_pfvf *pfvf); +static void get_lf_str_list(struct rvu_block block, int pcifunc, + char *lfs) +{ + int lf = 0, seq = 0, len = 0, prev_lf = block.lf.max; + + for_each_set_bit(lf, block.lf.bmap, block.lf.max) { + if (lf >= block.lf.max) + break; + + if (block.fn_map[lf] != pcifunc) + continue; + + if (lf == prev_lf + 1) { + prev_lf = lf; + seq = 1; + continue; + } + + if (seq) + len += sprintf(lfs + len, "-%d,%d", prev_lf, lf); + else + len += (len ? sprintf(lfs + len, ",%d", lf) : + sprintf(lfs + len, "%d", lf)); + + prev_lf = lf; + seq = 0; + } + + if (seq) + len += sprintf(lfs + len, "-%d", prev_lf); + + lfs[len] = '\0'; +} + +static int get_max_column_width(struct rvu *rvu) +{ + int index, pf, vf, lf_str_size = 12, buf_size = 256; + struct rvu_block block; + u16 pcifunc; + char *buf; + + buf = kzalloc(buf_size, GFP_KERNEL); + if (!buf) + return -ENOMEM; + + for (pf = 0; pf < rvu->hw->total_pfs; pf++) { + for (vf = 0; vf <= rvu->hw->total_vfs; vf++) { + pcifunc = pf << 10 | vf; + if (!pcifunc) + continue; + + for (index = 0; index < BLK_COUNT; index++) { + block = rvu->hw->block[index]; + if (!strlen(block.name)) + continue; + + get_lf_str_list(block, pcifunc, buf); + if (lf_str_size <= strlen(buf)) + lf_str_size = strlen(buf) + 1; + } + } + } + + kfree(buf); + return lf_str_size; +} + /* Dumps current provisioning status of all RVU block LFs */ static ssize_t rvu_dbg_rsrc_attach_status(struct file *filp, char __user *buffer, size_t count, loff_t *ppos) { - int index, off = 0, flag = 0, go_back = 0, len = 0; + int index, off = 0, flag = 0, len = 0, i = 0; struct rvu *rvu = filp->private_data; - int lf, pf, vf, pcifunc; + int bytes_not_copied = 0; struct rvu_block block; - int bytes_not_copied; - int lf_str_size = 12; + int pf, vf, pcifunc; int buf_size = 2048; + int lf_str_size; char *lfs; char *buf; @@ -249,6 +316,9 @@ static ssize_t rvu_dbg_rsrc_attach_status(struct file *filp, if (!buf) return -ENOSPC; + /* Get the maximum width of a column */ + lf_str_size = get_max_column_width(rvu); + lfs = kzalloc(lf_str_size, GFP_KERNEL); if (!lfs) { kfree(buf); @@ -262,65 +332,69 @@ static ssize_t rvu_dbg_rsrc_attach_status(struct file *filp, "%-*s", lf_str_size, rvu->hw->block[index].name); } + off += scnprintf(&buf[off], buf_size - 1 - off, "\n"); + bytes_not_copied = copy_to_user(buffer + (i * off), buf, off); + if (bytes_not_copied) + goto out; + + i++; + *ppos += off; for (pf = 0; pf < rvu->hw->total_pfs; pf++) { for (vf = 0; vf <= rvu->hw->total_vfs; vf++) { + off = 0; + flag = 0; pcifunc = pf << 10 | vf; if (!pcifunc) continue; if (vf) { sprintf(lfs, "PF%d:VF%d", pf, vf - 1); - go_back = scnprintf(&buf[off], - buf_size - 1 - off, - "%-*s", lf_str_size, lfs); + off = scnprintf(&buf[off], + buf_size - 1 - off, + "%-*s", lf_str_size, lfs); } else { sprintf(lfs, "PF%d", pf); - go_back = scnprintf(&buf[off], - buf_size - 1 - off, - "%-*s", lf_str_size, lfs); + off = scnprintf(&buf[off], + buf_size - 1 - off, + "%-*s", lf_str_size, lfs); } - off += go_back; - for (index = 0; index < BLKTYPE_MAX; index++) { + for (index = 0; index < BLK_COUNT; index++) { block = rvu->hw->block[index]; if (!strlen(block.name)) continue; len = 0; lfs[len] = '\0'; - for (lf = 0; lf < block.lf.max; lf++) { - if (block.fn_map[lf] != pcifunc) - continue; + get_lf_str_list(block, pcifunc, lfs); + if (strlen(lfs)) flag = 1; - len += sprintf(&lfs[len], "%d,", lf); - } - if (flag) - len--; - lfs[len] = '\0'; off += scnprintf(&buf[off], buf_size - 1 - off, "%-*s", lf_str_size, lfs); - if (!strlen(lfs)) - go_back += lf_str_size; } - if (!flag) - off -= go_back; - else - flag = 0; - off--; - off += scnprintf(&buf[off], buf_size - 1 - off, "\n"); + if (flag) { + off += scnprintf(&buf[off], + buf_size - 1 - off, "\n"); + bytes_not_copied = copy_to_user(buffer + + (i * off), + buf, off); + if (bytes_not_copied) + goto out; + + i++; + *ppos += off; + } } } - bytes_not_copied = copy_to_user(buffer, buf, off); +out: kfree(lfs); kfree(buf); - if (bytes_not_copied) return -EFAULT; - *ppos = off; - return off; + return *ppos; } RVU_DEBUG_FOPS(rsrc_status, rsrc_attach_status, NULL); @@ -504,7 +578,7 @@ static ssize_t rvu_dbg_qsize_write(struct file *filp, if (cmd_buf) ret = -EINVAL; - if (!strncmp(subtoken, "help", 4) || ret < 0) { + if (ret < 0 || !strncmp(subtoken, "help", 4)) { dev_info(rvu->dev, "Use echo <%s-lf > qsize\n", blk_string); goto qsize_write_done; } @@ -1719,6 +1793,10 @@ static int rvu_dbg_nix_band_prof_ctx_display(struct seq_file *m, void *unused) u16 pcifunc; char *str; + /* Ingress policers do not exist on all platforms */ + if (!nix_hw->ipolicer) + return 0; + for (layer = 0; layer < BAND_PROF_NUM_LAYERS; layer++) { if (layer == BAND_PROF_INVAL_LAYER) continue; @@ -1768,6 +1846,10 @@ static int rvu_dbg_nix_band_prof_rsrc_display(struct seq_file *m, void *unused) int layer; char *str; + /* Ingress policers do not exist on all platforms */ + if (!nix_hw->ipolicer) + return 0; + seq_puts(m, "\nBandwidth profile resource free count\n"); seq_puts(m, "=====================================\n"); for (layer = 0; layer < BAND_PROF_NUM_LAYERS; layer++) { diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c index 9ef4e942e31e..6970540dc470 100644 --- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c +++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c @@ -2507,6 +2507,9 @@ static void nix_free_tx_vtag_entries(struct rvu *rvu, u16 pcifunc) return; nix_hw = get_nix_hw(rvu->hw, blkaddr); + if (!nix_hw) + return; + vlan = &nix_hw->txvlan; mutex_lock(&vlan->rsrc_lock); diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/fs.h b/drivers/net/ethernet/mellanox/mlx5/core/en/fs.h index 41684a6c44e9..a88a1a48229f 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/fs.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/fs.h @@ -199,6 +199,9 @@ void mlx5e_disable_cvlan_filter(struct mlx5e_priv *priv); int mlx5e_create_flow_steering(struct mlx5e_priv *priv); void mlx5e_destroy_flow_steering(struct mlx5e_priv *priv); +int mlx5e_fs_init(struct mlx5e_priv *priv); +void mlx5e_fs_cleanup(struct mlx5e_priv *priv); + int mlx5e_add_vlan_trap(struct mlx5e_priv *priv, int trap_id, int tir_num); void mlx5e_remove_vlan_trap(struct mlx5e_priv *priv); int mlx5e_add_mac_trap(struct mlx5e_priv *priv, int trap_id, int tir_num); diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c index b4e986818794..4a13ef561587 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.c @@ -10,6 +10,8 @@ #include "en_tc.h" #include "rep/tc.h" #include "rep/neigh.h" +#include "lag.h" +#include "lag_mp.h" struct mlx5e_tc_tun_route_attr { struct net_device *out_dev; diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c index 33de8f0092a6..fb5397324aa4 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c @@ -141,8 +141,7 @@ static void mlx5e_ipsec_set_swp(struct sk_buff *skb, * Pkt: MAC IP ESP IP L4 * * Transport Mode: - * SWP: OutL3 InL4 - * InL3 + * SWP: OutL3 OutL4 * Pkt: MAC IP ESP L4 * * Tunnel(VXLAN TCP/UDP) over Transport Mode @@ -171,31 +170,35 @@ static void mlx5e_ipsec_set_swp(struct sk_buff *skb, return; if (!xo->inner_ipproto) { - eseg->swp_inner_l3_offset = skb_network_offset(skb) / 2; - eseg->swp_inner_l4_offset = skb_inner_transport_offset(skb) / 2; - if (skb->protocol == htons(ETH_P_IPV6)) - eseg->swp_flags |= MLX5_ETH_WQE_SWP_INNER_L3_IPV6; - if (xo->proto == IPPROTO_UDP) + switch (xo->proto) { + case IPPROTO_UDP: + eseg->swp_flags |= MLX5_ETH_WQE_SWP_OUTER_L4_UDP; + fallthrough; + case IPPROTO_TCP: + /* IP | ESP | TCP */ + eseg->swp_outer_l4_offset = skb_inner_transport_offset(skb) / 2; + break; + default: + break; + } + } else { + /* Tunnel(VXLAN TCP/UDP) over Transport Mode */ + switch (xo->inner_ipproto) { + case IPPROTO_UDP: eseg->swp_flags |= MLX5_ETH_WQE_SWP_INNER_L4_UDP; - return; - } - - /* Tunnel(VXLAN TCP/UDP) over Transport Mode */ - switch (xo->inner_ipproto) { - case IPPROTO_UDP: - eseg->swp_flags |= MLX5_ETH_WQE_SWP_INNER_L4_UDP; - fallthrough; - case IPPROTO_TCP: - eseg->swp_inner_l3_offset = skb_inner_network_offset(skb) / 2; - eseg->swp_inner_l4_offset = (skb->csum_start + skb->head - skb->data) / 2; - if (skb->protocol == htons(ETH_P_IPV6)) - eseg->swp_flags |= MLX5_ETH_WQE_SWP_INNER_L3_IPV6; - break; - default: - break; + fallthrough; + case IPPROTO_TCP: + eseg->swp_inner_l3_offset = skb_inner_network_offset(skb) / 2; + eseg->swp_inner_l4_offset = + (skb->csum_start + skb->head - skb->data) / 2; + if (skb->protocol == htons(ETH_P_IPV6)) + eseg->swp_flags |= MLX5_ETH_WQE_SWP_INNER_L3_IPV6; + break; + default: + break; + } } - return; } void mlx5e_ipsec_set_iv_esn(struct sk_buff *skb, struct xfrm_state *x, diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c index c06b4b938ae7..d226cc5ab1d1 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c @@ -1186,10 +1186,6 @@ static int mlx5e_create_vlan_table(struct mlx5e_priv *priv) struct mlx5e_flow_table *ft; int err; - priv->fs.vlan = kvzalloc(sizeof(*priv->fs.vlan), GFP_KERNEL); - if (!priv->fs.vlan) - return -ENOMEM; - ft = &priv->fs.vlan->ft; ft->num_groups = 0; @@ -1198,10 +1194,8 @@ static int mlx5e_create_vlan_table(struct mlx5e_priv *priv) ft_attr.prio = MLX5E_NIC_PRIO; ft->t = mlx5_create_flow_table(priv->fs.ns, &ft_attr); - if (IS_ERR(ft->t)) { - err = PTR_ERR(ft->t); - goto err_free_t; - } + if (IS_ERR(ft->t)) + return PTR_ERR(ft->t); ft->g = kcalloc(MLX5E_NUM_VLAN_GROUPS, sizeof(*ft->g), GFP_KERNEL); if (!ft->g) { @@ -1221,9 +1215,6 @@ err_free_g: kfree(ft->g); err_destroy_vlan_table: mlx5_destroy_flow_table(ft->t); -err_free_t: - kvfree(priv->fs.vlan); - priv->fs.vlan = NULL; return err; } @@ -1232,7 +1223,6 @@ static void mlx5e_destroy_vlan_table(struct mlx5e_priv *priv) { mlx5e_del_vlan_rules(priv); mlx5e_destroy_flow_table(&priv->fs.vlan->ft); - kvfree(priv->fs.vlan); } static void mlx5e_destroy_inner_ttc_table(struct mlx5e_priv *priv) @@ -1351,3 +1341,17 @@ void mlx5e_destroy_flow_steering(struct mlx5e_priv *priv) mlx5e_arfs_destroy_tables(priv); mlx5e_ethtool_cleanup_steering(priv); } + +int mlx5e_fs_init(struct mlx5e_priv *priv) +{ + priv->fs.vlan = kvzalloc(sizeof(*priv->fs.vlan), GFP_KERNEL); + if (!priv->fs.vlan) + return -ENOMEM; + return 0; +} + +void mlx5e_fs_cleanup(struct mlx5e_priv *priv) +{ + kvfree(priv->fs.vlan); + priv->fs.vlan = NULL; +} diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index 09c8b71b186c..41ef6eb70a58 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -4578,6 +4578,12 @@ static int mlx5e_nic_init(struct mlx5_core_dev *mdev, mlx5e_timestamp_init(priv); + err = mlx5e_fs_init(priv); + if (err) { + mlx5_core_err(mdev, "FS initialization failed, %d\n", err); + return err; + } + err = mlx5e_ipsec_init(priv); if (err) mlx5_core_err(mdev, "IPSec initialization failed, %d\n", err); @@ -4595,6 +4601,7 @@ static void mlx5e_nic_cleanup(struct mlx5e_priv *priv) mlx5e_health_destroy_reporters(priv); mlx5e_tls_cleanup(priv); mlx5e_ipsec_cleanup(priv); + mlx5e_fs_cleanup(priv); } static int mlx5e_init_nic_rx(struct mlx5e_priv *priv) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c index ba8164792016..129ff7e0d65c 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c @@ -67,6 +67,8 @@ #include "lib/fs_chains.h" #include "diag/en_tc_tracepoint.h" #include <asm/div64.h> +#include "lag.h" +#include "lag_mp.h" #define nic_chains(priv) ((priv)->fs.tc.chains) #define MLX5_MH_ACT_SZ MLX5_UN_SZ_BYTES(set_add_copy_action_in_auto) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c index c63d78eda606..188994d091c5 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c @@ -213,19 +213,18 @@ static inline void mlx5e_insert_vlan(void *start, struct sk_buff *skb, u16 ihs) memcpy(&vhdr->h_vlan_encapsulated_proto, skb->data + cpy1_sz, cpy2_sz); } -/* If packet is not IP's CHECKSUM_PARTIAL (e.g. icmd packet), - * need to set L3 checksum flag for IPsec - */ static void ipsec_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq, struct sk_buff *skb, struct mlx5_wqe_eth_seg *eseg) { + struct xfrm_offload *xo = xfrm_offload(skb); + eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM; - if (skb->encapsulation) { - eseg->cs_flags |= MLX5_ETH_WQE_L3_INNER_CSUM; + if (xo->inner_ipproto) { + eseg->cs_flags |= MLX5_ETH_WQE_L4_INNER_CSUM | MLX5_ETH_WQE_L3_INNER_CSUM; + } else if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) { + eseg->cs_flags |= MLX5_ETH_WQE_L4_CSUM; sq->stats->csum_partial_inner++; - } else { - sq->stats->csum_partial++; } } @@ -234,6 +233,11 @@ mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq, struct sk_buff *skb, struct mlx5e_accel_tx_state *accel, struct mlx5_wqe_eth_seg *eseg) { + if (unlikely(mlx5e_ipsec_eseg_meta(eseg))) { + ipsec_txwqe_build_eseg_csum(sq, skb, eseg); + return; + } + if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) { eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM; if (skb->encapsulation) { @@ -249,8 +253,6 @@ mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq, struct sk_buff *skb, eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM | MLX5_ETH_WQE_L4_CSUM; sq->stats->csum_partial++; #endif - } else if (unlikely(mlx5e_ipsec_eseg_meta(eseg))) { - ipsec_txwqe_build_eseg_csum(sq, skb, eseg); } else sq->stats->csum_none++; } diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c index 985e305179d1..c6cc67cb4f6a 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c @@ -473,10 +473,9 @@ esw_qos_create_rate_group(struct mlx5_eswitch *esw, struct netlink_ext_ack *exta err_min_rate: list_del(&group->list); - err = mlx5_destroy_scheduling_element_cmd(esw->dev, - SCHEDULING_HIERARCHY_E_SWITCH, - group->tsar_ix); - if (err) + if (mlx5_destroy_scheduling_element_cmd(esw->dev, + SCHEDULING_HIERARCHY_E_SWITCH, + group->tsar_ix)) NL_SET_ERR_MSG_MOD(extack, "E-Switch destroy TSAR for group failed"); err_sched_elem: kfree(group); diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag.c b/drivers/net/ethernet/mellanox/mlx5/core/lag.c index ca5690b0a7ab..d2105c1635c3 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/lag.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/lag.c @@ -442,6 +442,10 @@ static void mlx5_do_bond(struct mlx5_lag *ldev) if (!mlx5_lag_is_ready(ldev)) { do_bond = false; } else { + /* VF LAG is in multipath mode, ignore bond change requests */ + if (mlx5_lag_is_multipath(dev0)) + return; + tracker = ldev->tracker; do_bond = tracker.is_bonded && mlx5_lag_check_prereq(ldev); diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag_mp.c b/drivers/net/ethernet/mellanox/mlx5/core/lag_mp.c index f239b352a58a..21fdaf708f1f 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/lag_mp.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/lag_mp.c @@ -9,20 +9,23 @@ #include "eswitch.h" #include "lib/mlx5.h" +static bool __mlx5_lag_is_multipath(struct mlx5_lag *ldev) +{ + return !!(ldev->flags & MLX5_LAG_FLAG_MULTIPATH); +} + static bool mlx5_lag_multipath_check_prereq(struct mlx5_lag *ldev) { if (!mlx5_lag_is_ready(ldev)) return false; + if (__mlx5_lag_is_active(ldev) && !__mlx5_lag_is_multipath(ldev)) + return false; + return mlx5_esw_multipath_prereq(ldev->pf[MLX5_LAG_P1].dev, ldev->pf[MLX5_LAG_P2].dev); } -static bool __mlx5_lag_is_multipath(struct mlx5_lag *ldev) -{ - return !!(ldev->flags & MLX5_LAG_FLAG_MULTIPATH); -} - bool mlx5_lag_is_multipath(struct mlx5_core_dev *dev) { struct mlx5_lag *ldev; diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag_mp.h b/drivers/net/ethernet/mellanox/mlx5/core/lag_mp.h index 729c839397a8..dea199e79bed 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/lag_mp.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/lag_mp.h @@ -24,12 +24,14 @@ struct lag_mp { void mlx5_lag_mp_reset(struct mlx5_lag *ldev); int mlx5_lag_mp_init(struct mlx5_lag *ldev); void mlx5_lag_mp_cleanup(struct mlx5_lag *ldev); +bool mlx5_lag_is_multipath(struct mlx5_core_dev *dev); #else /* CONFIG_MLX5_ESWITCH */ static inline void mlx5_lag_mp_reset(struct mlx5_lag *ldev) {}; static inline int mlx5_lag_mp_init(struct mlx5_lag *ldev) { return 0; } static inline void mlx5_lag_mp_cleanup(struct mlx5_lag *ldev) {} +bool mlx5_lag_is_multipath(struct mlx5_core_dev *dev) { return false; } #endif /* CONFIG_MLX5_ESWITCH */ #endif /* __MLX5_LAG_MP_H__ */ diff --git a/drivers/net/ethernet/mellanox/mlxsw/pci.c b/drivers/net/ethernet/mellanox/mlxsw/pci.c index 13b0259f7ea6..fcace73eae40 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/pci.c +++ b/drivers/net/ethernet/mellanox/mlxsw/pci.c @@ -353,13 +353,10 @@ static int mlxsw_pci_rdq_skb_alloc(struct mlxsw_pci *mlxsw_pci, struct sk_buff *skb; int err; - elem_info->u.rdq.skb = NULL; skb = netdev_alloc_skb_ip_align(NULL, buf_len); if (!skb) return -ENOMEM; - /* Assume that wqe was previously zeroed. */ - err = mlxsw_pci_wqe_frag_map(mlxsw_pci, wqe, 0, skb->data, buf_len, DMA_FROM_DEVICE); if (err) @@ -597,21 +594,26 @@ static void mlxsw_pci_cqe_rdq_handle(struct mlxsw_pci *mlxsw_pci, struct pci_dev *pdev = mlxsw_pci->pdev; struct mlxsw_pci_queue_elem_info *elem_info; struct mlxsw_rx_info rx_info = {}; - char *wqe; + char wqe[MLXSW_PCI_WQE_SIZE]; struct sk_buff *skb; u16 byte_count; int err; elem_info = mlxsw_pci_queue_elem_info_consumer_get(q); - skb = elem_info->u.sdq.skb; - if (!skb) - return; - wqe = elem_info->elem; - mlxsw_pci_wqe_frag_unmap(mlxsw_pci, wqe, 0, DMA_FROM_DEVICE); + skb = elem_info->u.rdq.skb; + memcpy(wqe, elem_info->elem, MLXSW_PCI_WQE_SIZE); if (q->consumer_counter++ != consumer_counter_limit) dev_dbg_ratelimited(&pdev->dev, "Consumer counter does not match limit in RDQ\n"); + err = mlxsw_pci_rdq_skb_alloc(mlxsw_pci, elem_info); + if (err) { + dev_err_ratelimited(&pdev->dev, "Failed to alloc skb for RDQ\n"); + goto out; + } + + mlxsw_pci_wqe_frag_unmap(mlxsw_pci, wqe, 0, DMA_FROM_DEVICE); + if (mlxsw_pci_cqe_lag_get(cqe_v, cqe)) { rx_info.is_lag = true; rx_info.u.lag_id = mlxsw_pci_cqe_lag_id_get(cqe_v, cqe); @@ -647,10 +649,7 @@ static void mlxsw_pci_cqe_rdq_handle(struct mlxsw_pci *mlxsw_pci, skb_put(skb, byte_count); mlxsw_core_skb_receive(mlxsw_pci->core, skb, &rx_info); - memset(wqe, 0, q->elem_size); - err = mlxsw_pci_rdq_skb_alloc(mlxsw_pci, elem_info); - if (err) - dev_dbg_ratelimited(&pdev->dev, "Failed to alloc skb for RDQ\n"); +out: /* Everything is set up, ring doorbell to pass elem to HW */ q->producer_counter++; mlxsw_pci_queue_doorbell_producer_ring(mlxsw_pci, q); diff --git a/drivers/net/ethernet/microchip/lan743x_main.c b/drivers/net/ethernet/microchip/lan743x_main.c index 9e8561cdc32a..4d5a5d6595b3 100644 --- a/drivers/net/ethernet/microchip/lan743x_main.c +++ b/drivers/net/ethernet/microchip/lan743x_main.c @@ -1743,6 +1743,16 @@ static int lan743x_tx_ring_init(struct lan743x_tx *tx) ret = -EINVAL; goto cleanup; } + if (dma_set_mask_and_coherent(&tx->adapter->pdev->dev, + DMA_BIT_MASK(64))) { + if (dma_set_mask_and_coherent(&tx->adapter->pdev->dev, + DMA_BIT_MASK(32))) { + dev_warn(&tx->adapter->pdev->dev, + "lan743x_: No suitable DMA available\n"); + ret = -ENOMEM; + goto cleanup; + } + } ring_allocation_size = ALIGN(tx->ring_size * sizeof(struct lan743x_tx_descriptor), PAGE_SIZE); @@ -1934,7 +1944,8 @@ static void lan743x_rx_update_tail(struct lan743x_rx *rx, int index) index); } -static int lan743x_rx_init_ring_element(struct lan743x_rx *rx, int index) +static int lan743x_rx_init_ring_element(struct lan743x_rx *rx, int index, + gfp_t gfp) { struct net_device *netdev = rx->adapter->netdev; struct device *dev = &rx->adapter->pdev->dev; @@ -1948,7 +1959,7 @@ static int lan743x_rx_init_ring_element(struct lan743x_rx *rx, int index) descriptor = &rx->ring_cpu_ptr[index]; buffer_info = &rx->buffer_info[index]; - skb = __netdev_alloc_skb(netdev, buffer_length, GFP_ATOMIC | GFP_DMA); + skb = __netdev_alloc_skb(netdev, buffer_length, gfp); if (!skb) return -ENOMEM; dma_ptr = dma_map_single(dev, skb->data, buffer_length, DMA_FROM_DEVICE); @@ -2110,7 +2121,8 @@ static int lan743x_rx_process_buffer(struct lan743x_rx *rx) /* save existing skb, allocate new skb and map to dma */ skb = buffer_info->skb; - if (lan743x_rx_init_ring_element(rx, rx->last_head)) { + if (lan743x_rx_init_ring_element(rx, rx->last_head, + GFP_ATOMIC | GFP_DMA)) { /* failed to allocate next skb. * Memory is very low. * Drop this packet and reuse buffer. @@ -2276,6 +2288,16 @@ static int lan743x_rx_ring_init(struct lan743x_rx *rx) ret = -EINVAL; goto cleanup; } + if (dma_set_mask_and_coherent(&rx->adapter->pdev->dev, + DMA_BIT_MASK(64))) { + if (dma_set_mask_and_coherent(&rx->adapter->pdev->dev, + DMA_BIT_MASK(32))) { + dev_warn(&rx->adapter->pdev->dev, + "lan743x_: No suitable DMA available\n"); + ret = -ENOMEM; + goto cleanup; + } + } ring_allocation_size = ALIGN(rx->ring_size * sizeof(struct lan743x_rx_descriptor), PAGE_SIZE); @@ -2315,13 +2337,16 @@ static int lan743x_rx_ring_init(struct lan743x_rx *rx) rx->last_head = 0; for (index = 0; index < rx->ring_size; index++) { - ret = lan743x_rx_init_ring_element(rx, index); + ret = lan743x_rx_init_ring_element(rx, index, GFP_KERNEL); if (ret) goto cleanup; } return 0; cleanup: + netif_warn(rx->adapter, ifup, rx->adapter->netdev, + "Error allocating memory for LAN743x\n"); + lan743x_rx_ring_cleanup(rx); return ret; } @@ -3019,6 +3044,8 @@ static int lan743x_pm_resume(struct device *dev) if (ret) { netif_err(adapter, probe, adapter->netdev, "lan743x_hardware_init returned %d\n", ret); + lan743x_pci_cleanup(adapter); + return ret; } /* open netdev when netdev is at running state while resume. diff --git a/drivers/net/ethernet/microchip/sparx5/sparx5_main.c b/drivers/net/ethernet/microchip/sparx5/sparx5_main.c index cbece6e9bff2..5030dfca3879 100644 --- a/drivers/net/ethernet/microchip/sparx5/sparx5_main.c +++ b/drivers/net/ethernet/microchip/sparx5/sparx5_main.c @@ -758,6 +758,7 @@ static int mchp_sparx5_probe(struct platform_device *pdev) err = dev_err_probe(sparx5->dev, PTR_ERR(serdes), "port %u: missing serdes\n", portno); + of_node_put(portnp); goto cleanup_config; } config->portno = portno; diff --git a/drivers/net/ethernet/mscc/ocelot_vsc7514.c b/drivers/net/ethernet/mscc/ocelot_vsc7514.c index 291ae6817c26..d51f799e4e86 100644 --- a/drivers/net/ethernet/mscc/ocelot_vsc7514.c +++ b/drivers/net/ethernet/mscc/ocelot_vsc7514.c @@ -969,6 +969,7 @@ static int mscc_ocelot_init_ports(struct platform_device *pdev, target = ocelot_regmap_init(ocelot, res); if (IS_ERR(target)) { err = PTR_ERR(target); + of_node_put(portnp); goto out_teardown; } diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c b/drivers/net/ethernet/netronome/nfp/bpf/main.c index 11c83a99b014..f469950c7265 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/main.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c @@ -182,15 +182,21 @@ static int nfp_bpf_check_mtu(struct nfp_app *app, struct net_device *netdev, int new_mtu) { struct nfp_net *nn = netdev_priv(netdev); - unsigned int max_mtu; + struct nfp_bpf_vnic *bv; + struct bpf_prog *prog; if (~nn->dp.ctrl & NFP_NET_CFG_CTRL_BPF) return 0; - max_mtu = nn_readb(nn, NFP_NET_CFG_BPF_INL_MTU) * 64 - 32; - if (new_mtu > max_mtu) { - nn_info(nn, "BPF offload active, MTU over %u not supported\n", - max_mtu); + if (nn->xdp_hw.prog) { + prog = nn->xdp_hw.prog; + } else { + bv = nn->app_priv; + prog = bv->tc_prog; + } + + if (nfp_bpf_offload_check_mtu(nn, prog, new_mtu)) { + nn_info(nn, "BPF offload active, potential packet access beyond hardware packet boundary"); return -EBUSY; } return 0; diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h b/drivers/net/ethernet/netronome/nfp/bpf/main.h index d0e17eebddd9..16841bb750b7 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/main.h +++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h @@ -560,6 +560,8 @@ bool nfp_is_subprog_start(struct nfp_insn_meta *meta); void nfp_bpf_jit_prepare(struct nfp_prog *nfp_prog); int nfp_bpf_jit(struct nfp_prog *prog); bool nfp_bpf_supported_opcode(u8 code); +bool nfp_bpf_offload_check_mtu(struct nfp_net *nn, struct bpf_prog *prog, + unsigned int mtu); int nfp_verify_insn(struct bpf_verifier_env *env, int insn_idx, int prev_insn_idx); diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c b/drivers/net/ethernet/netronome/nfp/bpf/offload.c index 53851853562c..9d97cd281f18 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c @@ -481,19 +481,28 @@ int nfp_bpf_event_output(struct nfp_app_bpf *bpf, const void *data, return 0; } +bool nfp_bpf_offload_check_mtu(struct nfp_net *nn, struct bpf_prog *prog, + unsigned int mtu) +{ + unsigned int fw_mtu, pkt_off; + + fw_mtu = nn_readb(nn, NFP_NET_CFG_BPF_INL_MTU) * 64 - 32; + pkt_off = min(prog->aux->max_pkt_offset, mtu); + + return fw_mtu < pkt_off; +} + static int nfp_net_bpf_load(struct nfp_net *nn, struct bpf_prog *prog, struct netlink_ext_ack *extack) { struct nfp_prog *nfp_prog = prog->aux->offload->dev_priv; - unsigned int fw_mtu, pkt_off, max_stack, max_prog_len; + unsigned int max_stack, max_prog_len; dma_addr_t dma_addr; void *img; int err; - fw_mtu = nn_readb(nn, NFP_NET_CFG_BPF_INL_MTU) * 64 - 32; - pkt_off = min(prog->aux->max_pkt_offset, nn->dp.netdev->mtu); - if (fw_mtu < pkt_off) { + if (nfp_bpf_offload_check_mtu(nn, prog, nn->dp.netdev->mtu)) { NL_SET_ERR_MSG_MOD(extack, "BPF offload not supported with potential packet access beyond HW packet split boundary"); return -EOPNOTSUPP; } diff --git a/drivers/net/ethernet/netronome/nfp/nfp_asm.c b/drivers/net/ethernet/netronome/nfp/nfp_asm.c index 2643ea5948f4..154399c5453f 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_asm.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_asm.c @@ -196,7 +196,7 @@ int swreg_to_unrestricted(swreg dst, swreg lreg, swreg rreg, } reg->dst_lmextn = swreg_lmextn(dst); - reg->src_lmextn = swreg_lmextn(lreg) | swreg_lmextn(rreg); + reg->src_lmextn = swreg_lmextn(lreg) || swreg_lmextn(rreg); return 0; } @@ -277,7 +277,7 @@ int swreg_to_restricted(swreg dst, swreg lreg, swreg rreg, } reg->dst_lmextn = swreg_lmextn(dst); - reg->src_lmextn = swreg_lmextn(lreg) | swreg_lmextn(rreg); + reg->src_lmextn = swreg_lmextn(lreg) || swreg_lmextn(rreg); return 0; } diff --git a/drivers/net/ethernet/nxp/lpc_eth.c b/drivers/net/ethernet/nxp/lpc_eth.c index d29fe562b3de..c910fa2f40a4 100644 --- a/drivers/net/ethernet/nxp/lpc_eth.c +++ b/drivers/net/ethernet/nxp/lpc_eth.c @@ -1015,9 +1015,6 @@ static int lpc_eth_close(struct net_device *ndev) napi_disable(&pldat->napi); netif_stop_queue(ndev); - if (ndev->phydev) - phy_stop(ndev->phydev); - spin_lock_irqsave(&pldat->lock, flags); __lpc_eth_reset(pldat); netif_carrier_off(ndev); @@ -1025,6 +1022,8 @@ static int lpc_eth_close(struct net_device *ndev) writel(0, LPC_ENET_MAC2(pldat->net_base)); spin_unlock_irqrestore(&pldat->lock, flags); + if (ndev->phydev) + phy_stop(ndev->phydev); clk_disable_unprepare(pldat->clk); return 0; diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c index 46a6ff9a782d..2918947dd57c 100644 --- a/drivers/net/ethernet/realtek/r8169_main.c +++ b/drivers/net/ethernet/realtek/r8169_main.c @@ -157,6 +157,7 @@ static const struct pci_device_id rtl8169_pci_tbl[] = { { PCI_VDEVICE(REALTEK, 0x8129) }, { PCI_VDEVICE(REALTEK, 0x8136), RTL_CFG_NO_GBIT }, { PCI_VDEVICE(REALTEK, 0x8161) }, + { PCI_VDEVICE(REALTEK, 0x8162) }, { PCI_VDEVICE(REALTEK, 0x8167) }, { PCI_VDEVICE(REALTEK, 0x8168) }, { PCI_VDEVICE(NCUBE, 0x8168) }, diff --git a/drivers/net/ethernet/sfc/mcdi_port_common.c b/drivers/net/ethernet/sfc/mcdi_port_common.c index 4bd3ef8f3384..c4fe3c48ac46 100644 --- a/drivers/net/ethernet/sfc/mcdi_port_common.c +++ b/drivers/net/ethernet/sfc/mcdi_port_common.c @@ -132,16 +132,27 @@ void mcdi_to_ethtool_linkset(u32 media, u32 cap, unsigned long *linkset) case MC_CMD_MEDIA_SFP_PLUS: case MC_CMD_MEDIA_QSFP_PLUS: SET_BIT(FIBRE); - if (cap & (1 << MC_CMD_PHY_CAP_1000FDX_LBN)) + if (cap & (1 << MC_CMD_PHY_CAP_1000FDX_LBN)) { SET_BIT(1000baseT_Full); - if (cap & (1 << MC_CMD_PHY_CAP_10000FDX_LBN)) - SET_BIT(10000baseT_Full); - if (cap & (1 << MC_CMD_PHY_CAP_40000FDX_LBN)) + SET_BIT(1000baseX_Full); + } + if (cap & (1 << MC_CMD_PHY_CAP_10000FDX_LBN)) { + SET_BIT(10000baseCR_Full); + SET_BIT(10000baseLR_Full); + SET_BIT(10000baseSR_Full); + } + if (cap & (1 << MC_CMD_PHY_CAP_40000FDX_LBN)) { SET_BIT(40000baseCR4_Full); - if (cap & (1 << MC_CMD_PHY_CAP_100000FDX_LBN)) + SET_BIT(40000baseSR4_Full); + } + if (cap & (1 << MC_CMD_PHY_CAP_100000FDX_LBN)) { SET_BIT(100000baseCR4_Full); - if (cap & (1 << MC_CMD_PHY_CAP_25000FDX_LBN)) + SET_BIT(100000baseSR4_Full); + } + if (cap & (1 << MC_CMD_PHY_CAP_25000FDX_LBN)) { SET_BIT(25000baseCR_Full); + SET_BIT(25000baseSR_Full); + } if (cap & (1 << MC_CMD_PHY_CAP_50000FDX_LBN)) SET_BIT(50000baseCR2_Full); break; @@ -192,15 +203,19 @@ u32 ethtool_linkset_to_mcdi_cap(const unsigned long *linkset) result |= (1 << MC_CMD_PHY_CAP_100FDX_LBN); if (TEST_BIT(1000baseT_Half)) result |= (1 << MC_CMD_PHY_CAP_1000HDX_LBN); - if (TEST_BIT(1000baseT_Full) || TEST_BIT(1000baseKX_Full)) + if (TEST_BIT(1000baseT_Full) || TEST_BIT(1000baseKX_Full) || + TEST_BIT(1000baseX_Full)) result |= (1 << MC_CMD_PHY_CAP_1000FDX_LBN); - if (TEST_BIT(10000baseT_Full) || TEST_BIT(10000baseKX4_Full)) + if (TEST_BIT(10000baseT_Full) || TEST_BIT(10000baseKX4_Full) || + TEST_BIT(10000baseCR_Full) || TEST_BIT(10000baseLR_Full) || + TEST_BIT(10000baseSR_Full)) result |= (1 << MC_CMD_PHY_CAP_10000FDX_LBN); - if (TEST_BIT(40000baseCR4_Full) || TEST_BIT(40000baseKR4_Full)) + if (TEST_BIT(40000baseCR4_Full) || TEST_BIT(40000baseKR4_Full) || + TEST_BIT(40000baseSR4_Full)) result |= (1 << MC_CMD_PHY_CAP_40000FDX_LBN); - if (TEST_BIT(100000baseCR4_Full)) + if (TEST_BIT(100000baseCR4_Full) || TEST_BIT(100000baseSR4_Full)) result |= (1 << MC_CMD_PHY_CAP_100000FDX_LBN); - if (TEST_BIT(25000baseCR_Full)) + if (TEST_BIT(25000baseCR_Full) || TEST_BIT(25000baseSR_Full)) result |= (1 << MC_CMD_PHY_CAP_25000FDX_LBN); if (TEST_BIT(50000baseCR2_Full)) result |= (1 << MC_CMD_PHY_CAP_50000FDX_LBN); diff --git a/drivers/net/ethernet/sfc/ptp.c b/drivers/net/ethernet/sfc/ptp.c index a39c5143b386..797e51802ccb 100644 --- a/drivers/net/ethernet/sfc/ptp.c +++ b/drivers/net/ethernet/sfc/ptp.c @@ -648,7 +648,7 @@ static int efx_ptp_get_attributes(struct efx_nic *efx) } else if (rc == -EINVAL) { fmt = MC_CMD_PTP_OUT_GET_ATTRIBUTES_SECONDS_NANOSECONDS; } else if (rc == -EPERM) { - netif_info(efx, probe, efx->net_dev, "no PTP support\n"); + pci_info(efx->pci_dev, "no PTP support\n"); return rc; } else { efx_mcdi_display_error(efx, MC_CMD_PTP, sizeof(inbuf), @@ -824,7 +824,7 @@ static int efx_ptp_disable(struct efx_nic *efx) * should only have been called during probe. */ if (rc == -ENOSYS || rc == -EPERM) - netif_info(efx, probe, efx->net_dev, "no PTP support\n"); + pci_info(efx->pci_dev, "no PTP support\n"); else if (rc) efx_mcdi_display_error(efx, MC_CMD_PTP, MC_CMD_PTP_IN_DISABLE_LEN, diff --git a/drivers/net/ethernet/sfc/siena_sriov.c b/drivers/net/ethernet/sfc/siena_sriov.c index 83dcfcae3d4b..441e7f3e5375 100644 --- a/drivers/net/ethernet/sfc/siena_sriov.c +++ b/drivers/net/ethernet/sfc/siena_sriov.c @@ -1057,7 +1057,7 @@ void efx_siena_sriov_probe(struct efx_nic *efx) return; if (efx_siena_sriov_cmd(efx, false, &efx->vi_scale, &count)) { - netif_info(efx, probe, efx->net_dev, "no SR-IOV VFs probed\n"); + pci_info(efx->pci_dev, "no SR-IOV VFs probed\n"); return; } if (count > 0 && count > max_vfs) diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c index eb3b7bf771d7..3d67d1fa3690 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c @@ -736,7 +736,7 @@ static int stmmac_hwtstamp_set(struct net_device *dev, struct ifreq *ifr) config.rx_filter = HWTSTAMP_FILTER_PTP_V2_EVENT; ptp_v2 = PTP_TCR_TSVER2ENA; snap_type_sel = PTP_TCR_SNAPTYPSEL_1; - if (priv->synopsys_id != DWMAC_CORE_5_10) + if (priv->synopsys_id < DWMAC_CORE_4_10) ts_event_en = PTP_TCR_TSEVNTENA; ptp_over_ipv4_udp = PTP_TCR_TSIPV4ENA; ptp_over_ipv6_udp = PTP_TCR_TSIPV6ENA; diff --git a/drivers/net/hamradio/baycom_epp.c b/drivers/net/hamradio/baycom_epp.c index 775dcf4ebde5..6b6f28d5b8d5 100644 --- a/drivers/net/hamradio/baycom_epp.c +++ b/drivers/net/hamradio/baycom_epp.c @@ -623,16 +623,16 @@ static int receive(struct net_device *dev, int cnt) /* --------------------------------------------------------------------- */ -#ifdef __i386__ +#if defined(__i386__) && !defined(CONFIG_UML) #include <asm/msr.h> #define GETTICK(x) \ ({ \ if (boot_cpu_has(X86_FEATURE_TSC)) \ x = (unsigned int)rdtsc(); \ }) -#else /* __i386__ */ +#else /* __i386__ && !CONFIG_UML */ #define GETTICK(x) -#endif /* __i386__ */ +#endif /* __i386__ && !CONFIG_UML */ static void epp_bh(struct work_struct *work) { diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c index f124a8a58bd4..a3bfb156c83d 100644 --- a/drivers/net/phy/phy.c +++ b/drivers/net/phy/phy.c @@ -243,62 +243,10 @@ static void phy_sanitize_settings(struct phy_device *phydev) } } -int phy_ethtool_ksettings_set(struct phy_device *phydev, - const struct ethtool_link_ksettings *cmd) -{ - __ETHTOOL_DECLARE_LINK_MODE_MASK(advertising); - u8 autoneg = cmd->base.autoneg; - u8 duplex = cmd->base.duplex; - u32 speed = cmd->base.speed; - - if (cmd->base.phy_address != phydev->mdio.addr) - return -EINVAL; - - linkmode_copy(advertising, cmd->link_modes.advertising); - - /* We make sure that we don't pass unsupported values in to the PHY */ - linkmode_and(advertising, advertising, phydev->supported); - - /* Verify the settings we care about. */ - if (autoneg != AUTONEG_ENABLE && autoneg != AUTONEG_DISABLE) - return -EINVAL; - - if (autoneg == AUTONEG_ENABLE && linkmode_empty(advertising)) - return -EINVAL; - - if (autoneg == AUTONEG_DISABLE && - ((speed != SPEED_1000 && - speed != SPEED_100 && - speed != SPEED_10) || - (duplex != DUPLEX_HALF && - duplex != DUPLEX_FULL))) - return -EINVAL; - - phydev->autoneg = autoneg; - - if (autoneg == AUTONEG_DISABLE) { - phydev->speed = speed; - phydev->duplex = duplex; - } - - linkmode_copy(phydev->advertising, advertising); - - linkmode_mod_bit(ETHTOOL_LINK_MODE_Autoneg_BIT, - phydev->advertising, autoneg == AUTONEG_ENABLE); - - phydev->master_slave_set = cmd->base.master_slave_cfg; - phydev->mdix_ctrl = cmd->base.eth_tp_mdix_ctrl; - - /* Restart the PHY */ - phy_start_aneg(phydev); - - return 0; -} -EXPORT_SYMBOL(phy_ethtool_ksettings_set); - void phy_ethtool_ksettings_get(struct phy_device *phydev, struct ethtool_link_ksettings *cmd) { + mutex_lock(&phydev->lock); linkmode_copy(cmd->link_modes.supported, phydev->supported); linkmode_copy(cmd->link_modes.advertising, phydev->advertising); linkmode_copy(cmd->link_modes.lp_advertising, phydev->lp_advertising); @@ -317,6 +265,7 @@ void phy_ethtool_ksettings_get(struct phy_device *phydev, cmd->base.autoneg = phydev->autoneg; cmd->base.eth_tp_mdix_ctrl = phydev->mdix_ctrl; cmd->base.eth_tp_mdix = phydev->mdix; + mutex_unlock(&phydev->lock); } EXPORT_SYMBOL(phy_ethtool_ksettings_get); @@ -751,7 +700,7 @@ static int phy_check_link_status(struct phy_device *phydev) } /** - * phy_start_aneg - start auto-negotiation for this PHY device + * _phy_start_aneg - start auto-negotiation for this PHY device * @phydev: the phy_device struct * * Description: Sanitizes the settings (if we're not autonegotiating @@ -759,25 +708,43 @@ static int phy_check_link_status(struct phy_device *phydev) * If the PHYCONTROL Layer is operating, we change the state to * reflect the beginning of Auto-negotiation or forcing. */ -int phy_start_aneg(struct phy_device *phydev) +static int _phy_start_aneg(struct phy_device *phydev) { int err; + lockdep_assert_held(&phydev->lock); + if (!phydev->drv) return -EIO; - mutex_lock(&phydev->lock); - if (AUTONEG_DISABLE == phydev->autoneg) phy_sanitize_settings(phydev); err = phy_config_aneg(phydev); if (err < 0) - goto out_unlock; + return err; if (phy_is_started(phydev)) err = phy_check_link_status(phydev); -out_unlock: + + return err; +} + +/** + * phy_start_aneg - start auto-negotiation for this PHY device + * @phydev: the phy_device struct + * + * Description: Sanitizes the settings (if we're not autonegotiating + * them), and then calls the driver's config_aneg function. + * If the PHYCONTROL Layer is operating, we change the state to + * reflect the beginning of Auto-negotiation or forcing. + */ +int phy_start_aneg(struct phy_device *phydev) +{ + int err; + + mutex_lock(&phydev->lock); + err = _phy_start_aneg(phydev); mutex_unlock(&phydev->lock); return err; @@ -800,6 +767,61 @@ static int phy_poll_aneg_done(struct phy_device *phydev) return ret < 0 ? ret : 0; } +int phy_ethtool_ksettings_set(struct phy_device *phydev, + const struct ethtool_link_ksettings *cmd) +{ + __ETHTOOL_DECLARE_LINK_MODE_MASK(advertising); + u8 autoneg = cmd->base.autoneg; + u8 duplex = cmd->base.duplex; + u32 speed = cmd->base.speed; + + if (cmd->base.phy_address != phydev->mdio.addr) + return -EINVAL; + + linkmode_copy(advertising, cmd->link_modes.advertising); + + /* We make sure that we don't pass unsupported values in to the PHY */ + linkmode_and(advertising, advertising, phydev->supported); + + /* Verify the settings we care about. */ + if (autoneg != AUTONEG_ENABLE && autoneg != AUTONEG_DISABLE) + return -EINVAL; + + if (autoneg == AUTONEG_ENABLE && linkmode_empty(advertising)) + return -EINVAL; + + if (autoneg == AUTONEG_DISABLE && + ((speed != SPEED_1000 && + speed != SPEED_100 && + speed != SPEED_10) || + (duplex != DUPLEX_HALF && + duplex != DUPLEX_FULL))) + return -EINVAL; + + mutex_lock(&phydev->lock); + phydev->autoneg = autoneg; + + if (autoneg == AUTONEG_DISABLE) { + phydev->speed = speed; + phydev->duplex = duplex; + } + + linkmode_copy(phydev->advertising, advertising); + + linkmode_mod_bit(ETHTOOL_LINK_MODE_Autoneg_BIT, + phydev->advertising, autoneg == AUTONEG_ENABLE); + + phydev->master_slave_set = cmd->base.master_slave_cfg; + phydev->mdix_ctrl = cmd->base.eth_tp_mdix_ctrl; + + /* Restart the PHY */ + _phy_start_aneg(phydev); + + mutex_unlock(&phydev->lock); + return 0; +} +EXPORT_SYMBOL(phy_ethtool_ksettings_set); + /** * phy_speed_down - set speed to lowest speed supported by both link partners * @phydev: the phy_device struct diff --git a/drivers/net/usb/Kconfig b/drivers/net/usb/Kconfig index f87f17503373..b554054a7560 100644 --- a/drivers/net/usb/Kconfig +++ b/drivers/net/usb/Kconfig @@ -117,6 +117,7 @@ config USB_LAN78XX select PHYLIB select MICROCHIP_PHY select FIXED_PHY + select CRC32 help This option adds support for Microchip LAN78XX based USB 2 & USB 3 10/100/1000 Ethernet adapters. diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c index 793f8fbe0069..63cd72c5f580 100644 --- a/drivers/net/usb/lan78xx.c +++ b/drivers/net/usb/lan78xx.c @@ -4122,6 +4122,12 @@ static int lan78xx_probe(struct usb_interface *intf, dev->maxpacket = usb_maxpacket(dev->udev, dev->pipe_out, 1); + /* Reject broken descriptors. */ + if (dev->maxpacket == 0) { + ret = -ENODEV; + goto out4; + } + /* driver requires remote-wakeup capability during autosuspend. */ intf->needs_remote_wakeup = 1; diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c index 840c1c2ab16a..a33d7fb82a00 100644 --- a/drivers/net/usb/usbnet.c +++ b/drivers/net/usb/usbnet.c @@ -1788,6 +1788,11 @@ usbnet_probe (struct usb_interface *udev, const struct usb_device_id *prod) if (!dev->rx_urb_size) dev->rx_urb_size = dev->hard_mtu; dev->maxpacket = usb_maxpacket (dev->udev, dev->out, 1); + if (dev->maxpacket == 0) { + /* that is a broken device */ + status = -ENODEV; + goto out4; + } /* let userspace know we have a random address */ if (ether_addr_equal(net->dev_addr, node_id)) diff --git a/drivers/net/vmxnet3/vmxnet3_drv.c b/drivers/net/vmxnet3/vmxnet3_drv.c index 142f70670f5c..8799854bacb2 100644 --- a/drivers/net/vmxnet3/vmxnet3_drv.c +++ b/drivers/net/vmxnet3/vmxnet3_drv.c @@ -3833,7 +3833,6 @@ vmxnet3_suspend(struct device *device) vmxnet3_free_intr_resources(adapter); netif_device_detach(netdev); - netif_tx_stop_all_queues(netdev); /* Create wake-up filters. */ pmConf = adapter->pm_conf; diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c index bf2fac913942..662e26117353 100644 --- a/drivers/net/vrf.c +++ b/drivers/net/vrf.c @@ -1360,8 +1360,6 @@ static struct sk_buff *vrf_ip6_rcv(struct net_device *vrf_dev, bool need_strict = rt6_need_strict(&ipv6_hdr(skb)->daddr); bool is_ndisc = ipv6_ndisc_frame(skb); - nf_reset_ct(skb); - /* loopback, multicast & non-ND link-local traffic; do not push through * packet taps again. Reset pkt_type for upper layers to process skb. * For strict packets with a source LLA, determine the dst using the @@ -1424,8 +1422,6 @@ static struct sk_buff *vrf_ip_rcv(struct net_device *vrf_dev, skb->skb_iif = vrf_dev->ifindex; IPCB(skb)->flags |= IPSKB_L3SLAVE; - nf_reset_ct(skb); - if (ipv4_is_multicast(ip_hdr(skb)->daddr)) goto out; diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c index e31b98403f31..fc41ba95f81d 100644 --- a/drivers/net/xen-netfront.c +++ b/drivers/net/xen-netfront.c @@ -1730,6 +1730,10 @@ static int netfront_resume(struct xenbus_device *dev) dev_dbg(&dev->dev, "%s\n", dev->nodename); + netif_tx_lock_bh(info->netdev); + netif_device_detach(info->netdev); + netif_tx_unlock_bh(info->netdev); + xennet_disconnect_backend(info); return 0; } @@ -2349,6 +2353,10 @@ static int xennet_connect(struct net_device *dev) * domain a kick because we've probably just requeued some * packets. */ + netif_tx_lock_bh(np->netdev); + netif_device_attach(np->netdev); + netif_tx_unlock_bh(np->netdev); + netif_carrier_on(np->netdev); for (j = 0; j < num_queues; ++j) { queue = &np->queues[j]; diff --git a/drivers/nfc/port100.c b/drivers/nfc/port100.c index 517376c43b86..16ceb763594f 100644 --- a/drivers/nfc/port100.c +++ b/drivers/nfc/port100.c @@ -1006,11 +1006,11 @@ static u64 port100_get_command_type_mask(struct port100 *dev) skb = port100_alloc_skb(dev, 0); if (!skb) - return -ENOMEM; + return 0; resp = port100_send_cmd_sync(dev, PORT100_CMD_GET_COMMAND_TYPE, skb); if (IS_ERR(resp)) - return PTR_ERR(resp); + return 0; if (resp->len < 8) mask = 0; diff --git a/drivers/nfc/st95hf/core.c b/drivers/nfc/st95hf/core.c index d16cf3ff644e..b23f47936473 100644 --- a/drivers/nfc/st95hf/core.c +++ b/drivers/nfc/st95hf/core.c @@ -1226,11 +1226,9 @@ static int st95hf_remove(struct spi_device *nfc_spi_dev) &reset_cmd, ST95HF_RESET_CMD_LEN, ASYNC); - if (result) { + if (result) dev_err(&spictx->spidev->dev, "ST95HF reset failed in remove() err = %d\n", result); - return result; - } /* wait for 3 ms to complete the controller reset process */ usleep_range(3000, 4000); @@ -1239,7 +1237,7 @@ static int st95hf_remove(struct spi_device *nfc_spi_dev) if (stcontext->st95hf_supply) regulator_disable(stcontext->st95hf_supply); - return result; + return 0; } /* Register as SPI protocol driver */ diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index a67a3ad1d413..c74d7bceb222 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -332,26 +332,6 @@ static const struct attribute_group *pmem_attribute_groups[] = { NULL, }; -static void pmem_pagemap_cleanup(struct dev_pagemap *pgmap) -{ - struct pmem_device *pmem = pgmap->owner; - - blk_cleanup_disk(pmem->disk); -} - -static void pmem_release_queue(void *pgmap) -{ - pmem_pagemap_cleanup(pgmap); -} - -static void pmem_pagemap_kill(struct dev_pagemap *pgmap) -{ - struct request_queue *q = - container_of(pgmap->ref, struct request_queue, q_usage_counter); - - blk_freeze_queue_start(q); -} - static void pmem_release_disk(void *__pmem) { struct pmem_device *pmem = __pmem; @@ -359,12 +339,9 @@ static void pmem_release_disk(void *__pmem) kill_dax(pmem->dax_dev); put_dax(pmem->dax_dev); del_gendisk(pmem->disk); -} -static const struct dev_pagemap_ops fsdax_pagemap_ops = { - .kill = pmem_pagemap_kill, - .cleanup = pmem_pagemap_cleanup, -}; + blk_cleanup_disk(pmem->disk); +} static int pmem_attach_disk(struct device *dev, struct nd_namespace_common *ndns) @@ -426,10 +403,8 @@ static int pmem_attach_disk(struct device *dev, pmem->disk = disk; pmem->pgmap.owner = pmem; pmem->pfn_flags = PFN_DEV; - pmem->pgmap.ref = &q->q_usage_counter; if (is_nd_pfn(dev)) { pmem->pgmap.type = MEMORY_DEVICE_FS_DAX; - pmem->pgmap.ops = &fsdax_pagemap_ops; addr = devm_memremap_pages(dev, &pmem->pgmap); pfn_sb = nd_pfn->pfn_sb; pmem->data_offset = le64_to_cpu(pfn_sb->dataoff); @@ -443,16 +418,12 @@ static int pmem_attach_disk(struct device *dev, pmem->pgmap.range.end = res->end; pmem->pgmap.nr_range = 1; pmem->pgmap.type = MEMORY_DEVICE_FS_DAX; - pmem->pgmap.ops = &fsdax_pagemap_ops; addr = devm_memremap_pages(dev, &pmem->pgmap); pmem->pfn_flags |= PFN_MAP; bb_range = pmem->pgmap.range; } else { addr = devm_memremap(dev, pmem->phys_addr, pmem->size, ARCH_MEMREMAP_PMEM); - if (devm_add_action_or_reset(dev, pmem_release_queue, - &pmem->pgmap)) - return -ENOMEM; bb_range.start = res->start; bb_range.end = res->end; } diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 3109bdf137e4..838b5e2058be 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -119,25 +119,6 @@ static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl, static void nvme_update_keep_alive(struct nvme_ctrl *ctrl, struct nvme_command *cmd); -/* - * Prepare a queue for teardown. - * - * This must forcibly unquiesce queues to avoid blocking dispatch, and only set - * the capacity to 0 after that to avoid blocking dispatchers that may be - * holding bd_butex. This will end buffered writers dirtying pages that can't - * be synced. - */ -static void nvme_set_queue_dying(struct nvme_ns *ns) -{ - if (test_and_set_bit(NVME_NS_DEAD, &ns->flags)) - return; - - blk_set_queue_dying(ns->queue); - blk_mq_unquiesce_queue(ns->queue); - - set_capacity_and_notify(ns->disk, 0); -} - void nvme_queue_scan(struct nvme_ctrl *ctrl) { /* @@ -222,7 +203,7 @@ int nvme_reset_ctrl_sync(struct nvme_ctrl *ctrl) static void nvme_do_delete_ctrl(struct nvme_ctrl *ctrl) { dev_info(ctrl->device, - "Removing ctrl: NQN \"%s\"\n", ctrl->opts->subsysnqn); + "Removing ctrl: NQN \"%s\"\n", nvmf_ctrl_subsysnqn(ctrl)); flush_work(&ctrl->reset_work); nvme_stop_ctrl(ctrl); @@ -834,6 +815,7 @@ static void nvme_assign_write_stream(struct nvme_ctrl *ctrl, static inline void nvme_setup_flush(struct nvme_ns *ns, struct nvme_command *cmnd) { + memset(cmnd, 0, sizeof(*cmnd)); cmnd->common.opcode = nvme_cmd_flush; cmnd->common.nsid = cpu_to_le32(ns->head->ns_id); } @@ -885,6 +867,7 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req, return BLK_STS_IOERR; } + memset(cmnd, 0, sizeof(*cmnd)); cmnd->dsm.opcode = nvme_cmd_dsm; cmnd->dsm.nsid = cpu_to_le32(ns->head->ns_id); cmnd->dsm.nr = cpu_to_le32(segments - 1); @@ -901,6 +884,8 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req, static inline blk_status_t nvme_setup_write_zeroes(struct nvme_ns *ns, struct request *req, struct nvme_command *cmnd) { + memset(cmnd, 0, sizeof(*cmnd)); + if (ns->ctrl->quirks & NVME_QUIRK_DEALLOCATE_ZEROES) return nvme_setup_discard(ns, req, cmnd); @@ -934,9 +919,15 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns, dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH; cmnd->rw.opcode = op; + cmnd->rw.flags = 0; cmnd->rw.nsid = cpu_to_le32(ns->head->ns_id); + cmnd->rw.rsvd2 = 0; + cmnd->rw.metadata = 0; cmnd->rw.slba = cpu_to_le64(nvme_sect_to_lba(ns, blk_rq_pos(req))); cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1); + cmnd->rw.reftag = 0; + cmnd->rw.apptag = 0; + cmnd->rw.appmask = 0; if (req_op(req) == REQ_OP_WRITE && ctrl->nr_streams) nvme_assign_write_stream(ctrl, req, &control, &dsmgmt); @@ -993,10 +984,8 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req) struct nvme_ctrl *ctrl = nvme_req(req)->ctrl; blk_status_t ret = BLK_STS_OK; - if (!(req->rq_flags & RQF_DONTPREP)) { + if (!(req->rq_flags & RQF_DONTPREP)) nvme_clear_nvme_request(req); - memset(cmd, 0, sizeof(*cmd)); - } switch (req_op(req)) { case REQ_OP_DRV_IN: @@ -2612,6 +2601,24 @@ static ssize_t nvme_subsys_show_nqn(struct device *dev, } static SUBSYS_ATTR_RO(subsysnqn, S_IRUGO, nvme_subsys_show_nqn); +static ssize_t nvme_subsys_show_type(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct nvme_subsystem *subsys = + container_of(dev, struct nvme_subsystem, dev); + + switch (subsys->subtype) { + case NVME_NQN_DISC: + return sysfs_emit(buf, "discovery\n"); + case NVME_NQN_NVME: + return sysfs_emit(buf, "nvm\n"); + default: + return sysfs_emit(buf, "reserved\n"); + } +} +static SUBSYS_ATTR_RO(subsystype, S_IRUGO, nvme_subsys_show_type); + #define nvme_subsys_show_str_function(field) \ static ssize_t subsys_##field##_show(struct device *dev, \ struct device_attribute *attr, char *buf) \ @@ -2632,6 +2639,7 @@ static struct attribute *nvme_subsys_attrs[] = { &subsys_attr_serial.attr, &subsys_attr_firmware_rev.attr, &subsys_attr_subsysnqn.attr, + &subsys_attr_subsystype.attr, #ifdef CONFIG_NVME_MULTIPATH &subsys_attr_iopolicy.attr, #endif @@ -2702,6 +2710,21 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id) memcpy(subsys->firmware_rev, id->fr, sizeof(subsys->firmware_rev)); subsys->vendor_id = le16_to_cpu(id->vid); subsys->cmic = id->cmic; + + /* Versions prior to 1.4 don't necessarily report a valid type */ + if (id->cntrltype == NVME_CTRL_DISC || + !strcmp(subsys->subnqn, NVME_DISC_SUBSYS_NAME)) + subsys->subtype = NVME_NQN_DISC; + else + subsys->subtype = NVME_NQN_NVME; + + if (nvme_discovery_ctrl(ctrl) && subsys->subtype != NVME_NQN_DISC) { + dev_err(ctrl->device, + "Subsystem %s is not a discovery controller", + subsys->subnqn); + kfree(subsys); + return -EINVAL; + } subsys->awupf = le16_to_cpu(id->awupf); #ifdef CONFIG_NVME_MULTIPATH subsys->iopolicy = NVME_IOPOLICY_NUMA; @@ -4485,6 +4508,37 @@ out: } EXPORT_SYMBOL_GPL(nvme_init_ctrl); +static void nvme_start_ns_queue(struct nvme_ns *ns) +{ + if (test_and_clear_bit(NVME_NS_STOPPED, &ns->flags)) + blk_mq_unquiesce_queue(ns->queue); +} + +static void nvme_stop_ns_queue(struct nvme_ns *ns) +{ + if (!test_and_set_bit(NVME_NS_STOPPED, &ns->flags)) + blk_mq_quiesce_queue(ns->queue); +} + +/* + * Prepare a queue for teardown. + * + * This must forcibly unquiesce queues to avoid blocking dispatch, and only set + * the capacity to 0 after that to avoid blocking dispatchers that may be + * holding bd_butex. This will end buffered writers dirtying pages that can't + * be synced. + */ +static void nvme_set_queue_dying(struct nvme_ns *ns) +{ + if (test_and_set_bit(NVME_NS_DEAD, &ns->flags)) + return; + + blk_set_queue_dying(ns->queue); + nvme_start_ns_queue(ns); + + set_capacity_and_notify(ns->disk, 0); +} + /** * nvme_kill_queues(): Ends all namespace queues * @ctrl: the dead controller that needs to end @@ -4500,7 +4554,7 @@ void nvme_kill_queues(struct nvme_ctrl *ctrl) /* Forcibly unquiesce queues to avoid blocking dispatch */ if (ctrl->admin_q && !blk_queue_dying(ctrl->admin_q)) - blk_mq_unquiesce_queue(ctrl->admin_q); + nvme_start_admin_queue(ctrl); list_for_each_entry(ns, &ctrl->namespaces, list) nvme_set_queue_dying(ns); @@ -4563,7 +4617,7 @@ void nvme_stop_queues(struct nvme_ctrl *ctrl) down_read(&ctrl->namespaces_rwsem); list_for_each_entry(ns, &ctrl->namespaces, list) - blk_mq_quiesce_queue(ns->queue); + nvme_stop_ns_queue(ns); up_read(&ctrl->namespaces_rwsem); } EXPORT_SYMBOL_GPL(nvme_stop_queues); @@ -4574,11 +4628,25 @@ void nvme_start_queues(struct nvme_ctrl *ctrl) down_read(&ctrl->namespaces_rwsem); list_for_each_entry(ns, &ctrl->namespaces, list) - blk_mq_unquiesce_queue(ns->queue); + nvme_start_ns_queue(ns); up_read(&ctrl->namespaces_rwsem); } EXPORT_SYMBOL_GPL(nvme_start_queues); +void nvme_stop_admin_queue(struct nvme_ctrl *ctrl) +{ + if (!test_and_set_bit(NVME_CTRL_ADMIN_Q_STOPPED, &ctrl->flags)) + blk_mq_quiesce_queue(ctrl->admin_q); +} +EXPORT_SYMBOL_GPL(nvme_stop_admin_queue); + +void nvme_start_admin_queue(struct nvme_ctrl *ctrl) +{ + if (test_and_clear_bit(NVME_CTRL_ADMIN_Q_STOPPED, &ctrl->flags)) + blk_mq_unquiesce_queue(ctrl->admin_q); +} +EXPORT_SYMBOL_GPL(nvme_start_admin_queue); + void nvme_sync_io_queues(struct nvme_ctrl *ctrl) { struct nvme_ns *ns; diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c index 668c6bb7a567..c5a2b71c5268 100644 --- a/drivers/nvme/host/fabrics.c +++ b/drivers/nvme/host/fabrics.c @@ -548,6 +548,7 @@ static const match_table_t opt_tokens = { { NVMF_OPT_NR_POLL_QUEUES, "nr_poll_queues=%d" }, { NVMF_OPT_TOS, "tos=%d" }, { NVMF_OPT_FAIL_FAST_TMO, "fast_io_fail_tmo=%d" }, + { NVMF_OPT_DISCOVERY, "discovery" }, { NVMF_OPT_ERR, NULL } }; @@ -823,6 +824,9 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts, } opts->tos = token; break; + case NVMF_OPT_DISCOVERY: + opts->discovery_nqn = true; + break; default: pr_warn("unknown parameter or missing value '%s' in ctrl creation request\n", p); @@ -949,7 +953,7 @@ EXPORT_SYMBOL_GPL(nvmf_free_options); #define NVMF_ALLOWED_OPTS (NVMF_OPT_QUEUE_SIZE | NVMF_OPT_NR_IO_QUEUES | \ NVMF_OPT_KATO | NVMF_OPT_HOSTNQN | \ NVMF_OPT_HOST_ID | NVMF_OPT_DUP_CONNECT |\ - NVMF_OPT_DISABLE_SQFLOW |\ + NVMF_OPT_DISABLE_SQFLOW | NVMF_OPT_DISCOVERY |\ NVMF_OPT_FAIL_FAST_TMO) static struct nvme_ctrl * diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h index a146cb903869..c3203ff1c654 100644 --- a/drivers/nvme/host/fabrics.h +++ b/drivers/nvme/host/fabrics.h @@ -67,6 +67,7 @@ enum { NVMF_OPT_TOS = 1 << 19, NVMF_OPT_FAIL_FAST_TMO = 1 << 20, NVMF_OPT_HOST_IFACE = 1 << 21, + NVMF_OPT_DISCOVERY = 1 << 22, }; /** @@ -178,6 +179,13 @@ nvmf_ctlr_matches_baseopts(struct nvme_ctrl *ctrl, return true; } +static inline char *nvmf_ctrl_subsysnqn(struct nvme_ctrl *ctrl) +{ + if (!ctrl->subsys) + return ctrl->opts->subsysnqn; + return ctrl->subsys->subnqn; +} + int nvmf_reg_read32(struct nvme_ctrl *ctrl, u32 off, u32 *val); int nvmf_reg_read64(struct nvme_ctrl *ctrl, u32 off, u64 *val); int nvmf_reg_write32(struct nvme_ctrl *ctrl, u32 off, u32 val); diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c index aa14ad963d91..71b3108c22f0 100644 --- a/drivers/nvme/host/fc.c +++ b/drivers/nvme/host/fc.c @@ -16,6 +16,7 @@ #include <linux/nvme-fc.h> #include "fc.h" #include <scsi/scsi_transport_fc.h> +#include <linux/blk-mq-pci.h> /* *************************** Data Structures/Defines ****************** */ @@ -2382,7 +2383,7 @@ nvme_fc_ctrl_free(struct kref *ref) list_del(&ctrl->ctrl_list); spin_unlock_irqrestore(&ctrl->rport->lock, flags); - blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); + nvme_start_admin_queue(&ctrl->ctrl); blk_cleanup_queue(ctrl->ctrl.admin_q); blk_cleanup_queue(ctrl->ctrl.fabrics_q); blk_mq_free_tag_set(&ctrl->admin_tag_set); @@ -2510,7 +2511,7 @@ __nvme_fc_abort_outstanding_ios(struct nvme_fc_ctrl *ctrl, bool start_queues) /* * clean up the admin queue. Same thing as above. */ - blk_mq_quiesce_queue(ctrl->ctrl.admin_q); + nvme_stop_admin_queue(&ctrl->ctrl); blk_sync_queue(ctrl->ctrl.admin_q); blk_mq_tagset_busy_iter(&ctrl->admin_tag_set, nvme_fc_terminate_exchange, &ctrl->ctrl); @@ -2841,6 +2842,28 @@ nvme_fc_complete_rq(struct request *rq) nvme_fc_ctrl_put(ctrl); } +static int nvme_fc_map_queues(struct blk_mq_tag_set *set) +{ + struct nvme_fc_ctrl *ctrl = set->driver_data; + int i; + + for (i = 0; i < set->nr_maps; i++) { + struct blk_mq_queue_map *map = &set->map[i]; + + if (!map->nr_queues) { + WARN_ON(i == HCTX_TYPE_DEFAULT); + continue; + } + + /* Call LLDD map queue functionality if defined */ + if (ctrl->lport->ops->map_queues) + ctrl->lport->ops->map_queues(&ctrl->lport->localport, + map); + else + blk_mq_map_queues(map); + } + return 0; +} static const struct blk_mq_ops nvme_fc_mq_ops = { .queue_rq = nvme_fc_queue_rq, @@ -2849,6 +2872,7 @@ static const struct blk_mq_ops nvme_fc_mq_ops = { .exit_request = nvme_fc_exit_request, .init_hctx = nvme_fc_init_hctx, .timeout = nvme_fc_timeout, + .map_queues = nvme_fc_map_queues, }; static int @@ -3095,7 +3119,7 @@ nvme_fc_create_association(struct nvme_fc_ctrl *ctrl) ctrl->ctrl.max_hw_sectors = ctrl->ctrl.max_segments << (ilog2(SZ_4K) - 9); - blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); + nvme_start_admin_queue(&ctrl->ctrl); ret = nvme_init_ctrl_finish(&ctrl->ctrl); if (ret || test_bit(ASSOC_FAILED, &ctrl->flags)) @@ -3249,7 +3273,7 @@ nvme_fc_delete_association(struct nvme_fc_ctrl *ctrl) nvme_fc_free_queue(&ctrl->queues[0]); /* re-enable the admin_q so anything new can fast fail */ - blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); + nvme_start_admin_queue(&ctrl->ctrl); /* resume the io queues so that things will fast fail */ nvme_start_queues(&ctrl->ctrl); @@ -3572,7 +3596,7 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts, dev_info(ctrl->ctrl.device, "NVME-FC{%d}: new ctrl: NQN \"%s\"\n", - ctrl->cnum, ctrl->ctrl.opts->subsysnqn); + ctrl->cnum, nvmf_ctrl_subsysnqn(&ctrl->ctrl)); return &ctrl->ctrl; diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 11440c86881e..7f2071f2460c 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -105,8 +105,11 @@ void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl) down_read(&ctrl->namespaces_rwsem); list_for_each_entry(ns, &ctrl->namespaces, list) { - if (ns->head->disk) - kblockd_schedule_work(&ns->head->requeue_work); + if (!ns->head->disk) + continue; + kblockd_schedule_work(&ns->head->requeue_work); + if (ctrl->state == NVME_CTRL_LIVE) + disk_uevent(ns->head->disk, KOBJ_CHANGE); } up_read(&ctrl->namespaces_rwsem); } @@ -143,13 +146,12 @@ void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl) { struct nvme_ns *ns; - mutex_lock(&ctrl->scan_lock); down_read(&ctrl->namespaces_rwsem); - list_for_each_entry(ns, &ctrl->namespaces, list) - if (nvme_mpath_clear_current_path(ns)) - kblockd_schedule_work(&ns->head->requeue_work); + list_for_each_entry(ns, &ctrl->namespaces, list) { + nvme_mpath_clear_current_path(ns); + kblockd_schedule_work(&ns->head->requeue_work); + } up_read(&ctrl->namespaces_rwsem); - mutex_unlock(&ctrl->scan_lock); } void nvme_mpath_revalidate_paths(struct nvme_ns *ns) @@ -506,13 +508,23 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head) static void nvme_mpath_set_live(struct nvme_ns *ns) { struct nvme_ns_head *head = ns->head; + int rc; if (!head->disk) return; + /* + * test_and_set_bit() is used because it is protecting against two nvme + * paths simultaneously calling device_add_disk() on the same namespace + * head. + */ if (!test_and_set_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) { - device_add_disk(&head->subsys->dev, head->disk, - nvme_ns_id_attr_groups); + rc = device_add_disk(&head->subsys->dev, head->disk, + nvme_ns_id_attr_groups); + if (rc) { + clear_bit(NVME_NSHEAD_DISK_LIVE, &ns->flags); + return; + } nvme_add_ns_head_cdev(head); } @@ -550,7 +562,7 @@ static int nvme_parse_ana_log(struct nvme_ctrl *ctrl, void *data, return -EINVAL; nr_nsids = le32_to_cpu(desc->nnsids); - nsid_buf_size = nr_nsids * sizeof(__le32); + nsid_buf_size = flex_array_size(desc, nsids, nr_nsids); if (WARN_ON_ONCE(desc->grpid == 0)) return -EINVAL; diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index ef2467b93adb..b334af8aa264 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -342,6 +342,7 @@ struct nvme_ctrl { int nr_reconnects; unsigned long flags; #define NVME_CTRL_FAILFAST_EXPIRED 0 +#define NVME_CTRL_ADMIN_Q_STOPPED 1 struct nvmf_ctrl_options *opts; struct page *discard_page; @@ -372,6 +373,7 @@ struct nvme_subsystem { char model[40]; char firmware_rev[8]; u8 cmic; + enum nvme_subsys_type subtype; u16 vendor_id; u16 awupf; /* 0's based awupf value. */ struct ida ns_ida; @@ -463,6 +465,7 @@ struct nvme_ns { #define NVME_NS_ANA_PENDING 2 #define NVME_NS_FORCE_RO 3 #define NVME_NS_READY 4 +#define NVME_NS_STOPPED 5 struct cdev cdev; struct device cdev_device; @@ -679,6 +682,8 @@ void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status, void nvme_stop_queues(struct nvme_ctrl *ctrl); void nvme_start_queues(struct nvme_ctrl *ctrl); +void nvme_stop_admin_queue(struct nvme_ctrl *ctrl); +void nvme_start_admin_queue(struct nvme_ctrl *ctrl); void nvme_kill_queues(struct nvme_ctrl *ctrl); void nvme_sync_queues(struct nvme_ctrl *ctrl); void nvme_sync_io_queues(struct nvme_ctrl *ctrl); diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index ed684874842f..ca2ee806d74b 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -245,8 +245,15 @@ static int nvme_dbbuf_dma_alloc(struct nvme_dev *dev) { unsigned int mem_size = nvme_dbbuf_size(dev); - if (dev->dbbuf_dbs) + if (dev->dbbuf_dbs) { + /* + * Clear the dbbuf memory so the driver doesn't observe stale + * values from the previous instantiation. + */ + memset(dev->dbbuf_dbs, 0, mem_size); + memset(dev->dbbuf_eis, 0, mem_size); return 0; + } dev->dbbuf_dbs = dma_alloc_coherent(dev->dev, mem_size, &dev->dbbuf_dbs_dma_addr, @@ -1414,7 +1421,7 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq) nvmeq->dev->online_queues--; if (!nvmeq->qid && nvmeq->dev->ctrl.admin_q) - blk_mq_quiesce_queue(nvmeq->dev->ctrl.admin_q); + nvme_stop_admin_queue(&nvmeq->dev->ctrl); if (!test_and_clear_bit(NVMEQ_POLLED, &nvmeq->flags)) pci_free_irq(to_pci_dev(nvmeq->dev->dev), nvmeq->cq_vector, nvmeq); return 0; @@ -1673,7 +1680,7 @@ static void nvme_dev_remove_admin(struct nvme_dev *dev) * user requests may be waiting on a stopped queue. Start the * queue to flush these to completion. */ - blk_mq_unquiesce_queue(dev->ctrl.admin_q); + nvme_start_admin_queue(&dev->ctrl); blk_cleanup_queue(dev->ctrl.admin_q); blk_mq_free_tag_set(&dev->admin_tagset); } @@ -1707,7 +1714,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev) return -ENODEV; } } else - blk_mq_unquiesce_queue(dev->ctrl.admin_q); + nvme_start_admin_queue(&dev->ctrl); return 0; } @@ -2642,7 +2649,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown) if (shutdown) { nvme_start_queues(&dev->ctrl); if (dev->ctrl.admin_q && !blk_queue_dying(dev->ctrl.admin_q)) - blk_mq_unquiesce_queue(dev->ctrl.admin_q); + nvme_start_admin_queue(&dev->ctrl); } mutex_unlock(&dev->shutdown_lock); } diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index 1624da3702d4..850f84d204d0 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -919,7 +919,7 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl, else ctrl->ctrl.max_integrity_segments = 0; - blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); + nvme_start_admin_queue(&ctrl->ctrl); error = nvme_init_ctrl_finish(&ctrl->ctrl); if (error) @@ -928,7 +928,7 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl, return 0; out_quiesce_queue: - blk_mq_quiesce_queue(ctrl->ctrl.admin_q); + nvme_stop_admin_queue(&ctrl->ctrl); blk_sync_queue(ctrl->ctrl.admin_q); out_stop_queue: nvme_rdma_stop_queue(&ctrl->queues[0]); @@ -1026,12 +1026,12 @@ out_free_io_queues: static void nvme_rdma_teardown_admin_queue(struct nvme_rdma_ctrl *ctrl, bool remove) { - blk_mq_quiesce_queue(ctrl->ctrl.admin_q); + nvme_stop_admin_queue(&ctrl->ctrl); blk_sync_queue(ctrl->ctrl.admin_q); nvme_rdma_stop_queue(&ctrl->queues[0]); nvme_cancel_admin_tagset(&ctrl->ctrl); if (remove) - blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); + nvme_start_admin_queue(&ctrl->ctrl); nvme_rdma_destroy_admin_queue(ctrl, remove); } @@ -1096,11 +1096,13 @@ static int nvme_rdma_setup_ctrl(struct nvme_rdma_ctrl *ctrl, bool new) return ret; if (ctrl->ctrl.icdoff) { + ret = -EOPNOTSUPP; dev_err(ctrl->ctrl.device, "icdoff is not supported!\n"); goto destroy_admin; } if (!(ctrl->ctrl.sgls & (1 << 2))) { + ret = -EOPNOTSUPP; dev_err(ctrl->ctrl.device, "Mandatory keyed sgls are not supported!\n"); goto destroy_admin; @@ -1112,6 +1114,13 @@ static int nvme_rdma_setup_ctrl(struct nvme_rdma_ctrl *ctrl, bool new) ctrl->ctrl.opts->queue_size, ctrl->ctrl.sqsize + 1); } + if (ctrl->ctrl.sqsize + 1 > NVME_RDMA_MAX_QUEUE_SIZE) { + dev_warn(ctrl->ctrl.device, + "ctrl sqsize %u > max queue size %u, clamping down\n", + ctrl->ctrl.sqsize + 1, NVME_RDMA_MAX_QUEUE_SIZE); + ctrl->ctrl.sqsize = NVME_RDMA_MAX_QUEUE_SIZE - 1; + } + if (ctrl->ctrl.sqsize + 1 > ctrl->ctrl.maxcmd) { dev_warn(ctrl->ctrl.device, "sqsize %u > ctrl maxcmd %u, clamping down\n", @@ -1154,7 +1163,7 @@ destroy_io: nvme_rdma_destroy_io_queues(ctrl, new); } destroy_admin: - blk_mq_quiesce_queue(ctrl->ctrl.admin_q); + nvme_stop_admin_queue(&ctrl->ctrl); blk_sync_queue(ctrl->ctrl.admin_q); nvme_rdma_stop_queue(&ctrl->queues[0]); nvme_cancel_admin_tagset(&ctrl->ctrl); @@ -1194,7 +1203,7 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work) nvme_rdma_teardown_io_queues(ctrl, false); nvme_start_queues(&ctrl->ctrl); nvme_rdma_teardown_admin_queue(ctrl, false); - blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); + nvme_start_admin_queue(&ctrl->ctrl); if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) { /* state change failure is ok if we started ctrl delete */ @@ -2232,7 +2241,7 @@ static void nvme_rdma_shutdown_ctrl(struct nvme_rdma_ctrl *ctrl, bool shutdown) cancel_delayed_work_sync(&ctrl->reconnect_work); nvme_rdma_teardown_io_queues(ctrl, shutdown); - blk_mq_quiesce_queue(ctrl->ctrl.admin_q); + nvme_stop_admin_queue(&ctrl->ctrl); if (shutdown) nvme_shutdown_ctrl(&ctrl->ctrl); else @@ -2386,7 +2395,7 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev, goto out_uninit_ctrl; dev_info(ctrl->ctrl.device, "new ctrl: NQN \"%s\", addr %pISpcs\n", - ctrl->ctrl.opts->subsysnqn, &ctrl->addr); + nvmf_ctrl_subsysnqn(&ctrl->ctrl), &ctrl->addr); mutex_lock(&nvme_rdma_ctrl_mutex); list_add_tail(&ctrl->list, &nvme_rdma_ctrl_list); diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 9ce3458ee1dd..33bc83d8d992 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -926,12 +926,14 @@ static void nvme_tcp_fail_request(struct nvme_tcp_request *req) static int nvme_tcp_try_send_data(struct nvme_tcp_request *req) { struct nvme_tcp_queue *queue = req->queue; + int req_data_len = req->data_len; while (true) { struct page *page = nvme_tcp_req_cur_page(req); size_t offset = nvme_tcp_req_cur_offset(req); size_t len = nvme_tcp_req_cur_length(req); bool last = nvme_tcp_pdu_last_send(req, len); + int req_data_sent = req->data_sent; int ret, flags = MSG_DONTWAIT; if (last && !queue->data_digest && !nvme_tcp_queue_more(queue)) @@ -958,7 +960,7 @@ static int nvme_tcp_try_send_data(struct nvme_tcp_request *req) * in the request where we don't want to modify it as we may * compete with the RX path completing the request. */ - if (req->data_sent + ret < req->data_len) + if (req_data_sent + ret < req_data_len) nvme_tcp_advance_req(req, ret); /* fully successful last send in current PDU */ @@ -1048,10 +1050,11 @@ static int nvme_tcp_try_send_data_pdu(struct nvme_tcp_request *req) static int nvme_tcp_try_send_ddgst(struct nvme_tcp_request *req) { struct nvme_tcp_queue *queue = req->queue; + size_t offset = req->offset; int ret; struct msghdr msg = { .msg_flags = MSG_DONTWAIT }; struct kvec iov = { - .iov_base = &req->ddgst + req->offset, + .iov_base = (u8 *)&req->ddgst + req->offset, .iov_len = NVME_TCP_DIGEST_LENGTH - req->offset }; @@ -1064,7 +1067,7 @@ static int nvme_tcp_try_send_ddgst(struct nvme_tcp_request *req) if (unlikely(ret <= 0)) return ret; - if (req->offset + ret == NVME_TCP_DIGEST_LENGTH) { + if (offset + ret == NVME_TCP_DIGEST_LENGTH) { nvme_tcp_done_send_req(queue); return 1; } @@ -1915,7 +1918,7 @@ static int nvme_tcp_configure_admin_queue(struct nvme_ctrl *ctrl, bool new) if (error) goto out_stop_queue; - blk_mq_unquiesce_queue(ctrl->admin_q); + nvme_start_admin_queue(ctrl); error = nvme_init_ctrl_finish(ctrl); if (error) @@ -1924,7 +1927,7 @@ static int nvme_tcp_configure_admin_queue(struct nvme_ctrl *ctrl, bool new) return 0; out_quiesce_queue: - blk_mq_quiesce_queue(ctrl->admin_q); + nvme_stop_admin_queue(ctrl); blk_sync_queue(ctrl->admin_q); out_stop_queue: nvme_tcp_stop_queue(ctrl, 0); @@ -1946,12 +1949,12 @@ out_free_queue: static void nvme_tcp_teardown_admin_queue(struct nvme_ctrl *ctrl, bool remove) { - blk_mq_quiesce_queue(ctrl->admin_q); + nvme_stop_admin_queue(ctrl); blk_sync_queue(ctrl->admin_q); nvme_tcp_stop_queue(ctrl, 0); nvme_cancel_admin_tagset(ctrl); if (remove) - blk_mq_unquiesce_queue(ctrl->admin_q); + nvme_start_admin_queue(ctrl); nvme_tcp_destroy_admin_queue(ctrl, remove); } @@ -1960,7 +1963,7 @@ static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl, { if (ctrl->queue_count <= 1) return; - blk_mq_quiesce_queue(ctrl->admin_q); + nvme_stop_admin_queue(ctrl); nvme_start_freeze(ctrl); nvme_stop_queues(ctrl); nvme_sync_io_queues(ctrl); @@ -2055,7 +2058,7 @@ destroy_io: nvme_tcp_destroy_io_queues(ctrl, new); } destroy_admin: - blk_mq_quiesce_queue(ctrl->admin_q); + nvme_stop_admin_queue(ctrl); blk_sync_queue(ctrl->admin_q); nvme_tcp_stop_queue(ctrl, 0); nvme_cancel_admin_tagset(ctrl); @@ -2098,7 +2101,7 @@ static void nvme_tcp_error_recovery_work(struct work_struct *work) /* unquiesce to fail fast pending requests */ nvme_start_queues(ctrl); nvme_tcp_teardown_admin_queue(ctrl, false); - blk_mq_unquiesce_queue(ctrl->admin_q); + nvme_start_admin_queue(ctrl); if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_CONNECTING)) { /* state change failure is ok if we started ctrl delete */ @@ -2116,7 +2119,7 @@ static void nvme_tcp_teardown_ctrl(struct nvme_ctrl *ctrl, bool shutdown) cancel_delayed_work_sync(&to_tcp_ctrl(ctrl)->connect_work); nvme_tcp_teardown_io_queues(ctrl, shutdown); - blk_mq_quiesce_queue(ctrl->admin_q); + nvme_stop_admin_queue(ctrl); if (shutdown) nvme_shutdown_ctrl(ctrl); else @@ -2582,7 +2585,7 @@ static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev, goto out_uninit_ctrl; dev_info(ctrl->ctrl.device, "new ctrl: NQN \"%s\", addr %pISp\n", - ctrl->ctrl.opts->subsysnqn, &ctrl->addr); + nvmf_ctrl_subsysnqn(&ctrl->ctrl), &ctrl->addr); mutex_lock(&nvme_tcp_ctrl_mutex); list_add_tail(&ctrl->list, &nvme_tcp_ctrl_list); diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c index d95010481fce..bfc259e0d7b8 100644 --- a/drivers/nvme/host/zns.c +++ b/drivers/nvme/host/zns.c @@ -233,6 +233,8 @@ out_free: blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req, struct nvme_command *c, enum nvme_zone_mgmt_action action) { + memset(c, 0, sizeof(*c)); + c->zms.opcode = nvme_cmd_zone_mgmt_send; c->zms.nsid = cpu_to_le32(ns->head->ns_id); c->zms.slba = cpu_to_le64(nvme_sect_to_lba(ns, blk_rq_pos(req))); diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c index aa6d84d8848e..6fb24746de06 100644 --- a/drivers/nvme/target/admin-cmd.c +++ b/drivers/nvme/target/admin-cmd.c @@ -264,7 +264,7 @@ static u32 nvmet_format_ana_group(struct nvmet_req *req, u32 grpid, desc->chgcnt = cpu_to_le64(nvmet_ana_chgcnt); desc->state = req->port->ana_state[grpid]; memset(desc->rsvd17, 0, sizeof(desc->rsvd17)); - return sizeof(struct nvme_ana_group_desc) + count * sizeof(__le32); + return struct_size(desc, nsids, count); } static void nvmet_execute_get_log_page_ana(struct nvmet_req *req) @@ -278,8 +278,8 @@ static void nvmet_execute_get_log_page_ana(struct nvmet_req *req) u16 status; status = NVME_SC_INTERNAL; - desc = kmalloc(sizeof(struct nvme_ana_group_desc) + - NVMET_MAX_NAMESPACES * sizeof(__le32), GFP_KERNEL); + desc = kmalloc(struct_size(desc, nsids, NVMET_MAX_NAMESPACES), + GFP_KERNEL); if (!desc) goto out; @@ -374,13 +374,19 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req *req) id->rab = 6; + if (nvmet_is_disc_subsys(ctrl->subsys)) + id->cntrltype = NVME_CTRL_DISC; + else + id->cntrltype = NVME_CTRL_IO; + /* * XXX: figure out how we can assign a IEEE OUI, but until then * the safest is to leave it as zeroes. */ /* we support multiple ports, multiples hosts and ANA: */ - id->cmic = (1 << 0) | (1 << 1) | (1 << 3); + id->cmic = NVME_CTRL_CMIC_MULTI_PORT | NVME_CTRL_CMIC_MULTI_CTRL | + NVME_CTRL_CMIC_ANA; /* Limit MDTS according to transport capability */ if (ctrl->ops->get_mdts) @@ -536,7 +542,7 @@ static void nvmet_execute_identify_ns(struct nvmet_req *req) * Our namespace might always be shared. Not just with other * controllers, but also with any other user of the block device. */ - id->nmic = (1 << 0); + id->nmic = NVME_NS_NMIC_SHARED; id->anagrpid = cpu_to_le32(req->ns->anagrpid); memcpy(&id->nguid, &req->ns->nguid, sizeof(id->nguid)); @@ -1008,7 +1014,7 @@ u16 nvmet_parse_admin_cmd(struct nvmet_req *req) if (nvme_is_fabrics(cmd)) return nvmet_parse_fabrics_cmd(req); - if (nvmet_req_subsys(req)->type == NVME_NQN_DISC) + if (nvmet_is_disc_subsys(nvmet_req_subsys(req))) return nvmet_parse_discovery_cmd(req); ret = nvmet_check_ctrl_status(req); diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c index be5d82421e3a..091a0ca16361 100644 --- a/drivers/nvme/target/configfs.c +++ b/drivers/nvme/target/configfs.c @@ -1233,6 +1233,44 @@ static ssize_t nvmet_subsys_attr_model_store(struct config_item *item, } CONFIGFS_ATTR(nvmet_subsys_, attr_model); +static ssize_t nvmet_subsys_attr_discovery_nqn_show(struct config_item *item, + char *page) +{ + return snprintf(page, PAGE_SIZE, "%s\n", + nvmet_disc_subsys->subsysnqn); +} + +static ssize_t nvmet_subsys_attr_discovery_nqn_store(struct config_item *item, + const char *page, size_t count) +{ + struct nvmet_subsys *subsys = to_subsys(item); + char *subsysnqn; + int len; + + len = strcspn(page, "\n"); + if (!len) + return -EINVAL; + + subsysnqn = kmemdup_nul(page, len, GFP_KERNEL); + if (!subsysnqn) + return -ENOMEM; + + /* + * The discovery NQN must be different from subsystem NQN. + */ + if (!strcmp(subsysnqn, subsys->subsysnqn)) { + kfree(subsysnqn); + return -EBUSY; + } + down_write(&nvmet_config_sem); + kfree(nvmet_disc_subsys->subsysnqn); + nvmet_disc_subsys->subsysnqn = subsysnqn; + up_write(&nvmet_config_sem); + + return count; +} +CONFIGFS_ATTR(nvmet_subsys_, attr_discovery_nqn); + #ifdef CONFIG_BLK_DEV_INTEGRITY static ssize_t nvmet_subsys_attr_pi_enable_show(struct config_item *item, char *page) @@ -1262,6 +1300,7 @@ static struct configfs_attribute *nvmet_subsys_attrs[] = { &nvmet_subsys_attr_attr_cntlid_min, &nvmet_subsys_attr_attr_cntlid_max, &nvmet_subsys_attr_attr_model, + &nvmet_subsys_attr_attr_discovery_nqn, #ifdef CONFIG_BLK_DEV_INTEGRITY &nvmet_subsys_attr_attr_pi_enable, #endif @@ -1553,6 +1592,8 @@ static void nvmet_port_release(struct config_item *item) { struct nvmet_port *port = to_nvmet_port(item); + /* Let inflight controllers teardown complete */ + flush_scheduled_work(); list_del(&port->global_entry); kfree(port->ana_state); diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index b8425fa34300..5119c687de68 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -1140,7 +1140,7 @@ static void nvmet_start_ctrl(struct nvmet_ctrl *ctrl) * should verify iosqes,iocqes are zeroed, however that * would break backwards compatibility, so don't enforce it. */ - if (ctrl->subsys->type != NVME_NQN_DISC && + if (!nvmet_is_disc_subsys(ctrl->subsys) && (nvmet_cc_iosqes(ctrl->cc) != NVME_NVM_IOSQES || nvmet_cc_iocqes(ctrl->cc) != NVME_NVM_IOCQES)) { ctrl->csts = NVME_CSTS_CFS; @@ -1205,7 +1205,10 @@ static void nvmet_init_cap(struct nvmet_ctrl *ctrl) /* CC.EN timeout in 500msec units: */ ctrl->cap |= (15ULL << 24); /* maximum queue entries supported: */ - ctrl->cap |= NVMET_QUEUE_SIZE - 1; + if (ctrl->ops->get_max_queue_size) + ctrl->cap |= ctrl->ops->get_max_queue_size(ctrl) - 1; + else + ctrl->cap |= NVMET_QUEUE_SIZE - 1; if (nvmet_is_passthru_subsys(ctrl->subsys)) nvmet_passthrough_override_cap(ctrl); @@ -1278,7 +1281,7 @@ bool nvmet_host_allowed(struct nvmet_subsys *subsys, const char *hostnqn) if (subsys->allow_any_host) return true; - if (subsys->type == NVME_NQN_DISC) /* allow all access to disc subsys */ + if (nvmet_is_disc_subsys(subsys)) /* allow all access to disc subsys */ return true; list_for_each_entry(p, &subsys->hosts, entry) { @@ -1367,6 +1370,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn, mutex_init(&ctrl->lock); ctrl->port = req->port; + ctrl->ops = req->ops; INIT_WORK(&ctrl->async_event_work, nvmet_async_event_work); INIT_LIST_HEAD(&ctrl->async_events); @@ -1405,13 +1409,11 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn, } ctrl->cntlid = ret; - ctrl->ops = req->ops; - /* * Discovery controllers may use some arbitrary high value * in order to cleanup stale discovery sessions */ - if ((ctrl->subsys->type == NVME_NQN_DISC) && !kato) + if (nvmet_is_disc_subsys(ctrl->subsys) && !kato) kato = NVMET_DISC_KATO_MS; /* keep-alive timeout in seconds */ @@ -1491,7 +1493,8 @@ static struct nvmet_subsys *nvmet_find_get_subsys(struct nvmet_port *port, if (!port) return NULL; - if (!strcmp(NVME_DISC_SUBSYS_NAME, subsysnqn)) { + if (!strcmp(NVME_DISC_SUBSYS_NAME, subsysnqn) || + !strcmp(nvmet_disc_subsys->subsysnqn, subsysnqn)) { if (!kref_get_unless_zero(&nvmet_disc_subsys->ref)) return NULL; return nvmet_disc_subsys; @@ -1538,6 +1541,7 @@ struct nvmet_subsys *nvmet_subsys_alloc(const char *subsysnqn, subsys->max_qid = NVMET_NR_QUEUES; break; case NVME_NQN_DISC: + case NVME_NQN_CURR: subsys->max_qid = 0; break; default: diff --git a/drivers/nvme/target/discovery.c b/drivers/nvme/target/discovery.c index 7aa62bc6ae84..c2162eef8ce1 100644 --- a/drivers/nvme/target/discovery.c +++ b/drivers/nvme/target/discovery.c @@ -146,7 +146,7 @@ static size_t discovery_log_entries(struct nvmet_req *req) struct nvmet_ctrl *ctrl = req->sq->ctrl; struct nvmet_subsys_link *p; struct nvmet_port *r; - size_t entries = 0; + size_t entries = 1; list_for_each_entry(p, &req->port->subsystems, entry) { if (!nvmet_host_allowed(p->subsys, ctrl->hostnqn)) @@ -171,6 +171,7 @@ static void nvmet_execute_disc_get_log_page(struct nvmet_req *req) u32 numrec = 0; u16 status = 0; void *buffer; + char traddr[NVMF_TRADDR_SIZE]; if (!nvmet_check_transfer_len(req, data_len)) return; @@ -203,15 +204,19 @@ static void nvmet_execute_disc_get_log_page(struct nvmet_req *req) status = NVME_SC_INTERNAL; goto out; } - hdr = buffer; - list_for_each_entry(p, &req->port->subsystems, entry) { - char traddr[NVMF_TRADDR_SIZE]; + nvmet_set_disc_traddr(req, req->port, traddr); + + nvmet_format_discovery_entry(hdr, req->port, + nvmet_disc_subsys->subsysnqn, + traddr, NVME_NQN_CURR, numrec); + numrec++; + + list_for_each_entry(p, &req->port->subsystems, entry) { if (!nvmet_host_allowed(p->subsys, ctrl->hostnqn)) continue; - nvmet_set_disc_traddr(req, req->port, traddr); nvmet_format_discovery_entry(hdr, req->port, p->subsys->subsysnqn, traddr, NVME_NQN_NVME, numrec); @@ -268,6 +273,8 @@ static void nvmet_execute_disc_identify(struct nvmet_req *req) memcpy_and_pad(id->fr, sizeof(id->fr), UTS_RELEASE, strlen(UTS_RELEASE), ' '); + id->cntrltype = NVME_CTRL_DISC; + /* no limit on data transfer sizes for now */ id->mdts = 0; id->cntlid = cpu_to_le16(ctrl->cntlid); @@ -387,7 +394,7 @@ u16 nvmet_parse_discovery_cmd(struct nvmet_req *req) int __init nvmet_init_discovery(void) { nvmet_disc_subsys = - nvmet_subsys_alloc(NVME_DISC_SUBSYS_NAME, NVME_NQN_DISC); + nvmet_subsys_alloc(NVME_DISC_SUBSYS_NAME, NVME_NQN_CURR); return PTR_ERR_OR_ZERO(nvmet_disc_subsys); } diff --git a/drivers/nvme/target/fabrics-cmd.c b/drivers/nvme/target/fabrics-cmd.c index 7d0454cee920..70fb587e9413 100644 --- a/drivers/nvme/target/fabrics-cmd.c +++ b/drivers/nvme/target/fabrics-cmd.c @@ -221,7 +221,8 @@ static void nvmet_execute_admin_connect(struct nvmet_req *req) goto out; } - pr_info("creating controller %d for subsystem %s for NQN %s%s.\n", + pr_info("creating %s controller %d for subsystem %s for NQN %s%s.\n", + nvmet_is_disc_subsys(ctrl->subsys) ? "discovery" : "nvm", ctrl->cntlid, ctrl->subsys->subsysnqn, ctrl->hostnqn, ctrl->pi_support ? " T10-PI is enabled" : ""); req->cqe->result.u16 = cpu_to_le16(ctrl->cntlid); diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c index 0285ccc7541f..eb1094254c82 100644 --- a/drivers/nvme/target/loop.c +++ b/drivers/nvme/target/loop.c @@ -384,6 +384,8 @@ static int nvme_loop_configure_admin_queue(struct nvme_loop_ctrl *ctrl) error = PTR_ERR(ctrl->ctrl.admin_q); goto out_cleanup_fabrics_q; } + /* reset stopped state for the fresh admin queue */ + clear_bit(NVME_CTRL_ADMIN_Q_STOPPED, &ctrl->ctrl.flags); error = nvmf_connect_admin_queue(&ctrl->ctrl); if (error) @@ -398,7 +400,7 @@ static int nvme_loop_configure_admin_queue(struct nvme_loop_ctrl *ctrl) ctrl->ctrl.max_hw_sectors = (NVME_LOOP_MAX_SEGMENTS - 1) << (PAGE_SHIFT - 9); - blk_mq_unquiesce_queue(ctrl->ctrl.admin_q); + nvme_start_admin_queue(&ctrl->ctrl); error = nvme_init_ctrl_finish(&ctrl->ctrl); if (error) @@ -428,7 +430,7 @@ static void nvme_loop_shutdown_ctrl(struct nvme_loop_ctrl *ctrl) nvme_loop_destroy_io_queues(ctrl); } - blk_mq_quiesce_queue(ctrl->ctrl.admin_q); + nvme_stop_admin_queue(&ctrl->ctrl); if (ctrl->ctrl.state == NVME_CTRL_LIVE) nvme_shutdown_ctrl(&ctrl->ctrl); diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h index 7143c7fa7464..af193423c10b 100644 --- a/drivers/nvme/target/nvmet.h +++ b/drivers/nvme/target/nvmet.h @@ -309,6 +309,7 @@ struct nvmet_fabrics_ops { u16 (*install_queue)(struct nvmet_sq *nvme_sq); void (*discovery_chg)(struct nvmet_port *port); u8 (*get_mdts)(const struct nvmet_ctrl *ctrl); + u16 (*get_max_queue_size)(const struct nvmet_ctrl *ctrl); }; #define NVMET_MAX_INLINE_BIOVEC 8 @@ -576,6 +577,11 @@ static inline struct nvmet_subsys *nvmet_req_subsys(struct nvmet_req *req) return req->sq->ctrl->subsys; } +static inline bool nvmet_is_disc_subsys(struct nvmet_subsys *subsys) +{ + return subsys->type != NVME_NQN_NVME; +} + #ifdef CONFIG_NVME_TARGET_PASSTHRU void nvmet_passthru_subsys_free(struct nvmet_subsys *subsys); int nvmet_passthru_ctrl_enable(struct nvmet_subsys *subsys); diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c index 38d1f292ecc2..1deb4043e242 100644 --- a/drivers/nvme/target/rdma.c +++ b/drivers/nvme/target/rdma.c @@ -1819,12 +1819,36 @@ restart: mutex_unlock(&nvmet_rdma_queue_mutex); } +static void nvmet_rdma_destroy_port_queues(struct nvmet_rdma_port *port) +{ + struct nvmet_rdma_queue *queue, *tmp; + struct nvmet_port *nport = port->nport; + + mutex_lock(&nvmet_rdma_queue_mutex); + list_for_each_entry_safe(queue, tmp, &nvmet_rdma_queue_list, + queue_list) { + if (queue->port != nport) + continue; + + list_del_init(&queue->queue_list); + __nvmet_rdma_queue_disconnect(queue); + } + mutex_unlock(&nvmet_rdma_queue_mutex); +} + static void nvmet_rdma_disable_port(struct nvmet_rdma_port *port) { struct rdma_cm_id *cm_id = xchg(&port->cm_id, NULL); if (cm_id) rdma_destroy_id(cm_id); + + /* + * Destroy the remaining queues, which are not belong to any + * controller yet. Do it here after the RDMA-CM was destroyed + * guarantees that no new queue will be created. + */ + nvmet_rdma_destroy_port_queues(port); } static int nvmet_rdma_enable_port(struct nvmet_rdma_port *port) @@ -1976,6 +2000,11 @@ static u8 nvmet_rdma_get_mdts(const struct nvmet_ctrl *ctrl) return NVMET_RDMA_MAX_MDTS; } +static u16 nvmet_rdma_get_max_queue_size(const struct nvmet_ctrl *ctrl) +{ + return NVME_RDMA_MAX_QUEUE_SIZE; +} + static const struct nvmet_fabrics_ops nvmet_rdma_ops = { .owner = THIS_MODULE, .type = NVMF_TRTYPE_RDMA, @@ -1987,6 +2016,7 @@ static const struct nvmet_fabrics_ops nvmet_rdma_ops = { .delete_ctrl = nvmet_rdma_delete_ctrl, .disc_traddr = nvmet_rdma_disc_port_addr, .get_mdts = nvmet_rdma_get_mdts, + .get_max_queue_size = nvmet_rdma_get_max_queue_size, }; static void nvmet_rdma_remove_one(struct ib_device *ib_device, void *client_data) diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c index 07ee347ea3f3..84c387e4bf43 100644 --- a/drivers/nvme/target/tcp.c +++ b/drivers/nvme/target/tcp.c @@ -702,7 +702,7 @@ static int nvmet_try_send_ddgst(struct nvmet_tcp_cmd *cmd, bool last_in_batch) struct nvmet_tcp_queue *queue = cmd->queue; struct msghdr msg = { .msg_flags = MSG_DONTWAIT }; struct kvec iov = { - .iov_base = &cmd->exp_ddgst + cmd->offset, + .iov_base = (u8 *)&cmd->exp_ddgst + cmd->offset, .iov_len = NVME_TCP_DIGEST_LENGTH - cmd->offset }; int ret; @@ -1096,7 +1096,7 @@ recv: } if (queue->hdr_digest && - nvmet_tcp_verify_hdgst(queue, &queue->pdu, queue->offset)) { + nvmet_tcp_verify_hdgst(queue, &queue->pdu, hdr->hlen)) { nvmet_tcp_fatal_error(queue); /* fatal */ return -EPROTO; } @@ -1428,6 +1428,7 @@ static void nvmet_tcp_uninit_data_in_cmds(struct nvmet_tcp_queue *queue) static void nvmet_tcp_release_queue_work(struct work_struct *w) { + struct page *page; struct nvmet_tcp_queue *queue = container_of(w, struct nvmet_tcp_queue, release_work); @@ -1447,6 +1448,8 @@ static void nvmet_tcp_release_queue_work(struct work_struct *w) nvmet_tcp_free_crypto(queue); ida_simple_remove(&nvmet_tcp_queue_ida, queue->idx); + page = virt_to_head_page(queue->pf_cache.va); + __page_frag_cache_drain(page, queue->pf_cache.pagecnt_bias); kfree(queue); } @@ -1737,6 +1740,17 @@ err_port: return ret; } +static void nvmet_tcp_destroy_port_queues(struct nvmet_tcp_port *port) +{ + struct nvmet_tcp_queue *queue; + + mutex_lock(&nvmet_tcp_queue_mutex); + list_for_each_entry(queue, &nvmet_tcp_queue_list, queue_list) + if (queue->port == port) + kernel_sock_shutdown(queue->sock, SHUT_RDWR); + mutex_unlock(&nvmet_tcp_queue_mutex); +} + static void nvmet_tcp_remove_port(struct nvmet_port *nport) { struct nvmet_tcp_port *port = nport->priv; @@ -1746,6 +1760,11 @@ static void nvmet_tcp_remove_port(struct nvmet_port *nport) port->sock->sk->sk_user_data = NULL; write_unlock_bh(&port->sock->sk->sk_callback_lock); cancel_work_sync(&port->accept_work); + /* + * Destroy the remaining queues, which are not belong to any + * controller yet. + */ + nvmet_tcp_destroy_port_queues(port); sock_release(port->sock); kfree(port); diff --git a/drivers/of/of_reserved_mem.c b/drivers/of/of_reserved_mem.c index 59c1390cdf42..9da8835ba5a5 100644 --- a/drivers/of/of_reserved_mem.c +++ b/drivers/of/of_reserved_mem.c @@ -21,6 +21,7 @@ #include <linux/sort.h> #include <linux/slab.h> #include <linux/memblock.h> +#include <linux/kmemleak.h> #include "of_private.h" @@ -46,6 +47,7 @@ static int __init early_init_dt_alloc_reserved_memory_arch(phys_addr_t size, err = memblock_mark_nomap(base, size); if (err) memblock_free(base, size); + kmemleak_ignore_phys(base); } return err; diff --git a/drivers/pinctrl/bcm/pinctrl-ns.c b/drivers/pinctrl/bcm/pinctrl-ns.c index e79690bd8b85..d7f8175d2c1c 100644 --- a/drivers/pinctrl/bcm/pinctrl-ns.c +++ b/drivers/pinctrl/bcm/pinctrl-ns.c @@ -5,7 +5,6 @@ #include <linux/err.h> #include <linux/io.h> -#include <linux/mfd/syscon.h> #include <linux/module.h> #include <linux/of.h> #include <linux/of_device.h> @@ -13,7 +12,6 @@ #include <linux/pinctrl/pinctrl.h> #include <linux/pinctrl/pinmux.h> #include <linux/platform_device.h> -#include <linux/regmap.h> #include <linux/slab.h> #define FLAG_BCM4708 BIT(1) @@ -24,8 +22,7 @@ struct ns_pinctrl { struct device *dev; unsigned int chipset_flag; struct pinctrl_dev *pctldev; - struct regmap *regmap; - u32 offset; + void __iomem *base; struct pinctrl_desc pctldesc; struct ns_pinctrl_group *groups; @@ -232,9 +229,9 @@ static int ns_pinctrl_set_mux(struct pinctrl_dev *pctrl_dev, unset |= BIT(pin_number); } - regmap_read(ns_pinctrl->regmap, ns_pinctrl->offset, &tmp); + tmp = readl(ns_pinctrl->base); tmp &= ~unset; - regmap_write(ns_pinctrl->regmap, ns_pinctrl->offset, tmp); + writel(tmp, ns_pinctrl->base); return 0; } @@ -266,13 +263,13 @@ static const struct of_device_id ns_pinctrl_of_match_table[] = { static int ns_pinctrl_probe(struct platform_device *pdev) { struct device *dev = &pdev->dev; - struct device_node *np = dev->of_node; const struct of_device_id *of_id; struct ns_pinctrl *ns_pinctrl; struct pinctrl_desc *pctldesc; struct pinctrl_pin_desc *pin; struct ns_pinctrl_group *group; struct ns_pinctrl_function *function; + struct resource *res; int i; ns_pinctrl = devm_kzalloc(dev, sizeof(*ns_pinctrl), GFP_KERNEL); @@ -290,18 +287,12 @@ static int ns_pinctrl_probe(struct platform_device *pdev) return -EINVAL; ns_pinctrl->chipset_flag = (uintptr_t)of_id->data; - ns_pinctrl->regmap = syscon_node_to_regmap(of_get_parent(np)); - if (IS_ERR(ns_pinctrl->regmap)) { - int err = PTR_ERR(ns_pinctrl->regmap); - - dev_err(dev, "Failed to map pinctrl regs: %d\n", err); - - return err; - } - - if (of_property_read_u32(np, "offset", &ns_pinctrl->offset)) { - dev_err(dev, "Failed to get register offset\n"); - return -ENOENT; + res = platform_get_resource_byname(pdev, IORESOURCE_MEM, + "cru_gpio_control"); + ns_pinctrl->base = devm_ioremap_resource(dev, res); + if (IS_ERR(ns_pinctrl->base)) { + dev_err(dev, "Failed to map pinctrl regs\n"); + return PTR_ERR(ns_pinctrl->base); } memcpy(pctldesc, &ns_pinctrl_desc, sizeof(*pctldesc)); diff --git a/drivers/pinctrl/pinctrl-amd.c b/drivers/pinctrl/pinctrl-amd.c index 8d0f88e9ca88..bae9d429b813 100644 --- a/drivers/pinctrl/pinctrl-amd.c +++ b/drivers/pinctrl/pinctrl-amd.c @@ -840,6 +840,34 @@ static const struct pinconf_ops amd_pinconf_ops = { .pin_config_group_set = amd_pinconf_group_set, }; +static void amd_gpio_irq_init(struct amd_gpio *gpio_dev) +{ + struct pinctrl_desc *desc = gpio_dev->pctrl->desc; + unsigned long flags; + u32 pin_reg, mask; + int i; + + mask = BIT(WAKE_CNTRL_OFF_S0I3) | BIT(WAKE_CNTRL_OFF_S3) | + BIT(INTERRUPT_MASK_OFF) | BIT(INTERRUPT_ENABLE_OFF) | + BIT(WAKE_CNTRL_OFF_S4); + + for (i = 0; i < desc->npins; i++) { + int pin = desc->pins[i].number; + const struct pin_desc *pd = pin_desc_get(gpio_dev->pctrl, pin); + + if (!pd) + continue; + + raw_spin_lock_irqsave(&gpio_dev->lock, flags); + + pin_reg = readl(gpio_dev->base + i * 4); + pin_reg &= ~mask; + writel(pin_reg, gpio_dev->base + i * 4); + + raw_spin_unlock_irqrestore(&gpio_dev->lock, flags); + } +} + #ifdef CONFIG_PM_SLEEP static bool amd_gpio_should_save(struct amd_gpio *gpio_dev, unsigned int pin) { @@ -976,6 +1004,9 @@ static int amd_gpio_probe(struct platform_device *pdev) return PTR_ERR(gpio_dev->pctrl); } + /* Disable and mask interrupts */ + amd_gpio_irq_init(gpio_dev); + girq = &gpio_dev->gc.irq; girq->chip = &amd_gpio_irqchip; /* This will let us handle the parent IRQ in the driver */ diff --git a/drivers/pinctrl/stm32/pinctrl-stm32.c b/drivers/pinctrl/stm32/pinctrl-stm32.c index 68b3886f9f0f..dfd8888a222a 100644 --- a/drivers/pinctrl/stm32/pinctrl-stm32.c +++ b/drivers/pinctrl/stm32/pinctrl-stm32.c @@ -1644,8 +1644,8 @@ int __maybe_unused stm32_pinctrl_resume(struct device *dev) struct stm32_pinctrl_group *g = pctl->groups; int i; - for (i = g->pin; i < g->pin + pctl->ngroups; i++) - stm32_pinctrl_restore_gpio_regs(pctl, i); + for (i = 0; i < pctl->ngroups; i++, g++) + stm32_pinctrl_restore_gpio_regs(pctl, g->pin); return 0; } diff --git a/drivers/ptp/ptp_clock.c b/drivers/ptp/ptp_clock.c index 4dfc52e06704..f9b2d66b0443 100644 --- a/drivers/ptp/ptp_clock.c +++ b/drivers/ptp/ptp_clock.c @@ -170,6 +170,7 @@ static void ptp_clock_release(struct device *dev) struct ptp_clock *ptp = container_of(dev, struct ptp_clock, dev); ptp_cleanup_pin_groups(ptp); + kfree(ptp->vclock_index); mutex_destroy(&ptp->tsevq_mux); mutex_destroy(&ptp->pincfg_mux); mutex_destroy(&ptp->n_vclocks_mux); @@ -283,15 +284,20 @@ struct ptp_clock *ptp_clock_register(struct ptp_clock_info *info, /* Create a posix clock and link it to the device. */ err = posix_clock_register(&ptp->clock, &ptp->dev); if (err) { + if (ptp->pps_source) + pps_unregister_source(ptp->pps_source); + + if (ptp->kworker) + kthread_destroy_worker(ptp->kworker); + + put_device(&ptp->dev); + pr_err("failed to create posix clock\n"); - goto no_clock; + return ERR_PTR(err); } return ptp; -no_clock: - if (ptp->pps_source) - pps_unregister_source(ptp->pps_source); no_pps: ptp_cleanup_pin_groups(ptp); no_pin_groups: @@ -321,8 +327,6 @@ int ptp_clock_unregister(struct ptp_clock *ptp) ptp->defunct = 1; wake_up_interruptible(&ptp->tsev_wq); - kfree(ptp->vclock_index); - if (ptp->kworker) { kthread_cancel_delayed_work_sync(&ptp->aux_work); kthread_destroy_worker(ptp->kworker); diff --git a/drivers/ptp/ptp_kvm_x86.c b/drivers/ptp/ptp_kvm_x86.c index d0096cd7096a..4991054a2135 100644 --- a/drivers/ptp/ptp_kvm_x86.c +++ b/drivers/ptp/ptp_kvm_x86.c @@ -31,10 +31,10 @@ int kvm_arch_ptp_init(void) ret = kvm_hypercall2(KVM_HC_CLOCK_PAIRING, clock_pair_gpa, KVM_CLOCK_PAIRING_WALLCLOCK); - if (ret == -KVM_ENOSYS || ret == -KVM_EOPNOTSUPP) + if (ret == -KVM_ENOSYS) return -ENODEV; - return 0; + return ret; } int kvm_arch_ptp_get_clock(struct timespec64 *ts) diff --git a/drivers/reset/Kconfig b/drivers/reset/Kconfig index be799a5abf8a..b0056ae5d463 100644 --- a/drivers/reset/Kconfig +++ b/drivers/reset/Kconfig @@ -147,8 +147,8 @@ config RESET_OXNAS bool config RESET_PISTACHIO - bool "Pistachio Reset Driver" if COMPILE_TEST - default MACH_PISTACHIO + bool "Pistachio Reset Driver" + depends on MIPS || COMPILE_TEST help This enables the reset driver for ImgTec Pistachio SoCs. diff --git a/drivers/reset/reset-brcmstb-rescal.c b/drivers/reset/reset-brcmstb-rescal.c index b6f074d6a65f..433fa0c40e47 100644 --- a/drivers/reset/reset-brcmstb-rescal.c +++ b/drivers/reset/reset-brcmstb-rescal.c @@ -38,7 +38,7 @@ static int brcm_rescal_reset_set(struct reset_controller_dev *rcdev, } ret = readl_poll_timeout(base + BRCM_RESCAL_STATUS, reg, - !(reg & BRCM_RESCAL_STATUS_BIT), 100, 1000); + (reg & BRCM_RESCAL_STATUS_BIT), 100, 1000); if (ret) { dev_err(data->dev, "time out on SATA/PCIe rescal\n"); return ret; diff --git a/drivers/reset/reset-socfpga.c b/drivers/reset/reset-socfpga.c index 2a72f861f798..8c6492e5693c 100644 --- a/drivers/reset/reset-socfpga.c +++ b/drivers/reset/reset-socfpga.c @@ -92,3 +92,29 @@ void __init socfpga_reset_init(void) for_each_matching_node(np, socfpga_early_reset_dt_ids) a10_reset_init(np); } + +/* + * The early driver is problematic, because it doesn't register + * itself as a driver. This causes certain device links to prevent + * consumer devices from probing. The hacky solution is to register + * an empty driver, whose only job is to attach itself to the reset + * manager and call probe. + */ +static const struct of_device_id socfpga_reset_dt_ids[] = { + { .compatible = "altr,rst-mgr", }, + { /* sentinel */ }, +}; + +static int reset_simple_probe(struct platform_device *pdev) +{ + return 0; +} + +static struct platform_driver reset_socfpga_driver = { + .probe = reset_simple_probe, + .driver = { + .name = "socfpga-reset", + .of_match_table = socfpga_reset_dt_ids, + }, +}; +builtin_platform_driver(reset_socfpga_driver); diff --git a/drivers/reset/tegra/reset-bpmp.c b/drivers/reset/tegra/reset-bpmp.c index 24d3395964cc..4c5bba52b105 100644 --- a/drivers/reset/tegra/reset-bpmp.c +++ b/drivers/reset/tegra/reset-bpmp.c @@ -20,6 +20,7 @@ static int tegra_bpmp_reset_common(struct reset_controller_dev *rstc, struct tegra_bpmp *bpmp = to_tegra_bpmp(rstc); struct mrq_reset_request request; struct tegra_bpmp_message msg; + int err; memset(&request, 0, sizeof(request)); request.cmd = command; @@ -30,7 +31,13 @@ static int tegra_bpmp_reset_common(struct reset_controller_dev *rstc, msg.tx.data = &request; msg.tx.size = sizeof(request); - return tegra_bpmp_transfer(bpmp, &msg); + err = tegra_bpmp_transfer(bpmp, &msg); + if (err) + return err; + if (msg.rx.ret) + return -EINVAL; + + return 0; } static int tegra_bpmp_reset_module(struct reset_controller_dev *rstc, diff --git a/drivers/s390/block/dasd.c b/drivers/s390/block/dasd.c index e34c6cc61983..8e87a31e329d 100644 --- a/drivers/s390/block/dasd.c +++ b/drivers/s390/block/dasd.c @@ -2077,12 +2077,15 @@ static void __dasd_device_check_path_events(struct dasd_device *device) if (device->stopped & ~(DASD_STOPPED_DC_WAIT)) return; + + dasd_path_clear_all_verify(device); + dasd_path_clear_all_fcsec(device); + rc = device->discipline->pe_handler(device, tbvpm, fcsecpm); if (rc) { + dasd_path_add_tbvpm(device, tbvpm); + dasd_path_add_fcsecpm(device, fcsecpm); dasd_device_set_timer(device, 50); - } else { - dasd_path_clear_all_verify(device); - dasd_path_clear_all_fcsec(device); } }; diff --git a/drivers/s390/block/dasd_3990_erp.c b/drivers/s390/block/dasd_3990_erp.c index 4691a3c35d72..299001ad9a32 100644 --- a/drivers/s390/block/dasd_3990_erp.c +++ b/drivers/s390/block/dasd_3990_erp.c @@ -201,7 +201,7 @@ dasd_3990_erp_DCTL(struct dasd_ccw_req * erp, char modifier) struct ccw1 *ccw; struct dasd_ccw_req *dctl_cqr; - dctl_cqr = dasd_alloc_erp_request((char *) &erp->magic, 1, + dctl_cqr = dasd_alloc_erp_request(erp->magic, 1, sizeof(struct DCTL_data), device); if (IS_ERR(dctl_cqr)) { @@ -1652,7 +1652,7 @@ dasd_3990_erp_action_1B_32(struct dasd_ccw_req * default_erp, char *sense) } /* Build new ERP request including DE/LO */ - erp = dasd_alloc_erp_request((char *) &cqr->magic, + erp = dasd_alloc_erp_request(cqr->magic, 2 + 1,/* DE/LO + TIC */ sizeof(struct DE_eckd_data) + sizeof(struct LO_eckd_data), device); @@ -2388,7 +2388,7 @@ static struct dasd_ccw_req *dasd_3990_erp_add_erp(struct dasd_ccw_req *cqr) } /* allocate additional request block */ - erp = dasd_alloc_erp_request((char *) &cqr->magic, + erp = dasd_alloc_erp_request(cqr->magic, cplength, datasize, device); if (IS_ERR(erp)) { if (cqr->retries <= 0) { diff --git a/drivers/s390/block/dasd_eckd.c b/drivers/s390/block/dasd_eckd.c index 460e0f1cca53..8410a25a65c1 100644 --- a/drivers/s390/block/dasd_eckd.c +++ b/drivers/s390/block/dasd_eckd.c @@ -560,8 +560,8 @@ static int prefix_LRE(struct ccw1 *ccw, struct PFX_eckd_data *pfxdata, return -EINVAL; } pfxdata->format = format; - pfxdata->base_address = basepriv->ned->unit_addr; - pfxdata->base_lss = basepriv->ned->ID; + pfxdata->base_address = basepriv->conf.ned->unit_addr; + pfxdata->base_lss = basepriv->conf.ned->ID; pfxdata->validity.define_extent = 1; /* private uid is kept up to date, conf_data may be outdated */ @@ -736,32 +736,30 @@ dasd_eckd_cdl_reclen(int recid) return LABEL_SIZE; } /* create unique id from private structure. */ -static void create_uid(struct dasd_eckd_private *private) +static void create_uid(struct dasd_conf *conf, struct dasd_uid *uid) { int count; - struct dasd_uid *uid; - uid = &private->uid; memset(uid, 0, sizeof(struct dasd_uid)); - memcpy(uid->vendor, private->ned->HDA_manufacturer, + memcpy(uid->vendor, conf->ned->HDA_manufacturer, sizeof(uid->vendor) - 1); EBCASC(uid->vendor, sizeof(uid->vendor) - 1); - memcpy(uid->serial, &private->ned->serial, + memcpy(uid->serial, &conf->ned->serial, sizeof(uid->serial) - 1); EBCASC(uid->serial, sizeof(uid->serial) - 1); - uid->ssid = private->gneq->subsystemID; - uid->real_unit_addr = private->ned->unit_addr; - if (private->sneq) { - uid->type = private->sneq->sua_flags; + uid->ssid = conf->gneq->subsystemID; + uid->real_unit_addr = conf->ned->unit_addr; + if (conf->sneq) { + uid->type = conf->sneq->sua_flags; if (uid->type == UA_BASE_PAV_ALIAS) - uid->base_unit_addr = private->sneq->base_unit_addr; + uid->base_unit_addr = conf->sneq->base_unit_addr; } else { uid->type = UA_BASE_DEVICE; } - if (private->vdsneq) { + if (conf->vdsneq) { for (count = 0; count < 16; count++) { sprintf(uid->vduit+2*count, "%02x", - private->vdsneq->uit[count]); + conf->vdsneq->uit[count]); } } } @@ -776,10 +774,10 @@ static int dasd_eckd_generate_uid(struct dasd_device *device) if (!private) return -ENODEV; - if (!private->ned || !private->gneq) + if (!private->conf.ned || !private->conf.gneq) return -ENODEV; spin_lock_irqsave(get_ccwdev_lock(device->cdev), flags); - create_uid(private); + create_uid(&private->conf, &private->uid); spin_unlock_irqrestore(get_ccwdev_lock(device->cdev), flags); return 0; } @@ -803,14 +801,15 @@ static int dasd_eckd_get_uid(struct dasd_device *device, struct dasd_uid *uid) * return 0 for match */ static int dasd_eckd_compare_path_uid(struct dasd_device *device, - struct dasd_eckd_private *private) + struct dasd_conf *path_conf) { struct dasd_uid device_uid; + struct dasd_uid path_uid; - create_uid(private); + create_uid(path_conf, &path_uid); dasd_eckd_get_uid(device, &device_uid); - return memcmp(&device_uid, &private->uid, sizeof(struct dasd_uid)); + return memcmp(&device_uid, &path_uid, sizeof(struct dasd_uid)); } static void dasd_eckd_fill_rcd_cqr(struct dasd_device *device, @@ -946,34 +945,34 @@ out_error: return ret; } -static int dasd_eckd_identify_conf_parts(struct dasd_eckd_private *private) +static int dasd_eckd_identify_conf_parts(struct dasd_conf *conf) { struct dasd_sneq *sneq; int i, count; - private->ned = NULL; - private->sneq = NULL; - private->vdsneq = NULL; - private->gneq = NULL; - count = private->conf_len / sizeof(struct dasd_sneq); - sneq = (struct dasd_sneq *)private->conf_data; + conf->ned = NULL; + conf->sneq = NULL; + conf->vdsneq = NULL; + conf->gneq = NULL; + count = conf->len / sizeof(struct dasd_sneq); + sneq = (struct dasd_sneq *)conf->data; for (i = 0; i < count; ++i) { if (sneq->flags.identifier == 1 && sneq->format == 1) - private->sneq = sneq; + conf->sneq = sneq; else if (sneq->flags.identifier == 1 && sneq->format == 4) - private->vdsneq = (struct vd_sneq *)sneq; + conf->vdsneq = (struct vd_sneq *)sneq; else if (sneq->flags.identifier == 2) - private->gneq = (struct dasd_gneq *)sneq; + conf->gneq = (struct dasd_gneq *)sneq; else if (sneq->flags.identifier == 3 && sneq->res1 == 1) - private->ned = (struct dasd_ned *)sneq; + conf->ned = (struct dasd_ned *)sneq; sneq++; } - if (!private->ned || !private->gneq) { - private->ned = NULL; - private->sneq = NULL; - private->vdsneq = NULL; - private->gneq = NULL; + if (!conf->ned || !conf->gneq) { + conf->ned = NULL; + conf->sneq = NULL; + conf->vdsneq = NULL; + conf->gneq = NULL; return -EINVAL; } return 0; @@ -1016,9 +1015,9 @@ static void dasd_eckd_store_conf_data(struct dasd_device *device, * with the new one if this points to the same data */ cdp = device->path[chp].conf_data; - if (private->conf_data == cdp) { - private->conf_data = (void *)conf_data; - dasd_eckd_identify_conf_parts(private); + if (private->conf.data == cdp) { + private->conf.data = (void *)conf_data; + dasd_eckd_identify_conf_parts(&private->conf); } ccw_device_get_schid(device->cdev, &sch_id); device->path[chp].conf_data = conf_data; @@ -1036,8 +1035,8 @@ static void dasd_eckd_clear_conf_data(struct dasd_device *device) struct dasd_eckd_private *private = device->private; int i; - private->conf_data = NULL; - private->conf_len = 0; + private->conf.data = NULL; + private->conf.len = 0; for (i = 0; i < 8; i++) { kfree(device->path[i].conf_data); device->path[i].conf_data = NULL; @@ -1071,15 +1070,55 @@ static void dasd_eckd_read_fc_security(struct dasd_device *device) } } +static void dasd_eckd_get_uid_string(struct dasd_conf *conf, + char *print_uid) +{ + struct dasd_uid uid; + + create_uid(conf, &uid); + if (strlen(uid.vduit) > 0) + snprintf(print_uid, sizeof(*print_uid), + "%s.%s.%04x.%02x.%s", + uid.vendor, uid.serial, uid.ssid, + uid.real_unit_addr, uid.vduit); + else + snprintf(print_uid, sizeof(*print_uid), + "%s.%s.%04x.%02x", + uid.vendor, uid.serial, uid.ssid, + uid.real_unit_addr); +} + +static int dasd_eckd_check_cabling(struct dasd_device *device, + void *conf_data, __u8 lpm) +{ + struct dasd_eckd_private *private = device->private; + char print_path_uid[60], print_device_uid[60]; + struct dasd_conf path_conf; + + path_conf.data = conf_data; + path_conf.len = DASD_ECKD_RCD_DATA_SIZE; + if (dasd_eckd_identify_conf_parts(&path_conf)) + return 1; + + if (dasd_eckd_compare_path_uid(device, &path_conf)) { + dasd_eckd_get_uid_string(&path_conf, print_path_uid); + dasd_eckd_get_uid_string(&private->conf, print_device_uid); + dev_err(&device->cdev->dev, + "Not all channel paths lead to the same device, path %02X leads to device %s instead of %s\n", + lpm, print_path_uid, print_device_uid); + return 1; + } + + return 0; +} + static int dasd_eckd_read_conf(struct dasd_device *device) { void *conf_data; int conf_len, conf_data_saved; int rc, path_err, pos; __u8 lpm, opm; - struct dasd_eckd_private *private, path_private; - struct dasd_uid *uid; - char print_path_uid[60], print_device_uid[60]; + struct dasd_eckd_private *private; private = device->private; opm = ccw_device_get_path_mask(device->cdev); @@ -1109,11 +1148,11 @@ static int dasd_eckd_read_conf(struct dasd_device *device) if (!conf_data_saved) { /* initially clear previously stored conf_data */ dasd_eckd_clear_conf_data(device); - private->conf_data = conf_data; - private->conf_len = conf_len; - if (dasd_eckd_identify_conf_parts(private)) { - private->conf_data = NULL; - private->conf_len = 0; + private->conf.data = conf_data; + private->conf.len = conf_len; + if (dasd_eckd_identify_conf_parts(&private->conf)) { + private->conf.data = NULL; + private->conf.len = 0; kfree(conf_data); continue; } @@ -1123,59 +1162,11 @@ static int dasd_eckd_read_conf(struct dasd_device *device) */ dasd_eckd_generate_uid(device); conf_data_saved++; - } else { - path_private.conf_data = conf_data; - path_private.conf_len = DASD_ECKD_RCD_DATA_SIZE; - if (dasd_eckd_identify_conf_parts( - &path_private)) { - path_private.conf_data = NULL; - path_private.conf_len = 0; - kfree(conf_data); - continue; - } - if (dasd_eckd_compare_path_uid( - device, &path_private)) { - uid = &path_private.uid; - if (strlen(uid->vduit) > 0) - snprintf(print_path_uid, - sizeof(print_path_uid), - "%s.%s.%04x.%02x.%s", - uid->vendor, uid->serial, - uid->ssid, uid->real_unit_addr, - uid->vduit); - else - snprintf(print_path_uid, - sizeof(print_path_uid), - "%s.%s.%04x.%02x", - uid->vendor, uid->serial, - uid->ssid, - uid->real_unit_addr); - uid = &private->uid; - if (strlen(uid->vduit) > 0) - snprintf(print_device_uid, - sizeof(print_device_uid), - "%s.%s.%04x.%02x.%s", - uid->vendor, uid->serial, - uid->ssid, uid->real_unit_addr, - uid->vduit); - else - snprintf(print_device_uid, - sizeof(print_device_uid), - "%s.%s.%04x.%02x", - uid->vendor, uid->serial, - uid->ssid, - uid->real_unit_addr); - dev_err(&device->cdev->dev, - "Not all channel paths lead to " - "the same device, path %02X leads to " - "device %s instead of %s\n", lpm, - print_path_uid, print_device_uid); - path_err = -EINVAL; - dasd_path_add_cablepm(device, lpm); - continue; - } - path_private.conf_data = NULL; - path_private.conf_len = 0; + } else if (dasd_eckd_check_cabling(device, conf_data, lpm)) { + dasd_path_add_cablepm(device, lpm); + path_err = -EINVAL; + kfree(conf_data); + continue; } pos = pathmask_to_pos(lpm); @@ -1197,8 +1188,6 @@ static int dasd_eckd_read_conf(struct dasd_device *device) } } - dasd_eckd_read_fc_security(device); - return path_err; } @@ -1213,7 +1202,7 @@ static u32 get_fcx_max_data(struct dasd_device *device) return 0; /* is transport mode supported? */ fcx_in_css = css_general_characteristics.fcx; - fcx_in_gneq = private->gneq->reserved2[7] & 0x04; + fcx_in_gneq = private->conf.gneq->reserved2[7] & 0x04; fcx_in_features = private->features.feature[40] & 0x80; tpm = fcx_in_css && fcx_in_gneq && fcx_in_features; @@ -1282,9 +1271,9 @@ static int rebuild_device_uid(struct dasd_device *device, "returned error %d", rc); break; } - memcpy(private->conf_data, data->rcd_buffer, + memcpy(private->conf.data, data->rcd_buffer, DASD_ECKD_RCD_DATA_SIZE); - if (dasd_eckd_identify_conf_parts(private)) { + if (dasd_eckd_identify_conf_parts(&private->conf)) { rc = -ENODEV; } else /* first valid path is enough */ break; @@ -1299,11 +1288,10 @@ static int rebuild_device_uid(struct dasd_device *device, static void dasd_eckd_path_available_action(struct dasd_device *device, struct pe_handler_work_data *data) { - struct dasd_eckd_private path_private; - struct dasd_uid *uid; __u8 path_rcd_buf[DASD_ECKD_RCD_DATA_SIZE]; __u8 lpm, opm, npm, ppm, epm, hpfpm, cablepm; struct dasd_conf_data *conf_data; + struct dasd_conf path_conf; unsigned long flags; char print_uid[60]; int rc, pos; @@ -1367,11 +1355,11 @@ static void dasd_eckd_path_available_action(struct dasd_device *device, */ memcpy(&path_rcd_buf, data->rcd_buffer, DASD_ECKD_RCD_DATA_SIZE); - path_private.conf_data = (void *) &path_rcd_buf; - path_private.conf_len = DASD_ECKD_RCD_DATA_SIZE; - if (dasd_eckd_identify_conf_parts(&path_private)) { - path_private.conf_data = NULL; - path_private.conf_len = 0; + path_conf.data = (void *)&path_rcd_buf; + path_conf.len = DASD_ECKD_RCD_DATA_SIZE; + if (dasd_eckd_identify_conf_parts(&path_conf)) { + path_conf.data = NULL; + path_conf.len = 0; continue; } @@ -1382,7 +1370,7 @@ static void dasd_eckd_path_available_action(struct dasd_device *device, * the first working path UID will be used as device UID */ if (dasd_path_get_opm(device) && - dasd_eckd_compare_path_uid(device, &path_private)) { + dasd_eckd_compare_path_uid(device, &path_conf)) { /* * the comparison was not successful * rebuild the device UID with at least one @@ -1396,20 +1384,8 @@ static void dasd_eckd_path_available_action(struct dasd_device *device, */ if (rebuild_device_uid(device, data) || dasd_eckd_compare_path_uid( - device, &path_private)) { - uid = &path_private.uid; - if (strlen(uid->vduit) > 0) - snprintf(print_uid, sizeof(print_uid), - "%s.%s.%04x.%02x.%s", - uid->vendor, uid->serial, - uid->ssid, uid->real_unit_addr, - uid->vduit); - else - snprintf(print_uid, sizeof(print_uid), - "%s.%s.%04x.%02x", - uid->vendor, uid->serial, - uid->ssid, - uid->real_unit_addr); + device, &path_conf)) { + dasd_eckd_get_uid_string(&path_conf, print_uid); dev_err(&device->cdev->dev, "The newly added channel path %02X " "will not be used because it leads " @@ -1427,6 +1403,14 @@ static void dasd_eckd_path_available_action(struct dasd_device *device, if (conf_data) { memcpy(conf_data, data->rcd_buffer, DASD_ECKD_RCD_DATA_SIZE); + } else { + /* + * path is operational but path config data could not + * be stored due to low mem condition + * add it to the error path mask and schedule a path + * verification later that this could be added again + */ + epm |= lpm; } pos = pathmask_to_pos(lpm); dasd_eckd_store_conf_data(device, conf_data, pos); @@ -1447,7 +1431,10 @@ static void dasd_eckd_path_available_action(struct dasd_device *device, } dasd_path_add_nppm(device, npm); dasd_path_add_ppm(device, ppm); - dasd_path_add_tbvpm(device, epm); + if (epm) { + dasd_path_add_tbvpm(device, epm); + dasd_device_set_timer(device, 50); + } dasd_path_add_cablepm(device, cablepm); dasd_path_add_nohpfpm(device, hpfpm); spin_unlock_irqrestore(get_ccwdev_lock(device->cdev), flags); @@ -1625,8 +1612,8 @@ static int dasd_eckd_read_vol_info(struct dasd_device *device) prssdp = cqr->data; prssdp->order = PSF_ORDER_PRSSD; prssdp->suborder = PSF_SUBORDER_VSQ; /* Volume Storage Query */ - prssdp->lss = private->ned->ID; - prssdp->volume = private->ned->unit_addr; + prssdp->lss = private->conf.ned->ID; + prssdp->volume = private->conf.ned->unit_addr; ccw = cqr->cpaddr; ccw->cmd_code = DASD_ECKD_CCW_PSF; @@ -2085,11 +2072,11 @@ dasd_eckd_check_characteristics(struct dasd_device *device) device->path_thrhld = DASD_ECKD_PATH_THRHLD; device->path_interval = DASD_ECKD_PATH_INTERVAL; - if (private->gneq) { + if (private->conf.gneq) { value = 1; - for (i = 0; i < private->gneq->timeout.value; i++) + for (i = 0; i < private->conf.gneq->timeout.value; i++) value = 10 * value; - value = value * private->gneq->timeout.number; + value = value * private->conf.gneq->timeout.number; /* do not accept useless values */ if (value != 0 && value <= DASD_EXPIRES_MAX) device->default_expires = value; @@ -2121,6 +2108,7 @@ dasd_eckd_check_characteristics(struct dasd_device *device) if (rc) goto out_err3; + dasd_eckd_read_fc_security(device); dasd_path_create_kobjects(device); /* Read Feature Codes */ @@ -2195,10 +2183,10 @@ static void dasd_eckd_uncheck_device(struct dasd_device *device) return; dasd_alias_disconnect_device_from_lcu(device); - private->ned = NULL; - private->sneq = NULL; - private->vdsneq = NULL; - private->gneq = NULL; + private->conf.ned = NULL; + private->conf.sneq = NULL; + private->conf.vdsneq = NULL; + private->conf.gneq = NULL; dasd_eckd_clear_conf_data(device); dasd_path_remove_kobjects(device); } @@ -3750,8 +3738,8 @@ dasd_eckd_dso_ras(struct dasd_device *device, struct dasd_block *block, * subset. */ ras_data->op_flags.guarantee_init = !!(features->feature[56] & 0x01); - ras_data->lss = private->ned->ID; - ras_data->dev_addr = private->ned->unit_addr; + ras_data->lss = private->conf.ned->ID; + ras_data->dev_addr = private->conf.ned->unit_addr; ras_data->nr_exts = nr_exts; if (by_extent) { @@ -4293,8 +4281,8 @@ static int prepare_itcw(struct itcw *itcw, memset(&pfxdata, 0, sizeof(pfxdata)); pfxdata.format = 1; /* PFX with LRE */ - pfxdata.base_address = basepriv->ned->unit_addr; - pfxdata.base_lss = basepriv->ned->ID; + pfxdata.base_address = basepriv->conf.ned->unit_addr; + pfxdata.base_lss = basepriv->conf.ned->ID; pfxdata.validity.define_extent = 1; /* private uid is kept up to date, conf_data may be outdated */ @@ -4963,9 +4951,9 @@ dasd_eckd_fill_info(struct dasd_device * device, info->characteristics_size = sizeof(private->rdc_data); memcpy(info->characteristics, &private->rdc_data, sizeof(private->rdc_data)); - info->confdata_size = min((unsigned long)private->conf_len, - sizeof(info->configuration_data)); - memcpy(info->configuration_data, private->conf_data, + info->confdata_size = min_t(unsigned long, private->conf.len, + sizeof(info->configuration_data)); + memcpy(info->configuration_data, private->conf.data, info->confdata_size); return 0; } @@ -5808,6 +5796,8 @@ static int dasd_eckd_reload_device(struct dasd_device *device) if (rc) goto out_err; + dasd_eckd_read_fc_security(device); + rc = dasd_eckd_generate_uid(device); if (rc) goto out_err; @@ -5820,15 +5810,7 @@ static int dasd_eckd_reload_device(struct dasd_device *device) dasd_eckd_get_uid(device, &uid); if (old_base != uid.base_unit_addr) { - if (strlen(uid.vduit) > 0) - snprintf(print_uid, sizeof(print_uid), - "%s.%s.%04x.%02x.%s", uid.vendor, uid.serial, - uid.ssid, uid.base_unit_addr, uid.vduit); - else - snprintf(print_uid, sizeof(print_uid), - "%s.%s.%04x.%02x", uid.vendor, uid.serial, - uid.ssid, uid.base_unit_addr); - + dasd_eckd_get_uid_string(&private->conf, print_uid); dev_info(&device->cdev->dev, "An Alias device was reassigned to a new base device " "with UID: %s\n", print_uid); @@ -5966,8 +5948,8 @@ static int dasd_eckd_query_host_access(struct dasd_device *device, prssdp->order = PSF_ORDER_PRSSD; prssdp->suborder = PSF_SUBORDER_QHA; /* query host access */ /* LSS and Volume that will be queried */ - prssdp->lss = private->ned->ID; - prssdp->volume = private->ned->unit_addr; + prssdp->lss = private->conf.ned->ID; + prssdp->volume = private->conf.ned->unit_addr; /* all other bytes of prssdp must be zero */ ccw = cqr->cpaddr; diff --git a/drivers/s390/block/dasd_eckd.h b/drivers/s390/block/dasd_eckd.h index 65e4630ad2ae..a91b265441cc 100644 --- a/drivers/s390/block/dasd_eckd.h +++ b/drivers/s390/block/dasd_eckd.h @@ -658,16 +658,19 @@ struct dasd_conf_data { struct dasd_gneq gneq; } __packed; -struct dasd_eckd_private { - struct dasd_eckd_characteristics rdc_data; - u8 *conf_data; - int conf_len; - +struct dasd_conf { + u8 *data; + int len; /* pointers to specific parts in the conf_data */ struct dasd_ned *ned; struct dasd_sneq *sneq; struct vd_sneq *vdsneq; struct dasd_gneq *gneq; +}; + +struct dasd_eckd_private { + struct dasd_eckd_characteristics rdc_data; + struct dasd_conf conf; struct eckd_count count_area[5]; int init_cqr_status; diff --git a/drivers/s390/block/dasd_erp.c b/drivers/s390/block/dasd_erp.c index ba4fa372d02d..c07e6e713518 100644 --- a/drivers/s390/block/dasd_erp.c +++ b/drivers/s390/block/dasd_erp.c @@ -24,7 +24,7 @@ #include "dasd_int.h" struct dasd_ccw_req * -dasd_alloc_erp_request(char *magic, int cplength, int datasize, +dasd_alloc_erp_request(unsigned int magic, int cplength, int datasize, struct dasd_device * device) { unsigned long flags; @@ -33,8 +33,8 @@ dasd_alloc_erp_request(char *magic, int cplength, int datasize, int size; /* Sanity checks */ - BUG_ON( magic == NULL || datasize > PAGE_SIZE || - (cplength*sizeof(struct ccw1)) > PAGE_SIZE); + BUG_ON(datasize > PAGE_SIZE || + (cplength*sizeof(struct ccw1)) > PAGE_SIZE); size = (sizeof(struct dasd_ccw_req) + 7L) & -8L; if (cplength > 0) @@ -62,7 +62,7 @@ dasd_alloc_erp_request(char *magic, int cplength, int datasize, cqr->data = data; memset(cqr->data, 0, datasize); } - strncpy((char *) &cqr->magic, magic, 4); + cqr->magic = magic; ASCEBC((char *) &cqr->magic, 4); set_bit(DASD_CQR_FLAGS_USE_ERP, &cqr->flags); dasd_get_device(device); diff --git a/drivers/s390/block/dasd_int.h b/drivers/s390/block/dasd_int.h index 155428bfed8a..8b458010f88a 100644 --- a/drivers/s390/block/dasd_int.h +++ b/drivers/s390/block/dasd_int.h @@ -887,7 +887,7 @@ void dasd_proc_exit(void); /* externals in dasd_erp.c */ struct dasd_ccw_req *dasd_default_erp_action(struct dasd_ccw_req *); struct dasd_ccw_req *dasd_default_erp_postaction(struct dasd_ccw_req *); -struct dasd_ccw_req *dasd_alloc_erp_request(char *, int, int, +struct dasd_ccw_req *dasd_alloc_erp_request(unsigned int, int, int, struct dasd_device *); void dasd_free_erp_request(struct dasd_ccw_req *, struct dasd_device *); void dasd_log_sense(struct dasd_ccw_req *, struct irb *); @@ -1305,6 +1305,15 @@ static inline void dasd_path_add_ppm(struct dasd_device *device, __u8 pm) dasd_path_preferred(device, chp); } +static inline void dasd_path_add_fcsecpm(struct dasd_device *device, __u8 pm) +{ + int chp; + + for (chp = 0; chp < 8; chp++) + if (pm & (0x80 >> chp)) + dasd_path_fcsec(device, chp); +} + /* * set functions for path masks * the existing path mask will be replaced by the given path mask diff --git a/drivers/s390/block/dasd_ioctl.c b/drivers/s390/block/dasd_ioctl.c index 468cbeb539ff..95349f95758c 100644 --- a/drivers/s390/block/dasd_ioctl.c +++ b/drivers/s390/block/dasd_ioctl.c @@ -650,8 +650,8 @@ int dasd_ioctl(struct block_device *bdev, fmode_t mode, /** * dasd_biodasdinfo() - fill out the dasd information structure - * @disk [in]: pointer to gendisk structure that references a DASD - * @info [out]: pointer to the dasd_information2_t structure + * @disk: [in] pointer to gendisk structure that references a DASD + * @info: [out] pointer to the dasd_information2_t structure * * Provide access to DASD specific information. * The gendisk structure is checked if it belongs to the DASD driver by diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c index 3f6f14f0cafb..24b72ee4246f 100644 --- a/drivers/scsi/hosts.c +++ b/drivers/scsi/hosts.c @@ -220,7 +220,8 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev, goto fail; } - shost->cmd_per_lun = min_t(short, shost->cmd_per_lun, + /* Use min_t(int, ...) in case shost->can_queue exceeds SHRT_MAX */ + shost->cmd_per_lun = min_t(int, shost->cmd_per_lun, shost->can_queue); error = scsi_init_sense_cache(shost); diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c index 1f1586ad48fe..01f79991bf4a 100644 --- a/drivers/scsi/ibmvscsi/ibmvfc.c +++ b/drivers/scsi/ibmvscsi/ibmvfc.c @@ -1696,6 +1696,7 @@ static int ibmvfc_send_event(struct ibmvfc_event *evt, spin_lock_irqsave(&evt->queue->l_lock, flags); list_add_tail(&evt->queue_list, &evt->queue->sent); + atomic_set(&evt->active, 1); mb(); @@ -1710,6 +1711,7 @@ static int ibmvfc_send_event(struct ibmvfc_event *evt, be64_to_cpu(crq_as_u64[1])); if (rc) { + atomic_set(&evt->active, 0); list_del(&evt->queue_list); spin_unlock_irqrestore(&evt->queue->l_lock, flags); del_timer(&evt->timer); @@ -1737,7 +1739,6 @@ static int ibmvfc_send_event(struct ibmvfc_event *evt, evt->done(evt); } else { - atomic_set(&evt->active, 1); spin_unlock_irqrestore(&evt->queue->l_lock, flags); ibmvfc_trc_start(evt); } diff --git a/drivers/scsi/mpi3mr/mpi3mr_os.c b/drivers/scsi/mpi3mr/mpi3mr_os.c index 2197988333fe..3cae8803383b 100644 --- a/drivers/scsi/mpi3mr/mpi3mr_os.c +++ b/drivers/scsi/mpi3mr/mpi3mr_os.c @@ -3736,7 +3736,7 @@ mpi3mr_probe(struct pci_dev *pdev, const struct pci_device_id *id) shost->max_lun = -1; shost->unique_id = mrioc->id; - shost->max_channel = 1; + shost->max_channel = 0; shost->max_id = 0xFFFFFFFF; if (prot_mask >= 0) diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c index d383d4a03436..ad1b6c2b37a7 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c @@ -5065,9 +5065,12 @@ _scsih_setup_eedp(struct MPT3SAS_ADAPTER *ioc, struct scsi_cmnd *scmd, if (scmd->prot_flags & SCSI_PROT_GUARD_CHECK) eedp_flags |= MPI2_SCSIIO_EEDPFLAGS_CHECK_GUARD; - if (scmd->prot_flags & SCSI_PROT_REF_CHECK) { - eedp_flags |= MPI2_SCSIIO_EEDPFLAGS_INC_PRI_REFTAG | - MPI2_SCSIIO_EEDPFLAGS_CHECK_REFTAG; + if (scmd->prot_flags & SCSI_PROT_REF_CHECK) + eedp_flags |= MPI2_SCSIIO_EEDPFLAGS_CHECK_REFTAG; + + if (scmd->prot_flags & SCSI_PROT_REF_INCREMENT) { + eedp_flags |= MPI2_SCSIIO_EEDPFLAGS_INC_PRI_REFTAG; + mpi_request->CDB.EEDP32.PrimaryReferenceTag = cpu_to_be32(scsi_prot_ref_tag(scmd)); } diff --git a/drivers/scsi/qla2xxx/qla_bsg.c b/drivers/scsi/qla2xxx/qla_bsg.c index 4b5d28d89d69..655cf5de604b 100644 --- a/drivers/scsi/qla2xxx/qla_bsg.c +++ b/drivers/scsi/qla2xxx/qla_bsg.c @@ -431,7 +431,7 @@ done_unmap_sg: goto done_free_fcport; done_free_fcport: - if (bsg_request->msgcode == FC_BSG_RPT_ELS) + if (bsg_request->msgcode != FC_BSG_RPT_ELS) qla2x00_free_fcport(fcport); done: return rval; diff --git a/drivers/scsi/qla2xxx/qla_nvme.c b/drivers/scsi/qla2xxx/qla_nvme.c index 1c5da2dbd6f9..253055cf9daf 100644 --- a/drivers/scsi/qla2xxx/qla_nvme.c +++ b/drivers/scsi/qla2xxx/qla_nvme.c @@ -8,6 +8,8 @@ #include <linux/delay.h> #include <linux/nvme.h> #include <linux/nvme-fc.h> +#include <linux/blk-mq-pci.h> +#include <linux/blk-mq.h> static struct nvme_fc_port_template qla_nvme_fc_transport; @@ -642,6 +644,18 @@ static int qla_nvme_post_cmd(struct nvme_fc_local_port *lport, return rval; } +static void qla_nvme_map_queues(struct nvme_fc_local_port *lport, + struct blk_mq_queue_map *map) +{ + struct scsi_qla_host *vha = lport->private; + int rc; + + rc = blk_mq_pci_map_queues(map, vha->hw->pdev, vha->irq_offset); + if (rc) + ql_log(ql_log_warn, vha, 0x21de, + "pci map queue failed 0x%x", rc); +} + static void qla_nvme_localport_delete(struct nvme_fc_local_port *lport) { struct scsi_qla_host *vha = lport->private; @@ -676,6 +690,7 @@ static struct nvme_fc_port_template qla_nvme_fc_transport = { .ls_abort = qla_nvme_ls_abort, .fcp_io = qla_nvme_post_cmd, .fcp_abort = qla_nvme_fcp_abort, + .map_queues = qla_nvme_map_queues, .max_hw_queues = 8, .max_sgl_segments = 1024, .max_dif_sgl_segments = 64, diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c index d2e40aaba734..836fedcea241 100644 --- a/drivers/scsi/qla2xxx/qla_os.c +++ b/drivers/scsi/qla2xxx/qla_os.c @@ -4157,7 +4157,7 @@ qla2x00_mem_alloc(struct qla_hw_data *ha, uint16_t req_len, uint16_t rsp_len, ql_dbg_pci(ql_dbg_init, ha->pdev, 0xe0ee, "%s: failed alloc dsd\n", __func__); - return 1; + return -ENOMEM; } ha->dif_bundle_kallocs++; diff --git a/drivers/scsi/qla2xxx/qla_target.c b/drivers/scsi/qla2xxx/qla_target.c index b3478ed9b12e..7d8242c120fc 100644 --- a/drivers/scsi/qla2xxx/qla_target.c +++ b/drivers/scsi/qla2xxx/qla_target.c @@ -3319,8 +3319,7 @@ int qlt_xmit_response(struct qla_tgt_cmd *cmd, int xmit_type, "RESET-RSP online/active/old-count/new-count = %d/%d/%d/%d.\n", vha->flags.online, qla2x00_reset_active(vha), cmd->reset_count, qpair->chip_reset); - spin_unlock_irqrestore(qpair->qp_lock_ptr, flags); - return 0; + goto out_unmap_unlock; } /* Does F/W have an IOCBs for this request */ @@ -3445,10 +3444,6 @@ int qlt_rdy_to_xfer(struct qla_tgt_cmd *cmd) prm.sg = NULL; prm.req_cnt = 1; - /* Calculate number of entries and segments required */ - if (qlt_pci_map_calc_cnt(&prm) != 0) - return -EAGAIN; - if (!qpair->fw_started || (cmd->reset_count != qpair->chip_reset) || (cmd->sess && cmd->sess->deleted)) { /* @@ -3466,6 +3461,10 @@ int qlt_rdy_to_xfer(struct qla_tgt_cmd *cmd) return 0; } + /* Calculate number of entries and segments required */ + if (qlt_pci_map_calc_cnt(&prm) != 0) + return -EAGAIN; + spin_lock_irqsave(qpair->qp_lock_ptr, flags); /* Does F/W have an IOCBs for this request */ res = qlt_check_reserve_free_req(qpair, prm.req_cnt); @@ -3870,9 +3869,6 @@ void qlt_free_cmd(struct qla_tgt_cmd *cmd) BUG_ON(cmd->cmd_in_wq); - if (cmd->sg_mapped) - qlt_unmap_sg(cmd->vha, cmd); - if (!cmd->q_full) qlt_decr_num_pend_cmds(cmd->vha); diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c index b241f9e3885c..291ecc33b1fe 100644 --- a/drivers/scsi/scsi.c +++ b/drivers/scsi/scsi.c @@ -553,8 +553,10 @@ EXPORT_SYMBOL(scsi_device_get); */ void scsi_device_put(struct scsi_device *sdev) { - module_put(sdev->host->hostt->module); + struct module *mod = sdev->host->hostt->module; + put_device(&sdev->sdev_gendev); + module_put(mod); } EXPORT_SYMBOL(scsi_device_put); diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c index 86793259e541..a35841b34bfd 100644 --- a/drivers/scsi/scsi_sysfs.c +++ b/drivers/scsi/scsi_sysfs.c @@ -449,9 +449,12 @@ static void scsi_device_dev_release_usercontext(struct work_struct *work) struct scsi_vpd *vpd_pg80 = NULL, *vpd_pg83 = NULL; struct scsi_vpd *vpd_pg0 = NULL, *vpd_pg89 = NULL; unsigned long flags; + struct module *mod; sdev = container_of(work, struct scsi_device, ew.work); + mod = sdev->host->hostt->module; + scsi_dh_release_device(sdev); parent = sdev->sdev_gendev.parent; @@ -502,11 +505,17 @@ static void scsi_device_dev_release_usercontext(struct work_struct *work) if (parent) put_device(parent); + module_put(mod); } static void scsi_device_dev_release(struct device *dev) { struct scsi_device *sdp = to_scsi_device(dev); + + /* Set module pointer as NULL in case of module unloading */ + if (!try_module_get(sdp->host->hostt->module)) + sdp->host->hostt->module = NULL; + execute_in_process_context(scsi_device_dev_release_usercontext, &sdp->ew); } diff --git a/drivers/scsi/scsi_transport_iscsi.c b/drivers/scsi/scsi_transport_iscsi.c index 922e4c7bd88e..78343d3f9385 100644 --- a/drivers/scsi/scsi_transport_iscsi.c +++ b/drivers/scsi/scsi_transport_iscsi.c @@ -2930,8 +2930,6 @@ iscsi_set_param(struct iscsi_transport *transport, struct iscsi_uevent *ev) session->recovery_tmo = value; break; default: - err = transport->set_param(conn, ev->u.set_param.param, - data, ev->u.set_param.len); if ((conn->state == ISCSI_CONN_BOUND) || (conn->state == ISCSI_CONN_UP)) { err = transport->set_param(conn, ev->u.set_param.param, diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index d8f6add416c0..9bdee968d3b5 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -3684,7 +3684,12 @@ static int sd_resume(struct device *dev) static int sd_resume_runtime(struct device *dev) { struct scsi_disk *sdkp = dev_get_drvdata(dev); - struct scsi_device *sdp = sdkp->device; + struct scsi_device *sdp; + + if (!sdkp) /* E.g.: runtime resume at the start of sd_probe() */ + return 0; + + sdp = sdkp->device; if (sdp->ignore_media_change) { /* clear the device's sense data */ diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c index ebbbc1299c62..9eb1b88a29dd 100644 --- a/drivers/scsi/storvsc_drv.c +++ b/drivers/scsi/storvsc_drv.c @@ -1285,11 +1285,15 @@ static void storvsc_on_channel_callback(void *context) foreach_vmbus_pkt(desc, channel) { struct vstor_packet *packet = hv_pkt_data(desc); struct storvsc_cmd_request *request = NULL; + u32 pktlen = hv_pkt_datalen(desc); u64 rqst_id = desc->trans_id; + u32 minlen = rqst_id ? sizeof(struct vstor_packet) - + stor_device->vmscsi_size_delta : sizeof(enum vstor_packet_operation); - if (hv_pkt_datalen(desc) < sizeof(struct vstor_packet) - - stor_device->vmscsi_size_delta) { - dev_err(&device->device, "Invalid packet len\n"); + if (pktlen < minlen) { + dev_err(&device->device, + "Invalid pkt: id=%llu, len=%u, minlen=%u\n", + rqst_id, pktlen, minlen); continue; } @@ -1302,13 +1306,23 @@ static void storvsc_on_channel_callback(void *context) if (rqst_id == 0) { /* * storvsc_on_receive() looks at the vstor_packet in the message - * from the ring buffer. If the operation in the vstor_packet is - * COMPLETE_IO, then we call storvsc_on_io_completion(), and - * dereference the guest memory address. Make sure we don't call - * storvsc_on_io_completion() with a guest memory address that is - * zero if Hyper-V were to construct and send such a bogus packet. + * from the ring buffer. + * + * - If the operation in the vstor_packet is COMPLETE_IO, then + * we call storvsc_on_io_completion(), and dereference the + * guest memory address. Make sure we don't call + * storvsc_on_io_completion() with a guest memory address + * that is zero if Hyper-V were to construct and send such + * a bogus packet. + * + * - If the operation in the vstor_packet is FCHBA_DATA, then + * we call cache_wwn(), and access the data payload area of + * the packet (wwn_packet); however, there is no guarantee + * that the packet is big enough to contain such area. + * Future-proof the code by rejecting such a bogus packet. */ - if (packet->operation == VSTOR_OPERATION_COMPLETE_IO) { + if (packet->operation == VSTOR_OPERATION_COMPLETE_IO || + packet->operation == VSTOR_OPERATION_FCHBA_DATA) { dev_err(&device->device, "Invalid packet with ID of 0\n"); continue; } diff --git a/drivers/scsi/ufs/ufs-exynos.c b/drivers/scsi/ufs/ufs-exynos.c index a14dd8ce56d4..bb2dd79a1bcd 100644 --- a/drivers/scsi/ufs/ufs-exynos.c +++ b/drivers/scsi/ufs/ufs-exynos.c @@ -642,9 +642,9 @@ static int exynos_ufs_pre_pwr_mode(struct ufs_hba *hba, } /* setting for three timeout values for traffic class #0 */ - ufshcd_dme_set(hba, UIC_ARG_MIB(PA_PWRMODEUSERDATA0), 8064); - ufshcd_dme_set(hba, UIC_ARG_MIB(PA_PWRMODEUSERDATA1), 28224); - ufshcd_dme_set(hba, UIC_ARG_MIB(PA_PWRMODEUSERDATA2), 20160); + ufshcd_dme_set(hba, UIC_ARG_MIB(DL_FC0PROTTIMEOUTVAL), 8064); + ufshcd_dme_set(hba, UIC_ARG_MIB(DL_TC0REPLAYTIMEOUTVAL), 28224); + ufshcd_dme_set(hba, UIC_ARG_MIB(DL_AFC0REQTIMEOUTVAL), 20160); return 0; out: diff --git a/drivers/scsi/ufs/ufshcd-crypto.c b/drivers/scsi/ufs/ufshcd-crypto.c index d70cdcd35e43..67402baf6fae 100644 --- a/drivers/scsi/ufs/ufshcd-crypto.c +++ b/drivers/scsi/ufs/ufshcd-crypto.c @@ -48,11 +48,12 @@ out: return err; } -static int ufshcd_crypto_keyslot_program(struct blk_keyslot_manager *ksm, +static int ufshcd_crypto_keyslot_program(struct blk_crypto_profile *profile, const struct blk_crypto_key *key, unsigned int slot) { - struct ufs_hba *hba = container_of(ksm, struct ufs_hba, ksm); + struct ufs_hba *hba = + container_of(profile, struct ufs_hba, crypto_profile); const union ufs_crypto_cap_entry *ccap_array = hba->crypto_cap_array; const struct ufs_crypto_alg_entry *alg = &ufs_crypto_algs[key->crypto_cfg.crypto_mode]; @@ -105,11 +106,12 @@ static int ufshcd_clear_keyslot(struct ufs_hba *hba, int slot) return ufshcd_program_key(hba, &cfg, slot); } -static int ufshcd_crypto_keyslot_evict(struct blk_keyslot_manager *ksm, +static int ufshcd_crypto_keyslot_evict(struct blk_crypto_profile *profile, const struct blk_crypto_key *key, unsigned int slot) { - struct ufs_hba *hba = container_of(ksm, struct ufs_hba, ksm); + struct ufs_hba *hba = + container_of(profile, struct ufs_hba, crypto_profile); return ufshcd_clear_keyslot(hba, slot); } @@ -120,11 +122,11 @@ bool ufshcd_crypto_enable(struct ufs_hba *hba) return false; /* Reset might clear all keys, so reprogram all the keys. */ - blk_ksm_reprogram_all_keys(&hba->ksm); + blk_crypto_reprogram_all_keys(&hba->crypto_profile); return true; } -static const struct blk_ksm_ll_ops ufshcd_ksm_ops = { +static const struct blk_crypto_ll_ops ufshcd_crypto_ops = { .keyslot_program = ufshcd_crypto_keyslot_program, .keyslot_evict = ufshcd_crypto_keyslot_evict, }; @@ -179,15 +181,16 @@ int ufshcd_hba_init_crypto_capabilities(struct ufs_hba *hba) } /* The actual number of configurations supported is (CFGC+1) */ - err = devm_blk_ksm_init(hba->dev, &hba->ksm, - hba->crypto_capabilities.config_count + 1); + err = devm_blk_crypto_profile_init( + hba->dev, &hba->crypto_profile, + hba->crypto_capabilities.config_count + 1); if (err) goto out; - hba->ksm.ksm_ll_ops = ufshcd_ksm_ops; + hba->crypto_profile.ll_ops = ufshcd_crypto_ops; /* UFS only supports 8 bytes for any DUN */ - hba->ksm.max_dun_bytes_supported = 8; - hba->ksm.dev = hba->dev; + hba->crypto_profile.max_dun_bytes_supported = 8; + hba->crypto_profile.dev = hba->dev; /* * Cache all the UFS crypto capabilities and advertise the supported @@ -202,7 +205,7 @@ int ufshcd_hba_init_crypto_capabilities(struct ufs_hba *hba) blk_mode_num = ufshcd_find_blk_crypto_mode( hba->crypto_cap_array[cap_idx]); if (blk_mode_num != BLK_ENCRYPTION_MODE_INVALID) - hba->ksm.crypto_modes_supported[blk_mode_num] |= + hba->crypto_profile.modes_supported[blk_mode_num] |= hba->crypto_cap_array[cap_idx].sdus_mask * 512; } @@ -230,9 +233,8 @@ void ufshcd_init_crypto(struct ufs_hba *hba) ufshcd_clear_keyslot(hba, slot); } -void ufshcd_crypto_setup_rq_keyslot_manager(struct ufs_hba *hba, - struct request_queue *q) +void ufshcd_crypto_register(struct ufs_hba *hba, struct request_queue *q) { if (hba->caps & UFSHCD_CAP_CRYPTO) - blk_ksm_register(&hba->ksm, q); + blk_crypto_register(&hba->crypto_profile, q); } diff --git a/drivers/scsi/ufs/ufshcd-crypto.h b/drivers/scsi/ufs/ufshcd-crypto.h index 78a58e788dff..e18c01276873 100644 --- a/drivers/scsi/ufs/ufshcd-crypto.h +++ b/drivers/scsi/ufs/ufshcd-crypto.h @@ -18,7 +18,7 @@ static inline void ufshcd_prepare_lrbp_crypto(struct request *rq, return; } - lrbp->crypto_key_slot = blk_ksm_get_slot_idx(rq->crypt_keyslot); + lrbp->crypto_key_slot = blk_crypto_keyslot_index(rq->crypt_keyslot); lrbp->data_unit_num = rq->crypt_ctx->bc_dun[0]; } @@ -40,8 +40,7 @@ int ufshcd_hba_init_crypto_capabilities(struct ufs_hba *hba); void ufshcd_init_crypto(struct ufs_hba *hba); -void ufshcd_crypto_setup_rq_keyslot_manager(struct ufs_hba *hba, - struct request_queue *q); +void ufshcd_crypto_register(struct ufs_hba *hba, struct request_queue *q); #else /* CONFIG_SCSI_UFS_CRYPTO */ @@ -64,8 +63,8 @@ static inline int ufshcd_hba_init_crypto_capabilities(struct ufs_hba *hba) static inline void ufshcd_init_crypto(struct ufs_hba *hba) { } -static inline void ufshcd_crypto_setup_rq_keyslot_manager(struct ufs_hba *hba, - struct request_queue *q) { } +static inline void ufshcd_crypto_register(struct ufs_hba *hba, + struct request_queue *q) { } #endif /* CONFIG_SCSI_UFS_CRYPTO */ diff --git a/drivers/scsi/ufs/ufshcd-pci.c b/drivers/scsi/ufs/ufshcd-pci.c index 149c1aa09103..51424557810d 100644 --- a/drivers/scsi/ufs/ufshcd-pci.c +++ b/drivers/scsi/ufs/ufshcd-pci.c @@ -370,20 +370,6 @@ static void ufs_intel_common_exit(struct ufs_hba *hba) static int ufs_intel_resume(struct ufs_hba *hba, enum ufs_pm_op op) { - /* - * To support S4 (suspend-to-disk) with spm_lvl other than 5, the base - * address registers must be restored because the restore kernel can - * have used different addresses. - */ - ufshcd_writel(hba, lower_32_bits(hba->utrdl_dma_addr), - REG_UTP_TRANSFER_REQ_LIST_BASE_L); - ufshcd_writel(hba, upper_32_bits(hba->utrdl_dma_addr), - REG_UTP_TRANSFER_REQ_LIST_BASE_H); - ufshcd_writel(hba, lower_32_bits(hba->utmrdl_dma_addr), - REG_UTP_TASK_REQ_LIST_BASE_L); - ufshcd_writel(hba, upper_32_bits(hba->utmrdl_dma_addr), - REG_UTP_TASK_REQ_LIST_BASE_H); - if (ufshcd_is_link_hibern8(hba)) { int ret = ufshcd_uic_hibern8_exit(hba); @@ -463,6 +449,18 @@ static struct ufs_hba_variant_ops ufs_intel_lkf_hba_vops = { .device_reset = ufs_intel_device_reset, }; +#ifdef CONFIG_PM_SLEEP +static int ufshcd_pci_restore(struct device *dev) +{ + struct ufs_hba *hba = dev_get_drvdata(dev); + + /* Force a full reset and restore */ + ufshcd_set_link_off(hba); + + return ufshcd_system_resume(dev); +} +#endif + /** * ufshcd_pci_shutdown - main function to put the controller in reset state * @pdev: pointer to PCI device handle @@ -546,9 +544,14 @@ ufshcd_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id) } static const struct dev_pm_ops ufshcd_pci_pm_ops = { - SET_SYSTEM_SLEEP_PM_OPS(ufshcd_system_suspend, ufshcd_system_resume) SET_RUNTIME_PM_OPS(ufshcd_runtime_suspend, ufshcd_runtime_resume, NULL) #ifdef CONFIG_PM_SLEEP + .suspend = ufshcd_system_suspend, + .resume = ufshcd_system_resume, + .freeze = ufshcd_system_suspend, + .thaw = ufshcd_system_resume, + .poweroff = ufshcd_system_suspend, + .restore = ufshcd_pci_restore, .prepare = ufshcd_suspend_prepare, .complete = ufshcd_resume_complete, #endif diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c index 95be7ecdfe10..1157b24963ef 100644 --- a/drivers/scsi/ufs/ufshcd.c +++ b/drivers/scsi/ufs/ufshcd.c @@ -2737,12 +2737,7 @@ static int ufshcd_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *cmd) lrbp->req_abort_skip = false; - err = ufshpb_prep(hba, lrbp); - if (err == -EAGAIN) { - lrbp->cmd = NULL; - ufshcd_release(hba); - goto out; - } + ufshpb_prep(hba, lrbp); ufshcd_comp_scsi_upiu(hba, lrbp); @@ -4986,7 +4981,7 @@ static int ufshcd_slave_configure(struct scsi_device *sdev) else if (ufshcd_is_rpm_autosuspend_allowed(hba)) sdev->rpm_autosuspend = 1; - ufshcd_crypto_setup_rq_keyslot_manager(hba, q); + ufshcd_crypto_register(hba, q); return 0; } diff --git a/drivers/scsi/ufs/ufshcd.h b/drivers/scsi/ufs/ufshcd.h index 41f6e06f9185..62bdc412d38a 100644 --- a/drivers/scsi/ufs/ufshcd.h +++ b/drivers/scsi/ufs/ufshcd.h @@ -32,7 +32,7 @@ #include <linux/regulator/consumer.h> #include <linux/bitfield.h> #include <linux/devfreq.h> -#include <linux/keyslot-manager.h> +#include <linux/blk-crypto-profile.h> #include "unipro.h" #include <asm/irq.h> @@ -766,7 +766,7 @@ struct ufs_hba_monitor { * @crypto_capabilities: Content of crypto capabilities register (0x100) * @crypto_cap_array: Array of crypto capabilities * @crypto_cfg_register: Start of the crypto cfg array - * @ksm: the keyslot manager tied to this hba + * @crypto_profile: the crypto profile of this hba (if applicable) */ struct ufs_hba { void __iomem *mmio_base; @@ -911,7 +911,7 @@ struct ufs_hba { union ufs_crypto_capabilities crypto_capabilities; union ufs_crypto_cap_entry *crypto_cap_array; u32 crypto_cfg_register; - struct blk_keyslot_manager ksm; + struct blk_crypto_profile crypto_profile; #endif #ifdef CONFIG_DEBUG_FS struct dentry *debugfs_root; diff --git a/drivers/scsi/ufs/ufshpb.c b/drivers/scsi/ufs/ufshpb.c index 589af5f6b940..026a133149dc 100644 --- a/drivers/scsi/ufs/ufshpb.c +++ b/drivers/scsi/ufs/ufshpb.c @@ -84,16 +84,6 @@ static bool ufshpb_is_supported_chunk(struct ufshpb_lu *hpb, int transfer_len) return transfer_len <= hpb->pre_req_max_tr_len; } -/* - * In this driver, WRITE_BUFFER CMD support 36KB (len=9) ~ 1MB (len=256) as - * default. It is possible to change range of transfer_len through sysfs. - */ -static inline bool ufshpb_is_required_wb(struct ufshpb_lu *hpb, int len) -{ - return len > hpb->pre_req_min_tr_len && - len <= hpb->pre_req_max_tr_len; -} - static bool ufshpb_is_general_lun(int lun) { return lun < UFS_UPIU_MAX_UNIT_NUM_ID; @@ -334,7 +324,7 @@ ufshpb_get_pos_from_lpn(struct ufshpb_lu *hpb, unsigned long lpn, int *rgn_idx, static void ufshpb_set_hpb_read_to_upiu(struct ufs_hba *hba, struct ufshcd_lrb *lrbp, - __be64 ppn, u8 transfer_len, int read_id) + __be64 ppn, u8 transfer_len) { unsigned char *cdb = lrbp->cmd->cmnd; __be64 ppn_tmp = ppn; @@ -346,256 +336,11 @@ ufshpb_set_hpb_read_to_upiu(struct ufs_hba *hba, struct ufshcd_lrb *lrbp, /* ppn value is stored as big-endian in the host memory */ memcpy(&cdb[6], &ppn_tmp, sizeof(__be64)); cdb[14] = transfer_len; - cdb[15] = read_id; + cdb[15] = 0; lrbp->cmd->cmd_len = UFS_CDB_SIZE; } -static inline void ufshpb_set_write_buf_cmd(unsigned char *cdb, - unsigned long lpn, unsigned int len, - int read_id) -{ - cdb[0] = UFSHPB_WRITE_BUFFER; - cdb[1] = UFSHPB_WRITE_BUFFER_PREFETCH_ID; - - put_unaligned_be32(lpn, &cdb[2]); - cdb[6] = read_id; - put_unaligned_be16(len * HPB_ENTRY_SIZE, &cdb[7]); - - cdb[9] = 0x00; /* Control = 0x00 */ -} - -static struct ufshpb_req *ufshpb_get_pre_req(struct ufshpb_lu *hpb) -{ - struct ufshpb_req *pre_req; - - if (hpb->num_inflight_pre_req >= hpb->throttle_pre_req) { - dev_info(&hpb->sdev_ufs_lu->sdev_dev, - "pre_req throttle. inflight %d throttle %d", - hpb->num_inflight_pre_req, hpb->throttle_pre_req); - return NULL; - } - - pre_req = list_first_entry_or_null(&hpb->lh_pre_req_free, - struct ufshpb_req, list_req); - if (!pre_req) { - dev_info(&hpb->sdev_ufs_lu->sdev_dev, "There is no pre_req"); - return NULL; - } - - list_del_init(&pre_req->list_req); - hpb->num_inflight_pre_req++; - - return pre_req; -} - -static inline void ufshpb_put_pre_req(struct ufshpb_lu *hpb, - struct ufshpb_req *pre_req) -{ - pre_req->req = NULL; - bio_reset(pre_req->bio); - list_add_tail(&pre_req->list_req, &hpb->lh_pre_req_free); - hpb->num_inflight_pre_req--; -} - -static void ufshpb_pre_req_compl_fn(struct request *req, blk_status_t error) -{ - struct ufshpb_req *pre_req = (struct ufshpb_req *)req->end_io_data; - struct ufshpb_lu *hpb = pre_req->hpb; - unsigned long flags; - - if (error) { - struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req); - struct scsi_sense_hdr sshdr; - - dev_err(&hpb->sdev_ufs_lu->sdev_dev, "block status %d", error); - scsi_command_normalize_sense(cmd, &sshdr); - dev_err(&hpb->sdev_ufs_lu->sdev_dev, - "code %x sense_key %x asc %x ascq %x", - sshdr.response_code, - sshdr.sense_key, sshdr.asc, sshdr.ascq); - dev_err(&hpb->sdev_ufs_lu->sdev_dev, - "byte4 %x byte5 %x byte6 %x additional_len %x", - sshdr.byte4, sshdr.byte5, - sshdr.byte6, sshdr.additional_length); - } - - blk_mq_free_request(req); - spin_lock_irqsave(&hpb->rgn_state_lock, flags); - ufshpb_put_pre_req(pre_req->hpb, pre_req); - spin_unlock_irqrestore(&hpb->rgn_state_lock, flags); -} - -static int ufshpb_prep_entry(struct ufshpb_req *pre_req, struct page *page) -{ - struct ufshpb_lu *hpb = pre_req->hpb; - struct ufshpb_region *rgn; - struct ufshpb_subregion *srgn; - __be64 *addr; - int offset = 0; - int copied; - unsigned long lpn = pre_req->wb.lpn; - int rgn_idx, srgn_idx, srgn_offset; - unsigned long flags; - - addr = page_address(page); - ufshpb_get_pos_from_lpn(hpb, lpn, &rgn_idx, &srgn_idx, &srgn_offset); - - spin_lock_irqsave(&hpb->rgn_state_lock, flags); - -next_offset: - rgn = hpb->rgn_tbl + rgn_idx; - srgn = rgn->srgn_tbl + srgn_idx; - - if (!ufshpb_is_valid_srgn(rgn, srgn)) - goto mctx_error; - - if (!srgn->mctx) - goto mctx_error; - - copied = ufshpb_fill_ppn_from_page(hpb, srgn->mctx, srgn_offset, - pre_req->wb.len - offset, - &addr[offset]); - - if (copied < 0) - goto mctx_error; - - offset += copied; - srgn_offset += copied; - - if (srgn_offset == hpb->entries_per_srgn) { - srgn_offset = 0; - - if (++srgn_idx == hpb->srgns_per_rgn) { - srgn_idx = 0; - rgn_idx++; - } - } - - if (offset < pre_req->wb.len) - goto next_offset; - - spin_unlock_irqrestore(&hpb->rgn_state_lock, flags); - return 0; -mctx_error: - spin_unlock_irqrestore(&hpb->rgn_state_lock, flags); - return -ENOMEM; -} - -static int ufshpb_pre_req_add_bio_page(struct ufshpb_lu *hpb, - struct request_queue *q, - struct ufshpb_req *pre_req) -{ - struct page *page = pre_req->wb.m_page; - struct bio *bio = pre_req->bio; - int entries_bytes, ret; - - if (!page) - return -ENOMEM; - - if (ufshpb_prep_entry(pre_req, page)) - return -ENOMEM; - - entries_bytes = pre_req->wb.len * sizeof(__be64); - - ret = bio_add_pc_page(q, bio, page, entries_bytes, 0); - if (ret != entries_bytes) { - dev_err(&hpb->sdev_ufs_lu->sdev_dev, - "bio_add_pc_page fail: %d", ret); - return -ENOMEM; - } - return 0; -} - -static inline int ufshpb_get_read_id(struct ufshpb_lu *hpb) -{ - if (++hpb->cur_read_id >= MAX_HPB_READ_ID) - hpb->cur_read_id = 1; - return hpb->cur_read_id; -} - -static int ufshpb_execute_pre_req(struct ufshpb_lu *hpb, struct scsi_cmnd *cmd, - struct ufshpb_req *pre_req, int read_id) -{ - struct scsi_device *sdev = cmd->device; - struct request_queue *q = sdev->request_queue; - struct request *req; - struct scsi_request *rq; - struct bio *bio = pre_req->bio; - - pre_req->hpb = hpb; - pre_req->wb.lpn = sectors_to_logical(cmd->device, - blk_rq_pos(scsi_cmd_to_rq(cmd))); - pre_req->wb.len = sectors_to_logical(cmd->device, - blk_rq_sectors(scsi_cmd_to_rq(cmd))); - if (ufshpb_pre_req_add_bio_page(hpb, q, pre_req)) - return -ENOMEM; - - req = pre_req->req; - - /* 1. request setup */ - blk_rq_append_bio(req, bio); - req->rq_disk = NULL; - req->end_io_data = (void *)pre_req; - req->end_io = ufshpb_pre_req_compl_fn; - - /* 2. scsi_request setup */ - rq = scsi_req(req); - rq->retries = 1; - - ufshpb_set_write_buf_cmd(rq->cmd, pre_req->wb.lpn, pre_req->wb.len, - read_id); - rq->cmd_len = scsi_command_size(rq->cmd); - - if (blk_insert_cloned_request(q, req) != BLK_STS_OK) - return -EAGAIN; - - hpb->stats.pre_req_cnt++; - - return 0; -} - -static int ufshpb_issue_pre_req(struct ufshpb_lu *hpb, struct scsi_cmnd *cmd, - int *read_id) -{ - struct ufshpb_req *pre_req; - struct request *req = NULL; - unsigned long flags; - int _read_id; - int ret = 0; - - req = blk_get_request(cmd->device->request_queue, - REQ_OP_DRV_OUT | REQ_SYNC, BLK_MQ_REQ_NOWAIT); - if (IS_ERR(req)) - return -EAGAIN; - - spin_lock_irqsave(&hpb->rgn_state_lock, flags); - pre_req = ufshpb_get_pre_req(hpb); - if (!pre_req) { - ret = -EAGAIN; - goto unlock_out; - } - _read_id = ufshpb_get_read_id(hpb); - spin_unlock_irqrestore(&hpb->rgn_state_lock, flags); - - pre_req->req = req; - - ret = ufshpb_execute_pre_req(hpb, cmd, pre_req, _read_id); - if (ret) - goto free_pre_req; - - *read_id = _read_id; - - return ret; -free_pre_req: - spin_lock_irqsave(&hpb->rgn_state_lock, flags); - ufshpb_put_pre_req(hpb, pre_req); -unlock_out: - spin_unlock_irqrestore(&hpb->rgn_state_lock, flags); - blk_put_request(req); - return ret; -} - /* * This function will set up HPB read command using host-side L2P map data. */ @@ -609,7 +354,6 @@ int ufshpb_prep(struct ufs_hba *hba, struct ufshcd_lrb *lrbp) __be64 ppn; unsigned long flags; int transfer_len, rgn_idx, srgn_idx, srgn_offset; - int read_id = 0; int err = 0; hpb = ufshpb_get_hpb_data(cmd->device); @@ -685,24 +429,8 @@ int ufshpb_prep(struct ufs_hba *hba, struct ufshcd_lrb *lrbp) dev_err(hba->dev, "get ppn failed. err %d\n", err); return err; } - if (!ufshpb_is_legacy(hba) && - ufshpb_is_required_wb(hpb, transfer_len)) { - err = ufshpb_issue_pre_req(hpb, cmd, &read_id); - if (err) { - unsigned long timeout; - - timeout = cmd->jiffies_at_alloc + msecs_to_jiffies( - hpb->params.requeue_timeout_ms); - - if (time_before(jiffies, timeout)) - return -EAGAIN; - - hpb->stats.miss_cnt++; - return 0; - } - } - ufshpb_set_hpb_read_to_upiu(hba, lrbp, ppn, transfer_len, read_id); + ufshpb_set_hpb_read_to_upiu(hba, lrbp, ppn, transfer_len); hpb->stats.hit_cnt++; return 0; @@ -1841,16 +1569,11 @@ static void ufshpb_lu_parameter_init(struct ufs_hba *hba, u32 entries_per_rgn; u64 rgn_mem_size, tmp; - /* for pre_req */ - hpb->pre_req_min_tr_len = hpb_dev_info->max_hpb_single_cmd + 1; - if (ufshpb_is_legacy(hba)) hpb->pre_req_max_tr_len = HPB_LEGACY_CHUNK_HIGH; else hpb->pre_req_max_tr_len = HPB_MULTI_CHUNK_HIGH; - hpb->cur_read_id = 0; - hpb->lu_pinned_start = hpb_lu_info->pinned_start; hpb->lu_pinned_end = hpb_lu_info->num_pinned ? (hpb_lu_info->pinned_start + hpb_lu_info->num_pinned - 1) diff --git a/drivers/scsi/ufs/ufshpb.h b/drivers/scsi/ufs/ufshpb.h index a79e07398970..f15d8fdbce2e 100644 --- a/drivers/scsi/ufs/ufshpb.h +++ b/drivers/scsi/ufs/ufshpb.h @@ -241,8 +241,6 @@ struct ufshpb_lu { spinlock_t param_lock; struct list_head lh_pre_req_free; - int cur_read_id; - int pre_req_min_tr_len; int pre_req_max_tr_len; /* cached L2P map management worker */ diff --git a/drivers/spi/spi-altera-dfl.c b/drivers/spi/spi-altera-dfl.c index 44fc9ee13fc7..ca40923258af 100644 --- a/drivers/spi/spi-altera-dfl.c +++ b/drivers/spi/spi-altera-dfl.c @@ -134,7 +134,7 @@ static int dfl_spi_altera_probe(struct dfl_device *dfl_dev) if (!master) return -ENOMEM; - master->bus_num = dfl_dev->id; + master->bus_num = -1; hw = spi_master_get_devdata(master); diff --git a/drivers/spi/spi-altera-platform.c b/drivers/spi/spi-altera-platform.c index f7a7c14e3679..65147aae82a1 100644 --- a/drivers/spi/spi-altera-platform.c +++ b/drivers/spi/spi-altera-platform.c @@ -48,7 +48,7 @@ static int altera_spi_probe(struct platform_device *pdev) return err; /* setup the master state. */ - master->bus_num = pdev->id; + master->bus_num = -1; if (pdata) { if (pdata->num_chipselect > ALTERA_SPI_MAX_CS) { diff --git a/drivers/spi/spi-pl022.c b/drivers/spi/spi-pl022.c index feebda66f56e..e4484ace584e 100644 --- a/drivers/spi/spi-pl022.c +++ b/drivers/spi/spi-pl022.c @@ -1716,12 +1716,13 @@ static int verify_controller_parameters(struct pl022 *pl022, return -EINVAL; } } else { - if (chip_info->duplex != SSP_MICROWIRE_CHANNEL_FULL_DUPLEX) + if (chip_info->duplex != SSP_MICROWIRE_CHANNEL_FULL_DUPLEX) { dev_err(&pl022->adev->dev, "Microwire half duplex mode requested," " but this is only available in the" " ST version of PL022\n"); - return -EINVAL; + return -EINVAL; + } } } return 0; diff --git a/drivers/spi/spi-tegra20-slink.c b/drivers/spi/spi-tegra20-slink.c index 713292b0c71e..3226c4e1c7c0 100644 --- a/drivers/spi/spi-tegra20-slink.c +++ b/drivers/spi/spi-tegra20-slink.c @@ -1194,7 +1194,7 @@ static int __maybe_unused tegra_slink_runtime_suspend(struct device *dev) return 0; } -static int tegra_slink_runtime_resume(struct device *dev) +static int __maybe_unused tegra_slink_runtime_resume(struct device *dev) { struct spi_master *master = dev_get_drvdata(dev); struct tegra_slink_data *tspi = spi_master_get_devdata(master); diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c index 26e3d90d1e7c..841667a896dd 100644 --- a/drivers/vdpa/vdpa_user/vduse_dev.c +++ b/drivers/vdpa/vdpa_user/vduse_dev.c @@ -80,6 +80,7 @@ struct vduse_dev { struct vdpa_callback config_cb; struct work_struct inject; spinlock_t irq_lock; + struct rw_semaphore rwsem; int minor; bool broken; bool connected; @@ -410,6 +411,8 @@ static void vduse_dev_reset(struct vduse_dev *dev) if (domain->bounce_map) vduse_domain_reset_bounce_map(domain); + down_write(&dev->rwsem); + dev->status = 0; dev->driver_features = 0; dev->generation++; @@ -443,6 +446,8 @@ static void vduse_dev_reset(struct vduse_dev *dev) flush_work(&vq->inject); flush_work(&vq->kick); } + + up_write(&dev->rwsem); } static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx, @@ -885,6 +890,23 @@ static void vduse_vq_irq_inject(struct work_struct *work) spin_unlock_irq(&vq->irq_lock); } +static int vduse_dev_queue_irq_work(struct vduse_dev *dev, + struct work_struct *irq_work) +{ + int ret = -EINVAL; + + down_read(&dev->rwsem); + if (!(dev->status & VIRTIO_CONFIG_S_DRIVER_OK)) + goto unlock; + + ret = 0; + queue_work(vduse_irq_wq, irq_work); +unlock: + up_read(&dev->rwsem); + + return ret; +} + static long vduse_dev_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { @@ -966,8 +988,7 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd, break; } case VDUSE_DEV_INJECT_CONFIG_IRQ: - ret = 0; - queue_work(vduse_irq_wq, &dev->inject); + ret = vduse_dev_queue_irq_work(dev, &dev->inject); break; case VDUSE_VQ_SETUP: { struct vduse_vq_config config; @@ -1053,9 +1074,8 @@ static long vduse_dev_ioctl(struct file *file, unsigned int cmd, if (index >= dev->vq_num) break; - ret = 0; index = array_index_nospec(index, dev->vq_num); - queue_work(vduse_irq_wq, &dev->vqs[index].inject); + ret = vduse_dev_queue_irq_work(dev, &dev->vqs[index].inject); break; } default: @@ -1136,6 +1156,7 @@ static struct vduse_dev *vduse_dev_create(void) INIT_LIST_HEAD(&dev->send_list); INIT_LIST_HEAD(&dev->recv_list); spin_lock_init(&dev->irq_lock); + init_rwsem(&dev->rwsem); INIT_WORK(&dev->inject, vduse_dev_irq_inject); init_waitqueue_head(&dev->waitq); diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index dd95dfd85e98..3035bb6f5458 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -576,7 +576,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, /* Last one doesn't continue. */ desc[prev].flags &= cpu_to_virtio16(_vq->vdev, ~VRING_DESC_F_NEXT); if (!indirect && vq->use_dma_api) - vq->split.desc_extra[prev & (vq->split.vring.num - 1)].flags = + vq->split.desc_extra[prev & (vq->split.vring.num - 1)].flags &= ~VRING_DESC_F_NEXT; if (indirect) { diff --git a/drivers/watchdog/iTCO_wdt.c b/drivers/watchdog/iTCO_wdt.c index 643c6c2d0b72..ced2fc0deb8c 100644 --- a/drivers/watchdog/iTCO_wdt.c +++ b/drivers/watchdog/iTCO_wdt.c @@ -71,8 +71,6 @@ #define TCOBASE(p) ((p)->tco_res->start) /* SMI Control and Enable Register */ #define SMI_EN(p) ((p)->smi_res->start) -#define TCO_EN (1 << 13) -#define GBL_SMI_EN (1 << 0) #define TCO_RLD(p) (TCOBASE(p) + 0x00) /* TCO Timer Reload/Curr. Value */ #define TCOv1_TMR(p) (TCOBASE(p) + 0x01) /* TCOv1 Timer Initial Value*/ @@ -357,12 +355,8 @@ static int iTCO_wdt_set_timeout(struct watchdog_device *wd_dev, unsigned int t) tmrval = seconds_to_ticks(p, t); - /* - * If TCO SMIs are off, the timer counts down twice before rebooting. - * Otherwise, the BIOS generally reboots when the SMI triggers. - */ - if (p->smi_res && - (inl(SMI_EN(p)) & (TCO_EN | GBL_SMI_EN)) != (TCO_EN | GBL_SMI_EN)) + /* For TCO v1 the timer counts down twice before rebooting */ + if (p->iTCO_version == 1) tmrval /= 2; /* from the specs: */ @@ -527,7 +521,7 @@ static int iTCO_wdt_probe(struct platform_device *pdev) * Disables TCO logic generating an SMI# */ val32 = inl(SMI_EN(p)); - val32 &= ~TCO_EN; /* Turn off SMI clearing watchdog */ + val32 &= 0xffffdfff; /* Turn off SMI clearing watchdog */ outl(val32, SMI_EN(p)); } diff --git a/drivers/watchdog/ixp4xx_wdt.c b/drivers/watchdog/ixp4xx_wdt.c index 2693ffb24ac7..31b03fa71341 100644 --- a/drivers/watchdog/ixp4xx_wdt.c +++ b/drivers/watchdog/ixp4xx_wdt.c @@ -119,7 +119,7 @@ static int ixp4xx_wdt_probe(struct platform_device *pdev) iwdt = devm_kzalloc(dev, sizeof(*iwdt), GFP_KERNEL); if (!iwdt) return -ENOMEM; - iwdt->base = dev->platform_data; + iwdt->base = (void __iomem *)dev->platform_data; /* * Retrieve rate from a fixed clock from the device tree if diff --git a/drivers/watchdog/omap_wdt.c b/drivers/watchdog/omap_wdt.c index 1616f93dfad7..74d785b2b478 100644 --- a/drivers/watchdog/omap_wdt.c +++ b/drivers/watchdog/omap_wdt.c @@ -268,8 +268,12 @@ static int omap_wdt_probe(struct platform_device *pdev) wdev->wdog.bootstatus = WDIOF_CARDRESET; } - if (!early_enable) + if (early_enable) { + omap_wdt_start(&wdev->wdog); + set_bit(WDOG_HW_RUNNING, &wdev->wdog.status); + } else { omap_wdt_disable(wdev); + } ret = watchdog_register_device(&wdev->wdog); if (ret) { diff --git a/drivers/watchdog/sbsa_gwdt.c b/drivers/watchdog/sbsa_gwdt.c index ee9ff38929eb..9791c74aebd4 100644 --- a/drivers/watchdog/sbsa_gwdt.c +++ b/drivers/watchdog/sbsa_gwdt.c @@ -130,7 +130,7 @@ static u64 sbsa_gwdt_reg_read(struct sbsa_gwdt *gwdt) if (gwdt->version == 0) return readl(gwdt->control_base + SBSA_GWDT_WOR); else - return readq(gwdt->control_base + SBSA_GWDT_WOR); + return lo_hi_readq(gwdt->control_base + SBSA_GWDT_WOR); } static void sbsa_gwdt_reg_write(u64 val, struct sbsa_gwdt *gwdt) @@ -138,7 +138,7 @@ static void sbsa_gwdt_reg_write(u64 val, struct sbsa_gwdt *gwdt) if (gwdt->version == 0) writel((u32)val, gwdt->control_base + SBSA_GWDT_WOR); else - writeq(val, gwdt->control_base + SBSA_GWDT_WOR); + lo_hi_writeq(val, gwdt->control_base + SBSA_GWDT_WOR); } /* @@ -411,4 +411,3 @@ MODULE_AUTHOR("Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>"); MODULE_AUTHOR("Al Stone <al.stone@linaro.org>"); MODULE_AUTHOR("Timur Tabi <timur@codeaurora.org>"); MODULE_LICENSE("GPL v2"); -MODULE_ALIAS("platform:" DRV_NAME); diff --git a/fs/afs/write.c b/fs/afs/write.c index f24370f5c774..8b1d9c2f6bec 100644 --- a/fs/afs/write.c +++ b/fs/afs/write.c @@ -861,7 +861,8 @@ int afs_fsync(struct file *file, loff_t start, loff_t end, int datasync) */ vm_fault_t afs_page_mkwrite(struct vm_fault *vmf) { - struct page *page = thp_head(vmf->page); + struct folio *folio = page_folio(vmf->page); + struct page *page = &folio->page; struct file *file = vmf->vma->vm_file; struct inode *inode = file_inode(file); struct afs_vnode *vnode = AFS_FS_I(inode); @@ -884,7 +885,7 @@ vm_fault_t afs_page_mkwrite(struct vm_fault *vmf) goto out; #endif - if (wait_on_page_writeback_killable(page)) + if (folio_wait_writeback_killable(folio)) goto out; if (lock_page_killable(page) < 0) @@ -894,8 +895,8 @@ vm_fault_t afs_page_mkwrite(struct vm_fault *vmf) * details the portion of the page we need to write back and we might * need to redirty the page if there's a problem. */ - if (wait_on_page_writeback_killable(page) < 0) { - unlock_page(page); + if (folio_wait_writeback_killable(folio) < 0) { + folio_unlock(folio); goto out; } diff --git a/fs/autofs/waitq.c b/fs/autofs/waitq.c index 16b5fca0626e..54c1f8b8b075 100644 --- a/fs/autofs/waitq.c +++ b/fs/autofs/waitq.c @@ -358,7 +358,7 @@ int autofs_wait(struct autofs_sb_info *sbi, qstr.len = strlen(p); offset = p - name; } - qstr.hash = full_name_hash(dentry, name, qstr.len); + qstr.hash = full_name_hash(dentry, qstr.name, qstr.len); if (mutex_lock_interruptible(&sbi->wq_mutex)) { kfree(name); diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index ddc4f5436cc9..6c7eb80220ca 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -173,9 +173,10 @@ static int check_compressed_csum(struct btrfs_inode *inode, struct bio *bio, /* Hash through the page sector by sector */ for (pg_offset = 0; pg_offset < bytes_left; pg_offset += sectorsize) { - kaddr = page_address(page); + kaddr = kmap_atomic(page); crypto_shash_digest(shash, kaddr + pg_offset, sectorsize, csum); + kunmap_atomic(kaddr); if (memcmp(&csum, cb_sum, csum_size) != 0) { btrfs_print_data_csum_error(inode, disk_start, diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 04090ba0ef73..954b53a90f04 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -288,8 +288,9 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans, cur_size = min_t(unsigned long, compressed_size, PAGE_SIZE); - kaddr = page_address(cpage); + kaddr = kmap_atomic(cpage); write_extent_buffer(leaf, kaddr, ptr, cur_size); + kunmap_atomic(kaddr); i++; ptr += cur_size; diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c index c25dfd1a8a54..3dbe6eb5fda7 100644 --- a/fs/btrfs/lzo.c +++ b/fs/btrfs/lzo.c @@ -141,7 +141,7 @@ int lzo_compress_pages(struct list_head *ws, struct address_space *mapping, *total_in = 0; in_page = find_get_page(mapping, start >> PAGE_SHIFT); - data_in = page_address(in_page); + data_in = kmap(in_page); /* * store the size of all chunks of compressed data in @@ -152,7 +152,7 @@ int lzo_compress_pages(struct list_head *ws, struct address_space *mapping, ret = -ENOMEM; goto out; } - cpage_out = page_address(out_page); + cpage_out = kmap(out_page); out_offset = LZO_LEN; tot_out = LZO_LEN; pages[0] = out_page; @@ -210,6 +210,7 @@ int lzo_compress_pages(struct list_head *ws, struct address_space *mapping, if (out_len == 0 && tot_in >= len) break; + kunmap(out_page); if (nr_pages == nr_dest_pages) { out_page = NULL; ret = -E2BIG; @@ -221,7 +222,7 @@ int lzo_compress_pages(struct list_head *ws, struct address_space *mapping, ret = -ENOMEM; goto out; } - cpage_out = page_address(out_page); + cpage_out = kmap(out_page); pages[nr_pages++] = out_page; pg_bytes_left = PAGE_SIZE; @@ -243,11 +244,12 @@ int lzo_compress_pages(struct list_head *ws, struct address_space *mapping, break; bytes_left = len - tot_in; + kunmap(in_page); put_page(in_page); start += PAGE_SIZE; in_page = find_get_page(mapping, start >> PAGE_SHIFT); - data_in = page_address(in_page); + data_in = kmap(in_page); in_len = min(bytes_left, PAGE_SIZE); } @@ -257,17 +259,22 @@ int lzo_compress_pages(struct list_head *ws, struct address_space *mapping, } /* store the size of all chunks of compressed data */ - sizes_ptr = page_address(pages[0]); + sizes_ptr = kmap_local_page(pages[0]); write_compress_length(sizes_ptr, tot_out); + kunmap_local(sizes_ptr); ret = 0; *total_out = tot_out; *total_in = tot_in; out: *out_pages = nr_pages; + if (out_page) + kunmap(out_page); - if (in_page) + if (in_page) { + kunmap(in_page); put_page(in_page); + } return ret; } @@ -283,6 +290,7 @@ static void copy_compressed_segment(struct compressed_bio *cb, u32 orig_in = *cur_in; while (*cur_in < orig_in + len) { + char *kaddr; struct page *cur_page; u32 copy_len = min_t(u32, PAGE_SIZE - offset_in_page(*cur_in), orig_in + len - *cur_in); @@ -290,9 +298,11 @@ static void copy_compressed_segment(struct compressed_bio *cb, ASSERT(copy_len); cur_page = cb->compressed_pages[*cur_in / PAGE_SIZE]; + kaddr = kmap(cur_page); memcpy(dest + *cur_in - orig_in, - page_address(cur_page) + offset_in_page(*cur_in), + kaddr + offset_in_page(*cur_in), copy_len); + kunmap(cur_page); *cur_in += copy_len; } @@ -303,6 +313,7 @@ int lzo_decompress_bio(struct list_head *ws, struct compressed_bio *cb) struct workspace *workspace = list_entry(ws, struct workspace, list); const struct btrfs_fs_info *fs_info = btrfs_sb(cb->inode->i_sb); const u32 sectorsize = fs_info->sectorsize; + char *kaddr; int ret; /* Compressed data length, can be unaligned */ u32 len_in; @@ -311,7 +322,9 @@ int lzo_decompress_bio(struct list_head *ws, struct compressed_bio *cb) /* Bytes decompressed so far */ u32 cur_out = 0; - len_in = read_compress_length(page_address(cb->compressed_pages[0])); + kaddr = kmap(cb->compressed_pages[0]); + len_in = read_compress_length(kaddr); + kunmap(cb->compressed_pages[0]); cur_in += LZO_LEN; /* @@ -344,9 +357,9 @@ int lzo_decompress_bio(struct list_head *ws, struct compressed_bio *cb) ASSERT(cur_in / sectorsize == (cur_in + LZO_LEN - 1) / sectorsize); cur_page = cb->compressed_pages[cur_in / PAGE_SIZE]; + kaddr = kmap(cur_page); ASSERT(cur_page); - seg_len = read_compress_length(page_address(cur_page) + - offset_in_page(cur_in)); + seg_len = read_compress_length(kaddr + offset_in_page(cur_in)); cur_in += LZO_LEN; /* Copy the compressed segment payload into workspace */ @@ -431,7 +444,7 @@ int lzo_decompress(struct list_head *ws, unsigned char *data_in, destlen = min_t(unsigned long, destlen, PAGE_SIZE); bytes = min_t(unsigned long, destlen, out_len - start_byte); - kaddr = page_address(dest_page); + kaddr = kmap_local_page(dest_page); memcpy(kaddr, workspace->buf + start_byte, bytes); /* @@ -441,6 +454,7 @@ int lzo_decompress(struct list_head *ws, unsigned char *data_in, */ if (bytes < destlen) memset(kaddr+bytes, 0, destlen-bytes); + kunmap_local(kaddr); out: return ret; } diff --git a/fs/btrfs/zlib.c b/fs/btrfs/zlib.c index 8afa90074891..767a0c6c9694 100644 --- a/fs/btrfs/zlib.c +++ b/fs/btrfs/zlib.c @@ -126,7 +126,7 @@ int zlib_compress_pages(struct list_head *ws, struct address_space *mapping, ret = -ENOMEM; goto out; } - cpage_out = page_address(out_page); + cpage_out = kmap(out_page); pages[0] = out_page; nr_pages = 1; @@ -148,22 +148,26 @@ int zlib_compress_pages(struct list_head *ws, struct address_space *mapping, int i; for (i = 0; i < in_buf_pages; i++) { - if (in_page) + if (in_page) { + kunmap(in_page); put_page(in_page); + } in_page = find_get_page(mapping, start >> PAGE_SHIFT); - data_in = page_address(in_page); + data_in = kmap(in_page); memcpy(workspace->buf + i * PAGE_SIZE, data_in, PAGE_SIZE); start += PAGE_SIZE; } workspace->strm.next_in = workspace->buf; } else { - if (in_page) + if (in_page) { + kunmap(in_page); put_page(in_page); + } in_page = find_get_page(mapping, start >> PAGE_SHIFT); - data_in = page_address(in_page); + data_in = kmap(in_page); start += PAGE_SIZE; workspace->strm.next_in = data_in; } @@ -192,6 +196,7 @@ int zlib_compress_pages(struct list_head *ws, struct address_space *mapping, * the stream end if required */ if (workspace->strm.avail_out == 0) { + kunmap(out_page); if (nr_pages == nr_dest_pages) { out_page = NULL; ret = -E2BIG; @@ -202,7 +207,7 @@ int zlib_compress_pages(struct list_head *ws, struct address_space *mapping, ret = -ENOMEM; goto out; } - cpage_out = page_address(out_page); + cpage_out = kmap(out_page); pages[nr_pages] = out_page; nr_pages++; workspace->strm.avail_out = PAGE_SIZE; @@ -229,6 +234,7 @@ int zlib_compress_pages(struct list_head *ws, struct address_space *mapping, goto out; } else if (workspace->strm.avail_out == 0) { /* get another page for the stream end */ + kunmap(out_page); if (nr_pages == nr_dest_pages) { out_page = NULL; ret = -E2BIG; @@ -239,7 +245,7 @@ int zlib_compress_pages(struct list_head *ws, struct address_space *mapping, ret = -ENOMEM; goto out; } - cpage_out = page_address(out_page); + cpage_out = kmap(out_page); pages[nr_pages] = out_page; nr_pages++; workspace->strm.avail_out = PAGE_SIZE; @@ -258,8 +264,13 @@ int zlib_compress_pages(struct list_head *ws, struct address_space *mapping, *total_in = workspace->strm.total_in; out: *out_pages = nr_pages; - if (in_page) + if (out_page) + kunmap(out_page); + + if (in_page) { + kunmap(in_page); put_page(in_page); + } return ret; } @@ -276,7 +287,7 @@ int zlib_decompress_bio(struct list_head *ws, struct compressed_bio *cb) unsigned long buf_start; struct page **pages_in = cb->compressed_pages; - data_in = page_address(pages_in[page_in_index]); + data_in = kmap(pages_in[page_in_index]); workspace->strm.next_in = data_in; workspace->strm.avail_in = min_t(size_t, srclen, PAGE_SIZE); workspace->strm.total_in = 0; @@ -298,6 +309,7 @@ int zlib_decompress_bio(struct list_head *ws, struct compressed_bio *cb) if (Z_OK != zlib_inflateInit2(&workspace->strm, wbits)) { pr_warn("BTRFS: inflateInit failed\n"); + kunmap(pages_in[page_in_index]); return -EIO; } while (workspace->strm.total_in < srclen) { @@ -324,13 +336,13 @@ int zlib_decompress_bio(struct list_head *ws, struct compressed_bio *cb) if (workspace->strm.avail_in == 0) { unsigned long tmp; - + kunmap(pages_in[page_in_index]); page_in_index++; if (page_in_index >= total_pages_in) { data_in = NULL; break; } - data_in = page_address(pages_in[page_in_index]); + data_in = kmap(pages_in[page_in_index]); workspace->strm.next_in = data_in; tmp = srclen - workspace->strm.total_in; workspace->strm.avail_in = min(tmp, PAGE_SIZE); @@ -342,6 +354,8 @@ int zlib_decompress_bio(struct list_head *ws, struct compressed_bio *cb) ret = 0; done: zlib_inflateEnd(&workspace->strm); + if (data_in) + kunmap(pages_in[page_in_index]); if (!ret) zero_fill_bio(cb->orig_bio); return ret; diff --git a/fs/btrfs/zstd.c b/fs/btrfs/zstd.c index 56dce9f00988..f06b68040352 100644 --- a/fs/btrfs/zstd.c +++ b/fs/btrfs/zstd.c @@ -399,7 +399,7 @@ int zstd_compress_pages(struct list_head *ws, struct address_space *mapping, /* map in the first page of input data */ in_page = find_get_page(mapping, start >> PAGE_SHIFT); - workspace->in_buf.src = page_address(in_page); + workspace->in_buf.src = kmap(in_page); workspace->in_buf.pos = 0; workspace->in_buf.size = min_t(size_t, len, PAGE_SIZE); @@ -411,7 +411,7 @@ int zstd_compress_pages(struct list_head *ws, struct address_space *mapping, goto out; } pages[nr_pages++] = out_page; - workspace->out_buf.dst = page_address(out_page); + workspace->out_buf.dst = kmap(out_page); workspace->out_buf.pos = 0; workspace->out_buf.size = min_t(size_t, max_out, PAGE_SIZE); @@ -446,6 +446,7 @@ int zstd_compress_pages(struct list_head *ws, struct address_space *mapping, if (workspace->out_buf.pos == workspace->out_buf.size) { tot_out += PAGE_SIZE; max_out -= PAGE_SIZE; + kunmap(out_page); if (nr_pages == nr_dest_pages) { out_page = NULL; ret = -E2BIG; @@ -457,7 +458,7 @@ int zstd_compress_pages(struct list_head *ws, struct address_space *mapping, goto out; } pages[nr_pages++] = out_page; - workspace->out_buf.dst = page_address(out_page); + workspace->out_buf.dst = kmap(out_page); workspace->out_buf.pos = 0; workspace->out_buf.size = min_t(size_t, max_out, PAGE_SIZE); @@ -472,12 +473,13 @@ int zstd_compress_pages(struct list_head *ws, struct address_space *mapping, /* Check if we need more input */ if (workspace->in_buf.pos == workspace->in_buf.size) { tot_in += PAGE_SIZE; + kunmap(in_page); put_page(in_page); start += PAGE_SIZE; len -= PAGE_SIZE; in_page = find_get_page(mapping, start >> PAGE_SHIFT); - workspace->in_buf.src = page_address(in_page); + workspace->in_buf.src = kmap(in_page); workspace->in_buf.pos = 0; workspace->in_buf.size = min_t(size_t, len, PAGE_SIZE); } @@ -504,6 +506,7 @@ int zstd_compress_pages(struct list_head *ws, struct address_space *mapping, tot_out += PAGE_SIZE; max_out -= PAGE_SIZE; + kunmap(out_page); if (nr_pages == nr_dest_pages) { out_page = NULL; ret = -E2BIG; @@ -515,7 +518,7 @@ int zstd_compress_pages(struct list_head *ws, struct address_space *mapping, goto out; } pages[nr_pages++] = out_page; - workspace->out_buf.dst = page_address(out_page); + workspace->out_buf.dst = kmap(out_page); workspace->out_buf.pos = 0; workspace->out_buf.size = min_t(size_t, max_out, PAGE_SIZE); } @@ -531,8 +534,12 @@ int zstd_compress_pages(struct list_head *ws, struct address_space *mapping, out: *out_pages = nr_pages; /* Cleanup */ - if (in_page) + if (in_page) { + kunmap(in_page); put_page(in_page); + } + if (out_page) + kunmap(out_page); return ret; } @@ -556,7 +563,7 @@ int zstd_decompress_bio(struct list_head *ws, struct compressed_bio *cb) goto done; } - workspace->in_buf.src = page_address(pages_in[page_in_index]); + workspace->in_buf.src = kmap(pages_in[page_in_index]); workspace->in_buf.pos = 0; workspace->in_buf.size = min_t(size_t, srclen, PAGE_SIZE); @@ -592,14 +599,14 @@ int zstd_decompress_bio(struct list_head *ws, struct compressed_bio *cb) break; if (workspace->in_buf.pos == workspace->in_buf.size) { - page_in_index++; + kunmap(pages_in[page_in_index++]); if (page_in_index >= total_pages_in) { workspace->in_buf.src = NULL; ret = -EIO; goto done; } srclen -= PAGE_SIZE; - workspace->in_buf.src = page_address(pages_in[page_in_index]); + workspace->in_buf.src = kmap(pages_in[page_in_index]); workspace->in_buf.pos = 0; workspace->in_buf.size = min_t(size_t, srclen, PAGE_SIZE); } @@ -607,6 +614,8 @@ int zstd_decompress_bio(struct list_head *ws, struct compressed_bio *cb) ret = 0; zero_fill_bio(cb->orig_bio); done: + if (workspace->in_buf.src) + kunmap(pages_in[page_in_index]); return ret; } diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c index 8ffc40e84a59..fcf4f3b72923 100644 --- a/fs/cachefiles/rdwr.c +++ b/fs/cachefiles/rdwr.c @@ -25,20 +25,20 @@ static int cachefiles_read_waiter(wait_queue_entry_t *wait, unsigned mode, struct cachefiles_object *object; struct fscache_retrieval *op = monitor->op; struct wait_page_key *key = _key; - struct page *page = wait->private; + struct folio *folio = wait->private; ASSERT(key); _enter("{%lu},%u,%d,{%p,%u}", monitor->netfs_page->index, mode, sync, - key->page, key->bit_nr); + key->folio, key->bit_nr); - if (key->page != page || key->bit_nr != PG_locked) + if (key->folio != folio || key->bit_nr != PG_locked) return 0; - _debug("--- monitor %p %lx ---", page, page->flags); + _debug("--- monitor %p %lx ---", folio, folio->flags); - if (!PageUptodate(page) && !PageError(page)) { + if (!folio_test_uptodate(folio) && !folio_test_error(folio)) { /* unlocked, not uptodate and not erronous? */ _debug("page probably truncated"); } @@ -107,7 +107,7 @@ static int cachefiles_read_reissue(struct cachefiles_object *object, put_page(backpage2); INIT_LIST_HEAD(&monitor->op_link); - add_page_wait_queue(backpage, &monitor->monitor); + folio_add_wait_queue(page_folio(backpage), &monitor->monitor); if (trylock_page(backpage)) { ret = -EIO; @@ -294,7 +294,7 @@ monitor_backing_page: get_page(backpage); monitor->back_page = backpage; monitor->monitor.private = backpage; - add_page_wait_queue(backpage, &monitor->monitor); + folio_add_wait_queue(page_folio(backpage), &monitor->monitor); monitor = NULL; /* but the page may have been read before the monitor was installed, so @@ -548,7 +548,7 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object, get_page(backpage); monitor->back_page = backpage; monitor->monitor.private = backpage; - add_page_wait_queue(backpage, &monitor->monitor); + folio_add_wait_queue(page_folio(backpage), &monitor->monitor); monitor = NULL; /* but the page may have been read before the monitor was diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index 3e42d0466521..8f537f1d9d1d 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -2330,7 +2330,6 @@ retry: int ceph_fsync(struct file *file, loff_t start, loff_t end, int datasync) { - struct ceph_file_info *fi = file->private_data; struct inode *inode = file->f_mapping->host; struct ceph_inode_info *ci = ceph_inode(inode); u64 flush_tid; @@ -2365,14 +2364,9 @@ int ceph_fsync(struct file *file, loff_t start, loff_t end, int datasync) if (err < 0) ret = err; - if (errseq_check(&ci->i_meta_err, READ_ONCE(fi->meta_err))) { - spin_lock(&file->f_lock); - err = errseq_check_and_advance(&ci->i_meta_err, - &fi->meta_err); - spin_unlock(&file->f_lock); - if (err < 0) - ret = err; - } + err = file_check_and_advance_wb_err(file); + if (err < 0) + ret = err; out: dout("fsync %p%s result=%d\n", inode, datasync ? " datasync" : "", ret); return ret; diff --git a/fs/ceph/file.c b/fs/ceph/file.c index d16fd2d5fd42..e61018d9764e 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -233,7 +233,6 @@ static int ceph_init_file_info(struct inode *inode, struct file *file, spin_lock_init(&fi->rw_contexts_lock); INIT_LIST_HEAD(&fi->rw_contexts); - fi->meta_err = errseq_sample(&ci->i_meta_err); fi->filp_gen = READ_ONCE(ceph_inode_to_client(inode)->filp_gen); return 0; diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index 2df1e1284451..1c7574105478 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -541,8 +541,6 @@ struct inode *ceph_alloc_inode(struct super_block *sb) ceph_fscache_inode_init(ci); - ci->i_meta_err = 0; - return &ci->vfs_inode; } diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c index bdeb271f47d9..d8c31069fbf2 100644 --- a/fs/ceph/locks.c +++ b/fs/ceph/locks.c @@ -302,9 +302,6 @@ int ceph_flock(struct file *file, int cmd, struct file_lock *fl) if (!(fl->fl_flags & FL_FLOCK)) return -ENOLCK; - /* No mandatory locks */ - if (fl->fl_type & LOCK_MAND) - return -EOPNOTSUPP; dout("ceph_flock, fl_file: %p\n", fl->fl_file); diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 7cad180d6deb..d64413adc0fd 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -1493,7 +1493,6 @@ static void cleanup_session_requests(struct ceph_mds_client *mdsc, { struct ceph_mds_request *req; struct rb_node *p; - struct ceph_inode_info *ci; dout("cleanup_session_requests mds%d\n", session->s_mds); mutex_lock(&mdsc->mutex); @@ -1502,16 +1501,10 @@ static void cleanup_session_requests(struct ceph_mds_client *mdsc, struct ceph_mds_request, r_unsafe_item); pr_warn_ratelimited(" dropping unsafe request %llu\n", req->r_tid); - if (req->r_target_inode) { - /* dropping unsafe change of inode's attributes */ - ci = ceph_inode(req->r_target_inode); - errseq_set(&ci->i_meta_err, -EIO); - } - if (req->r_unsafe_dir) { - /* dropping unsafe directory operation */ - ci = ceph_inode(req->r_unsafe_dir); - errseq_set(&ci->i_meta_err, -EIO); - } + if (req->r_target_inode) + mapping_set_error(req->r_target_inode->i_mapping, -EIO); + if (req->r_unsafe_dir) + mapping_set_error(req->r_unsafe_dir->i_mapping, -EIO); __unregister_request(mdsc, req); } /* zero r_attempts, so kick_requests() will re-send requests */ @@ -1678,7 +1671,7 @@ static int remove_session_caps_cb(struct inode *inode, struct ceph_cap *cap, spin_unlock(&mdsc->cap_dirty_lock); if (dirty_dropped) { - errseq_set(&ci->i_meta_err, -EIO); + mapping_set_error(inode->i_mapping, -EIO); if (ci->i_wrbuffer_ref_head == 0 && ci->i_wr_ref == 0 && diff --git a/fs/ceph/super.c b/fs/ceph/super.c index 9b1b7f4cfdd4..fd8742bae847 100644 --- a/fs/ceph/super.c +++ b/fs/ceph/super.c @@ -1002,16 +1002,16 @@ static int ceph_compare_super(struct super_block *sb, struct fs_context *fc) struct ceph_fs_client *new = fc->s_fs_info; struct ceph_mount_options *fsopt = new->mount_options; struct ceph_options *opt = new->client->options; - struct ceph_fs_client *other = ceph_sb_to_client(sb); + struct ceph_fs_client *fsc = ceph_sb_to_client(sb); dout("ceph_compare_super %p\n", sb); - if (compare_mount_options(fsopt, opt, other)) { + if (compare_mount_options(fsopt, opt, fsc)) { dout("monitor(s)/mount options don't match\n"); return 0; } if ((opt->flags & CEPH_OPT_FSID) && - ceph_fsid_compare(&opt->fsid, &other->client->fsid)) { + ceph_fsid_compare(&opt->fsid, &fsc->client->fsid)) { dout("fsid doesn't match\n"); return 0; } @@ -1019,6 +1019,17 @@ static int ceph_compare_super(struct super_block *sb, struct fs_context *fc) dout("flags differ\n"); return 0; } + + if (fsc->blocklisted && !ceph_test_mount_opt(fsc, CLEANRECOVER)) { + dout("client is blocklisted (and CLEANRECOVER is not set)\n"); + return 0; + } + + if (fsc->mount_state == CEPH_MOUNT_SHUTDOWN) { + dout("client has been forcibly unmounted\n"); + return 0; + } + return 1; } diff --git a/fs/ceph/super.h b/fs/ceph/super.h index a40eb14c282a..14f951cd5b61 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -429,8 +429,6 @@ struct ceph_inode_info { #ifdef CONFIG_CEPH_FSCACHE struct fscache_cookie *fscache; #endif - errseq_t i_meta_err; - struct inode vfs_inode; /* at end */ }; @@ -774,7 +772,6 @@ struct ceph_file_info { spinlock_t rw_contexts_lock; struct list_head rw_contexts; - errseq_t meta_err; u32 filp_gen; atomic_t num_locks; }; diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 81ec192ce067..4124a89a1a5d 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1893,7 +1893,8 @@ static long writeback_sb_inodes(struct super_block *sb, * unplug, so get our IOs out the door before we * give up the CPU. */ - blk_flush_plug(current); + if (current->plug) + blk_flush_plug(current->plug, false); cond_resched(); } @@ -2291,7 +2292,7 @@ void wakeup_flusher_threads(enum wb_reason reason) * If we are expecting writeback progress we must submit plugged IO. */ if (blk_needs_flush_plug(current)) - blk_schedule_flush_plug(current); + blk_flush_plug(current->plug, true); rcu_read_lock(); list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 319596df5dc6..f55f9f94b1a4 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -1121,6 +1121,9 @@ int fuse_init_fs_context_submount(struct fs_context *fsc); */ void fuse_conn_destroy(struct fuse_mount *fm); +/* Drop the connection and free the fuse mount */ +void fuse_mount_destroy(struct fuse_mount *fm); + /** * Add connection to control filesystem */ diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 36cd03114b6d..12d49a1914e8 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -457,14 +457,6 @@ static void fuse_send_destroy(struct fuse_mount *fm) } } -static void fuse_put_super(struct super_block *sb) -{ - struct fuse_mount *fm = get_fuse_mount_super(sb); - - fuse_conn_put(fm->fc); - kfree(fm); -} - static void convert_fuse_statfs(struct kstatfs *stbuf, struct fuse_kstatfs *attr) { stbuf->f_type = FUSE_SUPER_MAGIC; @@ -1003,7 +995,6 @@ static const struct super_operations fuse_super_operations = { .evict_inode = fuse_evict_inode, .write_inode = fuse_write_inode, .drop_inode = generic_delete_inode, - .put_super = fuse_put_super, .umount_begin = fuse_umount_begin, .statfs = fuse_statfs, .sync_fs = fuse_sync_fs, @@ -1424,20 +1415,17 @@ static int fuse_get_tree_submount(struct fs_context *fsc) if (!fm) return -ENOMEM; + fm->fc = fuse_conn_get(fc); fsc->s_fs_info = fm; sb = sget_fc(fsc, NULL, set_anon_super_fc); - if (IS_ERR(sb)) { - kfree(fm); + if (fsc->s_fs_info) + fuse_mount_destroy(fm); + if (IS_ERR(sb)) return PTR_ERR(sb); - } - fm->fc = fuse_conn_get(fc); /* Initialize superblock, making @mp_fi its root */ err = fuse_fill_super_submount(sb, mp_fi); if (err) { - fuse_conn_put(fc); - kfree(fm); - sb->s_fs_info = NULL; deactivate_locked_super(sb); return err; } @@ -1569,8 +1557,6 @@ static int fuse_fill_super(struct super_block *sb, struct fs_context *fsc) { struct fuse_fs_context *ctx = fsc->fs_private; int err; - struct fuse_conn *fc; - struct fuse_mount *fm; if (!ctx->file || !ctx->rootmode_present || !ctx->user_id_present || !ctx->group_id_present) @@ -1580,42 +1566,18 @@ static int fuse_fill_super(struct super_block *sb, struct fs_context *fsc) * Require mount to happen from the same user namespace which * opened /dev/fuse to prevent potential attacks. */ - err = -EINVAL; if ((ctx->file->f_op != &fuse_dev_operations) || (ctx->file->f_cred->user_ns != sb->s_user_ns)) - goto err; + return -EINVAL; ctx->fudptr = &ctx->file->private_data; - fc = kmalloc(sizeof(*fc), GFP_KERNEL); - err = -ENOMEM; - if (!fc) - goto err; - - fm = kzalloc(sizeof(*fm), GFP_KERNEL); - if (!fm) { - kfree(fc); - goto err; - } - - fuse_conn_init(fc, fm, sb->s_user_ns, &fuse_dev_fiq_ops, NULL); - fc->release = fuse_free_conn; - - sb->s_fs_info = fm; - err = fuse_fill_super_common(sb, ctx); if (err) - goto err_put_conn; + return err; /* file->private_data shall be visible on all CPUs after this */ smp_mb(); fuse_send_init(get_fuse_mount_super(sb)); return 0; - - err_put_conn: - fuse_conn_put(fc); - kfree(fm); - sb->s_fs_info = NULL; - err: - return err; } /* @@ -1637,22 +1599,40 @@ static int fuse_get_tree(struct fs_context *fsc) { struct fuse_fs_context *ctx = fsc->fs_private; struct fuse_dev *fud; + struct fuse_conn *fc; + struct fuse_mount *fm; struct super_block *sb; int err; + fc = kmalloc(sizeof(*fc), GFP_KERNEL); + if (!fc) + return -ENOMEM; + + fm = kzalloc(sizeof(*fm), GFP_KERNEL); + if (!fm) { + kfree(fc); + return -ENOMEM; + } + + fuse_conn_init(fc, fm, fsc->user_ns, &fuse_dev_fiq_ops, NULL); + fc->release = fuse_free_conn; + + fsc->s_fs_info = fm; + if (ctx->fd_present) ctx->file = fget(ctx->fd); if (IS_ENABLED(CONFIG_BLOCK) && ctx->is_bdev) { err = get_tree_bdev(fsc, fuse_fill_super); - goto out_fput; + goto out; } /* * While block dev mount can be initialized with a dummy device fd * (found by device name), normal fuse mounts can't */ + err = -EINVAL; if (!ctx->file) - return -EINVAL; + goto out; /* * Allow creating a fuse mount with an already initialized fuse @@ -1668,7 +1648,9 @@ static int fuse_get_tree(struct fs_context *fsc) } else { err = get_tree_nodev(fsc, fuse_fill_super); } -out_fput: +out: + if (fsc->s_fs_info) + fuse_mount_destroy(fm); if (ctx->file) fput(ctx->file); return err; @@ -1747,17 +1729,25 @@ static void fuse_sb_destroy(struct super_block *sb) struct fuse_mount *fm = get_fuse_mount_super(sb); bool last; - if (fm) { + if (sb->s_root) { last = fuse_mount_remove(fm); if (last) fuse_conn_destroy(fm); } } +void fuse_mount_destroy(struct fuse_mount *fm) +{ + fuse_conn_put(fm->fc); + kfree(fm); +} +EXPORT_SYMBOL(fuse_mount_destroy); + static void fuse_kill_sb_anon(struct super_block *sb) { fuse_sb_destroy(sb); kill_anon_super(sb); + fuse_mount_destroy(get_fuse_mount_super(sb)); } static struct file_system_type fuse_fs_type = { @@ -1775,6 +1765,7 @@ static void fuse_kill_sb_blk(struct super_block *sb) { fuse_sb_destroy(sb); kill_block_super(sb); + fuse_mount_destroy(get_fuse_mount_super(sb)); } static struct file_system_type fuseblk_fs_type = { diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c index 0ad89c6629d7..94fc874f5de7 100644 --- a/fs/fuse/virtio_fs.c +++ b/fs/fuse/virtio_fs.c @@ -1394,12 +1394,13 @@ static void virtio_kill_sb(struct super_block *sb) bool last; /* If mount failed, we can still be called without any fc */ - if (fm) { + if (sb->s_root) { last = fuse_mount_remove(fm); if (last) virtio_fs_conn_destroy(fm); } kill_anon_super(sb); + fuse_mount_destroy(fm); } static int virtio_fs_test_super(struct super_block *sb, @@ -1455,19 +1456,14 @@ static int virtio_fs_get_tree(struct fs_context *fsc) fsc->s_fs_info = fm; sb = sget_fc(fsc, virtio_fs_test_super, set_anon_super_fc); - if (fsc->s_fs_info) { - fuse_conn_put(fc); - kfree(fm); - } + if (fsc->s_fs_info) + fuse_mount_destroy(fm); if (IS_ERR(sb)) return PTR_ERR(sb); if (!sb->s_root) { err = virtio_fs_fill_super(sb, fsc); if (err) { - fuse_conn_put(fc); - kfree(fm); - sb->s_fs_info = NULL; deactivate_locked_super(sb); return err; } diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index 635f0e3f10ec..5436a688157a 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -1338,8 +1338,6 @@ static int gfs2_flock(struct file *file, int cmd, struct file_lock *fl) { if (!(fl->fl_flags & FL_FLOCK)) return -ENOLCK; - if (fl->fl_type & LOCK_MAND) - return -EOPNOTSUPP; if (fl->fl_type == F_UNLCK) { do_unflock(file, fl); diff --git a/fs/io-wq.c b/fs/io-wq.c index 5bf8aa81715e..38b33ad9e8cf 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -140,6 +140,7 @@ static void io_wqe_dec_running(struct io_worker *worker); static bool io_acct_cancel_pending_work(struct io_wqe *wqe, struct io_wqe_acct *acct, struct io_cb_cancel_data *match); +static void create_worker_cb(struct callback_head *cb); static bool io_worker_get(struct io_worker *worker) { @@ -174,12 +175,46 @@ static void io_worker_ref_put(struct io_wq *wq) complete(&wq->worker_done); } +static void io_worker_cancel_cb(struct io_worker *worker) +{ + struct io_wqe_acct *acct = io_wqe_get_acct(worker); + struct io_wqe *wqe = worker->wqe; + struct io_wq *wq = wqe->wq; + + atomic_dec(&acct->nr_running); + raw_spin_lock(&worker->wqe->lock); + acct->nr_workers--; + raw_spin_unlock(&worker->wqe->lock); + io_worker_ref_put(wq); + clear_bit_unlock(0, &worker->create_state); + io_worker_release(worker); +} + +static bool io_task_worker_match(struct callback_head *cb, void *data) +{ + struct io_worker *worker; + + if (cb->func != create_worker_cb) + return false; + worker = container_of(cb, struct io_worker, create_work); + return worker == data; +} + static void io_worker_exit(struct io_worker *worker) { struct io_wqe *wqe = worker->wqe; + struct io_wq *wq = wqe->wq; - if (refcount_dec_and_test(&worker->ref)) - complete(&worker->ref_done); + while (1) { + struct callback_head *cb = task_work_cancel_match(wq->task, + io_task_worker_match, worker); + + if (!cb) + break; + io_worker_cancel_cb(worker); + } + + io_worker_release(worker); wait_for_completion(&worker->ref_done); raw_spin_lock(&wqe->lock); @@ -253,7 +288,7 @@ static bool io_wqe_create_worker(struct io_wqe *wqe, struct io_wqe_acct *acct) pr_warn_once("io-wq is not configured for unbound workers"); raw_spin_lock(&wqe->lock); - if (acct->nr_workers == acct->max_workers) { + if (acct->nr_workers >= acct->max_workers) { raw_spin_unlock(&wqe->lock); return true; } @@ -323,8 +358,10 @@ static bool io_queue_worker_create(struct io_worker *worker, init_task_work(&worker->create_work, func); worker->create_index = acct->index; - if (!task_work_add(wq->task, &worker->create_work, TWA_SIGNAL)) + if (!task_work_add(wq->task, &worker->create_work, TWA_SIGNAL)) { + clear_bit_unlock(0, &worker->create_state); return true; + } clear_bit_unlock(0, &worker->create_state); fail_release: io_worker_release(worker); @@ -716,11 +753,8 @@ static void io_workqueue_create(struct work_struct *work) struct io_worker *worker = container_of(work, struct io_worker, work); struct io_wqe_acct *acct = io_wqe_get_acct(worker); - if (!io_queue_worker_create(worker, acct, create_worker_cont)) { - clear_bit_unlock(0, &worker->create_state); - io_worker_release(worker); + if (!io_queue_worker_create(worker, acct, create_worker_cont)) kfree(worker); - } } static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index) @@ -1150,17 +1184,9 @@ static void io_wq_exit_workers(struct io_wq *wq) while ((cb = task_work_cancel_match(wq->task, io_task_work_match, wq)) != NULL) { struct io_worker *worker; - struct io_wqe_acct *acct; worker = container_of(cb, struct io_worker, create_work); - acct = io_wqe_get_acct(worker); - atomic_dec(&acct->nr_running); - raw_spin_lock(&worker->wqe->lock); - acct->nr_workers--; - raw_spin_unlock(&worker->wqe->lock); - io_worker_ref_put(wq); - clear_bit_unlock(0, &worker->create_state); - io_worker_release(worker); + io_worker_cancel_cb(worker); } rcu_read_lock(); @@ -1291,15 +1317,18 @@ int io_wq_max_workers(struct io_wq *wq, int *new_count) rcu_read_lock(); for_each_node(node) { + struct io_wqe *wqe = wq->wqes[node]; struct io_wqe_acct *acct; + raw_spin_lock(&wqe->lock); for (i = 0; i < IO_WQ_ACCT_NR; i++) { - acct = &wq->wqes[node]->acct[i]; + acct = &wqe->acct[i]; prev = max_t(int, acct->max_workers, prev); if (new_count[i]) acct->max_workers = new_count[i]; new_count[i] = prev; } + raw_spin_unlock(&wqe->lock); } rcu_read_unlock(); return 0; diff --git a/fs/io-wq.h b/fs/io-wq.h index bf5c4c533760..41bf37674a49 100644 --- a/fs/io-wq.h +++ b/fs/io-wq.h @@ -29,6 +29,17 @@ struct io_wq_work_list { struct io_wq_work_node *last; }; +#define wq_list_for_each(pos, prv, head) \ + for (pos = (head)->first, prv = NULL; pos; prv = pos, pos = (pos)->next) + +#define wq_list_for_each_resume(pos, prv) \ + for (; pos; prv = pos, pos = (pos)->next) + +#define wq_list_empty(list) (READ_ONCE((list)->first) == NULL) +#define INIT_WQ_LIST(list) do { \ + (list)->first = NULL; \ +} while (0) + static inline void wq_list_add_after(struct io_wq_work_node *node, struct io_wq_work_node *pos, struct io_wq_work_list *list) @@ -54,6 +65,15 @@ static inline void wq_list_add_tail(struct io_wq_work_node *node, } } +static inline void wq_list_add_head(struct io_wq_work_node *node, + struct io_wq_work_list *list) +{ + node->next = list->first; + if (!node->next) + list->last = node; + WRITE_ONCE(list->first, node); +} + static inline void wq_list_cut(struct io_wq_work_list *list, struct io_wq_work_node *last, struct io_wq_work_node *prev) @@ -69,6 +89,31 @@ static inline void wq_list_cut(struct io_wq_work_list *list, last->next = NULL; } +static inline void __wq_list_splice(struct io_wq_work_list *list, + struct io_wq_work_node *to) +{ + list->last->next = to->next; + to->next = list->first; + INIT_WQ_LIST(list); +} + +static inline bool wq_list_splice(struct io_wq_work_list *list, + struct io_wq_work_node *to) +{ + if (!wq_list_empty(list)) { + __wq_list_splice(list, to); + return true; + } + return false; +} + +static inline void wq_stack_add_head(struct io_wq_work_node *node, + struct io_wq_work_node *stack) +{ + node->next = stack->next; + stack->next = node; +} + static inline void wq_list_del(struct io_wq_work_list *list, struct io_wq_work_node *node, struct io_wq_work_node *prev) @@ -76,14 +121,14 @@ static inline void wq_list_del(struct io_wq_work_list *list, wq_list_cut(list, node, prev); } -#define wq_list_for_each(pos, prv, head) \ - for (pos = (head)->first, prv = NULL; pos; prv = pos, pos = (pos)->next) +static inline +struct io_wq_work_node *wq_stack_extract(struct io_wq_work_node *stack) +{ + struct io_wq_work_node *node = stack->next; -#define wq_list_empty(list) (READ_ONCE((list)->first) == NULL) -#define INIT_WQ_LIST(list) do { \ - (list)->first = NULL; \ - (list)->last = NULL; \ -} while (0) + stack->next = node->next; + return node; +} struct io_wq_work { struct io_wq_work_node list; diff --git a/fs/io_uring.c b/fs/io_uring.c index d4631a55a692..ca10dbb01201 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -103,11 +103,14 @@ #define IORING_MAX_REG_BUFFERS (1U << 14) -#define SQE_VALID_FLAGS (IOSQE_FIXED_FILE|IOSQE_IO_DRAIN|IOSQE_IO_LINK| \ - IOSQE_IO_HARDLINK | IOSQE_ASYNC | \ - IOSQE_BUFFER_SELECT) +#define SQE_COMMON_FLAGS (IOSQE_FIXED_FILE | IOSQE_IO_LINK | \ + IOSQE_IO_HARDLINK | IOSQE_ASYNC) + +#define SQE_VALID_FLAGS (SQE_COMMON_FLAGS|IOSQE_BUFFER_SELECT|IOSQE_IO_DRAIN) + #define IO_REQ_CLEAN_FLAGS (REQ_F_BUFFER_SELECTED | REQ_F_NEED_CLEANUP | \ - REQ_F_POLLED | REQ_F_INFLIGHT | REQ_F_CREDS) + REQ_F_POLLED | REQ_F_INFLIGHT | REQ_F_CREDS | \ + REQ_F_ASYNC_DATA) #define IO_TCTX_REFS_CACHE_NR (1U << 10) @@ -195,8 +198,10 @@ struct io_rings { }; enum io_uring_cmd_flags { - IO_URING_F_NONBLOCK = 1, - IO_URING_F_COMPLETE_DEFER = 2, + IO_URING_F_COMPLETE_DEFER = 1, + IO_URING_F_UNLOCKED = 2, + /* int's last bit, sign checks are usually faster than a bit test */ + IO_URING_F_NONBLOCK = INT_MIN, }; struct io_mapped_ubuf { @@ -305,26 +310,16 @@ struct io_submit_link { }; struct io_submit_state { - struct blk_plug plug; + /* inline/task_work completion list, under ->uring_lock */ + struct io_wq_work_node free_list; + /* batch completion logic */ + struct io_wq_work_list compl_reqs; struct io_submit_link link; - /* - * io_kiocb alloc cache - */ - void *reqs[IO_REQ_CACHE_SIZE]; - unsigned int free_reqs; - bool plug_started; - - /* - * Batch completion logic - */ - struct io_kiocb *compl_reqs[IO_COMPL_BATCH]; - unsigned int compl_nr; - /* inline/task_work completion list, under ->uring_lock */ - struct list_head free_list; - - unsigned int ios_left; + bool need_plug; + unsigned short submit_nr; + struct blk_plug plug; }; struct io_ring_ctx { @@ -368,6 +363,7 @@ struct io_ring_ctx { * uring_lock, and updated through io_uring_register(2) */ struct io_rsrc_node *rsrc_node; + int rsrc_cached_refs; struct io_file_table file_table; unsigned nr_user_files; unsigned nr_user_bufs; @@ -384,7 +380,7 @@ struct io_ring_ctx { } ____cacheline_aligned_in_smp; /* IRQ completion list, under ->completion_lock */ - struct list_head locked_free_list; + struct io_wq_work_list locked_free_list; unsigned int locked_free_nr; const struct cred *sq_creds; /* cred used for __io_sq_thread() */ @@ -399,7 +395,6 @@ struct io_ring_ctx { unsigned cached_cq_tail; unsigned cq_entries; struct eventfd_ctx *cq_ev_fd; - struct wait_queue_head poll_wait; struct wait_queue_head cq_wait; unsigned cq_extra; atomic_t cq_timeouts; @@ -417,7 +412,7 @@ struct io_ring_ctx { * For SQPOLL, only the single threaded io_sq_thread() will * manipulate the list, hence no extra locking is needed there. */ - struct list_head iopoll_list; + struct io_wq_work_list iopoll_list; struct hlist_head *cancel_hash; unsigned cancel_hash_bits; bool poll_multi_queue; @@ -456,6 +451,8 @@ struct io_ring_ctx { struct work_struct exit_work; struct list_head tctx_list; struct completion ref_comp; + u32 iowq_limits[2]; + bool iowq_limits_set; }; }; @@ -578,7 +575,6 @@ struct io_sr_msg { int msg_flags; int bgid; size_t len; - struct io_buffer *kbuf; }; struct io_open { @@ -690,11 +686,6 @@ struct io_hardlink { int flags; }; -struct io_completion { - struct file *file; - u32 cflags; -}; - struct io_async_connect { struct sockaddr_storage address; }; @@ -708,11 +699,15 @@ struct io_async_msghdr { struct sockaddr_storage addr; }; -struct io_async_rw { - struct iovec fast_iov[UIO_FASTIOV]; - const struct iovec *free_iovec; +struct io_rw_state { struct iov_iter iter; struct iov_iter_state iter_state; + struct iovec fast_iov[UIO_FASTIOV]; +}; + +struct io_async_rw { + struct io_rw_state s; + const struct iovec *free_iovec; size_t bytes_done; struct wait_page_queue wpq; }; @@ -739,9 +734,9 @@ enum { REQ_F_CREDS_BIT, REQ_F_REFCOUNT_BIT, REQ_F_ARM_LTIMEOUT_BIT, + REQ_F_ASYNC_DATA_BIT, /* keep async read/write and isreg together and in order */ - REQ_F_NOWAIT_READ_BIT, - REQ_F_NOWAIT_WRITE_BIT, + REQ_F_SUPPORT_NOWAIT_BIT, REQ_F_ISREG_BIT, /* not a real bit, just to check we're not overflowing the space */ @@ -782,10 +777,8 @@ enum { REQ_F_COMPLETE_INLINE = BIT(REQ_F_COMPLETE_INLINE_BIT), /* caller should reissue async */ REQ_F_REISSUE = BIT(REQ_F_REISSUE_BIT), - /* supports async reads */ - REQ_F_NOWAIT_READ = BIT(REQ_F_NOWAIT_READ_BIT), - /* supports async writes */ - REQ_F_NOWAIT_WRITE = BIT(REQ_F_NOWAIT_WRITE_BIT), + /* supports async reads/writes */ + REQ_F_SUPPORT_NOWAIT = BIT(REQ_F_SUPPORT_NOWAIT_BIT), /* regular file */ REQ_F_ISREG = BIT(REQ_F_ISREG_BIT), /* has creds assigned */ @@ -794,6 +787,8 @@ enum { REQ_F_REFCOUNT = BIT(REQ_F_REFCOUNT_BIT), /* there is a linked timeout that has to be armed */ REQ_F_ARM_LTIMEOUT = BIT(REQ_F_ARM_LTIMEOUT_BIT), + /* ->async_data allocated */ + REQ_F_ASYNC_DATA = BIT(REQ_F_ASYNC_DATA_BIT), }; struct async_poll { @@ -850,39 +845,41 @@ struct io_kiocb { struct io_mkdir mkdir; struct io_symlink symlink; struct io_hardlink hardlink; - /* use only after cleaning per-op data, see io_clean_op() */ - struct io_completion compl; }; - /* opcode allocated if it needs to store data for async defer */ - void *async_data; u8 opcode; /* polled IO has completed */ u8 iopoll_completed; - u16 buf_index; + unsigned int flags; + + u64 user_data; u32 result; + u32 cflags; struct io_ring_ctx *ctx; - unsigned int flags; - atomic_t refs; struct task_struct *task; - u64 user_data; - struct io_kiocb *link; struct percpu_ref *fixed_rsrc_refs; + /* store used ubuf, so we can prevent reloading */ + struct io_mapped_ubuf *imu; - /* used with ctx->iopoll_list with reads/writes */ - struct list_head inflight_entry; + /* used by request caches, completion batching and iopoll */ + struct io_wq_work_node comp_list; + atomic_t refs; + struct io_kiocb *link; struct io_task_work io_task_work; /* for polled requests, i.e. IORING_OP_POLL_ADD and async armed poll */ struct hlist_node hash_node; + /* internal polling, see IORING_FEAT_FAST_POLL */ struct async_poll *apoll; + /* opcode allocated if it needs to store data for async defer */ + void *async_data; struct io_wq_work work; + /* custom credentials, valid IFF REQ_F_CREDS is set */ const struct cred *creds; - - /* store used ubuf, so we can prevent reloading */ - struct io_mapped_ubuf *imu; + /* stores selected buf, valid IFF REQ_F_BUFFER_SELECTED is set */ + struct io_buffer *kbuf; }; struct io_tctx_node { @@ -900,12 +897,12 @@ struct io_defer_entry { struct io_op_def { /* needs req->file assigned */ unsigned needs_file : 1; + /* should block plug */ + unsigned plug : 1; /* hash wq insertion if file is a regular file */ unsigned hash_reg_file : 1; /* unbound wq insertion if file is a non-regular file */ unsigned unbound_nonreg_file : 1; - /* opcode is not supported by this kernel */ - unsigned not_supported : 1; /* set if opcode supports polled "wait" */ unsigned pollin : 1; unsigned pollout : 1; @@ -913,8 +910,8 @@ struct io_op_def { unsigned buffer_select : 1; /* do prep async if is going to be punted */ unsigned needs_async_setup : 1; - /* should block plug */ - unsigned plug : 1; + /* opcode is not supported by this kernel */ + unsigned not_supported : 1; /* size of async data needed, if any */ unsigned short async_size; }; @@ -1078,7 +1075,7 @@ static void io_uring_try_cancel_requests(struct io_ring_ctx *ctx, static void io_uring_cancel_generic(bool cancel_all, struct io_sq_data *sqd); static bool io_cqring_fill_event(struct io_ring_ctx *ctx, u64 user_data, - long res, unsigned int cflags); + s32 res, u32 cflags); static void io_put_req(struct io_kiocb *req); static void io_put_req_deferred(struct io_kiocb *req); static void io_dismantle_req(struct io_kiocb *req); @@ -1093,7 +1090,7 @@ static void __io_queue_sqe(struct io_kiocb *req); static void io_rsrc_put_work(struct work_struct *work); static void io_req_task_queue(struct io_kiocb *req); -static void io_submit_flush_completions(struct io_ring_ctx *ctx); +static void __io_submit_flush_completions(struct io_ring_ctx *ctx); static int io_req_prep_async(struct io_kiocb *req); static int io_install_fixed_file(struct io_kiocb *req, struct file *file, @@ -1165,6 +1162,12 @@ static inline void req_ref_get(struct io_kiocb *req) atomic_inc(&req->refs); } +static inline void io_submit_flush_completions(struct io_ring_ctx *ctx) +{ + if (!wq_list_empty(&ctx->submit_state.compl_reqs)) + __io_submit_flush_completions(ctx); +} + static inline void __io_req_set_refcount(struct io_kiocb *req, int nr) { if (!(req->flags & REQ_F_REFCOUNT)) { @@ -1178,13 +1181,52 @@ static inline void io_req_set_refcount(struct io_kiocb *req) __io_req_set_refcount(req, 1); } -static inline void io_req_set_rsrc_node(struct io_kiocb *req) +#define IO_RSRC_REF_BATCH 100 + +static inline void io_req_put_rsrc_locked(struct io_kiocb *req, + struct io_ring_ctx *ctx) + __must_hold(&ctx->uring_lock) { - struct io_ring_ctx *ctx = req->ctx; + struct percpu_ref *ref = req->fixed_rsrc_refs; + + if (ref) { + if (ref == &ctx->rsrc_node->refs) + ctx->rsrc_cached_refs++; + else + percpu_ref_put(ref); + } +} + +static inline void io_req_put_rsrc(struct io_kiocb *req, struct io_ring_ctx *ctx) +{ + if (req->fixed_rsrc_refs) + percpu_ref_put(req->fixed_rsrc_refs); +} + +static __cold void io_rsrc_refs_drop(struct io_ring_ctx *ctx) + __must_hold(&ctx->uring_lock) +{ + if (ctx->rsrc_cached_refs) { + percpu_ref_put_many(&ctx->rsrc_node->refs, ctx->rsrc_cached_refs); + ctx->rsrc_cached_refs = 0; + } +} + +static void io_rsrc_refs_refill(struct io_ring_ctx *ctx) + __must_hold(&ctx->uring_lock) +{ + ctx->rsrc_cached_refs += IO_RSRC_REF_BATCH; + percpu_ref_get_many(&ctx->rsrc_node->refs, IO_RSRC_REF_BATCH); +} +static inline void io_req_set_rsrc_node(struct io_kiocb *req, + struct io_ring_ctx *ctx) +{ if (!req->fixed_rsrc_refs) { req->fixed_rsrc_refs = &ctx->rsrc_node->refs; - percpu_ref_get(req->fixed_rsrc_refs); + ctx->rsrc_cached_refs--; + if (unlikely(ctx->rsrc_cached_refs < 0)) + io_rsrc_refs_refill(ctx); } } @@ -1217,6 +1259,11 @@ static bool io_match_task(struct io_kiocb *head, struct task_struct *task, return false; } +static inline bool req_has_async_data(struct io_kiocb *req) +{ + return req->flags & REQ_F_ASYNC_DATA; +} + static inline void req_set_fail(struct io_kiocb *req) { req->flags |= REQ_F_FAIL; @@ -1228,7 +1275,7 @@ static inline void req_fail_link_node(struct io_kiocb *req, int res) req->result = res; } -static void io_ring_ctx_ref_free(struct percpu_ref *ref) +static __cold void io_ring_ctx_ref_free(struct percpu_ref *ref) { struct io_ring_ctx *ctx = container_of(ref, struct io_ring_ctx, refs); @@ -1240,7 +1287,7 @@ static inline bool io_is_timeout_noseq(struct io_kiocb *req) return !req->timeout.off; } -static void io_fallback_req_func(struct work_struct *work) +static __cold void io_fallback_req_func(struct work_struct *work) { struct io_ring_ctx *ctx = container_of(work, struct io_ring_ctx, fallback_work.work); @@ -1253,15 +1300,13 @@ static void io_fallback_req_func(struct work_struct *work) req->io_task_work.func(req, &locked); if (locked) { - if (ctx->submit_state.compl_nr) - io_submit_flush_completions(ctx); + io_submit_flush_completions(ctx); mutex_unlock(&ctx->uring_lock); } percpu_ref_put(&ctx->refs); - } -static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) +static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) { struct io_ring_ctx *ctx; int hash_bits; @@ -1298,7 +1343,6 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) ctx->flags = p->flags; init_waitqueue_head(&ctx->sqo_sq_wait); INIT_LIST_HEAD(&ctx->sqd_list); - init_waitqueue_head(&ctx->poll_wait); INIT_LIST_HEAD(&ctx->cq_overflow_list); init_completion(&ctx->ref_comp); xa_init_flags(&ctx->io_buffers, XA_FLAGS_ALLOC1); @@ -1307,7 +1351,7 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) init_waitqueue_head(&ctx->cq_wait); spin_lock_init(&ctx->completion_lock); spin_lock_init(&ctx->timeout_lock); - INIT_LIST_HEAD(&ctx->iopoll_list); + INIT_WQ_LIST(&ctx->iopoll_list); INIT_LIST_HEAD(&ctx->defer_list); INIT_LIST_HEAD(&ctx->timeout_list); INIT_LIST_HEAD(&ctx->ltimeout_list); @@ -1316,9 +1360,10 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) INIT_DELAYED_WORK(&ctx->rsrc_put_work, io_rsrc_put_work); init_llist_head(&ctx->rsrc_put_llist); INIT_LIST_HEAD(&ctx->tctx_list); - INIT_LIST_HEAD(&ctx->submit_state.free_list); - INIT_LIST_HEAD(&ctx->locked_free_list); + ctx->submit_state.free_list.next = NULL; + INIT_WQ_LIST(&ctx->locked_free_list); INIT_DELAYED_WORK(&ctx->fallback_work, io_fallback_req_func); + INIT_WQ_LIST(&ctx->submit_state.compl_reqs); return ctx; err: kfree(ctx->dummy_ubuf); @@ -1346,21 +1391,16 @@ static bool req_need_defer(struct io_kiocb *req, u32 seq) return false; } -#define FFS_ASYNC_READ 0x1UL -#define FFS_ASYNC_WRITE 0x2UL -#ifdef CONFIG_64BIT -#define FFS_ISREG 0x4UL -#else -#define FFS_ISREG 0x0UL -#endif -#define FFS_MASK ~(FFS_ASYNC_READ|FFS_ASYNC_WRITE|FFS_ISREG) +#define FFS_NOWAIT 0x1UL +#define FFS_ISREG 0x2UL +#define FFS_MASK ~(FFS_NOWAIT|FFS_ISREG) static inline bool io_req_ffs_set(struct io_kiocb *req) { - return IS_ENABLED(CONFIG_64BIT) && (req->flags & REQ_F_FIXED_FILE); + return req->flags & REQ_F_FIXED_FILE; } -static void io_req_track_inflight(struct io_kiocb *req) +static inline void io_req_track_inflight(struct io_kiocb *req) { if (!(req->flags & REQ_F_INFLIGHT)) { req->flags |= REQ_F_INFLIGHT; @@ -1368,11 +1408,6 @@ static void io_req_track_inflight(struct io_kiocb *req) } } -static inline void io_unprep_linked_timeout(struct io_kiocb *req) -{ - req->flags &= ~REQ_F_LINK_TIMEOUT; -} - static struct io_kiocb *__io_prep_linked_timeout(struct io_kiocb *req) { if (WARN_ON_ONCE(!req->link)) @@ -1443,15 +1478,19 @@ static void io_prep_async_link(struct io_kiocb *req) } } -static void io_queue_async_work(struct io_kiocb *req, bool *locked) +static inline void io_req_add_compl_list(struct io_kiocb *req) +{ + struct io_submit_state *state = &req->ctx->submit_state; + + wq_list_add_tail(&req->comp_list, &state->compl_reqs); +} + +static void io_queue_async_work(struct io_kiocb *req, bool *dont_use) { struct io_ring_ctx *ctx = req->ctx; struct io_kiocb *link = io_prep_linked_timeout(req); struct io_uring_task *tctx = req->task->io_uring; - /* must not take the lock, NULL it as a precaution */ - locked = NULL; - BUG_ON(!tctx); BUG_ON(!tctx->io_wq); @@ -1492,7 +1531,7 @@ static void io_kill_timeout(struct io_kiocb *req, int status) } } -static void io_queue_deferred(struct io_ring_ctx *ctx) +static __cold void io_queue_deferred(struct io_ring_ctx *ctx) { while (!list_empty(&ctx->defer_list)) { struct io_defer_entry *de = list_first_entry(&ctx->defer_list, @@ -1506,7 +1545,7 @@ static void io_queue_deferred(struct io_ring_ctx *ctx) } } -static void io_flush_timeouts(struct io_ring_ctx *ctx) +static __cold void io_flush_timeouts(struct io_ring_ctx *ctx) __must_hold(&ctx->completion_lock) { u32 seq = ctx->cached_cq_tail - atomic_read(&ctx->cq_timeouts); @@ -1539,7 +1578,7 @@ static void io_flush_timeouts(struct io_ring_ctx *ctx) spin_unlock_irq(&ctx->timeout_lock); } -static void __io_commit_cqring_flush(struct io_ring_ctx *ctx) +static __cold void __io_commit_cqring_flush(struct io_ring_ctx *ctx) { if (ctx->off_timeout_used) io_flush_timeouts(ctx); @@ -1609,12 +1648,8 @@ static void io_cqring_ev_posted(struct io_ring_ctx *ctx) */ if (wq_has_sleeper(&ctx->cq_wait)) wake_up_all(&ctx->cq_wait); - if (ctx->sq_data && waitqueue_active(&ctx->sq_data->wait)) - wake_up(&ctx->sq_data->wait); if (io_should_trigger_evfd(ctx)) eventfd_signal(ctx->cq_ev_fd, 1); - if (waitqueue_active(&ctx->poll_wait)) - wake_up_interruptible(&ctx->poll_wait); } static void io_cqring_ev_posted_iopoll(struct io_ring_ctx *ctx) @@ -1628,8 +1663,6 @@ static void io_cqring_ev_posted_iopoll(struct io_ring_ctx *ctx) } if (io_should_trigger_evfd(ctx)) eventfd_signal(ctx->cq_ev_fd, 1); - if (waitqueue_active(&ctx->poll_wait)) - wake_up_interruptible(&ctx->poll_wait); } /* Returns true if there are no backlogged entries after the flush */ @@ -1725,7 +1758,7 @@ static inline void io_get_task_refs(int nr) } static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data, - long res, unsigned int cflags) + s32 res, u32 cflags) { struct io_overflow_cqe *ocqe; @@ -1753,7 +1786,7 @@ static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data, } static inline bool __io_cqring_fill_event(struct io_ring_ctx *ctx, u64 user_data, - long res, unsigned int cflags) + s32 res, u32 cflags) { struct io_uring_cqe *cqe; @@ -1776,13 +1809,13 @@ static inline bool __io_cqring_fill_event(struct io_ring_ctx *ctx, u64 user_data /* not as hot to bloat with inlining */ static noinline bool io_cqring_fill_event(struct io_ring_ctx *ctx, u64 user_data, - long res, unsigned int cflags) + s32 res, u32 cflags) { return __io_cqring_fill_event(ctx, user_data, res, cflags); } -static void io_req_complete_post(struct io_kiocb *req, long res, - unsigned int cflags) +static void io_req_complete_post(struct io_kiocb *req, s32 res, + u32 cflags) { struct io_ring_ctx *ctx = req->ctx; @@ -1801,40 +1834,27 @@ static void io_req_complete_post(struct io_kiocb *req, long res, req->link = NULL; } } + io_req_put_rsrc(req, ctx); io_dismantle_req(req); io_put_task(req->task, 1); - list_add(&req->inflight_entry, &ctx->locked_free_list); + wq_list_add_head(&req->comp_list, &ctx->locked_free_list); ctx->locked_free_nr++; - } else { - if (!percpu_ref_tryget(&ctx->refs)) - req = NULL; } io_commit_cqring(ctx); spin_unlock(&ctx->completion_lock); - - if (req) { - io_cqring_ev_posted(ctx); - percpu_ref_put(&ctx->refs); - } -} - -static inline bool io_req_needs_clean(struct io_kiocb *req) -{ - return req->flags & IO_REQ_CLEAN_FLAGS; + io_cqring_ev_posted(ctx); } -static void io_req_complete_state(struct io_kiocb *req, long res, - unsigned int cflags) +static inline void io_req_complete_state(struct io_kiocb *req, s32 res, + u32 cflags) { - if (io_req_needs_clean(req)) - io_clean_op(req); req->result = res; - req->compl.cflags = cflags; + req->cflags = cflags; req->flags |= REQ_F_COMPLETE_INLINE; } static inline void __io_req_complete(struct io_kiocb *req, unsigned issue_flags, - long res, unsigned cflags) + s32 res, u32 cflags) { if (issue_flags & IO_URING_F_COMPLETE_DEFER) io_req_complete_state(req, res, cflags); @@ -1842,12 +1862,12 @@ static inline void __io_req_complete(struct io_kiocb *req, unsigned issue_flags, io_req_complete_post(req, res, cflags); } -static inline void io_req_complete(struct io_kiocb *req, long res) +static inline void io_req_complete(struct io_kiocb *req, s32 res) { __io_req_complete(req, 0, res, 0); } -static void io_req_complete_failed(struct io_kiocb *req, long res) +static void io_req_complete_failed(struct io_kiocb *req, s32 res) { req_set_fail(req); io_req_complete_post(req, res, 0); @@ -1881,7 +1901,7 @@ static void io_flush_cached_locked_reqs(struct io_ring_ctx *ctx, struct io_submit_state *state) { spin_lock(&ctx->completion_lock); - list_splice_init(&ctx->locked_free_list, &state->free_list); + wq_list_splice(&ctx->locked_free_list, &state->free_list); ctx->locked_free_nr = 0; spin_unlock(&ctx->completion_lock); } @@ -1890,7 +1910,6 @@ static void io_flush_cached_locked_reqs(struct io_ring_ctx *ctx, static bool io_flush_cached_reqs(struct io_ring_ctx *ctx) { struct io_submit_state *state = &ctx->submit_state; - int nr; /* * If we have more than a batch's worth of requests in our IRQ side @@ -1899,20 +1918,7 @@ static bool io_flush_cached_reqs(struct io_ring_ctx *ctx) */ if (READ_ONCE(ctx->locked_free_nr) > IO_COMPL_BATCH) io_flush_cached_locked_reqs(ctx, state); - - nr = state->free_reqs; - while (!list_empty(&state->free_list)) { - struct io_kiocb *req = list_first_entry(&state->free_list, - struct io_kiocb, inflight_entry); - - list_del(&req->inflight_entry); - state->reqs[nr++] = req; - if (nr == ARRAY_SIZE(state->reqs)) - break; - } - - state->free_reqs = nr; - return nr != 0; + return !!state->free_list.next; } /* @@ -1921,38 +1927,54 @@ static bool io_flush_cached_reqs(struct io_ring_ctx *ctx) * Because of that, io_alloc_req() should be called only under ->uring_lock * and with extra caution to not get a request that is still worked on. */ -static struct io_kiocb *io_alloc_req(struct io_ring_ctx *ctx) +static __cold bool __io_alloc_req_refill(struct io_ring_ctx *ctx) __must_hold(&ctx->uring_lock) { struct io_submit_state *state = &ctx->submit_state; gfp_t gfp = GFP_KERNEL | __GFP_NOWARN; + void *reqs[IO_REQ_ALLOC_BATCH]; + struct io_kiocb *req; int ret, i; - BUILD_BUG_ON(ARRAY_SIZE(state->reqs) < IO_REQ_ALLOC_BATCH); - - if (likely(state->free_reqs || io_flush_cached_reqs(ctx))) - goto got_req; + if (likely(state->free_list.next || io_flush_cached_reqs(ctx))) + return true; - ret = kmem_cache_alloc_bulk(req_cachep, gfp, IO_REQ_ALLOC_BATCH, - state->reqs); + ret = kmem_cache_alloc_bulk(req_cachep, gfp, ARRAY_SIZE(reqs), reqs); /* * Bulk alloc is all-or-nothing. If we fail to get a batch, * retry single alloc to be on the safe side. */ if (unlikely(ret <= 0)) { - state->reqs[0] = kmem_cache_alloc(req_cachep, gfp); - if (!state->reqs[0]) - return NULL; + reqs[0] = kmem_cache_alloc(req_cachep, gfp); + if (!reqs[0]) + return false; ret = 1; } - for (i = 0; i < ret; i++) - io_preinit_req(state->reqs[i], ctx); - state->free_reqs = ret; -got_req: - state->free_reqs--; - return state->reqs[state->free_reqs]; + percpu_ref_get_many(&ctx->refs, ret); + for (i = 0; i < ret; i++) { + req = reqs[i]; + + io_preinit_req(req, ctx); + wq_stack_add_head(&req->comp_list, &state->free_list); + } + return true; +} + +static inline bool io_alloc_req_refill(struct io_ring_ctx *ctx) +{ + if (unlikely(!ctx->submit_state.free_list.next)) + return __io_alloc_req_refill(ctx); + return true; +} + +static inline struct io_kiocb *io_alloc_req(struct io_ring_ctx *ctx) +{ + struct io_wq_work_node *node; + + node = wq_stack_extract(&ctx->submit_state.free_list); + return container_of(node, struct io_kiocb, comp_list); } static inline void io_put_file(struct file *file) @@ -1961,35 +1983,28 @@ static inline void io_put_file(struct file *file) fput(file); } -static void io_dismantle_req(struct io_kiocb *req) +static inline void io_dismantle_req(struct io_kiocb *req) { unsigned int flags = req->flags; - if (io_req_needs_clean(req)) + if (unlikely(flags & IO_REQ_CLEAN_FLAGS)) io_clean_op(req); if (!(flags & REQ_F_FIXED_FILE)) io_put_file(req->file); - if (req->fixed_rsrc_refs) - percpu_ref_put(req->fixed_rsrc_refs); - if (req->async_data) { - kfree(req->async_data); - req->async_data = NULL; - } } -static void __io_free_req(struct io_kiocb *req) +static __cold void __io_free_req(struct io_kiocb *req) { struct io_ring_ctx *ctx = req->ctx; + io_req_put_rsrc(req, ctx); io_dismantle_req(req); io_put_task(req->task, 1); spin_lock(&ctx->completion_lock); - list_add(&req->inflight_entry, &ctx->locked_free_list); + wq_list_add_head(&req->comp_list, &ctx->locked_free_list); ctx->locked_free_nr++; spin_unlock(&ctx->completion_lock); - - percpu_ref_put(&ctx->refs); } static inline void io_remove_next_linked(struct io_kiocb *req) @@ -2075,47 +2090,45 @@ static bool io_disarm_next(struct io_kiocb *req) return posted; } -static struct io_kiocb *__io_req_find_next(struct io_kiocb *req) +static void __io_req_find_next_prep(struct io_kiocb *req) +{ + struct io_ring_ctx *ctx = req->ctx; + bool posted; + + spin_lock(&ctx->completion_lock); + posted = io_disarm_next(req); + if (posted) + io_commit_cqring(req->ctx); + spin_unlock(&ctx->completion_lock); + if (posted) + io_cqring_ev_posted(ctx); +} + +static inline struct io_kiocb *io_req_find_next(struct io_kiocb *req) { struct io_kiocb *nxt; + if (likely(!(req->flags & (REQ_F_LINK|REQ_F_HARDLINK)))) + return NULL; /* * If LINK is set, we have dependent requests in this chain. If we * didn't fail this request, queue the first one up, moving any other * dependencies to the next request. In case of failure, fail the rest * of the chain. */ - if (req->flags & IO_DISARM_MASK) { - struct io_ring_ctx *ctx = req->ctx; - bool posted; - - spin_lock(&ctx->completion_lock); - posted = io_disarm_next(req); - if (posted) - io_commit_cqring(req->ctx); - spin_unlock(&ctx->completion_lock); - if (posted) - io_cqring_ev_posted(ctx); - } + if (unlikely(req->flags & IO_DISARM_MASK)) + __io_req_find_next_prep(req); nxt = req->link; req->link = NULL; return nxt; } -static inline struct io_kiocb *io_req_find_next(struct io_kiocb *req) -{ - if (likely(!(req->flags & (REQ_F_LINK|REQ_F_HARDLINK)))) - return NULL; - return __io_req_find_next(req); -} - static void ctx_flush_and_put(struct io_ring_ctx *ctx, bool *locked) { if (!ctx) return; if (*locked) { - if (ctx->submit_state.compl_nr) - io_submit_flush_completions(ctx); + io_submit_flush_completions(ctx); mutex_unlock(&ctx->uring_lock); *locked = false; } @@ -2132,7 +2145,7 @@ static void tctx_task_work(struct callback_head *cb) while (1) { struct io_wq_work_node *node; - if (!tctx->task_list.first && locked && ctx->submit_state.compl_nr) + if (!tctx->task_list.first && locked) io_submit_flush_completions(ctx); spin_lock_irq(&tctx->task_lock); @@ -2195,8 +2208,9 @@ static void io_req_task_work_add(struct io_kiocb *req) * will do the job. */ notify = (req->ctx->flags & IORING_SETUP_SQPOLL) ? TWA_NONE : TWA_SIGNAL; - if (!task_work_add(tsk, &tctx->task_work, notify)) { - wake_up_process(tsk); + if (likely(!task_work_add(tsk, &tctx->task_work, notify))) { + if (notify == TWA_NONE) + wake_up_process(tsk); return; } @@ -2274,77 +2288,62 @@ static void io_free_req_work(struct io_kiocb *req, bool *locked) io_free_req(req); } -struct req_batch { - struct task_struct *task; - int task_refs; - int ctx_refs; -}; - -static inline void io_init_req_batch(struct req_batch *rb) +static void io_free_batch_list(struct io_ring_ctx *ctx, + struct io_wq_work_node *node) + __must_hold(&ctx->uring_lock) { - rb->task_refs = 0; - rb->ctx_refs = 0; - rb->task = NULL; -} + struct task_struct *task = NULL; + int task_refs = 0; -static void io_req_free_batch_finish(struct io_ring_ctx *ctx, - struct req_batch *rb) -{ - if (rb->ctx_refs) - percpu_ref_put_many(&ctx->refs, rb->ctx_refs); - if (rb->task) - io_put_task(rb->task, rb->task_refs); -} + do { + struct io_kiocb *req = container_of(node, struct io_kiocb, + comp_list); -static void io_req_free_batch(struct req_batch *rb, struct io_kiocb *req, - struct io_submit_state *state) -{ - io_queue_next(req); - io_dismantle_req(req); + if (unlikely(req->flags & REQ_F_REFCOUNT)) { + node = req->comp_list.next; + if (!req_ref_put_and_test(req)) + continue; + } - if (req->task != rb->task) { - if (rb->task) - io_put_task(rb->task, rb->task_refs); - rb->task = req->task; - rb->task_refs = 0; - } - rb->task_refs++; - rb->ctx_refs++; + io_req_put_rsrc_locked(req, ctx); + io_queue_next(req); + io_dismantle_req(req); - if (state->free_reqs != ARRAY_SIZE(state->reqs)) - state->reqs[state->free_reqs++] = req; - else - list_add(&req->inflight_entry, &state->free_list); + if (req->task != task) { + if (task) + io_put_task(task, task_refs); + task = req->task; + task_refs = 0; + } + task_refs++; + node = req->comp_list.next; + wq_stack_add_head(&req->comp_list, &ctx->submit_state.free_list); + } while (node); + + if (task) + io_put_task(task, task_refs); } -static void io_submit_flush_completions(struct io_ring_ctx *ctx) +static void __io_submit_flush_completions(struct io_ring_ctx *ctx) __must_hold(&ctx->uring_lock) { + struct io_wq_work_node *node, *prev; struct io_submit_state *state = &ctx->submit_state; - int i, nr = state->compl_nr; - struct req_batch rb; spin_lock(&ctx->completion_lock); - for (i = 0; i < nr; i++) { - struct io_kiocb *req = state->compl_reqs[i]; + wq_list_for_each(node, prev, &state->compl_reqs) { + struct io_kiocb *req = container_of(node, struct io_kiocb, + comp_list); __io_cqring_fill_event(ctx, req->user_data, req->result, - req->compl.cflags); + req->cflags); } io_commit_cqring(ctx); spin_unlock(&ctx->completion_lock); io_cqring_ev_posted(ctx); - io_init_req_batch(&rb); - for (i = 0; i < nr; i++) { - struct io_kiocb *req = state->compl_reqs[i]; - - if (req_ref_put_and_test(req)) - io_req_free_batch(&rb, req, &ctx->submit_state); - } - - io_req_free_batch_finish(ctx, &rb); - state->compl_nr = 0; + io_free_batch_list(ctx, state->compl_reqs.first); + INIT_WQ_LIST(&state->compl_reqs); } /* @@ -2404,12 +2403,9 @@ static unsigned int io_put_kbuf(struct io_kiocb *req, struct io_buffer *kbuf) static inline unsigned int io_put_rw_kbuf(struct io_kiocb *req) { - struct io_buffer *kbuf; - if (likely(!(req->flags & REQ_F_BUFFER_SELECTED))) return 0; - kbuf = (struct io_buffer *) (unsigned long) req->rw.addr; - return io_put_kbuf(req, kbuf); + return io_put_kbuf(req, req->kbuf); } static inline bool io_run_task_work(void) @@ -2423,52 +2419,22 @@ static inline bool io_run_task_work(void) return false; } -/* - * Find and free completed poll iocbs - */ -static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, - struct list_head *done) +static int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) { - struct req_batch rb; - struct io_kiocb *req; - - /* order with ->result store in io_complete_rw_iopoll() */ - smp_rmb(); - - io_init_req_batch(&rb); - while (!list_empty(done)) { - req = list_first_entry(done, struct io_kiocb, inflight_entry); - list_del(&req->inflight_entry); - - __io_cqring_fill_event(ctx, req->user_data, req->result, - io_put_rw_kbuf(req)); - (*nr_events)++; - - if (req_ref_put_and_test(req)) - io_req_free_batch(&rb, req, &ctx->submit_state); - } - - io_commit_cqring(ctx); - io_cqring_ev_posted_iopoll(ctx); - io_req_free_batch_finish(ctx, &rb); -} - -static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events, - long min) -{ - struct io_kiocb *req, *tmp; + struct io_wq_work_node *pos, *start, *prev; unsigned int poll_flags = BLK_POLL_NOSLEEP; DEFINE_IO_COMP_BATCH(iob); - LIST_HEAD(done); + int nr_events = 0; /* * Only spin for completions if we don't have multiple devices hanging - * off our complete list, and we're under the requested amount. + * off our complete list. */ - if (ctx->poll_multi_queue || *nr_events >= min) + if (ctx->poll_multi_queue || force_nonspin) poll_flags |= BLK_POLL_ONESHOT; - list_for_each_entry_safe(req, tmp, &ctx->iopoll_list, inflight_entry) { + wq_list_for_each(pos, start, &ctx->iopoll_list) { + struct io_kiocb *req = container_of(pos, struct io_kiocb, comp_list); struct kiocb *kiocb = &req->rw.kiocb; int ret; @@ -2477,11 +2443,7 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events, * If we find a request that requires polling, break out * and complete those lists first, if we have entries there. */ - if (READ_ONCE(req->iopoll_completed)) { - list_move_tail(&req->inflight_entry, &done); - continue; - } - if (!list_empty(&done)) + if (READ_ONCE(req->iopoll_completed)) break; ret = kiocb->ki_filp->f_op->iopoll(kiocb, &iob, poll_flags); @@ -2493,34 +2455,50 @@ static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events, /* iopoll may have completed current req */ if (!rq_list_empty(iob.req_list) || READ_ONCE(req->iopoll_completed)) - list_move_tail(&req->inflight_entry, &done); + break; } if (!rq_list_empty(iob.req_list)) iob.complete(&iob); - if (!list_empty(&done)) - io_iopoll_complete(ctx, nr_events, &done); + else if (!pos) + return 0; - return 0; + prev = start; + wq_list_for_each_resume(pos, prev) { + struct io_kiocb *req = container_of(pos, struct io_kiocb, comp_list); + + /* order with io_complete_rw_iopoll(), e.g. ->result updates */ + if (!smp_load_acquire(&req->iopoll_completed)) + break; + __io_cqring_fill_event(ctx, req->user_data, req->result, + io_put_rw_kbuf(req)); + nr_events++; + } + + if (unlikely(!nr_events)) + return 0; + + io_commit_cqring(ctx); + io_cqring_ev_posted_iopoll(ctx); + pos = start ? start->next : ctx->iopoll_list.first; + wq_list_cut(&ctx->iopoll_list, prev, start); + io_free_batch_list(ctx, pos); + return nr_events; } /* * We can't just wait for polled events to come to us, we have to actively * find and complete them. */ -static void io_iopoll_try_reap_events(struct io_ring_ctx *ctx) +static __cold void io_iopoll_try_reap_events(struct io_ring_ctx *ctx) { if (!(ctx->flags & IORING_SETUP_IOPOLL)) return; mutex_lock(&ctx->uring_lock); - while (!list_empty(&ctx->iopoll_list)) { - unsigned int nr_events = 0; - - io_do_iopoll(ctx, &nr_events, 0); - + while (!wq_list_empty(&ctx->iopoll_list)) { /* let it sleep and repeat later if can't complete a request */ - if (nr_events == 0) + if (io_do_iopoll(ctx, true) == 0) break; /* * Ensure we allow local-to-the-cpu processing to take place, @@ -2567,7 +2545,7 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, long min) * forever, while the workqueue is stuck trying to acquire the * very same mutex. */ - if (list_empty(&ctx->iopoll_list)) { + if (wq_list_empty(&ctx->iopoll_list)) { u32 tail = ctx->cached_cq_tail; mutex_unlock(&ctx->uring_lock); @@ -2576,11 +2554,15 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, long min) /* some requests don't go through iopoll_list */ if (tail != ctx->cached_cq_tail || - list_empty(&ctx->iopoll_list)) + wq_list_empty(&ctx->iopoll_list)) break; } - ret = io_do_iopoll(ctx, &nr_events, min); - } while (!ret && nr_events < min && !need_resched()); + ret = io_do_iopoll(ctx, !min); + if (ret < 0) + break; + nr_events += ret; + ret = 0; + } while (nr_events < min && !need_resched()); out: mutex_unlock(&ctx->uring_lock); return ret; @@ -2605,9 +2587,9 @@ static bool io_resubmit_prep(struct io_kiocb *req) { struct io_async_rw *rw = req->async_data; - if (!rw) + if (!req_has_async_data(req)) return !io_req_prep_async(req); - iov_iter_restore(&rw->iter, &rw->iter_state); + iov_iter_restore(&rw->s.iter, &rw->s.iter_state); return true; } @@ -2651,7 +2633,7 @@ static bool __io_complete_rw_common(struct io_kiocb *req, long res) { if (req->rw.kiocb.ki_flags & IOCB_WRITE) kiocb_end_write(req); - if (res != req->result) { + if (unlikely(res != req->result)) { if ((res == -EAGAIN || res == -EOPNOTSUPP) && io_rw_should_reissue(req)) { req->flags |= REQ_F_REISSUE; @@ -2666,16 +2648,11 @@ static bool __io_complete_rw_common(struct io_kiocb *req, long res) static void io_req_task_complete(struct io_kiocb *req, bool *locked) { unsigned int cflags = io_put_rw_kbuf(req); - long res = req->result; + int res = req->result; if (*locked) { - struct io_ring_ctx *ctx = req->ctx; - struct io_submit_state *state = &ctx->submit_state; - io_req_complete_state(req, res, cflags); - state->compl_reqs[state->compl_nr++] = req; - if (state->compl_nr == ARRAY_SIZE(state->compl_reqs)) - io_submit_flush_completions(ctx); + io_req_add_compl_list(req); } else { io_req_complete_post(req, res, cflags); } @@ -2711,12 +2688,11 @@ static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2) req->flags |= REQ_F_REISSUE; return; } + req->result = res; } - WRITE_ONCE(req->result, res); - /* order with io_iopoll_complete() checking ->result */ - smp_wmb(); - WRITE_ONCE(req->iopoll_completed, 1); + /* order with io_iopoll_complete() checking ->iopoll_completed */ + smp_store_release(&req->iopoll_completed, 1); } /* @@ -2725,13 +2701,13 @@ static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2) * find it from a io_do_iopoll() thread before the issuer is done * accessing the kiocb cookie. */ -static void io_iopoll_req_issued(struct io_kiocb *req) +static void io_iopoll_req_issued(struct io_kiocb *req, unsigned int issue_flags) { struct io_ring_ctx *ctx = req->ctx; - const bool in_async = io_wq_current_is_worker(); + const bool needs_lock = issue_flags & IO_URING_F_UNLOCKED; /* workqueue context doesn't hold uring_lock, grab it now */ - if (unlikely(in_async)) + if (unlikely(needs_lock)) mutex_lock(&ctx->uring_lock); /* @@ -2739,14 +2715,13 @@ static void io_iopoll_req_issued(struct io_kiocb *req) * how we do polling eventually, not spinning if we're on potentially * different devices. */ - if (list_empty(&ctx->iopoll_list)) { + if (wq_list_empty(&ctx->iopoll_list)) { ctx->poll_multi_queue = false; } else if (!ctx->poll_multi_queue) { struct io_kiocb *list_req; - list_req = list_first_entry(&ctx->iopoll_list, struct io_kiocb, - inflight_entry); - + list_req = container_of(ctx->iopoll_list.first, struct io_kiocb, + comp_list); if (list_req->file != req->file) ctx->poll_multi_queue = true; } @@ -2756,11 +2731,11 @@ static void io_iopoll_req_issued(struct io_kiocb *req) * it to the front so we find it first. */ if (READ_ONCE(req->iopoll_completed)) - list_add(&req->inflight_entry, &ctx->iopoll_list); + wq_list_add_head(&req->comp_list, &ctx->iopoll_list); else - list_add_tail(&req->inflight_entry, &ctx->iopoll_list); + wq_list_add_tail(&req->comp_list, &ctx->iopoll_list); - if (unlikely(in_async)) { + if (unlikely(needs_lock)) { /* * If IORING_SETUP_SQPOLL is enabled, sqes are either handle * in sq thread task context or in io worker task context. If @@ -2785,10 +2760,8 @@ static bool io_bdev_nowait(struct block_device *bdev) * any file. For now, just ensure that anything potentially problematic is done * inline. */ -static bool __io_file_supports_nowait(struct file *file, int rw) +static bool __io_file_supports_nowait(struct file *file, umode_t mode) { - umode_t mode = file_inode(file)->i_mode; - if (S_ISBLK(mode)) { if (IS_ENABLED(CONFIG_BLOCK) && io_bdev_nowait(I_BDEV(file->f_mapping->host))) @@ -2808,28 +2781,32 @@ static bool __io_file_supports_nowait(struct file *file, int rw) /* any ->read/write should understand O_NONBLOCK */ if (file->f_flags & O_NONBLOCK) return true; + return file->f_mode & FMODE_NOWAIT; +} - if (!(file->f_mode & FMODE_NOWAIT)) - return false; - - if (rw == READ) - return file->f_op->read_iter != NULL; +/* + * If we tracked the file through the SCM inflight mechanism, we could support + * any file. For now, just ensure that anything potentially problematic is done + * inline. + */ +static unsigned int io_file_get_flags(struct file *file) +{ + umode_t mode = file_inode(file)->i_mode; + unsigned int res = 0; - return file->f_op->write_iter != NULL; + if (S_ISREG(mode)) + res |= FFS_ISREG; + if (__io_file_supports_nowait(file, mode)) + res |= FFS_NOWAIT; + return res; } -static bool io_file_supports_nowait(struct io_kiocb *req, int rw) +static inline bool io_file_supports_nowait(struct io_kiocb *req) { - if (rw == READ && (req->flags & REQ_F_NOWAIT_READ)) - return true; - else if (rw == WRITE && (req->flags & REQ_F_NOWAIT_WRITE)) - return true; - - return __io_file_supports_nowait(req->file, rw); + return req->flags & REQ_F_SUPPORT_NOWAIT; } -static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, - int rw) +static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_ring_ctx *ctx = req->ctx; struct kiocb *kiocb = &req->rw.kiocb; @@ -2837,16 +2814,15 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, unsigned ioprio; int ret; - if (!io_req_ffs_set(req) && S_ISREG(file_inode(file)->i_mode)) - req->flags |= REQ_F_ISREG; + if (!io_req_ffs_set(req)) + req->flags |= io_file_get_flags(file) << REQ_F_SUPPORT_NOWAIT_BIT; kiocb->ki_pos = READ_ONCE(sqe->off); if (kiocb->ki_pos == -1 && !(file->f_mode & FMODE_STREAM)) { req->flags |= REQ_F_CUR_POS; kiocb->ki_pos = file->f_pos; } - kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp)); - kiocb->ki_flags = iocb_flags(kiocb->ki_filp); + kiocb->ki_flags = iocb_flags(file); ret = kiocb_set_rw_flags(kiocb, READ_ONCE(sqe->rw_flags)); if (unlikely(ret)) return ret; @@ -2857,22 +2833,11 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, * reliably. If not, or it IOCB_NOWAIT is set, don't retry. */ if ((kiocb->ki_flags & IOCB_NOWAIT) || - ((file->f_flags & O_NONBLOCK) && !io_file_supports_nowait(req, rw))) + ((file->f_flags & O_NONBLOCK) && !io_file_supports_nowait(req))) req->flags |= REQ_F_NOWAIT; - ioprio = READ_ONCE(sqe->ioprio); - if (ioprio) { - ret = ioprio_check_cap(ioprio); - if (ret) - return ret; - - kiocb->ki_ioprio = ioprio; - } else - kiocb->ki_ioprio = get_current_ioprio(); - if (ctx->flags & IORING_SETUP_IOPOLL) { - if (!(kiocb->ki_flags & IOCB_DIRECT) || - !kiocb->ki_filp->f_op->iopoll) + if (!(kiocb->ki_flags & IOCB_DIRECT) || !file->f_op->iopoll) return -EOPNOTSUPP; kiocb->ki_flags |= IOCB_HIPRI | IOCB_ALLOC_CACHE; @@ -2884,12 +2849,18 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, kiocb->ki_complete = io_complete_rw; } - if (req->opcode == IORING_OP_READ_FIXED || - req->opcode == IORING_OP_WRITE_FIXED) { - req->imu = NULL; - io_req_set_rsrc_node(req); + ioprio = READ_ONCE(sqe->ioprio); + if (ioprio) { + ret = ioprio_check_cap(ioprio); + if (ret) + return ret; + + kiocb->ki_ioprio = ioprio; + } else { + kiocb->ki_ioprio = get_current_ioprio(); } + req->imu = NULL; req->rw.addr = READ_ONCE(sqe->addr); req->rw.len = READ_ONCE(sqe->len); req->buf_index = READ_ONCE(sqe->buf_index); @@ -2924,7 +2895,7 @@ static void kiocb_done(struct kiocb *kiocb, ssize_t ret, struct io_async_rw *io = req->async_data; /* add previously done IO, if any */ - if (io && io->bytes_done > 0) { + if (req_has_async_data(req) && io->bytes_done > 0) { if (ret < 0) ret = io->bytes_done; else @@ -2947,7 +2918,7 @@ static void kiocb_done(struct kiocb *kiocb, ssize_t ret, struct io_ring_ctx *ctx = req->ctx; req_set_fail(req); - if (!(issue_flags & IO_URING_F_NONBLOCK)) { + if (issue_flags & IO_URING_F_UNLOCKED) { mutex_lock(&ctx->uring_lock); __io_req_complete(req, issue_flags, ret, cflags); mutex_unlock(&ctx->uring_lock); @@ -3018,13 +2989,15 @@ static int __io_import_fixed(struct io_kiocb *req, int rw, struct iov_iter *iter static int io_import_fixed(struct io_kiocb *req, int rw, struct iov_iter *iter) { - struct io_ring_ctx *ctx = req->ctx; struct io_mapped_ubuf *imu = req->imu; u16 index, buf_index = req->buf_index; if (likely(!imu)) { + struct io_ring_ctx *ctx = req->ctx; + if (unlikely(buf_index >= ctx->nr_user_bufs)) return -EFAULT; + io_req_set_rsrc_node(req, ctx); index = array_index_nospec(buf_index, ctx->nr_user_bufs); imu = READ_ONCE(ctx->user_bufs[index]); req->imu = imu; @@ -3051,10 +3024,11 @@ static void io_ring_submit_lock(struct io_ring_ctx *ctx, bool needs_lock) } static struct io_buffer *io_buffer_select(struct io_kiocb *req, size_t *len, - int bgid, struct io_buffer *kbuf, - bool needs_lock) + int bgid, unsigned int issue_flags) { + struct io_buffer *kbuf = req->kbuf; struct io_buffer *head; + bool needs_lock = issue_flags & IO_URING_F_UNLOCKED; if (req->flags & REQ_F_BUFFER_SELECTED) return kbuf; @@ -3075,34 +3049,32 @@ static struct io_buffer *io_buffer_select(struct io_kiocb *req, size_t *len, } if (*len > kbuf->len) *len = kbuf->len; + req->flags |= REQ_F_BUFFER_SELECTED; + req->kbuf = kbuf; } else { kbuf = ERR_PTR(-ENOBUFS); } io_ring_submit_unlock(req->ctx, needs_lock); - return kbuf; } static void __user *io_rw_buffer_select(struct io_kiocb *req, size_t *len, - bool needs_lock) + unsigned int issue_flags) { struct io_buffer *kbuf; u16 bgid; - kbuf = (struct io_buffer *) (unsigned long) req->rw.addr; bgid = req->buf_index; - kbuf = io_buffer_select(req, len, bgid, kbuf, needs_lock); + kbuf = io_buffer_select(req, len, bgid, issue_flags); if (IS_ERR(kbuf)) return kbuf; - req->rw.addr = (u64) (unsigned long) kbuf; - req->flags |= REQ_F_BUFFER_SELECTED; return u64_to_user_ptr(kbuf->addr); } #ifdef CONFIG_COMPAT static ssize_t io_compat_import(struct io_kiocb *req, struct iovec *iov, - bool needs_lock) + unsigned int issue_flags) { struct compat_iovec __user *uiov; compat_ssize_t clen; @@ -3118,7 +3090,7 @@ static ssize_t io_compat_import(struct io_kiocb *req, struct iovec *iov, return -EINVAL; len = clen; - buf = io_rw_buffer_select(req, &len, needs_lock); + buf = io_rw_buffer_select(req, &len, issue_flags); if (IS_ERR(buf)) return PTR_ERR(buf); iov[0].iov_base = buf; @@ -3128,7 +3100,7 @@ static ssize_t io_compat_import(struct io_kiocb *req, struct iovec *iov, #endif static ssize_t __io_iov_buffer_select(struct io_kiocb *req, struct iovec *iov, - bool needs_lock) + unsigned int issue_flags) { struct iovec __user *uiov = u64_to_user_ptr(req->rw.addr); void __user *buf; @@ -3140,7 +3112,7 @@ static ssize_t __io_iov_buffer_select(struct io_kiocb *req, struct iovec *iov, len = iov[0].iov_len; if (len < 0) return -EINVAL; - buf = io_rw_buffer_select(req, &len, needs_lock); + buf = io_rw_buffer_select(req, &len, issue_flags); if (IS_ERR(buf)) return PTR_ERR(buf); iov[0].iov_base = buf; @@ -3149,12 +3121,11 @@ static ssize_t __io_iov_buffer_select(struct io_kiocb *req, struct iovec *iov, } static ssize_t io_iov_buffer_select(struct io_kiocb *req, struct iovec *iov, - bool needs_lock) + unsigned int issue_flags) { if (req->flags & REQ_F_BUFFER_SELECTED) { - struct io_buffer *kbuf; + struct io_buffer *kbuf = req->kbuf; - kbuf = (struct io_buffer *) (unsigned long) req->rw.addr; iov[0].iov_base = u64_to_user_ptr(kbuf->addr); iov[0].iov_len = kbuf->len; return 0; @@ -3164,52 +3135,72 @@ static ssize_t io_iov_buffer_select(struct io_kiocb *req, struct iovec *iov, #ifdef CONFIG_COMPAT if (req->ctx->compat) - return io_compat_import(req, iov, needs_lock); + return io_compat_import(req, iov, issue_flags); #endif - return __io_iov_buffer_select(req, iov, needs_lock); + return __io_iov_buffer_select(req, iov, issue_flags); } -static int io_import_iovec(int rw, struct io_kiocb *req, struct iovec **iovec, - struct iov_iter *iter, bool needs_lock) +static struct iovec *__io_import_iovec(int rw, struct io_kiocb *req, + struct io_rw_state *s, + unsigned int issue_flags) { - void __user *buf = u64_to_user_ptr(req->rw.addr); - size_t sqe_len = req->rw.len; + struct iov_iter *iter = &s->iter; u8 opcode = req->opcode; + struct iovec *iovec; + void __user *buf; + size_t sqe_len; ssize_t ret; - if (opcode == IORING_OP_READ_FIXED || opcode == IORING_OP_WRITE_FIXED) { - *iovec = NULL; - return io_import_fixed(req, rw, iter); - } + BUILD_BUG_ON(ERR_PTR(0) != NULL); + + if (opcode == IORING_OP_READ_FIXED || opcode == IORING_OP_WRITE_FIXED) + return ERR_PTR(io_import_fixed(req, rw, iter)); /* buffer index only valid with fixed read/write, or buffer select */ - if (req->buf_index && !(req->flags & REQ_F_BUFFER_SELECT)) - return -EINVAL; + if (unlikely(req->buf_index && !(req->flags & REQ_F_BUFFER_SELECT))) + return ERR_PTR(-EINVAL); + + buf = u64_to_user_ptr(req->rw.addr); + sqe_len = req->rw.len; if (opcode == IORING_OP_READ || opcode == IORING_OP_WRITE) { if (req->flags & REQ_F_BUFFER_SELECT) { - buf = io_rw_buffer_select(req, &sqe_len, needs_lock); + buf = io_rw_buffer_select(req, &sqe_len, issue_flags); if (IS_ERR(buf)) - return PTR_ERR(buf); + return ERR_CAST(buf); req->rw.len = sqe_len; } - ret = import_single_range(rw, buf, sqe_len, *iovec, iter); - *iovec = NULL; - return ret; + ret = import_single_range(rw, buf, sqe_len, s->fast_iov, iter); + return ERR_PTR(ret); } + iovec = s->fast_iov; if (req->flags & REQ_F_BUFFER_SELECT) { - ret = io_iov_buffer_select(req, *iovec, needs_lock); + ret = io_iov_buffer_select(req, iovec, issue_flags); if (!ret) - iov_iter_init(iter, rw, *iovec, 1, (*iovec)->iov_len); - *iovec = NULL; - return ret; + iov_iter_init(iter, rw, iovec, 1, iovec->iov_len); + return ERR_PTR(ret); } - return __import_iovec(rw, buf, sqe_len, UIO_FASTIOV, iovec, iter, + ret = __import_iovec(rw, buf, sqe_len, UIO_FASTIOV, &iovec, iter, req->ctx->compat); + if (unlikely(ret < 0)) + return ERR_PTR(ret); + return iovec; +} + +static inline int io_import_iovec(int rw, struct io_kiocb *req, + struct iovec **iovec, struct io_rw_state *s, + unsigned int issue_flags) +{ + *iovec = __io_import_iovec(rw, req, s, issue_flags); + if (unlikely(IS_ERR(*iovec))) + return PTR_ERR(*iovec); + + iov_iter_save_state(&s->iter, &s->iter_state); + return 0; } static inline loff_t *io_kiocb_ppos(struct kiocb *kiocb) @@ -3234,7 +3225,8 @@ static ssize_t loop_rw_iter(int rw, struct io_kiocb *req, struct iov_iter *iter) */ if (kiocb->ki_flags & IOCB_HIPRI) return -EOPNOTSUPP; - if (kiocb->ki_flags & IOCB_NOWAIT) + if ((kiocb->ki_flags & IOCB_NOWAIT) && + !(kiocb->ki_filp->f_flags & O_NONBLOCK)) return -EAGAIN; while (iov_iter_count(iter)) { @@ -3280,7 +3272,7 @@ static void io_req_map_rw(struct io_kiocb *req, const struct iovec *iovec, { struct io_async_rw *rw = req->async_data; - memcpy(&rw->iter, iter, sizeof(*iter)); + memcpy(&rw->s.iter, iter, sizeof(*iter)); rw->free_iovec = iovec; rw->bytes_done = 0; /* can only be fixed buffers, no need to do anything */ @@ -3289,33 +3281,36 @@ static void io_req_map_rw(struct io_kiocb *req, const struct iovec *iovec, if (!iovec) { unsigned iov_off = 0; - rw->iter.iov = rw->fast_iov; + rw->s.iter.iov = rw->s.fast_iov; if (iter->iov != fast_iov) { iov_off = iter->iov - fast_iov; - rw->iter.iov += iov_off; + rw->s.iter.iov += iov_off; } - if (rw->fast_iov != fast_iov) - memcpy(rw->fast_iov + iov_off, fast_iov + iov_off, + if (rw->s.fast_iov != fast_iov) + memcpy(rw->s.fast_iov + iov_off, fast_iov + iov_off, sizeof(struct iovec) * iter->nr_segs); } else { req->flags |= REQ_F_NEED_CLEANUP; } } -static inline int io_alloc_async_data(struct io_kiocb *req) +static inline bool io_alloc_async_data(struct io_kiocb *req) { WARN_ON_ONCE(!io_op_defs[req->opcode].async_size); req->async_data = kmalloc(io_op_defs[req->opcode].async_size, GFP_KERNEL); - return req->async_data == NULL; + if (req->async_data) { + req->flags |= REQ_F_ASYNC_DATA; + return false; + } + return true; } static int io_setup_async_rw(struct io_kiocb *req, const struct iovec *iovec, - const struct iovec *fast_iov, - struct iov_iter *iter, bool force) + struct io_rw_state *s, bool force) { if (!force && !io_op_defs[req->opcode].needs_async_setup) return 0; - if (!req->async_data) { + if (!req_has_async_data(req)) { struct io_async_rw *iorw; if (io_alloc_async_data(req)) { @@ -3323,10 +3318,10 @@ static int io_setup_async_rw(struct io_kiocb *req, const struct iovec *iovec, return -ENOMEM; } - io_req_map_rw(req, iovec, fast_iov, iter); + io_req_map_rw(req, iovec, s->fast_iov, &s->iter); iorw = req->async_data; /* we've copied and mapped the iter, ensure state is saved */ - iov_iter_save_state(&iorw->iter, &iorw->iter_state); + iov_iter_save_state(&iorw->s.iter, &iorw->s.iter_state); } return 0; } @@ -3334,10 +3329,11 @@ static int io_setup_async_rw(struct io_kiocb *req, const struct iovec *iovec, static inline int io_rw_prep_async(struct io_kiocb *req, int rw) { struct io_async_rw *iorw = req->async_data; - struct iovec *iov = iorw->fast_iov; + struct iovec *iov; int ret; - ret = io_import_iovec(rw, req, &iov, &iorw->iter, false); + /* submission path, ->uring_lock should already be taken */ + ret = io_import_iovec(rw, req, &iov, &iorw->s, 0); if (unlikely(ret < 0)) return ret; @@ -3345,7 +3341,6 @@ static inline int io_rw_prep_async(struct io_kiocb *req, int rw) iorw->free_iovec = iov; if (iov) req->flags |= REQ_F_NEED_CLEANUP; - iov_iter_save_state(&iorw->iter, &iorw->iter_state); return 0; } @@ -3353,11 +3348,11 @@ static int io_read_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { if (unlikely(!(req->file->f_mode & FMODE_READ))) return -EBADF; - return io_prep_rw(req, sqe, READ); + return io_prep_rw(req, sqe); } /* - * This is our waitqueue callback handler, registered through lock_page_async() + * This is our waitqueue callback handler, registered through __folio_lock_async() * when we initially tried to do the IO with the iocb armed our waitqueue. * This gets called when the page is unlocked, and we generally expect that to * happen when the page IO is completed and the page is now uptodate. This will @@ -3429,7 +3424,7 @@ static bool io_rw_should_retry(struct io_kiocb *req) static inline int io_iter_do_read(struct io_kiocb *req, struct iov_iter *iter) { - if (req->file->f_op->read_iter) + if (likely(req->file->f_op->read_iter)) return call_read_iter(req->file, &req->rw.kiocb, iter); else if (req->file->f_op->read) return loop_rw_iter(READ, req, iter); @@ -3445,43 +3440,40 @@ static bool need_read_all(struct io_kiocb *req) static int io_read(struct io_kiocb *req, unsigned int issue_flags) { - struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; + struct io_rw_state __s, *s = &__s; + struct iovec *iovec; struct kiocb *kiocb = &req->rw.kiocb; - struct iov_iter __iter, *iter = &__iter; - struct io_async_rw *rw = req->async_data; bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK; - struct iov_iter_state __state, *state; + struct io_async_rw *rw; ssize_t ret, ret2; - if (rw) { - iter = &rw->iter; - state = &rw->iter_state; + if (!req_has_async_data(req)) { + ret = io_import_iovec(READ, req, &iovec, s, issue_flags); + if (unlikely(ret < 0)) + return ret; + } else { + rw = req->async_data; + s = &rw->s; /* * We come here from an earlier attempt, restore our state to * match in case it doesn't. It's cheap enough that we don't * need to make this conditional. */ - iov_iter_restore(iter, state); + iov_iter_restore(&s->iter, &s->iter_state); iovec = NULL; - } else { - ret = io_import_iovec(READ, req, &iovec, iter, !force_nonblock); - if (ret < 0) - return ret; - state = &__state; - iov_iter_save_state(iter, state); } - req->result = iov_iter_count(iter); + req->result = iov_iter_count(&s->iter); - /* Ensure we clear previously set non-block flag */ - if (!force_nonblock) - kiocb->ki_flags &= ~IOCB_NOWAIT; - else + if (force_nonblock) { + /* If the file doesn't support async, just async punt */ + if (unlikely(!io_file_supports_nowait(req))) { + ret = io_setup_async_rw(req, iovec, s, true); + return ret ?: -EAGAIN; + } kiocb->ki_flags |= IOCB_NOWAIT; - - /* If the file doesn't support async, just async punt */ - if (force_nonblock && !io_file_supports_nowait(req, READ)) { - ret = io_setup_async_rw(req, iovec, inline_vecs, iter, true); - return ret ?: -EAGAIN; + } else { + /* Ensure we clear previously set non-block flag */ + kiocb->ki_flags &= ~IOCB_NOWAIT; } ret = rw_verify_area(READ, req->file, io_kiocb_ppos(kiocb), req->result); @@ -3490,7 +3482,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags) return ret; } - ret = io_iter_do_read(req, iter); + ret = io_iter_do_read(req, &s->iter); if (ret == -EAGAIN || (req->flags & REQ_F_REISSUE)) { req->flags &= ~REQ_F_REISSUE; @@ -3503,7 +3495,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags) ret = 0; } else if (ret == -EIOCBQUEUED) { goto out_free; - } else if (ret <= 0 || ret == req->result || !force_nonblock || + } else if (ret == req->result || ret <= 0 || !force_nonblock || (req->flags & REQ_F_NOWAIT) || !need_read_all(req)) { /* read all, failed, already did sync or don't want to retry */ goto done; @@ -3514,22 +3506,19 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags) * untouched in case of error. Restore it and we'll advance it * manually if we need to. */ - iov_iter_restore(iter, state); + iov_iter_restore(&s->iter, &s->iter_state); - ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true); + ret2 = io_setup_async_rw(req, iovec, s, true); if (ret2) return ret2; iovec = NULL; rw = req->async_data; + s = &rw->s; /* * Now use our persistent iterator and state, if we aren't already. * We've restored and mapped the iter to match. */ - if (iter != &rw->iter) { - iter = &rw->iter; - state = &rw->iter_state; - } do { /* @@ -3537,11 +3526,11 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags) * above or inside this loop. Advance the iter by the bytes * that were consumed. */ - iov_iter_advance(iter, ret); - if (!iov_iter_count(iter)) + iov_iter_advance(&s->iter, ret); + if (!iov_iter_count(&s->iter)) break; rw->bytes_done += ret; - iov_iter_save_state(iter, state); + iov_iter_save_state(&s->iter, &s->iter_state); /* if we can retry, do so with the callbacks armed */ if (!io_rw_should_retry(req)) { @@ -3555,12 +3544,12 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags) * desired page gets unlocked. We can also get a partial read * here, and if we do, then just retry at the new offset. */ - ret = io_iter_do_read(req, iter); + ret = io_iter_do_read(req, &s->iter); if (ret == -EIOCBQUEUED) return 0; /* we got some bytes, but not all. retry. */ kiocb->ki_flags &= ~IOCB_WAITQ; - iov_iter_restore(iter, state); + iov_iter_restore(&s->iter, &s->iter_state); } while (ret > 0); done: kiocb_done(kiocb, ret, issue_flags); @@ -3575,47 +3564,46 @@ static int io_write_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { if (unlikely(!(req->file->f_mode & FMODE_WRITE))) return -EBADF; - return io_prep_rw(req, sqe, WRITE); + req->rw.kiocb.ki_hint = ki_hint_validate(file_write_hint(req->file)); + return io_prep_rw(req, sqe); } static int io_write(struct io_kiocb *req, unsigned int issue_flags) { - struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; + struct io_rw_state __s, *s = &__s; + struct iovec *iovec; struct kiocb *kiocb = &req->rw.kiocb; - struct iov_iter __iter, *iter = &__iter; - struct io_async_rw *rw = req->async_data; bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK; - struct iov_iter_state __state, *state; ssize_t ret, ret2; - if (rw) { - iter = &rw->iter; - state = &rw->iter_state; - iov_iter_restore(iter, state); - iovec = NULL; - } else { - ret = io_import_iovec(WRITE, req, &iovec, iter, !force_nonblock); - if (ret < 0) + if (!req_has_async_data(req)) { + ret = io_import_iovec(WRITE, req, &iovec, s, issue_flags); + if (unlikely(ret < 0)) return ret; - state = &__state; - iov_iter_save_state(iter, state); + } else { + struct io_async_rw *rw = req->async_data; + + s = &rw->s; + iov_iter_restore(&s->iter, &s->iter_state); + iovec = NULL; } - req->result = iov_iter_count(iter); + req->result = iov_iter_count(&s->iter); - /* Ensure we clear previously set non-block flag */ - if (!force_nonblock) - kiocb->ki_flags &= ~IOCB_NOWAIT; - else - kiocb->ki_flags |= IOCB_NOWAIT; + if (force_nonblock) { + /* If the file doesn't support async, just async punt */ + if (unlikely(!io_file_supports_nowait(req))) + goto copy_iov; - /* If the file doesn't support async, just async punt */ - if (force_nonblock && !io_file_supports_nowait(req, WRITE)) - goto copy_iov; + /* file path doesn't support NOWAIT for non-direct_IO */ + if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT) && + (req->flags & REQ_F_ISREG)) + goto copy_iov; - /* file path doesn't support NOWAIT for non-direct_IO */ - if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT) && - (req->flags & REQ_F_ISREG)) - goto copy_iov; + kiocb->ki_flags |= IOCB_NOWAIT; + } else { + /* Ensure we clear previously set non-block flag */ + kiocb->ki_flags &= ~IOCB_NOWAIT; + } ret = rw_verify_area(WRITE, req->file, io_kiocb_ppos(kiocb), req->result); if (unlikely(ret)) @@ -3635,10 +3623,10 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags) } kiocb->ki_flags |= IOCB_WRITE; - if (req->file->f_op->write_iter) - ret2 = call_write_iter(req->file, kiocb, iter); + if (likely(req->file->f_op->write_iter)) + ret2 = call_write_iter(req->file, kiocb, &s->iter); else if (req->file->f_op->write) - ret2 = loop_rw_iter(WRITE, req, iter); + ret2 = loop_rw_iter(WRITE, req, &s->iter); else ret2 = -EINVAL; @@ -3658,14 +3646,14 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags) goto done; if (!force_nonblock || ret2 != -EAGAIN) { /* IOPOLL retry should happen for io-wq threads */ - if ((req->ctx->flags & IORING_SETUP_IOPOLL) && ret2 == -EAGAIN) + if (ret2 == -EAGAIN && (req->ctx->flags & IORING_SETUP_IOPOLL)) goto copy_iov; done: kiocb_done(kiocb, ret2, issue_flags); } else { copy_iov: - iov_iter_restore(iter, state); - ret = io_setup_async_rw(req, iovec, inline_vecs, iter, false); + iov_iter_restore(&s->iter, &s->iter_state); + ret = io_setup_async_rw(req, iovec, s, false); return ret ?: -EAGAIN; } out_free: @@ -3801,7 +3789,7 @@ static int io_mkdirat_prep(struct io_kiocb *req, return 0; } -static int io_mkdirat(struct io_kiocb *req, int issue_flags) +static int io_mkdirat(struct io_kiocb *req, unsigned int issue_flags) { struct io_mkdir *mkd = &req->mkdir; int ret; @@ -3850,7 +3838,7 @@ static int io_symlinkat_prep(struct io_kiocb *req, return 0; } -static int io_symlinkat(struct io_kiocb *req, int issue_flags) +static int io_symlinkat(struct io_kiocb *req, unsigned int issue_flags) { struct io_symlink *sl = &req->symlink; int ret; @@ -3900,7 +3888,7 @@ static int io_linkat_prep(struct io_kiocb *req, return 0; } -static int io_linkat(struct io_kiocb *req, int issue_flags) +static int io_linkat(struct io_kiocb *req, unsigned int issue_flags) { struct io_hardlink *lnk = &req->hardlink; int ret; @@ -4319,9 +4307,9 @@ static int io_remove_buffers(struct io_kiocb *req, unsigned int issue_flags) struct io_ring_ctx *ctx = req->ctx; struct io_buffer *head; int ret = 0; - bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK; + bool needs_lock = issue_flags & IO_URING_F_UNLOCKED; - io_ring_submit_lock(ctx, !force_nonblock); + io_ring_submit_lock(ctx, needs_lock); lockdep_assert_held(&ctx->uring_lock); @@ -4334,7 +4322,7 @@ static int io_remove_buffers(struct io_kiocb *req, unsigned int issue_flags) /* complete before unlock, IOPOLL may need the lock */ __io_req_complete(req, issue_flags, ret, 0); - io_ring_submit_unlock(ctx, !force_nonblock); + io_ring_submit_unlock(ctx, needs_lock); return 0; } @@ -4406,9 +4394,9 @@ static int io_provide_buffers(struct io_kiocb *req, unsigned int issue_flags) struct io_ring_ctx *ctx = req->ctx; struct io_buffer *head, *list; int ret = 0; - bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK; + bool needs_lock = issue_flags & IO_URING_F_UNLOCKED; - io_ring_submit_lock(ctx, !force_nonblock); + io_ring_submit_lock(ctx, needs_lock); lockdep_assert_held(&ctx->uring_lock); @@ -4424,7 +4412,7 @@ static int io_provide_buffers(struct io_kiocb *req, unsigned int issue_flags) req_set_fail(req); /* complete before unlock, IOPOLL may need the lock */ __io_req_complete(req, issue_flags, ret, 0); - io_ring_submit_unlock(ctx, !force_nonblock); + io_ring_submit_unlock(ctx, needs_lock); return 0; } @@ -4757,8 +4745,9 @@ static int io_sendmsg(struct io_kiocb *req, unsigned int issue_flags) if (unlikely(!sock)) return -ENOTSOCK; - kmsg = req->async_data; - if (!kmsg) { + if (req_has_async_data(req)) { + kmsg = req->async_data; + } else { ret = io_sendmsg_copy_hdr(req, &iomsg); if (ret) return ret; @@ -4917,23 +4906,16 @@ static int io_recvmsg_copy_hdr(struct io_kiocb *req, } static struct io_buffer *io_recv_buffer_select(struct io_kiocb *req, - bool needs_lock) + unsigned int issue_flags) { struct io_sr_msg *sr = &req->sr_msg; - struct io_buffer *kbuf; - kbuf = io_buffer_select(req, &sr->len, sr->bgid, sr->kbuf, needs_lock); - if (IS_ERR(kbuf)) - return kbuf; - - sr->kbuf = kbuf; - req->flags |= REQ_F_BUFFER_SELECTED; - return kbuf; + return io_buffer_select(req, &sr->len, sr->bgid, issue_flags); } static inline unsigned int io_put_recv_kbuf(struct io_kiocb *req) { - return io_put_kbuf(req, req->sr_msg.kbuf); + return io_put_kbuf(req, req->kbuf); } static int io_recvmsg_prep_async(struct io_kiocb *req) @@ -4981,8 +4963,9 @@ static int io_recvmsg(struct io_kiocb *req, unsigned int issue_flags) if (unlikely(!sock)) return -ENOTSOCK; - kmsg = req->async_data; - if (!kmsg) { + if (req_has_async_data(req)) { + kmsg = req->async_data; + } else { ret = io_recvmsg_copy_hdr(req, &iomsg); if (ret) return ret; @@ -4990,7 +4973,7 @@ static int io_recvmsg(struct io_kiocb *req, unsigned int issue_flags) } if (req->flags & REQ_F_BUFFER_SELECT) { - kbuf = io_recv_buffer_select(req, !force_nonblock); + kbuf = io_recv_buffer_select(req, issue_flags); if (IS_ERR(kbuf)) return PTR_ERR(kbuf); kmsg->fast_iov[0].iov_base = u64_to_user_ptr(kbuf->addr); @@ -5042,7 +5025,7 @@ static int io_recv(struct io_kiocb *req, unsigned int issue_flags) return -ENOTSOCK; if (req->flags & REQ_F_BUFFER_SELECT) { - kbuf = io_recv_buffer_select(req, !force_nonblock); + kbuf = io_recv_buffer_select(req, issue_flags); if (IS_ERR(kbuf)) return PTR_ERR(kbuf); buf = u64_to_user_ptr(kbuf->addr); @@ -5173,7 +5156,7 @@ static int io_connect(struct io_kiocb *req, unsigned int issue_flags) int ret; bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK; - if (req->async_data) { + if (req_has_async_data(req)) { io = req->async_data; } else { ret = move_addr_to_kernel(req->connect.addr, @@ -5189,7 +5172,7 @@ static int io_connect(struct io_kiocb *req, unsigned int issue_flags) ret = __sys_connect_file(req->file, &io->address, req->connect.addr_len, file_flags); if ((ret == -EAGAIN || ret == -EINPROGRESS) && force_nonblock) { - if (req->async_data) + if (req_has_async_data(req)) return -EAGAIN; if (io_alloc_async_data(req)) { ret = -ENOMEM; @@ -5349,16 +5332,6 @@ static bool __io_poll_complete(struct io_kiocb *req, __poll_t mask) return !(flags & IORING_CQE_F_MORE); } -static inline bool io_poll_complete(struct io_kiocb *req, __poll_t mask) - __must_hold(&req->ctx->completion_lock) -{ - bool done; - - done = __io_poll_complete(req, mask); - io_commit_cqring(req->ctx); - return done; -} - static void io_poll_task_func(struct io_kiocb *req, bool *locked) { struct io_ring_ctx *ctx = req->ctx; @@ -5480,7 +5453,10 @@ static void __io_queue_proc(struct io_poll_iocb *poll, struct io_poll_table *pt, io_init_poll_iocb(poll, poll_one->events, io_poll_double_wake); req_ref_get(req); poll->wait.private = req; + *poll_ptr = poll; + if (req->opcode == IORING_OP_POLL_ADD) + req->flags |= REQ_F_ASYNC_DATA; } pt->nr_entries++; @@ -5604,17 +5580,13 @@ static int io_arm_poll_handler(struct io_kiocb *req) struct async_poll *apoll; struct io_poll_table ipt; __poll_t ret, mask = EPOLLONESHOT | POLLERR | POLLPRI; - int rw; - if (!req->file || !file_can_poll(req->file)) - return IO_APOLL_ABORTED; - if (req->flags & REQ_F_POLLED) - return IO_APOLL_ABORTED; if (!def->pollin && !def->pollout) return IO_APOLL_ABORTED; + if (!file_can_poll(req->file) || (req->flags & REQ_F_POLLED)) + return IO_APOLL_ABORTED; if (def->pollin) { - rw = READ; mask |= POLLIN | POLLRDNORM; /* If reading from MSG_ERRQUEUE using recvmsg, ignore POLLIN */ @@ -5622,14 +5594,9 @@ static int io_arm_poll_handler(struct io_kiocb *req) (req->sr_msg.msg_flags & MSG_ERRQUEUE)) mask &= ~POLLIN; } else { - rw = WRITE; mask |= POLLOUT | POLLWRNORM; } - /* if we can't nonblock try, then no point in arming a poll handler */ - if (!io_file_supports_nowait(req, rw)) - return IO_APOLL_ABORTED; - apoll = kmalloc(sizeof(*apoll), GFP_ATOMIC); if (unlikely(!apoll)) return IO_APOLL_ABORTED; @@ -5690,8 +5657,8 @@ static bool io_poll_remove_one(struct io_kiocb *req) /* * Returns true if we found and killed one or more poll requests */ -static bool io_poll_remove_all(struct io_ring_ctx *ctx, struct task_struct *tsk, - bool cancel_all) +static __cold bool io_poll_remove_all(struct io_ring_ctx *ctx, + struct task_struct *tsk, bool cancel_all) { struct hlist_node *tmp; struct io_kiocb *req; @@ -5845,7 +5812,8 @@ static int io_poll_add(struct io_kiocb *req, unsigned int issue_flags) if (mask) { /* no async, we'd stolen it */ ipt.error = 0; - done = io_poll_complete(req, mask); + done = __io_poll_complete(req, mask); + io_commit_cqring(req->ctx); } spin_unlock(&ctx->completion_lock); @@ -5921,7 +5889,10 @@ err: static void io_req_task_timeout(struct io_kiocb *req, bool *locked) { - req_set_fail(req); + struct io_timeout_data *data = req->async_data; + + if (!(data->flags & IORING_TIMEOUT_ETIME_SUCCESS)) + req_set_fail(req); io_req_complete_post(req, -ETIME, 0); } @@ -6127,7 +6098,8 @@ static int io_timeout_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (off && is_timeout_link) return -EINVAL; flags = READ_ONCE(sqe->timeout_flags); - if (flags & ~(IORING_TIMEOUT_ABS | IORING_TIMEOUT_CLOCK_MASK)) + if (flags & ~(IORING_TIMEOUT_ABS | IORING_TIMEOUT_CLOCK_MASK | + IORING_TIMEOUT_ETIME_SUCCESS)) return -EINVAL; /* more than one clock specified is invalid, obviously */ if (hweight32(flags & IORING_TIMEOUT_CLOCK_MASK) > 1) @@ -6138,7 +6110,9 @@ static int io_timeout_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (unlikely(off && !req->ctx->off_timeout_used)) req->ctx->off_timeout_used = true; - if (!req->async_data && io_alloc_async_data(req)) + if (WARN_ON_ONCE(req_has_async_data(req))) + return -EFAULT; + if (io_alloc_async_data(req)) return -ENOMEM; data = req->async_data; @@ -6295,6 +6269,7 @@ static int io_async_cancel(struct io_kiocb *req, unsigned int issue_flags) { struct io_ring_ctx *ctx = req->ctx; u64 sqe_addr = req->cancel.addr; + bool needs_lock = issue_flags & IO_URING_F_UNLOCKED; struct io_tctx_node *node; int ret; @@ -6303,7 +6278,7 @@ static int io_async_cancel(struct io_kiocb *req, unsigned int issue_flags) goto done; /* slow path, try all io-wq's */ - io_ring_submit_lock(ctx, !(issue_flags & IO_URING_F_NONBLOCK)); + io_ring_submit_lock(ctx, needs_lock); ret = -ENOENT; list_for_each_entry(node, &ctx->tctx_list, ctx_node) { struct io_uring_task *tctx = node->task->io_uring; @@ -6312,7 +6287,7 @@ static int io_async_cancel(struct io_kiocb *req, unsigned int issue_flags) if (ret != -ENOENT) break; } - io_ring_submit_unlock(ctx, !(issue_flags & IO_URING_F_NONBLOCK)); + io_ring_submit_unlock(ctx, needs_lock); done: if (ret < 0) req_set_fail(req); @@ -6339,6 +6314,7 @@ static int io_rsrc_update_prep(struct io_kiocb *req, static int io_files_update(struct io_kiocb *req, unsigned int issue_flags) { struct io_ring_ctx *ctx = req->ctx; + bool needs_lock = issue_flags & IO_URING_F_UNLOCKED; struct io_uring_rsrc_update2 up; int ret; @@ -6348,10 +6324,10 @@ static int io_files_update(struct io_kiocb *req, unsigned int issue_flags) up.tags = 0; up.resv = 0; - io_ring_submit_lock(ctx, !(issue_flags & IO_URING_F_NONBLOCK)); + io_ring_submit_lock(ctx, needs_lock); ret = __io_register_rsrc_update(ctx, IORING_RSRC_FILE, &up, req->rsrc_update.nr_args); - io_ring_submit_unlock(ctx, !(issue_flags & IO_URING_F_NONBLOCK)); + io_ring_submit_unlock(ctx, needs_lock); if (ret < 0) req_set_fail(req); @@ -6447,7 +6423,7 @@ static int io_req_prep_async(struct io_kiocb *req) { if (!io_op_defs[req->opcode].needs_async_setup) return 0; - if (WARN_ON_ONCE(req->async_data)) + if (WARN_ON_ONCE(req_has_async_data(req))) return -EFAULT; if (io_alloc_async_data(req)) return -EAGAIN; @@ -6479,68 +6455,39 @@ static u32 io_get_sequence(struct io_kiocb *req) return seq; } -static bool io_drain_req(struct io_kiocb *req) +static __cold void io_drain_req(struct io_kiocb *req) { - struct io_kiocb *pos; struct io_ring_ctx *ctx = req->ctx; struct io_defer_entry *de; int ret; - u32 seq; - - if (req->flags & REQ_F_FAIL) { - io_req_complete_fail_submit(req); - return true; - } - - /* - * If we need to drain a request in the middle of a link, drain the - * head request and the next request/link after the current link. - * Considering sequential execution of links, IOSQE_IO_DRAIN will be - * maintained for every request of our link. - */ - if (ctx->drain_next) { - req->flags |= REQ_F_IO_DRAIN; - ctx->drain_next = false; - } - /* not interested in head, start from the first linked */ - io_for_each_link(pos, req->link) { - if (pos->flags & REQ_F_IO_DRAIN) { - ctx->drain_next = true; - req->flags |= REQ_F_IO_DRAIN; - break; - } - } + u32 seq = io_get_sequence(req); /* Still need defer if there is pending req in defer list. */ - if (likely(list_empty_careful(&ctx->defer_list) && - !(req->flags & REQ_F_IO_DRAIN))) { + if (!req_need_defer(req, seq) && list_empty_careful(&ctx->defer_list)) { +queue: ctx->drain_active = false; - return false; + io_req_task_queue(req); + return; } - seq = io_get_sequence(req); - /* Still a chance to pass the sequence check */ - if (!req_need_defer(req, seq) && list_empty_careful(&ctx->defer_list)) - return false; - ret = io_req_prep_async(req); - if (ret) - goto fail; + if (ret) { +fail: + io_req_complete_failed(req, ret); + return; + } io_prep_async_link(req); de = kmalloc(sizeof(*de), GFP_KERNEL); if (!de) { ret = -ENOMEM; -fail: - io_req_complete_failed(req, ret); - return true; + goto fail; } spin_lock(&ctx->completion_lock); if (!req_need_defer(req, seq) && list_empty(&ctx->defer_list)) { spin_unlock(&ctx->completion_lock); kfree(de); - io_queue_async_work(req, NULL); - return true; + goto queue; } trace_io_uring_defer(ctx, req, req->user_data); @@ -6548,23 +6495,13 @@ fail: de->seq = seq; list_add_tail(&de->list, &ctx->defer_list); spin_unlock(&ctx->completion_lock); - return true; } static void io_clean_op(struct io_kiocb *req) { if (req->flags & REQ_F_BUFFER_SELECTED) { - switch (req->opcode) { - case IORING_OP_READV: - case IORING_OP_READ_FIXED: - case IORING_OP_READ: - kfree((void *)(unsigned long)req->rw.addr); - break; - case IORING_OP_RECVMSG: - case IORING_OP_RECV: - kfree(req->sr_msg.kbuf); - break; - } + kfree(req->kbuf); + req->kbuf = NULL; } if (req->flags & REQ_F_NEED_CLEANUP) { @@ -6629,17 +6566,19 @@ static void io_clean_op(struct io_kiocb *req) } if (req->flags & REQ_F_CREDS) put_cred(req->creds); - + if (req->flags & REQ_F_ASYNC_DATA) { + kfree(req->async_data); + req->async_data = NULL; + } req->flags &= ~IO_REQ_CLEAN_FLAGS; } static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags) { - struct io_ring_ctx *ctx = req->ctx; const struct cred *creds = NULL; int ret; - if ((req->flags & REQ_F_CREDS) && req->creds != current_cred()) + if (unlikely((req->flags & REQ_F_CREDS) && req->creds != current_cred())) creds = override_creds(req->creds); switch (req->opcode) { @@ -6762,8 +6701,8 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags) if (ret) return ret; /* If the op doesn't have a file, we're not polling for it */ - if ((ctx->flags & IORING_SETUP_IOPOLL) && req->file) - io_iopoll_req_issued(req); + if ((req->ctx->flags & IORING_SETUP_IOPOLL) && req->file) + io_iopoll_req_issued(req, issue_flags); return 0; } @@ -6779,6 +6718,8 @@ static struct io_wq_work *io_wq_free_work(struct io_wq_work *work) static void io_wq_submit_work(struct io_wq_work *work) { struct io_kiocb *req = container_of(work, struct io_kiocb, work); + unsigned int issue_flags = IO_URING_F_UNLOCKED; + bool needs_poll = false; struct io_kiocb *timeout; int ret = 0; @@ -6793,23 +6734,42 @@ static void io_wq_submit_work(struct io_wq_work *work) io_queue_linked_timeout(timeout); /* either cancelled or io-wq is dying, so don't touch tctx->iowq */ - if (work->flags & IO_WQ_WORK_CANCEL) - ret = -ECANCELED; + if (work->flags & IO_WQ_WORK_CANCEL) { + io_req_task_queue_fail(req, -ECANCELED); + return; + } - if (!ret) { - do { - ret = io_issue_sqe(req, 0); - /* - * We can get EAGAIN for polled IO even though we're - * forcing a sync submission from here, since we can't - * wait for request slots on the block side. - */ - if (ret != -EAGAIN) - break; - cond_resched(); - } while (1); + if (req->flags & REQ_F_FORCE_ASYNC) { + const struct io_op_def *def = &io_op_defs[req->opcode]; + bool opcode_poll = def->pollin || def->pollout; + + if (opcode_poll && file_can_poll(req->file)) { + needs_poll = true; + issue_flags |= IO_URING_F_NONBLOCK; + } } + do { + ret = io_issue_sqe(req, issue_flags); + if (ret != -EAGAIN) + break; + /* + * We can get EAGAIN for iopolled IO even though we're + * forcing a sync submission from here, since we can't + * wait for request slots on the block side. + */ + if (!needs_poll) { + cond_resched(); + continue; + } + + if (io_arm_poll_handler(req) == IO_APOLL_OK) + return; + /* aborted or ready, in either case retry blocking */ + needs_poll = false; + issue_flags &= ~IO_URING_F_NONBLOCK; + } while (1); + /* avoid locking problems by failing it from a clean context */ if (ret) io_req_task_queue_fail(req, ret); @@ -6833,12 +6793,7 @@ static void io_fixed_file_set(struct io_fixed_file *file_slot, struct file *file { unsigned long file_ptr = (unsigned long) file; - if (__io_file_supports_nowait(file, READ)) - file_ptr |= FFS_ASYNC_READ; - if (__io_file_supports_nowait(file, WRITE)) - file_ptr |= FFS_ASYNC_WRITE; - if (S_ISREG(file_inode(file)->i_mode)) - file_ptr |= FFS_ISREG; + file_ptr |= io_file_get_flags(file); file_slot->file_ptr = file_ptr; } @@ -6855,8 +6810,8 @@ static inline struct file *io_file_get_fixed(struct io_ring_ctx *ctx, file = (struct file *) (file_ptr & FFS_MASK); file_ptr &= ~FFS_MASK; /* mask in overlapping REQ_F and FFS bits */ - req->flags |= (file_ptr << REQ_F_NOWAIT_READ_BIT); - io_req_set_rsrc_node(req); + req->flags |= (file_ptr << REQ_F_SUPPORT_NOWAIT_BIT); + io_req_set_rsrc_node(req, ctx); return file; } @@ -6948,67 +6903,66 @@ static void io_queue_linked_timeout(struct io_kiocb *req) io_put_req(req); } -static void __io_queue_sqe(struct io_kiocb *req) +static void io_queue_sqe_arm_apoll(struct io_kiocb *req) + __must_hold(&req->ctx->uring_lock) +{ + struct io_kiocb *linked_timeout = io_prep_linked_timeout(req); + + switch (io_arm_poll_handler(req)) { + case IO_APOLL_READY: + if (linked_timeout) { + io_queue_linked_timeout(linked_timeout); + linked_timeout = NULL; + } + io_req_task_queue(req); + break; + case IO_APOLL_ABORTED: + /* + * Queued up for async execution, worker will release + * submit reference when the iocb is actually submitted. + */ + io_queue_async_work(req, NULL); + break; + } + + if (linked_timeout) + io_queue_linked_timeout(linked_timeout); +} + +static inline void __io_queue_sqe(struct io_kiocb *req) __must_hold(&req->ctx->uring_lock) { struct io_kiocb *linked_timeout; int ret; -issue_sqe: ret = io_issue_sqe(req, IO_URING_F_NONBLOCK|IO_URING_F_COMPLETE_DEFER); + if (req->flags & REQ_F_COMPLETE_INLINE) { + io_req_add_compl_list(req); + return; + } /* * We async punt it if the file wasn't marked NOWAIT, or if the file * doesn't support non-blocking read/write attempts */ if (likely(!ret)) { - if (req->flags & REQ_F_COMPLETE_INLINE) { - struct io_ring_ctx *ctx = req->ctx; - struct io_submit_state *state = &ctx->submit_state; - - state->compl_reqs[state->compl_nr++] = req; - if (state->compl_nr == ARRAY_SIZE(state->compl_reqs)) - io_submit_flush_completions(ctx); - return; - } - linked_timeout = io_prep_linked_timeout(req); if (linked_timeout) io_queue_linked_timeout(linked_timeout); } else if (ret == -EAGAIN && !(req->flags & REQ_F_NOWAIT)) { - linked_timeout = io_prep_linked_timeout(req); - - switch (io_arm_poll_handler(req)) { - case IO_APOLL_READY: - if (linked_timeout) - io_unprep_linked_timeout(req); - goto issue_sqe; - case IO_APOLL_ABORTED: - /* - * Queued up for async execution, worker will release - * submit reference when the iocb is actually submitted. - */ - io_queue_async_work(req, NULL); - break; - } - - if (linked_timeout) - io_queue_linked_timeout(linked_timeout); + io_queue_sqe_arm_apoll(req); } else { io_req_complete_failed(req, ret); } } -static inline void io_queue_sqe(struct io_kiocb *req) +static void io_queue_sqe_fallback(struct io_kiocb *req) __must_hold(&req->ctx->uring_lock) { - if (unlikely(req->ctx->drain_active) && io_drain_req(req)) - return; - - if (likely(!(req->flags & (REQ_F_FORCE_ASYNC | REQ_F_FAIL)))) { - __io_queue_sqe(req); - } else if (req->flags & REQ_F_FAIL) { + if (req->flags & REQ_F_FAIL) { io_req_complete_fail_submit(req); + } else if (unlikely(req->ctx->drain_active)) { + io_drain_req(req); } else { int ret = io_req_prep_async(req); @@ -7019,6 +6973,15 @@ static inline void io_queue_sqe(struct io_kiocb *req) } } +static inline void io_queue_sqe(struct io_kiocb *req) + __must_hold(&req->ctx->uring_lock) +{ + if (likely(!(req->flags & (REQ_F_FORCE_ASYNC | REQ_F_FAIL)))) + __io_queue_sqe(req); + else + io_queue_sqe_fallback(req); +} + /* * Check SQE restrictions (opcode and flags). * @@ -7028,9 +6991,6 @@ static inline bool io_check_restriction(struct io_ring_ctx *ctx, struct io_kiocb *req, unsigned int sqe_flags) { - if (likely(!ctx->restricted)) - return true; - if (!test_bit(req->opcode, ctx->restrictions.sqe_op)) return false; @@ -7045,16 +7005,35 @@ static inline bool io_check_restriction(struct io_ring_ctx *ctx, return true; } +static void io_init_req_drain(struct io_kiocb *req) +{ + struct io_ring_ctx *ctx = req->ctx; + struct io_kiocb *head = ctx->submit_state.link.head; + + ctx->drain_active = true; + if (head) { + /* + * If we need to drain a request in the middle of a link, drain + * the head request and the next request/link after the current + * link. Considering sequential execution of links, + * IOSQE_IO_DRAIN will be maintained for every request of our + * link. + */ + head->flags |= IOSQE_IO_DRAIN | REQ_F_FORCE_ASYNC; + ctx->drain_next = true; + } +} + static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req, const struct io_uring_sqe *sqe) __must_hold(&ctx->uring_lock) { - struct io_submit_state *state; unsigned int sqe_flags; - int personality, ret = 0; + int personality; + u8 opcode; /* req is partially pre-initialised, see io_preinit_req() */ - req->opcode = READ_ONCE(sqe->opcode); + req->opcode = opcode = READ_ONCE(sqe->opcode); /* same numerical values with corresponding REQ_F_*, safe to copy */ req->flags = sqe_flags = READ_ONCE(sqe->flags); req->user_data = READ_ONCE(sqe->user_data); @@ -7062,19 +7041,52 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req, req->fixed_rsrc_refs = NULL; req->task = current; - /* enforce forwards compatibility on users */ - if (unlikely(sqe_flags & ~SQE_VALID_FLAGS)) - return -EINVAL; - if (unlikely(req->opcode >= IORING_OP_LAST)) + if (unlikely(opcode >= IORING_OP_LAST)) { + req->opcode = 0; return -EINVAL; - if (!io_check_restriction(ctx, req, sqe_flags)) - return -EACCES; + } + if (unlikely(sqe_flags & ~SQE_COMMON_FLAGS)) { + /* enforce forwards compatibility on users */ + if (sqe_flags & ~SQE_VALID_FLAGS) + return -EINVAL; + if ((sqe_flags & IOSQE_BUFFER_SELECT) && + !io_op_defs[opcode].buffer_select) + return -EOPNOTSUPP; + if (sqe_flags & IOSQE_IO_DRAIN) + io_init_req_drain(req); + } + if (unlikely(ctx->restricted || ctx->drain_active || ctx->drain_next)) { + if (ctx->restricted && !io_check_restriction(ctx, req, sqe_flags)) + return -EACCES; + /* knock it to the slow queue path, will be drained there */ + if (ctx->drain_active) + req->flags |= REQ_F_FORCE_ASYNC; + /* if there is no link, we're at "next" request and need to drain */ + if (unlikely(ctx->drain_next) && !ctx->submit_state.link.head) { + ctx->drain_next = false; + ctx->drain_active = true; + req->flags |= IOSQE_IO_DRAIN | REQ_F_FORCE_ASYNC; + } + } - if ((sqe_flags & IOSQE_BUFFER_SELECT) && - !io_op_defs[req->opcode].buffer_select) - return -EOPNOTSUPP; - if (unlikely(sqe_flags & IOSQE_IO_DRAIN)) - ctx->drain_active = true; + if (io_op_defs[opcode].needs_file) { + struct io_submit_state *state = &ctx->submit_state; + + /* + * Plug now if we have more than 2 IO left after this, and the + * target is potentially a read/write to block based storage. + */ + if (state->need_plug && io_op_defs[opcode].plug) { + state->plug_started = true; + state->need_plug = false; + blk_start_plug_nr_ios(&state->plug, state->submit_nr); + } + + req->file = io_file_get(ctx, req, READ_ONCE(sqe->fd), + (sqe_flags & IOSQE_FIXED_FILE)); + if (unlikely(!req->file)) + return -EBADF; + } personality = READ_ONCE(sqe->personality); if (personality) { @@ -7084,27 +7096,8 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req, get_cred(req->creds); req->flags |= REQ_F_CREDS; } - state = &ctx->submit_state; - /* - * Plug now if we have more than 1 IO left after this, and the target - * is potentially a read/write to block based storage. - */ - if (!state->plug_started && state->ios_left > 1 && - io_op_defs[req->opcode].plug) { - blk_start_plug(&state->plug); - state->plug_started = true; - } - - if (io_op_defs[req->opcode].needs_file) { - req->file = io_file_get(ctx, req, READ_ONCE(sqe->fd), - (sqe_flags & IOSQE_FIXED_FILE)); - if (unlikely(!req->file)) - ret = -EBADF; - } - - state->ios_left--; - return ret; + return io_req_prep(req, sqe); } static int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, @@ -7116,7 +7109,8 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, ret = io_init_req(ctx, req, sqe); if (unlikely(ret)) { -fail_req: + trace_io_uring_req_failed(sqe, ret); + /* fail even hard links since we don't submit */ if (link->head) { /* @@ -7139,10 +7133,6 @@ fail_req: return ret; } req_fail_link_node(req, ret); - } else { - ret = io_req_prep(req, sqe); - if (unlikely(ret)) - goto fail_req; } /* don't need @sqe from now on */ @@ -7172,33 +7162,32 @@ fail_req: link->last->link = req; link->last = req; + if (req->flags & (REQ_F_LINK | REQ_F_HARDLINK)) + return 0; /* last request of a link, enqueue the link */ - if (!(req->flags & (REQ_F_LINK | REQ_F_HARDLINK))) { - link->head = NULL; - io_queue_sqe(head); - } - } else { - if (req->flags & (REQ_F_LINK | REQ_F_HARDLINK)) { - link->head = req; - link->last = req; - } else { - io_queue_sqe(req); - } + link->head = NULL; + req = head; + } else if (req->flags & (REQ_F_LINK | REQ_F_HARDLINK)) { + link->head = req; + link->last = req; + return 0; } + io_queue_sqe(req); return 0; } /* * Batched submission is done, ensure local IO is flushed out. */ -static void io_submit_state_end(struct io_submit_state *state, - struct io_ring_ctx *ctx) +static void io_submit_state_end(struct io_ring_ctx *ctx) { + struct io_submit_state *state = &ctx->submit_state; + if (state->link.head) io_queue_sqe(state->link.head); - if (state->compl_nr) - io_submit_flush_completions(ctx); + /* flush only after queuing links as they can generate completions */ + io_submit_flush_completions(ctx); if (state->plug_started) blk_finish_plug(&state->plug); } @@ -7210,7 +7199,8 @@ static void io_submit_state_start(struct io_submit_state *state, unsigned int max_ios) { state->plug_started = false; - state->ios_left = max_ios; + state->need_plug = max_ios > 2; + state->submit_nr = max_ios; /* set only head, no need to init link_last in advance */ state->link.head = NULL; } @@ -7262,45 +7252,45 @@ static const struct io_uring_sqe *io_get_sqe(struct io_ring_ctx *ctx) static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr) __must_hold(&ctx->uring_lock) { + unsigned int entries = io_sqring_entries(ctx); int submitted = 0; + if (unlikely(!entries)) + return 0; /* make sure SQ entry isn't read before tail */ - nr = min3(nr, ctx->sq_entries, io_sqring_entries(ctx)); - if (!percpu_ref_tryget_many(&ctx->refs, nr)) - return -EAGAIN; + nr = min3(nr, ctx->sq_entries, entries); io_get_task_refs(nr); io_submit_state_start(&ctx->submit_state, nr); - while (submitted < nr) { + do { const struct io_uring_sqe *sqe; struct io_kiocb *req; - req = io_alloc_req(ctx); - if (unlikely(!req)) { + if (unlikely(!io_alloc_req_refill(ctx))) { if (!submitted) submitted = -EAGAIN; break; } + req = io_alloc_req(ctx); sqe = io_get_sqe(ctx); if (unlikely(!sqe)) { - list_add(&req->inflight_entry, &ctx->submit_state.free_list); + wq_stack_add_head(&req->comp_list, &ctx->submit_state.free_list); break; } /* will complete beyond this point, count as submitted */ submitted++; if (io_submit_sqe(ctx, req, sqe)) break; - } + } while (submitted < nr); if (unlikely(submitted != nr)) { int ref_used = (submitted == -EAGAIN) ? 0 : submitted; int unused = nr - ref_used; current->io_uring->cached_refs += unused; - percpu_ref_put_many(&ctx->refs, unused); } - io_submit_state_end(&ctx->submit_state, ctx); + io_submit_state_end(ctx); /* Commit SQ ring head once we've consumed and submitted all SQEs */ io_commit_sqring(ctx); @@ -7339,16 +7329,15 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, bool cap_entries) if (cap_entries && to_submit > IORING_SQPOLL_CAP_ENTRIES_VALUE) to_submit = IORING_SQPOLL_CAP_ENTRIES_VALUE; - if (!list_empty(&ctx->iopoll_list) || to_submit) { - unsigned nr_events = 0; + if (!wq_list_empty(&ctx->iopoll_list) || to_submit) { const struct cred *creds = NULL; if (ctx->sq_creds != current_cred()) creds = override_creds(ctx->sq_creds); mutex_lock(&ctx->uring_lock); - if (!list_empty(&ctx->iopoll_list)) - io_do_iopoll(ctx, &nr_events, 0); + if (!wq_list_empty(&ctx->iopoll_list)) + io_do_iopoll(ctx, true); /* * Don't submit if refs are dying, good for io_uring_register(), @@ -7368,7 +7357,7 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, bool cap_entries) return ret; } -static void io_sqd_update_thread_idle(struct io_sq_data *sqd) +static __cold void io_sqd_update_thread_idle(struct io_sq_data *sqd) { struct io_ring_ctx *ctx; unsigned sq_thread_idle = 0; @@ -7425,7 +7414,7 @@ static int io_sq_thread(void *data) list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) { int ret = __io_sq_thread(ctx, cap_entries); - if (!sqt_spin && (ret > 0 || !list_empty(&ctx->iopoll_list))) + if (!sqt_spin && (ret > 0 || !wq_list_empty(&ctx->iopoll_list))) sqt_spin = true; } if (io_run_task_work()) @@ -7446,7 +7435,7 @@ static int io_sq_thread(void *data) io_ring_set_wakeup_flag(ctx); if ((ctx->flags & IORING_SETUP_IOPOLL) && - !list_empty_careful(&ctx->iopoll_list)) { + !wq_list_empty(&ctx->iopoll_list)) { needs_sched = false; break; } @@ -7622,7 +7611,7 @@ static void io_free_page_table(void **table, size_t size) kfree(table); } -static void **io_alloc_page_table(size_t size) +static __cold void **io_alloc_page_table(size_t size) { unsigned i, nr_tables = DIV_ROUND_UP(size, PAGE_SIZE); size_t init_size = size; @@ -7651,7 +7640,7 @@ static void io_rsrc_node_destroy(struct io_rsrc_node *ref_node) kfree(ref_node); } -static void io_rsrc_node_ref_zero(struct percpu_ref *ref) +static __cold void io_rsrc_node_ref_zero(struct percpu_ref *ref) { struct io_rsrc_node *node = container_of(ref, struct io_rsrc_node, refs); struct io_ring_ctx *ctx = node->rsrc_data->ctx; @@ -7697,10 +7686,13 @@ static struct io_rsrc_node *io_rsrc_node_alloc(struct io_ring_ctx *ctx) static void io_rsrc_node_switch(struct io_ring_ctx *ctx, struct io_rsrc_data *data_to_kill) + __must_hold(&ctx->uring_lock) { WARN_ON_ONCE(!ctx->rsrc_backup_node); WARN_ON_ONCE(data_to_kill && !ctx->rsrc_node); + io_rsrc_refs_drop(ctx); + if (data_to_kill) { struct io_rsrc_node *rsrc_node = ctx->rsrc_node; @@ -7728,7 +7720,8 @@ static int io_rsrc_node_switch_start(struct io_ring_ctx *ctx) return ctx->rsrc_backup_node ? 0 : -ENOMEM; } -static int io_rsrc_ref_quiesce(struct io_rsrc_data *data, struct io_ring_ctx *ctx) +static __cold int io_rsrc_ref_quiesce(struct io_rsrc_data *data, + struct io_ring_ctx *ctx) { int ret; @@ -7784,9 +7777,9 @@ static void io_rsrc_data_free(struct io_rsrc_data *data) kfree(data); } -static int io_rsrc_data_alloc(struct io_ring_ctx *ctx, rsrc_put_fn *do_put, - u64 __user *utags, unsigned nr, - struct io_rsrc_data **pdata) +static __cold int io_rsrc_data_alloc(struct io_ring_ctx *ctx, rsrc_put_fn *do_put, + u64 __user *utags, unsigned nr, + struct io_rsrc_data **pdata) { struct io_rsrc_data *data; int ret = -ENOMEM; @@ -8354,12 +8347,12 @@ static int io_install_fixed_file(struct io_kiocb *req, struct file *file, unsigned int issue_flags, u32 slot_index) { struct io_ring_ctx *ctx = req->ctx; - bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK; + bool needs_lock = issue_flags & IO_URING_F_UNLOCKED; bool needs_switch = false; struct io_fixed_file *file_slot; int ret = -EBADF; - io_ring_submit_lock(ctx, !force_nonblock); + io_ring_submit_lock(ctx, needs_lock); if (file->f_op == &io_uring_fops) goto err; ret = -ENXIO; @@ -8400,7 +8393,7 @@ static int io_install_fixed_file(struct io_kiocb *req, struct file *file, err: if (needs_switch) io_rsrc_node_switch(ctx, ctx->file_data); - io_ring_submit_unlock(ctx, !force_nonblock); + io_ring_submit_unlock(ctx, needs_lock); if (ret) fput(file); return ret; @@ -8410,11 +8403,12 @@ static int io_close_fixed(struct io_kiocb *req, unsigned int issue_flags) { unsigned int offset = req->close.file_slot - 1; struct io_ring_ctx *ctx = req->ctx; + bool needs_lock = issue_flags & IO_URING_F_UNLOCKED; struct io_fixed_file *file_slot; struct file *file; int ret, i; - io_ring_submit_lock(ctx, !(issue_flags & IO_URING_F_NONBLOCK)); + io_ring_submit_lock(ctx, needs_lock); ret = -ENXIO; if (unlikely(!ctx->file_data)) goto out; @@ -8440,7 +8434,7 @@ static int io_close_fixed(struct io_kiocb *req, unsigned int issue_flags) io_rsrc_node_switch(ctx, ctx->file_data); ret = 0; out: - io_ring_submit_unlock(ctx, !(issue_flags & IO_URING_F_NONBLOCK)); + io_ring_submit_unlock(ctx, needs_lock); return ret; } @@ -8556,8 +8550,8 @@ static struct io_wq *io_init_wq_offload(struct io_ring_ctx *ctx, return io_wq_create(concurrency, &data); } -static int io_uring_alloc_task_context(struct task_struct *task, - struct io_ring_ctx *ctx) +static __cold int io_uring_alloc_task_context(struct task_struct *task, + struct io_ring_ctx *ctx) { struct io_uring_task *tctx; int ret; @@ -8604,8 +8598,8 @@ void __io_uring_free(struct task_struct *tsk) tsk->io_uring = NULL; } -static int io_sq_offload_create(struct io_ring_ctx *ctx, - struct io_uring_params *p) +static __cold int io_sq_offload_create(struct io_ring_ctx *ctx, + struct io_uring_params *p) { int ret; @@ -9216,29 +9210,25 @@ static void io_destroy_buffers(struct io_ring_ctx *ctx) } } -static void io_req_cache_free(struct list_head *list) -{ - struct io_kiocb *req, *nxt; - - list_for_each_entry_safe(req, nxt, list, inflight_entry) { - list_del(&req->inflight_entry); - kmem_cache_free(req_cachep, req); - } -} - static void io_req_caches_free(struct io_ring_ctx *ctx) { struct io_submit_state *state = &ctx->submit_state; + int nr = 0; mutex_lock(&ctx->uring_lock); + io_flush_cached_locked_reqs(ctx, state); - if (state->free_reqs) { - kmem_cache_free_bulk(req_cachep, state->free_reqs, state->reqs); - state->free_reqs = 0; - } + while (state->free_list.next) { + struct io_wq_work_node *node; + struct io_kiocb *req; - io_flush_cached_locked_reqs(ctx, state); - io_req_cache_free(&state->free_list); + node = wq_stack_extract(&state->free_list); + req = container_of(node, struct io_kiocb, comp_list); + kmem_cache_free(req_cachep, req); + nr++; + } + if (nr) + percpu_ref_put_many(&ctx->refs, nr); mutex_unlock(&ctx->uring_lock); } @@ -9248,7 +9238,7 @@ static void io_wait_rsrc_data(struct io_rsrc_data *data) wait_for_completion(&data->done); } -static void io_ring_ctx_free(struct io_ring_ctx *ctx) +static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx) { io_sq_thread_finish(ctx); @@ -9257,6 +9247,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) ctx->mm_account = NULL; } + io_rsrc_refs_drop(ctx); /* __io_rsrc_put_work() may need uring_lock to progress, wait w/o it */ io_wait_rsrc_data(ctx->buf_data); io_wait_rsrc_data(ctx->file_data); @@ -9280,6 +9271,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) if (ctx->rsrc_backup_node) io_rsrc_node_destroy(ctx->rsrc_backup_node); flush_delayed_work(&ctx->rsrc_put_work); + flush_delayed_work(&ctx->fallback_work); WARN_ON_ONCE(!list_empty(&ctx->rsrc_ref_list)); WARN_ON_ONCE(!llist_empty(&ctx->rsrc_put_llist)); @@ -9310,7 +9302,7 @@ static __poll_t io_uring_poll(struct file *file, poll_table *wait) struct io_ring_ctx *ctx = file->private_data; __poll_t mask = 0; - poll_wait(file, &ctx->poll_wait, wait); + poll_wait(file, &ctx->cq_wait, wait); /* * synchronizes with barrier from wq_has_sleeper call in * io_commit_cqring @@ -9357,7 +9349,7 @@ struct io_tctx_exit { struct io_ring_ctx *ctx; }; -static void io_tctx_exit_cb(struct callback_head *cb) +static __cold void io_tctx_exit_cb(struct callback_head *cb) { struct io_uring_task *tctx = current->io_uring; struct io_tctx_exit *work; @@ -9372,14 +9364,14 @@ static void io_tctx_exit_cb(struct callback_head *cb) complete(&work->completion); } -static bool io_cancel_ctx_cb(struct io_wq_work *work, void *data) +static __cold bool io_cancel_ctx_cb(struct io_wq_work *work, void *data) { struct io_kiocb *req = container_of(work, struct io_kiocb, work); return req->ctx == data; } -static void io_ring_exit_work(struct work_struct *work) +static __cold void io_ring_exit_work(struct work_struct *work) { struct io_ring_ctx *ctx = container_of(work, struct io_ring_ctx, exit_work); unsigned long timeout = jiffies + HZ * 60 * 5; @@ -9408,6 +9400,8 @@ static void io_ring_exit_work(struct work_struct *work) io_sq_thread_unpark(sqd); } + io_req_caches_free(ctx); + if (WARN_ON_ONCE(time_after(jiffies, timeout))) { /* there is little hope left, don't run it too often */ interval = HZ * 60; @@ -9434,7 +9428,6 @@ static void io_ring_exit_work(struct work_struct *work) ret = task_work_add(node->task, &exit.task_work, TWA_SIGNAL); if (WARN_ON_ONCE(ret)) continue; - wake_up_process(node->task); mutex_unlock(&ctx->uring_lock); wait_for_completion(&exit.completion); @@ -9448,8 +9441,8 @@ static void io_ring_exit_work(struct work_struct *work) } /* Returns true if we found and killed one or more timeouts */ -static bool io_kill_timeouts(struct io_ring_ctx *ctx, struct task_struct *tsk, - bool cancel_all) +static __cold bool io_kill_timeouts(struct io_ring_ctx *ctx, + struct task_struct *tsk, bool cancel_all) { struct io_kiocb *req, *tmp; int canceled = 0; @@ -9471,7 +9464,7 @@ static bool io_kill_timeouts(struct io_ring_ctx *ctx, struct task_struct *tsk, return canceled != 0; } -static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) +static __cold void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) { unsigned long index; struct creds *creds; @@ -9533,8 +9526,9 @@ static bool io_cancel_task_cb(struct io_wq_work *work, void *data) return ret; } -static bool io_cancel_defer_files(struct io_ring_ctx *ctx, - struct task_struct *task, bool cancel_all) +static __cold bool io_cancel_defer_files(struct io_ring_ctx *ctx, + struct task_struct *task, + bool cancel_all) { struct io_defer_entry *de; LIST_HEAD(list); @@ -9559,7 +9553,7 @@ static bool io_cancel_defer_files(struct io_ring_ctx *ctx, return true; } -static bool io_uring_try_cancel_iowq(struct io_ring_ctx *ctx) +static __cold bool io_uring_try_cancel_iowq(struct io_ring_ctx *ctx) { struct io_tctx_node *node; enum io_wq_cancel cret; @@ -9583,9 +9577,9 @@ static bool io_uring_try_cancel_iowq(struct io_ring_ctx *ctx) return ret; } -static void io_uring_try_cancel_requests(struct io_ring_ctx *ctx, - struct task_struct *task, - bool cancel_all) +static __cold void io_uring_try_cancel_requests(struct io_ring_ctx *ctx, + struct task_struct *task, + bool cancel_all) { struct io_task_cancel cancel = { .task = task, .all = cancel_all, }; struct io_uring_task *tctx = task ? task->io_uring : NULL; @@ -9609,7 +9603,7 @@ static void io_uring_try_cancel_requests(struct io_ring_ctx *ctx, /* SQPOLL thread does its own polling */ if ((!(ctx->flags & IORING_SETUP_SQPOLL) && cancel_all) || (ctx->sq_data && ctx->sq_data->thread == current)) { - while (!list_empty_careful(&ctx->iopoll_list)) { + while (!wq_list_empty(&ctx->iopoll_list)) { io_iopoll_try_reap_events(ctx); ret = true; } @@ -9636,7 +9630,16 @@ static int __io_uring_add_tctx_node(struct io_ring_ctx *ctx) ret = io_uring_alloc_task_context(current, ctx); if (unlikely(ret)) return ret; + tctx = current->io_uring; + if (ctx->iowq_limits_set) { + unsigned int limits[2] = { ctx->iowq_limits[0], + ctx->iowq_limits[1], }; + + ret = io_wq_max_workers(tctx->io_wq, limits); + if (ret) + return ret; + } } if (!xa_load(&tctx->xa, (unsigned long)ctx)) { node = kmalloc(sizeof(*node), GFP_KERNEL); @@ -9675,7 +9678,7 @@ static inline int io_uring_add_tctx_node(struct io_ring_ctx *ctx) /* * Remove this io_uring_file -> task mapping. */ -static void io_uring_del_tctx_node(unsigned long index) +static __cold void io_uring_del_tctx_node(unsigned long index) { struct io_uring_task *tctx = current->io_uring; struct io_tctx_node *node; @@ -9698,7 +9701,7 @@ static void io_uring_del_tctx_node(unsigned long index) kfree(node); } -static void io_uring_clean_tctx(struct io_uring_task *tctx) +static __cold void io_uring_clean_tctx(struct io_uring_task *tctx) { struct io_wq *wq = tctx->io_wq; struct io_tctx_node *node; @@ -9725,7 +9728,7 @@ static s64 tctx_inflight(struct io_uring_task *tctx, bool tracked) return percpu_counter_sum(&tctx->inflight); } -static void io_uring_drop_tctx_refs(struct task_struct *task) +static __cold void io_uring_drop_tctx_refs(struct task_struct *task) { struct io_uring_task *tctx = task->io_uring; unsigned int refs = tctx->cached_refs; @@ -9741,7 +9744,8 @@ static void io_uring_drop_tctx_refs(struct task_struct *task) * Find any io_uring ctx that this task has registered or done IO on, and cancel * requests. @sqd should be not-null IIF it's an SQPOLL thread cancellation. */ -static void io_uring_cancel_generic(bool cancel_all, struct io_sq_data *sqd) +static __cold void io_uring_cancel_generic(bool cancel_all, + struct io_sq_data *sqd) { struct io_uring_task *tctx = current->io_uring; struct io_ring_ctx *ctx; @@ -9834,7 +9838,7 @@ static void *io_uring_validate_mmap_request(struct file *file, #ifdef CONFIG_MMU -static int io_uring_mmap(struct file *file, struct vm_area_struct *vma) +static __cold int io_uring_mmap(struct file *file, struct vm_area_struct *vma) { size_t sz = vma->vm_end - vma->vm_start; unsigned long pfn; @@ -10019,7 +10023,7 @@ out_fput: } #ifdef CONFIG_PROC_FS -static int io_uring_show_cred(struct seq_file *m, unsigned int id, +static __cold int io_uring_show_cred(struct seq_file *m, unsigned int id, const struct cred *cred) { struct user_namespace *uns = seq_user_ns(m); @@ -10051,11 +10055,59 @@ static int io_uring_show_cred(struct seq_file *m, unsigned int id, return 0; } -static void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, struct seq_file *m) +static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, + struct seq_file *m) { struct io_sq_data *sq = NULL; + struct io_overflow_cqe *ocqe; + struct io_rings *r = ctx->rings; + unsigned int sq_mask = ctx->sq_entries - 1, cq_mask = ctx->cq_entries - 1; + unsigned int sq_head = READ_ONCE(r->sq.head); + unsigned int sq_tail = READ_ONCE(r->sq.tail); + unsigned int cq_head = READ_ONCE(r->cq.head); + unsigned int cq_tail = READ_ONCE(r->cq.tail); + unsigned int sq_entries, cq_entries; bool has_lock; - int i; + unsigned int i; + + /* + * we may get imprecise sqe and cqe info if uring is actively running + * since we get cached_sq_head and cached_cq_tail without uring_lock + * and sq_tail and cq_head are changed by userspace. But it's ok since + * we usually use these info when it is stuck. + */ + seq_printf(m, "SqMask:\t\t0x%x\n", sq_mask); + seq_printf(m, "SqHead:\t%u\n", sq_head); + seq_printf(m, "SqTail:\t%u\n", sq_tail); + seq_printf(m, "CachedSqHead:\t%u\n", ctx->cached_sq_head); + seq_printf(m, "CqMask:\t0x%x\n", cq_mask); + seq_printf(m, "CqHead:\t%u\n", cq_head); + seq_printf(m, "CqTail:\t%u\n", cq_tail); + seq_printf(m, "CachedCqTail:\t%u\n", ctx->cached_cq_tail); + seq_printf(m, "SQEs:\t%u\n", sq_tail - ctx->cached_sq_head); + sq_entries = min(sq_tail - sq_head, ctx->sq_entries); + for (i = 0; i < sq_entries; i++) { + unsigned int entry = i + sq_head; + unsigned int sq_idx = READ_ONCE(ctx->sq_array[entry & sq_mask]); + struct io_uring_sqe *sqe = &ctx->sq_sqes[sq_idx]; + + if (sq_idx > sq_mask) + continue; + sqe = &ctx->sq_sqes[sq_idx]; + seq_printf(m, "%5u: opcode:%d, fd:%d, flags:%x, user_data:%llu\n", + sq_idx, sqe->opcode, sqe->fd, sqe->flags, + sqe->user_data); + } + seq_printf(m, "CQEs:\t%u\n", cq_tail - cq_head); + cq_entries = min(cq_tail - cq_head, ctx->cq_entries); + for (i = 0; i < cq_entries; i++) { + unsigned int entry = i + cq_head; + struct io_uring_cqe *cqe = &r->cqes[entry & cq_mask]; + + seq_printf(m, "%5u: user_data:%llu, res:%d, flag:%x\n", + entry & cq_mask, cqe->user_data, cqe->res, + cqe->flags); + } /* * Avoid ABBA deadlock between the seq lock and the io_uring mutex, @@ -10097,7 +10149,10 @@ static void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, struct seq_file *m) xa_for_each(&ctx->personalities, index, cred) io_uring_show_cred(m, index, cred); } - seq_printf(m, "PollList:\n"); + if (has_lock) + mutex_unlock(&ctx->uring_lock); + + seq_puts(m, "PollList:\n"); spin_lock(&ctx->completion_lock); for (i = 0; i < (1U << ctx->cancel_hash_bits); i++) { struct hlist_head *list = &ctx->cancel_hash[i]; @@ -10107,12 +10162,20 @@ static void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, struct seq_file *m) seq_printf(m, " op=%d, task_works=%d\n", req->opcode, req->task->task_works != NULL); } + + seq_puts(m, "CqOverflowList:\n"); + list_for_each_entry(ocqe, &ctx->cq_overflow_list, list) { + struct io_uring_cqe *cqe = &ocqe->cqe; + + seq_printf(m, " user_data=%llu, res=%d, flags=%x\n", + cqe->user_data, cqe->res, cqe->flags); + + } + spin_unlock(&ctx->completion_lock); - if (has_lock) - mutex_unlock(&ctx->uring_lock); } -static void io_uring_show_fdinfo(struct seq_file *m, struct file *f) +static __cold void io_uring_show_fdinfo(struct seq_file *m, struct file *f) { struct io_ring_ctx *ctx = f->private_data; @@ -10136,8 +10199,8 @@ static const struct file_operations io_uring_fops = { #endif }; -static int io_allocate_scq_urings(struct io_ring_ctx *ctx, - struct io_uring_params *p) +static __cold int io_allocate_scq_urings(struct io_ring_ctx *ctx, + struct io_uring_params *p) { struct io_rings *rings; size_t size, sq_array_offset; @@ -10226,8 +10289,8 @@ static struct file *io_uring_get_file(struct io_ring_ctx *ctx) return file; } -static int io_uring_create(unsigned entries, struct io_uring_params *p, - struct io_uring_params __user *params) +static __cold int io_uring_create(unsigned entries, struct io_uring_params *p, + struct io_uring_params __user *params) { struct io_ring_ctx *ctx; struct file *file; @@ -10385,7 +10448,8 @@ SYSCALL_DEFINE2(io_uring_setup, u32, entries, return io_uring_setup(entries, params); } -static int io_probe(struct io_ring_ctx *ctx, void __user *arg, unsigned nr_args) +static __cold int io_probe(struct io_ring_ctx *ctx, void __user *arg, + unsigned nr_args) { struct io_uring_probe *p; size_t size; @@ -10441,8 +10505,8 @@ static int io_register_personality(struct io_ring_ctx *ctx) return id; } -static int io_register_restrictions(struct io_ring_ctx *ctx, void __user *arg, - unsigned int nr_args) +static __cold int io_register_restrictions(struct io_ring_ctx *ctx, + void __user *arg, unsigned int nr_args) { struct io_uring_restriction *res; size_t size; @@ -10576,7 +10640,7 @@ static int io_register_rsrc_update(struct io_ring_ctx *ctx, void __user *arg, return __io_register_rsrc_update(ctx, type, &up, up.nr); } -static int io_register_rsrc(struct io_ring_ctx *ctx, void __user *arg, +static __cold int io_register_rsrc(struct io_ring_ctx *ctx, void __user *arg, unsigned int size, unsigned int type) { struct io_uring_rsrc_register rr; @@ -10602,8 +10666,8 @@ static int io_register_rsrc(struct io_ring_ctx *ctx, void __user *arg, return -EINVAL; } -static int io_register_iowq_aff(struct io_ring_ctx *ctx, void __user *arg, - unsigned len) +static __cold int io_register_iowq_aff(struct io_ring_ctx *ctx, + void __user *arg, unsigned len) { struct io_uring_task *tctx = current->io_uring; cpumask_var_t new_mask; @@ -10629,7 +10693,7 @@ static int io_register_iowq_aff(struct io_ring_ctx *ctx, void __user *arg, return ret; } -static int io_unregister_iowq_aff(struct io_ring_ctx *ctx) +static __cold int io_unregister_iowq_aff(struct io_ring_ctx *ctx) { struct io_uring_task *tctx = current->io_uring; @@ -10639,9 +10703,11 @@ static int io_unregister_iowq_aff(struct io_ring_ctx *ctx) return io_wq_cpu_affinity(tctx->io_wq, NULL); } -static int io_register_iowq_max_workers(struct io_ring_ctx *ctx, - void __user *arg) +static __cold int io_register_iowq_max_workers(struct io_ring_ctx *ctx, + void __user *arg) + __must_hold(&ctx->uring_lock) { + struct io_tctx_node *node; struct io_uring_task *tctx = NULL; struct io_sq_data *sqd = NULL; __u32 new_count[2]; @@ -10672,13 +10738,19 @@ static int io_register_iowq_max_workers(struct io_ring_ctx *ctx, tctx = current->io_uring; } - ret = -EINVAL; - if (!tctx || !tctx->io_wq) - goto err; + BUILD_BUG_ON(sizeof(new_count) != sizeof(ctx->iowq_limits)); - ret = io_wq_max_workers(tctx->io_wq, new_count); - if (ret) - goto err; + memcpy(ctx->iowq_limits, new_count, sizeof(new_count)); + ctx->iowq_limits_set = true; + + ret = -EINVAL; + if (tctx && tctx->io_wq) { + ret = io_wq_max_workers(tctx->io_wq, new_count); + if (ret) + goto err; + } else { + memset(new_count, 0, sizeof(new_count)); + } if (sqd) { mutex_unlock(&sqd->lock); @@ -10688,6 +10760,22 @@ static int io_register_iowq_max_workers(struct io_ring_ctx *ctx, if (copy_to_user(arg, new_count, sizeof(new_count))) return -EFAULT; + /* that's it for SQPOLL, only the SQPOLL task creates requests */ + if (sqd) + return 0; + + /* now propagate the restriction to all registered users */ + list_for_each_entry(node, &ctx->tctx_list, ctx_node) { + struct io_uring_task *tctx = node->task->io_uring; + + if (WARN_ON_ONCE(!tctx->io_wq)) + continue; + + for (i = 0; i < ARRAY_SIZE(new_count); i++) + new_count[i] = ctx->iowq_limits[i]; + /* ignore errors, it always returns zero anyway */ + (void)io_wq_max_workers(tctx->io_wq, new_count); + } return 0; err: if (sqd) { @@ -10721,7 +10809,7 @@ static bool io_register_op_must_quiesce(int op) } } -static int io_ctx_quiesce(struct io_ring_ctx *ctx) +static __cold int io_ctx_quiesce(struct io_ring_ctx *ctx) { long ret; @@ -10736,10 +10824,14 @@ static int io_ctx_quiesce(struct io_ring_ctx *ctx) */ mutex_unlock(&ctx->uring_lock); do { - ret = wait_for_completion_interruptible(&ctx->ref_comp); - if (!ret) + ret = wait_for_completion_interruptible_timeout(&ctx->ref_comp, HZ); + if (ret) { + ret = min(0L, ret); break; + } + ret = io_run_task_work_sig(); + io_req_caches_free(ctx); } while (ret >= 0); mutex_lock(&ctx->uring_lock); @@ -10970,6 +11062,8 @@ static int __init io_uring_init(void) /* should fit into one byte */ BUILD_BUG_ON(SQE_VALID_FLAGS >= (1 << 8)); + BUILD_BUG_ON(SQE_COMMON_FLAGS >= (1 << 8)); + BUILD_BUG_ON((SQE_VALID_FLAGS | SQE_COMMON_FLAGS) != SQE_VALID_FLAGS); BUILD_BUG_ON(ARRAY_SIZE(io_op_defs) != IORING_OP_LAST); BUILD_BUG_ON(__REQ_F_LAST_BIT > 8 * sizeof(int)); diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c index 176580f54af9..104ae698443e 100644 --- a/fs/jfs/jfs_metapage.c +++ b/fs/jfs/jfs_metapage.c @@ -13,6 +13,7 @@ #include <linux/buffer_head.h> #include <linux/mempool.h> #include <linux/seq_file.h> +#include <linux/writeback.h> #include "jfs_incore.h" #include "jfs_superblock.h" #include "jfs_filsys.h" diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c index 87aac4c72c37..1b07550485b9 100644 --- a/fs/kernel_read_file.c +++ b/fs/kernel_read_file.c @@ -178,7 +178,7 @@ int kernel_read_file_from_fd(int fd, loff_t offset, void **buf, struct fd f = fdget(fd); int ret = -EBADF; - if (!f.file) + if (!f.file || !(f.file->f_mode & FMODE_READ)) goto out; ret = kernel_read_file(f.file, offset, buf, buf_size, file_size, id); diff --git a/fs/ksmbd/auth.c b/fs/ksmbd/auth.c index 71c989f1568d..30a92ddc1817 100644 --- a/fs/ksmbd/auth.c +++ b/fs/ksmbd/auth.c @@ -298,8 +298,8 @@ int ksmbd_decode_ntlmssp_auth_blob(struct authenticate_message *authblob, int blob_len, struct ksmbd_session *sess) { char *domain_name; - unsigned int lm_off, nt_off; - unsigned short nt_len; + unsigned int nt_off, dn_off; + unsigned short nt_len, dn_len; int ret; if (blob_len < sizeof(struct authenticate_message)) { @@ -314,15 +314,17 @@ int ksmbd_decode_ntlmssp_auth_blob(struct authenticate_message *authblob, return -EINVAL; } - lm_off = le32_to_cpu(authblob->LmChallengeResponse.BufferOffset); nt_off = le32_to_cpu(authblob->NtChallengeResponse.BufferOffset); nt_len = le16_to_cpu(authblob->NtChallengeResponse.Length); + dn_off = le32_to_cpu(authblob->DomainName.BufferOffset); + dn_len = le16_to_cpu(authblob->DomainName.Length); + + if (blob_len < (u64)dn_off + dn_len || blob_len < (u64)nt_off + nt_len) + return -EINVAL; /* TODO : use domain name that imported from configuration file */ - domain_name = smb_strndup_from_utf16((const char *)authblob + - le32_to_cpu(authblob->DomainName.BufferOffset), - le16_to_cpu(authblob->DomainName.Length), true, - sess->conn->local_nls); + domain_name = smb_strndup_from_utf16((const char *)authblob + dn_off, + dn_len, true, sess->conn->local_nls); if (IS_ERR(domain_name)) return PTR_ERR(domain_name); diff --git a/fs/ksmbd/connection.c b/fs/ksmbd/connection.c index 48b18b4ec117..b57a0d8a392f 100644 --- a/fs/ksmbd/connection.c +++ b/fs/ksmbd/connection.c @@ -61,6 +61,8 @@ struct ksmbd_conn *ksmbd_conn_alloc(void) conn->local_nls = load_nls_default(); atomic_set(&conn->req_running, 0); atomic_set(&conn->r_count, 0); + conn->total_credits = 1; + init_waitqueue_head(&conn->req_running_q); INIT_LIST_HEAD(&conn->conns_list); INIT_LIST_HEAD(&conn->sessions); diff --git a/fs/ksmbd/ksmbd_netlink.h b/fs/ksmbd/ksmbd_netlink.h index 2fbe2bc1e093..c6718a05d347 100644 --- a/fs/ksmbd/ksmbd_netlink.h +++ b/fs/ksmbd/ksmbd_netlink.h @@ -211,6 +211,7 @@ struct ksmbd_tree_disconnect_request { */ struct ksmbd_logout_request { __s8 account[KSMBD_REQ_MAX_ACCOUNT_NAME_SZ]; /* user account name */ + __u32 account_flags; }; /* @@ -317,6 +318,7 @@ enum KSMBD_TREE_CONN_STATUS { #define KSMBD_USER_FLAG_BAD_UID BIT(2) #define KSMBD_USER_FLAG_BAD_USER BIT(3) #define KSMBD_USER_FLAG_GUEST_ACCOUNT BIT(4) +#define KSMBD_USER_FLAG_DELAY_SESSION BIT(5) /* * Share config flags. diff --git a/fs/ksmbd/mgmt/user_config.c b/fs/ksmbd/mgmt/user_config.c index d21629ae5c89..1019d3677d55 100644 --- a/fs/ksmbd/mgmt/user_config.c +++ b/fs/ksmbd/mgmt/user_config.c @@ -55,7 +55,7 @@ struct ksmbd_user *ksmbd_alloc_user(struct ksmbd_login_response *resp) void ksmbd_free_user(struct ksmbd_user *user) { - ksmbd_ipc_logout_request(user->name); + ksmbd_ipc_logout_request(user->name, user->flags); kfree(user->name); kfree(user->passkey); kfree(user); diff --git a/fs/ksmbd/mgmt/user_config.h b/fs/ksmbd/mgmt/user_config.h index b2bb074a0150..aff80b029579 100644 --- a/fs/ksmbd/mgmt/user_config.h +++ b/fs/ksmbd/mgmt/user_config.h @@ -18,6 +18,7 @@ struct ksmbd_user { size_t passkey_sz; char *passkey; + unsigned int failed_login_count; }; static inline bool user_guest(struct ksmbd_user *user) diff --git a/fs/ksmbd/smb2misc.c b/fs/ksmbd/smb2misc.c index 9edd9c161b27..030ca57c3784 100644 --- a/fs/ksmbd/smb2misc.c +++ b/fs/ksmbd/smb2misc.c @@ -284,11 +284,13 @@ static inline int smb2_ioctl_resp_len(struct smb2_ioctl_req *h) le32_to_cpu(h->MaxOutputResponse); } -static int smb2_validate_credit_charge(struct smb2_hdr *hdr) +static int smb2_validate_credit_charge(struct ksmbd_conn *conn, + struct smb2_hdr *hdr) { - int req_len = 0, expect_resp_len = 0, calc_credit_num, max_len; - int credit_charge = le16_to_cpu(hdr->CreditCharge); + unsigned int req_len = 0, expect_resp_len = 0, calc_credit_num, max_len; + unsigned short credit_charge = le16_to_cpu(hdr->CreditCharge); void *__hdr = hdr; + int ret; switch (hdr->Command) { case SMB2_QUERY_INFO: @@ -310,21 +312,37 @@ static int smb2_validate_credit_charge(struct smb2_hdr *hdr) req_len = smb2_ioctl_req_len(__hdr); expect_resp_len = smb2_ioctl_resp_len(__hdr); break; - default: + case SMB2_CANCEL: return 0; + default: + req_len = 1; + break; } - credit_charge = max(1, credit_charge); - max_len = max(req_len, expect_resp_len); + credit_charge = max_t(unsigned short, credit_charge, 1); + max_len = max_t(unsigned int, req_len, expect_resp_len); calc_credit_num = DIV_ROUND_UP(max_len, SMB2_MAX_BUFFER_SIZE); if (credit_charge < calc_credit_num) { - pr_err("Insufficient credit charge, given: %d, needed: %d\n", - credit_charge, calc_credit_num); + ksmbd_debug(SMB, "Insufficient credit charge, given: %d, needed: %d\n", + credit_charge, calc_credit_num); + return 1; + } else if (credit_charge > conn->max_credits) { + ksmbd_debug(SMB, "Too large credit charge: %d\n", credit_charge); return 1; } - return 0; + spin_lock(&conn->credits_lock); + if (credit_charge <= conn->total_credits) { + conn->total_credits -= credit_charge; + ret = 0; + } else { + ksmbd_debug(SMB, "Insufficient credits granted, given: %u, granted: %u\n", + credit_charge, conn->total_credits); + ret = 1; + } + spin_unlock(&conn->credits_lock); + return ret; } int ksmbd_smb2_check_message(struct ksmbd_work *work) @@ -382,26 +400,20 @@ int ksmbd_smb2_check_message(struct ksmbd_work *work) } } - if ((work->conn->vals->capabilities & SMB2_GLOBAL_CAP_LARGE_MTU) && - smb2_validate_credit_charge(hdr)) { - work->conn->ops->set_rsp_status(work, STATUS_INVALID_PARAMETER); - return 1; - } - if (smb2_calc_size(hdr, &clc_len)) return 1; if (len != clc_len) { /* client can return one byte more due to implied bcc[0] */ if (clc_len == len + 1) - return 0; + goto validate_credit; /* * Some windows servers (win2016) will pad also the final * PDU in a compound to 8 bytes. */ if (ALIGN(clc_len, 8) == len) - return 0; + goto validate_credit; /* * windows client also pad up to 8 bytes when compounding. @@ -414,7 +426,7 @@ int ksmbd_smb2_check_message(struct ksmbd_work *work) "cli req padded more than expected. Length %d not %d for cmd:%d mid:%llu\n", len, clc_len, command, le64_to_cpu(hdr->MessageId)); - return 0; + goto validate_credit; } ksmbd_debug(SMB, @@ -425,6 +437,13 @@ int ksmbd_smb2_check_message(struct ksmbd_work *work) return 1; } +validate_credit: + if ((work->conn->vals->capabilities & SMB2_GLOBAL_CAP_LARGE_MTU) && + smb2_validate_credit_charge(work->conn, hdr)) { + work->conn->ops->set_rsp_status(work, STATUS_INVALID_PARAMETER); + return 1; + } + return 0; } diff --git a/fs/ksmbd/smb2ops.c b/fs/ksmbd/smb2ops.c index b06456eb587b..fb6a65d23139 100644 --- a/fs/ksmbd/smb2ops.c +++ b/fs/ksmbd/smb2ops.c @@ -284,6 +284,7 @@ int init_smb3_11_server(struct ksmbd_conn *conn) void init_smb2_max_read_size(unsigned int sz) { + sz = clamp_val(sz, SMB3_MIN_IOSIZE, SMB3_MAX_IOSIZE); smb21_server_values.max_read_size = sz; smb30_server_values.max_read_size = sz; smb302_server_values.max_read_size = sz; @@ -292,6 +293,7 @@ void init_smb2_max_read_size(unsigned int sz) void init_smb2_max_write_size(unsigned int sz) { + sz = clamp_val(sz, SMB3_MIN_IOSIZE, SMB3_MAX_IOSIZE); smb21_server_values.max_write_size = sz; smb30_server_values.max_write_size = sz; smb302_server_values.max_write_size = sz; @@ -300,6 +302,7 @@ void init_smb2_max_write_size(unsigned int sz) void init_smb2_max_trans_size(unsigned int sz) { + sz = clamp_val(sz, SMB3_MIN_IOSIZE, SMB3_MAX_IOSIZE); smb21_server_values.max_trans_size = sz; smb30_server_values.max_trans_size = sz; smb302_server_values.max_trans_size = sz; diff --git a/fs/ksmbd/smb2pdu.c b/fs/ksmbd/smb2pdu.c index 005aa93a49d6..7e448df3f847 100644 --- a/fs/ksmbd/smb2pdu.c +++ b/fs/ksmbd/smb2pdu.c @@ -292,22 +292,6 @@ int init_smb2_neg_rsp(struct ksmbd_work *work) return 0; } -static int smb2_consume_credit_charge(struct ksmbd_work *work, - unsigned short credit_charge) -{ - struct ksmbd_conn *conn = work->conn; - unsigned int rsp_credits = 1; - - if (!conn->total_credits) - return 0; - - if (credit_charge > 0) - rsp_credits = credit_charge; - - conn->total_credits -= rsp_credits; - return rsp_credits; -} - /** * smb2_set_rsp_credits() - set number of credits in response buffer * @work: smb work containing smb response buffer @@ -317,49 +301,43 @@ int smb2_set_rsp_credits(struct ksmbd_work *work) struct smb2_hdr *req_hdr = ksmbd_req_buf_next(work); struct smb2_hdr *hdr = ksmbd_resp_buf_next(work); struct ksmbd_conn *conn = work->conn; - unsigned short credits_requested = le16_to_cpu(req_hdr->CreditRequest); - unsigned short credit_charge = 1, credits_granted = 0; - unsigned short aux_max, aux_credits, min_credits; - int rsp_credit_charge; + unsigned short credits_requested; + unsigned short credit_charge, credits_granted = 0; + unsigned short aux_max, aux_credits; - if (hdr->Command == SMB2_CANCEL) - goto out; + if (work->send_no_response) + return 0; - /* get default minimum credits by shifting maximum credits by 4 */ - min_credits = conn->max_credits >> 4; + hdr->CreditCharge = req_hdr->CreditCharge; - if (conn->total_credits >= conn->max_credits) { + if (conn->total_credits > conn->max_credits) { + hdr->CreditRequest = 0; pr_err("Total credits overflow: %d\n", conn->total_credits); - conn->total_credits = min_credits; - } - - rsp_credit_charge = - smb2_consume_credit_charge(work, le16_to_cpu(req_hdr->CreditCharge)); - if (rsp_credit_charge < 0) return -EINVAL; + } - hdr->CreditCharge = cpu_to_le16(rsp_credit_charge); + credit_charge = max_t(unsigned short, + le16_to_cpu(req_hdr->CreditCharge), 1); + credits_requested = max_t(unsigned short, + le16_to_cpu(req_hdr->CreditRequest), 1); - if (credits_requested > 0) { - aux_credits = credits_requested - 1; - aux_max = 32; - if (hdr->Command == SMB2_NEGOTIATE) - aux_max = 0; - aux_credits = (aux_credits < aux_max) ? aux_credits : aux_max; - credits_granted = aux_credits + credit_charge; + /* according to smb2.credits smbtorture, Windows server + * 2016 or later grant up to 8192 credits at once. + * + * TODO: Need to adjuct CreditRequest value according to + * current cpu load + */ + aux_credits = credits_requested - 1; + if (hdr->Command == SMB2_NEGOTIATE) + aux_max = 0; + else + aux_max = conn->max_credits - credit_charge; + aux_credits = min_t(unsigned short, aux_credits, aux_max); + credits_granted = credit_charge + aux_credits; - /* if credits granted per client is getting bigger than default - * minimum credits then we should wrap it up within the limits. - */ - if ((conn->total_credits + credits_granted) > min_credits) - credits_granted = min_credits - conn->total_credits; - /* - * TODO: Need to adjuct CreditRequest value according to - * current cpu load - */ - } else if (conn->total_credits == 0) { - credits_granted = 1; - } + if (conn->max_credits - conn->total_credits < credits_granted) + credits_granted = conn->max_credits - + conn->total_credits; conn->total_credits += credits_granted; work->credits_granted += credits_granted; @@ -368,7 +346,6 @@ int smb2_set_rsp_credits(struct ksmbd_work *work) /* Update CreditRequest in last request */ hdr->CreditRequest = cpu_to_le16(work->credits_granted); } -out: ksmbd_debug(SMB, "credits: requested[%d] granted[%d] total_granted[%d]\n", credits_requested, credits_granted, @@ -472,6 +449,12 @@ bool is_chained_smb2_message(struct ksmbd_work *work) return false; } + if ((u64)get_rfc1002_len(work->response_buf) + MAX_CIFS_SMALL_BUFFER_SIZE > + work->response_sz) { + pr_err("next response offset exceeds response buffer size\n"); + return false; + } + ksmbd_debug(SMB, "got SMB2 chained command\n"); init_chained_smb2_rsp(work); return true; @@ -541,7 +524,7 @@ int smb2_allocate_rsp_buf(struct ksmbd_work *work) { struct smb2_hdr *hdr = work->request_buf; size_t small_sz = MAX_CIFS_SMALL_BUFFER_SIZE; - size_t large_sz = work->conn->vals->max_trans_size + MAX_SMB2_HDR_SIZE; + size_t large_sz = small_sz + work->conn->vals->max_trans_size; size_t sz = small_sz; int cmd = le16_to_cpu(hdr->Command); @@ -1274,19 +1257,13 @@ static int generate_preauth_hash(struct ksmbd_work *work) return 0; } -static int decode_negotiation_token(struct ksmbd_work *work, - struct negotiate_message *negblob) +static int decode_negotiation_token(struct ksmbd_conn *conn, + struct negotiate_message *negblob, + size_t sz) { - struct ksmbd_conn *conn = work->conn; - struct smb2_sess_setup_req *req; - int sz; - if (!conn->use_spnego) return -EINVAL; - req = work->request_buf; - sz = le16_to_cpu(req->SecurityBufferLength); - if (ksmbd_decode_negTokenInit((char *)negblob, sz, conn)) { if (ksmbd_decode_negTokenTarg((char *)negblob, sz, conn)) { conn->auth_mechs |= KSMBD_AUTH_NTLMSSP; @@ -1298,9 +1275,9 @@ static int decode_negotiation_token(struct ksmbd_work *work, } static int ntlm_negotiate(struct ksmbd_work *work, - struct negotiate_message *negblob) + struct negotiate_message *negblob, + size_t negblob_len) { - struct smb2_sess_setup_req *req = work->request_buf; struct smb2_sess_setup_rsp *rsp = work->response_buf; struct challenge_message *chgblob; unsigned char *spnego_blob = NULL; @@ -1309,8 +1286,7 @@ static int ntlm_negotiate(struct ksmbd_work *work, int sz, rc; ksmbd_debug(SMB, "negotiate phase\n"); - sz = le16_to_cpu(req->SecurityBufferLength); - rc = ksmbd_decode_ntlmssp_neg_blob(negblob, sz, work->sess); + rc = ksmbd_decode_ntlmssp_neg_blob(negblob, negblob_len, work->sess); if (rc) return rc; @@ -1378,12 +1354,23 @@ static struct ksmbd_user *session_user(struct ksmbd_conn *conn, struct authenticate_message *authblob; struct ksmbd_user *user; char *name; - int sz; + unsigned int auth_msg_len, name_off, name_len, secbuf_len; + secbuf_len = le16_to_cpu(req->SecurityBufferLength); + if (secbuf_len < sizeof(struct authenticate_message)) { + ksmbd_debug(SMB, "blob len %d too small\n", secbuf_len); + return NULL; + } authblob = user_authblob(conn, req); - sz = le32_to_cpu(authblob->UserName.BufferOffset); - name = smb_strndup_from_utf16((const char *)authblob + sz, - le16_to_cpu(authblob->UserName.Length), + name_off = le32_to_cpu(authblob->UserName.BufferOffset); + name_len = le16_to_cpu(authblob->UserName.Length); + auth_msg_len = le16_to_cpu(req->SecurityBufferOffset) + secbuf_len; + + if (auth_msg_len < (u64)name_off + name_len) + return NULL; + + name = smb_strndup_from_utf16((const char *)authblob + name_off, + name_len, true, conn->local_nls); if (IS_ERR(name)) { @@ -1629,6 +1616,7 @@ int smb2_sess_setup(struct ksmbd_work *work) struct smb2_sess_setup_rsp *rsp = work->response_buf; struct ksmbd_session *sess; struct negotiate_message *negblob; + unsigned int negblob_len, negblob_off; int rc = 0; ksmbd_debug(SMB, "Received request for session setup\n"); @@ -1709,10 +1697,16 @@ int smb2_sess_setup(struct ksmbd_work *work) if (sess->state == SMB2_SESSION_EXPIRED) sess->state = SMB2_SESSION_IN_PROGRESS; + negblob_off = le16_to_cpu(req->SecurityBufferOffset); + negblob_len = le16_to_cpu(req->SecurityBufferLength); + if (negblob_off < (offsetof(struct smb2_sess_setup_req, Buffer) - 4) || + negblob_len < offsetof(struct negotiate_message, NegotiateFlags)) + return -EINVAL; + negblob = (struct negotiate_message *)((char *)&req->hdr.ProtocolId + - le16_to_cpu(req->SecurityBufferOffset)); + negblob_off); - if (decode_negotiation_token(work, negblob) == 0) { + if (decode_negotiation_token(conn, negblob, negblob_len) == 0) { if (conn->mechToken) negblob = (struct negotiate_message *)conn->mechToken; } @@ -1736,7 +1730,7 @@ int smb2_sess_setup(struct ksmbd_work *work) sess->Preauth_HashValue = NULL; } else if (conn->preferred_auth_mech == KSMBD_AUTH_NTLMSSP) { if (negblob->MessageType == NtLmNegotiate) { - rc = ntlm_negotiate(work, negblob); + rc = ntlm_negotiate(work, negblob, negblob_len); if (rc) goto out_err; rsp->hdr.Status = @@ -1796,9 +1790,30 @@ out_err: conn->mechToken = NULL; } - if (rc < 0 && sess) { - ksmbd_session_destroy(sess); - work->sess = NULL; + if (rc < 0) { + /* + * SecurityBufferOffset should be set to zero + * in session setup error response. + */ + rsp->SecurityBufferOffset = 0; + + if (sess) { + bool try_delay = false; + + /* + * To avoid dictionary attacks (repeated session setups rapidly sent) to + * connect to server, ksmbd make a delay of a 5 seconds on session setup + * failure to make it harder to send enough random connection requests + * to break into a server. + */ + if (sess->user && sess->user->flags & KSMBD_USER_FLAG_DELAY_SESSION) + try_delay = true; + + ksmbd_session_destroy(sess); + work->sess = NULL; + if (try_delay) + ssleep(5); + } } return rc; @@ -3779,6 +3794,24 @@ static int verify_info_level(int info_level) return 0; } +static int smb2_calc_max_out_buf_len(struct ksmbd_work *work, + unsigned short hdr2_len, + unsigned int out_buf_len) +{ + int free_len; + + if (out_buf_len > work->conn->vals->max_trans_size) + return -EINVAL; + + free_len = (int)(work->response_sz - + (get_rfc1002_len(work->response_buf) + 4)) - + hdr2_len; + if (free_len < 0) + return -EINVAL; + + return min_t(int, out_buf_len, free_len); +} + int smb2_query_dir(struct ksmbd_work *work) { struct ksmbd_conn *conn = work->conn; @@ -3855,9 +3888,13 @@ int smb2_query_dir(struct ksmbd_work *work) memset(&d_info, 0, sizeof(struct ksmbd_dir_info)); d_info.wptr = (char *)rsp->Buffer; d_info.rptr = (char *)rsp->Buffer; - d_info.out_buf_len = (work->response_sz - (get_rfc1002_len(rsp_org) + 4)); - d_info.out_buf_len = min_t(int, d_info.out_buf_len, le32_to_cpu(req->OutputBufferLength)) - - sizeof(struct smb2_query_directory_rsp); + d_info.out_buf_len = + smb2_calc_max_out_buf_len(work, 8, + le32_to_cpu(req->OutputBufferLength)); + if (d_info.out_buf_len < 0) { + rc = -EINVAL; + goto err_out; + } d_info.flags = srch_flag; /* @@ -4091,12 +4128,11 @@ static int smb2_get_ea(struct ksmbd_work *work, struct ksmbd_file *fp, le32_to_cpu(req->Flags)); } - buf_free_len = work->response_sz - - (get_rfc1002_len(rsp_org) + 4) - - sizeof(struct smb2_query_info_rsp); - - if (le32_to_cpu(req->OutputBufferLength) < buf_free_len) - buf_free_len = le32_to_cpu(req->OutputBufferLength); + buf_free_len = + smb2_calc_max_out_buf_len(work, 8, + le32_to_cpu(req->OutputBufferLength)); + if (buf_free_len < 0) + return -EINVAL; rc = ksmbd_vfs_listxattr(path->dentry, &xattr_list); if (rc < 0) { @@ -4407,6 +4443,8 @@ static void get_file_stream_info(struct ksmbd_work *work, struct path *path = &fp->filp->f_path; ssize_t xattr_list_len; int nbytes = 0, streamlen, stream_name_len, next, idx = 0; + int buf_free_len; + struct smb2_query_info_req *req = ksmbd_req_buf_next(work); generic_fillattr(file_mnt_user_ns(fp->filp), file_inode(fp->filp), &stat); @@ -4420,6 +4458,12 @@ static void get_file_stream_info(struct ksmbd_work *work, goto out; } + buf_free_len = + smb2_calc_max_out_buf_len(work, 8, + le32_to_cpu(req->OutputBufferLength)); + if (buf_free_len < 0) + goto out; + while (idx < xattr_list_len) { stream_name = xattr_list + idx; streamlen = strlen(stream_name); @@ -4444,6 +4488,10 @@ static void get_file_stream_info(struct ksmbd_work *work, streamlen = snprintf(stream_buf, streamlen + 1, ":%s", &stream_name[XATTR_NAME_STREAM_LEN]); + next = sizeof(struct smb2_file_stream_info) + streamlen * 2; + if (next > buf_free_len) + break; + file_info = (struct smb2_file_stream_info *)&rsp->Buffer[nbytes]; streamlen = smbConvertToUTF16((__le16 *)file_info->StreamName, stream_buf, streamlen, @@ -4454,12 +4502,13 @@ static void get_file_stream_info(struct ksmbd_work *work, file_info->StreamSize = cpu_to_le64(stream_name_len); file_info->StreamAllocationSize = cpu_to_le64(stream_name_len); - next = sizeof(struct smb2_file_stream_info) + streamlen; nbytes += next; + buf_free_len -= next; file_info->NextEntryOffset = cpu_to_le32(next); } - if (!S_ISDIR(stat.mode)) { + if (!S_ISDIR(stat.mode) && + buf_free_len >= sizeof(struct smb2_file_stream_info) + 7 * 2) { file_info = (struct smb2_file_stream_info *) &rsp->Buffer[nbytes]; streamlen = smbConvertToUTF16((__le16 *)file_info->StreamName, @@ -6220,8 +6269,7 @@ static noinline int smb2_write_pipe(struct ksmbd_work *work) (offsetof(struct smb2_write_req, Buffer) - 4)) { data_buf = (char *)&req->Buffer[0]; } else { - if ((le16_to_cpu(req->DataOffset) > get_rfc1002_len(req)) || - (le16_to_cpu(req->DataOffset) + length > get_rfc1002_len(req))) { + if ((u64)le16_to_cpu(req->DataOffset) + length > get_rfc1002_len(req)) { pr_err("invalid write data offset %u, smb_len %u\n", le16_to_cpu(req->DataOffset), get_rfc1002_len(req)); @@ -6379,8 +6427,7 @@ int smb2_write(struct ksmbd_work *work) (offsetof(struct smb2_write_req, Buffer) - 4)) { data_buf = (char *)&req->Buffer[0]; } else { - if ((le16_to_cpu(req->DataOffset) > get_rfc1002_len(req)) || - (le16_to_cpu(req->DataOffset) + length > get_rfc1002_len(req))) { + if ((u64)le16_to_cpu(req->DataOffset) + length > get_rfc1002_len(req)) { pr_err("invalid write data offset %u, smb_len %u\n", le16_to_cpu(req->DataOffset), get_rfc1002_len(req)); @@ -7023,24 +7070,26 @@ out2: return err; } -static int fsctl_copychunk(struct ksmbd_work *work, struct smb2_ioctl_req *req, +static int fsctl_copychunk(struct ksmbd_work *work, + struct copychunk_ioctl_req *ci_req, + unsigned int cnt_code, + unsigned int input_count, + unsigned long long volatile_id, + unsigned long long persistent_id, struct smb2_ioctl_rsp *rsp) { - struct copychunk_ioctl_req *ci_req; struct copychunk_ioctl_rsp *ci_rsp; struct ksmbd_file *src_fp = NULL, *dst_fp = NULL; struct srv_copychunk *chunks; unsigned int i, chunk_count, chunk_count_written = 0; unsigned int chunk_size_written = 0; loff_t total_size_written = 0; - int ret, cnt_code; + int ret = 0; - cnt_code = le32_to_cpu(req->CntCode); - ci_req = (struct copychunk_ioctl_req *)&req->Buffer[0]; ci_rsp = (struct copychunk_ioctl_rsp *)&rsp->Buffer[0]; - rsp->VolatileFileId = req->VolatileFileId; - rsp->PersistentFileId = req->PersistentFileId; + rsp->VolatileFileId = cpu_to_le64(volatile_id); + rsp->PersistentFileId = cpu_to_le64(persistent_id); ci_rsp->ChunksWritten = cpu_to_le32(ksmbd_server_side_copy_max_chunk_count()); ci_rsp->ChunkBytesWritten = @@ -7050,12 +7099,13 @@ static int fsctl_copychunk(struct ksmbd_work *work, struct smb2_ioctl_req *req, chunks = (struct srv_copychunk *)&ci_req->Chunks[0]; chunk_count = le32_to_cpu(ci_req->ChunkCount); + if (chunk_count == 0) + goto out; total_size_written = 0; /* verify the SRV_COPYCHUNK_COPY packet */ if (chunk_count > ksmbd_server_side_copy_max_chunk_count() || - le32_to_cpu(req->InputCount) < - offsetof(struct copychunk_ioctl_req, Chunks) + + input_count < offsetof(struct copychunk_ioctl_req, Chunks) + chunk_count * sizeof(struct srv_copychunk)) { rsp->hdr.Status = STATUS_INVALID_PARAMETER; return -EINVAL; @@ -7076,9 +7126,7 @@ static int fsctl_copychunk(struct ksmbd_work *work, struct smb2_ioctl_req *req, src_fp = ksmbd_lookup_foreign_fd(work, le64_to_cpu(ci_req->ResumeKey[0])); - dst_fp = ksmbd_lookup_fd_slow(work, - le64_to_cpu(req->VolatileFileId), - le64_to_cpu(req->PersistentFileId)); + dst_fp = ksmbd_lookup_fd_slow(work, volatile_id, persistent_id); ret = -EINVAL; if (!src_fp || src_fp->persistent_id != le64_to_cpu(ci_req->ResumeKey[1])) { @@ -7153,8 +7201,8 @@ static __be32 idev_ipv4_address(struct in_device *idev) } static int fsctl_query_iface_info_ioctl(struct ksmbd_conn *conn, - struct smb2_ioctl_req *req, - struct smb2_ioctl_rsp *rsp) + struct smb2_ioctl_rsp *rsp, + unsigned int out_buf_len) { struct network_interface_info_ioctl_rsp *nii_rsp = NULL; int nbytes = 0; @@ -7166,6 +7214,12 @@ static int fsctl_query_iface_info_ioctl(struct ksmbd_conn *conn, rtnl_lock(); for_each_netdev(&init_net, netdev) { + if (out_buf_len < + nbytes + sizeof(struct network_interface_info_ioctl_rsp)) { + rtnl_unlock(); + return -ENOSPC; + } + if (netdev->type == ARPHRD_LOOPBACK) continue; @@ -7245,11 +7299,6 @@ static int fsctl_query_iface_info_ioctl(struct ksmbd_conn *conn, if (nii_rsp) nii_rsp->Next = 0; - if (!nbytes) { - rsp->hdr.Status = STATUS_BUFFER_TOO_SMALL; - return -EINVAL; - } - rsp->PersistentFileId = cpu_to_le64(SMB2_NO_FID); rsp->VolatileFileId = cpu_to_le64(SMB2_NO_FID); return nbytes; @@ -7257,11 +7306,16 @@ static int fsctl_query_iface_info_ioctl(struct ksmbd_conn *conn, static int fsctl_validate_negotiate_info(struct ksmbd_conn *conn, struct validate_negotiate_info_req *neg_req, - struct validate_negotiate_info_rsp *neg_rsp) + struct validate_negotiate_info_rsp *neg_rsp, + unsigned int in_buf_len) { int ret = 0; int dialect; + if (in_buf_len < sizeof(struct validate_negotiate_info_req) + + le16_to_cpu(neg_req->DialectCount) * sizeof(__le16)) + return -EINVAL; + dialect = ksmbd_lookup_dialect_by_id(neg_req->Dialects, neg_req->DialectCount); if (dialect == BAD_PROT_ID || dialect != conn->dialect) { @@ -7295,7 +7349,7 @@ err_out: static int fsctl_query_allocated_ranges(struct ksmbd_work *work, u64 id, struct file_allocated_range_buffer *qar_req, struct file_allocated_range_buffer *qar_rsp, - int in_count, int *out_count) + unsigned int in_count, unsigned int *out_count) { struct ksmbd_file *fp; loff_t start, length; @@ -7322,7 +7376,8 @@ static int fsctl_query_allocated_ranges(struct ksmbd_work *work, u64 id, } static int fsctl_pipe_transceive(struct ksmbd_work *work, u64 id, - int out_buf_len, struct smb2_ioctl_req *req, + unsigned int out_buf_len, + struct smb2_ioctl_req *req, struct smb2_ioctl_rsp *rsp) { struct ksmbd_rpc_command *rpc_resp; @@ -7436,8 +7491,7 @@ int smb2_ioctl(struct ksmbd_work *work) { struct smb2_ioctl_req *req; struct smb2_ioctl_rsp *rsp, *rsp_org; - int cnt_code, nbytes = 0; - int out_buf_len; + unsigned int cnt_code, nbytes = 0, out_buf_len, in_buf_len; u64 id = KSMBD_NO_FID; struct ksmbd_conn *conn = work->conn; int ret = 0; @@ -7465,8 +7519,14 @@ int smb2_ioctl(struct ksmbd_work *work) } cnt_code = le32_to_cpu(req->CntCode); - out_buf_len = le32_to_cpu(req->MaxOutputResponse); - out_buf_len = min(KSMBD_IPC_MAX_PAYLOAD, out_buf_len); + ret = smb2_calc_max_out_buf_len(work, 48, + le32_to_cpu(req->MaxOutputResponse)); + if (ret < 0) { + rsp->hdr.Status = STATUS_INVALID_PARAMETER; + goto out; + } + out_buf_len = (unsigned int)ret; + in_buf_len = le32_to_cpu(req->InputCount); switch (cnt_code) { case FSCTL_DFS_GET_REFERRALS: @@ -7494,6 +7554,7 @@ int smb2_ioctl(struct ksmbd_work *work) break; } case FSCTL_PIPE_TRANSCEIVE: + out_buf_len = min_t(u32, KSMBD_IPC_MAX_PAYLOAD, out_buf_len); nbytes = fsctl_pipe_transceive(work, id, out_buf_len, req, rsp); break; case FSCTL_VALIDATE_NEGOTIATE_INFO: @@ -7502,9 +7563,16 @@ int smb2_ioctl(struct ksmbd_work *work) goto out; } + if (in_buf_len < sizeof(struct validate_negotiate_info_req)) + return -EINVAL; + + if (out_buf_len < sizeof(struct validate_negotiate_info_rsp)) + return -EINVAL; + ret = fsctl_validate_negotiate_info(conn, (struct validate_negotiate_info_req *)&req->Buffer[0], - (struct validate_negotiate_info_rsp *)&rsp->Buffer[0]); + (struct validate_negotiate_info_rsp *)&rsp->Buffer[0], + in_buf_len); if (ret < 0) goto out; @@ -7513,9 +7581,10 @@ int smb2_ioctl(struct ksmbd_work *work) rsp->VolatileFileId = cpu_to_le64(SMB2_NO_FID); break; case FSCTL_QUERY_NETWORK_INTERFACE_INFO: - nbytes = fsctl_query_iface_info_ioctl(conn, req, rsp); - if (nbytes < 0) + ret = fsctl_query_iface_info_ioctl(conn, rsp, out_buf_len); + if (ret < 0) goto out; + nbytes = ret; break; case FSCTL_REQUEST_RESUME_KEY: if (out_buf_len < sizeof(struct resume_key_ioctl_rsp)) { @@ -7540,15 +7609,33 @@ int smb2_ioctl(struct ksmbd_work *work) goto out; } + if (in_buf_len < sizeof(struct copychunk_ioctl_req)) { + ret = -EINVAL; + goto out; + } + if (out_buf_len < sizeof(struct copychunk_ioctl_rsp)) { ret = -EINVAL; goto out; } nbytes = sizeof(struct copychunk_ioctl_rsp); - fsctl_copychunk(work, req, rsp); + rsp->VolatileFileId = req->VolatileFileId; + rsp->PersistentFileId = req->PersistentFileId; + fsctl_copychunk(work, + (struct copychunk_ioctl_req *)&req->Buffer[0], + le32_to_cpu(req->CntCode), + le32_to_cpu(req->InputCount), + le64_to_cpu(req->VolatileFileId), + le64_to_cpu(req->PersistentFileId), + rsp); break; case FSCTL_SET_SPARSE: + if (in_buf_len < sizeof(struct file_sparse)) { + ret = -EINVAL; + goto out; + } + ret = fsctl_set_sparse(work, id, (struct file_sparse *)&req->Buffer[0]); if (ret < 0) @@ -7567,6 +7654,11 @@ int smb2_ioctl(struct ksmbd_work *work) goto out; } + if (in_buf_len < sizeof(struct file_zero_data_information)) { + ret = -EINVAL; + goto out; + } + zero_data = (struct file_zero_data_information *)&req->Buffer[0]; @@ -7586,6 +7678,11 @@ int smb2_ioctl(struct ksmbd_work *work) break; } case FSCTL_QUERY_ALLOCATED_RANGES: + if (in_buf_len < sizeof(struct file_allocated_range_buffer)) { + ret = -EINVAL; + goto out; + } + ret = fsctl_query_allocated_ranges(work, id, (struct file_allocated_range_buffer *)&req->Buffer[0], (struct file_allocated_range_buffer *)&rsp->Buffer[0], @@ -7626,6 +7723,11 @@ int smb2_ioctl(struct ksmbd_work *work) struct duplicate_extents_to_file *dup_ext; loff_t src_off, dst_off, length, cloned; + if (in_buf_len < sizeof(struct duplicate_extents_to_file)) { + ret = -EINVAL; + goto out; + } + dup_ext = (struct duplicate_extents_to_file *)&req->Buffer[0]; fp_in = ksmbd_lookup_fd_slow(work, dup_ext->VolatileFileHandle, @@ -7696,6 +7798,8 @@ out: rsp->hdr.Status = STATUS_OBJECT_NAME_NOT_FOUND; else if (ret == -EOPNOTSUPP) rsp->hdr.Status = STATUS_NOT_SUPPORTED; + else if (ret == -ENOSPC) + rsp->hdr.Status = STATUS_BUFFER_TOO_SMALL; else if (ret < 0 || rsp->hdr.Status == 0) rsp->hdr.Status = STATUS_INVALID_PARAMETER; smb2_set_err_rsp(work); diff --git a/fs/ksmbd/smb2pdu.h b/fs/ksmbd/smb2pdu.h index a6dec5ec6a54..ff5a2f01d34a 100644 --- a/fs/ksmbd/smb2pdu.h +++ b/fs/ksmbd/smb2pdu.h @@ -113,6 +113,8 @@ #define SMB21_DEFAULT_IOSIZE (1024 * 1024) #define SMB3_DEFAULT_IOSIZE (4 * 1024 * 1024) #define SMB3_DEFAULT_TRANS_SIZE (1024 * 1024) +#define SMB3_MIN_IOSIZE (64 * 1024) +#define SMB3_MAX_IOSIZE (8 * 1024 * 1024) /* * SMB2 Header Definition diff --git a/fs/ksmbd/transport_ipc.c b/fs/ksmbd/transport_ipc.c index 44aea33a67fa..1acf1892a466 100644 --- a/fs/ksmbd/transport_ipc.c +++ b/fs/ksmbd/transport_ipc.c @@ -601,7 +601,7 @@ int ksmbd_ipc_tree_disconnect_request(unsigned long long session_id, return ret; } -int ksmbd_ipc_logout_request(const char *account) +int ksmbd_ipc_logout_request(const char *account, int flags) { struct ksmbd_ipc_msg *msg; struct ksmbd_logout_request *req; @@ -616,6 +616,7 @@ int ksmbd_ipc_logout_request(const char *account) msg->type = KSMBD_EVENT_LOGOUT_REQUEST; req = (struct ksmbd_logout_request *)msg->payload; + req->account_flags = flags; strscpy(req->account, account, KSMBD_REQ_MAX_ACCOUNT_NAME_SZ); ret = ipc_msg_send(msg); diff --git a/fs/ksmbd/transport_ipc.h b/fs/ksmbd/transport_ipc.h index 9eacc895ffdb..5e5b90a0c187 100644 --- a/fs/ksmbd/transport_ipc.h +++ b/fs/ksmbd/transport_ipc.h @@ -25,7 +25,7 @@ ksmbd_ipc_tree_connect_request(struct ksmbd_session *sess, struct sockaddr *peer_addr); int ksmbd_ipc_tree_disconnect_request(unsigned long long session_id, unsigned long long connect_id); -int ksmbd_ipc_logout_request(const char *account); +int ksmbd_ipc_logout_request(const char *account, int flags); struct ksmbd_share_config_response * ksmbd_ipc_share_config_request(const char *name); struct ksmbd_spnego_authen_response * diff --git a/fs/ksmbd/transport_rdma.c b/fs/ksmbd/transport_rdma.c index 3a7fa23ba850..a2fd5a4d4cd5 100644 --- a/fs/ksmbd/transport_rdma.c +++ b/fs/ksmbd/transport_rdma.c @@ -549,6 +549,10 @@ static void recv_done(struct ib_cq *cq, struct ib_wc *wc) switch (recvmsg->type) { case SMB_DIRECT_MSG_NEGOTIATE_REQ: + if (wc->byte_len < sizeof(struct smb_direct_negotiate_req)) { + put_empty_recvmsg(t, recvmsg); + return; + } t->negotiation_requested = true; t->full_packet_received = true; wake_up_interruptible(&t->wait_status); @@ -556,10 +560,23 @@ static void recv_done(struct ib_cq *cq, struct ib_wc *wc) case SMB_DIRECT_MSG_DATA_TRANSFER: { struct smb_direct_data_transfer *data_transfer = (struct smb_direct_data_transfer *)recvmsg->packet; - int data_length = le32_to_cpu(data_transfer->data_length); + unsigned int data_length; int avail_recvmsg_count, receive_credits; + if (wc->byte_len < + offsetof(struct smb_direct_data_transfer, padding)) { + put_empty_recvmsg(t, recvmsg); + return; + } + + data_length = le32_to_cpu(data_transfer->data_length); if (data_length) { + if (wc->byte_len < sizeof(struct smb_direct_data_transfer) + + (u64)data_length) { + put_empty_recvmsg(t, recvmsg); + return; + } + if (t->full_packet_received) recvmsg->first_segment = true; @@ -568,7 +585,7 @@ static void recv_done(struct ib_cq *cq, struct ib_wc *wc) else t->full_packet_received = true; - enqueue_reassembly(t, recvmsg, data_length); + enqueue_reassembly(t, recvmsg, (int)data_length); wake_up_interruptible(&t->wait_reassembly_queue); spin_lock(&t->receive_credit_lock); diff --git a/fs/ksmbd/vfs.c b/fs/ksmbd/vfs.c index b41954294d38..835b384b0895 100644 --- a/fs/ksmbd/vfs.c +++ b/fs/ksmbd/vfs.c @@ -1023,7 +1023,7 @@ int ksmbd_vfs_zero_data(struct ksmbd_work *work, struct ksmbd_file *fp, int ksmbd_vfs_fqar_lseek(struct ksmbd_file *fp, loff_t start, loff_t length, struct file_allocated_range_buffer *ranges, - int in_count, int *out_count) + unsigned int in_count, unsigned int *out_count) { struct file *f = fp->filp; struct inode *inode = file_inode(fp->filp); diff --git a/fs/ksmbd/vfs.h b/fs/ksmbd/vfs.h index 7b1dcaa3fbdc..b0d5b8feb4a3 100644 --- a/fs/ksmbd/vfs.h +++ b/fs/ksmbd/vfs.h @@ -166,7 +166,7 @@ int ksmbd_vfs_zero_data(struct ksmbd_work *work, struct ksmbd_file *fp, struct file_allocated_range_buffer; int ksmbd_vfs_fqar_lseek(struct ksmbd_file *fp, loff_t start, loff_t length, struct file_allocated_range_buffer *ranges, - int in_count, int *out_count); + unsigned int in_count, unsigned int *out_count); int ksmbd_vfs_unlink(struct user_namespace *user_ns, struct dentry *dir, struct dentry *dentry); void *ksmbd_vfs_init_kstat(char **p, struct ksmbd_kstat *ksmbd_kstat); diff --git a/fs/locks.c b/fs/locks.c index 3d6fb4ae847b..0fca9d680978 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -2,117 +2,11 @@ /* * linux/fs/locks.c * - * Provide support for fcntl()'s F_GETLK, F_SETLK, and F_SETLKW calls. - * Doug Evans (dje@spiff.uucp), August 07, 1992 + * We implement four types of file locks: BSD locks, posix locks, open + * file description locks, and leases. For details about BSD locks, + * see the flock(2) man page; for details about the other three, see + * fcntl(2). * - * Deadlock detection added. - * FIXME: one thing isn't handled yet: - * - mandatory locks (requires lots of changes elsewhere) - * Kelly Carmichael (kelly@[142.24.8.65]), September 17, 1994. - * - * Miscellaneous edits, and a total rewrite of posix_lock_file() code. - * Kai Petzke (wpp@marie.physik.tu-berlin.de), 1994 - * - * Converted file_lock_table to a linked list from an array, which eliminates - * the limits on how many active file locks are open. - * Chad Page (pageone@netcom.com), November 27, 1994 - * - * Removed dependency on file descriptors. dup()'ed file descriptors now - * get the same locks as the original file descriptors, and a close() on - * any file descriptor removes ALL the locks on the file for the current - * process. Since locks still depend on the process id, locks are inherited - * after an exec() but not after a fork(). This agrees with POSIX, and both - * BSD and SVR4 practice. - * Andy Walker (andy@lysaker.kvaerner.no), February 14, 1995 - * - * Scrapped free list which is redundant now that we allocate locks - * dynamically with kmalloc()/kfree(). - * Andy Walker (andy@lysaker.kvaerner.no), February 21, 1995 - * - * Implemented two lock personalities - FL_FLOCK and FL_POSIX. - * - * FL_POSIX locks are created with calls to fcntl() and lockf() through the - * fcntl() system call. They have the semantics described above. - * - * FL_FLOCK locks are created with calls to flock(), through the flock() - * system call, which is new. Old C libraries implement flock() via fcntl() - * and will continue to use the old, broken implementation. - * - * FL_FLOCK locks follow the 4.4 BSD flock() semantics. They are associated - * with a file pointer (filp). As a result they can be shared by a parent - * process and its children after a fork(). They are removed when the last - * file descriptor referring to the file pointer is closed (unless explicitly - * unlocked). - * - * FL_FLOCK locks never deadlock, an existing lock is always removed before - * upgrading from shared to exclusive (or vice versa). When this happens - * any processes blocked by the current lock are woken up and allowed to - * run before the new lock is applied. - * Andy Walker (andy@lysaker.kvaerner.no), June 09, 1995 - * - * Removed some race conditions in flock_lock_file(), marked other possible - * races. Just grep for FIXME to see them. - * Dmitry Gorodchanin (pgmdsg@ibi.com), February 09, 1996. - * - * Addressed Dmitry's concerns. Deadlock checking no longer recursive. - * Lock allocation changed to GFP_ATOMIC as we can't afford to sleep - * once we've checked for blocking and deadlocking. - * Andy Walker (andy@lysaker.kvaerner.no), April 03, 1996. - * - * Initial implementation of mandatory locks. SunOS turned out to be - * a rotten model, so I implemented the "obvious" semantics. - * See 'Documentation/filesystems/mandatory-locking.rst' for details. - * Andy Walker (andy@lysaker.kvaerner.no), April 06, 1996. - * - * Don't allow mandatory locks on mmap()'ed files. Added simple functions to - * check if a file has mandatory locks, used by mmap(), open() and creat() to - * see if system call should be rejected. Ref. HP-UX/SunOS/Solaris Reference - * Manual, Section 2. - * Andy Walker (andy@lysaker.kvaerner.no), April 09, 1996. - * - * Tidied up block list handling. Added '/proc/locks' interface. - * Andy Walker (andy@lysaker.kvaerner.no), April 24, 1996. - * - * Fixed deadlock condition for pathological code that mixes calls to - * flock() and fcntl(). - * Andy Walker (andy@lysaker.kvaerner.no), April 29, 1996. - * - * Allow only one type of locking scheme (FL_POSIX or FL_FLOCK) to be in use - * for a given file at a time. Changed the CONFIG_LOCK_MANDATORY scheme to - * guarantee sensible behaviour in the case where file system modules might - * be compiled with different options than the kernel itself. - * Andy Walker (andy@lysaker.kvaerner.no), May 15, 1996. - * - * Added a couple of missing wake_up() calls. Thanks to Thomas Meckel - * (Thomas.Meckel@mni.fh-giessen.de) for spotting this. - * Andy Walker (andy@lysaker.kvaerner.no), May 15, 1996. - * - * Changed FL_POSIX locks to use the block list in the same way as FL_FLOCK - * locks. Changed process synchronisation to avoid dereferencing locks that - * have already been freed. - * Andy Walker (andy@lysaker.kvaerner.no), Sep 21, 1996. - * - * Made the block list a circular list to minimise searching in the list. - * Andy Walker (andy@lysaker.kvaerner.no), Sep 25, 1996. - * - * Made mandatory locking a mount option. Default is not to allow mandatory - * locking. - * Andy Walker (andy@lysaker.kvaerner.no), Oct 04, 1996. - * - * Some adaptations for NFS support. - * Olaf Kirch (okir@monad.swb.de), Dec 1996, - * - * Fixed /proc/locks interface so that we can't overrun the buffer we are handed. - * Andy Walker (andy@lysaker.kvaerner.no), May 12, 1997. - * - * Use slab allocator instead of kmalloc/kfree. - * Use generic list implementation from <linux/list.h>. - * Sped up posix_locks_deadlock by only considering blocked locks. - * Matthew Wilcox <willy@debian.org>, March, 2000. - * - * Leases and LOCK_MAND - * Matthew Wilcox <willy@debian.org>, June, 2000. - * Stephen Rothwell <sfr@canb.auug.org.au>, June, 2000. * * Locking conflicts and dependencies: * If multiple threads attempt to lock the same byte (or flock the same file) @@ -461,8 +355,6 @@ static void locks_move_blocks(struct file_lock *new, struct file_lock *fl) } static inline int flock_translate_cmd(int cmd) { - if (cmd & LOCK_MAND) - return cmd & (LOCK_MAND | LOCK_RW); switch (cmd) { case LOCK_SH: return F_RDLCK; @@ -942,8 +834,6 @@ static bool flock_locks_conflict(struct file_lock *caller_fl, */ if (caller_fl->fl_file == sys_fl->fl_file) return false; - if ((caller_fl->fl_type & LOCK_MAND) || (sys_fl->fl_type & LOCK_MAND)) - return false; return locks_conflict(caller_fl, sys_fl); } @@ -2116,11 +2006,9 @@ EXPORT_SYMBOL(locks_lock_inode_wait); * - %LOCK_SH -- a shared lock. * - %LOCK_EX -- an exclusive lock. * - %LOCK_UN -- remove an existing lock. - * - %LOCK_MAND -- a 'mandatory' flock. - * This exists to emulate Windows Share Modes. + * - %LOCK_MAND -- a 'mandatory' flock. (DEPRECATED) * - * %LOCK_MAND can be combined with %LOCK_READ or %LOCK_WRITE to allow other - * processes read and write access respectively. + * %LOCK_MAND support has been removed from the kernel. */ SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd) { @@ -2137,9 +2025,22 @@ SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd) cmd &= ~LOCK_NB; unlock = (cmd == LOCK_UN); - if (!unlock && !(cmd & LOCK_MAND) && - !(f.file->f_mode & (FMODE_READ|FMODE_WRITE))) + if (!unlock && !(f.file->f_mode & (FMODE_READ|FMODE_WRITE))) + goto out_putf; + + /* + * LOCK_MAND locks were broken for a long time in that they never + * conflicted with one another and didn't prevent any sort of open, + * read or write activity. + * + * Just ignore these requests now, to preserve legacy behavior, but + * throw a warning to let people know that they don't actually work. + */ + if (cmd & LOCK_MAND) { + pr_warn_once("Attempt to set a LOCK_MAND lock via flock(2). This support has been removed and the request ignored.\n"); + error = 0; goto out_putf; + } lock = flock_make_lock(f.file, cmd, NULL); if (IS_ERR(lock)) { @@ -2718,6 +2619,7 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl, struct inode *inode = NULL; unsigned int fl_pid; struct pid_namespace *proc_pidns = proc_pid_ns(file_inode(f->file)->i_sb); + int type; fl_pid = locks_translate_pid(fl, proc_pidns); /* @@ -2745,11 +2647,7 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl, seq_printf(f, " %s ", (inode == NULL) ? "*NOINODE*" : "ADVISORY "); } else if (IS_FLOCK(fl)) { - if (fl->fl_type & LOCK_MAND) { - seq_puts(f, "FLOCK MSNFS "); - } else { - seq_puts(f, "FLOCK ADVISORY "); - } + seq_puts(f, "FLOCK ADVISORY "); } else if (IS_LEASE(fl)) { if (fl->fl_flags & FL_DELEG) seq_puts(f, "DELEG "); @@ -2765,17 +2663,10 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl, } else { seq_puts(f, "UNKNOWN UNKNOWN "); } - if (fl->fl_type & LOCK_MAND) { - seq_printf(f, "%s ", - (fl->fl_type & LOCK_READ) - ? (fl->fl_type & LOCK_WRITE) ? "RW " : "READ " - : (fl->fl_type & LOCK_WRITE) ? "WRITE" : "NONE "); - } else { - int type = IS_LEASE(fl) ? target_leasetype(fl) : fl->fl_type; + type = IS_LEASE(fl) ? target_leasetype(fl) : fl->fl_type; - seq_printf(f, "%s ", (type == F_WRLCK) ? "WRITE" : - (type == F_RDLCK) ? "READ" : "UNLCK"); - } + seq_printf(f, "%s ", (type == F_WRLCK) ? "WRITE" : + (type == F_RDLCK) ? "READ" : "UNLCK"); if (inode) { /* userspace relies on this representation of dev_t */ seq_printf(f, "%d %02x:%02x:%lu ", fl_pid, diff --git a/fs/namei.c b/fs/namei.c index 1946d9667790..1f9d2187c765 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3076,9 +3076,7 @@ static int handle_truncate(struct user_namespace *mnt_userns, struct file *filp) int error = get_write_access(inode); if (error) return error; - /* - * Refuse to truncate files with mandatory locks held on them. - */ + error = security_path_truncate(path); if (!error) { error = do_truncate(mnt_userns, path->dentry, 0, diff --git a/fs/nfs/file.c b/fs/nfs/file.c index aa353fd58240..24e7dccce355 100644 --- a/fs/nfs/file.c +++ b/fs/nfs/file.c @@ -843,15 +843,6 @@ int nfs_flock(struct file *filp, int cmd, struct file_lock *fl) if (!(fl->fl_flags & FL_FLOCK)) return -ENOLCK; - /* - * The NFSv4 protocol doesn't support LOCK_MAND, which is not part of - * any standard. In principle we might be able to support LOCK_MAND - * on NFSv2/3 since NLMv3/4 support DOS share modes, but for now the - * NFS code is not set up for it. - */ - if (fl->fl_type & LOCK_MAND) - return -EINVAL; - if (NFS_SERVER(inode)->flags & NFS_MOUNT_LOCAL_FLOCK) is_local = 1; diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c index f1cc8258d34a..5d9ae17bd443 100644 --- a/fs/ocfs2/alloc.c +++ b/fs/ocfs2/alloc.c @@ -7045,7 +7045,7 @@ void ocfs2_set_inode_data_inline(struct inode *inode, struct ocfs2_dinode *di) int ocfs2_convert_inline_data_to_extents(struct inode *inode, struct buffer_head *di_bh) { - int ret, i, has_data, num_pages = 0; + int ret, has_data, num_pages = 0; int need_free = 0; u32 bit_off, num; handle_t *handle; @@ -7054,26 +7054,17 @@ int ocfs2_convert_inline_data_to_extents(struct inode *inode, struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); struct ocfs2_dinode *di = (struct ocfs2_dinode *)di_bh->b_data; struct ocfs2_alloc_context *data_ac = NULL; - struct page **pages = NULL; - loff_t end = osb->s_clustersize; + struct page *page = NULL; struct ocfs2_extent_tree et; int did_quota = 0; has_data = i_size_read(inode) ? 1 : 0; if (has_data) { - pages = kcalloc(ocfs2_pages_per_cluster(osb->sb), - sizeof(struct page *), GFP_NOFS); - if (pages == NULL) { - ret = -ENOMEM; - mlog_errno(ret); - return ret; - } - ret = ocfs2_reserve_clusters(osb, 1, &data_ac); if (ret) { mlog_errno(ret); - goto free_pages; + goto out; } } @@ -7093,7 +7084,8 @@ int ocfs2_convert_inline_data_to_extents(struct inode *inode, } if (has_data) { - unsigned int page_end; + unsigned int page_end = min_t(unsigned, PAGE_SIZE, + osb->s_clustersize); u64 phys; ret = dquot_alloc_space_nodirty(inode, @@ -7117,15 +7109,8 @@ int ocfs2_convert_inline_data_to_extents(struct inode *inode, */ block = phys = ocfs2_clusters_to_blocks(inode->i_sb, bit_off); - /* - * Non sparse file systems zero on extend, so no need - * to do that now. - */ - if (!ocfs2_sparse_alloc(osb) && - PAGE_SIZE < osb->s_clustersize) - end = PAGE_SIZE; - - ret = ocfs2_grab_eof_pages(inode, 0, end, pages, &num_pages); + ret = ocfs2_grab_eof_pages(inode, 0, page_end, &page, + &num_pages); if (ret) { mlog_errno(ret); need_free = 1; @@ -7136,20 +7121,15 @@ int ocfs2_convert_inline_data_to_extents(struct inode *inode, * This should populate the 1st page for us and mark * it up to date. */ - ret = ocfs2_read_inline_data(inode, pages[0], di_bh); + ret = ocfs2_read_inline_data(inode, page, di_bh); if (ret) { mlog_errno(ret); need_free = 1; goto out_unlock; } - page_end = PAGE_SIZE; - if (PAGE_SIZE > osb->s_clustersize) - page_end = osb->s_clustersize; - - for (i = 0; i < num_pages; i++) - ocfs2_map_and_dirty_page(inode, handle, 0, page_end, - pages[i], i > 0, &phys); + ocfs2_map_and_dirty_page(inode, handle, 0, page_end, page, 0, + &phys); } spin_lock(&oi->ip_lock); @@ -7180,8 +7160,8 @@ int ocfs2_convert_inline_data_to_extents(struct inode *inode, } out_unlock: - if (pages) - ocfs2_unlock_and_free_pages(pages, num_pages); + if (page) + ocfs2_unlock_and_free_pages(&page, num_pages); out_commit: if (ret < 0 && did_quota) @@ -7205,8 +7185,6 @@ out_commit: out: if (data_ac) ocfs2_free_alloc_context(data_ac); -free_pages: - kfree(pages); return ret; } diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c index 8521942f5af2..481017e1dac5 100644 --- a/fs/ocfs2/suballoc.c +++ b/fs/ocfs2/suballoc.c @@ -1251,7 +1251,7 @@ static int ocfs2_test_bg_bit_allocatable(struct buffer_head *bg_bh, { struct ocfs2_group_desc *bg = (struct ocfs2_group_desc *) bg_bh->b_data; struct journal_head *jh; - int ret; + int ret = 1; if (ocfs2_test_bit(nr, (unsigned long *)bg->bg_bitmap)) return 0; @@ -1259,14 +1259,18 @@ static int ocfs2_test_bg_bit_allocatable(struct buffer_head *bg_bh, if (!buffer_jbd(bg_bh)) return 1; - jh = bh2jh(bg_bh); - spin_lock(&jh->b_state_lock); - bg = (struct ocfs2_group_desc *) jh->b_committed_data; - if (bg) - ret = !ocfs2_test_bit(nr, (unsigned long *)bg->bg_bitmap); - else - ret = 1; - spin_unlock(&jh->b_state_lock); + jbd_lock_bh_journal_head(bg_bh); + if (buffer_jbd(bg_bh)) { + jh = bh2jh(bg_bh); + spin_lock(&jh->b_state_lock); + bg = (struct ocfs2_group_desc *) jh->b_committed_data; + if (bg) + ret = !ocfs2_test_bit(nr, (unsigned long *)bg->bg_bitmap); + else + ret = 1; + spin_unlock(&jh->b_state_lock); + } + jbd_unlock_bh_journal_head(bg_bh); return ret; } diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c index c86bd4e60e20..5c914ce9b3ac 100644 --- a/fs/ocfs2/super.c +++ b/fs/ocfs2/super.c @@ -2167,11 +2167,17 @@ static int ocfs2_initialize_super(struct super_block *sb, } if (ocfs2_clusterinfo_valid(osb)) { + /* + * ci_stack and ci_cluster in ocfs2_cluster_info may not be null + * terminated, so make sure no overflow happens here by using + * memcpy. Destination strings will always be null terminated + * because osb is allocated using kzalloc. + */ osb->osb_stackflags = OCFS2_RAW_SB(di)->s_cluster_info.ci_stackflags; - strlcpy(osb->osb_cluster_stack, + memcpy(osb->osb_cluster_stack, OCFS2_RAW_SB(di)->s_cluster_info.ci_stack, - OCFS2_STACK_LABEL_LEN + 1); + OCFS2_STACK_LABEL_LEN); if (strlen(osb->osb_cluster_stack) != OCFS2_STACK_LABEL_LEN) { mlog(ML_ERROR, "couldn't mount because of an invalid " @@ -2180,9 +2186,9 @@ static int ocfs2_initialize_super(struct super_block *sb, status = -EINVAL; goto bail; } - strlcpy(osb->osb_cluster_name, + memcpy(osb->osb_cluster_name, OCFS2_RAW_SB(di)->s_cluster_info.ci_cluster, - OCFS2_CLUSTER_NAME_LEN + 1); + OCFS2_CLUSTER_NAME_LEN); } else { /* The empty string is identical with classic tools that * don't know about s_cluster_info. */ diff --git a/fs/read_write.c b/fs/read_write.c index af057c57bdc6..0074afa7ecb3 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -368,10 +368,6 @@ int rw_verify_area(int read_write, struct file *file, const loff_t *ppos, size_t if (unlikely((ssize_t) count < 0)) return -EINVAL; - /* - * ranged mandatory locking does not apply to streams - it makes sense - * only for files where position has a meaning. - */ if (ppos) { loff_t pos = *ppos; diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 003f0d31743e..22bf14ab2d16 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1827,9 +1827,15 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx, if (mode_wp && mode_dontwake) return -EINVAL; - ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start, - uffdio_wp.range.len, mode_wp, - &ctx->mmap_changing); + if (mmget_not_zero(ctx->mm)) { + ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start, + uffdio_wp.range.len, mode_wp, + &ctx->mmap_changing); + mmput(ctx->mm); + } else { + return -ESRCH; + } + if (ret) return ret; diff --git a/include/acpi/platform/acgcc.h b/include/acpi/platform/acgcc.h index fb172a03a753..20ecb004f5a4 100644 --- a/include/acpi/platform/acgcc.h +++ b/include/acpi/platform/acgcc.h @@ -22,9 +22,14 @@ typedef __builtin_va_list va_list; #define va_arg(v, l) __builtin_va_arg(v, l) #define va_copy(d, s) __builtin_va_copy(d, s) #else +#ifdef __KERNEL__ #include <linux/stdarg.h> -#endif -#endif +#else +/* Used to build acpi tools */ +#include <stdarg.h> +#endif /* __KERNEL__ */ +#endif /* ACPI_USE_BUILTIN_STDARG */ +#endif /* ! va_arg */ #define ACPI_INLINE __inline__ diff --git a/include/asm-generic/cacheflush.h b/include/asm-generic/cacheflush.h index 4a674db4e1fa..fedc0dfa4877 100644 --- a/include/asm-generic/cacheflush.h +++ b/include/asm-generic/cacheflush.h @@ -49,9 +49,15 @@ static inline void flush_cache_page(struct vm_area_struct *vma, static inline void flush_dcache_page(struct page *page) { } + +static inline void flush_dcache_folio(struct folio *folio) { } #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 0 +#define ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO #endif +#ifndef ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO +void flush_dcache_folio(struct folio *folio); +#endif #ifndef flush_dcache_mmap_lock static inline void flush_dcache_mmap_lock(struct address_space *mapping) diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 4ac7ce096013..9c14f0a8dbe5 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -64,7 +64,7 @@ static inline bool bdi_has_dirty_io(struct backing_dev_info *bdi) return atomic_long_read(&bdi->tot_write_bandwidth); } -static inline void __add_wb_stat(struct bdi_writeback *wb, +static inline void wb_stat_mod(struct bdi_writeback *wb, enum wb_stat_item item, s64 amount) { percpu_counter_add_batch(&wb->stat[item], amount, WB_STAT_BATCH); @@ -72,12 +72,12 @@ static inline void __add_wb_stat(struct bdi_writeback *wb, static inline void inc_wb_stat(struct bdi_writeback *wb, enum wb_stat_item item) { - __add_wb_stat(wb, item, 1); + wb_stat_mod(wb, item, 1); } static inline void dec_wb_stat(struct bdi_writeback *wb, enum wb_stat_item item) { - __add_wb_stat(wb, item, -1); + wb_stat_mod(wb, item, -1); } static inline s64 wb_stat(struct bdi_writeback *wb, enum wb_stat_item item) diff --git a/include/linux/bio.h b/include/linux/bio.h index 9538f20ffaa5..fe6bdfbbef66 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -417,7 +417,8 @@ int bio_add_zone_append_page(struct bio *bio, struct page *page, void __bio_add_page(struct bio *bio, struct page *page, unsigned int len, unsigned int off); int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter); -void bio_release_pages(struct bio *bio, bool mark_dirty); +void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter); +void __bio_release_pages(struct bio *bio, bool mark_dirty); extern void bio_set_pages_dirty(struct bio *bio); extern void bio_check_pages_dirty(struct bio *bio); @@ -428,23 +429,13 @@ extern void bio_free_pages(struct bio *bio); void guard_bio_eod(struct bio *bio); void zero_fill_bio(struct bio *bio); -extern const char *bio_devname(struct bio *bio, char *buffer); +static inline void bio_release_pages(struct bio *bio, bool mark_dirty) +{ + if (!bio_flagged(bio, BIO_NO_PAGE_REF)) + __bio_release_pages(bio, mark_dirty); +} -#define bio_set_dev(bio, bdev) \ -do { \ - bio_clear_flag(bio, BIO_REMAPPED); \ - if ((bio)->bi_bdev != (bdev)) \ - bio_clear_flag(bio, BIO_THROTTLED); \ - (bio)->bi_bdev = (bdev); \ - bio_associate_blkg(bio); \ -} while (0) - -#define bio_copy_dev(dst, src) \ -do { \ - bio_clear_flag(dst, BIO_REMAPPED); \ - (dst)->bi_bdev = (src)->bi_bdev; \ - bio_clone_blkg_association(dst, src); \ -} while (0) +extern const char *bio_devname(struct bio *bio, char *buffer); #define bio_dev(bio) \ disk_devt((bio)->bi_bdev->bd_disk) @@ -463,6 +454,22 @@ static inline void bio_clone_blkg_association(struct bio *dst, struct bio *src) { } #endif /* CONFIG_BLK_CGROUP */ +static inline void bio_set_dev(struct bio *bio, struct block_device *bdev) +{ + bio_clear_flag(bio, BIO_REMAPPED); + if (bio->bi_bdev != bdev) + bio_clear_flag(bio, BIO_THROTTLED); + bio->bi_bdev = bdev; + bio_associate_blkg(bio); +} + +static inline void bio_copy_dev(struct bio *dst, struct bio *src) +{ + bio_clear_flag(dst, BIO_REMAPPED); + dst->bi_bdev = src->bi_bdev; + bio_clone_blkg_association(dst, src); +} + /* * BIO list management for use by remapping drivers (e.g. DM or MD) and loop. * diff --git a/include/linux/blk-crypto-profile.h b/include/linux/blk-crypto-profile.h new file mode 100644 index 000000000000..bbab65bd5428 --- /dev/null +++ b/include/linux/blk-crypto-profile.h @@ -0,0 +1,166 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright 2019 Google LLC + */ + +#ifndef __LINUX_BLK_CRYPTO_PROFILE_H +#define __LINUX_BLK_CRYPTO_PROFILE_H + +#include <linux/bio.h> +#include <linux/blk-crypto.h> + +struct blk_crypto_profile; + +/** + * struct blk_crypto_ll_ops - functions to control inline encryption hardware + * + * Low-level operations for controlling inline encryption hardware. This + * interface must be implemented by storage drivers that support inline + * encryption. All functions may sleep, are serialized by profile->lock, and + * are never called while profile->dev (if set) is runtime-suspended. + */ +struct blk_crypto_ll_ops { + + /** + * @keyslot_program: Program a key into the inline encryption hardware. + * + * Program @key into the specified @slot in the inline encryption + * hardware, overwriting any key that the keyslot may already contain. + * The keyslot is guaranteed to not be in-use by any I/O. + * + * This is required if the device has keyslots. Otherwise (i.e. if the + * device is a layered device, or if the device is real hardware that + * simply doesn't have the concept of keyslots) it is never called. + * + * Must return 0 on success, or -errno on failure. + */ + int (*keyslot_program)(struct blk_crypto_profile *profile, + const struct blk_crypto_key *key, + unsigned int slot); + + /** + * @keyslot_evict: Evict a key from the inline encryption hardware. + * + * If the device has keyslots, this function must evict the key from the + * specified @slot. The slot will contain @key, but there should be no + * need for the @key argument to be used as @slot should be sufficient. + * The keyslot is guaranteed to not be in-use by any I/O. + * + * If the device doesn't have keyslots itself, this function must evict + * @key from any underlying devices. @slot won't be valid in this case. + * + * If there are no keyslots and no underlying devices, this function + * isn't required. + * + * Must return 0 on success, or -errno on failure. + */ + int (*keyslot_evict)(struct blk_crypto_profile *profile, + const struct blk_crypto_key *key, + unsigned int slot); +}; + +/** + * struct blk_crypto_profile - inline encryption profile for a device + * + * This struct contains a storage device's inline encryption capabilities (e.g. + * the supported crypto algorithms), driver-provided functions to control the + * inline encryption hardware (e.g. programming and evicting keys), and optional + * device-independent keyslot management data. + */ +struct blk_crypto_profile { + + /* public: Drivers must initialize the following fields. */ + + /** + * @ll_ops: Driver-provided functions to control the inline encryption + * hardware, e.g. program and evict keys. + */ + struct blk_crypto_ll_ops ll_ops; + + /** + * @max_dun_bytes_supported: The maximum number of bytes supported for + * specifying the data unit number (DUN). Specifically, the range of + * supported DUNs is 0 through (1 << (8 * max_dun_bytes_supported)) - 1. + */ + unsigned int max_dun_bytes_supported; + + /** + * @modes_supported: Array of bitmasks that specifies whether each + * combination of crypto mode and data unit size is supported. + * Specifically, the i'th bit of modes_supported[crypto_mode] is set if + * crypto_mode can be used with a data unit size of (1 << i). Note that + * only data unit sizes that are powers of 2 can be supported. + */ + unsigned int modes_supported[BLK_ENCRYPTION_MODE_MAX]; + + /** + * @dev: An optional device for runtime power management. If the driver + * provides this device, it will be runtime-resumed before any function + * in @ll_ops is called and will remain resumed during the call. + */ + struct device *dev; + + /* private: The following fields shouldn't be accessed by drivers. */ + + /* Number of keyslots, or 0 if not applicable */ + unsigned int num_slots; + + /* + * Serializes all calls to functions in @ll_ops as well as all changes + * to @slot_hashtable. This can also be taken in read mode to look up + * keyslots while ensuring that they can't be changed concurrently. + */ + struct rw_semaphore lock; + + /* List of idle slots, with least recently used slot at front */ + wait_queue_head_t idle_slots_wait_queue; + struct list_head idle_slots; + spinlock_t idle_slots_lock; + + /* + * Hash table which maps struct *blk_crypto_key to keyslots, so that we + * can find a key's keyslot in O(1) time rather than O(num_slots). + * Protected by 'lock'. + */ + struct hlist_head *slot_hashtable; + unsigned int log_slot_ht_size; + + /* Per-keyslot data */ + struct blk_crypto_keyslot *slots; +}; + +int blk_crypto_profile_init(struct blk_crypto_profile *profile, + unsigned int num_slots); + +int devm_blk_crypto_profile_init(struct device *dev, + struct blk_crypto_profile *profile, + unsigned int num_slots); + +unsigned int blk_crypto_keyslot_index(struct blk_crypto_keyslot *slot); + +blk_status_t blk_crypto_get_keyslot(struct blk_crypto_profile *profile, + const struct blk_crypto_key *key, + struct blk_crypto_keyslot **slot_ptr); + +void blk_crypto_put_keyslot(struct blk_crypto_keyslot *slot); + +bool __blk_crypto_cfg_supported(struct blk_crypto_profile *profile, + const struct blk_crypto_config *cfg); + +int __blk_crypto_evict_key(struct blk_crypto_profile *profile, + const struct blk_crypto_key *key); + +void blk_crypto_reprogram_all_keys(struct blk_crypto_profile *profile); + +void blk_crypto_profile_destroy(struct blk_crypto_profile *profile); + +void blk_crypto_intersect_capabilities(struct blk_crypto_profile *parent, + const struct blk_crypto_profile *child); + +bool blk_crypto_has_capabilities(const struct blk_crypto_profile *target, + const struct blk_crypto_profile *reference); + +void blk_crypto_update_capabilities(struct blk_crypto_profile *dst, + const struct blk_crypto_profile *src); + +#endif /* __LINUX_BLK_CRYPTO_PROFILE_H */ diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 656fe34bdb6c..b4039fdf1b04 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -7,6 +7,7 @@ #include <linux/srcu.h> #include <linux/lockdep.h> #include <linux/scatterlist.h> +#include <linux/prefetch.h> struct blk_mq_tags; struct blk_flush_queue; @@ -132,7 +133,7 @@ struct request { #ifdef CONFIG_BLK_INLINE_ENCRYPTION struct bio_crypt_ctx *crypt_ctx; - struct blk_ksm_keyslot *crypt_keyslot; + struct blk_crypto_keyslot *crypt_keyslot; #endif unsigned short write_hint; @@ -655,8 +656,6 @@ int blk_mq_alloc_sq_tag_set(struct blk_mq_tag_set *set, unsigned int set_flags); void blk_mq_free_tag_set(struct blk_mq_tag_set *set); -void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule); - void blk_mq_free_request(struct request *rq); bool blk_mq_queue_inflight(struct request_queue *q); @@ -675,7 +674,40 @@ struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op, struct request *blk_mq_alloc_request_hctx(struct request_queue *q, unsigned int op, blk_mq_req_flags_t flags, unsigned int hctx_idx); -struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag); + +/* + * Tag address space map. + */ +struct blk_mq_tags { + unsigned int nr_tags; + unsigned int nr_reserved_tags; + + atomic_t active_queues; + + struct sbitmap_queue bitmap_tags; + struct sbitmap_queue breserved_tags; + + struct request **rqs; + struct request **static_rqs; + struct list_head page_list; + + /* + * used to clear request reference in rqs[] before freeing one + * request pool + */ + spinlock_t lock; +}; + +static inline struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, + unsigned int tag) +{ + if (tag < tags->nr_tags) { + prefetch(tags->rqs[tag]); + return tags->rqs[tag]; + } + + return NULL; +} enum { BLK_MQ_UNIQUE_TAG_BITS = 16, diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index abe721591e80..ed5be661dd0c 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -30,7 +30,7 @@ struct pr_ops; struct rq_qos; struct blk_queue_stats; struct blk_stat_callback; -struct blk_keyslot_manager; +struct blk_crypto_profile; /* Must be consistent with blk_mq_poll_stats_bkt() */ #define BLK_MQ_POLL_STATS_BKTS 16 @@ -44,6 +44,14 @@ struct blk_keyslot_manager; */ #define BLKCG_MAX_POLS 6 +static inline int blk_validate_block_size(unsigned int bsize) +{ + if (bsize < 512 || bsize > PAGE_SIZE || !is_power_of_2(bsize)) + return -EINVAL; + + return 0; +} + static inline bool blk_op_is_passthrough(unsigned int op) { op &= REQ_OP_MASK; @@ -150,6 +158,34 @@ static inline int blkdev_zone_mgmt_ioctl(struct block_device *bdev, #endif /* CONFIG_BLK_DEV_ZONED */ +/* + * Independent access ranges: struct blk_independent_access_range describes + * a range of contiguous sectors that can be accessed using device command + * execution resources that are independent from the resources used for + * other access ranges. This is typically found with single-LUN multi-actuator + * HDDs where each access range is served by a different set of heads. + * The set of independent ranges supported by the device is defined using + * struct blk_independent_access_ranges. The independent ranges must not overlap + * and must include all sectors within the disk capacity (no sector holes + * allowed). + * For a device with multiple ranges, requests targeting sectors in different + * ranges can be executed in parallel. A request can straddle an access range + * boundary. + */ +struct blk_independent_access_range { + struct kobject kobj; + struct request_queue *queue; + sector_t sector; + sector_t nr_sectors; +}; + +struct blk_independent_access_ranges { + struct kobject kobj; + bool sysfs_registered; + unsigned int nr_ia_ranges; + struct blk_independent_access_range ia_range[]; +}; + struct request_queue { struct request *last_merge; struct elevator_queue *elevator; @@ -224,8 +260,7 @@ struct request_queue { unsigned int dma_alignment; #ifdef CONFIG_BLK_INLINE_ENCRYPTION - /* Inline crypto capabilities */ - struct blk_keyslot_manager *ksm; + struct blk_crypto_profile *crypto_profile; #endif unsigned int rq_timeout; @@ -315,6 +350,8 @@ struct request_queue { */ struct mutex mq_freeze_lock; + int quiesce_depth; + struct blk_mq_tag_set *tag_set; struct list_head tag_set_list; struct bio_set bio_split; @@ -330,6 +367,12 @@ struct request_queue { #define BLK_MAX_WRITE_HINTS 5 u64 write_hints[BLK_MAX_WRITE_HINTS]; + + /* + * Independent sector access ranges. This is always NULL for + * devices that do not have multiple independent access ranges. + */ + struct blk_independent_access_ranges *ia_ranges; }; /* Keep blk_queue_flag_name[] in sync with the definitions below */ @@ -680,6 +723,11 @@ extern void blk_queue_update_dma_alignment(struct request_queue *, int); extern void blk_queue_rq_timeout(struct request_queue *, unsigned int); extern void blk_queue_write_cache(struct request_queue *q, bool enabled, bool fua); +struct blk_independent_access_ranges * +disk_alloc_independent_access_ranges(struct gendisk *disk, int nr_ia_ranges); +void disk_set_independent_access_ranges(struct gendisk *disk, + struct blk_independent_access_ranges *iars); + /* * Elevator features for blk_queue_required_elevator_features: */ @@ -706,12 +754,11 @@ extern void blk_set_queue_dying(struct request_queue *); * as the lock contention for request_queue lock is reduced. * * It is ok not to disable preemption when adding the request to the plug list - * or when attempting a merge, because blk_schedule_flush_list() will only flush - * the plug list when the task sleeps by itself. For details, please see - * schedule() where blk_schedule_flush_plug() is called. + * or when attempting a merge. For details, please see schedule() where + * blk_flush_plug() is called. */ struct blk_plug { - struct list_head mq_list; /* blk-mq requests */ + struct request *mq_list; /* blk-mq requests */ /* if ios_left is > 1, we can batch tag/rq allocations */ struct request *cached_rq; @@ -720,6 +767,7 @@ struct blk_plug { unsigned short rq_count; bool multiple_queues; + bool has_elevator; bool nowait; struct list_head cb_list; /* md requires an unplug callback */ @@ -737,31 +785,15 @@ extern struct blk_plug_cb *blk_check_plugged(blk_plug_cb_fn unplug, extern void blk_start_plug(struct blk_plug *); extern void blk_start_plug_nr_ios(struct blk_plug *, unsigned short); extern void blk_finish_plug(struct blk_plug *); -extern void blk_flush_plug_list(struct blk_plug *, bool); - -static inline void blk_flush_plug(struct task_struct *tsk) -{ - struct blk_plug *plug = tsk->plug; - - if (plug) - blk_flush_plug_list(plug, false); -} - -static inline void blk_schedule_flush_plug(struct task_struct *tsk) -{ - struct blk_plug *plug = tsk->plug; - if (plug) - blk_flush_plug_list(plug, true); -} +void blk_flush_plug(struct blk_plug *plug, bool from_schedule); static inline bool blk_needs_flush_plug(struct task_struct *tsk) { struct blk_plug *plug = tsk->plug; return plug && - (!list_empty(&plug->mq_list) || - !list_empty(&plug->cb_list)); + (plug->mq_list || !list_empty(&plug->cb_list)); } int blkdev_issue_flush(struct block_device *bdev); @@ -783,15 +815,10 @@ static inline void blk_finish_plug(struct blk_plug *plug) { } -static inline void blk_flush_plug(struct task_struct *task) +static inline void blk_flush_plug(struct blk_plug *plug, bool async) { } -static inline void blk_schedule_flush_plug(struct task_struct *task) -{ -} - - static inline bool blk_needs_flush_plug(struct task_struct *tsk) { return false; @@ -1144,19 +1171,20 @@ int kblockd_mod_delayed_work_on(int cpu, struct delayed_work *dwork, unsigned lo #ifdef CONFIG_BLK_INLINE_ENCRYPTION -bool blk_ksm_register(struct blk_keyslot_manager *ksm, struct request_queue *q); +bool blk_crypto_register(struct blk_crypto_profile *profile, + struct request_queue *q); -void blk_ksm_unregister(struct request_queue *q); +void blk_crypto_unregister(struct request_queue *q); #else /* CONFIG_BLK_INLINE_ENCRYPTION */ -static inline bool blk_ksm_register(struct blk_keyslot_manager *ksm, - struct request_queue *q) +static inline bool blk_crypto_register(struct blk_crypto_profile *profile, + struct request_queue *q) { return true; } -static inline void blk_ksm_unregister(struct request_queue *q) { } +static inline void blk_crypto_unregister(struct request_queue *q) { } #endif /* CONFIG_BLK_INLINE_ENCRYPTION */ diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 020a7d5bf470..3db6f6c95489 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -929,8 +929,11 @@ struct bpf_array_aux { * stored in the map to make sure that all callers and callees have * the same prog type and JITed flag. */ - enum bpf_prog_type type; - bool jited; + struct { + spinlock_t lock; + enum bpf_prog_type type; + bool jited; + } owner; /* Programs with direct jumps into programs part of this array. */ struct list_head poke_progs; struct bpf_map *map; diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h index 9c81724e4b98..bbe1eefa4c8a 100644 --- a/include/linux/bpf_types.h +++ b/include/linux/bpf_types.h @@ -101,14 +101,14 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STACK_TRACE, stack_trace_map_ops) #endif BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY_OF_MAPS, array_of_maps_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_HASH_OF_MAPS, htab_of_maps_map_ops) -#ifdef CONFIG_NET -BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops) -BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP_HASH, dev_map_hash_ops) -BPF_MAP_TYPE(BPF_MAP_TYPE_SK_STORAGE, sk_storage_map_ops) #ifdef CONFIG_BPF_LSM BPF_MAP_TYPE(BPF_MAP_TYPE_INODE_STORAGE, inode_storage_map_ops) #endif BPF_MAP_TYPE(BPF_MAP_TYPE_TASK_STORAGE, task_storage_map_ops) +#ifdef CONFIG_NET +BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops) +BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP_HASH, dev_map_hash_ops) +BPF_MAP_TYPE(BPF_MAP_TYPE_SK_STORAGE, sk_storage_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops) #if defined(CONFIG_XDP_SOCKETS) BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops) diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h index 832d8a74fa59..991911048857 100644 --- a/include/linux/cpuhotplug.h +++ b/include/linux/cpuhotplug.h @@ -72,6 +72,8 @@ enum cpuhp_state { CPUHP_SLUB_DEAD, CPUHP_DEBUG_OBJ_DEAD, CPUHP_MM_WRITEBACK_DEAD, + /* Must be after CPUHP_MM_VMSTAT_DEAD */ + CPUHP_MM_DEMOTION_DEAD, CPUHP_MM_VMSTAT_DEAD, CPUHP_SOFTIRQ_DEAD, CPUHP_NET_MVNETA_DEAD, @@ -240,6 +242,8 @@ enum cpuhp_state { CPUHP_AP_BASE_CACHEINFO_ONLINE, CPUHP_AP_ONLINE_DYN, CPUHP_AP_ONLINE_DYN_END = CPUHP_AP_ONLINE_DYN + 30, + /* Must be after CPUHP_AP_ONLINE_DYN for node_states[N_CPU] update */ + CPUHP_AP_MM_DEMOTION_ONLINE, CPUHP_AP_X86_HPET_ONLINE, CPUHP_AP_X86_KVM_CLK_ONLINE, CPUHP_AP_DTPM_CPU_ONLINE, diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h index 114553b487ef..a7df155ea49b 100644 --- a/include/linux/device-mapper.h +++ b/include/linux/device-mapper.h @@ -576,9 +576,9 @@ struct dm_table *dm_swap_table(struct mapped_device *md, struct dm_table *t); /* - * Table keyslot manager functions + * Table blk_crypto_profile functions */ -void dm_destroy_keyslot_manager(struct blk_keyslot_manager *ksm); +void dm_destroy_crypto_profile(struct blk_crypto_profile *profile); /*----------------------------------------------------------------- * Macros. diff --git a/include/linux/elfcore.h b/include/linux/elfcore.h index 2aaa15779d50..957ebec35aad 100644 --- a/include/linux/elfcore.h +++ b/include/linux/elfcore.h @@ -109,7 +109,7 @@ static inline int elf_core_copy_task_fpregs(struct task_struct *t, struct pt_reg #endif } -#if defined(CONFIG_UM) || defined(CONFIG_IA64) +#if (defined(CONFIG_UML) && defined(CONFIG_X86_32)) || defined(CONFIG_IA64) /* * These functions parameterize elf_core_dump in fs/binfmt_elf.c to write out * extra segments containing the gate DSO contents. Dumping its diff --git a/include/linux/filter.h b/include/linux/filter.h index 4a93c12543ee..ef03ff34234d 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -1051,6 +1051,7 @@ extern int bpf_jit_enable; extern int bpf_jit_harden; extern int bpf_jit_kallsyms; extern long bpf_jit_limit; +extern long bpf_jit_limit_max; typedef void (*bpf_jit_fill_hole_t)(void *area, unsigned int size); diff --git a/include/linux/flex_proportions.h b/include/linux/flex_proportions.h index c12df59d3f5f..3e378b1fb0bc 100644 --- a/include/linux/flex_proportions.h +++ b/include/linux/flex_proportions.h @@ -83,9 +83,10 @@ struct fprop_local_percpu { int fprop_local_init_percpu(struct fprop_local_percpu *pl, gfp_t gfp); void fprop_local_destroy_percpu(struct fprop_local_percpu *pl); -void __fprop_inc_percpu(struct fprop_global *p, struct fprop_local_percpu *pl); -void __fprop_inc_percpu_max(struct fprop_global *p, struct fprop_local_percpu *pl, - int max_frac); +void __fprop_add_percpu(struct fprop_global *p, struct fprop_local_percpu *pl, + long nr); +void __fprop_add_percpu_max(struct fprop_global *p, + struct fprop_local_percpu *pl, int max_frac, long nr); void fprop_fraction_percpu(struct fprop_global *p, struct fprop_local_percpu *pl, unsigned long *numerator, unsigned long *denominator); @@ -96,7 +97,7 @@ void fprop_inc_percpu(struct fprop_global *p, struct fprop_local_percpu *pl) unsigned long flags; local_irq_save(flags); - __fprop_inc_percpu(p, pl); + __fprop_add_percpu(p, pl, 1); local_irq_restore(flags); } diff --git a/include/linux/genhd.h b/include/linux/genhd.h index 900325aa5d31..59eabbc3a36b 100644 --- a/include/linux/genhd.h +++ b/include/linux/genhd.h @@ -213,6 +213,8 @@ static inline int add_disk(struct gendisk *disk) } extern void del_gendisk(struct gendisk *gp); +void invalidate_disk(struct gendisk *disk); + void set_disk_ro(struct gendisk *disk, bool read_only); static inline int get_disk_ro(struct gendisk *disk) @@ -221,6 +223,11 @@ static inline int get_disk_ro(struct gendisk *disk) test_bit(GD_READ_ONLY, &disk->state); } +static inline int bdev_read_only(struct block_device *bdev) +{ + return bdev->bd_read_only || get_disk_ro(bdev->bd_disk); +} + extern void disk_block_events(struct gendisk *disk); extern void disk_unblock_events(struct gendisk *disk); extern void disk_flush_events(struct gendisk *disk, unsigned int mask); diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 55b2ec1f965a..3745efd21cf6 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -520,15 +520,11 @@ static inline void arch_free_page(struct page *page, int order) { } #ifndef HAVE_ARCH_ALLOC_PAGE static inline void arch_alloc_page(struct page *page, int order) { } #endif -#ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE -static inline int arch_make_page_accessible(struct page *page) -{ - return 0; -} -#endif struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid, nodemask_t *nodemask); +struct folio *__folio_alloc(gfp_t gfp, unsigned int order, int preferred_nid, + nodemask_t *nodemask); unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid, nodemask_t *nodemask, int nr_pages, @@ -570,6 +566,15 @@ __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) return __alloc_pages(gfp_mask, order, nid, NULL); } +static inline +struct folio *__folio_alloc_node(gfp_t gfp, unsigned int order, int nid) +{ + VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); + VM_WARN_ON((gfp & __GFP_THISNODE) && !node_online(nid)); + + return __folio_alloc(gfp, order, nid, NULL); +} + /* * Allocate pages, preferring the node given as nid. When nid == NUMA_NO_NODE, * prefer the current CPU's closest node. Otherwise node must be valid and @@ -586,6 +591,7 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, #ifdef CONFIG_NUMA struct page *alloc_pages(gfp_t gfp, unsigned int order); +struct folio *folio_alloc(gfp_t gfp, unsigned order); extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order, struct vm_area_struct *vma, unsigned long addr, int node, bool hugepage); @@ -596,6 +602,10 @@ static inline struct page *alloc_pages(gfp_t gfp_mask, unsigned int order) { return alloc_pages_node(numa_node_id(), gfp_mask, order); } +static inline struct folio *folio_alloc(gfp_t gfp, unsigned int order) +{ + return __folio_alloc_node(gfp, order, numa_node_id()); +} #define alloc_pages_vma(gfp_mask, order, vma, addr, node, false)\ alloc_pages(gfp_mask, order) #define alloc_hugepage_vma(gfp_mask, vma, addr, order) \ diff --git a/include/linux/highmem-internal.h b/include/linux/highmem-internal.h index 4aa1031d3e4c..0a0b2b09b1b8 100644 --- a/include/linux/highmem-internal.h +++ b/include/linux/highmem-internal.h @@ -73,6 +73,12 @@ static inline void *kmap_local_page(struct page *page) return __kmap_local_page_prot(page, kmap_prot); } +static inline void *kmap_local_folio(struct folio *folio, size_t offset) +{ + struct page *page = folio_page(folio, offset / PAGE_SIZE); + return __kmap_local_page_prot(page, kmap_prot) + offset % PAGE_SIZE; +} + static inline void *kmap_local_page_prot(struct page *page, pgprot_t prot) { return __kmap_local_page_prot(page, prot); @@ -171,6 +177,11 @@ static inline void *kmap_local_page(struct page *page) return page_address(page); } +static inline void *kmap_local_folio(struct folio *folio, size_t offset) +{ + return page_address(&folio->page) + offset; +} + static inline void *kmap_local_page_prot(struct page *page, pgprot_t prot) { return kmap_local_page(page); diff --git a/include/linux/highmem.h b/include/linux/highmem.h index b4c49f9cc379..27cdd715c5f9 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -97,6 +97,43 @@ static inline void kmap_flush_unused(void); static inline void *kmap_local_page(struct page *page); /** + * kmap_local_folio - Map a page in this folio for temporary usage + * @folio: The folio containing the page. + * @offset: The byte offset within the folio which identifies the page. + * + * Requires careful handling when nesting multiple mappings because the map + * management is stack based. The unmap has to be in the reverse order of + * the map operation:: + * + * addr1 = kmap_local_folio(folio1, offset1); + * addr2 = kmap_local_folio(folio2, offset2); + * ... + * kunmap_local(addr2); + * kunmap_local(addr1); + * + * Unmapping addr1 before addr2 is invalid and causes malfunction. + * + * Contrary to kmap() mappings the mapping is only valid in the context of + * the caller and cannot be handed to other contexts. + * + * On CONFIG_HIGHMEM=n kernels and for low memory pages this returns the + * virtual address of the direct mapping. Only real highmem pages are + * temporarily mapped. + * + * While it is significantly faster than kmap() for the higmem case it + * comes with restrictions about the pointer validity. Only use when really + * necessary. + * + * On HIGHMEM enabled systems mapping a highmem page has the side effect of + * disabling migration in order to keep the virtual address stable across + * preemption. No caller of kmap_local_folio() can rely on this side effect. + * + * Context: Can be invoked from any context. + * Return: The virtual address of @offset. + */ +static inline void *kmap_local_folio(struct folio *folio, size_t offset); + +/** * kmap_atomic - Atomically map a page for temporary usage - Deprecated! * @page: Pointer to the page to be mapped * diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index f123e15d966e..f280f33ff223 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -251,15 +251,6 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *pud, } /** - * thp_head - Head page of a transparent huge page. - * @page: Any page (tail, head or regular) found in the page cache. - */ -static inline struct page *thp_head(struct page *page) -{ - return compound_head(page); -} - -/** * thp_order - Order of a transparent huge page. * @page: Head page of a transparent huge page. */ @@ -336,12 +327,6 @@ static inline struct list_head *page_deferred_list(struct page *page) #define HPAGE_PUD_MASK ({ BUILD_BUG(); 0; }) #define HPAGE_PUD_SIZE ({ BUILD_BUG(); 0; }) -static inline struct page *thp_head(struct page *page) -{ - VM_BUG_ON_PGFLAGS(PageTail(page), page); - return page; -} - static inline unsigned int thp_order(struct page *page) { VM_BUG_ON_PGFLAGS(PageTail(page), page); diff --git a/include/linux/keyslot-manager.h b/include/linux/keyslot-manager.h deleted file mode 100644 index a27605e2f826..000000000000 --- a/include/linux/keyslot-manager.h +++ /dev/null @@ -1,120 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -/* - * Copyright 2019 Google LLC - */ - -#ifndef __LINUX_KEYSLOT_MANAGER_H -#define __LINUX_KEYSLOT_MANAGER_H - -#include <linux/bio.h> -#include <linux/blk-crypto.h> - -struct blk_keyslot_manager; - -/** - * struct blk_ksm_ll_ops - functions to manage keyslots in hardware - * @keyslot_program: Program the specified key into the specified slot in the - * inline encryption hardware. - * @keyslot_evict: Evict key from the specified keyslot in the hardware. - * The key is provided so that e.g. dm layers can evict - * keys from the devices that they map over. - * Returns 0 on success, -errno otherwise. - * - * This structure should be provided by storage device drivers when they set up - * a keyslot manager - this structure holds the function ptrs that the keyslot - * manager will use to manipulate keyslots in the hardware. - */ -struct blk_ksm_ll_ops { - int (*keyslot_program)(struct blk_keyslot_manager *ksm, - const struct blk_crypto_key *key, - unsigned int slot); - int (*keyslot_evict)(struct blk_keyslot_manager *ksm, - const struct blk_crypto_key *key, - unsigned int slot); -}; - -struct blk_keyslot_manager { - /* - * The struct blk_ksm_ll_ops that this keyslot manager will use - * to perform operations like programming and evicting keys on the - * device - */ - struct blk_ksm_ll_ops ksm_ll_ops; - - /* - * The maximum number of bytes supported for specifying the data unit - * number. - */ - unsigned int max_dun_bytes_supported; - - /* - * Array of size BLK_ENCRYPTION_MODE_MAX of bitmasks that represents - * whether a crypto mode and data unit size are supported. The i'th - * bit of crypto_mode_supported[crypto_mode] is set iff a data unit - * size of (1 << i) is supported. We only support data unit sizes - * that are powers of 2. - */ - unsigned int crypto_modes_supported[BLK_ENCRYPTION_MODE_MAX]; - - /* Device for runtime power management (NULL if none) */ - struct device *dev; - - /* Here onwards are *private* fields for internal keyslot manager use */ - - unsigned int num_slots; - - /* Protects programming and evicting keys from the device */ - struct rw_semaphore lock; - - /* List of idle slots, with least recently used slot at front */ - wait_queue_head_t idle_slots_wait_queue; - struct list_head idle_slots; - spinlock_t idle_slots_lock; - - /* - * Hash table which maps struct *blk_crypto_key to keyslots, so that we - * can find a key's keyslot in O(1) time rather than O(num_slots). - * Protected by 'lock'. - */ - struct hlist_head *slot_hashtable; - unsigned int log_slot_ht_size; - - /* Per-keyslot data */ - struct blk_ksm_keyslot *slots; -}; - -int blk_ksm_init(struct blk_keyslot_manager *ksm, unsigned int num_slots); - -int devm_blk_ksm_init(struct device *dev, struct blk_keyslot_manager *ksm, - unsigned int num_slots); - -blk_status_t blk_ksm_get_slot_for_key(struct blk_keyslot_manager *ksm, - const struct blk_crypto_key *key, - struct blk_ksm_keyslot **slot_ptr); - -unsigned int blk_ksm_get_slot_idx(struct blk_ksm_keyslot *slot); - -void blk_ksm_put_slot(struct blk_ksm_keyslot *slot); - -bool blk_ksm_crypto_cfg_supported(struct blk_keyslot_manager *ksm, - const struct blk_crypto_config *cfg); - -int blk_ksm_evict_key(struct blk_keyslot_manager *ksm, - const struct blk_crypto_key *key); - -void blk_ksm_reprogram_all_keys(struct blk_keyslot_manager *ksm); - -void blk_ksm_destroy(struct blk_keyslot_manager *ksm); - -void blk_ksm_intersect_modes(struct blk_keyslot_manager *parent, - const struct blk_keyslot_manager *child); - -void blk_ksm_init_passthrough(struct blk_keyslot_manager *ksm); - -bool blk_ksm_is_superset(struct blk_keyslot_manager *ksm_superset, - struct blk_keyslot_manager *ksm_subset); - -void blk_ksm_update_capabilities(struct blk_keyslot_manager *target_ksm, - struct blk_keyslot_manager *reference_ksm); - -#endif /* __LINUX_KEYSLOT_MANAGER_H */ diff --git a/include/linux/ksm.h b/include/linux/ksm.h index 161e8164abcf..a38a5bca1ba5 100644 --- a/include/linux/ksm.h +++ b/include/linux/ksm.h @@ -52,7 +52,7 @@ struct page *ksm_might_need_to_copy(struct page *page, struct vm_area_struct *vma, unsigned long address); void rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc); -void ksm_migrate_page(struct page *newpage, struct page *oldpage); +void folio_migrate_ksm(struct folio *newfolio, struct folio *folio); #else /* !CONFIG_KSM */ @@ -83,7 +83,7 @@ static inline void rmap_walk_ksm(struct page *page, { } -static inline void ksm_migrate_page(struct page *newpage, struct page *oldpage) +static inline void folio_migrate_ksm(struct folio *newfolio, struct folio *old) { } #endif /* CONFIG_MMU */ diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 3096c9a0ee01..e34bf0cbdf55 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -369,7 +369,7 @@ enum page_memcg_data_flags { #define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1) -static inline bool PageMemcgKmem(struct page *page); +static inline bool folio_memcg_kmem(struct folio *folio); /* * After the initialization objcg->memcg is always pointing at @@ -384,89 +384,95 @@ static inline struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg) } /* - * __page_memcg - get the memory cgroup associated with a non-kmem page - * @page: a pointer to the page struct + * __folio_memcg - Get the memory cgroup associated with a non-kmem folio + * @folio: Pointer to the folio. * - * Returns a pointer to the memory cgroup associated with the page, - * or NULL. This function assumes that the page is known to have a + * Returns a pointer to the memory cgroup associated with the folio, + * or NULL. This function assumes that the folio is known to have a * proper memory cgroup pointer. It's not safe to call this function - * against some type of pages, e.g. slab pages or ex-slab pages or - * kmem pages. + * against some type of folios, e.g. slab folios or ex-slab folios or + * kmem folios. */ -static inline struct mem_cgroup *__page_memcg(struct page *page) +static inline struct mem_cgroup *__folio_memcg(struct folio *folio) { - unsigned long memcg_data = page->memcg_data; + unsigned long memcg_data = folio->memcg_data; - VM_BUG_ON_PAGE(PageSlab(page), page); - VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page); - VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, page); + VM_BUG_ON_FOLIO(folio_test_slab(folio), folio); + VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJCGS, folio); + VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_KMEM, folio); return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); } /* - * __page_objcg - get the object cgroup associated with a kmem page - * @page: a pointer to the page struct + * __folio_objcg - get the object cgroup associated with a kmem folio. + * @folio: Pointer to the folio. * - * Returns a pointer to the object cgroup associated with the page, - * or NULL. This function assumes that the page is known to have a + * Returns a pointer to the object cgroup associated with the folio, + * or NULL. This function assumes that the folio is known to have a * proper object cgroup pointer. It's not safe to call this function - * against some type of pages, e.g. slab pages or ex-slab pages or - * LRU pages. + * against some type of folios, e.g. slab folios or ex-slab folios or + * LRU folios. */ -static inline struct obj_cgroup *__page_objcg(struct page *page) +static inline struct obj_cgroup *__folio_objcg(struct folio *folio) { - unsigned long memcg_data = page->memcg_data; + unsigned long memcg_data = folio->memcg_data; - VM_BUG_ON_PAGE(PageSlab(page), page); - VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page); - VM_BUG_ON_PAGE(!(memcg_data & MEMCG_DATA_KMEM), page); + VM_BUG_ON_FOLIO(folio_test_slab(folio), folio); + VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJCGS, folio); + VM_BUG_ON_FOLIO(!(memcg_data & MEMCG_DATA_KMEM), folio); return (struct obj_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); } /* - * page_memcg - get the memory cgroup associated with a page - * @page: a pointer to the page struct + * folio_memcg - Get the memory cgroup associated with a folio. + * @folio: Pointer to the folio. * - * Returns a pointer to the memory cgroup associated with the page, - * or NULL. This function assumes that the page is known to have a + * Returns a pointer to the memory cgroup associated with the folio, + * or NULL. This function assumes that the folio is known to have a * proper memory cgroup pointer. It's not safe to call this function - * against some type of pages, e.g. slab pages or ex-slab pages. + * against some type of folios, e.g. slab folios or ex-slab folios. * - * For a non-kmem page any of the following ensures page and memcg binding + * For a non-kmem folio any of the following ensures folio and memcg binding * stability: * - * - the page lock + * - the folio lock * - LRU isolation * - lock_page_memcg() * - exclusive reference * - * For a kmem page a caller should hold an rcu read lock to protect memcg - * associated with a kmem page from being released. + * For a kmem folio a caller should hold an rcu read lock to protect memcg + * associated with a kmem folio from being released. */ +static inline struct mem_cgroup *folio_memcg(struct folio *folio) +{ + if (folio_memcg_kmem(folio)) + return obj_cgroup_memcg(__folio_objcg(folio)); + return __folio_memcg(folio); +} + static inline struct mem_cgroup *page_memcg(struct page *page) { - if (PageMemcgKmem(page)) - return obj_cgroup_memcg(__page_objcg(page)); - else - return __page_memcg(page); + return folio_memcg(page_folio(page)); } -/* - * page_memcg_rcu - locklessly get the memory cgroup associated with a page - * @page: a pointer to the page struct +/** + * folio_memcg_rcu - Locklessly get the memory cgroup associated with a folio. + * @folio: Pointer to the folio. * - * Returns a pointer to the memory cgroup associated with the page, - * or NULL. This function assumes that the page is known to have a + * This function assumes that the folio is known to have a * proper memory cgroup pointer. It's not safe to call this function - * against some type of pages, e.g. slab pages or ex-slab pages. + * against some type of folios, e.g. slab folios or ex-slab folios. + * + * Return: A pointer to the memory cgroup associated with the folio, + * or NULL. */ -static inline struct mem_cgroup *page_memcg_rcu(struct page *page) +static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio) { - unsigned long memcg_data = READ_ONCE(page->memcg_data); + unsigned long memcg_data = READ_ONCE(folio->memcg_data); - VM_BUG_ON_PAGE(PageSlab(page), page); + VM_BUG_ON_FOLIO(folio_test_slab(folio), folio); WARN_ON_ONCE(!rcu_read_lock_held()); if (memcg_data & MEMCG_DATA_KMEM) { @@ -523,17 +529,18 @@ static inline struct mem_cgroup *page_memcg_check(struct page *page) #ifdef CONFIG_MEMCG_KMEM /* - * PageMemcgKmem - check if the page has MemcgKmem flag set - * @page: a pointer to the page struct + * folio_memcg_kmem - Check if the folio has the memcg_kmem flag set. + * @folio: Pointer to the folio. * - * Checks if the page has MemcgKmem flag set. The caller must ensure that - * the page has an associated memory cgroup. It's not safe to call this function - * against some types of pages, e.g. slab pages. + * Checks if the folio has MemcgKmem flag set. The caller must ensure + * that the folio has an associated memory cgroup. It's not safe to call + * this function against some types of folios, e.g. slab folios. */ -static inline bool PageMemcgKmem(struct page *page) +static inline bool folio_memcg_kmem(struct folio *folio) { - VM_BUG_ON_PAGE(page->memcg_data & MEMCG_DATA_OBJCGS, page); - return page->memcg_data & MEMCG_DATA_KMEM; + VM_BUG_ON_PGFLAGS(PageTail(&folio->page), &folio->page); + VM_BUG_ON_FOLIO(folio->memcg_data & MEMCG_DATA_OBJCGS, folio); + return folio->memcg_data & MEMCG_DATA_KMEM; } /* @@ -577,7 +584,7 @@ static inline struct obj_cgroup **page_objcgs_check(struct page *page) } #else -static inline bool PageMemcgKmem(struct page *page) +static inline bool folio_memcg_kmem(struct folio *folio) { return false; } @@ -593,6 +600,11 @@ static inline struct obj_cgroup **page_objcgs_check(struct page *page) } #endif +static inline bool PageMemcgKmem(struct page *page) +{ + return folio_memcg_kmem(page_folio(page)); +} + static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg) { return (memcg == root_mem_cgroup); @@ -684,26 +696,47 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg) page_counter_read(&memcg->memory); } -int __mem_cgroup_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask); -static inline int mem_cgroup_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask) +int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp); + +/** + * mem_cgroup_charge - Charge a newly allocated folio to a cgroup. + * @folio: Folio to charge. + * @mm: mm context of the allocating task. + * @gfp: Reclaim mode. + * + * Try to charge @folio to the memcg that @mm belongs to, reclaiming + * pages according to @gfp if necessary. If @mm is NULL, try to + * charge to the active memcg. + * + * Do not use this for folios allocated for swapin. + * + * Return: 0 on success. Otherwise, an error code is returned. + */ +static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, + gfp_t gfp) { if (mem_cgroup_disabled()) return 0; - return __mem_cgroup_charge(page, mm, gfp_mask); + return __mem_cgroup_charge(folio, mm, gfp); } int mem_cgroup_swapin_charge_page(struct page *page, struct mm_struct *mm, gfp_t gfp, swp_entry_t entry); void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry); -void __mem_cgroup_uncharge(struct page *page); -static inline void mem_cgroup_uncharge(struct page *page) +void __mem_cgroup_uncharge(struct folio *folio); + +/** + * mem_cgroup_uncharge - Uncharge a folio. + * @folio: Folio to uncharge. + * + * Uncharge a folio previously charged with mem_cgroup_charge(). + */ +static inline void mem_cgroup_uncharge(struct folio *folio) { if (mem_cgroup_disabled()) return; - __mem_cgroup_uncharge(page); + __mem_cgroup_uncharge(folio); } void __mem_cgroup_uncharge_list(struct list_head *page_list); @@ -714,7 +747,7 @@ static inline void mem_cgroup_uncharge_list(struct list_head *page_list) __mem_cgroup_uncharge_list(page_list); } -void mem_cgroup_migrate(struct page *oldpage, struct page *newpage); +void mem_cgroup_migrate(struct folio *old, struct folio *new); /** * mem_cgroup_lruvec - get the lru list vector for a memcg & node @@ -753,33 +786,33 @@ out: } /** - * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page - * @page: the page + * folio_lruvec - return lruvec for isolating/putting an LRU folio + * @folio: Pointer to the folio. * - * This function relies on page->mem_cgroup being stable. + * This function relies on folio->mem_cgroup being stable. */ -static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page) +static inline struct lruvec *folio_lruvec(struct folio *folio) { - pg_data_t *pgdat = page_pgdat(page); - struct mem_cgroup *memcg = page_memcg(page); + struct mem_cgroup *memcg = folio_memcg(folio); - VM_WARN_ON_ONCE_PAGE(!memcg && !mem_cgroup_disabled(), page); - return mem_cgroup_lruvec(memcg, pgdat); + VM_WARN_ON_ONCE_FOLIO(!memcg && !mem_cgroup_disabled(), folio); + return mem_cgroup_lruvec(memcg, folio_pgdat(folio)); } struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p); struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm); -struct lruvec *lock_page_lruvec(struct page *page); -struct lruvec *lock_page_lruvec_irq(struct page *page); -struct lruvec *lock_page_lruvec_irqsave(struct page *page, +struct lruvec *folio_lruvec_lock(struct folio *folio); +struct lruvec *folio_lruvec_lock_irq(struct folio *folio); +struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio, unsigned long *flags); #ifdef CONFIG_DEBUG_VM -void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page); +void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio); #else -static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) +static inline +void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio) { } #endif @@ -947,6 +980,8 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg); extern bool cgroup_memory_noswap; #endif +void folio_memcg_lock(struct folio *folio); +void folio_memcg_unlock(struct folio *folio); void lock_page_memcg(struct page *page); void unlock_page_memcg(struct page *page); @@ -1115,12 +1150,17 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, #define MEM_CGROUP_ID_SHIFT 0 #define MEM_CGROUP_ID_MAX 0 +static inline struct mem_cgroup *folio_memcg(struct folio *folio) +{ + return NULL; +} + static inline struct mem_cgroup *page_memcg(struct page *page) { return NULL; } -static inline struct mem_cgroup *page_memcg_rcu(struct page *page) +static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio) { WARN_ON_ONCE(!rcu_read_lock_held()); return NULL; @@ -1131,6 +1171,11 @@ static inline struct mem_cgroup *page_memcg_check(struct page *page) return NULL; } +static inline bool folio_memcg_kmem(struct folio *folio) +{ + return false; +} + static inline bool PageMemcgKmem(struct page *page) { return false; @@ -1179,8 +1224,8 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg) return false; } -static inline int mem_cgroup_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask) +static inline int mem_cgroup_charge(struct folio *folio, + struct mm_struct *mm, gfp_t gfp) { return 0; } @@ -1195,7 +1240,7 @@ static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry) { } -static inline void mem_cgroup_uncharge(struct page *page) +static inline void mem_cgroup_uncharge(struct folio *folio) { } @@ -1203,7 +1248,7 @@ static inline void mem_cgroup_uncharge_list(struct list_head *page_list) { } -static inline void mem_cgroup_migrate(struct page *old, struct page *new) +static inline void mem_cgroup_migrate(struct folio *old, struct folio *new) { } @@ -1213,14 +1258,14 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg, return &pgdat->__lruvec; } -static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page) +static inline struct lruvec *folio_lruvec(struct folio *folio) { - pg_data_t *pgdat = page_pgdat(page); - + struct pglist_data *pgdat = folio_pgdat(folio); return &pgdat->__lruvec; } -static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) +static inline +void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio) { } @@ -1250,26 +1295,26 @@ static inline void mem_cgroup_put(struct mem_cgroup *memcg) { } -static inline struct lruvec *lock_page_lruvec(struct page *page) +static inline struct lruvec *folio_lruvec_lock(struct folio *folio) { - struct pglist_data *pgdat = page_pgdat(page); + struct pglist_data *pgdat = folio_pgdat(folio); spin_lock(&pgdat->__lruvec.lru_lock); return &pgdat->__lruvec; } -static inline struct lruvec *lock_page_lruvec_irq(struct page *page) +static inline struct lruvec *folio_lruvec_lock_irq(struct folio *folio) { - struct pglist_data *pgdat = page_pgdat(page); + struct pglist_data *pgdat = folio_pgdat(folio); spin_lock_irq(&pgdat->__lruvec.lru_lock); return &pgdat->__lruvec; } -static inline struct lruvec *lock_page_lruvec_irqsave(struct page *page, +static inline struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio, unsigned long *flagsp) { - struct pglist_data *pgdat = page_pgdat(page); + struct pglist_data *pgdat = folio_pgdat(folio); spin_lock_irqsave(&pgdat->__lruvec.lru_lock, *flagsp); return &pgdat->__lruvec; @@ -1356,6 +1401,14 @@ static inline void unlock_page_memcg(struct page *page) { } +static inline void folio_memcg_lock(struct folio *folio) +{ +} + +static inline void folio_memcg_unlock(struct folio *folio) +{ +} + static inline void mem_cgroup_handle_over_high(void) { } @@ -1517,38 +1570,39 @@ static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec, } /* Test requires a stable page->memcg binding, see page_memcg() */ -static inline bool page_matches_lruvec(struct page *page, struct lruvec *lruvec) +static inline bool folio_matches_lruvec(struct folio *folio, + struct lruvec *lruvec) { - return lruvec_pgdat(lruvec) == page_pgdat(page) && - lruvec_memcg(lruvec) == page_memcg(page); + return lruvec_pgdat(lruvec) == folio_pgdat(folio) && + lruvec_memcg(lruvec) == folio_memcg(folio); } /* Don't lock again iff page's lruvec locked */ -static inline struct lruvec *relock_page_lruvec_irq(struct page *page, +static inline struct lruvec *folio_lruvec_relock_irq(struct folio *folio, struct lruvec *locked_lruvec) { if (locked_lruvec) { - if (page_matches_lruvec(page, locked_lruvec)) + if (folio_matches_lruvec(folio, locked_lruvec)) return locked_lruvec; unlock_page_lruvec_irq(locked_lruvec); } - return lock_page_lruvec_irq(page); + return folio_lruvec_lock_irq(folio); } /* Don't lock again iff page's lruvec locked */ -static inline struct lruvec *relock_page_lruvec_irqsave(struct page *page, +static inline struct lruvec *folio_lruvec_relock_irqsave(struct folio *folio, struct lruvec *locked_lruvec, unsigned long *flags) { if (locked_lruvec) { - if (page_matches_lruvec(page, locked_lruvec)) + if (folio_matches_lruvec(folio, locked_lruvec)) return locked_lruvec; unlock_page_lruvec_irqrestore(locked_lruvec, *flags); } - return lock_page_lruvec_irqsave(page, flags); + return folio_lruvec_lock_irqsave(folio, flags); } #ifdef CONFIG_CGROUP_WRITEBACK @@ -1558,17 +1612,17 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, unsigned long *pheadroom, unsigned long *pdirty, unsigned long *pwriteback); -void mem_cgroup_track_foreign_dirty_slowpath(struct page *page, +void mem_cgroup_track_foreign_dirty_slowpath(struct folio *folio, struct bdi_writeback *wb); -static inline void mem_cgroup_track_foreign_dirty(struct page *page, +static inline void mem_cgroup_track_foreign_dirty(struct folio *folio, struct bdi_writeback *wb) { if (mem_cgroup_disabled()) return; - if (unlikely(&page_memcg(page)->css != wb->memcg_css)) - mem_cgroup_track_foreign_dirty_slowpath(page, wb); + if (unlikely(&folio_memcg(folio)->css != wb->memcg_css)) + mem_cgroup_track_foreign_dirty_slowpath(folio, wb); } void mem_cgroup_flush_foreign(struct bdi_writeback *wb); @@ -1588,7 +1642,7 @@ static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb, { } -static inline void mem_cgroup_track_foreign_dirty(struct page *page, +static inline void mem_cgroup_track_foreign_dirty(struct folio *folio, struct bdi_writeback *wb) { } diff --git a/include/linux/memory.h b/include/linux/memory.h index 7efc0a7c14c9..182c606adb06 100644 --- a/include/linux/memory.h +++ b/include/linux/memory.h @@ -160,7 +160,10 @@ int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t func, #define register_hotmemory_notifier(nb) register_memory_notifier(nb) #define unregister_hotmemory_notifier(nb) unregister_memory_notifier(nb) #else -#define hotplug_memory_notifier(fn, pri) ({ 0; }) +static inline int hotplug_memory_notifier(notifier_fn_t fn, int pri) +{ + return 0; +} /* These aren't inline functions due to a GCC bug. */ #define register_hotmemory_notifier(nb) ({ (void)(nb); 0; }) #define unregister_hotmemory_notifier(nb) ({ (void)(nb); }) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index c8077e936691..0d2aeb9b0f66 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -57,6 +57,10 @@ extern int migrate_huge_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page); extern int migrate_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page, int extra_count); +void folio_migrate_flags(struct folio *newfolio, struct folio *folio); +void folio_migrate_copy(struct folio *newfolio, struct folio *folio); +int folio_migrate_mapping(struct address_space *mapping, + struct folio *newfolio, struct folio *folio, int extra_count); #else static inline void putback_movable_pages(struct list_head *l) {} diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h index e23417424373..f17d2101af7a 100644 --- a/include/linux/mlx5/driver.h +++ b/include/linux/mlx5/driver.h @@ -1138,7 +1138,6 @@ int mlx5_cmd_create_vport_lag(struct mlx5_core_dev *dev); int mlx5_cmd_destroy_vport_lag(struct mlx5_core_dev *dev); bool mlx5_lag_is_roce(struct mlx5_core_dev *dev); bool mlx5_lag_is_sriov(struct mlx5_core_dev *dev); -bool mlx5_lag_is_multipath(struct mlx5_core_dev *dev); bool mlx5_lag_is_active(struct mlx5_core_dev *dev); bool mlx5_lag_is_master(struct mlx5_core_dev *dev); bool mlx5_lag_is_shared_fdb(struct mlx5_core_dev *dev); diff --git a/include/linux/mm.h b/include/linux/mm.h index 73a52aba448f..40ff114aaf9e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -36,10 +36,7 @@ struct mempolicy; struct anon_vma; struct anon_vma_chain; -struct file_ra_state; struct user_struct; -struct writeback_control; -struct bdi_writeback; struct pt_regs; extern int sysctl_page_lock_unfairness; @@ -216,13 +213,6 @@ int overcommit_kbytes_handler(struct ctl_table *, int, void *, size_t *, loff_t *); int overcommit_policy_handler(struct ctl_table *, int, void *, size_t *, loff_t *); -/* - * Any attempt to mark this function as static leads to build failure - * when CONFIG_DEBUG_INFO_BTF is enabled because __add_to_page_cache_locked() - * is referred to by BPF code. This must be visible for error injection. - */ -int __add_to_page_cache_locked(struct page *page, struct address_space *mapping, - pgoff_t index, gfp_t gfp, void **shadowp); #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) @@ -748,13 +738,18 @@ static inline int put_page_testzero(struct page *page) return page_ref_dec_and_test(page); } +static inline int folio_put_testzero(struct folio *folio) +{ + return put_page_testzero(&folio->page); +} + /* * Try to grab a ref unless the page has a refcount of zero, return false if * that is the case. * This can be called when MMU is off so it must not access * any of the virtual mappings. */ -static inline int get_page_unless_zero(struct page *page) +static inline bool get_page_unless_zero(struct page *page) { return page_ref_add_unless(page, 1, 0); } @@ -907,7 +902,7 @@ void __put_page(struct page *page); void put_pages_list(struct list_head *pages); void split_page(struct page *page, unsigned int order); -void copy_huge_page(struct page *dst, struct page *src); +void folio_copy(struct folio *dst, struct folio *src); /* * Compound pages have a destructor function. Provide a @@ -950,6 +945,20 @@ static inline unsigned int compound_order(struct page *page) return page[1].compound_order; } +/** + * folio_order - The allocation order of a folio. + * @folio: The folio. + * + * A folio is composed of 2^order pages. See get_order() for the definition + * of order. + * + * Return: The order of the folio. + */ +static inline unsigned int folio_order(struct folio *folio) +{ + return compound_order(&folio->page); +} + static inline bool hpage_pincount_available(struct page *page) { /* @@ -1131,6 +1140,11 @@ static inline enum zone_type page_zonenum(const struct page *page) return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK; } +static inline enum zone_type folio_zonenum(const struct folio *folio) +{ + return page_zonenum(&folio->page); +} + #ifdef CONFIG_ZONE_DEVICE static inline bool is_zone_device_page(const struct page *page) { @@ -1200,18 +1214,26 @@ static inline bool is_pci_p2pdma_page(const struct page *page) } /* 127: arbitrary random number, small enough to assemble well */ -#define page_ref_zero_or_close_to_overflow(page) \ - ((unsigned int) page_ref_count(page) + 127u <= 127u) +#define folio_ref_zero_or_close_to_overflow(folio) \ + ((unsigned int) folio_ref_count(folio) + 127u <= 127u) + +/** + * folio_get - Increment the reference count on a folio. + * @folio: The folio. + * + * Context: May be called in any context, as long as you know that + * you have a refcount on the folio. If you do not already have one, + * folio_try_get() may be the right interface for you to use. + */ +static inline void folio_get(struct folio *folio) +{ + VM_BUG_ON_FOLIO(folio_ref_zero_or_close_to_overflow(folio), folio); + folio_ref_inc(folio); +} static inline void get_page(struct page *page) { - page = compound_head(page); - /* - * Getting a normal page or the head of a compound page - * requires to already have an elevated page->_refcount. - */ - VM_BUG_ON_PAGE(page_ref_zero_or_close_to_overflow(page), page); - page_ref_inc(page); + folio_get(page_folio(page)); } bool __must_check try_grab_page(struct page *page, unsigned int flags); @@ -1228,9 +1250,28 @@ static inline __must_check bool try_get_page(struct page *page) return true; } +/** + * folio_put - Decrement the reference count on a folio. + * @folio: The folio. + * + * If the folio's reference count reaches zero, the memory will be + * released back to the page allocator and may be used by another + * allocation immediately. Do not access the memory or the struct folio + * after calling folio_put() unless you can be sure that it wasn't the + * last reference. + * + * Context: May be called in process or interrupt context, but not in NMI + * context. May be called while holding a spinlock. + */ +static inline void folio_put(struct folio *folio) +{ + if (folio_put_testzero(folio)) + __put_page(&folio->page); +} + static inline void put_page(struct page *page) { - page = compound_head(page); + struct folio *folio = page_folio(page); /* * For devmap managed pages we need to catch refcount transition from @@ -1238,13 +1279,12 @@ static inline void put_page(struct page *page) * need to inform the device driver through callback. See * include/linux/memremap.h and HMM for details. */ - if (page_is_devmap_managed(page)) { - put_devmap_managed_page(page); + if (page_is_devmap_managed(&folio->page)) { + put_devmap_managed_page(&folio->page); return; } - if (put_page_testzero(page)) - __put_page(page); + folio_put(folio); } /* @@ -1379,6 +1419,11 @@ static inline int page_to_nid(const struct page *page) } #endif +static inline int folio_nid(const struct folio *folio) +{ + return page_to_nid(&folio->page); +} + #ifdef CONFIG_NUMA_BALANCING static inline int cpu_pid_to_cpupid(int cpu, int pid) { @@ -1546,6 +1591,16 @@ static inline pg_data_t *page_pgdat(const struct page *page) return NODE_DATA(page_to_nid(page)); } +static inline struct zone *folio_zone(const struct folio *folio) +{ + return page_zone(&folio->page); +} + +static inline pg_data_t *folio_pgdat(const struct folio *folio) +{ + return page_pgdat(&folio->page); +} + #ifdef SECTION_IN_PAGE_FLAGS static inline void set_page_section(struct page *page, unsigned long section) { @@ -1559,6 +1614,20 @@ static inline unsigned long page_to_section(const struct page *page) } #endif +/** + * folio_pfn - Return the Page Frame Number of a folio. + * @folio: The folio. + * + * A folio may contain multiple pages. The pages have consecutive + * Page Frame Numbers. + * + * Return: The Page Frame Number of the first page in the folio. + */ +static inline unsigned long folio_pfn(struct folio *folio) +{ + return page_to_pfn(&folio->page); +} + /* MIGRATE_CMA and ZONE_MOVABLE do not allow pin pages */ #ifdef CONFIG_MIGRATION static inline bool is_pinnable_page(struct page *page) @@ -1595,6 +1664,89 @@ static inline void set_page_links(struct page *page, enum zone_type zone, #endif } +/** + * folio_nr_pages - The number of pages in the folio. + * @folio: The folio. + * + * Return: A positive power of two. + */ +static inline long folio_nr_pages(struct folio *folio) +{ + return compound_nr(&folio->page); +} + +/** + * folio_next - Move to the next physical folio. + * @folio: The folio we're currently operating on. + * + * If you have physically contiguous memory which may span more than + * one folio (eg a &struct bio_vec), use this function to move from one + * folio to the next. Do not use it if the memory is only virtually + * contiguous as the folios are almost certainly not adjacent to each + * other. This is the folio equivalent to writing ``page++``. + * + * Context: We assume that the folios are refcounted and/or locked at a + * higher level and do not adjust the reference counts. + * Return: The next struct folio. + */ +static inline struct folio *folio_next(struct folio *folio) +{ + return (struct folio *)folio_page(folio, folio_nr_pages(folio)); +} + +/** + * folio_shift - The size of the memory described by this folio. + * @folio: The folio. + * + * A folio represents a number of bytes which is a power-of-two in size. + * This function tells you which power-of-two the folio is. See also + * folio_size() and folio_order(). + * + * Context: The caller should have a reference on the folio to prevent + * it from being split. It is not necessary for the folio to be locked. + * Return: The base-2 logarithm of the size of this folio. + */ +static inline unsigned int folio_shift(struct folio *folio) +{ + return PAGE_SHIFT + folio_order(folio); +} + +/** + * folio_size - The number of bytes in a folio. + * @folio: The folio. + * + * Context: The caller should have a reference on the folio to prevent + * it from being split. It is not necessary for the folio to be locked. + * Return: The number of bytes in this folio. + */ +static inline size_t folio_size(struct folio *folio) +{ + return PAGE_SIZE << folio_order(folio); +} + +#ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE +static inline int arch_make_page_accessible(struct page *page) +{ + return 0; +} +#endif + +#ifndef HAVE_ARCH_MAKE_FOLIO_ACCESSIBLE +static inline int arch_make_folio_accessible(struct folio *folio) +{ + int ret; + long i, nr = folio_nr_pages(folio); + + for (i = 0; i < nr; i++) { + ret = arch_make_page_accessible(folio_page(folio, i)); + if (ret) + break; + } + + return ret; +} +#endif + /* * Some inline functions in vmstat.h depend on page_zone() */ @@ -1635,19 +1787,6 @@ void page_address_init(void); extern void *page_rmapping(struct page *page); extern struct anon_vma *page_anon_vma(struct page *page); -extern struct address_space *page_mapping(struct page *page); - -extern struct address_space *__page_file_mapping(struct page *); - -static inline -struct address_space *page_file_mapping(struct page *page) -{ - if (unlikely(PageSwapCache(page))) - return __page_file_mapping(page); - - return page->mapping; -} - extern pgoff_t __page_file_index(struct page *page); /* @@ -1662,7 +1801,7 @@ static inline pgoff_t page_index(struct page *page) } bool page_mapped(struct page *page); -struct address_space *page_mapping(struct page *page); +bool folio_mapped(struct folio *folio); /* * Return true only if the page has been allocated with @@ -1700,6 +1839,7 @@ extern void pagefault_out_of_memory(void); #define offset_in_page(p) ((unsigned long)(p) & ~PAGE_MASK) #define offset_in_thp(page, p) ((unsigned long)(p) & (thp_size(page) - 1)) +#define offset_in_folio(folio, p) ((unsigned long)(p) & (folio_size(folio) - 1)) /* * Flags passed to show_mem() and show_free_areas() to suppress output in @@ -1854,20 +1994,9 @@ extern int try_to_release_page(struct page * page, gfp_t gfp_mask); extern void do_invalidatepage(struct page *page, unsigned int offset, unsigned int length); -int redirty_page_for_writepage(struct writeback_control *wbc, - struct page *page); -void account_page_cleaned(struct page *page, struct address_space *mapping, - struct bdi_writeback *wb); -int set_page_dirty(struct page *page); +bool folio_mark_dirty(struct folio *folio); +bool set_page_dirty(struct page *page); int set_page_dirty_lock(struct page *page); -void __cancel_dirty_page(struct page *page); -static inline void cancel_dirty_page(struct page *page) -{ - /* Avoid atomic ops, locking, etc. when not actually needed. */ - if (PageDirty(page)) - __cancel_dirty_page(page); -} -int clear_page_dirty_for_io(struct page *page); int get_cmdline(struct task_struct *task, char *buffer, int buflen); @@ -2659,10 +2788,6 @@ extern vm_fault_t filemap_map_pages(struct vm_fault *vmf, pgoff_t start_pgoff, pgoff_t end_pgoff); extern vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf); -/* mm/page-writeback.c */ -int __must_check write_one_page(struct page *page); -void task_dirty_inc(struct task_struct *tsk); - extern unsigned long stack_guard_gap; /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */ extern int expand_stack(struct vm_area_struct *vma, unsigned long address); diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index 355ea1ee32bd..e2ec68b0515c 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -6,27 +6,33 @@ #include <linux/swap.h> /** - * page_is_file_lru - should the page be on a file LRU or anon LRU? - * @page: the page to test - * - * Returns 1 if @page is a regular filesystem backed page cache page or a lazily - * freed anonymous page (e.g. via MADV_FREE). Returns 0 if @page is a normal - * anonymous page, a tmpfs page or otherwise ram or swap backed page. Used by - * functions that manipulate the LRU lists, to sort a page onto the right LRU - * list. + * folio_is_file_lru - Should the folio be on a file LRU or anon LRU? + * @folio: The folio to test. * * We would like to get this info without a page flag, but the state - * needs to survive until the page is last deleted from the LRU, which + * needs to survive until the folio is last deleted from the LRU, which * could be as far down as __page_cache_release. + * + * Return: An integer (not a boolean!) used to sort a folio onto the + * right LRU list and to account folios correctly. + * 1 if @folio is a regular filesystem backed page cache folio + * or a lazily freed anonymous folio (e.g. via MADV_FREE). + * 0 if @folio is a normal anonymous folio, a tmpfs folio or otherwise + * ram or swap backed folio. */ +static inline int folio_is_file_lru(struct folio *folio) +{ + return !folio_test_swapbacked(folio); +} + static inline int page_is_file_lru(struct page *page) { - return !PageSwapBacked(page); + return folio_is_file_lru(page_folio(page)); } static __always_inline void update_lru_size(struct lruvec *lruvec, enum lru_list lru, enum zone_type zid, - int nr_pages) + long nr_pages) { struct pglist_data *pgdat = lruvec_pgdat(lruvec); @@ -39,69 +45,94 @@ static __always_inline void update_lru_size(struct lruvec *lruvec, } /** - * __clear_page_lru_flags - clear page lru flags before releasing a page - * @page: the page that was on lru and now has a zero reference + * __folio_clear_lru_flags - Clear page lru flags before releasing a page. + * @folio: The folio that was on lru and now has a zero reference. */ -static __always_inline void __clear_page_lru_flags(struct page *page) +static __always_inline void __folio_clear_lru_flags(struct folio *folio) { - VM_BUG_ON_PAGE(!PageLRU(page), page); + VM_BUG_ON_FOLIO(!folio_test_lru(folio), folio); - __ClearPageLRU(page); + __folio_clear_lru(folio); /* this shouldn't happen, so leave the flags to bad_page() */ - if (PageActive(page) && PageUnevictable(page)) + if (folio_test_active(folio) && folio_test_unevictable(folio)) return; - __ClearPageActive(page); - __ClearPageUnevictable(page); + __folio_clear_active(folio); + __folio_clear_unevictable(folio); +} + +static __always_inline void __clear_page_lru_flags(struct page *page) +{ + __folio_clear_lru_flags(page_folio(page)); } /** - * page_lru - which LRU list should a page be on? - * @page: the page to test + * folio_lru_list - Which LRU list should a folio be on? + * @folio: The folio to test. * - * Returns the LRU list a page should be on, as an index + * Return: The LRU list a folio should be on, as an index * into the array of LRU lists. */ -static __always_inline enum lru_list page_lru(struct page *page) +static __always_inline enum lru_list folio_lru_list(struct folio *folio) { enum lru_list lru; - VM_BUG_ON_PAGE(PageActive(page) && PageUnevictable(page), page); + VM_BUG_ON_FOLIO(folio_test_active(folio) && folio_test_unevictable(folio), folio); - if (PageUnevictable(page)) + if (folio_test_unevictable(folio)) return LRU_UNEVICTABLE; - lru = page_is_file_lru(page) ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON; - if (PageActive(page)) + lru = folio_is_file_lru(folio) ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON; + if (folio_test_active(folio)) lru += LRU_ACTIVE; return lru; } +static __always_inline +void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio) +{ + enum lru_list lru = folio_lru_list(folio); + + update_lru_size(lruvec, lru, folio_zonenum(folio), + folio_nr_pages(folio)); + list_add(&folio->lru, &lruvec->lists[lru]); +} + static __always_inline void add_page_to_lru_list(struct page *page, struct lruvec *lruvec) { - enum lru_list lru = page_lru(page); + lruvec_add_folio(lruvec, page_folio(page)); +} - update_lru_size(lruvec, lru, page_zonenum(page), thp_nr_pages(page)); - list_add(&page->lru, &lruvec->lists[lru]); +static __always_inline +void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio) +{ + enum lru_list lru = folio_lru_list(folio); + + update_lru_size(lruvec, lru, folio_zonenum(folio), + folio_nr_pages(folio)); + list_add_tail(&folio->lru, &lruvec->lists[lru]); } static __always_inline void add_page_to_lru_list_tail(struct page *page, struct lruvec *lruvec) { - enum lru_list lru = page_lru(page); + lruvec_add_folio_tail(lruvec, page_folio(page)); +} - update_lru_size(lruvec, lru, page_zonenum(page), thp_nr_pages(page)); - list_add_tail(&page->lru, &lruvec->lists[lru]); +static __always_inline +void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio) +{ + list_del(&folio->lru); + update_lru_size(lruvec, folio_lru_list(folio), folio_zonenum(folio), + -folio_nr_pages(folio)); } static __always_inline void del_page_from_lru_list(struct page *page, struct lruvec *lruvec) { - list_del(&page->lru); - update_lru_size(lruvec, page_lru(page), page_zonenum(page), - -thp_nr_pages(page)); + lruvec_del_folio(lruvec, page_folio(page)); } #endif diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 7f8ee09c711f..82dab23205c3 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -239,6 +239,72 @@ struct page { #endif } _struct_page_alignment; +/** + * struct folio - Represents a contiguous set of bytes. + * @flags: Identical to the page flags. + * @lru: Least Recently Used list; tracks how recently this folio was used. + * @mapping: The file this page belongs to, or refers to the anon_vma for + * anonymous memory. + * @index: Offset within the file, in units of pages. For anonymous memory, + * this is the index from the beginning of the mmap. + * @private: Filesystem per-folio data (see folio_attach_private()). + * Used for swp_entry_t if folio_test_swapcache(). + * @_mapcount: Do not access this member directly. Use folio_mapcount() to + * find out how many times this folio is mapped by userspace. + * @_refcount: Do not access this member directly. Use folio_ref_count() + * to find how many references there are to this folio. + * @memcg_data: Memory Control Group data. + * + * A folio is a physically, virtually and logically contiguous set + * of bytes. It is a power-of-two in size, and it is aligned to that + * same power-of-two. It is at least as large as %PAGE_SIZE. If it is + * in the page cache, it is at a file offset which is a multiple of that + * power-of-two. It may be mapped into userspace at an address which is + * at an arbitrary page offset, but its kernel virtual address is aligned + * to its size. + */ +struct folio { + /* private: don't document the anon union */ + union { + struct { + /* public: */ + unsigned long flags; + struct list_head lru; + struct address_space *mapping; + pgoff_t index; + void *private; + atomic_t _mapcount; + atomic_t _refcount; +#ifdef CONFIG_MEMCG + unsigned long memcg_data; +#endif + /* private: the union with struct page is transitional */ + }; + struct page page; + }; +}; + +static_assert(sizeof(struct page) == sizeof(struct folio)); +#define FOLIO_MATCH(pg, fl) \ + static_assert(offsetof(struct page, pg) == offsetof(struct folio, fl)) +FOLIO_MATCH(flags, flags); +FOLIO_MATCH(lru, lru); +FOLIO_MATCH(compound_head, lru); +FOLIO_MATCH(index, index); +FOLIO_MATCH(private, private); +FOLIO_MATCH(_mapcount, _mapcount); +FOLIO_MATCH(_refcount, _refcount); +#ifdef CONFIG_MEMCG +FOLIO_MATCH(memcg_data, memcg_data); +#endif +#undef FOLIO_MATCH + +static inline atomic_t *folio_mapcount_ptr(struct folio *folio) +{ + struct page *tail = &folio->page + 1; + return &tail->compound_mapcount; +} + static inline atomic_t *compound_mapcount_ptr(struct page *page) { return &page[1].compound_mapcount; @@ -257,6 +323,12 @@ static inline atomic_t *compound_pincount_ptr(struct page *page) #define PAGE_FRAG_CACHE_MAX_SIZE __ALIGN_MASK(32768, ~PAGE_MASK) #define PAGE_FRAG_CACHE_MAX_ORDER get_order(PAGE_FRAG_CACHE_MAX_SIZE) +/* + * page_private can be used on tail pages. However, PagePrivate is only + * checked by the VM on the head page. So page_private on the tail pages + * should be used for data that's ancillary to the head page (eg attaching + * buffer heads to tail pages after attaching buffer heads to the head page) + */ #define page_private(page) ((page)->private) static inline void set_page_private(struct page *page, unsigned long private) @@ -264,6 +336,11 @@ static inline void set_page_private(struct page *page, unsigned long private) page->private = private; } +static inline void *folio_get_private(struct folio *folio) +{ + return folio->private; +} + struct page_frag_cache { void * va; #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) diff --git a/include/linux/mmc/host.h b/include/linux/mmc/host.h index 0c0c9a0fdf57..52eae8c45b8d 100644 --- a/include/linux/mmc/host.h +++ b/include/linux/mmc/host.h @@ -15,7 +15,7 @@ #include <linux/mmc/card.h> #include <linux/mmc/pm.h> #include <linux/dma-direction.h> -#include <linux/keyslot-manager.h> +#include <linux/blk-crypto-profile.h> struct mmc_ios { unsigned int clock; /* clock rate */ @@ -492,7 +492,7 @@ struct mmc_host { /* Inline encryption support */ #ifdef CONFIG_MMC_CRYPTO - struct blk_keyslot_manager ksm; + struct blk_crypto_profile crypto_profile; #endif /* Host Software Queue support */ diff --git a/include/linux/mmdebug.h b/include/linux/mmdebug.h index 1935d4c72d10..d7285f8148a3 100644 --- a/include/linux/mmdebug.h +++ b/include/linux/mmdebug.h @@ -22,6 +22,13 @@ void dump_mm(const struct mm_struct *mm); BUG(); \ } \ } while (0) +#define VM_BUG_ON_FOLIO(cond, folio) \ + do { \ + if (unlikely(cond)) { \ + dump_page(&folio->page, "VM_BUG_ON_FOLIO(" __stringify(cond)")");\ + BUG(); \ + } \ + } while (0) #define VM_BUG_ON_VMA(cond, vma) \ do { \ if (unlikely(cond)) { \ @@ -47,6 +54,17 @@ void dump_mm(const struct mm_struct *mm); } \ unlikely(__ret_warn_once); \ }) +#define VM_WARN_ON_ONCE_FOLIO(cond, folio) ({ \ + static bool __section(".data.once") __warned; \ + int __ret_warn_once = !!(cond); \ + \ + if (unlikely(__ret_warn_once && !__warned)) { \ + dump_page(&folio->page, "VM_WARN_ON_ONCE_FOLIO(" __stringify(cond)")");\ + __warned = true; \ + WARN_ON(1); \ + } \ + unlikely(__ret_warn_once); \ +}) #define VM_WARN_ON(cond) (void)WARN_ON(cond) #define VM_WARN_ON_ONCE(cond) (void)WARN_ON_ONCE(cond) @@ -55,11 +73,13 @@ void dump_mm(const struct mm_struct *mm); #else #define VM_BUG_ON(cond) BUILD_BUG_ON_INVALID(cond) #define VM_BUG_ON_PAGE(cond, page) VM_BUG_ON(cond) +#define VM_BUG_ON_FOLIO(cond, folio) VM_BUG_ON(cond) #define VM_BUG_ON_VMA(cond, vma) VM_BUG_ON(cond) #define VM_BUG_ON_MM(cond, mm) VM_BUG_ON(cond) #define VM_WARN_ON(cond) BUILD_BUG_ON_INVALID(cond) #define VM_WARN_ON_ONCE(cond) BUILD_BUG_ON_INVALID(cond) #define VM_WARN_ON_ONCE_PAGE(cond, page) BUILD_BUG_ON_INVALID(cond) +#define VM_WARN_ON_ONCE_FOLIO(cond, folio) BUILD_BUG_ON_INVALID(cond) #define VM_WARN_ONCE(cond, format...) BUILD_BUG_ON_INVALID(cond) #define VM_WARN(cond, format...) BUILD_BUG_ON_INVALID(cond) #endif diff --git a/include/linux/netfs.h b/include/linux/netfs.h index 5d6a4158a9a6..12c4177f7703 100644 --- a/include/linux/netfs.h +++ b/include/linux/netfs.h @@ -22,6 +22,7 @@ * Overload PG_private_2 to give us PG_fscache - this is used to indicate that * a page is currently backed by a local disk cache */ +#define folio_test_fscache(folio) folio_test_private_2(folio) #define PageFsCache(page) PagePrivate2((page)) #define SetPageFsCache(page) SetPagePrivate2((page)) #define ClearPageFsCache(page) ClearPagePrivate2((page)) @@ -29,60 +30,80 @@ #define TestClearPageFsCache(page) TestClearPagePrivate2((page)) /** - * set_page_fscache - Set PG_fscache on a page and take a ref - * @page: The page. + * folio_start_fscache - Start an fscache write on a folio. + * @folio: The folio. * - * Set the PG_fscache (PG_private_2) flag on a page and take the reference - * needed for the VM to handle its lifetime correctly. This sets the flag and - * takes the reference unconditionally, so care must be taken not to set the - * flag again if it's already set. + * Call this function before writing a folio to a local cache. Starting a + * second write before the first one finishes is not allowed. */ -static inline void set_page_fscache(struct page *page) +static inline void folio_start_fscache(struct folio *folio) { - set_page_private_2(page); + VM_BUG_ON_FOLIO(folio_test_private_2(folio), folio); + folio_get(folio); + folio_set_private_2(folio); } /** - * end_page_fscache - Clear PG_fscache and release any waiters - * @page: The page - * - * Clear the PG_fscache (PG_private_2) bit on a page and wake up any sleepers - * waiting for this. The page ref held for PG_private_2 being set is released. + * folio_end_fscache - End an fscache write on a folio. + * @folio: The folio. * - * This is, for example, used when a netfs page is being written to a local - * disk cache, thereby allowing writes to the cache for the same page to be - * serialised. + * Call this function after the folio has been written to the local cache. + * This will wake any sleepers waiting on this folio. */ -static inline void end_page_fscache(struct page *page) +static inline void folio_end_fscache(struct folio *folio) { - end_page_private_2(page); + folio_end_private_2(folio); } /** - * wait_on_page_fscache - Wait for PG_fscache to be cleared on a page - * @page: The page to wait on + * folio_wait_fscache - Wait for an fscache write on this folio to end. + * @folio: The folio. * - * Wait for PG_fscache (aka PG_private_2) to be cleared on a page. + * If this folio is currently being written to a local cache, wait for + * the write to finish. Another write may start after this one finishes, + * unless the caller holds the folio lock. */ -static inline void wait_on_page_fscache(struct page *page) +static inline void folio_wait_fscache(struct folio *folio) { - wait_on_page_private_2(page); + folio_wait_private_2(folio); } /** - * wait_on_page_fscache_killable - Wait for PG_fscache to be cleared on a page - * @page: The page to wait on + * folio_wait_fscache_killable - Wait for an fscache write on this folio to end. + * @folio: The folio. * - * Wait for PG_fscache (aka PG_private_2) to be cleared on a page or until a - * fatal signal is received by the calling task. + * If this folio is currently being written to a local cache, wait + * for the write to finish or for a fatal signal to be received. + * Another write may start after this one finishes, unless the caller + * holds the folio lock. * * Return: * - 0 if successful. * - -EINTR if a fatal signal was encountered. */ +static inline int folio_wait_fscache_killable(struct folio *folio) +{ + return folio_wait_private_2_killable(folio); +} + +static inline void set_page_fscache(struct page *page) +{ + folio_start_fscache(page_folio(page)); +} + +static inline void end_page_fscache(struct page *page) +{ + folio_end_private_2(page_folio(page)); +} + +static inline void wait_on_page_fscache(struct page *page) +{ + folio_wait_private_2(page_folio(page)); +} + static inline int wait_on_page_fscache_killable(struct page *page) { - return wait_on_page_private_2_killable(page); + return folio_wait_private_2_killable(page_folio(page)); } enum netfs_read_source { diff --git a/include/linux/nvme-fc-driver.h b/include/linux/nvme-fc-driver.h index 2a38f2b477a5..cb909edb76c4 100644 --- a/include/linux/nvme-fc-driver.h +++ b/include/linux/nvme-fc-driver.h @@ -7,6 +7,7 @@ #define _NVME_FC_DRIVER_H 1 #include <linux/scatterlist.h> +#include <linux/blk-mq.h> /* @@ -497,6 +498,8 @@ struct nvme_fc_port_template { int (*xmt_ls_rsp)(struct nvme_fc_local_port *localport, struct nvme_fc_remote_port *rport, struct nvmefc_ls_rsp *ls_rsp); + void (*map_queues)(struct nvme_fc_local_port *localport, + struct blk_mq_queue_map *map); u32 max_hw_queues; u16 max_sgl_segments; @@ -779,6 +782,10 @@ struct nvmet_fc_target_port { * LS received. * Entrypoint is Mandatory. * + * @map_queues: This functions lets the driver expose the queue mapping + * to the block layer. + * Entrypoint is Optional. + * * @fcp_op: Called to perform a data transfer or transmit a response. * The nvmefc_tgt_fcp_req structure is the same LLDD-supplied * exchange structure specified in the nvmet_fc_rcv_fcp_req() call diff --git a/include/linux/nvme-rdma.h b/include/linux/nvme-rdma.h index 3ec8e50efa16..4dd7e6fe92fb 100644 --- a/include/linux/nvme-rdma.h +++ b/include/linux/nvme-rdma.h @@ -6,6 +6,8 @@ #ifndef _LINUX_NVME_RDMA_H #define _LINUX_NVME_RDMA_H +#define NVME_RDMA_MAX_QUEUE_SIZE 128 + enum nvme_rdma_cm_fmt { NVME_RDMA_CM_FMT_1_0 = 0x0, }; diff --git a/include/linux/nvme.h b/include/linux/nvme.h index b7c4c4130b65..855dd9b3e84b 100644 --- a/include/linux/nvme.h +++ b/include/linux/nvme.h @@ -27,8 +27,20 @@ #define NVME_NSID_ALL 0xffffffff enum nvme_subsys_type { - NVME_NQN_DISC = 1, /* Discovery type target subsystem */ - NVME_NQN_NVME = 2, /* NVME type target subsystem */ + /* Referral to another discovery type target subsystem */ + NVME_NQN_DISC = 1, + + /* NVME type target subsystem */ + NVME_NQN_NVME = 2, + + /* Current discovery type target subsystem */ + NVME_NQN_CURR = 3, +}; + +enum nvme_ctrl_type { + NVME_CTRL_IO = 1, /* I/O controller */ + NVME_CTRL_DISC = 2, /* Discovery controller */ + NVME_CTRL_ADMIN = 3, /* Administrative controller */ }; /* Address Family codes for Discovery Log Page entry ADRFAM field */ @@ -244,7 +256,9 @@ struct nvme_id_ctrl { __le32 rtd3e; __le32 oaes; __le32 ctratt; - __u8 rsvd100[28]; + __u8 rsvd100[11]; + __u8 cntrltype; + __u8 fguid[16]; __le16 crdt1; __le16 crdt2; __le16 crdt3; @@ -312,6 +326,7 @@ struct nvme_id_ctrl { }; enum { + NVME_CTRL_CMIC_MULTI_PORT = 1 << 0, NVME_CTRL_CMIC_MULTI_CTRL = 1 << 1, NVME_CTRL_CMIC_ANA = 1 << 3, NVME_CTRL_ONCS_COMPARE = 1 << 0, @@ -1303,6 +1318,12 @@ struct nvmf_common_command { #define MAX_DISC_LOGS 255 +/* Discovery log page entry flags (EFLAGS): */ +enum { + NVME_DISC_EFLAGS_EPCSD = (1 << 1), + NVME_DISC_EFLAGS_DUPRETINFO = (1 << 0), +}; + /* Discovery log page entry */ struct nvmf_disc_rsp_page_entry { __u8 trtype; @@ -1312,7 +1333,8 @@ struct nvmf_disc_rsp_page_entry { __le16 portid; __le16 cntlid; __le16 asqsz; - __u8 resv8[22]; + __le16 eflags; + __u8 resv10[20]; char trsvcid[NVMF_TRSVCID_SIZE]; __u8 resv64[192]; char subnqn[NVMF_NQN_FIELD_LEN]; diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index a558d67ee86f..d8623d6e1141 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -143,6 +143,8 @@ enum pageflags { #endif __NR_PAGEFLAGS, + PG_readahead = PG_reclaim, + /* Filesystems */ PG_checked = PG_owner_priv_1, @@ -171,6 +173,15 @@ enum pageflags { /* Compound pages. Stored in first tail page's flags */ PG_double_map = PG_workingset, +#ifdef CONFIG_MEMORY_FAILURE + /* + * Compound pages. Stored in first tail page's flags. + * Indicates that at least one subpage is hwpoisoned in the + * THP. + */ + PG_has_hwpoisoned = PG_mappedtodisk, +#endif + /* non-lru isolated movable page */ PG_isolated = PG_reclaim, @@ -193,6 +204,34 @@ static inline unsigned long _compound_head(const struct page *page) #define compound_head(page) ((typeof(page))_compound_head(page)) +/** + * page_folio - Converts from page to folio. + * @p: The page. + * + * Every page is part of a folio. This function cannot be called on a + * NULL pointer. + * + * Context: No reference, nor lock is required on @page. If the caller + * does not hold a reference, this call may race with a folio split, so + * it should re-check the folio still contains this page after gaining + * a reference on the folio. + * Return: The folio which contains this page. + */ +#define page_folio(p) (_Generic((p), \ + const struct page *: (const struct folio *)_compound_head(p), \ + struct page *: (struct folio *)_compound_head(p))) + +/** + * folio_page - Return a page from a folio. + * @folio: The folio. + * @n: The page number to return. + * + * @n is relative to the start of the folio. This function does not + * check that the page number lies within @folio; the caller is presumed + * to have a reference to the page. + */ +#define folio_page(folio, n) nth_page(&(folio)->page, n) + static __always_inline int PageTail(struct page *page) { return READ_ONCE(page->compound_head) & 1; @@ -217,6 +256,15 @@ static inline void page_init_poison(struct page *page, size_t size) } #endif +static unsigned long *folio_flags(struct folio *folio, unsigned n) +{ + struct page *page = &folio->page; + + VM_BUG_ON_PGFLAGS(PageTail(page), page); + VM_BUG_ON_PGFLAGS(n > 0 && !test_bit(PG_head, &page->flags), page); + return &page[n].flags; +} + /* * Page flags policies wrt compound pages * @@ -261,36 +309,64 @@ static inline void page_init_poison(struct page *page, size_t size) VM_BUG_ON_PGFLAGS(!PageHead(page), page); \ PF_POISONED_CHECK(&page[1]); }) +/* Which page is the flag stored in */ +#define FOLIO_PF_ANY 0 +#define FOLIO_PF_HEAD 0 +#define FOLIO_PF_ONLY_HEAD 0 +#define FOLIO_PF_NO_TAIL 0 +#define FOLIO_PF_NO_COMPOUND 0 +#define FOLIO_PF_SECOND 1 + /* * Macros to create function definitions for page flags */ #define TESTPAGEFLAG(uname, lname, policy) \ +static __always_inline bool folio_test_##lname(struct folio *folio) \ +{ return test_bit(PG_##lname, folio_flags(folio, FOLIO_##policy)); } \ static __always_inline int Page##uname(struct page *page) \ - { return test_bit(PG_##lname, &policy(page, 0)->flags); } +{ return test_bit(PG_##lname, &policy(page, 0)->flags); } #define SETPAGEFLAG(uname, lname, policy) \ +static __always_inline \ +void folio_set_##lname(struct folio *folio) \ +{ set_bit(PG_##lname, folio_flags(folio, FOLIO_##policy)); } \ static __always_inline void SetPage##uname(struct page *page) \ - { set_bit(PG_##lname, &policy(page, 1)->flags); } +{ set_bit(PG_##lname, &policy(page, 1)->flags); } #define CLEARPAGEFLAG(uname, lname, policy) \ +static __always_inline \ +void folio_clear_##lname(struct folio *folio) \ +{ clear_bit(PG_##lname, folio_flags(folio, FOLIO_##policy)); } \ static __always_inline void ClearPage##uname(struct page *page) \ - { clear_bit(PG_##lname, &policy(page, 1)->flags); } +{ clear_bit(PG_##lname, &policy(page, 1)->flags); } #define __SETPAGEFLAG(uname, lname, policy) \ +static __always_inline \ +void __folio_set_##lname(struct folio *folio) \ +{ __set_bit(PG_##lname, folio_flags(folio, FOLIO_##policy)); } \ static __always_inline void __SetPage##uname(struct page *page) \ - { __set_bit(PG_##lname, &policy(page, 1)->flags); } +{ __set_bit(PG_##lname, &policy(page, 1)->flags); } #define __CLEARPAGEFLAG(uname, lname, policy) \ +static __always_inline \ +void __folio_clear_##lname(struct folio *folio) \ +{ __clear_bit(PG_##lname, folio_flags(folio, FOLIO_##policy)); } \ static __always_inline void __ClearPage##uname(struct page *page) \ - { __clear_bit(PG_##lname, &policy(page, 1)->flags); } +{ __clear_bit(PG_##lname, &policy(page, 1)->flags); } #define TESTSETFLAG(uname, lname, policy) \ +static __always_inline \ +bool folio_test_set_##lname(struct folio *folio) \ +{ return test_and_set_bit(PG_##lname, folio_flags(folio, FOLIO_##policy)); } \ static __always_inline int TestSetPage##uname(struct page *page) \ - { return test_and_set_bit(PG_##lname, &policy(page, 1)->flags); } +{ return test_and_set_bit(PG_##lname, &policy(page, 1)->flags); } #define TESTCLEARFLAG(uname, lname, policy) \ +static __always_inline \ +bool folio_test_clear_##lname(struct folio *folio) \ +{ return test_and_clear_bit(PG_##lname, folio_flags(folio, FOLIO_##policy)); } \ static __always_inline int TestClearPage##uname(struct page *page) \ - { return test_and_clear_bit(PG_##lname, &policy(page, 1)->flags); } +{ return test_and_clear_bit(PG_##lname, &policy(page, 1)->flags); } #define PAGEFLAG(uname, lname, policy) \ TESTPAGEFLAG(uname, lname, policy) \ @@ -306,29 +382,37 @@ static __always_inline int TestClearPage##uname(struct page *page) \ TESTSETFLAG(uname, lname, policy) \ TESTCLEARFLAG(uname, lname, policy) -#define TESTPAGEFLAG_FALSE(uname) \ +#define TESTPAGEFLAG_FALSE(uname, lname) \ +static inline bool folio_test_##lname(const struct folio *folio) { return 0; } \ static inline int Page##uname(const struct page *page) { return 0; } -#define SETPAGEFLAG_NOOP(uname) \ +#define SETPAGEFLAG_NOOP(uname, lname) \ +static inline void folio_set_##lname(struct folio *folio) { } \ static inline void SetPage##uname(struct page *page) { } -#define CLEARPAGEFLAG_NOOP(uname) \ +#define CLEARPAGEFLAG_NOOP(uname, lname) \ +static inline void folio_clear_##lname(struct folio *folio) { } \ static inline void ClearPage##uname(struct page *page) { } -#define __CLEARPAGEFLAG_NOOP(uname) \ +#define __CLEARPAGEFLAG_NOOP(uname, lname) \ +static inline void __folio_clear_##lname(struct folio *folio) { } \ static inline void __ClearPage##uname(struct page *page) { } -#define TESTSETFLAG_FALSE(uname) \ +#define TESTSETFLAG_FALSE(uname, lname) \ +static inline bool folio_test_set_##lname(struct folio *folio) \ +{ return 0; } \ static inline int TestSetPage##uname(struct page *page) { return 0; } -#define TESTCLEARFLAG_FALSE(uname) \ +#define TESTCLEARFLAG_FALSE(uname, lname) \ +static inline bool folio_test_clear_##lname(struct folio *folio) \ +{ return 0; } \ static inline int TestClearPage##uname(struct page *page) { return 0; } -#define PAGEFLAG_FALSE(uname) TESTPAGEFLAG_FALSE(uname) \ - SETPAGEFLAG_NOOP(uname) CLEARPAGEFLAG_NOOP(uname) +#define PAGEFLAG_FALSE(uname, lname) TESTPAGEFLAG_FALSE(uname, lname) \ + SETPAGEFLAG_NOOP(uname, lname) CLEARPAGEFLAG_NOOP(uname, lname) -#define TESTSCFLAG_FALSE(uname) \ - TESTSETFLAG_FALSE(uname) TESTCLEARFLAG_FALSE(uname) +#define TESTSCFLAG_FALSE(uname, lname) \ + TESTSETFLAG_FALSE(uname, lname) TESTCLEARFLAG_FALSE(uname, lname) __PAGEFLAG(Locked, locked, PF_NO_TAIL) PAGEFLAG(Waiters, waiters, PF_ONLY_HEAD) __CLEARPAGEFLAG(Waiters, waiters, PF_ONLY_HEAD) @@ -384,8 +468,8 @@ PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL) /* PG_readahead is only used for reads; PG_reclaim is only for writes */ PAGEFLAG(Reclaim, reclaim, PF_NO_TAIL) TESTCLEARFLAG(Reclaim, reclaim, PF_NO_TAIL) -PAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND) - TESTCLEARFLAG(Readahead, reclaim, PF_NO_COMPOUND) +PAGEFLAG(Readahead, readahead, PF_NO_COMPOUND) + TESTCLEARFLAG(Readahead, readahead, PF_NO_COMPOUND) #ifdef CONFIG_HIGHMEM /* @@ -394,22 +478,25 @@ PAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND) */ #define PageHighMem(__p) is_highmem_idx(page_zonenum(__p)) #else -PAGEFLAG_FALSE(HighMem) +PAGEFLAG_FALSE(HighMem, highmem) #endif #ifdef CONFIG_SWAP -static __always_inline int PageSwapCache(struct page *page) +static __always_inline bool folio_test_swapcache(struct folio *folio) { -#ifdef CONFIG_THP_SWAP - page = compound_head(page); -#endif - return PageSwapBacked(page) && test_bit(PG_swapcache, &page->flags); + return folio_test_swapbacked(folio) && + test_bit(PG_swapcache, folio_flags(folio, 0)); +} +static __always_inline bool PageSwapCache(struct page *page) +{ + return folio_test_swapcache(page_folio(page)); } + SETPAGEFLAG(SwapCache, swapcache, PF_NO_TAIL) CLEARPAGEFLAG(SwapCache, swapcache, PF_NO_TAIL) #else -PAGEFLAG_FALSE(SwapCache) +PAGEFLAG_FALSE(SwapCache, swapcache) #endif PAGEFLAG(Unevictable, unevictable, PF_HEAD) @@ -421,14 +508,14 @@ PAGEFLAG(Mlocked, mlocked, PF_NO_TAIL) __CLEARPAGEFLAG(Mlocked, mlocked, PF_NO_TAIL) TESTSCFLAG(Mlocked, mlocked, PF_NO_TAIL) #else -PAGEFLAG_FALSE(Mlocked) __CLEARPAGEFLAG_NOOP(Mlocked) - TESTSCFLAG_FALSE(Mlocked) +PAGEFLAG_FALSE(Mlocked, mlocked) __CLEARPAGEFLAG_NOOP(Mlocked, mlocked) + TESTSCFLAG_FALSE(Mlocked, mlocked) #endif #ifdef CONFIG_ARCH_USES_PG_UNCACHED PAGEFLAG(Uncached, uncached, PF_NO_COMPOUND) #else -PAGEFLAG_FALSE(Uncached) +PAGEFLAG_FALSE(Uncached, uncached) #endif #ifdef CONFIG_MEMORY_FAILURE @@ -437,7 +524,7 @@ TESTSCFLAG(HWPoison, hwpoison, PF_ANY) #define __PG_HWPOISON (1UL << PG_hwpoison) extern bool take_page_off_buddy(struct page *page); #else -PAGEFLAG_FALSE(HWPoison) +PAGEFLAG_FALSE(HWPoison, hwpoison) #define __PG_HWPOISON 0 #endif @@ -451,7 +538,7 @@ PAGEFLAG(Idle, idle, PF_ANY) #ifdef CONFIG_KASAN_HW_TAGS PAGEFLAG(SkipKASanPoison, skip_kasan_poison, PF_HEAD) #else -PAGEFLAG_FALSE(SkipKASanPoison) +PAGEFLAG_FALSE(SkipKASanPoison, skip_kasan_poison) #endif /* @@ -489,10 +576,14 @@ static __always_inline int PageMappingFlags(struct page *page) return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) != 0; } -static __always_inline int PageAnon(struct page *page) +static __always_inline bool folio_test_anon(struct folio *folio) +{ + return ((unsigned long)folio->mapping & PAGE_MAPPING_ANON) != 0; +} + +static __always_inline bool PageAnon(struct page *page) { - page = compound_head(page); - return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0; + return folio_test_anon(page_folio(page)); } static __always_inline int __PageMovable(struct page *page) @@ -508,30 +599,32 @@ static __always_inline int __PageMovable(struct page *page) * is found in VM_MERGEABLE vmas. It's a PageAnon page, pointing not to any * anon_vma, but to that page's node of the stable tree. */ -static __always_inline int PageKsm(struct page *page) +static __always_inline bool folio_test_ksm(struct folio *folio) { - page = compound_head(page); - return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) == + return ((unsigned long)folio->mapping & PAGE_MAPPING_FLAGS) == PAGE_MAPPING_KSM; } + +static __always_inline bool PageKsm(struct page *page) +{ + return folio_test_ksm(page_folio(page)); +} #else -TESTPAGEFLAG_FALSE(Ksm) +TESTPAGEFLAG_FALSE(Ksm, ksm) #endif u64 stable_page_flags(struct page *page); -static inline int PageUptodate(struct page *page) +static inline bool folio_test_uptodate(struct folio *folio) { - int ret; - page = compound_head(page); - ret = test_bit(PG_uptodate, &(page)->flags); + bool ret = test_bit(PG_uptodate, folio_flags(folio, 0)); /* - * Must ensure that the data we read out of the page is loaded - * _after_ we've loaded page->flags to check for PageUptodate. - * We can skip the barrier if the page is not uptodate, because + * Must ensure that the data we read out of the folio is loaded + * _after_ we've loaded folio->flags to check the uptodate bit. + * We can skip the barrier if the folio is not uptodate, because * we wouldn't be reading anything from it. * - * See SetPageUptodate() for the other side of the story. + * See folio_mark_uptodate() for the other side of the story. */ if (ret) smp_rmb(); @@ -539,47 +632,71 @@ static inline int PageUptodate(struct page *page) return ret; } -static __always_inline void __SetPageUptodate(struct page *page) +static inline int PageUptodate(struct page *page) +{ + return folio_test_uptodate(page_folio(page)); +} + +static __always_inline void __folio_mark_uptodate(struct folio *folio) { - VM_BUG_ON_PAGE(PageTail(page), page); smp_wmb(); - __set_bit(PG_uptodate, &page->flags); + __set_bit(PG_uptodate, folio_flags(folio, 0)); } -static __always_inline void SetPageUptodate(struct page *page) +static __always_inline void folio_mark_uptodate(struct folio *folio) { - VM_BUG_ON_PAGE(PageTail(page), page); /* * Memory barrier must be issued before setting the PG_uptodate bit, - * so that all previous stores issued in order to bring the page - * uptodate are actually visible before PageUptodate becomes true. + * so that all previous stores issued in order to bring the folio + * uptodate are actually visible before folio_test_uptodate becomes true. */ smp_wmb(); - set_bit(PG_uptodate, &page->flags); + set_bit(PG_uptodate, folio_flags(folio, 0)); +} + +static __always_inline void __SetPageUptodate(struct page *page) +{ + __folio_mark_uptodate((struct folio *)page); +} + +static __always_inline void SetPageUptodate(struct page *page) +{ + folio_mark_uptodate((struct folio *)page); } CLEARPAGEFLAG(Uptodate, uptodate, PF_NO_TAIL) -int test_clear_page_writeback(struct page *page); -int __test_set_page_writeback(struct page *page, bool keep_write); +bool __folio_start_writeback(struct folio *folio, bool keep_write); +bool set_page_writeback(struct page *page); -#define test_set_page_writeback(page) \ - __test_set_page_writeback(page, false) -#define test_set_page_writeback_keepwrite(page) \ - __test_set_page_writeback(page, true) +#define folio_start_writeback(folio) \ + __folio_start_writeback(folio, false) +#define folio_start_writeback_keepwrite(folio) \ + __folio_start_writeback(folio, true) -static inline void set_page_writeback(struct page *page) +static inline void set_page_writeback_keepwrite(struct page *page) { - test_set_page_writeback(page); + folio_start_writeback_keepwrite(page_folio(page)); } -static inline void set_page_writeback_keepwrite(struct page *page) +static inline bool test_set_page_writeback(struct page *page) { - test_set_page_writeback_keepwrite(page); + return set_page_writeback(page); } __PAGEFLAG(Head, head, PF_ANY) CLEARPAGEFLAG(Head, head, PF_ANY) +/* Whether there are one or multiple pages in a folio */ +static inline bool folio_test_single(struct folio *folio) +{ + return !folio_test_head(folio); +} + +static inline bool folio_test_multi(struct folio *folio) +{ + return folio_test_head(folio); +} + static __always_inline void set_compound_head(struct page *page, struct page *head) { WRITE_ONCE(page->compound_head, (unsigned long)head + 1); @@ -603,12 +720,15 @@ static inline void ClearPageCompound(struct page *page) #ifdef CONFIG_HUGETLB_PAGE int PageHuge(struct page *page); int PageHeadHuge(struct page *page); +static inline bool folio_test_hugetlb(struct folio *folio) +{ + return PageHeadHuge(&folio->page); +} #else -TESTPAGEFLAG_FALSE(Huge) -TESTPAGEFLAG_FALSE(HeadHuge) +TESTPAGEFLAG_FALSE(Huge, hugetlb) +TESTPAGEFLAG_FALSE(HeadHuge, headhuge) #endif - #ifdef CONFIG_TRANSPARENT_HUGEPAGE /* * PageHuge() only returns true for hugetlbfs pages, but not for @@ -624,6 +744,11 @@ static inline int PageTransHuge(struct page *page) return PageHead(page); } +static inline bool folio_test_transhuge(struct folio *folio) +{ + return folio_test_head(folio); +} + /* * PageTransCompound returns true for both transparent huge pages * and hugetlbfs pages, so it should only be called when it's known @@ -660,12 +785,26 @@ static inline int PageTransTail(struct page *page) PAGEFLAG(DoubleMap, double_map, PF_SECOND) TESTSCFLAG(DoubleMap, double_map, PF_SECOND) #else -TESTPAGEFLAG_FALSE(TransHuge) -TESTPAGEFLAG_FALSE(TransCompound) -TESTPAGEFLAG_FALSE(TransCompoundMap) -TESTPAGEFLAG_FALSE(TransTail) -PAGEFLAG_FALSE(DoubleMap) - TESTSCFLAG_FALSE(DoubleMap) +TESTPAGEFLAG_FALSE(TransHuge, transhuge) +TESTPAGEFLAG_FALSE(TransCompound, transcompound) +TESTPAGEFLAG_FALSE(TransCompoundMap, transcompoundmap) +TESTPAGEFLAG_FALSE(TransTail, transtail) +PAGEFLAG_FALSE(DoubleMap, double_map) + TESTSCFLAG_FALSE(DoubleMap, double_map) +#endif + +#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_TRANSPARENT_HUGEPAGE) +/* + * PageHasHWPoisoned indicates that at least one subpage is hwpoisoned in the + * compound page. + * + * This flag is set by hwpoison handler. Cleared by THP split or free page. + */ +PAGEFLAG(HasHWPoisoned, has_hwpoisoned, PF_SECOND) + TESTSCFLAG(HasHWPoisoned, has_hwpoisoned, PF_SECOND) +#else +PAGEFLAG_FALSE(HasHWPoisoned) + TESTSCFLAG_FALSE(HasHWPoisoned) #endif /* @@ -849,6 +988,11 @@ static inline int page_has_private(struct page *page) return !!(page->flags & PAGE_FLAGS_PRIVATE); } +static inline bool folio_has_private(struct folio *folio) +{ + return page_has_private(&folio->page); +} + #undef PF_ANY #undef PF_HEAD #undef PF_ONLY_HEAD diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h index d8a6aecf99cb..83abf95e9fa7 100644 --- a/include/linux/page_idle.h +++ b/include/linux/page_idle.h @@ -8,46 +8,16 @@ #ifdef CONFIG_PAGE_IDLE_FLAG -#ifdef CONFIG_64BIT -static inline bool page_is_young(struct page *page) -{ - return PageYoung(page); -} - -static inline void set_page_young(struct page *page) -{ - SetPageYoung(page); -} - -static inline bool test_and_clear_page_young(struct page *page) -{ - return TestClearPageYoung(page); -} - -static inline bool page_is_idle(struct page *page) -{ - return PageIdle(page); -} - -static inline void set_page_idle(struct page *page) -{ - SetPageIdle(page); -} - -static inline void clear_page_idle(struct page *page) -{ - ClearPageIdle(page); -} -#else /* !CONFIG_64BIT */ +#ifndef CONFIG_64BIT /* * If there is not enough space to store Idle and Young bits in page flags, use * page ext flags instead. */ extern struct page_ext_operations page_idle_ops; -static inline bool page_is_young(struct page *page) +static inline bool folio_test_young(struct folio *folio) { - struct page_ext *page_ext = lookup_page_ext(page); + struct page_ext *page_ext = lookup_page_ext(&folio->page); if (unlikely(!page_ext)) return false; @@ -55,9 +25,9 @@ static inline bool page_is_young(struct page *page) return test_bit(PAGE_EXT_YOUNG, &page_ext->flags); } -static inline void set_page_young(struct page *page) +static inline void folio_set_young(struct folio *folio) { - struct page_ext *page_ext = lookup_page_ext(page); + struct page_ext *page_ext = lookup_page_ext(&folio->page); if (unlikely(!page_ext)) return; @@ -65,9 +35,9 @@ static inline void set_page_young(struct page *page) set_bit(PAGE_EXT_YOUNG, &page_ext->flags); } -static inline bool test_and_clear_page_young(struct page *page) +static inline bool folio_test_clear_young(struct folio *folio) { - struct page_ext *page_ext = lookup_page_ext(page); + struct page_ext *page_ext = lookup_page_ext(&folio->page); if (unlikely(!page_ext)) return false; @@ -75,9 +45,9 @@ static inline bool test_and_clear_page_young(struct page *page) return test_and_clear_bit(PAGE_EXT_YOUNG, &page_ext->flags); } -static inline bool page_is_idle(struct page *page) +static inline bool folio_test_idle(struct folio *folio) { - struct page_ext *page_ext = lookup_page_ext(page); + struct page_ext *page_ext = lookup_page_ext(&folio->page); if (unlikely(!page_ext)) return false; @@ -85,9 +55,9 @@ static inline bool page_is_idle(struct page *page) return test_bit(PAGE_EXT_IDLE, &page_ext->flags); } -static inline void set_page_idle(struct page *page) +static inline void folio_set_idle(struct folio *folio) { - struct page_ext *page_ext = lookup_page_ext(page); + struct page_ext *page_ext = lookup_page_ext(&folio->page); if (unlikely(!page_ext)) return; @@ -95,46 +65,75 @@ static inline void set_page_idle(struct page *page) set_bit(PAGE_EXT_IDLE, &page_ext->flags); } -static inline void clear_page_idle(struct page *page) +static inline void folio_clear_idle(struct folio *folio) { - struct page_ext *page_ext = lookup_page_ext(page); + struct page_ext *page_ext = lookup_page_ext(&folio->page); if (unlikely(!page_ext)) return; clear_bit(PAGE_EXT_IDLE, &page_ext->flags); } -#endif /* CONFIG_64BIT */ +#endif /* !CONFIG_64BIT */ #else /* !CONFIG_PAGE_IDLE_FLAG */ -static inline bool page_is_young(struct page *page) +static inline bool folio_test_young(struct folio *folio) { return false; } -static inline void set_page_young(struct page *page) +static inline void folio_set_young(struct folio *folio) { } -static inline bool test_and_clear_page_young(struct page *page) +static inline bool folio_test_clear_young(struct folio *folio) { return false; } -static inline bool page_is_idle(struct page *page) +static inline bool folio_test_idle(struct folio *folio) { return false; } -static inline void set_page_idle(struct page *page) +static inline void folio_set_idle(struct folio *folio) { } -static inline void clear_page_idle(struct page *page) +static inline void folio_clear_idle(struct folio *folio) { } #endif /* CONFIG_PAGE_IDLE_FLAG */ +static inline bool page_is_young(struct page *page) +{ + return folio_test_young(page_folio(page)); +} + +static inline void set_page_young(struct page *page) +{ + folio_set_young(page_folio(page)); +} + +static inline bool test_and_clear_page_young(struct page *page) +{ + return folio_test_clear_young(page_folio(page)); +} + +static inline bool page_is_idle(struct page *page) +{ + return folio_test_idle(page_folio(page)); +} + +static inline void set_page_idle(struct page *page) +{ + folio_set_idle(page_folio(page)); +} + +static inline void clear_page_idle(struct page *page) +{ + folio_clear_idle(page_folio(page)); +} #endif /* _LINUX_MM_PAGE_IDLE_H */ diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h index 719bfe5108c5..43c638c51c1f 100644 --- a/include/linux/page_owner.h +++ b/include/linux/page_owner.h @@ -12,7 +12,7 @@ extern void __reset_page_owner(struct page *page, unsigned int order); extern void __set_page_owner(struct page *page, unsigned int order, gfp_t gfp_mask); extern void __split_page_owner(struct page *page, unsigned int nr); -extern void __copy_page_owner(struct page *oldpage, struct page *newpage); +extern void __folio_copy_owner(struct folio *newfolio, struct folio *old); extern void __set_page_owner_migrate_reason(struct page *page, int reason); extern void __dump_page_owner(const struct page *page); extern void pagetypeinfo_showmixedcount_print(struct seq_file *m, @@ -36,10 +36,10 @@ static inline void split_page_owner(struct page *page, unsigned int nr) if (static_branch_unlikely(&page_owner_inited)) __split_page_owner(page, nr); } -static inline void copy_page_owner(struct page *oldpage, struct page *newpage) +static inline void folio_copy_owner(struct folio *newfolio, struct folio *old) { if (static_branch_unlikely(&page_owner_inited)) - __copy_page_owner(oldpage, newpage); + __folio_copy_owner(newfolio, old); } static inline void set_page_owner_migrate_reason(struct page *page, int reason) { @@ -63,7 +63,7 @@ static inline void split_page_owner(struct page *page, unsigned int order) { } -static inline void copy_page_owner(struct page *oldpage, struct page *newpage) +static inline void folio_copy_owner(struct folio *newfolio, struct folio *folio) { } static inline void set_page_owner_migrate_reason(struct page *page, int reason) diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h index 7ad46f45df39..2e677e6ad09f 100644 --- a/include/linux/page_ref.h +++ b/include/linux/page_ref.h @@ -67,9 +67,31 @@ static inline int page_ref_count(const struct page *page) return atomic_read(&page->_refcount); } +/** + * folio_ref_count - The reference count on this folio. + * @folio: The folio. + * + * The refcount is usually incremented by calls to folio_get() and + * decremented by calls to folio_put(). Some typical users of the + * folio refcount: + * + * - Each reference from a page table + * - The page cache + * - Filesystem private data + * - The LRU list + * - Pipes + * - Direct IO which references this page in the process address space + * + * Return: The number of references to this folio. + */ +static inline int folio_ref_count(const struct folio *folio) +{ + return page_ref_count(&folio->page); +} + static inline int page_count(const struct page *page) { - return atomic_read(&compound_head(page)->_refcount); + return folio_ref_count(page_folio(page)); } static inline void set_page_count(struct page *page, int v) @@ -79,6 +101,11 @@ static inline void set_page_count(struct page *page, int v) __page_ref_set(page, v); } +static inline void folio_set_count(struct folio *folio, int v) +{ + set_page_count(&folio->page, v); +} + /* * Setup the page count before being freed into the page allocator for * the first time (boot or memory hotplug) @@ -95,6 +122,11 @@ static inline void page_ref_add(struct page *page, int nr) __page_ref_mod(page, nr); } +static inline void folio_ref_add(struct folio *folio, int nr) +{ + page_ref_add(&folio->page, nr); +} + static inline void page_ref_sub(struct page *page, int nr) { atomic_sub(nr, &page->_refcount); @@ -102,6 +134,11 @@ static inline void page_ref_sub(struct page *page, int nr) __page_ref_mod(page, -nr); } +static inline void folio_ref_sub(struct folio *folio, int nr) +{ + page_ref_sub(&folio->page, nr); +} + static inline int page_ref_sub_return(struct page *page, int nr) { int ret = atomic_sub_return(nr, &page->_refcount); @@ -111,6 +148,11 @@ static inline int page_ref_sub_return(struct page *page, int nr) return ret; } +static inline int folio_ref_sub_return(struct folio *folio, int nr) +{ + return page_ref_sub_return(&folio->page, nr); +} + static inline void page_ref_inc(struct page *page) { atomic_inc(&page->_refcount); @@ -118,6 +160,11 @@ static inline void page_ref_inc(struct page *page) __page_ref_mod(page, 1); } +static inline void folio_ref_inc(struct folio *folio) +{ + page_ref_inc(&folio->page); +} + static inline void page_ref_dec(struct page *page) { atomic_dec(&page->_refcount); @@ -125,6 +172,11 @@ static inline void page_ref_dec(struct page *page) __page_ref_mod(page, -1); } +static inline void folio_ref_dec(struct folio *folio) +{ + page_ref_dec(&folio->page); +} + static inline int page_ref_sub_and_test(struct page *page, int nr) { int ret = atomic_sub_and_test(nr, &page->_refcount); @@ -134,6 +186,11 @@ static inline int page_ref_sub_and_test(struct page *page, int nr) return ret; } +static inline int folio_ref_sub_and_test(struct folio *folio, int nr) +{ + return page_ref_sub_and_test(&folio->page, nr); +} + static inline int page_ref_inc_return(struct page *page) { int ret = atomic_inc_return(&page->_refcount); @@ -143,6 +200,11 @@ static inline int page_ref_inc_return(struct page *page) return ret; } +static inline int folio_ref_inc_return(struct folio *folio) +{ + return page_ref_inc_return(&folio->page); +} + static inline int page_ref_dec_and_test(struct page *page) { int ret = atomic_dec_and_test(&page->_refcount); @@ -152,6 +214,11 @@ static inline int page_ref_dec_and_test(struct page *page) return ret; } +static inline int folio_ref_dec_and_test(struct folio *folio) +{ + return page_ref_dec_and_test(&folio->page); +} + static inline int page_ref_dec_return(struct page *page) { int ret = atomic_dec_return(&page->_refcount); @@ -161,15 +228,91 @@ static inline int page_ref_dec_return(struct page *page) return ret; } -static inline int page_ref_add_unless(struct page *page, int nr, int u) +static inline int folio_ref_dec_return(struct folio *folio) +{ + return page_ref_dec_return(&folio->page); +} + +static inline bool page_ref_add_unless(struct page *page, int nr, int u) { - int ret = atomic_add_unless(&page->_refcount, nr, u); + bool ret = atomic_add_unless(&page->_refcount, nr, u); if (page_ref_tracepoint_active(page_ref_mod_unless)) __page_ref_mod_unless(page, nr, ret); return ret; } +static inline bool folio_ref_add_unless(struct folio *folio, int nr, int u) +{ + return page_ref_add_unless(&folio->page, nr, u); +} + +/** + * folio_try_get - Attempt to increase the refcount on a folio. + * @folio: The folio. + * + * If you do not already have a reference to a folio, you can attempt to + * get one using this function. It may fail if, for example, the folio + * has been freed since you found a pointer to it, or it is frozen for + * the purposes of splitting or migration. + * + * Return: True if the reference count was successfully incremented. + */ +static inline bool folio_try_get(struct folio *folio) +{ + return folio_ref_add_unless(folio, 1, 0); +} + +static inline bool folio_ref_try_add_rcu(struct folio *folio, int count) +{ +#ifdef CONFIG_TINY_RCU + /* + * The caller guarantees the folio will not be freed from interrupt + * context, so (on !SMP) we only need preemption to be disabled + * and TINY_RCU does that for us. + */ +# ifdef CONFIG_PREEMPT_COUNT + VM_BUG_ON(!in_atomic() && !irqs_disabled()); +# endif + VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio); + folio_ref_add(folio, count); +#else + if (unlikely(!folio_ref_add_unless(folio, count, 0))) { + /* Either the folio has been freed, or will be freed. */ + return false; + } +#endif + return true; +} + +/** + * folio_try_get_rcu - Attempt to increase the refcount on a folio. + * @folio: The folio. + * + * This is a version of folio_try_get() optimised for non-SMP kernels. + * If you are still holding the rcu_read_lock() after looking up the + * page and know that the page cannot have its refcount decreased to + * zero in interrupt context, you can use this instead of folio_try_get(). + * + * Example users include get_user_pages_fast() (as pages are not unmapped + * from interrupt context) and the page cache lookups (as pages are not + * truncated from interrupt context). We also know that pages are not + * frozen in interrupt context for the purposes of splitting or migration. + * + * You can also use this function if you're holding a lock that prevents + * pages being frozen & removed; eg the i_pages lock for the page cache + * or the mmap_sem or page table lock for page tables. In this case, + * it will always succeed, and you could have used a plain folio_get(), + * but it's sometimes more convenient to have a common function called + * from both locked and RCU-protected contexts. + * + * Return: True if the reference count was successfully incremented. + */ +static inline bool folio_try_get_rcu(struct folio *folio) +{ + return folio_ref_try_add_rcu(folio, 1); +} + static inline int page_ref_freeze(struct page *page, int count) { int ret = likely(atomic_cmpxchg(&page->_refcount, count, 0) == count); @@ -179,6 +322,11 @@ static inline int page_ref_freeze(struct page *page, int count) return ret; } +static inline int folio_ref_freeze(struct folio *folio, int count) +{ + return page_ref_freeze(&folio->page, count); +} + static inline void page_ref_unfreeze(struct page *page, int count) { VM_BUG_ON_PAGE(page_count(page) != 0, page); @@ -189,4 +337,8 @@ static inline void page_ref_unfreeze(struct page *page, int count) __page_ref_unfreeze(page, count); } +static inline void folio_ref_unfreeze(struct folio *folio, int count) +{ + page_ref_unfreeze(&folio->page, count); +} #endif diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 62db6b0176b9..013cdc90f5fd 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -162,149 +162,119 @@ static inline void filemap_nr_thps_dec(struct address_space *mapping) void release_pages(struct page **pages, int nr); -/* - * For file cache pages, return the address_space, otherwise return NULL +struct address_space *page_mapping(struct page *); +struct address_space *folio_mapping(struct folio *); +struct address_space *swapcache_mapping(struct folio *); + +/** + * folio_file_mapping - Find the mapping this folio belongs to. + * @folio: The folio. + * + * For folios which are in the page cache, return the mapping that this + * page belongs to. Folios in the swap cache return the mapping of the + * swap file or swap device where the data is stored. This is different + * from the mapping returned by folio_mapping(). The only reason to + * use it is if, like NFS, you return 0 from ->activate_swapfile. + * + * Do not call this for folios which aren't in the page cache or swap cache. */ -static inline struct address_space *page_mapping_file(struct page *page) +static inline struct address_space *folio_file_mapping(struct folio *folio) { - if (unlikely(PageSwapCache(page))) - return NULL; - return page_mapping(page); + if (unlikely(folio_test_swapcache(folio))) + return swapcache_mapping(folio); + + return folio->mapping; +} + +static inline struct address_space *page_file_mapping(struct page *page) +{ + return folio_file_mapping(page_folio(page)); } /* - * speculatively take a reference to a page. - * If the page is free (_refcount == 0), then _refcount is untouched, and 0 - * is returned. Otherwise, _refcount is incremented by 1 and 1 is returned. - * - * This function must be called inside the same rcu_read_lock() section as has - * been used to lookup the page in the pagecache radix-tree (or page table): - * this allows allocators to use a synchronize_rcu() to stabilize _refcount. - * - * Unless an RCU grace period has passed, the count of all pages coming out - * of the allocator must be considered unstable. page_count may return higher - * than expected, and put_page must be able to do the right thing when the - * page has been finished with, no matter what it is subsequently allocated - * for (because put_page is what is used here to drop an invalid speculative - * reference). - * - * This is the interesting part of the lockless pagecache (and lockless - * get_user_pages) locking protocol, where the lookup-side (eg. find_get_page) - * has the following pattern: - * 1. find page in radix tree - * 2. conditionally increment refcount - * 3. check the page is still in pagecache (if no, goto 1) - * - * Remove-side that cares about stability of _refcount (eg. reclaim) has the - * following (with the i_pages lock held): - * A. atomically check refcount is correct and set it to 0 (atomic_cmpxchg) - * B. remove page from pagecache - * C. free the page - * - * There are 2 critical interleavings that matter: - * - 2 runs before A: in this case, A sees elevated refcount and bails out - * - A runs before 2: in this case, 2 sees zero refcount and retries; - * subsequently, B will complete and 1 will find no page, causing the - * lookup to return NULL. - * - * It is possible that between 1 and 2, the page is removed then the exact same - * page is inserted into the same position in pagecache. That's OK: the - * old find_get_page using a lock could equally have run before or after - * such a re-insertion, depending on order that locks are granted. - * - * Lookups racing against pagecache insertion isn't a big problem: either 1 - * will find the page or it will not. Likewise, the old find_get_page could run - * either before the insertion or afterwards, depending on timing. + * For file cache pages, return the address_space, otherwise return NULL */ -static inline int __page_cache_add_speculative(struct page *page, int count) +static inline struct address_space *page_mapping_file(struct page *page) { -#ifdef CONFIG_TINY_RCU -# ifdef CONFIG_PREEMPT_COUNT - VM_BUG_ON(!in_atomic() && !irqs_disabled()); -# endif - /* - * Preempt must be disabled here - we rely on rcu_read_lock doing - * this for us. - * - * Pagecache won't be truncated from interrupt context, so if we have - * found a page in the radix tree here, we have pinned its refcount by - * disabling preempt, and hence no need for the "speculative get" that - * SMP requires. - */ - VM_BUG_ON_PAGE(page_count(page) == 0, page); - page_ref_add(page, count); + struct folio *folio = page_folio(page); -#else - if (unlikely(!page_ref_add_unless(page, count, 0))) { - /* - * Either the page has been freed, or will be freed. - * In either case, retry here and the caller should - * do the right thing (see comments above). - */ - return 0; - } -#endif - VM_BUG_ON_PAGE(PageTail(page), page); - - return 1; + if (unlikely(folio_test_swapcache(folio))) + return NULL; + return folio_mapping(folio); } -static inline int page_cache_get_speculative(struct page *page) +static inline bool page_cache_add_speculative(struct page *page, int count) { - return __page_cache_add_speculative(page, 1); + VM_BUG_ON_PAGE(PageTail(page), page); + return folio_ref_try_add_rcu((struct folio *)page, count); } -static inline int page_cache_add_speculative(struct page *page, int count) +static inline bool page_cache_get_speculative(struct page *page) { - return __page_cache_add_speculative(page, count); + return page_cache_add_speculative(page, 1); } /** - * attach_page_private - Attach private data to a page. - * @page: Page to attach data to. - * @data: Data to attach to page. + * folio_attach_private - Attach private data to a folio. + * @folio: Folio to attach data to. + * @data: Data to attach to folio. * - * Attaching private data to a page increments the page's reference count. - * The data must be detached before the page will be freed. + * Attaching private data to a folio increments the page's reference count. + * The data must be detached before the folio will be freed. */ -static inline void attach_page_private(struct page *page, void *data) +static inline void folio_attach_private(struct folio *folio, void *data) { - get_page(page); - set_page_private(page, (unsigned long)data); - SetPagePrivate(page); + folio_get(folio); + folio->private = data; + folio_set_private(folio); } /** - * detach_page_private - Detach private data from a page. - * @page: Page to detach data from. + * folio_detach_private - Detach private data from a folio. + * @folio: Folio to detach data from. * - * Removes the data that was previously attached to the page and decrements + * Removes the data that was previously attached to the folio and decrements * the refcount on the page. * - * Return: Data that was attached to the page. + * Return: Data that was attached to the folio. */ -static inline void *detach_page_private(struct page *page) +static inline void *folio_detach_private(struct folio *folio) { - void *data = (void *)page_private(page); + void *data = folio_get_private(folio); - if (!PagePrivate(page)) + if (!folio_test_private(folio)) return NULL; - ClearPagePrivate(page); - set_page_private(page, 0); - put_page(page); + folio_clear_private(folio); + folio->private = NULL; + folio_put(folio); return data; } +static inline void attach_page_private(struct page *page, void *data) +{ + folio_attach_private(page_folio(page), data); +} + +static inline void *detach_page_private(struct page *page) +{ + return folio_detach_private(page_folio(page)); +} + #ifdef CONFIG_NUMA -extern struct page *__page_cache_alloc(gfp_t gfp); +struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order); #else -static inline struct page *__page_cache_alloc(gfp_t gfp) +static inline struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order) { - return alloc_pages(gfp, 0); + return folio_alloc(gfp, order); } #endif +static inline struct page *__page_cache_alloc(gfp_t gfp) +{ + return &filemap_alloc_folio(gfp, 0)->page; +} + static inline struct page *page_cache_alloc(struct address_space *x) { return __page_cache_alloc(mapping_gfp_mask(x)); @@ -331,9 +301,28 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping, #define FGP_FOR_MMAP 0x00000040 #define FGP_HEAD 0x00000080 #define FGP_ENTRY 0x00000100 +#define FGP_STABLE 0x00000200 -struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset, - int fgp_flags, gfp_t cache_gfp_mask); +struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index, + int fgp_flags, gfp_t gfp); +struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index, + int fgp_flags, gfp_t gfp); + +/** + * filemap_get_folio - Find and get a folio. + * @mapping: The address_space to search. + * @index: The page index. + * + * Looks up the page cache entry at @mapping & @index. If a folio is + * present, it is returned with an increased refcount. + * + * Otherwise, %NULL is returned. + */ +static inline struct folio *filemap_get_folio(struct address_space *mapping, + pgoff_t index) +{ + return __filemap_get_folio(mapping, index, 0, 0); +} /** * find_get_page - find and get a page reference @@ -377,25 +366,6 @@ static inline struct page *find_lock_page(struct address_space *mapping, } /** - * find_lock_head - Locate, pin and lock a pagecache page. - * @mapping: The address_space to search. - * @index: The page index. - * - * Looks up the page cache entry at @mapping & @index. If there is a - * page cache page, its head page is returned locked and with an increased - * refcount. - * - * Context: May sleep. - * Return: A struct page which is !PageTail, or %NULL if there is no page - * in the cache for this index. - */ -static inline struct page *find_lock_head(struct address_space *mapping, - pgoff_t index) -{ - return pagecache_get_page(mapping, index, FGP_LOCK | FGP_HEAD, 0); -} - -/** * find_or_create_page - locate or add a pagecache page * @mapping: the page's address_space * @index: the page's index into the mapping @@ -452,6 +422,73 @@ static inline bool thp_contains(struct page *head, pgoff_t index) return page_index(head) == (index & ~(thp_nr_pages(head) - 1UL)); } +#define swapcache_index(folio) __page_file_index(&(folio)->page) + +/** + * folio_index - File index of a folio. + * @folio: The folio. + * + * For a folio which is either in the page cache or the swap cache, + * return its index within the address_space it belongs to. If you know + * the page is definitely in the page cache, you can look at the folio's + * index directly. + * + * Return: The index (offset in units of pages) of a folio in its file. + */ +static inline pgoff_t folio_index(struct folio *folio) +{ + if (unlikely(folio_test_swapcache(folio))) + return swapcache_index(folio); + return folio->index; +} + +/** + * folio_next_index - Get the index of the next folio. + * @folio: The current folio. + * + * Return: The index of the folio which follows this folio in the file. + */ +static inline pgoff_t folio_next_index(struct folio *folio) +{ + return folio->index + folio_nr_pages(folio); +} + +/** + * folio_file_page - The page for a particular index. + * @folio: The folio which contains this index. + * @index: The index we want to look up. + * + * Sometimes after looking up a folio in the page cache, we need to + * obtain the specific page for an index (eg a page fault). + * + * Return: The page containing the file data for this index. + */ +static inline struct page *folio_file_page(struct folio *folio, pgoff_t index) +{ + /* HugeTLBfs indexes the page cache in units of hpage_size */ + if (folio_test_hugetlb(folio)) + return &folio->page; + return folio_page(folio, index & (folio_nr_pages(folio) - 1)); +} + +/** + * folio_contains - Does this folio contain this index? + * @folio: The folio. + * @index: The page index within the file. + * + * Context: The caller should have the page locked in order to prevent + * (eg) shmem from moving the page between the page cache and swap cache + * and changing its index in the middle of the operation. + * Return: true or false. + */ +static inline bool folio_contains(struct folio *folio, pgoff_t index) +{ + /* HugeTLBfs indexes the page cache in units of hpage_size */ + if (folio_test_hugetlb(folio)) + return folio->index == index; + return index - folio_index(folio) < folio_nr_pages(folio); +} + /* * Given the page we found in the page cache, return the page corresponding * to this index in the file @@ -560,6 +597,27 @@ static inline loff_t page_file_offset(struct page *page) return ((loff_t)page_index(page)) << PAGE_SHIFT; } +/** + * folio_pos - Returns the byte position of this folio in its file. + * @folio: The folio. + */ +static inline loff_t folio_pos(struct folio *folio) +{ + return page_offset(&folio->page); +} + +/** + * folio_file_pos - Returns the byte position of this folio in its file. + * @folio: The folio. + * + * This differs from folio_pos() for folios which belong to a swap file. + * NFS is the only filesystem today which needs to use folio_file_pos(). + */ +static inline loff_t folio_file_pos(struct folio *folio) +{ + return page_file_offset(&folio->page); +} + extern pgoff_t linear_hugepage_index(struct vm_area_struct *vma, unsigned long address); @@ -575,13 +633,13 @@ static inline pgoff_t linear_page_index(struct vm_area_struct *vma, } struct wait_page_key { - struct page *page; + struct folio *folio; int bit_nr; int page_match; }; struct wait_page_queue { - struct page *page; + struct folio *folio; int bit_nr; wait_queue_entry_t wait; }; @@ -589,7 +647,7 @@ struct wait_page_queue { static inline bool wake_page_match(struct wait_page_queue *wait_page, struct wait_page_key *key) { - if (wait_page->page != key->page) + if (wait_page->folio != key->folio) return false; key->page_match = 1; @@ -599,20 +657,31 @@ static inline bool wake_page_match(struct wait_page_queue *wait_page, return true; } -extern void __lock_page(struct page *page); -extern int __lock_page_killable(struct page *page); -extern int __lock_page_async(struct page *page, struct wait_page_queue *wait); -extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm, +void __folio_lock(struct folio *folio); +int __folio_lock_killable(struct folio *folio); +bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm, unsigned int flags); -extern void unlock_page(struct page *page); +void unlock_page(struct page *page); +void folio_unlock(struct folio *folio); + +static inline bool folio_trylock(struct folio *folio) +{ + return likely(!test_and_set_bit_lock(PG_locked, folio_flags(folio, 0))); +} /* * Return true if the page was successfully locked */ static inline int trylock_page(struct page *page) { - page = compound_head(page); - return (likely(!test_and_set_bit_lock(PG_locked, &page->flags))); + return folio_trylock(page_folio(page)); +} + +static inline void folio_lock(struct folio *folio) +{ + might_sleep(); + if (!folio_trylock(folio)) + __folio_lock(folio); } /* @@ -620,38 +689,30 @@ static inline int trylock_page(struct page *page) */ static inline void lock_page(struct page *page) { + struct folio *folio; might_sleep(); - if (!trylock_page(page)) - __lock_page(page); + + folio = page_folio(page); + if (!folio_trylock(folio)) + __folio_lock(folio); } -/* - * lock_page_killable is like lock_page but can be interrupted by fatal - * signals. It returns 0 if it locked the page and -EINTR if it was - * killed while waiting. - */ -static inline int lock_page_killable(struct page *page) +static inline int folio_lock_killable(struct folio *folio) { might_sleep(); - if (!trylock_page(page)) - return __lock_page_killable(page); + if (!folio_trylock(folio)) + return __folio_lock_killable(folio); return 0; } /* - * lock_page_async - Lock the page, unless this would block. If the page - * is already locked, then queue a callback when the page becomes unlocked. - * This callback can then retry the operation. - * - * Returns 0 if the page is locked successfully, or -EIOCBQUEUED if the page - * was already locked and the callback defined in 'wait' was queued. + * lock_page_killable is like lock_page but can be interrupted by fatal + * signals. It returns 0 if it locked the page and -EINTR if it was + * killed while waiting. */ -static inline int lock_page_async(struct page *page, - struct wait_page_queue *wait) +static inline int lock_page_killable(struct page *page) { - if (!trylock_page(page)) - return __lock_page_async(page, wait); - return 0; + return folio_lock_killable(page_folio(page)); } /* @@ -659,78 +720,108 @@ static inline int lock_page_async(struct page *page, * caller indicated that it can handle a retry. * * Return value and mmap_lock implications depend on flags; see - * __lock_page_or_retry(). + * __folio_lock_or_retry(). */ -static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm, +static inline bool lock_page_or_retry(struct page *page, struct mm_struct *mm, unsigned int flags) { + struct folio *folio; might_sleep(); - return trylock_page(page) || __lock_page_or_retry(page, mm, flags); + + folio = page_folio(page); + return folio_trylock(folio) || __folio_lock_or_retry(folio, mm, flags); } /* - * This is exported only for wait_on_page_locked/wait_on_page_writeback, etc., + * This is exported only for folio_wait_locked/folio_wait_writeback, etc., * and should not be used directly. */ -extern void wait_on_page_bit(struct page *page, int bit_nr); -extern int wait_on_page_bit_killable(struct page *page, int bit_nr); +void folio_wait_bit(struct folio *folio, int bit_nr); +int folio_wait_bit_killable(struct folio *folio, int bit_nr); /* - * Wait for a page to be unlocked. + * Wait for a folio to be unlocked. * - * This must be called with the caller "holding" the page, - * ie with increased "page->count" so that the page won't + * This must be called with the caller "holding" the folio, + * ie with increased "page->count" so that the folio won't * go away during the wait.. */ +static inline void folio_wait_locked(struct folio *folio) +{ + if (folio_test_locked(folio)) + folio_wait_bit(folio, PG_locked); +} + +static inline int folio_wait_locked_killable(struct folio *folio) +{ + if (!folio_test_locked(folio)) + return 0; + return folio_wait_bit_killable(folio, PG_locked); +} + static inline void wait_on_page_locked(struct page *page) { - if (PageLocked(page)) - wait_on_page_bit(compound_head(page), PG_locked); + folio_wait_locked(page_folio(page)); } static inline int wait_on_page_locked_killable(struct page *page) { - if (!PageLocked(page)) - return 0; - return wait_on_page_bit_killable(compound_head(page), PG_locked); + return folio_wait_locked_killable(page_folio(page)); } int put_and_wait_on_page_locked(struct page *page, int state); void wait_on_page_writeback(struct page *page); -int wait_on_page_writeback_killable(struct page *page); -extern void end_page_writeback(struct page *page); +void folio_wait_writeback(struct folio *folio); +int folio_wait_writeback_killable(struct folio *folio); +void end_page_writeback(struct page *page); +void folio_end_writeback(struct folio *folio); void wait_for_stable_page(struct page *page); +void folio_wait_stable(struct folio *folio); +void __folio_mark_dirty(struct folio *folio, struct address_space *, int warn); +static inline void __set_page_dirty(struct page *page, + struct address_space *mapping, int warn) +{ + __folio_mark_dirty(page_folio(page), mapping, warn); +} +void folio_account_cleaned(struct folio *folio, struct address_space *mapping, + struct bdi_writeback *wb); +static inline void account_page_cleaned(struct page *page, + struct address_space *mapping, struct bdi_writeback *wb) +{ + return folio_account_cleaned(page_folio(page), mapping, wb); +} +void __folio_cancel_dirty(struct folio *folio); +static inline void folio_cancel_dirty(struct folio *folio) +{ + /* Avoid atomic ops, locking, etc. when not actually needed. */ + if (folio_test_dirty(folio)) + __folio_cancel_dirty(folio); +} +static inline void cancel_dirty_page(struct page *page) +{ + folio_cancel_dirty(page_folio(page)); +} +bool folio_clear_dirty_for_io(struct folio *folio); +bool clear_page_dirty_for_io(struct page *page); +int __must_check folio_write_one(struct folio *folio); +static inline int __must_check write_one_page(struct page *page) +{ + return folio_write_one(page_folio(page)); +} -void __set_page_dirty(struct page *, struct address_space *, int warn); int __set_page_dirty_nobuffers(struct page *page); int __set_page_dirty_no_writeback(struct page *page); void page_endio(struct page *page, bool is_write, int err); -/** - * set_page_private_2 - Set PG_private_2 on a page and take a ref - * @page: The page. - * - * Set the PG_private_2 flag on a page and take the reference needed for the VM - * to handle its lifetime correctly. This sets the flag and takes the - * reference unconditionally, so care must be taken not to set the flag again - * if it's already set. - */ -static inline void set_page_private_2(struct page *page) -{ - page = compound_head(page); - get_page(page); - SetPagePrivate2(page); -} - -void end_page_private_2(struct page *page); -void wait_on_page_private_2(struct page *page); -int wait_on_page_private_2_killable(struct page *page); +void folio_end_private_2(struct folio *folio); +void folio_wait_private_2(struct folio *folio); +int folio_wait_private_2_killable(struct folio *folio); /* * Add an arbitrary waiter to a page's wait queue */ -extern void add_page_wait_queue(struct page *page, wait_queue_entry_t *waiter); +void folio_add_wait_queue(struct folio *folio, wait_queue_entry_t *waiter); /* * Fault everything in given userspace address range in. @@ -790,9 +881,11 @@ static inline int fault_in_pages_readable(const char __user *uaddr, size_t size) } int add_to_page_cache_locked(struct page *page, struct address_space *mapping, - pgoff_t index, gfp_t gfp_mask); + pgoff_t index, gfp_t gfp); int add_to_page_cache_lru(struct page *page, struct address_space *mapping, - pgoff_t index, gfp_t gfp_mask); + pgoff_t index, gfp_t gfp); +int filemap_add_folio(struct address_space *mapping, struct folio *folio, + pgoff_t index, gfp_t gfp); extern void delete_from_page_cache(struct page *page); extern void __delete_from_page_cache(struct page *page, void *shadow); void replace_page_cache_page(struct page *old, struct page *new); @@ -817,6 +910,10 @@ static inline int add_to_page_cache(struct page *page, return error; } +/* Must be non-static for BPF error injection */ +int __filemap_add_folio(struct address_space *mapping, struct folio *folio, + pgoff_t index, gfp_t gfp, void **shadowp); + /** * struct readahead_control - Describes a readahead request. * @@ -906,33 +1003,57 @@ void page_cache_async_readahead(struct address_space *mapping, page_cache_async_ra(&ractl, page, req_count); } +static inline struct folio *__readahead_folio(struct readahead_control *ractl) +{ + struct folio *folio; + + BUG_ON(ractl->_batch_count > ractl->_nr_pages); + ractl->_nr_pages -= ractl->_batch_count; + ractl->_index += ractl->_batch_count; + + if (!ractl->_nr_pages) { + ractl->_batch_count = 0; + return NULL; + } + + folio = xa_load(&ractl->mapping->i_pages, ractl->_index); + VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); + ractl->_batch_count = folio_nr_pages(folio); + + return folio; +} + /** * readahead_page - Get the next page to read. - * @rac: The current readahead request. + * @ractl: The current readahead request. * * Context: The page is locked and has an elevated refcount. The caller * should decreases the refcount once the page has been submitted for I/O * and unlock the page once all I/O to that page has completed. * Return: A pointer to the next page, or %NULL if we are done. */ -static inline struct page *readahead_page(struct readahead_control *rac) +static inline struct page *readahead_page(struct readahead_control *ractl) { - struct page *page; - - BUG_ON(rac->_batch_count > rac->_nr_pages); - rac->_nr_pages -= rac->_batch_count; - rac->_index += rac->_batch_count; + struct folio *folio = __readahead_folio(ractl); - if (!rac->_nr_pages) { - rac->_batch_count = 0; - return NULL; - } + return &folio->page; +} - page = xa_load(&rac->mapping->i_pages, rac->_index); - VM_BUG_ON_PAGE(!PageLocked(page), page); - rac->_batch_count = thp_nr_pages(page); +/** + * readahead_folio - Get the next folio to read. + * @ractl: The current readahead request. + * + * Context: The folio is locked. The caller should unlock the folio once + * all I/O to that folio has completed. + * Return: A pointer to the next folio, or %NULL if we are done. + */ +static inline struct folio *readahead_folio(struct readahead_control *ractl) +{ + struct folio *folio = __readahead_folio(ractl); - return page; + if (folio) + folio_put(folio); + return folio; } static inline unsigned int __readahead_batch(struct readahead_control *rac, @@ -1040,6 +1161,34 @@ static inline unsigned long dir_pages(struct inode *inode) } /** + * folio_mkwrite_check_truncate - check if folio was truncated + * @folio: the folio to check + * @inode: the inode to check the folio against + * + * Return: the number of bytes in the folio up to EOF, + * or -EFAULT if the folio was truncated. + */ +static inline ssize_t folio_mkwrite_check_truncate(struct folio *folio, + struct inode *inode) +{ + loff_t size = i_size_read(inode); + pgoff_t index = size >> PAGE_SHIFT; + size_t offset = offset_in_folio(folio, size); + + if (!folio->mapping) + return -EFAULT; + + /* folio is wholly inside EOF */ + if (folio_next_index(folio) - 1 < index) + return folio_size(folio); + /* folio is wholly past EOF */ + if (folio->index > index || !offset) + return -EFAULT; + /* folio is partially inside EOF */ + return offset; +} + +/** * page_mkwrite_check_truncate - check if page was truncated * @page: the page to check * @inode: the inode to check the page against @@ -1068,19 +1217,25 @@ static inline int page_mkwrite_check_truncate(struct page *page, } /** - * i_blocks_per_page - How many blocks fit in this page. + * i_blocks_per_folio - How many blocks fit in this folio. * @inode: The inode which contains the blocks. - * @page: The page (head page if the page is a THP). + * @folio: The folio. * - * If the block size is larger than the size of this page, return zero. + * If the block size is larger than the size of this folio, return zero. * - * Context: The caller should hold a refcount on the page to prevent it + * Context: The caller should hold a refcount on the folio to prevent it * from being split. - * Return: The number of filesystem blocks covered by this page. + * Return: The number of filesystem blocks covered by this folio. */ static inline +unsigned int i_blocks_per_folio(struct inode *inode, struct folio *folio) +{ + return folio_size(folio) >> inode->i_blkbits; +} + +static inline unsigned int i_blocks_per_page(struct inode *inode, struct page *page) { - return thp_size(page) >> inode->i_blkbits; + return i_blocks_per_folio(inode, page_folio(page)); } #endif /* _LINUX_PAGEMAP_H */ diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h index ae16a9856305..b31d3f3312ce 100644 --- a/include/linux/percpu-refcount.h +++ b/include/linux/percpu-refcount.h @@ -267,6 +267,28 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref) } /** + * percpu_ref_tryget_live_rcu - same as percpu_ref_tryget_live() but the + * caller is responsible for taking RCU. + * + * This function is safe to call as long as @ref is between init and exit. + */ +static inline bool percpu_ref_tryget_live_rcu(struct percpu_ref *ref) +{ + unsigned long __percpu *percpu_count; + bool ret = false; + + WARN_ON_ONCE(!rcu_read_lock_held()); + + if (likely(__ref_is_percpu(ref, &percpu_count))) { + this_cpu_inc(*percpu_count); + ret = true; + } else if (!(ref->percpu_count_ptr & __PERCPU_REF_DEAD)) { + ret = atomic_long_inc_not_zero(&ref->data->count); + } + return ret; +} + +/** * percpu_ref_tryget_live - try to increment a live percpu refcount * @ref: percpu_ref to try-get * @@ -283,20 +305,11 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref) */ static inline bool percpu_ref_tryget_live(struct percpu_ref *ref) { - unsigned long __percpu *percpu_count; bool ret = false; rcu_read_lock(); - - if (__ref_is_percpu(ref, &percpu_count)) { - this_cpu_inc(*percpu_count); - ret = true; - } else if (!(ref->percpu_count_ptr & __PERCPU_REF_DEAD)) { - ret = atomic_long_inc_not_zero(&ref->data->count); - } - + ret = percpu_ref_tryget_live_rcu(ref); rcu_read_unlock(); - return ret; } diff --git a/include/linux/rmap.h b/include/linux/rmap.h index c976cc6de257..e704b1a4c06c 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -235,7 +235,7 @@ unsigned long page_address_in_vma(struct page *, struct vm_area_struct *); * * returns the number of cleaned PTEs. */ -int page_mkclean(struct page *); +int folio_mkclean(struct folio *); /* * called in munlock()/munmap() path to check for other vmas holding @@ -295,12 +295,14 @@ static inline void try_to_unmap(struct page *page, enum ttu_flags flags) { } -static inline int page_mkclean(struct page *page) +static inline int folio_mkclean(struct folio *folio) { return 0; } - - #endif /* CONFIG_MMU */ +static inline int page_mkclean(struct page *page) +{ + return folio_mkclean(page_folio(page)); +} #endif /* _LINUX_RMAP_H */ diff --git a/include/linux/sched.h b/include/linux/sched.h index c1a927ddec64..e0454e60fe8f 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1160,10 +1160,8 @@ struct task_struct { /* Stacked block device info: */ struct bio_list *bio_list; -#ifdef CONFIG_BLOCK /* Stack plugging: */ struct blk_plug *plug; -#endif /* VM state: */ struct reclaim_state *reclaim_state; diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h index 21c3771e6a56..988528b5da43 100644 --- a/include/linux/secretmem.h +++ b/include/linux/secretmem.h @@ -23,7 +23,7 @@ static inline bool page_is_secretmem(struct page *page) mapping = (struct address_space *) ((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS); - if (mapping != page->mapping) + if (!mapping || mapping != page->mapping) return false; return mapping->a_ops == &secretmem_aops; diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h index 14ab0c0bc924..1ce9a9eb223b 100644 --- a/include/linux/skmsg.h +++ b/include/linux/skmsg.h @@ -128,6 +128,7 @@ int sk_msg_memcopy_from_iter(struct sock *sk, struct iov_iter *from, struct sk_msg *msg, u32 bytes); int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg, int len, int flags); +bool sk_msg_is_readable(struct sock *sk); static inline void sk_msg_check_to_free(struct sk_msg *msg, u32 i, u32 bytes) { diff --git a/include/linux/swap.h b/include/linux/swap.h index ba52f3a3478e..cdf0957a88a4 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -320,11 +320,17 @@ struct vma_swap_readahead { #endif }; +static inline swp_entry_t folio_swap_entry(struct folio *folio) +{ + swp_entry_t entry = { .val = page_private(&folio->page) }; + return entry; +} + /* linux/mm/workingset.c */ void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages); void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg); -void workingset_refault(struct page *page, void *shadow); -void workingset_activation(struct page *page); +void workingset_refault(struct folio *folio, void *shadow); +void workingset_activation(struct folio *folio); /* Only track the nodes of mappings with shadow entries */ void workingset_update_node(struct xa_node *node); @@ -344,9 +350,11 @@ extern unsigned long nr_free_buffer_pages(void); /* linux/mm/swap.c */ extern void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages); -extern void lru_note_cost_page(struct page *); +extern void lru_note_cost_folio(struct folio *); +extern void folio_add_lru(struct folio *); extern void lru_cache_add(struct page *); -extern void mark_page_accessed(struct page *); +void mark_page_accessed(struct page *); +void folio_mark_accessed(struct folio *); extern atomic_t lru_disable_count; @@ -365,7 +373,6 @@ extern void lru_add_drain(void); extern void lru_add_drain_cpu(int cpu); extern void lru_add_drain_cpu_zone(struct zone *zone); extern void lru_add_drain_all(void); -extern void rotate_reclaimable_page(struct page *page); extern void deactivate_file_page(struct page *page); extern void deactivate_page(struct page *page); extern void mark_page_lazyfree(struct page *page); diff --git a/include/linux/tpm.h b/include/linux/tpm.h index aa11fe323c56..12d827734686 100644 --- a/include/linux/tpm.h +++ b/include/linux/tpm.h @@ -269,6 +269,7 @@ enum tpm2_cc_attrs { #define TPM_VID_INTEL 0x8086 #define TPM_VID_WINBOND 0x1050 #define TPM_VID_STM 0x104A +#define TPM_VID_ATML 0x1114 enum tpm_chip_flags { TPM_CHIP_FLAG_TPM2 = BIT(1), diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h index a9f9c5714e65..fe95f0922526 100644 --- a/include/linux/trace_recursion.h +++ b/include/linux/trace_recursion.h @@ -16,23 +16,8 @@ * When function tracing occurs, the following steps are made: * If arch does not support a ftrace feature: * call internal function (uses INTERNAL bits) which calls... - * If callback is registered to the "global" list, the list - * function is called and recursion checks the GLOBAL bits. - * then this function calls... * The function callback, which can use the FTRACE bits to * check for recursion. - * - * Now if the arch does not support a feature, and it calls - * the global list function which calls the ftrace callback - * all three of these steps will do a recursion protection. - * There's no reason to do one if the previous caller already - * did. The recursion that we are protecting against will - * go through the same steps again. - * - * To prevent the multiple recursion checks, if a recursion - * bit is set that is higher than the MAX bit of the current - * check, then we know that the check was made by the previous - * caller, and we can skip the current check. */ enum { /* Function recursion bits */ @@ -40,12 +25,14 @@ enum { TRACE_FTRACE_NMI_BIT, TRACE_FTRACE_IRQ_BIT, TRACE_FTRACE_SIRQ_BIT, + TRACE_FTRACE_TRANSITION_BIT, - /* INTERNAL_BITs must be greater than FTRACE_BITs */ + /* Internal use recursion bits */ TRACE_INTERNAL_BIT, TRACE_INTERNAL_NMI_BIT, TRACE_INTERNAL_IRQ_BIT, TRACE_INTERNAL_SIRQ_BIT, + TRACE_INTERNAL_TRANSITION_BIT, TRACE_BRANCH_BIT, /* @@ -86,12 +73,6 @@ enum { */ TRACE_GRAPH_NOTRACE_BIT, - /* - * When transitioning between context, the preempt_count() may - * not be correct. Allow for a single recursion to cover this case. - */ - TRACE_TRANSITION_BIT, - /* Used to prevent recursion recording from recursing. */ TRACE_RECORD_RECURSION_BIT, }; @@ -113,12 +94,10 @@ enum { #define TRACE_CONTEXT_BITS 4 #define TRACE_FTRACE_START TRACE_FTRACE_BIT -#define TRACE_FTRACE_MAX ((1 << (TRACE_FTRACE_START + TRACE_CONTEXT_BITS)) - 1) #define TRACE_LIST_START TRACE_INTERNAL_BIT -#define TRACE_LIST_MAX ((1 << (TRACE_LIST_START + TRACE_CONTEXT_BITS)) - 1) -#define TRACE_CONTEXT_MASK TRACE_LIST_MAX +#define TRACE_CONTEXT_MASK ((1 << (TRACE_LIST_START + TRACE_CONTEXT_BITS)) - 1) /* * Used for setting context @@ -132,6 +111,7 @@ enum { TRACE_CTX_IRQ, TRACE_CTX_SOFTIRQ, TRACE_CTX_NORMAL, + TRACE_CTX_TRANSITION, }; static __always_inline int trace_get_context_bit(void) @@ -160,45 +140,34 @@ extern void ftrace_record_recursion(unsigned long ip, unsigned long parent_ip); #endif static __always_inline int trace_test_and_set_recursion(unsigned long ip, unsigned long pip, - int start, int max) + int start) { unsigned int val = READ_ONCE(current->trace_recursion); int bit; - /* A previous recursion check was made */ - if ((val & TRACE_CONTEXT_MASK) > max) - return 0; - bit = trace_get_context_bit() + start; if (unlikely(val & (1 << bit))) { /* * It could be that preempt_count has not been updated during * a switch between contexts. Allow for a single recursion. */ - bit = TRACE_TRANSITION_BIT; + bit = TRACE_CTX_TRANSITION + start; if (val & (1 << bit)) { do_ftrace_record_recursion(ip, pip); return -1; } - } else { - /* Normal check passed, clear the transition to allow it again */ - val &= ~(1 << TRACE_TRANSITION_BIT); } val |= 1 << bit; current->trace_recursion = val; barrier(); - return bit + 1; + return bit; } static __always_inline void trace_clear_recursion(int bit) { - if (!bit) - return; - barrier(); - bit--; trace_recursion_clear(bit); } @@ -214,7 +183,7 @@ static __always_inline void trace_clear_recursion(int bit) static __always_inline int ftrace_test_recursion_trylock(unsigned long ip, unsigned long parent_ip) { - return trace_test_and_set_recursion(ip, parent_ip, TRACE_FTRACE_START, TRACE_FTRACE_MAX); + return trace_test_and_set_recursion(ip, parent_ip, TRACE_FTRACE_START); } /** diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index eb70cabe6e7f..33a4240e6a6f 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -127,6 +127,8 @@ static inline long get_ucounts_value(struct ucounts *ucounts, enum ucount_type t long inc_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v); bool dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v); +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum ucount_type type); +void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum ucount_type type); bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsigned long max); static inline void set_rlimit_ucount_max(struct user_namespace *ns, diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index d6a6cf53b127..bfe38869498d 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -415,6 +415,78 @@ static inline void drain_zonestat(struct zone *zone, struct per_cpu_zonestat *pzstats) { } #endif /* CONFIG_SMP */ +static inline void __zone_stat_mod_folio(struct folio *folio, + enum zone_stat_item item, long nr) +{ + __mod_zone_page_state(folio_zone(folio), item, nr); +} + +static inline void __zone_stat_add_folio(struct folio *folio, + enum zone_stat_item item) +{ + __mod_zone_page_state(folio_zone(folio), item, folio_nr_pages(folio)); +} + +static inline void __zone_stat_sub_folio(struct folio *folio, + enum zone_stat_item item) +{ + __mod_zone_page_state(folio_zone(folio), item, -folio_nr_pages(folio)); +} + +static inline void zone_stat_mod_folio(struct folio *folio, + enum zone_stat_item item, long nr) +{ + mod_zone_page_state(folio_zone(folio), item, nr); +} + +static inline void zone_stat_add_folio(struct folio *folio, + enum zone_stat_item item) +{ + mod_zone_page_state(folio_zone(folio), item, folio_nr_pages(folio)); +} + +static inline void zone_stat_sub_folio(struct folio *folio, + enum zone_stat_item item) +{ + mod_zone_page_state(folio_zone(folio), item, -folio_nr_pages(folio)); +} + +static inline void __node_stat_mod_folio(struct folio *folio, + enum node_stat_item item, long nr) +{ + __mod_node_page_state(folio_pgdat(folio), item, nr); +} + +static inline void __node_stat_add_folio(struct folio *folio, + enum node_stat_item item) +{ + __mod_node_page_state(folio_pgdat(folio), item, folio_nr_pages(folio)); +} + +static inline void __node_stat_sub_folio(struct folio *folio, + enum node_stat_item item) +{ + __mod_node_page_state(folio_pgdat(folio), item, -folio_nr_pages(folio)); +} + +static inline void node_stat_mod_folio(struct folio *folio, + enum node_stat_item item, long nr) +{ + mod_node_page_state(folio_pgdat(folio), item, nr); +} + +static inline void node_stat_add_folio(struct folio *folio, + enum node_stat_item item) +{ + mod_node_page_state(folio_pgdat(folio), item, folio_nr_pages(folio)); +} + +static inline void node_stat_sub_folio(struct folio *folio, + enum node_stat_item item) +{ + mod_node_page_state(folio_pgdat(folio), item, -folio_nr_pages(folio)); +} + static inline void __mod_zone_freepage_state(struct zone *zone, int nr_pages, int migratetype) { @@ -525,12 +597,6 @@ static inline void mod_lruvec_page_state(struct page *page, #endif /* CONFIG_MEMCG */ -static inline void inc_lruvec_state(struct lruvec *lruvec, - enum node_stat_item idx) -{ - mod_lruvec_state(lruvec, idx, 1); -} - static inline void __inc_lruvec_page_state(struct page *page, enum node_stat_item idx) { @@ -543,6 +609,24 @@ static inline void __dec_lruvec_page_state(struct page *page, __mod_lruvec_page_state(page, idx, -1); } +static inline void __lruvec_stat_mod_folio(struct folio *folio, + enum node_stat_item idx, int val) +{ + __mod_lruvec_page_state(&folio->page, idx, val); +} + +static inline void __lruvec_stat_add_folio(struct folio *folio, + enum node_stat_item idx) +{ + __lruvec_stat_mod_folio(folio, idx, folio_nr_pages(folio)); +} + +static inline void __lruvec_stat_sub_folio(struct folio *folio, + enum node_stat_item idx) +{ + __lruvec_stat_mod_folio(folio, idx, -folio_nr_pages(folio)); +} + static inline void inc_lruvec_page_state(struct page *page, enum node_stat_item idx) { @@ -555,4 +639,21 @@ static inline void dec_lruvec_page_state(struct page *page, mod_lruvec_page_state(page, idx, -1); } +static inline void lruvec_stat_mod_folio(struct folio *folio, + enum node_stat_item idx, int val) +{ + mod_lruvec_page_state(&folio->page, idx, val); +} + +static inline void lruvec_stat_add_folio(struct folio *folio, + enum node_stat_item idx) +{ + lruvec_stat_mod_folio(folio, idx, folio_nr_pages(folio)); +} + +static inline void lruvec_stat_sub_folio(struct folio *folio, + enum node_stat_item idx) +{ + lruvec_stat_mod_folio(folio, idx, -folio_nr_pages(folio)); +} #endif /* _LINUX_VMSTAT_H */ diff --git a/include/linux/writeback.h b/include/linux/writeback.h index 8eb165760752..3bfd487d1dd2 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -389,7 +389,14 @@ void writeback_set_ratelimit(void); void tag_pages_for_writeback(struct address_space *mapping, pgoff_t start, pgoff_t end); -void account_page_redirty(struct page *page); +bool filemap_dirty_folio(struct address_space *mapping, struct folio *folio); +void folio_account_redirty(struct folio *folio); +static inline void account_page_redirty(struct page *page) +{ + folio_account_redirty(page_folio(page)); +} +bool folio_redirty_for_writepage(struct writeback_control *, struct folio *); +bool redirty_page_for_writepage(struct writeback_control *, struct page *); void sb_mark_inode_writeback(struct inode *inode); void sb_clear_inode_writeback(struct inode *inode); diff --git a/include/net/cfg80211.h b/include/net/cfg80211.h index 62dd8422e0dc..27336fc70467 100644 --- a/include/net/cfg80211.h +++ b/include/net/cfg80211.h @@ -5376,7 +5376,6 @@ static inline void wiphy_unlock(struct wiphy *wiphy) * netdev and may otherwise be used by driver read-only, will be update * by cfg80211 on change_interface * @mgmt_registrations: list of registrations for management frames - * @mgmt_registrations_lock: lock for the list * @mgmt_registrations_need_update: mgmt registrations were updated, * need to propagate the update to the driver * @mtx: mutex used to lock data in this struct, may be used by drivers @@ -5423,7 +5422,6 @@ struct wireless_dev { u32 identifier; struct list_head mgmt_registrations; - spinlock_t mgmt_registrations_lock; u8 mgmt_registrations_need_update:1; struct mutex mtx; diff --git a/include/net/mctp.h b/include/net/mctp.h index a824d47c3c6d..ffd2c23bd76d 100644 --- a/include/net/mctp.h +++ b/include/net/mctp.h @@ -54,7 +54,7 @@ struct mctp_sock { struct sock sk; /* bind() params */ - int bind_net; + unsigned int bind_net; mctp_eid_t bind_addr; __u8 bind_type; diff --git a/include/net/mptcp.h b/include/net/mptcp.h index 6026bbefbffd..3214848402ec 100644 --- a/include/net/mptcp.h +++ b/include/net/mptcp.h @@ -69,6 +69,10 @@ struct mptcp_out_options { struct { u64 sndr_key; u64 rcvr_key; + u64 data_seq; + u32 subflow_seq; + u16 data_len; + __sum16 csum; }; struct { struct mptcp_addr_info addr; diff --git a/include/net/sctp/sm.h b/include/net/sctp/sm.h index 2eb6d7c2c931..f37c7a558d6d 100644 --- a/include/net/sctp/sm.h +++ b/include/net/sctp/sm.h @@ -384,11 +384,11 @@ sctp_vtag_verify(const struct sctp_chunk *chunk, * Verification Tag value does not match the receiver's own * tag value, the receiver shall silently discard the packet... */ - if (ntohl(chunk->sctp_hdr->vtag) == asoc->c.my_vtag) - return 1; + if (ntohl(chunk->sctp_hdr->vtag) != asoc->c.my_vtag) + return 0; chunk->transport->encap_port = SCTP_INPUT_CB(chunk->skb)->encap_port; - return 0; + return 1; } /* Check VTAG of the packet matches the sender's own tag and the T bit is diff --git a/include/net/sock.h b/include/net/sock.h index ea6fbc88c8f9..463f390d90b3 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1208,7 +1208,7 @@ struct proto { #endif bool (*stream_memory_free)(const struct sock *sk, int wake); - bool (*stream_memory_read)(const struct sock *sk); + bool (*sock_is_readable)(struct sock *sk); /* Memory pressure */ void (*enter_memory_pressure)(struct sock *sk); void (*leave_memory_pressure)(struct sock *sk); @@ -2820,4 +2820,10 @@ void sock_set_sndtimeo(struct sock *sk, s64 secs); int sock_bind_add(struct sock *sk, struct sockaddr *addr, int addr_len); +static inline bool sk_is_readable(struct sock *sk) +{ + if (sk->sk_prot->sock_is_readable) + return sk->sk_prot->sock_is_readable(sk); + return false; +} #endif /* _SOCK_H */ diff --git a/include/net/tcp.h b/include/net/tcp.h index 3166dc15d7d6..60c384569e9c 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1576,6 +1576,7 @@ struct tcp_md5sig_key { u8 keylen; u8 family; /* AF_INET or AF_INET6 */ u8 prefixlen; + u8 flags; union tcp_md5_addr addr; int l3index; /* set if key added with L3 scope */ u8 key[TCP_MD5SIG_MAXKEYLEN]; @@ -1621,10 +1622,10 @@ struct tcp_md5sig_pool { int tcp_v4_md5_hash_skb(char *md5_hash, const struct tcp_md5sig_key *key, const struct sock *sk, const struct sk_buff *skb); int tcp_md5_do_add(struct sock *sk, const union tcp_md5_addr *addr, - int family, u8 prefixlen, int l3index, + int family, u8 prefixlen, int l3index, u8 flags, const u8 *newkey, u8 newkeylen, gfp_t gfp); int tcp_md5_do_del(struct sock *sk, const union tcp_md5_addr *addr, - int family, u8 prefixlen, int l3index); + int family, u8 prefixlen, int l3index, u8 flags); struct tcp_md5sig_key *tcp_v4_md5_lookup(const struct sock *sk, const struct sock *addr_sk); diff --git a/include/net/tls.h b/include/net/tls.h index be4b3e1cac46..1fffb206f09f 100644 --- a/include/net/tls.h +++ b/include/net/tls.h @@ -358,6 +358,7 @@ int tls_sk_query(struct sock *sk, int optname, char __user *optval, int __user *optlen); int tls_sk_attach(struct sock *sk, int optname, char __user *optval, unsigned int optlen); +void tls_err_abort(struct sock *sk, int err); int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx); void tls_sw_strparser_arm(struct sock *sk, struct tls_context *ctx); @@ -375,7 +376,7 @@ void tls_sw_release_resources_rx(struct sock *sk); void tls_sw_free_ctx_rx(struct tls_context *tls_ctx); int tls_sw_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, int flags, int *addr_len); -bool tls_sw_stream_read(const struct sock *sk); +bool tls_sw_sock_is_readable(struct sock *sk); ssize_t tls_sw_splice_read(struct socket *sock, loff_t *ppos, struct pipe_inode_info *pipe, size_t len, unsigned int flags); @@ -466,12 +467,6 @@ static inline bool tls_is_sk_tx_device_offloaded(struct sock *sk) #endif } -static inline void tls_err_abort(struct sock *sk, int err) -{ - sk->sk_err = err; - sk_error_report(sk); -} - static inline bool tls_bigint_increment(unsigned char *seq, int len) { int i; @@ -512,7 +507,7 @@ static inline void tls_advance_record_sn(struct sock *sk, struct cipher_context *ctx) { if (tls_bigint_increment(ctx->rec_seq, prot->rec_seq_size)) - tls_err_abort(sk, EBADMSG); + tls_err_abort(sk, -EBADMSG); if (prot->version != TLS_1_3_VERSION && prot->cipher_type != TLS_CIPHER_CHACHA20_POLY1305) diff --git a/include/net/udp.h b/include/net/udp.h index 360df454356c..909ecf447e0f 100644 --- a/include/net/udp.h +++ b/include/net/udp.h @@ -494,8 +494,9 @@ static inline struct sk_buff *udp_rcv_segment(struct sock *sk, * CHECKSUM_NONE in __udp_gso_segment. UDP GRO indeed builds partial * packets in udp_gro_complete_segment. As does UDP GSO, verified by * udp_send_skb. But when those packets are looped in dev_loopback_xmit - * their ip_summed is set to CHECKSUM_UNNECESSARY. Reset in this - * specific case, where PARTIAL is both correct and required. + * their ip_summed CHECKSUM_NONE is changed to CHECKSUM_UNNECESSARY. + * Reset in this specific case, where PARTIAL is both correct and + * required. */ if (skb->pkt_type == PACKET_LOOPBACK) skb->ip_summed = CHECKSUM_PARTIAL; diff --git a/include/trace/events/block.h b/include/trace/events/block.h index cc5ab96a7471..a95daa4d4caa 100644 --- a/include/trace/events/block.h +++ b/include/trace/events/block.h @@ -114,7 +114,7 @@ TRACE_EVENT(block_rq_requeue, */ TRACE_EVENT(block_rq_complete, - TP_PROTO(struct request *rq, int error, unsigned int nr_bytes), + TP_PROTO(struct request *rq, blk_status_t error, unsigned int nr_bytes), TP_ARGS(rq, error, nr_bytes), @@ -122,7 +122,7 @@ TRACE_EVENT(block_rq_complete, __field( dev_t, dev ) __field( sector_t, sector ) __field( unsigned int, nr_sector ) - __field( int, error ) + __field( int , error ) __array( char, rwbs, RWBS_LEN ) __dynamic_array( char, cmd, 1 ) ), @@ -131,7 +131,7 @@ TRACE_EVENT(block_rq_complete, __entry->dev = rq->rq_disk ? disk_devt(rq->rq_disk) : 0; __entry->sector = blk_rq_pos(rq); __entry->nr_sector = nr_bytes >> 9; - __entry->error = error; + __entry->error = blk_status_to_errno(error); blk_fill_rwbs(__entry->rwbs, rq->cmd_flags); __get_str(cmd)[0] = '\0'; diff --git a/include/trace/events/io_uring.h b/include/trace/events/io_uring.h index 0dd30de00e5b..7346f0164cf4 100644 --- a/include/trace/events/io_uring.h +++ b/include/trace/events/io_uring.h @@ -6,6 +6,7 @@ #define _TRACE_IO_URING_H #include <linux/tracepoint.h> +#include <uapi/linux/io_uring.h> struct io_wq_work; @@ -497,6 +498,66 @@ TRACE_EVENT(io_uring_task_run, (unsigned long long) __entry->user_data) ); +/* + * io_uring_req_failed - called when an sqe is errored dring submission + * + * @sqe: pointer to the io_uring_sqe that failed + * @error: error it failed with + * + * Allows easier diagnosing of malformed requests in production systems. + */ +TRACE_EVENT(io_uring_req_failed, + + TP_PROTO(const struct io_uring_sqe *sqe, int error), + + TP_ARGS(sqe, error), + + TP_STRUCT__entry ( + __field( u8, opcode ) + __field( u8, flags ) + __field( u8, ioprio ) + __field( u64, off ) + __field( u64, addr ) + __field( u32, len ) + __field( u32, op_flags ) + __field( u64, user_data ) + __field( u16, buf_index ) + __field( u16, personality ) + __field( u32, file_index ) + __field( u64, pad1 ) + __field( u64, pad2 ) + __field( int, error ) + ), + + TP_fast_assign( + __entry->opcode = sqe->opcode; + __entry->flags = sqe->flags; + __entry->ioprio = sqe->ioprio; + __entry->off = sqe->off; + __entry->addr = sqe->addr; + __entry->len = sqe->len; + __entry->op_flags = sqe->rw_flags; + __entry->user_data = sqe->user_data; + __entry->buf_index = sqe->buf_index; + __entry->personality = sqe->personality; + __entry->file_index = sqe->file_index; + __entry->pad1 = sqe->__pad2[0]; + __entry->pad2 = sqe->__pad2[1]; + __entry->error = error; + ), + + TP_printk("op %d, flags=0x%x, prio=%d, off=%llu, addr=%llu, " + "len=%u, rw_flags=0x%x, user_data=0x%llx, buf_index=%d, " + "personality=%d, file_index=%d, pad=0x%llx/%llx, error=%d", + __entry->opcode, __entry->flags, __entry->ioprio, + (unsigned long long)__entry->off, + (unsigned long long) __entry->addr, __entry->len, + __entry->op_flags, (unsigned long long) __entry->user_data, + __entry->buf_index, __entry->personality, __entry->file_index, + (unsigned long long) __entry->pad1, + (unsigned long long) __entry->pad2, __entry->error) +); + #endif /* _TRACE_IO_URING_H */ /* This part must be outside protection */ diff --git a/include/trace/events/pagemap.h b/include/trace/events/pagemap.h index 1d28431e85bd..171524d3526d 100644 --- a/include/trace/events/pagemap.h +++ b/include/trace/events/pagemap.h @@ -16,38 +16,38 @@ #define PAGEMAP_MAPPEDDISK 0x0020u #define PAGEMAP_BUFFERS 0x0040u -#define trace_pagemap_flags(page) ( \ - (PageAnon(page) ? PAGEMAP_ANONYMOUS : PAGEMAP_FILE) | \ - (page_mapped(page) ? PAGEMAP_MAPPED : 0) | \ - (PageSwapCache(page) ? PAGEMAP_SWAPCACHE : 0) | \ - (PageSwapBacked(page) ? PAGEMAP_SWAPBACKED : 0) | \ - (PageMappedToDisk(page) ? PAGEMAP_MAPPEDDISK : 0) | \ - (page_has_private(page) ? PAGEMAP_BUFFERS : 0) \ +#define trace_pagemap_flags(folio) ( \ + (folio_test_anon(folio) ? PAGEMAP_ANONYMOUS : PAGEMAP_FILE) | \ + (folio_mapped(folio) ? PAGEMAP_MAPPED : 0) | \ + (folio_test_swapcache(folio) ? PAGEMAP_SWAPCACHE : 0) | \ + (folio_test_swapbacked(folio) ? PAGEMAP_SWAPBACKED : 0) | \ + (folio_test_mappedtodisk(folio) ? PAGEMAP_MAPPEDDISK : 0) | \ + (folio_test_private(folio) ? PAGEMAP_BUFFERS : 0) \ ) TRACE_EVENT(mm_lru_insertion, - TP_PROTO(struct page *page), + TP_PROTO(struct folio *folio), - TP_ARGS(page), + TP_ARGS(folio), TP_STRUCT__entry( - __field(struct page *, page ) + __field(struct folio *, folio ) __field(unsigned long, pfn ) __field(enum lru_list, lru ) __field(unsigned long, flags ) ), TP_fast_assign( - __entry->page = page; - __entry->pfn = page_to_pfn(page); - __entry->lru = page_lru(page); - __entry->flags = trace_pagemap_flags(page); + __entry->folio = folio; + __entry->pfn = folio_pfn(folio); + __entry->lru = folio_lru_list(folio); + __entry->flags = trace_pagemap_flags(folio); ), /* Flag format is based on page-types.c formatting for pagemap */ - TP_printk("page=%p pfn=0x%lx lru=%d flags=%s%s%s%s%s%s", - __entry->page, + TP_printk("folio=%p pfn=0x%lx lru=%d flags=%s%s%s%s%s%s", + __entry->folio, __entry->pfn, __entry->lru, __entry->flags & PAGEMAP_MAPPED ? "M" : " ", @@ -60,23 +60,21 @@ TRACE_EVENT(mm_lru_insertion, TRACE_EVENT(mm_lru_activate, - TP_PROTO(struct page *page), + TP_PROTO(struct folio *folio), - TP_ARGS(page), + TP_ARGS(folio), TP_STRUCT__entry( - __field(struct page *, page ) + __field(struct folio *, folio ) __field(unsigned long, pfn ) ), TP_fast_assign( - __entry->page = page; - __entry->pfn = page_to_pfn(page); + __entry->folio = folio; + __entry->pfn = folio_pfn(folio); ), - /* Flag format is based on page-types.c formatting for pagemap */ - TP_printk("page=%p pfn=0x%lx", __entry->page, __entry->pfn) - + TP_printk("folio=%p pfn=0x%lx", __entry->folio, __entry->pfn) ); #endif /* _TRACE_PAGEMAP_H */ diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index 840d1ba84cf5..7dccb66474f7 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -52,11 +52,11 @@ WB_WORK_REASON struct wb_writeback_work; -DECLARE_EVENT_CLASS(writeback_page_template, +DECLARE_EVENT_CLASS(writeback_folio_template, - TP_PROTO(struct page *page, struct address_space *mapping), + TP_PROTO(struct folio *folio, struct address_space *mapping), - TP_ARGS(page, mapping), + TP_ARGS(folio, mapping), TP_STRUCT__entry ( __array(char, name, 32) @@ -69,7 +69,7 @@ DECLARE_EVENT_CLASS(writeback_page_template, bdi_dev_name(mapping ? inode_to_bdi(mapping->host) : NULL), 32); __entry->ino = mapping ? mapping->host->i_ino : 0; - __entry->index = page->index; + __entry->index = folio->index; ), TP_printk("bdi %s: ino=%lu index=%lu", @@ -79,18 +79,18 @@ DECLARE_EVENT_CLASS(writeback_page_template, ) ); -DEFINE_EVENT(writeback_page_template, writeback_dirty_page, +DEFINE_EVENT(writeback_folio_template, writeback_dirty_folio, - TP_PROTO(struct page *page, struct address_space *mapping), + TP_PROTO(struct folio *folio, struct address_space *mapping), - TP_ARGS(page, mapping) + TP_ARGS(folio, mapping) ); -DEFINE_EVENT(writeback_page_template, wait_on_page_writeback, +DEFINE_EVENT(writeback_folio_template, folio_wait_writeback, - TP_PROTO(struct page *page, struct address_space *mapping), + TP_PROTO(struct folio *folio, struct address_space *mapping), - TP_ARGS(page, mapping) + TP_ARGS(folio, mapping) ); DECLARE_EVENT_CLASS(writeback_dirty_inode_template, @@ -236,9 +236,9 @@ TRACE_EVENT(inode_switch_wbs, TRACE_EVENT(track_foreign_dirty, - TP_PROTO(struct page *page, struct bdi_writeback *wb), + TP_PROTO(struct folio *folio, struct bdi_writeback *wb), - TP_ARGS(page, wb), + TP_ARGS(folio, wb), TP_STRUCT__entry( __array(char, name, 32) @@ -250,7 +250,7 @@ TRACE_EVENT(track_foreign_dirty, ), TP_fast_assign( - struct address_space *mapping = page_mapping(page); + struct address_space *mapping = folio_mapping(folio); struct inode *inode = mapping ? mapping->host : NULL; strscpy_pad(__entry->name, bdi_dev_name(wb->bdi), 32); @@ -258,7 +258,7 @@ TRACE_EVENT(track_foreign_dirty, __entry->ino = inode ? inode->i_ino : 0; __entry->memcg_id = wb->memcg_css->id; __entry->cgroup_ino = __trace_wb_assign_cgroup(wb); - __entry->page_cgroup_ino = cgroup_ino(page_memcg(page)->css.cgroup); + __entry->page_cgroup_ino = cgroup_ino(folio_memcg(folio)->css.cgroup); ), TP_printk("bdi %s[%llu]: ino=%lu memcg_id=%u cgroup_ino=%lu page_cgroup_ino=%lu", diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h index 9dc0bf0c5a6e..ecd0f5bdfc1d 100644 --- a/include/uapi/asm-generic/fcntl.h +++ b/include/uapi/asm-generic/fcntl.h @@ -181,6 +181,10 @@ struct f_owner_ex { blocking */ #define LOCK_UN 8 /* remove lock */ +/* + * LOCK_MAND support has been removed from the kernel. We leave the symbols + * here to not break legacy builds, but these should not be used in new code. + */ #define LOCK_MAND 32 /* This is a mandatory flock ... */ #define LOCK_READ 64 /* which allows concurrent read operations */ #define LOCK_WRITE 128 /* which allows concurrent write operations */ diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index b270a07b285e..c45b5e9a9387 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -158,6 +158,7 @@ enum { #define IORING_TIMEOUT_BOOTTIME (1U << 2) #define IORING_TIMEOUT_REALTIME (1U << 3) #define IORING_LINK_TIMEOUT_UPDATE (1U << 4) +#define IORING_TIMEOUT_ETIME_SUCCESS (1U << 5) #define IORING_TIMEOUT_CLOCK_MASK (IORING_TIMEOUT_BOOTTIME | IORING_TIMEOUT_REALTIME) #define IORING_TIMEOUT_UPDATE_MASK (IORING_TIMEOUT_UPDATE | IORING_LINK_TIMEOUT_UPDATE) /* diff --git a/include/uapi/linux/mctp.h b/include/uapi/linux/mctp.h index 52b54d13f385..6acd4ccafbf7 100644 --- a/include/uapi/linux/mctp.h +++ b/include/uapi/linux/mctp.h @@ -10,6 +10,7 @@ #define __UAPI_MCTP_H #include <linux/types.h> +#include <linux/socket.h> typedef __u8 mctp_eid_t; @@ -18,11 +19,13 @@ struct mctp_addr { }; struct sockaddr_mctp { - unsigned short int smctp_family; - int smctp_network; + __kernel_sa_family_t smctp_family; + __u16 __smctp_pad0; + unsigned int smctp_network; struct mctp_addr smctp_addr; __u8 smctp_type; __u8 smctp_tag; + __u8 __smctp_pad1; }; #define MCTP_NET_ANY 0x0 diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 8dd73a64f921..b1cb1dbf7417 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -657,7 +657,7 @@ static int audit_filter_rules(struct task_struct *tsk, result = audit_comparator(audit_loginuid_set(tsk), f->op, f->val); break; case AUDIT_SADDR_FAM: - if (ctx->sockaddr) + if (ctx && ctx->sockaddr) result = audit_comparator(ctx->sockaddr->ss_family, f->op, f->val); break; diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index cebd4fb06d19..447def540544 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -1072,6 +1072,7 @@ static struct bpf_map *prog_array_map_alloc(union bpf_attr *attr) INIT_WORK(&aux->work, prog_array_map_clear_deferred); INIT_LIST_HEAD(&aux->poke_progs); mutex_init(&aux->poke_mutex); + spin_lock_init(&aux->owner.lock); map = array_map_alloc(attr); if (IS_ERR(map)) { diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index d6b7dfdd8066..6e3ae90ad107 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -524,6 +524,7 @@ int bpf_jit_enable __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_DEFAULT_ON); int bpf_jit_kallsyms __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_DEFAULT_ON); int bpf_jit_harden __read_mostly; long bpf_jit_limit __read_mostly; +long bpf_jit_limit_max __read_mostly; static void bpf_prog_ksym_set_addr(struct bpf_prog *prog) @@ -817,7 +818,8 @@ u64 __weak bpf_jit_alloc_exec_limit(void) static int __init bpf_jit_charge_init(void) { /* Only used as heuristic here to derive limit. */ - bpf_jit_limit = min_t(u64, round_up(bpf_jit_alloc_exec_limit() >> 2, + bpf_jit_limit_max = bpf_jit_alloc_exec_limit(); + bpf_jit_limit = min_t(u64, round_up(bpf_jit_limit_max >> 2, PAGE_SIZE), LONG_MAX); return 0; } @@ -1821,20 +1823,26 @@ static unsigned int __bpf_prog_ret0_warn(const void *ctx, bool bpf_prog_array_compatible(struct bpf_array *array, const struct bpf_prog *fp) { + bool ret; + if (fp->kprobe_override) return false; - if (!array->aux->type) { + spin_lock(&array->aux->owner.lock); + + if (!array->aux->owner.type) { /* There's no owner yet where we could check for * compatibility. */ - array->aux->type = fp->type; - array->aux->jited = fp->jited; - return true; + array->aux->owner.type = fp->type; + array->aux->owner.jited = fp->jited; + ret = true; + } else { + ret = array->aux->owner.type == fp->type && + array->aux->owner.jited == fp->jited; } - - return array->aux->type == fp->type && - array->aux->jited == fp->jited; + spin_unlock(&array->aux->owner.lock); + return ret; } static int bpf_check_tail_call(const struct bpf_prog *fp) diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 4e50c0bfdb7d..1cad6979a0d0 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -543,8 +543,10 @@ static void bpf_map_show_fdinfo(struct seq_file *m, struct file *filp) if (map->map_type == BPF_MAP_TYPE_PROG_ARRAY) { array = container_of(map, struct bpf_array, map); - type = array->aux->type; - jited = array->aux->jited; + spin_lock(&array->aux->owner.lock); + type = array->aux->owner.type; + jited = array->aux->owner.jited; + spin_unlock(&array->aux->owner.lock); } seq_printf(m, @@ -1337,12 +1339,11 @@ int generic_map_update_batch(struct bpf_map *map, void __user *values = u64_to_user_ptr(attr->batch.values); void __user *keys = u64_to_user_ptr(attr->batch.keys); u32 value_size, cp, max_count; - int ufd = attr->map_fd; + int ufd = attr->batch.map_fd; void *key, *value; struct fd f; int err = 0; - f = fdget(ufd); if (attr->batch.elem_flags & ~BPF_F_LOCK) return -EINVAL; @@ -1367,6 +1368,7 @@ int generic_map_update_batch(struct bpf_map *map, return -ENOMEM; } + f = fdget(ufd); /* bpf_map_do_batch() guarantees ufd is valid */ for (cp = 0; cp < max_count; cp++) { err = -EFAULT; if (copy_from_user(key, keys + cp * map->key_size, @@ -1386,6 +1388,7 @@ int generic_map_update_batch(struct bpf_map *map, kvfree(value); kvfree(key); + fdput(f); return err; } diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index e76b55917905..de006552be8a 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -13319,7 +13319,7 @@ BTF_SET_START(btf_non_sleepable_error_inject) /* Three functions below can be called from sleepable and non-sleepable context. * Assume non-sleepable from bpf safety point of view. */ -BTF_ID(func, __add_to_page_cache_locked) +BTF_ID(func, __filemap_add_folio) BTF_ID(func, should_fail_alloc_page) BTF_ID(func, should_failslab) BTF_SET_END(btf_non_sleepable_error_inject) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 570b0c97392a..ea08f01d0111 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2187,8 +2187,10 @@ static void cgroup_kill_sb(struct super_block *sb) * And don't kill the default root. */ if (list_empty(&root->cgrp.self.children) && root != &cgrp_dfl_root && - !percpu_ref_is_dying(&root->cgrp.self.refcnt)) + !percpu_ref_is_dying(&root->cgrp.self.refcnt)) { + cgroup_bpf_offline(&root->cgrp); percpu_ref_kill(&root->cgrp.self.refcnt); + } cgroup_put(&root->cgrp); kernfs_kill_sb(sb); } diff --git a/kernel/cred.c b/kernel/cred.c index f784e08c2fbd..1ae0b4948a5a 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -225,8 +225,6 @@ struct cred *cred_alloc_blank(void) #ifdef CONFIG_DEBUG_CREDENTIALS new->magic = CRED_MAGIC; #endif - new->ucounts = get_ucounts(&init_ucounts); - if (security_cred_alloc_blank(new, GFP_KERNEL_ACCOUNT) < 0) goto error; @@ -501,7 +499,7 @@ int commit_creds(struct cred *new) inc_rlimit_ucounts(new->ucounts, UCOUNT_RLIMIT_NPROC, 1); rcu_assign_pointer(task->real_cred, new); rcu_assign_pointer(task->cred, new); - if (new->user != old->user) + if (new->user != old->user || new->user_ns != old->user_ns) dec_rlimit_ucounts(old->ucounts, UCOUNT_RLIMIT_NPROC, 1); alter_cred_subscribers(old, -2); @@ -669,7 +667,7 @@ int set_cred_ucounts(struct cred *new) { struct task_struct *task = current; const struct cred *old = task->real_cred; - struct ucounts *old_ucounts = new->ucounts; + struct ucounts *new_ucounts, *old_ucounts = new->ucounts; if (new->user == old->user && new->user_ns == old->user_ns) return 0; @@ -681,9 +679,10 @@ int set_cred_ucounts(struct cred *new) if (old_ucounts && old_ucounts->ns == new->user_ns && uid_eq(old_ucounts->uid, new->euid)) return 0; - if (!(new->ucounts = alloc_ucounts(new->user_ns, new->euid))) + if (!(new_ucounts = alloc_ucounts(new->user_ns, new->euid))) return -EAGAIN; + new->ucounts = new_ucounts; if (old_ucounts) put_ucounts(old_ucounts); diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c index 95445bd6eb72..7a14ca29c377 100644 --- a/kernel/dma/debug.c +++ b/kernel/dma/debug.c @@ -552,7 +552,7 @@ static void active_cacheline_remove(struct dma_debug_entry *entry) * Wrapper function for adding an entry to the hash. * This function takes care of locking itself. */ -static void add_dma_entry(struct dma_debug_entry *entry) +static void add_dma_entry(struct dma_debug_entry *entry, unsigned long attrs) { struct hash_bucket *bucket; unsigned long flags; @@ -566,7 +566,7 @@ static void add_dma_entry(struct dma_debug_entry *entry) if (rc == -ENOMEM) { pr_err("cacheline tracking ENOMEM, dma-debug disabled\n"); global_disable = true; - } else if (rc == -EEXIST) { + } else if (rc == -EEXIST && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) { err_printk(entry->dev, entry, "cacheline tracking EEXIST, overlapping mappings aren't supported\n"); } @@ -1191,7 +1191,8 @@ void debug_dma_map_single(struct device *dev, const void *addr, EXPORT_SYMBOL(debug_dma_map_single); void debug_dma_map_page(struct device *dev, struct page *page, size_t offset, - size_t size, int direction, dma_addr_t dma_addr) + size_t size, int direction, dma_addr_t dma_addr, + unsigned long attrs) { struct dma_debug_entry *entry; @@ -1222,7 +1223,7 @@ void debug_dma_map_page(struct device *dev, struct page *page, size_t offset, check_for_illegal_area(dev, addr, size); } - add_dma_entry(entry); + add_dma_entry(entry, attrs); } void debug_dma_mapping_error(struct device *dev, dma_addr_t dma_addr) @@ -1280,7 +1281,8 @@ void debug_dma_unmap_page(struct device *dev, dma_addr_t addr, } void debug_dma_map_sg(struct device *dev, struct scatterlist *sg, - int nents, int mapped_ents, int direction) + int nents, int mapped_ents, int direction, + unsigned long attrs) { struct dma_debug_entry *entry; struct scatterlist *s; @@ -1289,6 +1291,12 @@ void debug_dma_map_sg(struct device *dev, struct scatterlist *sg, if (unlikely(dma_debug_disabled())) return; + for_each_sg(sg, s, nents, i) { + check_for_stack(dev, sg_page(s), s->offset); + if (!PageHighMem(sg_page(s))) + check_for_illegal_area(dev, sg_virt(s), s->length); + } + for_each_sg(sg, s, mapped_ents, i) { entry = dma_entry_alloc(); if (!entry) @@ -1304,15 +1312,9 @@ void debug_dma_map_sg(struct device *dev, struct scatterlist *sg, entry->sg_call_ents = nents; entry->sg_mapped_ents = mapped_ents; - check_for_stack(dev, sg_page(s), s->offset); - - if (!PageHighMem(sg_page(s))) { - check_for_illegal_area(dev, sg_virt(s), sg_dma_len(s)); - } - check_sg_segment(dev, s); - add_dma_entry(entry); + add_dma_entry(entry, attrs); } } @@ -1368,7 +1370,8 @@ void debug_dma_unmap_sg(struct device *dev, struct scatterlist *sglist, } void debug_dma_alloc_coherent(struct device *dev, size_t size, - dma_addr_t dma_addr, void *virt) + dma_addr_t dma_addr, void *virt, + unsigned long attrs) { struct dma_debug_entry *entry; @@ -1398,7 +1401,7 @@ void debug_dma_alloc_coherent(struct device *dev, size_t size, else entry->pfn = page_to_pfn(virt_to_page(virt)); - add_dma_entry(entry); + add_dma_entry(entry, attrs); } void debug_dma_free_coherent(struct device *dev, size_t size, @@ -1429,7 +1432,8 @@ void debug_dma_free_coherent(struct device *dev, size_t size, } void debug_dma_map_resource(struct device *dev, phys_addr_t addr, size_t size, - int direction, dma_addr_t dma_addr) + int direction, dma_addr_t dma_addr, + unsigned long attrs) { struct dma_debug_entry *entry; @@ -1449,7 +1453,7 @@ void debug_dma_map_resource(struct device *dev, phys_addr_t addr, size_t size, entry->direction = direction; entry->map_err_type = MAP_ERR_NOT_CHECKED; - add_dma_entry(entry); + add_dma_entry(entry, attrs); } void debug_dma_unmap_resource(struct device *dev, dma_addr_t dma_addr, diff --git a/kernel/dma/debug.h b/kernel/dma/debug.h index 83643b3010b2..f525197d3cae 100644 --- a/kernel/dma/debug.h +++ b/kernel/dma/debug.h @@ -11,26 +11,30 @@ #ifdef CONFIG_DMA_API_DEBUG extern void debug_dma_map_page(struct device *dev, struct page *page, size_t offset, size_t size, - int direction, dma_addr_t dma_addr); + int direction, dma_addr_t dma_addr, + unsigned long attrs); extern void debug_dma_unmap_page(struct device *dev, dma_addr_t addr, size_t size, int direction); extern void debug_dma_map_sg(struct device *dev, struct scatterlist *sg, - int nents, int mapped_ents, int direction); + int nents, int mapped_ents, int direction, + unsigned long attrs); extern void debug_dma_unmap_sg(struct device *dev, struct scatterlist *sglist, int nelems, int dir); extern void debug_dma_alloc_coherent(struct device *dev, size_t size, - dma_addr_t dma_addr, void *virt); + dma_addr_t dma_addr, void *virt, + unsigned long attrs); extern void debug_dma_free_coherent(struct device *dev, size_t size, void *virt, dma_addr_t addr); extern void debug_dma_map_resource(struct device *dev, phys_addr_t addr, size_t size, int direction, - dma_addr_t dma_addr); + dma_addr_t dma_addr, + unsigned long attrs); extern void debug_dma_unmap_resource(struct device *dev, dma_addr_t dma_addr, size_t size, int direction); @@ -53,7 +57,8 @@ extern void debug_dma_sync_sg_for_device(struct device *dev, #else /* CONFIG_DMA_API_DEBUG */ static inline void debug_dma_map_page(struct device *dev, struct page *page, size_t offset, size_t size, - int direction, dma_addr_t dma_addr) + int direction, dma_addr_t dma_addr, + unsigned long attrs) { } @@ -63,7 +68,8 @@ static inline void debug_dma_unmap_page(struct device *dev, dma_addr_t addr, } static inline void debug_dma_map_sg(struct device *dev, struct scatterlist *sg, - int nents, int mapped_ents, int direction) + int nents, int mapped_ents, int direction, + unsigned long attrs) { } @@ -74,7 +80,8 @@ static inline void debug_dma_unmap_sg(struct device *dev, } static inline void debug_dma_alloc_coherent(struct device *dev, size_t size, - dma_addr_t dma_addr, void *virt) + dma_addr_t dma_addr, void *virt, + unsigned long attrs) { } @@ -85,7 +92,8 @@ static inline void debug_dma_free_coherent(struct device *dev, size_t size, static inline void debug_dma_map_resource(struct device *dev, phys_addr_t addr, size_t size, int direction, - dma_addr_t dma_addr) + dma_addr_t dma_addr, + unsigned long attrs) { } diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c index 06fec5547e7c..8349a9f2c345 100644 --- a/kernel/dma/mapping.c +++ b/kernel/dma/mapping.c @@ -156,7 +156,7 @@ dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page, addr = dma_direct_map_page(dev, page, offset, size, dir, attrs); else addr = ops->map_page(dev, page, offset, size, dir, attrs); - debug_dma_map_page(dev, page, offset, size, dir, addr); + debug_dma_map_page(dev, page, offset, size, dir, addr, attrs); return addr; } @@ -195,7 +195,7 @@ static int __dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, ents = ops->map_sg(dev, sg, nents, dir, attrs); if (ents > 0) - debug_dma_map_sg(dev, sg, nents, ents, dir); + debug_dma_map_sg(dev, sg, nents, ents, dir, attrs); else if (WARN_ON_ONCE(ents != -EINVAL && ents != -ENOMEM && ents != -EIO)) return -EIO; @@ -249,12 +249,12 @@ EXPORT_SYMBOL(dma_map_sg_attrs); * Returns 0 on success or a negative error code on error. The following * error codes are supported with the given meaning: * - * -EINVAL - An invalid argument, unaligned access or other error - * in usage. Will not succeed if retried. - * -ENOMEM - Insufficient resources (like memory or IOVA space) to - * complete the mapping. Should succeed if retried later. - * -EIO - Legacy error code with an unknown meaning. eg. this is - * returned if a lower level call returned DMA_MAPPING_ERROR. + * -EINVAL An invalid argument, unaligned access or other error + * in usage. Will not succeed if retried. + * -ENOMEM Insufficient resources (like memory or IOVA space) to + * complete the mapping. Should succeed if retried later. + * -EIO Legacy error code with an unknown meaning. eg. this is + * returned if a lower level call returned DMA_MAPPING_ERROR. */ int dma_map_sgtable(struct device *dev, struct sg_table *sgt, enum dma_data_direction dir, unsigned long attrs) @@ -305,7 +305,7 @@ dma_addr_t dma_map_resource(struct device *dev, phys_addr_t phys_addr, else if (ops->map_resource) addr = ops->map_resource(dev, phys_addr, size, dir, attrs); - debug_dma_map_resource(dev, phys_addr, size, dir, addr); + debug_dma_map_resource(dev, phys_addr, size, dir, addr, attrs); return addr; } EXPORT_SYMBOL(dma_map_resource); @@ -510,7 +510,7 @@ void *dma_alloc_attrs(struct device *dev, size_t size, dma_addr_t *dma_handle, else return NULL; - debug_dma_alloc_coherent(dev, size, *dma_handle, cpu_addr); + debug_dma_alloc_coherent(dev, size, *dma_handle, cpu_addr, attrs); return cpu_addr; } EXPORT_SYMBOL(dma_alloc_attrs); @@ -566,7 +566,7 @@ struct page *dma_alloc_pages(struct device *dev, size_t size, struct page *page = __dma_alloc_pages(dev, size, dma_handle, dir, gfp); if (page) - debug_dma_map_page(dev, page, 0, size, dir, *dma_handle); + debug_dma_map_page(dev, page, 0, size, dir, *dma_handle, 0); return page; } EXPORT_SYMBOL_GPL(dma_alloc_pages); @@ -644,7 +644,7 @@ struct sg_table *dma_alloc_noncontiguous(struct device *dev, size_t size, if (sgt) { sgt->nents = 1; - debug_dma_map_sg(dev, sgt->sgl, sgt->orig_nents, 1, dir); + debug_dma_map_sg(dev, sgt->sgl, sgt->orig_nents, 1, dir, attrs); } return sgt; } diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index af24dc3febbe..6357c3580d07 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -167,7 +167,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr, addr + PAGE_SIZE); if (new_page) { - err = mem_cgroup_charge(new_page, vma->vm_mm, GFP_KERNEL); + err = mem_cgroup_charge(page_folio(new_page), vma->vm_mm, + GFP_KERNEL); if (err) return err; } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 92ef7b68198c..59bea523c84b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6343,7 +6343,7 @@ static inline void sched_submit_work(struct task_struct *tsk) * make sure to submit it to avoid deadlocks. */ if (blk_needs_flush_plug(tsk)) - blk_schedule_flush_plug(tsk); + blk_flush_plug(tsk->plug, true); } static void sched_update_worker(struct task_struct *tsk) @@ -8354,7 +8354,8 @@ int io_schedule_prepare(void) int old_iowait = current->in_iowait; current->in_iowait = 1; - blk_schedule_flush_plug(current); + if (current->plug) + blk_flush_plug(current->plug, true); return old_iowait; } @@ -8795,6 +8796,7 @@ void idle_task_exit(void) finish_arch_post_lock_switch(); } + scs_task_reset(current); /* finish_cpu(), as ran on the BP, will clean up the active_mm state */ } diff --git a/kernel/signal.c b/kernel/signal.c index 952741f6d0f9..487bf4f5dadf 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -426,22 +426,10 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags, */ rcu_read_lock(); ucounts = task_ucounts(t); - sigpending = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1); - switch (sigpending) { - case 1: - if (likely(get_ucounts(ucounts))) - break; - fallthrough; - case LONG_MAX: - /* - * we need to decrease the ucount in the userns tree on any - * failure to avoid counts leaking. - */ - dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1); - rcu_read_unlock(); - return NULL; - } + sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING); rcu_read_unlock(); + if (!sigpending) + return NULL; if (override_rlimit || likely(sigpending <= task_rlimit(t, RLIMIT_SIGPENDING))) { q = kmem_cache_alloc(sigqueue_cachep, gfp_flags); @@ -450,8 +438,7 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t gfp_flags, } if (unlikely(q == NULL)) { - if (dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1)) - put_ucounts(ucounts); + dec_rlimit_put_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING); } else { INIT_LIST_HEAD(&q->list); q->flags = sigqueue_flags; @@ -464,8 +451,8 @@ static void __sigqueue_free(struct sigqueue *q) { if (q->flags & SIGQUEUE_PREALLOC) return; - if (q->ucounts && dec_rlimit_ucounts(q->ucounts, UCOUNT_RLIMIT_SIGPENDING, 1)) { - put_ucounts(q->ucounts); + if (q->ucounts) { + dec_rlimit_put_ucounts(q->ucounts, UCOUNT_RLIMIT_SIGPENDING); q->ucounts = NULL; } kmem_cache_free(sigqueue_cachep, q); diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c index fa91f398f28b..1183c88634aa 100644 --- a/kernel/trace/blktrace.c +++ b/kernel/trace/blktrace.c @@ -816,7 +816,7 @@ blk_trace_request_get_cgid(struct request *rq) * Records an action against a request. Will log the bio offset + size. * **/ -static void blk_add_trace_rq(struct request *rq, int error, +static void blk_add_trace_rq(struct request *rq, blk_status_t error, unsigned int nr_bytes, u32 what, u64 cgid) { struct blk_trace *bt; @@ -834,7 +834,8 @@ static void blk_add_trace_rq(struct request *rq, int error, what |= BLK_TC_ACT(BLK_TC_FS); __blk_add_trace(bt, blk_rq_trace_sector(rq), nr_bytes, req_op(rq), - rq->cmd_flags, what, error, 0, NULL, cgid); + rq->cmd_flags, what, blk_status_to_errno(error), 0, + NULL, cgid); rcu_read_unlock(); } @@ -863,7 +864,7 @@ static void blk_add_trace_rq_requeue(void *ignore, struct request *rq) } static void blk_add_trace_rq_complete(void *ignore, struct request *rq, - int error, unsigned int nr_bytes) + blk_status_t error, unsigned int nr_bytes) { blk_add_trace_rq(rq, error, nr_bytes, BLK_TA_COMPLETE, blk_trace_request_get_cgid(rq)); diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 7efbc8aaf7f6..feebf57c6458 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -2208,7 +2208,7 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update) } /** - * ftrace_update_record, set a record that now is tracing or not + * ftrace_update_record - set a record that now is tracing or not * @rec: the record to update * @enable: set to true if the record is tracing, false to force disable * @@ -2221,7 +2221,7 @@ int ftrace_update_record(struct dyn_ftrace *rec, bool enable) } /** - * ftrace_test_record, check if the record has been enabled or not + * ftrace_test_record - check if the record has been enabled or not * @rec: the record to test * @enable: set to true to check if enabled, false if it is disabled * @@ -2574,7 +2574,7 @@ struct ftrace_rec_iter { }; /** - * ftrace_rec_iter_start, start up iterating over traced functions + * ftrace_rec_iter_start - start up iterating over traced functions * * Returns an iterator handle that is used to iterate over all * the records that represent address locations where functions @@ -2605,7 +2605,7 @@ struct ftrace_rec_iter *ftrace_rec_iter_start(void) } /** - * ftrace_rec_iter_next, get the next record to process. + * ftrace_rec_iter_next - get the next record to process. * @iter: The handle to the iterator. * * Returns the next iterator after the given iterator @iter. @@ -2630,7 +2630,7 @@ struct ftrace_rec_iter *ftrace_rec_iter_next(struct ftrace_rec_iter *iter) } /** - * ftrace_rec_iter_record, get the record at the iterator location + * ftrace_rec_iter_record - get the record at the iterator location * @iter: The current iterator location * * Returns the record that the current @iter is at. @@ -2733,7 +2733,7 @@ static int __ftrace_modify_code(void *data) } /** - * ftrace_run_stop_machine, go back to the stop machine method + * ftrace_run_stop_machine - go back to the stop machine method * @command: The command to tell ftrace what to do * * If an arch needs to fall back to the stop machine method, the @@ -2745,7 +2745,7 @@ void ftrace_run_stop_machine(int command) } /** - * arch_ftrace_update_code, modify the code to trace or not trace + * arch_ftrace_update_code - modify the code to trace or not trace * @command: The command that needs to be done * * Archs can override this function if it does not need to @@ -6977,7 +6977,7 @@ __ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip, struct ftrace_ops *op; int bit; - bit = trace_test_and_set_recursion(ip, parent_ip, TRACE_LIST_START, TRACE_LIST_MAX); + bit = trace_test_and_set_recursion(ip, parent_ip, TRACE_LIST_START); if (bit < 0) return; @@ -7052,7 +7052,7 @@ static void ftrace_ops_assist_func(unsigned long ip, unsigned long parent_ip, { int bit; - bit = trace_test_and_set_recursion(ip, parent_ip, TRACE_LIST_START, TRACE_LIST_MAX); + bit = trace_test_and_set_recursion(ip, parent_ip, TRACE_LIST_START); if (bit < 0) return; @@ -7525,7 +7525,9 @@ void ftrace_kill(void) } /** - * Test if ftrace is dead or not. + * ftrace_is_dead - Test if ftrace is dead or not. + * + * Returns 1 if ftrace is "dead", zero otherwise. */ int ftrace_is_dead(void) { diff --git a/kernel/trace/trace_eprobe.c b/kernel/trace/trace_eprobe.c index c4a15aef36af..928867f527e7 100644 --- a/kernel/trace/trace_eprobe.c +++ b/kernel/trace/trace_eprobe.c @@ -904,8 +904,8 @@ static int __trace_eprobe_create(int argc, const char *argv[]) if (IS_ERR(ep)) { ret = PTR_ERR(ep); - /* This must return -ENOMEM, else there is a bug */ - WARN_ON_ONCE(ret != -ENOMEM); + /* This must return -ENOMEM or missing event, else there is a bug */ + WARN_ON_ONCE(ret != -ENOMEM && ret != -ENODEV); ep = NULL; goto error; } diff --git a/kernel/ucount.c b/kernel/ucount.c index bb51849e6375..eb03f3c68375 100644 --- a/kernel/ucount.c +++ b/kernel/ucount.c @@ -284,6 +284,55 @@ bool dec_rlimit_ucounts(struct ucounts *ucounts, enum ucount_type type, long v) return (new == 0); } +static void do_dec_rlimit_put_ucounts(struct ucounts *ucounts, + struct ucounts *last, enum ucount_type type) +{ + struct ucounts *iter, *next; + for (iter = ucounts; iter != last; iter = next) { + long dec = atomic_long_add_return(-1, &iter->ucount[type]); + WARN_ON_ONCE(dec < 0); + next = iter->ns->ucounts; + if (dec == 0) + put_ucounts(iter); + } +} + +void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum ucount_type type) +{ + do_dec_rlimit_put_ucounts(ucounts, NULL, type); +} + +long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum ucount_type type) +{ + /* Caller must hold a reference to ucounts */ + struct ucounts *iter; + long dec, ret = 0; + + for (iter = ucounts; iter; iter = iter->ns->ucounts) { + long max = READ_ONCE(iter->ns->ucount_max[type]); + long new = atomic_long_add_return(1, &iter->ucount[type]); + if (new < 0 || new > max) + goto unwind; + if (iter == ucounts) + ret = new; + /* + * Grab an extra ucount reference for the caller when + * the rlimit count was previously 0. + */ + if (new != 1) + continue; + if (!get_ucounts(iter)) + goto dec_unwind; + } + return ret; +dec_unwind: + dec = atomic_long_add_return(-1, &iter->ucount[type]); + WARN_ON_ONCE(dec < 0); +unwind: + do_dec_rlimit_put_ucounts(ucounts, iter, type); + return 0; +} + bool is_ucounts_overlimit(struct ucounts *ucounts, enum ucount_type type, unsigned long max) { struct ucounts *iter; diff --git a/lib/flex_proportions.c b/lib/flex_proportions.c index 451543937524..53e7eb1dd76c 100644 --- a/lib/flex_proportions.c +++ b/lib/flex_proportions.c @@ -217,11 +217,12 @@ static void fprop_reflect_period_percpu(struct fprop_global *p, } /* Event of type pl happened */ -void __fprop_inc_percpu(struct fprop_global *p, struct fprop_local_percpu *pl) +void __fprop_add_percpu(struct fprop_global *p, struct fprop_local_percpu *pl, + long nr) { fprop_reflect_period_percpu(p, pl); - percpu_counter_add_batch(&pl->events, 1, PROP_BATCH); - percpu_counter_add(&p->events, 1); + percpu_counter_add_batch(&pl->events, nr, PROP_BATCH); + percpu_counter_add(&p->events, nr); } void fprop_fraction_percpu(struct fprop_global *p, @@ -253,20 +254,29 @@ void fprop_fraction_percpu(struct fprop_global *p, } /* - * Like __fprop_inc_percpu() except that event is counted only if the given + * Like __fprop_add_percpu() except that event is counted only if the given * type has fraction smaller than @max_frac/FPROP_FRAC_BASE */ -void __fprop_inc_percpu_max(struct fprop_global *p, - struct fprop_local_percpu *pl, int max_frac) +void __fprop_add_percpu_max(struct fprop_global *p, + struct fprop_local_percpu *pl, int max_frac, long nr) { if (unlikely(max_frac < FPROP_FRAC_BASE)) { unsigned long numerator, denominator; + s64 tmp; fprop_fraction_percpu(p, pl, &numerator, &denominator); - if (numerator > - (((u64)denominator) * max_frac) >> FPROP_FRAC_SHIFT) + /* Adding 'nr' to fraction exceeds max_frac/FPROP_FRAC_BASE? */ + tmp = (u64)denominator * max_frac - + ((u64)numerator << FPROP_FRAC_SHIFT); + if (tmp < 0) { + /* Maximum fraction already exceeded? */ return; + } else if (tmp < nr * (FPROP_FRAC_BASE - max_frac)) { + /* Add just enough for the fraction to saturate */ + nr = div_u64(tmp + FPROP_FRAC_BASE - max_frac - 1, + FPROP_FRAC_BASE - max_frac); + } } - __fprop_inc_percpu(p, pl); + __fprop_add_percpu(p, pl, nr); } diff --git a/lib/sbitmap.c b/lib/sbitmap.c index c6e2f1f2c4d2..2709ab825499 100644 --- a/lib/sbitmap.c +++ b/lib/sbitmap.c @@ -631,7 +631,7 @@ EXPORT_SYMBOL_GPL(sbitmap_queue_wake_up); static inline void sbitmap_update_cpu_hint(struct sbitmap *sb, int cpu, int tag) { if (likely(!sb->round_robin && tag < sb->depth)) - *per_cpu_ptr(sb->alloc_hint, cpu) = tag; + data_race(*per_cpu_ptr(sb->alloc_hint, cpu) = tag); } void sbitmap_queue_clear_batch(struct sbitmap_queue *sbq, int offset, diff --git a/mm/Makefile b/mm/Makefile index fc60a40ce954..d6c0042e3aa0 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -46,7 +46,7 @@ mmu-$(CONFIG_MMU) += process_vm_access.o endif obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ - maccess.o page-writeback.o \ + maccess.o page-writeback.o folio-compat.o \ readahead.o swap.o truncate.o vmscan.o shmem.o \ util.o mmzone.o vmstat.o backing-dev.o \ mm_init.o percpu.o slab_common.o \ diff --git a/mm/compaction.c b/mm/compaction.c index bfc93da1c2c7..fbc60f964c38 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1022,7 +1022,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, if (!TestClearPageLRU(page)) goto isolate_fail_put; - lruvec = mem_cgroup_page_lruvec(page); + lruvec = folio_lruvec(page_folio(page)); /* If we already hold the lock, we can skip some rechecking */ if (lruvec != locked) { @@ -1032,7 +1032,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, compact_lock_irqsave(&lruvec->lru_lock, &flags, cc); locked = lruvec; - lruvec_memcg_debug(lruvec, page); + lruvec_memcg_debug(lruvec, page_folio(page)); /* Try get exclusive access under lock */ if (!skip_updated) { diff --git a/mm/damon/core-test.h b/mm/damon/core-test.h index c938a9c34e6c..7008c3735e99 100644 --- a/mm/damon/core-test.h +++ b/mm/damon/core-test.h @@ -219,14 +219,14 @@ static void damon_test_split_regions_of(struct kunit *test) r = damon_new_region(0, 22); damon_add_region(r, t); damon_split_regions_of(c, t, 2); - KUNIT_EXPECT_EQ(test, damon_nr_regions(t), 2u); + KUNIT_EXPECT_LE(test, damon_nr_regions(t), 2u); damon_free_target(t); t = damon_new_target(42); r = damon_new_region(0, 220); damon_add_region(r, t); damon_split_regions_of(c, t, 4); - KUNIT_EXPECT_EQ(test, damon_nr_regions(t), 4u); + KUNIT_EXPECT_LE(test, damon_nr_regions(t), 4u); damon_free_target(t); damon_destroy_ctx(c); } diff --git a/mm/filemap.c b/mm/filemap.c index 44b4b551e430..5e206a429b57 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -834,6 +834,8 @@ EXPORT_SYMBOL(file_write_and_wait_range); */ void replace_page_cache_page(struct page *old, struct page *new) { + struct folio *fold = page_folio(old); + struct folio *fnew = page_folio(new); struct address_space *mapping = old->mapping; void (*freepage)(struct page *) = mapping->a_ops->freepage; pgoff_t offset = old->index; @@ -847,7 +849,7 @@ void replace_page_cache_page(struct page *old, struct page *new) new->mapping = mapping; new->index = offset; - mem_cgroup_migrate(old, new); + mem_cgroup_migrate(fold, fnew); xas_lock_irq(&xas); xas_store(&xas, new); @@ -869,26 +871,25 @@ void replace_page_cache_page(struct page *old, struct page *new) } EXPORT_SYMBOL_GPL(replace_page_cache_page); -noinline int __add_to_page_cache_locked(struct page *page, - struct address_space *mapping, - pgoff_t offset, gfp_t gfp, - void **shadowp) +noinline int __filemap_add_folio(struct address_space *mapping, + struct folio *folio, pgoff_t index, gfp_t gfp, void **shadowp) { - XA_STATE(xas, &mapping->i_pages, offset); - int huge = PageHuge(page); + XA_STATE(xas, &mapping->i_pages, index); + int huge = folio_test_hugetlb(folio); int error; bool charged = false; - VM_BUG_ON_PAGE(!PageLocked(page), page); - VM_BUG_ON_PAGE(PageSwapBacked(page), page); + VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); + VM_BUG_ON_FOLIO(folio_test_swapbacked(folio), folio); mapping_set_update(&xas, mapping); - get_page(page); - page->mapping = mapping; - page->index = offset; + folio_get(folio); + folio->mapping = mapping; + folio->index = index; if (!huge) { - error = mem_cgroup_charge(page, NULL, gfp); + error = mem_cgroup_charge(folio, NULL, gfp); + VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio); if (error) goto error; charged = true; @@ -900,7 +901,7 @@ noinline int __add_to_page_cache_locked(struct page *page, unsigned int order = xa_get_order(xas.xa, xas.xa_index); void *entry, *old = NULL; - if (order > thp_order(page)) + if (order > folio_order(folio)) xas_split_alloc(&xas, xa_load(xas.xa, xas.xa_index), order, gfp); xas_lock_irq(&xas); @@ -917,13 +918,13 @@ noinline int __add_to_page_cache_locked(struct page *page, *shadowp = old; /* entry may have been split before we acquired lock */ order = xa_get_order(xas.xa, xas.xa_index); - if (order > thp_order(page)) { + if (order > folio_order(folio)) { xas_split(&xas, old, order); xas_reset(&xas); } } - xas_store(&xas, page); + xas_store(&xas, folio); if (xas_error(&xas)) goto unlock; @@ -931,7 +932,7 @@ noinline int __add_to_page_cache_locked(struct page *page, /* hugetlb pages do not participate in page cache accounting */ if (!huge) - __inc_lruvec_page_state(page, NR_FILE_PAGES); + __lruvec_stat_add_folio(folio, NR_FILE_PAGES); unlock: xas_unlock_irq(&xas); } while (xas_nomem(&xas, gfp)); @@ -939,19 +940,19 @@ unlock: if (xas_error(&xas)) { error = xas_error(&xas); if (charged) - mem_cgroup_uncharge(page); + mem_cgroup_uncharge(folio); goto error; } - trace_mm_filemap_add_to_page_cache(page); + trace_mm_filemap_add_to_page_cache(&folio->page); return 0; error: - page->mapping = NULL; + folio->mapping = NULL; /* Leave page->index set: truncation relies upon it */ - put_page(page); + folio_put(folio); return error; } -ALLOW_ERROR_INJECTION(__add_to_page_cache_locked, ERRNO); +ALLOW_ERROR_INJECTION(__filemap_add_folio, ERRNO); /** * add_to_page_cache_locked - add a locked page to the pagecache @@ -968,59 +969,58 @@ ALLOW_ERROR_INJECTION(__add_to_page_cache_locked, ERRNO); int add_to_page_cache_locked(struct page *page, struct address_space *mapping, pgoff_t offset, gfp_t gfp_mask) { - return __add_to_page_cache_locked(page, mapping, offset, + return __filemap_add_folio(mapping, page_folio(page), offset, gfp_mask, NULL); } EXPORT_SYMBOL(add_to_page_cache_locked); -int add_to_page_cache_lru(struct page *page, struct address_space *mapping, - pgoff_t offset, gfp_t gfp_mask) +int filemap_add_folio(struct address_space *mapping, struct folio *folio, + pgoff_t index, gfp_t gfp) { void *shadow = NULL; int ret; - __SetPageLocked(page); - ret = __add_to_page_cache_locked(page, mapping, offset, - gfp_mask, &shadow); + __folio_set_locked(folio); + ret = __filemap_add_folio(mapping, folio, index, gfp, &shadow); if (unlikely(ret)) - __ClearPageLocked(page); + __folio_clear_locked(folio); else { /* - * The page might have been evicted from cache only + * The folio might have been evicted from cache only * recently, in which case it should be activated like - * any other repeatedly accessed page. - * The exception is pages getting rewritten; evicting other + * any other repeatedly accessed folio. + * The exception is folios getting rewritten; evicting other * data from the working set, only to cache data that will * get overwritten with something else, is a waste of memory. */ - WARN_ON_ONCE(PageActive(page)); - if (!(gfp_mask & __GFP_WRITE) && shadow) - workingset_refault(page, shadow); - lru_cache_add(page); + WARN_ON_ONCE(folio_test_active(folio)); + if (!(gfp & __GFP_WRITE) && shadow) + workingset_refault(folio, shadow); + folio_add_lru(folio); } return ret; } -EXPORT_SYMBOL_GPL(add_to_page_cache_lru); +EXPORT_SYMBOL_GPL(filemap_add_folio); #ifdef CONFIG_NUMA -struct page *__page_cache_alloc(gfp_t gfp) +struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order) { int n; - struct page *page; + struct folio *folio; if (cpuset_do_page_mem_spread()) { unsigned int cpuset_mems_cookie; do { cpuset_mems_cookie = read_mems_allowed_begin(); n = cpuset_mem_spread_node(); - page = __alloc_pages_node(n, gfp, 0); - } while (!page && read_mems_allowed_retry(cpuset_mems_cookie)); + folio = __folio_alloc_node(gfp, order, n); + } while (!folio && read_mems_allowed_retry(cpuset_mems_cookie)); - return page; + return folio; } - return alloc_pages(gfp, 0); + return folio_alloc(gfp, order); } -EXPORT_SYMBOL(__page_cache_alloc); +EXPORT_SYMBOL(filemap_alloc_folio); #endif /* @@ -1073,11 +1073,11 @@ EXPORT_SYMBOL(filemap_invalidate_unlock_two); */ #define PAGE_WAIT_TABLE_BITS 8 #define PAGE_WAIT_TABLE_SIZE (1 << PAGE_WAIT_TABLE_BITS) -static wait_queue_head_t page_wait_table[PAGE_WAIT_TABLE_SIZE] __cacheline_aligned; +static wait_queue_head_t folio_wait_table[PAGE_WAIT_TABLE_SIZE] __cacheline_aligned; -static wait_queue_head_t *page_waitqueue(struct page *page) +static wait_queue_head_t *folio_waitqueue(struct folio *folio) { - return &page_wait_table[hash_ptr(page, PAGE_WAIT_TABLE_BITS)]; + return &folio_wait_table[hash_ptr(folio, PAGE_WAIT_TABLE_BITS)]; } void __init pagecache_init(void) @@ -1085,7 +1085,7 @@ void __init pagecache_init(void) int i; for (i = 0; i < PAGE_WAIT_TABLE_SIZE; i++) - init_waitqueue_head(&page_wait_table[i]); + init_waitqueue_head(&folio_wait_table[i]); page_writeback_init(); } @@ -1140,10 +1140,10 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, */ flags = wait->flags; if (flags & WQ_FLAG_EXCLUSIVE) { - if (test_bit(key->bit_nr, &key->page->flags)) + if (test_bit(key->bit_nr, &key->folio->flags)) return -1; if (flags & WQ_FLAG_CUSTOM) { - if (test_and_set_bit(key->bit_nr, &key->page->flags)) + if (test_and_set_bit(key->bit_nr, &key->folio->flags)) return -1; flags |= WQ_FLAG_DONE; } @@ -1156,7 +1156,7 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, * * So update the flags atomically, and wake up the waiter * afterwards to avoid any races. This store-release pairs - * with the load-acquire in wait_on_page_bit_common(). + * with the load-acquire in folio_wait_bit_common(). */ smp_store_release(&wait->flags, flags | WQ_FLAG_WOKEN); wake_up_state(wait->private, mode); @@ -1175,14 +1175,14 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, return (flags & WQ_FLAG_EXCLUSIVE) != 0; } -static void wake_up_page_bit(struct page *page, int bit_nr) +static void folio_wake_bit(struct folio *folio, int bit_nr) { - wait_queue_head_t *q = page_waitqueue(page); + wait_queue_head_t *q = folio_waitqueue(folio); struct wait_page_key key; unsigned long flags; wait_queue_entry_t bookmark; - key.page = page; + key.folio = folio; key.bit_nr = bit_nr; key.page_match = 0; @@ -1217,7 +1217,7 @@ static void wake_up_page_bit(struct page *page, int bit_nr) * page waiters. */ if (!waitqueue_active(q) || !key.page_match) { - ClearPageWaiters(page); + folio_clear_waiters(folio); /* * It's possible to miss clearing Waiters here, when we woke * our page waiters, but the hashed waitqueue has waiters for @@ -1229,19 +1229,19 @@ static void wake_up_page_bit(struct page *page, int bit_nr) spin_unlock_irqrestore(&q->lock, flags); } -static void wake_up_page(struct page *page, int bit) +static void folio_wake(struct folio *folio, int bit) { - if (!PageWaiters(page)) + if (!folio_test_waiters(folio)) return; - wake_up_page_bit(page, bit); + folio_wake_bit(folio, bit); } /* - * A choice of three behaviors for wait_on_page_bit_common(): + * A choice of three behaviors for folio_wait_bit_common(): */ enum behavior { EXCLUSIVE, /* Hold ref to page and take the bit when woken, like - * __lock_page() waiting on then setting PG_locked. + * __folio_lock() waiting on then setting PG_locked. */ SHARED, /* Hold ref to page and check the bit when woken, like * wait_on_page_writeback() waiting on PG_writeback. @@ -1252,16 +1252,16 @@ enum behavior { }; /* - * Attempt to check (or get) the page bit, and mark us done + * Attempt to check (or get) the folio flag, and mark us done * if successful. */ -static inline bool trylock_page_bit_common(struct page *page, int bit_nr, +static inline bool folio_trylock_flag(struct folio *folio, int bit_nr, struct wait_queue_entry *wait) { if (wait->flags & WQ_FLAG_EXCLUSIVE) { - if (test_and_set_bit(bit_nr, &page->flags)) + if (test_and_set_bit(bit_nr, &folio->flags)) return false; - } else if (test_bit(bit_nr, &page->flags)) + } else if (test_bit(bit_nr, &folio->flags)) return false; wait->flags |= WQ_FLAG_WOKEN | WQ_FLAG_DONE; @@ -1271,9 +1271,10 @@ static inline bool trylock_page_bit_common(struct page *page, int bit_nr, /* How many times do we accept lock stealing from under a waiter? */ int sysctl_page_lock_unfairness = 5; -static inline int wait_on_page_bit_common(wait_queue_head_t *q, - struct page *page, int bit_nr, int state, enum behavior behavior) +static inline int folio_wait_bit_common(struct folio *folio, int bit_nr, + int state, enum behavior behavior) { + wait_queue_head_t *q = folio_waitqueue(folio); int unfairness = sysctl_page_lock_unfairness; struct wait_page_queue wait_page; wait_queue_entry_t *wait = &wait_page.wait; @@ -1282,8 +1283,8 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q, unsigned long pflags; if (bit_nr == PG_locked && - !PageUptodate(page) && PageWorkingset(page)) { - if (!PageSwapBacked(page)) { + !folio_test_uptodate(folio) && folio_test_workingset(folio)) { + if (!folio_test_swapbacked(folio)) { delayacct_thrashing_start(); delayacct = true; } @@ -1293,7 +1294,7 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q, init_wait(wait); wait->func = wake_page_function; - wait_page.page = page; + wait_page.folio = folio; wait_page.bit_nr = bit_nr; repeat: @@ -1308,7 +1309,7 @@ repeat: * Do one last check whether we can get the * page bit synchronously. * - * Do the SetPageWaiters() marking before that + * Do the folio_set_waiters() marking before that * to let any waker we _just_ missed know they * need to wake us up (otherwise they'll never * even go to the slow case that looks at the @@ -1319,8 +1320,8 @@ repeat: * lock to avoid races. */ spin_lock_irq(&q->lock); - SetPageWaiters(page); - if (!trylock_page_bit_common(page, bit_nr, wait)) + folio_set_waiters(folio); + if (!folio_trylock_flag(folio, bit_nr, wait)) __add_wait_queue_entry_tail(q, wait); spin_unlock_irq(&q->lock); @@ -1330,10 +1331,10 @@ repeat: * see whether the page bit testing has already * been done by the wake function. * - * We can drop our reference to the page. + * We can drop our reference to the folio. */ if (behavior == DROP) - put_page(page); + folio_put(folio); /* * Note that until the "finish_wait()", or until @@ -1370,7 +1371,7 @@ repeat: * * And if that fails, we'll have to retry this all. */ - if (unlikely(test_and_set_bit(bit_nr, &page->flags))) + if (unlikely(test_and_set_bit(bit_nr, folio_flags(folio, 0)))) goto repeat; wait->flags |= WQ_FLAG_DONE; @@ -1379,7 +1380,7 @@ repeat: /* * If a signal happened, this 'finish_wait()' may remove the last - * waiter from the wait-queues, but the PageWaiters bit will remain + * waiter from the wait-queues, but the folio waiters bit will remain * set. That's ok. The next wakeup will take care of it, and trying * to do it here would be difficult and prone to races. */ @@ -1410,19 +1411,17 @@ repeat: return wait->flags & WQ_FLAG_WOKEN ? 0 : -EINTR; } -void wait_on_page_bit(struct page *page, int bit_nr) +void folio_wait_bit(struct folio *folio, int bit_nr) { - wait_queue_head_t *q = page_waitqueue(page); - wait_on_page_bit_common(q, page, bit_nr, TASK_UNINTERRUPTIBLE, SHARED); + folio_wait_bit_common(folio, bit_nr, TASK_UNINTERRUPTIBLE, SHARED); } -EXPORT_SYMBOL(wait_on_page_bit); +EXPORT_SYMBOL(folio_wait_bit); -int wait_on_page_bit_killable(struct page *page, int bit_nr) +int folio_wait_bit_killable(struct folio *folio, int bit_nr) { - wait_queue_head_t *q = page_waitqueue(page); - return wait_on_page_bit_common(q, page, bit_nr, TASK_KILLABLE, SHARED); + return folio_wait_bit_common(folio, bit_nr, TASK_KILLABLE, SHARED); } -EXPORT_SYMBOL(wait_on_page_bit_killable); +EXPORT_SYMBOL(folio_wait_bit_killable); /** * put_and_wait_on_page_locked - Drop a reference and wait for it to be unlocked @@ -1439,31 +1438,28 @@ EXPORT_SYMBOL(wait_on_page_bit_killable); */ int put_and_wait_on_page_locked(struct page *page, int state) { - wait_queue_head_t *q; - - page = compound_head(page); - q = page_waitqueue(page); - return wait_on_page_bit_common(q, page, PG_locked, state, DROP); + return folio_wait_bit_common(page_folio(page), PG_locked, state, + DROP); } /** - * add_page_wait_queue - Add an arbitrary waiter to a page's wait queue - * @page: Page defining the wait queue of interest + * folio_add_wait_queue - Add an arbitrary waiter to a folio's wait queue + * @folio: Folio defining the wait queue of interest * @waiter: Waiter to add to the queue * - * Add an arbitrary @waiter to the wait queue for the nominated @page. + * Add an arbitrary @waiter to the wait queue for the nominated @folio. */ -void add_page_wait_queue(struct page *page, wait_queue_entry_t *waiter) +void folio_add_wait_queue(struct folio *folio, wait_queue_entry_t *waiter) { - wait_queue_head_t *q = page_waitqueue(page); + wait_queue_head_t *q = folio_waitqueue(folio); unsigned long flags; spin_lock_irqsave(&q->lock, flags); __add_wait_queue_entry_tail(q, waiter); - SetPageWaiters(page); + folio_set_waiters(folio); spin_unlock_irqrestore(&q->lock, flags); } -EXPORT_SYMBOL_GPL(add_page_wait_queue); +EXPORT_SYMBOL_GPL(folio_add_wait_queue); #ifndef clear_bit_unlock_is_negative_byte @@ -1489,124 +1485,116 @@ static inline bool clear_bit_unlock_is_negative_byte(long nr, volatile void *mem #endif /** - * unlock_page - unlock a locked page - * @page: the page + * folio_unlock - Unlock a locked folio. + * @folio: The folio. * - * Unlocks the page and wakes up sleepers in wait_on_page_locked(). - * Also wakes sleepers in wait_on_page_writeback() because the wakeup - * mechanism between PageLocked pages and PageWriteback pages is shared. - * But that's OK - sleepers in wait_on_page_writeback() just go back to sleep. + * Unlocks the folio and wakes up any thread sleeping on the page lock. * - * Note that this depends on PG_waiters being the sign bit in the byte - * that contains PG_locked - thus the BUILD_BUG_ON(). That allows us to - * clear the PG_locked bit and test PG_waiters at the same time fairly - * portably (architectures that do LL/SC can test any bit, while x86 can - * test the sign bit). + * Context: May be called from interrupt or process context. May not be + * called from NMI context. */ -void unlock_page(struct page *page) +void folio_unlock(struct folio *folio) { + /* Bit 7 allows x86 to check the byte's sign bit */ BUILD_BUG_ON(PG_waiters != 7); - page = compound_head(page); - VM_BUG_ON_PAGE(!PageLocked(page), page); - if (clear_bit_unlock_is_negative_byte(PG_locked, &page->flags)) - wake_up_page_bit(page, PG_locked); + BUILD_BUG_ON(PG_locked > 7); + VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); + if (clear_bit_unlock_is_negative_byte(PG_locked, folio_flags(folio, 0))) + folio_wake_bit(folio, PG_locked); } -EXPORT_SYMBOL(unlock_page); +EXPORT_SYMBOL(folio_unlock); /** - * end_page_private_2 - Clear PG_private_2 and release any waiters - * @page: The page + * folio_end_private_2 - Clear PG_private_2 and wake any waiters. + * @folio: The folio. * - * Clear the PG_private_2 bit on a page and wake up any sleepers waiting for - * this. The page ref held for PG_private_2 being set is released. + * Clear the PG_private_2 bit on a folio and wake up any sleepers waiting for + * it. The folio reference held for PG_private_2 being set is released. * - * This is, for example, used when a netfs page is being written to a local - * disk cache, thereby allowing writes to the cache for the same page to be + * This is, for example, used when a netfs folio is being written to a local + * disk cache, thereby allowing writes to the cache for the same folio to be * serialised. */ -void end_page_private_2(struct page *page) +void folio_end_private_2(struct folio *folio) { - page = compound_head(page); - VM_BUG_ON_PAGE(!PagePrivate2(page), page); - clear_bit_unlock(PG_private_2, &page->flags); - wake_up_page_bit(page, PG_private_2); - put_page(page); + VM_BUG_ON_FOLIO(!folio_test_private_2(folio), folio); + clear_bit_unlock(PG_private_2, folio_flags(folio, 0)); + folio_wake_bit(folio, PG_private_2); + folio_put(folio); } -EXPORT_SYMBOL(end_page_private_2); +EXPORT_SYMBOL(folio_end_private_2); /** - * wait_on_page_private_2 - Wait for PG_private_2 to be cleared on a page - * @page: The page to wait on + * folio_wait_private_2 - Wait for PG_private_2 to be cleared on a folio. + * @folio: The folio to wait on. * - * Wait for PG_private_2 (aka PG_fscache) to be cleared on a page. + * Wait for PG_private_2 (aka PG_fscache) to be cleared on a folio. */ -void wait_on_page_private_2(struct page *page) +void folio_wait_private_2(struct folio *folio) { - page = compound_head(page); - while (PagePrivate2(page)) - wait_on_page_bit(page, PG_private_2); + while (folio_test_private_2(folio)) + folio_wait_bit(folio, PG_private_2); } -EXPORT_SYMBOL(wait_on_page_private_2); +EXPORT_SYMBOL(folio_wait_private_2); /** - * wait_on_page_private_2_killable - Wait for PG_private_2 to be cleared on a page - * @page: The page to wait on + * folio_wait_private_2_killable - Wait for PG_private_2 to be cleared on a folio. + * @folio: The folio to wait on. * - * Wait for PG_private_2 (aka PG_fscache) to be cleared on a page or until a + * Wait for PG_private_2 (aka PG_fscache) to be cleared on a folio or until a * fatal signal is received by the calling task. * * Return: * - 0 if successful. * - -EINTR if a fatal signal was encountered. */ -int wait_on_page_private_2_killable(struct page *page) +int folio_wait_private_2_killable(struct folio *folio) { int ret = 0; - page = compound_head(page); - while (PagePrivate2(page)) { - ret = wait_on_page_bit_killable(page, PG_private_2); + while (folio_test_private_2(folio)) { + ret = folio_wait_bit_killable(folio, PG_private_2); if (ret < 0) break; } return ret; } -EXPORT_SYMBOL(wait_on_page_private_2_killable); +EXPORT_SYMBOL(folio_wait_private_2_killable); /** - * end_page_writeback - end writeback against a page - * @page: the page + * folio_end_writeback - End writeback against a folio. + * @folio: The folio. */ -void end_page_writeback(struct page *page) +void folio_end_writeback(struct folio *folio) { /* - * TestClearPageReclaim could be used here but it is an atomic - * operation and overkill in this particular case. Failing to - * shuffle a page marked for immediate reclaim is too mild to - * justify taking an atomic operation penalty at the end of - * ever page writeback. + * folio_test_clear_reclaim() could be used here but it is an + * atomic operation and overkill in this particular case. Failing + * to shuffle a folio marked for immediate reclaim is too mild + * a gain to justify taking an atomic operation penalty at the + * end of every folio writeback. */ - if (PageReclaim(page)) { - ClearPageReclaim(page); - rotate_reclaimable_page(page); + if (folio_test_reclaim(folio)) { + folio_clear_reclaim(folio); + folio_rotate_reclaimable(folio); } /* - * Writeback does not hold a page reference of its own, relying + * Writeback does not hold a folio reference of its own, relying * on truncation to wait for the clearing of PG_writeback. - * But here we must make sure that the page is not freed and - * reused before the wake_up_page(). + * But here we must make sure that the folio is not freed and + * reused before the folio_wake(). */ - get_page(page); - if (!test_clear_page_writeback(page)) + folio_get(folio); + if (!__folio_end_writeback(folio)) BUG(); smp_mb__after_atomic(); - wake_up_page(page, PG_writeback); - put_page(page); + folio_wake(folio, PG_writeback); + folio_put(folio); } -EXPORT_SYMBOL(end_page_writeback); +EXPORT_SYMBOL(folio_end_writeback); /* * After completing I/O on a page, call this routine to update the page @@ -1637,39 +1625,35 @@ void page_endio(struct page *page, bool is_write, int err) EXPORT_SYMBOL_GPL(page_endio); /** - * __lock_page - get a lock on the page, assuming we need to sleep to get it - * @__page: the page to lock + * __folio_lock - Get a lock on the folio, assuming we need to sleep to get it. + * @folio: The folio to lock */ -void __lock_page(struct page *__page) +void __folio_lock(struct folio *folio) { - struct page *page = compound_head(__page); - wait_queue_head_t *q = page_waitqueue(page); - wait_on_page_bit_common(q, page, PG_locked, TASK_UNINTERRUPTIBLE, + folio_wait_bit_common(folio, PG_locked, TASK_UNINTERRUPTIBLE, EXCLUSIVE); } -EXPORT_SYMBOL(__lock_page); +EXPORT_SYMBOL(__folio_lock); -int __lock_page_killable(struct page *__page) +int __folio_lock_killable(struct folio *folio) { - struct page *page = compound_head(__page); - wait_queue_head_t *q = page_waitqueue(page); - return wait_on_page_bit_common(q, page, PG_locked, TASK_KILLABLE, + return folio_wait_bit_common(folio, PG_locked, TASK_KILLABLE, EXCLUSIVE); } -EXPORT_SYMBOL_GPL(__lock_page_killable); +EXPORT_SYMBOL_GPL(__folio_lock_killable); -int __lock_page_async(struct page *page, struct wait_page_queue *wait) +static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait) { - struct wait_queue_head *q = page_waitqueue(page); + struct wait_queue_head *q = folio_waitqueue(folio); int ret = 0; - wait->page = page; + wait->folio = folio; wait->bit_nr = PG_locked; spin_lock_irq(&q->lock); __add_wait_queue_entry_tail(q, &wait->wait); - SetPageWaiters(page); - ret = !trylock_page(page); + folio_set_waiters(folio); + ret = !folio_trylock(folio); /* * If we were successful now, we know we're still on the * waitqueue as we're still under the lock. This means it's @@ -1686,16 +1670,16 @@ int __lock_page_async(struct page *page, struct wait_page_queue *wait) /* * Return values: - * 1 - page is locked; mmap_lock is still held. - * 0 - page is not locked. + * true - folio is locked; mmap_lock is still held. + * false - folio is not locked. * mmap_lock has been released (mmap_read_unlock(), unless flags had both * FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in * which case mmap_lock is still held. * - * If neither ALLOW_RETRY nor KILLABLE are set, will always return 1 - * with the page locked and the mmap_lock unperturbed. + * If neither ALLOW_RETRY nor KILLABLE are set, will always return true + * with the folio locked and the mmap_lock unperturbed. */ -int __lock_page_or_retry(struct page *page, struct mm_struct *mm, +bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm, unsigned int flags) { if (fault_flag_allow_retry_first(flags)) { @@ -1704,28 +1688,28 @@ int __lock_page_or_retry(struct page *page, struct mm_struct *mm, * even though return 0. */ if (flags & FAULT_FLAG_RETRY_NOWAIT) - return 0; + return false; mmap_read_unlock(mm); if (flags & FAULT_FLAG_KILLABLE) - wait_on_page_locked_killable(page); + folio_wait_locked_killable(folio); else - wait_on_page_locked(page); - return 0; + folio_wait_locked(folio); + return false; } if (flags & FAULT_FLAG_KILLABLE) { - int ret; + bool ret; - ret = __lock_page_killable(page); + ret = __folio_lock_killable(folio); if (ret) { mmap_read_unlock(mm); - return 0; + return false; } } else { - __lock_page(page); + __folio_lock(folio); } - return 1; + return true; } /** @@ -1801,143 +1785,155 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping, EXPORT_SYMBOL(page_cache_prev_miss); /* + * Lockless page cache protocol: + * On the lookup side: + * 1. Load the folio from i_pages + * 2. Increment the refcount if it's not zero + * 3. If the folio is not found by xas_reload(), put the refcount and retry + * + * On the removal side: + * A. Freeze the page (by zeroing the refcount if nobody else has a reference) + * B. Remove the page from i_pages + * C. Return the page to the page allocator + * + * This means that any page may have its reference count temporarily + * increased by a speculative page cache (or fast GUP) lookup as it can + * be allocated by another user before the RCU grace period expires. + * Because the refcount temporarily acquired here may end up being the + * last refcount on the page, any page allocation must be freeable by + * folio_put(). + */ + +/* * mapping_get_entry - Get a page cache entry. * @mapping: the address_space to search * @index: The page cache index. * - * Looks up the page cache slot at @mapping & @index. If there is a - * page cache page, the head page is returned with an increased refcount. + * Looks up the page cache entry at @mapping & @index. If it is a folio, + * it is returned with an increased refcount. If it is a shadow entry + * of a previously evicted folio, or a swap entry from shmem/tmpfs, + * it is returned without further action. * - * If the slot holds a shadow entry of a previously evicted page, or a - * swap entry from shmem/tmpfs, it is returned. - * - * Return: The head page or shadow entry, %NULL if nothing is found. + * Return: The folio, swap or shadow entry, %NULL if nothing is found. */ -static struct page *mapping_get_entry(struct address_space *mapping, - pgoff_t index) +static void *mapping_get_entry(struct address_space *mapping, pgoff_t index) { XA_STATE(xas, &mapping->i_pages, index); - struct page *page; + struct folio *folio; rcu_read_lock(); repeat: xas_reset(&xas); - page = xas_load(&xas); - if (xas_retry(&xas, page)) + folio = xas_load(&xas); + if (xas_retry(&xas, folio)) goto repeat; /* * A shadow entry of a recently evicted page, or a swap entry from * shmem/tmpfs. Return it without attempting to raise page count. */ - if (!page || xa_is_value(page)) + if (!folio || xa_is_value(folio)) goto out; - if (!page_cache_get_speculative(page)) + if (!folio_try_get_rcu(folio)) goto repeat; - /* - * Has the page moved or been split? - * This is part of the lockless pagecache protocol. See - * include/linux/pagemap.h for details. - */ - if (unlikely(page != xas_reload(&xas))) { - put_page(page); + if (unlikely(folio != xas_reload(&xas))) { + folio_put(folio); goto repeat; } out: rcu_read_unlock(); - return page; + return folio; } /** - * pagecache_get_page - Find and get a reference to a page. + * __filemap_get_folio - Find and get a reference to a folio. * @mapping: The address_space to search. * @index: The page index. - * @fgp_flags: %FGP flags modify how the page is returned. - * @gfp_mask: Memory allocation flags to use if %FGP_CREAT is specified. + * @fgp_flags: %FGP flags modify how the folio is returned. + * @gfp: Memory allocation flags to use if %FGP_CREAT is specified. * * Looks up the page cache entry at @mapping & @index. * * @fgp_flags can be zero or more of these flags: * - * * %FGP_ACCESSED - The page will be marked accessed. - * * %FGP_LOCK - The page is returned locked. - * * %FGP_HEAD - If the page is present and a THP, return the head page - * rather than the exact page specified by the index. + * * %FGP_ACCESSED - The folio will be marked accessed. + * * %FGP_LOCK - The folio is returned locked. * * %FGP_ENTRY - If there is a shadow / swap / DAX entry, return it - * instead of allocating a new page to replace it. + * instead of allocating a new folio to replace it. * * %FGP_CREAT - If no page is present then a new page is allocated using - * @gfp_mask and added to the page cache and the VM's LRU list. + * @gfp and added to the page cache and the VM's LRU list. * The page is returned locked and with an increased refcount. * * %FGP_FOR_MMAP - The caller wants to do its own locking dance if the * page is already in cache. If the page was allocated, unlock it before * returning so the caller can do the same dance. - * * %FGP_WRITE - The page will be written - * * %FGP_NOFS - __GFP_FS will get cleared in gfp mask - * * %FGP_NOWAIT - Don't get blocked by page lock + * * %FGP_WRITE - The page will be written to by the caller. + * * %FGP_NOFS - __GFP_FS will get cleared in gfp. + * * %FGP_NOWAIT - Don't get blocked by page lock. + * * %FGP_STABLE - Wait for the folio to be stable (finished writeback) * * If %FGP_LOCK or %FGP_CREAT are specified then the function may sleep even * if the %GFP flags specified for %FGP_CREAT are atomic. * * If there is a page cache page, it is returned with an increased refcount. * - * Return: The found page or %NULL otherwise. + * Return: The found folio or %NULL otherwise. */ -struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index, - int fgp_flags, gfp_t gfp_mask) +struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index, + int fgp_flags, gfp_t gfp) { - struct page *page; + struct folio *folio; repeat: - page = mapping_get_entry(mapping, index); - if (xa_is_value(page)) { + folio = mapping_get_entry(mapping, index); + if (xa_is_value(folio)) { if (fgp_flags & FGP_ENTRY) - return page; - page = NULL; + return folio; + folio = NULL; } - if (!page) + if (!folio) goto no_page; if (fgp_flags & FGP_LOCK) { if (fgp_flags & FGP_NOWAIT) { - if (!trylock_page(page)) { - put_page(page); + if (!folio_trylock(folio)) { + folio_put(folio); return NULL; } } else { - lock_page(page); + folio_lock(folio); } /* Has the page been truncated? */ - if (unlikely(page->mapping != mapping)) { - unlock_page(page); - put_page(page); + if (unlikely(folio->mapping != mapping)) { + folio_unlock(folio); + folio_put(folio); goto repeat; } - VM_BUG_ON_PAGE(!thp_contains(page, index), page); + VM_BUG_ON_FOLIO(!folio_contains(folio, index), folio); } if (fgp_flags & FGP_ACCESSED) - mark_page_accessed(page); + folio_mark_accessed(folio); else if (fgp_flags & FGP_WRITE) { /* Clear idle flag for buffer write */ - if (page_is_idle(page)) - clear_page_idle(page); + if (folio_test_idle(folio)) + folio_clear_idle(folio); } - if (!(fgp_flags & FGP_HEAD)) - page = find_subpage(page, index); + if (fgp_flags & FGP_STABLE) + folio_wait_stable(folio); no_page: - if (!page && (fgp_flags & FGP_CREAT)) { + if (!folio && (fgp_flags & FGP_CREAT)) { int err; if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping)) - gfp_mask |= __GFP_WRITE; + gfp |= __GFP_WRITE; if (fgp_flags & FGP_NOFS) - gfp_mask &= ~__GFP_FS; + gfp &= ~__GFP_FS; - page = __page_cache_alloc(gfp_mask); - if (!page) + folio = filemap_alloc_folio(gfp, 0); + if (!folio) return NULL; if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK | FGP_FOR_MMAP)))) @@ -1945,27 +1941,27 @@ no_page: /* Init accessed so avoid atomic mark_page_accessed later */ if (fgp_flags & FGP_ACCESSED) - __SetPageReferenced(page); + __folio_set_referenced(folio); - err = add_to_page_cache_lru(page, mapping, index, gfp_mask); + err = filemap_add_folio(mapping, folio, index, gfp); if (unlikely(err)) { - put_page(page); - page = NULL; + folio_put(folio); + folio = NULL; if (err == -EEXIST) goto repeat; } /* - * add_to_page_cache_lru locks the page, and for mmap we expect - * an unlocked page. + * filemap_add_folio locks the page, and for mmap + * we expect an unlocked page. */ - if (page && (fgp_flags & FGP_FOR_MMAP)) - unlock_page(page); + if (folio && (fgp_flags & FGP_FOR_MMAP)) + folio_unlock(folio); } - return page; + return folio; } -EXPORT_SYMBOL(pagecache_get_page); +EXPORT_SYMBOL(__filemap_get_folio); static inline struct page *find_get_entry(struct xa_state *xas, pgoff_t max, xa_mark_t mark) @@ -2420,6 +2416,7 @@ static int filemap_update_page(struct kiocb *iocb, struct address_space *mapping, struct iov_iter *iter, struct page *page) { + struct folio *folio = page_folio(page); int error; if (iocb->ki_flags & IOCB_NOWAIT) { @@ -2429,40 +2426,40 @@ static int filemap_update_page(struct kiocb *iocb, filemap_invalidate_lock_shared(mapping); } - if (!trylock_page(page)) { + if (!folio_trylock(folio)) { error = -EAGAIN; if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO)) goto unlock_mapping; if (!(iocb->ki_flags & IOCB_WAITQ)) { filemap_invalidate_unlock_shared(mapping); - put_and_wait_on_page_locked(page, TASK_KILLABLE); + put_and_wait_on_page_locked(&folio->page, TASK_KILLABLE); return AOP_TRUNCATED_PAGE; } - error = __lock_page_async(page, iocb->ki_waitq); + error = __folio_lock_async(folio, iocb->ki_waitq); if (error) goto unlock_mapping; } error = AOP_TRUNCATED_PAGE; - if (!page->mapping) + if (!folio->mapping) goto unlock; error = 0; - if (filemap_range_uptodate(mapping, iocb->ki_pos, iter, page)) + if (filemap_range_uptodate(mapping, iocb->ki_pos, iter, &folio->page)) goto unlock; error = -EAGAIN; if (iocb->ki_flags & (IOCB_NOIO | IOCB_NOWAIT | IOCB_WAITQ)) goto unlock; - error = filemap_read_page(iocb->ki_filp, mapping, page); + error = filemap_read_page(iocb->ki_filp, mapping, &folio->page); goto unlock_mapping; unlock: - unlock_page(page); + folio_unlock(folio); unlock_mapping: filemap_invalidate_unlock_shared(mapping); if (error == AOP_TRUNCATED_PAGE) - put_page(page); + folio_put(folio); return error; } @@ -2899,7 +2896,9 @@ unlock: static int lock_page_maybe_drop_mmap(struct vm_fault *vmf, struct page *page, struct file **fpin) { - if (trylock_page(page)) + struct folio *folio = page_folio(page); + + if (folio_trylock(folio)) return 1; /* @@ -2912,7 +2911,7 @@ static int lock_page_maybe_drop_mmap(struct vm_fault *vmf, struct page *page, *fpin = maybe_unlock_mmap_for_io(vmf, *fpin); if (vmf->flags & FAULT_FLAG_KILLABLE) { - if (__lock_page_killable(page)) { + if (__folio_lock_killable(folio)) { /* * We didn't have the right flags to drop the mmap_lock, * but all fault_handlers only check for fatal signals @@ -2924,11 +2923,11 @@ static int lock_page_maybe_drop_mmap(struct vm_fault *vmf, struct page *page, return 0; } } else - __lock_page(page); + __folio_lock(folio); + return 1; } - /* * Synchronous readahead happens when we don't even find a page in the page * cache at all. We don't want to perform IO under the mmap sem, so if we have @@ -3707,28 +3706,6 @@ out: } EXPORT_SYMBOL(generic_file_direct_write); -/* - * Find or create a page at the given pagecache position. Return the locked - * page. This function is specifically for buffered writes. - */ -struct page *grab_cache_page_write_begin(struct address_space *mapping, - pgoff_t index, unsigned flags) -{ - struct page *page; - int fgp_flags = FGP_LOCK|FGP_WRITE|FGP_CREAT; - - if (flags & AOP_FLAG_NOFS) - fgp_flags |= FGP_NOFS; - - page = pagecache_get_page(mapping, index, fgp_flags, - mapping_gfp_mask(mapping)); - if (page) - wait_for_stable_page(page); - - return page; -} -EXPORT_SYMBOL(grab_cache_page_write_begin); - ssize_t generic_perform_write(struct file *file, struct iov_iter *i, loff_t pos) { diff --git a/mm/folio-compat.c b/mm/folio-compat.c new file mode 100644 index 000000000000..5b6ae1da314e --- /dev/null +++ b/mm/folio-compat.c @@ -0,0 +1,142 @@ +/* + * Compatibility functions which bloat the callers too much to make inline. + * All of the callers of these functions should be converted to use folios + * eventually. + */ + +#include <linux/migrate.h> +#include <linux/pagemap.h> +#include <linux/swap.h> + +struct address_space *page_mapping(struct page *page) +{ + return folio_mapping(page_folio(page)); +} +EXPORT_SYMBOL(page_mapping); + +void unlock_page(struct page *page) +{ + return folio_unlock(page_folio(page)); +} +EXPORT_SYMBOL(unlock_page); + +void end_page_writeback(struct page *page) +{ + return folio_end_writeback(page_folio(page)); +} +EXPORT_SYMBOL(end_page_writeback); + +void wait_on_page_writeback(struct page *page) +{ + return folio_wait_writeback(page_folio(page)); +} +EXPORT_SYMBOL_GPL(wait_on_page_writeback); + +void wait_for_stable_page(struct page *page) +{ + return folio_wait_stable(page_folio(page)); +} +EXPORT_SYMBOL_GPL(wait_for_stable_page); + +bool page_mapped(struct page *page) +{ + return folio_mapped(page_folio(page)); +} +EXPORT_SYMBOL(page_mapped); + +void mark_page_accessed(struct page *page) +{ + folio_mark_accessed(page_folio(page)); +} +EXPORT_SYMBOL(mark_page_accessed); + +#ifdef CONFIG_MIGRATION +int migrate_page_move_mapping(struct address_space *mapping, + struct page *newpage, struct page *page, int extra_count) +{ + return folio_migrate_mapping(mapping, page_folio(newpage), + page_folio(page), extra_count); +} +EXPORT_SYMBOL(migrate_page_move_mapping); + +void migrate_page_states(struct page *newpage, struct page *page) +{ + folio_migrate_flags(page_folio(newpage), page_folio(page)); +} +EXPORT_SYMBOL(migrate_page_states); + +void migrate_page_copy(struct page *newpage, struct page *page) +{ + folio_migrate_copy(page_folio(newpage), page_folio(page)); +} +EXPORT_SYMBOL(migrate_page_copy); +#endif + +bool set_page_writeback(struct page *page) +{ + return folio_start_writeback(page_folio(page)); +} +EXPORT_SYMBOL(set_page_writeback); + +bool set_page_dirty(struct page *page) +{ + return folio_mark_dirty(page_folio(page)); +} +EXPORT_SYMBOL(set_page_dirty); + +int __set_page_dirty_nobuffers(struct page *page) +{ + return filemap_dirty_folio(page_mapping(page), page_folio(page)); +} +EXPORT_SYMBOL(__set_page_dirty_nobuffers); + +bool clear_page_dirty_for_io(struct page *page) +{ + return folio_clear_dirty_for_io(page_folio(page)); +} +EXPORT_SYMBOL(clear_page_dirty_for_io); + +bool redirty_page_for_writepage(struct writeback_control *wbc, + struct page *page) +{ + return folio_redirty_for_writepage(wbc, page_folio(page)); +} +EXPORT_SYMBOL(redirty_page_for_writepage); + +void lru_cache_add(struct page *page) +{ + folio_add_lru(page_folio(page)); +} +EXPORT_SYMBOL(lru_cache_add); + +int add_to_page_cache_lru(struct page *page, struct address_space *mapping, + pgoff_t index, gfp_t gfp) +{ + return filemap_add_folio(mapping, page_folio(page), index, gfp); +} +EXPORT_SYMBOL(add_to_page_cache_lru); + +noinline +struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index, + int fgp_flags, gfp_t gfp) +{ + struct folio *folio; + + folio = __filemap_get_folio(mapping, index, fgp_flags, gfp); + if ((fgp_flags & FGP_HEAD) || !folio || xa_is_value(folio)) + return &folio->page; + return folio_file_page(folio, index); +} +EXPORT_SYMBOL(pagecache_get_page); + +struct page *grab_cache_page_write_begin(struct address_space *mapping, + pgoff_t index, unsigned flags) +{ + unsigned fgp_flags = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE; + + if (flags & AOP_FLAG_NOFS) + fgp_flags |= FGP_NOFS; + return pagecache_get_page(mapping, index, fgp_flags, + mapping_gfp_mask(mapping)); +} +EXPORT_SYMBOL(grab_cache_page_write_begin); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 5e9ef0fc261e..e5483347291c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -603,7 +603,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, VM_BUG_ON_PAGE(!PageCompound(page), page); - if (mem_cgroup_charge(page, vma->vm_mm, gfp)) { + if (mem_cgroup_charge(page_folio(page), vma->vm_mm, gfp)) { put_page(page); count_vm_event(THP_FAULT_FALLBACK); count_vm_event(THP_FAULT_FALLBACK_CHARGE); @@ -2405,7 +2405,8 @@ static void __split_huge_page_tail(struct page *head, int tail, static void __split_huge_page(struct page *page, struct list_head *list, pgoff_t end) { - struct page *head = compound_head(page); + struct folio *folio = page_folio(page); + struct page *head = &folio->page; struct lruvec *lruvec; struct address_space *swap_cache = NULL; unsigned long offset = 0; @@ -2424,7 +2425,9 @@ static void __split_huge_page(struct page *page, struct list_head *list, } /* lock lru list/PageCompound, ref frozen by page_ref_freeze */ - lruvec = lock_page_lruvec(head); + lruvec = folio_lruvec_lock(folio); + + ClearPageHasHWPoisoned(head); for (i = nr - 1; i >= 1; i--) { __split_huge_page_tail(head, i, lruvec, list); @@ -2700,12 +2703,14 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) if (mapping) { int nr = thp_nr_pages(head); - if (PageSwapBacked(head)) + if (PageSwapBacked(head)) { __mod_lruvec_page_state(head, NR_SHMEM_THPS, -nr); - else + } else { __mod_lruvec_page_state(head, NR_FILE_THPS, -nr); + filemap_nr_thps_dec(mapping); + } } __split_huge_page(page, list, end); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 95dc7b83381f..6378c1066459 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5302,7 +5302,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, *pagep = NULL; goto out; } - copy_huge_page(page, *pagep); + folio_copy(page_folio(page), page_folio(*pagep)); put_page(*pagep); *pagep = NULL; } diff --git a/mm/internal.h b/mm/internal.h index cf3cb933eba3..b1001ebeb286 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -34,7 +34,16 @@ void page_writeback_init(void); +static inline void *folio_raw_mapping(struct folio *folio) +{ + unsigned long mapping = (unsigned long)folio->mapping; + + return (void *)(mapping & ~PAGE_MAPPING_FLAGS); +} + vm_fault_t do_swap_page(struct vm_fault *vmf); +void folio_rotate_reclaimable(struct folio *folio); +bool __folio_end_writeback(struct folio *folio); void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma, unsigned long floor, unsigned long ceiling); @@ -63,17 +72,28 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start, pgoff_t end, struct pagevec *pvec, pgoff_t *indices); /** - * page_evictable - test whether a page is evictable - * @page: the page to test + * folio_evictable - Test whether a folio is evictable. + * @folio: The folio to test. * - * Test whether page is evictable--i.e., should be placed on active/inactive - * lists vs unevictable list. - * - * Reasons page might not be evictable: - * (1) page's mapping marked unevictable - * (2) page is part of an mlocked VMA + * Test whether @folio is evictable -- i.e., should be placed on + * active/inactive lists vs unevictable list. * + * Reasons folio might not be evictable: + * 1. folio's mapping marked unevictable + * 2. One of the pages in the folio is part of an mlocked VMA */ +static inline bool folio_evictable(struct folio *folio) +{ + bool ret; + + /* Prevent address_space of inode and swap cache from being freed */ + rcu_read_lock(); + ret = !mapping_unevictable(folio_mapping(folio)) && + !folio_test_mlocked(folio); + rcu_read_unlock(); + return ret; +} + static inline bool page_evictable(struct page *page) { bool ret; diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 045cc579f724..5f02fda6f265 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -445,22 +445,25 @@ static bool hugepage_vma_check(struct vm_area_struct *vma, if (!transhuge_vma_enabled(vma, vm_flags)) return false; + if (vma->vm_file && !IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - + vma->vm_pgoff, HPAGE_PMD_NR)) + return false; + /* Enabled via shmem mount options or sysfs settings. */ - if (shmem_file(vma->vm_file) && shmem_huge_enabled(vma)) { - return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff, - HPAGE_PMD_NR); - } + if (shmem_file(vma->vm_file)) + return shmem_huge_enabled(vma); /* THP settings require madvise. */ if (!(vm_flags & VM_HUGEPAGE) && !khugepaged_always()) return false; - /* Read-only file mappings need to be aligned for THP to work. */ + /* Only regular file is valid */ if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && vma->vm_file && - !inode_is_open_for_write(vma->vm_file->f_inode) && (vm_flags & VM_EXEC)) { - return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff, - HPAGE_PMD_NR); + struct inode *inode = vma->vm_file->f_inode; + + return !inode_is_open_for_write(inode) && + S_ISREG(inode->i_mode); } if (!vma->anon_vma || vma->vm_ops) @@ -1087,7 +1090,7 @@ static void collapse_huge_page(struct mm_struct *mm, goto out_nolock; } - if (unlikely(mem_cgroup_charge(new_page, mm, gfp))) { + if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) { result = SCAN_CGROUP_CHARGE_FAIL; goto out_nolock; } @@ -1211,7 +1214,7 @@ out_up_write: mmap_write_unlock(mm); out_nolock: if (!IS_ERR_OR_NULL(*hpage)) - mem_cgroup_uncharge(*hpage); + mem_cgroup_uncharge(page_folio(*hpage)); trace_mm_collapse_huge_page(mm, isolated, result); return; } @@ -1658,7 +1661,7 @@ static void collapse_file(struct mm_struct *mm, goto out; } - if (unlikely(mem_cgroup_charge(new_page, mm, gfp))) { + if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) { result = SCAN_CGROUP_CHARGE_FAIL; goto out; } @@ -1763,6 +1766,10 @@ static void collapse_file(struct mm_struct *mm, filemap_flush(mapping); result = SCAN_FAIL; goto xa_unlocked; + } else if (PageWriteback(page)) { + xas_unlock_irq(&xas); + result = SCAN_FAIL; + goto xa_unlocked; } else if (trylock_page(page)) { get_page(page); xas_unlock_irq(&xas); @@ -1798,7 +1805,8 @@ static void collapse_file(struct mm_struct *mm, goto out_unlock; } - if (!is_shmem && PageDirty(page)) { + if (!is_shmem && (PageDirty(page) || + PageWriteback(page))) { /* * khugepaged only works on read-only fd, so this * page is dirty because it hasn't been flushed @@ -1975,7 +1983,7 @@ xa_unlocked: out: VM_BUG_ON(!list_empty(&pagelist)); if (!IS_ERR_OR_NULL(*hpage)) - mem_cgroup_uncharge(*hpage); + mem_cgroup_uncharge(page_folio(*hpage)); /* TODO: tracepoints */ } @@ -751,7 +751,7 @@ stale: /* * We come here from above when page->mapping or !PageSwapCache * suggests that the node is stale; but it might be under migration. - * We need smp_rmb(), matching the smp_wmb() in ksm_migrate_page(), + * We need smp_rmb(), matching the smp_wmb() in folio_migrate_ksm(), * before checking whether node->kpfn has been changed. */ smp_rmb(); @@ -852,9 +852,14 @@ static int unmerge_ksm_pages(struct vm_area_struct *vma, return err; } +static inline struct stable_node *folio_stable_node(struct folio *folio) +{ + return folio_test_ksm(folio) ? folio_raw_mapping(folio) : NULL; +} + static inline struct stable_node *page_stable_node(struct page *page) { - return PageKsm(page) ? page_rmapping(page) : NULL; + return folio_stable_node(page_folio(page)); } static inline void set_page_stable_node(struct page *page, @@ -2578,7 +2583,8 @@ struct page *ksm_might_need_to_copy(struct page *page, return page; /* let do_swap_page report the error */ new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address); - if (new_page && mem_cgroup_charge(new_page, vma->vm_mm, GFP_KERNEL)) { + if (new_page && + mem_cgroup_charge(page_folio(new_page), vma->vm_mm, GFP_KERNEL)) { put_page(new_page); new_page = NULL; } @@ -2658,26 +2664,26 @@ again: } #ifdef CONFIG_MIGRATION -void ksm_migrate_page(struct page *newpage, struct page *oldpage) +void folio_migrate_ksm(struct folio *newfolio, struct folio *folio) { struct stable_node *stable_node; - VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage); - VM_BUG_ON_PAGE(!PageLocked(newpage), newpage); - VM_BUG_ON_PAGE(newpage->mapping != oldpage->mapping, newpage); + VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); + VM_BUG_ON_FOLIO(!folio_test_locked(newfolio), newfolio); + VM_BUG_ON_FOLIO(newfolio->mapping != folio->mapping, newfolio); - stable_node = page_stable_node(newpage); + stable_node = folio_stable_node(folio); if (stable_node) { - VM_BUG_ON_PAGE(stable_node->kpfn != page_to_pfn(oldpage), oldpage); - stable_node->kpfn = page_to_pfn(newpage); + VM_BUG_ON_FOLIO(stable_node->kpfn != folio_pfn(folio), folio); + stable_node->kpfn = folio_pfn(newfolio); /* - * newpage->mapping was set in advance; now we need smp_wmb() + * newfolio->mapping was set in advance; now we need smp_wmb() * to make sure that the new stable_node->kpfn is visible - * to get_ksm_page() before it can see that oldpage->mapping - * has gone stale (or that PageSwapCache has been cleared). + * to get_ksm_page() before it can see that folio->mapping + * has gone stale (or that folio_test_swapcache has been cleared). */ smp_wmb(); - set_page_stable_node(oldpage, NULL); + set_page_stable_node(&folio->page, NULL); } } #endif /* CONFIG_MIGRATION */ diff --git a/mm/memblock.c b/mm/memblock.c index 5c3503c98b2f..5096500b2647 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -932,16 +932,14 @@ int __init_memblock memblock_mark_mirror(phys_addr_t base, phys_addr_t size) * covered by the memory map. The struct page representing NOMAP memory * frames in the memory map will be PageReserved() * + * Note: if the memory being marked %MEMBLOCK_NOMAP was allocated from + * memblock, the caller must inform kmemleak to ignore that memory + * * Return: 0 on success, -errno on failure. */ int __init_memblock memblock_mark_nomap(phys_addr_t base, phys_addr_t size) { - int ret = memblock_setclr_flag(base, size, 1, MEMBLOCK_NOMAP); - - if (!ret) - kmemleak_free_part_phys(base, size); - - return ret; + return memblock_setclr_flag(base, size, 1, MEMBLOCK_NOMAP); } /** @@ -1692,7 +1690,7 @@ void __init memblock_cap_memory_range(phys_addr_t base, phys_addr_t size) if (!size) return; - if (memblock.memory.cnt <= 1) { + if (!memblock_memory->total_size) { pr_warn("%s: No memory registered yet\n", __func__); return; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6da5020a8656..8dab23a71fc4 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -456,28 +456,6 @@ ino_t page_cgroup_ino(struct page *page) return ino; } -static struct mem_cgroup_per_node * -mem_cgroup_page_nodeinfo(struct mem_cgroup *memcg, struct page *page) -{ - int nid = page_to_nid(page); - - return memcg->nodeinfo[nid]; -} - -static struct mem_cgroup_tree_per_node * -soft_limit_tree_node(int nid) -{ - return soft_limit_tree.rb_tree_per_node[nid]; -} - -static struct mem_cgroup_tree_per_node * -soft_limit_tree_from_page(struct page *page) -{ - int nid = page_to_nid(page); - - return soft_limit_tree.rb_tree_per_node[nid]; -} - static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_node *mz, struct mem_cgroup_tree_per_node *mctz, unsigned long new_usage_in_excess) @@ -548,13 +526,13 @@ static unsigned long soft_limit_excess(struct mem_cgroup *memcg) return excess; } -static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page) +static void mem_cgroup_update_tree(struct mem_cgroup *memcg, int nid) { unsigned long excess; struct mem_cgroup_per_node *mz; struct mem_cgroup_tree_per_node *mctz; - mctz = soft_limit_tree_from_page(page); + mctz = soft_limit_tree.rb_tree_per_node[nid]; if (!mctz) return; /* @@ -562,7 +540,7 @@ static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page) * because their event counter is not touched. */ for (; memcg; memcg = parent_mem_cgroup(memcg)) { - mz = mem_cgroup_page_nodeinfo(memcg, page); + mz = memcg->nodeinfo[nid]; excess = soft_limit_excess(memcg); /* * We have to update the tree if mz is on RB-tree or @@ -593,7 +571,7 @@ static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg) for_each_node(nid) { mz = memcg->nodeinfo[nid]; - mctz = soft_limit_tree_node(nid); + mctz = soft_limit_tree.rb_tree_per_node[nid]; if (mctz) mem_cgroup_remove_exceeded(mz, mctz); } @@ -799,7 +777,6 @@ static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event) } static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, - struct page *page, int nr_pages) { /* pagein of a big page is an event. So, ignore page size */ @@ -842,7 +819,7 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, * Check events in order. * */ -static void memcg_check_events(struct mem_cgroup *memcg, struct page *page) +static void memcg_check_events(struct mem_cgroup *memcg, int nid) { /* threshold event is triggered in finer grain than soft limit */ if (unlikely(mem_cgroup_event_ratelimit(memcg, @@ -853,7 +830,7 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page) MEM_CGROUP_TARGET_SOFTLIMIT); mem_cgroup_threshold(memcg); if (unlikely(do_softlimit)) - mem_cgroup_update_tree(memcg, page); + mem_cgroup_update_tree(memcg, nid); } } @@ -1149,64 +1126,88 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg, } #ifdef CONFIG_DEBUG_VM -void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page) +void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio) { struct mem_cgroup *memcg; if (mem_cgroup_disabled()) return; - memcg = page_memcg(page); + memcg = folio_memcg(folio); if (!memcg) - VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != root_mem_cgroup, page); + VM_BUG_ON_FOLIO(lruvec_memcg(lruvec) != root_mem_cgroup, folio); else - VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != memcg, page); + VM_BUG_ON_FOLIO(lruvec_memcg(lruvec) != memcg, folio); } #endif /** - * lock_page_lruvec - lock and return lruvec for a given page. - * @page: the page + * folio_lruvec_lock - Lock the lruvec for a folio. + * @folio: Pointer to the folio. * * These functions are safe to use under any of the following conditions: - * - page locked - * - PageLRU cleared - * - lock_page_memcg() - * - page->_refcount is zero + * - folio locked + * - folio_test_lru false + * - folio_memcg_lock() + * - folio frozen (refcount of 0) + * + * Return: The lruvec this folio is on with its lock held. */ -struct lruvec *lock_page_lruvec(struct page *page) +struct lruvec *folio_lruvec_lock(struct folio *folio) { - struct lruvec *lruvec; + struct lruvec *lruvec = folio_lruvec(folio); - lruvec = mem_cgroup_page_lruvec(page); spin_lock(&lruvec->lru_lock); - - lruvec_memcg_debug(lruvec, page); + lruvec_memcg_debug(lruvec, folio); return lruvec; } -struct lruvec *lock_page_lruvec_irq(struct page *page) +/** + * folio_lruvec_lock_irq - Lock the lruvec for a folio. + * @folio: Pointer to the folio. + * + * These functions are safe to use under any of the following conditions: + * - folio locked + * - folio_test_lru false + * - folio_memcg_lock() + * - folio frozen (refcount of 0) + * + * Return: The lruvec this folio is on with its lock held and interrupts + * disabled. + */ +struct lruvec *folio_lruvec_lock_irq(struct folio *folio) { - struct lruvec *lruvec; + struct lruvec *lruvec = folio_lruvec(folio); - lruvec = mem_cgroup_page_lruvec(page); spin_lock_irq(&lruvec->lru_lock); - - lruvec_memcg_debug(lruvec, page); + lruvec_memcg_debug(lruvec, folio); return lruvec; } -struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags) +/** + * folio_lruvec_lock_irqsave - Lock the lruvec for a folio. + * @folio: Pointer to the folio. + * @flags: Pointer to irqsave flags. + * + * These functions are safe to use under any of the following conditions: + * - folio locked + * - folio_test_lru false + * - folio_memcg_lock() + * - folio frozen (refcount of 0) + * + * Return: The lruvec this folio is on with its lock held and interrupts + * disabled. + */ +struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio, + unsigned long *flags) { - struct lruvec *lruvec; + struct lruvec *lruvec = folio_lruvec(folio); - lruvec = mem_cgroup_page_lruvec(page); spin_lock_irqsave(&lruvec->lru_lock, *flags); - - lruvec_memcg_debug(lruvec, page); + lruvec_memcg_debug(lruvec, folio); return lruvec; } @@ -1956,18 +1957,17 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg) } /** - * lock_page_memcg - lock a page and memcg binding - * @page: the page + * folio_memcg_lock - Bind a folio to its memcg. + * @folio: The folio. * - * This function protects unlocked LRU pages from being moved to + * This function prevents unlocked LRU folios from being moved to * another cgroup. * - * It ensures lifetime of the locked memcg. Caller is responsible - * for the lifetime of the page. + * It ensures lifetime of the bound memcg. The caller is responsible + * for the lifetime of the folio. */ -void lock_page_memcg(struct page *page) +void folio_memcg_lock(struct folio *folio) { - struct page *head = compound_head(page); /* rmap on tail pages */ struct mem_cgroup *memcg; unsigned long flags; @@ -1981,7 +1981,7 @@ void lock_page_memcg(struct page *page) if (mem_cgroup_disabled()) return; again: - memcg = page_memcg(head); + memcg = folio_memcg(folio); if (unlikely(!memcg)) return; @@ -1995,7 +1995,7 @@ again: return; spin_lock_irqsave(&memcg->move_lock, flags); - if (memcg != page_memcg(head)) { + if (memcg != folio_memcg(folio)) { spin_unlock_irqrestore(&memcg->move_lock, flags); goto again; } @@ -2009,9 +2009,15 @@ again: memcg->move_lock_task = current; memcg->move_lock_flags = flags; } +EXPORT_SYMBOL(folio_memcg_lock); + +void lock_page_memcg(struct page *page) +{ + folio_memcg_lock(page_folio(page)); +} EXPORT_SYMBOL(lock_page_memcg); -static void __unlock_page_memcg(struct mem_cgroup *memcg) +static void __folio_memcg_unlock(struct mem_cgroup *memcg) { if (memcg && memcg->move_lock_task == current) { unsigned long flags = memcg->move_lock_flags; @@ -2026,14 +2032,22 @@ static void __unlock_page_memcg(struct mem_cgroup *memcg) } /** - * unlock_page_memcg - unlock a page and memcg binding - * @page: the page + * folio_memcg_unlock - Release the binding between a folio and its memcg. + * @folio: The folio. + * + * This releases the binding created by folio_memcg_lock(). This does + * not change the accounting of this folio to its memcg, but it does + * permit others to change it. */ -void unlock_page_memcg(struct page *page) +void folio_memcg_unlock(struct folio *folio) { - struct page *head = compound_head(page); + __folio_memcg_unlock(folio_memcg(folio)); +} +EXPORT_SYMBOL(folio_memcg_unlock); - __unlock_page_memcg(page_memcg(head)); +void unlock_page_memcg(struct page *page) +{ + folio_memcg_unlock(page_folio(page)); } EXPORT_SYMBOL(unlock_page_memcg); @@ -2734,9 +2748,9 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages) } #endif -static void commit_charge(struct page *page, struct mem_cgroup *memcg) +static void commit_charge(struct folio *folio, struct mem_cgroup *memcg) { - VM_BUG_ON_PAGE(page_memcg(page), page); + VM_BUG_ON_FOLIO(folio_memcg(folio), folio); /* * Any of the following ensures page's memcg stability: * @@ -2745,7 +2759,7 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg) * - lock_page_memcg() * - exclusive reference */ - page->memcg_data = (unsigned long)memcg; + folio->memcg_data = (unsigned long)memcg; } static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg) @@ -3015,15 +3029,16 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order) */ void __memcg_kmem_uncharge_page(struct page *page, int order) { + struct folio *folio = page_folio(page); struct obj_cgroup *objcg; unsigned int nr_pages = 1 << order; - if (!PageMemcgKmem(page)) + if (!folio_memcg_kmem(folio)) return; - objcg = __page_objcg(page); + objcg = __folio_objcg(folio); obj_cgroup_uncharge_pages(objcg, nr_pages); - page->memcg_data = 0; + folio->memcg_data = 0; obj_cgroup_put(objcg); } @@ -3257,17 +3272,18 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size) */ void split_page_memcg(struct page *head, unsigned int nr) { - struct mem_cgroup *memcg = page_memcg(head); + struct folio *folio = page_folio(head); + struct mem_cgroup *memcg = folio_memcg(folio); int i; if (mem_cgroup_disabled() || !memcg) return; for (i = 1; i < nr; i++) - head[i].memcg_data = head->memcg_data; + folio_page(folio, i)->memcg_data = folio->memcg_data; - if (PageMemcgKmem(head)) - obj_cgroup_get_many(__page_objcg(head), nr - 1); + if (folio_memcg_kmem(folio)) + obj_cgroup_get_many(__folio_objcg(folio), nr - 1); else css_get_many(&memcg->css, nr - 1); } @@ -3381,7 +3397,7 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, if (order > 0) return 0; - mctz = soft_limit_tree_node(pgdat->node_id); + mctz = soft_limit_tree.rb_tree_per_node[pgdat->node_id]; /* * Do not even bother to check the largest node if the root @@ -4537,17 +4553,17 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, * As being wrong occasionally doesn't matter, updates and accesses to the * records are lockless and racy. */ -void mem_cgroup_track_foreign_dirty_slowpath(struct page *page, +void mem_cgroup_track_foreign_dirty_slowpath(struct folio *folio, struct bdi_writeback *wb) { - struct mem_cgroup *memcg = page_memcg(page); + struct mem_cgroup *memcg = folio_memcg(folio); struct memcg_cgwb_frn *frn; u64 now = get_jiffies_64(); u64 oldest_at = now; int oldest = -1; int i; - trace_track_foreign_dirty(page, wb); + trace_track_foreign_dirty(folio, wb); /* * Pick the slot to use. If there is already a slot for @wb, keep @@ -5575,38 +5591,39 @@ static int mem_cgroup_move_account(struct page *page, struct mem_cgroup *from, struct mem_cgroup *to) { + struct folio *folio = page_folio(page); struct lruvec *from_vec, *to_vec; struct pglist_data *pgdat; - unsigned int nr_pages = compound ? thp_nr_pages(page) : 1; - int ret; + unsigned int nr_pages = compound ? folio_nr_pages(folio) : 1; + int nid, ret; VM_BUG_ON(from == to); - VM_BUG_ON_PAGE(PageLRU(page), page); - VM_BUG_ON(compound && !PageTransHuge(page)); + VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); + VM_BUG_ON(compound && !folio_test_multi(folio)); /* * Prevent mem_cgroup_migrate() from looking at * page's memory cgroup of its source page while we change it. */ ret = -EBUSY; - if (!trylock_page(page)) + if (!folio_trylock(folio)) goto out; ret = -EINVAL; - if (page_memcg(page) != from) + if (folio_memcg(folio) != from) goto out_unlock; - pgdat = page_pgdat(page); + pgdat = folio_pgdat(folio); from_vec = mem_cgroup_lruvec(from, pgdat); to_vec = mem_cgroup_lruvec(to, pgdat); - lock_page_memcg(page); + folio_memcg_lock(folio); - if (PageAnon(page)) { - if (page_mapped(page)) { + if (folio_test_anon(folio)) { + if (folio_mapped(folio)) { __mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages); __mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages); - if (PageTransHuge(page)) { + if (folio_test_transhuge(folio)) { __mod_lruvec_state(from_vec, NR_ANON_THPS, -nr_pages); __mod_lruvec_state(to_vec, NR_ANON_THPS, @@ -5617,18 +5634,18 @@ static int mem_cgroup_move_account(struct page *page, __mod_lruvec_state(from_vec, NR_FILE_PAGES, -nr_pages); __mod_lruvec_state(to_vec, NR_FILE_PAGES, nr_pages); - if (PageSwapBacked(page)) { + if (folio_test_swapbacked(folio)) { __mod_lruvec_state(from_vec, NR_SHMEM, -nr_pages); __mod_lruvec_state(to_vec, NR_SHMEM, nr_pages); } - if (page_mapped(page)) { + if (folio_mapped(folio)) { __mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages); __mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages); } - if (PageDirty(page)) { - struct address_space *mapping = page_mapping(page); + if (folio_test_dirty(folio)) { + struct address_space *mapping = folio_mapping(folio); if (mapping_can_writeback(mapping)) { __mod_lruvec_state(from_vec, NR_FILE_DIRTY, @@ -5639,7 +5656,7 @@ static int mem_cgroup_move_account(struct page *page, } } - if (PageWriteback(page)) { + if (folio_test_writeback(folio)) { __mod_lruvec_state(from_vec, NR_WRITEBACK, -nr_pages); __mod_lruvec_state(to_vec, NR_WRITEBACK, nr_pages); } @@ -5662,20 +5679,21 @@ static int mem_cgroup_move_account(struct page *page, css_get(&to->css); css_put(&from->css); - page->memcg_data = (unsigned long)to; + folio->memcg_data = (unsigned long)to; - __unlock_page_memcg(from); + __folio_memcg_unlock(from); ret = 0; + nid = folio_nid(folio); local_irq_disable(); - mem_cgroup_charge_statistics(to, page, nr_pages); - memcg_check_events(to, page); - mem_cgroup_charge_statistics(from, page, -nr_pages); - memcg_check_events(from, page); + mem_cgroup_charge_statistics(to, nr_pages); + memcg_check_events(to, nid); + mem_cgroup_charge_statistics(from, -nr_pages); + memcg_check_events(from, nid); local_irq_enable(); out_unlock: - unlock_page(page); + folio_unlock(folio); out: return ret; } @@ -6680,9 +6698,10 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root, atomic_long_read(&parent->memory.children_low_usage))); } -static int charge_memcg(struct page *page, struct mem_cgroup *memcg, gfp_t gfp) +static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg, + gfp_t gfp) { - unsigned int nr_pages = thp_nr_pages(page); + long nr_pages = folio_nr_pages(folio); int ret; ret = try_charge(memcg, gfp, nr_pages); @@ -6690,38 +6709,23 @@ static int charge_memcg(struct page *page, struct mem_cgroup *memcg, gfp_t gfp) goto out; css_get(&memcg->css); - commit_charge(page, memcg); + commit_charge(folio, memcg); local_irq_disable(); - mem_cgroup_charge_statistics(memcg, page, nr_pages); - memcg_check_events(memcg, page); + mem_cgroup_charge_statistics(memcg, nr_pages); + memcg_check_events(memcg, folio_nid(folio)); local_irq_enable(); out: return ret; } -/** - * __mem_cgroup_charge - charge a newly allocated page to a cgroup - * @page: page to charge - * @mm: mm context of the victim - * @gfp_mask: reclaim mode - * - * Try to charge @page to the memcg that @mm belongs to, reclaiming - * pages according to @gfp_mask if necessary. if @mm is NULL, try to - * charge to the active memcg. - * - * Do not use this for pages allocated for swapin. - * - * Returns 0 on success. Otherwise, an error code is returned. - */ -int __mem_cgroup_charge(struct page *page, struct mm_struct *mm, - gfp_t gfp_mask) +int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp) { struct mem_cgroup *memcg; int ret; memcg = get_mem_cgroup_from_mm(mm); - ret = charge_memcg(page, memcg, gfp_mask); + ret = charge_memcg(folio, memcg, gfp); css_put(&memcg->css); return ret; @@ -6742,6 +6746,7 @@ int __mem_cgroup_charge(struct page *page, struct mm_struct *mm, int mem_cgroup_swapin_charge_page(struct page *page, struct mm_struct *mm, gfp_t gfp, swp_entry_t entry) { + struct folio *folio = page_folio(page); struct mem_cgroup *memcg; unsigned short id; int ret; @@ -6756,7 +6761,7 @@ int mem_cgroup_swapin_charge_page(struct page *page, struct mm_struct *mm, memcg = get_mem_cgroup_from_mm(mm); rcu_read_unlock(); - ret = charge_memcg(page, memcg, gfp); + ret = charge_memcg(folio, memcg, gfp); css_put(&memcg->css); return ret; @@ -6800,7 +6805,7 @@ struct uncharge_gather { unsigned long nr_memory; unsigned long pgpgout; unsigned long nr_kmem; - struct page *dummy_page; + int nid; }; static inline void uncharge_gather_clear(struct uncharge_gather *ug) @@ -6824,36 +6829,36 @@ static void uncharge_batch(const struct uncharge_gather *ug) local_irq_save(flags); __count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout); __this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_memory); - memcg_check_events(ug->memcg, ug->dummy_page); + memcg_check_events(ug->memcg, ug->nid); local_irq_restore(flags); - /* drop reference from uncharge_page */ + /* drop reference from uncharge_folio */ css_put(&ug->memcg->css); } -static void uncharge_page(struct page *page, struct uncharge_gather *ug) +static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug) { - unsigned long nr_pages; + long nr_pages; struct mem_cgroup *memcg; struct obj_cgroup *objcg; - bool use_objcg = PageMemcgKmem(page); + bool use_objcg = folio_memcg_kmem(folio); - VM_BUG_ON_PAGE(PageLRU(page), page); + VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); /* * Nobody should be changing or seriously looking at - * page memcg or objcg at this point, we have fully - * exclusive access to the page. + * folio memcg or objcg at this point, we have fully + * exclusive access to the folio. */ if (use_objcg) { - objcg = __page_objcg(page); + objcg = __folio_objcg(folio); /* * This get matches the put at the end of the function and * kmem pages do not hold memcg references anymore. */ memcg = get_mem_cgroup_from_objcg(objcg); } else { - memcg = __page_memcg(page); + memcg = __folio_memcg(folio); } if (!memcg) @@ -6865,19 +6870,19 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug) uncharge_gather_clear(ug); } ug->memcg = memcg; - ug->dummy_page = page; + ug->nid = folio_nid(folio); /* pairs with css_put in uncharge_batch */ css_get(&memcg->css); } - nr_pages = compound_nr(page); + nr_pages = folio_nr_pages(folio); if (use_objcg) { ug->nr_memory += nr_pages; ug->nr_kmem += nr_pages; - page->memcg_data = 0; + folio->memcg_data = 0; obj_cgroup_put(objcg); } else { /* LRU pages aren't accounted at the root level */ @@ -6885,28 +6890,22 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug) ug->nr_memory += nr_pages; ug->pgpgout++; - page->memcg_data = 0; + folio->memcg_data = 0; } css_put(&memcg->css); } -/** - * __mem_cgroup_uncharge - uncharge a page - * @page: page to uncharge - * - * Uncharge a page previously charged with __mem_cgroup_charge(). - */ -void __mem_cgroup_uncharge(struct page *page) +void __mem_cgroup_uncharge(struct folio *folio) { struct uncharge_gather ug; - /* Don't touch page->lru of any random page, pre-check: */ - if (!page_memcg(page)) + /* Don't touch folio->lru of any random page, pre-check: */ + if (!folio_memcg(folio)) return; uncharge_gather_clear(&ug); - uncharge_page(page, &ug); + uncharge_folio(folio, &ug); uncharge_batch(&ug); } @@ -6920,52 +6919,49 @@ void __mem_cgroup_uncharge(struct page *page) void __mem_cgroup_uncharge_list(struct list_head *page_list) { struct uncharge_gather ug; - struct page *page; + struct folio *folio; uncharge_gather_clear(&ug); - list_for_each_entry(page, page_list, lru) - uncharge_page(page, &ug); + list_for_each_entry(folio, page_list, lru) + uncharge_folio(folio, &ug); if (ug.memcg) uncharge_batch(&ug); } /** - * mem_cgroup_migrate - charge a page's replacement - * @oldpage: currently circulating page - * @newpage: replacement page + * mem_cgroup_migrate - Charge a folio's replacement. + * @old: Currently circulating folio. + * @new: Replacement folio. * - * Charge @newpage as a replacement page for @oldpage. @oldpage will + * Charge @new as a replacement folio for @old. @old will * be uncharged upon free. * - * Both pages must be locked, @newpage->mapping must be set up. + * Both folios must be locked, @new->mapping must be set up. */ -void mem_cgroup_migrate(struct page *oldpage, struct page *newpage) +void mem_cgroup_migrate(struct folio *old, struct folio *new) { struct mem_cgroup *memcg; - unsigned int nr_pages; + long nr_pages = folio_nr_pages(new); unsigned long flags; - VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage); - VM_BUG_ON_PAGE(!PageLocked(newpage), newpage); - VM_BUG_ON_PAGE(PageAnon(oldpage) != PageAnon(newpage), newpage); - VM_BUG_ON_PAGE(PageTransHuge(oldpage) != PageTransHuge(newpage), - newpage); + VM_BUG_ON_FOLIO(!folio_test_locked(old), old); + VM_BUG_ON_FOLIO(!folio_test_locked(new), new); + VM_BUG_ON_FOLIO(folio_test_anon(old) != folio_test_anon(new), new); + VM_BUG_ON_FOLIO(folio_nr_pages(old) != nr_pages, new); if (mem_cgroup_disabled()) return; - /* Page cache replacement: new page already charged? */ - if (page_memcg(newpage)) + /* Page cache replacement: new folio already charged? */ + if (folio_memcg(new)) return; - memcg = page_memcg(oldpage); - VM_WARN_ON_ONCE_PAGE(!memcg, oldpage); + memcg = folio_memcg(old); + VM_WARN_ON_ONCE_FOLIO(!memcg, old); if (!memcg) return; /* Force-charge the new page. The old one will be freed soon */ - nr_pages = thp_nr_pages(newpage); - if (!mem_cgroup_is_root(memcg)) { page_counter_charge(&memcg->memory, nr_pages); if (do_memsw_account()) @@ -6973,11 +6969,11 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage) } css_get(&memcg->css); - commit_charge(newpage, memcg); + commit_charge(new, memcg); local_irq_save(flags); - mem_cgroup_charge_statistics(memcg, newpage, nr_pages); - memcg_check_events(memcg, newpage); + mem_cgroup_charge_statistics(memcg, nr_pages); + memcg_check_events(memcg, folio_nid(new)); local_irq_restore(flags); } @@ -7204,8 +7200,8 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry) * only synchronisation we have for updating the per-CPU variables. */ VM_BUG_ON(!irqs_disabled()); - mem_cgroup_charge_statistics(memcg, page, -nr_entries); - memcg_check_events(memcg, page); + mem_cgroup_charge_statistics(memcg, -nr_entries); + memcg_check_events(memcg, page_to_nid(page)); css_put(&memcg->css); } diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 3e6449f2102a..93078a2859a7 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -762,7 +762,7 @@ static int delete_from_lru_cache(struct page *p) * Poisoned page might never drop its ref count to 0 so we have * to uncharge it manually from its memcg. */ - mem_cgroup_uncharge(p); + mem_cgroup_uncharge(page_folio(p)); /* * drop the page count elevated by isolate_lru_page() @@ -1147,20 +1147,6 @@ static int __get_hwpoison_page(struct page *page) if (!HWPoisonHandlable(head)) return -EBUSY; - if (PageTransHuge(head)) { - /* - * Non anonymous thp exists only in allocation/free time. We - * can't handle such a case correctly, so let's give it up. - * This should be better than triggering BUG_ON when kernel - * tries to touch the "partially handled" page. - */ - if (!PageAnon(head)) { - pr_err("Memory failure: %#lx: non anonymous thp\n", - page_to_pfn(page)); - return 0; - } - } - if (get_page_unless_zero(head)) { if (head == compound_head(page)) return 1; @@ -1708,6 +1694,20 @@ try_again: } if (PageTransHuge(hpage)) { + /* + * The flag must be set after the refcount is bumped + * otherwise it may race with THP split. + * And the flag can't be set in get_hwpoison_page() since + * it is called by soft offline too and it is just called + * for !MF_COUNT_INCREASE. So here seems to be the best + * place. + * + * Don't need care about the above error handling paths for + * get_hwpoison_page() since they handle either free page + * or unhandlable page. The refcount is bumped iff the + * page is a valid handlable page. + */ + SetPageHasHWPoisoned(hpage); if (try_to_split_thp_page(p, "Memory Failure") < 0) { action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED); res = -EBUSY; diff --git a/mm/memory.c b/mm/memory.c index adf9b9ef8277..4b1de80c2a9c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -990,7 +990,7 @@ page_copy_prealloc(struct mm_struct *src_mm, struct vm_area_struct *vma, if (!new_page) return NULL; - if (mem_cgroup_charge(new_page, src_mm, GFP_KERNEL)) { + if (mem_cgroup_charge(page_folio(new_page), src_mm, GFP_KERNEL)) { put_page(new_page); return NULL; } @@ -3019,7 +3019,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) } } - if (mem_cgroup_charge(new_page, mm, GFP_KERNEL)) + if (mem_cgroup_charge(page_folio(new_page), mm, GFP_KERNEL)) goto oom_free_new; cgroup_throttle_swaprate(new_page, GFP_KERNEL); @@ -3539,7 +3539,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) shadow = get_shadow_from_swap_cache(entry); if (shadow) - workingset_refault(page, shadow); + workingset_refault(page_folio(page), + shadow); lru_cache_add(page); @@ -3769,7 +3770,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) if (!page) goto oom; - if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL)) + if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL)) goto oom_free_page; cgroup_throttle_swaprate(page, GFP_KERNEL); @@ -3907,6 +3908,15 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) return ret; /* + * Just backoff if any subpage of a THP is corrupted otherwise + * the corrupted page may mapped by PMD silently to escape the + * check. This kind of THP just can be PTE mapped. Access to + * the corrupted subpage should trigger SIGBUS as expected. + */ + if (unlikely(PageHasHWPoisoned(page))) + return ret; + + /* * Archs like ppc64 need additional space to store information * related to pte entry. Use the preallocated table for that. */ @@ -4193,7 +4203,8 @@ static vm_fault_t do_cow_fault(struct vm_fault *vmf) if (!vmf->cow_page) return VM_FAULT_OOM; - if (mem_cgroup_charge(vmf->cow_page, vma->vm_mm, GFP_KERNEL)) { + if (mem_cgroup_charge(page_folio(vmf->cow_page), vma->vm_mm, + GFP_KERNEL)) { put_page(vmf->cow_page); return VM_FAULT_OOM; } @@ -4258,7 +4269,7 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf) * We enter with non-exclusive mmap_lock (to exclude vma changes, * but allow concurrent faults). * The mmap_lock may have been released depending on flags and our - * return value. See filemap_fault() and __lock_page_or_retry(). + * return value. See filemap_fault() and __folio_lock_or_retry(). * If mmap_lock is released, vma may become invalid (for example * by other thread calling munmap()). */ @@ -4499,7 +4510,7 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud) * concurrent faults). * * The mmap_lock may have been released depending on flags and our return value. - * See filemap_fault() and __lock_page_or_retry(). + * See filemap_fault() and __folio_lock_or_retry(). */ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) { @@ -4603,7 +4614,7 @@ unlock: * By the time we get here, we already hold the mm semaphore * * The mmap_lock may have been released depending on flags and our - * return value. See filemap_fault() and __lock_page_or_retry(). + * return value. See filemap_fault() and __folio_lock_or_retry(). */ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags) @@ -4759,7 +4770,7 @@ static inline void mm_account_fault(struct pt_regs *regs, * By the time we get here, we already hold the mm semaphore * * The mmap_lock may have been released depending on flags and our - * return value. See filemap_fault() and __lock_page_or_retry(). + * return value. See filemap_fault() and __folio_lock_or_retry(). */ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags, struct pt_regs *regs) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 1592b081c58e..f4b4be7af4d3 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -856,16 +856,6 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags, goto out; } - if (flags & MPOL_F_NUMA_BALANCING) { - if (new && new->mode == MPOL_BIND) { - new->flags |= (MPOL_F_MOF | MPOL_F_MORON); - } else { - ret = -EINVAL; - mpol_put(new); - goto out; - } - } - ret = mpol_set_nodemask(new, nodes, scratch); if (ret) { mpol_put(new); @@ -1458,7 +1448,11 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags) return -EINVAL; if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES)) return -EINVAL; - + if (*flags & MPOL_F_NUMA_BALANCING) { + if (*mode != MPOL_BIND) + return -EINVAL; + *flags |= (MPOL_F_MOF | MPOL_F_MORON); + } return 0; } @@ -2202,6 +2196,16 @@ struct page *alloc_pages(gfp_t gfp, unsigned order) } EXPORT_SYMBOL(alloc_pages); +struct folio *folio_alloc(gfp_t gfp, unsigned order) +{ + struct page *page = alloc_pages(gfp | __GFP_COMP, order); + + if (page && order > 1) + prep_transhuge_page(page); + return (struct folio *)page; +} +EXPORT_SYMBOL(folio_alloc); + int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst) { struct mempolicy *pol = mpol_dup(vma_policy(src)); diff --git a/mm/memremap.c b/mm/memremap.c index ed593bf87109..5a66a71ab591 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -505,7 +505,7 @@ void free_devmap_managed_page(struct page *page) __ClearPageWaiters(page); - mem_cgroup_uncharge(page); + mem_cgroup_uncharge(page_folio(page)); /* * When a device_private page is freed, the page->mapping field diff --git a/mm/migrate.c b/mm/migrate.c index a6a7743ee98f..efa9941ebe03 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -364,7 +364,7 @@ static int expected_page_refs(struct address_space *mapping, struct page *page) */ expected_count += is_device_private_page(page); if (mapping) - expected_count += thp_nr_pages(page) + page_has_private(page); + expected_count += compound_nr(page) + page_has_private(page); return expected_count; } @@ -377,74 +377,75 @@ static int expected_page_refs(struct address_space *mapping, struct page *page) * 2 for pages with a mapping * 3 for pages with a mapping and PagePrivate/PagePrivate2 set. */ -int migrate_page_move_mapping(struct address_space *mapping, - struct page *newpage, struct page *page, int extra_count) +int folio_migrate_mapping(struct address_space *mapping, + struct folio *newfolio, struct folio *folio, int extra_count) { - XA_STATE(xas, &mapping->i_pages, page_index(page)); + XA_STATE(xas, &mapping->i_pages, folio_index(folio)); struct zone *oldzone, *newzone; int dirty; - int expected_count = expected_page_refs(mapping, page) + extra_count; - int nr = thp_nr_pages(page); + int expected_count = expected_page_refs(mapping, &folio->page) + extra_count; + long nr = folio_nr_pages(folio); if (!mapping) { /* Anonymous page without mapping */ - if (page_count(page) != expected_count) + if (folio_ref_count(folio) != expected_count) return -EAGAIN; /* No turning back from here */ - newpage->index = page->index; - newpage->mapping = page->mapping; - if (PageSwapBacked(page)) - __SetPageSwapBacked(newpage); + newfolio->index = folio->index; + newfolio->mapping = folio->mapping; + if (folio_test_swapbacked(folio)) + __folio_set_swapbacked(newfolio); return MIGRATEPAGE_SUCCESS; } - oldzone = page_zone(page); - newzone = page_zone(newpage); + oldzone = folio_zone(folio); + newzone = folio_zone(newfolio); xas_lock_irq(&xas); - if (page_count(page) != expected_count || xas_load(&xas) != page) { + if (folio_ref_count(folio) != expected_count || + xas_load(&xas) != folio) { xas_unlock_irq(&xas); return -EAGAIN; } - if (!page_ref_freeze(page, expected_count)) { + if (!folio_ref_freeze(folio, expected_count)) { xas_unlock_irq(&xas); return -EAGAIN; } /* - * Now we know that no one else is looking at the page: + * Now we know that no one else is looking at the folio: * no turning back from here. */ - newpage->index = page->index; - newpage->mapping = page->mapping; - page_ref_add(newpage, nr); /* add cache reference */ - if (PageSwapBacked(page)) { - __SetPageSwapBacked(newpage); - if (PageSwapCache(page)) { - SetPageSwapCache(newpage); - set_page_private(newpage, page_private(page)); + newfolio->index = folio->index; + newfolio->mapping = folio->mapping; + folio_ref_add(newfolio, nr); /* add cache reference */ + if (folio_test_swapbacked(folio)) { + __folio_set_swapbacked(newfolio); + if (folio_test_swapcache(folio)) { + folio_set_swapcache(newfolio); + newfolio->private = folio_get_private(folio); } } else { - VM_BUG_ON_PAGE(PageSwapCache(page), page); + VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio); } /* Move dirty while page refs frozen and newpage not yet exposed */ - dirty = PageDirty(page); + dirty = folio_test_dirty(folio); if (dirty) { - ClearPageDirty(page); - SetPageDirty(newpage); + folio_clear_dirty(folio); + folio_set_dirty(newfolio); } - xas_store(&xas, newpage); - if (PageTransHuge(page)) { + xas_store(&xas, newfolio); + if (nr > 1) { int i; for (i = 1; i < nr; i++) { xas_next(&xas); - xas_store(&xas, newpage); + xas_store(&xas, newfolio); } } @@ -453,7 +454,7 @@ int migrate_page_move_mapping(struct address_space *mapping, * to one less reference. * We know this isn't the last reference. */ - page_ref_unfreeze(page, expected_count - nr); + folio_ref_unfreeze(folio, expected_count - nr); xas_unlock(&xas); /* Leave irq disabled to prevent preemption while updating stats */ @@ -472,18 +473,18 @@ int migrate_page_move_mapping(struct address_space *mapping, struct lruvec *old_lruvec, *new_lruvec; struct mem_cgroup *memcg; - memcg = page_memcg(page); + memcg = folio_memcg(folio); old_lruvec = mem_cgroup_lruvec(memcg, oldzone->zone_pgdat); new_lruvec = mem_cgroup_lruvec(memcg, newzone->zone_pgdat); __mod_lruvec_state(old_lruvec, NR_FILE_PAGES, -nr); __mod_lruvec_state(new_lruvec, NR_FILE_PAGES, nr); - if (PageSwapBacked(page) && !PageSwapCache(page)) { + if (folio_test_swapbacked(folio) && !folio_test_swapcache(folio)) { __mod_lruvec_state(old_lruvec, NR_SHMEM, -nr); __mod_lruvec_state(new_lruvec, NR_SHMEM, nr); } #ifdef CONFIG_SWAP - if (PageSwapCache(page)) { + if (folio_test_swapcache(folio)) { __mod_lruvec_state(old_lruvec, NR_SWAPCACHE, -nr); __mod_lruvec_state(new_lruvec, NR_SWAPCACHE, nr); } @@ -499,11 +500,11 @@ int migrate_page_move_mapping(struct address_space *mapping, return MIGRATEPAGE_SUCCESS; } -EXPORT_SYMBOL(migrate_page_move_mapping); +EXPORT_SYMBOL(folio_migrate_mapping); /* * The expected number of remaining references is the same as that - * of migrate_page_move_mapping(). + * of folio_migrate_mapping(). */ int migrate_huge_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page) @@ -538,91 +539,87 @@ int migrate_huge_page_move_mapping(struct address_space *mapping, } /* - * Copy the page to its new location + * Copy the flags and some other ancillary information */ -void migrate_page_states(struct page *newpage, struct page *page) +void folio_migrate_flags(struct folio *newfolio, struct folio *folio) { int cpupid; - if (PageError(page)) - SetPageError(newpage); - if (PageReferenced(page)) - SetPageReferenced(newpage); - if (PageUptodate(page)) - SetPageUptodate(newpage); - if (TestClearPageActive(page)) { - VM_BUG_ON_PAGE(PageUnevictable(page), page); - SetPageActive(newpage); - } else if (TestClearPageUnevictable(page)) - SetPageUnevictable(newpage); - if (PageWorkingset(page)) - SetPageWorkingset(newpage); - if (PageChecked(page)) - SetPageChecked(newpage); - if (PageMappedToDisk(page)) - SetPageMappedToDisk(newpage); - - /* Move dirty on pages not done by migrate_page_move_mapping() */ - if (PageDirty(page)) - SetPageDirty(newpage); - - if (page_is_young(page)) - set_page_young(newpage); - if (page_is_idle(page)) - set_page_idle(newpage); + if (folio_test_error(folio)) + folio_set_error(newfolio); + if (folio_test_referenced(folio)) + folio_set_referenced(newfolio); + if (folio_test_uptodate(folio)) + folio_mark_uptodate(newfolio); + if (folio_test_clear_active(folio)) { + VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio); + folio_set_active(newfolio); + } else if (folio_test_clear_unevictable(folio)) + folio_set_unevictable(newfolio); + if (folio_test_workingset(folio)) + folio_set_workingset(newfolio); + if (folio_test_checked(folio)) + folio_set_checked(newfolio); + if (folio_test_mappedtodisk(folio)) + folio_set_mappedtodisk(newfolio); + + /* Move dirty on pages not done by folio_migrate_mapping() */ + if (folio_test_dirty(folio)) + folio_set_dirty(newfolio); + + if (folio_test_young(folio)) + folio_set_young(newfolio); + if (folio_test_idle(folio)) + folio_set_idle(newfolio); /* * Copy NUMA information to the new page, to prevent over-eager * future migrations of this same page. */ - cpupid = page_cpupid_xchg_last(page, -1); - page_cpupid_xchg_last(newpage, cpupid); + cpupid = page_cpupid_xchg_last(&folio->page, -1); + page_cpupid_xchg_last(&newfolio->page, cpupid); - ksm_migrate_page(newpage, page); + folio_migrate_ksm(newfolio, folio); /* * Please do not reorder this without considering how mm/ksm.c's * get_ksm_page() depends upon ksm_migrate_page() and PageSwapCache(). */ - if (PageSwapCache(page)) - ClearPageSwapCache(page); - ClearPagePrivate(page); + if (folio_test_swapcache(folio)) + folio_clear_swapcache(folio); + folio_clear_private(folio); /* page->private contains hugetlb specific flags */ - if (!PageHuge(page)) - set_page_private(page, 0); + if (!folio_test_hugetlb(folio)) + folio->private = NULL; /* * If any waiters have accumulated on the new page then * wake them up. */ - if (PageWriteback(newpage)) - end_page_writeback(newpage); + if (folio_test_writeback(newfolio)) + folio_end_writeback(newfolio); /* * PG_readahead shares the same bit with PG_reclaim. The above * end_page_writeback() may clear PG_readahead mistakenly, so set the * bit after that. */ - if (PageReadahead(page)) - SetPageReadahead(newpage); + if (folio_test_readahead(folio)) + folio_set_readahead(newfolio); - copy_page_owner(page, newpage); + folio_copy_owner(newfolio, folio); - if (!PageHuge(page)) - mem_cgroup_migrate(page, newpage); + if (!folio_test_hugetlb(folio)) + mem_cgroup_migrate(folio, newfolio); } -EXPORT_SYMBOL(migrate_page_states); +EXPORT_SYMBOL(folio_migrate_flags); -void migrate_page_copy(struct page *newpage, struct page *page) +void folio_migrate_copy(struct folio *newfolio, struct folio *folio) { - if (PageHuge(page) || PageTransHuge(page)) - copy_huge_page(newpage, page); - else - copy_highpage(newpage, page); - - migrate_page_states(newpage, page); + folio_copy(newfolio, folio); + folio_migrate_flags(newfolio, folio); } -EXPORT_SYMBOL(migrate_page_copy); +EXPORT_SYMBOL(folio_migrate_copy); /************************************************************ * Migration functions @@ -638,19 +635,21 @@ int migrate_page(struct address_space *mapping, struct page *newpage, struct page *page, enum migrate_mode mode) { + struct folio *newfolio = page_folio(newpage); + struct folio *folio = page_folio(page); int rc; - BUG_ON(PageWriteback(page)); /* Writeback must be complete */ + BUG_ON(folio_test_writeback(folio)); /* Writeback must be complete */ - rc = migrate_page_move_mapping(mapping, newpage, page, 0); + rc = folio_migrate_mapping(mapping, newfolio, folio, 0); if (rc != MIGRATEPAGE_SUCCESS) return rc; if (mode != MIGRATE_SYNC_NO_COPY) - migrate_page_copy(newpage, page); + folio_migrate_copy(newfolio, folio); else - migrate_page_states(newpage, page); + folio_migrate_flags(newfolio, folio); return MIGRATEPAGE_SUCCESS; } EXPORT_SYMBOL(migrate_page); @@ -2468,7 +2467,7 @@ static void migrate_vma_collect(struct migrate_vma *migrate) * @page: struct page to check * * Pinned pages cannot be migrated. This is the same test as in - * migrate_page_move_mapping(), except that here we allow migration of a + * folio_migrate_mapping(), except that here we allow migration of a * ZONE_DEVICE page. */ static bool migrate_vma_check_page(struct page *page) @@ -2846,7 +2845,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, if (unlikely(anon_vma_prepare(vma))) goto abort; - if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL)) + if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL)) goto abort; /* @@ -3066,7 +3065,7 @@ void migrate_vma_finalize(struct migrate_vma *migrate) EXPORT_SYMBOL(migrate_vma_finalize); #endif /* CONFIG_DEVICE_PRIVATE */ -#if defined(CONFIG_MEMORY_HOTPLUG) +#if defined(CONFIG_HOTPLUG_CPU) /* Disable reclaim-based migration. */ static void __disable_all_migrate_targets(void) { @@ -3209,25 +3208,6 @@ static void set_migration_target_nodes(void) } /* - * React to hotplug events that might affect the migration targets - * like events that online or offline NUMA nodes. - * - * The ordering is also currently dependent on which nodes have - * CPUs. That means we need CPU on/offline notification too. - */ -static int migration_online_cpu(unsigned int cpu) -{ - set_migration_target_nodes(); - return 0; -} - -static int migration_offline_cpu(unsigned int cpu) -{ - set_migration_target_nodes(); - return 0; -} - -/* * This leaves migrate-on-reclaim transiently disabled between * the MEM_GOING_OFFLINE and MEM_OFFLINE events. This runs * whether reclaim-based migration is enabled or not, which @@ -3239,8 +3219,18 @@ static int migration_offline_cpu(unsigned int cpu) * set_migration_target_nodes(). */ static int __meminit migrate_on_reclaim_callback(struct notifier_block *self, - unsigned long action, void *arg) + unsigned long action, void *_arg) { + struct memory_notify *arg = _arg; + + /* + * Only update the node migration order when a node is + * changing status, like online->offline. This avoids + * the overhead of synchronize_rcu() in most cases. + */ + if (arg->status_change_nid < 0) + return notifier_from_errno(0); + switch (action) { case MEM_GOING_OFFLINE: /* @@ -3274,13 +3264,31 @@ static int __meminit migrate_on_reclaim_callback(struct notifier_block *self, return notifier_from_errno(0); } +/* + * React to hotplug events that might affect the migration targets + * like events that online or offline NUMA nodes. + * + * The ordering is also currently dependent on which nodes have + * CPUs. That means we need CPU on/offline notification too. + */ +static int migration_online_cpu(unsigned int cpu) +{ + set_migration_target_nodes(); + return 0; +} + +static int migration_offline_cpu(unsigned int cpu) +{ + set_migration_target_nodes(); + return 0; +} + static int __init migrate_on_reclaim_init(void) { int ret; - ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "migrate on reclaim", - migration_online_cpu, - migration_offline_cpu); + ret = cpuhp_setup_state_nocalls(CPUHP_MM_DEMOTION_DEAD, "mm/demotion:offline", + NULL, migration_offline_cpu); /* * In the unlikely case that this fails, the automatic * migration targets may become suboptimal for nodes @@ -3288,9 +3296,12 @@ static int __init migrate_on_reclaim_init(void) * rare case, do not bother trying to do anything special. */ WARN_ON(ret < 0); + ret = cpuhp_setup_state(CPUHP_AP_MM_DEMOTION_ONLINE, "mm/demotion:online", + migration_online_cpu, NULL); + WARN_ON(ret < 0); hotplug_memory_notifier(migrate_on_reclaim_callback, 100); return 0; } late_initcall(migrate_on_reclaim_init); -#endif /* CONFIG_MEMORY_HOTPLUG */ +#endif /* CONFIG_HOTPLUG_CPU */ diff --git a/mm/mlock.c b/mm/mlock.c index 16d2ee160d43..e263d62ae2d0 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -271,6 +271,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) /* Phase 1: page isolation */ for (i = 0; i < nr; i++) { struct page *page = pvec->pages[i]; + struct folio *folio = page_folio(page); if (TestClearPageMlocked(page)) { /* @@ -278,7 +279,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) * so we can spare the get_page() here. */ if (TestClearPageLRU(page)) { - lruvec = relock_page_lruvec_irq(page, lruvec); + lruvec = folio_lruvec_relock_irq(folio, lruvec); del_page_from_lru_list(page, lruvec); continue; } else diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 831340e7ad8b..989f35a2bbb1 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1150,7 +1150,7 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags) struct task_struct *task; struct task_struct *p; unsigned int f_flags; - bool reap = true; + bool reap = false; struct pid *pid; long ret = 0; @@ -1177,15 +1177,15 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags) goto put_task; } - mm = p->mm; - mmgrab(mm); - - /* If the work has been done already, just exit with success */ - if (test_bit(MMF_OOM_SKIP, &mm->flags)) - reap = false; - else if (!task_will_free_mem(p)) { - reap = false; - ret = -EINVAL; + if (mmget_not_zero(p->mm)) { + mm = p->mm; + if (task_will_free_mem(p)) + reap = true; + else { + /* Error only if the work has not been done already */ + if (!test_bit(MMF_OOM_SKIP, &mm->flags)) + ret = -EINVAL; + } } task_unlock(p); @@ -1201,7 +1201,8 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags) mmap_read_unlock(mm); drop_mm: - mmdrop(mm); + if (mm) + mmput(mm); put_task: put_task_struct(task); put_pid: diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 4812a17b288c..9c64490171e0 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -562,12 +562,12 @@ static unsigned long wp_next_time(unsigned long cur_time) return cur_time; } -static void wb_domain_writeout_inc(struct wb_domain *dom, +static void wb_domain_writeout_add(struct wb_domain *dom, struct fprop_local_percpu *completions, - unsigned int max_prop_frac) + unsigned int max_prop_frac, long nr) { - __fprop_inc_percpu_max(&dom->completions, completions, - max_prop_frac); + __fprop_add_percpu_max(&dom->completions, completions, + max_prop_frac, nr); /* First event after period switching was turned off? */ if (unlikely(!dom->period_time)) { /* @@ -583,20 +583,20 @@ static void wb_domain_writeout_inc(struct wb_domain *dom, /* * Increment @wb's writeout completion count and the global writeout - * completion count. Called from test_clear_page_writeback(). + * completion count. Called from __folio_end_writeback(). */ -static inline void __wb_writeout_inc(struct bdi_writeback *wb) +static inline void __wb_writeout_add(struct bdi_writeback *wb, long nr) { struct wb_domain *cgdom; - inc_wb_stat(wb, WB_WRITTEN); - wb_domain_writeout_inc(&global_wb_domain, &wb->completions, - wb->bdi->max_prop_frac); + wb_stat_mod(wb, WB_WRITTEN, nr); + wb_domain_writeout_add(&global_wb_domain, &wb->completions, + wb->bdi->max_prop_frac, nr); cgdom = mem_cgroup_wb_domain(wb); if (cgdom) - wb_domain_writeout_inc(cgdom, wb_memcg_completions(wb), - wb->bdi->max_prop_frac); + wb_domain_writeout_add(cgdom, wb_memcg_completions(wb), + wb->bdi->max_prop_frac, nr); } void wb_writeout_inc(struct bdi_writeback *wb) @@ -604,7 +604,7 @@ void wb_writeout_inc(struct bdi_writeback *wb) unsigned long flags; local_irq_save(flags); - __wb_writeout_inc(wb); + __wb_writeout_add(wb, 1); local_irq_restore(flags); } EXPORT_SYMBOL_GPL(wb_writeout_inc); @@ -1084,7 +1084,7 @@ static void wb_update_write_bandwidth(struct bdi_writeback *wb, * write_bandwidth = --------------------------------------------------- * period * - * @written may have decreased due to account_page_redirty(). + * @written may have decreased due to folio_account_redirty(). * Avoid underflowing @bw calculation. */ bw = written - min(written, wb->written_stamp); @@ -2381,44 +2381,44 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc) } /** - * write_one_page - write out a single page and wait on I/O - * @page: the page to write + * folio_write_one - write out a single folio and wait on I/O. + * @folio: The folio to write. * - * The page must be locked by the caller and will be unlocked upon return. + * The folio must be locked by the caller and will be unlocked upon return. * * Note that the mapping's AS_EIO/AS_ENOSPC flags will be cleared when this * function returns. * * Return: %0 on success, negative error code otherwise */ -int write_one_page(struct page *page) +int folio_write_one(struct folio *folio) { - struct address_space *mapping = page->mapping; + struct address_space *mapping = folio->mapping; int ret = 0; struct writeback_control wbc = { .sync_mode = WB_SYNC_ALL, - .nr_to_write = 1, + .nr_to_write = folio_nr_pages(folio), }; - BUG_ON(!PageLocked(page)); + BUG_ON(!folio_test_locked(folio)); - wait_on_page_writeback(page); + folio_wait_writeback(folio); - if (clear_page_dirty_for_io(page)) { - get_page(page); - ret = mapping->a_ops->writepage(page, &wbc); + if (folio_clear_dirty_for_io(folio)) { + folio_get(folio); + ret = mapping->a_ops->writepage(&folio->page, &wbc); if (ret == 0) - wait_on_page_writeback(page); - put_page(page); + folio_wait_writeback(folio); + folio_put(folio); } else { - unlock_page(page); + folio_unlock(folio); } if (!ret) ret = filemap_check_errors(mapping); return ret; } -EXPORT_SYMBOL(write_one_page); +EXPORT_SYMBOL(folio_write_one); /* * For address_spaces which do not use buffers nor write back. @@ -2438,29 +2438,30 @@ EXPORT_SYMBOL(__set_page_dirty_no_writeback); * * NOTE: This relies on being atomic wrt interrupts. */ -static void account_page_dirtied(struct page *page, +static void folio_account_dirtied(struct folio *folio, struct address_space *mapping) { struct inode *inode = mapping->host; - trace_writeback_dirty_page(page, mapping); + trace_writeback_dirty_folio(folio, mapping); if (mapping_can_writeback(mapping)) { struct bdi_writeback *wb; + long nr = folio_nr_pages(folio); - inode_attach_wb(inode, page); + inode_attach_wb(inode, &folio->page); wb = inode_to_wb(inode); - __inc_lruvec_page_state(page, NR_FILE_DIRTY); - __inc_zone_page_state(page, NR_ZONE_WRITE_PENDING); - __inc_node_page_state(page, NR_DIRTIED); - inc_wb_stat(wb, WB_RECLAIMABLE); - inc_wb_stat(wb, WB_DIRTIED); - task_io_account_write(PAGE_SIZE); - current->nr_dirtied++; - __this_cpu_inc(bdp_ratelimits); + __lruvec_stat_mod_folio(folio, NR_FILE_DIRTY, nr); + __zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, nr); + __node_stat_mod_folio(folio, NR_DIRTIED, nr); + wb_stat_mod(wb, WB_RECLAIMABLE, nr); + wb_stat_mod(wb, WB_DIRTIED, nr); + task_io_account_write(nr * PAGE_SIZE); + current->nr_dirtied += nr; + __this_cpu_add(bdp_ratelimits, nr); - mem_cgroup_track_foreign_dirty(page, wb); + mem_cgroup_track_foreign_dirty(folio, wb); } } @@ -2469,130 +2470,152 @@ static void account_page_dirtied(struct page *page, * * Caller must hold lock_page_memcg(). */ -void account_page_cleaned(struct page *page, struct address_space *mapping, +void folio_account_cleaned(struct folio *folio, struct address_space *mapping, struct bdi_writeback *wb) { if (mapping_can_writeback(mapping)) { - dec_lruvec_page_state(page, NR_FILE_DIRTY); - dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); - dec_wb_stat(wb, WB_RECLAIMABLE); - task_io_account_cancelled_write(PAGE_SIZE); + long nr = folio_nr_pages(folio); + lruvec_stat_mod_folio(folio, NR_FILE_DIRTY, -nr); + zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, -nr); + wb_stat_mod(wb, WB_RECLAIMABLE, -nr); + task_io_account_cancelled_write(nr * PAGE_SIZE); } } /* - * Mark the page dirty, and set it dirty in the page cache, and mark the inode - * dirty. + * Mark the folio dirty, and set it dirty in the page cache, and mark + * the inode dirty. * - * If warn is true, then emit a warning if the page is not uptodate and has + * If warn is true, then emit a warning if the folio is not uptodate and has * not been truncated. * * The caller must hold lock_page_memcg(). */ -void __set_page_dirty(struct page *page, struct address_space *mapping, +void __folio_mark_dirty(struct folio *folio, struct address_space *mapping, int warn) { unsigned long flags; xa_lock_irqsave(&mapping->i_pages, flags); - if (page->mapping) { /* Race with truncate? */ - WARN_ON_ONCE(warn && !PageUptodate(page)); - account_page_dirtied(page, mapping); - __xa_set_mark(&mapping->i_pages, page_index(page), + if (folio->mapping) { /* Race with truncate? */ + WARN_ON_ONCE(warn && !folio_test_uptodate(folio)); + folio_account_dirtied(folio, mapping); + __xa_set_mark(&mapping->i_pages, folio_index(folio), PAGECACHE_TAG_DIRTY); } xa_unlock_irqrestore(&mapping->i_pages, flags); } -/* - * For address_spaces which do not use buffers. Just tag the page as dirty in - * the xarray. +/** + * filemap_dirty_folio - Mark a folio dirty for filesystems which do not use buffer_heads. + * @mapping: Address space this folio belongs to. + * @folio: Folio to be marked as dirty. + * + * Filesystems which do not use buffer heads should call this function + * from their set_page_dirty address space operation. It ignores the + * contents of folio_get_private(), so if the filesystem marks individual + * blocks as dirty, the filesystem should handle that itself. * - * This is also used when a single buffer is being dirtied: we want to set the - * page dirty in that case, but not all the buffers. This is a "bottom-up" - * dirtying, whereas __set_page_dirty_buffers() is a "top-down" dirtying. + * This is also sometimes used by filesystems which use buffer_heads when + * a single buffer is being dirtied: we want to set the folio dirty in + * that case, but not all the buffers. This is a "bottom-up" dirtying, + * whereas __set_page_dirty_buffers() is a "top-down" dirtying. * - * The caller must ensure this doesn't race with truncation. Most will simply - * hold the page lock, but e.g. zap_pte_range() calls with the page mapped and - * the pte lock held, which also locks out truncation. + * The caller must ensure this doesn't race with truncation. Most will + * simply hold the folio lock, but e.g. zap_pte_range() calls with the + * folio mapped and the pte lock held, which also locks out truncation. */ -int __set_page_dirty_nobuffers(struct page *page) +bool filemap_dirty_folio(struct address_space *mapping, struct folio *folio) { - lock_page_memcg(page); - if (!TestSetPageDirty(page)) { - struct address_space *mapping = page_mapping(page); + folio_memcg_lock(folio); + if (folio_test_set_dirty(folio)) { + folio_memcg_unlock(folio); + return false; + } - if (!mapping) { - unlock_page_memcg(page); - return 1; - } - __set_page_dirty(page, mapping, !PagePrivate(page)); - unlock_page_memcg(page); + __folio_mark_dirty(folio, mapping, !folio_test_private(folio)); + folio_memcg_unlock(folio); - if (mapping->host) { - /* !PageAnon && !swapper_space */ - __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); - } - return 1; + if (mapping->host) { + /* !PageAnon && !swapper_space */ + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); } - unlock_page_memcg(page); - return 0; + return true; } -EXPORT_SYMBOL(__set_page_dirty_nobuffers); +EXPORT_SYMBOL(filemap_dirty_folio); -/* - * Call this whenever redirtying a page, to de-account the dirty counters - * (NR_DIRTIED, WB_DIRTIED, tsk->nr_dirtied), so that they match the written - * counters (NR_WRITTEN, WB_WRITTEN) in long term. The mismatches will lead to - * systematic errors in balanced_dirty_ratelimit and the dirty pages position - * control. +/** + * folio_account_redirty - Manually account for redirtying a page. + * @folio: The folio which is being redirtied. + * + * Most filesystems should call folio_redirty_for_writepage() instead + * of this fuction. If your filesystem is doing writeback outside the + * context of a writeback_control(), it can call this when redirtying + * a folio, to de-account the dirty counters (NR_DIRTIED, WB_DIRTIED, + * tsk->nr_dirtied), so that they match the written counters (NR_WRITTEN, + * WB_WRITTEN) in long term. The mismatches will lead to systematic errors + * in balanced_dirty_ratelimit and the dirty pages position control. */ -void account_page_redirty(struct page *page) +void folio_account_redirty(struct folio *folio) { - struct address_space *mapping = page->mapping; + struct address_space *mapping = folio->mapping; if (mapping && mapping_can_writeback(mapping)) { struct inode *inode = mapping->host; struct bdi_writeback *wb; struct wb_lock_cookie cookie = {}; + long nr = folio_nr_pages(folio); wb = unlocked_inode_to_wb_begin(inode, &cookie); - current->nr_dirtied--; - dec_node_page_state(page, NR_DIRTIED); - dec_wb_stat(wb, WB_DIRTIED); + current->nr_dirtied -= nr; + node_stat_mod_folio(folio, NR_DIRTIED, -nr); + wb_stat_mod(wb, WB_DIRTIED, -nr); unlocked_inode_to_wb_end(inode, &cookie); } } -EXPORT_SYMBOL(account_page_redirty); +EXPORT_SYMBOL(folio_account_redirty); -/* - * When a writepage implementation decides that it doesn't want to write this - * page for some reason, it should redirty the locked page via - * redirty_page_for_writepage() and it should then unlock the page and return 0 +/** + * folio_redirty_for_writepage - Decline to write a dirty folio. + * @wbc: The writeback control. + * @folio: The folio. + * + * When a writepage implementation decides that it doesn't want to write + * @folio for some reason, it should call this function, unlock @folio and + * return 0. + * + * Return: True if we redirtied the folio. False if someone else dirtied + * it first. */ -int redirty_page_for_writepage(struct writeback_control *wbc, struct page *page) +bool folio_redirty_for_writepage(struct writeback_control *wbc, + struct folio *folio) { - int ret; + bool ret; + long nr = folio_nr_pages(folio); + + wbc->pages_skipped += nr; + ret = filemap_dirty_folio(folio->mapping, folio); + folio_account_redirty(folio); - wbc->pages_skipped++; - ret = __set_page_dirty_nobuffers(page); - account_page_redirty(page); return ret; } -EXPORT_SYMBOL(redirty_page_for_writepage); +EXPORT_SYMBOL(folio_redirty_for_writepage); -/* - * Dirty a page. +/** + * folio_mark_dirty - Mark a folio as being modified. + * @folio: The folio. * - * For pages with a mapping this should be done under the page lock for the - * benefit of asynchronous memory errors who prefer a consistent dirty state. - * This rule can be broken in some special cases, but should be better not to. + * For folios with a mapping this should be done under the page lock + * for the benefit of asynchronous memory errors who prefer a consistent + * dirty state. This rule can be broken in some special cases, + * but should be better not to. + * + * Return: True if the folio was newly dirtied, false if it was already dirty. */ -int set_page_dirty(struct page *page) +bool folio_mark_dirty(struct folio *folio) { - struct address_space *mapping = page_mapping(page); + struct address_space *mapping = folio_mapping(folio); - page = compound_head(page); if (likely(mapping)) { /* * readahead/lru_deactivate_page could remain @@ -2604,17 +2627,17 @@ int set_page_dirty(struct page *page) * it will confuse readahead and make it restart the size rampup * process. But it's a trivial problem. */ - if (PageReclaim(page)) - ClearPageReclaim(page); - return mapping->a_ops->set_page_dirty(page); + if (folio_test_reclaim(folio)) + folio_clear_reclaim(folio); + return mapping->a_ops->set_page_dirty(&folio->page); } - if (!PageDirty(page)) { - if (!TestSetPageDirty(page)) - return 1; + if (!folio_test_dirty(folio)) { + if (!folio_test_set_dirty(folio)) + return true; } - return 0; + return false; } -EXPORT_SYMBOL(set_page_dirty); +EXPORT_SYMBOL(folio_mark_dirty); /* * set_page_dirty() is racy if the caller has no reference against @@ -2650,49 +2673,49 @@ EXPORT_SYMBOL(set_page_dirty_lock); * page without actually doing it through the VM. Can you say "ext3 is * horribly ugly"? Thought you could. */ -void __cancel_dirty_page(struct page *page) +void __folio_cancel_dirty(struct folio *folio) { - struct address_space *mapping = page_mapping(page); + struct address_space *mapping = folio_mapping(folio); if (mapping_can_writeback(mapping)) { struct inode *inode = mapping->host; struct bdi_writeback *wb; struct wb_lock_cookie cookie = {}; - lock_page_memcg(page); + folio_memcg_lock(folio); wb = unlocked_inode_to_wb_begin(inode, &cookie); - if (TestClearPageDirty(page)) - account_page_cleaned(page, mapping, wb); + if (folio_test_clear_dirty(folio)) + folio_account_cleaned(folio, mapping, wb); unlocked_inode_to_wb_end(inode, &cookie); - unlock_page_memcg(page); + folio_memcg_unlock(folio); } else { - ClearPageDirty(page); + folio_clear_dirty(folio); } } -EXPORT_SYMBOL(__cancel_dirty_page); +EXPORT_SYMBOL(__folio_cancel_dirty); /* - * Clear a page's dirty flag, while caring for dirty memory accounting. - * Returns true if the page was previously dirty. - * - * This is for preparing to put the page under writeout. We leave the page - * tagged as dirty in the xarray so that a concurrent write-for-sync - * can discover it via a PAGECACHE_TAG_DIRTY walk. The ->writepage - * implementation will run either set_page_writeback() or set_page_dirty(), - * at which stage we bring the page's dirty flag and xarray dirty tag - * back into sync. - * - * This incoherency between the page's dirty flag and xarray tag is - * unfortunate, but it only exists while the page is locked. + * Clear a folio's dirty flag, while caring for dirty memory accounting. + * Returns true if the folio was previously dirty. + * + * This is for preparing to put the folio under writeout. We leave + * the folio tagged as dirty in the xarray so that a concurrent + * write-for-sync can discover it via a PAGECACHE_TAG_DIRTY walk. + * The ->writepage implementation will run either folio_start_writeback() + * or folio_mark_dirty(), at which stage we bring the folio's dirty flag + * and xarray dirty tag back into sync. + * + * This incoherency between the folio's dirty flag and xarray tag is + * unfortunate, but it only exists while the folio is locked. */ -int clear_page_dirty_for_io(struct page *page) +bool folio_clear_dirty_for_io(struct folio *folio) { - struct address_space *mapping = page_mapping(page); - int ret = 0; + struct address_space *mapping = folio_mapping(folio); + bool ret = false; - VM_BUG_ON_PAGE(!PageLocked(page), page); + VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); if (mapping && mapping_can_writeback(mapping)) { struct inode *inode = mapping->host; @@ -2705,48 +2728,49 @@ int clear_page_dirty_for_io(struct page *page) * We use this sequence to make sure that * (a) we account for dirty stats properly * (b) we tell the low-level filesystem to - * mark the whole page dirty if it was + * mark the whole folio dirty if it was * dirty in a pagetable. Only to then - * (c) clean the page again and return 1 to + * (c) clean the folio again and return 1 to * cause the writeback. * * This way we avoid all nasty races with the * dirty bit in multiple places and clearing * them concurrently from different threads. * - * Note! Normally the "set_page_dirty(page)" + * Note! Normally the "folio_mark_dirty(folio)" * has no effect on the actual dirty bit - since * that will already usually be set. But we * need the side effects, and it can help us * avoid races. * - * We basically use the page "master dirty bit" + * We basically use the folio "master dirty bit" * as a serialization point for all the different * threads doing their things. */ - if (page_mkclean(page)) - set_page_dirty(page); + if (folio_mkclean(folio)) + folio_mark_dirty(folio); /* * We carefully synchronise fault handlers against - * installing a dirty pte and marking the page dirty + * installing a dirty pte and marking the folio dirty * at this point. We do this by having them hold the - * page lock while dirtying the page, and pages are + * page lock while dirtying the folio, and folios are * always locked coming in here, so we get the desired * exclusion. */ wb = unlocked_inode_to_wb_begin(inode, &cookie); - if (TestClearPageDirty(page)) { - dec_lruvec_page_state(page, NR_FILE_DIRTY); - dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); - dec_wb_stat(wb, WB_RECLAIMABLE); - ret = 1; + if (folio_test_clear_dirty(folio)) { + long nr = folio_nr_pages(folio); + lruvec_stat_mod_folio(folio, NR_FILE_DIRTY, -nr); + zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, -nr); + wb_stat_mod(wb, WB_RECLAIMABLE, -nr); + ret = true; } unlocked_inode_to_wb_end(inode, &cookie); return ret; } - return TestClearPageDirty(page); + return folio_test_clear_dirty(folio); } -EXPORT_SYMBOL(clear_page_dirty_for_io); +EXPORT_SYMBOL(folio_clear_dirty_for_io); static void wb_inode_writeback_start(struct bdi_writeback *wb) { @@ -2766,27 +2790,28 @@ static void wb_inode_writeback_end(struct bdi_writeback *wb) queue_delayed_work(bdi_wq, &wb->bw_dwork, BANDWIDTH_INTERVAL); } -int test_clear_page_writeback(struct page *page) +bool __folio_end_writeback(struct folio *folio) { - struct address_space *mapping = page_mapping(page); - int ret; + long nr = folio_nr_pages(folio); + struct address_space *mapping = folio_mapping(folio); + bool ret; - lock_page_memcg(page); + folio_memcg_lock(folio); if (mapping && mapping_use_writeback_tags(mapping)) { struct inode *inode = mapping->host; struct backing_dev_info *bdi = inode_to_bdi(inode); unsigned long flags; xa_lock_irqsave(&mapping->i_pages, flags); - ret = TestClearPageWriteback(page); + ret = folio_test_clear_writeback(folio); if (ret) { - __xa_clear_mark(&mapping->i_pages, page_index(page), + __xa_clear_mark(&mapping->i_pages, folio_index(folio), PAGECACHE_TAG_WRITEBACK); if (bdi->capabilities & BDI_CAP_WRITEBACK_ACCT) { struct bdi_writeback *wb = inode_to_wb(inode); - dec_wb_stat(wb, WB_WRITEBACK); - __wb_writeout_inc(wb); + wb_stat_mod(wb, WB_WRITEBACK, -nr); + __wb_writeout_add(wb, nr); if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) wb_inode_writeback_end(wb); @@ -2799,32 +2824,34 @@ int test_clear_page_writeback(struct page *page) xa_unlock_irqrestore(&mapping->i_pages, flags); } else { - ret = TestClearPageWriteback(page); + ret = folio_test_clear_writeback(folio); } if (ret) { - dec_lruvec_page_state(page, NR_WRITEBACK); - dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); - inc_node_page_state(page, NR_WRITTEN); + lruvec_stat_mod_folio(folio, NR_WRITEBACK, -nr); + zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, -nr); + node_stat_mod_folio(folio, NR_WRITTEN, nr); } - unlock_page_memcg(page); + folio_memcg_unlock(folio); return ret; } -int __test_set_page_writeback(struct page *page, bool keep_write) +bool __folio_start_writeback(struct folio *folio, bool keep_write) { - struct address_space *mapping = page_mapping(page); - int ret, access_ret; + long nr = folio_nr_pages(folio); + struct address_space *mapping = folio_mapping(folio); + bool ret; + int access_ret; - lock_page_memcg(page); + folio_memcg_lock(folio); if (mapping && mapping_use_writeback_tags(mapping)) { - XA_STATE(xas, &mapping->i_pages, page_index(page)); + XA_STATE(xas, &mapping->i_pages, folio_index(folio)); struct inode *inode = mapping->host; struct backing_dev_info *bdi = inode_to_bdi(inode); unsigned long flags; xas_lock_irqsave(&xas, flags); xas_load(&xas); - ret = TestSetPageWriteback(page); + ret = folio_test_set_writeback(folio); if (!ret) { bool on_wblist; @@ -2835,84 +2862,105 @@ int __test_set_page_writeback(struct page *page, bool keep_write) if (bdi->capabilities & BDI_CAP_WRITEBACK_ACCT) { struct bdi_writeback *wb = inode_to_wb(inode); - inc_wb_stat(wb, WB_WRITEBACK); + wb_stat_mod(wb, WB_WRITEBACK, nr); if (!on_wblist) wb_inode_writeback_start(wb); } /* - * We can come through here when swapping anonymous - * pages, so we don't necessarily have an inode to track - * for sync. + * We can come through here when swapping + * anonymous folios, so we don't necessarily + * have an inode to track for sync. */ if (mapping->host && !on_wblist) sb_mark_inode_writeback(mapping->host); } - if (!PageDirty(page)) + if (!folio_test_dirty(folio)) xas_clear_mark(&xas, PAGECACHE_TAG_DIRTY); if (!keep_write) xas_clear_mark(&xas, PAGECACHE_TAG_TOWRITE); xas_unlock_irqrestore(&xas, flags); } else { - ret = TestSetPageWriteback(page); + ret = folio_test_set_writeback(folio); } if (!ret) { - inc_lruvec_page_state(page, NR_WRITEBACK); - inc_zone_page_state(page, NR_ZONE_WRITE_PENDING); + lruvec_stat_mod_folio(folio, NR_WRITEBACK, nr); + zone_stat_mod_folio(folio, NR_ZONE_WRITE_PENDING, nr); } - unlock_page_memcg(page); - access_ret = arch_make_page_accessible(page); + folio_memcg_unlock(folio); + access_ret = arch_make_folio_accessible(folio); /* * If writeback has been triggered on a page that cannot be made * accessible, it is too late to recover here. */ - VM_BUG_ON_PAGE(access_ret != 0, page); + VM_BUG_ON_FOLIO(access_ret != 0, folio); return ret; - } -EXPORT_SYMBOL(__test_set_page_writeback); +EXPORT_SYMBOL(__folio_start_writeback); -/* - * Wait for a page to complete writeback +/** + * folio_wait_writeback - Wait for a folio to finish writeback. + * @folio: The folio to wait for. + * + * If the folio is currently being written back to storage, wait for the + * I/O to complete. + * + * Context: Sleeps. Must be called in process context and with + * no spinlocks held. Caller should hold a reference on the folio. + * If the folio is not locked, writeback may start again after writeback + * has finished. */ -void wait_on_page_writeback(struct page *page) +void folio_wait_writeback(struct folio *folio) { - while (PageWriteback(page)) { - trace_wait_on_page_writeback(page, page_mapping(page)); - wait_on_page_bit(page, PG_writeback); + while (folio_test_writeback(folio)) { + trace_folio_wait_writeback(folio, folio_mapping(folio)); + folio_wait_bit(folio, PG_writeback); } } -EXPORT_SYMBOL_GPL(wait_on_page_writeback); +EXPORT_SYMBOL_GPL(folio_wait_writeback); -/* - * Wait for a page to complete writeback. Returns -EINTR if we get a - * fatal signal while waiting. +/** + * folio_wait_writeback_killable - Wait for a folio to finish writeback. + * @folio: The folio to wait for. + * + * If the folio is currently being written back to storage, wait for the + * I/O to complete or a fatal signal to arrive. + * + * Context: Sleeps. Must be called in process context and with + * no spinlocks held. Caller should hold a reference on the folio. + * If the folio is not locked, writeback may start again after writeback + * has finished. + * Return: 0 on success, -EINTR if we get a fatal signal while waiting. */ -int wait_on_page_writeback_killable(struct page *page) +int folio_wait_writeback_killable(struct folio *folio) { - while (PageWriteback(page)) { - trace_wait_on_page_writeback(page, page_mapping(page)); - if (wait_on_page_bit_killable(page, PG_writeback)) + while (folio_test_writeback(folio)) { + trace_folio_wait_writeback(folio, folio_mapping(folio)); + if (folio_wait_bit_killable(folio, PG_writeback)) return -EINTR; } return 0; } -EXPORT_SYMBOL_GPL(wait_on_page_writeback_killable); +EXPORT_SYMBOL_GPL(folio_wait_writeback_killable); /** - * wait_for_stable_page() - wait for writeback to finish, if necessary. - * @page: The page to wait on. + * folio_wait_stable() - wait for writeback to finish, if necessary. + * @folio: The folio to wait on. + * + * This function determines if the given folio is related to a backing + * device that requires folio contents to be held stable during writeback. + * If so, then it will wait for any pending writeback to complete. * - * This function determines if the given page is related to a backing device - * that requires page contents to be held stable during writeback. If so, then - * it will wait for any pending writeback to complete. + * Context: Sleeps. Must be called in process context and with + * no spinlocks held. Caller should hold a reference on the folio. + * If the folio is not locked, writeback may start again after writeback + * has finished. */ -void wait_for_stable_page(struct page *page) +void folio_wait_stable(struct folio *folio) { - page = thp_head(page); - if (page->mapping->host->i_sb->s_iflags & SB_I_STABLE_WRITES) - wait_on_page_writeback(page); + if (folio->mapping->host->i_sb->s_iflags & SB_I_STABLE_WRITES) + folio_wait_writeback(folio); } -EXPORT_SYMBOL_GPL(wait_for_stable_page); +EXPORT_SYMBOL_GPL(folio_wait_stable); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b37435c274cf..fee18ada46a2 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -724,7 +724,7 @@ static inline void free_the_page(struct page *page, unsigned int order) void free_compound_page(struct page *page) { - mem_cgroup_uncharge(page); + mem_cgroup_uncharge(page_folio(page)); free_the_page(page, compound_order(page)); } @@ -1312,8 +1312,10 @@ static __always_inline bool free_pages_prepare(struct page *page, VM_BUG_ON_PAGE(compound && compound_order(page) != order, page); - if (compound) + if (compound) { ClearPageDoubleMap(page); + ClearPageHasHWPoisoned(page); + } for (i = 1; i < (1 << order); i++) { if (compound) bad += free_tail_pages_check(page, page + i); @@ -5223,6 +5225,10 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid, if (unlikely(page_array && nr_pages - nr_populated == 0)) goto out; + /* Bulk allocator does not support memcg accounting. */ + if (memcg_kmem_enabled() && (gfp & __GFP_ACCOUNT)) + goto failed; + /* Use the single page allocator for one page. */ if (nr_pages - nr_populated == 1) goto failed; @@ -5400,6 +5406,18 @@ out: } EXPORT_SYMBOL(__alloc_pages); +struct folio *__folio_alloc(gfp_t gfp, unsigned int order, int preferred_nid, + nodemask_t *nodemask) +{ + struct page *page = __alloc_pages(gfp | __GFP_COMP, order, + preferred_nid, nodemask); + + if (page && order > 1) + prep_transhuge_page(page); + return (struct folio *)page; +} +EXPORT_SYMBOL(__folio_alloc); + /* * Common helper functions. Never use with __GFP_HIGHMEM because the returned * address cannot represent highmem pages. Use alloc_pages and then kmap if diff --git a/mm/page_ext.c b/mm/page_ext.c index dfb91653d359..2a52fd9ed464 100644 --- a/mm/page_ext.c +++ b/mm/page_ext.c @@ -269,7 +269,7 @@ static int __meminit init_section_page_ext(unsigned long pfn, int nid) total_usage += table_size; return 0; } -#ifdef CONFIG_MEMORY_HOTPLUG + static void free_page_ext(void *addr) { if (is_vmalloc_addr(addr)) { @@ -374,8 +374,6 @@ static int __meminit page_ext_callback(struct notifier_block *self, return notifier_from_errno(ret); } -#endif - void __init page_ext_init(void) { unsigned long pfn; diff --git a/mm/page_io.c b/mm/page_io.c index 6010fb07f231..9725c7e1eeea 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -38,7 +38,7 @@ void end_swap_bio_write(struct bio *bio) * Also print a dire warning that things will go BAD (tm) * very quickly. * - * Also clear PG_reclaim to avoid rotate_reclaimable_page() + * Also clear PG_reclaim to avoid folio_rotate_reclaimable() */ set_page_dirty(page); pr_alert_ratelimited("Write-error on swap-device (%u:%u:%llu)\n", @@ -317,7 +317,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc, * temporary failure if the system has limited * memory for allocating transmit buffers. * Mark the page dirty and avoid - * rotate_reclaimable_page but rate-limit the + * folio_rotate_reclaimable but rate-limit the * messages but do not flag PageError like * the normal direct-to-bio case as it could * be temporary. diff --git a/mm/page_owner.c b/mm/page_owner.c index 62402d22539b..d24ed221357c 100644 --- a/mm/page_owner.c +++ b/mm/page_owner.c @@ -210,10 +210,10 @@ void __split_page_owner(struct page *page, unsigned int nr) } } -void __copy_page_owner(struct page *oldpage, struct page *newpage) +void __folio_copy_owner(struct folio *newfolio, struct folio *old) { - struct page_ext *old_ext = lookup_page_ext(oldpage); - struct page_ext *new_ext = lookup_page_ext(newpage); + struct page_ext *old_ext = lookup_page_ext(&old->page); + struct page_ext *new_ext = lookup_page_ext(&newfolio->page); struct page_owner *old_page_owner, *new_page_owner; if (unlikely(!old_ext || !new_ext)) @@ -231,11 +231,11 @@ void __copy_page_owner(struct page *oldpage, struct page *newpage) new_page_owner->free_ts_nsec = old_page_owner->ts_nsec; /* - * We don't clear the bit on the oldpage as it's going to be freed + * We don't clear the bit on the old folio as it's going to be freed * after migration. Until then, the info can be useful in case of * a bug, and the overall stats will be off a bit only temporarily. * Also, migrate_misplaced_transhuge_page() can still fail the - * migration and then we want the oldpage to retain the info. But + * migration and then we want the old folio to retain the info. But * in that case we also don't need to explicitly clear the info from * the new page, which will be freed. */ diff --git a/mm/rmap.c b/mm/rmap.c index 6aebd1747251..3a1059c284c3 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -34,7 +34,7 @@ * mapping->private_lock (in __set_page_dirty_buffers) * lock_page_memcg move_lock (in __set_page_dirty_buffers) * i_pages lock (widely used) - * lruvec->lru_lock (in lock_page_lruvec_irq) + * lruvec->lru_lock (in folio_lruvec_lock_irq) * inode->i_lock (in set_page_dirty's __mark_inode_dirty) * bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty) * sb_lock (within inode_lock in fs/fs-writeback.c) @@ -981,7 +981,7 @@ static bool invalid_mkclean_vma(struct vm_area_struct *vma, void *arg) return true; } -int page_mkclean(struct page *page) +int folio_mkclean(struct folio *folio) { int cleaned = 0; struct address_space *mapping; @@ -991,20 +991,20 @@ int page_mkclean(struct page *page) .invalid_vma = invalid_mkclean_vma, }; - BUG_ON(!PageLocked(page)); + BUG_ON(!folio_test_locked(folio)); - if (!page_mapped(page)) + if (!folio_mapped(folio)) return 0; - mapping = page_mapping(page); + mapping = folio_mapping(folio); if (!mapping) return 0; - rmap_walk(page, &rwc); + rmap_walk(&folio->page, &rwc); return cleaned; } -EXPORT_SYMBOL_GPL(page_mkclean); +EXPORT_SYMBOL_GPL(folio_mkclean); /** * page_move_anon_rmap - move a page to our anon_vma diff --git a/mm/secretmem.c b/mm/secretmem.c index 1fea68b8d5a6..22b310adb53d 100644 --- a/mm/secretmem.c +++ b/mm/secretmem.c @@ -18,7 +18,6 @@ #include <linux/secretmem.h> #include <linux/set_memory.h> #include <linux/sched/signal.h> -#include <linux/refcount.h> #include <uapi/linux/magic.h> @@ -41,11 +40,11 @@ module_param_named(enable, secretmem_enable, bool, 0400); MODULE_PARM_DESC(secretmem_enable, "Enable secretmem and memfd_secret(2) system call"); -static refcount_t secretmem_users; +static atomic_t secretmem_users; bool secretmem_active(void) { - return !!refcount_read(&secretmem_users); + return !!atomic_read(&secretmem_users); } static vm_fault_t secretmem_fault(struct vm_fault *vmf) @@ -104,7 +103,7 @@ static const struct vm_operations_struct secretmem_vm_ops = { static int secretmem_release(struct inode *inode, struct file *file) { - refcount_dec(&secretmem_users); + atomic_dec(&secretmem_users); return 0; } @@ -204,6 +203,8 @@ SYSCALL_DEFINE1(memfd_secret, unsigned int, flags) if (flags & ~(SECRETMEM_FLAGS_MASK | O_CLOEXEC)) return -EINVAL; + if (atomic_read(&secretmem_users) < 0) + return -ENFILE; fd = get_unused_fd_flags(flags & O_CLOEXEC); if (fd < 0) @@ -217,8 +218,8 @@ SYSCALL_DEFINE1(memfd_secret, unsigned int, flags) file->f_flags |= O_LARGEFILE; + atomic_inc(&secretmem_users); fd_install(fd, file); - refcount_inc(&secretmem_users); return fd; err_put_fd: diff --git a/mm/shmem.c b/mm/shmem.c index 2ca4a341652c..17e344e26e73 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -709,7 +709,7 @@ static int shmem_add_to_page_cache(struct page *page, page->index = index; if (!PageSwapCache(page)) { - error = mem_cgroup_charge(page, charge_mm, gfp); + error = mem_cgroup_charge(page_folio(page), charge_mm, gfp); if (error) { if (PageTransHuge(page)) { count_vm_event(THP_FILE_FALLBACK); @@ -1636,6 +1636,7 @@ static int shmem_replace_page(struct page **pagep, gfp_t gfp, struct shmem_inode_info *info, pgoff_t index) { struct page *oldpage, *newpage; + struct folio *old, *new; struct address_space *swap_mapping; swp_entry_t entry; pgoff_t swap_index; @@ -1672,7 +1673,9 @@ static int shmem_replace_page(struct page **pagep, gfp_t gfp, xa_lock_irq(&swap_mapping->i_pages); error = shmem_replace_entry(swap_mapping, swap_index, oldpage, newpage); if (!error) { - mem_cgroup_migrate(oldpage, newpage); + old = page_folio(oldpage); + new = page_folio(newpage); + mem_cgroup_migrate(old, new); __inc_lruvec_page_state(newpage, NR_FILE_PAGES); __dec_lruvec_page_state(oldpage, NR_FILE_PAGES); } diff --git a/mm/slab.c b/mm/slab.c index d0f725637663..874b3f8fe80d 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -1095,7 +1095,7 @@ static int slab_offline_cpu(unsigned int cpu) return 0; } -#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG) +#if defined(CONFIG_NUMA) /* * Drains freelist for a node on each slab cache, used for memory hot-remove. * Returns -EBUSY if all objects cannot be drained so that the node is not @@ -1157,7 +1157,7 @@ static int __meminit slab_memory_callback(struct notifier_block *self, out: return notifier_from_errno(ret); } -#endif /* CONFIG_NUMA && CONFIG_MEMORY_HOTPLUG */ +#endif /* CONFIG_NUMA */ /* * swap the static kmem_cache_node with kmalloced memory diff --git a/mm/slub.c b/mm/slub.c index 3d2025f7163b..d8f77346376d 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1701,7 +1701,8 @@ static __always_inline bool slab_free_hook(struct kmem_cache *s, } static inline bool slab_free_freelist_hook(struct kmem_cache *s, - void **head, void **tail) + void **head, void **tail, + int *cnt) { void *object; @@ -1728,6 +1729,12 @@ static inline bool slab_free_freelist_hook(struct kmem_cache *s, *head = object; if (!*tail) *tail = object; + } else { + /* + * Adjust the reconstructed freelist depth + * accordingly if object's reuse is delayed. + */ + --(*cnt); } } while (object != old_tail); @@ -3413,7 +3420,9 @@ static __always_inline void do_slab_free(struct kmem_cache *s, struct kmem_cache_cpu *c; unsigned long tid; - memcg_slab_free_hook(s, &head, 1); + /* memcg_slab_free_hook() is already called for bulk free. */ + if (!tail) + memcg_slab_free_hook(s, &head, 1); redo: /* * Determine the currently cpus per cpu slab. @@ -3480,7 +3489,7 @@ static __always_inline void slab_free(struct kmem_cache *s, struct page *page, * With KASAN enabled slab_free_freelist_hook modifies the freelist * to remove objects, whose reuse must be delayed. */ - if (slab_free_freelist_hook(s, &head, &tail)) + if (slab_free_freelist_hook(s, &head, &tail, &cnt)) do_slab_free(s, page, head, tail, cnt, addr); } @@ -4203,8 +4212,8 @@ static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags) if (alloc_kmem_cache_cpus(s)) return 0; - free_kmem_cache_nodes(s); error: + __kmem_cache_release(s); return -EINVAL; } @@ -4880,13 +4889,15 @@ int __kmem_cache_create(struct kmem_cache *s, slab_flags_t flags) return 0; err = sysfs_slab_add(s); - if (err) + if (err) { __kmem_cache_release(s); + return err; + } if (s->flags & SLAB_STORE_USER) debugfs_slab_add(s); - return err; + return 0; } void *__kmalloc_track_caller(size_t size, gfp_t gfpflags, unsigned long caller) @@ -6108,9 +6119,14 @@ static int slab_debug_trace_open(struct inode *inode, struct file *filep) struct kmem_cache *s = file_inode(filep)->i_private; unsigned long *obj_map; + if (!t) + return -ENOMEM; + obj_map = bitmap_alloc(oo_objects(s->oo), GFP_KERNEL); - if (!obj_map) + if (!obj_map) { + seq_release_private(inode, filep); return -ENOMEM; + } if (strcmp(filep->f_path.dentry->d_name.name, "alloc_traces") == 0) alloc = TRACK_ALLOC; @@ -6119,6 +6135,7 @@ static int slab_debug_trace_open(struct inode *inode, struct file *filep) if (!alloc_loc_track(t, PAGE_SIZE / sizeof(struct location), GFP_KERNEL)) { bitmap_free(obj_map); + seq_release_private(inode, filep); return -ENOMEM; } diff --git a/mm/swap.c b/mm/swap.c index af3cad4e5378..8ff9ba7cf2de 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -80,10 +80,11 @@ static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = { static void __page_cache_release(struct page *page) { if (PageLRU(page)) { + struct folio *folio = page_folio(page); struct lruvec *lruvec; unsigned long flags; - lruvec = lock_page_lruvec_irqsave(page, &flags); + lruvec = folio_lruvec_lock_irqsave(folio, &flags); del_page_from_lru_list(page, lruvec); __clear_page_lru_flags(page); unlock_page_lruvec_irqrestore(lruvec, flags); @@ -94,7 +95,7 @@ static void __page_cache_release(struct page *page) static void __put_single_page(struct page *page) { __page_cache_release(page); - mem_cgroup_uncharge(page); + mem_cgroup_uncharge(page_folio(page)); free_unref_page(page, 0); } @@ -188,12 +189,13 @@ static void pagevec_lru_move_fn(struct pagevec *pvec, for (i = 0; i < pagevec_count(pvec); i++) { struct page *page = pvec->pages[i]; + struct folio *folio = page_folio(page); /* block memcg migration during page moving between lru */ if (!TestClearPageLRU(page)) continue; - lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags); + lruvec = folio_lruvec_relock_irqsave(folio, lruvec, &flags); (*move_fn)(page, lruvec); SetPageLRU(page); @@ -206,11 +208,13 @@ static void pagevec_lru_move_fn(struct pagevec *pvec, static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec) { - if (!PageUnevictable(page)) { - del_page_from_lru_list(page, lruvec); - ClearPageActive(page); - add_page_to_lru_list_tail(page, lruvec); - __count_vm_events(PGROTATED, thp_nr_pages(page)); + struct folio *folio = page_folio(page); + + if (!folio_test_unevictable(folio)) { + lruvec_del_folio(lruvec, folio); + folio_clear_active(folio); + lruvec_add_folio_tail(lruvec, folio); + __count_vm_events(PGROTATED, folio_nr_pages(folio)); } } @@ -227,23 +231,23 @@ static bool pagevec_add_and_need_flush(struct pagevec *pvec, struct page *page) } /* - * Writeback is about to end against a page which has been marked for immediate - * reclaim. If it still appears to be reclaimable, move it to the tail of the - * inactive list. + * Writeback is about to end against a folio which has been marked for + * immediate reclaim. If it still appears to be reclaimable, move it + * to the tail of the inactive list. * - * rotate_reclaimable_page() must disable IRQs, to prevent nasty races. + * folio_rotate_reclaimable() must disable IRQs, to prevent nasty races. */ -void rotate_reclaimable_page(struct page *page) +void folio_rotate_reclaimable(struct folio *folio) { - if (!PageLocked(page) && !PageDirty(page) && - !PageUnevictable(page) && PageLRU(page)) { + if (!folio_test_locked(folio) && !folio_test_dirty(folio) && + !folio_test_unevictable(folio) && folio_test_lru(folio)) { struct pagevec *pvec; unsigned long flags; - get_page(page); + folio_get(folio); local_lock_irqsave(&lru_rotate.lock, flags); pvec = this_cpu_ptr(&lru_rotate.pvec); - if (pagevec_add_and_need_flush(pvec, page)) + if (pagevec_add_and_need_flush(pvec, &folio->page)) pagevec_lru_move_fn(pvec, pagevec_move_tail_fn); local_unlock_irqrestore(&lru_rotate.lock, flags); } @@ -289,21 +293,21 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages) } while ((lruvec = parent_lruvec(lruvec))); } -void lru_note_cost_page(struct page *page) +void lru_note_cost_folio(struct folio *folio) { - lru_note_cost(mem_cgroup_page_lruvec(page), - page_is_file_lru(page), thp_nr_pages(page)); + lru_note_cost(folio_lruvec(folio), folio_is_file_lru(folio), + folio_nr_pages(folio)); } -static void __activate_page(struct page *page, struct lruvec *lruvec) +static void __folio_activate(struct folio *folio, struct lruvec *lruvec) { - if (!PageActive(page) && !PageUnevictable(page)) { - int nr_pages = thp_nr_pages(page); + if (!folio_test_active(folio) && !folio_test_unevictable(folio)) { + long nr_pages = folio_nr_pages(folio); - del_page_from_lru_list(page, lruvec); - SetPageActive(page); - add_page_to_lru_list(page, lruvec); - trace_mm_lru_activate(page); + lruvec_del_folio(lruvec, folio); + folio_set_active(folio); + lruvec_add_folio(lruvec, folio); + trace_mm_lru_activate(folio); __count_vm_events(PGACTIVATE, nr_pages); __count_memcg_events(lruvec_memcg(lruvec), PGACTIVATE, @@ -312,6 +316,11 @@ static void __activate_page(struct page *page, struct lruvec *lruvec) } #ifdef CONFIG_SMP +static void __activate_page(struct page *page, struct lruvec *lruvec) +{ + return __folio_activate(page_folio(page), lruvec); +} + static void activate_page_drain(int cpu) { struct pagevec *pvec = &per_cpu(lru_pvecs.activate_page, cpu); @@ -325,16 +334,16 @@ static bool need_activate_page_drain(int cpu) return pagevec_count(&per_cpu(lru_pvecs.activate_page, cpu)) != 0; } -static void activate_page(struct page *page) +static void folio_activate(struct folio *folio) { - page = compound_head(page); - if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) { + if (folio_test_lru(folio) && !folio_test_active(folio) && + !folio_test_unevictable(folio)) { struct pagevec *pvec; + folio_get(folio); local_lock(&lru_pvecs.lock); pvec = this_cpu_ptr(&lru_pvecs.activate_page); - get_page(page); - if (pagevec_add_and_need_flush(pvec, page)) + if (pagevec_add_and_need_flush(pvec, &folio->page)) pagevec_lru_move_fn(pvec, __activate_page); local_unlock(&lru_pvecs.lock); } @@ -345,21 +354,20 @@ static inline void activate_page_drain(int cpu) { } -static void activate_page(struct page *page) +static void folio_activate(struct folio *folio) { struct lruvec *lruvec; - page = compound_head(page); - if (TestClearPageLRU(page)) { - lruvec = lock_page_lruvec_irq(page); - __activate_page(page, lruvec); + if (folio_test_clear_lru(folio)) { + lruvec = folio_lruvec_lock_irq(folio); + __folio_activate(folio, lruvec); unlock_page_lruvec_irq(lruvec); - SetPageLRU(page); + folio_set_lru(folio); } } #endif -static void __lru_cache_activate_page(struct page *page) +static void __lru_cache_activate_folio(struct folio *folio) { struct pagevec *pvec; int i; @@ -380,8 +388,8 @@ static void __lru_cache_activate_page(struct page *page) for (i = pagevec_count(pvec) - 1; i >= 0; i--) { struct page *pagevec_page = pvec->pages[i]; - if (pagevec_page == page) { - SetPageActive(page); + if (pagevec_page == &folio->page) { + folio_set_active(folio); break; } } @@ -399,61 +407,59 @@ static void __lru_cache_activate_page(struct page *page) * When a newly allocated page is not yet visible, so safe for non-atomic ops, * __SetPageReferenced(page) may be substituted for mark_page_accessed(page). */ -void mark_page_accessed(struct page *page) +void folio_mark_accessed(struct folio *folio) { - page = compound_head(page); - - if (!PageReferenced(page)) { - SetPageReferenced(page); - } else if (PageUnevictable(page)) { + if (!folio_test_referenced(folio)) { + folio_set_referenced(folio); + } else if (folio_test_unevictable(folio)) { /* * Unevictable pages are on the "LRU_UNEVICTABLE" list. But, * this list is never rotated or maintained, so marking an * evictable page accessed has no effect. */ - } else if (!PageActive(page)) { + } else if (!folio_test_active(folio)) { /* * If the page is on the LRU, queue it for activation via * lru_pvecs.activate_page. Otherwise, assume the page is on a * pagevec, mark it active and it'll be moved to the active * LRU on the next drain. */ - if (PageLRU(page)) - activate_page(page); + if (folio_test_lru(folio)) + folio_activate(folio); else - __lru_cache_activate_page(page); - ClearPageReferenced(page); - workingset_activation(page); + __lru_cache_activate_folio(folio); + folio_clear_referenced(folio); + workingset_activation(folio); } - if (page_is_idle(page)) - clear_page_idle(page); + if (folio_test_idle(folio)) + folio_clear_idle(folio); } -EXPORT_SYMBOL(mark_page_accessed); +EXPORT_SYMBOL(folio_mark_accessed); /** - * lru_cache_add - add a page to a page list - * @page: the page to be added to the LRU. + * folio_add_lru - Add a folio to an LRU list. + * @folio: The folio to be added to the LRU. * - * Queue the page for addition to the LRU via pagevec. The decision on whether + * Queue the folio for addition to the LRU. The decision on whether * to add the page to the [in]active [file|anon] list is deferred until the - * pagevec is drained. This gives a chance for the caller of lru_cache_add() - * have the page added to the active list using mark_page_accessed(). + * pagevec is drained. This gives a chance for the caller of folio_add_lru() + * have the folio added to the active list using folio_mark_accessed(). */ -void lru_cache_add(struct page *page) +void folio_add_lru(struct folio *folio) { struct pagevec *pvec; - VM_BUG_ON_PAGE(PageActive(page) && PageUnevictable(page), page); - VM_BUG_ON_PAGE(PageLRU(page), page); + VM_BUG_ON_FOLIO(folio_test_active(folio) && folio_test_unevictable(folio), folio); + VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); - get_page(page); + folio_get(folio); local_lock(&lru_pvecs.lock); pvec = this_cpu_ptr(&lru_pvecs.lru_add); - if (pagevec_add_and_need_flush(pvec, page)) + if (pagevec_add_and_need_flush(pvec, &folio->page)) __pagevec_lru_add(pvec); local_unlock(&lru_pvecs.lock); } -EXPORT_SYMBOL(lru_cache_add); +EXPORT_SYMBOL(folio_add_lru); /** * lru_cache_add_inactive_or_unevictable @@ -888,11 +894,12 @@ void release_pages(struct page **pages, int nr) int i; LIST_HEAD(pages_to_free); struct lruvec *lruvec = NULL; - unsigned long flags; + unsigned long flags = 0; unsigned int lock_batch; for (i = 0; i < nr; i++) { struct page *page = pages[i]; + struct folio *folio = page_folio(page); /* * Make sure the IRQ-safe lock-holding time does not get @@ -904,7 +911,7 @@ void release_pages(struct page **pages, int nr) lruvec = NULL; } - page = compound_head(page); + page = &folio->page; if (is_huge_zero_page(page)) continue; @@ -943,7 +950,7 @@ void release_pages(struct page **pages, int nr) if (PageLRU(page)) { struct lruvec *prev_lruvec = lruvec; - lruvec = relock_page_lruvec_irqsave(page, lruvec, + lruvec = folio_lruvec_relock_irqsave(folio, lruvec, &flags); if (prev_lruvec != lruvec) lock_batch = 0; @@ -985,17 +992,18 @@ void __pagevec_release(struct pagevec *pvec) } EXPORT_SYMBOL(__pagevec_release); -static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec) +static void __pagevec_lru_add_fn(struct folio *folio, struct lruvec *lruvec) { - int was_unevictable = TestClearPageUnevictable(page); - int nr_pages = thp_nr_pages(page); + int was_unevictable = folio_test_clear_unevictable(folio); + long nr_pages = folio_nr_pages(folio); - VM_BUG_ON_PAGE(PageLRU(page), page); + VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); /* - * Page becomes evictable in two ways: + * A folio becomes evictable in two ways: * 1) Within LRU lock [munlock_vma_page() and __munlock_pagevec()]. - * 2) Before acquiring LRU lock to put the page to correct LRU and then + * 2) Before acquiring LRU lock to put the folio on the correct LRU + * and then * a) do PageLRU check with lock [check_move_unevictable_pages] * b) do PageLRU check before lock [clear_page_mlock] * @@ -1004,35 +1012,36 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec) * * #0: __pagevec_lru_add_fn #1: clear_page_mlock * - * SetPageLRU() TestClearPageMlocked() + * folio_set_lru() folio_test_clear_mlocked() * smp_mb() // explicit ordering // above provides strict * // ordering - * PageMlocked() PageLRU() + * folio_test_mlocked() folio_test_lru() * * - * if '#1' does not observe setting of PG_lru by '#0' and fails - * isolation, the explicit barrier will make sure that page_evictable - * check will put the page in correct LRU. Without smp_mb(), SetPageLRU - * can be reordered after PageMlocked check and can make '#1' to fail - * the isolation of the page whose Mlocked bit is cleared (#0 is also - * looking at the same page) and the evictable page will be stranded - * in an unevictable LRU. + * if '#1' does not observe setting of PG_lru by '#0' and + * fails isolation, the explicit barrier will make sure that + * folio_evictable check will put the folio on the correct + * LRU. Without smp_mb(), folio_set_lru() can be reordered + * after folio_test_mlocked() check and can make '#1' fail the + * isolation of the folio whose mlocked bit is cleared (#0 is + * also looking at the same folio) and the evictable folio will + * be stranded on an unevictable LRU. */ - SetPageLRU(page); + folio_set_lru(folio); smp_mb__after_atomic(); - if (page_evictable(page)) { + if (folio_evictable(folio)) { if (was_unevictable) __count_vm_events(UNEVICTABLE_PGRESCUED, nr_pages); } else { - ClearPageActive(page); - SetPageUnevictable(page); + folio_clear_active(folio); + folio_set_unevictable(folio); if (!was_unevictable) __count_vm_events(UNEVICTABLE_PGCULLED, nr_pages); } - add_page_to_lru_list(page, lruvec); - trace_mm_lru_insertion(page); + lruvec_add_folio(lruvec, folio); + trace_mm_lru_insertion(folio); } /* @@ -1046,10 +1055,10 @@ void __pagevec_lru_add(struct pagevec *pvec) unsigned long flags = 0; for (i = 0; i < pagevec_count(pvec); i++) { - struct page *page = pvec->pages[i]; + struct folio *folio = page_folio(pvec->pages[i]); - lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags); - __pagevec_lru_add_fn(page, lruvec); + lruvec = folio_lruvec_relock_irqsave(folio, lruvec, &flags); + __pagevec_lru_add_fn(folio, lruvec); } if (lruvec) unlock_page_lruvec_irqrestore(lruvec, flags); diff --git a/mm/swap_state.c b/mm/swap_state.c index bc7cee6b2ec5..8d4104242100 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -498,7 +498,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, mem_cgroup_swapin_uncharge_swap(entry); if (shadow) - workingset_refault(page, shadow); + workingset_refault(page_folio(page), shadow); /* Caller will initiate read into locked page */ lru_cache_add(page); diff --git a/mm/swapfile.c b/mm/swapfile.c index 30db362749c0..41c9e92f1f00 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -3534,13 +3534,13 @@ struct swap_info_struct *page_swap_info(struct page *page) } /* - * out-of-line __page_file_ methods to avoid include hell. + * out-of-line methods to avoid include hell. */ -struct address_space *__page_file_mapping(struct page *page) +struct address_space *swapcache_mapping(struct folio *folio) { - return page_swap_info(page)->swap_file->f_mapping; + return page_swap_info(&folio->page)->swap_file->f_mapping; } -EXPORT_SYMBOL_GPL(__page_file_mapping); +EXPORT_SYMBOL_GPL(swapcache_mapping); pgoff_t __page_file_index(struct page *page) { diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 7a9008415534..36e5f6ab976f 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -164,7 +164,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm, __SetPageUptodate(page); ret = -ENOMEM; - if (mem_cgroup_charge(page, dst_mm, GFP_KERNEL)) + if (mem_cgroup_charge(page_folio(page), dst_mm, GFP_KERNEL)) goto out_release; ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr, diff --git a/mm/util.c b/mm/util.c index bacabe446906..e58151a61255 100644 --- a/mm/util.c +++ b/mm/util.c @@ -654,81 +654,78 @@ void *kvrealloc(const void *p, size_t oldsize, size_t newsize, gfp_t flags) } EXPORT_SYMBOL(kvrealloc); -static inline void *__page_rmapping(struct page *page) -{ - unsigned long mapping; - - mapping = (unsigned long)page->mapping; - mapping &= ~PAGE_MAPPING_FLAGS; - - return (void *)mapping; -} - /* Neutral page->mapping pointer to address_space or anon_vma or other */ void *page_rmapping(struct page *page) { - page = compound_head(page); - return __page_rmapping(page); + return folio_raw_mapping(page_folio(page)); } -/* - * Return true if this page is mapped into pagetables. - * For compound page it returns true if any subpage of compound page is mapped. +/** + * folio_mapped - Is this folio mapped into userspace? + * @folio: The folio. + * + * Return: True if any page in this folio is referenced by user page tables. */ -bool page_mapped(struct page *page) +bool folio_mapped(struct folio *folio) { - int i; + long i, nr; - if (likely(!PageCompound(page))) - return atomic_read(&page->_mapcount) >= 0; - page = compound_head(page); - if (atomic_read(compound_mapcount_ptr(page)) >= 0) + if (folio_test_single(folio)) + return atomic_read(&folio->_mapcount) >= 0; + if (atomic_read(folio_mapcount_ptr(folio)) >= 0) return true; - if (PageHuge(page)) + if (folio_test_hugetlb(folio)) return false; - for (i = 0; i < compound_nr(page); i++) { - if (atomic_read(&page[i]._mapcount) >= 0) + + nr = folio_nr_pages(folio); + for (i = 0; i < nr; i++) { + if (atomic_read(&folio_page(folio, i)->_mapcount) >= 0) return true; } return false; } -EXPORT_SYMBOL(page_mapped); +EXPORT_SYMBOL(folio_mapped); struct anon_vma *page_anon_vma(struct page *page) { - unsigned long mapping; + struct folio *folio = page_folio(page); + unsigned long mapping = (unsigned long)folio->mapping; - page = compound_head(page); - mapping = (unsigned long)page->mapping; if ((mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON) return NULL; - return __page_rmapping(page); + return (void *)(mapping - PAGE_MAPPING_ANON); } -struct address_space *page_mapping(struct page *page) +/** + * folio_mapping - Find the mapping where this folio is stored. + * @folio: The folio. + * + * For folios which are in the page cache, return the mapping that this + * page belongs to. Folios in the swap cache return the swap mapping + * this page is stored in (which is different from the mapping for the + * swap file or swap device where the data is stored). + * + * You can call this for folios which aren't in the swap cache or page + * cache and it will return NULL. + */ +struct address_space *folio_mapping(struct folio *folio) { struct address_space *mapping; - page = compound_head(page); - /* This happens if someone calls flush_dcache_page on slab page */ - if (unlikely(PageSlab(page))) + if (unlikely(folio_test_slab(folio))) return NULL; - if (unlikely(PageSwapCache(page))) { - swp_entry_t entry; + if (unlikely(folio_test_swapcache(folio))) + return swap_address_space(folio_swap_entry(folio)); - entry.val = page_private(page); - return swap_address_space(entry); - } - - mapping = page->mapping; + mapping = folio->mapping; if ((unsigned long)mapping & PAGE_MAPPING_ANON) return NULL; return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS); } -EXPORT_SYMBOL(page_mapping); +EXPORT_SYMBOL(folio_mapping); /* Slow path of page_mapcount() for compound pages */ int __page_mapcount(struct page *page) @@ -750,13 +747,26 @@ int __page_mapcount(struct page *page) } EXPORT_SYMBOL_GPL(__page_mapcount); -void copy_huge_page(struct page *dst, struct page *src) +/** + * folio_copy - Copy the contents of one folio to another. + * @dst: Folio to copy to. + * @src: Folio to copy from. + * + * The bytes in the folio represented by @src are copied to @dst. + * Assumes the caller has validated that @dst is at least as large as @src. + * Can be called in atomic context for order-0 folios, but if the folio is + * larger, it may sleep. + */ +void folio_copy(struct folio *dst, struct folio *src) { - unsigned i, nr = compound_nr(src); + long i = 0; + long nr = folio_nr_pages(src); - for (i = 0; i < nr; i++) { + for (;;) { + copy_highpage(folio_page(dst, i), folio_page(src, i)); + if (++i == nr) + break; cond_resched(); - copy_highpage(nth_page(dst, i), nth_page(src, i)); } } @@ -1079,3 +1089,14 @@ void page_offline_end(void) up_write(&page_offline_rwsem); } EXPORT_SYMBOL(page_offline_end); + +#ifndef ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO +void flush_dcache_folio(struct folio *folio) +{ + long i, nr = folio_nr_pages(folio); + + for (i = 0; i < nr; i++) + flush_dcache_page(folio_page(folio, i)); +} +EXPORT_SYMBOL(flush_dcache_folio); +#endif diff --git a/mm/vmalloc.c b/mm/vmalloc.c index d77830ff604c..e8a807c78110 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2816,6 +2816,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid, unsigned int order, unsigned int nr_pages, struct page **pages) { unsigned int nr_allocated = 0; + struct page *page; + int i; /* * For order-0 pages we make use of bulk allocator, if @@ -2823,7 +2825,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid, * to fails, fallback to a single page allocator that is * more permissive. */ - if (!order) { + if (!order && nid != NUMA_NO_NODE) { while (nr_allocated < nr_pages) { unsigned int nr, nr_pages_request; @@ -2848,7 +2850,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid, if (nr != nr_pages_request) break; } - } else + } else if (order) /* * Compound pages required for remap_vmalloc_page if * high-order pages. @@ -2856,11 +2858,12 @@ vm_area_alloc_pages(gfp_t gfp, int nid, gfp |= __GFP_COMP; /* High-order pages or fallback path if "bulk" fails. */ - while (nr_allocated < nr_pages) { - struct page *page; - int i; - page = alloc_pages_node(nid, gfp, order); + while (nr_allocated < nr_pages) { + if (nid == NUMA_NO_NODE) + page = alloc_pages(gfp, order); + else + page = alloc_pages_node(nid, gfp, order); if (unlikely(!page)) break; diff --git a/mm/vmscan.c b/mm/vmscan.c index 74296c2d1fed..306229c4313f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2090,6 +2090,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, */ int isolate_lru_page(struct page *page) { + struct folio *folio = page_folio(page); int ret = -EBUSY; VM_BUG_ON_PAGE(!page_count(page), page); @@ -2099,7 +2100,7 @@ int isolate_lru_page(struct page *page) struct lruvec *lruvec; get_page(page); - lruvec = lock_page_lruvec_irq(page); + lruvec = folio_lruvec_lock_irq(folio); del_page_from_lru_list(page, lruvec); unlock_page_lruvec_irq(lruvec); ret = 0; @@ -2199,7 +2200,7 @@ static unsigned int move_pages_to_lru(struct lruvec *lruvec, * All pages were isolated from the same lruvec (and isolation * inhibits memcg migration). */ - VM_BUG_ON_PAGE(!page_matches_lruvec(page, lruvec), page); + VM_BUG_ON_PAGE(!folio_matches_lruvec(page_folio(page), lruvec), page); add_page_to_lru_list(page, lruvec); nr_pages = thp_nr_pages(page); nr_moved += nr_pages; @@ -4665,6 +4666,7 @@ void check_move_unevictable_pages(struct pagevec *pvec) for (i = 0; i < pvec->nr; i++) { struct page *page = pvec->pages[i]; + struct folio *folio = page_folio(page); int nr_pages; if (PageTransTail(page)) @@ -4677,7 +4679,7 @@ void check_move_unevictable_pages(struct pagevec *pvec) if (!TestClearPageLRU(page)) continue; - lruvec = relock_page_lruvec_irq(page, lruvec); + lruvec = folio_lruvec_relock_irq(folio, lruvec); if (page_evictable(page) && PageUnevictable(page)) { del_page_from_lru_list(page, lruvec); ClearPageUnevictable(page); diff --git a/mm/workingset.c b/mm/workingset.c index d5b81e4f4cbe..109ab978251a 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -273,17 +273,17 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg) } /** - * workingset_refault - evaluate the refault of a previously evicted page - * @page: the freshly allocated replacement page - * @shadow: shadow entry of the evicted page + * workingset_refault - Evaluate the refault of a previously evicted folio. + * @folio: The freshly allocated replacement folio. + * @shadow: Shadow entry of the evicted folio. * * Calculates and evaluates the refault distance of the previously - * evicted page in the context of the node and the memcg whose memory + * evicted folio in the context of the node and the memcg whose memory * pressure caused the eviction. */ -void workingset_refault(struct page *page, void *shadow) +void workingset_refault(struct folio *folio, void *shadow) { - bool file = page_is_file_lru(page); + bool file = folio_is_file_lru(folio); struct mem_cgroup *eviction_memcg; struct lruvec *eviction_lruvec; unsigned long refault_distance; @@ -295,16 +295,17 @@ void workingset_refault(struct page *page, void *shadow) unsigned long refault; bool workingset; int memcgid; + long nr; unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset); rcu_read_lock(); /* * Look up the memcg associated with the stored ID. It might - * have been deleted since the page's eviction. + * have been deleted since the folio's eviction. * * Note that in rare events the ID could have been recycled - * for a new cgroup that refaults a shared page. This is + * for a new cgroup that refaults a shared folio. This is * impossible to tell from the available data. However, this * should be a rare and limited disturbance, and activations * are always speculative anyway. Ultimately, it's the aging @@ -340,17 +341,18 @@ void workingset_refault(struct page *page, void *shadow) refault_distance = (refault - eviction) & EVICTION_MASK; /* - * The activation decision for this page is made at the level + * The activation decision for this folio is made at the level * where the eviction occurred, as that is where the LRU order - * during page reclaim is being determined. + * during folio reclaim is being determined. * - * However, the cgroup that will own the page is the one that + * However, the cgroup that will own the folio is the one that * is actually experiencing the refault event. */ - memcg = page_memcg(page); + nr = folio_nr_pages(folio); + memcg = folio_memcg(folio); lruvec = mem_cgroup_lruvec(memcg, pgdat); - inc_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file); + mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr); mem_cgroup_flush_stats(); /* @@ -376,16 +378,16 @@ void workingset_refault(struct page *page, void *shadow) if (refault_distance > workingset_size) goto out; - SetPageActive(page); - workingset_age_nonresident(lruvec, thp_nr_pages(page)); - inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + file); + folio_set_active(folio); + workingset_age_nonresident(lruvec, nr); + mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + file, nr); - /* Page was active prior to eviction */ + /* Folio was active prior to eviction */ if (workingset) { - SetPageWorkingset(page); + folio_set_workingset(folio); /* XXX: Move to lru_cache_add() when it supports new vs putback */ - lru_note_cost_page(page); - inc_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file); + lru_note_cost_folio(folio); + mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file, nr); } out: rcu_read_unlock(); @@ -393,12 +395,11 @@ out: /** * workingset_activation - note a page activation - * @page: page that is being activated + * @folio: Folio that is being activated. */ -void workingset_activation(struct page *page) +void workingset_activation(struct folio *folio) { struct mem_cgroup *memcg; - struct lruvec *lruvec; rcu_read_lock(); /* @@ -408,11 +409,10 @@ void workingset_activation(struct page *page) * XXX: See workingset_refault() - this should return * root_mem_cgroup even for !CONFIG_MEMCG. */ - memcg = page_memcg_rcu(page); + memcg = folio_memcg_rcu(folio); if (!mem_cgroup_disabled() && !memcg) goto out; - lruvec = mem_cgroup_page_lruvec(page); - workingset_age_nonresident(lruvec, thp_nr_pages(page)); + workingset_age_nonresident(folio_lruvec(folio), folio_nr_pages(folio)); out: rcu_read_unlock(); } diff --git a/net/batman-adv/bridge_loop_avoidance.c b/net/batman-adv/bridge_loop_avoidance.c index 1669744304c5..17687848daec 100644 --- a/net/batman-adv/bridge_loop_avoidance.c +++ b/net/batman-adv/bridge_loop_avoidance.c @@ -1560,10 +1560,14 @@ int batadv_bla_init(struct batadv_priv *bat_priv) return 0; bat_priv->bla.claim_hash = batadv_hash_new(128); - bat_priv->bla.backbone_hash = batadv_hash_new(32); + if (!bat_priv->bla.claim_hash) + return -ENOMEM; - if (!bat_priv->bla.claim_hash || !bat_priv->bla.backbone_hash) + bat_priv->bla.backbone_hash = batadv_hash_new(32); + if (!bat_priv->bla.backbone_hash) { + batadv_hash_destroy(bat_priv->bla.claim_hash); return -ENOMEM; + } batadv_hash_set_lock_class(bat_priv->bla.claim_hash, &batadv_claim_hash_lock_class_key); diff --git a/net/batman-adv/main.c b/net/batman-adv/main.c index 3ddd66e4c29e..5207cd8d6ad8 100644 --- a/net/batman-adv/main.c +++ b/net/batman-adv/main.c @@ -190,29 +190,41 @@ int batadv_mesh_init(struct net_device *soft_iface) bat_priv->gw.generation = 0; - ret = batadv_v_mesh_init(bat_priv); - if (ret < 0) - goto err; - ret = batadv_originator_init(bat_priv); - if (ret < 0) - goto err; + if (ret < 0) { + atomic_set(&bat_priv->mesh_state, BATADV_MESH_DEACTIVATING); + goto err_orig; + } ret = batadv_tt_init(bat_priv); - if (ret < 0) - goto err; + if (ret < 0) { + atomic_set(&bat_priv->mesh_state, BATADV_MESH_DEACTIVATING); + goto err_tt; + } + + ret = batadv_v_mesh_init(bat_priv); + if (ret < 0) { + atomic_set(&bat_priv->mesh_state, BATADV_MESH_DEACTIVATING); + goto err_v; + } ret = batadv_bla_init(bat_priv); - if (ret < 0) - goto err; + if (ret < 0) { + atomic_set(&bat_priv->mesh_state, BATADV_MESH_DEACTIVATING); + goto err_bla; + } ret = batadv_dat_init(bat_priv); - if (ret < 0) - goto err; + if (ret < 0) { + atomic_set(&bat_priv->mesh_state, BATADV_MESH_DEACTIVATING); + goto err_dat; + } ret = batadv_nc_mesh_init(bat_priv); - if (ret < 0) - goto err; + if (ret < 0) { + atomic_set(&bat_priv->mesh_state, BATADV_MESH_DEACTIVATING); + goto err_nc; + } batadv_gw_init(bat_priv); batadv_mcast_init(bat_priv); @@ -222,8 +234,20 @@ int batadv_mesh_init(struct net_device *soft_iface) return 0; -err: - batadv_mesh_free(soft_iface); +err_nc: + batadv_dat_free(bat_priv); +err_dat: + batadv_bla_free(bat_priv); +err_bla: + batadv_v_mesh_free(bat_priv); +err_v: + batadv_tt_free(bat_priv); +err_tt: + batadv_originator_free(bat_priv); +err_orig: + batadv_purge_outstanding_packets(bat_priv, NULL); + atomic_set(&bat_priv->mesh_state, BATADV_MESH_INACTIVE); + return ret; } diff --git a/net/batman-adv/network-coding.c b/net/batman-adv/network-coding.c index 9f06132e007d..0a7f1d36a6a8 100644 --- a/net/batman-adv/network-coding.c +++ b/net/batman-adv/network-coding.c @@ -152,8 +152,10 @@ int batadv_nc_mesh_init(struct batadv_priv *bat_priv) &batadv_nc_coding_hash_lock_class_key); bat_priv->nc.decoding_hash = batadv_hash_new(128); - if (!bat_priv->nc.decoding_hash) + if (!bat_priv->nc.decoding_hash) { + batadv_hash_destroy(bat_priv->nc.coding_hash); goto err; + } batadv_hash_set_lock_class(bat_priv->nc.decoding_hash, &batadv_nc_decoding_hash_lock_class_key); diff --git a/net/batman-adv/translation-table.c b/net/batman-adv/translation-table.c index e0b3dace2020..4b7ad6684bc4 100644 --- a/net/batman-adv/translation-table.c +++ b/net/batman-adv/translation-table.c @@ -4162,8 +4162,10 @@ int batadv_tt_init(struct batadv_priv *bat_priv) return ret; ret = batadv_tt_global_init(bat_priv); - if (ret < 0) + if (ret < 0) { + batadv_tt_local_table_free(bat_priv); return ret; + } batadv_tvlv_handler_register(bat_priv, batadv_tt_tvlv_ogm_handler_v1, batadv_tt_tvlv_unicast_handler_v1, diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h index e8136db44462..37ca76406f1e 100644 --- a/net/bridge/br_private.h +++ b/net/bridge/br_private.h @@ -1125,9 +1125,7 @@ static inline unsigned long br_multicast_lmqt(const struct net_bridge_mcast *brm static inline unsigned long br_multicast_gmi(const struct net_bridge_mcast *brmctx) { - /* use the RFC default of 2 for QRV */ - return 2 * brmctx->multicast_query_interval + - brmctx->multicast_query_response_interval; + return brmctx->multicast_membership_interval; } static inline bool diff --git a/net/bridge/netfilter/ebtables.c b/net/bridge/netfilter/ebtables.c index 83d1798dfbb4..ba045f35114d 100644 --- a/net/bridge/netfilter/ebtables.c +++ b/net/bridge/netfilter/ebtables.c @@ -926,7 +926,9 @@ static int translate_table(struct net *net, const char *name, return -ENOMEM; for_each_possible_cpu(i) { newinfo->chainstack[i] = - vmalloc(array_size(udc_cnt, sizeof(*(newinfo->chainstack[0])))); + vmalloc_node(array_size(udc_cnt, + sizeof(*(newinfo->chainstack[0]))), + cpu_to_node(i)); if (!newinfo->chainstack[i]) { while (i) vfree(newinfo->chainstack[--i]); diff --git a/net/can/isotp.c b/net/can/isotp.c index caaa532ece94..df6968b28bf4 100644 --- a/net/can/isotp.c +++ b/net/can/isotp.c @@ -121,7 +121,7 @@ enum { struct tpcon { int idx; int len; - u8 state; + u32 state; u8 bs; u8 sn; u8 ll_dl; @@ -848,6 +848,7 @@ static int isotp_sendmsg(struct socket *sock, struct msghdr *msg, size_t size) { struct sock *sk = sock->sk; struct isotp_sock *so = isotp_sk(sk); + u32 old_state = so->tx.state; struct sk_buff *skb; struct net_device *dev; struct canfd_frame *cf; @@ -860,45 +861,55 @@ static int isotp_sendmsg(struct socket *sock, struct msghdr *msg, size_t size) return -EADDRNOTAVAIL; /* we do not support multiple buffers - for now */ - if (so->tx.state != ISOTP_IDLE || wq_has_sleeper(&so->wait)) { - if (msg->msg_flags & MSG_DONTWAIT) - return -EAGAIN; + if (cmpxchg(&so->tx.state, ISOTP_IDLE, ISOTP_SENDING) != ISOTP_IDLE || + wq_has_sleeper(&so->wait)) { + if (msg->msg_flags & MSG_DONTWAIT) { + err = -EAGAIN; + goto err_out; + } /* wait for complete transmission of current pdu */ - wait_event_interruptible(so->wait, so->tx.state == ISOTP_IDLE); + err = wait_event_interruptible(so->wait, so->tx.state == ISOTP_IDLE); + if (err) + goto err_out; } - if (!size || size > MAX_MSG_LENGTH) - return -EINVAL; + if (!size || size > MAX_MSG_LENGTH) { + err = -EINVAL; + goto err_out; + } /* take care of a potential SF_DL ESC offset for TX_DL > 8 */ off = (so->tx.ll_dl > CAN_MAX_DLEN) ? 1 : 0; /* does the given data fit into a single frame for SF_BROADCAST? */ if ((so->opt.flags & CAN_ISOTP_SF_BROADCAST) && - (size > so->tx.ll_dl - SF_PCI_SZ4 - ae - off)) - return -EINVAL; + (size > so->tx.ll_dl - SF_PCI_SZ4 - ae - off)) { + err = -EINVAL; + goto err_out; + } err = memcpy_from_msg(so->tx.buf, msg, size); if (err < 0) - return err; + goto err_out; dev = dev_get_by_index(sock_net(sk), so->ifindex); - if (!dev) - return -ENXIO; + if (!dev) { + err = -ENXIO; + goto err_out; + } skb = sock_alloc_send_skb(sk, so->ll.mtu + sizeof(struct can_skb_priv), msg->msg_flags & MSG_DONTWAIT, &err); if (!skb) { dev_put(dev); - return err; + goto err_out; } can_skb_reserve(skb); can_skb_prv(skb)->ifindex = dev->ifindex; can_skb_prv(skb)->skbcnt = 0; - so->tx.state = ISOTP_SENDING; so->tx.len = size; so->tx.idx = 0; @@ -954,15 +965,25 @@ static int isotp_sendmsg(struct socket *sock, struct msghdr *msg, size_t size) if (err) { pr_notice_once("can-isotp: %s: can_send_ret %pe\n", __func__, ERR_PTR(err)); - return err; + goto err_out; } if (wait_tx_done) { /* wait for complete transmission of current pdu */ wait_event_interruptible(so->wait, so->tx.state == ISOTP_IDLE); + + if (sk->sk_err) + return -sk->sk_err; } return size; + +err_out: + so->tx.state = old_state; + if (so->tx.state == ISOTP_IDLE) + wake_up_interruptible(&so->wait); + + return err; } static int isotp_recvmsg(struct socket *sock, struct msghdr *msg, size_t size, diff --git a/net/can/j1939/j1939-priv.h b/net/can/j1939/j1939-priv.h index f6df20808f5e..16af1a7f80f6 100644 --- a/net/can/j1939/j1939-priv.h +++ b/net/can/j1939/j1939-priv.h @@ -330,6 +330,7 @@ int j1939_session_activate(struct j1939_session *session); void j1939_tp_schedule_txtimer(struct j1939_session *session, int msec); void j1939_session_timers_cancel(struct j1939_session *session); +#define J1939_MIN_TP_PACKET_SIZE 9 #define J1939_MAX_TP_PACKET_SIZE (7 * 0xff) #define J1939_MAX_ETP_PACKET_SIZE (7 * 0x00ffffff) diff --git a/net/can/j1939/main.c b/net/can/j1939/main.c index 08c8606cfd9c..9bc55ecb37f9 100644 --- a/net/can/j1939/main.c +++ b/net/can/j1939/main.c @@ -249,11 +249,14 @@ struct j1939_priv *j1939_netdev_start(struct net_device *ndev) struct j1939_priv *priv, *priv_new; int ret; - priv = j1939_priv_get_by_ndev(ndev); + spin_lock(&j1939_netdev_lock); + priv = j1939_priv_get_by_ndev_locked(ndev); if (priv) { kref_get(&priv->rx_kref); + spin_unlock(&j1939_netdev_lock); return priv; } + spin_unlock(&j1939_netdev_lock); priv = j1939_priv_create(ndev); if (!priv) @@ -269,10 +272,10 @@ struct j1939_priv *j1939_netdev_start(struct net_device *ndev) /* Someone was faster than us, use their priv and roll * back our's. */ + kref_get(&priv_new->rx_kref); spin_unlock(&j1939_netdev_lock); dev_put(ndev); kfree(priv); - kref_get(&priv_new->rx_kref); return priv_new; } j1939_priv_set(ndev, priv); diff --git a/net/can/j1939/transport.c b/net/can/j1939/transport.c index bb5c4b8979be..6c0a0ebdd024 100644 --- a/net/can/j1939/transport.c +++ b/net/can/j1939/transport.c @@ -1237,12 +1237,11 @@ static enum hrtimer_restart j1939_tp_rxtimer(struct hrtimer *hrtimer) session->err = -ETIME; j1939_session_deactivate(session); } else { - netdev_alert(priv->ndev, "%s: 0x%p: rx timeout, send abort\n", - __func__, session); - j1939_session_list_lock(session->priv); if (session->state >= J1939_SESSION_ACTIVE && session->state < J1939_SESSION_ACTIVE_MAX) { + netdev_alert(priv->ndev, "%s: 0x%p: rx timeout, send abort\n", + __func__, session); j1939_session_get(session); hrtimer_start(&session->rxtimer, ms_to_ktime(J1939_XTP_ABORT_TIMEOUT_MS), @@ -1609,6 +1608,8 @@ j1939_session *j1939_xtp_rx_rts_session_new(struct j1939_priv *priv, abort = J1939_XTP_ABORT_FAULT; else if (len > priv->tp_max_packet_size) abort = J1939_XTP_ABORT_RESOURCE; + else if (len < J1939_MIN_TP_PACKET_SIZE) + abort = J1939_XTP_ABORT_FAULT; } if (abort != J1939_XTP_NO_ABORT) { @@ -1789,6 +1790,7 @@ static void j1939_xtp_rx_dpo(struct j1939_priv *priv, struct sk_buff *skb, static void j1939_xtp_rx_dat_one(struct j1939_session *session, struct sk_buff *skb) { + enum j1939_xtp_abort abort = J1939_XTP_ABORT_FAULT; struct j1939_priv *priv = session->priv; struct j1939_sk_buff_cb *skcb, *se_skcb; struct sk_buff *se_skb = NULL; @@ -1803,9 +1805,11 @@ static void j1939_xtp_rx_dat_one(struct j1939_session *session, skcb = j1939_skb_to_cb(skb); dat = skb->data; - if (skb->len <= 1) + if (skb->len != 8) { /* makes no sense */ + abort = J1939_XTP_ABORT_UNEXPECTED_DATA; goto out_session_cancel; + } switch (session->last_cmd) { case 0xff: @@ -1904,7 +1908,7 @@ static void j1939_xtp_rx_dat_one(struct j1939_session *session, out_session_cancel: kfree_skb(se_skb); j1939_session_timers_cancel(session); - j1939_session_cancel(session, J1939_XTP_ABORT_FAULT); + j1939_session_cancel(session, abort); j1939_session_put(session); } diff --git a/net/core/dev.c b/net/core/dev.c index 7ee9fecd3aff..eb3a366bf212 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3163,6 +3163,12 @@ static u16 skb_tx_hash(const struct net_device *dev, qoffset = sb_dev->tc_to_txq[tc].offset; qcount = sb_dev->tc_to_txq[tc].count; + if (unlikely(!qcount)) { + net_warn_ratelimited("%s: invalid qcount, qoffset %u for tc %u\n", + sb_dev->name, qoffset, tc); + qoffset = 0; + qcount = dev->real_num_tx_queues; + } } if (skb_rx_queue_recorded(skb)) { @@ -3906,7 +3912,8 @@ int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *skb) skb_reset_mac_header(skb); __skb_pull(skb, skb_network_offset(skb)); skb->pkt_type = PACKET_LOOPBACK; - skb->ip_summed = CHECKSUM_UNNECESSARY; + if (skb->ip_summed == CHECKSUM_NONE) + skb->ip_summed = CHECKSUM_UNNECESSARY; WARN_ON(!skb_dst(skb)); skb_dst_force(skb); netif_rx_ni(skb); diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index f6197774048b..b2e49eb7001d 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -1973,9 +1973,9 @@ int netdev_register_kobject(struct net_device *ndev) int netdev_change_owner(struct net_device *ndev, const struct net *net_old, const struct net *net_new) { + kuid_t old_uid = GLOBAL_ROOT_UID, new_uid = GLOBAL_ROOT_UID; + kgid_t old_gid = GLOBAL_ROOT_GID, new_gid = GLOBAL_ROOT_GID; struct device *dev = &ndev->dev; - kuid_t old_uid, new_uid; - kgid_t old_gid, new_gid; int error; net_ns_get_ownership(net_old, &old_uid, &old_gid); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 2170bea2c7de..fe9358437380 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -80,6 +80,7 @@ #include <linux/indirect_call_wrapper.h> #include "datagram.h" +#include "sock_destructor.h" struct kmem_cache *skbuff_head_cache __ro_after_init; static struct kmem_cache *skbuff_fclone_cache __ro_after_init; @@ -1804,30 +1805,39 @@ EXPORT_SYMBOL(skb_realloc_headroom); struct sk_buff *skb_expand_head(struct sk_buff *skb, unsigned int headroom) { int delta = headroom - skb_headroom(skb); + int osize = skb_end_offset(skb); + struct sock *sk = skb->sk; if (WARN_ONCE(delta <= 0, "%s is expecting an increase in the headroom", __func__)) return skb; - /* pskb_expand_head() might crash, if skb is shared */ - if (skb_shared(skb)) { + delta = SKB_DATA_ALIGN(delta); + /* pskb_expand_head() might crash, if skb is shared. */ + if (skb_shared(skb) || !is_skb_wmem(skb)) { struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC); - if (likely(nskb)) { - if (skb->sk) - skb_set_owner_w(nskb, skb->sk); - consume_skb(skb); - } else { - kfree_skb(skb); - } + if (unlikely(!nskb)) + goto fail; + + if (sk) + skb_set_owner_w(nskb, sk); + consume_skb(skb); skb = nskb; } - if (skb && - pskb_expand_head(skb, SKB_DATA_ALIGN(delta), 0, GFP_ATOMIC)) { - kfree_skb(skb); - skb = NULL; + if (pskb_expand_head(skb, delta, 0, GFP_ATOMIC)) + goto fail; + + if (sk && is_skb_wmem(skb)) { + delta = skb_end_offset(skb) - osize; + refcount_add(delta, &sk->sk_wmem_alloc); + skb->truesize += delta; } return skb; + +fail: + kfree_skb(skb); + return NULL; } EXPORT_SYMBOL(skb_expand_head); diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 2d6249b28928..a86ef7e844f8 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -474,6 +474,20 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg, } EXPORT_SYMBOL_GPL(sk_msg_recvmsg); +bool sk_msg_is_readable(struct sock *sk) +{ + struct sk_psock *psock; + bool empty = true; + + rcu_read_lock(); + psock = sk_psock(sk); + if (likely(psock)) + empty = list_empty(&psock->ingress_msg); + rcu_read_unlock(); + return !empty; +} +EXPORT_SYMBOL_GPL(sk_msg_is_readable); + static struct sk_msg *sk_psock_create_ingress_msg(struct sock *sk, struct sk_buff *skb) { diff --git a/net/core/sock_destructor.h b/net/core/sock_destructor.h new file mode 100644 index 000000000000..2f396e6bfba5 --- /dev/null +++ b/net/core/sock_destructor.h @@ -0,0 +1,12 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +#ifndef _NET_CORE_SOCK_DESTRUCTOR_H +#define _NET_CORE_SOCK_DESTRUCTOR_H +#include <net/tcp.h> + +static inline bool is_skb_wmem(const struct sk_buff *skb) +{ + return skb->destructor == sock_wfree || + skb->destructor == __sock_wfree || + (IS_ENABLED(CONFIG_INET) && skb->destructor == tcp_wfree); +} +#endif diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c index c8496c1142c9..5f88526ad61c 100644 --- a/net/core/sysctl_net_core.c +++ b/net/core/sysctl_net_core.c @@ -419,7 +419,7 @@ static struct ctl_table net_core_table[] = { .mode = 0600, .proc_handler = proc_dolongvec_minmax_bpf_restricted, .extra1 = &long_one, - .extra2 = &long_max, + .extra2 = &bpf_jit_limit_max, }, #endif { diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c index da18094b5a04..e9911b18bdbf 100644 --- a/net/dsa/dsa2.c +++ b/net/dsa/dsa2.c @@ -1374,12 +1374,15 @@ static int dsa_switch_parse_ports_of(struct dsa_switch *ds, for_each_available_child_of_node(ports, port) { err = of_property_read_u32(port, "reg", ®); - if (err) + if (err) { + of_node_put(port); goto out_put_node; + } if (reg >= ds->num_ports) { dev_err(ds->dev, "port %pOF index %u exceeds num_ports (%zu)\n", port, reg, ds->num_ports); + of_node_put(port); err = -EINVAL; goto out_put_node; } @@ -1387,8 +1390,10 @@ static int dsa_switch_parse_ports_of(struct dsa_switch *ds, dp = dsa_to_port(ds, reg); err = dsa_port_parse_of(dp, port); - if (err) + if (err) { + of_node_put(port); goto out_put_node; + } } out_put_node: diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index e8b48df73c85..f5c336f8b0c8 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -486,10 +486,7 @@ static bool tcp_stream_is_readable(struct sock *sk, int target) { if (tcp_epollin_ready(sk, target)) return true; - - if (sk->sk_prot->stream_memory_read) - return sk->sk_prot->stream_memory_read(sk); - return false; + return sk_is_readable(sk); } /* diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c index d3e9386b493e..5f4d6f45d87f 100644 --- a/net/ipv4/tcp_bpf.c +++ b/net/ipv4/tcp_bpf.c @@ -150,19 +150,6 @@ int tcp_bpf_sendmsg_redir(struct sock *sk, struct sk_msg *msg, EXPORT_SYMBOL_GPL(tcp_bpf_sendmsg_redir); #ifdef CONFIG_BPF_SYSCALL -static bool tcp_bpf_stream_read(const struct sock *sk) -{ - struct sk_psock *psock; - bool empty = true; - - rcu_read_lock(); - psock = sk_psock(sk); - if (likely(psock)) - empty = list_empty(&psock->ingress_msg); - rcu_read_unlock(); - return !empty; -} - static int tcp_msg_wait_data(struct sock *sk, struct sk_psock *psock, long timeo) { @@ -232,6 +219,7 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock, bool cork = false, enospc = sk_msg_full(msg); struct sock *sk_redir; u32 tosend, delta = 0; + u32 eval = __SK_NONE; int ret; more_data: @@ -275,13 +263,24 @@ more_data: case __SK_REDIRECT: sk_redir = psock->sk_redir; sk_msg_apply_bytes(psock, tosend); + if (!psock->apply_bytes) { + /* Clean up before releasing the sock lock. */ + eval = psock->eval; + psock->eval = __SK_NONE; + psock->sk_redir = NULL; + } if (psock->cork) { cork = true; psock->cork = NULL; } sk_msg_return(sk, msg, tosend); release_sock(sk); + ret = tcp_bpf_sendmsg_redir(sk_redir, msg, tosend, flags); + + if (eval == __SK_REDIRECT) + sock_put(sk_redir); + lock_sock(sk); if (unlikely(ret < 0)) { int free = sk_msg_free_nocharge(sk, msg); @@ -479,7 +478,7 @@ static void tcp_bpf_rebuild_protos(struct proto prot[TCP_BPF_NUM_CFGS], prot[TCP_BPF_BASE].unhash = sock_map_unhash; prot[TCP_BPF_BASE].close = sock_map_close; prot[TCP_BPF_BASE].recvmsg = tcp_bpf_recvmsg; - prot[TCP_BPF_BASE].stream_memory_read = tcp_bpf_stream_read; + prot[TCP_BPF_BASE].sock_is_readable = sk_msg_is_readable; prot[TCP_BPF_TX] = prot[TCP_BPF_BASE]; prot[TCP_BPF_TX].sendmsg = tcp_bpf_sendmsg; diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 2e62e0d6373a..5b8ce65dfc06 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1037,6 +1037,20 @@ static void tcp_v4_reqsk_destructor(struct request_sock *req) DEFINE_STATIC_KEY_FALSE(tcp_md5_needed); EXPORT_SYMBOL(tcp_md5_needed); +static bool better_md5_match(struct tcp_md5sig_key *old, struct tcp_md5sig_key *new) +{ + if (!old) + return true; + + /* l3index always overrides non-l3index */ + if (old->l3index && new->l3index == 0) + return false; + if (old->l3index == 0 && new->l3index) + return true; + + return old->prefixlen < new->prefixlen; +} + /* Find the Key structure for an address. */ struct tcp_md5sig_key *__tcp_md5_do_lookup(const struct sock *sk, int l3index, const union tcp_md5_addr *addr, @@ -1059,7 +1073,7 @@ struct tcp_md5sig_key *__tcp_md5_do_lookup(const struct sock *sk, int l3index, lockdep_sock_is_held(sk)) { if (key->family != family) continue; - if (key->l3index && key->l3index != l3index) + if (key->flags & TCP_MD5SIG_FLAG_IFINDEX && key->l3index != l3index) continue; if (family == AF_INET) { mask = inet_make_mask(key->prefixlen); @@ -1074,8 +1088,7 @@ struct tcp_md5sig_key *__tcp_md5_do_lookup(const struct sock *sk, int l3index, match = false; } - if (match && (!best_match || - key->prefixlen > best_match->prefixlen)) + if (match && better_md5_match(best_match, key)) best_match = key; } return best_match; @@ -1085,7 +1098,7 @@ EXPORT_SYMBOL(__tcp_md5_do_lookup); static struct tcp_md5sig_key *tcp_md5_do_lookup_exact(const struct sock *sk, const union tcp_md5_addr *addr, int family, u8 prefixlen, - int l3index) + int l3index, u8 flags) { const struct tcp_sock *tp = tcp_sk(sk); struct tcp_md5sig_key *key; @@ -1105,7 +1118,9 @@ static struct tcp_md5sig_key *tcp_md5_do_lookup_exact(const struct sock *sk, lockdep_sock_is_held(sk)) { if (key->family != family) continue; - if (key->l3index && key->l3index != l3index) + if ((key->flags & TCP_MD5SIG_FLAG_IFINDEX) != (flags & TCP_MD5SIG_FLAG_IFINDEX)) + continue; + if (key->l3index != l3index) continue; if (!memcmp(&key->addr, addr, size) && key->prefixlen == prefixlen) @@ -1129,7 +1144,7 @@ EXPORT_SYMBOL(tcp_v4_md5_lookup); /* This can be called on a newly created socket, from other files */ int tcp_md5_do_add(struct sock *sk, const union tcp_md5_addr *addr, - int family, u8 prefixlen, int l3index, + int family, u8 prefixlen, int l3index, u8 flags, const u8 *newkey, u8 newkeylen, gfp_t gfp) { /* Add Key to the list */ @@ -1137,7 +1152,7 @@ int tcp_md5_do_add(struct sock *sk, const union tcp_md5_addr *addr, struct tcp_sock *tp = tcp_sk(sk); struct tcp_md5sig_info *md5sig; - key = tcp_md5_do_lookup_exact(sk, addr, family, prefixlen, l3index); + key = tcp_md5_do_lookup_exact(sk, addr, family, prefixlen, l3index, flags); if (key) { /* Pre-existing entry - just update that one. * Note that the key might be used concurrently. @@ -1182,6 +1197,7 @@ int tcp_md5_do_add(struct sock *sk, const union tcp_md5_addr *addr, key->family = family; key->prefixlen = prefixlen; key->l3index = l3index; + key->flags = flags; memcpy(&key->addr, addr, (family == AF_INET6) ? sizeof(struct in6_addr) : sizeof(struct in_addr)); @@ -1191,11 +1207,11 @@ int tcp_md5_do_add(struct sock *sk, const union tcp_md5_addr *addr, EXPORT_SYMBOL(tcp_md5_do_add); int tcp_md5_do_del(struct sock *sk, const union tcp_md5_addr *addr, int family, - u8 prefixlen, int l3index) + u8 prefixlen, int l3index, u8 flags) { struct tcp_md5sig_key *key; - key = tcp_md5_do_lookup_exact(sk, addr, family, prefixlen, l3index); + key = tcp_md5_do_lookup_exact(sk, addr, family, prefixlen, l3index, flags); if (!key) return -ENOENT; hlist_del_rcu(&key->node); @@ -1229,6 +1245,7 @@ static int tcp_v4_parse_md5_keys(struct sock *sk, int optname, const union tcp_md5_addr *addr; u8 prefixlen = 32; int l3index = 0; + u8 flags; if (optlen < sizeof(cmd)) return -EINVAL; @@ -1239,6 +1256,8 @@ static int tcp_v4_parse_md5_keys(struct sock *sk, int optname, if (sin->sin_family != AF_INET) return -EINVAL; + flags = cmd.tcpm_flags & TCP_MD5SIG_FLAG_IFINDEX; + if (optname == TCP_MD5SIG_EXT && cmd.tcpm_flags & TCP_MD5SIG_FLAG_PREFIX) { prefixlen = cmd.tcpm_prefixlen; @@ -1246,7 +1265,7 @@ static int tcp_v4_parse_md5_keys(struct sock *sk, int optname, return -EINVAL; } - if (optname == TCP_MD5SIG_EXT && + if (optname == TCP_MD5SIG_EXT && cmd.tcpm_ifindex && cmd.tcpm_flags & TCP_MD5SIG_FLAG_IFINDEX) { struct net_device *dev; @@ -1267,12 +1286,12 @@ static int tcp_v4_parse_md5_keys(struct sock *sk, int optname, addr = (union tcp_md5_addr *)&sin->sin_addr.s_addr; if (!cmd.tcpm_keylen) - return tcp_md5_do_del(sk, addr, AF_INET, prefixlen, l3index); + return tcp_md5_do_del(sk, addr, AF_INET, prefixlen, l3index, flags); if (cmd.tcpm_keylen > TCP_MD5SIG_MAXKEYLEN) return -EINVAL; - return tcp_md5_do_add(sk, addr, AF_INET, prefixlen, l3index, + return tcp_md5_do_add(sk, addr, AF_INET, prefixlen, l3index, flags, cmd.tcpm_key, cmd.tcpm_keylen, GFP_KERNEL); } @@ -1596,7 +1615,7 @@ struct sock *tcp_v4_syn_recv_sock(const struct sock *sk, struct sk_buff *skb, * memory, then we end up not copying the key * across. Shucks. */ - tcp_md5_do_add(newsk, addr, AF_INET, 32, l3index, + tcp_md5_do_add(newsk, addr, AF_INET, 32, l3index, key->flags, key->key, key->keylen, GFP_ATOMIC); sk_nocaps_add(newsk, NETIF_F_GSO_MASK); } diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 8536b2a7210b..2fffcf2b54f3 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -2867,6 +2867,9 @@ __poll_t udp_poll(struct file *file, struct socket *sock, poll_table *wait) !(sk->sk_shutdown & RCV_SHUTDOWN) && first_packet_length(sk) == -1) mask &= ~(EPOLLIN | EPOLLRDNORM); + /* psock ingress_msg queue should not contain any bad checksum frames */ + if (sk_is_readable(sk)) + mask |= EPOLLIN | EPOLLRDNORM; return mask; } diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c index 7a1d5f473878..bbe6569c9ad3 100644 --- a/net/ipv4/udp_bpf.c +++ b/net/ipv4/udp_bpf.c @@ -114,6 +114,7 @@ static void udp_bpf_rebuild_protos(struct proto *prot, const struct proto *base) *prot = *base; prot->close = sock_map_close; prot->recvmsg = udp_bpf_recvmsg; + prot->sock_is_readable = sk_msg_is_readable; } static void udp_bpf_check_v6_needs_rebuild(struct proto *ops) diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 12f985f43bcc..2f044a49afa8 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -464,13 +464,14 @@ static bool ip6_pkt_too_big(const struct sk_buff *skb, unsigned int mtu) int ip6_forward(struct sk_buff *skb) { - struct inet6_dev *idev = __in6_dev_get_safely(skb->dev); struct dst_entry *dst = skb_dst(skb); struct ipv6hdr *hdr = ipv6_hdr(skb); struct inet6_skb_parm *opt = IP6CB(skb); struct net *net = dev_net(dst->dev); + struct inet6_dev *idev; u32 mtu; + idev = __in6_dev_get_safely(dev_get_by_index_rcu(net, IP6CB(skb)->iif)); if (net->ipv6.devconf_all->forwarding == 0) goto error; diff --git a/net/ipv6/netfilter/ip6t_rt.c b/net/ipv6/netfilter/ip6t_rt.c index 733c83d38b30..4ad8b2032f1f 100644 --- a/net/ipv6/netfilter/ip6t_rt.c +++ b/net/ipv6/netfilter/ip6t_rt.c @@ -25,12 +25,7 @@ MODULE_AUTHOR("Andras Kis-Szabo <kisza@sch.bme.hu>"); static inline bool segsleft_match(u_int32_t min, u_int32_t max, u_int32_t id, bool invert) { - bool r; - pr_debug("segsleft_match:%c 0x%x <= 0x%x <= 0x%x\n", - invert ? '!' : ' ', min, id, max); - r = (id >= min && id <= max) ^ invert; - pr_debug(" result %s\n", r ? "PASS" : "FAILED"); - return r; + return (id >= min && id <= max) ^ invert; } static bool rt_mt6(const struct sk_buff *skb, struct xt_action_param *par) @@ -65,30 +60,6 @@ static bool rt_mt6(const struct sk_buff *skb, struct xt_action_param *par) return false; } - pr_debug("IPv6 RT LEN %u %u ", hdrlen, rh->hdrlen); - pr_debug("TYPE %04X ", rh->type); - pr_debug("SGS_LEFT %u %02X\n", rh->segments_left, rh->segments_left); - - pr_debug("IPv6 RT segsleft %02X ", - segsleft_match(rtinfo->segsleft[0], rtinfo->segsleft[1], - rh->segments_left, - !!(rtinfo->invflags & IP6T_RT_INV_SGS))); - pr_debug("type %02X %02X %02X ", - rtinfo->rt_type, rh->type, - (!(rtinfo->flags & IP6T_RT_TYP) || - ((rtinfo->rt_type == rh->type) ^ - !!(rtinfo->invflags & IP6T_RT_INV_TYP)))); - pr_debug("len %02X %04X %02X ", - rtinfo->hdrlen, hdrlen, - !(rtinfo->flags & IP6T_RT_LEN) || - ((rtinfo->hdrlen == hdrlen) ^ - !!(rtinfo->invflags & IP6T_RT_INV_LEN))); - pr_debug("res %02X %02X %02X ", - rtinfo->flags & IP6T_RT_RES, - ((const struct rt0_hdr *)rh)->reserved, - !((rtinfo->flags & IP6T_RT_RES) && - (((const struct rt0_hdr *)rh)->reserved))); - ret = (segsleft_match(rtinfo->segsleft[0], rtinfo->segsleft[1], rh->segments_left, !!(rtinfo->invflags & IP6T_RT_INV_SGS))) && @@ -107,22 +78,22 @@ static bool rt_mt6(const struct sk_buff *skb, struct xt_action_param *par) reserved), sizeof(_reserved), &_reserved); + if (!rp) { + par->hotdrop = true; + return false; + } ret = (*rp == 0); } - pr_debug("#%d ", rtinfo->addrnr); if (!(rtinfo->flags & IP6T_RT_FST)) { return ret; } else if (rtinfo->flags & IP6T_RT_FST_NSTRICT) { - pr_debug("Not strict "); if (rtinfo->addrnr > (unsigned int)((hdrlen - 8) / 16)) { - pr_debug("There isn't enough space\n"); return false; } else { unsigned int i = 0; - pr_debug("#%d ", rtinfo->addrnr); for (temp = 0; temp < (unsigned int)((hdrlen - 8) / 16); temp++) { @@ -138,26 +109,20 @@ static bool rt_mt6(const struct sk_buff *skb, struct xt_action_param *par) return false; } - if (ipv6_addr_equal(ap, &rtinfo->addrs[i])) { - pr_debug("i=%d temp=%d;\n", i, temp); + if (ipv6_addr_equal(ap, &rtinfo->addrs[i])) i++; - } if (i == rtinfo->addrnr) break; } - pr_debug("i=%d #%d\n", i, rtinfo->addrnr); if (i == rtinfo->addrnr) return ret; else return false; } } else { - pr_debug("Strict "); if (rtinfo->addrnr > (unsigned int)((hdrlen - 8) / 16)) { - pr_debug("There isn't enough space\n"); return false; } else { - pr_debug("#%d ", rtinfo->addrnr); for (temp = 0; temp < rtinfo->addrnr; temp++) { ap = skb_header_pointer(skb, ptr @@ -173,7 +138,6 @@ static bool rt_mt6(const struct sk_buff *skb, struct xt_action_param *par) if (!ipv6_addr_equal(ap, &rtinfo->addrs[temp])) break; } - pr_debug("temp=%d #%d\n", temp, rtinfo->addrnr); if (temp == rtinfo->addrnr && temp == (unsigned int)((hdrlen - 8) / 16)) return ret; diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 0ce52d46e4f8..b03dd02c9f13 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -599,6 +599,7 @@ static int tcp_v6_parse_md5_keys(struct sock *sk, int optname, struct sockaddr_in6 *sin6 = (struct sockaddr_in6 *)&cmd.tcpm_addr; int l3index = 0; u8 prefixlen; + u8 flags; if (optlen < sizeof(cmd)) return -EINVAL; @@ -609,6 +610,8 @@ static int tcp_v6_parse_md5_keys(struct sock *sk, int optname, if (sin6->sin6_family != AF_INET6) return -EINVAL; + flags = cmd.tcpm_flags & TCP_MD5SIG_FLAG_IFINDEX; + if (optname == TCP_MD5SIG_EXT && cmd.tcpm_flags & TCP_MD5SIG_FLAG_PREFIX) { prefixlen = cmd.tcpm_prefixlen; @@ -619,7 +622,7 @@ static int tcp_v6_parse_md5_keys(struct sock *sk, int optname, prefixlen = ipv6_addr_v4mapped(&sin6->sin6_addr) ? 32 : 128; } - if (optname == TCP_MD5SIG_EXT && + if (optname == TCP_MD5SIG_EXT && cmd.tcpm_ifindex && cmd.tcpm_flags & TCP_MD5SIG_FLAG_IFINDEX) { struct net_device *dev; @@ -640,9 +643,9 @@ static int tcp_v6_parse_md5_keys(struct sock *sk, int optname, if (ipv6_addr_v4mapped(&sin6->sin6_addr)) return tcp_md5_do_del(sk, (union tcp_md5_addr *)&sin6->sin6_addr.s6_addr32[3], AF_INET, prefixlen, - l3index); + l3index, flags); return tcp_md5_do_del(sk, (union tcp_md5_addr *)&sin6->sin6_addr, - AF_INET6, prefixlen, l3index); + AF_INET6, prefixlen, l3index, flags); } if (cmd.tcpm_keylen > TCP_MD5SIG_MAXKEYLEN) @@ -650,12 +653,12 @@ static int tcp_v6_parse_md5_keys(struct sock *sk, int optname, if (ipv6_addr_v4mapped(&sin6->sin6_addr)) return tcp_md5_do_add(sk, (union tcp_md5_addr *)&sin6->sin6_addr.s6_addr32[3], - AF_INET, prefixlen, l3index, + AF_INET, prefixlen, l3index, flags, cmd.tcpm_key, cmd.tcpm_keylen, GFP_KERNEL); return tcp_md5_do_add(sk, (union tcp_md5_addr *)&sin6->sin6_addr, - AF_INET6, prefixlen, l3index, + AF_INET6, prefixlen, l3index, flags, cmd.tcpm_key, cmd.tcpm_keylen, GFP_KERNEL); } @@ -1404,7 +1407,7 @@ static struct sock *tcp_v6_syn_recv_sock(const struct sock *sk, struct sk_buff * * across. Shucks. */ tcp_md5_do_add(newsk, (union tcp_md5_addr *)&newsk->sk_v6_daddr, - AF_INET6, 128, l3index, key->key, key->keylen, + AF_INET6, 128, l3index, key->flags, key->key, key->keylen, sk_gfp_mask(sk, GFP_ATOMIC)); } #endif diff --git a/net/mac80211/mesh.c b/net/mac80211/mesh.c index 97095b7c9c64..5dcfd53a4ab6 100644 --- a/net/mac80211/mesh.c +++ b/net/mac80211/mesh.c @@ -672,7 +672,7 @@ ieee80211_mesh_update_bss_params(struct ieee80211_sub_if_data *sdata, u8 *ie, u8 ie_len) { struct ieee80211_supported_band *sband; - const u8 *cap; + const struct element *cap; const struct ieee80211_he_operation *he_oper = NULL; sband = ieee80211_get_sband(sdata); @@ -687,9 +687,10 @@ ieee80211_mesh_update_bss_params(struct ieee80211_sub_if_data *sdata, sdata->vif.bss_conf.he_support = true; - cap = cfg80211_find_ext_ie(WLAN_EID_EXT_HE_OPERATION, ie, ie_len); - if (cap && cap[1] >= ieee80211_he_oper_size(&cap[3])) - he_oper = (void *)(cap + 3); + cap = cfg80211_find_ext_elem(WLAN_EID_EXT_HE_OPERATION, ie, ie_len); + if (cap && cap->datalen >= 1 + sizeof(*he_oper) && + cap->datalen >= 1 + ieee80211_he_oper_size(cap->data + 1)) + he_oper = (void *)(cap->data + 1); if (he_oper) sdata->vif.bss_conf.he_oper.params = diff --git a/net/mptcp/options.c b/net/mptcp/options.c index c41273cefc51..f0f22eb4fd5f 100644 --- a/net/mptcp/options.c +++ b/net/mptcp/options.c @@ -485,11 +485,11 @@ static bool mptcp_established_options_mp(struct sock *sk, struct sk_buff *skb, mpext = mptcp_get_ext(skb); data_len = mpext ? mpext->data_len : 0; - /* we will check ext_copy.data_len in mptcp_write_options() to + /* we will check ops->data_len in mptcp_write_options() to * discriminate between TCPOLEN_MPTCP_MPC_ACK_DATA and * TCPOLEN_MPTCP_MPC_ACK */ - opts->ext_copy.data_len = data_len; + opts->data_len = data_len; opts->suboptions = OPTION_MPTCP_MPC_ACK; opts->sndr_key = subflow->local_key; opts->rcvr_key = subflow->remote_key; @@ -505,9 +505,9 @@ static bool mptcp_established_options_mp(struct sock *sk, struct sk_buff *skb, len = TCPOLEN_MPTCP_MPC_ACK_DATA; if (opts->csum_reqd) { /* we need to propagate more info to csum the pseudo hdr */ - opts->ext_copy.data_seq = mpext->data_seq; - opts->ext_copy.subflow_seq = mpext->subflow_seq; - opts->ext_copy.csum = mpext->csum; + opts->data_seq = mpext->data_seq; + opts->subflow_seq = mpext->subflow_seq; + opts->csum = mpext->csum; len += TCPOLEN_MPTCP_DSS_CHECKSUM; } *size = ALIGN(len, 4); @@ -1227,7 +1227,7 @@ static void mptcp_set_rwin(const struct tcp_sock *tp) WRITE_ONCE(msk->rcv_wnd_sent, ack_seq); } -static u16 mptcp_make_csum(const struct mptcp_ext *mpext) +static u16 __mptcp_make_csum(u64 data_seq, u32 subflow_seq, u16 data_len, __sum16 sum) { struct csum_pseudo_header header; __wsum csum; @@ -1237,15 +1237,21 @@ static u16 mptcp_make_csum(const struct mptcp_ext *mpext) * always the 64-bit value, irrespective of what length is used in the * DSS option itself. */ - header.data_seq = cpu_to_be64(mpext->data_seq); - header.subflow_seq = htonl(mpext->subflow_seq); - header.data_len = htons(mpext->data_len); + header.data_seq = cpu_to_be64(data_seq); + header.subflow_seq = htonl(subflow_seq); + header.data_len = htons(data_len); header.csum = 0; - csum = csum_partial(&header, sizeof(header), ~csum_unfold(mpext->csum)); + csum = csum_partial(&header, sizeof(header), ~csum_unfold(sum)); return (__force u16)csum_fold(csum); } +static u16 mptcp_make_csum(const struct mptcp_ext *mpext) +{ + return __mptcp_make_csum(mpext->data_seq, mpext->subflow_seq, mpext->data_len, + mpext->csum); +} + void mptcp_write_options(__be32 *ptr, const struct tcp_sock *tp, struct mptcp_out_options *opts) { @@ -1337,7 +1343,7 @@ void mptcp_write_options(__be32 *ptr, const struct tcp_sock *tp, len = TCPOLEN_MPTCP_MPC_SYN; } else if (OPTION_MPTCP_MPC_SYNACK & opts->suboptions) { len = TCPOLEN_MPTCP_MPC_SYNACK; - } else if (opts->ext_copy.data_len) { + } else if (opts->data_len) { len = TCPOLEN_MPTCP_MPC_ACK_DATA; if (opts->csum_reqd) len += TCPOLEN_MPTCP_DSS_CHECKSUM; @@ -1366,14 +1372,17 @@ void mptcp_write_options(__be32 *ptr, const struct tcp_sock *tp, put_unaligned_be64(opts->rcvr_key, ptr); ptr += 2; - if (!opts->ext_copy.data_len) + if (!opts->data_len) goto mp_capable_done; if (opts->csum_reqd) { - put_unaligned_be32(opts->ext_copy.data_len << 16 | - mptcp_make_csum(&opts->ext_copy), ptr); + put_unaligned_be32(opts->data_len << 16 | + __mptcp_make_csum(opts->data_seq, + opts->subflow_seq, + opts->data_len, + opts->csum), ptr); } else { - put_unaligned_be32(opts->ext_copy.data_len << 16 | + put_unaligned_be32(opts->data_len << 16 | TCPOPT_NOP << 8 | TCPOPT_NOP, ptr); } ptr += 1; diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig index 54395266339d..92a747896f80 100644 --- a/net/netfilter/Kconfig +++ b/net/netfilter/Kconfig @@ -109,7 +109,7 @@ config NF_CONNTRACK_MARK config NF_CONNTRACK_SECMARK bool 'Connection tracking security mark support' depends on NETWORK_SECMARK - default m if NETFILTER_ADVANCED=n + default y if NETFILTER_ADVANCED=n help This option enables security markings to be applied to connections. Typically they are copied to connections from diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c index c25097092a06..29ec3ef63edc 100644 --- a/net/netfilter/ipvs/ip_vs_ctl.c +++ b/net/netfilter/ipvs/ip_vs_ctl.c @@ -4090,6 +4090,11 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs) tbl[idx++].data = &ipvs->sysctl_conn_reuse_mode; tbl[idx++].data = &ipvs->sysctl_schedule_icmp; tbl[idx++].data = &ipvs->sysctl_ignore_tunneled; +#ifdef CONFIG_IP_VS_DEBUG + /* Global sysctls must be ro in non-init netns */ + if (!net_eq(net, &init_net)) + tbl[idx++].mode = 0444; +#endif ipvs->sysctl_hdr = register_net_sysctl(net, "net/ipv4/vs", tbl); if (ipvs->sysctl_hdr == NULL) { diff --git a/net/netfilter/nft_chain_filter.c b/net/netfilter/nft_chain_filter.c index 5b02408a920b..3ced0eb6b7c3 100644 --- a/net/netfilter/nft_chain_filter.c +++ b/net/netfilter/nft_chain_filter.c @@ -342,12 +342,6 @@ static void nft_netdev_event(unsigned long event, struct net_device *dev, return; } - /* UNREGISTER events are also happening on netns exit. - * - * Although nf_tables core releases all tables/chains, only this event - * handler provides guarantee that hook->ops.dev is still accessible, - * so we cannot skip exiting net namespaces. - */ __nft_release_basechain(ctx); } @@ -366,6 +360,9 @@ static int nf_tables_netdev_event(struct notifier_block *this, event != NETDEV_CHANGENAME) return NOTIFY_DONE; + if (!check_net(ctx.net)) + return NOTIFY_DONE; + nft_net = nft_pernet(ctx.net); mutex_lock(&nft_net->commit_mutex); list_for_each_entry(table, &nft_net->tables, list) { diff --git a/net/netfilter/xt_IDLETIMER.c b/net/netfilter/xt_IDLETIMER.c index 7b2f359bfce4..2f7cf5ecebf4 100644 --- a/net/netfilter/xt_IDLETIMER.c +++ b/net/netfilter/xt_IDLETIMER.c @@ -137,7 +137,7 @@ static int idletimer_tg_create(struct idletimer_tg_info *info) { int ret; - info->timer = kmalloc(sizeof(*info->timer), GFP_KERNEL); + info->timer = kzalloc(sizeof(*info->timer), GFP_KERNEL); if (!info->timer) { ret = -ENOMEM; goto out; diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c index ad9df0cb4b98..90866ae45573 100644 --- a/net/sched/act_ct.c +++ b/net/sched/act_ct.c @@ -960,6 +960,7 @@ static int tcf_ct_act(struct sk_buff *skb, const struct tc_action *a, tmpl = p->tmpl; tcf_lastuse_update(&c->tcf_tm); + tcf_action_update_bstats(&c->common, skb); if (clear) { qdisc_skb_cb(skb)->post_ct = false; @@ -1049,7 +1050,6 @@ out_push: qdisc_skb_cb(skb)->post_ct = true; out_clear: - tcf_action_update_bstats(&c->common, skb); if (defrag) qdisc_skb_cb(skb)->pkt_len = skb->len; return retval; diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c index 32df65f68c12..fb3da4d8f4a3 100644 --- a/net/sctp/sm_statefuns.c +++ b/net/sctp/sm_statefuns.c @@ -156,6 +156,12 @@ static enum sctp_disposition __sctp_sf_do_9_1_abort( void *arg, struct sctp_cmd_seq *commands); +static enum sctp_disposition +__sctp_sf_do_9_2_reshutack(struct net *net, const struct sctp_endpoint *ep, + const struct sctp_association *asoc, + const union sctp_subtype type, void *arg, + struct sctp_cmd_seq *commands); + /* Small helper function that checks if the chunk length * is of the appropriate length. The 'required_length' argument * is set to be the size of a specific chunk we are testing. @@ -337,6 +343,14 @@ enum sctp_disposition sctp_sf_do_5_1B_init(struct net *net, if (!chunk->singleton) return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands); + /* Make sure that the INIT chunk has a valid length. + * Normally, this would cause an ABORT with a Protocol Violation + * error, but since we don't have an association, we'll + * just discard the packet. + */ + if (!sctp_chunk_length_valid(chunk, sizeof(struct sctp_init_chunk))) + return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands); + /* If the packet is an OOTB packet which is temporarily on the * control endpoint, respond with an ABORT. */ @@ -351,14 +365,6 @@ enum sctp_disposition sctp_sf_do_5_1B_init(struct net *net, if (chunk->sctp_hdr->vtag != 0) return sctp_sf_tabort_8_4_8(net, ep, asoc, type, arg, commands); - /* Make sure that the INIT chunk has a valid length. - * Normally, this would cause an ABORT with a Protocol Violation - * error, but since we don't have an association, we'll - * just discard the packet. - */ - if (!sctp_chunk_length_valid(chunk, sizeof(struct sctp_init_chunk))) - return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands); - /* If the INIT is coming toward a closing socket, we'll send back * and ABORT. Essentially, this catches the race of INIT being * backloged to the socket at the same time as the user issues close(). @@ -704,6 +710,9 @@ enum sctp_disposition sctp_sf_do_5_1D_ce(struct net *net, struct sock *sk; int error = 0; + if (asoc && !sctp_vtag_verify(chunk, asoc)) + return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands); + /* If the packet is an OOTB packet which is temporarily on the * control endpoint, respond with an ABORT. */ @@ -718,7 +727,8 @@ enum sctp_disposition sctp_sf_do_5_1D_ce(struct net *net, * in sctp_unpack_cookie(). */ if (!sctp_chunk_length_valid(chunk, sizeof(struct sctp_chunkhdr))) - return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands); + return sctp_sf_violation_chunklen(net, ep, asoc, type, arg, + commands); /* If the endpoint is not listening or if the number of associations * on the TCP-style socket exceed the max backlog, respond with an @@ -1524,20 +1534,16 @@ static enum sctp_disposition sctp_sf_do_unexpected_init( if (!chunk->singleton) return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands); + /* Make sure that the INIT chunk has a valid length. */ + if (!sctp_chunk_length_valid(chunk, sizeof(struct sctp_init_chunk))) + return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands); + /* 3.1 A packet containing an INIT chunk MUST have a zero Verification * Tag. */ if (chunk->sctp_hdr->vtag != 0) return sctp_sf_tabort_8_4_8(net, ep, asoc, type, arg, commands); - /* Make sure that the INIT chunk has a valid length. - * In this case, we generate a protocol violation since we have - * an association established. - */ - if (!sctp_chunk_length_valid(chunk, sizeof(struct sctp_init_chunk))) - return sctp_sf_violation_chunklen(net, ep, asoc, type, arg, - commands); - if (SCTP_INPUT_CB(chunk->skb)->encap_port != chunk->transport->encap_port) return sctp_sf_new_encap_port(net, ep, asoc, type, arg, commands); @@ -1882,9 +1888,9 @@ static enum sctp_disposition sctp_sf_do_dupcook_a( * its peer. */ if (sctp_state(asoc, SHUTDOWN_ACK_SENT)) { - disposition = sctp_sf_do_9_2_reshutack(net, ep, asoc, - SCTP_ST_CHUNK(chunk->chunk_hdr->type), - chunk, commands); + disposition = __sctp_sf_do_9_2_reshutack(net, ep, asoc, + SCTP_ST_CHUNK(chunk->chunk_hdr->type), + chunk, commands); if (SCTP_DISPOSITION_NOMEM == disposition) goto nomem; @@ -2202,9 +2208,11 @@ enum sctp_disposition sctp_sf_do_5_2_4_dupcook( * enough for the chunk header. Cookie length verification is * done later. */ - if (!sctp_chunk_length_valid(chunk, sizeof(struct sctp_chunkhdr))) - return sctp_sf_violation_chunklen(net, ep, asoc, type, arg, - commands); + if (!sctp_chunk_length_valid(chunk, sizeof(struct sctp_chunkhdr))) { + if (!sctp_vtag_verify(chunk, asoc)) + asoc = NULL; + return sctp_sf_violation_chunklen(net, ep, asoc, type, arg, commands); + } /* "Decode" the chunk. We have no optional parameters so we * are in good shape. @@ -2341,7 +2349,7 @@ enum sctp_disposition sctp_sf_shutdown_pending_abort( */ if (SCTP_ADDR_DEL == sctp_bind_addr_state(&asoc->base.bind_addr, &chunk->dest)) - return sctp_sf_discard_chunk(net, ep, asoc, type, arg, commands); + return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands); if (!sctp_err_chunk_valid(chunk)) return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands); @@ -2387,7 +2395,7 @@ enum sctp_disposition sctp_sf_shutdown_sent_abort( */ if (SCTP_ADDR_DEL == sctp_bind_addr_state(&asoc->base.bind_addr, &chunk->dest)) - return sctp_sf_discard_chunk(net, ep, asoc, type, arg, commands); + return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands); if (!sctp_err_chunk_valid(chunk)) return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands); @@ -2657,7 +2665,7 @@ enum sctp_disposition sctp_sf_do_9_1_abort( */ if (SCTP_ADDR_DEL == sctp_bind_addr_state(&asoc->base.bind_addr, &chunk->dest)) - return sctp_sf_discard_chunk(net, ep, asoc, type, arg, commands); + return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands); if (!sctp_err_chunk_valid(chunk)) return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands); @@ -2970,13 +2978,11 @@ enum sctp_disposition sctp_sf_do_9_2_shut_ctsn( * that belong to this association, it should discard the INIT chunk and * retransmit the SHUTDOWN ACK chunk. */ -enum sctp_disposition sctp_sf_do_9_2_reshutack( - struct net *net, - const struct sctp_endpoint *ep, - const struct sctp_association *asoc, - const union sctp_subtype type, - void *arg, - struct sctp_cmd_seq *commands) +static enum sctp_disposition +__sctp_sf_do_9_2_reshutack(struct net *net, const struct sctp_endpoint *ep, + const struct sctp_association *asoc, + const union sctp_subtype type, void *arg, + struct sctp_cmd_seq *commands) { struct sctp_chunk *chunk = arg; struct sctp_chunk *reply; @@ -3010,6 +3016,26 @@ nomem: return SCTP_DISPOSITION_NOMEM; } +enum sctp_disposition +sctp_sf_do_9_2_reshutack(struct net *net, const struct sctp_endpoint *ep, + const struct sctp_association *asoc, + const union sctp_subtype type, void *arg, + struct sctp_cmd_seq *commands) +{ + struct sctp_chunk *chunk = arg; + + if (!chunk->singleton) + return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands); + + if (!sctp_chunk_length_valid(chunk, sizeof(struct sctp_init_chunk))) + return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands); + + if (chunk->sctp_hdr->vtag != 0) + return sctp_sf_tabort_8_4_8(net, ep, asoc, type, arg, commands); + + return __sctp_sf_do_9_2_reshutack(net, ep, asoc, type, arg, commands); +} + /* * sctp_sf_do_ecn_cwr * @@ -3662,6 +3688,9 @@ enum sctp_disposition sctp_sf_ootb(struct net *net, SCTP_INC_STATS(net, SCTP_MIB_OUTOFBLUES); + if (asoc && !sctp_vtag_verify(chunk, asoc)) + asoc = NULL; + ch = (struct sctp_chunkhdr *)chunk->chunk_hdr; do { /* Report violation if the chunk is less then minimal */ @@ -3777,12 +3806,6 @@ static enum sctp_disposition sctp_sf_shut_8_4_5( SCTP_INC_STATS(net, SCTP_MIB_OUTCTRLCHUNKS); - /* If the chunk length is invalid, we don't want to process - * the reset of the packet. - */ - if (!sctp_chunk_length_valid(chunk, sizeof(struct sctp_chunkhdr))) - return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands); - /* We need to discard the rest of the packet to prevent * potential boomming attacks from additional bundled chunks. * This is documented in SCTP Threats ID. @@ -3810,6 +3833,9 @@ enum sctp_disposition sctp_sf_do_8_5_1_E_sa(struct net *net, { struct sctp_chunk *chunk = arg; + if (!sctp_vtag_verify(chunk, asoc)) + asoc = NULL; + /* Make sure that the SHUTDOWN_ACK chunk has a valid length. */ if (!sctp_chunk_length_valid(chunk, sizeof(struct sctp_chunkhdr))) return sctp_sf_violation_chunklen(net, ep, asoc, type, arg, @@ -3845,6 +3871,11 @@ enum sctp_disposition sctp_sf_do_asconf(struct net *net, return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands); } + /* Make sure that the ASCONF ADDIP chunk has a valid length. */ + if (!sctp_chunk_length_valid(chunk, sizeof(struct sctp_addip_chunk))) + return sctp_sf_violation_chunklen(net, ep, asoc, type, arg, + commands); + /* ADD-IP: Section 4.1.1 * This chunk MUST be sent in an authenticated way by using * the mechanism defined in [I-D.ietf-tsvwg-sctp-auth]. If this chunk @@ -3853,13 +3884,7 @@ enum sctp_disposition sctp_sf_do_asconf(struct net *net, */ if (!asoc->peer.asconf_capable || (!net->sctp.addip_noauth && !chunk->auth)) - return sctp_sf_discard_chunk(net, ep, asoc, type, arg, - commands); - - /* Make sure that the ASCONF ADDIP chunk has a valid length. */ - if (!sctp_chunk_length_valid(chunk, sizeof(struct sctp_addip_chunk))) - return sctp_sf_violation_chunklen(net, ep, asoc, type, arg, - commands); + return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands); hdr = (struct sctp_addiphdr *)chunk->skb->data; serial = ntohl(hdr->serial); @@ -3988,6 +4013,12 @@ enum sctp_disposition sctp_sf_do_asconf_ack(struct net *net, return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands); } + /* Make sure that the ADDIP chunk has a valid length. */ + if (!sctp_chunk_length_valid(asconf_ack, + sizeof(struct sctp_addip_chunk))) + return sctp_sf_violation_chunklen(net, ep, asoc, type, arg, + commands); + /* ADD-IP, Section 4.1.2: * This chunk MUST be sent in an authenticated way by using * the mechanism defined in [I-D.ietf-tsvwg-sctp-auth]. If this chunk @@ -3996,14 +4027,7 @@ enum sctp_disposition sctp_sf_do_asconf_ack(struct net *net, */ if (!asoc->peer.asconf_capable || (!net->sctp.addip_noauth && !asconf_ack->auth)) - return sctp_sf_discard_chunk(net, ep, asoc, type, arg, - commands); - - /* Make sure that the ADDIP chunk has a valid length. */ - if (!sctp_chunk_length_valid(asconf_ack, - sizeof(struct sctp_addip_chunk))) - return sctp_sf_violation_chunklen(net, ep, asoc, type, arg, - commands); + return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands); addip_hdr = (struct sctp_addiphdr *)asconf_ack->skb->data; rcvd_serial = ntohl(addip_hdr->serial); @@ -4575,6 +4599,9 @@ enum sctp_disposition sctp_sf_discard_chunk(struct net *net, { struct sctp_chunk *chunk = arg; + if (asoc && !sctp_vtag_verify(chunk, asoc)) + return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands); + /* Make sure that the chunk has a valid length. * Since we don't know the chunk type, we use a general * chunkhdr structure to make a comparison. @@ -4642,6 +4669,9 @@ enum sctp_disposition sctp_sf_violation(struct net *net, { struct sctp_chunk *chunk = arg; + if (!sctp_vtag_verify(chunk, asoc)) + return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands); + /* Make sure that the chunk has a valid length. */ if (!sctp_chunk_length_valid(chunk, sizeof(struct sctp_chunkhdr))) return sctp_sf_violation_chunklen(net, ep, asoc, type, arg, @@ -6348,6 +6378,7 @@ static struct sctp_packet *sctp_ootb_pkt_new( * yet. */ switch (chunk->chunk_hdr->type) { + case SCTP_CID_INIT: case SCTP_CID_INIT_ACK: { struct sctp_initack_chunk *initack; diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c index c038efc23ce3..78b663dbfa1f 100644 --- a/net/smc/af_smc.c +++ b/net/smc/af_smc.c @@ -1057,7 +1057,7 @@ static void smc_connect_work(struct work_struct *work) if (smc->clcsock->sk->sk_err) { smc->sk.sk_err = smc->clcsock->sk->sk_err; } else if ((1 << smc->clcsock->sk->sk_state) & - (TCPF_SYN_SENT | TCP_SYN_RECV)) { + (TCPF_SYN_SENT | TCPF_SYN_RECV)) { rc = sk_stream_wait_connect(smc->clcsock->sk, &timeo); if ((rc == -EPIPE) && ((1 << smc->clcsock->sk->sk_state) & diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c index 72f4b72eb175..f1d323439a2a 100644 --- a/net/smc/smc_llc.c +++ b/net/smc/smc_llc.c @@ -1822,7 +1822,7 @@ void smc_llc_link_active(struct smc_link *link) link->smcibdev->ibdev->name, link->ibport); link->state = SMC_LNK_ACTIVE; if (link->lgr->llc_testlink_time) { - link->llc_testlink_time = link->lgr->llc_testlink_time * HZ; + link->llc_testlink_time = link->lgr->llc_testlink_time; schedule_delayed_work(&link->llc_testlink_wrk, link->llc_testlink_time); } diff --git a/net/tipc/crypto.c b/net/tipc/crypto.c index c9391d38de85..dc60c32bb70d 100644 --- a/net/tipc/crypto.c +++ b/net/tipc/crypto.c @@ -2285,43 +2285,53 @@ static bool tipc_crypto_key_rcv(struct tipc_crypto *rx, struct tipc_msg *hdr) u16 key_gen = msg_key_gen(hdr); u16 size = msg_data_sz(hdr); u8 *data = msg_data(hdr); + unsigned int keylen; + + /* Verify whether the size can exist in the packet */ + if (unlikely(size < sizeof(struct tipc_aead_key) + TIPC_AEAD_KEYLEN_MIN)) { + pr_debug("%s: message data size is too small\n", rx->name); + goto exit; + } + + keylen = ntohl(*((__be32 *)(data + TIPC_AEAD_ALG_NAME))); + + /* Verify the supplied size values */ + if (unlikely(size != keylen + sizeof(struct tipc_aead_key) || + keylen > TIPC_AEAD_KEY_SIZE_MAX)) { + pr_debug("%s: invalid MSG_CRYPTO key size\n", rx->name); + goto exit; + } spin_lock(&rx->lock); if (unlikely(rx->skey || (key_gen == rx->key_gen && rx->key.keys))) { pr_err("%s: key existed <%p>, gen %d vs %d\n", rx->name, rx->skey, key_gen, rx->key_gen); - goto exit; + goto exit_unlock; } /* Allocate memory for the key */ skey = kmalloc(size, GFP_ATOMIC); if (unlikely(!skey)) { pr_err("%s: unable to allocate memory for skey\n", rx->name); - goto exit; + goto exit_unlock; } /* Copy key from msg data */ - skey->keylen = ntohl(*((__be32 *)(data + TIPC_AEAD_ALG_NAME))); + skey->keylen = keylen; memcpy(skey->alg_name, data, TIPC_AEAD_ALG_NAME); memcpy(skey->key, data + TIPC_AEAD_ALG_NAME + sizeof(__be32), skey->keylen); - /* Sanity check */ - if (unlikely(size != tipc_aead_key_size(skey))) { - kfree(skey); - skey = NULL; - goto exit; - } - rx->key_gen = key_gen; rx->skey_mode = msg_key_mode(hdr); rx->skey = skey; rx->nokey = 0; mb(); /* for nokey flag */ -exit: +exit_unlock: spin_unlock(&rx->lock); +exit: /* Schedule the key attaching on this crypto */ if (likely(skey && queue_delayed_work(tx->wq, &rx->work, 0))) return true; diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c index fde56ff49163..9ab81db8a654 100644 --- a/net/tls/tls_main.c +++ b/net/tls/tls_main.c @@ -681,12 +681,12 @@ static void build_protos(struct proto prot[TLS_NUM_CONFIG][TLS_NUM_CONFIG], prot[TLS_BASE][TLS_SW] = prot[TLS_BASE][TLS_BASE]; prot[TLS_BASE][TLS_SW].recvmsg = tls_sw_recvmsg; - prot[TLS_BASE][TLS_SW].stream_memory_read = tls_sw_stream_read; + prot[TLS_BASE][TLS_SW].sock_is_readable = tls_sw_sock_is_readable; prot[TLS_BASE][TLS_SW].close = tls_sk_proto_close; prot[TLS_SW][TLS_SW] = prot[TLS_SW][TLS_BASE]; prot[TLS_SW][TLS_SW].recvmsg = tls_sw_recvmsg; - prot[TLS_SW][TLS_SW].stream_memory_read = tls_sw_stream_read; + prot[TLS_SW][TLS_SW].sock_is_readable = tls_sw_sock_is_readable; prot[TLS_SW][TLS_SW].close = tls_sk_proto_close; #ifdef CONFIG_TLS_DEVICE diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c index 4feb95e34b64..1b08b877a890 100644 --- a/net/tls/tls_sw.c +++ b/net/tls/tls_sw.c @@ -35,6 +35,7 @@ * SOFTWARE. */ +#include <linux/bug.h> #include <linux/sched/signal.h> #include <linux/module.h> #include <linux/splice.h> @@ -43,6 +44,14 @@ #include <net/strparser.h> #include <net/tls.h> +noinline void tls_err_abort(struct sock *sk, int err) +{ + WARN_ON_ONCE(err >= 0); + /* sk->sk_err should contain a positive error code. */ + sk->sk_err = -err; + sk_error_report(sk); +} + static int __skb_nsg(struct sk_buff *skb, int offset, int len, unsigned int recursion_level) { @@ -419,7 +428,7 @@ int tls_tx_records(struct sock *sk, int flags) tx_err: if (rc < 0 && rc != -EAGAIN) - tls_err_abort(sk, EBADMSG); + tls_err_abort(sk, -EBADMSG); return rc; } @@ -450,7 +459,7 @@ static void tls_encrypt_done(struct crypto_async_request *req, int err) /* If err is already set on socket, return the same code */ if (sk->sk_err) { - ctx->async_wait.err = sk->sk_err; + ctx->async_wait.err = -sk->sk_err; } else { ctx->async_wait.err = err; tls_err_abort(sk, err); @@ -763,7 +772,7 @@ static int tls_push_record(struct sock *sk, int flags, msg_pl->sg.size + prot->tail_size, i); if (rc < 0) { if (rc != -EINPROGRESS) { - tls_err_abort(sk, EBADMSG); + tls_err_abort(sk, -EBADMSG); if (split) { tls_ctx->pending_open_record_frags = true; tls_merge_open_record(sk, rec, tmp, orig_end); @@ -1827,7 +1836,7 @@ int tls_sw_recvmsg(struct sock *sk, err = decrypt_skb_update(sk, skb, &msg->msg_iter, &chunk, &zc, async_capable); if (err < 0 && err != -EINPROGRESS) { - tls_err_abort(sk, EBADMSG); + tls_err_abort(sk, -EBADMSG); goto recv_end; } @@ -2007,7 +2016,7 @@ ssize_t tls_sw_splice_read(struct socket *sock, loff_t *ppos, } if (err < 0) { - tls_err_abort(sk, EBADMSG); + tls_err_abort(sk, -EBADMSG); goto splice_read_end; } ctx->decrypted = 1; @@ -2026,7 +2035,7 @@ splice_read_end: return copied ? : err; } -bool tls_sw_stream_read(const struct sock *sk) +bool tls_sw_sock_is_readable(struct sock *sk) { struct tls_context *tls_ctx = tls_get_ctx(sk); struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx); diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 89f9e85ae970..78e08e82c08c 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -3052,6 +3052,8 @@ static __poll_t unix_poll(struct file *file, struct socket *sock, poll_table *wa /* readable? */ if (!skb_queue_empty_lockless(&sk->sk_receive_queue)) mask |= EPOLLIN | EPOLLRDNORM; + if (sk_is_readable(sk)) + mask |= EPOLLIN | EPOLLRDNORM; /* Connection-based need to check for termination and startup */ if ((sk->sk_type == SOCK_STREAM || sk->sk_type == SOCK_SEQPACKET) && @@ -3091,6 +3093,8 @@ static __poll_t unix_dgram_poll(struct file *file, struct socket *sock, /* readable? */ if (!skb_queue_empty_lockless(&sk->sk_receive_queue)) mask |= EPOLLIN | EPOLLRDNORM; + if (sk_is_readable(sk)) + mask |= EPOLLIN | EPOLLRDNORM; /* Connection-based need to check for termination and startup */ if (sk->sk_type == SOCK_SEQPACKET) { diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c index b927e2baae50..452376c6f419 100644 --- a/net/unix/unix_bpf.c +++ b/net/unix/unix_bpf.c @@ -102,6 +102,7 @@ static void unix_dgram_bpf_rebuild_protos(struct proto *prot, const struct proto *prot = *base; prot->close = sock_map_close; prot->recvmsg = unix_bpf_recvmsg; + prot->sock_is_readable = sk_msg_is_readable; } static void unix_stream_bpf_rebuild_protos(struct proto *prot, @@ -110,6 +111,7 @@ static void unix_stream_bpf_rebuild_protos(struct proto *prot, *prot = *base; prot->close = sock_map_close; prot->recvmsg = unix_bpf_recvmsg; + prot->sock_is_readable = sk_msg_is_readable; prot->unhash = sock_map_unhash; } diff --git a/net/wireless/core.c b/net/wireless/core.c index 03323121ca50..aaba847d79eb 100644 --- a/net/wireless/core.c +++ b/net/wireless/core.c @@ -524,6 +524,7 @@ use_default_name: INIT_WORK(&rdev->propagate_cac_done_wk, cfg80211_propagate_cac_done_wk); INIT_WORK(&rdev->mgmt_registrations_update_wk, cfg80211_mgmt_registrations_update_wk); + spin_lock_init(&rdev->mgmt_registrations_lock); #ifdef CONFIG_CFG80211_DEFAULT_PS rdev->wiphy.flags |= WIPHY_FLAG_PS_ON_BY_DEFAULT; @@ -1279,7 +1280,6 @@ void cfg80211_init_wdev(struct wireless_dev *wdev) INIT_LIST_HEAD(&wdev->event_list); spin_lock_init(&wdev->event_lock); INIT_LIST_HEAD(&wdev->mgmt_registrations); - spin_lock_init(&wdev->mgmt_registrations_lock); INIT_LIST_HEAD(&wdev->pmsr_list); spin_lock_init(&wdev->pmsr_lock); INIT_WORK(&wdev->pmsr_free_wk, cfg80211_pmsr_free_wk); diff --git a/net/wireless/core.h b/net/wireless/core.h index b35d0db12f1d..1720abf36f92 100644 --- a/net/wireless/core.h +++ b/net/wireless/core.h @@ -100,6 +100,8 @@ struct cfg80211_registered_device { struct work_struct propagate_cac_done_wk; struct work_struct mgmt_registrations_update_wk; + /* lock for all wdev lists */ + spinlock_t mgmt_registrations_lock; /* must be last because of the way we do wiphy_priv(), * and it should at least be aligned to NETDEV_ALIGN */ diff --git a/net/wireless/mlme.c b/net/wireless/mlme.c index 3aa69b375a10..783acd2c4211 100644 --- a/net/wireless/mlme.c +++ b/net/wireless/mlme.c @@ -452,9 +452,9 @@ static void cfg80211_mgmt_registrations_update(struct wireless_dev *wdev) lockdep_assert_held(&rdev->wiphy.mtx); - spin_lock_bh(&wdev->mgmt_registrations_lock); + spin_lock_bh(&rdev->mgmt_registrations_lock); if (!wdev->mgmt_registrations_need_update) { - spin_unlock_bh(&wdev->mgmt_registrations_lock); + spin_unlock_bh(&rdev->mgmt_registrations_lock); return; } @@ -479,7 +479,7 @@ static void cfg80211_mgmt_registrations_update(struct wireless_dev *wdev) rcu_read_unlock(); wdev->mgmt_registrations_need_update = 0; - spin_unlock_bh(&wdev->mgmt_registrations_lock); + spin_unlock_bh(&rdev->mgmt_registrations_lock); rdev_update_mgmt_frame_registrations(rdev, wdev, &upd); } @@ -503,6 +503,7 @@ int cfg80211_mlme_register_mgmt(struct wireless_dev *wdev, u32 snd_portid, int match_len, bool multicast_rx, struct netlink_ext_ack *extack) { + struct cfg80211_registered_device *rdev = wiphy_to_rdev(wdev->wiphy); struct cfg80211_mgmt_registration *reg, *nreg; int err = 0; u16 mgmt_type; @@ -548,7 +549,7 @@ int cfg80211_mlme_register_mgmt(struct wireless_dev *wdev, u32 snd_portid, if (!nreg) return -ENOMEM; - spin_lock_bh(&wdev->mgmt_registrations_lock); + spin_lock_bh(&rdev->mgmt_registrations_lock); list_for_each_entry(reg, &wdev->mgmt_registrations, list) { int mlen = min(match_len, reg->match_len); @@ -583,7 +584,7 @@ int cfg80211_mlme_register_mgmt(struct wireless_dev *wdev, u32 snd_portid, list_add(&nreg->list, &wdev->mgmt_registrations); } wdev->mgmt_registrations_need_update = 1; - spin_unlock_bh(&wdev->mgmt_registrations_lock); + spin_unlock_bh(&rdev->mgmt_registrations_lock); cfg80211_mgmt_registrations_update(wdev); @@ -591,7 +592,7 @@ int cfg80211_mlme_register_mgmt(struct wireless_dev *wdev, u32 snd_portid, out: kfree(nreg); - spin_unlock_bh(&wdev->mgmt_registrations_lock); + spin_unlock_bh(&rdev->mgmt_registrations_lock); return err; } @@ -602,7 +603,7 @@ void cfg80211_mlme_unregister_socket(struct wireless_dev *wdev, u32 nlportid) struct cfg80211_registered_device *rdev = wiphy_to_rdev(wiphy); struct cfg80211_mgmt_registration *reg, *tmp; - spin_lock_bh(&wdev->mgmt_registrations_lock); + spin_lock_bh(&rdev->mgmt_registrations_lock); list_for_each_entry_safe(reg, tmp, &wdev->mgmt_registrations, list) { if (reg->nlportid != nlportid) @@ -615,7 +616,7 @@ void cfg80211_mlme_unregister_socket(struct wireless_dev *wdev, u32 nlportid) schedule_work(&rdev->mgmt_registrations_update_wk); } - spin_unlock_bh(&wdev->mgmt_registrations_lock); + spin_unlock_bh(&rdev->mgmt_registrations_lock); if (nlportid && rdev->crit_proto_nlportid == nlportid) { rdev->crit_proto_nlportid = 0; @@ -628,15 +629,16 @@ void cfg80211_mlme_unregister_socket(struct wireless_dev *wdev, u32 nlportid) void cfg80211_mlme_purge_registrations(struct wireless_dev *wdev) { + struct cfg80211_registered_device *rdev = wiphy_to_rdev(wdev->wiphy); struct cfg80211_mgmt_registration *reg, *tmp; - spin_lock_bh(&wdev->mgmt_registrations_lock); + spin_lock_bh(&rdev->mgmt_registrations_lock); list_for_each_entry_safe(reg, tmp, &wdev->mgmt_registrations, list) { list_del(®->list); kfree(reg); } wdev->mgmt_registrations_need_update = 1; - spin_unlock_bh(&wdev->mgmt_registrations_lock); + spin_unlock_bh(&rdev->mgmt_registrations_lock); cfg80211_mgmt_registrations_update(wdev); } @@ -784,7 +786,7 @@ bool cfg80211_rx_mgmt_khz(struct wireless_dev *wdev, int freq, int sig_dbm, data = buf + ieee80211_hdrlen(mgmt->frame_control); data_len = len - ieee80211_hdrlen(mgmt->frame_control); - spin_lock_bh(&wdev->mgmt_registrations_lock); + spin_lock_bh(&rdev->mgmt_registrations_lock); list_for_each_entry(reg, &wdev->mgmt_registrations, list) { if (reg->frame_type != ftype) @@ -808,7 +810,7 @@ bool cfg80211_rx_mgmt_khz(struct wireless_dev *wdev, int freq, int sig_dbm, break; } - spin_unlock_bh(&wdev->mgmt_registrations_lock); + spin_unlock_bh(&rdev->mgmt_registrations_lock); trace_cfg80211_return_bool(result); return result; diff --git a/net/wireless/scan.c b/net/wireless/scan.c index 11c68b159324..adc0d14cfd86 100644 --- a/net/wireless/scan.c +++ b/net/wireless/scan.c @@ -418,14 +418,17 @@ cfg80211_add_nontrans_list(struct cfg80211_bss *trans_bss, } ssid_len = ssid[1]; ssid = ssid + 2; - rcu_read_unlock(); /* check if nontrans_bss is in the list */ list_for_each_entry(bss, &trans_bss->nontrans_list, nontrans_list) { - if (is_bss(bss, nontrans_bss->bssid, ssid, ssid_len)) + if (is_bss(bss, nontrans_bss->bssid, ssid, ssid_len)) { + rcu_read_unlock(); return 0; + } } + rcu_read_unlock(); + /* add to the list */ list_add_tail(&nontrans_bss->nontrans_list, &trans_bss->nontrans_list); return 0; diff --git a/net/wireless/util.c b/net/wireless/util.c index 18dba3d7c638..a1a99a574984 100644 --- a/net/wireless/util.c +++ b/net/wireless/util.c @@ -1028,14 +1028,14 @@ int cfg80211_change_iface(struct cfg80211_registered_device *rdev, !(rdev->wiphy.interface_modes & (1 << ntype))) return -EOPNOTSUPP; - /* if it's part of a bridge, reject changing type to station/ibss */ - if (netif_is_bridge_port(dev) && - (ntype == NL80211_IFTYPE_ADHOC || - ntype == NL80211_IFTYPE_STATION || - ntype == NL80211_IFTYPE_P2P_CLIENT)) - return -EBUSY; - if (ntype != otype) { + /* if it's part of a bridge, reject changing type to station/ibss */ + if (netif_is_bridge_port(dev) && + (ntype == NL80211_IFTYPE_ADHOC || + ntype == NL80211_IFTYPE_STATION || + ntype == NL80211_IFTYPE_P2P_CLIENT)) + return -EBUSY; + dev->ieee80211_ptr->use_4addr = false; dev->ieee80211_ptr->mesh_id_up_len = 0; wdev_lock(dev->ieee80211_ptr); diff --git a/security/keys/process_keys.c b/security/keys/process_keys.c index e3d79a7b6db6..b5d5333ab330 100644 --- a/security/keys/process_keys.c +++ b/security/keys/process_keys.c @@ -918,6 +918,13 @@ void key_change_session_keyring(struct callback_head *twork) return; } + /* If get_ucounts fails more bits are needed in the refcount */ + if (unlikely(!get_ucounts(old->ucounts))) { + WARN_ONCE(1, "In %s get_ucounts failed\n", __func__); + put_cred(new); + return; + } + new-> uid = old-> uid; new-> euid = old-> euid; new-> suid = old-> suid; @@ -927,6 +934,7 @@ void key_change_session_keyring(struct callback_head *twork) new-> sgid = old-> sgid; new->fsgid = old->fsgid; new->user = get_uid(old->user); + new->ucounts = old->ucounts; new->user_ns = get_user_ns(old->user_ns); new->group_info = get_group_info(old->group_info); diff --git a/sound/pci/hda/patch_realtek.c b/sound/pci/hda/patch_realtek.c index 22d27b12c4e7..965b096f416f 100644 --- a/sound/pci/hda/patch_realtek.c +++ b/sound/pci/hda/patch_realtek.c @@ -2535,6 +2535,7 @@ static const struct snd_pci_quirk alc882_fixup_tbl[] = { SND_PCI_QUIRK(0x1558, 0x65d2, "Clevo PB51R[CDF]", ALC1220_FIXUP_CLEVO_PB51ED_PINS), SND_PCI_QUIRK(0x1558, 0x65e1, "Clevo PB51[ED][DF]", ALC1220_FIXUP_CLEVO_PB51ED_PINS), SND_PCI_QUIRK(0x1558, 0x65e5, "Clevo PC50D[PRS](?:-D|-G)?", ALC1220_FIXUP_CLEVO_PB51ED_PINS), + SND_PCI_QUIRK(0x1558, 0x65f1, "Clevo PC50HS", ALC1220_FIXUP_CLEVO_PB51ED_PINS), SND_PCI_QUIRK(0x1558, 0x67d1, "Clevo PB71[ER][CDF]", ALC1220_FIXUP_CLEVO_PB51ED_PINS), SND_PCI_QUIRK(0x1558, 0x67e1, "Clevo PB71[DE][CDF]", ALC1220_FIXUP_CLEVO_PB51ED_PINS), SND_PCI_QUIRK(0x1558, 0x67e5, "Clevo PC70D[PRS](?:-D|-G)?", ALC1220_FIXUP_CLEVO_PB51ED_PINS), @@ -6405,6 +6406,44 @@ static void alc_fixup_no_int_mic(struct hda_codec *codec, } } +/* GPIO1 = amplifier on/off + * GPIO3 = mic mute LED + */ +static void alc285_fixup_hp_spectre_x360_eb1(struct hda_codec *codec, + const struct hda_fixup *fix, int action) +{ + static const hda_nid_t conn[] = { 0x02 }; + + struct alc_spec *spec = codec->spec; + static const struct hda_pintbl pincfgs[] = { + { 0x14, 0x90170110 }, /* front/high speakers */ + { 0x17, 0x90170130 }, /* back/bass speakers */ + { } + }; + + //enable micmute led + alc_fixup_hp_gpio_led(codec, action, 0x00, 0x04); + + switch (action) { + case HDA_FIXUP_ACT_PRE_PROBE: + spec->micmute_led_polarity = 1; + /* needed for amp of back speakers */ + spec->gpio_mask |= 0x01; + spec->gpio_dir |= 0x01; + snd_hda_apply_pincfgs(codec, pincfgs); + /* share DAC to have unified volume control */ + snd_hda_override_conn_list(codec, 0x14, ARRAY_SIZE(conn), conn); + snd_hda_override_conn_list(codec, 0x17, ARRAY_SIZE(conn), conn); + break; + case HDA_FIXUP_ACT_INIT: + /* need to toggle GPIO to enable the amp of back speakers */ + alc_update_gpio_data(codec, 0x01, true); + msleep(100); + alc_update_gpio_data(codec, 0x01, false); + break; + } +} + static void alc285_fixup_hp_spectre_x360(struct hda_codec *codec, const struct hda_fixup *fix, int action) { @@ -6557,6 +6596,7 @@ enum { ALC269_FIXUP_HP_DOCK_GPIO_MIC1_LED, ALC280_FIXUP_HP_9480M, ALC245_FIXUP_HP_X360_AMP, + ALC285_FIXUP_HP_SPECTRE_X360_EB1, ALC288_FIXUP_DELL_HEADSET_MODE, ALC288_FIXUP_DELL1_MIC_NO_PRESENCE, ALC288_FIXUP_DELL_XPS_13, @@ -8250,6 +8290,10 @@ static const struct hda_fixup alc269_fixups[] = { .type = HDA_FIXUP_FUNC, .v.func = alc285_fixup_hp_spectre_x360, }, + [ALC285_FIXUP_HP_SPECTRE_X360_EB1] = { + .type = HDA_FIXUP_FUNC, + .v.func = alc285_fixup_hp_spectre_x360_eb1 + }, [ALC287_FIXUP_IDEAPAD_BASS_SPK_AMP] = { .type = HDA_FIXUP_FUNC, .v.func = alc285_fixup_ideapad_s740_coef, @@ -8584,6 +8628,8 @@ static const struct snd_pci_quirk alc269_fixup_tbl[] = { SND_PCI_QUIRK(0x103c, 0x87f7, "HP Spectre x360 14", ALC245_FIXUP_HP_X360_AMP), SND_PCI_QUIRK(0x103c, 0x8805, "HP ProBook 650 G8 Notebook PC", ALC236_FIXUP_HP_GPIO_LED), SND_PCI_QUIRK(0x103c, 0x880d, "HP EliteBook 830 G8 Notebook PC", ALC285_FIXUP_HP_GPIO_LED), + SND_PCI_QUIRK(0x103c, 0x8811, "HP Spectre x360 15-eb1xxx", ALC285_FIXUP_HP_SPECTRE_X360_EB1), + SND_PCI_QUIRK(0x103c, 0x8812, "HP Spectre x360 15-eb1xxx", ALC285_FIXUP_HP_SPECTRE_X360_EB1), SND_PCI_QUIRK(0x103c, 0x8846, "HP EliteBook 850 G8 Notebook PC", ALC285_FIXUP_HP_GPIO_LED), SND_PCI_QUIRK(0x103c, 0x8847, "HP EliteBook x360 830 G8 Notebook PC", ALC285_FIXUP_HP_GPIO_LED), SND_PCI_QUIRK(0x103c, 0x884b, "HP EliteBook 840 Aero G8 Notebook PC", ALC285_FIXUP_HP_GPIO_LED), @@ -9005,6 +9051,7 @@ static const struct hda_model_fixup alc269_fixup_models[] = { {.id = ALC245_FIXUP_HP_X360_AMP, .name = "alc245-hp-x360-amp"}, {.id = ALC295_FIXUP_HP_OMEN, .name = "alc295-hp-omen"}, {.id = ALC285_FIXUP_HP_SPECTRE_X360, .name = "alc285-hp-spectre-x360"}, + {.id = ALC285_FIXUP_HP_SPECTRE_X360_EB1, .name = "alc285-hp-spectre-x360-eb1"}, {.id = ALC287_FIXUP_IDEAPAD_BASS_SPK_AMP, .name = "alc287-ideapad-bass-spk-amp"}, {.id = ALC623_FIXUP_LENOVO_THINKSTATION_P340, .name = "alc623-lenovo-thinkstation-p340"}, {.id = ALC255_FIXUP_ACER_HEADPHONE_AND_MIC, .name = "alc255-acer-headphone-and-mic"}, diff --git a/sound/soc/codecs/Kconfig b/sound/soc/codecs/Kconfig index 82ee233a269d..216cea04ad70 100644 --- a/sound/soc/codecs/Kconfig +++ b/sound/soc/codecs/Kconfig @@ -1583,6 +1583,7 @@ config SND_SOC_WCD938X_SDW tristate "WCD9380/WCD9385 Codec - SDW" select SND_SOC_WCD938X select SND_SOC_WCD_MBHC + select REGMAP_IRQ depends on SOUNDWIRE select REGMAP_SOUNDWIRE help diff --git a/sound/soc/codecs/cs42l42.c b/sound/soc/codecs/cs42l42.c index fb1e4c33e27d..9a463ab54bdd 100644 --- a/sound/soc/codecs/cs42l42.c +++ b/sound/soc/codecs/cs42l42.c @@ -922,7 +922,6 @@ static int cs42l42_mute_stream(struct snd_soc_dai *dai, int mute, int stream) struct snd_soc_component *component = dai->component; struct cs42l42_private *cs42l42 = snd_soc_component_get_drvdata(component); unsigned int regval; - u8 fullScaleVol; int ret; if (mute) { @@ -993,20 +992,11 @@ static int cs42l42_mute_stream(struct snd_soc_dai *dai, int mute, int stream) cs42l42->stream_use |= 1 << stream; if (stream == SNDRV_PCM_STREAM_PLAYBACK) { - /* Read the headphone load */ - regval = snd_soc_component_read(component, CS42L42_LOAD_DET_RCSTAT); - if (((regval & CS42L42_RLA_STAT_MASK) >> CS42L42_RLA_STAT_SHIFT) == - CS42L42_RLA_STAT_15_OHM) { - fullScaleVol = CS42L42_HP_FULL_SCALE_VOL_MASK; - } else { - fullScaleVol = 0; - } - - /* Un-mute the headphone, set the full scale volume flag */ + /* Un-mute the headphone */ snd_soc_component_update_bits(component, CS42L42_HP_CTL, CS42L42_HP_ANA_AMUTE_MASK | - CS42L42_HP_ANA_BMUTE_MASK | - CS42L42_HP_FULL_SCALE_VOL_MASK, fullScaleVol); + CS42L42_HP_ANA_BMUTE_MASK, + 0); } } diff --git a/sound/soc/codecs/cs4341.c b/sound/soc/codecs/cs4341.c index 7d3e54d8eef3..29d05e32d341 100644 --- a/sound/soc/codecs/cs4341.c +++ b/sound/soc/codecs/cs4341.c @@ -305,12 +305,19 @@ static int cs4341_spi_probe(struct spi_device *spi) return cs4341_probe(&spi->dev); } +static const struct spi_device_id cs4341_spi_ids[] = { + { "cs4341a" }, + { } +}; +MODULE_DEVICE_TABLE(spi, cs4341_spi_ids); + static struct spi_driver cs4341_spi_driver = { .driver = { .name = "cs4341-spi", .of_match_table = of_match_ptr(cs4341_dt_ids), }, .probe = cs4341_spi_probe, + .id_table = cs4341_spi_ids, }; #endif diff --git a/sound/soc/codecs/nau8824.c b/sound/soc/codecs/nau8824.c index db88be48c998..f946ef65a4c1 100644 --- a/sound/soc/codecs/nau8824.c +++ b/sound/soc/codecs/nau8824.c @@ -867,8 +867,8 @@ static void nau8824_jdet_work(struct work_struct *work) struct regmap *regmap = nau8824->regmap; int adc_value, event = 0, event_mask = 0; - snd_soc_dapm_enable_pin(dapm, "MICBIAS"); - snd_soc_dapm_enable_pin(dapm, "SAR"); + snd_soc_dapm_force_enable_pin(dapm, "MICBIAS"); + snd_soc_dapm_force_enable_pin(dapm, "SAR"); snd_soc_dapm_sync(dapm); msleep(100); diff --git a/sound/soc/codecs/pcm179x-spi.c b/sound/soc/codecs/pcm179x-spi.c index 0a542924ec5f..ebf63ea90a1c 100644 --- a/sound/soc/codecs/pcm179x-spi.c +++ b/sound/soc/codecs/pcm179x-spi.c @@ -36,6 +36,7 @@ static const struct of_device_id pcm179x_of_match[] = { MODULE_DEVICE_TABLE(of, pcm179x_of_match); static const struct spi_device_id pcm179x_spi_ids[] = { + { "pcm1792a", 0 }, { "pcm179x", 0 }, { }, }; diff --git a/sound/soc/codecs/pcm512x.c b/sound/soc/codecs/pcm512x.c index 4dc844f3c1fc..60dee41816dc 100644 --- a/sound/soc/codecs/pcm512x.c +++ b/sound/soc/codecs/pcm512x.c @@ -116,6 +116,8 @@ static const struct reg_default pcm512x_reg_defaults[] = { { PCM512x_FS_SPEED_MODE, 0x00 }, { PCM512x_IDAC_1, 0x01 }, { PCM512x_IDAC_2, 0x00 }, + { PCM512x_I2S_1, 0x02 }, + { PCM512x_I2S_2, 0x00 }, }; static bool pcm512x_readable(struct device *dev, unsigned int reg) diff --git a/sound/soc/codecs/wcd938x.c b/sound/soc/codecs/wcd938x.c index f0daf8defcf1..52de7d14b139 100644 --- a/sound/soc/codecs/wcd938x.c +++ b/sound/soc/codecs/wcd938x.c @@ -4144,10 +4144,10 @@ static int wcd938x_codec_set_jack(struct snd_soc_component *comp, { struct wcd938x_priv *wcd = dev_get_drvdata(comp->dev); - if (!jack) + if (jack) return wcd_mbhc_start(wcd->wcd_mbhc, &wcd->mbhc_cfg, jack); - - wcd_mbhc_stop(wcd->wcd_mbhc); + else + wcd_mbhc_stop(wcd->wcd_mbhc); return 0; } diff --git a/sound/soc/codecs/wm8960.c b/sound/soc/codecs/wm8960.c index 9e621a254392..499604f1e178 100644 --- a/sound/soc/codecs/wm8960.c +++ b/sound/soc/codecs/wm8960.c @@ -742,9 +742,16 @@ static int wm8960_configure_clocking(struct snd_soc_component *component) int i, j, k; int ret; - if (!(iface1 & (1<<6))) { - dev_dbg(component->dev, - "Codec is slave mode, no need to configure clock\n"); + /* + * For Slave mode clocking should still be configured, + * so this if statement should be removed, but some platform + * may not work if the sysclk is not configured, to avoid such + * compatible issue, just add '!wm8960->sysclk' condition in + * this if statement. + */ + if (!(iface1 & (1 << 6)) && !wm8960->sysclk) { + dev_warn(component->dev, + "slave mode, but proceeding with no clock configuration\n"); return 0; } diff --git a/sound/soc/fsl/fsl_xcvr.c b/sound/soc/fsl/fsl_xcvr.c index 7ba2fd15132d..d0556c79fdb1 100644 --- a/sound/soc/fsl/fsl_xcvr.c +++ b/sound/soc/fsl/fsl_xcvr.c @@ -487,8 +487,9 @@ static int fsl_xcvr_prepare(struct snd_pcm_substream *substream, return ret; } - /* clear DPATH RESET */ + /* set DPATH RESET */ m_ctl |= FSL_XCVR_EXT_CTRL_DPTH_RESET(tx); + v_ctl |= FSL_XCVR_EXT_CTRL_DPTH_RESET(tx); ret = regmap_update_bits(xcvr->regmap, FSL_XCVR_EXT_CTRL, m_ctl, v_ctl); if (ret < 0) { dev_err(dai->dev, "Error while setting EXT_CTRL: %d\n", ret); @@ -590,10 +591,6 @@ static void fsl_xcvr_shutdown(struct snd_pcm_substream *substream, val |= FSL_XCVR_EXT_CTRL_CMDC_RESET(tx); } - /* set DPATH RESET */ - mask |= FSL_XCVR_EXT_CTRL_DPTH_RESET(tx); - val |= FSL_XCVR_EXT_CTRL_DPTH_RESET(tx); - ret = regmap_update_bits(xcvr->regmap, FSL_XCVR_EXT_CTRL, mask, val); if (ret < 0) { dev_err(dai->dev, "Err setting DPATH RESET: %d\n", ret); @@ -643,6 +640,16 @@ static int fsl_xcvr_trigger(struct snd_pcm_substream *substream, int cmd, dev_err(dai->dev, "Failed to enable DMA: %d\n", ret); return ret; } + + /* clear DPATH RESET */ + ret = regmap_update_bits(xcvr->regmap, FSL_XCVR_EXT_CTRL, + FSL_XCVR_EXT_CTRL_DPTH_RESET(tx), + 0); + if (ret < 0) { + dev_err(dai->dev, "Failed to clear DPATH RESET: %d\n", ret); + return ret; + } + break; case SNDRV_PCM_TRIGGER_STOP: case SNDRV_PCM_TRIGGER_SUSPEND: diff --git a/sound/soc/intel/boards/bytcht_es8316.c b/sound/soc/intel/boards/bytcht_es8316.c index 055248f104b2..4d313d0d0f23 100644 --- a/sound/soc/intel/boards/bytcht_es8316.c +++ b/sound/soc/intel/boards/bytcht_es8316.c @@ -456,12 +456,12 @@ static const struct dmi_system_id byt_cht_es8316_quirk_table[] = { static int snd_byt_cht_es8316_mc_probe(struct platform_device *pdev) { + struct device *dev = &pdev->dev; static const char * const mic_name[] = { "in1", "in2" }; + struct snd_soc_acpi_mach *mach = dev_get_platdata(dev); struct property_entry props[MAX_NO_PROPS] = {}; struct byt_cht_es8316_private *priv; const struct dmi_system_id *dmi_id; - struct device *dev = &pdev->dev; - struct snd_soc_acpi_mach *mach; struct fwnode_handle *fwnode; const char *platform_name; struct acpi_device *adev; @@ -476,7 +476,6 @@ static int snd_byt_cht_es8316_mc_probe(struct platform_device *pdev) if (!priv) return -ENOMEM; - mach = dev->platform_data; /* fix index of codec dai */ for (i = 0; i < ARRAY_SIZE(byt_cht_es8316_dais); i++) { if (!strcmp(byt_cht_es8316_dais[i].codecs->name, @@ -494,7 +493,7 @@ static int snd_byt_cht_es8316_mc_probe(struct platform_device *pdev) put_device(&adev->dev); byt_cht_es8316_dais[dai_index].codecs->name = codec_name; } else { - dev_err(&pdev->dev, "Error cannot find '%s' dev\n", mach->id); + dev_err(dev, "Error cannot find '%s' dev\n", mach->id); return -ENXIO; } @@ -533,11 +532,8 @@ static int snd_byt_cht_es8316_mc_probe(struct platform_device *pdev) /* get the clock */ priv->mclk = devm_clk_get(dev, "pmc_plt_clk_3"); - if (IS_ERR(priv->mclk)) { - ret = PTR_ERR(priv->mclk); - dev_err(dev, "clk_get pmc_plt_clk_3 failed: %d\n", ret); - return ret; - } + if (IS_ERR(priv->mclk)) + return dev_err_probe(dev, PTR_ERR(priv->mclk), "clk_get pmc_plt_clk_3 failed\n"); /* get speaker enable GPIO */ codec_dev = acpi_get_first_physical_node(adev); @@ -567,22 +563,13 @@ static int snd_byt_cht_es8316_mc_probe(struct platform_device *pdev) devm_acpi_dev_add_driver_gpios(codec_dev, byt_cht_es8316_gpios); priv->speaker_en_gpio = - gpiod_get_index(codec_dev, "speaker-enable", 0, - /* see comment in byt_cht_es8316_resume */ - GPIOD_OUT_LOW | GPIOD_FLAGS_BIT_NONEXCLUSIVE); - + gpiod_get_optional(codec_dev, "speaker-enable", + /* see comment in byt_cht_es8316_resume() */ + GPIOD_OUT_LOW | GPIOD_FLAGS_BIT_NONEXCLUSIVE); if (IS_ERR(priv->speaker_en_gpio)) { - ret = PTR_ERR(priv->speaker_en_gpio); - switch (ret) { - case -ENOENT: - priv->speaker_en_gpio = NULL; - break; - default: - dev_err(dev, "get speaker GPIO failed: %d\n", ret); - fallthrough; - case -EPROBE_DEFER: - goto err_put_codec; - } + ret = dev_err_probe(dev, PTR_ERR(priv->speaker_en_gpio), + "get speaker GPIO failed\n"); + goto err_put_codec; } snprintf(components_string, sizeof(components_string), @@ -597,7 +584,7 @@ static int snd_byt_cht_es8316_mc_probe(struct platform_device *pdev) byt_cht_es8316_card.long_name = long_name; #endif - sof_parent = snd_soc_acpi_sof_parent(&pdev->dev); + sof_parent = snd_soc_acpi_sof_parent(dev); /* set card and driver name */ if (sof_parent) { diff --git a/sound/soc/soc-core.c b/sound/soc/soc-core.c index c830e96afba2..80ca260595fd 100644 --- a/sound/soc/soc-core.c +++ b/sound/soc/soc-core.c @@ -2599,6 +2599,7 @@ int snd_soc_component_initialize(struct snd_soc_component *component, INIT_LIST_HEAD(&component->dai_list); INIT_LIST_HEAD(&component->dobj_list); INIT_LIST_HEAD(&component->card_list); + INIT_LIST_HEAD(&component->list); mutex_init(&component->io_mutex); component->name = fmt_single_name(dev, &component->id); diff --git a/sound/soc/soc-dapm.c b/sound/soc/soc-dapm.c index 7b67f1e19ae9..59d07648a7e7 100644 --- a/sound/soc/soc-dapm.c +++ b/sound/soc/soc-dapm.c @@ -2561,6 +2561,7 @@ static int snd_soc_dapm_set_pin(struct snd_soc_dapm_context *dapm, const char *pin, int status) { struct snd_soc_dapm_widget *w = dapm_find_widget(dapm, pin, true); + int ret = 0; dapm_assert_locked(dapm); @@ -2573,13 +2574,14 @@ static int snd_soc_dapm_set_pin(struct snd_soc_dapm_context *dapm, dapm_mark_dirty(w, "pin configuration"); dapm_widget_invalidate_input_paths(w); dapm_widget_invalidate_output_paths(w); + ret = 1; } w->connected = status; if (status == 0) w->force = 0; - return 0; + return ret; } /** @@ -3583,14 +3585,15 @@ int snd_soc_dapm_put_pin_switch(struct snd_kcontrol *kcontrol, { struct snd_soc_card *card = snd_kcontrol_chip(kcontrol); const char *pin = (const char *)kcontrol->private_value; + int ret; if (ucontrol->value.integer.value[0]) - snd_soc_dapm_enable_pin(&card->dapm, pin); + ret = snd_soc_dapm_enable_pin(&card->dapm, pin); else - snd_soc_dapm_disable_pin(&card->dapm, pin); + ret = snd_soc_dapm_disable_pin(&card->dapm, pin); snd_soc_dapm_sync(&card->dapm); - return 0; + return ret; } EXPORT_SYMBOL_GPL(snd_soc_dapm_put_pin_switch); @@ -4023,7 +4026,7 @@ static int snd_soc_dapm_dai_link_put(struct snd_kcontrol *kcontrol, rtd->params_select = ucontrol->value.enumerated.item[0]; - return 0; + return 1; } static void diff --git a/sound/usb/mixer.c b/sound/usb/mixer.c index a2ce535df14b..8e030b1c061a 100644 --- a/sound/usb/mixer.c +++ b/sound/usb/mixer.c @@ -1198,6 +1198,13 @@ static void volume_control_quirks(struct usb_mixer_elem_info *cval, cval->res = 1; } break; + case USB_ID(0x1224, 0x2a25): /* Jieli Technology USB PHY 2.0 */ + if (!strcmp(kctl->id.name, "Mic Capture Volume")) { + usb_audio_info(chip, + "set resolution quirk: cval->res = 16\n"); + cval->res = 16; + } + break; } } diff --git a/sound/usb/quirks-table.h b/sound/usb/quirks-table.h index de18fff69280..2af8c68fac27 100644 --- a/sound/usb/quirks-table.h +++ b/sound/usb/quirks-table.h @@ -4012,6 +4012,38 @@ YAMAHA_DEVICE(0x7010, "UB99"), } } }, +{ + /* + * Sennheiser GSP670 + * Change order of interfaces loaded + */ + USB_DEVICE(0x1395, 0x0300), + .bInterfaceClass = USB_CLASS_PER_INTERFACE, + .driver_info = (unsigned long) &(const struct snd_usb_audio_quirk) { + .ifnum = QUIRK_ANY_INTERFACE, + .type = QUIRK_COMPOSITE, + .data = &(const struct snd_usb_audio_quirk[]) { + // Communication + { + .ifnum = 3, + .type = QUIRK_AUDIO_STANDARD_INTERFACE + }, + // Recording + { + .ifnum = 4, + .type = QUIRK_AUDIO_STANDARD_INTERFACE + }, + // Main + { + .ifnum = 1, + .type = QUIRK_AUDIO_STANDARD_INTERFACE + }, + { + .ifnum = -1 + } + } + } +}, #undef USB_DEVICE_VENDOR_SPEC #undef USB_AUDIO_DEVICE diff --git a/sound/usb/quirks.c b/sound/usb/quirks.c index 889c855addfc..8929d9abe8aa 100644 --- a/sound/usb/quirks.c +++ b/sound/usb/quirks.c @@ -1719,6 +1719,11 @@ void snd_usb_audioformat_attributes_quirk(struct snd_usb_audio *chip, */ fp->attributes &= ~UAC_EP_CS_ATTR_FILL_MAX; break; + case USB_ID(0x1224, 0x2a25): /* Jieli Technology USB PHY 2.0 */ + /* mic works only when ep packet size is set to wMaxPacketSize */ + fp->attributes |= UAC_EP_CS_ATTR_FILL_MAX; + break; + } } @@ -1884,10 +1889,14 @@ static const struct usb_audio_quirk_flags_table quirk_flags_table[] = { QUIRK_FLAG_GET_SAMPLE_RATE), DEVICE_FLG(0x2912, 0x30c8, /* Audioengine D1 */ QUIRK_FLAG_GET_SAMPLE_RATE), + DEVICE_FLG(0x30be, 0x0101, /* Schiit Hel */ + QUIRK_FLAG_IGNORE_CTL_ERROR), DEVICE_FLG(0x413c, 0xa506, /* Dell AE515 sound bar */ QUIRK_FLAG_GET_SAMPLE_RATE), DEVICE_FLG(0x534d, 0x2109, /* MacroSilicon MS2109 */ QUIRK_FLAG_ALIGN_TRANSFER), + DEVICE_FLG(0x1224, 0x2a25, /* Jieli Technology USB PHY 2.0 */ + QUIRK_FLAG_GET_SAMPLE_RATE), /* Vendor matches */ VENDOR_FLG(0x045e, /* MS Lifecam */ diff --git a/tools/kvm/kvm_stat/kvm_stat b/tools/kvm/kvm_stat/kvm_stat index b0bf56c5f120..5a5bd74f55bd 100755 --- a/tools/kvm/kvm_stat/kvm_stat +++ b/tools/kvm/kvm_stat/kvm_stat @@ -742,7 +742,7 @@ class DebugfsProvider(Provider): The fields are all available KVM debugfs files """ - exempt_list = ['halt_poll_fail_ns', 'halt_poll_success_ns'] + exempt_list = ['halt_poll_fail_ns', 'halt_poll_success_ns', 'halt_wait_ns'] fields = [field for field in self.walkdir(PATH_DEBUGFS_KVM)[2] if field not in exempt_list] diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf index 5cd702062a04..b856afa6eb52 100644 --- a/tools/perf/Makefile.perf +++ b/tools/perf/Makefile.perf @@ -787,6 +787,8 @@ $(OUTPUT)dlfilters/%.o: dlfilters/%.c include/perf/perf_dlfilter.h $(Q)$(MKDIR) -p $(OUTPUT)dlfilters $(QUIET_CC)$(CC) -c -Iinclude $(EXTRA_CFLAGS) -o $@ -fpic $< +.SECONDARY: $(DLFILTERS:.so=.o) + $(OUTPUT)dlfilters/%.so: $(OUTPUT)dlfilters/%.o $(QUIET_LINK)$(CC) $(EXTRA_CFLAGS) -shared -o $@ $< diff --git a/tools/perf/arch/powerpc/util/skip-callchain-idx.c b/tools/perf/arch/powerpc/util/skip-callchain-idx.c index 3018a054526a..20cd6244863b 100644 --- a/tools/perf/arch/powerpc/util/skip-callchain-idx.c +++ b/tools/perf/arch/powerpc/util/skip-callchain-idx.c @@ -45,7 +45,7 @@ static const Dwfl_Callbacks offline_callbacks = { */ static int check_return_reg(int ra_regno, Dwarf_Frame *frame) { - Dwarf_Op ops_mem[2]; + Dwarf_Op ops_mem[3]; Dwarf_Op dummy; Dwarf_Op *ops = &dummy; size_t nops; diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c index 6211d0b84b7a..c32c2eb16d7d 100644 --- a/tools/perf/builtin-script.c +++ b/tools/perf/builtin-script.c @@ -459,7 +459,7 @@ static int evsel__check_attr(struct evsel *evsel, struct perf_session *session) return -EINVAL; if (PRINT_FIELD(WEIGHT) && - evsel__check_stype(evsel, PERF_SAMPLE_WEIGHT, "WEIGHT", PERF_OUTPUT_WEIGHT)) + evsel__check_stype(evsel, PERF_SAMPLE_WEIGHT_TYPE, "WEIGHT", PERF_OUTPUT_WEIGHT)) return -EINVAL; if (PRINT_FIELD(SYM) && @@ -4039,11 +4039,15 @@ script_found: goto out_delete; uname(&uts); - if (data.is_pipe || /* assume pipe_mode indicates native_arch */ - !strcmp(uts.machine, session->header.env.arch) || - (!strcmp(uts.machine, "x86_64") && - !strcmp(session->header.env.arch, "i386"))) + if (data.is_pipe) { /* Assume pipe_mode indicates native_arch */ native_arch = true; + } else if (session->header.env.arch) { + if (!strcmp(uts.machine, session->header.env.arch)) + native_arch = true; + else if (!strcmp(uts.machine, "x86_64") && + !strcmp(session->header.env.arch, "i386")) + native_arch = true; + } script.session = session; script__setup_sample_type(&script); diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c index 5c5979046523..d88bb65b74cc 100644 --- a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c +++ b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c @@ -949,7 +949,6 @@ static void redir_to_connected(int family, int sotype, int sock_mapfd, int err, n; u32 key; char b; - int retries = 100; zero_verdict_count(verd_mapfd); @@ -1002,17 +1001,11 @@ static void redir_to_connected(int family, int sotype, int sock_mapfd, goto close_peer1; if (pass != 1) FAIL("%s: want pass count 1, have %d", log_prefix, pass); -again: - n = read(c0, &b, 1); - if (n < 0) { - if (errno == EAGAIN && retries--) { - usleep(1000); - goto again; - } - FAIL_ERRNO("%s: read", log_prefix); - } + n = recv_timeout(c0, &b, 1, 0, IO_TIMEOUT_SEC); + if (n < 0) + FAIL_ERRNO("%s: recv_timeout", log_prefix); if (n == 0) - FAIL("%s: incomplete read", log_prefix); + FAIL("%s: incomplete recv", log_prefix); close_peer1: xclose(p1); @@ -1571,7 +1564,6 @@ static void unix_redir_to_connected(int sotype, int sock_mapfd, const char *log_prefix = redir_mode_str(mode); int c0, c1, p0, p1; unsigned int pass; - int retries = 100; int err, n; int sfd[2]; u32 key; @@ -1606,17 +1598,11 @@ static void unix_redir_to_connected(int sotype, int sock_mapfd, if (pass != 1) FAIL("%s: want pass count 1, have %d", log_prefix, pass); -again: - n = read(mode == REDIR_INGRESS ? p0 : c0, &b, 1); - if (n < 0) { - if (errno == EAGAIN && retries--) { - usleep(1000); - goto again; - } - FAIL_ERRNO("%s: read", log_prefix); - } + n = recv_timeout(mode == REDIR_INGRESS ? p0 : c0, &b, 1, 0, IO_TIMEOUT_SEC); + if (n < 0) + FAIL_ERRNO("%s: recv_timeout", log_prefix); if (n == 0) - FAIL("%s: incomplete read", log_prefix); + FAIL("%s: incomplete recv", log_prefix); close: xclose(c1); @@ -1748,7 +1734,6 @@ static void udp_redir_to_connected(int family, int sock_mapfd, int verd_mapfd, const char *log_prefix = redir_mode_str(mode); int c0, c1, p0, p1; unsigned int pass; - int retries = 100; int err, n; u32 key; char b; @@ -1781,17 +1766,11 @@ static void udp_redir_to_connected(int family, int sock_mapfd, int verd_mapfd, if (pass != 1) FAIL("%s: want pass count 1, have %d", log_prefix, pass); -again: - n = read(mode == REDIR_INGRESS ? p0 : c0, &b, 1); - if (n < 0) { - if (errno == EAGAIN && retries--) { - usleep(1000); - goto again; - } - FAIL_ERRNO("%s: read", log_prefix); - } + n = recv_timeout(mode == REDIR_INGRESS ? p0 : c0, &b, 1, 0, IO_TIMEOUT_SEC); + if (n < 0) + FAIL_ERRNO("%s: recv_timeout", log_prefix); if (n == 0) - FAIL("%s: incomplete read", log_prefix); + FAIL("%s: incomplete recv", log_prefix); close_cli1: xclose(c1); @@ -1841,7 +1820,6 @@ static void inet_unix_redir_to_connected(int family, int type, int sock_mapfd, const char *log_prefix = redir_mode_str(mode); int c0, c1, p0, p1; unsigned int pass; - int retries = 100; int err, n; int sfd[2]; u32 key; @@ -1876,17 +1854,11 @@ static void inet_unix_redir_to_connected(int family, int type, int sock_mapfd, if (pass != 1) FAIL("%s: want pass count 1, have %d", log_prefix, pass); -again: - n = read(mode == REDIR_INGRESS ? p0 : c0, &b, 1); - if (n < 0) { - if (errno == EAGAIN && retries--) { - usleep(1000); - goto again; - } - FAIL_ERRNO("%s: read", log_prefix); - } + n = recv_timeout(mode == REDIR_INGRESS ? p0 : c0, &b, 1, 0, IO_TIMEOUT_SEC); + if (n < 0) + FAIL_ERRNO("%s: recv_timeout", log_prefix); if (n == 0) - FAIL("%s: incomplete read", log_prefix); + FAIL("%s: incomplete recv", log_prefix); close_cli1: xclose(c1); @@ -1932,7 +1904,6 @@ static void unix_inet_redir_to_connected(int family, int type, int sock_mapfd, int sfd[2]; u32 key; char b; - int retries = 100; zero_verdict_count(verd_mapfd); @@ -1963,17 +1934,11 @@ static void unix_inet_redir_to_connected(int family, int type, int sock_mapfd, if (pass != 1) FAIL("%s: want pass count 1, have %d", log_prefix, pass); -again: - n = read(mode == REDIR_INGRESS ? p0 : c0, &b, 1); - if (n < 0) { - if (errno == EAGAIN && retries--) { - usleep(1000); - goto again; - } - FAIL_ERRNO("%s: read", log_prefix); - } + n = recv_timeout(mode == REDIR_INGRESS ? p0 : c0, &b, 1, 0, IO_TIMEOUT_SEC); + if (n < 0) + FAIL_ERRNO("%s: recv_timeout", log_prefix); if (n == 0) - FAIL("%s: incomplete read", log_prefix); + FAIL("%s: incomplete recv", log_prefix); close: xclose(c1); diff --git a/tools/testing/selftests/net/config b/tools/testing/selftests/net/config index 21b646d10b88..86ab429fe7f3 100644 --- a/tools/testing/selftests/net/config +++ b/tools/testing/selftests/net/config @@ -43,3 +43,4 @@ CONFIG_NET_ACT_TUNNEL_KEY=m CONFIG_NET_ACT_MIRRED=m CONFIG_BAREUDP=m CONFIG_IPV6_IOAM6_LWTUNNEL=y +CONFIG_CRYPTO_SM4=y diff --git a/tools/testing/selftests/net/fcnal-test.sh b/tools/testing/selftests/net/fcnal-test.sh index 13350cd5c8ac..3313566ce906 100755 --- a/tools/testing/selftests/net/fcnal-test.sh +++ b/tools/testing/selftests/net/fcnal-test.sh @@ -289,6 +289,12 @@ set_sysctl() run_cmd sysctl -q -w $* } +# get sysctl values in NS-A +get_sysctl() +{ + ${NSA_CMD} sysctl -n $* +} + ################################################################################ # Setup for tests @@ -439,10 +445,13 @@ cleanup() ip -netns ${NSA} link set dev ${NSA_DEV} down ip -netns ${NSA} link del dev ${NSA_DEV} + ip netns pids ${NSA} | xargs kill 2>/dev/null ip netns del ${NSA} fi + ip netns pids ${NSB} | xargs kill 2>/dev/null ip netns del ${NSB} + ip netns pids ${NSC} | xargs kill 2>/dev/null ip netns del ${NSC} >/dev/null 2>&1 } @@ -1003,6 +1012,60 @@ ipv4_tcp_md5() run_cmd nettest -s -I ${NSA_DEV} -M ${MD5_PW} -m ${NS_NET} log_test $? 1 "MD5: VRF: Device must be a VRF - prefix" + test_ipv4_md5_vrf__vrf_server__no_bind_ifindex + test_ipv4_md5_vrf__global_server__bind_ifindex0 +} + +test_ipv4_md5_vrf__vrf_server__no_bind_ifindex() +{ + log_start + show_hint "Simulates applications using VRF without TCP_MD5SIG_FLAG_IFINDEX" + run_cmd nettest -s -I ${VRF} -M ${MD5_PW} -m ${NS_NET} --no-bind-key-ifindex & + sleep 1 + run_cmd_nsb nettest -r ${NSA_IP} -X ${MD5_PW} + log_test $? 0 "MD5: VRF: VRF-bound server, unbound key accepts connection" + + log_start + show_hint "Binding both the socket and the key is not required but it works" + run_cmd nettest -s -I ${VRF} -M ${MD5_PW} -m ${NS_NET} --force-bind-key-ifindex & + sleep 1 + run_cmd_nsb nettest -r ${NSA_IP} -X ${MD5_PW} + log_test $? 0 "MD5: VRF: VRF-bound server, bound key accepts connection" +} + +test_ipv4_md5_vrf__global_server__bind_ifindex0() +{ + # This particular test needs tcp_l3mdev_accept=1 for Global server to accept VRF connections + local old_tcp_l3mdev_accept + old_tcp_l3mdev_accept=$(get_sysctl net.ipv4.tcp_l3mdev_accept) + set_sysctl net.ipv4.tcp_l3mdev_accept=1 + + log_start + run_cmd nettest -s -M ${MD5_PW} -m ${NS_NET} --force-bind-key-ifindex & + sleep 1 + run_cmd_nsb nettest -r ${NSA_IP} -X ${MD5_PW} + log_test $? 2 "MD5: VRF: Global server, Key bound to ifindex=0 rejects VRF connection" + + log_start + run_cmd nettest -s -M ${MD5_PW} -m ${NS_NET} --force-bind-key-ifindex & + sleep 1 + run_cmd_nsc nettest -r ${NSA_IP} -X ${MD5_PW} + log_test $? 0 "MD5: VRF: Global server, key bound to ifindex=0 accepts non-VRF connection" + log_start + + run_cmd nettest -s -M ${MD5_PW} -m ${NS_NET} --no-bind-key-ifindex & + sleep 1 + run_cmd_nsb nettest -r ${NSA_IP} -X ${MD5_PW} + log_test $? 0 "MD5: VRF: Global server, key not bound to ifindex accepts VRF connection" + + log_start + run_cmd nettest -s -M ${MD5_PW} -m ${NS_NET} --no-bind-key-ifindex & + sleep 1 + run_cmd_nsc nettest -r ${NSA_IP} -X ${MD5_PW} + log_test $? 0 "MD5: VRF: Global server, key not bound to ifindex accepts non-VRF connection" + + # restore value + set_sysctl net.ipv4.tcp_l3mdev_accept="$old_tcp_l3mdev_accept" } ipv4_tcp_novrf() diff --git a/tools/testing/selftests/net/forwarding/Makefile b/tools/testing/selftests/net/forwarding/Makefile index d97bd6889446..72ee644d47bf 100644 --- a/tools/testing/selftests/net/forwarding/Makefile +++ b/tools/testing/selftests/net/forwarding/Makefile @@ -9,6 +9,7 @@ TEST_PROGS = bridge_igmp.sh \ gre_inner_v4_multipath.sh \ gre_inner_v6_multipath.sh \ gre_multipath.sh \ + ip6_forward_instats_vrf.sh \ ip6gre_inner_v4_multipath.sh \ ip6gre_inner_v6_multipath.sh \ ipip_flat_gre_key.sh \ diff --git a/tools/testing/selftests/net/forwarding/forwarding.config.sample b/tools/testing/selftests/net/forwarding/forwarding.config.sample index b802c14d2950..e5e2fbeca22e 100644 --- a/tools/testing/selftests/net/forwarding/forwarding.config.sample +++ b/tools/testing/selftests/net/forwarding/forwarding.config.sample @@ -39,3 +39,5 @@ NETIF_CREATE=yes # Timeout (in seconds) before ping exits regardless of how many packets have # been sent or received PING_TIMEOUT=5 +# IPv6 traceroute utility name. +TROUTE6=traceroute6 diff --git a/tools/testing/selftests/net/forwarding/ip6_forward_instats_vrf.sh b/tools/testing/selftests/net/forwarding/ip6_forward_instats_vrf.sh new file mode 100755 index 000000000000..9f5b3e2e5e95 --- /dev/null +++ b/tools/testing/selftests/net/forwarding/ip6_forward_instats_vrf.sh @@ -0,0 +1,172 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 + +# Test ipv6 stats on the incoming if when forwarding with VRF + +ALL_TESTS=" + ipv6_ping + ipv6_in_too_big_err + ipv6_in_hdr_err + ipv6_in_addr_err + ipv6_in_discard +" + +NUM_NETIFS=4 +source lib.sh + +h1_create() +{ + simple_if_init $h1 2001:1:1::2/64 + ip -6 route add vrf v$h1 2001:1:2::/64 via 2001:1:1::1 +} + +h1_destroy() +{ + ip -6 route del vrf v$h1 2001:1:2::/64 via 2001:1:1::1 + simple_if_fini $h1 2001:1:1::2/64 +} + +router_create() +{ + vrf_create router + __simple_if_init $rtr1 router 2001:1:1::1/64 + __simple_if_init $rtr2 router 2001:1:2::1/64 + mtu_set $rtr2 1280 +} + +router_destroy() +{ + mtu_restore $rtr2 + __simple_if_fini $rtr2 2001:1:2::1/64 + __simple_if_fini $rtr1 2001:1:1::1/64 + vrf_destroy router +} + +h2_create() +{ + simple_if_init $h2 2001:1:2::2/64 + ip -6 route add vrf v$h2 2001:1:1::/64 via 2001:1:2::1 + mtu_set $h2 1280 +} + +h2_destroy() +{ + mtu_restore $h2 + ip -6 route del vrf v$h2 2001:1:1::/64 via 2001:1:2::1 + simple_if_fini $h2 2001:1:2::2/64 +} + +setup_prepare() +{ + h1=${NETIFS[p1]} + rtr1=${NETIFS[p2]} + + rtr2=${NETIFS[p3]} + h2=${NETIFS[p4]} + + vrf_prepare + h1_create + router_create + h2_create + + forwarding_enable +} + +cleanup() +{ + pre_cleanup + + forwarding_restore + + h2_destroy + router_destroy + h1_destroy + vrf_cleanup +} + +ipv6_in_too_big_err() +{ + RET=0 + + local t0=$(ipv6_stats_get $rtr1 Ip6InTooBigErrors) + local vrf_name=$(master_name_get $h1) + + # Send too big packets + ip vrf exec $vrf_name \ + $PING6 -s 1300 2001:1:2::2 -c 1 -w $PING_TIMEOUT &> /dev/null + + local t1=$(ipv6_stats_get $rtr1 Ip6InTooBigErrors) + test "$((t1 - t0))" -ne 0 + check_err $? + log_test "Ip6InTooBigErrors" +} + +ipv6_in_hdr_err() +{ + RET=0 + + local t0=$(ipv6_stats_get $rtr1 Ip6InHdrErrors) + local vrf_name=$(master_name_get $h1) + + # Send packets with hop limit 1, easiest with traceroute6 as some ping6 + # doesn't allow hop limit to be specified + ip vrf exec $vrf_name \ + $TROUTE6 2001:1:2::2 &> /dev/null + + local t1=$(ipv6_stats_get $rtr1 Ip6InHdrErrors) + test "$((t1 - t0))" -ne 0 + check_err $? + log_test "Ip6InHdrErrors" +} + +ipv6_in_addr_err() +{ + RET=0 + + local t0=$(ipv6_stats_get $rtr1 Ip6InAddrErrors) + local vrf_name=$(master_name_get $h1) + + # Disable forwarding temporary while sending the packet + sysctl -qw net.ipv6.conf.all.forwarding=0 + ip vrf exec $vrf_name \ + $PING6 2001:1:2::2 -c 1 -w $PING_TIMEOUT &> /dev/null + sysctl -qw net.ipv6.conf.all.forwarding=1 + + local t1=$(ipv6_stats_get $rtr1 Ip6InAddrErrors) + test "$((t1 - t0))" -ne 0 + check_err $? + log_test "Ip6InAddrErrors" +} + +ipv6_in_discard() +{ + RET=0 + + local t0=$(ipv6_stats_get $rtr1 Ip6InDiscards) + local vrf_name=$(master_name_get $h1) + + # Add a policy to discard + ip xfrm policy add dst 2001:1:2::2/128 dir fwd action block + ip vrf exec $vrf_name \ + $PING6 2001:1:2::2 -c 1 -w $PING_TIMEOUT &> /dev/null + ip xfrm policy del dst 2001:1:2::2/128 dir fwd + + local t1=$(ipv6_stats_get $rtr1 Ip6InDiscards) + test "$((t1 - t0))" -ne 0 + check_err $? + log_test "Ip6InDiscards" +} +ipv6_ping() +{ + RET=0 + + ping6_test $h1 2001:1:2::2 +} + +trap cleanup EXIT + +setup_prepare +setup_wait +tests_run + +exit $EXIT_STATUS diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh index e7fc5c35b569..92087d423bcf 100644 --- a/tools/testing/selftests/net/forwarding/lib.sh +++ b/tools/testing/selftests/net/forwarding/lib.sh @@ -751,6 +751,14 @@ qdisc_parent_stats_get() | jq '.[] | select(.parent == "'"$parent"'") | '"$selector" } +ipv6_stats_get() +{ + local dev=$1; shift + local stat=$1; shift + + cat /proc/net/dev_snmp6/$dev | grep "^$stat" | cut -f2 +} + humanize() { local speed=$1; shift diff --git a/tools/testing/selftests/net/nettest.c b/tools/testing/selftests/net/nettest.c index bd6288302094..b599003eb5ba 100644 --- a/tools/testing/selftests/net/nettest.c +++ b/tools/testing/selftests/net/nettest.c @@ -28,6 +28,7 @@ #include <unistd.h> #include <time.h> #include <errno.h> +#include <getopt.h> #include <linux/xfrm.h> #include <linux/ipsec.h> @@ -101,6 +102,8 @@ struct sock_args { struct sockaddr_in6 v6; } md5_prefix; unsigned int prefix_len; + /* 0: default, -1: force off, +1: force on */ + int bind_key_ifindex; /* expected addresses and device index for connection */ const char *expected_dev; @@ -271,11 +274,14 @@ static int tcp_md5sig(int sd, void *addr, socklen_t alen, struct sock_args *args } memcpy(&md5sig.tcpm_addr, addr, alen); - if (args->ifindex) { + if ((args->ifindex && args->bind_key_ifindex >= 0) || args->bind_key_ifindex >= 1) { opt = TCP_MD5SIG_EXT; md5sig.tcpm_flags |= TCP_MD5SIG_FLAG_IFINDEX; md5sig.tcpm_ifindex = args->ifindex; + log_msg("TCP_MD5SIG_FLAG_IFINDEX set tcpm_ifindex=%d\n", md5sig.tcpm_ifindex); + } else { + log_msg("TCP_MD5SIG_FLAG_IFINDEX off\n", md5sig.tcpm_ifindex); } rc = setsockopt(sd, IPPROTO_TCP, opt, &md5sig, sizeof(md5sig)); @@ -1822,6 +1828,14 @@ static int ipc_parent(int cpid, int fd, struct sock_args *args) } #define GETOPT_STR "sr:l:c:p:t:g:P:DRn:M:X:m:d:I:BN:O:SCi6xL:0:1:2:3:Fbq" +#define OPT_FORCE_BIND_KEY_IFINDEX 1001 +#define OPT_NO_BIND_KEY_IFINDEX 1002 + +static struct option long_opts[] = { + {"force-bind-key-ifindex", 0, 0, OPT_FORCE_BIND_KEY_IFINDEX}, + {"no-bind-key-ifindex", 0, 0, OPT_NO_BIND_KEY_IFINDEX}, + {0, 0, 0, 0} +}; static void print_usage(char *prog) { @@ -1858,6 +1872,10 @@ static void print_usage(char *prog) " -M password use MD5 sum protection\n" " -X password MD5 password for client mode\n" " -m prefix/len prefix and length to use for MD5 key\n" + " --no-bind-key-ifindex: Force TCP_MD5SIG_FLAG_IFINDEX off\n" + " --force-bind-key-ifindex: Force TCP_MD5SIG_FLAG_IFINDEX on\n" + " (default: only if -I is passed)\n" + "\n" " -g grp multicast group (e.g., 239.1.1.1)\n" " -i interactive mode (default is echo and terminate)\n" "\n" @@ -1893,7 +1911,7 @@ int main(int argc, char *argv[]) * process input args */ - while ((rc = getopt(argc, argv, GETOPT_STR)) != -1) { + while ((rc = getopt_long(argc, argv, GETOPT_STR, long_opts, NULL)) != -1) { switch (rc) { case 'B': both_mode = 1; @@ -1966,6 +1984,12 @@ int main(int argc, char *argv[]) case 'M': args.password = optarg; break; + case OPT_FORCE_BIND_KEY_IFINDEX: + args.bind_key_ifindex = 1; + break; + case OPT_NO_BIND_KEY_IFINDEX: + args.bind_key_ifindex = -1; + break; case 'X': args.client_pw = optarg; break; diff --git a/tools/testing/selftests/netfilter/nft_flowtable.sh b/tools/testing/selftests/netfilter/nft_flowtable.sh index 427d94816f2d..d4ffebb989f8 100755 --- a/tools/testing/selftests/netfilter/nft_flowtable.sh +++ b/tools/testing/selftests/netfilter/nft_flowtable.sh @@ -199,7 +199,6 @@ fi # test basic connectivity if ! ip netns exec ns1 ping -c 1 -q 10.0.2.99 > /dev/null; then echo "ERROR: ns1 cannot reach ns2" 1>&2 - bash exit 1 fi diff --git a/tools/testing/selftests/netfilter/nft_nat.sh b/tools/testing/selftests/netfilter/nft_nat.sh index d7e07f4c3d7f..da1c1e4b6c86 100755 --- a/tools/testing/selftests/netfilter/nft_nat.sh +++ b/tools/testing/selftests/netfilter/nft_nat.sh @@ -741,6 +741,149 @@ EOF return $lret } +# test port shadowing. +# create two listening services, one on router (ns0), one +# on client (ns2), which is masqueraded from ns1 point of view. +# ns2 sends udp packet coming from service port to ns1, on a highport. +# Later, if n1 uses same highport to connect to ns0:service, packet +# might be port-forwarded to ns2 instead. + +# second argument tells if we expect the 'fake-entry' to take effect +# (CLIENT) or not (ROUTER). +test_port_shadow() +{ + local test=$1 + local expect=$2 + local daddrc="10.0.1.99" + local daddrs="10.0.1.1" + local result="" + local logmsg="" + + echo ROUTER | ip netns exec "$ns0" nc -w 5 -u -l -p 1405 >/dev/null 2>&1 & + nc_r=$! + + echo CLIENT | ip netns exec "$ns2" nc -w 5 -u -l -p 1405 >/dev/null 2>&1 & + nc_c=$! + + # make shadow entry, from client (ns2), going to (ns1), port 41404, sport 1405. + echo "fake-entry" | ip netns exec "$ns2" nc -w 1 -p 1405 -u "$daddrc" 41404 > /dev/null + + # ns1 tries to connect to ns0:1405. With default settings this should connect + # to client, it matches the conntrack entry created above. + + result=$(echo "" | ip netns exec "$ns1" nc -w 1 -p 41404 -u "$daddrs" 1405) + + if [ "$result" = "$expect" ] ;then + echo "PASS: portshadow test $test: got reply from ${expect}${logmsg}" + else + echo "ERROR: portshadow test $test: got reply from \"$result\", not $expect as intended" + ret=1 + fi + + kill $nc_r $nc_c 2>/dev/null + + # flush udp entries for next test round, if any + ip netns exec "$ns0" conntrack -F >/dev/null 2>&1 +} + +# This prevents port shadow of router service via packet filter, +# packets claiming to originate from service port from internal +# network are dropped. +test_port_shadow_filter() +{ + local family=$1 + +ip netns exec "$ns0" nft -f /dev/stdin <<EOF +table $family filter { + chain forward { + type filter hook forward priority 0; policy accept; + meta iif veth1 udp sport 1405 drop + } +} +EOF + test_port_shadow "port-filter" "ROUTER" + + ip netns exec "$ns0" nft delete table $family filter +} + +# This prevents port shadow of router service via notrack. +test_port_shadow_notrack() +{ + local family=$1 + +ip netns exec "$ns0" nft -f /dev/stdin <<EOF +table $family raw { + chain prerouting { + type filter hook prerouting priority -300; policy accept; + meta iif veth0 udp dport 1405 notrack + udp dport 1405 notrack + } + chain output { + type filter hook output priority -300; policy accept; + udp sport 1405 notrack + } +} +EOF + test_port_shadow "port-notrack" "ROUTER" + + ip netns exec "$ns0" nft delete table $family raw +} + +# This prevents port shadow of router service via sport remap. +test_port_shadow_pat() +{ + local family=$1 + +ip netns exec "$ns0" nft -f /dev/stdin <<EOF +table $family pat { + chain postrouting { + type nat hook postrouting priority -1; policy accept; + meta iif veth1 udp sport <= 1405 masquerade to : 1406-65535 random + } +} +EOF + test_port_shadow "pat" "ROUTER" + + ip netns exec "$ns0" nft delete table $family pat +} + +test_port_shadowing() +{ + local family="ip" + + ip netns exec "$ns0" sysctl net.ipv4.conf.veth0.forwarding=1 > /dev/null + ip netns exec "$ns0" sysctl net.ipv4.conf.veth1.forwarding=1 > /dev/null + + ip netns exec "$ns0" nft -f /dev/stdin <<EOF +table $family nat { + chain postrouting { + type nat hook postrouting priority 0; policy accept; + meta oif veth0 masquerade + } +} +EOF + if [ $? -ne 0 ]; then + echo "SKIP: Could not add add $family masquerade hook" + return $ksft_skip + fi + + # test default behaviour. Packet from ns1 to ns0 is redirected to ns2. + test_port_shadow "default" "CLIENT" + + # test packet filter based mitigation: prevent forwarding of + # packets claiming to come from the service port. + test_port_shadow_filter "$family" + + # test conntrack based mitigation: connections going or coming + # from router:service bypass connection tracking. + test_port_shadow_notrack "$family" + + # test nat based mitigation: fowarded packets coming from service port + # are masqueraded with random highport. + test_port_shadow_pat "$family" + + ip netns exec "$ns0" nft delete table $family nat +} # ip netns exec "$ns0" ping -c 1 -q 10.0.$i.99 for i in 0 1 2; do @@ -861,6 +1004,8 @@ reset_counters $test_inet_nat && test_redirect inet $test_inet_nat && test_redirect6 inet +test_port_shadowing + if [ $ret -ne 0 ];then echo -n "FAIL: " nft --version diff --git a/tools/testing/selftests/vm/split_huge_page_test.c b/tools/testing/selftests/vm/split_huge_page_test.c index 1af16d2c2a0a..52497b7b9f1d 100644 --- a/tools/testing/selftests/vm/split_huge_page_test.c +++ b/tools/testing/selftests/vm/split_huge_page_test.c @@ -341,7 +341,7 @@ void split_file_backed_thp(void) } /* write something to the file, so a file-backed THP can be allocated */ - num_written = write(fd, tmpfs_loc, sizeof(tmpfs_loc)); + num_written = write(fd, tmpfs_loc, strlen(tmpfs_loc) + 1); close(fd); if (num_written < 1) { diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 10ab56c2484a..60aa1a4fc69b 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -414,9 +414,6 @@ static void uffd_test_ctx_init_ext(uint64_t *features) uffd_test_ops->allocate_area((void **)&area_src); uffd_test_ops->allocate_area((void **)&area_dst); - uffd_test_ops->release_pages(area_src); - uffd_test_ops->release_pages(area_dst); - userfaultfd_open(features); count_verify = malloc(nr_pages * sizeof(unsigned long long)); @@ -437,6 +434,26 @@ static void uffd_test_ctx_init_ext(uint64_t *features) *(area_count(area_src, nr) + 1) = 1; } + /* + * After initialization of area_src, we must explicitly release pages + * for area_dst to make sure it's fully empty. Otherwise we could have + * some area_dst pages be errornously initialized with zero pages, + * hence we could hit memory corruption later in the test. + * + * One example is when THP is globally enabled, above allocate_area() + * calls could have the two areas merged into a single VMA (as they + * will have the same VMA flags so they're mergeable). When we + * initialize the area_src above, it's possible that some part of + * area_dst could have been faulted in via one huge THP that will be + * shared between area_src and area_dst. It could cause some of the + * area_dst won't be trapped by missing userfaults. + * + * This release_pages() will guarantee even if that happened, we'll + * proactively split the thp and drop any accidentally initialized + * pages within area_dst. + */ + uffd_test_ops->release_pages(area_dst); + pipefd = malloc(sizeof(int) * nr_cpus * 2); if (!pipefd) err("pipefd"); diff --git a/tools/testing/vsock/vsock_diag_test.c b/tools/testing/vsock/vsock_diag_test.c index cec6f5a738e1..fa927ad16f8a 100644 --- a/tools/testing/vsock/vsock_diag_test.c +++ b/tools/testing/vsock/vsock_diag_test.c @@ -332,8 +332,6 @@ static void test_no_sockets(const struct test_opts *opts) read_vsock_stat(&sockets); check_no_sockets(&sockets); - - free_sock_stat(&sockets); } static void test_listen_socket_server(const struct test_opts *opts) |