From df216de6d9b195782be3cfc2d51296f3c4442b54 Mon Sep 17 00:00:00 2001 From: melanie witt Date: Wed, 25 Mar 2020 23:02:42 +0000 Subject: Add info about affinity requests to the troubleshooting doc We had recent bug report about a possible regression related to affinity policy enforcement with parallel server create requests. It turned out not to be a regression but because of the complexity around affinity enforcement, it might help to add a section to the compute troubleshooting doc about it which we could refer to in the future. Related-Bug: #1863190 Change-Id: I508c48183a7205d46e13154d4e92d31dfa7f7d78 --- doc/source/admin/support-compute.rst | 1 + .../troubleshooting/affinity-policy-violated.rst | 78 ++++++++++++++++++++++ 2 files changed, 79 insertions(+) create mode 100644 doc/source/admin/troubleshooting/affinity-policy-violated.rst diff --git a/doc/source/admin/support-compute.rst b/doc/source/admin/support-compute.rst index f5d571bf56..8522e51d79 100644 --- a/doc/source/admin/support-compute.rst +++ b/doc/source/admin/support-compute.rst @@ -16,6 +16,7 @@ you how to troubleshoot Compute. troubleshooting/orphaned-allocations.rst troubleshooting/rebuild-placement-db.rst + troubleshooting/affinity-policy-violated.rst Compute service logging diff --git a/doc/source/admin/troubleshooting/affinity-policy-violated.rst b/doc/source/admin/troubleshooting/affinity-policy-violated.rst new file mode 100644 index 0000000000..a7a563491e --- /dev/null +++ b/doc/source/admin/troubleshooting/affinity-policy-violated.rst @@ -0,0 +1,78 @@ +Affinity policy violated with parallel requests +=============================================== + +Problem +------- + +Parallel server create requests for affinity or anti-affinity land on the same +host and servers go to the ``ACTIVE`` state even though the affinity or +anti-affinity policy was violated. + +Solution +-------- + +There are two ways to avoid anti-/affinity policy violations among multiple +server create requests. + +Create multiple servers as a single request +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Use the `multi-create API`_ with the ``min_count`` parameter set or the +`multi-create CLI`_ with the ``--min`` option set to the desired number of +servers. + +This works because when the batch of requests is visible to ``nova-scheduler`` +at the same time as a group, it will be able to choose compute hosts that +satisfy the anti-/affinity constraint and will send them to the same hosts or +different hosts accordingly. + +.. _multi-create API: https://docs.openstack.org/api-ref/compute/#create-multiple-servers +.. _multi-create CLI: https://docs.openstack.org/python-openstackclient/latest/cli/command-objects/server.html#server-create + +Adjust Nova configuration settings +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When requests are made separately and the scheduler cannot consider the batch +of requests at the same time as a group, anti-/affinity races are handled by +what is called the "late affinity check" in ``nova-compute``. Once a server +lands on a compute host, if the request involves a server group, +``nova-compute`` contacts the API database (via ``nova-conductor``) to retrieve +the server group and then it checks whether the affinity policy has been +violated. If the policy has been violated, ``nova-compute`` initiates a +reschedule of the server create request. Note that this means the deployment +must have :oslo.config:option:`scheduler.max_attempts` set greater than ``1`` +(default is ``3``) to handle races. + +An ideal configuration for multiple cells will minimize `upcalls`_ from the +cells to the API database. This is how devstack, for example, is configured in +the CI gate. The cell conductors do not set +:oslo.config:option:`api_database.connection` and ``nova-compute`` sets +:oslo.config:option:`workarounds.disable_group_policy_check_upcall` to +``True``. + +However, if a deployment needs to handle racing affinity requests, it needs to +configure cell conductors to have access to the API database, for example: + +.. code-block:: ini + + [api_database] + connection = mysql+pymysql://root:a@127.0.0.1/nova_api?charset=utf8 + +The deployment also needs to configure ``nova-compute`` services not to disable +the group policy check upcall by either not setting (use the default) +:oslo.config:option:`workarounds.disable_group_policy_check_upcall` or setting +it to ``False``, for example: + +.. code-block:: ini + + [workarounds] + disable_group_policy_check_upcall = False + +With these settings, anti-/affinity policy should not be violated even when +parallel server create requests are racing. + +Future work is needed to add anti-/affinity support to the placement service in +order to eliminate the need for the late affinity check in ``nova-compute``. + +.. _upcalls: https://docs.openstack.org/nova/latest/user/cellsv2-layout.html#operations-requiring-upcalls + -- cgit v1.2.1