| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When cleaning fails, we power off the node, unless it has been running
a clean step already. This happens when aborting cleaning or on a boot
failure. This change makes sure that the power action does not wipe
the last_error field, resulting in a node with provision_state=CLEANFAIL
and last_error=None for several seconds. I've hit this in Metal3.
Also when aborting cleaning, make sure last_error is set during
the transition to CLEANFAIL, not when the clean up thread starts
running.
While here, make sure to log the current step in all cases, not only
when aborting a non-abortable step.
Change-Id: Id21dd7eb44dad149661ebe2d75a9b030aa70526f
Story: #2010603
Task: #47476
(cherry picked from commit 9a0fa631ca53b40f4dc1877a73e65ded8ac37616)
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Ironic validates network interface before the cleaning process,
currently invalid parameter is captured but for not others.
There is chance that a node could be stucked at the cleaning
state on networking issues or temporary service down of neutron
service.
This patch adds NetworkError to the exception hanlding to cover
such cases.
Change-Id: If20de2ad4ae4177dea10b7ebfc9a91ca6fbabdb9
|
|
|
|
|
|
| |
Currently we only remove the URL, which may leave a stale token.
Change-Id: I9ff2d726cb75317fe09bd43342541db0e721f2b8
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adds a new argument disable_ramdisk to the manual cleaning API.
Only steps that are marked with requires_ramdisk=False can be
run in this mode. Cleaning prepare/tear down is not done.
Some steps (like redfish BIOS) currently require IPA to detect
a successful reboot. They are not marked with requires_ramdisk
just yet.
Change-Id: Icacac871603bd48536188813647bc669c574de2a
Story: #2008491
Task: #41540
|
|
|
|
|
|
|
|
|
|
| |
We use maintenance mode to signal that hardware needs additional
intervention, because of potential damage or stuck long-running
processes. This is not the case for PXE booting or invalid requested
manual clean steps, so don't set maintenance if no clean step is
running when the failure occurs.
Change-Id: I8a7ce072359660fc6640e5f20ec2d3c452033557
|
|\ |
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This allows users to disable automated cleaning on
Node level.
Story: #2008113
Task: #40829
Change-Id: If583bae4108b9bfa99cc460509af84696c7003c5
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Update `cleaning_error_handler` to match with
`deploying_error_handler` that logs all errors and optionally
separates between logged message and `last_error`.
Logged message usually contains node's uuid as there is no
context for node in stream of log entries. `last_error`
usually does not contain node's uuid as it is already
displayed in the context of node.
Impact:
* There were messages that were only added to node's last_error.
Now they are going to be logged too.
* No need to log explicitly before `cleaning_error_handler`. Such
occurrences have been removed.
* Where there were different message for log and last_error it
is kept. Where there was only 1 message, it is left as it is to
be both logged and updated in `last_error`.
* Exception logging is replaced with error logging with traceback.
Story: 2008307
Task: 41198
Change-Id: I813228fb47a51ee6c45b420322acabdf565ff752
|
|/
|
|
|
|
|
|
|
|
|
| |
When moving the node to ``manageable``, in addition to
``cleaning``, retrieve the BIOS configuration settings. In the
case of ``manageable``, this may allow the settings to be used
when choosing which node to deploy.
Change-Id: Ic2b162f31d4a1465fcb61671e7f48b3d31de788c
Story: 2008326
Task: 41224
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The agent command exec model is based upon an incoming
heartbeat, however heartbeats are independent and
commands can take a long time. For example, software RAID
setup in CI can encounter this.
From an IPA log:
[-] Picked root device /dev/md0 for node c6ca0af2-baec-40d6-879d-cbb5c751aafb
based on root device hints {'name': '/dev/md0'}
[-] Attempting to download image from http://199.204.45.248:3928/agent_images/
c6ca0af2-baec-40d6-879d-cbb5c751aafb
[-] Executing command: standby.get_partition_uuids with args: {} execute_command
/usr/local/lib/python3.6/site-packages/ironic_python_agent/extensions/base.py:255
[-] Tried to execute standby.get_partition_uuids, agent is still executing Command name:
execute_deploy_step, params: {'step': {'interface': 'deploy', 'step': 'write_image',
'args': {'image_info': {'id': 'cb9e199a-af1b-4a6f-b00e-f284008b8046',
'urls': ['http://199.204.45.248:3928/agent_images/c6ca0af2-baec-40d6-879d-cbb5c751aafb'],
'disk_format': 'raw', 'container_format': 'bare', 'stream_raw_images': True, 'os_hash_algo':
'sha512', 'os_hash_value':<trimed>
This was with code built on master, using master images.
Inside the conductor log, it notes that it is likely an out
of date agent because only AgentAPIError is evaluated,
however any API error is evaluated this way. In reality, we need
to explicitly flag *when* we have an error that is because
we've tried to soon as something is already being worked upon.
The result, is to evaluate and return an exception indicating work
is already in flight.
Update - It looks like, the original fix to prevent busy agent
recognition did not fully detect all cases as getting steps is a
command which can
get skipped by accident with a busy agent, under certain circumstances.
Change I5d86878b5ed6142ed2630adee78c0867c49b663f in ironic-python-agent
also changed the string that was being checked for the previous
handling, where we really should have just made the string we were
checking lower case in ironic. Oh well! This should fix things
right up.
Story: 2008167
Task: 41175
Change-Id: Ia169640b7084d17d26f22e457c7af512db6d21d6
|
|
|
|
|
|
|
| |
To be able to get rid of using RPC for continuing async steps
we need this code to be callable.
Change-Id: I87ec9c39fa00226b196605af97d528b268f304c7
|
|
|
|
|
|
|
|
| |
We wipe these fields on some conditions, most notable - on starting
the deployment. Make the removal of these fields to always go through
the helpers in conductor/utils (and remove an unused one).
Change-Id: Idb952588bb8a6d5131764f29c6225762ba5d55cc
|
|
|
|
|
|
|
|
|
|
| |
When exiting cleaning, previously the agent token was purged
from ironic's database and agents continuing to run would not
be able to heartbeat to the conductor. With agent token, this
would orphan the agent such that it thought it had an agent
token, yet the conductor did not.
Change-Id: Id6f8609bcda369649d0f677aceed26ed5e72a313
|
|
|
|
| |
Change-Id: Ia60996b4198e6fcfba6094af26498869589e175e
|
|
|
|
|
|
|
| |
Python3 have a standard library for mock in the unittest module,
let's drop the mock requirement and switch tests to unittest mock.
Change-Id: I4f1b3e25c8adbc24cdda51c73da3b66967f7ef23
|
|
With more than 4000 lines of code (more than 8000 of unit tests) the
current manager.py is barely manageable. In preparation for the new
deployment API, this change moves the internal cleaning-related
functions to the new cleaning.py.
Change-Id: I13997af2246327bd11b6aaf7029afb20923e64bc
Story: #2006910
|