summaryrefslogtreecommitdiff
path: root/ironic/tests/unit/conductor/test_cleaning.py
Commit message (Collapse)AuthorAgeFilesLines
* Do not move nodes to CLEAN FAILED with empty last_errorDmitry Tantsur2023-03-071-10/+11
| | | | | | | | | | | | | | | | | | | | When cleaning fails, we power off the node, unless it has been running a clean step already. This happens when aborting cleaning or on a boot failure. This change makes sure that the power action does not wipe the last_error field, resulting in a node with provision_state=CLEANFAIL and last_error=None for several seconds. I've hit this in Metal3. Also when aborting cleaning, make sure last_error is set during the transition to CLEANFAIL, not when the clean up thread starts running. While here, make sure to log the current step in all cases, not only when aborting a non-abortable step. Change-Id: Id21dd7eb44dad149661ebe2d75a9b030aa70526f Story: #2010603 Task: #47476 (cherry picked from commit 9a0fa631ca53b40f4dc1877a73e65ded8ac37616)
* Fix nodes stuck at cleaning on Network Service issuesKaifeng Wang2022-09-201-6/+20
| | | | | | | | | | | | | Ironic validates network interface before the cleaning process, currently invalid parameter is captured but for not others. There is chance that a node could be stucked at the cleaning state on networking issues or temporary service down of neutron service. This patch adds NetworkError to the exception hanlding to cover such cases. Change-Id: If20de2ad4ae4177dea10b7ebfc9a91ca6fbabdb9
* Remove temporary cleaning information on starting cleaningDmitry Tantsur2021-04-221-3/+10
| | | | | | Currently we only remove the URL, which may leave a stale token. Change-Id: I9ff2d726cb75317fe09bd43342541db0e721f2b8
* API to force manual cleaning without booting IPADmitry Tantsur2021-03-161-14/+39
| | | | | | | | | | | | | | Adds a new argument disable_ramdisk to the manual cleaning API. Only steps that are marked with requires_ramdisk=False can be run in this mode. Cleaning prepare/tear down is not done. Some steps (like redfish BIOS) currently require IPA to detect a successful reboot. They are not marked with requires_ramdisk just yet. Change-Id: Icacac871603bd48536188813647bc669c574de2a Story: #2008491 Task: #41540
* Do not enter maintenance if cleaning fails before running the 1st stepDmitry Tantsur2021-01-081-2/+12
| | | | | | | | | | We use maintenance mode to signal that hardware needs additional intervention, because of potential damage or stuck long-running processes. This is not the case for PXE booting or invalid requested manual clean steps, so don't set maintenance if no clean step is running when the failure occurs. Change-Id: I8a7ce072359660fc6640e5f20ec2d3c452033557
* Merge "Allow disabling automated_clean per node"Zuul2020-11-251-0/+24
|\
| * Allow disabling automated_clean per nodeFeruzjon Muyassarov2020-11-241-0/+24
| | | | | | | | | | | | | | | | | | This allows users to disable automated cleaning on Node level. Story: #2008113 Task: #40829 Change-Id: If583bae4108b9bfa99cc460509af84696c7003c5
* | Update `cleaning_error_handler`Aija Jauntēva2020-11-131-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Update `cleaning_error_handler` to match with `deploying_error_handler` that logs all errors and optionally separates between logged message and `last_error`. Logged message usually contains node's uuid as there is no context for node in stream of log entries. `last_error` usually does not contain node's uuid as it is already displayed in the context of node. Impact: * There were messages that were only added to node's last_error. Now they are going to be logged too. * No need to log explicitly before `cleaning_error_handler`. Such occurrences have been removed. * Where there were different message for log and last_error it is kept. Where there was only 1 message, it is left as it is to be both logged and updated in `last_error`. * Exception logging is replaced with error logging with traceback. Story: 2008307 Task: 41198 Change-Id: I813228fb47a51ee6c45b420322acabdf565ff752
* | Retrieve BIOS configuration when moving node to ``manageable``Bob Fournier2020-11-101-3/+3
|/ | | | | | | | | | | When moving the node to ``manageable``, in addition to ``cleaning``, retrieve the BIOS configuration settings. In the case of ``manageable``, this may allow the settings to be used when choosing which node to deploy. Change-Id: Ic2b162f31d4a1465fcb61671e7f48b3d31de788c Story: 2008326 Task: 41224
* Handle agent still doing the prior commandJulia Kreger2020-10-291-0/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The agent command exec model is based upon an incoming heartbeat, however heartbeats are independent and commands can take a long time. For example, software RAID setup in CI can encounter this. From an IPA log: [-] Picked root device /dev/md0 for node c6ca0af2-baec-40d6-879d-cbb5c751aafb based on root device hints {'name': '/dev/md0'} [-] Attempting to download image from http://199.204.45.248:3928/agent_images/ c6ca0af2-baec-40d6-879d-cbb5c751aafb [-] Executing command: standby.get_partition_uuids with args: {} execute_command /usr/local/lib/python3.6/site-packages/ironic_python_agent/extensions/base.py:255 [-] Tried to execute standby.get_partition_uuids, agent is still executing Command name: execute_deploy_step, params: {'step': {'interface': 'deploy', 'step': 'write_image', 'args': {'image_info': {'id': 'cb9e199a-af1b-4a6f-b00e-f284008b8046', 'urls': ['http://199.204.45.248:3928/agent_images/c6ca0af2-baec-40d6-879d-cbb5c751aafb'], 'disk_format': 'raw', 'container_format': 'bare', 'stream_raw_images': True, 'os_hash_algo': 'sha512', 'os_hash_value':<trimed> This was with code built on master, using master images. Inside the conductor log, it notes that it is likely an out of date agent because only AgentAPIError is evaluated, however any API error is evaluated this way. In reality, we need to explicitly flag *when* we have an error that is because we've tried to soon as something is already being worked upon. The result, is to evaluate and return an exception indicating work is already in flight. Update - It looks like, the original fix to prevent busy agent recognition did not fully detect all cases as getting steps is a command which can get skipped by accident with a busy agent, under certain circumstances. Change I5d86878b5ed6142ed2630adee78c0867c49b663f in ironic-python-agent also changed the string that was being checked for the previous handling, where we really should have just made the string we were checking lower case in ironic. Oh well! This should fix things right up. Story: 2008167 Task: 41175 Change-Id: Ia169640b7084d17d26f22e457c7af512db6d21d6
* Refactoring: split away continue_node_deploy/cleanDmitry Tantsur2020-10-061-0/+29
| | | | | | | To be able to get rid of using RPC for continuing async steps we need this code to be callable. Change-Id: I87ec9c39fa00226b196605af97d528b268f304c7
* Fix agent token and URL handling during fast-track deploymentDmitry Tantsur2020-06-161-2/+24
| | | | | | | | We wipe these fields on some conditions, most notable - on starting the deployment. Make the removal of these fields to always go through the helpers in conductor/utils (and remove an unused one). Change-Id: Idb952588bb8a6d5131764f29c6225762ba5d55cc
* Fix fast track when exiting cleaningJulia Kreger2020-06-031-2/+7
| | | | | | | | | | When exiting cleaning, previously the agent token was purged from ironic's database and agents continuing to run would not be able to heartbeat to the conductor. With agent token, this would orphan the agent such that it thought it had an agent token, yet the conductor did not. Change-Id: Id6f8609bcda369649d0f677aceed26ed5e72a313
* Collect ramdisk logs also during cleaningDmitry Tantsur2020-05-141-2/+57
| | | | Change-Id: Ia60996b4198e6fcfba6094af26498869589e175e
* Switch to unittest mockIury Gregory Melo Ferreira2020-04-301-1/+2
| | | | | | | Python3 have a standard library for mock in the unittest module, let's drop the mock requirement and switch tests to unittest mock. Change-Id: I4f1b3e25c8adbc24cdda51c73da3b66967f7ef23
* Split cleaning-related functions from manager.py into a new moduleDmitry Tantsur2020-02-061-0/+975
With more than 4000 lines of code (more than 8000 of unit tests) the current manager.py is barely manageable. In preparation for the new deployment API, this change moves the internal cleaning-related functions to the new cleaning.py. Change-Id: I13997af2246327bd11b6aaf7029afb20923e64bc Story: #2006910