summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorZuul <zuul@review.opendev.org>2023-01-26 13:25:02 +0000
committerGerrit Code Review <review@openstack.org>2023-01-26 13:25:02 +0000
commit324bb00cb172ae2ac2211209569e9be4578c3755 (patch)
tree1e28176ea119506d0c6b81ab160960e533a99c26
parentb63d15ccdb7202af1700ce4f35b892c989356d7a (diff)
parent8604a799aa2768b93e3826b1e2c8b543c355282c (diff)
downloadironic-324bb00cb172ae2ac2211209569e9be4578c3755.tar.gz
Merge "Docs: Troubleshooting: how to exit clean failed"
-rw-r--r--doc/source/admin/troubleshooting.rst44
1 files changed, 44 insertions, 0 deletions
diff --git a/doc/source/admin/troubleshooting.rst b/doc/source/admin/troubleshooting.rst
index 7a9ddb0ab..72e969b6e 100644
--- a/doc/source/admin/troubleshooting.rst
+++ b/doc/source/admin/troubleshooting.rst
@@ -1100,3 +1100,47 @@ of other variables, you may be able to leverage the `RAID <raid>`_
configuration interface to delete volumes/disks, and recreate them. This may
have the same effect as a clean disk, however that too is RAID controller
dependent behavior.
+
+I'm in "clean failed" state, what do I do?
+==========================================
+
+There is only one way to exit the ``clean failed`` state. But before we visit
+the answer as to **how**, we need to stress the importance of attempting to
+understand **why** cleaning failed. On the simple side of things, this may be
+as simple as a DHCP failure, but on a complex side of things, it could be that
+a cleaning action failed against the underlying hardware, possibly due to
+a hardware failure.
+
+As such, we encourage everyone to attempt to understand **why** before exiting
+the ``clean failed`` state, because you could potentially make things worse
+for yourself. For example if firmware updates were being performed, you may
+need to perform a rollback operation against the physical server, depending on
+what, and how the firmware was being updated. Unfortunately this also borders
+the territory of "no simple answer".
+
+This can be counter balanced with sometimes there is a transient networking
+failure and a DHCP address was not obtained. An example of this would be
+suggested by the ``last_error`` field indicating something about "Timeout
+reached while cleaning the node", however we recommend following several
+basic troubleshooting steps:
+
+* Consult the ``last_error`` field on the node, utilizing the
+ ``baremetal node show <uuid>`` command.
+* If the version of ironic supports the feature, consult the node history
+ log, ``baremetal node history list`` and
+ ``baremetal node history get <uuid>``.
+* Consult the acutal console screen of the physical machine. *If* the ramdisk
+ booted, you will generally want to investigate the controller logs and see
+ if an uploaded agent log is being stored on the conductor responsible for
+ the baremetal node. Consult `Retrieving logs from the deploy ramdisk`_.
+ If the node did not boot for some reason, you can typically just retry
+ at this point and move on.
+
+How to get out of the state, once you've understood **why** you reached it
+in the first place, is to utilize the ``baremetal node manage <node_id>``
+command. This returns the node to ``manageable`` state, from where you can
+retry "cleaning" through automated cleaning with the ``provide`` command,
+or manual cleaning with ``clean`` command. or the next appropriate action
+in the workflow process you are attempting to follow, which may be
+ultimately be decommissioning the node because it could have failed and is
+being removed or replaced.