Merge "Docs: Troubleshooting: how to exit clean failed"

author: Zuul <zuul@review.opendev.org> 2023-01-26 13:25:02 +0000
committer: Gerrit Code Review <review@openstack.org> 2023-01-26 13:25:02 +0000
commit: 324bb00cb172ae2ac2211209569e9be4578c3755 (patch)
tree: 1e28176ea119506d0c6b81ab160960e533a99c26
parent: b63d15ccdb7202af1700ce4f35b892c989356d7a (diff)
parent: 8604a799aa2768b93e3826b1e2c8b543c355282c (diff)
download: ironic-324bb00cb172ae2ac2211209569e9be4578c3755.tar.gz
1 files changed, 44 insertions, 0 deletions
diff --git a/doc/source/admin/troubleshooting.rst b/doc/source/admin/troubleshooting.rst
index 7a9ddb0ab..72e969b6e 100644
--- a/doc/source/admin/troubleshooting.rst
+++ b/doc/source/admin/troubleshooting.rst
@@ -1100,3 +1100,47 @@ of other variables, you may be able to leverage the `RAID <raid>`_
 configuration interface to delete volumes/disks, and recreate them. This may
 have the same effect as a clean disk, however that too is RAID controller
 dependent behavior.
+
+I'm in "clean failed" state, what do I do?
+==========================================
+
+There is only one way to exit the ``clean failed`` state. But before we visit
+the answer as to **how**, we need to stress the importance of attempting to
+understand **why** cleaning failed. On the simple side of things, this may be
+as simple as a DHCP failure, but on a complex side of things, it could be that
+a cleaning action failed against the underlying hardware, possibly due to
+a hardware failure.
+
+As such, we encourage everyone to attempt to understand **why** before exiting
+the ``clean failed`` state, because you could potentially make things worse
+for yourself. For example if firmware updates were being performed, you may
+need to perform a rollback operation against the physical server, depending on
+what, and how the firmware was being updated. Unfortunately this also borders
+the territory of "no simple answer".
+
+This can be counter balanced with sometimes there is a transient networking
+failure and a DHCP address was not obtained. An example of this would be
+suggested by the ``last_error`` field indicating something about "Timeout
+reached while cleaning the node", however we recommend following several
+basic troubleshooting steps:
+
+* Consult the ``last_error`` field on the node, utilizing the
+  ``baremetal node show <uuid>`` command.
+* If the version of ironic supports the feature, consult the node history
+  log, ``baremetal node history list`` and
+  ``baremetal node history get <uuid>``.
+* Consult the acutal console screen of the physical machine. *If* the ramdisk
+  booted, you will generally want to investigate the controller logs and see
+  if an uploaded agent log is being stored on the conductor responsible for
+  the baremetal node. Consult `Retrieving logs from the deploy ramdisk`_.
+  If the node did not boot for some reason, you can typically just retry
+  at this point and move on.
+
+How to get out of the state, once you've understood **why** you reached it
+in the first place, is to utilize the ``baremetal node manage <node_id>``
+command. This returns the node to ``manageable`` state, from where you can
+retry "cleaning" through automated cleaning with the ``provide`` command,
+or manual cleaning with ``clean`` command. or the next appropriate action
+in the workflow process you are attempting to follow, which may be
+ultimately be decommissioning the node because it could have failed and is
+being removed or replaced.
author	Zuul <zuul@review.opendev.org>	2023-01-26 13:25:02 +0000
committer	Gerrit Code Review <review@openstack.org>	2023-01-26 13:25:02 +0000
commit	324bb00cb172ae2ac2211209569e9be4578c3755 (patch)
tree	1e28176ea119506d0c6b81ab160960e533a99c26
parent	b63d15ccdb7202af1700ce4f35b892c989356d7a (diff)
parent	8604a799aa2768b93e3826b1e2c8b543c355282c (diff)
download	ironic-324bb00cb172ae2ac2211209569e9be4578c3755.tar.gz