summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorJulia Kreger <juliaashleykreger@gmail.com>2023-01-19 11:31:28 -0800
committerJulia Kreger <juliaashleykreger@gmail.com>2023-01-20 06:22:58 -0800
commit8604a799aa2768b93e3826b1e2c8b543c355282c (patch)
tree3551b3ccc53479f3a060d1281ea90b8bb78eb38c
parent9a85e4787b9f7256ccd09ea293b89509666b88c7 (diff)
downloadironic-8604a799aa2768b93e3826b1e2c8b543c355282c.tar.gz
Docs: Troubleshooting: how to exit clean failed
I got pinged with some questions by an operator who had issues attempting to exit cleaning. In the discussion, it was realized we lack basic troubleshooting guidance, which led them to try everything but the command they needed. As such, adding some guidance in an attempt to help operators navigate these sorts of issues moving forward. Change-Id: Ia563f5e50bbcc789ccc768bef5800a64b38ff3d7
-rw-r--r--doc/source/admin/troubleshooting.rst44
1 files changed, 44 insertions, 0 deletions
diff --git a/doc/source/admin/troubleshooting.rst b/doc/source/admin/troubleshooting.rst
index 7a9ddb0ab..72e969b6e 100644
--- a/doc/source/admin/troubleshooting.rst
+++ b/doc/source/admin/troubleshooting.rst
@@ -1100,3 +1100,47 @@ of other variables, you may be able to leverage the `RAID <raid>`_
configuration interface to delete volumes/disks, and recreate them. This may
have the same effect as a clean disk, however that too is RAID controller
dependent behavior.
+
+I'm in "clean failed" state, what do I do?
+==========================================
+
+There is only one way to exit the ``clean failed`` state. But before we visit
+the answer as to **how**, we need to stress the importance of attempting to
+understand **why** cleaning failed. On the simple side of things, this may be
+as simple as a DHCP failure, but on a complex side of things, it could be that
+a cleaning action failed against the underlying hardware, possibly due to
+a hardware failure.
+
+As such, we encourage everyone to attempt to understand **why** before exiting
+the ``clean failed`` state, because you could potentially make things worse
+for yourself. For example if firmware updates were being performed, you may
+need to perform a rollback operation against the physical server, depending on
+what, and how the firmware was being updated. Unfortunately this also borders
+the territory of "no simple answer".
+
+This can be counter balanced with sometimes there is a transient networking
+failure and a DHCP address was not obtained. An example of this would be
+suggested by the ``last_error`` field indicating something about "Timeout
+reached while cleaning the node", however we recommend following several
+basic troubleshooting steps:
+
+* Consult the ``last_error`` field on the node, utilizing the
+ ``baremetal node show <uuid>`` command.
+* If the version of ironic supports the feature, consult the node history
+ log, ``baremetal node history list`` and
+ ``baremetal node history get <uuid>``.
+* Consult the acutal console screen of the physical machine. *If* the ramdisk
+ booted, you will generally want to investigate the controller logs and see
+ if an uploaded agent log is being stored on the conductor responsible for
+ the baremetal node. Consult `Retrieving logs from the deploy ramdisk`_.
+ If the node did not boot for some reason, you can typically just retry
+ at this point and move on.
+
+How to get out of the state, once you've understood **why** you reached it
+in the first place, is to utilize the ``baremetal node manage <node_id>``
+command. This returns the node to ``manageable`` state, from where you can
+retry "cleaning" through automated cleaning with the ``provide`` command,
+or manual cleaning with ``clean`` command. or the next appropriate action
+in the workflow process you are attempting to follow, which may be
+ultimately be decommissioning the node because it could have failed and is
+being removed or replaced.