Add documentation about debbuging NIFs/drivers

in Interoperability Tutorial.
author: Sverker Eriksson <sverker@erlang.org> 2022-06-29 21:21:27 +0200
committer: Sverker Eriksson <sverker@erlang.org> 2022-08-23 19:03:17 +0200
commit: abb60125ae543ef75346423f6ca24dcae4ccdd69 (patch)
tree: 161fc66213b38c3c7116d66a4953a29ecfcc441e /system
parent: b81458a1375be7709d428324ef97550c39e21d2f (diff)
download: erlang-abb60125ae543ef75346423f6ca24dcae4ccdd69.tar.gz
3 files changed, 350 insertions, 1 deletions
diff --git a/system/doc/tutorial/debugging.xml b/system/doc/tutorial/debugging.xml
new file mode 100644
index 0000000000..bd7814c441
--- /dev/null
+++ b/system/doc/tutorial/debugging.xml
@@ -0,0 +1,346 @@
+<?xml version="1.0" encoding="utf-8"?>
+<!DOCTYPE chapter SYSTEM "chapter.dtd">
+<chapter>
+  <header>
+    <copyright>
+      <year>2022</year>
+      <holder>Ericsson AB. All Rights Reserved.</holder>
+    </copyright>
+    <legalnotice>
+      Licensed under the Apache License, Version 2.0 (the "License");
+      you may not use this file except in compliance with the License.
+      You may obtain a copy of the License at
+
+          http://www.apache.org/licenses/LICENSE-2.0
+
+      Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+      See the License for the specific language governing permissions and
+      limitations under the License.
+
+    </legalnotice>
+
+    <title>Debugging NIFs and Port Drivers</title>
+    <prepared/>
+    <docno/>
+    <date/>
+    <rev/>
+    <file>debugging.xml</file>
+  </header>
+
+  <section>
+    <title>With great power comes great responsibilty</title>
+    <p>
+      NIFs and port driver code run inside the Erlang VM OS process (the
+      "Beam"). To maximize performance the code is called directly by the same
+      threads executing Erlang beam code and has full access to all the memory
+      of the OS process. A buggy NIF/driver can thus make severe damage by
+      corrupting memory.
+    </p>
+    <p>
+      In a best case scenario such memory corruption is detected immediately
+      causing the Beam to crash generating a core dump file which can be
+      analyzed to find the bug. However, it is very common for memory corruption
+      bugs to not be immediately detected when the faulty write happens, but
+      instead much later, for example when the calling Erlang process is garbage
+      collected. When that happens it can be very hard to find the root cause of
+      the memory corruption by analysing the core dump. All traces that could
+      have indicated which specific buggy NIF/driver that caused the corruption
+      may be long gone.
+    </p>
+  </section>
+  <section>
+    <title>The debug emulator</title>
+    <p>
+      One way to make debugging easier is to run an emulator built with target
+      <c>debug</c>. It will
+    </p>
+    <list type="bulleted">
+      <item>
+	<p>
+	  <em>Increase probability of detecting bugs earlier</em>. It contains a
+	  lot more runtime checks to ensure correct use of internal interfaces
+	  and data structures.
+	</p>
+      </item>
+      <item>
+	<p>
+	  <em>Generate a core dump that is easier to analyze</em>. Compiler
+	  optimizations are turned off, which stops the compiler from
+	  "optimizing away" variables, thus making it easier/possible to inspect
+	  their state.
+	</p>
+      </item>
+      <item>
+	<p>
+	  <em>Detect lock order violations</em>. A runtime lock checker will
+	  verify that the locks in the
+	  <seecref marker="erts:erl_nif"><c>erl_nif</c></seecref> and
+	  <seecref marker="erts:erl_driver"><c>erl_driver</c></seecref>
+	  APIs are seized in a consistent order that cannot result in deadlock
+	  bugs.
+	</p>
+      </item>
+    </list>
+    <p>
+      In fact, we recommend to use the debug emulator as default during
+      development of NIFs and drivers, regardless if you are troubleshooting
+      bugs or not. Some subtle bugs may not be detected by the normal emulator
+      and just happen to work anyway by chance. However, another version of the
+      emulator, or even different circumstances within the same emulator, may
+      cause the bug to later provoke all kinds of problems.
+    </p>
+    <p>
+      The main disadvantage of the <c>debug</c> emulator is its reduced
+      performance. The extra runtime checks and lack of compiler optimizations
+      may result in a slowdown with a factor of two or more depending on
+      load. The memory footprint should be about the same.
+    </p>
+    <p>
+      If the <c>debug</c> emulator is part of the Erlang/OTP installation, it can be
+      started with</p>
+      <pre>
+> <input>erl <seecom marker="erts:erl#emu_type">-emu_type</seecom> debug</input>
+Erlang/OTP 25 [erts-13.0.2] ... <em>[type-assertions] [debug-compiled] [lock-checking]</em>
+
+Eshell V13.0.2  (abort with ^G)
+1>
+</pre>
+    <p>
+      If the <c>debug</c> emulator is not part of the installation, you need to
+      <seeguide marker="system/installation_guide:INSTALL#Advanced-configuration-and-build-of-ErlangOTP_Building_How-to-Build-a-Debug-Enabled-Erlang-RunTime-System">
+      build it from the Erlang/OTP source code</seeguide>. After building from source
+      either make an Erlang/OTP installation or you can run the debug emulator
+      directly in the source tree with the <c>cerl</c> script:
+    </p>
+    <pre>
+> <input>$ERL_TOP/bin/cerl -debug</input>
+Erlang/OTP 25 [erts-13.0.2] ... <em>[type-assertions] [debug-compiled] [lock-checking]</em>
+
+Eshell V13.0.2  (abort with ^G)
+1>
+</pre>
+    <p>
+      The <c>cerl</c> script can also be used as a convenient way to start
+      the debugger <c>gdb</c> for core dump analysis:
+    </p>
+    <pre>
+> <input>$ERL_TOP/bin/cerl -debug -core core.12345</input>
+or
+> <input>$ERL_TOP/bin/cerl -debug -rcore core.12345</input>
+</pre>
+    <p>
+      The first variant starts Emacs and runs <c>gdb</c> within, while
+      the other <c>-rcore</c> runs <c>gdb</c> directly in the terminal. Apart
+      from starting <c>gdb</c> with the correct <c>beam.debug.smp</c> executable
+      file it will also read the file <c>$ERL_TOP/erts/etc/unix/etp-commands</c>
+      which contains a lot of <c>gdb</c> command for inspecting a beam core
+      dump. For example, the command <c>etp</c> that will print the content of
+      an Erlang term (<c>Eterm</c>) in plain Erlang syntax.
+    </p>
+  </section>
+  <section>
+    <title>Address Sanitizer</title>
+    <p>
+      <url href="https://clang.llvm.org/docs/AddressSanitizer.html">
+      AddressSanitizer</url> (asan) is an open source programming tool that
+      detects memory corruption bugs such as buffer overflows, use-after-free
+      and memory leaks. AddressSanitizer is based on compiler instrumentation
+      and is supported by both gcc and clang.
+    </p>
+    <p>
+      Similar to the <c>debug</c> emulator, the <c>asan</c> emulator runs slower
+      than normal, about 2-3 times slower. However, it also has a larger memory
+      footprint, about 3 times more memory than normal.
+    </p>
+    <p>
+      To get full effect you should compile both your own NIF/driver code as
+      well as the Erlang emulator with AddressSanitizer instrumentation. Compile
+      your own code by passing option <c>-fsanitize=address</c> to gcc or
+      clang. Other recommended options that will improve the fault
+      identification are <c>-fno-common</c> and <c>-fno-omit-frame-pointer</c>.
+    </p>
+    <p>
+      Build and run the emulator with AddressSanitizer support by using the same
+      procedure as for the debug emulator, except use the <c>asan</c> build
+      target instead of <c>debug</c>.
+    </p>
+    <taglist>
+      <tag>Run in source tree</tag>
+      <item>
+	<p>
+	  If you run the <c>asan</c> emulator directly in the source tree with the
+	  <c>cerl</c> script you only need to set environment variable
+	  <c>ASAN_LOG_DIR</c> to the directory where the error log files will be
+	  generated.
+	</p>
+	<pre>
+> <input>export ASAN_LOG_DIR=/my/asan/log/dir</input>
+> <input>$ERL_TOP/bin/cerl -asan</input>
+Erlang/OTP 25 [erts-13.0.2] ... <em>[address-sanitizer]</em>
+
+Eshell V13.0.2  (abort with ^G)
+1>
+</pre>
+        <p>
+	  You may however also want to set <c>ASAN_OPTIONS="halt_on_error=true"</c>
+	  if you want the emulator to crash when an error is detected.
+	</p>
+      </item>
+      <tag>Run installed Erlang/OTP</tag>
+      <item>
+	<p>
+	  If you run the <c>asan</c> emulator in an installed Erlang/OTP with <c>erl
+	  -emu_type asan</c> you need to set the path to the error log
+	  <em>file</em> with
+	</p>
+	<pre>
+> <input>export ASAN_OPTIONS="log_path=/my/asan/log/file"</input></pre>
+        <p>
+	  To avoid false positive memory leak reports from the emulator
+	  itself set <c>LSAN_OPTIONS</c> (LSAN=LeakSanitizer):
+	</p>
+	<pre>
+> <input>export LSAN_OPTIONS="suppressions=$ERL_TOP/erts/emulator/asan/suppress"</input></pre>
+        <p>
+	  The <c>suppress</c> file is currently not installed but can be copied
+	  manually from the source tree to wherever you want it.
+	</p>
+      </item>
+    </taglist>
+    <p>
+      Memory corruption errors are reported by AddressSanitizer when they
+      happen, but memory leaks are only checked and reported by default then the
+      emulator terminates.
+    </p>
+  </section>
+  <section>
+    <title>Valgrind</title>
+    <p>
+      An even more heavy weight debugging tool is <url
+      href="https://valgrind.org">Valgrind</url>. It can also find memory
+      corruption bugs and memory leaks similar to <c>asan</c>. Valgrind is not
+      as good at buffer overflow bugs, but it will find use of undefined data,
+      which is a type of error that <c>asan</c> cannot detect.
+    </p>
+    <p>
+      Valgrind is much slower than <c>asan</c> and it is incapable at
+      exploiting CPU multicore processing. We therefore recommend <c>asan</c> as
+      the first choice before trying valgrind.
+    </p>
+    <p>
+      Valgrind runs as a virtual machine itself, emulating execution of hardware
+      machine instructions. This means you can run almost any program unchanged
+      on valgrind. However, we have found that the beam executable benefits from
+      being compiled with special adaptions for running on valgrind.
+    </p>
+    <p>
+      Build the emulator with <c>valgrind</c> target the same as is done for
+      <c>debug</c> and <c>asan</c>. Note that <c>valgrind</c> needs to be
+      installed on the machine before the build starts.
+    </p>
+    <p>
+      Run the <c>valgrind</c> emulator directly in the source tree with the
+      <c>cerl</c> script. Set environment variable <c>VALGRIND_LOG_DIR</c> to
+      the directory where the error log files will be generated.
+    </p>
+    <pre>
+> <input>export VALGRIND_LOG_DIR=/my/valgrind/log/dir</input>
+> <input>$ERL_TOP/bin/cerl -valgrind</input>
+Erlang/OTP 25 [erts-13.0.2] ... <em>[valgrind-compiled]</em>
+
+Eshell V13.0.2  (abort with ^G)
+1>
+</pre>
+  </section>
+  <section>
+    <title>rr - Record and Replay</title>
+    <p>
+      Last but not least, the fantastic interactive debugging tool <url
+      href="https://rr-project.org/"><c>rr</c></url>, developed by Mozilla as
+      open source. <c>rr</c> stands for Record and Replay. While a core dump
+      represents only a static snapshot of the OS process when it crashed, with
+      <c>rr</c> you instead record the entire session, from start of the OS
+      process to the end (the crash). You can then replay that session from
+      within <c>gdb</c>. Single step, set breakpoints and watchpoints, and even
+      <em>execute backwards</em>.
+    </p>
+    <p>
+      Considering its powerful utility, <c>rr</c> is remarkably light weight.
+      It runs on Linux with any reasonably modern x86 CPU. You may get a two
+      times slowdown when executing in recording mode. The big weakness is its
+      inability to exploite CPU multicore processing. If the bug is a race
+      condition between concurrently running threads, it may be hard to
+      reproduce with <c>rr</c>.
+    </p>
+    <p>
+      <c>rr</c> does not require any special instrumented compilation. However,
+      if possible, run it together with the <c>debug</c> emulator, as that will
+      result in a much nicer debugging experience. You run <c>rr</c> in the
+      source tree using the <c>cerl</c> script.
+    </p>
+    <p>
+      Here is an example of a typical session. First we catch the crash in an rr
+      recording session:
+    </p>
+    <pre>
+> <input>$ERL_TOP/bin/cerl -debug -rr</input>
+rr: Saving execution to trace directory /home/foobar/.local/share/rr/beam.debug.smp-1.
+Erlang/OTP 25 [erts-13.0.2]
+
+Eshell V13.0.2  (abort with ^G)
+1> <input>mymod:buggy_nif().</input>
+Segmentation fault</pre>
+    <p>
+      Now we can replay that session with <c>rr replay</c>:
+    </p>
+    <pre>
+> <input>rr replay</input>
+GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.1) 9.2
+:
+(rr) <input>continue</input>
+:
+Thread 2 received signal SIGSEGV, Segmentation fault.
+(rr) <input>backtrace</input></pre>
+    <p>
+      You get the call stack at the moment of the crash. Bad luck, it is
+      somewhere deep down in the garbage collection of the beam. But you manage
+      to figure out that variable <c>hp</c> points to a broken Erlang term.
+    </p>
+    <p>
+      Set a watch point on that memory position and resume execution
+      <em>backwards</em>. The debugger will then stop at the exact position when
+      that memory position <c>*hp</c> was written.
+    </p>
+    <pre>
+(rr) <input>watch -l *hp</input>
+Hardware watchpoint 1: -location *hp
+(rr) <input>reverse-continue</input>
+Continuing.
+
+Thread 2 received signal SIGSEGV, Segmentation fault.</pre>
+    <p>
+      This is a quirk to be aware about. We started by executing forward until
+      it crashed with SIGSEGV. We are now executing backwards from that point,
+      so we are hitting the same SIGSEGV again but from the other
+      direction. Just continue backwards once more to move past it.
+    </p>
+    <pre>
+(rr) <input>reverse-continue</input>
+Continuing.
+
+Thread 2 hit Hardware watchpoint 1: -location *hp
+
+Old value = 42
+New value = 0</pre>
+    <p>
+      And here we are at the position when someone wrote a broken term on the
+      process heap. Note that "Old value" and "New value" are reversed when we
+      execute backwards. In this case the value 42 was written on the heap.
+      Let's see who the guilty one is:
+    </p>
+    <pre>
+(rr) <input>backtrace</input></pre>
+  </section>
+</chapter>
diff --git a/system/doc/tutorial/part.xml b/system/doc/tutorial/part.xml
index 4a66f0cb22..44bca4c4cd 100644
--- a/system/doc/tutorial/part.xml
+++ b/system/doc/tutorial/part.xml
@@ -36,5 +36,6 @@
   <xi:include href="c_portdriver.xml"/>
   <xi:include href="cnode.xml"/>
   <xi:include href="nif.xml"/>
+  <xi:include href="debugging.xml"/>
 </part>
 
diff --git a/system/doc/tutorial/xmlfiles.mk b/system/doc/tutorial/xmlfiles.mk
index 74e174f6d4..ccb572123b 100644
--- a/system/doc/tutorial/xmlfiles.mk
+++ b/system/doc/tutorial/xmlfiles.mk
@@ -1,3 +1,4 @@
+
 #
 # %CopyrightBegin%
 #
@@ -19,7 +20,8 @@
 #
 TUTORIAL_CHAPTER_FILES = \
 	introduction.xml\
-	overview.xml
+	overview.xml\
+	debugging.xml
 
 TUTORIAL_CHAPTER_GEN_FILES = \
 	cnode.xml\
author	Sverker Eriksson <sverker@erlang.org>	2022-06-29 21:21:27 +0200
committer	Sverker Eriksson <sverker@erlang.org>	2022-08-23 19:03:17 +0200
commit	abb60125ae543ef75346423f6ca24dcae4ccdd69 (patch)
tree	161fc66213b38c3c7116d66a4953a29ecfcc441e /system
parent	b81458a1375be7709d428324ef97550c39e21d2f (diff)
download	erlang-abb60125ae543ef75346423f6ca24dcae4ccdd69.tar.gz