summaryrefslogtreecommitdiff
path: root/system
diff options
context:
space:
mode:
authorSverker Eriksson <sverker@erlang.org>2022-06-29 21:21:27 +0200
committerSverker Eriksson <sverker@erlang.org>2022-08-23 19:03:17 +0200
commitabb60125ae543ef75346423f6ca24dcae4ccdd69 (patch)
tree161fc66213b38c3c7116d66a4953a29ecfcc441e /system
parentb81458a1375be7709d428324ef97550c39e21d2f (diff)
downloaderlang-abb60125ae543ef75346423f6ca24dcae4ccdd69.tar.gz
Add documentation about debbuging NIFs/drivers
in Interoperability Tutorial.
Diffstat (limited to 'system')
-rw-r--r--system/doc/tutorial/debugging.xml346
-rw-r--r--system/doc/tutorial/part.xml1
-rw-r--r--system/doc/tutorial/xmlfiles.mk4
3 files changed, 350 insertions, 1 deletions
diff --git a/system/doc/tutorial/debugging.xml b/system/doc/tutorial/debugging.xml
new file mode 100644
index 0000000000..bd7814c441
--- /dev/null
+++ b/system/doc/tutorial/debugging.xml
@@ -0,0 +1,346 @@
+<?xml version="1.0" encoding="utf-8"?>
+<!DOCTYPE chapter SYSTEM "chapter.dtd">
+<chapter>
+ <header>
+ <copyright>
+ <year>2022</year>
+ <holder>Ericsson AB. All Rights Reserved.</holder>
+ </copyright>
+ <legalnotice>
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+
+ </legalnotice>
+
+ <title>Debugging NIFs and Port Drivers</title>
+ <prepared/>
+ <docno/>
+ <date/>
+ <rev/>
+ <file>debugging.xml</file>
+ </header>
+
+ <section>
+ <title>With great power comes great responsibilty</title>
+ <p>
+ NIFs and port driver code run inside the Erlang VM OS process (the
+ "Beam"). To maximize performance the code is called directly by the same
+ threads executing Erlang beam code and has full access to all the memory
+ of the OS process. A buggy NIF/driver can thus make severe damage by
+ corrupting memory.
+ </p>
+ <p>
+ In a best case scenario such memory corruption is detected immediately
+ causing the Beam to crash generating a core dump file which can be
+ analyzed to find the bug. However, it is very common for memory corruption
+ bugs to not be immediately detected when the faulty write happens, but
+ instead much later, for example when the calling Erlang process is garbage
+ collected. When that happens it can be very hard to find the root cause of
+ the memory corruption by analysing the core dump. All traces that could
+ have indicated which specific buggy NIF/driver that caused the corruption
+ may be long gone.
+ </p>
+ </section>
+ <section>
+ <title>The debug emulator</title>
+ <p>
+ One way to make debugging easier is to run an emulator built with target
+ <c>debug</c>. It will
+ </p>
+ <list type="bulleted">
+ <item>
+ <p>
+ <em>Increase probability of detecting bugs earlier</em>. It contains a
+ lot more runtime checks to ensure correct use of internal interfaces
+ and data structures.
+ </p>
+ </item>
+ <item>
+ <p>
+ <em>Generate a core dump that is easier to analyze</em>. Compiler
+ optimizations are turned off, which stops the compiler from
+ "optimizing away" variables, thus making it easier/possible to inspect
+ their state.
+ </p>
+ </item>
+ <item>
+ <p>
+ <em>Detect lock order violations</em>. A runtime lock checker will
+ verify that the locks in the
+ <seecref marker="erts:erl_nif"><c>erl_nif</c></seecref> and
+ <seecref marker="erts:erl_driver"><c>erl_driver</c></seecref>
+ APIs are seized in a consistent order that cannot result in deadlock
+ bugs.
+ </p>
+ </item>
+ </list>
+ <p>
+ In fact, we recommend to use the debug emulator as default during
+ development of NIFs and drivers, regardless if you are troubleshooting
+ bugs or not. Some subtle bugs may not be detected by the normal emulator
+ and just happen to work anyway by chance. However, another version of the
+ emulator, or even different circumstances within the same emulator, may
+ cause the bug to later provoke all kinds of problems.
+ </p>
+ <p>
+ The main disadvantage of the <c>debug</c> emulator is its reduced
+ performance. The extra runtime checks and lack of compiler optimizations
+ may result in a slowdown with a factor of two or more depending on
+ load. The memory footprint should be about the same.
+ </p>
+ <p>
+ If the <c>debug</c> emulator is part of the Erlang/OTP installation, it can be
+ started with</p>
+ <pre>
+> <input>erl <seecom marker="erts:erl#emu_type">-emu_type</seecom> debug</input>
+Erlang/OTP 25 [erts-13.0.2] ... <em>[type-assertions] [debug-compiled] [lock-checking]</em>
+
+Eshell V13.0.2 (abort with ^G)
+1>
+</pre>
+ <p>
+ If the <c>debug</c> emulator is not part of the installation, you need to
+ <seeguide marker="system/installation_guide:INSTALL#Advanced-configuration-and-build-of-ErlangOTP_Building_How-to-Build-a-Debug-Enabled-Erlang-RunTime-System">
+ build it from the Erlang/OTP source code</seeguide>. After building from source
+ either make an Erlang/OTP installation or you can run the debug emulator
+ directly in the source tree with the <c>cerl</c> script:
+ </p>
+ <pre>
+> <input>$ERL_TOP/bin/cerl -debug</input>
+Erlang/OTP 25 [erts-13.0.2] ... <em>[type-assertions] [debug-compiled] [lock-checking]</em>
+
+Eshell V13.0.2 (abort with ^G)
+1>
+</pre>
+ <p>
+ The <c>cerl</c> script can also be used as a convenient way to start
+ the debugger <c>gdb</c> for core dump analysis:
+ </p>
+ <pre>
+> <input>$ERL_TOP/bin/cerl -debug -core core.12345</input>
+or
+> <input>$ERL_TOP/bin/cerl -debug -rcore core.12345</input>
+</pre>
+ <p>
+ The first variant starts Emacs and runs <c>gdb</c> within, while
+ the other <c>-rcore</c> runs <c>gdb</c> directly in the terminal. Apart
+ from starting <c>gdb</c> with the correct <c>beam.debug.smp</c> executable
+ file it will also read the file <c>$ERL_TOP/erts/etc/unix/etp-commands</c>
+ which contains a lot of <c>gdb</c> command for inspecting a beam core
+ dump. For example, the command <c>etp</c> that will print the content of
+ an Erlang term (<c>Eterm</c>) in plain Erlang syntax.
+ </p>
+ </section>
+ <section>
+ <title>Address Sanitizer</title>
+ <p>
+ <url href="https://clang.llvm.org/docs/AddressSanitizer.html">
+ AddressSanitizer</url> (asan) is an open source programming tool that
+ detects memory corruption bugs such as buffer overflows, use-after-free
+ and memory leaks. AddressSanitizer is based on compiler instrumentation
+ and is supported by both gcc and clang.
+ </p>
+ <p>
+ Similar to the <c>debug</c> emulator, the <c>asan</c> emulator runs slower
+ than normal, about 2-3 times slower. However, it also has a larger memory
+ footprint, about 3 times more memory than normal.
+ </p>
+ <p>
+ To get full effect you should compile both your own NIF/driver code as
+ well as the Erlang emulator with AddressSanitizer instrumentation. Compile
+ your own code by passing option <c>-fsanitize=address</c> to gcc or
+ clang. Other recommended options that will improve the fault
+ identification are <c>-fno-common</c> and <c>-fno-omit-frame-pointer</c>.
+ </p>
+ <p>
+ Build and run the emulator with AddressSanitizer support by using the same
+ procedure as for the debug emulator, except use the <c>asan</c> build
+ target instead of <c>debug</c>.
+ </p>
+ <taglist>
+ <tag>Run in source tree</tag>
+ <item>
+ <p>
+ If you run the <c>asan</c> emulator directly in the source tree with the
+ <c>cerl</c> script you only need to set environment variable
+ <c>ASAN_LOG_DIR</c> to the directory where the error log files will be
+ generated.
+ </p>
+ <pre>
+> <input>export ASAN_LOG_DIR=/my/asan/log/dir</input>
+> <input>$ERL_TOP/bin/cerl -asan</input>
+Erlang/OTP 25 [erts-13.0.2] ... <em>[address-sanitizer]</em>
+
+Eshell V13.0.2 (abort with ^G)
+1>
+</pre>
+ <p>
+ You may however also want to set <c>ASAN_OPTIONS="halt_on_error=true"</c>
+ if you want the emulator to crash when an error is detected.
+ </p>
+ </item>
+ <tag>Run installed Erlang/OTP</tag>
+ <item>
+ <p>
+ If you run the <c>asan</c> emulator in an installed Erlang/OTP with <c>erl
+ -emu_type asan</c> you need to set the path to the error log
+ <em>file</em> with
+ </p>
+ <pre>
+> <input>export ASAN_OPTIONS="log_path=/my/asan/log/file"</input></pre>
+ <p>
+ To avoid false positive memory leak reports from the emulator
+ itself set <c>LSAN_OPTIONS</c> (LSAN=LeakSanitizer):
+ </p>
+ <pre>
+> <input>export LSAN_OPTIONS="suppressions=$ERL_TOP/erts/emulator/asan/suppress"</input></pre>
+ <p>
+ The <c>suppress</c> file is currently not installed but can be copied
+ manually from the source tree to wherever you want it.
+ </p>
+ </item>
+ </taglist>
+ <p>
+ Memory corruption errors are reported by AddressSanitizer when they
+ happen, but memory leaks are only checked and reported by default then the
+ emulator terminates.
+ </p>
+ </section>
+ <section>
+ <title>Valgrind</title>
+ <p>
+ An even more heavy weight debugging tool is <url
+ href="https://valgrind.org">Valgrind</url>. It can also find memory
+ corruption bugs and memory leaks similar to <c>asan</c>. Valgrind is not
+ as good at buffer overflow bugs, but it will find use of undefined data,
+ which is a type of error that <c>asan</c> cannot detect.
+ </p>
+ <p>
+ Valgrind is much slower than <c>asan</c> and it is incapable at
+ exploiting CPU multicore processing. We therefore recommend <c>asan</c> as
+ the first choice before trying valgrind.
+ </p>
+ <p>
+ Valgrind runs as a virtual machine itself, emulating execution of hardware
+ machine instructions. This means you can run almost any program unchanged
+ on valgrind. However, we have found that the beam executable benefits from
+ being compiled with special adaptions for running on valgrind.
+ </p>
+ <p>
+ Build the emulator with <c>valgrind</c> target the same as is done for
+ <c>debug</c> and <c>asan</c>. Note that <c>valgrind</c> needs to be
+ installed on the machine before the build starts.
+ </p>
+ <p>
+ Run the <c>valgrind</c> emulator directly in the source tree with the
+ <c>cerl</c> script. Set environment variable <c>VALGRIND_LOG_DIR</c> to
+ the directory where the error log files will be generated.
+ </p>
+ <pre>
+> <input>export VALGRIND_LOG_DIR=/my/valgrind/log/dir</input>
+> <input>$ERL_TOP/bin/cerl -valgrind</input>
+Erlang/OTP 25 [erts-13.0.2] ... <em>[valgrind-compiled]</em>
+
+Eshell V13.0.2 (abort with ^G)
+1>
+</pre>
+ </section>
+ <section>
+ <title>rr - Record and Replay</title>
+ <p>
+ Last but not least, the fantastic interactive debugging tool <url
+ href="https://rr-project.org/"><c>rr</c></url>, developed by Mozilla as
+ open source. <c>rr</c> stands for Record and Replay. While a core dump
+ represents only a static snapshot of the OS process when it crashed, with
+ <c>rr</c> you instead record the entire session, from start of the OS
+ process to the end (the crash). You can then replay that session from
+ within <c>gdb</c>. Single step, set breakpoints and watchpoints, and even
+ <em>execute backwards</em>.
+ </p>
+ <p>
+ Considering its powerful utility, <c>rr</c> is remarkably light weight.
+ It runs on Linux with any reasonably modern x86 CPU. You may get a two
+ times slowdown when executing in recording mode. The big weakness is its
+ inability to exploite CPU multicore processing. If the bug is a race
+ condition between concurrently running threads, it may be hard to
+ reproduce with <c>rr</c>.
+ </p>
+ <p>
+ <c>rr</c> does not require any special instrumented compilation. However,
+ if possible, run it together with the <c>debug</c> emulator, as that will
+ result in a much nicer debugging experience. You run <c>rr</c> in the
+ source tree using the <c>cerl</c> script.
+ </p>
+ <p>
+ Here is an example of a typical session. First we catch the crash in an rr
+ recording session:
+ </p>
+ <pre>
+> <input>$ERL_TOP/bin/cerl -debug -rr</input>
+rr: Saving execution to trace directory /home/foobar/.local/share/rr/beam.debug.smp-1.
+Erlang/OTP 25 [erts-13.0.2]
+
+Eshell V13.0.2 (abort with ^G)
+1> <input>mymod:buggy_nif().</input>
+Segmentation fault</pre>
+ <p>
+ Now we can replay that session with <c>rr replay</c>:
+ </p>
+ <pre>
+> <input>rr replay</input>
+GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.1) 9.2
+:
+(rr) <input>continue</input>
+:
+Thread 2 received signal SIGSEGV, Segmentation fault.
+(rr) <input>backtrace</input></pre>
+ <p>
+ You get the call stack at the moment of the crash. Bad luck, it is
+ somewhere deep down in the garbage collection of the beam. But you manage
+ to figure out that variable <c>hp</c> points to a broken Erlang term.
+ </p>
+ <p>
+ Set a watch point on that memory position and resume execution
+ <em>backwards</em>. The debugger will then stop at the exact position when
+ that memory position <c>*hp</c> was written.
+ </p>
+ <pre>
+(rr) <input>watch -l *hp</input>
+Hardware watchpoint 1: -location *hp
+(rr) <input>reverse-continue</input>
+Continuing.
+
+Thread 2 received signal SIGSEGV, Segmentation fault.</pre>
+ <p>
+ This is a quirk to be aware about. We started by executing forward until
+ it crashed with SIGSEGV. We are now executing backwards from that point,
+ so we are hitting the same SIGSEGV again but from the other
+ direction. Just continue backwards once more to move past it.
+ </p>
+ <pre>
+(rr) <input>reverse-continue</input>
+Continuing.
+
+Thread 2 hit Hardware watchpoint 1: -location *hp
+
+Old value = 42
+New value = 0</pre>
+ <p>
+ And here we are at the position when someone wrote a broken term on the
+ process heap. Note that "Old value" and "New value" are reversed when we
+ execute backwards. In this case the value 42 was written on the heap.
+ Let's see who the guilty one is:
+ </p>
+ <pre>
+(rr) <input>backtrace</input></pre>
+ </section>
+</chapter>
diff --git a/system/doc/tutorial/part.xml b/system/doc/tutorial/part.xml
index 4a66f0cb22..44bca4c4cd 100644
--- a/system/doc/tutorial/part.xml
+++ b/system/doc/tutorial/part.xml
@@ -36,5 +36,6 @@
<xi:include href="c_portdriver.xml"/>
<xi:include href="cnode.xml"/>
<xi:include href="nif.xml"/>
+ <xi:include href="debugging.xml"/>
</part>
diff --git a/system/doc/tutorial/xmlfiles.mk b/system/doc/tutorial/xmlfiles.mk
index 74e174f6d4..ccb572123b 100644
--- a/system/doc/tutorial/xmlfiles.mk
+++ b/system/doc/tutorial/xmlfiles.mk
@@ -1,3 +1,4 @@
+
#
# %CopyrightBegin%
#
@@ -19,7 +20,8 @@
#
TUTORIAL_CHAPTER_FILES = \
introduction.xml\
- overview.xml
+ overview.xml\
+ debugging.xml
TUTORIAL_CHAPTER_GEN_FILES = \
cnode.xml\