diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2012-03-20 10:29:15 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2012-03-20 10:29:15 -0700 |
commit | 9c2b957db1772ebf942ae7a9346b14eba6c8ca66 (patch) | |
tree | 0dbb83e57260ea7fc0dc421f214d5f1b26262005 /Documentation/static-keys.txt | |
parent | 0bbfcaff9b2a69c71a95e6902253487ab30cb498 (diff) | |
parent | bea95c152dee1791dd02cbc708afbb115bb00f9a (diff) | |
download | linux-stable-9c2b957db1772ebf942ae7a9346b14eba6c8ca66.tar.gz |
Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events changes for v3.4 from Ingo Molnar:
- New "hardware based branch profiling" feature both on the kernel and
the tooling side, on CPUs that support it. (modern x86 Intel CPUs
with the 'LBR' hardware feature currently.)
This new feature is basically a sophisticated 'magnifying glass' for
branch execution - something that is pretty difficult to extract from
regular, function histogram centric profiles.
The simplest mode is activated via 'perf record -b', and the result
looks like this in perf report:
$ perf record -b any_call,u -e cycles:u branchy
$ perf report -b --sort=symbol
52.34% [.] main [.] f1
24.04% [.] f1 [.] f3
23.60% [.] f1 [.] f2
0.01% [k] _IO_new_file_xsputn [k] _IO_file_overflow
0.01% [k] _IO_vfprintf_internal [k] _IO_new_file_xsputn
0.01% [k] _IO_vfprintf_internal [k] strchrnul
0.01% [k] __printf [k] _IO_vfprintf_internal
0.01% [k] main [k] __printf
This output shows from/to branch columns and shows the highest
percentage (from,to) jump combinations - i.e. the most likely taken
branches in the system. "branches" can also include function calls
and any other synchronous and asynchronous transitions of the
instruction pointer that are not 'next instruction' - such as system
calls, traps, interrupts, etc.
This feature comes with (hopefully intuitive) flat ascii and TUI
support in perf report.
- Various 'perf annotate' visual improvements for us assembly junkies.
It will now recognize function calls in the TUI and by hitting enter
you can follow the call (recursively) and back, amongst other
improvements.
- Multiple threads/processes recording support in perf record, perf
stat, perf top - which is activated via a comma-list of PIDs:
perf top -p 21483,21485
perf stat -p 21483,21485 -ddd
perf record -p 21483,21485
- Support for per UID views, via the --uid paramter to perf top, perf
report, etc. For example 'perf top --uid mingo' will only show the
tasks that I am running, excluding other users, root, etc.
- Jump label restructurings and improvements - this includes the
factoring out of the (hopefully much clearer) include/linux/static_key.h
generic facility:
struct static_key key = STATIC_KEY_INIT_FALSE;
...
if (static_key_false(&key))
do unlikely code
else
do likely code
...
static_key_slow_inc();
...
static_key_slow_inc();
...
The static_key_false() branch will be generated into the code with as
little impact to the likely code path as possible. the
static_key_slow_*() APIs flip the branch via live kernel code patching.
This facility can now be used more widely within the kernel to
micro-optimize hot branches whose likelihood matches the static-key
usage and fast/slow cost patterns.
- SW function tracer improvements: perf support and filtering support.
- Various hardenings of the perf.data ABI, to make older perf.data's
smoother on newer tool versions, to make new features integrate more
smoothly, to support cross-endian recording/analyzing workflows
better, etc.
- Restructuring of the kprobes code, the splitting out of 'optprobes',
and a corner case bugfix.
- Allow the tracing of kernel console output (printk).
- Improvements/fixes to user-space RDPMC support, allowing user-space
self-profiling code to extract PMU counts without performing any
system calls, while playing nice with the kernel side.
- 'perf bench' improvements
- ... and lots of internal restructurings, cleanups and fixes that made
these features possible. And, as usual this list is incomplete as
there were also lots of other improvements
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (120 commits)
perf report: Fix annotate double quit issue in branch view mode
perf report: Remove duplicate annotate choice in branch view mode
perf/x86: Prettify pmu config literals
perf report: Enable TUI in branch view mode
perf report: Auto-detect branch stack sampling mode
perf record: Add HEADER_BRANCH_STACK tag
perf record: Provide default branch stack sampling mode option
perf tools: Make perf able to read files from older ABIs
perf tools: Fix ABI compatibility bug in print_event_desc()
perf tools: Enable reading of perf.data files from different ABI rev
perf: Add ABI reference sizes
perf report: Add support for taken branch sampling
perf record: Add support for sampling taken branch
perf tools: Add code to support PERF_SAMPLE_BRANCH_STACK
x86/kprobes: Split out optprobe related code to kprobes-opt.c
x86/kprobes: Fix a bug which can modify kernel code permanently
x86/kprobes: Fix instruction recovery on optimized path
perf: Add callback to flush branch_stack on context switch
perf: Disable PERF_SAMPLE_BRANCH_* when not supported
perf/x86: Add LBR software filter support for Intel CPUs
...
Diffstat (limited to 'Documentation/static-keys.txt')
-rw-r--r-- | Documentation/static-keys.txt | 286 |
1 files changed, 286 insertions, 0 deletions
diff --git a/Documentation/static-keys.txt b/Documentation/static-keys.txt new file mode 100644 index 000000000000..d93f3c00f245 --- /dev/null +++ b/Documentation/static-keys.txt @@ -0,0 +1,286 @@ + Static Keys + ----------- + +By: Jason Baron <jbaron@redhat.com> + +0) Abstract + +Static keys allows the inclusion of seldom used features in +performance-sensitive fast-path kernel code, via a GCC feature and a code +patching technique. A quick example: + + struct static_key key = STATIC_KEY_INIT_FALSE; + + ... + + if (static_key_false(&key)) + do unlikely code + else + do likely code + + ... + static_key_slow_inc(); + ... + static_key_slow_inc(); + ... + +The static_key_false() branch will be generated into the code with as little +impact to the likely code path as possible. + + +1) Motivation + + +Currently, tracepoints are implemented using a conditional branch. The +conditional check requires checking a global variable for each tracepoint. +Although the overhead of this check is small, it increases when the memory +cache comes under pressure (memory cache lines for these global variables may +be shared with other memory accesses). As we increase the number of tracepoints +in the kernel this overhead may become more of an issue. In addition, +tracepoints are often dormant (disabled) and provide no direct kernel +functionality. Thus, it is highly desirable to reduce their impact as much as +possible. Although tracepoints are the original motivation for this work, other +kernel code paths should be able to make use of the static keys facility. + + +2) Solution + + +gcc (v4.5) adds a new 'asm goto' statement that allows branching to a label: + +http://gcc.gnu.org/ml/gcc-patches/2009-07/msg01556.html + +Using the 'asm goto', we can create branches that are either taken or not taken +by default, without the need to check memory. Then, at run-time, we can patch +the branch site to change the branch direction. + +For example, if we have a simple branch that is disabled by default: + + if (static_key_false(&key)) + printk("I am the true branch\n"); + +Thus, by default the 'printk' will not be emitted. And the code generated will +consist of a single atomic 'no-op' instruction (5 bytes on x86), in the +straight-line code path. When the branch is 'flipped', we will patch the +'no-op' in the straight-line codepath with a 'jump' instruction to the +out-of-line true branch. Thus, changing branch direction is expensive but +branch selection is basically 'free'. That is the basic tradeoff of this +optimization. + +This lowlevel patching mechanism is called 'jump label patching', and it gives +the basis for the static keys facility. + +3) Static key label API, usage and examples: + + +In order to make use of this optimization you must first define a key: + + struct static_key key; + +Which is initialized as: + + struct static_key key = STATIC_KEY_INIT_TRUE; + +or: + + struct static_key key = STATIC_KEY_INIT_FALSE; + +If the key is not initialized, it is default false. The 'struct static_key', +must be a 'global'. That is, it can't be allocated on the stack or dynamically +allocated at run-time. + +The key is then used in code as: + + if (static_key_false(&key)) + do unlikely code + else + do likely code + +Or: + + if (static_key_true(&key)) + do likely code + else + do unlikely code + +A key that is initialized via 'STATIC_KEY_INIT_FALSE', must be used in a +'static_key_false()' construct. Likewise, a key initialized via +'STATIC_KEY_INIT_TRUE' must be used in a 'static_key_true()' construct. A +single key can be used in many branches, but all the branches must match the +way that the key has been initialized. + +The branch(es) can then be switched via: + + static_key_slow_inc(&key); + ... + static_key_slow_dec(&key); + +Thus, 'static_key_slow_inc()' means 'make the branch true', and +'static_key_slow_dec()' means 'make the the branch false' with appropriate +reference counting. For example, if the key is initialized true, a +static_key_slow_dec(), will switch the branch to false. And a subsequent +static_key_slow_inc(), will change the branch back to true. Likewise, if the +key is initialized false, a 'static_key_slow_inc()', will change the branch to +true. And then a 'static_key_slow_dec()', will again make the branch false. + +An example usage in the kernel is the implementation of tracepoints: + + static inline void trace_##name(proto) \ + { \ + if (static_key_false(&__tracepoint_##name.key)) \ + __DO_TRACE(&__tracepoint_##name, \ + TP_PROTO(data_proto), \ + TP_ARGS(data_args), \ + TP_CONDITION(cond)); \ + } + +Tracepoints are disabled by default, and can be placed in performance critical +pieces of the kernel. Thus, by using a static key, the tracepoints can have +absolutely minimal impact when not in use. + + +4) Architecture level code patching interface, 'jump labels' + + +There are a few functions and macros that architectures must implement in order +to take advantage of this optimization. If there is no architecture support, we +simply fall back to a traditional, load, test, and jump sequence. + +* select HAVE_ARCH_JUMP_LABEL, see: arch/x86/Kconfig + +* #define JUMP_LABEL_NOP_SIZE, see: arch/x86/include/asm/jump_label.h + +* __always_inline bool arch_static_branch(struct static_key *key), see: + arch/x86/include/asm/jump_label.h + +* void arch_jump_label_transform(struct jump_entry *entry, enum jump_label_type type), + see: arch/x86/kernel/jump_label.c + +* __init_or_module void arch_jump_label_transform_static(struct jump_entry *entry, enum jump_label_type type), + see: arch/x86/kernel/jump_label.c + + +* struct jump_entry, see: arch/x86/include/asm/jump_label.h + + +5) Static keys / jump label analysis, results (x86_64): + + +As an example, let's add the following branch to 'getppid()', such that the +system call now looks like: + +SYSCALL_DEFINE0(getppid) +{ + int pid; + ++ if (static_key_false(&key)) ++ printk("I am the true branch\n"); + + rcu_read_lock(); + pid = task_tgid_vnr(rcu_dereference(current->real_parent)); + rcu_read_unlock(); + + return pid; +} + +The resulting instructions with jump labels generated by GCC is: + +ffffffff81044290 <sys_getppid>: +ffffffff81044290: 55 push %rbp +ffffffff81044291: 48 89 e5 mov %rsp,%rbp +ffffffff81044294: e9 00 00 00 00 jmpq ffffffff81044299 <sys_getppid+0x9> +ffffffff81044299: 65 48 8b 04 25 c0 b6 mov %gs:0xb6c0,%rax +ffffffff810442a0: 00 00 +ffffffff810442a2: 48 8b 80 80 02 00 00 mov 0x280(%rax),%rax +ffffffff810442a9: 48 8b 80 b0 02 00 00 mov 0x2b0(%rax),%rax +ffffffff810442b0: 48 8b b8 e8 02 00 00 mov 0x2e8(%rax),%rdi +ffffffff810442b7: e8 f4 d9 00 00 callq ffffffff81051cb0 <pid_vnr> +ffffffff810442bc: 5d pop %rbp +ffffffff810442bd: 48 98 cltq +ffffffff810442bf: c3 retq +ffffffff810442c0: 48 c7 c7 e3 54 98 81 mov $0xffffffff819854e3,%rdi +ffffffff810442c7: 31 c0 xor %eax,%eax +ffffffff810442c9: e8 71 13 6d 00 callq ffffffff8171563f <printk> +ffffffff810442ce: eb c9 jmp ffffffff81044299 <sys_getppid+0x9> + +Without the jump label optimization it looks like: + +ffffffff810441f0 <sys_getppid>: +ffffffff810441f0: 8b 05 8a 52 d8 00 mov 0xd8528a(%rip),%eax # ffffffff81dc9480 <key> +ffffffff810441f6: 55 push %rbp +ffffffff810441f7: 48 89 e5 mov %rsp,%rbp +ffffffff810441fa: 85 c0 test %eax,%eax +ffffffff810441fc: 75 27 jne ffffffff81044225 <sys_getppid+0x35> +ffffffff810441fe: 65 48 8b 04 25 c0 b6 mov %gs:0xb6c0,%rax +ffffffff81044205: 00 00 +ffffffff81044207: 48 8b 80 80 02 00 00 mov 0x280(%rax),%rax +ffffffff8104420e: 48 8b 80 b0 02 00 00 mov 0x2b0(%rax),%rax +ffffffff81044215: 48 8b b8 e8 02 00 00 mov 0x2e8(%rax),%rdi +ffffffff8104421c: e8 2f da 00 00 callq ffffffff81051c50 <pid_vnr> +ffffffff81044221: 5d pop %rbp +ffffffff81044222: 48 98 cltq +ffffffff81044224: c3 retq +ffffffff81044225: 48 c7 c7 13 53 98 81 mov $0xffffffff81985313,%rdi +ffffffff8104422c: 31 c0 xor %eax,%eax +ffffffff8104422e: e8 60 0f 6d 00 callq ffffffff81715193 <printk> +ffffffff81044233: eb c9 jmp ffffffff810441fe <sys_getppid+0xe> +ffffffff81044235: 66 66 2e 0f 1f 84 00 data32 nopw %cs:0x0(%rax,%rax,1) +ffffffff8104423c: 00 00 00 00 + +Thus, the disable jump label case adds a 'mov', 'test' and 'jne' instruction +vs. the jump label case just has a 'no-op' or 'jmp 0'. (The jmp 0, is patched +to a 5 byte atomic no-op instruction at boot-time.) Thus, the disabled jump +label case adds: + +6 (mov) + 2 (test) + 2 (jne) = 10 - 5 (5 byte jump 0) = 5 addition bytes. + +If we then include the padding bytes, the jump label code saves, 16 total bytes +of instruction memory for this small fucntion. In this case the non-jump label +function is 80 bytes long. Thus, we have have saved 20% of the instruction +footprint. We can in fact improve this even further, since the 5-byte no-op +really can be a 2-byte no-op since we can reach the branch with a 2-byte jmp. +However, we have not yet implemented optimal no-op sizes (they are currently +hard-coded). + +Since there are a number of static key API uses in the scheduler paths, +'pipe-test' (also known as 'perf bench sched pipe') can be used to show the +performance improvement. Testing done on 3.3.0-rc2: + +jump label disabled: + + Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs): + + 855.700314 task-clock # 0.534 CPUs utilized ( +- 0.11% ) + 200,003 context-switches # 0.234 M/sec ( +- 0.00% ) + 0 CPU-migrations # 0.000 M/sec ( +- 39.58% ) + 487 page-faults # 0.001 M/sec ( +- 0.02% ) + 1,474,374,262 cycles # 1.723 GHz ( +- 0.17% ) + <not supported> stalled-cycles-frontend + <not supported> stalled-cycles-backend + 1,178,049,567 instructions # 0.80 insns per cycle ( +- 0.06% ) + 208,368,926 branches # 243.507 M/sec ( +- 0.06% ) + 5,569,188 branch-misses # 2.67% of all branches ( +- 0.54% ) + + 1.601607384 seconds time elapsed ( +- 0.07% ) + +jump label enabled: + + Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs): + + 841.043185 task-clock # 0.533 CPUs utilized ( +- 0.12% ) + 200,004 context-switches # 0.238 M/sec ( +- 0.00% ) + 0 CPU-migrations # 0.000 M/sec ( +- 40.87% ) + 487 page-faults # 0.001 M/sec ( +- 0.05% ) + 1,432,559,428 cycles # 1.703 GHz ( +- 0.18% ) + <not supported> stalled-cycles-frontend + <not supported> stalled-cycles-backend + 1,175,363,994 instructions # 0.82 insns per cycle ( +- 0.04% ) + 206,859,359 branches # 245.956 M/sec ( +- 0.04% ) + 4,884,119 branch-misses # 2.36% of all branches ( +- 0.85% ) + + 1.579384366 seconds time elapsed + +The percentage of saved branches is .7%, and we've saved 12% on +'branch-misses'. This is where we would expect to get the most savings, since +this optimization is about reducing the number of branches. In addition, we've +saved .2% on instructions, and 2.8% on cycles and 1.4% on elapsed time. |