summaryrefslogtreecommitdiff
path: root/kernel/bpf/stackmap.c
Commit message (Collapse)AuthorAgeFilesLines
* bpf: Fix integer overflow in prealloc_elems_and_freelist()Tatsuhiko Yasumatsu2021-09-301-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In prealloc_elems_and_freelist(), the multiplication to calculate the size passed to bpf_map_area_alloc() could lead to an integer overflow. As a result, out-of-bounds write could occur in pcpu_freelist_populate() as reported by KASAN: [...] [ 16.968613] BUG: KASAN: slab-out-of-bounds in pcpu_freelist_populate+0xd9/0x100 [ 16.969408] Write of size 8 at addr ffff888104fc6ea0 by task crash/78 [ 16.970038] [ 16.970195] CPU: 0 PID: 78 Comm: crash Not tainted 5.15.0-rc2+ #1 [ 16.970878] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 [ 16.972026] Call Trace: [ 16.972306] dump_stack_lvl+0x34/0x44 [ 16.972687] print_address_description.constprop.0+0x21/0x140 [ 16.973297] ? pcpu_freelist_populate+0xd9/0x100 [ 16.973777] ? pcpu_freelist_populate+0xd9/0x100 [ 16.974257] kasan_report.cold+0x7f/0x11b [ 16.974681] ? pcpu_freelist_populate+0xd9/0x100 [ 16.975190] pcpu_freelist_populate+0xd9/0x100 [ 16.975669] stack_map_alloc+0x209/0x2a0 [ 16.976106] __sys_bpf+0xd83/0x2ce0 [...] The possibility of this overflow was originally discussed in [0], but was overlooked. Fix the integer overflow by changing elem_size to u64 from u32. [0] https://lore.kernel.org/bpf/728b238e-a481-eb50-98e9-b0f430ab01e7@gmail.com/ Fixes: 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation") Signed-off-by: Tatsuhiko Yasumatsu <th.yasumatsu@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210930135545.173698-1-th.yasumatsu@gmail.com
* bpf, mm: Fix lockdep warning triggered by stack_map_get_build_id_offset()Yonghong Song2021-09-101-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the bpf selftest "get_stack_raw_tp" triggered the warning: [ 1411.304463] WARNING: CPU: 3 PID: 140 at include/linux/mmap_lock.h:164 find_vma+0x47/0xa0 [ 1411.304469] Modules linked in: bpf_testmod(O) [last unloaded: bpf_testmod] [ 1411.304476] CPU: 3 PID: 140 Comm: systemd-journal Tainted: G W O 5.14.0+ #53 [ 1411.304479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 1411.304481] RIP: 0010:find_vma+0x47/0xa0 [ 1411.304484] Code: de 48 89 ef e8 ba f5 fe ff 48 85 c0 74 2e 48 83 c4 08 5b 5d c3 48 8d bf 28 01 00 00 be ff ff ff ff e8 2d 9f d8 00 85 c0 75 d4 <0f> 0b 48 89 de 48 8 [ 1411.304487] RSP: 0018:ffffabd440403db8 EFLAGS: 00010246 [ 1411.304490] RAX: 0000000000000000 RBX: 00007f00ad80a0e0 RCX: 0000000000000000 [ 1411.304492] RDX: 0000000000000001 RSI: ffffffff9776b144 RDI: ffffffff977e1b0e [ 1411.304494] RBP: ffff9cf5c2f50000 R08: ffff9cf5c3eb25d8 R09: 00000000fffffffe [ 1411.304496] R10: 0000000000000001 R11: 00000000ef974e19 R12: ffff9cf5c39ae0e0 [ 1411.304498] R13: 0000000000000000 R14: 0000000000000000 R15: ffff9cf5c39ae0e0 [ 1411.304501] FS: 00007f00ae754780(0000) GS:ffff9cf5fba00000(0000) knlGS:0000000000000000 [ 1411.304504] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1411.304506] CR2: 000000003e34343c CR3: 0000000103a98005 CR4: 0000000000370ee0 [ 1411.304508] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1411.304510] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1411.304512] Call Trace: [ 1411.304517] stack_map_get_build_id_offset+0x17c/0x260 [ 1411.304528] __bpf_get_stack+0x18f/0x230 [ 1411.304541] bpf_get_stack_raw_tp+0x5a/0x70 [ 1411.305752] RAX: 0000000000000000 RBX: 5541f689495641d7 RCX: 0000000000000000 [ 1411.305756] RDX: 0000000000000001 RSI: ffffffff9776b144 RDI: ffffffff977e1b0e [ 1411.305758] RBP: ffff9cf5c02b2f40 R08: ffff9cf5ca7606c0 R09: ffffcbd43ee02c04 [ 1411.306978] bpf_prog_32007c34f7726d29_bpf_prog1+0xaf/0xd9c [ 1411.307861] R10: 0000000000000001 R11: 0000000000000044 R12: ffff9cf5c2ef60e0 [ 1411.307865] R13: 0000000000000005 R14: 0000000000000000 R15: ffff9cf5c2ef6108 [ 1411.309074] bpf_trace_run2+0x8f/0x1a0 [ 1411.309891] FS: 00007ff485141700(0000) GS:ffff9cf5fae00000(0000) knlGS:0000000000000000 [ 1411.309896] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1411.311221] syscall_trace_enter.isra.20+0x161/0x1f0 [ 1411.311600] CR2: 00007ff48514d90e CR3: 0000000107114001 CR4: 0000000000370ef0 [ 1411.312291] do_syscall_64+0x15/0x80 [ 1411.312941] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1411.313803] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 1411.314223] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1411.315082] RIP: 0033:0x7f00ad80a0e0 [ 1411.315626] Call Trace: [ 1411.315632] stack_map_get_build_id_offset+0x17c/0x260 To reproduce, first build `test_progs` binary: make -C tools/testing/selftests/bpf -j60 and then run the binary at tools/testing/selftests/bpf directory: ./test_progs -t get_stack_raw_tp The warning is due to commit 5b78ed24e8ec ("mm/pagemap: add mmap_assert_locked() annotations to find_vma*()") which added mmap_assert_locked() in find_vma() function. The mmap_assert_locked() function asserts that mm->mmap_lock needs to be held. But this is not the case for bpf_get_stack() or bpf_get_stackid() helper (kernel/bpf/stackmap.c), which uses mmap_read_trylock_non_owner() instead. Since mm->mmap_lock is not held in bpf_get_stack[id]() use case, the above warning is emitted during test run. This patch fixed the issue by (1). using mmap_read_trylock() instead of mmap_read_trylock_non_owner() to satisfy lockdep checking in find_vma(), and (2). droping lockdep for mmap_lock right before the irq_work_queue(). The function mmap_read_trylock_non_owner() is also removed since after this patch nobody calls it any more. Fixes: 5b78ed24e8ec ("mm/pagemap: add mmap_assert_locked() annotations to find_vma*()") Suggested-by: Jason Gunthorpe <jgg@ziepe.ca> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Luigi Rizzo <lrizzo@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: linux-mm@kvack.org Link: https://lore.kernel.org/bpf/20210909155000.1610299-1-yhs@fb.com
* bpf: Consolidate task_struct BTF_ID declarationsDaniel Xu2021-08-251-3/+1
| | | | | | | | No need to have it defined 5 times. Once is enough. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/6dcefa5bed26fe1226f26683f36819bb53ec19a2.1629772842.git.dxu@dxuuu.xyz
* bpf: Refcount task stack in bpf_get_task_stackDave Marchevsky2021-04-011-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On x86 the struct pt_regs * grabbed by task_pt_regs() points to an offset of task->stack. The pt_regs are later dereferenced in __bpf_get_stack (e.g. by user_mode() check). This can cause a fault if the task in question exits while bpf_get_task_stack is executing, as warned by task_stack_page's comment: * When accessing the stack of a non-current task that might exit, use * try_get_task_stack() instead. task_stack_page will return a pointer * that could get freed out from under you. Taking the comment's advice and using try_get_task_stack() and put_task_stack() to hold task->stack refcount, or bail early if it's already 0. Incrementing stack_refcount will ensure the task's stack sticks around while we're using its data. I noticed this bug while testing a bpf task iter similar to bpf_iter_task_stack in selftests, except mine grabbed user stack, and getting intermittent crashes, which resulted in dumps like: BUG: unable to handle page fault for address: 0000000000003fe0 \#PF: supervisor read access in kernel mode \#PF: error_code(0x0000) - not-present page RIP: 0010:__bpf_get_stack+0xd0/0x230 <snip...> Call Trace: bpf_prog_0a2be35c092cb190_get_task_stacks+0x5d/0x3ec bpf_iter_run_prog+0x24/0x81 __task_seq_show+0x58/0x80 bpf_seq_read+0xf7/0x3d0 vfs_read+0x91/0x140 ksys_read+0x59/0xd0 do_syscall_64+0x48/0x120 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: fa28dcb82a38 ("bpf: Introduce helper bpf_get_task_stack()") Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20210401000747.3648767-1-davemarchevsky@fb.com
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller2021-02-101-0/+2
|\
| * bpf: Check for integer overflow when using roundup_pow_of_two()Bui Quang Minh2021-02-031-0/+2
| | | | | | | | | | | | | | | | | | | | On 32-bit architecture, roundup_pow_of_two() can return 0 when the argument has upper most bit set due to resulting 1UL << 32. Add a check for this case. Fixes: d5a3b1f69186 ("bpf: introduce BPF_MAP_TYPE_STACK_TRACE") Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210127063653.3576-1-minhquangbui99@gmail.com
* | bpf: Add size arg to build_id_parse functionJiri Olsa2021-01-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | It's possible to have other build id types (other than default SHA1). Currently there's also ld support for MD5 build id. Adding size argument to build_id_parse function, that returns (if defined) size of the parsed build id, so we can recognize the build id type. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210114134044.1418404-3-jolsa@kernel.org
* | bpf: Move stack_map_get_build_id into libJiri Olsa2021-01-141-139/+4
|/ | | | | | | | | | | | | | | Moving stack_map_get_build_id into lib with declaration in linux/buildid.h header: int build_id_parse(struct vm_area_struct *vma, unsigned char *build_id); This function returns build id for given struct vm_area_struct. There is no functional change to stack_map_get_build_id function. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20210114134044.1418404-2-jolsa@kernel.org
* Merge tag 'net-next-5.11' of ↵Linus Torvalds2020-12-151-13/+3
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - support "prefer busy polling" NAPI operation mode, where we defer softirq for some time expecting applications to periodically busy poll - AF_XDP: improve efficiency by more batching and hindering the adjacency cache prefetcher - af_packet: make packet_fanout.arr size configurable up to 64K - tcp: optimize TCP zero copy receive in presence of partial or unaligned reads making zero copy a performance win for much smaller messages - XDP: add bulk APIs for returning / freeing frames - sched: support fragmenting IP packets as they come out of conntrack - net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs BPF: - BPF switch from crude rlimit-based to memcg-based memory accounting - BPF type format information for kernel modules and related tracing enhancements - BPF implement task local storage for BPF LSM - allow the FENTRY/FEXIT/RAW_TP tracing programs to use bpf_sk_storage Protocols: - mptcp: improve multiple xmit streams support, memory accounting and many smaller improvements - TLS: support CHACHA20-POLY1305 cipher - seg6: add support for SRv6 End.DT4/DT6 behavior - sctp: Implement RFC 6951: UDP Encapsulation of SCTP - ppp_generic: add ability to bridge channels directly - bridge: Connectivity Fault Management (CFM) support as is defined in IEEE 802.1Q section 12.14. Drivers: - mlx5: make use of the new auxiliary bus to organize the driver internals - mlx5: more accurate port TX timestamping support - mlxsw: - improve the efficiency of offloaded next hop updates by using the new nexthop object API - support blackhole nexthops - support IEEE 802.1ad (Q-in-Q) bridging - rtw88: major bluetooth co-existance improvements - iwlwifi: support new 6 GHz frequency band - ath11k: Fast Initial Link Setup (FILS) - mt7915: dual band concurrent (DBDC) support - net: ipa: add basic support for IPA v4.5 Refactor: - a few pieces of in_interrupt() cleanup work from Sebastian Andrzej Siewior - phy: add support for shared interrupts; get rid of multiple driver APIs and have the drivers write a full IRQ handler, slight growth of driver code should be compensated by the simpler API which also allows shared IRQs - add common code for handling netdev per-cpu counters - move TX packet re-allocation from Ethernet switch tag drivers to a central place - improve efficiency and rename nla_strlcpy - number of W=1 warning cleanups as we now catch those in a patchwork build bot Old code removal: - wan: delete the DLCI / SDLA drivers - wimax: move to staging - wifi: remove old WDS wifi bridging support" * tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1922 commits) net: hns3: fix expression that is currently always true net: fix proc_fs init handling in af_packet and tls nfc: pn533: convert comma to semicolon af_vsock: Assign the vsock transport considering the vsock address flags af_vsock: Set VMADDR_FLAG_TO_HOST flag on the receive path vsock_addr: Check for supported flag values vm_sockets: Add VMADDR_FLAG_TO_HOST vsock flag vm_sockets: Add flags field in the vsock address data structure net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit net: mscc: ocelot: install MAC addresses in .ndo_set_rx_mode from process context nfc: s3fwrn5: Release the nfc firmware net: vxget: clean up sparse warnings mlxsw: spectrum_router: Use eXtended mezzanine to offload IPv4 router mlxsw: spectrum: Set KVH XLT cache mode for Spectrum2/3 mlxsw: spectrum_router_xm: Introduce basic XM cache flushing mlxsw: reg: Add Router LPM Cache Enable Register mlxsw: reg: Add Router LPM Cache ML Delete Register mlxsw: spectrum_router_xm: Implement L-value tracking for M-index mlxsw: reg: Add XM Router M Table Register ...
| * bpf: Eliminate rlimit-based memory accounting for stackmap mapsRoman Gushchin2020-12-021-13/+3
| | | | | | | | | | | | | | | | | | | | Do not use rlimit-based memory accounting for stackmap maps. It has been replaced with the memcg-based memory accounting. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201201215900.3569844-30-guro@fb.com
* | irq_work: CleanupPeter Zijlstra2020-11-241-1/+1
|/ | | | | | | | Get rid of the __call_single_node union and clean up the API a little to avoid external code relying on the structure layout as much. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
* bpf: Allow specifying a BTF ID per argument in function protosLorenz Bauer2020-09-211-3/+2
| | | | | | | | | | | | | | | | | | | | | Function prototypes using ARG_PTR_TO_BTF_ID currently use two ways to signal which BTF IDs are acceptable. First, bpf_func_proto.btf_id is an array of IDs, one for each argument. This array is only accessed up to the highest numbered argument that uses ARG_PTR_TO_BTF_ID and may therefore be less than five arguments long. It usually points at a BTF_ID_LIST. Second, check_btf_id is a function pointer that is called by the verifier if present. It gets the actual BTF ID of the register, and the argument number we're currently checking. It turns out that the only user check_arg_btf_id ignores the argument, and is simply used to check whether the BTF ID has a struct sock_common at it's start. Replace both of these mechanisms with an explicit BTF ID for each argument in a function proto. Thanks to btf_struct_ids_match this is very flexible: check_arg_btf_id can be replaced by requiring struct sock_common. Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200921121227.255763-5-lmb@cloudflare.com
* bpf: Add map_meta_equal map opsMartin KaFai Lau2020-08-281-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some properties of the inner map is used in the verification time. When an inner map is inserted to an outer map at runtime, bpf_map_meta_equal() is currently used to ensure those properties of the inserting inner map stays the same as the verification time. In particular, the current bpf_map_meta_equal() checks max_entries which turns out to be too restrictive for most of the maps which do not use max_entries during the verification time. It limits the use case that wants to replace a smaller inner map with a larger inner map. There are some maps do use max_entries during verification though. For example, the map_gen_lookup in array_map_ops uses the max_entries to generate the inline lookup code. To accommodate differences between maps, the map_meta_equal is added to bpf_map_ops. Each map-type can decide what to check when its map is used as an inner map during runtime. Also, some map types cannot be used as an inner map and they are currently black listed in bpf_map_meta_alloc() in map_in_map.c. It is not unusual that the new map types may not aware that such blacklist exists. This patch enforces an explicit opt-in and only allows a map to be used as an inner map if it has implemented the map_meta_equal ops. It is based on the discussion in [1]. All maps that support inner map has its map_meta_equal points to bpf_map_meta_equal in this patch. A later patch will relax the max_entries check for most maps. bpf_types.h counts 28 map types. This patch adds 23 ".map_meta_equal" by using coccinelle. -5 for BPF_MAP_TYPE_PROG_ARRAY BPF_MAP_TYPE_(PERCPU)_CGROUP_STORAGE BPF_MAP_TYPE_STRUCT_OPS BPF_MAP_TYPE_ARRAY_OF_MAPS BPF_MAP_TYPE_HASH_OF_MAPS The "if (inner_map->inner_map_meta)" check in bpf_map_meta_alloc() is moved such that the same error is returned. [1]: https://lore.kernel.org/bpf/20200522022342.899756-1-kafai@fb.com/ Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200828011806.1970400-1-kafai@fb.com
* bpf: Iterate through all PT_NOTE sections when looking for build idJiri Olsa2020-08-121-10/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently when we look for build id within bpf_get_stackid helper call, we check the first NOTE section and we fail if build id is not there. However on some system (Fedora) there can be multiple NOTE sections in binaries and build id data is not always the first one, like: $ readelf -a /usr/bin/ls ... [ 2] .note.gnu.propert NOTE 0000000000000338 00000338 0000000000000020 0000000000000000 A 0 0 8358 [ 3] .note.gnu.build-i NOTE 0000000000000358 00000358 0000000000000024 0000000000000000 A 0 0 437c [ 4] .note.ABI-tag NOTE 000000000000037c 0000037c ... So the stack_map_get_build_id function will fail on build id retrieval and fallback to BPF_STACK_BUILD_ID_IP. This patch is changing the stack_map_get_build_id code to iterate through all the NOTE sections and try to get build id data from each of them. When tracing on sched_switch tracepoint that does bpf_get_stackid helper call kernel build, I can see about 60% increase of successful build id retrieval. The rest seems fails on -EFAULT. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200812123102.20032-1-jolsa@kernel.org
* bpf: Fix build on architectures with special bpf_user_pt_regs_tSong Liu2020-07-251-5/+4
| | | | | | | | | | | | Architectures like s390, powerpc, arm64, riscv have speical definition of bpf_user_pt_regs_t. So we need to cast the pointer before passing it to bpf_get_stack(). This is similar to bpf_get_stack_tp(). Fixes: 03d42fd2d83f ("bpf: Separate bpf_get_[stack|stackid] for perf events BPF") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200724200503.3629591-1-songliubraving@fb.com
* bpf: Separate bpf_get_[stack|stackid] for perf events BPFSong Liu2020-07-251-18/+166
| | | | | | | | | | | | | | | Calling get_perf_callchain() on perf_events from PEBS entries may cause unwinder errors. To fix this issue, the callchain is fetched early. Such perf_events are marked with __PERF_SAMPLE_CALLCHAIN_EARLY. Similarly, calling bpf_get_[stack|stackid] on perf_events from PEBS may also cause unwinder errors. To fix this, add separate version of these two helpers, bpf_get_[stack|stackid]_pe. These two hepers use callchain in bpf_perf_event_data_kern->data->callchain. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200723180648.1429892-2-songliubraving@fb.com
* bpf: Resolve BTF IDs in vmlinux imageJiri Olsa2020-07-131-1/+4
| | | | | | | | | | | | | | Using BTF_ID_LIST macro to define lists for several helpers using BTF arguments. And running resolve_btfids on vmlinux elf object during linking, so the .BTF_ids section gets the IDs resolved. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200711215329.41165-5-jolsa@kernel.org
* bpf: Fix build without CONFIG_STACKTRACESong Liu2020-07-021-0/+4
| | | | | | | | | | | Without CONFIG_STACKTRACE stack_trace_save_tsk() is not defined. Let get_callchain_entry_for_task() to always return NULL in such cases. Fixes: fa28dcb82a38 ("bpf: Introduce helper bpf_get_task_stack()") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200703024537.79971-1-songliubraving@fb.com
* bpf: Introduce helper bpf_get_task_stack()Song Liu2020-07-011-4/+73
| | | | | | | | | | | | | | | | | | | Introduce helper bpf_get_task_stack(), which dumps stack trace of given task. This is different to bpf_get_stack(), which gets stack track of current task. One potential use case of bpf_get_task_stack() is to call it from bpf_iter__task and dump all /proc/<pid>/stack to a seq_file. bpf_get_task_stack() uses stack_trace_save_tsk() instead of get_perf_callchain() for kernel stack. The benefit of this choice is that stack_trace_save_tsk() doesn't require changes in arch/. The downside of using stack_trace_save_tsk() is that stack_trace_save_tsk() dumps the stack trace to unsigned long array. For 32-bit systems, we need to translate it to u64 array. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200630062846.664389-3-songliubraving@fb.com
* bpf: Remove redundant synchronize_rcu.Alexei Starovoitov2020-07-011-3/+0
| | | | | | | | | | | | | | | bpf_free_used_maps() or close(map_fd) will trigger map_free callback. bpf_free_used_maps() is called after bpf prog is no longer executing: bpf_prog_put->call_rcu->bpf_prog_free->bpf_free_used_maps. Hence there is no need to call synchronize_rcu() to protect map elements. Note that hash_of_maps and array_of_maps update/delete inner maps via sys_bpf() that calls maybe_wait_bpf_programs() and synchronize_rcu(). Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/bpf/20200630043343.53195-2-alexei.starovoitov@gmail.com
* bpf: Set map_btf_{name, id} for all map typesAndrey Ignatov2020-06-221-0/+3
| | | | | | | | | | | Set map_btf_name and map_btf_id for all map types so that map fields can be accessed by bpf programs. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/a825f808f22af52b018dbe82f1c7d29dab5fc978.1592600985.git.rdna@fb.com
* mmap locking API: add mmap_read_trylock_non_owner()Michel Lespinasse2020-06-091-12/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a couple APIs used by kernel/bpf/stackmap.c only: - mmap_read_trylock_non_owner() - mmap_read_unlock_non_owner() (may be called from a work queue). It's still not ideal that bpf/stackmap subverts the lock ownership in this way. Thanks to Peter Zijlstra for suggesting this API as the least-ugly way of addressing this in the short term. Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-8-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mmap locking API: use coccinelle to convert mmap_sem rwsem call sitesMichel Lespinasse2020-06-091-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock | -down_write +mmap_write_lock | -down_write_killable +mmap_write_lock_killable | -down_write_trylock +mmap_write_trylock | -up_write +mmap_write_unlock | -downgrade_write +mmap_write_downgrade | -down_read +mmap_read_lock | -down_read_killable +mmap_read_lock_killable | -down_read_trylock +mmap_read_trylock | -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* bpf: Implement CAP_BPFAlexei Starovoitov2020-05-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement permissions as stated in uapi/linux/capability.h In order to do that the verifier allow_ptr_leaks flag is split into four flags and they are set as: env->allow_ptr_leaks = bpf_allow_ptr_leaks(); env->bypass_spec_v1 = bpf_bypass_spec_v1(); env->bypass_spec_v4 = bpf_bypass_spec_v4(); env->bpf_capable = bpf_capable(); The first three currently equivalent to perfmon_capable(), since leaking kernel pointers and reading kernel memory via side channel attacks is roughly equivalent to reading kernel memory with cap_perfmon. 'bpf_capable' enables bounded loops, precision tracking, bpf to bpf calls and other verifier features. 'allow_ptr_leaks' enable ptr leaks, ptr conversions, subtraction of pointers. 'bypass_spec_v1' disables speculative analysis in the verifier, run time mitigations in bpf array, and enables indirect variable access in bpf programs. 'bypass_spec_v4' disables emission of sanitation code by the verifier. That means that the networking BPF program loaded with CAP_BPF + CAP_NET_ADMIN will have speculative checks done by the verifier and other spectre mitigation applied. Such networking BPF program will not be able to leak kernel pointers and will not be able to access arbitrary kernel memory. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200513230355.7858-3-alexei.starovoitov@gmail.com
* bpf/stackmap: Dont trylock mmap_sem with PREEMPT_RT and interrupts disabledDavid Miller2020-02-241-3/+15
| | | | | | | | | | | | | | | | | | | | In a RT kernel down_read_trylock() cannot be used from NMI context and up_read_non_owner() is another problematic issue. So in such a configuration, simply elide the annotated stackmap and just report the raw IPs. In the longer term, it might be possible to provide a atomic friendly versions of the page cache traversal which will at least provide the info if the pages are resident and don't need to be paged in. [ tglx: Use IS_ENABLED() to avoid the #ifdeffery, fixup the irq work callback and add a comment ] Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200224145644.708960317@linutronix.de
* Merge branch 'irq-core-for-linus' of ↵Linus Torvalds2019-12-031-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Ingo Molnar: "Most of the IRQ subsystem changes in this cycle were irq-chip driver updates: - Qualcomm PDC wakeup interrupt support - Layerscape external IRQ support - Broadcom bcm7038 PM and wakeup support - Ingenic driver cleanup and modernization - GICv3 ITS preparation for GICv4.1 updates - GICv4 fixes There's also the series from Frederic Weisbecker that fixes memory ordering bugs for the irq-work logic, whose primary fix is to turn work->irq_work.flags into an atomic variable and then convert the complex (and buggy) atomic_cmpxchg() loop in irq_work_claim() into a much simpler atomic_fetch_or() call. There are also various smaller cleanups" * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits) pinctrl/sdm845: Add PDC wakeup interrupt map for GPIOs pinctrl/msm: Setup GPIO chip in hierarchy irqchip/qcom-pdc: Add irqchip set/get state calls irqchip/qcom-pdc: Add irqdomain for wakeup capable GPIOs irqchip/qcom-pdc: Do not toggle IRQ_ENABLE during mask/unmask irqchip/qcom-pdc: Update max PDC interrupts of/irq: Document properties for wakeup interrupt parent genirq: Introduce irq_chip_get/set_parent_state calls irqdomain: Add bus token DOMAIN_BUS_WAKEUP genirq: Fix function documentation of __irq_alloc_descs() irq_work: Fix IRQ_WORK_BUSY bit clearing irqchip/ti-sci-inta: Use ERR_CAST inlined function instead of ERR_PTR(PTR_ERR(...)) irq_work: Slightly simplify IRQ_WORK_PENDING clearing irq_work: Fix irq_work_claim() memory ordering irq_work: Convert flags to atomic_t irqchip: Ingenic: Add process for more than one irq at the same time. irqchip: ingenic: Alloc generic chips from IRQ domain irqchip: ingenic: Get virq number from IRQ domain irqchip: ingenic: Error out if IRQ domain creation failed irqchip: ingenic: Drop redundant irq_suspend / irq_resume functions ...
| * irq_work: Convert flags to atomic_tFrederic Weisbecker2019-11-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to convert flags to atomic_t in order to later fix an ordering issue on atomic_cmpxchg() failure. This will allow us to use atomic_fetch_or(). Also clarify the nature of those flags. [ mingo: Converted two more usage site the original patch missed. ] Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191108160858.31665-2-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | Merge branch 'locking-core-for-linus' of ↵Linus Torvalds2019-11-261-1/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "The main changes in this cycle were: - A comprehensive rewrite of the robust/PI futex code's exit handling to fix various exit races. (Thomas Gleixner et al) - Rework the generic REFCOUNT_FULL implementation using atomic_fetch_* operations so that the performance impact of the cmpxchg() loops is mitigated for common refcount operations. With these performance improvements the generic implementation of refcount_t should be good enough for everybody - and this got confirmed by performance testing, so remove ARCH_HAS_REFCOUNT and REFCOUNT_FULL entirely, leaving the generic implementation enabled unconditionally. (Will Deacon) - Other misc changes, fixes, cleanups" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits) lkdtm: Remove references to CONFIG_REFCOUNT_FULL locking/refcount: Remove unused 'refcount_error_report()' function locking/refcount: Consolidate implementations of refcount_t locking/refcount: Consolidate REFCOUNT_{MAX,SATURATED} definitions locking/refcount: Move saturation warnings out of line locking/refcount: Improve performance of generic REFCOUNT_FULL code locking/refcount: Move the bulk of the REFCOUNT_FULL implementation into the <linux/refcount.h> header locking/refcount: Remove unused refcount_*_checked() variants locking/refcount: Ensure integer operands are treated as signed locking/refcount: Define constants for saturation and max refcount values futex: Prevent exit livelock futex: Provide distinct return value when owner is exiting futex: Add mutex around futex exit futex: Provide state handling for exec() as well futex: Sanitize exit state handling futex: Mark the begin of futex exit explicitly futex: Set task::futex_state to DEAD right after handling futex exit futex: Split futex_mm_release() for exit/exec exit/exec: Seperate mm_release() futex: Replace PF_EXITPIDONE with a state ...
| * | locking/lockdep: Remove unused @nested argument from lock_release()Qian Cai2019-10-091-1/+1
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the following commit: b4adfe8e05f1 ("locking/lockdep: Remove unused argument in __lock_release") @nested is no longer used in lock_release(), so remove it from all lock_release() calls and friends. Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will@kernel.org> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: airlied@linux.ie Cc: akpm@linux-foundation.org Cc: alexander.levin@microsoft.com Cc: daniel@iogearbox.net Cc: davem@davemloft.net Cc: dri-devel@lists.freedesktop.org Cc: duyuyang@gmail.com Cc: gregkh@linuxfoundation.org Cc: hannes@cmpxchg.org Cc: intel-gfx@lists.freedesktop.org Cc: jack@suse.com Cc: jlbec@evilplan.or Cc: joonas.lahtinen@linux.intel.com Cc: joseph.qi@linux.alibaba.com Cc: jslaby@suse.com Cc: juri.lelli@redhat.com Cc: maarten.lankhorst@linux.intel.com Cc: mark@fasheh.com Cc: mhocko@kernel.org Cc: mripard@kernel.org Cc: ocfs2-devel@oss.oracle.com Cc: rodrigo.vivi@intel.com Cc: sean@poorly.run Cc: st@kernel.org Cc: tj@kernel.org Cc: tytso@mit.edu Cc: vdavydov.dev@gmail.com Cc: vincent.guittot@linaro.org Cc: viro@zeniv.linux.org.uk Link: https://lkml.kernel.org/r/1568909380-32199-1-git-send-email-cai@lca.pw Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | bpf/stackmap: Fix deadlock with rq_lock in bpf_get_stack()Song Liu2019-10-161-3/+4
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bpf stackmap with build-id lookup (BPF_F_STACK_BUILD_ID) can trigger A-A deadlock on rq_lock(): rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [...] Call Trace: try_to_wake_up+0x1ad/0x590 wake_up_q+0x54/0x80 rwsem_wake+0x8a/0xb0 bpf_get_stack+0x13c/0x150 bpf_prog_fbdaf42eded9fe46_on_event+0x5e3/0x1000 bpf_overflow_handler+0x60/0x100 __perf_event_overflow+0x4f/0xf0 perf_swevent_overflow+0x99/0xc0 ___perf_sw_event+0xe7/0x120 __schedule+0x47d/0x620 schedule+0x29/0x90 futex_wait_queue_me+0xb9/0x110 futex_wait+0x139/0x230 do_futex+0x2ac/0xa50 __x64_sys_futex+0x13c/0x180 do_syscall_64+0x42/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This can be reproduced by: 1. Start a multi-thread program that does parallel mmap() and malloc(); 2. taskset the program to 2 CPUs; 3. Attach bpf program to trace_sched_switch and gather stackmap with build-id, e.g. with trace.py from bcc tools: trace.py -U -p <pid> -s <some-bin,some-lib> t:sched:sched_switch A sample reproducer is attached at the end. This could also trigger deadlock with other locks that are nested with rq_lock. Fix this by checking whether irqs are disabled. Since rq_lock and all other nested locks are irq safe, it is safe to do up_read() when irqs are not disable. If the irqs are disabled, postpone up_read() in irq_work. Fixes: 615755a77b24 ("bpf: extend stackmap to save binary_build_id+offset instead of address") Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20191014171223.357174-1-songliubraving@fb.com Reproducer: ============================ 8< ============================ char *filename; void *worker(void *p) { void *ptr; int fd; char *pptr; fd = open(filename, O_RDONLY); if (fd < 0) return NULL; while (1) { struct timespec ts = {0, 1000 + rand() % 2000}; ptr = mmap(NULL, 4096 * 64, PROT_READ, MAP_PRIVATE, fd, 0); usleep(1); if (ptr == MAP_FAILED) { printf("failed to mmap\n"); break; } munmap(ptr, 4096 * 64); usleep(1); pptr = malloc(1); usleep(1); pptr[0] = 1; usleep(1); free(pptr); usleep(1); nanosleep(&ts, NULL); } close(fd); return NULL; } int main(int argc, char *argv[]) { void *ptr; int i; pthread_t threads[THREAD_COUNT]; if (argc < 2) return 0; filename = argv[1]; for (i = 0; i < THREAD_COUNT; i++) { if (pthread_create(threads + i, NULL, worker, NULL)) { fprintf(stderr, "Error creating thread\n"); return 0; } } for (i = 0; i < THREAD_COUNT; i++) pthread_join(threads[i], NULL); return 0; } ============================ 8< ============================
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2019-06-071-4/+1
|\ | | | | | | | | | | | | Some ISDN files that got removed in net-next had some changes done in mainline, take the removals. Signed-off-by: David S. Miller <davem@davemloft.net>
| * treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 206Thomas Gleixner2019-05-301-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of version 2 of the gnu general public license as published by the free software foundation extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 107 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Richard Fontana <rfontana@redhat.com> Reviewed-by: Steve Winslow <swinslow@gmail.com> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190528171438.615055994@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | bpf: move memory size checks to bpf_map_charge_init()Roman Gushchin2019-05-311-7/+1
| | | | | | | | | | | | | | | | | | | | | | Most bpf map types doing similar checks and bytes to pages conversion during memory allocation and charging. Let's unify these checks by moving them into bpf_map_charge_init(). Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* | bpf: rework memlock-based memory accounting for mapsRoman Gushchin2019-05-311-13/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to unify the existing memlock charging code with the memcg-based memory accounting, which will be added later, let's rework the current scheme. Currently the following design is used: 1) .alloc() callback optionally checks if the allocation will likely succeed using bpf_map_precharge_memlock() 2) .alloc() performs actual allocations 3) .alloc() callback calculates map cost and sets map.memory.pages 4) map_create() calls bpf_map_init_memlock() which sets map.memory.user and performs actual charging; in case of failure the map is destroyed <map is in use> 1) bpf_map_free_deferred() calls bpf_map_release_memlock(), which performs uncharge and releases the user 2) .map_free() callback releases the memory The scheme can be simplified and made more robust: 1) .alloc() calculates map cost and calls bpf_map_charge_init() 2) bpf_map_charge_init() sets map.memory.user and performs actual charge 3) .alloc() performs actual allocations <map is in use> 1) .map_free() callback releases the memory 2) bpf_map_charge_finish() performs uncharge and releases the user The new scheme also allows to reuse bpf_map_charge_init()/finish() functions for memcg-based accounting. Because charges are performed before actual allocations and uncharges after freeing the memory, no bogus memory pressure can be created. In cases when the map structure is not available (e.g. it's not created yet, or is already destroyed), on-stack bpf_map_memory structure is used. The charge can be transferred with the bpf_map_charge_move() function. Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* | bpf: group memory related fields in struct bpf_map_memoryRoman Gushchin2019-05-311-2/+2
|/ | | | | | | | | | | | | | Group "user" and "pages" fields of bpf_map into the bpf_map_memory structure. Later it can be extended with "memcg" and other related information. The main reason for a such change (beside cosmetics) is to pass bpf_map_memory structure to charging functions before the actual allocation of bpf_map. Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: fix lockdep false positive in stackmapAlexei Starovoitov2019-02-111-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | Lockdep warns about false positive: [ 11.211460] ------------[ cut here ]------------ [ 11.211936] DEBUG_LOCKS_WARN_ON(depth <= 0) [ 11.211985] WARNING: CPU: 0 PID: 141 at ../kernel/locking/lockdep.c:3592 lock_release+0x1ad/0x280 [ 11.213134] Modules linked in: [ 11.214954] RIP: 0010:lock_release+0x1ad/0x280 [ 11.223508] Call Trace: [ 11.223705] <IRQ> [ 11.223874] ? __local_bh_enable+0x7a/0x80 [ 11.224199] up_read+0x1c/0xa0 [ 11.224446] do_up_read+0x12/0x20 [ 11.224713] irq_work_run_list+0x43/0x70 [ 11.225030] irq_work_run+0x26/0x50 [ 11.225310] smp_irq_work_interrupt+0x57/0x1f0 [ 11.225662] irq_work_interrupt+0xf/0x20 since rw_semaphore is released in a different task vs task that locked the sema. It is expected behavior. Fix the warning with up_read_non_owner() and rwsem_release() annotation. Fixes: bae77c5eb5b2 ("bpf: enable stackmap with build_id in nmi context") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* bpf: zero out build_id for BPF_STACK_BUILD_ID_IPStanislav Fomichev2019-01-171-0/+2
| | | | | | | | | | | | When returning BPF_STACK_BUILD_ID_IP from stack_map_get_build_id_offset, make sure that build_id field is empty. Since we are using percpu free list, there is a possibility that we might reuse some previous bpf_stack_build_id with non-zero build_id. Fixes: 615755a77b24 ("bpf: extend stackmap to save binary_build_id+offset instead of address") Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* bpf: don't assume build-id length is always 20 bytesStanislav Fomichev2019-01-171-2/+5
| | | | | | | | | | | | | | | | | | | | | Build-id length is not fixed to 20, it can be (`man ld` /--build-id): * 128-bit (uuid) * 160-bit (sha1) * any length specified in ld --build-id=0xhexstring To fix the issue of missing BPF_STACK_BUILD_ID_VALID for shorter build-ids, assume that build-id is somewhere in the range of 1 .. 20. Set the remaining bytes to zero. v2: * don't introduce new "len = min(BPF_BUILD_ID_SIZE, nhdr->n_descsz)", we already know that nhdr->n_descsz <= BPF_BUILD_ID_SIZE if we enter this 'if' condition Fixes: 615755a77b24 ("bpf: extend stackmap to save binary_build_id+offset instead of address") Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* bpf: fix panic in stack_map_get_build_id() on i386 and arm32Song Liu2019-01-101-1/+2
| | | | | | | | | | | | | | As Naresh reported, test_stacktrace_build_id() causes panic on i386 and arm32 systems. This is caused by page_address() returns NULL in certain cases. This patch fixes this error by using kmap_atomic/kunmap_atomic instead of page_address. Fixes: 615755a77b24 (" bpf: extend stackmap to save binary_build_id+offset instead of address") Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* bpf: rename stack trace map operationsMauricio Vasquez B2018-10-191-1/+1
| | | | | | | | | | In the following patches queue and stack maps (FIFO and LIFO datastructures) will be implemented. In order to avoid confusion and a possible name clash rename stack_map_ops to stack_trace_map_ops Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: return EOPNOTSUPP when map lookup isn't supportedPrashant Bhole2018-10-091-1/+1
| | | | | | | | | | | | | | Return ERR_PTR(-EOPNOTSUPP) from map_lookup_elem() methods of below map types: - BPF_MAP_TYPE_PROG_ARRAY - BPF_MAP_TYPE_STACK_TRACE - BPF_MAP_TYPE_XSKMAP - BPF_MAP_TYPE_SOCKMAP/BPF_MAP_TYPE_SOCKHASH Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: decouple btf from seq bpf fs dump and enable more mapsDaniel Borkmann2018-08-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | Commit a26ca7c982cb ("bpf: btf: Add pretty print support to the basic arraymap") and 699c86d6ec21 ("bpf: btf: add pretty print for hash/lru_hash maps") enabled support for BTF and dumping via BPF fs for array and hash/lru map. However, both can be decoupled from each other such that regular BPF maps can be supported for attaching BTF key/value information, while not all maps necessarily need to dump via map_seq_show_elem() callback. The basic sanity check which is a prerequisite for all maps is that key/value size has to match in any case, and some maps can have extra checks via map_check_btf() callback, e.g. probing certain types or indicating no support in general. With that we can also enable retrieving BTF info for per-cpu map types and lpm. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com>
* bpf: avoid -Wmaybe-uninitialized warningArnd Bergmann2018-05-281-4/+3
| | | | | | | | | | | | | | | | | The stack_map_get_build_id_offset() function is too long for gcc to track whether 'work' may or may not be initialized at the end of it, leading to a false-positive warning: kernel/bpf/stackmap.c: In function 'stack_map_get_build_id_offset': kernel/bpf/stackmap.c:334:13: error: 'work' may be used uninitialized in this function [-Werror=maybe-uninitialized] This removes the 'in_nmi_ctx' flag and uses the state of that variable itself to see if it got initialized. Fixes: bae77c5eb5b2 ("bpf: enable stackmap with build_id in nmi context") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* bpf: enable stackmap with build_id in nmi contextSong Liu2018-05-141-6/+53
| | | | | | | | | | | | | | | | Currently, we cannot parse build_id in nmi context because of up_read(&current->mm->mmap_sem), this makes stackmap with build_id less useful. This patch enables parsing build_id in nmi by putting the up_read() call in irq_work. To avoid memory allocation in nmi context, we use per cpu variable for the irq_work. As a result, only one irq_work per cpu is allowed. If the irq_work is in-use, we fallback to only report ips. Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* bpf: add bpf_get_stack helperYonghong Song2018-04-291-0/+67
| | | | | | | | | | | | | | | | | | Currently, stackmap and bpf_get_stackid helper are provided for bpf program to get the stack trace. This approach has a limitation though. If two stack traces have the same hash, only one will get stored in the stackmap table, so some stack traces are missing from user perspective. This patch implements a new helper, bpf_get_stack, will send stack traces directly to bpf program. The bpf program is able to see all stack traces, and then can do in-kernel processing or send stack traces to user space through shared map or bpf_perf_event_output. Acked-by: Alexei Starovoitov <ast@fb.com> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: change prototype for stack_map_get_build_id_offsetYonghong Song2018-04-291-8/+5
| | | | | | | | This patch didn't incur functionality change. The function prototype got changed so that the same function can be reused later. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: extend stackmap to save binary_build_id+offset instead of addressSong Liu2018-03-151-22/+235
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, bpf stackmap store address for each entry in the call trace. To map these addresses to user space files, it is necessary to maintain the mapping from these virtual address to symbols in the binary. Usually, the user space profiler (such as perf) has to scan /proc/pid/maps at the beginning of profiling, and monitor mmap2() calls afterwards. Given the cost of maintaining the address map, this solution is not practical for system wide profiling that is always on. This patch tries to solve this problem with a variation of stackmap. This variation is enabled by flag BPF_F_STACK_BUILD_ID. Instead of storing addresses, the variation stores ELF file build_id + offset. Build ID is a 20-byte unique identifier for ELF files. The following command shows the Build ID of /bin/bash: [user@]$ readelf -n /bin/bash ... Build ID: XXXXXXXXXX ... With BPF_F_STACK_BUILD_ID, bpf_get_stackid() tries to parse Build ID for each entry in the call trace, and translate it into the following struct: struct bpf_stack_build_id_offset { __s32 status; unsigned char build_id[BPF_BUILD_ID_SIZE]; union { __u64 offset; __u64 ip; }; }; The search of build_id is limited to the first page of the file, and this page should be in page cache. Otherwise, we fallback to store ip for this entry (ip field in struct bpf_stack_build_id_offset). This requires the build_id to be stored in the first page. A quick survey of binary and dynamic library files in a few different systems shows that almost all binary and dynamic library files have build_id in the first page. Build_id is only meaningful for user stack. If a kernel stack is added to a stackmap with BPF_F_STACK_BUILD_ID, it will automatically fallback to only store ip (status == BPF_STACK_BUILD_ID_IP). Similarly, if build_id lookup failed for some reason, it will also fallback to store ip. User space can access struct bpf_stack_build_id_offset with bpf syscall BPF_MAP_LOOKUP_ELEM. It is necessary for user space to maintain mapping from build id to binary files. This mostly static mapping is much easier to maintain than per process address maps. Note: Stackmap with build_id only works in non-nmi context at this time. This is because we need to take mm->mmap_sem for find_vma(). If this changes, we would like to allow build_id lookup in nmi context. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* bpf: add helper for copying attrs to struct bpf_mapJakub Kicinski2018-01-141-5/+1
| | | | | | | | | | All map types reimplement the field-by-field copy of union bpf_attr members into struct bpf_map. Add a helper to perform this operation. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* bpf: implement syscall command BPF_MAP_GET_NEXT_KEY for stacktrace mapYonghong Song2018-01-061-2/+26
| | | | | | | | | | | | | | | | | | | | Currently, bpf syscall command BPF_MAP_GET_NEXT_KEY is not supported for stacktrace map. However, there are use cases where user space wants to enumerate all stacktrace map entries where BPF_MAP_GET_NEXT_KEY command will be really helpful. In addition, if user space wants to delete all map entries in order to save memory and does not want to close the map file descriptor, BPF_MAP_GET_NEXT_KEY may help improve performance if map entries are sparsely populated. The implementation has similar behavior for BPF_MAP_GET_NEXT_KEY implementation in hashtab. If user provides a NULL key pointer or an invalid key, the first key is returned. Otherwise, the first valid key after the input parameter "key" is returned, or -ENOENT if no valid key can be found. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* bpf: Add file mode configuration into bpf mapsChenbo Feng2017-10-201-1/+4
| | | | | | | | | | | | | | | Introduce the map read/write flags to the eBPF syscalls that returns the map fd. The flags is used to set up the file mode when construct a new file descriptor for bpf maps. To not break the backward capability, the f_flags is set to O_RDWR if the flag passed by syscall is 0. Otherwise it should be O_RDONLY or O_WRONLY. When the userspace want to modify or read the map content, it will check the file mode to see if it is allowed to make the change. Signed-off-by: Chenbo Feng <fengc@google.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>