Demonstrations of stackcount, the Linux eBPF/bcc version.
This program traces functions and frequency counts them with their entire
stack trace, summarized in-kernel for efficiency. For example, counting
stack traces that led to the submit_bio() kernel function, which creates
block device I/O:
# ./stackcount submit_bio
Tracing 1 functions for "submit_bio"... Hit Ctrl-C to end.
^C
submit_bio
ext4_writepages
do_writepages
__filemap_fdatawrite_range
filemap_flush
ext4_alloc_da_blocks
ext4_release_file
__fput
____fput
task_work_run
exit_to_usermode_loop
syscall_return_slowpath
entry_SYSCALL_64_fastpath
[unknown]
[unknown]
tar [15069]
5
submit_bio
ext4_bio_write_page
mpage_submit_page
mpage_map_and_submit_buffers
ext4_writepages
do_writepages
__filemap_fdatawrite_range
filemap_flush
ext4_alloc_da_blocks
ext4_release_file
__fput
____fput
task_work_run
exit_to_usermode_loop
syscall_return_slowpath
entry_SYSCALL_64_fastpath
[unknown]
[unknown]
tar [15069]
15
submit_bio
ext4_readpages
__do_page_cache_readahead
ondemand_readahead
page_cache_async_readahead
generic_file_read_iter
__vfs_read
vfs_read
sys_read
entry_SYSCALL_64_fastpath
[unknown]
tar [15069]
113
Detaching...
The output shows unique stack traces, in order from leaf (on-CPU) to root,
followed by their occurrence count. The last stack trace in the above output
shows syscall handling, sys_read(), vfs_read(), and then "readahead" functions:
looks like an application issued file read has triggered read ahead. The
application can be seen after the stack trace, in this case, "tar [15069]"
for the "tar" command, PID 15069.
The order of printed stack traces is from least to most frequent. The most
frequent in this case, the ext4_rename() stack, was taken 113 times during
tracing.
The "[unknown]" frames are from user-level, since this simple workload is
the tar command, which apparently has been compiled without frame pointers.
It's a common compiler optimization, but it breaks frame pointer-based stack
walkers. Similar broken stacks will be seen by other profilers and debuggers
that use frame pointers. Hopefully your application preserves them so that
the user-level stack trace is visible. So how does one get frame pointers, if
your application doesn't have them to start with? For the current bcc (until
it supports other stack walkers), you need to be running a application binaries
that preserves frame pointers, eg, using gcc's -fno-omit-frame-pointer. That's
about all I'll say here: this is a big topic that is not bcc/BPF specific.
It can be useful to trace the path to submit_bio to explain unusual rates of
disk IOPS. These could have in-kernel origins (eg, background scrub).
Now adding the -d option to delimit kernel and user stacks:
# ./stackcount -d submit_bio
Tracing 1 functions for "submit_bio"... Hit Ctrl-C to end.
^C
submit_bio
submit_bh
journal_submit_commit_record
jbd2_journal_commit_transaction
kjournald2
kthread
ret_from_fork
--
jbd2/xvda1-8 [405]
1
submit_bio
submit_bh
jbd2_journal_commit_transaction
kjournald2
kthread
ret_from_fork
--
jbd2/xvda1-8 [405]
2
submit_bio
ext4_writepages
do_writepages
__filemap_fdatawrite_range
filemap_flush
ext4_alloc_da_blocks
ext4_release_file
__fput
____fput
task_work_run
exit_to_usermode_loop
syscall_return_slowpath
entry_SYSCALL_64_fastpath
--
[unknown]
[unknown]
tar [15187]
5
submit_bio
ext4_bio_write_page
mpage_submit_page
mpage_map_and_submit_buffers
ext4_writepages
do_writepages
__filemap_fdatawrite_range
filemap_flush
ext4_alloc_da_blocks
ext4_release_file
__fput
____fput
task_work_run
exit_to_usermode_loop
syscall_return_slowpath
entry_SYSCALL_64_fastpath
--
[unknown]
[unknown]
tar [15187]
15
submit_bio
ext4_readpages
__do_page_cache_readahead
ondemand_readahead
page_cache_async_readahead
generic_file_read_iter
__vfs_read
vfs_read
sys_read
entry_SYSCALL_64_fastpath
--
[unknown]
[unknown]
[unknown]
tar [15187]
171
Detaching...
A "--" is printed between the kernel and user stacks.
As a different example, here is the kernel function hrtimer_init_sleeper():
# ./stackcount.py -d hrtimer_init_sleeper
Tracing 1 functions for "hrtimer_init_sleeper"... Hit Ctrl-C to end.
^C
hrtimer_init_sleeper
do_futex
SyS_futex
entry_SYSCALL_64_fastpath
--
[unknown]
containerd [16020]
1
hrtimer_init_sleeper
do_futex
SyS_futex
entry_SYSCALL_64_fastpath
--
__pthread_cond_timedwait
Monitor::IWait(Thread*, long)
Monitor::wait(bool, long, bool)
CompileQueue::get()
CompileBroker::compiler_thread_loop()
JavaThread::thread_main_inner()
JavaThread::run()
java_start(Thread*)
start_thread
java [4996]
1
hrtimer_init_sleeper
do_futex
SyS_futex
entry_SYSCALL_64_fastpath
--
[unknown]
[unknown]
containerd [16020]
1
hrtimer_init_sleeper
do_futex
SyS_futex
entry_SYSCALL_64_fastpath
--
__pthread_cond_timedwait
VMThread::loop()
VMThread::run()
java_start(Thread*)
start_thread
java [4996]
3
hrtimer_init_sleeper
do_futex
SyS_futex
entry_SYSCALL_64_fastpath
--
[unknown]
dockerd [16008]
4
hrtimer_init_sleeper
do_futex
SyS_futex
entry_SYSCALL_64_fastpath
--
[unknown]
[unknown]
dockerd [16008]
4
hrtimer_init_sleeper
do_futex
SyS_futex
entry_SYSCALL_64_fastpath
--
__pthread_cond_timedwait
Lio/netty/util/ThreadDeathWatcher$Watcher;::run
Interpreter
Interpreter
call_stub
JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, Thread*)
JavaCalls::call_virtual(JavaValue*, KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*)
JavaCalls::call_virtual(JavaValue*, Handle, KlassHandle, Symbol*, Symbol*, Thread*)
thread_entry(JavaThread*, Thread*)
JavaThread::thread_main_inner()
JavaThread::run()
java_start(Thread*)
start_thread
java [4996]
4
hrtimer_init_sleeper
do_futex
SyS_futex
entry_SYSCALL_64_fastpath
--
__pthread_cond_timedwait
clock_gettime
[unknown]
java [4996]
79
Detaching...
I was just trying to find a more interesting example. This output includes
some Java stacks where user-level has been walked correctly (even includes a
JIT symbol translation). dockerd and containerd don't have frame pointers
(grumble), but Java does (which is running with -XX:+PreserveFramePointer).
Here's another kernel function, ip_output():
# ./stackcount.py -d ip_output
Tracing 1 functions for "ip_output"... Hit Ctrl-C to end.
^C
ip_output
ip_queue_xmit
tcp_transmit_skb
tcp_write_xmit
__tcp_push_pending_frames
tcp_push
tcp_sendmsg
inet_sendmsg
sock_sendmsg
sock_write_iter
__vfs_write
vfs_write
SyS_write
entry_SYSCALL_64_fastpath
--
__write_nocancel
[unknown]
sshd [15015]
5
ip_output
ip_queue_xmit
tcp_transmit_skb
tcp_write_xmit
__tcp_push_pending_frames
tcp_push
tcp_sendmsg
inet_sendmsg
sock_sendmsg
sock_write_iter
__vfs_write
vfs_write
SyS_write
entry_SYSCALL_64_fastpath
--
__write_nocancel
[unknown]
[unknown]
sshd [8234]
5
ip_output
ip_queue_xmit
tcp_transmit_skb
tcp_write_xmit
__tcp_push_pending_frames
tcp_push
tcp_sendmsg
inet_sendmsg
sock_sendmsg
sock_write_iter
__vfs_write
vfs_write
SyS_write
entry_SYSCALL_64_fastpath
--
__write_nocancel
sshd [15015]
7
Detaching...
This time just sshd is triggering ip_output() calls.
Watch what happens if I filter on kernel stacks only (-K) for ip_output():
# ./stackcount.py -K ip_output
Tracing 1 functions for "ip_output"... Hit Ctrl-C to end.
^C
ip_output
ip_queue_xmit
tcp_transmit_skb
tcp_write_xmit
__tcp_push_pending_frames
tcp_push
tcp_sendmsg
inet_sendmsg
sock_sendmsg
sock_write_iter
__vfs_write
vfs_write
SyS_write
entry_SYSCALL_64_fastpath
13
Detaching...
They have grouped together as a single unique stack, since the kernel part
was the same.
Here is just the user stacks, fetched during the kernel function ip_output():
# ./stackcount.py -U ip_output
Tracing 1 functions for "ip_output"... Hit Ctrl-C to end.
^C
[unknown]
snmpd [1645]
1
__write_nocancel
[unknown]
[unknown]
sshd [8234]
3
__write_nocancel
sshd [15015]
4
I should really run a custom sshd with frame pointers so we can see its
stack trace...
User-space functions can also be traced if a library name is provided. For
example, to quickly identify code locations that allocate heap memory for
PID 4902 (using -p), by tracing malloc from libc ("c:malloc"):
# ./stackcount -p 4902 c:malloc
Tracing 1 functions for "malloc"... Hit Ctrl-C to end.
^C
malloc
rbtree_new
main
[unknown]
12
malloc
_rbtree_node_new_internal
_rbtree_node_insert
rbtree_insert
main
[unknown]
1189
Detaching...
Kernel stacks are absent as this didn't enter kernel code.
Note that user-space uses of stackcount can be somewhat more limited because
a lot of user-space libraries and binaries are compiled without frame-pointers
as discussed earlier (the -fomit-frame-pointer compiler default) or are used
without debuginfo.
In addition to kernel and user-space functions, kernel tracepoints and USDT
tracepoints are also supported.
For example, to determine where threads are being created in a particular
process, use the pthread_create USDT tracepoint:
# ./stackcount -p $(pidof parprimes) u:pthread:pthread_create
Tracing 1 functions for "u:pthread:pthread_create"... Hit Ctrl-C to end.
^C
parprimes [11923]
pthread_create@@GLIBC_2.2.5
main
__libc_start_main
[unknown]
7
You can use "readelf -n file" to see if it has USDT tracepoints.
Similarly, to determine where context switching is happening in the kernel,
use the sched:sched_switch kernel tracepoint:
# ./stackcount t:sched:sched_switch
__schedule
schedule
worker_thread
kthread
ret_from_fork
kworker/0:2 [25482]
1
__schedule
schedule
schedule_hrtimeout_range_clock
schedule_hrtimeout_range
ep_poll
SyS_epoll_wait
entry_SYSCALL_64_fastpath
epoll_wait
Lsun/nio/ch/SelectorImpl;::lockAndDoSelect
Lsun/nio/ch/SelectorImpl;::select
Lio/netty/channel/nio/NioEventLoop;::select
Lio/netty/channel/nio/NioEventLoop;::run
Interpreter
Interpreter
call_stub
JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, Thread*)
JavaCalls::call_virtual(JavaValue*, KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*)
JavaCalls::call_virtual(JavaValue*, Handle, KlassHandle, Symbol*, Symbol*, Thread*)
thread_entry(JavaThread*, Thread*)
JavaThread::thread_main_inner()
JavaThread::run()
java_start(Thread*)
start_thread
java [4996]
1
... (omitted for brevity)
__schedule
schedule
schedule_preempt_disabled
cpu_startup_entry
xen_play_dead
arch_cpu_idle_dead
cpu_startup_entry
cpu_bringup_and_idle
swapper/1 [0]
289
A -i option can be used to set an output interval, and -T to include a
timestamp. For example:
# ./stackcount.py -Tdi 1 submit_bio
Tracing 1 functions for "submit_bio"... Hit Ctrl-C to end.
06:05:13
06:05:14
submit_bio
xfs_do_writepage
write_cache_pages
xfs_vm_writepages
do_writepages
__writeback_single_inode
writeback_sb_inodes
__writeback_inodes_wb
wb_writeback
wb_workfn
process_one_work
worker_thread
kthread
ret_from_fork
--
kworker/u16:1 [15686]
1
submit_bio
process_one_work
worker_thread
kthread
ret_from_fork
--
kworker/u16:0 [16007]
1
submit_bio
xfs_buf_submit
xlog_bdstrat
xlog_sync
xlog_state_release_iclog
_xfs_log_force
xfs_log_force
xfs_fs_sync_fs
sync_fs_one_sb
iterate_supers
sys_sync
entry_SYSCALL_64_fastpath
--
[unknown]
sync [16039]
1
submit_bio
submit_bh
journal_submit_commit_record
jbd2_journal_commit_transaction
kjournald2
kthread
ret_from_fork
--
jbd2/xvda1-8 [405]
1
submit_bio
process_one_work
worker_thread
kthread
ret_from_fork
--
kworker/0:2 [25482]
2
submit_bio
ext4_writepages
do_writepages
__writeback_single_inode
writeback_sb_inodes
__writeback_inodes_wb
wb_writeback
wb_workfn
process_one_work
worker_thread
kthread
ret_from_fork
--
kworker/u16:0 [16007]
4
submit_bio
xfs_vm_writepages
do_writepages
__writeback_single_inode
writeback_sb_inodes
__writeback_inodes_wb
wb_writeback
wb_workfn
process_one_work
worker_thread
kthread
ret_from_fork
--
kworker/u16:1 [15686]
5
submit_bio
__block_write_full_page
block_write_full_page
blkdev_writepage
__writepage
write_cache_pages
generic_writepages
blkdev_writepages
do_writepages
__filemap_fdatawrite_range
filemap_fdatawrite
fdatawrite_one_bdev
iterate_bdevs
sys_sync
entry_SYSCALL_64_fastpath
--
[unknown]
sync [16039]
7
submit_bio
submit_bh
jbd2_journal_commit_transaction
kjournald2
kthread
ret_from_fork
--
jbd2/xvda1-8 [405]
8
submit_bio
ext4_bio_write_page
mpage_submit_page
mpage_map_and_submit_buffers
ext4_writepages
do_writepages
__writeback_single_inode
writeback_sb_inodes
__writeback_inodes_wb
wb_writeback
wb_workfn
process_one_work
worker_thread
kthread
ret_from_fork
--
kworker/u16:0 [16007]
8
submit_bio
__block_write_full_page
block_write_full_page
blkdev_writepage
__writepage
write_cache_pages
generic_writepages
blkdev_writepages
do_writepages
__writeback_single_inode
writeback_sb_inodes
__writeback_inodes_wb
wb_writeback
wb_workfn
process_one_work
worker_thread
kthread
ret_from_fork
--
kworker/u16:0 [16007]
60
06:05:15
06:05:16
Detaching...
This only included output for the 06:05:14 interval. The other internals
did not span block device I/O.
The -s output prints the return instruction offset for each function (aka
symbol offset). Eg:
# ./stackcount.py -s tcp_sendmsg
Tracing 1 functions for "tcp_sendmsg"... Hit Ctrl-C to end.
^C
tcp_sendmsg+0x1
sock_sendmsg+0x38
sock_write_iter+0x85
__vfs_write+0xe3
vfs_write+0xb8
SyS_write+0x55
entry_SYSCALL_64_fastpath+0x1e
__write_nocancel+0x7
sshd [15015]
3
tcp_sendmsg+0x1
sock_sendmsg+0x38
sock_write_iter+0x85
__vfs_write+0xe3
vfs_write+0xb8
SyS_write+0x55
entry_SYSCALL_64_fastpath+0x1e
__write_nocancel+0x7
sshd [8234]
3
Detaching...
If it wasn't clear how one function called another, knowing the instruction
offset can help you locate the lines of code from a disassembly dump.
The -v output is verbose, and shows raw addresses:
./stackcount.py -v tcp_sendmsg
Tracing 1 functions for "tcp_sendmsg"... Hit Ctrl-C to end.
^C
ffffffff817b05c1 tcp_sendmsg
ffffffff8173ea48 sock_sendmsg
ffffffff8173eae5 sock_write_iter
ffffffff81232b33 __vfs_write
ffffffff812331b8 vfs_write
ffffffff81234625 SyS_write
ffffffff818739bb entry_SYSCALL_64_fastpath
7f120511e6e0 __write_nocancel
sshd [8234]
3
ffffffff817b05c1 tcp_sendmsg
ffffffff8173ea48 sock_sendmsg
ffffffff8173eae5 sock_write_iter
ffffffff81232b33 __vfs_write
ffffffff812331b8 vfs_write
ffffffff81234625 SyS_write
ffffffff818739bb entry_SYSCALL_64_fastpath
7f919c5a26e0 __write_nocancel
sshd [15015]
11
Detaching...
A wildcard can also be used. Eg, all functions beginning with "tcp_send",
kernel stacks only (-K) with offsets (-s):
# ./stackcount -Ks 'tcp_send*'
Tracing 14 functions for "tcp_send*"... Hit Ctrl-C to end.
^C
tcp_send_delayed_ack0x1
tcp_rcv_established0x3b1
tcp_v4_do_rcv0x130
tcp_v4_rcv0x8e0
ip_local_deliver_finish0x9f
ip_local_deliver0x51
ip_rcv_finish0x8a
ip_rcv0x29d
__netif_receive_skb_core0x637
__netif_receive_skb0x18
netif_receive_skb_internal0x23
1
tcp_send_delayed_ack0x1
tcp_rcv_established0x222
tcp_v4_do_rcv0x130
tcp_v4_rcv0x8e0
ip_local_deliver_finish0x9f
ip_local_deliver0x51
ip_rcv_finish0x8a
ip_rcv0x29d
__netif_receive_skb_core0x637
__netif_receive_skb0x18
netif_receive_skb_internal0x23
4
tcp_send_mss0x1
inet_sendmsg0x67
sock_sendmsg0x38
sock_write_iter0x78
__vfs_write0xaa
vfs_write0xa9
sys_write0x46
entry_SYSCALL_64_fastpath0x16
7
tcp_sendmsg0x1
sock_sendmsg0x38
sock_write_iter0x78
__vfs_write0xaa
vfs_write0xa9
sys_write0x46
entry_SYSCALL_64_fastpath0x16
7
Detaching...
Use -r to allow regular expressions.
The -f option will emit folded output, which can be used as input to other
tools including flame graphs. For example, with delimiters as well:
# ./stackcount.py -df t:sched:sched_switch
^Csnmp-pass;[unknown];[unknown];[unknown];[unknown];[unknown];-;entry_SYSCALL_64_fastpath;SyS_select;core_sys_select;do_select;poll_schedule_timeout;schedule_hrtimeout_range;schedule_hrtimeout_range_clock;schedule;__schedule 1
kworker/7:0;-;ret_from_fork;kthread;worker_thread;schedule;__schedule 1
watchdog/0;-;ret_from_fork;kthread;smpboot_thread_fn;schedule;__schedule 1
snmp-pass;[unknown];[unknown];[unknown];[unknown];[unknown];-;entry_SYSCALL_64_fastpath;SyS_select;core_sys_select;do_select;poll_schedule_timeout;schedule_hrtimeout_range;schedule_hrtimeout_range_clock;schedule;__schedule 1
svscan;[unknown];-;entry_SYSCALL_64_fastpath;SyS_nanosleep;hrtimer_nanosleep;do_nanosleep;schedule;__schedule 1
python;[unknown];__select_nocancel;-;entry_SYSCALL_64_fastpath;SyS_select;core_sys_select;do_select;poll_schedule_timeout;schedule_hrtimeout_range;schedule_hrtimeout_range_clock;schedule;__schedule 1
kworker/2:0;-;ret_from_fork;kthread;worker_thread;schedule;__schedule 1
[...]
Flame graphs visualize stack traces. For information about them and links
to open source software, see http://www.brendangregg.com/flamegraphs.html .
This folded output can be piped directly into flamegraph.pl (the Perl version).
USAGE message:
# ./stackcount -h
usage: stackcount [-h] [-p PID] [-i INTERVAL] [-D DURATION] [-T] [-r] [-s]
[-P] [-K] [-U] [-v] [-d] [-f] [--debug]
pattern
Count events and their stack traces
positional arguments:
pattern search expression for events
optional arguments:
-h, --help show this help message and exit
-p PID, --pid PID trace this PID only
-i INTERVAL, --interval INTERVAL
summary interval, seconds
-D DURATION, --duration DURATION
total duration of trace, seconds
-T, --timestamp include timestamp on output
-r, --regexp use regular expressions. Default is "*" wildcards
only.
-s, --offset show address offsets
-P, --perpid display stacks separately for each process
-K, --kernel-stacks-only
kernel stack only
-U, --user-stacks-only
user stack only
-v, --verbose show raw addresses
-d, --delimited insert delimiter between kernel/user stacks
-f, --folded output folded format
--debug print BPF program before starting (for debugging
purposes)
examples:
./stackcount submit_bio # count kernel stack traces for submit_bio
./stackcount -d ip_output # include a user/kernel stack delimiter
./stackcount -s ip_output # show symbol offsets
./stackcount -sv ip_output # show offsets and raw addresses (verbose)
./stackcount 'tcp_send*' # count stacks for funcs matching tcp_send*
./stackcount -r '^tcp_send.*' # same as above, using regular expressions
./stackcount -Ti 5 ip_output # output every 5 seconds, with timestamps
./stackcount -p 185 ip_output # count ip_output stacks for PID 185 only
./stackcount -p 185 c:malloc # count stacks for malloc in PID 185
./stackcount t:sched:sched_fork # count stacks for sched_fork tracepoint
./stackcount -p 185 u:node:* # count stacks for all USDT probes in node
./stackcount -K t:sched:sched_switch # kernel stacks only
./stackcount -U t:sched:sched_switch # user stacks only