核心除錯神器SystemTap
a linux trace/probe tool.
簡介
SystemTap是我目前所知的最強大的核心除錯工具,有些傢伙甚至說它無所不能:)
(1) 發展歷程
Debuted in 2005 in Red Hat Enterprise Linux 4 Update 2 as a technology preview.
After four years in development, System 1.0 was released in 2009.
As of 2011 SystemTap runs fully supported in all Linux distributions.
(2) 官方介紹
SystemTap provides free software(GPL) infrastructure to simplify the gathering of information about the
running Linux system. This assists diagnosis of a performance or functional problem. SystemTap eliminates
the need for the developer to go through the tedious and disruptive instrument, recompile, install, and reboot
sequence that may be otherwise required to collect data.
SystemTap provides a simple command line interface and scripting language for writing instrumentation for
a live running kernel plus user-space application. We are publishing samples, as well as enlarging the internal
"tapset" script library to aid reuse and abstraction.
Among other tracing/probing tools, SystemTap is the tool of choice for complex tasks that may require live analysis,
programmable on-line response, and whole-system symbolic access. SystemTap can also handle simple tracing
jobs.
Current project members include Red Hat, IBM, Hitachi, and Oracle.
(3) 獲取原始碼
git clone git://sourceware.org/git/systemtap.git
安裝
(1) Ubuntu發行版
1. 安裝systemtap包
apt-get install systemtap
2. 安裝依賴包
gcc:C語言編譯器
elfutils:提供分析除錯資訊的庫函式
linux-headers-generic:編譯核心模組所需的核心標頭檔案以及模組配置資訊
3. 安裝核心除錯資訊(kernel-debuginfo)
kernel-debuginfo提供了除錯核心所需的符號表,如果沒有安裝的話SystemTap的威力就會大打折扣,
只能提供kprobes系列的功能。
下載對應的核心版本,我的是linux-image-3.11.0-12-generic-dbgsym_3.11.0-12.19_amd64.ddeb
下載後安裝:dpkg -i linux-image-3.11.0-12-generic-dbgsym_3.11.0-12.19_amd64.ddeb
4. 驗證
stap -ve 'probe kernel.function("do_fork") { print("hello world\n") exit() }'
如果沒有提示錯誤,就是安裝成功了。
(2) CentOS/RedHat發行版
使用yum安裝下列rpm包即可:
systemtap:SystemTap包
gcc:C語言編譯器
elfutils:提供庫函式來分析除錯資訊
kernel-devel:編譯核心模組所需的核心標頭檔案及模組配置資訊
kernel-debuginfo:提供所需的核心除錯資訊來定位核心函式和變數的位置
使用
一些例子SystemTap的簡單例子。
(1) stap
通常直接使用stap執行用SystemTap語法編寫的指令碼即可。
stap - systemtap script translator/driver
stap test.stp // .stp字尾的檔案是用SystemTap語法編寫的指令碼
指令碼主要元素:probe point + probe handler
stap [options] FILE // Run script in file
stap [options] -e SCRIPT // Run given script.
stap [options] -l PROBE // List matching probes.
stap [options] -L PROBE // List matching probes and local variables.
常用選項
-h:幫助
-g:guru模式,嵌入式C程式碼需要
-m:指定編譯成的模組名稱
-v:add verbosity to all passes
-k:不刪除臨時目錄
-p NUM:stop after pass NUM 1-5, instead of 5 (parse, elaborate, translate, compile, run)
-b:bulk (percpu file) mode, 使用RelayFS將資料從核心空間傳輸到使用者空間
-o FILE:輸出到指定檔案,而不是stdout
-c CMD:start the probes, run CMD, and exit when it finishes
stap是SystemTap的前端,當出現以下情況時退出:
1. The user interrupts the script with a CTRL-C.
2. The script executes the exit() function.
3. The script encounters a sufficient number of soft errors.
4. The monitored command started with the stap program's -c option exits.
(2) staprun
如果我們的輸入不是.stp指令碼,而是一個用stap生成的模組,那麼就用staprun來執行。
staprun - systemtap runtime
staprun [OPTIONS] MODULE [MODULE-OPTIONS]
staprun的作用:
The staprun program is the back-end of the Systemtap tool. It expects a kernel module produced by
the front-end stap tool.
Splitting the systemtap tool into a front-end and a back-end allows a user to compile a systemtap script
on a development machine that has the kernel debugging information (need to compile the script) and
then transfer the resulting kernel module to a production machine that doesn't have any development
tools or kernel debugging information installed.
staprun is a part of the SystemTap package, dedicated to module loading and unloading and kernel-to-user
data transfer.
常用選項
-o FILE:Send output to FILE.
-D:Run in background. This requires '-o' option.
(3) 監測核心函式
一個簡單指令碼,每當核心函式do_fork()被呼叫時,顯示呼叫它的程序名、程序ID、函式引數。
[java] view plain copy print?- global proc_counter
- probe begin {
- print("Started monitoring creation of new processes...Press ^C to terminate\n")
- printf("%-25s %-10s %-s\n", "Process Name", "Process ID", "Clone Flags")
- }
- probe kernel.function("do_fork") {
- proc_counter++
- printf("%-25s %-10d 0x%-x\n", execname(), pid(), $clone_flags)
- }
- probe end {
- printf("\n%d processes forked during the observed period\n", proc_counter)
- }
global proc_counter
probe begin {
print("Started monitoring creation of new processes...Press ^C to terminate\n")
printf("%-25s %-10s %-s\n", "Process Name", "Process ID", "Clone Flags")
}
probe kernel.function("do_fork") {
proc_counter++
printf("%-25s %-10d 0x%-x\n", execname(), pid(), $clone_flags)
}
probe end {
printf("\n%d processes forked during the observed period\n", proc_counter)
}
(4) 監測系統呼叫
一個簡單指令碼,顯示4秒內open系統呼叫的資訊:呼叫程序名、程序ID、函式引數。
[java] view plain copy print?- probe syscall.open
- {
- printf("%s(%d) open(%s)\n", execname(), pid(), argstr)
- }
- probe timer.ms(4000) # after 4 seconds
- {
- exit()
- }
probe syscall.open
{
printf("%s(%d) open(%s)\n", execname(), pid(), argstr)
}
probe timer.ms(4000) # after 4 seconds
{
exit()
}
(5) 監測原始檔中所有函式入口和出口
括號內的探測點描述包含三個部分:
function name part:函式名
@file name part:檔名
function line part:所在行號
例如:
[java] view plain copy print?- probe kernel.function("*@net/socket.c") {}
- probe kernel.function("*@net/socket.c").return {}
probe kernel.function("*@net/socket.c") {}
probe kernel.function("*@net/socket.c").return {}
這裡指定函式名為任意(用*表示),指定檔名為net/socket.c,探測函式的入口和返回。
還可以用“:行號”來指定行號。
(6) 查詢匹配的核心函式和變數
查詢名字中包含nit的核心函式:
stap -l 'kernel.function("*nit*")'
查詢名字中包含nit的核心函式和變數:
stap -L 'kernel.function("*nit*")'
(7) 自帶的用例集
/root/systemtap/testsuite/systemtap.examples/,包含了許多用例指令碼。
主要有幾個方面:
network、io、interrupt、locks、memory、process、virtualization等
(8) 監控所有程序的收發包情況
[java] view plain copy print?- global recv, xmit
- probe begin {
- printf("Starting network capture...Press ^C to terminate\n")
- }
- probe netdev.receive {
- recv[dev_name, pid(), execname()] <<< length
- }
- probe netdev.transmit {
- xmit[dev_name, pid(), execname()] <<< length
- }
- probe end {
- printf("\nCapture terminated\n\n")
- printf("%-5s %-15s %-10s %-10s %-10s\n",
- "If", "Process", "Pid", "RcvPktCnt", "XmtPktCnt")
- foreach([dev, pid, name] in recv) {
- recvcnt = @count(recv[dev, pid, name])
- xmtcnt = @count(xmit[dev, pid, name])
- printf("%-5s %-15s %-10d %-10d %-10d\n", dev, name, pid, recvcnt, xmtcnt)
- }
- }
global recv, xmit
probe begin {
printf("Starting network capture...Press ^C to terminate\n")
}
probe netdev.receive {
recv[dev_name, pid(), execname()] <<< length
}
probe netdev.transmit {
xmit[dev_name, pid(), execname()] <<< length
}
probe end {
printf("\nCapture terminated\n\n")
printf("%-5s %-15s %-10s %-10s %-10s\n",
"If", "Process", "Pid", "RcvPktCnt", "XmtPktCnt")
foreach([dev, pid, name] in recv) {
recvcnt = @count(recv[dev, pid, name])
xmtcnt = @count(xmit[dev, pid, name])
printf("%-5s %-15s %-10d %-10d %-10d\n", dev, name, pid, recvcnt, xmtcnt)
}
}
(9) Systemtap usage stories and interesting demos
官網提供的很多例子。
探測點
SystemTap指令碼主要是由探測點和探測點處理函式組成的,來看下都有哪些探測點可用。
The essential idea behind a systemtap script is to name events, and to give them handlers.
Systemtap works by translating the script to C, running the system C compiler to create a kernel module from that.
When the module is loaded, it activates all the probed events by hooking into the kernel.
(1) where to probe
Built-in events (probe point syntax and semantics)
begin:The startup of the systemtap session.
end:The end of the systemtap session.
kernel.function("sys_open"):The entry to the function named sys_open in the kernel.
syscall.close.return:The return from the close system call.
module("ext3").statement(0xdeadbeef):The addressed instruction in the ext3 filesystem driver.
timer.ms(200):A timer that fires every 200 milliseconds.
timer.jiffies(200):A timer that fires every 200 jiffies.
timer.profile:A timer that fires periodically on every CPU.
perf.hw.cache_misses:A particular number of CPU cache misses have occurred.
procfs("status").read:A process trying to read a synthetic file.
process("a.out").statement("*@main.c:200"):Line 200 of the a.out program.
更多資訊,可見stapprobes mannual page:
(2) what to print
Systemtap provides a variety of such contextual data, ready for formatting.
The usually appear as function calls within the handler.
tid():The id of the current thread.
pid():The process (task group) id of the current thread.
uid():The id of the current user.
execname():The name of the current process.
cpu():The current cpu number.
gettimeofday_s():Number of seconds since epoch.
get_cycles():Snapshot of hardware cycle counter.
pp():A string describing the probe point being currently handled.
probefunc():If known, the name of the function in which this probe was placed.
$$vars:If available, a pretty-printed listing of all local variables in scope.
print_backtrace():If possible, print a kernel backtrace.
print_ubacktrace():If possible, print a user-space backtrace.
$$parms:表示函式引數
$$return:表示函式返回值
thread_indent():tapset libary中一個很有用的函式,它的輸出格式:
A timestamp (number of microseconds since the initial indentation for the thread)
A process name and the thread id itself.
更多資訊,可見stapfuncs mannual page:
(3) Built-in probe point types (DWARF probes)
內建的探測點,安裝debuginfo後可使用。
This family of probe points uses symbolic debugging information for the target kernel or module,
as may be found in executables that have not been stripped, or in the separate debuginfo packages.
目前支援的內建探測點型別:
kernel.function(PATTERN) // 在函式的入口處放置探測點,可以獲取函式引數$PARM
kernel.function(PATTERN).return // 在函式的返回處放置探測點,可以獲取函式的返回值$return,以及可能被修改的函式引數$PARM
kernel.function(PATTERN).call // 取補集,取不符合條件的函式
kernel.function(PATTERN).inline // 只選擇符合條件的行內函數,行內函數不能使用.return
kernel.function(PATTERN).exported // 只選擇匯出的函式
module(MPATTERN).function(PATTERN)
module(MPATTERN).function(PATTERN).return
module(MPATTERN).function(PATTERN).call
module(MPATTERN).function(PATTERN).inline
kernel.statement(PATTERN)
kernel.statement(ADDRESS).absolute
module(MPATTERN).statement(PATTERN)
示例:
# Refers to all kernel functions with "init" or "exit" in the name
kernel.function("*init*"), kernel.function("*exit*")
# Refers to any functions within the "kernel/time.c" file that span line 240
kernel.function("*@kernel/time.c:240")
# Refers to all functions in the ext3 module
module("ext3").function("*")
# Refers to the statement at line 296 within the kernel/time.c file
kernel.statement("*@kernel/time.c:296")
# Refers to the statement at line bio_init+3 within the fs/bio.c file
kernel.statement("[email protected]/bio.c+3")
部分在編譯單元內可見的原始碼變數,比如函式引數、區域性變數或全域性變數,在探測點處理函式中同樣是可見的。
在指令碼中使用$加上變數的名字就可以飲用了。
變數的引用有兩種風格:
$varname // 引用變數varname
$var->field // 引用結構的成員變數
$var[N] // 引用陣列的元素
&$var // 變數的地址
@var("varname") // 引用變數varname
@var("[email protected]/file.c") // 引用src/file.c在被編譯時的全域性變數varname
@var("[email protected]")->field // 引用結構的成員變數
@var("[email protected]")[N] // 引用陣列的元素
&@var("[email protected]") // 變數的地址
$var$ // provide a string that includes the values of basic type values
$var$$ // provide a string that includes all values of nested data types
$$vars // 一個包含所有函式引數、區域性變數的字串
$$locals // 一個包含所有區域性變數的字串
$$params // 一個包含所有函式引數的字串
(4) DWARF-less probing
當沒有安裝debuginfo時,不能使用內建的探測點。
In the absence of debugging information, you can still use the kprobe family of probes to examine the
entry and exit points of kernel and module functions. You cannot lookup the arguments or local variables
of a function using these probes.
當目標核心或模組缺少除錯資訊時,雖然不能使用內建的探測點,但仍然可以使用kprobe來探測函式的入口點
和退出點。此時不能使用“$+變數名”來獲取函式引數或區域性變數的值。
SystemTap仍然提供了一種訪問引數的方法:
當函式因被探測而停滯在它的進入點時,可以使用編號來引用它的引數。
例如,假設被探測的函式宣告如下:
ssize_t sys_read(unsigned int fd, char __user *buf, size_t count)
可以分別使用unit_arg(1)、pointer_arg(2)、ulong_arg(3)來獲取fd、buf和count的值。
此種探測點雖然不支援$return,但可以通過呼叫returnval()來獲取暫存器的值,函式的返回值通常是儲存在
這一暫存器裡的,也可以呼叫returnstr()來獲取返回值的字串形式。
在處理函式程式碼裡面,可以呼叫register("regname")來獲取它被呼叫時特定CPU暫存器的值。
使用格式(不能用萬用字元):
kprobe.function(FUNCTION)
kprobe.function(FUNCTION).return
kprobe.module(NAME).function(FUNCTION)
kprobe.module(NAME).function(FUNCTION).return
kprobe.statement(ADDRESS).absolute
語法
(1) 基本格式
probe probe-point probe- handler,即probe Probe-Point { statement }
用probe指定一個探測點(probe-point),以及在這個探測點處執行的處理函式(probe-handler)。
每條語句不用結束符,分號“;”表示空語句。函式用{}括起來。
允許多種註釋語句:
Shell-stype:#
C-style:/* */
C++-style://
next語句用於提前退出Probe-handler。
String連線符是“.”,比較符為“==”。
例如:"hello" . "world" ,連線成"helloword"
變數屬於弱資料型別,不用事先宣告,不用指定資料型別。
字串型別和數字型別的轉換:
s = sprint(123) # s becomes the string "123"
probe-handler中定義的變數是區域性的,不能在其它探測點處理函式中使用。
global符號用於定義全域性變數。
Because of possible concurrency (multiple probe handlers running on different CPUs, each global variable
used by a probe is automatically read-locked or write-locked while the handler is running.
next語句:執行到next語句時,會馬上從探測點處理函式中返回。
(2) 函式
function name(param1, param2)
{
statements
return ret
}
Recursion is possible, up to a nesting depth limit.
(3) 條件語句
if (EXPR) STATEMENT [else STATEMENT]
(4) 迴圈語句
while (EXPR) STATEMENT
for (A; B; C) STATEMENT
break可以提前退出迴圈,continue可以跳過本次迴圈。
(5) 上下文變數
Allow access to the probe point context. To know which variables are likely to be available, you will need to
be familiar with the kernel source you are probing.
You can use stap -L PROBEPOINT to enumerate the variables available there.
使用stap -L probe-point,來檢視執行到這個探測點時,哪些上下文變數是可用的。
Two functions, user_string and kernel_string, can copy char *target variables into systemtap strings.
例項:
(6) 關聯陣列
These arrays are implemented as hash tables with a maximum size that is fixed at startup.
Because they are too large to be created dynamically for individual probes handler runs, they must be
declared as global.
關聯陣列是用雜湊表實現的,最大大小在一開始就設定了。
關聯陣列必須是全域性的,不能在探測點處理函式內部定義。
陣列的索引最多可以有9個,用逗號隔開,可以是數字或字串。
例如:global array[400]
6.1 陣列
可以用多個索引來定位陣列元素。
元素的資料型別有三種:數值、字串、統計型別。
如果不指定陣列的大小,那麼預設設為最大值MAXMAPENTRIES(2048)。
例如:
foo[4, "hello"]++
processusage[uid(), execname()]++
6.2 元素是否存在
例如:if ([4, "hello"] in foo) { }
6.3 元素刪除
例如:delete
delete times[tid()] # deletion of a single element
delete times # deletion of all elements
6.4 刪除變數
例如:delete var
如果var是一個數值型變數,那麼它被重置為0;如果var是一個字串型變數,那麼它被重置為"",
如果var是一個統計型別變數,那麼它所在的集合被清空。
6.4 遍歷
使用foreach關鍵字,允許使用break/continue,在遍歷期間不允許修改陣列。
foreach (x = [a, b] in foo) { fuss_with(x) } # simple loop in arbitrary sequence
foreach ([a, b] in foo+ limit 5) {} # loop in increasing sequence of value, stop after 5
foreach ([a-, b] in foo) {} # loop in decreasing sequence of first key
# Print the first 10 tuples and values in the array in decreasing sequence
foreach(v = [i, j] in foo- limit 10)
printf("foo [%d, %s] = %d\n", i, j, v)
三中遍歷形式:
foreach (VAR in ARRAY) STMT // 按值遍歷,VAR為元素值
foreach ([VAR1, VAR2, ...] in ARRAY) STMT // 按索引遍歷
foreach (VAR = [VAR1, VAR2, ...] in ARRAY) STMT // 同時得到元素值和元素索引
6.5 覆蓋
%表示當陣列容量不夠時,允許新的元素覆蓋掉舊的元素。
global ARRAY%[<size>], ARRAY2%
(7) 統計型別
statistics aggregates是SystemTap特有的資料型別,用於統計全域性變數。
操作符為“<<<”
例如:g_value <<< b # 相當於C語言的g_value += b
這種變數只能用特定函式操作,主要包括:
@count(g_value):所有統計操作的操作次數
@sum(g_value):所有統計操作的運算元的總和
@min(g_value):所有統計操作的運算元的最小值
@max(g_value):所有統計操作的運算元的最大值
@avg(g_value):所有統計操作的運算元的平均值
(8) 語言安全性
8.1 時間限制
探測點處理函式是有執行時間限制的,不能佔用太多時間,否則SystemTap在把指令碼編譯為C語言時會報錯。
每個探測點處理函式只能執行1000條語句,這個數量是可配置的。
8.2 動態記憶體分配
探測點處理函式中不允許動態記憶體分配。
No dynamic memory allocation whatsoever takes place during the execution of probe handlers.
Arrays, function contexts, and buffers are allocated during initialization.
8.3 鎖
多個探測點處理函式搶佔一個全域性變數鎖時,某幾個探測點處理函式可能會超時,被放棄執行。
訪問全域性變數時會加鎖,防止它被併發的修改。
If multiple probes seek conflicting locks on the same global variables, one or more of them will time out and be
aborted. Such events are tailed as skipped probes, and a count is displayed at session end.
8.4 bug
核心中少數對時間非常敏感的地方(上下文切換、中斷處理),是不能設為探測點的。
Putting probes indiscriminately into unusually sensitive parts of the kernel (low level context switching, interrupt
dispatching) has reportedly caused crashes in the past. We are fixing these bugs as they are found, and
constructing a probe "blacklist", but it is not complete.
8.5 修改限制
通過-D選項可以修改預設的一些限制。
-D NM=VAL emit macro definition into generated C code.
MAXNESTING - The maximum number of recursive function call levels. The default is 10.
MAXSTRINGLEN - The maximum length of strings. The default is 256 bytes for 32 bit machines and
512 bytes for all other machines.
MAXTRYLOCK - The maximum number of iterations to wait for locks on global variables before declaring
possible deadlock and skipping the probe. The default is 1000.
MAXACTION - The maximum number of statements to execute during any single probe hit. The default is 1000.
MAXMAPENTRIES - The maximum number of rows in an array if the array size is not specified explicitly when
declared. The default is 2048.
MAXERRORS - The maximum number of soft errors before an exit is triggered. The default is 0.
MAXSKIPPED - The maximum number of skipped reentrant probes before an exit is triggered. The default is 100.
MINSTACKSPACE - The minimum number of free kernel stack bytes required in order to run a probe handler.
This number should be large enough for the probe handler's own needs, plus a safety margin. The default is 1024.
(9) 命令列引數
可以從命令列傳遞兩種型別的引數:“字串”和數值。
9.1 數值
$1 ... $<N> 用於在指令碼中引用傳入的數值引數。
9.2 字串
@1 ... @<N> 用於在指令碼中引用傳入的字串引數。
(10) 條件編譯
%( CONDITION %? TRUE-TOKENS %)
%( CONDITION %? TRUE-TOKENS %: FALSE-TOKENS %)
編譯條件可以是:
@defined($var) // 目標變數是否可用
kernel_v > "2.6.37" // 比較版本號
kernel_vr // 比較版本號(包括字尾)
arch == "x86_64" // CPU架構
kernel CONFIG option,編譯選項:
%( CONFIG_UTRACE == "y" %?
do something
%)
使用者空間
SystemTap探測使用者空間程式需要utrace的支援,3.5以上的核心版本預設支援。
對於3.5以下的核心版本,需要自己打相關補丁。
需要:
debugging information for the named program
utrace support in the kernel
(1) Begin/end
探測點:
程序/執行緒建立時
程序/執行緒結束時
process.begin
process("PATH").begin
process(PID).begin
process.thread.begin
process("PATH").thread.begin
process(PID).thread.begin
process.end
process("PATH").end
process(PID).end
process.thread.end
process("PATH").thread.end
process(PID).thread.end
(2) Syscall
探測點:
系統呼叫開始
系統呼叫返回
process.syscall
process("PATH").syscall
process(PID).syscall
process.syscall.return
process("PATH").syscall.return
process(PID).syscall.return
可用的程序上下文變數:
$syscall // 系統呼叫號
$argN ($arg1~$arg6) // 系統呼叫引數
$return // 系統呼叫返回值
(3) Function/statement
探測點:
函式入口處
函式返回處
檔案中某行
函式中的某個標籤
process("PATH").function("NAME")
process("PATH").statement("*@FILE.c:123")
process("PATH").function("*").return
process("PATH").function("myfunc").label("foo")
(4) Absolute variant
探測點:
程序的虛擬地址
process(PID).statement(ADDRESS).absolute
A non-symbolic probe point uses raw, unverified virtual addresses and provide no $variables.
The target PID parameter must identify a running process and ADDRESS must identify a valid instruction address.
This is a guru mode probe.
(5) Target process
探測點:
動態連結庫中的函式(比如glibc)
Target process mode (invoked with stap -c CMD or -x PID) implicitly restricts all process.* probes to the given child
process.
If PATH names a shared library, all processes map that shared library can be probed.
If dwarf debugging information is installed, try using a command with this syntax:
probe process("/lib64/libc-2.8.so").function("...") { ... }
(6) Instruction probes
探測點:
單條指令
指令塊
process("PATH").insn
process(PID).insn
process("PATH").insn.block
process(PID).insn.block
The .insn probe is called for every single-stepped instruction of the process described by PID or PATH.
The .insn.block probe is called for every block-stepped instruction of the process described by PID or PATH.
Using this feature will significantly slow process execution.
統計一個程序執行了多少條指令:
stap -e 'global steps; probe process("/bin/ls").insn {steps++}; probe end {printf("Total instruction: %d\n", steps)}' \
-c /bin/ls
(7) 使用
gcc -g3 -o test test.c
stap -L 'process("./test").function("*")' // 顯示程式中的函式和變數
除錯等級:
Request debugging information and also use level to specify how much information. The default level is 2.
Level 0 produces no debug information at all. Thus, -g0 negates -g.
Level 1 produces minimal information, enough for making backtraces in parts of the program that you don't
plan to debug. This includes descriptions of functions and external variables, but no information about local
variables and no line numbers.
Level 3: includes extra information, such as all the macro definitions present in the program.
高階功能
(1) 自建指令碼庫
A tapset is just a script that designed for reuse by installation into a special directory.
Systemtap attempts to resolve references to global symbols (probes, functions, variables) that are not defined
within the script by a systematic search through the tapset library for scripts that define those symbols.
A user may give additional directories with the -I DIR option.
構建自己的庫:
1. 建立庫目錄mylib,新增兩個庫檔案
time-default.stp
[java] view plain copy print?- function __time_value() {
- return gettimeofday_us()
- }
function __time_value() {
return gettimeofday_us()
}
time-common.stp
[java] view plain copy print?- global __time_vars
- function timer_begin(name) {
- __time_vars[name] = __time_value()
- }
- function timer_end(name) {
- return __time_value() - __time_vars[name]
- }
global __time_vars
function timer_begin(name) {
__time_vars[name] = __time_value()
}
function timer_end(name) {
return __time_value() - __time_vars[name]
}
2. 編寫應用指令碼
tapset-time-user.stp
[java] view plain copy print?- probe begin {
- timer_begin("bench")
- for(i=0; i<1000; i++) ;
- printf("%d cycles\n", timer_end("bench"))
- exit()
- }
probe begin {
timer_begin("bench")
for(i=0; i<1000; i++) ;
printf("%d cycles\n", timer_end("bench"))
exit()
}
3. 執行
stap -I mylib/ tapset-time-user.stp
(2) 探測點重新命名
主要用於在探測點之上提供一個抽象層。
Probe point aliases allow creation of new probe points from existing ones.
This is useful if the new probe points are named to provide a higher level of abstraction.
格式:
probe new_name = existing_name1, existing_name2[, ..., existing_nameN]
{
prepending behavior
}
例項:
[java] view plain copy print?- probe syscallgroup.io = syscall.open, syscall.close,
- syscall.read, syscall.write
- {
- groupname = "io"
- }
- probe syscallgroup.process = syscall.fork, syscall.execve
- {
- groupname = "process"
- }
- probe syscallgroup.*
- {
- groups[execname() . "/" . groupname]++
- }
- global groups
- probe end
- {
- foreach (eg in groups+)
- printf("%s: %d\n", eg, groups[eg])
- }
probe syscallgroup.io = syscall.open, syscall.close,
syscall.read, syscall.write
{
groupname = "io"
}
probe syscallgroup.process = syscall.fork, syscall.execve
{
groupname = "process"
}
probe syscallgroup.*
{
groups[execname() . "/" . groupname]++
}
global groups
probe end
{
foreach (eg in groups+)
printf("%s: %d\n", eg, groups[eg])
}
(3) 嵌入C程式碼
SystemTap provides an "escape hatch" to go beyond what the language can safely offer.
嵌入的C程式碼段用%{和%}括起來,執行指令碼時要加-g選項。
提供一個THIS巨集,可以用於獲取函式引數和儲存函式返回值。
例項:
[java] view plain copy print?- %{
- #include <linux/sched.h>
- #include <linux/list.h>
- %}
- function process_list()
- %{
- struct task_struct *p;
- struct list_head *_p, *_n;
- printk("%-20s%-10s\n", "program", "pid");
- list_for_each_safe(_p, _n, ¤t->tasks) {
- p = list_entry(_p, struct task_struct, tasks);
- printk("%-20s%-10d\n", p->comm, p->pid);
- }
- %}
- probe begin {
- process_list()
- exit()
- }
%{
#include <linux/sched.h>
#include <linux/list.h>
%}
function process_list()
%{
struct task_struct *p;
struct list_head *_p, *_n;
printk("%-20s%-10s\n", "program", "pid");
list_for_each_safe(_p, _n, ¤t->tasks) {
p = list_entry(_p, struct task_struct, tasks);
printk("%-20s%-10d\n", p->comm, p->pid);
}
%}
probe begin {
process_list()
exit()
}
stap -g embeded-c.stp
dmesg可看到打印出的所有程序。
C程式碼用%{ ... %}括起來,可以是獨立的一個段,可以作為函式的一部分,也可以只是一個表示式。
(4) 已有指令碼庫
SystemTap預設提供了非常強大的指令碼庫,主要類別如下:
Context Functions
Timestamp Functions
Time utility functions
Shell command functions
Memory Tapset
Task Time Tapset
Secheduler Tapset
IO Scheduler and block IO Tapset
SCSI Tapset
TTY Tapset
Interrupt Request (IRQ) Tapset
Networking Tapset
Socket Tapset
SNMP Information Tapset
Kernel Process Tapset
Signal Tapset
Errno Tapset
Device Tapset
Directory-entry (dentry) Tapset
Logging Tapset
Queue Statistics Tapset
Random functions Tapset
String and data retrieving functions Tapset
String and data writing functions Tapset
Guru tapsets
A collection of standard string functions
Utility functions for using ansi control chars in logs
SystemTap Translator Tapset
Network File Storage Tapsets
Speculation
實現原理
(1) SystemTap指令碼的執行流程
pass1
During the parsing of the code, it is represented internally in a parse tree.
Preprocessing is performed during this step, and the code is checked for semantic and syntax errors.
pass2
During the elaboration step, the symbols and references in the SystemTap script are resolved.
Also, any tapsets that are referenced in the SystemTap script are imported.
Debug data that is read from the DWARF(a widely used, standardized debugging data format) information,
which is produced during kernel compilation, is used to find the addresses for functions and variables
referenced in the script, and allows probes to be placed inside functions.
pass3
Takes the output from the elaboration phase and converts it into C source code.
Variables used by multiple probes are protected by locks. Safety checks, and any necessary locking, are
handled during the translation. The code is also converted to use the Kprobes API for inserting probe points
into the kernel.
pass4
Once the SystemTap script has been translated into a C source file, the code is compiled into a module that
can be dynamically loaded and executed in the kernel.
pass5
Once the module is built, SystemTap loads the module into the kernel.
When the module loads, an init routine in the module starts running and begins inserting probes into their
proper locations. Hitting a probe causes execution to stop while the handler for that probe is called.
When the handler exits, normal execution continues. The module continues waiting for probes and executing
handler code until the script exits, or until the user presses Ctrl-c, at which time SystemTap removes the
probes, unloads the module, and exits.
Output from SystemTap is transferred from the kernel through a mechanism called relayfs, and sent to STDOUT.
(2) 從使用者空間和核心空間來看SystemTap指令碼的執行
(3) kprobes
斷點指令(breakpoint instruction):__asm INT 3,機器碼為CC。
斷點中斷(INT3)是一種軟中斷,當執行到INT 3指令時,CPU會把當時的程式指標(CS和EIP)壓入堆疊儲存起來,
然後通過中斷向量表呼叫INT 3所對應的中斷例程。
INT是軟中斷指令,中斷向量表是中斷號和中斷處理函式地址的對應表。
INT 3即觸發軟中斷3,相應的中斷處理函式的地址為:中斷向量表地址 + 4 * 3。
A Kprobe is a general purpose hook that can be inserted almost anywhere in the kernel code.
To allow it to probe an instruction, the first byte of the instruction is replaced with the breakpoint
instruction for the architecture being used. When this breakpoint is hit, Kprobe takes over execution,
executes its handler code for the probe, and then continues execution at the next instruction.
(4) 依賴的核心特性
kprobes/jprobes
return probes
reentrancy
colocated (multiple)
relayfs
scalability (unlocked handlers)
user-space probes