用gdb調試python多線程代碼-記一次死鎖的發現
| 版權:本文版權歸作者和博客園共有,歡迎轉載,但未經作者同意必須保留此段聲明,且在文章頁面明顯位置給出原文連接。如有問題,可以郵件:[email protected]
前言
相信很多人都有使用過sqlite3的經驗,一年前因為項目上的需要,寫了一個基於sqlite3的持久化隊列庫(persist-queue),已經發布在pypi上有段時間了。 前段時間,一下子來了兩個issues,一個是關於in-memory database的support,一個是sqlite3 隊列性能的問題。在數據量不大的情況下,sqlite的隊列會在某些情況下出現:
sqlite3.OperationalError: database is locked
在修復上面的問題的時候,註意到,在多線程的情況下更容易觸發上面的exception,並且比較怪的是,有時測試還有兩個問題: 1. 出現死鎖,並且CPU占用會一直保持在100%。
在程序裏面下斷點(import pdb;pdb.set_trace())和log都不好使,關鍵是無法精確定位到所有線程在當時的運行狀態。
找了不少方法,最後還是發現了管用的方法,使用大名鼎鼎的: gdb
相比pdb,gdb有以下幾個優點:
- 不需要顯示的下斷點,如"import pdb;pdb.set_trace()"
- 可以方便的調試多線程程序,允許你調試過程中切換調試線程。很多python debug是不支持的如 winpdb, pydevd
- 如果python解釋器core dump了,生成的core dump文件可以直接用gdb 來分析,而gdb只能望“dump”興嘆了。
Python給gdb準備了以extension,方便用戶不僅可以查看Python解釋器的運行情況,還可以查看用戶程序的運行情況。
python gdb extension在gdb的環境下提供了如下幾個py-*命令
- py-list 查看當前python應用程序上下文
- py-bt 查看當前python應用程序調用堆棧
- py-bt-full 查看當前python應用程序調用堆棧,並且顯示每個frame的詳細情況
- py-print
- py-locals 查看當前的scope的變量
- py-up 查看上一個frame
- py-down 查看下一個frame
環境準備
首先按照需要的各種調試用到的包
1. 安裝gdb, 調試的主要工具
sudo apt-get install gdb
2. 安裝python-dbg, 用來在調試的時候看到python源代碼的call stack
sudo apt-get install python-dbg
這裏面我都是用的python2.7,如果用的是python3.5,需要對應的安裝python3.5-dbg
重現問題(死鎖,Hung process)
$ tox -e py27
py27 develop-inst-nodeps: /home/peter/Documents/persist-queue
py27 installed: configparser==3.5.0,cov-core==1.15.0,coverage==4.4.1,enum-compat==0.0.2,enum34==1.1.6,eventlet==0.21.0,flake8==3.5.0,funcsigs==1.0.2,greenlet==0.4.12,mccabe==0.6.1,mock==2.0.0,nose2==0.6.5,pbr==3.1.1,-e git+https://github.com/peter-wangxu/persist-queue@d26561e3c7c9a35fd75ddedf0a023a7dcbd563a9#egg=persist_queue,pkg-resources==0.0.0,pycodestyle==2.3.1,pyflakes==1.6.0,six==1.11.0,virtualenv==15.1.0
py27 runtests: PYTHONHASHSEED=‘564480154‘
py27 runtests: commands[0] | nose2 --with-coverage --coverage-report xml --coverage-report term
WARNING:persistqueue.sqlbase:auto_commit=False is still experimental,only use it with care.
....
上面的測試永遠不會返回,悲傷啊。。
使用gdb加載symbols
1. 首先找到進程號
$ ps -ef | grep tox peter 9376 1777 0 15:10 pts/17 00:00:00 /usr/bin/python3 /usr/bin/tox peter 9409 9376 67 15:10 pts/17 01:04:14 /home/peter/Documents/persist-queue/.tox/py27/bin/python2.7 .tox/py27/bin/nose2 --with-coverage --coverage-report xml --coverage-report term peter 9913 9639 0 16:45 pts/18 00:00:00 grep --color=auto tox
上面的 9409 就是我卡住的進程了。
接下來就是使用gdb調試: sudo gdb python -p 9409
記得用用sudo,否則symbols有可能無法加載成功
sudo gdb python -p 9409 GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1 Copyright (C) 2016 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from python...Reading symbols from /usr/lib/debug/.build-id/cd/e2c487269892a2815c715667ae336984b82b0c.debug...done. done. Attaching to program: /usr/bin/python, process 9409 [New LWP 9427] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". 0x00007fb923247827 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x152af60) at ../sysdeps/unix/sysv/linux/futex-internal.h:205 205 ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
註意到這句 “Reading symbols from python...Reading symbols from /usr/lib/debug/.build-id/cd/e2c487269892a2815c715667ae336984b82b0c.debug...done.”
表示加載成功,否則會看到“no debugging symbols found”的log,可能是使用的python2.7跟dbg包的版本不一致。
2. 安裝用到的包的dbg信息
註意到我上面有一句 “Reading symbols from /usr/lib/x86_64-linux-gnu/libsqlite3.so.0...(no debugging symbols found)...done.”,表示我的sqlite3的symbols沒有加載,直接安裝就好
sudo apt-get install libsqlite3-0-dbg
重現進入gdb
看到這句
Reading symbols from /usr/lib/x86_64-linux-gnu/libsqlite3.so.0...Reading symbols from /usr/lib/debug/.build-id/78/5ce90f42d13a60a36853327823ef5f90f2e8f1.debug...done.
所有我關心的symbols就加載完了。
使用gdb debug 程序
debug 主線程
首先,可以是用 info threads,查看當前的線程
(gdb) info threads Id Target Id Frame * 1 Thread 0x7fb92365b700 (LWP 9409) "nose2" 0x00007fb923247827 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x152af60) at ../sysdeps/unix/sysv/linux/futex-internal.h:205 2 Thread 0x7fb91d4f1700 (LWP 9427) "nose2" _PyObject_GenericGetAttrWithDict () at ../Objects/object.c:1392
帶 * 的這個就是當前正在debug的thread。
查看主線程為何一直卡住
(gdb) thread 1 [Switching to thread 1 (Thread 0x7fb92365b700 (LWP 9409))] #0 0x00007fb923247827 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x152af60) at ../sysdeps/unix/sysv/linux/futex-internal.h:205 205 ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory. (gdb) py-list 335 waiter.acquire() 336 self.__waiters.append(waiter) 337 saved_state = self._release_save() 338 try: # restore state no matter what (e.g., KeyboardInterrupt) 339 if timeout is None: >340 waiter.acquire() 341 if __debug__: 342 self._note("%s.wait(): got it", self) 343 else: 344 # Balancing act: We can‘t afford a pure busy loop, so we 345 # have to sleep; but if we sleep the whole timeout time, (gdb) py-up #7 Frame 0x7fb9180068e0, for file /usr/lib/python2.7/threading.py, line 340, in wait (self=<_Condition(_Condition__lock=<thread.lock at remote 0x7fb9235efd50>, acquire=<built-in method acquire of thread.lock object at remote 0x7fb9235efd50>, _Verbose__verbose=False, _Condition__waiters=[<thread.lock at remote 0x7fb9235effd0>], release=<built-in method release of thread.lock object at remote 0x7fb9235efd50>) at remote 0x7fb91fdb1c10>, timeout=None, waiter=<thread.lock at remote 0x7fb9235effd0>, saved_state=None) waiter.acquire() (gdb) py-up #11 Frame 0x7fb91fdb69b0, for file /usr/lib/python2.7/threading.py, line 940, in join (self=<Thread(_Thread__block=<_Condition(_Condition__lock=<thread.lock at remote 0x7fb9235efd50>, acquire=<built-in method acquire of thread.lock object at remote 0x7fb9235efd50>, _Verbose__verbose=False, _Condition__waiters=[<thread.lock at remote 0x7fb9235effd0>], release=<built-in method release of thread.lock object at remote 0x7fb9235efd50>) at remote 0x7fb91fdb1c10>, _Thread__args=(3,), _Thread__ident=140433037399808, _Thread__target=<function at remote 0x7fb91fdae1b8>, _Thread__daemonic=False, _Thread__initialized=True, _Thread__started=<_Event(_Event__flag=True, _Verbose__verbose=False, _Event__cond=<_Condition(_Condition__lock=<thread.lock at remote 0x7fb9235eff90>, acquire=<built-in method acquire of thread.lock object at remote 0x7fb9235eff90>, _Verbose__verbose=False, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7fb9235eff90>) at remote 0x7fb91fdb1bd0>) at remote 0x7fb91f...(truncated) self.__block.wait() (gdb) py-up #15 Frame 0x13d7ba0, for file /home/peter/Documents/persist-queue/tests/test_sqlqueue.py, line 179, in test_multiple_consumers (self=<SQLite3QueueInMemory(path=‘:memory:‘, auto_commit=True, _testMethodName=‘test_multiple_consumers‘, _cleanups=[], _testMethodDoc=‘Test sqlqueue can be used by multiple consumers.‘, _type_equality_funcs={<type at remote 0x905840>: ‘assertSetEqual‘, <type at remote 0x9056a0>: ‘assertMultiLineEqual‘, <type at remote 0x906f00>: ‘assertDictEqual‘, <type at remote 0x906d60>: ‘assertTupleEqual‘, <type at remote 0x9049a0>: ‘assertSetEqual‘, <type at remote 0x9073e0>: ‘assertListEqual‘}, _resultForDoCleanups=<PluggableTestResult(session=<Session(testResult=<...>, logLevel=30, hooks=<PluginInterface(hooks={‘wasSuccessful‘: <Hook(method=‘wasSuccessful‘, plugins=[<ResultReporter(descriptions=True, reportCategories={‘failures‘: [], ‘expectedFailures‘: [], ‘skipped‘: [], ‘unexpectedSuccesses‘: [], ‘errors‘: []}, registered=True, dontReport=set([‘failures‘, ‘skipped‘, ‘errors‘, ‘passed‘, ‘expected...(truncated) t.join() (gdb) py-list 174 t.start() 175 consumers.append(t) 176 177 p.join() 178 for t in consumers: >179 t.join() 180 181 self.assertEqual(0, queue.qsize()) 182 for x in range(1000): 183 self.assertNotEqual(0, counter[x], 184 "not 0 for counter‘s index %s" % x)
可以看到我使用了py-list和幾次的py-up來到了我寫的測試test_sqlqueue.py裏面。
發現主線程是在t.join在等待子線程完成,但是奇怪的是,子線程一直沒有完成任務。
debug工作線程
接下來,使用 thread 2 切換到上面的 2 線程。
(gdb) thread 2
[Switching to thread 2 (Thread 0x7f02f51f9700 (LWP 18749))]
#0 PyType_IsSubtype () at ../Objects/typeobject.c:1184
1184 ../Objects/typeobject.c: No such file or directory.
使用py-list查看當前運行位置
(gdb) py-list
64 raise Empty
65 elif timeout is None:
66 # block until a put event.
67 pickled = self._pop()
68 while not pickled:
>69 self.put_event.wait()
70 pickled = self._pop()
71 elif timeout < 0:
72 raise ValueError("‘timeout‘ must be a non-negative number")
73 else:
74 # block until the timeout reached
(gdb)
可以看到,工作線程是在 "self.put_event.wait()“,一種可能是self.put_event一直是false,沒有被set位True,所以線程就這麽一直等著。繼續分析
使用py-locals 查看put-event當前值
(gdb) py-up #26 Frame 0x7f02f82b3608, for file /home/peter/Documents/persist-queue/persistqueue/sqlqueue.py, line 49, in _pop (self=<SQLiteQueue(name=‘default‘, _putter=<sqlite3.Connection at remote 0x7f02f82c2030>, path=‘:memory:‘, action_lock=<thread.lock at remote 0x7f02fbaf8e30>, memory_sql=True, put_event=<_Event(_Event__cond=<_Condition(acquire=<built-in method acquire of thread.lock object at remote 0x7f02fbaf8eb0>, release=<built-in method release of thread.lock object at remote 0x7f02fbaf8eb0>, _Condition__lock=<thread.lock at remote 0x7f02fbaf8eb0>, _Condition__waiters=[], _Verbose__verbose=False) at remote 0x7f02f835cf90>, _Event__flag=True, _Verbose__verbose=False) at remote 0x7f02f835ce90>, multithreading=True, tran_lock=<thread.lock at remote 0x7f02fbaf8f30>, timeout=<float at remote 0x1bb8660>, auto_commit=True, _getter=<sqlite3.Connection at remote 0x7f02f82c2030>, _conn=<sqlite3.Connection at remote 0x7f02f82c2030>) at remote 0x7f02f8491a10>) row = self._select() (gdb) py-list 44 # Action lock to assure multiple action to be *atomic* 45 self.action_lock = threading.Lock() 46 47 def _pop(self): 48 with self.action_lock: >49 row = self._select() 50 # Perhaps a sqlite3 bug, sometimes (None, None) is returned 51 # by select, below can avoid these invalid records. 52 if row and row[0] is not None: 53 self._delete(row[0]) 54 if not self.auto_commit: (gdb) py-print self local ‘self‘ = <SQLiteQueue(name=‘default‘, _putter=<sqlite3.Connection at remote 0x7f02f82c2030>, path=‘:memory:‘, action_lock=<thread.lock at remote 0x7f02fbaf8e30>, memory_sql=True, put_event=<_Event(_Event__cond=<_Condition(acquire=<built-in method acquire of thread.lock object at remote 0x7f02fbaf8eb0>, release=<built-in method release of thread.lock object at remote 0x7f02fbaf8eb0>, _Condition__lock=<thread.lock at remote 0x7f02fbaf8eb0>, _Condition__waiters=[], _Verbose__verbose=False) at remote 0x7f02f835cf90>, _Event__flag=True, _Verbose__verbose=False) at remote 0x7f02f835ce90>, multithreading=True, tran_lock=<thread.lock at remote 0x7f02fbaf8f30>, timeout=<float at remote 0x1bb8660>, auto_commit=True, _getter=<sqlite3.Connection at remote 0x7f02f82c2030>, _conn=<sqlite3.Connection at remote 0x7f02f82c2030>) at remote 0x7f02f8491a10>(gdb) py-locals self = <SQLiteQueue(name=‘default‘, _putter=<sqlite3.Connection at remote 0x7f02f82c2030>, path=‘:memory:‘, action_lock=<thread.lock at remote 0x7f02fbaf8e30>, memory_sql=True, put_event=<_Event(_Event__cond=<_Condition(acquire=<built-in method acquire of thread.lock object at remote 0x7f02fbaf8eb0>, release=<built-in method release of thread.lock object at remote 0x7f02fbaf8eb0>, _Condition__lock=<thread.lock at remote 0x7f02fbaf8eb0>, _Condition__waiters=[], _Verbose__verbose=False) at remote 0x7f02f835cf90>, _Event__flag=True, _Verbose__verbose=False) at remote 0x7f02f835ce90>, multithreading=True, tran_lock=<thread.lock at remote 0x7f02fbaf8f30>, timeout=<float at remote 0x1bb8660>, auto_commit=True, _getter=<sqlite3.Connection at remote 0x7f02f82c2030>, _conn=<sqlite3.Connection at remote 0x7f02f82c2030>) at remote 0x7f02f8491a10>
可以看到這個put_event是一直為True(_Event__flag=True),仔細想想,我好像沒設置為False過。。。
結合代碼:
1 def get(self, block=True, timeout=None): 2 if not block: 3 pickled = self._pop() 4 if not pickled: 5 raise Empty 6 elif timeout is None: 7 # block until a put event. 8 pickled = self._pop() 9 while not pickled: 10 self.put_event.wait(TICK_FOR_WAIT) 11 pickled = self._pop() 12 elif timeout < 0: 13 raise ValueError("‘timeout‘ must be a non-negative number") 14 else: 15 # block until the timeout reached 16 endtime = _time.time() + timeout 17 pickled = self._pop() 18 while not pickled: 19 remaining = endtime - _time.time() 20 if remaining <= 0.0: 21 raise Empty 22 self.put_event.wait( 23 TICK_FOR_WAIT if TICK_FOR_WAIT < remaining else remaining) 24 pickled = self._pop() 25 item = pickle.loads(pickled) 26 return item
如果因為某種情況,queue為空,那麽上面的9-11行就是一個無限循環,並且cpu會被占用到100%,
好了,到這裏就大概搞清楚為什麽有時候多線程測試是,程序一直為hung住,並且CPU的占用一直是100%,還會報出 ”sqlite3.OperationalError: database is locked“ 的error(因為程序的線程一直在嘗試讀數據庫,如果其他線程有update,或者inster拿到了sqlite的文件鎖,讀線程會讀超時)
參考文獻
http://podoliaka.org/2016/04/10/debugging-cpython-gdb
https://wiki.python.org/moin/DebuggingWithGdb
用gdb調試python多線程代碼-記一次死鎖的發現