surfaceflinger 導致的SWT 問題分析流程
首先檢查SWT 發生具體時間,檢查SYS_ANDROID_EVENT_LOG 搜尋Watchdog:
01-05 04:54:40.811 785 1160 I watchdog: surfaceflinger hang.
檢查SYS_ANDROID_LOG 檔案確認sf hang 時間
01-05 04:54:40.778 785 1160 V Watchdog: **SF hang Time **42201
01-05 04:54:40.779 785 1160 E Watchdog: **SWT happen **
且從上log得知 SF hang的總時間為42201。且倒推可知SF hang的開始發生時間約為
然後檢查SF_RTT_DUMP 和 SF_RTT_DUMP1 檔案 搜尋確認有效時間內的SF 堆疊資訊
----- pid 384 at 2018-01-05 04:53:58 ----- Cmd line: /system/bin/surfaceflinger ----- pid 384 at 2018-01-05 04:54:02 ----- Cmd line: /system/bin/surfaceflinger ----- pid 384 at 2018-01-05 04:54:06 ----- Cmd line: /system/bin/surfaceflinger ----- pid 384 at 2018-01-05 04:54:10 ----- Cmd line: /system/bin/surfaceflinger ----- pid 384 at 2018-01-05 04:54:14 ----- Cmd line: /system/bin/surfaceflinger ----- pid 384 at 2018-01-05 04:54:18 ----- Cmd line: /system/bin/surfaceflinger ----- pid 384 at 2018-01-05 04:54:22 ----- Cmd line: /system/bin/surfaceflinger ----- pid 384 at 2018-01-05 04:54:27 ----- Cmd line: /system/bin/surfaceflinger ----- pid 384 at 2018-01-05 04:54:31 ----- Cmd line: /system/bin/surfaceflinger ----- pid 384 at 2018-01-05 04:54:35 ----- Cmd line: /system/bin/surfaceflinger
檢查384主執行緒
找到code
status_t BufferQueueConsumer::setTransformHint(uint32_t hint) {
777 ATRACE_CALL();
778 BQ_LOGV("setTransformHint: %#x", hint);
779 Mutex::Autolock lock(mCore->mMutex);
780 mCore->mTransformHint = hint;
781 return NO_ERROR;
782 }
發現主執行緒一直等鎖 mCore->mMutex
檢查後面執行緒發現 423執行緒
"Binder:384_2" sysTid=423
#00 pc 00048c90 /system/lib/libc.so (__ioctl+8)
#01 pc 0001dd75 /system/lib/libc.so (ioctl+32)
#02 pc 000429bd /system/lib/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+168)
#03 pc 000433fd /system/lib/libbinder.so (android::IPCThreadState::waitForResponse(android::Parcel*, int*)+236)
#04 pc 0003d841 /system/lib/libbinder.so (android::BpBinder::transact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+36)
#05 pc 000518d1 /system/lib/libgui.so
#06 pc 000430b3 /system/lib/libgui.so (android::BufferQueueProducer::connect(android::sp<android::IProducerListener> const&, int, bool, android::IGraphicBufferProducer::QueueBufferOutput*)+974)
也一直卡住。
檢查code 發現BufferQueueProducer connect函式中
status_t BufferQueueProducer::connect(const sp<IProducerListener>& listener,
1235 int api, bool producerControlledByApp, QueueBufferOutput *output) {
1236 ATRACE_CALL();
1237 Mutex::Autolock lock(mCore->mMutex);
mCore->mMutex 鎖拿住了。所以確認此處卡住了SF 主執行緒導致的SWT。
由於後面的堆疊列印不清楚。看不到具體的呼叫。
所以詳細分析 SYS_ANDROID_LOG
看到一行 log
01-05 04:54:46.724 384 423 E : IProducerListener: binder call 'needsReleaseNotify' failed
在code 中搜索發現是
呼叫到 IProducerListener.cppvirtual bool needsReleaseNotify() 函式裡面:
status_t err = remote()->transact(NEEDS_RELEASE_NOTIFY, data, &reply);---這裡一直卡住了
47 if (err != NO_ERROR) {
48 ALOGE("IProducerListener: binder call \'needsReleaseNotify\' failed");
49 return true;
50 }
檢查確認needsReleaseNotify 此函式就是
BufferQueueProducer.cpp connect裡面的
1307 if (listener->needsReleaseNotify()) {
1308 mCore->mConnectedProducerListener = listener;
1309 }
所以從上面可以知道 SF hang的原因是
SF主執行緒需要拿的鎖mCore->mMutex 被 執行緒423 拿住了。執行緒423 在呼叫到needsReleaseNotify 的時候 在binder 通訊是卡住。
由於不清楚IProducerListener 誰註冊下來的。所以繼續看了下 SYS_ANDROID_LOG
發現一個問題;
com.manboker.headportrait 應用發生NE 後 SF 開始列印
01-05 04:53:58.185 384 429 W : [SF-WD] ============================================
01-05 04:53:58.185 384 429 W : [SF-WD] SF performance down, TID=384 SpendTime=797ms Threshold=500ms AnchorTime=261951612172457 Now=261952410170919.
01-05 04:53:58.185 384 429 W : [SF-WD] reset anchor by [handleTransaction]
01-05 04:53:58.186 384 429 W : [SF-WD] Start to dump RTT
01-05 04:53:58.186 384 429 W : [SF-WD] ============================================
01-05 04:53:58.568 384 429 W : [SF-WD] ============================================
01-05 04:53:58.568 384 429 W : [SF-WD] SF maybe hang, TID=384 SpendTime=1180ms Threshold=500ms AnchorTime=261951612172457 Now=261952792703919.
01-05 04:53:58.568 384 429 W : [SF-WD] reset anchor by [handleTransaction]
01-05 04:53:58.568 384 429 W : [SF-WD] RTT is still dumping
01-05 04:53:58.568 384 429 W : [SF-WD] ============================================
在NE發生之後必現 SF hang的log
而且在
01-05 04:54:46.724 384 423 E : IProducerListener: binder call 'needsReleaseNotify' failed
01-05 04:54:46.724 384 538 I BufferQueueProducer: [com.manboker.headportrait/com.manboker.headportrait.activities.CameraActivity#0](this:0xae088000,id:103885,api:1,p:7968,c:384) disconnect(P): api 1
01-05 04:54:46.726 384 1299 I SurfaceFlinger: EventThread Client Pid (7968) disconnected by (384)
01-05 04:54:46.726 384 1299 I chatty : uid=1000(system) Binder:384_4 identical 1 line
01-05 04:54:46.726 384 1299 I SurfaceFlinger: EventThread Client Pid (7968) disconnected by (384)
01-05 04:54:46.727 384 422 E IPCThreadState: binder thread pool (4 threads) starved for 33728 ms
01-05 04:54:46.730 481 481 I Zygote : Process 7968 exited due to signal (6)
01-05 04:54:46.730 481 481 I Zygote : Process 7968 dumped core.
7968 died SF 就好了。
檢查kernal log
[262000.955432] (3)[26663:kworker/3:2]binder: release 7968:7968 transaction 298427269 out, still active
[262000.955512] (3)[26663:kworker/3:2]binder: release 7968:8025 transaction 298427286 out, still active
[262000.955522] (3)[26663:kworker/3:2]binder: send failed reply for transaction 298427287 to 384:423
[262000.955579] (3)[26663:kworker/3:2]binder: undelivered transaction 298432612, process died.
[262000.955859] (3)[26663:kworker/3:2]binder: undelivered transaction 298431965, process died.
[262000.955867] (3)[26663:kworker/3:2]binder: undelivered transaction 298432005, process died.
明確打印出 binder 對端是7968 程序,即com.manboker.headportrait
所以此問題的根本原因是 SF 和 com.manboker.headportrait 通訊的時候。com.manboker.headportrait發生異常開啟了coredump功能後 抓coredump資訊 掛起了。
從而導致 SF一直卡死。