1. 程式人生 > >surfaceflinger 導致的SWT 問題分析流程

surfaceflinger 導致的SWT 問題分析流程

首先檢查SWT 發生具體時間,檢查SYS_ANDROID_EVENT_LOG 搜尋Watchdog:

01-05 04:54:40.811   785  1160 I watchdog: surfaceflinger  hang.

檢查SYS_ANDROID_LOG 檔案確認sf hang 時間

01-05 04:54:40.778   785  1160 V Watchdog: **SF hang Time **42201

01-05 04:54:40.779   785  1160 E Watchdog: **SWT happen **

且從上log得知 SF hang的總時間為42201。且倒推可知SF hang的開始發生時間約為

04:53:58

然後檢查SF_RTT_DUMP 和 SF_RTT_DUMP1 檔案 搜尋確認有效時間內的SF 堆疊資訊

----- pid 384 at 2018-01-05 04:53:58 -----

Cmd line: /system/bin/surfaceflinger

----- pid 384 at 2018-01-05 04:54:02 -----

Cmd line: /system/bin/surfaceflinger

----- pid 384 at 2018-01-05 04:54:06 -----

Cmd line: /system/bin/surfaceflinger

----- pid 384 at 2018-01-05 04:54:10 -----

Cmd line: /system/bin/surfaceflinger

----- pid 384 at 2018-01-05 04:54:14 -----

Cmd line: /system/bin/surfaceflinger

----- pid 384 at 2018-01-05 04:54:18 -----

Cmd line: /system/bin/surfaceflinger

----- pid 384 at 2018-01-05 04:54:22 -----

Cmd line: /system/bin/surfaceflinger

----- pid 384 at 2018-01-05 04:54:27 -----

Cmd line: /system/bin/surfaceflinger

----- pid 384 at 2018-01-05 04:54:31 -----

Cmd line: /system/bin/surfaceflinger

----- pid 384 at 2018-01-05 04:54:35 -----

Cmd line: /system/bin/surfaceflinger

檢查384主執行緒

找到code

status_t BufferQueueConsumer::setTransformHint(uint32_t hint) {

777     ATRACE_CALL();

778     BQ_LOGV("setTransformHint: %#x", hint);

779     Mutex::Autolock lock(mCore->mMutex);

780     mCore->mTransformHint = hint;

781     return NO_ERROR;

782 }

發現主執行緒一直等鎖 mCore->mMutex

檢查後面執行緒發現 423執行緒

"Binder:384_2" sysTid=423

  #00 pc 00048c90  /system/lib/libc.so (__ioctl+8)

  #01 pc 0001dd75  /system/lib/libc.so (ioctl+32)

  #02 pc 000429bd  /system/lib/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+168)

  #03 pc 000433fd  /system/lib/libbinder.so (android::IPCThreadState::waitForResponse(android::Parcel*, int*)+236)

  #04 pc 0003d841  /system/lib/libbinder.so (android::BpBinder::transact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+36)

  #05 pc 000518d1  /system/lib/libgui.so

  #06 pc 000430b3  /system/lib/libgui.so (android::BufferQueueProducer::connect(android::sp<android::IProducerListener> const&, int, bool, android::IGraphicBufferProducer::QueueBufferOutput*)+974)

也一直卡住。

檢查code 發現BufferQueueProducer connect函式中

status_t BufferQueueProducer::connect(const sp<IProducerListener>& listener,

1235         int api, bool producerControlledByApp, QueueBufferOutput *output) {

1236     ATRACE_CALL();

1237     Mutex::Autolock lock(mCore->mMutex);

 mCore->mMutex 鎖拿住了。所以確認此處卡住了SF 主執行緒導致的SWT

由於後面的堆疊列印不清楚。看不到具體的呼叫。

所以詳細分析 SYS_ANDROID_LOG

看到一行 log

01-05 04:54:46.724   384   423 E         : IProducerListener: binder call 'needsReleaseNotify' failed

在code 中搜索發現是

呼叫到 IProducerListener.cppvirtual bool needsReleaseNotify() 函式裡面:

status_t err = remote()->transact(NEEDS_RELEASE_NOTIFY, data, &reply);---這裡一直卡住了
47 if (err != NO_ERROR) {
48 ALOGE("IProducerListener: binder call \'needsReleaseNotify\' failed");
49 return true;
50 }

檢查確認needsReleaseNotify 此函式就是

BufferQueueProducer.cpp connect裡面的

1307 if (listener->needsReleaseNotify()) { 
1308   mCore->mConnectedProducerListener = listener;
1309 }

所以從上面可以知道 SF hang的原因是

SF主執行緒需要拿的鎖mCore->mMutex 被 執行緒423 拿住了。執行緒423 在呼叫到needsReleaseNotify 的時候 在binder 通訊是卡住。

由於不清楚IProducerListener 誰註冊下來的。所以繼續看了下 SYS_ANDROID_LOG

發現一個問題;

com.manboker.headportrait 應用發生NE 後 SF 開始列印

01-05 04:53:58.185   384   429 W         : [SF-WD] ============================================

01-05 04:53:58.185   384   429 W         : [SF-WD] SF performance down, TID=384 SpendTime=797ms Threshold=500ms AnchorTime=261951612172457 Now=261952410170919.

01-05 04:53:58.185   384   429 W         : [SF-WD] reset anchor by [handleTransaction]

01-05 04:53:58.186   384   429 W         : [SF-WD] Start to dump RTT

01-05 04:53:58.186   384   429 W         : [SF-WD] ============================================

01-05 04:53:58.568   384   429 W         : [SF-WD] ============================================

01-05 04:53:58.568   384   429 W         : [SF-WD] SF maybe hang, TID=384 SpendTime=1180ms Threshold=500ms AnchorTime=261951612172457 Now=261952792703919.

01-05 04:53:58.568   384   429 W         : [SF-WD] reset anchor by [handleTransaction]

01-05 04:53:58.568   384   429 W         : [SF-WD] RTT is still dumping

01-05 04:53:58.568   384   429 W         : [SF-WD] ============================================

在NE發生之後必現 SF hang的log

而且在

01-05 04:54:46.724   384   423 E         : IProducerListener: binder call 'needsReleaseNotify' failed

01-05 04:54:46.724   384   538 I BufferQueueProducer: [com.manboker.headportrait/com.manboker.headportrait.activities.CameraActivity#0](this:0xae088000,id:103885,api:1,p:7968,c:384) disconnect(P): api 1

01-05 04:54:46.726   384  1299 I SurfaceFlinger: EventThread Client Pid (7968) disconnected by (384)

01-05 04:54:46.726   384  1299 I chatty  : uid=1000(system) Binder:384_4 identical 1 line

01-05 04:54:46.726   384  1299 I SurfaceFlinger: EventThread Client Pid (7968) disconnected by (384)

01-05 04:54:46.727   384   422 E IPCThreadState: binder thread pool (4 threads) starved for 33728 ms

01-05 04:54:46.730   481   481 I Zygote  : Process 7968 exited due to signal (6)

01-05 04:54:46.730   481   481 I Zygote  : Process 7968 dumped core.

7968 died SF 就好了。

檢查kernal log

[262000.955432] (3)[26663:kworker/3:2]binder: release 7968:7968 transaction 298427269 out, still active

[262000.955512] (3)[26663:kworker/3:2]binder: release 7968:8025 transaction 298427286 out, still active

[262000.955522] (3)[26663:kworker/3:2]binder: send failed reply for transaction 298427287 to 384:423

[262000.955579] (3)[26663:kworker/3:2]binder: undelivered transaction 298432612, process died.

[262000.955859] (3)[26663:kworker/3:2]binder: undelivered transaction 298431965, process died.

[262000.955867] (3)[26663:kworker/3:2]binder: undelivered transaction 298432005, process died.

明確打印出 binder 對端是7968 程序,即com.manboker.headportrait

所以此問題的根本原因是 SF 和 com.manboker.headportrait 通訊的時候。com.manboker.headportrait發生異常開啟了coredump功能後 抓coredump資訊 掛起了。

從而導致 SF一直卡死。