1. 程式人生 > >systemui start time out導致的黑屏問題

systemui start time out導致的黑屏問題

一. 問題描述

1.1 現象

      手機黑屏,長按power鍵能出現關機介面

1.2 JIRA

      xxx

1.3 結論

      systemui 的service沒有重啟,導致黑屏。這是一個原生bug,因為systemui啟動的方式比較特別,他是通過service來將介面畫出來的,如果service沒有起來那就會導致黑屏。

1.4 修復連結:

      xxx

二. 初步分析

2.1 檢視system_server traces

      一般遇到黑屏我們也都會先看server_server的traces,看看是不是system_server卡住了導致的。正常情況下會看看system_server traces裡有沒有"held by"關鍵字或者“state=D”在,然後在bugreport中看看有沒有發生watchdog,很不幸的是這個問題都沒有這些關鍵資訊,system_srever各個執行緒的狀態都是正常的,那麼不是system_server出問題,那應該看什麼呢?這種情況下可以看看surfaceflinger,window,systemui等等和介面相關的資訊。

2.2 檢視介面相關的資訊

        和顯示的同事經過長時間討論並且看了很久的surfaceflinger的資訊,確認一個資訊:因為當前沒有畫任何東西,所以介面是黑的。那為什麼當前沒有畫任何東西呢?我們又繼續看了下systemui的trace,因為沒有專門研究過systemui,所以第一次看的時候只關注了systemui的執行緒有沒有卡住,由於systemui的各個執行緒都沒有卡住,導致我們誤以為systemui也是正常的。然後繼續看了bugreport中systemui的日誌並且和systemui的同事一起確認了一下,最後發現systemui並沒有重啟它的service,關鍵日誌如下:
07-06 11:48:42.890 1000 7903 7922 I am_kill : [0,23775,com.android.systemui,-800,bg anr]
07-06 11:48:49.472 1000 7903 17478 I am_proc_died: [0,23775,com.android.systemui,-800,0]
07-06 11:48:49.477 1000 7903 17478 I am_schedule_service_restart: [0,com.android.systemui/.fsgesture.FsGestureService,0]
07-06 11:48:49.484 1000 7903 17478 I am_schedule_service_restart: [0,com.android.systemui/.SystemUIService,0]
07-06 11:48:49.485 1000 7903 17478 I am_schedule_service_restart: [0,com.android.keyguard/.KeyguardService,0]
07-06 11:48:49.517 1000 7903 17478 I am_proc_start: [0,23058,1000,com.android.systemui,restart,com.android.systemui]
07-06 11:48:49.584 1000 7903 7922 I am_kill : [0,23058,com.android.systemui,-800,bg anr]
07-06 11:48:59.563 1000 7903 7922 I am_kill : [0,23058,com.android.systemui,-800,start timeout]
07-06 11:48:59.650 1000 7903 7922 I am_proc_start: [0,23278,1000,com.android.systemui,added application,com.android.systemui]
由於對services重啟這部分程式碼不是特別熟悉,直接看日誌肯定是看不出來的。只能懷疑這幾個service是不是被remove了,所以先在ActiveServices.java中所有service.remove的地方加了日誌,並且編譯了個包讓測試跑,跑了幾天都沒有復現。測試如果不能復現的話,那就只能自己想辦法復現了。先總結日誌的規律,然後按照當前的時序來複現,我們根據上邊的日誌能看到的是,systemui先發生bg anr,然後列印am_proc_deid,接著重啟service,再接著重啟systemui程序,但是在重啟程序的時候time out了,並且沒有列印am_proc_died,接著又重啟了一邊,最後一次重啟沒有啟動service。

三. 復現

      service,process start time out ANR我是啟動的時候post了一個延時訊息,然後啟動完成的時候remove掉這個延時訊息,如果這個延時訊息在固定時間裡沒有被remove,那麼就會發生ANR。我們直接去程式碼裡找service和process ANR的那兩個訊息:
SERVICE_TIMEOUT_MSG/PROC_START_TIMEOUT_MSG,然後在所有post的地方打上條件斷點(程序是systemui才停下),然後在處理訊息的地方打上斷點,根據之前的日誌開始復現。經過幾天的復現,沒有復現出來(因為偷懶了。。。。我製造anr的方式不是正常的,而是當代碼走到remove anr message的時候手動改了,不讓它remove ANR的message,雖然也會發生anr,但是每次systemui time out的時候都會列印am_proc_died)最後又看了下程式碼,懷疑systemui time out那一次重啟應該是沒有走到linkToDeath,才沒有走binderDied(am_proc_died是在appDiedLocked方法中列印的)。長時間的斷點除錯,對這部分程式碼開始熟悉了,並且找到了必現的方法:
幾個重要的斷點:
1.斷點 AMS中的processStartTimedOutLocked方法(程序start time out會走這裡):
case PROC_START_TIMEOUT_MSG: {
                ProcessRecord app = (ProcessRecord)msg.obj;
                synchronized (ActivityManagerService.this) {
                    processStartTimedOutLocked(app);
                }
            } break;
2.斷點 AMS startProcessLocked 方法中的checkTime(startTime, "startProcess: done updating pids map");(這裡啟動程序後會post PROC_START_TIMEOUT訊息):
synchronized (mPidsSelfLocked) {
                this.mPidsSelfLocked.put(startResult.pid, app);
                if (isActivityProcess) {
                    Message msg = mHandler.obtainMessage(PROC_START_TIMEOUT_MSG);
                    msg.obj = app;
                    mHandler.sendMessageDelayed(msg, startResult.usingWrapper
                            ? PROC_START_TIMEOUT_WITH_WRAPPER : PROC_START_TIMEOUT);
                }
            }
            checkTime(startTime, "startProcess: done updating pids map");
3.斷點AMS中的appDiedLocked:
@Override
public void binderDied() {
if (DEBUG_ALL) Slog.v(
TAG, "Death received in " + this
+ " for thread " + mAppThread.asBinder());
synchronized(ActivityManagerService.this) {
appDiedLocked(mApp, mPid, mAppThread, true);
}
}
4.斷點ActiveServices的 performServiceRestartLocked(這裡是service重啟)
        public void run() {
            synchronized(mAm) {
                performServiceRestartLocked(mService);
            }
        }
5.斷點AMS的attachApplicationLocked方法:
@Override
public final void attachApplication(IApplicationThread thread) {
synchronized (this) {
int callingPid = Binder.getCallingPid();
final long origId = Binder.clearCallingIdentity();
attachApplicationLocked(thread, callingPid);
Binder.restoreCallingIdentity(origId);
}
}
6.連線手機,kill systemui,然後在第二個斷點的地方等待10s(為了讓程序啟動ANR),然後繼續執行如果執行的順序如下,那麼就能必現了:
appDiedLocked->performServiceRestartLocked->processStartTimedOutLocked->attachApplicationLocked(原生android中,不會走到attachApplicationLocked,因為在processStartTimedOutLocked中原生android不會重啟程序。還有就是在第二步的checkTime那等十秒也不一定能走到processStartTimedOutLocked,不太清楚原因,可能除錯的原因,沒有深究,我都是在checkTime之後加了個sleep 10s,這樣基本都能必現)

四. 深入分析

      復現了之後就要分析為什麼service沒有重啟了,在這之前需要大概瞭解一下service是怎麼重啟的。

4.1 簡單介紹service如何重啟的

    所有的service啟動都會走到retrieveServiceLocked方法中來,在這方法中會把當前的這個ServiceRecord記錄到ServiceRestarter物件中,主要具體程式碼如下:
final ServiceRestarter res = new ServiceRestarter();
...
r = new ServiceRecord(mAm, ss, name, filter, sInfo, callingFromFg, res);
res.setService(r);

ServiceRecord(ActivityManagerService ams,
BatteryStatsImpl.Uid.Pkg.Serv servStats, ComponentName name,
Intent.FilterComparison intent, ServiceInfo sInfo, boolean callerIsFg,
Runnable restarter) {
    ...
    this.restarter = restarter;
    ...
}

private class ServiceRestarter implements Runnable {
    private ServiceRecord mService;

    void setService(ServiceRecord service) {
        mService = service;
    }

    public void run() {
        synchronized(mAm) {
            performServiceRestartLocked(mService);
        }
    }
}
performServiceRestartLocked是service重啟的重要方法之一,看下service重啟的呼叫棧:
at com.android.server.am.ActiveServices.scheduleServiceRestartLocked(ActiveServices.java:2078)
at com.android.server.am.ActiveServices.killServicesLocked(ActiveServices.java:3303)
at com.android.server.am.ActivityManagerService.cleanUpApplicationRecordLocked(ActivityManagerService.java:18579)
at com.android.server.am.ActivityManagerService.handleAppDiedLocked(ActivityManagerService.java:5527)
at com.android.server.am.ActivityManagerService.appDiedLocked(ActivityManagerService.java:5731)
at com.android.server.am.ActivityManagerService$AppDeathRecipient.binderDied(ActivityManagerService.java:1669)
- locked <0x2cc2> (a com.android.server.am.ActivityManagerService)
at android.os.BinderProxy.sendDeathNotice(Binder.java:849)
這裡在killServicesLocked方法中會判斷service是否需要重啟:
final void killServicesLocked(ProcessRecord app, boolean allowRestart) {
    ...
    // Now do remaining service cleanup.
    for (int i=app.services.size()-1; i>=0; i--) {//判斷ProcessRecord中的service是否大於0,程序啟動過的service會記錄在這裡
        ...
    } else {
        boolean canceled = scheduleServiceRestartLocked(sr, true);//重啟service

        ...
    }
    ...
    if (!allowRestart) {
        app.services.clear();//如果不允許重啟就清空services
        ...
    }
    ...
}
最後在scheduleServiceRestartLocked方法中post Runnable重啟service:
private final boolean scheduleServiceRestartLocked(ServiceRecord r, boolean allowCancel) {
    boolean canceled = false;
    ...
    final long now = SystemClock.uptimeMillis();
    ...
    } else {
        // Persistent processes are immediately restarted, so there is no
        // reason to hold of on restarting their services.
        r.totalRestartCount++;
        r.restartCount = 0;
        r.restartDelay = 0;
        r.nextRestartTime = now;//現在的時間
    }
    //判斷mRestartingServices中是否記錄有當前的這個service,如果沒有的話就add。在performServiceRestartLocked方法中會根據mRestartingServices來重啟
    if (!mRestartingServices.contains(r)) {
        r.createdFromFg = false;
        mRestartingServices.add(r);
        r.makeRestarting(mAm.mProcessStats.getMemFactorLocked(), now);
    }
    ....
    if (!mRestartingServices.contains(r)) {
        r.createdFromFg = false;
        mRestartingServices.add(r);/將當前的service放到mRestartingServices中
        r.makeRestarting(mAm.mProcessStats.getMemFactorLocked(), now);
    }

cancelForegroundNotificationLocked(r);
mAm.mHandler.removeCallbacks(r.restarter);
mAm.mHandler.postAtTime(r.restarter, r.nextRestartTime);
r.nextRestartTime = SystemClock.uptimeMillis() + r.restartDelay;
Slog.w(TAG, "Scheduling restart of crashed service "
+ r.shortName + " in " + r.restartDelay + "ms");
....
EventLog.writeEvent(EventLogTags.AM_SCHEDULE_SERVICE_RESTART,
r.userId, r.shortName, r.restartDelay);

    return canceled;
}
postAtTime之後會呼叫到對應的performServiceRestartLocked方法中的bringUpServiceLocked:
private String bringUpServiceLocked(ServiceRecord r, int intentFlags, boolean execInFg,
boolean whileRestarting, boolean permissionsReviewRequired)
throws TransactionTooLargeException {
    ...

    // We are now bringing the service up, so no longer in the
    // restarting state.
    if (mRestartingServices.remove(r)) {//將當前的service從mRestartingServices中移除
        clearRestartingIfNeededLocked(r);
    }
    ...
    if (!mPendingServices.contains(r)) {
        mPendingServices.add(r);//將當前的service新增到mPendingServices中
    }
    ...
    return null;
}
在我們程序啟動的時候在attachApplicationLocked也會根據mRestartingServices中記錄的service去重啟service,在attachApplicationLocked方法中有如下一段程式碼:
// Find any services that should be running in this process...
if (!badApp) {
    try {
        didSomething |= mServices.attachApplicationLocked(app, processName);
        checkTime(startTime, "attachApplicationLocked: after mServices.attachApplicationLocked");
    } catch (Exception e) {
        Slog.wtf(TAG, "Exception thrown starting services in " + app, e);
        badApp = true;
    }
}
在mServices.attachApplicationLocked方法中也會根據mRestartingServices中記錄的service去重啟對應的service:
boolean attachApplicationLocked(ProcessRecord proc, String processName)
throws RemoteException {
    ...
    // Also, if there are any services that are waiting to restart and
    // would run in this process, now is a good time to start them. It would
    // be weird to bring up the process but arbitrarily not let the services
    // run at this point just because their restart time hasn't come up.
    if (mRestartingServices.size() > 0) {
        ServiceRecord sr;
        for (int i=0; i<mRestartingServices.size(); i++) {
            sr = mRestartingServices.get(i);
            if (proc != sr.isolatedProc && (proc.uid != sr.appInfo.uid
                || !processName.equals(sr.processName))) {
            continue;
            }
        mAm.mHandler.removeCallbacks(sr.restarter);
        mAm.mHandler.post(sr.restarter);
        }
    }
return didSomething;
}
總結一下,我們重啟service一共有兩個地方,第一個是在appDiedLocked的流程中,第二個是在attachApplicationLocked的流程中。如果在appDiedLocked流程中post Runnable之後先執行了performServiceRestartLocked,然後在attachApplicationLocked重啟的流程中就不會post Runnable,因為mRestartingServices.size=0。如果在appDiedLocked流程中post Runnable之後先執行了attachApplicationLocked然後會重新remover 上次的message,然後再post Runnable。service重啟大概介紹這麼多吧,估計有點暈,下邊我們還會說到mRestartingServices,只需要記住它在appDiedLocked的流程中add service,在processStartTimedOutLocked中remove service。

4.2 service為什麼沒有重啟

    我們再來回憶一下具體呼叫順序:
異常流程:
appDiedLocked->performServiceRestartLocked->processStartTimedOutLocked->attachApplicationLocked
正常的流程:
appDiedLocked->performServiceRestartLocked->attachApplicationLocked或者appDiedLocked->attachApplicationLocked->performServiceRestartLocked
正常的流程分析:
第一種情況:appDiedLocked(mRestartingServices add,post Runnable準備重啟service,重啟程序)->performServiceRestartLocked(mRestartingServices remove,service重啟)->attachApplicationLocked(程序重啟)
第二種情況:appDiedLocked(mRestartingServices add,post Runnable準備重啟service,重啟程序)->attachApplicationLocked(程序重啟)->performServiceRestartLocked(mRestartingServices remove,,service重啟)

異常的情況分析:
appDiedLocked(mRestartingServices add,post Runnable準備重啟service,重啟程序)->performServiceRestartLocked(mRestartingServices remove,service重啟)->processStartTimedOutLocked(程序被殺,重啟程序)->attachApplicationLocked(程序重啟,因為由於之前在performServiceRestartLocked中mRestartingServices已經被全部remove了,所以service不會再重啟了)

五. 解決方案

1.讓service在attachApplicationLocked中重啟:xxx
具體思路是:
appDiedLocked(mRestartingServices add,不再post Runnable準備重啟service,重啟程序)->processStartTimedOutLocked(程序被殺,重啟程序)->attachApplicationLocked(程序重啟)->performServiceRestartLocked(mRestartingServices remove,service重啟)

2.在processStartTimedOutLocked中把service再次放到mRestartingServices中讓service重啟:xxx

3.在processStartTimedOutLocked中,把systemui做特殊處理,重啟service:xxx