1. 程式人生 > >WatchDog工作原理

WatchDog工作原理

一、概述

Android系統中,有硬體WatchDog用於定時檢測關鍵硬體是否正常工作,類似地,在framework層有一個軟體WatchDog用於定期檢測關鍵系統服務是否發生死鎖事件。WatchDog功能主要是分析系統核心服務和重要執行緒是否處於Blocked狀態。

  • 監視reboot廣播;
  • 監視mMonitors關鍵系統服務是否死鎖。

二、WatchDog初始化

2.1 startOtherServices

[-> SystemServer.java]

private void startOtherServices() {
    ...
    //建立watchdog【見小節2.2】
    final Watchdog watchdog = Watchdog.getInstance();
    //註冊reboot廣播【見小節2.3】
    watchdog.init(context, mActivityManagerService);
    ...
    mSystemServiceManager.startBootPhase(SystemService.PHASE_LOCK_SETTINGS_READY); //480
    ...
    mActivityManagerService.systemReady(new Runnable() {

       public void run() {
           mSystemServiceManager.startBootPhase(
                   SystemService.PHASE_ACTIVITY_MANAGER_READY);
           ...
           // watchdog啟動【見小節3.1】
           Watchdog.getInstance().start();
           mSystemServiceManager.startBootPhase(
                   SystemService.PHASE_THIRD_PARTY_APPS_CAN_START);
        }
        
    }
}

system_server程序啟動的過程中初始化WatchDog,主要有:

  • 建立watchdog物件,該物件本身繼承於Thread;
  • 註冊reboot廣播;
  • 呼叫start()開始工作。

2.2 getInstance

[-> Watchdog.java]

public static Watchdog getInstance() {
    if (sWatchdog == null) {
        //單例模式,建立例項物件【見小節2.3 】
        sWatchdog = new Watchdog();
    }
    return sWatchdog;
}

2.3 建立Watchdog

[-> Watchdog.java]

public class Watchdog extends Thread {
    //所有的HandlerChecker物件組成的列表,HandlerChecker物件型別【見小節2.3.1】
    final ArrayList<HandlerChecker> mHandlerCheckers = new ArrayList<>();
    ...

    private Watchdog() {
        super("watchdog");
        //將前臺執行緒加入佇列
        mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
                "foreground thread", DEFAULT_TIMEOUT);
        mHandlerCheckers.add(mMonitorChecker);
        //將主執行緒加入佇列
        mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
                "main thread", DEFAULT_TIMEOUT));
        //將ui執行緒加入佇列
        mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
                "ui thread", DEFAULT_TIMEOUT));
        //將i/o執行緒加入佇列
        mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
                "i/o thread", DEFAULT_TIMEOUT));
        //將display執行緒加入佇列
        mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
                "display thread", DEFAULT_TIMEOUT));
        //【見小節2.3.2】
        addMonitor(new BinderThreadMonitor());
    }

}

Watchdog繼承於Thread,建立的執行緒名為”watchdog”。mHandlerCheckers佇列包括、 主執行緒,fg, ui, io, display執行緒的HandlerChecker物件。

2.3.1 HandlerChecker

[-> Watchdog.java]

public final class HandlerChecker implements Runnable {
    private final Handler mHandler; //Handler物件
    private final String mName; //執行緒描述名
    private final long mWaitMax; //最長等待時間
    //記錄著監控的服務
    private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
    private boolean mCompleted; //開始檢查時先設定成false
    private Monitor mCurrentMonitor; 
    private long mStartTime; //開始準備檢查的時間點

    HandlerChecker(Handler handler, String name, long waitMaxMillis) {
        mHandler = handler;
        mName = name;
        mWaitMax = waitMaxMillis; 
        mCompleted = true;
    }
}

2.3.2 addMonitor

public class Watchdog extends Thread {
    public void addMonitor(Monitor monitor) {
        synchronized (this) {
            ...
            //此處mMonitorChecker資料型別為HandlerChecker
            mMonitorChecker.addMonitor(monitor);
        }
    }

    public final class HandlerChecker implements Runnable {
        private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();

        public void addMonitor(Monitor monitor) {
            //將上面的BinderThreadMonitor新增到mMonitors佇列
            mMonitors.add(monitor);
        }
        ...
    }
}

監控Binder執行緒, 將monitor新增到HandlerChecker的成員變數mMonitors列表中。 在這裡是將BinderThreadMonitor物件加入該執行緒。

private static final class BinderThreadMonitor implements Watchdog.Monitor {
    public void monitor() {
        Binder.blockUntilThreadAvailable();
    }
}

blockUntilThreadAvailable最終呼叫的是IPCThreadState,等待有空閒的binder執行緒

void IPCThreadState::blockUntilThreadAvailable()
{
    pthread_mutex_lock(&mProcess->mThreadCountLock);
    while (mProcess->mExecutingThreadsCount >= mProcess->mMaxThreads) {
        //等待正在執行的binder執行緒小於程序最大binder執行緒上限(16個)
        pthread_cond_wait(&mProcess->mThreadCountDecrement, &mProcess->mThreadCountLock);
    }
    pthread_mutex_unlock(&mProcess->mThreadCountLock);
}

可見addMonitor(new BinderThreadMonitor())是將Binder執行緒新增到android.fg執行緒的handler(mMonitorChecker)來檢查是否工作正常。

2.3 init

[-> Watchdog.java]

public void init(Context context, ActivityManagerService activity) {
    mResolver = context.getContentResolver();
    mActivity = activity;
    //註冊reboot廣播接收者【見小節2.3.1】
    context.registerReceiver(new RebootRequestReceiver(),
            new IntentFilter(Intent.ACTION_REBOOT),
            android.Manifest.permission.REBOOT, null);
}

2.3.1 RebootRequestReceiver

[-> Watchdog.java]

final class RebootRequestReceiver extends BroadcastReceiver {
    @Override
    public void onReceive(Context c, Intent intent) {
        if (intent.getIntExtra("nowait", 0) != 0) {
            //【見小節2.3.2】
            rebootSystem("Received ACTION_REBOOT broadcast");
            return;
        }
        Slog.w(TAG, "Unsupported ACTION_REBOOT broadcast: " + intent);
    }
}

2.3.2 rebootSystem

[-> Watchdog.java]

void rebootSystem(String reason) {
    Slog.i(TAG, "Rebooting system because: " + reason);
    IPowerManager pms = (IPowerManager)ServiceManager.getService(Context.POWER_SERVICE);
    try {
        //通過PowerManager執行reboot操作
        pms.reboot(false, reason, false);
    } catch (RemoteException ex) {
    }
}

最終是通過PowerManagerService來完成重啟操作,具體的重啟流程後續會單獨講述。

三、Watchdog檢測機制

當呼叫Watchdog.getInstance().start()時,則進入執行緒“watchdog”的run()方法, 該方法分成兩部分:

  • 前半部 [小節3.1] 用於監測是否觸發超時;
  • 後半部 [小節4.1], 當觸發超時則輸出各種資訊。

3.1 run

[-> Watchdog.java]

public void run() {
    boolean waitedHalf = false;
    while (true) {
        final ArrayList<HandlerChecker> blockedCheckers;
        final String subject;
        final boolean allowRestart;
        int debuggerWasConnected = 0;
        synchronized (this) {
            long timeout = CHECK_INTERVAL; //CHECK_INTERVAL=30s
            for (int i=0; i<mHandlerCheckers.size(); i++) {
                HandlerChecker hc = mHandlerCheckers.get(i);
                //執行所有的Checker的監控方法, 每個Checker記錄當前的mStartTime[見小節3.2]
                hc.scheduleCheckLocked();
            }

            if (debuggerWasConnected > 0) {
                debuggerWasConnected--;
            }

            long start = SystemClock.uptimeMillis();
            //通過迴圈,保證執行30s才會繼續往下執行
            while (timeout > 0) {
                if (Debug.isDebuggerConnected()) {
                    debuggerWasConnected = 2;
                }
                try {
                    wait(timeout); //觸發中斷,直接捕獲異常,繼續等待.
                } catch (InterruptedException e) {
                    Log.wtf(TAG, e);
                }
                if (Debug.isDebuggerConnected()) {
                    debuggerWasConnected = 2;
                }
                timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
            }
            
            //評估Checker狀態【見小節3.3】
            final int waitState = evaluateCheckerCompletionLocked();
            if (waitState == COMPLETED) {
                waitedHalf = false;
                continue;
            } else if (waitState == WAITING) {
                continue;
            } else if (waitState == WAITED_HALF) {
                if (!waitedHalf) {
                    //首次進入等待時間過半的狀態
                    ArrayList<Integer> pids = new ArrayList<Integer>();
                    pids.add(Process.myPid());
                    //輸出system_server和3個native程序的traces【見小節4.2】
                    ActivityManagerService.dumpStackTraces(true, pids, null, null,
                            NATIVE_STACKS_OF_INTEREST);
                    waitedHalf = true;
                }
                continue;
            }
            ... //進入這裡,意味著Watchdog已超時【見小節4.1】
        }
        ...
    }
}

public static final String[] NATIVE_STACKS_OF_INTEREST = new String[] {
    "/system/bin/mediaserver",
    "/system/bin/sdcard",
    "/system/bin/surfaceflinger"
};

該方法主要功能:

  1. 執行所有的Checker的監控方法scheduleCheckLocked()
    • 當mMonitor個數為0(除了android.fg執行緒之外都為0)且處於poll狀態,則設定mCompleted = true;
    • 當上次check還沒有完成, 則直接返回.
  2. 等待30s後, 再呼叫evaluateCheckerCompletionLocked來評估Checker狀態;
  3. 根據waitState狀態來執行不同的操作:
    • 當COMPLETED或WAITING,則相安無事;
    • 當WAITED_HALF(超過30s)且為首次, 則輸出system_server和3個Native程序的traces;
    • 當OVERDUE, 則輸出更多資訊.

由此,可見當觸發一次Watchdog, 則必然會呼叫兩次AMS.dumpStackTraces, 也就是說system_server和3個Native程序的traces 的traces資訊會輸出兩遍,且時間間隔超過30s.

3.2 scheduleCheckLocked

public final class HandlerChecker implements Runnable {
    ...
    public void scheduleCheckLocked() {
        if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
            mCompleted = true; //當目標looper正在輪詢狀態則返回。
            return;
        }

        if (!mCompleted) {
            return; //有一個check正在處理中,則無需重複傳送
        }
        mCompleted = false;
        
        mCurrentMonitor = null;
        // 記錄當下的時間
        mStartTime = SystemClock.uptimeMillis();
        //傳送訊息,插入訊息佇列最開頭, 見下方的run()方法
        mHandler.postAtFrontOfQueue(this);
    }
    
    public void run() {
        final int size = mMonitors.size();
        for (int i = 0 ; i < size ; i++) {
            synchronized (Watchdog.this) {
                mCurrentMonitor = mMonitors.get(i);
            }
            //回撥具體服務的monitor方法
            mCurrentMonitor.monitor();
        }

        synchronized (Watchdog.this) {
            mCompleted = true;
            mCurrentMonitor = null;
        }
    }
}

該方法主要功能: 向Watchdog的監控執行緒的Looper池的最頭部執行該HandlerChecker.run()方法, 在該方法中呼叫monitor(),執行完成後會設定mCompleted = true. 那麼當handler訊息池當前的訊息, 導致遲遲沒有機會執行monitor()方法, 則會觸發watchdog.

其中postAtFrontOfQueue(this),該方法輸入引數為Runnable物件,根據訊息機制, 最終會回撥HandlerChecker中的run方法,該方法會迴圈遍歷所有的Monitor介面,具體的服務實現該介面的monitor()方法。

可能的問題,如果有其他訊息不斷地呼叫postAtFrontOfQueue()也可能導致watchdog沒有機會執行;或者是每個monitor消耗一些時間,雷加起來超過1分鐘造成的watchdog. 這些都是非常規的Watchdog.

3.3 evaluateCheckerCompletionLocked

private int evaluateCheckerCompletionLocked() {
    int state = COMPLETED;
    for (int i=0; i<mHandlerCheckers.size(); i++) {
        HandlerChecker hc = mHandlerCheckers.get(i);
        //【見小節3.4】
        state = Math.max(state, hc.getCompletionStateLocked());
    }
    return state;
}

獲取mHandlerCheckers列表中等待狀態值最大的state.

3.4 getCompletionStateLocked

public int getCompletionStateLocked() {
    if (mCompleted) {
        return COMPLETED;
    } else {
        long latency = SystemClock.uptimeMillis() - mStartTime;
        // mWaitMax預設是60s
        if (latency < mWaitMax/2) {
            return WAITING;
        } else if (latency < mWaitMax) {
            return WAITED_HALF;
        }
    }
    return OVERDUE;
}
  • COMPLETED = 0:等待完成;
  • WAITING = 1:等待時間小於DEFAULT_TIMEOUT的一半,即30s;
  • WAITED_HALF = 2:等待時間處於30s~60s之間;
  • OVERDUE = 3:等待時間大於或等於60s。

四. Watchdog處理流程

4.1 run

[-> Watchdog.java]

public void run() {
    while (true) {
        synchronized (this) {
            ...
            //獲取被阻塞的checkers 【見小節4.1.1】
            blockedCheckers = getBlockedCheckersLocked();
            // 獲取描述資訊 【見小節4.1.2】
            subject = describeCheckersLocked(blockedCheckers);
            allowRestart = mAllowRestart;
        }

        EventLog.writeEvent(EventLogTags.WATCHDOG, subject);

        ArrayList<Integer> pids = new ArrayList<Integer>();
        pids.add(Process.myPid());
        if (mPhonePid > 0) pids.add(mPhonePid);
        //第二次以追加的方式,輸出system_server和3個native程序的棧資訊【見小節4.2】
        final File stack = ActivityManagerService.dumpStackTraces(
                !waitedHalf, pids, null, null, NATIVE_STACKS_OF_INTEREST);
                
        //系統已被阻塞1分鐘,也不在乎多等待2s,來確保stack trace資訊輸出
        SystemClock.sleep(2000);

        if (RECORD_KERNEL_THREADS) {
            //輸出kernel棧資訊【見小節4.3】
            dumpKernelStackTraces();
        }

        //觸發kernel來dump所有阻塞執行緒【見小節4.4】
        doSysRq('l');
        
        //輸出dropbox資訊【見小節4.5】
        Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
            public void run() {
                mActivity.addErrorToDropBox(
                        "watchdog", null, "system_server", null, null,
                        subject, null, stack, null);
            }
        };
        dropboxThread.start();
        
        try {
            dropboxThread.join(2000); //等待dropbox執行緒工作2s
        } catch (InterruptedException ignored) {
        }

        IActivityController controller;
        synchronized (this) {
            controller = mController;
        }
        if (controller != null) {
            //將阻塞狀態報告給activity controller,
            try {
                Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
                //返回值為1表示繼續等待,-1表示殺死系統
                int res = controller.systemNotResponding(subject);
                if (res >= 0) {
                    waitedHalf = false; 
                    continue; //設定ActivityController的某些情況下,可以讓發生Watchdog時繼續等待
                }
            } catch (RemoteException e) {
            }
        }

        //當debugger沒有attach時,才殺死程序
        if (Debug.isDebuggerConnected()) {
            debuggerWasConnected = 2;
        }
        if (debuggerWasConnected >= 2) {
            Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
        } else if (debuggerWasConnected > 0) {
            Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
        } else if (!allowRestart) {
            Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
        } else {
            Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
            //遍歷輸出阻塞執行緒的棧資訊
            for (int i=0; i<blockedCheckers.size(); i++) {
                Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");
                StackTraceElement[] stackTrace
                        = blockedCheckers.get(i).getThread().getStackTrace();
                for (StackTraceElement element: stackTrace) {
                    Slog.w(TAG, "    at " + element);
                }
            }
            Slog.w(TAG, "*** GOODBYE!");
            //殺死程序system_server【見小節4.6】
            Process.killProcess(Process.myPid());
            System.exit(10);
        }
        waitedHalf = false;
    }
}

Watchdog檢測到異常的資訊收集工作:

  • AMS.dumpStackTraces:輸出Java和Native程序的棧資訊;
  • WD.dumpKernelStackTraces:輸出Kernel棧資訊;
  • doSysRq
  • dropBox

收集完資訊後便會殺死system_server程序。此處allowRestart預設值為true, 當執行am hang操作則設定不允許重啟(allowRestart =false), 則不會殺死system_server程序.

4.1.1 getBlockedCheckersLocked

private ArrayList<HandlerChecker> getBlockedCheckersLocked() {
    ArrayList<HandlerChecker> checkers = new ArrayList<HandlerChecker>();
    //遍歷所有的Checker
    for (int i=0; i<mHandlerCheckers.size(); i++) {
        HandlerChecker hc = mHandlerCheckers.get(i);
        //將所有沒有完成,且超時的checker加入佇列
        if (hc.isOverdueLocked()) {
            checkers.add(hc);
        }
    }
    return checkers;
}

4.1.2 describeCheckersLocked

private String describeCheckersLocked(ArrayList<HandlerChecker> checkers) {
     StringBuilder builder = new StringBuilder(128);
     for (int i=0; i<checkers.size(); i++) {
         if (builder.length() > 0) {
             builder.append(", ");
         }
         // 輸出所有的checker資訊
         builder.append(checkers.get(i).describeBlockedStateLocked());
     }
     return builder.toString();
 }
 
 
 public String describeBlockedStateLocked() {
     //非前臺執行緒進入該分支
     if (mCurrentMonitor == null) {
         return "Blocked in handler on " + mName + " (" + getThread().getName() + ")";
     //前臺執行緒進入該分支
     } else {
         return "Blocked in monitor " + mCurrentMonitor.getClass().getName()
                 + " on " + mName + " (" + getThread().getName() + ")";
     }
 }

將所有執行時間超過1分鐘的handler執行緒或者monitor都記錄下來.

  • 當輸出的資訊是Blocked in handler,意味著相應的執行緒處理當前訊息時間超過1分鐘;
  • 當輸出的資訊是Blocked in monitor,意味著相應的執行緒處理當前訊息時間超過1分鐘,或者monitor遲遲拿不到鎖;

4.2 AMS.dumpStackTraces

public static File dumpStackTraces(boolean clearTraces, ArrayList<Integer> firstPids,
        ProcessCpuTracker processCpuTracker, SparseArray<Boolean> lastPids, String[] nativeProcs) {
    //預設為 data/anr/traces.txt
    String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);
    if (tracesPath == null || tracesPath.length() == 0) {
        return null;
    }

    File tracesFile = new File(tracesPath);
    try {
        //當clearTraces,則刪除已存在的traces檔案
        if (clearTraces && tracesFile.exists()) tracesFile.delete();
        //建立traces檔案
        tracesFile.createNewFile();
        // -rw-rw-rw-
        FileUtils.setPermissions(tracesFile.getPath(), 0666, -1, -1);
    } catch (IOException e) {
        return null;
    }
    //輸出trace內容
    dumpStackTraces(tracesPath, firstPids, processCpuTracker, lastPids, nativeProcs);
    return tracesFile;
}

輸出system_server和mediaserver,/sdcard,surfaceflinger這3個native程序的traces資訊。

4.3 WD.dumpKernelStackTraces

private File dumpKernelStackTraces() {
    // 路徑為data/anr/traces.txt
    String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);
    if (tracesPath == null || tracesPath.length() == 0) {
        return null;
    }
    // [見小節4.3.1]
    native_dumpKernelStacks(tracesPath);
    return new File(tracesPath);
}

native_dumpKernelStacks,經過JNI呼叫到android_server_Watchdog.cpp檔案中的dumpKernelStacks()方法。

4.3.1 dumpKernelStacks

[-> android_server_Watchdog.cpp]

static void dumpKernelStacks(JNIEnv* env, jobject clazz, jstring pathStr) {
    char buf[128];
    DIR* taskdir;
    
    const char *path = env->GetStringUTFChars(pathStr, NULL);
    // 開啟traces檔案
    int outFd = open(path, O_WRONLY | O_APPEND | O_CREAT,
        S_IRUSR|S_IWUSR|S_IRGRP|S_IWGRP|S_IROTH|S_IWOTH);
    ...

    snprintf(buf, sizeof(buf), "\n----- begin pid %d kernel stacks -----\n", getpid());
    write(outFd, buf, strlen(buf));

    //讀取該程序內的所有執行緒
    snprintf(buf, sizeof(buf), "/proc/%d/task", getpid());
    taskdir = opendir(buf);
    if (taskdir != NULL) {
        struct dirent * ent;
        while ((ent = readdir(taskdir)) != NULL) {
            int tid = atoi(ent->d_name);
            if (tid > 0 && tid <= 65535) {
                //輸出每個執行緒的traces 【4.3.2】
                dumpOneStack(tid, outFd);
            }
        }
        closedir(taskdir);
    }

    snprintf(buf, sizeof(buf), "----- end pid %d kernel stacks -----\n", getpid());
    write(outFd, buf, strlen(buf));

    close(outFd);
done:
    env->ReleaseStringUTFChars(pathStr, path);
}

通過讀取該節點/proc/%d/task獲取當前程序中的所有執行緒資訊。

4.3.2 dumpOneStack

[-> android_server_Watchdog.cpp]

static void dumpOneStack(int tid, int outFd) {
    char buf[64];
    //通過讀取節點/proc/%d/stack
    snprintf(buf, sizeof(buf), "/proc/%d/stack", tid);
    int stackFd = open(buf, O_RDONLY);
    if (stackFd >= 0) {
        //頭部
        strncat(buf, ":\n", sizeof(buf) - strlen(buf) - 1);
        write(outFd, buf, strlen(buf));

        //拷貝stack資訊
        int nBytes;
        while ((nBytes = read(stackFd, buf, sizeof(buf))) > 0) {
            write(outFd, buf, nBytes);
        }

        //尾部
        write(outFd, "\n", 1);
        close(stackFd);
    } else {
        ALOGE("Unable to open stack of tid %d : %d (%s)", tid, errno, strerror(errno));
    }
}

4.4 WD.doSysRq

private void doSysRq(char c) {
    try {
        FileWriter sysrq_trigger = new FileWriter("/proc/sysrq-trigger");
        sysrq_trigger.write(c);
        sysrq_trigger.close();
    } catch (IOException e) {
        Slog.w(TAG, "Failed to write to /proc/sysrq-trigger", e);
    }
}

通過向節點/proc/sysrq-trigger寫入字元,觸發kernel來dump所有阻塞執行緒,輸出所有CPU的backtrace到kernel log。

4.5 dropBox

關於dropbox已在dropBox原始碼篇詳細講解過,輸出檔案到/data/system/dropbox。對於觸發watchdog時,生成的dropbox檔案的tag是system_server_watchdog,內容是traces以及相應的blocked資訊。

4.6 killProcess

Process.killProcess已經在文章理解殺程序的實現原理已詳細講解,通過傳送訊號9給目標程序來完成殺程序的過程。

當殺死system_server程序,從而導致zygote程序自殺,進而觸發init執行重啟Zygote程序,這便出現了手機framework重啟的現象。

五. 總結

Watchdog是一個執行在system_server程序的名為”watchdog”的執行緒::

  • Watchdog運作過程,當阻塞時間超過1分鐘則觸發一次watchdog,會殺死system_server,觸發上層重啟;
  • mHandlerCheckers記錄所有的HandlerChecker物件的列表,包括foreground, main, ui, i/o, display執行緒的handler;
  • mHandlerChecker.mMonitors記錄所有Watchdog目前正在監控Monitor,所有的這些monitors都執行在foreground執行緒。
  • 有兩種方式加入Watchdog的監控:
    • addThread():用於監測Handler物件,預設超時時長為60s.這種超時往往是所對應的handler執行緒訊息處理得慢;
    • addMonitor(): 用於監控實現了Watchdog.Monitor介面的服務.這種超時可能是”android.fg”執行緒訊息處理得慢,也可能是monitor遲遲拿不到鎖;

以下情況,即使觸發了Watchdog,也不會殺掉system_server程序:

  • monkey: 設定IActivityController,攔截systemNotResponding事件, 比如monkey.
  • hang: 執行am hang命令,不重啟;
  • debugger: 連線debugger的情況, 不重啟;

5.1 輸出資訊

watchdog在check過程中出現阻塞1分鐘的情況,則會輸出:

  1. AMS.dumpStackTraces:輸出system_server和3個native程序的traces
    • 該方法會輸出兩次,第一次在超時30s的地方;第二次在超時1min;
  2. WD.dumpKernelStackTraces,輸出system_server程序中所有執行緒的kernel stack;
    • 節點/proc/%d/task獲取程序內所有的執行緒列表
    • 節點/proc/%d/stack獲取kernel的棧
  3. doSysRq, 觸發kernel來dump所有阻塞執行緒,輸出所有CPU的backtrace到kernel log;
    • 節點/proc/sysrq-trigger
  4. dropBox,輸出檔案到/data/system/dropbox,內容是trace + blocked資訊
  5. 殺掉system_server,進而觸發zygote程序自殺,從而重啟上層framework。

5.2 Handler方式

Watchdog監控的執行緒有:預設地DEFAULT_TIMEOUT=60s,除錯時才為10s方便找出潛在的ANR問題。

執行緒名 對應handler 說明
system_server new Handler(Looper.getMainLooper()) 當前主執行緒
android.fg FgThread.getHandler 前臺執行緒
android.ui UiThread.getHandler UI執行緒
android.io IoThread.getHandler I/O執行緒
android.display DisplayThread.getHandler display執行緒
ActivityManager AMS.MainHandler AMS建構函式中使用
PowerManagerService PMS.PowerManagerHandler PMS.onStart()中使用

目前watchdog會監控system_server程序中的以上7個執行緒,必須保證這些執行緒的Looper訊息處理時間不得超過1分鐘。

5.3 Monitor方式

能夠被Watchdog監控的系統服務都實現了Watchdog.Monitor介面,並實現其中的monitor()方法。執行在android.fg執行緒, 系統中實現該介面類主要有:

  • ActivityManagerService
  • WindowManagerService
  • InputManagerService
  • PowerManagerService
  • NetworkManagementService
  • MountService
  • NativeDaemonConnector
  • BinderThreadMonitor
  • MediaProjectionManagerService
  • MediaRouterService
  • MediaSessionService
  • BinderThreadMonitor