1. 程式人生 > >android-O RescueParty 介紹

android-O RescueParty 介紹

一. 概述 Android系統在很多情況下都會進入到一種無法自主恢復的狀態下:例如無法開機,常駐系統程序無限crash等等,往往在這些情況下手機已經無法正常使用了,像這些情況小白使用者往往都不知道怎麼修復手機,只能送回售後了。在O上加了一個救援的機制就是來解決這些問題的,這個機制叫: RescueParty RescueParty 的原理大致為:同一個uid的應用發生多次異常,RescueParty會根據該uid記錄發生的次數,當次數達到預設次數後會調整拯救的策略。拯救策略等級分為:
1.NONE 2.RESET_SETTINGS_UNTRUSTED_DEFAULTS 3.RESET_SETTINGS_UNTRUSTED_CHANGES 4.RESET_SETTINGS_TRUSTED_DEFAULTS 5.FACTORY_RESET 最終的拯救策略是進recovery模式。
那麼哪些場景會造成觸發這個機制呢? 1.a persistent app is stuck in a crash loop 2.we're stuck in a runtime restart loop. 二.RescueParty 原理介紹
RescueParty的原理我們從第一點“ a persistent app is stuck in a crash”來說,appCrash的流程這裡就不多說了,看一張時序圖好了:
O上在AppErrors.java的 crashApplicationInner方法中加上了 RescueParty監控,具體程式碼如下:
void crashApplicationInner(ProcessRecord r, ApplicationErrorReport.CrashInfo crashInfo,
	int callingPid, int callingUid) {
	。。。
	// If a persistent app is stuck in a crash loop, the device isn't very
	// usable, so we want to consider sending out a rescue party.
	if (r != null && r.persistent) {
		RescueParty.notePersistentAppCrash(mContext, r.uid);
	}
	
	AppErrorResult result = new AppErrorResult();
	TaskRecord task;
 	。。。
}


這裡呼叫了 RescuePartynotePersistentAppCrash 方法,並傳入了Context和程序uid.現在我們進入方法內部看看:
/**
* Take note of a persistent app crash. If we notice too many of these
* events happening in rapid succession, we'll send out a rescue party.
*/
public static void notePersistentAppCrash(Context context, int uid) {
	if (isDisabled()) return;
	Threshold t = sApps.get(uid);
	if (t == null) {
		t = new AppThreshold(uid);
		sApps.put(uid, t);
	}
	if (t.incrementAndTest()) {
		t.reset();
		incrementRescueLevel(t.uid);
		executeRescueLevel(context);
	}
}
首先先進行了一個 RescueParty 機制是否被禁用了的的判斷,我們看看什麼情況下會被禁用: 禁用的情況分為以下幾種情況: 1.eng版本會被禁用. 2.userdebug版本,並且usb正在連線中. 3.getprop persist.sys.disable_rescue 為true. 其他情況都沒有被禁用 然後我們繼續回到 notePersistentAppCrash 方法中來,如果RescueParty機制沒有被禁用,我們繼續往下:
Threshold t = sApps.get(uid);
	if (t == null) {
	t = new AppThreshold(uid);
	sApps.put(uid, t);
	}
	if (t.incrementAndTest()) {
		t.reset();
		incrementRescueLevel(t.uid);
		executeRescueLevel(context);
	}
我們先看看sApps的定義:
/** Threshold for app crash loops */private static SparseArray<Threshold> sApps = new SparseArray<>();
每一個uid會對應一個 Threshold 物件,這裡會根據uid取得對應的 Threshold 物件,如果 Threshold 物件為Null,那麼久new一個 Threshold 物件,然後放到sApps中。緊接著會呼叫 incrementAndTest 方法,看看 incrementAndTest 方法中做了什麼:
/**
* @return if this threshold has been triggered
*/
public boolean incrementAndTest() {
	final long now = SystemClock.elapsedRealtime();
	final long window = now - getStart();
	if (window > triggerWindow) {
		setCount(1);
		setStart(now);
		return false;
	} else {
		int count = getCount() + 1;
		setCount(count);
		EventLogTags.writeRescueNote(uid, count, window);
		Slog.w(TAG, "Noticed " + count + " events for UID " + uid + " in last "
			+ (window / 1000) + " sec");
		return (count >= triggerCount);
	}
}
這裡我們分別來看看 getStart / setStart / setCount / getCount 方法:
private static class BootThreshold extends Threshold {
	public BootThreshold() {
		// We're interested in 5 events in any 300 second period; this
		// window is super relaxed because booting can take a long time if
		// forced to dexopt things.
		super(android.os.Process.ROOT_UID, 5, 300 * DateUtils.SECOND_IN_MILLIS);
	}

	@Override
	public int getCount() {
		return SystemProperties.getInt(PROP_RESCUE_BOOT_COUNT, 0);
	}

	@Override
	public void setCount(int count) {
		SystemProperties.set(PROP_RESCUE_BOOT_COUNT, Integer.toString(count));
	}

	@Override
	public long getStart() {
		return SystemProperties.getLong(PROP_RESCUE_BOOT_START, 0);
	}

	@Override
	public void setStart(long start) {
		SystemProperties.set(PROP_RESCUE_BOOT_START, Long.toString(start));
	}
}
這裡其實就是把時間,次數儲存到了 Properties 檔案中。 從上邊的程式碼中我們可以看到 BootThreshold 繼承了 Threshold 並呼叫了它的構造方法:
super(android.os.Process.ROOT_UID, 5, 300 * DateUtils.SECOND_IN_MILLIS);

private abstract static class Threshold {
	。。。
	public Threshold(int uid, int triggerCount, long triggerWindow) {
		this.uid = uid;
		this.triggerCount = triggerCount;
		this.triggerWindow = triggerWindow;
	}
	。。。
}
從這裡我們可以知道 triggerWindow 的值為300000, triggerCount 的值為5. 到現在我們已經知道了 incrementAndTest 方法的具體含義了: 如果兩次crash的時間差大於300000,那麼就設定次數為1,並把時間設定為當前時間(重置時間和次數),否則就次數加1,然後儲存次數。並判斷當前次數是否大於 triggerCount (5),大於就返回true,返回true後會分別執行:
t.reset();
incrementRescueLevel(t.uid);
executeRescueLevel(context);
我們分別看看三個方法的實現:
public void reset() {
		setCount(0);
		setStart(0);
	}
將次數和時間分別設定為0。
/**
* Escalate to the next rescue level. After incrementing the level you'll
* probably want to call {@link #executeRescueLevel(Context)}.
*/
private static void incrementRescueLevel(int triggerUid) {
	final int level = MathUtils.constrain(
		SystemProperties.getInt(PROP_RESCUE_LEVEL, LEVEL_NONE) + 1,
		LEVEL_NONE, LEVEL_FACTORY_RESET);
	SystemProperties.set(PROP_RESCUE_LEVEL, Integer.toString(level));

	EventLogTags.writeRescueLevel(level, triggerUid);
	PackageManagerService.logCriticalInfo(Log.WARN, "Incremented rescue level to "
		+ levelToString(level) + " triggered by UID " + triggerUid);
}
這段程式碼其實 就是取出當前所在的等級,加1後在存到properties中。
private static void executeRescueLevel(Context context) {
	final int level = SystemProperties.getInt(PROP_RESCUE_LEVEL, LEVEL_NONE);
	if (level == LEVEL_NONE) return;

	Slog.w(TAG, "Attempting rescue level " + levelToString(level));
	try {
		executeRescueLevelInternal(context, level);
		EventLogTags.writeRescueSuccess(level);
		PackageManagerService.logCriticalInfo(Log.DEBUG,
			"Finished rescue level " + levelToString(level));
	} catch (Throwable t) {
			final String msg = ExceptionUtils.getCompleteMessage(t);
		EventLogTags.writeRescueFailure(level, msg);
		PackageManagerService.logCriticalInfo(Log.ERROR,
			"Failed rescue level " + levelToString(level) + ": " + msg);
	}
}


這裡先取出當前的等級,判斷等級是否為NONE,如果不是就會去呼叫 executeRescueLevelInternal 方法,我們接著看 executeRescueLevelInternal 方法做了什麼:
private static void executeRescueLevelInternal(Context context, int level) throws Exception {
	switch (level) {
		case LEVEL_RESET_SETTINGS_UNTRUSTED_DEFAULTS:
			resetAllSettings(context, Settings.RESET_MODE_UNTRUSTED_DEFAULTS);
			break;
		case LEVEL_RESET_SETTINGS_UNTRUSTED_CHANGES:
			resetAllSettings(context, Settings.RESET_MODE_UNTRUSTED_CHANGES);
			break;
		case LEVEL_RESET_SETTINGS_TRUSTED_DEFAULTS:
			resetAllSettings(context, Settings.RESET_MODE_TRUSTED_DEFAULTS);
			break;
		case LEVEL_FACTORY_RESET:
			RecoverySystem.rebootPromptAndWipeUserData(context, TAG);
			break;
	}
}
這裡根據不同的等級來救我們的系統,總共有四級,分別為: 1. LEVEL_RESET_SETTINGS_UNTRUSTED_DEFAULTS 2. LEVEL_RESET_SETTINGS_UNTRUSTED_CHANGES 3. LEVEL_RESET_SETTINGS_TRUSTED_DEFAULTS 4. LEVEL_FACTORY_RESET 接下來看看每一級做了些什麼事情,前面的三級都是呼叫了 resetAllSettings 方法,那就先看看 resetAllSettings 方法吧:
private static void resetAllSettings(Context context, int mode) throws Exception {
	// Try our best to reset all settings possible, and once finished
	// rethrow any exception that we encountered
	Exception res = null;
	final ContentResolver resolver = context.getContentResolver();
	try {
		Settings.Global.resetToDefaultsAsUser(resolver, null, mode, UserHandle.USER_SYSTEM);
	} catch (Throwable t) {
		res = new RuntimeException("Failed to reset global settings", t);
	}
	for (int userId : getAllUserIds()) {
		try {
			Settings.Secure.resetToDefaultsAsUser(resolver, null, mode, userId);
		} catch (Throwable t) {
			res = new RuntimeException("Failed to reset secure settings for " + userId, t);
		}
	}
	if (res != null) {
		throw res;
	}
}


這裡其實就是根據不同的等級盡最大的努力重置所有可能的設定,對這裡感興趣的可以詳細看一下。我們接下來看看最後一個等級,它呼叫了 RecoverySystem 類裡的 rebootPromptAndWipeUserData 方法,這裡其實就是讓系統進recovery模式了,詳細流程就不說了,看個呼叫棧吧:
"Binder:[email protected]" prio=5 tid=0xbe nid=NA waiting
java.lang.Thread.State: WAITING
blocks Binder:[email protected]
waiting for [email protected] to release lock on <0x2562> (a com.android.server.power.PowerManagerService$4)
at java.lang.Object.wait(Object.java:-1)
at com.android.server.power.PowerManagerService.shutdownOrRebootInternal(PowerManagerService.java:2802)
locked <0x2562> (a com.android.server.power.PowerManagerService$4)
at com.android.server.power.PowerManagerService.-wrap35(PowerManagerService.java:-1)
at com.android.server.power.PowerManagerService$BinderService.reboot(PowerManagerService.java:4483)
at android.os.PowerManager.reboot(PowerManager.java:969)
at com.android.server.RecoverySystemService$BinderService.rebootRecoveryWithCommand(RecoverySystemService.java:193)
locked <0x25e1> (a java.lang.Object)
at android.os.RecoverySystem.rebootRecoveryWithCommand(RecoverySystem.java:1146)
at android.os.RecoverySystem.bootCommand(RecoverySystem.java:925)
at android.os.RecoverySystem.rebootPromptAndWipeUserData(RecoverySystem.java:855)
at com.android.server.RescueParty.executeRescueLevelInternal(RescueParty.java:190)
at com.android.server.RescueParty.executeRescueLevel(RescueParty.java:166)
at com.android.server.RescueParty.notePersistentAppCrash(RescueParty.java:126)
at com.android.server.am.AppErrors.crashApplicationInner(AppErrors.java:343)
at com.android.server.am.AppErrors.crashApplication(AppErrors.java:322)
at com.android.server.am.ActivityManagerService.handleApplicationCrashInner(ActivityManagerService.java:14621)
at com.android.server.am.ActivityManagerService.handleApplicationCrash(ActivityManagerService.java:14603)
at android.app.IActivityManager$Stub.onTransact(IActivityManager.java:79)
at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:3011)
at android.os.Binder.execTransact(Binder.java:677)
最終會呼叫到 PowerManagerServicelowLevelReboot方法。 三.RescueParty監控的業務 發在本文最開始就已經說了在哪些場景會造成觸發這個機制:
  • a persistent app is stuck in a crash loop
  • we're stuck in a runtime restart loop.
第一種情況在原理介紹的時候已經說了,就是app連續crash的時候會觸發,接下來我們看看另外一種情況: we're stuck in a runtime restart loop: 這個其實就是監控手機是不是一直在無限重啟,我們看看它怎麼實現監控開機的:
private void startBootstrapServices() {
	。。。
	// Now that we have the bare essentials of the OS up and running, take
	// note that we just booted, which might send out a rescue party if
	// we're stuck in a runtime restart loop.
	 RescueParty.noteBoot(mSystemContext);

	// Manages LEDs and display backlight so we need it to bring up the display.
	 traceBeginAndSlog("StartLightsService");
 	。。。
}

在system_server啟動的時候在startBootstrapServices方法裡會呼叫noteBoot方法,我們可以繼續看看noteBoot方法:
/**
* Take note of a boot event. If we notice too many of these events
* happening in rapid succession, we'll send out a rescue party.
*/
public static void noteBoot(Context context) {
	if (isDisabled()) return;
		if (sBoot.incrementAndTest()) {
			sBoot.reset();
			incrementRescueLevel(sBoot.uid);
			executeRescueLevel(context);
		}
	}
}

看到這我們就很熟悉了,這裡其實也是根據時間來記錄次數,到達預設次數後會升級處理對策。最後的一個策略就是進入recovery了。
四.總結 RescueParty 實際上就統計一段時間內某個常駐程序有沒有在不斷的crash,如果是的話就按照crash的次數來分等級處理,最後一個等級是進入recovery模式,讓使用者自主格式化資料來拯救無法恢復的手機。