spark中的動態executor分配

動態分配executor的例項初始化部分

如果spark.executor.instances配置項設定為0或者沒有設定,這個預設情況下是一個未設定的值,yarn的執行模式時,這個配置通過--num-executors來得到.

同時spark.dynamicAllocation.enabled配置項設定為true時.預設值為false,表示啟用了動態分配executor.

在driver端SparkContext生成時,會檢查上面兩個配置項,如果這兩個配置滿足動態executor分配的要求時,會生成一個ExecutorAllocationManager例項.

_executorAllocationManager

=if (dynamicAllocationEnabled) {Some(new ExecutorAllocationManager(this, listenerBus, _conf)) } else { None }_executorAllocationManager.foreach(_.start())

必要的配置項:

1,配置項spark.dynamicAllocation.minExecutors,預設值0,最少分配的executor的個數.

2,配置項spark.dynamicAllocation.maxExecutors,預設值int.maxvalue.最大可分配的executor的個數.

3,配置項spark.dynamicAllocation.initialExecutors,預設值為配置項1的值,初始化時啟用的executor的個數,

4,1,配置項spark.dynamicAllocation.schedulerBacklogTimeout,預設值1s,如果未分配的task等待分配的時間超過了這個配置的時間,表示需要新啟動executor.

4,2,配置項spark.dynamicAllocation.sustainedSchedulerBacklogTimeout,預設是4,1,配置項的值,這個配置用於設定在初始排程的executor排程延時後,每次的等待超時時間.

5,配置項spark.dynamicAllocation.executorIdleTimeout,預設值60s,executor的空閒回收時間.

6,配置項spark.executor.cores的配置(executor-cores)必須大於或等於配置項spark.task.cpus的值(這個配置預設是1,這是每個task需要的cpu的個數).

7,配置項spark.shuffle.service.enabled必須配置為true,預設為false.如果這個配置設定為true時,BlockManager例項生成時,需要讀取spark.shuffle.service.port配置項配置的shuffle的埠,同時對應BlockManager的shuffleClient不在是預設的BlockTransferService例項,而是ExternalShuffleClient例項.

8,初始化時,ExecutorAllocationManager中的屬性initializing預設值為true,表示定時排程時,什麼都不做.

在執行ExecutorAllocationManager中的start函式時:

def start(): Unit = {

這裡把ExecutorAllocationListener例項(內部實現類)新增到sparkContext中的listenerBus中,用於監聽stage,task的啟動與完成,並做對應的操作. listenerBus.addListener(listener)val scheduleTask = new Runnable() {override def run(): Unit = {try { schedule() } catch {case ct: ControlThrowable =>throw ctcase t: Throwable => logWarning(s"Uncaught exception in thread

${Thread.currentThread().getName}", t) } } }

定時100ms執行一次schedule的排程函式,來進行task的分析.executor.scheduleAtFixedRate(scheduleTask, 0, intervalMillis,

TimeUnit.MILLISECONDS)}

對executor的個數分配的計算

針對task的排程主要由一個定時器每100ms進行一次schedule函式的呼叫.

private def schedule(): Unit = synchronized {

在這個函式中,首先得到當前的時間,val now = clock.getTimeMillis

在呼叫這個函式時,初始情況下,initializing的屬性值為true,這個時候,這個函式什麼也不做.

這個函式的內容,後面在進行分析. updateAndSyncNumExecutorsTarget(now)

這個removeTimes集合中記錄有每一個executor沒有被task佔用後的時間,如果這個時間超過了上面配置的idle的時間,會移出掉這個executor,同時設定initializing屬性為false,表示可以繼續進行task的排程.retain函式只保留未超時的executor.removeTimes.retain { case (executorId, expireTime) =>val expired = now >= expireTimeif (expired) {initializing = falseremoveExecutor(executorId) } !expired }}

如何知道stage被提交?看下面,

在SparkContext中,執行runJob命令時,針對一個stage進行submit操作時,會呼叫listenerBus中所有的listener對應的onStageSubmitted函式.

而在ExecutorAllocationManager進行start操作時,生成了一個listener,例項為ExecutorAllocationListener,並把這個listener新增到了listenerBus中.

接下來看看ExecutorAllocationListener中對應stage提交的監聽處理:

override def onStageSubmitted(stageSubmitted: SparkListenerStageSubmitted)

: Unit = {

這裡首先把initializing的屬性值設定為false,表示下次定時排程時,需要執行executor的分配操作.initializing = false

得到進行submit操作對應的stage的id與stage中對應的task的個數. val stageId = stageSubmitted.stageInfo.stageIdval numTasks = stageSubmitted.stageInfo.numTasks allocationManager.synchronized {

通過對應的stageId設定這個stage的task的個數,儲存到stageIdToNumTasks集合中.stageIdToNumTasks(stageId) = numTasks

這裡更新allocationManager中的addTime的時間,

由當前時間加上配置spark.dynamicAllocation.schedulerBacklogTimeout的超時時間. allocationManager.onSchedulerBacklogged()

這裡根據每個task對應的host,計算出每個host對應的task的個數,numTasksPending的個數原則上應該與stage中numTask的個數相同.// Compute the number of tasks requested by the stage on each hostvar numTasksPending = 0val hostToLocalTaskCountPerStage = new mutable.HashMap[String, Int]() stageSubmitted.stageInfo.taskLocalityPreferences.foreach { locality =>if (!locality.isEmpty) { numTasksPending += 1locality.foreach { location =>val count = hostToLocalTaskCountPerStage.getOrElse(location.host, 0) + 1hostToLocalTaskCountPerStage(location.host) = count } } }

在對應的集合中,根據stageid與pending的task的個數,對應的host與host對應的task的個數進行儲存.stageIdToExecutorPlacementHints.put(stageId,(numTasksPending, hostToLocalTaskCountPerStage.toMap))

下面的函式迭代stageIdToExecutorPlacementHints集合中的values,並更新allocationManager中localityAwareTasks屬性(儲存待啟動的task的個數)與hostToLocalTaskCount集合屬性(儲存host對應的task的個數)的值.新增到這裡,主要是executor啟動時對應的排程啟動task// Update the executor placement hintsupdateExecutorPlacementHints() }}

接面看看allocationManager中定時排程的updateAndSyncNumExecutorsTarget函式:

現在來說說updateAndSyncNumExecutorsTarget函式與addExecutors函式的作用:

示例說明:

假定這次的stage需要的executor的個數為5,numExecutorsTarget的配置保持預設值0,

如果是第一次排程啟動時,在updateAndSyncNumExecutorsTarget函式中:

1,先計算出這個stage需要的executor的個數,

val maxNeeded = maxNumExecutorsNeededif (initializing) {

如果函式進行這裡,表示還沒有stage提交,也就是沒有job被執行.不進行排程.// Do not change our target while we are still initializing, // Otherwise the first job may have to ramp up unnecessarily0}

2,進入的流程為else if (addTime != NOT_SET && now >= addTime)部分.這個時候執行addExecutors函式,(這裡假定時間已經達到了addTime的超時時間)

這種情況下預設的初始executor的個數為0的情況下,在當前時間超過了等待超時時間後,會進入,第一次時需要等待一秒鐘,每次執行會更新等待時間.

這裡根據要stage對應的task需要的executor的個數,並執行addExecutors的函式.

else if (addTime != NOT_SET && now >= addTime) {val delta = addExecutors(maxNeeded) logDebug(s"Starting timer to add more executors (to " +s"expire in $sustainedSchedulerBacklogTimeoutS seconds)")addTime += sustainedSchedulerBacklogTimeoutS * 1000delta}

在addExecutors函式中,先計算出目標的executor的個數(屬性numExecutorsTarget),

// Do not request more executors if it would put our target over the upper boundif (numExecutorsTarget >= maxNumExecutors) { logDebug(s"Not adding executors because our current target total " +s"is already $numExecutorsTarget (limit $maxNumExecutors)")numExecutorsToAdd = 1return 0}val oldNumExecutorsTarget = numExecutorsTarget// There's no point in wasting time ramping up to the number of executors we already have, so // make sure our target is at least as much as our current allocation:numExecutorsTarget = math.max(numExecutorsTarget, executorIds.size)// Boost our target with the number to add for this round:numExecutorsTarget += numExecutorsToAdd// Ensure that our target doesn't exceed what we need at the present moment:numExecutorsTarget = math.min(numExecutorsTarget, maxNumExecutorsNeeded)// Ensure that our target fits within configured bounds:numExecutorsTarget = math.max(math.min(numExecutorsTarget, maxNumExecutors),

minNumExecutors)

此時executorIds的長度為0,集合是個空集合,這個時候numExecutorsToAdd的值為預設的1,根據上面的程式碼計算完成後(maxNumExecutorsNeeded為5就是tasks需要的executor的個數),numExecutorsTarget的值為1.接下來計算出來一個值,如果這次任務的目標executor的個數高於上次tasks啟動的目標executor的個數,delta的值是一個大於0的值.根據上面的說明,下面程式碼中這個delta的值為1,val delta = numExecutorsTarget - oldNumExecutorsTarget// If our target has not changed, do not send a message // to the cluster manager and reset our exponential growthif (delta == 0) {

如果delta等於0,表示這次的目標executor的個數,與上次任務的executor的個數相同,重置增量的個數為1.numExecutorsToAdd = 1return 0}接下來,通過下面的程式碼通過SparkContext發起numExecutorsTarget的executor的啟動,並在executor中載入對應的task的個數.

val addRequestAcknowledged = testing || client.requestTotalExecutors(numExecutorsTarget, localityAwareTasks,

hostToLocalTaskCount)

接下來,由於我們的任務執行還需要的executor的個數還需要4個(共需要),同時這個時候,delta的值為1,與numExecutorsToAdd的屬性值相同,因此numExecutorsToAdd的值會*2.

numExecutorsToAdd = if (delta == numExecutorsToAdd) {numExecutorsToAdd * 2} else {1}

3,排程定時器開始執行第二次排程啟動,這個時候執行updateAndSyncNumExecutorsTarget函式時,numExecutorsTarget的值為1,需要的executor的個數為3,因此,還是會執行時間超時的流程.

再次進入addExecutors函式,這個時候numExecutorsToAdd屬性值為2,numExecutorsTarget屬性值為1,executorsIds的size為1,已經有一個executor被啟動,需要的executor的個數為3,最後計算完成後,numExecutorsTarget屬性的值為3.計算出來當前的numExecutorsTarget與上一次的numExecutorsTarget的delta的值為2,開始根據這個值為3的numExecutorsTarget發起task的啟動請求.

接下來,由於計算出來的delta的值為2,而屬性numExecutorsToAdd的值也為2,

因此numExecutorsToAdd屬性值現在還是需要*2操作.執行完成後,最後這個numExecutorsToAdd屬性值修改成了4.

4,這個時候,由於還有部分task沒有被執行,開始第三次的處理,此時,numExecutorTarget的值還是小於目標的executor的個數,接著執行addExecutors函式,此時,executorsIds的size為4,第一次執行一個,第二次啟動了3個,這個時候,numExecutorsTarget的屬性值變化情況:

首先先修改成4,(取numExecutorsTarget與executorIds.size中的最大值),

然後numExecutorsTarget += numExecutorsToAdd的值,這個時候值修改成了8.

最後與共需要的executor的個數5取最小值,把值修改成5.計算出當前的numExecutorsTarget與上一次的numExecutorsTarget的差值為2,numExecutorsToAdd的值為4,因此重新修改numExecutorsToAdd的值為1.

這個時候排程程式會修改addTime的值為NOT_SET,表示不在執行executor的排程.因為executor已經夠了.

5,現在假定spark.dynamicAllocation.initialExecutors配置項配置有一個值,初始值為6.需要的executor的個數還是是5.這個時候,進入updateAndSyncNumExecutorsTarget函式時,執行如下的流程部分,因為初始的executor的個數大於了需要的executor的個數.這部分流程在設定有初始大小的executor個數或者說要執行的job的第二個stage的task的個數需要的executor的個數小於小次stage需要的executor的個數時,會被執行.

else if (maxNeeded < numExecutorsTarget) {// The target number exceeds the number we actually need, so stop adding new // executors and inform the cluster manager to cancel the extra pending requestsval oldNumExecutorsTarget = numExecutorsTarget numExecutorsTarget = math.max(maxNeeded, minNumExecutors)numExecutorsToAdd = 1// If the new target has not changed, avoid sending a message to the cluster managerif (numExecutorsTarget < oldNumExecutorsTarget) { client.requestTotalExecutors(numExecutorsTarget, localityAwareTasks,

hostToLocalTaskCount) logDebug(s"Lowering target number of executors to $numExecutorsTarget

(previously " +s"$oldNumExecutorsTarget) because not all requested executors are actually

needed") }numExecutorsTarget - oldNumExecutorsTarget}

在上面的程式碼中,重新根據需要的executor的個數,計算出numExecutorsTarget的值,這個時候,新的numExecutorsTarget的值為5,而老的numExecutorsTarget的值為6,因此通過新的numExecutorsTarget直接呼叫SparkContext中對應的啟動executor的函式,發起對executor的排程與task的啟動.

通過SparkContext排程executor

在allocationManager中,對executor進行動態的呼叫後,會執行如下的程式碼片斷.

client.requestTotalExecutors(numExecutorsTarget, localityAwareTasks,

hostToLocalTaskCount)

在上面的程式碼中,client就是SparkContext例項.

下面看看這個函式的處理流程:

函式的傳入引數中:

numExecutors是目標的executor的個數,

第二個是共需要的task的個數,

第三個是host->taskCount的集合.

private[spark] override def requestTotalExecutors( numExecutors: Int,localityAwareTasks: Int,hostToLocalTaskCount: scala.collection.immutable.Map[String, Int] ): Boolean = { schedulerBackend match {case b: CoarseGrainedSchedulerBackend =>

這裡直接呼叫了GoarseGrainedSchedulerBackend中對應的函式. b.requestTotalExecutors(numExecutors, localityAwareTasks,

hostToLocalTaskCount)case _ => logWarning("Requesting executors is only supported in coarse-grained mode")false}}

下面看看GoarseGrainedSchedulerBackend中requestTotalExecutors的函式實現:

final override def requestTotalExecutors( numExecutors: Int,localityAwareTasks: Int,hostToLocalTaskCount: Map[String, Int] ): Boolean = synchronized {if (numExecutors < 0) {throw new IllegalArgumentException("Attempted to request a negative number of executor(s) " +s"$numExecutors from the cluster manager. Please specify a positive number!") }this.localityAwareTasks = localityAwareTasksthis.hostToLocalTaskCount = hostToLocalTaskCount

每次執行時,計算出還需要的共需要的executor的個數與正在執行或者等待回收的executor的個數之間的差值,這個差值是還需要啟動的executor的個數.numPendingExecutors = math.max(numExecutors - numExistingExecutors + executorsPendingToRemove.size,

這裡根據具體的cluster的部署模式(yarn,standalone,mesos,等),呼叫對應的函式進行executor的啟動操作.這裡我們看看standalone的操作.由SparkDeploySchedulerBackend實現.

這個函式的實現主要是通過向master傳送一個RequestExecutors訊息,這個訊息是一個需要響應的訊息.

這個訊息在Master中通過receiveAndReply函式中的RequestExecutors部分進行處理. doRequestTotalExecutors(numExecutors)}

Master中處理executor的申請:

caseRequestExecutors(appId, requestedTotal) =>context.reply(handleRequestExecutors(appId, requestedTotal))

看看這個的handleRequestExecutors函式

private def handleRequestExecutors(appId: String, requestedTotal: Int)

: Boolean = {idToApp.get(appId) match {

這個函式中,根據傳入的app對應的job共依賴的executor的個數,更新appInfo中executorLimit的值.並執行對executor的啟動的排程.case Some(appInfo) => logInfo(s"Application $appId requested to set total executors to

$requestedTotal.") appInfo.executorLimit = requestedTotal

在這個排程的過程中通過startExecutorsOnWorkers函式來排程與啟動executor在對應的worker中,

在判斷要啟動的executor的個數時,會根據scheduleExecutorsOnWorkers函式來判斷executor的個數是否達到要求的appInfo.executorLimit的個數,如果達到指定的executor的個數時,排程不再執行executor的啟動.判斷worker是否有足夠的資源啟動executor時,通過對executor需要的cpu core的個數與executor需要的記憶體來判斷worker是否有足夠的對應資源啟動executor,如果有,表示這個worker可以用來啟動executor,迭代所有的worker進行executor的啟動,當已經啟動的executor的個數達到了appInfo的executorLimit的限制時,不在進行分配. schedule()true case None => logWarning(s"Unknown application $appId requested

$requestedTotal total executors.")false}}

Driver端處理

相關推薦

spark中的動態executor分配

動態分配executor的例項初始化部分如果spark.executor.instances配置項設定為0或者沒有設定,這個預設情況下是一個未設定的值,yarn的執行模式時,這個配置通過--num-executors來得到. 同時spark.dynamicAlloc

spark中動態廣播變數的使用

今天來說一下spark,動態廣播變數的用法,如果對廣播變數用法不清楚的可以檢視這個部落格,在實際專案中,有時候我們的廣播變數是動態的,比如需要一分鐘更新一次,這個也是可以實現的,我們知道廣播變數是在driver端初始化,在excetors端獲取這個變數,但是不能修改,所以,我們可以在driver

32 Spark中的Executor工作原理

內容： 1.     Spark Executor 工作原理 2.     ExecutorBackend 註冊 3.     Executor例項化 4.     Executor 具體工作流程一、Spark Executor工作原理 1.再次討論Executor註

C語言中動態內存的分配

成功 col 釋放內存否則 turn stdlib.h color span 數組名動態內存分配：根據需要隨時開辟，隨時釋放的內存分配方式。分配時機和釋放時機完全由程序員決定，由於沒有數據聲明，這部分空間沒有名字。無法像使用變量或數組那樣通過變量名或數組名引用其中的數據

如何在Spark中使用動態資料轉置

Dynamic Transpose是Spark中的一個關鍵轉換，因為它需要大量的迭代。本文將為您提供有關如何使用記憶體中運算子處理此複雜方案的清晰概念。首先，讓我們看看我們擁有的源資料： idoc_number，訂單ID，idoc_qualifier_org，idoc_org 7738

如何在C/C 中動態分配二維陣列

                如何在C/C++中動態分配二維陣列在C/C++中動態分配二維陣列可以先申請一維的指標陣列，然後該陣列中的每個指標再申請陣列，這樣就相當於二維陣列了，但是這種方法會導致每行可能不相鄰，從而訪問效率比較低。如何申請連續的二維陣列了？本文將分別三個方面講解：一．動態申請列大小固定的二

Spark 中如何設定executor個數以及task並行度

一.指定spark executor 數量的公式 executor 數量 = spark.cores.max/spark.executor.cores spark.cores.max 是指你的spark程式需要的總核數 spark.exec

如何在C++中動態分配二維陣列

這個問題應該是我以前在CSDN蹭分時回答次數比較多的一個問題了，我的回答一般是三種方法：(1)用vector的vector，(2)先分配一個指標陣列，然後讓裡面每一個指標再指向一個數組，這個做法的好處是訪問陣列元素時比較直觀，可以用a[x][y]這樣的寫法，缺點是它相當於C

spark 資源動態分配

'spark.shuffle.service.enabled': 'true', 'spark.dynamicAllocation.enabled': 'false', 'spark.dynamicAllocation.initialExecutors': 50, 'spark.dynamicAl

Linux中動態記憶體的分配與回收(heap, buddy system, stab)

夥伴系統演算法　　在實際應用中，經常需要分配一組連續的頁框，而頻繁地申請和釋放不同大小的連續頁框，必然導致在已分配頁框的記憶體塊中分散了許多小塊的空閒頁框這樣，即使這些頁框是空閒的，其他需要分配連續頁框的應用也很難得到滿足　　為了避免出現這種情況，Linux核心中引入了夥伴系統演算法(buddy

C語言中動態分配陣列

摘要的重要性是不言而喻的，每次發文章我都很糾結如何寫出一個有特色的摘要來，能夠以最為簡短的文字向讀者描述出我所要表達的東西。但是常常出現的問題是，摘要寫得太簡短了，讀者看了不清楚文章究竟要講啥；摘要寫得稍微長點的話自然能夠描述清楚所要表達的東西，但是卻也出現了另外一個問題，

Spark 動態資源分配(Dynamic Resource Allocation) 解析

Spark 預設採用的是資源預分配的方式。這其實也和按需做資源分配的理念是有衝突的。這篇文章會詳細介紹Spark 動態資源分配原理。前言最近在使用Spark Streaming程式時，發現如下幾個問題：高峰和低峰Spark Stream

spark中executor執行Driver傳送的task，放入執行緒池中執行原理

import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; public class ThreadPool { public static void main(

Spark中Task，Partition，RDD、節點數、Executor數、core數目的關係和Application，Driver，Job，Task，Stage理解

梳理一下Spark中關於併發度涉及的幾個概念File，Block，Split，Task，Partition，RDD以及節點數、Executor數、core數目的關係。輸入可能以多個檔案的形式儲存在HDFS上，每個File都包含了很多塊，稱為Block。當Spar

模擬實現c語言中的動態記憶體分配malloc函式

動態儲存器分配器維護著一個程序的虛擬的儲存器區域，稱為堆（heap）。分配器將堆視為一組不同大小的塊的集合來維護。每個塊就是一個連續的虛擬儲存器片（chunk），要麼是已經分配的，要麼是空閒的。我們這裡把記憶體堆空間模擬為一個位元組陣列buf［1000］，塊的資料結構為：

C語言中動態分配陣列指標的釋放問題

我們都知道要實現根據程式的需要動態分配儲存空間，在C中需要使用到stdlib.h中的兩個函式，malloc，free，兩個函式的介紹如下： malloc函式的原型為： void *malloc (u igned int size) 其作用是在記憶體的動態儲存區中分配一個長

關於C++中的動態記憶體分配

1.malloc()、realloc()、free()函式   這三個函式為c++標準庫中的函式，都在#include <stdlib.h>中。（1）void *malloc()的返回型別為空指標 void *,因為我們並不知道指標指向什麼型別的資料，所以用voi

利用動態資源分配優化Spark應用資源利用率

背景在某地市開展專案的時候，發現數據採集，資料探索，預處理，資料統計，訓練預測都需要很多資源，現場資源不夠用。目前該專案的資源3臺舊的伺服器，每臺的資源記憶體為128G，cores 為24 （core可暫時忽略，以下僅考慮記憶體即可）。案例分析我們先對任務分別分析，然後分類。資料採集基於DC，接

Spark如何進行動態資源分配

一、操作場景對於Spark應用來說，資源是影響Spark應用執行效率的一個重要因素。當一個長期執行的服務，若分配給它多個Executor，可是卻沒有任何任務分配給它，而此時有其他的應用卻資源緊張，這就造成了很大的資源浪費和資源不合理的排程。動態資源排程就是為了解決這種場景，根據當前應用任務的負載情況，實時

2-3-配置DHCP服務器實現動態地址分配

客戶端 -name sci oom 動態分配工作站 request請求負責 evel 學習一個服務的過程： 1、此服務的概述：名字，功能，特點，端口號 2、安裝 3、配置文件的位置 4、服務啟動關閉腳本，查看端口 5、此服務的使用方法 6、修