系統技術非業餘研究 » Erlang節點重啟導致的incarnation問題
今天晚上mingchaoyan同學在線上問以下這個問題:
152489 =ERROR REPORT==== 2013-06-28 19:57:53 ===
152490 Discarding message {send,<<19 bytes>>} from <0.86.1> to <0.6743.0> in an old incarnation (1 ) of this node (2)
152491
152492
152493 =ERROR REPORT==== 2013-06-28 19:57:55 ===
152494 Discarding message {send,<<22 bytes>>} from <0.1623.1> to <0.6743.0> in an old incarnation (1) of this node (2我們中午伺服器更新後,日誌上滿屏的這些錯誤,請問您有遇到過類似的錯誤嗎?或者提過些定位問題,解決問題的思路,謝謝
這個問題有點意思,從日誌提示來再結合原始碼來看,馬上我們就可以找到打出這個提示的地方:
/*bif.c*/ Sint do_send(Process *p, Eterm to, Eterm msg, int suspend) { Eterm portid; ... } else if (is_external_pid(to)) { dep = external_pid_dist_entry(to); if(dep == erts_this_dist_entry) { erts_dsprintf_buf_t *dsbufp = erts_create_logger_dsbuf(); erts_dsprintf(dsbufp, "Discarding message %T from %T to %T in an old " "incarnation (%d) of this node (%d)\n", msg, p->id, to, external_pid_creation(to), erts_this_node->creation); erts_send_error_to_logger(p->group_leader, dsbufp); return 0; } .. }
觸發這句警告提示必須滿足以下條件:
1. 目標Pid必須是external_pid。
2. 該pid歸宿的外部節點所對應的dist_entry和當前節點的dist_entry相同。
通過google引擎,我找到了和這個描述很相近的問題:參見 這裡 ,該作者很好的描述和重現了這個現象,但是他沒有解釋出具體的原因。
好,那我們順著他的路子來重新下這個問題.
但演示之前,我們先鞏固下基礎,首先需要明白pid的格式:
可以參見這篇文章:
pid的核心內容摘抄如下:
Printed process ids < A.B.C > are composed of [6]:
A, the node number (0 is the local node, an arbitrary number for a remote node)
B, the first 15 bits of the process number (an index into the process table) [7]
C, bits 16-18 of the process number (the same process number as B) [7]
再參見Erlang External Term Format 文件的章節9.10
描述了PID_EXT的組成:
1 N 4 4 1
103 Node ID Serial Creation
Table 9.16:
Encode a process identifier object (obtained from spawn/3 or friends). The ID and Creation fields works just like in REFERENCE_EXT, while the Serial field is used to improve safety. In ID, only 15 bits are significant; the rest should be 0.
我們可以看到一個欄位 Creation, 這個東西我們之前怎麼沒見過呢?
參考erlang的文件 我們可以知道:
creation
Returns the creation of the local node as an integer. The creation is changed when a node is restarted. The creation of a node is stored in process identifiers, port identifiers, and references. This makes it (to some extent) possible to distinguish between identifiers from different incarnations of a node. Currently valid creations are integers in the range 1..3, but this may (probably will) change in the future. If the node is not alive, 0 is returned.
追蹤這個creation的來源,我們知道這個變數來自epmd. 具體點的描述就是每次節點都會像epmd註冊名字,epmd會給節點返回這個creation. net_kernel會把這個creation通過set_node這個bif登記到該節點的erts_this_dist_entry->creation中去:
/* erl_node_tables.c */ void erts_set_this_node(Eterm sysname, Uint creation) { ... erts_this_dist_entry->sysname = sysname; erts_this_dist_entry->creation = creation; ... } /*epmd_srv.c */ ... /* When reusing we change the "creation" number 1..3 */ node->creation = node->creation % 3 + 1; ...
從上面的程式碼可以看出creation取值是1-3,每次登記的時候+1. 未聯網的節點creation為0.
知道了createion的來龍去脈後,我們再看下DistEntry的資料結構,這個資料結構基本上代表了聯網的節點和外面世界的互動。
typedef struct dist_entry_ {
…
Eterm sysname; /* [email protected] atom for efficiency */
Uint32 creation; /* creation of connected node */
Eterm cid; /* connection handler (pid or port), NIL == free
…
} DistEntry;
其中最重要的資訊有上面3個,其中cid代表port(節點之間的TCP通道).
我們知道外部pid是通過binary_to_term來構造的, 程式碼位於external.c:dec_pid函式。
static byte* dec_pid(ErtsDistExternal *edep, Eterm** hpp, byte* ep, ErlOffHeap* off_heap, Eterm* objp) { ... /* * We are careful to create the node entry only after all * validity tests are done. */ node = dec_get_node(sysname, cre); if(node == erts_this_node) { *objp = make_internal_pid(data); } else { ExternalThing *etp = (ExternalThing *) *hpp; *hpp += EXTERNAL_THING_HEAD_SIZE + 1; etp->header = make_external_pid_header(1); etp->next = off_heap->first; etp->node = node; etp->data.ui[0] = data; off_heap->first = (struct erl_off_heap_header*) etp; *objp = make_external_pid(etp); } ... } static ERTS_INLINE ErlNode* dec_get_node(Eterm sysname, Uint creation) { switch (creation) { case INTERNAL_CREATION: return erts_this_node; case ORIG_CREATION: if (sysname == erts_this_node->sysname) { creation = erts_this_node->creation; } } return erts_find_or_insert_node(sysname,creation); }
如果creation等0的話,肯定是本地節點,否則根據sysname和creation來找到一個匹配的節點。
繼續上程式碼:
typedef struct erl_node_ { HashBucket hash_bucket; /* Hash bucket */ erts_refc_t refc; /* Reference count */ Eterm sysname; /* [email protected] atom for efficiency */ Uint32 creation; /* Creation */ DistEntry *dist_entry; /* Corresponding dist entry */ } ErlNode; /* erl_node_tables.c */ ErlNode *erts_find_or_insert_node(Eterm sysname, Uint creation) { ErlNode *res; ErlNode ne; ne.sysname = sysname; ne.creation = creation; erts_smp_rwmtx_rlock(&erts_node_table_rwmtx); res = hash_get(&erts_node_table, (void *) &ne); if (res && res != erts_this_node) { erts_aint_t refc = erts_refc_inctest(&res->refc, 0); if (refc < 2) /* New or pending delete */ erts_refc_inc(&res->refc, 1); } erts_smp_rwmtx_runlock(&erts_node_table_rwmtx); if (res) return res; erts_smp_rwmtx_rwlock(&erts_node_table_rwmtx); res = hash_put(&erts_node_table, (void *) &ne); ASSERT(res); if (res != erts_this_node) { erts_aint_t refc = erts_refc_inctest(&res->refc, 0); if (refc < 2) /* New or pending delete */ erts_refc_inc(&res->refc, 1); } erts_smp_rwmtx_rwunlock(&erts_node_table_rwmtx); return res; } static int node_table_cmp(void *venp1, void *venp2) { return ((((ErlNode *) venp1)->sysname == ((ErlNode *) venp2)->sysname && ((ErlNode *) venp1)->creation == ((ErlNode *) venp2)->creation) ? 0 : 1); } static void* node_table_alloc(void *venp_tmpl) { ErlNode *enp; if(((ErlNode *) venp_tmpl) == erts_this_node) return venp_tmpl; enp = (ErlNode *) erts_alloc(ERTS_ALC_T_NODE_ENTRY, sizeof(ErlNode)); node_entries++; erts_refc_init(&enp->refc, -1); enp->creation = ((ErlNode *) venp_tmpl)->creation; enp->sysname = ((ErlNode *) venp_tmpl)->sysname; enp->dist_entry = erts_find_or_insert_dist_entry(((ErlNode *) venp_tmpl)->sysname); return (void *) enp; }
這個erts_find_or_insert_node會根據sysname和creation的組合來查詢節點,如果找不到的話,會新建一個節點放入ErlNode型別的erts_node_table表中。而ErlNode有3個關鍵資訊 1. sysname 2. creation 3. dist_entry。 新建一個節點的時候,dist_entry填什麼呢?
核心程式碼是這行:
enp->dist_entry = erts_find_or_insert_dist_entry(((ErlNode *) venp_tmpl)->sysname);
這個dist_entry是根據sysname查詢到的,而不是依據sysname和creation的組合。
這時候問題就來了, 我們仔細看下 dec_pid的程式碼:
node = dec_get_node(sysname, cre);
if(node == erts_this_node) {
*objp = make_internal_pid(data);
} else {
…
etp->node = node;
…
*objp = make_external_pid(etp);
}
由於creation不同,所以相同的sysname, 無法找到目前的節點。在新建的節點裡面,它的dist_entry卻是當前節點對應的dist_entry.
創建出來的外部pid物件包含新建的node。
所以send的時候出警告的三句程式碼:
} else if (is_external_pid(to)) {
dep = external_pid_dist_entry(to);
if(dep == erts_this_dist_entry) {
external_pid_dist_entry巨集會從外部pid中取出node,再從node中取出dist_entry. 這個dist_entry很不幸的和erts_this_dist_entry相同,於是就有了上面的悲劇。
分析了半天總算有眉目了,喝口水先!
現在有了這些背景知識我們就可以演示了:
$ erl -sname a Erlang R15B03 (erts-5.9.3.1) [64-bit] [smp:16:16] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.9.3.1 (abort with ^G) ([email protected])1> term_to_binary(self()). <<131,103,100,0,11,97,64,114,100,115,48,54,52,48,55,54,0, 0,0,37,0,0,0,0,2>> ([email protected])2> binary_to_term(<<131,103,100,0,11,97,64,114,100,115,48,54,52,48,55,54,0,0,0,37,0,0,0,0,2>>). <0.37.0> ([email protected])3> binary_to_term(<<131,103,100,0,11,97,64,114,100,115,48,54,52,48,55,54,0,0,0,37,0,0,0,0,3>>). <0.37.0> ([email protected])4> binary_to_term(<<131,103,100,0,11,97,64,114,100,115,48,54,52,48,55,54,0,0,0,37,0,0,0,0,1>>). <0.37.0> ([email protected])5> binary_to_term(<<131,103,100,0,11,97,64,114,100,115,48,54,52,48,55,54,0,0,0,37,0,0,0,0,1>>)==self(). false ([email protected])6> binary_to_term(<<131,103,100,0,11,97,64,114,100,115,48,54,52,48,55,54,0,0,0,37,0,0,0,0,2>>)==self(). true ([email protected])7> binary_to_term(<<131,103,100,0,11,97,64,114,100,115,48,54,52,48,55,54,0,0,0,37,0,0,0,0,3>>)==self(). false ([email protected])8> binary_to_term(<<131,103,100,0,11,97,64,114,100,115,48,54,52,48,55,54,0,0,0,37,0,0,0,0,3>>)!ok. ok ([email protected])9> erlang:system_info(creation). 2 =ERROR REPORT==== 28-Jun-2013::23:10:58 === Discarding message ok from <0.37.0> to <0.37.0> in an old incarnation (3) of this node (2)
上面的演示我們可以看出,creation確實是每次+1迴圈,同時雖然pid打出來的是一樣的,但是實際上由於creation的存在,看起來一樣的還是不同的pid.
到這裡,我們大概明白了前應後果。但是並沒有回到上面同學的疑問。
他的叢集,只是重新啟動了個節點,然後收到一螢幕的警告。
注意是一屏!!!
我重新設計了一個案例,在深度剖析這個問題:
在這之前,我們需要以下程式:
$ cat test.erl -module(test). -export([start/0]). start()-> register(test, self()), loop(undefined). loop(State)-> loop( receive {set, Msg} -> Msg; {get, From} -> From!State end ).
這段程式碼的目的是:
test:start程序啟動起來後,會在目標節點上把自己登記為test名字,同時可以接受2中訊息get和set。set會保持使用者設定的資訊,而get會取回訊息。
我們的測試案例是這樣的:
啟動a,b節點,然後在b節點上通過spawn在a節點上啟動test:start這個程序負責儲存我們的資訊。這個資訊就是b程序的shell的程序pid.
然後模擬b節點掛掉重新啟動,通過a節點上的test程序取回上次保持的程序pid, 這個pid和新啟動的shell pid是相同的,但是他們應該是不完全相同的,因為creation不一樣。
好了,交代清楚了,我們就來秀下:
$ erl -name [email protected] Erlang R15B03 (erts-5.9.3.1) [64-bit] [smp:16:16] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.9.3.1 (abort with ^G) ([email protected])1>
好,A節點準備好了,接下來啟動B節點儲存shell的程序pid到節點a去。
$ erl -name [email protected] Erlang R15B03 (erts-5.9.3.1) [64-bit] [smp:16:16] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.9.3.1 (abort with ^G) ([email protected])1> R=spawn('[email protected]', test, start,[]). <6002.42.0> ([email protected])2> self(). <0.37.0> ([email protected])3> R!{set, self()}. {set,<0.37.0>} ([email protected])4> R!{get, self()}. {get,<0.37.0>} ([email protected])5> flush(). Shell got <0.37.0> ok ([email protected])6> BREAK: (a)bort (c)ontinue (p)roc info (i)nfo (l)oaded (v)ersion (k)ill (D)b-tables (d)istribution ^C
這時候把節點b退出,模擬b掛掉,再重新啟動b,取回之前儲存的pid,和現有的shell pid對比,發現不是完全一樣。
$ erl -name [email protected]
Erlang R15B03 (erts-5.9.3.1) [64-bit] [smp:16:16] [async-threads:0] [hipe] [kernel-poll:false]
Eshell V5.9.3.1 (abort with ^G)
([email protected])1> {test, ‘[email protected]’}!{get, self()}.
{get,<0.37.0>}
([email protected])2> flush().
Shell got <0.37.0>
ok
([email protected])3> {test, ‘[email protected]’}!{get, self()}, receive X->X end.
<0.37.0>
([email protected])4> T=v(-1).
<0.37.0>
([email protected])5> T==self().
false
([email protected])6> T!ok.
ok
([email protected])7>
=ERROR REPORT==== 28-Jun-2013::23:24:00 ===
Discarding message ok from <0.37.0> to <0.37.0> in an old incarnation (2) of this node (3)
[/erlang]
我們發訊息給取回的上次保持的pid, 就觸發了警告。
這個場景在分散式環境裡面非常普遍,參與協作的程序會保持在其他節點的系統裡面,當其中的一些程序掛掉重新啟動的時候,試圖取回這些程序id的時候,卻發現這些id已經失效了。
到這裡為止,應該能夠很好的回答了上面同學的問題了。
這個問題的解決方案是什麼呢?
我們的系統應該去monitor_node其他相關節點並且去捕獲nodedown訊息,當節點失效的時候,適時移除掉和該節點相關的程序。 因為這些程序本質上已經失去功效了。
小結:看起來再無辜的警告,也是會隱藏著重大的問題。
祝玩得開心。
Post Footer automatically generated by wp-posturl plugin for wordpress.