Erlang叢集自動化新增節點指南
阿新 • • 發佈:2018-12-23
Erlang的叢集是由各個節點組成的,一個節點有一個名字來標識,而不管這個節點在網路的物理位置,所以在部署Erlang叢集的時候就很方便。只要在叢集裡新啟動一個節點,給個相對固定的引導的節點,讓新節點和這個引導節點取得聯絡,由引導節點把新節點介紹入叢集就OK了。
在實踐中,新採購的機器上通常配置好IP,以及ssh訪問許可權。 我們需要在新機器上手工安裝Erlang系統,部署新應用,然後啟動應用節點,加入叢集服務,這個步驟很繁瑣。我們希望能夠自動化去做這個事情。common_test的ct_系列模組來救助了。 common_test是A framework for automated testing of arbitrary target nodes,它隨帶的ct_ssh可以透過ssh在遠端機器上執行各種各樣的shell命令,通過scp傳輸資料;而ct_slave非常方便的可以連線到遠端機器啟動一個erlang節點。$ cd ~ $ cat sshdemo.config {sshdemo, [ {ssh, "127.0.0.1"}, {port, 22}, {user, "yourname"}, {password, "yourpassword"} ] }. $ mkdir /tmp/sshdemo $ run_test -shell -config sshdemo.config Erlang R14B01 (erts-5.8.5) 1 [smp:2:2] [rq:2] [async-threads:0] [hipe] [kernel-poll:false] Common Test v1.5.1 starting (cwd is /Users/yufeng) Eshell V5.8.5 (abort with ^G) (通過上面的試驗,我們可以很方便的透過ssh在目標機器安裝我們的erlang系統以及部署我們的應用。接下來我們來演示下如何在目標機器開啟一個節點:[email protected])1> Installing: [{ct_config_plain,["/Users/yufeng/sshdemo.config"]}] Updated /Users/yufeng/last_interactive.html Any CT activities will be logged here ([email protected])1> {ok, CH}=ct_ssh:connect(sshdemo, sftp). {ok,} ([email protected])2> ct_ssh:list_dir(CH, "/tmp/sshdemo"). {ok,["..","."]} ([email protected])3> ct_ssh:write_file(CH, "/tmp/sshdemo/test.dat", "hello"). ok ([email protected])4> ct_ssh:read_file(CH, "/tmp/sshdemo/test.dat"). {ok,<>} ([email protected])5> {ok, CH1}=ct_ssh:connect(sshdemo, ssh). {ok,} ([email protected])6> ct_ssh:exec(CH1, "cp /tmp/sshdemo/test.dat /tmp/sshdemo/test1.dat"). {ok,[]} ([email protected])7> ... $ ls /tmp/sshdemo test.dat test1.dat
$ run_test -shell -name [email protected] Erlang R14B01 (erts-5.8.5) 1 [smp:2:2] [rq:2] [async-threads:0] [hipe] [kernel-poll:false] Common Test v1.5.1 starting (cwd is /Users/yufeng) Eshell V5.8.5 (abort with ^G) ([email protected])1> Updated /Users/yufeng/last_interactive.html Any CT activities will be logged here ([email protected])1> ct_slave:start('127.0.0.1', y, [{username, "yourname"},{password, "yourpassword"}]). {error,boot_timeout,'[email protected]'}我們可以看到我們可恥的失敗了,我們沒在預設的3秒內看到我們的節點啟動成功。問題出在哪裡呢?我們來系統的跟蹤下: 首先從原始碼ct_slave.erl:L403中我們可以清楚的看到:
% spawn node remotely spawn_remote_node(Host, Node, Options) -> Username = Options#options.username, Password = Options#options.password, ErlFlags = Options#options.erl_flags, SSHOptions = case {Username, Password} of {[], []}-> []; {_, []}-> [{user, Username}]; {_, _}-> [{user, Username}, {password, Password}] end ++ [{silently_accept_hosts, true}], check_for_ssh_running(), {ok, SSHConnRef} = ssh:connect(atom_to_list(Host), 22, SSHOptions), {ok, SSHChannelId} = ssh_connection:session_channel(SSHConnRef, infinity), ssh_connection:exec(SSHConnRef, SSHChannelId, get_cmd(Node, ErlFlags), infinity). ... % wait N seconds until node is pingable wait_for_node_alive(_Node, 0) -> pang; wait_for_node_alive(Node, N) -> timer:sleep(1000), case net_adm:ping(Node) of pong-> pong; pang-> wait_for_node_alive(Node, N-1) end. %%等待的邏輯 ... MasterHost = gethostname(), if MasterHost == Host -> spawn_local_node(Node, Options); true-> spawn_remote_node(Host, Node, Options) end, BootTimeout = Options#options.boot_timeout, InitTimeout = Options#options.init_timeout, StartupTimeout = Options#options.startup_timeout, Result = case wait_for_node_alive(ENode, BootTimeout) of pong-> call_functions(ENode, Functions2), receive {node_started, ENode}-> receive {node_ready, ENode}-> {ok, ENode} after StartupTimeout*1000-> {error, startup_timeout, ENode} end after InitTimeout*1000 -> {error, init_timeout, ENode} end; pang-> {error, boot_timeout, ENode} end, ...我們來用強大的dbg跟蹤下ct_slave的執行情況:
[email protected])2> dbg:tracer(). {ok,} ([email protected])3> dbg:p(all,c). {ok,[{matched,'[email protected]',32}]} ([email protected])4> dbg:tpl(ssh_connection,exec, [{'_', [], [{return_trace}]}]). {ok,[{matched,'[email protected]',1},{saved,1}]} ([email protected])5> dbg:tpl(ct_slave,wait_for_node_alive, [{'_', [], [{return_trace}]}]). {ok,[{matched,'[email protected]',1},{saved,2}]} ([email protected])6> ct_slave:start('127.0.0.1', y, [{username, "yourname"},{password, "yourpassword"} ]). () call ssh_connection:exec(,0,"erl -detached -noinput -setcookie VSVDYDCFVTPSDPBFHQMY -name y ",infinity) () returned from ssh_connection:exec/4 -> success () call ct_slave:wait_for_node_alive('[email protected]',3) () call ct_slave:wait_for_node_alive('[email protected]',2) () call ct_slave:wait_for_node_alive('[email protected]',1) {error,boot_timeout,'[email protected]'} ([email protected])7> code:which(ct_slave). "/usr/local/lib/erlang/lib/common_test-1.5.4/ebin/ct_slave.beam"從上面的跟蹤命令我們知道我們的ssh命令是執行成功的,但是在等待節點的時候失敗了,通過檢視程序和epmd也可以驗證這一點:
$ ps -ef|grep beam ... 500 36161 36120 0 0:00.24 ttys000 0:00.81 /usr/local/lib/erlang/erts-5.8.5/bin/beam.smp -- -root /usr/local/lib/erlang -progname erl -- -home /Users/yufeng -- -name [email protected] -s ct_run script_start -shell 501 26067 1 0 0:00.03 ?? 0:00.21 /opt/local/lib/erlang/erts-5.8.5/bin/beam -- -root /opt/local/lib/erlang -progname erl -- -home /Users/yufeng -noshell -noinput -noshell -noinput -setcookie VSVDYDCFVTPSDPBFHQMY -name y ... $ epmd -names epmd: up and running on port 4369 with data: name x at port 49869 name y at port 49838從上面的分析,我們可以清楚的看到: 我們想要的名字是 -name [email protected], 等待的也是這個名字,但是ct_slave啟動的是-name y.這明顯是個bug! 我們看下程式碼確定如何修復:
% get cmd for starting Erlang get_cmd(Node, Flags) -> Cookie = erlang:get_cookie(), "erl -detached -noinput -setcookie "++ atom_to_list(Cookie) ++ long_or_short() ++ atom_to_list(Node) ++ " " ++ Flags. % make a Erlang node name from name and hostname enodename(Host, Node) -> list_to_atom(atom_to_list(Node)++"@"++atom_to_list(Host)).我們可以看到get_cmd只是把node名字加到命令列去,但是node名字裡面沒有host部分。 知道原因修正就容易了,改下這行就好:
ssh_connection:exec(SSHConnRef, SSHChannelId, get_cmd(enodename(Host, Node), ErlFlags), infinity).我們來重新編譯,安裝下:
$ pwd /Users/yufeng/otp $ export ERL_TOP=`pwd` $ cd lib/common_test/ $ make $ sudo cp ebin/ct_slave.beam /usr/local/lib/erlang/lib/common_test-1.5.4/ebin/ct_slave.beam程式碼升級好了,我們再試驗下我們的猜想:
([email protected])8> l(ct_slave). {module,ct_slave} ([email protected])9> ct_slave:start('127.0.0.1', y, [{username, "yourname"},{password, "yourpassword"}]). (<0.39.0>) call ssh_connection:exec(<0.105.0>,0,"erl -detached -noinput -setcookie VSVDYDCFVTPSDPBFHQMY -name [email protected] ",infinity) (<0.39.0>) returned from ssh_connection:exec/4 -> success (<0.39.0>) call ct_slave:wait_for_node_alive('[email protected]',3) (<0.39.0>) call ct_slave:wait_for_node_alive('[email protected]',2) (<0.39.0>) returned from ct_slave:wait_for_node_alive/2 -> pong (<0.39.0>) returned from ct_slave:wait_for_node_alive/2 -> pong {ok,'[email protected]'} ([email protected])10> nodes(). ['[email protected]']這次試驗成功了。我們透過遠端啟動節點,把[email protected]成功加入叢集。 通過上面的2個步驟,我們就可以在新新增的裸機上方便的部署我們的Erlang系統,控制節點的運作和停止。 祝大家叢集開心!