1. 程式人生 > >流式資料處理

流式資料處理

1、直接登陸伺服器:ssh 2014210***@thumedia.org -p 6349

建立streaming.py:   touch streaming.py,並且如下編輯:

<span style="font-size:14px;">#! /usr/bin/python
import logging
import math
import time
pg2count={}
t=1
while 1:
    fp=open('/tmp/hw3.log','r')
    for line in fp:
        line = line.strip()
        times, page, count = line.split()[0],line.split()[1],line.split()[2]
        if count.isdigit() & page.startswith('Page-'):
            try:
                pg2count[page] = [pg2count[page][0] + int(count),t]             
            except:
                pg2count[page] = [int(count),t]
    fp.close()
    a=sorted(pg2count.items(), key=lambda page:page[1][0], reverse = True)
    print '%s%s%s' % ('the page rank at current time ',times,' is:')
    for i in range(0,10):
        print '%s\t%d' % (a[i][0],a[i][1][0])
    logger = logging.getLogger()
    #set loghandler 
    file = logging.FileHandler("output.log")
    logger.addHandler(file)
    #set formater   
    formatter = logging.Formatter("%(asctime)s %(levelname)s %(message)s")
    file.setFormatter(formatter)
    #set log level 
    logger.setLevel(logging.NOTSET)
    logger.info('%s%s%s' % ('the page rank at current time ',times,' is:'))
    for i in range(0,10):
        logger.info('%s\t%d' % (a[i][0],a[i][1][0]))
        time.sleep(60)</span>

2、寫好程式碼之後測試執行:python streaming.py輸出如下:

nohup: ignoring input and appending output to `nohup.out',則表示後臺執行成功,輸出顯示會儲存到nohup.out中,

clip_image002

也可以檢視output.log檔案裡的輸出:

clip_image004

最後我們讓它在後臺一直執行:nohup python streaming.py &輸出:

[1] 8994

一天之後,我們再次檢視結果:

clip_image006

可以看到,累計的結果已經和第一次不太一樣

3、殺掉程序:ps -ef|grep 1020得到如下輸出:

2014210***@cluster-3-1:~$ ps -ef|grep 1020

1020      7512  7471  0 Jan10 ?        00:00:00 sshd: 2014210***@pts/30

1020      7513  7512  0 Jan10 pts/30   00:00:00 -bash

1020      7574  7508  0 20:55 ?        00:00:00 sshd: 2014210***@pts/52

1020      7575  7574  0 20:55 pts/52   00:00:00 -bash

1020      8282  7575  0 21:04 pts/52   00:00:00 ps -ef

1020      8283  7575  0 21:04 pts/52   00:00:00 grep --color=auto 1020

1020      8994     1  0 13:20 ?        00:01:46 python streaming.py

1020     12260 12232  0 Jan10 ?        00:00:00 sshd: 2014210***@pts/35

1020     12261 12260  0 Jan10 pts/35   00:00:01 –bash

輸入kill 8994

2014210***@cluster-3-1:~$ kill 8994

2014210***@cluster-3-1:~$ ps -ef|grep 1020

1020      7512  7471  0 Jan10 ?        00:00:00 sshd: 2014210***@pts/30

1020      7513  7512  0 Jan10 pts/30   00:00:00 -bash

1020      7574  7508  0 20:55 ?        00:00:00 sshd: 2014210***@pts/52

1020      7575  7574  0 20:55 pts/52   00:00:00 -bash

1020      8335  7575  0 21:05 pts/52   00:00:00 ps -ef

1020      8336  7575  0 21:05 pts/52   00:00:00 grep --color=auto 1020

1020     12260 12232  0 Jan10 ?        00:00:00 sshd: 2014210***@pts/35

1020     12261 12260  0 Jan10 pts/35   00:00:01 –bash

至此,streaming.py執行結束。

Question

How can your design scale when the streaming is large and the calculation is complicated?

答:首先確定每個程式週期需要的時間,然後確定這段時間內的流資料能夠儲存在一塊足夠大的快取區域,等到下個程式週期處理前一個快取的流資料即可。