09 使用python完成詞頻統計
阿新 • • 發佈:2021-02-12
技術標籤:pythonlinux大資料hadoopubuntu
1 系統、軟體以及前提約束
- CentOS-7 64
為減少linux許可權對初學者造成影響,所有命令均在linux的root許可權下進行操作。 - 已安裝hadoop-2.5.2 https://www.jianshu.com/p/5707c5ccd85b
- CentOS7當中已經預設安裝python3.7.3
2 操作步驟
- 建立mapper.py檔案
#!/usr/bin/python import sys # input comes from STDIN (standard input) for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # split the line into words words = line.split() # increase counters for word in words: # write the results to STDOUT (standard output); # what we output here will be the input for the # Reduce step, i.e. the input for reducer.py # # tab-delimited; the trivial word count is 1 print ('%s\t%s' % (word, 1))
驗證,執行以下語句:
echo aa bb cc dd aa cc|python mapper.py
得到以下結果:
檢視統計結果- 建立reducer.py檔案:
#!/usr/bin/python from operator import itemgetter import sys current_word = None current_count = 0 word = None # input comes from STDIN for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # parse the input we got from mapper.py word, count = line.split('\t', 1) # convert count (currently a string) to int try: count = int(count) except ValueError: # count was not a number, so silently # ignore/discard this line continue # this IF-switch only works because Hadoop sorts map output # by key (here: word) before it is passed to the reducer if current_word == word: current_count += count else: if current_word: # write result to STDOUT print ('%s\t%s' % (current_word, current_count)) current_count = count current_word = word # do not forget to output the last word if needed! if current_word == word: print ('%s\t%s' % (current_word, current_count))
驗證,執行以下語句:
echo aa bb cc dd aa cc|python mapper.py|sort|python reducer.py
得到以下結果:
檢視統計結果- 建立一個檔案info.txt,內容如下:
aa bb cc dd aa cc
aa bb cc dd aa cc
aa bb cc dd aa cc
aa bb cc dd aa cc
aa bb cc dd aa cc cc dd
- 上傳該檔案到HDFS的/data的info檔案中
hdfs dfs -mkdir /data
hdfs dfs -put info.txt /data/info
- 執行以下命令,確保hdfs下/out99不存在
$HADOOP_HOME/bin/hadoop jar
$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.5.2.jar
-input "/data/*"
-output "/out99"
-mapper "python mapper.py"
-reducer "python reducer.py"
-file "/root/mapper.py"
-file "/root/reducer.py"
注意:$HADOOP_HOME就是hadoop的家目錄。
以上就是通過python完成詞頻統計的過程。