09 使用python完成詞頻統計

阿新 • • 發佈：2021-02-12

1 系統、軟體以及前提約束

CentOS-7 64
為減少linux許可權對初學者造成影響，所有命令均在linux的root許可權下進行操作。
已安裝hadoop-2.5.2 https://www.jianshu.com/p/5707c5ccd85b
CentOS7當中已經預設安裝python3.7.3

2 操作步驟

建立mapper.py檔案

#!/usr/bin/python

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    words = line.split()
    # increase counters
    for word in words:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        print ('%s\t%s' % (word, 1))

驗證，執行以下語句：

echo aa bb cc dd aa cc|python mapper.py

得到以下結果：

檢視統計結果

建立reducer.py檔案：

#!/usr/bin/python

from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)

    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
        continue

    # this IF-switch only works because Hadoop sorts map output
    # by key (here: word) before it is passed to the reducer
    if current_word == word:
        current_count += count
    else:
        if current_word:
            # write result to STDOUT
            print ('%s\t%s' % (current_word, current_count))
        current_count = count
        current_word = word

# do not forget to output the last word if needed!
if current_word == word:
    print ('%s\t%s' % (current_word, current_count))

驗證，執行以下語句：

echo aa bb cc dd aa cc|python mapper.py|sort|python reducer.py

得到以下結果：

檢視統計結果

建立一個檔案info.txt，內容如下：

aa bb cc dd aa cc
aa bb cc dd aa cc
aa bb cc dd aa cc
aa bb cc dd aa cc
aa bb cc dd aa cc cc dd

上傳該檔案到HDFS的/data的info檔案中

hdfs dfs -mkdir /data
hdfs dfs -put info.txt /data/info

執行以下命令，確保hdfs下/out99不存在

$HADOOP_HOME/bin/hadoop jar 
 $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.5.2.jar 
 -input "/data/*" 
 -output "/out99" 
 -mapper "python mapper.py" 
 -reducer "python reducer.py" 
 -file "/root/mapper.py" 
 -file "/root/reducer.py"

注意：$HADOOP_HOME就是hadoop的家目錄。
以上就是通過python完成詞頻統計的過程。

09 使用python完成詞頻統計

技術標籤：pythonlinux大資料hadoopubuntu 1 系統、軟體以及前提約束 CentOS-7 64 為減少linux許可權對初學者造成影響，所有命令均在linux的root許可權下進行操作。已安裝hadoop-2.5.2 https://www.jianshu.com

Python：詞頻統計及排序

對一段文字，想要統計各種詞語出現的次數，即詞頻統計，思路是先分詞，再進行數量統計、排序。

使用Spark RDD完成詞頻統計

1、實驗要求對給定的一個英文文字，使用Spark完成文字內容的讀取並轉換成RDD，然後使用RDD的運算元統計每個單詞出現的次數，將統計結果按從大到小的順序列印到控制檯上。

Python英文文章詞頻統計(14份劍橋真題詞頻統計)

Python劍橋真題詞頻統計最好還是要學以致用，自主蒐集了19年最近的14份劍橋真題之後，通過Python提供的jieba第三方庫，對所有的文章資訊進行了詞頻統計，並選擇性地剔除了部分簡易詞彙，比如數字，普通冠詞等，博主

python使用jieba實現簡單的詞頻統計

import jieba def getText(): txt=open(\"hamlet.txt\",\"r\").read() txt=txt.lower() for ch in \'|\"#$%&()*+,-./:;<>+?@[\\\\]^_{|}~\':

Python詞頻統計的3種方法

技術標籤：python 大家好，我是小小明。上次，我分享了《100毫秒過濾一百萬字文字的停用詞》，這次我將分享如何進行詞頻統計。

浙大版《Python 程式設計》題目集第7章-1 詞頻統計

技術標籤：# Python學習筆記python 第7章-1 詞頻統計 import sys s = sys.stdin.read()#系統標準輸入模組一次讀入

詞頻統計方案與具體實現-elasticsearch、spark、python

詞頻自用，侵刪詞頻統計方案與具體實現-elasticsearch、spark、python 方案一、基於ElasticSearch方式

Python二級備考筆記6 《從清華到MIT》詞頻統計

1 《從清華到MIT》詞頻統計型別：Python 組合資料型別‪‬‪‬‪‬‪‬‪‬‮‬‫‬‭‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‭‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‫‬‪‬‪‬‪‬‪‬‪‬

Python生成詞雲圖，TIIDF方法文字挖掘: 詞頻統計，詞雲圖

python中使用wordcloud包生成的詞雲圖。下面來介紹一下wordcloud包的基本用法。 class wordcloud.WordCloud(font_path=None, width=400, height=200, margin=2, ranks_only=None, prefer_horizontal=0.9,mask=None