用python寫wordcount

阿新 • • 發佈：2019-01-28

hadoop是建立在MapReduce機制之上，其中wordcount是hadoop最典型的一個例項，然而眾所周知，hadoop的原始碼是java，並且大多數的hadoop程式碼都是基於java搭建起來，那如何利用python實現wordcount，這將是本篇部落格主要想完成的功能，並將寫好的程式放入hadoop叢集上跑

新建mapper.py

#-*- encoding=UTF-8 -*-

import sys
import re
##標準輸入
for line in sys.stdin:
    line = line.strip()
    words = re.split 
('，',line)
    for word in words:
        print("{0}\t{1}".format(word,1))

這裡主要利用sys.stdin進行輸入,sys.stdout進行輸出，其中print為map到reduce這一段充當了標準輸出這一角色

輸入檔案 input.txt

hello,liming
hi,zhangsan
haha,hehe,liming
wangmazi,map
hadoop,hdfs,hbase
map,reduce,reduce

map測試

輸入以下指令對mapper的程式進行測試

cat input.txt 
 | python mapper.py

輸出結果如下

hello   1
liming  1
hi  1
zhangsan    1
haha    1
hehe    1
liming  1
wangmazi    1
map 1
hadoop  1
hdfs    1
hbase   1
map 1
reduce  1
reduce  1

通過上述結果，發現map將單詞進行了分割，每個單詞都對應著自己的一個出現次數，接下來，基於hadoop的機制會將這些單詞進行排序，然後再傳給reduce進行處理

編寫reducer.py

# -*- encoding=UTF-8 -*- 

import sys

cur_word = None
cur_count = 0
word = None

for line in sys.stdin:
    word,count = line.split('\t',1)

    count = int(count)

    if cur_word == word:
        cur_count += count
    else:
        if cur_word:
            print("{0}\t{1}".format(cur_word,cur_count))
        cur_word = word
        cur_count = count

# 最後一組的輸出
if word:
    print("{0}\t{1}".format(cur_word, cur_count))

reducer.py的編寫主要基於map排序過後進行，這是由於到將map的資料結果放到hdfs中時，會進行排序

測試reducer.py

輸入以下命令進行測試

cat input.txt | python mapper.py | sort | python reducer.py

測試結果如下

hadoop  1
haha    1
hbase   1
hdfs    1
hehe    1
hello   1
hi  1
liming  2
map 2
reduce  2
wangmazi    1
zhangsan    1

至此，wordcount的map和reduce程式完全寫完，下面將把程式上傳至hadoop叢集上跑

將input.txt上傳

hadoop fs -put input.txt *** (hdfs目錄下)

shell指令碼

一般來說，都是通過指令的方式進行，但是指令會過長，因此寫一個簡單的shell指令碼便可完成

#!/bin/bash

hadoop fs -rm -r -f ***/wordcount
hadoop jar ***/hadoop-mapreduce/hadoop-streaming.jar \
-libjars  *** \
-jobconf mapreduce.reduce.shuffle.memory.limit.percent=0.1 \
-jobconf mapreduce.reduce.shuffle.input.buffer.percent=0.1 \
-jobconf mapred.map.capacity=100 \
-jobconf mapreduce.reduce.memory.mb=8182 \
-jobconf mapreduce.reduce.java.opts=-Xms1600m \
-jobconf mapred.map.capacity=100 \
-jobconf mapred.reduce.capacity=100 \
-jobconf mapred.reduce.tasks=600 \
-jobconf mapreduce.job.queuename=root.default \
-jobconf mapreduce.map.cpu.vcores=2 \
-jobconf mapreduce.reduce.cpu.vcores=4 \
-jobconf mapred.job.name=zds_sub_model_score \
-file mapper.py \
-file reducer.py \
-mapper "python mapper.py" \
-reducer "python reducer.py" \
-input  ***/input.txt \
-output ***/wordcount \

指令碼中

*依據自己機器上的目錄進行設定
第3行刪除已有的wordcount檔案
第4、5行都是指定的jar包，依機器而定
第6-17行指定了各項引數
第18、19行指定了釋出的程式檔案
第20、21行指定執行的檔案
第22行為輸入檔案
第23行為輸出檔案，即是wordcount的輸出

將wordcount拉下來

hadoop fs -getmerge ***/wordcount wordcount

wordcount結果

hadoop  1
haha    1
hbase   1
hdfs    1
hehe    1
hello   1
hi  1
liming  2
map 2
reduce  2
wangmazi    1
zhangsan    1

發現在hadoop上執行的wordcount的結果和本地執行的結果一樣，便驗證了本文的方法

用python寫wordcount

hadoop是建立在MapReduce機制之上，其中wordcount是hadoop最典型的一個例項，然而眾所周知，hadoop的原始碼是java，並且大多數的hadoop程式碼都是基於java搭建起來，那如何利用python實現wordcount，這將是本篇部

用Python寫一個批量生成賬號的函數（用戶控制數據長度、數據條數）

shuf open 小寫長度數據 ase 函數用戶控制 app # 1、寫一個函數，批量生成一些註冊使用的賬號：[email protected]/* */，長度由用戶輸入，產生多少條也由用戶輸入，用戶名不能重復，用戶名必須由大寫字母、小寫字母、數字組成

用python寫CSV、EXCEL文件

() import exce 讀取 key print tput save style import pandas as pd writer = pd.ExcelWriter(‘output.xlsx‘) df1 = pd.DataFrame(data={‘col1‘:[

用Python寫一個小遊戲

python 小腳本剛學Python時間不長，但也知道了一點，看別人的參考寫了一個猜數字小遊戲，也算是禹學於樂吧。#!/usr/bin/env python #coding=utf-8

【疑問】用python寫登錄驗證遇到的問題

password () http eas ini contact blog pre python 最近開始斷斷續續學習python，今天加入博客園，作為新人，和各位老師們討教了，以後多多照顧！為了大家能看清楚所以就截圖了，文末尾附源碼，說不定會有那位老師給我指教一番。###

用python寫一個簡單的excel表格獲取當時的linux系統信息

psutil 生成之前建立 set ces ext 流量關閉最近在學習excel表格的制作，順便結合之前學習的內容，利用python的兩個模板，分別是獲取系統信息的psutil，和生成excel表格的xlsxwriter。利用這兩個模板將生成一個簡單的excel表格

用python寫一個九九乘法表-2月19日/2018

九九乘法 while -c pos ont 九九 pytho 九九乘法表 font first = 1 while first<=9: 　　sec=1 　　while sec<=first: 　　　　print(str(sec),"x",str(first),

用python寫一個restful API

python restful # -*- coding: utf-8 -*- # 作者：煮酒品茶 """ package.module ~~~~~~~~~~~~~~ python實現的圖書的一個restful api. 參考restful設計指南 URL：

用python寫一個微信聊天機器人

python wechat 聊天機器人 # -*- coding: utf-8 -*- """ package.module ~~~~~~~~~~~~~~ 一個微信機器人程序微信客戶端itchat: http://itchat.readthed

用python寫註入漏洞的poc

html () fin 數據 import for 正則 ase poc webug靶場一道簡單的註入題加點後報錯 could not to the database You have an error in your SQL syntax; check the man

用python寫一個微信跳一跳外掛,瞬間稱霸朋友圈

python 微信跳一跳爬蟲12月28日，微信宣布，小程序增加了新的類目：小遊戲，同時上線小遊戲你們跳的再好，在毫無心理波動的程序面前都是渣渣。剛剛會python的小白想玩怎麽辦？下有詳細的教程，哈哈，包教會不收任何的費用。感受一下被支配的恐懼吧：使用工具1.python3.6 2.adb 3

用Python寫Robot Framework測試

瀏覽器 from model self ear browser rar .py down Robot Framework 框架是基於 Python 語言開發的，所以，它本質上是 Python 的一個庫。百度搜索實例創建 py_robot.py 文件，代碼如下： fro

用Python編寫WordCount程序任務

氣象文本文 con accept stdin hdfs 文本 width exce 1. 用Python編寫WordCount程序並提交任務程序 WordCount 輸入一個包含大量單詞的文本文件輸出文件中每個單詞及其出現次數（

在Hadoop上用Python實現WordCount

tdi fff tool 目錄獲取 style 要求 ren pan 在hadoop上用Python實現WordCount 一、簡單說明　　本例中我們用Python寫一個簡單的運行在Hadoop上的MapReduce程序，即WordCount（讀取文本文件並統計單詞的詞

給女朋友用Python寫了一個自動抽獎程序！Python在手，獎品我有！

com () 單身代碼女孩子 nbsp 不能是不是 apt 我相信大部分的女孩子都是喜歡買買買的，我還沒有見過不喜歡買東西的女孩子，當然很多東西也是有抽獎這項優惠的，很多小程序都有抽獎這個功能的，好了廢話不多說了，為了給女朋友寫這款抽獎程序，可謂是嘔心瀝血！不過看到她

用python寫個隨機驗證碼

range emp pytho [] and random code port rand 隨機驗證碼 import random li = [] for i in range(8): r = random.randrange(0,5) print(r)

用java寫wordcount

同時 fileread iteye ron 詳細設計 sch porting 功能如何碼雲地址：https://gitee.com/Huan62201/events；個人PSP表格： PSP2.1 PSP階段預估耗時（分鐘）實際耗時

用python寫的一個簡易的雲音樂播放器

本人最近在學習python，在看了一些教程後，用python寫了一個簡單的雲音樂播放器，下面把主要程式碼貼上來，其中用到了github上他人寫的一個漢字轉拼音的庫，大家可以在github上找到。 #coding=utf-8 from Tkinter import * import tkMess

用Python 寫一個TCP 伺服器和TCP代理

TCP伺服器 import socket import threading bind_ip="0.0.0.0" bind_port=9999 server=socket.socket(socket.AF_INET,socket.SOCK_STREAM) server.bind((bind_i

用Python寫一個語音播放軟體

單位經常使用廣播進行臨時事項的通知(將文字轉換為語音然後通過功放廣播)，但是市面上多數語音播放軟體都是收費的，要麼發音失真，要麼不夠穩定——經常出現莫名其妙的故障，容易給工作帶來被動。學Python這麼久不如動手寫一款自己的語音廣播軟體，即使發生故障也可以自行排除。介面設計在開始動

用python寫wordcount

新建mapper.py

輸入檔案 input.txt

map測試

編寫reducer.py

測試reducer.py

將input.txt上傳

shell指令碼

將wordcount拉下來

wordcount結果

相關推薦