使用python寫一個最基本的mapreduce程序

阿新 • • 發佈：2019-03-15

sheng words == reducer logs 註意例子 one split

一個mapreduce程序大致分成三個部分，第一部分是mapper文件，第二個就是reducer文件，第三部分就是使用hadoop command 執行程序。

在這個過程中，困惑我最久的一個問題就是在hadoop command中hadoop-streaming 也就是streaming jar包的路徑。

路徑大概是這樣的:

cd ~
cd /usr/local/hadoop-2.7.3/share/hadoop/tools/lib
#在這個文件下，我們可以找到你 hadoop-streaming-2.7.3.jar

這個路徑是參考的這裏

這個最基本的mapreduce程序我主要參考了三個博客:

第一個-主要是參考這個博客的mapper和reducer的寫法-在這個博客中它在練習中給出了只寫mapper執行文件的一個例子

第二個博客-主要參考的這個博客的runsh的寫法

第三個博客-主要是參考這個博客的將本地文件上傳到hdfs文件系統中

首先對於mapper文件
mapper.py

#!/usr/bin/env python  
  
import sys  
  
# input comes from STDIN (standard input)  
for line in sys.stdin:  
    # remove leading and trailing whitespace  
    line = line.strip()  
    # split the line into words  
    words = line.split()  
    # increase counters  
    for word in words:  
        # write the results to STDOUT (standard output);  
        # what we output here will be the input for the  
        # Reduce step, i.e. the input for reducer.py  
        #  
        # tab-delimited; the trivial word count is 1  
        print '%s\t%s' % (word, 1)

#上面這個文件我們得到的結果大概是每個單詞對應一個數字1

對於reducer文件:reducer.py

#!/usr/bin/env python  
  
from operator import itemgetter  
import sys  
  
current_word = None  
current_count = 0  
word = None  
  
# input comes from STDIN  
for line in sys.stdin:  
    # remove leading and trailing whitespace  
    line = line.strip()  
  
    # parse the input we got from mapper.py  
    word, count = line.split('\t', 1)  
  
    # convert count (currently a string) to int  
    try:  
        count = int(count)  
    except ValueError:  
        # count was not a number, so silently  
        # ignore/discard this line  
        continue  
  
    # this IF-switch only works because Hadoop sorts map output  
    # by key (here: word) before it is passed to the reducer  
    if current_word == word:  
        current_count += count  
    else:  
        if current_word:  
            # write result to STDOUT  
            print '%s\t%s' % (current_word, current_count)  
        current_count = count  
        current_word = word  
  
# do not forget to output the last word if needed!  
if current_word == word:  
    print '%s\t%s' % (current_word, current_count)

對上面兩個代碼先進行一個本地的檢測

vim test.txt
foo foo quux labs foo bar quux

cat test.txt|python mapper.py

cat test.txt|python mapper.py|sort|python reducer.py
##註意在這裏我們執行萬mapper之後我們進行了一個排序，所以對於相同單詞是處於相鄰位置的，這樣在執行reducer文件的時候代碼可以寫的比較簡單一點

然後在hadoop集群中跑這個代碼

首先講這個test.txt 上傳到相應的hdfs文件系統中，使用的命令模式如下:

hadoop fs -put ./test.txt /dw_ext/weibo_bigdata_ugrowth/mds/

然後寫一個run.sh


HADOOP_CMD="/usr/local/hadoop-2.7.3/bin/hadoop"  # hadoop的bin的路徑
STREAM_JAR_PATH="/usr/local/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar"  ## streaming jar包的路徑

INPUT_FILE_PATH="/dw_ext/weibo_bigdata_ugrowth/mds/src.txt" #hadoop集群上的資源輸入路徑
#需要註意的是intput文件必須是在hadooop集群上的hdfs文件中的，所以必須將本地文件上傳到集群上
OUTPUT_PATH="/dw_ext/weibo_bigdata_ugrowth/mds/output"
#需要註意的是這output文件必須是不存在的目錄，因為我已經執行過一次了，所以這裏我把這個目錄通過下面的代碼刪掉

$HADOOP_CMD fs -rmr  $OUTPUT_PATH

$HADOOP_CMD jar $STREAM_JAR_PATH     -input $INPUT_FILE_PATH     -output $OUTPUT_PATH     -mapper "python mapper.py"     -reducer "python reducer.py"     -file ./mapper.py     -file ./reducer.py

# -mapper：用戶自己寫的mapper程序，可以是可執行文件或者腳本
# -reducer：用戶自己寫的reducer程序，可以是可執行文件或者腳本
# -file：打包文件到提交的作業中，可以是mapper或者reducer要用的輸入文件，如配置文件，字典等。

明天看這個
https://www.cnblogs.com/shay-zhangjin/p/7714868.html
https://www.cnblogs.com/kaituorensheng/p/3826114.html

使用python寫一個最基本的mapreduce程序

sheng words == reducer logs 註意例子 one split 一個mapreduce程序大致分成三個部分，第一部分是mapper文件，第二個就是reducer文件，第三部分就是使用hadoop command 執行程序。在這個過程中，困惑我最久的

一.寫一個最基本的mybatis專案，往資料庫中儲存資訊

1. 匯入相關的包，其中mybatis-3.2.7.jar和junit-4.9.jar分別是框架包和測試包，其餘的包都是mybatis包依賴的包 2. 建立資料庫，建立表 3. 建立實體類 4.定

用 python 寫一個年會抽獎小程序

搜索路徑 ole 含義讓其找到 python .py console 參數使用 pyinstaller 打包工具常用參數指南 pyinstaller -F demo.py 參數含義 -F 指定打包後只生成一個exe格式的文件 -D –onedir 創建一個

Python實現一個最簡單的MapReduce程式設計模型WordCount

MapReduce程式設計模型： Map：對映過程 Reduce：合併過程 import operator from functools import reduce # 需要處理的資料 lst = [ "Tom", "Jack",

用Python寫一個 Hadoop MapReduce 程式

01 [email protected]:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -mapper /home/hduser/mapper.py -reducer /home/hduser

Directx11教程(4) 一個最基本D3D應用程序(2)

模版 acc wol out 參考 chain 1.0 oca 生效原文:Directx11教程(4) 一個最基本D3D應用程序(2)接著上篇教程的代碼，本篇加入基本的D3D代碼，實現一個完整的D3D11程序框架。我們增加一個新類D3DClass, 用來處理3D渲染功

Directx11教程(3) 一個最基本D3D應用程序(1)

size ret escape http case mes window blog 以及原文:Directx11教程(3) 一個最基本D3D應用程序(1) 在前一篇教程程序代碼的基礎上，這次我們將增加2個類： Inpu

寫一個最簡單的gulp 實例

今天 blog png ruby 官網 base 1.0 pat fault 今天寫了一個簡單的gulp 實例分享給大家！比較適合gulp 初學者首選：看看gulp官網了解一些基本的定義　　　官網地址： http://www.gulpjs.com.cn/ 搭建n

用Python寫一個批量生成賬號的函數（用戶控制數據長度、數據條數）

shuf open 小寫長度數據 ase 函數用戶控制 app # 1、寫一個函數，批量生成一些註冊使用的賬號：[email protected]/* */，長度由用戶輸入，產生多少條也由用戶輸入，用戶名不能重復，用戶名必須由大寫字母、小寫字母、數字組成

javaWeb之寫一個最簡單的servlet

tran oid w3c write 分享瀏覽器 servle code mapping 1. 創建一個類servletTest2 繼承HttpServlet類。 public class servletTest2 extends HttpServlet {

python寫一個簡單的接口

結果服務 web框架簡單的 bsp 16px 這樣的 flask span 寫一個接口： 1、用到的模塊是flask，flask是一個python的一個web框架，可以用來開發接口和web頁面 2、啟動服務的效果是這樣的：用postman測試的結

【python學習】使用python寫一個2048小遊戲

ast stc 遊戲多少 wan nbsp 小遊戲效果參考個人博客：jerwang.cn 沒有參考其他代碼,效果圖：話不多少，源代碼： https://github.com/jerustc/Python/blob/master/2048.py【python學

用Python寫一個小遊戲

python 小腳本剛學Python時間不長，但也知道了一點，看別人的參考寫了一個猜數字小遊戲，也算是禹學於樂吧。#!/usr/bin/env python #coding=utf-8

python寫一個乘法表的腳本

python寫一個乘法表的腳本學習腳本的時候經常會被問到會不會寫一個99乘法表，現在就用python語句簡單寫一個乘法表[root@centos-1 python_py]# cat while3.py i = 1 while (i<=9): j=1 while(j<=i

python寫一個循環1+到10打印計算步驟的腳本——純粹無聊玩的

python寫一個循環1+到10打印計算[root@13cml10 ~]# cat a.py #_*_coding:utf-8_*_for i in range(0,12): for a in range(0,i): print "+", print a, print "=&

用python寫一個簡單的excel表格獲取當時的linux系統信息

psutil 生成之前建立 set ces ext 流量關閉最近在學習excel表格的制作，順便結合之前學習的內容，利用python的兩個模板，分別是獲取系統信息的psutil，和生成excel表格的xlsxwriter。利用這兩個模板將生成一個簡單的excel表格

用python寫一個九九乘法表-2月19日/2018

九九乘法 while -c pos ont 九九 pytho 九九乘法表 font first = 1 while first<=9: 　　sec=1 　　while sec<=first: 　　　　print(str(sec),"x",str(first),

手動實現一個單詞統計MapReduce程序與過程原理分析

Hadoop MapReduce Java [toc] 手動實現一個單詞統計MapReduce程序與過程原理分析前言我們知道，在搭建好hadoop環境後，可以運行wordcount程序來體驗一下hadoop的功能，該程序在hadoop目錄下的share/hadoop/mapreduce目錄中

用python寫一個restful API

python restful # -*- coding: utf-8 -*- # 作者：煮酒品茶 """ package.module ~~~~~~~~~~~~~~ python實現的圖書的一個restful api. 參考restful設計指南 URL：

用python寫一個微信聊天機器人

python wechat 聊天機器人 # -*- coding: utf-8 -*- """ package.module ~~~~~~~~~~~~~~ 一個微信機器人程序微信客戶端itchat: http://itchat.readthed

使用python寫一個最基本的mapreduce程序

相關推薦