1. 程式人生 > >Hadoop基準測試HiBench 學習筆記

Hadoop基準測試HiBench 學習筆記

一、準備工作

作業系統 centOS 64 位

JDK 1.7 

 maven 

hadoop 2.6

github下載HiBench 4.0

執行 bin/build-all.sh編譯 (參考HiBench 的說明,因沒有安裝spark,直接執行run-all.sh 會報錯,忽略)。

二、 編譯執行

1. conf 目錄下的benchmarks.lst檔案設定測試項,一行一個測試專案,刪除到只保留wordcount

2. conf目錄下的languages.lst檔案設定測試語言的型別,一行一種語言,用#號註釋掉spark/java等,只保留mapreduce

3. 執行run-all.sh可得到結果

三、探索

1. 複製run-all.sh成run-wordcount.sh,將其中從benchmarks.lst這一段 刪除,benchmark直接設定成等於wordcount就只執行wordcount 測試

2. 閱讀sh發現它進一步呼叫workloads/wordcount的對應prepare.sh 準備資料,然後再呼叫workloads/wordcount的對應run.sh執行測試

3. function下的workload-functions.sh 中提供了gen_report()這樣的公共函式

4. 結構複雜,功能強

5.層層剝離出必須得公共函式,timestamp()、dir_size()、gen_report()等重新整理了一個單一的腳步bench-workcount.sh 

6.執行bench-workcount.sh 測試通過,這樣把hadoop wordcount基準測試的基本流程理清楚了。

#!/bin/sh
PS4="#: "

function timestamp(){           # get current timestamp
    sec=`date +%s`
    nanosec=`date +%N`
    tmp=`expr $sec \* 1000 `
    msec=`expr $nanosec / 1000000 `
    echo `expr $tmp + $msec`
}

#hadoop fs -ls  /HiBench/Wordcount/Input/part-m-00000
#hadoop fs -du -s /HiBench/Wordcount/Input/part-m-00000
#32849623  /HiBench/Wordcount/Input/part-m-00000

function dir_size() {
    for item in $1; do
        if [[ $item =~ ^[0-9]+$ ]]; then
            echo $item
        fi
    done
}

function gen_report() {         # dump the result to report file
    local workload_name=$1
    local start=$2
    local end=$3
    local size=$4
    local report_file=$5
    
    which bc > /dev/null 2>&1
    if [ $? -eq 1 ]; then
        assert 0 "\"bc\" utility missing. Please install it to generate proper report."
        return 1
    fi
    local duration=$(echo "scale=3;($end-$start)/1000"|bc)
    local tput=`echo "$size/$duration"|bc`
    nodes=2

    if [ $nodes -eq 0 ]; then nodes=1; fi
    local tput_node=`echo "$tput/$nodes"|bc`  
    echo "${REPORT_TITLE}" > $report_file
    local report_line=$(printf "${REPORT_COLUMN_FORMATS}" ${workload_name} $(date +%F) $(date +%T) $size $duration $tput $tput_node)
    echo "${report_line}" >> ${report_file}
}

#--------------------------------------------------------------
# setting enviroment :
#-------------------------------------------------------------
REPORT_COLUMN_FORMATS="%-12s %-10s %-8s %-20s %-20s %-20s %-20s\n"
REPORT_TITLE="Type Date Time Input_data_size Duration(s) Throughput(bytes/s) Throughput/node"
REPORT_FILE="./bench-report.txt"

HADOOP_HOME="/home/hadoop/cloud/hadoop"
HDFS_HOST="hdfs://cld-srv-01:9000"
JARS="$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar"
INPUT_DIR=$HDFS_HOST/HiBench/Wordcount/Input
OUTPUT_DIR=$HDFS_HOST/HiBench/Wordcount/Output

#--------------------------------------------------------------
# prepare testing data
#--------------------------------------------------------------

$HADOOP_HOME/bin/hadoop --config $HADOOP_HOME/etc/hadoop fs -rm -r -skipTrash $INPUT_DIR
$HADOOP_HOME/bin/hadoop --config $HADOOP_HOME/etc/hadoop jar $JARS randomtextwriter \
    -D mapreduce.randomtextwriter.totalbytes=3200000000 \
    -D mapreduce.job.maps=12 \
    -D mapreduce.job.reduces=6 \
    -D mapreduce.output.fileoutputformat.compress=false \
    $INPUT_DIR

SIZE_INFO=`$HADOOP_HOME/bin/hadoop --config $HADOOP_HOME/etc/hadoop fs -du -s $INPUT_DIR`
SIZE=`dir_size $SIZE_INFO`
echo $SIZE

#------------------------------------------------------------
# execute benchmark testing
#------------------------------------------------------------

$HADOOP_HOME/bin/hadoop --config $HADOOP_HOME/etc/hadoop fs -rm -r -skipTrash $OUTPUT_DIR
START_TIME=`timestamp`
$HADOOP_HOME/bin/hadoop --config $HADOOP_HOME/etc/hadoop jar $JARS wordcount \
    -D mapreduce.output.fileoutputformat.compress=false \
    -D mapreduce.job.maps=12 \
    -D mapreduce.job.reduces=6 \
    $INPUT_DIR \
    $OUTPUT_DIR
END_TIME=`timestamp`

gen_report "HadoopWordCount" $START_TIME $END_TIME $SIZE $REPORT_FILE