hadoop2.x入門：編寫mapreduce對氣象資料集求每日最高氣溫和最低氣溫

阿新 • • 發佈：2019-01-23

1.下載氣象資料集

我們下載國內的氣象資料，使用下面命令進行下載

wget -D --accept-regex=REGEX -P data -r -c ftp://ftp.ncdc.noaa.gov/pub/data/noaa/2017/5*

2.解壓資料集，並儲存在文字檔案中

7月23號下載的，資料量為79w行,大小為182MB。所以即使年底也不過200w行。

[grid@tiny01 ~]$ zcat data/ftp.ncdc.noaa.gov/pub/data/noaa/2017/5*.gz > data.txt

在這裡>表示輸出重定向符

我們檢視氣象資料集：

0169501360999992017010100004 
+52130+122520FM-12+043399999V0202201N0010102600199003700199-03271-03631102641ADDAA124160092AJ199999999999999GA1081+026001101GA2999+999999101GA3999+999999101GE19MSL   +99999+99999GF108991081999026001999999MA1999999096571MD1210041+0301REMSYN004BUFR

對資料格式進行解釋

1-4     0169    
5-10    501360      # USAF weather station identifier 
11-15   99999 
       # WBAN weather station identifier
16-23   20170101    # 記錄日期
24-27   0000        # 記錄時間
28      4
29-34   +52130      # 緯度（1000倍）
35-41   +122520     # 經度（1000倍）
42-46   FM-12
47-51   +0433      # 海拔（米）
52-56   99999
57-60   V020
61-63   220         # 風向
64      1           # 質量程式碼
65      N
66-69   0010
70      1
71 
-75   02600       # 雲高（米）
76      1
77      9
78      9
79-84   003700      # 能見距離（米）
85      1
86      9
87      9
88-92   -0327       # 空氣溫度(攝氏度*10)
93      1
94-98   -0363       # 露點溫度(攝氏度*10)
99      1
100-104 10264       # 大氣壓力
105     1

其中第5-10位表示氣象站編號：501360（取前五位），查表可得對應的是黑龍江漠河。我們主要分析的是月份:16-21位和空氣溫度:88-92位的極值關係。

3. 編寫MapReduce程式

Mapper程式

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class MaxTemperatureMapper extends
                Mapper<LongWritable, Text, Text, IntWritable> {
        private static final int MISSING = 9999;

        @Override
        public void map(LongWritable key, Text value, Context context)
                        throws IOException, InterruptedException {
                String line = value.toString();
                String data = line.substring(15, 21);
                int airTemperature;
                if (line.charAt(87) == '+') { // parseInt doesn't like leading plus
                                                                               // signs
                        airTemperature = Integer.parseInt(line.substring(88, 92));
                } else {
                        airTemperature = Integer.parseInt(line.substring(87, 92));
                }
                String quality = line.substring(92, 93);
                if (airTemperature != MISSING && quality.matches("[01459]")) {
                        context.write(new Text(data), new IntWritable(airTemperature));
                }
        }
}

Reducer程式

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class MaxTemperatureReducer extends
                Reducer<Text, IntWritable, Text, IntWritable> {
        @Override
        public void reduce(Text key, Iterable<IntWritable> values, Context context)
                        throws IOException, InterruptedException {
                int maxValue = Integer.MIN_VALUE;
                for (IntWritable value : values) {
                        maxValue = Math.max(maxValue, value.get());
                }
                context.write(key, new IntWritable(maxValue));
        }
}

M-R job

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;

public class MaxTemperature extends Configured implements Tool {
    public static void main(String[] args) throws Exception {
        if (args.length != 2) {
            System.err
                    .println("Usage: MaxTemperature <input path> <output path>");
            System.exit(-1);
        }
        Configuration conf = new Configuration();
        conf.set("mapred.jar", "MaxTemperature.jar");
        Job job = Job.getInstance(conf);
        job.setJarByClass(MaxTemperature.class);
        job.setJobName("Max temperature");
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        job.setMapperClass(MaxTemperatureMapper.class);
        job.setReducerClass(MaxTemperatureReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }

    @Override
    public int run(String[] arg0) throws Exception {
        // TODO Auto-generated method stub
        return 0;
    }
}

注意設定conf.set("mapred.jar", "MaxTemperature.jar");第二個引數為即將打成的jar包的名稱

4.編譯java檔案,打成jar包

此編譯命令為：

[grid@tiny01 myclass]$ javac -classpath $HADOOP_HOME/share/hadoop/common/hadoop-common-2.7.2.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.7.2.jar:$HADOOP_HOME/share/hadoop/common/lib/commons-cli-1.2.jar *.java
[grid@tiny01 myclass]$ jar cvf MaxTemperature.jar *.class
[grid@tiny01 myclass]$ ll
total 28
-rw-rw-r--. 1 grid grid 1413 Jul  3 16:45 MaxTemperature.class
-rw-rw-r--. 1 grid grid 3085 Jul  9 19:04 MaxTemperature.jar
-rw-rw-r--. 1 grid grid  949 Jun 30 15:49 MaxTemperature.java
-rw-rw-r--. 1 grid grid 1876 Jul  3 16:45 MaxTemperatureMapper.class
-rw-rw-r--. 1 grid grid  953 Jun 30 15:37 MaxTemperatureMapper.java
-rw-rw-r--. 1 grid grid 1687 Jul  3 16:45 MaxTemperatureReducer.class
-rw-rw-r--. 1 grid grid  553 Jun 30 15:47 MaxTemperatureReducer.java

這裡的classpath和之前的hadoop版本有所區別，需要按照新的設定方法，這一點網上很少提及！(注意Hadoop不同版本，包不一樣)

5.將資料上傳至hdfs上

[grid@tiny01 ~]$ hadoop fs -put data.txt /data.txt

6. 執行該程式

[[email protected] ~]$ hadoop jar MaxTemperature.jar MaxTemperature /data.txt /out

17/07/24 00:13:20 INFO client.RMProxy: Connecting to ResourceManager at tiny01/192.168.1.101:8032
17/07/24 00:13:21 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
17/07/24 00:13:22 INFO input.FileInputFormat: Total input paths to process : 1
17/07/24 00:13:23 INFO mapreduce.JobSubmitter: number of splits:2
17/07/24 00:13:23 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1500807860144_0002
17/07/24 00:13:24 INFO impl.YarnClientImpl: Submitted application application_1500807860144_0002
17/07/24 00:13:24 INFO mapreduce.Job: The url to track the job: http://tiny01:8088/proxy/application_1500807860144_0002/
17/07/24 00:13:24 INFO mapreduce.Job: Running job: job_1500807860144_0002
17/07/24 00:13:44 INFO mapreduce.Job: Job job_1500807860144_0002 running in uber mode : false
17/07/24 00:13:44 INFO mapreduce.Job:  map 0% reduce 0%
17/07/24 00:14:49 INFO mapreduce.Job:  map 20% reduce 0%
17/07/24 00:14:52 INFO mapreduce.Job:  map 33% reduce 0%
17/07/24 00:14:55 INFO mapreduce.Job:  map 50% reduce 0%
17/07/24 00:16:02 INFO mapreduce.Job:  map 51% reduce 0%
17/07/24 00:16:05 INFO mapreduce.Job:  map 54% reduce 0%
17/07/24 00:16:08 INFO mapreduce.Job:  map 57% reduce 0%
17/07/24 00:16:11 INFO mapreduce.Job:  map 60% reduce 0%
17/07/24 00:16:14 INFO mapreduce.Job:  map 62% reduce 0%
17/07/24 00:16:20 INFO mapreduce.Job:  map 65% reduce 0%
17/07/24 00:16:40 INFO mapreduce.Job:  map 69% reduce 0%
17/07/24 00:16:42 INFO mapreduce.Job:  map 73% reduce 0%
17/07/24 00:16:44 INFO mapreduce.Job:  map 83% reduce 0%
17/07/24 00:16:46 INFO mapreduce.Job:  map 100% reduce 0%
17/07/24 00:17:22 INFO mapreduce.Job:  map 100% reduce 100%
17/07/24 00:17:30 INFO mapreduce.Job: Job job_1500807860144_0002 completed successfully
17/07/24 00:17:32 INFO mapreduce.Job: Counters: 50
        File System Counters
                FILE: Number of bytes read=10226664
                FILE: Number of bytes written=20805407
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=190690631
                HDFS: Number of bytes written=77
                HDFS: Number of read operations=9
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters
                Killed map tasks=1
                Launched map tasks=3
                Launched reduce tasks=1
                Data-local map tasks=3
                Total time spent by all maps in occupied slots (ms)=383699
                Total time spent by all reduces in occupied slots (ms)=143422
                Total time spent by all map tasks (ms)=383699
                Total time spent by all reduce tasks (ms)=143422
                Total vcore-milliseconds taken by all map tasks=383699
                Total vcore-milliseconds taken by all reduce tasks=143422
                Total megabyte-milliseconds taken by all map tasks=392907776
                Total megabyte-milliseconds taken by all reduce tasks=146864128
        Map-Reduce Framework
                Map input records=789998
                Map output records=786666
                Map output bytes=8653326
                Map output materialized bytes=10226670
                Input split bytes=184
                Combine input records=0
                Combine output records=0
                Reduce input groups=7
                Reduce shuffle bytes=10226670
                Reduce input records=786666
                Reduce output records=7
                Spilled Records=1573332
                Shuffled Maps =2
                Failed Shuffles=0
                Merged Map outputs=2
                GC time elapsed (ms)=2436
                CPU time spent (ms)=8470
                Physical memory (bytes) snapshot=415924224
                Virtual memory (bytes) snapshot=6170849280
                Total committed heap usage (bytes)=267198464
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=190690447
        File Output Format Counters
                Bytes Written=77

檢視結果

[grid@tiny01 ~]$ hadoop fs -cat /out/part-r-00000
201701  307
201702  350
201703  375
201704  399
201705  426
201706  444
201707  485

由於這裡的氣溫是攝氏度的10倍，所以看起來很大。
我們來檢查一下：

[grid@tiny01 ~]$ hadoop fs -copyToLocal /out/part-r-00000 result.txt
[grid@tiny01 ~]$ awk '{print $1".{66}+0"$2"1"}' result.txt |xargs -i grep --color=auto {} sample.txt | awk -v FS="" '{print substr($0,5,5),substr($0,16,6),substr($0,88,6)}'
59158 201704 +03071
59997 201701 +03071
56966 201702 +03501
56966 201703 +03751
56966 201704 +03991
51573 201705 +04261
51573 201706 +04441
51573 201706 +04441
51573 201707 +04851

正則表示式不會寫，就將就著看吧，第一條是因為正則表示式匹配的問題，因此這條資料不算。但是其他條都吻合，我們可以看看這幾個氣象站：

51573：新疆吐魯番
56966：雲南元江
59997：沒找到

測試成功!

hadoop2.x入門：編寫mapreduce對氣象資料集求每日最高氣溫和最低氣溫

1.下載氣象資料集我們下載國內的氣象資料，使用下面命令進行下載 wget -D --accept-regex=REGEX -P data -r -c ftp://ftp.ncdc.noaa.gov/pub/data/noaa/2017/5*

Hadoop2.x實戰：WordCount、Sort、去重複、average例項MapRedure編寫

Hadoop版本：2.6.0Eclipse版本：luna一、 Hadoop做的一個計算單詞的例項1、引入jar <dependencies> <dependency> <groupId>org.apache.hadoop&

深度學習入門專案：用keras構建CNN或LSTM對minist資料集做簡單分類任務

深度學習入門專案：用keras構建CNN或LSTM或RNN對Minist資料集做簡單分類任務參考keras中文文件 ——keras：是一個高階神經網路庫，用 Python 語言寫成，可以執行在 TensorFlow 或者 Theano 之上（即以此為後端）。

Hadoop2.x入門講解

Hadoop專案主要包括一下四個模組 1、Hadoop Common 為其他Hadoop模組提供基礎設施。 2、Hadoop HDFS 一個高可靠、高吞吐量的分散式檔案系統 3、Hadoop MapReduce 一個分散式的離線平行計算框架。 4、H

深度學習入門教程UFLDL學習實驗筆記二：使用向量化對MNIST資料集做稀疏自編碼

今天來做UFLDL的第二個實驗，向量化。我們都知道，在matlab裡面基本上如果使用for迴圈，程式是會慢的一逼的（可以說基本就執行不下去）所以在這呢，我們需要對程式進行向量化的處理，所謂向量化就是將matlab裡面所有的for迴圈用矩陣運算的方法實現，在這裡呢，因為之前的實驗我已經是按照向量化的形式編寫的程

Spark2.x 入門：Spark Streaming 簡介

Spark Streaming是構建在Spark上的實時計算框架，它擴充套件了Spark處理大規模流式資料的能力。Spark Streaming可結合批處理和互動查詢，適合一些需要對歷史資料和實時資料進行結合分析的應用場景。 Spark Streaming設計

Hadoop2.x實戰：WordCount例項執行

摘要：本文主要講了hadoop2.7.2自帶的WordCount例項執行作業系統:Ubuntu14.04 hadoop版本：2.7.2例項測試wordCount1、啟動hadoop/usr/hadoop/hadoop-2.7.2/sbin .

SpringBoot2.x入門：快速建立一個SpringBoot應用

## 前提這篇文章是《SpringBoot2.x入門》專輯的**第2篇**文章，使用的`SpringBoot`版本為`2.3.1.RELEASE`，`JDK`版本為`1.8`。常規的套路會建議使用`Spring`官方提供的工具[Spring Initializr](https://start.spri

SpringBoot2.x入門：引入web模組

## 前提這篇文章是《SpringBoot2.x入門》專輯的**第3篇**文章，使用的`SpringBoot`版本為`2.3.1.RELEASE`，`JDK`版本為`1.8`。主要介紹`SpringBoot`的`web`模組引入，會相對詳細地分析不同的`Servlet`容器（如`Tomcat`、`Je

SpringBoot2.x入門：應用打包與啟動

## 前提這篇文章是《SpringBoot2.x入門》專輯的**第5篇**文章，使用的`SpringBoot`版本為`2.3.1.RELEASE`，`JDK`版本為`1.8`。這篇文章分析一個偏向於運維方面的內容：`SpringBoot`應用的打包與啟動，分別會分析嵌入式`Servlet`容器和非嵌入

SpringBoot2.x入門：使用CommandLineRunner鉤子介面

## 前提這篇文章是《SpringBoot2.x入門》專輯的**第6篇**文章，使用的`SpringBoot`版本為`2.3.1.RELEASE`，`JDK`版本為`1.8`。這篇文章主要簡單聊聊鉤子介面`CommandLineRunner`和`ApplicationRunner`，下文有時候統稱兩者

SpringBoot2.x入門：使用MyBatis

> 這是公眾號《Throwable文摘》釋出的第**25**篇原創文章，收錄於專輯《SpringBoot2.x入門》。 ## 前提這篇文章是《SpringBoot2.x入門》專輯的**第8篇**文章，使用的`SpringBoot`版本為`2.3.1.RELEASE`，`JDK`版本為`1.8`。 `S

練習：編寫循環，讓用戶輸入用戶名和密碼，如果用戶為空或者數字，則重新輸入

color while pla 密碼 b2b 如果 lap use 循環編寫循環，讓用戶輸入用戶名和密碼，如果用戶為空或者數字，則重新輸入 while True: user = input(‘請輸入用戶:‘) .strip() # 用戶輸入，去掉兩邊

面試：談談你對大資料的理解

大資料概念：　　大資料（big data）：指無法在一定時間範圍內用常規軟體工具進行捕捉、管理和處理的資料集合，　　　　　　　　　　　是需要新處理模式才能具有更強的決策力、洞察發現力和流程優化能力的海量、高增長率和多樣化的資訊資產。　　主要解決：海量資料的儲存和海量資料的分析計算問題。大資料的

MapReduce對大資料進行排序的實踐

從一個小的例子開始: Map和reduce之間的shuffle(洗牌)是將key值相同的放在一塊,sort(排序)是按照key值進行排序.例如like出現了兩次,就會把like放在一起.you也是.然後根據key值進行按照字典的順序進行排序.下面我想將下面的資料按照時間進行排序,並且ID相

Unity：通過LitJson對JSON資料建立與使用

本文章首發於Unity：通過litjson對json資料建立與使用科普時間： LitJSON是一個使用C#編寫的.NET庫，我們可以通過它來處理JSON資料。 JSON(JavaScript Object Notation, JS 物件簡譜) 是一種輕量級的資料交換格式。它基於 E

簡單程式的編寫2：統計最高成績和最低成績

以下程式的功能是從鍵盤輸入若干個學生的成績，統計出最高成績和最低成績，當輸入負數時，結束輸入。 #include<iostream.h> void main() { int str[100],i,x,max,min; for(i=0;;i++) { c

Tensorflow例項1：對人工資料集的K均值聚類

2.6.7 例1–對人工資料集的K均值聚類 import tensorflow as tf import numpy as np import time import matplotlib.pyplot as plt import matplotlib from sklearn

TensorFlow深度學習實戰（一）：AlexNet對MNIST資料集進行分類

概要進來一段時間在看深度學習中經典的CNN模型相關論文。同時，為了督促自己學習TensorFlow，通讀論文之後開始，利用TensorFlow實現各個模型，復現相關實驗。這是第一篇論文講解的是AlexNet，論文下載網址為：ImageNet Classific

C/C++面試題：編寫類String的建構函式、解構函式和賦值函式。

考點：建構函式、解構函式和賦值函式的編寫方法出現頻率：☆☆☆☆☆已知類String的原型為： class String { public: String(const char *str = NULL);

hadoop2.x入門：編寫mapreduce對氣象資料集求每日最高氣溫和最低氣溫

1.下載氣象資料集

2.解壓資料集，並儲存在文字檔案中

3. 編寫MapReduce程式

4.編譯java檔案,打成jar包

5.將資料上傳至hdfs上

6. 執行該程式

相關推薦