Mapreduce資料分析例項

阿新 • • 發佈：2018-11-28

資料包

百度網盤

連結：https://pan.baidu.com/s/1v9M3jNdT4vwsqup9N0mGOA
提取碼：hs9c
複製這段內容後開啟百度網盤手機App，操作更方便哦

1、資料清洗說明：

（1）第一列是時間；

（2）第二列是賣出方；

（3）第三列是買入方；

（4）第四列是票的數量；

（5）第五列是金額。

賣出方，買入方一共三個角色，機場（C開頭），代理人（O開頭）和一般顧客（PAX）

2、資料清洗要求：

（1）統計最繁忙的機場Top10（包括買入賣出）；

（2）統計最受歡迎的航線；（起點終點一致（或相反））

（3）統計最大的代理人TOP10；

（4）統計某一天的各個機場的賣出資料top10。

3、資料視覺化要求：

（1）上述四中統計要求可以用餅圖、柱狀圖等顯示；

（2）可用關係圖展示各個機場之間的聯絡程度（以機票數量作為分析來源）。

實驗關鍵部分程式碼（列舉統計最繁忙機場的程式碼，其他程式碼大同小異）：

資料初步情理，主要是過濾出各個機場個總票數

1.    package mapreduce;    
2.    import java.io.IOException;    
3.    import java.net.URI;    
4.    import org.apache.hadoop.conf.Configuration;    
5.    import org.apache.hadoop.fs.Path;    
6.    import org.apache.hadoop.io.LongWritable;    
 
7.    import org.apache.hadoop.io.Text;    
8.    import org.apache.hadoop.mapreduce.Job;    
9.    import org.apache.hadoop.mapreduce.Mapper;    
10.    import org.apache.hadoop.mapreduce.Reducer;    
11.    import org.apache.hadoop.mapreduce.lib.chain.ChainMapper;    
12.    import org.apache.hadoop.mapreduce.lib.chain.ChainReducer;    
13.    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;    
14.    import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;    
15.    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;    
16.    import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;    
17.    import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;    
18.    import org.apache.hadoop.fs.FileSystem;    
19.    import org.apache.hadoop.io.IntWritable;    
20.    public class ChainMapReduce {    
21.        private static final String INPUTPATH = "hdfs://localhost:9000/mapreducetest/region.txt";    
22.        private static final String OUTPUTPATH = "hdfs://localhost:9000/mapreducetest/out1";    
23.        public static void main(String[] args) {    
24.            try {    
25.                Configuration conf = new Configuration();    
26.                FileSystem fileSystem = FileSystem.get(new URI(OUTPUTPATH), conf);    
27.                if (fileSystem.exists(new Path(OUTPUTPATH))) {    
28.                    fileSystem.delete(new Path(OUTPUTPATH), true);    
29.                }    
30.                Job job = new Job(conf, ChainMapReduce.class.getSimpleName());    
31.                FileInputFormat.addInputPath(job, new Path(INPUTPATH));    
32.                job.setInputFormatClass(TextInputFormat.class);    
33.                ChainMapper.addMapper(job, FilterMapper1.class, LongWritable.class, Text.class, Text.class, IntWritable.class, conf);    
34.                ChainReducer.setReducer(job, SumReducer.class, Text.class, IntWritable.class, Text.class, IntWritable.class, conf);    
35.                job.setMapOutputKeyClass(Text.class);    
36.                job.setMapOutputValueClass(IntWritable.class);    
37.                job.setPartitionerClass(HashPartitioner.class);    
38.                job.setNumReduceTasks(1);    
39.                job.setOutputKeyClass(Text.class);    
40.                job.setOutputValueClass(IntWritable.class);    
41.                FileOutputFormat.setOutputPath(job, new Path(OUTPUTPATH));    
42.                job.setOutputFormatClass(TextOutputFormat.class);    
43.                System.exit(job.waitForCompletion(true) ? 0 : 1);    
44.            } catch (Exception e) {    
45.                e.printStackTrace();    
46.            }    
47.        }    
48.        public static class FilterMapper1 extends Mapper<LongWritable, Text, Text, IntWritable> {    
49.            private Text outKey = new Text();    
50.            private IntWritable outValue = new IntWritable();    
51.            @Override    
52.            protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context)    
53.            throws IOException,InterruptedException {    
54.                String line = value.toString();    
55.                if (line.length() > 0) {    
56.                    String[] arr = line.split(",");    
57.                    int visit = Integer.parseInt(arr[3]);   
58.                    if(arr[1].substring(0, 1).equals("C")||arr[2].substring(0, 1).equals("C")){    
59.                        outKey.set(arr[1]);    
60.                        outValue.set(visit);    
61.                        context.write(outKey, outValue);    
62.                    }    
63.                }    
64.            }    
65.        }    
66.         
67.        public  static class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {    
68.            private IntWritable outValue = new IntWritable();    
69.            @Override    
70.            protected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context)    
71.        throws IOException, InterruptedException {    
72.        int sum = 0;    
73.        for (IntWritable val : values) {    
74.        sum += val.get();    
75.        }    
76.        outValue.set(sum);    
77.        context.write(key, outValue);    
78.        }    
79.        }    
80.        
81.        
82.        }

資料二次清理，進行排序

package mapreduce;    
import java.io.IOException;    
import org.apache.hadoop.conf.Configuration;    
import org.apache.hadoop.fs.Path;    
import org.apache.hadoop.io.IntWritable;    
import org.apache.hadoop.io.Text;    
import org.apache.hadoop.mapreduce.Job;    
import org.apache.hadoop.mapreduce.Mapper;    
import org.apache.hadoop.mapreduce.Reducer;    
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;    
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;    
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;    
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;    
public class OneSort {    
    public static class Map extends Mapper<Object , Text , IntWritable,Text >{    
    private static Text goods=new Text();    
    private static IntWritable num=new IntWritable();    
    public void map(Object key,Text value,Context context) throws IOException, InterruptedException{    
    String line=value.toString();    
    String arr[]=line.split("\t");   
    num.set(Integer.parseInt(arr[1]));    
    goods.set(arr[0]);    
    context.write(num,goods);    
    }    
    }    
    public static class Reduce extends Reducer< IntWritable, Text, IntWritable, Text>{    
    private static IntWritable result= new IntWritable();    
    public void reduce(IntWritable key,Iterable<Text> values,Context context) throws IOException, InterruptedException{    
        for(Text val:values){    
        context.write(key,val);    
        }    
        }    
        }    
        public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException{    
        Configuration conf=new Configuration();    
        Job job =new Job(conf,"OneSort");    
        job.setJarByClass(OneSort.class);    
        job.setMapperClass(Map.class);    
        job.setReducerClass(Reduce.class);    
        job.setOutputKeyClass(IntWritable.class);    
        job.setOutputValueClass(Text.class);    
        job.setInputFormatClass(TextInputFormat.class);    
        job.setOutputFormatClass(TextOutputFormat.class);    
        Path in=new Path("hdfs://localhost:9000/mapreducetest/out1/part-r-00000");    
        Path out=new Path("hdfs://localhost:9000/mapreducetest/out2");    
        FileInputFormat.addInputPath(job,in);    
        FileOutputFormat.setOutputPath(job,out);    
        System.exit(job.waitForCompletion(true) ? 0 : 1);    
    
        }    
        }

從hadoop中讀取檔案

package mapreduce;  
  
import java.io.BufferedReader;  
import java.io.IOException;  
import java.io.InputStreamReader;  
import java.net.URI;  
import java.util.ArrayList;  
import java.util.List;  
  
import org.apache.hadoop.conf.Configuration;  
import org.apache.hadoop.fs.FSDataInputStream;  
import org.apache.hadoop.fs.FileSystem;  
import org.apache.hadoop.fs.Path;  
  
public class ReadFile {  
    public static List<String> ReadFromHDFS(String file) throws IOException    
    {    
        //System.setProperty("hadoop.home.dir", "H:\\檔案\\hadoop\\hadoop-2.6.4");  
        List<String> list=new ArrayList();  
        int i=0;  
         Configuration conf = new Configuration();    
        StringBuffer buffer = new StringBuffer();  
        FSDataInputStream fsr = null;  
        BufferedReader bufferedReader = null;  
        String lineTxt = null;  
          
        try  
        {  
            FileSystem fs = FileSystem.get(URI.create(file),conf);  
            fsr = fs.open(new Path(file));  
            bufferedReader = new BufferedReader(new InputStreamReader(fsr));          
            while ((lineTxt = bufferedReader.readLine()) != null)  
            {  
                String[] arg=lineTxt.split("\t");  
                list.add(arg[0]);  
                list.add(arg[1]);  
            }  
        } catch (Exception e)  
        {  
            e.printStackTrace();  
        } finally  
        {  
            if (bufferedReader != null)  
            {  
                try  
                {  
                    bufferedReader.close();  
                } catch (IOException e)  
                {  
                    e.printStackTrace();  
                }  
            }  
        }  
        return list;  
   
    }  
      
    public static void main(String[] args) throws IOException {  
        List<String> ll=new  ReadFile().ReadFromHDFS("hdfs://localhost:9000/mapreducetest/out2/part-r-00000");  
        for(int i=0;i<ll.size();i++)  
        {  
            System.out.println(ll.get(i));  
        }  
          
    }  
  
}

前臺網頁程式碼

<%@page import="mapreduce.ReadFile"%>  
<%@page import="java.util.List"%>  
<%@page import="java.util.ArrayList"%>  
<%@page import="org.apache.hadoop.fs.FSDataInputStream" %>  
<%@ page language="java" contentType="text/html; charset=UTF-8"  
    pageEncoding="UTF-8"%>  
<!DOCTYPE html>  
<html>  
<head>  
<meta charset="UTF-8">  
<title>Insert title here</title>  
<% List<String> ll= ReadFile.ReadFromHDFS("hdfs://localhost:9000/mapreducetest/out2/part-r-00000");%>  
 <script src="../js/echarts.js"></script>  
</head>  
<body>  
<div id="main" style="width: 900px;height:400px;"></div>  
 <script type="text/javascript">  
        // 基於準備好的dom，初始化echarts例項  
        var myChart = echarts.init(document.getElementById('main'));  
  
        // 指定圖表的配置項和資料  
        var option = {  
            title: {  
                text: '最繁忙的機場TOP10'  
            },  
            tooltip: {},  
            legend: {  
                data:['票數']  
            },  
            xAxis: {  
                data:["<%=ll.get(ll.size()-1)%>"<%for(int i=ll.size()-3;i>=ll.size()-19;i--){  
                    if(i%2==1){  
                        %>,"<%=ll.get(i)%>"  
                    <%     
                    }  
                    }  
                    %>]  
  
  
            },  
            yAxis: {},  
            series: [{  
                name: '票數',  
                type: 'bar',  
                data: [<%=ll.get(ll.size()-2)%>  
                <%for(int i=ll.size()-1;i>=ll.size()-19;i--){  
                    if(i%2==0){  
                    %>,<%=ll.get(i)%>  
                <%     
                }  
                }  
                %>]  
            }]  
        };  
  
        // 使用剛指定的配置項和資料顯示圖表。  
        myChart.setOption(option);  
    </script>  
    <h2 color="red"><a href="NewFile.jsp">返回</a></h2>  
</body>

結果截圖：

Mapreduce資料分析例項

資料包百度網盤連結：https://pan.baidu.com/s/1v9M3jNdT4vwsqup9N0mGOA 提取碼：hs9c 複製這段內容後開啟百度網盤手機App，操作更方便哦 1、資料清洗說明：（1）

python資料分析例項(1)

1.獲取資料: 想要獲得道指30只成分股的最新股價 import requests import re import pandas as pd def retrieve_dji_list(): try: r = requests.get('https://mon

資料分析例項-MovieLens 1M 資料集

MovieLens 1M資料集含有來自6000名使用者對4000部電影的100萬條評分資料。分為三個表：評分，使用者資訊，電影資訊。這些資料都是dat檔案格式。讀取3個數據集： #coding=gbk # MovieLens 1M資料集含有來自6000名

SparkR安裝部署及資料分析例項

1. SparkR的安裝配置 1.1. R與Rstudio的安裝 1.1.1. R的安裝我們的工作環境都是在Ubuntu下操作的，所以只介紹Ubuntu下安裝R的方法： 1）在/etc/apt/sources.list新增源

R語言運用例項——關於2017年熱播劇各項資料分析統計（作業）

首先將拿到的excel檔案另存為csv格式，以便匯入Rstudio。開啟Rstudio，輸入命令 table<-data.frame(read.csv(“C:\Users\asus\Desktop\soapdata.csv”)) 建立一個名為tabl

ETL專案2:大資料清洗,處理:使用MapReduce進行離線資料分析並報表顯示完整專案

ETL專案2:大資料清洗,處理:使用MapReduce進行離線資料分析並報表顯示完整專案思路同我之前的部落格的思路 https://www.cnblogs.com/symkmk123/p/10197467.html 但是資料是從web訪問的資料 avro第一次過濾觀察資料的格式,我們

【利用python進行資料分析】準備與例項（一）

我已經分享了本書的ipynb，所以跟著我一起來實驗吧。如果你不懂怎麼開啟ipynb格式的檔案，那也沒關係，anaconda3讓一切變得更簡單（我像是打廣告的）。安裝玩anaconda之後，我們在開始裡就可以找到它的資料夾，裡面有一個Jupyter Notebook，就是它了。

資料清洗例項分析

隨著資訊科技的快速發展，各個領域都在每時每刻以驚人的速度產生出各式各樣的規模巨大的資料資訊，人類也在工作生活的方方面面接觸到越來越多的資料資訊。然而，人類對資料資訊理解的匱乏與資料爆炸的趨勢顯得並不對稱，人類在努力將資料資訊轉化為有利資訊知識的同時，也面臨著大資料之中夾雜的“髒資料”的挑戰，對原始資料來源

Python資料分析視覺化Seaborn例項講解

Seaborn是一種基於matplotlib的圖形視覺化python libraty。它提供了一種高度互動式介面，便於使用者能夠做出各種有吸引力的統計圖表。 Seaborn其實是在matplotlib的基礎上進行了更高階的API封裝，從而使得作圖更加容易，在大

Excel資料分析與業務建模_第四章_匹配函式MATCH（語法詳解及應用例項）

如果有一天，EXCEL中沒有了LOOKUP函式，怎麼辦？答案是就靠MATCH和INDEX兩兄弟了。 MATCH函式可返回指定區域內指定內容所在的行號（縱向區域）或列號（橫向區域）。 Suppose you have a worksheet with 5,000 rows c

python/pandas資料分析（十五）-聚合與分組運算例項

用特定於分組的值填充缺失值用平均值去填充nan s=pd.Series(np.random.randn(6)) s[::2]=np.nan s 0 NaN 1 -0.1181

《利用Python進行資料分析》CH2 專案例項-2

GroupLens Research採集了一組從20世紀90年代末到21世紀初，有MovieLens使用者提供的電影評分資料。這些資料包括電影評分、電影原資料、使用者的年齡、性別的統計等。基於機器學習的推薦演算法一般會對此感興趣。本例不會講解機器學

Wireshark網路分析例項集錦（大學霸內部資料）

Wireshark網路分析例項集錦試讀文件下載前言由於網路廣泛廣泛，與網路相關的安全問題也就變的非常重要。為了更好的分析整個網路的情況，人們開始使用各種專業的資料包分析工具。Wireshark是一款最知名的開源網路封包分析軟體。它可以抓取

《利用Python進行資料分析》例項：USDA食品資料庫

USDA食品資料庫：from pandas import DataFrame,Series from pylab import * import pandas as pd import json de

Hadoop學習筆記之初識MapReduce以及WordCount例項分析

MapReduce簡介 MapReduce是什麼? MapReduce是一種程式設計模型，用於大規模資料集的分散式運算。 Mapreduce基本原理 1、MapReduce通俗解釋圖書館要清點圖書數量，有10個書架，管理員為了加快統計速度，找來了

Python資料分析與機器學習-SVM調參例項

import numpy as np import matplotlib.pyplot as plt from scipy import stats from sklearn.svm import SVC from sklearn.datasets.samples_gene

hive例項-乘用車輛和商用車輛銷售資料分析

資料來源地址：http://pan.baidu.com/s/1cKsrKi 1.準備資料來源開啟上牌數--商用車銷量資料樣例.xlsx，另存為car.txt檔案，開啟car.txt，設定編碼格式為

例項操作：Python提取雅虎財經資料，並做資料分析和視覺化

第一步、獲取資料股市資料可以從Yahoo! Finance、 Google Finance以及國內的新浪財經等地方拿到。同時，pandas包提供了輕鬆從以上網站獲取資料的方法。 import pandas as pd # as 是對包或模組重新

Hadoop（十四）MapReduce原理分析

資源並行處理 ons 描述並發數 span col 數據分析 sub 前言　　上一篇我們分析了一個MapReduce在執行中的一些細節問題，這一篇分享的是MapReduce並行處理的基本過程和原理。　　Mapreduce是一個分布式運算程序的編程框架，是用戶開發

「機器學習」Python資料分析之Numpy進階

請點選此處輸入圖片描述進階廣播法則(rule) 廣播法則能使通用函式有意義地處理不具有相同形狀的輸入。廣播第一法則是，如果所有的輸入陣列維度不都相同，一個“1”將被重複地新增在維度較小的陣列上直至所有的陣列擁有一樣的維度。廣播第二法則確定長度為1的陣列沿著特

Mapreduce資料分析例項

相關推薦