基於HBase的MapReduce實現大量郵件信息統計分析

阿新 • • 發佈：2018-02-05

inittab 寫入 img implement system return dea 比較 tco

一：概述

在大多數情況下，如果使用MapReduce進行batch處理，文件一般是存儲在HDFS上的，但這裏有個很重要的場景不能忽視，那就是對於大量的小文件的處理（此處小文件沒有確切的定義，一般指文件大小比較小，比如5M以內的文件），而HDFS的文件塊一般是64M，這將會影響到HDFS的性能，因為小文件過多，那麽NameNode需要保存的文件元信息將占用更多的空間，加大NameNode的負載進而影響性能，假如對於每個文件，在NameNode中保存的元數據大小是100字節，那麽1千萬這樣的小文件，將占用10億byte，約1G的內存空間，目前有以下幾種對於眾多小文件的處理方法：

HAR File方式，將小文件合並成大文件

SequenceFile方式，以文件名為key，文件內容為value，生成一個序列文件

以HBase作為小文件的存儲，rowkey使用文件名，列族單元保存文件內容，文件後綴名等信息

在本文中的案例，就是采用第三種方法

二：實現

1：郵件格式如下，為了簡單起見及安全性，這裏作了簡化，每一封這樣的郵件，大小將近15k左右

技術分享圖片

2：HBase表

需要創建2張HBase表格，一張保存郵件文件，另外一張保存MapReduce的輸出結果，在hbase shell中分別創建：

create ‘email‘, {NAME=>‘f1‘, VERSIONS=>2}
create ‘summary‘, {NAME=>‘f1‘, VERSIONS=>2}

3：郵件文件導入到HBase中

請參考上篇文章將文件以API方式導入到HBase（小文件處理），此處，假設是把所有的每天已經生產的郵件文件，從本地導入到HBase中，另外一種方案是創建一個獨立的RESTful API，供第三方程序調用，將郵件信息寫入到HBase中

4：MapReduce

在IDEA中創建Maven工程

技術分享圖片

修改pom.xml文件，添加依賴hbase-client及hbase-server

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" 

         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>mronhbase</groupId>
    <artifactId>mronhbase</artifactId>
    <version>1.0-SNAPSHOT</version>

    <repositories>
        <repository>
            <id>apache</id>
            <url>http://maven.apache.org</url>
        </repository>
    </repositories>

    <dependencies>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>1.2.6</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-server</artifactId>
            <version>1.2.6</version>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <executions>
                    <!-- Run shade goal on package phase -->
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <filters>
                                <filter>
                                    <!-- Do not copy the signatures in the META-INF folder.
                                    Otherwise, this might cause SecurityExceptions when using the JAR. -->
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>

                            <transformers>
                                <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                    <mainClass>jiecxy.App</mainClass>
                                </transformer>
                            </transformers>

                            <createDependencyReducedPom>false</createDependencyReducedPom>
                        </configuration>
                    </execution>
                </executions>
            </plugin>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>

        </plugins>
    </build>
</project>

創建java類，先引入必要的包：

package examples;

import java.io.IOException;
import java.util.Iterator;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.mapreduce.TableMapper;
import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.mapreduce.Job;

類文件hbasemr，其中包含主程序入口

public class hbasemr {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException
    {
        String hbaseTableName1 = "email";
        String hbaseTableName2 = "summary";

        Configuration conf = new Configuration();

        Job job = Job.getInstance(conf);
        job.setJarByClass(hbasemr.class);
        job.setJobName("mronhbase");

        Scan scan = new Scan();
        scan.setCaching(500);
        scan.setCacheBlocks(false);

        TableMapReduceUtil.initTableMapperJob(hbaseTableName1, scan, MyMapper.class, Text.class, IntWritable.class, job);
        TableMapReduceUtil.initTableReducerJob(hbaseTableName2, MyReducer.class, job);
        System.exit(job.waitForCompletion(true) ? 1 : 0);
    }
    
    public static String getSubString(String value,String rgex){
        Pattern pattern = Pattern.compile(rgex);
        Matcher m = pattern.matcher(value);
        while(m.find()){
            return m.group(1);
        }
        return "";
    }
}

添加Mapper類及Reducer類：

Hbases實現了TableMapper類及TableReducer類，

創建MyMapper

   public static class MyMapper extends TableMapper<Text, IntWritable>{
        private final static IntWritable one = new IntWritable(1);

        public void map(ImmutableBytesWritable key, Result value, Context context)
                throws IOException,InterruptedException
        {
            String rowValue = Bytes.toString(value.getValue("f1".getBytes(),"message".getBytes()));
            if(rowValue !=null) {

                String rgex = "Project Name:(.*?)\\r\\n";
                String temp = hbasemr.getSubString(rowValue,rgex);
                String username = temp.substring(0, temp.indexOf(‘_‘));

                rgex = "Accounts:(.*?)\\r\\n";
                String count = hbasemr.getSubString(rowValue,rgex).trim();

                IntWritable intCount = new IntWritable(Integer.parseInt(count));
                context.write(new Text(username), intCount);
            }
        }
    }

創建MyReducer：

    public static class MyReducer extends TableReducer<Text,IntWritable, NullWritable>
    {
        private IntWritable result = new IntWritable();
        public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
        {
            int sum = 0;
            Iterator<IntWritable> item = values.iterator();
            while (item.hasNext()) {
                sum += item.next().get();
            }
            this.result.set(sum);

            Put put = new Put(key.toString().getBytes());
            put.addColumn("f1".getBytes(), "count".getBytes(), String.valueOf(sum).getBytes());
            context.write(NullWritable.get(), put);
        }
    }

5：運行與調試

啟動Hadoop及HBase，在IDEA 中對MyMapper類及MyReducer類設置好斷點，以調試方式運行程序，運行完後，進入到hbase shell查看運行結果

技術分享圖片

基於HBase的MapReduce實現大量郵件信息統計分析

inittab 寫入 img implement system return dea 比較 tco 一：概述在大多數情況下，如果使用MapReduce進行batch處理，文件一般是存儲在HDFS上的，但這裏有個很重要的場景不能忽視，那就是對於大量的小文件的處理（此處小文件

基於HBase的MapReduce實現大量郵件信息統計分析

基於HBase的MapReduce實現大量郵件信息統計分析

2018-08-05 期 MapReduce實現每個單詞在每個文件中坐標信息統計

Spring（五）：Spring&Struts2&Hibernate整合後，實現查詢Employee信息

RecyclerView實現分組展示信息

oracle仿全文檢索切詞機制實現文本信息類似度查找

匯編實現獲取CPU信息

C#基於LibUsbDotNet實現USB通信（一）

信息統計DashBoard踩坑總結

鏈表實現學生健康信息管理系統（簡略版）

制造業ERP系統如何實現生產車間信息化管理？

基於JSP的農村公共信息化發布系統設計_愛學術——免費下載

nginx open files limits 導致大量錯誤信息

一步一步實現web程序信息管理系統之二----後臺框架實現跳轉登陸頁面

通過泛型類實現輸出學生信息

C# 163郵箱發送附件以及郵件信息

信息統計和教師端考試部分API——張星

iOS 關於圖片地理位置隱私信息的分析和讀取

HTTP頭部信息解釋分析(詳細整理)

JVM源碼系列：ThreadMXBean 打出堆棧信息原理分析

基於.NET實現資料探勘--聚類分析演算法

基於HBase的MapReduce實現大量郵件信息統計分析

相關推薦