自定義OutputFormat -實現往不同的目錄輸出檔案

阿新 • • 發佈：2018-11-24

程式碼地址：
https://gitee.com/tanghongping/hadoopMapReduce/tree/master/src/com/thp/bigdata/myInputFormat

需求：

現有一些原始日誌需要做增強解析處理，流程：
1、從原始日誌檔案中讀取資料
2、根據日誌中的一個URL欄位到外部知識庫中獲取資訊增強到原始日誌
3、如果成功增強，則輸出到增強結果目錄；如果增強失敗，則抽取原始資料中URL欄位輸出到待爬清單目錄

分析：

程式的關鍵點是要在一個mapreduce程式中根據資料的不同輸出型別結果到不同的目錄，這類靈活的輸出需求可以通過自定義的OutputFormat來實現。

DBLoader ：連線資料庫，從資料庫中將字典資料快取出來

package com.thp.bigdata.logEnhance;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
import java.util.HashMap;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Set;

/**
 * 資料庫連線
 * java.sql包中的介面，它是sun公司為了簡化，統一對資料庫的操作，定義了一套java操作資料庫的規範，
 * 由各個資料庫公司自己實現，資料庫有mysql oracle等，
 * 而com.mysql.jdbc包中的類是mysql自己實現規範介面的類，
 * 不同的資料庫有不同的實現，為了能夠只寫一套程式碼，實現跨資料庫使用，
 * 書寫傳統jdbc需要匯入的包就使用java.sql包，而不用考慮具體的實現類。
 * @author 湯小萌
 *
 */
public class DBLoader {
	
	/**
	 * 從資料庫中將url對應的內容全部放到HashMap中進行快取
	 */
	public static void dbLoader(Map<String, String> urlContentMap) {
		Connection con = null;
		Statement st = null;
		ResultSet rs = null;
		try {
			Class.forName("com.mysql.jdbc.Driver");
			con = DriverManager.getConnection("jdbc:mysql://localhost:3306/urldb", "root", "root");
			st = con.createStatement();
			rs = st.executeQuery("select url, content from url_rule");
			while(rs.next()) {
				urlContentMap.put(rs.getString(1), rs.getString(2));
			}
		} catch (ClassNotFoundException e) {
			e.printStackTrace();
		} catch (SQLException e) {
			e.printStackTrace();
		} finally {
			if(rs != null) {
				try {
					rs.close();
				} catch (SQLException e) {
					e.printStackTrace();
				} finally {
					rs = null;
				}
			}
			if(st != null) {
				try {
					st.close();
				} catch (SQLException e) {
					e.printStackTrace();
				} finally {
					st = null;
				}
			}
			if(con != null) {
				try {
					con.close();
				} catch (SQLException e) {
					e.printStackTrace();
				} finally {
					con = null;
				}
			}
		}
	}
	
	// 測試資料庫連線成功
	public static void main(String[] args) {
		HashMap<String, String> urlContentMap = new HashMap<String, String>();
		dbLoader(urlContentMap);
		// Set<Entry<String, String>> entrySet = urlContentMap.entrySet();
		for(Entry<String, String> entrySet : urlContentMap.entrySet()) {
			System.out.println(entrySet.getKey() + " : " + entrySet.getValue());
		}
	}
}

自定義的OutputFormat : 可以實現跟資料型別的不同向不同的目錄輸出：

package com.thp.bigdata.logEnhance;

import java.io.IOException;

import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

/**
 * 日誌增強的自定義OutputFormat
 * 根據資料的不同輸出型別到不同的輸出目錄
 * @author 湯小萌
 *
 */
public class LogEnhanceOutputFormat extends FileOutputFormat<Text, NullWritable> {

	@Override
	public RecordWriter<Text, NullWritable> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException {
		FileSystem fs = FileSystem.get(job.getConfiguration());
		Path enhancePath = new Path("f:/enhancelog/output_en/log.txt");
		Path tocrawlPath = new Path("f:/enhancelog/output_crw/url.txt");
		FSDataOutputStream  enhanceOS = fs.create(enhancePath);
		FSDataOutputStream  tocrawlOS = fs.create(tocrawlPath);
		
		return new EnhanceRecordWriter(enhanceOS, tocrawlOS);
	}
	
	
	/**
	 * 這個RecordWriter類才是真正往外寫檔案的
	 * 需要往這個類的構造方法中傳遞輸出流，在getRecordWriter()方法中就要進行構建這兩個輸出流
	 * @author 湯小萌
	 *
	 */
	static class EnhanceRecordWriter extends RecordWriter<Text, NullWritable> {
		FSDataOutputStream enhanceOS = null;
		FSDataOutputStream tocrawlOS = null;
		public EnhanceRecordWriter(FSDataOutputStream enhanceOS, FSDataOutputStream tocrawlOS) {
			super();
			this.enhanceOS = enhanceOS;
			this.tocrawlOS = tocrawlOS;
		}
		
		// 往外寫檔案的邏輯
		@Override
		public void write(Text key, NullWritable value) throws IOException, InterruptedException {
			String dataLine = key.toString(); 
			if(dataLine.contains("tocrawl")) {
				// 如果寫的資料裡面包含 "tocrawl"，那麼就是不完全的資料，需要寫入待爬清單檔案
				tocrawlOS.write(dataLine.getBytes());
			} else {
				// 如果寫的資料沒有包含"tocrawl",就說明寫出的資料是增強日誌，那麼就需要寫入增強日誌的檔案
				enhanceOS.write(dataLine.getBytes());
			}
		}

		@Override
		public void close(TaskAttemptContext context) throws IOException, InterruptedException {
			if(enhanceOS != null) {
				enhanceOS.close();
			}
			if(tocrawlOS != null) {
				tocrawlOS.close();
			}
		}
		
	}

}

MapReduce 執行：

package com.thp.bigdata.logEnhance;

import java.io.IOException;
import java.util.HashMap;
import java.util.Map;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Counter;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.junit.Test;

/**
 * 日誌增強：
 * 寫入不同的檔案
 * @author 湯小萌
 *
 */
public class LogEnhance {
	static class LogEnhanceMapper extends Mapper<LongWritable, Text, Text, NullWritable> {
		Map<String, String> urlContentMap = new HashMap<String, String>();
		Text k = new Text();
		NullWritable v = NullWritable.get();
		
		/**
		 * 從資料庫中載入資料到HashMap快取下來
		 */
		@Override
		protected void setup(Context context)
				throws IOException, InterruptedException {
			DBLoader.dbLoader(urlContentMap);
		}
		
		@Override
		protected void map(LongWritable key, Text value, Context context)
				throws IOException, InterruptedException {
			// 獲取一個計數器 - (這個計數器是全域性的計數器)  記錄不合法的日誌行數，組名，計數器名稱
			Counter counter = context.getCounter("malFormed", "malFormedCounter");
			String line = value.toString();
			String[] fields = line.split("\t");
			try {
				String url = fields[28];
				System.out.println(url);
				String content_tag = urlContentMap.get(url);
				// System.out.println(content_tag);
				if(content_tag == null) { // 從知識庫中根據對應的url查詢的內容為空，
					k.set(url + "\t" + "tocrawl" + "\n");  // 自定義的輸出流沒有包裝，不能換行 
					context.write(k, v);
				} else {
					k.set(line + "\t" + content_tag + "\n");
					System.out.println(k.toString());
					context.write(k, v);
				}
			} catch (Exception e) {
				// 有的資料可能是不合法的，長度不夠，不完整的資料
				e.printStackTrace();
				System.err.println("資料不合法");
				counter.increment(1);
			}
		}
	}
	
	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf);
		
		
		job.setJarByClass(LogEnhance.class);
		job.setMapperClass(LogEnhanceMapper.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(NullWritable.class);
		
		// 要控制不同的檔案內容寫往不同的目標路徑，採用自定義的OutputStream
		job.setOutputFormatClass(LogEnhanceOutputFormat.class);
		
		FileInputFormat.setInputPaths(job, new Path("f:/enhancelog/input"));
		
		// 儘管我們在自定義的OutputFormat裡面已經設定好了輸出的路徑
		// 但是在FileOutputFormat中，必須輸出一個_success檔案，所以還需要設定輸出path
		FileOutputFormat.setOutputPath(job, new Path("f:/enhancelog/output"));
		
		// 現在只是做日誌的清洗，還不需要reduce task
		job.setNumReduceTasks(0);
		
		System.exit(job.waitForCompletion(true) ? 0 : 1);
		
	}
	
	
	
	@Test
	public void test() {
		// String str = "1374609560.11	1374609560.16	1374609560.16	1374609560.16	110	5	8615038208365	460023383869133	8696420056841778	2	460	0	14615			54941	10.188.77.252	61.145.116.27	35020	80	6	cmnet	1	221.177.218.34	221.177.217.161	221.177.218.34	221.177.217.167	ad.veegao.com	http://ad.veegao.com/veegao/iris.action		Apache-HttpClient/UNAVAILABLE (java 1.4)	POST	200	593	310	4	3	0	0	4	3	0	0	0	0	http://ad.veegao.com/veegao/iris.action	5903903079251243019	5903903103500771339	5980728";
		String str = "1374609557.12	1374609557.15	1374609557.15	1374609557.74	110	5	8615093268715	460023934411519	3588660433773101	2	460	0	14822			29343	10.188.77.164	223.203.194.156	42384	80	6	cmnet	1	221.177.218.41	221.177.217.161	221.177.218.41	221.177.217.167	ugc.moji001.com	http://ugc.moji001.com/sns/GetNewestShare/100/489?UserID=42958568&Platform=Android&Version=10023802&BaseOSVer=10&PartnerKey=5007&Model=GT-S7500&Device=phone&VersionType=1&TS=		Apache-HttpClient/UNAVAILABLE (java 1.4)	GET	200	421	363	3	2	0	0	3	2	0	0	0	0	http://ugc.moji001.com/sns/GetNewestShare/100/489?UserID=42958568&Platform=Android&Version=10023802&BaseOSVer=10&PartnerKey=5007&Model=GT-S7500&Device=phone&VersionType=1&TS=	5903903047315243019	5903903087191863307	5980488";
		
		String[] fields = str.split("\t");
		int count = 1;
		for(String field : fields) {
			System.out.println(count + " >   " + field);
			count++;
		}
	}
	
	
}

日誌資料：

https://pan.baidu.com/s/1xzlfGQ8R67bsDsTqTsOcrQ

資料庫中的字典資料的sql檔案：
https://pan.baidu.com/s/1SrExNEebLBVZqtr-MasjLA

https://pan.baidu.com/s/19YYg39yteIa5VMcwaGSMfg

自定義OutputFormat -實現往不同的目錄輸出檔案

程式碼地址： https://gitee.com/tanghongping/hadoopMapReduce/tree/master/src/com/thp/bigdata/myInputFormat 需求：現有一些原始日誌需要做增強解析處理，流程： 1、從原始日誌檔案中讀取資料

自定義ActionResult實現Rss輸出 (基於ASP.NET MVC Preview 3)

前兩天才做了一個Asp.Net MVC Preview2的實踐,沒想到這就升級到了Asp.Net Preview3了,Preview3確實比2好上不少,特別有兩個地方值得注意,一是Route新增了MapRoute方法,可以更方便新增Url路由規則,二是修改了View的部分,使得Action統一返回

Java 自定義 ClassLoader 實現隔離執行不同版本jar包的方式

1. 應用場景有時候我們需要在一個 Project 中執行多個不同版本的 jar 包，以應對不同叢集的版本或其它的問題。如果這個時候選擇在同一個專案中實現這樣的功能，那麼通常只能選擇更低版本的 jar 包，因為它們通常是向下相容的，但是這樣也往往會失去新版本

自定義註解（二）日誌輸出：自定義日誌註解+AOP實現

自定義日誌標籤YfLog 日誌註解：以日誌自定義註解+AOP實現 ####1、引入AOP Maven依賴  <dependency> <groupId>org.springf

Android -- 自定義view實現keep歡迎頁倒計時效果

super onfinish -m use new getc awt ttr alt 1，最近打開keep的app的時候，發現它的歡迎頁面的倒計時效果還不錯，所以打算自己來寫寫，然後就有了這篇文章。 2，還是老規矩，先看一下我們今天實現的效果　　相較於我們常見的倒計時

Android自定義View——實現水波紋效果類似剩余流量球

string 三個點 pre ber block span 初始化 move 理解最近突然手癢就想搞個貝塞爾曲線做個水波紋效果玩玩，終於功夫不負有心人最後實現了想要的效果，一起來看下吧：效果圖鎮樓一：先一步一步來分解一下實現的過程需要繪制一個正弦曲線(sin

Android自定義processor實現bindView功能

lis dds 定義 java代碼 cli 註冊文章 type() mage 一、簡介在現階段的Android開發中，註解越來越流行起來，比如ButterKnife，Retrofit，Dragger，EventBus等等都選擇使用註解來配置。按照處理時期，註解又分為兩

自定義toast實現

web javascript html5 toast ys_toast.css.ys-toast{ position:fixed; left:0; right:0; top:0; bottom:0; z-index: 999999; } .ys-toast>em{ pos

SpringVC 攔截器+自定義註解實現權限攔截

json.js 加載 bean media tar attr esp 權限 encoding 1.springmvc配置文件中配置 <?xml version="1.0" encoding="UTF-8"?> <beans xmlns="http://w

自定義ScrollView 實現上拉下拉的回彈效果--並且子控件中有Viewpager的情況

是否 AS abs pri tar utils lda animation ted onInterceptTouchEvent就是對子控件中Viewpager的處理：左右滑動應該讓viewpager消費 1 public class MyScrollView ext

[python]RobotFramework自定義庫實現UI自動化

bubuko output source 自動封裝 9.png 全局變量詳細變量 1.安裝教程環境搭建不多說，網上資料一大堆，可參考https://www.cnblogs.com/puresoul/p/3854963.html，寫的比較詳細，值得推薦。目前pyt

NPOI+反射+自定義特性實現上傳excel轉List及驗證

type set custom pre script private property xssf don 1.自定義特性 [AttributeUsage(AttributeTargets.Property, AllowMultiple = false, Inherited

Android bc信用盤搭建自定義behavior 實現上滑隱藏底部view

退出 Y軸 log rect app sum string dsl oss 布局 <android.support.design.widget.CoordinatorLayout android:layout_width="match_parent"

13、自定義Analyzer實現字長過濾

import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.Tokenizer; import org.a

OC 自定義tabBar實現tabBar上帶有圓形按鈕

1.建立繼承自UITabBar控制元件的類CustomTabBar,程式碼如下： #import "CustomTabBar.h" @interface CustomTabBar () @property (nonatomic, strong)UIButton *roundButton;

自定義按鈕實現 video暫停和播放的方法

注：兩個方法只能用於原生獲取的<video></video>元素，對jquery獲取的元素不管用 1.play(); 實現播放 // dom元素如下 <video width="800" height="400" id="video"

如何使用自定義模板實現個性化多維分析

自定義表格樣式多維分析展現報表時，潤乾報表提供了一套預設的表格樣式，統一的表格樣式可以使業務人員減少報表美化的工作量。然而預設的樣式不可能迎合所有使用者的審美，為此潤乾提供了自定義表格樣式的功能，供使用者實現個性化的需求，下面小編就來教你如何改變預設表格的樣式。先來看下預設的表格樣式，下

VUE 自定義表頭實現table的過濾功能

html程式碼下面是table <el-table-column v-for="(data,index) in mydata" //下面呼叫的是自定義的函式

Android UI 自定義ListView 實現下拉重新整理載入更多

分享一下我老師大神的人工智慧教程！零基礎，通俗易懂！http://blog.csdn.net/jiangjunshow 也歡迎大家轉載本篇文章。分享知識，造福人民，實現我們中華民族偉大復興！

Vue自定義元件實現按鈕許可權功能

在這之前請看我上一篇部落格https://blog.csdn.net/qq_41594146/article/details/83381964,這裡有思路和資料庫設定,之前做的是沒有元件化,也就是單純的v-for迴圈直接顯示,剛剛寫了按鈕許可權的元件,現在貼上程式碼\ var myBu

自定義OutputFormat -實現往不同的目錄輸出檔案

需求：

分析：

DBLoader ： 連線資料庫，從資料庫中將字典資料快取出來

自定義的OutputFormat : 可以實現跟資料型別的不同向不同的目錄輸出：

MapReduce 執行 ：

相關推薦

DBLoader ：連線資料庫，從資料庫中將字典資料快取出來

MapReduce 執行：