JAVA讀取HDFS的檔案資料出現亂碼的解決方案

阿新 • • 發佈：2020-11-17

使用JAVA api讀取HDFS檔案亂碼踩坑

想寫一個讀取HFDS上的部分檔案資料做預覽的介面，根據網上的部落格實現後，發現有時讀取資訊會出現亂碼，例如讀取一個csv時，字串之間被逗號分割

英文字串aaa，能正常顯示
中文字串“你好”，能正常顯示
中英混合字串如“aaa你好”，出現亂碼

查閱了眾多部落格，解決方案大概都是：使用xxx字符集解碼。抱著不信的想法，我依次嘗試，果然沒用。

解決思路

因為HDFS支援6種字符集編碼，每個本地檔案編碼方式又是極可能不一樣的，我們上傳本地檔案的時候其實就是把檔案編碼成位元組流上傳到檔案系統儲存。那麼在GET檔案資料時，面對不同檔案、不同字符集編碼的位元組流，肯定不是一種固定字符集解碼就能正確解碼的吧。

那麼解決方案其實有兩種

固定HDFS的編解碼字符集。比如我選用UTF-8，那麼在上傳檔案時統一編碼，即把不同檔案的位元組流都轉化為UTF-8編碼再進行儲存。這樣的話在獲取檔案資料的時候，採用UTF-8字符集解碼就沒什麼問題了。但這樣做的話仍然會在轉碼部分存在諸多問題，且不好實現。
動態解碼。根據檔案的編碼字符集選用對應的字符集對解碼，這樣的話並不會對檔案的原生字元流進行改動，基本不會亂碼。

我選用動態解碼的思路後，其難點在於如何判斷使用哪種字符集解碼。參考下面的內容，獲得瞭解決方案

java檢測文字(位元組流)的編碼方式

需求：

某檔案或者某位元組流要檢測他的編碼格式。

實現：

基於jchardet

<dependency>
	<groupId>net.sourceforge.jchardet</groupId>
	<artifactId>jchardet</artifactId>
	<version>1.0</version>
</dependency>

程式碼如下：

public class DetectorUtils {
	private DetectorUtils() {
	}
 
	static class ChineseCharsetDetectionObserver implements
			nsICharsetDetectionObserver {
		private boolean found = false;
		private String result;
 
		public void Notify(String charset) {
			found = true;
			result = charset;
		}
 
		public ChineseCharsetDetectionObserver(boolean found,String result) {
			super();
			this.found = found;
			this.result = result;
		}
 
		public boolean isFound() {
			return found;
		}
 
		public String getResult() {
			return result;
		}
 
	}
 
	public static String[] detectChineseCharset(InputStream in)
			throws Exception {
		String[] prob=null;
		BufferedInputStream imp = null;
		try {
			boolean found = false;
			String result = Charsets.UTF_8.toString();
			int lang = nsPSMDetector.CHINESE;
			nsDetector det = new nsDetector(lang);
			ChineseCharsetDetectionObserver detectionObserver = new ChineseCharsetDetectionObserver(
					found,result);
			det.Init(detectionObserver);
			imp = new BufferedInputStream(in);
			byte[] buf = new byte[1024];
			int len;
			boolean isAscii = true;
			while ((len = imp.read(buf,buf.length)) != -1) {
				if (isAscii)
					isAscii = det.isAscii(buf,len);
				if (!isAscii) {
					if (det.DoIt(buf,len,false))
						break;
				}
			}
 
			det.DataEnd();
			boolean isFound = detectionObserver.isFound();
			if (isAscii) {
				isFound = true;
				prob = new String[] { "ASCII" };
			} else if (isFound) {
				prob = new String[] { detectionObserver.getResult() };
			} else {
				prob = det.getProbableCharsets();
			}
			return prob;
		} finally {
			IOUtils.closeQuietly(imp);
			IOUtils.closeQuietly(in);
		}
	}
}

測試：

		String file = "C:/3737001.xml";
		String[] probableSet = DetectorUtils.detectChineseCharset(new FileInputStream(file));
		for (String charset : probableSet) {
			System.out.println(charset);
		}

Google提供了檢測位元組流編碼方式的包。那麼方案就很明瞭了，先讀一些檔案位元組流，用工具檢測編碼方式，再對應進行解碼即可。

具體解決程式碼

pom

<dependency>
	<groupId>net.sourceforge.jchardet</groupId>
	<artifactId>jchardet</artifactId>
	<version>1.0</version>
</dependency>

從HDFS讀取部分檔案做預覽的邏輯

 // 獲取檔案的部分資料做預覽
 public List<String> getFileDataWithLimitLines(String filePath,Integer limit) {
  FSDataInputStream fileStream = openFile(filePath);
  return readFileWithLimit(fileStream,limit);
 }

 // 獲取檔案的資料流
 private FSDataInputStream openFile(String filePath) {
  FSDataInputStream fileStream = null;
  try {
   fileStream = fs.open(new Path(getHdfsPath(filePath)));
  } catch (IOException e) {
   logger.error("fail to open file:{}",filePath,e);
  }
  return fileStream;
 }
 
 // 讀取最多limit行檔案資料
 private List<String> readFileWithLimit(FSDataInputStream fileStream,Integer limit) {
  byte[] bytes = readByteStream(fileStream);
  String data = decodeByteStream(bytes);
  if (data == null) {
   return null;
  }

  List<String> rows = Arrays.asList(data.split("\\r\\n"));
  return rows.stream().filter(StringUtils::isNotEmpty)
    .limit(limit)
    .collect(Collectors.toList());
 }

 // 從檔案資料流中讀取位元組流
 private byte[] readByteStream(FSDataInputStream fileStream) {
  byte[] bytes = new byte[1024*30];
  int len;
  ByteArrayOutputStream stream = new ByteArrayOutputStream();
  try {
   while ((len = fileStream.read(bytes)) != -1) {
    stream.write(bytes,len);
   }
  } catch (IOException e) {
   logger.error("read file bytes stream failed.",e);
   return null;
  }
  return stream.toByteArray();
 }

 // 解碼位元組流
 private String decodeByteStream(byte[] bytes) {
  if (bytes == null) {
   return null;
  }

  String encoding = guessEncoding(bytes);
  String data = null;
  try {
   data = new String(bytes,encoding);
  } catch (Exception e) {
   logger.error("decode byte stream failed.",e);
  }
  return data;
 }

 // 根據Google的工具判別編碼
 private String guessEncoding(byte[] bytes) {
  UniversalDetector detector = new UniversalDetector(null);
  detector.handleData(bytes,bytes.length);
  detector.dataEnd();
  String encoding = detector.getDetectedCharset();
  detector.reset();

  if (StringUtils.isEmpty(encoding)) {
   encoding = "UTF-8";
  }
  return encoding;
 }

以上就是JAVA讀取HDFS的檔案資料出現亂碼的解決方案的詳細內容，更多關於JAVA讀取HDFS的檔案亂碼的資料請關注我們其它相關文章！

Java向資料庫插入中文出現亂碼解決方案

主要解決方向，JAVA與MYSQL中編碼要統一。通常採用UTF-8. 這裡雖然你在專案中設定了資料庫採用UTF-8，但是那裡不包括連線方式之類的。

JAVA讀取HDFS的檔案資料出現亂碼的解決方案

使用JAVA api讀取HDFS檔案亂碼踩坑想寫一個讀取HFDS上的部分檔案資料做預覽的介面，根據網上的部落格實現後，發現有時讀取資訊會出現亂碼，例如讀取一個csv時，字串之間被逗號分割

java 讀取 utf-8 bom 出現亂碼

private static String ReadUTF8WithBOMToString(String filePath) throws IOException { String defaultEncoding = \"UTF-8\";

Java Swing開發 Label標籤介面中文出現亂碼解決方案

技術標籤：Java亂碼問題：當使用Swing開發中的JFrame新增Label元件時在介面顯示亂碼：

SpringMVC框架post提交資料庫出現亂碼解決方案

使用Post新增資料到資料庫出現方塊亂碼解決方法，在web.xml裡最前面新增過濾器，程式碼如下，放在最前面，因為有優先順序，要首先攔截

mysql遇到load data匯入檔案資料出現1290錯誤的解決方案

錯誤出現情景　　在cmd中使用mysql命令，學生資訊表新增資料。使用load data方式簡單批量匯入資料。

eclipse與資料庫連線插入或者拿出資料出現亂碼該如何解決。

eclipse或者資料庫出現亂碼出現亂碼有很多因素，例如一下幾個因素資料庫欄位沒有設定為utf-8；資料庫表沒有設定為utf-8;資料庫沒有設定為utf-8;eclipse連線資料沒有設定為utf-8; 解決方法第一種：在ec

關於解決IO流和Properties聯合使用讀取配置檔案之中文亂碼問題

技術標籤：Javajava亂碼 IO流和Properties聯合使用解決中文亂碼問題簡述 Map集合key value經常改變的資料，可以單獨寫到一個檔案中---------配置檔案key=value檔案配置檔案，建議以.properties結尾，但不是必須

Java讀取Excel檔案的內容，歸檔到json檔案中（對json資料進行讀取、新增/修改put、刪除remove操作）

【轉】r 讀取mysql 中文亂碼_R讀取MySQL資料出現亂碼

最終的解決辦法直接看 4 我的思路：我用的都是utf-8編碼，電腦系統win7， MySQL-Front進行資料庫的視覺化。

java讀取txt檔案並輸出結果

這篇文章主要介紹了java讀取txt檔案並輸出結果,文中通過示例程式碼介紹的非常詳細，對大家的學習或者工作具有一定的參考學習價值,需要的朋友可以參考下

spring無法讀取properties檔案資料問題詳解

1. controller中無法讀取config.properties檔案 controller中注入的@Value配置是從servlet-context.xml配置檔案中獲取的；service中注入的@Value配置可以從applicationContext.xml中獲取的。所以，如果要在controlle

Java連線資料庫oracle中文亂碼解決方案

今天寫了一個java專案連線資料庫，之後寫了一個執行入庫操作的模組。此時暴露出一個問題就是專案的中文插入到資料庫時會是亂碼：

Java 讀取XML的資料

第一步：在resources目錄新增config.xml配置檔案 <?xml version=\"1.0\" encoding=\"utf-8\" ?>

java讀取excel檔案內容

原文連結： https://www.cnblogs.com/bretgui/p/10156141.html 1.匯入依賴JAR包

c++ 讀取TXT檔案，中文亂碼處理

#include <iostream> #include <fstream> #include <string> #include <vector> #include <windows.h>

java 讀取本地檔案更改

部分參考 https://blog.csdn.net/Bancroft_boy/article/details/81126478 package IO; import java.io.*;

Java讀取xml檔案裡屬性

需要解析的xml檔案 <Data> <ElemName caption=\"lalalala\"> <Color> <Value caption=\"0\"/>

java讀取html檔案,並獲取body中所有的標籤及內容的案例

這裡的獲取的是html檔案中body中的所有標籤以及內容 package com.lmt.service.file; import java.io.BufferedReader;

java讀取一個檔案寫入另外一個檔案

package file.filereader; import java.io.*; /** *@description *params * 檔案讀取和寫入 */ public class FileReader {

JAVA讀取HDFS的檔案資料出現亂碼的解決方案

相關推薦