java 分批次讀取大檔案的三種方法

阿新 • • 發佈：2021-11-01

1. java 讀取大檔案的困難
java 讀取檔案的一般操作是將檔案資料全部讀取到記憶體中，然後再對資料進行操作。例如

Path path = Paths.get("file path");
byte[] data = Files.readAllBytes(path);

這對於小檔案是沒有問題的，但是對於稍大一些的檔案就會丟擲異常

Exception in thread "main" java.lang.OutOfMemoryError: Required array size too large
at java.nio.file.Files.readAllBytes(Files.java: 
3156)

從錯誤定位看出，Files.readAllBytes 方法最大支援 Integer.MAX_VALUE - 8 大小的檔案，也即最大2GB的檔案。一旦超過了這個限度，java 原生的方法就不能直接使用了。

2. 分次讀取大檔案
既然不能直接全部讀取大檔案到記憶體中，那麼就應該把檔案分成多個子區域分多次讀取。這就會有多種方法可以使用。

(1) 檔案位元組流
對檔案建立 java.io.BufferedInputStream ，每次呼叫 read() 方法時會接連取出檔案中長度為 arraySize 的資料到 array 中。這種方法可行但是效率不高。

import java.io.BufferedInputStream;
 
import java.io.FileInputStream;
import java.io.IOException;

/**
* Created by zfh on 16-4-19.
*/
public class StreamFileReader {
private BufferedInputStream fileIn;
private long fileLength;
private int arraySize;
private byte[] array;

public StreamFileReader(String fileName, int arraySize) throws IOException {
 
this.fileIn = new BufferedInputStream(new FileInputStream(fileName), arraySize);
this.fileLength = fileIn.available();
this.arraySize = arraySize;
}

public int read() throws IOException {
byte[] tmpArray = new byte[arraySize];
int bytes = fileIn.read(tmpArray);// 暫存到位元組陣列中
if (bytes != -1) {
array = new byte[bytes];// 位元組陣列長度為已讀取長度
System.arraycopy(tmpArray, 0, array, 0, bytes);// 複製已讀取資料
return bytes;
}
return -1;
}

public void close() throws IOException {
fileIn.close();
array = null;
}

public byte[] getArray() {
return array;
}

public long getFileLength() {
return fileLength;
}

public static void main(String[] args) throws IOException {
StreamFileReader reader = new StreamFileReader("/home/zfh/movie.mkv", 65536);
long start = System.nanoTime();
while (reader.read() != -1) ;
long end = System.nanoTime();
reader.close();
System.out.println("StreamFileReader: " + (end - start));
}
}

(2) 檔案通道
對檔案建立 java.nio.channels.FileChannel ，每次呼叫 read() 方法時會先將檔案資料讀取到分配的長度為 arraySize 的 java.nio.ByteBuffer 中，再從中將已經讀取到的檔案資料轉化到 array 中。這種利用了NIO中的通道的方法，比傳統的位元組流讀取檔案是要快一些。

import java.io.FileInputStream;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;

/**
* Created by zfh on 16-4-18.
*/
public class ChannelFileReader {
private FileInputStream fileIn;
private ByteBuffer byteBuf;
private long fileLength;
private int arraySize;
private byte[] array;

public ChannelFileReader(String fileName, int arraySize) throws IOException {
this.fileIn = new FileInputStream(fileName);
this.fileLength = fileIn.getChannel().size();
this.arraySize = arraySize;
this.byteBuf = ByteBuffer.allocate(arraySize);
}

public int read() throws IOException {
FileChannel fileChannel = fileIn.getChannel();
int bytes = fileChannel.read(byteBuf);// 讀取到ByteBuffer中
if (bytes != -1) {
array = new byte[bytes];// 位元組陣列長度為已讀取長度
byteBuf.flip();
byteBuf.get(array);// 從ByteBuffer中得到位元組陣列
byteBuf.clear();
return bytes;
}
return -1;
}

public void close() throws IOException {
fileIn.close();
array = null;
}

public byte[] getArray() {
return array;
}

public long getFileLength() {
return fileLength;
}

public static void main(String[] args) throws IOException {
ChannelFileReader reader = new ChannelFileReader("/home/zfh/movie.mkv", 65536);
long start = System.nanoTime();
while (reader.read() != -1) ;
long end = System.nanoTime();
reader.close();
System.out.println("ChannelFileReader: " + (end - start));
}
}

(3) 記憶體檔案對映
這種方法就是把檔案的內容被映像到計算機虛擬記憶體的一塊區域，從而可以直接操作記憶體當中的資料而無需每次都通過 I/O 去物理硬碟讀取檔案。這是由當前 java 態進入到作業系統核心態，由作業系統讀取檔案，再返回資料到當前 java 態的過程。這樣就能大幅提高我們操作大檔案的速度。

import java.io.FileInputStream;
import java.io.IOException;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;

/**
* Created by zfh on 16-4-19.
*/
public class MappedFileReader {
private FileInputStream fileIn;
private MappedByteBuffer mappedBuf;
private long fileLength;
private int arraySize;
private byte[] array;

public MappedFileReader(String fileName, int arraySize) throws IOException {
this.fileIn = new FileInputStream(fileName);
FileChannel fileChannel = fileIn.getChannel();
this.fileLength = fileChannel.size();
this.mappedBuf = fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, fileLength);
this.arraySize = arraySize;
}

public int read() throws IOException {
int limit = mappedBuf.limit();
int position = mappedBuf.position();
if (position == limit) {
return -1;
}
if (limit - position > arraySize) {
array = new byte[arraySize];
mappedBuf.get(array);
return arraySize;
} else {// 最後一次讀取資料
array = new byte[limit - position];
mappedBuf.get(array);
return limit - position;
}
}

public void close() throws IOException {
fileIn.close();
array = null;
}

public byte[] getArray() {
return array;
}

public long getFileLength() {
return fileLength;
}

public static void main(String[] args) throws IOException {
MappedFileReader reader = new MappedFileReader("/home/zfh/movie.mkv", 65536);
long start = System.nanoTime();
while (reader.read() != -1);
long end = System.nanoTime();
reader.close();
System.out.println("MappedFileReader: " + (end - start));
}
}

看似問題完美解決了，我們肯定會採用記憶體檔案對映的方法去處理大檔案。但是執行結果發現，這個方法仍然不能讀取超過2GB的檔案，明明 FileChannel.map() 方法傳遞的檔案長度是 long 型別的，怎麼和 Integer.MAX_VALUE 有關係？

Exception in thread "main" java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:868)

再從錯誤定位可以看到

size - The size of the region to be mapped; must be non-negative and no greater than Integer.MAX_VALUE

這可以歸結到一些歷史原因，還有 int 型別在 java 中的深入程度，但是本質上由於 java.nio.MappedByteBuffer 是直接繼承自 java.nio.ByteBuffer 的，而後者的索引變數是 int 型別的，所以前者也只能最大索引到 Integer.MAX_VALUE 的位置。這樣的話我們是不是就沒有辦法了？當然不是，一個記憶體檔案對映不夠用，那麼試一試用多個就可以了。

import java.io.FileInputStream;
import java.io.IOException;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;

/**
* Created by zfh on 16-4-19.
*/
public class MappedBiggerFileReader {
private MappedByteBuffer[] mappedBufArray;
private int count = 0;
private int number;
private FileInputStream fileIn;
private long fileLength;
private int arraySize;
private byte[] array;

public MappedBiggerFileReader(String fileName, int arraySize) throws IOException {
this.fileIn = new FileInputStream(fileName);
FileChannel fileChannel = fileIn.getChannel();
this.fileLength = fileChannel.size();
this.number = (int) Math.ceil((double) fileLength / (double) Integer.MAX_VALUE);
this.mappedBufArray = new MappedByteBuffer[number];// 記憶體檔案對映陣列
long preLength = 0;
long regionSize = (long) Integer.MAX_VALUE;// 對映區域的大小
for (int i = 0; i < number; i++) {// 將檔案的連續區域對映到記憶體檔案對映陣列中
if (fileLength - preLength < (long) Integer.MAX_VALUE) {
regionSize = fileLength - preLength;// 最後一片區域的大小
}
mappedBufArray[i] = fileChannel.map(FileChannel.MapMode.READ_ONLY, preLength, regionSize);
preLength += regionSize;// 下一片區域的開始
}
this.arraySize = arraySize;
}

public int read() throws IOException {
if (count >= number) {
return -1;
}
int limit = mappedBufArray[count].limit();
int position = mappedBufArray[count].position();
if (limit - position > arraySize) {
array = new byte[arraySize];
mappedBufArray[count].get(array);
return arraySize;
} else {// 本記憶體檔案對映最後一次讀取資料
array = new byte[limit - position];
mappedBufArray[count].get(array);
if (count < number) {
count++;// 轉換到下一個記憶體檔案對映
}
return limit - position;
}
}

public void close() throws IOException {
fileIn.close();
array = null;
}

public byte[] getArray() {
return array;
}

public long getFileLength() {
return fileLength;
}

public static void main(String[] args) throws IOException {
MappedBiggerFileReader reader = new MappedBiggerFileReader("/home/zfh/movie.mkv", 65536);
long start = System.nanoTime();
while (reader.read() != -1) ;
long end = System.nanoTime();
reader.close();
System.out.println("MappedBiggerFileReader: " + (end - start));
}
}

3. 執行結果比較
用上面三種方法讀取1GB檔案，執行結果如下

StreamFileReader: 11494900386
ChannelFileReader: 11329346316
MappedFileReader: 11169097480

讀取10GB檔案，執行結果如下

StreamFileReader: 194579779394
ChannelFileReader: 190430242497
MappedBiggerFileReader: 186923035795

————————————————
原文連結：https://blog.csdn.net/zhufenghao/article/details/51192043

java 分批次讀取大檔案的三種方法

java 分批次讀取大檔案的三種方法

淺談SpringBoot主流讀取配置檔案三種方式

java解析CSV檔案三種方法(openCSV)

Java 獲取當前系統時間的三種方法

java中列表深複製的三種方法

python讀取大檔案的幾種方法

java讀取大檔案內容到Elasticsearch分析（手把手教你java處理超大csv檔案）

簡單瞭解Python讀取大檔案程式碼例項

Java 5億整數大檔案怎麼排序

php實現將陣列或物件寫入到檔案的方法小結【三種方法】

win10系統開啟swf檔案的三種方法

基於Java的二叉樹的三種遍歷方式的遞迴與非遞迴實現

python讀取大檔案的時候怎麼避免大量佔用記憶體

JAVA字串反轉的三種方法

在js檔案中引入（呼叫）另一個js檔案的三種方法

Java實現佇列的三種方法集合

java base64編碼、解碼的三種方式總結

Oracle使用遊標進行分批次更新資料的6種方式及速度比對

Java 對HashMap進行排序的三種常見方法

java利用POI讀取excel檔案的方法

java 分批次讀取大檔案的三種方法

相關推薦