獲得parquet檔案的rows和filesize
阿新 • • 發佈:2018-11-12
貼程式碼
public static void getParquetFileSizeAndRowCount()throws Exception{ Path inputPath = new Path("/user/hive/warehouse/user_parquet"); Configuration conf = new Configuration(); FileStatus[] inputFileStatuses = inputPath.getFileSystem(conf).globStatus(inputPath); for (FileStatus fs : inputFileStatuses) { for (Footer f : ParquetFileReader.readFooters(conf, fs, false)) { for (BlockMetaData b : f.getParquetMetadata().getBlocks()) { logger.info("TotalByteSize:"+b.getTotalByteSize() +" CompressedSize:"+b.getCompressedSize()+" rowCount:"+b.getRowCount()); } } } }
輸出:
18/10/26 10:38:20 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 18/10/26 10:38:20 INFO hadoop.ParquetFileReader: reading another 1 footers 18/10/26 10:38:20 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 18/10/26 10:38:20 INFO test.HDFSTest: TotalByteSize:106324460 CompressedSize:106324460 rowCount:53285496
部分pom.xml
<properties> <hadoop.version>2.8.4</hadoop.version> <parquet.version>1.10.0</parquet.version> </properties> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>org.apache.parquet</groupId> <artifactId>parquet-common</artifactId> <version>${parquet.version}</version> </dependency> <dependency> <groupId>org.apache.parquet</groupId> <artifactId>parquet-encoding</artifactId> <version>${parquet.version}</version> </dependency> <dependency> <groupId>org.apache.parquet</groupId> <artifactId>parquet-column</artifactId> <version>${parquet.version}</version> </dependency> <dependency> <groupId>org.apache.parquet</groupId> <artifactId>parquet-hadoop</artifactId> <version>${parquet.version}</version> </dependency>