JTidy解決中文亂碼問題(親測可用)
阿新 • • 發佈:2019-01-10
專案中需要用到jTidy把html格式化為xml檔案,以便後續處理,然而在使用jTidy的時候,發現html文件裡中文用jTidy轉換後會亂碼。
經過一陣研究,發現主要是jTidy的inCharEncoding和outCharEncoding需要設定為UTF-8才可以正常讀取和寫入中文字元。
關鍵程式碼只有兩行:
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
完整程式碼如下:
import org.w3c.tidy.Configuration; import org.w3c.tidy.Tidy; import java.io.*; public class Main { public static final String WORK_DIR = "D:\\data\\temp\\jTidy\\"; public static final String INPUT_FILE = "input.html"; public static final String OUTPUT_FILE = "output.xml"; public static final String ERROR_LOG = "error.log"; public static void convert(InputStream inputStream, OutputStream outputStream){ Tidy tidy = new Tidy(); tidy.setInputEncoding("UTF-8"); tidy.setOutputEncoding("UTF-8"); tidy.setWraplen(0); try { tidy.setErrout(new PrintWriter(new FileWriter(WORK_DIR+ERROR_LOG))); } catch (IOException e) { e.printStackTrace(); } tidy.parse(inputStream,outputStream); } public static void main(String args[]){ InputStream inputStream = null; OutputStream outputStream = null; try { inputStream = new FileInputStream(WORK_DIR+INPUT_FILE); outputStream = new FileOutputStream(WORK_DIR+OUTPUT_FILE); convert(inputStream,outputStream); inputStream.close(); outputStream.close(); } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } finally { } } }