1. 程式人生 > 其它 >Mac Tesseract 4.1.1 樣本訓練超詳細教程

Mac Tesseract 4.1.1 樣本訓練超詳細教程

Mac Tesseract 4.1.1 樣本訓練超詳細教程

喬布斯的橘子 2021-03-17 01:40:17 483 收藏 2
文章標籤: opencv python 影象識別 ocr
版權
安裝
Mac直接安裝tesseract的話無法附帶安裝training tools

如果已經安裝了沒有training tools的tesseract,請先解除安裝

brew uninstall tesseract

先安裝一些依賴的包

# Packages which are always needed.
brew install automake autoconf libtool
brew install pkgconfig
brew install icu4c
brew install leptonica
# Packages required for training tools.
brew install pango
# Optional packages for extra features.
brew install libarchive
# Optional package for builds using g++.
brew install gcc

從下列連結下載tesseract-4.1.1.tar.gz並解壓

https://github.com/tesseract-ocr/tesseract/releases

編譯並安裝

cd tesseract-4.1.1
./autogen.sh
mkdir build
cd build
# Optionally add CXX=g++-8 to the configure command if you really want to use a different compiler.
../configure PKG_CONFIG_PATH=/usr/local/opt/icu4c/lib/pkgconfig:/usr/local/opt/libarchive/lib/pkgconfig:/usr/local/opt/libffi/lib/pkgconfig
make -j
# Optionally install Tesseract.
sudo make install
# Optionally build and install training tools.
make training
sudo make training-install

下載完不會附帶著一起下載資料集,通過下列連結自行下載需要的語言

https://github.com/tesseract-ocr/tessdata

訓練
首先,收集資料樣本(若干張需要訓練的圖片)

圖片格式需要轉換為tif

下載並開啟jTessBoxEditor (注意,該軟體需要java8環境,請自行配置):

https://pilotfiber.dl.sourceforge.net/project/vietocr/jTessBoxEditor/jTessBoxEditor-2.3.1.zip

在jTessBoxEditor中Tools->Merge TIFF將所有tif檔案合併

將合併後的tif檔案重新命名為eng.num.exp0.tif

生成box檔案,用來糾正識別錯誤

tesseract eng.num.exp0.tif eng.num.exp0 -l eng batch.nochop makebox

此時,應該有eng.num.exp0.tif和eng.num.exp0.box兩個檔案

使用jTessBoxEditor開啟eng.num.exp0.tif

(Box Editor->Open->eng.num.exp0.tif)

糾正識別錯誤

新建一個檔案,取名font_properties,並填入下列內容

font 0 0 0 0 0

執行如下命令訓練資料

tesseract eng.num.exp0.tif eng.num.exp0 nobatch box.train
unicharset_extractor eng.num.exp0.box
shapeclustering -F font_properties -U unicharset eng.num.exp0.tr
mftraining -F font_properties -U unicharset -O unicharset eng.num.exp0.tr
cntraining eng.num.exp0.tr
mv inttemp num.inttemp
mv normproto num.normproto
mv pffmtable num.pffmtable
mv shapetable num.shapetable
mv unicharset num.unicharset
combine_tessdata num.

執行後,會有如下檔案

將num.traineddata移到相應路徑便可使用

我的路徑是/usr/local/share/tessdata/