Mac Tesseract 4.1.1 樣本訓練超詳細教程
Mac Tesseract 4.1.1 樣本訓練超詳細教程
喬布斯的橘子 2021-03-17 01:40:17 483 收藏 2
文章標籤: opencv python 影象識別 ocr
版權
安裝
Mac直接安裝tesseract的話無法附帶安裝training tools
如果已經安裝了沒有training tools的tesseract,請先解除安裝
brew uninstall tesseract
先安裝一些依賴的包
# Packages which are always needed.
brew install automake autoconf libtool
brew install pkgconfig
brew install icu4c
brew install leptonica
# Packages required for training tools.
brew install pango
# Optional packages for extra features.
brew install libarchive
# Optional package for builds using g++.
brew install gcc
從下列連結下載tesseract-4.1.1.tar.gz並解壓
https://github.com/tesseract-ocr/tesseract/releases
編譯並安裝
cd tesseract-4.1.1
./autogen.sh
mkdir build
cd build
# Optionally add CXX=g++-8 to the configure command if you really want to use a different compiler.
../configure PKG_CONFIG_PATH=/usr/local/opt/icu4c/lib/pkgconfig:/usr/local/opt/libarchive/lib/pkgconfig:/usr/local/opt/libffi/lib/pkgconfig
make -j
# Optionally install Tesseract.
sudo make install
# Optionally build and install training tools.
make training
sudo make training-install
下載完不會附帶著一起下載資料集,通過下列連結自行下載需要的語言
https://github.com/tesseract-ocr/tessdata
訓練
首先,收集資料樣本(若干張需要訓練的圖片)
圖片格式需要轉換為tif
下載並開啟jTessBoxEditor (注意,該軟體需要java8環境,請自行配置):
https://pilotfiber.dl.sourceforge.net/project/vietocr/jTessBoxEditor/jTessBoxEditor-2.3.1.zip
在jTessBoxEditor中Tools->Merge TIFF將所有tif檔案合併
將合併後的tif檔案重新命名為eng.num.exp0.tif
生成box檔案,用來糾正識別錯誤
tesseract eng.num.exp0.tif eng.num.exp0 -l eng batch.nochop makebox
此時,應該有eng.num.exp0.tif和eng.num.exp0.box兩個檔案
使用jTessBoxEditor開啟eng.num.exp0.tif
(Box Editor->Open->eng.num.exp0.tif)
糾正識別錯誤
新建一個檔案,取名font_properties,並填入下列內容
font 0 0 0 0 0
執行如下命令訓練資料
tesseract eng.num.exp0.tif eng.num.exp0 nobatch box.train
unicharset_extractor eng.num.exp0.box
shapeclustering -F font_properties -U unicharset eng.num.exp0.tr
mftraining -F font_properties -U unicharset -O unicharset eng.num.exp0.tr
cntraining eng.num.exp0.tr
mv inttemp num.inttemp
mv normproto num.normproto
mv pffmtable num.pffmtable
mv shapetable num.shapetable
mv unicharset num.unicharset
combine_tessdata num.
執行後,會有如下檔案
將num.traineddata移到相應路徑便可使用
我的路徑是/usr/local/share/tessdata/