google Tesseract-ocr配置及訓練

阿新 • • 發佈：2019-01-18

一、配置

1、從http://code.google.com/p/tesseract-ocr/downloads/list下載tesseract-ocr-3.02-vs2008、tesseract-ocr-3.02.eng.tar、tesseract-ocr-3.02.02.tar、leptonica-1.68-win32-lib-include-dirs相關檔案；

2、將所有檔案存放在D:\BulidFolder資料夾下並解壓縮；

3、下載tesseract-ocr-setup-3.02.02.exe，安裝到d盤，生成目錄Tesseract-OCR。

4、將ccmain資料夾下的equationdetect.cpp

檔案中的static const STRING kCharsToEx[] = {"'", "`","\"", "\\", ",", ".", "〈", "〉", "《", "》", "」", "「", ""};修改成static const STRING kCharsToEx[] = {"'", "`","\"", "\\", ",",".","<", ">", "<<",">>", ""};(注：不改動編譯時始終出錯，其它方法暫未發現，3.01版本中沒有此檔案，編譯3.01不用對原始檔作任何修改)。

5、開啟D:\OCR\code\tesseract-ocr\vs2008的

tesseract.sln，重新編譯整個Solution；

二、訓練新的語言包

Tesseract是一個開源的OCR（Optical Character Recognition，光學字元識別）引擎，可以識別多種格式的影象檔案並將其轉換成文字，目前已支援60多種語言（包括中文）。 Tesseract最初由HP公司開發，後來由Google維護，目前釋出在Googel Project上。地址為http://code.google.com/p/tesseract-ocr/。

使用預設的語言庫識別

1.安裝Tesseract

從http://code.google.com/p/tesseract-ocr/downloads/list

下載Tesseract，目前版本為Tesseract3.02。因為只是測試使用，這裡直接下載winodws下的安裝檔案tesseract-ocr-setup-3.02.02.exe。安裝成功後會在相應磁碟上生成一個Tesseract-OCR目錄。通過目錄下的tesseract.exe程式就可以對影象字元進行識別了。 2.準備一副待識別的影象，這裡用畫圖工具隨便寫了一串數字，儲存為number.jpg，如下圖所示：

3. 開啟命令列，定位到Tesseract-OCR目錄，輸入命令：

[plain] view plain copy print ?

tesseract.exe number.jpg result -l eng

 tesseract.exe number.jpg result -l eng

其中result表示輸出結果檔案txt名稱，eng表示用以識別的語言檔案為英文。

3. 開啟Tesseract-OCR目錄下的result.txt檔案，看到識別的結果為7542315857，有3個字元識別錯誤，識別率還不是很高，那有沒有什麼方法來提供識別率呢？

其實Tesseract提供了一套訓練樣本的方法，用以生成自己所需的識別語言庫。下面介紹一下具體訓練樣本的方法。

訓練樣本

2. 獲取樣本影象。用畫圖工具繪製了5張0-9的文樣本影象（當然樣本越多越好），如下圖所示：

3.合併樣本影象。執行jTessBoxEditor工具，在點選選單欄中Tools--->Merge TIFF。在彈出的對話方塊中選擇樣本影象（按Shift選擇多張），合併成num.font.exp0.tif檔案。4.生成Box File檔案。開啟命令列，執行命令：

[plain] view plain copy print ?

tesseract.exe num.font.exp0.tif num.font.exp0 batch.nochop makebox

  tesseract.exe num.font.exp0.tif num.font.exp0 batch.nochop makebox

生成的BOX檔案為num.font.exp0.box，BOX檔案為Tessercat識別出的文字和其座標。

注：Make Box File的命令格式為：

[plain] view plain copy print ?

tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] batch.nochop makebox

  tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] batch.nochop makebox

其中lang為語言名稱，fontname為字型名稱，num為序號，可以隨便定義。

5.文字校正。執行jTessBoxEditor工具，開啟num.font.exp0.tif檔案（必須將上一步生成的.box和.tif樣本檔案放在同一目錄），如下圖所示。可以看出有些字元識別的不正確，可以通過該工具手動對每張圖片中識別錯誤的字元進行校正。校正完成後儲存即可。

6.定義字型特徵檔案。Tesseract-OCR3.01以上的版本在訓練之前需要建立一個名稱為font_properties.txt的字型特徵檔案。

font_properties不含有BOM頭，檔案內容格式如下：

[plain] view plain copy print ?

<fontname> <italic> <bold> <fixed> <serif> <fraktur>

<fontname> <italic> <bold> <fixed> <serif> <fraktur>

其中fontname為字型名稱，必須與[lang].[fontname].exp[num].box中的名稱保持一致。<italic> 、<bold> 、<fixed> 、<serif>、 <fraktur>的取值為1或0，表示字型是否具有這些屬性。

這裡在樣本圖片所在目錄下建立一個名稱為font_properties的檔案，用記事本開啟，輸入以下下內容：

[plain] view plain copy print ?

font 0 0 0 0 0

font 0 0 0 0 0

這裡全取值為0，表示字型不是粗體、斜體等等。

7.生成語言檔案。在樣本圖片所在目錄下建立一個批處理檔案，輸入如下內容。

[plain] view plain copy print ?

rem 執行改批處理前先要目錄下建立font_properties檔案
echo Run Tesseract for Training..
tesseract.exe num.font.exp0.tif num.font.exp0 nobatch box.train
echo Compute the Character Set..
unicharset_extractor.exe num.font.exp0.box
mftraining -F font_properties.txt -U unicharset -O num.unicharset num.font.exp0.tr
echo Clustering..
cntraining.exe num.font.exp0.tr
echo Rename Files..
rename normproto num.normproto
rename inttemp num.inttemp
rename pffmtable num.pffmtable
rename shapetable num.shapetable
echo Create Tessdata..
combine_tessdata.exe num.

rem 執行改批處理前先要目錄下建立font_properties檔案

echo Run Tesseract for Training..
tesseract.exe num.font.exp0.tif num.font.exp0 nobatch box.train

echo Compute the Character Set..
unicharset_extractor.exe num.font.exp0.box
mftraining -F font_properties -U unicharset -O num.unicharset num.font.exp0.tr

echo Clustering..
cntraining.exe num.font.exp0.tr

echo Rename Files..
rename normproto num.normproto
rename inttemp num.inttemp
rename pffmtable num.pffmtable
rename shapetable num.shapetable 

echo Create Tessdata..
combine_tessdata.exe num.

將批處理通過命令列執行。執行後的結果如下：

需確認列印結果中的Offset 1、3、4、5、13這些項不是-1。這樣，一個新的語言檔案就生成了。

num.traineddata便是最終生成的語言檔案，將生成的num.traineddata拷貝到Tesseract-OCR-->tessdata目錄下。可以用它來進行字元識別了。

使用訓練後的語言庫識別

用訓練後的語言庫識別number.jpg檔案，開啟命令列，定位到Tesseract-OCR目錄，輸入命令：

[plain] view plain copy print ?

tesseract.exe number.jpg result -l eng

tesseract.exe number.jpg result -l eng

識別結果如如圖所示，可以看到識別率提高了不少。通過自定義訓練樣本，可以進行圖形驗證碼、車牌號碼識別等。感興趣的朋友可以研究研究。

三、結合C++使用

1. 直接呼叫圖片

#include "strngs.h"
#include "baseapi.h"
#pragma  comment(lib,"libtesseract302d.lib")

////////////////
    tesseract::TessBaseAPI  api;  
    api.Init(NULL, "eng", tesseract::OEM_DEFAULT);  //初始化，設定語言包，中文簡體：chi_sim;英文：eng；也可以自己訓練語言包
    //api.SetVariable( "tessedit_char_whitelist", "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" );  
   
	STRING text_out;  
    if (!api.ProcessPages("kaze.tif", NULL, 0, &text_out))  
    {  
	    //AfxMessageBox("tesseract 處理出現異常");
        return 0;  
    }  	
	cout<<text_out.string();
	cout<<UTF8ToGBK(text_out.string()).c_str();

2. 直接呼叫文字

// tess.cpp:   
// Recognize text on an image using Tesseract API and print it to the screen  
// Usage: ./tess image.png  
  
#include <tesseract/baseapi.h>  
#include <tesseract/strngs.h>  
#include <iostream>  
  
int main(int argc, char** argv)  
{  
   if (argc != 2)  
   {   
        std::cout << "Please specify the input image!" << std::endl;  
        return -1;  
    }  
  
    const char* lang = "eng";  
    const char* filename = argv[1];  
    //新建tess基類  
    tesseract::TessBaseAPI tess;  
    //初始化  
    tess.Init(NULL, lang, tesseract::OEM_DEFAULT);  
    //設定識別模式  
    tess.SetPageSegMode(tesseract::PSM_SINGLE_BLOCK);    
    FILE* fin = fopen(filename, "rb");  
    if (fin == NULL)  
    {  
       std::cout << "Cannot open " << filename << std::endl;  
       return -1;  
    }  
    fclose(fin);  
  
    STRING text;  
    //進行識別  
    if (!tess.ProcessPages(filename, NULL, 0, &text))  
    {  
       std::cout << "Error during processing." << std::endl;  
       return -1;  
    }  
   else  
       std::cout << text.string() << std::endl;  
     
   return 0;  
}

2. 結合OpencCV使用

// tesscv.cpp:
// Using Tesseract API with OpenCV

#include <opencv2/highgui/highgui.hpp>
#include <opencv2/imgproc/imgproc.hpp>
#include <tesseract/baseapi.h>
#include <iostream>

int main(int argc, char** argv)
{
    // Usage: tesscv image.png
    if (argc != 2)
    {
        std::cout << "Please specify the input image!" << std::endl;
        return -1;
    }

    // Load image
    cv::Mat im = cv::imread(argv[1]);
    if (im.empty())
    {
        std::cout << "Cannot open source image!" << std::endl;
        return -1;
    }

    cv::Mat gray;
    cv::cvtColor(im, gray, CV_BGR2GRAY);
    // ...other image pre-processing here...

    // Pass it to Tesseract API
    tesseract::TessBaseAPI tess;
    tess.Init(NULL, "eng", tesseract::OEM_DEFAULT);
    tess.SetPageSegMode(tesseract::PSM_SINGLE_BLOCK);
    tess.SetImage((uchar*)gray.data, gray.cols, gray.rows, 1, gray.cols);

    // Get the text
    char* out = tess.GetUTF8Text();
    std::cout << out << std::endl;

    return 0;
}

google Tesseract-ocr配置及訓練

一、配置

二、訓練新的語言包

使用預設的語言庫識別

訓練樣本

使用訓練後的語言庫識別

三、結合C++使用

1. 直接呼叫圖片

google Tesseract-ocr配置及訓練

tesseract-ocr使用以及訓練方法

Google開源OCR專案Tesseract訓練(自己訓練的記錄,未成功)

Tika結合Tesseract-OCR 實現光學漢字識別（簡體、宋體的識別率百分之百）—附Java原始碼實現及真實測試資料和訓練集下載地址

Tesseract-OCR 訓練教程（一）

使用 tesseract-ocr生成訓練庫全部指令

Win10 環境安裝tesseract-ocr 4.00並配置環境變量

tesseract-ocr訓練

Tesseract-ocr 安裝配置

Tesseract-OCR識別中文與訓練字型檔例項

Tesseract-OCR 在烏班圖下訓練教程

Tesseract-OCR 字元識別---樣本訓練

ubuntu下使用Tesseract-ocr(編譯、安裝、使用、訓練新的語言庫)

OCR學習及tesseract的一些測試

py-faster-rcnn演算法caffe配置，訓練及應用到自己的資料集

Tesseract-OCR的簡單使用與訓練

識別圖片tesseract-ocr的配置

Tesseract-OCR 字元識別---樣本訓練圖片識別訓練

pva-faster-rcnn配置安裝及訓練自己的資料集

Tesseract-OCR 簡單的中文數字混合訓練

google Tesseract-ocr配置及訓練

一、配置

二、訓練新的語言包

使用預設的語言庫識別

訓練樣本

使用訓練後的語言庫識別

三、結合C++使用

1. 直接呼叫圖片

相關推薦