Python--文字識別--Tesseract

阿新 • • 發佈：2019-02-08

1.介紹

Tesseract 是一個 OCR 庫,目前由 Google 贊助(Google 也是一家以 OCR 和機器學習技術聞名於世的公司)。Tesseract 是目前公認最優秀、最精確的開源 OCR 系統。除了極高的精確度,Tesseract 也具有很高的靈活性。它可以通過訓練識別出任何字型，也可以識別出任何 Unicode 字元。

2.安裝

以下所有都是在mac上安裝實現的

//安裝tesseract的同時安裝訓練工具
brew install --with-training-tools tesseract

//安裝tesseract的同時安裝所有語言，語言包比較大，如果安裝的話時間較長，建議不安裝，按需選擇 

brew install  --all-languages tesseract

//安裝tesseract，並安裝訓練工具和語言
brew install --all-languages --with-training-tools tesseract 

//只安裝tesseract，不安裝訓練工具
brew install  tesseract

3.下載語言庫

下載地址:https://github.com/tesseract-ocr/tessdata
預設自帶的是英語
根據自己的需求選擇所要的語言庫，在這裡我們選擇的是簡體中文所以選擇的庫是：chi_sim.traineddata
將檔案拷貝到到：/usr/local/Cellar/tesseract/(你下載的版本號)/share/tessdata目錄下。

4.Tesseract的命令列使用

一般使用:

//預設使用eng文字型檔， imgName是圖片的地址，result識別結果
tesseract imgName result

指定語言:

//指定使用簡體中文
tesseract -l chi_sim imgName result

//檢視本地存在的語言庫
tesseract --list-langs

指定多語言:

//指定多語言，用+號相連
tesseract -l chi_sim+eng imgName result

5.Tesseract在Python中使用

通過 pip 安裝支援Python 版本的 Tesseract庫

pip install pytesseract

通過Python程式碼的簡單實現

import pytesseract
from PIL import Image

image = Image.open('/Users/admin/Desktop/test.jpg')
text = pytesseract.image_to_string(image)
print text

是爬蟲中的驗證碼的識別,可以通過更換別人訓練好的語言包來識別.如果想自己通過訓練來獲得語言包也是可以的.

Python--文字識別--Tesseract

1.介紹

2.安裝

3.下載語言庫

4.Tesseract的命令列使用

5.Tesseract在Python中使用

6.Tesseract訓練

Python--文字識別--Tesseract

Python 文字識別安裝Tesseract ORC

【python 文字識別】利用pytesseract庫進行圖片文字識別

圖片文字識別：Tesseract OCR庫在Python中基本使用

Python調用Tesseract-OCR完成圖片OCR識別

Tesseract-OCR-03-圖片文字識別

Python配置圖片文字識別

python學習----網頁圖片文字識別(簡單)

python呼叫百度圖片文字識別介面

python 使用pytesseract圖片文字識別

python+pillow+pytesseract+Tesseract-OCR驗證碼識別[轉]

python自動識別文字編碼格式

Python 利用百度文字識別 API 識別並提取圖片中文字

基於Eclipse下的 tesseract -OCR實現圖片文字識別過程簡單介紹

python使用百度aip文字識別

基於百度AI的文字識別-Python

Python程式設計：通過百度文字識別提取表格資料

Python爬蟲專案實戰3 | 圖片文字識別（以驗證碼識別為例）

Python影象處理之圖片文字識別（OCR）

Python-圖片文字識別

Python--文字識別--Tesseract

1.介紹

2.安裝

3.下載語言庫

4.Tesseract的命令列使用

5.Tesseract在Python中使用

6.Tesseract訓練

相關推薦