1. 程式人生 > 其它 >18.python爬蟲—Pytesseract

18.python爬蟲—Pytesseract

技術標籤:python爬蟲python爬蟲

18.1 Tesseract

  • 定義:
    • Tesseract是一個將影象翻譯成文字的OCR庫(光學文字識別,Optical Character Recognition)
  • 安裝:
    • sudo apt-get install tesseract-ocr
  • 在python中呼叫Tesseract
    • pip install pytesseract

18.1.1 Tesseract處理規範的文字

  • 在終端中:
    • tesseract test.jpg text
  • 在python程式碼中
    • import pytesseract
    • from PIL import Image
    • image = Image.open(jpg)
    • pytesseract.image_to_string(image)
      前提:windows安裝好tesseract-ocr

18.2 實驗

In:

pip install pytesseract

out:

Requirement already satisfied: pytesseract in d:\programdata\anaconda3\lib\site-packages (0.3.4)
Requirement already satisfied: Pillow in d:\programdata\anaconda3\lib\site-packages (from pytesseract) (6.2.0)
Note: you may need to restart the kernel to use updated packages.

In:

from PIL import Image
import pytesseract

In:

pytesseract.pytesseract.tesseract_cmd = r'D:\Program Files\Tesseract-OCR\tesseract.exe'

In:

img = Image.open("temp01.png")

In:

pytesseract.image_to_string(img)

out:

'Bute: windows22#¢Ftesseract-ocr\n\nIn [1]: pip install pytesseract'