18.python爬蟲—Pytesseract
阿新 • • 發佈:2021-01-27
18.1 Tesseract
- 定義:
- Tesseract是一個將影象翻譯成文字的OCR庫(光學文字識別,Optical Character Recognition)
- 安裝:
- sudo apt-get install tesseract-ocr
- 在python中呼叫Tesseract
- pip install pytesseract
18.1.1 Tesseract處理規範的文字
- 在終端中:
- tesseract test.jpg text
- 在python程式碼中
- import pytesseract
- from PIL import Image
- image = Image.open(jpg)
- pytesseract.image_to_string(image)
前提:windows安裝好tesseract-ocr
18.2 實驗
In:
pip install pytesseract
out:
Requirement already satisfied: pytesseract in d:\programdata\anaconda3\lib\site-packages (0.3.4) Requirement already satisfied: Pillow in d:\programdata\anaconda3\lib\site-packages (from pytesseract) (6.2.0) Note: you may need to restart the kernel to use updated packages.
In:
from PIL import Image
import pytesseract
In:
pytesseract.pytesseract.tesseract_cmd = r'D:\Program Files\Tesseract-OCR\tesseract.exe'
In:
img = Image.open("temp01.png")
In:
pytesseract.image_to_string(img)
out:
'Bute: windows22#¢Ftesseract-ocr\n\nIn [1]: pip install pytesseract'