python+pillow+pytesseract+Tesseract-OCR驗證碼識別[轉]
安裝 pillow,pytesseract ,安裝該模組之後,還需要安裝 tesseract-ocr 。
(PS:如果安裝了pip,可以python的scripts檔案下,輸入cmd,然後輸入pip install pillow安裝最新版的pillow,如果需要安裝其它版本的則要自己下載安裝,安裝其它第三方庫都可用這種方法。)
tesseract-ocr 下載地址: https://digi.bib.uni-mannheim.de/tesseract/
本次測試下載的是 tesseract-ocr-setup-4.00.00dev.exe ,這塊的過程遇到好幾個問題。
FileNotFoundError: [WinError 2] 系統找不到指定的檔案。
pytesseract.pytesseract.TesseractError: (2, ‘Usage: python pytesseract.py [-l lang] input_file’)
pytesseract.pytesseract.TesseractError: (1, ‘Error opening data file \Program Files (x86)\Tesseract-OCR\eng.traineddata’)
這幾個問題主要是需要安裝配置Tesseract-OCR,
-
下載安裝tesseract-ocr,
-
新增環境變數: TESSDATA_PREFIX = C:\Program Files (x86)\Tesseract-OCR (PS:在環境變數中新新增變數:TESSDATA_PREFIX ,值(路徑)為:C:\Program Files (x86)\Tesseract-OCR(安裝路徑))
-
編輯檔案 D:\Python35\Lib\site-packages\pytesseract\pytesseract.py
tesseract_cmd = ‘tesseract’
改為:
tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract’
# coding=utf-8
import requests
import pytesseract
from PIL import Image
from io import BytesIO
# captcha_url = 'https://www.'
# captcha_content = requests.get(url=captcha_url)
# captcha_content = captcha_content.content
# # 用自位元組讀出圖片
# image = Image.open(BytesIO(captcha_content))
img_path = r'1351_5243.png'
image = Image.open(img_path)
# 轉化為灰度圖
imgry = image.convert('L')
table = [0 if i < 140 else 1 for i in range(256)]
# 使字型更加突出的顯示
out = imgry.point(table,'1')
# out.show()
captcha = pytesseract.image_to_string(out)
captcha = captcha.strip()
captcha = captcha.upper()
print(captcha)