用PDFMiner從PDF中提取文本文字
阿新 • • 發佈:2017-05-08
dfp port 下載 span setup 技術分享 code with converter
1、下載並安裝PDFMiner
從https://pypi.python.org/pypi/pdfminer/下載PDFMineer
wget https://pypi.python.org/packages/57/4f/e1df0437858188d2d36466a7bb89aa024d252bd0b7e3ba90cbc567c6c0b8/pdfminer-20140328.tar.gz#md5=dfe3eb1b7b7017ab514aad6751a7c2ea
加壓並安裝
tar -zxvf pdfminer-20140328.tar.gz cd pdfminer-20140328/ make cmap #防止中文亂碼,否則處理中文會出現一大堆(CID:xxx)sudo python setup.py install
2、提取文本文字
from cStringIO import StringIO from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage import sys import string def convert_pdf_2_text(path): rsrcmgr= PDFResourceManager() retstr = StringIO() device = TextConverter(rsrcmgr, retstr, codec=‘utf-8‘, laparams=LAParams()) interpreter = PDFPageInterpreter(rsrcmgr, device) with open(path, ‘rb‘) as fp: for page in PDFPage.get_pages(fp, set()): interpreter.process_page(page) text= retstr.getvalue() device.close() retstr.close() return text text = convert_pdf_2_text(sys.argv[1]) open(‘real?.txt‘,‘wb‘).write(text)
3、測試結果
【1】http://www.unixuser.org/~euske/python/pdfminer/#source
【2】https://www.zhihu.com/question/31586273
用PDFMiner從PDF中提取文本文字