工作需要要用python解析各種文件,我敬愛的manager AKA Byrd推薦給了我textract。

“Textract is the most ridiculous library that I've ever used before”,其實它還是挺強大的,只是對於pdf不太友好。



用 pip install textract 安裝好這個庫之後 

import textract
textract.process('a.pdf', method='pdfminer')

Google了一會才知道原來安裝textract的時候並不會自動幫你安裝pdfminer,需要手動安裝pdfminer。"Install Python 2.6 or newer. (For Python 3 support have a look at pdfminer.six)."

所以 pip install pdfminer.six




UnboundLocalError: local variable 'pipe' referenced before assignment


def run(self, args):
    """Run ``command`` and return the subsequent ``stdout`` and ``stderr``
    as a tuple. If the command is not successful, this raises a

    # run a subprocess and put the stdout and stderr on the pipe object
        pipe = subprocess.Popen(
            stdout=subprocess.PIPE, stderr=subprocess.PIPE,
    except OSError as e:
        if e.errno == errno.ENOENT:
            # File not found.
            # This is equivalent to getting exitcode 127 from sh
            raise exceptions.ShellError(
                ' '.join(args), 127, '', '',

    # pipe.wait() ends up hanging on large files. using
    # pipe.communicate appears to avoid this issue
    stdout, stderr = pipe.communicate()
# if pipe is busted, raise an error (unlike Fabric) if pipe.returncode != 0: raise exceptions.ShellError( ' '.join(args), pipe.returncode, stdout, stderr, ) return stdout, stderr

發現是紅字部分出錯,心裡"WaduHek ?!" 我就寫了一句程式碼,這報錯算是怎麼回事呀?


def extract_pdfminer(self, filename, **kwargs):
    """Extract text from pdfs using pdfminer."""
    stdout, _ = self.run(['pdf2txt.py', filename])
    return stdout

這個 pdf2txt.py 無法被找到




def extract_pdfminer(self, filename, **kwargs):
    """Extract text from pdfs using pdfminer."""
    stdout, _ = self.run(['python','path/to/pdf2txt.py', filename])
    return stdout

run的第二個引數 'path/to/pdf2txt.py' ,要改成你係統上的pdf2.txt.py的絕對路徑(相對路徑我沒試過,不知道可不可行)。

比如我是在virtualenv下開發的,所以我的路徑就是 c:\Users\....\venv\Scripts\pdf2.txt.py



import textract


根據原始碼pdf_parser.py :

def extract(self, filename, method='', **kwargs):
    if method == '' or method == 'pdftotext':
            return self.extract_pdftotext(filename, **kwargs)
        except ShellError as ex:
            # If pdftotext isn't installed and the pdftotext method
            # wasn't specified, then gracefully fallback to using
            # pdfminer instead.
            if method == '' and ex.is_not_installed():
                return self.extract_pdfminer(filename, **kwargs)
                raise ex

    elif method == 'pdfminer':
        return self.extract_pdfminer(filename, **kwargs)
    elif method == 'tesseract':
        return self.extract_tesseract(filename, **kwargs)
        raise UnknownMethod(method)

def extract_pdftotext(self, filename, **kwargs):
    """Extract text from pdfs using the pdftotext command line utility."""
    if 'layout' in kwargs:
        args = ['pdftotext', '-layout', filename, '-']
        args = ['pdftotext', filename, '-']
    stdout, _ = self.run(args)
    return stdout

def extract_pdfminer(self, filename, **kwargs):
    """Extract text from pdfs using pdfminer."""
    stdout, _ = self.run(['python','path/to/pdf2txt.py', filename])
    return stdout

當我的 method =='' 時,它應該先進入

     return self.extract_pdftotext(filename, **kwargs)


except ShellError as ex:
            # If pdftotext isn't installed and the pdftotext method
            # wasn't specified, then gracefully fallback to using
            # pdfminer instead.
            if method == '' and ex.is_not_installed():
                return self.extract_pdfminer(filename, **kwargs)


def extract_pdfminer(self, filename, **kwargs):
    """Extract text from pdfs using pdfminer."""
    stdout, _ = self.run(['python','path/to/pdf2txt.py', filename])
    return stdout




textract.exceptions.ShellError: The command `pdftotext a.pdf -` failed with exit code 127

------------- stdout -------------

------------- stderr -------------


最後因為這個方法要修改原始碼,那我以後用 pip freeze > requirements.txt     +      pip  install -r requirements.txt  安裝的textract庫會有問題,所以我沒有選擇這種方法  ......   會不會有同學讀了這麼久讀到這了發現這方法居然還不好用然後就把網頁關了hh

第二種方法 :


".... so unfortunately I will have to package a copy of pdf2txt.py within my own package, as there is no reliable way to know where the "Scripts" is and add it dynamically to the path that is both crossplatform ...."

將 pdf2txt.py 檔案複製到你的專案資料夾下,然後通過繼承,在你的程式下新增自己(不是'自己',是Irq3000)一個pdfminer類:

class MyPdfMinerParser(ShellParser):
    """Extract text from pdf files using the native python PdfMiner library"""

    def extract(self, filename, **kwargs):
        """Extract text from pdfs using pdfminer and pdf2txt.py wrapper."""
        # Create a temporary output file
        tempfilefh, tempfilepath = mkstemp(suffix='.txt')
        os.close(tempfilefh)  # close to allow writing to tesseract
        # Extract text from pdf using the entry script pdf2txt (part of PdfMiner)
        pdf2txt.main(['', '-o', tempfilepath, filename])
        # Read the results of extraction
        with open(tempfilepath, 'rb') as f:
            res = f.read()
        # Remove temporary output file
        return res

pdfminerparser = MyPdfMinerParser()
result = pdfminerparser.process('a.pdf', 'utf8')


當我以為終於完事了的時候,執行程式碼,結果出現這樣的報錯 :

usage: dpMain.py [-h] [-d] [-p PAGENOS]
                 [--page-numbers PAGE_NUMBERS [PAGE_NUMBERS ...]]
                 [-m MAXPAGES] [-P PASSWORD] [-o OUTFILE] [-t OUTPUT_TYPE]
                 [-c CODEC] [-s SCALE] [-A] [-V] [-W WORD_MARGIN]
                 [-M CHAR_MARGIN] [-L LINE_MARGIN] [-F BOXES_FLOW]
                 [-Y LAYOUTMODE] [-n] [-R ROTATION] [-O OUTPUT_DIR] [-C] [-S]
                 files [files ...]

dpMain.py: error: unrecognized arguments: a.pdf

........  行吧


 pdf2txt.main(['', '-o', tempfilepath, filename])
 pdf2txt.main(['-o', tempfilepath, filename])

大功告成 !


