python使用textract解析pdf時遇到UnboundLocalError: local variable 'pipe' referenced before assignment

阿新 • • 發佈：2019-01-15

工作需要要用python解析各種文件,我敬愛的manager AKA Byrd推薦給了我textract。

“Textract is the most ridiculous library that I've ever used before”，其實它還是挺強大的，只是對於pdf不太友好。

-----------------------------------------------------------------------------------------------------------------

第一個坑：

用 pip install textract 安裝好這個庫之後

import textract
textract.process('a.pdf', method='pdfminer')

Google了一會才知道原來安裝textract的時候並不會自動幫你安裝pdfminer,需要手動安裝pdfminer。"Install Python 2.6 or newer. (For Python 3 support have a look at pdfminer.six)."
原來是因為我用的是python3.x,所以得用pdfminer.six

所以 pip install pdfminer.six

第一個坑到這裡就踩完了。

第二個坑：

再次執行程式碼，這次出現了這樣的報錯資訊

UnboundLocalError: local variable 'pipe' referenced before assignment

檢視原始碼utils.py

def run(self, args):
    """Run ``command`` and return the subsequent ``stdout`` and ``stderr``
    as a tuple. If the command is not successful, this raises a
    :exc:`textract.exceptions.ShellError`.
    """

    # run a subprocess and put the stdout and stderr on the pipe object
    try:
        pipe = subprocess.Popen(
            args,
            stdout=subprocess.PIPE, stderr=subprocess.PIPE,
        )
    except OSError as e:
        if e.errno == errno.ENOENT:
            # File not found.
            # This is equivalent to getting exitcode 127 from sh
            raise exceptions.ShellError(
                ' '.join(args), 127, '', '',
            )

    # pipe.wait() ends up hanging on large files. using
    # pipe.communicate appears to avoid this issue
    stdout, stderr = pipe.communicate()
 

    # if pipe is busted, raise an error (unlike Fabric)
    if pipe.returncode != 0:
        raise exceptions.ShellError(
            ' '.join(args), pipe.returncode, stdout, stderr,
        )

    return stdout, stderr

發現是紅字部分出錯，心裡"WaduHek ?!" 我就寫了一句程式碼，這報錯算是怎麼回事呀？

在原始碼pdf_parser.py中

def extract_pdfminer(self, filename, **kwargs):
    """Extract text from pdfs using pdfminer."""
    stdout, _ = self.run(['pdf2txt.py', filename])
    return stdout

這個 pdf2txt.py 無法被找到

以下是兩種解決方法:

第1種方法:

修改原始碼，使其為:

def extract_pdfminer(self, filename, **kwargs):
    """Extract text from pdfs using pdfminer."""
    stdout, _ = self.run(['python','path/to/pdf2txt.py', filename])
    return stdout

run的第二個引數 'path/to/pdf2txt.py' ，要改成你係統上的pdf2.txt.py的絕對路徑(相對路徑我沒試過，不知道可不可行)。

比如我是在virtualenv下開發的，所以我的路徑就是 c:\Users\....\venv\Scripts\pdf2.txt.py

這樣的話就可運行了。

（後話：當我用這種方法時，如果我將我的程式碼修改成這樣：

import textract
textract.process('a.pdf')

即去掉了method='pdfminer'。

根據原始碼pdf_parser.py :

def extract(self, filename, method='', **kwargs):
    if method == '' or method == 'pdftotext':
        try:
            return self.extract_pdftotext(filename, **kwargs)
        except ShellError as ex:
            # If pdftotext isn't installed and the pdftotext method
            # wasn't specified, then gracefully fallback to using
            # pdfminer instead.
            if method == '' and ex.is_not_installed():
                return self.extract_pdfminer(filename, **kwargs)
            else:
                raise ex

    elif method == 'pdfminer':
        return self.extract_pdfminer(filename, **kwargs)
    elif method == 'tesseract':
        return self.extract_tesseract(filename, **kwargs)
    else:
        raise UnknownMethod(method)

def extract_pdftotext(self, filename, **kwargs):
    """Extract text from pdfs using the pdftotext command line utility."""
    if 'layout' in kwargs:
        args = ['pdftotext', '-layout', filename, '-']
    else:
        args = ['pdftotext', filename, '-']
    stdout, _ = self.run(args)
    return stdout

def extract_pdfminer(self, filename, **kwargs):
    """Extract text from pdfs using pdfminer."""
    stdout, _ = self.run(['python','path/to/pdf2txt.py', filename])
    return stdout

當我的 method =='' 時，它應該先進入

try:
     return self.extract_pdftotext(filename, **kwargs)

然後意識到我並沒有安裝pdftotxt,再進入到

except ShellError as ex:
            # If pdftotext isn't installed and the pdftotext method
            # wasn't specified, then gracefully fallback to using
            # pdfminer instead.
            if method == '' and ex.is_not_installed():
                return self.extract_pdfminer(filename, **kwargs)

最終應該還是會去執行

def extract_pdfminer(self, filename, **kwargs):
    """Extract text from pdfs using pdfminer."""
    stdout, _ = self.run(['python','path/to/pdf2txt.py', filename])
    return stdout

這麼來說，我修改自己的程式碼後應該還是能執行的。然而它又報錯了

.............................

textract.exceptions.ShellError: The command `pdftotext a.pdf -` failed with exit code 127

------------- stdout -------------

------------- stderr -------------

時間原因沒去深究，有想法的同學希望可以教教我為什麼，萬分感謝。

最後因為這個方法要修改原始碼，那我以後用 pip freeze > requirements.txt + pip install -r requirements.txt 安裝的textract庫會有問題，所以我沒有選擇這種方法 ...... 會不會有同學讀了這麼久讀到這了發現這方法居然還不好用然後就把網頁關了hh）

第二種方法 :

根據帖子中的Irq3000：

".... so unfortunately I will have to package a copy of pdf2txt.py within my own package, as there is no reliable way to know where the "Scripts" is and add it dynamically to the path that is both crossplatform ...."

將 pdf2txt.py 檔案複製到你的專案資料夾下，然後通過繼承，在你的程式下新增自己(不是'自己'，是Irq3000)寫一個pdfminer類:

class MyPdfMinerParser(ShellParser):
    """Extract text from pdf files using the native python PdfMiner library"""

    def extract(self, filename, **kwargs):
        """Extract text from pdfs using pdfminer and pdf2txt.py wrapper."""
        # Create a temporary output file
        tempfilefh, tempfilepath = mkstemp(suffix='.txt')
        os.close(tempfilefh)  # close to allow writing to tesseract
        # Extract text from pdf using the entry script pdf2txt (part of PdfMiner)
        pdf2txt.main(['', '-o', tempfilepath, filename])
        # Read the results of extraction
        with open(tempfilepath, 'rb') as f:
            res = f.read()
        # Remove temporary output file
        os.remove(tempfilepath)
        return res

pdfminerparser = MyPdfMinerParser()
result = pdfminerparser.process('a.pdf', 'utf8')

第三個坑：

當我以為終於完事了的時候，執行程式碼，結果出現這樣的報錯：

usage: dpMain.py [-h] [-d] [-p PAGENOS]
[--page-numbers PAGE_NUMBERS [PAGE_NUMBERS ...]]
[-m MAXPAGES] [-P PASSWORD] [-o OUTFILE] [-t OUTPUT_TYPE]
[-c CODEC] [-s SCALE] [-A] [-V] [-W WORD_MARGIN]
[-M CHAR_MARGIN] [-L LINE_MARGIN] [-F BOXES_FLOW]
[-Y LAYOUTMODE] [-n] [-R ROTATION] [-O OUTPUT_DIR] [-C] [-S]
files [files ...]

dpMain.py: error: unrecognized arguments: a.pdf

........ 行吧

看了一下原始碼，分析了一下，發現

 pdf2txt.main(['', '-o', tempfilepath, filename])

這行程式碼的第一個引數是多餘的(或者說我水平還低，實在沒發現它有什麼用)，將其去除，得：

 pdf2txt.main(['-o', tempfilepath, filename])

大功告成！

python使用textract解析pdf時遇到UnboundLocalError: local variable 'pipe' referenced before assignment

python使用textract解析pdf時遇到UnboundLocalError: local variable 'pipe' referenced before assignment

全域性變數報錯：UnboundLocalError: local variable 'l' referenced before assignment

UnboundLocalError: local variable 'XXX' referenced before assignment

常見的local variable 'x' referenced before assignment問題

關於 local variable 'has' referenced before assignment 問題

全局變量報錯：UnboundLocalError: local variable 'l' referenced before assignment

python UnboundLocalError: local variable 'xxx' referenced before assignment

Python-local variable 'raw_password' referenced before assignment

關於local variable 'i' referenced before assignment

JSP 使用<%@include%>報Duplicate local variable path 錯誤解決方法

Python3.x：pdf2htmlEX（解析pdf）安裝和使用

Python3.x：PDFMiner3k在線、本地解析pdf

java9 Local-variable type inference

【Java】解決Gson解析資料時int自動轉化為double問題

java Error---Lambda expression's local variable e cannot re-declare another local variable defined e

Jupyter notebook 轉pdf時出現的一個錯誤（只出現前4頁）及原因分析

wkhtmltopdf 轉pdf時元素被頁面切割開

用python解析pdf中的文字與表格【pdfplumber的安裝與使用】

關於Mac在配置反向代理伺服器時出現/usr/local/nginx/logs/access.log" failed

aspose將word轉pdf時亂碼，或者出現小方框問題

python使用textract解析pdf時遇到UnboundLocalError: local variable 'pipe' referenced before assignment

相關推薦