PDF內容替換--Python實現

阿新 • • 發佈：2019-01-21

摘要：這篇文章介紹瞭如何使用Python3把PDF檔案轉換為word文件並替換word文件中的文字內容，這裡替換的規則是中英轉換或者其它自定義規則。

作者：yooongchun
微信公眾號： yooongchun小屋
這裡寫圖片描述

pdf轉換為word

pdf檔案由於很難編輯，因而想要通過程式設計修改內容有些困難【不是不可能，只是代價比較大，特別是要保留原pdf格式的話】，故而這裡介紹一個替換pdf檔案文字內容的間接解決方案，即：把pdf檔案轉換為word然後再程式設計替換word的文字內容，當然替換完成後想把word轉為pdf就簡單的多了

本文這裡轉換pdf檔案介紹四種方法：

使用第三方工具

使用pdfminer3k 解析pdf然後再把內容寫入到word中
使用第三方SDK
使用LibreOffice 開發者庫

這裡簡單說明下幾種方法的優劣：

使用第三方工具的話，有這樣兩種情況：

Online版本，大多可以免費使用，但是轉換麻煩，不能程式設計操作，需要手動實現
軟體版本，大多可以批量轉換，但是大多轉換有限制，需付費才能無限制使用，且不能程式設計處理

如果上面的情況符合你的實際需要，那再好不過！當然，不能符合的話，接著往下看

使用pdfminer3k解析pdf檔案

對於純文字內容且沒有著嚴格的格式要求的話，你可以使用這個工具進行轉換，詳細轉換請參考這篇文章：使用Python抽取PDF檔案內容，包括文字、影象、線條等物件

使用第三方SDK

這個是一個比較好的選擇，然而，SDK大多收費！！！這些SDK一般提供完善的開發者API介面，你可以方便而又簡單的呼叫，樓主這裡介紹一個叫做easyConvertPDF，他們提供一個pdf2word.py 的API，使用十分簡單，轉換效果頁令人滿意，只是，License太貴了，買不起，當然你可以下載試用

使用`LibreOffice` 開發者庫

這個是Micro Office 支援的 Office 開發者庫，裡面可以實現PDF轉換為Word，只是樓主在使用時一致沒測試成功，下面給出一段使用程式碼，你需要先安裝LibreOffice 才行

import os
import 
 subprocess

for top, dirs, files in os.walk('/my/pdf/folder'):
    for filename in files:
        if filename.endswith('.pdf'):
            abspath = os.path.join(top, filename)
            subprocess.call('lowriter --invisible --convert-to doc "{}"'
                            .format(abspath), shell=True)

從上面的四種方法中選擇一種適合你的方法！

替換word文字內容

替換word使用的是win32com 的包，使用這個包你需要先安裝pywin32
```
pip install pywin32
```
安裝完成後可能會出現如下錯誤
```
ImportError: DLL load failed: win32api, sys, os
```
這時你可以通過如下方式解決

把C:\Users\yourname\Anaconda3\Lib\site-packages\pypiwin32_system32這個目錄下的pythoncom36.dll 和pywintypes36.dll 拷貝到C:\Users\fanyu\Anaconda3\Lib\site-packages\win32 這個目錄下

正常後編寫測試程式

import os
import win32com.client


# 處理Word文件的類

class RemoteWord:
  def __init__(self, filename=None):
      self.xlApp=win32com.client.DispatchEx('Word.Application')
      self.xlApp.Visible=0
      self.xlApp.DisplayAlerts=0    #後臺執行，不顯示，不警告
      if filename:
          self.filename=filename
          if os.path.exists(self.filename):
              self.doc=self.xlApp.Documents.Open(filename)
          else:
              self.doc = self.xlApp.Documents.Add()    #建立新的文件
              self.doc.SaveAs(filename)
      else:
          self.doc=self.xlApp.Documents.Add()
          self.filename=''

  def add_doc_end(self, string):
      '''在文件末尾新增內容'''
      rangee = self.doc.Range()
      rangee.InsertAfter('\n'+string)

  def add_doc_start(self, string):
      '''在文件開頭新增內容'''
      rangee = self.doc.Range(0, 0)
      rangee.InsertBefore(string+'\n')

  def insert_doc(self, insertPos, string):
      '''在文件insertPos位置新增內容'''
      rangee = self.doc.Range(0, insertPos)
      if (insertPos == 0):
          rangee.InsertAfter(string)
      else:
          rangee.InsertAfter('\n'+string)

  def replace_doc(self,string,new_string):
      '''替換文字'''
      self.xlApp.Selection.Find.ClearFormatting()
      self.xlApp.Selection.Find.Replacement.ClearFormatting()
      self.xlApp.Selection.Find.Execute(string, False, False, False, False, False, True, 1, True, new_string, 2)

  def save(self):
      '''儲存文件'''
      self.doc.Save()

  def save_as(self, filename):
      '''文件另存為'''
      self.doc.SaveAs(filename)

  def close(self):
      '''儲存檔案、關閉檔案'''
      self.save()
      self.xlApp.Documents.Close()
      self.xlApp.Quit()

if __name__ == '__main__':
 doc = RemoteWord(docx)  # 初始化一個doc物件
 # 這裡演示替換內容，其他功能自己按照上面類的功能按需使用
 doc.replace_doc('your string','new string')  # 替換文字內容
 doc.close()

好了以上就完成了從pdf到word的完整內容，這裡樓主附上一個例項：

樓主需要把pdf檔案報表的英文內容替換為中文，也就是等價於翻譯，只是翻譯是提前做好的，放置在一個match_rule.xlsx的檔案中，這個檔案內容就是兩列：原文譯文，然後程式會讀取內容並自動替換。

下面樓主給出程式碼：


# 該程式實現把PDF轉換為word然後讀取word檔案中的文字內容並按照特定的規則替換文字



# -*- encoding: utf8 -*-


__author__ = 'yooongchun'


# 引入所需要的基本包

import os
import re
import xlrd
import win32com.client
import logging
logging.basicConfig(level=logging.INFO)



# 處理Word文件的類

class RemoteWord:
  def __init__(self, filename=None):
      self.xlApp=win32com.client.DispatchEx('Word.Application')
      self.xlApp.Visible=0
      self.xlApp.DisplayAlerts=0    #後臺執行，不顯示，不警告
      if filename:
          self.filename=filename
          if os.path.exists(self.filename):
              self.doc=self.xlApp.Documents.Open(filename)
          else:
              self.doc = self.xlApp.Documents.Add()    #建立新的文件
              self.doc.SaveAs(filename)
      else:
          self.doc=self.xlApp.Documents.Add()
          self.filename=''

  def add_doc_end(self, string):
      '''在文件末尾新增內容'''
      rangee = self.doc.Range()
      rangee.InsertAfter('\n'+string)

  def add_doc_start(self, string):
      '''在文件開頭新增內容'''
      rangee = self.doc.Range(0, 0)
      rangee.InsertBefore(string+'\n')

  def insert_doc(self, insertPos, string):
      '''在文件insertPos位置新增內容'''
      rangee = self.doc.Range(0, insertPos)
      if (insertPos == 0):
          rangee.InsertAfter(string)
      else:
          rangee.InsertAfter('\n'+string)

  def replace_doc(self,string,new_string):
      '''替換文字'''
      self.xlApp.Selection.Find.ClearFormatting()
      self.xlApp.Selection.Find.Replacement.ClearFormatting()
      self.xlApp.Selection.Find.Execute(string, False, False, False, False, False, True, 1, True, new_string, 2)

  def save(self):
      '''儲存文件'''
      self.doc.Save()

  def save_as(self, filename):
      '''文件另存為'''
      self.doc.SaveAs(filename)

  def close(self):
      '''儲存檔案、關閉檔案'''
      self.save()
      self.xlApp.Documents.Close()
      self.xlApp.Quit()



# 遍歷找到word檔案路徑

def find_docx(pdf_path):
  file_list=[]
  if os.path.isfile(pdf_path):
      file_list.append(pdf_path)
  else:
      for top, dirs, files in os.walk(pdf_path):
          for filename in files:
              if filename.endswith('.docx')or filename.endswith('.doc'):
                  abspath = os.path.join(top, filename)
                  file_list.append(abspath)
  return file_list



# 替換文字內容

def replace_docx(rule,docx_list):
  len_doc=len(docx_list)
  i=0  # 計數
  for docx in docx_list:
      i+=1
      logging.info('開始替換第 %s/%s 個word檔案內容:%s...'%(i,len_doc,os.path.basename(docx)))
      doc = RemoteWord(docx)  # 初始化一個doc物件
      for item in rule:  # 替換
          doc.replace_doc(item[0], item[1])
      doc.close()

  logging.info('完成！')



# 對內容進行排序


# 這裡因為在進行文字替換的時候涉及到一個長句裡面的部分可能被短句（相同內容）內容替換掉


# 因而必須先把文字按照從長到短的順序來進行替換

def sort_rule(rule):
  result=[]
  for item, val in rule.items():
      le=len(item)
      flag = True
      if len(result)>0:
          for index, res in enumerate(result):
              if len(item) >= len(res[0]):
                  flag=False
                  result.insert(index, (item, val))
                  break
          if flag:
              result.append((item, val))

      else:
          result.append((item,val))

  return result



# 載入Excel,把取得的內容返回，格式：dict{'原文':'譯文'}

def init_excel(excel_path):
  logging.info('載入文字匹配規則的Excel:%s' % os.path.basename(excel_path))
  rule={}  # 儲存原文和翻譯內容
  pdf_path=''
  try:
      book = xlrd.open_workbook(excel_path)  # 開啟一個wordbook
      sheet = book.sheet_by_name('Translation')  # 切換sheet
      rows = sheet.nrows  # 行數
      for row in range(rows - 1):
          text_ori=sheet.cell(row, 0).value  # 取得資料：原文
          text_trans=sheet.cell(row,1).value  # 取得資料：譯文
          if not re.match(r'^#.+',text_ori):  # 原文不以#開頭
              if text_ori == 'pdf檔案(或資料夾)地址':   # 獲得pdf檔案路徑
                  pdf_path=text_trans
              else:
                  rule[text_ori]=text_trans  # 取得值加入text
  except IOError:
      raise IOError
  logging.info('載入Excel完成！')

  return pdf_path, rule

if __name__ == '__main__':

  excel_path = './match_rule.xlsx'    # 替換規則的Excel檔案地址
  logging.info('正在開啟pdf轉換軟體，請手動轉換你的pdf檔案！')
  os.popen(r'"./PDF2Word/pdf2word.exe"')
  flag=input('你已經完成pdf檔案轉換了嗎？(y/n)：')
  while not flag == 'y':
      logging.info('請先轉換pdf！')
      flag = input('你已經完成pdf檔案轉換了嗎？(y/n)：')
  pdf_path, rule = init_excel(excel_path)  # 載入Excel,取得內容
  sorted_rule=sort_rule(rule)  # 排序規則：按照由長到短
  docx_list=find_docx(pdf_path)  # 獲取docx檔案路徑
  replace_docx(sorted_rule,docx_list)  # 替換內容

  logging.info('程式執行完成!')

PDF內容替換--Python實現

pdf轉換為word

使用第三方工具的話，有這樣兩種情況：

使用pdfminer3k解析pdf檔案

使用第三方SDK

使用`LibreOffice` 開發者庫

替換word文字內容

PDF內容替換--Python實現

【內附PDF資料】Python實現下載圖片並生產PDF檔案

python生成pdf報告、python實現html轉換為pdf報告[python3]

用Python開發PDF編輯器，實現PDF頁面提取，頁面合併與替換

python 實現 2個文件替換更新

nginx使用replace-filter-nginx-module實現內容替換

php如何利用python實現對pdf文件的操作（讀寫、合並分割）

Python文件操作：同一個文件進行內容替換

用Python實現阿裏釘釘機器人讀取數據庫內容自動發群通知

字串替換兩個特定字元之間內容的程式碼實現

PHP如何利用Python實現對PDF檔案的操作

劍指offer程式設計題python實現（第2題）替換空格

用python實現讀取xml文件內容

python實現文字檔案內容換行--->>>真的可行

python實現文本文件內容換行--->>>真的可行

使用python實現RSA加解密演算法(包含讀取檔案操作),檔案內容為16進位制字串,同時實現對學號姓名的加密——(SCU應用密碼學實驗)

python 實現將office檔案 (word, ppt, excel等) 轉為pdf

python實現簡單的抽獎小程式，抽獎的內容從檔案裡面讀取

協同過濾演算法概述與python 實現協同過濾演算法基於內容（usr-item,item-item）

python實現文字與Excel內容相互轉換

PDF內容替換--Python實現

pdf轉換為word

使用第三方工具的話，有這樣兩種情況：

使用pdfminer3k解析pdf檔案

使用第三方SDK

使用LibreOffice 開發者庫

替換word文字內容

相關推薦

使用`LibreOffice` 開發者庫