處理Word文檔

阿新 • • 發佈：2017-12-24

python

利用python-docx模塊，Python可以創建和修改Word文檔，它帶有.docx文件擴展名。運行pip install python-docx，可以安裝該模塊。

在導入pytho-docx模塊時，需要執行import docx,而不是import python-docx。

和純文本相比，.docx文件有很多結構。這些結構在python-docx中用3種不同的類型來表示。在最高一層，Document對象表示整個文檔。Document對象包含一個Paragraph對象的列表，表示文檔中的段落。每一個Paragraph對象都包含一個Run對象的列表。

Word文檔中的文本不僅僅是字符串。它包含與之相關的字體、大小、顏色和其他樣式信息。在Word中，樣式是這些屬性的集合。一個Run對象是相同樣式文本的延續。當文本樣式發生改變時，就需要一個新的Run對象。

c:\python\Scripts>pip3.6 install python-docx
Collecting python-docx
  Downloading python-docx-0.8.6.tar.gz (5.3MB)
    100% |████████████████████████████████| 5.3MB 73kB/s
Collecting lxml>=2.3.2 (from python-docx)
  Downloading lxml-4.1.1-cp36-cp36m-win_amd64.whl (3.5MB)
    100% |████████████████████████████████| 3.6MB 39kB/s
Installing collected packages: lxml, python-docx
  Running setup.py install for python-docx ... done
Successfully installed lxml-4.1.1 python-docx-0.8.6

讀取Word文檔

使用len()得到paragraphs的個數。

每個paragraph對象都有一個text屬性，包含該段中的文本的字符串（沒有樣式信息）。

每個paragraph對象也有一個runs屬性，它是run對象的列表。

run對象也有一個text屬性，包含那個延續中的文本。

run表示的是不同樣式的文本。

c:\python>python
Python 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 18:41:36) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import docx
>>> doc=docx.Document('Gettysburg Address.docx')
>>> len(doc.paragraphs)
4
>>> doc.paragraphs[0].text
'Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.'
>>> doc.paragraphs[-1].text
''
>>> len(doc.paragraphs[0].runs)
1
>>> doc.paragraphs[0].runs[0].text
'Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.'
>>>

從文件中取得完整的文本

如果只關心word文檔中的文本，不關心樣式信息，就可以利用getText()函數。

它接受一個.docx文件名，返回其中文本的字符串。

可以通過在append()方法中增加空格，形成段落縮進。

>>> import docx
>>> doc=docx.Document('Gettysburg Address.docx')
>>> fullText=[]
>>> for para in doc.paragraphs:
...     fullText.append(para.text)
...
>>> print(fullText)
['Four score and seven years ago our fathers brought forth on this continent, ......hall not perish from the earth.', '']
>>>

設置Paragraph和Run對象的樣式

對於word文檔，有3種類型的樣式：段落樣式可以應用於paragraph對象，字符樣式可以應用於run對象，鏈接的樣式可以應用於這兩種對象。

在設置style屬性時，不要在樣式名稱中使用空格。

如果對run對象應用鏈接的樣式，需要在樣式名稱末尾加上‘char’。

在當前版本的python-docx（0.7.4）中，只能使用默認的word樣式，以及打開的文件中已有的樣式，不能創建新的樣式。

寫入Word文檔

在添加完文本後，向Document對象的save()方法傳入一個文件名字符串，將Document對象保存到文件。

可以用新的段落文本，再次調用add_paragraph()方法添加段落。

>>> import docx
>>> doc=docx.Document('Gettysburg Address.docx')
>>> doc.add_paragraph('Hello world!')
<docx.text.paragraph.Paragraph object at 0x0000019B5DD45B70>
>>> doc.save('newdoc.docx')

效果：

技術分享圖片

如果要在已有段落的末尾添加文本，可以調用paragraph對象的add_run方法，向它傳入一個字符串。

add_paragraph()和add_run()都接受可選的第二個參數，它是表示Paragraph或Run對象樣式的字符串。

>>> import docx
>>> doc=docx.Document('Gettysburg Address.docx')
>>> doc.add_paragraph('Hello world!')
<docx.text.paragraph.Paragraph object at 0x0000019B5D57ABE0>
>>> paraObj1=doc.add_paragraph('This is a second paragraph.')
>>> paraObj2=doc.add_paragraph('This is a yet another paragraph.')
>>> paraObj1.add_run('This text is being added to the second paragraph.')
<docx.text.run.Run object at 0x0000019B5DD5B940>
>>> doc.save('newdoc2.docx')

效果：

添加標題

調用add_heading()將添加一個段落，並使用一種標題樣式。

add_heading()的參數，是一個標題文本的字符串，以及一個從0-4的整數。

整數0表示標題是Title樣式，這用於文檔的頂部。

整數1-4是不同的標題層次，1是主要的標題，4是最低層的子標題。

add_heading()返回一個Paragraph對象。

>>> import docx
>>> doc=docx.Document()
>>> doc.add_heading('Header 0',0)
<docx.text.paragraph.Paragraph object at 0x0000019B5E1740F0>
>>> doc.add_heading('Header 1',1)
<docx.text.paragraph.Paragraph object at 0x0000019B5E1740B8>
>>> doc.add_heading('Header 2',2)
<docx.text.paragraph.Paragraph object at 0x0000019B5E174978>
>>> doc.add_heading('Header 3',3)
<docx.text.paragraph.Paragraph object at 0x0000019B5E174198>
>>> doc.add_heading('Header 4',4)
<docx.text.paragraph.Paragraph object at 0x0000019B5E1740F0>
>>> doc.save('headings.docx')

效果：

技術分享圖片

添加換行符和換頁符

要添加換行符，可以在run對象上調用add_break()方法。

要添加換頁符，可以將docx.text.WD_BREAK.PAGE作為唯一的參數，傳遞給add_break()。

>>> import docx
>>> doc=docx.Document()
>>> doc.add_paragraph('This is on the first page!')
<docx.text.paragraph.Paragraph object at 0x0000019B5DD5B400>
>>> doc.paragraphs[0].runs[0].add_break()
>>> doc.add_paragraph('This is the new text!')
<docx.text.paragraph.Paragraph object at 0x0000019B5DD5B400>
>>> doc.save('break.docx')

效果：

技術分享圖片

添加圖像

Document對象有一個add_picture()方法，在文檔末尾添加圖像。

>>> import docx
>>> doc=docx.Document()
>>> doc.add_paragraph('This is on the first page!')
<docx.text.paragraph.Paragraph object at 0x0000019B5E189080>
>>> doc.add_picture('123.jpg',width=docx.shared.Inches(1),height=docx.shared.Cm(4))
<docx.shape.InlineShape object at 0x0000019B5E1899E8>
>>> doc.save('picture.docx')

效果：

技術分享圖片

處理Word文檔

python利用python-docx模塊，Python可以創建和修改Word文檔，它帶有.docx文件擴展名。運行pip install python-docx，可以安裝該模塊。在導入pytho-docx模塊時，需要執行import docx,而不是import python-docx。和純文本相比，.do

處理Word文檔

處理Word文檔

C# 如何處理Word文檔分頁——插入、刪除、阻止分頁

利用Aspose.Words處理Word文檔之間的轉換和內容操作

Word文檔如何在線翻譯處理

使用模板生成word文檔

利用POI操作不同版本號word文檔中的圖片以及創建word文檔

將Latex tex文檔轉換成 word文檔（上）

java生成word文檔

Java 用Freemarker完美導出word文檔（帶圖片）

教你如何在word文檔中輸入攝氏度符號

.net core 導出word文檔

回味word文檔

Java 使用 jacob 將 word 文檔轉換為 pdf 文件

將word內容追擊到新建WORD文檔

JAVA導出下載word文檔（導出帶富文本圖片word）

【Java】導出word文檔之freemarker導出

word文檔快速轉換為PPT演示文稿

采集txt文本/word文檔中郵箱地址的python小工具

C# 如何添加表格到Word文檔

C#采用OpenXml給Word文檔添加表格

處理Word文檔

相關推薦