python docx轉html
阿新 • • 發佈:2019-02-18
#!/usr/bin/env python #coding=utf-8 from win32com import client as wc word = wc.Dispatch('Word.Application') doc = word.Documents.Open(r'C:/Users/Administrator/Desktop/1.docx') doc.SaveAs('F:/1.html', 8) doc.SaveAs('F:/2.pdf', 17) doc.SaveAs('F:/3.html', 10) doc.Close() word.Quit() ''' win32com download http://sourceforge.net/projects/pywin32/files/pywin32/Build%20218 這裡測試的環境是:windows xp,office 2007,python 2.5.2,pywin32 build 213,原理是利用win32com介面直接呼叫office API,好處是簡單、相容性好,只要office能處理的,python都可以處理,處理出來的結果和office word裡面“另存為”一致。 原文地址:http://www.fuchaoqun.com/2009/03/use-python-convert-word-to-html-with-win32com/ view source print ? 1.#!/usr/bin/env python 2.#coding=utf-8 3.from win32com import client as wc 4.word = wc.Dispatch('Word.Application') 5.doc = word.Documents.Open('d:/labs/math.doc') 6.doc.SaveAs('d:/labs/math.html', 8 ) 7.doc.Close() 8.word.Quit() 關鍵的就是doc.SaveAs(’d:/labs/math.html’, 8)這一行,網上很多文章寫成:doc.SaveAs(’d:/labs/math.html’, win32com.client.constants.wdFormatHTML),直接報錯: AttributeError: class Constants has no attribute ‘wdFormatHTML’ 當然你也可以用上面的程式碼將word檔案轉換成任意格式檔案(只要office 2007支援,比如將word檔案轉換成PDF檔案,把8改成17即可),下面是office 2007支援的全部檔案格式對應表: wdFormatDocument = 0 wdFormatDocument97 = 0 wdFormatDocumentDefault = 16 wdFormatDOSText = 4 wdFormatDOSTextLineBreaks = 5 wdFormatEncodedText = 7 wdFormatFilteredHTML = 10 wdFormatFlatXML = 19 wdFormatFlatXMLMacroEnabled = 20 wdFormatFlatXMLTemplate = 21 wdFormatFlatXMLTemplateMacroEnabled = 22 wdFormatHTML = 8 wdFormatPDF = 17 wdFormatRTF = 6 wdFormatTemplate = 1 wdFormatTemplate97 = 1 wdFormatText = 2 wdFormatTextLineBreaks = 3 wdFormatUnicodeText = 7 wdFormatWebArchive = 9 wdFormatXML = 11 wdFormatXMLDocument = 12 wdFormatXMLDocumentMacroEnabled = 13 wdFormatXMLTemplate = 14 wdFormatXMLTemplateMacroEnabled = 15 wdFormatXPS = 18 照著字面意思應該能對應到相應的檔案格式,如果你是office 2003可能支援不了這麼多格式。word檔案轉html有兩種格式可選wdFormatHTML、wdFormatFilteredHTML(對應數字 8、10),區別是如果是wdFormatHTML格式的話,word檔案裡面的公式等ole物件將會儲存成wmf格式,而選用 wdFormatFilteredHTML的話公式圖片將儲存為gif格式,而且目測可以看出用wdFormatFilteredHTML生成的HTML 明顯比wdFormatHTML要乾淨許多。 當然你也可以用任意一種語言通過com來呼叫office API,比如PHP. '''