PHP讀取doc docx xls pdf txt內容

阿新 • • 發佈：2018-11-15

分享一下我老師大神的人工智慧教程！零基礎，通俗易懂！http://blog.csdn.net/jiangjunshow

也歡迎大家轉載本篇文章。分享知識，造福人民，實現我們中華民族偉大復興！

我的一個客戶有這樣的需求:上傳檔案,可以是doc,docx,xls,pdf,txt格式,現需要用php讀取這些檔案的內容,然後計算檔案裡面字數.

1.PHP讀取DOC格式的檔案

PHP沒有自帶讀取word檔案的類,或者是庫,這裡我們使用

antiword(http://www.winfield.demon.nl/)這個包來讀取doc檔案.

首先介紹一下如何在windows下使用:

1.開啟http://www.winfield.demon.nl/(antiword下載頁面),找到對應的windows版本(http://www.winfield.demon.nl/#Windows),下載antiword windows版本(antiword-0_37-windows.zip);

2.將下載下來的檔案解壓到C盤根目錄下;

這裡還有一點需要注意的:http://www.informatik.uni-frankfurt.de/~markus/antiword/00README.WIN這個連線裡有windows下安裝的說明檔案.

需要設定環境變數,我的電腦(右鍵)->高階->環境變數->在上面的使用者變數裡新建一個

變數名:HOME

變數值:c:\home這個目錄應該是存在的,如果不存在就在C盤下建立一個home資料夾.

然後在系統變數,修改Path,在Path變數的值最前面加上%HOME%\antiword.

3.開始->執行->CMD 進入到antiword目錄;

輸入 antiword -h 看看效果.

4.然後我們使用antiword –t 命令讀取一下doc檔案內容;首先複製一個doc檔案到c:\antiword目錄,然後執行

>antiword –t 檔名.doc

就可以看到螢幕上輸出word檔案的內容了.

可能你會問了,這和PHP讀取word有什麼關係呢?呵呵,別急,我們來看看如何在PHP裡使用這個命令.

<?php

$file = “D:\xampp\htdocs\word_count\uploads\doc-english.doc”;

$content = shell_exec(“c:\antiword\antiword –f $file”);

這樣就把word裡面的內容讀取content裡面了.

至於如何在Linux下讀取doc檔案內容,就是下載linux版本的壓縮包,裡面有readme.txt檔案,按照那種方式安裝就可以了.

$content = shell_exec ( "/usr/local/bin/antiword -f $file" );

2.PHP讀取PDF檔案內容

php也沒有專門用來讀取pdf內容的類庫.這樣我們採用第三方包(xpdf).還是先做windows下的操作,下載,將其解壓到C盤根目錄下.

開始->執行->cmd->cd /d c:\xpdf
<?php

$file = “D:\xampp\htdocs\word_count\uploads\pdf-english.pdf”;

$content = shell_exec ( "c:\\xpdf\\pdftotext $file -" );

這樣就可以把pdf檔案的內容讀取到php變數裡了.

Linux下的安裝方法也很簡單這裡就不在一一列出

<?php

$content = shell_exec ( "/usr/bin/pdftotext $file -" );

3.PHP讀取ZIP檔案內容

首先使用PHP zip解壓zip檔案,然後讀取解壓包裡的檔案,如果是word就採用antiword讀取,如果是pdf就使用xpdf讀取.

<?php

/**
* Read ZIP valid file
*
* @param string $file file path
* @return string total valid content
*/
function ReadZIPFile($file = '') {
    $content = "";
    $inValidFileName = array ();
    $zip = new ZipArchive ( );
    if ($zip->open ( $file ) === TR ) {
        for($i = 0; $i < $zip->numFiles; $i ++) {
            $entry = $zip->getNameIndex ( $i );
            if (preg_match ( '#\.(txt)|\.(doc)|\.(docx)|\.(pdf)$#i', $entry )) {
                $zip->extractTo ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ), array (
                        $entry
                ) );
                $content .= CheckSystemOS ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) . "/" . $entry );
            } else {
                $inValidFileName [$i] = $entry;
            }
        }
        $zip->close ();
        rrmdir ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) );
        /*if (file_exists ( $file )) {
            unlink ( $file );
        }*/
        return $content;
    } else {
        return "";
    }
}

4.PHP讀取DOCX檔案內容

docx檔案其實是由很多XML檔案組成,其中內容就存在於word/document.xml裡面.

我們找到一個docx檔案,使用zip檔案開啟(或者把docx字尾名改為zip,然後解壓)

在word目錄下有document.xml

docx檔案的內容就存在於document.xml裡面,我們讀取這個檔案就可以了.

<?php

/**
* Read Docx File
*
* @param string $file filepath
* @return string file content
*/
function parseWord($file) {
    $content = "";
    $zip = new ZipArchive ( );
    if ($zip->open ( $file ) === tr ) {
        for($i = 0; $i < $zip->numFiles; $i ++) {
            $entry = $zip->getNameIndex ( $i );
            if (pathinfo ( $entry, PATHINFO_BASENAME ) == "document.xml") {
                $zip->extractTo ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ), array (
                        $entry
                ) );
                $filepath = pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) . "/" . $entry;
                $content = strip_tags ( file_get_contents ( $filepath ) );
                break;
            }
        }
        $zip->close ();
        rrmdir ( pathinfo ( $file, PATHINFO_DIRNAME ) . "/" . pathinfo ( $file, PATHINFO_FILENAME ) );
        return $content;
    } else {
        return "";
    }
}

如果想要通過PHP建立docx檔案,或者是把docx檔案轉為xhtml,pdf可以使用phpdocx,(http://www.phpdocx.com/)

5.PHP讀TXT

直接使用PHP file_get_content函式就可以了.

<?php

$file = “D:\xampp\htdocs\word_count\uploads\eng.txt”;

$content = file_get_content($file);

6.PHP讀EXCEL

http://phpexcel.codeplex.com/

現在只是讀取檔案內容了,怎麼計算單詞的個數呢?

PHP有一個自帶的函式,str_word_count,這個函式可以計算出單詞的個數,但是如果要計算antiword讀取出來的doc檔案的單詞個數就會很大的誤差.

這裡我們使用以下這個函式專門用來讀取單詞個數
<?php

/**
* statistic word count
*
* @param string $content word content of the file
* @return int word count of the content
*/
function StatisticWordsCount($text = '') {
    //    $text = trim ( preg_replace ( '/\d+/', ' ', $text ) ); // remove extra spaces
    $text = str_replace ( str_split ( '|' ), '', $text ); // remove these chars (you can specify more)
    //    $text = str_replace ( str_split ( '-' ), '', $text ); // remove these chars (you can specify more)
    $text = trim ( preg_replace ( '/\s+/', ' ', $text ) ); // remove extra spaces
    $text = preg_replace ( '/-{2,}/', '', $text ); // remove 2 or more dashes in a row
    $len = strlen ( $text );
    if (0 === $len) {
        return 0;
    }
    $words = 1;
    while ( $len -- ) {
        if (' ' === $text [$len]) {
            ++ $words;
        }
    }
    return $words;
}

詳細的程式碼如下:

<?php
/**
* check system operation win or linux
*
* @param string $file contain file path and file name
* @return file content
*/
function CheckSystemOS($file = '') {
    $content = "";
    //    $type = s str ( $file, strrpos ( $file, '.' ) + 1 );
    $type = pathinfo ( $file, PATHINFO_EXTENSION );
    //    global $UNIX_ANTIWORD_PATH, $UNIX_XPDF_PATH;
    if (strtoupper ( s str ( PHP_OS, 0, 3 ) ) === 'WIN') { //this is a server using windows
        switch (strtolower ( $type )) {
            case 'doc' :
                $content = shell_exec ( "c:\\antiword\\antiword -f $file" );
                break;
            case 'docx' :
                $content = parseWord ( $file );
                break;
            case 'pdf' :
                $content = shell_exec ( "c:\\xpdf\\pdftotext $file -" );
                break;
            case 'zip' :
                $content = ReadZIPFile ( $file );
                break;
            case 'txt' :
                $content = file_get_contents ( $file );
                break;
        }
    } else { //this is a server not using windows
        switch (strtolower ( $type )) {
            case 'doc' :
                $content = shell_exec ( "/usr/local/bin/antiword -f $file" );
                break;
            case 'docx' :
                $content = parseWord ( $file );
                break;
            case 'pdf' :
                $content = shell_exec ( "/usr/bin/pdftotext $file -" );
                break;
            case 'zip' :
                $content = ReadZIPFile ( $file );
                break;
            case 'txt' :
                $content = file_get_contents ( $file );
                break;
        }
    }
    /*if (file_exists ( $file )) {
        @unlink ( $file );
    }*/
    return $content;
}

/**
* remove directory
*
* @param string $dir path dir
*/
function rrmdir($dir) {
    if (is_dir ( $dir )) {
        $objects = scandir ( $dir );
        foreach ( $objects as $object ) {
            if ($object != "." && $object != "..") {
                if (filetype ( $dir . "/" . $object ) == "dir") {
                    rrmdir ( $dir . "/" . $object );
                } else {
                    unlink ( $dir . "/" . $object );
                }
            }
        }
        reset ( $objects );
        rmdir ( $dir );
    }
}

//呼叫方法

$file = “D:\xampp\htdocs\word_count\uploads\pdf-german.zip”;

$word_number = StatisticWordsCount ( CheckSystemOS ( $file) );

http://www.it300.com/article-15290.html

給我老師的人工智慧教程打call！http://blog.csdn.net/jiangjunshow

PHP讀取doc docx xls pdf txt內容

給我老師的人工智慧教程打call！http://blog.csdn.net/jiangjunshow

PHP讀取doc docx xls pdf txt內容

POI解析文件內容（txt,doc,docx,xls,xlsx,ppt,pdf）

基於libreOffice的doc,docx,ppt,pptx,txt,xlsx,xls轉pdf java實現

如何用PHP讀取Excel文件數據及內容信息

Linux下讀取doc,docx檔案

PHP 讀取session和往session寫內容

使用python-docx讀取doc,docx文件

JAVA中通過poi和pdfbox讀取office檔案和pdf檔案內容

Java讀取本地文件內容支援文件格式有（.doc+.docx+.txt+.xls+.xlsx）

C# winfrom 寫的一個搜尋助手，可以按照標題和內容搜尋，支援doc,xls,ppt,pdf,txt等格式的檔案搜尋

txt、doc、xls、ppt、pdf檔案線上預覽

Python讀取txt內容寫入xls格式的excel中

Ubuntu下使用python讀取doc和docx文件的內容

php 讀取txt檔案中的內容，轉換成陣列

使用Lucene對doc、docx、pdf、txt文件進行全文檢索功能的實現

Java讀取多種檔案格式的檔案（pdf,pptx,ppt,doc,docx...）

C#讀取PDF、TXT內容

PHP讀取docx文件內容

doc,docx,pdf,ppt等檔案型別讀取方法

python讀寫doc/docx/txt/xls檔案

PHP讀取doc docx xls pdf txt內容

給我老師的人工智慧教程打call！http://blog.csdn.net/jiangjunshow

相關推薦