爬蟲初探--PHP

阿新 • • 發佈：2017-05-23

count style 構造 com 任務 tail pattern 簡單 mkdir

　　我有收藏的cms網站，偶爾會下載一些資源，老司機都懂的:-D。然後有一次好幾天沒上，堆了好些沒弄，心想：cao，這好麻煩啊，能不能寫個腳本自動幫我搞？然後忽然就想到，這是不是就是所謂的爬蟲呢？心中一陣激動。

　　因為本人還是小白，只會用PHP，所以只能將就一下吧。網站主頁就是羅列各個資源的小圖標以及入口，我用自己封裝的curl函數get請求過去，獲取所有詳情頁的入口鏈接，然後看有沒有上次請求的錨點記錄鏈接，如果沒有，就繼續請求下一頁的的鏈接，如果有就停止。然後再遍歷這個記錄鏈接的數組，依次請求。詳情頁內有成對的大圖與小圖，我只要大圖，過濾掉小圖，然後就是PHP強大的file_get_contents了和file_put_contents函數了，well，talk is cheap，show my code now。

  1   1 <?php 
  2   2 
  3   3 // 加載封裝好的curl請求函數
  4   4 require "../curl_request.php";
  5   5 
  6   6 class grab{
  7   7 
  8   8     // 網站主頁
  9   9     public $url = "/portal.php";
 10  10     // 圖片詳情頁的匹配規則
 11  11     private $content_preg = "/\/content-\d{4}-1-1\.html/i";
 12  12     // 下一頁url
 13  13     private 
 $page = "https://www.xibixibi.com/portal.php?page=";
 14  14     // 大圖匹配規則
 15  15     private $bigPic_preg = "/\/data\/attachment\/forum\/20\d{2}[01]\d{1}\/[0123]\d{1}\/[a-zA-Z0-9]{22}\.(jpg|png)/";
 16  16     // 上一次保存的詳情url
 17  17     public $lastSave = "";
 18  18     // 圖片保存根目錄
 19  19     public $root = "E:/root/";
 
 20  20     // 保存grabDetailSites方法的調用次數
 21  21     private $count = 0;
 22  22     // 圖片詳情的集合數組
 23  23     public $gallery = array();
 24  24 
 25  25     /**
 26  26      * 構造函數
 27  27      *
 28  28      */
 29  29     public function __construct(){
 30  30         set_time_limit(0);
 31  31     }
 32  32     /**
 33  33      * 抓取網站所有詳情頁鏈接的方法
 34  34      * @param @url 網站url
 35  35      */
 36  36     public function grabDetailSites($url = ""){
 37  37         // 發送請求
 38  38         $result = getRequest($url);
 39  39         // 匹配詳情頁url
 40  40         preg_match_all($this->content_preg, $result, $matches, PREG_PATTERN_ORDER);
 41  41         // 去重
 42  42         $matches = array_unique($matches[0]);
 43  43         // 去掉網站最後一個聯系方式的連接
 44  44         if (count($matches) > 12) {
 45  45             $matches = array_slice($matches, 0, 12);
 46  46         }
 47  47         // 看是否已經找到上一次最新的詳情頁地址
 48  48         $offset = array_search($this->lastSave, $matches);
 49  49         // 保存此次最新的詳情頁連接
 50  50         if ($this->count == 0) {
 51  51             file_put_contents("./lastsave.txt", $matches[0]);
 52  52         }
 53  53         ++$this->count;
 54  54         // 如果找到上次抓取的最新詳情url，則保存url並停止
 55  55         if ($offset !== FALSE) {
 56  56             $matches = array_slice($matches, 0, $offset);
 57  57             $this->gallery = array_merge($this->gallery, $matches);
 58  58             return TRUE;
 59  59         }else{
 60  60              // 否則遞歸下一頁查找
 61  61             $this->gallery = array_merge($this->gallery, $matches);
 62  62             $this->grabDetailSites($this->page . ($this->count + 1));
 63  63             return TRUE;
 64  64         }
 65  65     }
 66  66 
 67  67     /**
 68  68      * 根據gallery的詳情url獲取其內部大圖
 69  69      *
 70  70      */
 71  71     public function grabBigPic(){
 72  72         // 循環gallery詳情數組
 73  73         foreach ($this->gallery as $key => $value) {
 74  74             // 獲取大圖的url
 75  75             $result = getRequest($value);
 76  76             preg_match_all($this->bigPic_preg, $result, $matches);
 77  77             $matches = array_unique($matches[0]);
 78  78             // 循環獲取大圖的數據
 79  79             foreach ($matches as $key1 => $value1) {
 80  80                 $pic = getRequest($value1);
 81  81                 $month = date("Y/m/");
 82  82                 if (!is_dir($this->root . $month)) {
 83  83                     mkdir($this->root . $month, 777, TRUE);
 84  84                 }
 85  85                 // 保存圖片數據
 86  86                 file_put_contents($this->root . $month . basename($value1), $pic);
 87  87             }
 88  88         }
 89  89     }
 90  90 
 91  91     /**
 92  92      * 整理舊的圖片文件
 93  93      *
 94  94      */
 95  95     public function sortPic(){
 96  96         $allPics = scandir($this->root);
 97  97         // 刪除.和..
 98  98         unset($allPics[0]);
 99  99         unset($allPics[1]);
100 100         foreach ($allPics as $key => $value) {
101 101             $time = date("Y/m/", filemtime($this->root . $value));
102 102             if (!is_dir($this->root . $time)) {
103 103                 mkdir($this->root . $time, 777, TRUE);
104 104             }
105 105             // 移動文件
106 106             rename($this->root . $value, $this->root . $time . $value);
107 107         }
108 108     }
109 109 
110 110     public function __set($key, $value){
111 111         $this->$key = $value;
112 112     }
113 113 }
114

　　因為網站不是很復雜，所以這個類寫的還算比較簡單吧，本來想做個定時任務的，不過還是等以後我的老爺機換了Ubuntu吧。

爬蟲初探--PHP

count style 構造 com 任務 tail pattern 簡單 mkdir 　　我有收藏的cms網站，偶爾會下載一些資源，老司機都懂的:-D。然後有一次好幾天沒上，堆了好些沒弄，心想：cao，這好麻煩啊，能不能寫個腳本自動幫我搞？然後忽然就想到，這是不是就是所謂

爬蟲初探--PHP

爬蟲初探--PHP

Python爬蟲初探 - selenium+beautifulsoup4+chromedriver爬取需要登錄的網頁信息

初探PHP面向物件與設計模式-策略模式

【爬蟲初探】新浪微博搜尋爬蟲實現

python3爬蟲初探（四）之檔案儲存

【php網頁爬蟲】php抓取網頁資料

php擴展初探

python&php數據抓取、爬蟲分析與中介，有網址案例

php 防止爬蟲設置

php利用simple_html_dom類，獲取頁面內容，充當爬蟲角色

【PHP爬蟲】curl+simple_html_dom 抓取百度最新消息新聞標題，來源，URL

phpspider php爬蟲框架

php網絡爬蟲，實現采集功能

什麽？php也能做爬蟲？

PHP爬蟲最全總結2-phpQuery，PHPcrawer，snoopy框架中文介紹

PHP分頁初探一個最簡單的PHP分頁代碼的簡單實現

php爬蟲原型

php多線程爬蟲類

PHP簡單爬蟲爬取免費代理ip 一萬條

PHP頁面爬蟲

爬蟲初探--PHP

相關推薦