Python爬蟲之利用正則表達式爬取內涵吧

阿新 • • 發佈：2017-09-03

file res start cnblogs all save nts quest ide

首先，我們來看一下，爬蟲前基本的知識點概括

技術分享

一. match()方法：

這個方法會從字符串的開頭去匹配（也可以指定開始的位置），如果在開始沒有找到，立即返回None，匹配到一個結果，就不再匹配。

我們可以指定開始的位置的索引是3，範圍是3-10，那麽python將從第4個字符‘1‘開始匹配，只匹配一個結果。

group()獲得一個或多個分組的字符串，指定多個字符串時將以元組的形式返回，group(0)代表整個匹配的字串，不填寫參數時，group()返回的是group(0)。

 1 import re
 2 
 3 pattern = re.compile(r‘\d+‘)     #匹配數字一次以上 

 4 m = pattern.match(‘one123two456‘)
 5 print m
 6 print m.group()
 7 
 8 #None
 9 #...AttributeError: ‘NoneType‘ object has no attribute ‘group‘
10 
11 
12 pattern = re.compile(r‘\d+‘)     #匹配數字一次以上
13 m = pattern.match(‘one123two456‘. 3, 10)
14 print m
15 print m.group()
16 
17 #<_sre.SRE_Match object at 0x00000000026FAE68> 

18 #123

二. search()方法：

search方法與match比較類似，區別在於match()方法只檢測是不是在字符串的開始位置匹配，search()會掃描整個字符串查找匹配，同樣，search方法只匹配一次。

1 import re
2 
3 pattern = re.compile(r‘\d+‘)
4 m = pattern.search(‘one123two456‘)
5 print m.group()
6 
7 #123

三. findall()方法：

搜索字符串，以列表的形式返回全部能匹配的字串。

1 import re
2 
3 pattern = re.compile(r‘ 
\d+‘)
4 m = pattern.findall(‘one123two456‘)
5 print m
6 
7 #[‘123‘, ‘456‘]

四. sub()方法：

用來替換每一個匹配的字符串，並返回替換後的字符串。

1 import re
2 
3 pattern = re.compile(r‘\d+‘)
4 m = pattern.sub(‘abc‘, ‘one123two456‘)
5 print m
6 
7 #oneabctwo456

五. 實踐：爬取內涵吧段子

 1 #-*-coding:utf-8-*-
 2 
 3 import requests
 4 import re
 5 
 6 class Spider:
 7 
 8     def __init__(self):
 9         self.page = 1
10 
11     def getPage(self, page):
12         url = "http://www.neihan8.com/article/list_5_{}.html".format(page)
13         response = requests.get(url)
14         contents =  response.content.decode(‘gbk‘)   #查看網頁源代碼，內涵吧默認編碼是charset=gb2312
15         return contents
16 
17     def getContent(self):
18         contents = self.getPage(self.page) 
19         pattern = re.compile(‘<h4>.*?<a href.*?html">(.*?)</a>.*?class="f18 mb20">(.*?)</div>‘, re.S)
20         results = pattern.findall(contents)
21         contents = []
22         for item in results:
23             title = re.sub(‘<b>|</b>‘, "", item[0])
24             content = re.sub(r‘<p>|</p>|<br />|&\w+;|<img alt.*|<div style=.*>|<div>|<p style="text-align: center; ">‘, "", item[1])
25             content = re.sub(r‘<div class="upload-txt.*baseline;">|<h1 class="title".*vertical-align: baseline;">|</h1>‘, "", content)
26             content = re.sub(r‘<div class=.*onclick="showAnswer(this)">|</a><div class="answer">‘, "", content)
27             content = re.sub(r‘<span style="color: rgb.*;">‘, "", content)
28             contents.append([title, content])
29         return contents
30 
31     def save_Data(self):
32         file = open("duanzi.txt", "w+")
33         x = 1
34         y = 1
35         for self.page in range(0, 507):
36             contents = self.getContent()
37             print u"正在寫入第%d頁的數據..." %(self.page+1)
38             for item in contents:
39                 file.write(str(x) + "." + item[0])
40                 file.write("\n")
41                 file.write(item[1])
42                 file.write("=====================================================================================\n\n")
43                 if item==contents[-1]:
44                     file.write(u"********第" + str(y) + "頁完********\n\n")
45                     y += 1
46                 x += 1            
47         print u"所有頁面已加載完"
48 
49     def start(self):
50         self.save_Data()
51                 
52                              
53 spider = Spider()
54 spider.start()

基本上可以獲取段子的標題和內容，但由於內涵吧的段子越到後面標簽越復雜，所以給替換標簽帶來了很大的難度。

Python爬蟲之利用正則表達式爬取內涵吧

file res start cnblogs all save nts quest ide 首先，我們來看一下，爬蟲前基本的知識點概括一. match()方法：這個方法會從字符串的開頭去匹配（也可以指定開始的位置），如果在開始沒有找到，立即返回None，匹配到一個結果

Python爬蟲學習之正則表達式爬取個人博客

9.png turn () htm parent ast string 則表達式 urn 實例需求：運用python語言爬取http://www.eastmountyxz.com/個人博客的基本信息，包括網頁標題，網頁所有圖片的url，網頁文章的url、標題以及摘要。實

python爬蟲知識點總結（九）Requests+正則表達式爬取貓眼電影

bsp code item 代碼 proc action none width auth 一、爬取流程二、代碼演示 #-*- coding: UTF-8 -*- #_author:AlexCthon #mail:[email protected] #date:20

常用正則表達式爬取網頁信息及HTML分析總結

logfile mpi 開始 order 標題 ear 爬取網頁常用 enter Python爬取網頁信息時，經常使用的正則表達式及方法。 1.獲取<tr></tr>標簽之間內容 2.獲取<a href..></a>超鏈接

Requests+正則表達式爬取貓眼電影

movies core http status roc find apple ascii int 代碼: import re import json from multiprocessing import Pool import requests from reque

14-Requests+正則表達式爬取貓眼電影

ons 亂碼 aci resp 正則 app 名稱 header ascii ‘‘‘Requests+正則表達式爬取貓眼電影TOP100‘‘‘‘‘‘流程框架：抓去單頁內容：利用requests請求目標站點，得到單個網頁HTML代碼，返回結果。正則表達式分析：根據HTML代碼

Requests+正則表達式爬取貓眼電影

none tle req boa cto asc sta int col 1 # encoding:utf-8 2 from requests.exceptions import RequestException 3 import requests

python之(re)正則表達式上

文件 port 正則表達式操作數字一次空白下劃線應該 python正則表達式知識預備正則表達式使用反斜杠" \ "來代表特殊形式或用作轉義字符，這裏跟Python的語法沖突，因此，Python用" \\\\ "表示正則表達式中的" \ "，因為正

Python開發基礎-Day15正則表達式爬蟲應用，configparser模塊和subprocess模塊

表達 port 進行 false popen ext signal -- 默認正則表達式爬蟲應用（校花網） 1 import requests 2 import re 3 import json 4 #定義函數返回網頁的字符串信息 5 def getPage_

python 爬蟲2-正則表達式抓取拉勾網職位信息

headers mode data .cn 保存 time exc href ace import requestsimport re #正則表達式import time import pandas #保存成 CSV #header={‘User-Agent‘:‘M

Python學習筆記模式匹配與正則表達式之用正則表達式匹配更多模式

重復實例 int clas span 就是 image 特定 mat 隨筆記錄方便自己和同路人查閱。 #------------------------------------------------我是可恥的分割線--------------------------

爬蟲——爬蟲中使用正則表達式

txt文件點擊頁碼 range safari 頁面 gen odin ace 下面我們嘗試爬取內涵段子網站：http://www.neihan8.com/article/list_5_1.html 打開之後，當你進行翻頁的時候，不能註意到，url地址的變化：

Python日誌分析與正則表達式

logs sea 篩選 ear d+ class 時間針對日誌程序員經常會面臨日誌的分析工作。而正則表達式是處理日誌的必備工具。 “Line 622: 01-01 09:04:16.727 <6> [pid:14399, cpu1 dabc_pwym_t

【copy】必備之常用正則表達式 By 其他博主

包含其他意義 exp target 特殊浮點 net 測試工具熟練而優雅的使用正則，對於程序員來講，實在太有意義了（即便非此類者，也是好處多多）；它輔助處理復雜的文本查詢和字符串操作，不僅能用之於代碼，還能雅之於編輯器，瀏覽器，Terminal等，實在是編碼居家必備

兄弟連學Python（06）---- 正則表達式匹配規則

驗證列表 cas 斜杠小數點 php 能夠 spa 超過正則表達式 - 匹配規則基本模式匹配一切從最基本的開始。模式，是正則表達式最基本的元素，它們是一組描述字符串特征的字符。模式可以很簡單，由普通的字符串組成，也可以非常復雜，往往用特殊的字符表示一個範圍內的字

Linux之基本正則表達式（grep）

grep 基本正則表達式 **正則表達式：Regual Expression，簡寫REGEXP**由一類特殊字符及文本字符編寫的模式，其中有些字符不表示其字面意義，而是用於表示控制或通配的功能：分兩類：基本正則表達式：BRE 擴展正則表達式：ERE

利用正則表達式去掉字符串的前後空格

class 匹配 body lac blog 表示函數 str 內容　　實現函數如下：　　 function Trim(str) { return str.replace(/(^\s*)|(\s*$)/g, "");

利用正則表達式限制網頁表單裏的文本框輸入內容

利用 replace 只能輸入數字 bsp filter TE red pan 代碼利用正則表達式限制網頁表單裏的文本框輸入內容將以下代碼放入輸入框就可以了。（1）用正則表達式限制只能輸入中文：onkeyup="value=value.replace(/[^\u4E0

利用正則表達式去除所有html標簽，只保留文字

TE func 規則第一個 ace ole 針對 pre 全局後臺將富文本編輯器中的內容返回到前端時如果帶上了標簽，這時就可以利用這種方法只保留文字。標簽的格式有以下幾種 1.<div class="test"></div> 2.<img

Python re模塊,正則表達式

spl 貪婪匹配制表符學會一段 pat true bce art re模塊講正題之前我們先來看一個例子：https://reg.jd.com/reg/person?ReturnUrl=https%3A//www.jd.com/ 這是京東的註冊頁面，打開頁面我

Python爬蟲之利用正則表達式爬取內涵吧

相關推薦