網路爬蟲（六）：Python中的正則表示式教程

阿新 • • 發佈：2019-02-15

接下來準備用糗百做一個爬蟲的小例子。

但是在這之前，先詳細的整理一下Python中的正則表示式的相關內容。

正則表示式在Python爬蟲中的作用就像是老師點名時用的花名冊一樣，是必不可少的神兵利器。

整理時沒有注意，實在抱歉。

一、正則表示式基礎

1.1.概念介紹

正則表示式是用於處理字串的強大工具，它並不是Python的一部分。

其他程式語言中也有正則表示式的概念，區別只在於不同的程式語言實現支援的語法數量不同。

它擁有自己獨特的語法以及一個獨立的處理引擎，在提供了正則表示式的語言裡，正則表示式的語法都是一樣的。

下圖展示了使用正則表示式進行匹配的流程：

正則表示式的大致匹配過程是：

1.依次拿出表示式和文字中的字元比較，

2.如果每一個字元都能匹配，則匹配成功；一旦有匹配不成功的字元則匹配失敗。

3.如果表示式中有量詞或邊界，這個過程會稍微有一些不同。

下圖列出了Python支援的正則表示式元字元和語法：

1.2. 數量詞的貪婪模式與非貪婪模式

正則表示式通常用於在文字中查詢匹配的字串。

貪婪模式，總是嘗試匹配儘可能多的字元；

非貪婪模式則相反，總是嘗試匹配儘可能少的字元。

Python裡數量詞預設是貪婪的。

例如：正則表示式"ab*"如果用於查詢"abbbc"，將找到"abbb"。

而如果使用非貪婪的數量詞"ab*?"，將找到"a"。

1.3. 反斜槓的問題

與大多數程式語言相同，正則表示式裡使用"\"作為轉義字元，這就可能造成反斜槓困擾。

假如你需要匹配文字中的字元"\"，那麼使用程式語言表示的正則表示式裡將需要4個反斜槓"\\\\"：

第一個和第三個用於在程式語言裡將第二個和第四個轉義成反斜槓，

轉換成兩個反斜槓\\後再在正則表示式裡轉義成一個反斜槓用來匹配反斜槓\。

這樣顯然是非常麻煩的。

Python裡的原生字串很好地解決了這個問題，這個例子中的正則表示式可以使用r"\\"表示。

同樣，匹配一個數字的"\\d"可以寫成r"\d"。

有了原生字串，媽媽再也不用擔心我的反斜槓問題~

二、介紹re模組

2.1. Compile

Python通過re模組提供對正則表示式的支援。

使用re的一般步驟是：

Step1：先將正則表示式的字串形式編譯為Pattern例項。

Step2：然後使用Pattern例項處理文字並獲得匹配結果（一個Match例項）。

Step3：最後使用Match例項獲得資訊，進行其他的操作。

我們新建一個re01.py來試驗一下re的應用：

# -*- coding: utf-8 -*-  
#一個簡單的re例項，匹配字串中的hello字串  
  
#匯入re模組  
import re  
   
# 將正則表示式編譯成Pattern物件，注意hello前面的r的意思是“原生字串”  
pattern = re.compile(r'hello')  
   
# 使用Pattern匹配文字，獲得匹配結果，無法匹配時將返回None  
match1 = pattern.match('hello world!')  
match2 = pattern.match('helloo world!')  
match3 = pattern.match('helllo world!')  
  
#如果match1匹配成功  
if match1:  
    # 使用Match獲得分組資訊  
    print match1.group()  
else:  
    print 'match1匹配失敗！'  
  
  
#如果match2匹配成功  
if match2:  
    # 使用Match獲得分組資訊  
    print match2.group()  
else:  
    print 'match2匹配失敗！'  
  
  
#如果match3匹配成功  
if match3:  
    # 使用Match獲得分組資訊  
    print match3.group()  
else:  
    print 'match3匹配失敗！'

可以看到控制檯輸出了匹配的三個結果：

下面來具體看看程式碼中的關鍵方法。

★ re.compile(strPattern[, flag]):

這個方法是Pattern類的工廠方法，用於將字串形式的正則表示式編譯為Pattern物件。

第二個引數flag是匹配模式，取值可以使用按位或運算子'|'表示同時生效，比如re.I | re.M。

另外，你也可以在regex字串中指定模式，

比如re.compile('pattern', re.I | re.M)與re.compile('(?im)pattern')是等價的。

可選值有：

re.I(全拼：IGNORECASE): 忽略大小寫（括號內是完整寫法，下同）
re.M(全拼：MULTILINE): 多行模式，改變'^'和'$'的行為（參見上圖）
re.S(全拼：DOTALL): 點任意匹配模式，改變'.'的行為
re.L(全拼：LOCALE): 使預定字元類 \w \W \b \B \s \S 取決於當前區域設定
re.U(全拼：UNICODE): 使預定字元類 \w \W \b \B \s \S \d \D 取決於unicode定義的字元屬性
re.X(全拼：VERBOSE): 詳細模式。這個模式下正則表示式可以是多行，忽略空白字元，並可以加入註釋。

以下兩個正則表示式是等價的：

# -*- coding: utf-8 -*-  
#兩個等價的re匹配,匹配一個小數  
import re  
  
a = re.compile(r"""\d +  # the integral part 
                   \.    # the decimal point 
                   \d *  # some fractional digits""", re.X)  
  
b = re.compile(r"\d+\.\d*")  
  
match11 = a.match('3.1415')  
match12 = a.match('33')  
match21 = b.match('3.1415')  
match22 = b.match('33')   
  
if match11:  
    # 使用Match獲得分組資訊  
    print match11.group()  
else:  
    print u'match11不是小數'  
      
if match12:  
    # 使用Match獲得分組資訊  
    print match12.group()  
else:  
    print u'match12不是小數'  
      
if match21:  
    # 使用Match獲得分組資訊  
    print match21.group()  
else:  
    print u'match21不是小數'  
  
if match22:  
    # 使用Match獲得分組資訊  
    print match22.group()  
else:  
    print u'match22不是小數'

re提供了眾多模組方法用於完成正則表示式的功能。

這些方法可以使用Pattern例項的相應方法替代，唯一的好處是少寫一行re.compile()程式碼，

但同時也無法複用編譯後的Pattern物件。

這些方法將在Pattern類的例項方法部分一起介紹。

如一開始的hello例項可以簡寫為：

# -*- coding: utf-8 -*-  
#一個簡單的re例項，匹配字串中的hello字串  
import re  
  
m = re.match(r'hello', 'hello world!')  
print m.group()

re模組還提供了一個方法escape(string)，用於將string中的正則表示式元字元如*/+/?等之前加上轉義符再返回

2.2. Match

Match物件是一次匹配的結果，包含了很多關於此次匹配的資訊，可以使用Match提供的可讀屬性或方法來獲取這些資訊。

屬性：

string: 匹配時使用的文字。
re: 匹配時使用的Pattern物件。
pos: 文字中正則表示式開始搜尋的索引。值與Pattern.match()和Pattern.seach()方法的同名引數相同。
endpos: 文字中正則表示式結束搜尋的索引。值與Pattern.match()和Pattern.seach()方法的同名引數相同。
lastindex: 最後一個被捕獲的分組在文字中的索引。如果沒有被捕獲的分組，將為None。
lastgroup: 最後一個被捕獲的分組的別名。如果這個分組沒有別名或者沒有被捕獲的分組，將為None。

方法：

group([group1, …])：
獲得一個或多個分組截獲的字串；指定多個引數時將以元組形式返回。group1可以使用編號也可以使用別名；編號0代表整個匹配的子串；不填寫引數時，返回group(0)；沒有截獲字串的組返回None；截獲了多次的組返回最後一次截獲的子串。
groups([default])：
以元組形式返回全部分組截獲的字串。相當於呼叫group(1,2,…last)。default表示沒有截獲字串的組以這個值替代，預設為None。
groupdict([default])：
返回以有別名的組的別名為鍵、以該組截獲的子串為值的字典，沒有別名的組不包含在內。default含義同上。
start([group])：
返回指定的組截獲的子串在string中的起始索引（子串第一個字元的索引）。group預設值為0。
end([group])：
返回指定的組截獲的子串在string中的結束索引（子串最後一個字元的索引+1）。group預設值為0。
span([group])：
返回(start(group), end(group))。
expand(template)：
將匹配到的分組代入template中然後返回。template中可以使用\id或\g<id>、\g<name>引用分組，但不能使用編號0。\id與\g<id>是等價的；但\10將被認為是第10個分組，如果你想表達\1之後是字元'0'，只能使用\g<1>0。

下面來用一個py例項輸出所有的內容加深理解：

# -*- coding: utf-8 -*-  
#一個簡單的match例項  
  
import re  
# 匹配如下內容：單詞+空格+單詞+任意字元  
m = re.match(r'(\w+) (\w+)(?P<sign>.*)', 'hello world!')  
  
print "m.string:", m.string  
print "m.re:", m.re  
print "m.pos:", m.pos  
print "m.endpos:", m.endpos  
print "m.lastindex:", m.lastindex  
print "m.lastgroup:", m.lastgroup  
  
print "m.group():", m.group()  
print "m.group(1,2):", m.group(1, 2)  
print "m.groups():", m.groups()  
print "m.groupdict():", m.groupdict()  
print "m.start(2):", m.start(2)  
print "m.end(2):", m.end(2)  
print "m.span(2):", m.span(2)  
print r"m.expand(r'\g<2> \g<1>\g<3>'):", m.expand(r'\2 \1\3')  
   
### output ###  
# m.string: hello world!  
# m.re: <_sre.SRE_Pattern object at 0x016E1A38>  
# m.pos: 0  
# m.endpos: 12  
# m.lastindex: 3  
# m.lastgroup: sign  
# m.group(1,2): ('hello', 'world')  
# m.groups(): ('hello', 'world', '!')  
# m.groupdict(): {'sign': '!'}  
# m.start(2): 6  
# m.end(2): 11  
# m.span(2): (6, 11)  
# m.expand(r'\2 \1\3'): world hello!

2.3. Pattern

Pattern物件是一個編譯好的正則表示式，通過Pattern提供的一系列方法可以對文字進行匹配查詢。

Pattern不能直接例項化，必須使用re.compile()進行構造，也就是re.compile()返回的物件。

Pattern提供了幾個可讀屬性用於獲取表示式的相關資訊：

pattern: 編譯時用的表示式字串。
flags: 編譯時用的匹配模式。數字形式。
groups: 表示式中分組的數量。
groupindex: 以表示式中有別名的組的別名為鍵、以該組對應的編號為值的字典，沒有別名的組不包含在內。

可以用下面這個例子檢視pattern的屬性：

# -*- coding: utf-8 -*-  
#一個簡單的pattern例項  
  
import re  
p = re.compile(r'(\w+) (\w+)(?P<sign>.*)', re.DOTALL)  
   
print "p.pattern:", p.pattern  
print "p.flags:", p.flags  
print "p.groups:", p.groups  
print "p.groupindex:", p.groupindex  
   
### output ###  
# p.pattern: (\w+) (\w+)(?P<sign>.*)  
# p.flags: 16  
# p.groups: 3  
# p.groupindex: {'sign': 3}

下面重點介紹一下pattern的例項方法及其使用。

1.match

match(string[, pos[, endpos]]) | re.match(pattern, string[, flags])：

這個方法將從string的pos下標處起嘗試匹配pattern；

如果pattern結束時仍可匹配，則返回一個Match物件；

如果匹配過程中pattern無法匹配，或者匹配未結束就已到達endpos，則返回None。

pos和endpos的預設值分別為0和len(string)；

re.match()無法指定這兩個引數，引數flags用於編譯pattern時指定匹配模式。

注意：這個方法並不是完全匹配。

當pattern結束時若string還有剩餘字元，仍然視為成功。

想要完全匹配，可以在表示式末尾加上邊界匹配符'$'。

下面來看一個Match的簡單案例：

# encoding: UTF-8  
import re  
   
# 將正則表示式編譯成Pattern物件  
pattern = re.compile(r'hello')  
   
# 使用Pattern匹配文字，獲得匹配結果，無法匹配時將返回None  
match = pattern.match('hello world!')  
   
if match:  
    # 使用Match獲得分組資訊  
    print match.group()  
   
### 輸出 ###  
# hello

2.search

search(string[, pos[, endpos]]) | re.search(pattern, string[, flags]): 這個方法用於查詢字串中可以匹配成功的子串。

從string的pos下標處起嘗試匹配pattern，

如果pattern結束時仍可匹配，則返回一個Match物件；

若無法匹配，則將pos加1後重新嘗試匹配；

直到pos=endpos時仍無法匹配則返回None。

pos和endpos的預設值分別為0和len(string))；

re.search()無法指定這兩個引數，引數flags用於編譯pattern時指定匹配模式。

那麼它和match有什麼區別呢？

match()函式只檢測re是不是在string的開始位置匹配，

search()會掃描整個string查詢匹配，

match（）只有在0位置匹配成功的話才有返回，如果不是開始位置匹配成功的話，match()就返回none
例如：
print(re.match(‘super’, ‘superstition’).span())

會返回(0, 5)

print(re.match(‘super’, ‘insuperable’))

則返回None

search()會掃描整個字串並返回第一個成功的匹配
例如：

print(re.search(‘super’, ‘superstition’).span())

返回(0, 5)
print(re.search(‘super’, ‘insuperable’).span())

返回(2, 7)

看一個search的例項：

# -*- coding: utf-8 -*-  
#一個簡單的search例項  
  
import re  
   
# 將正則表示式編譯成Pattern物件  
pattern = re.compile(r'world')  
   
# 使用search()查詢匹配的子串，不存在能匹配的子串時將返回None  
# 這個例子中使用match()無法成功匹配  
match = pattern.search('hello world!')  
   
if match:  
    # 使用Match獲得分組資訊  
    print match.group()  
   
### 輸出 ###  
# world

3.split

split(string[, maxsplit]) | re.split(pattern, string[, maxsplit]):按照能夠匹配的子串將string分割後返回列表。

maxsplit用於指定最大分割次數，不指定將全部分割。

import re  
   
p = re.compile(r'\d+')  
print p.split('one1two2three3four4')  
   
### output ###  
# ['one', 'two', 'three', 'four', '']

4.findall

findall(string[, pos[, endpos]]) | re.findall(pattern, string[, flags]):搜尋string，以列表形式返回全部能匹配的子串。

import re  
   
p = re.compile(r'\d+')  
print p.findall('one1two2three3four4')  
   
### output ###  
# ['1', '2', '3', '4']

5.finditer

finditer(string[, pos[, endpos]]) | re.finditer(pattern, string[, flags]):搜尋string，返回一個順序訪問每一個匹配結果（Match物件）的迭代器。

import re  
   
p = re.compile(r'\d+')  
for m in p.finditer('one1two2three3four4'):  
    print m.group(),  
   
### output ###  
# 1 2 3 4

6.sub

sub(repl, string[, count]) | re.sub(pattern, repl, string[, count]):使用repl替換string中每一個匹配的子串後返回替換後的字串。
當repl是一個字串時，可以使用\id或\g<id>、\g<name>引用分組，但不能使用編號0。
當repl是一個方法時，這個方法應當只接受一個引數（Match物件），並返回一個字串用於替換（返回的字串中不能再引用分組）。
count用於指定最多替換次數，不指定時全部替換。

import re  
   
p = re.compile(r'(\w+) (\w+)')  
s = 'i say, hello world!'  
   
print p.sub(r'\2 \1', s)  
   
def func(m):  
    return m.group(1).title() + ' ' + m.group(2).title()  
   
print p.sub(func, s)  
   
### output ###  
# say i, world hello!  
# I Say, Hello World!

7.subn

subn(repl, string[, count]) |re.sub(pattern, repl, string[, count]):返回 (sub(repl, string[, count]), 替換次數)。

import re  
   
p = re.compile(r'(\w+) (\w+)')  
s = 'i say, hello world!'  
   
print p.subn(r'\2 \1', s)  
   
def func(m):  
    return m.group(1).title() + ' ' + m.group(2).title()  
   
print p.subn(func, s)  
   
### output ###  
# ('say i, world hello!', 2)  
# ('I Say, Hello World!', 2)

至此，Python的正則表示式基本介紹就算是完成了^_^

原文連結：http://blog.csdn.net/pleasecallmewhy/article/details/8929576

網路爬蟲（六）：Python中的正則表示式教程

一、正則表示式基礎

1.1.概念介紹

1.2. 數量詞的貪婪模式與非貪婪模式

1.3. 反斜槓的問題

二、介紹re模組

2.1. Compile

2.2. Match

2.3. Pattern

1.match

2.search

3.split

4.findall

5.finditer

6.sub

7.subn

網路爬蟲（六）：Python中的正則表示式教程

python學習（六）：python中賦值、淺拷貝、深拷貝的區別

資料爬蟲（三）：python中requests庫使用方法詳解

jmeter教程（八）：關聯及正則表示式提取器

shell文字過濾程式設計（一）：grep和正則表示式

[Python]網路爬蟲（一）：抓取網頁的含義和URL基本構成

[Python]網路爬蟲（二）：利用urllib2通過指定的URL抓取網頁內容

[Python]網路爬蟲（五）：urllib2的使用細節與抓站技巧

Python網路爬蟲（三）：chromdriver.exe與chrome版本對映及下載連結

Python網路爬蟲（四）：視訊下載器

Python網路爬蟲（九）：爬取頂點小說網站全部小說，並存入MongoDB

Python網路爬蟲（四）：selenium+chrome爬取美女圖片

[Python]網路爬蟲（二）：利用urllib通過指定的URL抓取網頁內容

Python網路爬蟲（七）：解決ImportError:DLL load failed：作業系統無法執行問題

[Python]網路爬蟲（三）：使用cookiejar管理cookie 以及模擬登入知乎

[Python]網路爬蟲（四）：Opener與Handler

爬蟲（六）：Selenium庫使用

Python基礎（4）：python中的特性入門篇（索引，切片，連線，重複，成員操作符）

Windows網路程式設計（六）：IP Helper

Python基礎（8）：python中的特性進階篇（迭代，列表生成式，生成器，迭代器）

網路爬蟲（六）：Python中的正則表示式教程

一、 正則表示式基礎

1.1.概念介紹

1.2. 數量詞的貪婪模式與非貪婪模式

1.3. 反斜槓的問題

二、 介紹re模組

2.1. Compile

2.2. Match

2.3. Pattern

1.match

2.search

3.split

4.findall

5.finditer

6.sub

7.subn

相關推薦

一、正則表示式基礎

二、介紹re模組