爬蟲第一課：正則表示式符號與方法

阿新 • • 發佈：2018-11-02

第一課：正則表示式符號與方法

１．

. :匹配任意字元，換行符除外：

>>> import re

>>> a='xy123'

>>> b=re.findall('x',a)

>>> b

['x']

>>> b=re.findall('x...',a)
>>> b

['xy12']

所以，"."是一個佔位符

２．

* :匹配前一個字元０次或者無限次：

>>> import re
>>> a='xy123'
>>> b=re.findall('x*',a)
>>> b

['x', '', '', '', '', '']

>>> a='xyx123'
>>> b=re.findall('x*',a)
>>> b
['x', '', 'x', '', '', '', '']

３．

? : 匹配前一個字元０次或者１次：

>>> b=re.findall('x?',a)
>>> b
['x', '', 'x', '', '', '', '']

４．

.* :貪心演算法：

>>> a
'fffxxIxxhyhxxlovexxhhhxxyouxxghh'
>>> b=re.findall('xx.*xx',a)
>>> b

['xxIxxhyhxxlovexxhhhxxyouxx']

５．

.*? :非貪心演算法：

>>> a
'fffxxIxxhyhxxlovexxhhhxxyouxxghh'
>>> b=re.findall('xx.*?xx',a)
>>> b
['xxIxx', 'xxlovexx', 'xxyouxx']

６．

() :匹配目標：

>>> a
'fffxxIxxhyhxxlovexxhhhxxyouxxghh'
>>> b=re.findall('xx(.*?)xx',a)
>>> b
['I', 'love', 'you']

提取出來了目標：Ｉ LOVE YOU

再來看一個例子：

import re
s='ffsdxxhello\nxxfgfgxxworldxxhffh'

d=re.findall('xx(.*?)xx',s)

結果：>>> d

['fgfg']

注意，這裡換行了，而只尋找到了第二行。（我們的目標是找到hello world）

那麼怎麼避免這種情況呢？

答案：用.S

import re
s='ffsdxxhello\nxxfgfgxxworldxxhffh'

d=re.findall('xx(.*?)xx',s,re.S)

結果：>>> d

['hello\n', 'world']

接下來對比findall 與search 的區別：

>>> s2='asssxxIxx123xxlovexxdh'
>>> f=re.search('xx(.*?)xx123xx(.*?)xx',s2).group(2)
>>> f
'love'
>>> f=re.search('xx(.*?)xx123xx(.*?)xx',s2).group(1)
>>> f
'I'
>>> f=re.search('xx(.*?)xx123xx(.*?)xx',s2).group(0)
>>> f
'xxIxx123xxlovexx'
>>> f=re.search('xx(.*?)xx123xx(.*?)xx',s2).group(3)
Traceback (most recent call last):
File "<pyshell#22>", line 1, in <module>
f=re.search('xx(.*?)xx123xx(.*?)xx',s2).group(3)

IndexError: no such group

而

>>> f=re.findall('xx(.*?)xxgdgxx(.*?)xx',s2)

>>> f
[('I', 'love')]

接下來講解sub的使用：

ｓｕｂ的功能就是替換

>>> s='123hfhdfhdxhdhd123'

>>> output=re.sub('123(.*?)123','123789123',s)
>>> output
'123789123'
>>> output=re.sub('123(.*?)123','123%d123'%789,s)
>>> output

'123789123'

最好不要使用compile

匹配純數字的特殊方法：

\d+

>> a='dsgdgd1112255555555555hdhdgdgd'
>>> c=re.findall('(\d+)',a)

>>> c

['1112255555555555']

>>> b='dghgd11111111ysdysdys2222223ddh'

>>> dc=re.findall('(\d+)',b)
>>> dc
['11111111', '2222223']

爬蟲第一課：正則表示式符號與方法

爬蟲第一課：正則表示式符號與方法

爬蟲第三課：正則表示式

python第一篇：正則表示式的方法簡單歸納

第一章：正則表示式

Python核心程式設計第三版練習參考（第一章：正則表示式）

python第一篇：正則表達式的方法簡單歸納

Python爬蟲學習必備知識點：正則表示式模組詳解

爬蟲：正則表示式

爬蟲入門系列（五）：正則表示式完全指南（上）

爬蟲入門系列（六）：正則表示式完全指南（下）

《零基礎入門學習Python》第057講：論一隻爬蟲的自我修養5：正則表示式

《零基礎入門學習Python》第060講：論一隻爬蟲的自我修養8：正則表示式4

《零基礎入門學習Python》第059講：論一隻爬蟲的自我修養7：正則表示式3

《零基礎入門學習Python》第058講：論一隻爬蟲的自我修養6：正則表示式2

python爬蟲學習筆記6：正則表示式及re庫

Python學習之路（五）爬蟲（四）正則表示式爬去名言網

python—【爬蟲】學習_2(正則表示式篇）_2(practice)

python—【爬蟲】學習_2(正則表示式篇）1.基礎知識

Python學習：正則表示式

python基礎：正則表示式

爬蟲第一課：正則表示式符號與方法

相關推薦