不能包含全形正則_python正則表示式

阿新 • • 發佈：2021-02-08

正則表示式（regularExpression, re)

是一個電腦科學的概念
用於使用單個字串來描述，匹配符合某個規則的字串
查詢符合某些複雜規則的字串的需要，正則表示式就是用於描述這些規則的工具

正則表示式的寫法

- 參考 
    - https://deerchao.cn/tutorials/regex/regex.htm
- 部分
    - . ：匹配任意字元，除了n
    - []：匹配來自字符集的任意單一字元
    - d：匹配數字
    - D：匹配非數字
    - s：匹配空白字元（包括r、n、t等）
    - S：匹配非空白字元
    - w：匹配字母/數字/下劃線
    - W：匹配非字母/數字/下劃線
    - * ：匹配0次或多次
    - + ：匹配1次或多次
    - ? ：匹配0次或1次
    - {M,N} ：匹配至少M次至多N次
    - ^ ：匹配字串的開始
    - $ ：匹配字串的結束
    - b：匹配單詞的邊界
    - (exp)：匹配exp並捕獲到自動命名的組中
    - (? <name>exp)：匹配exp並捕獲到名為name的組中
    - | ：分支

正則表示式修飾符 - 可選標誌

- re.I  使匹配對大小寫不敏感
- re.L  做本地化識別（locale-aware）匹配
- re.M  多行匹配，影響 ^ 和 $
- re.S  使 . 匹配包括換行在內的所有字元
- re.U  根據Unicode字符集解析字元。這個標誌影響 w, W, b, B.
- re.X  該標誌通過給予你更靈活的格式以便你將正則表示式寫得更易於理解。

RE模組使用

1、compile 函式根據一個模式字串和可選的標誌引數生成一個正則表示式物件。該物件擁有一系列方法用於正則表示式匹配和替換
2、re 模組也提供了與這些方法功能完全一致的函式，這些函式使用一個模式字串做為它們的第一個引數。

compile

1. 使用compile將表示正則的字串編譯為一個pattern物件
2. 通過pattern物件提供一系列方法對文字進行查詢匹配，獲得匹配結果，一個match物件
3. 最後使用Match物件提供的屬性和方法獲得資訊，根據需要進行操作

Pattern物件的一些常用方法

match(string[, pos[, endpos]])
- 查詢字串的頭部（也可以指定起始位置），它是一次匹配，只要找到了一個匹配的結果就返回，而不是查詢所有匹配的結果
- 當匹配成功時，返回一個 Match 物件，如果沒有匹配上，則返回 None
search(string[, pos[, endpos]])
- 用於查詢字串的任何位置，它也是一次匹配，只要找到了一個匹配的結果就返回，而不是查詢所有匹配的結果
- 當匹配成功時，返回一個 Match 物件，如果沒有匹配上，則返回 None
findall(string[, pos[, endpos]])
- 搜尋整個字串，獲得所有匹配的結果
- findall 以列表形式返回全部能匹配的子串，如果沒有匹配，則返回一個空列表
finditer 方法
split(string[, maxsplit])
- 按照能夠匹配的子串將字串分割後返回列表
sub(repl, string[, count])
- 用於替換
- 如果 repl 是字串，則會使用 repl 去替換字串每一個匹配的子串，並返回替換後的字串
- 如果 repl 是函式，這個方法應當只接受一個引數（Match 物件），並返回一個字串用於替換（返回的字串中不能再引用分組）。
subn(repl, string[, count])
- 也用於替換，返回一個元組

match物件(match與search方法返回的)

group([group1, …]) 方法用於獲得一個或多個分組匹配的字串，當要獲得整個匹配的子串時，可直接使用 group() 或 group(0)
start([group]) 方法用於獲取分組匹配的子串在整個字串中的起始位置（子串第一個字元的索引），引數預設值為 0
end([group]) 方法用於獲取分組匹配的子串在整個字串中的結束位置（子串最後一個字元的索引+1），引數預設值為 0
span([group]) 方法返回 (start(group), end(group))

# 查詢數字
import re

p = re.compile(r'd+')
m = p.match("one12twothree33456four78", 3, 26)
print(m)
print(m[0])
print(m.start(0))
print(m.end(0))
<_sre.SRE_Match object; span=(3, 5), match='12'>
12
3
5
import re

p = re.compile(r'([a-z]+) ([a-z]+)', re.I)

m = p.match("I am relly love wangxiaojing")
print(m)
print(m.group(0))
print(m.start(0))
print(m.end(0))
<_sre.SRE_Match object; span=(0, 4), match='I am'>
I am
0
4
print(m.group(2))
print(m.start(1))
print(m.end(1))
am
0
1

查詢

search(str, [,pos[,endpos]])：在字串中查詢匹配，pos和endpos表示起始位置
findall：查詢所有
finditer：查詢，返回一個iter結果

import re
p = re.compile(r'd+')
m = p.search("one12two34three567four")
print(m.group())
12
rst = p.findall("one12two34three567four")
print(type(rst))
print(rst)
<class 'list'>
['12', '34', '567']

sub替換

sub(repl, str[, count])

# sub替換的案例
import re
p = re.compile(r'(w+) (w+)')
s = "hello 123 wang 456 xiaojing, i love you"
rst = p.sub(r"Hello world", s)
print(rst)
Hello world Hello world xiaojing, Hello world you

匹配中文

大部分中文內容表示範圍是[u4e00-u9fa5]，不包括全形標點

import re

title = u'世界 你好, hello moto'

p = re.compile(r'[u4e00-u9fa5]+')
r = p.findall(title)
print(r)
['世界', '你好']

貪婪和非貪婪

貪婪：儘可能多的匹配，(*)表示貪婪匹配
非貪婪：找到符合條件的最小內容即可，（?)表示非貪婪
正則預設使用貪婪匹配

import re
title = u'<div>name</div><div>age</div>'

p1 = re.compile(r'<div>.*</div>')
p2 = re.compile(r'<div>.*?</div>')

m1 = p1.search(title)
print(m1.group())

m2 = p2.search(title)
#print(m2)
print(m2.group())
<div>name</div><div>age</div>
<div>name</div>

re模組函式

使用 compile 函式生成的 Pattern 物件的一系列方法跟 re 模組的多數函式是對應的
re.match(pattern, string, flags=0)
re.search(pattern, string, flags=0)
re.findall(pattern, string[, flags])
re.finditer(pattern, string[, flags])
re.split(pattern, string[, maxsplit])
re.sub(pattern, repl, string[, count])
re.subn(pattern, repl, string[, count])

使用哪種

如果一個正則表示式需要用到多次（比如上面的 d+），在多種場合經常需要被用到，出於效率的考慮，我們應該預先編譯該正則表示式，生成一個 Pattern 物件，再使用該物件的一系列方法對需要匹配的檔案進行匹配；而如果直接使用 re.match, re.search 等函式，每次傳入一個正則表示式，它都會被編譯一次，效率就會大打折扣。
因此，推薦使用第 1 種用法

練習

# 1、匹配一行文字中所有開頭的字母

import re
s = ' i love you but you don't love me'

content = re.findall(r'bw', s)
print(content)
['i', 'l', 'y', 'b', 'y', 'd', 't', 'l', 'm']
# 2、匹配一行文字中所有數字開頭的內容

import re
s = 'i 22love 33you 44but 5you don't66 7love 88me'

content = re.findall(r'bd', s)
print(content)
['2', '3', '4', '5', '7', '8']
# 3、匹配只含數字和字母的行
s = 'i love you n2222kkkk but ndfe23 you dont love n23243dd'
content =re.findall(r'w+', s, re.M)
print(content)
['i', 'love', 'you', '2222kkkk', 'but', 'dfe23', 'you', 'dont', 'love', '23243dd']