Regular expressions and basic text matching

阿新 • • 發佈：2021-02-09

Meta character Description

Character	Meaning
.	Period matches any single character except a line break.
[ ]	Character class. Matches any character contained between the square brackets.
[^]	Negated character class. Matches any character that is not contained between the square brackets
*	Matches 0 or more repetitions of the preceding symbol.
+	Matches 1 or more repetitions of the preceding symbol.
?	Makes the preceding symbol optional.
{n,m}	Braces. Matches at least “n” but not more than “m” repetitions of the preceding symbol.
(xyz)	Captured group. Matches the characters xyz in that exact order.If you do not want to capture this group, use (?:xyz)
l	Alternation. Matches either the characters before or the characters after the symbol.
\	Escapes the next character. This allows you to match reserved characters [ ] ( ) { } . * + ? ^ $ \
^	Matches the beginning of the input.
$	Matches the end of the input.
\d \D	\d matches a single digit 0-9, \D anything but a digit
\w \W	\w matches alphanumerical letters (0-9, a-z and _), \W anything but alphanumerical letters
\s \S	\s matches a space (such as \ ,\t, \n); \S matches anything but a space.
\n	Matches a linebreak

Import regex packge

import re
text = '''
In the short term, the impact of the COVID-19 disease on China’s economic growth will be very obvious. 
Since the outbreak, many domestic and foreign institutions have made their estimations (see Figure 1). 
Most of them believe that the GDP growth rate in the first quarter may be about 4%, a decline by about 2 percentage points. 
The growth rates in the next three quarters will gradually pick up depending on when the outbreak ends, 
and the annual GDP growth will show a "V-shaped" pattern.
In light of the prevailing analyses and estimations of domestic and foreign institutions, 
we believe that if the outbreak could be largely over in late March or early April, 
the growth rates in the four quarters of this year may reach 4.5%...%, 5.0%, 5.8%, and 5.7% respectively. 
The annual growth rate may be 5.2-5.3%.
'''

’’’ ‘’’: put string literal inside the triple quote sign. Let’s see what the text actually looks like in python codetext)

text

在這裡插入圖片描述

The simple match

term = 'COVID-19'
a = re.findall(term, text) # findall method return a list of all possible matches
a[0]

在這裡插入圖片描述
findall method return a list of all possible matches.
This search is not constrained by languages:

term = '中國'
b= re.search(term, "中國GDP增幅受到COVID-19影響，增量下降。中國的") # search method will go over all lines of text and report the first occurence
b, b.group()

在這裡插入圖片描述
Search method will go over all lines of text and report the first occurence.

term = '美國'
b= re.match(term, "中國GDP增幅受到COVID-19影響，增量下降。中國的...") # match method will go over the first lines of text and report the first occurence
b # b.group() will return error since b is a None object

Match method will go over the first lines of text and report the first occurence.

c = re.sub('中國', '美國',"中國GDP增幅受到COVID-19影響，增量下降。中國的...") # sub can be used to substitute terms
c

Substitute terms
About the use of Backslash:

d = re.findall('\.\nI', text) 
d

在這裡插入圖片描述

a = re.findall('\\', 'd\9reter') 
print(a[0])

在這裡插入圖片描述

The full stop

The full stop . is the simplest example of a meta character. The meta character . matches any single character. It will not match return or newline characters. For example, the regular expression .ar means: any character, followed by the letter a, followed by the letter r.

term = 'th.'
b = re.findall(term, 'In the short term, the impact of the COVID-19 disease on China’s economic growth will be large')
b

在這裡插入圖片描述

The Repetitions, Character Sets, and Captured Groups

re.findall(r'.*', 'In the short term, the impact of the COVID-19 disease on China’s economic growth will be very obvious.')

在這裡插入圖片描述

re.findall(r'o.?', 'In the short term, the impact of the COVID-19 disease on China’s economic growth will be very obvious.')

在這裡插入圖片描述

re.findall('on', 'on the contrary, the impact of the COVID-19 disease on US’s economic growth will be very the largest out of all countries.')

在這裡插入圖片描述

# match every word start with o
re.findall('\so.+\s', 'on the contrary, the impact of the COVID-19 disease on US’s economic growth will be very the largest out of all countries.')

在這裡插入圖片描述
The above approach failed because + match is greedy, as it will try to find the longest match. In this case, our match needs to be lazy instead of greedy, so that it will stop as soon as it finds a possible match.

re.findall('\so.+?\s', 'on the contrary, the impact of the COVID-19 disease on US’s economic growth will be very the largest out of all countries.')

It missed the first ‘on’ in ‘on the contrary’. This is because \s matches spaces only, not beginning or the end of the line. So instead of \s, we use \b to match word boundaries.

re.findall('\\bo.+?\\b', 'on the contrary, the impact \a of the COVID-19 disease on US’s economic growth will be very the largest out of all countries.')

在這裡插入圖片描述
‘Why ‘\b’ instead of ‘\b’, because in Python string literals ‘\b’ corresponds to some special characters (’\x08’). So when you write ‘\bo.+?\b’, what the findall function get as an argument is ‘\x08 something here \x08’. Now you need to use ‘’ to escape the first ‘’ so that ‘\b’ actually means ‘\b’, instead of a special backspace character.’（轉義字元）

An easier way: writing matched termed in raw strings

# adding a r before '' to convert a string literal into a raw string
re.findall(r'\bo.+?\b', 'on the contrary, the impact of the COVID-19 disease on US’s economic growth will be very the largest out of all countries.')

在這裡插入圖片描述
More about '?'

# ? can also mean match zero or one times
re.findall(r'\bo.?\b', 'on the contrary, the impact of the COVID-19 disease on US’s economic growth will be very the largest out of all countries.')

在這裡插入圖片描述

re.findall(r'\bone?\b', 'on the contrary, the impact of the COVID-19 disease on US’s economic growth will be very the largest one out of all countries.')

在這裡插入圖片描述
Other anchors include the beginning (^) and the end ($) of a string

re.findall(r'^on', 'on the contrary, the impact of the COVID-19 disease on US’s economic growth will be very the largest one out of all countries.')

在這裡插入圖片描述

re.findall(r'\w*\.$', 'on the contrary, the impact of the COVID-19 disease on US’s economic growth will be very the largest one out of all countries.')

在這裡插入圖片描述
[], {} and ()

re.findall(r'[0-9]+', 'on the contrary, the impact of the COVID-19 disease on US’s economic growth will be very the largest one out of all countries.')

在這裡插入圖片描述
^ sign inside [] is not an anchor. Use [^] for the negated set.

re.findall(r'[^0-9]+', 'on the contrary, the impact of the COVID-19 disease on US’s economic growth will be very the largest one out of all countries.')

在這裡插入圖片描述
What if we only want to match ‘on’ or ‘of’:

# use () to capture groups
re.findall(r'\bo(n|f)\b', 'on the contrary, the impact of the COVID-19 disease on US’s economic growth will be very the largest one out of all countries.')

In the above example, we made the right match, but the () not only matches, but also captured the matched part into groups. So if we just want to match, not to capture. use (?: )

re.findall(r'\bo(?:n|f)\b', 'on the contrary, the impact of the COVID-19 disease on US’s economic growth will be very the largest one out of all countries.')

在這裡插入圖片描述

re.findall(r'\bo(?:n|f)|out\b', 'on the contrary, the impact of the COVID-19 disease on US’s economic growth will be very the largest one out of all countries.')

在這裡插入圖片描述
Carefully read the two following codes to see the difference and similarity.

re.findall(r'(on).*?(on)', 'on the contrary, the impact of the COVID-19 disease on US’s economic growth will be very the largest one out of all countries.')

在這裡插入圖片描述

re.findall(r'(?:on).*?(?:on)', 'on the contrary, the impact of the COVID-19 disease on US’s economic growth will be very the largest one out of all countries.')

在這裡插入圖片描述

Put it all together

text

在這裡插入圖片描述

term1 = '\d%'
re.findall(term1, text) # not right

在這裡插入圖片描述

term2 = '[0-9.]%' # notice that . in the [] matches literal ., not full stop
re.findall(term2, text) # getting better, still not right

在這裡插入圖片描述

term3 = '[0-9.]+?%'# +? is lazy search, means it will stop at the first match, if just +, it will stop at the longest greedy match. 
re.findall(term3, text)

在這裡插入圖片描述

# get rid of the % sign, and not match ...%
term4 = r'([0-9][0-9.]*?)%'
re.findall(term4, text)

在這裡插入圖片描述

# or if you wish to use lookarounds, which matched a pattern that has to succeeded or preceded another given pattern (the send pattern is not captured).
# here we want to use the positive lookahead , which asserts that the first part of the expression must be followed by the lookahead expression. 
term5 = r'[0-9][0-9.]*?(?=%)'
re.findall(term5, text)

在這裡插入圖片描述

# if you want to find numbers that are not followed by the dollar sign, use negative look ahead
term6 = r'[0-9][0-9.]*(?!%|[0-9.])'
re.findall(term6, text)

在這裡插入圖片描述

# you can also use positive/negative lookbehind to match a pattern that has to be preceded by another
term7 = r'(?<=\$)([0-9.]*)\.'
re.findall(term7, "this watermelon is $3.4. 3 times cheaper than that one.")

在這裡插入圖片描述

Extracting COVID-19 diagnoses and symptoms from clinical text: A new annotated corpus and neural event extraction framework

從臨床文字中提取COVID-19診斷和症狀:一個新的標註語料庫和神經事件提取框架

Regular expressions and basic text matching

Meta character Description

Import regex packge

The simple match

The full stop

The Repetitions, Character Sets, and Captured Groups

Put it all together

Regular expressions and basic text matching

Go xmas2020 學習筆記 13、Regular Expressions

【Stanford - Speech and Language Processing 讀書筆記】2、Regular Expression，Text Normalization，Edit distance

How to change any text to Proper Case and Sentence case using tr?

習題：Vasya and Maximum Matching（轉換&DP）

10. Regular Expression Matching

錯誤/警告型別總結——comparison between signed and unsigned integer expressions

paper carefully review : Seeing Voices and Hearing Faces: Cross-modal Biometric Matching

EAST: An Efﬁcient and Accurate Scene Text Detector 論文閱讀

10. 正則表示式匹配 Regular Expression Matching

LeetCode - 解題筆記 - 10- Regular Expression Matching

[LeetCode] 10. Regular Expression Matching（正則匹配）

Sample pipeline for text feature extraction and evaluation of sklearn

no suitable HttpMessageConverter found for response type [X] and content type [text/plain]

【Leetcode】10. Regular Expression Matching

LeetCode：Regular Expression Matching

每日一篇文獻：Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching

Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Extracting COVID-19 diagnoses and symptoms from clinical text: A new annotated corpus and neural event extraction framework

Regular expressions and basic text matching

Meta character Description

Import regex packge

The simple match

The full stop

The Repetitions, Character Sets, and Captured Groups

Put it all together

相關推薦