Eloquent JavaScript #09# Regular Expressions
索引
-
Notes
- js創建正則表達式的兩種方式
- js正則匹配方式(1)
- 字符集合
- 重復匹配
- 分組(子表達式)
- js正則匹配方式(2)
- The Date class
- 匹配整個字符串
- Choice patterns
- 正則匹配的機制
- 回溯Backtracking
- Replace
- 貪婪匹配Greed
- 動態構建正則表達式
- Search
- The lastIndex property
- 遍歷匹配項
- 解析INI文件
- 國際字符
-
Excercise
- Regexp golf
- Quoting style
- Numbers again
Notes
1、正則表達式幫助我們在字符串中尋找特定模式。
js創建正則表達式的兩種等價寫法:
let re1 = new RegExp("abc"); let re2 = /abc/;
2、應用正則表達式
console.log(/abc/.test("abcde")); // → true console.log(/abc/.test("abxde")); // → false
3、字符集合
\d |
Any digit character |
\w |
An alphanumeric character (“word character”) |
\s |
Any whitespace character (space, tab, newline, and similar) |
\D |
A character that is not a digit |
\W |
A nonalphanumeric character |
\S |
A nonwhitespace character |
. |
Any character except for newline |
/abc/ |
A sequence of characters |
/[abc]/ |
Any character from a set of characters |
/[^abc]/ |
Any character not in a set of characters |
/[0-9]/ |
Any character in a range of characters |
/x+/ |
One or more occurrences of the pattern x |
/x+?/ |
One or more occurrences, nongreedy |
/x*/ |
Zero or more occurrences |
/x?/ |
Zero or one occurrence |
/x{2,4}/ |
Two to four occurrences |
/(abc)/ |
A group |
/a|b|c/ |
Any one of several patterns |
/\d/ |
Any digit character |
/\w/ |
An alphanumeric character (“word character”) |
/\s/ |
Any whitespace character |
/./ |
Any character except newlines |
/\b/ |
A word boundary |
/^/ |
Start of input |
/$/ |
End of input |
\d等轉移字符可以放在 [ ] 裏而不喪失含義,但是 . 和+ 之類的特殊符號不行,會變為普通的符號。
整體取反,非0非1:
let notBinary = /[^01]/; console.log(notBinary.test("1100100010100110")); // → false console.log(notBinary.test("1100100010200110")); // → true
4、重復匹配
+ one or more,* zero or more
console.log(/‘\d+‘/.test("‘123‘")); // → true console.log(/‘\d+‘/.test("‘‘")); // → false console.log(/‘\d*‘/.test("‘123‘")); // → true console.log(/‘\d*‘/.test("‘‘")); // → true
? zero or one
let neighbor = /neighbou?r/; console.log(neighbor.test("neighbour")); // → true console.log(neighbor.test("neighbor")); // → true
{2} a pattern should occur a precise number of times,It is also possible to specify a range this way: {2,4}
means the element must occur at least twice and at most four times.
let dateTime = /\d{1,2}-\d{1,2}-\d{4} \d{1,2}:\d{2}/; console.log(dateTime.test("1-30-2003 8:45")); // → true
You can also specify open-ended ranges when using braces by omitting the number after the comma. So, {5,}
means five or more times.
5、分組(子表達式)
括號內的n個元素被視作一個整體元素(分組,子表達式):
let cartoonCrying = /boo+(hoo+)+/i; console.log(cartoonCrying.test("Boohoooohoohooo")); // → true
i表示該表達式大小寫不敏感。
6、進行正則匹配的另外一種方式
可以讓我們獲取額外的信息:
let match = /\d+/.exec("one two 100"); console.log(match); // → ["100"] console.log(match.index); // → 8
exec的返回值:匹配失敗為null,成功則如上所示。
等價寫法:
console.log("one two 100".match(/\d+/)); // → ["100"]
含括號表達式的情況:
let quotedText = /‘([^‘]*)‘/; console.log(quotedText.exec("she said ‘hello‘")); // → ["‘hello‘", "hello"] console.log(/bad(ly)?/.exec("bad")); // → ["bad", undefined] console.log(/(\d)+/.exec("123")); // → ["123", "3"]
返回數組的第一個元素為整個正則表達式匹配的字符串,而第二元素為() 內正則(子表達式)匹配的字符串(沒有就是undefined,多個就取最後一個)。容易知道,第二個元素幾乎總是第一個元素的子集。
7、The Date class
console.log(new Date()); // → Sat Sep 01 2018 13:54:43 GMT+0800 (中國標準時間) console.log(new Date(2009, 11, 9)); // → Wed Dec 09 2009 00:00:00 GMT+0800 (中國標準時間) console.log(new Date(2009, 11, 9, 12, 59, 59, 999)); // → Wed Dec 09 2009 12:59:59 GMT+0800 (中國標準時間) console.log(new Date(1997, 10, 19).getTime()); // → 879868800000 console.log(new Date(1387407600000)); // → Thu Dec 19 2013 07:00:00 GMT+0800 (中國標準時間) console.log(new Date().getTime()); // → 1535781283593 console.log(Date.now()); // → 1535781283593
通過正則表達式,由String創建日期:
"use strict"; function getDate(string) { let [_, month, day, year] = /(\d{1,2})-(\d{1,2})-(\d{4})/.exec(string); return new Date(year, month - 1, day); } console.log(getDate("1-30-2003")); // → Thu Jan 30 2003 00:00:00 GMT+0100 (CET)
PS. 下劃線除了用來占位外沒有其它含義。
8、強制匹配整個字符串
利用 ^ 和 $ 。例如/^\d+$/匹配完全由數字構成的字符串,/^!/
匹配由!開頭的字符串,而/x^/
啥也匹配不了。
用 \b 標註單詞邊界:
console.log(/cat/.test("concatenate")); // → true console.log(/\bcat\b/.test("concatenate")); // → false console.log(/\bcat\b/.test("xx cat xx")); // → true
9、Choice patterns
let animalCount = /\b\d+ (pig|cow|chicken)s?\b/; console.log(animalCount.test("15 pigs")); // → true console.log(animalCount.test("15 pigchickens")); // → false
10、正則匹配的機制
當你進行正則匹配時(test或者exec),正則引擎將從所給字符串的開頭開始嘗試匹配,接著是第二個字符,第三個字符... 試圖在所給字符串中尋找一個匹配,直到找到一個匹配項或者到達字符串末尾結束。要麽返回第一個匹配,要麽什麽都匹配不到。
/** * 模擬用正則\b\d+ (pig|cow|chicken)s?\b * 匹配"the 3 pigs" */ const str = "the 3 pigs"; function simulateRegex(str, start) { const digits = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]; // 逐個字符嘗試匹配,直到找到一個匹配項或者到達字符串尾結束 for(let currentPosition = start; currentPosition != str.length; ++currentPosition) { let tempPosition = currentPosition; if(tempPosition == 0 || str[tempPosition - 1] == " ") {} else continue; // 匹配單詞邊界通過,前面是標點也是可以的。。 if(!str[tempPosition++] in digits) continue; // 至少匹配一個數字通過 while(str[tempPosition] in digits) { tempPosition++; // 嘗試繼續匹配數字 } if(str[tempPosition++] != " ") continue; // 匹配一個空格通過 let tempWord; if(str.slice(tempPosition, tempPosition + "pig".length) === (tempWord = "pig") || str.slice(tempPosition, tempPosition + "cow".length) === (tempWord = "cow") || str.slice(tempPosition, tempPosition + "chicken".length) === (tempWord = "chicken")) { tempPosition += tempWord.length; } else { continue; } // 單詞匹配成功 if(str[tempPosition] == "s") tempPosition++; // 有沒s都可以 if(tempPosition == str.length || str[tempPosition] == " ") { // 最後的單詞邊界 let match = [str.slice(currentPosition, tempPosition + 1)]; return match; } } return null; } let match = simulateRegex(str, 4); console.log(match); // → ["3 pigs"]
11、回溯Backtracking
正則引擎在進行分支匹配(|)或重復匹配(+ *)時,如果發現無法繼續再繼續往下匹配,就會進行“回溯”。
在進行分支匹配時,如果第一個分支就匹配成功,就不再匹配其它分支,如果不成功就會回溯到分支的入口,進入到另外一個分支繼續匹配。
而進行重復匹配時,例如說/^.*x/用匹配"abcxe",.*會首先把所有字符消費幹凈,當正則引擎發現最後還需要一個x時,*操作符會嘗試少匹配一個字符,但是仍然沒發現x,於是繼續回溯,直到發現x,最終得到字符串abc。
12、The replace method
replace配合正則:
console.log("papa".replace("p", "m")); // → mapa console.log("Borobudur".replace(/[ou]/, "a")); // → Barobudur console.log("Borobudur".replace(/[ou]/g, "a")); // g代表global全部 // → Barabadar
replace的真正強大之處在於可以用“$數字”引用匹配字符串:
console.log( "Liskov, Barbara\nMcCarthy, John\nWadler, Philip" .replace(/(\w+), (\w+)/g, "$2 $1")); // → Barbara Liskov // John McCarthy // Philip Wadler "hello, word, every, one".replace(/(\w+),/g, "$1 "); // “$+數字”引用匹配中的分組 // → "hello word every one" "hello, word, every, one".replace(/one/g, "$& $&"); // “$&”引用整個匹配 // → "hello, word, every, one one"
還可以傳入函數:
"hello, word, every, one".replace(/(\w+),/g, str => str.toUpperCase()); // → "HELLO, WORD, EVERY, one"
13、貪婪Greed
function stripComments(code) { return code.replace(/\/\/.*|\/\*[^]*\*\//g, ""); } console.log(stripComments("1 + /* 2 */3")); // → 1 + 3 console.log(stripComments("x = 10;// ten!")); // → x = 10; console.log(stripComments("1 /* a */+/* b */ 1")); // → 1 1
可以用replace來去掉代碼中的所有註釋。
[^]可以匹配任何字符,因為 /**/可能跨多行,句點 . 無法匹配換行符號。
然而上面最後一行代碼結果卻出錯了,這是為什麽呢?
因為(+
, *
, ?
, and {}
) 這些操作符號都是貪婪的,就像“回溯”裏面提到的,它們總是先盡可能地消費字符,直到無路可走才會回頭,這樣理所當然會匹配到更長的那一個。解決方案就是在這些符號後面加問號 (+?
, *?
, ??
, {}?
),這樣它們就會匹配盡可能少的字符串。
function stripComments(code) { return code.replace(/\/\/.*|\/\*[^]*?\*\//g, ""); } console.log(stripComments("1 /* a */+/* b */ 1")); // → 1 + 1
當要用到重復匹配符時,先考慮用非貪婪版本的。
14、動態構建正則表達式
利用new RegExp(拼接字符串, "gi")構建,gi表示global替換全部和大小寫不敏感。
let name = "harry"; let text = "Harry is a suspicious character."; let regexp = new RegExp("\\b(" + name + ")\\b", "gi"); console.log(text.replace(regexp, "_$1_")); // → _Harry_ is a suspicious character. let name = "dea+hl[]rd"; let text = "This dea+hl[]rd guy is super annoying."; let escaped = name.replace(/[\\[.+*?(){|^$]/g, "\\$&"); // escaped → "dea\+hl\[]rd" let regexp = new RegExp("\\b" + escaped + "\\b", "gi"); console.log(text.replace(regexp, "_$&_")); // → This _dea+hl[]rd_ guy is super annoying.
15、Search
正則版indexof:
console.log(" word".search(/\S/)); // → 2 console.log(" ".search(/\S/)); // → -1
16、The lastIndex property
需求:設置從字符串的某個字符開始匹配
問題:沒有方便的辦法
理由:不方便正是js的特性。。。。
解決方案:在【嚴格的條件】下用lastIndex設定起始位置
嚴格的條件:表達式必須開啟g(global)或者s(sticky)選項,並且必須通過exec方式執行匹配。
lastIndex:正則對象的一個屬性,數字,決定了下一個匹配從第幾個字符開始。在嚴格條件 ↑ 下設定才有效。非嚴格條件下改變該值是毫無作用的。
let pattern = /y/g; pattern.lastIndex = 3; let match = pattern.exec("xyzzy"); console.log(match.index); // → 4 console.log(pattern.lastIndex); // → 5
僅global:匹配成功,自動更新lastIndex為匹配成功位置的下一個位置(如上),匹配失敗,lastIndex重新設置為0。
global:從str[lastIndex]開始向後搜索匹配
sticky:從str[lastIndex]直接開始匹配,不向後搜索。
let global = /abc/g; console.log(global.exec("xyz abc")); // → ["abc"] let sticky = /abc/y; console.log(sticky.exec("xyz abc")); // → null
所以只需簡單調整一下lastIndex就可以讓上面成功的失敗、失敗的成功:
let global = /abc/g; global.lastIndex = 6; // 從c開始向後搜索匹配 console.log(global.exec("xyz abc")); // → null let sticky = /abc/y; sticky.lastIndex = 4; // 從a開始匹配 console.log(sticky.exec("xyz abc")); // → ["abc"]
因為在global啟用時,LastIndex在匹配完之後是要自動更新的,所以,當用一個正則對象匹配多次的時候就會出現坑爹的結果:
let digit = /\d/g; console.log(digit.exec("here it is: 1")); // → ["1"] console.log(digit.exec("and now: 1")); // → null
在s啟用,或者啥也不啟用時不會有這方面的顧慮。
global的另外一方面影響在於,它改變了match的行為:
console.log("Banana".match(/an/g)); // → ["an", "an"] console.log(/an/g.exec("Banana")); // → ["an", index: 1, input: "Banana", groups: undefined] // global改變了match的行為,本來上述兩個 // 輸出應該相同的(等價操作),而且["an", "an"] // 後者本應該是子表達式匹配的字符串,前者的子集
總結。。慎用global
17、遍歷匹配項
利用global模式下的lastIndex機制應該是最簡便的方法。
let input = "A string with 3 numbers in it... 42 and 88."; let number = /\b\d+\b/g; let match; while (match = number.exec(input)) { console.log("Found", match[0], "at", match.index); } // → Found 3 at 14 // Found 42 at 33 // Found 88 at 40
18、解析INI文件
function parseINI(string) { // Start with an object to hold the top-level fields let result = {}; let section = result; string.split(/\r?\n/).forEach(line => { let match; if(match = line.match(/^(\w+)=(.*)$/)) { section[match[1]] = match[2]; } else if(match = line.match(/^\[(.*)\]$/)) { section = result[match[1]] = {}; } else if(!/^\s*(;.*)?$/.test(line)) { throw new Error("Line ‘" + line + "‘ is not valid."); } }); return result; } console.log(parseINI(` searchengine=https://duckduckgo.com/?q=$1 spitefulness=9.7 ; comments are preceded by a semicolon... ; each section concerns an individual enemy [larry] fullname=Larry Doe type=kindergarten bully website=http://www.geocities.com/CapeCanaveral/11451 [davaeorn] fullname=Davaeorn type=evil wizard outputdir=/home/marijn/enemies/davaeorn`)); // → davaeorn: { fullname: "Davaeorn", type: "evil wizard", outputdir: "/home/marijn/enemies/davaeorn" }? // larry: { fullname: "Larry Doe", type: "kindergarten bully", website: "http://www.geocities.com/CapeCanaveral/11451" }? // searchengine: "https://duckduckgo.com/?q=$1"? // spitefulness: "9.7"
19、國際字符
console.log(/??{3}/.test("??????")); // → false console.log(/<.>/.test("<??>")); // → false console.log(/<.>/u.test("<??>")); // → true
??可以視為兩個字符,??{3} 後面的量詞實際針對的是構成??的第二個字符,解決方法是在正則後添加u(for Unicode)。然而這可能導致原有的匹配出現問題。
因此,需要在添加u的前提下,繼續添加\p{Property=Value}:
console.log(/\p{Script=Greek}/u.test("α")); // → true console.log(/\p{Script=Arabic}/u.test("α")); // → false console.log(/\p{Alphabetic}/u.test("α")); // → true console.log(/\p{Alphabetic}/u.test("!")); // → false
Exercises
① Regexp golf
// Fill in the regular expressions verify(/ca[rt]/, ["my car", "bad cats"], ["camper", "high art"]); verify(/pr?op/, ["pop culture", "mad props"], ["plop", "prrrop"]); verify(/ferr(et|y|ari)/, ["ferret", "ferry", "ferrari"], ["ferrum", "transfer A"]); verify(/ious\b/, ["how delicious", "spacious room"], ["ruinous", "consciousness"]); verify(/\s[.,:;]/, ["bad punctuation ."], ["escape the period"]); verify(/\w{7}/, ["hottentottententen"], ["no", "hotten totten tenten"]); verify(/\b[^\We]+\b/i, ["red platypus", "wobbling nest"], ["earth bed", "learning ape", "BEET"]); function verify(regexp, yes, no) { // Ignore unfinished exercises if (regexp.source == "...") return; for (let str of yes) if (!regexp.test(str)) { console.log(`Failure to match ‘${str}‘`); } for (let str of no) if (regexp.test(str)) { console.log(`Unexpected match for ‘${str}‘`); } }
-—————— -- -——-—— -- - -----————------------ -- -- -- - -- —
② Quoting style
let text = "‘I‘m the cook,‘ he said, ‘it‘s my job.‘"; // Change this call. console.log(text.replace(/‘|([\w]‘[\w])/g, str => str == "‘" ? ‘"‘ : str)); // → "I‘m the cook," he said, "it‘s my job."
課本解答:
let text = "‘I‘m the cook,‘ he said, ‘it‘s my job.‘"; console.log(text.replace(/(^|\W)‘|‘(\W|$)/g, ‘$1"$2‘)); // → "I‘m the cook," he said, "it‘s my job."
-—————— -- -——-—— -- - -----————------------ -- -- -- - -- —
③ Numbers again
// Fill in this regular expression. let number = /^[+-]?(\d+\.?\d*|\d*\.?\d+)([eE][+-]?\d+)?$/; // Tests: for (let str of ["1", "-1", "+15", "1.55", ".5", "5.", "1.3e2", "1E-4", "1e+12"]) { if (!number.test(str)) { console.log(`Failed to match ‘${str}‘`); } } for (let str of ["1a", "+-1", "1.2.3", "1+1", "1e4.5", ".5.", "1f5", "."]) { if (number.test(str)) { console.log(`Incorrectly accepted ‘${str}‘`); } }
課本答案(-號最好轉義?):
let number = /^[+\-]?(\d+(\.\d*)?|\.\d+)([eE][+\-]?\d+)?$/;
Eloquent JavaScript #09# Regular Expressions