1. 程式人生 > >Eloquent JavaScript #09# Regular Expressions

Eloquent JavaScript #09# Regular Expressions

chang mean ati git while ise parse tin spl

索引

  • Notes

    1. js創建正則表達式的兩種方式
    2. js正則匹配方式(1)
    3. 字符集合
    4. 重復匹配
    5. 分組(子表達式)
    6. js正則匹配方式(2)
    7. The Date class
    8. 匹配整個字符串
    9. Choice patterns
    10. 正則匹配的機制
    11. 回溯Backtracking
    12. Replace
    13. 貪婪匹配Greed
    14. 動態構建正則表達式
    15. Search
    16. The lastIndex property
    17. 遍歷匹配項
    18. 解析INI文件
    19. 國際字符
  • Excercise

    1. Regexp golf
    2. Quoting style
    3. Numbers again

Notes

1、正則表達式幫助我們在字符串中尋找特定模式。

js創建正則表達式的兩種等價寫法:

let re1 = new RegExp("abc");
let re2 = /abc/;

2、應用正則表達式

console.log(/abc/.test("abcde"));
// → true
console.log(/abc/.test("abxde"));
// → false

3、字符集合

\d Any digit character
\w An alphanumeric character (“word character”)
\s Any whitespace character (space, tab, newline, and similar)
\D A character that is not a digit
\W A nonalphanumeric character
\S A nonwhitespace character
. Any character except for newline
/abc/ A sequence of characters
/[abc]/ Any character from a set of characters
/[^abc]/ Any character not in a set of characters
/[0-9]/ Any character in a range of characters
/x+/ One or more occurrences of the pattern x
/x+?/ One or more occurrences, nongreedy
/x*/ Zero or more occurrences
/x?/ Zero or one occurrence
/x{2,4}/ Two to four occurrences
/(abc)/ A group
/a|b|c/ Any one of several patterns
/\d/ Any digit character
/\w/ An alphanumeric character (“word character”)
/\s/ Any whitespace character
/./ Any character except newlines
/\b/ A word boundary
/^/ Start of input
/$/ End of input

\d等轉移字符可以放在 [ ] 裏而不喪失含義,但是 . 和+ 之類的特殊符號不行,會變為普通的符號。

整體取反,非0非1:

let notBinary = /[^01]/;
console.log(notBinary.test("1100100010100110"));
// → false
console.log(notBinary.test("1100100010200110"));
// → true

4、重復匹配

+ one or more,* zero or more

console.log(/‘\d+‘/.test("‘123‘"));
// → true
console.log(/‘\d+‘/.test("‘‘"));
// → false
console.log(/‘\d*‘/.test("‘123‘"));
// → true
console.log(/‘\d*‘/.test("‘‘"));
// → true

? zero or one

let neighbor = /neighbou?r/;
console.log(neighbor.test("neighbour"));
// → true
console.log(neighbor.test("neighbor"));
// → true

{2} a pattern should occur a precise number of times,It is also possible to specify a range this way: {2,4} means the element must occur at least twice and at most four times.

let dateTime = /\d{1,2}-\d{1,2}-\d{4} \d{1,2}:\d{2}/;
console.log(dateTime.test("1-30-2003 8:45"));
// → true

You can also specify open-ended ranges when using braces by omitting the number after the comma. So, {5,} means five or more times.

5、分組(子表達式)

括號內的n個元素被視作一個整體元素(分組,子表達式):

let cartoonCrying = /boo+(hoo+)+/i;
console.log(cartoonCrying.test("Boohoooohoohooo"));
// → true

i表示該表達式大小寫不敏感。

6、進行正則匹配的另外一種方式

可以讓我們獲取額外的信息:

let match = /\d+/.exec("one two 100");
console.log(match);
// → ["100"]
console.log(match.index);
// → 8

exec的返回值:匹配失敗為null,成功則如上所示。

等價寫法:

console.log("one two 100".match(/\d+/));
// → ["100"]

含括號表達式的情況:

let quotedText = /‘([^‘]*)‘/;
console.log(quotedText.exec("she said ‘hello‘"));
// → ["‘hello‘", "hello"]

console.log(/bad(ly)?/.exec("bad"));
// → ["bad", undefined]
console.log(/(\d)+/.exec("123"));
// → ["123", "3"]

返回數組的第一個元素為整個正則表達式匹配的字符串,而第二元素為() 內正則(子表達式)匹配的字符串(沒有就是undefined,多個就取最後一個)。容易知道,第二個元素幾乎總是第一個元素的子集。

7、The Date class

console.log(new Date());
// → Sat Sep 01 2018 13:54:43 GMT+0800 (中國標準時間)

console.log(new Date(2009, 11, 9));
// → Wed Dec 09 2009 00:00:00 GMT+0800 (中國標準時間)
console.log(new Date(2009, 11, 9, 12, 59, 59, 999));
// → Wed Dec 09 2009 12:59:59 GMT+0800 (中國標準時間)

console.log(new Date(1997, 10, 19).getTime());
// → 879868800000
console.log(new Date(1387407600000));
// → Thu Dec 19 2013 07:00:00 GMT+0800 (中國標準時間)

console.log(new Date().getTime());
// → 1535781283593
console.log(Date.now());
// → 1535781283593

通過正則表達式,由String創建日期:

"use strict";

function getDate(string) {
  let [_, month, day, year] =
    /(\d{1,2})-(\d{1,2})-(\d{4})/.exec(string);
  return new Date(year, month - 1, day);
}
console.log(getDate("1-30-2003"));
// → Thu Jan 30 2003 00:00:00 GMT+0100 (CET)

PS. 下劃線除了用來占位外沒有其它含義。

8、強制匹配整個字符串

利用 ^ 和 $ 。例如/^\d+$/匹配完全由數字構成的字符串,/^!/ 匹配由!開頭的字符串,而/x^/ 啥也匹配不了。

用 \b 標註單詞邊界:

console.log(/cat/.test("concatenate"));
// → true
console.log(/\bcat\b/.test("concatenate"));
// → false
console.log(/\bcat\b/.test("xx cat xx"));
// → true

9、Choice patterns

let animalCount = /\b\d+ (pig|cow|chicken)s?\b/;
console.log(animalCount.test("15 pigs"));
// → true
console.log(animalCount.test("15 pigchickens"));
// → false

10、正則匹配的機制

當你進行正則匹配時(test或者exec),正則引擎將從所給字符串的開頭開始嘗試匹配,接著是第二個字符,第三個字符... 試圖在所給字符串中尋找一個匹配,直到找到一個匹配項或者到達字符串末尾結束。要麽返回第一個匹配,要麽什麽都匹配不到。

/**
 * 模擬用正則\b\d+ (pig|cow|chicken)s?\b
 * 匹配"the 3 pigs"
 */

const str = "the 3 pigs";

function simulateRegex(str, start) {
    const digits = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9];
    // 逐個字符嘗試匹配,直到找到一個匹配項或者到達字符串尾結束
    for(let currentPosition = start; currentPosition != str.length; ++currentPosition) {
        let tempPosition = currentPosition;
        if(tempPosition == 0 || str[tempPosition - 1] == " ") {} else continue;
        // 匹配單詞邊界通過,前面是標點也是可以的。。            
        if(!str[tempPosition++] in digits) continue;
        // 至少匹配一個數字通過
        while(str[tempPosition] in digits) {
            tempPosition++;
            // 嘗試繼續匹配數字
        }
        if(str[tempPosition++] != " ") continue;
        // 匹配一個空格通過
        let tempWord;
        if(str.slice(tempPosition, tempPosition + "pig".length) === (tempWord = "pig") ||
            str.slice(tempPosition, tempPosition + "cow".length) === (tempWord = "cow") ||
            str.slice(tempPosition, tempPosition + "chicken".length) === (tempWord = "chicken")) {
            tempPosition += tempWord.length;
        } else {
            continue;
        }
        // 單詞匹配成功
        if(str[tempPosition] == "s") tempPosition++;
        // 有沒s都可以
        if(tempPosition == str.length || str[tempPosition] == " ") {
            // 最後的單詞邊界
            let match = [str.slice(currentPosition, tempPosition + 1)];
            return match;
        }
    }
    return null;
}

let match = simulateRegex(str, 4);
console.log(match);
// → ["3 pigs"]

11、回溯Backtracking

正則引擎在進行分支匹配(|)或重復匹配(+ *)時,如果發現無法繼續再繼續往下匹配,就會進行“回溯”。

在進行分支匹配時,如果第一個分支就匹配成功,就不再匹配其它分支,如果不成功就會回溯到分支的入口,進入到另外一個分支繼續匹配。

而進行重復匹配時,例如說/^.*x/用匹配"abcxe",.*會首先把所有字符消費幹凈,當正則引擎發現最後還需要一個x時,*操作符會嘗試少匹配一個字符,但是仍然沒發現x,於是繼續回溯,直到發現x,最終得到字符串abc。

12、The replace method

replace配合正則:

console.log("papa".replace("p", "m"));
// → mapa

console.log("Borobudur".replace(/[ou]/, "a"));
// → Barobudur
console.log("Borobudur".replace(/[ou]/g, "a")); // g代表global全部
// → Barabadar

replace的真正強大之處在於可以用“$數字”引用匹配字符串:

console.log(
  "Liskov, Barbara\nMcCarthy, John\nWadler, Philip"
    .replace(/(\w+), (\w+)/g, "$2 $1"));
// → Barbara Liskov
//   John McCarthy
//   Philip Wadler


"hello, word, every, one".replace(/(\w+),/g, "$1 "); // “$+數字”引用匹配中的分組
// → "hello  word  every  one"
"hello, word, every, one".replace(/one/g, "$& $&"); // “$&”引用整個匹配
// → "hello, word, every, one one"

還可以傳入函數:

"hello, word, every, one".replace(/(\w+),/g, str => str.toUpperCase()); 
// → "HELLO, WORD, EVERY, one"

13、貪婪Greed

function stripComments(code) {
  return code.replace(/\/\/.*|\/\*[^]*\*\//g, "");
}
console.log(stripComments("1 + /* 2 */3"));
// → 1 + 3
console.log(stripComments("x = 10;// ten!"));
// → x = 10;
console.log(stripComments("1 /* a */+/* b */ 1"));
// → 1  1

可以用replace來去掉代碼中的所有註釋。

[^]可以匹配任何字符,因為 /**/可能跨多行,句點 . 無法匹配換行符號。

然而上面最後一行代碼結果卻出錯了,這是為什麽呢?

因為(+, *, ?, and {}) 這些操作符號都是貪婪的,就像“回溯”裏面提到的,它們總是先盡可能地消費字符,直到無路可走才會回頭,這樣理所當然會匹配到更長的那一個。解決方案就是在這些符號後面加問號 (+?, *?, ??, {}?),這樣它們就會匹配盡可能少的字符串。

function stripComments(code) {
  return code.replace(/\/\/.*|\/\*[^]*?\*\//g, "");
}
console.log(stripComments("1 /* a */+/* b */ 1"));
// → 1 + 1

當要用到重復匹配符時,先考慮用非貪婪版本的。

14、動態構建正則表達式

利用new RegExp(拼接字符串, "gi")構建,gi表示global替換全部和大小寫不敏感。

let name = "harry";
let text = "Harry is a suspicious character.";
let regexp = new RegExp("\\b(" + name + ")\\b", "gi");
console.log(text.replace(regexp, "_$1_"));
// → _Harry_ is a suspicious character.

let name = "dea+hl[]rd";
let text = "This dea+hl[]rd guy is super annoying.";
let escaped = name.replace(/[\\[.+*?(){|^$]/g, "\\$&");
// escaped → "dea\+hl\[]rd"
let regexp = new RegExp("\\b" + escaped + "\\b", "gi");
console.log(text.replace(regexp, "_$&_"));
// → This _dea+hl[]rd_ guy is super annoying.

15、Search

正則版indexof:

console.log("  word".search(/\S/));
// → 2
console.log("    ".search(/\S/));
// → -1

16、The lastIndex property

需求:設置從字符串的某個字符開始匹配

問題:沒有方便的辦法

理由:不方便正是js的特性。。。。

解決方案:在【嚴格的條件】下用lastIndex設定起始位置

嚴格的條件:表達式必須開啟g(global)或者s(sticky)選項,並且必須通過exec方式執行匹配。

lastIndex:正則對象的一個屬性,數字,決定了下一個匹配從第幾個字符開始。在嚴格條件 ↑ 下設定才有效。非嚴格條件下改變該值是毫無作用的。

let pattern = /y/g;
pattern.lastIndex = 3;
let match = pattern.exec("xyzzy");
console.log(match.index);
// → 4
console.log(pattern.lastIndex);
// → 5

僅global:匹配成功,自動更新lastIndex為匹配成功位置的下一個位置(如上),匹配失敗,lastIndex重新設置為0。

global:從str[lastIndex]開始向後搜索匹配

sticky:從str[lastIndex]直接開始匹配,不向後搜索。

let global = /abc/g;
console.log(global.exec("xyz abc"));
// → ["abc"]
let sticky = /abc/y;
console.log(sticky.exec("xyz abc"));
// → null

所以只需簡單調整一下lastIndex就可以讓上面成功的失敗、失敗的成功:

let global = /abc/g;
global.lastIndex = 6; // 從c開始向後搜索匹配
console.log(global.exec("xyz abc"));
// → null
let sticky = /abc/y;
sticky.lastIndex = 4; // 從a開始匹配
console.log(sticky.exec("xyz abc"));
// → ["abc"]

因為在global啟用時,LastIndex在匹配完之後是要自動更新的,所以,當用一個正則對象匹配多次的時候就會出現坑爹的結果:

let digit = /\d/g;
console.log(digit.exec("here it is: 1"));
// → ["1"]
console.log(digit.exec("and now: 1"));
// → null

在s啟用,或者啥也不啟用時不會有這方面的顧慮。

global的另外一方面影響在於,它改變了match的行為:

console.log("Banana".match(/an/g));
// → ["an", "an"]
console.log(/an/g.exec("Banana"));
// → ["an", index: 1, input: "Banana", groups: undefined] 
// global改變了match的行為,本來上述兩個
// 輸出應該相同的(等價操作),而且["an", "an"]
// 後者本應該是子表達式匹配的字符串,前者的子集

總結。。慎用global

17、遍歷匹配項

利用global模式下的lastIndex機制應該是最簡便的方法。

let input = "A string with 3 numbers in it... 42 and 88.";
let number = /\b\d+\b/g;
let match;
while (match = number.exec(input)) {
  console.log("Found", match[0], "at", match.index);
}
// → Found 3 at 14
//   Found 42 at 33
//   Found 88 at 40

18、解析INI文件

function parseINI(string) {
    // Start with an object to hold the top-level fields
    let result = {};
    let section = result;
    string.split(/\r?\n/).forEach(line => {
        let match;
        if(match = line.match(/^(\w+)=(.*)$/)) {
            section[match[1]] = match[2];
        } else if(match = line.match(/^\[(.*)\]$/)) {
            section = result[match[1]] = {};
        } else if(!/^\s*(;.*)?$/.test(line)) {
            throw new Error("Line ‘" + line + "‘ is not valid.");
        }
    });
    return result;
}

console.log(parseINI(`
searchengine=https://duckduckgo.com/?q=$1
spitefulness=9.7

; comments are preceded by a semicolon...
; each section concerns an individual enemy
[larry]
fullname=Larry Doe
type=kindergarten bully
website=http://www.geocities.com/CapeCanaveral/11451

[davaeorn]
fullname=Davaeorn
type=evil wizard
outputdir=/home/marijn/enemies/davaeorn`));
// → davaeorn:  { fullname: "Davaeorn", type: "evil wizard", outputdir: "/home/marijn/enemies/davaeorn" }?
// larry:  { fullname: "Larry Doe", type: "kindergarten bully", website: "http://www.geocities.com/CapeCanaveral/11451" }?
// searchengine: "https://duckduckgo.com/?q=$1"?
// spitefulness: "9.7"

19、國際字符

console.log(/??{3}/.test("??????"));
// → false
console.log(/<.>/.test("<??>"));
// → false
console.log(/<.>/u.test("<??>"));
// → true

??可以視為兩個字符,??{3} 後面的量詞實際針對的是構成??的第二個字符,解決方法是在正則後添加u(for Unicode)。然而這可能導致原有的匹配出現問題。

因此,需要在添加u的前提下,繼續添加\p{Property=Value}:

console.log(/\p{Script=Greek}/u.test("α"));
// → true
console.log(/\p{Script=Arabic}/u.test("α"));
// → false
console.log(/\p{Alphabetic}/u.test("α"));
// → true
console.log(/\p{Alphabetic}/u.test("!"));
// → false

Exercises

① Regexp golf

// Fill in the regular expressions

verify(/ca[rt]/,
       ["my car", "bad cats"],
       ["camper", "high art"]);

verify(/pr?op/,
       ["pop culture", "mad props"],
       ["plop", "prrrop"]);

verify(/ferr(et|y|ari)/,
       ["ferret", "ferry", "ferrari"],
       ["ferrum", "transfer A"]);

verify(/ious\b/,
       ["how delicious", "spacious room"],
       ["ruinous", "consciousness"]);

verify(/\s[.,:;]/,
       ["bad punctuation ."],
       ["escape the period"]);

verify(/\w{7}/,
       ["hottentottententen"],
       ["no", "hotten totten tenten"]);

verify(/\b[^\We]+\b/i,
       ["red platypus", "wobbling nest"],
       ["earth bed", "learning ape", "BEET"]);


function verify(regexp, yes, no) {
  // Ignore unfinished exercises
  if (regexp.source == "...") return;
  for (let str of yes) if (!regexp.test(str)) {
    console.log(`Failure to match ‘${str}‘`);
  }
  for (let str of no) if (regexp.test(str)) {
    console.log(`Unexpected match for ‘${str}‘`);
  }
}

-—————— -- -——-—— -- - -----————------------ -- -- -- - -- —

② Quoting style

let text = "‘I‘m the cook,‘ he said, ‘it‘s my job.‘";
// Change this call.
console.log(text.replace(/‘|([\w]‘[\w])/g, str => str == "‘" ? ‘"‘ : str));
// → "I‘m the cook," he said, "it‘s my job."

課本解答:

let text = "‘I‘m the cook,‘ he said, ‘it‘s my job.‘";

console.log(text.replace(/(^|\W)‘|‘(\W|$)/g, ‘$1"$2‘));
// → "I‘m the cook," he said, "it‘s my job."

-—————— -- -——-—— -- - -----————------------ -- -- -- - -- —

③ Numbers again

// Fill in this regular expression.
let number = /^[+-]?(\d+\.?\d*|\d*\.?\d+)([eE][+-]?\d+)?$/;

// Tests:
for (let str of ["1", "-1", "+15", "1.55", ".5", "5.",
                 "1.3e2", "1E-4", "1e+12"]) {
  if (!number.test(str)) {
    console.log(`Failed to match ‘${str}‘`);
  }
}
for (let str of ["1a", "+-1", "1.2.3", "1+1", "1e4.5",
                 ".5.", "1f5", "."]) {
  if (number.test(str)) {
    console.log(`Incorrectly accepted ‘${str}‘`);
  }
}

課本答案(-號最好轉義?):

let number = /^[+\-]?(\d+(\.\d*)?|\.\d+)([eE][+\-]?\d+)?$/;

Eloquent JavaScript #09# Regular Expressions