TS 原理詳細解讀(5)語法2-語法解析

在上一節介紹了語法樹的結構，本節則介紹如何解析標記組成語法樹。

對應的原始碼位於 src/compiler/parser.ts。

入口函式

要解析一份原始碼，輸入當然是原始碼內容（字串），同時還提供路徑（用於報錯）、語言版本（比如ES3 和 ES5 在有些細節不同）。

createSourceFile 是負責將原始碼解析為語法樹的入口函式，使用者可以直接呼叫：比如 ts.createSourceFile(‘<stdio>’, 'var xld;')。

export function createSourceFile(fileName: string, sourceText: string, languageVersion: ScriptTarget, setParentNodes = false, scriptKind?: ScriptKind): SourceFile {
        performance.mark("beforeParse");
        let result: SourceFile;

        perfLogger.logStartParseSourceFile(fileName);
        if (languageVersion === ScriptTarget.JSON) {
            result = Parser.parseSourceFile(fileName, sourceText, languageVersion, /*syntaxCursor*/ undefined, setParentNodes, ScriptKind.JSON);
        }
        else {
            result = Parser.parseSourceFile(fileName, sourceText, languageVersion, /*syntaxCursor*/ undefined, setParentNodes, scriptKind);
        }
        perfLogger.logStopParseSourceFile();

        performance.mark("afterParse");
        performance.measure("Parse", "beforeParse", "afterParse");
        return result;
    }

入口函式內部除了某些效能測試程式碼，主要是呼叫 Parser.parseSourceFile 完成解析。

解析原始檔物件

export function parseSourceFile(fileName: string, sourceText: string, languageVersion: ScriptTarget, syntaxCursor: IncrementalParser.SyntaxCursor | undefined, setParentNodes = false, scriptKind?: ScriptKind): SourceFile {
    scriptKind = ensureScriptKind(fileName, scriptKind);
    /* ...(略)... */

    initializeState(sourceText, languageVersion, syntaxCursor, scriptKind);

    const result = parseSourceFileWorker(fileName, languageVersion, setParentNodes, scriptKind);

    clearState();

    return result;
}

function initializeState(_sourceText: string, languageVersion: ScriptTarget, _syntaxCursor: IncrementalParser.SyntaxCursor | undefined, scriptKind: ScriptKind) {
    NodeConstructor = objectAllocator.getNodeConstructor();
    TokenConstructor = objectAllocator.getTokenConstructor();
    IdentifierConstructor = objectAllocator.getIdentifierConstructor();
    SourceFileConstructor = objectAllocator.getSourceFileConstructor();

    sourceText = _sourceText;
    syntaxCursor = _syntaxCursor;

    parseDiagnostics = [];
    parsingContext = 0;
    identifiers = createMap<string>();
    identifierCount = 0;
    nodeCount = 0;

    switch (scriptKind) {
        case ScriptKind.JS:
        case ScriptKind.JSX:
            contextFlags = NodeFlags.JavaScriptFile;
            break;
        case ScriptKind.JSON:
            contextFlags = NodeFlags.JavaScriptFile | NodeFlags.JsonFile;
            break;
        default:
            contextFlags = NodeFlags.None;
            break;
    }
    parseErrorBeforeNextFinishedNode = false;

    // Initialize and prime the scanner before parsing the source elements.
    scanner.setText(sourceText);
    scanner.setOnError(scanError);
    scanner.setScriptTarget(languageVersion);
    scanner.setLanguageVariant(getLanguageVariant(scriptKind));
}

如果你仔細讀了這段程式碼，你可能會有這些疑問：

1. NodeConstructor 等是什麼

你可以直接將它看成 Node 類的建構函式，new NodeConstructor 和 new Node 是一回事。那為什麼不直接用 new Node? 這是一種效能優化手段。TS 設計的目的是用於任何 JS 引擎，包括瀏覽器、Node.js、微軟自己的 JS 引擎，而 Node 代表語法樹節點，數目會非常多，TS 允許針對不同的環境使用不同的 Node 型別，以達到最節約記憶體的效果。

2. syntaxCursor 是什麼

這是用於增量解析的物件，如果不執行增量解析，它是空的。增量解析是指如果之前已解析過一次原始碼，第二次解析時可以複用上次解析的結果，主要在編譯器場景使用：編輯完原始碼後，原始碼要重新解析為語法樹，如果通過增量解析，可以大幅減少解析次數。增量解析將在下一節中詳細介紹。

3. identifiers 是什麼

一般地，我們認為：原始碼中的單詞都會用兩次以上（變數名總會有定義和使用的時候，這裡就有兩次），如果將相同內容的字串共用相同的引用，可以節約記憶體。identifiers 就儲存了每個字串記憶體的唯一引用。

4. parsingContext 是什麼

用於指代當前解析所在的標記位，比如當前函式是否有 async，這樣可以判斷 await 是否合法。

5. parseErrorBeforeNextFinishedNode 是什麼

每個語法樹節點，都通過 createNode 建立，然後結束時會呼叫 finishNode，如果在解析一個語法樹節點時出現錯誤（可能是詞法掃描錯誤、也可能是語法錯誤），都會把 parseErrorBeforeNextFinishedNode 改成 true，在 finishNode 中會判斷這個變數，然後標記這個語法樹節點存在語法錯誤。TypeScript 比其它語法解析器強大的地方在於碰到語法錯誤後並不會終止解析，而是嘗試修復原始碼。（因為在編輯器環境，不可能因為存在錯誤就停止自動補全）。這裡標記節點語法錯誤，是為了下次增量解析時禁止重用此節點。

解析過程

雖然這篇文章叫 TypeScript 原始碼解讀，但其實主要是介紹編譯器的實現原理，知道了這些原理，無論什麼語言的編譯器你都能弄明白，反過來如果你之前沒有什麼基礎想要自己讀懂 TypeScript，那是很難的。原始碼就像水果，你需要自己剝；這篇文章就像果汁，營養吸收地更快。插圖是展示原理的最好方式，因此文中會包含大量的插圖，如果你現在讀的這個網頁是純文字，一張插圖都沒有，那麼這個網站就是盜版侵權的，請重新百度。原版都是有插圖的，插圖可以讓你快速理解原理！這類文章目前不止中文版的稀缺，英文版的同樣稀缺，畢竟瞭解編譯原理、且真正能開發出語言的人是非常少的。有興趣讀這些文章的也絕不是隻知道搬磚賺錢的菜鳥，請支援原版！

解析器每次讀取一個標記，並根據這個標記判斷接下來是什麼語法，比如碰到 if 就知道是 if 語句，碰到 var 知道是變數宣告。

當發現 if 之後，根據 if 語句的定義，接下來會強制讀取一個 “(” 標記，如果讀不到就報錯：語法錯誤，缺少“(”。讀完 “(” 後解析一個表示式，然後再解析一個“)”，然後再解析一個語句，如果這時接下來發現一個 else，就繼續讀一個語句，否則直接終止，然後重新判斷下一個標記的語法。

if 語句的語法定義是這樣的：

IfStatement:
    if ( Expression ) Statement
    if ( Expression ) Statement else Statement

這個定義的意思是：if 語句(IfStatement)有兩種語法，當然無論哪種，開頭都是 if ( Expression ) Statement

為什麼是這樣定義的呢，這是因為JS是遵守ECMA-262規範的，而ECMA-262規範就像一種協議，規定了 if 語句要怎麼定義。

ECMA-262 規範也有很多版本，熟悉的ES3,ES5,ES6 這些其實就是這個規範的版本。ES10的版本可以在這裡檢視：http://www.ecma-international.org/ecma-262/10.0/index.html#sec-grammar-summary

程式碼實現

原始檔由語句組成，首先讀取下一個標記（nextToken）；然後解析語句列表（parseList, parseStatement）

function parseSourceFileWorker(fileName: string, languageVersion: ScriptTarget, setParentNodes: boolean, scriptKind: ScriptKind): SourceFile {
    const isDeclarationFile = isDeclarationFileName(fileName);

    sourceFile = createSourceFile(fileName, languageVersion, scriptKind, isDeclarationFile);
    sourceFile.flags = contextFlags;

    // Prime the scanner.
    nextToken();
    // A member of ReadonlyArray<T> isn't assignable to a member of T[] (and prevents a direct cast) - but this is where we set up those members so they can be readonly in the future
    processCommentPragmas(sourceFile as {} as PragmaContext, sourceText);
    processPragmasIntoFields(sourceFile as {} as PragmaContext, reportPragmaDiagnostic);

    sourceFile.statements = parseList(ParsingContext.SourceElements, parseStatement);
    Debug.assert(token() === SyntaxKind.EndOfFileToken);
    sourceFile.endOfFileToken = addJSDocComment(parseTokenNode());

    setExternalModuleIndicator(sourceFile);

    sourceFile.nodeCount = nodeCount;
    sourceFile.identifierCount = identifierCount;
    sourceFile.identifiers = identifiers;
    sourceFile.parseDiagnostics = parseDiagnostics;

    if (setParentNodes) {
        fixupParentReferences(sourceFile);
    }

    return sourceFile;

    function reportPragmaDiagnostic(pos: number, end: number, diagnostic: DiagnosticMessage) {
        parseDiagnostics.push(createFileDiagnostic(sourceFile, pos, end, diagnostic));
    }
}

解析一個語句：

 function parseStatement(): Statement {
    switch (token()) {
        case SyntaxKind.SemicolonToken:
            return parseEmptyStatement();
        case SyntaxKind.OpenBraceToken:
            return parseBlock(/*ignoreMissingOpenBrace*/ false);
        case SyntaxKind.VarKeyword:
            return parseVariableStatement(<VariableStatement>createNodeWithJSDoc(SyntaxKind.VariableDeclaration));
        // ...(略)
    }
 }

規則很簡單：

先看現在標記是什麼，比如是 var，說明是一個 var 語句，那就繼續解析 var 語句:

function parseVariableStatement(node: VariableStatement): VariableStatement {
    node.kind = SyntaxKind.VariableStatement;
    node.declarationList = parseVariableDeclarationList(/*inForStatementInitializer*/ false);
    parseSemicolon();
    return finishNode(node);
}

var 語句的解析過程為先解析一個宣告列表，然後解析分號（parseSemicolon）

再看一個 while 語句的解析：

function parseWhileStatement(): WhileStatement {
    const node = <WhileStatement>createNode(SyntaxKind.WhileStatement);
    parseExpected(SyntaxKind.WhileKeyword);  // while
    parseExpected(SyntaxKind.OpenParenToken); // (
    node.expression = allowInAnd(parseExpression); // *Expession*
    parseExpected(SyntaxKind.CloseParenToken); // )
    node.statement = parseStatement(); // *Statement*
    return finishNode(node);
}

所謂語法解析，就是把每個不同的語法都這樣解析一次，然後得到語法樹。

其中，最複雜的應該是解析列表（parseList）：

 // Parses a list of elements
 function parseList<T extends Node>(kind: ParsingContext, parseElement: () => T): NodeArray<T> {
    const saveParsingContext = parsingContext;
    parsingContext |= 1 << kind;
    const list = [];
    const listPos = getNodePos();

    while (!isListTerminator(kind)) {
        if (isListElement(kind, /*inErrorRecovery*/ false)) {
            const element = parseListElement(kind, parseElement);
            list.push(element);

            continue;
        }

        if (abortParsingListOrMoveToNextToken(kind)) {
            break;
        }
    }

    parsingContext = saveParsingContext;
    return createNodeArray(list, listPos);
}

parseList 的核心就是一個迴圈，只要列表沒有結束，就一直解析同一種語法。

比如解析引數列表，碰到“)”表示列表結束，否則一直解析“引數”；比如解析陣列表示式，碰到“]”結束。

如果理論接下來應該解析引數時，但下一個標記又不是引數，則會出現語法錯誤，但接下來應該解析解析引數，還是不再繼續引數列表，這時候用 abortParsingListOrMoveToNextToken 判斷。其中，kind: ParsingContext 用於區分不同的列表（是引數，還是陣列？或者別的？）

列表結束

// True if positioned at a list terminator
function isListTerminator(kind: ParsingContext): boolean {
    if (token() === SyntaxKind.EndOfFileToken) {
        // Being at the end of the file ends all lists.
        return true;
    }

    switch (kind) {
        case ParsingContext.BlockStatements:
        case ParsingContext.SwitchClauses:
        case ParsingContext.TypeMembers:
            return token() === SyntaxKind.CloseBraceToken;
        // ...(略)
    }
}

總結：對於有括號的標記，只有碰到右半括號，才能停止解析，其它的比如繼承列表(extends A, B, C) 碰到 “{” 就結束。

解析元素

function parseListElement<T extends Node>(parsingContext: ParsingContext, parseElement: () => T): T {
    const node = currentNode(parsingContext);
    if (node) {
        return <T>consumeNode(node);
    }

    return parseElement();
}

這裡本質是使用了 parseElement，其它程式碼是為了增量解析（後面詳解）

繼續列表？

function abortParsingListOrMoveToNextToken(kind: ParsingContext) {
    parseErrorAtCurrentToken(parsingContextErrors(kind));
    if (isInSomeParsingContext()) {
        return true;
    }

    nextToken();
    return false;
}

// True if positioned at element or terminator of the current list or any enclosing list
function isInSomeParsingContext(): boolean {
    for (let kind = 0; kind < ParsingContext.Count; kind++) {
        if (parsingContext & (1 << kind)) { // 只要是任意一種上下文
            if (isListElement(kind, /*inErrorRecovery*/ true) || isListTerminator(kind)) {
                return true;
            }
        }
    }

    return false;
}

總結：如果接下來的標記是合法的元素，就繼續解析，此時解析器認為使用者只是忘打逗號之類的分隔符。如果不是說明這個列表根本就是有問題的，不再繼續犯錯。

上面重點介紹了 if 的語法，其它都大同小異，就不再介紹。

現在你應該知道語法樹產生的大致過程了，如果仍不懂的，可在此處停頓往回複習，並對照原始碼，加以理解。

語法上下文

有些語法的使用是有要求的，比如 await 只在 async 函式內部才作關鍵字。

原始碼中用閉包內全域性的變數儲存這些資訊。思路是：先設定允許 await 標記位，然後解析表示式（這時標記位已設定成允許 await），解析完成則清除標記位。

function doInAwaitContext<T>(func: () => T): T {
    return doInsideOfContext(NodeFlags.AwaitContext, func);
}

function doInsideOfContext<T>(context: NodeFlags, func: () => T): T {
    // contextFlagsToSet will contain only the context flags that
    // are not currently set that we need to temporarily enable.
    // We don't just blindly reset to the previous flags to ensure
    // that we do not mutate cached flags for the incremental
    // parser (ThisNodeHasError, ThisNodeOrAnySubNodesHasError, and
    // HasAggregatedChildData).
    const contextFlagsToSet = context & ~contextFlags;
    if (contextFlagsToSet) {
        // set the requested context flags
        setContextFlag(/*val*/ true, contextFlagsToSet);
        const result = func();
        // reset the context flags we just set
        setContextFlag(/*val*/ false, contextFlagsToSet);
        return result;
    }

    // no need to do anything special as we are already in all of the requested contexts
    return func();
}

這也是為什麼規範中每個語法名稱後面都帶了一個小括號的原因：表示此處的表示式是否包括 await 這樣的意義。

IfStatement_{[Yield, Await, Return]}:
   if ( Expression[+In, ?Yield, ?Await] ) Statement[?Yield, ?Await, ?Return] else Statement[?Yield, ?Await, ?Return]
   if ( Expression[+In, ?Yield, ?Await] ) Statement[?Yield, ?Await, ?Return]

其中，+In 表示允許 in 標記，?yield 表示 yield 標記保持不變，- in 表示禁止 in 標記。

通過上下文語法，下面這樣的程式碼是不允許的：

for(var x = key in item; x; ) {
      
}

(雖然 in 是本身也是可以直接使用的運算子，但不能用於 for 的初始值，否則按 for..in 解析。）

後瞻（lookahead)

上面舉的例子，都是可以通過第一個標記就可以確定後面的語法（比如碰到 if 就按 if 語句處理）。那有沒可能只看第一個標記無法確定之後的語法呢？

早期的 JS 版本是沒有的（畢竟這樣編譯器做起來簡單），但隨著 JS 功能不斷增加，就出現了這樣的情況。

比如直接 x 是變數，如果後面有箭頭， x =>，就成了引數。

這時需要用到語法後瞻，所謂的後瞻就是提前看一下後面的標記，然後決定。

/** Invokes the provided callback then unconditionally restores the parser to the state it
 * was in immediately prior to invoking the callback.  The result of invoking the callback
 * is returned from this function.
 */
function lookAhead<T>(callback: () => T): T {
    return speculationHelper(callback, /*isLookAhead*/ true);
}

function speculationHelper<T>(callback: () => T, isLookAhead: boolean): T {
    // Keep track of the state we'll need to rollback to if lookahead fails (or if the
    // caller asked us to always reset our state).
    const saveToken = currentToken;
    const saveParseDiagnosticsLength = parseDiagnostics.length;
    const saveParseErrorBeforeNextFinishedNode = parseErrorBeforeNextFinishedNode;

    // Note: it is not actually necessary to save/restore the context flags here.  That's
    // because the saving/restoring of these flags happens naturally through the recursive
    // descent nature of our parser.  However, we still store this here just so we can
    // assert that invariant holds.
    const saveContextFlags = contextFlags;

    // If we're only looking ahead, then tell the scanner to only lookahead as well.
    // Otherwise, if we're actually speculatively parsing, then tell the scanner to do the
    // same.
    const result = isLookAhead
        ? scanner.lookAhead(callback)
        : scanner.tryScan(callback);

    Debug.assert(saveContextFlags === contextFlags);

    // If our callback returned something 'falsy' or we're just looking ahead,
    // then unconditionally restore us to where we were.
    if (!result || isLookAhead) {
        currentToken = saveToken;
        parseDiagnostics.length = saveParseDiagnosticsLength;
        parseErrorBeforeNextFinishedNode = saveParseErrorBeforeNextFinishedNode;
    }

    return result;
}

說的比較簡單，具體實現稍微麻煩：如果預覽的時候發現標記錯誤咋辦？所以需要先記住當前的錯誤資訊，然後使用掃描器的預覽功能讀取之後的標記，之後完全恢復到之前的狀態。

TypeScript 中有哪些語法要後瞻呢？

<T> 可能是型別轉換（<any>x）、箭頭函式( <T> x => {}）或 JSX （<T>X</T>）
public 可能是修飾符(public class A {})，或變數（public++）
type 在後面跟識別符號時才是別名型別，否則作變數
let 只有在後面跟識別符號時才是變數宣告，否則是變數，但 let let = 1 是不對的。

可以看到語法後瞻增加了編譯器的複雜度，也浪費了一些效能。

雖然語法設計者儘量避免出現這樣的後瞻，但還是有一些地方，因為相容問題不得不採用這個方案。

語法歧義

/ 的歧義

之前提到的 / 可能是除號或正則表示式，在詞法階段還無法分析，但在語法解析階段，因為已知道現在需要什麼語法，可以正確地處理這個符號。

比如需要表示式的時候，碰到 /，因為 / 不能是表示式的開頭，只能把 / 重新按正則表示式標記解析。如果在表示式後面碰到 /，那就作除號。

但也有些歧義是語法解析階段都很難處理的。

< 的歧義

比如 call<number, any>(x)，你可能覺得是呼叫 call 泛型函式（引數 x），但它也可以理解成： (call < number) , (any > (x))

所有支援泛型的C風格語言都有類似的問題，多數編譯器的做法是：和你想的一樣，按泛型看，畢竟一般人很少在 > 後面寫括號。

在 TS 設計之初，<T>x 是表示型別轉換的，這個設計源於 C。但後來為了支援 JSX，這個語法就和 JSX 徹底衝突了。

因此 TS 選擇的方案是：引入 as 語法，<T>x 和 x as T 完全相同。同時引入 tsx 副檔名，在 tsx 中，<T> 當 JSX，在普通 ts，<T> 依然是型別轉換（為相容）。

插入分號

JS 一向允許省略分號，在需要解析分號的地方判斷後面的標記是否是“}”，或包含空行。

function canParseSemicolon() {
    // If there's a real semicolon, then we can always parse it out.
    if (token() === SyntaxKind.SemicolonToken) {
        return true;
    }

    // We can parse out an optional semicolon in ASI cases in the following cases.
    return token() === SyntaxKind.CloseBraceToken || token() === SyntaxKind.EndOfFileToken || scanner.hasPrecedingLineBreak();
}

JSDoc

TS 為了儘可能相容 JS，允許使用者直接使用 JS + JSDoc 的方式備註型別，所以 JS 裡的 JSDoc 註釋也按原始碼的一部分解析。

/** @type {string} */
var x = 120

export function parseIsolatedJSDocComment(content: string, start: number | undefined, length: number | undefined): { jsDoc: JSDoc, diagnostics: Diagnostic[] } | undefined {
    initializeState(content, ScriptTarget.Latest, /*_syntaxCursor:*/ undefined, ScriptKind.JS);
    sourceFile = <SourceFile>{ languageVariant: LanguageVariant.Standard, text: content };
    const jsDoc = doInsideOfContext(NodeFlags.None, () => parseJSDocCommentWorker(start, length));
    const diagnostics = parseDiagnostics;
    clearState();

    return jsDoc ? { jsDoc, diagnostics } : undefined;
}

export function parseJSDocComment(parent: HasJSDoc, start: number, length: number): JSDoc | undefined {
    const saveToken = currentToken;
    const saveParseDiagnosticsLength = parseDiagnostics.length;
    const saveParseErrorBeforeNextFinishedNode = parseErrorBeforeNextFinishedNode;

    const comment = doInsideOfContext(NodeFlags.None, () => parseJSDocCommentWorker(start, length));
    if (comment) {
        comment.parent = parent;
    }

    if (contextFlags & NodeFlags.JavaScriptFile) {
        if (!sourceFile.jsDocDiagnostics) {
            sourceFile.jsDocDiagnostics = [];
        }
        sourceFile.jsDocDiagnostics.push(...parseDiagnostics);
    }
    currentToken = saveToken;
    parseDiagnostics.length = saveParseDiagnosticsLength;
    parseErrorBeforeNextFinishedNode = saveParseErrorBeforeNextFinishedNode;

    return comment;
}

小結

本節介紹了語法解析，並提到了 TS 如何在碰到錯誤後繼續解析。

下節將重點介紹增量解析。 #不定時更新#

時間有限，文章未校驗，如果發現錯誤請指出。