1. 程式人生 > >詞法分析器Lexer

詞法分析器Lexer

詞法分析

In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an assigned and thus identified meaning). 在電腦科學中,詞法分析,lexing或標記化是將一系列字元(例如在計算機程式或網頁中)轉換成一系列標記(具有指定且因此標識的含義的字串)的過程。

編碼目標

給定一個原始碼檔案,能夠將其轉化為詞法記號流。 比如規定int的詞法記號為30,輸出就是<30, int>;數字的詞法記號為11,則輸入123,輸出為<11, 123>。

約定

把程式中的詞法單元分為四類:識別符號(分為關鍵字和一般識別符號)、數字、特殊字元、空白(空格、Tab、回車換行等)

程式流程圖

程式流程圖 對於運算子等符號,這裡只考慮兩個字元的組合情況,不考慮三個字元組成的運算子。之所以要在讀到特殊字元之後在往後讀一個字元是因為有可能在表中存在類似>=>的運算子,要保證最長字元匹配。

關鍵程式碼

首字元型別判斷

public static String getCharType
(String str) { String regex_Letter = "[a-zA-Z]"; String regex_Number = "[0-9]"; String regex_Blank = "\\s"; Pattern pattern; pattern = Pattern.compile(regex_Letter); Matcher matcher = pattern.matcher(str); if (matcher.find()) return
"LETTER"; pattern = Pattern.compile(regex_Number); matcher = pattern.matcher(str); if (matcher.find()) return "NUMBER"; pattern = Pattern.compile(regex_Blank); matcher = pattern.matcher(str); if (matcher.find()) return "BLANK"; return "SPECIAL"; }

如果首字元為字母

case "LETTER":
	pattern = Pattern.compile(regex_ID);
	matcher = pattern.matcher(srcCode);
	if (matcher.lookingAt()) {
		String result = matcher.group();
		if (LexicalToken.isKeyWord(result)) {
			int token = lextok.getToken(result);
			System.out.printf("<%d,%s>  ", token, result);
		} else {
			int token = lextok.getToken("ID");
			System.out.printf("<%d,%s>  ", token, result);
		}
	}
	srcCode = srcCode.substring(matcher.end());
	break;

如果首字元是數字

case "NUMBER":
	pattern = Pattern.compile(regex_NUM);
    matcher = pattern.matcher(srcCode);
    if (matcher.lookingAt()) {
	    String result = matcher.group();
        int token = lextok.getToken("NUM");
        System.out.printf("<%d,%s>  ", token, result);
     }
     srcCode = srcCode.substring(matcher.end());
     break;

如果首字元是空格

case "BLANK":
	srcCode = srcCode.substring(1);
    break;

如果首字元是特殊符號

case "SPECIAL":
	if (srcCode.length() > 1) {
	    String secondChar = srcCode.substring(1, 2);
        String result;
        LinkedHashMap tokenMap = lextok.getLexicalTokenMap();
        Set set = tokenMap.keySet();
        result = firstChar + secondChar;
        if (getCharType(secondChar).equals("SPECIAL") && set.contains(result)) {
            int token = lextok.getToken(result);
            System.out.printf("<%d,%s>  ", token, result);
            srcCode = srcCode.substring(2);
        }else {
            result = firstChar;
            int token = lextok.getToken(result);
            System.out.printf("<%d,%s>  ", token, result);
            srcCode = srcCode.substring(1);
              }
	} else {  // 字串中只有一個字元時
           int token = lextok.getToken(srcCode);
           System.out.printf("<%d,%s>  ", token, srcCode);
           srcCode = srcCode.substring(1);
    }
    break;