1. 程式人生 > >To be a Tough Man——liushuaikobe

To be a Tough Man——liushuaikobe

   <while, _>
   <(, _>
   <id, 指向i的符號表項的指標>
   <>=, _>
   <id, 指向j的符號表項的指標>
   <), _>
   <id, 指向i的符號表項的指標>
   <--, _>
   <;, _>
詞法分析分析器作為一個獨立子程式:
詞法分析是編譯過程中的一個階段,在語法分析前進行。詞法分析作為一遍,可以簡化設計,改進編譯效率,增加編譯系統的可移植性。也可以和語法分析結合在一起作為一遍,由語法分析程式呼叫詞法分析程式來獲得當前單詞供語法分析使用。

-----------------------------------------------------------------------------------

我寫的這個詞法分析器,不是很健全,尤其是錯誤處理機制,像在字串識別中,'ab'是C語言中不合法的char變數,但是我的詞法分析器不能判斷出錯誤,會死迴圈;此外,只能識別出有限的關鍵字、有限形式的字串(相信讀者看懂我的狀態機就知道哪裡有限了),由於時間不夠了,我不想再改了,下面貼出程式碼,供大家參考。

對了,貼程式碼之前,先說說我的詞法分析器的狀態機的設計。

我對“數字”的詞法分析用了一個狀態機,包括浮點數、整形數,狀態機如下:


對“字元(串)”的識別用了一個狀態機,包括關鍵字、char、以及char *,如下:


當然,對C語言的註釋的識別也用了一個狀態機,必須先把原始碼中的註釋cut掉才能進行分析,如下:


我對運算子的識別(包括雙目和單目)沒有采用明顯的狀態機,都是直接分析判斷的,實際從某種意義上來講對它們的分析也是採用了狀態機的原理,只是狀態機結構比較簡單,就沒再顯式用state表示,它們的狀態機實際上如下:


下面上程式碼:

Scanner.py,作為主模組來執行:

'''
Created on 2012-10-18

@author: liushuai
'''
import string
import Category
import FileAccess

_currentIndex = 0
_Tokens = []
_prog = ""
_categoryNo = -1

_stateNumber = 0
_stateString = 0
_potentialNumber = ""
_potentialString = ""

def readComments(prog):
    '''Read the comments of a program'''
    state = 0
    currentIndex, beginIndex, endIndex = (0, 0, 0)
    commentsIndexs = []
    for c in prog:
        if state == 0:
            if c == '/':
                beginIndex = currentIndex
                state = 1
            else:
                pass
        elif state == 1:
            if c == '*':
                state = 2
            else :
                state = 0
        elif state == 2:
            if c == '*':
                state = 3
            else:
                pass
        elif state == 3:
            if c == '*':
                pass
            elif c == '/':
                endIndex = currentIndex
                commentsIndexs.append([beginIndex, endIndex])
                state = 0 #set 0 state
            else:
                state = 2
        currentIndex += 1
    return commentsIndexs
        
def cutComments(prog, commentsIndexs):
    '''cut the comments of the program prog'''
    num = len(commentsIndexs)
    if num == 0:
        return prog
    else :
        comments = []
        for i in xrange(num):
            comments.append(prog[commentsIndexs[i][0]:commentsIndexs[i][1] + 1])
        for item in comments:
            prog = prog.replace(item, "")
        return prog
    
def scan(helper):
    '''scan the program, and analysis it'''
    global _stateNumber, _stateString, _currentIndex, _Tokens, _prog, _categoryNo, _potentialNumber, _potentialString
    currentChar = _prog[_currentIndex]
    ######################################CHAR STRING######################################
    if currentChar == '\'' or currentChar == '\"' or currentChar in string.letters + "_$\\%\@"  or (currentChar in string.digits and _stateString != 0):
        if _stateString == 0:
            if currentChar == '\'':
                _potentialString = "%s%s" % (_potentialString, currentChar)
                _stateString = 1
                _currentIndex += 1
            elif currentChar == "\"":
                _potentialString = "%s%s" % (_potentialString, currentChar)
                _stateString = 2
                _currentIndex += 1
            elif currentChar in string.letters + "$_":
                _potentialString = "%s%s" % (_potentialString, currentChar)
                _stateString = 7
                _currentIndex += 1
            else:
                _currentIndex += 1
                _stateNumber = 10
        elif _stateString == 1:
            if currentChar in string.letters + "#
[email protected]
%": _potentialString = "%s%s" % (_potentialString, currentChar) _stateString = 3 _currentIndex += 1 elif currentChar == '\\': _potentialString = "%s%s" % (_potentialString, currentChar) _stateString = 9 _currentIndex += 1 else: _currentIndex += 1 _stateNumber = 10 elif _stateString == 2: if currentChar in string.letters + "\\% ": _potentialString = "%s%s" % (_potentialString, currentChar) _stateString = 4 _currentIndex += 1 else: _currentIndex += 1 _stateNumber = 10 elif _stateString == 3: if currentChar == '\'': _potentialString = "%s%s" % (_potentialString, currentChar) _stateString = 5 _currentIndex += 1 else: _currentIndex += 1 _stateNumber = 10 elif _stateString == 4: if currentChar == '\"': _potentialString = "%s%s" % (_potentialString, currentChar) _stateString = 6 _currentIndex += 1 elif currentChar in string.letters + "\\% ": _potentialString = "%s%s" % (_potentialString, currentChar) _stateString = 4 _currentIndex += 1 else: _currentIndex += 1 _stateNumber = 10 elif _stateString == 7: if currentChar in string.digits + string.letters + "$_": _potentialString = "%s%s" % (_potentialString, currentChar) _stateString = 8 _currentIndex += 1 else: _currentIndex += 1 _stateNumber = 10 elif _stateString == 8: if currentChar in string.digits + string.letters + "$_": _potentialString = "%s%s" % (_potentialString, currentChar) _stateString = 8 _currentIndex += 1 else: _currentIndex += 1 _stateNumber = 10 elif _stateString == 9: if currentChar in ['b', 'n', 't', '\\', '\'', '\"']: _potentialString = "%s%s" % (_potentialString, currentChar) _stateString = 3 _currentIndex += 1 else: _currentIndex += 1 _stateNumber = 10 else: _currentIndex += 1 ###################################### NUMBERS ###################################### elif currentChar in string.digits + ".": if _stateNumber == 0: if currentChar in "123456789": _potentialNumber = "%s%s" % (_potentialNumber, currentChar) _stateNumber = 6 _currentIndex += 1 elif currentChar == '0': _potentialNumber = "%s%s" % (_potentialNumber, currentChar) _stateNumber = 4 _currentIndex += 1 else: _stateNumber = 8 _currentIndex += 1 elif _stateNumber == 4: if currentChar == '.': _potentialNumber = "%s%s" % (_potentialNumber, currentChar) _stateNumber = 5 _currentIndex += 1 else: _stateNumber = 8 _currentIndex += 1 elif _stateNumber == 5: if currentChar in string.digits: _potentialNumber = "%s%s" % (_potentialNumber, currentChar) _stateNumber = 7 _currentIndex += 1 else: _stateNumber = 8 _currentIndex += 1 elif _stateNumber == 6: if currentChar in string.digits: _potentialNumber = "%s%s" % (_potentialNumber, currentChar) _stateNumber = 6 _currentIndex += 1 elif currentChar == '.': _potentialNumber = "%s%s" % (_potentialNumber, currentChar) _stateNumber = 5 _currentIndex += 1 else: _stateNumber = 8 _currentIndex += 1 elif _stateNumber == 7: if currentChar in string.digits: _potentialNumber = "%s%s" % (_potentialNumber, currentChar) _stateNumber = 7 _currentIndex += 1 else: _stateNumber = 8 _currentIndex += 1 else: _currentIndex += 1 ######################################OTEAR OPERATERS###################################### else: if _stateNumber == 6 or _stateNumber == 4: helper.outPutToken(_potentialNumber, "INT", Category.IdentifierTable["INT"]) elif _stateNumber == 7: helper.outPutToken(_potentialNumber, "FLOAT", Category.IdentifierTable["FLOAT"]) elif _stateNumber != 0: helper.outPutToken("ERROR NUMBER", "None", "None") _stateNumber = 0 _potentialNumber = "" if _stateString == 7 or _stateString == 8: if _potentialString in Category.KeyWordsTable: helper.outPutToken(_potentialString, _potentialString.upper(), Category.IdentifierTable[_potentialString.upper()]) else: helper.outPutToken(_potentialString, "IDN" , Category.IdentifierTable["IDN"]) helper.setSymbolTable(_potentialString, "IDN" , Category.IdentifierTable["IDN"]) elif _stateString == 5: helper.outPutToken(_potentialString, "CHAR", Category.IdentifierTable["CHAR"]) elif _stateString == 6: helper.outPutToken(_potentialString, "CHAR *", Category.IdentifierTable["CHAR *"]) elif _stateString != 0: helper.outPutToken("ERROR STRING", "None", "None") _stateString = 0 _potentialString = "" if currentChar == " ": _currentIndex += 1 elif currentChar == '>': _currentIndex += 1 currentChar = _prog[_currentIndex] if currentChar == "=": helper.outPutToken(">=", ">=", Category.IdentifierTable[">="]) _currentIndex += 1 else : helper.outPutToken(">", ">", Category.IdentifierTable[">"]) elif currentChar == '<': _currentIndex += 1 currentChar = _prog[_currentIndex] if currentChar == "=": helper.outPutToken("<=", "<=", Category.IdentifierTable["<="]) _currentIndex += 1 else : helper.outPutToken("<", "<", Category.IdentifierTable["<"]) elif currentChar == '+': _currentIndex += 1 currentChar = _prog[_currentIndex] if currentChar == '+': helper.outPutToken("++", "++", Category.IdentifierTable["++"]) _currentIndex += 1 else : helper.outPutToken("+", "+", Category.IdentifierTable["+"]) elif currentChar == '-': _currentIndex += 1 currentChar = _prog[_currentIndex] if currentChar == '-': helper.outPutToken("--", "--", Category.IdentifierTable["--"]) else: helper.outPutToken("-", "-", Category.IdentifierTable["-"]) elif currentChar == '=': _currentIndex += 1 currentChar = _prog[_currentIndex] if currentChar == '=': helper.outPutToken("==", "==", Category.IdentifierTable["=="]) _currentIndex += 1 else : helper.outPutToken("=", "=", Category.IdentifierTable["="]) elif currentChar == '!': _currentIndex += 1 currentChar = _prog[_currentIndex] if currentChar == '=': helper.outPutToken("!=", "!=", Category.IdentifierTable["!="]) _currentIndex += 1 else : helper.outPutToken("!", "!", Category.IdentifierTable["!"]) elif currentChar == '&': _currentIndex += 1 currentChar = _prog[_currentIndex] if currentChar == '&': helper.outPutToken("&&", "&&", Category.IdentifierTable["&&"]) _currentIndex += 1 else : helper.outPutToken("&", "&", Category.IdentifierTable["&"]) elif currentChar == '|': _currentIndex += 1 currentChar = _prog[_currentIndex] if currentChar == '|': helper.outPutToken("||", "||", Category.IdentifierTable["||"]) _currentIndex += 1 else : helper.outPutToken("|", "|", Category.IdentifierTable["||"]) elif currentChar == '*': helper.outPutToken("*", "*", Category.IdentifierTable["*"]) _currentIndex += 1 elif currentChar == '/': helper.outPutToken("/", "/", Category.IdentifierTable["/"]) _currentIndex += 1 elif currentChar == ';': helper.outPutToken(";", ";", Category.IdentifierTable[";"]) _currentIndex += 1 elif currentChar == ",": helper.outPutToken(",", ",", Category.IdentifierTable[","]) _currentIndex += 1 elif currentChar == '{': helper.outPutToken("{", "{", Category.IdentifierTable["{"]) _currentIndex += 1 elif currentChar == '}': helper.outPutToken("}", "}", Category.IdentifierTable["}"]) _currentIndex += 1 elif currentChar == '[': helper.outPutToken("[", "[", Category.IdentifierTable["["]) _currentIndex += 1 elif currentChar == ']': helper.outPutToken("]", "]", Category.IdentifierTable["]"]) _currentIndex += 1 elif currentChar == '(': helper.outPutToken("(", "(", Category.IdentifierTable["("]) _currentIndex += 1 elif currentChar == ')': helper.outPutToken(")", ")", Category.IdentifierTable[")"]) _currentIndex += 1 if __name__ == '__main__': helper = FileAccess.FileHelper("H://test.c", "H://token.txt", "H://symbol_table.txt") prog = helper.readProg() print prog comments = readComments(prog) _prog = cutComments(prog, comments) print _prog while _currentIndex < len(_prog): scan(helper) helper.closeFiles()



Category.py,這個模組裡面定義了一些C語言中的關鍵字、運算子等等,是種別碼錶: