自制指令碼語言（2） LR(1) parser generator的設計

阿新 • • 發佈：2019-01-27

摘要：設計一個Parser generator自動生成器，對增廣文法G自動生成其LR(1) parser語法解析器。

為什麼要做這個生成器？因為以前寫過遞迴下降的LL(1)的parser，覺得很費腦，業務邏輯和程式碼實現都要全盤考慮，不利於開發。當然我得承認這樣利於優化，執行效率更高。如果先做好generator，自動生成自底向上的LR parser，開發起來更方便，利於以後擴充套件或修改文法。這大概就是所謂的增加一個抽象層來分解複雜問題。

談談LALR(1)與LR(0)、SLR(1)、LR(1)的聯絡與區別。自底向上的語法解析器，設計好其增廣文法之後，可以生成LR(0)項集。在此項集的基礎上，用非終結符號的Follow集符號指導action動作為reduce，根據項與項的轉移生成針對終結符號的action動作為shift，以及非終結符號的Goto表。這就是SLR(1)。如果在項集裡面增加一個屬性用來表示展開式的Follow集，生成項集時連帶考慮Follow集符號不同的為單獨項，就是LR(1)。LR(1)合併相同核心的項，則變成了LALR(1)。一般說來，LALR(1)文法最常用，表達能力稍弱於LR(1)，但強於SLR(1)。表達能力主要指正確解決移進歸約衝突的能力。

generator適合單獨放一個包裡，與parser隔開。輸入是一個txt檔案，按行列出每條語法式，最後應該有全部的符號彙總，非終結符和終結符。輸出也是一個txt檔案，記錄了action(I, a)和goto(I, X)表。

現在開始設計generator的架構。說句題外話，關於資料結構和演算法。資料結構+演算法=程式。但是資料結構是本質，是根基。因為資料結構，也就是資料儲存方式的抽象，決定了演算法的形式。例如lisp，它的任何演算法都充滿了car、cdr，這就是因為lisp的核心資料結構是連結串列。當然，lisp可以編譯後執行，編譯時把一些庫函式按照線性地址記憶體方式來優化。效率可以大大提高。繼續回到generator的架構。首先要有Grammar類，放置每條語法產生式。通過txt檔案讀取獲取每條Grammar物件。需要有符號類Symbol物件儲存非終結符與終結符。每個Symbol物件通過Grammar產生式獲得First集以及Follow集。然後要得到LR(1)項集items的規範族，canonical collection of sets of LR(1) items。定義一個Item類。每個Item物件有kernel item與其closure。產生Item物件的方法是Goto(Item, Symbol)函式和Closure(Item)函式。

getFirst( ), closure( ), getGoto( ), getCCs( )這幾個關鍵函式的偽碼如下：

boolean getFirst()		
		add all tokens to theirselves' first-set
		while first-set still changing 
			for each grammar
				Symbol head=grammar.head;
				Set<Symbol> first_set=new HashSet<Symbol>();
				for each production of the grammar
					for each symbol in production.symbols			
						first_set.addAll(symbol.First);	
						if(!sym.First.contains(e))
							break;						
						if all symbol.First has e
							first_set.add(e);
				add first_set to symbol.first_set

boolean getClosure(CC cc)		
		Item k_item=cc.kernel_item;
		if k_item.position is at end
			cc.is_reduce=true;
			return false;
		ArrayList<Item> cl_items=new ArrayList<Item>();
		add k_item to cl_items;
		while cl_items still changing
			get new k_item from cl_items;
			get new symbol_head from new k_item;
			Set<Symbol> follow_set=new HashSet<Symbol>();
			match the k_item's form A->BC.D,f
				get follow_set as First_set of {D,f}
			for each production of symbol_head, and each follow_sym in follow_set
				build new item and add it to cl_item
		cc.items.addAll(cl_items);

CC getGoto(Item item)
		CC cc =new CC();
		Item kr_item=new Item(item);
		if(kr_item.position<kr_item.symbols.size())
			kr_item.position++;
			ArrayList<Kernel> kernels=gen_kernels.get(kr_item.head);
			if kr_item in kernels
				return the kernel's cc_in;
			Kernel kernel=new Kernel();
			kernel.item_in=kr_item;
			add kernel to gen_kernels
			cc.kernel_item=kr_item;
			cc.index_gr=kr_item.index_gr_tb;
			kernel.cc_in=cc;
			return cc;

boolean getCCs(){
		build cc0;
		getClosure(cc0);
		gen_CCs.add(cc0);
		while gen_CCs still change
			get a cc from gen_CCs
			if cc.kernel_item.position==cc.kernel_item.symbols.size()	
				cc.is_reduce=true;
				continue;			
			for each item in cc.items
				CC new_cc=getGoto(item);
				put symbol and new_cc in cc.goto_tb;
				getClosure(new_cc);
				if(!new_cc.in_table)
					gen_CCs.add(new_cc);
					new_cc.in_table=true;
					new_cc.index_cc=gen_CCs.size()-1;

自制指令碼語言（2） LR(1) parser generator的設計

自制指令碼語言（2） LR(1) parser generator的設計

自制指令碼語言（3） LR(1) parser generator的實現

自制指令碼語言（8）從LR(1) 到 GLR parser generator

自制指令碼語言（5）自制的編譯器——LR(1) parser

自制指令碼語言（10）抽象語法樹AST與三地址線性IR

自制指令碼語言（12）作用域與符號表

自制指令碼語言（4）自動生成的詞法分析器

自制指令碼語言（1）詞法、語法、虛擬機器的設計思路

快看Sample代碼，速學Swift語言（2）-基礎介紹快看Sample代碼，速學Swift語言（1）-語法速覽

C#復習筆記（2）--C#1所搭建的核心基礎

shell指令碼基礎（2）

初級C語言（2）

從零開始的畢設--HTML(超文字標記語言)（2）

BugkuCTF（2）變數1

60-思考題（2）-將1到9 這九個數字分成若干個數，滿足一定的算式

python指令碼學習（2）

測試回顧版-Loadrunner指令碼程式設計（2）-

打破國外壟斷，開發中國人自己的程式語言（2）：使用監聽器實現計算器

大前端之路node第（2）天：Express Generator搭建node專案後臺

《使用python進行自然語言理解（Nltk）》1.2

自制指令碼語言（2） LR(1) parser generator的設計

相關推薦