1. 程式人生 > >antlr提取代碼註釋

antlr提取代碼註釋

等等 xtend string cal 遍歷 這一 final log semi

1. 來由

為什麽要寫提取註釋呢,起因是工作需要。弄這麽個不太重要的功能點來講,旨在拋磚引玉。

一般而言,大家使用antlr解析源代碼的時候,不會關心註釋和空格之類內容,默認會過濾掉,不會放到語法樹裏,講了,真把空格這類東西保留在語法樹裏,會帶來很多問題。要保留註釋的話,也不會放進語法樹裏,而是會導流到不同的channel裏。channel可以理解為不同的管道,源文件解析後的token會通過默認管道,而註釋等其它一些元素,可以導流到自定義管道。這樣既不會給解析帶來額外負擔,也不會丟棄任何內容。

2. 抽取註釋

閑話少說,怎麽提取代碼裏的註釋呢,在 12.1 Broadcasting Tokens on Different Channels這一節專門有講。

2.1 語法定義-導流

首先在語法文件裏進行不同channel的導流定義:

先看默認的,直接扔掉了:

WS  : [\t\n\r]+ ->  skip

SL_COMMENT
    : ‘//‘ .*? ‘\n‘ -> skip
    ;
  • 1
  • 2
  • 3
  • 4
  • 5

重新定義-導流:

@lexer::members{
    public static final int WHITESPACE = 1;
    public static final int COMMENTS = 2;
}

WS  : [ \t\n\r]+ -> channel(WHITESPACE); //channel(1)

SL_COMMENT
    : ‘//‘ .*? ‘\n‘ -> channel(COMMENTS) //channel(2)
    ;
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

效果如下圖所示,默認的是channel 0,其它用戶自定義的都是hidden channel
技術分享

2.2 按規則(位置)提取

下面是12.1節裏的示例,為什麽說按位置提取呢,因為它是按照某個具體的規則定義來抽取註釋的。示例代碼是要將變量定義右側的註釋,挪動到代碼行的上面。

技術分享

具體實現:

/***
 * Excerpted from "The Definitive ANTLR 4 Reference",
 * published by The Pragmatic Bookshelf.
 * Copyrights apply to this code. It may not be used to create training material, 
 * courses, books, articles, and the like. Contact us if you are in doubt.
 * We make no guarantees that this code is fit for any purpose. 
 * Visit http://www.pragmaticprogrammer.com/titles/tpantlr2 for more book information.
***/
import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.tree.ParseTreeWalker;

import java.io.FileInputStream;
import java.io.InputStream;
import java.util.List;

public class ShiftVarComments {
    public static class CommentShifter extends CymbolBaseListener {
        BufferedTokenStream tokens;
        TokenStreamRewriter rewriter;
        /** Create TokenStreamRewriter attached to token stream
         *  sitting between the Cymbol lexer and parser.
         */
        public CommentShifter(BufferedTokenStream tokens) {
            this.tokens = tokens;
            rewriter = new TokenStreamRewriter(tokens);
        }

        @Override
        public void exitVarDecl(CymbolParser.VarDeclContext ctx) {
            Token semi = ctx.getStop(); 
            int i = semi.getTokenIndex();
            List<Token> cmtChannel =
                tokens.getHiddenTokensToRight(i, CymbolLexer.COMMENTS); 
            if ( cmtChannel!=null ) {
                Token cmt = cmtChannel.get(0); 
                if ( cmt!=null ) {
                    String txt = cmt.getText().substring(2);
                    String newCmt = "/* " + txt.trim() + " */\n";
                    rewriter.insertBefore(ctx.start, newCmt); 
                    rewriter.replace(cmt, "\n");              
                }
            }
        }
    }

    public static void main(String[] args) throws Exception {
        String inputFile = null;
        if ( args.length>0 ) inputFile = args[0];
        InputStream is = System.in;
        if ( inputFile!=null ) {
            is = new FileInputStream(inputFile);
        }
        ANTLRInputStream input = new ANTLRInputStream(is);
        CymbolLexer lexer = new CymbolLexer(input);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        CymbolParser parser = new CymbolParser(tokens);
        RuleContext tree = parser.file();

        ParseTreeWalker walker = new ParseTreeWalker();
        CommentShifter shifter = new CommentShifter(tokens);
        walker.walk(shifter, tree);
        System.out.print(shifter.rewriter.getText());
    }
}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65

從上述代碼可以看到,CommentShifter繼承listener模式,重載了exitVarDecl方法。在遍歷parse tree的時候,會自動調用exitVarDecl,完成了註釋順序改寫功能。exitVarDecl對應了語法文件裏面的變量定義規則,每當有變量定義的時候,就會調用該方法。

2.3 按channel提取所有註釋

上面的註釋提取方法有個問題,就是只能提取相應規則的註釋。函數有註釋,類有註釋,參數可能有註釋,等等,還有很多別的地方,如果都提取的話,則要費一番周折,弄上一堆函數定義。

如果不關心註釋所在的具體規則,只提取註釋的話,可以遍歷token,通過判斷token所在的channel來實現。語法文件裏將註釋導流到channel(2),那麽凡是channel值為2的token則為註釋,這就好辦了。

    private static void printComments(String code){
        CPP14Lexer lexer = new CPP14Lexer(new ANTLRInputStream(code));
        CommonTokenStream tokens = new CommonTokenStream(lexer);

        List<Token> lt = tokens.getTokens();
        for(Token t:lt){
            // if t is on channel 2 which is comments channel(configured in grammar file)
            // simply pass t, otherwise for two adjacent comments line the first comment line will
            // appear twice
            if(t.getChannel() == 2) continue;

            // getHiddenTokensToLeft will suffice to get all comments
            // no need to call getHiddenTokensToRight
            int tokenIndex = t.getTokenIndex();
            List<Token> comments = tokens.getHiddenTokensToLeft(tokenIndex);
            if(comments != null && comments.size() > 0){
                for(Token c:comments){
                    System.out.println("    " + c.getText());
                }
            }
        }
    }
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23

antlr提取代碼註釋