Saturday, September 1, 2012

Tackling Comments in ANTLR Compiler

Most compilers treat comments as just another meaningless whitespaces. They identify them in the source code and then throw them away. On the other hand, things are a bit harder if you need to know comments content and their position.

Such requirement is not too common and official ANTLR documentation does not says much about how to do it. The compiler we have been working on needed to preserve comments, so we had to come up with our own solution. This post documents it.

First chapter introduces our compiler. It describes what it does, where it is located and how it is written. Second chapter explains our comment preserving requirements. The rest of the post describes datastructures and algorithms used in our solution.

Table of Contents

Less4j Project

Less4j is a port of Less.js into java. Less.js compiles CSS-like language named less into standard CSS. It is still in prototype phase and compiles subset of CSS back into CSS. Less4j source code is located on Github.

First sub-chapter introduces less language and is written to be as short as possible. Second one contains overall architecture of our project.

Less
Less is basically CSS with variables, mixins, functions and few other features. Less.js compiles it into regular CSS. The original compiler was written in JavaScript and can be integrated into java project using Rhino.

Official documentation can be found on lesscss.org and introduction tutorial in previous post on this blog.

Architecture
Less4j compiler is based on antlr parser generator. This post assumes that you know antlr basics: what is lexer, parser, token and what is a rule. In any case, antlr details are available in our two previous tutorial posts.

Current version of our project runs in three phases:
  • build abstract syntax tree corresponding to input .less file,
  • translate .less abstract syntax tree into corresponding .css abstract syntax tree,
  • create final .css output.

The rest of this post focus on the abstract syntax tree building e.g, step 1. Second step is not done yet and the third one is more or less trivial.

Tree Building (Step 1)
Abstract syntax tree (AST) is built in two sub-phases:
  • antlr parses input into and creates "first AST",
  • the "first tree" is transformed into final abstract syntax tree.

First syntax tree is composed of default antlr tree nodes. They are all instances of CommonTree class and each knows its own type, position in the source code and can return a list of its children.

Nodes in the final tree belong to ASTCssNode hierarchy. The hierarchy has a separate class for each node type (ruleset, declaration, font-face, ...). Each class has only properties relevant for represented css element. For example, ruleset can return its own selectors and declarations.

It would be possible to merge these two phases into one. We decided not to do so, because debugging is much easier if things are done in separate phases.

The Compiler (Step 2)
The compiler was not yet implemented.

Printing (Step 3)
Printing is essentially breath first traversal through the final abstract syntax tree. The algorithm knows how to print each type of the node and recursively calls itself to print all its childs.

Requirements

Original less.js preserve comments and their positions. Whitespaces around comments are not preserved, the translation may remove them or add new ones into the output.

We decided not to mimic the exact less.js behavior. Instead, we tried to identify most important properties of their algorithm and use them as requirements.

Keep All Comments
The first requirement is the most important:
  • No comment may be lost. Every input comment must appear in the output.

Keep Comments Positions
The second most important thing is to keep comments where they have been found. Comments usually describe the element located before them, after them or on the same line. Therefore, comments must not change their location too much during the translation.

To make things less vague, we defined "keeping the location" as:
  • Comment located between two elements must stay between those two elements.
  • If the input contains new line between comment and its following element, then its corresponding output comment must be followed by a new line.
  • If the input line ends with a comment, the translation must keep that comment on the same line as its previous input element.

All three situations are shown in the next sections input-output examples.

Examples
Following two inputs must be unchanged by the translation. In both cases, input and output string must be exactly the same.

Example 1, ruleset with multiple declarations:
/*
 Preceding comment with new line.
*/
/* before */ .comments /* after */ {
  /* Leave me here. */
  /* 2 */ color: #808080 /* grey */ , /* blue */ #ffa500;
  margin: 2px;/* something about declaration*/
  padding: 2px /*before comma*/;/* after comma */
  /* Leave me here. */
} /* Almost last comment. */
/* Last comment. */

Example 2, an empty ruleset:
/* 1*/ .comments /* 2*/ /* 3 */ .something {
  /* Orphan comment. */
}

Datastructures

We decided to attach comments and new lines to their surrounding elements. This requires only small changes in our abstract syntax trees and in the printing algorithm.

Three building and the algorithm for comments attaching are described in following chapters.

First Abstract Syntax Tree
Each comment or new line is attached to exactly one tree node. Each node owns comments and new lines attached to it and knows their positions. It knows whether they should go before the node, after the node or inside it.

Each node in the first abstract syntax tree keeps three lists:
  • list of owned preceding comments and new lines,
  • list of owned following comments and new lines,
  • list of owned comments and new lines that are located inside it.

Each comment and new line is assigned to exactly one of these lists:
private List<Token> preceding = new LinkedList<Token>();
private List<Token> orphans = new LinkedList<Token>();
private List<Token> following = new LinkedList<Token>();

Final Abstract Syntax Tree
The final abstract syntax tree keeps comments in three similar lists. The only difference it that each final comment knows whether it should be followed by a new line or not. It does not know how many new lines have originally been there and new lines themselves are thrown away.

We have two types of comments, those that are followed by new line and those that are not followed by new line:
public class Comment extends ASTCssNode {

  private String comment;
  private boolean hasNewLine;
  
  //...

  public String getComment() {
    return comment;
  }

  public boolean hasNewLine() {
    return hasNewLine;
  }
}

Each node in the final abstract syntax tree keeps three lists:
  • list of owned preceding comments,
  • list of owned following comments,
  • list of owned comments that are located inside it.

The class:
class ASTCssNode {
  private List<Comment> openingComments = new ArrayList<Comment>();
  private List<Comment> orphanComments = new ArrayList<Comment>();
  private List<Comment> trailingComments = new ArrayList<Comment>();

  // ...
}

Printing
The printing algorithm requires three changes:
  • It must print all preceding comments before each encountered node.
  • It must print all following comments after each encountered node.
  • If it is possible for a node to have orphan comments, then the algorithm must print them too.

Matching Comments - Lexer

Both this and the next chapters assume that you already know what is lexer, parser, token, rule and how antlr works. You do not need to know details, everything needed in covered in the antlr hello world tutorial.

First thing we need to do is to modify the lexer to generate comment tokens. This is covered in the first sub-chapter. Second sub-chapter explains why we can not treat comment tokens the same way as other tokens. Finally, the third sub-chapter shows how to hide them.

Regular Expression
CSS comments look a lot like C or Java comments. They start with /* symbol, end with */ symbol and can not be nested. Therefore, we have to match anything in between /* and its closest */:
COMMENT: '/*' ( options { greedy=false; } : .*) '*/';

The options { greedy=false; } part is necessary, because antlr lexer is greedy. If we would use a simple regular expression '/*' .* '*/', the lexer would create one big token from everything between the first /* and the last */ in the input.

Consequences for Parser
Comments and whitespaces have one thing in common: they can appear in any number between any other two tokens. If we would use the above rule as is, then we would have to add comments tokens into every possible place in each parser rule.

For example, following declaration rule:
declaration: property COLON expr prio?;
would have to be changed into something like this:
declaration: property COMMENT* COLON COMMENT* expr COMMENT* prio?;

This way of comments handling is very error prone, labor extensive and we doubt anybody ever handled them this way. We do not want to add COMMENT* into every possible place in our grammar, but we need them in final abstract syntax tree.

Hiding Tokens
Each generated token has property named "channel". Only tokens with channel value 0 reach the parser, all other channels are hidden. It is important to remember that lexer generates token stream with all tokens including the hidden ones. Hidden tokens are filtered out between the parser and the lexer.

If you want to hide a token, you can use either a non-zero channel number or the HIDDEN keyword. The keyword corresponds to channel number 99.

Use the action part of the lexer rule to send all comments on hidden channel:
COMMENT: '/*' ( options { greedy=false; } : .*) '*/'
       { 
         $channel = HIDDEN;
       };

Collecting Hidden Comments

We have successfully added comments into the grammar without having to rewrite every single rule in it. Of course, as comments disappeared from the parser, we have to collect them elsewhere and then attach them to the abstract syntax tree.

Common Token Stream
Antlr generated lexer and parser are not compatible with each other. While the lexer implements the TokenSource interface, the parser takes an instance of TokenStream as its input. This class level incompatibly makes sense because:
  • lexer generates hidden tokens that should be filtered out,
  • parser occasionally needs to look ahead into upcoming tokens, but lexer does not have the ability to show them. Lexer returns one token at a time and is not able to show following or previous tokens.

Typical way to connect lexer with parser is to use CommonTokenStream. Common token stream adds look ahead abilities to the token source and filters out unwanted tokens:
LessLexer lexer = new LessLexer(inputStream);
CommonTokenStream tokens = new CommonTokenStream(lexer);
LessParser parser = new LessParser(tokens);

Collecting Hidden Tokens
The easiest way to collect hidden tokens is to place collector between lexer and common token stream. The collector implements the same interface as the lexer (e.g. TokenSource), wraps over the lexer and acts as its replacement.

Common token source is given a collector instead of the lexer as its input:
LessLexer lexer = new LessLexer(inputStream);
CollectorTokenSource tokenSource = new CollectorTokenSource(lexer, ...);
CommonTokenStream tokens = new CommonTokenStream(tokenSource);
LessParser parser = new LessParser(tokens);

Common token stream uses Token nextToken() method of the TokenSource interface to read tokens. Our collector delegates the call to the lexer and returns the same token as the lexer would. However, if it is either a comment or a new line, the collector will store the token in a list of collected tokens:
class CollectorTokenSource implements TokenSource {
  
  ... 

  @Override
  public Token nextToken() {
    Token nextToken = source.nextToken();
    if (shouldCollect(nextToken)) {
      collectedTokens.add(nextToken);
    }
    
    return nextToken;
  }

  protected boolean shouldCollect(Token nextToken) {
    // filter the token by its type
    return collectTokenTypes.contains(nextToken.getType());
  }

}

Full implementation of the CollectorTokenSource class:
Click to expand the table of contents:

class CollectorTokenSource implements TokenSource {

  private final TokenSource source;
  private final Set<Integer> collectTokenTypes = 
      new HashSet<Integer>();
  private final LinkedList<Token> collectedTokens = 
      new LinkedList<Token>();

  public CollectorTokenSource(TokenSource source, 
      Collection<Integer> collectTokenTypes) {
    super();
    this.source = source;
    this.collectTokenTypes.addAll(collectTokenTypes);
  }

  /**
   * Returns next token from the wrapped token source. Stores it 
   * in a list if necessary.
   */
  @Override
  public Token nextToken() {
    Token nextToken = source.nextToken();
    if (shouldCollect(nextToken)) {
      collectedTokens.add(nextToken);
    }

    return nextToken;
  }

  /**
   * Decide whether collect the token or not.
   */
  protected boolean shouldCollect(Token nextToken) {
    // filter the token by its type
    return collectTokenTypes.contains(nextToken.getType());
  }

  public LinkedList<Token> getCollectedTokens() {
    return collectedTokens;
  }

  @Override
  public String getSourceName() {
    return "Collect hidden channel " + source.getSourceName();
  }

}

Merging Tokens With AST

Collected tokens need to be merged with abstract syntax tree generated by antlr. To do so, we have to customize the tree generated by antlr parser and then attach hidden tokens to it.

Merge is possible because:
  • Each antlr token knows his own index in the token stream.
  • Each antlr node knows indexes of both the first and last tokens that belongs to it.

First sub-chapter shows how to customize the tree and the rest of the chapter explains how to merge it with collected tokens.

Customizing Parser
Tree nodes generated by antlr parser are unable to hold any additional information. As we want to attach hidden tokens directly to the tree, we have to create new type of tree node and configure generated parser to use it.

Custom Tree Node
Abstract syntax tree generated by default is composed of instances of CommonTree class.

We extended CommonTree and added three properties into new class:
public class HiddenTokenAwareTree extends CommonTree {
  
  private List<Token> preceding = new LinkedList<Token>();
  private List<Token> orphans = new LinkedList<Token>();
  private List<Token> following = new LinkedList<Token>();

  // ... constructors, getters and setters follow

}

Configure Parser
Generated parser delegates all tree modifying operations to TreeAdaptor interface. Tree adaptor creates new nodes, add childs to them or deletes them when needed. Default tree adapter is called CommonTreeAdaptor.

As our tree nodes extend default CommonTree class, our custom adaptor needs to customize only the create method. Everything else may work the same way as in the original CommonTreeAdaptor:
class LessTreeAdaptor extends CommonTreeAdaptor {

  @Override
  public Object create(Token payload) {
    return new HiddenTokenAwareTree(payload);
  }

}

Tree adaptor used by parser is configured via setTreeAdaptor method:
LessParser parser = new LessParser(tokens);
parser.setTreeAdaptor(new LessTreeAdaptor());

Assignment
Our algorithm is supposed to assign each collected token to either preceding, following or orphan list of some tree node.

A tree node may own only those hidden tokens that are directly before it, directly after it or directly inside it. "Directly" means that no other tree node starts or ends between the hidden token and its owning node. There must be no other closer candidate.

Preceding List
The list of preceding tokens contains hidden tokens located directly before the owning node.

For example, this ruleset:
/* 0 */.comments /* 1 */ /* 2 */ .something /* 3 */ {
  declaration: value
}

corresponds to following simplified abstract syntax tree (the parenthesis contains start and stop token indexes of the subtree):
RULESET (2-22)
- SELECTOR (2-10)
--- CSS_CLASS comments (2-3)
--- EMPTY_COMBINATOR (9-9)
--- CSS_CLASS something (9-11)
- BODY_OF_DECLARATIONS (14-22)
--- DECLARATION (17-20)
----- PROPERTY declaration (17-17)
----- TERM value (20-20)

there are four hidden comments and one hidden new line (the parenthesis contains token index):
(0) COMMENT /* 0 */
(5) COMMENT /* 1 */
(7) COMMENT /* 2 */
(12) COMMENT /* 3 */
(15) NEW_LINE 

The comment 0 goes to the RULESET, because there is nothing that would start closer to it. Although SELECTOR node starts with the same index as the ruleset, it is rulesets child and thus considered to be behind it.

Comments 1 and 2 go to the EMPTY_COMBINATOR and the comment 3 and new line belong to the BODY_OF_DECLARATIONS.

Following List
The "following" list is going to be used only if there is no possible preceding owner. It may contain only hidden tokens located directly after the owning node.

For example, both comments in the following example have no element directly behind them:
.comments .something {
  declaration: something;
  /* Trailing comment. */
} 

.other .selector {
  declaration: value;
}
/* Last comment. */

The input corresponds to following simplified tree (the parenthesis contains start and stop token indexes of the subtree):
RULESET (0-18)
- SELECTOR .comments .something (0-4)
- BODY_OF_DECLARATIONS (6-18)
--- DECLARATION declaration: something; (9-13) 
RULESET (22-37)
- SELECTOR .other .selector (22-26)
- BODY_OF_DECLARATIONS (28-37)
--- DECLARATION declaration: value; (31-35)

hidden channel contains two comments (we omitted new lines because there are too many of them):
(16) COMMENT /* Trailing comment. */
(39) COMMENT /* Last comment. */

The Trailing comment. has index 16. There is nothing between this comment and the end of the first ruleset (position 18). No further element can own the comment, because there must be no node start nor node end between the hidden token and its owning node.

Therefore we have to look for an element that ends directly before it. There is nothing between the end of declaration: something; and the comment on position 16. The comment should be assigned to that declaration as the following hidden token.

The second comment Last comment. is the last comment in the input file. It will be assigned to the last element in the file e.g., second RULESET.

Orphan List
The list of orphan comments should contain only those comments that have neither preceding nor following elements:
.comments .something {
  /* Orphan comment. */
} 

The only comment in the above example belongs to the ruleset body.

Algorithm
The solution is implemented in class named ListToTreeCombiner. The list of hidden tokens is stored in its private property, so we do not have to constantly pass it around.

Root of the abstract syntax tree is special case and we will ignore it for now. The assignment of hidden tokens to any other tree node is done inside the associateAsChild method. It takes tree node as a parameter and assigns all hidden tokens up to its end. The method:
  • reads all hidden tokens up to the end of the input sub-tree,
  • assigns them to the input node or one of its childs,
  • removes all used tokens from the list of hidden tokens.

The method works correctly only if the list of hidden tokens starts with tokens that should be assigned to the sub-tree in parameter.

This part explains how the associateAsChild method works. First section shows both commented code of this method and full listing of ListToTreeCombiner class. Second section contains the input we will use as the example. Remaining sections apply the method to the example input and step through it to show how it works.

The Code
Next code example contains the associateAsChild method. The full class including all helper methods is shown below:
private LinkedList<Token> hiddenTokens;

private void associateAsChild(HiddenTokenAwareTree ast) {
  //fill preceding list of ast
  addAllPrecedingTokens(ast);

  LinkedList<HiddenTokenAwareTree> children = getChildren(ast);
  //if the node has no children, fill orphans list
  if (children.isEmpty()) {
    addAllContainedTokens(ast);
    return;
  }

  HiddenTokenAwareTree previousChild = null;
  for (HiddenTokenAwareTree child : children) {
    //everything up to first new line or next child belongs to 
    //previous element
    assignFirstCommentsSegment(previousChild, child);
    //recursion assigns tokens to the child
    associateAsChild(child);
    previousChild = child;
  }
  //remaining tokens located inside this node belong to the last child
  addFollowingTokens(previousChild, ast.getTokenStopIndex());
}

The whole class:
Click to expand the table of contents:

public class ListToTreeCombiner {

  private LinkedList<Token> hiddenTokens;

  public void associate(HiddenTokenAwareTree ast, 
      LinkedList<Token> hiddenTokens) {
    initialize(hiddenTokens);

    LinkedList<HiddenTokenAwareTree> children = getChildren(ast);
    for (HiddenTokenAwareTree child : children) {
      associateAsChild(child);
    }

    if (children.isEmpty()) {
      addAllContainedTokens(ast);
    } else {
      HiddenTokenAwareTree lastChild = children.getLast();
      lastChild.addFollowing(hiddenTokens);
    }
  }

  private void initialize(LinkedList<Token> hiddenTokens) {
    this.hiddenTokens = hiddenTokens;
  }

  private void associateAsChild(HiddenTokenAwareTree ast) {
    addAllPrecedingTokens(ast);

    LinkedList<HiddenTokenAwareTree> children = getChildren(ast);
    if (children.isEmpty()) {
      addAllContainedTokens(ast);
      return;
    }

    HiddenTokenAwareTree previousChild = null;
    for (HiddenTokenAwareTree child : children) {
      assignFirstCommentsSegment(previousChild, child);
      associateAsChild(child);
      previousChild = child;
    }

    addFollowingTokens(previousChild, ast.getTokenStopIndex());
  }

  private void assignFirstCommentsSegment(
      HiddenTokenAwareTree firstChild, 
      HiddenTokenAwareTree secondChild) {

    LinkedList<Token> tail = 
      readTillNewLine(secondChild.getTokenStartIndex());

    if (!tail.isEmpty() && firstChild != null) {
      if (tail.peekLast().getType() == LessLexer.NEW_LINE)
        firstChild.addFollowing(tail);
      else
        secondChild.addPreceding(tail);
    }
  }

  private void addAllContainedTokens(HiddenTokenAwareTree ast) {
    int stop = ast.getTokenStopIndex();
    List<Token> result = readPrefix(stop);
    ast.addOrphans(result);
  }

  private void addFollowingTokens(HiddenTokenAwareTree target, 
      int stop) {
    List<Token> result = readPrefix(stop);
    target.addFollowing(result);
  }

  private LinkedList<HiddenTokenAwareTree> getChildren(
      HiddenTokenAwareTree ast) {

    List<HiddenTokenAwareTree> children = ast.getChildren();
    if (children == null)
      return new LinkedList<HiddenTokenAwareTree>();

    LinkedList<HiddenTokenAwareTree> copy = 
      new LinkedList<HiddenTokenAwareTree>(children);
    Collections.sort(copy, new PositionComparator());
    return copy;
  }

  private void addAllPrecedingTokens(HiddenTokenAwareTree target) {
    int start = target.getTokenStartIndex();
    List<Token> tokens = readPrefix(start);
    target.addPreceding(tokens);
  }

  private LinkedList<Token> readTillNewLine(int end) {
    LinkedList<Token> result = new LinkedList<Token>();
    if (hiddenTokens.isEmpty())
      return result;

    Token first = hiddenTokens.peekFirst();
    while (first != null && first.getTokenIndex() < end && 
        first.getType() == LessLexer.COMMENT) {
      result.add(first);
      hiddenTokens.removeFirst();
      first = hiddenTokens.peekFirst();
    }

    if (first == null || first.getTokenIndex() >= end)
      return result;

    result.add(first);
    return result;
  }

  private List<Token> readPrefix(int end) {
    List<Token> result = new ArrayList<Token>();
    if (hiddenTokens.isEmpty())
      return result;

    Token first = hiddenTokens.peekFirst();
    while (first != null && first.getTokenIndex() < end) {
      result.add(first);
      hiddenTokens.removeFirst();
      first = hiddenTokens.peekFirst();
    }
    return result;
  }

  class PositionComparator implements Comparator<CommonTree> {

    @Override
    public int compare(CommonTree arg0, CommonTree arg1) {
      return arg0.getTokenStartIndex() - arg1.getTokenStartIndex();
    }

  }

}

Example
We will use the first .less sample from requirements chapter to show how the method works.

The example input:
/*
 Preceding comment with new line.
*/
/* before */ .comments /* after */ {
  /* Leave me here. */
  /* 2 */ color: #808080 /* grey */ , /* blue */ #ffa500;
  margin: 2px;/* something about declaration*/
  padding: 2px /*before comma*/;/* after comma */
  /* Leave me here. */
} /* Almost last comment. */
/* Last comment. */

Expand to see all collected hidden tokens:
Click to expand the table of contents:
The number in parenthesis is token index. It is followed by token type and a text corresponding to the token.

(0) COMMENT /* Preceding comment with new line.*/
(1) NEW_LINE 
(2) COMMENT /* before */
(7) COMMENT /* after */
(10) NEW_LINE 
(12) COMMENT /* 1 */
(13) NEW_LINE 
(15) COMMENT /* 2 */
(22) COMMENT /* grey */
(26) COMMENT /* blue */
(30) NEW_LINE 
(36) COMMENT /*before comma*/
(38) COMMENT /* something about declaration*/
(39) NEW_LINE 
(45) COMMENT /*before comma*/
(47) COMMENT /* after comma */
(48) NEW_LINE 
(50) COMMENT /* Leave me here. */
(51) NEW_LINE 
(54) COMMENT /* Almost last comment. */
(55) NEW_LINE 
(56) COMMENT /* Last comment. */

Simplified abstract syntax tree:
STYLE_SHEET 4 58
-- RULESET 4 52
---- SELECTOR 4 5
-------- ...
---- BODY_OF_DECLARATIONS 9 52
------ DECLARATION 17 29
-------- ...
------ DECLARATION 32 37
-------- ...
------ DECLARATION 41 46
-------- ...

How It Works
Assume that we have a sub-tree corresponding to the "body of declarations" in the above example and a list of hidden tokens. All hidden tokes that belong to selector or other previous element have already been assigned and removed from the list.

As the body of declarations starts with token number 9, any token with index lower than 9 can be assigned as preceding token to the body of declarations. We created method addAllPrecedingTokens(ast) to do that. It takes tree node as a parameter and puts all hidden tokens with low indexes into its preceding list.

Associate preceding tokens first:
private void associateAsChild(HiddenTokenAwareTree ast) {
  addAllPrecedingTokens(ast);
  // ...
}

Expand to see the body of addAllPrecedingTokens helper method:
Click to expand the table of contents:

private void addAllPrecedingTokens(HiddenTokenAwareTree target) {
  int start = target.getTokenStartIndex();
  List<Token> tokens = readPrefix(start);
  target.addPreceding(tokens);
}

Tokens between 9 and 52 belong to individual declarations inside the body of declarations. As the first child ends up at the position 29, everything up to that position belongs to it. As we already chopped off everything up token 9, we can recursively call our method to do that work.
associateAsChild(child);

The recursive call assigns everything up to the position 29 to the first child. However, that child may own also tokens with positions higher than 29, but only up to closest new line and only up to the next element start or end.

Otherwise said, if there is no new line between two childs, all hidden tokens up to the beginning of the second child belong to that child. If there is new line between them, everything up to that new line belongs to the first child.

This logic is implemented inside the assignFirstCommentsSegment method. It takes two childs as a parameter and decides whether first child should have any following tokens assigned.

Assigning following tokens is the last thing we had to do with the first child. Once it is done, we can move to the next childs:
LinkedList<HiddenTokenAwareTree> children = getChildren(ast);

HiddenTokenAwareTree previousChild = null;
for (HiddenTokenAwareTree child : children) {
  assignFirstCommentsSegment(previousChild, child);
  associateAsChild(child);
  previousChild = child;
}

Expand to see the assignFirstCommentsSegment methods body:
Click to expand the table of contents:

private void assignFirstCommentsSegment(
    HiddenTokenAwareTree firstChild, 
    HiddenTokenAwareTree secondChild) {

  LinkedList<Token> tail =
    readTillNewLine(secondChild.getTokenStartIndex());

  if (!tail.isEmpty() && firstChild != null) {
    if (tail.peekLast().getType() == LessLexer.NEW_LINE)
      firstChild.addFollowing(tail);
    else
      secondChild.addPreceding(tail);
  }
}

The only special case is the last child which will get also all remaining hidden tokens up body of declaration end at the index 52.

We use the addFollowingTokens method to assign all remaining tokens to the last child. Click to see its body:
Click to expand the table of contents:

private void addFollowingTokens(HiddenTokenAwareTree target, 
    int stop) {
  List<Token> result = readPrefix(stop);
  target.addFollowing(result);
}

The method may exit now, everything up to the end of its input node has already been assigned.
  addFollowingTokens(previousChild, ast.getTokenStopIndex());
}

Special Case
The above example omitted one special case: sub-tree with no childs must associate all hidden comments as its orphan comments.

LinkedList<HiddenTokenAwareTree> children = getChildren(ast);
if (children.isEmpty()) {
  addAllContainedTokens(ast);
  return;
}

The body of the addAllContainedTokens method:
Click to expand the table of contents:

private void addAllContainedTokens(HiddenTokenAwareTree ast) {
  int stop = ast.getTokenStopIndex();
  List<Token> result = readPrefix(stop);
  ast.addOrphans(result);
}

Consequences

The approach we just described has few important consequences for our grammar:
  1. All parentheses, curly braces and other "structure only" tokens must be present in the abstract syntax tree. We can not suppress them in the grammar.
  2. Major modifications of abstract syntax tree can not be done in the grammar. They can be done only after hidden tokens have been merged with the tree.
  3. We must be careful not to loose comments during the translation of the first abstract syntax tree into the final one. The mapping between the first and final abstract syntax trees is not one-to-one. All comments must find their way into the final tree, even if the owning node is thrown away.
  4. Abstract syntax tree tend to be more complicated then it would be otherwise. Every simple small thing must be a separate node/class. This the most annoying point in this list.

Final Notes

Our project uses multiple phases to create ast:
  • Create antlr abstract syntax tree and collect hidden comments and new lines.
  • Merge antlr abstract syntax tree with hidden tokens.
  • Convert antlr abstract syntax tree into final abstract syntax tree.

It would be possible to merge these three phases into one. Antlr is able to spit out parser that does all these thing automatically and returns abstract syntax tree as we want it.

We decided not to do it, because separate phases are easier to understand and debug. If it turns out that such approach has significant drawbacks, we will refactor the code and merge phases. Writing second compiler version or refactoring the code is usually easier than writing the first version, especially if the unit tests are already written.

7 comments:

Anonymous said...

Thanks for such nice post, you made my day ... :)

Philippe Symons said...

Thank you for this post. It was exactly what I needed for my syntax compiler ^^.

Albert Andrada said...

Really Nice Information,Thank You Very Much For Sharing.
Wordpress Development Company

Pinaki Poddar said...

Excellent post .. i was looking for exactly a similar solution t extract comments from Java source without having to modify the grammar rules everywhere.

Alexey Petuschak said...

Thank you for contribution!
The solution remains perfectly actual and helped me a lot to keep comments in my parser.

Ijjwyzfi Pnrblazg said...

It seems great! Here are some discount ray ban sunglasses for you.

SURESH BABU Ganta said...

thanks for sharing this article the article having a good content and valuable information.

Post a Comment