Saturday, September 1, 2012

Tackling Comments in ANTLR Compiler

Most compilers treat comments as just another meaningless whitespaces. They identify them in the source code and then throw them away. On the other hand, things are a bit harder if you need to know comments content and their position.

Such requirement is not too common and official ANTLR documentation does not says much about how to do it. The compiler we have been working on needed to preserve comments, so we had to come up with our own solution. This post documents it.

First chapter introduces our compiler. It describes what it does, where it is located and how it is written. Second chapter explains our comment preserving requirements. The rest of the post describes datastructures and algorithms used in our solution.

Table of Contents

Less4j Project

Less4j is a port of Less.js into java. Less.js compiles CSS-like language named less into standard CSS. It is still in prototype phase and compiles subset of CSS back into CSS. Less4j source code is located on Github.

First sub-chapter introduces less language and is written to be as short as possible. Second one contains overall architecture of our project.

Less
Less is basically CSS with variables, mixins, functions and few other features. Less.js compiles it into regular CSS. The original compiler was written in JavaScript and can be integrated into java project using Rhino.

Official documentation can be found on lesscss.org and introduction tutorial in previous post on this blog.

Architecture
Less4j compiler is based on antlr parser generator. This post assumes that you know antlr basics: what is lexer, parser, token and what is a rule. In any case, antlr details are available in our two previous tutorial posts.

Current version of our project runs in three phases:
  • build abstract syntax tree corresponding to input .less file,
  • translate .less abstract syntax tree into corresponding .css abstract syntax tree,
  • create final .css output.

The rest of this post focus on the abstract syntax tree building e.g, step 1. Second step is not done yet and the third one is more or less trivial.

Tree Building (Step 1)
Abstract syntax tree (AST) is built in two sub-phases:
  • antlr parses input into and creates "first AST",
  • the "first tree" is transformed into final abstract syntax tree.

First syntax tree is composed of default antlr tree nodes. They are all instances of CommonTree class and each knows its own type, position in the source code and can return a list of its children.

Nodes in the final tree belong to ASTCssNode hierarchy. The hierarchy has a separate class for each node type (ruleset, declaration, font-face, ...). Each class has only properties relevant for represented css element. For example, ruleset can return its own selectors and declarations.

It would be possible to merge these two phases into one. We decided not to do so, because debugging is much easier if things are done in separate phases.

The Compiler (Step 2)
The compiler was not yet implemented.

Printing (Step 3)
Printing is essentially breath first traversal through the final abstract syntax tree. The algorithm knows how to print each type of the node and recursively calls itself to print all its childs.

Requirements

Original less.js preserve comments and their positions. Whitespaces around comments are not preserved, the translation may remove them or add new ones into the output.

We decided not to mimic the exact less.js behavior. Instead, we tried to identify most important properties of their algorithm and use them as requirements.

Keep All Comments
The first requirement is the most important:
  • No comment may be lost. Every input comment must appear in the output.

Keep Comments Positions
The second most important thing is to keep comments where they have been found. Comments usually describe the element located before them, after them or on the same line. Therefore, comments must not change their location too much during the translation.

To make things less vague, we defined "keeping the location" as:
  • Comment located between two elements must stay between those two elements.
  • If the input contains new line between comment and its following element, then its corresponding output comment must be followed by a new line.
  • If the input line ends with a comment, the translation must keep that comment on the same line as its previous input element.

All three situations are shown in the next sections input-output examples.

Examples
Following two inputs must be unchanged by the translation. In both cases, input and output string must be exactly the same.

Example 1, ruleset with multiple declarations:
/*
 Preceding comment with new line.
*/
/* before */ .comments /* after */ {
  /* Leave me here. */
  /* 2 */ color: #808080 /* grey */ , /* blue */ #ffa500;
  margin: 2px;/* something about declaration*/
  padding: 2px /*before comma*/;/* after comma */
  /* Leave me here. */
} /* Almost last comment. */
/* Last comment. */

Example 2, an empty ruleset:
/* 1*/ .comments /* 2*/ /* 3 */ .something {
  /* Orphan comment. */
}

Datastructures

We decided to attach comments and new lines to their surrounding elements. This requires only small changes in our abstract syntax trees and in the printing algorithm.

Three building and the algorithm for comments attaching are described in following chapters.

First Abstract Syntax Tree
Each comment or new line is attached to exactly one tree node. Each node owns comments and new lines attached to it and knows their positions. It knows whether they should go before the node, after the node or inside it.

Each node in the first abstract syntax tree keeps three lists:
  • list of owned preceding comments and new lines,
  • list of owned following comments and new lines,
  • list of owned comments and new lines that are located inside it.

Each comment and new line is assigned to exactly one of these lists:
private List<Token> preceding = new LinkedList<Token>();
private List<Token> orphans = new LinkedList<Token>();
private List<Token> following = new LinkedList<Token>();

Final Abstract Syntax Tree
The final abstract syntax tree keeps comments in three similar lists. The only difference it that each final comment knows whether it should be followed by a new line or not. It does not know how many new lines have originally been there and new lines themselves are thrown away.

We have two types of comments, those that are followed by new line and those that are not followed by new line:
public class Comment extends ASTCssNode {

  private String comment;
  private boolean hasNewLine;
  
  //...

  public String getComment() {
    return comment;
  }

  public boolean hasNewLine() {
    return hasNewLine;
  }
}

Each node in the final abstract syntax tree keeps three lists:
  • list of owned preceding comments,
  • list of owned following comments,
  • list of owned comments that are located inside it.

The class:
class ASTCssNode {
  private List<Comment> openingComments = new ArrayList<Comment>();
  private List<Comment> orphanComments = new ArrayList<Comment>();
  private List<Comment> trailingComments = new ArrayList<Comment>();

  // ...
}

Printing
The printing algorithm requires three changes:
  • It must print all preceding comments before each encountered node.
  • It must print all following comments after each encountered node.
  • If it is possible for a node to have orphan comments, then the algorithm must print them too.

Matching Comments - Lexer

Both this and the next chapters assume that you already know what is lexer, parser, token, rule and how antlr works. You do not need to know details, everything needed in covered in the antlr hello world tutorial.

First thing we need to do is to modify the lexer to generate comment tokens. This is covered in the first sub-chapter. Second sub-chapter explains why we can not treat comment tokens the same way as other tokens. Finally, the third sub-chapter shows how to hide them.

Regular Expression
CSS comments look a lot like C or Java comments. They start with /* symbol, end with */ symbol and can not be nested. Therefore, we have to match anything in between /* and its closest */:
COMMENT: '/*' ( options { greedy=false; } : .*) '*/';

The options { greedy=false; } part is necessary, because antlr lexer is greedy. If we would use a simple regular expression '/*' .* '*/', the lexer would create one big token from everything between the first /* and the last */ in the input.

Consequences for Parser
Comments and whitespaces have one thing in common: they can appear in any number between any other two tokens. If we would use the above rule as is, then we would have to add comments tokens into every possible place in each parser rule.

For example, following declaration rule:
declaration: property COLON expr prio?;
would have to be changed into something like this:
declaration: property COMMENT* COLON COMMENT* expr COMMENT* prio?;

This way of comments handling is very error prone, labor extensive and we doubt anybody ever handled them this way. We do not want to add COMMENT* into every possible place in our grammar, but we need them in final abstract syntax tree.

Hiding Tokens
Each generated token has property named "channel". Only tokens with channel value 0 reach the parser, all other channels are hidden. It is important to remember that lexer generates token stream with all tokens including the hidden ones. Hidden tokens are filtered out between the parser and the lexer.

If you want to hide a token, you can use either a non-zero channel number or the HIDDEN keyword. The keyword corresponds to channel number 99.

Use the action part of the lexer rule to send all comments on hidden channel:
COMMENT: '/*' ( options { greedy=false; } : .*) '*/'
       { 
         $channel = HIDDEN;
       };

Collecting Hidden Comments

We have successfully added comments into the grammar without having to rewrite every single rule in it. Of course, as comments disappeared from the parser, we have to collect them elsewhere and then attach them to the abstract syntax tree.

Common Token Stream
Antlr generated lexer and parser are not compatible with each other. While the lexer implements the TokenSource interface, the parser takes an instance of TokenStream as its input. This class level incompatibly makes sense because:
  • lexer generates hidden tokens that should be filtered out,
  • parser occasionally needs to look ahead into upcoming tokens, but lexer does not have the ability to show them. Lexer returns one token at a time and is not able to show following or previous tokens.

Typical way to connect lexer with parser is to use CommonTokenStream. Common token stream adds look ahead abilities to the token source and filters out unwanted tokens:
LessLexer lexer = new LessLexer(inputStream);
CommonTokenStream tokens = new CommonTokenStream(lexer);
LessParser parser = new LessParser(tokens);

Collecting Hidden Tokens
The easiest way to collect hidden tokens is to place collector between lexer and common token stream. The collector implements the same interface as the lexer (e.g. TokenSource), wraps over the lexer and acts as its replacement.

Common token source is given a collector instead of the lexer as its input:
LessLexer lexer = new LessLexer(inputStream);
CollectorTokenSource tokenSource = new CollectorTokenSource(lexer, ...);
CommonTokenStream tokens = new CommonTokenStream(tokenSource);
LessParser parser = new LessParser(tokens);

Common token stream uses Token nextToken() method of the TokenSource interface to read tokens. Our collector delegates the call to the lexer and returns the same token as the lexer would. However, if it is either a comment or a new line, the collector will store the token in a list of collected tokens:
class CollectorTokenSource implements TokenSource {
  
  ... 

  @Override
  public Token nextToken() {
    Token nextToken = source.nextToken();
    if (shouldCollect(nextToken)) {
      collectedTokens.add(nextToken);
    }
    
    return nextToken;
  }

  protected boolean shouldCollect(Token nextToken) {
    // filter the token by its type
    return collectTokenTypes.contains(nextToken.getType());
  }

}

Full implementation of the CollectorTokenSource class:
Click to expand the table of contents:

class CollectorTokenSource implements TokenSource {

  private final TokenSource source;
  private final Set<Integer> collectTokenTypes = 
      new HashSet<Integer>();
  private final LinkedList<Token> collectedTokens = 
      new LinkedList<Token>();

  public CollectorTokenSource(TokenSource source, 
      Collection<Integer> collectTokenTypes) {
    super();
    this.source = source;
    this.collectTokenTypes.addAll(collectTokenTypes);
  }

  /**
   * Returns next token from the wrapped token source. Stores it 
   * in a list if necessary.
   */
  @Override
  public Token nextToken() {
    Token nextToken = source.nextToken();
    if (shouldCollect(nextToken)) {
      collectedTokens.add(nextToken);
    }

    return nextToken;
  }

  /**
   * Decide whether collect the token or not.
   */
  protected boolean shouldCollect(Token nextToken) {
    // filter the token by its type
    return collectTokenTypes.contains(nextToken.getType());
  }

  public LinkedList<Token> getCollectedTokens() {
    return collectedTokens;
  }

  @Override
  public String getSourceName() {
    return "Collect hidden channel " + source.getSourceName();
  }

}

Merging Tokens With AST

Collected tokens need to be merged with abstract syntax tree generated by antlr. To do so, we have to customize the tree generated by antlr parser and then attach hidden tokens to it.

Merge is possible because:
  • Each antlr token knows his own index in the token stream.
  • Each antlr node knows indexes of both the first and last tokens that belongs to it.

First sub-chapter shows how to customize the tree and the rest of the chapter explains how to merge it with collected tokens.

Customizing Parser
Tree nodes generated by antlr parser are unable to hold any additional information. As we want to attach hidden tokens directly to the tree, we have to create new type of tree node and configure generated parser to use it.

Custom Tree Node
Abstract syntax tree generated by default is composed of instances of CommonTree class.

We extended CommonTree and added three properties into new class:
public class HiddenTokenAwareTree extends CommonTree {
  
  private List<Token> preceding = new LinkedList<Token>();
  private List<Token> orphans = new LinkedList<Token>();
  private List<Token> following = new LinkedList<Token>();

  // ... constructors, getters and setters follow

}

Configure Parser
Generated parser delegates all tree modifying operations to TreeAdaptor interface. Tree adaptor creates new nodes, add childs to them or deletes them when needed. Default tree adapter is called CommonTreeAdaptor.

As our tree nodes extend default CommonTree class, our custom adaptor needs to customize only the create method. Everything else may work the same way as in the original CommonTreeAdaptor:
class LessTreeAdaptor extends CommonTreeAdaptor {

  @Override
  public Object create(Token payload) {
    return new HiddenTokenAwareTree(payload);
  }

}

Tree adaptor used by parser is configured via setTreeAdaptor method:
LessParser parser = new LessParser(tokens);
parser.setTreeAdaptor(new LessTreeAdaptor());

Assignment
Our algorithm is supposed to assign each collected token to either preceding, following or orphan list of some tree node.

A tree node may own only those hidden tokens that are directly before it, directly after it or directly inside it. "Directly" means that no other tree node starts or ends between the hidden token and its owning node. There must be no other closer candidate.

Preceding List
The list of preceding tokens contains hidden tokens located directly before the owning node.

For example, this ruleset:
/* 0 */.comments /* 1 */ /* 2 */ .something /* 3 */ {
  declaration: value
}

corresponds to following simplified abstract syntax tree (the parenthesis contains start and stop token indexes of the subtree):
RULESET (2-22)
- SELECTOR (2-10)
--- CSS_CLASS comments (2-3)
--- EMPTY_COMBINATOR (9-9)
--- CSS_CLASS something (9-11)
- BODY_OF_DECLARATIONS (14-22)
--- DECLARATION (17-20)
----- PROPERTY declaration (17-17)
----- TERM value (20-20)

there are four hidden comments and one hidden new line (the parenthesis contains token index):
(0) COMMENT /* 0 */
(5) COMMENT /* 1 */
(7) COMMENT /* 2 */
(12) COMMENT /* 3 */
(15) NEW_LINE 

The comment 0 goes to the RULESET, because there is nothing that would start closer to it. Although SELECTOR node starts with the same index as the ruleset, it is rulesets child and thus considered to be behind it.

Comments 1 and 2 go to the EMPTY_COMBINATOR and the comment 3 and new line belong to the BODY_OF_DECLARATIONS.

Following List
The "following" list is going to be used only if there is no possible preceding owner. It may contain only hidden tokens located directly after the owning node.

For example, both comments in the following example have no element directly behind them:
.comments .something {
  declaration: something;
  /* Trailing comment. */
} 

.other .selector {
  declaration: value;
}
/* Last comment. */

The input corresponds to following simplified tree (the parenthesis contains start and stop token indexes of the subtree):
RULESET (0-18)
- SELECTOR .comments .something (0-4)
- BODY_OF_DECLARATIONS (6-18)
--- DECLARATION declaration: something; (9-13) 
RULESET (22-37)
- SELECTOR .other .selector (22-26)
- BODY_OF_DECLARATIONS (28-37)
--- DECLARATION declaration: value; (31-35)

hidden channel contains two comments (we omitted new lines because there are too many of them):
(16) COMMENT /* Trailing comment. */
(39) COMMENT /* Last comment. */

The Trailing comment. has index 16. There is nothing between this comment and the end of the first ruleset (position 18). No further element can own the comment, because there must be no node start nor node end between the hidden token and its owning node.

Therefore we have to look for an element that ends directly before it. There is nothing between the end of declaration: something; and the comment on position 16. The comment should be assigned to that declaration as the following hidden token.

The second comment Last comment. is the last comment in the input file. It will be assigned to the last element in the file e.g., second RULESET.

Orphan List
The list of orphan comments should contain only those comments that have neither preceding nor following elements:
.comments .something {
  /* Orphan comment. */
} 

The only comment in the above example belongs to the ruleset body.

Algorithm
The solution is implemented in class named ListToTreeCombiner. The list of hidden tokens is stored in its private property, so we do not have to constantly pass it around.

Root of the abstract syntax tree is special case and we will ignore it for now. The assignment of hidden tokens to any other tree node is done inside the associateAsChild method. It takes tree node as a parameter and assigns all hidden tokens up to its end. The method:
  • reads all hidden tokens up to the end of the input sub-tree,
  • assigns them to the input node or one of its childs,
  • removes all used tokens from the list of hidden tokens.

The method works correctly only if the list of hidden tokens starts with tokens that should be assigned to the sub-tree in parameter.

This part explains how the associateAsChild method works. First section shows both commented code of this method and full listing of ListToTreeCombiner class. Second section contains the input we will use as the example. Remaining sections apply the method to the example input and step through it to show how it works.

The Code
Next code example contains the associateAsChild method. The full class including all helper methods is shown below:
private LinkedList<Token> hiddenTokens;

private void associateAsChild(HiddenTokenAwareTree ast) {
  //fill preceding list of ast
  addAllPrecedingTokens(ast);

  LinkedList<HiddenTokenAwareTree> children = getChildren(ast);
  //if the node has no children, fill orphans list
  if (children.isEmpty()) {
    addAllContainedTokens(ast);
    return;
  }

  HiddenTokenAwareTree previousChild = null;
  for (HiddenTokenAwareTree child : children) {
    //everything up to first new line or next child belongs to 
    //previous element
    assignFirstCommentsSegment(previousChild, child);
    //recursion assigns tokens to the child
    associateAsChild(child);
    previousChild = child;
  }
  //remaining tokens located inside this node belong to the last child
  addFollowingTokens(previousChild, ast.getTokenStopIndex());
}

The whole class:
Click to expand the table of contents:

public class ListToTreeCombiner {

  private LinkedList<Token> hiddenTokens;

  public void associate(HiddenTokenAwareTree ast, 
      LinkedList<Token> hiddenTokens) {
    initialize(hiddenTokens);

    LinkedList<HiddenTokenAwareTree> children = getChildren(ast);
    for (HiddenTokenAwareTree child : children) {
      associateAsChild(child);
    }

    if (children.isEmpty()) {
      addAllContainedTokens(ast);
    } else {
      HiddenTokenAwareTree lastChild = children.getLast();
      lastChild.addFollowing(hiddenTokens);
    }
  }

  private void initialize(LinkedList<Token> hiddenTokens) {
    this.hiddenTokens = hiddenTokens;
  }

  private void associateAsChild(HiddenTokenAwareTree ast) {
    addAllPrecedingTokens(ast);

    LinkedList<HiddenTokenAwareTree> children = getChildren(ast);
    if (children.isEmpty()) {
      addAllContainedTokens(ast);
      return;
    }

    HiddenTokenAwareTree previousChild = null;
    for (HiddenTokenAwareTree child : children) {
      assignFirstCommentsSegment(previousChild, child);
      associateAsChild(child);
      previousChild = child;
    }

    addFollowingTokens(previousChild, ast.getTokenStopIndex());
  }

  private void assignFirstCommentsSegment(
      HiddenTokenAwareTree firstChild, 
      HiddenTokenAwareTree secondChild) {

    LinkedList<Token> tail = 
      readTillNewLine(secondChild.getTokenStartIndex());

    if (!tail.isEmpty() && firstChild != null) {
      if (tail.peekLast().getType() == LessLexer.NEW_LINE)
        firstChild.addFollowing(tail);
      else
        secondChild.addPreceding(tail);
    }
  }

  private void addAllContainedTokens(HiddenTokenAwareTree ast) {
    int stop = ast.getTokenStopIndex();
    List<Token> result = readPrefix(stop);
    ast.addOrphans(result);
  }

  private void addFollowingTokens(HiddenTokenAwareTree target, 
      int stop) {
    List<Token> result = readPrefix(stop);
    target.addFollowing(result);
  }

  private LinkedList<HiddenTokenAwareTree> getChildren(
      HiddenTokenAwareTree ast) {

    List<HiddenTokenAwareTree> children = ast.getChildren();
    if (children == null)
      return new LinkedList<HiddenTokenAwareTree>();

    LinkedList<HiddenTokenAwareTree> copy = 
      new LinkedList<HiddenTokenAwareTree>(children);
    Collections.sort(copy, new PositionComparator());
    return copy;
  }

  private void addAllPrecedingTokens(HiddenTokenAwareTree target) {
    int start = target.getTokenStartIndex();
    List<Token> tokens = readPrefix(start);
    target.addPreceding(tokens);
  }

  private LinkedList<Token> readTillNewLine(int end) {
    LinkedList<Token> result = new LinkedList<Token>();
    if (hiddenTokens.isEmpty())
      return result;

    Token first = hiddenTokens.peekFirst();
    while (first != null && first.getTokenIndex() < end && 
        first.getType() == LessLexer.COMMENT) {
      result.add(first);
      hiddenTokens.removeFirst();
      first = hiddenTokens.peekFirst();
    }

    if (first == null || first.getTokenIndex() >= end)
      return result;

    result.add(first);
    return result;
  }

  private List<Token> readPrefix(int end) {
    List<Token> result = new ArrayList<Token>();
    if (hiddenTokens.isEmpty())
      return result;

    Token first = hiddenTokens.peekFirst();
    while (first != null && first.getTokenIndex() < end) {
      result.add(first);
      hiddenTokens.removeFirst();
      first = hiddenTokens.peekFirst();
    }
    return result;
  }

  class PositionComparator implements Comparator<CommonTree> {

    @Override
    public int compare(CommonTree arg0, CommonTree arg1) {
      return arg0.getTokenStartIndex() - arg1.getTokenStartIndex();
    }

  }

}

Example
We will use the first .less sample from requirements chapter to show how the method works.

The example input:
/*
 Preceding comment with new line.
*/
/* before */ .comments /* after */ {
  /* Leave me here. */
  /* 2 */ color: #808080 /* grey */ , /* blue */ #ffa500;
  margin: 2px;/* something about declaration*/
  padding: 2px /*before comma*/;/* after comma */
  /* Leave me here. */
} /* Almost last comment. */
/* Last comment. */

Expand to see all collected hidden tokens:
Click to expand the table of contents:
The number in parenthesis is token index. It is followed by token type and a text corresponding to the token.

(0) COMMENT /* Preceding comment with new line.*/
(1) NEW_LINE 
(2) COMMENT /* before */
(7) COMMENT /* after */
(10) NEW_LINE 
(12) COMMENT /* 1 */
(13) NEW_LINE 
(15) COMMENT /* 2 */
(22) COMMENT /* grey */
(26) COMMENT /* blue */
(30) NEW_LINE 
(36) COMMENT /*before comma*/
(38) COMMENT /* something about declaration*/
(39) NEW_LINE 
(45) COMMENT /*before comma*/
(47) COMMENT /* after comma */
(48) NEW_LINE 
(50) COMMENT /* Leave me here. */
(51) NEW_LINE 
(54) COMMENT /* Almost last comment. */
(55) NEW_LINE 
(56) COMMENT /* Last comment. */

Simplified abstract syntax tree:
STYLE_SHEET 4 58
-- RULESET 4 52
---- SELECTOR 4 5
-------- ...
---- BODY_OF_DECLARATIONS 9 52
------ DECLARATION 17 29
-------- ...
------ DECLARATION 32 37
-------- ...
------ DECLARATION 41 46
-------- ...

How It Works
Assume that we have a sub-tree corresponding to the "body of declarations" in the above example and a list of hidden tokens. All hidden tokes that belong to selector or other previous element have already been assigned and removed from the list.

As the body of declarations starts with token number 9, any token with index lower than 9 can be assigned as preceding token to the body of declarations. We created method addAllPrecedingTokens(ast) to do that. It takes tree node as a parameter and puts all hidden tokens with low indexes into its preceding list.

Associate preceding tokens first:
private void associateAsChild(HiddenTokenAwareTree ast) {
  addAllPrecedingTokens(ast);
  // ...
}

Expand to see the body of addAllPrecedingTokens helper method:
Click to expand the table of contents:

private void addAllPrecedingTokens(HiddenTokenAwareTree target) {
  int start = target.getTokenStartIndex();
  List<Token> tokens = readPrefix(start);
  target.addPreceding(tokens);
}

Tokens between 9 and 52 belong to individual declarations inside the body of declarations. As the first child ends up at the position 29, everything up to that position belongs to it. As we already chopped off everything up token 9, we can recursively call our method to do that work.
associateAsChild(child);

The recursive call assigns everything up to the position 29 to the first child. However, that child may own also tokens with positions higher than 29, but only up to closest new line and only up to the next element start or end.

Otherwise said, if there is no new line between two childs, all hidden tokens up to the beginning of the second child belong to that child. If there is new line between them, everything up to that new line belongs to the first child.

This logic is implemented inside the assignFirstCommentsSegment method. It takes two childs as a parameter and decides whether first child should have any following tokens assigned.

Assigning following tokens is the last thing we had to do with the first child. Once it is done, we can move to the next childs:
LinkedList<HiddenTokenAwareTree> children = getChildren(ast);

HiddenTokenAwareTree previousChild = null;
for (HiddenTokenAwareTree child : children) {
  assignFirstCommentsSegment(previousChild, child);
  associateAsChild(child);
  previousChild = child;
}

Expand to see the assignFirstCommentsSegment methods body:
Click to expand the table of contents:

private void assignFirstCommentsSegment(
    HiddenTokenAwareTree firstChild, 
    HiddenTokenAwareTree secondChild) {

  LinkedList<Token> tail =
    readTillNewLine(secondChild.getTokenStartIndex());

  if (!tail.isEmpty() && firstChild != null) {
    if (tail.peekLast().getType() == LessLexer.NEW_LINE)
      firstChild.addFollowing(tail);
    else
      secondChild.addPreceding(tail);
  }
}

The only special case is the last child which will get also all remaining hidden tokens up body of declaration end at the index 52.

We use the addFollowingTokens method to assign all remaining tokens to the last child. Click to see its body:
Click to expand the table of contents:

private void addFollowingTokens(HiddenTokenAwareTree target, 
    int stop) {
  List<Token> result = readPrefix(stop);
  target.addFollowing(result);
}

The method may exit now, everything up to the end of its input node has already been assigned.
  addFollowingTokens(previousChild, ast.getTokenStopIndex());
}

Special Case
The above example omitted one special case: sub-tree with no childs must associate all hidden comments as its orphan comments.

LinkedList<HiddenTokenAwareTree> children = getChildren(ast);
if (children.isEmpty()) {
  addAllContainedTokens(ast);
  return;
}

The body of the addAllContainedTokens method:
Click to expand the table of contents:

private void addAllContainedTokens(HiddenTokenAwareTree ast) {
  int stop = ast.getTokenStopIndex();
  List<Token> result = readPrefix(stop);
  ast.addOrphans(result);
}

Consequences

The approach we just described has few important consequences for our grammar:
  1. All parentheses, curly braces and other "structure only" tokens must be present in the abstract syntax tree. We can not suppress them in the grammar.
  2. Major modifications of abstract syntax tree can not be done in the grammar. They can be done only after hidden tokens have been merged with the tree.
  3. We must be careful not to loose comments during the translation of the first abstract syntax tree into the final one. The mapping between the first and final abstract syntax trees is not one-to-one. All comments must find their way into the final tree, even if the owning node is thrown away.
  4. Abstract syntax tree tend to be more complicated then it would be otherwise. Every simple small thing must be a separate node/class. This the most annoying point in this list.

Final Notes

Our project uses multiple phases to create ast:
  • Create antlr abstract syntax tree and collect hidden comments and new lines.
  • Merge antlr abstract syntax tree with hidden tokens.
  • Convert antlr abstract syntax tree into final abstract syntax tree.

It would be possible to merge these three phases into one. Antlr is able to spit out parser that does all these thing automatically and returns abstract syntax tree as we want it.

We decided not to do it, because separate phases are easier to understand and debug. If it turns out that such approach has significant drawbacks, we will refactor the code and merge phases. Writing second compiler version or refactoring the code is usually easier than writing the first version, especially if the unit tests are already written.

38 comments:

Anonymous said...

Thanks for such nice post, you made my day ... :)

Unknown said...

Thank you for this post. It was exactly what I needed for my syntax compiler ^^.

Pinaki Poddar said...

Excellent post .. i was looking for exactly a similar solution t extract comments from Java source without having to modify the grammar rules everywhere.

Alexey Petuschak said...

Thank you for contribution!
The solution remains perfectly actual and helped me a lot to keep comments in my parser.

Unknown said...

thanks for sharing this article the article having a good content and valuable information.

Unknown said...

Your blog is very useful for me, as your tutorial sessions are indeed of great benefit..java training in chennai | chennai's no.1 java training in chennai | best java institute in chennai

vignesjoseph said...

It's interesting that many of the bloggers your tips helped to clarify a few things for me as well as giving... very specific nice content. Hadoop Training in Chennai | Java Training in Chennai | Selenium Training in Chennai

Anonymous said...

From where the methods of ListToTreeCombiner class is called?

Suresh said...
This comment has been removed by the author.
Manoj said...

Excellent information with unique content and it is very useful to know about the information based on blogs.
embedded testing training in chennai

Muthu Raj said...

Very nice information. Thank you for sharing it.
Thanks,
Final Year Embedded Projects Chennai | Final Year Matlab Projects Chennai.

sunshineprofe said...

The young boys ended up stimulated to read through them and now have unquestionably been having fun with these things.
fire and safety courses in chennai

Praylin S said...

Great blog with unique content. I was looking for posts like yours. Keep posting. Regards.
Ionic Training in Chennai | Best Ionic Training in Chennai | Ionic Corporate Training | Ionic Training Course

Anbarasan14 said...

One of the best blogs that I have read till now. Thanks for your contribution in sharing such a useful information. Waiting for your further updates.

English Speaking Classes in Mumbai
Best English Speaking Institute in Mumbai
Spoken English Classes in Mumbai
Best English Speaking Classes in Mumbai
English Speaking Course in Mumbai
English Speaking Institute in Mumbai
Spoken English Training near me

Xplore IT Corp said...

Hey Nice Blog!! Thanks For Sharing!!!Wonderful blog & good post.Its really helpful for me, waiting for a more new post. Keep Blogging!
dot net course training in coimbatore
IT security training in coimbatore

vijaykumar said...


the article is nice.most of the important points are there.thankyou for sharing a good one.
RPA course in Chennai
RPA Training Institute in Chennai
Blue Prism Training Chennai
Blue Prism Training Institute in Chennai
UiPath Training Institutes in Chennai
rpa Training in Anna Nagar
rpa Training in T Nagar

Web Designing Company in Uttam Nagar said...

We make exact website designs that will attract possibility customers to your industry supporting to future enlargement and gainfulness. The design draws the visual aspect of a site. Generally, the design consolidates di...

Website_designing_company_in_palam

Web Designing Company in Uttam Nagar said...

COMPLETE CUSTOMER SATISFACTION
It is our promise that you will always get a good as well as satisfying experience and results from our end.

ROUND THE CLOCK SUPPORT
Our experts are always there to answer your questions and queries.

100% PROJECT COMMITMENT
We are fully dedicated to you and your projects to maximize your profits.

GOAL ORIENTED EXECUTIONS
We study your business objectives thoroughly to give extremely suitable and relevant solutions.





Dynamic Website Designing Company in Janakpuri

Dynamic Website Designing Company in Janakpuri

Dynamic Website Designing Company in Janakpuri

Dynamic Website Designing Company in Janakpuri

Dynamic Website Designing Company in Janakpuri

Dynamic Website Designing Company in Janakpuri

Dynamic Website Designing Company in Janakpuri


Dynamic Website Designing Company in Janakpuri

Joyal said...

Effective blog with a lot of information. I just Shared you the link below for Courses .They really provide good level of training and Placement,I just Had Appium Classes in this institute , Just Check This Link You can get it more information about the Appium course.


Java training in chennai | Java training in annanagar | Java training in omr | Java training in porur | Java training in tambaram | Java training in velachery

Best Ooty Taxi - Taxi Services in Ooty said...

Thanks for your valuable information shared on this site.This information very useful to me.

cabs in ooty | Travels in ooty


kitchentale said...

*This really answered my problem, thank you! Visit My site: online kitchen game

Rice Store said...

Nice job, it’s a great post. The info is good to know! online business

vé máy bay đi Canada said...

Mua vé máy bay tại Aivivu, tham khảo

gia ve may bay di my

vé máy bay mỹ về việt nam

vé máy bay về việt nam từ nhật

vé máy bay khứ hồi từ đức về việt nam

vé máy bay từ Toronto về việt nam

đặt vé máy bay từ hàn quốc về việt nam

Vé máy bay từ Canada về việt nam said...

Liên hệ Aivivu, đặt vé máy bay tham khảo

vé máy bay đi Mỹ Vietnam Airline

chuyến bay từ mỹ về việt nam

vé máy bay vietjet từ nhật về việt nam

giá vé máy bay từ đức về việt nam

thông tin chuyến bay từ canada về việt nam

Lịch bay từ Hàn Quốc về Việt Nam hôm nay

khách sạn cách ly ở nha trang

racesite.pro said...

When your website or blog goes live for the first time, it is exciting. 경마

casinositehot.com said...

Good write-up, I'm regular visitor of one's blog, maintain up the excellent operate, and It is going to be a regular visitor for a lengthy time.카지노사이트핫

야설 said...

Hello, I log on to your blog regularly. Your humorist style is awesome, keep up the good work!|

성인야설

오피헌터 said...

I was very happy to locate this web-site. I wished to many thanks for your time for this fantastic read!! I most definitely appreciating every little of it and also I have you bookmarked to take a look at new stuff you blog post.오피

건마탑 said...

I could not refrain from commenting. Exceptionally well written.

마사지

murat said...

instagram takipçi satın al
casino siteleri
GXB5WL

porz said...

slot siteleri
kralbet
betpark
tipobet
betmatik
kibris bahis siteleri
poker siteleri
bonus veren siteler
mobil ödeme bahis
NHSİQG

Bay said...

elf bar
binance hesap açma
sms onay
ETZ3

yBap said...

binance hesap açma
elf bar
sms onay
KUMMA

Anonymous said...

شركة تنظيف مكيفات بالدمام
شركة تنظيف مسابح بالقصيم

Rutvi said...

hello blogger

this is very informative blog , and you explained it in very easy language.

thank you so much for sharing & keep updating.

Digital Marketing Course In Surat
Social Media Marketing Course In Surat
Full Stack Developer Course In Surat

osman said...

çekmeköy
kepez
manavgat
milas
balıkesir
S6P

ilayda said...

bayrampaşa
güngören
hakkari
izmit
kumluca
EB8NW

Orkun said...

salt likit
salt likit
X3Q

Post a Comment