Capturing absolute offsets for JavaCC/JJTree tokens

I use JavaCC for generating parsers in Java. And use JJTree to create AST after parsing. JJTree creates nodes of the AST and you can configure JavaCC options to capture tokens in the node – i.e. if you want each node to contain start and end tokens. The default code generated by JavaCC creates Token class with offsets that are relative to the starting offset of the line. It has fields like beginLine, beginColumn, endLine and endColumn. Here the line numbers are absolute line numbers (starting from 1) and column fields contain offsets (again starting from 1) within the corresponding lines.
However many times you want to capture absolute offsets of tokens in the input stream, and not just relative offset in the line. I wish there was a JavaCC option to enable this. But it is not too complex if you want to do it yourself.

To explain how to do this, I will take a grammer file that is generated by the JavaCC wizard of JavaCC Eclipse plugin. This is the default grammer file it generates –

/**
* JJTree file
*/

options{
  JDK_VERSION ="1.5";
}

PARSER_BEGIN(eg2)
package test;

publicclass eg2 {
  publicstaticvoid main(String args[]){
    System.out.println("Reading from standard input...");
    System.out.print("Enter an expression like \"1+(2+3)*var;\" :");
    eg2 parser =new eg2(System.in);
    try{
      SimpleNode startNode = parser.Start();
      startNode.dump("");
      System.out.println("Thank you.");
    }catch(Exception e){
      System.out.println("Oops.");
      System.out.println(e.getMessage());
    }
  }
}
PARSER_END(eg2)

SKIP :
{
  " "
| "\t"
| "\n"
| "\r"
| <"//" (~["\n","\r"])* ("\n"|"\r"|"\r\n")>
| <"/*" (~["*"])* "*" (~["/"] (~["*"])* "*")* "/">
}
TOKEN : /* LITERALS */
{
  < INTEGER_LITERAL:
        <DECIMAL_LITERAL> (["l","L"])?
      | <HEX_LITERAL> (["l","L"])?
      | <OCTAL_LITERAL> (["l","L"])?
  >
|  < #DECIMAL_LITERAL: ["1"-"9"] (["0"-"9"])* >
|  < #HEX_LITERAL: "0" ["x","X"] (["0"-"9","a"-"f","A"-"F"])+ >
|  < #OCTAL_LITERAL: "0" (["0"-"7"])* >
}
TOKEN : /* IDENTIFIERS */
{
  < IDENTIFIER: <LETTER> (<LETTER>|<DIGIT>)* >
|  < #LETTER: ["_","a"-"z","A"-"Z"] >
|  < #DIGIT: ["0"-"9"] >
}

SimpleNode Start():{}
{
  Expression() ";"
  { return jjtThis; }
}
void Expression():{ }
{
  AdditiveExpression()
}
void AdditiveExpression():{}
{
  MultiplicativeExpression() ( ( "+" | "-" ) MultiplicativeExpression() )*
}
void MultiplicativeExpression():{}
{
  UnaryExpression() ( ( "*" | "/" | "%" ) UnaryExpression() )*
}
void UnaryExpression():{}
{
  "(" Expression() ")" | Identifier() | Integer()
}
void Identifier():{}
{
  <IDENTIFIER>
}
void Integer():{}
{
  <INTEGER_LITERAL>
}

I have slightly modified the main function so as not to use eg2 class as a static class. Above is a grammer for parsing simple arithmetic expressions like 1 + 2 * ( 3 +4) etc. Now consider the following input (with new lines) –
1 +
2 * (3+4);
For ‘1’, JavaCC generates token with beginLine=1 and beginColumn=1. For ‘2’ it generates Token with beginLine=2 and beginColumn=1. However you might want to know the absolute position of ‘2’ in the input stream. BTW, the AST nodes generated by the above grammer will not contain start and end tokens. You will need to set JavaCC option ‘TRACK_TOKENS’ to true.

To capture absolute offsets, we need to modify SimpleCharStream class to keep track of total chars read and create an additional field in the Token class to store absolute offsets. However we need to to go through some intermediate steps to do this.

Setting additional JavaCC options:

Set following options in the grammer file –

options{
  JDK_VERSION ="1.5";
  TRACK_TOKENS =true;
  TOKEN_EXTENDS ="BaseToken"; //specify package if not in the same as the parser 
  COMMON_TOKEN_ACTION=true;
}

We will not modify Token class to add a field for the absolute offset, but we will specify the parent class for Token. This is done using option TOKEN_EXTENDS.
We want a hook in the token manager to set absolute offset after a new Token is created. So we set COMMON_TOKEN_ACTION. When this is set to true, TokenManager calls CommonTokenAction after creating a new Token. This method must be declared in the TOKEN_MGR_DECLS section.

Create Base Class for Token

public class BaseToken {
	public int absoluteBeginColumn = 0;
	public int absoluteEndColumn = 0;
}

Modify SimpleCharStream

This class reads input from the input stream (such as file), buffers it and keeps track of current token offset. You can specify initial buffer size for this class. SimpleCharStream reads from the input stream and stores data in a char buffer (of the size you specified or default). When all the chars in the buffer are read (by the TokenManager) and more data is available to read, then it expands the buffer.

We will add two fields to this class –

protected int totalCharsRead = 0;
protected int absoluteTokenBengin = 0;

As the name suggests, totalCharsRead keeps count of total chars read from the input stream. And absoluteTokenBegin points to absolute offset of beginning of a new Token. We will add an accessor function for absoluteTokenBegin –

public final int getAbsoluteTokenBengin() {
    return absoluteTokenBengin;
}

We will increment totalChars read whenever a character is read from the buffer and we will decrement it when character is but back in the buffer (in backup function). The read  and backup functions look as below after modifications –

/** Read a character. */
  public char readChar() throws java.io.IOException
  {
    if (inBuf > 0)
    {
      --inBuf;

      if (++bufpos == bufsize)
        bufpos = 0;

      totalCharsRead++;
      return buffer[bufpos];
    }

    if (++bufpos >= maxNextCharInd)
      FillBuff();

    totalCharsRead++;
    char c = buffer[bufpos];

    UpdateLineColumn(c);
    return c;
  }

/** Backup a number of characters. */
  public void backup(int amount) {

    inBuf += amount;
    totalCharsRead -= amount;
    if ((bufpos -= amount) < 0)
      bufpos += bufsize;
  }

Modifications are highlighted in bold. Next, we will modify BeginToken function to set absoluteTokenBengin –

/** Start. */
  public char BeginToken() throws java.io.IOException
  {
    tokenBegin = -1;
    char c = readChar();
    tokenBegin = bufpos;
    absoluteTokenBengin = totalCharsRead;

    return c;
  }

Modify TokenManager

Finally we will add method CommonTokenAction to the token manager. As mentioned above, we need to add this function to TOKEN_MGR_DECLS section of the grammer file –

TOKEN_MGR_DECLS :
{
	public void CommonTokenAction(Token t)
	{
		t.absoluteBeginColumn = getCurrentTokenAbsolutePosition();
		t.absoluteEndColumn = t.absoluteBeginColumn + t.image.length();
	}

	public int getCurrentTokenAbsolutePosition()
	{
		if (input_stream instanceof SimpleCharStream)
			return ((SimpleCharStream)input_stream).getAbsoluteTokenBengin();
		return -1;
	}
}

This is not related to the problem we are trying to solve, but if you have to write a large block of Java code in TOKEN_MGR_DECLS, which could be delegated to parser class, then you might want to use TOKEN_MANAGER_USES_PARSER option. If set to true, JavaCC creates TokenManager class with a field, parser, that holds reference to the main parser class. Then from within TOKEN_MGR_DECLS, you can call methods on the parser object.

You can easily test if the absolute offsets are set in the Token by debugging the code. Or you can also modify dump method of SimpleNode to print additional token information –

/* Override this method if you want to customize how the node dumps
     out its children. */

  public void dump(String prefix) {
    System.out.println(toString(prefix));
    printTokenInfo(prefix);
    if (children != null) {
      for (int i = 0; i < children.length; ++i) {
        SimpleNode n = (SimpleNode)children[i];
        if (n != null) {
          n.dump(prefix + " ");
        }
      }
    }
  }
}

//New method added to print token information
  public void printTokenInfo(String prefix)
  {
	  prefix += "\t";
	  System.out.println(prefix + "StartCol = " + firstToken.beginColumn +
			  " AbsStartCol = " + firstToken.absoluteBeginColumn + " - " +
			  firstToken.image);
	  System.out.println(prefix + "EndCol = " + lastToken.beginColumn +
			  " AbsEndCol = " + lastToken.absoluteBeginColumn + " - " +
			  lastToken.image);
  }

Now, for the input –
1 +
2 * (3+4);
following output will be printed –

Start
	StartCol = 1 AbsStartCol = 1 Image - 1
	EndCol = 10 AbsEndCol = 14 Image - ;
 Expression
 	StartCol = 1 AbsStartCol = 1 Image - 1
 	EndCol = 9 AbsEndCol = 13 Image - )
  AdditiveExpression
  	StartCol = 1 AbsStartCol = 1 Image - 1
  	EndCol = 9 AbsEndCol = 13 Image - )
   MultiplicativeExpression
   	StartCol = 1 AbsStartCol = 1 Image - 1
   	EndCol = 1 AbsEndCol = 1 Image - 1
    UnaryExpression
    	StartCol = 1 AbsStartCol = 1 Image - 1
    	EndCol = 1 AbsEndCol = 1 Image - 1
     Integer
     	StartCol = 1 AbsStartCol = 1 Image - 1
     	EndCol = 1 AbsEndCol = 1 Image - 1
   MultiplicativeExpression
   	StartCol = 1 AbsStartCol = 5 Image - 2
   	EndCol = 9 AbsEndCol = 13 Image - )
    UnaryExpression
    	StartCol = 1 AbsStartCol = 5 Image - 2
    	EndCol = 1 AbsEndCol = 5 Image - 2
     Integer
     	StartCol = 1 AbsStartCol = 5 Image - 2
     	EndCol = 1 AbsEndCol = 5 Image - 2
    UnaryExpression
    	StartCol = 5 AbsStartCol = 9 Image - (
    	EndCol = 9 AbsEndCol = 13 Image - )
     Expression
     	StartCol = 6 AbsStartCol = 10 Image - 3
     	EndCol = 8 AbsEndCol = 12 Image - 4
      AdditiveExpression
      	StartCol = 6 AbsStartCol = 10 Image - 3
      	EndCol = 8 AbsEndCol = 12 Image - 4
       MultiplicativeExpression
       	StartCol = 6 AbsStartCol = 10 Image - 3
       	EndCol = 6 AbsEndCol = 10 Image - 3
        UnaryExpression
        	StartCol = 6 AbsStartCol = 10 Image - 3
        	EndCol = 6 AbsEndCol = 10 Image - 3
         Integer
         	StartCol = 6 AbsStartCol = 10 Image - 3
         	EndCol = 6 AbsEndCol = 10 Image - 3
       MultiplicativeExpression
       	StartCol = 8 AbsStartCol = 12 Image - 4
       	EndCol = 8 AbsEndCol = 12 Image - 4
        UnaryExpression
        	StartCol = 8 AbsStartCol = 12 Image - 4
        	EndCol = 8 AbsEndCol = 12 Image - 4
         Integer
         	StartCol = 8 AbsStartCol = 12 Image - 4
         	EndCol = 8 AbsEndCol = 12 Image - 4
Thank you.

– Ram Kulkarni

5 Replies to “Capturing absolute offsets for JavaCC/JJTree tokens”

    1. This should work for special tokens too. After all special tokens are instances of Token class. In this example Token is extended from BaseToken, to which fields to track absolute positions are added.

      1. I have found this is not the case, it appears that the XXTokenManager class that is generated is missing a call to the CommonTokenAction method when specialToken is built. I don;t know if this is specific to my version of if there is an additional flag to make this work.

Leave a Reply