Handling some of the warnings and errors generated by JavaCC

I am currently building a parser using JavaCC. I have used JavaCC in the past, but whenever I use it after a long gap, I have to relearn a few things about it – particularly handling warnings. So I thought this time I would blog about ways to handle some of the frequent warnings that I have seen.

If you are unfamiliar with JavaCC, then it is a parser generator. You create grammer using EBNF (Extended Backus-Naur Form) and feed it to JavaCC. JavaCC then creates Java classes for the parser. I do not want to make this post into JavaCC tutorial. There are some very good tutorials available at JavaCC Documentation page and FAQ. I especially find Lookahead MiniTutorial and Token Manager MiniTutorial very useful. If you use Eclipse IDE, then you would find JavaCC plugin for Eclipse useful – it provides wizard to create JavaCC or JJTree (JJTree creates AST, Abstract Syntax Tree, after parsing the input) files, provides code colorization, outline, code hyper link, syntax checking and compilation. You can also set JavaCC debug options easily using this plugin.

I will use following tokens that are generated by default if you use the wizard provided by JavaCC Eclipse plugin to create a JavaCC grammer file. I have created a .jjt file for examples in this blog.

SKIP:
{
  " "
| "\t"
| "\n"
| "\r"
| <"//" (~["\n","\r"])* ("\n"|"\r"|"\r\n")>
| <"/*" (~["*"])* "*" (~["/"] (~["*"])* "*")* "/">
}
TOKEN:/* LITERALS */
{
  < INTEGER_LITERAL:
        <DECIMAL_LITERAL> (["l","L"])?
      | <HEX_LITERAL> (["l","L"])?
      | <OCTAL_LITERAL> (["l","L"])?
  >
|  < #DECIMAL_LITERAL: ["1"-"9"] (["0"-"9"])* >
|  < #HEX_LITERAL: "0" ["x","X"] (["0"-"9","a"-"f","A"-"F"])+ >
|  < #OCTAL_LITERAL: "0" (["0"-"7"])* >
}
TOKEN:/* IDENTIFIERS */
{
  < IDENTIFIER: <LETTER> (<LETTER>|<DIGIT>)* >
|  < #LETTER: ["_","a"-"z","A"-"Z"] >
|  < #DIGIT: ["0"-"9"] >
}

Tokens that start with # are private to that block of token declaration. I am going to start with a simple one –

Error: Missing return statement in function

Consider following BNF production –

SimpleNode Start():{}
{
	<INTEGER_LITERAL>
	|
	<IDENTIFIER>

  { return jjtThis; }
}

JavaCC compiler will not throw any error for the above production. But when you run the parser on input “123”, you will get error – “Exception in thread “main” java.lang.Error: Missing return statement in function”

This is because Start production is converted to Java function called Start that returns SimpleNode. In the grammer file, this object is returned at the end {return jjtThis}, but it returns only if the matched token is <IDENTIFIER>. This is because Java block of {return jjtThis} is associated with <IDENTIFIER>. If input is “abc” then you will not get this error.
We can fix this issue by grouping expression choices and associating Java block to the group –

SimpleNode Start():{}
{
  (
	<INTEGER_LITERAL>
	|
	<IDENTIFIER>
  )
  { return jjtThis; }
}

Warning : Choice conflict involving two expansions at

The warning is quite obvious – that there are more than one options to match at the given location. JavaCC warning message, in fact, clearly indicates common token that caused this warning. However fixing this warning may not be easy in a large and complex grammer. Here I will take a simple example –

SimpleNode Start():{}
{
  (
	prod1()
	|
	prod2()
  )
  { return jjtThis; }
}

void prod1():{}
{
	<IDENTIFIER>
}

void prod2():{}
{
	<IDENTIFIER> <INTEGER_LITERAL>
}

The above grammer will cause following warning –

Warning: Choice conflict involving two expansions at
         line 55, column 9 and line 57, column 9 respectively.
         A common prefix is: <IDENTIFIER>
         Consider using a lookahead of 2 for earlier expansion.
Warning: Choice conflict involving two expansions at
         line 57, column 9 and line 59, column 9 respectively.
         A common prefix is: <IDENTIFIER>
         Consider using a lookahead of 2 for earlier expansion

The warning is caused by token <IDENTIFIER>, because both prod1 and prod2 start with <IDENTIFIER> and JavaCC cannot make correct decision (to either pick prod1 or prod2) without additional hint. Even though this is flagged a warning, JavaCC compiler generates incorrect code for the above grammer –

        case IDENTIFIER:
          prod1();
          break;
          prod2();
          break;

The error here is that prod2() call is unreachable, because it is after a break statement.

Warning message by JavaCC gives hint for fixing this issue. So if we tell JavaCC to look for next two tokens before deciding which path to take (prod1 or prod2), then this warning will go away.

SimpleNode Start():{}
{
  (
  	LOOKAHEAD(2)
	prod1()
	|
	prod2()
  )
  { return jjtThis; }
}

Specifying end of input:

However above grammer is still not correct if we expect input to have only one IDENTIFIER or one IDENTIFIER followed by one INTEGER_LITERAL. For example, feed an input of “abc def” to the above grammer and it will parse the input without any error. However you would expect the parser to throw error in this case. This can be fixed by telling JavaCC that input should end after either prod1 or prod2. We can do this by adding <EOF> token at the end of Start production –

SimpleNode Start():{}
{
  (
  	LOOKAHEAD(2)
	prod1()
	|
	prod2()
  )
  <EOF>
  { return jjtThis; }
}

Order of choices in production:

Suppose we change prod1 and prod2 to add one more <IDENTIFIER> token –

void prod1():{}
{
	<IDENTIFIER> <IDENTIFIER>
}

void prod2():{}
{
	<IDENTIFIER> <IDENTIFIER> <INTEGER_LITERAL>
}

If you input “abc def 123” to the parser, then you will get following error –

Encountered " <INTEGER_LITERAL> "123 "" at line 1, column 9.
Was expecting:
    <EOF>

In this case you might expect that prod2 should match the input. However JavaCC would look next two tokens before deciding if prod1 or prod2 should be matched and would see that prod1 matches next two tokens and would select it for matching. However after matching prod1, the parser expects end of input <EOF> and when it finds integer 123, it throws error.

We can fix this by moving longest match at the top, in this case prod2, so that JavaCC will try to match that first. We also have to change number of tokens to look ahead to 3, else input “abc def” will throw error, even though it could be matched by prod1.

SimpleNode Start():{}
{
  (
  	LOOKAHEAD(3)
	prod2()
	|
	prod1()
  )
  <EOF>
  { return jjtThis; }
}

Debugging parser

It is useful to know what choices JavaCC has made when parsing the input. This helps in debugging the grammer, when things do not work as you expect. Add DEBUG_PARSER=true; option at the top of the grammer file.

options{
//other options
DEBUG_PARSER=true;
}

When you run the parser on input “abc def 123”, you will see the following output in the console.

Call:   Start
  Call:   prod1
    Consumed token: <: "abc" at line 1 column 1>
    Consumed token: <: "def" at line 1 column 5>
  Return: prod1
  Consumed token: < at line 1 column 7>
Return: Start

If you want to see how parser processes lookaheads, then set option DEBUG_LOOKAHEAD = true. To see how token manager creates tokens from the input, set  DEBUG_TOKEN_MANAGER = true. With this option you can see how each character in the input stream is matched to create tokens. Though the output could be very verbose, it is very valuable when debugging complex grammer.

Warning: Choice conflict in (…)* construct

This is a common warning if there is a conflict of choices in loop or repetitive expressions. Consider following example –

SimpleNode Start():{}
{
	prod4()

  	<EOF>
  { return jjtThis; }
}

void prod3():{}
{
	<IDENTIFIER>
	(
		"-"
		<IDENTIFIER>
	)*
}

void prod4():{}
{
	prod3() "-" <INTEGER_LITERAL>
}

Valid input for this grammer is IDENTIFIER followed by zero or more pairs of “-” and IDENTIFIER followed by “-” and INTEGER_LITERAL. So there is loop/repetition in prod3. JavaCC will create following warning for the above grammer –

Warning: Choice conflict in (...)* construct at line -, column -.
         Expansion nested within construct and expansion following construct
         have common prefixes, one of which is: "-"
         Consider using a lookahead of 2 or more for nested expansion.

Warning is caused because in prod4, we have “-” that follows call to prod3 and prod3 has optional production that starts with “-“. So if the input is “abc – 123”, the parser generated by JavaCC has choice to go into (“-” <IDENTIFIER>) match in prod3 or end matching prod3 and match (“-” <INTEGER_LITERAL>) in prod4. So there is conflict of choices. If you tun the parser on this input you will get following error –

Encountered " <INTEGER_LITERAL> "100 "" at line 1, column 7.
Was expecting:
    <IDENTIFIER>...

We can add a hint (LOOKAHEAD) in the loop of prod3 to fix this warning.

void prod3():{}
{
	<IDENTIFIER>
	(
		LOOKAHEAD(2)
		"-"
		<IDENTIFIER>
	)*
}

So before matching “-” in the loop in prod3, the parser will check next token and will proceed with the match only if the token is of type IDENTIFIER.

Warnings may not necessarily result in incorrect parsing:

However not all warnings abut choice conflict cause error in parsing. Consider following grammer –

SimpleNodeStart():{}
{
	(
		prod5()
		|
		<INTEGER_LITERAL>
	)+

  	<EOF>
  { return jjtThis; }
}

voidprod5():{}
{
	(
		<IDENTIFIER> "-" <IDENTIFIER>
	)+
}

Valid input for the above grammer is INTEGER or one or more IDENTIFIER followed by “-” and IDENTIFIER, optionally followed by INTEGER. Example of valid input to this grammer is – “abc – def ghi – jkl 123 456”. JavaCC will create following warning for this grammer –

Warning: Choice conflict in (...)+ construct at line -, column -.
         Expansion nested within construct and expansion following construct
         have common prefixes, one of which is:
         Consider using a lookahead of 2 or more for nested expansion.

The choice conflict occurs in this grammer, for the above input string,  because after matching “abc – def” using prod5, the parser has two choices to match “ghi – jkl” – it can either continue matching in prod5 (because of repetition specified by +) or can come out of prod5, back in the loop of Start and match prod5 again. In any case, the next three token will be matched by prod5, so it is not really going to result in incorrect parsing. We can suppress warning in this case by specifying LOOKAHEAD(1), which is kind of redundant, because default lookahead in JavaCC is 1.  But JavaCC does not create warnings at the choice conflict if you specify any lookahead – it assumes that you know what you are doing in the grammer.

void prod5():{}
{
	(
		LOOKAHEAD(1)
		<IDENTIFIER> "-" <IDENTIFIER>
	)+
}

Order of TOKEN declaration matters:

JavaCC parser matches tokens in the order in which they are declared. We already have <IDENTIFIER> token declared, which will match any alpha character followed by any number of alpha numeric characters and “_”, with no spaces in between them. Suppose we want to declare a new token called VAR and we add this token after <IDENTIFIER> is declared –

TOKEN:/* IDENTIFIERS */
{
  < IDENTIFIER: <LETTER> (<LETTER>|<DIGIT>)* >
|  < #LETTER: ["_","a"-"z","A"-"Z"] >
|  < #DIGIT: ["0"-"9"] >
}

TOKEN:
{
	<VAR : "var">
}

You will get following warning from JavaCC  –

Warning:  “var” cannot be matched as a string literal token at line -, column -. It will be matched as  <IDENTIFIER>.

So “var” in your input will be matched as <IDENTIFIER> and not as <VAR> token. The solution is to move more specific token declaration at the top, in this case move <VAR> above <IDENTIFIER>

TOKEN:
{
	<VAR : "var">
}

TOKEN:/* IDENTIFIERS */
{
  < IDENTIFIER: <LETTER> (<LETTER>|<DIGIT>)* >
|  < #LETTER: ["_","a"-"z","A"-"Z"] >
|  < #DIGIT: ["0"-"9"] >
}

Nested syntactic lookaheads are ignored:

JavaCC ignores syntactic lookaheads. I have spent a lot of time debugging the grammer because I ignored this fact. In syntactic lookahead, you call function that represents the production, in LOOKAHEAD e.g. LOOKAHEAD(prod1()). See the grammer below –

SimpleNode Start():{}
{
	prod6()
  	<EOF>
  { return jjtThis; }
}

void prod6():{}
{
	LOOKAHEAD(prod7())
		prod7()
	|
	<INTEGER_LITERAL>
}

void prod7():{}
{
	(
	LOOKAHEAD(prod8() ".")
		prod8()
	|
	<IDENTIFIER> <INTEGER_LITERAL>
	)
	"."
}

void prod8():{}
{
	<IDENTIFIER>
}

If you input “abc 123” to the parser, you might expect that <IDENTIFIER> <INTEGER_LITERAL> in prod7 would match it. Let’s start from prod6(). It does syntactic lookahead for prod7(). In prod7(), there is another (nested) syntactic lookahead for prod8(). prod8() would match <IDENTIFIER> (for “abc” input). Then parser looks for “.”. But the next token is <INTEGER_LITERAL> (“123”). So lookahead in prod7 fails. At this point you would expect that parser will try to lookahead <IDENTIFIER> <INTEGER_LITERAL> declared as alternate choice in prod7(). But the parser does not do this, because it had ignored nested LOOKAHEAD in prod7() and had actually tried to match content of LOOKAHEAD. When it did not match “.”, the whole lookahead for prod7 fails and the parser looks for <INTEGER_LITERAL> as declared in prod6(). When it does not find integer, it throws error. If you run the parser with DEBUG_LOOKAHEAD = true, you will get following output –

Call:   Start
  Call:   prod6
    Call:   prod7(LOOKING AHEAD...)
      Call:   prod8(LOOKING AHEAD...)
        Visited token: <: "abc" at line 1 column 1>; Expected token: <>
      Return: prod8(LOOKAHEAD SUCCEEDED)
      Visited token: <<INTEGER_LITERAL>: "123" at line 1 column 5>; Expected token: <".">
    Return: prod7(LOOKAHEAD FAILED)
  Return: prod6
Return: Start
test.parser.ParseException: Encountered "  "abc "" at line 1, column 1.
Was expecting:
    <INTEGER_LITERAL> ...

Once you know that nested lookaheads are ignored, you can modify the grammer to work around that. In the above example, we can modify the grammer as follows –

void prod7():{}
{
	<IDENTIFIER> [<INTEGER_LITERAL>] "."
}

-Ram Kulkarni

4 Replies to “Handling some of the warnings and errors generated by JavaCC”

Leave a Reply

Social