March 31, 2013 by Ram

Understanding AST created by Mozilla Rhino parser

For an application I am developing, I needed to get all functions and variables declared in JavaScript code. Because the application I am developing is in Java, I started looking for readily available JavaScript parser written in Java. I found Mozilla Rhino to be a good fit for my requirements and decided to use it.

Parsing with Rhino is quite simple and well documented.

private AstRoot parse(String src, int startLineNum)
	throws IOException {

	CompilerEnvirons env = new CompilerEnvirons();
	env.setRecoverFromErrors(true);
	env.setGenerateDebugInfo(true);
	env.setRecordingComments(true);

	StringReader strReader = new StringReader(src);

	IRFactory factory = new IRFactory(env);
	return factory.parse(strReader, null, startLineNum);

}

AstNode class has methods like getSymbols and getSymbolTable which I presumed would give me all variables and functions declared in the code. However these methods returned only functions and not variables declared. So I had to traverse the tree and get all symbols. I thought I would use visit method of AstRoot to process all nodes.

root.visit(new NodeVisitor() {

	@Override
	public boolean visit(AstNode node) {
		int nodeType  = node.getType();
		//TODO: process the node based on node type
		return true; //process children
	}
});

However the above code threw ClassCastException exception in the visit method of ScriptNode class.

java.lang.ClassCastException: org.mozilla.javascript.Node cannot be cast to org.mozilla.javascript.ast.AstNode at org.mozilla.javascript.ast.ScriptNode.visit(ScriptNode.java:312)

And the test code was a simple JS script –

function test()
{
	var k = 10;
}

i = 100;

So finally I decided to traverse the AST from root and process each node. That is when I realized Rhino created AST differently from that AST I had worked with earlier. The AST I had worked with created nodes with their children as array of nodes. If you start with the root and visit each child, you are sure to traverse the entire tree and hence the code.

However Rhino does not create nodes with array of child nodes. Each node has methods to access the first node and then you can traverse remaining child nodes by accessing ‘next’ member. So instead of an array, it creates linked list of child nodes, which is not such a big problem. However a few things surprised me when I traversed the tree.

I started with the root node and for each node I followed following rules –

Get first child and traverse the linked list by calling next till the next is null
Call next on the parent node and repeat the loop.

Here is what I found –

When you traverse as above, you will only visit function name, and not function body
If you want to visit function body, you will have to call getFunctions (or getFunctionNode) method of ScriptNode. Both AstNode and FunctionNode are of type ScriptNode.
If you traverse the node as I described above (children first and then next node), you are not guaranteed to visit all AST nodes.

For the simple snippet of JS code above, the AST creates is as follows

As you can see, only function Name node is in the tree. AstRoot has two children – Name (Function) and Node (Expr_Result). Expr_Result node has one child – Node (setName). SetName has two children- Name (BindName) and NumberLiteral. This tree represents the entire code, except body of the function test. As mentioned earlier, you can get body of the function by calling getFunctionNode on AstRoot.

Now, if you inspect Name (BindName) node, you will see that it’s parent is Assignment node. Assignment has left node as Name(BindName) and right node as NumberLiteral. However Assignment is not part of the tree if you traverse it as described above.

So you also need to consider type (type constants are decalred in org.mozilla.javascript.Token class) and parent of the node when processing the AST created by Rhino.

-Ram Kulkarni

15 Replies to “Understanding AST created by Mozilla Rhino parser”

Martin says:

September 18, 2013 at 8:29 pm

Hello. Nice article. I am currently struggling with Rhino AST. You state that it is quite well documented, I have not found much in the way of documentation apart from the JavaDocs that have some gaps. Do you have a link to some useful documentation? Thanks again for the helpful article.

Reply
1. Ram Kulkarni says:
  
  September 19, 2013 at 9:39 am
  
  When I said that parsing with Rhino was well documented, I was talking about documentation for using parser APIs, not how AST was structured. However I can’t seem to find where I saw the documentation about using Parser APIs now. I am sure I saw it before. I might have also looked at test cases under testsrc/org folder, when you download Rhino.
  
  Reply
Ta The Ultimate says:

February 9, 2014 at 2:35 am

HI, thanks for the clear example. I was wandering if you know a way to change the function body. I currently want to wrap a function’s body with a “with(some_object){ }”

Reply
1. Ram Kulkarni says:
  
  February 9, 2014 at 11:58 am
  
  I haven’t done something like this before, but one possible solution could be that after you parse JS code, get all FunctionNode. You can get the start offset of each function by calling getAbsolutePosition method. The end offset could be startOffset + node.getText().length(). Once you know start and end positions, then insert wrapper code at those positions in the source file.
  
  Reply
Shreeharsha Voonna says:

March 30, 2014 at 11:14 pm

I am new to Rhino parser. I need to find out how to fetch words present in given script. I have tried doing this but I could retrieve only names of variables and functions. But I need to extract every single word( which could become a bit difficult in obfuscated script). Here is my approach to find variable and function names in script

import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;

import org.mozilla.javascript.Parser;
import org.mozilla.javascript.ast.AstNode;
import org.mozilla.javascript.ast.FunctionCall;
import org.mozilla.javascript.ast.NodeVisitor;

public class JavascriptParser {

public static void main(String[] args) throws IOException {
class Printer implements NodeVisitor {

public boolean visit(AstNode node) {

/* Here I am using Name instead of FunctionCall because it gives errors for a few functions present in script */

if (node instanceof Name) {

System.out.println(name.getString());
/* this prints name of variable as well as functions present in script*/
}
return true;
}
}

String file = “/dss2.js”;
Reader reader = new FileReader(file);
try {
AstNode node = new Parser().parse(reader, file, 1);
node.visit(new Printer());
} finally {
reader.close();
}
}

Reply
1. Ram Kulkarni says:
  
  March 31, 2014 at 9:26 am
  
  What do you mean by capturing every single word in JS script? With the script you posted, you will be able to capture all functions and variables. Can you give example of what you are not able to capture with your code?
  
  Reply
  1. Shreeharsha Voonna says:
    
    March 31, 2014 at 10:43 am
    
    Basically I wanted to find the ratio of keywords to word and probability of shell code being present in JS script (part of implementation of research paper) to check if script is vulnerable. So considering a dummy script like this :
    
    var emptyFun = function () {};
    
    var methods = [‘assert’, ‘clear’, ‘count’, ‘debug’, ‘dir’, ‘dirxml’, ‘error’,
    ‘exception’, ‘group’, ‘groupCollapsed’, ‘groupEnd’, ‘info’, ‘log’,
    ‘markTimeline’, ‘profile’, ‘profileEnd’, ‘table’, ‘time’, ‘timeEnd’,
    ‘timeStamp’, ‘trace’, ‘warn’];
    
    var length = methods.length;
    
    var console = (window.console = window.console || {});
    
    while (length1–) {
    method = methods[length];
    if (!console[method1]) {
    console[method] = emptyFun;
    }
    }
    
    In this case, from the previous code I will be able to extract [ emptyfun, methods, length, length1, console, window, method ] but I will not be able to extract words like [ function, assert, clear, etc]. Even though I have the script I cant separate words based on when a space character occurs (will not work in case of obfuscated script).
    
    Reply
    1. Ram Kulkarni says:
      
      March 31, 2014 at 12:12 pm
      
      Just processing Name is not going to give you all words/keywords. You will have to handle each type of node separately. For example –
      
      public boolean visit(AstNode node) { printNode(node); return true; } void printNode (AstNode node) { String nodeStr = ""; int type = node.getType(); switch (type) { case Token.NAME: if (node instanceof Name) { Name name = (Name)node; nodeStr = name.getIdentifier(); } break; case Token.STRING: if (node instanceof StringLiteral) nodeStr = ((StringLiteral)node).getValue(); break; case Token.WHILE: nodeStr = "while"; break; case Token.TRUE: nodeStr = "true"; break; case Token.FALSE: nodeStr = "false"; break; case Token.FUNCTION: nodeStr = "function"; break; default: //find constant from Token.java and handle the case if you want. System.out.println(String.valueOf(type)); return; } if (nodeStr.length() > 0) System.out.println(nodeStr); }
      
      You will need to handle more types depending on what keywords you want to process. You can find more type constants in Token.java. If you have downloaded Rhino, you will find this class in rhino1_7R4/src/org/mozilla/javascript/Token.java
      
      Reply
      1. Shreeharsha Voonna says:
        
        April 1, 2014 at 8:41 am
        
        Thanks a lot. It worked. 🙂
Marion says:

May 14, 2014 at 2:20 pm

I want to use rhino to get the AST of a piece of code and make some transformations on this tree. Then I try to generate the corresponding JavaScript code by the transformed AST, is there any API provided by rhino?

Reply
1. Ram Kulkarni says:
  
  May 15, 2014 at 5:42 pm
  
  Not sure. I was only interested in parsing JS code, not regenerating it from AST – so did not look for such API. But if I find any such API, I will post it here
  
  Reply
Priya Sheth says:

July 2, 2014 at 1:39 pm

thank you sir…your method printnode() in above comment is quite useful for me.
but i have some problem..while running code ,if my javascript has some stmt like var a=10;
then switch goes to default case and print final int value like 145. i want to print var name.
and other problem is that if my script contains function code then various parse exception occurs.
so please guide me.

Reply
Priya Sheth says:

July 2, 2014 at 4:40 pm

if stmt is like var a=eval(c+d);
then please give me hint for how to reach eval() while parsing script.

Reply
1. Ram Kulkarni says:
  
  July 3, 2014 at 9:20 am
  
  The code snippet I posted in one of the comment for just an example of how to process nodes. I did not handle all the cases. To evaluate function calls, like eval, you can either check parent of node in Token.NAME case (in case of function call the parent would be of type FunctionCall) or handle a new case Token.CALL. Use debugger to see details of each node. Also refer to source of Token class of Mozilla to know what node type constants mean.
  
  Reply
Jayant says:

September 2, 2016 at 4:26 pm

Hello Ram,

Your article is very helpful and thank you for writing such article.

Where can I get all types of AST nodes? I have searched but no luck.

Could you please help me how can I check the if below regular expression is present in my JS file after I have parsed it using AST?

JS: steal(function($) {
init: function(){
this.validate(“irefNo”, function () {
var test=””;
test = this.iReferenceNumber.match(/[$+,:=?@#|!`^()~[]><\%*_=+]/g);
return "Invalid characters' "+test+" 'found in field";
}
}
}

Reply