[ Jocelyn Paine's Home Page | Free software | Publications ]
[ Uses of JavaCC 1: Web-O-Matic | Uses of JavaCC 2: Spreadsheet front-end | Uses of JavaCC 3: Fortran format interpreter | How to Generate a Tree-Building Parser in Java using JJTree ]

Introduction to JJTree

Note: I wrote this introduction in 1998. Although what it says about the version of JJTree then current is true, there are newer versions, and the site from which you get them has moved.

I wrote this little introduction in response to a posting on the JavaCC list asking how to get started with JJTree. (For those who don't know, JavaCC is a parser-generator written in Java, and JJTree is an add-on which allows the generated parsers to produce syntax trees. They are both freely downloadable, with documentation, from the JavaCC home page).

I used JavaCC to write the parser for a Web authoring tool that I developed to make CGI programming easier. Instead of having to write Perl scripts or similar, you can just specify your dynamic pages in HTML, with some added tags and Java code inserts to control the interaction. The tool, Web-O-Matic, is freely available from my Web-O-Matic page.

I prototyped the Web-O-Matic compiler in Lisp. I then decided to make it more portable, and rewrote it in Java, modifying it so it would generate Java code (it originally generated Object Rexx, because of the system I did the original project on). So I needed to find a quick way to write parsers in Java, and was lucky enough to come across JavaCC. I also needed to build abstract syntax trees that I could walk to do type-checking and code-generation, and decided to use JJTree for this.

Web-O-Matic's parser doesn't have an easy job, because its input is a mixture of HTML, plain text, and Java. However, I managed, and some of my experience may be of help to others. So let's get on to my JJTree introduction. As I say, the parser I'm talking about parses this extended HTML superset, but the principles of JJTree are the same for any parser.

Looking at my grammar, I find that almost all my grammar rules are of the form

  void WOMHeaderContainer(): {}
  {
      <LT> <WOM_HEADER> WOMArgumentList("Header") <GT>
      Header()
      <END_CONTAINER> <WOM_HEADER> <GT>
  }
In other words, they are exactly what one would put into JavaCC. So almost everywhere, I'm allowing JJTree to take the default action in generating trees, rather than me customising them with its annotations. This default action is that the rule generates a node whose class has the same name as the rule's left-hand side, preceded by AST. JJTree generates the class definitions for these nodes automatically - you'll find them if you list your directory after running it.

The default structure of the trees is that when a rule R generates a tree, that tree's children are the trees generated by the rules called on R's right-hand side. So my above rule will generate a tree of class ASTWOMHeaderContainer, whose children are of class ASTWOMArgumentList and ASTHeader. All nodes are subclasses of SimpleNode.java (also generated by JJTree). If you look in the source of that file, you'll find that the children are held in an array called children. However, you should only use the public methods to get at them. These are jjtGetChild(int i) and jjtGetNumChildren(). Using these two, you can get a child at a specified position, if you know it. Positions start at 0: in my rule above, the ASTWOMArgumentList node would be at position 0 and the ASTHeader at position 1. You can also iterate over all the children in the normal way. You may need to test the class of each child, which you can do with Java's instanceof operator.

Now, the structure of the trees gives you the abstract syntax of your input, but not - in itself - the tokens. Sometimes the grammar is enough to tell you what these are. In my rule, the first token has to be an <LT>.

But in other cases, they can vary, and you'll want to capture the token as a string. There are two ways to do this. One is to use semantic actions as in JavaCC. Capture the token, get its image, and then store that (or a function of it, maybe just an integer denoting its type) in part of the tree node being generated by the current rule. The variable jjtThis always refers to that node. You'll need to add a field to your node class to hold the token image in, and you can do that either by defining your own subclass of the node, or by actually editing its definition file.

The other way is to alter JJTree's behaviour so that all tokens are captured. I think you do that by

  1. adding the
      NODE_SCOPE_HOOK = true;
    
    option. This causes methods jjtreeOpenNodeScope and jjttreeCloseNodeScope to be called whenever your node starts and finishes being built.

  2. Defining these methods as
      void jjtreeOpenNodeScope( Node node )
      {
        ((SimpleNode)node).first_token = getToken(1);
      }
    
      void jjtreeCloseNodeScope( Node node )
      {
        ((SimpleNode)node).last_token = getToken(0);
      }
    
    I know someone explained this in a posting to the JavaCC list submitted around the middle of 1997. This causes the fields first_token and last_token to be set to the first and last tokens encountered when your node is built. Tokens are chained together, so from the first, you can iterate over all the others, stopping when you hit the last. You will need to add the first_token and last_token fields: they're not part of the standard SimpleNode.

    If parsing HTML, this gives you an easy way to capture all those bits of plain text you'll want to emit verbatim - write a grammar rule that represents the region between two of your own special tags, and then treat all the tokens from that as verbatim.

So that gives you a tree which both tells you the abstract syntax of your input, and gives you all its tokens.

If you don't want trees from some of your rules, you can decorate them with #void, as in

  void HTMLArgumentList() #void: {}
  {
     ( HTMLArgumentKeyAndValue() )*
  }

  void RegionPart() {}
  {
    (
      <LT> <ID>
      HTMLArgumentList()
      <GT>
    )
  }
This means that the tree for RegionPart will not contain a child subtree for HTMLArgumentList (because of that rule's #void), but will contain subtrees for HTMLArgumentKeyAndValue.

You need to be able to get the tree from your top-level rule. Do this by giving that rule a non-void result type, the same type as that of the node you want to return. And inside the rule, make the last semantic action return jjtThis:

  ASTFile File(): {}
  {
     [ Layout() ]
     [ LOOKAHEAD( <LT> <WOM_HEADER> )
       WOMHeaderContainer()
     ]
     [ Layout() ]
     ( LOOKAHEAD( <LT> <WOM_CLASS> )
       WOMClassContainer()
     | TopContents()
     )
     { return jjtThis; }
  }
As I've previously said, jjtThis is always the node currently being built.

A good way to start with JJTree is just to use it as though it were JavaCC, not annotating any of the rules so that all the trees are built by default. Then at the end of a successful parse, dump the tree and see what it looks like. I think there's code to do that in Rob Duncan's HTML grammar.

There are lots of things I haven't covered, including conditionally-built trees and the mods to handle Visitor patterns. But I hope this has been of some help.

6th January 1998.

[ Jocelyn Paine's Home Page | Free software | Publications ]
[ Uses of JavaCC 1: Web-O-Matic | Uses of JavaCC 2: Spreadsheet front-end | Uses of JavaCC 3: Fortran format interpreter ]