[ Jocelyn Paine's Home Page
| Free software
| Publications
]
[
Uses of JavaCC 1: Web-O-Matic
| Uses of JavaCC 2: Spreadsheet
front-end
| Uses
of JavaCC 3: Fortran format interpreter
| How
to Generate a
Tree-Building Parser in Java using
JJTree
]
Note: I wrote this introduction in 1998. Although what it says about the version of JJTree then current is true, there are newer versions, and the site from which you get them has moved.
I wrote this little introduction in response to a posting on the JavaCC list asking how to get started with JJTree. (For those who don't know, JavaCC is a parser-generator written in Java, and JJTree is an add-on which allows the generated parsers to produce syntax trees. They are both freely downloadable, with documentation, from the JavaCC home page).
I used JavaCC to write the parser for a Web authoring tool that I developed to make CGI programming easier. Instead of having to write Perl scripts or similar, you can just specify your dynamic pages in HTML, with some added tags and Java code inserts to control the interaction. The tool, Web-O-Matic, is freely available from my Web-O-Matic page.
I prototyped the Web-O-Matic compiler in Lisp. I then decided to make it more portable, and rewrote it in Java, modifying it so it would generate Java code (it originally generated Object Rexx, because of the system I did the original project on). So I needed to find a quick way to write parsers in Java, and was lucky enough to come across JavaCC. I also needed to build abstract syntax trees that I could walk to do type-checking and code-generation, and decided to use JJTree for this.
Web-O-Matic's parser doesn't have an easy job, because its input is a mixture of HTML, plain text, and Java. However, I managed, and some of my experience may be of help to others. So let's get on to my JJTree introduction. As I say, the parser I'm talking about parses this extended HTML superset, but the principles of JJTree are the same for any parser.
Looking at my grammar, I find that almost all my grammar rules are of the form
void WOMHeaderContainer(): {} { <LT> <WOM_HEADER> WOMArgumentList("Header") <GT> Header() <END_CONTAINER> <WOM_HEADER> <GT> }In other words, they are exactly what one would put into JavaCC. So almost everywhere, I'm allowing JJTree to take the default action in generating trees, rather than me customising them with its annotations. This default action is that the rule generates a node whose class has the same name as the rule's left-hand side, preceded by
AST
.
JJTree generates
the class definitions for these nodes automatically - you'll find them
if you list your directory after running it.
The default structure of the trees is that when a rule R generates
a tree, that tree's children are the trees generated by the rules
called on R's right-hand side. So my above rule will generate a
tree
of class ASTWOMHeaderContainer
, whose children are of class
ASTWOMArgumentList
and ASTHeader
.
All nodes are subclasses of
SimpleNode.java
(also generated by JJTree). If you look in the source of
that file, you'll find that the children are held in an array called
children
.
However, you should only use the public methods to get at them.
These are jjtGetChild(int
i)
and jjtGetNumChildren()
.
Using these two,
you can get a child at a specified position, if you know it. Positions
start at 0: in my rule above, the ASTWOMArgumentList
node would be
at position 0 and the ASTHeader
at position 1. You can also iterate
over all the children in the normal way. You may need to test the
class of each child, which you can do with Java's instanceof
operator.
Now, the structure of the trees gives you the abstract syntax of
your input, but not - in itself - the tokens. Sometimes the grammar
is enough to tell you what these are. In my rule, the first token
has to be an <LT>
.
But in other cases, they can vary, and you'll want to capture the
token as a string. There are two ways to do this. One is to use semantic
actions as in JavaCC. Capture the token, get its image, and then store
that (or a function of it, maybe just an integer denoting its type) in
part of the tree node being generated by the current rule. The variable
jjtThis
always refers to that node. You'll need to add a field to
your node class to hold the token image in, and you can do that either
by defining your own subclass of the node, or by actually editing its
definition file.
The other way is to alter JJTree's behaviour so that all tokens are captured. I think you do that by
NODE_SCOPE_HOOK = true;option. This causes methods
jjtreeOpenNodeScope
and
jjttreeCloseNodeScope
to be called whenever your node
starts
and finishes being built.
void jjtreeOpenNodeScope( Node node ) { ((SimpleNode)node).first_token = getToken(1); } void jjtreeCloseNodeScope( Node node ) { ((SimpleNode)node).last_token = getToken(0); }I know someone explained this in a posting to the JavaCC list submitted around the middle of 1997. This causes the fields
first_token
and last_token
to be set to the first
and last tokens encountered when your node is built. Tokens
are chained together, so from the first, you can iterate over
all the others, stopping when you hit the last. You will need
to add the first_token
and last_token
fields: they're not part
of the standard SimpleNode
.
If parsing HTML, this gives you an easy way to capture all those bits of plain text you'll want to emit verbatim - write a grammar rule that represents the region between two of your own special tags, and then treat all the tokens from that as verbatim.
So that gives you a tree which both tells you the abstract syntax of your input, and gives you all its tokens.
If you don't want trees from some of your rules, you can decorate them
with #void
, as in
void HTMLArgumentList() #void: {} { ( HTMLArgumentKeyAndValue() )* } void RegionPart() {} { ( <LT> <ID> HTMLArgumentList() <GT> ) }This means that the tree for
RegionPart
will not contain a child subtree
for HTMLArgumentList
(because of that rule's #void
), but will contain
subtrees for HTMLArgumentKeyAndValue
.
You need to be able to get the tree from your top-level rule. Do this
by giving that rule a non-void result type, the same type as that of
the node you want to return. And inside the rule, make the last semantic
action return jjtThis
:
ASTFile File(): {} { [ Layout() ] [ LOOKAHEAD( <LT> <WOM_HEADER> ) WOMHeaderContainer() ] [ Layout() ] ( LOOKAHEAD( <LT> <WOM_CLASS> ) WOMClassContainer() | TopContents() ) { return jjtThis; } }As I've previously said,
jjtThis
is always the node currently being
built.
A good way to start with JJTree is just to use it as though it were JavaCC, not annotating any of the rules so that all the trees are built by default. Then at the end of a successful parse, dump the tree and see what it looks like. I think there's code to do that in Rob Duncan's HTML grammar.
There are lots of things I haven't covered, including conditionally-built trees and the mods to handle Visitor patterns. But I hope this has been of some help.
6th January 1998.
[ Jocelyn Paine's Home Page
| Free software
| Publications
]
[
Uses of JavaCC 1: Web-O-Matic
| Uses of JavaCC 2: Spreadsheet
front-end
| Uses
of JavaCC 3: Fortran format interpreter
]