[ Jocelyn Ireson-Paine's Home Page | Publications | Dobbs Code Talk Index | Dobbs Blog Version ]

An Online Budget Questionnaire, JavaCC, and the Three Ways of Putting Together

After the ash, the papers will again be full of post-Election austerity. Roads or railways, councils or care homes: where will the Chancellor cut to chop the UK's massive deficit? I and a colleague have put up a poll that surveys how people would trade-off these cuts, and how this correlates with voting preference, at www.sharesim.arachsys.com/login/budget10.jsp and Facebook group Balance the Budget. We used a questionnaire-compiler which is a nice application of the parser-builder JavaCC.

The questionnaire was written by my colleague John Green of Research for Today. It is based round four grids, rendered as HTML tables. Each grid has a row for each area of public spending surveyed: affordable housing; Trident nuclear submarines; other defence; nationwide 40Mb broadband; care homes. Each row has cells for possible levels of service: 20,000 more affordable homes per year; 40,000; 60,000. Cells are also equipped with a radio button and a score stating the relative cost of that level of service:

In each grid, the person doing the poll has to click buttons so that the costs sum to a particular budget saving: £60 billion for the first grid; £40 billion for the second; £20 billion; as now. It turns out to be surprisingly difficult — one might say taxing — as you discover just how much the UK cannot afford.

The questionnaire also contains multiple-choice questions having just one row or column of radio buttons or checkboxes. These ask for such characteristics as sex, age band, and political preference: Labour, Tory, or Lib Dem.

These and the rest of the questionnaire are coded in a mini-language that I implemented. Here's an example:

text |We are now going to ask you
      some questions about yourself
      and your politics.|

question Age
  asks |How old are you?|
  options |Under 18|
          |Between 18 to 65|
          |Over 65|
  layout vertical

question Party
  asks |Which party do you support?|
  options |Labour|
          |Tory|
          |Lib Dem|
          |Other|
  layout vertical

page break
By the way, I used the bar | as a string delimiter rather than double quote, to make it easier to include quotes in strings.

It is useful to design mini-languages for tasks like this, but of course one needs to specify their grammar. Here is a grammar for the fragment above, written in BNF. I've extended it with the + operator, which by convention means "one or more occurrences of":

<questionnaire> ::=
  <item>+

<item> ::=
  <question> |
  <non-question> 

<question> ::=
  "question" 
  "asks" <string>
  "options" <string>+
  <layout-specifier>

<layout-specifier> ::=
  "layout" ( "horizontal" | "vertical" )

<non-question> ::=
 <text> |
 <page-break>

<text> ::=
 "text" <string> 

<page-break> ::=
  "page break"

To write a parser in JavaCC, one transcribes BNF into its grammar-description language. This is easy: JavaCC's grammar-description language is designed to be close to BNF.

An important thing JavaCC needs to know is which are the reserved words, or what compiler-writers call "tokens", in its input. For my mini-language example, they are asks, horizontal, layout, options, page break, question, text, and vertical.

One must also say how tokens are separated. In common with almost all programming languages — except Fortran, Snobol, and I suppose, Python — our mini-language is completely free-format. So I tell JavaCC to ignore spaces, tabs, and newlines and carriage returns between tokens. One does this by writing a regular expression that specifies the characters to be skipped.

I let questionnaire authors put comments in their files. This is good practice for any mini-language. It allows the author to annotate their file with information about creation and revision dates, why changes were made, the purpose of a particular statement, and so on. I decided to use C-style comments: either two slashes up to the end of the line, or slash-star up to star-slash. Again, one specifies these with regular expressions.

Having told JavaCC about tokens, the gaps between tokens, and comments, one can now code a grammar. In JavaCC's grammar-description language, each rule looks rather like a Java method. For example, the rule for item resembles a method named item, with a body that says "either call the question rule, or, call the non-question rule".

It isn't enough merely to recognise whether a file holds a valid questionnaire or not. One wants the parser to build a representation of the questionnaire. To do this, one augments the grammar with "semantic actions". That's the computer scientist's name for pieces of code that take the strings recognised by the right-hand side of a grammar rule and do stuff with them, such as storing them in the fields of a data structure. In JavaCC, the semantic actions are Java statements.

So a JavaCC file contains definitions of tokens and of the spaces between tokens, definitions of comments, a grammar, and the semantic actions inside the grammar. One must embed this in a class definition. This can contain methods and variables to be called by the semantic actions. One must also set up the input stream that the parser will read text from, and set some options. Amongst other things, these control the diagnostics to be generated. And then, one feeds the JavaCC file through JavaCC, and out comes a Java file. Compile this, and you have your parser.

The above would apply, mutatis mutandis, to any parser-builder, not just JavaCC. Something else that should always apply is the tight connection between grammar, data structure, and control structure. In sequential programming, there are three main control structures. These are: sequencing, or doing one thing and then another thing; alternation, or doing one thing OR another thing; and iteration, or doing one thing over and over again. Java does sequencing with semicolons, alternation with "if" and "switch", and iteration with loops.

Corresponding notions in a grammar are: recognising one thing and then another thing; recognising one thing OR another thing; and recognising one thing over and over again. The BNF notation I used indicates the first by concatenation, the second by the | operator, and the third by the + operator.

And corresponding to these in data structures are: storing one thing and another thing; storing one thing OR another thing; and storing one thing over and over again. Almost every language uses multiple fields for the first, and arrays, lists, vectors and so on for the third. A principled language like Pascal uses "variant types" for the second; some other languages call these "unions". Java doesn't quite have this notion, but bodges it with subclasses.

What is the connection of these "Three Ways of Putting Together" with our questionnaires? I needed to decide what kind of data structure the semantic actions in my grammar would build. The answer is simple: data structures that parallel the structure of the grammar.

So I represented the entire questionnaire as an instance of class Questionnaire, with a field that was a vector of Items. I defined class Item, and its subclasses Question and NonQuestion. A Question had instance variables holding the text that it asks, a vector of options, and a LayoutSpecifier. The LayoutSpecifier was a Java Enum with two elements: HORIZONTAL and VERTICAL. Class NonQuestion had subclasses Text and PageBreak. And a Text contained one instance variable, a String. Compare this with my BNF.

By the way, the previous paragraph, and my grammar, are for the questionnaire fragment I showed earlier. The questionnaires that we use in real life contain other kinds of question, such as the grids. They also contain other constructs. For example, conditional jumps that skip part of a questionnaire if the respondent gives certain answers.

How does the structure of questionnaires relate to the Web? When somebody logs in to our budget poll, my Java code looks up the name of the file that contains the mini-language definition for the budget10 questionnaire. It passes the name to the parser instance I generated with JavaCC. JavaCC has translated my grammar rules into methods, the top-level one being named questionnaire. My code calls this, passing it an input stream opened onto this file. This method calls the subordinate methods corresponding to subordinate rules, as well as the semantic actions.

The semantic actions build an instance of class Questionnaire. This then has to be rendered as HTML. And here we again see the Three Ways of Putting Together. The top-level render-as-HTML method sequences through the fields of the Questionnaire instance. It iterates over the vector of Item instances in one of these fields. And it has alternative methods for outputting an Item, depending on whether this is a Question or a NonQuestion.

So the Three Ways of Putting Together apply to grammar, to data, and to control. By designing these so that one mirrors another, we can write really clean code.

The same applies to processing the answers to a questionnaire. Once we have answers from enough respondents, John will analyse them with his SIMALTO ("Simultaneous, Multi-Attribute, Level Trade-Off") algorithms, and publish the results, correlating with respondents' expressed voting preference. But how do the answers get to that stage? The respondent's browser sends each page's answers to our server in the normal way, as a string of name=value parameters at the end of a URL. The server passes these to my Java code, which stores them in an instance of my class AnswersToQuestionnaire. This parallels the question parts of class Questionnaire. (The non-question parts, namely text and page breaks, don't generate answers, and therefore are irrelevant.) Class AnswersToQuestionnaire contains a vector of instances of class AnswerToQuestion, analogous to my Question class.

Having these parallel structures makes it easy to associate answers with the question they belong to. This helps when saving the answers to file, and when displaying already-answered questions. It is also useful in reporting errors, as when the respondent's budget expenditure falls outside the permitted range. Luckily, on the Web, the consequence is one of our red error messages. If you are Chancellor, the consequences of inapposite budget expenditure are less benign.