After the ash, the papers will again be full of post-Election austerity. Roads or railways, councils or care homes: where will the Chancellor cut to chop the UK's massive deficit? I and a colleague have put up a poll that surveys how people would trade-off these cuts, and how this correlates with voting preference, at www.sharesim.arachsys.com/login/budget10.jsp and Facebook group Balance the Budget. We used a questionnaire-compiler which is a nice application of the parser-builder JavaCC.
The questionnaire was written by
my colleague John Green
of Research for Today.
It is based round
four grids, rendered as HTML tables.
Each grid has a row for each area of
public spending surveyed: affordable
housing; Trident nuclear
submarines; other defence; nationwide 40Mb
broadband; care homes.
Each row has cells for possible
levels of service: 20,000 more affordable
homes per year; 40,000; 60,000.
Cells are also equipped with a radio button
and a score stating the relative cost
of that level of service:
In each grid, the person doing the poll has to click buttons so that the costs sum to a particular budget saving: £60 billion for the first grid; £40 billion for the second; £20 billion; as now. It turns out to be surprisingly difficult — one might say taxing — as you discover just how much the UK cannot afford.
The questionnaire also contains multiple-choice questions having just one row or column of radio buttons or checkboxes. These ask for such characteristics as sex, age band, and political preference: Labour, Tory, or Lib Dem.
These and the rest of the questionnaire are coded in a mini-language that I implemented. Here's an example:
text |We are now going to ask you some questions about yourself and your politics.| question Age asks |How old are you?| options |Under 18| |Between 18 to 65| |Over 65| layout vertical question Party asks |Which party do you support?| options |Labour| |Tory| |Lib Dem| |Other| layout vertical page breakBy the way, I used the bar
|
as a string
delimiter rather than double quote, to make
it easier to include quotes in strings.
It is useful to design mini-languages
for tasks like this, but of course one needs
to specify their grammar. Here is a grammar for
the fragment above, written in
BNF. I've extended it with the +
operator, which by convention means "one or more
occurrences of":
<questionnaire> ::= <item>+ <item> ::= <question> | <non-question> <question> ::= "question" "asks" <string> "options" <string>+ <layout-specifier> <layout-specifier> ::= "layout" ( "horizontal" | "vertical" ) <non-question> ::= <text> | <page-break> <text> ::= "text" <string> <page-break> ::= "page break"
To write a parser in JavaCC, one transcribes BNF into its grammar-description language. This is easy: JavaCC's grammar-description language is designed to be close to BNF.
An important thing JavaCC needs to know is
which are the reserved words, or what
compiler-writers call "tokens", in
its input.
For my mini-language example, they
are asks
,
horizontal
,
layout
,
options
,
page break
,
question
,
text
,
and
vertical
.
One must also say how tokens are separated. In common with almost all programming languages — except Fortran, Snobol, and I suppose, Python — our mini-language is completely free-format. So I tell JavaCC to ignore spaces, tabs, and newlines and carriage returns between tokens. One does this by writing a regular expression that specifies the characters to be skipped.
I let questionnaire authors put comments in their files. This is good practice for any mini-language. It allows the author to annotate their file with information about creation and revision dates, why changes were made, the purpose of a particular statement, and so on. I decided to use C-style comments: either two slashes up to the end of the line, or slash-star up to star-slash. Again, one specifies these with regular expressions.
Having told JavaCC about tokens, the gaps
between tokens, and comments, one can now
code a grammar. In JavaCC's
grammar-description language, each
rule looks
rather like a Java method.
For example, the rule for item
resembles a method named item
,
with a body that says "either
call the
question
rule, or, call
the non-question
rule".
It isn't enough merely to recognise whether a file holds a valid questionnaire or not. One wants the parser to build a representation of the questionnaire. To do this, one augments the grammar with "semantic actions". That's the computer scientist's name for pieces of code that take the strings recognised by the right-hand side of a grammar rule and do stuff with them, such as storing them in the fields of a data structure. In JavaCC, the semantic actions are Java statements.
So a JavaCC file contains definitions of tokens and of the spaces between tokens, definitions of comments, a grammar, and the semantic actions inside the grammar. One must embed this in a class definition. This can contain methods and variables to be called by the semantic actions. One must also set up the input stream that the parser will read text from, and set some options. Amongst other things, these control the diagnostics to be generated. And then, one feeds the JavaCC file through JavaCC, and out comes a Java file. Compile this, and you have your parser.
The above would apply, mutatis mutandis, to any parser-builder, not just JavaCC. Something else that should always apply is the tight connection between grammar, data structure, and control structure. In sequential programming, there are three main control structures. These are: sequencing, or doing one thing and then another thing; alternation, or doing one thing OR another thing; and iteration, or doing one thing over and over again. Java does sequencing with semicolons, alternation with "if" and "switch", and iteration with loops.
Corresponding notions in a grammar are:
recognising one thing and then another thing;
recognising one thing OR another thing;
and recognising one thing over and
over again. The BNF notation I used
indicates the first by concatenation, the second
by the |
operator, and the third
by the +
operator.
And corresponding to these in data structures are: storing one thing and another thing; storing one thing OR another thing; and storing one thing over and over again. Almost every language uses multiple fields for the first, and arrays, lists, vectors and so on for the third. A principled language like Pascal uses "variant types" for the second; some other languages call these "unions". Java doesn't quite have this notion, but bodges it with subclasses.
What is the connection of these "Three Ways of Putting Together" with our questionnaires? I needed to decide what kind of data structure the semantic actions in my grammar would build. The answer is simple: data structures that parallel the structure of the grammar.
So I represented the entire questionnaire
as an instance of class Questionnaire
,
with a field that was a vector of Item
s.
I defined class Item
, and
its subclasses Question
and NonQuestion
. A
Question
had instance variables
holding the text
that it asks,
a vector of options,
and a LayoutSpecifier
.
The LayoutSpecifier
was a Java Enum
with two elements: HORIZONTAL
and VERTICAL
. Class
NonQuestion
had subclasses Text
and PageBreak
.
And a Text
contained
one instance variable, a String
.
Compare this with my BNF.
By the way, the previous paragraph, and my grammar, are for the questionnaire fragment I showed earlier. The questionnaires that we use in real life contain other kinds of question, such as the grids. They also contain other constructs. For example, conditional jumps that skip part of a questionnaire if the respondent gives certain answers.
How does the structure of
questionnaires relate to the Web?
When somebody logs in to our budget poll,
my Java code looks up the name of the file
that contains the mini-language
definition for the budget10
questionnaire. It passes the name
to the parser instance I generated
with JavaCC. JavaCC has translated
my grammar rules into methods,
the top-level one being named
questionnaire
. My code
calls this,
passing it
an input stream opened
onto this file. This method
calls the subordinate
methods corresponding to subordinate
rules, as well as the semantic actions.
The semantic actions build an
instance of class
Questionnaire
. This then
has to be rendered as
HTML. And here we again see the
Three Ways of Putting Together.
The top-level render-as-HTML
method sequences through the fields of
the Questionnaire
instance.
It iterates over the
vector of Item
instances in one of these fields.
And it
has alternative
methods for outputting
an Item
, depending
on whether this is a
Question
or
a NonQuestion
.
So the Three Ways of Putting Together apply to grammar, to data, and to control. By designing these so that one mirrors another, we can write really clean code.
The same applies to processing
the answers to a questionnaire.
Once we have answers from enough
respondents, John will analyse them with
his SIMALTO ("Simultaneous, Multi-Attribute,
Level Trade-Off") algorithms, and
publish the results, correlating
with respondents' expressed voting preference.
But how do the answers get
to that stage? The respondent's
browser sends each page's answers
to our server
in the normal way,
as a string of
name=value parameters at the end of a
URL. The server passes these to
my Java code, which stores
them in an instance of my
class AnswersToQuestionnaire
.
This
parallels
the question parts of
class Questionnaire
.
(The non-question parts, namely text
and page breaks, don't generate answers, and therefore
are irrelevant.)
Class AnswersToQuestionnaire
contains a vector
of instances of class AnswerToQuestion
,
analogous to my
Question
class.
Having these parallel structures makes it easy to associate answers with the question they belong to. This helps when saving the answers to file, and when displaying already-answered questions. It is also useful in reporting errors, as when the respondent's budget expenditure falls outside the permitted range. Luckily, on the Web, the consequence is one of our red error messages. If you are Chancellor, the consequences of inapposite budget expenditure are less benign.