For a conference,
I once drew a cartoon about spreadsheets. It showed a City gent staring despairingly
at an Excel formula while behind him, Excel developers in Canary
Wharf fretted. Like this:
In the cartoon,
I'd assumed it was his own spreadsheet, but it could have been
someone else's "legacy"
spreadsheet that he was trying to decipher — and that's
what I'm going to write about. My mind has been running on
the topic since a colleague asked me to explain my work
on what I call "structure discovery":
making explicit the things an author knows about
his spreadsheet but didn't write down.
To demonstrate, I've written a tiny cash-flow
spreadsheet, pictured below. Its formulae are realistic, but I've
reduced the number of rows and simplified the currency and date formats
so that it doesn't take up too much space on the screen:
The first thing one needs is a listing of the
spreadsheet's formulae. Oddly, Excel doesn't provide this;
but below, I do so by reading the spreadsheet
into a reverse-engineering
tool I've developed. This parses the Excel
file, extracts the formulae, associates
them with their cells, and lists them. That's something that
will be useful again later on, when I go on to transform
the formulae to make them easier to read. But to start with, here
are the formulae without any processing:
Cells
Sheet1!A1:F12
where:
Sheet1!B1 = "Simple cash-flow spreadsheet"
Sheet1!B2 = "Demonstrates Jocelyn Ireson-Paine's reverse-engineering software"
Sheet1!F4 = "Opening cash"
Sheet1!F5 = 1000
Sheet1!B7 = "Month"
Sheet1!C7 = "Income"
Sheet1!D7 = "Expenses"
Sheet1!E7 = "Net profit"
Sheet1!F7 = "Cash at start of period"
Sheet1!B8 = 1
Sheet1!C8 = 120
Sheet1!D8 = 27
Sheet1!E8 = Sheet1!C8-Sheet1!D8
Sheet1!F8 = Sheet1!F5
Sheet1!B9 = 2
Sheet1!C9 = 120
Sheet1!D9 = 41
Sheet1!E9 = Sheet1!C9-Sheet1!D9
Sheet1!F9 = Sheet1!F8+Sheet1!E8
Sheet1!B10 = 3
Sheet1!C10 = 150
Sheet1!D10 = 64
Sheet1!E10 = Sheet1!C10-Sheet1!D10
Sheet1!F10 = Sheet1!F9+Sheet1!E9
Sheet1!A12 = "Totals"
Sheet1!C12 = SUM(Sheet1!C8:Sheet1!C10)
Sheet1!D12 = SUM(Sheet1!D8:Sheet1!D10)
Sheet1!E12 = SUM(Sheet1!E8:Sheet1!E10)
Having the formulae in a single
listing I can print is nice, but it would be
even better if the cell addresses were
replaced by meaningful names. So suppose again that
the cash-flow spreadsheet is someone else's that
I'm trying to decipher, and suppose also that
from its layout, cell colouring, and labels, I've
guessed which cells belong together. For example, suppose
I've guessed that cells B8:B10 are successive months, and that
C8:C10 are successive incomes. I could summarise my
guesses like this:
'Sheet1'!F5 is opening_cash.
'Sheet1'!B8:B10 is months.
'Sheet1'!C8:C10 is income.
'Sheet1'!D8:D10 is expenses.
'Sheet1'!E8:E10 is net_profit.
'Sheet1'!F8:F10 is cash_at_start_of_period.
'Sheet1'!D12 is total_expenses.
'Sheet1'!C12:C12 is total_income.
'Sheet1'!E12:E12 iso total_profit.
Equipped with such a mapping from addresses to names, one
could then regard each group of cells as an array. For
example, B8:B10 would consist of the elements
months[1], months[2]
and months[3]. Similarly, C8:C10 would consist of
income[1], income[2], and
income[3]. It's then fairly simple to
rewrite each formula, replacing cell addresses in it
by named array elements. I've implemented such
transformations in my reverse-engineering software, and
below, apply them to the cash-flow spreadsheet. Notice how much
more intelligible it becomes:
Cells
total_profit[1]
total_expenses[1]
total_income[1]
Sheet1!A1:F12
cash_at_start_of_period[1:3]
net_profit[1:3]
expenses[1:3]
income[1:3]
months[1:3]
opening_cash[1]
where:
Sheet1!B1 = "Simple cash-flow spreadsheet"
Sheet1!B2 = "Demonstrates Jocelyn Ireson-Paine's reverse-engineering software"
Sheet1!F4 = "Opening cash"
Sheet1!B7 = "Month"
Sheet1!C7 = "Income"
Sheet1!D7 = "Expenses"
Sheet1!E7 = "Net profit"
Sheet1!F7 = "Cash at start of period"
Sheet1!A12 = "Totals"
cash_at_start_of_period[1] = opening_cash[1]
cash_at_start_of_period[2] = cash_at_start_of_period[1]+net_profit[1]
cash_at_start_of_period[3] = cash_at_start_of_period[2]+net_profit[2]
expenses[1] = 27
expenses[2] = 41
expenses[3] = 64
income[1] = 120
income[2] = 120
income[3] = 150
months[1] = 1
months[2] = 2
months[3] = 3
net_profit[1] = income[1]-expenses[1]
net_profit[2] = income[2]-expenses[2]
net_profit[3] = income[3]-expenses[3]
opening_cash[1] = 1000
total_expenses[1] = SUM(expenses[1]:expenses[3])
total_income[1] = SUM(income[1]:income[3])
total_profit[1] = SUM(net_profit[1]:net_profit[3])
Although such a listing is more intelligible than the
original formulae, it can become tedious and very long if the
spreadsheet has lots of repeated formulae such as those for
net_profit.
To overcome this, I've implemented
a "run detector". This compares each formula with the
ones next to it, looking to see whether it's either
identical or differs by 1 in a cell address or constant. When
it finds a run of two or more formulae that are thus related,
it summarises them by factoring out
the common parts and replacing these by a variable,
written here as V0. Here's output
from the run-detector applied to the cash-flow spreadsheet:
total_profit[1] = SUM(net_profit[1]:net_profit[3]).
total_expenses[1] = SUM(expenses[1]:expenses[3]).
total_income[1] = SUM(income[1]:income[3]).
Sheet1!A12 = "Totals".
Sheet1!B1 = "Simple cash-flow spreadsheet".
Sheet1!B2 = "Demonstrates Jocelyn Ireson-Paine's reverse-engineering software".
Sheet1!B7 = "Month".
Sheet1!C7 = "Income".
Sheet1!D7 = "Expenses".
Sheet1!E7 = "Net profit".
Sheet1!F4 = "Opening cash".
Sheet1!F7 = "Cash at start of period".
cash_at_start_of_period[1] = opening_cash[1].
cash_at_start_of_period[V0>1] = cash_at_start_of_period[V0-1]+net_profit[V0-1].
net_profit[V0] = income[V0]-expenses[V0].
expenses[1] = 27.
expenses[2] = 41.
expenses[3] = 64.
income[V0<3] = 120.
income[3] = 150.
months[V0] = V0.
opening_cash[1] = 1000.
In case this isn't clear, the equation months[V0] = V0
means that for all V0 within the bounds of
the months table, the V0 month is V0.
That is, the first element of the months table is 1, the second
is 2, and
so on.
Similarly, the equations for
cash_at_start_of_period say that the first element of cash_at_start_of_period
is equal to opening_cash, and that each subsequent
element is the value for the previous month plus that
month's net profit. This, by the way, is a
common pattern in spreadsheets:
an initialisation formula followed by a run of formulae that
pass information from one row to the next. In effect, it's
recursion, but over space rather than time.
The run-detector also works on "raw" spreadsheet listings: those
without name substitutions. Here it is, applied again to the
cash-flow spreadsheet:
Sheet1!A12 = "Totals".
Sheet1!B1 = "Simple cash-flow spreadsheet".
Sheet1!B2 = "Demonstrates Jocelyn Ireson-Paine's reverse-engineering software".
Sheet1!B7 = "Month".
Sheet1!B[V1 in 8:10] = V1-7.
Sheet1!C7 = "Income".
Sheet1!C[8:9] = 120.
Sheet1!C10 = 150.
Sheet1![V0 in C:E]12 = SUM(Sheet1![V0]8:Sheet1![V0]10).
Sheet1!D7 = "Expenses".
Sheet1!D8 = 27.
Sheet1!D9 = 41.
Sheet1!D10 = 64.
Sheet1!E7 = "Net profit".
Sheet1!E[V1 in 8:10] = Sheet1!C[V1]-Sheet1!D[V1].
Sheet1!F4 = "Opening cash".
Sheet1!F5 = 1000.
Sheet1!F7 = "Cash at start of period".
Sheet1!F8 = Sheet1!F5.
Sheet1!F[V1 in 9:10] = Sheet1!F[V1-1]+Sheet1!E[V1-1].
I had to invent a
notation for variables that range over row and column numbers.
This is illustrated by the line Sheet1!B[V1 in 8:10] = V1-7.
This says that Sheet1!B8 = 1, and that
Sheet1!B9 = 2, and so on. Similarly, the
left-hand side Sheet1![V0 in C:E]12 says
that V0 ranges over column addresses
C to E. And on its right-hand side, the
subexpression Sheet1![V0]8 denotes
Sheet1!C8 or Sheet1!D8
or Sheet1!E8.
Such a listing is useful because it helps
discover runs of related formulae, and hence determine which cells
are part of the same group. For example,
it refers repeatedly to rows 8:10 and
9:10. If we didn't already know that the cells in these
rows of each column were related, the listing would be
a good hint.
That isn't actually necessary in my cash-flow example,
where the grouping is obvious by eyeballing the spreadsheet,
but it can be very useful. I once
worked on a 10,000-cell 200-sheet monster of a housing-finance spreadsheet which
contained some 60 20×40 interrelated tables. The run-detector
was invaluable in showing which formulae were related. 60 × 20×40
is 48,000 cells; I was able to reduce these to about 120 equations. I
say 120, because
each table tended to be defined by two sets of formulae: one set to initialise its
first row, and one to progress down subsequent rows, in a similar
way to the recursion mentioned above. The formulae
within each set differed only by column address, and hence the run-detector
could summarise them into one equation.
Finally, here is a spreadsheet that I've seeded with errors:
This spreadsheet looks like the one I showed earlier, but Excel is not
happy with it, which it signals with the little green triangles
in cells E9 and D12. It displays these when it thinks a
formula is out of step with those around it. To find out what's
wrong, I listed the formulae with name substitutions again; and
three errors become apparent. The value for
cash_at_start_of_period[2]
is a hard-wired constant rather than a formula.
The formula for net_profit[2] subtracts from income[1]
and not income[2], an error hard to see by inspecting
the spreadsheet because both cells have the same value. And
the SUM making up total_expenses[1] has run off the end
of its range:
Cells
total_profit[1]
total_expenses[1]
total_income[1]
Sheet1!A1:F12
cash_at_start_of_period[1:3]
net_profit[1:3]
expenses[1:3]
income[1:3]
months[1:3]
opening_cash[1]
where:
Sheet1!B1 = "Simple cash-flow spreadsheet"
Sheet1!B2 = "Demonstrates Jocelyn Ireson-Paine's reverse-engineering software"
Sheet1!F4 = "Opening cash"
Sheet1!B7 = "Month"
Sheet1!C7 = "Income"
Sheet1!D7 = "Expenses"
Sheet1!E7 = "Net profit"
Sheet1!F7 = "Cash at start of period"
Sheet1!A12 = "Totals"
cash_at_start_of_period[1] = opening_cash[1]
cash_at_start_of_period[2] = 1093 Error
cash_at_start_of_period[3] = cash_at_start_of_period[2]+net_profit[2]
expenses[1] = 27
expenses[2] = 41
expenses[3] = 64
income[1] = 120
income[2] = 120
income[3] = 150
months[1] = 1
months[2] = 2
months[3] = 3
net_profit[1] = income[1]-expenses[1]
net_profit[2] = income[1]-expenses[2] Error
net_profit[3] = income[3]-expenses[3]
opening_cash[1] = 1000
total_expenses[1] = SUM(expenses[1]:Sheet1!D11) Error
total_income[1] = SUM(income[1]:income[3])
total_profit[1] = SUM(net_profit[1]:net_profit[3])
In bigger spreadsheets, the "out-of-step" errors would also
become evident when I list by runs. The spreadsheet here doesn't
really have enough related formulae for this to happen. However,
the following extract from a by-runs listing does show that something
is amiss with net_profit:
net_profit[V0<3] = income[1]-expenses[V0].
net_profit[3] = income[3]-expenses[3].
All three kinds of error, by the way, have occurred in real
spreadsheets. I've got repeated formulae wrong
in my own spreadsheets, though I've always caught
the error before the spreadsheets went live.
I seem to remember horror stories of naïve
spreadsheeters overtyping a formula with the number
it has calculated, to "save the computer work";
and other anecdotes of the same being done to defraud.
And errors in SUM ranges are notorious. SF author and critic
David Langford once
received
only half the
royalties he was due because a SUM in his publisher's royalties spreadsheet
missed some cells.
And the European Spreadsheet Risks Interest Group has a
horror story (search for
"$38K spreadsheet error") of a hospital's Medicare
bad-debts spreadsheet whose SUM ranges were misprogrammed.
Because of this, the hospital claimed $38,240
more to compensate it for bad debts
than it should have.
Making existing spreadsheets easier to understand is one important
aspect of spreadsheet safety, and
is the main point of this posting. But it's also important to
avoid bugs
when writing new spreadsheets. I've worked on this too, and I'll
finish with a summary of the work. It's based on a language I've
developed called Excelsior, which can compile
listings like those above into spreadsheets. This means that you can
write meaningful identifiers
rather than cell addresses. Separate layout statements in the
program specify how these identifiers map to cells.
Excelsior also lets you write reuseable modules that can
be shared between different spreadsheets. In other words, code reuse.
That's something lacking from Excel, where to share code between spreadsheets,
you have to copy cells from one spreadsheet and paste into another. And then,
of course, you end up having to maintain several copies of your formulae.
The following links give more information on this
work. In How to Document
a Spreadsheet: an Exercise with Cash-Flow and Loans, I show
how to develop, in Excelsior, a spreadsheet that
represents several loans, and for each month, decides
which is best to borrow from. The algorithm is
fairly simple, but it's subtle, and it would
be tricky to get right if coding directly in Excel. Having the
formulae all available in front of you, with sensible
cell names, makes it much easier.
A much longer exercise of the same kind is
Gliders, Hasslers, and the Toadsucker:
Writing and Explaining a Structured Excel Life Game.
In it, I show how to
develop a spreadsheet that implements a well-known cellular automaton,
John Conway's game of Life. I wrote my article for maths teachers
who might want to introduce secondary-school students to Life.
Having the primary source code be an Excelsior listing, rather than
a spreadsheet, makes it much easier to show students the equations
that update the cellular automaton.
Rather different is the science-fiction
generator at Ireson-Paine Spreadsheet
Parts Repository
Excel science-fiction generator. I based this on a
jokey diagram in an old SF anthology, Gahan Wilson's
The Science Fiction Horror Movie Pocket Computer from Harry
Harrison and Brian Aldiss's
The Year's Best Science Fiction No. 5. It's a familiar
idea: a network of linked boxes where each box contains a plot event
such as "Earth falls into the Sun" or "And so everybody dies". The
links between boxes indicate which events can follow which. My spreadsheet
implements this as a transition graph, which it recurses over to
generate a story. I wrote the spreadsheet entirely in
Excelsior; the recursion would have been very tricky to
program if using only Excel.
Those examples use the idea that one can avoid bugs by
generating spreadsheets from
easily readable programs. Another way to avoid bugs is to
build spreadsheets from prefabricated components.
That's demonstrated in
Spreadsheet
Components, Google Spreadsheets, and Code Reuse,
where I apply it to Google's online spreadsheet, enabling users
to load library formulae into it in the same way they can
load charts. The use of modules is an important part
of spreadsheet safety, but much neglected.