[ Jocelyn Ireson-Paine's Home Page | Publications | Document (shortened) in Word format ]
We introduce spreadsheet algebra. This treats spreadsheets as mathematical objects and operates on expressions containing spreadsheets, giving us rules for expanding, factorising, simplifying and rearranging them. We implemented it in our spreadsheet analysis program Model Master, and demonstrate by reverse engineering one of Panko's spreadsheets, producing a concise listing of the formulae with cells renamed to meaningful identifiers derived from the labels. This is useful when error-checking. We have also used it to automatically add numerical differentiation code to a process-control spreadsheet, and to generate new spreadsheets by joining existing ones.
We take a relativist approach, believing there is no one "right" way to display spreadsheet expressions. Different tasks need different presentations. In this, we imitate conventional computer algebra, supporting the trial and error needed during reverse-engineering.
In 1953, while IBM were busy inventing Fortran, Kahrimanian and Nolan
independently wrote the first programs that could differentiate symbolically.
Fortran, seeing the world in terms of numbers, only lets you differentiate
numerically; by using arrays of the things, you can lay a grid over your
variables and get a numerical answer to their rate of change, to as fine a
degree of accuracy as you want. But what Kahrimanian and Nolan had realised was
that the process of symbolically differentiating the expression x^{3}+2x
to make 3x^{2}+2 is as much about "getting an answer" as is that of
numerically evaluating the expression (x[1]-x[0])/dt
to get 5.
Their work kicked off the development of computer algebra systems. And since
then, computer algebra
has pushed this notion of "getting an answer" ever further into mathematical
territory, capturing the behaviour of polynomials, series, tensors, ..
One of the things about such objects is that they exist in pieces. There is a single symbol for the number 5, but not for the polynomial (1+x)^{n}. The polynomial doesn.t even have a single presentation. Sometimes, we prefer to write it as 1+nx+n(n-1)x^{2}/2+..., 1+n!x/1!(n-1)!+n!x^{2}/2!(n-2)!+..., or as ∑C(n,k)x^{k}. It all depends on which particular patterns we want to make explicit and how far we are trying to squeeze the expression down to fit into the rudimentary capacity of human short-term memory. Computer algebra designers know the importance of rearranging and re-presenting expressions, and have put years of effort into algorithns for factorisation, expansion and simplification. We shall argue the need to extend this work to spreadsheets.
While computer algebra was extending the range of mathematical objects we can compute with, computer science was extending the range of computational entities we can mathematise about. Not least of these were computer programs themselves. It took computing some time to understand programs mathematically; but once it had, it realised that they too could be the objects of computer manipulation. Spreadsheets are programs; and in this paper, we describe a computer algebra system for "spreadsheet algebra", where the fundamental mathematical components are spreadsheets.
The rest of this paper is structured as follows. The next section illustrates conventional computer algebra. We then introduce "spreadsheet algebra", starting with an introduction to the idea of computer programs as mathematical entities. This is followed by a section on how spreadsheet algebra relates to our spreadsheet--analysis program Model Master and - the longest section in the paper - a demonstration of reverse engineering, generating a comprehensible listing from one of Panko's spreadsheets. This is followed by two smaller examples, in one of which we used spreadsheet algebra to insert numerical-differentiation code into a process-control spreadsheet. We finish with a section on further work.
For those unfamiliar with it, we introduce computer algebra via a short session with the Maple algebra system [Hart and Wolfe]. We shall go on to compare it with our spreadsheet algebra system. The lines ending in semicolons below are the user.s input; they are followed by indented lines showing Maple.s responses:
p := (a+b)^2; p := (a + b)^{2} q := expand(p); q := a^{2} + 2 a b + b^{2} r := factor(q); r := (a + b)^{2}
The first command assigns the expression (a+b)^2
to the Maple
"expression variable" p
. If you saw a similar statement in Fortran,
you would expect it to evaluate the expression, adding a
to
b
, squaring the result, and assigning the resulting number to
p
. However, like other computer
algebra
systems, Maple is
about computing with expressions, not just numbers. One adds, subtracts, squares
numbers; one simplifies, factorises, expands expressions. So Maple needs
variables, like p
, in which expressions can be held.
In the second command, the user passes this variable to expand
,
a Maple function that treats its argument as an expression and expands it out as
far as possible. The result is assigned to variable q
.
Finally, we call another built-in, factor
, to refactorise it,
showing that we can get back to the original expression. The key points are:
Computer algebra systems manipulate algebraic expressions without evaluating them into numbers.
You can display the same expression in different ways, for example by factorising or expanding it.
You can store expressions in variables.
You can have several versions of an expression on the go at the same time.
As we have said, programs and spreadsheets are mathematical entities, and we can do algebra with them too. In this section, we explain.
What does it mean, to do
algebra with programs?
Consider the following program, which assigns the numbers 1 to 10 to the arrays
p
and q
,
for i := 1 to 10 do p[i] := i; for i := 1 to 10 do q[i] := i;Now, this has the same effect as
for i := 1 to 10 do a[i] := i; b[i] := i;From this, it's tempting to propose a general law of programs which says
for I := L to U do S1; for I := L to U do S2;always has the same effect as
for I := L to U do S1; S2;where
S1
and S2
are arbitrary statements. Notice
how this resembles the distributive law a(b+c)=ab+ac for numbers.
Why is this useful? Well, were this law always true, it would help in optimising our code, because it would give us a rule we could follow to merge loops, thus reducing the time spent testing and incrementing loop counters.
As it happens, this law is not universally applicable: consider what would
happen if S2
were a statement that halted the program, or if
S1
and S2
both updated the same array but with
different values. Such problems arise, for example, arises with variables and
assignment when different statements find themselves updating the same piece of
store. However, that doesn.t destroy the validity of program algebra
laws; it.s just that
conventional languages are complicated and it.s hard to give simple examples
here. The topic is discussed in [Bowen; Visser et. al.].
The good news for EuSpRIG is that spreadsheets are actually simpler to handle in this way, as we shall see..
So if spreadsheets are programs, and we can do algebra on programs, we can do algebra with spreadsheets. But why bother?
The answer is that we want to achieve quality through reverse engineering. In its raw or unrefined state, a spreadsheet file is about as legible as a stretch of DNA or the inside of an amoeba. We think we can make out some parts because of the boundaries between them; others, we reveal by staining with powerful dyes. But point mutations have corrupted evolution's drag-and-drop copying; different dyes disagree on the regions they delimit; much of what we find looks suspiciously like left-over junk from previous versions; and the Author wrote no comments and won't pick up His phone. In such circumstances, it's essential to carry out trial dissections, saving useful parts in killing bottles or pinning them labelled to a board; bracketting repeated patterns in a database and comparing them for matches. We need to classify, clean, label and store parts for later reuse; to store, as well, different dissections; and to transpose component parts from one to another. Spreadsheet algebra lets us do this.
The slogan is: preserve meaning while maximising utility. In our program algebra example, utility was efficiency: the program with merged loops was faster than that without. In other contexts, it could be memory occupancy or ease of data entry . or, as in our big example later on, readability. Preserve meaning, but don.t preserve presentation if utility is better served by a different presentation.
At EuSpRIG 2001, we presented a paper on Model Master (MM), a spreadsheet design language and its compiler [Ireson-Paine, 2001]. The MM language enables spreadsheets to be described in concise mathematical notation. Meaningful identifiers can be used in place of cell names; programs can be divided into modules which can be stored in files and reused from one spreadsheet to another; cells to which the same calculation is being applied can be grouped so that the calculation appears only once; formatting commands can be used to display the same program in many different layouts.
To demonstrate, we reimplemented one of Tom Grossman's queueing simulations in MM. Our program appears at the end of [Ireson-Paine, 2001], and implements a 4-server queue. Grossman's original spreadsheet represents each server as a column in which successive rows represent successive transactions. To alter the number of servers in the simulation entails changing the number of columns, a major structural change. We were able to do this in MM just by editing one constant in our code.
MM also lets us decompile spreadsheets into MM programs, regardless of whether they were originally generated in MM. The program will lack much of the structure intended by its spreadsheet's author - for example, in an accounts spreadsheet where one column tracks income over several years, there's nothing to indicate that all these cells are "doing the same thing". However, MM can transform the program into one that maps the entire column onto one multi-cell attribute. Discovering though which cells should be grouped, how they should be named, and what their indices are, can take a lot of trial and error. Spreadsheet algebra supports this, allowing users to put a through a gamut of transformations until they find the one that best reveals its structure.
This, the main section, demonstrates
spreadsheet
algebra
for reverse
engineering. We started with a
spreadsheet
sent by Steve
Davis of Clemson University, who is evaluating MM as a debugging tool. The
spreadsheet
turned out to
have been written by Ray Panko for the work reported in "Applying code
inspection to
spreadsheet
testing" [Panko]. We show
it below in value view, omitting to save space some labels found below and to
the right of the calculation area. These all described errors - for example,
cell J28 contained the string F5;; living costs summer;; cells in sum
should be f14..f20
- and were evidently not part of the original problem.
Cash Budget | Assume: | 2% | inflation/semester | ||
Fall | Spring | Summer | |||
Cash, beginning | $1,000 | $1,360 | $673 | ||
Outflows - | School costs | $4,468 | $4,474 | $4,480 | |
Living costs | $4,172 | $4,213 | $4,485 | ||
Inflows- | Loans | 3000 | 3000 | 3000 | |
Support from home | $6,000 | $5,000 | $6,000 | ||
Cash,end | $1,360 | $673 | $980 | ||
======= | ======= | ======= | |||
School (contractual) - | Tuition | $4,115 | $4,115 | $4,115 | |
Fees | $53 | $53 | $53 | ||
$300 | 306 | 312 | |||
Living (contractual) | Monthly | Months | |||
Housing | $450 | 4 | $1,800 | $1,800 | $1,800 |
Insurance | $53 | 4 | $212 | $212 | $212 |
Living Costs (other) | |||||
Food | $330 | 4 | $1,320 | 1346 | 1373 |
Entertainment | $150 | 4 | $600 | 612 | 624 |
Transportation | $40 | 4 | $160 | $161 | 164 |
Clothing | $21 | 4 | $80 | 82 | 84 |
The following sections show some spreadsheet algebra manipulations, and should be compared with the Maple session in Section 2.
Our first step was to start the spreadsheet algebra system and load the spreadsheet:
panko := load( "Panko.slk" );
Because of the file's extension, load
assumes it to be a
spreadsheet
and
automatically decompiles it into an MM program. The":=
"assigns this
program to the variable panko
. Just as p
,
r
in the Maple session are variables which hold
algebraic expressions, so panko
is a variable which holds a
expression. And,
just as Maple automatically lists the expression resulting from the commands
typed at it, so MM displays the
spreadsheet resulting from
the load
command, listing it as an MM program:
<A1 A3 A4 A6 A8 A10 A13 A14 A15 A16 A17 A18 A19 A20 B1 B4 B5 B6 B7 B10 B11 B13 B14 B15 B17 B18 B19 B20 C13 C14 C15 C17 C18 C19 C20 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D14 D15 D17 D18 D19 D20 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E14 E15 E17 E18 E19 E20 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F14 F15 F17 F18 F19 F20 H1 H7 I19 I20 I23 I24 I27 I32 I37 J25 J26 J28 J29 J30 J31 J33 J34 J37 L5 L7 > where A1 = "Cash Budget" B1 = " Assume:" D1 = 0.02 E1 = "inflation/semester" H1 = "Panko, R.R. Applying code inspection to spreadsheet testing, Journal of Management Information Systems, (in press?)" D2 = "Fall" E2 = "Spring" F2 = "Summer" A3 = "Cash, beginning" D3 = 1000 <etc> D9 = "=======" E9 = "=======" F9 = "=======" <etc> I27 = "4 mechanical errors in formulas with 3 opns and cell references" J28 = "F5;; living costs summer;; cells in sum should be f14..f20" <etc>The program contains a bracketed list of attribute declarations followed by the equations (some omitted to save space). As always in MM, all attributes referenced in the equations are declared. This provides redundancy for the compiler (when compiling and error-checking MM programs) and for the reader (when struggling to make sense of other people's code).
The next step was to remove the labels, as they are not part of the
calculations. We did this by calling detach_labels
, which removes
all cells containing string constants not referenced by other cells. These can
fairly be assumed to be labels:
panko_delabelled := detach_labels( panko );MM then displays the result:
<B14 B15 B17 B18 B19 B20 C14 C15 C17 C18 C19 C20 D1 D3 D4 D5 D6 D7 D8 D10 D11 D12 D14 D15 D17 D18 D19 D20 E3 E4 E5 E6 E7 E8 E10 E11 E12 E14 E15 E17 E18 E19 E20 F3 F4 F5 F6 F7 F8 F10 F11 F12 F14 F15 F17 F18 F19 F20 > where D1 = 0.02 D3 = 1000 E3 = D8 F3 = E8 D4 = SUM(D10:D12) E4 = SUM(E10:E12) F4 = SUM(F10:F12) D5 = SUM(D14:D20) E5 = SUM(E14:E20) F5 = SUM(F12:F19) D6 = 3000 E6 = 3000 F6 = 3000 D7 = ROUND(1000+D4+D5-D3-D6,-3) E7 = ROUND(1000+E4+E5-E3-E6,-3) F7 = ROUND(1000+F4+F5-F3-F6,-3) D8 = D3-D4-D5+D6+D7 E8 = E3-E4-E5+E6+E7 F8 = F3-F4-E5+F6+F7 D10 = 4115 <etc>Note that the labels have been removed. Note also that we now have two versions of the spreadsheet on the go simultaneously, just as our Maple session had several versions of an expression
When MM loads a spreadsheet, it gives all attributes the same name as the cell they came from. We wanted to rename them to make the code easier to understand. Inspection suggested the labels in columns A and B might make meaningful names for the cells in columns D, E and F, so to test this, we sliced columns A and B out of the labels and then joined them:
labels := panko without panko_delabelled; col_A := by_row( slice( labels, A1:A20 ) ); col_B := by_row( slice( labels, B1:B20 ) ); cols_AB := zip( col_A, col_B, append ); cols_AB_cleaned := zip( cols_AB, mangle );
Here, the first line subtracts the
spreadsheet
without labels,
built in Stage
2, from the entire
spreadsheet, giving just the
labels. The next two lines extract columns A and B (up to row 20, the final one
of importance). The by_row
function modifies slice
's
result to be a function from row number to label rather than from cell address
(i.e. row and column) to label.
Next, in the fourth command, we zip together corresponding rows of columns A
and B. We use MM's zip
function, which runs down the columns passed
in the first two arguments, taking corresponding cells from each and applying
the third argument, itself a function, to them. Here, this function is
append
: it joins or appends two strings. So the effect is to append
the A and B labels of each row.
We then had to clean up the results. The problem is that we want to use the
labels as identifiers. As in most languages, these can contain letters and
digits, but not such things as minus signs, brackets, and spaces. So we ran down
the joined names, passing each through the mangle
function. This
"mangles" each name (the word is commonly used by compiler writers) to remove or
replace unwanted characters. Thus the labels Cash Budget
and
Assume
become Cash_BudgetAssume
, for example. The next
thing is to pair the cell names of column D with the new identifiers that
replace them:
renamings := make_map( by_row(col_D), cols_AB_cleaned );giving us
D3 > Cash_beginning D4 > Outflows_School_costs D5 > Living_costs D6 > Inflows_Loans D7 > Support_from_home D8 > Cash_end D10 > School_contractual_Tuition D11 > Fees D14 > Housing D15 > Insurance D17 > Food D18 > Entertainment D19 > Transportation D20 > Clothingwhere the listing uses > to mean "maps to". So what this listing shows is meaningful names suggested for the cells in column D.
Columns D, E and F are obviously related, reporting the same quantities for the three school terms autumn ("fall"), spring and summer. A feature of MM - indeed, a key point - is that an attribute can name not just one cell as in the examples so far, but an entire collection of cells. So for example, assume we have a spreadsheet where column A represents the income of some company, with rows 1 (for 1990) up to 14 (for 2003). We could declare this in MM as
<Income : [1990..2003]>and write references such as
Income[1995]
in the equations.
In the previous paragraph, the multi-cell attribute Income
had
numeric indices. MM also allows symbolic indices akin to Pascal's enumerated
types. We use these when building tables whose cells do not have a natural
ordering; for example the branches of a department store or the colours a dress
can be bought in. Thus we might have a business with branches in Oxford,
Cambridge and London. To store the income for each branch at one particular
time, we would declare:
<Income : {Oxford,Cambridge,London}>while to hold income for these branches from 1990 onwards, the attribute would gain a dimension, becoming:
<Income : {Oxford,Cambridge,London} * [1990..2003]>
We now show how we applied multi-cell attributes and enumerated types to our
running example. Since our renaming of the cells in column D made sense, we
decided to do the same to columns E and F. But we also decided the three cells
D1, E1 and F1 should be treated as part of one attribute, indexed by the
symbolic indices fall
, spring
and summer
.
Thus any reference to D1 would be replaced by
Cash_BudgetAssume[fall]
; reference to E1 by
Cash_BudgetAssume[spring]
; and so on. We did the same for all the
other attributes in these three columns.
cols_DEF := slice( panko_delabelled, D1:F20 ); cols_DEF_merged := merge( cols_DEF, cols_AB_cleaned, D:F, [fall,spring,summer] );which gave:
<B14 B15 B17 B18 B19 C14 C15 C17 C18 C19 C20 Cash_BudgetAssume[fall,spring,summer] Cash_beginning[fall,spring,summer] Cash_end[fall,spring,summer] Clothing[fall,spring,summer] D12 E12 Entertainment[fall,spring,summer] F12 Fees[fall,spring,summer] Food[fall,spring,summer] Housing[fall,spring,summer] Inflows_Loans[fall,spring,summer] Insurance[fall,spring,summer] Living_costs[fall,spring,summer] Outflows_School_costs[fall,spring,summer] School_contractual_Tuition[fall,spring,summer] Support_from_home[fall,spring,summer] Transportation[fall,spring,summer] > where Cash_BudgetAssume[fall] = 0.02 Cash_beginning[fall] = 1000 Cash_beginning[spring] = Cash_end[fall] Cash_beginning[summer] = Cash_end[spring] Cash_end[fall] = Cash_beginning[fall]-Outflows_School_costs[fall]- Living_costs[fall]+Inflows_Loans[fall]+Support_from_home[fall] Cash_end[spring] = Cash_beginning[spring]-Outflows_School_costs[spring]- Living_costs[spring]+Inflows_Loans[spring]+Support_from_home[spring] Cash_end[summer] = Cash_beginning[summer]-Outflows_School_costs[summer]- Living_costs[spring]+Inflows_Loans[summer]+Support_from_home[summer] Clothing[fall] = C20*20 Clothing[spring] = ROUND(Clothing[fall]*(1+Cash_BudgetAssume[fall]),0) Clothing[summer] = ROUND(Clothing[spring]*(1+Cash_BudgetAssume[fall]),0) D12 = 300 E12 = ROUND(D12*(1+Cash_BudgetAssume[fall]),0) Entertainment[fall] = C18*B18 Entertainment[spring] = ROUND(Entertainment[fall]*(1+Cash_BudgetAssume[fall]),0) Entertainment[summer] = ROUND(Entertainment[spring]*(1+Cash_BudgetAssume[fall]),0) F12 = ROUND(E12*(1+Cash_BudgetAssume[fall]),0) Fees[fall] = 53 Fees[spring] = 53 Fees[summer] = 53 Food[fall] = C17*B17 Food[spring] = ROUND(Food[fall]*(1+Cash_BudgetAssume[fall]),0) Food[summer] = ROUND(Food[spring]*(1+Cash_BudgetAssume[fall]),0) Housing[fall] = C14*B14 Housing[spring] = Housing[fall] Housing[summer] = Housing[fall] Inflows_Loans[fall] = 3000 Inflows_Loans[spring] = 3000 Inflows_Loans[summer] = 3000 Insurance[fall] = C15*B15 Insurance[spring] = Insurance[fall] Insurance[summer] = Insurance[fall] Living_costs[fall] = SUM(Housing[fall]:Clothing[fall]) Living_costs[spring] = SUM(Housing[spring]:Clothing[spring]) Living_costs[summer] = SUM(F12:Transportation[summer]) Outflows_School_costs[fall] = SUM(School_contractual_Tuition[fall]:D12) Outflows_School_costs[spring] = SUM(School_contractual_Tuition[spring]:E12) Outflows_School_costs[summer] = SUM(School_contractual_Tuition[summer]:F12) School_contractual_Tuition[fall] = 4115 School_contractual_Tuition[spring] = 4115 School_contractual_Tuition[summer] = 4115 Support_from_home[fall] = ROUND(1000+Outflows_School_costs[fall]+ Living_costs[fall]-Cash_beginning[fall]-Inflows_Loans[fall],-3) Support_from_home[spring] = ROUND(1000+Outflows_School_costs[spring]+ Living_costs[spring]-Cash_beginning[spring]-Inflows_Loans[spring],-3) Support_from_home[summer] = ROUND(1000+Outflows_School_costs[summer]+ Living_costs[summer]-Cash_beginning[summer]-Inflows_Loans[summer],-3) Transportation[fall] = C19*B19 Transportation[spring] = ROUND(Transportation[fall]+1+Cash_BudgetAssume[fall],0) Transportation[summer] = ROUND(Transportation[spring]*(1+Cash_BudgetAssume[fall]),0)Not bad, is it?
Recall that we made the new attribute names by concatenating labels from
corresponding rows in columns A and B. That isn't actually quite good enough,
because the column A labels sometimes qualify more than one cell in column B. A4
- Outflows
- appears to qualify both B4 and B5, so we should
have prefixed it to both. Even so, our transformation has worked pretty
well.
In developing MM, we want to make spreadsheets easier to read, and, of course, to check. While the main point of this particular paper is to introduce spreadsheet algebra, we ought also to say something about error detection. The spreadsheet in our example originated, as we mentioned, in Panko's work on detecting errors via code inspection, for which he seeded it with eight errors. Of these, four were in columns D, E and F:
E19=ROUND(D19+(1+$D$1),0)
. D19+
should be
D19*
.
F3=F4-F5+E6+E7
. Should be F3=F4-E5+E6+E7
.
D20=C20*20
. Should be C20*B20
.
F5=SUM(F12..F19)
. Should be SUM(F14..F20)
.
These formulae correspond to the equations for Transportation[spring]
,
Cash_end[summer]
,
Clothing[fall]
and Living_costs[summer]
in Stage 5
of the example. The first error is relatively easy to spot by inspection, helped
by the way the listing groups together the elements of an attribute and makes it
easy to compare the spring and summer equations. Likewise, the second error
stands out because the right hand side refers to
Living_costs[spring]
and not Living_costs[summer]
. The
third error, C20*20
is not obvious (though it might be had we
bothered to work out out a good replacement name for C20
), but the
fourth error stands out because the attribute F12
in the right hand
side of Living_costs[summer]
should, by comparison with the other
elements of Living_costs
, be Housing[summer]
.
Boolean algebra,
linear algebra, and of
course the ordinary kind of
algebra that operates on
numbers, all posess primitive operators such as set union, addition, or matrix
multiplication, and fundamental laws that define how these behave. Boolean
algebra
has De Morgan's law;
numerical
algebra
has the
distributive law and others. So far, we haven't seen such things in
spreadsheet
algebra, where the functions
have been ad-hocced up for convenience in reverse engineering:
rename
, merge
, slice
,
detach_labels
. We did, however, invent some, and the others are
based on them. They include:
spreadsheet addition, which joins together two spreadsheets or MM programs, concatenating their attributes and equations:
(<a,b,c> where a=b+c) plus (<d,e,f> where d=e*f) <a b c d e f > where a = b+c d = e*f
spreadsheet extension, an analogue to drag-and-drop copying, which extends attributes and calculations over a range:
(< a b c > where a=b+c) over [1..10] <a[1..10] b[1..10] c[1..10] > where a[all I] = b[I]+c[I]
spreadsheet division, which "divides" a set of identical attributes by their range:
<a[1..10] b[1..10] c[1..10> / [1..10] <a b c>
This is how we use our operators to combine calculations and data from several spreadsheets into one:
a := load("Company.slk"); <A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 > where C1 = A1+B1 C2 = A2+B2 <etc> b := merge( a, "Income", A1:A14, [1990..2003 ] ); b := merge( b, "Outgoings", B1:B14, [1990..2003 ] ); b := merge( b, "Profit", C1:C14, [1990..2003 ] ); b := roll( b, "Income", $+$ ); b := roll( b, "Outgoings", $+$ ); b := roll( b, "Profit", $+$ ); <Income[1990..2003] Outgoings[1990..2003] Profit[1990..2003] > where Profit[all T] = Income[T]-Outgoings[T] c := (b plus < MonthlyProfit: [1990..2003] > where MonthlyProfit[all Y] = Profit[Y] / 12 ); <Income[1990..2003] MonthlyProfit[1990..2003] Outgoings[1990..2003] Profit[1990..2003] > where Profit[all T] = Income[T]-Outgoings[T] MonthlyProfit[all Y] = Profit[Y]/12Here, we first renamed and merged the cells in the input spreadsheet, and then added a
MonthlyProfit
attribute. The main point here is how we added the
new attribute, something that would entail inserting a new column and
cell-by-cell formulae if done directly in Excel. We did have to go to some
effort to do the renaming and merging (i.e. to make the structure of the input
spreadsheet
explicit),
but we plan various ways to simplify the process: by using formats, via
user-defined
spreadsheet
algebra
functions
which, for example, would package up the above sequence of merge
s
and roll
s into one call; by automating equation rolling. Also, the
user can always store the
spreadsheet
after it has
been converted to an MM program, and start from that each time.
We next show a more complicated example where we load and combine three spreadsheets:
a := load("company.slk"); <etc> <Income[1990..2003] Outgoings[1990..2003] Profit[1990..2003] > where Profit[all T] = Income[T]-Outgoings[T] b := load("OxfordData.slk"); <etc> <Income[1990..2003] > where Income = [ 347, 326, 291, 285, 279, 251, 277, 301, 340, 461, 312, 332, 324, 318 ] Outgoings = [ 213, 210, 225, 252, 274, 287, 291, 312, 345, 311, 341, 221, 219, 232 ] c := load("CambridgeData.slk"); <etc> <Income[1990..2003] > where Income = [ 431, 455, 498, 501, 522, 526, 547, 561, 558, 412, 494, 504, 509, 517 ] Outgoings = [ 341, 358, 293, 287, 260, 287, 281, 315, 327, 363, 389, 290, 287, 281 ] d := (a on b) plus (a on c); <Income_1[1990..2003] Income_2[1990..2003] Outgoings_1[1990..2003] Outgoings_2[1990..2003] Profit_1[1990..2003] Profit_2[1990..2003] > where Income_1 = [ 431, 455, 498, 501, 522, 526, 547, 561, 558, 412, 494, 504, 509, 517 ] Income_2 = [ 347, 326, 291, 285, 279, 251, 277, 301, 340, 461, 312, 332, 324, 318 ] Outgoings_1 = [ 341, 358, 293, 287, 260, 287, 281, 315, 327, 363, 389, 290, 287, 281 ] Outgoings_2 = [ 213, 210, 225, 252, 274, 287, 291, 312, 345, 311, 341, 221, 219, 232 ] Profit_1[all T] = Income_1[T]-Outgoings_1[T] Profit_2[all T] = Income_2[T]-Outgoings_2[T]In this example, we loaded another income-outgoings-profit spreadsheet and reformatted it (with similar commands as in the previous example, here omitted to save space). We then loaded two two-column data spreadsheets, one for the Oxford branch of the business and one for the Cambridge branch. Then we glued all three together with the expression
(a on b) plus (a on c)
. Each
on
superimposes attributes of the same name. So the first one
copies the Oxford data into the income-outgoings-profit
spreadsheet; the second
copies the Cambridge data into a new copy of that
spreadsheet. The net result
is a
spreadsheet
with two
copies of the calculations. With a little more effort, we can merge the
attributes to make the branch an explicit index, similar to the season in our
first example: <Income[1990..2003]*{Oxford,Cambridge} Outgoings[1990..2003]*{Oxford,Cambridge} Profit[1990..2003]*{Oxford,Cambridge} > where Income = [ [ 431, 455, 498, 501, 522, 526, 547, 561, 558, 412, 494, 504, 509, 517 ], [ 347, 326, 291, 285, 279, 251, 277, 301, 340, 461, 312, 332, 324, 318 ] ] Outgoings = [ [ 341, 358, 293, 287, 260, 287, 281, 315, 327, 363, 389, 290, 287, 281 ], [ 213, 210, 225, 252, 274, 287, 291, 312, 345, 311, 341, 221, 219, 232 ] ] Profit[all T,all B] = Income[T,B]-Outgoings[T,B]
Incidentally, we can also generate spreadsheet templates where the attributes' bounds (such as 1990 and 2003) are not wired in as above but passed as parameters. This process of generalisation enables us to produce a module that can be glued into any spreadsheet, regardless of the size of the row or column we are attaching it to.
As a final example, we show an application inspired by work with Jack Ponton
of Edinburgh University's Chemical Engineering Department. Suppose we have a
spreadsheet that
calculates
the concentration x
of some chemical at various timepoints, and we
want also to see the rate of change of concentration, x_dot
. If we
assume the difference between timepoints is always the same - 1000, say - we can
approximate the rate of change by just dividing this into the successive
differences in concentration. To do this with MM simply entailed loading the
spreadsheet (which, to save
space in the listing, is just one column into which the concentrations can be
typed as data), and adding to it a new attribute x_dot
and the
difference code:
a := load( "Ponton.xls" ); <A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 > b := merge( a, "x", A1:A10, [1..10] ); <x[1..10] > c := (b plus <x_dot[2..10]>) where x_dot[all T] = (x[T] - x[T-1]) / 1000; <x[1..10] x_dot[2..10] > where x_dot[all T] = (x[T]-x[T-1])/1000
We can do more ambitious transformations in the same way, for example adding various schemes for numerical integration. Since we can store these in libraries and parameterise them to fit any range of attribute, spreadsheet algebra gives us an extremely powerful generic way to add them to almost any spreadsheet we could want.
Listings much bigger than our example would become tedious to inspect, and there is always a risk of missing errors even in small listings. Can we automatically compares formulae and flags a warning when they differ? Markus Clermont reported his research into this at EuSpRIG 2002; dissertation, [Clermont]. His program "stains" the raw spreadsheet, proposing sets of cells that, in our terminology, might be grouped to become elements of a single attribute. Some cells get proposed because they contain similar formulae, but he uses other means too, for example proposing that cells mentioned in a cell range should be grouped. In our example, this would be like saying that D14 to D20 should be considered for grouping, as should D10 to D12. In MM, we have independently experimented with something similar, using unification and inverse unification as a way to detect related formulae. [Our next paper in this series, on structure discovery and logic programming, takes this further.]
We believe there are other methods worth investigating. Discrete Fourier transforms might help discover repeating patterns of related but non-contiguous cells like these:
XYZ XYZ XYZor
XYZ XYZ XYZIt might be worth looking in general into the perceptual techniques developed in computer vision, and certainly at blackboard systems and other methods that artificial intelligence has developed for fusing knowledge from different sources.
Another possibility is searching with parallel terraced scans, a technique Hofstadter's Fluid Analogy Research Group have used in solving analogy problems [Hofstadter,Hofstadter (2)], and one we believe promising for resolving tensions between different ways to view a problem. With spreadsheets, one such tension is that between lumping and splitting: grouping too coarsely (the extreme being to cover the entire worksheet with just one attribute) versus dividing too finely (the extreme being to split the spreadsheet into one attribute for every cell).
Perhaps the most promising technique is inductive logic programming [ILP]. This learns, given logical descriptions of examples of a concept, general rules describing the concept itself. Unlike neural nets and other machine learning techniques, the learnt concepts are easy to translate into English and to use in programs. ILP has some very convincing industrial applications - drug companies use it to learn the relation between biological activity and 3D structure in vast collections of molecules - and we believe it could be invalauable in learning libraries of common patterns. This is something we are experimenting with.
There are so many possible structure-discovery techniques that we are unlikely to find them all in the same program. This means we need a standard representation of patterns, errors and other information, and a way to share it between tools. What kind of information should be exchanged? We suggest:
High-level specification of what a spreadsheet, or regions therein, is doing.
Semantically related regions and the reasons for believing them to be related, together with a description of how they map onto the worksheet surface.
Names, derived from labels, proposed for regions. Inductive logic programming will be useful for learning from example where authorss tend to put labels in relation to the cells they name.
Errors, such as a cell where an equation "ought" to resemble those around it but doesn't.
Proposed corrections for errors.
Is our algebra language appropriate? We had to give the user a means of performing element-by-element manipulation of rows and columns, as when appending the cells in columns A and B to make attribute names. We provided a functional language, partly because we liked it and partly because it mapped cleanly onto our implementation language Kawa, a dialect of Scheme. However, this requires expertise the average user will certainly lack.
It is worth asking whether a graphical interface might help - something along the lines of MINOS: Steps Towards a Syntax-free Computer Algebra Program [Tintarev, 1999], where the author adopts a desktop metaphor, with different icons on the desktop representing different expressions to be manipulated.
We are working on algorithms by which a user can compile MM code to a spreadsheet, change the compiled spreadsheet, then calculate the equivalent change to the MM program. This would be useful if you develop a spreadsheet in MM and then want to change cell styles and formats from Excel, having the changes reflected back to your MM program. Doing so involves some reverse-engineering of the kind described earlier, but under constraints that the new MM program constructed by decompiling the modified spreadsheet resemble the original as much as possible. In general, this is not easy because changes to the spreadsheet may cut across parts of the MM program, rather as though we were to translate a sentence into French, throw a random eror into it, and then try to translate the result back into an English sentence with the "same" error. We are experimenting with an appropriate formalisation of the notion of "part", so that if we change something in the compiled spreadsheet, we can find the minimal enclosing entity in the MM code that it is a part of, and adjust the code accordingly. This may be a new approach to reverse engineering.
Jonathan Bowen. Recent Developments in the Algebra of Programs, http://vmoc.museophile.com/algebra/section3_6.html, in A Brief History of Algebra and Computing: An Eclectic Oxonian View, http://vmoc.museophile.com/algebra/algebra.html.
Markus Clermont. A Scalable Approach to Spreadsheet Visualisation, http://www.ifi.uni-klu.ac.at/Publications/Dissertations/pubfiles/pdffiles/2003-0175-MC.pdf, Dissertation, Fakultät für Wirtschaftswissenschaften und Informatik, Universität Klagenfurt.
Thomas Grossman. Spreadsheet Queueing Simulation Templates, http://www.ucalgary.ca/%7Egrossman/simulation/index.html.
Dave Hart and Clinton Wolfe. Basic Algebra in Maple, http://www.indiana.edu/~statmath/math/maple/index.html.
Douglas Hofstadter. Analogies and Roles in Human and Machine Thinking, in Metamagical Themas: Questing for the Essence of Mind and Pattern, Basic Books, 1996.
Douglas Hofstadter. Fluid Concepts and Creative Analogies: Computer Models of the Fundamental Mechanisms of Thought, Penguin, 1998.
Inductive Logic Programming, http://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/ilp.html.
Jocelyn Ireson-Paine. Ensuring Spreadsheet Integrity with Model Master, http://www.j-paine.org/eusprig2001_as_html/eusprig2001.html, Proceedings of EuSpRIG 2001, Amsterdam 5-6 July 2001.
Raymond R. Panko. Applying Code Inspection to Spreadsheet Testing, http://panko.cba.hawaii.edu/papers/ss/2phase.htm.
Kyril Tintarev. MINOS: Steps Towards a Syntax-free Computer Algebra Program, http://science.uniserve.edu.au/pubs/callab/vol3/tintarev.html, CAL-laborate, Volume 3, October 1999.
Eelco Visser et. al. The Program Transformation Wiki, http://www.program-transformation.org/twiki/bin/view/Transform.
[ Jocelyn Ireson-Paine's Home Page | Publications | Document (shortened) in Word format ]