[ Jocelyn Ireson-Paine's Home Page | Publications | Document (shortened) in Word format ]
We introduce spreadsheet algebra. This treats spreadsheets as mathematical objects and operates on expressions containing spreadsheets, giving us rules for expanding, factorising, simplifying and rearranging them. We implemented it in our spreadsheet analysis program Model Master, and demonstrate by reverse engineering one of Panko's spreadsheets, producing a concise listing of the formulae with cells renamed to meaningful identifiers derived from the labels. This is useful when error-checking. We have also used it to automatically add numerical differentiation code to a process-control spreadsheet, and to generate new spreadsheets by joining existing ones.
We take a relativist approach, believing there is no one "right" way to display spreadsheet expressions. Different tasks need different presentations. In this, we imitate conventional computer algebra, supporting the trial and error needed during reverse-engineering.
In 1953, while IBM were busy inventing Fortran, Kahrimanian and Nolan
independently wrote the first programs that could differentiate symbolically.
Fortran, seeing the world in terms of numbers, only lets you differentiate
numerically; by using arrays of the things, you can lay a grid over your
variables and get a numerical answer to their rate of change, to as fine a
degree of accuracy as you want. But what Kahrimanian and Nolan had realised was
that the process of symbolically differentiating the expression x3+2x
to make 3x2+2 is as much about "getting an answer" as is that of
numerically evaluating the expression (x[1]-x[0])/dt
to get 5.
Their work kicked off the development of computer algebra systems. And since
then, computer algebra
has pushed this notion of "getting an answer" ever further into mathematical
territory, capturing the behaviour of polynomials, series, tensors, ..
One of the things about such objects is that they exist in pieces. There is a single symbol for the number 5, but not for the polynomial (1+x)n. The polynomial doesn.t even have a single presentation. Sometimes, we prefer to write it as 1+nx+n(n-1)x2/2+..., 1+n!x/1!(n-1)!+n!x2/2!(n-2)!+..., or as ∑C(n,k)xk. It all depends on which particular patterns we want to make explicit and how far we are trying to squeeze the expression down to fit into the rudimentary capacity of human short-term memory. Computer algebra designers know the importance of rearranging and re-presenting expressions, and have put years of effort into algorithns for factorisation, expansion and simplification. We shall argue the need to extend this work to spreadsheets.
While computer algebra was extending the range of mathematical objects we can compute with, computer science was extending the range of computational entities we can mathematise about. Not least of these were computer programs themselves. It took computing some time to understand programs mathematically; but once it had, it realised that they too could be the objects of computer manipulation. Spreadsheets are programs; and in this paper, we describe a computer algebra system for "spreadsheet algebra", where the fundamental mathematical components are spreadsheets.
The rest of this paper is structured as follows. The next section illustrates conventional computer algebra. We then introduce "spreadsheet algebra", starting with an introduction to the idea of computer programs as mathematical entities. This is followed by a section on how spreadsheet algebra relates to our spreadsheet--analysis program Model Master and - the longest section in the paper - a demonstration of reverse engineering, generating a comprehensible listing from one of Panko's spreadsheets. This is followed by two smaller examples, in one of which we used spreadsheet algebra to insert numerical-differentiation code into a process-control spreadsheet. We finish with a section on further work.
For those unfamiliar with it, we introduce computer algebra via a short session with the Maple algebra system [Hart and Wolfe]. We shall go on to compare it with our spreadsheet algebra system. The lines ending in semicolons below are the user.s input; they are followed by indented lines showing Maple.s responses:
p := (a+b)^2; p := (a + b)2 q := expand(p); q := a2 + 2 a b + b2 r := factor(q); r := (a + b)2
The first command assigns the expression (a+b)^2
to the Maple
"expression variable" p
. If you saw a similar statement in Fortran,
you would expect it to evaluate the expression, adding a
to
b
, squaring the result, and assigning the resulting number to
p
. However, like other computer
algebra
systems, Maple is
about computing with expressions, not just numbers. One adds, subtracts, squares
numbers; one simplifies, factorises, expands expressions. So Maple needs
variables, like p
, in which expressions can be held.
In the second command, the user passes this variable to expand
,
a Maple function that treats its argument as an expression and expands it out as
far as possible. The result is assigned to variable q
.
Finally, we call another built-in, factor
, to refactorise it,
showing that we can get back to the original expression. The key points are:
Computer algebra systems manipulate algebraic expressions without evaluating them into numbers.
You can display the same expression in different ways, for example by factorising or expanding it.
You can store expressions in variables.
You can have several versions of an expression on the go at the same time.
As we have said, programs and spreadsheets are mathematical entities, and we can do algebra with them too. In this section, we explain.
What does it mean, to do
algebra with programs?
Consider the following program, which assigns the numbers 1 to 10 to the arrays
p
and q
,
for i := 1 to 10 do p[i] := i; for i := 1 to 10 do q[i] := i;Now, this has the same effect as
for i := 1 to 10 do a[i] := i; b[i] := i;From this, it's tempting to propose a general law of programs which says
for I := L to U do S1; for I := L to U do S2;always has the same effect as
for I := L to U do S1; S2;where
S1
and S2
are arbitrary statements. Notice
how this resembles the distributive law a(b+c)=ab+ac for numbers.
Why is this useful? Well, were this law always true, it would help in optimising our code, because it would give us a rule we could follow to merge loops, thus reducing the time spent testing and incrementing loop counters.
As it happens, this law is not universally applicable: consider what would
happen if S2
were a statement that halted the program, or if
S1
and S2
both updated the same array but with
different values. Such problems arise, for example, arises with variables and
assignment when different statements find themselves updating the same piece of
store. However, that doesn.t destroy the validity of program algebra
laws; it.s just that
conventional languages are complicated and it.s hard to give simple examples
here. The topic is discussed in [Bowen; Visser et. al.].
The good news for EuSpRIG is that spreadsheets are actually simpler to handle in this way, as we shall see..
So if spreadsheets are programs, and we can do algebra on programs, we can do algebra with spreadsheets. But why bother?
The answer is that we want to achieve quality through reverse engineering. In its raw or unrefined state, a spreadsheet file is about as legible as a stretch of DNA or the inside of an amoeba. We think we can make out some parts because of the boundaries between them; others, we reveal by staining with powerful dyes. But point mutations have corrupted evolution's drag-and-drop copying; different dyes disagree on the regions they delimit; much of what we find looks suspiciously like left-over junk from previous versions; and the Author wrote no comments and won't pick up His phone. In such circumstances, it's essential to carry out trial dissections, saving useful parts in killing bottles or pinning them labelled to a board; bracketting repeated patterns in a database and comparing them for matches. We need to classify, clean, label and store parts for later reuse; to store, as well, different dissections; and to transpose component parts from one to another. Spreadsheet algebra lets us do this.
The slogan is: preserve meaning while maximising utility. In our program algebra example, utility was efficiency: the program with merged loops was faster than that without. In other contexts, it could be memory occupancy or ease of data entry . or, as in our big example later on, readability. Preserve meaning, but don.t preserve presentation if utility is better served by a different presentation.
At EuSpRIG 2001, we presented a paper on Model Master (MM), a spreadsheet design language and its compiler [Ireson-Paine, 2001]. The MM language enables spreadsheets to be described in concise mathematical notation. Meaningful identifiers can be used in place of cell names; programs can be divided into modules which can be stored in files and reused from one spreadsheet to another; cells to which the same calculation is being applied can be grouped so that the calculation appears only once; formatting commands can be used to display the same program in many different layouts.
To demonstrate, we reimplemented one of Tom Grossman's queueing simulations in MM. Our program appears at the end of [Ireson-Paine, 2001], and implements a 4-server queue. Grossman's original spreadsheet represents each server as a column in which successive rows represent successive transactions. To alter the number of servers in the simulation entails changing the number of columns, a major structural change. We were able to do this in MM just by editing one constant in our code.
MM also lets us decompile spreadsheets into MM programs, regardless of whether they were originally generated in MM. The program will lack much of the structure intended by its spreadsheet's author - for example, in an accounts spreadsheet where one column tracks income over several years, there's nothing to indicate that all these cells are "doing the same thing". However, MM can transform the program into one that maps the entire column onto one multi-cell attribute. Discovering though which cells should be grouped, how they should be named, and what their indices are, can take a lot of trial and error. Spreadsheet algebra supports this, allowing users to put a through a gamut of transformations until they find the one that best reveals its structure.
This, the main section, demonstrates
spreadsheet
algebra
for reverse
engineering. We started with a
spreadsheet
sent by Steve
Davis of Clemson University, who is evaluating MM as a debugging tool. The
spreadsheet
turned out to
have been written by Ray Panko for the work reported in "Applying code
inspection to
spreadsheet
testing" [Panko]. We show
it below in value view, omitting to save space some labels found below and to
the right of the calculation area. These all described errors - for example,
cell J28 contained the string F5;; living costs summer;; cells in sum
should be f14..f20
- and were evidently not part of the original problem.
Cash Budget | Assume: | 2% | inflation/semester | ||
Fall | Spring | Summer | |||
Cash, beginning | $1,000 | $1,360 | $673 | ||
Outflows - | School costs | $4,468 | $4,474 | $4,480 | |
Living costs | $4,172 | $4,213 | $4,485 | ||
Inflows- | Loans | 3000 | 3000 | 3000 | |
Support from home | $6,000 | $5,000 | $6,000 | ||
Cash,end | $1,360 | $673 | $980 | ||
======= | ======= | ======= | |||
School (contractual) - | Tuition | $4,115 | $4,115 | $4,115 | |
Fees | $53 | $53 | $53 | ||
$300 | 306 | 312 | |||
Living (contractual) | Monthly | Months | |||
Housing | $450 | 4 | $1,800 | $1,800 | $1,800 |
Insurance | $53 | 4 | $212 | $212 | $212 |
Living Costs (other) | |||||
Food | $330 | 4 | $1,320 | 1346 | 1373 |
Entertainment | $150 | 4 | $600 | 612 | 624 |
Transportation | $40 | 4 | $160 | $161 | 164 |
Clothing | $21 | 4 | $80 | 82 | 84 |
The following sections show some spreadsheet algebra manipulations, and should be compared with the Maple session in Section 2.
Our first step was to start the spreadsheet algebra system and load the spreadsheet:
panko := load( "Panko.slk" );
Because of the file's extension, The next step was to remove the labels, as they are not part of the
calculations. We did this by calling When MM loads a
spreadsheet, it gives all
attributes the same name as the cell they came from. We wanted to rename them to
make the code easier to understand. Inspection suggested the labels in columns A
and B might make meaningful names for the cells in columns D, E and F, so to
test this, we sliced columns A and B out of the labels and then joined them: Here, the first line subtracts the
spreadsheet
without labels,
built in Stage
2, from the entire
spreadsheet, giving just the
labels. The next two lines extract columns A and B (up to row 20, the final one
of importance). The Next, in the fourth command, we zip together corresponding rows of columns A
and B. We use MM's We then had to clean up the results. The problem is that we want to use the
labels as identifiers. As in most languages, these can contain letters and
digits, but not such things as minus signs, brackets, and spaces. So we ran down
the joined names, passing each through the Columns D, E and F are obviously related, reporting the same quantities for
the three school terms autumn ("fall"), spring and summer. A feature of MM -
indeed, a key point - is that an attribute can name not just one cell as in the
examples so far, but an entire collection of cells. So for example, assume we
have a
spreadsheet
where
column A represents the income of some company, with rows 1 (for 1990) up to 14
(for 2003). We could declare this in MM as In the previous paragraph, the multi-cell attribute We now show how we applied multi-cell attributes and enumerated types to our
running example. Since our renaming of the cells in column D made sense, we
decided to do the same to columns E and F. But we also decided the three cells
D1, E1 and F1 should be treated as part of one attribute, indexed by the
symbolic indices Recall that we made the new attribute names by concatenating labels from
corresponding rows in columns A and B. That isn't actually quite good enough,
because the column A labels sometimes qualify more than one cell in column B. A4
- In developing MM, we want to make spreadsheets easier to read, and, of
course, to check. While the main point of this particular paper is to introduce
spreadsheet
algebra, we ought also to
say something about error detection. The
spreadsheet
in our example
originated, as we mentioned, in Panko's work on detecting errors via code
inspection, for which he seeded it with eight errors. Of these, four were in
columns D, E and F:
These formulae correspond to the equations for Boolean algebra,
linear algebra, and of
course the ordinary kind of
algebra that operates on
numbers, all posess primitive operators such as set union, addition, or matrix
multiplication, and fundamental laws that define how these behave. Boolean
algebra
has De Morgan's law;
numerical
algebra
has the
distributive law and others. So far, we haven't seen such things in
spreadsheet
algebra, where the functions
have been ad-hocced up for convenience in reverse engineering:
spreadsheet
addition, which joins together two spreadsheets or MM programs, concatenating
their attributes and equations: spreadsheet
extension, an analogue to drag-and-drop copying, which extends attributes and
calculations over a range: spreadsheet division, which "divides" a set of identical attributes
by their range: This is how we use our operators to combine calculations and data from
several spreadsheets into one: We next show a more complicated example where we load and combine three
spreadsheets: Incidentally, we can also generate
spreadsheet
templates where
the attributes' bounds (such as 1990 and 2003) are not wired in as above but
passed as parameters. This process of generalisation enables us to produce a
module that can be glued into any
spreadsheet, regardless of
the size of the row or column we are attaching it to. As a final example, we show an application inspired by work with Jack Ponton
of Edinburgh University's Chemical Engineering Department. Suppose we have a
spreadsheet that
calculates
the concentration We can do more ambitious transformations in the same way, for example adding
various schemes for numerical integration. Since we can store these in libraries
and parameterise them to fit any range of attribute,
spreadsheet
algebra gives us an
extremely powerful generic way to add them to almost any
spreadsheet we could want.
Listings much bigger than our example would become tedious to inspect, and
there is always a risk of missing errors even in small listings. Can we
automatically compares formulae and flags a warning when they differ? Markus
Clermont reported his research into this at EuSpRIG 2002; dissertation,
[Clermont].
His program "stains" the raw
spreadsheet, proposing sets
of cells that, in our terminology, might be grouped to become elements of a
single attribute. Some cells get proposed because they contain similar formulae,
but he uses other means too, for example proposing that cells mentioned in a
cell range should be grouped. In our example, this would be like saying that D14
to D20 should be considered for grouping, as should D10 to D12. In MM, we have
independently experimented with something similar, using unification and inverse
unification as a way to detect related formulae. [Our next paper in this series,
on structure discovery and logic programming, takes this further.] We believe there are other methods worth investigating. Discrete Fourier
transforms might help discover repeating patterns of related but non-contiguous
cells like these: Another possibility is searching with parallel terraced scans, a technique
Hofstadter's Fluid Analogy Research Group have used in solving analogy problems
[Hofstadter,Hofstadter
(2)], and one we believe promising for resolving tensions between different
ways to view a problem. With spreadsheets, one such tension is that between
lumping and splitting: grouping too coarsely (the extreme being to cover the
entire worksheet with just one attribute) versus dividing too finely (the
extreme being to split the
spreadsheet into one
attribute for every cell). Perhaps the most promising technique is inductive
logic programming [ILP]. This
learns, given logical descriptions of examples of a concept, general rules
describing the concept itself. Unlike neural nets and other machine learning
techniques, the learnt concepts are easy to translate into English and to use in
programs. ILP has some very convincing industrial applications - drug companies
use it to learn the relation between biological activity and 3D structure in
vast collections of molecules - and we believe it could be invalauable in
learning libraries of common patterns. This is something we are experimenting
with. There are so many possible structure-discovery techniques that we are
unlikely to find them all in the same program. This means we need a standard
representation of patterns, errors and other information, and a way to share it
between tools. What kind of information should be exchanged? We suggest:
High-level specification of what a
spreadsheet, or regions
therein, is doing. Semantically related regions and the reasons for believing them to be
related, together with a description of how they map onto the worksheet
surface. Names, derived from labels, proposed for regions. Inductive logic
programming will be useful for learning from example where authorss tend to
put labels in relation to the cells they name. Errors, such as a cell where an equation "ought" to resemble those around
it but doesn't. Proposed corrections for errors. Is our
algebra
language appropriate? We had to give the user a means of performing
element-by-element manipulation of rows and columns, as when appending the cells
in columns A and B to make attribute names. We provided a functional language,
partly because we liked it and partly because it mapped cleanly onto our
implementation language Kawa, a dialect of Scheme. However, this requires
expertise the average user will certainly lack. It is worth asking whether a graphical interface might help - something along
the lines of MINOS:
Steps Towards a Syntax-free Computer Algebra Program [Tintarev,
1999], where the author adopts a desktop metaphor, with different icons on the
desktop representing different expressions to be manipulated. We are working on algorithms by which a user can compile MM code to a
spreadsheet, change the
compiled
spreadsheet,
then calculate the equivalent change to the MM program. This would be useful if
you develop a
spreadsheet
in MM and then want to change cell styles and formats from Excel, having the
changes reflected back to your MM program. Doing so involves some
reverse-engineering of the kind described earlier, but under constraints that
the new MM program constructed by decompiling the modified
spreadsheet
resemble the
original as much as possible. In general, this is not easy because changes to
the
spreadsheet
may cut
across parts of the MM program, rather as though we were to translate a sentence
into French, throw a random eror into it, and then try to translate the result
back into an English sentence with the "same" error. We are experimenting with
an appropriate formalisation of the notion of "part", so that if we change
something in the compiled
spreadsheet, we can find the
minimal enclosing entity in the MM code that it is a part of, and adjust the
code accordingly. This may be a new approach to reverse engineering. Jonathan Bowen. Recent Developments
in the
Algebra of
Programs, http://vmoc.museophile.com/algebra/section3_6.html, in
A Brief History of
Algebra and Computing: An
Eclectic Oxonian View, http://vmoc.museophile.com/algebra/algebra.html. Markus Clermont. A
Scalable Approach to
Spreadsheet
Visualisation,
http://www.ifi.uni-klu.ac.at/Publications/Dissertations/pubfiles/pdffiles/2003-0175-MC.pdf,
Dissertation, Fakultät für Wirtschaftswissenschaften und Informatik, Universität
Klagenfurt. Thomas Grossman.
Spreadsheet
Queueing
Simulation Templates,
http://www.ucalgary.ca/%7Egrossman/simulation/index.html. Dave Hart and Clinton Wolfe. Basic
Algebra in Maple,
http://www.indiana.edu/~statmath/math/maple/index.html.
Douglas Hofstadter. Analogies and Roles in Human and Machine Thinking,
in Metamagical Themas: Questing for the Essence of Mind and Pattern,
Basic Books, 1996. Douglas Hofstadter. Fluid Concepts and Creative Analogies: Computer Models
of the Fundamental Mechanisms of Thought, Penguin, 1998. Inductive
Logic Programming,
http://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/ilp.html. Jocelyn Ireson-Paine. Ensuring
Spreadsheet Integrity
with Model Master,
http://www.j-paine.org/eusprig2001_as_html/eusprig2001.html, Proceedings of
EuSpRIG 2001, Amsterdam 5-6 July 2001. Raymond R. Panko. Applying Code
Inspection to
Spreadsheet
Testing, http://panko.cba.hawaii.edu/papers/ss/2phase.htm. Kyril Tintarev. MINOS:
Steps Towards a Syntax-free Computer
Algebra Program,
http://science.uniserve.edu.au/pubs/callab/vol3/tintarev.html,
CAL-laborate, Volume 3, October 1999. Eelco Visser et. al. The
Program Transformation Wiki,
http://www.program-transformation.org/twiki/bin/view/Transform. [ Jocelyn Ireson-Paine's Home Page |
Publications
| Document (shortened) in
Word format ] load
assumes it to be a
spreadsheet
and
automatically decompiles it into an MM program. The":=
"assigns this
program to the variable panko
. Just as p
,
r
in the Maple session are variables which hold
algebraic expressions, so panko
is a variable which holds a
expression. And,
just as Maple automatically lists the expression resulting from the commands
typed at it, so MM displays the
spreadsheet resulting from
the load
command, listing it as an MM program: <A1 A3 A4 A6 A8 A10 A13 A14 A15 A16 A17 A18 A19 A20
B1 B4 B5 B6 B7 B10 B11 B13 B14 B15 B17 B18 B19 B20
C13 C14 C15 C17 C18 C19 C20
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D14 D15 D17 D18 D19 D20
E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E14 E15 E17 E18 E19 E20
F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F14 F15 F17 F18 F19 F20
H1 H7
I19 I20 I23 I24 I27 I32 I37
J25 J26 J28 J29 J30 J31 J33 J34 J37
L5 L7
>
where
A1 = "Cash Budget"
B1 = " Assume:"
D1 = 0.02
E1 = "inflation/semester"
H1 = "Panko, R.R. Applying code inspection to spreadsheet testing, Journal of Management Information Systems, (in press?)"
D2 = "Fall"
E2 = "Spring"
F2 = "Summer"
A3 = "Cash, beginning"
D3 = 1000
<etc>
D9 = "======="
E9 = "======="
F9 = "======="
<etc>
I27 = "4 mechanical errors in formulas with 3 opns and cell references"
J28 = "F5;; living costs summer;; cells in sum should be f14..f20"
<etc>
The program contains a bracketed list of attribute declarations followed
by the equations (some omitted to save space). As always in MM, all attributes
referenced in the equations are declared. This provides redundancy for the
compiler (when compiling and error-checking MM programs) and for the reader
(when struggling to make sense of other people's code).
Stage 2
detach_labels
, which removes
all cells containing string constants not referenced by other cells. These can
fairly be assumed to be labels: panko_delabelled := detach_labels( panko );
MM then displays the result: <B14 B15 B17 B18 B19 B20
C14 C15 C17 C18 C19 C20
D1 D3 D4 D5 D6 D7 D8 D10 D11 D12 D14 D15 D17 D18 D19 D20
E3 E4 E5 E6 E7 E8 E10 E11 E12 E14 E15 E17 E18 E19 E20
F3 F4 F5 F6 F7 F8 F10 F11 F12 F14 F15 F17 F18 F19 F20
>
where
D1 = 0.02
D3 = 1000
E3 = D8
F3 = E8
D4 = SUM(D10:D12)
E4 = SUM(E10:E12)
F4 = SUM(F10:F12)
D5 = SUM(D14:D20)
E5 = SUM(E14:E20)
F5 = SUM(F12:F19)
D6 = 3000
E6 = 3000
F6 = 3000
D7 = ROUND(1000+D4+D5-D3-D6,-3)
E7 = ROUND(1000+E4+E5-E3-E6,-3)
F7 = ROUND(1000+F4+F5-F3-F6,-3)
D8 = D3-D4-D5+D6+D7
E8 = E3-E4-E5+E6+E7
F8 = F3-F4-E5+F6+F7
D10 = 4115
<etc>
Note that the labels have been removed. Note also that we now have two
versions of the
spreadsheet on the go
simultaneously, just as our Maple session had several versions of an expression
Stage 3
labels := panko without panko_delabelled;
col_A := by_row( slice( labels, A1:A20 ) );
col_B := by_row( slice( labels, B1:B20 ) );
cols_AB := zip( col_A, col_B, append );
cols_AB_cleaned := zip( cols_AB, mangle );
by_row
function modifies slice
's
result to be a function from row number to label rather than from cell address
(i.e. row and column) to label. zip
function, which runs down the columns passed
in the first two arguments, taking corresponding cells from each and applying
the third argument, itself a function, to them. Here, this function is
append
: it joins or appends two strings. So the effect is to append
the A and B labels of each row. mangle
function. This
"mangles" each name (the word is commonly used by compiler writers) to remove or
replace unwanted characters. Thus the labels Cash Budget
and
Assume
become Cash_BudgetAssume
, for example. The next
thing is to pair the cell names of column D with the new identifiers that
replace them:
renamings := make_map( by_row(col_D), cols_AB_cleaned );
giving us D3 > Cash_beginning
D4 > Outflows_School_costs
D5 > Living_costs
D6 > Inflows_Loans
D7 > Support_from_home
D8 > Cash_end
D10 > School_contractual_Tuition
D11 > Fees
D14 > Housing
D15 > Insurance
D17 > Food
D18 > Entertainment
D19 > Transportation
D20 > Clothing
where the listing uses > to mean "maps to". So what this listing shows
is meaningful names suggested for the cells in column D.
Stage 4
<Income : [1990..2003]>
and write references such as Income[1995]
in the equations.
Income
had
numeric indices. MM also allows symbolic indices akin to Pascal's enumerated
types. We use these when building tables whose cells do not have a natural
ordering; for example the branches of a department store or the colours a dress
can be bought in. Thus we might have a business with branches in Oxford,
Cambridge and London. To store the income for each branch at one particular
time, we would declare: <Income : {Oxford,Cambridge,London}>
while to hold income for these branches from 1990 onwards, the attribute
would gain a dimension, becoming: <Income : {Oxford,Cambridge,London} * [1990..2003]>
fall
, spring
and summer
.
Thus any reference to D1 would be replaced by
Cash_BudgetAssume[fall]
; reference to E1 by
Cash_BudgetAssume[spring]
; and so on. We did the same for all the
other attributes in these three columns. cols_DEF := slice( panko_delabelled, D1:F20 );
cols_DEF_merged := merge( cols_DEF, cols_AB_cleaned, D:F, [fall,spring,summer] );
which gave: <B14 B15 B17 B18 B19
C14 C15 C17 C18 C19 C20
Cash_BudgetAssume[fall,spring,summer]
Cash_beginning[fall,spring,summer]
Cash_end[fall,spring,summer]
Clothing[fall,spring,summer]
D12
E12
Entertainment[fall,spring,summer]
F12
Fees[fall,spring,summer]
Food[fall,spring,summer]
Housing[fall,spring,summer]
Inflows_Loans[fall,spring,summer]
Insurance[fall,spring,summer]
Living_costs[fall,spring,summer]
Outflows_School_costs[fall,spring,summer]
School_contractual_Tuition[fall,spring,summer]
Support_from_home[fall,spring,summer]
Transportation[fall,spring,summer]
>
where
Cash_BudgetAssume[fall] = 0.02
Cash_beginning[fall] = 1000
Cash_beginning[spring] = Cash_end[fall]
Cash_beginning[summer] = Cash_end[spring]
Cash_end[fall] = Cash_beginning[fall]-Outflows_School_costs[fall]-
Living_costs[fall]+Inflows_Loans[fall]+Support_from_home[fall]
Cash_end[spring] = Cash_beginning[spring]-Outflows_School_costs[spring]-
Living_costs[spring]+Inflows_Loans[spring]+Support_from_home[spring]
Cash_end[summer] = Cash_beginning[summer]-Outflows_School_costs[summer]-
Living_costs[spring]+Inflows_Loans[summer]+Support_from_home[summer]
Clothing[fall] = C20*20
Clothing[spring] = ROUND(Clothing[fall]*(1+Cash_BudgetAssume[fall]),0)
Clothing[summer] = ROUND(Clothing[spring]*(1+Cash_BudgetAssume[fall]),0)
D12 = 300
E12 = ROUND(D12*(1+Cash_BudgetAssume[fall]),0)
Entertainment[fall] = C18*B18
Entertainment[spring] = ROUND(Entertainment[fall]*(1+Cash_BudgetAssume[fall]),0)
Entertainment[summer] = ROUND(Entertainment[spring]*(1+Cash_BudgetAssume[fall]),0)
F12 = ROUND(E12*(1+Cash_BudgetAssume[fall]),0)
Fees[fall] = 53
Fees[spring] = 53
Fees[summer] = 53
Food[fall] = C17*B17
Food[spring] = ROUND(Food[fall]*(1+Cash_BudgetAssume[fall]),0)
Food[summer] = ROUND(Food[spring]*(1+Cash_BudgetAssume[fall]),0)
Housing[fall] = C14*B14
Housing[spring] = Housing[fall]
Housing[summer] = Housing[fall]
Inflows_Loans[fall] = 3000
Inflows_Loans[spring] = 3000
Inflows_Loans[summer] = 3000
Insurance[fall] = C15*B15
Insurance[spring] = Insurance[fall]
Insurance[summer] = Insurance[fall]
Living_costs[fall] = SUM(Housing[fall]:Clothing[fall])
Living_costs[spring] = SUM(Housing[spring]:Clothing[spring])
Living_costs[summer] = SUM(F12:Transportation[summer])
Outflows_School_costs[fall] = SUM(School_contractual_Tuition[fall]:D12)
Outflows_School_costs[spring] = SUM(School_contractual_Tuition[spring]:E12)
Outflows_School_costs[summer] = SUM(School_contractual_Tuition[summer]:F12)
School_contractual_Tuition[fall] = 4115
School_contractual_Tuition[spring] = 4115
School_contractual_Tuition[summer] = 4115
Support_from_home[fall] = ROUND(1000+Outflows_School_costs[fall]+
Living_costs[fall]-Cash_beginning[fall]-Inflows_Loans[fall],-3)
Support_from_home[spring] = ROUND(1000+Outflows_School_costs[spring]+
Living_costs[spring]-Cash_beginning[spring]-Inflows_Loans[spring],-3)
Support_from_home[summer] = ROUND(1000+Outflows_School_costs[summer]+
Living_costs[summer]-Cash_beginning[summer]-Inflows_Loans[summer],-3)
Transportation[fall] = C19*B19
Transportation[spring] = ROUND(Transportation[fall]+1+Cash_BudgetAssume[fall],0)
Transportation[summer] = ROUND(Transportation[spring]*(1+Cash_BudgetAssume[fall]),0)
Not bad, is it?
Outflows
- appears to qualify both B4 and B5, so we should
have prefixed it to both. Even so, our transformation has worked pretty
well.ERROR DETECTION
Has our listing, with its renamed cells, helped us detect them?
E19=ROUND(D19+(1+$D$1),0)
. D19+
should be
D19*
.
F3=F4-F5+E6+E7
. Should be F3=F4-E5+E6+E7
.
D20=C20*20
. Should be C20*B20
.
F5=SUM(F12..F19)
. Should be SUM(F14..F20)
.
Transportation[spring]
,
Cash_end[summer]
,
Clothing[fall]
and Living_costs[summer]
in Stage 5
of the example. The first error is relatively easy to spot by inspection, helped
by the way the listing groups together the elements of an attribute and makes it
easy to compare the spring and summer equations. Likewise, the second error
stands out because the right hand side refers to
Living_costs[spring]
and not Living_costs[summer]
. The
third error, C20*20
is not obvious (though it might be had we
bothered to work out out a good replacement name for C20
), but the
fourth error stands out because the attribute F12
in the right hand
side of Living_costs[summer]
should, by comparison with the other
elements of Living_costs
, be Housing[summer]
. SPREADSHEET
ALGEBRA EXAMPLE: NEW
SPREADSHEETS FROM OLD
rename
, merge
, slice
,
detach_labels
. We did, however, invent some, and the others are
based on them. They include:
(<a,b,c> where a=b+c) plus (<d,e,f> where d=e*f)
<a b c d e f
>
where
a = b+c
d = e*f
(< a b c > where a=b+c) over [1..10]
<a[1..10] b[1..10] c[1..10]
>
where
a[all I] = b[I]+c[I]
<a[1..10] b[1..10] c[1..10> / [1..10]
<a b c>
a := load("Company.slk");
<A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14
B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14
>
where
C1 = A1+B1
C2 = A2+B2
<etc>
b := merge( a, "Income", A1:A14, [1990..2003 ] );
b := merge( b, "Outgoings", B1:B14, [1990..2003 ] );
b := merge( b, "Profit", C1:C14, [1990..2003 ] );
b := roll( b, "Income", $+$ );
b := roll( b, "Outgoings", $+$ );
b := roll( b, "Profit", $+$ );
<Income[1990..2003] Outgoings[1990..2003] Profit[1990..2003]
>
where
Profit[all T] = Income[T]-Outgoings[T]
c := (b plus < MonthlyProfit: [1990..2003] >
where MonthlyProfit[all Y] = Profit[Y] / 12
);
<Income[1990..2003] MonthlyProfit[1990..2003]
Outgoings[1990..2003] Profit[1990..2003]
>
where
Profit[all T] = Income[T]-Outgoings[T]
MonthlyProfit[all Y] = Profit[Y]/12
Here, we first renamed and merged the cells in the input
spreadsheet, and then added
a MonthlyProfit
attribute. The main point here is how we added the
new attribute, something that would entail inserting a new column and
cell-by-cell formulae if done directly in Excel. We did have to go to some
effort to do the renaming and merging (i.e. to make the structure of the input
spreadsheet
explicit),
but we plan various ways to simplify the process: by using formats, via
user-defined
spreadsheet
algebra
functions
which, for example, would package up the above sequence of merge
s
and roll
s into one call; by automating equation rolling. Also, the
user can always store the
spreadsheet
after it has
been converted to an MM program, and start from that each time.
a := load("company.slk"); <etc>
<Income[1990..2003] Outgoings[1990..2003] Profit[1990..2003]
>
where
Profit[all T] = Income[T]-Outgoings[T]
b := load("OxfordData.slk"); <etc>
<Income[1990..2003]
>
where
Income = [ 347, 326, 291, 285, 279, 251, 277, 301, 340, 461, 312, 332, 324, 318 ]
Outgoings = [ 213, 210, 225, 252, 274, 287, 291, 312, 345, 311, 341, 221, 219, 232 ]
c := load("CambridgeData.slk"); <etc>
<Income[1990..2003]
>
where
Income = [ 431, 455, 498, 501, 522, 526, 547, 561, 558, 412, 494, 504, 509, 517 ]
Outgoings = [ 341, 358, 293, 287, 260, 287, 281, 315, 327, 363, 389, 290, 287, 281 ]
d := (a on b) plus (a on c);
<Income_1[1990..2003] Income_2[1990..2003] Outgoings_1[1990..2003]
Outgoings_2[1990..2003] Profit_1[1990..2003] Profit_2[1990..2003]
>
where
Income_1 = [ 431, 455, 498, 501, 522, 526, 547, 561, 558, 412, 494, 504, 509, 517 ]
Income_2 = [ 347, 326, 291, 285, 279, 251, 277, 301, 340, 461, 312, 332, 324, 318 ]
Outgoings_1 = [ 341, 358, 293, 287, 260, 287, 281, 315, 327, 363, 389, 290, 287, 281 ]
Outgoings_2 = [ 213, 210, 225, 252, 274, 287, 291, 312, 345, 311, 341, 221, 219, 232 ]
Profit_1[all T] = Income_1[T]-Outgoings_1[T]
Profit_2[all T] = Income_2[T]-Outgoings_2[T]
In this example, we loaded another income-outgoings-profit
spreadsheet
and reformatted
it (with similar commands as in the previous example, here omitted to save
space). We then loaded two two-column data spreadsheets, one for the Oxford
branch of the business and one for the Cambridge branch. Then we glued all three
together with the expression (a on b) plus (a on c)
. Each
on
superimposes attributes of the same name. So the first one
copies the Oxford data into the income-outgoings-profit
spreadsheet; the second
copies the Cambridge data into a new copy of that
spreadsheet. The net result
is a
spreadsheet
with two
copies of the calculations. With a little more effort, we can merge the
attributes to make the branch an explicit index, similar to the season in our
first example: <Income[1990..2003]*{Oxford,Cambridge} Outgoings[1990..2003]*{Oxford,Cambridge}
Profit[1990..2003]*{Oxford,Cambridge}
>
where
Income = [ [ 431, 455, 498, 501, 522, 526, 547, 561, 558, 412, 494, 504, 509, 517 ],
[ 347, 326, 291, 285, 279, 251, 277, 301, 340, 461, 312, 332, 324, 318 ] ]
Outgoings = [ [ 341, 358, 293, 287, 260, 287, 281, 315, 327, 363, 389, 290, 287, 281 ],
[ 213, 210, 225, 252, 274, 287, 291, 312, 345, 311, 341, 221, 219, 232 ] ]
Profit[all T,all B] = Income[T,B]-Outgoings[T,B]
SPREADSHEET
ALGEBRA
EXAMPLE: ADDING CODE
FOR NUMERICAL DIFFERENTIATION
x
of some chemical at various timepoints, and we
want also to see the rate of change of concentration, x_dot
. If we
assume the difference between timepoints is always the same - 1000, say - we can
approximate the rate of change by just dividing this into the successive
differences in concentration. To do this with MM simply entailed loading the
spreadsheet (which, to save
space in the listing, is just one column into which the concentrations can be
typed as data), and adding to it a new attribute x_dot
and the
difference code: a := load( "Ponton.xls" );
<A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
>
b := merge( a, "x", A1:A10, [1..10] );
<x[1..10]
>
c := (b plus <x_dot[2..10]>) where
x_dot[all T] = (x[T] - x[T-1]) / 1000;
<x[1..10] x_dot[2..10]
>
where
x_dot[all T] = (x[T]-x[T-1])/1000
FURTHER WORK AND CONNECTIONS TO OTHER RESEARCH
Automatic structure discovery
XYZ XYZ XYZ
or XYZ
XYZ
XYZ
It might be worth looking in general into the perceptual techniques
developed in computer vision, and certainly at blackboard systems and other
methods that artificial intelligence has developed for fusing knowledge from
different sources.
A standard for interchange of structure-discovery data
The
algebra
language
Incremental reverse-engineering
REFERENCES