In other words, plan before you program!
Top-level message: it is best to design on paper, and only code afterwards.
Other top-level message: write up your design so others can understand it. Explain how every cell group works, and what it's for. That way, you will have documentation for all who maintain or audit your spreadsheet.
When more precise or concise than natural language, don't be afraid to use mathematical notation. Including Excel formulae.
I shall propose one way to do this, using a particular kind of "pseudo-code". Pseudo-code is notation that looks like a programming language but isn't executable.
With it, I shall build a simple loans spreadsheet. At the end, I shall have two important things: the spreadsheet; and its documentation.
"Formal specification" is the computer science term for explaining precisely what a program does and how. There are many formal specification techniques. The one I propose is simple to use, because close to Excel.
Ray Panko reports, in What We Know About Spreadsheet Errors ( www.opssys.com/instantkb/attachments/What_We_Know_About_Spreadsheet_Errors_Whitepaper-GUID9b35763e2d504ddab36b9e26a4eee631.pdf ), that:
Cragg and King, 1993, inspected 20 operational spreadsheets from 10 firms. They found serious errors in 5 spreadsheets. "One universal aspect of the Cragg and King spreadsheets was informal iterative development. None of the spreadsheets used formal specification, design, or coding. In addition, there was extensive revision because of poor initial design."
(Cragg, P. G., & King, M. (1993). Spreadsheet Modelling Abuse: An Opportunity for OR? In Journal of the Operational Research Society, 44(8), 743-752.)
Has this changed?
Computer scientist Donald Knuth, Literate Programming (1984) in Literate Programming. CSLI, 1992, pg. 99. Quoted at the top of www.literateprogramming.com :
"I believe that the time is ripe for significantly better documentation of programs, and that we can best achieve this by considering programs to be works of literature. Hence, my title: 'Literate Programming.'
Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.
The practitioner of literate programming can be regarded as an essayist, whose main concern is with exposition and excellence of style. Such an author, with thesaurus in hand, chooses the names of variables carefully and explains what each variable means. He or she strives for a program that is comprehensible because its concepts have been introduced in an order that is best for human understanding, using a mixture of formal and informal methods that reinforce each other."
(The money values are at most £100. That's merely to narrow the cells for these pictures.)
-- This is a simple spreadsheet that calculates cash at the start and end of each period from expenses and initial cash.
bounds time_span: 1 to 12.
-- The number of periods over which the model runs. In this version, it's 12 months.
table time : time_span → date.
-- time[t] is the date of the first day of period t. The final day of period t is the final day of its month.
-- I don't use this table in my calculations, but I do use copies of it for captions to the other tables. This separates how captions are calculated from how they are displayed. As Phil Bewig points out in his paper How do you know your spreadsheet is right? Principles, Techniques and Practice of Spreadsheet Style, at www.eusprig.org/hdykysir.pdf , separating calculation from display is a Good Thing.
date( 2009, t, 1 ).
-- Now I define the cash and expenses tables:
table initial_cash : → currency.
table expenses_during_period : time_span → currency.
table total_cash_at_start_of_period : time_span → currency.
table total_cash_at_end_of_period : time_span → currency.
-- initial_cash is the "opening cash balance", the value for the first period's total_cash_at_start_of_period. This will be input by the user.
-- expenses_during_period[t] is the expenses incurred during period t: during the first day to the final day inclusive. This will be input by the user.
-- total_cash_at_start_of_period[t] is the total cash held at the beginning of the first day of period t. Similarly, total_cash_at_end_of_period[t] is the total cash held at the end of the final day of period t.
total_cash_at_start_of_period[ 1 ] =
total_cash_at_start_of_period[ t>1 ] =
total_cash_at_end_of_period[ t-1 ].
total_cash_at_end_of_period[ t ] =
total_cash_at_start_of_period[ t ] - expenses_during_period[ t ].
Table definitions: name cell groups and state their size.
Bounds declarations: name table sizes, for when many tables share the same size.
Equations: the calculations!
Commentary: free text.
Perhaps the stylised notation looks odd, but so does maths ("dy/dx", "lim x → ∞") when you first see it. Its convenience is that you don't have to invent for yourself a way to say these things.
-- I shall start the loans model by putting a borrowing term into the equation for total_cash_at_end_of_period. This records the amount actually borrowed in each period, and is calculated from the amount the user wants to borrow. To hold these amounts, I need two more tables:
table want_to_borrow_during_period : time_span → currency.
table actually_borrowed_during_period : time_span → currency.
-- want_to_borrow_during_period[t] is the amount the user wants to borrow during period t.
-- actually_borrowed_during_period[t] is the amount the loans enable the user to borrow, taking borrowing limits into account. This will be less than or equal to want_to_borrow_during_period[t]. In this first version of the model, I just make them equal.
total_cash_at_end_of_period[ t ] =
total_cash_at_start_of_period[ t ] - expenses_during_period[ t ] +
actually_borrowed_during_period[ t ].
actually_borrowed_during_period[ t ] =
want_to_borrow_during_period[ t ].
Now I add the loan data structures and calculations. I no longer need the following equation, useful though it was in testing the spreadsheet above:
actually_borrowed_during_period[ t ] =.
want_to_borrow_during_period[ t ]
-- I shall now define the two attributes that characterise each loan. These are: the maximum amount one can borrow; and whether or not there is such a maximum. I need two more tables for these. They will each contain one element for each loan, so their sizes depend on the number of loans:
bounds loans_span: 1 to 4.
-- The number of loans.
table has_ceiling : loans_span → boolean.
table ceiling : loans_span → currency.
-- has_ceiling[l] is true if loan l has a ceiling.
-- ceiling[l] is l's ceiling. That is, the user can borrow any amount B such that B ≤ ceiling[l]. If not( has_ceiling[l] ), the user can borrow without limit.
-- It's worth noting that some spreadsheeters might prefer to have only the ceiling table, using a "nonsense" value such as -1 or #N/A to indicate that a loan has no ceiling. But ceiling and has_ceiling are, logically speaking, different kinds of data, and it is cleaner to separate them.
-- Now that I've described the attributes characterising a loan, I shall model the loans' progress through time. First, I need an initial value, because the loans could have been arranged and borrowed from before the first cash-flow period. So I shall make a table analogous to initial_cash:
table initial_loan : loans_span → currency.
-- initial_loan[l] is the total already borrowed from loan l at time.
-- Now I'll define tables that record the progress of each loan over time. They are analogous to my total_cash tables:
table total_loan_at_start_of_period : loans_span time_span → currency.
table total_loan_at_end_of_period : loans_span time_span → currency.
table lent_during_period : loans_span time_span → currency.
-- total_loan_at_start_of_period[l,t] is the total lent by loan l at the start of period t, i.e. the total borrowed from that loan. Similarly, total_loan_at_end_of_period[l,t] is the total lent by loan l at the end of period t.
-- lent_during_period[l,t] is the amount lent by loan l in period t. This is the core of the spreadsheet, for it's here that I record which loan lends during each period, and how much.
total_loan_at_start_of_period[ l, 1 ] =
initial_loan[ l ].
total_loan_at_start_of_period[ l, t>1 ] =
total_loan_at_end_of_period[ l, t-1 ].
total_loan_at_end_of_period[ l, t ] =
total_loan_at_start_of_period[ l, t ] + lent_during_period[ l, t ].
-- Before calculating lent_during_period, I shall jump to another part of the calculation, and make a table with one element for each loan and period. It will indicate whether there is enough money left in the loan to borrow from it:
table can_supply_wants : loans_span time_span → boolean.
-- can_supply_wants is a table of Boolean (logical) values. The element can_supply_wants[l,t] is true if loan l is capable of lending what the user wants to borrow at period t. This does not imply that we will in fact use the loan.
-- Layout is important here, because later on, I'm going to scan the table with the "match" function. So I shall arrange can_supply_wants with time running vertically, and l running from left to right with no gaps.
can_supply_wants[ l, t ] =
or( not( has_ceiling[l] )
, want_to_borrow_during_period[t] + total_loan_at_start_of_period[l,t] <= ceiling[l]
-- This equation means: can_supply_wants[l,t] is true if l has no ceiling, or if the amount the user wants to borrow, plus the amount lent at the start of the period, is less than or equal to l's ceiling.
-- That table tells us which loans the user can borrow from, but how do we decide which they will in fact borrow from? Let's assume it's the first eligible loan, the one with the smallest l. So I want to find the first "true" in the table. I can do so by scanning with Excel's "match" function:
table first_that_can_supply_wants : time_span → general.
-- first_that_can_supply_wants[l,t] is the first l for which can_supply_wants[l,t] holds. It is 1 if the user wants to borrow zero in period t, and #N/A if can_supply_wants[l,t] is false for all l, i.e. if no loan can supply what the user wants.
first_that_can_supply_wants[ t ] =
match( true, can_supply_wants[ all, t ], 0 ).
-- In the pseudo-code, I've used the subscript "all" to indicate that the match ranges over the entire first dimension of lent_during_period.
-- Now I am going to return to first_that_can_supply_wants and link it to lent_during_period. I said earlier that lent_during_period[l,t] is the amount lent by loan l in period t. Now I have everything necessary to calculate it:
lent_during_period[ l, t ] =
if( isna( first_that_can_supply_wants[t] )
, if( l = first_that_can_supply_wants[ t ]
, want_to_borrow_during_period[ t ]
-- This equation means: the amount lent by loan l in period t is 0 if first_that_can_supply_wants[t] is #N/A: that is, if the match found no trues. Otherwise, if first_that_can_supply_wants[t] is l, it is the amount the user wants to borrow. Otherwise it is 0.
-- And at last, I can define actually_borrowed_during_period:
actually_borrowed_during_period[ t ] =
sum( lent_during_period[ all, t ] ).
-- This is just the sum over all loans of the amount lent during period t. It works because at most one loan's element will ever be non-zero.
And now, having finished the documentation, I I build the spreadsheet by deciding where to put my tables, and then translating the equations into Excel formulae. (I know I showed pictures of it earlier, but that's because I went back over this tutorial and inserted them once I'd built it.)
The crucial point is that I now have a nice formal description of the spreadsheet, with every cell group documented. You'll get it from these slides by collecting the indented pseudo-code fragments.
This tutorial was originally my third blog posting for Dr Dobbs Code Talk in a series about documenting spreadsheets. It follows from Gliders, Hasslers, and the Toadsucker: Writing and Explaining a Structured Excel Life Game and How to Document a Spreadsheet: an Exercise with Cash-Flow and Loans.
In these, I used my spreadsheet-description language Excelsior to show how to document first, then write an Excelsior program around the documentation, and then translate this into Excel. But Excelsior isn't essential; and here, I am using Excelsior notation as pseudo-code.
It gives you a precise design which you can work out on paper before ever touching Excel. Which is good, because to change a design in Excel, you may need to erase ten tables and a thousand cells, then retype, re-fill, and re-fill yet again. To change a design on paper, you need only a rubber.