9th October 2002
Input and output
Matrix algebra and manipulation
Writing for posterity
This section concentrates on making your programs more error-free. It emphasises the importance of structured design and testing of programs, and making sure at each stage that you are clear about what you are doing. The algebra of GAUSS translates almost from the page into code, but there are few checks to ensure that your algebra is correct. This section aims to correct that.
1 Programming methods
Because GAUSS is tolerant in the range of errors and mistakes it will let pass, a systematic approach to writing code is important: a program should be designed rather than just developed. In a structured language like GAUSS, paper solutions will tend to resemble the finished code. There two main approaches to program design are top-down and bottom-up.
1.1 Top-down design
To econometricians used to dealing with packages, this is the most logical approach. The idea is to write down an algorithm; then take each part of the first algorithm and write down an algorithm for that bit; then find algorithms for all the elements of the sub-algorithm; and so on. This progressive approach is called step-wise refinement.
For example, consider writing a program to run OLS regressions on a data set. The first algorithm might be
(1) Get options
(2) Read data
(4) Print results
Now refine stage (3):
(3.1) Get x and y matrices from dataset
(3.3) Calculate statistics
and then (3.3):
(3.1) Get x and y matrices from dataset
(3.3) Calculate statistics
(3.3.1) Find TSS, ESS, RSS
(3.3.2) Calculate s
(3.3.3) Calculate standard errors and t-stats
(3.3.4) Calculate R2
The first stage is similar to the instructions that would be given to, say, TSP. The difference with GAUSS is that all the sub-stages need to be written as well. On the other hand, in this scheme it is becoming clear that the problem degenerates rapidly into a simple set of tasks. Other problems will of course be more difficult, but the principle of breaking down a problem into more detailed (but also simpler) actions is clear.
Also clear is that much of this can be translated directly into GAUSS code. The first algorithm might almost be the main section of a program, with the tasks being procedure calls. This is why a structured approach to design improves the quality of programs: as well as forcing the programmer to write down all the steps to be taken (and so, hopefully, all the pitfalls to be avoided), the correlation between the outline of the original algorithm and the final program structure aids verification of the program.
1.2 Bottom-up design
The bottom-up approach takes the opposite tack. Problems are solved at the lowest level, and programs are built up by using earlier solutions as building blocks.
In the above example, the first task might be to design a procedure to take as input TSS, ESS, n and k and produce R2, s2, and standard errors. When this procedure is fully tested, a procedure taking as input the x'x and x'y matrices will use the first routine in the production of OLS estimates, variances, and significance levels. This procedure is then fully tested and only when it functions correctly does consideration of the next stage begin; but then in this next stage, the written procedures can be taken as proven code.
This approach, while as valid as top-down design, is not often the immediate choice, particularly when the programmer is used to working at a much higher level of abstraction (as in econometric packages). It also gives less of a "feel" to a program's structure. On the other hand, testing procedures built from the bottom up is usually simpler. Procedures are tested at the lowest possible level, and only the procedure being built is being tested. This is much more reliable than trying to test a complete program.
The choice of a design method is up to the programmer, and most programs have an element of both. Generally, the top-down style works best on large projects which need a disciplined approach, but when it comes to actually programming rather than designing, starting from the simplest bits of code and working outwards is usually the most effective (and safest) route. However, most programmers will over time build up their own libraries of useful little functions, and so the bulk of design will tend to concentrate on the "grand scheme" side.
One of the most important aids to writing better programs is the use of comments. Comments generate no executable code and have no effect whatsoever on the performance of the program. They are entirely for the programmer's benefit. How then do they make programs safer? By allowing complicated pieces of code to be explained in the program; by identifying what variables are used where; by proclaiming the purpose of procedures; in short, by encouraging descriptions within the program of what a piece of code does, why it does it, what variables it uses, and what results it gives out.
A comment is anything enclosed in a slash-asterisk combination:
/* this is a comment */
/* a = b + c; */
/* so is the above instruction as it is enclosed in comment marks */
The start of a comment is marked by /*, the end by */. Anything enclosed in these marks will be treated as a comment and ignored by the program: the instruction in the above example no longer exists as far as the program is concerned.
Comments can be nested; that is, one comment can contain another comment. This is useful when, for example, the user wants to temporarily "block out" a piece of code to test something:
a = b + c;
/******* remove this bit of code temporarily
Mutate (b, c); /* proc to do something to b and c */
d = b * c;
Having multiple asterisks after the start or before the end of the comment block is fine by GAUSS; all it checks for is the /* or */ combination. Everything else within these two is ignored.
This is one of the few places in GAUSS where spacing is important. The comment
/* this is a comment with a space in the final marker * /
will be lead to the error message "Open comment at end of file" because GAUSS will not recognise "* /" as the intended token "*/".
2.1 When to use comments
Too many comments in a program are not as bad as too few, but they may distract from the program. However, this is difficult to achieve. Generally, comments amongst code are usually only wanted where a complex operation is being carried out, or where the control structure of the program is not immediately obvious, or where a particular variable value is not clear; basically, anywhere where a new reader might be confused by some aspect of the program. The programmer may also want to include comments on variables as they are declared, saying what their purpose is, their type, and so on for his own reference.
Comment blocks can be used to keep track of programs. A comment of some sort should always be included at the start of the program, identifying the program's purpose and possibly also authorship details.
Where procedures are declared, comments become very important. Because a GAUSS procedure header only says how many variables are returned, a comment saying which of the local variables and parameters are returned would be useful - along with a note of any global variables used or updated. As GAUSS variables are can change size and form very easily, comments explaining the type of variables expected as parameters and returned is often useful. Finally, a note of what the procedure actually does makes the whole block much more readable.
Consider the following comment block. The procedure TestColl is used to test each of the nSubs square submatrices, concatenated vertically into one matrix, for multicollinearity:
This consists of a one-line description of the procedure's function; details of the input and output parameters; and a reference to the mathematical basis of the function. It also informs us that the procedure does not access any (user-defined) global variables.
The aim of a block such as this is twofold. Firstly, the author of the procedure can check its function against the claims in the comment block (ie that given the correct sort of data it will return a boolean variable set to true if multicollinearity is found in any submatrix). Secondly, the programmer wanting to use this procedure can find out what the procedure does and what are the types of the input and output parameters without having to study the procedure in detail.
The laxity of the GAUSS syntax, the weak typing of variables, and the poor handling of input all contribute to making testing a necessity for all but the smallest programs. We consider here some aspects of testing programs. However, it should be remembered that testing is inherently Popperian: a program can only be proved not to work by testing; it cannot be proved to work.
Essentially, there are three things that can go wrong with a program: it is given the wrong instructions; the instructions are entered wrongly; or the data it uses is wrong or inappropriate. All three areas should at least be considered before a program is pronounced "finished".
3.1 Semantic errors
Semantic errors are those where the program does not work as intended because it has been told to do the wrong thing. For example, the instruction sequences
are both valid programs; however, the second correctly calculates the variance of an IV estimate of beta, while the first does - well, something else.
GAUSS cannot detect these errors. It is entirely up to the programmer to find them. This is where a rigorous approach to defining the problem and implementing the solution will make a difference. If a program is well structured and commented, then the actions of each part of a program can be checked against the claimed result; this claimed result should itself be checked against the solution algorithm to see if the result was intended.
Procedurisation simplifies this somewhat by turning sections of the code into "black boxes" which can be tested independently and then, once they appear to work, can be taken for granted to some extent. Small sections of code should be tested where possible; waiting until a program is finished before testing commences may well be counterproductive if the program is large and complex.
Semantic errors are the most difficult to find because there is nothing for GAUSS to report as an error. The program is only "wrong" in the sense that it does work as intended. Unfortunately, some errors will still slip by - particularly those to do with matrix size and orientation. In one program I missed a transpose operator; the fact that a number of calculations were therefore being done on a row vector when they should have been using column vectors and scalars left GAUSS unfazed. As the results were sensible (largely due to luck in the way the matrix was indexed), the error did not come to light for some months, until the program was altered and an associated operation failed.
The most obvious way to test for this is to create test data; for example, testing an IV estimator might involve creating a number of observation sets with different variances and correlations between the variables. One test data set might have zero error terms, to test the model in the "ideal" case; another might have instruments uncorrelated with explanatory variables; another leads to a singular covariance matrix to see if the program picks that error up; and so on.
GAUSS does have a run-time debugger, but this is signally difficult to use and rarely informative. The easiest way to test particular portions of code is to use PRINT statements to inform the user where the program has got to and what values any variables of interest the program currently has. For example, supposing an unexpected result seems to arise from the code
a = b * c;
IF b > c;
a = ThisProc(a, b, c);ELSE;
a = ThatProc(a, b, c);ENDIF;
Then this could be augmented with
a = b * c;
PRINT "xtestx a is currently size " ROWS(a) COLS(a);
PRINT "xtestx Current value of a: " a;
IF b > c;
PRINT "xtestx IF section; b>c";ELSE;
a = ThisProc(a, b, c);
PRINT "xtestx ELSE section, b<=c";ENDIF;
a = ThatProc(a, b, c);
PRINT "xtestx Out of IF statement: new value of a:" a;
This seems like overkill, but this is often the easiest and quickest way to find errors. Note that the PRINT statements write "xtestx" before the error codes. Adding easily indentifiable text fragments makes it easier to see which statements are test messages. It also makes it easier to find them later when the program works and they need to be removed.
3.2 Syntactic errors
Syntactic errors - mistakes in the coding of a program - are usually fairly simple to discover. GAUSS will pick up some when it prepares to run a program; others will only come to light when a particular piece of code is executing. For example, if a procedure does not return the number of variables claimed in the procedure declaration, this will only be picked up when the procedure is called.
However, it will be discovered at some point, and so testing should make sure that all the instructions in the program are called at some time during the test stage. Again, PRINT statements and test data can be helpful in finding these errors.
3.3 User errors
GAUSS's worst feature is undoubtedly its handling of user input. The CON command is extremely user-unfriendly, and its file handling is based on shaky assumptions of existence.
The CON command assumes that the program instructs the user well and that the user neither makes mistakes or changes his mind during the entry of streams of numbers. These are unjustified assumptions in most practical cases. If a program expects a stream of numbers, then the authors suggest replacing CON with CONS, the string input function. This allows the user to edit the list of numbers as they are entered. The output from CONS can then be converted using the function STOF, which converts a string full of numbers into a column vector. Thus these two are equivalent:
unless the user types in less than r*c numbers. However, the second form is much more usable in almost every case.
On files, GAUSS generally assumes that files exist. Therefore, GAUSS will often crash if files are not found. This tends to be more annoying than a serious problem. If, however, a file not being found would have devastating impact, then file opening should be carried out at the beginning of the program - or at least, before any permanent work is carried out. There is no "exist" command in GAUSS, but the FILES command provides a feasible if irritatingly awkward way to test for existence. In GAUSS 4.0 FILES is deprecated in favour of FILESA and FILEINFO.
Once the program has its input, it may need to be tested. The amount and rigour of this depends on the type of input. For example, one program used by the authors uses information in one file to analyse another file. Because the information in the first is crucial to successful management of the second, the program will not accept an information file which it considers is inconsistent with the data file.
A program should be able to deal with all kinds of user input; anything it cannot deal with should be weeded out and thrown away. Testing a program only against sensible inputs is often not good enough, especially if the program is to be used by other people. Making a program robust to errors in data entry can require some thought as to what might actually be entered.
Unlike syntactic or semantic errors, some error in the user input may be allowable. A procedure of mine expects positive integers up to a certain number. It does not check the input string for dud entries, because the relevant code ignores them anyway. Foolproof routines for checking data are not always desirable. In the 1.6-million-iteration program described in an earlier section, only essential variables are checked for missing values; missing values in other variables are ignored because they do no harm, and the time wasted checking for them would not be well spent.
|back to top|
|Copyright © 2002 Trig Consulting Ltd|