go to Trig home page   Guide to GAUSS Programming - a basic introduction


 
Writing for posterity

Some programs are one-offs, written quickly to solve a particular task and then discarded. However, most programs will be in use for a few weeks at least, and possibly years. Writing with an eye to maintenance and amendment in the first stages makes future changes much easier - especially if the original author is not the one altering the program. Even if the original author does come back to the program, the reasons for or effects of particular code segments may not be immediately apparent.

Far and away the most important factor in increasing the longevity of programs is the use of comments. These have already been covered in Safer programming. Other factors are now considered.


1 Styles and conventions

Throughout this manual, a fairly consistent style has been used. This makes no odds to GAUSS; it just makes the code more readable. The whole point of having a language where commands are separated by semi-colons and spaces are ignored is that variations in layout can be put to good use. Any users who have seen a BASIC or ForTran program with one statement per line and no extraneous spaces will immediately recognise the improved legibility that comes with structure.

The free-and-easy structure of the language can, of course, be ignored at the programmer's whim. There is nothing to stop the homesick BASIC programmer writing

i=1;
DO WHILE i<10;
PRINT "Hello Mum";
i=i+1;
ENDO;

but some simple indentation would have made the start and end of the WHILE loops immediately obvious, even to someone unfamiliar with GAUSS.

Similarly with variable and procedure names. There is nothing to stop a program using "i1" and "i2" as variable names, although "rowNum" and "colNum" would be much more readable. A descriptive name does not need more memory space than a short unhelpful one: both "i1" and "rowNum" will be allocated eight bytes of memory for their names.

Short names are not necessarily unhelpful in context. i, j, k etcetera are commonly used to index variables; in an program making IV estimates, variables called "xx", "zx", and "zy" are meaningful to econometricians. Consistent use of a name is also sensible.

Other styles are more concerned with personal choice. For example, this coursebook has always used capital letters for GAUSS standard words and procedures. The view of the author is that it makes clear what functions and features are integral to GAUSS and which are the responsibility of the programmer (and so should be defined in the program somewhere). This is not reflected in the official GAUSS documentation, but it has no functional impact and it suits me, so I maintain it as my way of making programs readable.

The key to a good style is that it should
  • highlight the flow of the program
  • add meaning to otherwise anonymous code, and
  • be consistent, even if it can't manage the first two
Readability is the defining characteristic of a good style.

2 Structures

Structures were introduced in GAUSS 4.0, and mentioned in GAUSS basics. It was noted there that they are effectively grouping variables, and so they were ignored by the subsequent discussion. However, they can be extremely useful in clarifying programs, and also have a role to play in making programs safer and easier to maintain.

Consider a procedure to do a simple OLS regression and another one to print the results:


PROC (6) = Do_OLS(x, y, degs_of_freedom);

/* calc in here */

RETP (est_beta, beta_errors, v_cov_matrix, tss, ess, sigma_sq);

ENDP;


PROC (0) = Print_OLS(est_beta, beta_errors, v_cov_matrix, tss,
  ess, sigma_sq, n, degs_of_freedom);

/* display in here */

ENDP;

with the appropriate calling code:

{est_beta, beta_errors, v_cov_matrix, tss, ess, sigma_sq} =
  Do_OLS(x, y, degs_of_freedom);
Print_OLS(est_beta, beta_errors, v_cov_matrix, tss, ess, sigma_sq,
  n, degs_of_freedom, name);

Although sensible variable names and an examination of the code will soon reveal what the program calls are doing, this is not as it stands very readable. Nor is this an excessive number of parameters. On some programs written before structures were available, I have procedures passing over twenty parameters in and out.

A structure is a definition of a group of related variables. The important point is that the variables do not need to be of the same type. If we have all strings or all single numbers we can group them into string arrays or matrices. But where we have items of a different size and/or type, we need to use a structure.

A structure is defined and used as follows

STRUCT structure_type {type1 name1; type2 name2 ...};
STRUCT structure_type s1;
STRUCT structure_type s2;

This defines a structure of type "structure_type" containing a set of objects, with each of those objects having a name and a type associated with them. The second and third lines create particular "instances" of structure_type, which can then be used as other variables. However, we do not do anything with the structure directly. Instead we access the element within the structure by using "inner name" of the structure element and linking it to a particular variable by a dot:

s1.name1 = ...
PRINT s2.name1;
x = s1.name2 = ...

and so on. For example, consider a structure called "briefcase" defined to include objects called "wallet" and "newspaper". Richard and Judy are both instances of "briefcase"; that is, we have "Richard's briefcase" and "Judy's briefcase". We can then refer to the items each of them has in their suitcase as "richard.wallet" or "judy.newspaper", and so on.

How does this improve our coding? Well, consider a more meaningful example of a structure (with added comments):

STRUCT OLS_data
{
/* descriptive fields */

STRING name; /* name for this particular regression */

/* input fields */

MATRIX x; /* explanatory vars */
MATRIX y; /* dependent var */
MATRIX n; /* no. of obs. */
MATRIX degs_of_freedom; /* correction for dof */

/* results fields */

MATRIX est_beta; /* coeff. estimates */
MATRIX beta_errors; /* estimate errors */
MATRIX v_cov_matrix; /* covariance matrix */
MATRIX tss; /* total sum of sq. */
MATRIX ess; /* estiamted s.s. */
MATRIX sigma_sq; /* estimated sigma2 */
}

Suppose a specific instance "ols1" is created and the appropriate input values filled in:

STRUCT OLS_data ols1;
ols1.name = "First regression";
ols1.x = x_data;
ols1.y = y_data;
ols1.n = ROWS(x_data);
ols1.degs_of_freedom = n-COLS(x_data);

Then the main code can be redefined as:

ols1 = Do_OLS(ols1);
Print_OLS(ols1)

which is suddenly much more readable. The procedures can operate on the inner elements of the structure as if they were passed as normal parameters.

As well as clarity, this has one advantage if your program requires development. If I decide to add another input or output parameter to the above procedures, I only have to change the structure definition. Had I been using separate parameters, I would have to edit the prcoedure definitions - and made sure that every call to that procedure in the program had the parameters adjusted accordingly.

A structure is also one of the few places where GAUSS will check to see that the actual variable matches up with the defined type, which reduces the chance of using invalid data.

One downside of structures is that extra information is required to store them, which means they require more memory and are slower to use, compared to having the raw variables to hand. This is a small problem for modern computers but, as with everything else in GAUSS, you need to be aware ofthe potential problem when dealing with very large or complex structures.

There is also a tendency to create over-includsive structures. For example, in the above case it might have been better to separate off input fields from output fields. This would allow you to create several regression from the same basic data structure without duplicating the main data matrices, for example.

 
3 Separating code

GAUSS allows code to be split up into several files. GAUSS is then told where the files are and reads them in when it prepares to run a program. Separating the code over several files makes no difference to the running of the program or the memory used. This is because all GAUSS does is to insert the file into the main program file before running.

The command for this is

#INCLUDE fileName;

Note the hash sign #; this tells GAUSS that this command is something to be done when it is preparing the run (a compile time instruction). When the RUN command is given, GAUSS loads the program file into memory and then checks it for instructions of this sort (there are others, but less important for now). When it comes across the #INCLUDE, it inserts all the code in fileName at that point in the text of the main program file; in other words, the effect is just the same as if all the code that was in the file fileName had been written in the main program file.

If this is the case, then why bother with #INCLUDE? The reason is twofold. Firstly, it allows the code to be broken into a number of chunks. A small file is more easily read and edited than a large one. Global variables are more likely to be missed in a large file. If one part of code wants changing, then perhaps only one file needs to be edited, while other files can be left untouched.

Secondly, this allows code which is useful in a general context to be placed in a file for access by a number of programs. This saves duplicating code in a number of programs. Note that the effect is exactly the same as if the code had been duplicated; however, because the code used in several programs is in only one file, maintaining and updating the code is much easier than if the procedure had been copied and inserted into each file separately.

The #INCLUDE files can be nested: one #INCLUDEd file may contain another #INCLUDE. If the same file is #INCLUDEd twice, then it should have no effect unless the program redefines some of the variables or procedures in the #INCLUDE file between #INCLUDEs. The file name should be a constant string. It may include a complete path, in which case GAUSS will only look in the specified directory; or it may just be the file name, in which case GAUSS will search in a number of "standard" locations (usually starting in the GAUSS directory; see the manual for configuration information).

3.1 Examples

Supposing the user had written a number of useful input and output routines, and stored them in two files "InUtils.GL" and "OutUtils.GL"; the first file is in the directory C:\GAUSS, and the second is in the sub-directory OUTPUT. Then

#INCLUDE "InUtils.GL";
#INCLUDE "C:\GAUSS\OUTPUT\OutUtils.GL";

would lead to both these files being incorporated into the program. Note that the complete contents of the file are inserted into the main program file. If there is a lot of extraneous material in the #INCLUDEd files, then all this will be brought in even though it is unused. For this reason, files containing general-purpose routines should not be enormous files with every possible useful function in them, but relatively small and pertinent.

As an illustration, suppose the user has written ten input procedures. Placing them in one file means that all ten procedures will be incorporated into any program using just one procedure. Placing each procedure in a different file means that only the minimum amount of code is incorporated into any program; however, a program then might need ten #INCLUDEs, and it may be difficult keeping track of each file.

For examples of #INCLUDE in use, see the code samples on this site.


3 Documentation

Documentation for a program can be intended for the end user or the programmer. This coursebook is not concerned with the former. For the latter, the need for documentation is directly related to the complexity of the program.

A basic level of documentation should always be associated with a program: at a minimum, some description of what the program does, how it does it, what results it should produce. The best programs will be self-documenting, achieved through
  • copious comments
  • sensible variable and procedure names
  • intelligent structuring of code
Among the comments should be: notices of changes made to the code; descriptions of procedures and parameters; explanations of particularly complex or abstruse operations.

Added to this should ideally be some sort of paper documentation. The more complex parts of an operation should be explained in detail if necessary. The cross-product program, above, has a large amount of documentation on the underlying matrix algebra and some on the statistical basis (but admittedly is badly documented on the general features; still, that's what self-documentation is all about).

Again, much of this depends on the program that has been written, its longevity, its distribution, and the people who will edit it in future. However, even if the original programmer will be the only person to look at or edit the program, some investment in documentation will always be worth it.

In addition, documentation will often be a natural result of the development process: the reason the matrix algebra for the cross-product program is well-specified is due to the need to pin down exactly what equations were needed before programming could begin. Commenting on pieces of code (especially procedures) as they are written forces the programmer to be specific about the purpose of a particular action. A well-documented program is not necessarily more efficient; but the chances of it being correct are rather better.

[ previous page ] [ next page ]