High-Quality Software Through Semiformal Specification and Verification

Allan M. Stavely
Computer Science Department
New Mexico Tech
Socorro, NM 87801 USA
al@nmt.edu

A version of this paper was presented at the 12th Conference on Software Engineering Education and Training (CSEE&T'99), March 22--24, 1999, New Orleans, Louisiana, USA.

Introduction

Since 1993 I have been teaching the most effective software engineering course I have ever taught, as measured by the quality of the programs that students can produce at the end of it. The core of the course is semiformal specification and verification, using techniques adapted from the method called "Cleanroom Software Engineering" [Mills], [Linger].

Yes, we really do verify all of our programs, even programs of realistic size that do real work --- especially those. By avoiding excessive formality while retaining the discipline of mathematical reasoning, we are able to do verification with reasonable effort and apply it to real programs; the techniques seem to scale up to large programs very naturally. And we use the verification, not as an end in itself, but as a tool for detecting and removing defects.

In what follows, I will first describe the principles and techniques on which the course is based. I will then describe the course and how I teach it, and finish by presenting results from student projects and from real development projects done using the methods of the course.

Cleanroom

The Cleanroom method, developed by the late Harlan Mills and his colleagues at IBM and elsewhere, attempts to do for software what cleanroom fabrication does for semiconductors: to achieve quality by keeping defects out during fabrication. In semiconductors, dirt or dust that is allowed to contaminate a chip as it is being made cannot possibly be removed later. But we try to do the equivalent when we write programs that are full of bugs, and then attempt to remove them all using debugging.

The Cleanroom method, then, uses a number of techniques to develop software carefully, in a well-controlled way, so as to avoid or eliminate as many defects as possible before the software is ever executed. Elements of the method are:

Cleanroom practitioners have shown their methods to be very effective in a number of real-world projects at IBM and other organizations ([Linger] p. 56, [Hausler] p. 92, [Trammell]) and by a controlled experiment in an academic environment [Selby]. The levels of quality obtained using Cleanroom methods have been impressive. Estimates of industry averages for defect density vary widely, but some published estimates that I have seen for defects per thousand non-comment lines of code are

In contrast, the average for Cleanroom-developed code in 17 real-world projects from 1987 to 1994, a total of almost a million lines of code, was 2.3 defects per thousand lines of code measured from the time of first execution in testing ([Linger] p. 56). This is clearly much better than the industry average, whichever estimate is used for the latter. The testing then removed most of these defects; many of the Cleanroom-developed systems showed no errors at all when put into production use.

Cleanroom practitioners also claim that their methods cost nothing in terms of effort, schedule and cost, and often even increase productivity and shorten development time: the extra time spent in specification and verification is more than recovered in greatly reduced time spend in debugging and defect correction. Again, the data from real projects supports these claims ([Linger] p. 56).

The right level of formality

Journal papers on Cleanroom [Mills], [Selby], [Cobb] began to appear in the late 1980s, presenting the method as described in the previous section. I read them with great interest. I found the concept appealing, and if the early reports of successful Cleanroom projects were any indication, this was a technology well worth paying attention to.

I was especially eager to see how verification was done in Cleanroom. I had taught, and occasionally actually done, program verification using traditional methods (Hoare's axioms, for example; see [Hoare] and, for example, [Baber]) for many years. However, as sympathetic as I was to the idea, I had to admit that this kind of verification was tedious and time-consuming, and not practical for everyday use. It could be done on small programs, as academic exercises, and occasionally on a few larger programs, in cases in which unusual safety or security requirements justified the substantial extra effort. But I had trouble arguing that all programming projects in the real world should be done using verification. My students didn't believe it and, frankly, neither did I.

Yet Mills and his colleagues claimed to be performing verification routinely on typical real-world applications, tens of thousands of lines long! Furthermore, they claimed a decrease in overall effort when they used verification.

Unfortunately, the papers didn't explain how the verification was done. They gave tantalizing hints in statements like these ([Mills] p. 22):

The method ... used in Cleanroom development, called functional verification, is quite different than the method of axiomatic verification usually taught in universities. It is based on functional semantics and on the reduction of software verification to ordinary mathematical reasoning about sets and functions as directly as possible.

But nothing that had been published at that time described the process in enough detail that I could learn to do it myself.

Then in June of 1992 I had the great good fortune to attend a workshop on Cleanroom methods. It was organized specifically for undergraduate faculty, sponsored by NSF, and conducted by members of IBM's Cleanroom Software Technology Center. In fact, this "workshop" was a condensed version of IBM's in-house training course.

I learned that "functional semantics" means that program statements are treated as functions from one program state to another. These functions are described by giving the new values of any variables whose values are changed, as functions of the values of variables in the original state. This is a more direct way of describing computations than the assertions used in axiomatic verification, which state facts about values of variables rather than stating what the values actually are.

Specifications of programs and parts of programs, then, are written as "intended functions", and verification means showing that each piece computes what its intended function says that it should compute. The intended functions may contain mathematical formulas, programming-language expressions, English, or any other notation that is appropriate.

Intended functions are written for every part of the code, down to the level of each control construct and nontrivial statement. This discipline forces the programmer to document every part of the code and to think very carefully about each step. Furthermore, it places a high premium on clear and simple intended functions and code, and code that obviously matches its intended function. Many errors are prevented in this way alone.

Here is a simple example of code annotated with intended functions, as it might appear in a C program:

 1  [ report := description of each item in selected_items ]
 2  {
 3      itemlist L = selected_items;
 4
 5      [ report := empty ]
 6      reset(report);
 7
 8      [ report, L:=report || description of each item in L, empty ]
 9      while (L != NULL)
10      {
11          [ report,L := report || description of first item in L,
12              all but first item in L ]
13          put_description(report, L->first);
14          L = L->rest;
15      }
16  }

The intended functions are the parts in square brackets; each one is the specification for the code immediately below it. (The operator "||" denotes concatenation.)

Verification is done following the structure of the control constructs. In this case, the top level is the sequence of the statement on line 3, the computation on lines 5 and 6, and the computation on lines 8 through 15. We must show that the composition of the function computed by line 3 (which is obvious), the function on line 5, and the function on line 8 does what the intended function on line 1 specifies; we don't even look at the code below the functions in this step. By substitution of selected_items for L and "empty" for report in line 8, the composition is

[report, L := empty || description of each item in selected_items,
              empty ]

which (since the effect on L is not visible outside the braces on lines 2 and 16) agrees with line 1.

To verify that the while-loop computes its intended function, we would ask:

  1. Does the loop always terminate? From the context of the rest of the program, we would see that an itemlist is a linear linked list terminated by NULL, so that repeatedly computing

    [ L := all but first item in L ]
    
    will eventually leave L containing NULL, if it is not initially NULL.

  2. When the loop condition is false, does doing nothing compute the loop's intended function? Yes, because when L is NULL the set "each item in L" is empty.

  3. When the loop condition is true, does executing the loop body, followed by computing the loop's intended function in the resulting state, compute the loop's intended function from the state before the loop is entered? By substituting the results of the function on line 11 into the function on line 8, we obtain

    [ report, L := (report || description of first item in L)
                           || description of each item in
                                  (all but first item in L),
                    empty ]
    
    which reduces to
    [ report, L := report description of each item in L, empty ]
    

    Notice how we can do this reasoning just by manipulating the English phrases in simple ways; we don't need to formalize what an "item" or a "description" is, or even spend time thinking about what the terms mean.

The remaining steps are to show that the statements on line 6 and lines 13--14 compute their respective intended functions. To see what each procedure call computes, we would look at the intended functions for the corresponding procedure definitions and substitute the values of the arguments. The rest is trivial.

Cleanroom-style verification is done, not by a lone programmer endlessly pushing symbols around on paper, but by a group of developers discussing each correctness issue among themselves, as a group of mathematicians might do. This is done in a review meeting, led by the program's author and including several other team members.

The author of the code would present the correctness arguments very much as I have just done in the example. The other members of the team might raise questions or suggest ways of improving the code, and would be watching for mistakes in the code, the intended functions, and the correctness arguments. Each part is considered verified when all members agree that it is correct.

The team members do not do detailed written proofs of every point. If a part of the program obviously does what its intended function says, the team members note that fact and go on. In more complicated cases, they might spend a bit more time discussing why a particular section is correct, perhaps drawing pictures or writing and simplifying expressions on a whiteboard. They will do detailed derivations and proofs, using the full power of mathematics, only when absolutely necessary.

This, then, is the way that verification is made really practical: it is done at the appropriate level of formality, and no more formally than is necessary. It is not desirable to prove every point using full formality, right down to the level of mathematical axioms. In fact, in most cases it is counterproductive and a waste of time. After all, even mathematicians don't do it in all their work --- why should programmers?

Furthermore, the real purpose of the verification is not to establish that the code is a mathematically perfect object. On the contrary, the verifiers assume that there may be defects in the code, and they watch carefully for places where the verification fails. This is the way that defects are detected. The value of the verification is not in the proofs themselves; it is in the bugs that are found and removed in the process.

If a bug fix is obvious, the correction is made on the spot and the verification continues; otherwise, the specification and code are sent back to the author for rework. The review process also encourages careful examination of the program with respect to all aspects of quality, much as in other kinds of inspection meetings [Fagan]. If the reviewers conclude that any part of the code is lacking in some aspect of quality, that part is sent back for rework.

After all of the defects found in the review are fixed and the program is successfully verified, it is not assumed to be perfect even then. That is because the verifiers are human, and may make mistakes and overlook defects. Therefore, the program is tested thoroughly. It is not too unusual to find a few bugs. Fortunately, it happens that most such bugs are simple oversights that are found very early in testing, and are easy to find and fix. They are seldom the deep or subtle algorithm flaws that can escape detection until they cause mysterious failures in production use much later. It turns out that the latter are the kinds of defects that the verification is particularly good at exposing.

The testing serves another purpose that is even more important: as quality control. If too many bugs are found in testing, it is a sign that the development, and the verification in particular, were not done carefully enough. The code is sent back to the developers to be redone, and steps are taken to tighten up the development process. Data from the testing can even be used to draw statistical inferences about the reliability of the final product.

Thus, the Cleanroom approach to specification and verification, even though solidly based on mathematics and formal methods, is a very pragmatic one. These are its key characteristics: (1) using the "functional" style, (2) verifying for the purpose of finding bugs, (3) verifying by group interaction, (4) testing to complement the verification, and, probably most importantly, (5) using as much formality as necessary, and no more.

The course

I immediately returned to my college and began developing a course based on these principles. I offered it for the first time in the spring semester of 1993, and have taught it five times in all (as of the end of 1998).

The course is listed as Computer Science 427, "Zero-Defect Software Development". I mean the title to be provocative. I explain to the students that we can't really achieve zero defects consistently, even by using Cleanroom methods. But we can come a lot closer than most people do, and "zero defects" should be our goal.

Most students who have taken the course have been computer science undergraduates in their third or fourth year. A few have been graduate students, and a few have been programming professionals who are not full-time students.

The programming prerequisite for the course is the third course in our sequence, which happens to be systems programming; this is not for any specific systems programming topics, but just for a certain level of programming maturity and experience. Actually, most students who take the course have more programming training and experience than this. An additional prerequisite has been a first course in discrete mathematics; this requirement is now redundant, since our students are now required to take such a course in their first year.

The textbook is my own [Stavely]. I found that no previous book on Cleanroom was suitable for a course such as mine. As often happens, the book evolved from a collection of notes distributed to the students in place of a textbook.

I have modified and adapted the standard Cleanroom methods somewhat for the course. In most cases, the adaptations have been simplifications. One important omission is the use of "box structures": I find that my students can do successful specification and verification of the well-structured programs that they already know how to write.

Specification and verification are the core topics in the course. I also cover Cleanroom-style testing, which is done based on patterns of expected usage rather than code coverage, and the incremental development which is another standard part of the Cleanroom process. I omit statistical inference from the results of testing, because I consider it a more advanced topic and don't want to limit my course to students who have the necessary background in statistics.

I usually spend the first week giving an introduction to the Cleanroom process, followed by the concepts of functions and intended functions. This requires introducing a bit of notation, like the notation used in the example above, but not much. The next topic is the theory of verification, including the correctness questions that must be answered for each of the standard control constructs. I use plenty of examples, of course; wherever possible these are realistic sections of code in a real programming language such as C. I spend a lot of time on how to write good intended functions, because this is critical for later success. Intended functions must express accurately what you intend to specify, they must be clear and unambiguous, they must be as simple as possible without omitting anything important, and they must be at the right level of formality.

By the third week the students should be ready for a real exercise, the writing and verification of a very small program. One that I often assign at this point is one that counts characters, words and lines in a stream of ASCII text (like the Unix utility wc). It turns out that this assignment is deceptively tricky: a student must choose whether to count beginnings of words or ends of words, and what to do about possible special cases at the beginning and end of the input. A poor combination of choices, or any other aspect of the code that is complicated or obscure, will force the student's intended functions to be complicated, and then the verification will be long and tedious. Students see very quickly that writing verifiable code means working hard to make the code as clean and simple as it can be.

Once the students think that they have a successful solution to the exercise, I display a few of their programs to the rest of the class. (I find that students are usually willing to volunteer their programs for this purpose.) First we examine the intended functions to see if they look adequate. Often they will not be, and I and the other students make suggestions for improving them. But if they appear to be, I then attempt a few steps of the verification. Often this will fail, because the intended functions and the code really don't match. Usually the code actually works, but the intended functions don't accurately express what the code should be doing. The students learn a lot from seeing exactly why their early attempts are not quite right.

Specification and verification are skills that require practice to master, and so I normally assign two more programs that are a bit more complicated than this one, but not much. Meanwhile, I lecture on the protocol for group verification reviews. By the third assignment, I ask the students to verify their own programs in groups of three or four people. I circulate around the room, listen in, and mediate or give hints where necessary. By this time most students are writing programs and intended functions that really are verifiable, so the reviews usually go reasonably smoothly.

Once a group claims that a program is verified, I test it, playing the role of an independent testing team, and tell the group of any errors that I find. Occasionally I see a program or section of code that obviously doesn't verify, and then I ask questions. It almost always happens that the group didn't really verify it --- they merely inspected it to see if it "looked right". This is not acceptable procedure, of course, and it is important to watch out for it.

I also encourage the students to keep records of defects found in their programs and whether they are found in review, compilation or test, as well as other data such as code size and time spent in each activity of program development. Once students are in the habit of doing this, they can use the data to track their progress and measure their performance against their expectations, as in the Personal Software Process [Humphrey].

By this time we are at or just beyond the middle of the semester. At this point I assign a larger programming project for the students to do in teams, due at the end of the semester. The first three times, the class was small enough (no more than eight students) that I made the whole class a team. When enrollment is larger, the class is broken into teams of three or four people. I try to choose a project that will result in about a thousand lines of code (not counting blank lines and lines containing only comments or intended functions), developed in at least two increments.

Much of the project work is done outside of class time. Meanwhile, I cover the rest of the lecture material. This includes Cleanroom-style testing and incremental development, and special techniques for specifying and verifying such things as data structures and object-oriented programs, and recursion and programs in functional programming languages. If time permits, I may introduce other topics, including other formal methods in software development.

For further details on the material covered in the course, please see [Stavely].

Results: the class projects

One good measure of the effectiveness of the course is the quality of the students' final projects. Here are the results.

Each project was intended to be a real application program constructed to meet a real need; two of the five are still being used. The projects were a program to generate and fill in forms in the TeX markup language, an electronic mail handler for a subset of the SMTP protocol, a program to trace a network of HTML pages on local and remote machines and report broken links, a suite of programs to gather statistics on web usage from log files, and a program to generate stereotyped HTML pages from (almost) plain text files.

The first three projects produced programs of just under 1,000 lines each; the defect densities were 7.3, 6.5 and 2.1 per thousand lines respectively, measured from first execution. These results are not quite as good (on average) as those from industry Cleanroom projects, but are, in my opinion, quite impressive for undergraduates with no industrial programming experience and no previous exposure to formal methods.

The fourth and fifth times the course was offered, the results of the projects cannot be compared directly with those of the first three times, because the conditions were different. The fourth time, I allowed the students to use a programming language (Perl) which many of them did not know, because they wanted to learn it while doing the project. This was a mistake. The students let too many errors go undetected in verification because of their unfamiliarity with the language. The resulting bug density, over 20 per thousand lines, was higher than it should have been, and didn't impress the students with the effectiveness of the method as well as it should have. The project was successful even so: the bugs were easy to find and fix, and the programs were soon running with no errors, as we normally expect. Still, I should have insisted that the students do their projects in a language that they knew very thoroughly.

The fifth time, a few students were not fully committed to the course, didn't do all of the assignments, and never really mastered the techniques. A few others didn't complete their projects because of unusual external demands on their time. However, of the increments of the project that were actually completed, most had bug densities after verification of less than ten per thousand lines. For some the numbers were much lower, and one student's first increment of 501 lines had no bugs at all.

Thus, except for a few anomalies, the students' final projects have been very successful. The record shows that, under normal circumstances, any competent student, applying the methods taught in the course with a reasonable amount of care, should be able to contruct software containing ten bugs per thousand lines or less by the end of the course. Again, this is before the code is ever executed, and most of these bugs will be removed in testing. By current real-world standards, this is an impressive level of quality, especially for students.

Results: outside the class

Being in the academic world and not in industry, my students and I seldom have an opportunity to undertake really large software development projects, such as those of tens or hundreds of thousands of lines that have been reported in the Cleanroom literature. However, graduates of the course and I have constructed several real applications of modest size using the methods that I teach. Here are the results.

To gain experience in the methods that I was teaching, in 1995 I conducted a small-scale development project to produce a real application program for a real client. The client was the New Mexico Tech registrar, who needed a program to detect time conflicts and missing information of various kinds in a proposed class schedule for the college.

The development went very smoothly. The resulting program was 478 nonblank, non-comment lines in the C language. The development took 35 person-hours, broken down as follows (in hours and minutes):

requirements gathering and writing specification document 8:25
design and coding 9:10
verification review (4:05 elapsed time, 2 people) 8:10
removal of defects caught during review, and reverification   4:15
compiling and testing 5:00
debugging 0

The number of participants in a verification review is typically three to eight. However, it happened that I had only one other trained person available to help me: John Shipman, an early graduate of the course. Thus we used a verification team of only two people.

In the verification reviews, we caught four bugs. We overlooked two, both of which I happened to notice while reading the code again after the verification. Perhaps a larger verification team would have been caught one or both of these. Still, the bug density after verification was only 4.2 per thousand lines. No bugs at all were found in testing. The program has now been in production use for three years, with no malfunctions ever observed.

I also helped Mr. Shipman to verify parts of five programs which he wrote for his consulting and contracting business (Zoological Data Processing, Socorro, NM) and for the New Mexico Tech Computer Center. The verified parts totalled 3,022 nonblank, non-comment lines in the Icon and Python languages. In all, 29 bugs were left after verification, for a bug density of 9.6 per thousand lines at first execution. This number is somewhat misleading, however: 13 of these bugs were actually mistakes in types of variables, number and types of arguments to functions, etc. Icon and Python do not have strong typing; in languages with strong typing, these errors would have been detected by the compiler. When these errors are subtracted, the bug density was 5.3 per thousand lines. Again, this was with a verification team of only two people.

More recently, another graduate of the course, Alex Kent, developed a program using the methods of the course in his capacity as programmer for the New Mexico Tech Computer Center. The program was a printer-control system, written to replace an unreliable system already in place. The resulting code was 2,062 nonblank, non-comment lines of C.

Mr. Kent first tried to eliminate as many bugs as possibly by himself using semiformal verification. He found and removed 17 bugs in this way. He then used a three-person group to verify the resulting code in the normal way; they detected 11 further bugs. Five more bugs were detected in testing. As of this writing, the program has been in production use on one printer for about a week, and has processed close to a thousand printer jobs with no errors. Assuming that no bugs remain, the bug density after the one-person verification was 7.6 per thousand lines, which the three-person verification reduced to 2.4 per thousand lines and the testing reduced to zero.

All of these projects followed the expected pattern for programs developed in the Cleanroom style. Development went smoothly; there were no unpleasant surprises, all-night debugging sessions, or periods of panic. Bug density after verification was low, of the expected order of magnitude. Testing detected most or all of the remaining bugs, and most of those bugs were easy to diagnose and fix. The programs were soon up and running, and they kept running with few or no malfunctions after that.

Conclusion

There is no magic in the techniques that I teach in this course --- in fact, in hindsight, the whole method seems less a high-tech breakthrough than an application of good common sense. As is often the case, we owe this "common sense" to the brilliant insights and hard work of those who came before us! In any case, the method works, reliably and cost-effectively. And it can be taught, even to undergraduates.

These techniques give us one way of constructing software that can truly be said to be engineered, not just cranked out by ad-hoc coding and debugging that we try to keep under control through project management, as in much of what has been called "software engineering". Methods like these --- methods that are firmly grounded in fundamentals, that are adapted to the needs of the real world while retaining their essence, and that really do work in practice --- are what engineering is all about. Perhaps some day, when our field is truly mature, methods with these properties will be accepted as the core of software engineering education.

References

[Baber] Robert Laurence Baber. Error-Free Software: Know-how and Know-why of Program Correctness. Chichester, England: Wiley, 1991.
[Cobb] Richard H. Cobb and Harlan D. Mills. "Engineering software under statistical quality control." IEEE Software 7, 6 (November 1990), 44-54.
[DeMarco] Tom DeMarco. Controlling Software Projects: Management Measurement & Estimation. Englewood Cliffs, N.J.: Prentice-Hall, 1982.
[Fagan] Michael E. Fagan."Advances in software inspections." IEEE Transactions on Software Engineering , vol. SE-12, no. 7 (July 1986), 744-751.
[Hausler] P. A. Hausler, R. C. Linger and C. J. Trammell. "Adopting Cleanroom software engineering with a phased approach." IBM Systems Journal , vol. 33, no. 1, 1994, 89-109.
[Hoare] C. A. R. Hoare. "An axiomatic basis for computer programming." Commun. ACM 12, 10, October 1969, 576--583.
[Humphrey] Watts S. Humphrey. Introduction to the Personal Software Process. Reading, Mass.: Addison-Wesley, 1997.
[Jones] Capers Jones. Programming Productivity. New York: McGraw-Hill, 1986.
[Linger] Richard C. Linger. "Cleanroom process model." IEEE Software 11, 2 (March 1994), 50-58.
[Mills] Harlan D. Mills, Michael Dyer and Richard C. Linger. "Cleanroom software engineering." IEEE Software 4, 5 (September 1987), 19-24.
[Selby] Richard W. Selby, Victor R. Basili and F. Terry Baker. "Cleanroom software development: an empirical evaluation." IEEE Transactions on Software Engineering, vol, SE-13, no. 9 (September 1987), 1027-1037.
[Stavely] Allan M. Stavely. Toward Zero-Defect Programming. Reading, Mass.: Addison Wesley Longman, 1999.
[Trammell] Carmen J. Trammell, Leon H. Binder and Catherine E. Snyder. "The automated production control system: a case study in Cleanroom Software Engineering." ACM Transactions on Software Engineering and Methodology 1, 1 (January 1992), 81-94.

Acknowledgement: most of the preparation of this page was done by Addison Wesley Longman, publisher of my book Toward Zero-Defect Programming. , from text supplied by me (Stavely).