Software Bugs
Fault Model for Software Components:
Bohrbugs:
This class of software faults are easily reproducible and hence can be
easily removed.
These faults should have ideally been removed during the debugging phase.
If such faults remain in the operational phase, then the only
way out is design diversity wherein applications providing the same
functionality but using different design/implementations are used to mask
faults in individual implementations.
For more on design diversity and software fault-tolerance click
here.
Heisenbugs:
Obvious most design faults in software are likely to have been detected and
removed during testing and subsequently as a result of feedback during
field use. However, even mature software can be expected to have what
are known as "Heisenbugs"[GRAY 1986]. These are bugs in the software that
are revealed only during specific collusions of events. For instance a
sequence of operations may leave the software in a state that results in an
error on an operation executed next. Synchronization oversights in multithreaded
software are another example, where errors occur during some executions, but do
not occur when repeated. Such errors are said to be caused by transient faults.
Simply retrying a failed operation, or if the application process has crashed,
restarting the process (the restarting could be done by middleware providing
Software Implemented Fault Tolerance, SIFT) might resolve the problem.
Software failure due to resource exhaustion:
Another type of fault observed in software systems, is due to the phenomenon of
resource exhaustion. Operating system resources such as swap space, free memory
available, etc. are progressively depleted due to defects in software such as
memory leaks and incomplete cleanup of resources after use. These faults may
exist in the operating system, middleware, and the application software. The
estimation of the rate of resource exhaustion and consequently the expected
time of software failure has been the focus of research on "software rejuvenation"
techniques. Periodically restarting a process/rebooting a node, or doing a
prediction-based rejuvenation based on the observed rate of resource exhaustion
may help prevent the software from crashing (operating system, middleware,
application). For more details please see ISSRE '98
paper and the ISSRE '99 paper.
References:
[GRAY 1986] J. Gray, "Why do computers stop and what can be done about it?",
Proc. of 5th Symp. on Reliability in Distributed Software and Database Systems,
pp. 3-12, January 1986.