Chapter 10 Metrics and Models in Software Testing
How do we measure the progress of testing? When do we release the software? Why do we devote more time and resources for testing a particular module? What is the reliability of software at the time of release? Who is responsible for the selection of a poor test suite? How many faults do we expect during testing? How much time and resources are required to test a software? How do we know the effectiveness of test suite? We may keep on framing such questions without much effort? However, finding answers to such questions are not easy and may require significant amount of effort. Software testing metrics may help us to measure and quantify many things which may find some answers to such important questions.
10.1 Software Metrics
“What cannot be measured, cannot be controlled” is a reality in this world. If we want to control something we should first be able to measure it. Therefore, everything should be measurable. If a thing is not measurable, we should make an effort to make it measurable. The area of measurement is very important in every field and we have mature and establish metrics to quantify various things. However, in software engineering this “area of measurement” is still in its developing stage and may require significant effort to make it mature, scientific and effective.
10.1.1 Measure, Measurement and Metrics
These terms are often used interchangeably. However, we should understand the difference amongst these terms. Pressman explained this clearly as [PRES05]:
“A measure provides a quantitative indication of the extent, amount, dimension, capacity or size of some attributes of a product or process. Measurement is the act of determining a measure. The metric is a quantitative measure of the degree to which a product or process possesses a given attribute”. For example, a measure is the number of failures experienced during testing. Measurement is the way of recording such failures. A software metric may be average number of failures experienced per hour during testing.
Fenton [FENT04] has defined measurement as:
“It is the process by which numbers or symbols are assigned to attributes of entities in the real world in such a way as to describe them according to clearly defined rules”.
The basic issue is that we want to measure every attribute of an entity. We should have established metrics to do so. However, we are in the process of developing metrics for many attributes of various entities used in software engineering.
Software metrics can be defined as [GOOD93]: “The continuous application of measurement based techniques to the software development process and its products to supply meaningful and timely management information, together with the use of those techniques to improve that process and its products.”
Many things are covered in this definition. Software metrics are related to measures which, in turn, involve numbers for quantification, these numbers are used to produce better product and improve its related process. We may like to measure quality attributes such as testability, complexity, reliability, maintainability, efficiency, portability, enhanceability, usability etc for a software. We may also like to measure size, effort, development time and resources for a software.
Software metrics are applicable in all phases of software development life cycle. In software requirements and analysis phase, where output is the SRS document, we may have to estimate the cost, manpower requirement and development time for the software. The customer may like to know cost of the software and development time before signing the contract. As we all know, the SRS document acts as a contract between customer and developer. The readability and effectiveness of SRS document may help to increase the confidence level of the customer and may provide better foundations for designing the product. Some metrics are available for cost and size estimation like COCOMO, Putnam resource allocation model, function point estimation model etc. Some metrics are also available for the SRS document like number of mistakes found during verification, change request frequency, readability etc. In the design phase, we may like to measure stability of a design, coupling amongst modules, cohesion of a module etc. We may also like to measure the amount of data input to a software, processed by the software and also produced by the software. A count of the amount of data input to, processed in, and output from software is called a data structure metric. Many such metrics are available like number of variables, number of operators, number of operands, number of live variables, variable spans, module weakness etc. Some information flow metrics are also popular like FANIN, FAN OUT etc.
Use cases may also be used to design metrics like counting actors, counting use cases, counting number of links etc. Some metrics may also be designed for various applications of websites like number of static web pages, number of dynamic web pages, number of internal page links, word count, number of static and dynamic content objects, time taken to search a web page and retrieve the desired information, similarity of web pages etc. Software metrics have number of applications during implementation phase and after the completion of such a phase. Halstead software size measures are applicable after coding like token count, program length, program volume, program level, difficulty, estimation of time and effort, language level etc. Some complexity measures are also popular like cyclomatic complexity, knot count, feature count etc. Software metrics have found good number of applications during testing. One area is the reliability estimation where popular models are Musa’s basic execution time model and Logarithmic Poisson execution time model. Jelinski Moranda model [JELI72] is also used for the calculation of reliability. Source code coverage metrics are available that calculate the percentage of source code covered during testing. Test suite effectiveness may also be measured. Number of failures experienced per unit of time, number of paths, number of independent paths, number of du paths, percentage of statement coverage, percentage of branch condition covered are also useful software metrics. Maintenance phase may have many metrics like number of faults reported per year, number of requests for changes per year, percentage of source code modified per year, percentage of obsolete source code per year etc.
We may find number of applications of software metrics in every phase of software development life cycle. They provide meaningful and timely information which may help us to take corrective actions as and when required. Effective implementation of metrics may improve the quality of software and may help us to deliver the software in time and within budget.
10.2 Categories of Metrics
There are two broad categories of software metrics namely product metrics and process metrics. Product metrics describe the characteristics of the product such as size, complexity, design features, performance, efficiency, reliability, portability, etc. Process metrics describe the effectiveness and quality of the processes that produce the software product. Examples are effort required in the process, time to produce the product, effectiveness of defect removal during development, number of defects found during testing, maturity of the process [AGGA08].
10.2.1 Product metrics for testing
These metrics provide information about the testing status of a software product. The data for such metrics are also generated during testing and may help us to know the quality of the product. Some of the basic metrics are given as:
(i) Number of failures experienced in a time interval
(ii) Time interval between failures
(iii) Cumulative failures experienced upto a specified time
(iv) Time of failure
(v) Estimated time for testing
(vi) Actual testing time
With these basic metrics, we may find some additional metrics as given below:
(ii) Average time interval between failures
(iii) Maximum and minimum failures experienced in any time interval
(iv) Average number of failures experienced in time intervals
(v) Time remaining to complete the testing.
We may design similar metrics to find the indications about the quality of the product.
10.2.2 Process metrics for testing
These metrics are developed to monitor the progress of testing, status of design and development of test cases and outcome of test cases after execution.
Some of the basic process metrics are given below:
(i) Number of test cases designed
(ii) Number of test cases executed
(iii) Number of test cases passed
(iv) Number of test cases failed
(v) Test case execution time
(vi) Total execution time
(vii) Time spent for the development of a test case
(viii) Total time spent for the development of all test cases
On the basis of above direct measures, we may design following additional metrics which may convert the base metric data into more useful information.
(i) % of test cases executed
(ii) % of test cases passed
(iii) % of test cases failed
(iv) Total actual execution time / total estimated execution time
(v) Average execution time of a test case
These metrics, although simple, may help us to know the progress of testing and may provide meaningful information to the testers and project manager.
An effective test plan may force us to capture data and convert it into useful metrics for process and product both. This document also guides the organization for future projects and may also suggest changes in the existing processes in order to produce a good quality maintainable software product.
10.3 Object Oriented Metrics used in Testing
Object oriented metrics capture many attributes of a software and some of them are relevant in testing. Measuring structural design attributes of a software system, such as coupling, cohesion or complexity, is a promising approach towards early quality assessments. There are several metrics available in the literature to capture the quality of design and source code.
10.3.1 Coupling Metrics
Coupling relations increase complexity, reduce encapsulation, potential reuse, and limit understanding and maintainability. The coupling metrics requires information about attribute usage and method invocations of other classes. These metrics are given in table 10.1. Higher values of coupling metrics indicate that a class under test will require more number of stubs during testing. In addition, each interface will require to be tested thoroughly.
Coupling between Objects. (CBO)
CBO for a class is count of the number of other classes to which it is coupled.
Data Abstraction Coupling (DAC)
Data Abstraction is a technique of creating new data types suited for an application to be programmed.
DAC = number of ADTs defined in a class.
Message Passing Coupling. (MPC)
It counts the number of send statements defined in a class.
Response for a Class (RFC)
It is defined as set of methods that can be potentially executed in response to a message received by an object of that class. It is given by
RFC=|RS|, where RS, the response set of the class, is given by
Information flow-based coupling (ICP)
The number of methods invoked in a class, weighted by the number of parameters of the methods invoked.
Information flow-based inheritance coupling. (IHICP)
Same as ICP, but only counts methods invocations of ancestors of classes.
Information flow-based non-inheritance coupling (NIHICP)
Same as ICP, but only counts methods invocations of classes not related through inheritance.
Count of modules (classes) that call a given class, plus the number of global data elements.
Count of modules (classes) called by a given module plus the number of global data elements altered by the module (class).
Table 10.1: Coupling Metrics
10.3.3 Inheritance Metrics
Inheritance metrics requires information about ancestors and descendants of a class. They also collect information about methods overridden, inherited and added (i.e. neither inherited nor overrided). These metrics are summarized in table 10.3. If a class has more number of children (or sub classes), more amount of testing may be required in testing the methods of that class. More is the depth of inheritance tree, more complex is the design as more number of methods and classes are involved. Thus, we may test all the inherited methods of a class and testing effort well increase accordingly.
Number of Children (NOC)
The NOC is the number of immediate subclasses of a class in a hierarchy.
Depth of Inheritance Tree (DIT)
The depth of a class within the inheritance hierarchy is the maximum number of steps from the class node to the root of the tree and is measured by the number of ancestor classes.
Number of Parents (NOP)
The number of classes that a class directly inherits from (i.e. multiple inheritance).
Number of Descendants (NOD)
The number of subclasses (both direct and indirectly inherited) of a class.
Number of Ancestors (NOA)
The number of superclasses (both direct and indirectly inherited) of a class.
Number of Methods Overridden (NMO)
When a method in a subclass has the same name and type signature as in its superclass, then the method in the superclass is said to be overridden by the method in the subclass.
Number of Methods Inherited (NMI)
The number of methods that a class inherits from its super (ancestor) class.
Number of Methods Added (NMA)
The number of new methods added in a class (neither inherited, nor overriding).
Table 10.3: Inheritance Metrics
10.3.4 Size Metrics
Size metrics indicate the length of a class in terms of lines of source code and methods used in the class. These metrics are given in table 10.4. If a class has more number of methods with greater complexity, then more number of test cases will be required to test that class. When a class with more number of methods with greater complexity is inherited, it will require more rigorous testing. Similarly, a class with more number of public methods will require thorough testing of public methods as they may be used by other classes.
Number of Attributes per Class (NA)
It counts the total number of attributes defined in a class.
Number of Methods per Class (NM)
It counts number of methods defined in a class.
Weighted Methods per Class (WMC)
The WMC is a count of sum of complexities of all methods in a class. Consider a class K1, with methods M1,…….. Mn that are defined in the class. Let C1,……….Cn be the complexity of the methods.
Number of public methods (PM)
It counts number of public methods defined in a class.
Number of non-public methods (NPM)
It counts number of private methods defined in a class.
Lines Of Code (LOC)
It counts the lines in the source code.
Table 10.4: Size Metrics
10.4 What should we measure during testing?
We should measure every thing (if possible) which we want to control and which may help us to find answers to the questions given in the beginning of this chapter. Test metrics may help us to measure the current performance of any project. The collected data may become historical data for future projects. This data is very important because in the absence of historical data, all estimates are just the guesses. Hence, it is essential to record the key information about the current projects. Test metrics may become an important indicator of the effectiveness and efficiency of a software testing process and may also identify risky areas that may need more testing.
We may measure many things during testing with respect to time and some of them are given as:
1) Time required to run a test case.
2) Total time required to run a test suite.
3) Time available for testing
4) Time interval between failures
5) Cumulative failures experienced upto a given time
6) Time of failure
7) Failures experienced in a time interval
A test case requires some time for its execution. A measurement of this time may help to estimate the total time required to execute a test suite. This is the simplest metric and may estimate the testing effort. We may calculate the time available for testing at any point in time during testing, if we know the total allotted time for testing. Generally unit of time is seconds, minutes or hours, per test case. Total testing time may be defined in terms of hours. Time needed to execute a planned test suite may also be defined in terms of hours.
When we test a software, we experience failures. These failures may be recorded in different ways like time of failure, time interval between failures, cumulative failures experienced upto given time and failures experienced in a time interval. Consider the table 10.5 and table 10.6 where time based failure specification and failure based failure specification are given:
Sr. No. of failure occurrences
Failure time measured in minutes
Failure intervals in minutes
Table 10.5: Time based failure specification
Time in minutes
Failures in interval of 20 minutes
Table 10.6: Failure based failure specification
These two tables give us the idea about failure pattern and may help us to define the following:
1) Time taken to experience ‘n’ failures
2) Number of failures in a particular time interval
3) Total number of failures experienced after a specified time
4) Maximum / minimum number of failures experienced in any regular time interval.
10.4.2 Quality of source code
We may know the quality of the delivered source code after reasonable time of release using the following formula:
Where WDB: Number of weighted defects found before release
WDA: Number of weighted defects found after release
The weight for each defect is defined on the basis of defect severity and removal cost. A severity is assigned to each defect by testers based on how important or serious is the defect. A lower value of this metric indicates the less number of error detection or less serious error detection.
We may also calculate the number of defects per execution test case. This may also be used as an indicator of source code quality as the source code progressed through the series of test activities [STEP03].
10.4.3 Source Code Coverage
We may like to execute every statement of a program at least once before its release to the customer. Hence, percentage of source code coverage may be calculated as:
The higher value of this metric given confidence about the effectiveness of a test suite. We should write additional test cases to cover the uncovered portions of the source code.
10.4.4 Test Case Defect Density
This metric may help us to know the efficiency and effectiveness of our test cases.
Where Failed test case: A test case that when executed, produced an undesired output.
Passed test case: A test case that when executed, produced a desired output
Higher value of this metric indicates that the test cases are effective and efficient because they are able to detect more number of defects.
10.4.5 Review Efficiency
Review efficiency is a metric that gives insight on the quality of review process carried out during verification.
Higher the value of this metric, better is the review efficiency.
10.5 Software Quality Attributes Prediction Models
Software quality is dependent on many attributes like reliability, maintainability, fault proneness, testability, complexity, etc. Number of models are available for the prediction of one or more such attributes of quality. These models are especially beneficial for large-scale systems, where testing experts need to focus their attention and resources to problem areas in the system under development.
10.5.1 Reliability Models
Many reliability models for software are available where emphasis is on failures rather than faults. We experience failures during execution of any program. A fault in the program may lead to failure(s) depending upon the input(s) given to a program with the purpose of executing it. Hence, time of failure and time between failures may help us to find reliability of software. As we all know, software reliability is the probability of failure free operation of software in a given time under specified conditions. Generally, we consider the calendar time. We may like to know the probability that a given software will not fail in one month time or one week time and so on. However, most of the available models are based on execution time. The execution time is the time for which the computer actually executes the program. Reliability models based on execution time normally give better results than those based on calendar time. In many cases, we have a mapping table that converts execution time to calendar time for the purpose of reliability studies. In order to differentiate both the timings, execution time is represented byand calendar time by t.
Most of the reliability models are applicable at system testing level. Whenever software fails, we note the time of failure and also try to locate and correct the fault that caused the failure. During system testing, software may not fail at regular intervals and may also not follow a particular pattern. The variation in time between successive failures may be described in terms of following functions:
μ () : average number of failures upto time
λ () : average number of failures per unit time at time and is known as failure intensity function.
It is expected that the reliability of a program increases due to fault detection and correction over time and hence the failure intensity decreases accordingly.
(i) Basic Execution Time Model
This is one of the popular model of software reliability assessment and was developed by J.D. MUSA [MUSA79] in 1979. As the name indicates, it is based on execution time (). The basic assumption is that failures may occur according to a non-homogeneous poisson process (NHPP) during testing. Many examples may be given for real world events where poisson processes are used. Few examples are given as:
* Number of users using a website in a given period of time.
* Number of persons requesting for railway tickets in a given period of time
* Number of e-mails expected in a given period of time.
The failures during testing represents a non-homogeneous process, and failure intensity decreases as a function of time. J.D. Musa assumed that the decrease in failure intensity as a function of the number of failures observed, is constant and is given as:
Where : Initial failure intensity at the start of testing.
: Total number of failures experienced upto infinite time
: Number of failures experienced upto a given point in time.
Musa [MUSA79] has also given the relationship between failure intensity (λ) and the mean failures experienced (μ) and is given in 10.1.
If we take the first derivative of equation given above, we get the slope of the failure intensity as given below
The negative sign shows that there is a negative slope indicating a decrementing trend in failure intensity.
This model also assumes a uniform failure pattern meaning thereby equal probability of failures due to various faults. The relationship between execution time () and mean failures experienced (μ) is given in 10.2
The derivation of the relationship of 10.2 may be obtained as:
The failure intensity as a function of time is given in 10.3.
This relationship is useful for calculating present failure intensity at any given value of execution time. We may find this relationship
Two additional equations are given to calculate additional failures required to be experienced to reach a failure intensity objective (λF) and additional time required to reach the objective. These equations are given as: Where ∆μ: Expected number of additional failures to be experienced to reach failure intensity objective.
: Additional time required to reach the failure intensity objective.
: Present failure intensity
: Failure intensity objective. and are very interesting metrics to know the additional time and additional failures required to achieve a failure intensity objective.
Example 10.1: A program will experience 100 failures in infinite time. It has now experienced 50 failures. The initial failure intensity is 10 failures/hour. Use the basic execution time model for the following:
(i) Find the present failure intensity.
(ii) Calculate the decrement of failure intensity per failure.
(iii) Determine the failure experienced and failure intensity after 10 and 50 hours of execution.
(iv) Find the additional failures and additional execution time needed to reach the failure intensity objective of 2 failures/hour.
(a) Present failure intensity can be calculated using the following equation:
(b) Decrement of failure intensity per failure can be calculated using the following:
(c) Failures experienced and failure intensity after 10 and 50 hours of execution can be calculated as:
(i) After 10 hours of execution
(ii) After 50 hours of execution
(d) and with failure intensity objective of 2 failures/hour
(ii) Logarithmic Poisson Execution time model
With a slight modification in the failure intensity function, Musa presented logarithmic poisson execution time model. The failure intensity function is given as:
Where θ: Failure intensity decay parameter which represents the relative change of failure intensity per failure experienced.
The slope of failure intensity is given as:
The expected number of failures for this model is always infinite at infinite time. The relation for mean failures experienced is given as:
The expression for failure intensity with respect to time is given as:
The relationship for additional number of failures and additional execution time are given as:
When execution time is more, the logarithmic poisson model may give large values of failure intensity than the basic model.
Example 10.2: The initial failure intensity of a program is 10 failures/hour. The program has experienced 50 failures. The failure intensity decay parameter is 0.01/failure. Use the logarithmic poisson execution time model for the following:
(a) Find present failure intensity.
(b) Calculate the decrement of failure intensity per failure.
(c) Determine the failure experienced and failure intensity after 10 and 50 hours of execution.
(d) Find the additional failures and additional and failure execution time needed to reach the failure intensity objective of 2 failures/hour.
(a) Present failure intensity can be calculated as:
= 50 failures
= 50 failures
= 6.06 failures/hour
(b) Decrement of failure intensity per failure can be calculated as:
(c) Failure experienced and failure intensity after 10 and 50 hours of execution can be calculated as:
(i) After 10 hours of execution
(ii) After 50 hours of execution
(d) and with failure intensity objective of 2 failures/hour
(iii) The Jelinski – Moranda Model
The Jelinski – Moranda model [JELI72] is the earliest and simples software reliability model. It proposed a failure intensity function in the form of
Where = Constant of proportionality
N = total number of errors present
i = number of errors found by time interval ti.
This model assumes that all failures have the same failure rate. It means that failure rate is a step function and there will be an improvement in reliability after fixing an error. Hence, every failure contributes equally to the overall reliability. Here, failure intensity is directly proportional to the number of errors remaining in a software.
Once we know the value of failure intensity function using any reliability model, we may calculate reliability using the equation given below:
Where λ is the failure intensity and t is the operating time. Lower the failure intensity and higher is the reliability and vice versa.
Example 10.3: A program may experience 200 failures in infinite time of testing. It has experienced 100 failures. Use Jelinski-Moranda model to calculate failure intensity after the experience of 150 failures?
Total expected number of failures (N) = 200
Failures experienced (i) =100
Constant of proportionality () = 0.02
= 2.02 failures/hour
After 150 failures
= 0.02 (200-150+1)
Failure intensity will decrease with every additional failure experience.
10.5.2 An example of fault prediction model in practice
It is clear that software metrics can be used to capture the quality of object oriented design and code. These metrics provide ways to evaluate the quality of software and their use in earlier phases of software development can help organizations in assessing a large software development quickly, at a low cost.
To achieve help for planning and executing testing by focusing resources on the fault prone parts of the design and code, the model used to predict faulty classes should be used. The fault prediction model can also be used to identify classes that are prone to have severe faults. One can use this model with respect to high severity of faults to focus the testing on those parts of the system that are likely to cause serious failures. In this section, we describe models used to find relationship between object oriented metrics and fault proneness, and how such models can be of great help in planning and executing testing activities [MALH09, SING10].
In order to perform the analysis we used public domain KC1 NASA data set [NASA04] The data set is available on www.mdp.ivv.nasa.gov. The 145 classes in this data were developed using C++ language.
The goal of our analysis is to explore empirically the relationship between object oriented metrics and fault proneness at the class level. Therefore, fault proneness is the binary dependent variable and object oriented metrics (namely WMC, CBO, RFC, LCOM, DIT, NOC and SLOC) are the independent variables. Fault proneness is defined as the probability of fault detection in a class. We first associated defects with each class according to their severities. The value of severity quantifies the impact of the defect on the overall environment with 1 being most severe to 5 being least severe. Faults with severity rating 1 were classified as high severity faults. Faults with severity rating 2 were classified as medium severity faults and faults with severity rating 3, 4, 5 as low severity faults as at severity rating 4 no class was found to be faulty and at severity rating 5 only one class was faulty. Table 10.7 summarizes the distribution of faults and faulty classes at high, medium, and low severity levels in the KC1 NASA data set after preprocessing of faults in the data set.
Level of severity
Number faulty of classes
% of faulty classes
Number of faults
% of Distribution of faults
Table 10.7: Distribution of Faults and Faulty Classes at High, Medium and Low Severity Levels
The "min", "max", "mean", "median", “std dev ", "25% quartile" and "75% quartile" for all metrics in the analysis are shown in table 10.8.
Table 10.8: Descriptive Statistics for metrics
The low values of DIT and NOC indicate that inheritance is not much used in the system. The LCOM metric has high values. Table 10.9 shows the correlation among metrics, which is an important static quantity.
Table 10.9: Correlations among Metrics
The correlation coefficients shown in bold are significant at 0.01 level. WMC, LOC, DIT metrics are correlated with RFC metric. Similarly, the WMC and CBO metrics are correlated with LOC metric. Therefore, it shows that these metrics are not totally independent and represents redundant information.
The next step of our analysis found the combined effect of object oriented metrics on fault proneness of class at various severity levels. We obtained from four multivariate fault prediction models using LR method. The first one is for high severity faults, the second one is for medium severity faults, the third one is for low severity faults and the fourth one is for ungraded severity faults.
We used multivariate logistic regression approach in our analysis. In a multivariate logistic regression model, the coefficient and the significance level of an independent variable represent the net effect of that variable on the dependent variable – in our case fault proneness. Tables 10.10, 10.11, 10.12 and 10.13 provide the coefficient (B), standard error (SE), statistical significance (sig), odds ratio (exp(B)) for metrics included in the model.
Two metrics CBO and SLOC were included in the multivariate model for predicting high severity faults. CBO, LCOM, NOC, SLOC metrics were included in the multivariate model for predicting medium severity faults. Four metrics CBO, WMC, RFC, and SLOC were included in the model predicted with respect to low severity faults. Similarly, CBO, LCOM, NOC, RFC, SLOC metrics were included in the ungraded severity model.
Table 10.10: High severity faults model statistics
Table 10.11: Medium severity faults model statistics
Table 10.12: Low severity faults model statistics
Table 10.13: Ungraded severity fault model statistics
To validate our findings we performed 10-cross validation of all the models. For the 10-cross validation, the classes were randomly divided into 10 parts of approximately equal (14 partitions of ten data points each and 1 partition of five data points each).
The performance of binary prediction models is typically evaluated using confusion matrix (see table 10.14). In order to validate the findings of our analysis, we used the commonly used evaluation measures sensitivity, specificity, completeness, precision, and ROC analysis.
0.00 (Not Fault-Prone)
True Fault Prone (TFP)
False Not Fault Prone (FNFP)
0.00 (Not Fault-Prone)
False Fault Prone (FFP)
True Not Fault Prone (TNFP)
Table 10.14: Confusion matrix
It is defined as the ratio of number of classes correctly predicted to the total number of classes.
It is defined as the ratio of the number of classes correctly predicted as fault prone to the total number of classes that are actually fault prone.
Completeness is defined as the number of faults in classes classified fault-prone, divided by the total number of faults in the system.
Receiver Operating Characteristics (ROC) Curve
The performance of the outputs of the predicted models was evaluated using ROC analysis. The ROC curve, which is defined as a plot of sensitivity on the y-coordinate versus its 1-specificity on the x coordinate, is an effective method of evaluating the quality or performance of predicted models [EMAM99]. While constructing ROC curves, we selected many cutoff points between 0 and 1, and calculated sensitivity and specificity at each cut off point. The optimal choice of the cutoff point (that maximizes both sensitivity and specificity) can be selected from the ROC curve [EMAM99]. Hence, by using the ROC curve one can easily determine optimal cutoff point for a model. Area under the ROC Curve (AUC) is a combined measure of sensitivity and specificity. In order to compute the accuracy of the predicted models, we use the area under ROC curve.
We summarized the results of cross validation of predicted models via the LR approach in Table 10.15.
Table 10.15: Results of 10-cross validation of models
The ROC curve for the LR model with respect to the high, medium, low, and ungraded severity of faults is shown in 10.4.
Based on the findings from this analysis, can use the SLOC and CBO metrics in earlier phases of software development to measure the quality of the systems and predict which classes with higher severity need extra attention. This can help management focus resources on those classes that are likely to cause serious failures. Also, if required, developers can reconsider design and thus take corrective actions. The models predicted in the previous section could be of great help for planning and executing testing activities. For example, if one has the resources available to inspect 26 percent of the code, one should test 26 percent classes predicted with more severe faults. If these classes are selected for testing one can expect maximum severe faults to be covered.
There are many school of thoughts about the usefulness and applications of software metrics. However, every school of thought accepts the old quote of software engineering i.e. “You cannot improve what you cannot measure; and you cannot control what you cannot measure”. In order to control and improve various activities, we should have “something” to measure such activities. This “something” differs from one school of thought to another school of thought. Despite different views, most of us feel that software metrics help to improve productivity and quality. Software process metrics are widely used such as capability maturity model for software (CMM-SW) and ISO9001. Every organization is putting serious efforts to implement these metrics.
10.5.3 Maintenance effort prediction model
The cost of software maintenance is increasing day by day. The development may take 2 to 3 years, but same software may have to be maintained for another 10 or more years. Hence, maintenance effort is becoming an important factor for software developers. The obvious question is “how should we estimate maintenance effort in early phases of software development life cycle?”. The estimation may help us to calculate the cost of software maintenance which a customer may like to know as early as possible in order to plan the costing of the project.
Maintenance effort is defined as number of lines of source code added or changed during maintenance phase. A model has been used to predict maintenance effort using Artificial Neural Network (ANN) [AGGA06, MALH09]. This is a simple model and predictions are quite realistic. In this model, maintenance effort is used as a dependent variable. The independent variables are eight object oriented metrics namely WMC, CBO, RFC, LCOM, DIT, NOC, DAC, and NOM. The model is trained and tested on two commercial software products User Interface Management System (UIMS) and Quality Evaluation System (QUES), which are presented in [LI93]. UIMS system consists of 39 classes and QUES system consists of 71 classes
The ANN network used in model prediction belongs to Multilayer Feed Forward networks and is referred to as M-H-Q network with M source nodes, H nodes in hidden layer and Q nodes in the output layer. The input nodes are connected to every node of the hidden layer but are not directly connected to the output node. Thus, the network does not have any lateral or shortcut connection.
Artificial Neural Network (ANN) repetitively adjusts different weights so that the difference between desired output from the network and actual output from ANN is minimized. The network learns by finding a vector of connection weights that minimizes the sum of squared errors on the training data set. The summary of ANN used in model for predicting maintenance effort is shown in table 10.16.
Table 10.16: ANN Summary
The ANN was trained by the standard error back propagation algorithm at a learning rate of 0.005, having the minimum square error as the training stopping criterion.
The main measure used for evaluating model performance is the Mean Absolute Relative Error (MARE). MARE is the preferred error measure for software measurement researchers and is calculated as follows [FINN96]:
estimate is the predicted output by the network for each observation
n is the number of observations
To establish whether models are biased and tend to over or under estimate, the Mean Relative Error (MRE) is calculated as follows [FINN96]:
We use the following steps in model prediction:
1. The input metrics are normalized using min-max normalization. Min-max normalization performs a linear transformation on the original data. Suppose that minA and maxA are the minimum and maximum values of an attribute A. It maps value v of A to v’ in the range 0 to 1 using the formula:
2. Perform principal components (P.C) analysis on the normalized metrics to produce domain metrics.
3. We divide data into training, test and validate sets using 3:1:1 ratio.
4. Develop ANN model based on training and test data sets.
5. Apply the ANN model to validate data set in order to evaluate the accuracy of the model.
The P.C. extraction analysis and varimax rotation method is applied on all metrics. The rotated component matrix is given in table 10.17. Table 10.17 shows the relationship between the original object oriented metrics and the domain metrics. The values above 0.7 (shown in bold in table 10.17) are the metrics that are used to interpret the PCs. For each PC, we also provide its eigenvalue, variance percent and cumulative percent. The interpretations of PCs are given as follows:
* P1: DAC, LCOM, NOM, RFC and WMC are cohesion, coupling and size metrics. We have size, coupling and cohesion metrics in this dimension. This shows that there are classes with high internal methods (methods defined in the class) and external methods (methods called by the class). This means cohesion and coupling is related to number of methods and attributes in the class.
* P2: MPC is coupling metric that counts number of send statements defined in a class.
* P3: NOC and DIT are inheritance metrics that count number of children and depth of inheritance tree in a class.
Table 10.17: Rotated principal components
We employed ANN technique to predict the maintenance effort of the classes. The inputs to the network were all the domain metrics P1, P2, and P3. The network was trained using the back propagation algorithm. Table 10.16 shows the best architecture, which was experimentally determined. The model is trained using training and test data sets and evaluated on validation data set. Table 10.18 shows the MARE, MRE, r and p-value results of ANN model evaluated on validation data. The correlation of the predicted change and the observed change is represented by the coefficient of correlation (r). The significant level of a validation is indicated by a p-value. A commonly accepted p-value is 0.05.
Table 10.18: Validation results of ANN model
Table 10.19: Analysis of model evaluation accuracy
For validate data sets, the percentage error smaller than 10 percent, 27 percent and 55 percent is shown in table 10.19. We conclude that impact of prediction is valid in the population. 10.5 plots predicted number of lines added or changed vs actual number of lines added or changed.
Software testing metrics are one part of metrics studies and focus on the testing issues of processes and products. Test suite effectiveness, source code coverage, defect density and review efficiency are some of the popular testing metrics. Testing efficiency may also be calculated using size of software tested/resources used. We should also have metrics to provide immediate, real time feedback to testers and project manager on quality of testing during each test phase rather waiting until the release of the software.
MULTIPLE CHOICE QUESTIONS
Note: Select most appropriate answer of the following questions.
10.1 One fault may lead to
(a) One failure
(b) Two failures
(c) Many failures
(d) All of the above
10.2 Failure occurrences can be represented as:
(a) Time to failure
(b) Time interval between failures
(c) Failure experienced in a time interval
(d) All of the above
10.3 What is the maximum value of reliability?
(d) None of the above
10.4 What is the minimum value of reliability?
(d) None of the above
10.5 As the failure intensity decreases reliability:
(c) No effect
(d) None of the above
10.6 Basic and logarithmic execution time models were developed by:
(a) Victor Baisili
(b) J.D. Musa
(c) R. Binder
(d) B. Littlewood
10.7 Which is not a cohesion metric?
(a) Lack of cohesion in methods
(b) Tight class cohesion
(c) Response for a class
(d) Information flow cohesion
10.8 Which is not a size metric?
(a) Number of attributes per class
(b) Number of methods per class
(c) Number of children
(d) Weighted methods per class
10.9 Choose an inheritance metric?
(a) Number of children
(b) Response for a class
(c) Number of methods per class
(d) Message passing coupling
10.10 Which is not a coupling metric?
(a) Coupling between objects
(b) Data abstraction coupling
(c) Message passing coupling
(d) Number of children
10.11 What can be measured with respect to time during testing?
(a) Time available for testing
(b) Time to failure
(c) Time interval between failures
(d) All of the above
10.12 NHPP stands for
(a) Non-homogeneous poisson process
(b) Non-hetrogeneous poisson process
(c) Non-homogeneous programming process
(d) Non-hetrogeneous programming process
10.13 Which is not a test process metric?
(a) Number of test cases designed
(b) Number of test cases executed
(c) Number of failures experienced in a time interval
(d) Number of test cases failed
10.14 Which is not a test product metric?
(a) Time interval between failures
(b) Time to failure
(c) Estimated time for testing
(d) Test case execution time
10.15 Testability is dependent on
(a) Characteristics of the representation
(b) Characteristics of the implementation
(c) Built in test capabilities
(d) All of the above
10.16 Which of the following is true?
(a) Testability is inversely proportional to complexity
(b) Testability is directly proportional to complexity
(c) Testability is equal to complexity
(d) None of the above
10.17 Cyclomatic complexity of code provides
(a) An upper limit for the number of test cases needed for the code coverage criterion
(b) A lower limit for the number of test cases needed for the code coverage criterion
(c) A direction for testing
(d) None of the above
10.18 Higher is the cyclomatic complexity:
(a) More is the testing effort
(b) Less is the testing effort
(c) Infinite is the testing effort
(d) None of the above
10.19 Which is not the object oriented metric given by chidamber and Kemerer:
(a) Coupling between objects
(b) Lack of cohesion
(c) Response for a class
(d) Number of branches in a tree
10.20 Precision is defined as:
(a) Ratio of number of classes correctly predicted to the total number of classes
(b) Ratio of number of classes wrongly predicted to the total number of classes
(c) Ratio of total number of classes to the classes wrongly predicted
(d) None of the above
10.21 Sensitivity is defined as:
(a) Ratio of number of classes correctly predicted as fault prone to the total number of classes
(b) Ratio of the number of classes correctly predicted as fault prone to the total number of classes that are actually fault prone
(c) Ratio of faulty classes to total number of classes
(d) None of the above
10.22 Reliability is measured with respect to
10.23 Choose an event where poisson process is not used
(a) Number of users using a website in a given time interval
(b) Number of persons requesting for railway tickets in a given period of time
(c) Number of students in a class
(d) Number of e-mails expected in a given period of time
10.24 Choose a data structure metric:
(a) Number of live variables
(b) Variable span
(c) Module weakness
(d) All of the above
10.25 Software testing metrics are used to measure
(a) Progress of testing
(b) Reliability of software
(c) Time spent during testing
(d) All of the above
10.1 What is software metric? Why do we really need metrics in software? Discuss the areas of applications and problems during implementation of metrics.
10.2 Define the following terms:
10.3 (a) What should we measure during testing?
(b) Discuss things which can be measured with respect to time
(c) Explain any reliability model where emphasis is on failures rather than faults
10.4 Describe the following metrics:
(a) Quality of source code
(b) Source code coverage
(c) Test case defect density
(d) Review efficiency
10.5 (a) What metrics are required to be captured during testing?
(b) Identify some test process metrics
(c) Identify some test product metrics
10.6 What is the relationship between testability and complexity? Discuss the factors which are affecting the software testability.
10.7 Explain the software fault prediction model. List out the metrics used in the analysis of the model. Define precision, sensitivity and completeness. What is the purpose of using receiver operating characteristics (ROC) curve.
10.8 Discuss the basic model of software reliability. How can we calculate and?
10.9 Explain the logarithmic poisson model and find the values of ∆μ and ∆Γ.
10.10 Assume that initial failure intensity in 20 failures / CPU hr. The failure intensity decay parameter is 0.05 / failure. We assume that 50 failures have been experienced. Calculate the current failure intensity.
10.11 Assume that the initial failure intensity is 5 failures / CPU hr. The failure intensity decay parameter is 0.01 / failure. We have experienced 25 failures upto this time. Find the failures experienced and failure intensity after 30 and 60 CPU hrs of execution.
10.12 Explain the Jelinski – Moranda model of reliability theory. What is the relationship between ‘λ’ and ‘t’.
10.13 A program is expected to have 500 faults. Assumption is that one fault may lead to one failure. The initial failure intensity is 10 failures / CPU hr. The program is released with a failure intensity objective of 6 failures / CPU hr. Calculate the number of failures experienced before release.
10.14 Assume that a program will experience 200 failures in infinite time. It has now experienced 100. The initial failure intensity was 10 failures / CPU hr.
(a) Determine the present failure intensity
(b) Calculate failures experienced and failure intensity after 50 and 100 CPU hrs. of execution.
Use the basic execution time model for the above calculations.
10.15 What is software reliability? Does it exist? Describe the following terms:
(iii) Failure intensity
(iv) Failures experienced in a time interval