National Aeronautics and Space Administration Classifying Software Faults
National Aeronautics and Space Administration Classifying Software Faults to Improve Fault Detection Effectiveness Executive Briefing NASA OSMA Software Assurance Symposium September 9-11, 2008 Allen P. Nikora, JPL/Caltech This research was carried out at the Jet Propulsion Laboratory, California Institute of Technology under a contract with the National Aeronautics and Space Administration. The work was sponsored by the NASA Office of Safety and Mission Assurance under the Software Assurance Research Program led by the NASA Software IV&V Facility. This activity is managed locally at JPL through the Assurance and Technology Program Office 09/09/2008 SAS08_Classify_Defects_Nikora 1 National Aeronautics and Space Administration Agenda Problem/Approach Relevance to NASA Accomplishments and/or Tech Transfer Po tential Next Steps 09/09/2008 SAS08_Classify_Defects_Nikora 2 National Aeronautics and Space Administration Problem/Approach All software systems contain faults Different types of faults exhibit different types of failure behavior Different types of faults require different identification techniques Some faults are easier to find than others. Likelihood of detecting and removing software faults during development and testing, as well as the possible strategies for dealing with residual faults during mission operations depend on the fault type. Goals are to Determine the relative frequencies of specific types of faults and to identify trends in those frequencies Develop effective techniques for identifying and removing faults or making their effects. Develop guidelines, based on the analysis of faults and failures,
for applying the techniques based on the context of current and future missions. 09/09/2008 SAS08_Classify_Defects_Nikora 3 Problem/Approach (contd) What must be done? National Aeronautics and Space Administration Analyze software failure data (test and operations) from historical, current JPL and NASA missions and classify the underlying software faults. Further classify the faults by criticality (e.g., non-critical, significant mission impact, mission critical), and detection phase. Perform statistical analysis Proportions of faults of each category. Conditional frequencies (e.g., percentage of critical faults among aging-related bugs, percentage of aging-related bugs among the critical faults). Trends in conditional frequencies (within and across missions). Determine criteria for further classifying faults (e.g., for the aging-related bugs: faults causing round-off errors, faults causing memory leaks, etc.) to identify classes of faults with high criticality and low detectability. For highly critical faults that are difficult to detect prior to release, develop techniques for: Identifying component(s) most likely to contain these types of faults. Improving the detectability of the faults with model-based verification or static analysis tools, as well as during testing. Masking the faults via fault-tolerance (e.g, software rejuvenation for aging-related faults) Such techniques must be able to accurately distinguish between behavioral changes resulting from normal changes in the systems operating environment input space and those brought about by aging-related faults. Develop guidelines for implementing techniques in the context of current, future missions. 09/09/2008 SAS08_Classify_Defects_Nikora 4 National Aeronautics and Space Administration Different types of faults have different types of effects. Choose fault identification/mitigation strategies based on types of failures encountered in system being developed
Bohrbugs Deterministically cause failures Easiest to find during testing Fault-tolerance of the operational system can mainly be achieved with design diversity Mandelbugs difficult to find, isolate, and correct during testing Re-execution of an operation that failed because of a Mandelbug will generally not result in another failure Fault-tolerance can be achieved by simple retries or more sophisticated approaches like checkpointing, and recovery-oriented computing Aging-related Relevance to NASA Tendency of causing a failure increases with the system run-time Proactive measures that clean the internal system state (softgware rejunvenation) and thus reduce the failure rate are useful Aging can be a significant threat to NASA software systems, (e.g., continuously operating planetary exploration spacecraft flight control systems), since aging-related faults are often difficult to find during development Related work Rejuvenation has been implemented in many different kinds of software systems, including telecommunications system], transaction processing systems], and cluster servers Various types of software systems, like web servers and military systems, have been found to age 09/09/2008 SAS08_Classify_Defects_Nikora 5 National Aeronautics and Space Administration
Accomplishments and/or Tech Transfer Potential Collected over 40,000 failure records from JPL problem reporting system Operational failures and failures observed during system test, ATLO operations All failures (software and non-software) Over 2 dozen projects represented Planetary exploration Earth-orbiter Instruments Continued analysis of software failures Classified flight software failures for 18 projects Classification of ground software failures for same 18 missions in progress Completed statistical analysis of flight software failure data Started application of machine learning/data mining techniques to improve classification accuracy: Software vs. non-software failures Types of software failures Supervised vs. unsupervised learning 09/09/2008 SAS08_Classify_Defects_Nikora 6 National Aeronautics and Space Administration Next steps Complete analysis of failures Complete analysis of ground software ISAs by end of September, 2008. Complete statistical analyses for all failures to identify trends: Proportions of software failures Proportions of Bohrbugs vs. Mandelbugs vs. aging-related bugs Complete experiments with machine learning/data mining; identify most appropriate failure data representations and learning models to distinguish between: Software and non-software failures find additional software failures in problem reporting system and classify them. Can improve accuracy of software failure type classification Different types of software failures Based on analyses of proportions and trends in failure data, identify/ develop appropriate fault prevention/mitigation strategies (e.g., software rejuvenation) Other software improvement/defect analysis tasks and organizations at JPL have expressed interest in collaborating with this effort: JPL Software Product and Process Assurance Group
JPL Software Quality Improvement project 09/09/2008 SAS08_Classify_Defects_Nikora 7 National Aeronautics and Space Administration Backup Information National Aeronautics and Space Administration Fault Classifications Classification Scheme: The following definitions of software fault types are based on [Grottke05a, Grottke05b]: Mandelbug := A fault whose activation and/or error propagation are complex, where complexity can either be caused by interactions of the software application with its systeminternal environment (hardware, operating system, other applications), or by a time lag between the fault activation and the occurrence of a failure. Typically, a Mandelbug is difficult to isolate, and/or the failures caused by it are not systematically reproducible. (Sometimes, Mandelbugs are incorrectly referred to as Heisenbugs.) Bohrbug := A fault that is easily isolated and that manifests consistently under a well-defined set of conditions, because its activation and error propagation lack complexity. Complementary antonym of Mandelbug. Aging-related bug := A fault that leads to the accumulation of internal error states, resulting in an increased failure rate and/or degraded performance. Sub-type of Mandelbug. According to these definitions, the classes of Bohrbugs, aging-related bugs, and non-agingrelated Mandelbugs partition the space of all software faults. References: [Grottke05a] M. Grottke and K. S. Trivedi, Software faults, software aging and software rejuvenation, Journal of the Reliability Engineering Association of Japan 27(7):425438, 2005. [Grottke05b] M. Grottke and K. S. Trivedi, A classification of software faults, Supplemental Proc. Sixteenth International Symposium on Software Reliability Engineering, 2005, pp. 4.194.20. Accomplishments 09/09/2008 Next Slide SAS08_Classify_Defects_Nikora 9 National Aeronautics and Space Administration Mission Characteristics Summary ID Accomplishments
9 3 14 Mission ID (in launch order) Fault type proportions for the eight projects with the largest number of unique faults Accomplishments 09/09/2008 Next Slide SAS08_Classify_Defects_Nikora 11 National Aeronautics and Space Administration 0.0 0.8 0.6 0.4 0.0 Mission 2 Mission 10 Mission 11 Mission 16 Mission 2 Mission 10 Mission 11 Mission 16 0.2 0.6 0.4 0.2 Proportion of Bohrbugs 0.8 Proportion of non-aging-related Mandelbugs 1.0 1.0 Analysis Results (contd) 0.0
0.2 0.4 0.6 0.8 1.0 0.0 0.2 Normalized duration Proportion of Bohrbugs for the four earlier missions 0.6 0.8 1.0 Proportion of non-aging-related Mandelbugs for the four earlier missions Accomplishments 09/09/2008 0.4 Normalized duration Next Slide SAS08_Classify_Defects_Nikora 12 National Aeronautics and Space Administration 1.0 0.8 0.6 0.2 0.4 Proportion of Bohrbugs 0.6 0.4 0.2 0.0
95% confidence interval Mission 3 Mission 9 0.0 0.2 0.4 0.6 0.8 1.0 95% confidence interval Mission 6 Mission 14 0.0 Proportion of Bohrbugs 0.8 1.0 Analysis Results (contd) 0.0 0.2 Normalized duration 0.6 0.8 1.0 Normalized duration Proportion of Bohrbugs for missions 3 and 9, and 95% confidence interval based on the four earlier missions Proportion of Bohrbugs for missions 6 and 14, and 95% confidence interval based on the four earlier missions Accomplishments 09/09/2008 0.4 Next Slide
Normalized duration 0.4 0.6 0.8 1.0 Normalized duration Proportion of non-aging-related Mandelbugs for missions 3 and 9, and 95% confidence interval based on the four earlier missions Proportion of non-aging-related Mandelbugs for missions 6 and 14, and 95% confidence interval based on the four earlier missions Accomplishments 09/09/2008 SAS08_Classify_Defects_Nikora 14 National Aeronautics and Space Administration Machine Learning/Text Mining Results ROC Curves - FSW vs. Other ISA Types 1 0.9 0.816 0.822 0.832 0.819 0.753 0.818 0.8 0.7 0.712 0.658 0.6 pd 0.869 0.5 FSW 0.4
NaiveBayesMultinomial -x 10 -i -k 0.3 0.2 Other ISA Types 0.1 NaiveBayesMultinomial -x 10 -i -k 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.8 0.9 1 Flight software failures vs. 0.7 all other failures pf Accomplishments 09/09/2008 Flight software failures vs. all other failures SAS08_Classify_Defects_Nikora Next Slide 15 National Aeronautics and Space Administration Machine Learning/Text Mining Results ROC Curves - GSW vs. Other ISA Types 1 0.9 0.846 0.840 0.846 0.855 0.853 0.817 0.809
Ground software failures vs. all other failures SAS08_Classify_Defects_Nikora Next Slide 16 National Aeronautics and Space Administration Machine Learning/Text Mining Results ROC Curves - Flight and Ground Software vs. Other ISA Types 1 0.9 0.821 0.815 0.811 0.806 0.817 0.890 (2) 0.805 0.789 0.793 0.8 0.7 0.698 pd 0.6 0.5 FSW and GSW 0.4 0.393 ComplementNaiveBayes -S 1.0 -x 10 i -k 0.3 Other ISA Types 0.2 ComplementNaiveBayes -S 1.0 -x 10 i -k 0.1 0 0 0.1 Accomplishments 09/09/2008
0.2 0.3 0.4 0.5 pf 0.6 0.7 0.8 0.9 1 Flight and ground software failures vs. all other failures SAS08_Classify_Defects_Nikora Next Slide 17 National Aeronautics and Space Administration Machine Learning/Text Mining Results ROC Curves - PROC vs. Other ISA Types 1 0.953 0.9220.953 0.9 0.879 0.826 0.844 0.8 0.734 0.7 0.719 0.672 0.730 0.652 0.656 pd 0.6 PROC
0.5 0.4 ComplementNaiveBayes -S 1.0 -x 10 -i -k 0.3 Other ISA Types 0.2 ComplementNaiveBayes -S 1.0 -x 10 -i -k 0.1 0 0 0.1 Accomplishments 09/09/2008 0.2 0.3 0.4 0.5 pf 0.6 0.7 0.8 0.9 1 Procedural/process errors vs. all other failures SAS08_Classify_Defects_Nikora 18
Scholars with an education bent: Gavriel Salomon, David Olson, Michael Cole. Thinking about Instructional Uses. TV came into classroom at nearly same time it won wide acceptance in homes. Variety of different purposes for use.
OPTIONS TO A BASIC APPLE INSURANCE POLICY Adding additional quality based coverage to your apple policy Perils Excluded From the Basic Policy Size Color Shape Russeting These perils are insured causes of loss when a producer adds the quality endorsement.
Example: HTTP Metastability. N flows between hosts and servers. Flow . n. is OFF or ON. Time is discrete, occupancy measure = proportion of ON flows. At every time step, every flow switches state with proba matrix that depends on...
the parallax of the things we see. This change helps our. brains determine the distances to objects and is analogous to. how astronomers determine the distance to objects in space. (c) As Earth orbits the Sun, a nearby star appears...
Cross-docking, transloading, and merging-in-transit . Vendor Managed Inventory (VMI) Value driven consulting. Hellmann Contract Logistics. Cruise Ports Serviced. This is representative of ports already serviced. In many ports we can provide ship-side delivery. We can also manage deliveries ...
Widened the scope of privacy and security protections under HIPAA. ... The confidential information we come in contact with everyday is only as safe as our weakest link. What is Protected Health Information (PHI)? ... including wireless and DSL/cable home...
Ready to download the document? Go ahead and hit continue!