(Translated by https://www.hiragana.jp/)
Guiding deep learning system testing using surprise adequacy | Proceedings of the 41st International Conference on Software Engineering

research-article

Guiding deep learning system testing using surprise adequacy

Authors:

Shin YooAuthors Info & Claims

ICSE '19: Proceedings of the 41st International Conference on Software Engineering

Pages 1039 - 1049

https://doi.org/10.1109/ICSE.2019.00108

Published: 25 May 2019 Publication History

Abstract

Deep Learning (DL) systems are rapidly being adopted in safety and security critical domains, urgently calling for ways to test their correctness and robustness. Testing of DL systems has traditionally relied on manual collection and labelling of data. Recently, a number of coverage criteria based on neuron activation values have been proposed. These criteria essentially count the number of neurons whose activation during the execution of a DL system satisfied certain properties, such as being above predefined thresholds. However, existing coverage criteria are not sufficiently fine grained to capture subtle behaviours exhibited by DL systems. Moreover, evaluations have focused on showing correlation between adversarial examples and proposed criteria rather than evaluating and guiding their use for actual testing of DL systems. We propose a novel test adequacy criterion for testing of DL systems, called Surprise Adequacy for Deep Learning Systems (SADL), which is based on the behaviour of DL systems with respect to their training data. We measure the surprise of an input as the difference in DL system's behaviour between the input and the training data (i.e., what was learnt during training), and subsequently develop this as an adequacy criterion: a good test input should be sufficiently but not overtly surprising compared to training data. Empirical evaluation using a range of DL systems from simple image classifiers to autonomous driving car platforms shows that systematic sampling of inputs based on their surprise can improve classification accuracy of DL systems against adversarial examples by up to 77.5% via retraining.

References

[1]

Autonomous driving model: Chauffeur. https://github.com/udacity/self-driving-car/tree/master/steering-models/community-models/chauffeur.

[2]

The udacity open source self-driving car project. https://github.com/udacity/self-driving-car.

[3]

Google accident 2016: A google self-driving car caused a crash for the first time http://www.theverge.com/2016/2/29/11134344/google-self-driving-car-crash-report, 2016.

[4]

Paul Ammann and Jeff Offutt. Introduction to Software Testing. Cambridge University Press, 2016.

Digital Library

[5]

Yoshua Bengio, Grégoire Mesnil, Yann Dauphin, and Salah Rifai. Better mixing via deep representations. CoRR, abs/1207.4404, 2012.

[6]

Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016.

[7]

Nicholas Carlini and David Wagner. Adversarial examples are not easily detected. Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security - AISec '17, 2017.

Digital Library

[8]

Nicholas Carlini and David Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 3--14. ACM, 2017.

Digital Library

[9]

Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. CoRR, abs/1608.04644, 2016.

[10]

Chenyi Chen, Ari Seff, Alain Kornhauser, and Jianxiong Xiao. Deepdriving: Learning affordance for direct perception in autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision, pages 2722--2730, 2015.

Digital Library

[11]

T. Y. Chen, F.-C. Kuo, T. H. Tse, and Zhi Quan Zhou. Metamorphic testing and beyond. In Proceedings of the International Workshop on Software Technology and Engineering Practice (STEP 2003), pages 94--100, September 2004.

Digital Library

[12]

Zhihua Cui, Fei Xue, Xingjuan Cai, Yang Cao, Gai-ge Wang, and Jinjun Chen. Detection of malicious code variants based on deep learning. IEEE Transactions on Industrial Informatics, 14(7):3187--3196, 2018.

[13]

Clement Farabet, Camille Couprie, Laurent Najman, and Yann LeCun. Learning hierarchical features for scene labeling. IEEE transactions on pattern analysis and machine intelligence, 35(8):1915--1929, 2013.

Digital Library

[14]

Reuben Feinman, Ryan R Curtin, Saurabh Shintre, and Andrew B Gardner. Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410, 2017.

[15]

Robert Feldt, Simon Poulding, David Clark, and Shin Yoo. Test set diameter: Quantifying the diversity of sets of test cases. In Proceedings of the IEEE International Conference on Software Testing, Verification, and Validation, ICST 2016, pages 223--233, 2016.

[16]

Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.

[17]

Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29(6):82--97, 2012.

[18]

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735--1780, 1997.

Digital Library

[19]

Xiaowei Huang, Marta Kwiatkowska, Sen Wang, and Min Wu. Safety verification of deep neural networks. In Rupak Majumdar and Viktor Kunčak, editors, Computer Aided Verification, pages 3--29, Cham, 2017. Springer International Publishing.

[20]

Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. On using very large target vocabulary for neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 1--10, 2015.

[21]

Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The CIFAR-10 dataset. online: http://www.cs.toronto.edu/kriz/cifar.html, 2014.

[22]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097--1105, 2012.

Digital Library

[23]

Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. CoRR, abs/1607.02533, 2016.

[24]

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436, 2015.

[25]

Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. AT&T Labs {Online}. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.

[26]

Stijn Luca, Peter Karsmakers, Kris Cuppens, Tom Croonenborghs, Anouk Van de Vel, Berten Ceulemans, Lieven Lagae, Sabine Van Huffel, and Bart Vanrumste. Detecting rare events using extreme value statistics applied to epileptic convulsions in children. Artificial Intelligence in Medicine, 60(2):89 -- 96, 2014.

Digital Library

[27]

Lei Ma, Felix Juefei-Xu, Jiyuan Sun, Chunyang Chen, Ting Su, Fuyuan Zhang, Minhui Xue, Bo Li, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. Deepgauge: Comprehensive and multi-granularity testing criteria for gauging the robustness of deep learning systems. CoRR, abs/1803.07519, 2018.

[28]

Lei Ma, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Felix Juefei-Xu, Chao Xie, Li Li, Yang Liu, Jianjun Zhao, et al. Deepmutation: Mutation testing of deep learning systems. arXiv preprint arXiv:1805.05206, 2018.

[29]

Lei Ma, Fuyuan Zhang, Minhui Xue, Bo Li, Yang Liu, Jianjun Zhao, and Yadong Wang. Combinatorial testing for deep learning systems. arXiv preprint arXiv:1806.07723, 2018.

[30]

Xingjun Ma, Bo Li, Yisen Wang, Sarah M Erfani, Sudanthi Wijewickrema, Michael E Houle, Grant Schoenebeck, Dawn Song, and James Bailey. Characterizing adversarial subspaces using local intrinsic dimensionality. arXiv preprint arXiv:1801.02613, 2018.

[31]

Christian Murphy, Kuang Shen, and Gail Kaiser. Automatic system testing of programs without test oracles. In Proceedings of the 18th International Symposium on Software Testing and Analysis, ISSTA 2009, pages 189--200. ACM Press, 2009.

Digital Library

[32]

Nicolas Papernot, Fartash Faghri, Nicholas Carlini, Ian Goodfellow, Reuben Feinman, Alexey Kurakin, Cihang Xie, Yash Sharma, Tom Brown, Aurko Roy, Alexander Matyasko, Vahid Behzadan, Karen Hambardzumyan, Zhishuai Zhang, Yi-Lin Juang, Zhi Li, Ryan Sheatsley, Abhibhav Garg, Jonathan Uesato, Willi Gierke, Yinpeng Dong, David Berthelot, Paul Hendricks, Jonas Rauber, and Rujun Long. Technical report on the cleverhans v2.1.0 adversarial examples library. arXiv preprint arXiv:1610.00768, 2018.

[33]

Nicolas Papernot, Patrick D. McDaniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. CoRR, abs/1511.07528, 2015.

[34]

Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. Deepxplore: Automated whitebox testing of deep learning systems. In Proceedings of the 26th Symposium on Operating Systems Principles, SOSP '17, pages 1--18, New York, NY, USA, 2017. ACM.

Digital Library

[35]

Simon Poulding and Robert Feldt. Generating controllably invalid and atypical inputs for robustness testing. In Software Testing, Verification and Validation Workshops (ICSTW), 2017 IEEE International Conference on, pages 81--84. IEEE, 2017.

[36]

David W Scott. Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons, 2015.

[37]

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104--3112, 2014.

Digital Library

[38]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1--9, 2015.

[39]

L. Tarassenko. BioSign<sup>™</sup> : multi-parameter monitoring for early warning of patient deterioration. IET Conference Proceedings, pages 71--76(5), January 2005.

[40]

Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. Deeptest: Automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th International Conference on Software Engineering, pages 303--314. ACM, 2018.

Digital Library

[41]

Matt P Wand and M Chris Jones. Kernel smoothing. Chapman and Hall/CRC, 1994.

[42]

Shin Yoo. Metamorphic testing of stochastic optimisation. In Proceedings of the 3rd International Workshop on Search-Based Software Testing, SBST 2010, pages 192--201, 2010.

Digital Library

[43]

Shin Yoo and Mark Harman. Regression testing minimisation, selection and prioritisation: A survey. Software Testing, Verification, and Reliability, 22(2):67--120, March 2012.

Digital Library

[44]

Hong Zhu, Patrick A. V. Hall, and John H. R. May. Software unit test coverage and adequacy. ACM Comput. Surv., 29(4):366--427, December 1997.

Digital Library

Cited By

Lee JChen SMordahl ALiu CYang WWei S(2024)Automated Testing Linguistic Capabilities of NLP ModelsACM Transactions on Software Engineering and Methodology10.1145/367245533:7(1-33)Online publication date: 14-Jun-2024
https://dl.acm.org/doi/10.1145/3672455
Huang DBu QFu YQing YXie XChen JCui H(2024)Neuron Sensitivity-Guided Test Case SelectionACM Transactions on Software Engineering and Methodology10.1145/367245433:7(1-32)Online publication date: 12-Jun-2024
https://dl.acm.org/doi/10.1145/3672454
Wang ZXu SFan LCai XLi LLiu Z(2024)Can Coverage Criteria Guide Failure Discovery for Image Classifiers? An Empirical StudyACM Transactions on Software Engineering and Methodology10.1145/367244633:7(1-28)Online publication date: 13-Jun-2024
https://dl.acm.org/doi/10.1145/3672446
Show More Cited By

Recommendations

Evaluating Surprise Adequacy for Deep Learning System Testing
The rapid adoption of Deep Learning (DL) systems in safety critical domains such as medical imaging and autonomous driving urgently calls for ways to test their correctness and robustness. Borrowing from the concept of test adequacy in traditional ...
Importance-driven deep learning system testing
ICSE '20: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering

Deep Learning (DL) systems are key enablers for engineering intelligent applications due to their ability to solve complex tasks such as image recognition and machine translation. Nevertheless, using DL systems in safety- and security-critical ...
An intuitive approach to determine test adequacy in safety-critical software

Safety-critical software must adhere to stringent quality standards and is expected to be thoroughly tested. However, exhaustive testing of software is usually impractical. The two main challenges faced by a software testing team are generation of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICSE '19: Proceedings of the 41st International Conference on Software Engineering

May 2019

1318 pages

General Chair:
Joanne M. Atlee
University of Waterloo, Canada
,
Program Chairs:
Tevfik Bultan
University of California, Santa Barbara
,
Jon Whittle
Monash University, Australia

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering
IEEE-CS: Computer Society

Publisher

IEEE Press

Publication History

Published: 25 May 2019

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Conference

ICSE '19

Sponsor:

SIGSOFT
IEEE-CS

ICSE '19: 41st International Conference on Software Engineering

May 25 - 31, 2019

Quebec, Montreal, Canada

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

138
Total Citations
View Citations
643
Total Downloads

Downloads (Last 12 months)45
Downloads (Last 6 weeks)8

Reflects downloads up to 06 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lee JChen SMordahl ALiu CYang WWei S(2024)Automated Testing Linguistic Capabilities of NLP ModelsACM Transactions on Software Engineering and Methodology10.1145/367245533:7(1-33)Online publication date: 14-Jun-2024
https://dl.acm.org/doi/10.1145/3672455
Huang DBu QFu YQing YXie XChen JCui H(2024)Neuron Sensitivity-Guided Test Case SelectionACM Transactions on Software Engineering and Methodology10.1145/367245433:7(1-32)Online publication date: 12-Jun-2024
https://dl.acm.org/doi/10.1145/3672454
Wang ZXu SFan LCai XLi LLiu Z(2024)Can Coverage Criteria Guide Failure Discovery for Image Classifiers? An Empirical StudyACM Transactions on Software Engineering and Methodology10.1145/367244633:7(1-28)Online publication date: 13-Jun-2024
https://dl.acm.org/doi/10.1145/3672446
Feng XHan XChen SYang W(2024)LLMEffiChecker: Understanding and Testing Efficiency Degradation of Large Language ModelsACM Transactions on Software Engineering and Methodology10.1145/366481233:7(1-38)Online publication date: 26-Aug-2024
https://dl.acm.org/doi/10.1145/3664812
Zohdinasab TRiccio VTonella P(2024)Focused Test Generation for Autonomous Driving SystemsACM Transactions on Software Engineering and Methodology10.1145/366460533:6(1-32)Online publication date: 27-Jun-2024
https://dl.acm.org/doi/10.1145/3664605
Yuan YWang SSu ZChristakis MPradel M(2024)See the Forest, not Trees: Unveiling and Escaping the Pitfalls of Error-Triggering Inputs in Neural Network TestingProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680385(1605-1617)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680385
Demir DBetin Can ASurer EChristakis MPradel M(2024)Test Selection for Deep Neural Networks using Meta-Models with Uncertainty MetricsProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680312(678-690)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680312
Chen JWang JSun YCheng PChen JChristakis MPradel M(2024)Isolation-Based Debugging for Neural NetworksProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652132(338-349)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3652132
Li ZXu ZJi RPan MZhang TWang LLi XChristakis MPradel M(2024)Distance-Aware Test Input Selection for Deep Neural NetworksProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652125(248-260)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3652125
Aghababaeyan ZAbdellatif MDadkhah MBriand L(2024)DeepGD: A Multi-Objective Black-Box Test Selection Approach for Deep Neural NetworksACM Transactions on Software Engineering and Methodology10.1145/364438833:6(1-29)Online publication date: 27-Jun-2024
https://dl.acm.org/doi/10.1145/3644388
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents