(Translated by https://www.hiragana.jp/)
Guiding deep learning system testing using surprise adequacy | Proceedings of the 41st International Conference on Software Engineering skip to main content
10.1109/ICSE.2019.00108acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

Guiding deep learning system testing using surprise adequacy

Published: 25 May 2019 Publication History

Abstract

Deep Learning (DL) systems are rapidly being adopted in safety and security critical domains, urgently calling for ways to test their correctness and robustness. Testing of DL systems has traditionally relied on manual collection and labelling of data. Recently, a number of coverage criteria based on neuron activation values have been proposed. These criteria essentially count the number of neurons whose activation during the execution of a DL system satisfied certain properties, such as being above predefined thresholds. However, existing coverage criteria are not sufficiently fine grained to capture subtle behaviours exhibited by DL systems. Moreover, evaluations have focused on showing correlation between adversarial examples and proposed criteria rather than evaluating and guiding their use for actual testing of DL systems. We propose a novel test adequacy criterion for testing of DL systems, called Surprise Adequacy for Deep Learning Systems (SADL), which is based on the behaviour of DL systems with respect to their training data. We measure the surprise of an input as the difference in DL system's behaviour between the input and the training data (i.e., what was learnt during training), and subsequently develop this as an adequacy criterion: a good test input should be sufficiently but not overtly surprising compared to training data. Empirical evaluation using a range of DL systems from simple image classifiers to autonomous driving car platforms shows that systematic sampling of inputs based on their surprise can improve classification accuracy of DL systems against adversarial examples by up to 77.5% via retraining.

References

[1]
Autonomous driving model: Chauffeur. https://github.com/udacity/self-driving-car/tree/master/steering-models/community-models/chauffeur.
[2]
The udacity open source self-driving car project. https://github.com/udacity/self-driving-car.
[3]
Google accident 2016: A google self-driving car caused a crash for the first time http://www.theverge.com/2016/2/29/11134344/google-self-driving-car-crash-report, 2016.
[4]
Paul Ammann and Jeff Offutt. Introduction to Software Testing. Cambridge University Press, 2016.
[5]
Yoshua Bengio, Grégoire Mesnil, Yann Dauphin, and Salah Rifai. Better mixing via deep representations. CoRR, abs/1207.4404, 2012.
[6]
Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016.
[7]
Nicholas Carlini and David Wagner. Adversarial examples are not easily detected. Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security - AISec '17, 2017.
[8]
Nicholas Carlini and David Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 3--14. ACM, 2017.
[9]
Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. CoRR, abs/1608.04644, 2016.
[10]
Chenyi Chen, Ari Seff, Alain Kornhauser, and Jianxiong Xiao. Deepdriving: Learning affordance for direct perception in autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision, pages 2722--2730, 2015.
[11]
T. Y. Chen, F.-C. Kuo, T. H. Tse, and Zhi Quan Zhou. Metamorphic testing and beyond. In Proceedings of the International Workshop on Software Technology and Engineering Practice (STEP 2003), pages 94--100, September 2004.
[12]
Zhihua Cui, Fei Xue, Xingjuan Cai, Yang Cao, Gai-ge Wang, and Jinjun Chen. Detection of malicious code variants based on deep learning. IEEE Transactions on Industrial Informatics, 14(7):3187--3196, 2018.
[13]
Clement Farabet, Camille Couprie, Laurent Najman, and Yann LeCun. Learning hierarchical features for scene labeling. IEEE transactions on pattern analysis and machine intelligence, 35(8):1915--1929, 2013.
[14]
Reuben Feinman, Ryan R Curtin, Saurabh Shintre, and Andrew B Gardner. Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410, 2017.
[15]
Robert Feldt, Simon Poulding, David Clark, and Shin Yoo. Test set diameter: Quantifying the diversity of sets of test cases. In Proceedings of the IEEE International Conference on Software Testing, Verification, and Validation, ICST 2016, pages 223--233, 2016.
[16]
Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
[17]
Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29(6):82--97, 2012.
[18]
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735--1780, 1997.
[19]
Xiaowei Huang, Marta Kwiatkowska, Sen Wang, and Min Wu. Safety verification of deep neural networks. In Rupak Majumdar and Viktor Kunčak, editors, Computer Aided Verification, pages 3--29, Cham, 2017. Springer International Publishing.
[20]
Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. On using very large target vocabulary for neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 1--10, 2015.
[21]
Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The CIFAR-10 dataset. online: http://www.cs.toronto.edu/kriz/cifar.html, 2014.
[22]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097--1105, 2012.
[23]
Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. CoRR, abs/1607.02533, 2016.
[24]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436, 2015.
[25]
Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. AT&T Labs {Online}. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
[26]
Stijn Luca, Peter Karsmakers, Kris Cuppens, Tom Croonenborghs, Anouk Van de Vel, Berten Ceulemans, Lieven Lagae, Sabine Van Huffel, and Bart Vanrumste. Detecting rare events using extreme value statistics applied to epileptic convulsions in children. Artificial Intelligence in Medicine, 60(2):89 -- 96, 2014.
[27]
Lei Ma, Felix Juefei-Xu, Jiyuan Sun, Chunyang Chen, Ting Su, Fuyuan Zhang, Minhui Xue, Bo Li, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. Deepgauge: Comprehensive and multi-granularity testing criteria for gauging the robustness of deep learning systems. CoRR, abs/1803.07519, 2018.
[28]
Lei Ma, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Felix Juefei-Xu, Chao Xie, Li Li, Yang Liu, Jianjun Zhao, et al. Deepmutation: Mutation testing of deep learning systems. arXiv preprint arXiv:1805.05206, 2018.
[29]
Lei Ma, Fuyuan Zhang, Minhui Xue, Bo Li, Yang Liu, Jianjun Zhao, and Yadong Wang. Combinatorial testing for deep learning systems. arXiv preprint arXiv:1806.07723, 2018.
[30]
Xingjun Ma, Bo Li, Yisen Wang, Sarah M Erfani, Sudanthi Wijewickrema, Michael E Houle, Grant Schoenebeck, Dawn Song, and James Bailey. Characterizing adversarial subspaces using local intrinsic dimensionality. arXiv preprint arXiv:1801.02613, 2018.
[31]
Christian Murphy, Kuang Shen, and Gail Kaiser. Automatic system testing of programs without test oracles. In Proceedings of the 18th International Symposium on Software Testing and Analysis, ISSTA 2009, pages 189--200. ACM Press, 2009.
[32]
Nicolas Papernot, Fartash Faghri, Nicholas Carlini, Ian Goodfellow, Reuben Feinman, Alexey Kurakin, Cihang Xie, Yash Sharma, Tom Brown, Aurko Roy, Alexander Matyasko, Vahid Behzadan, Karen Hambardzumyan, Zhishuai Zhang, Yi-Lin Juang, Zhi Li, Ryan Sheatsley, Abhibhav Garg, Jonathan Uesato, Willi Gierke, Yinpeng Dong, David Berthelot, Paul Hendricks, Jonas Rauber, and Rujun Long. Technical report on the cleverhans v2.1.0 adversarial examples library. arXiv preprint arXiv:1610.00768, 2018.
[33]
Nicolas Papernot, Patrick D. McDaniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. CoRR, abs/1511.07528, 2015.
[34]
Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. Deepxplore: Automated whitebox testing of deep learning systems. In Proceedings of the 26th Symposium on Operating Systems Principles, SOSP '17, pages 1--18, New York, NY, USA, 2017. ACM.
[35]
Simon Poulding and Robert Feldt. Generating controllably invalid and atypical inputs for robustness testing. In Software Testing, Verification and Validation Workshops (ICSTW), 2017 IEEE International Conference on, pages 81--84. IEEE, 2017.
[36]
David W Scott. Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons, 2015.
[37]
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104--3112, 2014.
[38]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1--9, 2015.
[39]
L. Tarassenko. BioSign<sup>™</sup> : multi-parameter monitoring for early warning of patient deterioration. IET Conference Proceedings, pages 71--76(5), January 2005.
[40]
Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. Deeptest: Automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th International Conference on Software Engineering, pages 303--314. ACM, 2018.
[41]
Matt P Wand and M Chris Jones. Kernel smoothing. Chapman and Hall/CRC, 1994.
[42]
Shin Yoo. Metamorphic testing of stochastic optimisation. In Proceedings of the 3rd International Workshop on Search-Based Software Testing, SBST 2010, pages 192--201, 2010.
[43]
Shin Yoo and Mark Harman. Regression testing minimisation, selection and prioritisation: A survey. Software Testing, Verification, and Reliability, 22(2):67--120, March 2012.
[44]
Hong Zhu, Patrick A. V. Hall, and John H. R. May. Software unit test coverage and adequacy. ACM Comput. Surv., 29(4):366--427, December 1997.

Cited By

View all
  • (2024)Automated Testing Linguistic Capabilities of NLP ModelsACM Transactions on Software Engineering and Methodology10.1145/367245533:7(1-33)Online publication date: 14-Jun-2024
  • (2024)Neuron Sensitivity-Guided Test Case SelectionACM Transactions on Software Engineering and Methodology10.1145/367245433:7(1-32)Online publication date: 12-Jun-2024
  • (2024)Can Coverage Criteria Guide Failure Discovery for Image Classifiers? An Empirical StudyACM Transactions on Software Engineering and Methodology10.1145/367244633:7(1-28)Online publication date: 13-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICSE '19: Proceedings of the 41st International Conference on Software Engineering
May 2019
1318 pages

Sponsors

Publisher

IEEE Press

Publication History

Published: 25 May 2019

Check for updates

Badges

Author Tags

  1. deep learning systems
  2. test adequacy

Qualifiers

  • Research-article

Conference

ICSE '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)45
  • Downloads (Last 6 weeks)8
Reflects downloads up to 06 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Automated Testing Linguistic Capabilities of NLP ModelsACM Transactions on Software Engineering and Methodology10.1145/367245533:7(1-33)Online publication date: 14-Jun-2024
  • (2024)Neuron Sensitivity-Guided Test Case SelectionACM Transactions on Software Engineering and Methodology10.1145/367245433:7(1-32)Online publication date: 12-Jun-2024
  • (2024)Can Coverage Criteria Guide Failure Discovery for Image Classifiers? An Empirical StudyACM Transactions on Software Engineering and Methodology10.1145/367244633:7(1-28)Online publication date: 13-Jun-2024
  • (2024)LLMEffiChecker: Understanding and Testing Efficiency Degradation of Large Language ModelsACM Transactions on Software Engineering and Methodology10.1145/366481233:7(1-38)Online publication date: 26-Aug-2024
  • (2024)Focused Test Generation for Autonomous Driving SystemsACM Transactions on Software Engineering and Methodology10.1145/366460533:6(1-32)Online publication date: 27-Jun-2024
  • (2024)See the Forest, not Trees: Unveiling and Escaping the Pitfalls of Error-Triggering Inputs in Neural Network TestingProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680385(1605-1617)Online publication date: 11-Sep-2024
  • (2024)Test Selection for Deep Neural Networks using Meta-Models with Uncertainty MetricsProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680312(678-690)Online publication date: 11-Sep-2024
  • (2024)Isolation-Based Debugging for Neural NetworksProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652132(338-349)Online publication date: 11-Sep-2024
  • (2024)Distance-Aware Test Input Selection for Deep Neural NetworksProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652125(248-260)Online publication date: 11-Sep-2024
  • (2024)DeepGD: A Multi-Objective Black-Box Test Selection Approach for Deep Neural NetworksACM Transactions on Software Engineering and Methodology10.1145/364438833:6(1-29)Online publication date: 27-Jun-2024
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media