(Translated by https://www.hiragana.jp/)
SoK: A Modularized Approach to Study the Security of Automatic Speech Recognition Systems | ACM Transactions on Privacy and Security skip to main content
research-article

SoK: A Modularized Approach to Study the Security of Automatic Speech Recognition Systems

Published: 19 May 2022 Publication History
  • Get Citation Alerts
  • Abstract

    With the wide use of Automatic Speech Recognition (ASR) in applications such as human machine interaction, simultaneous interpretation, audio transcription, and so on, its security protection becomes increasingly important. Although recent studies have brought to light the weaknesses of popular ASR systems that enable out-of-band signal attack, adversarial attack, and so on, and further proposed various remedies (signal smoothing, adversarial training, etc.), a systematic understanding of ASR security (both attacks and defenses) is still missing, especially on how realistic such threats are and how general existing protection could be. In this article, we present our systematization of knowledge for ASR security and provide a comprehensive taxonomy for existing work based on a modularized workflow. More importantly, we align the research in this domain with that on security in Image Recognition System (IRS), which has been extensively studied, using the domain knowledge in the latter to help understand where we stand in the former. Generally, both IRS and ASR are perceptual systems. Their similarities allow us to systematically study existing literature in ASR security based on the spectrum of attacks and defense solutions proposed for IRS, and pinpoint the directions of more advanced attacks and the directions potentially leading to more effective protection in ASR. In contrast, their differences, especially the complexity of ASR compared with IRS, help us learn unique challenges and opportunities in ASR security. Particularly, our experimental study shows that transfer attacks across ASR models are feasible, even in the absence of knowledge about models (even their types) and training data.

    References

    [1]
    Sajjad Abdoli, Luiz G. Hafemann, Jerome Rony, Ismail Ben Ayed, Patrick Cardinal, and Alessandro L. Koerich. 2019. Universal adversarial audio perturbations. IEEE Trans. Pattern Anal. Mach. Intell. (2019).
    [2]
    Hadi Abdullah, Washington Garcia, Christian Peeters, Patrick Traynor, Kevin R. B. Butler, and Joseph Wilson. 2019. Practical hidden voice attacks against speech and speaker recognition systems. In Proceedings of the 26th Annual Network and Distributed System Security Symposium (NDSS).
    [3]
    Hadi Abdullah, Muhammad Sajidur Rahman, Washington Garcia, Logan Blue, Kevin Warren, Anurag Swarnim Yadav, Tom Shrimpton, and Patrick Traynor. 2021. Hear “No Evil,” see “Kenansville”: Efficient and transferable black-box attacks on speech recognition and voice identification systems. In 42nd IEEE Symposium on Security and Privacy.
    [4]
    Hadi Abdullah, Muhammad Sajidur Rahman, Christian Peeters, Cassidy Gibson, Washington Garcia, Vincent Bindschaedler, Thomas Shrimpton, and Patrick Traynor. 2021. Beyond \(L\_p\) clipping: Equalization-based psychoacoustic attacks against ASRs. arXiv preprint arXiv:2110.13250 (2021).
    [5]
    Hadi Abdullah, Kevin Warren, Vincent Bindschaedler, Nicolas Papernot, and Patrick Traynor. 2021. SoK: The faults in our ASRs: An overview of attacks against automatic speech recognition and speaker identification systems. In 42nd IEEE Symposium on Security and Privacy.
    [6]
    Victor Akinwande, Celia Cintas, Skyler Speakman, and Srihari Sridharan. 2020. Identifying audio adversarial examples via anomalous pattern detection. arXiv preprint arXiv:2002.05463 (2020).
    [7]
    Amr Alanwar, Bharathan Balaji, Yuan Tian, Shuo Yang, and Mani Srivastava. 2017. EchoSafe: Sonar-based verifiable interaction with intelligent digital agents. In 1st ACM Workshop on the Internet of Safe Things. 38–43.
    [8]
    Efthimios Alepis and Constantinos Patsakis. 2017. Monkey says, monkey does: Security and privacy on voice assistants. IEEE Access 5 (2017), 17841–17851.
    [9]
    Moustafa Alzantot, Bharathan Balaji, and Mani Srivastava. 2017. Did you hear that? Adversarial examples against automatic speech recognition. In NIPS 2017 Machine Deception Workshop.
    [10]
    Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. 2016. Deep Speech 2: End-to-end speech recognition in English and Mandarin. In International Conference on Machine Learning. 173–182.
    [11]
    Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. 2018. Synthesizing robust adversarial examples. In International Conference on Machine Learning. PMLR, 284–293.
    [12]
    Jacob Benesty, Shoji Makino, and Jingdong Chen. 2006. Speech Enhancement. Springer Science & Business Media.
    [13]
    Mary K. Bispham, Ioannis Agrafiotis, and Michael Goldsmith. 2019. Nonsense attacks on Google Assistant and missense attacks on Amazon Alexa. (2019).
    [14]
    Logan Blue, Hadi Abdullah, Luis Vargas, and Patrick Traynor. 2018. 2MA: Verifying voice commands via two microphone authentication. In Asia Conference on Computer and Communications Security. 89–100.
    [15]
    Connor Bolton, Sara Rampazzi, Chaohao Li, Andrew Kwong, Wenyuan Xu, and Kevin Fu. 2018. Blue Note: How intentional acoustic interference damages availability and integrity in hard disk drives and operating systems. In IEEE Symposium on Security and Privacy (SP). IEEE, 1048–1062.
    [16]
    Junyoung Byun, Hyojun Go, and Changick Kim. 2021. Small input noise is enough to defend against query-based black-box attacks. arXiv preprint arXiv:2101.04829 (2021).
    [17]
    Yulong Cao, Ningfei Wang, Chaowei Xiao, Dawei Yang, Jin Fang, Ruigang Yang, Qi Alfred Chen, Mingyan Liu, and Bo Li. 2021. Invisible for both camera and lidar: Security of multi-sensor fusion based perception in autonomous driving under physical-world attacks. In IEEE Symposium on Security and Privacy (SP). IEEE, 176–194.
    [18]
    D. Caputo, L. Verderame, A. Merlo, A. Ranieri, and L. Caviglione. 2020. Are you (Google) home? Detecting users’ presence through traffic analysis of smart speakers. ITASEC 2020.
    [19]
    J.-F. Cardoso and Beate H. Laheld. 1996. Equivariant adaptive source separation. IEEE Trans. Sig. Process. 44, 12 (1996), 3017–3030.
    [20]
    Nicholas Carlini, Pratyush Mishra, Tavish Vaidya, Yuankai Zhang, Micah Sherr, Clay Shields, David Wagner, and Wenchao Zhou. 2016. Hidden voice commands. In 25th USENIX Security Symposium (USENIX Security’16). 513–530.
    [21]
    Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy (SP). IEEE, 39–57.
    [22]
    Nicholas Carlini and David Wagner. 2018. Audio adversarial examples: Targeted attacks on speech-to-text. In IEEE Security and Privacy Workshops (SPW). IEEE, 1–7.
    [23]
    Lucy Chai, Thavishi Illandara, and Zhongxia Yan. [n.d.]. Private speech adversaries. ([n. d.]).
    [24]
    Kuei-Huan Chang, Po-Hao Huang, Honggang Yu, Yier Jin, and Ting-Chi Wang. 2020. Audio adversarial examples generation with recurrent neural networks. In 25th Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 488–493.
    [25]
    Guangke Chen, Sen Chen, Lingling Fan, Xiaoning Du, Zhe Zhao, Fu Song, and Yang Liu. 2019. Who is real Bob? Adversarial attacks on speaker recognition systems. arXiv preprint arXiv:1911.01840 (2019).
    [26]
    Jingdong Chen, Jacob Benesty, Yiteng Huang, and Simon Doclo. 2006. New insights into the noise reduction Wiener filter. IEEE Trans. Audio, Speech Lang. Process. 14, 4 (2006), 1218–1234.
    [27]
    Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. 2017. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In 10th ACM Workshop on Artificial Intelligence and Security. 15–26.
    [28]
    Steven Chen, Nicholas Carlini, and David Wagner. 2020. Stateful detection of black-box adversarial attacks. In 1st ACM Workshop on Security and Privacy on Artificial Intelligence. 30–39.
    [29]
    Si Chen, Kui Ren, Sixu Piao, Cong Wang, Qian Wang, Jian Weng, Lu Su, and Aziz Mohaisen. 2017. You can hear but you cannot steal: Defending against voice impersonation attacks on smartphones. In IEEE 37th International Conference on Distributed Computing Systems (ICDCS). IEEE, 183–195.
    [30]
    Tao Chen, Kai-Kuang Ma, and Li-Hui Chen. 1999. Tri-state median filter for image denoising. IEEE Trans. Image Process. 8, 12 (1999), 1834–1838.
    [31]
    Tao Chen, Longfei Shangguan, Zhenjiang Li, and Kyle Jamieson. 2020. Metamorph: Injecting inaudible commands into over-the-air voice controlled systems. NDSS (2020).
    [32]
    Yuxin Chen, Huiying Li, Steven Nagels, Zhijing Li, Pedro Lopes, Ben Y. Zhao, and Haitao Zheng. 2019. Understanding the effectiveness of ultrasonic microphone jammer. arXiv preprint arXiv:1904.08490 (2019).
    [33]
    Yuxuan Chen, Xuejing Yuan, Jiangshan Zhang, Yue Zhao, Shengzhi Zhang, Kai Chen, and XiaoFeng Wang. 2020. Devil’s Whisper: A general approach for physical adversarial attacks against commercial black-box speech recognition devices. In 29th USENIX Security Symposium (USENIX Security’20).
    [34]
    Moustapha Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet. 2017. Houdini: Fooling deep structured prediction models. NIPS (2017).
    [35]
    Ronan Collobert, Christian Puhrsch, and Gabriel Synnaeve. 2016. Wav2Letter: An end-to-end convNet-based speech recognition system. arXiv preprint arXiv:1609.03193 (2016).
    [36]
    Nilaksh Das, Madhuri Shanbhogue, Shang-Tse Chen, Li Chen, Michael E. Kounavis, and Duen Horng Chau. 2018. Adagio: Interactive experimentation with adversarial attack and defense for audio. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 677–681.
    [37]
    Ambra Demontis, Marco Melis, Maura Pintor, Matthew Jagielski, Battista Biggio, Alina Oprea, Cristina Nita-Rotaru, and Fabio Roli. 2019. Why do adversarial attacks transfer? Explaining transferability of evasion and poisoning attacks. In 28th USENIX Security Symposium (USENIX Security’19). 321–338.
    [38]
    Guang Deng and L. W. Cahill. 1993. An adaptive Gaussian filter for noise reduction and edge detection. In IEEE Conference Record Nuclear Science Symposium and Medical Imaging Conference. IEEE, 1615–1619.
    [39]
    Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. 2018. Boosting adversarial attacks with momentum. In IEEE Conference on Computer Vision and Pattern Recognition. 9185–9193.
    [40]
    Tianyu Du, Shouling Ji, Jinfeng Li, Qinchen Gu, Ting Wang, and Raheem Beyah. 2020. SirenAttack: Generating adversarial audio for end-to-end acoustic systems. ASIACCS (2020).
    [41]
    Sri Harsha Dumpala, Imran Sheikh, Rupayan Chakraborty, and Sunil Kumar Kopparapu. 2019. Improving ASR robustness to perturbed speech using cycle-consistent generative adversarial networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5726–5730.
    [42]
    Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul Prakash, Tadayoshi Kohno, and Dawn Song. 2018. Robust physical-world attacks on deep learning visual classification. In IEEE Conference on Computer Vision and Pattern Recognition. 1625–1634.
    [43]
    Huan Feng, Kassem Fawaz, and Kang G. Shin. 2017. Continuous authentication for voice assistants. In 23rd Annual International Conference on Mobile Computing and Networking. 343–355.
    [44]
    Kevin Fu and Wenyuan Xu. 2018. Risks of trusting the physics of sensors. Commun. ACM 61, 2 (2018), 20–23.
    [45]
    Taesik Gong, Alberto Gil C. P. Ramos, Sourav Bhattacharya, Akhil Mathur, and Fahim Kawsar. 2019. AudiDoS: Real-Time denial-of-service adversarial attacks on deep audio models. In 18th IEEE International Conference on Machine Learning And Applications (ICMLA). IEEE, 978–985.
    [46]
    Yuan Gong, Boyang Li, Christian Poellabauer, and Yiyu Shi. 2019. Real-time adversarial attacks. In 28th International Joint Conference on Artificial Intelligence. AAAI Press, 4672–4680.
    [47]
    Yuan Gong and Christian Poellabauer. 2018. Crafting adversarial examples for speech paralinguistics applications. DYnamic and Novel Advances in Machine Learning and Intelligent Cyber Security (DYNAMICS) Workshop.
    [48]
    Yuan Gong and Christian Poellabauer. 2018. Protecting voice controlled systems using sound source identification based on acoustic cues. In 27th International Conference on Computer Communication and Networks (ICCCN). IEEE, 1–9.
    [49]
    Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).
    [50]
    Qingli Guo, Jing Ye, Yiran Chen, Yu Hu, Yazhu Lan, Guohe Zhang, and Xiaowei Li. 2020. INOR—An intelligent noise reduction method to defend against adversarial audio examples. Neurocomputing (2020).
    [51]
    Qingli Guo, Jing Ye, Yu Hu, Guohe Zhang, Xiaowei Li, and Huawei Li. 2020. MultiPAD: A multivariant partition-based method for audio adversarial examples detection. IEEE Access 8 (2020), 63368–63380.
    [52]
    Joon Kuy Han, Hyoungshick Kim, and Simon S. Woo. 2019. Nickel to Lego: Using Foolgle to create adversarial examples to fool Google cloud speech-to-text API. In ACM SIGSAC Conference on Computer and Communications Security. 2593–2595.
    [53]
    Simon Haykin and Bernard Widrow. 2003. Least-mean-square Adaptive Filters. Vol. 31. John Wiley & Sons.
    [54]
    Yitao He, Junyu Bian, Xinyu Tong, Zihui Qian, Wei Zhu, Xiaohua Tian, and Xinbing Wang. 2019. Canceling inaudible voice commands against voice control systems. In 25th Annual International Conference on Mobile Computing and Networking. 1–15.
    [55]
    Qian Huang, Isay Katsman, Horace He, Zeqi Gu, Serge Belongie, and Ser-Nam Lim. 2019. Enhancing adversarial example transferability with an intermediate level attack. In IEEE International Conference on Computer Vision. 4733–4742.
    [56]
    Wenchao Huang, Yan Xiong, Xiang-Yang Li, Hao Lin, Xufei Mao, Panlong Yang, and Yunhao Liu. 2014. Shake and walk: Acoustic direction finding and fine-grained indoor localization using smartphones. In IEEE Conference on Computer Communications. IEEE, 370–378.
    [57]
    Shehzeen Hussain, Paarth Neekhara, Shlomo Dubnov, Julian McAuley, and Farinaz Koushanfar. 2021. WaveGuard: Understanding and mitigating audio adversarial examples. In 30th USENIX Security Symposium (USENIX Security’21).
    [58]
    Yeongjin Jang, Chengyu Song, Simon P. Chung, Tielei Wang, and Wenke Lee. 2014. A11y attacks: Exploiting accessibility in operating systems. In ACM SIGSAC Conference on Computer and Communications Security. 103–115.
    [59]
    Chaouki Kasmi and Jose Lopes Esteves. 2015. IEMI threats for information security: Remote command injection on modern smartphones. IEEE Trans. Electromag. Compatib. 57, 6 (2015), 1752–1755.
    [60]
    Shreya Khare, Rahul Aralikatte, and Senthil Mani. 2019. Adversarial black-box attacks on automatic speech recognition systems using multi-objective evolutionary optimization. In Interspeech Conference. 3208–3212.
    [61]
    Yehao Kong and Jiliang Zhang. 2019. Adversarial audio: A new information hiding method and backdoor for DNN-based speech recognition models. arXiv preprint arXiv:1904.03829 (2019).
    [62]
    Felix Kreuk, Yossi Adi, Moustapha Cisse, and Joseph Keshet. 2018. Fooling end-to-end speaker verification with adversarial examples. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1962–1966.
    [63]
    Deepak Kumar, Riccardo Paccagnella, Paul Murley, Eric Hennenfent, Joshua Mason, Adam Bates, and Michael Bailey. 2018. Skill squatting attacks on amazon alexa. In 27th USENIX Security Symposium (USENIX Security’18). 33–47.
    [64]
    Alexey Kurakin, Ian Goodfellow, and Samy Bengio. 2016. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533 (2016).
    [65]
    Alexey Kurakin, Ian Goodfellow, and Samy Bengio. 2016. Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236 (2016).
    [66]
    Hyun Kwon, Yongchul Kim, Hyunsoo Yoon, and Daeseon Choi. 2019. Selective audio adversarial example in evasion attack on speech recognition system. IEEE Trans. Inf. Forens. Secur. 15 (2019), 526–538.
    [67]
    Hyun Kwon, Hyunsoo Yoon, and Ki-Woong Park. 2019. POSTER: Detecting audio adversarial example through audio modification. In ACM SIGSAC Conference on Computer and Communications Security. 2521–2523.
    [68]
    Yann LeCun et al. 2015. LeNet-5, convolutional neural networks. 20, 5 (2015), 14. Retrieved from http://yann.lecun.com/exdb/lenet.
    [69]
    Yeonjoon Lee, Yue Zhao, Jiutian Zeng, Kwangwuk Lee, Nan Zhang, Faysal Hossain Shezan, Yuan Tian, Kai Chen, and XiaoFeng Wang. 2020. Using sonar for liveness detection to protect smart speakers against remote attackers. Proc. ACM Interact., Mob., Wear. Ubiq. Technol. 4, 1 (2020), 1–28.
    [70]
    Xinyu Lei, Guan-Hua Tu, Alex X. Liu, Kamran Ali, Chi-Yu Li, and Tian Xie. 2017. The insecurity of home digital voice assistants–Amazon Alexa as a case study. arXiv preprint arXiv:1712.03327 (2017).
    [71]
    Juncheng Li, Shuhui Qu, Xinjian Li, Joseph Szurley, J. Zico Kolter, and Florian Metze. 2019. Adversarial music: Real world audio adversary against wake-word detection system. In Conference on Advances in Neural Information Processing Systems. 11908–11918.
    [72]
    Xu Li, Jinghua Zhong, Xixin Wu, Jianwei Yu, Xunying Liu, and Helen Meng. 2020. Adversarial attacks on GMM I-Vector based speaker verification systems. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6579–6583.
    [73]
    Zhuohang Li, Cong Shi, Yi Xie, Jian Liu, Bo Yuan, and Yingying Chen. 2020. Practical adversarial attacks against speaker recognition systems. In 21st International Workshop on Mobile Computing Systems and Applications. 9–14.
    [74]
    Zhuohang Li, Yi Wu, Jian Liu, Yingying Chen, and Bo Yuan. 2020. AdvPulse: Universal, synchronization-free, and targeted audio adversarial attacks via subsecond perturbations. In ACM SIGSAC Conference on Computer and Communications Security. 1121–1134.
    [75]
    Jinpeng Liu and Ke Tang. 2013. Scaling up covariance matrix adaptation evolution strategy using cooperative coevolution. In International Conference on Intelligent Data Engineering and Automated Learning. Springer, 350–357.
    [76]
    Songxiang Liu, Haibin Wu, Hung-Yi Lee, and Helen Meng. 2019. Adversarial attacks on spoofing countermeasures of automatic speaker verification. ASRU (2019).
    [77]
    Xiaolei Liu, Xiaosong Zhang, Kun Wan, Qingxin Zhu, and Yufei Ding. 2020. Towards weighted-sampling audio adversarial example attack. AAAI (2020).
    [78]
    Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. 2016. Delving into transferable adversarial examples and black-box attacks. arXiv preprint arXiv:1611.02770 (2016).
    [79]
    Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2017. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017).
    [80]
    Dongyu Meng and Hao Chen. 2017. MagNet: A two-pronged defense against adversarial examples. In ACM SIGSAC Conference on Computer and Communications Security. 135–147.
    [81]
    Yan Meng, Zichang Wang, Wei Zhang, Peilin Wu, Haojin Zhu, Xiaohui Liang, and Yao Liu. 2018. WiVo: Enhancing the security of voice control system via wireless signal in IoT environment. In 18th ACM International Symposium on Mobile Ad Hoc Networking and Computing. 81–90.
    [82]
    Richard Mitev, Markus Miettinen, and Ahmad-Reza Sadeghi. 2019. Alexa lied to me: Skill-based man-in-the-middle attacks on virtual assistants. In ACM Asia Conference on Computer and Communications Security. 465–478.
    [83]
    Lindasalwa Muda, Mumtaj Begam, and Irraivan Elamvazuthi. 2010. Voice recognition algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) techniques. arXiv preprint arXiv:1003.4083 (2010).
    [84]
    Taiki Nakamura, Yuki Saito, Shinnosuke Takamichi, Yusuke Ijima, and Hiroshi Saruwatari. [n.d.]. V2S attack: Building DNN-based voice conversion from automatic speaker verification. In 10th ISCA Speech Synthesis Workshop. 161–165.
    [85]
    Shoei Nashimoto, Daisuke Suzuki, Takeshi Sugawara, and Kazuo Sakiyama. 2018. Sensor CON-Fusion: Defeating Kalman filter in signal injection attack. In Asia Conference on Computer and Communications Security. 511–524.
    [86]
    Paarth Neekhara, Shehzeen Hussain, Prakhar Pandey, Shlomo Dubnov, Julian McAuley, and Farinaz Koushanfar. 2019. Universal adversarial perturbations for speech recognition systems. In Interspeech Conference. 481–485.
    [87]
    Tae-Hyun Oh, Tali Dekel, Changil Kim, Inbar Mosseri, William T. Freeman, Michael Rubinstein, and Wojciech Matusik. 2019. Speech2Face: Learning the face behind a voice. In IEEE Conference on Computer Vision and Pattern Recognition. 7539–7548.
    [88]
    Ren Pang, Hua Shen, Xinyang Zhang, Shouling Ji, Yevgeniy Vorobeychik, Xiapu Luo, Alex Liu, and Ting Wang. 2020. A tale of evil twins: Adversarial inputs versus poisoned models. In ACM SIGSAC Conference on Computer and Communications Security. 85–99.
    [89]
    Ren Pang, Xinyang Zhang, Shouling Ji, Xiapu Luo, and Ting Wang. 2020. AdvMind: Inferring adversary intent of black-box attacks. In 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1899–1907.
    [90]
    Tianyu Pang, Chao Du, Yinpeng Dong, and Jun Zhu. 2018. Towards robust detection of adversarial examples. In Conference on Advances in Neural Information Processing Systems. 4579–4589.
    [91]
    Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. 2016. Transferability in machine learning: From phenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277 (2016).
    [92]
    Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z. Berkay Celik, and Ananthram Swami. 2017. Practical black-box attacks against machine learning. In ACM on Asia conference on Computer and Communications Security. 506–519.
    [93]
    Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, and Ananthram Swami. 2016. The limitations of deep learning in adversarial settings. In IEEE European Symposium on Security and Privacy (EuroS&P). IEEE, 372–387.
    [94]
    Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. 2016. Distillation as a defense to adversarial perturbations against deep neural networks. In IEEE Symposium on Security and Privacy (SP). IEEE, 582–597.
    [95]
    Yao Qin, Nicholas Carlini, Ian Goodfellow, Garrison Cottrell, and Colin Raffel. 2019. Imperceptible, robust, and targeted adversarial examples for automatic speech recognition. ICML (2019).
    [96]
    Erwin Quiring, David Klein, Daniel Arp, Martin Johns, and Konrad Rieck. 2020. Adversarial preprocessing: Understanding and preventing image-scaling attacks in machine learning. In 29th USENIX Security Symposium (USENIX Security’20).
    [97]
    Krishan Rajaratnam and Jugal Kalita. 2018. Noise flooding for detecting audio adversarial examples against automatic speech recognition. In IEEE International Symposium on Signal Processing and Information Technology (ISSPIT). IEEE, 197–201.
    [98]
    Krishan Rajaratnam, Kunal Shah, and Jugal Kalita. 2018. Isolated and ensemble audio preprocessing methods for detecting adversarial examples against automatic speech recognition. In 30th Conference on Computational Linguistics and Speech Processing (ROCLING’18). 16–30.
    [99]
    Andrew Slavin Ross and Finale Doshi-Velez. 2017. Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. arXiv preprint arXiv:1711.09404 (2017).
    [100]
    Nirupam Roy, Haitham Hassanieh, and Romit Roy Choudhury. 2017. BackDoor: Making microphones hear inaudible sounds. In 15th Annual International Conference on Mobile Systems, Applications, and Services. 2–14.
    [101]
    Nirupam Roy, Sheng Shen, Haitham Hassanieh, and Romit Roy Choudhury. 2018. Inaudible voice commands: The long-range attack and defense. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI’18). 547–560.
    [102]
    Sara Sabour, Yanshuai Cao, Fartash Faghri, and David J. Fleet. 2015. Adversarial manipulation of deep representations. arXiv preprint arXiv:1511.05122 (2015).
    [103]
    Saeid Samizade, Zheng-Hua Tan, Chao Shen, and Xiaohong Guan. 2020. Adversarial example detection by classification for deep speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3102–3106.
    [104]
    Lea Schönherr, Katharina Kohls, Steffen Zeiler, Thorsten Holz, and Dorothea Kolossa. 2019. Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding. NDSS (2019).
    [105]
    Lea Schönherr, Steffen Zeiler, Thorsten Holz, and Dorothea Kolossa. 2019. Robust over-the-air adversarial examples against automatic speech recognition systems. arXiv preprint arXiv:1908.01551 (2019).
    [106]
    Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, and Michael K. Reiter. 2016. Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition. In ACM SIGSAC Conference on Computer and Communications Security. 1528–1540.
    [107]
    Hao Shen, Weiming Zhang, Han Fang, Zehua Ma, and Nenghai Yu. 2019. JamSys: Coverage optimization of a microphone jamming system based on ultrasounds. IEEE Access 7 (2019), 67483–67496.
    [108]
    Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, Mia X. Chen, Ye Jia, Anjuli Kannan, Tara Sainath, Yuan Cao, Chung-Cheng Chiu, et al. 2019. Lingvo: A modular and scalable framework for sequence-to-sequence modeling. arXiv preprint arXiv:1902.08295 (2019).
    [109]
    Yucheng Shi, Siyu Wang, and Yahong Han. 2019. Curls & Whey: Boosting black-box adversarial attacks. In IEEE Conference on Computer Vision and Pattern Recognition. 6519–6527.
    [110]
    Hocheol Shin, Dohyun Kim, Yujin Kwon, and Yongdae Kim. 2017. Illusion and dazzle: Adversarial optical channel exploits against lidars for automotive applications. In International Conference on Cryptographic Hardware and Embedded Systems. Springer, 445–467.
    [111]
    Yunmok Son, Hocheol Shin, Dongkwan Kim, Youngseok Park, Juhwan Noh, Kibum Choi, Jungwoo Choi, and Yongdae Kim. 2015. Rocking drones with intentional sound noise on gyroscopic sensors. In 24th USENIX Security Symposium (USENIX Security’15). 881–896.
    [112]
    Liwei Song and Prateek Mittal. 2017. Poster: Inaudible voice commands. In ACM SIGSAC Conference on Computer and Communications Security. 2583–2585.
    [113]
    Jiawei Su, Danilo Vasconcellos Vargas, and Kouichi Sakurai. 2019. One pixel attack for fooling deep neural networks. IEEE Trans. Evolut. Comput. 23, 5 (2019), 828–841.
    [114]
    Vinod Subramanian, Emmanouil Benetos, and Mark B. Sandler. 2019. Robustness of adversarial attacks in sound event classification. (2019).
    [115]
    Takeshi Sugawara, Benjamin Cyr, Sara Rampazzi, Daniel Genkin, and Kevin Fu. [n.d.]. Light commands: Laser-Based audio injection attacks on voice-controllable systems. ([n.d.]).
    [116]
    Jiachen Sun, Yulong Cao, Qi Alfred Chen, and Z. Morley Mao. 2020. Towards robust lidar-based perception in autonomous driving: General black-box adversarial sensor attack and countermeasures. In 29th USENIX Security Symposium (USENIX Security’20). 877–894.
    [117]
    Sining Sun, Pengcheng Guo, Lei Xie, and Mei-Yuh Hwang. 2019. Adversarial regularization for attention based end-to-end robust speech recognition. IEEE/ACM Trans. Audio, Speech Lang. Process. 27, 11 (2019), 1826–1838.
    [118]
    Sining Sun, Ching-Feng Yeh, Mari Ostendorf, Mei-Yuh Hwang, and Lei Xie. 2018. Training augmentation with adversarial examples for robust speech recognition. In Interspeech Conference. 2404–2408.
    [119]
    Zheng Sun, Aveek Purohit, Raja Bose, and Pei Zhang. 2013. Spartacus: Spatially-aware interaction for mobile devices through energy-efficient audio sensing. In 11th Annual International Conference on Mobile Systems, Applications, and Services. 263–276.
    [120]
    Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013).
    [121]
    Joseph Szurley and Zico J. Kolter. 2019. Perceptual based adversarial audio attacks. CoRR (2019).
    [122]
    Keiichi Tamura, Akitada Omagari, and Shuichi Hashida. 2019. Novel defense method against audio adversarial example for speech-to-text transcription neural networks. In IEEE 11th International Workshop on Computational Intelligence and Applications (IWCIA). IEEE, 115–120.
    [123]
    Rohan Taori, Amog Kamsetty, Brenton Chu, and Nikita Vemuri. 2019. Targeted adversarial examples for black box audio systems. In IEEE Security and Privacy Workshops (SPW). IEEE, 15–20.
    [124]
    Dang Ngoc Hoang Thanh, Serdar Engínoğlu, et al. 2019. An iterative mean filter for image denoising. IEEE Access 7 (2019), 167847–167859.
    [125]
    Xiaohai Tian, Rohan Kumar Das, and Haizhou Li. 2019. Black-box attacks on automatic speaker verification using feedback-controlled voice conversion. arXiv preprint arXiv:1909.07655 (2019).
    [126]
    Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick McDaniel. 2017. Ensemble adversarial training: Attacks and defenses. arXiv preprint arXiv:1705.07204 (2017).
    [127]
    Timothy Trippel, Ofir Weisse, Wenyuan Xu, Peter Honeyman, and Kevin Fu. 2017. WALNUT: Waging doubt on the integrity of MEMS accelerometers with acoustic injection attacks. In IEEE European Symposium on Security and Privacy (EuroS&P). IEEE, 3–18.
    [128]
    Yazhou Tu, Zhiqiang Lin, Insup Lee, and Xiali Hei. 2018. Injected and delivered: Fabricating implicit control over actuation systems by spoofing inertial sensors. In 27th USENIX Security Symposium (USENIX Security’18). 1545–1562.
    [129]
    Jon Vadillo and Roberto Santana. 2019. Universal adversarial examples in speech command classification. arXiv preprint arXiv:1911.10182 (2019).
    [130]
    Jon Vadillo and Roberto Santana. 2022. On the human evaluation of universal audio adversarial perturbations. Comput. Secur. 112 (2022), 102495.
    [131]
    Tavish Vaidya, Yuankai Zhang, Micah Sherr, and Clay Shields. 2015. Cocaine noodles: Exploiting the gap between human and machine speech recognition. In 9th USENIX Workshop on Offensive Technologies (WOOT’15).
    [132]
    Ville Vestman, Tomi Kinnunen, Rosa González Hautamäki, and Md Sahidullah. 2020. Voice mimicry attacks assisted by automatic speaker verification. Comput. Speech Lang. 59 (2020), 36–54.
    [133]
    Chen Wang, S. Abhishek Anand, Jian Liu, Payton Walker, Yingying Chen, and Nitesh Saxena. 2019. Defeating hidden audio channel attacks on voice assistants via audio-induced surface vibrations. In 35th Annual Computer Security Applications Conference. 42–56.
    [134]
    Qian Wang, Kui Ren, Man Zhou, Tao Lei, Dimitrios Koutsonikolas, and Lu Su. 2016. Messages behind the sound: Real-time hidden acoustic signal capture with smartphones. In 22nd Annual International Conference on Mobile Computing and Networking. 29–41.
    [135]
    Xiong Wang, Sining Sun, Changhao Shan, Jingyong Hou, Lei Xie, Shen Li, and Xin Lei. 2019. Adversarial examples for improving end-to-end attention-based small-footprint keyword spotting. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6366–6370.
    [136]
    Xiong Wang, Sining Sun, Changhao Shan, Jingyong Hou, Lei Xie, Shen Li, and Xin Lei. 2019. Adversarial examples for improving end-to-end attention-based small-footprint keyword spotting. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6366–6370.
    [137]
    Yao Wang, Wandong Cai, Tao Gu, Wei Shao, Yannan Li, and Yong Yu. 2019. Secure your voice: An oral airflow-based continuous liveness detection for voice assistants. Proc. ACM Interact., Mob., Wear. Ubiq. Technol. 3, 4 (2019), 1–28.
    [138]
    Lei Wu, Zhanxing Zhu, Cheng Tai, and E. Weinan. 2018. Enhancing the transferability of adversarial examples with noise reduced gradient. (2018).
    [139]
    Yi Wu, Jian Liu, Yingying Chen, and Jerry Cheng. 2019. Semi-black-box attacks against speech recognition systems using adversarial samples. In IEEE International Symposium on Dynamic Spectrum Access Networks (DySPAN). IEEE, 1–5.
    [140]
    Qixue Xiao, Yufei Chen, Chao Shen, Yu Chen, and Kang Li. 2019. Seeing is not believing: Camouflage attacks on image scaling algorithms. In 28th USENIX Security Symposium (USENIX Security’19). 443–460.
    [141]
    Cihang Xie, Zhishuai Zhang, Yuyin Zhou, Song Bai, Jianyu Wang, Zhou Ren, and Alan L. Yuille. 2019. Improving transferability of adversarial examples with input diversity. In IEEE Conference on Computer Vision and Pattern Recognition. 2730–2739.
    [142]
    Yaxiong Xie, Zhenjiang Li, and Mo Li. 2018. Precise power delay profiling with commodity Wi-Fi. IEEE Trans. Mob. Comput. 18, 6 (2018), 1342–1355.
    [143]
    Yi Xie, Zhuohang Li, Cong Shi, Jian Liu, Yingying Chen, and Bo Yuan. 2020. Enabling fast and universal audio adversarial attack using generative model. arXiv preprint arXiv:2004.12261 (2020).
    [144]
    Yi Xie, Cong Shi, Zhuohang Li, Jian Liu, Yingying Chen, and Bo Yuan. 2020. Real-time, universal, and robust adversarial attacks against speaker recognition systems. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1738–1742.
    [145]
    Wenyuan Xu, Chen Yan, Weibin Jia, Xiaoyu Ji, and Jianhao Liu. 2018. Analyzing and enhancing the security of ultrasonic sensors for autonomous vehicles. IEEE Internet Things J. 5, 6 (2018), 5015–5029.
    [146]
    Zirui Xu, Fuxun Yu, and Xiang Chen. 2020. LanCe: A comprehensive and lightweight CNN defense methodology against physical adversarial attacks on embedded multimedia applications. In 25th Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 470–475.
    [147]
    Hiromu Yakura and Jun Sakuma. 2019. Robust audio adversarial example for a physical attack. In 28th International Joint Conference on Artificial Intelligence. AAAI Press, 5334–5341.
    [148]
    Chen Yan, Hocheol Shin, Connor Bolton, Wenyuan Xu, Yongdae Kim, and Kevin Fu. 2020. SoK: A minimalist approach to formalizing analog sensor security. In IEEE Symposium on Security and Privacy (SP). 480–495.
    [149]
    Chen Yan, Wenyuan Xu, and Jianhao Liu. 2016. Can you trust autonomous vehicles: Contactless attacks against sensors of self-driving vehicle. DEF CON 24, 8 (2016), 109.
    [150]
    Chen Yan, Guoming Zhang, Xiaoyu Ji, Tianchen Zhang, Taimin Zhang, and Wenyuan Xu. 2019. The feasibility of injecting inaudible voice commands to voice assistants. IEEE Trans. Depend. Secure Comput. (2019).
    [151]
    Qiben Yan, Kehai Liu, Qin Zhou, Hanqing Guo, and Ning Zhang. [n.d.]. SurfingAttack: Interactive hidden attack on voice assistants using ultrasonic guided waves. ([n.d.]).
    [152]
    Chao-Han Yang, Jun Qi, Pin-Yu Chen, Xiaoli Ma, and Chin-Hui Lee. 2020. Characterizing speech adversarial examples using self-attention U-Net enhancement. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3107–3111.
    [153]
    Zhuolin Yang, Pin Yu Chen, Bo Li, and Dawn Song. 2019. Characterizing audio adversarial examples using temporal dependency. In 7th International Conference on Learning Representations.
    [154]
    Park Joon Young, Jo Hyo Jin, Samuel Woo, and Dong Hoon Lee. 2016. BadVoice: Soundless voice-control replay attack on modern smartphones. In 8th International Conference on Ubiquitous and Future Networks (ICUFN). IEEE, 882–887.
    [155]
    Xuejing Yuan, Yuxuan Chen, Aohui Wang, Kai Chen, Shengzhi Zhang, Heqing Huang, and Ian M. Molloy. 2018. All your Alexa are belong to us: A remote voice control attack against Echo. In IEEE Global Communications Conference (GLOBECOM). IEEE, 1–6.
    [156]
    Xuejing Yuan, Yuxuan Chen, Yue Zhao, Yunhui Long, Xiaokang Liu, Kai Chen, Shengzhi Zhang, Heqing Huang, Xiaofeng Wang, and Carl A. Gunter. 2018. CommanderSong: A systematic approach for practical adversarial voice recognition. In 27th USENIX Security Symposium (USENIX Security’18). 49–64.
    [157]
    Qiang Zeng, Jianhai Su, Chenglong Fu, Golam Kayas, Lannan Luo, Xiaojiang Du, Chiu C. Tan, and Jie Wu. 2019. A multiversion programming inspired approach to detecting audio adversarial examples. In 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 39–51.
    [158]
    Guoming Zhang, Chen Yan, Xiaoyu Ji, Tianchen Zhang, Taimin Zhang, and Wenyuan Xu. 2017. DolphinAttack: Inaudible voice commands. In ACM SIGSAC Conference on Computer and Communications Security. 103–117.
    [159]
    Hongting Zhang, Pan Zhou, Qiben Yan, and Xiao-Yang Liu. 2020. Generating robust audio adversarial examples with temporal dependency. In International Joint Conferences on Artificial Intelligence. 3167–3173.
    [160]
    Jiajie Zhang, Bingsheng Zhang, and Bincheng Zhang. 2019. Defending adversarial attacks on cloud-aided automatic speech recognition systems. In 7th International Workshop on Security in Cloud Computing. 23–31.
    [161]
    Nan Zhang, Xianghang Mi, Xuan Feng, XiaoFeng Wang, Yuan Tian, and Feng Qian. 2019. Dangerous skills: Understanding and mitigating security risks of voice-controlled third-party functions on virtual personal assistant systems. In IEEE Symposium on Security and Privacy (SP). IEEE, 1381–1396.
    [162]
    Rongjunchen Zhang, Xiao Chen, Sheng Wen, and James Zheng. 2019. Who activated my voice assistant? A stealthy attack on Android phones without users’ awareness. In International Conference on Machine Learning for Cyber Security. Springer, 378–396.
    [163]
    Yangyong Zhang, Lei Xu, Abner Mendoza, Guangliang Yang, Phakpoom Chinprutthiwong, and Guofei Gu. 2019. Life after speech recognition: Fuzzing semantic misinterpretation for voice assistant applications. NDSS.
    [164]
    Baolin Zheng, Peipei Jiang, Qian Wang, Qi Li, Chao Shen, Cong Wang, Yunjie Ge, Qingyang Teng, and Shenyi Zhang. 2021. Black-box adversarial attacks on commercial speech platforms with minimal information. arXiv preprint arXiv:2110.09714 (2021).
    [165]
    Bing Zhou, Mohammed Elbadry, Ruipeng Gao, and Fan Ye. 2017. BatTracker: High precision infrastructure-free mobile device tracking in indoor environments. In 15th ACM Conference on Embedded Network Sensor Systems. 1–14.

    Cited By

    View all
    • (2024)Disposable identities: Solving web trackingJournal of Information Security and Applications10.1016/j.jisa.2024.10382184(103821)Online publication date: Aug-2024
    • (2024)Secure speech-recognition data transfer in the internet of things using a power system and a tried-and-true key generation techniqueCluster Computing10.1007/s10586-024-04649-3Online publication date: 29-Jul-2024
    • (2023)Towards the transferable audio adversarial attack via ensemble methodsCybersecurity10.1186/s42400-023-00175-86:1Online publication date: 5-Dec-2023
    • Show More Cited By

    Index Terms

    1. SoK: A Modularized Approach to Study the Security of Automatic Speech Recognition Systems

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Privacy and Security
      ACM Transactions on Privacy and Security  Volume 25, Issue 3
      August 2022
      288 pages
      ISSN:2471-2566
      EISSN:2471-2574
      DOI:10.1145/3530305
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 19 May 2022
      Online AM: 29 March 2022
      Accepted: 01 January 2022
      Revised: 01 November 2021
      Received: 01 March 2021
      Published in TOPS Volume 25, Issue 3

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Adversarial attacks
      2. machine learning security
      3. speech system security

      Qualifiers

      • Research-article
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)252
      • Downloads (Last 6 weeks)20
      Reflects downloads up to 09 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Disposable identities: Solving web trackingJournal of Information Security and Applications10.1016/j.jisa.2024.10382184(103821)Online publication date: Aug-2024
      • (2024)Secure speech-recognition data transfer in the internet of things using a power system and a tried-and-true key generation techniqueCluster Computing10.1007/s10586-024-04649-3Online publication date: 29-Jul-2024
      • (2023)Towards the transferable audio adversarial attack via ensemble methodsCybersecurity10.1186/s42400-023-00175-86:1Online publication date: 5-Dec-2023
      • (2023)Towards Understanding and Mitigating Audio Adversarial Examples for Speaker RecognitionIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2022.322067320:5(3970-3987)Online publication date: 1-Sep-2023
      • (2023)Combining Deep Learning with Domain Adaptation and Filtering Techniques for Speech Recognition in Noisy Environments2023 IEEE International Autumn Meeting on Power, Electronics and Computing (ROPEC)10.1109/ROPEC58757.2023.10409492(1-6)Online publication date: 18-Oct-2023
      • (2023)Knowledge Distillation Based Defense for Audio Trigger Backdoor in Federated LearningGLOBECOM 2023 - 2023 IEEE Global Communications Conference10.1109/GLOBECOM54140.2023.10437601(4271-4276)Online publication date: 4-Dec-2023
      • (2023)Defense Mechanisms Against Audio Adversarial Attacks: Recent Advances and Future DirectionsEdge Computing and IoT: Systems, Management and Security10.1007/978-3-031-28990-3_12(166-175)Online publication date: 31-Mar-2023
      • (2022)Automatic Speech Recognition Method Based on Deep Learning Approaches for Uzbek LanguageSensors10.3390/s2210368322:10(3683)Online publication date: 12-May-2022
      • (2022)Detecting Audio Adversarial Examples in Automatic Speech Recognition Systems Using Decision Boundary PatternsJournal of Imaging10.3390/jimaging81203248:12(324)Online publication date: 9-Dec-2022
      • (2022)Your Voice is Not Yours? Black-Box Adversarial Attacks Against Speaker Recognition Systems2022 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SocialCom-SustainCom57177.2022.00094(692-699)Online publication date: Dec-2022

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media