(Translated by https://www.hiragana.jp/)
Towards Understanding and Mitigating Audio Adversarial Examples for Speaker Recognition | IEEE Transactions on Dependable and Secure Computing skip to main content
research-article

Towards Understanding and Mitigating Audio Adversarial Examples for Speaker Recognition

Published: 01 September 2023 Publication History

Abstract

Speaker recognition systems (SRSs) have recently been shown to be vulnerable to adversarial attacks, raising significant security concerns. In this work, we systematically investigate transformation and adversarial training based defenses for securing SRSs. According to the characteristic of SRSs, we present 22 diverse transformations and thoroughly evaluate them using 7 recent promising adversarial attacks (4 white-box and 3 black-box) on speaker recognition. With careful regard for best practices in defense evaluations, we analyze the strength of transformations to withstand adaptive attacks. We also evaluate and understand their effectiveness against adaptive attacks when combined with adversarial training. Our study provides thirteen useful insights and findings, many of them are new or inconsistent with the conclusions in the image and speech recognition domains, e.g., variable and constant bit rate speech compressions have different performance, and some non-differentiable transformations remain effective against current promising evasion techniques which often work well in the image domain. We demonstrate that the proposed novel feature-level transformation combined with adversarial training is rather effective compared to the sole adversarial training in a complete white-box setting, e.g., increasing the accuracy by 13.62&#x0025; and attack cost by two orders of magnitude, while other transformations do not necessarily improve the overall defense capability. This work sheds further light on the research directions in this field. We also release our evaluation platform <sc>SpeakerGuard</sc> to foster further research.

References

[1]
Kaldi toolkit. 2022. [Online]. Available: https://github.com/kaldi-asr/kaldi
[2]
Microsoft azure speaker recognition. 2022. [Online]. Available: https://azure.microsoft.com/en-us/services/cognitive-services/speaker-recognition
[3]
TD Bank voiceprint. 2022. [Online]. Available: https://www.tdbank.com/bank/tdvoiceprint.html
[4]
F. Kreuk, Y. Adi, M. Cissé, and J. Keshet, “Fooling end-to-end speaker verification with adversarial examples,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2018, pp. 1962–1966.
[5]
X. Li, J. Zhong, X. Wu, J. Yu, X. Liu, and H. Meng, “Adversarial attacks on GMM I-vector based speaker verification systems,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2020, pp. 6579–6583.
[6]
A. Jati, C.-C. Hsu, M. Pal, R. Peri, W. AbdAlmageed, and S. Narayanan, “Adversarial attack and defense strategies for deep speaker recognition systems,” Comput. Speech Lang., vol. 68, 2021, Art. no.
[7]
W. Zhang et al., “Attack on practical speaker verification system using universal adversarial perturbations,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2021, pp. 2575–2579.
[8]
J. Li et al., “Universal adversarial perturbations generative network for speaker recognition,” in Proc. IEEE Int. Conf. Multimedia Expo, 2020, pp. 1–6.
[9]
Y. Xie, Z. Li, C. Shi, J. Liu, Y. Chen, and B. Yuan, “Enabling fast and universal audio adversarial attack using generative model,” in Proc. 35th AAAI Conf. Artif. Intell., 2021, pp. 14129–14137.
[10]
Q. Wang, P. Guo, and L. Xie, “Inaudible adversarial perturbations for targeted attack in speaker recognition,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2020, pp. 4228–4232.
[11]
A. S. Shamsabadi, F. S. Teixeira, A. Abad, B. Raj, A. Cavallaro, and I. Trancoso, “FoolHD: Fooling speaker identification by highly imperceptible adversarial disturbances,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2021, pp. 6159–6163.
[12]
G. Chen et al., “Who is real Bob? Adversarial attacks on speaker recognition systems,” in Proc. IEEE Symp. Secur. Privacy, 2021, pp. 694–711.
[13]
T. Du, S. Ji, J. Li, Q. Gu, T. Wang, and R. Beyah, “SirenAttack: Generating adversarial audio for end-to-end acoustic systems,” in Proc. 15th ACM Asia Conf. Comput. Commun. Secur., 2020, pp. 357–369.
[14]
Y. Gong and C. Poellabauer, “Crafting adversarial examples for speech paralinguistics applications,” 2017,.
[15]
H. Abdullah et al., “Hear “no evil,” see “kenansville”: Efficient and transferable black-box attacks on speech recognition and voice identification systems,” in Proc. IEEE Symp. Secur. Privacy, 2021, pp. 712–729.
[16]
H. Wu, Y. Zhang, Z. Wu, D. Wang, and H. Lee, “Voting for the right answer: Adversarial defense for speaker verification,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2021, pp. 4294–4298.
[17]
R. Olivier, B. Raj, and M. Shah, “High-frequency adversarial defense for speech and audio,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2021, pp. 2995–2999.
[18]
F. Tramèr, N. Carlini, W. Brendel, and A. Madry, “On adaptive attacks to adversarial example defenses,” in Proc. 34th Int. Conf. Neural Inf. Process. Syst., 2020, Art. no.
[19]
A. Athalye, N. Carlini, and D. A. Wagner, “Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples,” in Proc. 35th Int. Conf. Mach. Learn., 2018, pp. 274–283.
[20]
D. Wierstra, T. Schaul, T. Glasmachers, Y. Sun, J. Peters, and J. Schmidhuber, “Natural evolution strategies,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 949–980, 2014.
[21]
A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok, “Synthesizing robust adversarial examples,” in Proc. 35th Int. Conf. Mach. Learn., 2018, pp. 284–293.
[22]
I. J. Goodfellow, N. Papernot, and P. D. McDaniel, “CleverHans v0.1: An adversarial machine learning library,” 2016,.
[23]
M. Nicolae et al., “Adversarial robustness toolbox v1.0.0,” 2018,.
[24]
N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Trans. Audio, Speech Lang. Process., vol. 19, no. 4, pp. 788–798, May 2011.
[25]
D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted gaussian mixture models,” Digit. Signal Process., vol. 10, pp. 19–41, 2000.
[26]
S. Becker, M. Ackermann, S. Lapuschkin, K.-R. Müller, and W. Samek, “Interpreting and explaining deep neural networks for classification of audio signals,” 2018,.
[27]
C. Li et al., “Deep speaker: An end-to-end neural speaker embedding system,” 2017,.
[28]
S. J. D. Prince and J. H. Elder, “Probabilistic linear discriminant analysis for inferences about identity,” in Proc. IEEE 11th Int. Conf. Comput. Vis., 2007, pp. 1–8.
[29]
N. Dehak et al., “Cosine similarity scoring without score normalization techniques,” in Proc. Odyssey, 2010, p. 15.
[30]
The most popular acoustic features. 2020. [Online]. Available: http://speech.ee.ntu.edu.tw/tlkagk/courses/DLHLP20/ASR%20(v12).pdf
[31]
Z. Yang, B. Li, P. Chen, and D. Song, “Characterizing audio adversarial examples using temporal dependency,” in Proc. Int. Conf. Learn. Representations, 2019.
[32]
W. Xu, D. Evans, and Y. Qi, “Feature squeezing: Detecting adversarial examples in deep neural networks,” in Proc. Annu. Netw. Distrib. Syst. Secur. Symp., 2018.
[33]
S. Rebuffi, S. Gowal, D. A. Calian, F. Stimberg, O. Wiles, and T. Mann, “Data augmentation can improve robustness,” in Proc. 35th Int. Conf. Neural Inf. Process. Syst., 2021, pp. 29935–29948.
[34]
D. Prabakaran and R. Shyamala, “A review on performance of voice feature extraction techniques,” in Proc. IEEE 3rd Int. Conf. Comput. Commun. Technol., 2019, pp. 221–231.
[35]
X. Xiao et al., “Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation,” EURASIP J. Adv. Signal Process., vol. 2016, no. 1, pp. 1–18, 2016.
[36]
H. Purwins, B. Li, T. Virtanen, J. Schlüter, S. Chang, and T. N. Sainath, “Deep learning for audio signal processing,” IEEE J. Sel. Topics Signal Process., vol. 13, no. 2, pp. 206–219, May 2019.
[37]
A. Paszke et al., “Automatic differentiation in PyTorch,” in Proc. Int. Conf. Neural Inf. Process. Syst. Workshops, 2017.
[38]
I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in Proc. Int. Conf. Learn. Representations, 2015.
[39]
A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” in Proc. Int. Conf. Learn. Representations, 2018.
[40]
X. Yuan et al., “Commandersong: A systematic approach for practical adversarial voice recognition,” in Proc. 27th USENIX Conf. Secur. Symp., 2018, pp. 49–64.
[41]
H. Kwon, H. Yoon, and K.-W. Park, “POSTER: Detecting audio adversarial example through audio modification,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., 2019, pp. 2521–2523.
[42]
K. Rajaratnam, B. Alshemali, and J. Kalita, “Speech coding and audio preprocessing for mitigating and detecting audio adversarial examples on automatic speech recognition,” 2018. [Online]. Available: http://cs.uccs.edu/jkalita/work/reu/REU2018/07Rajaratnam.pdf
[43]
K. Vos, K. V. Sørensen, S. S. Jensen, and J.-M. Valin, “Voice coding with opus,” in Proc. 135th Audio Eng. Soc. Conv., 2013, pp. 722–731.
[44]
J. Valin, “Speex: A free codec for free speech,” 2016,.
[45]
E. Ekudden, R. Hagen, I. Johansson, and J. Svedberg, “The adaptive multi-rate speech coder,” in Proc. IEEE Workshop Speech Coding, 1999, pp. 117–119.
[46]
M. Bosi et al., “ISO/IEC MPEG-2 advanced audio coding,” J. Audio Eng. Soc., vol. 45, no. 10, pp. 789–814, 1997.
[47]
S. Hacker, MP3: The Definitive Guide. Sebastopol, CA, USA: O’Reilly, 2000.
[48]
J. Benesty, Springer Handbook of Speech Processing. Berlin, Germany: Springer Handbooks, 2008.
[49]
J. Sohn, N. S. Kim, and W. Sung, “A statistical model-based voice activity detection,” IEEE Signal Process. Lett., vol. 6, no. 1, pp. 1–3, Jan. 1999.
[50]
R. Xu and D. C. Wunsch II, “Survey of clustering algorithms,” IEEE Trans. Neural Netw., vol. 16, no. 3, pp. 645–678, May 2005.
[51]
L. A. Leiva and E. Vidal, “Warped k-means: An algorithm to cluster sequentially-distributed data,” Inf. Sci., vol. 237, pp. 196–210, 2013.
[52]
Ivector-plda model released by kaldi. 2022. [Online]. Available: https://kaldi-asr.org/models/m7
[53]
G. Chen, Z. Zhao, F. Song, S. Chen, L. Fan, and Y. Liu, “AS2T: Arbitrary source-to-target adversarial attack on speaker recognition systems,” IEEE Trans. Dependable Secure Comput., to be published.
[54]
Z. Chen, L.-C. Chang, C. Chen, G. Wang, and Z. Bi, “Defending against fakebob adversarial attacks in speaker verification systems with noise-adding,” Algorithms, vol. 15, 2022, Art. no.
[55]
X. Zhang, X. Zhang, M. Sun, X. Zou, K. Chen, and N. Yu, “Imperceptible black-box waveform-level adversarial attack towards automatic speaker recognition,” to be published.
[56]
M. Pal, A. Jati, R. Peri, C. Hsu, W. AbdAlmageed, and S. Narayanan, “Adversarial defense for deep speaker recognition using hybrid adversarial training,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2021, pp. 6164–6168.
[57]
V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2015, pp. 5206–5210.
[58]
N. Carlini and D. A. Wagner, “Towards evaluating the robustness of neural networks,” in Proc. IEEE Symp. Secur. Privacy, 2017, pp. 39–57.
[59]
L. Bu, Z. Zhao, Y. Duan, and F. Song, “Taking care of the discretization problem: A comprehensive study of the discretization problem and a black-box adversarial attack in discrete integer domain,” IEEE Trans. Dependable Secure Comput., vol. 19, no. 5, pp. 3200–3217, Sep./Oct. 2022.
[60]
H. Tan, L. Wang, H. Zhang, J. Zhang, M. Shafiq, and Z. Gu, “Adversarial attack and defense strategies of speaker recognition systems: A survey,” Electronics, vol. 11, 2022, Art. no.
[61]
A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ)–A new method for speech quality assessment of telephone networks and codecs,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2001, pp. 749–752.
[62]
Y. Xiang, G. Hua, and B. Yan, Digital Audio Watermarking: Fundamentals, Techniques and Challenges. Berlin, Germany: Springer, 2017.
[63]
H. Abdullah, K. Warren, V. Bindschaedler, N. Papernot, and P. Traynor, “SoK: The faults in our ASRs: An overview of attacks against automatic speech recognition and speaker identification systems,” in Proc. IEEE Symp. Secur. Privacy, 2021, pp. 730–747.
[64]
D. W. Griffin and J. S. Lim, “Signal estimation from modified short-time fourier transform,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 1983, pp. 804–807.
[65]
N. Carlini and D. A. Wagner, “Audio adversarial examples: Targeted attacks on speech-to-text,” in Proc. IEEE Secur. Privacy Workshops, 2018, pp. 1–7.
[66]
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learn. Representations, 2015.
[67]
M. Müller, “Dynamic time warping,” in Information Retrieval for Music and Motion, Berlin, Germany: Springer, 2007, pp. 69–84.
[68]
C. Doersch, “Tutorial on variational autoencoders,” 2016,.
[69]
Y. Li, L. Li, L. Wang, T. Zhang, and B. Gong, “NATTACK: Learning the distributions of adversarial examples for an improved black-box attack on deep neural networks,” in Proc. 36th Int. Conf. Mach. Learn., 2019, pp. 3866–3876.
[70]
Y. Dong et al., “Efficient decision-based black-box adversarial attacks on face recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 7706–7714.
[71]
L. Blue, L. Vargas, and P. Traynor, “Hello, is it me you’re looking for?: Differentiating between human and electronic speakers for voice interface security,” in Proc. 11th ACM Conf. Secur. Privacy Wireless Mobile Netw., 2018, pp. 123–133.
[72]
Y. Meng et al., “Your microphone array retains your identity: A robust voice liveness detection system for smart speaker,” in Proc. USENIX Conf. Secur. Symp., 2022, pp. 1077–1094.
[73]
A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: A large-scale speaker identification dataset,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2017, pp. 2616–2620.
[74]
A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. 34th Int. Conf. Neural Inf. Process. Syst., 2020, Art. no.
[75]
N. Vaessen and D. A. van Leeuwen, “Fine-tuning Wav2Vec2 for speaker recognition,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2022, pp. 7967–7971.
[76]
C. Feng, P. Hsu, and H. Lee, “Silence is sweeter than speech: Self-supervised model using silence to store speaker information,” 2022,.
[77]
H. Zen et al., “LibriTTS: A corpus derived from LibriSpeech for text-to-speech,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2019, pp. 1526–1530.
[78]
M. Sharif, L. Bauer, and M. K. Reiter, “On the suitability of lp-norms for creating and preventing adversarial examples,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops, 2018, pp. 1686–16868.
[79]
B. Zheng et al., “Black-box adversarial attacks on commercial speech platforms with minimal information,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., 2021, pp. 86–107.
[80]
Z. Zhao, G. Chen, J. Wang, Y. Yang, F. Song, and J. Sun, “Attack as defense: Characterizing adversarial examples using robustness,” in Proc. 30th ACM SIGSOFT Int. Symp. Softw. Testing Anal., 2021, pp. 42–55.
[81]
H. Abdullah, W. Garcia, C. Peeters, P. Traynor, K. R. B. Butler, and J. Wilson, “Practical hidden voice attacks against speech and speaker recognition systems,” in Proc. Annu. Netw. Distrib. Syst. Secur. Symp., 2019.
[82]
E. Wenger et al., ““hello, it's me”: Deep learning-based speech synthesis attacks in the real world,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., 2021, pp. 235–251.
[83]
Y. Chen et al., “SoK: A modularized approach to study the security of automatic speech recognition systems,” 2021,.
[84]
Y. Qin, N. Carlini, G. W. Cottrell, I. J. Goodfellow, and C. Raffel, “Imperceptible, robust, and targeted adversarial examples for automatic speech recognition,” in Proc. 36th Int. Conf. Mach. Learn., 2019, pp. 5231–5240.
[85]
L. Schönherr, K. Kohls, S. Zeiler, T. Holz, and D. Kolossa, “Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding,” in Proc. Annu. Netw. Distrib. Syst. Secur. Symp., 2019.
[86]
Z. Li, Y. Wu, J. Liu, Y. Chen, and B. Yuan, “AdvPulse: Universal, synchronization-free, and targeted audio adversarial attacks via subsecond perturbations,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., 2020, pp. 1121–1134.
[87]
R. Taori, A. Kamsetty, B. Chu, and N. Vemuri, “Targeted adversarial examples for black box audio systems,” in Proc. IEEE Secur. Privacy Workshops, 2019, pp. 15–20.
[88]
T. Eisenhofer, L. Schönherr, J. Frank, L. Speckemeier, D. Kolossa, and T. Holz, “Dompteur: Taming audio adversarial examples,” in Proc. USENIX Conf. Secur. Symp., 2021, pp. 2309–2326.
[89]
D. Mukhopadhyay, M. Shirvanian, and N. Saxena, “All your voices are belong to us: Stealing voices to fool humans and machines,” in Proc. Eur. Symp. Res. Comput. Secur., 2015, pp. 599–621.
[90]
Y. Xie, Z. Li, C. Shi, J. Liu, Y. Chen, and B. Yuan, “Real-time, robust and adaptive universal adversarial attacks against speaker recognition systems,” J. Signal Process. Syst., vol. 93, pp. 1187–1200, 2021.
[91]
P. Samangouei, M. Kabkab, and R. Chellappa, “Defense-GAN: Protecting classifiers against adversarial attacks using generative models,” in Proc. Int. Conf. Learn., Representations, 2018.
[92]
S. Joshi, J. Villalba, P. Zelasko, L. Moro-Velázquez, and N. Dehak, “Study of pre-processing defenses against adversarial attacks on state-of-the-art speaker recognition systems,” IEEE Trans. Inf. Forensics Secur., vol. 16, pp. 4811–4826, Sep. 2021.
[93]
X. Li et al., “Investigating robustness of adversarial samples detection for automatic speaker verification,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2020, pp. 1540–1544.
[94]
Z. Peng, X. Li, and T. Lee, “Pairing weak with strong: Twin models for defending against adversarial attack on speaker verification,” in Proc. Annu. Conf. Int. Speech Commun. Assoc., 2021, pp. 4284–4288.
[95]
N. Carlini and D. A. Wagner, “Adversarial examples are not easily detected: Bypassing ten detection methods,” in Proc. 10th ACM Workshop Artif. Intell. Secur., 2017, pp. 3–14.
[96]
I. Viñals, A. Ortega, A. Miguel, and E. Lleida, “An analysis of the short utterance problem for speaker characterization,” Appl. Sci., vol. 9, 2019, Art. no.
[97]
R. Eberhart and J. Kennedy, “A new optimizer using particle swarm theory,” in Proc. 6th Int. Symp. Micro Mach. Hum. Sci., 1995, pp. 39–43.
[98]
D. Arthur and S. Vassilvitskii, “k-means : The advantages of careful seeding,” in Proc. IEEE 18th Annu. ACM-SIAM Symp. Discrete Algorithms, 2007, pp. 1027–1035.

Cited By

View all
  • (2024)Attack as Detection: Using Adversarial Attack Methods to Detect Abnormal ExamplesACM Transactions on Software Engineering and Methodology10.1145/363197733:3(1-45)Online publication date: 15-Mar-2024
  • (2024)AFPM: A Low-Cost and Universal Adversarial Defense for Speaker Recognition SystemsIEEE Transactions on Information Forensics and Security10.1109/TIFS.2023.334823219(2273-2287)Online publication date: 1-Jan-2024

Recommendations

Comments

Information & Contributors

Information

Published In

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 01 September 2023

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Attack as Detection: Using Adversarial Attack Methods to Detect Abnormal ExamplesACM Transactions on Software Engineering and Methodology10.1145/363197733:3(1-45)Online publication date: 15-Mar-2024
  • (2024)AFPM: A Low-Cost and Universal Adversarial Defense for Speaker Recognition SystemsIEEE Transactions on Information Forensics and Security10.1109/TIFS.2023.334823219(2273-2287)Online publication date: 1-Jan-2024

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media