Search | arXiv e-print repository

Sparks of Large Audio Models: A Survey and Outlook

Authors: Siddique Latif, Moazzam Shoukat, Fahad Shamshad, Muhammad Usama, Yi Ren, Heriberto Cuayáhuitl, Wenwu Wang, Xulong Zhang, Roberto Togneri, Erik Cambria, Björn W. Schuller

Abstract: This survey paper provides a comprehensive overview of the recent advancements and challenges in applying large language models to the field of audio signal processing. Audio processing, with its diverse signal representations and a wide range of sources--from human voices to musical instruments and environmental sounds--poses challenges distinct from those found in traditional Natural Language Pr… ▽ More This survey paper provides a comprehensive overview of the recent advancements and challenges in applying large language models to the field of audio signal processing. Audio processing, with its diverse signal representations and a wide range of sources--from human voices to musical instruments and environmental sounds--poses challenges distinct from those found in traditional Natural Language Processing scenarios. Nevertheless, \textit{Large Audio Models}, epitomized by transformer-based architectures, have shown marked efficacy in this sphere. By leveraging massive amount of data, these models have demonstrated prowess in a variety of audio tasks, spanning from Automatic Speech Recognition and Text-To-Speech to Music Generation, among others. Notably, recently these Foundational Audio Models, like SeamlessM4T, have started showing abilities to act as universal translators, supporting multiple speech tasks for up to 100 languages without any reliance on separate task-specific systems. This paper presents an in-depth analysis of state-of-the-art methodologies regarding \textit{Foundational Large Audio Models}, their performance benchmarks, and their applicability to real-world scenarios. We also highlight current limitations and provide insights into potential future research directions in the realm of \textit{Large Audio Models} with the intent to spark further discussion, thereby fostering innovation in the next generation of audio-processing systems. Furthermore, to cope with the rapid development in this area, we will consistently update the relevant repository with relevant recent articles and their open-source implementations at https://github.com/EmulationAI/awesome-large-audio-models. △ Less

Submitted 21 September, 2023; v1 submitted 24 August, 2023; originally announced August 2023.

Comments: Under review, Repo URL: https://github.com/EmulationAI/awesome-large-audio-models

arXiv:2303.11607 [pdf, other]

Transformers in Speech Processing: A Survey

Authors: Siddique Latif, Aun Zaidi, Heriberto Cuayahuitl, Fahad Shamshad, Moazzam Shoukat, Junaid Qadir

Abstract: The remarkable success of transformers in the field of natural language processing has sparked the interest of the speech-processing community, leading to an exploration of their potential for modeling long-range dependencies within speech sequences. Recently, transformers have gained prominence across various speech-related domains, including automatic speech recognition, speech synthesis, speech… ▽ More The remarkable success of transformers in the field of natural language processing has sparked the interest of the speech-processing community, leading to an exploration of their potential for modeling long-range dependencies within speech sequences. Recently, transformers have gained prominence across various speech-related domains, including automatic speech recognition, speech synthesis, speech translation, speech para-linguistics, speech enhancement, spoken dialogue systems, and numerous multimodal applications. In this paper, we present a comprehensive survey that aims to bridge research studies from diverse subfields within speech technology. By consolidating findings from across the speech technology landscape, we provide a valuable resource for researchers interested in harnessing the power of transformers to advance the field. We identify the challenges encountered by transformers in speech processing while also offering insights into potential solutions to address these issues. △ Less

Submitted 21 March, 2023; originally announced March 2023.

Comments: under-review

arXiv:2201.09873 [pdf, other]

Transformers in Medical Imaging: A Survey

Authors: Fahad Shamshad, Salman Khan, Syed Waqas Zamir, Muhammad Haris Khan, Munawar Hayat, Fahad Shahbaz Khan, Huazhu Fu

Abstract: Following unprecedented success on the natural language tasks, Transformers have been successfully applied to several computer vision problems, achieving state-of-the-art results and prompting researchers to reconsider the supremacy of convolutional neural networks (CNNs) as {de facto} operators. Capitalizing on these advances in computer vision, the medical imaging field has also witnessed growin… ▽ More Following unprecedented success on the natural language tasks, Transformers have been successfully applied to several computer vision problems, achieving state-of-the-art results and prompting researchers to reconsider the supremacy of convolutional neural networks (CNNs) as {de facto} operators. Capitalizing on these advances in computer vision, the medical imaging field has also witnessed growing interest for Transformers that can capture global context compared to CNNs with local receptive fields. Inspired from this transition, in this survey, we attempt to provide a comprehensive review of the applications of Transformers in medical imaging covering various aspects, ranging from recently proposed architectural designs to unsolved issues. Specifically, we survey the use of Transformers in medical image segmentation, detection, classification, reconstruction, synthesis, registration, clinical report generation, and other tasks. In particular, for each of these applications, we develop taxonomy, identify application-specific challenges as well as provide insights to solve them, and highlight recent trends. Further, we provide a critical discussion of the field's current state as a whole, including the identification of key challenges, open problems, and outlining promising future directions. We hope this survey will ignite further interest in the community and provide researchers with an up-to-date reference regarding applications of Transformer models in medical imaging. Finally, to cope with the rapid development in this field, we intend to regularly update the relevant latest papers and their open-source implementations at \url{https://github.com/fahadshamshad/awesome-transformers-in-medical-imaging}. △ Less

Submitted 24 January, 2022; originally announced January 2022.

Comments: 41 pages, \url{https://github.com/fahadshamshad/awesome-transformers-in-medical-imaging}

arXiv:2101.00240 [pdf, other]

A Survey on Deep Reinforcement Learning for Audio-Based Applications

Authors: Siddique Latif, Heriberto Cuayáhuitl, Farrukh Pervez, Fahad Shamshad, Hafiz Shehbaz Ali, Erik Cambria

Abstract: Deep reinforcement learning (DRL) is poised to revolutionise the field of artificial intelligence (AI) by endowing autonomous systems with high levels of understanding of the real world. Currently, deep learning (DL) is enabling DRL to effectively solve various intractable problems in various fields. Most importantly, DRL algorithms are also being employed in audio signal processing to learn direc… ▽ More Deep reinforcement learning (DRL) is poised to revolutionise the field of artificial intelligence (AI) by endowing autonomous systems with high levels of understanding of the real world. Currently, deep learning (DL) is enabling DRL to effectively solve various intractable problems in various fields. Most importantly, DRL algorithms are also being employed in audio signal processing to learn directly from speech, music and other sound signals in order to create audio-based autonomous systems that have many promising application in the real world. In this article, we conduct a comprehensive survey on the progress of DRL in the audio domain by bringing together the research studies across different speech and music-related areas. We begin with an introduction to the general field of DL and reinforcement learning (RL), then progress to the main DRL methods and their applications in the audio domain. We conclude by presenting challenges faced by audio-based DRL agents and highlighting open areas for future research and investigation. △ Less

Submitted 1 January, 2021; originally announced January 2021.

Comments: Under Review

arXiv:2005.07026 [pdf, other]

Subsampled Fourier Ptychography using Pretrained Invertible and Untrained Network Priors

Authors: Fahad Shamshad, Asif Hanif, Ali Ahmed

Abstract: Recently pretrained generative models have shown promising results for subsampled Fourier Ptychography (FP) in terms of quality of reconstruction for extremely low sampling rate and high noise. However, one of the significant drawbacks of these pretrained generative priors is their limited representation capabilities. Moreover, training these generative models requires access to a large number of… ▽ More Recently pretrained generative models have shown promising results for subsampled Fourier Ptychography (FP) in terms of quality of reconstruction for extremely low sampling rate and high noise. However, one of the significant drawbacks of these pretrained generative priors is their limited representation capabilities. Moreover, training these generative models requires access to a large number of fully-observed clean samples of a particular class of images like faces or digits that is prohibitive to obtain in the context of FP. In this paper, we propose to leverage the power of pretrained invertible and untrained generative models to mitigate the representation error issue and requirement of a large number of example images (for training generative models) respectively. Through extensive experiments, we demonstrate the effectiveness of proposed approaches in the context of FP for low sampling rates and high noise levels. △ Less

Submitted 13 May, 2020; originally announced May 2020.

Comments: Part of this work has been accepted in NeurIPS Deep Inverse Workshop, 2019

arXiv:2002.12578 [pdf, other]

Class-Specific Blind Deconvolutional Phase Retrieval Under a Generative Prior

Authors: Fahad Shamshad, Ali Ahmed

Abstract: In this paper, we consider the highly ill-posed problem of jointly recovering two real-valued signals from the phaseless measurements of their circular convolution. The problem arises in various imaging modalities such as Fourier ptychography, X-ray crystallography, and in visible light communication. We propose to solve this inverse problem using alternating gradient descent algorithm under two p… ▽ More In this paper, we consider the highly ill-posed problem of jointly recovering two real-valued signals from the phaseless measurements of their circular convolution. The problem arises in various imaging modalities such as Fourier ptychography, X-ray crystallography, and in visible light communication. We propose to solve this inverse problem using alternating gradient descent algorithm under two pretrained deep generative networks as priors; one is trained on sharp images and the other on blur kernels. The proposed recovery algorithm strives to find a sharp image and a blur kernel in the range of the respective pre-generators that \textit{best} explain the forward measurement model. In doing so, we are able to reconstruct quality image estimates. Moreover, the numerics show that the proposed approach performs well on the challenging measurement models that reflect the physically realizable imaging systems and is also robust to noise △ Less

Submitted 28 February, 2020; originally announced February 2020.

Comments: 10 pages

arXiv:1910.08792 [pdf, other]

Sub-Nyquist Sampling of Sparse and Correlated Signals in Array Processing

Authors: Ali Ahmed, Fahad Shamshad, Humera Hameed

Abstract: This paper considers efficient sampling of simultaneously sparse and correlated (S$\&$C) signals. Such signals arise in various applications in array processing. We propose an implementable sampling architecture for the acquisition of S$\&$C at a sub-Nyquist rate. We prove a sampling theorem showing exact and stable reconstruction of the acquired signals even when the sampling rate is smaller than… ▽ More This paper considers efficient sampling of simultaneously sparse and correlated (S$\&$C) signals. Such signals arise in various applications in array processing. We propose an implementable sampling architecture for the acquisition of S$\&$C at a sub-Nyquist rate. We prove a sampling theorem showing exact and stable reconstruction of the acquired signals even when the sampling rate is smaller than the Nyquist rate by orders of magnitude. Quantitatively, our results state that an ensemble $M$ signals, composed of a-priori unknown latent $R$ signals, each bandlimited to $W/2$ but only $S$-sparse in the Fourier domain, can be reconstructed exactly from compressive sampling only at a rate $RS\log^αあるふぁ W$ samples per second. When $R \ll M$, and $S\ll W$, this amounts to a significant reduction in sampling rate compared to the Nyquist rate of $MW$ samples per second. This is the first result that presents an implementable sampling architecture, and a sampling theorem for the compressive acquisition of S$\&$C signals. The signal reconstruction from sub-Nyquist rate boils down to a sparse and low-rank (S$\&$L) matrix recovery from a few linear measurements. The conventional convex penalties for S$\&$L matrices are provably not optimal in the number of measurements. We resort to a two-step algorithm to recover S$\&$L matrix from a near-optimal number of measurements. This result then translates into a signal reconstruction algorithm from a sub-Nyquist sampling rate. △ Less

Submitted 18 January, 2023; v1 submitted 19 October, 2019; originally announced October 2019.

arXiv:1812.11065 [pdf, other]

Deep Ptych: Subsampled Fourier Ptychography using Generative Priors

Authors: Fahad Shamshad, Farwa Abbas, Ali Ahmed

Abstract: This paper proposes a novel framework to regularize the highly ill-posed and non-linear Fourier ptychography problem using generative models. We demonstrate experimentally that our proposed algorithm, Deep Ptych, outperforms the existing Fourier ptychography techniques, in terms of quality of reconstruction and robustness against noise, using far fewer samples. We further modify the proposed appro… ▽ More This paper proposes a novel framework to regularize the highly ill-posed and non-linear Fourier ptychography problem using generative models. We demonstrate experimentally that our proposed algorithm, Deep Ptych, outperforms the existing Fourier ptychography techniques, in terms of quality of reconstruction and robustness against noise, using far fewer samples. We further modify the proposed approach to allow the generative model to explore solutions outside the range, leading to improved performance. △ Less

Submitted 22 December, 2018; originally announced December 2018.

Showing 1–8 of 8 results for author: Shamshad, F