Search | arXiv e-print repository

USM-Lite: Quantization and Sparsity Aware Fine-tuning for Speech Recognition with Universal Speech Models

Authors: Shaojin Ding, David Qiu, David Rim, Yanzhang He, Oleg Rybakov, Bo Li, Rohit Prabhavalkar, Weiran Wang, Tara N. Sainath, Zhonglin Han, Jian Li, Amir Yazdanbakhsh, Shivani Agrawal

Abstract: End-to-end automatic speech recognition (ASR) models have seen revolutionary quality gains with the recent development of large-scale universal speech models (USM). However, deploying these massive USMs is extremely expensive due to the enormous memory usage and computational cost. Therefore, model compression is an important research topic to fit USM-based ASR under budget in real-world scenarios… ▽ More End-to-end automatic speech recognition (ASR) models have seen revolutionary quality gains with the recent development of large-scale universal speech models (USM). However, deploying these massive USMs is extremely expensive due to the enormous memory usage and computational cost. Therefore, model compression is an important research topic to fit USM-based ASR under budget in real-world scenarios. In this study, we propose a USM fine-tuning approach for ASR, with a low-bit quantization and N:M structured sparsity aware paradigm on the model weights, reducing the model complexity from parameter precision and matrix topology perspectives. We conducted extensive experiments with a 2-billion parameter USM on a large-scale voice search dataset to evaluate our proposed method. A series of ablation studies validate the effectiveness of up to int4 quantization and 2:4 sparsity. However, a single compression technique fails to recover the performance well under extreme setups including int2 quantization and 1:4 sparsity. By contrast, our proposed method can compress the model to have 9.4% of the size, at the cost of only 7.3% relative word error rate (WER) regressions. We also provided in-depth analyses on the results and discussions on the limitations and potential solutions, which would be valuable for future studies. △ Less

Submitted 16 January, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

Comments: Accepted by ICASSP 2024. Preprint

arXiv:2210.13761 [pdf, other]

Streaming Parrotron for on-device speech-to-speech conversion

Authors: Oleg Rybakov, Fadi Biadsy, Xia Zhang, Liyang Jiang, Phoenix Meadowlark, Shivani Agrawal

Abstract: We present a fully on-device streaming Speech2Speech conversion model that normalizes a given input speech directly to synthesized output speech. Deploying such a model on mobile devices pose significant challenges in terms of memory footprint and computation requirements. We present a streaming-based approach to produce an acceptable delay, with minimal loss in speech conversion quality, when com… ▽ More We present a fully on-device streaming Speech2Speech conversion model that normalizes a given input speech directly to synthesized output speech. Deploying such a model on mobile devices pose significant challenges in terms of memory footprint and computation requirements. We present a streaming-based approach to produce an acceptable delay, with minimal loss in speech conversion quality, when compared to a reference state of the art non-streaming approach. Our method consists of first streaming the encoder in real time while the speaker is speaking. Then, as soon as the speaker stops speaking, we run the spectrogram decoder in streaming mode along the side of a streaming vocoder to generate output speech. To achieve an acceptable delay-quality trade-off, we propose a novel hybrid approach for look-ahead in the encoder which combines a look-ahead feature stacker with a look-ahead self-attention. We show that our streaming approach is almost 2x faster than real time on the Pixel4 CPU. △ Less

Submitted 24 May, 2023; v1 submitted 25 October, 2022; originally announced October 2022.

arXiv:2203.15952 [pdf, other]

4-bit Conformer with Native Quantization Aware Training for Speech Recognition

Authors: Shaojin Ding, Phoenix Meadowlark, Yanzhang He, Lukasz Lew, Shivani Agrawal, Oleg Rybakov

Abstract: Reducing the latency and model size has always been a significant research problem for live Automatic Speech Recognition (ASR) application scenarios. Along this direction, model quantization has become an increasingly popular approach to compress neural networks and reduce computation cost. Most of the existing practical ASR systems apply post-training 8-bit quantization. To achieve a higher compr… ▽ More Reducing the latency and model size has always been a significant research problem for live Automatic Speech Recognition (ASR) application scenarios. Along this direction, model quantization has become an increasingly popular approach to compress neural networks and reduce computation cost. Most of the existing practical ASR systems apply post-training 8-bit quantization. To achieve a higher compression rate without introducing additional performance regression, in this study, we propose to develop 4-bit ASR models with native quantization aware training, which leverages native integer operations to effectively optimize both training and inference. We conducted two experiments on state-of-the-art Conformer-based ASR models to evaluate our proposed quantization technique. First, we explored the impact of different precisions for both weight and activation quantization on the LibriSpeech dataset, and obtained a lossless 4-bit Conformer model with 5.8x size reduction compared to the float32 model. Following this, we for the first time investigated and revealed the viability of 4-bit quantization on a practical ASR system that is trained with large-scale datasets, and produced a lossless Conformer ASR model with mixed 4-bit and 8-bit weights that has 5x size reduction compared to the float32 model. △ Less

Submitted 2 March, 2023; v1 submitted 29 March, 2022; originally announced March 2022.

Comments: Published at INTERSPEECH 2022

arXiv:2201.13271 [pdf, other]

doi 10.1016/j.compbiomed.2022.106093

StRegA: Unsupervised Anomaly Detection in Brain MRIs using a Compact Context-encoding Variational Autoencoder

Authors: Soumick Chatterjee, Alessandro Sciarra, Max Dünnwald, Pavan Tummala, Shubham Kumar Agrawal, Aishwarya Jauhari, Aman Kalra, Steffen Oeltze-Jafra, Oliver Speck, Andreas Nürnberger

Abstract: Expert interpretation of anatomical images of the human brain is the central part of neuro-radiology. Several machine learning-based techniques have been proposed to assist in the analysis process. However, the ML models typically need to be trained to perform a specific task, e.g., brain tumour segmentation or classification. Not only do the corresponding training data require laborious manual an… ▽ More Expert interpretation of anatomical images of the human brain is the central part of neuro-radiology. Several machine learning-based techniques have been proposed to assist in the analysis process. However, the ML models typically need to be trained to perform a specific task, e.g., brain tumour segmentation or classification. Not only do the corresponding training data require laborious manual annotations, but a wide variety of abnormalities can be present in a human brain MRI - even more than one simultaneously, which renders representation of all possible anomalies very challenging. Hence, a possible solution is an unsupervised anomaly detection (UAD) system that can learn a data distribution from an unlabelled dataset of healthy subjects and then be applied to detect out of distribution samples. Such a technique can then be used to detect anomalies - lesions or abnormalities, for example, brain tumours, without explicitly training the model for that specific pathology. Several Variational Autoencoder (VAE) based techniques have been proposed in the past for this task. Even though they perform very well on controlled artificially simulated anomalies, many of them perform poorly while detecting anomalies in clinical data. This research proposes a compact version of the "context-encoding" VAE (ceVAE) model, combined with pre and post-processing steps, creating a UAD pipeline (StRegA), which is more robust on clinical data, and shows its applicability in detecting anomalies such as tumours in brain MRIs. The proposed pipeline achieved a Dice score of 0.642$\pm$0.101 while detecting tumours in T2w images of the BraTS dataset and 0.859$\pm$0.112 while detecting artificially induced anomalies, while the best performing baseline achieved 0.522$\pm$0.135 and 0.783$\pm$0.111, respectively. △ Less

Submitted 4 September, 2022; v1 submitted 31 January, 2022; originally announced January 2022.

Journal ref: Computers in Biology and Medicine, 106093 (2022)

arXiv:2201.10981 [pdf, other]

Joint Liver and Hepatic Lesion Segmentation in MRI using a Hybrid CNN with Transformer Layers

Authors: Georg Hille, Shubham Agrawal, Pavan Tummala, Christian Wybranski, Maciej Pech, Alexey Surov, Sylvia Saalfeld

Abstract: Deep learning-based segmentation of the liver and hepatic lesions therein steadily gains relevance in clinical practice due to the increasing incidence of liver cancer each year. Whereas various network variants with overall promising results in the field of medical image segmentation have been successfully developed over the last years, almost all of them struggle with the challenge of accurately… ▽ More Deep learning-based segmentation of the liver and hepatic lesions therein steadily gains relevance in clinical practice due to the increasing incidence of liver cancer each year. Whereas various network variants with overall promising results in the field of medical image segmentation have been successfully developed over the last years, almost all of them struggle with the challenge of accurately segmenting hepatic lesions in magnetic resonance imaging (MRI). This led to the idea of combining elements of convolutional and transformer-based architectures to overcome the existing limitations. This work presents a hybrid network called SWTR-Unet, consisting of a pretrained ResNet, transformer blocks as well as a common Unet-style decoder path. This network was primarily applied to single-modality non-contrast-enhanced liver MRI and additionally to the publicly available computed tomography (CT) data of the liver tumor segmentation (LiTS) challenge to verify the applicability on other modalities. For a broader evaluation, multiple state-of-the-art networks were implemented and applied, ensuring a direct comparability. Furthermore, correlation analysis and an ablation study were carried out, to investigate various influencing factors on the segmentation accuracy of the presented method. With Dice scores of averaged 98+-2% for liver and 81+-28% lesion segmentation on the MRI dataset and 97+-2% and 79+-25%, respectively on the CT dataset, the proposed SWTR-Unet proved to be a precise approach for liver and hepatic lesion segmentation with state-of-the-art results for MRI and competing accuracy in CT imaging. The achieved segmentation accuracy was found to be on par with manually performed expert segmentations as indicated by inter-observer variabilities for liver lesion segmentation. In conclusion, the presented method could save valuable time and resources in clinical practice. △ Less

Submitted 22 March, 2023; v1 submitted 26 January, 2022; originally announced January 2022.

arXiv:2104.14713 [pdf, other]

Simultaneous Denoising and Localization Network for Photoacoustic Target Localization

Authors: Amirsaeed Yazdani, Sumit Agrawal, Kerrick Johnstonbaugh, Sri-Rajasekhar Kothapalli, Vishal Monga

Abstract: A significant research problem of recent interest is the localization of targets like vessels, surgical needles, and tumors in photoacoustic (PA) images. To achieve accurate localization, a high photoacoustic signal-to-noise ratio (SNR) is required. However, this is not guaranteed for deep targets, as optical scattering causes an exponential decay in optical fluence with respect to tissue depth. T… ▽ More A significant research problem of recent interest is the localization of targets like vessels, surgical needles, and tumors in photoacoustic (PA) images. To achieve accurate localization, a high photoacoustic signal-to-noise ratio (SNR) is required. However, this is not guaranteed for deep targets, as optical scattering causes an exponential decay in optical fluence with respect to tissue depth. To address this, we develop a novel deep learning method designed to explicitly exhibit robustness to noise present in photoacoustic radio-frequency (RF) data. More precisely, we describe and evaluate a deep neural network architecture consisting of a shared encoder and two parallel decoders. One decoder extracts the target coordinates from the input RF data while the other boosts the SNR and estimates clean RF data. The joint optimization of the shared encoder and dual decoders lends significant noise robustness to the features extracted by the encoder, which in turn enables the network to contain detailed information about deep targets that may be obscured by noise. Additional custom layers and newly proposed regularizers in the training loss function (designed based on observed RF data signal and noise behavior) serve to increase the SNR in the cleaned RF output and improve model performance. To account for depth-dependent strong optical scattering, our network was trained with simulated photoacoustic datasets of targets embedded at different depths inside tissue media of different scattering levels. The network trained on this novel dataset accurately locates targets in experimental PA data that is clinically relevant with respect to the localization of vessels, needles, or brachytherapy seeds. We verify the merits of the proposed architecture by outperforming the state of the art on both simulated and experimental datasets. △ Less

Submitted 29 April, 2021; originally announced April 2021.

Comments: Accepted by IEEE Transactions on Medical Imaging

arXiv:2012.02978 [pdf, other]

Design and Implementation of Path Trackers for Ackermann Drive based Vehicles

Authors: Adarsh Patnaik, Manthan Patel, Vibhakar Mohta, Het Shah, Shubh Agrawal, Aditya Rathore, Ritwik Malik, Debashish Chakravarty, Ranjan Bhattacharya

Abstract: This article is an overview of the various literature on path tracking methods and their implementation in simulation and realistic operating environments.The scope of this study includes analysis, implementation,tuning, and comparison of some selected path tracking methods commonly used in practice for trajectory tracking in autonomous vehicles. Many of these methods are applicable at low speed d… ▽ More This article is an overview of the various literature on path tracking methods and their implementation in simulation and realistic operating environments.The scope of this study includes analysis, implementation,tuning, and comparison of some selected path tracking methods commonly used in practice for trajectory tracking in autonomous vehicles. Many of these methods are applicable at low speed due to the linear assumption for the system model, and hence, some methods are also included that consider nonlinearities present in lateral vehicle dynamics during high-speed navigation. The performance evaluation and comparison of tracking methods are carried out on realistic simulations and a dedicated instrumented passenger car, Mahindra e2o, to get a performance idea of all the methods in realistic operating conditions and develop tuning methodologies for each of the methods. It has been observed that our model predictive control-based approach is able to perform better compared to the others in medium velocity ranges. △ Less

Submitted 5 December, 2020; originally announced December 2020.

Comments: 24 pages, 24 figures

arXiv:2001.08539 [pdf, other]

Automatic Differentiation and Continuous Sensitivity Analysis of Rigid Body Dynamics

Authors: David Millard, Eric Heiden, Shubham Agrawal, Gaurav S. Sukhatme

Abstract: A key ingredient to achieving intelligent behavior is physical understanding that equips robots with the ability to reason about the effects of their actions in a dynamic environment. Several methods have been proposed to learn dynamics models from data that inform model-based control algorithms. While such learning-based approaches can model locally observed behaviors, they fail to generalize to… ▽ More A key ingredient to achieving intelligent behavior is physical understanding that equips robots with the ability to reason about the effects of their actions in a dynamic environment. Several methods have been proposed to learn dynamics models from data that inform model-based control algorithms. While such learning-based approaches can model locally observed behaviors, they fail to generalize to more complex dynamics and under long time horizons. In this work, we introduce a differentiable physics simulator for rigid body dynamics. Leveraging various techniques for differential equation integration and gradient calculation, we compare different methods for parameter estimation that allow us to infer the simulation parameters that are relevant to estimation and control of physical systems. In the context of trajectory optimization, we introduce a closed-loop model-predictive control algorithm that infers the simulation parameters through experience while achieving cost-minimizing performance. △ Less

Submitted 21 January, 2020; originally announced January 2020.

Comments: arXiv admin note: substantial text overlap with arXiv:1905.10706

arXiv:1906.01299 [pdf, other]

Grid-based Localization Stack for Inspection Drones towards Automation of Large Scale Warehouse Systems

Authors: Ashwary Anand, Shubh Agrawal, Shivang Agrawal, Aman Chandra, Krishnakant Deshmukh

Abstract: SLAM based techniques are often adopted for solving the navigation problem for the drones in GPS denied environment. Despite the widespread success of these approaches, they have not yet been fully exploited for automation in a warehouse system due to expensive sensors and setup requirements. This paper focuses on the use of low-cost monocular camera-equipped drones for performing warehouse manage… ▽ More SLAM based techniques are often adopted for solving the navigation problem for the drones in GPS denied environment. Despite the widespread success of these approaches, they have not yet been fully exploited for automation in a warehouse system due to expensive sensors and setup requirements. This paper focuses on the use of low-cost monocular camera-equipped drones for performing warehouse management tasks like inventory scanning and position update. The methods introduced are at par with the existing state of warehouse environment present today, that is, the existence of a grid network for the ground vehicles, hence eliminating any additional infrastructure requirement for drone deployment. As we lack scale information, that in itself forbids us to use any 3D techniques, we focus more towards optimizing standard image processing algorithms like the thick line detection and further developing it into a fast and robust grid localization framework. In this paper, we show different line detection algorithms, their significance in grid localization and their limitations. We further extend our proposed implementation towards a real-time navigation stack for an actual warehouse inspection case scenario. Our line detection method using skeletonization and centroid strategy works considerably even with varying light conditions, line thicknesses, colors, orientations, and partial occlusions. A simple yet effective Kalman Filter has been used for smoothening the ρろー and θしーた outputs of the two different line detection methods for better drone control while grid following. A generic strategy that handles the navigation of the drone on a grid for completion of the allotted task is also developed. Based on the simulation and real-life experiments, the final developments on the drone localization and navigation in a structured environment are discussed. △ Less

Submitted 4 June, 2019; originally announced June 2019.

Comments: 8 pages

Showing 1–9 of 9 results for author: Agrawal, S