Audio and Speech Processing

See recent articles

Total of 21 entries

Showing up to 2000 entries per page: fewer | more | all

[1] arXiv:2408.05554 [pdf, html, other]: Title: Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text

Jinpeng Li, Yu Pu, Qi Sun, Wei-Qiang Zhang

Comments: Accepted by INTERSPEECH 2024;Minor typo correction

Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)

Whisper and other large-scale automatic speech recognition models have made significant progress in performance. However, their performance on many low-resource languages, such as Kazakh, is not satisfactory. It is worth researching how to utilize low-cost data to improve the performance of Whisper on under-represented languages. In this study, we utilized easily accessible unpaired speech and text data and combined the language model GPT with Whisper on Kazakh. We implemented end of transcript (EOT) judgment modification and hallucination penalty to improve the performance of speech recognition. Further, we employed the decoding average token log probability as a criterion to select samples from unlabeled speech data and used pseudo-labeled data to fine-tune the model to further improve its performance. Ultimately, we achieved more than 10\% absolute WER reduction in multiple experiments, and the whole process has the potential to be generalized to other under-represented languages.
[2] arXiv:2408.05641 [pdf, html, other]: Title: Towards a Quantitative Analysis of Coarticulation with a Phoneme-to-Articulatory Model

Chaofei Fan, Jaimie M. Henderson, Chris Manning, Francis R. Willett

Comments: To be published in Interspeech 2024

Subjects: Audio and Speech Processing (eess.AS)

Prior coarticulation studies focus mainly on limited phonemic sequences and specific articulators, providing only approximate descriptions of the temporal extent and magnitude of coarticulation. This paper is an initial attempt to comprehensively investigate coarticulation. We leverage existing Electromagnetic Articulography (EMA) datasets to develop and train a phoneme-to-articulatory (P2A) model that can generate realistic EMA for novel phoneme sequences and replicate known coarticulation patterns. We use model-generated EMA on 9K minimal word pairs to analyze coarticulation magnitude and extent up to eight phonemes from the coarticulation trigger, and compare coarticulation resistance across different consonants. Our findings align with earlier studies and suggest a longer-range coarticulation effect than previously found. This model-based approach can potentially compare coarticulation between adults and children and across languages, offering new insights into speech production.
[3] arXiv:2408.05721 [pdf, other]: Title: Extracting Urban Sound Information for Residential Areas in Smart Cities Using an End-to-End IoT System

Ee-Leng Tan, Furi Andi Karnapi, Linus Junjia Ng, Kenneth Ooi, Woon-Seng Gan

Comments: 13 pages, 15 figures, journal

Journal-ref: IEEE IoT Journal, 2021

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)

With rapid urbanization comes the increase of community, construction, and transportation noise in residential areas. The conventional approach of solely relying on sound pressure level (SPL) information to decide on the noise environment and to plan out noise control and mitigation strategies is inadequate. This paper presents an end-to-end IoT system that extracts real-time urban sound metadata using edge devices, providing information on the sound type, location and duration, rate of occurrence, loudness, and azimuth of a dominant noise in nine residential areas. The collected metadata on environmental sound is transmitted to and aggregated in a cloud-based platform to produce detailed descriptive analytics and visualization. Our approach to integrating different building blocks, namely, hardware, software, cloud technologies, and signal processing algorithms to form our real-time IoT system is outlined. We demonstrate how some of the sound metadata extracted by our system are used to provide insights into the noise in residential areas. A scalable workflow to collect and prepare audio recordings from nine residential areas to construct our urban sound dataset for training and evaluating a location-agnostic model is discussed. Some practical challenges of managing and maintaining a sensor network deployed at numerous locations are also addressed.
[4] arXiv:2408.05758 [pdf, html, other]: Title: VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing

Chunyu Qiang, Wang Geng, Yi Zhao, Ruibo Fu, Tao Wang, Cheng Gong, Tianrui Wang, Qiuyu Liu, Jiangyan Yi, Zhengqi Wen, Chen Zhang, Hao Che, Longbiao Wang, Jianwu Dang, Jianhua Tao

Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)

Deep learning has brought significant improvements to the field of cross-modal representation learning. For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired, emphasizing the semantic content of the text modality while de-emphasizing the paralinguistic information of the speech modality. We propose a method called "Vector Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP)", which uses the cross-modal aligned sequence transcoder to bring text and speech into a joint multimodal space, learning how to connect text and speech at the frame level. The proposed VQ-CTAP is a paradigm for cross-modal sequence representation learning, offering a promising solution for fine-grained generation and recognition tasks in speech processing. The VQ-CTAP can be directly applied to VC and ASR tasks without fine-tuning or additional structures. We propose a sequence-aware semantic connector, which connects multiple frozen pre-trained modules for the TTS task, exhibiting a plug-and-play capability. We design a stepping optimization strategy to ensure effective model convergence by gradually injecting and adjusting the influence of various loss components. Furthermore, we propose a semantic-transfer-wise paralinguistic consistency loss to enhance representational capabilities, allowing the model to better generalize to unseen data and capture the nuances of paralinguistic information. In addition, VQ-CTAP achieves high-compression speech coding at a rate of 25Hzへるつ from 24kHzきろへるつ input waveforms, which is a 960-fold reduction in the sampling rate. The audio demo is available at this https URL

[5] arXiv:2306.17670 (cross-list from cs.NE) [pdf, html, other]: Title: Learning Delays in Spiking Neural Networks using Dilated Convolutions with Learnable Spacings

Ilyass Hammouamri, Ismail Khalfaoui-Hassani, Timothée Masquelier

Journal-ref: ICLR 2024

Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)

Spiking Neural Networks (SNNs) are a promising research direction for building power-efficient information processing systems, especially for temporal tasks such as speech recognition. In SNNs, delays refer to the time needed for one spike to travel from one neuron to another. These delays matter because they influence the spike arrival times, and it is well-known that spiking neurons respond more strongly to coincident input spikes. More formally, it has been shown theoretically that plastic delays greatly increase the expressivity in SNNs. Yet, efficient algorithms to learn these delays have been lacking. Here, we propose a new discrete-time algorithm that addresses this issue in deep feedforward SNNs using backpropagation, in an offline manner. To simulate delays between consecutive layers, we use 1D convolutions across time. The kernels contain only a few non-zero weights - one per synapse - whose positions correspond to the delays. These positions are learned together with the weights using the recently proposed Dilated Convolution with Learnable Spacings (DCLS). We evaluated our method on three datasets: the Spiking Heidelberg Dataset (SHD), the Spiking Speech Commands (SSC) and its non-spiking version Google Speech Commands v0.02 (GSC) benchmarks, which require detecting temporal patterns. We used feedforward SNNs with two or three hidden fully connected layers, and vanilla leaky integrate-and-fire neurons. We showed that fixed random delays help and that learning them helps even more. Furthermore, our method outperformed the state-of-the-art in the three datasets without using recurrent connections and with substantially fewer parameters. Our work demonstrates the potential of delay learning in developing accurate and precise models for temporal data processing. Our code is based on PyTorch / SpikingJelly and available at: this https URL
[6] arXiv:2408.05493 (cross-list from cs.SD) [pdf, html, other]: Title: Stream-based Active Learning for Anomalous Sound Detection in Machine Condition Monitoring

Tuan Vu Ho, Kota Dohi, Yohei Kawaguchi

Comments: Accepted as a conference paper in INTERSPEECH 2024

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)

This paper introduces an active learning (AL) framework for anomalous sound detection (ASD) in machine condition monitoring system. Typically, ASD models are trained solely on normal samples due to the scarcity of anomalous data, leading to decreased accuracy for unseen samples during inference. AL is a promising solution to solve this problem by enabling the model to learn new concepts more effectively with fewer labeled examples, thus reducing manual annotation efforts. However, its effectiveness in ASD remains unexplored. To minimize update costs and time, our proposed method focuses on updating the scoring backend of ASD system without retraining the neural network model. Experimental results on the DCASE 2023 Challenge Task 2 dataset confirm that our AL framework significantly improves ASD performance even with low labeling budgets. Moreover, our proposed sampling strategy outperforms other baselines in terms of the partial area under the receiver operating characteristic score.
[7] arXiv:2408.05769 (cross-list from cs.CL) [pdf, html, other]: Title: LI-TTA: Language Informed Test-Time Adaptation for Automatic Speech Recognition

Eunseop Yoon, Hee Suk Yoon, John Harvill, Mark Hasegawa-Johnson, Chang D. Yoo

Comments: INTERSPEECH 2024

Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

Test-Time Adaptation (TTA) has emerged as a crucial solution to the domain shift challenge, wherein the target environment diverges from the original training environment. A prime exemplification is TTA for Automatic Speech Recognition (ASR), which enhances model performance by leveraging output prediction entropy minimization as a self-supervision signal. However, a key limitation of this self-supervision lies in its primary focus on acoustic features, with minimal attention to the linguistic properties of the input. To address this gap, we propose Language Informed Test-Time Adaptation (LI-TTA), which incorporates linguistic insights during TTA for ASR. LI-TTA integrates corrections from an external language model to merge linguistic with acoustic information by minimizing the CTC loss from the correction alongside the standard TTA loss. With extensive experiments, we show that LI-TTA effectively improves the performance of TTA for ASR in various distribution shift situations.
[8] arXiv:2408.05928 (cross-list from cs.SD) [pdf, html, other]: Title: Adapting General Disentanglement-Based Speaker Anonymization for Enhanced Emotion Preservation

Xiaoxiao Miao, Yuxiang Zhang, Xin Wang, Natalia Tomashenko, Donny Cheng Lock Soh, Ian Mcloughlin

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)

A general disentanglement-based speaker anonymization system typically separates speech into content, speaker, and prosody features using individual encoders. This paper explores how to adapt such a system when a new speech attribute, for example, emotion, needs to be preserved to a greater extent. While existing systems are good at anonymizing speaker embeddings, they are not designed to preserve emotion. Two strategies for this are examined. First, we show that integrating emotion embeddings from a pre-trained emotion encoder can help preserve emotional cues, even though this approach slightly compromises privacy protection. Alternatively, we propose an emotion compensation strategy as a post-processing step applied to anonymized speaker embeddings. This conceals the original speaker's identity and reintroduces the emotional traits lost during speaker embedding anonymization. Specifically, we model the emotion attribute using support vector machines to learn separate boundaries for each emotion. During inference, the original speaker embedding is processed in two ways: one, by an emotion indicator to predict emotion and select the emotion-matched SVM accurately; and two, by a speaker anonymizer to conceal speaker characteristics. The anonymized speaker embedding is then modified along the corresponding SVM boundary towards an enhanced emotional direction to save the emotional cues. The proposed strategies are also expected to be useful for adapting a general disentanglement-based speaker anonymization system to preserve other target paralinguistic attributes, with potential for a range of downstream tasks.
[9] arXiv:2408.05950 (cross-list from cs.NE) [pdf, html, other]: Title: Robust online reconstruction of continuous-time signals from a lean spike train ensemble code

Anik Chattopadhyay, Arunava Banerjee

Comments: 22 pages, including a 9-page appendix, 8 figures. A GitHub link to the project implementation is embedded in the paper

Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)

Sensory stimuli in animals are encoded into spike trains by neurons, offering advantages such as sparsity, energy efficiency, and high temporal resolution. This paper presents a signal processing framework that deterministically encodes continuous-time signals into biologically feasible spike trains, and addresses the questions about representable signal classes and reconstruction bounds. The framework considers encoding of a signal through spike trains generated by an ensemble of neurons using a convolve-then-threshold mechanism with various convolution kernels. A closed-form solution to the inverse problem, from spike trains to signal reconstruction, is derived in the Hilbert space of shifted kernel functions, ensuring sparse representation of a generalized Finite Rate of Innovation (FRI) class of signals. Additionally, inspired by real-time processing in biological systems, an efficient iterative version of the optimal reconstruction is formulated that considers only a finite window of past spikes, ensuring robustness of the technique to ill-conditioned encoding; convergence guarantees of the windowed reconstruction to the optimal solution are then provided. Experiments on a large audio dataset demonstrate excellent reconstruction accuracy at spike rates as low as one-fifth of the Nyquist rate, while showing clear competitive advantage in comparison to state-of-the-art sparse coding techniques in the low spike rate regime.
[10] arXiv:2408.06022 (cross-list from cs.SD) [pdf, html, other]: Title: Controlling Surprisal in Music Generation via Information Content Curve Matching

Mathias Rose Bjare, Stefan Lattner, Gerhard Widmer

Comments: 8 pages, 4 figures, 2 tables, accepted at the 25th Int. Society for Music Information Retrieval Conf., San Francisco, USA, 2024

Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

In recent years, the quality and public interest in music generation systems have grown, encouraging research into various ways to control these systems. We propose a novel method for controlling surprisal in music generation using sequence models. To achieve this goal, we define a metric called Instantaneous Information Content (IIC). The IIC serves as a proxy function for the perceived musical surprisal (as estimated from a probabilistic model) and can be calculated at any point within a music piece. This enables the comparison of surprisal across different musical content even if the musical events occur in irregular time intervals. We use beam search to generate musical material whose IIC curve closely approximates a given target IIC. We experimentally show that the IIC correlates with harmonic and rhythmic complexity and note density. The correlation decreases with the length of the musical context used for estimating the IIC. Finally, we conduct a qualitative user study to test if human listeners can identify the IIC curves that have been used as targets when generating the respective musical material. We provide code for creating IIC interpolations and IIC visualizations on this https URL.
[11] arXiv:2408.06043 (cross-list from cs.CL) [pdf, html, other]: Title: Enhancing Dialogue Speech Recognition with Robust Contextual Awareness via Noise Representation Learning

Wonjun Lee, San Kim, Gary Geunbae Lee

Comments: 11 pages, 2 figures, Accepted to SIGDIAL2024

Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

Recent dialogue systems rely on turn-based spoken interactions, requiring accurate Automatic Speech Recognition (ASR). Errors in ASR can significantly impact downstream dialogue tasks. To address this, using dialogue context from user and agent interactions for transcribing subsequent utterances has been proposed. This method incorporates the transcription of the user's speech and the agent's response as model input, using the accumulated context generated by each turn. However, this context is susceptible to ASR errors because it is generated by the ASR model in an auto-regressive fashion. Such noisy context can further degrade the benefits of context input, resulting in suboptimal ASR performance. In this paper, we introduce Context Noise Representation Learning (CNRL) to enhance robustness against noisy context, ultimately improving dialogue speech recognition accuracy. To maximize the advantage of context awareness, our approach includes decoder pre-training using text-based dialogue data and noise representation learning for a context encoder. Based on the evaluation of speech dialogues, our method shows superior results compared to baselines. Furthermore, the strength of our approach is highlighted in noisy environments where user speech is barely audible due to real-world noise, relying on contextual information to transcribe the input accurately.
[12] arXiv:2408.06053 (cross-list from cs.SD) [pdf, html, other]: Title: PyNeuralFx: A Python Package for Neural Audio Effect Modeling

Yen-Tung Yeh, Wen-Yi Hsiao, Yi-Hsuan Yang

Comments: toolkit paper

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)

We present PyNeuralFx, an open-source Python toolkit designed for research on neural audio effect modeling. The toolkit provides an intuitive framework and offers a comprehensive suite of features, including standardized implementation of well-established model architectures, loss functions, and easy-to-use visualization tools. As such, it helps promote reproducibility for research on neural audio effect modeling, and enable in-depth performance comparison of different models, offering insight into the behavior and operational characteristics of models through DSP methodology. The toolkit can be found at this https URL.
[13] arXiv:2408.06065 (cross-list from cs.CL) [pdf, html, other]: Title: An Investigation Into Explainable Audio Hate Speech Detection

Jinmyeong An, Wonjun Lee, Yejin Jeon, Jungseul Ok, Yunsu Kim, Gary Geunbae Lee

Comments: Accepted to SIGDIAL 2024

Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)

Research on hate speech has predominantly revolved around detection and interpretation from textual inputs, leaving verbal content largely unexplored. While there has been limited exploration into hate speech detection within verbal acoustic speech inputs, the aspect of interpretability has been overlooked. Therefore, we introduce a new task of explainable audio hate speech detection. Specifically, we aim to identify the precise time intervals, referred to as audio frame-level rationales, which serve as evidence for hate speech classification. Towards this end, we propose two different approaches: cascading and End-to-End (E2E). The cascading approach initially converts audio to transcripts, identifies hate speech within these transcripts, and subsequently locates the corresponding audio time frames. Conversely, the E2E approach processes audio utterances directly, which allows it to pinpoint hate speech within specific time frames. Additionally, due to the lack of explainable audio hate speech datasets that include audio frame-level rationales, we curated a synthetic audio dataset to train our models. We further validated these models on actual human speech utterances and found that the E2E approach outperforms the cascading method in terms of the audio frame Intersection over Union (IoU) metric. Furthermore, we observed that including frame-level rationales significantly enhances hate speech detection accuracy for the E2E approach.
\textbf{Disclaimer} The reader may encounter content of an offensive or hateful nature. However, given the nature of the work, this cannot be avoided.
[14] arXiv:2408.06227 (cross-list from cs.CL) [pdf, other]: Title: FLEURS-R: A Restored Multilingual Speech Corpus for Generation Tasks

Min Ma, Yuma Koizumi, Shigeki Karita, Heiga Zen, Jason Riesa, Haruko Ishikawa, Michiel Bacchiani

Journal-ref: INTERSPEECH 2024

Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)

This paper introduces FLEURS-R, a speech restoration applied version of the Few-shot Learning Evaluation of Universal Representations of Speech (FLEURS) corpus. FLEURS-R maintains an N-way parallel speech corpus in 102 languages as FLEURS, with improved audio quality and fidelity by applying the speech restoration model Miipher. The aim of FLEURS-R is to advance speech technology in more languages and catalyze research including text-to-speech (TTS) and other speech generation tasks in low-resource languages. Comprehensive evaluations with the restored speech and TTS baseline models trained from the new corpus show that the new corpus obtained significantly improved speech quality while maintaining the semantic contents of the speech. The corpus is publicly released via Hugging Face.
[15] arXiv:2408.06264 (cross-list from cs.SD) [pdf, html, other]: Title: Audio Enhancement for Computer Audition -- An Iterative Training Paradigm Using Sample Importance

Manuel Milling, Shuo Liu, Andreas Triantafyllopoulos, Ilhan Aslan, Björn W. Schuller

Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)

Neural network models for audio tasks, such as automatic speech recognition (ASR) and acoustic scene classification (ASC), are susceptible to noise contamination for real-life applications. To improve audio quality, an enhancement module, which can be developed independently, is explicitly used at the front-end of the target audio applications. In this paper, we present an end-to-end learning solution to jointly optimise the models for audio enhancement (AE) and the subsequent applications. To guide the optimisation of the AE module towards a target application, and especially to overcome difficult samples, we make use of the sample-wise performance measure as an indication of sample importance. In experiments, we consider four representative applications to evaluate our training paradigm, i.e., ASR, speech command recognition (SCR), speech emotion recognition (SER), and ASC. These applications are associated with speech and non-speech tasks concerning semantic and non-semantic features, transient and global information, and the experimental results indicate that our proposed approach can considerably boost the noise robustness of the models, especially at low signal-to-noise ratios (SNRs), for a wide range of computer audition tasks in everyday-life noisy environments.

[16] arXiv:2301.12258 (replaced) [pdf, html, other]: Title: Cross-domain Neural Pitch and Periodicity Estimation

Max Morrison, Caedon Hsieh, Nathan Pruyne, Bryan Pardo

Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)

Pitch is a foundational aspect of our perception of audio signals. Pitch contours are commonly used to analyze speech and music signals and as input features for many audio tasks, including music transcription, singing voice synthesis, and prosody editing. In this paper, we describe a set of techniques for improving the accuracy of widely-used neural pitch and periodicity estimators to achieve state-of-the-art performance on both speech and music. We also introduce a novel entropy-based method for extracting periodicity and per-frame voiced-unvoiced classifications from statistical inference-based pitch estimators (e.g., neural networks), and show how to train a neural pitch estimator to simultaneously handle both speech and music data (i.e., cross-domain estimation) without performance degradation. Our estimator implementations run 11.2x faster than real-time on a Intel i9-9820X 10-core 3.30 GHz CPU$\unicode{x2014}$approaching the speed of state-of-the-art DSP-based pitch estimators$\unicode{x2014}$or 408x faster than real-time on a NVIDIA GeForce RTX 3090 GPU. We release all of our code and models as Pitch-Estimating Neural Networks (penn), an open-source, pip-installable Python module for training, evaluating, and performing inference with pitch- and periodicity-estimating neural networks. The code for penn is available at this https URL.
[17] arXiv:2309.08166 (replaced) [pdf, html, other]: Title: Residual Speaker Representation for One-Shot Voice Conversion

Le Xu, Jiangyan Yi, Tao Wang, Yong Ren, Rongxiu Zhong, Zhengqi Wen, Jianhua Tao

Comments: Accepted by INTERSPEECH2024

Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)

Recently, there have been significant advancements in voice conversion, resulting in high-quality performance. However, there are still two critical challenges in this field. Firstly, current voice conversion methods have limited robustness when encountering unseen speakers. Secondly, they also have limited ability to control timbre representation. To address these challenges, this paper presents a novel approach that leverages tokens of multi-layer residual approximations to enhance robustness when dealing with unseen speakers, called the residual speaker module. Introducing multi-layer approximations facilitates the separation of information from the timbre, enabling effective control over timbre in voice conversion. The proposed method outperforms baselines in subjective and objective evaluations, demonstrating superior performance and increased robustness. Our demo page is publicly available.
[18] arXiv:2403.14402 (replaced) [pdf, html, other]: Title: XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception

HyoJung Han, Mohamed Anwar, Juan Pino, Wei-Ning Hsu, Marine Carpuat, Bowen Shi, Changhan Wang

Comments: ACL2024

Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

Speech recognition and translation systems perform poorly on noisy inputs, which are frequent in realistic environments. Augmenting these systems with visual signals has the potential to improve robustness to noise. However, audio-visual (AV) data is only available in limited amounts and for fewer languages than audio-only resources. To address this gap, we present XLAVS-R, a cross-lingual audio-visual speech representation model for noise-robust speech recognition and translation in over 100 languages. It is designed to maximize the benefits of limited multilingual AV pre-training data, by building on top of audio-only multilingual pre-training and simplifying existing pre-training schemes. Extensive evaluation on the MuAViC benchmark shows the strength of XLAVS-R on downstream audio-visual speech recognition and translation tasks, where it outperforms the previous state of the art by up to 18.5% WER and 4.7 BLEU given noisy AV inputs, and enables strong zero-shot audio-visual ability with audio-only fine-tuning.
[19] arXiv:2404.03179 (replaced) [pdf, html, other]: Title: UniAV: Unified Audio-Visual Perception for Multi-Task Video Event Localization

Tiantian Geng, Teng Wang, Yanfu Zhang, Jinming Duan, Weili Guan, Feng Zheng, Ling shao

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)

Video localization tasks aim to temporally locate specific instances in videos, including temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL). Existing methods over-specialize on each task, overlooking the fact that these instances often occur in the same video to form the complete video content. In this work, we present UniAV, a Unified Audio-Visual perception network, to achieve joint learning of TAL, SED and AVEL tasks for the first time. UniAV can leverage diverse data available in task-specific datasets, allowing the model to learn and share mutually beneficial knowledge across tasks and modalities. To tackle the challenges posed by substantial variations in datasets (size/domain/duration) and distinct task characteristics, we propose to uniformly encode visual and audio modalities of all videos to derive generic representations, while also designing task-specific experts to capture unique knowledge for each task. Besides, we develop a unified language-aware classifier by utilizing a pre-trained text encoder, enabling the model to flexibly detect various types of instances and previously unseen ones by simply changing prompts during inference. UniAV outperforms its single-task counterparts by a large margin with fewer parameters, achieving on-par or superior performances compared to state-of-the-art task-specific methods across ActivityNet 1.3, DESED and UnAV-100 benchmarks.
[20] arXiv:2404.09466 (replaced) [pdf, html, other]: Title: Scoring Time Intervals using Non-Hierarchical Transformer For Automatic Piano Transcription

Yujia Yan, Zhiyao Duan

Comments: Camera-ready version

Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)

The neural semi-Markov Conditional Random Field (semi-CRF) framework has demonstrated promise for event-based piano transcription. In this framework, all events (notes or pedals) are represented as closed time intervals tied to specific event types. The neural semi-CRF approach requires an interval scoring matrix that assigns a score for every candidate interval. However, designing an efficient and expressive architecture for scoring intervals is not trivial. This paper introduces a simple method for scoring intervals using scaled inner product operations that resemble how attention scoring is done in transformers. We show theoretically that, due to the special structure from encoding the non-overlapping intervals, under a mild condition, the inner product operations are expressive enough to represent an ideal scoring matrix that can yield the correct transcription result. We then demonstrate that an encoder-only structured non-hierarchical transformer backbone, operating only on a low-time-resolution feature map, is capable of transcribing piano notes and pedals with high accuracy and time precision. The experiment shows that our approach achieves the new state-of-the-art performance across all subtasks in terms of the F1 measure on the Maestro dataset.
[21] arXiv:2408.02113 (replaced) [pdf, other]: Title: Dise\~no de sonido para producciones audiovisuales e historias sonoras en el aula. Hacia una docencia creativa mediante el uso de herramientas inteligentes

Miguel Civit, Francisco Cuadrado

Comments: 11 pages, in Spanish language. 1 figure. Preprint from La nueva era del podcast (2023)

Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)

This study aims to share a teaching experience teaching sound design for audiovisual productions and compares different projects tackled by students. It is not intended to be a comparative analysis of different types of teaching but rather an analysis of different problems observed in different profiles of students of the subject who study it in different grades. The world of audio can be very interesting for a large part of the students, both those with creative and technical inclinations. Musical creation and production, synchronization with images, dubbing, etc. They are disciplines that are generally interesting but can have a very high barrier to entry due to their great technical complexity. Sometimes it can take weeks or even months for the uninitiated to begin to use audio editing programs with the necessary ease, which are not always particularly intuitive for students. Learning through the use of PBL methodologies generates, in our experience, results much superior to those that can be observed through the use of other teaching methods such as master classes. Students acquire technical skills while developing creative projects in which they get personally involved. Despite everything mentioned above, most interactions between teachers and students focus on aspects of technical correction. From different parameters in reverbs (such as pre-delay, decay, modulation...) to how to correctly adjust compressors, noise gates, etc.; The number of tools with which to work with audio is incredibly extensive, as well as many of its features that can present serious differences depending on their manufacturers.

Total of 21 entries

Showing up to 2000 entries per page: fewer | more | all

Audio and Speech Processing

New submissions for Tuesday, 13 August 2024 (showing 4 of 4 entries )

Cross submissions for Tuesday, 13 August 2024 (showing 11 of 11 entries )

Replacement submissions for Tuesday, 13 August 2024 (showing 6 of 6 entries )