Search | arXiv e-print repository

Guided Masked Self-Distillation Modeling for Distributed Multimedia Sensor Event Analysis

Authors: Masahiro Yasuda, Noboru Harada, Yasunori Ohishi, Shoichiro Saito, Akira Nakayama, Nobutaka Ono

Abstract: Observations with distributed sensors are essential in analyzing a series of human and machine activities (referred to as 'events' in this paper) in complex and extensive real-world environments. This is because the information obtained from a single sensor is often missing or fragmented in such an environment; observations from multiple locations and modalities should be integrated to analyze eve… ▽ More Observations with distributed sensors are essential in analyzing a series of human and machine activities (referred to as 'events' in this paper) in complex and extensive real-world environments. This is because the information obtained from a single sensor is often missing or fragmented in such an environment; observations from multiple locations and modalities should be integrated to analyze events comprehensively. However, a learning method has yet to be established to extract joint representations that effectively combine such distributed observations. Therefore, we propose Guided Masked sELf-Distillation modeling (Guided-MELD) for inter-sensor relationship modeling. The basic idea of Guided-MELD is to learn to supplement the information from the masked sensor with information from other sensors needed to detect the event. Guided-MELD is expected to enable the system to effectively distill the fragmented or redundant target event information obtained by the sensors without being overly dependent on any specific sensors. To validate the effectiveness of the proposed method in novel tasks of distributed multimedia sensor event analysis, we recorded two new datasets that fit the problem setting: MM-Store and MM-Office. These datasets consist of human activities in a convenience store and an office, recorded using distributed cameras and microphones. Experimental results on these datasets show that the proposed Guided-MELD improves event tagging and detection performance and outperforms conventional inter-sensor relationship modeling methods. Furthermore, the proposed method performed robustly even when sensors were reduced. △ Less

Submitted 12 April, 2024; originally announced April 2024.

Comments: 13page, 7figure, under review

arXiv:2307.12232 [pdf, other]

Signal Reconstruction from Mel-spectrogram Based on Bi-level Consistency of Full-band Magnitude and Phase

Authors: Yoshiki Masuyama, Natsuki Ueno, Nobutaka Ono

Abstract: We propose an optimization-based method for reconstructing a time-domain signal from a low-dimensional spectral representation such as a mel-spectrogram. Phase reconstruction has been studied to reconstruct a time-domain signal from the full-band short-time Fourier transform (STFT) magnitude. The Griffin-Lim algorithm (GLA) has been widely used because it relies only on the redundancy of STFT and… ▽ More We propose an optimization-based method for reconstructing a time-domain signal from a low-dimensional spectral representation such as a mel-spectrogram. Phase reconstruction has been studied to reconstruct a time-domain signal from the full-band short-time Fourier transform (STFT) magnitude. The Griffin-Lim algorithm (GLA) has been widely used because it relies only on the redundancy of STFT and is applicable to various audio signals. In this paper, we jointly reconstruct the full-band magnitude and phase by considering the bi-level relationships among the time-domain signal, its STFT coefficients, and its mel-spectrogram. The proposed method is formulated as a rigorous optimization problem and estimates the full-band magnitude based on the criterion used in GLA. Our experiments demonstrate the effectiveness of the proposed method on speech, music, and environmental signals. △ Less

Submitted 23 July, 2023; originally announced July 2023.

Comments: Accepted to IEEE WASPAA 2023

arXiv:2307.12231 [pdf, other]

Exploring the Integration of Speech Separation and Recognition with Self-Supervised Learning Representation

Authors: Yoshiki Masuyama, Xuankai Chang, Wangyou Zhang, Samuele Cornell, Zhong-Qiu Wang, Nobutaka Ono, Yanmin Qian, Shinji Watanabe

Abstract: Neural speech separation has made remarkable progress and its integration with automatic speech recognition (ASR) is an important direction towards realizing multi-speaker ASR. This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end. In detail, we explore multi-channel separation methods, mask-based beamforming and comp… ▽ More Neural speech separation has made remarkable progress and its integration with automatic speech recognition (ASR) is an important direction towards realizing multi-speaker ASR. This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end. In detail, we explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model. We employ the recent self-supervised learning representation (SSLR) as a feature and improve the recognition performance from the case with filterbank features. To further improve multi-speaker recognition performance, we present a carefully designed training strategy for integrating speech separation and recognition with SSLR. The proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set, significantly outperforming an existing mask-based MVDR beamforming and filterbank integration (28.9%). △ Less

Submitted 23 July, 2023; originally announced July 2023.

Comments: Accepted to IEEE WASPAA 2023

arXiv:2302.10536 [pdf, other]

Nonparallel Emotional Voice Conversion For Unseen Speaker-Emotion Pairs Using Dual Domain Adversarial Network & Virtual Domain Pairing

Authors: Nirmesh Shah, Mayank Kumar Singh, Naoya Takahashi, Naoyuki Onoe

Abstract: Primary goal of an emotional voice conversion (EVC) system is to convert the emotion of a given speech signal from one style to another style without modifying the linguistic content of the signal. Most of the state-of-the-art approaches convert emotions for seen speaker-emotion combinations only. In this paper, we tackle the problem of converting the emotion of speakers whose only neutral data ar… ▽ More Primary goal of an emotional voice conversion (EVC) system is to convert the emotion of a given speech signal from one style to another style without modifying the linguistic content of the signal. Most of the state-of-the-art approaches convert emotions for seen speaker-emotion combinations only. In this paper, we tackle the problem of converting the emotion of speakers whose only neutral data are present during the time of training and testing (i.e., unseen speaker-emotion combinations). To this end, we extend a recently proposed StartGANv2-VC architecture by utilizing dual encoders for learning the speaker and emotion style embeddings separately along with dual domain source classifiers. For achieving the conversion to unseen speaker-emotion combinations, we propose a Virtual Domain Pairing (VDP) training strategy, which virtually incorporates the speaker-emotion pairs that are not present in the real data without compromising the min-max game of a discriminator and generator in adversarial training. We evaluate the proposed method using a Hindi emotional database. △ Less

Submitted 21 February, 2023; originally announced February 2023.

Comments: Demo Samples at https://demosamplesites.github.io/EVCUP/

arXiv:2302.07928 [pdf, other]

Multi-Channel Target Speaker Extraction with Refinement: The WavLab Submission to the Second Clarity Enhancement Challenge

Authors: Samuele Cornell, Zhong-Qiu Wang, Yoshiki Masuyama, Shinji Watanabe, Manuel Pariente, Nobutaka Ono

Abstract: This paper describes our submission to the Second Clarity Enhancement Challenge (CEC2), which consists of target speech enhancement for hearing-aid (HA) devices in noisy-reverberant environments with multiple interferers such as music and competing speakers. Our approach builds upon the powerful iterative neural/beamforming enhancement (iNeuBe) framework introduced in our recent work, and this p… ▽ More This paper describes our submission to the Second Clarity Enhancement Challenge (CEC2), which consists of target speech enhancement for hearing-aid (HA) devices in noisy-reverberant environments with multiple interferers such as music and competing speakers. Our approach builds upon the powerful iterative neural/beamforming enhancement (iNeuBe) framework introduced in our recent work, and this paper extends it for target speaker extraction. We therefore name the proposed approach as iNeuBe-X, where the X stands for extraction. To address the challenges encountered in the CEC2 setting, we introduce four major novelties: (1) we extend the state-of-the-art TF-GridNet model, originally designed for monaural speaker separation, for multi-channel, causal speech enhancement, and large improvements are observed by replacing the TCNDenseNet used in iNeuBe with this new architecture; (2) we leverage a recent dual window size approach with future-frame prediction to ensure that iNueBe-X satisfies the 5 ms constraint on algorithmic latency required by CEC2; (3) we introduce a novel speaker-conditioning branch for TF-GridNet to achieve target speaker extraction; (4) we propose a fine-tuning step, where we compute an additional loss with respect to the target speaker signal compensated with the listener audiogram. Without using external data, on the official development set our best model reaches a hearing-aid speech perception index (HASPI) score of 0.942 and a scale-invariant signal-to-distortion ratio improvement (SI-SDRi) of 18.8 dBでしべる. These results are promising given the fact that the CEC2 data is extremely challenging (e.g., on the development set the mixture SI-SDR is -12.3 dBでしべる). A demo of our submitted system is available at WAVLab CEC2 demo. △ Less

Submitted 15 February, 2023; originally announced February 2023.

arXiv:2210.10742 [pdf, other]

End-to-End Integration of Speech Recognition, Dereverberation, Beamforming, and Self-Supervised Learning Representation

Authors: Yoshiki Masuyama, Xuankai Chang, Samuele Cornell, Shinji Watanabe, Nobutaka Ono

Abstract: Self-supervised learning representation (SSLR) has demonstrated its significant effectiveness in automatic speech recognition (ASR), mainly with clean speech. Recent work pointed out the strength of integrating SSLR with single-channel speech enhancement for ASR in noisy environments. This paper further advances this integration by dealing with multi-channel input. We propose a novel end-to-end ar… ▽ More Self-supervised learning representation (SSLR) has demonstrated its significant effectiveness in automatic speech recognition (ASR), mainly with clean speech. Recent work pointed out the strength of integrating SSLR with single-channel speech enhancement for ASR in noisy environments. This paper further advances this integration by dealing with multi-channel input. We propose a novel end-to-end architecture by integrating dereverberation, beamforming, SSLR, and ASR within a single neural network. Our system achieves the best performance reported in the literature on the CHiME-4 6-channel track with a word error rate (WER) of 1.77%. While the WavLM-based strong SSLR demonstrates promising results by itself, the end-to-end integration with the weighted power minimization distortionless response beamformer, which simultaneously performs dereverberation and denoising, improves WER significantly. Its effectiveness is also validated on the REVERB dataset. △ Less

Submitted 19 October, 2022; originally announced October 2022.

Comments: Accepted to IEEE SLT 2022

arXiv:2209.00937 [pdf, other]

Inverse-free Online Independent Vector Analysis with Flexible Iterative Source Steering

Authors: Taishi Nakashima, Nobutaka Ono

Abstract: In this paper, we propose a new online independent vector analysis (IVA) algorithm for real-time blind source separation (BSS). In many BSS algorithms, the iterative projection (IP) has been used for updating the demixing matrix, a parameter to be estimated in BSS. However, it requires matrix inversion, which can be costly, particularly in online processing. To improve this situation, we introduce… ▽ More In this paper, we propose a new online independent vector analysis (IVA) algorithm for real-time blind source separation (BSS). In many BSS algorithms, the iterative projection (IP) has been used for updating the demixing matrix, a parameter to be estimated in BSS. However, it requires matrix inversion, which can be costly, particularly in online processing. To improve this situation, we introduce iterative source steering (ISS) to online IVA. ISS does not require any matrix inversions, and thus its computational complexity is less than that of IP. Furthermore, when only part of the sources are moving, ISS enables us to update the demixing matrix flexibly and effectively so that the steering vectors of only the moving sources are updated. Numerical experiments under a dynamic condition confirm the efficacy of the proposed method. △ Less

Submitted 2 September, 2022; originally announced September 2022.

Comments: 5 pages, 2 figures. Submitted to APSIPA 2022

arXiv:2207.04357 [pdf, ps, other]

Joint Analysis of Acoustic Scenes and Sound Events with Weakly labeled Data

Authors: Shunsuke Tsubaki, Keisuke Imoto, Nobutaka Ono

Abstract: Considering that acoustic scenes and sound events are closely related to each other, in some previous papers, a joint analysis of acoustic scenes and sound events utilizing multitask learning (MTL)-based neural networks was proposed. In conventional methods, a strongly supervised scheme is applied to sound event detection in MTL models, which requires strong labels of sound events in model trainin… ▽ More Considering that acoustic scenes and sound events are closely related to each other, in some previous papers, a joint analysis of acoustic scenes and sound events utilizing multitask learning (MTL)-based neural networks was proposed. In conventional methods, a strongly supervised scheme is applied to sound event detection in MTL models, which requires strong labels of sound events in model training; however, annotating strong event labels is quite time-consuming. In this paper, we thus propose a method for the joint analysis of acoustic scenes and sound events based on the MTL framework with weak labels of sound events. In particular, in the proposed method, we introduce the multiple-instance learning scheme for weakly supervised training of sound event detection and evaluate four pooling functions, namely, max pooling, average pooling, exponential softmax pooling, and attention pooling. Experimental results obtained using parts of the TUT Acoustic Scenes 2016/2017 and TUT Sound Events 2016/2017 datasets show that the proposed MTL-based method with weak labels outperforms the conventional single-task-based scene classification and event detection models with weak labels in terms of both the scene classification and event detection performances. △ Less

Submitted 9 July, 2022; originally announced July 2022.

Comments: Accepted to IWAENC2022

arXiv:2206.13014 [pdf, other]

Joint Optimization of Sampling Rate Offsets Based on Entire Signal Relationship Among Distributed Microphones

Authors: Yoshiki Masuyama, Kouei Yamaoka, Nobutaka Ono

Abstract: In this paper, we propose to simultaneously estimate all the sampling rate offsets (SROs) of multiple devices. In a distributed microphone array, the SRO is inevitable, which deteriorates the performance of array signal processing. Most of the existing SRO estimation methods focused on synchronizing two microphones. When synchronizing more than two microphones, we select one reference microphone a… ▽ More In this paper, we propose to simultaneously estimate all the sampling rate offsets (SROs) of multiple devices. In a distributed microphone array, the SRO is inevitable, which deteriorates the performance of array signal processing. Most of the existing SRO estimation methods focused on synchronizing two microphones. When synchronizing more than two microphones, we select one reference microphone and estimate the SRO of each non-reference microphone independently. Hence, the relationship among signals observed by non-reference microphones is not considered. To address this problem, the proposed method jointly optimizes all SROs based on a probabilistic model of a multichannel signal. The SROs and model parameters are alternately updated to increase the log-likelihood based on an auxiliary function. The effectiveness of the proposed method is validated on mixtures of various numbers of speakers. △ Less

Submitted 26 June, 2022; originally announced June 2022.

Comments: 5 pages, 2 figures,accepted by Interspeech2022

arXiv:2206.02187 [pdf, other]

M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation

Authors: Vishal Chudasama, Purbayan Kar, Ashish Gudmalwar, Nirmesh Shah, Pankaj Wasnik, Naoyuki Onoe

Abstract: Emotion Recognition in Conversations (ERC) is crucial in developing sympathetic human-machine interaction. In conversational videos, emotion can be present in multiple modalities, i.e., audio, video, and transcript. However, due to the inherent characteristics of these modalities, multi-modal ERC has always been considered a challenging undertaking. Existing ERC research focuses mainly on using te… ▽ More Emotion Recognition in Conversations (ERC) is crucial in developing sympathetic human-machine interaction. In conversational videos, emotion can be present in multiple modalities, i.e., audio, video, and transcript. However, due to the inherent characteristics of these modalities, multi-modal ERC has always been considered a challenging undertaking. Existing ERC research focuses mainly on using text information in a discussion, ignoring the other two modalities. We anticipate that emotion recognition accuracy can be improved by employing a multi-modal approach. Thus, in this study, we propose a Multi-modal Fusion Network (M2FNet) that extracts emotion-relevant features from visual, audio, and text modality. It employs a multi-head attention-based fusion mechanism to combine emotion-rich latent representations of the input data. We introduce a new feature extractor to extract latent features from the audio and visual modality. The proposed feature extractor is trained with a novel adaptive margin-based triplet loss function to learn emotion-relevant features from the audio and visual data. In the domain of ERC, the existing methods perform well on one benchmark dataset but not on others. Our results show that the proposed M2FNet architecture outperforms all other methods in terms of weighted average F1 score on well-known MELD and IEMOCAP datasets and sets a new state-of-the-art performance in ERC. △ Less

Submitted 5 June, 2022; originally announced June 2022.

Comments: Accepted for publication in the 5th Multimodal Learning and Applications (MULA) Workshop at CVPR 2022

arXiv:2204.03173 [pdf, other]

Automated Sleep Staging via Parallel Frequency-Cut Attention

Authors: Zheng Chen, Ziwei Yang, Lingwei Zhu, Wei Chen, Toshiyo Tamura, Naoaki Ono, MD Altaf-Ul-Amin, Shigehiko Kanaya, Ming Huang

Abstract: This paper proposes a novel framework for automatically capturing the time-frequency nature of electroencephalogram (EEG) signals of human sleep based on the authoritative sleep medicine guidance. The framework consists of two parts: the first part extracts informative features by partitioning the input EEG spectrograms into a sequence of time-frequency patches. The second part is constituted by a… ▽ More This paper proposes a novel framework for automatically capturing the time-frequency nature of electroencephalogram (EEG) signals of human sleep based on the authoritative sleep medicine guidance. The framework consists of two parts: the first part extracts informative features by partitioning the input EEG spectrograms into a sequence of time-frequency patches. The second part is constituted by an attention-based architecture to efficiently search for the correlation between partitioned time-frequency patches and defining factors of sleep stages in parallel. The proposed pipeline is validated on the Sleep Heart Health Study dataset with new state-of-the-art results for the stages wake, N2, and N3, obtaining respective F1 scores of 0.93, 0.88, and 0.87, with only EEG signals used. The proposed method also has a high inter-rater reliability of 0.80 kappa. We also visualize the correspondence between sleep staging decisions and features extracted by the proposed method, providing strong interpretability for our model. △ Less

Submitted 12 January, 2023; v1 submitted 6 April, 2022; originally announced April 2022.

Comments: 10 pages, 9 figures

arXiv:2203.09723 [pdf, other]

Estimation of Consistent Time Delays in Subsample via Auxiliary-Function-Based Iterative Updates

Authors: Kouei Yamaoka, Yukoh Wakabayashi, Nobutaka Ono

Abstract: In this paper, we propose a new algorithm for the estimation of multiple time delays (TDs). Since a TD is a fundamental spatial cue for sensor array signal processing techniques, many methods for estimating it have been studied. Most of them, including generalized cross correlation (CC)-based methods, focus on how to estimate a TD between two sensors. These methods can then be easily adapted for m… ▽ More In this paper, we propose a new algorithm for the estimation of multiple time delays (TDs). Since a TD is a fundamental spatial cue for sensor array signal processing techniques, many methods for estimating it have been studied. Most of them, including generalized cross correlation (CC)-based methods, focus on how to estimate a TD between two sensors. These methods can then be easily adapted for multiple TDs by applying them to every pair of a reference sensor and another one. However, these pairwise methods can use only the partial information obtained by the selected sensors, resulting in inconsistent TD estimates and limited estimation accuracy. In contrast, we propose joint optimization of entire TD parameters, where spatial information obtained from all sensors is taken into account. We also introduce a consistent constraint regarding TD parameters to the observation model. We then consider a multidimensional CC (MCC) as the objective function, which is derived on the basis of maximum likelihood estimation. To maximize the MCC, which is a nonconvex function, we derive the auxiliary function for the MCC and design efficient update rules. We additionally estimate the amplitudes of the transfer functions for supporting the TD estimation, where we maximize the Rayleigh quotient under the non-negative constraint. We experimentally analyze essential features of the proposed method and evaluate its effectiveness in TD estimation. Code will be available at https://github.com/onolab-tmu/AuxTDE. △ Less

Submitted 23 March, 2022; v1 submitted 17 March, 2022; originally announced March 2022.

Comments: 13 pages, 8 figures

arXiv:2102.06322 [pdf, other]

doi 10.1109/ICASSP39728.2021.9413478

Joint Dereverberation and Separation with Iterative Source Steering

Authors: Taishi Nakashima, Robin Scheibler, Masahito Togami, Nobutaka Ono

Abstract: We propose a new algorithm for joint dereverberation and blind source separation (DR-BSS). Our work builds upon the IRLMA-T framework that applies a unified filter combining dereverberation and separation. One drawback of this framework is that it requires several matrix inversions, an operation inherently costly and with potential stability issues. We leverage the recently introduced iterative so… ▽ More We propose a new algorithm for joint dereverberation and blind source separation (DR-BSS). Our work builds upon the IRLMA-T framework that applies a unified filter combining dereverberation and separation. One drawback of this framework is that it requires several matrix inversions, an operation inherently costly and with potential stability issues. We leverage the recently introduced iterative source steering (ISS) updates to propose two algorithms mitigating this issue. Albeit derived from first principles, the first algorithm turns out to be a natural combination of weighted prediction error (WPE) dereverberation and ISS-based BSS, applied alternatingly. In this case, we manage to reduce the number of matrix inversion to only one per iteration and source. The second algorithm updates the ILRMA-T matrix using only sequential ISS updates requiring no matrix inversion at all. Its implementation is straightforward and memory efficient. Numerical experiments demonstrate that both methods achieve the same final performance as ILRMA-T in terms of several relevant objective metrics. In the important case of two sources, the number of iterations required is also similar. △ Less

Submitted 31 May, 2021; v1 submitted 11 February, 2021; originally announced February 2021.

Comments: 5 pages, 2 figures, accepted at ICASSP 2021

arXiv:2004.03926 [pdf, other]

MM Algorithms for Joint Independent Subspace Analysis with Application to Blind Single and Multi-Source Extraction

Authors: Robin Scheibler, Nobutaka Ono

Abstract: In this work, we propose efficient algorithms for joint independent subspace analysis (JISA), an extension of independent component analysis that deals with parallel mixtures, where not all the components are independent. We derive an algorithmic framework for JISA based on the majorization-minimization (MM) optimization technique (JISA-MM). We use a well-known inequality for super-Gaussian source… ▽ More In this work, we propose efficient algorithms for joint independent subspace analysis (JISA), an extension of independent component analysis that deals with parallel mixtures, where not all the components are independent. We derive an algorithmic framework for JISA based on the majorization-minimization (MM) optimization technique (JISA-MM). We use a well-known inequality for super-Gaussian sources to derive a surrogate function of the negative log-likelihood of the observed data. The minimization of this surrogate function leads to a variant of the hybrid exact-approximate diagonalization problem, but where multiple demixing vectors are grouped together. In the spirit of auxiliary function based independent vector analysis (AuxIVA), we propose several updates that can be applied alternately to one, or jointly to two, groups of demixing vectors. Recently, blind extraction of one or more sources has gained interest as a reasonable way of exploiting larger microphone arrays to achieve better separation. In particular, several MM algorithms have been proposed for overdetermined IVA (OverIVA). By applying JISA-MM, we are not only able to rederive these in a general manner, but also find several new algorithms. We run extensive numerical experiments to evaluate their performance, and compare it to that of full separation with AuxIVA. We find that algorithms using pairwise updates of two sources, or of one source and the background have the fastest convergence, and are able to separate target sources quickly and precisely from the background. In addition, we characterize the performance of all algorithms under a large number of noise, reverberation, and background mismatch conditions. △ Less

Submitted 8 April, 2020; originally announced April 2020.

Comments: 15 pages, 4 figures

arXiv:1910.10654 [pdf, other]

Fast Independent Vector Extraction by Iterative SINR Maximization

Authors: Robin Scheibler, Nobutaka Ono

Abstract: We propose fast independent vector extraction (FIVE), a new algorithm that blindly extracts a single non-Gaussian source from a Gaussian background. The algorithm iteratively computes beamforming weights maximizing the signal-to-interference-and-noise ratio for an approximate noise covariance matrix. We demonstrate that this procedure minimizes the negative log-likelihood of the input data accordi… ▽ More We propose fast independent vector extraction (FIVE), a new algorithm that blindly extracts a single non-Gaussian source from a Gaussian background. The algorithm iteratively computes beamforming weights maximizing the signal-to-interference-and-noise ratio for an approximate noise covariance matrix. We demonstrate that this procedure minimizes the negative log-likelihood of the input data according to a well-defined probabilistic model. The minimization is carried out via the auxiliary function technique whereas, unlike related methods, the auxiliary function is globally minimized at every iteration. Numerical experiments are carried out to assess the performance of FIVE. We find that it is vastly superior to competing methods in terms of convergence speed, and has high potential for real-time applications. △ Less

Submitted 23 October, 2019; originally announced October 2019.

Comments: 5 pages, 4 figures, Submitted to ICASSP 2020

arXiv:1905.07880 [pdf, other]

Independent Vector Analysis with more Microphones than Sources

Authors: Robin Scheibler, Nobutaka Ono

Abstract: We extend frequency-domain blind source separation based on independent vector analysis to the case where there are more microphones than sources. The signal is modelled as non-Gaussian sources in a Gaussian background. The proposed algorithm is based on a parametrization of the demixing matrix decreasing the number of parameters to estimate. Furthermore, orthogonal constraints between the signal… ▽ More We extend frequency-domain blind source separation based on independent vector analysis to the case where there are more microphones than sources. The signal is modelled as non-Gaussian sources in a Gaussian background. The proposed algorithm is based on a parametrization of the demixing matrix decreasing the number of parameters to estimate. Furthermore, orthogonal constraints between the signal and background subspaces are imposed to regularize the separation. The problem can then be posed as a constrained likelihood maximization. We propose efficient alternating updates guaranteed to converge to a stationary point of the cost function. The performance of the algorithm is assessed on simulated signals. We find that the separation performance is on par with that of the conventional determined algorithm at a fraction of the computational cost. △ Less

Submitted 7 August, 2019; v1 submitted 20 May, 2019; originally announced May 2019.

Comments: Accepted to WASPAA 2019, 5 pages, 3 figures

arXiv:1904.02334 [pdf, other]

doi 10.1109/ICASSP.2019.8682594

Multi-modal Blind Source Separation with Microphones and Blinkies

Authors: Robin Scheibler, Nobutaka Ono

Abstract: We propose a blind source separation algorithm that jointly exploits measurements by a conventional microphone array and an ad hoc array of low-rate sound power sensors called blinkies. While providing less information than microphones, blinkies circumvent some difficulties of microphone arrays in terms of manufacturing, synchronization, and deployment. The algorithm is derived from a joint probab… ▽ More We propose a blind source separation algorithm that jointly exploits measurements by a conventional microphone array and an ad hoc array of low-rate sound power sensors called blinkies. While providing less information than microphones, blinkies circumvent some difficulties of microphone arrays in terms of manufacturing, synchronization, and deployment. The algorithm is derived from a joint probabilistic model of the microphone and sound power measurements. We assume the separated sources to follow a time-varying spherical Gaussian distribution, and the non-negative power measurement space-time matrix to have a low-rank structure. We show that alternating updates similar to those of independent vector analysis and Itakura-Saito non-negative matrix factorization decrease the negative log-likelihood of the joint distribution. The proposed algorithm is validated via numerical experiments. Its median separation performance is found to be up to 8 dBでしべる more than that of independent vector analysis, with significantly reduced variability. △ Less

Submitted 3 April, 2019; originally announced April 2019.

Comments: Accepted at IEEE ICASSP 2019, Brighton, UK. 5 pages. 3 figures

arXiv:1808.08056 [pdf, other]

Independent Low-Rank Matrix Analysis Based on Time-Variant Sub-Gaussian Source Model

Authors: Shinichi Mogami, Norihiro Takamune, Daichi Kitamura, Hiroshi Saruwatari, Yu Takahashi, Kazunobu Kondo, Hiroaki Nakajima, Nobutaka Ono

Abstract: Independent low-rank matrix analysis (ILRMA) is a fast and stable method for blind audio source separation. Conventional ILRMAs assume time-variant (super-)Gaussian source models, which can only represent signals that follow a super-Gaussian distribution. In this paper, we focus on ILRMA based on a generalized Gaussian distribution (GGD-ILRMA) and propose a new type of GGD-ILRMA that adopts a time… ▽ More Independent low-rank matrix analysis (ILRMA) is a fast and stable method for blind audio source separation. Conventional ILRMAs assume time-variant (super-)Gaussian source models, which can only represent signals that follow a super-Gaussian distribution. In this paper, we focus on ILRMA based on a generalized Gaussian distribution (GGD-ILRMA) and propose a new type of GGD-ILRMA that adopts a time-variant sub-Gaussian distribution for the source model. By using a new update scheme called generalized iterative projection for homogeneous source models, we obtain a convergence-guaranteed update rule for demixing spatial parameters. In the experimental evaluation, we show the versatility of the proposed method, i.e., the proposed time-variant sub-Gaussian source model can be applied to various types of source signal. △ Less

Submitted 24 August, 2018; originally announced August 2018.

Comments: 8 pages, 5 figures, To appear in the Proceedings of APSIPA ASC 2018

arXiv:1806.10307 [pdf, other]

Independent Deeply Learned Matrix Analysis for Multichannel Audio Source Separation

Authors: Shinichi Mogami, Hayato Sumino, Daichi Kitamura, Norihiro Takamune, Shinnosuke Takamichi, Hiroshi Saruwatari, Nobutaka Ono

Abstract: In this paper, we address a multichannel audio source separation task and propose a new efficient method called independent deeply learned matrix analysis (IDLMA). IDLMA estimates the demixing matrix in a blind manner and updates the time-frequency structures of each source using a pretrained deep neural network (DNN). Also, we introduce a complex Student's t-distribution as a generalized source g… ▽ More In this paper, we address a multichannel audio source separation task and propose a new efficient method called independent deeply learned matrix analysis (IDLMA). IDLMA estimates the demixing matrix in a blind manner and updates the time-frequency structures of each source using a pretrained deep neural network (DNN). Also, we introduce a complex Student's t-distribution as a generalized source generative model including both complex Gaussian and Cauchy distributions. Experiments are conducted using music signals with a training dataset, and the results show the validity of the proposed method in terms of separation accuracy and computational cost. △ Less

Submitted 27 June, 2018; originally announced June 2018.

Comments: 5 pages, 4 figures, To appear in the Proceedings of the 26th European Signal Processing Conference (EUSIPCO 2018)

Showing 1–19 of 19 results for author: Ono, N