Search | arXiv e-print repository

FGA: Fourier-Guided Attention Network for Crowd Count Estimation

Authors: Yashwardhan Chaudhuri, Ankit Kumar, Arun Balaji Buduru, Adel Alshamrani

Abstract: Crowd counting is gaining societal relevance, particularly in domains of Urban Planning, Crowd Management, and Public Safety. This paper introduces Fourier-guided attention (FGA), a novel attention mechanism for crowd count estimation designed to address the inefficient full-scale global pattern capture in existing works on convolution-based attention networks. FGA efficiently captures multi-scale… ▽ More Crowd counting is gaining societal relevance, particularly in domains of Urban Planning, Crowd Management, and Public Safety. This paper introduces Fourier-guided attention (FGA), a novel attention mechanism for crowd count estimation designed to address the inefficient full-scale global pattern capture in existing works on convolution-based attention networks. FGA efficiently captures multi-scale information, including full-scale global patterns, by utilizing Fast-Fourier Transformations (FFT) along with spatial attention for global features and convolutions with channel-wise attention for semi-global and local features. The architecture of FGA involves a dual-path approach: (1) a path for processing full-scale global features through FFT, allowing for efficient extraction of information in the frequency domain, and (2) a path for processing remaining feature maps for semi-global and local features using traditional convolutions and channel-wise attention. This dual-path architecture enables FGA to seamlessly integrate frequency and spatial information, enhancing its ability to capture diverse crowd patterns. We apply FGA in the last layers of two popular crowd-counting works, CSRNet and CANNet, to evaluate the module's performance on benchmark datasets such as ShanghaiTech-A, ShanghaiTech-B, UCF-CC-50, and JHU++ crowd. The experiments demonstrate a notable improvement across all datasets based on Mean-Squared-Error (MSE) and Mean-Absolute-Error (MAE) metrics, showing comparable performance to recent state-of-the-art methods. Additionally, we illustrate the interpretability using qualitative analysis, leveraging Grad-CAM heatmaps, to show the effectiveness of FGA in capturing crowd patterns. △ Less

Submitted 8 July, 2024; originally announced July 2024.

Comments: Accepted to IJCNN'24

arXiv:2406.10448 [pdf, other]

AVR: Synergizing Foundation Models for Audio-Visual Humor Detection

Authors: Sarthak Sharma, Orchid Chetia Phukan, Drishti Singh, Arun Balaji Buduru, Rajesh Sharma

Abstract: In this work, we present, AVR application for audio-visual humor detection. While humor detection has traditionally centered around textual analysis, recent advancements have spotlighted multimodal approaches. However, these methods lean on textual cues as a modality, necessitating the use of ASR systems for transcribing the audio-data. This heavy reliance on ASR accuracy can pose challenges in re… ▽ More In this work, we present, AVR application for audio-visual humor detection. While humor detection has traditionally centered around textual analysis, recent advancements have spotlighted multimodal approaches. However, these methods lean on textual cues as a modality, necessitating the use of ASR systems for transcribing the audio-data. This heavy reliance on ASR accuracy can pose challenges in real-world applications. To address this bottleneck, we propose an innovative audio-visual humor detection system that circumvents textual reliance, eliminating the need for ASR models. Instead, the proposed approach hinges on the intricate interplay between audio and visual content for effective humor detection. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: Accepted to INTERSPEECH 2024 Show & Tell Demonstrations

arXiv:2406.09156 [pdf, other]

Towards Multilingual Audio-Visual Question Answering

Authors: Orchid Chetia Phukan, Priyabrata Mallick, Swarup Ranjan Behera, Aalekhya Satya Narayani, Arun Balaji Buduru, Rajesh Sharma

Abstract: In this paper, we work towards extending Audio-Visual Question Answering (AVQA) to multilingual settings. Existing AVQA research has predominantly revolved around English and replicating it for addressing AVQA in other languages requires a substantial allocation of resources. As a scalable solution, we leverage machine translation and present two multilingual AVQA datasets for eight languages crea… ▽ More In this paper, we work towards extending Audio-Visual Question Answering (AVQA) to multilingual settings. Existing AVQA research has predominantly revolved around English and replicating it for addressing AVQA in other languages requires a substantial allocation of resources. As a scalable solution, we leverage machine translation and present two multilingual AVQA datasets for eight languages created from existing benchmark AVQA datasets. This prevents extra human annotation efforts of collecting questions and answers manually. To this end, we propose, MERA framework, by leveraging state-of-the-art (SOTA) video, audio, and textual foundation models for AVQA in multiple languages. We introduce a suite of models namely MERA-L, MERA-C, MERA-T with varied model architectures to benchmark the proposed datasets. We believe our work will open new research directions and act as a reference benchmark for future works in multilingual AVQA. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: Accepted to Interspeech 2024

MSC Class: 68T45

arXiv:2406.06798 [pdf, other]

The Reasonable Effectiveness of Speaker Embeddings for Violence Detection

Authors: Sarthak Jain, Orchid Chetia Phukan, Arun Balaji Buduru, Rajesh Sharma

Abstract: In this paper, we focus on audio violence detection (AVD). AVD is necessary for several reasons, especially in the context of maintaining safety, preventing harm, and ensuring security in various environments. This calls for accurate AVD systems. Like many related applications in audio processing, the most common approach for improving the performance, would be by leveraging self-supervised (SSL)… ▽ More In this paper, we focus on audio violence detection (AVD). AVD is necessary for several reasons, especially in the context of maintaining safety, preventing harm, and ensuring security in various environments. This calls for accurate AVD systems. Like many related applications in audio processing, the most common approach for improving the performance, would be by leveraging self-supervised (SSL) pre-trained models (PTMs). However, as these SSL models are very large models with million of parameters and this can hinder real-world deployment especially in compute-constraint environment. To resolve this, we propose the usage of speaker recognition models which are much smaller compared to the SSL models. Experimentation with speaker recognition model embeddings with SVM & Random Forest as classifiers, we show that speaker recognition model embeddings perform the best in comparison to state-of-the-art (SOTA) SSL models and achieve SOTA results. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: Accepted to INTERSPEECH 24 Show & Tell Demonstrations

arXiv:2406.06781 [pdf, other]

PERSONA: An Application for Emotion Recognition, Gender Recognition and Age Estimation

Authors: Devyani Koshal, Orchid Chetia Phukan, Sarthak Jain, Arun Balaji Buduru, Rajesh Sharma

Abstract: Emotion Recognition (ER), Gender Recognition (GR), and Age Estimation (AE) constitute paralinguistic tasks that rely not on the spoken content but primarily on speech characteristics such as pitch and tone. While previous research has made significant strides in developing models for each task individually, there has been comparatively less emphasis on concurrently learning these tasks, despite th… ▽ More Emotion Recognition (ER), Gender Recognition (GR), and Age Estimation (AE) constitute paralinguistic tasks that rely not on the spoken content but primarily on speech characteristics such as pitch and tone. While previous research has made significant strides in developing models for each task individually, there has been comparatively less emphasis on concurrently learning these tasks, despite their inherent interconnectedness. As such in this demonstration, we present PERSONA, an application for predicting ER, GR, and AE with a single model in the backend. One notable point is we show that representations from speaker recognition pre-trained model (PTM) is better suited for such a multi-task learning format than the state-of-the-art (SOTA) self-supervised (SSL) PTM by carrying out a comparative study. Our methodology obviates the need for deploying separate models for each task and can potentially conserve resources and time during the training and deployment phases. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: Accepted to INTERSPEECH 2024 Show & Tell Demonstrations

arXiv:2406.06774 [pdf, other]

ComFeAT: Combination of Neural and Spectral Features for Improved Depression Detection

Authors: Orchid Chetia Phukan, Sarthak Jain, Shubham Singh, Muskaan Singh, Arun Balaji Buduru, Rajesh Sharma

Abstract: In this work, we focus on the detection of depression through speech analysis. Previous research has widely explored features extracted from pre-trained models (PTMs) primarily trained for paralinguistic tasks. Although these features have led to sufficient advances in speech-based depression detection, their performance declines in real-world settings. To address this, in this paper, we introduce… ▽ More In this work, we focus on the detection of depression through speech analysis. Previous research has widely explored features extracted from pre-trained models (PTMs) primarily trained for paralinguistic tasks. Although these features have led to sufficient advances in speech-based depression detection, their performance declines in real-world settings. To address this, in this paper, we introduce ComFeAT, an application that employs a CNN model trained on a combination of features extracted from PTMs, a.k.a. neural features and spectral features to enhance depression detection. Spectral features are robust to domain variations, but, they are not as good as neural features in performance, suprisingly, combining them shows complementary behavior and improves over both neural and spectral features individually. The proposed method also improves over previous state-of-the-art (SOTA) works on E-DAIC benchmark. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: Accepted to INTERSPEECH 2024 Show & Tell Demonstrations

arXiv:2406.03205 [pdf, other]

CoLLAB: A Collaborative Approach for Multilingual Abuse Detection

Authors: Orchid Chetia Phukan, Yashasvi Chaurasia, Arun Balaji Buduru, Rajesh Sharma

Abstract: In this study, we investigate representations from paralingual Pre-Trained model (PTM) for Audio Abuse Detection (AAD), which has not been explored for AAD. Our results demonstrate their superiority compared to other PTM representations on the ADIMA benchmark. Furthermore, combining PTM representations enhances AAD performance. Despite these improvements, challenges with cross-lingual generalizabi… ▽ More In this study, we investigate representations from paralingual Pre-Trained model (PTM) for Audio Abuse Detection (AAD), which has not been explored for AAD. Our results demonstrate their superiority compared to other PTM representations on the ADIMA benchmark. Furthermore, combining PTM representations enhances AAD performance. Despite these improvements, challenges with cross-lingual generalizability still remain, and certain languages require training in the same language. This demands individual models for different languages, leading to scalability, maintenance, and resource allocation issues and hindering the practical deployment of AAD systems in linguistically diverse real-world environments. To address this, we introduce CoLLAB, a novel framework that doesn't require training and allows seamless merging of models trained in different languages through weight-averaging. This results in a unified model with competitive AAD performance across multiple languages. △ Less

Submitted 5 June, 2024; originally announced June 2024.

arXiv:2405.06049 [pdf, other]

BB-Patch: BlackBox Adversarial Patch-Attack using Zeroth-Order Optimization

Authors: Satyadwyoom Kumar, Saurabh Gupta, Arun Balaji Buduru

Abstract: Deep Learning has become popular due to its vast applications in almost all domains. However, models trained using deep learning are prone to failure for adversarial samples and carry a considerable risk in sensitive applications. Most of these adversarial attack strategies assume that the adversary has access to the training data, the model parameters, and the input during deployment, hence, focu… ▽ More Deep Learning has become popular due to its vast applications in almost all domains. However, models trained using deep learning are prone to failure for adversarial samples and carry a considerable risk in sensitive applications. Most of these adversarial attack strategies assume that the adversary has access to the training data, the model parameters, and the input during deployment, hence, focus on perturbing the pixel level information present in the input image. Adversarial Patches were introduced to the community which helped in bringing out the vulnerability of deep learning models in a much more pragmatic manner but here the attacker has a white-box access to the model parameters. Recently, there has been an attempt to develop these adversarial attacks using black-box techniques. However, certain assumptions such as availability large training data is not valid for a real-life scenarios. In a real-life scenario, the attacker can only assume the type of model architecture used from a select list of state-of-the-art architectures while having access to only a subset of input dataset. Hence, we propose an black-box adversarial attack strategy that produces adversarial patches which can be applied anywhere in the input image to perform an adversarial attack. △ Less

Submitted 9 May, 2024; originally announced May 2024.

arXiv:2404.00827 [pdf, other]

SONIC: Synergizing VisiON Foundation Models for Stress RecogNItion from ECG signals

Authors: Orchid Chetia Phukan, Ankita Das, Arun Balaji Buduru, Rajesh Sharma

Abstract: Stress recognition through physiological signals such as Electrocardiogram (ECG) signals has garnered significant attention. Traditionally, research in this field predominantly focused on utilizing handcrafted features or raw signals as inputs for learning algorithms. However, there is now a burgeoning interest within the community in leveraging large-scale vision foundation models (VFMs) like Res… ▽ More Stress recognition through physiological signals such as Electrocardiogram (ECG) signals has garnered significant attention. Traditionally, research in this field predominantly focused on utilizing handcrafted features or raw signals as inputs for learning algorithms. However, there is now a burgeoning interest within the community in leveraging large-scale vision foundation models (VFMs) like ResNet50, VGG19, and others. These VFMs are increasingly preferred due to their ability to capture complex features, enhancing the accuracy and effectiveness of stress recognition systems. However, no particular focus has been given on combining these VFMs. The combination of VFMs offers promising benefits by harnessing their collective knowledge to extract richer representations for improved stress recognition. So, to mitigate this research gap, we focus on combining different VFMs for stress recognition from ECG and propose SONIC, a novel framework that combines VFMs through their logits and training a fully connected network on the combined logits. Through extensive experimentation, SONIC showed the top performance against individual VFMs performance on the WESAD benchmark. With SONIC, we report state-of-the-art (SOTA) performance in WESAD with 99.36% and 99.24% (stress vs non-stress) and 97.66% and 97.10% (amusement vs stress vs baseline) in accuracy and F1 respectively. △ Less

Submitted 31 March, 2024; originally announced April 2024.

arXiv:2404.00809 [pdf, other]

Heterogeneity over Homogeneity: Investigating Multilingual Speech Pre-Trained Models for Detecting Audio Deepfake

Authors: Orchid Chetia Phukan, Gautam Siddharth Kashyap, Arun Balaji Buduru, Rajesh Sharma

Abstract: In this work, we investigate multilingual speech Pre-Trained models (PTMs) for Audio deepfake detection (ADD). We hypothesize that multilingual PTMs trained on large-scale diverse multilingual data gain knowledge about diverse pitches, accents, and tones, during their pre-training phase and making them more robust to variations. As a result, they will be more effective for detecting audio deepfake… ▽ More In this work, we investigate multilingual speech Pre-Trained models (PTMs) for Audio deepfake detection (ADD). We hypothesize that multilingual PTMs trained on large-scale diverse multilingual data gain knowledge about diverse pitches, accents, and tones, during their pre-training phase and making them more robust to variations. As a result, they will be more effective for detecting audio deepfakes. To validate our hypothesis, we extract representations from state-of-the-art (SOTA) PTMs including monolingual, multilingual as well as PTMs trained for speaker and emotion recognition, and evaluated them on ASVSpoof 2019 (ASV), In-the-Wild (ITW), and DECRO benchmark databases. We show that representations from multilingual PTMs, with simple downstream networks, attain the best performance for ADD compared to other PTM representations, which validates our hypothesis. We also explore the possibility of fusion of selected PTM representations for further improvements in ADD, and we propose a framework, MiO (Merge into One) for this purpose. With MiO, we achieve SOTA performance on ASV and ITW and comparable performance on DECRO with current SOTA works. △ Less

Submitted 31 March, 2024; originally announced April 2024.

Comments: Accepted to NAACL (Findings) 2024

arXiv:2402.01579 [pdf, other]

Are Paralinguistic Representations all that is needed for Speech Emotion Recognition?

Authors: Orchid Chetia Phukan, Gautam Siddharth Kashyap, Arun Balaji Buduru, Rajesh Sharma

Abstract: Availability of representations from pre-trained models (PTMs) have facilitated substantial progress in speech emotion recognition (SER). Particularly, representations from PTM trained for paralinguistic speech processing have shown state-of-the-art (SOTA) performance for SER. However, such paralinguistic PTM representations haven't been evaluated for SER in linguistic environments other than Engl… ▽ More Availability of representations from pre-trained models (PTMs) have facilitated substantial progress in speech emotion recognition (SER). Particularly, representations from PTM trained for paralinguistic speech processing have shown state-of-the-art (SOTA) performance for SER. However, such paralinguistic PTM representations haven't been evaluated for SER in linguistic environments other than English. Also, paralinguistic PTM representations haven't been investigated in benchmarks such as SUPERB, EMO-SUPERB, ML-SUPERB for SER. This makes it difficult to access the efficacy of paralinguistic PTM representations for SER in multiple languages. To fill this gap, we perform a comprehensive comparative study of five SOTA PTM representations. Our results shows that paralinguistic PTM (TRILLsson) representations performs the best and this performance can be attributed to its effectiveness in capturing pitch, tone and other speech characteristics more effectively than other PTM representations. △ Less

Submitted 11 July, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

Comments: Accepted to INTERSPEECH 24

arXiv:2401.05968 [pdf, other]

A Lightweight Feature Fusion Architecture For Resource-Constrained Crowd Counting

Authors: Yashwardhan Chaudhuri, Ankit Kumar, Orchid Chetia Phukan, Arun Balaji Buduru

Abstract: Crowd counting finds direct applications in real-world situations, making computational efficiency and performance crucial. However, most of the previous methods rely on a heavy backbone and a complex downstream architecture that restricts the deployment. To address this challenge and enhance the versatility of crowd-counting models, we introduce two lightweight models. These models maintain the s… ▽ More Crowd counting finds direct applications in real-world situations, making computational efficiency and performance crucial. However, most of the previous methods rely on a heavy backbone and a complex downstream architecture that restricts the deployment. To address this challenge and enhance the versatility of crowd-counting models, we introduce two lightweight models. These models maintain the same downstream architecture while incorporating two distinct backbones: MobileNet and MobileViT. We leverage Adjacent Feature Fusion to extract diverse scale features from a Pre-Trained Model (PTM) and subsequently combine these features seamlessly. This approach empowers our models to achieve improved performance while maintaining a compact and efficient design. With the comparison of our proposed models with previously available state-of-the-art (SOTA) methods on ShanghaiTech-A ShanghaiTech-B and UCF-CC-50 dataset, it achieves comparable results while being the most computationally efficient model. Finally, we present a comparative study, an extensive ablation study, along with pruning to show the effectiveness of our models. △ Less

Submitted 11 January, 2024; originally announced January 2024.

arXiv:2307.09938 [pdf, other]

Tracking an Untracked Space Debris After an Inelastic Collision Using Physics Informed Neural Network

Authors: Harsha M., Gurpreet Singh, Vinod Kumar, Arun Balaji Buduru, Sanat K. Biswas

Abstract: With the sustained rise in satellite deployment in Low Earth Orbits, the collision risk from untracked space debris is also increasing. Often small-sized space debris (below 10 cm) are hard to track using the existing state-of-the-art methods. However, knowing such space debris' trajectory is crucial to avoid future collisions. We present a Physics Informed Neural Network (PINN) - based approach f… ▽ More With the sustained rise in satellite deployment in Low Earth Orbits, the collision risk from untracked space debris is also increasing. Often small-sized space debris (below 10 cm) are hard to track using the existing state-of-the-art methods. However, knowing such space debris' trajectory is crucial to avoid future collisions. We present a Physics Informed Neural Network (PINN) - based approach for estimation of the trajectory of space debris after a collision event between active satellite and space debris. In this work, we have simulated 8565 inelastic collision events between active satellites and space debris. Using the velocities of the colliding objects before the collision, we calculate the post-collision velocities and record the observations. The state (position and velocity), coefficient of restitution, and mass estimation of un-tracked space debris after an inelastic collision event along with the tracked active satellite can be posed as an optimization problem by observing the deviation of the active satellite from the trajectory. We have applied the classical optimization method, the Lagrange multiplier approach, for solving the above optimization problem and observed that its state estimation is not satisfactory as the system is under-determined. Subsequently, we have designed Deep Neural network-based methods and Physics Informed Neural Network (PINN )based methods for solving the above optimization problem. We have compared the performance of the models using root mean square error (RMSE) and interquartile range of the predictions. It has been observed that the PINN-based methods provide a better prediction for position, velocity, mass and coefficient of restitution of the space debris compared to other methods. △ Less

Submitted 25 January, 2024; v1 submitted 19 July, 2023; originally announced July 2023.

Comments: 23 pages, 18 figures (consolidated into 13 figures by using sub-figures), accepted as a journal paper by Nature Scientific Report

arXiv:2306.10338 [pdf, other]

Trauma lurking in the shadows: A Reddit case study of mental health issues in online posts about Childhood Sexual Abuse

Authors: Orchid Chetia Phukan, Rajesh Sharma, Arun Balaji Buduru

Abstract: Childhood Sexual Abuse (CSA) is a menace to society and has long-lasting effects on the mental health of the survivors. From time to time CSA survivors are haunted by various mental health issues in their lifetime. Proper care and attention towards CSA survivors facing mental health issues can drastically improve the mental health conditions of CSA survivors. Previous works leveraging online socia… ▽ More Childhood Sexual Abuse (CSA) is a menace to society and has long-lasting effects on the mental health of the survivors. From time to time CSA survivors are haunted by various mental health issues in their lifetime. Proper care and attention towards CSA survivors facing mental health issues can drastically improve the mental health conditions of CSA survivors. Previous works leveraging online social media (OSM) data for understanding mental health issues haven't focused on mental health issues in individuals with CSA background. Our work fills this gap by studying Reddit posts related to CSA to understand their mental health issues. Mental health issues such as depression, anxiety, and Post-Traumatic Stress Disorder (PTSD) are most commonly observed in posts with CSA background. Observable differences exist between posts related to mental health issues with and without CSA background. Keeping this difference in mind, for identifying mental health issues in posts with CSA exposure we develop a two-stage framework. The first stage involves classifying posts with and without CSA background and the second stage involves recognizing mental health issues in posts that are classified as belonging to CSA background. The top model in the first stage is able to achieve accuracy and f1-score (macro) of 96.26% and 96.24%. and in the second stage, the top model reports hamming score of 67.09%. Content Warning: Reader discretion is recommended as our study tackles topics such as child sexual abuse, molestation, etc. △ Less

Submitted 17 June, 2023; originally announced June 2023.

arXiv:2305.18640 [pdf, other]

Transforming the Embeddings: A Lightweight Technique for Speech Emotion Recognition Tasks

Authors: Orchid Chetia Phukan, Arun Balaji Buduru, Rajesh Sharma

Abstract: Speech emotion recognition (SER) is a field that has drawn a lot of attention due to its applications in diverse fields. A current trend in methods used for SER is to leverage embeddings from pre-trained models (PTMs) as input features to downstream models. However, the use of embeddings from speaker recognition PTMs hasn't garnered much focus in comparison to other PTM embeddings. To fill this ga… ▽ More Speech emotion recognition (SER) is a field that has drawn a lot of attention due to its applications in diverse fields. A current trend in methods used for SER is to leverage embeddings from pre-trained models (PTMs) as input features to downstream models. However, the use of embeddings from speaker recognition PTMs hasn't garnered much focus in comparison to other PTM embeddings. To fill this gap and in order to understand the efficacy of speaker recognition PTM embeddings, we perform a comparative analysis of five PTM embeddings. Among all, x-vector embeddings performed the best possibly due to its training for speaker recognition leading to capturing various components of speech such as tone, pitch, etc. Our modeling approach which utilizes x-vector embeddings and mel-frequency cepstral coefficients (MFCC) as input features is the most lightweight approach while achieving comparable accuracy to previous state-of-the-art (SOTA) methods in the CREMA-D benchmark. △ Less

Submitted 29 May, 2023; originally announced May 2023.

Comments: Accepted to Interspeech 2023

arXiv:2304.11472 [pdf, other]

A Comparative Study of Pre-trained Speech and Audio Embeddings for Speech Emotion Recognition

Authors: Orchid Chetia Phukan, Arun Balaji Buduru, Rajesh Sharma

Abstract: Pre-trained models (PTMs) have shown great promise in the speech and audio domain. Embeddings leveraged from these models serve as inputs for learning algorithms with applications in various downstream tasks. One such crucial task is Speech Emotion Recognition (SER) which has a wide range of applications, including dynamic analysis of customer calls, mental health assessment, and personalized lang… ▽ More Pre-trained models (PTMs) have shown great promise in the speech and audio domain. Embeddings leveraged from these models serve as inputs for learning algorithms with applications in various downstream tasks. One such crucial task is Speech Emotion Recognition (SER) which has a wide range of applications, including dynamic analysis of customer calls, mental health assessment, and personalized language learning. PTM embeddings have helped advance SER, however, a comprehensive comparison of these PTM embeddings that consider multiple facets such as embedding model architecture, data used for pre-training, and the pre-training procedure being followed is missing. A thorough comparison of PTM embeddings will aid in the faster and more efficient development of models and enable their deployment in real-world scenarios. In this work, we exploit this research gap and perform a comparative analysis of embeddings extracted from eight speech and audio PTMs (wav2vec 2.0, data2vec, wavLM, UniSpeech-SAT, wav2clip, YAMNet, x-vector, ECAPA). We perform an extensive empirical analysis with four speech emotion datasets (CREMA-D, TESS, SAVEE, Emo-DB) by training three algorithms (XGBoost, Random Forest, FCN) on the derived embeddings. The results of our study indicate that the best performance is achieved by algorithms trained on embeddings derived from PTMs trained for speaker recognition followed by wav2clip and UniSpeech-SAT. This can relay that the top performance by embeddings from speaker recognition PTMs is most likely due to the model taking up information about numerous speech features such as tone, accent, pitch, and so on during its speaker recognition training. Insights from this work will assist future studies in their selection of embeddings for applications related to SER. △ Less

Submitted 22 April, 2023; originally announced April 2023.

arXiv:2110.15923 [pdf, other]

Efficient Representation of Interaction Patterns with Hyperbolic Hierarchical Clustering for Classification of Users on Twitter

Authors: Tanvi Karandikar, Avinash Prabhu, Avinash Tulasi, Arun Balaji Buduru, Ponnurangam Kumaraguru

Abstract: Social media platforms play an important role in democratic processes. During the 2019 General Elections of India, political parties and politicians widely used Twitter to share their ideals, advocate their agenda and gain popularity. Twitter served as a ground for journalists, politicians and voters to interact. The organic nature of these interactions can be upended by malicious accounts on Twit… ▽ More Social media platforms play an important role in democratic processes. During the 2019 General Elections of India, political parties and politicians widely used Twitter to share their ideals, advocate their agenda and gain popularity. Twitter served as a ground for journalists, politicians and voters to interact. The organic nature of these interactions can be upended by malicious accounts on Twitter, which end up being suspended or deleted from the platform. Such accounts aim to modify the reach of content by inorganically interacting with particular handles. These interactions are a threat to the integrity of the platform, as such activity has the potential to affect entire results of democratic processes. In this work, we design a feature extraction framework which compactly captures potentially insidious interaction patterns. Our proposed features are designed to bring out communities amongst the users that work to boost the content of particular accounts. We use Hyperbolic Hierarchical Clustering (HypHC) which represents the features in the hyperbolic manifold to further separate such communities. HypHC gives the added benefit of representing these features in a lower dimensional space -- thus serving as a dimensionality reduction technique. We use these features to distinguish between different classes of users that emerged in the aftermath of the 2019 General Elections of India. Amongst the users active on Twitter during the elections, 2.8% of the users participating were suspended and 1% of the users were deleted from the platform. We demonstrate the effectiveness of our proposed features in differentiating between regular users (users who were neither suspended nor deleted), suspended users and deleted users. By leveraging HypHC in our pipeline, we obtain F1 scores of upto 93%. △ Less

Submitted 1 November, 2021; v1 submitted 29 October, 2021; originally announced October 2021.

arXiv:2107.05104 [pdf, other]

"A Virus Has No Religion": Analyzing Islamophobia on Twitter During the COVID-19 Outbreak

Authors: Mohit Chandra, Manvith Reddy, Shradha Sehgal, Saurabh Gupta, Arun Balaji Buduru, Ponnurangam Kumaraguru

Abstract: The COVID-19 pandemic has disrupted people's lives driving them to act in fear, anxiety, and anger, leading to worldwide racist events in the physical world and online social networks. Though there are works focusing on Sinophobia during the COVID-19 pandemic, less attention has been given to the recent surge in Islamophobia. A large number of positive cases arising out of the religious Tablighi J… ▽ More The COVID-19 pandemic has disrupted people's lives driving them to act in fear, anxiety, and anger, leading to worldwide racist events in the physical world and online social networks. Though there are works focusing on Sinophobia during the COVID-19 pandemic, less attention has been given to the recent surge in Islamophobia. A large number of positive cases arising out of the religious Tablighi Jamaat gathering has driven people towards forming anti-Muslim communities around hashtags like #coronajihad, #tablighijamaatvirus on Twitter. In addition to the online spaces, the rise in Islamophobia has also resulted in increased hate crimes in the real world. Hence, an investigation is required to create interventions. To the best of our knowledge, we present the first large-scale quantitative study linking Islamophobia with COVID-19. In this paper, we present CoronaBias dataset which focuses on anti-Muslim hate spanning four months, with over 410,990 tweets from 244,229 unique users. We use this dataset to perform longitudinal analysis. We find the relation between the trend on Twitter with the offline events that happened over time, measure the qualitative changes in the context associated with the Muslim community, and perform macro and micro topic analysis to find prevalent topics. We also explore the nature of the content, focusing on the toxicity of the URLs shared within the tweets present in the CoronaBias dataset. Apart from the content-based analysis, we focus on user analysis, revealing that the portrayal of religion as a symbol of patriotism played a crucial role in deciding how the Muslim community was perceived during the pandemic. Through these experiments, we reveal the existence of anti-Muslim rhetoric around COVID-19 in the Indian sub-continent. △ Less

Submitted 25 July, 2021; v1 submitted 11 July, 2021; originally announced July 2021.

arXiv:2009.13854 [pdf, other]

Multi-objective Reinforcement Learning based approach for User-Centric Power Optimization in Smart Home Environments

Authors: Saurabh Gupta, Siddhant Bhambri, Karan Dhingra, Arun Balaji Buduru, Ponnurangam Kumaraguru

Abstract: Smart homes require every device inside them to be connected with each other at all times, which leads to a lot of power wastage on a daily basis. As the devices inside a smart home increase, it becomes difficult for the user to control or operate every individual device optimally. Therefore, users generally rely on power management systems for such optimization but often are not satisfied with th… ▽ More Smart homes require every device inside them to be connected with each other at all times, which leads to a lot of power wastage on a daily basis. As the devices inside a smart home increase, it becomes difficult for the user to control or operate every individual device optimally. Therefore, users generally rely on power management systems for such optimization but often are not satisfied with the results. In this paper, we present a novel multi-objective reinforcement learning framework with two-fold objectives of minimizing power consumption and maximizing user satisfaction. The framework explores the trade-off between the two objectives and converges to a better power management policy when both objectives are considered while finding an optimal policy. We experiment on real-world smart home data, and show that the multi-objective approaches: i) establish trade-off between the two objectives, ii) achieve better combined user satisfaction and power consumption than single-objective approaches. We also show that the devices that are used regularly and have several fluctuations in device modes at regular intervals should be targeted for optimization, and the experiments on data from other smart homes fetch similar results, hence ensuring transfer-ability of the proposed framework. △ Less

Submitted 29 September, 2020; originally announced September 2020.

Comments: 8 pages, 7 figures, Accepted at IEEE SMDS'2020

arXiv:2009.13839 [pdf, other]

imdpGAN: Generating Private and Specific Data with Generative Adversarial Networks

Authors: Saurabh Gupta, Arun Balaji Buduru, Ponnurangam Kumaraguru

Abstract: Generative Adversarial Network (GAN) and its variants have shown promising results in generating synthetic data. However, the issues with GANs are: (i) the learning happens around the training samples and the model often ends up remembering them, consequently, compromising the privacy of individual samples - this becomes a major concern when GANs are applied to training data including personally i… ▽ More Generative Adversarial Network (GAN) and its variants have shown promising results in generating synthetic data. However, the issues with GANs are: (i) the learning happens around the training samples and the model often ends up remembering them, consequently, compromising the privacy of individual samples - this becomes a major concern when GANs are applied to training data including personally identifiable information, (ii) the randomness in generated data - there is no control over the specificity of generated samples. To address these issues, we propose imdpGAN - an information maximizing differentially private Generative Adversarial Network. It is an end-to-end framework that simultaneously achieves privacy protection and learns latent representations. With experiments on MNIST dataset, we show that imdpGAN preserves the privacy of the individual data point, and learns latent codes to control the specificity of the generated samples. We perform binary classification on digit pairs to show the utility versus privacy trade-off. The classification accuracy decreases as we increase privacy levels in the framework. We also experimentally show that the training process of imdpGAN is stable but experience a 10-fold time increase as compared with other GAN frameworks. Finally, we extend imdpGAN framework to CelebA dataset to show how the privacy and learned representations can be used to control the specificity of the output. △ Less

Submitted 29 September, 2020; originally announced September 2020.

Comments: 9 pages, 7 figures, Accepted at IEEE TPS'2020

arXiv:2007.06078 [pdf, other]

Fine-grained Language Identification with Multilingual CapsNet Model

Authors: Mudit Verma, Arun Balaji Buduru

Abstract: Due to a drastic improvement in the quality of internet services worldwide, there is an explosion of multilingual content generation and consumption. This is especially prevalent in countries with large multilingual audience, who are increasingly consuming media outside their linguistic familiarity/preference. Hence, there is an increasing need for real-time and fine-grained content analysis servi… ▽ More Due to a drastic improvement in the quality of internet services worldwide, there is an explosion of multilingual content generation and consumption. This is especially prevalent in countries with large multilingual audience, who are increasingly consuming media outside their linguistic familiarity/preference. Hence, there is an increasing need for real-time and fine-grained content analysis services, including language identification, content transcription, and analysis. Accurate and fine-grained spoken language detection is an essential first step for all the subsequent content analysis algorithms. Current techniques in spoken language detection may lack on one of these fronts: accuracy, fine-grained detection, data requirements, manual effort in data collection \& pre-processing. Hence in this work, a real-time language detection approach to detect spoken language from 5 seconds' audio clips with an accuracy of 91.8\% is presented with exiguous data requirements and minimal pre-processing. Novel architectures for Capsule Networks is proposed which operates on spectrogram images of the provided audio snippets. We use previous approaches based on Recurrent Neural Networks and iVectors to present the results. Finally we show a ``Non-Class'' analysis to further stress on why CapsNet architecture works for LID task. △ Less

Submitted 12 July, 2020; originally announced July 2020.

Comments: 5 pages, 6 figures

arXiv:1912.03298 [pdf, other]

Making Smart Homes Smarter: Optimizing Energy Consumption with Human in the Loop

Authors: Mudit Verma, Siddhant Bhambri, Saurabh Gupta, Arun Balaji Buduru

Abstract: Rapid advancements in the Internet of Things (IoT) have facilitated more efficient deployment of smart environment solutions for specific user requirement. With the increase in the number of IoT devices, it has become difficult for the user to control or operate every individual smart device into achieving some desired goal like optimized power consumption, scheduled appliance running time, etc. F… ▽ More Rapid advancements in the Internet of Things (IoT) have facilitated more efficient deployment of smart environment solutions for specific user requirement. With the increase in the number of IoT devices, it has become difficult for the user to control or operate every individual smart device into achieving some desired goal like optimized power consumption, scheduled appliance running time, etc. Furthermore, existing solutions to automatically adapt the IoT devices are not capable enough to incorporate the user behavior. This paper presents a novel approach to accurately configure IoT devices while achieving the twin objectives of energy optimization along with conforming to user preferences. Our work comprises of unsupervised clustering of devices' data to find the states of operation for each device, followed by probabilistically analyzing user behavior to determine their preferred states. Eventually, we deploy an online reinforcement learning (RL) agent to find the best device settings automatically. Results for three different smart homes' data-sets show the effectiveness of our methodology. To the best of our knowledge, this is the first time that a practical approach has been adopted to achieve the above mentioned objectives without any human interaction within the system. △ Less

Submitted 4 May, 2020; v1 submitted 6 December, 2019; originally announced December 2019.

arXiv:1912.01667 [pdf, other]

A Survey of Black-Box Adversarial Attacks on Computer Vision Models

Authors: Siddhant Bhambri, Sumanyu Muku, Avinash Tulasi, Arun Balaji Buduru

Abstract: Machine learning has seen tremendous advances in the past few years, which has lead to deep learning models being deployed in varied applications of day-to-day life. Attacks on such models using perturbations, particularly in real-life scenarios, pose a severe challenge to their applicability, pushing research into the direction which aims to enhance the robustness of these models. After the intro… ▽ More Machine learning has seen tremendous advances in the past few years, which has lead to deep learning models being deployed in varied applications of day-to-day life. Attacks on such models using perturbations, particularly in real-life scenarios, pose a severe challenge to their applicability, pushing research into the direction which aims to enhance the robustness of these models. After the introduction of these perturbations by Szegedy et al. [1], significant amount of research has focused on the reliability of such models, primarily in two aspects - white-box, where the adversary has access to the targeted model and related parameters; and the black-box, which resembles a real-life scenario with the adversary having almost no knowledge of the model to be attacked. To provide a comprehensive security cover, it is essential to identify, study, and build defenses against such attacks. Hence, in this paper, we propose to present a comprehensive comparative study of various black-box adversarial attacks and defense techniques. △ Less

Submitted 7 February, 2020; v1 submitted 3 December, 2019; originally announced December 2019.

Comments: 33 pages

arXiv:1909.10012 [pdf, other]

Is change the only constant? Profile change perspective on #LokSabhaElections2019

Authors: Kumari Neha, Shashank Srikanth, Sonali Singhal, Shwetanshu Singh, Arun Balaji Buduru, Ponnurangam Kumaraguru

Abstract: Users on Twitter are identified with the help of their profile attributes that consists of username, display name, profile image, to name a few. The profile attributes that users adopt can reflect their interests, belief, or thematic inclinations. Literature has proposed the implications and significance of profile attribute change for a random population of users. However, the use of profile attr… ▽ More Users on Twitter are identified with the help of their profile attributes that consists of username, display name, profile image, to name a few. The profile attributes that users adopt can reflect their interests, belief, or thematic inclinations. Literature has proposed the implications and significance of profile attribute change for a random population of users. However, the use of profile attribute for endorsements and to start a movement have been under-explored. In this work, we consider #LokSabhaElections2019 as a movement and perform a large-scale study of the profile of users who actively made changes to profile attributes centered around #LokSabhaElections2019. We collect the profile metadata for 49.4M users for a period of 2 months from April 5, 2019 to June 5, 2019 amid #LokSabhaElections2019. We investigate how the profile changes vary for the influential leaders and their followers over the social movement. We further differentiate the organic and inorganic ways to show the political inclination from the prism of profile changes. We report how the addition of election campaign related keywords lead to spread of behavior contagion and further investigate it with respect to "Chowkidar Movement" in detail. △ Less

Submitted 22 September, 2019; originally announced September 2019.

Comments: 8 pages, 11 figures, 4 tables

arXiv:1909.07151 [pdf, other]

Hashtags are (not) judgemental: The untold story of Lok Sabha elections 2019

Authors: Saurabh Gupta, Asmit Kumar Singh, Arun Balaji Buduru, Ponnurangam Kumaraguru

Abstract: Hashtags in online social media have become a way for users to build communities around topics, promote opinions, and categorize messages. In the political context, hashtags on Twitter are used by users to campaign for their parties, spread news, or to get followers and get a general idea by following a discussion built around a hashtag. In the past, researchers have studied certain types and spec… ▽ More Hashtags in online social media have become a way for users to build communities around topics, promote opinions, and categorize messages. In the political context, hashtags on Twitter are used by users to campaign for their parties, spread news, or to get followers and get a general idea by following a discussion built around a hashtag. In the past, researchers have studied certain types and specific properties of hashtags by utilizing a lot of data collected around hashtags. In this paper, we perform a large-scale empirical analysis of elections using only the hashtags shared on Twitter during the 2019 Lok Sabha elections in India. We study the trends and events unfolded on the ground, the latent topics to uncover representative hashtags and semantic similarity to relate hashtags with the election outcomes. We collect over 24 million hashtags to perform extensive experiments. First, we find the trending hashtags to cross-reference them with the tweets in our dataset to list down notable events. Second, we use Latent Dirichlet Allocation to find topic patterns in the dataset. In the end, we use skip-gram word embedding model to find semantically similar hashtags. We propose popularity and an influence metric to predict election outcomes using just the hashtags. Empirical results show that influence is a good measure to predict the election outcome. △ Less

Submitted 28 April, 2020; v1 submitted 16 September, 2019; originally announced September 2019.

arXiv:1909.07144 [pdf, other]

Catching up with trends: The changing landscape of political discussions on twitter in 2014 and 2019

Authors: Avinash Tulasi, Kanay Gupta, Omkar Gurjar, Sathvik Sanjeev Buggana, Paras Mehan, Arun Balaji Buduru, Ponnurangam Kumaraguru

Abstract: The advent of 4G increased the usage of internet in India, which took a huge number of discussions online. Online Social Networks (OSNs) are the center of these discussions. During elections, political discussions constitute a significant portion of the trending topics on these networks. Politicians and political parties catch up with these trends, and social media then becomes a part of their pub… ▽ More The advent of 4G increased the usage of internet in India, which took a huge number of discussions online. Online Social Networks (OSNs) are the center of these discussions. During elections, political discussions constitute a significant portion of the trending topics on these networks. Politicians and political parties catch up with these trends, and social media then becomes a part of their publicity agenda. We cannot ignore this trend in any election, be it the U.S, Germany, France, or India. Twitter is a major platform where we observe these trends. In this work, we examine the magnitude of political discussions on twitter by contrasting the platform usage on levels like gender, political party, and geography, in 2014 and 2019 Indian General Elections. In a further attempt to understand the strategies followed by political parties, we compare twitter usage by Bharatiya Janata Party (BJP) and Indian National Congress (INC) in 2019 General Elections in terms of how efficiently they make use of the platform. We specifically analyze the handles of politicians who emerged victorious. We then proceed to compare political handles held by frontmen of BJP and INC: Narendra Modi (@narendramodi) and Rahul Gandhi (@RahulGandhi) using parameters like "following", "tweeting habits", "sources used to tweet", along with text analysis of tweets. With this work, we also introduce a rich dataset covering a majority of tweets made during the election period in 2014 and 2019. △ Less

Submitted 18 September, 2019; v1 submitted 16 September, 2019; originally announced September 2019.

arXiv:1810.11937 [pdf, other]

An approach to predictively securing critical cloud infrastructures through probabilistic modeling

Authors: Satvik Jain, Arun Balaji Buduru, Anshuman Chhabra

Abstract: Cloud infrastructures are being increasingly utilized in critical infrastructures such as banking/finance, transportation and utility management. Sophistication and resources used in recent security breaches including those on critical infrastructures show that attackers are no longer limited by monetary/computational constraints. In fact, they may be aided by entities with large financial and hum… ▽ More Cloud infrastructures are being increasingly utilized in critical infrastructures such as banking/finance, transportation and utility management. Sophistication and resources used in recent security breaches including those on critical infrastructures show that attackers are no longer limited by monetary/computational constraints. In fact, they may be aided by entities with large financial and human resources. Hence there is urgent need to develop predictive approaches for cyber defense to strengthen cloud infrastructures specifically utilized by critical infrastructures. Extensive research has been done in the past on applying techniques such as Game Theory, Machine Learning and Bayesian Networks among others for the predictive defense of critical infrastructures. However a major drawback of these approaches is that they do not incorporate probabilistic human behavior which limits their predictive ability. In this paper, a stochastic approach is proposed to predict less secure states in critical cloud systems which might lead to potential security breaches. These less-secure states are deemed as `risky' states in our approach. Markov Decision Process (MDP) is used to accurately incorporate user behavior(s) as well as operational behavior of the cloud infrastructure through a set of features. The developed reward/cost mechanism is then used to select appropriate `actions' to identify risky states at future time steps by learning an optimal policy. Experimental results show that the proposed framework performs well in identifying future `risky' states. Through this work we demonstrate the effectiveness of using probabilistic modeling (MDP) to predictively secure critical cloud infrastructures. △ Less

Submitted 28 October, 2018; originally announced October 2018.

Showing 1–27 of 27 results for author: Buduru, A B