Search | arXiv e-print repository

FairDomain: Achieving Fairness in Cross-Domain Medical Image Segmentation and Classification

Authors: Yu Tian, Congcong Wen, Min Shi, Muhammad Muneeb Afzal, Hao Huang, Muhammad Osama Khan, Yan Luo, Yi Fang, Mengyu Wang

Abstract: Addressing fairness in artificial intelligence (AI), particularly in medical AI, is crucial for ensuring equitable healthcare outcomes. Recent efforts to enhance fairness have introduced new methodologies and datasets in medical AI. However, the fairness issue under the setting of domain transfer is almost unexplored, while it is common that clinics rely on different imaging technologies (e.g., di… ▽ More Addressing fairness in artificial intelligence (AI), particularly in medical AI, is crucial for ensuring equitable healthcare outcomes. Recent efforts to enhance fairness have introduced new methodologies and datasets in medical AI. However, the fairness issue under the setting of domain transfer is almost unexplored, while it is common that clinics rely on different imaging technologies (e.g., different retinal imaging modalities) for patient diagnosis. This paper presents FairDomain, a pioneering systemic study into algorithmic fairness under domain shifts, employing state-of-the-art domain adaptation (DA) and generalization (DG) algorithms for both medical segmentation and classification tasks to understand how biases are transferred between different domains. We also introduce a novel plug-and-play fair identity attention (FIA) module that adapts to various DA and DG algorithms to improve fairness by using self-attention to adjust feature importance based on demographic attributes. Additionally, we curate the first fairness-focused dataset with two paired imaging modalities for the same patient cohort on medical segmentation and classification tasks, to rigorously assess fairness in domain-shift scenarios. Excluding the confounding impact of demographic distribution variation between source and target domains will allow clearer quantification of the performance of domain transfer models. Our extensive evaluations reveal that the proposed FIA significantly enhances both model performance accounted for fairness across all domain shift settings (i.e., DA and DG) with respect to different demographics, which outperforms existing methods on both segmentation and classification. The code and data can be accessed at https://ophai.hms.harvard.edu/datasets/harvard-fairdomain20k. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: ECCV 2024; Codes available at https://github.com/Harvard-Ophthalmology-AI-Lab/FairDomain

arXiv:2407.07054 [pdf, other]

A Differentially Private Blockchain-Based Approach for Vertical Federated Learning

Authors: Linh Tran, Sanjay Chari, Md. Saikat Islam Khan, Aaron Zachariah, Stacy Patterson, Oshani Seneviratne

Abstract: We present the Differentially Private Blockchain-Based Vertical Federal Learning (DP-BBVFL) algorithm that provides verifiability and privacy guarantees for decentralized applications. DP-BBVFL uses a smart contract to aggregate the feature representations, i.e., the embeddings, from clients transparently. We apply local differential privacy to provide privacy for embeddings stored on a blockchain… ▽ More We present the Differentially Private Blockchain-Based Vertical Federal Learning (DP-BBVFL) algorithm that provides verifiability and privacy guarantees for decentralized applications. DP-BBVFL uses a smart contract to aggregate the feature representations, i.e., the embeddings, from clients transparently. We apply local differential privacy to provide privacy for embeddings stored on a blockchain, hence protecting the original data. We provide the first prototype application of differential privacy with blockchain for vertical federated learning. Our experiments with medical data show that DP-BBVFL achieves high accuracy with a tradeoff in training time due to on-chain aggregation. This innovative fusion of differential privacy and blockchain technology in DP-BBVFL could herald a new era of collaborative and trustworthy machine learning applications across several decentralized application domains. △ Less

Submitted 9 July, 2024; originally announced July 2024.

arXiv:2407.06889 [pdf, other]

A Neurosymbolic Approach to Adaptive Feature Extraction in SLAM

Authors: Yasra Chandio, Momin A. Khan, Khotso Selialia, Luis Garcia, Joseph DeGol, Fatima M. Anwar

Abstract: Autonomous robots, autonomous vehicles, and humans wearing mixed-reality headsets require accurate and reliable tracking services for safety-critical applications in dynamically changing real-world environments. However, the existing tracking approaches, such as Simultaneous Localization and Mapping (SLAM), do not adapt well to environmental changes and boundary conditions despite extensive manual… ▽ More Autonomous robots, autonomous vehicles, and humans wearing mixed-reality headsets require accurate and reliable tracking services for safety-critical applications in dynamically changing real-world environments. However, the existing tracking approaches, such as Simultaneous Localization and Mapping (SLAM), do not adapt well to environmental changes and boundary conditions despite extensive manual tuning. On the other hand, while deep learning-based approaches can better adapt to environmental changes, they typically demand substantial data for training and often lack flexibility in adapting to new domains. To solve this problem, we propose leveraging the neurosymbolic program synthesis approach to construct adaptable SLAM pipelines that integrate the domain knowledge from traditional SLAM approaches while leveraging data to learn complex relationships. While the approach can synthesize end-to-end SLAM pipelines, we focus on synthesizing the feature extraction module. We first devise a domain-specific language (DSL) that can encapsulate domain knowledge on the important attributes for feature extraction and the real-world performance of various feature extractors. Our neurosymbolic architecture then undertakes adaptive feature extraction, optimizing parameters via learning while employing symbolic reasoning to select the most suitable feature extractor. Our evaluations demonstrate that our approach, neurosymbolic Feature EXtraction (nFEX), yields higher-quality features. It also reduces the pose error observed for the state-of-the-art baseline feature extractors ORB and SIFT by up to 90% and up to 66%, respectively, thereby enhancing the system's efficiency and adaptability to novel environments. △ Less

Submitted 9 July, 2024; originally announced July 2024.

Comments: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

arXiv:2407.06345 [pdf, other]

Multi-person eye tracking for real-world scene perception in social settings

Authors: Shreshth Saxena, Areez Visram, Neil Lobo, Zahid Mirza, Mehak Rafi Khan, Biranugan Pirabaharan, Alexander Nguyen, Lauren K. Fink

Abstract: Eye movements provide a window into human behaviour, attention, and interaction dynamics. Previous research suggests that eye movements are highly influenced by task, setting, and social others; however, most eye tracking research is conducted in single-person, in-lab settings and is yet to be validated in multi-person, naturalistic contexts. One such prevalent real-world context is the collective… ▽ More Eye movements provide a window into human behaviour, attention, and interaction dynamics. Previous research suggests that eye movements are highly influenced by task, setting, and social others; however, most eye tracking research is conducted in single-person, in-lab settings and is yet to be validated in multi-person, naturalistic contexts. One such prevalent real-world context is the collective viewing of a shared scene in social settings, for example, viewing a concert, film, lecture, sports, etc. Here, we apply mobile eye tracking in a real-world multi-person setup and develop a system to stream, record, and analyse synchronised data. We tested our proposed, open-source system while participants (N=60) watched a live concert and a documentary film screening during a public event. We tackled challenges related to networking bandwidth requirements, real-time monitoring, and gaze projection from individual egocentric perspectives to a common coordinate space for shared gaze analysis. Our system achieves precise time synchronisation and accurate gaze projection in challenging dynamic scenes. Further, to illustrate the potential of collective eye-tracking data, we introduce and evaluate novel analysis metrics and visualisations. Overall, our approach contributes to the development and application of versatile multi-person eye tracking systems in real-world social settings. This advancement enables insight into collaborative behaviour, group dynamics, and social interaction, with high ecological validity. Moreover, it paves the path for innovative, interactive tools that promote collaboration and coordination in social contexts. △ Less

Submitted 8 July, 2024; originally announced July 2024.

Comments: Please refer to the supplementary video illustrating the proposed approach in this paper here: https://tinyurl.com/multipersonET

ACM Class: I.4.8; J.4; J.5; C.4; D.2.10

arXiv:2407.04519 [pdf, other]

Success or Failure? Analyzing Segmentation Refinement with Few-Shot Segmentation

Authors: Seonghyeon Moon, Haein Kong, Muhammad Haris Khan

Abstract: The purpose of segmentation refinement is to enhance the initial coarse masks generated by segmentation algorithms. The refined masks are expected to capture the details and contours of the target objects. Research on segmentation refinement has developed as a response to the need for high-quality initial masks. However, to our knowledge, no method has been developed that can determine the success… ▽ More The purpose of segmentation refinement is to enhance the initial coarse masks generated by segmentation algorithms. The refined masks are expected to capture the details and contours of the target objects. Research on segmentation refinement has developed as a response to the need for high-quality initial masks. However, to our knowledge, no method has been developed that can determine the success of segmentation refinement. Such a method could ensure the reliability of segmentation in applications where the outcome of the segmentation is important, and fosters innovation in image processing technologies. To address this research gap, we propose JFS~(Judging From Support-set), a method to identify the success of segmentation refinement leveraging a few-shot segmentation (FSS) model. The traditional goal of the problem in FSS is to find a target object in a query image utilizing target information given by a support set. However, in our proposed method, we use the FSS network in a novel way to assess the segmentation refinement. When there are two masks, a coarse mask and a refined mask from segmentation refinement, these two masks become support masks. The existing support mask works as a ground truth mask to judge whether the quality of the refined segmentation is more accurate than the coarse mask. We first obtained a coarse mask and refined it using SEPL (SAM Enhanced Pseduo-Labels) to get the two masks. Then, these become input to FSS model to judge whether the post-processing was successful. JFS is evaluated on the best and worst cases from SEPL to validate its effectiveness. The results showed that JFS can determine whether the SEPL is a success or not. △ Less

Submitted 5 July, 2024; originally announced July 2024.

Comments: 4 pages

arXiv:2407.04069 [pdf, other]

A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations

Authors: Md Tahmid Rahman Laskar, Sawsan Alqahtani, M Saiful Bari, Mizanur Rahman, Mohammad Abdullah Matin Khan, Haidar Khan, Israt Jahan, Amran Bhuiyan, Chee Wei Tan, Md Rizwan Parvez, Enamul Hoque, Shafiq Joty, Jimmy Huang

Abstract: Large Language Models (LLMs) have recently gained significant attention due to their remarkable capabilities in performing diverse tasks across various domains. However, a thorough evaluation of these models is crucial before deploying them in real-world applications to ensure they produce reliable performance. Despite the well-established importance of evaluating LLMs in the community, the comple… ▽ More Large Language Models (LLMs) have recently gained significant attention due to their remarkable capabilities in performing diverse tasks across various domains. However, a thorough evaluation of these models is crucial before deploying them in real-world applications to ensure they produce reliable performance. Despite the well-established importance of evaluating LLMs in the community, the complexity of the evaluation process has led to varied evaluation setups, causing inconsistencies in findings and interpretations. To address this, we systematically review the primary challenges and limitations causing these inconsistencies and unreliable evaluations in various steps of LLM evaluation. Based on our critical review, we present our perspectives and recommendations to ensure LLM evaluations are reproducible, reliable, and robust. △ Less

Submitted 4 July, 2024; originally announced July 2024.

arXiv:2407.02871 [pdf, other]

LMBF-Net: A Lightweight Multipath Bidirectional Focal Attention Network for Multifeatures Segmentation

Authors: Tariq M Khan, Shahzaib Iqbal, Syed S. Naqvi, Imran Razzak, Erik Meijering

Abstract: Retinal diseases can cause irreversible vision loss in both eyes if not diagnosed and treated early. Since retinal diseases are so complicated, retinal imaging is likely to show two or more abnormalities. Current deep learning techniques for segmenting retinal images with many labels and attributes have poor detection accuracy and generalisability. This paper presents a multipath convolutional neu… ▽ More Retinal diseases can cause irreversible vision loss in both eyes if not diagnosed and treated early. Since retinal diseases are so complicated, retinal imaging is likely to show two or more abnormalities. Current deep learning techniques for segmenting retinal images with many labels and attributes have poor detection accuracy and generalisability. This paper presents a multipath convolutional neural network for multifeature segmentation. The proposed network is lightweight and spatially sensitive to information. A patch-based implementation is used to extract local image features, and focal modulation attention blocks are incorporated between the encoder and the decoder for improved segmentation. Filter optimisation is used to prevent filter overlaps and speed up model convergence. A combination of convolution operations and group convolution operations is used to reduce computational costs. This is the first robust and generalisable network capable of segmenting multiple features of fundus images (including retinal vessels, microaneurysms, optic discs, haemorrhages, hard exudates, and soft exudates). The results of our experimental evaluation on more than ten publicly available datasets with multiple features show that the proposed network outperforms recent networks despite having a small number of learnable parameters. △ Less

Submitted 3 July, 2024; originally announced July 2024.

arXiv:2407.02598 [pdf, other]

AutoSplat: Constrained Gaussian Splatting for Autonomous Driving Scene Reconstruction

Authors: Mustafa Khan, Hamidreza Fazlali, Dhruv Sharma, Tongtong Cao, Dongfeng Bai, Yuan Ren, Bingbing Liu

Abstract: Realistic scene reconstruction and view synthesis are essential for advancing autonomous driving systems by simulating safety-critical scenarios. 3D Gaussian Splatting excels in real-time rendering and static scene reconstructions but struggles with modeling driving scenarios due to complex backgrounds, dynamic objects, and sparse views. We propose AutoSplat, a framework employing Gaussian splatti… ▽ More Realistic scene reconstruction and view synthesis are essential for advancing autonomous driving systems by simulating safety-critical scenarios. 3D Gaussian Splatting excels in real-time rendering and static scene reconstructions but struggles with modeling driving scenarios due to complex backgrounds, dynamic objects, and sparse views. We propose AutoSplat, a framework employing Gaussian splatting to achieve highly realistic reconstructions of autonomous driving scenes. By imposing geometric constraints on Gaussians representing the road and sky regions, our method enables multi-view consistent simulation of challenging scenarios including lane changes. Leveraging 3D templates, we introduce a reflected Gaussian consistency constraint to supervise both the visible and unseen side of foreground objects. Moreover, to model the dynamic appearance of foreground objects, we estimate residual spherical harmonics for each foreground Gaussian. Extensive experiments on Pandaset and KITTI demonstrate that AutoSplat outperforms state-of-the-art methods in scene reconstruction and novel view synthesis across diverse driving scenarios. Visit our project page at https://autosplat.github.io/. △ Less

Submitted 3 July, 2024; v1 submitted 2 July, 2024; originally announced July 2024.

arXiv:2407.01440 [pdf, other]

GAT-Steiner: Rectilinear Steiner Minimal Tree Prediction Using GNNs

Authors: Bugra Onal, Eren Dogan, Muhammad Hadir Khan, Matthew R. Guthaus

Abstract: The Rectilinear Steiner Minimum Tree (RSMT) problem is a fundamental problem in VLSI placement and routing and is known to be NP-hard. Traditional RSMT algorithms spend a significant amount of time on finding Steiner points to reduce the total wire length or use heuristics to approximate producing sub-optimal results. We show that Graph Neural Networks (GNNs) can be used to predict optimal Steiner… ▽ More The Rectilinear Steiner Minimum Tree (RSMT) problem is a fundamental problem in VLSI placement and routing and is known to be NP-hard. Traditional RSMT algorithms spend a significant amount of time on finding Steiner points to reduce the total wire length or use heuristics to approximate producing sub-optimal results. We show that Graph Neural Networks (GNNs) can be used to predict optimal Steiner points in RSMTs with high accuracy and can be parallelized on GPUs. In this paper, we propose GAT-Steiner, a graph attention network model that correctly predicts 99.846% of the nets in the ISPD19 benchmark with an average increase in wire length of only 0.480% on suboptimal wire length nets. On randomly generated benchmarks, GAT-Steiner correctly predicts 99.942% with an average increase in wire length of only 0.420% on suboptimal wire length nets. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: Preprint for The 2024 IEEE/ACM International Conference on Computer-Aided Design (ICCAD 2024)

arXiv:2406.18325 [pdf, ps, other]

Linear codes with few weights over $\mathbb{F}_{p}+u\mathbb{F}_{p}$

Authors: Pavan Kumar, Noor Mohammad Khan

Abstract: For any positive integer $m$ and an odd prime $p$; let $\mathbb{F}_{q}+u\mathbb{F}_{q}$, where $q=p^{m}$, be a ring extension of the ring $\mathbb{F}_{p}+u\mathbb{F}_{p}.$ In this paper, we construct linear codes over $\mathbb{F}_{p}+u\mathbb{F}_{p}$ by using trace function defined on $\mathbb{F}_{q}+u\mathbb{F}_{q}$ and determine their Hamming weight distributions by employing symplectic-weight… ▽ More For any positive integer $m$ and an odd prime $p$; let $\mathbb{F}_{q}+u\mathbb{F}_{q}$, where $q=p^{m}$, be a ring extension of the ring $\mathbb{F}_{p}+u\mathbb{F}_{p}.$ In this paper, we construct linear codes over $\mathbb{F}_{p}+u\mathbb{F}_{p}$ by using trace function defined on $\mathbb{F}_{q}+u\mathbb{F}_{q}$ and determine their Hamming weight distributions by employing symplectic-weight distributions of their Gray images. △ Less

Submitted 26 June, 2024; originally announced June 2024.

MSC Class: 94B05; 11T71

arXiv:2406.18307 [pdf, ps, other]

Five-Lee-weight linear codes over $\mathbb{F}_{q}+u\mathbb{F}_{q}$

Authors: Pavan Kumar, Noor Mohammad Khan

Abstract: In this study, linear codes having their Lee-weight distributions over the semi-local ring $\mathbb{F}_{q}+u\mathbb{F}_{q}$ with $u^{2}=1$ are constructed using the defining set and Gauss sums for an odd prime $q $. Moreover, we derive complete Hamming-weight enumerators for the images of the constructed linear codes under the Gray map. We finally show an application to secret sharing schemes. In this study, linear codes having their Lee-weight distributions over the semi-local ring $\mathbb{F}_{q}+u\mathbb{F}_{q}$ with $u^{2}=1$ are constructed using the defining set and Gauss sums for an odd prime $q $. Moreover, we derive complete Hamming-weight enumerators for the images of the constructed linear codes under the Gray map. We finally show an application to secret sharing schemes. △ Less

Submitted 5 July, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

MSC Class: 94B05; 11T71

arXiv:2406.17746 [pdf, other]

Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon

Authors: USVSN Sai Prashanth, Alvin Deng, Kyle O'Brien, Jyothir S V, Mohammad Aflah Khan, Jaydeep Borkar, Christopher A. Choquette-Choo, Jacob Ray Fuehne, Stella Biderman, Tracy Ke, Katherine Lee, Naomi Saphra

Abstract: Memorization in language models is typically treated as a homogenous phenomenon, neglecting the specifics of the memorized data. We instead model memorization as the effect of a set of complex factors that describe each sample and relate it to the model and corpus. To build intuition around these factors, we break memorization down into a taxonomy: recitation of highly duplicated sequences, recons… ▽ More Memorization in language models is typically treated as a homogenous phenomenon, neglecting the specifics of the memorized data. We instead model memorization as the effect of a set of complex factors that describe each sample and relate it to the model and corpus. To build intuition around these factors, we break memorization down into a taxonomy: recitation of highly duplicated sequences, reconstruction of inherently predictable sequences, and recollection of sequences that are neither. We demonstrate the usefulness of our taxonomy by using it to construct a predictive model for memorization. By analyzing dependencies and inspecting the weights of the predictive model, we find that different factors influence the likelihood of memorization differently depending on the taxonomic category. △ Less

Submitted 25 June, 2024; originally announced June 2024.

arXiv:2406.17190 [pdf, other]

Sound Tagging in Infant-centric Home Soundscapes

Authors: Mohammad Nur Hossain Khan, Jialu Li, Nancy L. McElwain, Mark Hasegawa-Johnson, Bashima Islam

Abstract: Certain environmental noises have been associated with negative developmental outcomes for infants and young children. Though classifying or tagging sound events in a domestic environment is an active research area, previous studies focused on data collected from a non-stationary microphone placed in the environment or from the perspective of adults. Further, many of these works ignore infants or… ▽ More Certain environmental noises have been associated with negative developmental outcomes for infants and young children. Though classifying or tagging sound events in a domestic environment is an active research area, previous studies focused on data collected from a non-stationary microphone placed in the environment or from the perspective of adults. Further, many of these works ignore infants or young children in the environment or have data collected from only a single family where noise from the fixed sound source can be moderate at the infant's position or vice versa. Thus, despite the recent success of large pre-trained models for noise event detection, the performance of these models on infant-centric noise soundscapes in the home is yet to be explored. To bridge this gap, we have collected and labeled noises in home soundscapes from 22 families in an unobtrusive manner, where the data are collected through an infant-worn recording device. In this paper, we explore the performance of a large pre-trained model (Audio Spectrogram Transformer [AST]) on our noise-conditioned infant-centric environmental data as well as publicly available home environmental datasets. Utilizing different training strategies such as resampling, utilizing public datasets, mixing public and infant-centric training sets, and data augmentation using noise and masking, we evaluate the performance of a large pre-trained model on sparse and imbalanced infant-centric data. Our results show that fine-tuning the large pre-trained model by combining our collected dataset with public datasets increases the F1-score from 0.11 (public datasets) and 0.76 (collected datasets) to 0.84 (combined datasets) and Cohen's Kappa from 0.013 (public datasets) and 0.77 (collected datasets) to 0.83 (combined datasets) compared to only training with public or collected datasets, respectively. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: Accepted in IEEE/ACM CHASE 2024

arXiv:2406.15831 [pdf, other]

Shape2.5D: A Dataset of Texture-less Surfaces for Depth and Normals Estimation

Authors: Muhammad Saif Ullah Khan, Muhammad Zeshan Afzal, Didier Stricker

Abstract: Reconstructing texture-less surfaces poses unique challenges in computer vision, primarily due to the lack of specialized datasets that cater to the nuanced needs of depth and normals estimation in the absence of textural information. We introduce "Shape2.5D," a novel, large-scale dataset designed to address this gap. Comprising 364k frames spanning 2635 3D models and 48 unique objects, our datase… ▽ More Reconstructing texture-less surfaces poses unique challenges in computer vision, primarily due to the lack of specialized datasets that cater to the nuanced needs of depth and normals estimation in the absence of textural information. We introduce "Shape2.5D," a novel, large-scale dataset designed to address this gap. Comprising 364k frames spanning 2635 3D models and 48 unique objects, our dataset provides depth and surface normal maps for texture-less object reconstruction. The proposed dataset includes synthetic images rendered with 3D modeling software to simulate various lighting conditions and viewing angles. It also includes a real-world subset comprising 4672 frames captured with a depth camera. Our comprehensive benchmarks, performed using a modified encoder-decoder network, showcase the dataset's capability to support the development of algorithms that robustly estimate depth and normals from RGB images. Our open-source data generation pipeline allows the dataset to be extended and adapted for future research. The dataset is publicly available at \url{https://github.com/saifkhichi96/Shape25D}. △ Less

Submitted 22 June, 2024; originally announced June 2024.

Comments: This dataset paper was originally written in 2022

arXiv:2406.14583 [pdf]

doi 10.32604/jai.2022.038875

Modeling & Evaluating the Performance of Convolutional Neural Networks for Classifying Steel Surface Defects

Authors: Nadeem Jabbar Chaudhry, M. Bilal Khan, M. Javaid Iqbal, Siddiqui Muhammad Yasir

Abstract: Recently, outstanding identification rates in image classification tasks were achieved by convolutional neural networks (CNNs). to use such skills, selective CNNs trained on a dataset of well-known images of metal surface defects captured with an RGB camera. Defects must be detected early to take timely corrective action due to production concerns. For image classification up till now, a model-bas… ▽ More Recently, outstanding identification rates in image classification tasks were achieved by convolutional neural networks (CNNs). to use such skills, selective CNNs trained on a dataset of well-known images of metal surface defects captured with an RGB camera. Defects must be detected early to take timely corrective action due to production concerns. For image classification up till now, a model-based method has been utilized, which indicated the predicted reflection characteristics of surface defects in comparison to flaw-free surfaces. The problem of detecting steel surface defects has grown in importance as a result of the vast range of steel applications in end-product sectors such as automobiles, households, construction, etc. The manual processes for detections are time-consuming, labor-intensive, and expensive. Different strategies have been used to automate manual processes, but CNN models have proven to be the most effective rather than image processing and machine learning techniques. By using different CNN models with fine-tuning, easily compare their performance and select the best-performing model for the same kinds of tasks. However, it is important that using different CNN models either from fine tuning can be computationally expensive and time-consuming. Therefore, our study helps the upcoming researchers to choose the CNN without considering the issues of model complexity, performance, and computational resources. In this article, the performance of various CNN models with transfer learning techniques are evaluated. These models were chosen based on their popularity and impact in the field of computer vision research, as well as their performance on benchmark datasets. According to the outcomes, DenseNet201 outperformed the other CNN models and had the greatest detection rate on the NEU dataset, falling in at 98.37 percent. △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2406.14498 [pdf, other]

LLaSA: Large Multimodal Agent for Human Activity Analysis Through Wearable Sensors

Authors: Sheikh Asif Imran, Mohammad Nur Hossain Khan, Subrata Biswas, Bashima Islam

Abstract: Integrating inertial measurement units (IMUs) with large language models (LLMs) advances multimodal AI by enhancing human activity understanding. We introduce SensorCaps, a dataset of 26,288 IMU-derived activity narrations, and OpenSQA, an instruction-following dataset with 257,562 question-answer pairs. Combining LIMU-BERT and Llama, we develop LLaSA, a Large Multimodal Agent capable of interpret… ▽ More Integrating inertial measurement units (IMUs) with large language models (LLMs) advances multimodal AI by enhancing human activity understanding. We introduce SensorCaps, a dataset of 26,288 IMU-derived activity narrations, and OpenSQA, an instruction-following dataset with 257,562 question-answer pairs. Combining LIMU-BERT and Llama, we develop LLaSA, a Large Multimodal Agent capable of interpreting and responding to activity and motion analysis queries. Our evaluation demonstrates LLaSA's effectiveness in activity classification and question answering, highlighting its potential in healthcare, sports science, and human-computer interaction. These contributions advance sensor-aware language models and open new research avenues. Our code repository and datasets can be found on https://github.com/BASHLab/LLaSA. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: Under review at ARR (for EMNLP 2024)

arXiv:2406.14370 [pdf, other]

Enhanced Bank Check Security: Introducing a Novel Dataset and Transformer-Based Approach for Detection and Verification

Authors: Muhammad Saif Ullah Khan, Tahira Shehzadi, Rabeya Noor, Didier Stricker, Muhammad Zeshan Afzal

Abstract: Automated signature verification on bank checks is critical for fraud prevention and ensuring transaction authenticity. This task is challenging due to the coexistence of signatures with other textual and graphical elements on real-world documents. Verification systems must first detect the signature and then validate its authenticity, a dual challenge often overlooked by current datasets and meth… ▽ More Automated signature verification on bank checks is critical for fraud prevention and ensuring transaction authenticity. This task is challenging due to the coexistence of signatures with other textual and graphical elements on real-world documents. Verification systems must first detect the signature and then validate its authenticity, a dual challenge often overlooked by current datasets and methodologies focusing only on verification. To address this gap, we introduce a novel dataset specifically designed for signature verification on bank checks. This dataset includes a variety of signature styles embedded within typical check elements, providing a realistic testing ground for advanced detection methods. Moreover, we propose a novel approach for writer-independent signature verification using an object detection network. Our detection-based verification method treats genuine and forged signatures as distinct classes within an object detection framework, effectively handling both detection and verification. We employ a DINO-based network augmented with a dilation module to detect and verify signatures on check images simultaneously. Our approach achieves an AP of 99.2 for genuine and 99.4 for forged signatures, a significant improvement over the DINO baseline, which scored 93.1 and 89.3 for genuine and forged signatures, respectively. This improvement highlights our dilation module's effectiveness in reducing both false positives and negatives. Our results demonstrate substantial advancements in detection-based signature verification technology, offering enhanced security and efficiency in financial document processing. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: Accepted for publication in 16th IAPR International Workshop on Document Analysis Systems 2024

arXiv:2406.13439 [pdf, other]

Finding Blind Spots in Evaluator LLMs with Interpretable Checklists

Authors: Sumanth Doddapaneni, Mohammed Safi Ur Rahman Khan, Sshubam Verma, Mitesh M. Khapra

Abstract: Large Language Models (LLMs) are increasingly relied upon to evaluate text outputs of other LLMs, thereby influencing leaderboards and development decisions. However, concerns persist over the accuracy of these assessments and the potential for misleading conclusions. In this work, we investigate the effectiveness of LLMs as evaluators for text generation tasks. We propose FBI, a novel framework d… ▽ More Large Language Models (LLMs) are increasingly relied upon to evaluate text outputs of other LLMs, thereby influencing leaderboards and development decisions. However, concerns persist over the accuracy of these assessments and the potential for misleading conclusions. In this work, we investigate the effectiveness of LLMs as evaluators for text generation tasks. We propose FBI, a novel framework designed to examine the proficiency of Evaluator LLMs in assessing four critical abilities in other LLMs: factual accuracy, instruction following, coherence in long-form writing, and reasoning proficiency. By introducing targeted perturbations in answers generated by LLMs, that clearly impact one of these key capabilities, we test whether an Evaluator LLM can detect these quality drops. By creating a total of 2400 perturbed answers covering 22 perturbation categories, we conduct a comprehensive study using different evaluation strategies on five prominent LLMs commonly used as evaluators in the literature. Our findings reveal significant shortcomings in current Evaluator LLMs, which failed to identify quality drops in over 50\% of cases on average. Single-answer and pairwise evaluations demonstrated notable limitations, whereas reference-based evaluations showed comparatively better performance. These results underscore the unreliable nature of current Evaluator LLMs and advocate for cautious implementation in practical applications. Code and data are available at https://github.com/AI4Bharat/FBI. △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2406.13302 [pdf, other]

Situational Instructions Database: Task Guidance in Dynamic Environments

Authors: Muhammad Saif Ullah Khan, Sankalp Sinha, Didier Stricker, Muhammad Zeshan Afzal

Abstract: The Situational Instructions Database (SID) addresses the need for enhanced situational awareness in artificial intelligence (AI) systems operating in dynamic environments. By integrating detailed scene graphs with dynamically generated, task-specific instructions, SID provides a novel dataset that allows AI systems to perform complex, real-world tasks with improved context sensitivity and operati… ▽ More The Situational Instructions Database (SID) addresses the need for enhanced situational awareness in artificial intelligence (AI) systems operating in dynamic environments. By integrating detailed scene graphs with dynamically generated, task-specific instructions, SID provides a novel dataset that allows AI systems to perform complex, real-world tasks with improved context sensitivity and operational accuracy. This dataset leverages advanced generative models to simulate a variety of realistic scenarios based on the 3D Semantic Scene Graphs (3DSSG) dataset, enriching it with scenario-specific information that details environmental interactions and tasks. SID facilitates the development of AI applications that can adapt to new and evolving conditions without extensive retraining, supporting research in autonomous technology and AI-driven decision-making processes. This dataset is instrumental in developing robust, context-aware AI agents capable of effectively navigating and responding to unpredictable settings. Available for research and development, SID serves as a critical resource for advancing the capabilities of intelligent systems in complex environments. Dataset available at \url{https://github.com/mindgarage/situational-instructions-database}. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: 9 pages, 6 figures

arXiv:2406.08775 [pdf, other]

ALINA: Advanced Line Identification and Notation Algorithm

Authors: Mohammed Abdul Hafeez Khan, Parth Ganeriwala, Siddhartha Bhattacharyya, Natasha Neogi, Raja Muthalagu

Abstract: Labels are the cornerstone of supervised machine learning algorithms. Most visual recognition methods are fully supervised, using bounding boxes or pixel-wise segmentations for object localization. Traditional labeling methods, such as crowd-sourcing, are prohibitive due to cost, data privacy, amount of time, and potential errors on large datasets. To address these issues, we propose a novel annot… ▽ More Labels are the cornerstone of supervised machine learning algorithms. Most visual recognition methods are fully supervised, using bounding boxes or pixel-wise segmentations for object localization. Traditional labeling methods, such as crowd-sourcing, are prohibitive due to cost, data privacy, amount of time, and potential errors on large datasets. To address these issues, we propose a novel annotation framework, Advanced Line Identification and Notation Algorithm (ALINA), which can be used for labeling taxiway datasets that consist of different camera perspectives and variable weather attributes (sunny and cloudy). Additionally, the CIRCular threshoLd pixEl Discovery And Traversal (CIRCLEDAT) algorithm has been proposed, which is an integral step in determining the pixels corresponding to taxiway line markings. Once the pixels are identified, ALINA generates corresponding pixel coordinate annotations on the frame. Using this approach, 60,249 frames from the taxiway dataset, AssistTaxi have been labeled. To evaluate the performance, a context-based edge map (CBEM) set was generated manually based on edge features and connectivity. The detection rate after testing the annotated labels with the CBEM set was recorded as 98.45%, attesting its dependability and effectiveness. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: Paper has been accepted to The 3rd CVPR Workshop on Vision Datasets Understanding, 2024

arXiv:2406.08344 [pdf, other]

Blind Image Deblurring using FFT-ReLU with Deep Learning Pipeline Integration

Authors: Abdul Mohaimen Al Radi, Prothito Shovon Majumder, Syed Mumtahin Mahmud, Mahdi Mohd Hossain Noki, Md. Haider Ali, Md. Mosaddek Khan

Abstract: Blind image deblurring is the process of deriving a sharp image and a blur kernel from a blurred image. Blurry images are typically modeled as the convolution of a sharp image with a blur kernel, necessitating the estimation of the unknown blur kernel to perform blind image deblurring effectively. Existing approaches primarily focus on domain-specific features of images, such as salient edges, dar… ▽ More Blind image deblurring is the process of deriving a sharp image and a blur kernel from a blurred image. Blurry images are typically modeled as the convolution of a sharp image with a blur kernel, necessitating the estimation of the unknown blur kernel to perform blind image deblurring effectively. Existing approaches primarily focus on domain-specific features of images, such as salient edges, dark channels, and light streaks. These features serve as probabilistic priors to enhance the estimation of the blur kernel. For improved generality, we propose a novel prior (ReLU sparsity prior) that estimates blur kernel effectively across all distributions of images (natural, facial, text, low-light, saturated etc). Our approach demonstrates superior efficiency, with inference times up to three times faster, while maintaining high accuracy in PSNR, SSIM, and error ratio metrics. We also observe noticeable improvement in the performance of the state-of-the-art architectures (in terms of aforementioned metrics) in deep learning based approaches when our method is used as a post-processing unit. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: 20 pages, 13 figures

arXiv:2406.06533 [pdf, other]

Pragmatic Formal Verification Methodology for Clock Domain Crossing (CDC)

Authors: Aman Kumar, Muhammad Ul Haque Khan, Bijitendra Mittra

Abstract: Modern System-on-Chip (SoC) designs are becoming more and more complex due to the technology upscaling. SoC designs often operate on multiple asynchronous clock domains, further adding to the complexity of the overall design. To make the devices power efficient, designers take a Globally-Asynchronous Locally-Synchronous (GALS) approach that creates multiple asynchronous domains. These Clock Domain… ▽ More Modern System-on-Chip (SoC) designs are becoming more and more complex due to the technology upscaling. SoC designs often operate on multiple asynchronous clock domains, further adding to the complexity of the overall design. To make the devices power efficient, designers take a Globally-Asynchronous Locally-Synchronous (GALS) approach that creates multiple asynchronous domains. These Clock Domain Crossings (CDC) are prone to metastability effects, and functional verification of such CDC is very important to ensure that no bug escapes. Conventional verification methods, such as register transfer level (RTL) simulations and static timing analysis, are not enough to address these CDC issues, which may lead to verification gaps. Additionally, identifying these CDC-related bugs is very time-consuming and is one of the most common reasons for costly silicon re-spins. This paper is focused on the development of a pragmatic formal verification methodology to minimize the CDC issues by exercising Metastability Injection (MSI) in different CDC paths. △ Less

Submitted 20 April, 2024; originally announced June 2024.

Comments: Published in DVCon Europe 2023

arXiv:2406.06500 [pdf, ps, other]

Adaptive Opponent Policy Detection in Multi-Agent MDPs: Real-Time Strategy Switch Identification Using Running Error Estimation

Authors: Mohidul Haque Mridul, Mohammad Foysal Khan, Redwan Ahmed Rizvee, Md Mosaddek Khan

Abstract: In Multi-agent Reinforcement Learning (MARL), accurately perceiving opponents' strategies is essential for both cooperative and adversarial contexts, particularly within dynamic environments. While Proximal Policy Optimization (PPO) and related algorithms such as Actor-Critic with Experience Replay (ACER), Trust Region Policy Optimization (TRPO), and Deep Deterministic Policy Gradient (DDPG) perfo… ▽ More In Multi-agent Reinforcement Learning (MARL), accurately perceiving opponents' strategies is essential for both cooperative and adversarial contexts, particularly within dynamic environments. While Proximal Policy Optimization (PPO) and related algorithms such as Actor-Critic with Experience Replay (ACER), Trust Region Policy Optimization (TRPO), and Deep Deterministic Policy Gradient (DDPG) perform well in single-agent, stationary environments, they suffer from high variance in MARL due to non-stationary and hidden policies of opponents, leading to diminished reward performance. Additionally, existing methods in MARL face significant challenges, including the need for inter-agent communication, reliance on explicit reward information, high computational demands, and sampling inefficiencies. These issues render them less effective in continuous environments where opponents may abruptly change their policies without prior notice. Against this background, we present OPS-DeMo (Online Policy Switch-Detection Model), an online algorithm that employs dynamic error decay to detect changes in opponents' policies. OPS-DeMo continuously updates its beliefs using an Assumed Opponent Policy (AOP) Bank and selects corresponding responses from a pre-trained Response Policy Bank. Each response policy is trained against consistently strategizing opponents, reducing training uncertainty and enabling the effective use of algorithms like PPO in multi-agent environments. Comparative assessments show that our approach outperforms PPO-trained models in dynamic scenarios like the Predator-Prey setting, providing greater robustness to sudden policy shifts and enabling more informed decision-making through precise opponent policy insights. △ Less

Submitted 10 June, 2024; originally announced June 2024.

arXiv:2405.20363 [pdf, other]

LLMGeo: Benchmarking Large Language Models on Image Geolocation In-the-wild

Authors: Zhiqiang Wang, Dejia Xu, Rana Muhammad Shahroz Khan, Yanbin Lin, Zhiwen Fan, Xingquan Zhu

Abstract: Image geolocation is a critical task in various image-understanding applications. However, existing methods often fail when analyzing challenging, in-the-wild images. Inspired by the exceptional background knowledge of multimodal language models, we systematically evaluate their geolocation capabilities using a novel image dataset and a comprehensive evaluation framework. We first collect images f… ▽ More Image geolocation is a critical task in various image-understanding applications. However, existing methods often fail when analyzing challenging, in-the-wild images. Inspired by the exceptional background knowledge of multimodal language models, we systematically evaluate their geolocation capabilities using a novel image dataset and a comprehensive evaluation framework. We first collect images from various countries via Google Street View. Then, we conduct training-free and training-based evaluations on closed-source and open-source multi-modal language models. we conduct both training-free and training-based evaluations on closed-source and open-source multimodal language models. Our findings indicate that closed-source models demonstrate superior geolocation abilities, while open-source models can achieve comparable performance through fine-tuning. △ Less

Submitted 30 May, 2024; originally announced May 2024.

Comments: 7 pages, 3 figures, 5 tables, CVPR 2024 Workshop on Computer Vision in the Wild

arXiv:2405.20084 [pdf, other]

Estimating Human Poses Across Datasets: A Unified Skeleton and Multi-Teacher Distillation Approach

Authors: Muhammad Saif Ullah Khan, Dhavalkumar Limbachiya, Didier Stricker, Muhammad Zeshan Afzal

Abstract: Human pose estimation is a key task in computer vision with various applications such as activity recognition and interactive systems. However, the lack of consistency in the annotated skeletons across different datasets poses challenges in developing universally applicable models. To address this challenge, we propose a novel approach integrating multi-teacher knowledge distillation with a unifie… ▽ More Human pose estimation is a key task in computer vision with various applications such as activity recognition and interactive systems. However, the lack of consistency in the annotated skeletons across different datasets poses challenges in developing universally applicable models. To address this challenge, we propose a novel approach integrating multi-teacher knowledge distillation with a unified skeleton representation. Our networks are jointly trained on the COCO and MPII datasets, containing 17 and 16 keypoints, respectively. We demonstrate enhanced adaptability by predicting an extended set of 21 keypoints, 4 (COCO) and 5 (MPII) more than original annotations, improving cross-dataset generalization. Our joint models achieved an average accuracy of 70.89 and 76.40, compared to 53.79 and 55.78 when trained on a single dataset and evaluated on both. Moreover, we also evaluate all 21 predicted points by our two models by reporting an AP of 66.84 and 72.75 on the Halpe dataset. This highlights the potential of our technique to address one of the most pressing challenges in pose estimation research and application - the inconsistency in skeletal annotations. △ Less

Submitted 30 May, 2024; originally announced May 2024.

Comments: 15 pages (with references)

arXiv:2405.17520 [pdf, other]

Advancing Medical Image Segmentation with Mini-Net: A Lightweight Solution Tailored for Efficient Segmentation of Medical Images

Authors: Syed Javed, Tariq M. Khan, Abdul Qayyum, Arcot Sowmya, Imran Razzak

Abstract: Accurate segmentation of anatomical structures and abnormalities in medical images is crucial for computer-aided diagnosis and analysis. While deep learning techniques excel at this task, their computational demands pose challenges. Additionally, some cutting-edge segmentation methods, though effective for general object segmentation, may not be optimised for medical images. To address these issue… ▽ More Accurate segmentation of anatomical structures and abnormalities in medical images is crucial for computer-aided diagnosis and analysis. While deep learning techniques excel at this task, their computational demands pose challenges. Additionally, some cutting-edge segmentation methods, though effective for general object segmentation, may not be optimised for medical images. To address these issues, we propose Mini-Net, a lightweight segmentation network specifically designed for medical images. With fewer than 38,000 parameters, Mini-Net efficiently captures both high- and low-frequency features, enabling real-time applications in various medical imaging scenarios. We evaluate Mini-Net on various datasets, including DRIVE, STARE, ISIC-2016, ISIC-2018, and MoNuSeg, demonstrating its robustness and good performance compared to state-of-the-art methods. △ Less

Submitted 12 July, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

arXiv:2405.15563 [pdf]

Heterogeneous virus classification using a functional deep learning model based on transmission electron microscopy images (Preprint)

Authors: Niloy Sikder, Md. Al-Masrur Khan, Anupam Kumar Bairagi, Mehedi Masud, Jun Jiat Tiang, Abdullah-Al Nahid

Abstract: Viruses are submicroscopic agents that can infect all kinds of lifeforms and use their hosts' living cells to replicate themselves. Despite having some of the simplest genetic structures among all living beings, viruses are highly adaptable, resilient, and given the right conditions, are capable of causing unforeseen complications in their hosts' bodies. Due to their multiple transmission pathways… ▽ More Viruses are submicroscopic agents that can infect all kinds of lifeforms and use their hosts' living cells to replicate themselves. Despite having some of the simplest genetic structures among all living beings, viruses are highly adaptable, resilient, and given the right conditions, are capable of causing unforeseen complications in their hosts' bodies. Due to their multiple transmission pathways, high contagion rate, and lethality, viruses are the biggest biological threat faced by animal and plant species. It is often challenging to promptly detect the presence of a virus in a possible host's body and accurately determine its type using manual examination techniques; however, it can be done using computer-based automatic diagnosis methods. Most notably, the analysis of Transmission Electron Microscopy (TEM) images has been proven to be quite successful in instant virus identification. Using TEM images collected from a recently published dataset, this article proposes a deep learning-based classification model to identify the type of virus within those images correctly. The methodology of this study includes two coherent image processing techniques to reduce the noise present in the raw microscopy images. Experimental results show that it can differentiate among the 14 types of viruses present in the dataset with a maximum of 97.44% classification accuracy and F1-score, which asserts the effectiveness and reliability of the proposed method. Implementing this scheme will impart a fast and dependable way of virus identification subsidiary to the thorough diagnostic procedures. △ Less

Submitted 24 May, 2024; originally announced May 2024.

arXiv:2405.14497 [pdf, other]

Improving Single Domain-Generalized Object Detection: A Focus on Diversification and Alignment

Authors: Muhammad Sohail Danish, Muhammad Haris Khan, Muhammad Akhtar Munir, M. Saquib Sarfraz, Mohsen Ali

Abstract: In this work, we tackle the problem of domain generalization for object detection, specifically focusing on the scenario where only a single source domain is available. We propose an effective approach that involves two key steps: diversifying the source domain and aligning detections based on class prediction confidence and localization. Firstly, we demonstrate that by carefully selecting a set o… ▽ More In this work, we tackle the problem of domain generalization for object detection, specifically focusing on the scenario where only a single source domain is available. We propose an effective approach that involves two key steps: diversifying the source domain and aligning detections based on class prediction confidence and localization. Firstly, we demonstrate that by carefully selecting a set of augmentations, a base detector can outperform existing methods for single domain generalization by a good margin. This highlights the importance of domain diversification in improving the performance of object detectors. Secondly, we introduce a method to align detections from multiple views, considering both classification and localization outputs. This alignment procedure leads to better generalized and well-calibrated object detector models, which are crucial for accurate decision-making in safety-critical applications. Our approach is detector-agnostic and can be seamlessly applied to both single-stage and two-stage detectors. To validate the effectiveness of our proposed methods, we conduct extensive experiments and ablations on challenging domain-shift scenarios. The results consistently demonstrate the superiority of our approach compared to existing methods. Our code and models are available at: https://github.com/msohaildanish/DivAlign △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.13518 [pdf, other]

PerSense: Personalized Instance Segmentation in Dense Images

Authors: Muhammad Ibraheem Siddiqui, Muhammad Umer Sheikh, Hassan Abid, Muhammad Haris Khan

Abstract: Leveraging large-scale pre-training, vision foundational models showcase notable performance benefits. While recent years have witnessed significant advancements in segmentation algorithms, existing models still face challenges to automatically segment personalized instances in dense and crowded scenarios. The primary factor behind this limitation stems from bounding box-based detections, which ar… ▽ More Leveraging large-scale pre-training, vision foundational models showcase notable performance benefits. While recent years have witnessed significant advancements in segmentation algorithms, existing models still face challenges to automatically segment personalized instances in dense and crowded scenarios. The primary factor behind this limitation stems from bounding box-based detections, which are constrained by occlusions, background clutter, and object orientation, particularly when dealing with dense images. To this end, we propose PerSense, an end-to-end, training-free, and model-agnostic one-shot framework to address the personalized instance segmentation in dense images. Towards developing this framework, we make following core contributions. (a) We propose an Instance Detection Module (IDM) and leverage a Vision-Language Model, a grounding object detector, and a few-shot object counter (FSOC) to realize a new baseline. (b) To tackle false positives within candidate point prompts, we design Point Prompt Selection Module (PPSM). Both IDM and PPSM transform density maps from FSOC into personalized instance-level point prompts for segmentation and offer a seamless integration in our model-agnostic framework. (c) We introduce a feedback mechanism which enables PerSense to harness the full potential of FSOC by automating the exemplar selection process. (d) To promote algorithmic advances and effective tools for this relatively underexplored task, we introduce PerSense-D, a dataset exclusive to personalized instance segmentation in dense images. We validate the effectiveness of PerSense on the task of personalized instance segmentation in dense images on PerSense-D and comparison with SOTA. Additionally, our qualitative findings demonstrate the adaptability of our framework to images captured in-the-wild. △ Less

Submitted 22 May, 2024; originally announced May 2024.

Comments: Technical report of PerSense

arXiv:2405.12427 [pdf, other]

Deep learning approaches to indoor wireless channel estimation for low-power communication

Authors: Samrah Arif, Muhammad Arif Khan, Sabih Ur Rehman

Abstract: In the rapidly growing development of the Internet of Things (IoT) infrastructure, achieving reliable wireless communication is a challenge. IoT devices operate in diverse environments with common signal interference and fluctuating channel conditions. Accurate channel estimation helps adapt the transmission strategies to current conditions, ensuring reliable communication. Traditional methods, su… ▽ More In the rapidly growing development of the Internet of Things (IoT) infrastructure, achieving reliable wireless communication is a challenge. IoT devices operate in diverse environments with common signal interference and fluctuating channel conditions. Accurate channel estimation helps adapt the transmission strategies to current conditions, ensuring reliable communication. Traditional methods, such as Least Squares (LS) and Minimum Mean Squared Error (MMSE) estimation techniques, often struggle to adapt to the diverse and complex environments typical of IoT networks. This research article delves into the potential of Deep Learning (DL) to enhance channel estimation, focusing on the Received Signal Strength Indicator (RSSI) metric - a critical yet challenging aspect due to its susceptibility to noise and environmental factors. This paper presents two Fully Connected Neural Networks (FCNNs)-based Low Power (LP-IoT) channel estimation models, leveraging RSSI for accurate channel estimation in LP-IoT communication. Our Model A exhibits a remarkable 99.02% reduction in Mean Squared Error (MSE), and Model B demonstrates a notable 90.03% MSE reduction compared to the benchmarks set by current studies. Additionally, the comparative studies of our model A with other DL-based techniques show significant efficiency in our estimation models. △ Less

Submitted 20 May, 2024; originally announced May 2024.

arXiv:2405.08252 [pdf, other]

Smart Sampling: Self-Attention and Bootstrapping for Improved Ensembled Q-Learning

Authors: Muhammad Junaid Khan, Syed Hammad Ahmed, Gita Sukthankar

Abstract: We present a novel method aimed at enhancing the sample efficiency of ensemble Q learning. Our proposed approach integrates multi-head self-attention into the ensembled Q networks while bootstrapping the state-action pairs ingested by the ensemble. This not only results in performance improvements over the original REDQ (Chen et al. 2021) and its variant DroQ (Hi-raoka et al. 2022), thereby enhanc… ▽ More We present a novel method aimed at enhancing the sample efficiency of ensemble Q learning. Our proposed approach integrates multi-head self-attention into the ensembled Q networks while bootstrapping the state-action pairs ingested by the ensemble. This not only results in performance improvements over the original REDQ (Chen et al. 2021) and its variant DroQ (Hi-raoka et al. 2022), thereby enhancing Q predictions, but also effectively reduces both the average normalized bias and standard deviation of normalized bias within Q-function ensembles. Importantly, our method also performs well even in scenarios with a low update-to-data (UTD) ratio. Notably, the implementation of our proposed method is straightforward, requiring minimal modifications to the base model. △ Less

Submitted 13 May, 2024; originally announced May 2024.

Comments: FLAIRS-37 (2024)

arXiv:2405.06835 [pdf, other]

Automating Code Adaptation for MLOps -- A Benchmarking Study on LLMs

Authors: Harsh Patel, Buvaneswari A. Ramanan, Manzoor A. Khan, Thomas Williams, Brian Friedman, Lawrence Drabeck

Abstract: This paper explores the possibilities of the current generation of Large Language Models for incorporating Machine Learning Operations (MLOps) functionalities into ML training code bases. We evaluate the performance of OpenAI (gpt-3.5-turbo) and WizardCoder (open-source, 15B parameters) models on the automated accomplishment of various MLOps functionalities in different settings. We perform a benc… ▽ More This paper explores the possibilities of the current generation of Large Language Models for incorporating Machine Learning Operations (MLOps) functionalities into ML training code bases. We evaluate the performance of OpenAI (gpt-3.5-turbo) and WizardCoder (open-source, 15B parameters) models on the automated accomplishment of various MLOps functionalities in different settings. We perform a benchmarking study that assesses the ability of these models to: (1) adapt existing code samples (Inlining) with component-specific MLOps functionality such as MLflow and Weights & Biases for experiment tracking, Optuna for hyperparameter optimization etc., and (2) perform the task of Translation from one component of an MLOps functionality to another, e.g., translating existing GitPython library based version control code to Data Version Control library based. We also propose three different approaches that involve teaching LLMs to comprehend the API documentation of the components as a reference while accomplishing the Translation tasks. In our evaluations, the gpt-3.5-turbo model significantly outperforms WizardCoder by achieving impressive Pass@3 accuracy in model optimization (55% compared to 0% by WizardCoder), experiment tracking (100%, compared to 62.5% by WizardCoder), model registration (92% compared to 42% by WizardCoder) and hyperparameter optimization (83% compared to 58% by WizardCoder) on average, in their best possible settings, showcasing its superior code adaptability performance in complex MLOps tasks. △ Less

Submitted 10 May, 2024; originally announced May 2024.

Comments: The work was completed during 2Q, 3Q of Year 2023, when WizardCoder was the top performing Open source LLM for coding. Newer and better models have emerged since then. The processes and methodologies utilized for this benchmarking can still be utilized for evaluating the current SoTA models

arXiv:2405.06128 [pdf, other]

Enhanced Multimodal Content Moderation of Children's Videos using Audiovisual Fusion

Authors: Syed Hammad Ahmed, Muhammad Junaid Khan, Gita Sukthankar

Abstract: Due to the rise in video content creation targeted towards children, there is a need for robust content moderation schemes for video hosting platforms. A video that is visually benign may include audio content that is inappropriate for young children while being impossible to detect with a unimodal content moderation system. Popular video hosting platforms for children such as YouTube Kids still p… ▽ More Due to the rise in video content creation targeted towards children, there is a need for robust content moderation schemes for video hosting platforms. A video that is visually benign may include audio content that is inappropriate for young children while being impossible to detect with a unimodal content moderation system. Popular video hosting platforms for children such as YouTube Kids still publish videos which contain audio content that is not conducive to a child's healthy behavioral and physical development. A robust classification of malicious videos requires audio representations in addition to video features. However, recent content moderation approaches rarely employ multimodal architectures that explicitly consider non-speech audio cues. To address this, we present an efficient adaptation of CLIP (Contrastive Language-Image Pre-training) that can leverage contextual audio cues for enhanced content moderation. We incorporate 1) the audio modality and 2) prompt learning, while keeping the backbone modules of each modality frozen. We conduct our experiments on a multimodal version of the MOB (Malicious or Benign) dataset in supervised and few-shot settings. △ Less

Submitted 9 May, 2024; originally announced May 2024.

Comments: 8 pages, 3 figures, Accepted at The 37th International FLAIRS Conference

arXiv:2405.04735 [pdf, other]

Cryptanalysis of the SIMON Cypher Using Neo4j

Authors: Jonathan Cook, Sabih ur Rehman, M. Arif Khan

Abstract: The exponential growth in the number of Internet of Things (IoT) devices has seen the introduction of several Lightweight Encryption Algorithms (LEA). While LEAs are designed to enhance the integrity, privacy and security of data collected and transmitted by IoT devices, it is hazardous to assume that all LEAs are secure and exhibit similar levels of protection. To improve encryption strength, cry… ▽ More The exponential growth in the number of Internet of Things (IoT) devices has seen the introduction of several Lightweight Encryption Algorithms (LEA). While LEAs are designed to enhance the integrity, privacy and security of data collected and transmitted by IoT devices, it is hazardous to assume that all LEAs are secure and exhibit similar levels of protection. To improve encryption strength, cryptanalysts and algorithm designers routinely probe LEAs using various cryptanalysis techniques to identify vulnerabilities and limitations of LEAs. Despite recent improvements in the efficiency of cryptanalysis utilising heuristic methods and a Partial Difference Distribution Table (PDDT), the process remains inefficient, with the random nature of the heuristic inhibiting reproducible results. However, the use of a PDDT presents opportunities to identify relationships between differentials utilising knowledge graphs, leading to the identification of efficient paths throughout the PDDT. This paper introduces the novel use of knowledge graphs to identify intricate relationships between differentials in the SIMON LEA, allowing for the identification of optimal paths throughout the differentials, and increasing the effectiveness of the differential security analyses of SIMON. △ Less

Submitted 7 May, 2024; originally announced May 2024.

Comments: 10 pages, 10 figures, 2 algorithms, accepted by the 4th International Conference on Electrical, Computer and Energy Technologies (ICECET) to be presented in July 2024

arXiv:2405.03660 [pdf, other]

CICA: Content-Injected Contrastive Alignment for Zero-Shot Document Image Classification

Authors: Sankalp Sinha, Muhammad Saif Ullah Khan, Talha Uddin Sheikh, Didier Stricker, Muhammad Zeshan Afzal

Abstract: Zero-shot learning has been extensively investigated in the broader field of visual recognition, attracting significant interest recently. However, the current work on zero-shot learning in document image classification remains scarce. The existing studies either focus exclusively on zero-shot inference, or their evaluation does not align with the established criteria of zero-shot evaluation in th… ▽ More Zero-shot learning has been extensively investigated in the broader field of visual recognition, attracting significant interest recently. However, the current work on zero-shot learning in document image classification remains scarce. The existing studies either focus exclusively on zero-shot inference, or their evaluation does not align with the established criteria of zero-shot evaluation in the visual recognition domain. We provide a comprehensive document image classification analysis in Zero-Shot Learning (ZSL) and Generalized Zero-Shot Learning (GZSL) settings to address this gap. Our methodology and evaluation align with the established practices of this domain. Additionally, we propose zero-shot splits for the RVL-CDIP dataset. Furthermore, we introduce CICA (pronounced 'ki-ka'), a framework that enhances the zero-shot learning capabilities of CLIP. CICA consists of a novel 'content module' designed to leverage any generic document-related textual information. The discriminative features extracted by this module are aligned with CLIP's text and image features using a novel 'coupled-contrastive' loss. Our module improves CLIP's ZSL top-1 accuracy by 6.7% and GZSL harmonic mean by 24% on the RVL-CDIP dataset. Our module is lightweight and adds only 3.3% more parameters to CLIP. Our work sets the direction for future research in zero-shot document classification. △ Less

Submitted 6 May, 2024; originally announced May 2024.

Comments: 18 Pages, 4 Figures and Accepted in ICDAR 2024

arXiv:2405.01310 [pdf, other]

Overcoming LLM Challenges using RAG-Driven Precision in Coffee Leaf Disease Remediation

Authors: Dr. Selva Kumar S, Afifah Khan Mohammed Ajmal Khan, Imadh Ajaz Banday, Manikantha Gada, Vibha Venkatesh Shanbhag

Abstract: This research introduces an innovative AI-driven precision agriculture system, leveraging YOLOv8 for disease identification and Retrieval Augmented Generation (RAG) for context-aware diagnosis. Focused on addressing the challenges of diseases affecting the coffee production sector in Karnataka, The system integrates sophisticated object detection techniques with language models to address the inhe… ▽ More This research introduces an innovative AI-driven precision agriculture system, leveraging YOLOv8 for disease identification and Retrieval Augmented Generation (RAG) for context-aware diagnosis. Focused on addressing the challenges of diseases affecting the coffee production sector in Karnataka, The system integrates sophisticated object detection techniques with language models to address the inherent constraints associated with Large Language Models (LLMs). Our methodology not only tackles the issue of hallucinations in LLMs, but also introduces dynamic disease identification and remediation strategies. Real-time monitoring, collaborative dataset expansion, and organizational involvement ensure the system's adaptability in diverse agricultural settings. The effect of the suggested system extends beyond automation, aiming to secure food supplies, protect livelihoods, and promote eco-friendly farming practices. By facilitating precise disease identification, the system contributes to sustainable and environmentally conscious agriculture, reducing reliance on pesticides. Looking to the future, the project envisions continuous development in RAG-integrated object detection systems, emphasizing scalability, reliability, and usability. This research strives to be a beacon for positive change in agriculture, aligning with global efforts toward sustainable and technologically enhanced food production. △ Less

Submitted 2 May, 2024; originally announced May 2024.

Comments: 6 pages, 3 figures

arXiv:2405.00025 [pdf, other]

Leveraging Pre-trained CNNs for Efficient Feature Extraction in Rice Leaf Disease Classification

Authors: Md. Shohanur Islam Sobuj, Md. Imran Hossen, Md. Foysal Mahmud, Mahbub Ul Islam Khan

Abstract: Rice disease classification is a critical task in agricultural research, and in this study, we rigorously evaluate the impact of integrating feature extraction methodologies within pre-trained convolutional neural networks (CNNs). Initial investigations into baseline models, devoid of feature extraction, revealed commendable performance with ResNet-50 and ResNet-101 achieving accuracies of 91% and… ▽ More Rice disease classification is a critical task in agricultural research, and in this study, we rigorously evaluate the impact of integrating feature extraction methodologies within pre-trained convolutional neural networks (CNNs). Initial investigations into baseline models, devoid of feature extraction, revealed commendable performance with ResNet-50 and ResNet-101 achieving accuracies of 91% and 92%, respectively. Subsequent integration of Histogram of Oriented Gradients (HOG) yielded substantial improvements across architectures, notably propelling the accuracy of EfficientNet-B7 from 92\% to an impressive 97%. Conversely, the application of Local Binary Patterns (LBP) demonstrated more conservative performance enhancements. Moreover, employing Gradient-weighted Class Activation Mapping (Grad-CAM) unveiled that HOG integration resulted in heightened attention to disease-specific features, corroborating the performance enhancements observed. Visual representations further validated HOG's notable influence, showcasing a discernible surge in accuracy across epochs due to focused attention on disease-affected regions. These results underscore the pivotal role of feature extraction, particularly HOG, in refining representations and bolstering classification accuracy. The study's significant highlight was the achievement of 97% accuracy with EfficientNet-B7 employing HOG and Grad-CAM, a noteworthy advancement in optimizing pre-trained CNN-based rice disease identification systems. The findings advocate for the strategic integration of advanced feature extraction techniques with cutting-edge pre-trained CNN architectures, presenting a promising avenue for substantially augmenting the precision and effectiveness of image-based disease classification systems in agricultural contexts. △ Less

Submitted 26 February, 2024; originally announced May 2024.

arXiv:2404.15337 [pdf, other]

RSSI Estimation for Constrained Indoor Wireless Networks using ANN

Authors: Samrah Arif, M. Arif Khan, Sabih Ur Rehman

Abstract: In the expanding field of the Internet of Things (IoT), wireless channel estimation is a significant challenge. This is specifically true for low-power IoT (LP-IoT) communication, where efficiency and accuracy are extremely important. This research establishes two distinct LP-IoT wireless channel estimation models using Artificial Neural Networks (ANN): a Feature-based ANN model and a Sequence-bas… ▽ More In the expanding field of the Internet of Things (IoT), wireless channel estimation is a significant challenge. This is specifically true for low-power IoT (LP-IoT) communication, where efficiency and accuracy are extremely important. This research establishes two distinct LP-IoT wireless channel estimation models using Artificial Neural Networks (ANN): a Feature-based ANN model and a Sequence-based ANN model. Both models have been constructed to enhance LP-IoT communication by lowering the estimation error in the LP-IoT wireless channel. The Feature-based model aims to capture complex patterns of measured Received Signal Strength Indicator (RSSI) data using environmental characteristics. The Sequence-based approach utilises predetermined categorisation techniques to estimate the RSSI sequence of specifically selected environment characteristics. The findings demonstrate that our suggested approaches attain remarkable precision in channel estimation, with an improvement in MSE of $88.29\%$ of the Feature-based model and $97.46\%$ of the Sequence-based model over existing research. Additionally, the comparative analysis of these techniques with traditional and other Deep Learning (DL)-based techniques also highlights the superior performance of our developed models and their potential in real-world IoT applications. △ Less

Submitted 9 April, 2024; originally announced April 2024.

arXiv:2404.14955 [pdf, other]

A Comprehensive Survey for Hyperspectral Image Classification: The Evolution from Conventional to Transformers

Authors: Muhammad Ahmad, Salvatore Distifano, Adil Mehmood Khan, Manuel Mazzara, Chenyu Li, Jing Yao, Hao Li, Jagannath Aryal, Gemine Vivone, Danfeng Hong

Abstract: Hyperspectral Image Classification (HSC) is a challenging task due to the high dimensionality and complex nature of Hyperspectral (HS) data. Traditional Machine Learning approaches while effective, face challenges in real-world data due to varying optimal feature sets, subjectivity in human-driven design, biases, and limitations. Traditional approaches encounter the curse of dimensionality, strugg… ▽ More Hyperspectral Image Classification (HSC) is a challenging task due to the high dimensionality and complex nature of Hyperspectral (HS) data. Traditional Machine Learning approaches while effective, face challenges in real-world data due to varying optimal feature sets, subjectivity in human-driven design, biases, and limitations. Traditional approaches encounter the curse of dimensionality, struggle with feature selection and extraction, lack spatial information consideration, exhibit limited robustness to noise, face scalability issues, and may not adapt well to complex data distributions. In recent years, DL techniques have emerged as powerful tools for addressing these challenges. This survey provides a comprehensive overview of the current trends and future prospects in HSC, focusing on the advancements from DL models to the emerging use of Transformers. We review the key concepts, methodologies, and state-of-the-art approaches in DL for HSC. We explore the potential of Transformer-based models in HSC, outlining their benefits and challenges. We also delve into emerging trends in HSC, as well as thorough discussions on Explainable AI and Interoperability concepts along with Diffusion Models (image denoising, feature extraction, and image fusion). Additionally, we address several open challenges and research questions pertinent to HSC. Comprehensive experimental results have been undertaken using three HS datasets to verify the efficacy of various conventional DL models and Transformers. Finally, we outline future research directions and potential applications that can further enhance the accuracy and efficiency of HSC. The Source code is available at \url{https://github.com/mahmad00/Conventional-to-Transformer-for-Hyperspectral-Image-Classification-Survey-2024}. △ Less

Submitted 12 June, 2024; v1 submitted 23 April, 2024; originally announced April 2024.

arXiv:2404.12957 [pdf, other]

Towards Reliable Latent Knowledge Estimation in LLMs: In-Context Learning vs. Prompting Based Factual Knowledge Extraction

Authors: Qinyuan Wu, Mohammad Aflah Khan, Soumi Das, Vedant Nanda, Bishwamittra Ghosh, Camila Kolling, Till Speicher, Laurent Bindschaedler, Krishna P. Gummadi, Evimaria Terzi

Abstract: We propose an approach for estimating the latent knowledge embedded inside large language models (LLMs). We leverage the in-context learning (ICL) abilities of LLMs to estimate the extent to which an LLM knows the facts stored in a knowledge base. Our knowledge estimator avoids reliability concerns with previous prompting-based methods, is both conceptually simpler and easier to apply, and we demo… ▽ More We propose an approach for estimating the latent knowledge embedded inside large language models (LLMs). We leverage the in-context learning (ICL) abilities of LLMs to estimate the extent to which an LLM knows the facts stored in a knowledge base. Our knowledge estimator avoids reliability concerns with previous prompting-based methods, is both conceptually simpler and easier to apply, and we demonstrate that it can surface more of the latent knowledge embedded in LLMs. We also investigate how different design choices affect the performance of ICL-based knowledge estimation. Using the proposed estimator, we perform a large-scale evaluation of the factual knowledge of a variety of open source LLMs, like OPT, Pythia, Llama(2), Mistral, Gemma, etc. over a large set of relations and facts from the Wikidata knowledge base. We observe differences in the factual knowledge between different model families and models of different sizes, that some relations are consistently better known than others but that models differ in the precise facts they know, and differences in the knowledge of base models and their finetuned counterparts. △ Less

Submitted 19 April, 2024; originally announced April 2024.

arXiv:2404.09342 [pdf, other]

Face-voice Association in Multilingual Environments (FAME) Challenge 2024 Evaluation Plan

Authors: Muhammad Saad Saeed, Shah Nawaz, Muhammad Salman Tahir, Rohan Kumar Das, Muhammad Zaigham Zaheer, Marta Moscati, Markus Schedl, Muhammad Haris Khan, Karthik Nandakumar, Muhammad Haroon Yousaf

Abstract: The advancements of technology have led to the use of multimodal systems in various real-world applications. Among them, the audio-visual systems are one of the widely used multimodal systems. In the recent years, associating face and voice of a person has gained attention due to presence of unique correlation between them. The Face-voice Association in Multilingual Environments (FAME) Challenge 2… ▽ More The advancements of technology have led to the use of multimodal systems in various real-world applications. Among them, the audio-visual systems are one of the widely used multimodal systems. In the recent years, associating face and voice of a person has gained attention due to presence of unique correlation between them. The Face-voice Association in Multilingual Environments (FAME) Challenge 2024 focuses on exploring face-voice association under a unique condition of multilingual scenario. This condition is inspired from the fact that half of the world's population is bilingual and most often people communicate under multilingual scenario. The challenge uses a dataset namely, Multilingual Audio-Visual (MAV-Celeb) for exploring face-voice association in multilingual environments. This report provides the details of the challenge, dataset, baselines and task details for the FAME Challenge. △ Less

Submitted 16 April, 2024; v1 submitted 14 April, 2024; originally announced April 2024.

Comments: ACM Multimedia Conference - Grand Challenge

arXiv:2404.08168 [pdf, other]

Conformal Prediction via Regression-as-Classification

Authors: Etash Guha, Shlok Natarajan, Thomas Möllenhoff, Mohammad Emtiyaz Khan, Eugene Ndiaye

Abstract: Conformal prediction (CP) for regression can be challenging, especially when the output distribution is heteroscedastic, multimodal, or skewed. Some of the issues can be addressed by estimating a distribution over the output, but in reality, such approaches can be sensitive to estimation error and yield unstable intervals.~Here, we circumvent the challenges by converting regression to a classifica… ▽ More Conformal prediction (CP) for regression can be challenging, especially when the output distribution is heteroscedastic, multimodal, or skewed. Some of the issues can be addressed by estimating a distribution over the output, but in reality, such approaches can be sensitive to estimation error and yield unstable intervals.~Here, we circumvent the challenges by converting regression to a classification problem and then use CP for classification to obtain CP sets for regression.~To preserve the ordering of the continuous-output space, we design a new loss function and make necessary modifications to the CP classification techniques.~Empirical results on many benchmarks shows that this simple approach gives surprisingly good results on many practical problems. △ Less

Submitted 11 April, 2024; originally announced April 2024.

Comments: International Conference of Learning Representations 2024

Journal ref: International Conference of Learning Representations 2024

arXiv:2404.08165 [pdf, other]

Lightweight Cryptanalysis of IoT Encryption Algorithms : Is Quota Sampling the Answer?

Authors: Jonathan Cook, Sabih ur Rehman, M. Arif Khan

Abstract: Rapid growth in the number of small sensor devices known as the Internet of Things (IoT) has seen the development of lightweight encryption algorithms. Two well-known lightweight algorithms are SIMON and SIMECK which have been specifically designed for use on resource-constrained IoT devices. These lightweight encryption algorithms are based on the efficient Feistel block structure which is known… ▽ More Rapid growth in the number of small sensor devices known as the Internet of Things (IoT) has seen the development of lightweight encryption algorithms. Two well-known lightweight algorithms are SIMON and SIMECK which have been specifically designed for use on resource-constrained IoT devices. These lightweight encryption algorithms are based on the efficient Feistel block structure which is known to exhibit vulnerabilities to differential cryptanalysis. Consequently, it is necessary to test these algorithms for resilience against such attacks. While existing state-of-the-art research has demonstrated novel heuristic methods of differential cryptanalysis that improve time efficiency on previous techniques, the large state sizes of these encryption algorithms inhibit cryptanalysis time efficiency. In this paper, we introduce Versatile Investigative Sampling Technique for Advanced Cryptanalysis (VISTA-CRYPT) - a time-efficient enhancement of differential cryptanalysis of lightweight encryption algorithms. The proposed technique introduces a simple framework of quota sampling that produces state-of-the-art results with time reductions of up to $76\%$ over existing techniques. Further, we present a preliminary graph-based analysis of the output differentials for the identification of relationships within the data and future research opportunities to further enhance the performance of differential cryptanalysis. The code designed for this work and associated datasets will be available at https://github.com/johncook1979/simon-cryptanalysis. △ Less

Submitted 11 April, 2024; originally announced April 2024.

Comments: 24 pages, 21 figures, 7 tables

arXiv:2404.08024 [pdf, other]

The OxMat dataset: a multimodal resource for the development of AI-driven technologies in maternal and newborn child health

Authors: M. Jaleed Khan, Ioana Duta, Beth Albert, William Cooke, Manu Vatish, Gabriel Davis Jones

Abstract: The rapid advancement of Artificial Intelligence (AI) in healthcare presents a unique opportunity for advancements in obstetric care, particularly through the analysis of cardiotocography (CTG) for fetal monitoring. However, the effectiveness of such technologies depends upon the availability of large, high-quality datasets that are suitable for machine learning. This paper introduces the Oxford M… ▽ More The rapid advancement of Artificial Intelligence (AI) in healthcare presents a unique opportunity for advancements in obstetric care, particularly through the analysis of cardiotocography (CTG) for fetal monitoring. However, the effectiveness of such technologies depends upon the availability of large, high-quality datasets that are suitable for machine learning. This paper introduces the Oxford Maternity (OxMat) dataset, the world's largest curated dataset of CTGs, featuring raw time series CTG data and extensive clinical data for both mothers and babies, which is ideally placed for machine learning. The OxMat dataset addresses the critical gap in women's health data by providing over 177,211 unique CTG recordings from 51,036 pregnancies, carefully curated and reviewed since 1991. The dataset also comprises over 200 antepartum, intrapartum and postpartum clinical variables, ensuring near-complete data for crucial outcomes such as stillbirth and acidaemia. While this dataset also covers the intrapartum stage, around 94% of the constituent CTGS are antepartum. This allows for a unique focus on the underserved antepartum period, in which early detection of at-risk fetuses can significantly improve health outcomes. Our comprehensive review of existing datasets reveals the limitations of current datasets: primarily, their lack of sufficient volume, detailed clinical data and antepartum data. The OxMat dataset lays a foundation for future AI-driven prenatal care, offering a robust resource for developing and testing algorithms aimed at improving maternal and fetal health outcomes. △ Less

Submitted 11 April, 2024; originally announced April 2024.

arXiv:2404.05508 [pdf, other]

Synergy of Large Language Model and Model Driven Engineering for Automated Development of Centralized Vehicular Systems

Authors: Nenad Petrovic, Fengjunjie Pan, Krzysztof Lebioda, Vahid Zolfaghari, Sven Kirchner, Nils Purschke, Muhammad Aqib Khan, Viktor Vorobev, Alois Knoll

Abstract: We present a prototype of a tool leveraging the synergy of model driven engineering (MDE) and Large Language Models (LLM) for the purpose of software development process automation in the automotive industry. In this approach, the user-provided input is free form textual requirements, which are first translated to Ecore model instance representation using an LLM, which is afterwards checked for co… ▽ More We present a prototype of a tool leveraging the synergy of model driven engineering (MDE) and Large Language Models (LLM) for the purpose of software development process automation in the automotive industry. In this approach, the user-provided input is free form textual requirements, which are first translated to Ecore model instance representation using an LLM, which is afterwards checked for consistency using Object Constraint Language (OCL) rules. After successful consistency check, the model instance is fed as input to another LLM for the purpose of code generation. The generated code is evaluated in a simulated environment using CARLA simulator connected to an example centralized vehicle architecture, in an emergency brake scenario. △ Less

Submitted 8 April, 2024; originally announced April 2024.

Report number: TUM-I24109 ACM Class: D.2.1; D.2.2; D.2.4; I.2.7; I.2.2; I.7.0

arXiv:2404.01878 [pdf, other]

Real, fake and synthetic faces -- does the coin have three sides?

Authors: Shahzeb Naeem, Ramzi Al-Sharawi, Muhammad Riyyan Khan, Usman Tariq, Abhinav Dhall, Hasan Al-Nashash

Abstract: With the ever-growing power of generative artificial intelligence, deepfake and artificially generated (synthetic) media have continued to spread online, which creates various ethical and moral concerns regarding their usage. To tackle this, we thus present a novel exploration of the trends and patterns observed in real, deepfake and synthetic facial images. The proposed analysis is done in two pa… ▽ More With the ever-growing power of generative artificial intelligence, deepfake and artificially generated (synthetic) media have continued to spread online, which creates various ethical and moral concerns regarding their usage. To tackle this, we thus present a novel exploration of the trends and patterns observed in real, deepfake and synthetic facial images. The proposed analysis is done in two parts: firstly, we incorporate eight deep learning models and analyze their performances in distinguishing between the three classes of images. Next, we look to further delve into the similarities and differences between these three sets of images by investigating their image properties both in the context of the entire image as well as in the context of specific regions within the image. ANOVA test was also performed and provided further clarity amongst the patterns associated between the images of the three classes. From our findings, we observe that the investigated deeplearning models found it easier to detect synthetic facial images, with the ViT Patch-16 model performing best on this task with a class-averaged sensitivity, specificity, precision, and accuracy of 97.37%, 98.69%, 97.48%, and 98.25%, respectively. This observation was supported by further analysis of various image properties. We saw noticeable differences across the three category of images. This analysis can help us build better algorithms for facial image generation, and also shows that synthetic, deepfake and real face images are indeed three different classes. △ Less

Submitted 2 April, 2024; originally announced April 2024.

arXiv:2404.01438 [pdf]

Generation and Detection of Sign Language Deepfakes -- A Linguistic and Visual Analysis

Authors: Shahzeb Naeem, Muhammad Riyyan Khan, Usman Tariq, Abhinav Dhall, Carlos Ivan Colon, Hasan Al-Nashash

Abstract: A question in the realm of deepfakes is slowly emerging pertaining to whether we can go beyond facial deepfakes and whether it would be beneficial to society. Therefore, this research presents a positive application of deepfake technology in upper body generation, while performing sign-language for the Deaf and Hard of Hearing (DHoH) community. The resulting videos are later vetted with a sign lan… ▽ More A question in the realm of deepfakes is slowly emerging pertaining to whether we can go beyond facial deepfakes and whether it would be beneficial to society. Therefore, this research presents a positive application of deepfake technology in upper body generation, while performing sign-language for the Deaf and Hard of Hearing (DHoH) community. The resulting videos are later vetted with a sign language expert. This is particularly helpful, given the intricate nature of sign language, a scarcity of sign language experts, and potential benefits for health and education. The objectives of this work encompass constructing a reliable deepfake dataset, evaluating its technical and visual credibility through computer vision and natural language processing models, and assessing the plausibility of the generated content. With over 1200 videos, featuring both previously seen and unseen individuals for the generation model, using the help of a sign language expert, we establish a deepfake dataset in sign language that can further be utilized to detect fake videos that may target certain people of determination. △ Less

Submitted 1 April, 2024; originally announced April 2024.

Comments: 13 pages, 13 figures, Computer Vision and Image Understanding Journal

arXiv:2404.00068 [pdf, other]

A Data-Driven Predictive Analysis on Cyber Security Threats with Key Risk Factors

Authors: Fatama Tuz Johora, Md Shahedul Islam Khan, Esrath Kanon, Mohammad Abu Tareq Rony, Md Zubair, Iqbal H. Sarker

Abstract: Cyber risk refers to the risk of defacing reputation, monetary losses, or disruption of an organization or individuals, and this situation usually occurs by the unconscious use of cyber systems. The cyber risk is unhurriedly increasing day by day and it is right now a global threat. Developing countries like Bangladesh face major cyber risk challenges. The growing cyber threat worldwide focuses on… ▽ More Cyber risk refers to the risk of defacing reputation, monetary losses, or disruption of an organization or individuals, and this situation usually occurs by the unconscious use of cyber systems. The cyber risk is unhurriedly increasing day by day and it is right now a global threat. Developing countries like Bangladesh face major cyber risk challenges. The growing cyber threat worldwide focuses on the need for effective modeling to predict and manage the associated risk. This paper exhibits a Machine Learning(ML) based model for predicting individuals who may be victims of cyber attacks by analyzing socioeconomic factors. We collected the dataset from victims and non-victims of cyberattacks based on socio-demographic features. The study involved the development of a questionnaire to gather data, which was then used to measure the significance of features. Through data augmentation, the dataset was expanded to encompass 3286 entries, setting the stage for our investigation and modeling. Among several ML models with 19, 20, 21, and 26 features, we proposed a novel Pertinent Features Random Forest (RF) model, which achieved maximum accuracy with 20 features (95.95\%) and also demonstrated the association among the selected features using the Apriori algorithm with Confidence (above 80\%) according to the victim. We generated 10 important association rules and presented the framework that is rigorously evaluated on real-world datasets, demonstrating its potential to predict cyberattacks and associated risk factors effectively. Looking ahead, future efforts will be directed toward refining the predictive model's precision and delving into additional risk factors, to fortify the proposed framework's efficacy in navigating the complex terrain of cybersecurity threats. △ Less

Submitted 28 March, 2024; originally announced April 2024.

Comments: The paper contains 15 pages, 7 tables and 6 figures

arXiv:2403.19949 [pdf, other]

FairCLIP: Harnessing Fairness in Vision-Language Learning

Authors: Yan Luo, Min Shi, Muhammad Osama Khan, Muhammad Muneeb Afzal, Hao Huang, Shuaihang Yuan, Yu Tian, Luo Song, Ava Kouhana, Tobias Elze, Yi Fang, Mengyu Wang

Abstract: Fairness is a critical concern in deep learning, especially in healthcare, where these models influence diagnoses and treatment decisions. Although fairness has been investigated in the vision-only domain, the fairness of medical vision-language (VL) models remains unexplored due to the scarcity of medical VL datasets for studying fairness. To bridge this research gap, we introduce the first fair… ▽ More Fairness is a critical concern in deep learning, especially in healthcare, where these models influence diagnoses and treatment decisions. Although fairness has been investigated in the vision-only domain, the fairness of medical vision-language (VL) models remains unexplored due to the scarcity of medical VL datasets for studying fairness. To bridge this research gap, we introduce the first fair vision-language medical dataset Harvard-FairVLMed that provides detailed demographic attributes, ground-truth labels, and clinical notes to facilitate an in-depth examination of fairness within VL foundation models. Using Harvard-FairVLMed, we conduct a comprehensive fairness analysis of two widely-used VL models (CLIP and BLIP2), pre-trained on both natural and medical domains, across four different protected attributes. Our results highlight significant biases in all VL models, with Asian, Male, Non-Hispanic, and Spanish being the preferred subgroups across the protected attributes of race, gender, ethnicity, and language, respectively. In order to alleviate these biases, we propose FairCLIP, an optimal-transport-based approach that achieves a favorable trade-off between performance and fairness by reducing the Sinkhorn distance between the overall sample distribution and the distributions corresponding to each demographic group. As the first VL dataset of its kind, Harvard-FairVLMed holds the potential to catalyze advancements in the development of machine learning models that are both ethically aware and clinically effective. Our dataset and code are available at https://ophai.hms.harvard.edu/datasets/harvard-fairvlmed10k. △ Less

Submitted 5 April, 2024; v1 submitted 28 March, 2024; originally announced March 2024.

Comments: CVPR 2024

arXiv:2403.16194 [pdf, other]

Pose-Guided Self-Training with Two-Stage Clustering for Unsupervised Landmark Discovery

Authors: Siddharth Tourani, Ahmed Alwheibi, Arif Mahmood, Muhammad Haris Khan

Abstract: Unsupervised landmarks discovery (ULD) for an object category is a challenging computer vision problem. In pursuit of developing a robust ULD framework, we explore the potential of a recent paradigm of self-supervised learning algorithms, known as diffusion models. Some recent works have shown that these models implicitly contain important correspondence cues. Towards harnessing the potential of d… ▽ More Unsupervised landmarks discovery (ULD) for an object category is a challenging computer vision problem. In pursuit of developing a robust ULD framework, we explore the potential of a recent paradigm of self-supervised learning algorithms, known as diffusion models. Some recent works have shown that these models implicitly contain important correspondence cues. Towards harnessing the potential of diffusion models for the ULD task, we make the following core contributions. First, we propose a ZeroShot ULD baseline based on simple clustering of random pixel locations with nearest neighbour matching. It delivers better results than existing ULD methods. Second, motivated by the ZeroShot performance, we develop a ULD algorithm based on diffusion features using self-training and clustering which also outperforms prior methods by notable margins. Third, we introduce a new proxy task based on generating latent pose codes and also propose a two-stage clustering mechanism to facilitate effective pseudo-labeling, resulting in a significant performance improvement. Overall, our approach consistently outperforms state-of-the-art methods on four challenging benchmarks AFLW, MAFL, CatHeads and LS3D by significant margins. △ Less

Submitted 24 March, 2024; originally announced March 2024.

Comments: Accepted in CVPR 2024

Showing 1–50 of 644 results for author: Khan, M