Search | arXiv e-print repository

arXiv:2405.20485 [pdf, other]

Phantom: General Trigger Attacks on Retrieval Augmented Language Generation

Authors: Harsh Chaudhari, Giorgio Severi, John Abascal, Matthew Jagielski, Christopher A. Choquette-Choo, Milad Nasr, Cristina Nita-Rotaru, Alina Oprea

Abstract: Retrieval Augmented Generation (RAG) expands the capabilities of modern large language models (LLMs) in chatbot applications, enabling developers to adapt and personalize the LLM output without expensive training or fine-tuning. RAG systems use an external knowledge database to retrieve the most relevant documents for a given query, providing this context to the LLM generator. While RAG achieves i… ▽ More Retrieval Augmented Generation (RAG) expands the capabilities of modern large language models (LLMs) in chatbot applications, enabling developers to adapt and personalize the LLM output without expensive training or fine-tuning. RAG systems use an external knowledge database to retrieve the most relevant documents for a given query, providing this context to the LLM generator. While RAG achieves impressive utility in many applications, its adoption to enable personalized generative models introduces new security risks. In this work, we propose new attack surfaces for an adversary to compromise a victim's RAG system, by injecting a single malicious document in its knowledge database. We design Phantom, general two-step attack framework against RAG augmented LLMs. The first step involves crafting a poisoned document designed to be retrieved by the RAG system within the top-k results only when an adversarial trigger, a specific sequence of words acting as backdoor, is present in the victim's queries. In the second step, a specially crafted adversarial string within the poisoned document triggers various adversarial attacks in the LLM generator, including denial of service, reputation damage, privacy violations, and harmful behaviors. We demonstrate our attacks on multiple LLM architectures, including Gemma, Vicuna, and Llama. △ Less

Submitted 30 May, 2024; originally announced May 2024.

arXiv:2404.02181 [pdf, other]

Leveraging Machine Learning for Early Autism Detection via INDT-ASD Indian Database

Authors: Trapti Shrivastava, Harshal Chaudhari, Vrijendra Singh

Abstract: Machine learning (ML) has advanced quickly, particularly throughout the area of health care. The diagnosis of neurodevelopment problems using ML is a very important area of healthcare. Autism spectrum disorder (ASD) is one of the developmental disorders that is growing the fastest globally. The clinical screening tests used to identify autistic symptoms are expensive and time-consuming. But now th… ▽ More Machine learning (ML) has advanced quickly, particularly throughout the area of health care. The diagnosis of neurodevelopment problems using ML is a very important area of healthcare. Autism spectrum disorder (ASD) is one of the developmental disorders that is growing the fastest globally. The clinical screening tests used to identify autistic symptoms are expensive and time-consuming. But now that ML has been advanced, it's feasible to identify autism early on. Previously, many different techniques have been used in investigations. Still, none of them have produced the anticipated outcomes when it comes to the capacity to predict autistic features utilizing a clinically validated Indian ASD database. Therefore, this study aimed to develop a simple, quick, and inexpensive technique for identifying ASD by using ML. Various machine learning classifiers, including Adaboost (AB), Gradient Boost (GB), Decision Tree (DT), Logistic Regression (LR), Random Forest (RF), Gaussian Naive Bayes (GNB), Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), K-Nearest Neighbors (KNN), and Support Vector Machine (SVM), were used to develop the autism prediction model. The proposed method was tested with records from the AIIMS Modified INDT-ASD (AMI) database, which were collected through an application developed by AIIMS in Delhi, India. Feature engineering has been applied to make the proposed solution easier than already available solutions. Using the proposed model, we succeeded in predicting ASD using a minimized set of 20 questions rather than the 28 questions presented in AMI with promising accuracy. In a comparative evaluation, SVM emerged as the superior model among others, with 100 $\pm$ 0.05\% accuracy, higher recall by 5.34\%, and improved accuracy by 2.22\%-6.67\% over RF. We have also introduced a web-based solution supporting both Hindi and English. △ Less

Submitted 2 April, 2024; originally announced April 2024.

arXiv:2401.00170 [pdf, ps, other]

doi 10.1145/3632754.3632764

L3Cube-MahaSocialNER: A Social Media based Marathi NER Dataset and BERT models

Authors: Harsh Chaudhari, Anuja Patil, Dhanashree Lavekar, Pranav Khairnar, Raviraj Joshi

Abstract: This work introduces the L3Cube-MahaSocialNER dataset, the first and largest social media dataset specifically designed for Named Entity Recognition (NER) in the Marathi language. The dataset comprises 18,000 manually labeled sentences covering eight entity classes, addressing challenges posed by social media data, including non-standard language and informal idioms. Deep learning models, includin… ▽ More This work introduces the L3Cube-MahaSocialNER dataset, the first and largest social media dataset specifically designed for Named Entity Recognition (NER) in the Marathi language. The dataset comprises 18,000 manually labeled sentences covering eight entity classes, addressing challenges posed by social media data, including non-standard language and informal idioms. Deep learning models, including CNN, LSTM, BiLSTM, and Transformer models, are evaluated on the individual dataset with IOB and non-IOB notations. The results demonstrate the effectiveness of these models in accurately recognizing named entities in Marathi informal text. The L3Cube-MahaSocialNER dataset offers user-centric information extraction and supports real-time applications, providing a valuable resource for public opinion analysis, news, and marketing on social media platforms. We also show that the zero-shot results of the regular NER model are poor on the social NER test set thus highlighting the need for more social NER datasets. The datasets and models are publicly available at https://github.com/l3cube-pune/MarathiNLP △ Less

Submitted 30 December, 2023; originally announced January 2024.

Comments: Accepted at Forum for Information Retrieval Evaluation (FIRE 2023)

arXiv:2312.01306 [pdf, other]

doi 10.1007/978-981-99-6550-2_37

On Significance of Subword tokenization for Low Resource and Efficient Named Entity Recognition: A case study in Marathi

Authors: Harsh Chaudhari, Anuja Patil, Dhanashree Lavekar, Pranav Khairnar, Raviraj Joshi, Sachin Pande

Abstract: Named Entity Recognition (NER) systems play a vital role in NLP applications such as machine translation, summarization, and question-answering. These systems identify named entities, which encompass real-world concepts like locations, persons, and organizations. Despite extensive research on NER systems for the English language, they have not received adequate attention in the context of low reso… ▽ More Named Entity Recognition (NER) systems play a vital role in NLP applications such as machine translation, summarization, and question-answering. These systems identify named entities, which encompass real-world concepts like locations, persons, and organizations. Despite extensive research on NER systems for the English language, they have not received adequate attention in the context of low resource languages. In this work, we focus on NER for low-resource language and present our case study in the context of the Indian language Marathi. The advancement of NLP research revolves around the utilization of pre-trained transformer models such as BERT for the development of NER models. However, we focus on improving the performance of shallow models based on CNN, and LSTM by combining the best of both worlds. In the era of transformers, these traditional deep learning models are still relevant because of their high computational efficiency. We propose a hybrid approach for efficient NER by integrating a BERT-based subword tokenizer into vanilla CNN/LSTM models. We show that this simple approach of replacing a traditional word-based tokenizer with a BERT-tokenizer brings the accuracy of vanilla single-layer models closer to that of deep pre-trained models like BERT. We show the importance of using sub-word tokenization for NER and present our study toward building efficient NLP systems. The evaluation is performed on L3Cube-MahaNER dataset using tokenizers from MahaBERT, MahaGPT, IndicBERT, and mBERT. △ Less

Submitted 3 December, 2023; originally announced December 2023.

Comments: Accepted at ICDAM 2023

arXiv:2311.10005 [pdf, other]

Towards Flexibility and Robustness of LSM Trees

Authors: Andy Huynh, Harshal A. Chaudhari, Evimaria Terzi, Manos Athanassoulis

Abstract: Log-Structured Merge trees (LSM trees) are increasingly used as part of the storage engine behind several data systems, and are frequently deployed in the cloud. As the number of applications relying on LSM-based storage backends increases, the problem of performance tuning of LSM trees receives increasing attention. We consider both nominal tunings - where workload and execution environment are a… ▽ More Log-Structured Merge trees (LSM trees) are increasingly used as part of the storage engine behind several data systems, and are frequently deployed in the cloud. As the number of applications relying on LSM-based storage backends increases, the problem of performance tuning of LSM trees receives increasing attention. We consider both nominal tunings - where workload and execution environment are accurately known a priori - and robust tunings - which consider uncertainty in the workload knowledge. This type of workload uncertainty is common in modern applications, notably in shared infrastructure environments like the public cloud. To address this problem, we introduce ENDURE, a new paradigm for tuning LSM trees in the presence of workload uncertainty. Specifically, we focus on the impact of the choice of compaction policy, size ratio, and memory allocation on the overall performance. ENDURE considers a robust formulation of the throughput maximization problem and recommends a tuning that offers near-optimal throughput when the executed workload is not the same, instead in a neighborhood of the expected workload. Additionally, we explore the robustness of flexible LSM designs by proposing a new unified design called K-LSM that encompasses existing designs. We deploy our robust tuning system, ENDURE, on a state-of-the-art key-value store, RocksDB, and demonstrate throughput improvements of up to 5x in the presence of uncertainty. Our results indicate that the tunings obtained by ENDURE are more robust than tunings obtained under our expanded LSM design space. This indicates that robustness may not be inherent to a design, instead, it is an outcome of a tuning process that explicitly accounts for uncertainty. △ Less

Submitted 16 November, 2023; originally announced November 2023.

Comments: 25 pages, 19 figures, VLDB-J. arXiv admin note: substantial text overlap with arXiv:2110.13801

arXiv:2310.03838 [pdf, other]

Chameleon: Increasing Label-Only Membership Leakage with Adaptive Poisoning

Authors: Harsh Chaudhari, Giorgio Severi, Alina Oprea, Jonathan Ullman

Abstract: The integration of machine learning (ML) in numerous critical applications introduces a range of privacy concerns for individuals who provide their datasets for model training. One such privacy risk is Membership Inference (MI), in which an attacker seeks to determine whether a particular data sample was included in the training dataset of a model. Current state-of-the-art MI attacks capitalize on… ▽ More The integration of machine learning (ML) in numerous critical applications introduces a range of privacy concerns for individuals who provide their datasets for model training. One such privacy risk is Membership Inference (MI), in which an attacker seeks to determine whether a particular data sample was included in the training dataset of a model. Current state-of-the-art MI attacks capitalize on access to the model's predicted confidence scores to successfully perform membership inference, and employ data poisoning to further enhance their effectiveness. In this work, we focus on the less explored and more realistic label-only setting, where the model provides only the predicted label on a queried sample. We show that existing label-only MI attacks are ineffective at inferring membership in the low False Positive Rate (FPR) regime. To address this challenge, we propose a new attack Chameleon that leverages a novel adaptive data poisoning strategy and an efficient query selection method to achieve significantly more accurate membership inference than existing label-only attacks, especially at low FPRs. △ Less

Submitted 16 January, 2024; v1 submitted 5 October, 2023; originally announced October 2023.

Comments: To appear at International Conference on Learning Representations (ICLR) 2024

arXiv:2303.00989 [pdf, other]

Role of modified cloud microphysics parameterization in coupled climate model for studying ISM rainfall: small-scale cloud model and climate model work better together

Authors: Moumita Bhowmik, Anupam Hazra, Ankur Srivastava, Dipjyoti Mudiar, Hemantkumar S. Chaudhari, Suryachandra A. Rao, Lian-Ping Wang

Abstract: An unresolved problem of present generation coupled climate models is the realistic distribution of rainfall over Indian monsoon region, which is also related to the persistent dry bias over Indian land mass. Therefore, quantitative prediction of the intensity of rainfall events has remained a challenge for the state-of-the-art global coupled models. Guided by the observation, it is hypothesized t… ▽ More An unresolved problem of present generation coupled climate models is the realistic distribution of rainfall over Indian monsoon region, which is also related to the persistent dry bias over Indian land mass. Therefore, quantitative prediction of the intensity of rainfall events has remained a challenge for the state-of-the-art global coupled models. Guided by the observation, it is hypothesized that insufficient growth of cloud droplets and processes responsible for the cloud to rain water conversion are key components to distinguish between shallow to convective clouds. The new diffusional growth rates and relative dispersion based autoconversion from the Eulerian-Lagrangian particleby-particle based small-scale model provide a pathway to revisit the parameterizations in climate models for monsoon clouds. The realistic information of cloud drop size distribution is incorporated in the microphysical parameterization scheme of climate model. Two sensitivity simulations are conducted using coupled forecast system (CFSv2) model. When our physically based small-scale derived modified parameterization is used, a coupled climate model simulates the probability distribution (PDF) of rainfall and accompanying specific humidity, liquid water content, and outgoing long-wave radiation (OLR) with increasing accuracy. The improved simulation of rainfall PDF appears to have been aided by much improved simulation of OLR and resulted better simulation of the ISM rainfall. △ Less

Submitted 2 March, 2023; originally announced March 2023.

arXiv:2208.12348 [pdf, other]

SNAP: Efficient Extraction of Private Properties with Poisoning

Authors: Harsh Chaudhari, John Abascal, Alina Oprea, Matthew Jagielski, Florian Tramèr, Jonathan Ullman

Abstract: Property inference attacks allow an adversary to extract global properties of the training dataset from a machine learning model. Such attacks have privacy implications for data owners sharing their datasets to train machine learning models. Several existing approaches for property inference attacks against deep neural networks have been proposed, but they all rely on the attacker training a large… ▽ More Property inference attacks allow an adversary to extract global properties of the training dataset from a machine learning model. Such attacks have privacy implications for data owners sharing their datasets to train machine learning models. Several existing approaches for property inference attacks against deep neural networks have been proposed, but they all rely on the attacker training a large number of shadow models, which induces a large computational overhead. In this paper, we consider the setting of property inference attacks in which the attacker can poison a subset of the training dataset and query the trained target model. Motivated by our theoretical analysis of model confidences under poisoning, we design an efficient property inference attack, SNAP, which obtains higher attack success and requires lower amounts of poisoning than the state-of-the-art poisoning-based property inference attack by Mahloujifar et al. For example, on the Census dataset, SNAP achieves 34% higher success rate than Mahloujifar et al. while being 56.5x faster. We also extend our attack to infer whether a certain property was present at all during training and estimate the exact proportion of a property of interest efficiently. We evaluate our attack on several properties of varying proportions from four datasets and demonstrate SNAP's generality and effectiveness. An open-source implementation of SNAP can be found at https://github.com/johnmath/snap-sp23. △ Less

Submitted 21 June, 2023; v1 submitted 25 August, 2022; originally announced August 2022.

Comments: 28 pages, 16 figures

arXiv:2205.09986 [pdf, other]

SafeNet: The Unreasonable Effectiveness of Ensembles in Private Collaborative Learning

Authors: Harsh Chaudhari, Matthew Jagielski, Alina Oprea

Abstract: Secure multiparty computation (MPC) has been proposed to allow multiple mutually distrustful data owners to jointly train machine learning (ML) models on their combined data. However, by design, MPC protocols faithfully compute the training functionality, which the adversarial ML community has shown to leak private information and can be tampered with in poisoning attacks. In this work, we argue t… ▽ More Secure multiparty computation (MPC) has been proposed to allow multiple mutually distrustful data owners to jointly train machine learning (ML) models on their combined data. However, by design, MPC protocols faithfully compute the training functionality, which the adversarial ML community has shown to leak private information and can be tampered with in poisoning attacks. In this work, we argue that model ensembles, implemented in our framework called SafeNet, are a highly MPC-amenable way to avoid many adversarial ML attacks. The natural partitioning of data amongst owners in MPC training allows this approach to be highly scalable at training time, provide provable protection from poisoning attacks, and provably defense against a number of privacy attacks. We demonstrate SafeNet's efficiency, accuracy, and resilience to poisoning on several machine learning datasets and models trained in end-to-end and transfer learning scenarios. For instance, SafeNet reduces backdoor attack success significantly, while achieving $39\times$ faster training and $36 \times$ less communication than the four-party MPC framework of Dalskov et al. Our experiments show that ensembling retains these benefits even in many non-iid settings. The simplicity, cheap setup, and robustness properties of ensembling make it a strong first choice for training ML models privately in MPC. △ Less

Submitted 8 September, 2022; v1 submitted 20 May, 2022; originally announced May 2022.

arXiv:2110.13801 [pdf, other]

Endure: A Robust Tuning Paradigm for LSM Trees Under Workload Uncertainty

Authors: Andy Huynh, Harshal A. Chaudhari, Evimaria Terzi, Manos Athanassoulis

Abstract: Log-Structured Merge trees (LSM trees) are increasingly used as the storage engines behind several data systems, frequently deployed in the cloud. Similar to other database architectures, LSM trees take into account information about the expected workload (e.g., reads vs. writes, point vs. range queries) to optimize their performance via tuning. Operating in shared infrastructure like the cloud, h… ▽ More Log-Structured Merge trees (LSM trees) are increasingly used as the storage engines behind several data systems, frequently deployed in the cloud. Similar to other database architectures, LSM trees take into account information about the expected workload (e.g., reads vs. writes, point vs. range queries) to optimize their performance via tuning. Operating in shared infrastructure like the cloud, however, comes with a degree of workload uncertainty due to multi-tenancy and the fast-evolving nature of modern applications. Systems with static tuning discount the variability of such hybrid workloads and hence provide an inconsistent and overall suboptimal performance. To address this problem, we introduce Endure - a new paradigm for tuning LSM trees in the presence of workload uncertainty. Specifically, we focus on the impact of the choice of compaction policies, size-ratio, and memory allocation on the overall performance. Endure considers a robust formulation of the throughput maximization problem, and recommends a tuning that maximizes the worst-case throughput over a neighborhood of each expected workload. Additionally, an uncertainty tuning parameter controls the size of this neighborhood, thereby allowing the output tunings to be conservative or optimistic. Through both model-based and extensive experimental evaluation of Endure in the state-of-the-art LSM-based storage engine, RocksDB, we show that the robust tuning methodology consistently outperforms classical tun-ing strategies. We benchmark Endure using 15 workload templates that generate more than 10000 unique noisy workloads. The robust tunings output by Endure lead up to a 5$\times$ improvement in through-put in presence of uncertainty. On the flip side, when the observed workload exactly matches the expected one, Endure tunings have negligible performance loss. △ Less

Submitted 2 November, 2021; v1 submitted 26 October, 2021; originally announced October 2021.

Comments: 21 pages, 30 figures

arXiv:2110.03956 [pdf]

doi 10.1029/2021GL096489

Seasonal Predictability of Lightning over the Global Hotspot Regions

Authors: Chandrima Mallick, Anupam Hazra, Subodh K. Saha, Hemantkumar S. Chaudhari, Samir Pokhrel, Mahen Konwar, Ushnanshu Dutta, Greeshma M. Mohan, K. Gayatri Vani

Abstract: Skillful seasonal prediction of lightning is crucial over several global hotspot regions, as it causes severe damages to infrastructures and losses of human life. While major emphasis has been given for predicting rainfall, prediction of lightning in one season advance remained uncommon, owing to the nature of the problem, which is short-lived local phenomenon. Here we show that on the seasonal ti… ▽ More Skillful seasonal prediction of lightning is crucial over several global hotspot regions, as it causes severe damages to infrastructures and losses of human life. While major emphasis has been given for predicting rainfall, prediction of lightning in one season advance remained uncommon, owing to the nature of the problem, which is short-lived local phenomenon. Here we show that on the seasonal time scale, lightning over the major global hot-spot regions is strongly tied with slowly varying global predictors (e.g., El Nino and Southern Oscillation). Moreover, the sub-seasonal variance of lightning is highly correlated with global predictors, suggesting a seminal role played by the global climate mode in shaping the local land-atmosphere interactions, which eventually affects seasonal lightning variability. It is shown that the seasonal predictability of lightning over the hotspot is comparable to that of seasonal rainfall, which opens up an avenue for reliable seasonal forecasting of lightning for special awareness and preventive measures. Keywords: Lightning, Seasonal forecasting, SST, Global predictors △ Less

Submitted 8 October, 2021; originally announced October 2021.

arXiv:2109.07122 [pdf]

doi 10.1016/j.gloplacha.2022.103873

Unraveling the Global Teleconnections of Indian Summer Monsoon Clouds: Expedition from CMIP5 to CMIP6

Authors: Ushnanshu Dutta, Anupam Hazra, Hemantkumar S. Chaudhari, Subodh Kumar Saha, Samir Pokhrel, Utkarsh Verma

Abstract: We have analyzed the teleconnection of total cloud fraction (TCF) with global sea surface temperature (SST) in multi-model ensembles (MME) of the fifth and sixth Coupled Model Intercomparison Projects (CMIP5 and CMIP6). CMIP6-MME has a more robust and realistic teleconnection (TCF and global SST) pattern over the extra-tropics (R ~0.43) and North Atlantic (R ~0.39) region, which in turn resulted i… ▽ More We have analyzed the teleconnection of total cloud fraction (TCF) with global sea surface temperature (SST) in multi-model ensembles (MME) of the fifth and sixth Coupled Model Intercomparison Projects (CMIP5 and CMIP6). CMIP6-MME has a more robust and realistic teleconnection (TCF and global SST) pattern over the extra-tropics (R ~0.43) and North Atlantic (R ~0.39) region, which in turn resulted in improvement of rainfall bias over the Asian summer monsoon (ASM) region. CMIP6-MME can better reproduce the mean TCF and have reduced dry (wet) rainfall bias on land (ocean) over the ASM region. CMIP6-MME has improved the biases of seasonal mean rainfall, TCF, and outgoing longwave radiation (OLR) over the Indian Summer Monsoon (ISM) region by ~40%, ~45%, and ~31%, respectively, than CMIP5-MME and demonstrates better spatial correlation with observation/reanalysis. Results establish the credibility of the CMIP6 models and provide a scientific basis for improving the seasonal prediction of ISM. △ Less

Submitted 20 September, 2021; v1 submitted 15 September, 2021; originally announced September 2021.

Comments: 12 pages, 4 main figures, 2 supplementary figures

arXiv:2101.04521 [pdf]

Examining the variability of cloud hydrometeors and its importance on the Indian summer monsoon rainfall predictability

Authors: Ushnanshu Dutta, Anupam Hazra, Subodh Kumar Saha, Hemantkumar S. Chaudhari, Samir Pokhrel, Mahen Konwar

Abstract: Skilful prediction of the seasonal Indian summer monsoon (ISM) rainfall (ISMR) at least one season in advance has great socio-economic value. It represents a lifeline for about a sixth of the world's population. The ISMR prediction remained a challenging problem with the sub-critical skills of the dynamical models attributable to limited understanding of the interaction among clouds, convection, a… ▽ More Skilful prediction of the seasonal Indian summer monsoon (ISM) rainfall (ISMR) at least one season in advance has great socio-economic value. It represents a lifeline for about a sixth of the world's population. The ISMR prediction remained a challenging problem with the sub-critical skills of the dynamical models attributable to limited understanding of the interaction among clouds, convection, and circulation. The variability of cloud hydrometeors (cloud ice and cloud water) in different time scales (3-7 days, 10-20 days and 30-60 days bands) are examined from re-analysis data during Indian summer monsoon (ISM). Here, we also show that the 'internal' variability of cloud hydrometeors (particularly cloud ice) associated with the ISM sub-seasonal (synoptic + intra-seasonal) fluctuations is partly predictable as they are found to be tied with slowly varying forcing (e.g., El Niño and Southern Oscillation). The representation of deep convective clouds, which involve ice phase processes in a coupled climate model, strongly modulates ISMR variability in association with global predictors. The results from the two sensitivity simulations using coupled global climate model (CGCM) are provided to demonstrate the importance of the cloud hydrometeors on ISM rainfall predictability. Therefore, this study provides a scientific basis for improving the simulation of the seasonal ISMR by improving the physical processes of the cloud on a sub-seasonal time scale and motivating further research in this direction. △ Less

Submitted 12 January, 2021; originally announced January 2021.

Comments: 36 Pages, 14 figures

arXiv:2009.02423 [pdf, other]

A General Framework for Fairness in Multistakeholder Recommendations

Authors: Harshal A. Chaudhari, Sangdi Lin, Ondrej Linda

Abstract: Contemporary recommender systems act as intermediaries on multi-sided platforms serving high utility recommendations from sellers to buyers. Such systems attempt to balance the objectives of multiple stakeholders including sellers, buyers, and the platform itself. The difficulty in providing recommendations that maximize the utility for a buyer, while simultaneously representing all the sellers on… ▽ More Contemporary recommender systems act as intermediaries on multi-sided platforms serving high utility recommendations from sellers to buyers. Such systems attempt to balance the objectives of multiple stakeholders including sellers, buyers, and the platform itself. The difficulty in providing recommendations that maximize the utility for a buyer, while simultaneously representing all the sellers on the platform has lead to many interesting research problems.Traditionally, they have been formulated as integer linear programs which compute recommendations for all the buyers together in an \emph{offline} fashion, by incorporating coverage constraints so that the individual sellers are proportionally represented across all the recommended items. Such approaches can lead to unforeseen biases wherein certain buyers consistently receive low utility recommendations in order to meet the global seller coverage constraints. To remedy this situation, we propose a general formulation that incorporates seller coverage objectives alongside individual buyer objectives in a real-time personalized recommender system. In addition, we leverage highly scalable submodular optimization algorithms to provide recommendations to each buyer with provable theoretical quality bounds. Furthermore, we empirically evaluate the efficacy of our approach using data from an online real-estate marketplace. △ Less

Submitted 4 September, 2020; originally announced September 2020.

Comments: 7 pages, 3 figures

ACM Class: I.2.1

arXiv:2006.10904 [pdf, other]

Learn to Earn: Enabling Coordination within a Ride Hailing Fleet

Authors: Harshal A. Chaudhari, John W. Byers, Evimaria Terzi

Abstract: The problem of optimizing social welfare objectives on multi sided ride hailing platforms such as Uber, Lyft, etc., is challenging, due to misalignment of objectives between drivers, passengers, and the platform itself. An ideal solution aims to minimize the response time for each hyper local passenger ride request, while simultaneously maintaining high demand satisfaction and supply utilization a… ▽ More The problem of optimizing social welfare objectives on multi sided ride hailing platforms such as Uber, Lyft, etc., is challenging, due to misalignment of objectives between drivers, passengers, and the platform itself. An ideal solution aims to minimize the response time for each hyper local passenger ride request, while simultaneously maintaining high demand satisfaction and supply utilization across the entire city. Economists tend to rely on dynamic pricing mechanisms that stifle price sensitive excess demand and resolve the supply demand imbalances emerging in specific neighborhoods. In contrast, computer scientists primarily view it as a demand prediction problem with the goal of preemptively repositioning supply to such neighborhoods using black box coordinated multi agent deep reinforcement learning based approaches. Here, we introduce explainability in the existing supply repositioning approaches by establishing the need for coordination between the drivers at specific locations and times. Explicit need based coordination allows our framework to use a simpler non deep reinforcement learning based approach, thereby enabling it to explain its recommendations ex post. Moreover, it provides envy free recommendations i.e., drivers at the same location and time do not envy one another's future earnings. Our experimental evaluation demonstrates the effectiveness, the robustness, and the generalizability of our framework. Finally, in contrast to previous works, we make available a reinforcement learning environment for end to end reproducibility of our work and to encourage future comparative studies. △ Less

Submitted 16 July, 2020; v1 submitted 18 June, 2020; originally announced June 2020.

Comments: 16 pages, 9 figures

MSC Class: 68T05 ACM Class: I.2; K.4; J.6

arXiv:1912.02631 [pdf, ps, other]

doi 10.14722/ndss.2020.23005

Trident: Efficient 4PC Framework for Privacy Preserving Machine Learning

Authors: Harsh Chaudhari, Rahul Rachuri, Ajith Suresh

Abstract: Machine learning has started to be deployed in fields such as healthcare and finance, which propelled the need for and growth of privacy-preserving machine learning (PPML). We propose an actively secure four-party protocol (4PC), and a framework for PPML, showcasing its applications on four of the most widely-known machine learning algorithms -- Linear Regression, Logistic Regression, Neural Netwo… ▽ More Machine learning has started to be deployed in fields such as healthcare and finance, which propelled the need for and growth of privacy-preserving machine learning (PPML). We propose an actively secure four-party protocol (4PC), and a framework for PPML, showcasing its applications on four of the most widely-known machine learning algorithms -- Linear Regression, Logistic Regression, Neural Networks, and Convolutional Neural Networks. Our 4PC protocol tolerating at most one malicious corruption is practically efficient as compared to the existing works. We use the protocol to build an efficient mixed-world framework (Trident) to switch between the Arithmetic, Boolean, and Garbled worlds. Our framework operates in the offline-online paradigm over rings and is instantiated in an outsourced setting for machine learning. Also, we propose conversions especially relevant to privacy-preserving machine learning. The highlights of our framework include using a minimal number of expensive circuits overall as compared to ABY3. This can be seen in our technique for truncation, which does not affect the online cost of multiplication and removes the need for any circuits in the offline phase. Our B2A conversion has an improvement of $\mathbf{7} \times$ in rounds and $\mathbf{18} \times$ in the communication complexity. The practicality of our framework is argued through improvements in the benchmarking of the aforementioned algorithms when compared with ABY3. All the protocols are implemented over a 64-bit ring in both LAN and WAN settings. Our improvements go up to $\mathbf{187} \times$ for the training phase and $\mathbf{158} \times$ for the prediction phase when observed over LAN and WAN. △ Less

Submitted 8 June, 2021; v1 submitted 5 December, 2019; originally announced December 2019.

Comments: This work appeared at the 26th Annual Network and Distributed System Security Symposium (NDSS) 2020. Update: An improved version of this framework is available at arXiv:2106.02850

arXiv:1912.02592 [pdf, other]

doi 10.1145/3338466.3358922

ASTRA: High Throughput 3PC over Rings with Application to Secure Prediction

Authors: Harsh Chaudhari, Ashish Choudhury, Arpita Patra, Ajith Suresh

Abstract: The concrete efficiency of secure computation has been the focus of many recent works. In this work, we present concretely-efficient protocols for secure $3$-party computation (3PC) over a ring of integers modulo $2^{\ell}$ tolerating one corruption, both with semi-honest and malicious security. Owing to the fact that computation over ring emulates computation over the real-world system architectu… ▽ More The concrete efficiency of secure computation has been the focus of many recent works. In this work, we present concretely-efficient protocols for secure $3$-party computation (3PC) over a ring of integers modulo $2^{\ell}$ tolerating one corruption, both with semi-honest and malicious security. Owing to the fact that computation over ring emulates computation over the real-world system architectures, secure computation over ring has gained momentum of late. Cast in the offline-online paradigm, our constructions present the most efficient online phase in concrete terms. In the semi-honest setting, our protocol requires communication of $2$ ring elements per multiplication gate during the {\it online} phase, attaining a per-party cost of {\em less than one element}. This is achieved for the first time in the regime of 3PC. In the {\it malicious} setting, our protocol requires communication of $4$ elements per multiplication gate during the online phase, beating the state-of-the-art protocol by $5$ elements. Realized with both the security notions of selective abort and fairness, the malicious protocol with fairness involves slightly more communication than its counterpart with abort security for the output gates {\em alone}. We apply our techniques from $3$PC in the regime of secure server-aided machine-learning (ML) inference for a range of prediction functions-- linear regression, linear SVM regression, logistic regression, and linear SVM classification. Our setting considers a model-owner with trained model parameters and a client with a query, with the latter willing to learn the prediction of her query based on the model parameters of the former. The inputs and computation are outsourced to a set of three non-colluding servers. Our constructions catering to both semi-honest and the malicious world, invariably perform better than the existing constructions. △ Less

Submitted 5 December, 2019; originally announced December 2019.

Comments: This article is the full and extended version of an article appeared in ACM CCSW 2019

arXiv:1809.00878 [pdf, other]

doi 10.1029/2018JD030082

Unraveling the Mystery of Indian Summer Monsoon Prediction: Improved Estimate of Predictability Limit

Authors: Subodh Kumar Saha, Anupam Hazra, Samir Pokhrel, Hemantkumar S. Chaudhari, K. Sujith, Archana Rai, Hasibur Rahaman, B. N. Goswami

Abstract: Large socio-economic impact of the Indian Summer Monsoon (ISM) extremes motivated numerous attempts at its long range prediction over the past century. However, a rather estimated low potential predictability limit (PPL) of seasonal prediction of the ISM, contributed significantly by 'internal' interannual variability was considered insurmountable. Here we show that the 'internal' variability cont… ▽ More Large socio-economic impact of the Indian Summer Monsoon (ISM) extremes motivated numerous attempts at its long range prediction over the past century. However, a rather estimated low potential predictability limit (PPL) of seasonal prediction of the ISM, contributed significantly by 'internal' interannual variability was considered insurmountable. Here we show that the 'internal' variability contributed by the ISM sub-seasonal (synoptic + intra-seasonal) fluctuations, so far considered chaotic, is partly predictable as found to be tied to slowly varying forcing (e.g. El Nino and Southern Oscillation). This provides a scientific basis for predictability of the ISM rainfall beyond the conventional estimates of PPL. We establish a much higher actual limit of predictability (r~0.82) through an extensive re-forecast experiment (1920 years of simulation) by improving two major physics in a global coupled climate model, which raises a hope for a very reliable dynamical seasonal ISM forecasting in the near future. △ Less

Submitted 4 September, 2018; originally announced September 2018.

arXiv:1801.07722 [pdf, other]

doi 10.1137/1.9781611975321.50

Markov Chain Monitoring

Authors: Harshal A. Chaudhari, Michael Mathioudakis, Evimaria Terzi

Abstract: In networking applications, one often wishes to obtain estimates about the number of objects at different parts of the network (e.g., the number of cars at an intersection of a road network or the number of packets expected to reach a node in a computer network) by monitoring the traffic in a small number of network nodes or edges. We formalize this task by defining the 'Markov Chain Monitoring' p… ▽ More In networking applications, one often wishes to obtain estimates about the number of objects at different parts of the network (e.g., the number of cars at an intersection of a road network or the number of packets expected to reach a node in a computer network) by monitoring the traffic in a small number of network nodes or edges. We formalize this task by defining the 'Markov Chain Monitoring' problem. Given an initial distribution of items over the nodes of a Markov chain, we wish to estimate the distribution of items at subsequent times. We do this by asking a limited number of queries that retrieve, for example, how many items transitioned to a specific node or over a specific edge at a particular time. We consider different types of queries, each defining a different variant of the Markov chain monitoring. For each variant, we design efficient algorithms for choosing the queries that make our estimates as accurate as possible. In our experiments with synthetic and real datasets we demonstrate the efficiency and the efficacy of our algorithms in a variety of settings. △ Less

Submitted 23 January, 2018; originally announced January 2018.

Comments: 13 pages, 10 figures, 1 table

Showing 1–19 of 19 results for author: Chaudhari, H