(Translated by https://www.hiragana.jp/)
Search | arXiv e-print repository
Skip to main content

Showing 1–50 of 93 results for author: Raff, E

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.14644  [pdf, other

    cs.CL

    Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context

    Authors: Nilanjana Das, Edward Raff, Manas Gaur

    Abstract: Previous research on testing the vulnerabilities in Large Language Models (LLMs) using adversarial attacks has primarily focused on nonsensical prompt injections, which are easily detected upon manual or automated review (e.g., via byte entropy). However, the exploration of innocuous human-understandable malicious prompts augmented with adversarial injections remains limited. In this research, we… ▽ More

    Submitted 25 July, 2024; v1 submitted 19 July, 2024; originally announced July 2024.

  2. arXiv:2407.06346  [pdf, other

    cs.LG cs.DC stat.ML

    High-Dimensional Distributed Sparse Classification with Scalable Communication-Efficient Global Updates

    Authors: Fred Lu, Ryan R. Curtin, Edward Raff, Francis Ferraro, James Holt

    Abstract: As the size of datasets used in statistical learning continues to grow, distributed training of models has attracted increasing attention. These methods partition the data and exploit parallelism to reduce memory and runtime, but suffer increasingly from communication costs as the data size or the number of iterations grows. Recent work on linear models has shown that a surrogate likelihood can be… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

    Comments: KDD 2024, Research Track

  3. arXiv:2406.12058  [pdf, other

    cs.AI cs.CL

    WellDunn: On the Robustness and Explainability of Language Models and Large Language Models in Identifying Wellness Dimensions

    Authors: Seyedali Mohammadi, Edward Raff, Jinendra Malekar, Vedant Palit, Francis Ferraro, Manas Gaur

    Abstract: Language Models (LMs) are being proposed for mental health applications where the heightened risk of adverse outcomes means predictive performance may not be a sufficient litmus test of a model's utility in clinical practice. A model that can be trusted for practice should have a correspondence between explanation and clinical determination, yet no prior research has examined the attention fidelit… ▽ More

    Submitted 28 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

    Comments: 26 pages, including reference and appendix sections, 8 figures, and 16 tables

  4. arXiv:2406.01753  [pdf, other

    cs.LG cs.DC stat.ML

    Optimizing the Optimal Weighted Average: Efficient Distributed Sparse Classification

    Authors: Fred Lu, Ryan R. Curtin, Edward Raff, Francis Ferraro, James Holt

    Abstract: While distributed training is often viewed as a solution to optimizing linear models on increasingly large datasets, inter-machine communication costs of popular distributed approaches can dominate as data dimensionality increases. Recent work on non-interactive algorithms shows that approximate solutions for linear models can be obtained efficiently with only a single round of communication among… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

    Comments: Under review

  5. arXiv:2405.03991  [pdf, other

    cs.CR cs.LG

    Assemblage: Automatic Binary Dataset Construction for Machine Learning

    Authors: Chang Liu, Rebecca Saul, Yihao Sun, Edward Raff, Maya Fuchs, Townsend Southard Pantano, James Holt, Kristopher Micinski

    Abstract: Binary code is pervasive, and binary analysis is a key task in reverse engineering, malware classification, and vulnerability discovery. Unfortunately, while there exist large corpuses of malicious binaries, obtaining high-quality corpuses of benign binaries for modern systems has proven challenging (e.g., due to licensing issues). Consequently, machine learning based pipelines for binary analysis… ▽ More

    Submitted 7 May, 2024; originally announced May 2024.

  6. arXiv:2405.02228  [pdf, other

    cs.CL cs.AI cs.IR

    REASONS: A benchmark for REtrieval and Automated citationS Of scieNtific Sentences using Public and Proprietary LLMs

    Authors: Deepa Tilwani, Yash Saxena, Ali Mohammadi, Edward Raff, Amit Sheth, Srinivasan Parthasarathy, Manas Gaur

    Abstract: Automatic citation generation for sentences in a document or report is paramount for intelligence analysts, cybersecurity, news agencies, and education personnel. In this research, we investigate whether large language models (LLMs) are capable of generating references based on two forms of sentence queries: (a) Direct Queries, LLMs are asked to provide author names of the given research article,… ▽ More

    Submitted 8 May, 2024; v1 submitted 3 May, 2024; originally announced May 2024.

    Comments: Work in progress

  7. arXiv:2404.01141  [pdf, other

    cs.LG cs.CR stat.ML

    SoK: A Review of Differentially Private Linear Models For High-Dimensional Data

    Authors: Amol Khanna, Edward Raff, Nathan Inkawhich

    Abstract: Linear models are ubiquitous in data science, but are particularly prone to overfitting and data memorization in high dimensions. To guarantee the privacy of training data, differential privacy can be used. Many papers have proposed optimization techniques for high-dimensional differentially private linear models, but a systematic comparison between these methods does not exist. We close this gap… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

    Comments: 21 pages, 7 figures. To be published at the 2nd IEEE Conference on Secure and Trustworthy Machine Learning (SaTML)

    ACM Class: I.2

  8. arXiv:2403.17978  [pdf, other

    cs.CR cs.AI cs.LG stat.ML

    Holographic Global Convolutional Networks for Long-Range Prediction Tasks in Malware Detection

    Authors: Mohammad Mahmudul Alam, Edward Raff, Stella Biderman, Tim Oates, James Holt

    Abstract: Malware detection is an interesting and valuable domain to work in because it has significant real-world impact and unique machine-learning challenges. We investigate existing long-range techniques and benchmarks and find that they're not very suitable in this problem area. In this paper, we introduce Holographic Global Convolutional Networks (HGConv) that utilize the properties of Holographic Red… ▽ More

    Submitted 23 March, 2024; originally announced March 2024.

    Comments: To appear in Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS) 2024, Valencia, Spain

  9. arXiv:2401.10176  [pdf, ps, other

    cs.LG cs.CV

    Comprehensive OOD Detection Improvements

    Authors: Anish Lakkapragada, Amol Khanna, Edward Raff, Nathan Inkawhich

    Abstract: As machine learning becomes increasingly prevalent in impactful decisions, recognizing when inference data is outside the model's expected input distribution is paramount for giving context to predictions. Out-of-distribution (OOD) detection methods have been created for this task. Such methods can be split into representation-based or logit-based methods from whether they respectively utilize the… ▽ More

    Submitted 18 January, 2024; originally announced January 2024.

  10. arXiv:2312.15813  [pdf, other

    cs.LG

    Small Effect Sizes in Malware Detection? Make Harder Train/Test Splits!

    Authors: Tirth Patel, Fred Lu, Edward Raff, Charles Nicholas, Cynthia Matuszek, James Holt

    Abstract: Industry practitioners care about small improvements in malware detection accuracy because their models are deployed to hundreds of millions of machines, meaning a 0.1\% change can cause an overwhelming number of false positives. However, academic research is often restrained to public datasets on the order of ten thousand samples and is too small to detect improvements that may be relevant to ind… ▽ More

    Submitted 25 December, 2023; originally announced December 2023.

    Comments: To appear in Conference on Applied Machine Learning for Information Security 2023

  11. arXiv:2312.15310  [pdf, other

    cs.CV cs.LG q-bio.NC

    Towards Generalization in Subitizing with Neuro-Symbolic Loss using Holographic Reduced Representations

    Authors: Mohammad Mahmudul Alam, Edward Raff, Tim Oates

    Abstract: While deep learning has enjoyed significant success in computer vision tasks over the past decade, many shortcomings still exist from a Cognitive Science (CogSci) perspective. In particular, the ability to subitize, i.e., quickly and accurately identify the small (less than 6) count of items, is not well learned by current Convolutional Neural Networks (CNNs) or Vision Transformers (ViTs) when usi… ▽ More

    Submitted 23 December, 2023; originally announced December 2023.

    Comments: Accepted in 38th Annual AAAI Workshop on Neuro-Symbolic Learning and Reasoning in the Era of Large Language Models (NuCLeaR), 2024

  12. arXiv:2312.01242  [pdf, other

    cs.LG cs.AI

    DDxT: Deep Generative Transformer Models for Differential Diagnosis

    Authors: Mohammad Mahmudul Alam, Edward Raff, Tim Oates, Cynthia Matuszek

    Abstract: Differential Diagnosis (DDx) is the process of identifying the most likely medical condition among the possible pathologies through the process of elimination based on evidence. An automated process that narrows a large set of pathologies down to the most likely pathologies will be of great importance. The primary prior works have relied on the Reinforcement Learning (RL) paradigm under the intuit… ▽ More

    Submitted 2 December, 2023; originally announced December 2023.

    Comments: Accepted at 1st Workshop on Deep Generative Models for Health at NeurIPS 2023

  13. arXiv:2311.09228  [pdf, other

    cs.CY

    Does Starting Deep Learning Homework Earlier Improve Grades?

    Authors: Edward Raff, Cynthia Matuszek

    Abstract: Intuitively, students who start a homework assignment earlier and spend more time on it should receive better grades on the assignment. However, existing literature on the impact of time spent on homework is not clear-cut and comes mostly from K-12 education. It is not clear that these prior studies can inform coursework in deep learning due to differences in demographics, as well as the computati… ▽ More

    Submitted 30 September, 2023; originally announced November 2023.

    Comments: To appear in AI for AI Education, co-located with ECAI 2023

  14. arXiv:2310.19978  [pdf, other

    cs.LG stat.CO stat.ML

    Scaling Up Differentially Private LASSO Regularized Logistic Regression via Faster Frank-Wolfe Iterations

    Authors: Edward Raff, Amol Khanna, Fred Lu

    Abstract: To the best of our knowledge, there are no methods today for training differentially private regression models on sparse input data. To remedy this, we adapt the Frank-Wolfe algorithm for $L_1$ penalized linear regression to be aware of sparse inputs and to use them effectively. In doing so, we reduce the training time of the algorithm from $\mathcal{O}( T D S + T N S)$ to… ▽ More

    Submitted 30 October, 2023; originally announced October 2023.

    Comments: To appear in the 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

  15. arXiv:2310.17867  [pdf, other

    stat.ML cs.AI cs.LG

    Reproducibility in Multiple Instance Learning: A Case For Algorithmic Unit Tests

    Authors: Edward Raff, James Holt

    Abstract: Multiple Instance Learning (MIL) is a sub-domain of classification problems with positive and negative labels and a "bag" of inputs, where the label is positive if and only if a positive element is contained within the bag, and otherwise is negative. Training in this context requires associating the bag-wide label to instance-level information, and implicitly contains a causal assumption and asymm… ▽ More

    Submitted 26 October, 2023; originally announced October 2023.

    Comments: To appear in the 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

  16. arXiv:2310.11706  [pdf, other

    cs.CR

    MalDICT: Benchmark Datasets on Malware Behaviors, Platforms, Exploitation, and Packers

    Authors: Robert J. Joyce, Edward Raff, Charles Nicholas, James Holt

    Abstract: Existing research on malware classification focuses almost exclusively on two tasks: distinguishing between malicious and benign files and classifying malware by family. However, malware can be categorized according to many other types of attributes, and the ability to identify these attributes in newly-emerging malware using machine learning could provide significant value to analysts. In particu… ▽ More

    Submitted 18 October, 2023; originally announced October 2023.

  17. arXiv:2309.06643  [pdf, other

    cs.CR

    Semi-supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection

    Authors: Maksim E. Eren, Manish Bhattarai, Robert J. Joyce, Edward Raff, Charles Nicholas, Boian S. Alexandrov

    Abstract: Identification of the family to which a malware specimen belongs is essential in understanding the behavior of the malware and developing mitigation strategies. Solutions proposed by prior work, however, are often not practicable due to the lack of realistic evaluation factors. These factors include learning under class imbalance, the ability to identify new malware, and the cost of production-qua… ▽ More

    Submitted 12 September, 2023; originally announced September 2023.

    Comments: Accepted at ACM TOPS

  18. arXiv:2308.12271  [pdf, other

    cs.CV

    A Generative Approach for Image Registration of Visible-Thermal (VT) Cancer Faces

    Authors: Catherine Ordun, Alexandra Cha, Edward Raff, Sanjay Purushotham, Karen Kwok, Mason Rule, James Gulley

    Abstract: Since thermal imagery offers a unique modality to investigate pain, the U.S. National Institutes of Health (NIH) has collected a large and diverse set of cancer patient facial thermograms for AI-based pain research. However, differing angles from camera capture between thermal and visible sensors has led to misalignment between Visible-Thermal (VT) images. We modernize the classic computer vision… ▽ More

    Submitted 23 August, 2023; originally announced August 2023.

    Comments: 2nd Annual Artificial Intelligence over Infrared Images for Medical Applications Workshop (AIIIMA) at the 26th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2023)

    Journal ref: 2nd Annual Artificial Intelligence over Infrared Images for Medical Applications Workshop 2023

  19. arXiv:2307.13855  [pdf, other

    cs.CV cs.LG

    Exploring the Sharpened Cosine Similarity

    Authors: Skyler Wu, Fred Lu, Edward Raff, James Holt

    Abstract: Convolutional layers have long served as the primary workhorse for image classification. Recently, an alternative to convolution was proposed using the Sharpened Cosine Similarity (SCS), which in theory may serve as a better feature detector. While multiple sources report promising results, there has not been to date a full-scale empirical analysis of neural network performance using these new lay… ▽ More

    Submitted 25 July, 2023; originally announced July 2023.

    Comments: Accepted to I Can't Believe It's Not Better Workshop (ICBINB) at NeurIPS 2022

  20. arXiv:2306.16354  [pdf, ps, other

    cs.LG stat.ML

    cuSLINK: Single-linkage Agglomerative Clustering on the GPU

    Authors: Corey J. Nolet, Divye Gala, Alex Fender, Mahesh Doijade, Joe Eaton, Edward Raff, John Zedlewski, Brad Rees, Tim Oates

    Abstract: In this paper, we propose cuSLINK, a novel and state-of-the-art reformulation of the SLINK algorithm on the GPU which requires only $O(Nk)$ space and uses a parameter $k$ to trade off space and time. We also propose a set of novel and reusable building blocks that compose cuSLINK. These building blocks include highly optimized computational patterns for $k$-NN graph construction, spanning trees, a… ▽ More

    Submitted 28 June, 2023; originally announced June 2023.

    Comments: To appear in ECML PKDD 2023 by Springer Nature

  21. arXiv:2306.15790  [pdf, other

    cs.LG cs.CR

    Probing the Transition to Dataset-Level Privacy in ML Models Using an Output-Specific and Data-Resolved Privacy Profile

    Authors: Tyler LeBlond, Joseph Munoz, Fred Lu, Maya Fuchs, Elliott Zaresky-Williams, Edward Raff, Brian Testa

    Abstract: Differential privacy (DP) is the prevailing technique for protecting user data in machine learning models. However, deficits to this framework include a lack of clarity for selecting the privacy budget $εいぷしろん$ and a lack of quantification for the privacy leakage for a particular data row by a particular trained model. We make progress toward these limitations and a new perspective by which to visualiz… ▽ More

    Submitted 27 June, 2023; originally announced June 2023.

    Comments: Approved for Public Release; Distribution Unlimited. PA #:AFRL-2022-3639

  22. arXiv:2306.09951  [pdf, ps, other

    cs.LG stat.ML

    You Don't Need Robust Machine Learning to Manage Adversarial Attack Risks

    Authors: Edward Raff, Michel Benaroch, Andrew L. Farris

    Abstract: The robustness of modern machine learning (ML) models has become an increasing concern within the community. The ability to subvert a model into making errant predictions using seemingly inconsequential changes to input is startling, as is our lack of success in building models robust to this concern. Existing research shows progress, but current mitigations come with a high cost and simultaneousl… ▽ More

    Submitted 16 June, 2023; originally announced June 2023.

  23. arXiv:2306.06505  [pdf, other

    cs.CV

    Vista-Morph: Unsupervised Image Registration of Visible-Thermal Facial Pairs

    Authors: Catherine Ordun, Edward Raff, Sanjay Purushotham

    Abstract: For a variety of biometric cross-spectral tasks, Visible-Thermal (VT) facial pairs are used. However, due to a lack of calibration in the lab, photographic capture between two different sensors leads to severely misaligned pairs that can lead to poor results for person re-identification and generative AI. To solve this problem, we introduce our approach for VT image registration called Vista Morph… ▽ More

    Submitted 10 June, 2023; originally announced June 2023.

    Journal ref: 2023, 7th IEEE International Joint Conference on Biometrics (IJCB)

  24. arXiv:2306.06228  [pdf, other

    cs.CR cs.LG

    AVScan2Vec: Feature Learning on Antivirus Scan Data for Production-Scale Malware Corpora

    Authors: Robert J. Joyce, Tirth Patel, Charles Nicholas, Edward Raff

    Abstract: When investigating a malicious file, searching for related files is a common task that malware analysts must perform. Given that production malware corpora may contain over a billion files and consume petabytes of storage, many feature extraction and similarity search approaches are computationally infeasible. Our work explores the potential of antivirus (AV) scan data as a scalable source of feat… ▽ More

    Submitted 9 June, 2023; originally announced June 2023.

  25. arXiv:2306.03819  [pdf, other

    cs.LG cs.CL cs.CY

    LEACE: Perfect linear concept erasure in closed form

    Authors: Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, Stella Biderman

    Abstract: Concept erasure aims to remove specified features from a representation. It can improve fairness (e.g. preventing a classifier from using gender or race) and interpretability (e.g. removing a concept to observe changes in model behavior). We introduce LEAst-squares Concept Erasure (LEACE), a closed-form method which provably prevents all linear classifiers from detecting a concept while changing t… ▽ More

    Submitted 29 October, 2023; v1 submitted 6 June, 2023; originally announced June 2023.

  26. arXiv:2305.19534  [pdf, other

    cs.LG cs.AI stat.ML

    Recasting Self-Attention with Holographic Reduced Representations

    Authors: Mohammad Mahmudul Alam, Edward Raff, Stella Biderman, Tim Oates, James Holt

    Abstract: In recent years, self-attention has become the dominant paradigm for sequence modeling in a variety of domains. However, in domains with very long sequence lengths the $\mathcal{O}(T^2)$ memory and $\mathcal{O}(T^2 H)$ compute costs can make using transformers infeasible. Motivated by problems in malware detection, where sequence lengths of $T \geq 100,000$ are a roadblock to deep learning, we re-… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

    Comments: To appear in Proceedings of the 40th International Conference on Machine Learning (ICML)

  27. arXiv:2304.12429  [pdf, ps, other

    cs.LG cs.CR

    Sparse Private LASSO Logistic Regression

    Authors: Amol Khanna, Fred Lu, Edward Raff, Brian Testa

    Abstract: LASSO regularized logistic regression is particularly useful for its built-in feature selection, allowing coefficients to be removed from deployment and producing sparse solutions. Differentially private versions of LASSO logistic regression have been developed, but generally produce dense solutions, reducing the intrinsic utility of the LASSO penalty. In this paper, we present a differentially pr… ▽ More

    Submitted 28 April, 2023; v1 submitted 24 April, 2023; originally announced April 2023.

    Comments: 20 pages, 5 figures

  28. arXiv:2304.11158  [pdf, other

    cs.CL

    Emergent and Predictable Memorization in Large Language Models

    Authors: Stella Biderman, USVSN Sai Prashanth, Lintang Sutawika, Hailey Schoelkopf, Quentin Anthony, Shivanshu Purohit, Edward Raff

    Abstract: Memorization, or the tendency of large language models (LLMs) to output entire sequences from their training data verbatim, is a key concern for safely deploying language models. In particular, it is vital to minimize a model's memorization of sensitive datapoints such as those containing personal identifiable information (PII). The prevalence of such undesirable memorization can pose issues for m… ▽ More

    Submitted 31 May, 2023; v1 submitted 21 April, 2023; originally announced April 2023.

  29. arXiv:2304.01373  [pdf, other

    cs.CL

    Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

    Authors: Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, Oskar van der Wal

    Abstract: How do large language models (LLMs) develop and evolve over the course of training? How do these patterns change as models scale? To answer these questions, we introduce \textit{Pythia}, a suite of 16 LLMs all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters. We provide public access to 154 checkpoints for each one of the 16 models, alongside tools… ▽ More

    Submitted 31 May, 2023; v1 submitted 3 April, 2023; originally announced April 2023.

    Comments: Code at https://github.com/EleutherAI/pythia

  30. arXiv:2303.10303  [pdf, other

    cs.LG cs.CR

    The Challenge of Differentially Private Screening Rules

    Authors: Amol Khanna, Fred Lu, Edward Raff

    Abstract: Linear $L_1$-regularized models have remained one of the simplest and most effective tools in data analysis, especially in information retrieval problems where n-grams over text with TF-IDF or Okapi feature values are a strong and easy baseline. Over the past decade, screening rules have risen in popularity as a way to reduce the runtime for producing the sparse regression weights of $L_1$ models.… ▽ More

    Submitted 17 March, 2023; originally announced March 2023.

    Comments: 5 pages, 2 figures

    ACM Class: I.2; K.4

  31. arXiv:2302.09395  [pdf, other

    cs.CV cs.AI eess.IV

    When Visible-to-Thermal Facial GAN Beats Conditional Diffusion

    Authors: Catherine Ordun, Edward Raff, Sanjay Purushotham

    Abstract: Thermal facial imagery offers valuable insight into physiological states such as inflammation and stress by detecting emitted radiation in the infrared spectrum, which is unseen in the visible spectra. Telemedicine applications could benefit from thermal imagery, but conventional computers are reliant on RGB cameras and lack thermal sensors. As a result, we propose the Visible-to-Thermal Facial GA… ▽ More

    Submitted 18 February, 2023; originally announced February 2023.

    Journal ref: 2023 IEEE International Conference on Image Processing

  32. arXiv:2302.08973  [pdf, other

    cs.LG

    Measuring Equality in Machine Learning Security Defenses: A Case Study in Speech Recognition

    Authors: Luke E. Richards, Edward Raff, Cynthia Matuszek

    Abstract: Over the past decade, the machine learning security community has developed a myriad of defenses for evasion attacks. An understudied question in that community is: for whom do these defenses defend? This work considers common approaches to defending learned systems and how security defenses result in performance inequities across different sub-populations. We outline appropriate parity metrics fo… ▽ More

    Submitted 22 August, 2023; v1 submitted 17 February, 2023; originally announced February 2023.

    Comments: Accepted to AISec'23

  33. arXiv:2301.06163  [pdf, other

    cs.LG cs.AI stat.ML

    A Coreset Learning Reality Check

    Authors: Fred Lu, Edward Raff, James Holt

    Abstract: Subsampling algorithms are a natural approach to reduce data size before fitting models on massive datasets. In recent years, several works have proposed methods for subsampling rows from a data matrix while maintaining relevant information for classification. While these works are supported by theory and limited experiments, to date there has not been a comprehensive evaluation of these methods.… ▽ More

    Submitted 15 January, 2023; originally announced January 2023.

    Comments: To appear in the Thirty-Seventh AAAI Conference on Artificial Intelligence

  34. arXiv:2212.09535  [pdf, other

    cs.CL cs.AI cs.LG

    BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting

    Authors: Zheng-Xin Yong, Hailey Schoelkopf, Niklas Muennighoff, Alham Fikri Aji, David Ifeoluwa Adelani, Khalid Almubarak, M Saiful Bari, Lintang Sutawika, Jungo Kasai, Ahmed Baruwa, Genta Indra Winata, Stella Biderman, Edward Raff, Dragomir Radev, Vassilina Nikoulina

    Abstract: The BLOOM model is a large publicly available multilingual language model, but its pretraining was limited to 46 languages. To extend the benefits of BLOOM to other languages without incurring prohibitively large costs, it is desirable to adapt BLOOM to new languages not seen during pretraining. In this work, we apply existing language adaptation strategies to BLOOM and benchmark its zero-shot pro… ▽ More

    Submitted 27 May, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

    Comments: ACL 2023

  35. arXiv:2212.02663  [pdf, other

    cs.LG cs.AI cs.CR

    Efficient Malware Analysis Using Metric Embeddings

    Authors: Ethan M. Rudd, David Krisiloff, Scott Coull, Daniel Olszewski, Edward Raff, James Holt

    Abstract: In this paper, we explore the use of metric learning to embed Windows PE files in a low-dimensional vector space for downstream use in a variety of applications, including malware detection, family classification, and malware attribute tagging. Specifically, we enrich labeling on malicious and benign PE files using computationally expensive, disassembly-based malicious capabilities. Using these ca… ▽ More

    Submitted 5 December, 2022; originally announced December 2022.

    Comments: Pre-print of a manuscript submitted to the ACM Digital Threats: Research and Practice (DTRAP) Special Issue on Applied Machine Learning for Information Security. 19 Pages

  36. arXiv:2211.13250  [pdf, other

    cs.LG cs.AI

    Lempel-Ziv Networks

    Authors: Rebecca Saul, Mohammad Mahmudul Alam, John Hurwitz, Edward Raff, Tim Oates, James Holt

    Abstract: Sequence processing has long been a central area of machine learning research. Recurrent neural nets have been successful in processing sequences for a number of tasks; however, they are known to be both ineffective and computationally expensive when applied to very long sequences. Compression-based methods have demonstrated more robustness when processing such sequences -- in particular, an appro… ▽ More

    Submitted 23 November, 2022; originally announced November 2022.

    Comments: I Can't Believe It's Not Better Workshop at NeurIPS 2022

  37. arXiv:2211.01786  [pdf, other

    cs.CL cs.AI cs.LG

    Crosslingual Generalization through Multitask Finetuning

    Authors: Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, Colin Raffel

    Abstract: Multitask prompted finetuning (MTF) has been shown to help large language models generalize to new tasks in a zero-shot setting, but so far explorations of MTF have focused on English data and models. We apply MTF to the pretrained multilingual BLOOM and mT5 model families to produce finetuned variants called BLOOMZ and mT0. We find finetuning large multilingual language models on English tasks wi… ▽ More

    Submitted 29 May, 2023; v1 submitted 3 November, 2022; originally announced November 2022.

    Comments: 9 main pages (119 with appendix), 16 figures and 11 tables

  38. arXiv:2210.08643  [pdf, other

    cs.LG cs.CR

    A General Framework for Auditing Differentially Private Machine Learning

    Authors: Fred Lu, Joseph Munoz, Maya Fuchs, Tyler LeBlond, Elliott Zaresky-Williams, Edward Raff, Francis Ferraro, Brian Testa

    Abstract: We present a framework to statistically audit the privacy guarantee conferred by a differentially private machine learner in practice. While previous works have taken steps toward evaluating privacy loss through poisoning attacks or membership inference, they have been tailored to specific models or have demonstrated low statistical power. Our work develops a general methodology to empirically eva… ▽ More

    Submitted 6 January, 2023; v1 submitted 16 October, 2022; originally announced October 2022.

    Comments: NeurIPS 2022

  39. arXiv:2209.03148  [pdf, other

    cs.LG

    Improving Out-of-Distribution Detection via Epistemic Uncertainty Adversarial Training

    Authors: Derek Everett, Andre T. Nguyen, Luke E. Richards, Edward Raff

    Abstract: The quantification of uncertainty is important for the adoption of machine learning, especially to reject out-of-distribution (OOD) data back to human experts for review. Yet progress has been slow, as a balance must be struck between computational efficiency and the quality of uncertainty estimates. For this reason many use deep ensembles of neural networks or Monte Carlo dropout for reasonable u… ▽ More

    Submitted 9 September, 2022; v1 submitted 5 September, 2022; originally announced September 2022.

    Comments: 8 pages, 5 figures

  40. arXiv:2206.05893  [pdf, other

    cs.LG cs.CR cs.CV stat.ML

    Deploying Convolutional Networks on Untrusted Platforms Using 2D Holographic Reduced Representations

    Authors: Mohammad Mahmudul Alam, Edward Raff, Tim Oates, James Holt

    Abstract: Due to the computational cost of running inference for a neural network, the need to deploy the inferential steps on a third party's compute environment or hardware is common. If the third party is not fully trusted, it is desirable to obfuscate the nature of the inputs and outputs, so that the third party can not easily determine what specific task is being performed. Provably secure protocols fo… ▽ More

    Submitted 12 June, 2022; originally announced June 2022.

    Comments: To appear in the Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022

  41. arXiv:2206.04763  [pdf, other

    cs.LG

    Neural Bregman Divergences for Distance Learning

    Authors: Fred Lu, Edward Raff, Francis Ferraro

    Abstract: Many metric learning tasks, such as triplet learning, nearest neighbor retrieval, and visualization, are treated primarily as embedding tasks where the ultimate metric is some variant of the Euclidean distance (e.g., cosine or Mahalanobis), and the algorithm must learn to embed points into the pre-chosen space. The study of non-Euclidean geometries is often not explored, which we believe is due to… ▽ More

    Submitted 20 November, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

    Comments: Published in ICLR 2023, more related works added

  42. arXiv:2206.03265  [pdf, other

    cs.CR cs.LG

    Marvolo: Programmatic Data Augmentation for Practical ML-Driven Malware Detection

    Authors: Michael D. Wong, Edward Raff, James Holt, Ravi Netravali

    Abstract: Data augmentation has been rare in the cyber security domain due to technical difficulties in altering data in a manner that is semantically consistent with the original data. This shortfall is particularly onerous given the unique difficulty of acquiring benign and malicious training data that runs into copyright restrictions, and that institutions like banks and governments receive targeted malw… ▽ More

    Submitted 7 June, 2022; originally announced June 2022.

    Comments: 15 pages, 7 figures

  43. arXiv:2204.08583  [pdf, other

    cs.CV

    VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance

    Authors: Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, Edward Raff

    Abstract: Generating and editing images from open domain text prompts is a challenging task that heretofore has required expensive and specially trained models. We demonstrate a novel methodology for both tasks which is capable of producing images of high visual quality from text prompts of significant semantic complexity without any training by using a multimodal encoder to guide image generations. We demo… ▽ More

    Submitted 4 September, 2022; v1 submitted 18 April, 2022; originally announced April 2022.

    Comments: Accepted for publication at ECCV 2022 Code available at https://github.com/EleutherAI/vqgan-clip/tree/main/notebooks

  44. arXiv:2204.04372  [pdf, ps, other

    cs.LG cs.AI cs.SE

    A Siren Song of Open Source Reproducibility

    Authors: Edward Raff, Andrew L. Farris

    Abstract: As reproducibility becomes a greater concern, conferences have largely converged to a strategy of asking reviewers to indicate whether code was attached to a submission. This is part of a larger trend of taking action based on assumed ideals, without studying if those actions will yield the desired outcome. Our argument is that this focus on code for replication is misguided if we want to improve… ▽ More

    Submitted 8 April, 2022; originally announced April 2022.

    Comments: To be presented at the ML Evaluation Standards Workshop at ICLR 2022

  45. arXiv:2204.04214  [pdf, other

    eess.IV cs.CV cs.LG

    Intelligent Sight and Sound: A Chronic Cancer Pain Dataset

    Authors: Catherine Ordun, Alexandra N. Cha, Edward Raff, Byron Gaskin, Alex Hanson, Mason Rule, Sanjay Purushotham, James L. Gulley

    Abstract: Cancer patients experience high rates of chronic pain throughout the treatment process. Assessing pain for this patient population is a vital component of psychological and functional well-being, as it can cause a rapid deterioration of quality of life. Existing work in facial pain detection often have deficiencies in labeling or methodology that prevent them from being clinically relevant. This p… ▽ More

    Submitted 7 April, 2022; originally announced April 2022.

    Comments: Published as conference paper at the 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks

    Journal ref: 2021, Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track

  46. arXiv:2204.03829  [pdf, other

    cs.DL cs.AI cs.LG

    Does the Market of Citations Reward Reproducible Work?

    Authors: Edward Raff

    Abstract: The field of bibliometrics, studying citations and behavior, is critical to the discussion of reproducibility. Citations are one of the primary incentive and reward systems for academic work, and so we desire to know if this incentive rewards reproducible work. Yet to the best of our knowledge, only one work has attempted to look at this combined space, concluding that non-reproducible work is mor… ▽ More

    Submitted 8 April, 2022; originally announced April 2022.

    Comments: To be presented at the ML Evaluation Standards Workshop at ICLR 2022

  47. arXiv:2202.14010   

    cs.CR cs.AI cs.GT cs.LG

    Proceedings of the Artificial Intelligence for Cyber Security (AICS) Workshop at AAAI 2022

    Authors: James Holt, Edward Raff, Ahmad Ridley, Dennis Ross, Arunesh Sinha, Diane Staheli, William Streilen, Milind Tambe, Yevgeniy Vorobeychik, Allan Wollaber

    Abstract: The workshop will focus on the application of AI to problems in cyber security. Cyber systems generate large volumes of data, utilizing this effectively is beyond human capabilities. Additionally, adversaries continue to develop new attacks. Hence, AI methods are required to understand and protect the cyber domain. These challenges are widely studied in enterprise networks, but there are many gaps… ▽ More

    Submitted 1 March, 2022; v1 submitted 28 February, 2022; originally announced February 2022.

  48. arXiv:2202.08985  [pdf, ps, other

    cs.LG

    Out of Distribution Data Detection Using Dropout Bayesian Neural Networks

    Authors: Andre T. Nguyen, Fred Lu, Gary Lopez Munoz, Edward Raff, Charles Nicholas, James Holt

    Abstract: We explore the utility of information contained within a dropout based Bayesian neural network (BNN) for the task of detecting out of distribution (OOD) data. We first show how previous attempts to leverage the randomized embeddings induced by the intermediate layers of a dropout BNN can fail due to the distance metric used. We introduce an alternative approach to measuring embedding uncertainty,… ▽ More

    Submitted 17 February, 2022; originally announced February 2022.

  49. arXiv:2202.07005  [pdf, other

    cs.LG

    Continuously Generalized Ordinal Regression for Linear and Deep Models

    Authors: Fred Lu, Francis Ferraro, Edward Raff

    Abstract: Ordinal regression is a classification task where classes have an order and prediction error increases the further the predicted class is from the true class. The standard approach for modeling ordinal data involves fitting parallel separating hyperplanes that optimize a certain loss function. This assumption offers sample efficient learning via inductive bias, but is often too restrictive in real… ▽ More

    Submitted 14 February, 2022; originally announced February 2022.

  50. arXiv:2201.07406  [pdf, other

    cs.CL cs.AI

    Fooling MOSS Detection with Pretrained Language Models

    Authors: Stella Biderman, Edward Raff

    Abstract: As artificial intelligence (AI) technologies become increasingly powerful and prominent in society, their misuse is a growing concern. In educational settings, AI technologies could be used by students to cheat on assignments and exams. In this paper we explore whether transformers can be used to solve introductory level programming assignments while bypassing commonly used AI tools to detect simi… ▽ More

    Submitted 6 September, 2022; v1 submitted 18 January, 2022; originally announced January 2022.

    Comments: To appear in the Proceedings of the 29th ACM International Conference on Information & Knowledge Management (CIKM)