(Translated by https://www.hiragana.jp/)
Search | arXiv e-print repository
Skip to main content

Showing 1–46 of 46 results for author: Hadfield-Menell, D

.
  1. arXiv:2404.02949  [pdf, other

    cs.LG cs.AI

    The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability

    Authors: Stephen Casper, Jieun Yun, Joonhyuk Baek, Yeseong Jung, Minhwan Kim, Kiwan Kwon, Saerom Park, Hayden Moore, David Shriver, Marissa Connor, Keltin Grimes, Angus Nicolson, Arush Tagade, Jessica Rumbelow, Hieu Minh Nguyen, Dylan Hadfield-Menell

    Abstract: Interpretability techniques are valuable for helping humans understand and oversee AI systems. The SaTML 2024 CNN Interpretability Competition solicited novel methods for studying convolutional neural networks (CNNs) at the ImageNet scale. The objective of the competition was to help human crowd-workers identify trojans in CNNs. This report showcases the methods and results of four featured compet… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

    Comments: Competition for SaTML 2024

  2. arXiv:2403.05030  [pdf, other

    cs.CR cs.AI cs.LG

    Defending Against Unforeseen Failure Modes with Latent Adversarial Training

    Authors: Stephen Casper, Lennart Schulze, Oam Patel, Dylan Hadfield-Menell

    Abstract: Despite extensive diagnostics and debugging by developers, AI systems sometimes exhibit harmful unintended behaviors. Finding and fixing these is challenging because the attack surface is so large -- it is not tractable to exhaustively search for inputs that may elicit harmful behaviors. Red-teaming and adversarial training (AT) are commonly used to improve robustness, however, they empirically st… ▽ More

    Submitted 1 April, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

  3. arXiv:2402.16835  [pdf, other

    cs.CL

    Eight Methods to Evaluate Robust Unlearning in LLMs

    Authors: Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, Dylan Hadfield-Menell

    Abstract: Machine unlearning can be useful for removing harmful capabilities and memorized text from large language models (LLMs), but there are not yet standardized methods for rigorously evaluating it. In this paper, we first survey techniques and limitations of existing unlearning evaluations. Second, we apply a comprehensive set of tests for the robustness and competitiveness of unlearning in the "Who's… ▽ More

    Submitted 26 February, 2024; originally announced February 2024.

  4. arXiv:2401.14446  [pdf, other

    cs.CY cs.AI cs.CR

    Black-Box Access is Insufficient for Rigorous AI Audits

    Authors: Stephen Casper, Carson Ezell, Charlotte Siegmann, Noam Kolt, Taylor Lynn Curtis, Benjamin Bucknall, Andreas Haupt, Kevin Wei, Jérémy Scheurer, Marius Hobbhahn, Lee Sharkey, Satyapriya Krishna, Marvin Von Hagen, Silas Alberti, Alan Chan, Qinyi Sun, Michael Gerovitch, David Bau, Max Tegmark, David Krueger, Dylan Hadfield-Menell

    Abstract: External audits of AI systems are increasingly recognized as a key mechanism for AI governance. The effectiveness of an audit, however, depends on the degree of access granted to auditors. Recent audits of state-of-the-art AI systems have primarily relied on black-box access, in which auditors can only query the system and observe its outputs. However, white-box access to the system's inner workin… ▽ More

    Submitted 29 May, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

    Comments: FAccT 2024

    Journal ref: The 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT '24), June 3-6, 2024, Rio de Janeiro, Brazil

  5. arXiv:2312.08358  [pdf, other

    cs.LG cs.AI stat.ML

    Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF

    Authors: Anand Siththaranjan, Cassidy Laidlaw, Dylan Hadfield-Menell

    Abstract: In practice, preference learning from human feedback depends on incomplete data with hidden context. Hidden context refers to data that affects the feedback received, but which is not represented in the data used to train a preference model. This captures common issues of data collection, such as having human annotators with varied preferences, cognitive processes that result in seemingly irration… ▽ More

    Submitted 16 April, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

    Comments: Presented at ICLR 2024

  6. arXiv:2312.03729  [pdf, other

    cs.CL cs.AI

    Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?

    Authors: Kevin Liu, Stephen Casper, Dylan Hadfield-Menell, Jacob Andreas

    Abstract: Neural language models (LMs) can be used to evaluate the truth of factual statements in two ways: they can be either queried for statement probabilities, or probed for internal representations of truthfulness. Past work has found that these two procedures sometimes disagree, and that probes tend to be more accurate than LM outputs. This has led some researchers to conclude that LMs "lie" or otherw… ▽ More

    Submitted 27 November, 2023; originally announced December 2023.

    Comments: Accepted to EMNLP, 2024

  7. arXiv:2307.15217  [pdf, other

    cs.AI cs.CL cs.LG

    Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

    Authors: Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen , et al. (7 additional authors not shown)

    Abstract: Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and rel… ▽ More

    Submitted 11 September, 2023; v1 submitted 27 July, 2023; originally announced July 2023.

  8. arXiv:2307.04028  [pdf, other

    cs.CV cs.AI cs.LG

    Measuring the Success of Diffusion Models at Imitating Human Artists

    Authors: Stephen Casper, Zifan Guo, Shreya Mogulothu, Zachary Marinov, Chinmay Deshpande, Rui-Jie Yew, Zheng Dai, Dylan Hadfield-Menell

    Abstract: Modern diffusion models have set the state-of-the-art in AI image generation. Their success is due, in part, to training on Internet-scale data which often includes copyrighted work. This prompts questions about the extent to which these models learn from, imitate, or copy the work of human artists. This work suggests that tying copyright liability to the capabilities of the model may be useful gi… ▽ More

    Submitted 8 July, 2023; originally announced July 2023.

    Comments: Accepted to the 1 st Workshop on Generative AI and Law

  9. arXiv:2306.09442  [pdf, other

    cs.CL cs.AI cs.LG

    Explore, Establish, Exploit: Red Teaming Language Models from Scratch

    Authors: Stephen Casper, Jason Lin, Joe Kwon, Gatlen Culp, Dylan Hadfield-Menell

    Abstract: Deploying large language models (LMs) can pose hazards from harmful outputs such as toxic or false text. Prior work has introduced automated tools that elicit harmful outputs to identify these risks. While this is a valuable step toward securing models, these approaches rely on a pre-existing way to efficiently classify undesirable outputs. Using a pre-existing classifier does not allow for red-te… ▽ More

    Submitted 10 October, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

  10. arXiv:2302.10894  [pdf, other

    cs.LG cs.AI cs.CV

    Red Teaming Deep Neural Networks with Feature Synthesis Tools

    Authors: Stephen Casper, Yuxiao Li, Jiawei Li, Tong Bu, Kevin Zhang, Kaivalya Hariharan, Dylan Hadfield-Menell

    Abstract: Interpretable AI tools are often motivated by the goal of understanding model behavior in out-of-distribution (OOD) contexts. Despite the attention this area of study receives, there are comparatively few cases where these tools have identified previously unknown bugs in models. We argue that this is due, in part, to a common feature of many interpretability methods: they analyze model behavior by… ▽ More

    Submitted 21 September, 2023; v1 submitted 7 February, 2023; originally announced February 2023.

    Comments: In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

  11. arXiv:2302.06559  [pdf, other

    cs.CY cs.GT cs.IR econ.TH

    Recommending to Strategic Users

    Authors: Andreas Haupt, Dylan Hadfield-Menell, Chara Podimata

    Abstract: Recommendation systems are pervasive in the digital economy. An important assumption in many deployed systems is that user consumption reflects user preferences in a static sense: users consume the content they like with no other considerations in mind. However, as we document in a large-scale online survey, users do choose content strategically to influence the types of content they get recommend… ▽ More

    Submitted 13 February, 2023; originally announced February 2023.

    Comments: 35 pages

  12. arXiv:2211.10024  [pdf, other

    cs.LG cs.AI cs.CR

    Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks

    Authors: Stephen Casper, Kaivalya Hariharan, Dylan Hadfield-Menell

    Abstract: This paper considers the problem of helping humans exercise scalable oversight over deep neural networks (DNNs). Adversarial examples can be useful by helping to reveal weaknesses in DNNs, but they can be difficult to interpret or draw actionable conclusions from. Some previous works have proposed using human-interpretable adversarial attacks including copy/paste attacks in which one natural image… ▽ More

    Submitted 5 May, 2023; v1 submitted 17 November, 2022; originally announced November 2022.

    Comments: Best paper award at the NeurIPS 2022 ML Safety Workshop -- https://neurips2022.mlsafety.org/

  13. arXiv:2209.02167  [pdf, other

    cs.AI cs.CR cs.LG

    Red Teaming with Mind Reading: White-Box Adversarial Policies Against RL Agents

    Authors: Stephen Casper, Taylor Killian, Gabriel Kreiman, Dylan Hadfield-Menell

    Abstract: Adversarial examples can be useful for identifying vulnerabilities in AI systems before they are deployed. In reinforcement learning (RL), adversarial policies can be developed by training an adversarial agent to minimize a target agent's rewards. Prior work has studied black-box versions of these attacks where the adversary only observes the world state and treats the target agent as any other pa… ▽ More

    Submitted 13 October, 2023; v1 submitted 5 September, 2022; originally announced September 2022.

    Comments: Code is available at https://github.com/thestephencasper/lm_white_box_attacks

  14. arXiv:2208.10469  [pdf, other

    cs.AI cs.GT cs.MA econ.TH

    Formal Contracts Mitigate Social Dilemmas in Multi-Agent RL

    Authors: Andreas A. Haupt, Phillip J. K. Christoffersen, Mehul Damani, Dylan Hadfield-Menell

    Abstract: Multi-agent Reinforcement Learning (MARL) is a powerful tool for training autonomous agents acting independently in a common environment. However, it can lead to sub-optimal behavior when individual incentives and group incentives diverge. Humans are remarkably capable at solving these social dilemmas. It is an open problem in MARL to replicate such cooperative behaviors in selfish agents. In this… ▽ More

    Submitted 29 January, 2024; v1 submitted 22 August, 2022; originally announced August 2022.

  15. arXiv:2208.01534  [pdf, other

    cs.IR cs.AI cs.HC

    Towards Psychologically-Grounded Dynamic Preference Models

    Authors: Mihaela Curmei, Andreas Haupt, Dylan Hadfield-Menell, Benjamin Recht

    Abstract: Designing recommendation systems that serve content aligned with time varying preferences requires proper accounting of the feedback effects of recommendations on human behavior and psychological condition. We argue that modeling the influence of recommendations on people's preferences must be grounded in psychologically plausible models. We contribute a methodology for developing grounded dynamic… ▽ More

    Submitted 6 August, 2022; v1 submitted 1 August, 2022; originally announced August 2022.

    Comments: In Sixteenth ACM Conference on Recommender Systems, September 18-23, 2022, Seattle, WA, USA, 14 pages

  16. arXiv:2207.13243  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks

    Authors: Tilman Räuker, Anson Ho, Stephen Casper, Dylan Hadfield-Menell

    Abstract: The last decade of machine learning has seen drastic increases in scale and capabilities. Deep neural networks (DNNs) are increasingly being deployed in the real world. However, they are difficult to analyze, raising concerns about using them without a rigorous understanding of how they function. Effective tools for interpreting them will be important for building more trustworthy AI by helping to… ▽ More

    Submitted 18 August, 2023; v1 submitted 26 July, 2022; originally announced July 2022.

  17. arXiv:2207.10192  [pdf

    cs.IR cs.SI

    Building Human Values into Recommender Systems: An Interdisciplinary Synthesis

    Authors: Jonathan Stray, Alon Halevy, Parisa Assar, Dylan Hadfield-Menell, Craig Boutilier, Amar Ashar, Lex Beattie, Michael Ekstrand, Claire Leibowicz, Connie Moon Sehat, Sara Johansen, Lianne Kerlin, David Vickrey, Spandana Singh, Sanne Vrijenhoek, Amy Zhang, McKane Andrus, Natali Helberger, Polina Proutskova, Tanushree Mitra, Nina Vasan

    Abstract: Recommender systems are the algorithms which select, filter, and personalize content across many of the worlds largest platforms and apps. As such, their positive and negative effects on individuals and on societies have been extensively theorized and studied. Our overarching question is how to ensure that recommender systems enact the values of the individuals and societies that they serve. Addre… ▽ More

    Submitted 20 July, 2022; originally announced July 2022.

    ACM Class: J.4; H.3.3; K.4.2

  18. arXiv:2206.07870  [pdf, other

    cs.AI

    How to talk so AI will learn: Instructions, descriptions, and autonomy

    Authors: Theodore R Sumers, Robert D Hawkins, Mark K Ho, Thomas L Griffiths, Dylan Hadfield-Menell

    Abstract: From the earliest years of our lives, humans use language to express our beliefs and desires. Being able to talk to artificial agents about our preferences would thus fulfill a central goal of value alignment. Yet today, we lack computational models explaining such language use. To address this challenge, we formalize learning from language in a contextual bandit setting and ask how a human might… ▽ More

    Submitted 10 October, 2022; v1 submitted 15 June, 2022; originally announced June 2022.

    Comments: 10 pages, 5 figures. Published as a conference paper at NeurIPS 2022

  19. arXiv:2204.11966  [pdf, other

    cs.LG cs.IR

    Estimating and Penalizing Induced Preference Shifts in Recommender Systems

    Authors: Micah Carroll, Anca Dragan, Stuart Russell, Dylan Hadfield-Menell

    Abstract: The content that a recommender system (RS) shows to users influences them. Therefore, when choosing a recommender to deploy, one is implicitly also choosing to induce specific internal states in users. Even more, systems trained via long-horizon optimization will have direct incentives to manipulate users: in this work, we focus on the incentive to shift user preferences so they are easier to sati… ▽ More

    Submitted 14 July, 2022; v1 submitted 25 April, 2022; originally announced April 2022.

    Comments: Accepted to ICML 2022 (Spotlight)

    Journal ref: Proceedings of the 39th International Conference on Machine Learning, PMLR 162:2686-2708, 2022

  20. arXiv:2204.05091  [pdf, other

    cs.AI cs.CL

    Linguistic communication as (inverse) reward design

    Authors: Theodore R. Sumers, Robert D. Hawkins, Mark K. Ho, Thomas L. Griffiths, Dylan Hadfield-Menell

    Abstract: Natural language is an intuitive and expressive way to communicate reward information to autonomous agents. It encompasses everything from concrete instructions to abstract descriptions of the world. Despite this, natural language is often challenging to learn from: it is difficult for machine learning methods to make appropriate inferences from such a wide range of input. This paper proposes a ge… ▽ More

    Submitted 11 April, 2022; originally announced April 2022.

    Comments: 6 pages, 3 figures. Accepted at Learning from Natural Language Supervision workshop (ACL 2022)

  21. arXiv:2112.03386  [pdf, other

    cs.RO cs.AI cs.LG

    Guided Imitation of Task and Motion Planning

    Authors: Michael James McDonald, Dylan Hadfield-Menell

    Abstract: While modern policy optimization methods can do complex manipulation from sensory data, they struggle on problems with extended time horizons and multiple sub-goals. On the other hand, task and motion planning (TAMP) methods scale to long horizons but they are computationally expensive and need to precisely track world state. We propose a method that draws on the strength of both methods: we train… ▽ More

    Submitted 6 December, 2021; originally announced December 2021.

    Comments: 16 pages, 6 figures, 2 tables, submitted to Conference on Robot Learning 2021, to be published in Proceedings of Machine Learning Research

  22. arXiv:2110.03605  [pdf, other

    cs.LG cs.AI cs.CV

    Robust Feature-Level Adversaries are Interpretability Tools

    Authors: Stephen Casper, Max Nadeau, Dylan Hadfield-Menell, Gabriel Kreiman

    Abstract: The literature on adversarial attacks in computer vision typically focuses on pixel-level perturbations. These tend to be very difficult to interpret. Recent work that manipulates the latent representations of image generators to create "feature-level" adversarial perturbations gives us an opportunity to explore perceptible, interpretable adversarial attacks. We make three contributions. First, we… ▽ More

    Submitted 11 September, 2023; v1 submitted 7 October, 2021; originally announced October 2021.

    Comments: NeurIPS 2022, code available at https://github.com/thestephencasper/feature_level_adv

  23. arXiv:2107.10939  [pdf, ps, other

    cs.IR cs.CY cs.LG

    What are you optimizing for? Aligning Recommender Systems with Human Values

    Authors: Jonathan Stray, Ivan Vendrov, Jeremy Nixon, Steven Adler, Dylan Hadfield-Menell

    Abstract: We describe cases where real recommender systems were modified in the service of various human values such as diversity, fairness, well-being, time well spent, and factual accuracy. From this we identify the current practice of values engineering: the creation of classifiers from human-created data with value-based labels. This has worked in practice for a variety of issues, but problems are addre… ▽ More

    Submitted 22 July, 2021; originally announced July 2021.

    Comments: Originally presented at the ICML 2020 Participatory Approaches to Machine Learning workshop

  24. arXiv:2107.00441  [pdf, ps, other

    cs.CY

    When Curation Becomes Creation: Algorithms, Microcontent, and the Vanishing Distinction between Platforms and Creators

    Authors: Liu Leqi, Dylan Hadfield-Menell, Zachary C. Lipton

    Abstract: Ever since social activity on the Internet began migrating from the wilds of the open web to the walled gardens erected by so-called platforms, debates have raged about the responsibilities that these platforms ought to bear. And yet, despite intense scrutiny from the news media and grassroots movements of outraged users, platforms continue to operate, from a legal standpoint, on the friendliest t… ▽ More

    Submitted 1 July, 2021; originally announced July 2021.

  25. arXiv:2102.03896  [pdf, other

    cs.AI

    Consequences of Misaligned AI

    Authors: Simon Zhuang, Dylan Hadfield-Menell

    Abstract: AI systems often rely on two key components: a specified goal or reward function and an optimization algorithm to compute the optimal behavior for that goal. This approach is intended to provide value for a principal: the user on whose behalf the agent acts. The objectives given to these agents often refer to a partial specification of the principal's goals. We consider the cost of this incomplete… ▽ More

    Submitted 7 February, 2021; originally announced February 2021.

    Journal ref: NeurIPS 2020

  26. arXiv:2012.14536  [pdf, other

    cs.GT cs.AI

    Multi-Principal Assistance Games: Definition and Collegial Mechanisms

    Authors: Arnaud Fickinger, Simon Zhuang, Andrew Critch, Dylan Hadfield-Menell, Stuart Russell

    Abstract: We introduce the concept of a multi-principal assistance game (MPAG), and circumvent an obstacle in social choice theory, Gibbard's theorem, by using a sufficiently collegial preference inference mechanism. In an MPAG, a single agent assists N human principals who may have widely different preferences. MPAGs generalize assistance games, also known as cooperative inverse reinforcement learning game… ▽ More

    Submitted 28 December, 2020; originally announced December 2020.

    Comments: arXiv admin note: text overlap with arXiv:2007.09540

  27. arXiv:2007.09540  [pdf, other

    cs.AI

    Multi-Principal Assistance Games

    Authors: Arnaud Fickinger, Simon Zhuang, Dylan Hadfield-Menell, Stuart Russell

    Abstract: Assistance games (also known as cooperative inverse reinforcement learning games) have been proposed as a model for beneficial AI, wherein a robotic agent must act on behalf of a human principal but is initially uncertain about the humans payoff function. This paper studies multi-principal assistance games, which cover the more general case in which the robot acts on behalf of N humans who may hav… ▽ More

    Submitted 18 July, 2020; originally announced July 2020.

  28. arXiv:2001.09318  [pdf, other

    cs.MA cs.AI

    Silly rules improve the capacity of agents to learn stable enforcement and compliance behaviors

    Authors: Raphael Köster, Dylan Hadfield-Menell, Gillian K. Hadfield, Joel Z. Leibo

    Abstract: How can societies learn to enforce and comply with social norms? Here we investigate the learning dynamics and emergence of compliance and enforcement of social norms in a foraging game, implemented in a multi-agent reinforcement learning setting. In this spatiotemporally extended game, individuals are incentivized to implement complex berry-foraging policies and punish transgressions against soci… ▽ More

    Submitted 25 January, 2020; originally announced January 2020.

  29. arXiv:1906.02641  [pdf, other

    cs.LG cs.HC cs.RO stat.ML

    An Extensible Interactive Interface for Agent Design

    Authors: Matthew Rahtz, James Fang, Anca D. Dragan, Dylan Hadfield-Menell

    Abstract: In artificial intelligence, we often specify tasks through a reward function. While this works well in some settings, many tasks are hard to specify this way. In deep reinforcement learning, for example, directly specifying a reward as a function of a high-dimensional observation is challenging. Instead, we present an interface for specifying tasks interactively using demonstrations. Our approach… ▽ More

    Submitted 8 August, 2019; v1 submitted 6 June, 2019; originally announced June 2019.

    Comments: Presented at 2019 ICML Workshop on Human in the Loop Learning (HILL 2019), Long Beach, USA

  30. arXiv:1905.01019  [pdf, other

    cs.LG cs.CG stat.ML

    Adversarial Training with Voronoi Constraints

    Authors: Marc Khoury, Dylan Hadfield-Menell

    Abstract: Adversarial examples are a pervasive phenomenon of machine learning models where seemingly imperceptible perturbations to the input lead to misclassifications for otherwise statistically accurate models. We propose a geometric framework, drawing on tools from the manifold reconstruction literature, to analyze the high-dimensional geometry of adversarial examples. In particular, we highlight the im… ▽ More

    Submitted 2 May, 2019; originally announced May 2019.

    Comments: arXiv admin note: substantial text overlap with arXiv:1811.00525

  31. Conservative Agency via Attainable Utility Preservation

    Authors: Alexander Matt Turner, Dylan Hadfield-Menell, Prasad Tadepalli

    Abstract: Reward functions are easy to misspecify; although designers can make corrections after observing mistakes, an agent pursuing a misspecified reward function can irreversibly change the state of its environment. If that change precludes optimization of the correctly specified reward function, then correction is futile. For example, a robotic factory assistant could break expensive equipment due to a… ▽ More

    Submitted 10 June, 2020; v1 submitted 25 February, 2019; originally announced February 2019.

    Comments: Published in AI, Ethics, and Society 2020

  32. arXiv:1901.08654  [pdf, other

    cs.LG cs.AI stat.ML

    The Assistive Multi-Armed Bandit

    Authors: Lawrence Chan, Dylan Hadfield-Menell, Siddhartha Srinivasa, Anca Dragan

    Abstract: Learning preferences implicit in the choices humans make is a well studied problem in both economics and computer science. However, most work makes the assumption that humans are acting (noisily) optimally with respect to their preferences. Such approaches can fail when people are themselves learning about what they want. In this work, we introduce the assistive multi-armed bandit, where a robot a… ▽ More

    Submitted 24 January, 2019; originally announced January 2019.

    Comments: Accepted to HRI 2019

  33. arXiv:1901.01291  [pdf, other

    cs.RO cs.LG stat.ML

    On the Utility of Model Learning in HRI

    Authors: Gokul Swamy, Jens Schulz, Rohan Choudhury, Dylan Hadfield-Menell, Anca Dragan

    Abstract: Fundamental to robotics is the debate between model-based and model-free learning: should the robot build an explicit model of the world, or learn a policy directly? In the context of HRI, part of the world to be modeled is the human. One option is for the robot to treat the human as a black box and learn a policy for how they act directly. But it can also model the human as an agent, and rely on… ▽ More

    Submitted 21 May, 2020; v1 submitted 4 January, 2019; originally announced January 2019.

  34. arXiv:1812.09376  [pdf, other

    cs.AI

    Human-AI Learning Performance in Multi-Armed Bandits

    Authors: Ravi Pandya, Sandy H. Huang, Dylan Hadfield-Menell, Anca D. Dragan

    Abstract: People frequently face challenging decision-making problems in which outcomes are uncertain or unknown. Artificial intelligence (AI) algorithms exist that can outperform humans at learning such tasks. Thus, there is an opportunity for AI agents to assist people in learning these tasks more effectively. In this work, we use a multi-armed bandit as a controlled setting in which to explore this direc… ▽ More

    Submitted 21 December, 2018; originally announced December 2018.

    Comments: Artificial Intelligence, Ethics and Society (AIES) 2019

  35. arXiv:1811.01267  [pdf, other

    cs.AI cs.CY cs.HC

    Legible Normativity for AI Alignment: The Value of Silly Rules

    Authors: Dylan Hadfield-Menell, McKane Andrus, Gillian K. Hadfield

    Abstract: It has become commonplace to assert that autonomous agents will have to be built to follow human rules of behavior--social norms and laws. But human laws and norms are complex and culturally varied systems, in many cases agents will have to learn the rules. This requires autonomous agents to have models of how human rule systems work so that they can make reliable predictions about rules. In this… ▽ More

    Submitted 3 November, 2018; originally announced November 2018.

  36. arXiv:1811.00525  [pdf, other

    cs.LG stat.ML

    On the Geometry of Adversarial Examples

    Authors: Marc Khoury, Dylan Hadfield-Menell

    Abstract: Adversarial examples are a pervasive phenomenon of machine learning models where seemingly imperceptible perturbations to the input lead to misclassifications for otherwise statistically accurate models. We propose a geometric framework, drawing on tools from the manifold reconstruction literature, to analyze the high-dimensional geometry of adversarial examples. In particular, we highlight the im… ▽ More

    Submitted 11 December, 2018; v1 submitted 1 November, 2018; originally announced November 2018.

    Comments: Improvements to clarity and presentation over initial submission

  37. arXiv:1809.03060  [pdf, other

    cs.LG cs.AI stat.ML

    Active Inverse Reward Design

    Authors: Sören Mindermann, Rohin Shah, Adam Gleave, Dylan Hadfield-Menell

    Abstract: Designers of AI agents often iterate on the reward function in a trial-and-error process until they get the desired behavior, but this only guarantees good behavior in the training environment. We propose structuring this process as a series of queries asking the user to compare between different reward functions. Thus we can actively select queries for maximum informativeness about the true rewar… ▽ More

    Submitted 6 November, 2019; v1 submitted 9 September, 2018; originally announced September 2018.

  38. arXiv:1806.03820  [pdf, other

    cs.AI

    An Efficient, Generalized Bellman Update For Cooperative Inverse Reinforcement Learning

    Authors: Dhruv Malik, Malayandi Palaniappan, Jaime F. Fisac, Dylan Hadfield-Menell, Stuart Russell, Anca D. Dragan

    Abstract: Our goal is for AI systems to correctly identify and act according to their human user's objectives. Cooperative Inverse Reinforcement Learning (CIRL) formalizes this value alignment problem as a two-player game between a human and robot, in which only the human knows the parameters of the reward function: the robot needs to learn them as the interaction unfolds. Previous work showed that CIRL can… ▽ More

    Submitted 11 June, 2018; originally announced June 2018.

  39. arXiv:1806.02501  [pdf, other

    cs.RO cs.AI cs.LG

    Simplifying Reward Design through Divide-and-Conquer

    Authors: Ellis Ratner, Dylan Hadfield-Menell, Anca D. Dragan

    Abstract: Designing a good reward function is essential to robot planning and reinforcement learning, but it can also be challenging and frustrating. The reward needs to work across multiple different environments, and that often requires many iterations of tuning. We introduce a novel divide-and-conquer approach that enables the designer to specify a reward separately for each environment. By treating thes… ▽ More

    Submitted 6 June, 2018; originally announced June 2018.

    Comments: Robotics: Science and Systems (RSS) 2018

  40. arXiv:1804.04268  [pdf, other

    cs.AI

    Incomplete Contracting and AI Alignment

    Authors: Dylan Hadfield-Menell, Gillian Hadfield

    Abstract: We suggest that the analysis of incomplete contracting developed by law and economics researchers can provide a useful framework for understanding the AI alignment problem and help to generate a systematic approach to finding solutions. We first provide an overview of the incomplete contracting literature and explore parallels between this work and the problem of AI alignment. As we emphasize, mis… ▽ More

    Submitted 11 April, 2018; originally announced April 2018.

  41. Expressive Robot Motion Timing

    Authors: Allan Zhou, Dylan Hadfield-Menell, Anusha Nagabandi, Anca D. Dragan

    Abstract: Our goal is to enable robots to \emph{time} their motion in a way that is purposefully expressive of their internal states, making them more transparent to people. We start by investigating what types of states motion timing is capable of expressing, focusing on robot manipulation and keeping the path constant while systematically varying the timing. We find that users naturally pick up on certain… ▽ More

    Submitted 5 February, 2018; originally announced February 2018.

    Journal ref: HRI '17 Proceedings of the 2017 ACM/IEEE International Conference on Human-Robot Interaction Pages 22-31

  42. arXiv:1711.02827  [pdf, other

    cs.AI cs.LG

    Inverse Reward Design

    Authors: Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart Russell, Anca Dragan

    Abstract: Autonomous agents optimize the reward function we give them. What they don't know is how hard it is for us to design a reward function that actually captures what we want. When designing the reward, we might think of some specific training scenarios, and make sure that the reward will lead to the right behavior in those scenarios. Inevitably, agents encounter new scenarios (e.g., new types of terr… ▽ More

    Submitted 7 October, 2020; v1 submitted 7 November, 2017; originally announced November 2017.

    Comments: Advances in Neural Information Processing Systems 30 (NIPS 2017) Revised Oct 2020 to fix a typo in Eq. 3

  43. arXiv:1707.06354  [pdf, other

    cs.AI cs.HC cs.LG cs.RO

    Pragmatic-Pedagogic Value Alignment

    Authors: Jaime F. Fisac, Monica A. Gates, Jessica B. Hamrick, Chang Liu, Dylan Hadfield-Menell, Malayandi Palaniappan, Dhruv Malik, S. Shankar Sastry, Thomas L. Griffiths, Anca D. Dragan

    Abstract: As intelligent systems gain autonomy and capability, it becomes vital to ensure that their objectives match those of their human users; this is known as the value-alignment problem. In robotics, value alignment is key to the design of collaborative robots that can integrate into human workflows, successfully inferring and adapting to their users' objectives as they go. We argue that a meaningful s… ▽ More

    Submitted 5 February, 2018; v1 submitted 19 July, 2017; originally announced July 2017.

    Comments: Published at the International Symposium on Robotics Research (ISRR 2017)

    MSC Class: 68T05 ACM Class: I.2.0; I.2.6; I.2.8; I.2.9

    Journal ref: International Symposium on Robotics Research, 2017

  44. arXiv:1705.09990  [pdf, other

    cs.AI

    Should Robots be Obedient?

    Authors: Smitha Milli, Dylan Hadfield-Menell, Anca Dragan, Stuart Russell

    Abstract: Intuitively, obedience -- following the order that a human gives -- seems like a good property for a robot to have. But, we humans are not perfect and we may give orders that are not best aligned to our preferences. We show that when a human is not perfectly rational then a robot that tries to infer and act according to the human's underlying preferences can always perform better than a robot that… ▽ More

    Submitted 28 May, 2017; originally announced May 2017.

    Comments: Accepted to IJCAI 2017

  45. arXiv:1611.08219  [pdf, other

    cs.AI

    The Off-Switch Game

    Authors: Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, Stuart Russell

    Abstract: It is clear that one of the primary tools we can use to mitigate the potential risk from a misbehaving AI system is the ability to turn the system off. As the capabilities of AI systems improve, it is important to ensure that such systems do not adopt subgoals that prevent a human from switching them off. This is a challenge because many formulations of rational agents create strong incentives for… ▽ More

    Submitted 15 June, 2017; v1 submitted 24 November, 2016; originally announced November 2016.

  46. arXiv:1606.03137  [pdf, other

    cs.AI

    Cooperative Inverse Reinforcement Learning

    Authors: Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, Stuart Russell

    Abstract: For an autonomous system to be helpful to humans and to pose no unwarranted risks, it needs to align its values with those of the humans in its environment in such a way that its actions contribute to the maximization of value for the humans. We propose a formal definition of the value alignment problem as cooperative inverse reinforcement learning (CIRL). A CIRL problem is a cooperative, partial-… ▽ More

    Submitted 17 February, 2024; v1 submitted 9 June, 2016; originally announced June 2016.