Search | arXiv e-print repository

Optimal Multi-Fidelity Best-Arm Identification

Authors: Riccardo Poiani, Rémy Degenne, Emilie Kaufmann, Alberto Maria Metelli, Marcello Restelli

Abstract: In bandit best-arm identification, an algorithm is tasked with finding the arm with highest mean reward with a specified accuracy as fast as possible. We study multi-fidelity best-arm identification, in which the algorithm can choose to sample an arm at a lower fidelity (less accurate mean estimate) for a lower cost. Several methods have been proposed for tackling this problem, but their optimalit… ▽ More In bandit best-arm identification, an algorithm is tasked with finding the arm with highest mean reward with a specified accuracy as fast as possible. We study multi-fidelity best-arm identification, in which the algorithm can choose to sample an arm at a lower fidelity (less accurate mean estimate) for a lower cost. Several methods have been proposed for tackling this problem, but their optimality remain elusive, notably due to loose lower bounds on the total cost needed to identify the best arm. Our first contribution is a tight, instance-dependent lower bound on the cost complexity. The study of the optimization problem featured in the lower bound provides new insights to devise computationally efficient algorithms, and leads us to propose a gradient-based approach with asymptotically optimal cost complexity. We demonstrate the benefits of the new algorithm compared to existing methods in experiments. Our theoretical and empirical findings also shed light on an intriguing concept of optimal fidelity for each arm. △ Less

Submitted 5 June, 2024; originally announced June 2024.

arXiv:2406.02235 [pdf, other]

Power Mean Estimation in Stochastic Monte-Carlo Tree_Search

Authors: Tuan Dam, Odalric-Ambrym Maillard, Emilie Kaufmann

Abstract: Monte-Carlo Tree Search (MCTS) is a widely-used strategy for online planning that combines Monte-Carlo sampling with forward tree search. Its success relies on the Upper Confidence bound for Trees (UCT) algorithm, an extension of the UCB method for multi-arm bandits. However, the theoretical foundation of UCT is incomplete due to an error in the logarithmic bonus term for action selection, leading… ▽ More Monte-Carlo Tree Search (MCTS) is a widely-used strategy for online planning that combines Monte-Carlo sampling with forward tree search. Its success relies on the Upper Confidence bound for Trees (UCT) algorithm, an extension of the UCB method for multi-arm bandits. However, the theoretical foundation of UCT is incomplete due to an error in the logarithmic bonus term for action selection, leading to the development of Fixed-Depth-MCTS with a polynomial exploration bonus to balance exploration and exploitation~\citep{shah2022journal}. Both UCT and Fixed-Depth-MCTS suffer from biased value estimation: the weighted sum underestimates the optimal value, while the maximum valuation overestimates it~\citep{coulom2006efficient}. The power mean estimator offers a balanced solution, lying between the average and maximum values. Power-UCT~\citep{dam2019generalized} incorporates this estimator for more accurate value estimates but its theoretical analysis remains incomplete. This paper introduces Stochastic-Power-UCT, an MCTS algorithm using the power mean estimator and tailored for stochastic MDPs. We analyze its polynomial convergence in estimating root node values and show that it shares the same convergence rate of $\mathcal{O}(n^{-1/2})$, with $n$ is the number of visited trajectories, as Fixed-Depth-MCTS, with the latter being a special case of the former. Our theoretical results are validated with empirical tests across various stochastic MDP environments. △ Less

Submitted 4 June, 2024; originally announced June 2024.

Comments: UAI 2024 conference

arXiv:2405.17108 [pdf, ps, other]

Finding good policies in average-reward Markov Decision Processes without prior knowledge

Authors: Adrienne Tuynman, Rémy Degenne, Emilie Kaufmann

Abstract: We revisit the identification of an $\varepsilon$-optimal policy in average-reward Markov Decision Processes (MDP). In such MDPs, two measures of complexity have appeared in the literature: the diameter, $D$, and the optimal bias span, $H$, which satisfy $H\leq D$. Prior work have studied the complexity of $\varepsilon$-optimal policy identification only when a generative model is available. In th… ▽ More We revisit the identification of an $\varepsilon$-optimal policy in average-reward Markov Decision Processes (MDP). In such MDPs, two measures of complexity have appeared in the literature: the diameter, $D$, and the optimal bias span, $H$, which satisfy $H\leq D$. Prior work have studied the complexity of $\varepsilon$-optimal policy identification only when a generative model is available. In this case, it is known that there exists an MDP with $D \simeq H$ for which the sample complexity to output an $\varepsilon$-optimal policy is $Ωおめが(SAD/\varepsilon^2)$ where $S$ and $A$ are the sizes of the state and action spaces. Recently, an algorithm with a sample complexity of order $SAH/\varepsilon^2$ has been proposed, but it requires the knowledge of $H$. We first show that the sample complexity required to estimate $H$ is not bounded by any function of $S,A$ and $H$, ruling out the possibility to easily make the previous algorithm agnostic to $H$. By relying instead on a diameter estimation procedure, we propose the first algorithm for $(\varepsilon,δでるた)$-PAC policy identification that does not need any form of prior knowledge on the MDP. Its sample complexity scales in $SAD/\varepsilon^2$ in the regime of small $\varepsilon$, which is near-optimal. In the online setting, our first contribution is a lower bound which implies that a sample complexity polynomial in $H$ cannot be achieved in this setting. Then, we propose an online algorithm with a sample complexity in $SAD^2/\varepsilon^2$, as well as a novel approach based on a data-dependent stopping rule that we believe is promising to further reduce this bound. △ Less

Submitted 27 May, 2024; originally announced May 2024.

arXiv:2311.05638 [pdf, ps, other]

Towards Instance-Optimality in Online PAC Reinforcement Learning

Authors: Aymen Al-Marjani, Andrea Tirinzoni, Emilie Kaufmann

Abstract: Several recent works have proposed instance-dependent upper bounds on the number of episodes needed to identify, with probability $1-δでるた$, an $\varepsilon$-optimal policy in finite-horizon tabular Markov Decision Processes (MDPs). These upper bounds feature various complexity measures for the MDP, which are defined based on different notions of sub-optimality gaps. However, as of now, no lower bound… ▽ More Several recent works have proposed instance-dependent upper bounds on the number of episodes needed to identify, with probability $1-δでるた$, an $\varepsilon$-optimal policy in finite-horizon tabular Markov Decision Processes (MDPs). These upper bounds feature various complexity measures for the MDP, which are defined based on different notions of sub-optimality gaps. However, as of now, no lower bound has been established to assess the optimality of any of these complexity measures, except for the special case of MDPs with deterministic transitions. In this paper, we propose the first instance-dependent lower bound on the sample complexity required for the PAC identification of a near-optimal policy in any tabular episodic MDP. Additionally, we demonstrate that the sample complexity of the PEDEL algorithm of \cite{Wagenmaker22linearMDP} closely approaches this lower bound. Considering the intractability of PEDEL, we formulate an open question regarding the possibility of achieving our lower bound using a computationally-efficient algorithm. △ Less

Submitted 31 October, 2023; originally announced November 2023.

arXiv:2311.03992 [pdf, other]

Bandit Pareto Set Identification: the Fixed Budget Setting

Authors: Cyrille Kone, Emilie Kaufmann, Laura Richert

Abstract: We study a multi-objective pure exploration problem in a multi-armed bandit model. Each arm is associated to an unknown multi-variate distribution and the goal is to identify the distributions whose mean is not uniformly worse than that of another distribution: the Pareto optimal set. We propose and analyze the first algorithms for the \emph{fixed budget} Pareto Set Identification task. We propose… ▽ More We study a multi-objective pure exploration problem in a multi-armed bandit model. Each arm is associated to an unknown multi-variate distribution and the goal is to identify the distributions whose mean is not uniformly worse than that of another distribution: the Pareto optimal set. We propose and analyze the first algorithms for the \emph{fixed budget} Pareto Set Identification task. We propose Empirical Gap Elimination, a family of algorithms combining a careful estimation of the ``hardness to classify'' each arm in or out of the Pareto set with a generic elimination scheme. We prove that two particular instances, EGE-SR and EGE-SH, have a probability of error that decays exponentially fast with the budget, with an exponent supported by an information theoretic lower-bound. We complement these findings with an empirical study using real-world and synthetic datasets, which showcase the good performance of our algorithms. △ Less

Submitted 7 November, 2023; originally announced November 2023.

Comments: 42 pages

arXiv:2307.06100 [pdf, other]

doi 10.1126/scirobotics.abl6259

Agilicious: Open-Source and Open-Hardware Agile Quadrotor for Vision-Based Flight

Authors: Philipp Foehn, Elia Kaufmann, Angel Romero, Robert Penicka, Sihao Sun, Leonard Bauersfeld, Thomas Laengle, Giovanni Cioffi, Yunlong Song, Antonio Loquercio, Davide Scaramuzza

Abstract: Autonomous, agile quadrotor flight raises fundamental challenges for robotics research in terms of perception, planning, learning, and control. A versatile and standardized platform is needed to accelerate research and let practitioners focus on the core problems. To this end, we present Agilicious, a co-designed hardware and software framework tailored to autonomous, agile quadrotor flight. It is… ▽ More Autonomous, agile quadrotor flight raises fundamental challenges for robotics research in terms of perception, planning, learning, and control. A versatile and standardized platform is needed to accelerate research and let practitioners focus on the core problems. To this end, we present Agilicious, a co-designed hardware and software framework tailored to autonomous, agile quadrotor flight. It is completely open-source and open-hardware and supports both model-based and neural-network--based controllers. Also, it provides high thrust-to-weight and torque-to-inertia ratios for agility, onboard vision sensors, GPU-accelerated compute hardware for real-time perception and neural-network inference, a real-time flight controller, and a versatile software stack. In contrast to existing frameworks, Agilicious offers a unique combination of flexible software stack and high-performance hardware. We compare Agilicious with prior works and demonstrate it on different agile tasks, using both model-based and neural-network--based controllers. Our demonstrators include trajectory tracking at up to 5g and 70 km/h in a motion-capture system, and vision-based acrobatic flight and obstacle avoidance in both structured and unstructured environments using solely onboard perception. Finally, we demonstrate its use for hardware-in-the-loop simulation in virtual-reality environments. Thanks to its versatility, we believe that Agilicious supports the next generation of scientific and industrial quadrotor research. △ Less

Submitted 12 July, 2023; originally announced July 2023.

Comments: 14 pages, 5 figures, 2 tables

Journal ref: Science Robotics Vol. 7, Issue 67, 2022

arXiv:2307.00424 [pdf, other]

Adaptive Algorithms for Relaxed Pareto Set Identification

Authors: Cyrille Kone, Emilie Kaufmann, Laura Richert

Abstract: In this paper we revisit the fixed-confidence identification of the Pareto optimal set in a multi-objective multi-armed bandit model. As the sample complexity to identify the exact Pareto set can be very large, a relaxation allowing to output some additional near-optimal arms has been studied. In this work we also tackle alternative relaxations that allow instead to identify a relevant subset of t… ▽ More In this paper we revisit the fixed-confidence identification of the Pareto optimal set in a multi-objective multi-armed bandit model. As the sample complexity to identify the exact Pareto set can be very large, a relaxation allowing to output some additional near-optimal arms has been studied. In this work we also tackle alternative relaxations that allow instead to identify a relevant subset of the Pareto set. Notably, we propose a single sampling strategy, called Adaptive Pareto Exploration, that can be used in conjunction with different stopping rules to take into account different relaxations of the Pareto Set Identification problem. We analyze the sample complexity of these different combinations, quantifying in particular the reduction in sample complexity that occurs when one seeks to identify at most $k$ Pareto optimal arms. We showcase the good practical performance of Adaptive Pareto Exploration on a real-world scenario, in which we adaptively explore several vaccination strategies against Covid-19 in order to find the optimal ones when multiple immunogenicity criteria are taken into account. △ Less

Submitted 3 November, 2023; v1 submitted 1 July, 2023; originally announced July 2023.

MSC Class: 68T05

arXiv:2306.13601 [pdf, other]

Active Coverage for PAC Reinforcement Learning

Authors: Aymen Al-Marjani, Andrea Tirinzoni, Emilie Kaufmann

Abstract: Collecting and leveraging data with good coverage properties plays a crucial role in different aspects of reinforcement learning (RL), including reward-free exploration and offline learning. However, the notion of "good coverage" really depends on the application at hand, as data suitable for one context may not be so for another. In this paper, we formalize the problem of active coverage in episo… ▽ More Collecting and leveraging data with good coverage properties plays a crucial role in different aspects of reinforcement learning (RL), including reward-free exploration and offline learning. However, the notion of "good coverage" really depends on the application at hand, as data suitable for one context may not be so for another. In this paper, we formalize the problem of active coverage in episodic Markov decision processes (MDPs), where the goal is to interact with the environment so as to fulfill given sampling requirements. This framework is sufficiently flexible to specify any desired coverage property, making it applicable to any problem that involves online exploration. Our main contribution is an instance-dependent lower bound on the sample complexity of active coverage and a simple game-theoretic algorithm, CovGame, that nearly matches it. We then show that CovGame can be used as a building block to solve different PAC RL tasks. In particular, we obtain a simple algorithm for PAC reward-free exploration with an instance-dependent sample complexity that, in certain MDPs which are "easy to explore", is lower than the minimax one. By further coupling this exploration algorithm with a new technique to do implicit eliminations in policy space, we obtain a computationally-efficient algorithm for best-policy identification whose instance-dependent sample complexity scales with gaps between policy values. △ Less

Submitted 23 June, 2023; originally announced June 2023.

Comments: Accepted at COLT 2023

arXiv:2305.16041 [pdf, other]

An $\varepsilon$-Best-Arm Identification Algorithm for Fixed-Confidence and Beyond

Authors: Marc Jourdan, Rémy Degenne, Emilie Kaufmann

Abstract: We propose EB-TC$\varepsilon$, a novel sampling rule for $\varepsilon$-best arm identification in stochastic bandits. It is the first instance of Top Two algorithm analyzed for approximate best arm identification. EB-TC$\varepsilon$ is an *anytime* sampling rule that can therefore be employed without modification for fixed confidence or fixed budget identification (without prior knowledge of the b… ▽ More We propose EB-TC$\varepsilon$, a novel sampling rule for $\varepsilon$-best arm identification in stochastic bandits. It is the first instance of Top Two algorithm analyzed for approximate best arm identification. EB-TC$\varepsilon$ is an *anytime* sampling rule that can therefore be employed without modification for fixed confidence or fixed budget identification (without prior knowledge of the budget). We provide three types of theoretical guarantees for EB-TC$\varepsilon$. First, we prove bounds on its expected sample complexity in the fixed confidence setting, notably showing its asymptotic optimality in combination with an adaptive tuning of its exploration parameter. We complement these findings with upper bounds on its probability of error at any time and for any error parameter, which further yield upper bounds on its simple regret at any time. Finally, we show through numerical simulations that EB-TC$\varepsilon$ performs favorably compared to existing algorithms, in different settings. △ Less

Submitted 6 November, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

Comments: 68 pages, 14 figures, 4 tables. To be published in the Thirty-seventh Conference on Neural Information Processing Systems

arXiv:2304.04128 [pdf, other]

Learning Agile, Vision-based Drone Flight: from Simulation to Reality

Authors: Davide Scaramuzza, Elia Kaufmann

Abstract: We present our latest research in learning deep sensorimotor policies for agile, vision-based quadrotor flight. We show methodologies for the successful transfer of such policies from simulation to the real world. In addition, we discuss the open research questions that still need to be answered to improve the agility and robustness of autonomous drones toward human-pilot performance. We present our latest research in learning deep sensorimotor policies for agile, vision-based quadrotor flight. We show methodologies for the successful transfer of such policies from simulation to the real world. In addition, we discuss the open research questions that still need to be answered to improve the agility and robustness of autonomous drones toward human-pilot performance. △ Less

Submitted 8 April, 2023; originally announced April 2023.

arXiv:2301.13089 [pdf, ps, other]

Can an AI Win Ghana's National Science and Maths Quiz? An AI Grand Challenge for Education

Authors: George Boateng, Victor Kumbol, Elsie Effah Kaufmann

Abstract: There is a lack of enough qualified teachers across Africa which hampers efforts to provide adequate learning support such as educational question answering (EQA) to students. An AI system that can enable students to ask questions via text or voice and get instant answers will make high-quality education accessible. Despite advances in the field of AI, there exists no robust benchmark or challenge… ▽ More There is a lack of enough qualified teachers across Africa which hampers efforts to provide adequate learning support such as educational question answering (EQA) to students. An AI system that can enable students to ask questions via text or voice and get instant answers will make high-quality education accessible. Despite advances in the field of AI, there exists no robust benchmark or challenge to enable building such an (EQA) AI within the African context. Ghana's National Science and Maths Quiz competition (NSMQ) is the perfect competition to evaluate the potential of such an AI due to its wide coverage of scientific fields, variety of question types, highly competitive nature, and live, real-world format. The NSMQ is a Jeopardy-style annual live quiz competition in which 3 teams of 2 students compete by answering questions across biology, chemistry, physics, and math in 5 rounds over 5 progressive stages until a winning team is crowned for that year. In this position paper, we propose the NSMQ AI Grand Challenge, an AI Grand Challenge for Education using Ghana's National Science and Maths Quiz competition (NSMQ) as a case study. Our proposed grand challenge is to "Build an AI to compete live in Ghana's National Science and Maths Quiz (NSMQ) competition and win - performing better than the best contestants in all rounds and stages of the competition." We describe the competition, and key technical challenges to address along with ideas from recent advances in machine learning that could be leveraged to solve this challenge. This position paper is a first step towards conquering such a challenge and importantly, making advances in AI for education in the African context towards democratizing high-quality education across Africa. △ Less

Submitted 30 January, 2023; originally announced January 2023.

arXiv:2301.01755 [pdf, other]

doi 10.1109/TRO.2024.3400838

Autonomous Drone Racing: A Survey

Authors: Drew Hanover, Antonio Loquercio, Leonard Bauersfeld, Angel Romero, Robert Penicka, Yunlong Song, Giovanni Cioffi, Elia Kaufmann, Davide Scaramuzza

Abstract: Over the last decade, the use of autonomous drone systems for surveying, search and rescue, or last-mile delivery has increased exponentially. With the rise of these applications comes the need for highly robust, safety-critical algorithms which can operate drones in complex and uncertain environments. Additionally, flying fast enables drones to cover more ground which in turn increases productivi… ▽ More Over the last decade, the use of autonomous drone systems for surveying, search and rescue, or last-mile delivery has increased exponentially. With the rise of these applications comes the need for highly robust, safety-critical algorithms which can operate drones in complex and uncertain environments. Additionally, flying fast enables drones to cover more ground which in turn increases productivity and further strengthens their use case. One proxy for developing algorithms used in high-speed navigation is the task of autonomous drone racing, where researchers program drones to fly through a sequence of gates and avoid obstacles as quickly as possible using onboard sensors and limited computational power. Speeds and accelerations exceed over 80 kph and 4 g respectively, raising significant challenges across perception, planning, control, and state estimation. To achieve maximum performance, systems require real-time algorithms that are robust to motion blur, high dynamic range, model uncertainties, aerodynamic disturbances, and often unpredictable opponents. This survey covers the progression of autonomous drone racing across model-based and learning-based approaches. We provide an overview of the field, its evolution over the years, and conclude with the biggest challenges and open questions to be faced in the future. △ Less

Submitted 8 July, 2024; v1 submitted 4 January, 2023; originally announced January 2023.

Comments: 26 pages

Journal ref: IEEE Transactions on Robotics (T-RO), Vol. 40, 2024

arXiv:2211.12181 [pdf, ps, other]

User-Conditioned Neural Control Policies for Mobile Robotics

Authors: Leonard Bauersfeld, Elia Kaufmann, Davide Scaramuzza

Abstract: Recently, learning-based controllers have been shown to push mobile robotic systems to their limits and provide the robustness needed for many real-world applications. However, only classical optimization-based control frameworks offer the inherent flexibility to be dynamically adjusted during execution by, for example, setting target speeds or actuator limits. We present a framework to overcome t… ▽ More Recently, learning-based controllers have been shown to push mobile robotic systems to their limits and provide the robustness needed for many real-world applications. However, only classical optimization-based control frameworks offer the inherent flexibility to be dynamically adjusted during execution by, for example, setting target speeds or actuator limits. We present a framework to overcome this shortcoming of neural controllers by conditioning them on an auxiliary input. This advance is enabled by including a feature-wise linear modulation layer (FiLM). We use model-free reinforcement-learning to train quadrotor control policies for the task of navigating through a sequence of waypoints in minimum time. By conditioning the policy on the maximum available thrust or the viewing direction relative to the next waypoint, a user can regulate the aggressiveness of the quadrotor's flight during deployment. We demonstrate in simulation and in real-world experiments that a single control policy can achieve close to time-optimal flight performance across the entire performance envelope of the robot, reaching up to 60 km/h and 4.5g in acceleration. The ability to guide a learned controller during task execution has implications beyond agile quadrotor flight, as conditioning the control policy on human intent helps safely bringing learning based systems out of the well-defined laboratory environment into the wild. △ Less

Submitted 2 April, 2023; v1 submitted 22 November, 2022; originally announced November 2022.

Comments: 6 pages + 1 pages references

Journal ref: IEEE International Conference on Robotics and Automation (ICRA), London, 2023

arXiv:2210.15287 [pdf, other]

Learned Inertial Odometry for Autonomous Drone Racing

Authors: Giovanni Cioffi, Leonard Bauersfeld, Elia Kaufmann, Davide Scaramuzza

Abstract: Inertial odometry is an attractive solution to the problem of state estimation for agile quadrotor flight. It is inexpensive, lightweight, and it is not affected by perceptual degradation. However, only relying on the integration of the inertial measurements for state estimation is infeasible. The errors and time-varying biases present in such measurements cause the accumulation of large drift in… ▽ More Inertial odometry is an attractive solution to the problem of state estimation for agile quadrotor flight. It is inexpensive, lightweight, and it is not affected by perceptual degradation. However, only relying on the integration of the inertial measurements for state estimation is infeasible. The errors and time-varying biases present in such measurements cause the accumulation of large drift in the pose estimates. Recently, inertial odometry has made significant progress in estimating the motion of pedestrians. State-of-the-art algorithms rely on learning a motion prior that is typical of humans but cannot be transferred to drones. In this work, we propose a learning-based odometry algorithm that uses an inertial measurement unit (IMU) as the only sensor modality for autonomous drone racing tasks. The core idea of our system is to couple a model-based filter, driven by the inertial measurements, with a learning-based module that has access to the thrust measurements. We show that our inertial odometry algorithm is superior to the state-of-the-art filter-based and optimization-based visual-inertial odometry as well as the state-of-the-art learned-inertial odometry in estimating the pose of an autonomous racing drone. Additionally, we show that our system is comparable to a visual-inertial odometry solution that uses a camera and exploits the known gate location and appearance. We believe that the application in autonomous drone racing paves the way for novel research in inertial odometry for agile quadrotor flight. △ Less

Submitted 28 February, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

Journal ref: Robotics and Automation Letters (RA-L), 2023

arXiv:2210.00974 [pdf, other]

Dealing with Unknown Variances in Best-Arm Identification

Authors: Marc Jourdan, Rémy Degenne, Emilie Kaufmann

Abstract: The problem of identifying the best arm among a collection of items having Gaussian rewards distribution is well understood when the variances are known. Despite its practical relevance for many applications, few works studied it for unknown variances. In this paper we introduce and analyze two approaches to deal with unknown variances, either by plugging in the empirical variance or by adapting t… ▽ More The problem of identifying the best arm among a collection of items having Gaussian rewards distribution is well understood when the variances are known. Despite its practical relevance for many applications, few works studied it for unknown variances. In this paper we introduce and analyze two approaches to deal with unknown variances, either by plugging in the empirical variance or by adapting the transportation costs. In order to calibrate our two stopping rules, we derive new time-uniform concentration inequalities, which are of independent interest. Then, we illustrate the theoretical and empirical performances of our two sampling rule wrappers on Track-and-Stop and on a Top Two algorithm. Moreover, by quantifying the impact on the sample complexity of not knowing the variances, we reveal that it is rather small. △ Less

Submitted 23 January, 2023; v1 submitted 3 October, 2022; originally announced October 2022.

Comments: 73 pages, 5 figures, 3 tables. To be published in the 34th International Conference on Algorithmic Learning Theory, Singapore, 2023

arXiv:2207.05852 [pdf, other]

Optimistic PAC Reinforcement Learning: the Instance-Dependent View

Authors: Andrea Tirinzoni, Aymen Al-Marjani, Emilie Kaufmann

Abstract: Optimistic algorithms have been extensively studied for regret minimization in episodic tabular MDPs, both from a minimax and an instance-dependent view. However, for the PAC RL problem, where the goal is to identify a near-optimal policy with high probability, little is known about their instance-dependent sample complexity. A negative result of Wagenmaker et al. (2021) suggests that optimistic s… ▽ More Optimistic algorithms have been extensively studied for regret minimization in episodic tabular MDPs, both from a minimax and an instance-dependent view. However, for the PAC RL problem, where the goal is to identify a near-optimal policy with high probability, little is known about their instance-dependent sample complexity. A negative result of Wagenmaker et al. (2021) suggests that optimistic sampling rules cannot be used to attain the (still elusive) optimal instance-dependent sample complexity. On the positive side, we provide the first instance-dependent bound for an optimistic algorithm for PAC RL, BPI-UCRL, for which only minimax guarantees were available (Kaufmann et al., 2021). While our bound features some minimal visitation probabilities, it also features a refined notion of sub-optimality gap compared to the value gaps that appear in prior work. Moreover, in MDPs with deterministic transitions, we show that BPI-UCRL is actually near-optimal. On the technical side, our analysis is very simple thanks to a new "target trick" of independent interest. We complement these findings with a novel hardness result explaining why the instance-dependent complexity of PAC RL cannot be easily related to that of regret minimization, unlike in the minimax regime. △ Less

Submitted 12 July, 2022; originally announced July 2022.

Comments: arXiv admin note: text overlap with arXiv:2203.09251

arXiv:2206.05979 [pdf, other]

Top Two Algorithms Revisited

Authors: Marc Jourdan, Rémy Degenne, Dorian Baudry, Rianne de Heide, Emilie Kaufmann

Abstract: Top Two algorithms arose as an adaptation of Thompson sampling to best arm identification in multi-armed bandit models (Russo, 2016), for parametric families of arms. They select the next arm to sample from by randomizing among two candidate arms, a leader and a challenger. Despite their good empirical performance, theoretical guarantees for fixed-confidence best arm identification have only been… ▽ More Top Two algorithms arose as an adaptation of Thompson sampling to best arm identification in multi-armed bandit models (Russo, 2016), for parametric families of arms. They select the next arm to sample from by randomizing among two candidate arms, a leader and a challenger. Despite their good empirical performance, theoretical guarantees for fixed-confidence best arm identification have only been obtained when the arms are Gaussian with known variances. In this paper, we provide a general analysis of Top Two methods, which identifies desirable properties of the leader, the challenger, and the (possibly non-parametric) distributions of the arms. As a result, we obtain theoretically supported Top Two algorithms for best arm identification with bounded distributions. Our proof method demonstrates in particular that the sampling step used to select the leader inherited from Thompson sampling can be replaced by other choices, like selecting the empirical best arm. △ Less

Submitted 4 October, 2022; v1 submitted 13 June, 2022; originally announced June 2022.

Comments: 75 pages, 8 figures, 3 tables

arXiv:2206.00121 [pdf, ps, other]

Near-Optimal Collaborative Learning in Bandits

Authors: Clémence Réda, Sattar Vakili, Emilie Kaufmann

Abstract: This paper introduces a general multi-agent bandit model in which each agent is facing a finite set of arms and may communicate with other agents through a central controller in order to identify, in pure exploration, or play, in regret minimization, its optimal arm. The twist is that the optimal arm for each agent is the arm with largest expected mixed reward, where the mixed reward of an arm is… ▽ More This paper introduces a general multi-agent bandit model in which each agent is facing a finite set of arms and may communicate with other agents through a central controller in order to identify, in pure exploration, or play, in regret minimization, its optimal arm. The twist is that the optimal arm for each agent is the arm with largest expected mixed reward, where the mixed reward of an arm is a weighted sum of the rewards of this arm for all agents. This makes communication between agents often necessary. This general setting allows to recover and extend several recent models for collaborative bandit learning, including the recently proposed federated learning with personalization (Shi et al., 2021). In this paper, we provide new lower bounds on the sample complexity of pure exploration and on the regret. We then propose a near-optimal algorithm for pure exploration. This algorithm is based on phased elimination with two novel ingredients: a data-dependent sampling scheme within each phase, aimed at matching a relaxation of the lower bound. △ Less

Submitted 28 October, 2022; v1 submitted 31 May, 2022; originally announced June 2022.

arXiv:2203.15052 [pdf, other]

doi 10.1109/LRA.2022.3181755

Learning Minimum-Time Flight in Cluttered Environments

Authors: Robert Penicka, Yunlong Song, Elia Kaufmann, Davide Scaramuzza

Abstract: We tackle the problem of minimum-time flight for a quadrotor through a sequence of waypoints in the presence of obstacles while exploiting the full quadrotor dynamics. Early works relied on simplified dynamics or polynomial trajectory representations that did not exploit the full actuator potential of the quadrotor, and, thus, resulted in suboptimal solutions. Recent works can plan minimum-time tr… ▽ More We tackle the problem of minimum-time flight for a quadrotor through a sequence of waypoints in the presence of obstacles while exploiting the full quadrotor dynamics. Early works relied on simplified dynamics or polynomial trajectory representations that did not exploit the full actuator potential of the quadrotor, and, thus, resulted in suboptimal solutions. Recent works can plan minimum-time trajectories; yet, the trajectories are executed with control methods that do not account for obstacles. Thus, a successful execution of such trajectories is prone to errors due to model mismatch and in-flight disturbances. To this end, we leverage deep reinforcement learning and classical topological path planning to train robust neural-network controllers for minimum-time quadrotor flight in cluttered environments. The resulting neural network controller demonstrates substantially better performance of up to 19\% over state-of-the-art methods. More importantly, the learned policy solves the planning and control problem simultaneously online to account for disturbances, thus achieving much higher robustness. As such, the presented method achieves 100% success rate of flying minimum-time policies without collision, while traditional planning and control approaches achieve only 40%. The proposed method is validated in both simulation and the real world, with quadrotor speeds of up to 42km/h and accelerations of 3.6g. △ Less

Submitted 17 June, 2022; v1 submitted 28 March, 2022; originally announced March 2022.

Journal ref: IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 7209-7216, July 2022

arXiv:2203.10883 [pdf, other]

Efficient Algorithms for Extreme Bandits

Authors: Dorian Baudry, Yoan Russac, Emilie Kaufmann

Abstract: In this paper, we contribute to the Extreme Bandit problem, a variant of Multi-Armed Bandits in which the learner seeks to collect the largest possible reward. We first study the concentration of the maximum of i.i.d random variables under mild assumptions on the tail of the rewards distributions. This analysis motivates the introduction of Quantile of Maxima (QoMax). The properties of QoMax are s… ▽ More In this paper, we contribute to the Extreme Bandit problem, a variant of Multi-Armed Bandits in which the learner seeks to collect the largest possible reward. We first study the concentration of the maximum of i.i.d random variables under mild assumptions on the tail of the rewards distributions. This analysis motivates the introduction of Quantile of Maxima (QoMax). The properties of QoMax are sufficient to build an Explore-Then-Commit (ETC) strategy, QoMax-ETC, achieving strong asymptotic guarantees despite its simplicity. We then propose and analyze a more adaptive, anytime algorithm, QoMax-SDA, which combines QoMax with a subsampling method recently introduced by Baudry et al. (2021). Both algorithms are more efficient than existing approaches in two aspects (1) they lead to better empirical performance (2) they enjoy a significant reduction of the memory and time complexities. △ Less

Submitted 21 March, 2022; originally announced March 2022.

Comments: Proceedings of the 25 th International Conference on Artificial Intelligence and Statistics (AISTATS) 2022

arXiv:2203.09251 [pdf, other]

Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs

Authors: Andrea Tirinzoni, Aymen Al-Marjani, Emilie Kaufmann

Abstract: In probably approximately correct (PAC) reinforcement learning (RL), an agent is required to identify an $εいぷしろん$-optimal policy with probability $1-δでるた$. While minimax optimal algorithms exist for this problem, its instance-dependent complexity remains elusive in episodic Markov decision processes (MDPs). In this paper, we propose the first nearly matching (up to a horizon squared factor and logarithmic… ▽ More In probably approximately correct (PAC) reinforcement learning (RL), an agent is required to identify an $εいぷしろん$-optimal policy with probability $1-δでるた$. While minimax optimal algorithms exist for this problem, its instance-dependent complexity remains elusive in episodic Markov decision processes (MDPs). In this paper, we propose the first nearly matching (up to a horizon squared factor and logarithmic terms) upper and lower bounds on the sample complexity of PAC RL in deterministic episodic MDPs with finite state and action spaces. In particular, our bounds feature a new notion of sub-optimality gap for state-action pairs that we call the deterministic return gap. While our instance-dependent lower bound is written as a linear program, our algorithms are very simple and do not require solving such an optimization problem during learning. Their design and analyses employ novel ideas, including graph-theoretical concepts (minimum flows) and a new maximum-coverage exploration strategy. △ Less

Submitted 24 October, 2022; v1 submitted 17 March, 2022; originally announced March 2022.

arXiv:2203.07747 [pdf, other]

doi 10.1109/LRA.2023.3246839

Real-time Neural-MPC: Deep Learning Model Predictive Control for Quadrotors and Agile Robotic Platforms

Authors: Tim Salzmann, Elia Kaufmann, Jon Arrizabalaga, Marco Pavone, Davide Scaramuzza, Markus Ryll

Abstract: Model Predictive Control (MPC) has become a popular framework in embedded control for high-performance autonomous systems. However, to achieve good control performance using MPC, an accurate dynamics model is key. To maintain real-time operation, the dynamics models used on embedded systems have been limited to simple first-principle models, which substantially limits their representative power. I… ▽ More Model Predictive Control (MPC) has become a popular framework in embedded control for high-performance autonomous systems. However, to achieve good control performance using MPC, an accurate dynamics model is key. To maintain real-time operation, the dynamics models used on embedded systems have been limited to simple first-principle models, which substantially limits their representative power. In contrast to such simple models, machine learning approaches, specifically neural networks, have been shown to accurately model even complex dynamic effects, but their large computational complexity hindered combination with fast real-time iteration loops. With this work, we present Real-time Neural MPC, a framework to efficiently integrate large, complex neural network architectures as dynamics models within a model-predictive control pipeline. Our experiments, performed in simulation and the real world onboard a highly agile quadrotor platform, demonstrate the capabilities of the described system to run learned models with, previously infeasible, large modeling capacity using gradient-based online optimization MPC. Compared to prior implementations of neural networks in online optimization MPC we can leverage models of over 4000 times larger parametric capacity in a 50Hzへるつ real-time window on an embedded platform. Further, we show the feasibility of our framework on real-world problems by reducing the positional tracking error by up to 82% when compared to state-of-the-art MPC approaches without neural network dynamics. △ Less

Submitted 25 July, 2023; v1 submitted 15 March, 2022; originally announced March 2022.

Journal ref: IEEE Robotics and Automation Letters (Volume: 8, Issue: 4, April 2023)

arXiv:2202.10796 [pdf, ps, other]

A Benchmark Comparison of Learned Control Policies for Agile Quadrotor Flight

Authors: Elia Kaufmann, Leonard Bauersfeld, Davide Scaramuzza

Abstract: Quadrotors are highly nonlinear dynamical systems that require carefully tuned controllers to be pushed to their physical limits. Recently, learning-based control policies have been proposed for quadrotors, as they would potentially allow learning direct mappings from high-dimensional raw sensory observations to actions. Due to sample inefficiency, training such learned controllers on the real pla… ▽ More Quadrotors are highly nonlinear dynamical systems that require carefully tuned controllers to be pushed to their physical limits. Recently, learning-based control policies have been proposed for quadrotors, as they would potentially allow learning direct mappings from high-dimensional raw sensory observations to actions. Due to sample inefficiency, training such learned controllers on the real platform is impractical or even impossible. Training in simulation is attractive but requires to transfer policies between domains, which demands trained policies to be robust to such domain gap. In this work, we make two contributions: (i) we perform the first benchmark comparison of existing learned control policies for agile quadrotor flight and show that training a control policy that commands body-rates and thrust results in more robust sim-to-real transfer compared to a policy that directly specifies individual rotor thrusts, (ii) we demonstrate for the first time that such a control policy trained via deep reinforcement learning can control a quadrotor in real-world experiments at speeds over 45km/h. △ Less

Submitted 22 February, 2022; originally announced February 2022.

Comments: 6 pages (+1 references)

Journal ref: IEEE International Conference on Robotics and Automation (ICRA), Philadelphia, 2022

arXiv:2110.05832 [pdf, other]

doi 10.1093/mnras/stac1734

Cometary dust analogues for physics experiments

Authors: A. Lethuillier, C. Feller, E. Kaufmann, P. Becerra, N. Hänni, R. Diethelm, C. Kreuzig, B. Gundlach, J. Blum, A. Pommerol, G. Kargl, E. Kührt, H. Capelo, D. Haack, X. Zhang, J. Knollenberg, N. S. Molinski, T. Gilke, H. Sierks, P. Tiefenbacher, C. Güttler, K. A. Otto, D. Bischoff, M. Schweighart, A. Hagermann , et al. (1 additional authors not shown)

Abstract: The CoPhyLab (Cometary Physics Laboratory) project is designed to study the physics of comets through a series of earth-based experiments. For these experiments, a dust analogue was created with physical properties comparable to those of the non-volatile dust found on comets. This "CoPhyLab dust" is planned to be mixed with water and CO$_2$ ice and placed under cometary conditions in vacuum chambe… ▽ More The CoPhyLab (Cometary Physics Laboratory) project is designed to study the physics of comets through a series of earth-based experiments. For these experiments, a dust analogue was created with physical properties comparable to those of the non-volatile dust found on comets. This "CoPhyLab dust" is planned to be mixed with water and CO$_2$ ice and placed under cometary conditions in vacuum chambers to study the physical processes taking place on the nuclei of comets. In order to develop this dust analogue, we mixed two components representative for the non-volatile materials present in cometary nuclei. We chose silica dust as representative for the mineral phase and charcoal for the organic phase, which also acts as a darkening agent. In this paper, we provide an overview of known cometary analogues before presenting measurements of eight physical properties of different mixtures of the two materials and a comparison of these measurements with known cometary values. The physical properties of interest are: particle size, density, gas permeability, spectrophotometry, mechanical, thermal and electrical properties. We found that the analogue dust that matches the highest number of physical properties of cometary materials consists of a mixture of either 60\%/40\% or 70\%/30\% of silica dust/charcoal by mass. These best-fit dust analogue will be used in future CoPhyLab experiments. △ Less

Submitted 12 October, 2021; originally announced October 2021.

arXiv:2110.05113 [pdf, other]

doi 10.1126/scirobotics.abg5810

Learning High-Speed Flight in the Wild

Authors: Antonio Loquercio, Elia Kaufmann, René Ranftl, Matthias Müller, Vladlen Koltun, Davide Scaramuzza

Abstract: Quadrotors are agile. Unlike most other machines, they can traverse extremely complex environments at high speeds. To date, only expert human pilots have been able to fully exploit their capabilities. Autonomous operation with on-board sensing and computation has been limited to low speeds. State-of-the-art methods generally separate the navigation problem into subtasks: sensing, mapping, and plan… ▽ More Quadrotors are agile. Unlike most other machines, they can traverse extremely complex environments at high speeds. To date, only expert human pilots have been able to fully exploit their capabilities. Autonomous operation with on-board sensing and computation has been limited to low speeds. State-of-the-art methods generally separate the navigation problem into subtasks: sensing, mapping, and planning. While this approach has proven successful at low speeds, the separation it builds upon can be problematic for high-speed navigation in cluttered environments. Indeed, the subtasks are executed sequentially, leading to increased processing latency and a compounding of errors through the pipeline. Here we propose an end-to-end approach that can autonomously fly quadrotors through complex natural and man-made environments at high speeds, with purely onboard sensing and computation. The key principle is to directly map noisy sensory observations to collision-free trajectories in a receding-horizon fashion. This direct mapping drastically reduces processing latency and increases robustness to noisy and incomplete perception. The sensorimotor mapping is performed by a convolutional network that is trained exclusively in simulation via privileged learning: imitating an expert with access to privileged information. By simulating realistic sensor noise, our approach achieves zero-shot transfer from simulation to challenging real-world environments that were never experienced during training: dense forests, snow-covered terrain, derailed trains, and collapsed buildings. Our work demonstrates that end-to-end policies trained in simulation enable high-speed autonomous flight through challenging environments, outperforming traditional obstacle avoidance pipelines. △ Less

Submitted 11 October, 2021; originally announced October 2021.

Comments: 16 pages (+7 supplementary)

Journal ref: Science Robotics 2021 Vol. 6, Issue 59, abg5810

arXiv:2109.04210 [pdf, other]

doi 10.1109/LRA.2021.3131690

Performance, Precision, and Payloads: Adaptive Nonlinear MPC for Quadrotors

Authors: Drew Hanover, Philipp Foehn, Sihao Sun, Elia Kaufmann, Davide Scaramuzza

Abstract: Agile quadrotor flight in challenging environments has the potential to revolutionize shipping, transportation, and search and rescue applications. Nonlinear model predictive control (NMPC) has recently shown promising results for agile quadrotor control, but relies on highly accurate models for maximum performance. Hence, model uncertainties in the form of unmodeled complex aerodynamic effects, v… ▽ More Agile quadrotor flight in challenging environments has the potential to revolutionize shipping, transportation, and search and rescue applications. Nonlinear model predictive control (NMPC) has recently shown promising results for agile quadrotor control, but relies on highly accurate models for maximum performance. Hence, model uncertainties in the form of unmodeled complex aerodynamic effects, varying payloads and parameter mismatch will degrade overall system performance. In this paper, we propose L1-NMPC, a novel hybrid adaptive NMPC to learn model uncertainties online and immediately compensate for them, drastically improving performance over the non-adaptive baseline with minimal computational overhead. Our proposed architecture generalizes to many different environments from which we evaluate wind, unknown payloads, and highly agile flight conditions. The proposed method demonstrates immense flexibility and robustness, with more than 90% tracking error reduction over non-adaptive NMPC under large unknown disturbances and without any gain tuning. In addition, the same controller with identical gains can accurately fly highly agile racing trajectories exhibiting top speeds of 70 km/h, offering tracking performance improvements of around 50% relative to the non-adaptive NMPC baseline. △ Less

Submitted 3 December, 2021; v1 submitted 9 September, 2021; originally announced September 2021.

Comments: 8 Pages, 6 figures, Accepted RAL 2021

Journal ref: IEEE Robotics and Automation Letters 0, 2377-3766, 2021

arXiv:2109.01365 [pdf, other]

A Comparative Study of Nonlinear MPC and Differential-Flatness-Based Control for Quadrotor Agile Flight

Authors: Sihao Sun, Angel Romero, Philipp Foehn, Elia Kaufmann, Davide Scaramuzza

Abstract: Accurate trajectory tracking control for quadrotors is essential for safe navigation in cluttered environments. However, this is challenging in agile flights due to nonlinear dynamics, complex aerodynamic effects, and actuation constraints. In this article, we empirically compare two state-of-the-art control frameworks: the nonlinear-model-predictive controller (NMPC) and the differential-flatness… ▽ More Accurate trajectory tracking control for quadrotors is essential for safe navigation in cluttered environments. However, this is challenging in agile flights due to nonlinear dynamics, complex aerodynamic effects, and actuation constraints. In this article, we empirically compare two state-of-the-art control frameworks: the nonlinear-model-predictive controller (NMPC) and the differential-flatness-based controller (DFBC), by tracking a wide variety of agile trajectories at speeds up to 20 m/s (i.e.,72 km/h). The comparisons are performed in both simulation and real-world environments to systematically evaluate both methods from the aspect of tracking accuracy, robustness, and computational efficiency. We show the superiority of NMPC in tracking dynamically infeasible trajectories, at the cost of higher computation time and risk of numerical convergence issues. For both methods, we also quantitatively study the effect of adding an inner-loop controller using the incremental nonlinear dynamic inversion (INDI) method, and the effect of adding an aerodynamic drag model. Our real-world experiments, performed in one of the world's largest motion capture systems, demonstrate more than 78% tracking error reduction of both NMPC and DFBC, indicating the necessity of using an inner-loop controller and aerodynamic drag model for agile trajectory tracking. △ Less

Submitted 4 January, 2024; v1 submitted 3 September, 2021; originally announced September 2021.

Journal ref: The paper has been published in the IEEE Transactions on Robotics (T-RO), 2022

arXiv:2106.08015 [pdf, other]

doi 10.15607/RSS.2021.XVII.042

NeuroBEM: Hybrid Aerodynamic Quadrotor Model

Authors: Leonard Bauersfeld, Elia Kaufmann, Philipp Foehn, Sihao Sun, Davide Scaramuzza

Abstract: Quadrotors are extremely agile, so much in fact, that classic first-principle-models come to their limits. Aerodynamic effects, while insignificant at low speeds, become the dominant model defect during high speeds or agile maneuvers. Accurate modeling is needed to design robust high-performance control systems and enable flying close to the platform's physical limits. We propose a hybrid approach… ▽ More Quadrotors are extremely agile, so much in fact, that classic first-principle-models come to their limits. Aerodynamic effects, while insignificant at low speeds, become the dominant model defect during high speeds or agile maneuvers. Accurate modeling is needed to design robust high-performance control systems and enable flying close to the platform's physical limits. We propose a hybrid approach fusing first principles and learning to model quadrotors and their aerodynamic effects with unprecedented accuracy. First principles fail to capture such aerodynamic effects, rendering traditional approaches inaccurate when used for simulation or controller tuning. Data-driven approaches try to capture aerodynamic effects with blackbox modeling, such as neural networks; however, they struggle to robustly generalize to arbitrary flight conditions. Our hybrid approach unifies and outperforms both first-principles blade-element theory and learned residual dynamics. It is evaluated in one of the world's largest motion-capture systems, using autonomous-quadrotor-flight data at speeds up to 65km/h. The resulting model captures the aerodynamic thrust, torques, and parasitic effects with astonishing accuracy, outperforming existing models with 50% reduced prediction errors, and shows strong generalization capabilities beyond the training set. △ Less

Submitted 15 June, 2021; originally announced June 2021.

Comments: 9 pages + 1 pages references

Journal ref: Robotics: Science and Systems (RSS), 2021

arXiv:2105.03543 [pdf]

doi 10.1016/j.icarus.2020.114098

A Statistical Review of Light Curves and the Prevalence of Contact Binaries in the Kuiper Belt

Authors: Mark R. Showalter, Susan D. Benecchi, Marc W. Buie, William M. Grundy, James T. Keane, Carey M. Lisse, Cathy B. Olkin, Simon B. Porter, Stuart J. Robbins, Kelsi N. Singer, Anne J. Verbiscer, Harold A. Weaver, Amanda M. Zangari, Douglas P. Hamilton, David E. Kaufmann, Tod R. Lauer, D. S. Mehoke, T. S. Mehoke, J. R. Spencer, H. B. Throop, J. W. Parker, S. Alan Stern

Abstract: We investigate what can be learned about a population of distant KBOs by studying the statistical properties of their light curves. Whereas others have successfully inferred the properties of individual, highly variable KBOs, we show that the fraction of KBOs with low amplitudes also provides fundamental information about a population. Each light curve is primarily the result of two factors: shape… ▽ More We investigate what can be learned about a population of distant KBOs by studying the statistical properties of their light curves. Whereas others have successfully inferred the properties of individual, highly variable KBOs, we show that the fraction of KBOs with low amplitudes also provides fundamental information about a population. Each light curve is primarily the result of two factors: shape and orientation. We consider contact binaries and ellipsoidal shapes, with and without flattening. After developing the mathematical framework, we apply it to the existing body of KBO light curve data. Principal conclusions are as follows. (1) When using absolute magnitude H as a proxy for size, it is more accurate to use the maximum of the light curve rather than the mean. (2) Previous investigators have noted that smaller KBOs have higher-amplitude light curves, and have interpreted this as evidence that they are systematically more irregular in shape than larger KBOs; we show that a population of flattened bodies with uniform proportions could also explain this result. (3) Our analysis indicates that prior assessments of the fraction of contact binaries in the Kuiper Belt may be artificially low. (4) The pole orientations of some KBOs can be inferred from observed changes in their light curves; however, these KBOs constitute a biased sample, whose pole orientations are not representative of the population overall. (5) Although surface topography, albedo patterns, limb darkening, and other surface properties can affect individual light curves, they do not have a strong influence on the statistics overall. (6) Photometry from the OSSOS survey is incompatible with previous results and its statistical properties defy easy interpretation. We also discuss the promise of this approach for the analysis of future, much larger data sets such as the one anticipated from the Rubin Observatory. △ Less

Submitted 7 May, 2021; originally announced May 2021.

Journal ref: Icarus 356, id. 114098 (2021)

arXiv:2103.14666 [pdf, other]

Autonomous Overtaking in Gran Turismo Sport Using Curriculum Reinforcement Learning

Authors: Yunlong Song, HaoChih Lin, Elia Kaufmann, Peter Duerr, Davide Scaramuzza

Abstract: Professional race-car drivers can execute extreme overtaking maneuvers. However, existing algorithms for autonomous overtaking either rely on simplified assumptions about the vehicle dynamics or try to solve expensive trajectory-optimization problems online. When the vehicle approaches its physical limits, existing model-based controllers struggle to handle highly nonlinear dynamics, and cannot le… ▽ More Professional race-car drivers can execute extreme overtaking maneuvers. However, existing algorithms for autonomous overtaking either rely on simplified assumptions about the vehicle dynamics or try to solve expensive trajectory-optimization problems online. When the vehicle approaches its physical limits, existing model-based controllers struggle to handle highly nonlinear dynamics, and cannot leverage the large volume of data generated by simulation or real-world driving. To circumvent these limitations, we propose a new learning-based method to tackle the autonomous overtaking problem. We evaluate our approach in the popular car racing game Gran Turismo Sport, which is known for its detailed modeling of various cars and tracks. By leveraging curriculum learning, our approach leads to faster convergence as well as increased performance compared to vanilla reinforcement learning. As a result, the trained controller outperforms the built-in model-based game AI and achieves comparable overtaking performance with an experienced human driver. △ Less

Submitted 9 May, 2021; v1 submitted 26 March, 2021; originally announced March 2021.

Comments: Accepted for publication at the IEEE International Conference on Robotics and Automation (ICRA), Xi An, 2021

arXiv:2103.10070 [pdf, other]

Top-m identification for linear bandits

Authors: Clémence Réda, Emilie Kaufmann, Andrée Delahaye-Duriez

Abstract: Motivated by an application to drug repurposing, we propose the first algorithms to tackle the identification of the m $\ge$ 1 arms with largest means in a linear bandit model, in the fixed-confidence setting. These algorithms belong to the generic family of Gap-Index Focused Algorithms (GIFA) that we introduce for Top-m identification in linear bandits. We propose a unified analysis of these algo… ▽ More Motivated by an application to drug repurposing, we propose the first algorithms to tackle the identification of the m $\ge$ 1 arms with largest means in a linear bandit model, in the fixed-confidence setting. These algorithms belong to the generic family of Gap-Index Focused Algorithms (GIFA) that we introduce for Top-m identification in linear bandits. We propose a unified analysis of these algorithms, which shows how the use of features might decrease the sample complexity. We further validate these algorithms empirically on simulated data and on a simple drug repurposing task. △ Less

Submitted 18 March, 2021; originally announced March 2021.

arXiv:2103.08624 [pdf, other]

Autonomous Drone Racing with Deep Reinforcement Learning

Authors: Yunlong Song, Mats Steinweg, Elia Kaufmann, Davide Scaramuzza

Abstract: In many robotic tasks, such as autonomous drone racing, the goal is to travel through a set of waypoints as fast as possible. A key challenge for this task is planning the time-optimal trajectory, which is typically solved by assuming perfect knowledge of the waypoints to pass in advance. The resulting solution is either highly specialized for a single-track layout, or suboptimal due to simplifyin… ▽ More In many robotic tasks, such as autonomous drone racing, the goal is to travel through a set of waypoints as fast as possible. A key challenge for this task is planning the time-optimal trajectory, which is typically solved by assuming perfect knowledge of the waypoints to pass in advance. The resulting solution is either highly specialized for a single-track layout, or suboptimal due to simplifying assumptions about the platform dynamics. In this work, a new approach to near-time-optimal trajectory generation for quadrotors is presented. Leveraging deep reinforcement learning and relative gate observations, our approach can compute near-time-optimal trajectories and adapt the trajectory to environment changes. Our method exhibits computational advantages over approaches based on trajectory optimization for non-trivial track configurations. The proposed approach is evaluated on a set of race tracks in simulation and the real world, achieving speeds of up to 60 km/h with a physical quadrotor. △ Less

Submitted 2 August, 2021; v1 submitted 15 March, 2021; originally announced March 2021.

Comments: This paper has been accepted for publication at the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, 2021. Copyright @ IEEE

arXiv:2102.05773 [pdf, other]

Data-Driven MPC for Quadrotors

Authors: Guillem Torrente, Elia Kaufmann, Philipp Foehn, Davide Scaramuzza

Abstract: Aerodynamic forces render accurate high-speed trajectory tracking with quadrotors extremely challenging. These complex aerodynamic effects become a significant disturbance at high speeds, introducing large positional tracking errors, and are extremely difficult to model. To fly at high speeds, feedback control must be able to account for these aerodynamic effects in real-time. This necessitates a… ▽ More Aerodynamic forces render accurate high-speed trajectory tracking with quadrotors extremely challenging. These complex aerodynamic effects become a significant disturbance at high speeds, introducing large positional tracking errors, and are extremely difficult to model. To fly at high speeds, feedback control must be able to account for these aerodynamic effects in real-time. This necessitates a modelling procedure that is both accurate and efficient to evaluate. Therefore, we present an approach to model aerodynamic effects using Gaussian Processes, which we incorporate into a Model Predictive Controller to achieve efficient and precise real-time feedback control, leading to up to 70% reduction in trajectory tracking error at high speeds. We verify our method by extensive comparison to a state-of-the-art linear drag model in synthetic and real-world experiments at speeds of up to 14m/s and accelerations beyond 4g. △ Less

Submitted 3 March, 2021; v1 submitted 10 February, 2021; originally announced February 2021.

Comments: 8 pages

Journal ref: IEEE Robotics and Automation Letters (RA-L), 2021

arXiv:2101.12641 [pdf, other]

doi 10.1016/j.icarus.2021.114355

Gas flow in Martian spider formation

Authors: Nicholas Attree, Erkia Kaufmann, Axel Hagermann

Abstract: Martian araneiform terrain, located in the Southern polar regions, consists of features with central pits and radial troughs which are thought to be associated with the solid state greenhouse effect under a CO$_{2}$ ice sheet. Sublimation at the base of this ice leads to gas buildup, fracturing of the ice and the flow of gas and entrained regolith out of vents and onto the surface. There are two p… ▽ More Martian araneiform terrain, located in the Southern polar regions, consists of features with central pits and radial troughs which are thought to be associated with the solid state greenhouse effect under a CO$_{2}$ ice sheet. Sublimation at the base of this ice leads to gas buildup, fracturing of the ice and the flow of gas and entrained regolith out of vents and onto the surface. There are two possible pathways for the gas: through the gap between the ice slab and the underlying regolith, as proposed by Kieffer et al (2007), or through the pores of a permeable regolith layer, which would imply that regolith properties can control the spacing between adjacent spiders, as suggested by Hao et al. We test this hypothesis quantitatively in order to place constraints on the regolith properties. Based on previously estimated flow rates and thermophysical arguments, we suggest that there is insufficient depth of porous regolith to support the full gas flow through the regolith. By contrast, free gas flow through a regolith--ice gap is capable of supplying the likely flow rates for gap sizes on the order of a centimetre. This size of gap can be opened in the centre of a spider feature by gas pressure bending the overlying ice slab upwards, or by levitating it entirely as suggested in the original Kieffer et al (2007) model. Our calculations therefore support at least some of the gas flowing through a gap opened between the regolith and ice. Regolith properties most likely still play a role in the evolution of spider morphology, by regolith cohesion controlling the erosion of the central pit and troughs, for example. △ Less

Submitted 29 January, 2021; originally announced January 2021.

Comments: Accepted in Icarus

arXiv:2012.05754 [pdf, other]

Optimal Thompson Sampling strategies for support-aware CVaR bandits

Authors: Dorian Baudry, Romain Gautron, Emilie Kaufmann, Odalric-Ambryn Maillard

Abstract: In this paper we study a multi-arm bandit problem in which the quality of each arm is measured by the Conditional Value at Risk (CVaR) at some level alpha of the reward distribution. While existing works in this setting mainly focus on Upper Confidence Bound algorithms, we introduce a new Thompson Sampling approach for CVaR bandits on bounded rewards that is flexible enough to solve a variety of p… ▽ More In this paper we study a multi-arm bandit problem in which the quality of each arm is measured by the Conditional Value at Risk (CVaR) at some level alpha of the reward distribution. While existing works in this setting mainly focus on Upper Confidence Bound algorithms, we introduce a new Thompson Sampling approach for CVaR bandits on bounded rewards that is flexible enough to solve a variety of problems grounded on physical resources. Building on a recent work by Riou & Honda (2020), we introduce B-CVTS for continuous bounded rewards and M-CVTS for multinomial distributions. On the theoretical side, we provide a non-trivial extension of their analysis that enables to theoretically bound their CVaR regret minimization performance. Strikingly, our results show that these strategies are the first to provably achieve asymptotic optimality in CVaR bandits, matching the corresponding asymptotic lower bounds for this setting. Further, we illustrate empirically the benefit of Thompson Sampling approaches both in a realistic environment simulating a use-case in agriculture and on various synthetic examples. △ Less

Submitted 21 March, 2022; v1 submitted 10 December, 2020; originally announced December 2020.

Comments: Presented at the Thirty-eighth International Conference on Machine Learning (ICML 2021). In this version we refine Lemma 2 and correct its proof (does not change the main theorems)

arXiv:2010.14323 [pdf, other]

Sub-sampling for Efficient Non-Parametric Bandit Exploration

Authors: Dorian Baudry, Emilie Kaufmann, Odalric-Ambrym Maillard

Abstract: In this paper we propose the first multi-armed bandit algorithm based on re-sampling that achieves asymptotically optimal regret simultaneously for different families of arms (namely Bernoulli, Gaussian and Poisson distributions). Unlike Thompson Sampling which requires to specify a different prior to be optimal in each case, our proposal RB-SDA does not need any distribution-dependent tuning. RB-… ▽ More In this paper we propose the first multi-armed bandit algorithm based on re-sampling that achieves asymptotically optimal regret simultaneously for different families of arms (namely Bernoulli, Gaussian and Poisson distributions). Unlike Thompson Sampling which requires to specify a different prior to be optimal in each case, our proposal RB-SDA does not need any distribution-dependent tuning. RB-SDA belongs to the family of Sub-sampling Duelling Algorithms (SDA) which combines the sub-sampling idea first used by the BESA [1] and SSMC [2] algorithms with different sub-sampling schemes. In particular, RB-SDA uses Random Block sampling. We perform an experimental study assessing the flexibility and robustness of this promising novel approach for exploration in bandit models. △ Less

Submitted 27 October, 2020; originally announced October 2020.

Comments: NeurIPS 2020, Dec 2020, Vancouver, Canada

arXiv:2010.03531 [pdf, ps, other]

Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited

Authors: Omar Darwiche Domingues, Pierre Ménard, Emilie Kaufmann, Michal Valko

Abstract: In this paper, we propose new problem-independent lower bounds on the sample complexity and regret in episodic MDPs, with a particular focus on the non-stationary case in which the transition kernel is allowed to change in each stage of the episode. Our main contribution is a novel lower bound of $Ωおめが((H^3SA/εいぷしろん^2)\log(1/δでるた))$ on the sample complexity of an $(\varepsilon,δでるた)$-PAC algorithm for best poli… ▽ More In this paper, we propose new problem-independent lower bounds on the sample complexity and regret in episodic MDPs, with a particular focus on the non-stationary case in which the transition kernel is allowed to change in each stage of the episode. Our main contribution is a novel lower bound of $Ωおめが((H^3SA/εいぷしろん^2)\log(1/δでるた))$ on the sample complexity of an $(\varepsilon,δでるた)$-PAC algorithm for best policy identification in a non-stationary MDP. This lower bound relies on a construction of "hard MDPs" which is different from the ones previously used in the literature. Using this same class of MDPs, we also provide a rigorous proof of the $Ωおめが(\sqrt{H^3SAT})$ regret bound for non-stationary MDPs. Finally, we discuss connections to PAC-MDP lower bounds. △ Less

Submitted 7 October, 2020; originally announced October 2020.

arXiv:2009.00563 [pdf, other]

Flightmare: A Flexible Quadrotor Simulator

Authors: Yunlong Song, Selim Naji, Elia Kaufmann, Antonio Loquercio, Davide Scaramuzza

Abstract: State-of-the-art quadrotor simulators have a rigid and highly-specialized structure: either are they really fast, physically accurate, or photo-realistic. In this work, we propose a novel quadrotor simulator: Flightmare. Flightmare is composed of two main components: a configurable rendering engine built on Unity and a flexible physics engine for dynamics simulation. Those two components are total… ▽ More State-of-the-art quadrotor simulators have a rigid and highly-specialized structure: either are they really fast, physically accurate, or photo-realistic. In this work, we propose a novel quadrotor simulator: Flightmare. Flightmare is composed of two main components: a configurable rendering engine built on Unity and a flexible physics engine for dynamics simulation. Those two components are totally decoupled and can run independently of each other. This makes our simulator extremely fast: rendering achieves speeds of up to 230 Hzへるつ, while physics simulation of up to 200,000 Hzへるつ on a laptop. In addition, Flightmare comes with several desirable features: (i) a large multi-modal sensor suite, including an interface to extract the 3D point-cloud of the scene; (ii) an API for reinforcement learning which can simulate hundreds of quadrotors in parallel; and (iii) integration with a virtual-reality headset for interaction with the simulated environment. We demonstrate the flexibility of Flightmare by using it for two different robotic tasks: quadrotor control using deep reinforcement learning and collision-free path planning in a complex 3D environment. △ Less

Submitted 9 May, 2021; v1 submitted 1 September, 2020; originally announced September 2020.

Comments: Accepted for publication at 4th Conference on Robot Learning (CoRL), Cambridge MA, USA. 2020

arXiv:2008.07971 [pdf, other]

doi 10.1109/LRA.2021.3064284

Super-Human Performance in Gran Turismo Sport Using Deep Reinforcement Learning

Authors: Florian Fuchs, Yunlong Song, Elia Kaufmann, Davide Scaramuzza, Peter Duerr

Abstract: Autonomous car racing is a major challenge in robotics. It raises fundamental problems for classical approaches such as planning minimum-time trajectories under uncertain dynamics and controlling the car at the limits of its handling. Besides, the requirement of minimizing the lap time, which is a sparse objective, and the difficulty of collecting training data from human experts have also hindere… ▽ More Autonomous car racing is a major challenge in robotics. It raises fundamental problems for classical approaches such as planning minimum-time trajectories under uncertain dynamics and controlling the car at the limits of its handling. Besides, the requirement of minimizing the lap time, which is a sparse objective, and the difficulty of collecting training data from human experts have also hindered researchers from directly applying learning-based approaches to solve the problem. In the present work, we propose a learning-based system for autonomous car racing by leveraging a high-fidelity physical car simulation, a course-progress proxy reward, and deep reinforcement learning. We deploy our system in Gran Turismo Sport, a world-leading car simulator known for its realistic physics simulation of different race cars and tracks, which is even used to recruit human race car drivers. Our trained policy achieves autonomous racing performance that goes beyond what had been achieved so far by the built-in AI, and, at the same time, outperforms the fastest driver in a dataset of over 50,000 human players. △ Less

Submitted 9 May, 2021; v1 submitted 18 August, 2020; originally announced August 2020.

Comments: Accepted for Publication at the IEEE Robotics and Automation Letters (RA-L) 2021, and International Conference on Robots and Automation (ICRA) 2021

Journal ref: IEEE Robotics and Automation Letters (RAL) 2021

arXiv:2007.13442 [pdf, other]

Fast active learning for pure exploration in reinforcement learning

Authors: Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Emilie Kaufmann, Edouard Leurent, Michal Valko

Abstract: Realistic environments often provide agents with very limited feedback. When the environment is initially unknown, the feedback, in the beginning, can be completely absent, and the agents may first choose to devote all their effort on exploring efficiently. The exploration remains a challenge while it has been addressed with many hand-tuned heuristics with different levels of generality on one sid… ▽ More Realistic environments often provide agents with very limited feedback. When the environment is initially unknown, the feedback, in the beginning, can be completely absent, and the agents may first choose to devote all their effort on exploring efficiently. The exploration remains a challenge while it has been addressed with many hand-tuned heuristics with different levels of generality on one side, and a few theoretically-backed exploration strategies on the other. Many of them are incarnated by intrinsic motivation and in particular explorations bonuses. A common rule of thumb for exploration bonuses is to use $1/\sqrt{n}$ bonus that is added to the empirical estimates of the reward, where $n$ is a number of times this particular state (or a state-action pair) was visited. We show that, surprisingly, for a pure-exploration objective of reward-free exploration, bonuses that scale with $1/n$ bring faster learning rates, improving the known upper bounds with respect to the dependence on the horizon $H$. Furthermore, we show that with an improved analysis of the stopping time, we can improve by a factor $H$ the sample complexity in the best-policy identification setting, which is another pure-exploration objective, where the environment provides rewards but the agent is not penalized for its behavior during the exploration phase. △ Less

Submitted 10 October, 2020; v1 submitted 27 July, 2020; originally announced July 2020.

arXiv:2007.05078 [pdf, other]

A Kernel-Based Approach to Non-Stationary Reinforcement Learning in Metric Spaces

Authors: Omar Darwiche Domingues, Pierre Ménard, Matteo Pirotta, Emilie Kaufmann, Michal Valko

Abstract: In this work, we propose KeRNS: an algorithm for episodic reinforcement learning in non-stationary Markov Decision Processes (MDPs) whose state-action set is endowed with a metric. Using a non-parametric model of the MDP built with time-dependent kernels, we prove a regret bound that scales with the covering dimension of the state-action space and the total variation of the MDP with time, which qu… ▽ More In this work, we propose KeRNS: an algorithm for episodic reinforcement learning in non-stationary Markov Decision Processes (MDPs) whose state-action set is endowed with a metric. Using a non-parametric model of the MDP built with time-dependent kernels, we prove a regret bound that scales with the covering dimension of the state-action space and the total variation of the MDP with time, which quantifies its level of non-stationarity. Our method generalizes previous approaches based on sliding windows and exponential discounting used to handle changing environments. We further propose a practical implementation of KeRNS, we analyze its regret and validate it experimentally. △ Less

Submitted 23 March, 2022; v1 submitted 9 July, 2020; originally announced July 2020.

Comments: Update following the publication in AISTATS 2021. Fixed typos and lemma about runtime

arXiv:2006.06294 [pdf, other]

Adaptive Reward-Free Exploration

Authors: Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Edouard Leurent, Michal Valko

Abstract: Reward-free exploration is a reinforcement learning setting studied by Jin et al. (2020), who address it by running several algorithms with regret guarantees in parallel. In our work, we instead give a more natural adaptive approach for reward-free exploration which directly reduces upper bounds on the maximum MDP estimation error. We show that, interestingly, our reward-free UCRL algorithm can be… ▽ More Reward-free exploration is a reinforcement learning setting studied by Jin et al. (2020), who address it by running several algorithms with regret guarantees in parallel. In our work, we instead give a more natural adaptive approach for reward-free exploration which directly reduces upper bounds on the maximum MDP estimation error. We show that, interestingly, our reward-free UCRL algorithm can be seen as a variant of an algorithm of Fiechter from 1994, originally proposed for a different objective that we call best-policy identification. We prove that RF-UCRL needs of order $({SAH^4}/{\varepsilon^2})(\log(1/δでるた) + S)$ episodes to output, with probability $1-δでるた$, an $\varepsilon$-approximation of the optimal policy for any reward function. This bound improves over existing sample-complexity bounds in both the small $\varepsilon$ and the small $δでるた$ regimes. We further investigate the relative complexities of reward-free exploration and best-policy identification. △ Less

Submitted 7 October, 2020; v1 submitted 11 June, 2020; originally announced June 2020.

arXiv:2006.05879 [pdf, other]

Planning in Markov Decision Processes with Gap-Dependent Sample Complexity

Authors: Anders Jonsson, Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Edouard Leurent, Michal Valko

Abstract: We propose MDP-GapE, a new trajectory-based Monte-Carlo Tree Search algorithm for planning in a Markov Decision Process in which transitions have a finite support. We prove an upper bound on the number of calls to the generative models needed for MDP-GapE to identify a near-optimal action with high probability. This problem-dependent sample complexity result is expressed in terms of the sub-optima… ▽ More We propose MDP-GapE, a new trajectory-based Monte-Carlo Tree Search algorithm for planning in a Markov Decision Process in which transitions have a finite support. We prove an upper bound on the number of calls to the generative models needed for MDP-GapE to identify a near-optimal action with high probability. This problem-dependent sample complexity result is expressed in terms of the sub-optimality gaps of the state-action pairs that are visited during exploration. Our experiments reveal that MDP-GapE is also effective in practice, in contrast with other algorithms with sample complexity guarantees in the fixed-confidence setting, that are mostly theoretical. △ Less

Submitted 10 June, 2020; originally announced June 2020.

arXiv:2006.05768 [pdf, other]

Deep Drone Acrobatics

Authors: Elia Kaufmann, Antonio Loquercio, René Ranftl, Matthias Müller, Vladlen Koltun, Davide Scaramuzza

Abstract: Performing acrobatic maneuvers with quadrotors is extremely challenging. Acrobatic flight requires high thrust and extreme angular accelerations that push the platform to its physical limits. Professional drone pilots often measure their level of mastery by flying such maneuvers in competitions. In this paper, we propose to learn a sensorimotor policy that enables an autonomous quadrotor to fly ex… ▽ More Performing acrobatic maneuvers with quadrotors is extremely challenging. Acrobatic flight requires high thrust and extreme angular accelerations that push the platform to its physical limits. Professional drone pilots often measure their level of mastery by flying such maneuvers in competitions. In this paper, we propose to learn a sensorimotor policy that enables an autonomous quadrotor to fly extreme acrobatic maneuvers with only onboard sensing and computation. We train the policy entirely in simulation by leveraging demonstrations from an optimal controller that has access to privileged information. We use appropriate abstractions of the visual input to enable transfer to a real quadrotor. We show that the resulting policy can be directly deployed in the physical world without any fine-tuning on real data. Our methodology has several favorable properties: it does not require a human expert to provide demonstrations, it cannot harm the physical system during training, and it can be used to learn maneuvers that are challenging even for the best human pilots. Our approach enables a physical quadrotor to fly maneuvers such as the Power Loop, the Barrel Roll, and the Matty Flip, during which it incurs accelerations of up to 3g. △ Less

Submitted 11 June, 2020; v1 submitted 10 June, 2020; originally announced June 2020.

Comments: 8 pages + 2 pages references. Video: https://youtu.be/2N_wKXQ6MXA. Code: https://github.com/uzh-rpg/deep_drone_acrobatics

Journal ref: Robotics, Science, and Systems (RSS), 2020

arXiv:2005.12813 [pdf, other]

AlphaPilot: Autonomous Drone Racing

Authors: Philipp Foehn, Dario Brescianini, Elia Kaufmann, Titus Cieslewski, Mathias Gehrig, Manasi Muglikar, Davide Scaramuzza

Abstract: This paper presents a novel system for autonomous, vision-based drone racing combining learned data abstraction, nonlinear filtering, and time-optimal trajectory planning. The system has successfully been deployed at the first autonomous drone racing world championship: the 2019 AlphaPilot Challenge. Contrary to traditional drone racing systems, which only detect the next gate, our approach makes… ▽ More This paper presents a novel system for autonomous, vision-based drone racing combining learned data abstraction, nonlinear filtering, and time-optimal trajectory planning. The system has successfully been deployed at the first autonomous drone racing world championship: the 2019 AlphaPilot Challenge. Contrary to traditional drone racing systems, which only detect the next gate, our approach makes use of any visible gate and takes advantage of multiple, simultaneous gate detections to compensate for drift in the state estimate and build a global map of the gates. The global map and drift-compensated state estimate allow the drone to navigate through the race course even when the gates are not immediately visible and further enable to plan a near time-optimal path through the race course in real time based on approximate drone dynamics. The proposed system has been demonstrated to successfully guide the drone through tight race courses reaching speeds up to 8m/s and ranked second at the 2019 AlphaPilot Challenge. △ Less

Submitted 20 August, 2021; v1 submitted 26 May, 2020; originally announced May 2020.

Comments: This paper is an extended version of an accepted publication from Robotics: Science and Systems, 2020. This version has been accepted for publication in Autonomous Robots (Springer). Please cite as "AlphaPilot: Autonomous Drone Racing", P. Foehn, Autonomous Robots 2021. Associated video at https://youtu.be/DGjwm5PZQT8

arXiv:2004.05599 [pdf, other]

Kernel-Based Reinforcement Learning: A Finite-Time Analysis

Authors: Omar Darwiche Domingues, Pierre Ménard, Matteo Pirotta, Emilie Kaufmann, Michal Valko

Abstract: We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning problems whose state-action space is endowed with a metric. We introduce Kernel-UCBVI, a model-based optimistic algorithm that leverages the smoothness of the MDP and a non-parametric kernel estimator of the rewards and transitions to efficiently balance exploration and exploitation. For problems with $K$ epi… ▽ More We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning problems whose state-action space is endowed with a metric. We introduce Kernel-UCBVI, a model-based optimistic algorithm that leverages the smoothness of the MDP and a non-parametric kernel estimator of the rewards and transitions to efficiently balance exploration and exploitation. For problems with $K$ episodes and horizon $H$, we provide a regret bound of $\widetilde{O}\left( H^3 K^{\frac{2d}{2d+1}}\right)$, where $d$ is the covering dimension of the joint state-action space. This is the first regret bound for kernel-based RL using smoothing kernels, which requires very weak assumptions on the MDP and has been previously applied to a wide range of tasks. We empirically validate our approach in continuous MDPs with sparse rewards. △ Less

Submitted 23 March, 2022; v1 submitted 12 April, 2020; originally announced April 2020.

Comments: Update following the publication in ICML 2021, including fixed typos

arXiv:2004.01017 [pdf]

doi 10.1126/science.aaw9771

Initial results from the New Horizons exploration of 2014 MU69, a small Kuiper Belt Object

Authors: S. A. Stern, H. A. Weaver, J. R. Spencer, C. B. Olkin, G. R. Gladstone, W. M. Grundy, J. M. Moore, D. P. Cruikshank, H. A. Elliott, W. B. McKinnon, J. Wm. Parker, A. J. Verbiscer, L. A. Young, D. A. Aguilar, J. M. Albers, T. Andert, J. P. Andrews, F. Bagenal, M. E. Banks, B. A. Bauer, J. A. Bauman, K. E. Bechtold, C. B. Beddingfield, N. Behrooz, K. B. Beisser , et al. (180 additional authors not shown)

Abstract: The Kuiper Belt is a distant region of the Solar System. On 1 January 2019, the New Horizons spacecraft flew close to (486958) 2014 MU69, a Cold Classical Kuiper Belt Object, a class of objects that have never been heated by the Sun and are therefore well preserved since their formation. Here we describe initial results from these encounter observations. MU69 is a bi-lobed contact binary with a fl… ▽ More The Kuiper Belt is a distant region of the Solar System. On 1 January 2019, the New Horizons spacecraft flew close to (486958) 2014 MU69, a Cold Classical Kuiper Belt Object, a class of objects that have never been heated by the Sun and are therefore well preserved since their formation. Here we describe initial results from these encounter observations. MU69 is a bi-lobed contact binary with a flattened shape, discrete geological units, and noticeable albedo heterogeneity. However, there is little surface color and compositional heterogeneity. No evidence for satellites, ring or dust structures, gas coma, or solar wind interactions was detected. By origin MU69 appears consistent with pebble cloud collapse followed by a low velocity merger of its two lobes. △ Less

Submitted 2 April, 2020; originally announced April 2020.

Comments: 43 pages, 8 figure

Journal ref: Science 364, eaaw9771 (2019)

arXiv:2004.00727 [pdf]

doi 10.1126/science.aay3999

The Geology and Geophysics of Kuiper Belt Object (486958) Arrokoth

Authors: J. R. Spencer, S. A. Stern, J. M. Moore, H. A. Weaver, K. N. Singer, C. B. Olkin, A. J. Verbiscer, W. B. McKinnon, J. Wm. Parker, R. A. Beyer, J. T. Keane, T. R. Lauer, S. B. Porter, O. L. White, B. J. Buratti, M. R. El-Maarry, C. M. Lisse, A. H. Parker, H. B. Throop, S. J. Robbins, O. M. Umurhan, R. P. Binzel, D. T. Britt, M. W. Buie, A. F. Cheng , et al. (53 additional authors not shown)

Abstract: The Cold Classical Kuiper Belt, a class of small bodies in undisturbed orbits beyond Neptune, are primitive objects preserving information about Solar System formation. The New Horizons spacecraft flew past one of these objects, the 36 km long contact binary (486958) Arrokoth (2014 MU69), in January 2019. Images from the flyby show that Arrokoth has no detectable rings, and no satellites (larger t… ▽ More The Cold Classical Kuiper Belt, a class of small bodies in undisturbed orbits beyond Neptune, are primitive objects preserving information about Solar System formation. The New Horizons spacecraft flew past one of these objects, the 36 km long contact binary (486958) Arrokoth (2014 MU69), in January 2019. Images from the flyby show that Arrokoth has no detectable rings, and no satellites (larger than 180 meters diameter) within a radius of 8000 km, and has a lightly-cratered smooth surface with complex geological features, unlike those on previously visited Solar System bodies. The density of impact craters indicates the surface dates from the formation of the Solar System. The two lobes of the contact binary have closely aligned poles and equators, constraining their accretion mechanism. △ Less

Submitted 1 April, 2020; originally announced April 2020.

Journal ref: Science, 367, aay3999 (2020)

arXiv:1912.03074 [pdf, other]

Solving Bernoulli Rank-One Bandits with Unimodal Thompson Sampling

Authors: Cindy Trinh, Emilie Kaufmann, Claire Vernade, Richard Combes

Abstract: Stochastic Rank-One Bandits (Katarya et al, (2017a,b)) are a simple framework for regret minimization problems over rank-one matrices of arms. The initially proposed algorithms are proved to have logarithmic regret, but do not match the existing lower bound for this problem. We close this gap by first proving that rank-one bandits are a particular instance of unimodal bandits, and then providing a… ▽ More Stochastic Rank-One Bandits (Katarya et al, (2017a,b)) are a simple framework for regret minimization problems over rank-one matrices of arms. The initially proposed algorithms are proved to have logarithmic regret, but do not match the existing lower bound for this problem. We close this gap by first proving that rank-one bandits are a particular instance of unimodal bandits, and then providing a new analysis of Unimodal Thompson Sampling (UTS), initially proposed by Paladino et al (2017). We prove an asymptotically optimal regret bound on the frequentist regret of UTS and we support our claims with simulations showing the significant improvement of our method compared to the state-of-the-art. △ Less

Submitted 6 December, 2019; originally announced December 2019.

arXiv:1910.10945 [pdf, other]

Fixed-Confidence Guarantees for Bayesian Best-Arm Identification

Authors: Xuedong Shang, Rianne de Heide, Emilie Kaufmann, Pierre Ménard, Michal Valko

Abstract: We investigate and provide new insights on the sampling rule called Top-Two Thompson Sampling (TTTS). In particular, we justify its use for fixed-confidence best-arm identification. We further propose a variant of TTTS called Top-Two Transportation Cost (T3C), which disposes of the computational burden of TTTS. As our main contribution, we provide the first sample complexity analysis of TTTS and T… ▽ More We investigate and provide new insights on the sampling rule called Top-Two Thompson Sampling (TTTS). In particular, we justify its use for fixed-confidence best-arm identification. We further propose a variant of TTTS called Top-Two Transportation Cost (T3C), which disposes of the computational burden of TTTS. As our main contribution, we provide the first sample complexity analysis of TTTS and T3C when coupled with a very natural Bayesian stopping rule, for bandits with Gaussian rewards, solving one of the open questions raised by Russo (2016). We also provide new posterior convergence results for TTTS under two models that are commonly used in practice: bandits with Gaussian and Bernoulli rewards and conjugate priors. △ Less

Submitted 28 October, 2019; v1 submitted 24 October, 2019; originally announced October 2019.

Showing 1–50 of 81 results for author: Kaufmann, E