(Translated by https://www.hiragana.jp/)
Smooth Imitation Learning via Smooth Costs and Smooth Policies | Proceedings of the 5th Joint International Conference on Data Science & Management of Data (9th ACM IKDD CODS and 27th COMAD)

research-article

Smooth Imitation Learning via Smooth Costs and Smooth Policies

Authors:

Sapana Chaudhary,

Balaraman RavindranAuthors Info & Claims

CODS-COMAD '22: Proceedings of the 5th Joint International Conference on Data Science & Management of Data (9th ACM IKDD CODS and 27th COMAD)

Pages 63 - 71

https://doi.org/10.1145/3493700.3493716

Published: 08 January 2022 Publication History

Abstract

Imitation learning (IL) is a popular approach in the continuous control setting as among other reasons it circumvents the problems of reward mis-specification and exploration in reinforcement learning (RL). In IL from demonstrations, an important challenge is to obtain agent policies that are smooth with respect to the inputs. Learning through imitation a policy that is smooth as a function of a large state-action (s-a) space (typical of high dimensional continuous control environments) can be challenging. We take a first step towards tackling this issue by using smoothness inducing regularizers on both the policy and the cost models of adversarial imitation learning. Our regularizers work by ensuring that the cost function changes in a controlled manner as a function of s-a space; and the agent policy is well behaved with respect to the state space. We call our new smooth IL algorithm Smooth Policy and Cost Imitation Learning (SPaCIL, pronounced “Special”). We introduce a novel metric to quantify the smoothness of the learned policies. We demonstrate SPaCIL’s superior performance on continuous control tasks from MuJoCo. The algorithm not just outperforms the state-of-the-art IL algorithm on our proposed smoothness metric, but, enjoys added benefits of faster learning and substantially higher average return.

References

[1]

Pieter Abbeel and Andrew Y Ng. 2004. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning. ACM, 1.

Digital Library

[2]

Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. 2009. A survey of robot learning from demonstration. Robotics and autonomous systems 57, 5 (2009), 469–483.

[3]

Martin Arjovsky and Léon Bottou. 2017. Towards principled methods for training generative adversarial networks. In NIPS 2016 Workshop on Adversarial Training. In review for ICLR, Vol. 2016.

[4]

Michael Bain and Claude Sammut. 1995. A Framework for Behavioural Cloning. In Machine Intelligence 15. 103–129.

[5]

Lionel Blondé, Pablo Strasser, and Alexandros Kalousis. 2020. Lipschitzness Is All You Need To Tame Off-policy Generative Adversarial Imitation Learning. arXiv preprint arXiv:2006.16785(2020).

[6]

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. Openai gym. arXiv preprint arXiv:1606.01540(2016).

[7]

Sylvain Calinon. 2009. Robot programming by demonstration. EPFL Press.

Digital Library

[8]

Jianhui Chen, Hoang M Le, Peter Carr, Yisong Yue, and James J Little. 2016. Learning online smooth predictors for realtime camera planning using recurrent decision trees. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4688–4696.

[9]

Peter Dayan and Bernard W Balleine. 2002. Reward, motivation, and reinforcement learning. Neuron 36, 2 (2002), 285–298.

[10]

Chelsea Finn, Sergey Levine, and Pieter Abbeel. 2016. Guided cost learning: Deep inverse optimal control via policy optimization. In Proceedings of the 33rd International Conference on Machine Learning, Vol. 48.

[11]

Justin Fu, Katie Luo, and Sergey Levine. 2017. Learning Robust Rewards with Adversarial Inverse Reinforcement Learning. arXiv preprint arXiv:1710.11248(2017).

[12]

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. 2017. Improved Training of Wasserstein GANs. arXiv preprint arXiv:1704.00028(2017).

[13]

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning. PMLR, 1861–1870.

[14]

Dan Hendrycks, Mantas Mazeika, Saurav Kadavath, and Dawn Song. 2019. Using self-supervised learning can improve model robustness and uncertainty. In Advances in Neural Information Processing Systems. 15663–15674.

[15]

Jonathan Ho and Stefano Ermon. 2016. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems. 4565–4573.

[16]

Martin Sleziak. [n.d.]. Is the maximum function Lipschitz continuous?Mathematics Stack Exchange. arXiv:https://math.stackexchange.com/q/1742410https://math.stackexchange.com/q/1742410 URL:https://math.stackexchange.com/q/1742410 (version: 2016-04-14).

[17]

Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Tuo Zhao. 2019. Smart: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. arXiv preprint arXiv:1911.03437(2019).

[18]

Ming Jin and Javad Lavaei. 2018. Control-theoretic analysis of smoothness for stability-certified reinforcement learning. In 2018 IEEE Conference on Decision and Control (CDC). IEEE, 6840–6847.

Digital Library

[19]

Sham Kakade and John Langford. 2002. Approximately optimal approximate reinforcement learning. In In Proc. 19th International Conference on Machine Learning. Citeseer.

[20]

H Jin Kim, Michael I Jordan, Shankar Sastry, and Andrew Y Ng. 2004. Autonomous helicopter flight via reinforcement learning. In Advances in neural information processing systems. 799–806.

[21]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980(2014).

[22]

Jonathan Ko, Daniel J Klein, Dieter Fox, and Dirk Haehnel. 2007. Gaussian processes and reinforcement learning for identification and control of an autonomous blimp. In Proceedings 2007 ieee international conference on robotics and automation. IEEE, 742–747.

[23]

Petar Kormushev, Sylvain Calinon, and Darwin G Caldwell. 2013. Reinforcement learning in robotics: Applications and real-world challenges. Robotics 2, 3 (2013), 122–148.

[24]

Hoang Le, Andrew Kang, Yisong Yue, and Peter Carr. 2016. Smooth imitation learning for online sequence prediction. In International Conference on Machine Learning. PMLR, 680–688.

[25]

Sergey Levine, Zoran Popovic, and Vladlen Koltun. 2011. Nonlinear inverse reinforcement learning with gaussian processes. In Advances in neural information processing systems. 19–27.

[26]

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971(2015).

[27]

Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. 2018. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence 41, 8(2018), 1979–1993.

[28]

Andrew Y Ng, Stuart J Russell, 2000. Algorithms for inverse reinforcement learning. In Icml. 663–670.

[29]

Matteo Papini, Andrea Battistello, and Marcello Restelli. [n.d.]. Safe Exploration in Gaussian Policy Gradient. ([n. d.]).

[30]

Jason Pazis and Ronald Parr. 2011. Non-Parametric Approximate Linear Programming for MDPs. In AAAI.

[31]

Benjamin Recht. 2019. A tour of reinforcement learning: The view from continuous control. Annual Review of Control, Robotics, and Autonomous Systems 2 (2019), 253–279.

[32]

Stéphane Ross and Drew Bagnell. 2010. Efficient reductions for imitation learning. In Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 661–668.

[33]

Stephane Ross and J Andrew Bagnell. 2014. Reinforcement and imitation learning via interactive no-regret learning. arXiv preprint arXiv:1406.5979(2014).

[34]

Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. 2011. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics. 627–635.

[35]

Stuart Russell. 1998. Learning agents for uncertain environments. In Proceedings of the eleventh annual conference on Computational learning theory. ACM, 101–103.

Digital Library

[36]

Saumya Kumaar Saksena, B Navaneethkrishnan, Sinchana Hegde, Pragadeesh Raja, and Ravi M Vishwanath. 2019. Towards Behavioural Cloning for Autonomous Driving. In 2019 Third IEEE International Conference on Robotic Computing (IRC). IEEE, 560–567.

[37]

Stefan Schaal. 1999. Is imitation learning the route to humanoid robots?Trends in cognitive sciences 3, 6 (1999), 233–242.

[38]

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. 2015. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 1889–1897.

Digital Library

[39]

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. 2015. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438(2015).

[40]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347(2017).

[41]

Qianli Shen, Yan Li, Haoming Jiang, Zhaoran Wang, and Tuo Zhao. 2020. Deep Reinforcement Learning with Smooth Policy. arXiv preprint arXiv:2003.09534(2020).

[42]

Richard S Sutton, Andrew G Barto, 1998. Introduction to reinforcement learning. Vol. 135. MIT press Cambridge.

[43]

Emanuel Todorov, Tom Erez, and Yuval Tassa. 2012. MuJoCo: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on. IEEE, 5026–5033.

[44]

Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V Le. 2019. Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848(2019).

[45]

Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P Xing, Laurent El Ghaoui, and Michael I Jordan. 2019. Theoretically principled trade-off between robustness and accuracy. arXiv preprint arXiv:1901.08573(2019).

[46]

Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. 2008. Maximum Entropy Inverse Reinforcement Learning. In AAAI, Vol. 8. Chicago, IL, USA, 1433–1438.

Cited By

Chaudhary SElkind E(2023)On adaptivity and safety in sequential decision makingProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/813(7077-7078)Online publication date: 19-Aug-2023
https://dl.acm.org/doi/10.24963/ijcai.2023/813

Index Terms

Smooth Imitation Learning via Smooth Costs and Smooth Policies
1. Computing methodologies
  1. Artificial intelligence
    1. Control methods
  2. Machine learning
2. Theory of computation

Index terms have been assigned to the content through auto-classification.

Recommendations

Tracking the Race Between Deep Reinforcement Learning and Imitation Learning
Quantitative Evaluation of Systems
Abstract
Learning-based approaches for solving large sequential decision making problems have become popular in recent years. The resulting agents perform differently and their characteristics depend on those of the underlying learning approach. Here, we ...
Smooth imitation learning for online sequence prediction
ICML'16: Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48

We study the problem of smooth imitation learning for online sequence prediction, where the goal is to train a policy that can smoothly imitate demonstrated behavior in a dynamic and continuous environment in response to online, sequential context ...
Imitating Opponent to Win: Adversarial Policy Imitation Learning in Two-player Competitive Games
AAMAS '23: Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems

Recent research on vulnerabilities of deep reinforcement learning (RL) has shown that adversarial policies can influence a target RL agent (victim agent) to perform poorly. In existing studies, adversarial policies are directly trained based on ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CODS-COMAD '22: Proceedings of the 5th Joint International Conference on Data Science & Management of Data (9th ACM IKDD CODS and 27th COMAD)

January 2022

357 pages

ISBN:9781450385824

DOI:10.1145/3493700

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGGRAPH: ACM Special Interest Group on Computer Graphics and Interactive Techniques

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 January 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Robert Bosch Center for Data Science and Artificial Intelligence

Conference

CODS-COMAD 2022

Sponsor:

SIGGRAPH

CODS-COMAD 2022: 5th Joint International Conference on Data Science & Management of Data (9th ACM IKDD CODS and 27th COMAD)

January 8 - 10, 2022

Bangalore, India

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
61
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)1

Reflects downloads up to 07 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chaudhary SElkind E(2023)On adaptivity and safety in sequential decision makingProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/813(7077-7078)Online publication date: 19-Aug-2023
https://dl.acm.org/doi/10.24963/ijcai.2023/813

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents