(Translated by https://www.hiragana.jp/)
Smooth Imitation Learning via Smooth Costs and Smooth Policies | Proceedings of the 5th Joint International Conference on Data Science & Management of Data (9th ACM IKDD CODS and 27th COMAD) skip to main content
10.1145/3493700.3493716acmconferencesArticle/Chapter ViewAbstractPublication PagescomadConference Proceedingsconference-collections
research-article

Smooth Imitation Learning via Smooth Costs and Smooth Policies

Published: 08 January 2022 Publication History
  • Get Citation Alerts
  • Abstract

    Imitation learning (IL) is a popular approach in the continuous control setting as among other reasons it circumvents the problems of reward mis-specification and exploration in reinforcement learning (RL). In IL from demonstrations, an important challenge is to obtain agent policies that are smooth with respect to the inputs. Learning through imitation a policy that is smooth as a function of a large state-action (s-a) space (typical of high dimensional continuous control environments) can be challenging. We take a first step towards tackling this issue by using smoothness inducing regularizers on both the policy and the cost models of adversarial imitation learning. Our regularizers work by ensuring that the cost function changes in a controlled manner as a function of s-a space; and the agent policy is well behaved with respect to the state space. We call our new smooth IL algorithm Smooth Policy and Cost Imitation Learning (SPaCIL, pronounced “Special”). We introduce a novel metric to quantify the smoothness of the learned policies. We demonstrate SPaCIL’s superior performance on continuous control tasks from MuJoCo. The algorithm not just outperforms the state-of-the-art IL algorithm on our proposed smoothness metric, but, enjoys added benefits of faster learning and substantially higher average return.

    References

    [1]
    Pieter Abbeel and Andrew Y Ng. 2004. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning. ACM, 1.
    [2]
    Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. 2009. A survey of robot learning from demonstration. Robotics and autonomous systems 57, 5 (2009), 469–483.
    [3]
    Martin Arjovsky and Léon Bottou. 2017. Towards principled methods for training generative adversarial networks. In NIPS 2016 Workshop on Adversarial Training. In review for ICLR, Vol. 2016.
    [4]
    Michael Bain and Claude Sammut. 1995. A Framework for Behavioural Cloning. In Machine Intelligence 15. 103–129.
    [5]
    Lionel Blondé, Pablo Strasser, and Alexandros Kalousis. 2020. Lipschitzness Is All You Need To Tame Off-policy Generative Adversarial Imitation Learning. arXiv preprint arXiv:2006.16785(2020).
    [6]
    Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. Openai gym. arXiv preprint arXiv:1606.01540(2016).
    [7]
    Sylvain Calinon. 2009. Robot programming by demonstration. EPFL Press.
    [8]
    Jianhui Chen, Hoang M Le, Peter Carr, Yisong Yue, and James J Little. 2016. Learning online smooth predictors for realtime camera planning using recurrent decision trees. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4688–4696.
    [9]
    Peter Dayan and Bernard W Balleine. 2002. Reward, motivation, and reinforcement learning. Neuron 36, 2 (2002), 285–298.
    [10]
    Chelsea Finn, Sergey Levine, and Pieter Abbeel. 2016. Guided cost learning: Deep inverse optimal control via policy optimization. In Proceedings of the 33rd International Conference on Machine Learning, Vol. 48.
    [11]
    Justin Fu, Katie Luo, and Sergey Levine. 2017. Learning Robust Rewards with Adversarial Inverse Reinforcement Learning. arXiv preprint arXiv:1710.11248(2017).
    [12]
    Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. 2017. Improved Training of Wasserstein GANs. arXiv preprint arXiv:1704.00028(2017).
    [13]
    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning. PMLR, 1861–1870.
    [14]
    Dan Hendrycks, Mantas Mazeika, Saurav Kadavath, and Dawn Song. 2019. Using self-supervised learning can improve model robustness and uncertainty. In Advances in Neural Information Processing Systems. 15663–15674.
    [15]
    Jonathan Ho and Stefano Ermon. 2016. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems. 4565–4573.
    [16]
    Martin Sleziak. [n.d.]. Is the maximum function Lipschitz continuous?Mathematics Stack Exchange. arXiv:https://math.stackexchange.com/q/1742410https://math.stackexchange.com/q/1742410 URL:https://math.stackexchange.com/q/1742410 (version: 2016-04-14).
    [17]
    Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Tuo Zhao. 2019. Smart: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. arXiv preprint arXiv:1911.03437(2019).
    [18]
    Ming Jin and Javad Lavaei. 2018. Control-theoretic analysis of smoothness for stability-certified reinforcement learning. In 2018 IEEE Conference on Decision and Control (CDC). IEEE, 6840–6847.
    [19]
    Sham Kakade and John Langford. 2002. Approximately optimal approximate reinforcement learning. In In Proc. 19th International Conference on Machine Learning. Citeseer.
    [20]
    H Jin Kim, Michael I Jordan, Shankar Sastry, and Andrew Y Ng. 2004. Autonomous helicopter flight via reinforcement learning. In Advances in neural information processing systems. 799–806.
    [21]
    Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980(2014).
    [22]
    Jonathan Ko, Daniel J Klein, Dieter Fox, and Dirk Haehnel. 2007. Gaussian processes and reinforcement learning for identification and control of an autonomous blimp. In Proceedings 2007 ieee international conference on robotics and automation. IEEE, 742–747.
    [23]
    Petar Kormushev, Sylvain Calinon, and Darwin G Caldwell. 2013. Reinforcement learning in robotics: Applications and real-world challenges. Robotics 2, 3 (2013), 122–148.
    [24]
    Hoang Le, Andrew Kang, Yisong Yue, and Peter Carr. 2016. Smooth imitation learning for online sequence prediction. In International Conference on Machine Learning. PMLR, 680–688.
    [25]
    Sergey Levine, Zoran Popovic, and Vladlen Koltun. 2011. Nonlinear inverse reinforcement learning with gaussian processes. In Advances in neural information processing systems. 19–27.
    [26]
    Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971(2015).
    [27]
    Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. 2018. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence 41, 8(2018), 1979–1993.
    [28]
    Andrew Y Ng, Stuart J Russell, 2000. Algorithms for inverse reinforcement learning. In Icml. 663–670.
    [29]
    Matteo Papini, Andrea Battistello, and Marcello Restelli. [n.d.]. Safe Exploration in Gaussian Policy Gradient. ([n. d.]).
    [30]
    Jason Pazis and Ronald Parr. 2011. Non-Parametric Approximate Linear Programming for MDPs. In AAAI.
    [31]
    Benjamin Recht. 2019. A tour of reinforcement learning: The view from continuous control. Annual Review of Control, Robotics, and Autonomous Systems 2 (2019), 253–279.
    [32]
    Stéphane Ross and Drew Bagnell. 2010. Efficient reductions for imitation learning. In Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 661–668.
    [33]
    Stephane Ross and J Andrew Bagnell. 2014. Reinforcement and imitation learning via interactive no-regret learning. arXiv preprint arXiv:1406.5979(2014).
    [34]
    Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. 2011. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics. 627–635.
    [35]
    Stuart Russell. 1998. Learning agents for uncertain environments. In Proceedings of the eleventh annual conference on Computational learning theory. ACM, 101–103.
    [36]
    Saumya Kumaar Saksena, B Navaneethkrishnan, Sinchana Hegde, Pragadeesh Raja, and Ravi M Vishwanath. 2019. Towards Behavioural Cloning for Autonomous Driving. In 2019 Third IEEE International Conference on Robotic Computing (IRC). IEEE, 560–567.
    [37]
    Stefan Schaal. 1999. Is imitation learning the route to humanoid robots?Trends in cognitive sciences 3, 6 (1999), 233–242.
    [38]
    John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. 2015. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15). 1889–1897.
    [39]
    John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. 2015. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438(2015).
    [40]
    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347(2017).
    [41]
    Qianli Shen, Yan Li, Haoming Jiang, Zhaoran Wang, and Tuo Zhao. 2020. Deep Reinforcement Learning with Smooth Policy. arXiv preprint arXiv:2003.09534(2020).
    [42]
    Richard S Sutton, Andrew G Barto, 1998. Introduction to reinforcement learning. Vol. 135. MIT press Cambridge.
    [43]
    Emanuel Todorov, Tom Erez, and Yuval Tassa. 2012. MuJoCo: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on. IEEE, 5026–5033.
    [44]
    Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V Le. 2019. Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848(2019).
    [45]
    Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P Xing, Laurent El Ghaoui, and Michael I Jordan. 2019. Theoretically principled trade-off between robustness and accuracy. arXiv preprint arXiv:1901.08573(2019).
    [46]
    Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. 2008. Maximum Entropy Inverse Reinforcement Learning. In AAAI, Vol. 8. Chicago, IL, USA, 1433–1438.

    Cited By

    View all
    • (2023)On adaptivity and safety in sequential decision makingProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/813(7077-7078)Online publication date: 19-Aug-2023

    Index Terms

    1. Smooth Imitation Learning via Smooth Costs and Smooth Policies
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Conferences
          CODS-COMAD '22: Proceedings of the 5th Joint International Conference on Data Science & Management of Data (9th ACM IKDD CODS and 27th COMAD)
          January 2022
          357 pages
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Sponsors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 08 January 2022

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. continuous control
          2. deep reinforcement learning
          3. imitation learning
          4. regularization
          5. smooth policy

          Qualifiers

          • Research-article
          • Research
          • Refereed limited

          Funding Sources

          • Robert Bosch Center for Data Science and Artificial Intelligence

          Conference

          CODS-COMAD 2022
          Sponsor:

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)10
          • Downloads (Last 6 weeks)1
          Reflects downloads up to 07 Aug 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2023)On adaptivity and safety in sequential decision makingProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/813(7077-7078)Online publication date: 19-Aug-2023

          View Options

          Get Access

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format.

          HTML Format

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media