## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Gaussian process dynamic programming

Neurocomputing, no. 7-9 (2009): 1508-1524

EI

Full Text

Weibo

Keywords

Abstract

Reinforcement learning (RL) and optimal control of systems with continuous states and actions require approximation techniques in most interesting cases. In this article, we introduce Gaussian process dynamic programming (GPDP), an approximate value function-based RL algorithm. We consider both a classic optimal control problem, where pro...More

Code:

Data:

Introduction

- Reinforcement learning (RL) is based on the principle of experience-based, goaldirected learning.
- If the authors call the RL algorithm “controller” and identify actions with the “control signal” the authors have a one-to-one mapping from reinforcement learning to optimal control if the surrounding world is fully known.
- For an initial state x0 ∈ nx and a policy π, the expected cumulative cost of a finite N -step optimization horizon is N −1.
- The function gterm is a control-independent terminal cost that incurs at the last time step N.
- The associated state-value function V ∗ satisfies Bellman’s equation

Highlights

- Reinforcement learning (RL) is based on the principle of experience-based, goaldirected learning
- In contrast to supervised learning, where labels are provided from an external supervisor, a reinforcement learning algorithm must be able to learn from experience collected through interaction with the surrounding world
- We introduce and analyze the Gaussian process dynamic programming (GPDP) algorithm
- We assume that the dynamics are deterministic and known, but the value function is unknown and modeled by GPv. This case corresponds to the standard Gaussian process dynamic programming setting (Algorithm 2) we have proposed in previous work, (Deisenroth et al, 2008a)
- We introduced Gaussian process dynamic programming (GPDP), a value-function based Reinforcement learning algorithm for continuous-valued state and action spaces
- In the context of a classic optimal control problem, the under-actuated pendulum swing up, we have shown that Gaussian process dynamic programming yields a near-optimal solution

Results

- In the considered particular trajectory, the total cost of the GPDP controller is approximately 9% higher than the total cost incurring when applying the DP controller.
- Remember that the reward function (15) is independent of the angular velocity and the control signal.
- For this particular trajectory, the cumulative reward of ALGPDP is approximately 7% lower than the cumulative reward of the optimal dynamic programming solution

Conclusion

- Training GPq scales cubically in the number of actions used for training. If the action space cannot be covered with training points, sub-sampling actions is possible to speed up training: Assume that the most relevant part of the Q∗k-function (line 7 of Algorithm 2) is the one close to the optimum and choose those M actions that yield the lowest expected cost in state xi.
- If the action space cannot be covered with training points, sub-sampling actions is possible to speed up training: Assume that the most relevant part of the Q∗k-function is the one close to the optimum and choose those M actions that yield the lowest expected cost in state xi
- These M actions define U and are the training inputs of GPq in Algorithm 2.
- Sparse approximations will be unavoidable if the data sets become remarkably larger
- This fact is due to the scaling properties of Gaussian process training.Probabilistic models in artificial learning algorithms can speed up learning noticeably as they quantify uncertainty in experience-based knowledge and alleviate model bias.
- In this setting, the authors still required problem-specific knowledge

- Table1: Solutions to integral (11)
- Table2: Training sets of GP models involved in Algorithm 4
- Table3: Real computation times of ALGPDP on a standard computer

Funding

- MPD acknowledges support by the German Research Foundation (DFG) through grant RA 1030/1-3 to CER. A Gaussian Process Prediction with Uncertain Inputs In the following, we re-state results from Rasmussen and Ghahramani (2003), O’Hagan (1991), Girard et al (2003), and Kuss (2006) of how to predict with Gaussian processes when the test input is uncertain. Consider the problem of predicting a function value h(x∗) for an uncertain test input x∗, where h ∼ GP with a squared exponential kernel kh and x∗ ∼ N (μ, Σ)

Reference

- Atkeson, C. G., 1994. Using Local Trajectory Optimizers to Speed up Global Optimization in Dynamic Programming. In: Hanson, J. E., Moody, S. J., Lippmann, R. P. (Eds.), Advances in Neural Information Processing Systems 6. Morgan Kaufmann, pp. 503–521.
- Atkeson, C. G., Santamarıa, J. C., 1997. A Comparison of Direct and ModelBased Reinforcement Learning. In: Proceedings of the International Conference on Robotics and Automation.
- Atkeson, C. G., Schaal, S., July 1997. Robot Learning from Demonstration. In: Fisher Jr., D. H. (Ed.), Proceedings of the 14th International Conference on Machine Learning. Morgan Kaufmann, Nashville, TN, USA, pp. 12–20.
- Bellman, R. E., 1957. Dynamic Programming. Princeton University Press, Princeton, NJ, USA.
- Bertsekas, D. P., 200Dynamic Programming and Optimal Control, 3rd Edition. Vol. 1 of Optimization and Computation Series. Athena Scientific, Belmont, MA, USA.
- Bertsekas, D. P., 2007. Dynamic Programming and Optimal Control, 3rd Edition. Vol. 2 of Optimization and Computation Series. Athena Scientific, Belmont, MA, USA.
- Bertsekas, D. P., Tsitsiklis, J. N., 1996. Neuro-Dynamic Programming. Optimization and Computation. Athena Scientific, Belmont, MA, USA.
- Bryson, A. E., Ho, Y.-C., 1975. Applied Optimal Control: Optimization, Estimation, and Control. Hemisphere, New York City, NY, USA.
- Chaloner, K., Verdinelli, I., 1995. Bayesian Experimental Design: A Review. Statistical Science 10, 273–304.
- Deisenroth, M. P., Peters, J., Rasmussen, C. E., June 2008a. Approximate Dynamic Programming with Gaussian Processes. In: Proceedings of the 2008 American Control Conference. Seattle, WA, USA, pp. 4480–4485.
- Deisenroth, M. P., Rasmussen, C. E., Peters, J., April 2008b. Model-Based Reinforcement Learning with Continuous States and Actions. In: Proceedings of the 16th European Symposium on Artificial Neural Networks. Bruges, Belgium, pp. 19–24.
- Doya, K., January 2000. Reinforcement Learning in Continuous Time and Space. Neural Computation 12 (1), 219–245.
- Engel, Y., Mannor, S., Meir, R., August 2003. Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning. In: Proceedings of the 20th International Conference on Machine Learning. Vol. 20.
- Engel, Y., Mannor, S., Meir, R., August 2005. Reinforcement Learning with Gaussian Processes. In: Proceedings of the 22nd International Conference on Machine Learning. Vol. 22.
- Ernst, D., Geurts, P., Wehenkel, L., 2005. Tree-Based Batch Mode Reinforcement Learning. Journal of Machine Learning Research 6, 503–556.
- Ghavamzadeh, M., Engel, Y., 2007. Bayesian Policy Gradient Algorithms. In: Scholkopf, B., Platt, J. C., Hoffman, T. (Eds.), Advances in Neural Information Processing Systems 19. The MIT Press, Cambridge, MA, USA, pp. 457–464.
- Girard, A., Rasmussen, C. E., Quinonero Candela, J., Murray-Smith, R., 2003. Gaussian Process Priors with Uncertain Inputs—Application to Multiple-Step Ahead Time Series Forecasting. In: Becker, S., Thrun, S., Obermayer, K. (Eds.), Advances in Neural Information Processing Systems 15. The MIT Press, Cambridge, MA, USA, pp. 529–536.
- Gordon, G. J., 1995. Stable Function Approximation in Dynamic Programming. In: Prieditis, A., Russell, S. (Eds.), Proceedings of the 12th International Conference on Machine Learning. Morgan Kaufmann, San Francisco, CA, USA, pp. 261–268.
- Howard, R. A., 1960. Dynamic Programming and Markov Processes. The MIT Press, Cambridge, MA, USA.
- Jacobs, R. A., Jordan, M. I., Nowlan, S. J., Hinton, G. E., 1991. Adaptive Mixtures of Local Experts. Neural Computation 3, 79–87.
- Julier, S. J., Uhlmann, J. K., March 2004. Unscented Filtering and Nonlinear Estimation. IEEE Review 92 (3), 401–422.
- Kalman, R. E., 1960. A New Approach to Linear Filtering and Prediction Problems. Transactions of the ASME — Journal of Basic Engineering 82 (Series D), 35–45.
- Ko, J., Fox, D., September 2008. GP-BayesFilters: Bayesian Filtering Using Gaussian Process Prediction and Observation Models. In: Proceedings of the 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Nice, France, pp. 3471–3476.
- Ko, J., Klein, D. J., Fox, D., Haehnel, D., April 2007a. Gaussian Processes and Reinforcement Learning for Identification and Control of an Autonomous Blimp. In: Proceedings of the International Conference on Robotics and Automation. Rome, Italy, pp. 742–747.
- Ko, J., Klein, D. J., Fox, D., Haehnel, D., October 2007b. GP-UKF: Unscented Kalman Filters with Gaussian Process Prediction and Observation Models. In: Proceedings of the 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems. San Diego, CA, USA, pp. 1901–1907.
- Kocijan, J., Murray-Smith, R., Rasmussen, C. E., Likar, B., September 2003. Predictive Control with Gaussian Process Models. In: Zajc, B., Tkalcic, M. (Eds.), Proceedings of IEEE Region 8 Eurocon 2003: Computer as a Tool. Piscataway, NJ, USA, pp. 352–356.
- Krause, A., Singh, A., Guestrin, C., February 2008. Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies. Journal of Machine Learning Research 9, 235–284.
- Kuss, M., February 2006. Gaussian Process Models for Robust Regression, Classification, and Reinforcement Learning. Ph.D. thesis, Technische Universitat Darmstadt, Germany.
- MacKay, D. J. C., 1992. Information-Based Objective Functions for Active Data Selection. Neural Computation 4, 590–604.
- MacKay, D. J. C., 1999. Comparison of Approximate Methods for Handling Hyperparameters. Neural Computation 11 (5), 1035–1068.
- MacKay, D. J. C., 2003. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, The Edinburgh Building, Cambridge CB2 2RU, UK.
- Martinez-Cantin, R., de Freitas, N., Doucet, A., Castellanos, J., June 2007. Active Policy Learning for Robot Planning and Exploration under Uncertainty. In: Proceedings of Robotics: Science and Systems III. Atlanta, GA, USA.
- Matheron, G., 1973. The Intrinsic Random Functions and Their Applications. Advances in Applied Probability 5, 439–468.
- Minka, T. P., January 2001. A Family of Algorithms for Approximate Bayesian Inference. Ph.D. thesis, Massachusetts Institute of Technology, Cambridge, MA, USA.
- Murray-Smith, R., Sbarbaro, D., July 2002. Nonlinear Adaptive Control Using NonParametric Gaussian Process Prior Models. In: Proceedings of the 15th IFAC World Congress. Vol. 15. Academic Press, Barcelona, Spain.
- Murray-Smith, R., Sbarbaro, D., Rasmussen, C. E., Girard, A., August 2003.
- O’Hagan, A., 1991. Bayes-Hermite Quadrature. Journal of Statistical Planning and Inference 29, 245–260.
- Ormoneit, D., Sen, S., November 2002. Kernel-Based Reinforcement Learning. Machine Learning 49 (2–3), 161–178.
- Peters, J., Schaal, S., 2008a. Learning to Control in Operational Space. The International Journal of Robotics Research 27 (2), 197–212.
- Peters, J., Schaal, S., 2008b. Natural Actor-Critic. Neurocomputing 71 (7–9), 1180– 1190.
- Peters, J., Schaal, S., 2008c. Reinforcement Learning of Motor Skills with Policy Gradients. Neural Networks 21, 682–697.
- Pfingsten, T., September 2006. Bayesian Active Learning for Sensitivity Analysis. In: Proceedings of the 17th European Conference on Machine Learning. pp. 353–364.
- Quinonero-Candela, J., Rasmussen, C. E., 2005. A Unifying View of Sparse Approximate Gaussian Process Regression. Journal of Machine Learning Research 6 (2), 1939–1960.
- Rasmussen, C. E., 1996. Evaluation of Gaussian Processes and other Methods for Non-linear Regression. Ph.D. thesis, Department of Computer Science, University of Toronto.
- Rasmussen, C. E., Deisenroth, M. P., November 2008. Probabilistic Inference for Fast Learning in Control. In: Girgin, S., Loth, M, Munos, R., Preux, P., Ryabko, D. (Eds.), Recent Advances in Reinforcement Learning. Vol. 5323 of Lecture Notes on Computer Science. Springer-Verlag, pp. 229–242.
- Rasmussen, C. E., Ghahramani, Z., 2003. Bayesian Monte Carlo. In: Becker, S., Thrun, S., Obermayer, K. (Eds.), Advances in Neural Information Processing Systems 15. The MIT Press, Cambridge, MA, USA, pp. 489–496.
- Rasmussen, C. E., Kuss, M., June 2004. Gaussian Processes in Reinforcement Learning. In: Thrun, S., Saul, L. K., Scholkopf, B. (Eds.), Advances in Neural Information Processing Systems 16. The MIT Press, Cambridge, MA, USA, pp. 751–759.
- Rasmussen, C. E., Williams, C. K. I., 2006. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. The MIT Press, Cambridge, MA, USA. URL http://www.gaussianprocess.org/gpml/
- Riedmiller, M., 2000. Concepts and Facilities of a Neural Reinforcement Learning Control Architecture for Technical Process Control. Neural Computation and Application 8, 323–338.
- Riedmiller, M., 2005. Neural Fitted Q Iteration—First Experiences with a Data
- Efficient Neural Reinforcement Learning Method. In: Proceedings of the 16th European Conference on Machine Learning. Porto, Portugal.
- Riedmiller, M., Braun, H., 1993. A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Algorithm. In: In Proceedings of the IEEE International Conference on Neural Networks. pp. 586–591.
- Snelson, E., Ghahramani, Z., 2006. Sparse Gaussian Processes using Pseudo-inputs. In: Weiss, Y., Scholkopf, B., Platt, J. C. (Eds.), Advances in Neural Information Processing Systems 18. The MIT Press, Cambridge, MA, USA, pp. 1257–1264.
- Sutton, R. S., Barto, A. G., 1998. Reinforcement Learning: An Introduction. Adaptive Computation and Machine Learning. The MIT Press, Cambridge, MA, USA. Verdinelli, I., Kadane, J. B., June 1992. Bayesian Designs for Maximizing Information and Outcome. Journal of the American Statistical Association 87 (418), 510–515.
- Wasserman, L., 2006. All of Nonparametric Statistics. Springer Texts in Statistics. Springer Science+Business Media, Inc., New York, NY, USA. Williams, C. K. I., Rasmussen, C. E., 1996. Gaussian Processes for Regression. In: Touretzky, D. S., Mozer, M. C., Hasselmo, M. E. (Eds.), Advances in Neural Processing Systems 8. The MIT Press, Cambridge, MA, USA, pp. 598–604.
- Marc Peter Deisenroth is a Ph.D. candidate at Universitat Karlsruhe (TH), Germany, while being visiting graduate student at the Computational and Biological Learning Lab at the Department of Engineering, University of Cambridge, UK. He graduated from Universitat Karlsruhe (TH) in August 2006 with a German Masters degree in Informatics. From October 2006 to September 2007, he has been a graduate research assistant at the Max Planck Institute for Biological Cybernetics in Tubingen, Germany. He has been a visiting researcher at Osaka University, Japan, in 2006 and at Kanazawa University, Japan, in 2004. His research interests include Bayesian inference, reinforcement learning, optimal and nonlinear control.
- Carl Edward Rasmussen is a lecturer in the Computational and Biological Learning Lab at the Department of Engineering, University of Cambridge and an adjunct research scientist at the Max Planck Institute for Biological Cybernetics, Tubingen, Germany. His main research interests are Bayesian inference and machine learning. He received his Masters in Engineering from the Technical University of Denmark and his Ph.D. in Computer Science from the University of Toronto in 1996. Since then he has been a post doc at the Technical University of Denmark, a senior research fellow at the Gatsby Computational Neuroscience Unit at University College London from 2000–2002, and a junior research group leader at the Max Planck Institute for Biological Cybernetics in Tubingen, Germany, from 2002–2007.

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn