Document Type: Review Article

Authors

1 Faculty of Mechanical Engineering, and Mechatronics Research Laboratory, School of Engineering Emerging Technologies, University of Tabriz, Tabriz, Iran

2 Mechatronics Research Laboratory, School of Engineering Emerging Technologies, University of Tabriz, Tabriz, Iran

Abstract

In recent years, researches on reinforcement learning (RL) have focused on bridging the gap between adaptive optimal control and bio-inspired learning techniques. Neural network reinforcement learning (NNRL) is among the most popular algorithms in the RL framework. The advantage of using neural networks enables the RL to search for optimal policies more efficiently in several real-life applications. Although many surveys investigated general RL, no survey is specifically dedicated to the combination of artificial neural networks and RL. This paper therefore describes the state of the art of NNRL algorithms, with a focus on robotics applications. In this paper, a comprehensive survey is started with a discussion on the concepts of RL. Then, a review of several different NNRL algorithms is presented. Afterwards, the performances of different NNRL algorithms are evaluated and compared in learning prediction and learning control tasks from an empirical aspect and the paper concludes with a discussion on open issues.

Keywords

 Lewis, F. L., & Vrabie, D., “Reinforcement learning and adaptive dynamic programming for feedback control”, Circuits and Systems Magazine, 09(3), pp. 32–50, 2009.
 L. Busoniu, R. Babuska, B. De Schutter, D. Ernst, “Reinforcement Learning and Dynamic Programming Using Function Approximators”, CRC Press, NY, 2010.
 Williams, J. K., “Reinforcement learning of optimal controls”, In S. E. Haupt, A. Pasini, & C. Marzban (Eds.), Artificial intelligence methods in the environmental sciences, Springer, pp. 297–327, 2009
 Cs. Szepesvri, “Algorithms for Reinforcement Learning”, Morgan and Claypool, 2010.
 X. Xu, “Reinforcement Learning and Approximate Dynamic Programming”, Science Press, Beijing, 2010.
 Sutton, R., Barto, A. G., & Williams, R. J., “Reinforcement learning is direct adaptive optimal control”, In Proceedings of the American control conference, pp. 2143–2146, 1992.
 P.J. Werbos, “Intelligence in the brain: a theory of how it works and how to build it”, Neural Networks, pp. 200–212, 2009.
 D. Ernst, P. Geurts, L. Wehenkel, “Tree-based batch mode reinforcement learning”, Journal of Machine Learning Research 6, pp. 503–556, 2005.
 A.G. Barto, T.G. Dietterich, “Reinforcement learning and its relationship to supervised learning”, in: J. Si, A. Barto, W. Powell, D. Wunsch (Eds.), Handbook of Learning and Approximate Dynamic programming, Wiley-IEEE Press, New York, 2004.
 Chen M, Ge SS and Ren BB, “Adaptive tracking control of uncertain MIMO nonlinear systems with input constraints”, Automatica, 47, pp. 452–465, 2011.
 Li TS, Wang D and Feng G, ”A DSC Approach to robust adaptive NN tracking control for strict-feedback nonlinear systems”, IEEE Transactions on System Man, and Cybernetics – Part B: Cybernetics, 40, pp. 915–927, 2010.
 Li ZJ, Ge SS, Adams M and Wijesoma WS, “Adaptive robust output-feedback
motion/force control of electrically driven nonholonomic mobile manipulators”, IEEE Transactions on Control Systems Technology, 16(6), pp. 1308–1315, 2008a.
 Li ZJ, Ge SS, Adams M and Wijesoma WS , “Robust adaptive control of uncertain force/motion constrained nonholonomic mobile manipulators”, Automatica, 44, pp. 776–784, 2008b.
 Yang C, Ganesh G, Albu-Schaeffer A and Burdet E, “Human like adaptation of force and impedance in stable and unstable tasks”, IEEE Transactions on Robotics, 27, pp. 918–930, 2011.
 Sun FC, Sun ZQ and Woo PY, “Stable neural-networkbased adaptive control for sampleddata nonlinear systems”, IEEE Transactions on Neural Networks, 9(5), pp. 956–968, 1998.
 Sun FC, Li L, Li HX and Liu HP, “Neuro-fuzzy dynamic-inversion-based adaptive control for robotic manipulators- discrete time case”, IEEE Transactions on Industrial Electronics, 54(3), pp. 1342–1351, 2007.
 R. Sutton, A.G. Barto, “Reinforcement Learning. An Introduction”, MIT Press, Cambridge MA, 1998.
 W. Smart, L. Kaelbling, “Practical reinforcement learning in continuous spaces”, International Conference on Machine Learning, ICML, 2000.
 M. Carreras, P. Ridao, A. El-Fakdi, “Semi-online neural-Q-learning for real-time robot learning”, IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS, Las Vegas, USA, 2003.
 L. Busoniu, R. Babuska, B. De Schutter, and D. Ernst, “Reinforcement Learning and Dynamic Programming Using Function Approximators”, CRC Press, 2010.
 J. Peters and S. Schaal, “Natural Actor-Critic,” Neurocomputing, 71, pp. 1180–1190, 2008.
 D. P. Bertsekas, “Dynamic Programming and Optimal Control”, vol.2, 3rd ed. Athena Scientific, 2007.
 K. G. Vamvoudakis and F. L. Lewis, “Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem,” Automatica, 46(5), pp. 878–888, 2010.
 T. Hanselmann, L. Noakes, and A. Zaknich, “Continuous-Time Adaptive Critics”, IEEE Transactions on Neural Networks, 18(3), pp. 631– 647, 2007.
 R. Sutton, “Learning to predict by the method of temporal differences”, Machine Learning3, pp. 9–44, 1988.
 Bertsekas, D., “Neuro-dynamic programming. In Encyclopedia of optimization”, pp. 2555–2560, 2009.
 R. Enns, J. Si, “Helicopter trimming and tracking control using direct neural dynamic programming”, IEEE Transactions on Neural Networks, 14 (4), pp. 929–939, 2003.
 S.J. Brartke, A. Barto, “Linear least-squares algorithms for temporal difference learning”, Machine Learning, 22, pp. 33–57, 1996.
 H.R. Maei, C. Szepesvri, S. Bhatnagar, D. Precup, R.S. Sutton”, Convergent temporaldifference learning with arbitrary smooth function approximation”, in: J. Laferty, C. Williams (Eds.), Advances in Neural Information Processing Systems, 22, MIT Press, Cambridge, MA, USA, 2010.
 H. Zhang, L. Cui, X. Zhang, Y. Luo, “Data-driven robust approximate optimal tracking control for unknown general nonlinear systems using adaptive dynamic programming method”, IEEE Transactions on Neural Networks, 22 (12), pp. 2226–2236, 2011.
 S. Bradtke, “Incremental Dynamic Programming for On-Line Adaptive Optimal Control”, Ph.D. thesis, University of Massachusetts, Computer Science Dept. Tech. Rep., pp. 94–62, 1994.
 Vlassis, N., Toussaint, M., Kontes, G., and Piperidis, S.,”Learning model-free robot control by a Monte Carlo EM algorithm”, Autonomous Robots, 27(2), pp. 123–130, 2009.
 G. A. Rummery and M. Niranjan, “On-Line Q-Learning Using Connectionist Systems”, Cambridge University, Tech. Rep. CUED/FINFENG/ TR 166, 1994.
 C. Watkins, P. Dayan, “Q-learning”, Machine Learning, 8, pp. 279–292, 1992.
 V. R. Konda and J. N. Tsitsiklis, “On Actor-Critic Algorithms”, SIAM Journal on Control and Optimization, 42(4), pp. 1143–1166, 2003.
 R. J. Williams, “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning”, Machine Learning, 8, pp. 229–256, 1992.
 F. S. Melo, S. P. Meyn, and M. I. Ribeiro, “An Analysis of Reinforcement Learning with Function Approximation”, in Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, pp. 664–671, 2008.
 G. J. Gordon, “Stable Function Approximation in Dynamic Programming”, in Proceedings of the 12th International Conference on Machine Learning, Tahoe City, USA, pp. 261–268, 1995.
 S. Bhatnagar, “An actor-critic algorithm with function approximation for discounted cost constrained Markov decision processes”, Systems & Control Letters, 59(12), pp. 760–766, 2010.
 B. Kim, J. Park, S. Park, and S. Kang, “Impedance Learning for Robotic Contact Tasks Using Natural Actor-Critic Algorithm”, IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics, 40(2), pp. 433–443, 2010.
 V. R. Konda and J. N. Tsitsiklis, “On Actor-Critic Algorithms”, SIAM Journal on Control and optimization, 42(4), pp. 1143–1166, 2003.
 J. Peters and S. Schaal, “Natural Actor-Critic”, Neurocomputing, 71, pp. 1180–1190, 2008.
 [43] J. A. Boyan, “Technical Update: Least-Squares Temporal Difference Learning,” Machine Learning, 49, pp. 233–246, 2002.
 K.G. Vamvoudakis, Frank L. Lewis, “Online actor–critic algorithm to solve the continuous-time infinite horizon optimal control problem”, Automatica, 46 (5), pp. 878– 888, 2010.
 H. R. Berenji and D. Vengerov, “A Convergent Actor-Critic-Based FRL Algorithm with Application to Power Management of Wireless Transmitters”, IEEE Transactions on Fuzzy Systems, 11(4), pp. 478–485, 2003.
 L. Tang and Y. Liu, “Adaptive neural network control of robot manipulator using reinforcement learning”, Journal of Vibration and Control (online version), June 2013.
 Farkaš,T. Malík and K. Rebrová, “Grounding the meanings in sensorimotor behavior using reinforcement learning”, frontiers in neurobotics, 6(1), pp.1-13, 2012.
 Sh. Bhasin , N. Sharma, P. Patre , W. Dixon, “Asymptotic tracking by a reinforcement learning-based adaptive critic controller”, Journal of Control Theory Application, 9(3), pp. 400-409, 2011.
 Y. Nakamura, T. Mori, M. Sato, Sh. Ishii, “Reinforcement learning for a biped robot based on a CPG-actor-critic method”, Neural Networks, 20, pp.723-735, 2007.
 Z. Miljkovic’, M. Mitic’, M. Lazarevic’, B. Babic’, “Neural network Reinforcement Learning for visual control of robot manipulators”, Expert Systems with Applications, 40, pp.1721-1736, 2013.
 Sarangapani, J., “Neural network control of nonlinear discrete-time systems”, USA: Taylor & Francis, 2006.
 Stingu, P., & Lewis, F, “Adaptive dynamic programming applied to a 6dof quadrotor”, In B. Igelnik (Ed.), Computational modeling and simulation of intellect: Current state and future perspectives. IGI Global, 2010.
 G. Tesauro, “Temporal difference learning and TD-Gammon. Communications of the ACM”, 38(3), pp.58–68, March 1995.
 H. White, “Learning in Artificial Neural Networks: A Statistical Perspective”, Neural computation, 1(4), pp. 425-464, 1989.
 E.M. Johansson, F.U. Dowla, and D.M. Goodman, ” Backpropagation Learning for Multilayer Feed-Forward Neural Networks Using the Conjugate Gradient Method”, International Journal of Neural Systems, 02(291), 1991.
 Bebis, G., Georgiopoulos, M., “Feed-forward neural networks”, Potentials, IEEE, 13(4), pp. 27-31, 1994.
 Zidong Wang, Yurong Liu, Li Yu, Xiaohui Liu, “Exponential stability of delayed recurrent neural networks with Markovian jumping parameters”, Physics Letters A, 356, Issues 4–5, pp. 346–352, 2006.
 Kurt Binder, Dieter W. Heermann, “Monte Carlo Simulation in Statistical Physics: An Introduction”, Springer, 2010.
 Bell, Ann M., “Reinforcement Learning Rules in a Repeated Game”, Computational Economics, 18(1), August 2001.
 Raul Rojas, “Neural Networks - A Systematic Introduction”, Springer-Verlag, Berlin, NewYork, 1996
 B. Bass, T. Nixon, “Boltzmann Learning”, Reference Module in Earth Systems and Environmental Sciences Encyclopedia of Ecology, pp.489-493, 2008.
 Holk Cruse, “Neural Networks As Cybernetic Systems: Science Briefings” , George Thieme Verlag, 1997.
 Gruning, A., “Elman backpropagation as reinforcement for simple recurrent networks. Neural Computation”, 2007.
 Kasiran, Z., Ibrahim, Z., Syahir Mohd Ribuan, M. , “Mobile phone customers churn prediction using elman and Jordan Recurrent Neural Network ”, 7th International Conference on Computing and Convergence Technology (ICCCT), pp. 673 – 678, 2012.
 Ring, M. B., “Learning sequential tasks by incrementally adding higher orders”, In C. L. Giles, S. J. Hanson, & J. D. Cowan (Eds.), Advances in neural information processing systems, 5 ,pp. 115-122. San Mateo, California: Morgan Kaufmann Publishers, 1993a.
 Ring, M. B., “Two methods for hierarchy learning in reinforcement environments”, In J. A. Meyer, H. Roitblat, & S. Wilson (Eds.), From animals to animats 2: Proceedings of the second international conference on simulation of adaptive behavior, pp. 148-155, 1993b.
 McCallum, R. A., “Learning to use selective attention and short-term memory in sequential tasks”, In From animals to animats 4: Proceedings of the fourth international conference on simulation of adaptive behavior. Cambridge, MA: MIT Press, 1996.
 Lanzi, P. L., “Adaptive agents with reinforcement learning and internal memory”, In J.-A. Meyer, D. Floreano, H. L. Roitblat, & S. W. Wilson (Eds.), From animals to animats 6, Proceedings of the sixth international conference on simulation of adaptive behavior, pp. 333-342, Cambridge, MA: MIT Press, 2000.
 Schmidhuber, J., “Optimal ordered problem solver”, Technical report No. IDSIA- 12-02, Manno-Lugano, Switzerland: IDSIA, 2002.
 N. Shiraga, S. Ozawa, and S. Abe, “A Reinforcement Learning Algorithm for Neural Networks with Incremental Learning Ability”, Proceedings of the 9th International Conference on Neural Information Processing (ICONIP), 5, pp. 2566 – 2570, 2002.
 J. Schmidhuber, D. Wierstra, M. Gagliolo, F. Gomez., “Training Recurrent Networks by Evolino”, Neural Computation, 19(3), pp. 757–779, 2007
 B. Bakker, “Reinforcement learning by backpropagation through an LSTM model/critic”, IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL), pp.127-134, 2007.
 G. Luppino, G. Rizzolatti, G. "The Organization of the Frontal MotoCortex", News in physiological sciences, 15, pp. 219-224, 2000.

 G. Baldassarre" A modular neural-network model of the basal ganglia’s role in learning and selecting motor behaviours", Journal of Cognitive Systems Research, 3, pp. 5–13, 2002.
o L. Ciancio, L. Zollo, E. Guglielmelli, D. Caligiore and G. Baldassarre,
“Hierarchical Reinforcement Learning and Central Pattern Generators for Modeling the Development of Rhythmic Manipulation Skills”, IEEE International Conference on Development and Learning (ICDL), 2, pp. 1-8, 2011.
 Johansson, B., !ahin, E. & Balkenius, C., “A Bioinspired Hierarchical Reinforcement Learning Architecture for Modeling Learning of Multiple Skills with Continuous States and Actions”, International Conference on Epigenetic Robotics: Modeling Cognitive Development in Robotic Systems, 2010.
 M. Snel, Sh. Whiteson, and Y. Kuniyoshi, “Robust Central Pattern Generators for Embodied Hierarchical Reinforcement Learning”, In ICDL: Proceedings of the First Joint IEEE International Conference on Development and Learning, pp. 1–6, August 2011.
 Parr, R. and Russel, S., “Reinforcement learning with hierarchies of machines”, Advanced Neural Information Processing Systems: Proceedings of the 1997 conference, Cambridge, MA:MIT Press, 1998.
 Dietterich, T. G., “Hierarchical reinforcement learning with the maxq value function decomposition”, Journal of Artificial Intelligence Research, 13, pp. 227-303, 2000.
 Andrew G. Barto, Sridhar Mahadevan, “Recent Advances in Hierarchical Reinforcement Learning”, Discrete Event Dynamic Systems: Theory and Applications, 13, pp. 41-77, 2003
 V.S. Borkar, “Stochastic Approximation: A Dynamical Systems Viewpoint”, Cambridge University Press, 2008.
 V.S. Borkar, “Stochastic approximation with two time scales”, Systems & Control Letters 29 (5), pp. 291–294, 1997.
 R. Samli, “Stochastic Neural Networks and Their Solutions to Optimization Problems”, International Journal of Electronics; Mechanical and Mechatronics Engineering, 2(3), pp.(293-297), 2012
 S. Barreto, D. Precup, J. Pineau, “Reinforcement Learning using Kernel-Based Stochastic Factorization”, Advances in Neural Information Processing Systems, 24, pp. 720-728, 2011
 J. Bose, S. B. Furber, and J. L. Shapiro., “An associative memory for the on-line recognition and prediction of temporal sequences”, CoRR , 2006
 Kosko, “Bidirectional associative memories,” IEEE Transactions on Systems, Man and Cybernetics, SMC-18( 1), pp. 49–60, Jan./Feb. 1988.
 Y. F.Wang, J. B. Cruz, Jr., and J. H. Mulligan, Jr., “Two coding strategies for bidirectional associative memory”, IEEE Transactions on Neural Networks, 1(1), pp. 81–92, Mar. 1990.
 Z. Wang, “A bidirectional associative memory based on optimal linear associative memory”, IEEE Trans. Comput., 45(10), pp. 1171–1179, Oct. 1996.
 Shen and J. B. Cruz, Jr., “Encoding strategy for maximum noise tolerance bidirectional associative memory,” IEEE Transactions on Neural Networks, 16(2), pp. 293–300, Mar. 2005.
 S. Du, Z. Chen, Z. Yuan, and X. Zhang, “Sensitivity to noise in bidirectional associative memory (BAM),” IEEE Transactions on Neural Network, 16(4), pp. 887–898, Jul. 2005.
 C.-S. Leung, “Optimum learning for bidirectional associative memory in the sense of capacity,” IEEE Transactions on Systems, Man, and Cybernetics, 24(5), pp. 791–795, May 1994.
 S. Chartier and M. Boukadoum, “A bidirectional heteroassociative memory for binary and grey-level patterns”, IEEE Transactions on Neural Networks, 17(2), pp. 385–396, Mar. 2006
 S. Chartier, M. Boukadoum, and M. Amiri, “BAM Learning of Nonlinearly Separable Tasks by Using an Asymmetrical Output Function and Reinforcement Learning”, IEEE TRANSACTIONS ON NEURAL NETWORKS, 20(8), 2009.
 S. M. Bohte, J. N. Kok, and H. L. Poutre, “Error-backpropagation in temporally encoded networks of spiking neurons”, Neurocomputing, 48, pp. 17–37, 2002.
 F. Ponulak and A. Kasinski, “Supervised learning in spiking neural networks with resume: Sequence learning, classication, and spike shifting,” Neural Computation, 22, pp. 467–510, 2010.
 R. Fiete and H. S. Seung, “Gradient learning in spiking neural networks by dynamic perturbation of conductances”, Physical Review Letters, 97(4), 2006.
 R. V. Florian, “Reinforcement learning through modulation of spike timing- dependent synaptic plasticity,” Neural Computation, 19, pp. 1468–1502, 2007.
 R. V. Florian, “A reinforcement learning algorithm for spiking neural networks”, Seventh International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), 2005.
 P. Suszynski, P. Wawrzynski, “Learning population of spiking neural networks with perturbation of conductances”, Proceedings of International Joint Conference on Neural Networks, Dallas, Texas, USA, 2013.
 H. Sebastian Seung, “Learning in Spiking Neural Viewpoint Networks by Reinforcement of Stochastic Synaptic Transmission”, Neuron, 40, pp. 1063–1073, 2003.
 J. Brea,W. Senn and J. P. Pfister, “Sequence learning with hidden units in spiking neural networks”, Advances in Neural Information Processing Systems Conference, 2011.
 Christodoulou, A. Cleanthous, “Spiking Neural Networks with Different Reinforcement Learning (RL) Schemes in a Multiagent Setting”, Chinese Journal of Physiology, 53(6), PP. 447-453, 2010.
 W. Potjans, A. Morrison, M. Diesmann, “A spiking neural network model of an actor-critic learning agent”, Neural Computation, 21(2), pp. 301-339, 2009
 Frémaux N, Sprekeler H, Gerstner W , ”Reinforcement Learning Using a Continuous Time Actor-Critic Framework with Spiking Neurons”, PLoS Computer Biology, 9(4), 2013.
 S. E. Fahlman and C. Lebiere,” The Cascade-Correlation Learning Architecture”, D. S. Touretzky (ed.), Advances in Neural Information Processing Systems, 2, pp. 524–532, 1990.
 S. Nissen, “Large Scale Reinforcement Learning using Q-SARSA and Cascading Neural Networks”, MSc Thesis, University of Copenhagen, 2007.
 G. Balazs. "Cascade-Correlation Neural Networks: A Survey". Technical Report, University of Alberta, December 2009.
o -T. Lin and C. S. G. Lee, “Neural-network-based fuzzy logic control and decision system,” IEEE Transactions on Computers, 40(12), pp. 1320-1336, 1991.
 L. -X. Wang and J. H. Mendel, “Back-propagation fuzzy systems as nonlinear dynamic system identifiers”, Proc. IEEE International Conference on Fuzzy Systems, San Diego, pp. 1163-1170, 1992.
 J. -S. Roger Jang, “ANFIS: adaptive-network-based fuzzy inference systems”, IEEE Transactions on Systems, Man, and Cybernetics, 23(3), pp. 665-685, 1993.
o Kosko, “Neural Networks and Fuzzy Systems: A Dynamical system Approach to Machine Intelligence”, Prentice-Hall, Englewood Cliffs, NJ, 1992.
 M. C. Su, “Identification of Singleton fuzzy Models via fuzzy hyper-rectangular composite NN, in Fuzzy Model Identification: Selected Approaches”, H. Hellen doorn and D. Driankov, Eds., Springer-Verlag, pp. 193-212, 1997.
 M. C. Su, C. –W. Lin, and S. –S. Tsay, “Neural-network-based Fuzzy Model and its Application to Transient stability Prediction in Power System”, IEEE Transactions on Systems, Man, and Cybernetics, 29(1), pp. 149-157, 1999
o Goldberg, “Genetic Algorithm in Search,Optimization, and machine Learning”, Addison-Welsley, MA, 1989.
 H. Shah, M.Gopal, “A Reinforcement Learning Algorithm with Evolving Fuzzy Neural Networks”, Proceedings of the International Conference on Systems, Control and Informatics, 2013
 X. Jinlin, Z. Weigong , G. Zongyang, “Neurofuzzy velocity tracking control with reinforcement learning ”, International Conference on Electronic Measurement & Instruments (ICEMI), pp