In this part of the book we describe three fundamental classes of methods for solving the reinforcement learning problem: dynamic programming, Monte Carlo methods, and temporal-difference learning. All of these methods solve the full version of the problem, including delayed rewards.

Each class of methods has its strengths and weaknesses. Dynamic programming methods are well developed mathematically, but require a complete and accurate model of the environment. Monte Carlo methods don't require a model and are conceptually simple, but are not suited for step-by-step incremental computation. Finally, temporal-difference methods require no model and are fully incremental, but are more complex to analyze. The methods also differ in several ways with respect to their efficiency and speed of convergence. In the third part of the book we explore how these methods can be combined so as to obtain the best features of each of them.

- 4. Dynamic Programming
- 4.1 Policy Evaluation
- 4.2 Policy Improvement
- 4.3 Policy Iteration
- 4.4 Value Iteration
- 4.5 Asynchronous Dynamic Programming
- 4.6 Generalized Policy Iteration
- 4.7 Efficiency of Dynamic Programming
- 4.8 Summary
- 4.9 Bibliographical and Historical Remarks

- 5. Monte Carlo Methods
- 5.1 Monte Carlo Policy Evaluation
- 5.2 Monte Carlo Estimation of Action Values
- 5.3 Monte Carlo Control
- 5.4 On-Policy Monte Carlo Control
- 5.5 Evaluating One Policy While Following Another
- 5.6 Off-Policy Monte Carlo Control
- 5.7 Incremental Implementation
- 5.8 Summary
- 5.9 Bibliographical and Historical Remarks

- 6. Temporal-Difference Learning
- 6.1 TD Prediction
- 6.2 Advantages of TD Prediction Methods
- 6.3 Optimality of TD(0)
- 6.4 Sarsa: On-Policy TD Control
- 6.5 Q-Learning: Off-Policy TD Control
- 6.6 Actor-Critic Methods
- 6.7 R-Learning for Undiscounted Continuing Tasks
- 6.8 Games, Afterstates, and Other Special Cases
- 6.9 Summary
- 6.10 Bibliographical and Historical Remarks

Mark Lee 2005-01-04