本节主要讲了动态规划。需要重点理解Policy Iteration（evaluation 和 improvement两部分）和Value Iteration两类经典算法的原理异同，然后讲了asynchronous backups。
GridWorld小游戏 by karpathy
What is Dynamic Programming?
Dynamic sequential or temporal component to the problem
Programming optimising a “program”, i.e. a policy
Requirements for Dynamic Programming
Planning by Dynamic Programming
Dynamic programming assumes full knowledge of the MDP It is used for planning in an MDP
- A model of the environment is known
- The agent performs computations with its model (without any external interaction)
- The agent improves its policy
How to Improve a Policy（贪心法）
david课上以及Sutton的书上都有一个Jack’s Car Rental的example。
Modified Policy Iteration
Value Iteration in MDPs
Principle of Optimality
Any optimal policy can be subdivided into two components:
- An optimal first action A∗
- Followed by an optimal policy from successor state $S\prime$
Summary of DP Algorithms
Difference between Policy Iteration and Value Iteration(重点理解)
- Policy iteration includes: policy evaluation + policy improvement, and the two are repeated iteratively until policy converges.
- Value iteration includes: finding optimal value function + one policy extraction. There is no repeat of the two because once the value function is optimal, then the policy out of it should also be optimal (i.e. converged).
- Finding optimal value function can also be seen as a combination of policy improvement (due to max) and truncated policy evaluation (the reassignment of $v(s)$ after just one sweep of all states regardless of convergence).
- The algorithms for policy evaluation and finding optimal value function are highly similar except for a max operation (as highlighted)
- Similarly, the key step to policy improvement and policy extraction are identical except the former involves a stability check.
Policy iteration is faster than value iteration, as a policy converges more quickly than a value function.
Extensions to Dynamic Programming