RL-3 Planning by Dynamic Programming

本节主要讲了动态规划。需要重点理解Policy Iteration(evaluation 和 improvement两部分)和Value Iteration两类经典算法的原理异同,然后讲了asynchronous backups。

Material

PDF

Video by David Silver

Video by Siraj

«An Introduction to Reinforcement Learning, Sutton and Barto, 1998» 第四章

GridWorld小游戏 by karpathy

Introduction

What is Dynamic Programming?

Dynamic sequential or temporal component to the problem

Programming optimising a “program”, i.e. a policy

Requirements for Dynamic Programming

1520256251454

Planning by Dynamic Programming

Dynamic programming assumes full knowledge of the MDP It is used for planning in an MDP

PLANNING

  • A model of the environment is known
  • The agent performs computations with its model (without any external interaction)
  • The agent improves its policy

可用于prediction和control,区别在于是否对policy$\pi$进行optimize

Policy Evaluation(prediction)

1520338645843

1520338712432

david课上以及Sutton的书上都有一个gridworld的demo。karpathy的网站上有一个用js写的大格子版本的游戏。

Policy Iteration(control)

How to Improve a Policy(贪心法)

1520339161131

Policy Iteration

1520339333337

david课上以及Sutton的书上都有一个Jack’s Car Rental的example。

Policy Improvement(含证明)

1520339482788

1520339633167

Modified Policy Iteration

1520340407198

Value Iteration

Value Iteration in MDPs

Principle of Optimality

Any optimal policy can be subdivided into two components:

  • An optimal first action A∗
  • Followed by an optimal policy from successor state $S\prime$

1520340513202

1520340620924

1520340703871

Summary of DP Algorithms

1520342558131

Difference between Policy Iteration and Value Iteration(重点理解)

来自stackoverflow

img

Key points:

  1. Policy iteration includes: policy evaluation + policy improvement, and the two are repeated iteratively until policy converges.
  2. Value iteration includes: finding optimal value function + one policy extraction. There is no repeat of the two because once the value function is optimal, then the policy out of it should also be optimal (i.e. converged).
  3. Finding optimal value function can also be seen as a combination of policy improvement (due to max) and truncated policy evaluation (the reassignment of $v(s)$ after just one sweep of all states regardless of convergence).
  4. The algorithms for policy evaluation and finding optimal value function are highly similar except for a max operation (as highlighted)
  5. Similarly, the key step to policy improvement and policy extraction are identical except the former involves a stability check.

Policy iteration is faster than value iteration, as a policy converges more quickly than a value function.

Extensions to Dynamic Programming

1520342589950

hackerHugo wechat
一个一万年没有更新的公众号