RL-13 tips for reinforcement learning

MDP

  1. If you have a fully observed MDP, then there always exits a deterministic policy that is as least as good as the optimal policy.
  2. Markov property when exploration? e.g. when using count based exploration strategy, we using information from old state!

When

Value Based

Convergence of value iteration’ Proof:

slide from CMU

#

hackerHugo wechat
一个一万年没有更新的公众号