RL-13 tips for reinforcement learning


  1. If you have a fully observed MDP, then there always exits a deterministic policy that is as least as good as the optimal policy.
  2. Markov property when exploration? e.g. when using count based exploration strategy, we using information from old state!


Value Based

Convergence of value iteration’ Proof:

slide from CMU


hackerHugo wechat