- If you have a fully observed MDP, then there always exits a deterministic policy that is as least as good as the optimal policy.
- Markov property when exploration? e.g. when using count based exploration strategy, we using information from old state!
Convergence of value iteration’ Proof: