这一节讲了MP, MRP, MDP, Bellman Equation,optimal的含义与以及解法。
Material
Video1(English)
Video2(Chinese)
«An Introduction to Reinforcement Learning, Sutton and Barto, 1998»
Markov Processes
Intro to MDPs
-
Markov decision processes formally describe an environment for reinforcement learning
-
Where the environment is fully observable
-
The current state completely characterises the process
-
Almost all RL problems can be formalised as MDPs
Markov Property
The future is independent of the past given the present
Definition
A state $S_t$is Markov if and only if
- The state captures all relevant information from the history
- Once the state is known, the history may be thrown away
- i.e. The state is a sufficient statistic of the future
State Transition Matrix
Markov Chains
Markov Process
A Markov process is a memoryless random process, i.e. a sequence of random states $S_1, S_2$, … with the Markov property
Definition
A Markov Process (or Markov Chain) is a tuple $<S,\mathcal{P}>$
- $\mathcal{S}$ is a (finite) set of states
- $\mathcal{P}$ is a state transition probability matrix,
Markov Reward Process
MRP
A Markov reward process is a Markov chain with values.
Definition
A Markov Reward Process is a tuple $<\mathcal{S,P,R,\gamma}>$
-
$\mathcal{S}$ is a finite set of states
-
$\mathcal{P}$ is a state transition probability matrix,
-
$\mathcal{R}$ is a reward function,
-
$\gamma$ is a discount factor, $\gamma \in[0,1]$
Return
Definition
The return $G_t$ is the total discounted reward from time-step t.
Why discount?
- Mathematically convenient to discount rewards
- Avoids infinite returns in cyclic Markov processes
- Uncertainty about the future may not be fully represented
- ……
Value Function
The value function $v(s)$ gives the long-term value of state $s$
Definition
The state-value function $v(s)$ of an MRP is the expected return starting from state $s$
(其中$\mathbb{E}$的意思为expected value, 即数学期望)
Example: Student MRP Returns
NOTICE!!!
-
The returns of sample 是 random的, value function 并不是随机的,而是expectation of those random variables(returns) !!!!
-
G ** => **Goal 下标t是time
-
return的定义中不含expected
The return $G_t$ is the total discounted reward from time-step t.
但是value function含有
The state value function $v(s)$ of an MRP is the expected return starting from state $s$
-
与return相关的是time,与value function相关的是state
Bellman Equation
Do recursive decomposition(递归分解) based on value function
The value function can be decomposed into two parts:
- immediate reward $R_{t+1}$
- discounted value of successor state $\gamma v(S_{t+1})$
Bellman Equation in Matrix Form(重点)
Solving the Bellman Equation
-
The Bellman equation is a linear equation
-
It can be solved directly:
-
Computational complexity is $O(n^3)$ for $n$ states
-
Direct solution only possible for small MRPs
-
There are many iterative methods for large MRPs, e.g.
- Dynamic programming
- Monte-Carlo evaluation
- Temporal-Difference learning
Markov Decision Process
A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov.
Definition
A Markov Decision Process is a tuple $<\mathcal{S,A,P,R,\gamma}>$
-
$\mathcal{S}$ is a finite set of states
-
$\mathcal{A}$ is a finite set of actions
-
$\mathcal{P}$ is a state transition probability matrix,
-
$\mathcal{R}$ is a reward function,
-
$\gamma$ is a discount factor, $\gamma \in[0,1]$
Policy
Definition
A policy π is a distribution over actions given states
这个公式是stochastic policy。
公式里没有reward是因为reward已经被囊括在state之中
-
A policy fully defines the behaviour of an agent
-
MDP policies depend on the current state (not the history)
-
i.e. Policies are stationary (time-independent),
Policy实际上是一个随即转移矩阵(stochastic transition matrix)
time-independent 但是 state dependent
Value Function
Definition
-
state-value function: The state-value function $v_π(s)$ of an MDP is the expected return starting from state s, and then following policy π
-
action-value function: The action-value function $q_π(s, a)$ is the expected return starting from state s, taking action a, and then following policy π
两种value-function的区别在于是否take action
Bellman Expectation Equation(重点)
The state-value function can again be decomposed into immediate reward plus discounted value of successor state,
The action-value function can similarly be decomposed,
Optimal Value Function
The optimal state-value function $v_*(s)$ is the maximum value function over all policies
The optimal action-value function $q_*(s,a)$ is the maximum action-value function over all policies
- The optimal value function specifies the best possible performance in the MDP
- An MDP is “solved” when we know the optimal value fn.
Optimal Policy
Define a partial ordering over policies