# Material

PDF

Video1(English)

Video2(Chinese)

«An Introduction to Reinforcement Learning, Sutton and Barto, 1998»

# Markov Processes

## Intro to MDPs

• Markov decision processes formally describe an environment for reinforcement learning

• Where the environment is fully observable

• The current state completely characterises the process

• Almost all RL problems can be formalised as MDPs

## Markov Property

The future is independent of the past given the present

Definition

A state $S_t$is Markov if and only if

• The state captures all relevant information from the history
• Once the state is known, the history may be thrown away
• i.e. The state is a sufficient statistic of the future

### State Transition Matrix ## Markov Chains

### Markov Process

A Markov process is a memoryless random process, i.e. a sequence of random states $S_1, S_2$, … with the Markov property

Definition

A Markov Process (or Markov Chain) is a tuple $<S,\mathcal{P}>$

• $\mathcal{S}$ is a (finite) set of states
• $\mathcal{P}​$ is a state transition probability matrix, $\mathcal{P}_{ss\prime}=\mathbb{P}[S_{t+1}=s\prime|S_t=s]$

# Markov Reward Process

## MRP

A Markov reward process is a Markov chain with values.

Definition

A Markov Reward Process is a tuple $<\mathcal{S,P,R,\gamma}>$

• $\mathcal{S}$ is a finite set of states

• $\mathcal{P}$ is a state transition probability matrix,

• $\mathcal{R}$ is a reward function,

• $\gamma$ is a discount factor, $\gamma \in[0,1]$

## Return

Definition

The return $G_t​$ is the total discounted reward from time-step t.

### Why discount?

• Mathematically convenient to discount rewards
• Avoids infinite returns in cyclic Markov processes
• Uncertainty about the future may not be fully represented
• ……

## Value Function

The value function $v(s)$ gives the long-term value of state $s$

Definition

The state-value function $v(s)$ of an MRP is the expected return starting from state $s$

(其中$\mathbb{E}$的意思为expected value, 即数学期望)

### Example: Student MRP Returns NOTICE!!!

• The returns of sample 是 random的, value function 并不是随机的，而是expectation of those random variables(returns) !!!!

• G ** => **Goal 下标ttime

• return的定义中不含expected

The return $G_t$ is the total discounted reward from time-step t.

但是value function含有

The state value function $v(s)$ of an MRP is the expected return starting from state $s$

• return相关的是time，与value function相关的是state

## Bellman Equation

Do recursive decomposition（递归分解） based on value function

The value function can be decomposed into two parts:

• immediate reward $R_{t+1}$
• discounted value of successor state $\gamma v(S_{t+1})$  ### Solving the Bellman Equation

• The Bellman equation is a linear equation

• It can be solved directly:

$\mathcal{v=R+\gamma Pv}$ $\mathcal{(I-\gamma P)v=R}$ $\mathcal{v=(I-\gamma P)^{-1}}R$

• Computational complexity is $O(n^3)$ for $n$ states

• Direct solution only possible for small MRPs

• There are many iterative methods for large MRPs, e.g.

• Dynamic programming
• Monte-Carlo evaluation
• Temporal-Difference learning

# Markov Decision Process

A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov.

Definition

A Markov Decision Process is a tuple $<\mathcal{S,A,P,R,\gamma}>$

• $\mathcal{S}$ is a finite set of states

• $\mathcal{A}$ is a finite set of actions

• $\mathcal{P}$ is a state transition probability matrix,

• $\mathcal{R}$ is a reward function,

• $\gamma$ is a discount factor, $\gamma \in[0,1]$

## Policy

Definition

A policy π is a distribution over actions given states

• A policy fully defines the behaviour of an agent

• MDP policies depend on the current state (not the history)

• i.e. Policies are stationary (time-independent),

Policy实际上是一个随即转移矩阵（stochastic transition matrix）

time-independent 但是 state dependent ## Value Function

Definition

• state-value function: The state-value function $v_π(s)$ of an MDP is the expected return starting from state s, and then following policy π

• action-value function: The action-value function $q_π(s, a)$ is the expected return starting from state s, taking action a, and then following policy π

## Bellman Expectation Equation（重点）

The state-value function can again be decomposed into immediate reward plus discounted value of successor state,

The action-value function can similarly be decomposed, ## Optimal Value Function

The optimal state-value function $v_*(s)$ is the maximum value function over all policies

The optimal action-value function $q_*(s,a)$ is the maximum action-value function over all policies

• The optimal value function specifies the best possible performance in the MDP
• An MDP is “solved” when we know the optimal value fn. ### Optimal Policy

Define a partial ordering over policies  