RL-2 Markov Decision Processes

这一节讲了MP, MRP, MDP, Bellman Equation,optimal的含义与以及解法。

Material

PDF

Video1(English)

Video2(Chinese)

«An Introduction to Reinforcement Learning, Sutton and Barto, 1998»

Markov Processes

Intro to MDPs

  • Markov decision processes formally describe an environment for reinforcement learning

  • Where the environment is fully observable

  • The current state completely characterises the process

  • Almost all RL problems can be formalised as MDPs

Markov Property

The future is independent of the past given the present

Definition

A state $S_t$is Markov if and only if

  • The state captures all relevant information from the history
  • Once the state is known, the history may be thrown away
  • i.e. The state is a sufficient statistic of the future

State Transition Matrix

State Transition Matrix

Markov Chains

Markov Process

A Markov process is a memoryless random process, i.e. a sequence of random states $S_1, S_2$, … with the Markov property

Definition

A Markov Process (or Markov Chain) is a tuple $<S,\mathcal{P}>$

  • $\mathcal{S}$ is a (finite) set of states
  • $\mathcal{P}​$ is a state transition probability matrix,

Markov Reward Process

MRP

A Markov reward process is a Markov chain with values.

Definition

A Markov Reward Process is a tuple $<\mathcal{S,P,R,\gamma}>$

  • $\mathcal{S}$ is a finite set of states

  • $\mathcal{P}$ is a state transition probability matrix,

  • $\mathcal{R}$ is a reward function,

  • $\gamma$ is a discount factor, $\gamma \in[0,1]$

Return

Definition

The return $G_t​$ is the total discounted reward from time-step t.

Why discount?

  • Mathematically convenient to discount rewards
  • Avoids infinite returns in cyclic Markov processes
  • Uncertainty about the future may not be fully represented
  • ……

Value Function

The value function $v(s)$ gives the long-term value of state $s$

Definition

The state-value function $v(s)$ of an MRP is the expected return starting from state $s$

(其中$\mathbb{E}$的意思为expected value, 即数学期望)

Example: Student MRP Returns

example1

NOTICE!!!

  • The returns of sample 是 random的, value function 并不是随机的,而是expectation of those random variables(returns) !!!!

  • G ** => **Goal 下标ttime

  • return的定义中不含expected

    The return $G_t$ is the total discounted reward from time-step t.

    但是value function含有

    The state value function $v(s)$ of an MRP is the expected return starting from state $s$

  • return相关的是time,与value function相关的是state

Bellman Equation

Do recursive decomposition(递归分解) based on value function

The value function can be decomposed into two parts:

  • immediate reward $R_{t+1}$
  • discounted value of successor state $\gamma v(S_{t+1})$

bellman equation1

1519908935088

Bellman Equation in Matrix Form(重点)bellman equation2

Solving the Bellman Equation

  • The Bellman equation is a linear equation

  • It can be solved directly:

  • Computational complexity is $O(n^3)$ for $n$ states

  • Direct solution only possible for small MRPs

  • There are many iterative methods for large MRPs, e.g.

    • Dynamic programming
    • Monte-Carlo evaluation
    • Temporal-Difference learning

Markov Decision Process

A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov.

Definition

A Markov Decision Process is a tuple $<\mathcal{S,A,P,R,\gamma}>$

  • $\mathcal{S}$ is a finite set of states

  • $\mathcal{A}$ is a finite set of actions

  • $\mathcal{P}$ is a state transition probability matrix,

  • $\mathcal{R}$ is a reward function,

  • $\gamma$ is a discount factor, $\gamma \in[0,1]$

Policy

Definition

A policy π is a distribution over actions given states

这个公式是stochastic policy。

公式里没有reward是因为reward已经被囊括在state之中

  • A policy fully defines the behaviour of an agent

  • MDP policies depend on the current state (not the history)

  • i.e. Policies are stationary (time-independent),

Policy实际上是一个随即转移矩阵(stochastic transition matrix)

time-independent 但是 state dependent

policy

Value Function

Definition

  • state-value function: The state-value function $v_π(s)$ of an MDP is the expected return starting from state s, and then following policy π

  • action-value function: The action-value function $q_π(s, a)$ is the expected return starting from state s, taking action a, and then following policy π

两种value-function的区别在于是否take action

Bellman Expectation Equation(重点)

The state-value function can again be decomposed into immediate reward plus discounted value of successor state,

The action-value function can similarly be decomposed,

1520144576465

Optimal Value Function

The optimal state-value function $v_*(s)$ is the maximum value function over all policies

The optimal action-value function $q_*(s,a)$ is the maximum action-value function over all policies

  • The optimal value function specifies the best possible performance in the MDP
  • An MDP is “solved” when we know the optimal value fn.

1520144868320

Optimal Policy

Define a partial ordering over policies

1520145255686

hackerHugo wechat
一个一万年没有更新的公众号