RL-1 Introduction to Reinforcement Learning

Reinforcement learning 是LeCun所说cake顶端的cherry,可见是非常有意思了。David Silver的公开课非常著名。立一个flag,三月底之前,学完全部课程并且完成project。这篇blog是david silver第一讲的笔记,对RL做了一个简介,并着重讲解了什么是RL problem。





«An Introduction to Reinforcement Learning, Sutton and Barto, 1998»

About RL

Many Faces of RL


Differences of other ML paradigms and RL

  • There is no supervisor, only a reward signal
  • Feedback is delayed, not instantaneous
  • Time really matters (sequential, non i.i.d(non independent and identically distributed~非独立同分布) data)
  • Agent’s actions affect the subsequent data it receives (这里有一点类似于active learning不知道有没有人写过将RL应用于主动学习的paper)

The RL Problem


A reward $R_t$ is a scalar feedback signal indicates how well agent is doing at step t. The agent’s job is to maximise cumulative reward.

Definition (Reward Hypothesis)

All goals can be described by the maximisation of expected cumulative reward

Sequential Decision Making

  • Actions may have long term consequences

  • Reward may be delayed

  • It may be better to sacrifice immediate reward to gain more long-term reward


A financial investment (may take months to mature) Refuelling a helicopter (might prevent a crash in several hours) Blocking opponent moves (might help winning chances many oves from now)


Agent and Environment



History and State

  • The history is the sequence of observations, actions, rewards

  • i.e. all observable variables up to time t

  • i.e. the sensorimotor stream of a robot or embodied agent

  • What happens next depends on the history:

    • The agent selects actions
    • The environment selects observations/rewards
  • State is the information used to determine what happens next

  • Formally, state is a function of the history:

Environment State

  • The environment state $S_t^e$ is the environment’s private representation
  • i.e. whatever data the environment uses to pick the next observation/reward
  • The environment state is not usually visible to the agent
  • Even if $S_t^e$ is visible, it may contain irrelevant information

Agent State

  • The agent state $S_t^e$ is the agent’s internal representation
  • i.e. whatever information the agent uses to pick the next action
  • i.e. it is the information used by reinforcement learning algorithms
  • It can be any function of history:

Information State

An information state (a.k.a.(also known as) Markov state) contains all useful information from the history.


A state $S_t^e$ is Markov if and only if

  • “The future is independent of the past given the present”

  • Once the state is known, the history may be thrown away

  • i.e. The state is a sufficient statistic of the future

  • The environment state $S_t^e$ is Markov

  • The history $H_t$ is Markov

Fully Observable Environments

Full observability: agent directly observes environment state

  • Agent state = environment state = information state
  • Formally, this is a Markov decision process (MDP)

Partially Observable Environments

Partial observability: agent indirectly observes environment

  • Now agent state $\neq $ environment state
  • Formally this is a partially observable Markov decision process(POMDP)
  • Agent must construct its own state representation $S_t^a$, e.g.
    • Complete history: $S_t^a=H_t$
    • Beliefs of environment state:$S_t^a=(P[S_t^e=s^n])$
    • Recurrent neural network: $S_t^a=\sigma(S_{t-1}^aW_s+O_tW_o)$

Inside An RL Agent

Major Components of an RL Agent

  • Policy: agent’s behaviour function
  • Value function: how good is each state and/or action
  • Model: agent’s representation of the environment


  • A policy is the agent’s behaviour

  • It is a map from state to action, e.g.

  • Deterministic policy: $a=\pi(s)$

  • Stochastic policy:

Value Function

  • Value function is a prediction of future reward (expected future total reward)

  • Used to evaluate the goodness/badness of states

  • And therefore to select between actions, e.g.


build a model is not always required !!!

  • A model predicts what the environment will do next

  • Transitions: $\mathcal{P}$ predicts the next state. (state transition model)

  • Rewards: $\mathcal{R}$ predicts the next (immediate) reward, e.g.

RL Agent Taxonomy

rl agent taxonomy

Problems within RL

Learning and Planning

  • Reinforcement Learning
    • The environment is initially unknown
    • The agent interacts with the environment
    • The agent improves its policy
  • Planning
    • A model of the environment is known
    • The agent performs computations with its model (without any external interaction)
    • The agent improves its policy

Exploration and Exploitation Trade Off

Exploration finds more information about the environment Exploitation exploits known information to maximise reward

Prediction and Control

  • Prediction: evaluate the future
    • Given a policy
  • Control: optimise the future
    • Find the best policy
hackerHugo wechat