RL-5 Model-Free Prediction

Model-free prediction是指

Estimate the value function of an unknown MDP

对于没有model的问题进行prediction(本节课不考虑control),主要介绍了Monte-Carlo学习算法,Temporal-Difference算法(TD(0))以及其推广延伸的TD($\lambda$),需要理解$TD(\lambda)$的前向和后向视角。

Material

PDF

Video by David Silver

«An Introduction to Reinforcement Learning, Sutton and Barto, 1998» Chapter5,6

Monte-Carlo Learning

Monte-Carlo Reinforcement Learning

  • MC methods learn directly from episodes of experience
  • MC is model-free: no knowledge of MDP transitions / rewards
  • MC learns from complete episodes: no bootstrapping
  • MC uses the simplest possible idea: value = mean return
  • Caveat: can only apply MC to episodic(片段的) MDPs
    • All episodes must terminate

Monte-Carlo Policy Evaluation

52111595023

First-Visit Monte-Carlo Policy Evaluation

52111657543

Every-Visit Monte-Carlo Policy Evaluation

52111663300

NOTICE

  1. The first-visit MC method estimates vπ(s) as the average of the returns following first visits to s, whereas the every-visit MC method averages the returns following all visits to s.
  2. First-visit MC has been most widely studied
  3. Every-visit MC extends more naturally to function approximation and eligibility traces (看不懂233333)

——摘录自Sutton’s «RL»

Blackjack Example

52111705634

52111708669

Incremental Monte-Carlo

The mean µ1, µ2, … of a sequence x1, x2, … can be computed incrementally,

52111829062

52111879996

Temporal-Difference Learning

  • TD methods learn directly from episodes of experience
  • TD is model-free: no knowledge of MDP transitions / rewards
  • TD learns from incomplete episodes, by bootstrapping
  • TD updates a guess towards a guess

52111909102

Driving Home Example

52112152667

Advantages and Disadvantages of MC vs. TD

  • TD can learn before knowing the final outcome
    • TD can learn online after every step
    • MC must wait until end of episode before return is known
  • TD can learn without the final outcome
    • TD can learn from incomplete sequences
    • MC can only learn from complete sequences
    • TD works in continuing (non-terminating) environments
    • MC only works for episodic (terminating) environments

Bias/Variance Trade-Off

52112159713

52112166794

Batch MC and TD

52181149467

52181157139

Unified View

52181160011

52181161792

52181163193

52181165385

bootstrap是向前推导,自举。Sample是采样,采取action向前走一个step或者多个step直至一整个episode。

52181176648

TD(λ)

n-Step TD

使用n-step return 来作为TD target即可

52181180896

52181188285

Forward View of TD(λ)

52181195230

52181198505

很容易推导每一个step的系数都加起来,最终的和为1

52181212616

Backward View TD(λ)

  • Forward view provides theory
  • Backward view provides mechanism
  • Update online, every step, from incomplete sequences

Eligibility Traces

52181218232

这是一个较难理解的概念,我的理解是对每个state都要分配credit或者说是weight,对应到公示上就是一个系数$E_t(s)$,在这里用一个table来存储(在下一节function approximation里面可以用nn)。这里的Eligibility Traces的计算规则很简单,如上图所示,含义是对recency和frequency的state分配更多的credit。

52241736262

Relationship Between Forward and Backward TD

当$\lambda$为0时是TD(0), $\lambda$为1时是MC

52181245977

52181247238

52181249737

hackerHugo wechat
一个一万年没有更新的公众号