1. The benefits of sharing knowledge across tasks
2. The transfer learning problem in RL
3. Transfer learning with source and target domains
4. Next time: multi-task learning, meta-learning
• Goal:

• Understand how reinforcement learning algorithms can benefit from structure learned on prior tasks

• Understand a few transfer learning methods

transfer learning in RL 是一个open question，并不是一个solved problem。这节课重在理解各种新兴方法的pros and cons

# Material

Video_transfer_learning_in_CS294-112

PDF_by_Levin

PDF_by_Levin

# Introduction

## Motivation

• If we’ve solved prior tasks, we might acquire useful knowledge for solving a new task
• How is the knowledge stored?
• Q-function: tells us which actions or states are good
• Policy: tells us which actions are potentially useful
• some actions are never useful! (可以避免，提高exploration效率)
• Models(dynamics model): what are the laws of physics that govern the world?
• Features/hidden states: provide us with a good representation
• Don’t underestimate this!

Aside(关于上述方法中最后一条feature/hidden states): paper «Loss is its own Reward: Self-Supervision for Reinforcement Learning»中研究了在atari游戏中representation的价值

The gap between initial optimization and recovery shows a representation learning bottleneck.

## Transfer learning terminology

transfer learning: using experience from **one set of tasks**($\textcolor{green}{source\enspace domain}$) for faster learning and better performance on a new task($\textcolor{yellow}{target\enspace domain}$)

in RL, $\textcolor{Red}{task}$=$\textcolor{BLUE}{MDP}$!

• “shot”: number of attempts in the target domain
• 0-shot: just run a policy trained in the source domain
• 1-shot: try the task once
• few shot: try the task a few times

## How can we frame transfer learning problems?

1. “Forward” transfer: train on one task, transfer to a new task
• Just try it and hope for the best
• Architectures for transfer: progressive networks
• Finetune on the new task
• Model-based reinforcement learning
• Model distillation
• Contextual policies
• Modular policy networks
• RNN-based meta-learning

# Try to Solve

## Forward transfer method

Policies trained for one set of circumstances might just work in a new domain, but no promises or guarantees. 这里省略机器人到拧盖子和倒水的两个例子，这种方法能否成功在于source domain与target domain的相似性。因此需要一些方法来stack the odds in favor

### Finetuning

The most popular transfer learning method in (supervised) deep learning!

Challenges with finetuning in RL:

1. RL tasks are generally much less diverse
• Features are less general
• Policies & value functions become overly specialized
2. Optimal policies in fully observed MDPs are deterministic
• Loss of exploration at convergence
• Low-entropy policies adapt very slowly to new settings

### Finetuning with maximum-entropy policies

How can we increase diversity and entropy? 回忆之前学过的connection between inference and control中学到的Max-ENT policy，还有quadruped robot的例子。

Act as randomly as possible while collecting high rewards!

Example: pre-training for robustness

Example: pre-training for diversity

Downside:

1. worse in the source domain
2. algorithm is more complex

### Architectures for transfer: progressive networks

• An issue with finetuning
• Deep networks work best when they are big
• When we finetune, we typically want to use a little bit of experience
• Little bit of experience + big network = overfitting
• Can we somehow finetune a small network, but still pretrain a big network?
• Idea 1: finetune just a few layers
• Limited expressiveness(如果只finetune最后几层)
• Big error gradients can wipe out initialization(如果finetune全部，并有overfitting的可能)
• Freeze the old layers, so no forgetting(不会wipe out initialization)
• 由于有new architecture(smaller), 所以old conv layer不会wipe out information we need for new game
finetuning最后几层 添加新结构

pros: alleviates some issues with finetuning

cons: not obvious how serious these issues are

### Finetuning Summary

• Try and hope for the best
• Sometimes there is enough variability during training to generalize
• Finetuning
• A few issues with finetuning in RL
• Maximum entropy training can help
• Architectures for finetuning: progressive networks
• Addresses some overfitting and expressivity problems by construction

What if we can manipulate the source domain?

• So far: source domain (e.g., empty room) and target domain (e.g., corridor) are fixed(或者前面的ant learnining how to run in empty room and transfer to hallway)
• What if we can design the source domain, and we have a difficult target domain?
• Often the case for simulation to real world transfer
• Same idea: the more diversity we see at training time, the better we will transfer!

### EPOpt: randomizing physical parameters

Generally the conventional wisdom for example in robust control is if you train under greater variabilty you will get more robustness as the cost of task performance. The interesting thing about DRL is that in general that might still be the case, 但有些情况下这些robust的policy可以不牺牲performance，仍然足够robust. 比如上图中用ensemble train出来的policy在mass为3/6/9时的performance与直接用对应的mass train出来的相同

the more you randomized the physical parameter during training the more like you are to succeed when a test time there’s a new physical phenomena that you did not vary.(unmodeled effect!!!)

### Preparing for the unknown: explicit system ID

also vary physical parameter: 横坐标为offset of the center of the mass

Reference: «Preparing for the Unknown: Learning a Universal Policy with Online System Identification»，该系统由两部分组成: a Universal Policy (UP) and a function for Online System Identification (OSI). 后者为一个RNN，输出环境中的参数(诸如friction/mass之类). UP-OSI系统在这个task中效果接近UP-true(正确的环境参数)的效果。

Reference:«Sim-to-Real Transfer of Robotic Control with Dynamics Randomization»，这篇paper结合和上述两种方法: randomizing physical parameters和使用recurrent Policy，但是没有explicitly attempt to predict parameter而是一个implicit system identification.

Reference:«Sadeghi et al., “CAD2RL: Real Single-Image Flight without a Single Real Image” » Collision Avoidance via Deep Reinforcement Learning，在CAD模拟器中训练quadrotor(四旋翼飞行器)，并迁移到真实环境中。其中模拟的环境要求: less realistic simulator but one that is highly randomized. 这篇paper里观察到了: enough diversity in the source domain can go a long way toward enabling transfer but you do have to make sure that you avoid sort of pathological regularity like the lack of reflections.

• So far: pure 0-shot transfer: learn in source domain so that we can succeed in unknown target domain

• Not possible in general: if we know nothing about the target domain, the best we can do is be as robust as possible

• What if we saw a few images of the target domain?

### Better transfer through domain adaptation

Confusion loss: A classifier that tries to look at the features inside the convolutional nn and guess are these features of simulated image or real image, and then the gradient of the classifier is reversed and bp back into convolutional layers.

This confusion loss is providing domain adaptation machinery just for the perception(希望从source/target 提取出来的features是invariance的). It’s not attempting in any way to account for the physical discrepancy.

### Forward transfer summary

• Pretraining and finetuning

• Standard finetuning with RL is hard
• Maximum entropy formulation can help
• How can we modify the source domain for transfer?

• Randomization can help a lot: the more diverse the better!
• How can we use modest amounts of target domain data?

• Domain adaptation: make the network unable to distinguish observations from the two domains
• or modify the source domain observations to look like target domain
• Only provides invariance – assumes all differences are functionally irrelevant; this is not always enough!

functional relevant vairation between two domain(frontier of current research in this field)

### Multiple source domains

• So far: more diversity = better transfer
• Need to design this diversity
• E.g., simulation to real world transfer: randomize the simulation
• What if we transfer from multiple different tasks?
• In a sense, closer to what people do: build on a lifetime of experience
• Substantially harder: past tasks don’t directly tell us how to solve the task in the target domain!

## Model-based reinforcement learning

• If the past tasks are all different, what do they have in common?
• Idea 1: the laws of physics(宇宙中有着不变的规则，尝试extract those rules)
• Same robot doing different chores(杂物)
• Same car driving to different destinations
• Trying to accomplish different things in the same open-ended video game
• Simple version: train model on past tasks, and then use it to solve new tasks
• More complex version: adapt or finetune the model to new task
• Easier than finetuning the policy in task is very different but physics are mostly the same

### Example: 1-shot learning with model priors

We have a large collections of source domains and we want to use them to accelerate to perform one-shot learning in a target domain.使用two previous states and two previous actions然后预测next state. 这个model并不是直接用来planning in the target domain, 而是用来产生一个prior, 然后用这个prior和最近的batch experience一起快速得到一个target domain的model.

### Can we solve multiple tasks at once?

• Sometimes learning a model is very hard
• Can we learn a multi-task policy that can simultaneously perform many tasks?
• Idea 1: construct a joint MDP

• Idea 2: train in each MDP separately, and then combine the policies

### Actor-mimic and policy distillation

Backgound: Ensembles & Distillation

Ensemble models: single models are often not the most robust – instead train many models and average their predictions

• this is how most ML competitions (e.g., Kaggle) are won

• this is very expensive at test time.

Can we make a single model that is as good as an ensemble?

temperature: 参见wiki

Reference: «Distilling the Knowledge in a Neural Network» from Hinton

• Single policy is going to be trained with distillation to have similar action distribution as the source policy.

• Loss function: sample states and actions from expert policy(basically literally taking the data from the replay buffers of these policys such as Q learning), then maximizing the log probability of those actions under the actor mimic network.

### How does the model know what to do?

• So far: what to do is apparent from the input (e.g., which game is being played)
• What if the policy can do multiple things in the same environment?

So far everything we’ve discussed assume of state space dimensionality does not change betweent tasks, there are a number of ways to handle different state dimensionality that are based around either embedding or modularity