# Material

PDF_FROM_BERKELY

VODIE_BY_LEVIN

• some mistakes matter more than others!
• behavior is stochastic
• but good behavior is still the most likely

# Inference with A probabilistic graphical model of decision making

## probabilistic graphical model

• Can model suboptimal behavior (important for inverse RL)
• Can apply inference algorithms to solve control and planning problems
• Provides an explanation for why stochastic behavior might be preferred (useful for exploration and transfer learning)

## Backward messages

### Summary

$\beta_T$开始递归到$\beta_1$来计算backward message。并且$\beta_t$ is “Q-function-like”。

## Policy computation

### Summary

• Natural interpretation: better actions are more probable
• Random tie-breaking
• Analogous to Boltzmann exploration
• Approaches greedy policy as temperature decreases

## Summay

1. Probabilistic graphical model for optimal control

1. . Control = inference (similar to HMM, EKF, etc.)

2. Very similar to dynamic programming, value iteration, etc. (but “soft”)

# Algorithm with Soft Optimality

## Policy gradient with soft optimality

optimizes $\sum_tE_{\pi(s_t,a_t)}[r(s_t,a_t)]+E_{\pi(s_t)}[\mathcal{H}(\pi(a_t|s_t))]$ 即使用policy gradient with soft optimality所要提升的目标除了expected future total reward之外，还有一项policy entropy作为正则项。其意义是防止policy提早崩塌为deterministic policy。

Equivalence Between Policy Gradients and Soft Q-Learning

Bridging the Gap Between Value and Policy Based Reinforcement Learning

## Benefits of soft optimality

• Improve exploration and prevent entropy collapse
• Easier to specialize (finetune) policies for more specific tasks
• Principled approach to break ties
• Better robustness (due to wider coverage of states)
• Can reduce to hard optimality as reward magnitude increases
• Good model for modeling human behavior (more on this later inverse reinforcement learnig)

# Practical

## Tractable ?

$\pi$中做sample方法，以下是一种思路：