Deep Reinforcement Learning Course(4)

Posted on 2024-07-18 Edited on 2024-07-23 In hg-rl Views:

policy-based 的算法（直接输出 policy），policy-gradient 方法

Unit4

Policy Gradient With Pytorch

Introduction

直接优化 policy
policy gradient 是其中一类方法
- Monte Carlo Reinforce 是一种 policy gradient 方法

policy-based methods

What are the policy-based methods？
3 类强化学习方法
- value-based methods
- policy-based methods
- actor-critic method：上面两种的结合
policy based：参数化 policy
- 例如使用 NN 输出一个行动上的分布：\(\pi_\theta(s)=\mathbb{P}[A|s;\theta]\)，然后使用梯度下降优化参数
流程：CartPole-v1 游戏为例

policy-based methods 和 policy-gradient methods 的区别
- policy-gradient 是一种 policy-based 方法

policy-gradient 的优缺点

Advantages

可以直接输出 policy（不需要保存额外的 action 数据等）
- 不需要自己实现 exploration/exploitation trade-off
可以学习随机策略
- 不需要处理 perceptual aliasing 问题：相同/类似状态，但是需要不同的行动
- 如下任务：机器人寻宝游戏；机器人只能感知到前后左右的墙壁状态
  - 灰色区域状态相同，都是 \(\phi(s)=\overbrace{(\underbrace{1}_{\text{up}}\underbrace{0}_{\text{right}}\underbrace{1}_{\text{down}}\underbrace{0}_{\text{left}})}^{\text{walls=state}}\)
  - 如果输出结果一样（往左或往右），此时很难找到财富（只能靠 exploration）
  - 随机策略（往左、往右各50%）则会好很多

高维、连续行动空间效果更好
- Deep Q-Learning 无法处理无穷种行动的情况
- 例如自动驾驶，转方向盘的角度有无穷种可能
收敛性更好
- 随机策略的 policy 随着训练变化慢
- 如果是确定性策略，例如上一时刻 Q-value \(\text{(l,r)=(0.10,0.09)}\)，然后这一时刻变成了 Q-value \(\text{(l,r)=(0.10,0.11)}\)，那么 policy 就发生了剧变

Disadvantages

容易陷入局部最优
训练更慢
方差大

深入 policy-gradient methods

parameterized stochastic policy
action preference：the probability of taking each action
目标：更多的采样收益更大的行动
idea：如果当前 episode 收益高，则认为当前 episode 中所有的 action 都是好的
Training Loop 的伪代码如下
- Collect an episode with the \(\pi\) (policy).
- Calculate the return(sum of rewards).
- Update the weights of the \(\pi\)
  - If positive return \(\to\) increase the probability of each (state, action) pairs taken during the episode.
  - If negative return \(\to\) decrease the probability of each (state, action) taken during the episode.
score/objective function（目标函数）：\(J(\theta)=\mathbb{E}_{\tau\sim\pi}[R(\tau)]\)
- 和 episode 不同，trajectory 只有一系列 state、action，不包括 reward
- \(J(\theta)\)：输入 trajectory，输出期望 return
- expected return：expected cumulative reward
理解：感觉像是压缩了信息，直接把 reward 信息保存到 \(\theta\) 里面了

\[ J(\theta)=\sum_\tau P(\tau;\theta)R(\tau) \]

\[ P(\tau;\theta)=\left[\prod_{t=0}P(s_{t+1}|s_t,a_t)\pi_\theta(a_t|s_t)\right] \]

目的：找到 \(\theta\)，最大化目标函数

Gradient Ascent

最大化 \(\to\) gradient ascent（梯度上升）
梯度上升：\(\theta\leftarrow\theta+\alpha*\nabla_\theta J(\theta)\)
问题
- 无法准确计算梯度，通过样本进行估计
- 不能对状态微分（Markov Decision Process dynamics）
如何微分？Policy Gradient Theorem！
- 对任意可微的 policy、任意目标函数，都有
  - 红色部分应该加上吧？

\[ \nabla_\theta J(\theta)=\mathbb{E}_{\pi_\theta}\left[{\color{red}\sum_{t=0}}\nabla_\theta\log\pi_\theta(a_t\mid s_t)R(\tau)\right] \]

Reinforce algorithm

Monte Carlo Reinforce、Monte-Carlo policy-gradient
算法流程（Loop）
- Use the policy \(\pi_\theta\) to collect an episode \(\tau\)
- Use the episode to estimate the gradient \(\hat{g}=\nabla_\theta J(\theta)\)
  - \(\nabla_\theta J(\theta)\approx\hat{g}=\sum_{t=0}\nabla_\theta\log\pi_\theta(a_t|s_t)R(\tau)\)
- Update the weights of the policy：\(\theta\leftarrow\theta+\alpha\hat{g}\)
collect multiple episodes (trajectories)：梯度多次平均

\[ \nabla_\theta J(\theta)\approx\hat{g}=\frac1m\sum_{i=1}^m\sum_{t=0}\nabla_\theta\log\pi_\theta(a_t^{(i)}|s_t^{(i)})R(\tau^{(i)}) \]

the Policy Gradient Theorem

derivative log trick（also called likelihood ratio trick or REINFORCE trick）

\[ \begin{aligned} \nabla_\theta J(\theta) &=\nabla_\theta\sum_\tau P(\tau;\theta)R(\tau)\\ &=\sum_\tau \nabla_\theta P(\tau;\theta)R(\tau)\\ &=\sum_\tau \frac{P(\tau;\theta)}{P(\tau;\theta)}\cdot \nabla_\theta P(\tau;\theta)R(\tau)\\ &=\sum_\tau \frac{P(\tau;\theta)}{P(\tau;\theta)}\cdot \nabla_\theta P(\tau;\theta)R(\tau)\\ &=\sum_\tau P(\tau;\theta)\nabla_\theta \log P(\tau;\theta)R(\tau)\\ \end{aligned} \]

展开 \(P\)
- initial state distribution \(\mu(s_0)\)，state transition dynamics \(P\)
  - 二者都不依赖于 \(\theta\)

\[ \begin{aligned} \nabla_\theta \log P(\tau;\theta) &=\nabla_\theta \log\left[\mu(s_0)\prod_{t=0}^HP(s_{t+1}\mid s_t,a_t)\pi_\theta(a_t\mid s_t)\right]\\ &=\nabla_\theta\left[\log\mu(s_0)+\sum\limits_{t=0}^H\log P(s_{t+1}\mid s_t,a_t)+\sum\limits_{t=0}^H\log\pi_\theta(a_t\mid s_t)\right]\\ &=\nabla_\theta \log\mu(s_0)+\nabla_\theta\sum_{t=0}^H\log P(s_{t+1}\mid s_t,a_t)+\nabla_{\theta}\sum_{t=0}^{H}\log\pi_{\theta}(a_{t}\mid s_{t})\\ &=\nabla_{\theta}\sum_{t=0}^{H}\log\pi_{\theta}(a_{t}\mid s_{t}) \end{aligned} \]

于是

\[ \begin{aligned} \nabla_\theta J(\theta) &=\sum_\tau P(\tau;\theta)\left[\nabla_{\theta}\sum_{t=0}^{H}\log\pi_{\theta}(a_{t}\mid s_{t})\right]R(\tau)\\ &=\sum_\tau\sum_{t=0}^{H} P(\tau;\theta)\nabla_{\theta}\log\pi_{\theta}(a_{t}\mid s_{t})R(\tau) \end{aligned} \]

HW4

不倒木棒

CartPole-v1
- 观察空间：4（Cart 位置、速度；Pole 角度、角速度）
- 行动空间：2（左右）
算法：Allgorith REINFORCE

\[ \begin{aligned} &\text{1: }\textbf{procedure}~\text{REINFORCE}\\ &\text{2: }\quad\text{Start with policy model}~\pi_\theta\\ &\text{3: }\quad \textbf{repeat}:\\ &\text{4: }\quad \quad\text{Generate an episode}~S_0,A_0,r_0,\ldots,S_{T-1},A_{T-1},r_{T-1}\text{ following } \pi_\theta(\cdot)\\ &\text{5: }\quad \quad \textbf{for}~t~\text{from}~T-1~\text{to}~0:\\ &\text{6: }\quad \quad \quad G_t=\sum_{k=t}^{T-1}\gamma^{k-t}r_k\\ &\text{7: }\quad \quad L(\theta)=\frac1T\sum_{t=0}^{T-1}G_t\log\pi_\theta(A_t|S_t)\\ &\text{8: }\quad \quad \text{Optimize}~\pi_\theta~\text{using}~\nabla L(\theta)\\ &\text{9: }\textbf{end procedure}\end{aligned} \]

Flappy Bird

PixelCopter
- 观察空间：7（y 坐标、速度、到地板、天花板的距离、下一个障碍物水平 x 距离、下一个障碍物的左上角、右下角的 y 坐标）
- 行动空间：2（上升、不动）