Deep Reinforcement Learning Course(4)

Unit4

  • Policy Gradient With Pytorch

Introduction

  • 直接优化 policy
  • policy gradient 是其中一类方法
    • Monte Carlo Reinforce 是一种 policy gradient 方法

policy-based methods

  • What are the policy-based methods?
  • 3 类强化学习方法
    • value-based methods
    • policy-based methods
    • actor-critic method:上面两种的结合
  • policy based:参数化 policy
    • 例如使用 NN 输出一个行动上的分布:\(\pi_\theta(s)=\mathbb{P}[A|s;\theta]\),然后使用梯度下降优化参数
  • 流程:CartPole-v1 游戏为例

  • policy-based methods 和 policy-gradient methods 的区别
    • policy-gradient 是一种 policy-based 方法

policy-gradient 的优缺点

Advantages

  • 可以直接输出 policy(不需要保存额外的 action 数据等)
    • 不需要自己实现 exploration/exploitation trade-off
  • 可以学习随机策略
    • 不需要处理 perceptual aliasing 问题:相同/类似状态,但是需要不同的行动

    • 如下任务:机器人寻宝游戏;机器人只能感知到前后左右的墙壁状态

      • 灰色区域状态相同,都是 \(\phi(s)=\overbrace{(\underbrace{1}_{\text{up}}\underbrace{0}_{\text{right}}\underbrace{1}_{\text{down}}\underbrace{0}_{\text{left}})}^{\text{walls=state}}\)

      • 如果输出结果一样(往左或往右),此时很难找到财富(只能靠 exploration)

      • 随机策略(往左、往右各50%)则会好很多

  • 高维、连续行动空间效果更好
    • Deep Q-Learning 无法处理无穷种行动的情况
    • 例如自动驾驶,转方向盘的角度有无穷种可能
  • 收敛性更好
    • 随机策略的 policy 随着训练变化慢
    • 如果是确定性策略,例如上一时刻 Q-value \(\text{(l,r)=(0.10,0.09)}\),然后这一时刻变成了 Q-value \(\text{(l,r)=(0.10,0.11)}\),那么 policy 就发生了剧变

Disadvantages

  • 容易陷入局部最优
  • 训练更慢
  • 方差大

深入 policy-gradient methods

  • parameterized stochastic policy
  • action preference:the probability of taking each action
  • 目标:更多的采样收益更大的行动
  • idea:如果当前 episode 收益高,则认为当前 episode 中所有的 action 都是好的
  • Training Loop 的伪代码如下
    • Collect an episode with the \(\pi\) (policy).
    • Calculate the return(sum of rewards).
    • Update the weights of the \(\pi\)
      • If positive return \(\to\) increase the probability of each (state, action) pairs taken during the episode.
      • If negative return \(\to\) decrease the probability of each (state, action) taken during the episode.
  • score/objective function(目标函数):\(J(\theta)=\mathbb{E}_{\tau\sim\pi}[R(\tau)]\)
    • 和 episode 不同,trajectory 只有一系列 state、action,不包括 reward
    • \(J(\theta)\):输入 trajectory,输出期望 return
    • expected returnexpected cumulative reward
  • 理解:感觉像是压缩了信息,直接把 reward 信息保存到 \(\theta\) 里面了

\[ J(\theta)=\sum_\tau P(\tau;\theta)R(\tau) \]

\[ P(\tau;\theta)=\left[\prod_{t=0}P(s_{t+1}|s_t,a_t)\pi_\theta(a_t|s_t)\right] \]

  • 目的:找到 \(\theta\),最大化目标函数

Gradient Ascent

  • 最大化 \(\to\) gradient ascent(梯度上升)
  • 梯度上升:\(\theta\leftarrow\theta+\alpha*\nabla_\theta J(\theta)\)
  • 问题
    • 无法准确计算梯度,通过样本进行估计
    • 不能对状态微分(Markov Decision Process dynamics)
  • 如何微分?Policy Gradient Theorem!
    • 对任意可微的 policy、任意目标函数,都有
      • 红色部分应该加上吧?

\[ \nabla_\theta J(\theta)=\mathbb{E}_{\pi_\theta}\left[{\color{red}\sum_{t=0}}\nabla_\theta\log\pi_\theta(a_t\mid s_t)R(\tau)\right] \]

Reinforce algorithm

  • Monte Carlo Reinforce、Monte-Carlo policy-gradient
  • 算法流程(Loop)
    • Use the policy \(\pi_\theta\) to collect an episode \(\tau\)
    • Use the episode to estimate the gradient \(\hat{g}=\nabla_\theta J(\theta)\)
      • \(\nabla_\theta J(\theta)\approx\hat{g}=\sum_{t=0}\nabla_\theta\log\pi_\theta(a_t|s_t)R(\tau)\)
    • Update the weights of the policy:\(\theta\leftarrow\theta+\alpha\hat{g}\)
  • collect multiple episodes (trajectories):梯度多次平均

\[ \nabla_\theta J(\theta)\approx\hat{g}=\frac1m\sum_{i=1}^m\sum_{t=0}\nabla_\theta\log\pi_\theta(a_t^{(i)}|s_t^{(i)})R(\tau^{(i)}) \]

the Policy Gradient Theorem

  • derivative log trick(also called likelihood ratio trick or REINFORCE trick

\[ \begin{aligned} \nabla_\theta J(\theta) &=\nabla_\theta\sum_\tau P(\tau;\theta)R(\tau)\\ &=\sum_\tau \nabla_\theta P(\tau;\theta)R(\tau)\\ &=\sum_\tau \frac{P(\tau;\theta)}{P(\tau;\theta)}\cdot \nabla_\theta P(\tau;\theta)R(\tau)\\ &=\sum_\tau \frac{P(\tau;\theta)}{P(\tau;\theta)}\cdot \nabla_\theta P(\tau;\theta)R(\tau)\\ &=\sum_\tau P(\tau;\theta)\nabla_\theta \log P(\tau;\theta)R(\tau)\\ \end{aligned} \]

  • 展开 \(P\)
    • initial state distribution \(\mu(s_0)\),state transition dynamics \(P\)
      • 二者都不依赖于 \(\theta\)

\[ \begin{aligned} \nabla_\theta \log P(\tau;\theta) &=\nabla_\theta \log\left[\mu(s_0)\prod_{t=0}^HP(s_{t+1}\mid s_t,a_t)\pi_\theta(a_t\mid s_t)\right]\\ &=\nabla_\theta\left[\log\mu(s_0)+\sum\limits_{t=0}^H\log P(s_{t+1}\mid s_t,a_t)+\sum\limits_{t=0}^H\log\pi_\theta(a_t\mid s_t)\right]\\ &=\nabla_\theta \log\mu(s_0)+\nabla_\theta\sum_{t=0}^H\log P(s_{t+1}\mid s_t,a_t)+\nabla_{\theta}\sum_{t=0}^{H}\log\pi_{\theta}(a_{t}\mid s_{t})\\ &=\nabla_{\theta}\sum_{t=0}^{H}\log\pi_{\theta}(a_{t}\mid s_{t}) \end{aligned} \]

  • 于是

\[ \begin{aligned} \nabla_\theta J(\theta) &=\sum_\tau P(\tau;\theta)\left[\nabla_{\theta}\sum_{t=0}^{H}\log\pi_{\theta}(a_{t}\mid s_{t})\right]R(\tau)\\ &=\sum_\tau\sum_{t=0}^{H} P(\tau;\theta)\nabla_{\theta}\log\pi_{\theta}(a_{t}\mid s_{t})R(\tau) \end{aligned} \]

HW4

不倒木棒

  • CartPole-v1
    • 观察空间:4(Cart 位置、速度;Pole 角度、角速度)
    • 行动空间:2(左右)
  • 算法:Allgorith REINFORCE

\[ \begin{aligned} &\text{1: }\textbf{procedure}~\text{REINFORCE}\\ &\text{2: }\quad\text{Start with policy model}~\pi_\theta\\ &\text{3: }\quad \textbf{repeat}:\\ &\text{4: }\quad \quad\text{Generate an episode}~S_0,A_0,r_0,\ldots,S_{T-1},A_{T-1},r_{T-1}\text{ following } \pi_\theta(\cdot)\\ &\text{5: }\quad \quad \textbf{for}~t~\text{from}~T-1~\text{to}~0:\\ &\text{6: }\quad \quad \quad G_t=\sum_{k=t}^{T-1}\gamma^{k-t}r_k\\ &\text{7: }\quad \quad L(\theta)=\frac1T\sum_{t=0}^{T-1}G_t\log\pi_\theta(A_t|S_t)\\ &\text{8: }\quad \quad \text{Optimize}~\pi_\theta~\text{using}~\nabla L(\theta)\\ &\text{9: }\textbf{end procedure}\end{aligned} \]

Flappy Bird

  • PixelCopter
    • 观察空间:7(y 坐标、速度、到地板、天花板的距离、下一个障碍物水平 x 距离、下一个障碍物的左上角、右下角的 y 坐标)
    • 行动空间:2(上升、不动)