Deep Reinforcement Learning Course(1)

Posted on 2024-06-19 Edited on 2024-07-23 In hg-rl Views:

强化学习框架、基本概念介绍；两个小实验，简单的跑一跑别人的代码

课程

概念
- state \(S_0\)：状态
- action \(A_0\)：行动
- reward \(R_1\)：奖励
- next state \(S_1\)
目标：最大化 cumulative reward（expected return）
Markov Decision Process (MDP)
决策只与当前状态有关（与之前状态无关）
Observations/States Space
- State \(s\)：a complete description of the state of the world
  - 国际象棋中的整个棋局
- Observation \(o\)：a partial description of the state
  - 游戏里面的观察视野
Action Space
- all possible actions in an environment
- 可以是离散的、也可以是连续的（discrete or continuous space）
Rewards and the discounting
- reward：the only feedback for the agent
- return（cumulative reward）：收益：\(R(\tau)=r_{t+1}+r_{t+2}+\cdots=\sum_{k=0}^{\infty}r_{t+k+1}\)
  - \(\tau\)：trajectory（轨迹），一系列 state+action 的集合
- discount rate \(\gamma\in[0,1]\)
  - \(R(\tau)=\sum_{k=0}^{\infty}\gamma^{k}r_{t+k+1}\)
  - \(\gamma\)：越大越在意长期收益（long-term reward），越小越在意短期收益（short term reward）

两种解决 RL 问题的思路：Policy-Based Methods、Value-Based Methods
policy-based：\(\text{state}\to\pi(\text{state})\to\text{action}\)
- 直接学习最优的 policy
- RL 就是为了求解最优的 \(\pi^{\ast}\)
- 对 policy 分类
  - 确定性的（deterministic）：\(a=\pi(s)\)
    - 每一个状态确定唯一的行动
  - 随机性的（stochastic）：\(\pi(a\mid s)=P\left[A\mid s\right]\)
value-based
- 学习什么状态是最优的，然后找到走向最优状态的行动
- 价值定义：从这一点出发，能够获得的收益越大，则越有价值
  - \(v_\pi(s)=\mathbb{E}_\pi\left[R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+\ldots\mid S_t=s\right]\)