cloudinwind's blog
RL笔记(15):SACBlur image

引言(Introduction)#

Soft Actor-Critic (SAC) 是一种基于最大熵强化学习的 Off-Policy(离线策略) 算法,于 2018 年提出。 SAC 的前身是 Soft Q-learning (SQL)。相比于 SQL 需要复杂的采样过程,SAC 引入了 Actor-Critic 架构,使其训练更加稳定且高效。SAC 在各类 Benchmark 及真实机器人任务中表现出色,以其极强的抗干扰能力对超参数的鲁棒性著称,是现代深度强化学习的基石算法之一。


最大熵强化学习 (Maximum Entropy RL)#

熵的定义#

熵 (Entropy) 表示对一个随机变量的随机程度的度量。如果 XX 是一个随机变量,它的概率密度记为 pp,那么它的熵 HH 定义为:

H(X)=Exp[logp(x)]=xp(x)logp(x)dx\begin{align} H(X)&=\mathbb{E}_{x\sim p}[-\log p(x)]\notag \\ &=-\int_{x}p(x)\log p(x) \mathrm{d}x \notag \end{align}

在强化学习中,可以使用 H(π(s))H(\pi(\cdot|s)) 来表示策略 π\pi 在状态 ss 下的随机程度。

目标函数#

最大熵强化学习(maximum entropy RL)的思想是除了最大化累计奖励,还要使得策略更加随机。如此,强化学习的目标中加入了一项熵的正则项,定义为:

πMaxEnt=argmaxπEπ[t=0r(st,at)+αH(π(st))]\begin{align} \pi_{\textbf{MaxEnt}}^*=\arg\max_{\pi}\mathbb{E}_{\pi}\left[\sum_{t=0}^\infty r(s_t,a_t)+\alpha H(\pi(\cdot|s_t))\right]\notag \end{align}

其中,α\alpha 是一项正则项系数,用来权衡熵的重要程度。 SAC 算法加入熵正则项,有利于增加强化学习算法的探索,α\alpha 越大,探索性越强,有助于加速策略的学习,降低策略陷入局部最优的可能性。


基于能量的模型 (Energy-Based Model, EBM)#

基于能量的模型(Energy-Based Model,EBM)是一类基于统计物理学原理的概率模型,通过能量函数为每个可能的状态分配一个标量能量值,低能量区域对应高概率区域。在强化学习中,EBM 将状态-动作对 (s,a)(s,a) 映射到能量值 E\mathcal{E},从而表示策略分布,定义为:

π(as)=exp(E(s,a))Z(s)\begin{align} \pi(a|s)=\frac{\exp(-\mathcal{E}(s,a))}{Z(s)}\notag \end{align}

其中,Z(s)=Aexp(E(s,a))daZ(s)=\int_{A}\exp(-\mathcal{E}(s,a))\mathrm{d}a 是配分函数,负责归一化能量值以形成有效的概率分布。


Soft 策略迭代 (Soft Policy Iteration)#

Soft 贝尔曼方程#

Vsoftπ(s)=Eaπ(s)[Qsoftπ(s,a)αlogπ(as)]=Eaπ(as)[Qsoftπ(s,a)]+αH(π(s))\begin{align} V_{\textbf{soft}}^\pi(s)&=\mathbb{E}_{a\sim\pi(\cdot|s)}[Q^{\pi}_{\textbf{soft}}(s,a)-\alpha\log\pi(a|s)]\notag \\ &=\mathbb{E}_{a\sim \pi(a|s)}[Q^{\pi}_{\textbf{soft}}(s,a)]+\alpha H(\pi(\cdot|s)) \notag \end{align} Qsoftπ(s,a)=r(s,a)+γEsp(s,a)[Vsoftπ(s)]\begin{align} Q^{\pi}_{\textbf{soft}}(s,a)=r(s,a)+\gamma \mathbb{E}_{s^\prime\sim p(\cdot|s,a)}[V_{\textbf{soft}}^\pi(s^\prime)]\notag \\ \end{align}

Soft 策略评估 (Soft Policy Evaluation)#

在 SAC 算法中,Soft 价值函数定义为:

Vsoftπ(s)Eaπ(s)[Qsoftπ(s,a)αlogπ(as)]\begin{align} V_{\textbf{soft}}^\pi(s)\triangleq\mathbb{E}_{a\sim\pi(\cdot|s)}[Q^{\pi}_{\textbf{soft}}(s,a)-\alpha\log\pi(a|s)]\notag \end{align}

Soft 动作价值函数定义为:

Qsoftπ(s,a)r(s,a)+γEsp(s,a)[Vsoftπ(s)]\begin{align} Q^{\pi}_{\textbf{soft}}(s,a)\triangleq r(s,a)+\gamma \mathbb{E}_{s^\prime\sim p(\cdot|s,a)}[V_{\textbf{soft}}^\pi(s^\prime)]\notag \\ \end{align}

动作价值函数的 Soft 贝尔曼迭代算子定义为:

TπQsoftk(s,a)r(s,a)+γEsp(s,a)[Vsoftk(s)]r(s,a)+γEsp(s,a)[Eaπ(s)[Qsoftk(s,a)αlogπ(as)]]\begin{align} \mathcal{T}^\pi Q^{k}_{\textbf{soft}}(s,a)&\triangleq r(s,a)+\gamma \mathbb{E}_{s^\prime\sim p(\cdot|s,a)}[V_{\textbf{soft}}^{k}(s^\prime)]\notag \\ &\triangleq r(s,a)+\gamma \mathbb{E}_{s^\prime\sim p(\cdot|s,a)}[\mathbb{E}_{a^\prime\sim\pi(\cdot|s^\prime)}[Q^{k}_{\textbf{soft}}(s^\prime,a^\prime)-\alpha\log\pi(a^\prime|s^\prime)]]\notag \end{align}

Soft 策略评估定理:基于上述的算子 Tπ\mathcal{T}^\pi 以及初始化的映射 Qsoft0:S×ARQ^0_{\textbf{soft}}:S\times A\rightarrow \mathbb{R},且 A<|A|<\inftyQsoftk+1=TπQsoftkQ^{k+1}_{\textbf{soft}}=\mathcal{T}^\pi Q^k_{\textbf{soft}}。那么当 kk\rightarrow\infty 时,序列 {Qsoftk}\{Q^k_{\textbf{soft}}\} 会收敛至 QsoftπQ_{\textbf{soft}}^\pi

接下来证明这是一个压缩映射。文章重新定义了奖励 rπ(s,a)r_\pi(s,a)

rπ(s,a)r(s,a)+Esp(s,a)[H(π(s))]\begin{align} r_{\pi}(s,a)\triangleq r(s,a)+\mathbb{E}_{s^\prime\sim p(\cdot|s,a)}[H(\pi(\cdot|s^\prime))]\notag \\ \end{align}

迭代式可以重写为:

TπQsoftk(s,a)rπ(s,a)+γEsp(s,a),aπ(s)[Qsoftk(s,a)]\begin{align} \mathcal{T}^\pi Q^{k}_{\textbf{soft}}(s,a)\triangleq r_\pi(s,a)+\gamma \mathbb{E}_{s^\prime\sim p(\cdot|s,a), a^\prime \sim \pi(\cdot|s^\prime)}[Q^{k}_{\textbf{soft}}(s^\prime,a^\prime)]\notag \\ \end{align}

在算法中,我们假设 A<|A|<\infty,那么 rπ(s,a)r_\pi(s,a) 就是有界的,那么只需要证明此不等式:

TπQsoftN(s,a)TπQsoftM(s,a)kQsoftN(s,a)QsoftM(s,a),k(0,1)\begin{align} \big|\big|\mathcal{T}^\pi Q^N_{\textbf{soft}}(s,a)-\mathcal{T}^\pi Q^M_{\textbf{soft}}(s,a)\big|\big|_\infty \le k\big|\big| Q^N_{\textbf{soft}}(s,a)-Q^M_{\textbf{soft}}(s,a)\big|\big|_\infty ,\quad\exists k\in(0,1)\notag \end{align}

左式化简为:

TπQsoftN(s,a)TπQsoftM(s,a)=γEsp(s,a),aπ(s)[QsoftN(s,a)QsoftM(s,a)]γEsp(s,a),aπ(s)[QsoftN(s,a)QsoftM(s,a)]γEsp(s,a)[maxaAQsoftN(s,a)QsoftM(s,a)]γmaxsS,aAQsoftN(s,a)QsoftM(s,a)\begin{align} \big|\mathcal{T}^\pi Q^N_{\textbf{soft}}(s,a)-\mathcal{T}^\pi Q^M_{\textbf{soft}}(s,a)\big| &=\gamma\big|\mathbb{E}_{s^\prime\sim p(\cdot|s,a), a^\prime\sim\pi(\cdot|s^\prime)}\left[Q^N_{\textbf{soft}}(s^\prime,a^\prime)-Q^M_{\textbf{soft}}(s^\prime,a^\prime)\right]\big|\notag \\ &\le \gamma\mathbb{E}_{s^\prime\sim p(\cdot|s,a), a^\prime\sim\pi(\cdot|s^\prime)}\left[\big|Q^N_{\textbf{soft}}(s^\prime,a^\prime)-Q^M_{\textbf{soft}}(s^\prime,a^\prime)\big|\right] \notag \\ &\le \gamma\mathbb{E}_{s^\prime\sim p(\cdot|s,a)}\left[\max_{a^\prime\in A}\big|Q^N_{\textbf{soft}}(s^\prime,a^\prime)-Q^M_{\textbf{soft}}(s^\prime,a^\prime)\big|\right] \notag \\ &\le \gamma \max_{s^\prime\in S,a^\prime\in A} \big|Q^N_{\textbf{soft}}(s^\prime,a^\prime)-Q^M_{\textbf{soft}}(s^\prime,a^\prime)\big| \notag \end{align}

最终可以得出:

maxsS,aATπQsoftN(s,a)TπQsoftM(s,a)γmaxsS,aAQsoftN(s,a)QsoftM(s,a)\begin{align} \max_{s\in S,a\in A}\big|\mathcal{T}^\pi Q^N_{\textbf{soft}}(s,a)-\mathcal{T}^\pi Q^M_{\textbf{soft}}(s,a)\big| \le \gamma \max_{s^\prime\in S,a^\prime\in A} \big|Q^N_{\textbf{soft}}(s^\prime,a^\prime)-Q^M_{\textbf{soft}}(s^\prime,a^\prime)\big| \notag \\ \end{align}

其中,折扣因子 γ(0,1)\gamma\in(0,1),可以证明 Tπ\mathcal{T}^\pi 是压缩映射。


Soft 策略提升 (Soft Policy Improvement)#

Soft 策略提升定理:对于任意 πoldΠ\pi_{\text{old}} \in \Pi,并以下式优化 πnew\pi_{\text{new}}。假设 A<|A|<\infty,那么对于 (s,a)S×A\forall (s,a) \in S \times A,有 Qsoftπnew(s,a)Qsoftπold(s,a)Q_{\textbf{soft}}^{\pi_{\text{new}}}(s,a)\ge Q_{\textbf{soft}}^{\pi_{\text{old}}}(s,a)

πnew(s)=argminπΠDKL(π(s)exp(1αQsoftπold(s,))Zπold(s))\begin{align} \pi_{\text{new}}(\cdot|s)=\arg \min_{\pi^\prime\in\Pi}D_{\textbf{KL}}\left(\pi^\prime(\cdot|s)\Big|\Big|\frac{\exp(\frac{1}{\alpha}Q^{\pi_{\text{old}}}_{\textbf{soft}}(s,\cdot))}{Z^{\pi_{\text{old}}}(s)}\right)\notag \\ \end{align}

其中 Zπold(s)Z^{\pi_{\text{old}}}(s) 是配分函数,负责归一化分布。

定义 J(π(s);πold)J(\pi(\cdot|s);\pi_{\text{old}}) 为:

J(π(s);πold)=DKL(π(s)exp(1αQsoftπold(s,))Zπold(s))=aπ(as)logπ(as)exp(1αQsoftπold(s,a)logZπold(s))da=Eaπ(s)[logπ(as)1αQsoftπold(s,a)+logZπold(s)]\begin{align} J(\pi(\cdot|s);\pi_{\text{old}})&= D_{\textbf{KL}}\left(\pi(\cdot|s)\Big|\Big|\frac{\exp(\frac{1}{\alpha}Q_{\textbf{soft}}^{\pi_\text{old}}(s,\cdot))}{Z^{\pi_\text{old}}(s)}\right)\notag \\ &=\int_a \pi(a|s)\log\frac{\pi(a|s)}{\exp(\frac{1}{\alpha}Q_{\textbf{soft}}^{\pi_\text{old}}(s,a)-\log Z^{\pi_{\text{old}}}(s))}\text{d}a \notag \\ &=\mathbb{E}_{a\sim\pi(\cdot|s)}\left[\log\pi(a|s)-\frac{1}{\alpha}Q_{\textbf{soft}}^{\pi_\text{old}}(s,a)+\log Z^{\pi_\text{old}}(s)\right] \notag \end{align}

那么对于 πold\pi_\text{old}πnew\pi_\text{new},满足 J(πnew(s);πold)J(πold(s);πold)J(\pi_\text{new}(\cdot|s);\pi_\text{old})\le J(\pi_\text{old}(\cdot|s);\pi_\text{old}),即:

Eaπnew(s)[logπnew(as)1αQsoftπold(s,a)+logZπold(s)]Eaπold(s)[logπold(as)1αQsoftπold(s,a)+logZπold(s)]\begin{align} \mathbb{E}_{a\sim\pi_\text{new}(\cdot|s)}\left[\log\pi_\text{new}(a|s)-\frac{1}{\alpha}Q_{\textbf{soft}}^{\pi_\text{old}}(s,a)+\log Z^{\pi_\text{old}}(s)\right] \le \mathbb{E}_{a\sim\pi_\text{old}(\cdot|s)}\left[\log\pi_\text{old}(a|s)-\frac{1}{\alpha}Q_{\textbf{soft}}^{\pi_\text{old}}(s,a)+\log Z^{\pi_\text{old}}(s)\right] \notag \\ \end{align}

Zπold(s)Z^{\pi_\text{old}}(s)aa 无关,相约后得:

Eaπnew(s)[logπnew(as)1αQsoftπold(s,a)]Eaπold(s)[logπold(as)1αQsoftπold(s,a)]\begin{align} \mathbb{E}_{a\sim\pi_\text{new}(\cdot|s)}\left[\log\pi_\text{new}(a|s)-\frac{1}{\alpha}Q_{\textbf{soft}}^{\pi_\text{old}}(s,a)\right] \le \mathbb{E}_{a\sim\pi_\text{old}(\cdot|s)}\left[\log\pi_\text{old}(a|s)-\frac{1}{\alpha}Q_{\textbf{soft}}^{\pi_\text{old}}(s,a)\right] \notag \\ \end{align}

化简为:

Eaπnew(s)[Qsoftπold(s,a)αlogπnew(as)]Eaπold(s)[Qsoftπold(s,a)αlogπold(as)]=Vsoftπold(s)\begin{align} \mathbb{E}_{a\sim\pi_\text{new}(\cdot|s)}\left[Q_{\textbf{soft}}^{\pi_\text{old}}(s,a)-\alpha\log\pi_\text{new}(a|s)\right] \ge \mathbb{E}_{a\sim\pi_\text{old}(\cdot|s)}\left[Q_{\textbf{soft}}^{\pi_\text{old}}(s,a)-\alpha\log\pi_\text{old}(a|s)\right] = V_{\textbf{soft}}^{\pi_\text{old}}(s) \notag \\ \end{align}

接下来证明 Qsoftπnew(s,a)Qsoftπold(s,a)Q_{\textbf{soft}}^{\pi_{\text{new}}}(s,a)\ge Q_{\textbf{soft}}^{\pi_{\text{old}}}(s,a),我们利用上式进行迭代展开:

Qsoftπold(s,a)=r0+γEs1p(s,a)[Vsoftπold(s1)]r0+γEs1p(s,a)[Ea1πnew(s1)[Qsoftπold(s1,a1)αlogπnew(a1s1)]]=r0+γEs1p(s,a)[Ea1πnew(s1)[r(s1,a1)+γEs2[Vsoftπold(s2)]+αH(πnew(s1))]]t=0γtE(st,at)ρπnew[r(st,at)+αH(πnew(st))]=Qsoftπnew(s,a)\begin{align} Q_{\textbf{soft}}^{\pi_{\text{old}}}(s,a) &= r_0 + \gamma \mathbb{E}_{s_1\sim p(\cdot|s,a)}\left[ V_{\textbf{soft}}^{\pi_\text{old}}(s_1)\right]\notag \\ &\le r_0 + \gamma \mathbb{E}_{s_1\sim p(\cdot|s,a)}\left[\mathbb{E}_{a_1\sim \pi_\text{new}(\cdot|s_1)}\left[Q_{\textbf{soft}}^{\pi_\text{old}}(s_1,a_1) - \alpha \log \pi_\text{new}(a_1|s_1)\right]\right] \notag \\ &= r_0 + \gamma \mathbb{E}_{s_1\sim p(\cdot|s,a)}\left[\mathbb{E}_{a_1\sim \pi_\text{new}(\cdot|s_1)}\left[r(s_1, a_1) + \gamma \mathbb{E}_{s_2}[V_{\textbf{soft}}^{\pi_\text{old}}(s_2)] + \alpha H(\pi_\text{new}(\cdot|s_1))\right]\right] \notag \\ &\cdots \notag \\ &\le \sum_{t=0}^\infty \gamma^t \mathbb{E}_{(s_t,a_t)\sim \rho^{\pi_\text{new}}}\left[r(s_t,a_t)+\alpha H(\pi_\text{new}(\cdot|s_t))\right]\notag \\ &=Q_{\textbf{soft}}^{\pi_\text{new}}(s,a) \notag \end{align}

证明完毕。


Soft 策略迭代定理#

假设 A<|A|<\infty,对于任意 πΠ\pi \in \Pi,重复应用 Soft 策略评估和 Soft 策略提升,会收敛到 π\pi^*,使得对于任意 πΠ\pi \in \Pi 以及 (s,a)S×A(s,a)\in S\times A,都有 Qsoftπ(s,a)Qsoftπ(s,a)Q^{\pi^*}_\textbf{soft}(s,a)\ge Q^{\pi}_\textbf{soft}(s,a)

证明过程: 令 πi\pi_i 为第 ii 轮迭代的策略,根据 Soft 策略迭代定理,序列 {Qsoftπi}\{Q^{\pi_i}_\textbf{soft}\} 是单调递增的。对于任意 πΠ\pi \in \PiQsoftπQ^\pi_\textbf{soft} 是有上界的(奖励与熵均是有界的),因此序列会收敛到某个 π\pi^*,接下来需要证明 π\pi^* 是最优的。 在收敛时,必然满足:对于任意 πΠ\pi\in\Piππ\pi\neq \pi^*,都有 J(π(s);π)<J(π(s);π)J(\pi^*(\cdot|s);\pi^*) < J(\pi(\cdot|s);\pi^*),即:

Eaπ(s)[logπ(as)1αQsoftπ(s,a)]<Eaπ(s)[logπ(as)1αQsoftπ(s,a)]\begin{align} \mathbb{E}_{a\sim\pi^*(\cdot|s)}\left[\log\pi^*(a|s)-\frac{1}{\alpha}Q_{\textbf{soft}}^{\pi^*}(s,a)\right] < \mathbb{E}_{a\sim\pi(\cdot|s)}\left[\log\pi(a|s)-\frac{1}{\alpha}Q_{\textbf{soft}}^{\pi^*}(s,a)\right] \notag \\ \end{align}

化简为:

Vsoftπ(s)>Eaπ(s)[Qsoftπ(s,a)αlogπ(as)]\begin{align} V_\textbf{soft}^{\pi^*}(s) > \mathbb{E}_{a\sim \pi(\cdot|s)}\left[Q_\textbf{soft}^{\pi^*}(s,a)-\alpha \log \pi(a|s)\right]\notag \\ \end{align}

接下来证明 π\pi^* 是最优的,即证明 Qsoftπ(s,a)>Qsoftπ(s,a)Q_\textbf{soft}^{\pi^* }(s,a) > Q^\pi_\textbf{soft}(s,a)

Qsoftπ(s,a)=r0+γEs1p(s,a)[Vsoftπ(s1)]>r0+γEs1p(s,a)[Ea1π(s1)[Qsoftπ(s1,a1)αlogπ(a1s1)]]=r0+γEs1p(s,a)[Ea1π(s1)[r(s1,a1)+γEs2[Vsoftπ(s2)]+αH(π(s1))]]>t=0γtE(st,at)ρπ[r(st,at)+αH(π(st))]=Qsoftπ(s,a)\begin{align} Q_\textbf{soft}^{\pi^* }(s,a)&=r_0+\gamma\mathbb{E}_{s_1\sim p(\cdot|s,a)}\left[V_\textbf{soft}^{\pi^*}(s_1)\right]\notag \\ &>r_0+\gamma\mathbb{E}_{s_1\sim p(\cdot|s,a)}\left[\mathbb{E}_{a_1\sim \pi(\cdot|s_1)}\left[Q_\textbf{soft}^{\pi^*}(s_1,a_1)-\alpha \log \pi(a_1|s_1)\right]\right] \notag \\ &= r_0+\gamma\mathbb{E}_{s_1\sim p(\cdot|s,a)}\left[\mathbb{E}_{a_1\sim \pi(\cdot|s_1)}\left[r(s_1, a_1) + \gamma \mathbb{E}_{s_2}[V_\textbf{soft}^{\pi^*}(s_2)] + \alpha H(\pi(\cdot|s_1))\right]\right] \notag \\ &> \sum_{t=0}^\infty \gamma^t \mathbb{E}_{(s_t,a_t)\sim\rho^\pi}\left[r(s_t,a_t)+\alpha H(\pi(\cdot|s_t))\right] \notag \\ &=Q^\pi_\textbf{soft}(s,a)\notag \end{align}

证明完毕。


算法实现#

在 SAC 算法中,我们为两个动作价值函数 QsoftQ_\textbf{soft}(参数分别 ω1\omega_1 为和 ω2\omega_2)和一个策略函数 π\pi(参数为 θ\theta)建模。基于 Double DQN 的思想,SAC 使用 QsoftQ_\textbf{soft} 两个网络,但每次用 QsoftQ_\textbf{soft} 网络时会挑选一个 QsoftQ_\textbf{soft} 值小的网络,从而缓解 QsoftQ_\textbf{soft} 值过高估计的问题。

动作价值函数 (Critic)#

任意一个函数 QsoftQ_\textbf{soft} 的损失函数为:

LQsoft(ω)=E(st,at,rt,st+1)R[12(Qsoftω(st,at)(rt+γVsoftω(st+1)))2]=E(st,at,rt,st+1)R[12(Qsoftω(st,at)(rt+γEat+1πθ(st+1)[minj=1,2Qsoftωj(st+1,at+1)αlogπθ(at+1st+1)]))2]\begin{align} L_{Q_\textbf{soft}}(\omega)&=\mathbb{E}_{(s_t,a_t,r_t,s_{t+1})\sim R}\left[\frac{1}{2}\left(Q_\textbf{soft}^\omega(s_t,a_t)-\left(r_t+\gamma V_{\textbf{soft}}^{\omega^-}(s_{t+1})\right)\right)^2\right]\notag \\ &= \mathbb{E}_{(s_t,a_t,r_t,s_{t+1})\sim R}\left[\frac{1}{2}\left(Q_\textbf{soft}^\omega(s_t,a_t)-\left(r_t+\gamma\mathbb{E}_{a_{t+1}\sim \pi_\theta(\cdot|s_{t+1})}\left[\min_{j=1,2} Q_\textbf{soft}^{\omega_j^-}(s_{t+1},a_{t+1})-\alpha \log \pi_{\theta}(a_{t+1}|s_{t+1})\right]\right)\right)^2\right]\notag \end{align}

其中,RR 是策略过去收集的数据,因为 SAC 是一种离线策略算法。为了让训练更加稳定,这里使用了两个目标网络 QsoftωjQ_\textbf{soft}^{\omega_j^-}。目标网络的更新方式是 Soft 更新:

ωjτωj+(1τ)ωj,j=1,2\begin{align} \omega_j^-\leftarrow \tau \omega_j + (1-\tau) \omega_j^-, \quad j=1,2\notag \end{align}

策略函数 (Actor)#

接下来根据 Soft 策略提升定理中的 KL 散度来推导策略 πθ\pi_\theta 的损失函数 Lπ(θ)L_{\pi}(\theta)

θ=argminθLπ(θ)=argminθEstR[DKL(πθ(st)exp(1αQsoftω(st,))Z(st))]=argminθEstR[Eatπθ(st)[log(πθ(atst)Z(st)exp(1αQsoftω(st,at)))]]=argminθEstR,atπθ(st)[logπθ(atst)1αQsoftω(st,at)+logZ(st)]=argminθEstR,atπθ(st)[logπθ(atst)1αQsoftω(st,at)]=argminθEstR,atπθ(st)[αlogπθ(atst)Qsoftω(st,at)]=argmaxθEstR[Vsoftω(st)]\begin{align} \theta&=\arg \min_{\theta} L_\pi(\theta) \notag \\ &=\arg \min_{\theta} \mathbb{E}_{s_t\sim R}\left[D_\textbf{KL}\left(\pi_\theta(\cdot|s_t)\Big|\Big|\frac{\exp(\frac{1}{\alpha }Q_\textbf{soft}^\omega(s_t,\cdot))}{Z(s_t)}\right)\right]\notag \\ &= \arg\min_\theta \mathbb{E}_{s_t\sim R}\left[\mathbb{E}_{a_t\sim \pi_\theta(\cdot|s_t)}\left[\log\left(\frac{\pi_\theta(a_t|s_t)Z(s_t)}{\exp(\frac{1}{\alpha}Q_\textbf{soft}^\omega(s_t,a_t))}\right)\right]\right] \notag \\ &=\arg\min_\theta \mathbb{E}_{s_t\sim R, a_t\sim \pi_\theta(\cdot|s_t)}\left[\log \pi_\theta(a_t|s_t)-\frac{1}{\alpha}Q_\textbf{soft}^\omega(s_t,a_t)+\log Z(s_t)\right] \notag \\ &=\arg\min_\theta \mathbb{E}_{s_t\sim R, a_t\sim \pi_\theta(\cdot|s_t)}\left[\log \pi_\theta(a_t|s_t)-\frac{1}{\alpha}Q_\textbf{soft}^\omega(s_t,a_t)\right] \notag \\ &=\arg\min_\theta \mathbb{E}_{s_t\sim R, a_t\sim \pi_\theta(\cdot|s_t)}\left[\alpha\log \pi_\theta(a_t|s_t)-Q_\textbf{soft}^\omega(s_t,a_t)\right] \notag \\ &=\arg\max_\theta \mathbb{E}_{s_t\sim R}\left[V_\textbf{soft}^\omega (s_t)\right]\notag \end{align}

考虑到两个 QsoftωQ_\textbf{soft}^\omega 网络,损失函数写为:

Lπ(θ)=argminθEstR,atπθ(st)[αlogπθ(atst)minj=1,2Qsoftωj(st,at)]\begin{align} L_\pi(\theta)=\arg\min_\theta \mathbb{E}_{s_t\sim R, a_t\sim \pi_\theta(\cdot|s_t)}\left[\alpha\log \pi_\theta(a_t|s_t)-\min_{j=1,2}Q_\textbf{soft}^{\omega_j}(s_t,a_t)\right] \notag \\ \end{align}

重参数化技巧 (Reparameterization Trick)#

对于连续动作空间的环境,策略输出 Gauss 分布的均值和标准差,但是 Gauss 分布采样动作的过程是不可导的。因此,我们使用了重参数化技巧。 重参数化的做法是先从标准的 Gauss 分布 N\mathcal{N} 进行采样,然后再将采样值乘以标准差后加上均值,如下:

at=fθ(ϵt;st)=μθ(st)+ϵtσθ(st),ϵN(0,I)\begin{align} a_t&=f_\theta(\epsilon_t;s_t) \notag \\ &= \mu_\theta(s_t)+\epsilon_t\sigma_\theta(s_t), \quad \epsilon\sim\mathcal{N}(0,I)\notag \end{align}

这样就可以认为是从策略高斯分布采样,并且这样对于策略函数是可导的,损失函数写为:

Lπ(θ)=argminθEstR,ϵtN[αlogπθ(fθ(ϵt;st)st)minj=1,2Qsoftωj(st,fθ(ϵt;st))]\begin{align} L_\pi(\theta)=\arg\min_\theta \mathbb{E}_{s_t\sim R, \epsilon_t \sim \mathcal{N}}\left[\alpha\log \pi_\theta(f_\theta(\epsilon_t;s_t)|s_t)-\min_{j=1,2}Q_\textbf{soft}^{\omega_j}(s_t,f_\theta(\epsilon_t;s_t))\right] \notag \\ \end{align}

自动调整熵系数 (Automating Entropy Adjustment)#

在 SAC 算法中,如何选择熵正则项的系数非常重要。在不同的状态下需要不同大小的熵:在最优动作不确定的某个状态下,熵的取值应该大一点;而在某个最优动作比较确定的状态下,熵的取值可以小一点。 为了自动调整熵正则项,SAC 将强化学习的目标改写为一个带约束的优化问题:

argmaxθE(st,at)ρπθ[t=0r(st,at)]s.t.E(st,at)ρπθ[logπθ(atst)]H0\begin{align} \arg\max_\theta\mathbb{E}_{(s_t,a_t)\sim \rho^{\pi_\theta}}\left[\sum_{t=0}^\infty r(s_t,a_t)\right] \quad \text{s.t.} \quad \mathbb{E}_{(s_t,a_t)\sim\rho^{\pi_\theta}}\left[-\log\pi_\theta(a_t|s_t)\right]\ge H_0\notag \\ \end{align}

将上述问题转化为 KKT(Karush-Kuhn-Tucker)条件下的约束问题:

argminθE(st,at)ρπθ[t=0r(st,at)]s.t.H0E(st,at)ρπθ[logπθ(atst)]0\begin{align} \arg\min_\theta\mathbb{E}_{(s_t,a_t)\sim \rho^{\pi_\theta}}\left[-\sum_{t=0}^\infty r(s_t,a_t)\right] \quad \text{s.t.} \quad H_0-\mathbb{E}_{(s_t,a_t)\sim\rho^{\pi_\theta}}\left[-\log\pi_\theta(a_t|s_t)\right]\le 0\notag \\ \end{align}

通过拉格朗日乘数法,转化为无约束问题,拉格朗日函数为:

L(θ,α)=E(st,at)ρπθ[t=0r(st,at)]+α[H0E(st,at)ρπθ[logπθ(atst)]]\begin{align} L(\theta,\alpha)=\mathbb{E}_{(s_t,a_t)\sim \rho^{\pi_\theta}}\left[-\sum_{t=0}^\infty r(s_t,a_t)\right] + \alpha\left[H_0-\mathbb{E}_{(s_t,a_t)\sim\rho^{\pi_\theta}}\left[-\log\pi_\theta(a_t|s_t)\right]\right] \notag \end{align}

其中 α0\alpha\ge0 是拉格朗日乘数,将拉格朗日函数中与 α\alpha 相关的提取出来:

L(α)=EstR,atπθ(st)[αlogπθ(atst)αH0]\begin{align} L(\alpha)=\mathbb{E}_{s_t\sim R,a_t\sim\pi_\theta(\cdot|s_t)}\left[-\alpha\log\pi_\theta(a_t|s_t)-\alpha H_0\right] \notag \end{align}

即当策略的熵低于目标值 H0H_0 时,训练目标 L(α)L(\alpha) 会使 α\alpha 的值增大,进而在上述最小化损失函数 Lπ(θ)L_\pi(\theta) 的过程中增加了策略熵对应项的重要性;而当策略的熵高于目标值 H0H_0 时,训练目标 L(α)L(\alpha) 会使 α\alpha 的值减小,进而使得策略训练时更专注于价值提升。

RL笔记(15):SAC
https://cloudflare.cloudinwind.top/blog/rl-note-15
Author 云之痕
Published at December 24, 2025
Comment seems to stuck. Try to refresh?✨