From policy gradient theorem to variational autoencoder
Policy gradient theorem is the basis of many deep reinforcement learning methods (e.g., PPO, TRPO, and SAC). Meanwhile, Variational Auto-encoder (VAE) is one of the popular generative methods in machine learning. Since the target problems of these approaches share the same formulation, we can use policy gradient as an alternative training method for VAE.
What is policy gradient theorem
Policy gradient theorem is widely applied to solve RL problems. The theoretical model that reinforcement learning (RL) tries to handle is Markov Decision Process (MDP). In general, MDP consists of a six-element tuple $\mathcal{T} = <\mathcal{S}, \mathcal{A}, \mathcal{R}, \mathcal{P}, \mathcal{P}_0, \gamma>$. Here, $\mathcal{S}$ is the state space, $\mathcal{A}$ denotes the action space, $\mathcal{R}$ is a reward function, $\mathcal{P}$ is the state-transition probabilities matrix, $\mathcal{P}_0$ is the initial state distribution, and $\gamma \in [0, 1]$ is the discount factor. A policy $\pi(a | s)$, where $a \in \mathcal{A}$ and $s \in \mathcal{S}$, is a mapping from state $s$ to the probability of selecting action $a$. A trajectory $\tau$ is defined by a sequence of sampling history under policy
\[\tau_{\pi}=(s_0, a_0, r_0, s_1, a_1, r_1, ...).\]The state value function of a state $s_t$ under a policy $\pi$ is denoted as $v_\pi(s_t)$, which is the expected return when starting in $s_t$ and following $\pi$ thereafter. Thus the value function can be calculated as:
\[v_{\pi}(s_t) = \mathbb{E}_{\tau \sim P_{\mathcal{T}}(\tau|\pi)} \left [ \sum_{k=t} \gamma^{k-t} r_k \right ].\]In literature, researchers tend to use state-action value function, denoted as $q_\pi (s_t, a_t)$, which can be derived by state value function:
\[q_\pi(s_t, a_t) = \mathbb{E}_{a_t \sim \pi (a_t |s_t) }v_{\pi}(s_t ).\]The goal of reinforcment learning is to obtain the optimal policy $\pi^* $ to maximize the expected total reward:
\[J = \mathbb{E}_{s_0 \sim \mathcal{P}_0} \left [ v_{\pi}(s_0) \right ] = \mathbb{E}_{\tau \sim \pi, \mathcal{S}}\left [ \sum_{t=0}^{T} \gamma^t q_{\pi}(s_t, a_t) \right ] .\]This can be seen as a typical numerical optimisation problem, where the policy is a parametrised by $\theta$. The key to find the optimal policy is to calculate the gradient of the target function:
\[\nabla_{\theta}J = \nabla_{\theta} \mathbb{E}_{s_0 \sim \mathcal{P}_0} \mathbb{E}_{a_0 \sim \pi(a_0|s_0)} q_\pi(s_0, a_0) = \mathbb{E}_{s_0 \sim \mathcal{P}_0} \nabla_{\theta} \mathbb{E}_{a_0 \sim \pi(a_0|s_0)} q_\pi(s_0, a_0)\] \[\begin{equation} \begin{split} \nabla_{\theta} \mathbb{E}_{a \sim \pi(a|s)} q_\pi(s_0, a_0) &= \sum_{s} \left (\sum_{k=0}^{\infty} Pr(s_0 \rightarrow s, k, \pi) \right) \sum_{a}\nabla_{\theta}\pi(a|s)q_{\pi}(s, a), \\ &\propto \sum_{s}\mu(s) \sum_a \nabla_{\theta} \pi(a|s)q_\pi(s, a) \\ &= \sum_{s}\mu(s) \sum_a \frac{\nabla_{\theta}\pi(a|s)}{\pi(a|s)}\pi(a|s)q_\pi(s, a) \\ &= \sum_{s}\mu(s) \sum_a \pi(a|s) q_\pi(s,a) \nabla_{\theta}\log\pi(a|s). \\ \end{split} \end{equation}\]Here we apply the Monte-Carlo method (refer to reinforcement learning: an introduction chapter 5 ) to obtain the graident through the sampling trajectories $\tau = (s_0,a_0,r_0, …, s_T, a_T, r_T)$, thus we have:
\[\begin{equation} \begin{split} \sum_{s}\mu(s) \sum_a \pi(a|s) q_\pi(s,a) \nabla_{\theta}\log\pi(a|s) &\approx \mathbb{E}_{\tau \sim \pi, \mathcal{P}} \left [ \sum_{t=0}^{T} \gamma^t \nabla_{\theta}\log\pi(a_t|s_t) q_\pi(s_t,a_t) \right ] \\ \end{split} \end{equation}\]The above equation is the basis of the policy gradient methods such as PPO, TRPO, and SAC. A simple approximation to approximate the policy gradient is to use the total reward for each trajectory instead of using q-values as:
\[\nabla_{\theta}J = \mathbb{E}_{\tau \sim \pi, \mathcal{P}} \left [ \left ( G_{\tau} - b(s) \right ) \nabla_{\theta}\log \pi(a_t | s_t)\right ],\]where $b(s)$ is a baseline that does not vary with action $a$.
What is the VAE
VAE is a generative model based on latent variable. The fundamental mathematic problem that a generative model needs to solve is to obtain the distribution, $p(x)$, over the original data in some potentially high-dimensional space, assuming $p(x)$ is parametrised by $\theta$, where $p(x) \approx p_{\theta}(x)$. However, obtaining this distribution can be challenging. Lots of methods get the optimal parameter $\theta$ by maximum likelihood fit
:
Current methods that directly calculate $p(x)$ have had one of tree drawbacks: 1) they might require strong assumptions about the structure in the data; 2) they might make severe approximations, leading to suboptimal models; 3) they might rely on computationally expensive inference procedures like Markov Chain Monte Carlo.
VAE solves this problem by using latent model. Instead of directly calculate $p(x)$, the objective of VAE is:
\[p(x) = \int_{z} p(x|z)p(z),\]where $p(z)$
is a simple distribution (e.g., gaussian distribution) and $p(x|z)$
can be approximated by parametrised functions (e.g., neural networks). Then, we can calculate the parameters by using maximum likelihood as:
However, calculate the term $ \int_{z} p_{\theta}(x|z)p(z) d_z$
is intractable, which might need extremely large number of samples of $z$ in high dimensional spaces.
VAE solves this problem by assuming that the sampled values of z are likely to have produced x, and computes p(x)
just from those sampled values. This means that we need a new function $p(z|x)$
which can take a value of $x$ and give the distribution over z values that are likely to produce x. Hopefully, the space of $z$ values that are likely under $p(z|x)$
will be much smaller than the space of all z’s that are likely under the prior $p(z)$. However, it is still hard to calculate $p(z|x)$
. VAE uses variational inference to approximate p(z|x)$
as $q(z|x; \theta) \approx p(z|x)$
, where $q(z|x; \theta)$
is a parameterized distribution. In order to minimise the distance between $q(z|x)$
and p(z|x)
, we introduce the Kullback-Leibler divergence (KL-divergence) which is defined as:
By applying Bayes rule to $p(z|x_i)$
, we have
This equation actually involves the target that we want to maximize, $\mathbb{E}_{z \sim q(z|x)} \left [ \log p(x) \right ]$
, thus we change the format of the above target as:
Because the KL divergence also great equal than 0, thus we can have the lower bound of $\mathbb{E}_{z \sim q(z|x)} \left [ \log p(x) \right ]$
as:
And this is actually the training target function of VAE which includes two part, the first term is reconstruction loss $\mathbb{E}_{z \sim q(z|x)} \left [ \log p(x|z) \right ]$
which means giving a z sampled from $ q(z|x)$
and generating $x$ based on $z$. The second term is regularisation term which calculate the KL divergence between q(z|x)
and p(z)
where the prior of $z$ is a simple gaussian distribution $p(z) = N(0, I)$.
However, we find the whole process is not differentiable when we draw the computational graph, thus we can not directly use gradient decent to optimise the target. To address this problem, we use reparametrisation trick to sample $z$:
\[q(z|x) = N(\mu(x), \sigma(x)) = \mu(x) + \sigma(x) N(0,I).\]In addition to above way for the derivation of the VAE’s lower bound, we have another way to derive the lower bound of the VAE:
\[\begin{equation} \begin{split} \log p(x) &= \log \int_{z} p(x|z)p(z) = \log \int_{z} p(x|z)p(z) \frac{q(z|x)}{q(z|x)} \\ &= \log \mathbb{E}_{z \sim q(z|x)} \left [\frac{p(x|z)p(z)}{q(z|x)} \right ] \ \ \ {\rm applying \ Jensen’s \ inequality} \\ &\ge \mathbb{E}_{z \sim q(z|x)} \left [ \log \frac{p(x|z)p(z)}{q(z|x)} \right ]. \end{split} \end{equation}\]In general, we call this objective as evidence lower bound (ELBO).
Using policy gradient theorem to solve VAE (amortized variational inference)
Based on the aforementioned discussion, the objective of policy gradient methods and VAE share similar high-level structure, which is given by:
\[\begin{equation} L(\theta, \phi) = \mathbb{E}_{x \sim p_{\phi}(x)}\left [ f_{\theta}(x) \right ], \end{equation}\]where $p_{\phi}(x)$ is a distribution parametrised by $\phi$ and $f_{\theta}(x)$ is a cost function for $x$. The objective is to obtain optimal parameters that minimise the target function. Here $f_{\theta}(x)$ in policy gradient and VAE represents the value function and log likelihood, respectively. Obviously, we can use policy gradient to solve VAE.
Let’s take a look at the ELBO:
\[\begin{equation} \begin{split} \log p(x) &\ge \mathbb{E}_{z \sim q(z|x)} \left [ \log \frac{p(x|z)p(z)}{q(z|x)} \right ] \\ & = \mathbb{E}_{z \sim q(z|x)} \left [ \log p(x|z) + \log p(z) \right ] - \mathbb{E}_{z \sim q(z|x)} q(z|x) \\ & = \mathbb{E}_{z \sim q(z|x)} \left [ \log p(x|z) + \log p(z) \right ] + \mathcal{H}(q(z|x)), \end{split} \end{equation}\]where $\mathcal{H}(q(z|x))$
is the entropy of $q(z|x)$
. Our goal is to maximise this ELBO to obtain optimal parameters of $p(x|z)$
and $q(z|x)$
, which are denotes as $\theta$ and $\phi$. We denote the ELBO as $L(\theta, \phi)$. To apply gradient assent to obtain optimal $\theta$ and $\phi$ we need $\nabla_{\theta}L(\theta, \phi)$ and $\nabla_{\phi}L(\theta, \phi)$. We can use Monte Carlo method to calculate the partial derivative
Calculating the $\nabla_{\phi}L(\theta, \phi)$ is difficult. However, if we consider the $q(x|z)$
as policy, $x$ as state, $z$ as action, $\log p(x|z)+\log p(z)$
as reward function. The gradient of the entropy term can be directly calculated while we can use gradient theorem to calculate the term $\mathbb{E}_{z \sim q(z|x)} \left [ \log p(x|z) + \log p(z) \right ]$
, as
where $r(z,x) = \log p(x|z) + \log p(z) $
.
In summary, we have the partial derivative for both $\theta$ and $\phi$ as:
\[\begin{equation} \begin{split} \nabla_{\theta}L(\theta, \phi) &\approx \mathbb{E}_{z \sim q(z|x)} \left [ \nabla_{\theta}\log p(x|z) \right ] \\ \nabla_{\phi}L(\theta, \phi) &\approx \mathbb{E}_{z \sim q(z|x)} \left [ \nabla_{\phi} \log q(z|x) \sum_{z \sim q(z|x)} r(z,x) \right ] + \nabla_{\phi} \mathcal{H}(q(z|x)) \end{split} \end{equation}\]