Room no. 308

The room of our random experiments

View My GitHub Profile

Chapter 7 - Temporal Difference Methods

by Chaitanya

← Previous Chapter Next Chapter →

TD learning of State Values

TD learning refers to a broad class of Reinforcement Learning algorithms that learn by bootstrapping from the current estimate of the value function.

For some policy $\pi$, we want to estimate the value $v^\pi(s)$ of all the states $s \in \mathcal{S}$. Suppose we are given a sequence of states, actions and rewards $(s_0, a_0, r_1, s_1, a_1, r_2, \ldots)$ generated by following the policy $\pi$, we can construct a simple TD algorithm to estimate $v^\pi(s)$ as follows:

\[\begin{align*} v_{t+1}(s_t) &= v_t(s_t) - \alpha \big[v_t(s_t) - (r_{t+1} + \gamma v_t(s_{t+1}))\big] \\ v_{t+1}(s) &= v_t(s) \quad \text{for } s \neq s_t \end{align*}\]

Here, $v_t(s)$ is the estimate of $v^\pi(s)$ at time $t$, $\alpha$ is the step-size parameter. This TD algorithm can be thought of as a special case of the stochastic approximation algorithm for solving the Bellman equation for $v^\pi$:

\[\begin{align} v^\pi(s) &= \mathbb{E}_\pi \big[ R_{t+1} + \gamma G_{t+1} \mid S_t = s \big] \\ &= \mathbb{E}_\pi \big[ R_{t+1} + \gamma v^\pi(S_{t+1}) \mid S_t = s \big] \label{TD-value} \end{align}\]

This TD algorithm can be derived by applying robbins-monro algorithm to the above Bellman equation:

For evey state, we define a function \(g(v_\pi(s)) = v_\pi(s) - \mathbb{E}_\pi \big[ R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s \big]\).

Hence, the Bellman equation is equivalent to solving \(g(v_\pi(s)) = 0\). Every timestep, we observe a noisy sample of \(g(v_\pi(s))\):

\[\begin{equation*} \begin{split} \tilde{g}(v_\pi(s_t)) &= v_\pi(s_t) - (r_{t+1} + \gamma v_\pi(s_{t+1}))\\ &= \underbrace{\left( v_\pi(s_t) - \mathbb{E}\big[ R_{t+1} + \gamma v_\pi(s_{t+1}) \mid S_t = s_t \big] \right)}_{g(v_\pi(s_t))} + \underbrace{\left( \mathbb{E}\big[ R_{t+1} + \gamma v_\pi(s_{t+1}) \mid S_t = s_t \big] - (r_{t+1} + \gamma v_\pi(s_{t+1})) \right)}_{\eta_t} \end{split} \end{equation*}\]

Therefore, Robbins-Monro algorithm for solving $g(v_\pi(s)) = 0$ gives us the TD update rule:

\[\begin{equation*} \begin{split} v_{t+1}(s_t) &= v_t(s_t) - \alpha \tilde{g}(v_t(s_t)) \\ &= v_t(s_t) - \alpha \big[v_t(s_t) - (r_{t+1} + \gamma v_\pi(s_{t+1}))\big] \end{split} \end{equation*}\]

Note that this assumes that we alread have the correct value estimates for every other state \(s \neq s_t\). If we were to estimate the value function for all states, there would have been \(v_{t}(s_{t+1})\) instead of \(v_\pi(s_{t+1})\) in the above equation. Nevertheless, as shown later, this algorithm still converges to the correct value function if we estimate the value function for all states.

Property analysis of TD algorithm

  1. TD target: The quantity \(\overline{v}_t = (r_{t+1} + \gamma v_t(s_{t+1}))\) is called the TD target. It is the value that the algorithm is trying to make $v_t(s_t)$ approach.
\[\begin{equation*} \begin{split} v_{t+1}(s_t) - \overline{v}_t &= v_t(s_t) - \alpha \big[v_t(s_t) - (r_{t+1} + \gamma v_t(s_{t+1}))\big] - \overline{v}_t \\ &= v_t(s_t) - \overline{v}_t - \alpha (v_t(s_t) - \overline{v}_t) \\ &= (1 - \alpha)(v_t(s_t) - \overline{v}_t) \\ \implies \|v_{t+1}(s_t) - \overline{v}_t\| &= (1 - \alpha) \|v_t(s_t) - \overline{v}_t\| \end{split} \end{equation*}\]

Therefore, for $0 < \alpha < 1$, the estimate $v_{t+1}(s_t)$ is closer to the TD target $\overline{v}_t$ than $v_t(s_t)$.

Convergence of TD algorithm