The room of our random experiments
by Chaitanya
TD learning refers to a broad class of Reinforcement Learning algorithms that learn by bootstrapping from the current estimate of the value function.
For some policy $\pi$, we want to estimate the value $v^\pi(s)$ of all the states $s \in \mathcal{S}$. Suppose we are given a sequence of states, actions and rewards $(s_0, a_0, r_1, s_1, a_1, r_2, \ldots)$ generated by following the policy $\pi$, we can construct a simple TD algorithm to estimate $v^\pi(s)$ as follows:
\[\begin{align*} v_{t+1}(s_t) &= v_t(s_t) - \alpha \big[v_t(s_t) - (r_{t+1} + \gamma v_t(s_{t+1}))\big] \\ v_{t+1}(s) &= v_t(s) \quad \text{for } s \neq s_t \end{align*}\]Here, $v_t(s)$ is the estimate of $v^\pi(s)$ at time $t$, $\alpha$ is the step-size parameter. This TD algorithm can be thought of as a special case of the stochastic approximation algorithm for solving the Bellman equation for $v^\pi$:
\[\begin{align} v^\pi(s) &= \mathbb{E}_\pi \big[ R_{t+1} + \gamma G_{t+1} \mid S_t = s \big] \\ &= \mathbb{E}_\pi \big[ R_{t+1} + \gamma v^\pi(S_{t+1}) \mid S_t = s \big] \label{TD-value} \end{align}\]This TD algorithm can be derived by applying robbins-monro algorithm to the above Bellman equation:
For evey state, we define a function \(g(v_\pi(s)) = v_\pi(s) - \mathbb{E}_\pi \big[ R_{t+1} + \gamma v_\pi(S_{t+1}) \mid S_t = s \big]\).
Hence, the Bellman equation is equivalent to solving \(g(v_\pi(s)) = 0\). Every timestep, we observe a noisy sample of \(g(v_\pi(s))\):
\[\begin{equation*} \begin{split} \tilde{g}(v_\pi(s_t)) &= v_\pi(s_t) - (r_{t+1} + \gamma v_\pi(s_{t+1}))\\ &= \underbrace{\left( v_\pi(s_t) - \mathbb{E}\big[ R_{t+1} + \gamma v_\pi(s_{t+1}) \mid S_t = s_t \big] \right)}_{g(v_\pi(s_t))} + \underbrace{\left( \mathbb{E}\big[ R_{t+1} + \gamma v_\pi(s_{t+1}) \mid S_t = s_t \big] - (r_{t+1} + \gamma v_\pi(s_{t+1})) \right)}_{\eta_t} \end{split} \end{equation*}\]Therefore, Robbins-Monro algorithm for solving $g(v_\pi(s)) = 0$ gives us the TD update rule:
\[\begin{equation*} \begin{split} v_{t+1}(s_t) &= v_t(s_t) - \alpha \tilde{g}(v_t(s_t)) \\ &= v_t(s_t) - \alpha \big[v_t(s_t) - (r_{t+1} + \gamma v_\pi(s_{t+1}))\big] \end{split} \end{equation*}\]Note that this assumes that we alread have the correct value estimates for every other state \(s \neq s_t\). If we were to estimate the value function for all states, there would have been \(v_{t}(s_{t+1})\) instead of \(v_\pi(s_{t+1})\) in the above equation. Nevertheless, as shown later, this algorithm still converges to the correct value function if we estimate the value function for all states.
Therefore, for $0 < \alpha < 1$, the estimate $v_{t+1}(s_t)$ is closer to the TD target $\overline{v}_t$ than $v_t(s_t)$.