The room of our random experiments
by Chaitanya
Consider a sequence of timesteps \(t = 0, 1, 2, 3, \ldots\). At each timestep \(t\), the agent is in a state \(S_t\) and takes an action \(A_t\) following a policy \(\pi\). The agent receives a reward \(R_{t+1}\) and transitions to a new state \(S_{t+1}\). Rhis can be represented as:
\[S_t \xrightarrow{A_t} R_{t+1}, S_{t+1}\]\(S_t, A_t, R_{t+1}, S_{t+1}\) are all random variables belonging to their respective spaces in the MDP. Then, starting from state \(S_t\), the state-action-reward trajectory is:
\[S_t \xrightarrow{A_t} R_{t+1}, S_{t+1} \xrightarrow{A_{t+1}} R_{t+2}, S_{t+2} \xrightarrow{A_{t+2}} R_{t+3}, S_{t+3} \ldots\]The return \(G_t\) is the total discounted reward from timestep \(t\) onwards:
\[G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}\]where \(\gamma \in (0, 1)\) is the discount factor. Note that the return is a random variable as it depends on the random variables \(R_{t+1}, R_{t+2}, \ldots\). The state value function \(v_{\pi}(s)\) is the expected return from state \(s\) under policy \(\pi\):
\[v_{\pi}(s) = \mathbb{E}_{\pi}[G_t | S_t = s]\]In a nutshell, the Bellman equation is a set of linear equaitons that express the relationship between the values of states. First, we can see that:
\[G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} \ldots = R_{t+1} + \gamma G_{t+1}\]Then, the state value function can be expressed as: $trial$
\[\begin{equation} \begin{split} v_{\pi}(s) & = \mathbb{E}_{\pi}[G_t | S_t = s] \\ & = \mathbb{E}_{\pi}[R_{t+1} + \gamma G_{t+1} | S_t = s] \\ & = \mathbb{E}_{\pi}[R_{t+1} | S_t = s] + \gamma \mathbb{E}_{\pi}[G_{t+1} | S_t = s] \end{split} \end{equation} \label{eq:valuereturn}\]The first term in the RHS of Equation $\eqref{eq:valuereturn}$ is the expected reward from state \(s\) under policy \(\pi\), and can be expressed as:
\[\begin{equation} \begin{split} \mathbb{E}[R_{t+1} | S_t = s] & = \sum_{a \in \mathcal{A}} \pi(a|s) \mathbb{E}[R_{t+1} | S_t = s, A_t = a] \\ & = \sum_{a \in \mathcal{A}} \pi(a|s) \sum_{r \in \mathcal{R}} p(r|s, a) r \end{split} \end{equation}\]The second term in the RHS of Equation $\eqref{eq:valuereturn}$ is the expected future return. This can be expressed as:
\[\begin{equation} \begin{split} \mathbb{E}_{\pi}[G_{t+1} | S_t = s] & = \sum_{s' \in \mathcal{S}} \mathbb{E}_{\pi}[G_{t+1} | S_{t+1} = s'] p(s' | s) \\ & = \sum_{s' \in \mathcal{S}} v_{\pi}(s') p(s' | s) \\ & = \sum_{s' \in \mathcal{S}} v_{\pi}(s') \sum_{a \in \mathcal{A}} \pi(a|s) p(s' | s, a) \\ & = \sum_{a \in \mathcal{A}} \pi(a|s) \sum_{s' \in \mathcal{S}} v_{\pi}(s') p(s' | s, a) \end{split} \end{equation}\]Substituting the above two equations back into Equation $\eqref{eq:valuereturn}$, we get the Bellman equation for the state value function:
\[\begin{equation} \begin{split} v_{\pi}(s) & = \sum_{a \in \mathcal{A}} \pi(a|s) \sum_{r \in \mathcal{R}} p(r|s, a) r + \gamma \sum_{a \in \mathcal{A}} \pi(a|s) \sum_{s' \in \mathcal{S}} v_{\pi}(s') p(s' | s, a) \\ & = \sum_{a \in \mathcal{A}} \pi(a|s) \left( \sum_{r \in \mathcal{R}} p(r|s, a) r + \gamma \sum_{s' \in \mathcal{S}} v_{\pi}(s') p(s' | s, a) \right) \\ & = \sum_{a \in \mathcal{A}} \pi(a|s) \sum_{s', r} p(s', r | s, a) \left( r + \gamma v_{\pi}(s') \right) \end{split} \end{equation} \label{eq:bellmanvalue}\]Equation $\eqref{eq:bellmanvalue}$ is the Bellman equation for the state value function. Note that:
Bellman equation is a set of linear equations that express the relationship between the values of states rather than a single equation.
Solving the Bellman equation for a given policy \(\pi\) gives the state value function \(v_{\pi}(s)\) and is a policy evaluation process.
\(p(r|s, a)\) and \(p(s’ | s, a)\) represent the system model.
The Bellman equation can be expressed in matrix-vector form as:
\[\begin{equation} v_{\pi} = r^{\pi} + \gamma P^{\pi} v_{\pi} \end{equation}\]or simply:
\[\begin{equation} v = r + \gamma P v \end{equation}\]where:
Since the Bellman equation is a linear equation, a closed form solution can be easily obtained as:
\[v = (I - \gamma P)^{-1} r\]Some properties of the matrix \(I - \gamma P\) are:
$(I - \gamma P)$ is invertible. Proof with Gershgorin circle theorem??
\((I - \gamma P)^{-1} \ge I\): Every element of \((I - \gamma P)^{-1}\) is greater than or equal to 0. This is because $P$ has non-negative entries and by using the Neumann series expansion of the inverse matrix, we get: \((I - \gamma P)^{-1} = I + \gamma P + \gamma^2 P^2 + \ldots \ge 0 \).
For any vector $r \ge 0$, \((I - \gamma P)^{-1} r \ge r \ge 0\).
Consider the iterative solution to the Bellman equation:
\[v_{k+1} = r + \gamma P v_k\]This algorithm produces a series of values \(v_0, v_1, v_2, \ldots\) starting from some initial guess. The algorithm converges to the true value function \(v_{\pi}\) as \(k \rightarrow \infty\). Proof:
We define \( \delta_k = v_{k} - v_{\pi}\) and show that \( \delta_k \rightarrow 0\) as \(k \rightarrow \infty\).
(\(\delta_{k+1} = v_{k+1} - v_{\pi} \))
\[\begin{equation} \begin{split} \delta_{k+1} + v_{\pi} & = r + \gamma P(\delta_k + v_{\pi}) \\ \delta_{k+1} & = \gamma P \delta_k \end{split} \end{equation}\]Hence, \( \delta_{k+1} = \gamma P \delta_k = \gamma^2 P^2 \delta_{k-1} = \ldots = \gamma^{k+1} P^{k+1} \delta_0\). Since \(P\) is a transition probability matrix and \(\gamma \in (0, 1)\), \(\gamma^k P^k \rightarrow 0\) as \(k \rightarrow \infty\). Hence, \( \delta_k \rightarrow 0\) as \(k \rightarrow \infty\).
The state-action value function \(q_{\pi}(s, a)\) is the expected return from state \(s\) and action \(a\) under policy \(\pi\). Note that the policy is followed after taking the action \(a\). The state-action value function can be expressed as:
\[q_{\pi}(s, a) = \mathbb{E}_{\pi}[G_t | S_t = s, A_t = a]\]From the equation above:
\[\begin{equation} \begin{split} q_{\pi}(s, a) &= \sum_{r \in \mathcal{R}} p(r|s, a) r + \gamma \sum_{s' \in \mathcal{S}} v_{\pi}(s') p(s' | s, a) \\ &= \sum_{r \in \mathcal{R}} p(r|s, a) r + \gamma \sum_{s' \in \mathcal{S}} p(s' | s, a) \sum_{a' \in \mathcal{A}} \pi(a'|s') q_{\pi}(s', a') \end{split} \end{equation}\]This is the Bellman equation for the state-action value function. The Bellman equation for the state-action value function can be expressed in matrix-vector form as:
\[q_{\pi} = \widetilde{r} + \gamma P \Pi q_{\pi}\]Here:
\(q_{\pi}\) is the state-action value function vector under policy \(\pi\).
\(\widetilde{r}\) is the immediate reward vector and does not depend on the policy (because we know that we are taking the action $a$ immediately).
\(P\) is the transition probability matrix which, unlike the state value Bellman equation, does not depend on the policy.