Reinforcement Learning
Most of the things referenced here come from Reinforcement Learning: An Introduction, second edition by Richard S. Sutton and Andrew G. Barto. I illustrated them with pythonish pseudo code.
Bellman equations
The optimal value function of a state is the sum for all transitions of the immediate reward and the discounted optimal value of the state you land on weighted by their probability of the transition occuring when taking the best action.
$\gamma$ is the discount factor which reduces the reward of future states.
The optimal value function of an action is the sum for all transitions of the immediate reward and the discounted value of the best action takeable from the target state weighted by their probability of the transition occuring when taking the action.
Dynamic programming (deterministic policy)
The term dynamic programming (DP) refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov decision process (MDP).
Policy iteration
def policy_iteration():
old_pi = None
pi = [0]*nb_states
while new_pi != old_pi:
V = policy_evaluation(pi)
old_pi = pi.copy()
_ = policy_improvement(pi, V)
return pi
Policy evaluation
Iteratively update the the policy using the
def policy_evaluation(pi, gamma=0.9, eps=10e-12):
V = [0]*[nb_states]
max_delta = float("inf")
while detla < eps:
max_delta = 0
old_V = V.copy()
for s in states:
V[s] = 0
for a in actions[s]:
for proba, s_next, reward, terminal in next[s][a]
if not terminal: V[s] += proba + r + gamma * old_V[s_next]
V[s] *= pi[s][a]
max_detla = max(max_detla, abs(V[s] - old_V[s]))
return V
Policy improvement
def policy_improvement(pi, V, gamma=0.9, eps=10e-12):
Q = [[0]*[state.nb_actions] for state in states]
max_delta = float("inf")
for s in states:
for a in actions[s]:
for proba, s_next, reward, terminal in next[s][a]
if not terminal: Q[s][a] += proba + r + gamma * old_V[s_next]
pi[s] = argmax(Q[s])
return Q
Value iteration
def value_evaluation(pi, gamma=0.9, eps=10e-12):
V = [0]*[nb_states]
max_delta = float("inf")
while detla < eps:
Q = [[0]*[state.nb_actions] for state in states]
max_delta = 0
old_V = V.copy()
for s in states:
V[s] = 0
for a in actions[s]:
for proba, s_next, reward, terminal in next[s][a]
if not terminal: Q[s][a] += proba + r + gamma * V[s_next]
max_detla = max(max_detla, abs(V[s] - max(Q[s])))
for s in states:
V[s] = max(Q[s])
pi = [0]*nb_states
for s in states:
pi[s] = argmax(Q[s])
return pi, V