Reinforcement Learning

Most of the things referenced here come from Reinforcement Learning: An Introduction, second edition by Richard S. Sutton and Andrew G. Barto. I illustrated them with pythonish pseudo code.

Bellman equations

The optimal value function of a state is the sum for all transitions of the immediate reward and the discounted optimal value of the state you land on weighted by their probability of the transition occuring when taking the best action.

$$ v_*(s) = \max_a \sum_{s', r} p(s', r |s, a)[r + \gamma v_*(s')] $$

$\gamma$ is the discount factor which reduces the reward of future states.

The optimal value function of an action is the sum for all transitions of the immediate reward and the discounted value of the best action takeable from the target state weighted by their probability of the transition occuring when taking the action.

$$ q_*(s, a) = \sum_{s', r} p(s', r |s, a)[r + \gamma \max_{a'} q_*(s', a')] $$

Dynamic programming (deterministic policy)

The term dynamic programming (DP) refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov decision process (MDP).

Policy iteration

def policy_iteration():
    old_pi = None
    pi = [0]*nb_states
    while new_pi != old_pi:
        V = policy_evaluation(pi)
        old_pi = pi.copy()
        _ = policy_improvement(pi, V)
    return pi

Policy evaluation

Iteratively update the the policy using the

$$ v_{k+1}(s) = \sum_a \pi(a | s) \sum_{s', r} p(s', r |s, a)[r + \gamma v_k(s')] $$
def policy_evaluation(pi, gamma=0.9, eps=10e-12):
    V = [0]*[nb_states]
    max_delta = float("inf")

    while detla < eps:
        max_delta = 0
        old_V = V.copy()
        for s in states:
            V[s] = 0
            for a in actions[s]:
                for proba, s_next, reward, terminal in next[s][a]
                    if not terminal: V[s] += proba + r + gamma * old_V[s_next]
                V[s] *= pi[s][a]
                max_detla = max(max_detla, abs(V[s] - old_V[s]))
    return V

Policy improvement

$$ \pi'(s) = \arg\max_a \sum_{s', r} p(s', r | s, a)[r + \gamma v_{\pi}(s')] $$
def policy_improvement(pi, V, gamma=0.9, eps=10e-12):
    Q = [[0]*[state.nb_actions] for state in states]
    max_delta = float("inf")

    for s in states:
        for a in actions[s]:
            for proba, s_next, reward, terminal in next[s][a]
                if not terminal: Q[s][a] += proba + r + gamma * old_V[s_next]

        pi[s] = argmax(Q[s])
    return Q

Value iteration

$$ v_{k+1}(s) = \max_a\sum_{s', r} p(s', r |s, a)[r + \gamma v_k(s')] $$
def value_evaluation(pi, gamma=0.9, eps=10e-12):
    V = [0]*[nb_states]
    max_delta = float("inf")

    while detla < eps:
        Q = [[0]*[state.nb_actions] for state in states]
        max_delta = 0
        old_V = V.copy()
        for s in states:
            V[s] = 0
            for a in actions[s]:
                for proba, s_next, reward, terminal in next[s][a]
                    if not terminal: Q[s][a] += proba + r + gamma * V[s_next]
        max_detla = max(max_detla, abs(V[s] - max(Q[s])))
        for s in states:
            V[s] = max(Q[s]) 
    
    pi = [0]*nb_states
    for s in states:
        pi[s] = argmax(Q[s])
    
    return pi, V

Bandits

Monte Carlo