Reinforcement Learning Part 1

Bandit

Multi-armed bandit as quite a simple problem serves well as an introduction to the basic concepts of reinforcement learning.

There are $k$ levers available for you to pull. You are given $n$ turns and in every turn you can pull one lever. Right after pulling a lever the machine gives you money. The quota of money isn't dependent on the number of times you've pulled a lever before. Every lever has true value of average reward - $\mu$ that is hidden from you. Every time a reward is a sample from the normal distribution with mean $\mu$ and standard deviation $\sigma$ . Find the best strategy to accomplish the goal of having the biggest amount of money after your last turn!

Let's describe this problem in reinforcement learning terms!

$n$ - number of time steps in a single episode
$t \in \{0,1,...,(n-1)\}$ - time step, every time you must choose the lever to pull
$\mathcal{S} = \{ 0 \}$ - there is always only one state, whatever you do at any time step you have the same actions available and you can't influence the environment in any way
$\mathcal{A} = \mathcal{A}(S_{0}) = \{0,1,...,(k-1)\}$ - action set, there is only one state therefore it's equal to the set of actions available at state $S_{0}$
$A_{t}$ - action taken at time $t$ , number of the chosen lever
$R_{t}$ - reward after time $t$ , money got from pulling the lever
$G_{t}$ - return following time $t$ , sum of all rewards after time $t$

Now to the core of the problem, how to choose actions, what $A_{t}$ should be determined by? If we are rational beings, we probably shouldn't choose randomly... To abstract things more let us say that $A_{t}$ should be determined by some function that takes under account the information available to us at time $t$ and some masterminded strategy. That function will be our policy, $\pi(a)$ . For any given action it should return the probability of choosing it. At any time we should be more or less able to tell which action among the available is the most viable and our policy function should be based mainly on this evaluation of every action at any given time at any given state. We call this helping hand, action-state-value function, but since we have only one state it's simply action-value function. Let's denote it by $Q_{t}(a)$ . But this is only our evaluation, somewhere there is the perfect function... $q_{*}(a)$ , the perfect function by the rules of our example doesn't depend on time step so we've just dropped $t$ .
So what should be the value of an action? In this problem with only one state, every action taken is independent, it cannot influence anything more than its immediate reward. And if our goal is to maximize the cumulated reward, our subjective value of an action should depend on the immediate reward we will get after choosing it. Therefore we could describe our action-value function as:

Q_{t}(a) = \begin{cases} \frac{\textrm{sum of rewards when }a\textrm{ taken prior to }t}{\textrm{number of times }a\textrm{ taken prior to }t} = \frac{\displaystyle\sum_{i=1}^{t-1} R_{i} \cdot \mathbb{I}_{A_{i} = a}}{\displaystyle\sum_{i=1}^{t-1} \mathbb{I}_{A_{i} = a}} &\text{if } \text{otherwise} \\ \text{default} &\text{if } \displaystyle\sum_{i=1}^{t-1} \mathbb{I}_{A_{i} = a} = 0 \end{cases}

(1.1)
This is called a sample-average method, because for every action we are taking average of the historical rewards. If our strategy would be simply to follow the rule of choosing the actions of the highest value according to our action-value functions, then our policy completely led by greed and instant gratification would look like that:

\pi_{t}(a) = \begin{cases} 1 &\text{if } a = \mathop{\arg\max}\limits_{a \in \mathcal{A}} Q_{t}(a) \\ 0 &\text{if } \text{otherwise} \end{cases}

(1.2)
So what would be our first action? If we follow $\arg\max$ definition from the extremely popular Python library numpy, the first index would be returned, so we would choose the action 0. On the other hand we could also decide that if there is more than 1 action for which value is the highest then we will choose among them at random, that's the most popular approach. So we chose our first action randomly, got some reward for it and what's next? What would be our next action? That depends on the reward we just got and on the default value we assigned to every action beforehand. If the reward is higher we are going to choose this action again, otherwise we will choose randomly from all the actions excluding the one just taken.
Let's decide what the default value should be. If we set it extremely low, so low that we would never get a lower reward, after choosing the first action randomly, we will choose it another time until the end. That doesn't seem too promising unless we are feeling ourselves extremely lucky today... If we choose it to be around the median of the expected rewards of all the actions there is some probability that we will 'test' most of the actions throughout the entire game. Now, the most interesting case, setting the default value so high that for sure there would no reward so high from any of the actions. We will go from action to action, getting our 'expectations' shattered but at $t=n$ we would have all the actions tasted once and have estimated of their real values based on reality. We call this strategy of setting default values of actions extremely high - strategy of optimistic initial values. Depending on the problem it could be more or less advantageous, but it's advantageous to be aware of its existence.
There is a more popular way of not getting at one and the same action every time, $\varepsilon$ -greediness. We still prefer to choose the most valuable action according to our action-value function, but at any time step we define some probability $\varepsilon \in (0,1)$ of choosing a random action.

\pi_{t}(a) = \begin{cases} 1-\frac{\varepsilon}{n} &\text{if } a = \mathop{\arg\max}\limits_{a \in \mathcal{A}} Q_{t}(a) \\ \frac{\varepsilon}{n} &\text{if } \text{otherwise} \end{cases}

(1.3)
If we set $\varepsilon$ high enough in respect to the number of time steps and actions available, there is a good chance of taking every action at least one time!

Let's now tackle this problem less theoretically!
$n=1000$
$len(\mathcal{A})=10$
$\mu=0$
$\sigma=0.5$
We have 10 possible actions to take at any of the 1000 time steps. Every action's true value is a sample from the normal distribution of the mean 0 and standard deviation 0.5. After choosing an action we are getting a reward equal to a sample from the normal distribution where mean is equal to the true value of that action and standard deviation 0.5. We must devise a strategy to achieve the goal of maximizing cumulated reward, return.
We are going to use equations 1.1 and 1.3 to define our strategy. Below you can see the visualization of 1000 simulations with the mean probability of taking the action of the highest true value at every time step. We will consider 3 cases: greedy with $\varepsilon=0$ , $\varepsilon$ -greedy with $\varepsilon=0.01$ and $\varepsilon$ -greedy with $\varepsilon=0.1$ . We set initial value for every action to 0.

As can be almost seen, initially the greediest option has a small advantage against the other ones. But when it is soon getting stuck at not-certainly-optimal actions, the explorative alternatives are doing much better. The option with $\varepsilon=0.01$ is quite slow at recognizing the possibly more optimal actions to choose but by the law of great numbers, after large enough number of steps, surely it will. And when it does, it will still only choose at random 1% of the time in comparison with 10% of the strategy "yellow". Therefore in the long run strategy "green" will surely be the best one. The question is, for run how long we are preparing ourselves? Some strategies may be better in shorter timeframes and other strategies in longer timeframes. Things are relative.

Let's look from pure curiosity at what would happen to these strategies if we tweak them by assuring that the first priority is to choose unexplored actions. Any guess?

We can see some weirdly straight line at the beginning. It goes for the first 10 steps. That's a pure representation of our tweak. Firstly we had to explore all 10 actions once, so it's logical that we chose an optimal action 10% of the time! More interesting is the dominance of the extremely greedy strategy for the first half of the race. Surely if we firstly explored all options, the greedy strategy would be the most logical, right? But our rewards aren't so straightforward, they are not equal to the true values of the given actions but just to the sample from the normal distribution when this true values are just the means. So exploring everything once doesn't give us all the information. Therefore after some lagging behind, explorative strategies once more take the lead!

use rand::prelude::*;
use rand::Rng;
use rand_distr::Normal;

pub struct BanditRunStatistics {
    pub steps: u64,
    pub epsilon: f64,
    pub initial_value: f64,
    pub optimal_choices: Vec<u64>,
    pub examination_standard_deviation: f64,
    pub true_values_standard_deviation: f64,
}

pub struct BanditsConfiguration {
    pub bandits_count: u64,
    pub steps: u64,
    pub epsilon: f64,
    pub initial_value: f64,
    pub true_values_mean: f64,
    pub true_values_standard_deviation: f64,
    pub examination_standard_deviation: f64,
    pub explore_all_first: bool,
    pub true_values_arg: Option<Vec<f64>>,
}

pub fn bandits(
    BanditsConfiguration {
        mut bandits_count,
        steps,
        epsilon,
        initial_value,
        true_values_mean,
        true_values_standard_deviation,
        examination_standard_deviation,
        explore_all_first,
        true_values_arg,
    }: BanditsConfiguration,
) -> BanditRunStatistics {
    let mut statistics: BanditRunStatistics = BanditRunStatistics {
        steps,
        epsilon,
        initial_value,
        optimal_choices: Vec::new(),
        true_values_standard_deviation: 1.0,
        examination_standard_deviation: 0.3,
    };
    let mut optimal_choices: u64 = 0;
    let mut rng = rand::thread_rng();
    let mut true_values: Vec<f64> = Vec::new();
    let mut Q: Vec<f64> = Vec::new();
    let mut checks: Vec<u64> = Vec::new();
    true_values = match true_values_arg {
        Some(values) => {
            bandits_count = values.len() as u64;
            values
        }
        None => {
            let normal: Normal<f64> =
                Normal::new(true_values_mean, true_values_standard_deviation).unwrap();
            for _ in 0..bandits_count {
                let sample: f64 = normal.sample(&mut rng);
                true_values.push(sample);
            }
            true_values
        }
    };
    for _ in 0..bandits_count {
        Q.push(initial_value);
        checks.push(0);
    }

    for i in 0..steps {
        let chosen: u64;
        let greedy: bool = rng.gen_bool(1.0 - epsilon);
        if explore_all_first && checks.iter().any(|&x| x == 0) {
            chosen = checks
                .iter()
                .enumerate()
                .filter(|(_, &x)| x == 0)
                .map(|(index, _)| index)
                .next()
                .unwrap() as u64;
        } else if greedy {
            chosen = Q
                .iter()
                .enumerate()
                .max_by(|(_, a), (_, b)| a.partial_cmp(b).unwrap())
                .map(|(index, _)| index)
                .unwrap() as u64;
        } else {
            chosen = rng.gen_range(0..bandits_count);
        }
        let a = true_values.get(chosen as usize).unwrap();
        let normal_examination: Normal<f64> =
            Normal::new(*a, examination_standard_deviation).unwrap();
        let reward = normal_examination.sample(&mut rng);
        let expected = Q.get(chosen as usize).unwrap();
        *checks.get_mut(chosen as usize).unwrap() += 1;
        let checked = checks.get(chosen as usize).unwrap();
        *Q.get_mut(chosen as usize).unwrap() =
            (reward - expected) * 1f64 / *checked as f64 + expected;
        let optimal_choice = true_values
            .iter()
            .enumerate()
            .max_by(|(_, a), (_, b)| a.partial_cmp(b).unwrap())
            .map(|(index, _)| index)
            .unwrap() as u64;
        if optimal_choice == chosen {
            optimal_choices += 1
        }
        statistics.optimal_choices.push(optimal_choices);
    }
    statistics
}