Skip to content

Reinforcement Learning

Reinforcement Learning is used to solve interacting problems where the data observed up to time t is considered to decide which action to take at time t + 1

  • Used for Artificial Intelligence when training machines to perform various tasks
    • Machines learn through trial and error
      • Desired outcomes provide the AI with reward
      • Undesired with punishment

The Multi Armed Bandit Problem

  • There are \(d\) options to achieve a goal
  • Each time a mean is applied constitutes a round
  • At each round \(n\), one of the possible \(d\) options are applied
  • For each round \(n\),
    • if the application of the mean \(i\) results in the goal being achieved, a reward (1) is given
    • if the application of the mean \(i\) does not result in the goal being achieved, no reward (0) is given
  • The objective of the algorithm is to maximize the total rewards over many rounds
  • The Dilemma: Exploration vs Exploitation
    • If we continue to explore we may find a better option than the one selected
      • However we may have to deal with more sub-optimal options in the process while exploring for a longer perid
    • If we exploit the best option selected thus far, there is a chance that we are missing out on finding the optimal option

Upper Confidence Bound

  • Intuition:
    • With \(d\) options available, it is uncertain as to which one will lead to success
    • Try to predict the options that will lead to the goal being achieved in most cases
  • It is a deterministic algorithm

UCB Algorithm Steps

  • At the end of each round, the following numbers are considered for each of the options
    • the number of times an option \(i\) was selected (\(N_i(n)\))
    • the sum of the rewards for each option, i.e. the number of times the option selection resulted in a reward (\(R_i(n)\))
  • The above numbers are usd to calculate
    • the average reward value (observed average) for each option upto that round
      • As per the law of large numbers, it will converge to the expected value in the long run
    • The confidence interval \([\overline r_i(n) - \Delta_i(n), \overline r_i(n) + \Delta_i(n)]\) at round \(n\) where
  • The algorithm selects the option with the highest upper bound (\(\overline r_i(n) + \Delta_i(n)\) ) for the confidence interval

Thompson Sampling Algorithm

  • It is a probabilistic algorithm
  • Has more allowance for exploring less/unexplored options while exploiting the proven options
  • Uses the concept of Beta Distribution
  • Intuition
    • Start with a beta distributions of 0 wins and 0 losses for each option
    • In the first round, try each option and update the beta distribution based on the result - win or loss
    • In the subsequent rounds, use the option with the highest beta distribution and update the distrbution based on the result

Thompson Sampling Algorithm Steps

  • At the end of each round, the following numbers are considered for each of the options
    • the number of times selection of the option \(i\) resulted in a reward (\(N_i^1(n)\))
    • the number of times selection of the option \(i\) did not result in a reward (\(N_i^0(n)\))
  • For each option, we take a random draw from the beta distribution
  • The algorithm selects the option with the highest \(\theta_i(n)\)