Fitting simulated models#
In our last class, we looked at fitting models based on a known probability distribution. This let us write down a likelihood function that we could calculate and subsequently optimize to find the model’s parameters. But what if the probability distribution changes from trial-to-trial? What about models where there’s no simple formula we can write down?
Thankfully, we can still make progress in cases like this because all that is required is that we can compute the likelihood algorithmically, not necessarily that it has a simple form we can write down. For this exercise, we’ll be examining a very simple example of this by fitting a reinforcement learning model to data from a two-alternative forced choice (2AFT) task.
The task#
The task is simple: subjects are repeatedly presented with a pair of choices. One choice (A), delivers a reward of 2 units 70% of the time (and 0 otherwise), while another option (B) delivers a reward of 4 units 40% of the time. (Quick: which is better?) Subjects learn by trial and error about the values of each option and attempt to maximize reward.
The data set for this session consists of two vectors, one entry per trial, of a single subject’s choices and outcomes.
Exercise
Load the data. Which option does the subject choose more frequently over time? You can download the data using
!gdown 12h6xHzDIATnbf2pqApKNA9pBCAqdkHnq
Solution:#
Show code cell content
import numpy as np
import scipy.io as sci
# Load data
vars = sci.loadmat('choice.mat')
# check choice frequency
choice= vars['choice']
outcome = vars['outcome']
n1 = np.sum(choice == 1)
n2 = np.sum(choice == 2)
print(f"choice 1: {n1}, choice 2: {n2}")
choice 1: 186, choice 2: 314
The model#
Our goal is to fit a simple reinforcement learning algorithm to these data. This model consists of two pieces:
A value model that describes the benefit of choosing each option, along with a method of updating these values based on outcomes.
A choice model or policy that uses an estimate of the value of each option to make decisions.
The value model will assign each option a value \(Q\), its best estimate of the average reward gained by choosing that option. If, on a particular trial, choosing option A results in reward \(R\), then we update the value of \(Q_A\) according to
with \(\alpha\in [0, 1]\) the learning rate and \(R - Q_A\) the difference in observed and expected outcomes — the reward prediction error. Option B is updated in the same way when it is chosen.
When making choices, we assume the agent uses a softmax (in this case logistic) function that transforms the difference in value between options A and B into a probability. That is
and \(p(A) = 1 - p(B)\). The parameter \(\beta\) captures the variability in choice, with large \(\beta\) corresponding to an agent that always chooses the option with higher \(Q\) and \(\beta = 0\) an agent that chooses randomly.
Our goal will be to use the observed choices and outcomes to find the most likely values of the agent’s parameters \(\alpha\) and \(\beta\). To do so, we will maximize the likelihood of the observed choices by adjusting the parameter values.
Exercise
Write a function that takes as inputs a vector of parameters, a vector of choices, and a vector of outcomes and returns the log likelihood of the choices and outcomes given the parameters. You will want to break this down into steps:
The strategy is to loop over trials. For each trial you want to:
Calculate the probabilities of choosing A or B given \(Q_A\) and \(Q_B\).
Use these probabilities to compute the log likelihood of the observed choice. (Hint: if one has two mutually exclusive options, each chosen with a given probability, what’s the distribution? What form does the likelihood take?)
Calculate the reward prediction error (RPE).
Use the RPE to update the \(Q\)-values for the next trial.
The result will be a vector of log likelihoods, one per trial. If the trials are all independent (given the \(Q\)s), then there is a straightforward way of combining all these log likelihoods into a single value to be returned by your function.
Write a function to be optimized by SciPy. Two important notes:
Your function to be optimized can only take a single input, the vector of parameters.
SciPy only finds the minimum of functions, whereas we’d like to maximize the log likelihood.
Optimize your function to find the maximum likelihood values of \(\alpha\) and \(\beta\). Since we need \(\alpha \in [0, 1]\) and \(\beta \ge 0\), you’ll need to give the optimizer constraints (using the
bounds=
keyword argument tominimize
.) If a particular parameter is only bounded on one side (e.g., below) or unbounded, you can use(0, None)
or(None, None)
, respectively.
Solution:#
Show code cell content
def LL_RL(beta, choice, outcome):
"""
Given model parameters in the vector beta, calculate the log
likelihood of observed data given choices and outcomes.
"""
Ntrials, temp = choice.shape
alpha = beta[1] # learning rate
gamma = beta[0] # typically called beta
# allocate empty arrays for Q values and log likelihoods
Q = np.empty((Ntrials, 2))
LL = np.empty((Ntrials, 1))
# initialize action values to 0
Q[0, :] = 0
# loop over trials
for ii in range(Ntrials):
QA = Q[ii, 0]
QB = Q[ii, 1]
# probability of choosing B
pB = 1 / (1 + np.exp(-gamma * (QB - QA)))
# calculate log likelihood based on observed choice
if choice[ii][0] == 1:
LL[ii] = np.log(1 - pB)
else:
LL[ii] = np.log(pB)
# update action value for this choice according to
# learning rule
this_Q = Q[ii, choice[ii][0] - 1]
RPE = outcome[ii][0] - this_Q
new_Q = this_Q + alpha * RPE
if ii < Ntrials - 1:
Q[ii + 1, :] = Q[ii, :][0]
Q[ii + 1, choice[ii][0] - 1] = new_Q
return LL
# function we want to **minimize**
# when trials are independent, total log likelihood is
# just a sum of log likelihoods for each trial
def cost(beta, choice, outcome):
return -np.sum(LL_RL(beta, choice, outcome))
Show code cell content
from scipy.optimize import minimize
# fit model
bnds = ((0.001, np.inf), (0, 1))
res = minimize(cost, x0 = [1, 0.5], args = (choice, outcome), bounds = bnds)
print(f"params = {res.x}")
params = [1.13388567 0.14891938]