Notes, assignments, and code for NEUROBIO 735 (Spring 2018).
1/10 – 2/8:
Wednesday, Thursday
3:00 – 4:30
DIBS conference room
For this homework, we’ll explore another tabular dataset involving animal choice behavior from this paper. In the experiment, male monkeys were given a choice between delivery of two juice rewards, one of which was paired with an image. The images were chosen from one of four categories — dominant male monkeys, subordinate male monkeys, images of female monkey sex skin, and a gray square — while the relative amount of juice available for each option varied across choices. When animals were indifferent between the juice and juice-plus-image options for a given discrepancy in juice values, that difference was taken to be the value of the paired image. At issue in the experiment is how image values differed by category and across individuals.
In this homework, we will replicate pieces of this analysis by fitting choice models to subsets of the data. In doing so, we’ll have recourse to the split-apply-combine pattern, which allows us to first define an analysis on an arbitrary subset of data and then apply this analysis across some grouping of the data set. The key advantage of this method is that it separates the analysis itself from the method used to split the data into groups, resulting in more flexible, reusable code.
The data are contained in this csv file with columns:
monk
: animal doing the tasksession
: session number (unique across all animals)piccat
: picture category:
dv
: difference in value between the juice-plus-image and juice-only options; positive values mean more fluid for choosing the juice-plus-image optionNimg
: number of times the animal chose the juice-plus-image optionNtot
: total number of times the choice was presentedsessdate
: date of the session in which these data were collectedcategorical
, as needed.Our goal will be to design an analysis capable of calculating the value of each image category each session. We’ll do this by using split-apply-combine.
For many applications of the split-apply-combine method, it’s easiest to begin by designing the “apply” portion first. That is, we design and test the analysis we want to perform on an arbitrary subset of the data. To do this, we’ll often extract a representative subset of the data in order to perform testing.
Extract a subset of the data to use in developing the analysis. For instance, session 9, picture category 4.
Remove any rows in the data subset that correspond to no observations (Ntot
= 0).
The most important tests we tend to do in scientific software involve plotting data and results. Because your eye is a much better pattern detector than most programmatic tests you might write, the fastest way to assess whether your code is functioning correctly is to generate plots.
For each value difference in your data subset, calculate the proportion of trials on which the animal chose the juice-plus-image option.
Plot this proportion as a set of points.
We expect that as dv
increases, animals are more likely to choose the juice-plus-image option. Our observations in this case are numbers of times each choice was presented and numbers of times the image option was selected. For such observations, the standard statistical distribution is a binomial, and the typical model is logistic regression. In logistic regression, we assume
\[
\log \frac{p}{1-p} = X \beta
\]
where \(p\) is the probability of choosing the image, \(X\) is the matrix of regressors in our model (the variables we think might affect this probability), and \(\beta\) is a vector of coefficients (weights) for these variables. In our case, we will assume the simplest possible model, in which the probability of choosing the juice-plus-image option depends only on the difference in juice between the two options:
\[
\log \frac{p}{1-p} = \beta_1 + \beta_2 dv
\]
Our measure of interest will be the value of \(dv\) at which the animal is indifferent between the juice-only and juice-plus-image options, the so-called “point of subjective equality.” At this point, we can reason that \(V_1 = V_2 + V_{\mathrm{image}} \Rightarrow V_{\mathrm{image}} = V_1 - V_2 = -dv\). That is, the image value is minus the value of \(dv\) at which \(p=\frac{1}{2}\).
Derive an expression for the image value as a function of \(\beta\).
Using Matlab’s fitglm
command, fit the logistic regression. (Hint: you may want to use the 'BinomialSize'
argument to specify the number of times each choice was presented.)
For model checking, we want to be sure that our fitted model captures any trends visible in the data. We can do this by plotting a model fit line on the same figure as our raw data.
In Matlab, we can use the predict
command to generate this fit line. We pass predict
our fitted model object and a table (alternately, a vector of inputs, if we didn’t pass a table to fitglm
) with values at which to predict the model’s output.
Generate a table of new dv
values. These should be numerous and close enough together that the predicted values will plot as a smooth line.
Use the predict
command to get predictions for \(p\) at the new dv
values. Alternately, you can use the formula above to solve for \(p\) in terms of \(\beta\) and \(dv\).
Plot these predictions on the same figure as the original data.
Now that we know the analysis we want to perform on each data subset, we should be able to follow split-apply-combine:
splitapply
, which breaks the data into groups. Ideally, we would give this a list of grouping variables.splitapply
would recombine the results of our analysis in a new table, one row for each group.This is what happens in most other languages (R, Python, etc.). Unfortunately, because The MathWorks is racing to catch up on support for data tables, it doesn’t quite work that way in Matlab yet:
splitapply
requires a single numeric vector labeling the groups. If you look at the splitapply
documentation, you can see that this is done with the findgroups
command.splitapply
doesn’t feed the analysis function a data subset in the form of the table. Rather, it feeds the function individual columns.So here’s what you’ll need to do. Either:
splitapply
, and use the individual columns within the function to do the calculation.Use varargin
and reconstruct the data table inside the function. That would look something like this:
function img_val = fit_model(varargin)
data = table(varargin{:});
data.Properties.VariableNames = {...
...
end
That is, we take the varargin
cell array, use {:}
to turn it into a comma-separated list, and feed the resulting list to table
. The only downside to this is that you must then set the variable names in the resulting table for the code to work as before.
Both of these solutions are significantly limited, since they require your function to know exactly how many columns are coming and in what order (in the first case to provide an argument list; in the second to set variable names). Ideally, we would simply be able to accept a data table and reference the columns we need by name, regardless of where they are in the table or what else is include. As it is, we’re forced to implement a solution that’s more rigid and thus more brittle.
Likewise, be aware that splitapply
does not perform the “combine” step: it will instead return a series of vectors, one for each output of the analysis function you provide. Thus it will be up to you to recombine these into a table, if needed.
splitapply
along with your analysis function to calculate the value of each image category in each session.