Quantitative Neurobiology

Notes, assignments, and code for NEUROBIO 735 (Spring 2024).

Class details:

1/11 – 4/17:
Tuesday, Thursday
3:00 – 4:30
301 Bryan Research Building

Syllabus

Exercises

Solutions

Book

Homework: split-apply-combine

For this homework, we’ll explore another tabular dataset involving animal choice behavior from this paper. In the experiment, male monkeys were given a choice between delivery of two juice rewards, one of which was paired with an image. The images were chosen from one of four categories — dominant male monkeys, subordinate male monkeys, images of female monkey sex skin, and a gray square — while the relative amount of juice available for each option varied across choices. When animals were indifferent between the juice and juice-plus-image options for a given discrepancy in juice values, that difference was taken to be the value of the paired image. At issue in the experiment is how image values differed by category and across individuals.

In this homework, we will replicate pieces of this analysis by fitting choice models to subsets of the data. In doing so, we’ll have recourse to the split-apply-combine pattern, which allows us to first define an analysis on an arbitrary subset of data and then apply this analysis across some grouping of the data set. The key advantage of this method is that it separates the analysis itself from the method used to split the data into groups, resulting in more flexible, reusable code.

Loading the data

The data are contained in this csv file with columns:

  1. Load the data
  2. Transform any columns into categorical, as needed.
  3. Tidy the data, as needed.

Analysis: image value

Our goal will be to design an analysis capable of calculating the value of each image category each session. We’ll do this by using split-apply-combine.

Subsetting the data

For many applications of the split-apply-combine method, it’s easiest to begin by designing the “apply” portion first. That is, we design and test the analysis we want to perform on an arbitrary subset of the data. To do this, we’ll often extract a representative subset of the data in order to perform testing.

  1. Extract a subset of the data to use in developing the analysis. For instance, session 9, picture category 4.

  2. Remove any rows in the data subset that correspond to no observations (Ntot = 0).

Examining the data

The most important tests we tend to do in scientific software involve plotting data and results. Because your eye is a much better pattern detector than most programmatic tests you might write, the fastest way to assess whether your code is functioning correctly is to generate plots.

  1. For each value difference in your data subset, calculate the proportion of trials on which the animal chose the juice-plus-image option.

  2. Plot this proportion as a set of points.

Fitting the model

We expect that as dv increases, animals are more likely to choose the juice-plus-image option. Our observations in this case are numbers of times each choice was presented and numbers of times the image option was selected. For such observations, the standard statistical distribution is a binomial, and the typical model is logistic regression. In logistic regression, we assume \[ \log \frac{p}{1-p} = X \beta \] where \(p\) is the probability of choosing the image, \(X\) is the matrix of regressors in our model (the variables we think might affect this probability), and \(\beta\) is a vector of coefficients (weights) for these variables. In our case, we will assume the simplest possible model, in which the probability of choosing the juice-plus-image option depends only on the difference in juice between the two options: \[ \log \frac{p}{1-p} = \beta_1 + \beta_2 dv \] Our measure of interest will be the value of \(dv\) at which the animal is indifferent between the juice-only and juice-plus-image options, the so-called “point of subjective equality.” At this point, we can reason that \(V_1 = V_2 + V_{\mathrm{image}} \Rightarrow V_{\mathrm{image}} = V_1 - V_2 = -dv\). That is, the image value is minus the value of \(dv\) at which \(p=\frac{1}{2}\).

  1. Derive an expression for the image value as a function of \(\beta\).

  2. Using StatsModels’s GLM class, fit the logistic regression. (Note: The inputs for the Binomial family regressions, including the logistic, assume two columns for the dependent variables, the number of successes and the number of failures.)
  3. Very important: Statsmodels does not add a constant to your regression by default! Be sure to pay attention to sm.add_constant() in the link above!

Checking the fit

For model checking, we want to be sure that our fitted model captures any trends visible in the data. We can do this by plotting a model fit line on the same figure as our raw data.

Given a GLMResults object, we can use the get_prediction method to generate this fit line. We pass get_prediction a NumPy array of new values at which to predict the model’s output.

  1. Generate a table of new dv values. These should be numerous and close enough together that the predicted values will plot as a smooth line.

  2. Use the get_prediction command to get predictions for \(p\) at the new dv values. Alternately, you can use the formula above to solve for \(p\) in terms of \(\beta\) and \(dv\).

  3. Plot these predictions on the same figure as the original data.

Putting it all together

Now that we know the analysis we want to perform on each data subset, we should be able to follow split-apply-combine:

  1. Wrap our existing analysis in a function that takes a data subset in the form of a table.
  2. Call groupby, which breaks the data into groups. Ideally, we would give this a list of grouping variables using the by keyword argument.
  3. Call apply on the resulting GroupBy object. apply takes as its first argument a function used to evaluate each chunk of the DataFrame.
  1. Write a function following the guidelines above that fits a logistic regression to a given data subset.
    • Note that not all sessions produce a well-behaved fit, so statsmodels may complain that some fits are ill-behaved or fail to return a result. This is expected.
    • Because not all sessions produce a well-behaved fit, the array of results may be missing values. Thus, in order to be able to identify what result belongs to what data subset, your function should return not only the value of each image category, but also the session and category to which the data belong. Remember that this information will be passed into the function as part of the original data set.
  2. Use groupby and apply along with your analysis function to calculate the value of each image category in each session.
  3. Remove any results with \(\vert V_{\mathrm{image}} > 1000\vert \) and produce a boxplot of the remaining image values broken out by category.