Analyzing table data#

Last time, we practiced loading, recoding, and tidying an example dataset. We also worked on a very limited example of exploratory analysis, constructing summary statistics and plots. Today, we’ll push that analysis further, using Pandas to filter our data and ask whether some of our variables make a statistical difference to the outcomes we care about.

From last time: loading and merging data#

To begin follow the same approach as last time to download the data, load it, and combine it. This time, you will not need to add the cumulative reward or cumulative trial variables. Instead, our goal will be to create a data frame that will allow us to model how total rewards earned vary across sexes as a function of behavioral state. These behavioral states are not observed but determined by using a Hidden Markov Model (not as difficult as Wikipedia makes it look) to model behavior. That’s a bit much for us to bite off at this point, but thanfully, the authors have already assigned each trial to either:

  1. explore

  2. exploit (left)

  3. exploit (right)

Exercise

  1. Use the split-apply-combine method to calculate total rewards earned for each animal for each session in each state.

  2. Add sex back in as a column to the resulting data frame. There are multiple ways to do this, but a solution that works well in more complicated examples is to merge the resulting data frame from the previous question with a subset of columns from the original data and drop duplicate rows.

Solution:#

Hide code cell content
# total reward in each state
rwd_by_state = dat.groupby(['subject', 'session', 'state']).reward.sum().reset_index()

# add sex back in by merging
rwd_by_state = rwd_by_state.merge(dat[['subject', 'sex']].drop_duplicates())

rwd_by_state

Now, let’s make a couple of plots to see how dividing trials by state suggests effects that might be present in the data:

Exercise

  1. Using the data frame you just made, make a box plot that shows the distribution of rewards earned for each animal across sessions. Color the boxes by sex.

  2. Aggregating across individuals and sessions, make a box plot that compares the distribution of rewards earned in each state by sex. Color the boxes by sex.

  3. What patterns do you see in the results? Does sex appear to make a difference?

Solution#

Hide code cell content
# make box plot
import matplotlib.pyplot as plt
%config InlineBackend.figure_formats = ['svg']
import seaborn as sns

fig, ax = plt.subplots(1, 2, figsize=(8, 3))
sns.boxplot(ax=ax[0], data=rwd_by_state, x='subject', y='reward', hue='sex')
sns.boxplot(ax=ax[1], data=rwd_by_state, x='state', y='reward', hue='sex')
ax[0].legend([], frameon=False)
ax[0].set_xticks([])
ax[1].set_xticklabels(['explore', 'exploit (L)', 'exploit (R)'])
plt.tight_layout()

Statistical modeling#

Of course, eyeballing our plots will only get us so far. Eventually, we will want to produce some kind of statistical model that helps us determine whether the results we see could be due to chance. Note that I’m speaking in terms of statistical modeling because rather than just trying to run a test, we want to think in terms of what kind of process might have generated the data and build in our assumptions about that. Even very simple statistical tests are built on models that encapsulate our assumptions. For example, every ANOVA is just a linear model, and the class of linear models (and their cousins) are much more flexible and powerful.

So let’s build a linear model.

Our goal is to model the reward as a function of state and sex. In R’s formula notation, this regression would be

\[ reward \sim state + sex \]

which adds an intercept plus a linear term for each variable. But of course the plots appear to indicate that reward might depend on an interaction between state and sex, which we would code as

\[ reward \sim state * sex \]

which adds an intercept, a main effect for each variable, and the interaction.

Exercise

  1. Using statsmodels

    import statsmodels
    import statsmodels.formula.api as smf
    

    fit the two regressions described above. You’ll want the smf.ols command (ordinary least squares) for regression. You’ll also need to tell statsmodels that session is a categorical variable.

  2. What do you find? Is there an effect of sex?

Solution:#

Hide code cell content
import statsmodels
import statsmodels.formula.api as smf

# main effects only
md = smf.ols("reward ~ C(state) + sex", rwd_by_state)
mdf = md.fit()
print(mdf.summary())
Hide code cell content
# model with interaction term

md = smf.ols("reward ~ C(state) * sex", rwd_by_state)
mdf = md.fit()
print(mdf.summary())

Mixed effects modeling#

Of course, something about the above models stinks. If you’ve done any amount of regression, you know that standard linear models assume that the errors are independent and identically distributed. This means that once we specify the sex of the animal and its behavioral state, all the deviations of the data from the model’s prediction are drawn from the same normal distribution.

But this obviously ignores two extremely important sources of variation in the real data—subject and session—and as we showed above, these sources of variation are not negligible. One (bad) attempt at solving this problem is to simply do a separate regression for each animal (and perhaps each session), but in that case, we have very few data points to go on. Another (also bad) solution is to try fitting separate parameters for, e.g., each session’s mean rewards, but that drastically increases the number of parameters we have to fit. Instead we’ll assume the following:

  1. We start by using state and sex to make a (linear) prediction for the total reward obtained. These are called our fixed effects.

  2. But we assume that every subject’s individual prediction is jittered around this value. Some subjects, even accounting for state and sex, earn a little more or a little less.

  3. In addition, we assume that every session is distributed around the subject’s mean performance. Again, some are a little better, some a little worse.

These latter two additions are called random effects, and models that include both fixed and random effects are known as mixed effects models. In math form, this is

\[ reward = \beta_0 + \beta_{sex} * sex + \beta_{state} * state + \beta_{int} sex * state + \eta + \delta + \varepsilon \]

where \(\eta\) is the subject random effect (normally distributed with mean 0), \(\delta\) is the session effect, and \(\varepsilon\) is the unnaccounted-for (residual) error. Rather than estimating these values individually, we focus on estimating their variance.

In statsmodels, we use mixedlm to do mixed effects modeling. Here, I’ll be honest: it’s easier in R. In our case, it’s particularly tricky because session is “nested” within subject. In R, we’d do this with

lmer("reward ~ sex * state + (1|subject/session)", 
     data=rwd_by_state)

but in Python, we need

md = smf.mixedlm("reward ~ C(state) * sex", rwd_by_state, 
                 groups='subject',
                 re_formula="1",
                 vc_formula={'session': "0 + C(session)"}
                 )

That last bit, the vc_formula tells us that session is nested inside the grouping variable (subject).

Exercise

  1. Fit this model. I suggest

    md.fit(method='lbfgs', maxiter=1000) 
    

to avoid some numerical instabilities.

  1. What do you find? Is there an effect of sex? Do the variance parameters for the random effects have the right sizes based on your plot of all subjects earlier?

Hide code cell content
md = smf.mixedlm("reward ~ C(state) * sex", rwd_by_state, 
                 groups='subject',
                 re_formula="1",
                 vc_formula={'session': "0 + C(session)"}
                 )
mdf = md.fit(method='lbfgs', maxiter=1000)
print(mdf.summary())