Analyzing table data#
Last time, we practiced loading, recoding, and tidying an example dataset. We also worked on a very limited example of exploratory analysis, constructing summary statistics and plots. Today, we’ll push that analysis further, using Pandas to filter our data and ask whether some of our variables make a statistical difference to the outcomes we care about.
From last time: loading and merging data#
To begin follow the same approach as last time to download the data, load it, and combine it. This time, you will not need to add the cumulative reward or cumulative trial variables. Instead, our goal will be to create a data frame that will allow us to model how total rewards earned vary across sexes as a function of behavioral state. These behavioral states are not observed but determined by using a Hidden Markov Model (not as difficult as Wikipedia makes it look) to model behavior. That’s a bit much for us to bite off at this point, but thanfully, the authors have already assigned each trial to either:
explore
exploit (left)
exploit (right)
Exercise
Use the split-apply-combine method to calculate total rewards earned for each animal for each session in each state.
Add
sex
back in as a column to the resulting data frame. There are multiple ways to do this, but a solution that works well in more complicated examples is tomerge
the resulting data frame from the previous question with a subset of columns from the original data and drop duplicate rows.
Solution:#
Show code cell content
# total reward in each state
rwd_by_state = dat.groupby(['subject', 'session', 'state']).reward.sum().reset_index()
# add sex back in by merging
rwd_by_state = rwd_by_state.merge(dat[['subject', 'sex']].drop_duplicates())
rwd_by_state
Now, let’s make a couple of plots to see how dividing trials by state suggests effects that might be present in the data:
Exercise
Using the data frame you just made, make a box plot that shows the distribution of rewards earned for each animal across sessions. Color the boxes by sex.
Aggregating across individuals and sessions, make a box plot that compares the distribution of rewards earned in each state by sex. Color the boxes by sex.
What patterns do you see in the results? Does sex appear to make a difference?
Solution#
Show code cell content
# make box plot
import matplotlib.pyplot as plt
%config InlineBackend.figure_formats = ['svg']
import seaborn as sns
fig, ax = plt.subplots(1, 2, figsize=(8, 3))
sns.boxplot(ax=ax[0], data=rwd_by_state, x='subject', y='reward', hue='sex')
sns.boxplot(ax=ax[1], data=rwd_by_state, x='state', y='reward', hue='sex')
ax[0].legend([], frameon=False)
ax[0].set_xticks([])
ax[1].set_xticklabels(['explore', 'exploit (L)', 'exploit (R)'])
plt.tight_layout()
Statistical modeling#
Of course, eyeballing our plots will only get us so far. Eventually, we will want to produce some kind of statistical model that helps us determine whether the results we see could be due to chance. Note that I’m speaking in terms of statistical modeling because rather than just trying to run a test, we want to think in terms of what kind of process might have generated the data and build in our assumptions about that. Even very simple statistical tests are built on models that encapsulate our assumptions. For example, every ANOVA is just a linear model, and the class of linear models (and their cousins) are much more flexible and powerful.
So let’s build a linear model.
Our goal is to model the reward as a function of state and sex. In R’s formula notation, this regression would be
which adds an intercept plus a linear term for each variable. But of course the plots appear to indicate that reward might depend on an interaction between state and sex, which we would code as
which adds an intercept, a main effect for each variable, and the interaction.
Exercise
Using
statsmodels
import statsmodels import statsmodels.formula.api as smf
fit the two regressions described above. You’ll want the
smf.ols
command (ordinary least squares) for regression. You’ll also need to tellstatsmodels
thatsession
is a categorical variable.What do you find? Is there an effect of sex?
Solution:#
Show code cell content
import statsmodels
import statsmodels.formula.api as smf
# main effects only
md = smf.ols("reward ~ C(state) + sex", rwd_by_state)
mdf = md.fit()
print(mdf.summary())
Show code cell content
# model with interaction term
md = smf.ols("reward ~ C(state) * sex", rwd_by_state)
mdf = md.fit()
print(mdf.summary())
Mixed effects modeling#
Of course, something about the above models stinks. If you’ve done any amount of regression, you know that standard linear models assume that the errors are independent and identically distributed. This means that once we specify the sex of the animal and its behavioral state, all the deviations of the data from the model’s prediction are drawn from the same normal distribution.
But this obviously ignores two extremely important sources of variation in the real data—subject and session—and as we showed above, these sources of variation are not negligible. One (bad) attempt at solving this problem is to simply do a separate regression for each animal (and perhaps each session), but in that case, we have very few data points to go on. Another (also bad) solution is to try fitting separate parameters for, e.g., each session’s mean rewards, but that drastically increases the number of parameters we have to fit. Instead we’ll assume the following:
We start by using state and sex to make a (linear) prediction for the total reward obtained. These are called our fixed effects.
But we assume that every subject’s individual prediction is jittered around this value. Some subjects, even accounting for state and sex, earn a little more or a little less.
In addition, we assume that every session is distributed around the subject’s mean performance. Again, some are a little better, some a little worse.
These latter two additions are called random effects, and models that include both fixed and random effects are known as mixed effects models. In math form, this is
where \(\eta\) is the subject random effect (normally distributed with mean 0), \(\delta\) is the session effect, and \(\varepsilon\) is the unnaccounted-for (residual) error. Rather than estimating these values individually, we focus on estimating their variance.
In statsmodels
, we use mixedlm
to do mixed effects modeling. Here, I’ll be honest: it’s easier in R. In our case, it’s particularly tricky because session is “nested” within subject. In R, we’d do this with
lmer("reward ~ sex * state + (1|subject/session)",
data=rwd_by_state)
but in Python, we need
md = smf.mixedlm("reward ~ C(state) * sex", rwd_by_state,
groups='subject',
re_formula="1",
vc_formula={'session': "0 + C(session)"}
)
That last bit, the vc_formula
tells us that session
is nested inside the grouping variable (subject
).
Exercise
Fit this model. I suggest
md.fit(method='lbfgs', maxiter=1000)
to avoid some numerical instabilities.
What do you find? Is there an effect of sex? Do the variance parameters for the random effects have the right sizes based on your plot of all subjects earlier?
Show code cell content
md = smf.mixedlm("reward ~ C(state) * sex", rwd_by_state,
groups='subject',
re_formula="1",
vc_formula={'session': "0 + C(session)"}
)
mdf = md.fit(method='lbfgs', maxiter=1000)
print(mdf.summary())