Analyzing table data

Analyzing table data#

Last time, we practiced loading, recoding, and tidying an example dataset. We also worked on a very limited example of exploratory analysis, constructing summary statistics and plots. Today, we’ll push that analysis further, using Pandas to filter our data and ask whether some of our variables make a statistical difference to the outcomes we care about.

From last time: loading and merging data#

To begin follow the same approach as last time to download the data, load it, and combine it. This time, you will not need to add the cumulative reward or cumulative trial variables. Instead, our goal will be to create a data frame that will allow us to model how total rewards earned vary across sexes as a function of behavioral state. These behavioral states are not observed but determined by using a Hidden Markov Model (not as difficult as Wikipedia makes it look) to model behavior. That’s a bit much for us to bite off at this point, but thanfully, the authors have already assigned each trial to either:

explore
exploit (left)
exploit (right)

Exercise

Use the split-apply-combine method to calculate total rewards earned for each animal for each session in each state.
Add sex back in as a column to the resulting data frame. There are multiple ways to do this, but a solution that works well in more complicated examples is to merge the resulting data frame from the previous question with a subset of columns from the original data and drop duplicate rows.

Solution:#

Show code cell content Hide code cell content

# total reward in each state
rwd_by_state = dat.groupby(['subject', 'session', 'state']).reward.sum().reset_index()

# add sex back in by merging
rwd_by_state = rwd_by_state.merge(dat[['subject', 'sex']].drop_duplicates())

rwd_by_state

	subject	session	state	reward	sex
0	1	1	1	87	M
1	1	1	2	32	M
2	1	1	3	86	M
3	1	2	1	103	M
4	1	2	2	33	M
...	...	...	...	...	...
638	32	7	1	65	F
639	32	7	2	56	F
640	32	7	3	44	F
641	32	8	1	117	F
642	32	8	3	77	F

643 rows × 5 columns

Now, let’s make a couple of plots to see how dividing trials by state suggests effects that might be present in the data:

Exercise

Using the data frame you just made, make a box plot that shows the distribution of rewards earned for each animal across sessions. Color the boxes by sex.
Aggregating across individuals and sessions, make a box plot that compares the distribution of rewards earned in each state by sex. Color the boxes by sex.
What patterns do you see in the results? Does sex appear to make a difference?

Solution#

Statistical modeling#

Of course, eyeballing our plots will only get us so far. Eventually, we will want to produce some kind of statistical model that helps us determine whether the results we see could be due to chance. Note that I’m speaking in terms of statistical modeling because rather than just trying to run a test, we want to think in terms of what kind of process might have generated the data and build in our assumptions about that. Even very simple statistical tests are built on models that encapsulate our assumptions. For example, every ANOVA is just a linear model, and the class of linear models (and their cousins) are much more flexible and powerful.

So let’s build a linear model.

Our goal is to model the reward as a function of state and sex. In R’s formula notation, this regression would be

\[ reward \sim state + sex \]

which adds an intercept plus a linear term for each variable. But of course the plots appear to indicate that reward might depend on an interaction between state and sex, which we would code as

\[ reward \sim state * sex \]

which adds an intercept, a main effect for each variable, and the interaction.

Exercise

Using statsmodels
```
import statsmodels
import statsmodels.formula.api as smf
```
fit the two regressions described above. You’ll want the smf.ols command (ordinary least squares) for regression. You’ll also need to tell statsmodels that session is a categorical variable.
What do you find? Is there an effect of sex?

Solution:#

Show code cell content Hide code cell content

import statsmodels
import statsmodels.formula.api as smf

# main effects only
md = smf.ols("reward ~ C(state) + sex", rwd_by_state)
mdf = md.fit()
print(mdf.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 reward   R-squared:                       0.309
Model:                            OLS   Adj. R-squared:                  0.306
Method:                 Least Squares   F-statistic:                     95.16
Date:                Tue, 07 Jan 2025   Prob (F-statistic):           6.41e-51
Time:                        18:08:50   Log-Likelihood:                -3264.5
No. Observations:                 643   AIC:                             6537.
Df Residuals:                     639   BIC:                             6555.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept       100.0311      2.902     34.473      0.000      94.333     105.729
C(state)[T.2]   -46.1535      3.765    -12.259      0.000     -53.546     -38.761
C(state)[T.3]   -57.8245      3.667    -15.769      0.000     -65.025     -50.624
sex[T.M]         -5.2647      3.085     -1.706      0.088     -11.323       0.794
==============================================================================
Omnibus:                       25.442   Durbin-Watson:                   2.456
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               27.461
Skew:                           0.500   Prob(JB):                     1.09e-06
Kurtosis:                       3.152   Cond. No.                         3.97
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Show code cell content Hide code cell content

# model with interaction term

md = smf.ols("reward ~ C(state) * sex", rwd_by_state)
mdf = md.fit()
print(mdf.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 reward   R-squared:                       0.335
Model:                            OLS   Adj. R-squared:                  0.330
Method:                 Least Squares   F-statistic:                     64.15
Date:                Tue, 07 Jan 2025   Prob (F-statistic):           3.27e-54
Time:                        18:08:50   Log-Likelihood:                -3252.1
No. Observations:                 643   AIC:                             6516.
Df Residuals:                     637   BIC:                             6543.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==========================================================================================
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                 91.7920      3.419     26.847      0.000      85.078      98.506
C(state)[T.2]            -29.4406      4.985     -5.905      0.000     -39.230     -19.651
C(state)[T.3]            -48.7349      5.060     -9.631      0.000     -58.672     -38.798
sex[T.M]                  11.0205      4.807      2.293      0.022       1.581      20.460
C(state)[T.2]:sex[T.M]   -37.1350      7.450     -4.985      0.000     -51.764     -22.506
C(state)[T.3]:sex[T.M]   -18.0470      7.206     -2.504      0.013     -32.198      -3.896
==============================================================================
Omnibus:                       17.212   Durbin-Watson:                   2.435
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               18.070
Skew:                           0.410   Prob(JB):                     0.000119
Kurtosis:                       3.029   Cond. No.                         8.96
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Mixed effects modeling#

Of course, something about the above models stinks. If you’ve done any amount of regression, you know that standard linear models assume that the errors are independent and identically distributed. This means that once we specify the sex of the animal and its behavioral state, all the deviations of the data from the model’s prediction are drawn from the same normal distribution.

But this obviously ignores two extremely important sources of variation in the real data—subject and session—and as we showed above, these sources of variation are not negligible. One (bad) attempt at solving this problem is to simply do a separate regression for each animal (and perhaps each session), but in that case, we have very few data points to go on. Another (also bad) solution is to try fitting separate parameters for, e.g., each session’s mean rewards, but that drastically increases the number of parameters we have to fit. Instead we’ll assume the following:

We start by using state and sex to make a (linear) prediction for the total reward obtained. These are called our fixed effects.
But we assume that every subject’s individual prediction is jittered around this value. Some subjects, even accounting for state and sex, earn a little more or a little less.
In addition, we assume that every session is distributed around the subject’s mean performance. Again, some are a little better, some a little worse.

These latter two additions are called random effects, and models that include both fixed and random effects are known as mixed effects models. In math form, this is

\[ reward = \beta_0 + \beta_{sex} * sex + \beta_{state} * state + \beta_{int} sex * state + \eta + \delta + \varepsilon \]

where \(\eta\) is the subject random effect (normally distributed with mean 0), \(\delta\) is the session effect, and \(\varepsilon\) is the unnaccounted-for (residual) error. Rather than estimating these values individually, we focus on estimating their variance.

In statsmodels, we use mixedlm to do mixed effects modeling. Here, I’ll be honest: it’s easier in R. In our case, it’s particularly tricky because session is “nested” within subject. In R, we’d do this with

lmer("reward ~ sex * state + (1|subject/session)", 
     data=rwd_by_state)

but in Python, we need

md = smf.mixedlm("reward ~ C(state) * sex", rwd_by_state, 
                 groups='subject',
                 re_formula="1",
                 vc_formula={'session': "0 + C(session)"}
                 )

That last bit, the vc_formula tells us that session is nested inside the grouping variable (subject).

Exercise

Fit this model. I suggest
```
md.fit(method='lbfgs', maxiter=1000) 
```

to avoid some numerical instabilities.

What do you find? Is there an effect of sex? Do the variance parameters for the random effects have the right sizes based on your plot of all subjects earlier?

	subject	session	state	reward	sex
0	1	1	1	87	M
1	1	1	2	32	M
2	1	1	3	86	M
3	1	2	1	103	M
4	1	2	2	33	M
...	...	...	...	...	...
638	32	7	1	65	F
639	32	7	2	56	F
640	32	7	3	44	F
641	32	8	1	117	F
642	32	8	3	77	F

	subject	session	state	reward	sex
0	1	1	1	87	M
1	1	1	2	32	M
2	1	1	3	86	M
3	1	2	1	103	M
4	1	2	2	33	M
...	...	...	...	...	...
638	32	7	1	65	F
639	32	7	2	56	F
640	32	7	3	44	F
641	32	8	1	117	F
642	32	8	3	77	F

Analyzing table data

Contents

Analyzing table data#

From last time: loading and merging data#

Solution:#

Solution#

Statistical modeling#

Solution:#

Mixed effects modeling#

	subject	session	state	reward	sex
0	1	1	1	87	M
1	1	1	2	32	M
2	1	1	3	86	M
3	1	2	1	103	M
4	1	2	2	33	M
...	...	...	...	...	...
638	32	7	1	65	F
639	32	7	2	56	F
640	32	7	3	44	F
641	32	8	1	117	F
642	32	8	3	77	F