Exploring table data

Exploring table data#

This week, our in-class work will focus on two topics in tandem. First, we’ll be getting practice with handling real-world tabular data using Pandas. Second, we’ll be thinking about how we deal with sex differences in neuroscience and sex as a biological variable.

Let’s tackle those in reverse. Since 2016, NIH has required that its projects properly account for sex as a biological variable in research designs. To grossly oversimplify, if females make up 50% of the human population we hope to benefit, you need to have a strong reason for only using only a single sex (usually males) in your study. Of course, there are such reasons (some courtship behaviors, maternal behavior), but the purpose of the policy is to move researchers away from viewing “male” as a biological default.

As several papers have pointed out (e.g. Shansky and Murphy [2021]) this will require a shift in thinking. Not only because it’s typical to treat female physiology as more complicated [Shansky, 2019] but because underpowered or misinterpreted studies on sex differences have been used to reinforce stereotypes [Grissom and Reyes, 2019]. For this reason, it’s important to think in a methodologically rigorous way about how we investigate sex differences, even when they’re not the primary aim of our research.

Our data this week come from a recent study in this vein from Chen et al. [2021] at the University of Minnesota, who examined sex differences in mice performing a restless bandit task. Briefly, the task presents animals with two options, both of which are (probabilistically) rewarded but whose values change in time. This requires animals to trade off exploiting an option with known value against exploring the value of an option they haven’t chosen in a while. As the authors report in the paper, while both sexes achieved the same reward rates in the task, the strategies they used to do so differed in interesting ways.

Fortunately, the authors posted their data to an online repository (yay open science!). If you’re working locally, you can download from this link. If not, you can start your Colab notebook with the following cell, which will download the data and unzip it for you:

%%bash
curl --output tmp.zip -L -X GET "https://datadryad.org/api/v2/datasets/doi%3A10.5061%2Fdryad.z612jm6c0/download" -H "accept: application/zip"
unzip tmp.zip
unzip cleaned_up_restless_final_data.zip
rm tmp.zip
rm README.txt
rm cleaned_up_restless_final_data.zip
chmod -R u+wx cleaned\ up\ restless\ final\ data
mv cleaned\ up\ restless\ final\ data data

The files are in the data folder, with one subfolder for each session. Animals 1–16 are male, 17–32 female. Our goal for today will be to aggregate all this data into a single tidy data frame that we can use to perform some exploratory plotting.

Loading table data#

The most common format for storing tabular data is in comma-separated value or .csv format, in which each line of the file contains a row of the table, with columns separated by commas. Often, the first line of the file lists the column names, also separated by commas.

Exercise

Start by loading the data for a single session and a single animal into Pandas. This can be done via the pandas.read_csv function. Note that the first column of the csv file is an index, so you might want to provide the index_col=0 keyword argument.
Print the first 5 rows of the table. What are the column names? What is a way to find this out without printing the table directly (i.e., what if you had to find this out within a function)?

Solution:#

Show code cell content Hide code cell content

# import pandas package
import pandas as pd

# load data from a csv into a table
data_path = 'data/'
dat = pd.read_csv(data_path + 'session1/1.csv', index_col=0)
dat.head(5)

	left	right	choice	reward	state	RT	retrieval	initiation
0	0.5	0.6	1	0	1	3.945	0.000	2.387
1	0.5	0.6	2	1	1	1.787	1.717	1.248
2	0.5	0.6	1	0	1	2.235	0.000	0.000
3	0.5	0.6	2	0	1	2.026	0.000	4.393
4	0.5	0.6	1	1	1	1.977	1.603	1.959

First plots#

As we did last week, let’s try out some ideas on this small chunk of data by making plots.

Exercise

The left and right columns list the probability reward for nosepokes to the left and right targets on each trial. Plot the values of both of these over time. What would the optimal strategy for the mice be in this situation if they knew this information? How should this change if they don’t know the values?
The choice variable is coded 1 for a left choice and 2 for a right choice. Make a plot that shows how choices track reward probabilities for the two options. Since choice only has two values, you might want to smooth the series. You can either adopt the convolutional approach we used last week or use the Pandas windowing operations.
Plot the cumulative reward earned by the animal as a function of trial in the session.

Solution:#

If the mice knew the values, they would choose whichever is maximal at a given time. If they don’t know the values of each option, they need to estimate them, and explore when it’s likely another option is better. (Actually, calculating the optimal strategy in situations like this is a surprisingly rich mathematical problem.)

Combining data sets#

With these simple analyses under our belt, we can think about how we want to analyze data for all the animals combined. There are two basic approaches:

Loop over data sets, applying the same analysis to each one. Then write out the results to disk or hold them all in a list in memory.
Combine all the data together into a single DataFrame object that can be easily manipulated for a wide variety of analyses.

Tidying data#

Many data tables are structured with multiple observations in each row, but for many analysis and visualization needs, data are better structured as one observation per row or “tidy” data. While not perfect for all purposes, tidy data provides a method for canonicalizing data so that software can focus on transferring to and from a fixed form.

In Pandas, the key functions for this purpose are melt and stack, along with several others. For a more full-featured approach in R, see tidyverse and the comparison here.

Exercise

Given that each csv file contains data from one session from one animal, what additional columns do we need to put the final combined data frame into tidy format?
Write code that performs this conversion for a single dataset:
- Load the data set.
- Add columns for the session, subject number, and sex of the animal. Note that these will be the same for the entire data set.
- Makes the index (trial number) into a new column called trial. Hint reset_index.
Is there any additional information you will need to uniquely identify each row? Can that be added at this stage?

Solution:#

Show code cell content Hide code cell content

# add columns to make the dataframe tidy
# we need columns for session and subject

chunk = pd.read_csv(data_path + 'session1/1.csv', index_col=0)
session = 1
sub = 1

# add columns for session and subject
chunk['session'] = session
chunk['subject'] = sub
if sub <= 16:
    chunk['sex'] = 'M'
else:
    chunk['sex'] = 'F'

# make a trial column by resetting the current index and renaming to trial
chunk = chunk.reset_index()
chunk = chunk.rename(columns={"index": "trial"})

chunk

	trial	left	right	choice	reward	state	RT	retrieval	initiation	session	subject	sex
0	0	0.5	0.6	1	0	1	3.945	0.000	2.387	1	1	M
1	1	0.5	0.6	2	1	1	1.787	1.717	1.248	1	1	M
2	2	0.5	0.6	1	0	1	2.235	0.000	0.000	1	1	M
3	3	0.5	0.6	2	0	1	2.026	0.000	4.393	1	1	M
4	4	0.5	0.6	1	1	1	1.977	1.603	1.959	1	1	M
...	...	...	...	...	...	...	...	...	...	...	...	...
295	295	0.5	0.4	2	1	3	1.527	1.174	0.001	1	1	M
296	296	0.5	0.4	2	1	3	2.611	1.319	0.000	1	1	M
297	297	0.5	0.4	2	1	3	1.996	1.042	0.904	1	1	M
298	298	0.5	0.4	2	1	3	1.826	1.181	0.000	1	1	M
299	299	0.5	0.4	2	1	3	1.324	1.098	0.314	1	1	M

300 rows × 12 columns

Data in multiple directories#

Now that we can load and add needed information to each data chunk, we have to think about how to systematically go through subdirectories and load data. For this, we’ll use the walk function from the os module, which takes a directory name as input and returns a generator we can loop over. Our overall strategy will be to walk through the directory, loading all the csv files, process each one, hold the results in a list, and use concat to combine them into a single data frame at the end.

This bit can be a little tricky, so we’ll break it down in pieces:

Exercise

Write a loop that simply prints the information given by os.walk as we go through the data directory. Which of these pieces do you need?
We don’t need to process all the directories returned by this process. Write code to prevent us from doing so.
Write code that extracts the session number from the directory name. Remember that int converts a string to an integer.
Write code that ensures we only process files that end in .csv.
Write code that gets the subject number from the file name.
Using the code you wrote above, add session, subject, sex, and trial columns to the data set.
Append this data frame to the list you’re using to hold the results.
Finally, concatenate the results list to form a new data frame. Note that the index is not unique, so for technical reasons, it will help to reset_index again to get a unique index for each row.

Solution:#

Show code cell content Hide code cell content

import os

chunk_list = []
for (dirname, _, files) in list(os.walk(data_path)):
  folder = dirname.split('/')[-1]
  if 'session' in folder:
    session = int(folder[7:])  # make it an integer
    for fname in files:
      if fname.endswith('.csv'):
        sub = int(fname.split('.')[0])  # make it an integer
        chunk = pd.read_csv(dirname + '/' + fname, index_col=0)

        # add columns for session and subject
        chunk['session'] = session
        chunk['subject'] = sub
        if sub <= 16:
          chunk['sex'] = 'M'
        else:
          chunk['sex'] = 'F'
        
        # make a trial column by resetting the current index and renaming to trial
        chunk = chunk.reset_index()
        chunk = chunk.rename(columns={"index": "trial"})
        
        chunk_list.append(chunk)

print("Total number of chunks: {}".format(len(chunk_list)))


# concatenate all data frames together

dat = pd.concat(chunk_list).reset_index(drop=True)  # overwrites our old dat variable!
dat

Total number of chunks: 256

	trial	left	right	choice	reward	state	RT	retrieval	initiation	session	subject	sex
0	0	0.5	0.6	2	1	1	3.167	0.909	0.132	1	27	F
1	1	0.5	0.6	2	1	1	0.927	0.859	0.001	1	27	F
2	2	0.6	0.5	2	0	1	1.026	0.000	0.103	1	27	F
3	3	0.7	0.6	2	1	1	1.727	1.136	1.028	1	27	F
4	4	0.7	0.6	1	1	1	1.168	1.074	0.000	1	27	F
...	...	...	...	...	...	...	...	...	...	...	...	...
70773	292	0.6	0.6	2	1	1	6.768	2.157	0.000	5	10	M
70774	293	0.6	0.6	2	0	1	9.557	0.000	0.001	5	10	M
70775	294	0.7	0.5	2	1	1	88.997	2.882	5.659	5	10	M
70776	295	0.7	0.5	1	0	1	18.927	0.000	0.000	5	10	M
70777	296	0.7	0.5	2	1	1	7.492	5.710	6.008	5	10	M

70778 rows × 12 columns

Group-level statistics#

Having data in tidy format can facilitate easier comparisons across observation types. In Pandas, the relevant code pattern is to call groupby on the DataFrame, followed by a call to a summary statistic like mean or std. That is, if data is a DataFrame:

data.groupby(by=['column1', 'column5']).mean()

will calculate a groupwise mean. More generally, this pattern of breaking data into chunks based on a combination of variables, performing some analysis, and aggregating the results is known as Split-Apply-Combine.

Exercise

Use groupby to calculate the mean and variance of the reaction time for each animal in each session.

Solution:#

Show code cell content Hide code cell content

dat.groupby(['subject', 'session']).RT.agg(['mean', 'var'])

		mean	var
subject	session
1	1	2.722293	10.967707
	2	5.345410	376.833405
	3	3.266718	4.629838
	4	3.041440	4.596238
	5	4.084762	8.817232
...	...	...	...
32	4	3.608547	31.459340
	5	2.572577	9.237009
	6	3.565250	31.661808
	7	3.583887	30.114035
	8	2.970333	9.691184

256 rows × 2 columns

Exploratory plotting redux#

Now let’s try to generalize our plot above of accumulated rewards over time. This time, however, we’d like to plot accumulated rewards for all animals across all sessions. As you might expect, groupby will be our friend, but we’ll have to figure out how to calculate the information we need. As we’ll see, it’s easiest to do this by adding the new variables as columns to our data frame.

The final tidy#

Exercise

To make sure the steps below work, we need to start by sorting the data frame by subject and session. (Be sure to save the result or use inplace=True!)
Using groupby, calculate the cumulative reward for each subject and each session as we did for a single dataset above. Assign this to a new column in the data frame.
Similarly, calculate the cumulative reward for each subject across all sessions and assign this as a different column in the dataframe.
The one relevant piece of information we still don’t have, the one that would uniquely identify each row and make the data frame tidy, is a unique trial number for each animal across sessions. If dat is the data frame and trial is the column containing the (within-session) trial number, the following line of code creates such a column:

dat['cumulative_trial'] = dat.groupby(['subject']).trial.transform(lambda x: x.reset_index().index)

How does this work?

Solution:#

The strategy here is as follows:

Group the data frame into chunks, one per subject.
Pull out the column we want.
Perform some transformation (e.g., cumulative sum).
Assign the result back as a new column.

Show code cell content Hide code cell content

dat = dat.sort_values(by=['subject', 'session'])
dat['cumulative_reward_session'] = dat.groupby(['subject', 'session']).reward.cumsum()
dat['cumulative_reward_lifetime'] = dat.groupby(['subject']).reward.cumsum()
dat['cumulative_trial'] = dat.groupby(['subject']).trial.transform(lambda x: x.reset_index().index)
dat.head()

	trial	left	right	choice	reward	state	RT	retrieval	initiation	session	subject	sex	cumulative_reward_session	cumulative_reward_lifetime	cumulative_trial
4976	0	0.5	0.6	1	0	1	3.945	0.000	2.387	1	1	M	0	0	0
4977	1	0.5	0.6	2	1	1	1.787	1.717	1.248	1	1	M	1	1	1
4978	2	0.5	0.6	1	0	1	2.235	0.000	0.000	1	1	M	1	1	2
4979	3	0.5	0.6	2	0	1	2.026	0.000	4.393	1	1	M	1	1	3
4980	4	0.5	0.6	1	1	1	1.977	1.603	1.959	1	1	M	2	2	4

Why did we do all this?#

The above approach — adding a bunch of variables to our data frame rather than extracting what we need — might seem like overkill, but as R users know, the approach really shines when we want to start grouping our analyses by different sets of variables. As an example, with our expanded data set, Seaborn easily lets us easily make both of the following plots with a simple variation in the same command:

Exercise

Using lineplot, plot cumulative reward across all sessions, aggregating by sex. (Hint: since lineplot automatically calculates bootstrapped confidence intervals and this can take a long time for big datasets, you might want to supply the keyword ci=None.)
Make the same plot, showing cumulative rewards for all individuals.

Exploring table data

Contents

Exploring table data#

Loading table data#

Solution:#

First plots#

Solution:#

Combining data sets#

Tidying data#

Solution:#

Data in multiple directories#

Solution:#

Group-level statistics#

Solution:#

Exploratory plotting redux#

The final tidy#

Solution:#

Why did we do all this?#

Solution:#