Patterns, functions, and duck typing#
In the last section, we looked at the basic ingredients of Python (its data types) and the ways we can use them (iteration and logic, more formally known as “control flow”).
We also touched on the idea that both these ingredients and the way we mix them are designed so as to be consistent. This consistency, that lists behave like tuples behave like dicts (whenever that makes sense), is one of the hallmarks of Python.
This consistency also gives rise to one of the most common patterns or idioms in Python: duck typing. Unlike languages that require you to strictly define what types of data a particular piece of code can work with, Python takes the standpoint that, “If it walks like a duck and quacks like a duck, it’s a duck.” In this paradigm, it doesn’t matter what kind of data you pass to a function so long as that data supports all the necessary operations.
We’ll explore some of this below.
Patterns#
Programmers are extremely lazy — a certain type of lazy. A good programmer is the kind of person who will spend six hours coding a solution to a problem that saves her six seconds thousands of times.
As a result, since the 90s, programmers have collected and studied several dozen kinds of (good) solutions for how to structure programs that come up over and over again. If you’ve written anything more complicated than a simple script, you might have even come up with one or two of these. Many of them only really apply to larger software projects. These solutions are known as Design Patterns, and they’re the next step up the ladder in your programming education once you feel comfortable writing snippets of code.
But let’s make this concrete. To review, consider:
mylist = ['1', 1, 'pooh', 'piglet', [2, 7, True]]
mytup = ('a', 1, 2, True)
mystr = "we can do this \U0001F389"
As a trivial example of duck typing, note that we don’t have to do anything special to print these variables: it’s the same print
function, regardless of the data type.
print(mylist)
print(mytup)
print(mystr)
['1', 1, 'pooh', 'piglet', [2, 7, True]]
('a', 1, 2, True)
we can do this 🎉
Less trivially, all of these are collections (they have a notion of “contained in”) and they are iterable (we can get elements out of them one by one). Because of this, the code to iterate over all these collections and print elements one at a time is identical:
collection = mylist # you can change this to different collections defined above
for element in collection:
print(element)
1
1
pooh
piglet
[2, 7, True]
This may seem simple, but it’s because Python has taken one of the most common design patterns — the iterator pattern — and baked it right into the language.
Exercise
We’ve seen that functions like print
and the iterator pattern are shared across data types. What are other examples of such similarities?
Functions#
Functions are the single greatest invention in the history of computer programming. By allowing us to reuse code, functions allow for greater abstraction (as we’ll see), more readable code, and greater modularity (you don’t need to to know the inner details of everything).
More explicitly, functions are named blocks of code with inputs and a single output (though we can get around this restriction). To define a function, we can use the def
keyword:
def myfunc(x):
print(x + 1)
return 2 * x
print(myfunc(4))
y = myfunc(-7)
print(y)
5
8
-6
-14
Here, def
says we are about to define a function. This keyword is followed by the name of the function and a list of its arguments in parentheses. Python has several neat features in the way arguments are defined, including the ability to take arguments by name, to leave the number of arguments unspecified, and to give default values to certain arguments.
Finally, the return
keyword specifies the output of the function. Note that, like for
, the line defining the function ends in a colon and the entire function body is indented.
def anotherfunc(x, y=2): # y has the default value 2
z = x ** y # x to the power y
return z / 2.0
print(anotherfunc(4, 0.5)) # here we specify both x and y
print(anotherfunc(4)) # here we specify only x, so y = 2
1.0
8.0
Understanding variable scope#
Functions are like black boxes. The information that gets passed into them is bound to the input variable name, but this variable only exists while the function is running. This is a tricky topic that goes under the name of variable scoping, but the following examples illustrate that you have to be careful about what information is and isn’t being passed into a function.
x = 'foo'
print('x = ' + x)
# appending an empty string seems like a sensible default
def reverser(x, appender=''):
"""
This is a docstring. It tells us what the function does.
This function reverses its input and appends its second argument to the end.
"""
print('x = ' + x)
return x[::-1] + appender
print(help(reverser))
print(reverser('bar'))
print(reverser('elephant', ' monkey'))
print('x = ' + x)
x = foo
Help on function reverser in module __main__:
reverser(x, appender='')
This is a docstring. It tells us what the function does.
This function reverses its input and appends its second argument to the end.
None
x = bar
rab
x = elephant
tnahpele monkey
x = foo
Note that the value of x
inside the function had nothing to do with the value of x
outside the function. Within the function, x
took on the value of whatever we passed in as the first argument of reverser. When the function returned, x was restored to its original value.
This may seem confusing, but we actually want this behavior. The fact that variables defined within the function live and die inside the function means that we can use functions without worrying that they will overwrite variables we ourselves define. Imagine if you used a function that had an argument x
or defined a variable data
. You may well have these variables running around in your own code, and scoping makes sure that you don’t need to worry about someone else’s function overwriting them.
Functions take inputs, perform their work, and return outputs. You don’t have to know what they are doing under the hood, and your own functions should play nice in the same way.
What makes a good function?#
Some things to consider:
functions do one thing
functions give us the chance to replace confusing behavior with clearly named behavior
functions allow us to obey the DRY principle (don’t repeat yourself)
functions can call other functions
So how about we practice by writing some code?
Handling multiple data sets:#
To illustrate these features, we’ll be examining data from a comparative study of cognitive ability between macaques and lemurs link. Data are available here, but to make things easy, we will create a directory here on Colab and download the files directly from Google drive. After that, we will assume that these data live in a directory data/primates
inside the working directory.
To accomplish that, we’ll use the os
library:
import os
if not os.path.isdir('data/primates'):
os.makedirs('data/primates')
Then we’ll use the gdown
package to download the files directly from Google Drive:
import gdown
data_urls = [
'https://drive.google.com/uc?export=download&id=1t-USz89jo4DPuyxUcr6DzX1FKdEeIOl6',
'https://drive.google.com/uc?export=download&id=1sZNoxO8Hf4sw6-dvtOKz2TYAydVnihHo',
'https://drive.google.com/uc?export=download&id=1lnl5zGDSWHZLKuiiKgPmwPuKBGGIv5AF',
'https://drive.google.com/uc?export=download&id=1eBYozAifLLbBfMMdU5CNfYTGuDzH4YnO',
'https://drive.google.com/uc?export=download&id=1QKRi0HdKoRDAUOdQJ05ozh065KN0D-IN'
]
data_names = [
'macaque.csv',
'trained_Macaque.csv',
'Black.csv',
'Mongoose.csv',
'Catta.csv'
]
prefix = 'data/primates/'
for url, name in zip(data_urls, data_names):
gdown.download(url, prefix + name, quiet=True)
Warning
If you don’t already have a package installed, lines like
import gdown
above will produce errors. In this case, you can install them by creating a new cell and running, e.g.,
!pip install gdown
In in a code cell, a leading !
(often read “bang”) tells Jupyter to run the rest of the line in the shell, as if it had been
typed into the terminal. pip
is Python’s package manager, which will try to find and download the relevant package and its dependencies.
Normally, we could just use the %ls
magic to get the list of files in a given directory:
%ls data/primates
Black.csv Catta.csv Mongoose.csv macaque.csv trained_Macaque.csv
But if we want to eventually moved over to pure Python, we will use the os
library, which gives us operating system commands.
pathparts = ('data', 'primates')
# this command will work on both Windows and Mac/Unix
# the * expands the tuple, so it's as if we'd written
# os.path.join('data', 'basic_python', 'primates')
fullpath = os.path.join(*pathparts)
print(fullpath)
data/primates
datfiles = os.listdir(fullpath) # note: we're not guaranteed an order here
print(datfiles)
['Mongoose.csv', 'Black.csv', 'macaque.csv', 'trained_Macaque.csv', 'Catta.csv']
Our first order of business is to figure out our analysis from a single dataset.
Warning
The analysis below uses DataFrames. We’ll do more on those later. For now, just see what’s possible and play along. The important thing is how we’ll reorganize (or “refactor”) the code below.
We’ll load the csv
file (which you can view as a spreadsheet) as a Pandas DataFrame.
# make the filename by joining with the path (works cross-platform!)
fname = os.path.join(fullpath, datfiles[0])
import pandas as pd
# df is short for dataframe
# in code with a lot of dataframes, we would choose a more descriptive name
# index_col=0 says the first column of the file is row names
df = pd.read_csv(fname, index_col=0)
df.head()
/tmp/ipykernel_3811/954763891.py:4: DeprecationWarning:
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
import pandas as pd
Sub | Species | Date | Block (approx 72 trials) | Trial | NumA | NumB | Accuracy | RT | Surface Area | |
---|---|---|---|---|---|---|---|---|---|---|
12765 | eduardo | Mongoose | 12/2/08 | 1.0 | 1 | 4 | 5 | 1 | 1.183 | equal |
12766 | eduardo | Mongoose | 12/2/08 | 1.0 | 2 | 3 | 9 | 0 | 0.883 | equal |
12767 | eduardo | Mongoose | 12/2/08 | 1.0 | 3 | 4 | 7 | 1 | 2.750 | congruent |
12768 | eduardo | Mongoose | 12/2/08 | 1.0 | 4 | 5 | 7 | 1 | 1.000 | congruent |
12769 | eduardo | Mongoose | 12/2/08 | 1.0 | 5 | 2 | 3 | 0 | 1.250 | congruent |
We can find out some interesting things:
df['Sub'].unique()
array(['eduardo', 'felipe', 'pedro', 'sancho'], dtype=object)
df['Species'].unique(), df['Date'].unique()
(array(['Mongoose'], dtype=object),
array(['12/2/08', '12/3/08', '12/4/08', '12/5/08', '12/10/08', '12/11/08',
'12/12/08', '12/15/08', '1/6/09', '1/7/09', '1/8/09', '1/12/09',
'1/13/09', '1/14/09', '1/16/09', '11/16/09', '11/17/09',
'11/18/09', '11/19/09', '11/25/09', '11/30/09', '12/1/09',
'12/8/09', '12/9/09', '12/10/09', '12/11/09', '12/14/09',
'12/15/09', '12/16/09', '12/17/09', '1/21/09', '1/22/09',
'1/23/09', '9/8/09', '9/10/09', '9/11/09', '9/14/09', '9/15/09',
'9/17/09', '9/18/09', '9/21/09', '9/22/09', '9/24/09', '9/25/09',
'9/28/09', '9/29/09', '10/2/09', '10/6/09', '10/9/09', '10/13/09'],
dtype=object))
Groupby: Split-Apply-Combine:#
It’s pretty typical in a dataset like this that we want to do some analysis for each subset of the data, however that subset is defined. Pandas makes this very easy:
# reading left to right:
# group the data by subject,
# take the accuracy and response time columns,
# compute the mean of each
df.groupby('Sub')[['Accuracy', 'RT']].mean()
Accuracy | RT | |
---|---|---|
Sub | ||
eduardo | 0.766854 | 3.534282 |
felipe | 0.593519 | 1.147807 |
pedro | 0.763109 | 1.454616 |
sancho | 0.661111 | 1.890181 |
df.groupby(['Sub', 'Surface Area'])[['Accuracy', 'RT']].mean()
Accuracy | RT | ||
---|---|---|---|
Sub | Surface Area | ||
eduardo | congruent | 0.732210 | 4.433112 |
equal | 0.801498 | 2.635451 | |
felipe | congruent | 0.594444 | 1.150126 |
equal | 0.592593 | 1.145489 | |
pedro | congruent | 0.762172 | 1.458337 |
equal | 0.764045 | 1.450895 | |
sancho | congruent | 0.657407 | 1.758446 |
equal | 0.664815 | 2.021917 |
groupby
has much more sophisticated behavior than this (if you want to group by something other than a specific set of columns, you can supply your own criterion), which you can read about here.
In addition, we can plot things like reaction time distributions:
import matplotlib.pyplot as plt
%matplotlib inline
df[['Sub', 'RT']].boxplot(by='Sub');
df['RT'].hist(by=df['Sub'], bins=100);
Pandas plotting is best for quick and dirty plots; if we want to do better, we need to dive more into Matplotlib or Seaborn. We’ll see how to prettify our outputs later on.
But we can plot all on the same axis if we simply tell Pandas which axis to plot into.
So here’s our strategy:
create an axis object to plot into (
gca
= get current axis)split the RT portion of the dataframe into groups using
groupby
iterate over these groups (the iterator gives us a name and a dataframe for each group
call plot on each dataframe, passing the name as the label and the axis we want to reuse
ax = plt.figure().gca()
# now we're going to use iteration!
# the grouped dataframe is a collection of (key, dataframe) tuples
for name, grp in df.groupby('Sub'):
# plot repeatedly into the same axes
grp['RT'].plot(kind='density', ax=ax, label=name.capitalize());
plt.legend(); # draw plot legend
# adjust x limits of plot
plt.xlim(-1, 10);
So we’ve seen that we can do some neat things with this individual dataset. In fact, we’d like to do these analyses and aggregate across all datasets.
Here’s the plan: - load each datset in turn - get the average RT and Accuracy for each animal, store it in a dataframe - plot the RT curve for each animal - load the next dataset, repeat
Multiple datasets: pulling it together:#
Let’s try to combine the above code into a single chunk of code. We’ll iterate over data files and simply repeat the same code each time. (Note how we made a good decision in encoding the file name in a variable we can change instead of hard coding it.)
# make an empty piece to hold each dataframe
df_pieces = []
ax = plt.figure().gca() # make a figure and get its current axis object
# iterate over datfiles
for f in datfiles:
fname = os.path.join(fullpath, f)
df = pd.read_csv(fname, index_col=0)
mean_data = df.groupby('Sub')[['Accuracy', 'RT']].mean()
df_pieces.append(mean_data)
for name, grp in df.groupby('Sub'):
grp['RT'].plot(kind='density', ax=ax, label=name.capitalize());
plt.xlim(0, 6)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.);
combined_data = pd.concat(df_pieces)
combined_data.head()
Accuracy | RT | |
---|---|---|
Sub | ||
eduardo | 0.766854 | 3.534282 |
felipe | 0.593519 | 1.147807 |
pedro | 0.763109 | 1.454616 |
sancho | 0.661111 | 1.890181 |
hopkins | 0.700803 | 2.023438 |
Note that we basically just copied over the code from before. (For the slick arguments to legend
that put the box outside and to the right, see example here.
Building code that lasts#
The above chunk of code is pretty nifty. It works, it produces good output, it’s something we can back and run in six months to produce that figure.
But how well will you understand that code in six months? What if you need to change it? What if we’d like to reuse the code elsewhere. Typically, researchers use a few approaches:
generate plots interactively when needed; don’t bother with a script
modify this script as needed to produce new output
cut and paste from this script when you need to do something similar
The first of these is a terrible idea. The others less so, but they have disadvantages:
if you modify this script, you need to remember what you modified and where so that you can produce the original figure again
if you cut and paste, and you later improve the code or find a bug, you need to remember all the places you cut and pasted, correct the code, and re-run
if you cut and paste, your code will contain lots of repetition; it will be harder to see how what you’re doing differs across scripts
The strategy that good coders use to surmount these difficulties is code reuse. There are lots of ways to reuse code, but the oldest and arguably best is to modularize our code by writing functions. Modular code is built up from smaller subunits. Originally, these units were scripts, but over time, the preferred method is to create functions. Functions are like scripts in that they are named sections of code, but they have a few advantages over scripts, as we will see.
Modularization strategy#
For scientists, the path to modularization generally takes this form:
start by exploring data interactively in the console or a notebook
tidy up the code in a notebook that illustrates a particular analysis
when you start to see chunks of code that do a single (non-obvious) task, collect those chunks into functions
rewrite the analysis to call the functions
remove the functions from the notebook and put them into modules that can be imported
The emphasis here is first on deciding what we want to do (exploring analyses), getting it working (illustrating in a notebook), and only lastly on making our code cleaner and more reusable. The same goes for making our code faster, which comes as a last step. As you become a better programmer, you will develop the ability to think about reuse and speed from the early stages, but even very good coders can be bad guessers at how best to design things early on.
def get_data_files(pathparts):
"""
This function takes an iterable of path parts (directories),
finds all files in that directory, and returns a list of those files.
"""
import os
fullpath = os.path.join(*pathparts)
datfiles = os.listdir(fullpath)
# now add the fullpath to each of these file names so
# we output a list of absolute paths
output_list = [os.path.join(fullpath, f) for f in datfiles] # whoa!
return output_list
print(get_data_files(pathparts)) # should work as before
print(get_data_files(list(pathparts))) # even works if the input is a list
['data/primates/Mongoose.csv', 'data/primates/Black.csv', 'data/primates/macaque.csv', 'data/primates/trained_Macaque.csv', 'data/primates/Catta.csv']
['data/primates/Mongoose.csv', 'data/primates/Black.csv', 'data/primates/macaque.csv', 'data/primates/trained_Macaque.csv', 'data/primates/Catta.csv']
Note that Python is smart enough to use a list, since the *
operator will convert any iterable object (one that can be stepped through) into a tuple and then unpack as normal.
Also, we used a fancy trick inside called a list comprehension that makes it easy to do some operations where we would normally have to iterate (i.e., use a for
loop).
And we can define a couple of other functions:
def extract_data(df):
"""
Calculate the mean RT and Accuracy per subject for the dataframe df.
Return result as a data frame.
"""
groupvar = 'Sub'
colvars = ['Accuracy', 'RT']
return df.groupby(groupvar)[colvars].mean()
def plot_RT_dist(df, ax):
"""
Given a file name and axis object, plot the RT distribution for
each animal in the file into the axis object.
"""
groupvar = 'Sub'
colvar = 'RT'
for name, grp in df.groupby(groupvar):
grp[colvar].plot(kind='density', ax=ax, label=name.capitalize());
return ax
Now let’s use those functions to put together the entire analysis into a single function.
Note how much easier this is to read than the code before. Even though it’s more lines, calling functions with descriptive names like plot_RT_dist
makes the goal we’re trying to achieve clearer.
def do_all_analysis(files):
"""
This function plots the reaction time density for each subject in each file
contained in the iterable files. It also calculates the mean accuracy and
reaction time for each subject and returns these in a data frame.
Files should be full file paths.
"""
import matplotlib.pyplot as plt
import pandas as pd
df_pieces = []
ax = plt.figure().gca()
for f in files:
# read in data
df = pd.read_csv(f, index_col=0)
# process summary data from df
summary_data = extract_data(df)
df_pieces.append(summary_data)
# plot Reaction Time distribution
plot_RT_dist(df, ax)
# add legend to figure
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.);
# get figure corresponding to axis
fig = ax.get_figure()
# concatenate all extracted dataframe pieces into one
combined_data = pd.concat(df_pieces)
# now return a tuple with the combined data frame and the figure object
return combined_data, fig
flist = get_data_files(pathparts)
summary_data, fig = do_all_analysis(flist)
plt.xlim(0, 6);
summary_data
Accuracy | RT | |
---|---|---|
Sub | ||
eduardo | 0.766854 | 3.534282 |
felipe | 0.593519 | 1.147807 |
pedro | 0.763109 | 1.454616 |
sancho | 0.661111 | 1.890181 |
hopkins | 0.700803 | 2.023438 |
quinn | 0.749074 | 1.345008 |
redford | 0.597222 | 1.260423 |
tarantino | 0.725096 | 1.687411 |
broome | 0.751622 | 0.825900 |
huxley | 0.738889 | 2.045129 |
solly | 0.733333 | 1.271243 |
yerkes | 0.636905 | 0.770189 |
feinstein | 0.910185 | 0.843286 |
mikulski | 0.846296 | 0.583694 |
schroeder | 0.726852 | 0.978447 |
agathon | 0.725000 | 3.105548 |
berisades | 0.727778 | 0.798131 |
capnlee | 0.664193 | 0.826445 |
licinius | 0.704630 | 1.454208 |
Writing your own modules:#
First, let’s download the module:
!gdown 1keZWN340R0KT3vxF9ul-2xb4avYRR31v --quiet
Tip
This is an alternate way to run gdown for downloading files from Google Drive. It uses the same gdown
package we imported above, but the !
at the beginning of the line tells Jupyter to run this at the command line rather than in Python.
In lemurs.py
(the file we just downloaded), I’ve extracted this code into its own standalone module. That is, I’ve excerpted the functions into their own file. There are at least three ways we can use this separate code file:
Because the file has a line
if __name__ == '__main__':
the code will check to see if it is being run as a standalone module from the command line. In that case, the special variable
__name__
has the value'__main__'
, and the code following the colon will execute. In that case, we can simply typepython lemurs.py
at the command line to load the module and run all the code that follows the
if
statement above.Similarly, we can use the
%run
magic function in the IPython notebook:%run lemurs
This will have the same effect, except we will be able to carry on at the end of the code with all the variables still intact. This is great if we want to pull part of our analysis into a separate file but try out new ideas in the notebook from the place the code left off.
Finally, if we just want to make use of some function in the module we can do
import lemurs
which loads all of the definitions from the module into memory where we can call them. This is exactly what we did in importing pandas or matplotlib, but this time with our own code!
%run lemurs
<Figure size 640x480 with 0 Axes>
import lemurs
dir(lemurs)
['__builtins__',
'__cached__',
'__doc__',
'__file__',
'__loader__',
'__name__',
'__package__',
'__spec__',
'do_all_analysis',
'extract_data',
'get_data_files',
'os',
'pd',
'plot_RT_dist',
'plt']
lemurs.get_data_files(pathparts)
['data/primates/Mongoose.csv',
'data/primates/Black.csv',
'data/primates/macaque.csv',
'data/primates/trained_Macaque.csv',
'data/primates/Catta.csv']
As we’ve seen, Python (and its ecosystem) provide a powerful set of tools for moving quickly from prototyping in the notebook to building modular programs that automate data analysis. As projects get bigger and their corresponding pipelines more complex, this workflow is essential to produce maintainable code and reproducible results.