Patterns, functions, and duck typing#

In the last section, we looked at the basic ingredients of Python (its data types) and the ways we can use them (iteration and logic, more formally known as “control flow”).

We also touched on the idea that both these ingredients and the way we mix them are designed so as to be consistent. This consistency, that lists behave like tuples behave like dicts (whenever that makes sense), is one of the hallmarks of Python.

This consistency also gives rise to one of the most common patterns or idioms in Python: duck typing. Unlike languages that require you to strictly define what types of data a particular piece of code can work with, Python takes the standpoint that, “If it walks like a duck and quacks like a duck, it’s a duck.” In this paradigm, it doesn’t matter what kind of data you pass to a function so long as that data supports all the necessary operations.

We’ll explore some of this below.

Patterns#

Programmers are extremely lazy — a certain type of lazy. A good programmer is the kind of person who will spend six hours coding a solution to a problem that saves her six seconds thousands of times.

As a result, since the 90s, programmers have collected and studied several dozen kinds of (good) solutions for how to structure programs that come up over and over again. If you’ve written anything more complicated than a simple script, you might have even come up with one or two of these. Many of them only really apply to larger software projects. These solutions are known as Design Patterns, and they’re the next step up the ladder in your programming education once you feel comfortable writing snippets of code.

But let’s make this concrete. To review, consider:

mylist = ['1', 1, 'pooh', 'piglet', [2, 7, True]]
mytup = ('a', 1, 2, True)
mystr = "we can do this \U0001F389"

As a trivial example of duck typing, note that we don’t have to do anything special to print these variables: it’s the same print function, regardless of the data type.

print(mylist)
print(mytup)
print(mystr)
['1', 1, 'pooh', 'piglet', [2, 7, True]]
('a', 1, 2, True)
we can do this 🎉

Less trivially, all of these are collections (they have a notion of “contained in”) and they are iterable (we can get elements out of them one by one). Because of this, the code to iterate over all these collections and print elements one at a time is identical:

collection = mylist  # you can change this to different collections defined above

for element in collection:
    print(element)
1
1
pooh
piglet
[2, 7, True]

This may seem simple, but it’s because Python has taken one of the most common design patterns — the iterator pattern — and baked it right into the language.

Exercise

We’ve seen that functions like print and the iterator pattern are shared across data types. What are other examples of such similarities?

Functions#

Functions are the single greatest invention in the history of computer programming. By allowing us to reuse code, functions allow for greater abstraction (as we’ll see), more readable code, and greater modularity (you don’t need to to know the inner details of everything).

More explicitly, functions are named blocks of code with inputs and a single output (though we can get around this restriction). To define a function, we can use the def keyword:

def myfunc(x):
    print(x + 1)
    return 2 * x

print(myfunc(4))
y = myfunc(-7)
print(y)
5
8
-6
-14

Here, def says we are about to define a function. This keyword is followed by the name of the function and a list of its arguments in parentheses. Python has several neat features in the way arguments are defined, including the ability to take arguments by name, to leave the number of arguments unspecified, and to give default values to certain arguments.

Finally, the return keyword specifies the output of the function. Note that, like for, the line defining the function ends in a colon and the entire function body is indented.

def anotherfunc(x, y=2):  # y has the default value 2
    z = x ** y  # x to the power y
    return z / 2.0

print(anotherfunc(4, 0.5))  # here we specify both x and y
print(anotherfunc(4))  # here we specify only x, so y = 2
1.0
8.0

Understanding variable scope#

Functions are like black boxes. The information that gets passed into them is bound to the input variable name, but this variable only exists while the function is running. This is a tricky topic that goes under the name of variable scoping, but the following examples illustrate that you have to be careful about what information is and isn’t being passed into a function.

x = 'foo'
print('x = ' + x)

# appending an empty string seems like a sensible default
def reverser(x, appender=''):  
    """
    This is a docstring. It tells us what the function does. 
    This function reverses its input and appends its second argument to the end.
    """
    print('x = ' + x)
    return x[::-1] + appender

print(help(reverser))

print(reverser('bar'))
print(reverser('elephant', ' monkey'))

print('x = ' + x)
x = foo
Help on function reverser in module __main__:

reverser(x, appender='')
    This is a docstring. It tells us what the function does.
    This function reverses its input and appends its second argument to the end.

None
x = bar
rab
x = elephant
tnahpele monkey
x = foo

Note that the value of x inside the function had nothing to do with the value of x outside the function. Within the function, x took on the value of whatever we passed in as the first argument of reverser. When the function returned, x was restored to its original value.

This may seem confusing, but we actually want this behavior. The fact that variables defined within the function live and die inside the function means that we can use functions without worrying that they will overwrite variables we ourselves define. Imagine if you used a function that had an argument x or defined a variable data. You may well have these variables running around in your own code, and scoping makes sure that you don’t need to worry about someone else’s function overwriting them.

Functions take inputs, perform their work, and return outputs. You don’t have to know what they are doing under the hood, and your own functions should play nice in the same way.

What makes a good function?#

Some things to consider:

  • functions do one thing

  • functions give us the chance to replace confusing behavior with clearly named behavior

  • functions allow us to obey the DRY principle (don’t repeat yourself)

  • functions can call other functions

So how about we practice by writing some code?

Handling multiple data sets:#

To illustrate these features, we’ll be examining data from a comparative study of cognitive ability between macaques and lemurs link. Data are available here, but to make things easy, we will create a directory here on Colab and download the files directly from Google drive. After that, we will assume that these data live in a directory data/primates inside the working directory.

To accomplish that, we’ll use the os library:

import os
if not os.path.isdir('data/primates'):
  os.makedirs('data/primates')

Then we’ll use the gdown package to download the files directly from Google Drive:

import gdown
data_urls = [
    'https://drive.google.com/uc?export=download&id=1t-USz89jo4DPuyxUcr6DzX1FKdEeIOl6',
    'https://drive.google.com/uc?export=download&id=1sZNoxO8Hf4sw6-dvtOKz2TYAydVnihHo',
    'https://drive.google.com/uc?export=download&id=1lnl5zGDSWHZLKuiiKgPmwPuKBGGIv5AF',
    'https://drive.google.com/uc?export=download&id=1eBYozAifLLbBfMMdU5CNfYTGuDzH4YnO',
    'https://drive.google.com/uc?export=download&id=1QKRi0HdKoRDAUOdQJ05ozh065KN0D-IN'
]
data_names = [
    'macaque.csv',
    'trained_Macaque.csv',
    'Black.csv',
    'Mongoose.csv',
    'Catta.csv'
]
prefix = 'data/primates/'
for url, name in zip(data_urls, data_names):
    gdown.download(url, prefix + name, quiet=True) 

Warning

If you don’t already have a package installed, lines like

import gdown

above will produce errors. In this case, you can install them by creating a new cell and running, e.g.,

!pip install gdown

In in a code cell, a leading ! (often read “bang”) tells Jupyter to run the rest of the line in the shell, as if it had been typed into the terminal. pip is Python’s package manager, which will try to find and download the relevant package and its dependencies.

Normally, we could just use the %ls magic to get the list of files in a given directory:

%ls data/primates
Black.csv  Catta.csv  Mongoose.csv  macaque.csv  trained_Macaque.csv

But if we want to eventually moved over to pure Python, we will use the os library, which gives us operating system commands.

pathparts = ('data', 'primates')

# this command will work on both Windows and Mac/Unix
# the * expands the tuple, so it's as if we'd written 
# os.path.join('data', 'basic_python', 'primates')
fullpath = os.path.join(*pathparts)

print(fullpath)
data/primates
datfiles = os.listdir(fullpath) # note: we're not guaranteed an order here
print(datfiles)  
['Mongoose.csv', 'Black.csv', 'macaque.csv', 'trained_Macaque.csv', 'Catta.csv']

Our first order of business is to figure out our analysis from a single dataset.

Warning

The analysis below uses DataFrames. We’ll do more on those later. For now, just see what’s possible and play along. The important thing is how we’ll reorganize (or “refactor”) the code below.

We’ll load the csv file (which you can view as a spreadsheet) as a Pandas DataFrame.

# make the filename by joining with the path (works cross-platform!)
fname = os.path.join(fullpath, datfiles[0])  

import pandas as pd
# df is short for dataframe
# in code with a lot of dataframes, we would choose a more descriptive name
# index_col=0 says the first column of the file is row names 
df = pd.read_csv(fname, index_col=0)  

df.head()
/tmp/ipykernel_3811/954763891.py:4: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
Sub Species Date Block (approx 72 trials) Trial NumA NumB Accuracy RT Surface Area
12765 eduardo Mongoose 12/2/08 1.0 1 4 5 1 1.183 equal
12766 eduardo Mongoose 12/2/08 1.0 2 3 9 0 0.883 equal
12767 eduardo Mongoose 12/2/08 1.0 3 4 7 1 2.750 congruent
12768 eduardo Mongoose 12/2/08 1.0 4 5 7 1 1.000 congruent
12769 eduardo Mongoose 12/2/08 1.0 5 2 3 0 1.250 congruent

We can find out some interesting things:

df['Sub'].unique()
array(['eduardo', 'felipe', 'pedro', 'sancho'], dtype=object)
df['Species'].unique(), df['Date'].unique()
(array(['Mongoose'], dtype=object),
 array(['12/2/08', '12/3/08', '12/4/08', '12/5/08', '12/10/08', '12/11/08',
        '12/12/08', '12/15/08', '1/6/09', '1/7/09', '1/8/09', '1/12/09',
        '1/13/09', '1/14/09', '1/16/09', '11/16/09', '11/17/09',
        '11/18/09', '11/19/09', '11/25/09', '11/30/09', '12/1/09',
        '12/8/09', '12/9/09', '12/10/09', '12/11/09', '12/14/09',
        '12/15/09', '12/16/09', '12/17/09', '1/21/09', '1/22/09',
        '1/23/09', '9/8/09', '9/10/09', '9/11/09', '9/14/09', '9/15/09',
        '9/17/09', '9/18/09', '9/21/09', '9/22/09', '9/24/09', '9/25/09',
        '9/28/09', '9/29/09', '10/2/09', '10/6/09', '10/9/09', '10/13/09'],
       dtype=object))

Groupby: Split-Apply-Combine:#

It’s pretty typical in a dataset like this that we want to do some analysis for each subset of the data, however that subset is defined. Pandas makes this very easy:

# reading left to right: 
# group the data by subject, 
# take the accuracy and response time columns,
# compute the mean of each

df.groupby('Sub')[['Accuracy', 'RT']].mean()
Accuracy RT
Sub
eduardo 0.766854 3.534282
felipe 0.593519 1.147807
pedro 0.763109 1.454616
sancho 0.661111 1.890181
df.groupby(['Sub', 'Surface Area'])[['Accuracy', 'RT']].mean()
Accuracy RT
Sub Surface Area
eduardo congruent 0.732210 4.433112
equal 0.801498 2.635451
felipe congruent 0.594444 1.150126
equal 0.592593 1.145489
pedro congruent 0.762172 1.458337
equal 0.764045 1.450895
sancho congruent 0.657407 1.758446
equal 0.664815 2.021917

groupby has much more sophisticated behavior than this (if you want to group by something other than a specific set of columns, you can supply your own criterion), which you can read about here.

In addition, we can plot things like reaction time distributions:

import matplotlib.pyplot as plt
%matplotlib inline
df[['Sub', 'RT']].boxplot(by='Sub');
../_images/b96f50c84f11314fd588ffed3779598090cea8014f4a2ea030290a24ce632ed1.svg
df['RT'].hist(by=df['Sub'], bins=100);
../_images/5cc2c0ea7cbd541224616b203832c85e102500532a50b1e503d48e9010f50540.svg

Pandas plotting is best for quick and dirty plots; if we want to do better, we need to dive more into Matplotlib or Seaborn. We’ll see how to prettify our outputs later on.

But we can plot all on the same axis if we simply tell Pandas which axis to plot into.

So here’s our strategy:

  • create an axis object to plot into (gca = get current axis)

  • split the RT portion of the dataframe into groups using groupby

  • iterate over these groups (the iterator gives us a name and a dataframe for each group

  • call plot on each dataframe, passing the name as the label and the axis we want to reuse

ax = plt.figure().gca()

# now we're going to use iteration!
# the grouped dataframe is a collection of (key, dataframe) tuples

for name, grp in df.groupby('Sub'):  
    # plot repeatedly into the same axes
    grp['RT'].plot(kind='density', ax=ax, label=name.capitalize());

plt.legend();  # draw plot legend

# adjust x limits of plot
plt.xlim(-1, 10);  
../_images/d00f685e8f8d1a55499050ecf1a20c558519e9bf50a3d3855eae89477335c592.svg

So we’ve seen that we can do some neat things with this individual dataset. In fact, we’d like to do these analyses and aggregate across all datasets.

Here’s the plan: - load each datset in turn - get the average RT and Accuracy for each animal, store it in a dataframe - plot the RT curve for each animal - load the next dataset, repeat

Multiple datasets: pulling it together:#

Let’s try to combine the above code into a single chunk of code. We’ll iterate over data files and simply repeat the same code each time. (Note how we made a good decision in encoding the file name in a variable we can change instead of hard coding it.)

# make an empty piece to hold each dataframe
df_pieces = []

ax = plt.figure().gca()  # make a figure and get its current axis object

# iterate over datfiles
for f in datfiles:
    fname = os.path.join(fullpath, f)
    
    df = pd.read_csv(fname, index_col=0)

    mean_data = df.groupby('Sub')[['Accuracy', 'RT']].mean()
    
    df_pieces.append(mean_data)
    
    for name, grp in df.groupby('Sub'):
        grp['RT'].plot(kind='density', ax=ax, label=name.capitalize());
        
plt.xlim(0, 6)

plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.);

combined_data = pd.concat(df_pieces)

combined_data.head()
Accuracy RT
Sub
eduardo 0.766854 3.534282
felipe 0.593519 1.147807
pedro 0.763109 1.454616
sancho 0.661111 1.890181
hopkins 0.700803 2.023438
../_images/e2651fd02314ecc2490ddbefaeb8a895089598d05b2c341bf7630df9957893ca.svg

Note that we basically just copied over the code from before. (For the slick arguments to legend that put the box outside and to the right, see example here.

Building code that lasts#

The above chunk of code is pretty nifty. It works, it produces good output, it’s something we can back and run in six months to produce that figure.

But how well will you understand that code in six months? What if you need to change it? What if we’d like to reuse the code elsewhere. Typically, researchers use a few approaches:

  • generate plots interactively when needed; don’t bother with a script

  • modify this script as needed to produce new output

  • cut and paste from this script when you need to do something similar

The first of these is a terrible idea. The others less so, but they have disadvantages:

  • if you modify this script, you need to remember what you modified and where so that you can produce the original figure again

  • if you cut and paste, and you later improve the code or find a bug, you need to remember all the places you cut and pasted, correct the code, and re-run

  • if you cut and paste, your code will contain lots of repetition; it will be harder to see how what you’re doing differs across scripts

The strategy that good coders use to surmount these difficulties is code reuse. There are lots of ways to reuse code, but the oldest and arguably best is to modularize our code by writing functions. Modular code is built up from smaller subunits. Originally, these units were scripts, but over time, the preferred method is to create functions. Functions are like scripts in that they are named sections of code, but they have a few advantages over scripts, as we will see.

Modularization strategy#

For scientists, the path to modularization generally takes this form:

  • start by exploring data interactively in the console or a notebook

  • tidy up the code in a notebook that illustrates a particular analysis

  • when you start to see chunks of code that do a single (non-obvious) task, collect those chunks into functions

  • rewrite the analysis to call the functions

  • remove the functions from the notebook and put them into modules that can be imported

The emphasis here is first on deciding what we want to do (exploring analyses), getting it working (illustrating in a notebook), and only lastly on making our code cleaner and more reusable. The same goes for making our code faster, which comes as a last step. As you become a better programmer, you will develop the ability to think about reuse and speed from the early stages, but even very good coders can be bad guessers at how best to design things early on.

def get_data_files(pathparts):
    """
    This function takes an iterable of path parts (directories), 
    finds all files in that directory, and returns a list of those files.
    """
    
    import os
    
    fullpath = os.path.join(*pathparts)

    datfiles = os.listdir(fullpath)
    
    # now add the fullpath to each of these file names so
    # we output a list of absolute paths
    
    output_list = [os.path.join(fullpath, f) for f in datfiles]  # whoa! 
    
    return output_list
print(get_data_files(pathparts))  # should work as before

print(get_data_files(list(pathparts)))  # even works if the input is a list
['data/primates/Mongoose.csv', 'data/primates/Black.csv', 'data/primates/macaque.csv', 'data/primates/trained_Macaque.csv', 'data/primates/Catta.csv']
['data/primates/Mongoose.csv', 'data/primates/Black.csv', 'data/primates/macaque.csv', 'data/primates/trained_Macaque.csv', 'data/primates/Catta.csv']

Note that Python is smart enough to use a list, since the * operator will convert any iterable object (one that can be stepped through) into a tuple and then unpack as normal.

Also, we used a fancy trick inside called a list comprehension that makes it easy to do some operations where we would normally have to iterate (i.e., use a for loop).

And we can define a couple of other functions:

def extract_data(df):
    """
    Calculate the mean RT and Accuracy per subject for the dataframe df. 
    Return result as a data frame.
    """
    
    groupvar = 'Sub'
    colvars = ['Accuracy', 'RT']
    
    return df.groupby(groupvar)[colvars].mean()
def plot_RT_dist(df, ax):
    """
    Given a file name and axis object, plot the RT distribution for 
    each animal in the file into the axis object.
    """
    
    groupvar = 'Sub'
    colvar = 'RT'
    
    for name, grp in df.groupby(groupvar):
        grp[colvar].plot(kind='density', ax=ax, label=name.capitalize());
        
    return ax

Now let’s use those functions to put together the entire analysis into a single function.

Note how much easier this is to read than the code before. Even though it’s more lines, calling functions with descriptive names like plot_RT_dist makes the goal we’re trying to achieve clearer.

def do_all_analysis(files):
    """
    This function plots the reaction time density for each subject in each file
    contained in the iterable files. It also calculates the mean accuracy and 
    reaction time for each subject and returns these in a data frame.
    Files should be full file paths.
    """
    import matplotlib.pyplot as plt
    import pandas as pd
    
    df_pieces = []
    ax = plt.figure().gca()
    
    for f in files:
        # read in data
        df = pd.read_csv(f, index_col=0)
        
        # process summary data from df
        summary_data = extract_data(df)
        df_pieces.append(summary_data)
        
        # plot Reaction Time distribution
        plot_RT_dist(df, ax)
        
    # add legend to figure
    plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.);
    
    # get figure corresponding to axis
    fig = ax.get_figure()  
    
    # concatenate all extracted dataframe pieces into one
    combined_data = pd.concat(df_pieces)
    
    # now return a tuple with the combined data frame and the figure object
    return combined_data, fig 
flist = get_data_files(pathparts)

summary_data, fig = do_all_analysis(flist)

plt.xlim(0, 6);

summary_data
Accuracy RT
Sub
eduardo 0.766854 3.534282
felipe 0.593519 1.147807
pedro 0.763109 1.454616
sancho 0.661111 1.890181
hopkins 0.700803 2.023438
quinn 0.749074 1.345008
redford 0.597222 1.260423
tarantino 0.725096 1.687411
broome 0.751622 0.825900
huxley 0.738889 2.045129
solly 0.733333 1.271243
yerkes 0.636905 0.770189
feinstein 0.910185 0.843286
mikulski 0.846296 0.583694
schroeder 0.726852 0.978447
agathon 0.725000 3.105548
berisades 0.727778 0.798131
capnlee 0.664193 0.826445
licinius 0.704630 1.454208
../_images/684a8ee85b8cb702085aab67204af2baf5f88a1184d7af82a8e8dbf49b24d048.svg

Writing your own modules:#

First, let’s download the module:

!gdown 1keZWN340R0KT3vxF9ul-2xb4avYRR31v --quiet

Tip

This is an alternate way to run gdown for downloading files from Google Drive. It uses the same gdown package we imported above, but the ! at the beginning of the line tells Jupyter to run this at the command line rather than in Python.

In lemurs.py (the file we just downloaded), I’ve extracted this code into its own standalone module. That is, I’ve excerpted the functions into their own file. There are at least three ways we can use this separate code file:

  1. Because the file has a line

    if __name__ == '__main__':
    

    the code will check to see if it is being run as a standalone module from the command line. In that case, the special variable __name__ has the value '__main__', and the code following the colon will execute. In that case, we can simply type

    python lemurs.py
    

    at the command line to load the module and run all the code that follows the if statement above.

  2. Similarly, we can use the %run magic function in the IPython notebook:

    %run lemurs
    

    This will have the same effect, except we will be able to carry on at the end of the code with all the variables still intact. This is great if we want to pull part of our analysis into a separate file but try out new ideas in the notebook from the place the code left off.

  3. Finally, if we just want to make use of some function in the module we can do

    import lemurs
    

    which loads all of the definitions from the module into memory where we can call them. This is exactly what we did in importing pandas or matplotlib, but this time with our own code!

%run lemurs
../_images/286fbe20e7e6a98222798d3652e326238f5da9070c5960abac4c19ab00b9cdd5.svg
<Figure size 640x480 with 0 Axes>
import lemurs

dir(lemurs)
['__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'do_all_analysis',
 'extract_data',
 'get_data_files',
 'os',
 'pd',
 'plot_RT_dist',
 'plt']
lemurs.get_data_files(pathparts)
['data/primates/Mongoose.csv',
 'data/primates/Black.csv',
 'data/primates/macaque.csv',
 'data/primates/trained_Macaque.csv',
 'data/primates/Catta.csv']

As we’ve seen, Python (and its ecosystem) provide a powerful set of tools for moving quickly from prototyping in the notebook to building modular programs that automate data analysis. As projects get bigger and their corresponding pipelines more complex, this workflow is essential to produce maintainable code and reproducible results.