Basic Programming in Python: Data Structures#

At the core of understanding Python programming is effectively using Python’s built-in data types. Today, we’re going to focus on the most important cases: lists, dictionaries, tuples, and strings. Each has its own important uses, but thankfully, Python prizes consistency in its interfaces: knowing how to use one will give good intuition for how to use the others.

Today, we’re going to cover the basics of data types in Python and use simple programming structures to repeat an analysis on multiple data sets. Lastly, we will learn about refactoring our code to make it easier to read and more reusable.

Lists#

Lists are one of the fundamental data types in Python. A list is, well, a list of objects:

a = 1.5
b = 'spam'
mylist = [2, 7, a, 'eggs', b]
print(mylist)
[2, 7, 1.5, 'eggs', 'spam']

Lists are defined with using square brackets, and can contain variables of any type, separated by commas. If you’re used to Matlab, lists are like cell arrays, but a lot more flexible. In the R world, lists are, well, lists.

Again, elements of a list can be anything even other lists:

another_list = ['pooh', 'piglet']
mylist.append(another_list)  # we can use this to append to a list (in-place)
print(mylist)
print(len(mylist))

mylist = mylist + [1, 1.7, -2]  # concatenation is easy!
print(mylist)
[2, 7, 1.5, 'eggs', 'spam', ['pooh', 'piglet']]
6
[2, 7, 1.5, 'eggs', 'spam', ['pooh', 'piglet'], 1, 1.7, -2]

We can access the elements of a list using square brackets. Python is a very consistent language, so the pattern “get an item using square brackets” will show up over and over again.

print(mylist[0])  # first element
print(mylist[2])  # third element
print(mylist[-1])  # last element
print(mylist[-2])  # penultimate element
2
1.5
-2
1.7

We can also use a technique called slicing to get subsequences from the list:

print(mylist[:])  # all elements
print(mylist[1:3])  # elements >= 1 and < 3 (note that element 3 is *not* included)
print(mylist[:-1])  # all but last element
print(mylist[3:])  # elements 3 to end
print(mylist[::2])  # every other element
[2, 7, 1.5, 'eggs', 'spam', ['pooh', 'piglet'], 1, 1.7, -2]
[7, 1.5]
[2, 7, 1.5, 'eggs', 'spam', ['pooh', 'piglet'], 1, 1.7]
['eggs', 'spam', ['pooh', 'piglet'], 1, 1.7, -2]
[2, 1.5, 'spam', 1, -2]

Note that we can use a slice object inside the brackets. This object is of the form start:stop:step. Note that the start element is included, but the stop element is not. Any of these arguments can be omitted, in which case

  • start is assumed to be 0

  • stop is assumed to be len(mylist)

  • step is assumed to be 1

Note: If you’re coming from Matlab or R, you have to keep in mind that indexing in Python is 0-based, not 1-based. Which is better is a sort of holy war in programming languages. R, Matlab, and Julia are all 1-based. Everything else is 0. Zero-based indexing makes some kinds of expressions easier to write, but there’s no difference in what you can accomplish with the language.

Exercises

(Some of this is just self-study Googling)

  1. Can you think of a way to reverse a list using slicing (i.e., not using the reverse method or reversed command)?

  2. If mylist is a list, what’s the difference between mylist.append and mylist.extend? (Hint: what if the argument to these functions is another list?)

Tuples: Lists that refuse to change#

A tuple, looks a lot like a list without the square brackets:

mylist = ['a', 'b', 'c', 'd', 'e']
mytup = 'a', 'b', 'c', 'd', 'e'

print(mylist)
print(mytup)
['a', 'b', 'c', 'd', 'e']
('a', 'b', 'c', 'd', 'e')

You can see that tuples are printed with parentheses, which are not required, but make the syntax easier to read. In fact, we will often define tuples this way:

another_tup = (1, 2, 3)
one_element_tuple = 'a',

print(another_tup)
print(one_element_tuple)
(1, 2, 3)
('a',)

Lists and tuples behave a lot alike:

print(len(mylist))
print(len(mytup))
5
5
print(mylist[1])
print(mytup[1])

mylist[1] == mytup[1]
b
b
True
print(mylist[2:])
print(mytup[2:])
['c', 'd', 'e']
('c', 'd', 'e')

But there is one important way in which tuples and lists differ: tuples are immutable. This means that you cannot add to, delete from, or change a tuple. Once created, its contents cannot be altered.

mylist[-1] = 'egg'
print(mylist)

mytup[-1] = 'egg'
print(mytup)
['a', 'b', 'c', 'd', 'egg']
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[11], line 4
      1 mylist[-1] = 'egg'
      2 print(mylist)
----> 4 mytup[-1] = 'egg'
      5 print(mytup)

TypeError: 'tuple' object does not support item assignment

Among other things, this means that tuples cannot be sorted or reversed, though using the + operator still works, since it creates a new tuple:

print(mylist + ['f', 'g', 'h'])
print(mytup + ['f', 'g', 'h'])
['a', 'b', 'c', 'd', 'egg', 'f', 'g', 'h']
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[12], line 2
      1 print(mylist + ['f', 'g', 'h'])
----> 2 print(mytup + ['f', 'g', 'h'])

TypeError: can only concatenate tuple (not "list") to tuple

Of course, we can’t add lists and tuples! Python wouldn’t know what we wanted the result to be. But since lists and tuples are so similar, we can convert them to one another using their constructors:

print(list(mytup))
print(tuple(mylist))

print(mytup + tuple(['f', 'g', 'h']))
['a', 'b', 'c', 'd', 'e']
('a', 'b', 'c', 'd', 'egg')
('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h')

This will often be handy when we have some type of data that are a sequence (an ordered collection that we can iterate over (more on that below)).

Exercise

  1. Really a conceptual question: Lists seem pretty good. Why do we even need tuples?

Strings#

Believe it or not, strings are also sequences in Python. They are a lot like both lists and tuples:

mystr = 'like a sir'
print(mystr)
like a sir
print(mystr[5:])
print(list(mystr))
a sir
['l', 'i', 'k', 'e', ' ', 'a', ' ', 's', 'i', 'r']

In fact, we can think of a string as just like a list or tuple with some special functions attached that are particularly useful for strings:

print(mystr.upper())
print(mystr.capitalize())
print(mystr.split(' '))  # that is, split the string on the spaces
LIKE A SIR
Like a sir
['like', 'a', 'sir']

But strings, like tuples, are immutable.

mystr[-1] = 'n'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[17], line 1
----> 1 mystr[-1] = 'n'

TypeError: 'str' object does not support item assignment

Even so, operations like reversing and sorting are supported (they just return a new string, leaving the old alone):

print(mystr[::-1])
print(sorted(mystr))
print(''.join(sorted(mystr)))
ris a ekil
[' ', ' ', 'a', 'e', 'i', 'i', 'k', 'l', 'r', 's']
  aeiiklrs

Note

If the join syntax looks funky to you, you’re not alone. We might like to write

print(sorted(mystr).join(''))

meaning we want to join the elements of the list with no space in between, but remember that we called

mystr.split(' ')

and since split is a thing we do to strings, join must be a thing we do to strings, not lists. As a result, the pattern for merging a list of strings is to call join on the string we want in between the elements of the list:

print('-'.join(['trick', 'or', 'treaters']))
'trick-or-treaters'

Go figure.

Tip

If you use Python to deal with non-Latin character sets (e.g., Arabic, Korean), handling strings gets more complex, because some characters take more bytes than others to specify. Rest assured, Python has full support for Unicode:

# Note *two* code points: one for thumbs up, one for skin tone
print("\U0001F44D\U0001F3FF")  
'👍🏿'

But the moral is that if you need this complexity, better read up.

Exercises

  1. Take the string "Stu should mind Stu's own business" and replace Stu with John. Your answer should work no matter how many times the name appears in the string, but there are many ways to do this. Some internet searching may be in order.

  2. Say we run an experiment with variables day (2 digits), month (2 digits), year (4 digits), and subject (5 digits). Can you write a line of code that turns these into a unique string separated by underscores (_)? Note that none of the variables above are strings.

Dictionaries#

There are times when we might want to store a collection of variables together, but there is no natural ordering for the variables. For instance, a row in a spreadsheet might be converted to a list, but it would be more helpful to have the column name associated with each variable than to get the entries in any particular order. We would much prefer to get the value by the column name than by an index.

This concept of storing not just values (as in a list or tuple), but as key-value pairs is realized by the Python dictionary variable type. Dictionaries in Python are the basis for objects, and so they are very efficient. As opposed to lists, dicts use curly braces:

mydict = {'a': 5, 'b': 7}
print(mydict)
{'a': 5, 'b': 7}

Note that the order in which the dictionary is printed is not the same as the order in which we specified it. Dictionaries are not ordered. This makes them more computationally efficient. If you absolutely need an ordered dictionary, there’s this, but it often means that a dictionary is not the right data type for your problem.

Unlike lists or tuples, we get elements from a dictionary by providing a key, which returns the corresponding value:

print(mydict['a'])
print(mydict['b'])
5
7

Like lists, we can add to dictionaries. A dictionary key can be any object that cannot change (technically it must be “hashable”), and its corresponding value can be anything:

# we can add a key, value pair 
mydict[101] = ['a', 0.1, [4, 5, 7]]
mydict[(1, 2)] = 10
print(mydict)
print(len(mydict))
{'a': 5, 'b': 7, 101: ['a', 0.1, [4, 5, 7]], (1, 2): 10}
4

And because dicts are key, value pairs, we can make dicts from a tuple or list of 2-tuples:

print(dict([('a', 0), ('b', 1), ('last', 'foo')]))
{'a': 0, 'b': 1, 'last': 'foo'}

Warning

  • Dicts are not ordered (I’m saying this again because it’s important)

  • as a result, dicts can’t be sliced, and we can’t get an element by index

  • we can iterate over a dict (see below), but there are no guarantees about which keys come first or last

Exercises

  1. How do we reverse a Python dictionary? That is, how do we get a “reverse lookup” where the values become the keys and vice versa? When is this idea well-defined? (That is, what could go wrong with the process?)

  2. How do we merge two dictionaries? What could go wrong?

Containers and Iteration#

Among the most important commonalities that lists, strings, tuples, and dicts all have (and share with data frames and numpy arrays) are containment and iteration. This is one of the best examples of how very different data types in Python can behave very similarly, lessening our need to learn unique syntax for every type of data.

Containment#

In Python, we can check whether an element is in a collection with the in keyword:

print(mytup)
print(mylist)
print(mystr)
print(mydict)
('a', 'b', 'c', 'd', 'e')
['a', 'b', 'c', 'd', 'egg']
like a sir
{'a': 5, 'b': 7, 101: ['a', 0.1, [4, 5, 7]], (1, 2): 10}
print('b' in mytup)
print('egg' in mylist)
print('sir' in mystr)
print('ik' in mystr)
print(101 in mydict)
print('a' in mydict)
True
True
True
True
True
True

Note that for dicts, containment checks for keys not values. That is, we can find out whether an entry is in the dictionary, but not its value.

Iteration#

In many cases, we want to perform some logic for every element in a collection. To do so, we need a way of stepping through that collection, looking at one element at a time. In Python, this is done with the for keyword:

for elem in mylist:   # note: this line must end in a colon
    print(elem)        # this line must be indented
a
b
c
d
egg
for char in mystr:    # char is the variable name we give to each element as we step through
    print(char + '-letter')
l-letter
i-letter
k-letter
e-letter
 -letter
a-letter
 -letter
s-letter
i-letter
r-letter
for key in mydict:       # note: iterating over a dict gives us keys
    print(mydict[key])    # every indented line gets repeated
    print('--------')
print(len(mydict))        # this line is not indented, so doesn't get repeated
5
--------
7
--------
['a', 0.1, [4, 5, 7]]
--------
10
--------
4

Almost every data structure in Python can be iterated over, and the ability to do this will allow us to repeat a block of code for each element of a collection. This ability to build code that works for a single element of a collection and easily repeat it is part of the essence of programming.

Using Logic#

It’s pretty typical that we might want to decide whether or not to run a block of code based on some logical condition. The most basic conditional in Python is the if statement:

if len(mylist) > 2:
    print(mylist[2])
c
testvar = mytup

if isinstance(testvar, list):
    print("variable is a list")
elif isinstance(testvar, tuple):
    print("variable is a tuple")
else:
    print("variable is neither list nor tuple")
variable is a tuple

And we can combine conditions with logical operations:

vowels = 'aeiou'
sentence = 'it was the best of times, it was the worst of times'.split(' ')
print(sentence)

for word in sentence:
    firstletter = word[0]
    if firstletter in vowels or (len(word) > 4):
        print(word.upper())
['it', 'was', 'the', 'best', 'of', 'times,', 'it', 'was', 'the', 'worst', 'of', 'times']
IT
OF
TIMES,
IT
WORST
OF
TIMES

Exercise

Write a block of code that filters a dictionary to create a new one with a subset of keys. The new dictionary should contain only those entries from the old dictionary where the key is a string and starts with a vowel.