%config InlineBackend.figure_formats = ['svg']

Working with Array Data#

It shouldn’t be hard to make the case to a bunch of scientists that arrays are a natural and important data structure for scientific computing. What we discuss less often is why arrays are actually different from data structures we’ve already met and why they help us compute efficiently.

For these needs the NumPy package provides a general data type, the n-dimensional array or ndarray. This data type is so powerful that it forms the basis for most serious numerical processing in Python, including Pandas, SciPy, and image processing packages like scikit-image.

That’s what we’ll get into today. Specifically, we’ll focus on the syntax needed to get subsets of data into and out of arrays, as well as on basic math, which is the foundation for using the more powerful features available in SciPy.

Warm-up: Lists#

Let’s review some facts about lists.

A list is, well, a list of objects:

a = 1.5
b = 'spam'
mylist = [2, 7, a, 'eggs', b]
print(mylist)
[2, 7, 1.5, 'eggs', 'spam']

Lists are defined with using square brackets, and can contain variables of any type, separated by commas.

Again, elements of a list can be anything even other lists:

another_list = ['pooh', 'piglet']
mylist.append(another_list)  # we can use this to append to a list (in-place)
print(mylist)
print(len(mylist))

mylist = mylist + [1, 1.7, -2]  # concatenation is easy!
print(mylist)
[2, 7, 1.5, 'eggs', 'spam', ['pooh', 'piglet']]
6
[2, 7, 1.5, 'eggs', 'spam', ['pooh', 'piglet'], 1, 1.7, -2]

We can access the elements of a list using square brackets, just as we did with Pandas. Python is a very consistent language, so the pattern “get an item using square brackets” will show up over and over again.

print(mylist[0])  # first element
print(mylist[2])  # third element
print(mylist[-1])  # last element
print(mylist[-2])  # penultimate element
2
1.5
-2
1.7

We can also use a technique called slicing to get subsequences from the list:

print(mylist[:])  # all elements
print(mylist[1:3])  # elements >= 1 and < 3 (note that element 3 is *not* included)
print(mylist[:-1])  # all but last element
print(mylist[3:])  # elements 3 to end
print(mylist[::2])  # every other element
[2, 7, 1.5, 'eggs', 'spam', ['pooh', 'piglet'], 1, 1.7, -2]
[7, 1.5]
[2, 7, 1.5, 'eggs', 'spam', ['pooh', 'piglet'], 1, 1.7]
['eggs', 'spam', ['pooh', 'piglet'], 1, 1.7, -2]
[2, 1.5, 'spam', 1, -2]

Note that we can use a slice object inside the brackets. This object is of the form start:stop:step. Note that the start element is included, but the stop element is not. Any of these arguments can be omitted, in which case

  • start is assumed to be 0

  • stop is assumed to be len(mylist)

  • step is assumed to be 1

Warm-up: Lists of lists#

In cases where we have a matrix of data we want to represent, a natural way to do that is as a list of lists:

mydata = [[1, 2, 3], [4, 5, 6]]
print(mydata)
print(len(mydata))
[[1, 2, 3], [4, 5, 6]]
2

This is a nested list. It has two elements:

mydata[0]
[1, 2, 3]

and we can get individual data by accessing the list inside a list

print(mydata[0][1])
2

We can think of this in analogy with a 2 x 3 matrix, where the first index gives the row, and the second index gives the column.

In this case, mydata[0][1] instructs us to get the 0th element of mydata (the first row), and the 1st element from that (the second column).

However, there are some drawbacks for using lists to represent large arrays of data:

  • Lists are flexible in that the elements of the list can be of any type, but this makes them slower and less memory-efficient than arrays where all the elements are of the same type (e.g., integer or decimal numbers).

  • It’s pretty clear what mydata[0][1] does, but for arrays with many dimensions, we might like to write something simpler, like mydata[0, 1]. Doing this with lists gives an error.

  • We would also like to do math with matrices, but since lists can contain anything, Python lacks a notion of math operations on lists

The NumPy package resolves these issues by defining the ndarray

The NumPy array#

If we already have a list of lists, NumPy gives us a clear way to convert it to an array:

import numpy as np  # this nickname (numpy = np) is so common as to be ubiquitous

We’ve not really talked about import statements, but I hope you’ve done enough of the reading that this makes sense. Unlike Matlab (but like R), Python requires you to load packages to get specialized functionality. This serves two purposes:

  1. It keeps the core language small. Python has to do a lot of tasks, from scientific computing to running on very small computing devices with limited memory, and there are times when it’s helpful to not load every possible function.

  2. Modules (the name for libraries in Python) also have the benefit of creating “namespaces.” You can think of these like folders: writing np.sort looks in the numpy module for a function called sort. This is unambiguous, and therefore won’t conflict with any other sort function your program has defined. Time and experience have shown that this sort of namespacing, while requiring a few more characters of typing, has a lot of benefits in terms of code readability and sustainability.

myarray = np.array(mydata)
print(myarray)
print(type(myarray))
[[1 2 3]
 [4 5 6]]
<class 'numpy.ndarray'>

Note that the myarray variable is a numpy.ndarray, and that it prints a little prettier than the list of lists, making clear that it’s a 2 x 3 matrix. In fact, we can find out the dimensions of an array with the shape attribute. (This will also be true of dataframes. Note again the consistency of syntax across data types in Python.)

print(myarray.shape)
(2, 3)

We can use some of the same syntax as before:

print(myarray[0])
print(myarray[0].shape)
[1 2 3]
(3,)

Just as before, the 0th element is the first row. In this case, however, the 0th element is an array of shape (3,), a vector of length 3. We will see that selecting rows or columns from an array returns a view of that array, with dimensions correspondingly reduced.

Along the same lines:

print(myarray[1])
print(myarray[1][0])
print(myarray[1][1:])
[4 5 6]
4
[5 6]

But we can now use a more natural notation:

print(myarray[1, 0])
print(myarray[1, 1:])
print(myarray[:, 0])
4
[5 6]
[1 4]

And this is just the tip of the iceberg. You can use slicing along any dimension of an array, arrays to index into arrays, and much more. See the SciPy docs for details.

Note

From time to time, it helps to think about how computers work to understand why some types of code go faster.

For instance, our mental model of a two-dimensional array is as some sort of rectangle with row and column coordinates, but memory on your computer is linearly addressed. That is, the numbers stored in memory are just a big line. Now, for a two-dimensional array, there are two possible (sensible) ways to think about mapping this rectangle to a line:

  1. Row-major order: In this scheme, the elements of the array are listed by going across the first row (index 0), then the second row (index 1), and so on until the matrix is completed. Thus, elements indexed [0, 0] and [0, 1] are located next to each other in memory, while [0, 0] and [1, 0] are not. This is the convention in C, Python, and R.

  2. Column-major order: This is the other option, where the first column is contiguous in memory. This is the convention in FORTRAN, Matlab, and Julia.

Why does this matter? Because operations that pull contiguous chunks from memory will be more efficient than those that have to pull non-contiguous chunks. Thus, for instance, the usual advice in Matlab to put data you want to operate on together in columns. In some cases, this will have very real performance benefits.

Note that in Python/NumPy, row-major ordering means that expressions like myarray[0,:] or simply myarray[0] pull contiguous chunks from the array. Thus, an added benefit of Python syntax is that there’s a simple rule for organizing your data in arrays for efficient access: arrange data so that accessing the chunks you need requires the fewest indices.

Loading data#

But let’s move on to a real example:

These data are in .npy format, which is NumPy’s native array format. Numpy is capable of saving arrays in compressed format (.gz) or in plain text (.txt) if arrays are 2-dimensional. Bear in mind that NumPy and SciPy can also be used to read in data from other common formats including those used by Matlab.

Now let’s load the data

!gdown 1hLP90kH74R_bw_cFHtriT7Y3FpCbKmyb
arr = np.load('arraydata.npy')

Well, that was easy!

Let’s see what we can learn about this variable we just loaded data into:

print(type(arr))
print(arr.shape)
<class 'numpy.ndarray'>
(91, 109, 91)

We see that this type of object is a numpy.ndarray, as expected, and that its shape is given by three numbers. That is, it’s a three-dimensional array!

What does it look like?

print(arr)
[[[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  ...
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]]

 [[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  ...
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]]

 [[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  ...
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]]

 ...

 [[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  ...
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]]

 [[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  ...
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]]

 [[0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  ...
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]
  [0 0 0 ... 0 0 0]]]

That probably doesn’t seem very helpful. In general, with large arrays, we simply can’t look at all the data. As usual, the way we will address this is by plotting. However, we can’t plot all the dimensions of the array at once…

Plotting array data#

# this is also common; pyplot contains most of the common functions
import matplotlib.pyplot as plt  

# this makes sure plots are displayed in the notebook instead of in a pop-up window
%matplotlib inline  

So we would like to plot our data (to see it better), but first, we need to select a subset of the data we can actually plot. The way we will do this is by taking slices through the data. For instance, if we do

slc = arr[64, :, :]
print(slc.shape)
(109, 91)

Just as before, we’ve asked for position 64 along axis 0, with all elements in axes 1 and 2. Just as with a coordinate grid, fixing one coordinate has left us two coordinates unspecified, so the result is a 2d array.

To view this array, we can think of plotting it as an image, where the value of the array entry becomes an image intensity value:

# the matplotlib functiont to show a matrix is matshow
plt.matshow(slc)
<matplotlib.image.AxesImage at 0x7fde3d0021e0>
../_images/10dda9170a5e6b3d8d3bc9ed0bed358161f8fcf9a32d2affb6f11335a74e7216.svg

Hmmmm. What if we instead plotted this as an image? (matshow is mostly for looking at matrices. The following command is better for visualizing data that are a gridded version of smooth information, like an image. You may often want to try this both ways.)

plt.imshow(slc, cmap='gray')  # cmap is for colormap
<matplotlib.image.AxesImage at 0x7fde3ce66cf0>
../_images/60f2050280b1fd6a8844432fca61a9764db14d7bf3251fe79fdd7705fc7a9919.svg

Let’s take another slice, this time letting axes 0 and 1 vary:

plt.imshow(arr[:, 48, :], cmap='gray')
<matplotlib.image.AxesImage at 0x7fde3cf22240>
../_images/ebe7b2eed06b02a56fa33e052bea15ae7a7d7350450b69a9f7c0ed6bbeb848f9.svg

And one more…

plt.imshow(arr[:, :, 37], cmap='gray')
<matplotlib.image.AxesImage at 0x7fde3cc46480>
../_images/3cc0aa6463e52fdd6b8b98b9760fbced94b6396070ac2526afbd50b1177a0be2.svg

So the array in question is actually a 3d image! And by fixing one of the coordinate dimensions of the array, we have reduced the 3d image to a 2d image. Note also a few important conventions:

  • For MRI images, the data array indices correspond to x, y, and z coordinates. These are typically in RAS+ convention, where x increases to the Right, y increases as we move Anterior, and z increases as we move Superior.

  • Fixing one coordinate (z above) gives a 2d array with x and y coordinates. However, in the image, the x direction (left-right) is plotted up-down, while the y direction (anterior-posterior) is plotted left-right. This is because functions like matshow and imshow follow the matrix convention, in which the first index is for rows and the second index is for columns. This is typically what we want: the rows of the matrix become the rows of the image.

  • Along the same lines, the vertical axis increases as one moves down, which is the opposite of normal plotting convention but the most sensible for matrices, where row index increases as we move down. When plotting matrices as data, make sure you know which convention is being used.

Activity: localizing a structure#

The orbitofrontal cortex is an area of the brain known to be involved in a large number of cognitive functions, particularly related to reward and decision making. Suppose you perform a study and find a large BOLD (blood oxygenation level-dependent) signal response at coordinates (50, 85, 25).

Exercise

Plot three sections through the brain that contain this point, showing the point with a bright red dot in each one.

Hints:

  • Go ahead and issue the command

plt.figure(figsize=(15, 15))

before you plot anything. This will make the figure size bigger and easier to see.

  • It will be helpful to use the subplot command (example) to arrange the three plots into one. You will need to specify how many rows your images should be arranged into, how many columns, and which image you want (note that, as opposed to typical Python, these indices start at 1).

  • Use plt.scatter (example) to place a single dot on each image.

    • You will have to specify only one x and one y, but keep in mind that the first number is horizontal and the second number is vertical, as with standard plotting (but not matrix plotting)!

    • You will need to supply both the color and s (size) arguments.

Solution:#

Hide code cell source
cc = (50, 85, 25)
plt.figure(figsize=(15, 15)) 
plt.subplot(1, 3, 1) 
plt.imshow(arr[cc[0], :, :], cmap='gray') 
plt.scatter(cc[2], cc[1], color='r', s=30) 
plt.subplot(1, 3, 2)
plt.imshow(arr[:, cc[1], :], cmap='gray') 
plt.scatter(cc[2], cc[0], color='r', s=30) 
plt.subplot(1, 3, 3) 
plt.imshow(arr[:, :, cc[2]], cmap='gray') 
plt.scatter(cc[1], cc[0], color='r', s=30)
Hide code cell output
<matplotlib.collections.PathCollection at 0x7fde3c8c7c50>
../_images/af2e4934b36e0b28f92ab0f07c69ac6e6a91b2edc64aa2936c89295277e2c1ef.svg

Array Math#

The benefit of storing data in arrays is not merely that it makes organizing multidimensional data easier. DataFrames have many more advantages in this way and automate many powerful transformations. Instead, arrays are primarily useful to us because they make numerical operations easier.

Consider, for example, the following array

aa = np.array([[5, -7, 3.4], [6, -1, -1]])
print(aa)
[[ 5.  -7.   3.4]
 [ 6.  -1.  -1. ]]

If we write something like

aa + 1.1

we typically mean that we want to add 1.1 to each entry, and this is exactly what NumPy does:

aa + 1.1
array([[ 6.1, -5.9,  4.5],
       [ 7.1,  0.1,  0.1]])

Note, however, that multiplication is elementwise by default!

A = np.array([[2, -2], [3, 0]])
B = np.array([[1, 0], [0, 1]])

print(A * B)  # elementwise multiplication
print(A @ B)  # matrix multiplication
[[2 0]
 [0 0]]
[[ 2 -2]
 [ 3  0]]

Broadcasting#

But NumPy is much smarter than that. If we create two more arrays

bb = np.array([3, -3])
cc = np.array([7, 4, 0])

We might want to add them, but addition is typically only defined for matrices of the same shape:

print(aa.shape)
print(bb.shape)
(2, 3)
(2,)

Now what we could want in this case is for NumPy to treat bb as having shape (2, 1) (a column vector) and simply add this column to each column of aa. This is mathematically equivalent to creating a new array of shape (2, 3) with one copy of bb in each column, which is then the same size as aa.

That is, we could do

# make a matrix by repeating the bb column three times
bbb = np.c_[bb, bb, bb]  # concatenate along columns
print(bbb)
print(bbb.shape)

print(bbb + aa)
[[ 3  3  3]
 [-3 -3 -3]]
(2, 3)
[[ 8.  -4.   6.4]
 [ 3.  -4.  -4. ]]

But clearly, this is difficult to type out if aa is of shape (2, 5000) and bb is of shape (2,). If we ask NumPy to multiply these things, it should make some reasonable guesses as to how to expand the dimensions of the arrays in order to do this and complain if it can’t.

This feature of NumPy goes under the name of broadcasting. There are very detailed rules about how NumPy makes the shapes of different arrays match up, which are detailed here. In the simplest cases this, happens like so:

print(bb.shape)
print(bb[:, np.newaxis].shape)
print(aa + bb[:, np.newaxis])
(2,)
(2, 1)
[[ 8.  -4.   6.4]
 [ 3.  -4.  -4. ]]

Basically, we use the np.newaxis keyword to add an additional dimension (of size 1) to bb in the position we choose, making bb of shape (2, 1), and then NumPy sees that aa and bb share a first dimension and makes a good guess about how to broadcast them together.

Similarly, the following also works:

print(aa + cc[np.newaxis, :])
[[12.  -3.   3.4]
 [13.   3.  -1. ]]

Altering arrays#

Clearly, another thing we might want to do to our arrays is change their entries. In the case of changing single entries, this is pretty easy to do:

print(aa)
print('-----------------')
aa[0][1] = 5
print(aa)
[[ 5.  -7.   3.4]
 [ 6.  -1.  -1. ]]
-----------------
[[ 5.   5.   3.4]
 [ 6.  -1.  -1. ]]

But we can assign slices as well:

print(aa)
print('-----------------')
aa[1] = 0.1
print(aa)
[[ 5.   5.   3.4]
 [ 6.  -1.  -1. ]]
-----------------
[[5.  5.  3.4]
 [0.1 0.1 0.1]]
print(aa)
print('-----------------')
aa[1] = [1, 2, 3]
print(aa)
[[5.  5.  3.4]
 [0.1 0.1 0.1]]
-----------------
[[5.  5.  3.4]
 [1.  2.  3. ]]

Note that in the latter case, we’ve actually assigned a list, not an array, to the slice aa[1], but NumPy is smart enough to convert it to an array before assigning it.

Important: Notice how each time we perform an assignment above, we altered the original array, aa. This is almost always what we want. Often, our arrays are very large, and it costs less memory to only have one copy of a giant array, with different variables referring to slices of that array (like addresses). In NumPy parlance, we say that indexing or slicing returns a view of the array, not a copy.

Be careful this distinction doesn’t bite you. For instance

aaa = aa[0]
print(aa)
print('-----------------')
print(aaa)
print('-----------------')
aaa[2] = -100
print(aa)
print('-----------------')
print(aaa)
[[5.  5.  3.4]
 [1.  2.  3. ]]
-----------------
[5.  5.  3.4]
-----------------
[[   5.    5. -100.]
 [   1.    2.    3.]]
-----------------
[   5.    5. -100.]

Here, we simply assigned the name aaa to a view of aa; both variables still pointed to the same place in memory, so when we changed its value, it changed for both variables.

Often, this is what we want. When we don’t it’s best to make a copy:

aaa = aa[0].copy()
print(aa) 
print('-----------------')
print(aaa)
print('-----------------')
aaa[2] = 501
print(aa)
print('-----------------')
print(aaa)
[[   5.    5. -100.]
 [   1.    2.    3.]]
-----------------
[   5.    5. -100.]
-----------------
[[   5.    5. -100.]
 [   1.    2.    3.]]
-----------------
[  5.   5. 501.]

Changing dimensions#

Finally, there are times when you would like to change which dimension corresponds to rows and which to columns (for instance, flipping an image around the diagonal). In the 2d case, this is equivalent to matrix transposition:

print(aa)
print('-----------------')
print(aa.T)
print('-----------------')
print(aa.shape, aa.T.shape)
[[   5.    5. -100.]
 [   1.    2.    3.]]
-----------------
[[   5.    1.]
 [   5.    2.]
 [-100.    3.]]
-----------------
(2, 3) (3, 2)

For a matrix with more than two dimensions, transposition reverses the list of dimensions, so that the last becomes first, first becomes last, etc. You can also specify a custom reordering (see here).

randarr = np.random.rand(5, 3, 7, 2, 6)  # a 5 x 3 x 7 array of random numbers
print(randarr.shape, randarr.T.shape)
(5, 3, 7, 2, 6) (6, 2, 7, 3, 5)

Activity: Color Images#

On computers, the simplest type of color image encoding is an M x N x 3 array. This represents an M x N image where each of the final dimensions represents a unique color channel (red, green, and blue). Matplotlib understands this, and will handle this automatically.

To start, download any png file from the internet and save it to your working directory. (You may try using a .jpg or other file formats, but they are not guaranteed to work.) Then use imread and imshow to display it in the notebook like so:

!gdown --id 1yYXD1wIVS3sv_0CXr5KoSrHprqUbJCyP
img = plt.imread('marmoset.jpg')
print(img.shape)
plt.imshow(img);
(153, 128, 3)
../_images/0929e12ee3833663a7a890a3cef94820063115095a11258319a0bd23730702d0.svg

Now use what you’ve learned about arrays to adjust the red-green-blue color balance of the image.

img2 = img.copy()
img2[:, :, 2] = 0
plt.imshow(img2);
../_images/9163e3f6b0c36174b1c5eb4613d56c63c5b57cdbc03083276426ff134c58a87c.svg
img3 = img.copy()
img3[:, :, 0] = img3[:, :, 0] * 1.3
plt.imshow(img3);
../_images/8fbf26cdeb6fc309496a5066587433cbb5cfa99a17b05564718b981a4f6e6cd1.svg
plt.imshow(img);
../_images/3be52e414078330f7988aae75d74f1e2d8b6f5c9b4cce7128633d44ae4c3d913.svg

I used to have an example where we assigned, e.g.,

img3 = img

assigned as in the cell above, and showed that the original image changed. Apparently you can no longer do that.

Bonus Exercises

  1. Can you flip the image horizontally?

  2. Can you rotate the image \(90^\circ\)?

If you’re interested in learning more about images in Matplotlib, you can read the tutorial here.

Learning More#

NumPy is a powerful library, with incredibly flexible syntax, but for beginners, the code can be a little cryptic. Some suggested reading:

  • An extensive tutorial on NumPy is available here.

  • Once you’re familiar with basic Python, Python for Data Analysis is a good book, covering some NumPy and SciPy, but particularly focused on Pandas (Wes McKinney, the author, was also the creator and lead developer of Pandas).

Bonus: NaN and missing data#

If you analyze data for any amount of time, you will have to deal with missing data. Sometimes data are lost — subjects withdraw, files are lost, or a given condition was not run for a particular session — and in that case, we would like a way to indicate that a matrix of data has missing entries.

Other times, we have data, but some quantity we calculate turns out to be ill-defined. For instance, we might want to standardize our data, subtracting the mean and dividing by the standard deviation (this results in data with mean 0 and standard deviation 1 — a convenient thing to have).

But what if we only have one data point? In this case the standard deviation of that data set is 0, and division by 0 is undefined.

Clearly these are different types of situations. In the R language, the first case of missing data is coded NA (not available), while the second case is coded NaN (not a number).

Python handles the second case in exactly the same way (NaN is an international standard), whereas the second case is handled with masking, in the form of NumPy’s masked arrays.

print(np.array([1., np.inf, -7, 0]) / 0.0)
print(np.inf / np.inf)
[ inf  inf -inf  nan]
nan
/tmp/ipykernel_3717/3629404404.py:1: RuntimeWarning: divide by zero encountered in divide
  print(np.array([1., np.inf, -7, 0]) / 0.0)
/tmp/ipykernel_3717/3629404404.py:1: RuntimeWarning: invalid value encountered in divide
  print(np.array([1., np.inf, -7, 0]) / 0.0)

Note that some constants like \(\infty\) and NaN are accessible within the NumPy package as np.inf, np.nan.

# we can test for nan:
myarr = np.array([1, -3, np.nan, 8, np.nan])
np.isnan(myarr)
array([False, False,  True, False,  True])

We can use a boolean (True/False) array to index into a data array of the same size. When the index array is true, we pull out the corresponding elements of the data array:

nan_arr = np.isnan(myarr)

# get entries that are not NaN (~ is negation)
myarr[~nan_arr]
array([ 1., -3.,  8.])

Ignoring NaNs#

Something that often happens is that we want to ignore NaNs in our data. Perhaps we want to take a mean that does not take into account any of these undefined entries:

print(myarr.mean())
print(np.nanmean(myarr))
nan
2.0

In addition, Matplotlib ignores NaN entries when plotting:

xx = np.linspace(0, 10, 500)  # make 500 uniformly spaced points between 0 and 10
yy = 1.5 * xx - 2
plt.plot(xx, yy);
../_images/22015aaafb635ee9c6ab35e3936d6fa37bad0b7312e5ae9837c0e9077c1f7229.svg
yy[240:300] = np.nan
plt.plot(xx, yy);
../_images/b88afcb757a797df30f6dd53bc5fad9307b2e5c808ac0fb2b6785070728ce309.svg

Masking Data#

Masking is a way of adding additional information to an ndarray to tell Python that a given entry is missing or otherwise invalid. We might want to do this for several reasons:

  • The data are legitimately missing, as in the examples above.

  • The data are censored. That is, the reading from our sensor is pegged at some maximum value, \(M\), and so all we know is that \(x \ge M\).

  • The measurement is somehow invalid. For instance, a negative number for a length.

In all of these cases, we might want to mask the invalid data.

For example:

npts = 500
xx = np.linspace(0, 10, npts)
yy = np.sin(xx) + 0.8

plt.plot(xx, yy);
../_images/316b6bdc2415b244a289ace743bb1a962a0a76bc7d144d6f083383aa57eae3fc.svg

Let’s say all the readings below 0 are invalid. We can fix this. The numpy.ma module contains functions relating to masked arrays, including functions to mask data inside a range, greater or less than a given value, or where a particular truth condition is met:

masked_yy = np.ma.masked_less(yy, 0)  # mask where yy < 0
print(yy.mean())
print(masked_yy.mean())
0.9829891749263486
1.1467817135462555

Clearly, the second average is higher. This is because it excludes all the negative numbers.

Similarly, plots exclude masked values:

plt.plot(xx, masked_yy);
../_images/dd48428e75bc49a3d3a3f0a91467d6d6e1e9e44bad430ba42c527b2073fae44d.svg

This is potentially most useful with images. For instance, we can mask our brain images from earlier by constructing the mask ourselves:

# the mask should be logical and of the same size as arr
# here, we set all entries to True, which masks all data by default
mask = np.ones(arr.shape, dtype='bool')
# the mask should be true for the data we *don't* want to use
# unmask the window: x in [60, 75), y in [50, 90), z in [20, 60)
mask[60:75, 50:90, 20:60] = False
masked_arr = np.ma.masked_array(arr, mask=mask)
cc = (65, 85, 25)

plt.figure(figsize=(15, 15))
plt.subplot(1, 3, 1)
plt.imshow(masked_arr[cc[0], :, :], cmap='gray')
plt.subplot(1, 3, 2)
plt.imshow(masked_arr[:, cc[1], :], cmap='gray')
plt.subplot(1, 3, 3)
plt.imshow(masked_arr[:, :, cc[2]], cmap='gray');
../_images/5826e994d343ea9b47d1c059b74b50a9f600e0de335b0dd3f518efd28c5d15ed.svg

In this way, we can restrict our analysis to a particular region of interest inside the image; all other data will be ignored. In real MRI analysis, the mask is often based on anatomy, so that we restrict our data to a particular brain structure or structures.