Quantitative Neurobiology

Notes, assignments, and code for NEUROBIO 735 (Spring 2018).

Class details:

1/10 – 2/8:
Wednesday, Thursday
3:00 – 4:30
DIBS conference room

Exercises

Quizzes

GitHub

Week 2: Spike data

In our second week, we’ll be working with a ubiquitous data type in neurobiology: point process or event data. Typically, these events are action potentials, but they could also be EPSPs, vesicle releases, or even communication in social networks. We’ll focus on common exploratory data approaches, including the peri-stimulus/peri-event time histogram, and estimation of firing rates.

We’ll also be talking about refactoring, the process of iteratively rewriting code to make it cleaner, more useful, and more maintainable. As a result, much of this week’s in-class work will have you writing and rewriting the same analyses, each time making them more generic and reusable. Along the way, we’ll touch on related issues such as code smells, testing, and the red-green-refactor method.

So before our first class:

and before our second class:

More on regression coding

Following up on our discussion in class today:

When using a categorical variable in your regression (e.g., genotype, drug/no-drug), if you fit an intercept (and this is the default), then you are actually fitting redundant parameters. For example, in a model with genotype (WT, KO) and drug (Y, N) as parameters, then

  • There are four conditions to be fit: WT-Y, WT-N, KO-Y, KO-N
  • The coefficients in the model y ~ g * d (with g genotype, d drug, and an interaction term) are
    • an intercept (corresponding to WT-N)
    • a difference between WT and KO genotypes
    • a difference between drug and no drug
    • an interaction coefficient that captures how the KO-Y condition exceeds the prediction resulting just from the main effects
  • But we could also have done:
    • an intercept that is the mean of all four groups
    • two numbers that specify how WT and KO differ from the mean (one is the negative of the other)
    • two numbers that specify how Y/N drug differ from the mean (these also add to 0)
    • four numbers that specify how each individual case differs from the predictions based on the main effects above

In both cases, however, there are four and only four unique numbers. They are all just linear combinations of one another. The first case is known as treatment coding, while the second case goes by the name of sum coding. Depending on your use case, one or the other is usually more interpretable.

For Matlab, the default is treatment (dummy) coding. Relevant docs are:

More info on how these work in R and Python.

Clarifying tidy data

One of the issues that came up this week in our discussion of tidy data is whether there’s ever a case when having multiple observed numbers on each row of a table could be considered tidy. The example is Table 12 in the tidy data paper, in which both tmin and tmax are recorded in each row.

The answer is that, apart from a rigorous mathematical definition, there is some wiggle room in deciding whether a data set is tidy. For instance, consider an eye tracking data set with eye position locations at every point in time

Time X Y
0.5 0.7 0.8
0.55 0.72 0.81
0.6 0.75 0.88

Note that we have two observations, the x and y coordinates, in each row. By the strictest definition, the data are not tidy. However, there is never a case when we would measure one of these datums without the other. The two of them could also be considered a single multivariate observation.

In general, I personally favor keeping multiple observations in the same column when

  • Multiple columns are semantically related (eye coordinates, blood pressure numbers).
  • These observations are either always present together or always absent. One cannot observe one without the other.
  • It makes working with your data easier. Practicality beats purity.

That’s my take. Your mileage may vary. Once you know the rule, you’re free to break the rule.

Week 1: Tabular data

In our first week of class, we’ll be exploring Matlab’s capabilities for working with one of the most common data formats in all of science: tabular data. Data tables are the organizing principle behind spreadsheets, databases, and even advanced data visualization. Tabular data are typically organized with observations in rows and measured variables in columns, with the freedom to mix numbers, text, and dates in the same data structure.

Traditionally, Matlab has lacked facilities for dealing with tables, but in the last several years, it has introduced both categorical variables and data tables, making analysis of mixed-type data sets like surveys and behavioral data much more tractable.

So before our first class:

  • Make sure you have a recent version of Matlab installed and ready to go. R2016a or later is best.
  • Read Hadley Wickham’s Tidy Data paper. Don’t worry about the R syntax, but do internalize the concepts. How you organize your data can make a huge difference in how easy it is to analyze later.
  • Read about categorical arrays. You should be able to define them and convert string and numeric data to them.
  • Watch a short background video about tables in Matlab:
  • and another about function handles:

and before our second meeting:

  • Read about the split-apply-combine method. Again, don’t worry about the R syntax; focus on the analysis pattern, which generalizes across programming languages. In our second session, we’ll cover Matlab’s syntax for performing many of the same operations.
  • Watch this video about generalized linear models:

Matlab readiness quiz

First of all, don’t panic. This quiz is supposed to be challenging. My goal is to figure out how much basic programming you’re familiar with in Matlab so we can calibrate the course properly. In principle, the material here should be doable for anyone who’s completed something like this online class.

You can download the data for all the questions here. Data for each question are in the corresponding .mat file, which you can load into Matlab. This is a self-assessment. You do not need to mail your answers to me, but feel free to contact me to ask questions if you’re stuck. Use whatever resources you want to complete the assignment; programming is not about having things memorized.

So here goes:

  1. FizzBuzz: Write a program that prints the numbers 1 to 100 to the Matlab console. Only observe the following exceptions:
    • if the number is a multiple of 3, print “fizz.”
    • if the number is a multiple of 5, print “buzz.”
    • if the number is a multiple of 3 and 5, print “fizzbuzz.”
  2. Write a function that accepts as input a probability p and an integer n. The function should return a vector of n random binary numbers (either 0 or 1). Each entry should be 1 with probability p.

  3. Given the data matrix D, with a separate data series in each row, write a function that returns a matrix Z of the same dimensions. Each row of Z contains the same data as D, but z-scored (i.e., mean 0, standard deviation 1).

  4. Given the matrix A, make a new matrix B consisting of all rows in A that contain no negative elements.

  5. Given the two-column matrix A (first column, observations; second column, time points), plot the raw data (as points) and a solid line showing the smoothed data using any method you choose.

  6. Write a function that takes as input the array of structs S and returns a cell array equal to the number of fields of each struct in S. Each element of the returned cell array should collect all the values from the corresponding field in S. For example, if the elements of S are points with x and y fields, the returned value would be a cell array with two cells, one containing a vector of all the x values, the other all y values.

Matlab markup

For the last several years, particularly with the advent of Project Jupyter, documents integrating code with text, equations, and visualization (plots) have been a hot topic in scientific computing. While it is possible to run Matlab through a Jupyter notebook, Matlab has its own answer in the form of Matlab markup. It’s similar in spirit to the better-known Markdown, which is just a shorthand way of writing text that can easily be transformed to HMTL. This HTML can then be inserted directly into webpages. In fact, a previous incarnation of this Matlab course was generated using this method.

I encourage you to get familiar with Matlab Markup prior to the semester, since it may be useful in preparing your homeworks. The ability to mix text with code and visualization allows us to get closer to the idea of an analysis lab notebook, a practice we’ll encourage you to continue outside the class.

For those of you running Matlab 2016a or later, you may also want to take advantage of Live Scripts.