Quantitative Neurobiology

Notes, assignments, and code for NEUROBIO 735 (Spring 2018).

Class details:

1/10 – 2/8:
Wednesday, Thursday
3:00 – 4:30
DIBS conference room

Exercises

Quizzes

GitHub

Week 5: Fitting models

In our last week on programming, we’ll pivot toward the rest of the semester’s material by talking about fitting computational models. Perhaps surprisingly, the actual mechanics behind fitting models to data is rarely discussed, and so we’ll focus a little on the theory of fitting and the rest of our time on the actual mechanics of writing code that does this.

So before our first class:

  • Read about Matlab’s fminsearch function.
  • Watch a video on the math behind linear regression:

In the most generic case, finding the minimum of an arbitrary function \(f(x)\) is computationally intractable: there is no guarantee we can find the global minimum. However, in many cases of interest, special properties of \(f(x)\) allow us to make use of symmetries that allow for efficient solution of the problem. Much of the research on optimization involves either proving that some class of optimization problems can be reduced to another with nice mathematical properties (so optimization can be performed efficiently) or designing new optimization algorithms that solve some class of problems faster (in fewer steps or less time per step).

For our second class this week, we’ll focus on simply using Matlab’s facilities for optimization without getting into details of the algorithms involved. That is, we’ll treat these as “black box” methods that take in an function \(f\) and spit out an answer.

So, before our second class:

Week 4: Debugging

In our forth week, we’re going to pivot from our tour of data analysis to a set of skills that will become increasingly important as you investigate topics in quantitative neurobiology. Each week so far has featured a programming theme along with a data type:

  1. Table data
  2. Refactoring
  3. Profiling

These are all useful and applicable to even simple cases of running small data analyses, but for more complex work in modeling, we expect that code will grow and with it the complexity of writing and maintaining larger software projects. To begin to tackle that complexity, this week will focus on debugging.

All beginning programming students are either taught or stumble upon two methods of debugging:

  1. Pepper your code with print statements like disp('got to line 17') also known as “debugging by printf.”

  2. Aimlessly changing things until the code sort of works. Call it “debugging by guesswork” or less charitably, “debugging by random walk.”

To these, the last few years have made available another advanced technique: debugging by posting on StackOverflow.

These techniques can be highly effective. No kidding. But as you do a better job of refactoring code into functions and profiling to remove bottlenecks, code can become more difficult to explore interactively without doing major surgery. That is why almost every serious program language makes available a debugger.

Unfortunately, most advice about debugging tends to be language- and tool-specific. That can be helpful when you’re first starting, since your primary method of learning will be generalizing from your own experience with particular bugs, and knowing how to use a good tool at that point will result in dramatic improvements. But as you do more programming, you begin to see general patterns and gain experience with tracking down subtle bugs. The first set of readings this week deal with Matlab-specific material, the second with more general debugging strategy.

So before our first class:

and before our second class:

Read this, this, and this for debugging strategies and tips.

Week 3: Imaging data

For week three, we’ll be working with imaging data. Two-photon imaging, to be precise, though other types of imaging present similar challenges. Unlike point process (i.e., spike) data, which are just collections of events — temporal data — imaging data are spatiotemporal: we must deal not only with time, but space as well. In practice, this means not only working with time series, but with images: time series of images.

On the coding side, we’ll be devoting some time this week to learning what makes code run fast, why some programming languages are faster than others, the strengths and weaknesses of Matlab as a coding platform when we need lots of computation, and the tools Matlab provides for helping us find and remove speed bottlenecks in our code.

Much of what we will learn can be summarized in the classic quote by computer scientist Donald Knuth:

“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified.”

So before our first class:

and before our second class:

Week 2: Spike data

In our second week, we’ll be working with a ubiquitous data type in neurobiology: point process or event data. Typically, these events are action potentials, but they could also be EPSPs, vesicle releases, or even communication in social networks. We’ll focus on common exploratory data approaches, including the peri-stimulus/peri-event time histogram, and estimation of firing rates.

We’ll also be talking about refactoring, the process of iteratively rewriting code to make it cleaner, more useful, and more maintainable. As a result, much of this week’s in-class work will have you writing and rewriting the same analyses, each time making them more generic and reusable. Along the way, we’ll touch on related issues such as code smells, testing, and the red-green-refactor method.

So before our first class:

and before our second class:

More on regression coding

Following up on our discussion in class today:

When using a categorical variable in your regression (e.g., genotype, drug/no-drug), if you fit an intercept (and this is the default), then you are actually fitting redundant parameters. For example, in a model with genotype (WT, KO) and drug (Y, N) as parameters, then

  • There are four conditions to be fit: WT-Y, WT-N, KO-Y, KO-N
  • The coefficients in the model y ~ g * d (with g genotype, d drug, and an interaction term) are
    • an intercept (corresponding to WT-N)
    • a difference between WT and KO genotypes
    • a difference between drug and no drug
    • an interaction coefficient that captures how the KO-Y condition exceeds the prediction resulting just from the main effects
  • But we could also have done:
    • an intercept that is the mean of all four groups
    • two numbers that specify how WT and KO differ from the mean (one is the negative of the other)
    • two numbers that specify how Y/N drug differ from the mean (these also add to 0)
    • four numbers that specify how each individual case differs from the predictions based on the main effects above

In both cases, however, there are four and only four unique numbers. They are all just linear combinations of one another. The first case is known as treatment coding, while the second case goes by the name of sum coding. Depending on your use case, one or the other is usually more interpretable.

For Matlab, the default is treatment (dummy) coding. Relevant docs are:

More info on how these work in R and Python.

Clarifying tidy data

One of the issues that came up this week in our discussion of tidy data is whether there’s ever a case when having multiple observed numbers on each row of a table could be considered tidy. The example is Table 12 in the tidy data paper, in which both tmin and tmax are recorded in each row.

The answer is that, apart from a rigorous mathematical definition, there is some wiggle room in deciding whether a data set is tidy. For instance, consider an eye tracking data set with eye position locations at every point in time

Time X Y
0.5 0.7 0.8
0.55 0.72 0.81
0.6 0.75 0.88

Note that we have two observations, the x and y coordinates, in each row. By the strictest definition, the data are not tidy. However, there is never a case when we would measure one of these datums without the other. The two of them could also be considered a single multivariate observation.

In general, I personally favor keeping multiple observations in the same column when

  • Multiple columns are semantically related (eye coordinates, blood pressure numbers).
  • These observations are either always present together or always absent. One cannot observe one without the other.
  • It makes working with your data easier. Practicality beats purity.

That’s my take. Your mileage may vary. Once you know the rule, you’re free to break the rule.

Week 1: Tabular data

In our first week of class, we’ll be exploring Matlab’s capabilities for working with one of the most common data formats in all of science: tabular data. Data tables are the organizing principle behind spreadsheets, databases, and even advanced data visualization. Tabular data are typically organized with observations in rows and measured variables in columns, with the freedom to mix numbers, text, and dates in the same data structure.

Traditionally, Matlab has lacked facilities for dealing with tables, but in the last several years, it has introduced both categorical variables and data tables, making analysis of mixed-type data sets like surveys and behavioral data much more tractable.

So before our first class:

  • Make sure you have a recent version of Matlab installed and ready to go. R2016a or later is best.
  • Read Hadley Wickham’s Tidy Data paper. Don’t worry about the R syntax, but do internalize the concepts. How you organize your data can make a huge difference in how easy it is to analyze later.
  • Read about categorical arrays. You should be able to define them and convert string and numeric data to them.
  • Watch a short background video about tables in Matlab:
  • and another about function handles:

and before our second meeting:

  • Read about the split-apply-combine method. Again, don’t worry about the R syntax; focus on the analysis pattern, which generalizes across programming languages. In our second session, we’ll cover Matlab’s syntax for performing many of the same operations.
  • Watch this video about generalized linear models:

Matlab readiness quiz

First of all, don’t panic. This quiz is supposed to be challenging. My goal is to figure out how much basic programming you’re familiar with in Matlab so we can calibrate the course properly. In principle, the material here should be doable for anyone who’s completed something like this online class.

You can download the data for all the questions here. Data for each question are in the corresponding .mat file, which you can load into Matlab. This is a self-assessment. You do not need to mail your answers to me, but feel free to contact me to ask questions if you’re stuck. Use whatever resources you want to complete the assignment; programming is not about having things memorized.

So here goes:

  1. FizzBuzz: Write a program that prints the numbers 1 to 100 to the Matlab console. Only observe the following exceptions:
    • if the number is a multiple of 3, print “fizz.”
    • if the number is a multiple of 5, print “buzz.”
    • if the number is a multiple of 3 and 5, print “fizzbuzz.”
  2. Write a function that accepts as input a probability p and an integer n. The function should return a vector of n random binary numbers (either 0 or 1). Each entry should be 1 with probability p.

  3. Given the data matrix D, with a separate data series in each row, write a function that returns a matrix Z of the same dimensions. Each row of Z contains the same data as D, but z-scored (i.e., mean 0, standard deviation 1).

  4. Given the matrix A, make a new matrix B consisting of all rows in A that contain no negative elements.

  5. Given the two-column matrix A (first column, observations; second column, time points), plot the raw data (as points) and a solid line showing the smoothed data using any method you choose.

  6. Write a function that takes as input the array of structs S and returns a cell array equal to the number of fields of each struct in S. Each element of the returned cell array should collect all the values from the corresponding field in S. For example, if the elements of S are points with x and y fields, the returned value would be a cell array with two cells, one containing a vector of all the x values, the other all y values.

Matlab markup

For the last several years, particularly with the advent of Project Jupyter, documents integrating code with text, equations, and visualization (plots) have been a hot topic in scientific computing. While it is possible to run Matlab through a Jupyter notebook, Matlab has its own answer in the form of Matlab markup. It’s similar in spirit to the better-known Markdown, which is just a shorthand way of writing text that can easily be transformed to HMTL. This HTML can then be inserted directly into webpages. In fact, a previous incarnation of this Matlab course was generated using this method.

I encourage you to get familiar with Matlab Markup prior to the semester, since it may be useful in preparing your homeworks. The ability to mix text with code and visualization allows us to get closer to the idea of an analysis lab notebook, a practice we’ll encourage you to continue outside the class.

For those of you running Matlab 2016a or later, you may also want to take advantage of Live Scripts.