John Pearson






Lab Website

Google Scholar




Thinking With Data by Max Shron; O'Reilly Media

17 December 2014

Disclaimer: I received a free copy of this work under the O’Reilly Blogger Review Program.

What you need to know up front is that Max Shron was once the data scientist for OK Cupid, in which capacity he wrote one of the most interesting and insightful data blogs on the net. In Thinking With Data, he wants to teach you how to make your own analysis relevant to whatever audience you’re trying to reach — the public, your colleagues, maybe even your boss.

It seems like we shouldn’t even need a book like this. I suspect we desperately, desperately do. As Shron points out, there is a big gap in practice between the activities that data scientists love — exploring data, fitting models, making plots — and what they are actually asked to do: deliver value.

This is even true in science. Perhaps especially true in science. Mostly because the stuff of doing science seems to postdocs and graduate students (and even to some investigators) to be the data collection and analysis, when in fact the substance of doing science is extracting insight and meaning from data and communicating that to other scientists and the public.

We all know this, and yet it’s very, very easy to get distracted.

Thinking With Data is a very short book about how not to get distracted.

In Chapter 1, Shron introduces a model for scoping data projects he calls “CoNVO” (Context, Needs, Vision, Outcome), the idea being that productive data science begins by understanding the background of a problem, delineating actual needs to be addressed, planning what a valuable solution would look like, and ensuring that the eventual solution is adopted by its intended users.

In Chapter 2, he uses in-depth case studies to illustrate what this process looks like and the techniques data scientists can bring to bear at each different stage.

I liked both of these chapters a lot. They, along with Chapter 5, which pulls everything together in an extended example, constitute 90% of the reason I would recommend this book.

For my money, Chapter 3 is the weakest. I suspect that Shron is personally fascinated by rhetoric, and he may be convinced that patterns of reasoning are the key to communicating data, but I suspect he is indulging himself, and the the rhetorical approach of this chapter — exhaustive classification of arguments — feels pedantic. Somehow this manages at the same time to overstate the importance of rhetorical strategies (do intelligent people really need a rubric of evidence types and objections to know they matter?) and fail to do it justice (surely this treatment is just the bare bones of a beginning, as Shron himself notes). Given how short the book is, this doesn’t seem to me the best use of the pages devoted to it. Still, it may turn a light on for some readers.

Chapter 4 is also weak. Suffice it to say that if you are serious about making causal inferences from data: 1) You are in for a world of hurt and 2) You had better be reading something more serious than this.

Even so, on balance, I would still recommend this book. It can be read in an afternoon, and it’s an important counterweight to all the big data hype and model mania that dominate much of the material on data science. Our goal is to solve problems and produce insights, and Shron gets that. For those of us who could use the occasional reminder, his little book is both engaging and worthwhile.