Dear Students,

Now that we’ve had our first guest lecture, I’d like to revisit the general framework I proposed for thinking about the data science process on the first day of class (when I generalized the example from Google Plus), and show how Jake’s lecture fits within this framework. Throughout the semester we’ll see that our guest lecturers each provide different perspectives and help us dig deeper into the various intricacies of this framework. By the end of the semester, we may want to amend it, generalize it, or re-imagine it, but this is my current mental model.

**Traditional Statistics Pedagogy**

First, allow me to describe the way traditional statistics classes/textbooks present data analysis. A standard homework problem would be: one is presented with a clean data set, and told to run a regression with *y*=weight and *x*=height, for example. I look unfavorably upon this because it takes the creativity and life out of everything, and doesn’t resemble in the least what it’s like to actually be a researcher or statistician in the real world. As mentioned previously, in the “real world” (no offense, classrooms aren’t the real world), no one is going to hand you a nice clean data set (and if they did, I’d be skeptical! Also where’d it come from?), and no one is going to tell you what method to use. (Why are we even running this regression in the first place? What questions are we even trying to answer?). The homework problem might then have some questions about interpreting or understanding the model: “interpret the coefficients” and let’s face it, most people think these are blow-off questions, don’t take them all that seriously, and there are no consequences if they mis-interpret on a homework problem, so they may think about it for… a minute, and then write some plausible interpretation. (I’m sure some might argue that a student needs to first learn statistics in this “clean” way, away from the mess of the world, before being ready to tackle the mess. I’m not convinced this is true.)

Here’s a visual for you visual thinkers:

Further limitations of this approach are that the data set will be relatively small, and certainly fits on one machine, and not much coding will be necessary to do any of it.

**My Current Mental Model of the Data Science Process**

Now, here’s the view of data science (aka statistics?), that I’d like to set forth (this is just the first half):

*(click on photo to see larger version)*

But wait! This model so far seems to suggest this will all magically happen without human intervention. By “human” here, I mean “data scientist”. Someone has to make the decisions about what data to collect, and why. And that someone is the data scientist or my beloved data science team. Let’s revise:

*(click on photo to see larger version)*

Here’s the second half:

*(click on photo to see larger version)*

**When is the first question asked?**

There’s a bit of a chicken-and-egg problem here. What starts the data collection in the first place? Do you ask a research question and then go out and try to find data to answer it? Are you handed a data set or go about collecting a data set *without* some question or problem in mind? This happens. Recall RD: they are collecting data without having anything in particular in mind yet for what they want to do with it. The data is coming before the questions. How do they know they’re recording the “right” information? Why are they recording what they are recording? What statistical or sampling bias might exist in the data sets? All this needs to be carefully examined by the data scientist.

**How Jake’s Lecture Fits in**

Jake Hofman (Microsoft Research)

Part 1: Spam classification and Naive Bayes

Problem he wants to solve: He wants to build a spam-classifier. Why? Because people don’t want spam in their inbox. But they also don’t want important messages to be thrown out. We could just throw all email (spam and ham) into the spam folder if our only criteria was to reduce spam.

*(click on photos to see larger version)*

**Note about Naive Bayes**: Naive Bayes is considered the base case for classification. It makes simplifying assumptions about the independence of features, but still manages to perform fairly well. In the machine learning literature, researchers introducing new classification algorithms compare the performance of their new proposed algorithms to that of Naive Bayes. Also note the goal is to label or classify unlabeled observations, and the actual parameters themselves (estimated probabilities) aren’t commonly considered of interest. The classification is evaluated using a mis-classification rate. For spam detection in particular, consider which is worse: false positives (calling something spam when it’s not, e.g. putting email from your boss in the spam folder); or false negatives (letting spam get into your inbox).

**Comparing Naive Bayes to k-nearest neighbors**: Jake pointed out that k-nearest-neighbors has one tuning parameter (k, the number of neighbors), while Naive Bayes can include two hyper-parameters in the numerator and denominator for smoothing. Naive Bayes is a linear classifier, while k-nearest neighbors is not. The curse of dimensionality and large feature sets are a problem for k-nearest neighbors, while Naive Bayes performs well. k-nearest-neighbors requires no “training” (just load in the data set), while Naive Bayes does. All classification problems we looked at so far have been supervised learning (the data comes labeled).

Part 2: APIs, scraping the web and the Olympic Record Analysis

Problem: Jake was inspired by a New York Times visualization of Olympic records over time. He had in mind a data set and analysis, and then went out and used his data science super power, “data wrangling”, to build a data set and visualize:

For you to think about:

– I leave it to you to think about the questions on the first homework assignment and how they fit within this framework. Also recall how my Google Plus example fits.

– Where should model selection, model evaluation, mis-classification rate go in all this? Right now it’s being lumped in with Machine Learning.

– Layered through all of this is technology: infrastructure, tools, software, data products, code

– Does “Big Data” change this framework?

– The word “interpret” should be in there with “visualize, communicate and report findings”.

Reblogged this on Stats in the Wild.