Category Models

Data products in the wild

Hi Students, Monday’s lecture will focus on Human Factors in Data Science. The class will be an onslaught of needs finding, design, prototyping, and evaluation. It will be intense; brace yourselves. As data scientists, you will ultimately produce a data product, be it a graph or a report or a presentation. This product will affect the […]

Week 12: Predictive modeling, Data Leakage, Model Evaluation

Each week Cathy O’Neil blogs about the class. Cross-posted from This week’s guest lecturer in Rachel Schutt’s Columbia Data Science class was Claudia Perlich. Claudia has been the Chief Scientist at m6d for 3 years. Before that she was a data analytics group at the IBM center that developed Watson, the computer that won […]

Experiments, A/B Testing and Causal Modeling

Screenshot of article by Brian Christian from Wired magazine, The A/B Test: Inside the Technology That’s Changing the Rules of Business from April 2012

Dear Students,

I want to address explicitly why Causal Modeling and Experiments are part of this course. The last two lectures have addressed observational studies and causal modeling and a bit on experiments. [...]

Week 11: Estimating Causal Effects

Each week Cathy O’Neil blogs about the class. Cross-posted from This week in Rachel Schutt’s Data Science course at Columbia we had Ori Stitelman, a data scientist at Media6Degrees. We also learned last night of a new Columbia course: STAT 4249 Applied Data Science, taught by Rachel Schutt and Ian Langmore. More information can […]

Brief Introduction to Social Network Modeling

Here is a brief, and not comprehensive, introduction to social network modeling drawing from academic literature in mathematics, statistics, computer science, sociology and physics. This will simply introduce some basic models. I won’t get into stochastic (random) processes on networks (such as epidemics or “cascades”), dynamics of networks, nor algorithms for approximating metrics on large-scale […]

Week 7:, Recommendation Engines, SVD, Alternating Least Squares, Convexity, Filter Bubbles

Each week Cathy O’Neil blogs about the class. Cross-posted from Last night in Rachel Schutt’s Columbia Data Science course we had Matt Gattis come and talk to us about recommendation engines. Matt graduated from MIT in CS, worked at SiteAdvisor, and co-founded hunch as its CTO, which recently got acquired by eBay. Here’s what […]

Week 5: GetGlue, time series, financial modeling, advanced regression, and ethics

Each week Cathy O’Neil blogs about the class. Cross-posted from But what makes this week unique is that Cathy was our guest lecturer. So first I need to introduce her, and then what follows is her blog post. Students in the class already know Cathy because she comes each week, asks good questions and […]

Next-Gen Data Scientists

The following is a prologue to a discussion of what makes for a good data scientist. Data is information and is extremely powerful. Models and algorithms that use data can literally change the world. Quantitatively-minded people have always been able to solve important problems, so this is nothing new, and there’s always been data, so […]

Week 4: The Data Science Process, k-means, Classifiers, Logistic Regression and Evaluation

Each week Cathy O’Neil blogs about the class. Cross-posted from This week our guest lecturer for the Columbia Data Science class was Brian Dalessandro. Brian works at Media6Degrees as a VP of Data Science, and he’s super active in the research community. He’s also served as co-chair of the KDD competition. Before Brian started, […]

The Data Science Process

Dear Students, Now that we’ve had our first guest lecture, I’d like to revisit the general framework I proposed for thinking about the data science process on the first day of class (when I generalized the example from Google Plus), and show how Jake’s lecture fits within this framework. Throughout the semester we’ll see that […]


Get every new post delivered to your Inbox.

Join 344 other followers