*This is the version published at the beginning of the semester. Logistics and content get modified slightly throughout the semester. The most current information can be viewed on the About the Class and the Guest Speakers pages.*

**Introduction to Data Science (W4242)**

Fall 2012

Professor: Dr. Rachel Schutt

Wednesdays 6:10-8:55pm

Location: 503 Hamilton Hall

Email: rrs2117@columbia.edu

Lab Instructor: Jared Lander

Labs: Mondays 6:10-7:25pm

Location: 503 Hamilton Hall

Teaching Assistant: Benjamin Reddy

Problem Sessions: Thursdays 7:45-9:15pm

Location: TBD

Office Hours: Tuesdays time TBD

Location: Statistics Department, 1255 Amsterdam Ave, School of Social Work Building, 10th floor

Email: reddy@stat.columbia.edu

**Prerequisites:** Some linear algebra and previous exposure to probability and statistics is ideal; as well as some programming experience.

**Goals of Course:**

1) Learn about what it’s like to be a data scientist

2) Be able to do some of what a data scientist does

**Course Structure:**

I’ll teach the first two weeks to build up sufficient background and foundation. After that, each class will be divided into two parts: (1) Review of previous material and introduction of any new material necessary to understand the guest lecture, (2) Guest lecturer teaching new algorithms, methods, or models, giving case studies, showing their actual code, and describing their role as a data scientist emphasizing the course themes.

**Course Themes:**

Machine learning and data mining algorithms, and statistical models and methods; prediction vs. description; exploratory data analysis; communication; visualization; data processing, munging and engineering; big data; coding; ethics; asking good questions

**Course Schedule and Topics**

September 5: Introduction: What is Data Science?, Getting started with R, Exploratory Data Analysis, Review of probability and probability distributions, Bayes Rule

September 12: Supervised Learning, Regression, polynomial regression, local regression, k-nearest neighbors,

September 19: Unsupervised Learning, Kernel density estimation, k-means, Naive Bayes, Data and Data Scraping (Guest Lecturer: Jake Hofman, Microsoft Research)

September 26: Classification, ranking, logistic regression (Guest Lecturer: Brian Dalessandro, Media 6 Degrees)

October 3: Ethics, time series, advanced regression, finance (Guest Lecturer: Cathy O’Neil)

October 10: Decision trees, Best practices, feature selection (Guest Lecturer: William Cukierski,Kaggle). Kaggle competition (final project) announced;

Applying data science in a hybrid research environment (Guest Lecturer: David Huffaker, Google)

October 17: Recommendation engines, dimensionality reduction, indexing large-scale data, and implementing / optimizing machine learning algorithms. (Guest Lecturer: Matt Gattis, eBay)

October 24: Data visualization, data journalism, dashboards? (Guest Lecturer: Mark Hansen, Columbia)

October 31: Social network analysis (Guest Lecturer: John Kelly, Morningside Analytics)

November 7: Sampling, Stratification, Experimental design, pharma (Guest Lecturer: David Madigan, Columbia)

November 14: Observational causal modeling (Guest Lecturer: Ori Stitelman, Media 6 Degrees)

November 19*: Sampling, data leakage, data incest (Guest Lecturer: Claudia Perlich, Media 6 Degrees)

*Scheduled for Monday because Wednesday, November 21 is the evening before Thanksgiving

November 28: Data engineering, sharding, Hadoop, mapreduce and proto buffers (Guest Lecturer: Josh Wills, Cloudera)

December 5: Data engineering (Guest Lecturer: David Crawshaw, Google)

**Recommended Texts and Readings**

As this is an emerging field, there is no single good textbook for it yet.

I will be drawing from some of the following texts:

Data Mining and Machine Learning:

The Elements of Statistical Learning: Data Mining, Inference and Prediction, *Trevor Hastie, et al.*

Pattern Recognition and Machine Learning, *Christopher Bishop*

Bayesian Reasoning and Machine Learning, *David Barber*

Programming Collective Intelligence, *Toby Segaran*

Data Mining with R: Learning with Case Studies, *Luis Torgo*

Data Mining: Practical Machine Learning Tools and Techniques, *Ian H. Witten et al*

Artificial Intelligence: A Modern Approach, *Stuart Russell and Peter Norvig*

Introduction to Machine Learning (Adaptive Computation and Machine Learning), *Ethem Alpaydim*

Programming Languages:

R in a Nutshell: A Desktop Quick Reference, *Joseph Adler*

Learning Python (O’Reilly), *Mark Lutz and David Ascher*

The Art of R Programming: A Tour of Statistical Software Design, *Norman Matloff*

Hadoop

Hadoop: The Definitive Guide, *Tom White*

Visualization

The Elements of Graphing Data, *William Cleveland*

Visualize This: The FlowingData Guide to Design, Visualization, and Statistics, *Nathan Yau*

Experiments

Statistics for Experimenters: Design, Innovation, and Discovery,* George E. P. Box, et al*

Probability

A First Course in Probability or Introduction to Probability Models*,* *Sheldon Ross*

**Course requirements and Grading**

Homework Assignments (40%)

Final Project (40%)

Final In Class Exam (15%)

Attendance / Participation (5%)

**Homework Assignments**

You are encouraged to discuss problems with other people, but the write-up and code must be your own. Please include a copy of your code, and format it in Courier font. No late assignments accepted.

**Final Project**

The final project will be a Kaggle-style competition. You will form teams and work together. The competition will be announced October 10th and the deadline will be in December. More details to come in October. But feel free to check out Kaggle in the meantime.

**A Note on Programming Languages**

Most of my instruction will involve either R or Python. Guest lecturers may give examples using different languages but will explain what the code means. Homework assignments will generally require R or Python. If you feel that you can complete them in a different language successfully, you can, but we won’t be able to necessarily help you if you get stuck.

[...] is a blog by a UCLA student, Nathan Yau, who wrote the book, Visualize This, which is on our syllabus. There’s a ton of useful stuff on flowingdata, so I encourage you to spend time browsing it. [...]