Syllabus

This is the version published at the beginning of the semester. Logistics and content get modified slightly throughout the semester. The most current information can be viewed on the About the Class and the Guest Speakers pages.

Introduction to Data Science (W4242)
Fall 2012
Professor: Dr. Rachel Schutt
Wednesdays 6:10-8:55pm
Location: 503 Hamilton Hall
Email: rrs2117@columbia.edu

Lab Instructor: Jared Lander
Labs: Mondays 6:10-7:25pm
Location: 503 Hamilton Hall

Teaching Assistant: Benjamin Reddy
Problem Sessions: Thursdays 7:45-9:15pm
Location: TBD
Office Hours: Tuesdays time TBD
Location: Statistics Department, 1255 Amsterdam Ave, School of Social Work Building, 10th floor
Email: reddy@stat.columbia.edu

Prerequisites: Some linear algebra and previous exposure to probability and statistics is ideal; as well as some programming experience.

Goals of Course:
1) Learn about what it’s like to be a data scientist
2) Be able to do some of what a data scientist does

Course Structure:
I’ll teach the first two weeks to build up sufficient background and foundation. After that, each class will be divided into two parts: (1) Review of previous material and introduction of any new material necessary to understand the guest lecture, (2) Guest lecturer teaching new algorithms, methods, or models, giving case studies, showing their actual code, and describing their role as a data scientist emphasizing the course themes.

Course Themes:
Machine learning and data mining algorithms, and statistical models and methods; prediction vs. description; exploratory data analysis; communication; visualization; data processing, munging and engineering; big data; coding; ethics; asking good questions

Course Schedule and Topics
September 5: Introduction: What is Data Science?, Getting started with R, Exploratory Data Analysis, Review of probability and probability distributions, Bayes Rule

September 12: Supervised Learning, Regression, polynomial regression, local regression, k-nearest neighbors,

September 19: Unsupervised Learning, Kernel density estimation, k-means, Naive Bayes, Data and Data Scraping (Guest Lecturer: Jake Hofman, Microsoft Research)

September 26: Classification, ranking, logistic regression (Guest Lecturer: Brian Dalessandro, Media 6 Degrees)

October 3: Ethics, time series, advanced regression, finance (Guest Lecturer: Cathy O’Neil)

October 10: Decision trees, Best practices, feature selection (Guest Lecturer: William Cukierski,Kaggle). Kaggle competition (final project) announced;
Applying data science in a hybrid research environment (Guest Lecturer: David Huffaker, Google)

October 17: Recommendation engines, dimensionality reduction, indexing large-scale data, and implementing / optimizing machine learning algorithms. (Guest Lecturer: Matt Gattis, eBay)

October 24: Data visualization, data journalism, dashboards? (Guest Lecturer: Mark Hansen, Columbia)

October 31: Social network analysis (Guest Lecturer: John Kelly, Morningside Analytics)

November 7: Sampling, Stratification, Experimental design, pharma (Guest Lecturer: David Madigan, Columbia)

November 14: Observational causal modeling (Guest Lecturer: Ori Stitelman, Media 6 Degrees)

November 19*: Sampling, data leakage, data incest (Guest Lecturer: Claudia Perlich, Media 6 Degrees)
*Scheduled for Monday because Wednesday, November 21 is the evening before Thanksgiving

November 28: Data engineering, sharding, Hadoop, mapreduce and proto buffers (Guest Lecturer: Josh Wills, Cloudera)

December 5: Data engineering (Guest Lecturer: David Crawshaw, Google)

Recommended Texts and Readings
As this is an emerging field, there is no single good textbook for it yet.
I will be drawing from some of the following texts:

Data Mining and Machine Learning:

The Elements of Statistical Learning: Data Mining, Inference and Prediction, Trevor Hastie, et al.
Pattern Recognition and Machine Learning, Christopher Bishop
Bayesian Reasoning and Machine Learning, David Barber
Programming Collective Intelligence, Toby Segaran
Data Mining with R: Learning with Case Studies, Luis Torgo
Data Mining: Practical Machine Learning Tools and Techniques, Ian H. Witten et al
Artificial Intelligence: A Modern Approach, Stuart Russell and Peter Norvig
Introduction to Machine Learning (Adaptive Computation and Machine Learning), Ethem Alpaydim

Programming Languages:

R in a Nutshell: A Desktop Quick Reference, Joseph Adler
Learning Python (O’Reilly), Mark Lutz and David Ascher
The Art of R Programming: A Tour of Statistical Software Design, Norman Matloff

Hadoop

Hadoop: The Definitive Guide, Tom White

Visualization

The Elements of Graphing Data, William Cleveland
Visualize This: The FlowingData Guide to Design, Visualization, and Statistics, Nathan Yau

Experiments

Statistics for Experimenters: Design, Innovation, and Discovery, George E. P. Box, et al

Probability

A First Course in Probability or Introduction to Probability Models, Sheldon Ross

Course requirements and Grading
Homework Assignments (40%)
Final Project (40%)
Final In Class Exam (15%)
Attendance / Participation (5%)

Homework Assignments
You are encouraged to discuss problems with other people, but the write-up and code must be your own. Please include a copy of your code, and format it in Courier font. No late assignments accepted.

Final Project
The final project will be a Kaggle-style competition. You will form teams and work together. The competition will be announced October 10th and the deadline will be in December. More details to come in October. But feel free to check out Kaggle in the meantime.

A Note on Programming Languages
Most of my instruction will involve either R or Python. Guest lecturers may give examples using different languages but will explain what the code means. Homework assignments will generally require R or Python. If you feel that you can complete them in a different language successfully, you can, but we won’t be able to necessarily help you if you get stuck.

One comment

  1. Weekly Data Viz #3 | Columbia University Introduction to Data Science Statistics W4242, Fall 2012 · · Reply

    […] is a blog by a UCLA student, Nathan Yau, who wrote the book, Visualize This, which is on our syllabus. There’s a ton of useful stuff on flowingdata, so I encourage you to spend time browsing it. […]

Leave a Reply

Fill in your details below or click an icon to log in:

You are commenting using your WordPress.com account. Log Out / Change )

You are commenting using your Twitter account. Log Out / Change )

You are commenting using your Facebook account. Log Out / Change )

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 435 other followers

Build a website with WordPress.com
%d bloggers like this: