Dear Students,
We’ve now had six weeks of blog posts, guest lectures, labs and homework assignments that have brought up a vast number of topics and issues across multiple dimensions that covers some subspace of Data Science.
Finding your own way of understanding Data Science
I hope you are finding your own ways of figuring out how these all fit together, and organizing it all in your minds. Perhaps now it feels like an overwhelmingly large space of ideas. I’m sure after the course is over, certain topics and themes will emerge for you as more important than others. I’ll use this post to map out the universe/time-space I think we’ve explored so far, and identify ten important concepts (Why ten? Because we have ten fingers).
Dimensions of the Data Science Universe
Here’s one way to organize the course’s topics and ideas we’ve explored far. Think of different dimensions of the space we’ve explored, and then think of examples or a range of possible values within each dimension. Let me do that here:
Types of Data:
- time-stamped event logs,
- text or content (essays, email, articles),
- graph/network (nodes & edges; user-user; bipartite: users & movies, users & content)
- time series (e.g. returns)
- user-level (e.g. NYT ad click data)
Size of Data Set:
- David Huffaker’s surveys of 100’s of users
- millions of GetGlue event logs;
- data sharded into multiple pieces (NYT homework 1)
Algorithms and Models:
- k-nearest neighbors (week 2)
- Regression (week 2 and 5)
- Naive Bayes (week 3)
- k-means (week 4)
- Logistic regression (week 4)
- Decision Trees (week 6)
- Random Forests (week 6)
Machine Learning Concepts
- over-fitting
- bias-variance tradeoff
- cost or loss functions
- training set, test set
Goals:
- Prediction
- Classification
- Establishing causal relationships
- Recommendation/ranking
- Data-driven decision-making
Domains:
- On-line advertising (Brian,md6)
- Finance (Cathy)
- Education (Will,Kaggle; The theme of the final project)
- Entertainment (tv & movies, GetGlue)
- On-line social networks (Google+, GetGlue),
- Real estate market (Real Direct, NY Housing market data)
- Museums (Data Science of Art post)
- Astronomy (weekly data viz #3)
- Olympics (Jake’s study),
- Content classification (Spam filter + NYT article classification)
Data Products:
- Spam classifier
- Personalized tv show recommendations
- Algorithm for trading stock
- NYT article classifier
- Google+ circles
- Google + privacy settings,…
Ideas and Concepts:
This is a space of it’s own that exists in our minds. I’m making it into its own post: 10 important ideas— some I’ve been thinking about since before I proposed the course and others have emerged as important through listening to the guest lectures, having conversations with guest lecturers, friends and you.
Yours, Rachel