We’ve started using vocabulary and concepts that are the language of data science. I’ll start listing some of them here. If you can get yourself to the point where you are able to explain each of these concepts to someone else in a clear way in a sentence or two, you’re in good shape. Of course then you have to actually be able to do it! (Don’t memorize the definition. Really try to understand the meaning. Wikipedia’s a good start as is David Barber’s (free online) version of Bayesian Reasoning and Machine Learning. ) Students, please add to the list in the comments section as the semester progresses and I’ll update periodically.
— machine learning
— supervised learning
— unsupervised learning
— training set or training sample
— test set or test sample
— k-nearest neighbors
— exploratory data analysis
— regression
— residual sum of squares
— least squares estimators
— classification
— prediction
— overfitting
— cross-validation
— loss functions
— misclassification
— labels
— Euclidean distance
— bias, variance, bias-variance trade-off
“Data mining” probably fits into EDA, but thought I’d throw it out there.
And what about Visualization? Information design?
That initial list is obviously heavy on the stats/methodology side. Perhaps we can build up the tech side a bit? Here’s a (very) few to start:
- SQL
- NoSQL
- Hadoop
- map reduce
[…] languages, the languages that people in various disciplines (or domains) speak and the language of Data Science. I want to raise the issue of religious wars over preferred language, and the danger this poses for […]