Exploratory Data Analysis

Exploratory Data Analysis (EDA) is often relegated to Chapter 1 (by which I mean the “easiest”, and lowest level) of standard introductory statistics textbooks and then forgotten about for the rest of the book. Notable examples of textbooks used in statistics curriculum that embrace EDA are Andrew Gelman‘s books (which are by no means introductory). I was privileged to have Andrew as my thesis advisor, so he’s been a tremendous influence on my approach to practicing statistics and working with data. Further still, I now am fortunate to get to work alongside two former Bell Labs/AT&T statisticians, Daryl Pregibon and Diane Lambert, who are also in this vein of applied statistics and I’ve learned from them to make EDA a part of my best practices. Yes, even with very large Google-scale data, we do EDA. In the context of data in an internet/engineering company, EDA is done for some of the same reasons it’s done with smaller data sets, but there are additional reasons to do it with data that has been generated from logs.

Standard reasons *anyone* working with data should do EDA:
(1) Gain intuition about the data
(2) Make comparisons between distributions
(3) Sanity checking- make sure the data is on the scale you think is, in the format you thought it should be
(4) Finding out where data is missing or there are outliers
(5) Summarizing the data

In the context of data generated from logs (e.g. internet-type data), EDA helps with
(1) Debugging the logging process. (“Patterns” you find in the data could actually be something wrong in the logging process that needs to be fixed. If you never go to the trouble of debugging, you’ll continue to think your patterns are real.) The engineers I’ve worked with are always grateful for help in this area.
(2) Making sure the product is performing as intended

Exploratory Data Analysis is distinct from Data Visualization in that EDA is done towards the beginning of analysis and data visualization is done towards the end to communicate one’s finding. I personally find EDA relaxing because it’s ok to make mistakes and it’s just me and the data. “Long before worrying about how to convince others, you first have to understand what’s happening yourself”. (Gelman and Hill, Data Analysis Using Regression and Multilevel Hierarchical Modeling, p.551)

With EDA, you can also use the understanding you get to inform and improve the development of algorithms. I gave specific examples of this in class. Plotting data and making comparisons can get you extremely far, and is far better to do than getting a data set and immediately running a regression just because you know how.

The father of Exploratory Data Analysis, John Tukey, also ultimately influenced the development of S (in Bell labs) which is now R, the preferred (programming) language of (many) statisticians.

(Thanks to Chris Wiggins for recent conversations about this.)

Some references to understand best practices and historical context:
(1) Exploratory Data Analysis, John Tukey, 1977
(2) The Visual Display of Quantitative Information, Edward Tufte, 1983
(3) The Elements of Graphing Data, William S. Cleveland, 1994
(4) Statistical graphics for research and presentation, Appendix of Data Analysis Using Regression and Multilevel Hierarchial Modeling, Andrew Gelman and Jennifer Hill, 2007
(5) Exploratory Data Analysis for Complex Models, Andrew Gelman, American Statistical Association, 2004

5 comments

  1. Data Scientist Profiles | Columbia University Introduction to Data Science · · Reply

    […] Reason 4: So I could demonstrate my thought process when I do EDA […]

  2. Columbia Data Science course, week 4: K-means, Classifiers, Logistic Regression, Evaluation « mathbabe · · Reply

    […] data analysis (EDA); if you don’t really know what I’m talking about then look at Rachel’s recent blog post on the subject. You may realize that it isn’t actually […]

  3. Week 4: k-means, Classifiers, Logistic Regression and Evaluation | Columbia University Introduction to Data Science Statistics W4242, Fall 2012 · · Reply

    […] data analysis (EDA); if you don’t really know what I’m talking about then look at Rachel’s recent blog post on the subject. You may realize that it isn’t actually […]

  4. I disagree that EDA is distinct from visualization, or that visualization is only used for data presentation and communication. Visualization is to convey information through graphical representations of data. It is both exploratory (as in EDA) and explanatory (as in presentation).

    1. I agree that EDA includes visualization (with a lower case “v”) to explore the data using graphical techniques. I was trying to distinguish between statistical graphics for exploratory purposes vs explanatory (to use your terms). Exploratory Data Analysis as a discipline or area started by John Tukey referred to statistical graphics used to understand the data yourself. Data Visualization tends, at least now in the vernacular, to refer to presentation and communication. I’m not trying to split hairs about the terminology, but rather trying to distinguish between these two distinct purposes for visual displays of information. The distinction is important because it lends itself to different methods, processes and choices.

Leave a Reply

Fill in your details below or click an icon to log in:

You are commenting using your WordPress.com account. Log Out / Change )

You are commenting using your Twitter account. Log Out / Change )

You are commenting using your Facebook account. Log Out / Change )

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 435 other followers

Build a website with WordPress.com
%d bloggers like this: