Exploratory Data Analysis with Time-stamped Event Data

In the age of Big Data, one of the common data types is time-stamped events. This post focuses on (1) Explaining what time-stamped event data is and (2) Describing the Exploratory Data Analysis (EDA) you can do with it. It’s best to start your analysis with EDA so you can gain intuition for the data BEFORE building models.

What is time-stamped event data?

Time-stamped event data is a common data type in the age of Big Data. In fact, it’s one of the “causes” of Big Data. The fact that computers can record all the actions a user takes means that a single user can generate thousands of data points alone in a day. When people (users) visit a website or use an app, or interact with computers (or phones and other devices), their actions can be logged, and the exact time of their action recorded. For the purposes of understanding, just think about yourself interacting with your favorite website or app and all the actions you can take on it. When a new product or feature is built, engineers working on it write code to capture the events a users take as they navigate and use the product. For example, imagine a user visits the New York Times home page. It’s then possible to capture which news stories rendered for them, and which they clicked on. This could then generate event logs. Each record is an “event” that took place between a user and the app or website.

Let’s look at an example from GetGlue, the social tv company that allows users to “check-in” to tv shows. This illustrates what the raw data can look like. Focus on the fact that there’s a timestamp.

Example of Raw Data Point:
{“userId”: “rachelschutt”, “numCheckins”: “1″, “modelName”: “movies”, “title”: “Collaborator”, “source”: “http://getglue.com/stickers/tribeca_film/collaborator_coming_soon”, “numReplies”: “0″, “app”: “GetGlue”, “lastCheckin”: “true”, “timestamp”: “2012-05-18T14:15:40Z”, “director”: “martin donovan”, “verb”: “watching”, “key”: “rachelschutt/2012-05-18T14:15:40Z”, “others”: “97″, “displayName”: “Rachel Schutt”, “lastModified”: “2012-05-18T14:15:43Z”, “objectKey”: “movies/collaborator/martin_donovan”, “action”: “Checkin”}

Let’s extract four fields: {“userid”:”rachelschutt”, “action”:”watching”, “title”:”Collaborator”, timestamp:”2012-05-18T14:15:40Z” }
More generally think of this as data taking the form: {user, verb, object, timestamp}. Bob clicked on a Nike Ad at 5:52 am. Sue “liked” “Romney Strives to Stand Apart in Global Policy” news story at 7:35am.

Exploratory Data Analysis (EDA)

Let’s look at some basic EDA you can do with this kind of data, focusing on time.

(1) Individual user plots over time. Take a random sample of users. Start with something small like 100 users. Yes, maybe your data set has millions of users, but to start out you need to gain intuition. Looking at millions of data points is too much for you as a human. But just by looking at 100, you’ll start to understand the data. 100 users is NOT a good sample size if you were to start making inferences about the entire set of data. For each user, create a plot like the following.

Looking at 4 users over time. Simple EDA to start gaining intuition

What I notice or think when I look at this: User 1 comes the same time each day; User 2 is doing the action less and less frequently, why?; User 3 need longer time horizon; User 4 looks “normal”, but what’s normal? hhmmm… what is the typical or average user doing? What does variation around that look like? Could I start classifying users into different segments who behave differently with respect to time? How would I characterize using metrics/quantities the differences between these users?

What you should think about:

  • Data Munging Mental exercise: How would you even get the data to do this? You have the raw data where each data point is an event. Some users have multiple events. So how are you going to build a new data set where each data point is a user? Each row needs to be a user followed by a bunch of time stamps. Notice that different users will have a different number of time stamps.
  • Coding exercise: How would you write code to create a single plot like this? Then repeat for each user.
  • Generally you should be thinking along the lines of: What do you notice from this simple hand drawing about different patterns that can exist in user behavior over time? Or differences between users? What information is captured in the picture that you think would need to be captured in metrics or a predictive model?

(2) Adding in color: Now you can add color. Suppose a user can take two possible actions: “thumbs_up” and “thumbs_down”; “like” and “comment” . You could add color into your plots to distinguish between those actions.

What I notice or think when I look at this: In this toy example, I see that all the users did the same thing at the same time towards the right end of the plots. Is this a real event or a bug in the system? What do I learn from this? That maybe I need to check somehow whether there is large co-occcurence of some action across users. How will I do that? Also is “black” more common than “red”? I’ll need to build a metric that captures that. Maybe some users like to always do one thing, some users like to do another, and then other users are a mix. What’s a “mix” look like?

(3) Counting: Now that I’ve started to get some sense of variation across users, I can think about how I might want to aggregate users. The x-axis will be time, and now the y-axis will be counts.

I no longer have to start working with my 100 users. Although I could protoype my code on that small data set and then once I get it working, I can apply it to a larger data set. So counting seems easy and obvious right? Trivial? But it’s not. There are so many choices you need to make here; and the choices you make will impact your perception and understanding of the data set. Here are choices you need to make

  • What are you counting: Number of unique users? Number of actions? Number of users who did a given action at least once during the given time segment?
  • Length of time: What time segment are you counting on? Second, minute, hour, 8-hour segments, day? week? Why?
  • Time zones: Are your users in different time zones? If user 1 and user 2 are in New York and London respectively, then it’s 7am in NYC and noon in London. Do you want to count that as 7am? But then if you look at 7am-8am, and said “30,000 users did this action in the morning”, well that’s not right, is it? It wasn’t morning in London. How are you going to treat this? Well you could shift the data so that you bucket so it’s 7am for the user, not 7am in New York. It’s a decision you have to make and you can justify doing it either way. But you have to understand the consequences of that decision as your analysis moves forward.
  • Action types: Maybe you want to make a different plots for different action types.

Metrics and new variables or features
The intuition you gained from EDA can now help you in constructing metrics or features:

  • By user: Frequencies, rates, velocity, counts, time to first event, binary variables such as did action at least once, did action at least 10x,…
  • Between users: variation,similarity,…
  • Aggregated by user: counts. There could be potential values at each time point (money spent), so there could be sums, totals.
  • Aggregated by action: for a given action, how many users did it, how many did it more than once, more than 10x?
  • This is by no means a comprehensive list. Metrics and variables can be any function of the data, but you have to construct them in a purposeful way and understand their meaning

Where this can go next:
You’ll next want to start moving towards modeling, algorithms, and analysis. You’ll want to build some of the intelligence/intuition you built from the Exploratory Data Analysis into your models and algorithms. What you should do next depends on the problem/real-world context, which is a topic of a later post. But here are some examples of what you could do next:

  • Time series modeling (including auto-regression)
  • Clustering: Automatically detect clusters of users with respect to their behavior over time. How would you measure closeness?
  • Automatically detect common behavior patterns. How do you even define what a behavior pattern is?
  • Prediction: Can you predict whether a specific user is going to do something tomorrow? Can you predict how many users will do something tomorrow? Next week?
  • Change-point detection: Can you identify when some big event has taken place? Can you figure out what kind of behavior in your system should trigger an alarm?
  • Establishing causality
  • Recommender systems (These {user,action, object,time} data types are good for recommender systems. But mostly ignore time.Should they?)

What’s new about this?
Time-stamped data itself is not new. Time series analysis is a well-established field. In some cases the available data sets were fairly small and events were recorded once a day; or reported at aggregate levels. Some examples of data time-stamped data sets that have existed for a while even at a granular level are:

  • Finance: stock price
  • Credit cards: credit card transactions
  • Telecommunications: Phone call records (at AT & T)
  • Library: books check out of a library

But there are a couple things that make this new: (1) Many of us now carry around devices with us 24 hours a day that can be used for measurement purposes and to record actions. It’s now easy to measure human behavior throughout the day (2) Time-stamps are accurate (unless there’s a bug), so you’re not relying on the user to self-report (3) Computing power makes it possible to store large amounts of data and process it fairly quickly. (4) There’s an abundance and richness of data.

One comment

  1. Eurry Kim · · Reply

    These issues you bring up reminded me of the corporate tax data I worked with when I was at the IRS. Only a few years ago, the IRS began accepting e-filed corporate tax returns — that is, tax data in a tree structure. My colleague and I worked on producing a file representing a single tax year. First of all, what is a corporate tax year? And what is considered the “right” tax return for each corporation? It was far from a simple task. First, apart from original tax filings, corporations can also file amended returns, bare-bones returns, superseding returns, and partial year returns. They can file extensions, so if a corporation designates their tax year from July 2011 to June 2012, they can file as late as December 2012 for their 2011 tax year. And we had access to each corporation’s subsidiary filing (sometimes there were several hundreds of subsidiary attachments each with their own credit forms and statements), which often contained the bulk of the data. And they can change their tax structure! A C corporation one year can opt to become an S corporation the next.

    In the end, we relied heavily on timestamps of the time of e-file. The statistical agency of the IRS, Statistics of Income, had already defined a tax year, so we used that range of dates. We ended up taking the latest original filings or the latest superseding filings (since amended returns didn’t require a corporation to file the full information return) to produce a set of data to work from. This experience taught me that seemingly simple questions about data do not have easy answers — or at least answers without caveats.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 345 other followers

%d bloggers like this: