Dear Students,
Check out this story in this week’s NYT Big Data in Your Blood
I want to use it to explore a couple things and ideas I was struggling with before the class started this semester, and that I wasn’t sure how to communicate with you about on our first day together:
Semantics again
The current interpretation of “Data Scientist” as it is colloquially used and as we discussed in week 1 is someone who has a combination of math/stats/ML/computing/engineering skills + who works with massive data sets generated by user interactions with the web/internet/social media.
Simultaneously we keep hearing a lot of stuff in the press about “Big Data”, but that’s always expressed in terms of massive data sets across many different sectors: tech, internet, pharma, finance, genomics, telecommunications and marketing.
So we have all these big data sets, and there are people with a variety of technical skill levels at all these companies (and research institutions) analyzing them or trying to, while also
grappling with the massiveness of them.
Defining the Scope of the Course
[revised this to try to be slightly more rigorous]
We’d like to define the space of problems that could be solved with Data Science. Such a description will be given by a list of parameters.
As currently treated in the media, “data science” is a collection of tools and best practices applied to data sets in the space of massive internet/web data. And in some ways is a set of “solutions” without a clear set of well-defined problems to solve.
“data science” = {technical skills} x {human/social} domain expertise applied to {massive} {human/social} data but to solve what space of problems?
“Big Data” = {massive} {domain} data
Observations:
— They intersect at “massive” data.
— “Big Data” is generalized across domains but “data science” hasn’t.
— technical skills = {stats, engineering, computation, machine learning, physics, math, etc.} (basically methods of dealing with the data)
— domain covers the space of, well, everything in the world! But if we need to simplify it to (human behavior, medicine, finance, retail, culture) for illustration, let’s do it.
Let ‘s say “Big Data” and “data science” are embedded in a larger space, S={technical skills} x {domain} x {size of data sets} where size varies from ” no data” to petabytes and growing. And that S represents a set of tools and data sets. Let’s just call that Solution Space. Does that make sense? Or are the data sets the very problems themselves? (It seems that’s how Big Data defines itself. The Data is the problem.)
But so if that’s Solution Space. What’s Problem Space?
Central issues in trying to determine the scope of this course :
(1) Do we expand “data science” to all domains? Not really a big deal to do except colloquially it isn’t used that way. And it’s actually a fairly big space on its own.
(2) What set of problems do the solutions in Solution Space solve? How do we define or parameterize Problem Space? or do we just stay in Solution Space?
Let me step away from this and explore “data science” as I currently understand it’s evolution:
Who doesn’t love human beings?
I’m not a sociologist but there are some of you in my class so I’d love for you to help me flesh out my perspective on this. This is my layman’s understanding of the situation. (Also when I say “sociologist”, political scientists, psychologists, and other social scientists are included. I’m getting into territory that I’m not that educated about so I’m happy you guys can help me figure it out. Also isn’t there a history of science professor in my class? I need your help). The datasets available to social scientists in the past were fairly small and collected in the field using surveys and sampling methods. Getting data could be both time-consuming, prone to human-error in the collection process and would ultimately lead to a fairly small data set. You might be really happy if you could get 100’s of data points. So you didn’t need much computing power to analyze them, but you needed to be have a good understanding of statistics.
Now the internet comes along and some aspects of our lives start going from offline to online.
Examples: finding out information: libraries -> search engines (google)
advertising: print/television -> online advertising (google)
relationships: phone, in person, writing letters -> social networks (facebook)
finding relationships: in person, classifieds -> online dating (jdate, match.com, harmony, okcupid)
buying stuff: brick and mortar -> online shopping (amazon.com, ebay)
listening to music: records, tapes, cds -> streaming music (pandora)
reading: newspapers,books -> new york times, tons of online publishers
choosing what movies to watch: movie reviews -> netflix
The implications of this didn’t seem to hit right away. At first it seemed up in the air whether people were even going to embrace the internet and technology in the way they did. Technology didn’t used to be cool! I’m not sure if Apple is singlehandedly responsible for making it so, but I’m pretty sure most kids in my middle school classes thought that computers were for nerds and could never have imagined the day that everyone they knew would have one.
(If anyone wants to help me make my arguments more rigorous, please help. I’m just trying to say how I think it went. History of Science prof?). It didn’t fully hit me that technology was cool until I went to SXSW this past March 2012. Am I oblivious? Maybe a little. Or maybe I just grew up at a time when cool was music+film festivals but absolutely NOT technology festivals.
Also at first, these companies were more focused on building functionality into their websites. And revelations like collaborative filtering for amazon’s recommender system, meant that the way the users were interacting with the product would generate data that could be used to improve the product. So that’s taking data and building it back into the product. And that’s pretty cool. But then also you have all this information captured in logs about how HUMANS ARE BEHAVING! Who doesn’t love humans? (not me, really. just kidding, sort of.) So then DJ Patil and Jeff Hammerbacher were at LinkedIn and Facebook, respectively and they both realized that they had tons of information about human behavior and it was a sociologist’s dream! (Remember, previously the sociologist only had 100’s of datapoints!) And that is where “Data Scientists” were born.
Then something along the lines of: humans love hearing stuff about themselves and how they behave (moreso than about the pharmaceutical drug development process), so the popular press could really understandably capitalize on that.
Implications for this class
I came into class last week struggling with how to position this course, so began with what I understood so far, in the context of online human behavior from my experience at Google. I used the colloquial definition of data science to create the data scientist profiles with you. But today I see what is strange with them. Why is domain expertise on the same axis as all the mathematical tools and methods?? This should be a 2 dimensional space instead {methods for dealing with data} x {domains}.
I also was fairly taken with the idea that what distinguished the type of problems in this space involved that the data itself becomes the building blocks of product. But that’s just an artificial construct, I’m pretty sure now. And not really a meaningful way for us to narrow down the problem space we’re interested in. I think the article, Big Data in Your Blood , that I began this discussion with demonstrates this fairly well.
We need to define problem space, P, that can be solved by solution space, S.
Here’s a strawman. P = {domain} x {real world problems that map to prediction, classification, causality} x {is the data set massive? or not}
I’m sure this P actually exists in a larger space that we should be concerned about.
How to Educate the World’s Problem Solvers
One thing that DJ Patil was definitely onto was the importance of teams, or let me generalize that to collaboration. The world’s problem space is so large that educating individuals on all the methods and all the domains and all the data sizes is just non-sensical. But if we could create teams cultivate a collaborative environment that collectively had a sufficient understanding of S (Solution Space) and that in the dual them to covered P, Problem Space, we’d be in good shape!
We’d aspire to an institute that included people across domains, across technical backgrounds, who have a variety of hammers and know how to deal with a spectrum of data sizes with varying degrees of introversion and extroversion that can pivot and group up and separate to solve world problems, rather than isolated academic departments. Somehow the people are like tetris pieces or in more dimensions . [I’m pretty sure I read this somewhere but can’t remember, so tell me if you know.]
I am interested in the problem of how to train future problem-solvers to exist in space S, to cover P. I want to explore with you, my class, how to educate you in such a way that we are not focused solely on individual achievement because no one person can do it all, but on the collaborative spirit with the goal that collectively we can do it all. That is overly grand, but why should data science exist in some narrow sub-space?
In the classroom setting, you the students have a variety of skills and domain expertise and abilities and personalities. Rather than each of you in isolation solving problems and expecting yourselves to be good at everything, how can you leverage each other and your abilities so that the whole is greater than the sum of it’s parts? I think my challenge is to figure out a curriculum that helps you through this.
Lest you fear that I am being overly idealistic or pie-in-the-sky, I believe I’m being practical. There are a shortage of people who can do Data Science well. When I talk to people in positions recruiting, they don’t want recent grads because the recent grads don’t have enough experience. New grads then face the Catch-22: “how am I supposed to get any experience if I need experience to get experience?” So where’s the potential problem and solution- in education! Why isn’t education training students so that when they interview, they can speak intelligently and from experience about what it’s like to function as part of a well-oiled team that was able to solve some set of reasonably important problems by collaborating? If students could do that, then they’d have the “experience” recruiters were looking for. [I think this all deserves elaboration in another post.]
My father and my sister
My father is interested in stuff like this:
Encode massive amounts of data, in the space of biology, genomics, Science etc. (over-simplifying)
My sister (one of them) is interested in solving problems in the cultural sector where there is little data. (over-simplifying)
They’re complex people so can’t explain too much about it right now.
So where do we want Data Sciences to exist?
And my sister gets full credit for saying to me: “I think this could be the next great digital divide - those sectors that have data and those that do not. I said it first.”
So I think all of this is the realm and concern of a course on Data Science.
Please let me know your thoughts.
Yours, Rachel
It’s interesting to see your own view of the course evolve based on the content, and I think in part our reactions to it. I wonder where we end up at the end of the semester!
It might be valuable to have a space to include more articles, books, chapters, etc that we think might be useful. For example this, about google maps vs UPS: http://scienceofanalytics.com/2012/09/09/google-maps-may-be-great-but-ill-bet-ups-has-better/
Thanks for the feedback.
Good idea! I like the idea of us all collaborating together to build up the collection of references! Does anyone in the class want to take on the project of being the Course Librarian? Or at least curating/organizing the collection of material that others in the class contribute? Something along those lines? I’m sure it could be done somewhere on this website. Or maybe a team of you could do it?
Thanks for this (can I just say how exciting it is to be on the inside of the very tight self-defining loop of a new discipline?)
Re: the data previously available to the social sciences. It’s not true that the datasets were always tiny. The UK Household Longitudinal Study covers about a hundred thousand people (yes, that’s still not Big Data, but it’s more than the hundreds).
More important, to my mind, than the number of data points, is the resolution at which the interesting questions are asked, and that at which the answers are proposed (I refer throughout this comment to *Dissecting the Social* by Peter Hedström). Broadly speaking, sociologists are interested in explaining phenomena which operate on macro scales, at the level of society. For example, there is no way to examine virality at the level of the individual — indeed, there is no way to *define* viral contagion without speaking of more than one person. Generally, but not always, questions which can be defined with respect to a single actor are not in the domain of most of the social sciences.
However, the level at which the answers are proposed is up for grabs. It is almost certainly a deficit of understanding in my young career, but I find it difficult to fathom how correlations between, say, unemployment and health outcomes at the postcode level are supposed to provide any basis for actual explanation, however carefully confounding factors are controlled for. It’s true that theories on this basis can have reasonable predictive accuracy (or whatever metric you use to assess theories). However, the rules thus produced always seem ad-hoc, ungeneralisable, and — on a gut-feeling level — less *satisfying*. I’m more interested in micro-explanations which describe mechanisms operating at the level of the individual social actor (or, you might say, “user”, or “human being”). I’m sure there is something unjustifiably anthropocentric in taking persons as the basic unit of analysis, but it does have the advantage that we can run whatever mechanisms we come up with against our own intuition about motivations and action.
And this is linked to another problem. I don’t know anywhere near enough about this to be able speak of it accurately, but one problem that affects social theories is the non-linearity of social processes, in the sense that small changes in the initial conditions can produce large and apparently inexplicable variations in the outcomes. In part, this is inherent in a discipline which spans levels of abstraction. (Try explaining cell biology in terms of the four fundamental forces if you don’t believe me). If you rest with macro-explanations of macro-phenomena, your theories will probably fail when you try to expand their scope. I have other semi-formed opinions about the role of aesthetics and intuition in theory-building, and about the role of experiments, but those are for another time…
One of my reasons for taking this class is my interest in so-called *analytical sociology*, of which Hedström is a prime proponent, and which embraces this micro-level explanation. How does this relate to Big Data? Well, actually, sociologists have had Big Data since (before?) they’ve had statistics. What we need now — and what we are beginning to get — is LittleBigData (apologies to Sony Computer Entertainment): large collections of detailed, longitudinal, accurate, and above all *individual-level* data about many people. Only with LittleBigData can we provide micro-explanations of macro-phenomena. It is not unhelpful that much of the data which are being produced by OSNs and other social-oriented tech companies are relational in nature. As a consequence, we are not confined to assuming that everyone interacts in a vast social soup; rather, we can pin-point interactions and the processes which occur within them.
See you on the frontier,
Adam
Thanks for your response, Adam. These are great insights and very thought-provoking! You are onto something with LittleBigData. And it seems there are analogies to bioinformatics or genomics, which maybe one of our bioinformatics students could enlighten us on. A good understanding of LittleBigData would, I think, require domain expertise from across domains (analytical sociology, genomics,…others?); the domain experts could find analogies in the real-world problems they were trying to solve, which they would then work with ML/Stats people to map to a set of potential data sets, and mathematical or ML problems. Then people with ML/Stats expertise who knew how to collaborate well with domain experts-would also need to work with people really good at processing and handling massive amounts of data (who actually would have been in the original conversations about how to structure the data, because that would be an important part of the challenge). Both domains will have benefited and not have tried to figure it all out in isolation. Something like that. (Also great visualization expertise, etc.)
Also I didn’t even address the infrastructure required of all of this, great engineering.
From an entirely naive perspective, it seems that Data Science is primarily concerned with questions in which a large, fine-grained data set is available. What Adam points out seems to me the fundamental obstacle, from a methodology perspective, with dealing with big data: namely, understanding what a correlation means, and particularly when it signals causation. I don’t know how it pans out in practice, but I would conjecture that this is an even larger problem with big data sets. With such sets, one has a certain high confidence in conclusions due simply to the size of the data set. On the other hand, the relative ease of tabulating data in multiple ways seems to imply a danger of stumbling upon many spurious correlations.
Naive perspectives can be the ones that help push boundaries. That’s why it’s good to have all these disciplines together in one place communicating. We can challenge each other’s assumptions and also help each other be precise about how our discipline sees the world.
In some Data Science problems the data is being used not to “describe” but rather to predict. In which case interpreting relationships between variables as causal isn’t as important as predictive power.
Adam (or anyone), in the context of analytical sociology, what are some concrete examples of the sorts of things we might want to describe vs. predict?
- As you outlined, data science can be thought of an interdisciplinary field consisting of people / teams with skills in
S={technical skills} x {domain} x {size of data sets}
- Domain expertise is spawned from interest in the field (e.g., computational biology — interest in learning how the genome works, fraud detection — interest in understanding thieves’ behavior and keeping the system clean)
- I view technical skills in DS as gaining literacy in programming and in statistics. It’s not good enough to be interdisciplinary; you have to be multidiscipinary.
- Suppose A = {hacking, software engineering, scalable systems}, and B = {applied statistics, machine learning, data analysis, data visualization}. You should be excellent in one, and at least good in another.
- Some make the argument that a data scientist as someone who can program better than an average statistician, and do statistics better than an average programmer. I think that’s a very flawed statement because the average competency of programming for statisticians is quite low, and vice versa.
Thinking in terms of teams is a great place to start. One of the most rewarding learning experiences I’ve had is pair programming in Square. Pairing with other great engineers help you learn different languages, styles, concepts very quickly.
[…] Pair programming. In addition to Greg’s recommendation, anecdotally, Ian Wong at Square is an advocate (he’ll be coming in a few weeks to speak to us); also Princeton Professor Brian Kernighan […]