Big Data Domain Surfing (Part 1)

Dear Students,

As mentioned we have diverse backgrounds in this class. And lest there be any confusion, I am not talking about our ethnicities, home countries, or spoken languages. I’m talking about the academic spaces we each inhabit, which has me thinking along the lines of Data Science as having the potential to be the Lingua Franca or Translator between disciplines. Is Data Science what happens in the space that exists between Big Data domains and not within a single Big Data domain? When I say “Domain”, I just mean “academic discipline” or “field”. Even my use of the word “Domain” to mean “subject matter expertise” is an example of my having a certain perspective as a statistician.

This is divided up into the following sections:
- Game played while hiking with S’s 5-year-old
- Domain Surfing from Analytical Sociology to BioInformatics
- Feeling Stupid
- Example using BioInformatics Department

Game played while hiking with S’s 5-year-old
Here’s a little anecdote to have in mind as an analogy. One of my best friends, S, has a son R (who I love :) ). I visited a few months ago and we went hiking. At the time, he was 5 and one of his favorite games to play was you had to give him two 3-letter words and he would find the mapping between the first word to the last word by changing one letter at a time. For example: I’d give him “DOG” and “CAT” and he’d come back with “DOG”->”COG”->”COT”->”CAT”. Notice all the steps in between are legitimate words and there are probably multiple solutions and in fact there are some solutions that are really long, you could go off on some really long chain if you really felt like it, but we were aiming for the shortest path. And one of the challenges was he just wanted me to keep making up more and more of them. Before I gave him the two words, I would check if a solution existed in my head and usually for 3-letter words, I could do that fairly fast. Then he got to 4- and 5- and even 6-letter words and it got harder for me to figure out whether a mapping even existed, and I wanted to only give him two words if a mapping existed because otherwise he’d get frustrated. Why didn’t I just give him two words and let him figure out if a mapping existed? I could have! That would have been a learning opportunity- mappings don’t always exist. But I wanted to give him problems with solutions. Plus, he’s 5! And he’s doing this! Not necessarily recommending it as a general parenting technique. Keep this in mind as an analogy for what I’m about to talk about. Think about whether this is a good analogy or whether there are aspects that don’t illustrate my points well.

Domain Surfing from Analytical Sociology to BioInformatics
Did you see Adam’s comment on the Big Data in My Blood post? I’ll excerpt some of it here because it’s what got me started on this whole train of thought:

One of my reasons for taking this class is my interest in so-called *analytical sociology*, of which Hedström is a prime proponent, and which embraces this micro-level explanation. How does this relate to Big Data? Well, actually, sociologists have had Big Data since (before?) they’ve had statistics. What we need now — and what we are beginning to get — is LittleBigData (apologies to Sony Computer Entertainment): large collections of detailed, longitudinal, accurate, and above all *individual-level* data about many people. Only with LittleBigData can we provide micro-explanations of macro-phenomena

Domain Expert vs Technical Expert
Now Adam is what I would call a “domain expert”. He’s spent a lot of time thinking about problems within Sociology. I would distinguish him from people like me who you could say are “technical” [I need a better word!] experts. I’ve studied math, operations research, statistics, … I know lots of methods to solve problems in quantitative ways. But I don’t have a deep understanding of biology or sociology. Now Adam is actually also “technical” in his understanding of sociology. So the distinction between “domain expert” and “technical expert” may not be a good or fair one. I know of graduates from the Political Science Department at Columbia, for example, who know their area of political science well (domain expertise), but are also well-versed in advanced statistical methods(technical expertise) and could probably in all fairness call themselves “statisticians” and get away with it, if they wanted to. There are people in sociology who would also be considered domain experts with much less technical background than Adam, so it’s a spectrum.

When I read Adam’s thoughts, the way I think as a “technical expert” is that I think it would be possible to generalize his problem formulation mathematically, and that mathematical generalization would then solve Adam’s problem as well as problems across other domains.
The first domain I thought of was BioInformatics. Now, recognize that I’m in over my head in both domains. But I have experts in both domains (Adam and the students sitting in from BioInformatics) who could help me figure out the mapping. So I’m saying approximately: Analytical Sociology LittleBigDataProblems are equivalent (in some “Data Sciencey” sense of the word) to BioInformatics Problems, and that I think it would be valuable to try to figure out how to translate between them.

Feeling Stupid
When you’re an “expert” in some area, which to some degree or another we all are, we feel confident in that space, and it’s more comfortable to inhabit it. Right after I got my PhD, I finally felt like I could call myself an “expert”, and then started work and “felt stupid” again because my colleagues had their own language, system for doing things, vocabulary words, notations, assumptions, etc. Even reading Adam’s post, I feel a little stupid because I can’t admit to understand everything he’s saying. But am I supposed to? I mean, am I expected to be an expert in Everything in the World? If I let my impulse to feel stupid block me from trying to understand what he’s saying even though we speak slightly different languages, then we may not solve some important problems! But I don’t think it’s easy to figure out how to talk to people who are experts in different things than you are. We’ve all “grown up” in different academic disciplines, and we don’t even realize sometimes how much the discipline we come from is affecting how we approach solving problems. We need to figure out how to talk across disciplines even if it means feeling stupid!

Example using BioInformatics Department

There are a few people from the bioinformatics department sitting in the class. Aside from them, how many of you could give a reasonable description of what they do in BioInformatics departments? Without looking on Wikipedia or Google! I have some vague ideas, but I don’t think I really know very much about what they do there, or how they see their domain, or how they think of data in their domain. So we started an email thread which I will excerpt here for illustrative purposes (Permission gotten):

Rachel: how many people are there of you from bioinformatics [attending the class]?
Hojjat: To the best of my knowledge, there are two PhD students and one post-doc, and one MA student that take it for credit, and one PhD student who wants to audit. We have different perspectives, I believe, as some of us are more focused on Bioinformatics, while others are more focused on Clinical Informatics.
Rachel:Thanks! Well first maybe you could help me understand the distinction between those two fields as you perceive them.
Hojjat: Here is a try: Biomedical Informatics (aka Health Informatics) is the general discipline in which information technology and information theory (and data science?) is used on biomedical data and/or to improve healthcare. It has several subdomains in it:
* Bioinformatics: where the data is mostly at the cell/tissue level and encompasses genomics, proteomics, etc.
* Medical/Clinical informatics: where the data is mostly at the human/process level; this one frequently deals with data stored in electronic medical records. other discplines such as Systems Biology (again mostly cell level, but using a systems approach), Imaging Informatics (body organ level, ties tightly with physics, image processing, etc), Public Health Informatics (population level data)

Now that he’s talking the language of “data” I start seeing ways his domain problems map to some general “data” type problems, with structures that could perhaps map back to Adam’s problems. But I need to spend more time talking to the people in the department to really get how they think about their field. But then I reread the post email, I realized I don’t know what all the words mean.

Rachel: what’s proteomics?
Hojjat: Proteomics is to proteins, what genomics is to genes.
Rachel: How do you think of protein data being structured?
Hojjat: I am adding more people to the list of recipients. They are also from our department.
Hojjat: Jonathan! You can answer this one more reliably.

Heather: Hojjat gave a great overview of the field, but I wanted to add my area of interest because it’s relevant to the NYT article mentioned on the class blog “Big Data in Your Blood” and also because this area is new to me too and I’m hoping this class will help me clarify things a little more.

I’m new to the world of Biomedical Informatics, so I’m still trying to find the right way to classify my area of interest. My background is in Epidemiology and Behavioral Science and I’m interested in understanding the way people/patients interact with “mobile” technology (phones,tables, wearable sensors, etc.) and how those technologies can be leveraged to improve their every day health, manage chronic diseases, and provide information (integrated into EHRs and clinician workflow) to help physicians and others on the healthcare team make more informed recommendations.

I’m planning to go into what I think the implications of this are for Data Science as a field and Communication between fields. But this is a fairly long already, so maybe look back to the story about S’s son, R, and think about whether the analogy is making sense at all? You might even think of me as standing in as little R trying to find the mapping between two words, or even as my adult self trying to decide whether the two words can even be mapped in the first place before I give little R the problem to solve.

Please let me know your thoughts.
Yours, Rachel

5 Comments

  1. Jed Dougherty · · Reply

    Very Hofstadterian game you’re playing with the kid. Relevant xkcd: http://xkcd.com/917/

    I’m not sure I understand the analogy. Are you implying that you are attempting to give us difficult problems that are not unsolvable? Or that since we aren’t 5, you are just going to ask us to make connections that you aren’t sure exist? Or am I just being way too literal?

    In a counter analogy, if statisticians are the mathematicians of the social science world, then maybe data scientists are the physicists?

    1. Rachel Schutt · · Reply

      Good question.
      I like your counter analogy. keep it up!
      Let’s get a discussion going here of what I possibly meant by the analogy. (I will explain what I had originally been thinking in subsequent posts, but I think there are multiple possibilities including yours. If I do say so myself :)
      Does anyone else want to venture an interpretation? Or respond to Jed’s interesting counter-analogy?

  2. Albert Lee · · Reply

    Hello, I am a new master student from Biomedical Informatics that joined a class this week. My area of interest is bioinformatics. I work with sequencing data mostly, so I will make a brief comments on the field and why it came about.

    With the revolutioin of sequencing technology, biology has become the data-intensive science. The first human genome project took about 10 years. It was slow and expensive. And rightfully so, considering human genome contains 3 billion bases. But with the advent of so called “Next Generation Sequencing” technology, the entire human genome can now be sequenced in a matter of days. Actually, the fastest sequencer avaiable on the market now takes as little as 2 hours.That’s right… 2 hrs. Thanks to this revolution, the fields of -omics rose such as genomics, proteomics, etc. In Wikipedia, -ome referes to a totality of some sort. So for example, genomics is the study of the entire DNA sequence on a global scale. Same for proteomics.

    Though sequencing is getting faster and also cheaper(it used to cost billons of dollars, but now thousands), how to make sense data remains as a huge challenge. It’s cool that we can see almost “everything” in cells, but we are still like a 5 year old kid playing with Google Earth. There’s buzzword going around in the field saying that we are entering the the era of “$1000 genome.” That is true, but the catch is at “$10000 analysis.”

    So, the ultimate goal of bioinformatics as the field, I think, is to solve these new kinds of data problems in biology and that’s why I am taking this class!

  3. On Inspiring Students and Being Human | Columbia University Introduction to Data Science Statistics W4242, Fall 2012 · · Reply

    [...] Students, Lest you think (yes, I know I used that turn of phrase in posts before. I like it.) that I am bragging about my character traits (in which case you don’t [...]

  4. On Language, Religion and Next-Gen Data Scientists | Columbia University Introduction to Data Science Statistics W4242, Fall 2012 · · Reply

    [...] I want to revisit the issue of language in the following senses: programming languages, the languages that people in various disciplines (or domains) speak and the language of Data Science. I want to raise the issue of religious wars [...]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

You are commenting using your Twitter account. Log Out / Change )

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 53 other followers

Powered by WordPress.com