What were you thinking when you made us do those data scientist profiles?
I had four primary reasons for going through that exercise:
Reason 1: Cultivating self-awareness
– I want you to think about who you are now with respect to data science
– I want you to think about what your goals are in this class with respect to data science and how you would like your data scientist profile to change over the course of the semester. Become an expert in one thing or a generalist? or some mix? There are career advantages and disadvantages to each regardless of whether you’re in academia or industry.
Reason 2: Illustrate importance of standardization in visualization
I wanted to demonstrate standardizing visualizations of individuals as a mix of characteristics. (You should think about how you might do it, and then also ask yourself whether you think a standardized visualization has any value.) In this particular case
(a) standardizing the x-axis: I used the main buckets that I thought were approximately some of the skills one needs as a data scientist. I’m not tied to these buckets, but it seemed reasonable for the first day of class and we can revise this going forward. The chosen buckets “Data Viz”, “Software Engineer”, “Math, “Statistics”, “Machine Learning(ML)”, “Communication skills” and “Domain expertise” are convenient, and contestable. Also I said “maybe software engineer should be CS, I don’t know” and then didn’t really make a decision, and you didn’t seem to mind (thanks!), but it did result in some students having different labels than others.
I pointed out that we had to consider whether the labels would be ordered or not. One way would be to go from left to right in terms of harder to softer skills. But felt stating Software Engineering was a harder (more technical) skill than ML or Mathematics was problematic. Alternatively we could consider ordering according to the “data science pipeline”, starting with engineering, moving towards analysis with math, statistics, ML (would have to choose an order), and then moving into visualization and reporting and story telling and communication. The complexity of the pipeline makes left to right ordering non-obvious. So rather than resolve this in the moment because I could see it going either of several ways, I chose to not think of them as ordered.
So once we assume we are not interpreting them as ordered, we have to be careful not to see patterns that aren’t there, but rather are simply a manifestation of the (arbitrarily) chosen order. Also some people in the room might justifiably feel that I wasn’t being granualar or broad enough depending on their frame of reference. So I acknowledge this is flawed, but also you have to start somewhere, and usually somewhere fairly simple, and that’s part of EDA!
(b) standardizing the y-axis: I drew my profile on the board and showed what my data scientist profile was when I finished my PhD and how it changed after working on a great data science team, learning from my collaborators and colleagues. Here the comparison is before and after. I chose not to label the scale because I didn’t want my notion of expertise to influence you. One man’s expert is another man’s poser. A student just learning this stuff has a different scale than someone who has been doing this for years, and each would have different interpretation of “expertise”, which could itself be a reflection of over- or under-confidence. So we have to accept that our scales if we label them will be subjective at this point. (We should think about what it would mean to standardize the scale. How would we do it? What would the consequences of it be? How do we define “expert”?)
Reason 3: Our first step to thinking about data science teams in the classroom
I want us to form data science teams in this classroom and one way to think about going about it would be try to combine complementary profiles.
Reason 4: So I could demonstrate my thought process when I do EDA
It’s a mix of intuition and math/stats know-how. I first came up with a simple standardized visualization and you generated n=80 profiles, which I could then look at to make comparisons across you. The lack of standardization of the y-axis means I would try to focus on relative shapes. Did I know what I would see before I did it? No. But I had a hunch that some of the following would happen:
(a) I’d learn something about you as people
(b) I’d see natural clusters of profiles. Some people are similar to each other. (Think: what does “similar” mean? what is the “distance” between two profiles? How do I measure similarity?)
(c) I’d get a sense of the distribution across profiles
(d) I’d be able to start getting intuition for how to construct data science teams in the classroom.
(e) I’d start thinking of machine learning or analysis problems I could potentially work on with this data set or a generalized version of it. For example, trying to automate the creation of data science teams using these profiles. Just let your imagination go here as a data scientist. How would you use these profiles or something along these lines as a way to think about or construct effective teams? Especially after reading the DJ Patil paper (assigned reading, which we’ll discuss). I am not suggesting really great teams could be created this way, but could it get us partway to a solution?
Meta-thoughts on this blog and analysis before I show the results
– My thoughts about this, who I am as a data scientist and what my strengths are relative to others, and what do I contribute to a team have been shaped and influenced by many conversations I’ve had with my collaborator and friend, David Huffaker. We’ve reflected a lot together on the composition of good data science teams. So I think of him as a collaborator on this investigation, even though he wasn’t there in person. He’ll be here in a few weeks!
– I am not doing this as a representative of my employer, Google, but rather as an adjunct assistant professor at Columbia. Though my experience on the Google+ Analytics (Data Science?) team informs my understanding.
– I respect your privacy as individuals. Let’s decide on a class policy on blogging. We’ll do it next class. This is new territory for me because when I taught Intro to Statistics in the past, no one had any interest in blogging about it, including myself. Go figure.
Some examples of your profiles
Here are some photos. I am taking out any names or ids you wrote and putting some up that were representative, others that expressed personality, etc. Don’t look at these for general trends because I chose these to call out specific qualities.
Miscellaneous thoughts when I saw them
I liked that some people changed things if they didn’t like my original suggested visualization; and that others just stuck to it. My instructions also were not very clear. I basically flung index cards at everyone and asked you to do what I had done. Maybe if I did this again; I’d first make you stick exactly to my visualization (for standardization); and then ask you to make up your own version (so you could think more about the creative process of visualization and what pieces of information you think needed to be captured, and also so you could express your individuality more.) We’ll revisit.
I started to get more feeling for your humanity, your ideas about yourselves, your levels of confidence, your senses of humor. It made me look forward to working with you all more.
I definitely saw potential clusters and distributions in the profiles in our class. I’ll chat about it with you in class. For now I hope this has given insight into why I think EDA is important, and what my thought process can be.
Final things for you think about
Thought experiment: Generalize this problem to visualizing a team rather than an individual.
Thought experiment: Some data sets could be millions of users/humans. (unlikely to be a set of millions potential data scientists!). So how would you think about scaling this process? Is there a difference in what you would do if the numbers were self-reported vs. logged user actions on a website? Think of social networking site or online dating website to get concrete about this. How would you explore a data set of users and their attributes? If the attributes were self-reported attributes like “how happy are you on a scale of 1-10″, how would you handle the subjectivity of “10″? How would you visualize it, cluster it, characterize the distribution over it?
Scaling also means that you might start by sampling and doing it by eye yourself to gain intuition, but then build an algorithm to automate. (This is an example of a machine learning which we’re about to start learning this week.) Also remind yourself that I asked you to question standardization, and think about how having un-standardized input might effect all this. Does the importance of standardization change for you when we are dealing with smaller data sets vs millions?