They don’t have to settle for models at all

Dear Students,

Check out this piece “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete“, published in Wired magazine four years ago by Chris Anderson, the editor-in-chief. In the data science world, four years is a long time (if you go by how long the term “data science” has even existed). In this respect, the article provides “historical” context, and has been fairly influential. Anderson throws out several thought-provoking ideas, equating massive amounts of data to complete information. Statistics, as a field, has long been concerned with making inferences from data, and often involves building and fitting models [think: regression, for example] in order to make those inferences. Anderson argues no models are necessary and “correlation is enough”, and that in the context of massive amounts of data, “they [Google] don’t have to settle for models at all”. Read this and try to get a sense of his key arguments. Think critically about whether you buy what he’s saying; where you agree, disagree, or where you need more information to form an opinion. Yours, Rachel


49 comments

  1. Jennifer Shin · · Reply

    Most of the scientific examples mentioned in this article seem to be taken out of context. I suspect a closer examination of data, science, and industry may show that scientist are now entering the marketplace because their skill sets are necessary for solving the problems that have now stumped those who entered the marketplace early. In addition, one of my motivations for working with data has been to create tools that I can actually use for research purposes and I’ve noticed a growing interest by others in science as well, ranging from increasing popularity of SAGE among mathematicians to the creation and development of scientific tools by the Department of Energy.

    While the historical data may suggest that there are no benefits to possessing scientific know how, the movement toward big data is still in the early stages and there is a lack of data to really support that this historical trend can be considered an accurate and predictive indicator for the future. There are also several assumptions made in this article that underemphasizes the important role science has played in shaping what are able to do with data.

    First, I’d like to point out that though the growing popularity in collecting data may seem new and cutting edge, the most influential scientists in history have not only had an interest in data, but demonstrated a knack for understanding the information contained in it. In particular, this is evident in the notes left behind by many notable Mathematicians. One such Mathematician, Tobias Mayer, was profiled by NY Times last year,” One example is, last year: http://www.forbes.com/sites/danwoods/2011/10/07/amazons-john-rauser-on-what-is-a-data-scientist/

    Second, there is a growing interest in hiring graduates with a background in science and engineering rather than the more traditionally sought computer science major. (I can’t remember where I read this right now, but I’ll post a link when I do. This article also mentions this: http://blogs.hbr.org/cs/2012/09/data_is_useless_without_the_skills.html).

    Third, many of the efficient algorithms, machine learning techniques, and even the computer itself are the direct result of the research of physicists and mathematicians of the 21st century. In fact, I’ve worked with computer scientists who didn’t understand the fundamental assumptions of the Ising model or implications of the Markov property, but used the methods derived from these models to solve CS problems. Without the work of scientists, what would have been their “go to” solution? And for that matter, can we really consider an application to be independent from the scientific knowledge that produced it or consider knowledge of an application to be of equivalent value as understanding the advanced concepts from which it is derived?

    It might also be worth noting that the characterization that Newtonian models are crude in comparison to atomic models makes a comparison and neglects to take into account that there is over 200 years in between these discoveries. Weren’t we all taught in secondary school that in science precision is determined by the accuracy of the instruments and tools available at the time when the research was conducted? I would expect that in 2212 the computers of today will indeed be crude when compared to the technology of the future.

    Before we allow ourselves to be swayed by author’s utter confidence in his suggestion that we replace the beauty and simplicity of scientific solutions with large-scale computers filled with massive data sets, perhaps we should consider a statement made by the greatest scientist of the past century:

    “Any intelligent fool can make things bigger and more complex… It takes a touch of genius — and a lot of courage to move in the opposite direction.” - Albert Einstein

  2. I remember reading this, and I still think it’s totally absurd. It reminds me of the Pastafarians’ claim that “global warming, earthquakes, hurricanes, and other natural disasters are a direct effect of the shrinking numbers of pirates since the 1800s.” (From http://tinyurl.com/8vp3f) Perhaps a more compelling example is that of Tycho Brahe and Johannes Kepler. Brahe kept voluminous exacting records of planetary observations. Consequently, Kepler pored over the data and was able to come up with his laws of planetary motion. These latter, in my mind, rise to the level of correlation, since they essentially describe non-causal relationships between various quantities associated with planetary orbits. While these laws were significant, they were nowhere near the breakthrough of Newton’s law of universal gravitation, which single-handedly ushered in the age of modern physics.

    The lesson seems to be that understanding causal mechanisms is vastly more powerful than observing correlations.

  3. I agree that this article seems to simplify the issues here.

    I think a better statement and a more subtle argument about the points that Anderson is hinting at can be found here: http://norvig.com/chomsky.html

    Peter Norvig is trying to refute Noam Chomsky’s claim that all understanding derived in linguistics must have a simple explainable model to be true. He does a pretty good job explaining the issues at stake and presenting his claims.

    1. To explain a bit more:

      Although the article seems a bit dated now, Anderson’s attitude towards traditional statistical methods are indicative of the paradigm shift that is occurring in statistics. As we have become better able to handle larger sample sizes and more complicated models, the trade off between bias and variance has shifted decidedly towards variance. While Anderson explicitly argues that models are dead, he is in fact proposing that we discard our old models, which are optimized for small samples and logical and computational simplicity, for new models optimized for large samples and complexity. In a sense he is saying: “Models are dead! Long live models!”.

      Peter Norvig’s article is especially interesting because his stated purpose is to explain the shift in models directly. His article was motivated by remarks by Noam Chomsky about linguistics research that uses these complicated models.

      Chomsky is essentially arguing that if the model used to describe language cannot be explained through logic than it is irrelevant because grammar and language in general are the products of reasoning people. Norvig has a completely different metric to evaluate the usefulness of model - its predictive ability.

      1. As long as time exists - a contradictio in adjecto if ever there was one - there will be a need for models. We will never have data for events that haven’t happened yet, rendering Andersons arguments moot. I find your example a much more interesting question.
        Chomsky and Norvig’s divergence, to me, is at the heart of the argument. Do we judge a model’s worth based on its predictive abilities? Or do we judge it not just on how correct it tends to be but based on how well its actual structure helps to explain the processes of the system it attempts to invoke? This begs the question - is there underlying structure in the real world? - which I’m pretty sure is what philosophers have been arguing about for the last x years.
        My personal belief (hypothesis if you will) is that there is a correlation between the perceived underlying explanatory capabilities of a model and the longevity of its perceived usefulness and scope. This is a strictly correlative issue in my eyes in that the longer a model remains as a good predictor, the more accepted it becomes that it does in fact explain something about the underlying nature of the thing is trying to predict. Likewise, the more a model seems to explain a situation in easily understood cause and effect terms, the longer it will be seen as a good predictor. Just look at the godawful models in undergraduate economics classes.
        As Jose Luis Borges notes in Death and the Compass “You’ll say that reality hasn’t the least obligation to be interesting. And I’ll answer you that reality may avoid that obligation but that hypotheses may not.”

  4. I respectfully disagree with the views of Anderson. Here is why:

    (1) In Anderson’s view, the reason we used to create models was because we didn’t have enough data (we only had a sample) so we needed statistical models to be able to do inference. In this point of view (shared by many other scholars of the “big data” era), now that we have “all” the data, if you will, the traditional methods of statistics don’t apply any more. We can get the “true” mean from the data, because it is not a sample anymore. We don’t need to verify causality: all we need is to find associations; these associations cannot be “unreal”, because those are the actual associations seen in “all” the data.

    That is completely untrue. First, even if you had all the data, it doesn’t mean you have all the truth. Data is collected by observations; observations are always subject to bias. The bigger the data, the higher the chances of finding associations which are due to minor biases. Secondly, you actually don’t have “all” the data. No matter how big, your data set is just a “snapshot” of the real world. Things tend to change; systems are dynamic. The associations you find in one set of data do not have to hold true forever.

    (2) Another problem with Anderson’s view is that it uses the scientific approach against scientific approach! The fact that we now know about gene-protein interactions is NOT because we have more data, or because we run associations (rather than inferences from small samples); it is because the “hypothesis-model-test” was performed! Mendelian hypothesis was tested again and again; it explained the color of chickpeas, but it didn’t explain transmission of Diabetes in generations. What do we do when the test fails? We create a new hypothesis! It was EXACTLY because of the hypothesis-model-test paradigm that we learned Mendelian hypothesis is not correct (better say, it is incomplete). It was not because we stopped using that paradigm and switched to a data-driven association-only semantics-free scientific model.

    (3) Last but not list, I find Anderson’s conclusions rather rhetorical. He talks about the gene-protein interactions in a sense that make me wonder if he actually knows what it is. He insults the mind that disagrees with his point of view by calling it “stuck in the old way of doing science”. He concludes “there’s no reason to cling to our old ways” without clarifying what the “new way” is, and show how it can be used for anything more than the error-prone Google translations, and the like. (He also obviously doesn’t know Farsi, or he wouldn’t claim the Klingon-to-Farsi translations actually works!)

    My one-liner answer to Anderson would be: show me one highly-accepted modern scientific theory which is purely the result of analyzing big data, without using traditional approach of science. I bet he can’t!

  5. I agree with what the above posters have already said. I do think Anderson is over oversimplifying and needs to consider the fact that just because we have this “new way” of doing things, doesn’t make the old obsolete. However I also think he makes some good points.

    We are on the verge of something new with all the data that is accessible to companies such as Google, and even to the general public. While a lot has been done with it, we have so many more possibilities to explore how we can utilize and exploit the raw information data collection provides us. It allows us to do things people wouldn’t have dreamed of 50 years ago, or even 20 years ago. For example, the article links to a short paragraph called “The Petabyte Age”, which includes the figure of 600 terabytes being approximately the size of ancestry.com’s genealogy database, including census records from 1790-2000. Thanks to historical record keeping such as this, we have been able to look at trends of cultural expansion across our country going back to that time. But now with that data almost instantaneously accessible to anyone interested, combined with hundreds of thousands of users inputting their own genealogy information, anyone with some free time and a little data experience could come up with some fascinating insights into how our country’s population developed.

    Jake Hofman’s Olympics data example is another perfect argument for this. It probably wouldn’t have been useful, efficient, or terribly exciting for scientists in the early 1900’s to hypothesize about the progression of Olympic sprinters in the next 100 years, and then wait around to see if their theory was supported. But we can use data now to look back on that trend. And when we have data on the scale that Anderson is talking about, we have a mind-boggling amount of information we can look at to try to find these trends.

    I don’t believe the scientific method as we know it will ever truly become obsolete, but data on the scale that Google and IBM and similar groups are operating with will allow us so much more in terms of insights into the world. A biologist may argue that there is no point to Venter discovering these new species when he knows absolutely nothing about them, but work like his can motivate a team of researchers to get out there, find one of those species, and study it like they would with any other new discovery. Without the information provided by this huge data, they may not have thought to look there in the first place.

    I think the take home point of this article is too strong; that is, Anderson’s thought that “There’s no reason to cling to our old ways”. I think what we really need to do to take advantage of what big data can offer us is to cling to our old ways, but usher in the new as well, and combine these resources and abilities we have in order to continue accomplishing spectacular and insightful things. We can’t get anywhere without progress, and that should mean allowing ourselves to break from the tradition of the scientific method, and, if the data supports it, drawing scientifically justifiable conclusions from the numbers themselves.

    1. Eurry Kim · · Reply

      Anderson is sorely misguided about the importance of theory. Particularly in light of Big Data, we should instead profess, “above all, theory.” Theorizing is a crucial process in the Research Method. It allows us to consciously think about the concepts underlying our theories and methods of measuring those concepts. Howard Becker, an American sociologist demonstrated the importance of this process by using the example of crime. A shoot-from-the-hip researcher would immediately think of robbery, murder, and the like. Well, what about white collar crime? If you’re researching crime, then why not white collar crime? What about the definition of crime does not encompass white collar crime? In the same manner, what about big data from Twitter tells us about social unrest in the Middle East? How is it known that Twitter users in the Middle East are representative of the broader sentiment of those countries? These concepts should be thought about and defined before Big Data is established as a virtue. Defining abstract and elusive concepts is the nature of the hypothesis-developing process. If these are not consciously mulled over and empirically defined, then the conclusions will not add to human Knowledge.

      Replacing a honed methodological approach to research can never be replaced by a “numbers speak for themselves” mentality. Anderson mentioned that “correlation [was] enough.” And while it is true that Target prints diaper coupons for expecting mothers and Facebook knows your childhood friends, where is the broader theory in these actioned-upon correlations? How will we answer the question of pay parity? What are the causes of global warming? In a broader sense, how do we solve these wicked problems? Theory enables us to construct logical and empirical reasons for these phenomena. Without theory, there will be more people constructing correlations between cell phone usage and everything else that has exponentially grown in the past three decades.

  6. mcshaganoff · · Reply

    The implicit assumption Anderson makes throughout this article is that with “Big Data” we will have samples that are so large that they nearly perfectly approximate their underlying populations. As a result, the likelihood of observing spurious correlations approaches zero as samples grow larger and larger, hence his statement that ultimately “correlation is enough”. But even if we can get data on this scale, correlation doesn’t tell us the true nature of a relationship. It only tells us whether Y increases or decreases when X increases. Correlation is only enough if one only cares that a relationship exists, not why it exists.

    This may be acceptable in certain contexts. For example, the massively profitable algorithmic trading company Renaissance Technologies is famous for trading any correlations they can find in the data, regardless of whether they understand the nature of those relationships. But I suspect that most of us yearn for a deeper understanding of why the world works the way that it does.

    Although Anderson makes several good points about the importance of new computing techniques to science, he doesn’t really articulate how this changes the scientific method at all. It seems to me that science has always proceeded from the observation of some facts (e.g. Anderson’s “correlations”) to a theory of why those facts exist, ultimately resulting in testable hypotheses to prove or disprove the theory. As Lily’s discussion of Venter’s work nicely illustrates, Big Data is allowing us to observe more facts for scientific inquiry at a quickening pace. Thus, it as an enabler of the Scientific Method, rather than as a replacement for it, that Anderson’s “Data Deluge” shows its true value.

    1. Michael Discenza · · Reply

      I would like to discuss two applications that Eurry and mcshaganoff mention, social science research and global climate change, and discuss the utility of big data technology

      First, in social science, I think it is important to understand that unlike natural science, most of the methodology and the stipulations about causality, and indeed the epistemology in general is something that is constructed and theorized discursively. I think the social sciences provide pretty stringent requirements for establishing causality and that minimizes the impact that so called big data methods might have in this type or research. One good way to think about the compatibility of these methods with social science is through the idea of latent variable methods. When we approximate real world distributions that we learn about through the observed variables we create, we can’t learn anything about our subjects within the constraints of our human understanding nor through deductive reasoning connect our understanding of the certainty of our findings to the basic maxims that allow us to be sure of knowledge. Latent variable models might not be useful for a social scientist trying to understand how people think in order to say guide policy or write a book, but they sure can be used in an applied sense for the projects of prediction and profit maximization—two things that aren’t really useful to scientists, but are quite useful for applied scientists such as entrepreneurs and those in industry.

      Global climate change is another area where we can make similar conclusions about the usefulness of big data and latent variable models with infinite parameters spaces. We might be able to potentially more accurately predict the rate at which climate changes than models that draw on thousands of physical facts and “laws” that enumerate the various feedback cycles and interactivity of the earth’s physical properties. This is because in a system this complex , not all the interactions can be understood or determined with high enough conditional accuracy. Having a model that might do a great job of predicting climate change, however, would be relatively useless when it comes to planning specific interventions to stop this climate change. We could not shape global policy by enacting a cap that would be low enough to save us from reaching a certain temperature because we wouldn’t be able to say what that amount of carbon would be with any kind of certainty given a latent variable model. Moreover, because policy and indeed science as well are both human activities where epistemological systems are so important, a latent variable model would be of limited use. With all the trouble people (specifically of one political leaning) have accepting global climate change as a real phenomenon and accepting statistics (the science of lies), can you imagine justifying environmental policy by pointing to latent variable data model for climate change that throws out centuries of physics knowledge?

      I guess this is my way of saying that Anderson completely overstated the utility of new big data methods for science, but I certainly appreciate how proactive his piece is.

  7. While Anderson uses Venter as a practical example, he stated the limitations of work using “Google-quality computing resources”. I believe that for our respective fields to advance, domain expertise won’t likely be replaced by tools such as computers. Humans are gifted with the ability to abstract which currently our cold and calculating friends do not. Yes, super-computers of past and present have changed the data landscape that we currently live in, but ultimately there has and will have a human right behind it.

  8. Im not sure I understand why Chris mentions biological research-he implies we should be studying biology without creating models, but I don’t understand at all how this is possible. He then mentions Craig Venter’s sequencing voyage, doesn’t explain why it is significant, but wants to claim Craig is a model-less pioneer in genomics research. Shotgun sequencing absolutely uses a model of the genome. It also seems a bit vague what exactly his definition of a model is, and this may be the source of my confusion.

    Google Knowledge Graph and Google’s famous neural network, are they not prime examples of models built by Google?

  9. I cannot totally agree with Anderson’s opinion from several aspects below. Admittedly he pointed out a controversial topic in data analysis area with thoughtful idea.
    Correlation is not sufficient to answer all questions especially those raised from basic research area. Anderson mentioned biological science. It is true that there are many association analyses regarding the high-throughput data from NextG seq. But the truth is that, in most case, association is just beginning of the whole analysis process. It is not workable to just knowing this unidirectional relationship without digging in.
    Anderson talked many cases in industrial to induce the notion that traditional data science is obsolete, and big data is a totally new thing for which classical method is not suitable. However, industry is much different from academia in many aspects. The success of Google’s Ads should not be necessary to illustrate its success in demonstrating mathematics models. Its success might be from the fact that they are in the first, and free for users. Users don’t care about the causal relationship between different websites. On the other hand, research is another thing. Biologists care the causal relationship between different genes and proteins. The current hot area regarding cancer driver gene research illustrated this point very well. To sum up, we cannot easily general the usage of some model or methodology from industry to academia ignoring the domain knowledge, versa vice.

  10. Seeing so much opposition to Anderson’s article, I decided to play the role of devil’s advocate.
    To my mind, what Anderson argues is not that correlation is an answer to everything. Rather, he argues that the conventional approach of theorizing and hypothesizing before conducting the actual analysis of data is not always a must. (Note: as I view it, Anderson uses the word “model” not in purely mathematical/statistical sense, but rather as a description of a phenomenon we hold in our minds, which in turn can be operationalized with the help of specific variables).
    The ideal situation imagined by ideologists of research methods – first, one comes up with a hypothesis, and then he tests it against real-life data – is not the only way of making discoveries. The natural way of making discoveries (at least outside academia) often takes the reversed way. One encounters something he cannot explain and, in an attempt to explain it, he comes up with a theory. Due to various reasons such an approach, which is sometimes called data-crunching or brute-force approach (one takes data, runs correlations, discovers that some of them are significant and tries to explain them), is not even considered ethical within academic circles [Abrahamson, 2008]. Why is that?
    One frequent answer is that when the data set is large enough one can almost always discover at least one significant correlation, which may lead to excessive speculation and useless debates, stealing scientists’ time, when in reality the correlation may be an accident and might not imply causality. But is it a valid reason to ban such a “brute-force” approach? I would have enough courage to say “no”. The reason for that - it is possible to imagine situations, where such a brute-force approach could be useful. Imagine looking for causes of cancer. If one day accidentally a correlation is found between cancer occurrence and some specific factor, wouldn’t there be a good reason to check the possible connection between these 2 variables? What if such a finding contributes to our understanding of cancer? Wouldn’t this alone vindicate the brute-force approach, where we don’t attempt to theorize/hypothesize before running correlation analysis? I would say it would.
    I could also speculate that there is a different reason, why such an approach is considered unethical among some academicians. It might be because brute-force approach turns science into a game of luck rather than a game of minds and guesses. One accidentally uncovered correlation/connection might lead to a great discovery, whereas the whole life spent theorizing might not. Such a possibility might make scientists sad and give them a reason to proclaim such a brute approach unethical. Hopefully, I am wrong.
    I am not trying to deny the value of science though. When a new discovery is made, be it through an accident or through an educated guess, science steps in to understand and explain why things are what they are, to produce a theory, and in this case looking for correlations is not useful anymore. This is fully the domain of traditional scientific approach.
    Therefore the issue as I see it boils down to the question of purpose. What are we trying to achieve by applying these varying methods?
    1) Are we trying to explain why something happens? In this case we badly need traditional scientific approach. By producing and testing hypotheses, based on the models we have in our minds, we can discover explanations and make sure that our models conform to reality.
    2) Are we trying to take use of co-occurrence of events? In this case brute-force is not bad. What if we just don’t want/need to understand? What if correlation is enough, as in the case of assigning rankings to web-page based on the number of links to the web-page? Then we don’t need to have the model describing why people find that specific website interesting whatsoever. We are able to achieve our goals without having full understanding. And such an approach deserves to exist. One way to think about it, we don’t really need to have an excellent knowledge of physiology to be able to walk.
    3) Finally, maybe we are just trying to take a deep dive into the sea of data to understand what to theorize about. In this case brute-force approach can be helpful too. As data becomes very complex it is likely that we will have no models to describe the relation between different factors. In this case correlations can be truly enlightening, at least in the sense that running correlations is better than nothing. As stated before, a correlation might be a thread that will lead the researcher to a new discovery. What if something we spot in the chaos of data will prompt us to discover the order? Remember the popular story, according to which Newton came up with the theory of gravitation when he observed the fall of an apple.
    As a conclusion, I don’t think that Anderson is right, saying that scientific method is obsolete or that the end of theory is here. We do need scientific method and theories. Without them deep understanding of the world is impossible. But at the same time there are other objectives to pursue, besides answering the question “why”, where scientific method is not the only (and sometimes not the best) instrument we can use.

    1. Silis J. · · Reply

      In response to your second question (“Are we trying to take use of co-occurence of events?”) and also in a greater scope your example about cancer, I believe that mere correlation is not sufficient. From my observations of medical publications in the last 5 years, the gold standard has and will be (for the foreseeable future) in randomized clinical trials for establishing casual relationships. The reason I bring this point up is because we (as scientists and probably humans) do not trust our lives based merely on correlations. While we might find that protein A has a significant correlation with antioxidant uptake, this correlation cannot explain the biological mechanisms that govern this process. Without the mechanisms, it will be very difficult to produce therapies that would either stimulate or increase production of protein A. These biological mechanisms are identified through repeated randomly controlled trials, which allow us to posit that protein A increases the permeability of the intestinal lining, thereby increasing antioxidant absorption.

      Additionally, it is difficult to address the issue of confounding variables through correlations. It may be that protein A and protein B have collinear protein concentrations in the blood stream. Biologically, proteins A and B can be completely different. I think that there are many issues not addressed by correlations to not want to pursue more knowledge.

      However, correlations have its importance in science. That importance is to allow us to know where to look, what to investigate. Without large data sets, we would not have the genomic data to identify that people with high antioxidant update have protein A. So yes, correlations are an indispensable area of research, and for some purposes it may be enough, but I do not believe that correlations from large data sets is capable of answering all of our questions. While we might care to have an in-depth understanding of anatomy and physiology to understand how we walk, I believe that it would surely be helpful knowledge for physical therapists who have to teach stroke victims how to walk again.

      (I am not saying that medicine is unique in its technical knowledge requirements; this is just what I am most familiar with for this healthy debate.)

  11. Human Ingenuity | Columbia University Introduction to Data Science Statistics W4242, Fall 2012 · · Reply

    […] on the web with a solution involving eigenvectors. Presented, also, for the student discussion on Anderson’s article. Share this:LinkedInPrintTwitterRedditFacebookMoreStumbleUponEmailLike this:LikeBe the […]

  12. Although I agree with Anderson on that science in the petabyte age has definitely shifted towards quantitative from theoretical, I dont think that it will ever replace theoretical modeling. In reality, models and data approach should compliment one another and not the other way around.

    Just off the top of my head, couple of problems with relying on data gathering alone is that: good data is expensive (census) and data without theory can be misleading (google had this problem with websites that did search engine optimization)

    The central idea is just like we don’t know enough to make correct models, it is also hard for us to have enough data to call a good sample size.

  13. Overall, I agree with Anderson’s ideology although I would take issue with the extreme nature of his assertions. For instance, when data-driven analysis reaches conclusions which are totally antithetical to the ones which traditional theory holds dear, I would generally trust the numbers. Nonetheless, traditional theory serves as a check on the conclusions we may want to draw from the data. If the data is totally counterintuitive it is often a sign that, at the very least, further investigation is warranted. Therefore, I don’t believe that data mining really is the end of the theory, although it is the end of theory as the ideal way to draw conclusions. We will always need some theory to guide our data miners in their research and analysis. However, if there is a direct contradiction between data and theory, more and more, the data will (and should) take precedence.

  14. It is very interesting that Anderson talks about importance of analysis of data four years back. His underlying presumptions and explanations are credible (although distracting) but his conclusion is indigestible. By relying purely on correlation, and not taking causation into account, google might be able to predict a lot of things. But again, they are not perfect at that. And it makes me think about that next big change, a start-up that will eat up Google’s market by respecting the cause behind an event.

    At the same time, I think about the ‘butterfly effect’. A small initiation leading to a big change. And like the unified theory in physics, I’d like to believe in a unified theory in statistics. Where one statistical model with one correlation matrix (that would contain every single feature-event possible in this world if not universe) will be able to explain everything. That is some food for thought.

  15. There’s no doubt that Petabyte Age made possible to store and process huge data set with ease. However I couldn’t agree with Anderson’s opinion that massive data makes scientific method of building models useless. The reasons are as follows:

    By using the Google Ads matching example, Anderson stated that we should stop building models because Correlation is good enough. However, correlation is definitely not enough, and the Google’s approach with data is not strong enough to speak for rest of the world in data science. The fact is if passive data tells anything, correlation is just a small part of the story. If there’s a need to dig into more detailed information behind the scene, say to investigate the click-through rate of the suggested Ads or the trend of customer preference shifts. They all require some scientific methods (linear regression or K-nearest neighbors, etc.) to model and visualize the results.

    Regarding the use of correlation to “Discover the new species”, Anderson argued that with massive database, new sequence might correlate with other sequences that resemble those of species we know more about and hence we could make guesses about the animal. The “matching” method works theoretically, but the author may forget the fact that massive data brings noises as well. By simply examining single points of data, tons of other sequences may show the positive correlation with the new one. So without a scientific model to prove the true relationship, we won’t be able to draw any conclusions solely based on correlation.

    Finally, I am not saying modeling works fine everywhere just like I don’t think the correlation is the only method we care about. Each method has its own application merit to the industry. To make better use of the big data, data scientists have to choose the right way of processing data instead of saying one method is on top of the other.

  16. Bianca Rahill-Marier · · Reply

    I agree with much of what classmates have said above, but would like to focus my post not on repeating this, but on one specific issue I have with Anderson’s conviction that models are losing relevance:

    Firstly, I’d like to say that I, to some extent, agree with Anderson’s message. Anderson writes, “Google’s founding philosophy is that we don’t know why this page is better than that one: If the statistics of incoming links say it is, that’s good enough. No semantic or causal analysis is required.” I tend to agree with this; to decide whether a page is ‘good enough’ relying on purely big data is probably infinitely easier than sifting through the complex network of human decision-making that motivates a user to click or not-click on an add. The issue I take with Anderson’s argument is that, at least to me, I don’t find this particular example extends all that well to the rest of ‘science’. Google, and other Internet based enterprises, are in the unique position of having constant large stream of data. And while there is increasing use for satellite imagery, as in the crop prediction example, I think the rest of ‘science’ is far behind in having access to this kind of data.

    My ‘domain expertise’ (if a B.S degree and in-progress M.S. even counts) is in water resources and climate, an area where significant data is available on the climate front, but less so on the micro-scale of soil-atmosphere interactions, water use, and available water supply. Data is collected in a very isolated manner, and for the time being not all that many automated tools have been developed or tested that could revolutionize this any time soon. A big part of what we do is ask if we can find correlation between certain climate indicators (which are regularly and reliably tracked), and rainfall can we help guide the decision-making of farmers on what and how much to plant. To me, this is very different than what big data offers, or at least how it is portrayed in the linked article, “Feeding the Masses: Data In, Crop Predictions Out”. The name is misleading, since the author explains how satellites use real-time data to track the progress of crops, and then produce predictions (not all that much ahead of time) that help soft-commodity traders anticipate price hikes and supply shortfalls. This is not feeding the masses. Feeding the masses is being able to predict, significantly enough ahead of time, what the agricultural conditions will be so that farmers can respond in a way that optimizes food supply and their revenue, ideally in a way that also incorporates responsible and sustainable water use. Big data will certainly (and I am in this class because I think it will) help improve and maybe even revolutionize these models, but I have trouble seeing how big data can predict the future without assuming that a certain underlying causation or mechanism exists. In another of the linked articles, a doctor was using big data to scan the aging of skeletons, but ultimately was using it to find an underlying mechanism. Certainly, there will be less and less need for models to ‘fill in the blanks’ of data as data collection and access expands, but even if correlation increasingly implies causation simple the act of making data interact and discovering correlation is, to me at least, a model in itself. All this to say is that I think, and this is in agreement with Anderson, that we may be moving from the model-experiment-validate approach to a data collect-analyze-interpret approach. I do not think this makes models invalid – I think it just makes them better. Until everything in the world is measurable in real-time, models cannot be replaced. Until then, we need causation and an underlying mechanism, to know what to measure and when to be able to predict a natural phenomenon, such as crop outputs for example, in advance enough for it to be useful for people. I’m not sure big data can predict the future, but I think big date, informed by our scientific models and knowledge, can get close.

  17. The author makes a compelling argument about the end of theory; however I would ask if this is just another example of historic technological advancement. This is not the first time technology has caused a shift in the way that research is conducted, causing adjustments to scientific theory (e.g. the microscope, MRI, the first computers, etc.,). I would imagine that after the development of any major technological advancement there is a rush of information that precedes theory. That researchers use the new technology while not fully understanding the direction it takes them. Then after enough time and exposure, researchers once again use theory to direct the new methods in increasingly clever and insight-driven ways.

    Massive data does not seem like an entirely new concept. The author argues that big data needs an “entirely different approach” and that it “requires us to lose the tether on data as something that can be visualized in its totality”. My thought is that frequentist statistics has always been based around the idea of infinite data and a starting assumption of independence between variables and observations. Statistics has always been dealing with the concept of big data, and technology has finally caught up.

    I agree with the author that traditional statistics as a discipline is being challenged by data related advances, however this threat seems to be driven more by the progress in computational technology than the data itself. I would argue that the big shakeup is primarily centered in methodology, which will require researchers to adjust and learn without throwing out the old scientific process or theoretical underpinnings in probability theory. Theory is still valuable, but the current shortage is in people who understand data-driven methods.

  18. After my first read of Anderson’s article, I felt that he had oversimplified the issue in order to make his point and was ready to disregard his argument. However, after reading the discussion here, I do believe that there is a middle ground where Anderson’s quantitative approach meets the scientific method. Bianca nicely stated that we may actually be moving from the model-experiment-validate approach to the data collect-analyze-interpret approach. I think there is utility in this, especially now that big data is available, because in my experience exploratory data analysis has often been more of a formality, whereas it is considered an important first step in data science. I think big data provides great opportunity for identifying issues and trends that can inform theoretical knowledge. Anderson’s point stands for those issues that we do not care to understand, but for most issues, especially those related to human behavior, we are (or at least I am) interested in understanding trends and patterns observed at a deeper level.

    Also, to go on a slight tangent, if there is no scientific method, how will we improve knowledge on those things that cannot be counted? In the context of human behavior, there are populations that are not contributing to the data deluge because they are not connected. Does that mean that scientific discovery among these populations should cease?

  19. The author mentioned that we don’t need models any more. And instead, we just need massive data. When I look at the first line of the passage, I felt it’s quite impossible to verify the author’s opinion. But after I finished reading the whole passage, I found I was thinking that the author is quite reasonable and we can find some examples that really coincide with the author’s opinion. In the past, we cannot find enough data to describe the relationship. So we had to build all kinds of models to generalize the relationship. But now we can get enough data to get close to the relationship of the variables. Like the PageRank algorithm, we don’t have to know why the pages can be regarded as an important one and even we don’t have to know the contents. We just need the data of the link numbers. Another point is that the decision-maker can get better understanding and analysis with the result of data visualization. Like the speech given by Hans Rosling, a Swedish statistician, we can get vivid, clear and exact understanding from the lively and dramatic data.
    http://ed.ted.com/lessons/hans-rosling-shows-the-best-stats-you-ve-ever-seen

    But I still think the author is too optimistic. Data only is not enough. I still think data and model should be considered together. The reason is that I think the true meaning of modeling is to find the nature relationships between objects and then using mathematical method to interpret. So the model we typically use is a bridge between nature relationship and mathematics interpretation. So maybe data has become more and more important but we still cannot totally get rid of models.

  20. Shaodian · · Reply

    I do agree that “All models are wrong”, however, we can do nothing without models. Claiming “data is everything” is ridiculous.

    First, everything is model. “No model” itself is a model too. When you try to conceive, describe, or construct something, you are modeling. Google use models. It is true that they might not use “hypothesize, model, test”, but Map-reduce is model, not just gathering data and correlations randomly. Google’s Ad system uses models as well - actually in every state-of-the-art search engine providers: Google, Bing, Baidu, logistic regression are used to predict user clicking behavior. You can’t argue that such a model is not a model - it’s just not that kind of traditional model which was built upon some sound and reliable hypothesis or axioms and could be falsified. Clearly, the author has a limited interpretation of “model”.

    Philosophically, we could classify all the things in the universe into two categories: those are “scientific” and those are not. Apparently, the author, from the very beginning of the article, assumes that “model” should be scientific, thus it should be deterministic, certain, falsifiable, and able to be captured by clear mathematical equations. However it is not the case. In machine learning models, especially statistical ones, rationale of the conclusion or result is not absent - they are hidden in the models. The statistical models reflect how people view the problem (linear? hierarchical?) and how we choose the representation(Graphical models? neural networks? linear models?) The author gives the example of Goolge advertisement and he claims that “Google didn’t pretend to know anything about the culture and conventions of advertising — it just assumed that better data, with better analytical tools, would win the day”. This is completely wrong. First, “analytical tools” involve a lot of models. Second, “Google doesn’t know” does not mean there is no culture, convention, or more precisely - pattern, inside. Finally, “Google doesn’t know” does not mean “Google doesn’t want to know”. All the patterns and rules are hidden in the models and data. It is irresponsible to ignore the rationale and expect to win tomorrow only by data.

    Such argument is just like the centuries’ debate between empiricist and rationalist, which will never come to an end.

  21. Charlene · · Reply

    Hey
    I personally love this article a lot. I buy his idea.
    I was confused when first looked at it. But I think the example of “Discover a new species” is great! It tells us the bad side of the stereotype. So when we have a model inside our mind, we will stop creation but just follow our routine. However, in the era of data deluge, no one has been in front of us, we are the first generation faced with this new species! We are all Darwin. There are no books, no models. We should radically change our minds to fit in the new era. Google is a good case who jumped out and won the success in the data era.

    However, as a beginner in data science, I have to say that, if you do not teach me with the models like k nearest. I would have felt lost. The concept of scientific method is a good way to get things moving. So my perspective on this issue is that, in education, we should combine the model with the freedom of model. This means that we have to leave a space for students to think out of the box, while giving them some kind of help.

    So, it is too extreme to refuse the scientific method at all. There should be some balanced point. However, I think only when we become an expert can we have the knowledge and confidence to see the question calmly.

  22. Zaiming Yao · · Reply

    I think even in the new era, we should never go away from model or theories. That is the ultimate goal of science. We are doing researches to explain the “why” in the world, to find the paradigm we can fit in, to dig into the relations between the constructs. The models may be not to the taste of the data deluge. But I believe with the development of data science, we will have more and more insights into the messy concept.

    Also, I doubt his argument that we are not concerned about why we do something, but just need to track and measure the behavior and let the numbers speak for themselves. If it happened, you cannot imagine the chaos. We would be just looking for the sales number for a product, but we do not analyze the consumer behaviors to see the underlying psychology changes inside. Then we will never be in the right place! Numbers can only tell the phenomena, not the theory, not the why!

  23. It is difficult to say I whole-heartedly agree or disagree with Chris Anderson’s perspective on the usefulness of Models. The practice of “Correlation is enough.” will never satisfy the scientific need to understand activities/events. By simply recognizing patterns or identifying outliers we are limiting our knowledge to a single-direction of information. The Scientific Method has maintained its usefulness over the centuries because of its ability to foster the understanding of world and scientific events. The ability to Model an event based on understanding its components serves to enhance our understanding of the subject being studied (irrespective of its success or failure of a hypothesis). This practice allows for knowledge to be developed as opposed to being “told” to us.

    However, the direction and information that can be acquired by Machine Learning cannot be ignored. As opposed to seeing the two worlds as mutually exclusive, it may benefit us to find an over-arching methodology that blends the practices. It can be argued that discovering the existence of a species without theorizing its existence essentially robs scientists of the knowledge that is required to make such findings.

    But if a newly-adapted method of beginning at the “conclusion”, the discovery of additional information can be propelled with modeling from the results. The relationship becomes one of a leap-frog effect, one method constantly outperforming the previous methods success. To analogies the two I see the world as a dark room and consider the Scientific Method a careful series of steps that can only benefit from the flashlight that is Machine Learning.

  24. I am not clear about my own position for or against the article. Speaking from my own perception, the language of the text is an exemplar of a lawsuit against the history of human thinking and cognition so far. However the big data is a very newly disputed topic, I wanted to know how it helped in expanding the scientific approach compared to the so called “traditional” style.(I am asking, I really don’t know much about the different contributions and implications of big data)
    Models never meant to be the perfect explanations of the real world, and they are simply the facilitatory expression of the phenomena.
    When saying that correlation is fair enough, I would say Good!, so I assume pseudo-logic is what we use then. but if it works for all the people(let’s say in the case of showing people their targeted ads), then it’s fine. Rationally thinking this might not result in the best choice, but that’s what people do; we’ll make products for such behaviors.
    Consider this example of the case in which we have more consumers of ice cream when it gets hot and also more murders in the summer !! But there’s a loop here(I assume). Is this correlation enough because of huge amount of data that makes us infer or because we don’t have enough data(which I believe that could be the case) to really explain that(who knows!!). When we cannot come up with perfect reasoning we make cut off points. What is the disputing approach against the causation? Is it because the other way(correlation only) worked somewhere successfully? or does this article says: do not merely think about the causal relations; correlations lead you through many problems, so make cut-offs! But still I have a problem for the assumptions being made. We have some entity called “data” that will be satisificing for our inferences. But it ignores how and what data we are collecting.

  25. The author overestimated the function of data processing technology as a substitute to the statistical models. Mathematical modeling has been the foundation approach to solve quantitative problems, with the sense of abstracting massive facts in the real life to seasoned theories. Although the “black box” approaches of pattern detection could solve problems implicitly without pain of reasoning for people who are lack of the required knowledge, it won’t be a long-term gain to people since they will never learn anything new from the problem solving procedure itself. Theories and models definitely will go invalid with more data revealed, but the scientific approach will bring more than just an answer, which might even go more non-sense with just data processing technologies.

    It is always a dilemma to argue models versus implicit approaches, since any big data technologies such as cloud computing and machine learning, are based on statistical or mathematical models. Programmed computers would solve problems, but never give us the inspiration of innovation from their results without any insights on models and theories.

  26. I always enjoy reading provocative writings and I feel this assignment obliged me to structure some thoughts I have had in mind for a while.
    Reading the thoughtful reactions to Anderson’s article, I noted two interpretations of the term “model”. I think Rachel calls attention to “statistical models” when she suggests that “Statistics, as a field, has long been concerned with making inferences from data, and often involves building and fitting models [think: regression, for example] in order to make those inferences.” Framed in this way, I am very optimistic on the benefits of algorithms over traditional statistical modeling in the Social Sciences. I find very persuasive the critiques elaborated by some authors such as David Freedman, Richard Berk and Christopher Achen on the dangers of using statistical models for causal inference. However, some time ago I found this article written by Leo Breiman (the initiator of CART and Random Forests) and I was compelled by the “black box” of the techniques developed in Machine Learning. In this sense, I would agree with Anderson when he says “With enough data, the numbers speak for themselves”. I feel very comfortable putting aside several assumptions imposed by statistical models that are ignored in algorithmic modeling.
    http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.ss/1009213726
    On the other hand, I perceived that some posts interpret the term “models” as “theoretical models”. Probably, Anderson had in mind this kind of models when he characterized science: “These models, for the most part, are systems visualized in the minds of scientists. The models are then tested, and experiments confirm or falsify theoretical models of how the world works”. I am convinced that the claim that high volume of data may possibly substitute theoretical thinking can’t be taken seriously but only as provoking (although there have been acute attempts to substitute causal thinking by algorithmic modeling-here is a review by David Freedman on one of these attempts).
    http://philpapers.org/rec/HUMTGL
    As long as I can see, Anderson’s assertions immediately raised the mantra “correlation is not causation”. Although I totally agree, I may change my mind after reading this post on the supposed perils of correlations.
    http://www.slate.com/articles/health_and_science/science/2012/10/correlation_does_not_imply_causation_how_the_internet_fell_in_love_with_a_stats_class_clich_.single.html
    As the blogpost points out, we should not be tempted to automatically discard correlations arguing “Your hype is busted. Your study debunked. End of conversation. Thank you and good night.” Indeed, It could not imply causation “but it surely is a hint.” For some reason, I don’t feel repulsion for Anderson’s article. Maybe I am somewhat naïve, but and I am sure that a dose of common sense could fix the pitfalls of plain correlation in many, many applications. Or not?

  27. Course Announcements (Wednesday, 10/3) | Columbia University Introduction to Data Science Statistics W4242, Fall 2012 · · Reply

    […] #2 is due tonight. I love the healthy debate you’re having on the “End of Theory” post. Keep up the good work! Homework #3 will be assigned tonight. Share […]

  28. Boti Li · · Reply

    I cannot totally agree with Anderson’s opinion in this article, he argued that only relied on data science, scientists can make breakthroughs in their fields with little background knowledges and models. In undergraduate, I majored in chemistry whose breakthroughs generally based on experiments in lab. As an expert in chemistry, one need to have some insights on a topic at the very beginning, and based on rich domain knowledges, he/she gains some new but subjective feelings or vague assumptions to this topic. And in the following months, he/she need to devote all the time in labotary to try almost every possible or potential way to test the new ideas. This tough process cannot be done relying on data science. It is because that conditions in experiments cannot be predicted by previous data, the chemical influence between different substances are extremely complicated and is influenced by millions of environmental factors. Therefore, a successful experiment is basically relied on one’s personal experience, and if one totally believe the result of analyzing previous data sets of chemistry, he/she may be troubled by tons of accidents during experiment. In sum, in my point of view, some of disciplines that need intuitive feelings based on rich domain knowledge cannot be done merely with data science.

  29. kimnbeul · · Reply

    It’s very interesting article! I somehow buy Chris Anderson’s opinion. It’s true that scientists have long been concerned with making inferences from data, and often involves building and fitting models such as Regression, in order to make those inferences. Even though sometimes their models have some problems and bias to be generalized, we’ve followed their approach to grasp concepts or circumstances that we want to know better. That was somewhat history or traditional way to understand the world so far. In fact, Anderson’s notion shows the different aspect of the former way, arguing that no models are necessary and “correlation is enough”, and that in the context of massive amounts of data, “they (like Google) don’t have to settle for models at all.”

    In my opinion, the data and theory should not have somewhat “antagonistic relationship”, but pursue “complementary cooperation”. According to the writing, Chris Anderson strongly points out that traditional approach to science using hypothesize, model, test is becoming obsolete. However, I still believe that theories are necessary to see the world better. First of all, only data cannot explain all of the aspects including in-depth knowledge such as human behavior, sociology, taxonomy, ontology, and psychology because we cannot get 100% all of the data that we need to understand them, which means we need somewhat analysis and interpretations. What’s more, data itself can have certain errors that depending only data too much can be misleading, leading to worse results. Thus, I think we have to keep in mind several ways and efforts to use both data and theory mutually cooperate.

  30. I find it hard to take this argument seriously. We’re supposed to believe that with increasingly massive quantities of data, and sufficient computational power able to process it, theory will become obsolete as the “numbers speak for themselves.” But it seems to me that theory is just as useful when data is abundant — perhaps more so. Spurious correlations become more of a hazard as the amount of data increases; it’s easy to find statistically significant relationships that may mean nothing, which is why we’re often warned against “star gazing” when analyzing our data. That’s where theory comes in, because it can specify in advance the kinds of evidence needed to support or refute it.

    The idea that more powerful computers can magically find the correlations that are meaningful is seductive. But even correlations are based on an underlying statistical model, and the algorithms have to be made by someone. In other words, modeling assumptions are always there — the question is whether we’re being explicit about it. If we give up any theoretical ambitions, we’ll be left with the kind of “advanced biology” described in the article, in which (apparently) the computer tells us that some new species have to exist without specifying what they are or even how to look for them. To paraphrase Jaron Lanier, scientific knowledge is only meaningful insofar as it is intelligible to humans.

  31. I disagree with Mr. Anderson’s statement that correlation is causing the scientific method to become obsolete. Using data correlation never explains why something is happening, it just says something is happening because of X, Y and Z. This may be helpful in analyzing and creating objects now, but it stunts growth in the future. Using the scientific method approach will help you figure out why something is happening so you can manipulate the science to make something new. An example for this would be in material science. Using data correlation, it could be detected that metals are stronger than glasses and at certain temperatures they can become brittle. This correlation would be helpful in determining which material to use in creating a building, but it is limited only material selection. If one looks closer and realizes what the atoms and molecular structures are doing, then they can make a composite material that work better in certain situations. One could argue that if one correlated the data to also include looking at the atoms and molecules they could make this discovery. This is true, but how would you know to look at the molecular structure as any numerous amounts of characteristics such as the color of the material. If the simulation ran took into account so many factors, it would be bulky and inefficient and would not get to the heart of why something is happening.

  32. Andrew Ghazi · · Reply

    I’m going to jump on the bandwagon here and say that this article seems really misguided. Anderson seems to think that the entire purpose of statistics is prediction. While may be more true from a commercial point of view, I feel that within academia for instance models are less about predictive power and more about showing that one understands the process underlying a given phenomenon. Models also serve as a means of communication (this ties in with what the instructor of the computational workshop said about code last week); they summarize complex relationships in ways that people can understand. Model parsimony and interpretability are valuable characteristics — not because they increase predictive power (they frequently make it worse) but because they allow others to wrap their heads around what you have come up with. If you just “throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot” that doesn’t mean you’ll have any idea what to make out of what pops out.

    Also, I’d like to mention that only 2 of the above 33 comments seem to buy Anderson’s argument to some extent. So there’s some data pointing to a conclusion, for what it’s worth :)

  33. Adam Obeng · · Reply

    I can’t fault Anderson’s enthusiasm, but it seems like there are a couple of connected but distinct issues that are insufficiently distinguished here. There’s quite some theory in what Google does: for returning search results, it requires some theoretical leap to assume that the best predictor of what people want to find is incoming links — and not only incoming links, but links weighted in a very particular way. Also, Google doesn’t just rely on correlation in the data they’ve acquired, they are pioneers in A/B testing. In other words, experiments.

    Also, I think there is a relevant difference between model and theory. To take Anderson’s example of Venter’s shotgun sequencing, what would have happened to the project of matching new gene sequences to those of existing species if Venter had compared the distribution of nucleotides rather than using a matching algorithm? Or, given that everything is just numbers, what if he had calculated the mean and standard deviation of adenine, cytosine, guanine and thymine? While we need not have a specific model of what the relationship between mass, distance and the gravitational constant is, it might be beneficial to have a concept of a planet when we’re examining data of their movements.

    Imagine the extreme case for a moment: we have a record of the state of every particle (or wave-function, or whatever, physicists correct me[1]) in the Universe over time. After having collected enough data, we’ve got a giant lookup table: if the configuration is like *this*, then next it will be like *this*. What’s the next step? This black box can predict whatever we want, but wouldn’t we want to understand how it works? Wouldn’t we pull out the matrices and try to find patterns we can understand? Wouldn’t this be the beginning — not the end — of theory?

    P. S. Just because there is a consensus, can we conclude that there is a right answer?

    [1] Yes, yes, Uncertainty Principle, Maxwell’s Demon, etc., etc.

  34. For the longest time, many areas of academics and research have relied on the fact that by creating a pool large enough to represent the larger population at question, they could model after reality with a certain level of confidence and minimal constraint. As someone who also participates in large scale research in the natural sciences, to be able to model with better accuracy overall creates a better standpoint in which research can be taken- accuracy reflects how useful the research may be interpreted. However, it is becoming more and more astonishing with what we are able to do with our current technological advances and data manipulating techniques. To say that models are becoming obsolete is a little extreme in my personal opinion, but they are straying away from merely being a model and converging to “the truth”. It is not farfetched to say that as our ability to collect data progresses, our ability to forecast becomes more and more accurate, which will make modeling somewhat redundant and may even stray from what is actually there. However, that time has not quite come yet in certain fields of study. There will be a day, nonetheless, when “with enough data, the numbers [will be able to] speak for themselves” and modeling will become only a figment of history.

  35. A few paragraphs into his piece, Chris Anderson implies – likely inadvertently – the challenge in separating data from science: “This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear.” Data analysis as a replacement for science relies on prioritizing tools over process. Models are still posited, but data allows our initial understanding to be more complete and our models to be more full; our tools do not create solutions and data only helps answer the questions we choose to ask.

    The term data science seems to fit this field well, but it needs to be understood as an extension of science. While I agree it requires a more evolved way of thinking and analyzing, it’s imbedded in the scientific process. Data collection tools have simply expanded the laboratory to a scope we did not foresee until recently. No one would argue that the Manhattan Project rendered the Curies’ process obsolete.

    We are on a continuum in our evolution and our understanding of ourselves and of the world around us. While it may be tempting to see ourselves in a new age relying on probabilistic analysis rather than absolute theories, traditional sciences and engineering have been shifting towards this understanding since at least the Enlightenment and Laplace. The human tendency is to see a Great Leap Forward, when the current state of affairs is the result of deliberate change through action, observation, reflection and accountability. This is science.

  36. More data is never a problem. The problem is the way we interpret them and the approach we have to access the truth of the world — models. However, accuracy built on increasingly more assumptions is not a good approach. Human beings need another mindset boost.

  37. I totally disagreed with Chris’s point. “What is science?” As long as I am concerned, science is the system, which is used to describe and explain natural phenomena through observation and experiment. It means that in order to explore the phenomena, we need to look for “Why”. In real world, even if you have a huge amount of data, in the meanwhile, some of the variables is correlated to each our, but we can’t make result or conclusion based the correlation in answering the “Why”. I believe the regular science method is more trustable, because that they make hypothesis test in finding the “Why” and in answering the phenomena.

  38. Yige (Tony) Wang · · Reply

    It seems to me that the attitude of this article is a little too cynical and Box’s models are all wrong is too extreme. Model sometimes does work. And sometimes works perfectly. Models provide researchers a theoretical foundation which, I believe, could better support the cogency of our argument. The approach to science - hypothesize, model, test- still has its validity in logic per se.

    However, paradoxically, I somewhat agree with Anderson’s claim that “The models are then tested, and experiments confirm or falsify theoretical models of how the world works. This is the way science has worked for hundreds of years.” The word falsify strikes my attention. The saying that models or laboratory experiments could possibly falsify theories reminds me of my experience of reading psychology research papers, in which, I believe, participants could be easily manipulated in such experimental settings and thus subjecting to biases, which could in turn “confirm” hypothesis. However, people might react differently outside of a lab setting which would go against with what theory predicts. As a result, these pre-created lab experiments could, to some extent, falsify theories.

    But our approaches to questions should not be circumscribed to what the model sets. I agree of Anderson’s point that analysis without coherent model, theories, or hypothesis might generate interesting results as well. The pattern of data is strong enough to tell a compelling story about people’s behavior, without any theoretical foundation. That’s what gives birth to data science. I think both approaches could work depending on your objectives. The traditional approach of “hypothesize - model - test” might be more appropriate for rigorous studies in academia where theory is demanded. But in a corporate environment, approach from a lens of correlation could be more efficient.

  39. Luyao Zhao · · Reply

    Indeed, data science provides us a new useful approach, but maybe it’s a little bit exaggerating to claim that massive amounts of data and applied mathematics can replace EVERY other tool including the whole science. In the article, the author stated “We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.” Well, that’s true, but:

    First, we have to have the numbers that we can throw into the computer. In fact, not everything has a big pool of data for us to use. For example, if we want to create a new medicine that can cure cancer, can we just try every chemical thing on lots of people in order to gain the data set? In lots of situations we don’t have enough data. In addition, even if we have all the data, it would be much more efficient if we only use the related data. To know which data sets are related and which are not, we need the scientific theories and structures.

    Second, we need to build or choose the most suitable statistical methods and algorithms, which also requires that we have the relevant scientific knowledge and understand the underlying mechanism.

    Third, after we throw the numbers into computers, what we can get is still data. We still need to comprehend the results, understand why the results are like this and how to utilize them. Without the knowledge of the scientific fields that the data belong to, we cannot fully take advantage of the data.

  40. This article is positing a provocative point of view - where the author pretty much denigrates the value of human judgement and behaviour to let data (actions, views, clicks,…) speak for themselves and be the holistic manifestation of human behaviour. One problem with negating the value of causation and reasoning is with figuring out what to do with data once the correlations are understood. That models can be tested allows for randomness, spuriousness, the effects of time, etc to be taken into account. If anything, incredibly large amounts of data does not make the scientific method obsolete, but all the more powerful; with more (sound and comprehensive) data, and less robust models and weaker hypothesis can better be weeded out.

    The power of machine-generated algorithms and pattern recognition is undeniably evidenced across numerous fields and industries. Nonetheless, to take the author’s example (and perhaps question the blindness of his admiration for Google), there is yet to exist a tool (Google or other) that enables the perfect translation (regardless of the variety of translation theories) from one language to another. It is precisely an example where context is necessary, and the machine-generated prediction is still imprecise. The translations, though constantly improving, are almost always imperfect when more than a clause is translated. Similarly, the way ads are matched to content is often understandable, but also a misrepresentation of the consumer’s actual interests or desires. Ultimately, models (and people) are necessary and help in not only to comprehend underlying relationships and correlations, but also to make sense of the outliers and misfits.

    In a way, just as there is debate as to the extent to which social sciences can claim the ‘science’ in their name, we should ask ourselves how ‘scientific’ a data scientist is, or should be. Should a data scientist, in addition to all the roles alluded to in class, be attempting to apply the scientific method? Should we be presenting axioms and stating assumptions, developing and re-postulating hypotheses before and upon EDA, testing relationships, exploring causality, presenting the data (visualising it), making projections, developing reproducible models with general and/or specific rules and applications etc… ? And with what degree of confidence?

  41. Anderson’s article is, first of all, a polemic, and as such it is highly successful - as evidenced by the interesting debate it can still instigate after four years. However, even a polemic is based on arguments, and I want to focus on what I see as the two main argumentative pillar’s of Anderson’s article - pillars that, as I will try to show, are not robust.

    First, Anderson presents the reader with a vision of a better new world, based on data science, with the example of Google ads. However, Google ads for news website is, even four years after his writing, still far away from being able to create enough revenue to support the kind of quality journalism that is crucial to the functioning of a democracy. This revenue is still mostly generated by ad salespeople who work their network of business contacts and maybe, once in a while, have a survey carried out. Chalk one up for small data.

    Second, Anderson tries to expand his argument from the mundane world of online business to the echelons of science that analyzes human behavior. However, it is notable that he picks his examples for the productivity of big data mostly from the natural sciences. This, it seems, is no coincidence. Even all the data collected by online services do not give us the whole picture, eg, we usually get much lower-quality data on demographic background and opinions than in traditional surveys. In addition, Anderson assumes that big-data analysis of human behavior would just describe this behavior. However, it is most probable that it would also influence behavior, as the Google, Facebook, etc. algorithms already do, through directing people’s attention. This influence may even give rise to a short-circuiting where big-data analysis unearthes relationships that it has created before. This problem and others, ironically, are beyond the scope of big data; they must be approached with the toolkit of traditional social sciences.

  42. Albert L. · · Reply

    I think Anderson’s argument is based on his assumption that theories and models are merely means to an end, and that available “tools” are not good enough to explain the complicated world and that we have reached to an era where we can to the end without those tools.

    Although this is kind of off topic, his argument kind of reminds me of two types of students in a math class: 1) those who merely memorize equations and recipes to tackle questions without understanding how and why they work just to get an A and 2) those who spend time trying to really understand the fundamental mathematical concepts (even if that would mean not getting an A). Whose approach is right? Ideally, a student would want to understand the concepts well AND get an A, but what if a student has no option but to choose one (because of the limitations of time, resources, etc,etc) ? I don’ think there is a fundamentally “right” answer, and it depends what student’s ultimate goal is.

    Models and theories are good because it allows us to understand the data and ultimately predict for the future. But we are living in the world filled with the complex systems and is fast-pacing that requires us to make decisions quickly. Think of Flu epidemic, for example. If scientists are interested in the mechanism behind flu epidemic, they would start with hypothesis, test it, and validate their theory. This is the iterative process and allows us to better understand the concept step by step, but it could be slow and costly and might not even guide us with best optimal “next action”. If our priority is to detect where the imminent flu will occur, and that some heuristics based on Twitter or other big data analysis can predict better than the models available, we should use that then. One of the interesting examples that comes into my mind is that in 2009 Google was able to detect influenza epidemics based on using search engine query data[1]. This is the example where models or theories are not used to answer the goal of interest

    So I guess it boils down to what goal you have: “how does the system work?” vs. “what is my next action?” There should be balance in both. And saying one method is “more” right than the other is a bit stretch. Therefore, to Anderson’s comments, “numbers speak for themselves…and… correlation is good enough,” I would say, well it depends.

  43. Even in advertising, there’s a big difference between correlations and meaningful correlations, so to say correlation is everything is highly misleading I think. Forget causation, I think to sift through the millions of types of correlations you would need to look at and knowing the types of correlations to look for are probably a science in themselves. The last thing I want is my doctor telling me that I should smoke because smoking is correlated with not dying of *some old age related condition*. Sure, Google can optimize base on their ads, but understanding why a correlation exists and/or the particular states it exists is probably as useful as assuming whatever big dataset you are working from can not only measure but generalize outside itself on any feature in the data.

  44. Alexandra Boghosian · · Reply

    Chris Anderson’s “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete,” asserts that the scientific method is becoming obsolete because of “massive data.” In illustrating his point, Anderson highlights the fact that analytical tools (like the ones that Google uses to translate languages) need not take into account any properties or context of their field of study. As he says, “correlation is enough.” While correlation is a powerful thing, and indeed powerful enough in some instances, to reduce the scientific method to correlation is to go against the very nature of the field. Anderson so demonizes the models of science that he fails to see that they have a value outside of how they explain data. The scientific method gives scientists a reproducible framework from which they can ask and answer the question, “why?”

    While I do not dispute the success of these techniques, I do find fault with the idea that the objects for which we collect scientific data are secondary to the data itself. Anderson points to J. Craig Venter, who has used supercomputers to sequence different environment and has discovered thousands of new species. In the next paragraph, Anderson points out, and seems to even celebrate, the fact that Venter does not know anything else about these species. Do we call this discovery? It seems as though Anderson is content with the status of these species as “statistical blip”. I would like to counter that scientists’ aspirations go far beyond these “blips.” A biologist in this situation would want to know how the species came to be. In other words, what caused its existence? Scientists are in the business of finding out why we observe a phenomenon, and to even ask this question, you first must assume causality.

    In addition to offending the nature of science and reducing the definition of discovery to exclude the properties of the object in question (Venter finds his species by using properties of other objects, and noting a difference), I would like to know if Anderson is personally ready to accept a world where objects do not all fall at a constant rate. (I understand the counter-argument would be that the data would draw the same conclusion. For the purposes of this response, I wish only to illustrate the fact that the force of gravity is very hard to part with. I could elaborate.) Are we prepared to throw away our models of how the world works? Are we content to find new species, but not ask why they exist? Are we comfortable to throw away causality in favor of correlation when it comes to medicine? To say that there is no cause of an epidemic? This is a slippery slope.

    Science can be viewed in many different ways. A series of paradigm shifts, all building on previous models, is a very common view. Another view, however, is one where science is a reflection of our technology. Our scientific discoveries are constrained by what we can measure, what we can record, as well as how we make sense of the results. Kepler’s laws of planetary motion would never have been possible without the telescope. Newton’s gravity would not have been developed without calculus. Perhaps Anderson’s point is more comfortably viewed in this light, where the petabytes of data that we have can be analyzed with supercomputers, and used to support the scientific method without stamping out its ability to ask deeper questions.

Leave a Reply

Fill in your details below or click an icon to log in:

You are commenting using your WordPress.com account. Log Out / Change )

You are commenting using your Twitter account. Log Out / Change )

You are commenting using your Facebook account. Log Out / Change )

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 435 other followers

Build a website with WordPress.com
%d bloggers like this: