This is a guest post by Professor Matthew Jones, from Columbia’s History department, who has been attending the course.
Data & Hubris
In the wake of the recent election, data people, those that love, and especially those that idealize them exploded in Schadenfreude about the many errors of the traditional punditocracy. Computational statistics and data analysis had vanquished prognostication based on older forms of intuition, gut instinct, long-term journalistic experience, and a decadent web of Washington connections. The apparent success of the Obama team and those using quantitative prediction revealed that a new age in political analysis has been cemented. Older forms of “expertise,” now with scare quotes, were invited to take a long overdue retirement, and permit a new data-driven political analysis to emerge fully.
It’s a compelling tale, with an easy and attractive bifurcation of old and new forms of knowledge. Yet it’s a division the data and analysis in our course -our training set- suggests is highly improbable. Good data science, our lecturers have suggested time and again, have been far more reflective about the dangers of throwing away existing domain knowledge and its experts entirely.
Origin stories legitimate hierarchies of expertise. Data mining has long had a popular, if perhaps apocryphal, origin story: the surprising discovery, using the APRIORI algorithm, that men buy diapers and beer at the same time in drug stores. Traditional marketing people, with their quant folk psychologies and intuitions about business, were heretofore to be vanquished before what the press probably still called an “electronic brain.” The story follows a classic template. Probability and statistics from their origins in the European Enlightenment have long challenged traditional forms of expertise: the pricing of insurance and annuities using data rather than reflection of character of the applicant entailed the diminution and disappearance of older experts. In the very book that introduced our much beloved epsilons and deltas into real analysis, the great mathematician Augustin-Louis Cauchy blamed statisticians for the French Revolution: “Let us cultivate the mathematical sciences with ardor, without wanting to extend them beyond their domain; and let us not imagine that one can attack history with formulas, nor give sanction to morality through theories of algebra or the integral calculus.”
These narratives fit nicely into the celebration of disruption so central to Silicon Valley libertanianism, Schumpeterian capitalism, and one major variant of tech journalism. However locally crucial in extirpating rent-seeking forms of political analysis and nescience, the dichotomy mistakes utterly the real skills and knowledge that appear often to give the data sciences the traction they have. Rachel’s course-dedicated to reflection upon the capacities of the data scientist-has made mincemeat of any facile dichotomy of the data expert and the traditional expert. An important instantiation of the worldwide phenomena of the defining of the data scientist, our course has put a tempering of hubris, especially algorithmic hubris, at the center of its technical training.
Obama’s data team has explained much of their success in taking the dangers of hubris rather seriously, indeed, in building a technical system premised on avoiding the dangers of overestimation, from the choice and turning of algorithms to the redundancy of the back-end and network systems: “I think the Republicans [nsfw] up in the hubris department,” Harper Reed explained to Atlantic writer Alexis Madrigal. “I know we had the best technology team I’ve ever worked with, but we didn’t know if it would work. I was incredibly confident it would work. I was betting a lot on it. We had time. We had resources. We had done what we thought would work, and it still could have broken. Something could have happened.”
Debate over the value of domain knowledge has long polarized the KDD community. Much of the power of unsupervised learning, after all, is overcoming a crippling dependence on our wonted categories of social and scientific analysis, as is seen in the celebrations of the Obama analytics team:
The notion of a campaign looking for groups such as “soccer moms” or “waitress moms” to convert is outdated. Campaigns can now pinpoint individual swing voters. “White suburban women? They’re not all the same. The Latino community is very diverse with very different interests,” Wagner said. “What the data permits you to do is to figure out that diversity.”
In productive tension with this escape from deadening classifications, however, the movement to revalorize domain expertise within statistics seems about as old as formalized data mining. In our lecture of November 7, David Madigan reflected-I paraphrase-that twenty years ago substantive expertise and collaboration with domain experts was far less common in academic statistics; academic statisticians then generally proved theorems, or dreamt up new methods or tests and then ran around to find data sets upon which to use them, rather than taking sets of data and finding appropriate models or algorithms. Today, he explained, statisticians are much more likely to involve deep collaboration with people in social and medical sciences-in application areas. Not all our speakers were as interested in deep collaboration with domain experts, though none dismissed them outright. Perhaps they might have, faced with our political pundits.
In a now infamous Wall Street Journal article, Peggy Noonan mocked the job ad for the Obama analytics department: “It read like politics as done by Martians.” The campaign was simply insufficiently human, with its war room both “high-tech and bloodless.” (Unmentioned went that the contemporaneous Romney ads necessarily read similarly.)
Data science depends utterly on algorithms but does not reduce to those algorithms. The use of those algorithms rests fundamentally on what sociologists of science call “tacit knowledge“-practical knowledge not easily reducable to articulated rules-or perhaps impossible to reduce to rules. Using algorithms well is fundamentally a very human endeavor-something not particularly algorithmic.
No warning to the wise has so dominated our course as the many dangers of overfitting, the taking of noise for signal in a given training set; or, alternatively, learning too much from a training set to generalize properly. From the third lecture onward, the importance of properly tuning k in performing k nearest neighbors has recurred time and again; each time we sought a nice algorithmic means to picking k; each time we learned about the very human nature of such tuning, with the help of any number of evaluative procedures such as cross-validation and the like. Other algorithms are no different.
Central to the concern with overfitting is the non-reflective use of algorithms in many quarters. Machine learning and data mining texts tend to comprise a sequence of chapter on different algorithms. Brian Dalessandro stressed the need really to grasp an algorithm to avoid overfitting, as Cathy O’Neil’s summarized in her invaluable record of the course:
Ask yourself carefully, do you understand it for real? Really? Admit it if you don’t. You don’t have to be a master of every algorithm to be a good data scientist. The truth is, getting the “best-fit” of an algorithm often requires intimate knowledge of said algorithm. Sometimes you need to tweak an algorithm to make it fit your data. A common mistake for people not completely familiar with an algorithm is to overfit.
The hubris one might have when using an algorithm must be tempered through a profound familiarity with that algorithm and its particular instantiation. Precisely such a tempering of hubris will permit the tuning of parameters necessary to avoid overfitting and thus to permit optimal generalization.
Reflection upon the splendors and miseries of existing models figured prominently in the Obama’s campaign’s job ad:
Responsibilities include:
- Develop and build statistical/predictive/machine learning models to assist in field, digital media, paid media and fundraising operations
- Assess the performance of previous models and determine when these models should be updated
- Design and execute experiments to test the applicability and validity of these models in the field
- Create metrics to assess performance of various campaign tactics
- Collaborate with the data team to improve existing database and suggest new data sources
- Work with stakeholders to identify other research needs and priorities
The facile, automatic application of models is simply not at issue here: criticism and evaluation are. No Martian unfamiliar with territory could do this: existing data-of all kind-is simply too vital to pass up.
The working subtitle of my nascient project on the history of data mining is the “critique of artificial reason,” referring to Kant’s famous exercise in boundary demarcation between philosophy and mathematics. “I shall show that in philosophy the mathematician brings about by his method nothing but houses of cards, and that the philosopher can by his method only arouse chatter in the share of [cognition belonging to] mathematics” (KrV A727/B755). Kant’s bifurcation missed the productive nature of the violations of such boundaries. If hubris has to be checked for data science to succeed, being hubristic about the possibility of acquiring and analyzing data, especially low quality data, seems essential.
All these remarks rest primarily on our training set-our course; time, I think, to attend to the dangers of overfitting my model of the very recent history of the data sciences and to wonder whether I’m using the data of this our future in modeling its past.
[...] theories of algebra or the integral calculus”-Augustin-Louis Cauchy, 1821, quoted by Matthew Jones, Columbia [...]
[...] Jones, a historian at Columbia who is working on the history of data mining, came to a similar conclusionafter auditing Rachel Schutt’sintroduction to data science class: “Data science depends utterly [...]