Hi Students,
Sorry for the gap in posts. Last week was the Research at Google summit in Mountain View. As an aside, if you work at a remote office, as some of you will, you will occasionally have to dock with the mothership. These periods are often a blur of work and networking, and sometimes things fall through the cracks.
I digress; I want to talk to y’all about big data. As Aaron pointed out, the term “big” is relative. Big can range from 100’s of gigabytes to yottabytes (1024) of data. Semantically, big can mean anything from “Excel chokes with this dataset” to “I need to buy a datacenter… or three”.
“Bigger is better”-This meme is firmly entrenched in the mind of many a corporate executive. And they aren’t completely bonkers. While most people differ on how they would define big, there are scientific papers that back up this folk wisdom. Those papers talk about the unreasonable effectiveness of data.
However, going big may not always be the right solution.
Data is not free…
The argument often gets made that storage is cheap and we have enough spare computing power to wade through large datasets for insight. Often this is only half true. Yes, hard drives and computing power are cheap for most purposes, but at a large enough scale it does cost money to make big data work. You need to maintain data centers, provide hardware redundancy, or pay Google or Amazon to do it for you.
But those are still the cheaper costs. The real cost is human time, which of course can be translated to money. As data moves from a desktop to a datacenter, you may need to supplement your army of analysts with systems programmers building data pipelines. These pipelines will require updates and maintenance as your data evolves.
Moreover, as data continues to grow, the rate at which your team can create and test hypotheses becomes slower. Data essentially gums up the works of your analytics machine. The slow down increases the time to fix errors in your data collection and in your analytics pipelines. While you are busy crunching data, your competitors may be moving products to market and making their processes more efficient.
Finally, getting data is often hard. Take, for example, supervised learning. You need ground truth labels for your examples to train a model. Ground truth for small datasets often comes from human experts, and human labels may be cost effective for 10,000 examples. However, as your data grows, human labeling rapidly becomes too costly. You may have to shift from experts to Mechanical Turk workers to automated techniques. These introduce new sources of error that you have to take into account.
… and you may not need all that data
For applications like speech recognition, computer vision, and social network analysis, these costs are justified. Advances in these fields are being rapidly made by high tech machine learning systems that crunch tons of data. The results are astounding: speech recognition works well enough that Apple and Google are willing to deeply incorporate it into their phones (though there are well known dissenters).
However, some insights can be gathered from smaller data sources. For some queries, a subset or sample of the dataset may be enough. For others, you may recognize that certain pieces of information are extraneous, and by removing this information, you may be able to process the data using more powerful tools.
Small data is not always the answer, but you should try it first. Big data is expensive and unwieldy. Using a smaller dataset, you will be able to iterate faster. You will find bugs quickly. And if and when you do need to ramp up, you will be in a better place.
I’m not saying that big data is a sham, or that it is necessarily ineffective. Rather, I’m arguing that bigger is not always better and that data comes at a cost. Before we give into the hype, we should think about what we actually need from our data sources and our computational tools. Before we promise too much from data, we should make sure that we can live up to those promises.
I want to close with a quote from an article in the Wall Street Journal that Ben DeCoudres sent me. It’s a great article, but this quote kind of nails it. When it comes to the big promises from big data,
This inevitably sets us up for disappointment. Amidst all the claims that big data will help us find the magic needle in an ever-larger haystack, there’s the risk that we’ll eventually start to become frustrated when all of this big data, well, doesn’t seem to produce much of anything — except calls for more spending on higher and higher storage capacity and faster and faster computing ability to make sense of it all.