Hi Students,
Last week Steve Lohr and Andy Lehren from the New York Times came in to talk about data journalism. Given how amazing that lecture was, I thought you might want more. For more of Andy, you can watch this interview of Andy talking about investigative journalism. For more about data journalism you should check out the Guardian’s data store.
Data journalism is not new, just easier…
Supplementing stories with data is at the core of many investigative journalism efforts. Tables are popular, and they’ve been around for a while. As Andy pointed out, the NYTimes brought down Boss Tweed by publishing a simple table. They acquired and published the ledgers from the comptroller’s office. These documents showed that money slated to go into the construction of a new courthouse was being siphoned off to Tweed’s cronies. The public was, of course, outraged.
The first piece of data journalism by the Guardian came out in 1821. For context, this was 40 years before the Civil War, James Monroe had just been reelected as president, and we were on less than great terms with England after they burned down our capitol. The Guardian provided a table breaking down schools by how many children attended the school and how much that school cost. While riddled with data collection errors, it was still a better analysis than anyone had done before. It showed that there were far more students getting a free education than previously thought, which also meant that poverty affecting children was far higher than previously thought.
Data journalism is not new. What’s different now is that it’s easier. The barrier to entry has been reduced, both in terms of gathering and analyzing data. Spreadsheets make simple analysis easier, and the internet is brimming with data sources. Simon Rogers describes the change to journalism in the following way:
But now statistics have become democratised, no longer the preserve of the few but of everyone who has a spreadsheet package on their laptop, desktop or even their mobile and tablet. Anyone can take on a fearsome set of data now and wrangle it into shape. Of course, they may not be right, but now you can easily find someone to help you. We are not wandering alone any more.
..but it’s still journalism
In the same article, Rogers also explains that data journalism is still journalism. He says, “Data journalism is not graphics and visualisations. It’s about telling the story in the best way possible.” The key deliverable is the story. In some cases the story will need a bar chart or a map or some sort of detailed infographic. In others, it may require a single number or simply just prose.
As data scientists, we tend to fetishize the power of data. It seeps into how we view the world, specifically in how we think people should do their jobs. This world view has led to some backlash from the journalism community. Many data scientists do not understand that the goal of journalism is to write a interesting, compelling, and accurate story. The notion that everything can be counted, that data is power is, in fact, false. As Jonathan Gray from the Open Data Foundation expands on this point: “The value that data can potentially deliver to society is to be realised by human beings who use data to do useful things.”
Data is a means to an end.
With that in mind students, I’d like you to scour the web a good data journalism article. Describe what makes it good from both a data scientist point of view and from a journalist point of view.
Thinking about a good data journalism piece, I remembered this article that came out this summer from the New York Times on intergenerational mobility across the United States in terms of income level:
http://www.nytimes.com/2013/07/22/business/in-climbing-income-ladder-location-matters.html?hp&_r=1&
This article, written by David Leonhart, is supported by interactive graphs and visuals generated from data that was gathered as a result of a very large study entailing millions of earnings records by a team of academic economists. The study was led by Lab for Economic Applications and Policy at Harvard, the Center for Equitable Growth at Berkeley, and Laura and John Arnold Foundation and was funded by the National Science Foundation.
The official webpage of the study also includes detailed comments on findings, methods used and also a documentation of the data that was used (downloadable):
http://www.equality-of-opportunity.org/
The problem of mobility gap is a well-known fact in the United States. Although some of the insights that can be gained from this article and the accompanying charts are not that revolutionary (for example you would expect that in diverse and metropolitan areas children born into families in the bottom 5th income level have a fairly higher chance of moving up into the top 5th compared to children raised in the South and the Midwest.), I believe it still poses a very good example of data journalism: it backs up the story with power of data by giving the readers something that they can play around with, get engaged in the story and try to understand it better and drive their own insights and comments.
From the data science perspective, this is groundbreaking empirical work. It exploits a dataset that is richer than any that had been available before. The study focuses on mobility measures of children born in 1980-81 –their adult income levels in 2010-11, the income levels of their families between 1996 and 2000. The study maps children to 741 commuting zones and analyzes how they have moved up or down along the national income level percentiles and tries to understand the factors driving this spread in mobility. As also noted in the article, the data only shows correlation of difference of mobility measure with various factors and not causality.
From a journalism point of view, the article is very successful in combining the “human” factor with the numbers and colors we see in the maps. Leonhart, juxtaposes anecdotes of real families and explains where and how they fall on the mobility scale. He also makes some political connections. We can see how social policies especially ones on tax credits and educational systems are actually correlated with intergenerational mobility. Policy makers could possibly learn a lot from this study.
I selected an article from The Economist: The World In 2014 print edition named “From baby boom to bust: Crashing fertility will transform the Asian family” which can be retrieved from here http://www.economist.com/news/21589151-crashing-fertility-will-transform-asian-family-baby-boom-bust
The article begins with an infograph that visually quickly seeks to capture the title of the article and the attention of the reader; a descending pattern in Asian family fertility rate for the past 50-60 years. For nearly every idea that the author discusses in the article there are numbers close by that clarify and support them, mainly statistics used are from The United Nations. The author of the article does not use the statistics to obfuscate ideas. On the contrary, the numbers used are presented clearly and with explanations of definitions whenever the author believes that an unsophisticated reader may need some additional help in order to help gain the most information out of the article.
What I found interesting was a blog entry in the newspaper’s site providing commentary on some of the interesting points made in the article. Accompanying the blog entry, however, the site also presents a 3 minute videographic which is full of interesting statistics presented and contextualized by a narrator in a very compelling manner (A must see!). I found that after viewing the videographic the combined user experience with the published article was much enhanced. They blog and videographic can be found here http://www.economist.com/blogs/theworldin2014/2013/11/falling-fertility
The article discusses a very important topic that has been very much in the news lately, especially in the context of the just ended Chinese Communist Party Third plenum. The article begins by outlining its principal idea and clearly explaining some of the terminology that will be used to convey the information contained within. Each paragraph begins with a seminal idea capturing the interest of the reader, which is then explained by the author in the remaining part of the paragraph with the help of statistics when necessary. The ideas are presented chronologically and in at least two instances the author clearly numbers the idea to follow which provides logic to the message and continuity with the ideas that follow. The article’s main idea is how fertility rates are impacting the Asian demographics. However, to drive the point across, the author sets the Asian fertility issue in the context of fertility rates from other parts of the world enabling a reader located in a different part of the world, other than Asia, who may be reading the article, to gain a better understanding and form a personal relationship with the information presented in the article.
The first thought that came to my mind after hearing the lectures and reading these articles is that data journalism is kind of like short “graphic novels” (is that an oxymoron?). Let me first start by saying that I know little-to-nothing about “graphic novels”. But from what I do know, the general format, from a (very) naive perspective, is to tell a story in a traditional book format supported by comic-book art. This rings a similar tone to what data journalism tries to accomplish: tell an interesting and accurate story with the support of some graphics and visualizations.
Here is an article from theguardian telling the story of the US Presidential candidates’ journey to the 2012 election in a “graphic novel” format: http://www.theguardian.com/world/interactive/2012/nov/06/america-elect-graphic-novel.
The data scientist can look at this article and and see flashy graphics and visualizations to tell the story, while the data journalist can look at this article as an accurate and informative story with some cool graphics compelling the reader to continue reading.
Data journalism isn’t really new. In fact it has existed for a long time, data is used in journalism to tell compelling stories every day. And a great data visualization can tell a story better than words http://www.huffingtonpost.com/2013/03/22/gun-deaths-us-newtown_n_2935686.html
I am sure many of us remember the story of sandy hook and the public attention recieved during that time. There was a great push to tigthen restriction in gun control and in the above article it emphasize just that. The above piece included an interactive map with data on gun death to tell a story. We start out from the small town of Newtown and then span out to see the rest of the US light up with gun death in just one month. Each point represented a person, a personal story. The story are further enriched by providing faces and stories of individual (of children, moms, etc ). The journalist used a start point of one particular incident that was receiving great media attention and expanded upon it by using data and personalize story in an attempt to garner more attention. The journaliset try to show us that Sandy hook is not a unique incident but is potential tradegy that can happen each day and that is happening everyday, despite the fact that it might be on different scales. From a data science perspective, they sucessfully use great data visualizuation to prove a point and from a journalism perspective, the data is not just data but people and stories of individuals that can tug at people heart strings, .
Until very recently, many people in sports believed that statistics could only significantly improve your team in the sport of baseball. With the amount of recognition that “money ball” had received for fueling the Oakland A’s path to success. It has been extremely difficult for other sports to quantify any particular players performance and be able to come up with a credible expectation of what the future may hold. With the help of technological advancements, the NBA has truly taken the analytical side of the sport by storm. I personally believe that statistics in sports are the most interesting form of statistics. Most of this may come due to me being bias as I am an athlete myself. However, it is becoming extremely difficult to ignore the rising relevance of statistics in the sport of basketball and this is not in reference to the gaudy numbers that we are all accustomed to being shown by SportScenter. New technology has been infused into the game that is actually capable of quantifying a players skill, not just overall, but in certain situations. This could be revolutionary for the sport, however it is still in its early stages and has not been widely accepted by all teams in the Association.
http://www.grantland.com/story/_/id/9068903/the-toronto-raptors-sportvu-cameras-nba-analytical-revolution
As we learned from class and the above article, data journalism is indeed journalism. Data is a tool to support the opinion in article or help to tell a story.
The data journalism article I want to recommend is:
http://www.newyorker.com/online/blogs/johncassidy/2013/11/inequality-and-growth-what-do-we-know.html
which title is “American Inequality in Six Charts”
From a data science view, we can see that all charts are easy to under stand and showing one part of the story. Chart 1&2 are line charts show the trend, which are easy to compare. Chart 3 is a bar chart shows three topics: income before tax, income after tax and the tax rate, which can be interpreted from the chart. The fourth one is a heat map that clearly shows the mobility in all area of US. In Chart 5, he used a simply regression, shows us the trend and rho.
From journalism view, we can see that each chart is telling a different story that supports the core idea of article: the condition of inequality in US. The first and second chart tells us that pre-tax income share between the richest people and ordinary people. The third one is a graph showing inequality across country. The forth graph is showing us a more underlying problem for inequality which is the mobility of society. The fifth one shows the relationship of mobility vs. inequality, which explained why mobility of society is a reflection of inequality.
I read a series of 3 articles about data jornalism:
http://onlinejournalismblog.wordpress.com/2013/09/20/ethics-in-data-journalism-automation-feeds-and-a-world-without-gatekeepers/
http://onlinejournalismblog.wordpress.com/2013/09/16/ethics-in-data-journalism-privacy-user-data-collaboration-and-the-clash-of-codes/
http://onlinejournalismblog.wordpress.com/2013/09/18/ethics-in-data-journalism-mass-data-gathering-scraping-foi-and-deception/
They talk about the ethics of data and data journalism.
From a data scientist perspective, the amout of data available is amazing. There is more data being collected so quickly that there is always a question that can be asked and answered. We are able to have a stream of data to be analysed online and are able to generate real-time data visualizations.
The problems arises from a journalism point of view, when re rely too heavily on automation and automated processes without much oversight. We might be scrapping form data sources with lower quality data, sources may loose quality over time. Additionally automatic process run into ta problem of copyright and plagiarism infringements. Additionally bad quality data sources used for published journalism could ruin the credibility of a news source.
I read a series of 3 articles about data jornalism:
http://onlinejournalismblog.wordpress.com/2013/09/20/ethics-in-data-journalism-automation-feeds-and-a-world-without-gatekeepers/
http://onlinejournalismblog.wordpress.com/2013/09/16/ethics-in-data-journalism-privacy-user-data-collaboration-and-the-clash-of-codes/
http://onlinejournalismblog.wordpress.com/2013/09/18/ethics-in-data-journalism-mass-data-gathering-scraping-foi-and-deception/
They talk about the ethics of data and data journalism.
From a data scientist perspective, the amout of data available is amazing. There is more data being collected so quickly that there is always a question that can be asked and answered. We are able to have a stream of data to be analysed online and are able to generate real-time data visualizations.
The problems arises from a journalism point of view, when re rely too heavily on automation and automated processes without much oversight. We might be scrapping form data sources with lower quality data, sources may loose quality over time. Additionally automatic process run into ta problem of checking for data quality over time. this could lead to poor quality journalism from poor quality data, and could ruin the reputation of journalists
Journalism with data science in it is very popular nowadays. In the old days, when statistic and computer science is not that advanced, or there is no computer or calculator, people find it hard to collect data and do statistics. At that time, the journalism was always described by what people see and hear. For example, when a journalist go to a battle in world war II, he had to follow the troop for a long time, and describe what he see and hear in the journalism. However, in this way, he does not know how many people die in the war, which area is the safety area, which troop is more possible to win… He can just describe the environment around himself, like there are bodies everywhere, soldiers are tired about the moving…
But in today, with the help of math and computer, journalism can describe a thing based on data science view, which is a more efficient way to description. Still for the war journalism, this time, journalist can find out the number of death, lose of money and which troop is more strong by doing statistics and visualization.
Below is an example of data journalism.
http://www.nielsen.com/us/en/reports/2013/how-loyal-are-your-customers.html
http://www.nytimes.com/newsgraphics/2013/10/27/south-china-sea/
I came across this article several weeks ago and I found this new way of reporting to be amazing. This report supplement stories with data analysis, data visualization, pictures and also animation. I think it will bring revolutions to the online news industry.
From the perspective as a data scientist, firstly, the report combines several data sets to tell one story. Secondly, the data sets are used to interpret several different aspects of the story, which will help readers to understand and think about the story thoroughly.