Dear Students,
I want to let you know about the following:
(1) Institute for Data Sciences and Engineering: Columbia University’s new Institute for Data Sciences and Engineering has launched a website: http://idse.columbia.edu. The word “Data” also modifies “Engineering”, in case there was confusion. Data Sciences and Data Engineering.
(2) Kaggle Competition: As you know, our class Kaggle essay-scoring competition is under way. You have now all entered the competition using “essay length” as a predictor (number of words or number of letters) and perhaps other features as well. You can view the leaderboard there. You should now start using natural language processing to generate or extract other features and do feature selection to improve your performance. Use any models or algorithms you think might work!
(3) Student’s Visualization: Chris Mulligan, a Columbia student who audits our class, created this cool visualization, that updates hourly to capture the leaderboard over time, as opposed to just a current snapshot. Chris writes:
A while back I participated in a MITRE competition similar to Kaggle, focusing on fuzzy name matching. During that competition I wanted to track who was improving, but the site only listed the current leaderboard. I wrote a scraper to download the site every hour, and an R script to visualize the changes over time. (Seen here: http://chmullig.com/2011/02/mitre-challenge-graph/ )
I realized Kaggle provides a leaderboard CSV very similar to what I was generating in that previous competition, so I could easily adjust my original script to generate a visual for the in class Kaggle project. I thought folks might be interested in seeing how the rankings change over time. A few minutes of coding (mostly because I’d hard coded some assumptions about a range of 0-100, now made flexible) resulted in this: http://chmullig.com/2012/11/intro-to-data-science-kaggle-leaderboard/
(4) Next Semester: Registration for Spring 2013 semester begins this week. As mentioned before, Ian Langmore and I are offering a new course called Applied Data Science. Also Mark Hansen is offering what looks to be an awesome course (though I don’t yet see it in the Directory of Classes) that I recommend for journalists and non-journalists alike- in other words, everyone. This is Mark’s brief version of the course description.
Formats, Protocols & Algorithms: A sampling of journalistic computing
The flow of information is governed by the trio of formats, protocols and algorithms. Our workshop will consider these three as categories both to organize a set of practical skills (you will write code), as well as to anchor discussions about the newsworthiness of speci#c technologies and their actions in society (you will write stories). We will take a journalistic approach, adapting your familiar techniques of inquiry to data technologies: What is it designed for? Who is the audience? What does it reveal? What does it hide? Do I need to pay attention to this?
Our goals in proposing this course are simple: 1) Provide journalists with hands-on experience collecting, processing and analyzing data, 2) De-mystify the tools and methods behind data technologies, and 3) Supply sufficient background so that students might become creators of new technologies, transitioning from tool users to tool makers.
At the time of writing, we will anchor the course on three case studies: The document clustering behind Google News, the social media mining from Storyful, and the automated story construction from Narrative Science. The course will involve individual and group projects that directly engage with these technologies, surfacing basic principles and vocabulary behind our trio of format, protocol and algorithm. We will also assign smaller coding projects based on real data to familiarize students with good practices for writing code. The course will end with a #nal project, an “act of journalism,” that might be a story, a data visualization, or a new data set or algorithm (appropriately published).
Our main programming language will be Python, however, we assume NO PRIOR CODING EXPERIENCE.