Each week Cathy O’Neil blogs about the class. Cross-posted from mathbabe.org.
This week in Rachel Schutt’s Columbia Data Science course we had two excellent guest speakers.
The first speaker of the night was Mark Hansen, who recently came from UCLA via the New York Times to Columbia with a joint appointment in journalism and statistics. He is a renowned data visualization expert and also an energetic and generous speaker. We were lucky to have him on a night where he’d been drinking an XXL latte from Starbucks to highlight his natural effervescence.
Mark started by telling us a bit about Gabriel Tarde (1843-1904).
Tarde was a sociologist who believed that the social sciences had the capacity to produce vastly more data than the physical sciences. His reasoning was as follows.
The physical sciences observe from a distance: they typically model or incorporate models which talk about an aggregate in some way - for example, biology talks about the aggregate of our cells. What Tarde pointed out was that this is a deficiency, basically a lack of information. We should instead be tracking every atom.
This is where Tarde points out that in the social realm we can do this, where cells are replaced by people. We can collect a huge amount of information about those individuals.
But wait, are we not missing the forest for the trees when we do this? Bruno Latour weighs in on his take of Tarde as follows:
“But the ‘whole’ is now nothing more than a provisional visualization which can be modified and reversed at will, by moving back to the individual components, and then looking for yet other tools to regroup the same elements into alternative assemblages.”
In 1903, Tarde even foresees the emergence of Facebook, although he refers to a “daily press”:
“At some point, every social event is going to be reported or observed.”
Mark then laid down the theme of his lecture using a 2009 quote of Bruno Latour:
“Change the instruments and you will change the entire social theory that goes with them.”
Kind of like that famous physics cat, I guess, Mark (and Tarde) want us to newly consider
- the way the structure of society changes as we observe it, and
- ways of thinking about the relationship of the individual to the aggregate.
Mark’s Thought Experiment:
As data become more personal, as we collect more data about “individuals”, what new methods or tools do we need to express the fundamental relationship between ourselves and our communities, our communities and our country, our country and the world? Could we ever be satisfied with poll results or presidential approval ratings when we can see the complete trajectory of public opinions, individuated and interacting?
What is data science?
Mark threw up this quote from of our own John Tukey:
“The best thing about being a statistician is that you get to play in everyone’s backyard”
But let’s think about that again - is it so great? Is it even reasonable? In some sense, to think of us as playing in other people’s yards, with their toys, is to draw a line between “traditional data fields” and “everything else”.
It’s maybe even implying that all our magic comes from the traditional data fields (math, stats, CS), and we’re some kind of super humans because we’re uber-nerds. That’s a convenient way to look at it from the perspective of our egos, of course, but it’s perhaps too narrow and arrogant.
And it begs the question, what is “traditional” and what is “everything else” anyway?
Mark claims that everything else should include:
- social science,
- physical science,
- geography,
- architecture,
- education,
- information science,
- architecture,
- digital humanities,
- journalism,
- design,
- media art
There’s more to our practice than being technologists, and we need to realize that technology itself emerges out of the natural needs of a discipline. For example, GIS emerges from geographers and text data mining emerges from digital humanities.
In other words, it’s not math people ruling the world, it’s domain practices being informed by techniques growing organically from those fields. When data hits their practice, each practice is learning differently; their concerns are unique to that practice.
Responsible data science integrates those lessons, and it’s not a purely mathematical integration. It could be a way of describing events, for example. Specifically, it’s not necessarily a quantifiable thing.
Bottom-line: it’s possible that the language of data science has something to do with social science just as it has something to do with math.
Processing
Mark then told us a bit about his profile (“expansionist”) and about the language processing, in answer to a question about what is different when a designer takes up data or starts to code.
He explained it by way of another thought experiment: what is the use case for a language for artists? Students came up with a bunch of ideas:
- being able to specify shapes,
- faithful rendering of what visual thing you had in mind,
- being able to sketch,
- 3-d,
- animation,
- interactivity,
- Mark added publishing - artists must be able to share and publish their end results.
It’s java based, with a simple “publish” button, etc. The language is adapted to the practice of artists. He mentioned that teaching designers to code meant, for him, stepping back and talking about iteration, if statements, etc., of in other words stuff that seemed obvious to him but is not obvious to someone who is an artist. He needed to unpack his assumptions, which is what’s fun about teaching to the uninitiated.
He next moved on to close versus distant reading of texts. He mentioned Franco Moretti from Stanford. This is for Franco:
Franco thinks about “distant reading”, which means trying to get a sense of what someone’s talking about without reading line by line. This leads to PCA-esque thinking, a kind of dimension reduction of novels.
In other words, another cool example of how data science should integrate the way the experts in various fields figure it out. We don’t just go into their backyards and play, maybe instead we go in and watch themplay and formalize and inform their process with our bells and whistles. In this way they can teach us new games, games that actually expand our fundamental conceptions of data and the approaches we need to analyze them.
Mark’s favorite viz projects
1) Nuage Vert, Helen Evans & Heiko Hansen: a projection onto a power plant’s steam cloud. The size of the green projection corresponds to the amount of energy the city is using. Helsinki and Paris.
2) One Tree, Natalie Jeremijenko: The artist cloned trees and planted the genetically identical seeds in several areas. Displays among other things the environmental conditions in each area where they are planted.
3) Dusty Relief, New Territories: here the building collects pollution around it, displayed as dust.
4) Project Reveal, New York Times R&D lab: this is a kind of magic mirror which wirelessly connects using facial recognition technology and gives you information about yourself. As you stand at the mirror in the morning you get that “come-to-jesus moment” according to Mark.
5) Million Dollar Blocks, Spatial Information Design Lab (SIDL): So there are crime stats for google maps, which are typically painful to look at. The SIDL is headed by Laura Kurgan, and in this piece she flipped the statistics. She went into the prison population data, and for every incarcerated person, she looked at their home address, measuring per home how much money the state was spending to keep the people who lived there in prison. She discovered that some blocks were spending $1,000,000 to keep people in prison.
Moral of the above: just because you can put something on the map, doesn’t mean you should. Doesn’t mean there’s a new story. Sometimes you need to dig deeper and flip it over to get a new story.
New York Times lobby: Moveable Type
Mark walked us through a project he did with Ben Rubin for the NYTimes on commission (and he later went to the NYTimes on sabbatical). It’s in the lobby of their midtown headquarters at 8th and 42nd.
It consists of 560 text displays, two walls with 280 on each, and the idea is they cycle through various “scenes” which each have a theme and an underlying data science model.
For example, in one there are waves upon waves of digital ticker-tape like scenes which leave behind clusters of text, and where each cluster represents a different story from the paper. The text for a given story highlights phrases which make a given story different from others in some information-theory sense.
In another scene the numbers coming out of stories are highlighted, so you might see on a given box “18 gorillas”. In a third scene, crossword puzzles play themselves with sounds of pencil and paper.
The display boxes themselves are retro, with embedded linux processors running python, and a sound card on each box, which makes clicky sounds or wavy sounds or typing sounds depending on what scene is playing.
The data taken in is text from NY Times articles, blogs, and search engine activity. Every sentence is parsed using Stanford NLP techniques, which diagrams sentences.
Altogether there are about 15 “scenes” so far, and it’s code so one can keep adding to it. Here’s an interview with them about the exhibit:
Project Cascade: Lives on a Screen
Mark next told us about Cascade, which was joint work with Jer Thorp data artist-in-residence at the New York Times. Cascade came about from thinking about how people share New York Times links on Twitter. It was in partnerships with bitly.
The idea was to collect enough data so that we could see someone browse, encode the link in bitly, tweet that encoded link, see other people click on that tweet and see bitly decode the link, and then see those new people browse the New York Times. It’s a visualization of that entire process, much as Tarde suggested we should do.
There were of course data decisions to be made: a loose matching of tweets and clicks through time, for example. If 17 different tweets have the same url they don’t know which one you clicked on, so they guess (the guess actually seemed to involve probabilistic matching on time stamps so it’s an educated guess). They used the Twitter map of who follows who. If someone you follow tweets about something before you do then it counts as a retweet. It covers any nytimes.com link.
Here’s a NYTimes R&D video about Project Cascade:
Note: this was done 2 years ago, and Twitter has gotten a lot bigger since then.
Cronkite Plaza
Next Mark told us about something he was working on which just opened 1.5 months ago with Jer and Ben. It’s also news related, but this is projecting on the outside of a building rather than in the lobby; specifically, the communications building at UT Austin, in Cronkite Plaza.
The majority of the projected text is sourced from Cronkite’s broadcasts, but also have local closed-captioned news sources. One scene of this project has extracted the questions asked during local news - things like “how did she react?” or “What type of dog would you get?”. The project uses 6 projectors.
Goals of these exhibits
They are meant to be graceful and artistic, but should also teach something. At the same time we don’t want to be overly didactic. The aim is to live in between art and information. It’s a funny place: increasingly we see a flattening effect when tools are digitized and made available, so that statisticians can code like a designer (we can make things that look like design) and similarly designers can make something that looks like data.
What data can we get? Be a good investigator: a small polite voice which asks for data usually gets it.
eBay transactions and books
Again working jointly with Jer Thorp, Mark investigated a day’s worth of eBay’s transactions that went through Paypal and, for whatever reason, two years of book sales. How do you visualize this? Take a look at the yummy underlying data:
Here’s how they did it (it’s ingenious). They started with the text of Death of a Salesman by Arthur Miller. They used a mechanical turk mechanism to locate objects in the text that you can buy on eBay.
When an object is found it moves it to a special bin, so “chair” or “flute” or “table.” When it has a few collected buy-able objects, it then takes the objects and sees where they are all for sale on the day’s worth of transactions, and looks at details on outliers and such. After examining the sales, the code will find a zipcode in some quiet place like Montana.
Then it flips over to the book sales data, looks at all the books bought or sold in that zip code, picks a book (which is also on Project Gutenberg), and begins to read that book and collect “buyable” objects from that. And it keeps going. Here’s a video:
Public Theater Shakespeare Machine
The last thing Mark showed us is is joint work with Rubin and Thorp, installed in the lobby of the Public Theater. The piece itself is an oval structure with 37 bladed LED displays, set above the bar.
There’s one blade for each of Shakespeare’s plays. Longer plays are in the long end of the oval, Hamlet you see when you come in.
The data input is the text of each play. Each scene does something different - for example, it might collect noun phrases that have something to do with body from each play, so the “Hamlet” blade will only show a body phrase from Hamlet. In another scene, various kinds of combinations or linguistic constructs are mined:
- “high and might” “good and gracious” etc.
- “devilish-holy” “heart-sore” “ill-favored” “sea-tossed” “light-winged” “crest-fallen” “hard-favoured” etc.
Note here that the digital humanities, through the MONK Project, offered intense xml descriptions of the plays. Every single word is given hooha and there’s something on the order of 150 different parts of speech.
As Mark said, it’s Shakespeare so it stays awesome no matter what you do, but here we see we’re successively considering words as symbols, or as thematic, or as parts of speech. It’s all data.
Ian Wong from Square
Next Ian Wong, an “Inference Scientist” at Square who dropped out of an Electrical Engineering Ph.D. program at Stanford talked to us about Data Science in Risk.
He conveniently started with his takeaways:
- Machine learning is not equivalent to R scripts. ML is founded in math, expressed in code, and assembled into software. You need to be an engineer and learn to write readable, reusable code: your code will be reread more times by other people than by you, so learn to write it so that others can read it.
- Data visualization is not equivalent to producing a nice plot. Rather, think about visualizations as pervasive and part of the environment of a good company.
- Together, they augment human intelligence. We have limited cognitive abilities as human beings, but if we can learn from data, we create an exoskeleton, an augmented understanding of our world through data.
Square
Square was founded in 2009. There were 40 employees in 2010, and there are 400 now. The mission of the company is to make commerce easy. Right now transactions are needlessly complicated. It takes too much to understand and to do, even to know where to start for a vendor. For that matter, it’s too complicated for buyers as well. The question we set out to ask is, how do we make transactions simple and easy?
We send out a white piece of plastic, which we refer to as the iconic square. It’s something you can plug into your phone or iPad. It’s simple and familiar, and it makes it easy to use and to sell.
It’s even possible to buy things hands-free using the square. A buyer can open a tab on their phone so that they can pay by saying their name.. Then the merchant taps your name on their screen. This makes sense if you are a frequent visitor to a certain store like a coffee shop.
Our goal is to make it easy for sellers to sign up for Square and accept payments. Of course, it’s also possible that somebody may sign up and try to abuse the service. We are therefore very careful at Square to avoid losing money on sellers with fraudulent intentions or bad business models.
The Risk Challenge
At Square we need to balance the following goals:
- to provide a frictionless and delightful experience for buyers and sellers,
- to fuel rapid growth, and in particular to avoid inhibiting growth through asking for too much information of new sellers, which adds needless barriers to joining, and
- to maintain low financial loss.
Today we’ll just focus on the third goal through detection of suspicious activity. We do this by investing in machine learning and viz. We’ll first discuss the machine learning aspects.
Part 1: Detecting suspicious activity using machine learning
First of all, what’s suspicious? Examples from the class included:
- lots of micro transactions occurring,
- signs of money laundering,
- high frequency or inconsistent frequency of transactions.
Example: Say Rachel has a food truck, but then for whatever reason starts to have $1000 transactions (mathbabe can’t help but insert that Rachel might be a food douche which would explain everything).
On the one hand, if we let money go through, Square is liable in the case it was unauthorized. Technically the fraudster, so in this case Rachel would be liable, but our experience is that usually fraudsters are insolvent, so it ends up on Square.
On the other hand, the customer service is bad if we stop payment on what turn out to be real payments. After all, what if she’s innocent and we deny the charges? She will probably hate us, may even sully our reputation, and in any case our trust is lost with her after that.
This example crystallizes the important challenges we face: false positives erode customer trust, false negatives make us lose money.
And since Square processes millions of dollars worth of sales per day, we need to do this systematically and automatically. We need to assess the risk level of every event and entity in our system.
So what do we do?
First of all, we take a look at our data. We’ve got three types:
- payment data, where the fields are transaction_id, seller_id, buyer_id, amount, success (0 or 1), timestamp,
- seller data, where the fields are seller_id, sign_up_date, business_name, business_type, business_location,
- settlement data, where the fields are settlement_id, state, timestamp.
Important fact: we settle to our customers the next day so we don’t have to make our decision within microseconds. We have a few hours. We’d like to do it quickly of course, but in certain cases we have time for a phone call to check on things.
So here’s the process: given a bunch (as in hundreds or thousands) of payment events, we throw each through the risk engine, and then send some iffy looking ones on to a “manual review”. An ops team will then review the cases on an individual basis. Specifically, anything that looks rejectable gets sent to ops, which make phone calls to double check unless it’s super outrageously obviously fraud.
Also, to be clear, there are actually two kinds of fraud to worry about, seller-side fraud and buyer-side fraud. For the purpose of this discussion, we’ll focus on the former.
So now it’s a question of how we set up the risk engine. Note that we can think of the risk engine as putting things in bins, and those bins each have labels. So we can call this a labeling problem.
But that kind of makes it sound like unsupervised learning, like a clustering problem, and although it shares some properties with that, it’s certainly not that simple - we don’t reject a payment and then merely stand pat with that label, because as we discussed we send it on to an ops team to assess it independently. So in actuality we have a pretty complicated set of labels, including for example:
- initially rejected but ok,
- initially rejected and bad,
- initially accepted but on further consideration might have been bad,
- initially accepted and things seem ok,
- initially accepted and later found to be bad, …
So in other words we have ourselves a semi-supervised learning problem, straddling the worlds of supervised and unsupervised learning. We first check our old labels, and modify them, and then use them to help cluster new events using salient properties and attributes common to historical events whose labels we trust. We are constantly modifying our labels even in retrospect for this reason.
We estimate performance using precision and recall. Note there are very few positive examples so accuracy is not a good metric of success, since the “everything looks good” model is dumb but has good accuracy.
Labels are what Ian considered to be the “neglected half of the data” (recall T = {(x_i, y_i)}). In undergrad statistics education and in data mining competitions, the availability of labels is often taken for granted. In reality, labels are tough to define and capture. Labels are really important. It’s not just objective function, it is the objective.
As is probably familiar to people, we have a problem with sparsity of features. This is exacerbated by class imbalance (i.e., there are few positive samples). We also don’t know the same information for all of our sellers, especially when we have new sellers. But if we are too conservative we start off on the wrong foot with new customers.
Also, we might have a data point, say zipcode, for every seller, but we don’t have enough information in knowing the zipcode alone because so few sellers share zipcodes. In this case we want to do some clever binning of the zipcodes, which is something like sub model of our model.
Finally, and this is typical for predictive algorithms, we need to tweak our algorithm to optimize it- we need to consider whether features interact linearly or non-linearly, and to account for class imbalance.. We also have to be aware of adversarial behavior. An example of adversarial behavior in e-commerce is new buyer fraud, where a given person sets up 10 new accounts with slightly different spellings of their name and address.
Since models degrade over time, as people learn to game them, we need to continually retrain models. The keys to building performance models are as follows:
- it’s not a black box. You can’t build a good model by assuming that the algorithm will take care of everything. For instance, I need to know why I am misclassifying certain people, so I’ll need to roll up my sleeves and dig into my model.
- We need to perform rapid iterations of testing, with experiments like you’d do in a science lab. If you’re not sure whether to try A or B, then try both.
- When you hear someone say, “So which models or packages do you use?” then you’ve got someone who doesn’t get it. Models and/or packages are not magic potion.
Mathbabe cannot resist paraphrasing Ian here as saying “It’s not about the package. it’s about what you do with it.” But what Ian really thinks it’s about, at least for code, is:
- readability
- reusability
- correctness
- structure
- hygiene
So, if you’re coding a random forest algorithm and you’ve hardcoded the number of trees: you’re an idiot. put a friggin parameter there so people can reuse it. Make it tweakable. And write the tests for pity’s sake; clean code and clarity of thought go together.
At Square we try to maintain reusability and readability — we structure our code in different folders with distinct, reusable components that provide semantics around the different parts of building a machine learning model: model, signal, error, experiment.
We only write scripts in the experiments folder where we either tie together components from model, signal and error or we conduct exploratory data analysis. It’s more than just a script, it’s a way of thinking, a philosophy of approach.
What does such a discipline give you? Every time you run an experiment your should incrementally increase your knowledge. This discipline helps you make sure you don’t do the same work again. Without it you can’t even figure out the things you or someone else has already attempted.
For more on what every project directory should contain, see Project Template, written by John Myles White.
We had a brief discussion of how reading other people’s code is a huge problem, especially when we don’t even know what clean code looks like. Ian stayed firm on his claim that “if you don’t write production code then you’re not productive.”
In this light, Ian suggests exploring and actively reading Github’s repository of R code. He says to try writing your own R package after reading this. Also, he says that developing an aesthetic sense for code is analogous to acquiring the taste for beautiful proofs; it’s done through rigorous practice and feedback from peers and mentors. The problem is, he says, that statistics instructors in schools usually do not give feedback on code quality, nor are they qualified to.
For extra credit, Ian suggests the reader contrasts the implementations of the caret package (poor code) with scikit-learn (clean code).
Important things Ian skipped
- how is a model “productionized”?
- how are features computed in real-time to support these models?
- how do we make sure “what we see is what we get”, meaning the features we build in a training environment will be the ones we see in real-time. Turns out this is a pretty big problem.
- how do you test a risk engine?
Next Ian talked to us about how Square uses visualization.
Data Viz at Square
Ian talked to us about a bunch of different ways the Inference Team at Square use visualizations to monitor the transactions going on at any given time. He mentioned that these monitors aren’t necessarily trying to predict fraud per se but rather provides a way of keeping an eye on things to look for trends and patterns over time and serves as the kind of “data exoskeleton” that he mentioned at the beginning. People at Square believe in ambient analytics, which means passively ingesting data constantly so you develop a visceral feel for it.
After all, it is only by becoming very familiar with our data that we even know what kind of patterns are unusual or deserve their own model. To go further into the philosophy of this approach, he said two thing:
“What gets measured gets managed,” and “You can’t improve what you don’t measure.”
He described a workflow tool to review users, which shows features of the seller, including the history of sales and geographical information, reviews, contact info, and more. Think mission control.
In addition to the raw transactions, there are risk metrics that Ian keeps a close eye on. So for example he monitors the “clear rates” and “freeze rates” per day, as well as how many events needed to be reviewed. Using his fancy viz system he can get down to which analysts froze the most today and how long each account took to review, and what attributes indicate a long review process.
In general people at Square are big believers in visualizing business metrics (sign-ups, activations, active users, etc.) in dashboards; they think it leads to more accountability and better improvement of models as they degrade. They run a kind of constant EKG of their business through ambient analytics.
Ian ended with his data scientist profile. He thinks it should be on a logarithmic scale, since it doesn’t take very long to be okay at something (good enough to get by) but it takes lots of time to get from good to great. He believes that productivity should also be measured in log-scale, and his argument is that leading software contributors crank out packages at a much higher rate than other people.
Ian’s advice to aspiring data scientists
- play with real data
- build a good foundation in school
- get an internship
- be literate, not just in statistics
- stay curious
Ian’s thought experiment
Suppose you know about every single transaction in the world as it occurs. How would you use that data?