Marginally Interesting: My thoughts on the NY Times article: Troves of Personal Data, Forbidden to Researchers

The NY Times has an article basically complaining that the big social network sites aren’t releasing their data and that they are hurting research.

Actually, I can understand the companies here. Releasing such data is a big privacy issue because it’s very hard to make sure your data is anonymized. Anyone still remembers why there wasn’t a second Netflix competition? They got sued after the first run and decided to cancel it because

they couldn’t ensure to protect the users privacy.

For many of those companies, that big pile of data is basically all they have, so they won’t just give it away for free, be it for research purposes, or not.

Also, data always used to be pretty scarce in social network research. If you look at review articles on social networks like

this one, you see that most of the research focused on a small number of data sets, for example, the karate school data set, the dolphin data set, or the monastery data set, all of which have been assembled by hand by some researchers. Ironically, the largest available data set so far is the Enron data set which has been released as part of the trial against the Enron bankruptcy.

So I think it’s wrong to expect companies like Twitter to happily release a substantial portion of their data for research purposes. On the other hand, I also think there is a very real problem of poorly validated research in that area. For example, Daniel Gayo-Avello has this very interesting review article on arXiv where he discusses that many papers on predicting elections are seriously flawed. Another example is the paper “Twitter mood predicts the stock market” by Johan Bollen et al. which is also seriously methodologically flawed.

Again I think is wrong to blame the lack of available data here. Of course it’s easier to validate research if you have the data to rerun the experiments and analyses, but I think (as I’ve said before) that we also need to resist the urge to jump the current big data and data science wave and get back to doing properly validate research in the first place.

Posted by Mikio L. Braun at 2012-05-23 12:35:00 +0200

Marginally Interesting: My thoughts on the NY Times article: Troves of Personal Data, Forbidden to Researchers

Marginally Interesting: My thoughts on the NY Times article: Troves of Personal Data, Forbidden to Researchers

My app status is Ready for Sale but I cannot see my app on the App Store. Why? 為什麽審核通過後 appstore中搜不到我的app

Thoughts on the Application of Radar Technology to the Improvement of Street Light System

Ask HN: Your thoughts on the Proposition 12?

Marginally Interesting: How Python became the language of choice for data science

Marginally Interesting: The future of Big Data (according to Stratosphere/Flink)

Marginally Interesting: jblas finally on central Maven repository

Marginally Interesting: Book Review: 'Debt: the first 5000 years' by David Graeber

Marginally Interesting: Some Tips On Using Cassandra

Warning: date(): It is not safe to rely on the system's timezone settings. You are required to use

I think NY Times copied my article that went to #1 on HN

Marginally Interesting: On my way to NIPS 2008

Marginally Interesting: Command Line Interactive Machine Learning on the JVM. Part 2: JRuby and Scala

Marginally Interesting: Slides for my LinuxTag talk on Cassandra

Marginally Interesting: Command Line Interactive Machine Learning on the JVM. Part 3: Missing Parts

Marginally Interesting: Command Line Interactive Machine Learning on the JVM. Part 1: Why?

Damn It, It’s on the Tip of My Tongue

Marginally Interesting: Short Review of Edward R. Tufte's "The Visual Display of Quantitative Information"

Marginally Interesting: AI's Road to the Mainstream

Marginally Interesting: Reclaim your data, own a piece of the cloud!

Marginally Interesting: My thoughts on the NY Times article: Troves of Personal Data, Forbidden to Researchers

相關推薦