Marginally Interesting: My thoughts on the NY Times article: Troves of Personal Data, Forbidden to Researchers
The NY Times has an article basically complaining that the big social network sites aren’t releasing their data and that they are hurting research.
Actually, I can understand the companies here. Releasing such data is
a big privacy issue because it’s very hard to make sure your data is
anonymized. Anyone still remembers why there wasn’t a second Netflix
competition? They got sued after the first run and decided to cancel
it because
For many of those companies, that big pile of data is basically all they have, so they won’t just give it away for free, be it for research purposes, or not.
Also, data always used to be pretty scarce in social network
research. If you look at review articles on social networks like
So I think it’s wrong to expect companies like Twitter to happily release a substantial portion of their data for research purposes. On the other hand, I also think there is a very real problem of poorly validated research in that area. For example, Daniel Gayo-Avello has this very interesting review article on arXiv where he discusses that many papers on predicting elections are seriously flawed. Another example is the paper “Twitter mood predicts the stock market” by Johan Bollen et al. which is also seriously methodologically flawed.
Again I think is wrong to blame the lack of available data here. Of course it’s easier to validate research if you have the data to rerun the experiments and analyses, but I think (as I’ve said before) that we also need to resist the urge to jump the current big data and data science wave and get back to doing properly validate research in the first place.
Posted by Mikio L. Braun at 2012-05-23 12:35:00 +0200
blog comments powered by Disqus