Marginally Interesting: Analyzing Social Media Data

This is all pretty exciting and interesting, but there are also a few things where there is still room for improvement.

There is very little stuff on real-time analysis. Many papers boast with the hundreds of millions of tweets (and the access to Twitter’s firehose necessary to get that amount of data) which have formed the basis for the paper. However, many papers later introduce some more or less arbitrary ways of truncating the data, for example by taking a number of “most active users”. This is both true for

Jure Leskovec’s paper as well as the Yahoo research’s paper.

However, I think that getting to real-time is extremely important, because you cannot just wait for days or longer to get your analysis. By that time, more data will have been streaming in, and when are you going to analyze that data?

Another problem with many of the analyses is that they focus on the positive cases only. Meaning that they develop some method to detect bursts or trends and then use some famous real-world example (like Japan winning the women’s soccer championship) to show that the method is triggered by the data. However, few publications go so far as to validate their method on negative examples as well, showing that the method not only detect trends well, but also does so robustly with few false positives.

A classical example is the highly cited 2003 paper by Jon Kleinberg “Bursty and Hierarchical Structure from Streams” which explains how to detect areas of higher than usual activity, for example, from email streams. But then, the paper shows how the detected structure coincides with real deadlines for two examples without discussing negative examples in depth.

Many methods also seem to believe that an analysis which is based on hundreds of millions of data points is automatically true in general. While this is certainly true for simple statistics which you can estimate well, there are other methods which can overfit. And for those, as many other disciplines like bioinformatics have had to learn the hard way, as you get more data, the probability that you find some evidence for your hypothesis increases drastically.

To get reliable results, you need to follow the same rules as when validating the performance of a machine learning algorithm: Test on data which is disjoint from training data. If your method detects trends, check it on data which you believe has no structure. If you aggregate topics, check it on days when nothing special was happening. If you analyze the structure of the data, check on an independent sample (ideally from a period of time which is a bit removed from the original sample).

That way you might have less data available, but your results will improve a lot in terms of reliability.

Posted by Mikio L. Braun at 2011-11-01 22:20:00 +0100

Marginally Interesting: Analyzing Social Media Data

Marginally Interesting: Analyzing Social Media Data

論文閱讀 | CrystalBall: A Visual Analytic System for Future Event Discovery and Analysis from Social Media Data

Marginally Interesting: More Google Big Data papers: Megastore and Spanner

Marginally Interesting: Three Things About Data Science You Won't Find In the Books

Marginally Interesting: Machine Learning and Data Sets

Marginally Interesting: Yet another Big Data whitepaper

How to win friends online: It's not which groups you join, but how many: Rice scientists crunch social media data to explain how

Marginally Interesting: From the Cluetrain Manifesto to Social Media

Marginally Interesting: How Python became the language of choice for data science

Marginally Interesting: Big Data beyond MapReduce: Google's Big Data papers

Marginally Interesting: Reclaim your data, own a piece of the cloud!

Marginally Interesting: Introducing Data Science Seminars

Marginally Interesting: Big Data and Market Research

Marginally Interesting: My thoughts on the NY Times article: Troves of Personal Data, Forbidden to Researchers

Marginally Interesting: What is Data Science?

Marginally Interesting: Big Data & Machine Learning Convergence

Marginally Interesting: Apache Spark: The Next Big Data Thing?

Marginally Interesting: The future of Big Data (according to Stratosphere/Flink)

Marginally Interesting: Pheed, Tent.io, and the Future of Social Networks

Marginally Interesting: Data Analysis: The Hard Parts

Marginally Interesting: Analyzing Social Media Data

相關推薦