1. 程式人生 > >How Analysts “Read” 1,846 Political Tweets Instantly

How Analysts “Read” 1,846 Political Tweets Instantly

Trying out the algorithms:

The topics derived from LSA seemed pretty unclear, with a lot of overlapping words. Topic zero seemed roughly to be about former president Megawati Sukarnoputri’s support for Presidential Candidate and Jakarta Governor Jokowi and his running mate, Jusuf Kalla (also known as JK) against Former General Prabowo Subianto. Topic one was almost identical, substituting the word for “talks” with the word for “governor.” Topic three was interesting, discussing Presidential Candidate Prabowo and rival Vice Presidential Candidate Kalla, as well as two dropouts from the presidential race, Aburizal Bakrie and Hanura Wiranto, as well as the words for candidate and Vice Presidential candidate. Perhaps these tweets were regarding a “dream team” of Prabowo as President and Kalla as Vice President. Topic three once again discussed Prabowo and Kalla, with mention of a convention, Aburizal Bakrie, and the words “election”, “enter”, and “forward”. The final topic once again included Golkar Chairman Aburizal Bakrie, Hatta, Wiranto, Prabowo, Dahlan, Prabowo’s PAN party, and the words “vice president”, “open”, and “evaluation”. Overall, these groupings don’t appear particularly useful or enlightening.

LDA with a bag of words seemed better, with topics such as “Kalla”, “Leader(ship)”, “Governor (Jokowi)”, “Real”, and “Popular” (Topic zero: Jokowi), “Dahlan”, “Prabowo”, “Wiranto”, “Fill in”, “Yes”, “Support”, “Ahok”, and “Governor” (Topic one: Famous endorsements for Jokowi against Prabowo), “Kalla”, “Aburizal Bakrie”, “Prabowo” “Tri Rismaharini”, and “Mahfud MD” (Topic 2: Famous Endorsements of Prabowo against the Jokowi-Kalla ticket, or in Rismaharini’s case, a fake endorsement), “convention”, “PDI-P party”, “Golkar party”, “Mahfud MD”, “survey”, “People’s/public”, “win” (Topic three: polling and party comparison), and “Megawati”, “PAN”, “Presidential candidacy”, and “Hatta Radjasa” (Topic four: Prabowo). The topics also had a nice symmetry, with one topic per major candidate, one topic of famous supporters of each candidacy, and a topic on polling and party comparison.

LDA with Tf-idf gave interesting topics such as “Dahlan”, “PDI-P”, “Forward”, “Support”, “Mega(wati)”, “Governor”, “Kalla” (Topic zero: Jokowi), “Hatta”, “Megawati”, “Party”, “Kalla”, “Vice President”, “PAN” and “Evaluation” (Topic one: Vice Presidents), “Jakarta”, “Dahlan”, “Convention”, “Vice President”, “Megawati”, “Yudhoyono”, “Mahfud MD” (Topic two: national political players), “Aburizal Bakrie”, “Candidate”, “Ad”, “Survey”, “Tri Rismaharini”, “Yudhoyono”, “Prabowo”, “Vice Presidential Candidate”, and “Kalla” (Topic three: unclear, possibly political advertisements and polling), and “Prabowo”, “Vice”, “Kalla”, “Wiranto”, “Megawati”, “People”, and “Indonesia” (Topic five: unclear, more famous political players). However, the topics seemed less clearly-defined than LDA with the bag of words.

Now for some fun testing the algorithm on fake tweets:

For this particular corpus, and perhaps this is related to the small total number of tweets and the limited words in each tweet, LDA performed better when not weighted with Tf-idf, so that is the model and topic grouping I chose to proceed with.

I ran the LDA with BOW algorithm on a tweet about Jokowi — “kalau jadi presiden jokowi tetep jadi gubernur jakarta tidak” — which matched 73% with the first topic: Jokowi. A fake Indonesian tweet I wrote supporting Jokowi, PDI-P, and Kalla — “Saya mendukung JK dan Kalla! PDI-P selamanya!” — also scored as an 80% match with the Jokowi topic.

The verdict:

Despite my small corpus and limited vocabulary in each tweet, LSA and LDA helped me quickly suss out topics within the dataset and see sensible topic clusterings. A simple LDA with a bag of words gave me the most sensible clusterings of the two presidential candidates, famous endorsements or supporters of each ticket, and public polling. The model also seemed to perform well on a sample tweet and synthetically-created tweet.

To really shine, though, these models should be applied to a much larger corpus to better represent the Indonesian twitter population during the 2014 Indonesian election. A more complete corpus could also enable me to map the sizes of each topic cluster more meaningfully to answer questions such as whether Jokowi or Prabowo seemed to generate more tweets, or how many tweets seem to have been about endorsements or supporters rather than focusing on the actual candidates. Still, the fioundation is here, and there’s lots of room to expand this basic code framework for use by politicians and political campaigns (as well as political historians examining past elections and political movements in the digital age).

For more information and the code behind this analysis, please check out the respository and full report on my GitHub account.