1. 程式人生 > >Using NLTK to visualize my favorite albums’ lyrics

Using NLTK to visualize my favorite albums’ lyrics

Using NLTK to visualize my favorite albums’ lyrics

A few weeks ago I was enrolled in Python for Data Science by UCSD on EdX.org. It is an introductory course so it starts with the basics but by the end of it you have worked with Twitter’s API, predicted weather using Machine Learning and even done some Natural Language Processing using NLTK.

The last grabbed my attention because, before then, I had hardly thought of language as a source of information. I used to think of language as just a tool. A way to get information from my head into yours.

I was so wrong. So, so wrong.

My project: Números Fantasma

If you read Spanish you can check out a more detailed version of this note on my website

elblogdehiphop.com

Números Fantasma translates roughly to Phantom Numbers. This title derives from one of my favorite all-time albums Luces Fantasma by La Banda Bastön.Their album’s title relates to the supernatural and “those things that are there all the time but you cannot see”(Noisey en Español).

This resonated with me instantly because as a data analyst my job is literally to show or highlight “things that are there all the time but you cannot see” right way.

My objective was (semi-)clear: What other information is there in this album? That information that one may not get right away. I loved the album instantly but I couldn’t point out exactly why. Could I analyze it differently (not just by listening to it) to find that out? Could I do this for other albums?

What I did

Using Python’s NLTK library I quickly cleaned the lyrics for the album and created a word count table. I had transcribed the album for Genius.com earlier this year so I already had Luces Fantasma but to make this more interesting I also looked at Kendrick Lamar’s “DAMN.” (I also used Genius.com to get its lyrics — I love Genius).

Now, after I had created these tables I found the website WordCountTools.com which does pretty much the same thing BUT also gives you a ton of other metrics like # of monosyllabic and multisyllabic words which I was very interested in as this is a hip hop album.

Secondly, I looked at Spotify to grab other information that may be of interest. I manually grabbed the number of plays each track had and using their API I grabbed their audio features.

Third, I looked at what was the best way to show this information. These are just numbers, how can I make this tell me something I did not know before.

The Viz

I decided to go with Tableau for the visualizations. I had been working on developing my Tableau skills so this seemed like good practice.

Here I show 4 aspects of each track:

  1. The red diamond represents the percentage of the track rapped by the artists. 2 skits in Luces Fantasma are performed by featured artists so their diamond is at 0%
  2. The red dotted line shows the percentage of unique (non-repeated) words in a track. I chose to show this metric because I have been fascinated with rappers’ vocabularies since Matt Daniels’ The Largest Vocabulary in Hip Hop.
  3. The blue bar shows the number of monosyllabic words.
  4. The blue line shows the number of multisyllabic words.

First impressions

Muelas de Gallo (La Banda Bastön’s MC) is an incredible MC. I knew this but I was never able to quantify it. Now, looking at his work side-by-side to Kendrick’s I could see how amazing of a lyricist he truly is. This is mostly because American Hip Hop artists’ skills have been more well-documented and analyzed.

In Hip Hop, content is as important as how you deliver it. Luces Fantasma talks about love, death, modern México and the struggles of its citizens. DAMN. is just as amazing and I believe that’s well established.

This analysis is not about the content (even though I tried doing some sentiment analysis). It is about the raw delivery (I’m not even looking at flow). In the raw numbers, Muelas de Gallo uses a “more complex” vocabulary. He has more multisyllabic and more non-repeated words overall. Muelas also does not repeat himself many times, the most repeated words have a count of 51 while Kendrick’s go up to 104.