🎶 Most popular song topics around the world 🌍

Discover the diverse themes that resonate globally.

🎵 About the project

While going through a break-up, I tried to distract myself with music - just like most of us do. However, I quickly realised that songs weren't of much help: almost all of them were cherishing love and romantic feelings.

I started wondering:

And since I had to choose a project for my Computational Linguisitcs class, I decided to explore this idea.

💬 Results

After looking through all the countries using topic modeling, I distinguished the following most common themes:

Slide 1

Topic 1

Love-related emotions/Relationships (who could have guessed?)

Slide 2

Topic 2

Urban Life/ Hip-Hop culture/ Gangsta life

Slide 2

Topic 3

Self-reflection, abstract topics

Slide 2

Topic 4

"Find", "better", "cry", "hurt", "stay"... Looks like a heartbreak to me

Slide 2

Topic 5

"Night", "crazy", "kiss", "tequila", "dancing"... Sounds like a wild party!

Slide 2

Topic 6

This topic appeared in Egypt only and deviates notably from other common topics.

Slide 2

Topic 7

This topic appeared in Greece only and deviates notably from other common topics.

Slide 2

Topic 8

This is a data artifact containing song and artist names.

Slide 2

Topic 9

This is a data artifact containing poorly translated words from low-resource languages.

As you can see, love in different forms (and money) indeed rules the chart in practically all the countries!

To my disappointment, I didn't find very specific topics, like "food", "family", or something even more unexpected. Well, maybe, the topics weren't that diverse to begin with? Maybe the most popular songs usually don’t mention anything too eccentric?

However, there were singular cases that diverged from the common topics. I would name them:

  • Politics and religion - a topic observed in Egypt
  • Violence and conflict - a topic observed in Greece
  • Sometimes clusters were also formed of noise and artifacts:

  • Music and artist names
  • Untranslated words (for low-resource languages)

  • I took top-200 songs popular in each of the 38 countries. To select the songs, I used Spotify's weekly total charts and selected the most popular songs of all times* for that country.

    Then I scraped the lyrics, translated them to English and preprocessed using basic techniques (tokenization, lemmatization, lowercasing, deleting punctuation and stop-words**). Then I applied topic modeling using LDA (Latent Dirichlet Allocation) to get the topics for each country.

    Tools utilized:

    The very first results were pretty meaningless (image below), so I had to go through many iterations of preprocessing to start seeing some coherence it topics.

    Early results of LDAe

    Just like in many other unsupervised methods, in LDA, we can't directly see the name or the topic of the cluster: instead, we have to label them ourselves. Naming the topics was one of the hardest parts! Click on the results to see what I found out.

    *from the moment Spotify entered the country till March 2023

    **I also used filter_extremes function from gensim module to eliminate the most common words (that are met in more than 50% of the documents) and the least common, rather eccentric/too specific words (that are met in less than 3 documents) – for example, names of the entities, rare words (like “pathointelligence” or “bermuda triangle”), etc. The function also takes only the top N words from the remaining corpus (in my case, N was set to 15 000).

    After applying topic modeling, I ended up with various clusters of words per country.

    One thing about LDA is that we don't know for sure which number of topics there actually is. LDA just gives us back the exact number of clusters that we ask it to produce. So, for each country, I tried to produce from 3 to 7 topics, just to see how it goes.

    Good news is that there are some metrics that help us evaluate the quality of clusters. I used UMass Coherence Score:

    Formula for UMass Coherence Score

    UMass Coherence score reflects how often words occur together within a topic. To calculate it, we take the logarithm of the number of times two words appeared together across the documents and divide it by the number of times the word appeared alone. So the greater the number, the better is the coherence score.

    The best u_mass score will be as close to 0 as possible (since the logarithm of 1 is 0).

    Formula for UMass Coherence Score

    To measure the coherence of the whole topic (cluster), we extend the calculation to each pair of top-N words and calculate average pairwise coherence score.