Blog post by Yin Yin Lu: NLP Meets K-Pop at the Meertens Instituut

Submitted by karolina@clarin.eu on 16 November 2017

The Meertens Instituut, which is dedicated to the study of contemporary Dutch language and culture, was on one level a strange place to develop natural language processing ( ) tools for a corpus of 1.6 million comments on Korean pop (K-pop) YouTube videos. However, as K-pop has become a global cultural phenomenon and as the Meertens now has a dedicated team of digital humanities researchers, on another level it was a most ideal environment for my mobility grant visit. The purpose of the visit was to prepare materials for the full-day methods workshop on analysing large volumes of online text using NLP libraries in Python, jointly sponsored by CLARIN and the Université libre de Bruxelles.

My visit began not at Meertens but at the Institute for International Social History (IISH), the venue of the CLARIAH Tech Day on Friday 6 October. This event provided me with an excellent introduction to digital humanities activity in the Netherlands; CLARIAH, the Common Lab Research Infrastructure for the Arts and Humanities, is a joint partnership between CLARIN and DARIAH-NL, which provides digital tools and data and teaches computational methods to researchers. I learned about data integration and modelling, workflow systems, and the Text Encoding Initiative ( ) as a basis for digital editions and (linguistic) querying. I also spoke extensively with Marieke van Erp, the team leader of the recently established Digital Humanities group at the KNAW Humanities Cluster.

Thus, I arrived at the Meertens the following Monday with a broader perspective on not only its connections with CLARIN, but also its role in promoting new computational methodologies within the Netherlands, in collaboration with other institutes. I spent the duration of my visit working with Dr. Folgert Karsdorp, a tenure-track researcher who describes his field as ‘computational folkloristics’—the application of data science methods to ethnology and the study of cultural evolution (as exemplified, e.g., by folktales). These methods are referred to in the digital humanities as forms of ‘distant reading’, as opposed to ‘close reading’. They allow scholars to process corpora of much larger sizes than those to which more qualitative, hermeneutical approaches are better suited.

One of the ‘distant reading’ techniques that Folgert has implemented is topic modelling, an unsupervised technique for the detection of themes in text corpora. A ‘theme’ is essentially a collection of words; topic models assign themes to documents based upon the co-occurrences of words in the documents. They operate under a very naïve assumption: a document is defined by the distribution of its vocabulary across various themes; syntax (and thereby context) is not taken into consideration. That being said, this naïve model can generate some powerful insights about a corpus of text that instigate further qualitative analyses.

There are many different types of topic models, and we chose latent Dirichlet allocation (LDA) for the workshop, given its popularity and effectiveness with digital text. We then had to decide upon the best implementation method for our K-pop corpus, which consisted of 1,586,671 comments on 202 YouTube videos from the four most popular boy and girl groups: Bangtan Boys (BTS), EXO, BLACKPINK, and TWICE. Like most social media text, the comments vary greatly in style and length, and many contain emoji and foreign characters rendered as Unicode (even after we attempted to filter out all non-English posts via a common word heuristic). We tested Rehurek’s gensim library as well as Pedregosa et al.’s scikit-learn library, but both produced uninteresting results.

We decided to use Allen Riddell’s library for the workshop tutorial, which implements LDA with collapsed Gibbs sampling. This produced significantly more meaningful results than either gensim or scikit-learn, although the quality of the results was directly correlated with the amount of time taken to run the model (the Gibbs sampler is computationally intensive). Training a model for the K-pop corpus with 1,500 iterations and 25 topics took a little over an hour on Folgert’s 2015 MacBook Pro. As that is equivalent to the length of the topic modelling session during the workshop, we had to cache the results for the participants.

Ten Most Likely Words in Topic 19 (Allen Riddell’s LDA Library)

The next step was to visualise the results. This is extremely important, as it impacts the ease of analysis; moreover, many of the workshop attendees had limited experience with programming, and one of our objectives was to make them excited about the potential of data science techniques for the humanities and social sciences. First, we created a grouped bar chart of the mean topic distributions per K-pop group, to easily see which topics feature prominently for each group, and to compare groups within each topic. For some topics, there was not much difference among the distributions; for other topics there were extreme differences. The former topics tended to be extremely general (e.g., expressions of love), and the latter focused on specific groups (or even specific members of specific groups).

Grouped Bar Chart of Average Topic Distributions for K-pop Groups

We then created a more advanced interactive visualisation using the third-party library pyLDAvis, which is part of the R package created by Carson Sievert and Kenny Shirley. This visualisation facilitates the exploration of topics and their relationships, and consists of two parts. On the left-hand side, the topics are plotted as circles in a two-dimensional intertopic distance map, created via multidimensional scaling (principal components analysis by default); the closer two topics are in the plot, the more similar they are. Moreover, the size of the circle is proportional to how prominent the topic is in the entire corpus. On the right-hand side of the visualisation, when a topic (circle) is selected in the plot, the top 30 most relevant terms are displayed in a horizontal stacked bar chart, ranked by relevance. The bar chart is stacked because it also displays the overall frequency of the term, which provides a sense of how unique the term is to the topic.

pyLDAvis Interactive Visualisation

Given that our ‘documents’ in the corpus were YouTube comments, we thought that to better understand the topics, it would be interesting to see which videos were associated with them (and subsequently to watch the videos). We ended the topic modelling session script with a simple function that retrieved the ID of the most relevant video for a given topic, and embedded the video within the Jupyter Notebook (which is the platform we used to deliver the workshop).

After finalising the details of the topic modelling Notebook, Folgert and I mapped out the structure of the rest of the workshop—namely the two introduction to NLP with Python sessions that would provide the backdrop to the more focused sessions. We decided that it would be most beneficial for the morning session to focus on the pandas library, which is indispensable for data analysis tasks. The objective of the session was to obtain a high-level statistical overview of the dataset after loading it into a dataframe. This versatile structure allows for the comments to be sorted and filtered in various ways (e.g., by number of likes, date published, author). We ended the session with some basic, yet powerful, visualisations: comments over time, number of likes per comment, number of comments and likes per author, and number of comments per group in histogram form.

Comments Over Time

Number of Comments by Like Count

Log-log Plot of Number of Comments per Author

Log-log Plot of Number of Likes per Author

Scatterplot of Comments per Author vs Likes per Author

Histogram of Comments per K-pop Group

After the high-level overview, we designed the afternoon introductory session to focus on the Natural Language Toolkit (NLTK) library. This is an excellent tool for analysing linguistic data, as it has built-in corpora and text processing functionalities for everything from tokenisation to semantic reasoning. We applied some basic functions to a few YouTube comment files in our K-pop corpus, and examined them individually as well as comparatively, after implementing a special tweet tokeniser (although designed for tweets, it is applicable to other forms of social media data). We calculated the lexical diversity of the comments, the frequency distributions of specific keywords (e.g., singer names), the words that only appeared once, the most popular verbs (after using the part-of-speech tagger), n-grams, and collocations. This allowed for a more fine-grained linguistic analysis of what was being said about each video.

Lexical Diversity of Four Comment Files

Hapax Legomena in One Comment File

20 Most Frequent Bigrams and Trigrams in One Comment File

All in all, the mobility grant visit was indispensable for the best possible execution of the methods training day. On the one hand, it allowed for both the structure and content of the workshop sessions to be designed, and to fit into a cohesive larger narrative. The sessions had to complement each other and provide multiple perspectives on the K-pop corpus. On the other hand, the in-person collaboration allowed for many technical issues to be preemptively addressed: everything from file names and folder paths (which had to be consistent for the data to be loaded properly into the Jupyter Notebooks) to the setup of the GitHub page, which contains all of the scripts. I had an extremely productive and enjoyable time, and I very much hope that I might find another excuse to visit the Meertens Instituut in future!

Blog post written by Yin Yin Lu who received the CLARIN Mobility Grant in October 2017.

More information about the CLARIN Mobility Grants can be found here.