Forskning

Subcorpus TM Browser

by Gautham Badrinathan, Peter Leonard, and Timothy R. Tangherlini

The browser allows one to curate topic models, and then use the curated models as a method for finding passages in a large, unlabeled corpus, based on the curated models.

STM-browser

The Subcorpus topic model site is based on work by Timothy Tangherlini (UCLA) and Peter Leonard (Yale), and grew out of work on the Google Books project for Humanities research. The site allows one to curate topic models, and then use the curated models as a method for finding passages in a large, unlabeled corpus, based on the curated models.

For example, from the curation site (default landing tab), typing in "bran" will bring up three existing versions of Hovedstrømningerin the type ahead box. Clicking on the topic modeling tab (as opposed to curation), would allow one to create additional models based on the existing corpora.

From the default landing page, selecting topics in the upper right hand corner, allows one to select various inferencers (i.e. topic model collections), and explore these models.

For example, visting http://stm.scandinavian.ucla.edu/topics, one could select hs_01_500-stopped" target="_blank, a subcorpus consisting of the first volume of √, chunked into 500 word chunks and stop worded, with 50 topics. Clicking through on the first topic takes one to the topic browser, where one can see the results of k=50 topic model for the first volume of Hovedstrømninger, as well as the same model inferred on the remaining volumes of the work. The goal of such an approach is to find passages in the other volumes of Hovedstrømninger that have topical affinities with passages from other volumes.

Distributed under an MIT license

Code available at: github