The question of whether Topic Modeling is classified as a quantitative or qualitative approach depends on the nature of data used for its implementation. The process requires data in text form, which can be obtained through several methods like interviews, transforming reports to text, or using web scraping. Moreover, the sample size could be significant. While numerous samples consist of individual documents, others may contain large amounts of text. The purpose of topic modeling is to identify patterns and themes within the data, making it a tool used in qualitative analysis. However, once the data is transformed into a format suitable for analysis, the actual modeling process is quantitative in nature, using algorithms to identify and extract topics from the text. Ultimately, whether topic modeling is considered quantitative or qualitative depends on the specific step in the process and the overall purpose of topic modeling.
Latent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation (LD) is an unsupervised learning algorithm used for topic modeling. It assumes that documents are collections of words, and maps these words to a list of topics. It also ignores the order of occurrence of words and syntactic information.
LDA is especially helpful for large websites because it allows users to specify a basic internal link structure, which can improve a website’s performance and visibility in search engines. It can also help increase a competitive edge by enabling a better understanding of user-generated content. This method was used for a recent study that extracted information from online reviews and classified topics according to sentiment.
This technique is similar to Principal Component Analysis, which reduces the dimensionality of a data set by using linear combinations of different variables. It does so by breaking up a larger value into smaller components, or sub-components. It is commonly used in the area of strategic business optimization.
Hierarchical latent tree analysis (HLTA)
HLTA is a new approach to hierarchical topic modeling. Its key idea is to replace EM with progressive EM. This is a much more effective way of learning the structure of a topic. It is also faster and more accurate than hLDA and CorEx.
HLTA can be used to model the hierarchy of topics by estimating word co-occurrences and other latent variables. It can be used to model a soft cluster of documents and can be improved on LDA. HLTA has its limitations, but has many advantages.
The approach has been used to model topics across various data sets. For example, Block and Newman used topic modeling to model topics from the Pennsylvania Gazette over a period of 1728 to 1800. The method was also applied to PNAS abstracts to identify topics that increased in popularity between 1991 and 2001.
Other topic models
Other topic modeling is a statistical approach for analyzing textual data. It combines statistical associations of words in a text to generate latent topics, clusters of words whose co-occurrences represent higher-order concepts. While this method does not offer automatic text analysis applications, it offers a lens through which to examine textual data.
This approach can help researchers generate management theory from textual data. It helps researchers label texts and create measures to use for statistical analysis and hypothesis testing. It’s a useful tool for qualitative and quantitative data analysis and can help researchers shift the paradigm of data analysis. However, there are limitations to its use.
Topic models require textual data, which can be obtained from interviews, rendering reports to text, or web scraping. The size of a sample can be critical for determining the accuracy of topic modeling. In most cases, each document is treated as a separate text, although some samples may be large masses of text that contains a variety of topics.
Limitations of topic models
Topic modeling is often a useful tool, but it also has its limitations. For instance, if your goal is to analyze text content, it might be useful to look at the text in its natural language form, rather than as a machine-generated model. Topic models are complex beasts, their internal workings beyond the comprehension of the average user. They can play a game of bait-and-switch, offering astonishing results straight out of the box, but then revealing a paralyzing array of options when the dial is cranked. Fortunately, there are automated measures of coherence, which help narrow down the ocean of candidates and possibilities.
Another limitation is the need for manual data pre-processing before topic modeling. This involves removing topic-general words and stop words from the corpus. This is a time-consuming, manual process, and can alter the results. Topic-general words are common in corpora and tend to change the semantic structure of the topic pairs. In addition, removing these words can reduce the validity of word-pair topics.
Methods for evaluating topic models
Whether to use qualitative or quantitative methods to evaluate topic models depends on your purpose and the kind of data you’re looking for. Topic modeling can be used for many different tasks, from document classification to semantic theme exploration. But, evaluating topic models is not a straightforward task and requires some knowledge of the domain and the purposes for which you plan to use the model.
A common evaluation method is based on human judgment. This method can produce good results, but is time-consuming and expensive. Furthermore, humans tend to disagree about what makes a good topic. Therefore, it is useful to use quantitative metrics to standardize evaluation of topic models. One of these metrics is perplexity, which is calculated by dividing the dataset into two parts.