kaggle_Group_report | MENG Guanlin-Jace

status

type

date

slug

summary

category

icon

password

COMP4432 – Group Project

Data Product: A Million News Headlines Topic Modelling and Topic Classification by 7 models

Student Name: WANG Zhuchen (20076035d), MENG Guanlin (20099185d), GENG Longling (20080439d)

1 Introduction

In this project, our group will conduct topic modeling and classification tasks based on the data provided in “A Million News Headlines” on Kaggle. Data preprocessing on news headlines will be first introduced. Topic modeling using basic models such as Latent Semantic Analysis and other advanced models such as H5 + Clustering, Top2vec, and BERTopic will then be carried out. t-SNE, word cloud and other visualization methods will be used to visualize the results. For topic classification, Auto-Encoder and BERT will be deployed to classify the positive and negative topics. The performance of models will be evaluated based on different measure metrics, such as topic coherence, similarity and diversity.

2 Data Preprocessing and Preliminary Analysis

1) Deduplication: To address redundancy within the dataset comprising over one million entries, duplicate records are identified and removed, retaining only the initial occurrence of each unique data point.

2) Stemming: Stemming involves reducing words to their root forms by removing affixes such as prefixes and suffixes, or transforming them into lemmas. This process enhances the accuracy of topic modeling by standardizing word forms.

3) Removal of stop words: Stop words, such as "the", "a", "an", and "in", which are commonly used but lack substantive meaning, are excluded from the topic modeling phase to focus on more meaningful content.

Table 2[1] shows dataset statistics after data preprocessing.

Number of headlines	1,195,191
Number of headline tokens in vocabulary	78,280
Average length of headlines	33.292 characters, 5.493 tokens
Range of publish date	From 2003.02.19 to 2020.12.31

Table 2.1: Data statistics

Figure 2.1 Histogram (word_count, unique_word_count, char_count, punctuation_count, stop_word_count, mean_word_length, before_preprocessing, and after_preprocessing)

3.1Baselines: LSA, LDA

3.1.1 Latent Semantic Analysis

Latent semantic analysis (LSA) is a technique in natural language processing (NLP) of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms, based on the assumption that words that are used in the same contexts tend to have similar meanings. LSA is useful for topic modeling, where the goal is to discover the underlying thematic structure in a collection of documents. By analyzing patterns of word distributions across documents, LSA can capture the underlying semantic structure of the language without any prior linguistic or perceptual similarity knowledge.

3.1.2 Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. In the context of text data, these unobserved groups are topics, and LDA is a popular method for performing topic modeling in large collections of documents. LDA assumes that each document in a corpus can be described by a distribution over a fixed number of topics, and each topic can be described by a distribution over words. It also assumes that there are latent topics that generate the documents through a stochastic process.

Figure 3.1.2 Algorithm of LDA Results [1]

3.1.3 Experiments and Result Analysis

Using first 50000 data from data set, we apply LSA and LDA to get 10 topics cluster. The results are shown as below:

Topic Cluster	Top 5 words	# of headlines
0 1 2 3 4 5 6 7 8 9	police death probe missing seek govt boost health urged fund court man face charge murder iraq say war troop pm council plan merger water seek new open set sars sign dy dead car man hospital plan water power sought reef win world cup australia final killed crash blast baghdad iraqi	14916 10785 1921 5383 1193 2704 1037 1051 9294 1547

Table 3.1.3.1: Latent Semantic Analysis Result

Topic Cluster	Top 5 words	# of headlines
0 1 2	police sa win drug claim man police court face crash police win lead say offer	4277 6363 4460

3 4 5 6 7 8 9

report set plan govt farmer plan council service coast union cup world say war open water rain test wa group police death council search korea govt council new vic market iraq killed new iraqi govt

4535 4864 5290 4745 5018 4507 5772

Table 3.1.3.2: Latent Dirichlet Allocation Result

t-SNE for LSA Plot t-SNE for LDA Plot

Figure 3.1.3.2 t-SNE for LSA and LDA Results

From the figures we can see that the clusters of LSA topics are of high variance in size, while LDA has a flatten distribution among each cluster. The clusters of LDA topics are sparse and discontinuous.

3.2 T5 +Clustering

3.2.1 Introduction to T5

T5 [1] is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format. T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, e.g., for translation: translate English to German: …In this project,we translate different title texts to features metrix.(The T5 model was presented in Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.)

3.2.2 How is T5 studied?

All analysis performed in [1] uses the unified, text-to-text framework described above, as it allows a variety of different language understanding tasks to be converted into a shared format. Additionally, analysis of T5 uses the same underlying transformer architecture and pre-training dataset.

T5 [2] uses an encoder-decoder architecture that closely resembles the original transformer. The differences are (Modifications made by T5 to the encoder-decoder transformer architecture)

LayerNorm [3] is applied immediately before each attention and feed forward transformation (i.e., outside of the residualpath). No additive bias is used for LayerNorm (i.e., see here; we only use scale and eliminate the additive bias) A simple position embedding scheme is used that adds a scalar to the corresponding logit used to compute attention weights. Dropout is applied throughout the network (e.g., attention weights, feed forward network, skip connection, etc.) These modifications are illustrated in the above figure.

Figure 3.2.2.1 T5 model

3.2.3 Clustering

Figure 3.2.3.1 Hierarchical Clustering Figure 3.2.3.2 Word cloud

Topic Cluster	Top 5 words	# of headlines
0 1 2 3 4 5 6 7 8 9	police govt us iraq kill us polic govt new plan govt iraq polic us iraqi kill man dace court murder Nsw sa sar sydny polic Us call plan man farmer rise dy dead car man hospital plan council fund new world Sar nsw concern hous new Iraq govt iraqi us polic	7370 7354 6366 4676 4590 4457 4337 3982 3898 2801

Table 3.2.3.1 Top-10 topics

Under the training of the T5 pre-training model, we can get a 49832x512 matrix, in which 512 columns represent that 512 features of the data are constructed. Before starting clustering, we first pass the Hierarchical Clustering of 512 features in 3.2.3.1 and 3.2. The word cloud of 3.2 obtains an intuitive image to roughly judge the subsequent processing direction. Based on this feature matrix and preprocessed data, we process TF-IDF and then cluster 10 topic groups through kmeans. In this way, we get a table that can quantify and measure data classification, such as 3.2.3.1.

3.3 Top2Vec

3.3.1 Introduction for Top2Vec

Top2Vec is an algorithm for topic modeling and dimensionality reduction that is designed to identify topics in a collection of documents automatically. Unlike traditional topic modeling approaches such as Latent Dirichlet Allocation and Latent Semantic Analysis, which rely on bag-of-words representations and matrix factorization techniques, Top2Vec leverages word embeddings to capture the semantic meaning of words and documents.

3.3.2 Architecture in Top2vec

Word and Document Embeddings Create jointly embedded document and word vectors using Doc2Vec or Universal Sentence Encoder or BERT Sentence Transformer. Documents will be placed close to other similar documents and close to the most distinguishing words. In our code, we use Doc2Vec, which will train a doc2vec model from scratch.

Dimensionality Reduction To make the data more manageable and to improve the quality of the clustering, Top2Vec applies a dimensionality reduction technique such as UMAP (Uniform Manifold Approximation and Projection) to the document embeddings. This step projects the high-dimensional embeddings into a lower-dimensional space while preserving the local and global structure of the data.

Clustering In the reduced-dimensional space, Top2Vec uses a density-based clustering algorithm like HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) to group documents into clusters based on their density in the embedding space [5]. Each cluster is considered to represent a different topic.

Topic Creation For each document cluster, Top2Vec identifies the most representative words by finding the word embeddings that are closest to the document embeddings within the cluster. These words form the topic's keyword list, which can be used to interpret and label the topic. If there are too many similar topics, Top2Vec can merge them based on the similarity of their topic keywords to produce a more concise set of topics.

Figure 3.3.2.1 An Example of a Semantic Space [6] Figure 3.3.2.2 UMAP Visualization for Headline Vectors Figure 3.3.2.3 HDBSCA Visualization for Headline Vectors

3.3.3 Experiments and Result Analysis

Topic 10 Cluster	Top 5 words	# of headlines
0 1 2 3 4 5 6 7 8	find quarter spark close celebrates praise approves surgery brisbane timor sought witness officer councillor check sink fan federer advance weekend vote today miner criticises teacher sell approves remains date stabbing fined schumacher blair scud turkey sentenced euro guilty thorpe growth suu zealand concerned pool criticises	14916 10785 1921 5383 1193 2704 1037 1051 9294

wild indemnity insurance row nation

1547

Table 3.3.3.1: Top2vec Analysis Result

Figure 3.3.3.1 Wordcloud for Topic 0 and Topic 1 3.4 BERTopic

3.4.1 Model Innovation

In this section 3.4, BERTopic is adopted for the preprocessed text. BERTopic stands as an innovative topic modeling technique that revolutionizes the interpretation and clustering of textual data, especially efficient in the computation time [7].

Compared with traditional topic modeling methods, BERTopic excels in capturing the semantic nuances of text by leveraging pre-trained BERT embeddings. By encoding text into dense vector representations, BERTopic preserves intricate relationships between words and documents, enabling the creation of dense clusters that correspond to coherent topics [8]. Moreover, the incorporation of c-TF-IDF ensures that important words are prioritized within the topic descriptions, enhancing the interpretability of the resulting clusters [8]. In this endeavor, BERTopic offers flexibility by providing two default embedding models tailored to different language requirements. For English language text, the default embedding model utilized is all-MiniLM-L6-v2, while for multilingual text, the paraphrase-multilingual-MiniLM-L12-v2 model is employed. Our group adopts the “all-MiniLM-L6-v2” embedding model for this task.

Figure 3.4.1.1. Computation time (wall time) of BERTopic Model in seconds of each topic model on the testing dataset [7]

In this task, BERTopic offers a robust solution that harnesses the power of BERT embeddings and contextual Term Frequency-Inverse Document Frequency (c-TF-IDF) Embedding to unveil latent topics within the corpus. The c-TF-IDF (Class-based Term Frequency-Inverse Document Frequency) technique in equation (2) extends the traditional TF-IDF procedure in equation (1) to the clustering of documents. Initially, all documents within a cluster are treated collectively as a single document by concatenating them. Then, TF-IDF is adapted to this aggregated representation by translating individual documents into clusters. This class-based TF-IDF approach evaluates the significance of words within clusters rather than individual documents.

Figure 3.4.1.2. class-based TF-IDF Calculation [7]

Apart from deploying the c-TF-IDF embedding, the author designs a new metric Jaccard Similarity for the experiment and testing of topic similarity in the following. The Jaccard similarity between two sets A and B is defined as the size of the intersection of the sets divided by the size of the union of the sets. Mathematically, it can be expressed in the following. For inter-topic, the higher the value, the better the topic modelling is. For the intra-topic, the lower the value, the better the modelling is. We adopt the Inter-topic Jaccard Similarity in model comparison part.

3.4.2 Experiment Results and Analysis

In the following, the Table 1 summarizes the Top-10 Topic by BERTopic modelling. Also, the authors implement the code in following figure 1-4 provides visualization for BERTopic. In figure 3.4.2.1 and table 3.4.2.1, top-10 topic word scores are visualized. In figure 3.4.2.2, similarity matrix between topics is summarized. For example, “Baghdad Iraq Iraqi” and “Baghdad Iraq Iraqi” has similarity score of 1.0 because they are same topic. “Baghdad Iraq Iraqi” and “age care nurs” has 0.2 similarity score approximately because they are dissimilar to each other. In figure 3.4.2.3, the c-TF-IDF score is visualized. In figure 3.4.2.4, Hierarchical clustering results for the first 200 topics are summarized.

Topic Number	Topic Content
0	Baghdad Iraq Iraqi Kuwait un
1	Fire firefight blaze arson burn
2	Water restrict Irri supply waterhouse
3	Sar Hong Kong quarantine Singapore
4	Govt local qid fed Vic
5	Council councilor super boundaries cbd
6	Korea Korean north nth nuclear
7	Pga Sorenstam lead lpga open
8	drought aid farmer Nino relief
9	Protest anti war rally march

Table 3.4.2.1. Top-10 Topic Summary

Figure 3.4.2.1. Top-10 topic word scores Figure 3.4.2.2. Similarity Score among topics Figure 3.4.2.3. Term Score Decline per topic

Figure 3.4.2.4. Hierarchical clustering results (for the first 200 topics)

3.5 Model Comparison and Metric Analysis

To evaluate performance of the topic models, we adopt four metrics: Coherence Score (CS), topic coherence (TC), topic diversity (TD), and Human Evaluation. Experiment results are shown in Table 3.5.1 and 3.5.2. The experiments are deployed in python 3.9 and Jupyter Notebook setting.

Key observation, reason analysis and explanations:

1) Coherence Score: It indicates how interpretable and coherent the topics are. Higher coherence scores generally indicate better topics, as they suggest that the words within each topic are more semantically related and form more coherent themes. Specifically, BERTopic offers coherence measures such as 'c_v', 'u_mass', 'c_uci', and 'c_npmi'. 'c_v' is adopted in this metric.

Closer to 1: Higher coherence and interpretability of each topic, and thus topic modelling is better.

Closer to 0: Lower coherence and interpretability of each topic, and thus topic modelling is worse. •

2) Topic Similarity: It measures the average semantic similarity between words within different topics. For the inter-topic case, a lower Topic Similarity value suggests that the topics are more dissimilar. This metric provides a finer-grained assessment of topic quality compared to the Coherence Score, as it evaluates the relationship among different topics. Inter-topic Jaccard Similarity Score is adopted in this metric. • Closer to 0: Lower semantic similarity between words among different topics, indicating more dissimilar topics, and thus topic modelling is better.

Closer to 1: Higher semantic similarity between words among different topics, indicating more similar topics, and thus topic modelling is worse. •

3) Topic Diversity: It measures how diverse the topics are in terms of the range of unique words they contain. While high diversity is desirable to

ensure comprehensive coverage of the dataset, overly diverse topics may lack focus and coherence, and thus it is important to evaluate the models together with above two metrics.

Closer to 1: Higher diversity of topics, with a broader range of concepts covered, and thus topic modelling is better. • Closer to 0: Lower diversity of topics, with topics focused on a narrower set of concepts, and thus topic modelling is worse.

4) Human evaluation: In particular, we measure the interpretability of the first 1% topics by human evaluation. We define two sub metrics as follows: • Positive Feedback: Topics are deemed interpretable, relevant, and coherent by human evaluators, and it reflects better topic modelling. • Negative Feedback: Topics are confusing, irrelevant, or lack coherence according to human evaluators, and it reflects worse topic modelling.

Table 3.5.1 and 3.5.2 report the comparison results among the 5 models, including 2 baselines (LSA, LDA) and 3 advanced models (Top2Vec,

BERTopic, and other).

Models	Coherence Score (C_V)	Topic Similarity (Jaccard)	Topic Diversity (TD)	Positive Feedback Rate	Negative Feedback Rate
Baselines LSA	0.4165	0.0145	0.89	70%	30%
Baselines LDA	0.2699	0.0336	0.82	60%	40%
Top2Vec	0.7072	0.0197	0.6689	80%	20%
BERTopic	0.5457	0.0085	0.53	90%	10%
H5+Clustering	0.3432	0.1434	0.25	60%	40%

Table 3.5.1: Experiment Results Summary

Model	Innovation	Limitation
LSA	Can catch words synonyms Deals well with data sparsity A solid understanding of probability theory and statistics are not necessary	Assumes a linear relationship between terms and topics, which may not always hold true Does not capture polysemy well Does not account for word order
LDA	Can deal with small data [9] Generates smaller number of topics comparing with word [9] embedding based approaches [9] Domain knowledge is not extremely important [9]	Requires many experiments to fine-tune parameters Depends on the frequency of common words and assumes topic independence Number of topics is a user defined parameter
Top2Vec	Support multilingual analysis [10] Number of optimal topics is not defined by a user [10] Support very large dataset [10] Preprocessing isn’t needed (use word embedding) [10]	Quality of the topics generated is dependent on the quality of the embeddings used Doesn't work well with small dataset Generates many outliers

BERTopic	Leverages pre-trained BERT embeddings, which capture intricate semantic relationships Automatically determines the number of topics without requiring the user to specify it Considers both semantic and syntactic information, leading to more accurate topic representations Supports large datasets and is computationally efficient due to BERT's scalability	Requires substantial computational resources, especially when using large pre-trained BERT models May not perform optimally on domain-specific or niche datasets where BERT embeddings are less effective Interpretability of topics may be challenging due to the complexity of BERT embeddings and the black-box nature of the model Fine-tuning hyperparameters for specific tasks or datasets may be necessary for optimal performance
H5+Clustering	Versatility: T5's ability to handle various tasks in a unified text-to-text format makes it suitable for topic modeling. Transfer Learning: T5 is pre-trained on a mixture of unsupervised and supervised tasks, which can help in capturing broad linguistic patterns and semantic relationships present in the news headlines dataset. Semantic Understanding: T5's architecture and training methodology allow it to capture intricate semantic relationships within the text.	Computational Resources: T5 is a large-scale model that requires substantial computational resources. Complexity and Interpretability: The black-box nature of deep learning models like T5 can hinder the interpretability of the results. Fine-tuning Challenges: Finding the right balance between model complexity, training duration, and performance metrics.

Table 3.5.2: Models Summary

4 Topic Classification

In addition to the first task of topic modelling, the author also deploys and develops two models: Auto-Encoder and BERT for positive and negative topic

classification. For positive news classification, we consider words such as "boosts," "great," "develops," and "promising," among others. Conversely, for negative news classification, we look for words like "war," "turmoil," "trouble," and "injury."

1) The model Auto-Encoder we adopt is based around a bi-directional RNN (either GRU or LSTM), with max pooling. The encoder is comprised of two RNN layers, while the decoder uses an RNN with a dense layer on top of it. The reconstruction of the original input occurs on a final Dense layer.

2) The BERT model we adopt consist of:

Embeddings Layer contains three sub-layers:

Word Embeddings: Embedding layer with vocabulary size of 30522 and embedding dimension of 768. It converts input tokens into dense vectors of fixed size.

Position Embeddings: Embedding layer with 512 positions and embedding dimension of 768. It encodes the position of each token in the input sequence.

Token Type Embeddings: Embedding layer with 2 token types and embedding dimension of 768. It encodes the segment or sequence type information.

Encoder Layer consists of 12 BertLayer modules. Each BertLayer has the following sub-layers:

• BertAttention: Self-attention mechanism with linear transformations for query, key, and value vectors. It captures the importance of each token in relation to other tokens in the sequence.

BertIntermediate: Intermediate dense layer with input dimension of 768 and output dimension of 3072. It applies a GELU activation function.

BertOutput: Output dense layer with input dimension of 3072 and output dimension of 768. It applies layer normalization and dropout. •

Finally, it includes a pooler layer with a linear transformation from 768 to 768 followed by a hyperbolic tangent activation function. It generates a fixed-size representation (pooling) of the entire sequence. Classifier is a Linear layer with input dimension of 768 and output dimension of 3.

Results are summarized in the following Table 4.1. RNN-based Auto-encoder has higher validation accuracy and lower F1-score, while BERT model has higher F1-score and lower accuracy in this task.

Figure 4.1: Auto-Encoder [11], Figure 4.2: BERT Architecture [8]

Model

Validation Accuracy

Validation Precision

Validation Recall

Validation F1-score

RNN-based Auto-Encoder	0. 7560	-	-	0.6496
BERT	0.2742	0.9559	0.9961	0.9071

Table 4.1. Experiment summary of topic classification

5 Conclusion and Future Work

In this project, we deploy 7 different models for the task of Topic Modelling and Topic Classification. For topic modeling, the dataset is preprocessed, two baseline models (LSA and LDA) are built and tested, and three advanced models (Top2Vec, BERTopic, and other) are also deployed and compared. t-SNE and word cloud will be used to visualize the result. For topic classification, dataset is labelled, and two advanced models (Auto-Encoder, and BERT) are deployed and compared. Moreover, four metrics, including Coherence Score, Topic Coherence, Topic Diversity, and Human Evaluation are defined and tested among the five models. We have summarized the following key innovation and findings.

1) Compared to baselines LSA/LDA, Top2Vec topic modelling is more efficient in Coherence Score and Topic Diversity, which reflects more relevance within each topic. This is likely due to the characteristic that Top2Vec is good at supporting multilingual analysis, automatic topic number generation, and suitable for large dataset. 2) Compared to baselines LSA/LDA, BERTopic topic modelling has smaller value in Inter-Topic Jaccard Similarity, which indicates more irrelevance among different topics.

3) Compared to previous four models, H5+Clustering topic modelling has smaller value in Coherence Score and Topic Diversity, and larger value in Inter-Topic Jaccard Similarity, which reflects less suitability for this task.

4) For positive and negative topic classification, RNN-based Auto-encoder has higher validation accuracy and lower F1-score, while BERT model has higher F1-score and lower accuracy in this task.

Finally, we point out several research directions for this project. For Topic Modeling, the research direction include:

1) Graph-based Topic Modeling to explore the integration of graph-based methods with topic modeling techniques to capture the inherent relationships and structures within documents.

2) Interpretable Topic Models: Research efforts may include the exploration of novel priors, regularization techniques, or interactive topic modeling frameworks aimed at enhancing interpretability while maintaining model performance.

For Topic Classification, Semi-Supervised and Active Learning could be investigated for leveraging limited labeled data efficiently in topic classification tasks.

3) Semi-supervised learning approaches, such as self-training or co-training, can be deployed.

4) Additionally, active learning strategies can be employed to intelligently select the most informative data points for annotation, thereby reducing the annotation burden while maintaining classification accuracy.

References

[1] yasshramchandani, “task6_EDA_News-Headlines,” Kaggle.com, Apr. 28, 2021. https://www.kaggle.com/code/yasshramchandani/task6-eda-news-headlines (accessed May 12, 2024).

[2] C. R., Wolfe. T5: Text-to-Text Transformers (Part One). Retrieved May 12, 2024, from Substack.com website: https://cameronrwolfe.substack.com/p/t5-text-to-text-transformers-part [3] Huggingface.co, “T5,” 2014. https://huggingface.co/docs/transformers/en/model_doc/t5 (accessed May 12, 2024).

[4] A. Srivastava and C. Sutton, “Autoencoding Variational Inference For Topic Models,” arXiv.org, 2017, doi: 10.48550/arxiv.1703.01488

[5] R. J. G. B. Campello, D. Moulavi, and J. Sander, “Density-Based Clustering Based on Hierarchical Density Estimates,” in Advances in Knowledge Discovery and Data Mining, Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 160–172. doi: 10.1007/978-3-642-37456-2_14

[6] D. Angelov, “Top2Vec: Distributed Representations of Topics,” arXiv.org, 2020, doi: 10.48550/arxiv.2008.09470

[7] M. Grootendorst, "BERTopic: Neural topic modeling with a class-based TF-IDF procedure," arXiv:2008.07909, 2020.

[8] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," arXiv:1810.04805, 2018 [9] D. Bank, N. Koenigstein, and R. Giryes, "Autoencoders," arXiv:2003.05991v2 [cs.LG], Apr. 2021.

[10] R. Albalawi, T. H. Yeap, and M. Benyoucef, “Using Topic Modeling Methods for Short-Text Data: A Comparative Analysis,” Frontiers in artificial intelligence, vol. 3, pp. 42–42, 2020, doi: 10.3389/frai.2020.00042

[11] R. Egger and J. Yu, “A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts,” Frontiers in sociology, vol. 7, pp. 886498–886498, 2022, doi: 10.3389/fsoc.2022.886498