Evaluation of Word Embedding Techniques for the Vietnamese SMS Spam Detection Model

Authors

  • Vu Minh Tuan Hanoi University
  • Tran Quang Anh
  • Do Thuy Duong

Keywords:

Vietnamese spam, SMS Spam, deep learning, CNN, word embedding

Abstract

The escalating issue of SMS spam in Vietnamese text messages has prompted the adoption of machine learning and deep learning models for effective detection. This paper investigates the impact of word embedding techniques on enhancing SMS spam detection models. Traditional statistical methods (BoW, TF-IDF) are compared with advanced techniques (Word2Vec, fastText, GloVe, PhoBERT) using a proprietary dataset. The evaluation focuses on accuracy, precision, recall, and F1 Score. PhoBERT integrated with CNN model showcased the highest accuracy of 0.968 and a remarkable F1 score of 0.941. The study sheds light on the role of word embeddings in constructing robust spam detection models, offering valuable guidance for model selection. The methodology, comparative analysis, and future directions are presented.

Downloads

Published

2024-05-04 — Updated on 2024-05-04

Versions