GOWGCNStream: an integrated text graph representation learning for semantic and syntactic enhanced short text stream clustering
Keywords:
GCN, BERT, DPMM, graph-of-words, text stream clusteringAbstract
Text stream clustering is considered as a primitive task in natural language processing (NLP) which contains unique challenges related to the sparsity/noise, infinite length and cluster evolution of the input documents. In recent years, many mixture topic model, such as: Dirichlet Process Mixture Model (DPMM) based algorithms (e.g., MStream, OSDM, etc.) have demonstrated remarkable improvements in the accuracy performance of short text stream clustering task. However, these contemporary DPMM-based models still suffered limitations related to the capability of sufficiently capturing the sequential and long-range syntactic dependent relationships between words in texts in which can assist to leverage the quality of extracted clusters from given streams. To deal with these challenges, in this paper, we proposed a novel integrated graph convolutional network (GCN) with DPMM for handling text stream clustering task, called as GOWGCNStream. Our proposed GOWGCNStream model is an integration of GCN with BERT for capturing the joint syntactic structural and contextual representations of texts which are then used to facilitate the DPMM framework for dealing with short text stream clustering task. Extensive experiments in benchmark datasets (Tweet-Set and Google-News) demonstrated the effectiveness of our proposed GOWGCNStream model in comparing with recent state-of-the-art baselines.