Using content and non-dictionary words for author profiling of Vietnamese forum posts
Keywords:
author profiling, machine learning, content-based features, non-dictionary wordsAbstract
This paper reports the results of author profiling task for Vietnamese forum posts to identify personal traits, such as gender, age, occupation, and location of the author using content and nondictionary words. Experiments were conducted on different types of features, including stylometric features (such as lexical, syntactic, structural features), content-based features (the most important content words), non-dictionary words (such as slangs, abbreviations) to compare the performance and on datasets we collected from popular forums in Vietnamese. Three learning methods, consisting of Decision Tree, Bayes Network, Support Vector Machine (SVM), were tested and SVM achieved the best results. The results show that these kinds of features work well on such a kind of short and informal messages as forum posts, in which, content
words features yielded much better results than stylometric and non-dictionary words features when used individually. However, the combination of stylelometric and non-dictionary words also achieved good results.1