Published in International Journal of Advanced Research in Computer Science Engineering and Information Technology
ISSN: 2321-3337 Impact Factor:1.521 Volume:4 Issue:3 Year: 01 April,2016 Pages:803-811
Relevance feature discovery for text mining is a big challenge to guarantee the quality of discovered relevance features in text documents for describing user preferences because of large scale terms and data patterns. Most existing popular text mining and classification methods have adopted term based approaches. However, they have all suffered from the problems of polysemy and synonymy. Over the years, there has been often held the hypothesis that pattern based methods should perform better than term based ones in describing user preferences yet, how to effectively use large scale patterns remains a hard problem in text mining. To make a breakthrough in this challenging issue, this paper presents an innovative model for relevance feature discovery. It discovers both positive and negative patterns in text documents as higher level features and deploys them over low level features terms. It also classifies terms into categories and updates term weights based on their specificity and their distributions in patterns. Substantial experiments using this model on RCV1, TREC topics and Reuters 21578 show that the proposed model significantly outperforms both the state of the art term based methods and the pattern based methods.
Text mining, text feature extraction, text classification
1. M. Aghdam, N. Ghasem Aghaee, and M. Basiri, “Text feature selection using ant colony optimization,” in Expert Syst. Appl., vol. 36, pp. 6843,6853, 2009. 2. A. Algarni and Y. Li, “Mining specific features for acquiring user information needs,” in Proc. Pacific Asia Knowl. Discovery Data Mining, 2013, pp. 532,543. 3. A. Algarni, Y. Li, and Y. Xu, “Selected new training documents to update user profile,” in Proc. Int. Conf. Inf. Knowl. Manage., 2010, pp. 799,808. 4. N. Azam and J. Yao, “Comparison of term frequency and document frequency based feature selection metrics in text categorization,” Expert Syst. Appl., vol. 39, no. 5, pp. 4760,4768, 2012.