implementation of fast clustering based feature subset selection algorithm for hdd

T.Gayathri,D.Suvidha,P.V.Monisha,U.JothiLakshmi

Published in International Journal of Advanced Research in Electronics, Communication & Instrumentation Engineering and Development

ISSN: 2347 -7210          Impact Factor:1.9         Volume:1         Issue:3         Year: 08 March,2014         Pages:101-100

International Journal of Advanced Research in Electronics, Communication & Instrumentation Engineering and Development

Abstract

Feature selection involves identifying a subset of the most useful features that produces compatible results as the original entire set of features. A feature selection algorithm may be evaluated from both the efficiency and effectiveness points of view. While the efficiency concerns the time required to find a subset of features, the effectiveness is related to the quality of the subset of features. Based on these criteria, a fast clustering-based feature selection algorithm, FAST, is proposed and experimentally evaluated in this paper. The FAST algorithm works in two steps. In the first step, features are divided into clusters by using graph-theoretic clustering methods. In the second step, the most representative feature that is strongly related to target classes is selected from each cluster to form a subset of features. Features in different clusters are relatively independent, the clustering-based strategy of FAST has a high probability of producing a subset of useful and independent features. To ensure the efficiency of FAST, we adopt the efficient minimum-spanning tree clustering method. The efficiency and effectiveness of the FAST algorithm are evaluated through an empirical study. The results, on 35 publicly available realworld high dimensional image, microarray, and text data, demonstrate that FAST not only produces smaller subsets of features but also improves the performances of the four types of classifiers.

Kewords

Feature subset selection, filter method, feature clustering, graph-based clustering

Reference

1.Hall M.A., Correlation-Based Feature Subset Selection for Machine Learning,Ph.D. dissertation Waikato, New Zealand: Univ. Waikato, 1999. 2.Webb G.I., Multiboosting: A technique for combining boosting and Wagging,Machine Learning, 40(2), pp 159-196, 2000. 3.Yu L. and Liu H., Efficient feature selection via analysis of relevance andredundancy, Journal of Machine Learning Research, 10(5), pp 1205-1224,2004. 4.Demsar J., Statistical comparison of classifiers over multiple data sets, J.Mach. Learn. Res., 7, pp 1-30, 2006. 5.Garcia S and Herrera F., An extension on “Statistical Comparisons of Classifiers over Multiple Data Sets” for all pairwise comparisons, J. Mach.Learn. Res., 9, pp 2677-2694, 2008. 6.G.H. John, R. Kohavi, and K. Pfleger, “Irrelevant Features and the Subset Selection Problem,” Proc. 11th Int’l Conf. Machine Learning, pp. 121-129, 1994. 7.K. Kira and L.A. Rendell, “The Feature Selection Problem:Traditional Methods and a New Algorithm,” Proc. 10th Nat’l Conf. Artificial Intelligence, pp. 129-134, 1992. 8.F. Pereira, N. Tishby, and L. Lee, “Distributional Clustering of English Words,” Proc. 31st Ann. Meeting on Assoc. for Computational Linguistics, pp. 183-190, 1993. 9.W.H. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Vetterling, Numerical Recipes in C. Cambridge Univ. Press 1988. 10.R.C. Prim, “Shortest Connection Networks and Some Generalizations,” Bell System Technical J., vol. 36, pp. 1389-1401, 1957. 11,J.R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufman, 1993.