performance enhancement of imbalanced data using meta-cost algorithm

M.kiruthiga ,P.Sangeetha

Published in International Journal of Advanced Research in Computer Science Engineering and Information Technology

ISSN: 2321-3337          Impact Factor:1.521         Volume:4         Issue:3         Year: 29 April,2016         Pages:999-1004

International Journal of Advanced Research in Computer Science Engineering and Information Technology

Abstract

Class imbalance is one of the major issues in classification. It degrades the performance of data mining. It mostly occurs by the non-experts labeling the object. Online outsourcing systems, such as Amazon’s Mechanical Turk, allow users to label the same objects with lack of quality. It frequently increases the cost of misclassification which arise due to imbalance.Thus, a meta-cost algorithm is projected to handle the problem of imbalanced noisy labeling and to reduce the misclassification cost. The main objective is to generate the training dataset and integrate labels of examples. This method is used to resolve the issue of minority sample and also able to deal with imbalanced multiple noisy labeling. The algorithm is applied to the imbalanced dataset collected from UCI repository and the obtained result shows that the meta-cost algorithm performs better than other methods.

Kewords

repeated labeling, majority voting, positive and negative labels.

Reference

#1. C. L. Black and C. J. Merz. UCI repository of machine learning database [Online]. Available: http://archive.ics.uci.edu/ml/, 1998. #2. N. V. Chawla, L. O. Hall, K. W. Bowyer, and W. P. Kegelmeyer, “SMOTE: Synthetic minority oversampling technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, 2002. #3. Domingos P,’ Meta-cost: A general method for making classifiers cost-sensitive,’ in KDD, pp 155-164. #4. P. Donmez, J. G. Carbonell, and J. Schneider, “Efficiently learning the accuracy of labeling sources for selective sampling,” in Proc.15th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2009, pp. 259–268. #5. H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, Sep. 2009. #6. H. Kajino, Y. Tsuboi, and H. Kashima, “A convex formulation for learning from crowds,” in Proc. 26th AAAI Conf. Artif. Intell., 2012, pp. 73–79. #7. A. Kumar and M. Lease, “Modeling annotator accuracies for supervised learning,” in Proc. 4th ACM WSDM Workshop Crowd sourcing Search Data Mining, 2011, pp. 19–22. #8. X. Y. Liu, J. Wu, and Z. H. Zhou, “Exploratory under sampling for class imbalance learning,” in Proc. IEEE 6th Int. Conf. Data Mining, 2006, pp. 965–969. #9. H. Y. Lo, J. C. Wang, H. M., Wang, and S. D., Lin, “Cost-sensitive multi-label learning for audio tag annotation and retrieval,” IEEE Trans. Multimedia, vol. 13, no. 3, pp. 518–529,2011. #10. C. Parker, “On measuring the performance of binary classifiers,” Knowl. Inform. Syst., vol. 35, no. 1, pp. 131–152, 2013. #11. V. S. Sheng, “Simple multiple noisy label utilization strategies,” in Proc. IEEE 11th Int. Conf. Data Mining, 2011, pp. 635–644. #12. V. S. Sheng, F. Provost, and P. Ipeirotis, “Get another label? Improving data quality and data mining using multiple, nosiy labeler,” in Proc. 14th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2008, pp. 614–662. #13. P. Smyth, M. C. Burl, U. M. Fayyad, P. Perona, and P. Baldi, “Inferring ground truth from subjective labeling of venus images,” Adv. Neural Inform. Syst., vol. 8, pp. 1085–1092, 1995. #14. R. Snow, B. O’Connor, D. Jurafsky, and A. Ng, “Cheap and fast— But is it good?” in Proc. Conf. Empirical Methods Natural Lang. Process., 2008, pp. 254–263. #15. P. Welinder and P. Perona, “Online crowdsourcing: Rating annotators and obtaining cost-effective labels,” in Proc. Workshop Adv.Comput. Vis. Humans Loop, 2010, pp. 25–32. #16. J. Zhang, X. Wu, and Victor S. Sheng,”Imbalanced Multiple Noisy Labeling”, vol 27, feb 2015.