Published in International Journal of Advanced Research in Computer Science Engineering and Information Technology
ISSN: 2321-3337 Impact Factor:1.521 Volume:2 Issue:2 Year: 08 March,2014 Pages:106-111
Abstract—FoCUS (Forum Crawler Under Supervision), is a supervised web-scale forum crawler. The goal of FoCUS is to crawl relevant forum content from the web with minimal overhead. FoCUS is an automation engine that will dynamically crawl the relevant content in a forum . Forum threads contain information content that is the target of forum crawler. Cleanup of data and moving the contents to the appropriate web pages is the major scope of the project. The content of forum may be the queries asked by the users. After crawling the content, FoCUS will dynamically move the queries in the related forum, which will deal the particular query. Then FoCUS cleanup the unrelated query from the particular forum, and that free space is allocated to new queries posted by user. FoCUS take six path from entry page to thread page. It helps the frequent thread updation in forum. FoCUS makes use the technique called differential content extraction, which helps to maintain a record for already crawled data. In each time FoCUS will not crawl the forum data from the beginning, it will maintain a record of already crawled data and manipulates only the newly posted queries.
EIT Path, Forum Crawling, ITF Regex, URL Type
[1]S.Brin and L.Page.The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems, vol.30, nos. 1-7, pp. 107-117, 1998. [2] R. Cai, J.-M. Yang, W. Lai, Y. Wang and L. Zhang. iRobot: An Intelligent Crawler for Web Forums. Proc. 17th Int’l Conf. World Wide Web, pp. 447-456, 2008 [3]Y.Guo, K.Li, K.Zhang, and G.Zhang. Board Forum Crawling: a Web Crawling Method for Web Forum.Proc.2006 IEEE/WIC/ACM Int’l Conf.Web Intelligence, Pp.475-478, 2006. [4]C.Gao, L. Wang, C.-Y.Lin, and Y.-I. Song, Finding Question-Answer Pairs from Online Forums.Proc.31st Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 467- 474, 2008. [5] Wang.Y, Yang.J.-M, Lai.W, Cai.R, Zhang.L, and Ma.W.- Y.,’Exploring Traversal Strategy for Web Forum Crawling’. Proc. 31st Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 459- 466, 2008. [6]A. Dasgupta, R.Kumar, and A.Sasturkar.De-duping URLs via rewrite rules. Proc.14thACM SIGKDD Int’l Conf.Knowledge Discovery. and Data Mining, pp.186 - 194,2008. [7]M.Henzinger.Finding near-duplicate, Web pages: a largescale evaluation of algorithms.Proc.29th Ann.Int’l ACM SIGIR Conf.Research and Development in Information Retrieval,pp.284-291,2006