proficient pattern discovery in sequence data sets

Yogesh S Lonkar,

Published in International Journal of Advanced Research in Computer Science Engineering and Information Technology

ISSN: 2321-3337          Impact Factor:1.521         Volume:4         Issue:3         Year: 22 March,2015         Pages:384-398

International Journal of Advanced Research in Computer Science Engineering and Information Technology

Abstract

Available sequence mining algorithms mostly focus on subsequence mining. In this mining biological DNA and protein motif mining, require proficient mining of “estimated” patterns that are adjacent. The few available algorithms that can be useful to find such adjacent estimated pattern mining have disadvantage like poor scalability, lack of guarantee in finding the pattern, and difficult in adapting to other applications. In this paper, we present a new algorithm called FLexible and Accurate Motif DEtector (FLAME). FLAME is a flexible suffix-tree-based algorithm that can be used to find frequent patterns with a variety of definitions of motif (pattern) models. It is also correct, as it always finds the pattern if it live. Using both actual and artificial data sets, we show that FLAME is rapid, scalable existing algorithms on a variety of performance metrics. In addition, based on FLAME, we also address a more general problem, named complete structured motif extraction, which allows mining recurrent combinations of motifs under comfortable limitation.

Kewords

Motif, sequence mining, suffix tree.

Reference

[1] M. O. Dayhoff, R. M. Schwartz, and B. Orcutt, "A Model for Evolutionary Changes in Proteins," Atlas of Protein Sequence and Structure, vol. 5, pp. 345-352, 1978. [2] S. Henikoff and J. Henikoff, "Amino Acid Substitution Matrices from Protein Blocks," National Academy of Sciences, USA, vol. 89, no. 22, pp. 10915-9, 1992. [3] R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules," in VLDB, 1994, pp. 487-499. [4] ——, "Mining Sequential Patterns," in ICDE, 1995, pp. 3-14. [5] M. J. Zaki, "SPADE: An Efficient Algorithm for Mining Frequent Sequences," Machine Learning, vol. 42, no. 1/2, pp. 31-60, 2001. [6] J. Wang and J. Han, "BIDE: Efficient Mining of Frequent Closed Sequences," in ICDE, 2004, pp. 79-90. [7] X. Yan, J. Han, and R. Afshar, "CloSpan: Mining Closed Sequential Patterns in Large Datasets," in SDM, 2003. [8] J. Yang, W. Wang, P. S. Yu, and J. Han, "Mining Long Sequential Patterns in a Noisy Environment," in SIGMOD, 2002, pp. 406-417. [9] M. J. Zaki, "Sequence Mining in Categorical Domains: Incorporating Constrains," in CIKM, 2000, pp. 442-429. [10] J. Pei, J. Han, and W. Wang, "Mining Sequential Patterns With Constraints in Large Databases," in CIKM, 2002, pp. 18-25. [11] A. Floratou, S. Tata, and J. M. Patel, "Finding Hidden Patterns in Sequences, Tech. Rep. http://www.pages.cs.wisc.edu/∼floratou, 2009. [12] S. Sinha and M. Tompa, "YMF: A Program for Discovery of Novel Transcription Factor Binding Sites by Statistical Overrepresentation," Nucleic Acids Research, vol. 31, no. 13, 2003. [13] G. Pavesi, P. Mereghetti, G. Mauri, and G. Pesole, "Weeder Web: Discovery of Transcription Factor Binding Sites in a Set of Sequences From Co-Regulated Genes," Nucleic Acids Research, vol. 32(Web Server issue), pp. W199-W203, 2004. [14] E. Eskin and P. A. Pevzner, "Finding Composite Regulatory Patterns in DNA Sequences," in ISMB, 2002, pp. S354-63. [15] J. Buhler and M. Tompa, "Finding Motifs Using Random Projections," Journal Computational Biology, vol. 9, no. 2, pp. 225-242, 2002. [16] G. Das, K.-I. Lin, H. Mannila, G. Renganathan, and P. Smyth, "Rule Discovery From Time Series," in KDD, 1998, pp. 16-22. [17] S. Hoppner, "Discovery of Temporal Patterns - Learning Rules about the Qualitative Behaviour of Time Series," in 5th European Conference on Principles and Practice of Knowledge Discovery in Databases, 2001, pp. 192-203. [18] P. Patel, E. Keogh, J. Lin, and S. Lonardi, "Mining Motifs in Massive Time Series Databases," in ICDM, 2002, pp. 370-377. [19] H. Wu, B. Salzberg, G. C. Sharp, S. B. Jiang, H. Shirato, and D. Kaeli, "Subsequence Matching on Structured Time Series Data," in SIGMOD, 2005, pp. 682-693. [20] B. Y.-C. Chiu, E. J. Keogh, and S. Lonardi, "Probabilistic Discovery of Time Series Motifs," in KDD, 2003, pp. 493-498. [21] W. Wang and J. Yang, Mining Sequential Patterns from Large Data Sets. Springer-Verlag, 2005, vol. 28. [22] M. Das and H. K. Dai, "A Survey of DNA Motif Finding Algorithms," BMC Bioinformatics, vol. 8, 2007. [23] G. K. Sandve and F. Drabløs, "A Survey of Motif Discovery Methods in an Integrated Framework," Biology Direct, vol. 1, 2006. [24] Y. Zhang and M. J. Zaki, "ExMOTIF: Efficient Structured Motif Extraction," Algorithms for Molecular Biology, vol. 1, no. 21, November 2006. [25] A. M. Carvalho et al., "An Efficient Algorithm for the Identification of Structured Motifs in DNA Promoter Sequences," IEEE/ACM Transactions on computational biology and bioinformatics, vol. 3, 2006. [26] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. Hsu, "PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth," in ICDE, 2001, pp. 215-224. [27] A. Brazma, I. Jonassen, I. Eidhammer, and D. Gilbert, "Approaches to the Automatic Discovery of Patterns in Biosequences," Journal of Computational Biology, vol. 5, pp. 279-305, 1998. [28] L. Marsan and M.-F. Sagot, "Algorithms for Extracting Structured Motifs Using a Suffix Tree with Application to Promoter and Regulatory Site Consensus Identification," Journal of Computational Biology, vol. 7, no. 3/4, pp. 345-360, 2000. [29] F. Zhu, X. Yan, J. Han, and P. S. Yu, "Efficient discovery of frequent approximate sequential patterns," in ICDM, 2007. [30] T. L. Bailey and C. Elkan, "Unsupervised Learning of Multiple Motifs in Biopolymers using EM," Machine Learning, vol. 21, no. 1-2, pp. 51-80, 1995. [31] W. Thompson, E. C. Rouchka, and C. E. Lawrence, "Gibbs Recursive Sampler: Finding Transcription Factor Binding Sites," Nucleic Acids Research, vol. 31, no. 13, pp. 3580-3585, 2003. [32] M. Tompa et al., "Assessing Computational Tools for the Discovery of Transcription Factor Binding Sites," Nature Biotechnology, vol. 23, pp. 137-144, 2005. [33] A. W.-C. Fu, E. J. Keogh, L. Y. H. Lau, and C. A. Ratanamahatana, "Scaling and Time Warping in Time Series Querying," in VLDB, 2005, pp. 649-660. [34] M. Vlachos, G. Kollios, and D. Gunopulos, "Discovering Similar Multidimensional Trajectories," in ICDE, 2002, pp. 673-684. [35] L. Chen, M. Tamer Ozsu, and V. Oria, "Robust and Fast Similarity Search for Moving Object Trajectories," in SIGMOD, 2005, pp. 491- 502. [36] Y. Zhu and D. Shasha, "Warping Indexes with Envelope Transforms for Query by Humming," in SIGMOD, 2003, pp. 181-192. [37] A. Udechukwu, K. Barker, and R. Alhajj, "Discovering all frequent trends in time series," in Proc. of Winter Int. Sym. on Information and Comm. Tech., vol. 58, 2004, pp. 1-6. [39]"Data Sets from Analysis of Financial Time http://www.gsb.uchicago.edu/fac/ruey.tsay/teaching/fts/. [40] R. S. Tsay, Analysis of Financial Time Series, 1st ed. [41] P. A. Pevzner and S.-H. Sze, "Combinatorial Approaches to Finding Subtle Signals in DNA Sequences," in ISMB, 2000, pp. 269-278. [42] "YMF Source Code," http://bio.cs.washington.edu/software.html. [43] "Weeder Source Code," http://www.pesolelab.it/Tool/ind.php. [44] "Random Projections Source Code," http://www.cse.wustl.edu/ jbuhler/pgt/. [45] "cSPADE Source Code," http://www.cs.rpi.edu/ zaki/software/. [46] S. Tata, R. A. Hankins, and J. M. Patel, "Practical Suffix Tree Construction," in VLDB, 2004, pp. 36-47. [47] A. Gionis, P. Indyk, and R. Motwani, "Similarity Search in High Dimensions via Hashing," in VLDB, 1999, pp. 518-529. [48] T. L. Bailey and C. Elkan, "Fitting a Mixture Model by Expectation Maximization to Discover Motifs in Biopolymers," in ISMB, 1994, pp. 28-36. [49] "TRANSFAC," http://www.gene-regulation.com/pub/databases.html