MADMX: A Strategy for Maximal Dense Motif Extraction
Publication: Journal of Computational Biology
Volume 18, Issue Number 4
Abstract
We develop, analyze, and experiment with a new tool, called madmx, which extracts frequent motifs from biological sequences. We introduce the notion of density to single out the “significant” motifs. The density is a simple and flexible measure for bounding the number of don't cares in a motif, defined as the fraction of solid (i.e., different from don't care) characters in the motif. A maximal dense motif has density above a certain threshold, and any further specialization of a don't care symbol in it or any extension of its boundaries decreases its number of occurrences in the input sequence. By extracting only maximal dense motifs, madmx reduces the output size and improves performance, while enhancing the quality of the discoveries. The efficiency of our approach relies on a newly defined combining operation, dubbed fusion, which allows for the construction of maximal dense motifs in a bottom-up fashion, while avoiding the generation of nonmaximal ones. We provide experimental evidence of the efficiency and the quality of the motifs returned by madmx.
Get full access to this article
View all available purchase options and get full access to this article.
References
Agrawal R.Srikant R.1994. Fast algorithms for mining association rulesProc. 20th VLDB487-499. Agrawal, R., and Srikant, R. 1994. Fast algorithms for mining association rules. Proc. 20th VLDB 487–499.
Apostolico A.Comin M.Parida L.2009. VARUN: discovering extensible motifs under saturation constraintsIEEE Trans. Comput. Biol. Bioinformhttp://doi.ieeecomputersociety.org/10.1109/TCBB.2008.123. Apostolico, A., Comin, M., and Parida, L. 2009. VARUN: discovering extensible motifs under saturation constraints. IEEE Trans. Comput. Biol. Bioinform. http://doi.ieeecomputersociety.org/10.1109/TCBB.2008.123.
Apostolico A.Parida L.2004. Incremental paradigms of motif discoveryJ. Comput. Biol.1115-25. Apostolico, A., and Parida, L. 2004. Incremental paradigms of motif discovery. J. Comput. Biol. 11, 15–25.
Apostolico A.Tagliacollo C.2007. Optimal offline extraction of irredundant motif basesLect. Notes Comput. Sci.4598360-371. Apostolico, A., and Tagliacollo, C. 2007. Optimal offline extraction of irredundant motif bases. Lect. Notes Comput. Sci. 4598, 360–371.
Apostolico A.Tagliacollo C.2008. Incremental discovery of the irredundant motif bases for all suffixes of a string in O (n2 log n) timeTheor. Comput. Sci.408106-115. Apostolico, A., and Tagliacollo, C. 2008. Incremental discovery of the irredundant motif bases for all suffixes of a string in O (n2 log n) time. Theor. Comput. Sci. 408, 106–115.
Arimura H.Uno T.2008. Mining maximal flexible patterns in a sequenceLect. Notes Comput. Sci.4914307-317. Arimura, H., and Uno, T. 2008. Mining maximal flexible patterns in a sequence. Lect. Notes Comput. Sci. 4914, 307–317.
Jurka J.Kapitonov V.V.Pavlicek A. et al.2005. Repbase update: a database of eukaryotic repetitive elementsCytogenet. Genome Res.110462-467. Jurka, J., Kapitonov, V.V., Pavlicek, A., et al. 2005. Repbase update: a database of eukaryotic repetitive elements. Cytogenet. Genome Res. 110, 462–467.
Morris M.Nicolas F.Ukkonen E.2008. On the complexity of finding gapped motifsCoRR. Morris, M., Nicolas, F., and Ukkonen, E. 2008. On the complexity of finding gapped motifs. CoRR abs/0802.0314.
Parida L.2000. Some results on flexible-pattern discoveryLect. Notes Comput. Sci.184833-45. Parida, L. 2000. Some results on flexible-pattern discovery. Lect. Notes Comput. Sci. 1848, 33–45.
Parida L.2008Pattern Discovery in BioinformaticsChapman & Hall/CRCBoca Raton, FL. Parida, L. 2008. Pattern Discovery in Bioinformatics. Chapman & Hall/CRC, Boca Raton, FL.
Pisanti N.2002Segment-based distances and similarities in genomic sequences [Ph.D. dissertation]University of PisaItaly. Pisanti, N. 2002. Segment-based distances and similarities in genomic sequences [Ph.D. dissertation]. University of Pisa, Italy.
Pisanti N.Crochemore M.Grossi R. et al.2005. Bases of motifs for generating repeated patterns with wild cardsIEEE Trans. Comput. Biol. Bioinform.240-50. Pisanti, N., Crochemore, M., Grossi, R., et al. 2005. Bases of motifs for generating repeated patterns with wild cards. IEEE Trans. Comput. Biol. Bioinform. 2, 40–50.
Rigoutsos I.Floratos A.1998. Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithmBioinformatics1455-67. Rigoutsos, I., and Floratos, A. 1998. Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm. Bioinformatics 14, 55–67.
Saha S.Bridges S.Magbanua Z.V. et al.2008. Empirical comparison of ab initio repeat finding programsNucleic Acids Res.362284-2294. Saha, S., Bridges, S., Magbanua, Z.V., et al. 2008. Empirical comparison of ab initio repeat finding programs. Nucleic Acids Res. 36, 2284–2294.
Smit A.F.A.Hubley R.Green P.1996. RepeatMasker Open-3.0www.repeatmasker.orgJanuary152011. Smit, A.F.A., Hubley, R., and Green, P. 1996. RepeatMasker Open-3.0. Available at: www.repeatmasker.org. Accessed January 15, 2011.
Smith T.F.Waterman M.S.1981. Identification of common molecular subsequencesJ. Mol. Biol.147195-197. Smith, T.F., and Waterman, M.S. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197.
Ukkonen E.2007. Structural analysis of gapped motifs of a stringLect. Notes Comput. Sci.4708681-690. Ukkonen, E. 2007. Structural analysis of gapped motifs of a string. Lect. Notes Comput. Sci. 4708, 681–690.
Information & Authors
Information
Published In
Journal of Computational Biology
Volume 18 • Issue Number 4 • April 2011
Pages: 535 - 545
PubMed: 21417937
Copyright
Copyright 2011, Mary Ann Liebert, Inc.
History
Published online: 8 April 2011
Published in print: April 2011
Published ahead of print: 21 March 2011
Topics
Authors
Disclosure Statement
No competing financial interests exist.
Metrics & Citations
Metrics
Citations
Export Citation
Export citation
Select the format you want to export the citations of this publication.
View Options
Get Access
Access content
To read the fulltext, please use one of the options below to sign in or purchase access.⚠ Society Access
If you are a member of a society that has access to this content please log in via your society website and then return to this publication.