Research Article
No access
Published Online: 8 April 2011

MADMX: A Strategy for Maximal Dense Motif Extraction

Publication: Journal of Computational Biology
Volume 18, Issue Number 4

Abstract

We develop, analyze, and experiment with a new tool, called madmx, which extracts frequent motifs from biological sequences. We introduce the notion of density to single out the “significant” motifs. The density is a simple and flexible measure for bounding the number of don't cares in a motif, defined as the fraction of solid (i.e., different from don't care) characters in the motif. A maximal dense motif has density above a certain threshold, and any further specialization of a don't care symbol in it or any extension of its boundaries decreases its number of occurrences in the input sequence. By extracting only maximal dense motifs, madmx reduces the output size and improves performance, while enhancing the quality of the discoveries. The efficiency of our approach relies on a newly defined combining operation, dubbed fusion, which allows for the construction of maximal dense motifs in a bottom-up fashion, while avoiding the generation of nonmaximal ones. We provide experimental evidence of the efficiency and the quality of the motifs returned by madmx.

Get full access to this article

View all available purchase options and get full access to this article.

References

Agrawal R.Srikant R.1994. Fast algorithms for mining association rulesProc. 20th VLDB487-499. Agrawal, R., and Srikant, R. 1994. Fast algorithms for mining association rules. Proc. 20th VLDB 487–499.
Apostolico A.Comin M.Parida L.2009. VARUN: discovering extensible motifs under saturation constraintsIEEE Trans. Comput. Biol. Bioinformhttp://doi.ieeecomputersociety.org/10.1109/TCBB.2008.123. Apostolico, A., Comin, M., and Parida, L. 2009. VARUN: discovering extensible motifs under saturation constraints. IEEE Trans. Comput. Biol. Bioinform. http://doi.ieeecomputersociety.org/10.1109/TCBB.2008.123.
Apostolico A.Parida L.2004. Incremental paradigms of motif discoveryJ. Comput. Biol.1115-25. Apostolico, A., and Parida, L. 2004. Incremental paradigms of motif discovery. J. Comput. Biol. 11, 15–25.
Apostolico A.Tagliacollo C.2007. Optimal offline extraction of irredundant motif basesLect. Notes Comput. Sci.4598360-371. Apostolico, A., and Tagliacollo, C. 2007. Optimal offline extraction of irredundant motif bases. Lect. Notes Comput. Sci. 4598, 360–371.
Apostolico A.Tagliacollo C.2008. Incremental discovery of the irredundant motif bases for all suffixes of a string in O (n2 log n) timeTheor. Comput. Sci.408106-115. Apostolico, A., and Tagliacollo, C. 2008. Incremental discovery of the irredundant motif bases for all suffixes of a string in O (n2 log n) time. Theor. Comput. Sci. 408, 106–115.
Arimura H.Uno T.2008. Mining maximal flexible patterns in a sequenceLect. Notes Comput. Sci.4914307-317. Arimura, H., and Uno, T. 2008. Mining maximal flexible patterns in a sequence. Lect. Notes Comput. Sci. 4914, 307–317.
Jurka J.Kapitonov V.V.Pavlicek A. et al.2005. Repbase update: a database of eukaryotic repetitive elementsCytogenet. Genome Res.110462-467. Jurka, J., Kapitonov, V.V., Pavlicek, A., et al. 2005. Repbase update: a database of eukaryotic repetitive elements. Cytogenet. Genome Res. 110, 462–467.
Morris M.Nicolas F.Ukkonen E.2008. On the complexity of finding gapped motifsCoRR. Morris, M., Nicolas, F., and Ukkonen, E. 2008. On the complexity of finding gapped motifs. CoRR abs/0802.0314.
Parida L.2000. Some results on flexible-pattern discoveryLect. Notes Comput. Sci.184833-45. Parida, L. 2000. Some results on flexible-pattern discovery. Lect. Notes Comput. Sci. 1848, 33–45.
Parida L.2008Pattern Discovery in BioinformaticsChapman & Hall/CRCBoca Raton, FL. Parida, L. 2008. Pattern Discovery in Bioinformatics. Chapman & Hall/CRC, Boca Raton, FL.
Pisanti N.2002Segment-based distances and similarities in genomic sequences [Ph.D. dissertation]University of PisaItaly. Pisanti, N. 2002. Segment-based distances and similarities in genomic sequences [Ph.D. dissertation]. University of Pisa, Italy.
Pisanti N.Crochemore M.Grossi R. et al.2005. Bases of motifs for generating repeated patterns with wild cardsIEEE Trans. Comput. Biol. Bioinform.240-50. Pisanti, N., Crochemore, M., Grossi, R., et al. 2005. Bases of motifs for generating repeated patterns with wild cards. IEEE Trans. Comput. Biol. Bioinform. 2, 40–50.
Rigoutsos I.Floratos A.1998. Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithmBioinformatics1455-67. Rigoutsos, I., and Floratos, A. 1998. Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm. Bioinformatics 14, 55–67.
Saha S.Bridges S.Magbanua Z.V. et al.2008. Empirical comparison of ab initio repeat finding programsNucleic Acids Res.362284-2294. Saha, S., Bridges, S., Magbanua, Z.V., et al. 2008. Empirical comparison of ab initio repeat finding programs. Nucleic Acids Res. 36, 2284–2294.
Smit A.F.A.Hubley R.Green P.1996. RepeatMasker Open-3.0www.repeatmasker.orgJanuary152011. Smit, A.F.A., Hubley, R., and Green, P. 1996. RepeatMasker Open-3.0. Available at: www.repeatmasker.org. Accessed January 15, 2011.
Smith T.F.Waterman M.S.1981. Identification of common molecular subsequencesJ. Mol. Biol.147195-197. Smith, T.F., and Waterman, M.S. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197.
Ukkonen E.2007. Structural analysis of gapped motifs of a stringLect. Notes Comput. Sci.4708681-690. Ukkonen, E. 2007. Structural analysis of gapped motifs of a string. Lect. Notes Comput. Sci. 4708, 681–690.

Information & Authors

Information

Published In

cover image Journal of Computational Biology
Journal of Computational Biology
Volume 18Issue Number 4April 2011
Pages: 535 - 545
PubMed: 21417937

History

Published online: 8 April 2011
Published in print: April 2011
Published ahead of print: 21 March 2011

Permissions

Request permissions for this article.

Topics

Authors

Affiliations

Roberto Grossi
Dipartimento di Informatica, Università di Pisa, Italy.
Andrea Pietracaprina
Dipartimento di Ingegneria dell'Informazione, Università di Padova, Italy.
Nadia Pisanti
Dipartimento di Informatica, Università di Pisa, Italy.
Geppino Pucci
Dipartimento di Ingegneria dell'Informazione, Università di Padova, Italy.
Eli Upfal
Department of Computer Science, Brown University, Providence, Rhode Island.
Fabio Vandin
Department of Computer Science, Brown University, Providence, Rhode Island.

Notes

Address correspondence to:Dr. Fabio VandinDepartment of Computer ScienceBrown UniversityProvidence, RI 02906E-mail: [email protected]

Disclosure Statement

No competing financial interests exist.

Metrics & Citations

Metrics

Citations

Export citation

Select the format you want to export the citations of this publication.

View Options

Get Access

Access content

To read the fulltext, please use one of the options below to sign in or purchase access.

Society Access

If you are a member of a society that has access to this content please log in via your society website and then return to this publication.

Restore your content access

Enter your email address to restore your content access:

Note: This functionality works only for purchases done as a guest. If you already have an account, log in to access the content to which you are entitled.

View options

PDF/EPUB

View PDF/ePub

Full Text

View Full Text

Media

Figures

Other

Tables

Share

Share

Copy the content Link

Share on social media

Back to Top