The Statistics of k-mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches
Publication: Journal of Computational Biology
Volume 29, Issue Number 2
Abstract
k-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability r, under the assumption that there are no spurious k-mer matches. How does this process affect the k-mers of S? We derive the expectation and variance of the number of mutated k-mers and of the number of islands (a maximal interval of mutated k-mers) and oceans (a maximal interval of nonmutated k-mers). We then derive hypothesis tests and confidence intervals (CIs) for r given an observed number of mutated k-mers, or, alternatively, given the Jaccard similarity (with or without MinHash). We demonstrate the usefulness of our results using a few select applications: obtaining a CI to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long-read alignments to a de Bruijn graph by Jabba.
Get full access to this article
View all available purchase options and get full access to this article.
REFERENCES
Bankevich, A., Nurk, S., Antipov, D., et al. 2012. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477.
Broder, A.Z. 1997. On the resemblance and containment of documents, pp. 21–29. In: Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171). IEEE, Salerno, Italy.
Brown, C.T., Olm, M.R., Thomas, B.C., and Banfield, J.F. 2016. Measurement of bacterial replication rates in microbial communities. Nat. Biotechnol. 34, 1256.
Brown, L.D., Cai, T.T., and DasGupta, A. 2001. Interval estimation for a binomial proportion. Stat. Sci. 16, 101–117.
Burden, C.J., Leopardi, P., and Forêt, S. 2014. The distribution of word matches between markovian sequences with periodic boundary conditions. J. Comput. Biol. 21, 41–63.
Casella, G., and Berger, R.L. 2002. Statistical Inference, Vol. 2. Duxbury Pacific Grove, CA.
Denti, L., Previtali, M., Bernardini, G., et al. 2019. MALVA: Genotyping by Mapping-free ALlele detection of known VAriants. iScience 18, 20–27.
Fan, H., Ives, A.R., Surget-Groba, Y., and Cannon, C.H. 2015. An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC Genomics 16, 522.
Gusfield, D. 1997. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press.
Harris, R.S., and Medvedev, P. 2020. Improved representation of sequence bloom trees. Bioinformatics 36, 721–727.
Haubold, B., Pfaffelhuber, P., Domazet-Loso, M., and Wiehe, T. 2009. Estimating mutation distances from unaligned genomes. J. Comput. Biol. 16, 1487–1500.
Hoeffding, W., Robbins, H., et al. 1948. The central limit theorem for dependent random variables. Duke Math. J. 15, 773–780.
Jain, C., Dilthey, A., Koren, S., et al. 2017. A fast approximate algorithm for mapping long reads to large reference databases, pp. 66–81. In: International Conference on Research in Computational Molecular Biology. Ed: S. Cenk Sahinalp. Springer, Hongkong.
Lander, E.S., and Waterman, M.S. 1988. Genomic mapping by fingerprinting random clones: A mathematical analysis. Genomics 2, 231–239.
Li, H. 2018. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100.
Lu, Y.Y., Tang, K., Ren, J., et al. 2017. Cafe: Accelerated alignment-free sequence analysis. Nucleic Acids Res. 45(W1), W554–W559.
Miao, W., and Gastwirth, J.L. 2004. The effect of dependence on confidence intervals for a population proportion. Am. Stat. 58, 124–130.
Miclotte, G., Heydari, M., Demeester, P., et al. 2016. Jabba: Hybrid error correction for long sequencing reads. Algorithms Mol. Biol. 11, 1–12.
Morgenstern, B., Zhu, B., Horwege, S., and Leimeister, C.A. 2015. Estimating evolutionary distances between genomic sequences from spaced-word matches. Algorithms Mol. Biol. 10, 5.
Ondov, B.D., Starrett, G.J., Sappington, A., et al. 2019. Mash Screen: High-throughput sequence containment estimation for genome discovery. Genome Biol. 20, 232.
Ondov, B.D., Treangen, T.J., Melsted, P., et al. 2016. Mash: Fast genome and metagenome distance estimation using minhash. Genome Biol. 17, 132.
Reinert, G., Chew, D., Sun, F., and Waterman, M.S. 2009. Alignment-free sequence comparison (i): Statistics and power. J. Comput. Biol. 16, 1615–1634.
Röhling, S., Linne, A., Schellhorn, J., et al. 2020. The number of k-mer matches between two dna sequences as a function of k and applications to estimate phylogenetic distances. PLos One 15, e0228070.
Ross, N. 2011. Fundamentals of Stein's method. Probab. Surv. 8, 210–293.
Salmela, L., Walve, R., Rivals, E., and Ukkonen, E. 2017. Accurate self-correction of errors in long reads using de Bruijn graphs. Bioinformatics 33, 799–806.
Sarmashghi, S., Bohmann, K., Gilbert, M.T.P., et al. 2019. Skmer: Assembly-free and alignment-free sample identification using genome skims. Genome Biol. 20, 1–20.
Schwengers, O., Hain, T., Chakraborty, T., and Goesmann, A. 2019. Referenceseeker: Rapid determination of appropriate reference genomes. BioRxiv. Vol. 863621.
Solomon, B., and Kingsford, C. 2016. Fast search of thousands of short-read sequencing experiments. Nat. Biotechnol. 34, 300–302.
Song, K., Ren, J., Reinert, G., et al. 2014. New developments of alignment-free sequence comparison: Measures, statistics and next-generation sequencing. Briefings Bioinf. 15, 343–353.
Standage, D.S., Brown, C.T., and Hormozdiari, F. 2019. Kevlar: A mapping-free framework for accurate discovery of de novo variants. bioRxiv. Vol. 549154.
Sun, C., and Medvedev, P. 2018. Toward fast and accurate snp genotyping from whole genome sequencing data for bedside diagnostics. Bioinformatics 35, 415–420.
Tang, T., Liu, Y., Zhang, B., et al. 2019. Sketch distance-based clustering of chromosomes for large genome database compression. BMC Genomics 20, 1–9.
Wang, A., and Au, K.F. 2020. Performance difference of graph-based and alignment-based hybrid error correction methods for error-prone long reads. Genome Biol. 21, 14.
Wasserman, L. 2013. All of Statistics: A Concise Course in Statistical Inference. Springer Science & Business Media, Berlin, Germany.
Wilson, E.B. 1927. Probable inference, the law of succession, and statistical inference. J. Am. Stat. Assoc. 22, 209–212.
Wood, D.E., and Salzberg, S.L. 2014. Kraken: Ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, R46.
Wu, T.-J., Huang, Y.-H., and Li, L.-A. 2005. Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences. Bioinformatics 21, 4125–4132.
Information & Authors
Information
Published In
Journal of Computational Biology
Volume 29 • Issue Number 2 • February 2022
Pages: 155 - 168
PubMed: 35108101
Copyright
Copyright 2022, Mary Ann Liebert, Inc., publishers.
History
Published online: 16 February 2022
Published in print: February 2022
Published ahead of print: 1 February 2022
Topics
Authors
Author Disclosure Statement
The authors declare they have no conflicting financial interests
Funding Information
P.M. was supported by NSF awards 1453527 and 1439057. A.B. was supported, in part, by the NSF grant CCF-1850443. This material is based upon work supported by the National Science Foundation under Grant No. 1664803.
Metrics & Citations
Metrics
Citations
Export Citation
Export citation
Select the format you want to export the citations of this publication.
View Options
Get Access
Access content
To read the fulltext, please use one of the options below to sign in or purchase access.⚠ Society Access
If you are a member of a society that has access to this content please log in via your society website and then return to this publication.