Research Article
No access
Published Online: 16 February 2022

The Statistics of k-mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches

Publication: Journal of Computational Biology
Volume 29, Issue Number 2


k-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability r, under the assumption that there are no spurious k-mer matches. How does this process affect the k-mers of S? We derive the expectation and variance of the number of mutated k-mers and of the number of islands (a maximal interval of mutated k-mers) and oceans (a maximal interval of nonmutated k-mers). We then derive hypothesis tests and confidence intervals (CIs) for r given an observed number of mutated k-mers, or, alternatively, given the Jaccard similarity (with or without MinHash). We demonstrate the usefulness of our results using a few select applications: obtaining a CI to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long-read alignments to a de Bruijn graph by Jabba.

Get full access to this article

View all available purchase options and get full access to this article.


Bankevich, A., Nurk, S., Antipov, D., et al. 2012. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477.
Broder, A.Z. 1997. On the resemblance and containment of documents, pp. 21–29. In: Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171). IEEE, Salerno, Italy.
Brown, C.T., Olm, M.R., Thomas, B.C., and Banfield, J.F. 2016. Measurement of bacterial replication rates in microbial communities. Nat. Biotechnol. 34, 1256.
Brown, L.D., Cai, T.T., and DasGupta, A. 2001. Interval estimation for a binomial proportion. Stat. Sci. 16, 101–117.
Burden, C.J., Leopardi, P., and Forêt, S. 2014. The distribution of word matches between markovian sequences with periodic boundary conditions. J. Comput. Biol. 21, 41–63.
Casella, G., and Berger, R.L. 2002. Statistical Inference, Vol. 2. Duxbury Pacific Grove, CA.
Denti, L., Previtali, M., Bernardini, G., et al. 2019. MALVA: Genotyping by Mapping-free ALlele detection of known VAriants. iScience 18, 20–27.
Fan, H., Ives, A.R., Surget-Groba, Y., and Cannon, C.H. 2015. An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC Genomics 16, 522.
Gusfield, D. 1997. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press.
Harris, R.S., and Medvedev, P. 2020. Improved representation of sequence bloom trees. Bioinformatics 36, 721–727.
Haubold, B., Pfaffelhuber, P., Domazet-Loso, M., and Wiehe, T. 2009. Estimating mutation distances from unaligned genomes. J. Comput. Biol. 16, 1487–1500.
Hoeffding, W., Robbins, H., et al. 1948. The central limit theorem for dependent random variables. Duke Math. J. 15, 773–780.
Jain, C., Dilthey, A., Koren, S., et al. 2017. A fast approximate algorithm for mapping long reads to large reference databases, pp. 66–81. In: International Conference on Research in Computational Molecular Biology. Ed: S. Cenk Sahinalp. Springer, Hongkong.
Lander, E.S., and Waterman, M.S. 1988. Genomic mapping by fingerprinting random clones: A mathematical analysis. Genomics 2, 231–239.
Li, H. 2018. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100.
Lu, Y.Y., Tang, K., Ren, J., et al. 2017. Cafe: Accelerated alignment-free sequence analysis. Nucleic Acids Res. 45(W1), W554–W559.
Miao, W., and Gastwirth, J.L. 2004. The effect of dependence on confidence intervals for a population proportion. Am. Stat. 58, 124–130.
Miclotte, G., Heydari, M., Demeester, P., et al. 2016. Jabba: Hybrid error correction for long sequencing reads. Algorithms Mol. Biol. 11, 1–12.
Morgenstern, B., Zhu, B., Horwege, S., and Leimeister, C.A. 2015. Estimating evolutionary distances between genomic sequences from spaced-word matches. Algorithms Mol. Biol. 10, 5.
Ondov, B.D., Starrett, G.J., Sappington, A., et al. 2019. Mash Screen: High-throughput sequence containment estimation for genome discovery. Genome Biol. 20, 232.
Ondov, B.D., Treangen, T.J., Melsted, P., et al. 2016. Mash: Fast genome and metagenome distance estimation using minhash. Genome Biol. 17, 132.
Reinert, G., Chew, D., Sun, F., and Waterman, M.S. 2009. Alignment-free sequence comparison (i): Statistics and power. J. Comput. Biol. 16, 1615–1634.
Röhling, S., Linne, A., Schellhorn, J., et al. 2020. The number of k-mer matches between two dna sequences as a function of k and applications to estimate phylogenetic distances. PLos One 15, e0228070.
Ross, N. 2011. Fundamentals of Stein's method. Probab. Surv. 8, 210–293.
Salmela, L., Walve, R., Rivals, E., and Ukkonen, E. 2017. Accurate self-correction of errors in long reads using de Bruijn graphs. Bioinformatics 33, 799–806.
Sarmashghi, S., Bohmann, K., Gilbert, M.T.P., et al. 2019. Skmer: Assembly-free and alignment-free sample identification using genome skims. Genome Biol. 20, 1–20.
Schwengers, O., Hain, T., Chakraborty, T., and Goesmann, A. 2019. Referenceseeker: Rapid determination of appropriate reference genomes. BioRxiv. Vol. 863621.
Solomon, B., and Kingsford, C. 2016. Fast search of thousands of short-read sequencing experiments. Nat. Biotechnol. 34, 300–302.
Song, K., Ren, J., Reinert, G., et al. 2014. New developments of alignment-free sequence comparison: Measures, statistics and next-generation sequencing. Briefings Bioinf. 15, 343–353.
Standage, D.S., Brown, C.T., and Hormozdiari, F. 2019. Kevlar: A mapping-free framework for accurate discovery of de novo variants. bioRxiv. Vol. 549154.
Sun, C., and Medvedev, P. 2018. Toward fast and accurate snp genotyping from whole genome sequencing data for bedside diagnostics. Bioinformatics 35, 415–420.
Tang, T., Liu, Y., Zhang, B., et al. 2019. Sketch distance-based clustering of chromosomes for large genome database compression. BMC Genomics 20, 1–9.
Wang, A., and Au, K.F. 2020. Performance difference of graph-based and alignment-based hybrid error correction methods for error-prone long reads. Genome Biol. 21, 14.
Wasserman, L. 2013. All of Statistics: A Concise Course in Statistical Inference. Springer Science & Business Media, Berlin, Germany.
Wilson, E.B. 1927. Probable inference, the law of succession, and statistical inference. J. Am. Stat. Assoc. 22, 209–212.
Wood, D.E., and Salzberg, S.L. 2014. Kraken: Ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, R46.
Wu, T.-J., Huang, Y.-H., and Li, L.-A. 2005. Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences. Bioinformatics 21, 4125–4132.

Information & Authors


Published In

cover image Journal of Computational Biology
Journal of Computational Biology
Volume 29Issue Number 2February 2022
Pages: 155 - 168
PubMed: 35108101


Published online: 16 February 2022
Published in print: February 2022
Published ahead of print: 1 February 2022


Request permissions for this article.




Antonio Blanca
Department of Computer Science and Engineering, and The Pennsylvania State University, University Park, Pennsylvania, USA.
Robert S. Harris
Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, USA.
David Koslicki
Department of Computer Science and Engineering, and The Pennsylvania State University, University Park, Pennsylvania, USA.
Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, USA.
Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania, USA.
Department of Computer Science and Engineering, and The Pennsylvania State University, University Park, Pennsylvania, USA.
Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania, USA.
Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania, USA.


Address correspondence to: Prof. Paul Medvedev, Department of Computer Science and Engineering, The Pennsylvania State University, W205 Westgate Bldg., University Park, PA 16802, USA [email protected]
This is the full version of the article of the same title appearing in the proceedings of RECOMB 2021. A preprint of this full version appears in Authors are listed in alphabetical order.

Author Disclosure Statement

The authors declare they have no conflicting financial interests

Funding Information

P.M. was supported by NSF awards 1453527 and 1439057. A.B. was supported, in part, by the NSF grant CCF-1850443. This material is based upon work supported by the National Science Foundation under Grant No. 1664803.

Metrics & Citations



Export citation

Select the format you want to export the citations of this publication.

View Options

Get Access

Access content

To read the fulltext, please use one of the options below to sign in or purchase access.

Society Access

If you are a member of a society that has access to this content please log in via your society website and then return to this publication.

Restore your content access

Enter your email address to restore your content access:

Note: This functionality works only for purchases done as a guest. If you already have an account, log in to access the content to which you are entitled.

View options


View PDF/ePub

Full Text

View Full Text







Copy the content Link

Share on social media

Back to Top