Research Article
No access
Published Online: 1 September 2016

On the Minimum Error Correction Problem for Haplotype Assembly in Diploid and Polyploid Genomes

Publication: Journal of Computational Biology
Volume 23, Issue Number 9

Abstract

In diploid genomes, haplotype assembly is the computational problem of reconstructing the two parental copies, called haplotypes, of each chromosome starting from sequencing reads, called fragments, possibly affected by sequencing errors. Minimum error correction (MEC) is a prominent computational problem for haplotype assembly and, given a set of fragments, aims at reconstructing the two haplotypes by applying the minimum number of base corrections. MEC is computationally hard to solve, but some approximation-based or fixed-parameter approaches have been proved capable of obtaining accurate results on real data. In this work, we expand the current characterization of the computational complexity of MEC from the approximation and the fixed-parameter tractability point of view. In particular, we show that MEC is not approximable within a constant factor, whereas it is approximable within a logarithmic factor in the size of the input. Furthermore, we answer open questions on the fixed-parameter tractability for parameters of classical or practical interest: the total number of corrections and the fragment length. In addition, we present a direct 2-approximation algorithm for a variant of the problem that has also been applied in the framework of clustering data. Finally, since polyploid genomes, such as those of plants and fishes, are composed of more than two copies of the chromosomes, we introduce a novel formulation of MEC, namely the k-ploid MEC problem, that extends the traditional problem to deal with polyploid genomes. We show that the novel formulation is still both computationally hard and hard to approximate. Nonetheless, from the parameterized point of view, we prove that the problem is tractable for parameters of practical interest such as the number of haplotypes and the coverage, or the number of haplotypes and the fragment length.

Get full access to this article

View all available purchase options and get full access to this article.

References

Aguiar D., and Istrail S. 2012. HapCompass: A fast cycle basis algorithm for accurate haplotype assembly of sequence data. J. Comput. Biol. 19, 577–590.
Aguiar D., and Istrail S. 2013. Haplotype assembly in polyploid genomes and identical by descent shared tracts. Bioinformatics 29, 352–360.
Aldinucci M., Bracciali A., Marschall T., et al. 2014. High-performance haplotype assembly, 245–258. In Computational Intelligence Methods for Bioinformatics and Biostatistics—11th International Meeting. CIBB 2014, Cambridge, United Kingdom, June 26–28, 2014, Revised Selected Papers.
Ausiello G., Crescenzi P., Gambosi G., et al. 1999. Complexity and Approximation: Combinatorial Optimization Problems and Their Approximability Properties. Springer-Verlag, Berlin.
Bansal V., and Bafna V. 2008. HapCUT: An efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics 24, i153–i159.
Bao E., Jiang T., and Girke T. 2013. BRANCH: Boosting RNA-Seq assemblies with partial or related genomic sequences. Bioinformatics 29, 1250–1259.
Beerenwinkel N., Beretta S., Bonizzoni P., et al. 2015. Covering pairs in directed acyclic graphs. Comput. J. 58, 1673–1686.
Berger E., Yorukoglu D., Peng J., et al. 2014. Haptree: A novel bayesian framework for single individual polyplotyping using ngs data. PLoS Comput. Biol. 10:e1003502.
Bonizzoni P., Della Vedova G., Dondi R., et al. 2003. The haplotyping problem: An overview of computational models and solutions. J. Comput. Sci. Technol. 18, 675–688.
Bonizzoni P., Dondi R., Klau G.W., et al. 2015. On the fixed parameter tractability and approximability of the minimum error correction problem, 100–113. In 26th Annual Symposium on Combinatorial Pattern Matching (CPM), vol. 9133 of LNCS, Springer International Publishing, Zurich, Switzerland.
Browning B., and Browning S. 2008. Haplotypic analysis of Wellcome Trust case control consortium data. Hum. Genet. 123, 273–280.
Chen Z.Z., Deng F., and Wang L. 2013. Exact algorithms for haplotype assembly from whole-genome sequence data. Bioinformatics 29, 1938–45.
Cilibrasi R., Van Iersel L., Kelk S., et al. 2007. The complexity of the single individual SNP haplotyping problem. Algorithmica 49, 13–36.
Das S., and Vikalo H. 2015. SDhaP: Haplotype assembly for diploids and polyploids via semi-definite programming. BMC Genomics 16, 260.
Dondi R. 2012. New results for the longest haplotype reconstruction problem. Discrete Appl. Math. 160, 1299–1310.
Downey R.G., and Fellows M.R. 2013. Fundamentals of Parameterized Complexity. Texts in Computer Science. Springer, London.
Duitama J., et al. 2012. Fosmid-based whole genome haplotyping of a HapMap trio child: Evaluation of single individual haplotyping techniques. Nucleic Acids Res. 40, 2041–2053.
Eriksson N., Pachter L., Mitsuya Y., et al. 2008. Viral population estimation using pyrosequencing. PLoS Comput. Biol. 4, e1000074.
Fouilhoux P., and Mahjoub A. 2012. Solving VLSI design and DNA sequencing problems using bipartization of graphs. Comput. Optim. Appl. 51, 749–781.
Fulkerson D.R. 1956. Note on Dilworth's decomposition theorem for partially ordered sets. Proc. Am. Math. Soc. 7, 701–702.
Garey M.R., and Johnson D.S. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, San Francisco.
Garg N., Vazirani V.V., and Yannakakis M. 1996. Approximate max-flow min-(multi) cut theorems and their applications. SIAM J. Comput. 25, 235–251.
Geraci F. 2010. A comparison of several algorithms for the single individual SNP haplotyping reconstruction problem. Bioinformatics 26, 2217–2225.
Guo J., Gramm J., Hüffner F., et al. 2006. Compression-based fixed-parameter algorithms for feedback vertex set and edge bipartization. J. Comput. Syst. Sci. 72, 1386–1396.
Halldórsson B.V., Aguiar D., and Istrail S. 2011. Haplotype phasing by multi-assembly of shared haplotypes: Phase-dependent interactions between rare variants, 88–99. In Pacific Symposium on Biocomputing. World Scientific Publishing.
He D., Choi A., Pipatsrisawat K., et al. 2010. Optimal algorithms for haplotype assembly from whole-genome sequence data. Bioinformatics 26, i183–i190.
Jiao Y., Xu J., and Li M. 2004. On the k-closest substring and k-consensus pattern problems, 130–144. In Combinatorial Pattern Matching, vol. 3109 of LNCS. Springer, Berlin, Heidelberg.
Khot S. 2002. On the power of unique 2-prover 1-round games, 767–775. In Proceeding of the Thirty-Four Annual ACM Symposium on theory of computing. ACM.
Kleinberg J., Papadimitriou C., and Raghavan P. 1998. Segmentation problems, 473–482. In Proceeding of the Thirties Annual ACM Symposium on theory of computing. ACM.
Lancia G., Bafna V., Istrail S., et al. 2001. SNPs problems, complexity, and algorithms, 182–193. In ESA, vol. 2161 of LNCS.
Lippert R., Schwartz R., Lancia G., et al. 2002. Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem. Brief. Bioinform. 3, 23–31.
Meidanis J., Porto O., and Telles G.P. 1998. On the consecutive ones property. Discrete Appl. Math. 88, 325–354.
Ostrovsky R., and Rabani Y. 2002. Polynomial-time approximation schemes for geometric min-sum median clustering. J. ACM 49, 139–156.
Patterson M., Marschall T., Pisanti N., et al. 2014. WhatsHap: Haplotype assembly for future-generation sequencing reads, 237–249. In Sharan R., ed. RECOMB, vol. 8394 of LNCS. Springer International Publishing, Zurich, Switzerland.
Patterson M., Marschall T., Pisanti N., et al. 2015. WhatsHap: Weighted haplotype assembly for future-generation sequencing reads. J. Comput. Biol. 6, 498–509.
Pirola Y., Bonizzoni P., and Jiang T. 2012. An efficient algorithm for haplotype inference on pedigrees with recombinations and mutations. IEEE/ACM Trans. Comput. Biol. Bioinform. 9, 12–25.
Pirola Y., Della Vedova G., Bonizzoni P., et al. 2013. Haplotype-based prediction of gene alleles using pedigrees and SNP genotypes, 33–41. In Proceeding of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics. ACM.
Pirola Y., Zaccaria S., Dondi R., et al., 2015a. HapCol. Available at: http://hapcol.algolab.eu Accessed September 17, 2015
Pirola Y., Zaccaria S., Dondi R., et al., 2015b. HapCol: Accurate and memory-efficient haplotype assembly from long reads. Bioinformatics.
Reed B., Smith K., and Vetta A. 2004. Finding odd cycle transversals. Oper. Res. Lett. 32, 299–301.
Rizzi R., Tomescu A., and Mäkinen V. 2014. On the complexity of minimum path cover with subpath constraints for multi-assembly. BMC Bioinformatics 15, S5.
Song L., and Florea L. 2013. CLASS: Constrained transcript assembly of RNA-seq reads. BMC Bioinformatics 14, 1–8.
Trapnell C., Williams B.A., Pertea G., et al. 2010. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 516–520.
Wang R.S., Wu L.Y., Li Z.P., et al. 2005. Haplotype reconstruction from snp fragments by minimum error correction. Bioinformatics 21, 2456–2462.
Yannakakis M. 1978. Node-and edge-deletion NP-complete problems, 253–264. In Proc. Symp. Theory of Computing (STOC). ACM.

Information & Authors

Information

Published In

cover image Journal of Computational Biology
Journal of Computational Biology
Volume 23Issue Number 9September 2016
Pages: 718 - 736
PubMed: 27280382

History

Published in print: September 2016
Published online: 1 September 2016
Published ahead of print: 9 June 2016

Permissions

Request permissions for this article.

Topics

Authors

Affiliations

Paola Bonizzoni
Department of Computer Science (DISCO), University of Milano-Bicocca, Milan, Italy.
Riccardo Dondi
Department of Social and Human Sciences, University of Bergamo, Bergamo, Italy.
Gunnar W. Klau
Life Sciences Group, Centrum Wiskunde & Informatica (CWI), Amsterdam, The Netherlands.
ERABLE Team, INRIA, Lyon, France.
Yuri Pirola
Department of Computer Science (DISCO), University of Milano-Bicocca, Milan, Italy.
Nadia Pisanti
ERABLE Team, INRIA, Lyon, France.
Department of Computer Science, University of Pisa, Pisa, Italy.
Simone Zaccaria
Department of Computer Science (DISCO), University of Milano-Bicocca, Milan, Italy.

Notes

A preliminary version of this article partially appeared in Bonizzoni et al. (2015).
Address correspondence to:Dr. Paola BonizzoniDip. di Informatica Sistemistica e ComunicazioneUniv. degli Studi di Milano-BicoccaViale Sarca 336Milan 20126Italy
E-mail: [email protected]

Author Disclosure Statement

No competing financial interests exist.

Metrics & Citations

Metrics

Citations

Export citation

Select the format you want to export the citations of this publication.

View Options

Get Access

Access content

To read the fulltext, please use one of the options below to sign in or purchase access.

Society Access

If you are a member of a society that has access to this content please log in via your society website and then return to this publication.

Restore your content access

Enter your email address to restore your content access:

Note: This functionality works only for purchases done as a guest. If you already have an account, log in to access the content to which you are entitled.

View options

PDF/EPUB

View PDF/ePub

Full Text

View Full Text

Media

Figures

Other

Tables

Share

Share

Copy the content Link

Share on social media

Back to Top