Research Article
No access
Published Online: 4 October 2018

A Two-Level Scheme for Quality Score Compression

Publication: Journal of Computational Biology
Volume 25, Issue Number 10

Abstract

Previous studies on quality score compression can be classified into two main lines: lossy schemes and lossless schemes. Lossy schemes enable a better management of computational resources. Thus, in practice, and for preliminary analyses, bioinformaticians may prefer to work with a lossy quality score representation. However, the original quality scores might be required for a deeper analysis of the data. Hence, it might be necessary to keep them; in addition to lossy compression this requires lossless compression as well. We developed a space-efficient hierarchical representation of quality scores, QScomp, which allows the users to work with lossy quality scores in routine analysis, without sacrificing the capability of reaching the original quality scores when further investigations are required. Each quality score is represented by a tuple through a novel decomposition. The first and second dimensions of these tuples are separately compressed such that the first-level compression is a lossy scheme. The compressed information of the second dimension allows the users to extract the original quality scores. Experiments on real data reveal that the downstream analysis with the lossy part—spending only 0.49 bits per quality score on average—shows a competitive performance, and that the total space usage with the inclusion of the compressed second dimension is comparable to the performance of competing lossless schemes.

Get full access to this article

View all available purchase options and get full access to this article.

References

Alberti C., Daniels N., Hernaez M., et al. 2016. An evaluation framework for lossy compression of genome sequencing quality values. 2016 Data Compression Conference (DCC), Snowbird, UT, pp. 221–230.
Berger B., Daniels N.M., and Yu Y.W. 2016. Computational biology in the 21st century: Scaling with compressive algorithms. Commun. ACM. 59, 72–80.
Bonfield J.K., and Mahoney M.V. 2013. Compression of FASTQ and SAM format sequencing data. PLoS One. 8, e59190.
Cánovas R., Moffat A., and Turpin A. 2014. Lossy compression of quality scores in genomic data. Bioinformatics. 30, 2130–2136.
Cock P.J.A., Fields C.J., Goto N., et al. 2010. The sanger FASTQ file format for sequences with quality scores, and the solexa/illumina FASTQ variants. Nucleic Acids Res. 38, 1767–1771.
Deorowicz S., and Grabowski S. 2011. Compression of DNA sequence reads in FASTQ format. Bioinformatics. 27, 860–862.
Elias P. 1975. Universal codeword sets and representations of the integers. IEEE Trans. Inf. Theory. 21, 194–203.
Ewing B., and Green P. 1998. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8, 186–194.
Hach F., Numanagić I., Alkan C., et al. 2012. Scalce: Boosting sequence compression algorithms using locally consistent encoding. Bioinformatics. 28, 3051–3057.
Hernaez M., Ochoa I., and Weissman T. 2016. A cluster-based approach to compression of quality scores. 2016 Data Compression Conference (DCC), Snowbird, UT, pp. 261–270.
Janin L., Rosone G., and Cox A.J. 2014. Adaptive reference-free compression of sequence quality scores. Bioinformatics. 30, 24–30.
Jones D.C., Ruzzo W.L., Peng X., et al. 2012. Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res. 40, e171.
Loh P.-R., Baym M., and Berger B. 2012. Compressive genomics. Nat. Biotechnol. 30, 627–630.
Malysa G., Mikel H., Ochoa I., et al. 2015. Qvz: Lossy compression of quality values. Bioinformatics. 31, 3122–3129.
Numanagić I., Bonfield J.K., Hach F., et al. 2016. Comparison of high-throughput sequencing data compression tools. Nat. Methods. 13, 1005–1008.
Ochoa I., Asnani H., Bharadia D., et al. 2013. Qualcomp: A new lossy compressor for quality scores based on rate distortion theory. BMC Bioinformatics. 14, 187.
Ochoa I., Hernaez M., Goldfeder R., et al. 2016. Effect of lossy compression of quality scores on variant calling. Brief. Bioinform. 18, 183–194.
Ostermann J., Bormans J., List P., et al. 2004. Video coding with H.264/AVC: Tools, performance, and complexity. IEEE Circuits Syst. Mag. 4, 7–28.
Rimmer A., Phan H., Mathieson I., et al. 2014. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat. Genet. 46, 912–918.
Van der Auwera G.A., Carneiro M.O., Hartl C., et al. 2013. From FASTQ data to high-confidence variant calls: The genome analysis toolkit best practices pipeline. Curr. Protoc. Bioinformatics. 43, 11.10.1–11.10.33.
Voges J., Ostermann J., and Hernaez M. 2017. CALQ: Compression of quality values of aligned sequencing data. Bioinformatics. 34, 1650–1658.
Wan R., Anh V.N., and Asai K. 2012. Transformations for the compression of FASTQ quality scores of next-generation sequencing data. Bioinformatics. 28, 628–635.
Yu Y.W., Yorukoglu D., Peng J., et al. 2015. Quality score compression improves genotyping accuracy. Nat. Biotechnol. 33, 240–243.
Zook J.M., Catoe D., McDaniel J., et al. 2016. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data. 3, 160025.

Information & Authors

Information

Published In

cover image Journal of Computational Biology
Journal of Computational Biology
Volume 25Issue Number 10October 2018
Pages: 1141 - 1151
PubMed: 30059248

History

Published online: 4 October 2018
Published in print: October 2018
Published ahead of print: 30 July 2018

Permissions

Request permissions for this article.

Topics

Authors

Affiliations

Institut für Informationsverarbeitung, Leibniz Universität Hannover, Hannover, Germany.
Ali Fotouhi
Electronics and Communication Engineering Department, Istanbul Technical University, Istanbul, Turkey.
Jörn Ostermann
Institut für Informationsverarbeitung, Leibniz Universität Hannover, Hannover, Germany.
Muhammed Oğuzhan Külekci [email protected]
Informatics Institute, Istanbul Technical University, Istanbul, Turkey.

Notes

Address correspondence to:Jan VogesLeibniz Universität HannoverInstitut für InformationsverarbeitungAppelstr. 9AHannover 30167Germany
[email protected]
Assoc. Prof. Muhammed Oğuzhan KülekciInformatics InstituteIstanbul Technical UniversityIstanbul 34469Turkey
[email protected]

Author Disclosure Statement

The authors declare that no competing financial interests exist.

Metrics & Citations

Metrics

Citations

Export citation

Select the format you want to export the citations of this publication.

View Options

Get Access

Access content

To read the fulltext, please use one of the options below to sign in or purchase access.

Society Access

If you are a member of a society that has access to this content please log in via your society website and then return to this publication.

Restore your content access

Enter your email address to restore your content access:

Note: This functionality works only for purchases done as a guest. If you already have an account, log in to access the content to which you are entitled.

View options

PDF/EPUB

View PDF/ePub

Full Text

View Full Text

Media

Figures

Other

Tables

Share

Share

Copy the content Link

Share on social media

Back to Top