Skip to main content

Abstract

The goal of this document is do introduce a reader to the current state of the field of genetic data compression, provide high level overview of the existing solutions on the market. As well as present our vision and current state of our project.

Who is this article written for?

This article might be useful for technical specialists or organization managers who are involved into genome sequencing projects, especially to those who are looking to optimize their data processing pipeline via more efficient genetic data compression.


Introduction to the problem of genetic data compression.

Genome sequencing raw data files, namely .fastq are huge. The size of one uncompressed file can vary from tens of GB to almost 1TB. A typical human whole genome sequencing experiment can produce 100s of GBs of data in FASTQ files with sequencing coverage often being 30x or higher. Which leads to the problem of disc space optimization for storage of those files.

General purpose compression tools

Most straight forward way to reduce file size is to use standard compression tools such as 7zip, gzip or pigz. Last one is actually parallel implementation of gzip, and probably is the fastest option among general purpose compression tool. In fact .fastq.gz file format is still industry standard for genetic data storage and many bioinformatic tools accept them as an input. But general purpose compression tools do not take into account the nature of .fasta/.fastq files, which are usually highly redundant on practice due to high coverage.

Genetic data specialized tools

To exploit this redundancy specialized compression tools have been developed. Those tools utilize various properties of genetic data files to make compression algorithms more efficient. First such specialized tools started to appear circa 2014 and keep evolving ever since.

Here you can see the graph of evolution of some open source solutions we were able to identify during our market research.


Blue - developed for short reads(NGS).
Green - developed for long reads(TGS).
Square - reference-free.
Rombus - reference-free and reference-based modes available.
Direction of arrow shows that authors of the paper claimed that their tool performed better than previous solution.
Solid line means that new solution was developed by the same team as previous one.


It is worth noting that tool represented on the graph above is not the complete list, there are more data compression projects, but some of them are obsolete( github wasn't update for 5+ years) and are not of great interest. The whole list of tools we were able to find during our research includes around 15 solutions, which you can observe at our compression tools overview list. We also put commercial solution into a separate category.

Commercial solutions

Most of them are closed source and requre paid subscription to use. But you can expect better performance metrics and developers support available.

For example you can have a look at Petagene or Genozip

Performance metrics and potential challenges.

There are several metrics should be kept in mind to choose the right solution for your needs.

Compression ratio.

Obviously compression ratio is the most important metric for compression algorithm. In case of .fastq file compression it may vary from 2.3 bit/base in case of general purpose algorithm like gzip up to 0.3 in case of highly specialized tools such as RENANO, CoLoRd, NanoSpring.

Compression speed.

Compression speed is also an important metric for evaluation of compression tools. The speed can depend on several factors such as: algorithm complexity, ability of parallelization, available hardware, especially sufficient amount of memory and number of CPU cores/threads available for parallel computation.

Speed vs CR tradeoff

In case of some algorithms it could take up to 4 hours for human genome .fastq file compression. For example some opensource tools such as SPRING or CoLoRd are prioritizing optimization of compression ration in order to get closer to the theoretical limit which leads to poor speed performance. Whereas others are trying to find tradeoff between speed and CR. We think the good example of such tool is Genozip. This tools does not have such a good compression ratio as SPRING or CoLoRd, but it is still way better that general purpose compression algorithms and fast at the same time.

Other parameters to consider

  • Decompression speed.
    Usually decompression requires less time and resources.

  • Memory usage during compression and decompression 32GB usually should be enough.

  • Ability to deal with PacBio and Oxford NanoPore long reads. With appearance of a new sequencing hardware technologies such as PacBio and Oxford NanoPore, the reads length become significantly larger, 100k to 1kk base pairs per read. Which could be a problem for some compression tool, which were made for short read only.

  • Price One of the main reasons for data compression is storage use optimization. Everyone want to save some money. So the price is also an important factor to consider when choosing between compression solutions. Open source solution are free, but in many cases they lack proper support, so one might consider choosing among commercial solutions, but some of them might be more expansive then the others.

  • Open/close source. Licence. Code maintenance.

  • Reference free vs reference based.

  • Lossy/Lossless compression.

  • gzipped input support.

GzipColoRdNanoSpringRENANOGenozipPetagen
Compression ratiolowhighhighhighmediumNo Data
Compression speedmediumslow(hours)slow(hours)slow(hours)fast(10th of minutes)No Data
Decompression speedhighlowlowlowhighNo Data
Memory usage
Long read supportyesyesyesyesyesyes
PriceFreeFree, but no changes allowed(GPL Licence)FreeFree$3500-$20000/YearNo data
Source code accessOpen sourceOpen sourceOpen sourceOpen sourceNOT AN OPEN SOURCE, but source code is availableNo source code available
Licence typeGPLv3GPLv3MITMITLicenceComercial
Lossy/Lossless compressionLosslessLossy and Lossless modeLosslessLosslessLosslessNo data
gzipped input supportNot applicableNoNoNoyesNo Data
VCF supportyesNoNoNoyesyes
BAM supportyesNoNoNoyesyes
Reference-free compressionyesyesyesyesyesNo Data
Reference-based compressionnoyesyesyesyesNo Data

Our vision.

We can see that field of genetic data compression is evolving and new solution appear every year. But from our perspective there is still a lot of room for improvement and competition. Especially in the dimension of finding right balance between execution speed and compression ratio.

Our current solution

For now we have a short read only compression tool, which has better CR and faster compression and decompression speed than current industry standard. Our approach is a combination of a simple preprocessing heuristics and usage of standard compression tools. Despite the fact that this approach is less efficient it term of compression ratio than such tools as ColoRd, NanoSpring or RENANO, We see this direction as most promising due to linear complexity of algorithm and its ability to utilize multiple cores. Which makes it way faster(minutes instead of hours) and still allows to keep compression ratio on the decent level.

Experiment

We have conducted comparison test on the small dataset between several tool, including our solution which we called fastqpress.


Experiment result data

Toolinitial_size [byte]result_size [byte]time_to_compress [sec]time_to_decompress [sec]coverageCR
fastqpress78544441163270750.790.482164.810686605
fastqpress157185841255353241.630.741326.15562352
fastqpress315420410396482333.451.481647.955472064
genozip78544441153545161.7610.578165.115396734
genozip157185841296624803.660.867325.29914697
genozip315420410592400115.7781.36645.324448876
colord7854444170735386.3841.2621611.10398234
colord1571858411244301316.2912.4233212.63245815
colord3154204102247979141.9474.1856414.03128748
gzip785444412216627210.1040.383163.543421329
gzip1571858414430811620.2810.768323.547563182
gzip3154204108874328239.8071.494643.554301834

Dependency of Compression ration from sequencing coverage. coverage


Dependency of Compression ration from initial file size.

size

More benchmarks

more benchmarks

Collaboration proposal and feedback collection

We are still at the early stage of development and willing to prioritize those features which are most valuable to our potential customers. We value every bit of information from industry experts and would be happy to talk in deep details about your current issue in genomic data processing pipeline. So please feel free to contact us at isachenkoa@gmal.com or subscribe to our email update list.