Fastq compression tools comparison
Compression tools list
Open Source tools.
General purpose (non-fastq specific) compression tools
-
gzip - gzip - By far the most popular method to compress FASTQ files is to simply gzip them. Most bioinformatic tools will accept gzipped files as input. The gzip standard offers various levels of compression, allowing users to tradeoff compression time and compression efficiency. Gzip level 1 is the fastest to compress a given file, but at the cost of some compression efficiency. Gzip level 9 on the other hand should provide a smaller overall file at the cost of an increased time compressing the data.
-
pigz - pigz, which stands for parallel implementation of gzip, is a fully functional replacement for gzip that exploits multiple processors and multiple cores to the hilt when compressing data.
-
xz - XZ Utils is free general-purpose data compression software with a high compression ratio. XZ Utils were written for POSIX-like systems, but also work on some not-so-POSIX systems. XZ Utils are the successor to LZMA Utils.
-
zstd - is Facebooks fast lossless compression algorithm, targeting real-time compression scenarios at zlib-level and better compression ratios. It's backed by a very fast entropy stage, provided by Huff0 and FSE library.
-
bzip2 - bzip2 is a freely available, patent free, high-quality data compressor. It typically compresses files to within 10% to 15% of the best available techniques (the PPM family of statistical compressors), whilst being around twice as fast at compression and six times faster at decompression.
-
brotli - Brotli is a Googles generic-purpose lossless compression algorithm that compresses data using a combination of a modern variant of the LZ77 algorithm, Huffman coding and 2nd order context modeling, with a compression ratio comparable to the best currently available general-purpose compression methods. It is similar in speed with deflate but offers more dense compression.
Fastq specific tools
- FaStore - The FaStore compressor offers both lossless and lossy compresion. For lossy compression, it can both alter read identifiers as well as alter quality scores.
- SPRING - The Spring is free OpenSource compressor offers lossless and lossy modes.
- NanoSpring - NanoSpring - Tool for compression of nanopore genomic reads in FASTQ format (gzipped input also supported). Compresses only the read sequences (i.e., ignores quality values and read identifiers).
- FQsqueezer - FQSqueezer is an experimental high-end compression of FASTQ files. The main goal of the tool is to offer the best possible compression ratio with running times allowing to run it even for WGS human datasets. FQSqueezer usually offers compression ratios tens of percent better than given by the state-of-the-art tools, like FaStore, Minicom, Spring. The running times are, however, significantly longer.
- Repaq - is a tool to compress FASTQ files with ultra-high compression ratio and high speed. repaq supports compressing the FASTQ to .rfq or .rfq.xz formats. Compressing to .rfq is ultra fast, while compressing to .rfq.xz provides very high compression ratio. For NovaSeq data, as an example: the .rfq file can be much smaller than .fq.gz, and the compressing time is usually less than 1/5 of gzip compression. The .rfq.xz file can be as small as 5% of the original FASTQ file, or smaller than 30% of the .fq.gz file. For paired-end FASTQ files, repaq compresses them into one single file to provide higher compression ratio. This tool also supports non-Illumina format FASTQ (i.e. the BGI-SEQ format), but the compression ratio is not as good Illumina format FASTQ.
- CoLoRd - A versatile compressor of third generation sequencing reads.
- FastqCLS - The robust FASTQ-specific compressor for recent generation data via score-based reordering. FastqCLS has no external dependencies, using only linux commands and general compressor zpaq and can be used on docker hub.
- ReNANO - RENANO is a reference-based lossless FASTQ data compressor, specifically tailored to compress FASTQ files generated with nanopore sequencing technologies. RENANO improves on its state of the art predecessor ENANO, by providing a more efficient base call sequence compression component. Two compression algorithms are introduced, corresponding to the following scenarios: (1) a reference genome is available without cost to both the compressor and the decompressor; (2) the reference genome is available only on the compressor side, and a compacted version of the reference is included in the compressed file.
- PgRC - Pseudogenome-based Read Compressor (PgRC) is an in-memory algorithm for compressing the DNA stream of FASTQ datasets, based on the idea of building an approximation of the shortest common superstring over high-quality reads.
- SCA-NGS - Secure compression algorithm for next generation sequencing data using genetic operators and block sorting.
- Minicom- Minicom is a tool for compressing short reads in FASTQ. The minicom program is written in C++11 and works on Linux. It is availble under an open-source license. Note: Minicom only compresses DNA sequences in the FASTQ file. It does not support to compress the whole FASTQ file.
Comercial solutions
- MZPAQ - a FASTQ data compression tool.
- Petagene - Petagene offers a suite of compression tools that can losslessly compress both fastq.gz files and .bam files. The software will even capture gzip settings so that md5sums of compressed->decompressed files will match those of their original.
- Genozip - Lossless compression of FASTQ, BAM, VCF. genozip git repo.
- Dragen - DRAGEN FASTQ Toolkit allows manipulation of FASTQ files, including adapter trimming, quality trimming, length filtering, format conversions and down-sampling. Acquired by Illumina in 2018.