Skip to main content

File Formats for Storing Genomic Data

Β· 16 min read

FILE.png

Modern genomics relies on high-throughput sequencing technologies (NGS), which generate massive volumes of biological data. Efficient storage, processing, and exchange of this data require specialized file formats, each optimized for a specific stage of analysis.

At first glance, the genomic ecosystem may appear fragmented β€” dozens of formats, many of them text-based, binary, or domain-specific. However, these formats are not arbitrary. Instead, they form a well-defined hierarchy that closely mirrors the standard data-processing pipelines used in genomics.

In most genomic workflows, data passes through several conceptual stages:

Stage 0 β€” Raw sequence data
Immediately after sequencing, data is represented as raw nucleotide sequences, with or without base-level quality scores. Regardless of whether the experiment involves bulk DNA sequencing, RNA-seq, single-cell, or spatial transcriptomics, the initial output is always conceptually similar. These data are stored in FASTA or FASTQ formats.

Stage 1 β€” Read alignment
Raw reads are then aligned to a reference genome. This process produces alignment-based formats such as SAM, BAM, or CRAM, which describe how each read maps to genomic coordinates and how reliable that alignment is.

Stage 2 β€” Derived biological representations
Aligned reads enable higher-level biological interpretations. Variant calling produces VCF or BCF files, genome annotation relies on formats such as GFF, GTF, or GVF, and single-cell or spatial transcriptomics workflows often transform alignment data into matrix-based formats such as h5ad.

Stage 3 β€” Specialized and auxiliary formats
Some formats do not fit neatly into this linear pipeline but serve important complementary roles. These include consumer genotyping outputs (e.g., 23andMe), cloud-native array storage formats (Zarr), visualization and browser-specific formats (TRACK), and platform-specific internal formats such as Illumina LOC files.

Understanding this hierarchy reveals the underlying structure of genomic data processing and clarifies why so many formats exist. In the following sections, we examine each format individually while keeping its role in the broader analytical workflow in mind.

1. FASTA​

The FASTA format is one of the most fundamental and widely used formats in bioinformatics. It is designed to store nucleotide or amino acid sequences without any associated quality information.

A FASTA file consists of one or more records, each of which begins with a header line indicated by the > symbol, followed by a sequence of characters representing a biological sequence.

Sequences in FASTA format start with a single-line description, followed by lines containing the actual sequence. The description is marked by the greater-than symbol (>) in the first column. The word following this symbol up to the first space is the sequence identifier, followed by an optional description.

>sequence_1 Homo_sapiens chromosome_1
ATGCTAGCTAGCTACGATCGATCGATCG
GCTAGCTAGCTAGCATCGATCGATCGA

This is followed by lines containing the biological sequences themselves. Typically, lines in FASTA format are limited to 80–120 characters in length (for historical reasons), but modern programs can recognize sequences written entirely on a single line. A single file may contain multiple sequences.

Read more

2. FASTQ​

FASTQ is a standard format for storing raw sequencing data generated using NGS platforms. Unlike FASTA, it includes not only the nucleotide sequence but also quality information for each base.

Each FASTQ record consists of four lines: a read identifier, the sequence, a separator, and a quality string encoded using ASCII characters according to the Phred scale. This allows the reliability of each nucleotide to be quantitatively assessed.

@sequence_1
ATGCTAGCTAGCTACGATCG
+
IIIIIIIIIIIIIIIIIIII

FASTQ is the primary input format for most bioinformatics pipelines, including quality control, filtering, alignment, and assembly. A significant drawback of the format is its large size, which requires substantial computational and disk resources.

Read more

3. SAM​

The SAM (Sequence Alignment/Map) format is a text-based format used to store the results of aligning sequenced reads to a reference sequence. It was developed as a universal and human-readable standard for representing alignment data and serves as the foundation for the binary BAM format.

The SAM format is used to represent information on how sequences generated by next-generation sequencing (NGS) technologies correspond to a reference genome. It is primarily applied at the stages of debugging, analysis, and interpretation of alignment data, as well as for data exchange between various bioinformatics tools.

A SAM file is a plain text file and consists of two main parts: a header and an alignment section.

The header of a SAM file contains lines starting with the @ symbol and includes metadata about the data format, reference sequences, alignment parameters, read groups, and the software tools used in the analysis. Although the header is optional, its presence significantly improves the correctness and reproducibility of the analysis.

@HD	VN:1.6	SO:coordinate
@SQ SN:chr1 LN:248956422
@PG ID:bwa PN:bwa VN:0.7.17

read_001 0 chr1 100 255 10M * 0 0 ATGCTAGCTA IIIIIIIIII

The main part of a SAM file consists of lines, each corresponding to a single sequenced read. Each record contains a fixed set of mandatory fields separated by tab characters, including the read identifier, a bitwise flag, the reference sequence name, the alignment position, the mapping quality score, the CIGAR string, information about paired reads, the nucleotide sequence, and base quality scores. In addition, optional tags may be present, providing extended annotation information.

The key drawback of the SAM format is its large file size compared to binary formats.

Read more

Although BAM is often described as a β€œcompressed SAM file,” this description is somewhat misleading.

BAM is not created by simply applying a generic compression algorithm (such as ZIP or gzip) to a SAM file. Instead, BAM uses a specialized format called BGZF (Blocked GNU Zip Format). BGZF is based on the standard DEFLATE compression algorithm (the same algorithm used by gzip), but it introduces a block-based structure that enables efficient random access.

This design allows BAM files to be indexed and queried by genomic coordinates, which would not be possible with a naΓ―vely compressed text file. As a result, BAM achieves both compact storage and high-performance data access β€” properties that are essential for large-scale genomic analyses.

4. BAM​

The BAM (Binary Alignment/Map) format is a binary representation of the SAM format and is widely used in bioinformatics for storing the results of aligning next-generation sequencing (NGS) data to a reference genome.

The BAM format is used to store information about the positioning of sequenced reads relative to a reference sequence. It is applied after the alignment stage of data obtained in FASTQ format, using specialized algorithms and software tools such as BWA, Bowtie2, or STAR.

A BAM file has a binary structure and logically consists of two main components: a header and alignment records. The header contains metadata about the file and alignment parameters, including information on reference sequences, the format version, the software tools used, and read groups. The presence of a correct header is essential for proper data interpretation.

For efficient processing, BAM files are typically used together with index files in .bai or .csi formats. Indexing enables fast access to data corresponding to specific genomic regions without the need to read the entire file sequentially.

One of the main advantages of the BAM format is its compactness.

Read more

5. CRAM​

The CRAM (Compressed Reference-oriented Alignment Map) format is a compressed binary format designed for storing alignment results of next-generation sequencing (NGS) data. It was developed as a more storage-efficient alternative to the BAM format and is based on the use of a reference sequence for data compression.

CRAM is especially valuable in projects where minimizing storage requirements is critical, such as biobanks, clinical studies, and large genomic consortia.

A CRAM file has a binary structure and logically consists of a header and data blocks. The header contains metadata about the format, reference sequences, alignment parameters, and the software tools used.

The key difference between CRAM and BAM is that nucleotide sequences and base quality scores are stored as differences relative to the reference genome rather than as absolute values. This approach enables highly efficient data compression.

A defining feature of the CRAM format is reference-based compression. To correctly read and decode a CRAM file, the same version of the reference genome used during file creation is required. Any mismatch in the reference sequence may lead to incorrect data interpretation.

The main advantage of CRAM is its substantially reduced file size compared to BAM, which lowers storage requirements and facilitates faster data transfer. The format also supports indexing, enabling rapid access to specific genomic regions.

Read more

6. VCF​

The VCF (Variant Call Format) is used to store information about genetic variants identified during sequencing data analysis. It is designed to represent single nucleotide polymorphisms (SNPs), insertions and deletions (indels), as well as more complex structural variants.

A VCF file is text-based and consists of a header and a data section. The header contains metadata describing the format, analysis parameters, and annotations. The main section includes variant coordinates, reference and alternative alleles, quality scores, filter status, and genotype information for one or more samples.

##fileformat=VCFv4.2
##source=FreeBayes
##reference=hg38
#CHROM POS ID REF ALT QUAL FILTER INFO
chr1 105 rs123456 A G 99 PASS DP=42

VCF is the standard output format for variant calling pipelines and is widely used in population genetics, clinical genomics, and genome-wide association studies.

Read more

BCF follows the same conceptual logic as BAM in relation to SAM. While VCF is a human-readable, text-based format, BCF encodes the same information using a binary representation.

This binary encoding significantly improves parsing speed and reduces file size, making BCF particularly useful for large cohort studies and high-throughput variant analyses. Importantly, BCF is not a compressed VCF file in the traditional sense; it is a structurally different representation optimized for computational efficiency.

7. BCF​

The BCF (Binary Call Format) is the binary counterpart of the VCF format. It was developed to improve performance and reduce file size when handling large-scale variant datasets.

BCF preserves the logical structure and fields of VCF but encodes them in a binary form, enabling faster reading, writing, and processing. The format is widely used in tools such as samtools and bcftools.

8. GFF​

GFF (General Feature Format) is a file format used for storing annotations of genes and other features of DNA, RNA, and protein sequences.

A GFF file is a text file in which each functional genomic element is represented by a single line. Each line contains nine fields separated by tab characters. This structure allows for efficient and straightforward extraction of the required data.

IV     curated  mRNA   5506800 5508917 . + .   Transcript B0273.1; Note "Zn-Finger"
IV curated 5'UTR 5506800 5508999 . + . Transcript B0273.1
IV curated exon 5506900 5506996 . + . Transcript B0273.1
IV curated exon 5506026 5506382 . + . Transcript B0273.1
IV curated exon 5506558 5506660 . + . Transcript B0273.1
IV curated exon 5506738 5506852 . + . Transcript B0273.1
IV curated 3'UTR 5506852 5508917 . + . Transcript B0273.1

It is used to store the results of gene prediction or experimental determination of genes, as well as more complex functional elements of the genome.

Read more

9. GTF​

The GTF (Gene Transfer Format) is a specialized variant of GFF and is primarily used to represent gene and transcript structures. It is widely applied in RNA-seq analyses.

GTF provides detailed hierarchical information about genes, transcripts, and exons, making it suitable for expression quantification.

chr1	HAVANA	gene	11869	14409	.	+	.	gene_id "ENSG00000223972"; gene_name "DDX11L1";
chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_name "DDX11L1";
chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "1";
chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "2";


Read more

10. GVF​

The GVF (Genome Variation Format) is an extension of GFF designed to store genome variation data with annotations.

GVF enables the description of variant effects, mutation types, and functional impact within the context of genomic annotations.

11. BED​

BED (Browser Extensible Data) format provides a flexible way to define the data lines that are displayed in an annotation track. BED lines have three required fields and nine additional optional fields. The number of fields per line must be consistent throughout any single set of data in an annotation track. The order of the optional fields is binding: lower-numbered fields must always be populated if higher-numbered fields are used.

# chrom  chromStart  chromEnd  name  score  strand  thickStart  thickEnd  itemRgb  blockCount  blockSizes  blockStarts
chr1 1000 2000 GeneA 100 + 1050 1950 0,0,255 5 10,20,30,20,10 0,100,150,200,220
chr1 3000 4000 GeneB 50 - 3100 3900 255,0,0 3 15,20,15 0,100,120
chr2 5000 5500 PeakC 200 . . . . . . .

A BED file consists of lines, each describing a single genomic interval.

The BED structure allows for compact and clear representation of genomic coordinate data and enables visualization in genome browsers such as the UCSC Genome Browser.

Read more

12. TRACK​

The TRACK format is used in the UCSC Genome Browser to define and customize user visualization tracks for genomic data. This format allows researchers to specify display parameters such as color, name, display type, and other properties, providing a clear visual representation of genomic datasets.

A TRACK file is a text file that begins with a track directive, which sets metadata for the track, followed by the actual data associated with a chosen format (e.g., BED, WIG, GFF).

track name="Coverage" description="Read Depth" visibility=2 color=0,0,255 altColor=255,0,0
chr1 10000 10100 10.5
chr1 10100 10200 12.0
chr1 10200 10300 8.1
chr1 10300 10400 15.3
chrX 500000 500100 25.0
chrX 500100 500200 28.2

Following the track directive are lines containing coordinate data and additional parameters defined by the data format (BED, WIG, GFF). The TRACK format itself does not store the data but serves as a visualization controller.

13. 23andMe​

The 23andMe format is a tab-delimited text format used to store consumer-level genotyping results. This format is employed by commercial services such as 23andMe to provide users with information about their single nucleotide polymorphisms (SNPs) and other genetic variants.

# rsid	chromosome	position	genotype
rs4477212 1 82154 AA
rs3094315 1 752566 AG

Files may also include additional metadata about the platform version or reference genome build used for genotyping.

Read more

14. h5ad​

The h5ad format is a binary format based on HDF5 and is used for storing single-cell transcriptomics data. It was developed within the Python ecosystem and is tightly integrated with the Scanpy library, designed for analysis and visualization of large single-cell RNA sequencing datasets.

h5ad enables storage not only of the gene expression matrix but also of cell and gene metadata, clustering results, low-dimensional embeddings (e.g., UMAP or t-SNE), and annotations, making it a comprehensive format for all stages of single-cell transcriptomics analysis.

An h5ad file is binary and based on the hierarchical HDF5 structure, which supports storage of multidimensional arrays and attribute dictionaries.

This approach allows all necessary data for analysis to be stored in a single file, improving reproducibility and simplifying data sharing between researchers.

15. ZARR​

The Zarr data format is an open, community-maintained format designed for efficient, scalable storage of large N-dimensional arrays. It stores data as compressed and chunked arrays in a format well-suited to parallel processing and cloud-native workflows.

This section will help you get up and running with the Zarr library in Python to efficiently manage and analyze multi-dimensional arrays.

Read more

16. Illumina LOC files​

The LOC (Location) file format is used in Illumina sequencing platforms to store spatial information about cluster positions on the flowcell surface. These files are part of Illumina’s internal data formats and are involved in the early stages of sequencing data processing.

LOC files contain X and Y coordinates for each cluster detected during image analysis. These coordinates are essential for mapping fluorescence signals to individual clusters during downstream image processing and base calling steps.

Typically, LOC files are generated after image analysis and before base calling. In earlier versions of Illumina software, cluster coordinates were stored in text-based .locs files. In more recent versions, these have been replaced by binary formats such as .clocs, which significantly reduce file size and improve processing performance.

LOC files are not intended for direct end-user analysis and are mainly consumed by internal Illumina tools such as RTA (Real-Time Analysis) and bcl2fastq. However, understanding their role and structure is important for bioinformaticians working with low-level sequencing data, developing custom pipelines, or investigating the internal workflows of Illumina sequencing systems.

Read more

Conclusion​

Despite the apparent diversity of genomic file formats, they form a coherent and logically structured system aligned with the main stages of genomic data analysis. From raw sequencing reads to aligned data, annotated genomes, and high-level biological interpretations, each format reflects a specific computational and conceptual need.

This hierarchy not only simplifies data processing but also enables scalability, reproducibility, and interoperability across tools and research groups. At the same time, the continued development of formats such as CRAM and Zarr highlights an ongoing effort to improve storage efficiency, data transfer, and cloud-native analysis.

As sequencing technologies evolve and datasets continue to grow, file formats will remain a critical component of genomic infrastructure β€” balancing human readability, computational performance, and long-term data sustainability.