Complete Guide to Genomic Data Formats: FASTA, FASTQ, BAM, VCF & Beyond

Feb 14, 2026 · 16 min read · Bioinformatics, Data Science

The explosion of genomic data has given rise to a complex ecosystem of file formats, each designed for specific use cases in the sequencing pipeline. From raw sequence reads to annotated variant calls, understanding these formats is essential for bioinformaticians, computational biologists, and anyone working with genomic data. This guide provides a comprehensive overview of the most important formats in 2026.

FASTA Format

The simplest and most universal sequence format. Each entry begins with a header line starting with ">" followed by a description, then the sequence on subsequent lines.

>NM_001101.5 Homo sapiens actin beta (ACTB), mRNA ATGGATGATGATATCGCCGCGCTCGTCGTCGACAACGGCTCCGGCATGTGCAAAGCCGGC TTCGCGGGCGACGATGCCCCCCGGGCCGTCTTCCCCTCCATCGTGGGGCGCCCCAGGCAC CAGGGCGTGATGGTGGGCATGGGTCAGAAGGATTCCTATGTGGGCGACGAGGCCCAGAGC

FASTA is used for reference genomes, protein sequences, and any context where quality scores are not needed. Multi-FASTA files contain multiple sequences separated by header lines. The format supports DNA (A,T,G,C,N), RNA (A,U,G,C), and protein (20 standard amino acid codes) sequences.

FASTQ Format

The standard format for raw sequencing reads, extending FASTA with quality information. Each read is represented by four lines.

@ERR000589.1 HSQ1008:237:C1YG9:1:1101:1171:2088/1 ACGTACGTACGTACGTACGTACGTACGTACGT + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

Quality scores use Phred encoding (Q = -10 log10(P_error)). Phred+33 (Sanger/Illumina 1.8+) is now universal. Q30 means a 1-in-1000 chance of error. Modern Illumina instruments produce paired-end FASTQ files (R1 and R2), with ~150-300 bp reads and Q30 > 85% for most bases.

SAM/BAM/CRAM

SAM (Sequence Alignment/Map) stores aligned reads against a reference genome. BAM is the binary compressed version, and CRAM provides even higher compression using reference-based encoding.

Key fields include: read name, bitwise FLAG (paired, mapped, strand), reference name, mapping position, mapping quality (MAPQ), CIGAR string (alignment operations), mate information, sequence, and quality.

@HD VN:1.6 SO:coordinate @SQ SN:chr1 LN:248956422 read001 0 chr1 10001 60 100M * 0 0 ACGT... FFFF... NM:i:0

VCF (Variant Call Format)

VCF stores genetic variants (SNPs, indels, structural variants) identified from sequencing data. It includes the reference allele, alternate allele(s), quality score, filter status, and sample-level genotype information.

##fileformat=VCFv4.3 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 chr1 10177 rs367896724 A AC 100 PASS AF=0.425;DP=50 GT:DP 0/1:25

Conversion Best Practices

Reference: Li H, Handsaker B, Wysoker A, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25(16):2078-2079. doi:10.1093/bioinformatics/btp352
Disclaimer: This guide is for educational purposes. Always refer to official format specifications for production use.