Complete Guide to Genomic Data Formats in 2026: FASTA, FASTQ, BAM, VCF & Beyond

The explosion of genomic data has given rise to a complex ecosystem of file formats, each designed for specific use cases in the sequencing pipeline. From raw sequence reads to annotated variant calls, understanding these formats is essential for bioinformaticians, computational biologists, and anyone working with genomic data. This guide provides a comprehensive overview of the most important formats in 2026.

FASTA Format

The simplest and most universal sequence format. Each entry begins with a header line starting with ">" followed by a description, then the sequence on subsequent lines.

>NM_001101.5 Homo sapiens actin beta (ACTB), mRNA
ATGGATGATGATATCGCCGCGCTCGTCGTCGACAACGGCTCCGGCATGTGCAAAGCCGGC
TTCGCGGGCGACGATGCCCCCCGGGCCGTCTTCCCCTCCATCGTGGGGCGCCCCAGGCAC
CAGGGCGTGATGGTGGGCATGGGTCAGAAGGATTCCTATGTGGGCGACGAGGCCCAGAGC

FASTA is used for reference genomes, protein sequences, and any context where quality scores are not needed. Multi-FASTA files contain multiple sequences separated by header lines. The format supports DNA (A,T,G,C,N), RNA (A,U,G,C), and protein (20 standard amino acid codes) sequences.

FASTQ Format

The standard format for raw sequencing reads, extending FASTA with quality information. Each read is represented by four lines.

@ERR000589.1 HSQ1008:237:C1YG9:1:1101:1171:2088/1
ACGTACGTACGTACGTACGTACGTACGTACGT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

Quality scores use Phred encoding (Q = -10 log10(P_error)). Phred+33 (Sanger/Illumina 1.8+) is now universal. Q30 means a 1-in-1000 chance of error. Modern Illumina instruments produce paired-end FASTQ files (R1 and R2), with ~150-300 bp reads and Q30 > 85% for most bases.

SAM/BAM/CRAM

SAM (Sequence Alignment/Map) stores aligned reads against a reference genome. BAM is the binary compressed version, and CRAM provides even higher compression using reference-based encoding.

Key fields include: read name, bitwise FLAG (paired, mapped, strand), reference name, mapping position, mapping quality (MAPQ), CIGAR string (alignment operations), mate information, sequence, and quality.

@HD VN:1.6 SO:coordinate
@SQ SN:chr1 LN:248956422
read001  0  chr1  10001  60  100M  *  0  0  ACGT...  FFFF...  NM:i:0

VCF (Variant Call Format)

VCF stores genetic variants (SNPs, indels, structural variants) identified from sequencing data. It includes the reference allele, alternate allele(s), quality score, filter status, and sample-level genotype information.

##fileformat=VCFv4.3
#CHROM POS   ID        REF ALT  QUAL FILTER INFO           FORMAT  SAMPLE1
chr1   10177 rs367896724 A  AC   100  PASS   AF=0.425;DP=50 GT:DP   0/1:25

Conversion Best Practices

Use samtools for SAM/BAM/CRAM interconversion (samtools view -b for SAM-to-BAM, samtools view -C for BAM-to-CRAM)
Use bcftools for VCF manipulation, normalization, and filtering
Use seqtk or seqkit for FASTA/FASTQ format conversion and manipulation
Always validate converted files (samtools quickcheck, vcf-validator)
Maintain provenance metadata through all conversion steps

Reference: Li H, Handsaker B, Wysoker A, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25(16):2078-2079. doi:10.1093/bioinformatics/btp352

Disclaimer: This guide is for educational purposes. Always refer to official format specifications for production use.

Complete Guide to Genomic Data Formats: FASTA, FASTQ, BAM, VCF & Beyond

FASTA Format

FASTQ Format

SAM/BAM/CRAM

VCF (Variant Call Format)

Conversion Best Practices

Bioinformatics Newsletter