The explosion of genomic data has given rise to a complex ecosystem of file formats, each designed for specific use cases in the sequencing pipeline. From raw sequence reads to annotated variant calls, understanding these formats is essential for bioinformaticians, computational biologists, and anyone working with genomic data. This guide provides a comprehensive overview of the most important formats in 2026.
The simplest and most universal sequence format. Each entry begins with a header line starting with ">" followed by a description, then the sequence on subsequent lines.
FASTA is used for reference genomes, protein sequences, and any context where quality scores are not needed. Multi-FASTA files contain multiple sequences separated by header lines. The format supports DNA (A,T,G,C,N), RNA (A,U,G,C), and protein (20 standard amino acid codes) sequences.
The standard format for raw sequencing reads, extending FASTA with quality information. Each read is represented by four lines.
Quality scores use Phred encoding (Q = -10 log10(P_error)). Phred+33 (Sanger/Illumina 1.8+) is now universal. Q30 means a 1-in-1000 chance of error. Modern Illumina instruments produce paired-end FASTQ files (R1 and R2), with ~150-300 bp reads and Q30 > 85% for most bases.
SAM (Sequence Alignment/Map) stores aligned reads against a reference genome. BAM is the binary compressed version, and CRAM provides even higher compression using reference-based encoding.
Key fields include: read name, bitwise FLAG (paired, mapped, strand), reference name, mapping position, mapping quality (MAPQ), CIGAR string (alignment operations), mate information, sequence, and quality.
VCF stores genetic variants (SNPs, indels, structural variants) identified from sequencing data. It includes the reference allele, alternate allele(s), quality score, filter status, and sample-level genotype information.