Low level contamination confounds population genomic analysis 

Audrey K Ward

G3 (Bethesda). 2026 Jan 30:jkag021. doi: 10.1093/g3journal/jkag021. Online ahead of print.

ABSTRACT

Genome sequence contamination has a variety of causes and can originate from within or between species. Previous research focused on contamination between distantly related species or on prokaryotes. Here we test for intra-species contamination by mapping short read genome data to a reference and visualizing the frequency of reads with single nucleotide di_erences from the reference. Out of 1,298 publicly available genome sequences investigated for Saccharomyces cerevisiae, a small number (8 genomes) show at least 5% contamination. Contamination rates di_ered however among sequencing centers: one unusually large study had a low contamination rate (below 0.2%) but the contamination rate was higher for other studies (2% or 15% of genomes). Using genome data contaminated in silico to known degrees, we showed that contamination is recognizable in plots with unexpected secondary allele (B-allele) frequencies of at least 5% and measured contamination e_ects on admixture and phylogenetic analysis in two fungal species. With a standard base calling pipeline, we found that contaminated genomes super_cially appeared to produce good quality genome data. Yet as little as 5-10% genome contamination was enough to change phylogenetic tree topologies and make contaminated strains appear as hybrids between lineages (genetically admixed). We recommend the use of B-allele frequency plots to screen genome resequencing data for intra-species contamination.

PMID:41616078 | DOI:10.1093/g3journal/jkag021