This readme explains the indel filtering which has been carried out over version 2 and 3 of the 20110521 integrated variant releases Indel Filtering --------------- 1. Exclusion of excessive 1-bp outliers We identified that a subset of indels from the low-coverage data have very high false positive rates. In particular, the following 10 samples showed excessive number of singleton indels (~1,000 to 23,000) that are mostly 1bp insertions. NA12144 NA20752 NA18626 NA19437 NA19439 NA19436 NA19448 NA18627 NA19313 NA19446 Upon further investigation, we found that the excessive 1bp singleton insertions are due to technical artifacts introduced in the sequencing step. We removed 162,928 1bp singleton insertions specific only to the 10 outlier samples. 2. In addition, we found much higher fraction of frameshift indels in low-coverage specific indels compared to the indels shared between low-coverage and exome data, suggesting low-coverage specific coding indels may have enriched false positive rates. We removed additional 3,014 protein-coding frameshift indels exclusive to low-coverage samples to increase the specificity of the protein-coding indels. The above (category 1 and 2) identified indels were removed from version 1 of the 20110521 release. A list of these excluded sites can be found at: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20120131_indel_sites_to_exclude/ALL.wgs.indels_to_exclude.20101123.indels.sites.vcf.gz/ 3. Preliminary evaluations of INDEL call sets demonstrated high apparent false positive rate after the above steps, and rare INDELs demonstrated higher discordance with independent datasets. To extract high quality INDELs, we restricted the minimum allele frequency (before integration) to 0.5%, and additionally applied SVM approach to further filter out potential false positive INDELs guided by the indel genotypes from the Affymetrix Axiom genotyping chip were provided (Jeannette Schmidt & Jeremy Gollub). The SVM was trained using multiple features including (a) allele balance (b) inbreeding coefficient (c) flanking sequence complexity (d) homopolymer runs (e) strand bias (f) cycle bias (g) mapping quality (h) number of supporting non-ref reads, and (i) distance to nearby INDELs. This set of indels (category 3) was removed from version 2 of the 20110521 release to produce version 3. A list of these excluded sites can be found at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20120312_phase1_v2_indel_cleaned_sites_list/ALL.wgs.phase1_release_v2.20101123.snps_indels_sv_indel_clean_exclusion.20120312.sites.vcf.gz