This readme explains the indel filtering which has been carried out over version 2 and
3 of the 20110521 integrated variant releases

Indel Filtering
---------------

1. Exclusion of excessive 1-bp outliers 

We identified that a subset of indels from the low-coverage data have very high false 
positive rates. In particular, the following 10 samples showed excessive number of 
singleton indels (~1,000 to 23,000) that are mostly 1bp insertions.
		
NA12144
NA20752
NA18626
NA19437
NA19439
NA19436
NA19448
NA18627
NA19313
NA19446

Upon further investigation, we found that the excessive 1bp singleton insertions are 
due to technical artifacts introduced in the sequencing step. We removed 162,928 1bp 
singleton insertions specific only to the 10 outlier samples.

2. 
In addition, we found much higher fraction of frameshift indels in low-coverage specific 
indels compared to the indels shared between low-coverage and exome data, suggesting 
low-coverage specific coding indels may have enriched false positive rates. We removed 
additional 3,014 protein-coding frameshift indels exclusive to low-coverage samples to 
increase the specificity of the protein-coding indels.

 
The above (category 1 and 2) identified indels were removed from version 1 of the 20110521 release.
A list of these excluded sites can be found at:
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20120131_indel_sites_to_exclude/ALL.wgs.indels_to_exclude.20101123.indels.sites.vcf.gz/


3. 
Preliminary evaluations of INDEL call sets demonstrated high apparent false positive rate
after the above steps, and rare INDELs demonstrated higher discordance with independent 
datasets. To extract high quality INDELs, we restricted the minimum allele frequency 
(before integration) to 0.5%, and additionally applied SVM approach to further filter out 
potential false positive INDELs guided by the indel genotypes from the Affymetrix Axiom 
genotyping chip were provided (Jeannette Schmidt & Jeremy Gollub). The SVM was trained 
using multiple features including (a) allele balance (b) inbreeding coefficient 
(c) flanking sequence complexity (d) homopolymer runs (e) strand bias (f) cycle bias 
(g) mapping quality (h) number of supporting non-ref reads, and (i) distance to nearby 
INDELs. 

This set of indels (category 3) was removed from version 2 of the 20110521 release to produce version 3.

A list of these excluded sites can be found at
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20120312_phase1_v2_indel_cleaned_sites_list/ALL.wgs.phase1_release_v2.20101123.snps_indels_sv_indel_clean_exclusion.20120312.sites.vcf.gz