18-JUN-2012

dbSNP currently supports VCF v4.0
(http://www.1000genomes.org/wiki/doku.php?id=1000_genomes:analysis:vcf4.0).  

The VCF files are available at 
ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF

The following are omitted from all VCF files:
  Variations listed as microsatellites or named variations
  Variations that are not mapped on the reference genome (GRCh37.x)
  Variations that are mapped to more than one location on the reference 
    genome. (Weight > 1)

Following are the files in the VCF main directory:

./00-All.vcf.gz
  VCF of all variations that meet the criteria to be in a VCF file.  This file 
  is created once per dbSNP build.

common_all.vcf.gz
  VCF of all variations that are polymorphic in a least one population the 
  1000 Genomes project or any of the following handles:
    1000GENOMES
    CSHL-HAPMAP
    EGP_SNPS
    NHLBI-ESP
    PGA-UW-FHCRC
  A variation is polymorphic if the minor allele frequency is at least 0.01
  and the minor allele is present in at least two samples.

clinvar_YYYYMMDD.vcf.gz
  VCF of variations from clinvar where 'YYYYMMDD' represents the date 
  the file was created.
  This file is created weekly.


common_and_clinical_YYYYMMDD.vcf.gz
  Variations from common_all.vcf.gz that are clinical.  A clinical
  variation is one the appears in clinvar_YYYYMMDD.vcf.gz with at least
  one of the following clinical significance codes:
    4 - probable-pathogenic
    5 - pathogenic
    6 - drug-response
    7 - histocompatibility
    255 - other
  This file is created weekly.

common_no_known_medical_impact_YYYYMMDD.vcf.gz
  Variations from common_all.vcf.gz that do not meet the clinical criteria
  described above.  This file is created weekly.

clinvar_00-newest.vcf.gz
common_and_clinical_00-newest.vcf.gz
common_no_known_medical_impact_00-newest.vcf.gz
  Symbolic links of the lastest files described above that are created weekly.

Following are the subdirectories of the VCF directory:

PreviousWeekly
  Older versions of files that are created weekly

ByChromosome
  VCF with genotypes and genotype freqencies listed by chromosome 
  and population ID.
  example: 14-12162-MKK.vcf.gz

ByChromosomeNoGeno
  VCF with genotype freqencies, but the genotypes are omitted. These are 
  listed by chromosome and population ID.
  example: 14-12162-MKK-nogeno.vcf.gz 

ByPopulation
  VCF with genotypes and genotype freqencies listed by population and 
  chromosome.  These files are symbolic links to the files in 'ByChromosome'
  example: MKK-12162-14.vcf.gz

ByPopulationNoGeno
  VCF with genotype freqencies, but the genotypes are omitted. These are 
  listed by population and chromosome and are symbolic links to the
  files in ByChromosomeNoGeno.
  example: MKK-12162-14-nogeno.vcf.gz

File naming convention example:

14-12162-MKK.vcf.gz

14 - Chromosome
12162 - dbSNP population ID - 
  see http://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewTable.cgi?pop=12162 or
  http://www.ncbi.nlm.nih.gov/projects/SNP/snp_tableList.cgi?fld=Population+handle&cond=contains&str=CSHL-HAPMAP&type=pop

MKK - three letter population ID.  For more information see 
  http://ccr.coriell.org/sections/collections/NHGRI/?SsId=11

A note about the position.

The RSPOS tag is the position of the SNP in dbSNP and the position reported in
column 2 may differ from the RSPOS tag.  All alleles for an INDEL or multi-byte
SNP must begin with the same nucleotide and to accomplish this, the preceeding
base pair is prefixed to each allele and the position of this base pair is 
reported.
Also, if all of the alleles consist of the same repeated sequence or a deletion
the beginning of the repeat is calculated and the preceeding base pair is 
reported.
For example, if the variations are AT/ATAT/-, the position in column 2 is
the location of the first repeat (AT) minus one.

Following is a sample VCF header from ./ByChromosome/14-12162-MKK.vcf.gz

##fileformat=VCFv4.0
##fileDate=20120604
##source=dbSNP
##dbSNP_BUILD_ID=137
##reference=GRCh37.p5
##phasing=partial
##variationPropertyDocumentationUrl=ftp://ftp.ncbi.nlm.nih.gov/snp/specs/dbSNP_BitField_latest.pdf
##dbSNP_POP_ID=12162
##dbSNP_LOC_POP_ID=HAPMAP-MKK
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=RSPOS,Number=1,Type=Integer,Description="Chr position reported in dbSNP">
##INFO=<ID=RV,Number=0,Type=Flag,Description="RS orientation is reversed">
##INFO=<ID=VP,Number=1,Type=String,Description="Variation Property.  Documentation is at ftp://ftp.ncbi.nlm.nih.gov/snp/specs/dbSNP_BitField_latest.pdf">
##INFO=<ID=GENEINFO,Number=1,Type=String,Description="Pairs each of gene symbol:gene id.  The gene symbol and id are delimited by a colon (:) and each pair is delimited by a vertical bar (|)">
##INFO=<ID=dbSNPBuildID,Number=1,Type=Integer,Description="First dbSNP Build for RS">
##INFO=<ID=SAO,Number=1,Type=Integer,Description="Variant Allele Origin: 0 - unspecified, 1 - Germline, 2 - Somatic, 3 - Both">
##INFO=<ID=SSR,Number=1,Type=Integer,Description="Variant Suspect Reason Code, 0 - unspecified, 1 - Paralog, 2 - byEST, 3 - Para_EST, 4 - oldAlign, 5 - other">
##INFO=<ID=GMAF,Number=1,Type=Float,Description="Global Minor Allele Frequency [0, 0.5]; global population is 1000GenomesProject phase 1 genotype data from 629 individuals, released in the 11-23-2010 dataset">
##INFO=<ID=WGT,Number=1,Type=Integer,Description="Weight, 00 - unmapped, 1 - weight 1, 2 - weight 2, 3 - weight 3 or more">
##INFO=<ID=VC,Number=1,Type=String,Description="Variation Class">
##INFO=<ID=PM,Number=0,Type=Flag,Description="Variant is Precious(Clinical,Pubmed Cited)">
##INFO=<ID=TPA,Number=0,Type=Flag,Description="Provisional Third Party Annotation(TPA) (currently rs from PHARMGKB who will give phenotype data)">
##INFO=<ID=PMC,Number=0,Type=Flag,Description="Links exist to PubMed Central article">
##INFO=<ID=S3D,Number=0,Type=Flag,Description="Has 3D structure - SNP3D table">
##INFO=<ID=SLO,Number=0,Type=Flag,Description="Has SubmitterLinkOut - From SNP->SubSNP->Batch.link_out">
##INFO=<ID=NSF,Number=0,Type=Flag,Description="Has non-synonymous frameshift A coding region variation where one allele in the set changes all downstream amino acids. FxnClass = 44">
##INFO=<ID=NSM,Number=0,Type=Flag,Description="Has non-synonymous missense A coding region variation where one allele in the set changes protein peptide. FxnClass = 42">
##INFO=<ID=NSN,Number=0,Type=Flag,Description="Has non-synonymous nonsense A coding region variation where one allele in the set changes to STOP codon (TER). FxnClass = 41">
##INFO=<ID=REF,Number=0,Type=Flag,Description="Has reference A coding region variation where one allele in the set is identical to the reference sequence. FxnCode = 8">
##INFO=<ID=SYN,Number=0,Type=Flag,Description="Has synonymous A coding region variation where one allele in the set does not change the encoded amino acid. FxnCode = 3">
##INFO=<ID=U3,Number=0,Type=Flag,Description="In 3' UTR Location is in an untranslated region (UTR). FxnCode = 53">
##INFO=<ID=U5,Number=0,Type=Flag,Description="In 5' UTR Location is in an untranslated region (UTR). FxnCode = 55">
##INFO=<ID=ASS,Number=0,Type=Flag,Description="In acceptor splice site FxnCode = 73">
##INFO=<ID=DSS,Number=0,Type=Flag,Description="In donor splice-site FxnCode = 75">
##INFO=<ID=INT,Number=0,Type=Flag,Description="In Intron FxnCode = 6">
##INFO=<ID=R3,Number=0,Type=Flag,Description="In 3' gene region FxnCode = 13">
##INFO=<ID=R5,Number=0,Type=Flag,Description="In 5' gene region FxnCode = 15">
##INFO=<ID=OTH,Number=0,Type=Flag,Description="Has other variant with exactly the same set of mapped positions on NCBI refernce assembly.">
##INFO=<ID=CFL,Number=0,Type=Flag,Description="Has Assembly conflict. This is for weight 1 and 2 variant that maps to different chromosomes on different assemblies.">
##INFO=<ID=ASP,Number=0,Type=Flag,Description="Is Assembly specific. This is set if the variant only maps to one assembly">
##INFO=<ID=MUT,Number=0,Type=Flag,Description="Is mutation (journal citation, explicit fact): a low frequency variation that is cited in journal and other reputable sources">
##INFO=<ID=VLD,Number=0,Type=Flag,Description="Is Validated.  This bit is set if the variant has 2+ minor allele count based on frequency or genotype data.">
##INFO=<ID=G5A,Number=0,Type=Flag,Description=">5% minor allele frequency in each and all populations">
##INFO=<ID=G5,Number=0,Type=Flag,Description=">5% minor allele frequency in 1+ populations">
##INFO=<ID=HD,Number=0,Type=Flag,Description="Marker is on high density genotyping kit (50K density or greater).  The variant may have phenotype associations present in dbGaP.">
##INFO=<ID=GNO,Number=0,Type=Flag,Description="Genotypes available. The variant has individual genotype (in SubInd table).">
##INFO=<ID=KGValidated,Number=0,Type=Flag,Description="1000 Genome validated">
##INFO=<ID=KGPhase1,Number=0,Type=Flag,Description="1000 Genome phase 1 (incl. June Interim phase 1)">
##INFO=<ID=KGPilot123,Number=0,Type=Flag,Description="1000 Genome discovery all pilots 2010(1,2,3)">
##INFO=<ID=KGPROD,Number=0,Type=Flag,Description="Has 1000 Genome submission">
##INFO=<ID=OTHERKG,Number=0,Type=Flag,Description="non-1000 Genome submission">
##INFO=<ID=PH3,Number=0,Type=Flag,Description="HAP_MAP Phase 3 genotyped: filtered, non-redundant">
##INFO=<ID=CDA,Number=0,Type=Flag,Description="Variation is interrogated in a clinical diagnostic assay">
##INFO=<ID=LSD,Number=0,Type=Flag,Description="Submitted from a locus-specific database">
##INFO=<ID=MTP,Number=0,Type=Flag,Description="Microattribution/third-party annotation(TPA:GWAS,PAGE)">
##INFO=<ID=OM,Number=0,Type=Flag,Description="Has OMIM/OMIA">
##INFO=<ID=NOC,Number=0,Type=Flag,Description="Contig allele not present in variant allele list. The reference sequence allele at the mapped position is not present in the variant allele list, adjusted for orientation.">
##INFO=<ID=WTD,Number=0,Type=Flag,Description="Is Withdrawn by submitter If one member ss is withdrawn by submitter, then this bit is set.  If all member ss' are withdrawn, then the rs is deleted to SNPHistory">
##INFO=<ID=NOV,Number=0,Type=Flag,Description="Rs cluster has non-overlapping allele sets. True when rs set has more than 2 alleles from different submissions and these sets share no alleles in common.">
##INFO=<ID=GCF,Number=0,Type=Flag,Description="Has Genotype Conflict Same (rs, ind), different genotype.  N/N is not included.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GT2,Number=1,Type=String,Description="Second Genotype">
##FORMAT=<ID=GT3,Number=1,Type=String,Description="Third Genotype">
##FORMAT=<ID=GC,Number=1,Type=Integer,Description="Genotype Count">
##FORMAT=<ID=GC2,Number=1,Type=Integer,Description="Second Genotype Count">
##FORMAT=<ID=GC3,Number=1,Type=Integer,Description="Third Genotype Count">
##FILTER=<ID=NC,Description="Inconsistent Genotype Submission For At Least One Sample">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA21295	NA21297	NA21300	NA21301	NA21303	NA21307	NA21308	NA21311	NA21312	NA21314	NA21316	NA21318	NA21320	NA21333	NA21336	NA21339	NA21344	NA21352	NA21353	NA21355	NA21356	NA21357	NA21359	NA21360	NA21362	NA21363	NA21364	NA21365	NA21367	NA21368	NA21370	NA21371	NA21378	NA21379	NA21381	NA21382	NA21384	NA21385	NA21387	NA21388	NA21390	NA21391	NA21399	NA21400	NA21402	NA21403	NA21405	NA21408	NA21414	NA21415	NA21417	NA21418	NA21420	NA21421	NA21423	NA21424	NA21434	NA21435	NA21436	NA21438	NA21440	NA21441	NA21447	NA21448	NA21451	NA21453	NA21454	NA21457	NA21473	NA21475	NA21476	NA21478	NA21479	NA21485	NA21486	NA21488	NA21489	NA21491	NA21493	NA21509	NA21510	NA21512	NA21513	NA21517	NA21519	NA21520	NA21521	NA21522	NA21523	NA21524	NA21526	NA21528	NA21529	NA21573	NA21574	NA21575	NA21576	NA21577	NA21578	NA21580	NA21582	NA21583	NA21587	NA21596	NA21597	NA21599	NA21600	NA21611	NA21613	NA21614	NA21615	NA21616	NA21617	NA21619	NA21620	NA21631	NA21632	NA21634	NA21635	NA21647	NA21650	NA21678	NA21682	NA21683	NA21685	NA21686	NA21689	NA21693	NA21716	NA21717	NA21719	NA21722	NA21723	NA21733	NA21738	NA21739	NA21740	NA21741	NA21768	NA21776	NA21784	NA21825	NA21826