18-JUN-2012 dbSNP currently supports VCF v4.0 (http://www.1000genomes.org/wiki/doku.php?id=1000_genomes:analysis:vcf4.0). The VCF files are available at ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF The following are omitted from all VCF files: Variations listed as microsatellites or named variations Variations that are not mapped on the reference genome (GRCh37.x) Variations that are mapped to more than one location on the reference genome. (Weight > 1) Following are the files in the VCF main directory: ./00-All.vcf.gz VCF of all variations that meet the criteria to be in a VCF file. This file is created once per dbSNP build. common_all.vcf.gz VCF of all variations that are polymorphic in a least one population the 1000 Genomes project or any of the following handles: 1000GENOMES CSHL-HAPMAP EGP_SNPS NHLBI-ESP PGA-UW-FHCRC A variation is polymorphic if the minor allele frequency is at least 0.01 and the minor allele is present in at least two samples. clinvar_YYYYMMDD.vcf.gz VCF of variations from clinvar where 'YYYYMMDD' represents the date the file was created. This file is created weekly. common_and_clinical_YYYYMMDD.vcf.gz Variations from common_all.vcf.gz that are clinical. A clinical variation is one the appears in clinvar_YYYYMMDD.vcf.gz with at least one of the following clinical significance codes: 4 - probable-pathogenic 5 - pathogenic 6 - drug-response 7 - histocompatibility 255 - other This file is created weekly. common_no_known_medical_impact_YYYYMMDD.vcf.gz Variations from common_all.vcf.gz that do not meet the clinical criteria described above. This file is created weekly. clinvar_00-newest.vcf.gz common_and_clinical_00-newest.vcf.gz common_no_known_medical_impact_00-newest.vcf.gz Symbolic links of the lastest files described above that are created weekly. Following are the subdirectories of the VCF directory: PreviousWeekly Older versions of files that are created weekly ByChromosome VCF with genotypes and genotype freqencies listed by chromosome and population ID. example: 14-12162-MKK.vcf.gz ByChromosomeNoGeno VCF with genotype freqencies, but the genotypes are omitted. These are listed by chromosome and population ID. example: 14-12162-MKK-nogeno.vcf.gz ByPopulation VCF with genotypes and genotype freqencies listed by population and chromosome. These files are symbolic links to the files in 'ByChromosome' example: MKK-12162-14.vcf.gz ByPopulationNoGeno VCF with genotype freqencies, but the genotypes are omitted. These are listed by population and chromosome and are symbolic links to the files in ByChromosomeNoGeno. example: MKK-12162-14-nogeno.vcf.gz File naming convention example: 14-12162-MKK.vcf.gz 14 - Chromosome 12162 - dbSNP population ID - see http://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewTable.cgi?pop=12162 or http://www.ncbi.nlm.nih.gov/projects/SNP/snp_tableList.cgi?fld=Population+handle&cond=contains&str=CSHL-HAPMAP&type=pop MKK - three letter population ID. For more information see http://ccr.coriell.org/sections/collections/NHGRI/?SsId=11 A note about the position. The RSPOS tag is the position of the SNP in dbSNP and the position reported in column 2 may differ from the RSPOS tag. All alleles for an INDEL or multi-byte SNP must begin with the same nucleotide and to accomplish this, the preceeding base pair is prefixed to each allele and the position of this base pair is reported. Also, if all of the alleles consist of the same repeated sequence or a deletion the beginning of the repeat is calculated and the preceeding base pair is reported. For example, if the variations are AT/ATAT/-, the position in column 2 is the location of the first repeat (AT) minus one. Following is a sample VCF header from ./ByChromosome/14-12162-MKK.vcf.gz ##fileformat=VCFv4.0 ##fileDate=20120604 ##source=dbSNP ##dbSNP_BUILD_ID=137 ##reference=GRCh37.p5 ##phasing=partial ##variationPropertyDocumentationUrl=ftp://ftp.ncbi.nlm.nih.gov/snp/specs/dbSNP_BitField_latest.pdf ##dbSNP_POP_ID=12162 ##dbSNP_LOC_POP_ID=HAPMAP-MKK ##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed"> ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=RSPOS,Number=1,Type=Integer,Description="Chr position reported in dbSNP"> ##INFO=<ID=RV,Number=0,Type=Flag,Description="RS orientation is reversed"> ##INFO=<ID=VP,Number=1,Type=String,Description="Variation Property. Documentation is at ftp://ftp.ncbi.nlm.nih.gov/snp/specs/dbSNP_BitField_latest.pdf"> ##INFO=<ID=GENEINFO,Number=1,Type=String,Description="Pairs each of gene symbol:gene id. The gene symbol and id are delimited by a colon (:) and each pair is delimited by a vertical bar (|)"> ##INFO=<ID=dbSNPBuildID,Number=1,Type=Integer,Description="First dbSNP Build for RS"> ##INFO=<ID=SAO,Number=1,Type=Integer,Description="Variant Allele Origin: 0 - unspecified, 1 - Germline, 2 - Somatic, 3 - Both"> ##INFO=<ID=SSR,Number=1,Type=Integer,Description="Variant Suspect Reason Code, 0 - unspecified, 1 - Paralog, 2 - byEST, 3 - Para_EST, 4 - oldAlign, 5 - other"> ##INFO=<ID=GMAF,Number=1,Type=Float,Description="Global Minor Allele Frequency [0, 0.5]; global population is 1000GenomesProject phase 1 genotype data from 629 individuals, released in the 11-23-2010 dataset"> ##INFO=<ID=WGT,Number=1,Type=Integer,Description="Weight, 00 - unmapped, 1 - weight 1, 2 - weight 2, 3 - weight 3 or more"> ##INFO=<ID=VC,Number=1,Type=String,Description="Variation Class"> ##INFO=<ID=PM,Number=0,Type=Flag,Description="Variant is Precious(Clinical,Pubmed Cited)"> ##INFO=<ID=TPA,Number=0,Type=Flag,Description="Provisional Third Party Annotation(TPA) (currently rs from PHARMGKB who will give phenotype data)"> ##INFO=<ID=PMC,Number=0,Type=Flag,Description="Links exist to PubMed Central article"> ##INFO=<ID=S3D,Number=0,Type=Flag,Description="Has 3D structure - SNP3D table"> ##INFO=<ID=SLO,Number=0,Type=Flag,Description="Has SubmitterLinkOut - From SNP->SubSNP->Batch.link_out"> ##INFO=<ID=NSF,Number=0,Type=Flag,Description="Has non-synonymous frameshift A coding region variation where one allele in the set changes all downstream amino acids. FxnClass = 44"> ##INFO=<ID=NSM,Number=0,Type=Flag,Description="Has non-synonymous missense A coding region variation where one allele in the set changes protein peptide. FxnClass = 42"> ##INFO=<ID=NSN,Number=0,Type=Flag,Description="Has non-synonymous nonsense A coding region variation where one allele in the set changes to STOP codon (TER). FxnClass = 41"> ##INFO=<ID=REF,Number=0,Type=Flag,Description="Has reference A coding region variation where one allele in the set is identical to the reference sequence. FxnCode = 8"> ##INFO=<ID=SYN,Number=0,Type=Flag,Description="Has synonymous A coding region variation where one allele in the set does not change the encoded amino acid. FxnCode = 3"> ##INFO=<ID=U3,Number=0,Type=Flag,Description="In 3' UTR Location is in an untranslated region (UTR). FxnCode = 53"> ##INFO=<ID=U5,Number=0,Type=Flag,Description="In 5' UTR Location is in an untranslated region (UTR). FxnCode = 55"> ##INFO=<ID=ASS,Number=0,Type=Flag,Description="In acceptor splice site FxnCode = 73"> ##INFO=<ID=DSS,Number=0,Type=Flag,Description="In donor splice-site FxnCode = 75"> ##INFO=<ID=INT,Number=0,Type=Flag,Description="In Intron FxnCode = 6"> ##INFO=<ID=R3,Number=0,Type=Flag,Description="In 3' gene region FxnCode = 13"> ##INFO=<ID=R5,Number=0,Type=Flag,Description="In 5' gene region FxnCode = 15"> ##INFO=<ID=OTH,Number=0,Type=Flag,Description="Has other variant with exactly the same set of mapped positions on NCBI refernce assembly."> ##INFO=<ID=CFL,Number=0,Type=Flag,Description="Has Assembly conflict. This is for weight 1 and 2 variant that maps to different chromosomes on different assemblies."> ##INFO=<ID=ASP,Number=0,Type=Flag,Description="Is Assembly specific. This is set if the variant only maps to one assembly"> ##INFO=<ID=MUT,Number=0,Type=Flag,Description="Is mutation (journal citation, explicit fact): a low frequency variation that is cited in journal and other reputable sources"> ##INFO=<ID=VLD,Number=0,Type=Flag,Description="Is Validated. This bit is set if the variant has 2+ minor allele count based on frequency or genotype data."> ##INFO=<ID=G5A,Number=0,Type=Flag,Description=">5% minor allele frequency in each and all populations"> ##INFO=<ID=G5,Number=0,Type=Flag,Description=">5% minor allele frequency in 1+ populations"> ##INFO=<ID=HD,Number=0,Type=Flag,Description="Marker is on high density genotyping kit (50K density or greater). The variant may have phenotype associations present in dbGaP."> ##INFO=<ID=GNO,Number=0,Type=Flag,Description="Genotypes available. The variant has individual genotype (in SubInd table)."> ##INFO=<ID=KGValidated,Number=0,Type=Flag,Description="1000 Genome validated"> ##INFO=<ID=KGPhase1,Number=0,Type=Flag,Description="1000 Genome phase 1 (incl. June Interim phase 1)"> ##INFO=<ID=KGPilot123,Number=0,Type=Flag,Description="1000 Genome discovery all pilots 2010(1,2,3)"> ##INFO=<ID=KGPROD,Number=0,Type=Flag,Description="Has 1000 Genome submission"> ##INFO=<ID=OTHERKG,Number=0,Type=Flag,Description="non-1000 Genome submission"> ##INFO=<ID=PH3,Number=0,Type=Flag,Description="HAP_MAP Phase 3 genotyped: filtered, non-redundant"> ##INFO=<ID=CDA,Number=0,Type=Flag,Description="Variation is interrogated in a clinical diagnostic assay"> ##INFO=<ID=LSD,Number=0,Type=Flag,Description="Submitted from a locus-specific database"> ##INFO=<ID=MTP,Number=0,Type=Flag,Description="Microattribution/third-party annotation(TPA:GWAS,PAGE)"> ##INFO=<ID=OM,Number=0,Type=Flag,Description="Has OMIM/OMIA"> ##INFO=<ID=NOC,Number=0,Type=Flag,Description="Contig allele not present in variant allele list. The reference sequence allele at the mapped position is not present in the variant allele list, adjusted for orientation."> ##INFO=<ID=WTD,Number=0,Type=Flag,Description="Is Withdrawn by submitter If one member ss is withdrawn by submitter, then this bit is set. If all member ss' are withdrawn, then the rs is deleted to SNPHistory"> ##INFO=<ID=NOV,Number=0,Type=Flag,Description="Rs cluster has non-overlapping allele sets. True when rs set has more than 2 alleles from different submissions and these sets share no alleles in common."> ##INFO=<ID=GCF,Number=0,Type=Flag,Description="Has Genotype Conflict Same (rs, ind), different genotype. N/N is not included."> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GT2,Number=1,Type=String,Description="Second Genotype"> ##FORMAT=<ID=GT3,Number=1,Type=String,Description="Third Genotype"> ##FORMAT=<ID=GC,Number=1,Type=Integer,Description="Genotype Count"> ##FORMAT=<ID=GC2,Number=1,Type=Integer,Description="Second Genotype Count"> ##FORMAT=<ID=GC3,Number=1,Type=Integer,Description="Third Genotype Count"> ##FILTER=<ID=NC,Description="Inconsistent Genotype Submission For At Least One Sample"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA21295 NA21297 NA21300 NA21301 NA21303 NA21307 NA21308 NA21311 NA21312 NA21314 NA21316 NA21318 NA21320 NA21333 NA21336 NA21339 NA21344 NA21352 NA21353 NA21355 NA21356 NA21357 NA21359 NA21360 NA21362 NA21363 NA21364 NA21365 NA21367 NA21368 NA21370 NA21371 NA21378 NA21379 NA21381 NA21382 NA21384 NA21385 NA21387 NA21388 NA21390 NA21391 NA21399 NA21400 NA21402 NA21403 NA21405 NA21408 NA21414 NA21415 NA21417 NA21418 NA21420 NA21421 NA21423 NA21424 NA21434 NA21435 NA21436 NA21438 NA21440 NA21441 NA21447 NA21448 NA21451 NA21453 NA21454 NA21457 NA21473 NA21475 NA21476 NA21478 NA21479 NA21485 NA21486 NA21488 NA21489 NA21491 NA21493 NA21509 NA21510 NA21512 NA21513 NA21517 NA21519 NA21520 NA21521 NA21522 NA21523 NA21524 NA21526 NA21528 NA21529 NA21573 NA21574 NA21575 NA21576 NA21577 NA21578 NA21580 NA21582 NA21583 NA21587 NA21596 NA21597 NA21599 NA21600 NA21611 NA21613 NA21614 NA21615 NA21616 NA21617 NA21619 NA21620 NA21631 NA21632 NA21634 NA21635 NA21647 NA21650 NA21678 NA21682 NA21683 NA21685 NA21686 NA21689 NA21693 NA21716 NA21717 NA21719 NA21722 NA21723 NA21733 NA21738 NA21739 NA21740 NA21741 NA21768 NA21776 NA21784 NA21825 NA21826