Oops, I mean the "C" is reference and "A" is variant -- swap all my letters, I got the columns confused, I should've reviewed that link first!
On Thursday, May 17, 2012 6:52:06 PM UTC-4, Madeleine Ball wrote:
--> The VCF format might be frustrating to you in that it fails to distinguish between "not sufficiently covered to make a genotype call" and "matches the reference genome". (It is theoretically possible to report this, but to date it's only been done in a base-by-base manner, which results in ridiculously large files.)
This sounds bad. So the 23andMe-files also report those SNPs which haven't been called as a match to the reference?They don't report the SNPs which have been called as a homozygous match to reference. (Not sure if this was what you said.)An example out of one of the VCF-files of the PGP:
…
1 753405 rs61770173 C A 203.64 […excluded…] GT:AD:DP:GQ:PL 0/1:17,16:34:99:234,0,489
…
So in this case the genotype is given as 0/1, an unphased C/A, but it also could be that the allele which is given as C wasn't called at all? But there is also information about the read-depth (DP), Genotype-Quality (GQ) and a phred-scaled likelihood (PL), so I could use those to determine how accurate the genotype-calling was, couldn't I?Ah, you've cut out a lot of the important data in that line. I know, the "INFO" field is ugly... I think these catch-all fields end up getting used a lot because people didn't really know what they wanted when they invented the format? :-) You can find a description of VCF format here, btw:The "C" is the variant, the "A" is the reference. This person is either a homozygous or heterozygous carrier of the variant. If the person had "A" homozygously, this line would not be reported.From the full line:1 753405 rs61770173 C A 445.79 MQFilter40 AC=2;AF=1.00;AN=2;DB;DP=20;Dels=0.0;FS=0.0;HRun=0; HaplotypeScore=0.0;MQ=28.45; MQ0=1;QD=22.29;SNPEFF_EFFECT= TRANSCRIPT;SNPEFF_FUNCTIONAL_ CLASS=NONE;SNPEFF_GENE_ BIOTYPE=lincRNA;SNPEFF_GENE_ NAME=FAM87B;SNPEFF_IMPACT= MODIFIER;SNPEFF_TRANSCRIPT_ID= ENST00000326734 GT:AD:DP:GQ:PL 1/1:0,17:20:45.08:479,45,0 I interpret this as saying: AC=2 ... allele count is two, AF=1.00 allele frequency of the variant is "100%" ... so this position is homozygous for the variant in this data. What I believe to be a het line from another PGP exome:1 753405 rs61770173 C A 203.64 MQFilter40 AB=0.515;AC=1;AF=0.50;AN=2;BaseQRankSum=2.916;DB;DP=34; Dels=0.0;FS=0.0;HRun=0; HaplotypeScore=0.0;MQ=38.34; MQ0=3;MQRankSum=-3.837;QD=5. 99;ReadPosRankSum=0.197; SNPEFF_EFFECT=TRANSCRIPT; SNPEFF_FUNCTIONAL_CLASS=NONE; SNPEFF_GENE_BIOTYPE=lincRNA; SNPEFF_GENE_NAME=FAM87B; SNPEFF_IMPACT=MODIFIER;SNPEFF_ TRANSCRIPT_ID=ENST00000326734 GT:AD:DP:GQ:PL 0/1:17,16:34:99:234,0,489 I should mention: For openSNP we currently only aim to read the known SNPs (those with Rs-IDs).... Now up to 41 million (dbSNP 134)? It's getting close to having nearly every variant in a genome as "known" by dbSNP. There's only around 3 million nonreference variants in a given genome, so at some point reporting genotypes for "all dbSNP IDs" gets worse than simply reporting the positions that are non-reference.
You received this message because you are subscribed to the Google Groups "DIYbio" group.
To view this discussion on the web visit https://groups.google.com/d/msg/diybio/-/e1NanOC42j8J.
To post to this group, send email to diybio@googlegroups.com.
To unsubscribe from this group, send email to diybio+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/diybio?hl=en.






0 comments:
Post a Comment