Re: [DIYbio] 23andMe exomes

> The VCF format might be frustrating to you in that it fails to distinguish between "not sufficiently covered to make a genotype call" and "matches the reference genome". (It is theoretically possible to report this, but to date it's only been done in a base-by-base manner, which results in ridiculously large files.)

This sounds bad. So the 23andMe-files also report those SNPs which haven't been called as a match to the reference?

They don't report the SNPs which have been called as a homozygous match to reference. (Not sure if this was what you said.)
 
An example out of one of the VCF-files of the PGP:


1       753405  rs61770173      C       A       203.64  […excluded…]     GT:AD:DP:GQ:PL  0/1:17,16:34:99:234,0,489


So in this case the genotype is given as 0/1, an unphased C/A, but it also could be that the allele which is given as C wasn't called at all? But there is also information about the read-depth (DP), Genotype-Quality (GQ) and a phred-scaled likelihood (PL), so I could use those to determine how accurate the genotype-calling was, couldn't I?

Ah, you've cut out a lot of the important data in that line. I know, the "INFO" field is ugly... I think these catch-all fields end up getting used a lot because people didn't really know what they wanted when they invented the format? :-) You can find a description of VCF format here, btw:

The "C" is the variant, the "A" is the reference. This person is either a homozygous or heterozygous carrier of the variant. If the person had "A" homozygously, this line would not be reported.

From the full line:
1 753405 rs61770173 C A 445.79 MQFilter40 AC=2;AF=1.00;AN=2;DB;DP=20;Dels=0.0;FS=0.0;HRun=0;HaplotypeScore=0.0;MQ=28.45;MQ0=1;QD=22.29;SNPEFF_EFFECT=TRANSCRIPT;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=lincRNA;SNPEFF_GENE_NAME=FAM87B;SNPEFF_IMPACT=MODIFIER;SNPEFF_TRANSCRIPT_ID=ENST00000326734 GT:AD:DP:GQ:PL 1/1:0,17:20:45.08:479,45,0

I interpret this as saying: AC=2 ... allele count is two, AF=1.00 allele frequency of the variant is "100%" ... so this position is homozygous for the variant in this data. What I believe to be a het line from another PGP exome:

1 753405 rs61770173 C A 203.64 MQFilter40 AB=0.515;AC=1;AF=0.50;AN=2;BaseQRankSum=2.916;DB;DP=34;Dels=0.0;FS=0.0;HRun=0;HaplotypeScore=0.0;MQ=38.34;MQ0=3;MQRankSum=-3.837;QD=5.99;ReadPosRankSum=0.197;SNPEFF_EFFECT=TRANSCRIPT;SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=lincRNA;SNPEFF_GENE_NAME=FAM87B;SNPEFF_IMPACT=MODIFIER;SNPEFF_TRANSCRIPT_ID=ENST00000326734 GT:AD:DP:GQ:PL 0/1:17,16:34:99:234,0,489

I should mention: For openSNP we currently only aim to read the known SNPs (those with Rs-IDs).

... Now up to 41 million (dbSNP 134)? It's getting close to having nearly every variant in a genome as "known" by dbSNP. There's only around 3 million nonreference variants in a given genome, so at some point reporting genotypes for "all dbSNP IDs" gets worse than simply reporting the positions that are non-reference.

--
You received this message because you are subscribed to the Google Groups "DIYbio" group.
To view this discussion on the web visit https://groups.google.com/d/msg/diybio/-/17zcKKBpiokJ.
To post to this group, send email to diybio@googlegroups.com.
To unsubscribe from this group, send email to diybio+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/diybio?hl=en.

  • Digg
  • Del.icio.us
  • StumbleUpon
  • Reddit
  • RSS

0 comments:

Post a Comment