Re: [DIYbio] full genome sequencing and exome data storage

Thinking about whole genome storage, the human genome is 3.2 Gbp in size. That's a base-four data-type, if you discount methylation and other forms of chemical DNA modification.

However, how do they encode that to binary? Do they store as ASCII, which (if memory serves) uses a full byte to store each character? Because encoding a base-four datatype like a/c/t/g to binary 1/0 could be minimised to only two bits per base, meaning that storing 8 bits per base (ASCII or similar) would be inflating the data 4x, making your 3.2 Gbp genome occupy 3.2 Gb instead of a minimum of 0.8 Gb before compression?

Or am I wildly off the mark? :)
That accounts only for the actual base sequence of course; your annotations etc. will still outstrip the DNA vastly in size, although they will probably have a better compression ratio.

Has anyone looked into a direct basepair->bit codec for minimisation of genome storage? Even a compression from a full byte per base down to four bits would halve storage size while leaving plenty of overhead for modified bases etc.?

On 23 November 2012 23:42, Giovanni <giovanni.lostumbo@gmail.com> wrote:
A bit of a speculative post into the future. Reading about new ventures into exome sequencing and data amounting to >6GB just for the base pairs, I was curious about the media formats that are used, such as DVDs. A couple articles got me thinking about full genome sequencing (3x10^9 base pairs and about 50TB of storage), would be possible on several 4Terabyte HDDs, but terabyte (1-15TB) optical discs like ones by Fujifilm may make portable genomic data a lot simpler to handle. A few links I found on it:
http://fudzilla.com/home/item/29581-1tb-optical-discs-coming-in-2015
http://www.tweaktown.com/news/26908/1tb_optical_discs_are_coming_but_you_ll_have_to_wait_until_2015/index.html
http://news.yahoo.com/1-000-genome-almost-ready-111300774.html
http://www.genomeweb.com/clinical-genomics/23andme-opens-research-portal-outside-investigators-effort-advance-genomics-know
http://www.wired.com/wiredscience/2012/11/social-codes/
http://www.kinexus.ca/pdf/graphs_charts/HumanGenomeSequence.pdf

I guess if full genome sequencing is available, the most practical storage medium would be one that doesn't comprise a major part of the cost of sequencing. I think the cost of TB discs, like Blu- ray, and DVDs before them, might be as little as $0.20- few dollars each, but their price might not be as low if TB discs have popular adoption, which would come with UltraHD cinema discs for 4K resolution televisions and Playstation 4 discs (if they exceed 25-50GB). The idea of an entire genome fitting on just a few optical discs instead of 50 is actually a little encouraging.

--
-- You received this message because you are subscribed to the Google Groups DIYbio group. To post to this group, send email to diybio@googlegroups.com. To unsubscribe from this group, send email to diybio+unsubscribe@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/diybio?hl=en
Learn more at www.diybio.org
---
You received this message because you are subscribed to the Google Groups "DIYbio" group.
To post to this group, send email to diybio@googlegroups.com.
To unsubscribe from this group, send email to diybio+unsubscribe@googlegroups.com.
Visit this group at http://groups.google.com/group/diybio?hl=en.
To view this discussion on the web visit https://groups.google.com/d/msg/diybio/-/WAatPz52dX4J.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
www.indiebiotech.com
twitter.com/onetruecathal
joindiaspora.com/u/cathalgarvey
PGP Public Key: http://bit.ly/CathalGKey


--
-- You received this message because you are subscribed to the Google Groups DIYbio group. To post to this group, send email to diybio@googlegroups.com. To unsubscribe from this group, send email to diybio+unsubscribe@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/diybio?hl=en
Learn more at www.diybio.org
---
You received this message because you are subscribed to the Google Groups "DIYbio" group.
To post to this group, send email to diybio@googlegroups.com.
To unsubscribe from this group, send email to diybio+unsubscribe@googlegroups.com.
Visit this group at http://groups.google.com/group/diybio?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

  • Digg
  • Del.icio.us
  • StumbleUpon
  • Reddit
  • RSS

0 comments:

Post a Comment