Camelia, the Perl 6 bug

IRC log for #bioperl, 2009-08-16

| Channels | #bioperl index | Today | | Search | Google Search | Plain-Text | summary

All times shown according to UTC.

Time Nick Message
02:47 tacolou joined #bioperl
02:49 tacolou Hi.  Not specifically a bioperl question, but what is the preferred format for storing a full genome sequence and allowing FAST random access?
02:52 deafferret like you need to jump around all over the sequence, reading small subseqs?
02:53 deafferret by location?
02:53 deafferret (I'm jhannah, by the way.)
02:54 deafferret I haven't heard anyone ask this specific question before, in any forum, I don't think
02:54 tacolou Yes.  Pull sequences as fast a possible using genomic coordinates.
02:56 deafferret As a total guess, I'd think perhaps splitting the genome into 100K chunks, shoving those into a database, and indexing by start_position, might be the fastest?
02:57 tacolou I considered a database, but it's too much preprocessing overhead for this.
02:58 deafferret oh? SQLite is pretty lightweight. I would think a genome would chop into SQLite in less than a minute...
02:59 deafferret i'm not sure how to make random access extremely fast without cutting and indexing
02:59 deafferret but I'm not a low-level seek sort of guy...
03:00 tacolou The best I came up with is splitting by chromosome and accessing a given byte position with seek.
03:00 deafferret have you benchmarked that? too slow?
03:01 tacolou It's actually very fast, but there are other more compact formats like 2bit that may be even smaller and just as fast.  Wanted to see if anyone else has tackled this.
03:03 deafferret so you're worried about storage space too?
03:05 tacolou Not so much storage space.  Since I can't possibly be the first to need this, I was looking for some community consensus about what the correct and fastest way to do this might be.
03:05 deafferret right. much smarter people than I will be here during the day Monday. usually pretty quiet weekends, though
03:06 deafferret I'd like to hear peoples ideas
03:07 tacolou I have a workable solution for now, so maybe I'll leave it.  I was more curious about the utility of alternative formats like 2bit and nibble for doing this kind of stuff.  They have the advantage of being more compact, but may come at the cost of random access speed.
03:08 tacolou Thanks.
03:08 deafferret I
03:09 deafferret I've used 2bit before, for something... mummer I think. but I never got into the details
03:09 ptl joined #bioperl
03:11 tacolou It's a binary format that stores A,T,C,G in two bits and N's and other "masks" in another set of bits.
03:11 tacolou Entire human genome with masking data fits in about ~750MB.
03:13 deafferret "another set of bits" ?
03:15 tacolou http://genome.ucsc.edu/FAQ/FAQformat#format7
05:13 ptl_ joined #bioperl
14:53 kyanardag_ joined #bioperl
19:56 kyanardag_ joined #bioperl

| Channels | #bioperl index | Today | | Search | Google Search | Plain-Text | summary