Camelia, the Perl 6 bug

IRC log for #bioruby, 2012-04-25

| Channels | #bioruby index | Today | | Search | Google Search | Plain-Text | summary

All times shown according to UTC.

Time Nick Message
00:59 shevy joined #bioruby
05:51 ilpuccio joined #bioruby
06:01 ilpuccio joined #bioruby
10:41 marjan joined #bioruby
11:05 pjotrp hi marjan
11:13 marjan hi
12:00 pjotrp an application for gff3 is a genome browser
12:01 pjotrp marjan: http://www.wormbase.org/species/c_​elegans/gene/WBGene00000003#0-9e-3
12:02 marjan thx
12:03 pjotrp obviously they use a DB now
12:03 pjotrp but if the parser was fast enough...
12:03 pjotrp the original data is in gff3 - see the ftp area
12:04 pjotrp this is even clearer: http://www.wormbase.org/tools/genome/gbr​owse/c_elegans/?name=V:9243999..9246453
12:05 pjotrp if you click on the floppy disk icon you can even download GFF3 tracks
12:06 pjotrp Wormbase was an early adopter of GFF3
12:06 marjan ok, I'll take a better look at their website.
13:22 ilpuccio joined #bioruby
13:31 rbuels pjotrp, marjan: no matter how fast a gff3 parser is, it can't really be used to drive a genome browser.  the data has to be indexed in some way for lookups by the genomic range, and by the attrs.  many people use a RDBMS for this indexing, but that's certainly not the only way.
13:33 rbuels pjotrp, marjan: jbrowse, for example, has a flat-file format for storing the data for range lookups
13:33 rbuels (which is nested containment lists encoded as JSON)
13:34 rbuels pjotrp, marjan: but JBrowse needs a fast GFF3 parser so that creating this indexed representation doesn't take for-freaking-ever
13:34 marjan :)
13:35 rbuels and since JBrowse's formatting stuff is in perl, we use Bio::GFF3::LowLevel::Parser, which is probably the current fastest perl parser
13:35 marjan rbuels: ok, thx for your info
13:36 marjan So, that would be one usecase for my parser.
13:36 marjan Because the perl parser isn't parallel, right?
13:37 rbuels marjan: no, i'm not aware of any parallel gff3 parsers
14:00 pjotrp rbuels: I am not saying it is useful for (popular) web services. But we
14:00 pjotrp have a genome browser for QTL mapping - reading GFF3 on the fly is feasible
14:00 pjotrp and kinda interesting, as we generate GFF3 on the fly too ;)
14:01 rbuels only if it's very small gff3.
14:01 rbuels a fast parser could move the definition of what 'very small' is, but there's a limit to how far.
14:01 pjotrp My bio-gff3 parser can parse 1Gb in 2 minutes. 8 cores would make that 15 seconds. Acceptable
14:02 pjotrp the D parser will be magnitudes faster.
14:02 pjotrp At least, that is what I hope ;)
14:03 rbuels even if you are super-duper smart about how you divide the tasks among the workers, the parse time isn't going to scale even close to linearly with the number of cores.
14:04 * rbuels thinks about it a bit
14:04 rbuels if you give workers the chunks of gff3 between '###' marks, you might get close to linear actually
14:04 pjotrp sorry, got my numbers a bit optimistic. But it is possible.
14:04 pjotrp yes
14:04 pjotrp you need to parse a sorted file
14:05 rbuels ordered by line-dependency, not by genomic coordinate.
14:05 pjotrp the chunks I call blocks, and yes
14:05 pjotrp this is what we call an 'optimistic' parser
14:06 rbuels you can't write a parser that will break if the gff3 input is not very nice.
14:06 pjotrp absolutely
14:07 * rbuels nods
14:07 pjotrp we can fall back on more robust parsers for that
14:07 pjotrp this is for speed and 90% of cases
14:07 pjotrp at least, that is the idea
14:08 pjotrp also we intend it as a bit of a show case
14:08 pjotrp on D and parallelization
14:08 rbuels i wouldn't recommend falling back to other parsers, that would be a pain to use.
14:08 rbuels if you can parse a *chunk* correctly, no matter how horrible
14:08 rbuels or a block
14:08 rbuels calling them blocks.
14:09 rbuels and the parallelization is according to blocks.
14:09 rbuels then the degenerate case would be no parallelization, parsing the whole file as a single block.
14:09 rbuels which is no worse than current parsers.
14:09 pjotrp yes
14:09 pjotrp in theory
14:09 rbuels and if your file has blocks, you can parallelize
14:10 pjotrp but with this parser we may not take all edge cases
14:10 pjotrp like circular references
14:10 pjotrp GFF3 allows for that
14:10 * rbuels nods
14:10 pjotrp but maybe, in time, it will be a full implementation
14:11 rbuels with circular references, you just output data structures that have circular references.
14:11 rbuels that's what the perl lowlevel parser does.
14:11 pjotrp it is one interpretation ;)
14:11 pjotrp and just one example I am aware of
14:12 rbuels course, that means in perl that it will never be freed  :-/
14:12 pjotrp there are always problems in a flexible format
14:12 rbuels does ruby have a mark-and-sweep GC?
14:12 * rbuels hopes so
14:12 * rbuels looks it up
14:12 pjotrp uhm
14:12 pjotrp first there are more Rubies
14:12 rbuels oh right
14:12 pjotrp second there are pluggable GC implementations
14:12 pjotrp at least on the JVM
14:13 pjotrp hard to answer right ;)
14:13 rbuels sounds like 'mostly yes'
14:13 pjotrp yes
14:14 pjotrp if you choose so
14:14 pjotrp the default is actually mark and sweep in 1.9, I think
14:15 pjotrp http://patshaughnessy.net/2012/3/23/why-you-shoul​d-be-excited-about-garbage-collection-in-ruby-2-0
14:15 pjotrp stuff people get excited about :/
14:19 rbuels ah, that's pretty cool!
14:19 * rbuels is kind of excited
14:20 rbuels glad to see ruby maturing
14:24 pjotrp it is actually an interesting read
17:29 ilpuccio joined #bioruby
21:37 ilpuccio joined #bioruby
23:30 pyrimidine joined #bioruby

| Channels | #bioruby index | Today | | Search | Google Search | Plain-Text | summary