Camelia, the Perl 6 bug

IRC log for #bioperl, 2010-05-04

| Channels | #bioperl index | Today | | Search | Google Search | Plain-Text | summary

All times shown according to UTC.

Time Nick Message
01:21 brandi joined #bioperl
01:21 brandi left #bioperl
02:58 melic joined #bioperl
04:17 splut joined #bioperl
05:06 cawiss joined #bioperl
06:05 bag_ joined #bioperl
06:57 cawiss joined #bioperl
10:20 cawiss left #bioperl
10:20 cawiss joined #bioperl
11:00 cawiss_ joined #bioperl
13:36 rbuels joined #bioperl
13:51 abooz joined #bioperl
14:53 jhannah o.  ooo..        oo..   ec2
15:38 spekki01 joined #bioperl
15:49 spekki01 joined #bioperl
15:53 abooz Hey guys, what can bioperl extract from a fasta file? I'm not being able to get anything other than the accession number and sequence. Are there any specialized parsers or do i go and do the parsing with a regex and then use them to populate each object?
15:53 abooz Cheers.
15:54 jhannah I think the definition of FASTA is >ID DESCRIPTION\nSEQUENCE
15:54 jhannah so that's all Bio::SeqIO does for you
15:54 jhannah if your descriptiong is mightly creative then you regex that our yourself
15:54 jhannah I could be wrong
15:55 jhannah http://www.bioperl.org/wiki/HOWTO:SeqIO
15:55 abooz people tend to try to fit many things in these tiny headers..
15:55 jhannah yup, but there's no standard for the crazy
15:57 abooz i guess regex it and then assign the attributes. Is there some kind of template to manipulate in order to make seqio understand each specific format perhaps?
15:58 jhannah you want to... um... you can investigate / write your own classes in the SeqIO/ directory ...
15:59 jhannah but I'm not sure that anyone else would want yours unless there's a published name for this new standard?
15:59 jhannah dunno... probably easiest just to subclass SeqIO into your own darkPAN module?
16:00 abooz yeah. I guess that's what i'll do.
16:00 abooz thanks!
16:08 * rbuels encodes his postgres database dumps base64 in his fasta description lines
16:13 jhannah ya, I like putting the sequences into the description too just so I have a backup
16:15 jhannah base64 of the associated GenBank and ENSEMBL entries are also handy in fasta descriptions
16:16 jhannah so you don't have to covert things
16:16 jhannah n
16:19 abooz so if i have something like that loc= complement(join(2691..4571,4918..5163)) in the fasta header, how would i go feeding it to a range object?
16:19 jhannah omg, really? ouch
16:20 jhannah you can't use genbank or ensembl formatted data?
16:21 abooz to be honest i haven't seen it but just preparing for the worst.
16:21 jhannah http://www.bioperl.org/wiki/HOWTO:F​eature-Annotation#Location_Objects  ?
16:29 cawiss joined #bioperl
16:55 spekki01 joined #bioperl
16:56 spekki01 I'm having some trouble getting eutils esearch to work here is the code im trying to run http://codepad.org/q5e0eyrT basically i start with a array of accession numbers and i want to get there corresponding gi number and then use efetch to get the entire record.
16:57 spekki01 here is the error i keep getting http://codepad.org/SO3j036f
17:00 jhannah spekki01: are you on bioperl-live?
17:01 spekki01 hmmm lemme check whats the current version number
17:01 jhannah not sure what latest "stable" is... I run out of svn/git
17:03 jhannah recent 3rd party interface errors are more likely corrected in the latest code. i can run your test
17:04 spekki01 im running 1.006
17:05 jhannah are you getting this?   Can't locate Bio/DB/EUtilities.pm in @INC
17:06 jhannah that's in the "Output" section of your first paste, which was a false alarm last time iirc
17:06 spekki01 no im not getting that error
17:07 spekki01 i guess i can try update my bioperl install
17:07 jhannah ha... you should probably stop using codepad then... your pastes confuse with those false alarms
17:07 jhannah i can try your code against trunk, please hold
17:07 spekki01 yeah sry about that
17:08 spekki01 the only error that i get is http://codepad.org/SO3j036f
17:08 jhannah and this is different from our Apr. 7 conversation?  :)
17:10 spekki01 im trying to think back
17:11 jhannah OK, bleed BioPerl throwing No History Data returned at efetch2.pl line 23.
17:12 jhannah last time was   -db => 'nuccore'    that does not seem to fix this time
17:12 spekki01 yeah nuccore was the wrong db i found out in the end
17:13 spekki01 the accessions numbers should be searched for in the nucleotide in order to get a gi number
17:16 jhannah your paren is funky after -term
17:18 spekki01 im just going by the example from http://www.bioperl.org/wiki/HOW​TO:EUtilities_Cookbook#esearch-.3Eesummary "Get GIs (as well as other information) for a list of accessions" near the bottom of the page
17:20 jhannah got the History part working:   http://github.com/jhannah/sandbo​x/blob/master/spekkio/efetch2.pl
17:21 jhannah um... the example does    -id => \@ids    yours was    -term => join (   and a missing paren
17:23 spekki01 well that example uses esummary im not sure if efetch has that same paren
17:23 jhannah ahh... maybe you meant this in the wiki    -term => join(',',@accs)      that one is fine, yours was missing the end paren to join
17:24 jhannah your join was also joining '-usehistory' and 'y'
17:24 spekki01 oh i see it now
17:24 jhannah :)
17:24 spekki01 ./facepalm
17:25 jhannah k, so now I'm down below on a new error      Can't call method "isa" without a package or object reference at /mnt/home/jhannah/src/bioperl-live/Bi​o/Tools/EUtilities/EUtilParameters.pm line 490
17:25 jhannah it doesn't like your   $factory -> set_parameters(    for some reason
17:26 spekki01 yeah i just saw that hmmm
17:28 spekki01 ah i put single quotes around the $hist variable
17:28 spekki01 when setting the paren for history in efetch
17:28 jhannah oh  laugh
17:28 jhannah ya
17:29 jhannah program now finishes without dying... but "Retrieved 0"
17:30 spekki01 i think thats ok, that retrieved thing is in blocks of 500 so 0 means the first block im assuming
17:31 jhannah cool. i pushed the function version to github
17:31 jhannah :)    /me clocks back in
17:50 spekki01 when using bio::seqio  is there a limit to the size of the file that can be fed to it?
18:03 jhannah just your RAM
18:04 jhannah if you stream and don't horde all your sequence objects you'll be fine
18:04 jhannah you can also tune what is parsed and what isn't
18:05 jhannah Bio::Seq::SeqBuilder   http://www.bioperl.org/wiki/HOWTO:SeqIO#Speed.2C_Bio::Seq::SeqBuilder
18:09 spekki01 how does the streaming work because the file i have is 3gigs or can you link me an example of doing it?
18:10 jhannah just   while (my $seq = $in->next_seq) {   }     then let $seq go out of score
18:10 jhannah scope
18:10 jhannah and you should be fine
18:11 jhannah if you push every sequence into an array or something silly like that then a 3G file will be ... oh i dunno ... 20GB of perl RAM?
18:13 splut depends on how many objects are in the 3G file
18:14 splut but, more RAM than is available to a 32-bit process
18:14 jhannah um... do 3 gigabase sequences exist? of what?
18:15 splut chromosomes
18:16 jhannah hmm... human chr 1 is 247M base
18:16 splut you asked how much RAM a 3GB file would take up
18:16 splut I never specified what, just that a 3GB file would eat up more memory than a 32-bit process could have
18:17 jhannah right-o
18:29 splut and for the record. Amoeba Dubia has a genome of 670,000,000,000 bp
18:30 splut amoeba proteus has 290,000,000,000
18:31 spekki01 sry just finished lunch, its 3 gigs of like 700 records, basically i have an array of 700 accessin numbers which i need the full seqence for. anyway the code i was posting about earlier creates a file with all of these records in it one after another. i then need to go through this file with bio:seqio and find the coding sequence feature corresponding to the protein
19:00 jhannah wow... is that 670GB contiguous? do amoeboids have chromosomes?
19:00 jhannah entrez search finds no nucleotide hits... hm
19:00 jhannah not assembled I guess?
19:03 spekki01 at Jhannah: that code you helped me figure out earlier, with the file it makes how can i test if the file has a record in it for each of the accession numbers i had? reason is that im going to have a lot of accession numbers so i want to make sure that i get a record for each of them.
19:06 spekki01 nm i think i have an idea.
19:06 jhannah :)
19:07 jhannah if you're going to be doing lots of searches against 3GB of data I tend to pull out what I need to search on into a SQLite database and index it. makes searches for those specific things a jillion times faster
19:07 splut or use a variation on a grep search to find strings and offsets
19:13 dnewkirk joined #bioperl
19:18 dnewkirk Is there an efficient way to modify the output of Bio::Align to change the color of sequence that is different than the reference sequence? IE, anything that is different in the aligned sequence would be hightlighted.
19:19 dnewkirk Well, at least change the color of the characters, not the background
19:20 splut There is no Bio::Align module. There is an AlignIO, but that is a frontend for a number of formats
19:20 jhannah dnewkirk: nopaste your code?
19:28 dnewkirk I haven't started coding yet, was trying to get ideas before I spent lots of time coding something that will probably get used once by my post-doc friend :/
19:28 dnewkirk er, I meant AlignIO
19:29 jhannah you might be talking about a Bio::Graphic something or other. you'll have to send me to some sample code or a website or whever you got the idea
19:30 splut sounds like he just needs to get access to something like Sequencher
19:30 dnewkirk ok
19:30 splut can align sequences and shows how and where they differ
19:31 splut http://www.genecodes.com/
19:36 dnewkirk Thanks splut, that's what he needed
19:49 spekki01 can somone have a look at this code http://codepad.org/MIZkjO5m am i doing streaming right? because im going to be feeding bio:seqio a file with over 700 records in it and i don't want to overload anything.
19:57 jhannah spekki01: looks fine to me. if none of those 700 records by itself is bigger than your RAM you should be fine
19:58 jhannah does that program run ok for you?
19:58 jhannah and quickly?  against your 3GB file?
20:09 spekki01 sry was afk, im about to test the program on about 500 of those records so ill let you know how it goes so far it seems to be doing ok on small batches 62ish records
20:12 jhannah of course if you *really* only needed ACCESSION, grep is lots faster... :)
20:13 jhannah but usually SeqIO is oh-so awesome
20:16 spekki01 no i need to do more than that but i just used grabbing accession number as a test to make sure my file hade a records for each accession number i originally looked up
20:18 jhannah right  :)  SeqIO to the rescue! woot!
20:49 jhannah heh.... after 10,000 commits, svn diff -r1:2 takes a while
20:50 jhannah git++
21:37 kyanardag joined #bioperl
21:47 spekki01 hey jhannah: jfyi so i tested that code on a 3gig file and it took about an hour id say to download the full genbank record with the sequence data via esearch/efetch then parse through that data with bio::seqio. Ran faily well id have to say.
22:03 abooz joined #bioperl
22:06 jhannah spekki01: how long did the parse part take?
22:07 jhannah spekki01: fyi: my IRC client only highlights if 'jhannah' is the first set of characters. 'hey jhannah' doesn't cut it  :)
22:21 splut heh, might want to improve on that, jhannah :)
22:37 jhannah splut: that's pretty typical default behaviour. I use irssi
22:38 jhannah i'm sure theres a setting to highlight keywords, but I don't care enough
22:38 jhannah :)
22:40 dnewkirk if $_ =~ m/jhannah/ { $hide = TRUE; }
22:41 jhannah dnewkirk: /ignore is easier  :)
22:41 dnewkirk lol

| Channels | #bioperl index | Today | | Search | Google Search | Plain-Text | summary