Camelia, the Perl 6 bug

IRC log for #bioperl, 2012-03-12

| Channels | #bioperl index | Today | | Search | Google Search | Plain-Text | summary

All times shown according to UTC.

Time Nick Message
00:54 gtuckerkellogg joined #bioperl
01:09 CIA-66 bioperl-live: Florent Angly amplicons * rbed3729 / Bio/SeqFeature/SubSeq.pm : Fixed error in documentation - http://git.io/LQvSOA
06:26 j_wright joined #bioperl
07:35 j_wright joined #bioperl
08:40 leont joined #bioperl
11:18 gtuckerkellogg joined #bioperl
11:32 leont joined #bioperl
12:58 donut_1 joined #bioperl
12:58 donut_1 hello everyone
13:00 donut_1 I'm afraid I have yet another problem related to gaining sequences. Pyrimidine has gotten me nearly all the way there.
13:00 donut_1 The perl script just is not adding up.  If I specify the start/stop point it doesn't seem to be giving me the right data.
13:01 donut_1 For instance, say I want to get the first 10 bases before the gene.  I set my start point as "-10" and my stop point as "0". This should give me the first 10 bases before teh gene seq. that I've specified
13:01 donut_1 unless of course i'm misunderstanding what this code does
13:01 donut_1 gist.github.com/2021686
13:01 donut_1 http://gist.github.com/2021686
13:02 donut_1 in this particular example, I want to get the first 5k bases upstream
13:03 donut_1 Okay so here's whats happening.  Say I specify my start as "0" and stop as "1".  I get X number of bases.  If I change my start to "-5000" I get X+5000 bases.
13:27 rbuels joined #bioperl
13:59 scottcain joined #bioperl
14:00 pyrimidine donut_1: NCBI coordinates (as well as bioperl ones) are both 1-based, and the start is always less than the end
14:01 ank joined #bioperl
14:01 pyrimidine so, if you were looking for the upstream region of a gene (e.g. promoter) you have to take strand into account
14:01 donut_1 Hey pyrimidine
14:02 pyrimidine o/
14:02 donut_1 Well what I'm thinking is, run the seq script to pull 5k+Gene then run again to pull gene, then post diff the two.  The 5k I'll pump to its own file and the gene to its own file.
14:03 donut_1 essentially my colleague needs the 5k seq and gene seq to be split up
14:04 pyrimidine what if the gene started at 3000?
14:04 pyrimidine just need the upstream 3000 in that case?
14:04 donut_1 Oh.
14:05 pyrimidine it sounds like you only need the promoter regions
14:05 donut_1 According to her, she needs 5k upstream and downstream
14:05 donut_1 so 10k bases total
14:05 pyrimidine biomart may still be your best bet if you are using ensembl IDs
14:05 donut_1 well i have all the uids now thanks to you
14:06 pyrimidine ok
14:07 donut_1 so the easiest thing to do for me is to iterate thruogh all the UIDs twice.  One UID being pumped to a script that gives me the gene seq. and then the same UID being pumped to a second script that gives me the 5k upstream
14:07 donut_1 i'm still just toying around with ideas atm
14:11 pyrimidine you could do something like, check the start/end for each gene, recalculate the retrieved sequence based on what you want to retrieve (w/ the caveat that there may be less than 5000bp available upstream or downstream), then just split the retrieved sequence up internally using substr() using the offsets
14:11 donut_1 Wow.  I assume you are referring to Perl syntax, something with which I have no background in. :(
14:12 pyrimidine not completely
14:12 pyrimidine python has a substring command IIRC
14:12 pyrimidine actually, IIRC it's even simpler in python, something like a[0..9]
14:13 * pyrimidine has to look that up
14:14 pyrimidine x = s[3:9]
14:14 pyrimidine range symbol was throwing me
14:18 pyrimidine heh, actually, looking again at how NCBI stores genes, start > end means the gene is on the opposite strand
14:20 pyrimidine however, it doesn't matter if you are pulling from the actual sequence, in that case start has to be < end IIRC (unless NCBI has changed things, which is entirely possible)
14:20 donut_1 hey sorry we had a fire alarm, i had to step out
14:21 pyrimidine s'okay
14:21 pyrimidine you can backscroll through the magic of IRC
14:21 pyrimidine or, you can use the log :)
14:22 pyrimidine http://irclog.perlgeek.de/bioperl/2012-03-12
14:22 donut_1 Oh I can see what you wrote me :), I just wasn't here to respond right away
14:23 carandraug joined #bioperl
14:34 donut_1 Well I'm officially lost now.  I can't figure this out for some reason.
14:34 donut_1 I can't imagine getting 5k up + gene + 5k down in separate files being very difficult though.
14:36 pyrimidine right, but you should only need to retrieve the sequence once
14:36 pyrimidine not three times
14:36 donut_1 That would make this go a lot quicker for sure.
14:36 donut_1 I sleep the script 3 seconds after every query.
14:37 pyrimidine you can reduce that sleep to 1 seq
14:37 pyrimidine s/seq/sec/
14:37 pyrimidine biologist freudian slip
14:37 donut_1 True but I dont want the script to advance faster than my internet connection
14:38 donut_1 i've no safeguards built in if i hit a query error
14:38 pyrimidine you can technically submit 3 queries per second, unless they've changed the rules
14:38 pyrimidine that is true
14:39 donut_1 this thing will have to run overnight anyway so I'd rather be on the safer side
14:39 pyrimidine but you can check for errors using an eval {} block and rerun them if there is a problem
14:41 donut_1 ah but then I'll have to learn Perl to figure out how to make that work. :(
14:42 pyrimidine well, this *is* the #bioperl channel
14:42 pyrimidine :)
14:42 donut_1 :D
14:42 donut_1 Very true.
14:42 donut_1 Well I know how to modify the chrstart chrstop values
14:42 donut_1 As long as I know what exactly I'm getting as far as the SEQ goes, I'm golden.
14:42 scottcain joined #bioperl
14:43 donut_1 Because I can do some diffs and separate the data.
14:45 pyrimidine not sure the diff approach will work based on the returned format, but it will be interesting to see how things go
14:45 donut_1 So.  Okay, going back to my code example.  If I set chrstart = -5000, then I should be getting 5k + gene
14:46 donut_1 then if I set it to +0 (leaving chrstop = +0) I should get gene.
14:46 pyrimidine right
14:46 donut_1 so if I subtract off the gene seq from the first query, I should be left with 5k bases
14:46 pyrimidine though, again, that is dependent on the strand
14:47 donut_1 Well I don't recall the script giving me an option regarding strand.
14:47 pyrimidine sure
14:47 pyrimidine my $strand = $item_data{ChrStart} > $item_data{ChrStop} ? 2 : 1;
14:47 pyrimidine but, I think efetch will adjust that for you
14:48 donut_1 Oh right. There it is.
14:48 donut_1 So, wow okay so the seq I'm getting could be strand 1 or 2 seqs?
14:48 donut_1 Bah this is getting much more complicated
14:48 scottcain_ joined #bioperl
14:49 pyrimidine heh, welcome to bioinformatcs
14:49 pyrimidine *bioinformatics
14:49 donut_1 I'm just glad you're here so I can bounce ideas off to
14:49 donut_1 Otherwise I'd be completely lost.
14:49 donut_1 Well I kinda am still but not as much
14:51 pyrimidine let's take the simplest examples, where the gene start < end
14:51 pyrimidine in  this case, the strand is 1
14:52 pyrimidine when you retrieve the sequence based on that, for the *genome* the sequence start is always less than the sequence end
14:53 pyrimidine the strand indicates the direction the gene is transcribed (DNA is double-stranded)
14:54 pyrimidine so for the case where the *gene* coordinates have a start < end, the *genome* coordinates will match up (start < end), and the strand will be 1
14:55 pyrimidine where it gets a little confusing is when the *gene* coordinates have a start > end
14:55 pyrimidine in this case, to get the right sequence information from the chromosome (again where start must be < end), you have to do two things:
14:56 pyrimidine (1) strand = 2
14:56 donut_1 Oh right.
14:56 donut_1 This is starting to sound familiar from biochem
14:56 donut_1 Its been a while
14:56 donut_1 Well, my colleague never said or made the distinction about strand 1 or 2
14:57 pyrimidine ah, probably assumed you knew that
14:57 donut_1 crap
14:57 donut_1 Well lets assume we only care about strand 1
14:57 donut_1 or maybe thats not a good idea...
14:58 pyrimidine I was going to say (2) switch start and end, but it looks as if the script doesn't do that
15:00 pyrimidine kind of odd, actually, why they would need a strand in that case, may have to check that out to see if it makes a difference
15:01 donut_1 *sigh
15:03 pyrimidine again, welcome to the world of bioinformatics :P
15:03 donut_1 so here's something odd
15:03 pyrimidine however, this did work fine for me a number of years ago
15:04 donut_1 If i get seq for UID... using chrstart -5000 and chrstop +0, I get 36879 bases.  If I use +0 +0 I get 41963 bases
15:04 donut_1 seems kinda backwards
15:05 pyrimidine which UID?
15:05 pyrimidine I can use it to debug
15:05 donut_1 57147
15:05 pyrimidine have the original Ensembl ID for that one?
15:05 donut_1 ENSG00000000457
15:06 donut_1 Man I'm supposed to be writing two papers right now, deadline next week for revision/submission to journals, but I'm trying to help a friend.
15:06 rbuels joined #bioperl
15:06 donut_1 I didnt realize it would be this difficult lol.
15:07 pyrimidine NCBI never makes it very easy ;)
15:07 pyrimidine does she need the GenBank format, or can it be simple FASTA/raw?
15:08 donut_1 fasta
15:08 donut_1 my script does all that though
15:08 donut_1 so i get the summary from the perl script, then I just manipulate the data
15:08 donut_1 So I pretty much snag everything after "ORIGIN" in the summary, remove numbers & spaces and change it all to uppercase
15:08 donut_1 throw in a header for each block of bases (uid, eid, etc.)
15:08 donut_1 and it worked nicely
15:09 pyrimidine you can get FASTA directly from NCBI
15:09 donut_1 great
15:09 donut_1 lol
15:09 pyrimidine -rettype   => 'fasta'
15:10 pyrimidine and they have nicely added in the coordinate info at the top
15:10 pyrimidine for the descriptor:
15:10 donut_1 nice!
15:10 pyrimidine >gi|224589800:c169858075-169821804 Homo sapiens chromosome 1, GRCh37.p5 Primary Assembly
15:10 scottcain joined #bioperl
15:10 pyrimidine c = complement I would guess
15:10 pyrimidine so strand = 2
15:10 donut_1 yup thats what my output says
15:10 donut_1 right after the coords
15:11 donut_1 hmmm
15:12 donut_1 i'd like to have the UID in there as well
15:12 donut_1 well either way
15:12 donut_1 i'm good on getting the data into fasta format already
15:12 donut_1 i'm just not sure what exactly i'm getting by doing a -5000,0 and 0,0 query
15:12 donut_1 since the latter is getting an extra 5000 bases
15:13 pyrimidine could just name the file based on the gene ID
15:13 donut_1 well, long story short, i'm naming files after the chromosomes and listing the genes inside
15:13 donut_1 lets me process each file by chromosome rather than thousands of individua gene files
15:13 donut_1 the former being much much much much quicker
15:19 pyrimidine heh, just checked out whether flipping start/end makes a difference, and it doesn't *if* you retrieve the gene alone
15:20 pyrimidine that's a gotcha
15:20 pyrimidine NCBI just reverses them for you, how friendly of them
15:21 donut_1 Okay so... maybe I'm thnking about how the start/stop thing works
15:21 donut_1 If I do a -5000, +0 and a +0, +0... the latter has 5k more bases than former
15:21 donut_1 however if I do...
15:21 donut_1 If I do a +5000, +0 and a +0, +0... the former has 5k more bases than latter
15:22 donut_1 so +5000k bases <====== start
15:22 donut_1 and-5000 bases start=====>
15:22 donut_1 I think?
15:22 pyrimidine what efetch is assuming is, if you switch out start and end (e.g. start > end) then you mean (start, end) = (end, start)
15:22 pyrimidine e.g. you need to switch them
15:24 pyrimidine but unless you correct for this in your script (always switch them out if strand == 2), if start > end you will end up with 5000 fewer bases
15:25 pyrimidine in other words, NCBI is trying to be helpful, but they end up shooting you in the foot
15:25 donut_1 great
15:26 pyrimidine easy to fix
15:33 pyrimidine ah, I have one even better
15:36 donut_1 I'm all ears
15:37 pyrimidine finishing it up...
15:40 donut_1 Alrighty. I'll be right back.
15:43 pyrimidine donut_1: something like https://gist.github.com/2022673
15:43 pyrimidine (may need to check the offsets to make sure that matches up with the gene)
15:54 scottcain joined #bioperl
16:01 donut_1 back
16:02 pyrimidine donut_1: just updated that gist
16:02 jhannah pyrimidine++
16:02 jhannah look at you go!
16:02 pyrimidine splits the sequence into 4 files: all, gene, up, and down
16:02 jhannah he's an unstoppable  bioinf machine!
16:03 pyrimidine working on something like this locally, actually, so it scratches a particular itch
16:03 donut_1 I am in love with this guy/gal
16:03 pyrimidine guy
16:03 donut_1 i love you (no homo)
16:03 jhannah he prefers "genius"
16:03 donut_1 so my _down.fn files have one base in them
16:03 donut_1 and _up.fn files
16:04 pyrimidine look at $start_offset
16:04 pyrimidine set to 1
16:04 donut_1 Okay so if I want 5k upstream set it to 5000?
16:04 donut_1 and $end_offset = 5000 for downstream?
16:04 pyrimidine $start_offset = 5000, I believe
16:04 donut_1 Wow I feel like i'm overextending your help
16:05 donut_1 okay and $end = 5000 for 5k down?
16:05 pyrimidine $end_offset = 5000
16:05 pyrimidine yes
16:05 pyrimidine I think that will work
16:05 donut_1 oh my colleuge said strand doesnt matter as long as we know what strand it is coming from (which the script outputs which is good)
16:06 pyrimidine of course, she needs to check on the output, but it seems fine
16:07 donut_1 holy crap this might just work for me
16:07 donut_1 one last thing
16:07 pyrimidine gonna check one more thing (could be a small off-by-one error)
16:07 donut_1 the 5k downstream, thats 5k extra bases from the gene body end?
16:11 pyrimidine I believe so
16:12 pyrimidine I'm seeing a couple of off-by-one errors, though, I'm checking the output vs what NCBI gives
16:12 pyrimidine http://www.ncbi.nlm.nih.gov/nuccore/NC​_000001.10?from=169821804&amp;to=16986​3076&amp;report=fasta&amp;strand=true
16:12 donut_1 you sir are magnificent
16:12 pyrimidine well, this could be a very nice general purpose tool I can add to eutils
16:13 pyrimidine (which I wrote up)
16:13 donut_1 looks like the start is off by 1 and the end is off by... +3
16:16 pyrimidine yes
16:16 pyrimidine but it's close
16:17 pyrimidine I think it's from the switching around of coordinates
16:17 donut_1 it probably won't matter to be honest
16:17 donut_1 as long as we're close and we know what magnitude of 'close' we are
16:17 donut_1 i'll go ask (she doesn't check her emails often...)
16:17 pyrimidine well, it would be nice to get the behavior consistent
16:18 donut_1 collaboration could be so much easier if ppl did check their emails
16:18 donut_1 instead I have to negotiate 3 flights of stairs everytime I need to ask her a question.
16:19 donut_1 I'd get her in chat but "irc" wouldnt even be in her vocabulary
16:19 pyrimidine donut_1: that's an easy one; if she doesn't respond, it's obviously not a high enough priority for her, and thus you :)
16:19 pyrimidine otherwise, if it were she would be checking them regulalry
16:19 pyrimidine *regularly
16:25 donut_1 lol
16:25 donut_1 she said "why didnt you just call me?"
16:25 donut_1 i hate using telephones
16:25 donut_1 Anyhow, I'd like to ask you if you would be okay, after this is all said and done, acknowledging you on the paper this is all geared towards
16:26 donut_1 oh, she said an offset of a couple basis is ok
16:27 pyrimidine ok
16:27 pyrimidine yeah, sure, let me know re: paper
16:27 pyrimidine acknowledgement is fine
16:27 donut_1 also, did you end up tweaking the perl script to fix the offset consistency?
16:28 donut_1 Absolutely.  I should get your email so we can discuss the paper in the future.
16:29 donut_1 I told her it would be appropriate to acknowledge you.
16:30 pyrimidine you can /msg me your email, I can send info
16:30 donut_1 whats the syntax for an /msg command?
16:31 pyrimidine /msg donut_1
16:31 pyrimidine (without the leading space)
16:32 donut_1 so if i were to msg you, it'd be "/msg @pyrimidine"?
16:32 pyrimidine w/o the '@'
16:48 pyrimidine okay, seems as if another change has occurred at NCBI, bit frustrating.
16:49 pyrimidine Seems as if the gene info returned from the database assumes 0-based coordinates, but efetch assumes 1-based
16:54 pyrimidine https://gist.github.com/2023289
16:55 donut_1 Anyway to set the script to take an argument (aka UID?)
16:56 donut_1 rather than predefining it inside
16:56 pyrimidine yeah, I can do that.  Just need to set @gene_ids = @ARGV
16:57 pyrimidine probably remove the esearch bit right before it
17:01 donut_1 "Global symbol "@ids" requires explicit package name"
17:02 pyrimidine here: https://gist.github.com/2023358
17:03 donut_1 beautiful
17:05 donut_1 so something isnt matching up when i verify with website
17:06 donut_1 nope wait
17:06 donut_1 nope it matches
17:06 donut_1 the gene body has an extra base on it
17:07 donut_1 but it shouldnt matter
17:07 donut_1 anyway to manually verify up/down bases?
17:07 donut_1 i forget what link to click on at the ncbi website
17:37 donut_1 okay one other thing.  We need to log the strand number to each header
17:42 pyrimidine sorry, stepped away ($job)
17:43 pyrimidine the strand should come across in the header
17:43 donut_1 so the strand info prints to screen but doesnt log itself into the files
17:43 pyrimidine well, it comes across as something like '>GeneID:57147:gi|224589800:c169863077-169821803'
17:43 donut_1 okay I've got that
17:43 donut_1 well
17:43 pyrimidine where the 'c' in that seq range is strand=2
17:44 donut_1 that last number is 804 for me
17:44 donut_1 what is strand=1?
17:44 pyrimidine default strand (normal).  strand=2 is complementary strand
17:45 donut_1 gotcha
17:45 donut_1 c means strand 2, no c means strand 1
17:45 donut_1 i see it now
17:45 donut_1 it looks like the bases matched up with what the website gives
17:45 donut_1 give or take a base or two
17:45 donut_1 which is fine
17:45 donut_1 I tihnk i'm set to run this for the thousands of IDs I have now.
17:46 pyrimidine let me know how it goes
17:49 donut_1 its chuggin now.  Should be done by tomorrow.
17:49 donut_1 I sleep the script at right under a second per query so
17:49 donut_1 that'll be the limiting factor.
17:51 donut_1 hmm
17:51 donut_1 anyway to have it print the EID number in the header?
17:51 donut_1 we already have the UID
17:53 pyrimidine This could be done if you take the original version of the script, which accepted the EnsemblID and looked up the UID
17:54 donut_1 ok, well its not a huge deal
17:54 donut_1 just a convenience thing
17:54 donut_1 I'm fine with this
17:54 donut_1 and its amazing
17:54 pyrimidine could always just make a EID => UID map
17:54 donut_1 no worries
17:55 donut_1 Anyhow we'll be sure to keep in touch re: the paper in the future
17:56 pyrimidine sounds good!
17:56 donut_1 Thanks a million once again.  Couldn't have done it without your help.
18:03 pyrimidine found the error with the end: https://gist.github.com/2023358
18:04 donut_1 oh the offset error?
18:05 pyrimidine yes, I think
18:06 donut_1 looks like its spot on now
19:20 * rbuels scratches idly at his javascript hives
19:31 pyrimidine ick
19:31 pyrimidine (both javascript and the hives)
19:38 scottcain joined #bioperl
20:01 scottcain joined #bioperl
20:38 sl33v3_ joined #bioperl
22:44 leont joined #bioperl
22:51 * rbuels merrily writes things like (function() { blahblah })()
22:51 rbuels JS is a pretty damn crappy language.
22:51 * rbuels is reminded of people merrily pithing frogs
23:11 jhannah oh really? i've got a friend who uses JS for *everything* server-side. he says he wants to fight you

| Channels | #bioperl index | Today | | Search | Google Search | Plain-Text | summary