Camelia, the Perl 6 bug

IRC log for #bioperl, 2012-07-06

| Channels | #bioperl index | Today | | Search | Google Search | Plain-Text | summary

All times shown according to UTC.

Time Nick Message
08:29 leont joined #bioperl
08:56 leont joined #bioperl
10:55 Ulti left #bioperl
11:53 pyrimidine joined #bioperl
12:59 hash_table joined #bioperl
14:22 hash_table joined #bioperl
16:40 leont joined #bioperl
17:13 hash_table joined #bioperl
17:57 flu Anyone else notice NCBI eutils returning a high volume of "<ERROR>Unable to obtain query #1</ERROR>" messages?
17:58 flu I'm playing nice with the speed at which I'm executing queries, so I don't think they are throttling me.
18:57 pyrimidine flu: not sure, but it is a high-traffic point of the day
18:57 pyrimidine do you have a script?  I can test from my end
18:57 pyrimidine (gist)
18:58 flu pyrimidine, I've been seeing it for about a week at on and off peak times.
18:58 pyrimidine flu: huh, not sure then, but I wouldn't be surprised if they changed something on their end that is causing this
19:02 flu I'm essentially using http://www.bioperl.org/wiki/HOWTO:EUt​ilities_Cookbook#How_do_I_retrieve_a_​long_list_of_sequences_using_a_query.3F
19:04 flu query is "Nasonia vitripennis[organism] NOT 13660[BioProject] NOT WGS[Keyword]"
19:04 flu on nuccore
19:08 * pyrimidine testing it out
19:11 pyrimidine flu: seems to be working.  at least, sequences are being retrieved
19:12 flu You need to check the output because the <ERROR> tag goes to output and doesn't get thrown by the eval.
19:12 pyrimidine output from esearch?
19:12 flu yep
19:14 pyrimidine flu: https://gist.github.com/3062213
19:15 pyrimidine the query seemed to go through
19:16 pyrimidine I do know that if the email isn't specified they (NCBI) have been cracking down some
19:16 flu sorry, the output from efetch when pulling down batches of 500 FASTA sequences.  Not the initial efetch.
19:16 flu I'm supplying both tool and email.
19:17 pyrimidine ok
19:17 flu inital esearch I mean
19:18 pyrimidine script is running, at 9K now...
19:19 pyrimidine flu: yep, it's there
19:20 pyrimidine Seem to be roughly equivalent to the number of retrievals
19:20 pyrimidine so, each iteration in the loop
19:22 pyrimidine flu: https://gist.github.com/3062213
19:23 pyrimidine actually, no, this seems to be an issue on their end
19:23 pyrimidine the counts are off
19:24 pyrimidine (e.g. this isn't occurring within the loop, but within a specific set of sequences)
19:24 pyrimidine seeing if it's reproducible...
19:27 pyrimidine nope, completely random, appears in clusters
19:30 flu Random is what I was seeing as well.
19:31 pyrimidine flu, there is a biopython post related to this from April, blog post here
19:31 pyrimidine http://opensourcepharmacist.wordpress.com/2012/04/​16/the-community-biopython-overcomes-limitations/
19:33 pyrimidine I recall this, actually (I responded on the thread)
19:33 pyrimidine http://thread.gmane.org/gmane​.comp.python.bio.general/6962
19:36 pyrimidine they basically checked for the <ERROR> header for each response and resent the request if it was found
19:37 pyrimidine (they did this passing it through their SeqIO parser, but I think a simple match for the XML tag would work as well
19:43 flu That is what I essentially did by modifying the eval to check for the xml error and die if it finds it.  However, my number of records fetched is still way off the $count.  I'll continue to debug and follow up with NCBI if necessary.
19:46 pyrimidine flu: the redo didn't work?  That's... scary
19:47 flu My feeling as well.
19:48 pyrimidine there is a slight problem with the script as well, based on this.
19:48 pyrimidine it assumes all retrieved data is valid and prints to a filehandle
19:48 pyrimidine *single filehandle
19:49 pyrimidine would be better to cache each response and concatenate a file if no error is found
19:50 pyrimidine *concatenate to a file
19:51 flu I'll try that, based on how far off my numbers were I suspected there was something funny going on with the filehandle.
19:54 pyrimidine something like: https://gist.github.com/3062213
19:54 pyrimidine though one could use a simple temp file as well
19:54 pyrimidine funny, that error is very sporadic
19:55 pyrimidine last few tries I haven't had any problems
19:57 leont joined #bioperl
20:01 pyrimidine flu: I think this should be sent to NCBI
20:02 flu k, I'll draft something up and send it to one of our contacts there.
20:17 pyrimidine flu: I'll update the code example  on the wiki
20:20 pyrimidine flu: new version catches the error and retries; you can see why the numbers are off
20:20 pyrimidine same gist: https://gist.github.com/3062213
20:21 pyrimidine after a certain # of requests, the error hits
20:23 flu I wasn't seeing any messages about the limit being hit (I used 10).
20:23 pyrimidine I would set retmax to 2000-5000
20:23 pyrimidine just ran it with 2000 w/o problems
20:24 pyrimidine (using the above script)
20:25 flu Here is one problem I had, I was dying in the callback.
20:25 pyrimidine yes, that is a bug in the script
20:27 pyrimidine it *should* work, though (checking $data chunk for the error), but it isn't triggered for some reason
23:10 hash_table joined #bioperl

| Channels | #bioperl index | Today | | Search | Google Search | Plain-Text | summary