Camelia, the Perl 6 bug

IRC log for #bioperl, 2011-10-12

| Channels | #bioperl index | Today | | Search | Google Search | Plain-Text | summary

All times shown according to UTC.

Time Nick Message
00:32 philsf joined #bioperl
10:38 carandraug joined #bioperl
15:20 Lynx_ joined #bioperl
17:04 ank joined #bioperl
17:19 ulaek joined #bioperl
17:56 ashleydev joined #bioperl
17:56 ashleydev hi there
17:57 ashleydev I'm doing some research in computer storage deduplication algorithms
17:57 ashleydev and maybe you guys can give me a few pointers.
17:58 ashleydev I've got a backup application that takes a stream of (backup) data and cuts it into variable length segments and then tries to de-duplicate those segments.
17:59 ashleydev if you hash those segments you can look at the problem as a stream of 64 bit hashes that may repeat in various patterns and have various runs of similarity
18:00 ashleydev I think it's probably pretty similar to trying to look for similar sections of DNA in a genome -- though I'm a computer systems guy, not a genome/bio guy... so
18:04 ashleydev What I'm trying to do is to understand the repeating but morphing patterns in the stream of data so that I can try to create an algorithm for producing a synthetic pattern that is very similar.
18:05 ashleydev so where do I go to understand how to analyze repeating and or similar patterns in a stream of data?
18:05 ashleydev a lot of this seems fractal like to me...
18:06 ashleydev hit me up with suggestions.
18:16 rbuels joined #bioperl
18:26 drin i just learned about stem/stemming (I forget which it is) trees. They are apparently very good for finding repeats in a string. In this case it seems possible that you can take your 64 bit streams of data, produce a stemming tree and then use that for you deduplication methods.
18:27 ashleydev drin: i'll look it up
18:28 drin I'm not sure what you mean by trying to produce a synthetic pattern to the stream of data
18:28 ashleydev i'm testing ddup engines
18:28 drin ohh
18:28 drin that makes sense
18:28 ashleydev I have a few real world data sets
18:28 ashleydev but they are a PITA to use
18:28 drin you're probably better off with real world data
18:28 ashleydev I am writing a sythetic data generator
18:29 drin or you should analyze a bunch of real world data
18:29 ashleydev that's what I"m doing
18:29 drin and essentially use that as a seed for your generator
18:29 ashleydev but how to analyze it is the question.
18:29 ashleydev I.e. what to look for and then how to replicate it
18:30 drin probably break it up into sections
18:30 drin like 3-mer pieces
18:30 ashleydev 3-mer?
18:30 drin basically... 3 nucleotides
18:30 * ashleydev looks up the meaning of nucleotide...
18:30 drin nucleotide = spot in DNA
18:30 drin so like
18:30 drin A is a nucleotide
18:31 drin T is a nucleotide
18:31 ashleydev a
18:31 drin basically...
18:31 ashleydev ah
18:31 drin you said you're trying to look at DNA right?
18:31 ashleydev basically
18:31 ashleydev I figure that my 64 bit hashes which represent segments of data
18:31 ashleydev would likely be comperable to a segment of DNA
18:32 drin how so
18:32 ashleydev well i've got a list of these 64 bit integers, millions long
18:32 ashleydev and some of them repeat
18:32 ashleydev seems like DNA to me
18:32 drin isn't that true of all data?
18:32 drin it is of some length
18:32 drin and some repeats
18:33 drin you could say it's similar to english sentences
18:33 drin i'm just curious why you are looking at DNA in particular
18:33 ashleydev exactly -- i've been looking at some web search algorithms and data mining too
18:33 drin mmm
18:33 drin well I am no expert on the subject by a  long shot
18:33 drin but I'm not sure that trying to make a de-duplication algorithm
18:33 ashleydev but I figure the genomics guys now how to classify this stuff in their heads
18:34 drin that shotguns all forms of data
18:34 drin will necessarily work
18:34 drin but it'd b interesting to see the results
18:34 ashleydev here's what I have now:
18:34 ashleydev look at all hashes, and associate a repeat count
18:35 ashleydev so you have a stream of repeat counts for each hash
18:35 ashleydev then you can break that up
18:35 ashleydev hmm it's more of a ref-count than a repeat count, so
18:36 ashleydev a ref-count of 1 means that hash is only seen once in the data
18:36 ashleydev so,
18:36 ashleydev I have broken down the data into runs where you have consecutive hashes that have a ref-count > 1
18:37 drin k
18:37 ashleydev then I can cluster these runs based on jaccard-distance
18:37 ashleydev so that's interesting
18:37 ashleydev and I'm using circos to look at where the similar runs repeat
18:38 ashleydev which is also interesting but theres a lot of subtlety that isn't picked up in that
18:38 ashleydev for example
18:38 drin i don't know much about jaccard distance
18:38 drin how are you applying it to your data?
18:38 drin yea, i'm not sure what circos are either
18:39 ashleydev basically if the intersection of the sets of hashes in each run is 90% or more of the union of the hashes of both runs
18:39 ashleydev http://circos.ca
18:39 drin yea, i'm just not sure what you're considering to be elements in each set
18:39 drin the individual bits?
18:40 ashleydev the hashes are my smallest unit of divisibility
18:40 drin so.. it's the hashcode (this is what i assume when you say hash) that you're considering to be an element in your set
18:40 drin where your set is the data stream as a whole?
18:41 ashleydev so if a run contains (1, 2, 3, 4) and another run contains (2, 3, 4, 5) then the intersection is 3 and the union is 5 so that's a jaccard distance is 3/5
18:41 drin right
18:41 drin where those numbers
18:41 drin are hashes of 64 bit segments
18:41 ashleydev no they are the actual hashes
18:41 ashleydev remeber my hashe
18:42 ashleydev s/.*//
18:42 ashleydev my hashes represent 8K segments of data
18:42 ashleydev and those 8K segments of data may repeat
18:42 ashleydev which would be represented by repeate of that hash of that 8k of data
18:43 drin ohhh
18:43 drin i thought you were hashing 64 bit segments of the data
18:43 ashleydev na
18:43 drin you have 64bit hashes
18:43 ashleydev yes
18:43 drin that is more sensical
18:43 drin but still, those hashes are of data segements
18:44 ashleydev yes
18:44 drin interesting
18:45 ashleydev ...
18:45 ashleydev so as I was saying regarding the runs
18:45 ashleydev that is interesting but it does mask alot of the mutations
18:45 ashleydev and that mutation structure seems imporant
18:45 ashleydev is ther a classification of typse of mutations?
18:45 ashleydev in this field?
18:46 drin to some extent
18:46 drin i can't say it's as well established as you may think
18:46 drin but i'm not a biologist so i can't say
18:46 ashleydev I figure there's something I could publish here....
18:46 ashleydev what is your expertiese?
18:46 drin cs
18:46 ashleydev daa
18:46 ashleydev me too
18:46 drin you can look up SNPs
18:46 ashleydev but you know more about genomics than I do
18:47 drin single nucleotide polymorphisms
18:47 drin i know that's a type of mutation
18:47 drin and as far as looking at what may or may not be masked
18:47 ashleydev http://en.wikipedia.org/wiki/Mutati​on#Classification_of_mutation_types
18:48 drin lol yea
18:48 drin point mutations describes SNPs
18:48 drin insertions and deletions are pretty easy to explain too
18:48 ashleydev interseting
18:48 ashleydev I see all of these in my data
18:48 ashleydev including the amplifications
18:48 drin i was assuming there were more classifications than these
18:48 drin but perhaps not
18:48 * ashleydev yay, getting somewhere
18:48 drin basically consider you have some strand of DNA
18:49 drin that strand of DNA will code for various things for various types of cells
18:49 drin wait. this is actually a different problem
18:49 ashleydev oh?
18:49 drin but basically depending on the cell, it will convert certain parts of the DNA into RNA
18:49 drin though I think that from a CS perspective
18:50 drin you amy be able to treat this the same as insertions/deletions
18:50 drin at least similarity to deletions
18:50 drin insertions and deletions actually stem from duplicating DNA
18:50 drin and the.. rna? I forget what copies DNA
18:50 drin whatever copies it basically messes up
18:50 drin and so the DNA has extra or loses some number of base pairs
18:51 drin I'm not sure how you would catch that really
18:51 drin unless you were to notice some pattern in when/where some dna strand appears in the whole DNA strand
18:52 drin and be able to uhh.. say with some amount of confidence that the given strand appears in an odd location or seems to be missing
18:52 drin and further, I'm not sure how you would be able to tell if your method is obscuring that a bit too much
18:52 ashleydev hmmm
18:53 drin you should look up the e-value that is calculated for BLAST
18:53 drin i'm not too sure where you can find that
18:53 ashleydev what's that?
18:53 drin but I'm pretty sure that value describes how they score dna similarity
18:53 ashleydev hmm
18:53 drin based on possible mutations
18:53 drin that will  be useful for you to see if you're obscuring SNPs
18:53 ashleydev maybe I should look into blast then?
18:54 drin i'm not sure if you necessarily want to look into blast as a whole
18:54 drin it's a multiple sequence alignment/searching tool
18:54 drin basically given some dna string
18:54 ashleydev hmm ok so assembling dna
18:54 drin it finds best matches in its database
18:54 ashleydev similar but not the same problem
18:54 drin right
18:54 drin unless you were to try applying BLAST for finding your duplicates
18:55 drin instead of hashes
18:55 drin then suddenly you're looking at a similar problem with different approaches
18:55 ashleydev yeah I could do that
18:55 drin but again, that's really only useful for biological data
18:55 drin if you're trying to make something more general
18:55 drin i'm not sure that other datasets have the same restrictions/patterns as DNA at all
18:55 ashleydev well... I'm trying to plunder where I can for my problem... heh
18:56 drin and beyond that, I'm not sure how applicable such a thing would be to protein sequences
18:56 drin ah, gotcha lol
18:56 drin DNA has only 4 letters essentially
18:56 ashleydev one thing I've done to mask the problem of SNPs
18:56 drin but they code for proteins in 3-mer
18:56 drin proteins have soemthing like 20 letters
18:56 drin so depending on what you're looking for
18:56 drin variability or not
18:57 drin you might consider looking at proteins instead of nucleotides
18:57 ashleydev right
18:57 drin anyways
18:57 drin i gotta go downtown
18:57 ashleydev that's kind of what I've done but upped it just one more level
18:57 drin get a lightup frisbee for later
18:57 ashleydev saying that my protiens are 8K
18:57 ashleydev and that my letters are a 64 bit number
18:57 ashleydev thx...
18:57 drin yea
18:58 drin it's a bit diff
18:58 drin DNA is more like octal
18:58 ashleydev yeah
18:58 drin in strands of 3
18:58 ashleydev ah
18:58 ashleydev yeah
18:58 drin it translates to decimal
18:58 ashleydev got it
18:58 drin yea
18:58 drin anyways
18:58 drin have fun good luck
18:58 ashleydev octal is easy mapping
18:58 drin i'd say talk more to people in here
18:58 drin I may have explained soemthing incorrectly
18:58 drin or led you down a wrong path
18:58 drin i'd def get a 2nd and 3rd opinion
18:59 ashleydev yes
18:59 drin adios
18:59 ashleydev \o
18:59 ashleydev any one else feel free to give me some suggestions
18:59 ashleydev i'm open to any pointers.
18:59 ashleydev :)
18:59 ashleydev i'll be around.
19:57 ank joined #bioperl
21:30 carandraug joined #bioperl
21:32 dukeleto joined #bioperl
22:13 carandraug joined #bioperl
22:14 dukeleto joined #bioperl
22:47 dukeleto joined #bioperl
22:47 dukeleto joined #bioperl
22:49 dukeleto joined #bioperl

| Channels | #bioperl index | Today | | Search | Google Search | Plain-Text | summary