Perl 6 - the future is here, just unevenly distributed

IRC log for #opentreeoflife, 2015-03-26

| Channels | #opentreeoflife index | Today | | Search | Google Search | Plain-Text | summary

All times shown according to UTC.

Time Nick Message
00:06 kcranstn joined #opentreeoflife
02:47 ilbot3 joined #opentreeoflife
02:47 Topic for #opentreeoflife is now Open Tree Of Life | opentreeoflife.org | github.com/opentreeoflife | http://irclog.perlgeek.de/opentreeoflife/today
07:48 7JTAAANBR joined #opentreeoflife
11:34 kcranstn joined #opentreeoflife
14:32 josephwb joined #opentreeoflife
15:12 mtholder joined #opentreeoflife
15:50 josephwb hey mtholder: you around for a bit?
15:50 mtholder yes
15:51 josephwb sweet. off to make a tea, then back to pick yer brains.
15:51 mtholder ok. slim pickings. or pickens, perhaps
15:52 josephwb Dr. Strangelove is my favourite movie.
15:53 josephwb well, that and Goonies (of course)
15:53 mtholder what a pair
16:03 josephwb ok, back
16:03 josephwb 2 tiny things to talk about:
16:03 josephwb 1) unsupported edges
16:03 josephwb 2) subproblems
16:05 mtholder ok
16:05 josephwb ok, "unsupported edges". the issue here is that we are creating more nodes than we should when we are combining data across incompletely-overlapping sources
16:06 josephwb we want to be able to identify these bad nodes before we commit them to the graph
16:06 josephwb testing a synth tree vs. inputs is straightforward (thanks to your notes and Wilkinson)
16:06 mtholder ok. or could be post processign.
16:06 josephwb right
16:06 mtholder processing*
16:07 josephwb but this is *the* bottleneck; would rather be able to break early
16:07 mtholder the rule in my code is:
16:07 josephwb let me dig up an example
16:07 josephwb ok, you go
16:07 mtholder 1 the child is a MRCA of the ingroup of a split in the tree
16:08 josephwb just a sec
16:08 mtholder 2. the parents descendant include >0 members of the "outgroup"
16:08 mtholder for that split.
16:09 josephwb for "1", i think you mean "the ingroup split in the tree is entirely contained in the ingroup of the graph node", yes?
16:09 josephwb they do not have to be identical
16:09 josephwb child == graph node
16:10 mtholder the descendants don't have to be equal
16:10 mtholder but, it must be an MRCA
16:10 mtholder not a deeper ancestor
16:10 josephwb essentially pruning down to the intersection of label sets
16:10 mtholder yes
16:11 mtholder so, in practice, more than one child of the graph node has a descendant of the node in the input tree.
16:11 josephwb hmm. "must be an MRCA". what does that mean?
16:11 mtholder if all of the children from the phylo tree are coming from 1 child, then that chid ( or one of its descendant) is "more recent"
16:12 mtholder If the input split was AB | C and the synth was (((A,B), D), C)
16:13 mtholder i would not consider either internal edge in the synth supported
16:13 mtholder by the input
16:13 mtholder both are compatible with it.
16:13 josephwb that differs from Wilkinson (and Ruchi)
16:13 mtholder yes
16:14 josephwb so, if i glean things correctly, Wilkinson is a node-based test, whereas yours is edge-based
16:14 josephwb i.e. (A,B) does appear in an input (Wilkinson)
16:15 josephwb you would prefer ((A,B),C,D)
16:15 mtholder so, 1. I don't really think that we *know* that no unsupported edges (in the sense that I use the phrase) is a requirement.
16:15 mtholder yes. ((A,B),C,D) displays the split in the input
16:15 mtholder in that sense it is just as close to the inputs as the more resolved tree.
16:15 josephwb got it
16:15 josephwb let me run this by you:
16:16 mtholder If that is true for all splits (if the polytomy is just as close) then we should return the polytomy (IMHO)
16:16 josephwb inputs: (((A,B),C),E) and (((A,B),D),E)
16:16 josephwb [i agree]
16:17 josephwb so, we make a bunch of nodes. the following proposed node comes up: A,B,C | D,E
16:17 josephwb it does not conflict with either input
16:18 josephwb not supported, either
16:18 josephwb but what would be the test here?
16:18 josephwb that node being proposed because of:
16:19 josephwb wait, maybe a bad example
16:21 josephwb oh: it comes up when trying to combine:
16:21 josephwb A,B,C | E and
16:21 josephwb A,B | D,E
16:21 josephwb from trees 1 & 2, respectively
16:22 mtholder i understand that it is tough to decide.
16:22 josephwb no conflict (i.e. ingroup in one appears in outgroup of the other)
16:22 josephwb in this instance, it is clear we do *not* want to make that node
16:23 josephwb not sure how to articulate that in test form
16:23 josephwb maybe because we are looking at the tree structures?
16:23 josephwb harder with just sets
16:24 mtholder there is a set of trees compatible with both A,B,C | E and A,B | D,E
16:24 mtholder how to articulate that set?
16:24 mtholder is the tough part.
16:25 josephwb so, at least one tree split that is consistent with both?
16:26 mtholder the set is the set of trees that can be produced by resolving (((A,B),C,D),E) and ((A,B,C),D,E)
16:26 mtholder i think
16:26 mtholder tough to express that in one split.
16:27 mtholder (I mean the union of the sets)
16:30 josephwb so this is very different from the ITEB support test
16:31 mtholder not all trees in the set would have all edges that are supported.
16:31 mtholder if that is what you mean.
16:32 mtholder In this case the tree - with just these 2 splits - the tree with no unsupported edges would be ((A,B,C),D,E)
16:32 josephwb i mean, ITEB is straightforward: find a single input that supoprts, and stop. does not depend on multiple trees
16:33 mtholder but if you have more splits to consider, it is useful to retain the set of trees that are maximally compatible with the splits that you have considers.
16:33 mtholder considered*
16:33 mtholder later input splits may improve the resolution.
16:36 jimallman left #opentreeoflife
16:37 jimallman joined #opentreeoflife
16:37 josephwb hmm. still trying to figure out how to implement a test that says the proposed node is bogus
16:38 josephwb maybe need to keep the respective tree structures in mind
16:38 josephwb when considering "summing" A,B,C | E and A,B | D,E, we can ask:
16:39 josephwb try going up (parent) one node in tree2: is it likewise compatible with tree1?
16:39 josephwb [this is actually what we want to do: create the node: A,B,C,D | E]
16:39 mtholder I'm not following you.
16:40 mtholder why do you want A,B,C,D | E ?
16:40 mtholder that split is fine, but there are other plausible ones.
16:40 josephwb the graph node A,B,C,D | E is consistent with both inputs: we want to make that one
16:41 josephwb the proposed node A,B,C | D,E (from combining A,B,C | E and A,B | D,E) is not
16:41 mtholder OK. but so is ((A,B,C),(D,E)) is also consistent with both (and lacks A,B,C,D | E )
16:42 josephwb that is my problem!
16:43 josephwb we do not want to create any graph node which separates C and D
16:43 josephwb no input tree contains both
16:45 josephwb maybe i need to step back
16:45 mtholder well... it depends on the goal. If you want to maximize the number of splits that you display and you have more trees to consider, it seesm you need to keep the resolutions of  (((A,B),C,D),E) and ((A,B,C),D,E) "in play"
16:46 josephwb right; i think we can assume that we've got everything
16:46 mtholder there is no single split (for the full leaf set) in all of those trees
16:46 josephwb yup
16:47 josephwb IF we create the graph node A,B,C | D,E, it could potentially end up in the synth tree, and would be unsupported
16:47 josephwb so we don't want to make it
16:47 josephwb IF != IndexFungorum
16:48 mtholder I'm not optimistic about being able to predict which nodes you need when we are just considering pairs of statements.
16:48 mtholder in this case if these were the last splits then
16:49 mtholder yes you could tell that A,B,C | D, E was unsupported
16:49 mtholder so there would be no point in adding it.
16:49 josephwb yes, i am also concerned about the pairedness
16:50 josephwb let's shift for a second: what *could* support that nodes (without completely overlapping label sets)?
16:50 mtholder D,E| C
16:51 mtholder or D,E | B or D,E | A
16:51 josephwb just anything with C | D?
16:51 mtholder anything with "ingroup" D,E and outgroup being any of A, B, and C
16:52 mtholder (and you can add other labels in either the ingroup or outgroup if they aren't A-E)
16:52 josephwb ooh, i think we are close here.
16:54 josephwb not sure why you focus on D,E in the ingroup
16:54 josephwb i see how, not why
16:55 mtholder sorry. which node are we trying to support?
16:55 josephwb A,B,C | D,E
16:56 josephwb currently have: A,B,C | E and A,B | D,E
16:56 mtholder sorry I was mis understood
16:56 josephwb would a 3rd tree allow making the node?
16:56 mtholder A,B,C | D
16:56 josephwb so: anything with C | D?
16:57 mtholder no. not X,C | D
16:57 josephwb without conflicting with the other ingroup taxa
16:57 josephwb how about:
16:58 mtholder no. not X,C | D
16:58 mtholder sorry.
16:58 mtholder that was bogus arrow up
16:58 josephwb (C + any of {A,B}) | D {and maybe other non-conflicting junk}
16:58 mtholder yes
16:59 josephwb that looks like a test to me
16:59 mtholder but pairs of additional splits could do it too
16:59 josephwb right: the hard part
17:00 mtholder C,X| D and A,X | C
17:00 mtholder for instance
17:01 josephwb i detect infinite digression...
17:01 mtholder np-hardness
17:03 josephwb for "bipartition" i (in some tree) we keep track of which other bipartitions (from other trees) are compatible
17:03 josephwb if such a combination exists, it will be in that set
17:04 josephwb e.g. in trying to combine A,B,C | E and A,B | D,E, we know all bipartitions thar are compatible with each
17:05 mtholder ok
17:05 josephwb would the intersection of those two "lists" narrow us?
17:06 josephwb i.e. if a combination of 3 trees can get us that proposed node, they will all be in the intersection
17:06 josephwb 3 trees -> nodes from 3 separate input trees
17:06 mtholder sure - you don't need to consider splits that conflict with the subtree you are building up.
17:07 josephwb i believe we do that, but not beyond pairwise
17:07 josephwb i.e. do not consider conflicting
17:08 josephwb so, how about this:
17:09 josephwb in considering "summing" A,B,C | E and A,B | D,E (to give A,B,C | D,E)
17:10 josephwb do XOR separately for ingroups and outgroups of "source bipartitions"
17:10 josephwb that would give us:
17:10 josephwb C | D
17:10 josephwb i.e. the new information from the combined nodes
17:10 mtholder yes.
17:11 josephwb that is what we have to satisfy (from some combination of inputs) in order to create the node
17:11 josephwb let's call A,B,C | E (1) and A,B | D,E (2)
17:12 josephwb for all compatible bipartitions of (1) and (2), find intersection. call this (3)
17:13 josephwb ugh.
17:13 josephwb i was going to say throw out from (3) all that do not contain any of C, D
17:13 josephwb but i don't think that is right
17:13 josephwb still, (3) is what we should be working with
17:17 josephwb e.g. C|X, X|Y, Y|W, W|D
17:17 josephwb i think the above would support C|D, but the intervening data do not contain C or D
17:18 mtholder you need something more on the left because C|D is trivial
17:19 josephwb yes?
17:19 josephwb wait, from the XOR?
17:19 mtholder I just mean the statements
17:19 mtholder C|X doesn't say anything
17:20 josephwb right
17:20 mtholder C,X | Y   and A, X | Y implies that A, C | Y
17:21 mtholder So those 2 plus C,Y | D would imply A,C | D (without saying it in one statement)
17:22 mtholder X, Y | D rather that C,Y |D would also do it.
17:22 josephwb don't you just need #1 and #3
17:23 josephwb oh, no
17:24 josephwb need something (somewhere) from the intersection of the ingroups of the source nodes
17:24 josephwb here, A
17:30 josephwb i think i could eke out something that would compare relevant things from (3)
17:31 josephwb as a first test, this seems prudent:
17:31 josephwb (C + any of {A,B}) | D {and maybe other non-conflicting junk}
17:31 josephwb i.e. if any single tree displays that, get to stop early
17:36 josephwb ok, i definitely have a bunch of new stuff to mull over
17:38 guest|41520 joined #opentreeoflife
17:41 jimallman mtholder: quick question regarding peyotl refactoring, when you have a moment
17:41 josephwb do you have time to (briefly) discuss subproblems mtholder?
17:41 josephwb hey jimallman: back off
17:41 jimallman :D    no rush here
17:41 josephwb i'm monopolizing mtholder's time
17:41 josephwb jk
17:41 * jimallman saw a pause, took a show
17:41 jimallman took a shot
17:42 josephwb i'm afraid i put mtholder to sleep ;P
17:42 josephwb i can give mtholder a break from me
17:42 josephwb my tea got cold anyway
17:42 jimallman never sleeps… it’s always there, watching
17:42 * jimallman shivers
17:44 mtholder Actually I have to step away. But I'll check back later...
17:46 jimallman danke
18:11 jimallman nothing new in the PR queue this week, i think: https://github.com/pulls?user=OpenTreeOfLife
18:13 jar286 joined #opentreeoflife
18:14 jar286 agreed, no new PRs
18:21 mtholder joined #opentreeoflife
18:22 mtholder hi jimallman. just back for a bit. did you want to discuss something?
18:23 jimallman yes. i’m looking at study-id minting, wondering how to handle this for new types that won’t work the same way
18:23 mtholder because the prefix is user specific?
18:23 jimallman i see collections (and other “minor” types) just having a uniqueness constraint, and (iiuc) always getting their name from the creator or convention
18:24 jimallman prefix=username, yes
18:24 mtholder the "read the prefix from a file in Phylesystem" thing was a late addition to make it easier to test.
18:25 jimallman ...so i can treat the id-minting stuff as specific to phylesystem (nexson) stores, or keep is as vestigial code for all (which seems cluttered and confusing)
18:25 jimallman ah, i was wondering what the rationale for prefix_file was.. :)
18:26 mtholder if we have a character prohibited from the prefix (e.g. '_') then we can be confident that each user's prefix will lead to a unique concatenation of prefix_#
18:26 mtholder I guess we don't even need to prohibit a char
18:26 mtholder last _ in the string could be the break point
18:27 jimallman ah, good thought about underscores in GitHub username. but i was anticipating user
18:27 mtholder we could serialize a dict of prefix-> last numeric code for that prefix
18:27 jimallman ‘/‘ as the separator for these
18:27 jimallman eg, mtholder/trees-about-bees
18:27 mtholder OK. serialize a dict of usernames -> used strings.
18:27 jimallman so no numeric component to these, more like unique “slugs"
18:28 jimallman i see what you mean. the dict as a quick test for uniqueness.
18:28 mtholder the dict could get slightly big, but it would be read on launch and then in memory.
18:28 mtholder yeah.
18:28 mtholder a json list of slugs could be transformed into a python set of slugs for fast uniqueness checks.
18:29 jimallman the only glitch i can see to this approach is in PhylesystemShard.create_git_action_for_new_study…
18:29 jimallman ah, and that’s not a problem if we suplly an explicit doc-id
18:29 mtholder \mtholder looks at that...
18:29 mtholder one sec
18:30 mtholder yeah. I think it could all be in different "namespaces"
18:30 mtholder we could just have a boolean that says the phylesystem uses a numeric counter
18:31 mtholder and the collections classes just want unique strings.
18:31 mtholder slugs, that is.
18:31 mtholder 'unique' for this user only
18:31 mtholder I want mtholder/trees-about-trees not bees ;-)
18:31 jimallman right. if id = ‘{username}/{docid}’, then unique is unique
18:32 jimallman just like ‘ot_123’ and ‘pg_123’ are allowed
18:32 mtholder sounds good.
18:32 mtholder the / may interact with web2py routing in less-than-fun ways
18:32 mtholder but I'm sure there is a way to deal with that.
18:33 jimallman yes, if there’s an action that intercepts the early path, we can treat the rest as web2py ‘args’
18:33 jimallman i’m looking for an easy alignment of doc-ids and URLs (for simpler types)
18:33 jimallman time for one more…?
18:33 mtholder yes.
18:34 jimallman i’m looking at docstore and shard configuration, as seen here: http://api.opentreeoflife.org/phylesystem/phylesystem_config
18:34 mtholder ok
18:34 jimallman like many current features, this assumes studies, nexson, etc.
18:34 mtholder yeah.
18:35 jimallman it looks like i can override this for each type of docstore/shard, translating from type-specific terms to generic (study=>doc, etc)
18:35 mtholder sounds good. It was mainly intended to help folks who had a local copy of the git repo.
18:35 mtholder which is a corner cas.
18:35 mtholder case
18:35 mtholder and less relevant to collections.
18:35 jimallman ah, good to know. are there other consumers for these that might be confused?
18:36 mtholder i don't think anyone uses that method.
18:36 mtholder it is not in the public API, but it is documented  in phylesystem-api/doc
18:36 mtholder I think that we can tweak it w/o fear of annoying anyone.
18:37 jimallman right. iirc, the web2py apps occasionally fetch and act on these values. in any case, i’m trying to maintain identical API and outward behavior for phylesystem.
18:37 jimallman oh, one more! what are the “aliases” refered to in the shard class? i don’t get this at all, implication is aliased studies, or references to studies..
18:38 mtholder alias to allow 91 to mean "pg_91
18:38 mtholder old pre-namespace legacy stuff.
18:38 mtholder should not be needed going forward.
18:39 jimallman doh! OK, i’ll work accordingly. ok to weed this stuff out entirely?
18:39 mtholder I think so.
18:40 jimallman ok. i’ll try to do this in a single commit.
18:40 mtholder Great. OK. I've got to run. I'll be online tomorrow.
18:40 mtholder bye.
18:40 jimallman thanks!

| Channels | #opentreeoflife index | Today | | Search | Google Search | Plain-Text | summary