Perl 6 - the future is here, just unevenly distributed

IRC log for #opentreeoflife, 2013-11-19

| Channels | #opentreeoflife index | Today | | Search | Google Search | Plain-Text | summary

All times shown according to UTC.

Time Nick Message
00:17 jimallman joined #otol
00:21 travis-ci joined #otol
00:21 travis-ci [travis-ci] OpenTreeOfLife/api.opentreeoflife.org#164 (master - ce32c47 : Jonathan "Duke" Leto): The build passed.
00:21 travis-ci [travis-ci] Change view : https://github.com/OpenTreeOfLife/api.opentreeoflife.org/compare/50b513670d79...ce32c4764e75
00:21 travis-ci [travis-ci] Build details : http://travis-ci.org/OpenTreeOfLife/api.opentreeoflife.org/builds/14171998
00:21 travis-ci left #otol
00:43 travis-ci joined #otol
00:43 travis-ci [travis-ci] OpenTreeOfLife/api.opentreeoflife.org#165 (master - cfe95b9 : Jonathan "Duke" Leto): The build passed.
00:43 travis-ci [travis-ci] Change view : https://github.com/OpenTreeOfLife/api.opentreeoflife.org/compare/ce32c4764e75...cfe95b990c6b
00:43 travis-ci [travis-ci] Build details : http://travis-ci.org/OpenTreeOfLife/api.opentreeoflife.org/builds/14172752
00:43 travis-ci left #otol
01:11 dukeleto jimallman: ping?
01:12 jimallman hey man
01:13 jimallman dukeleto: ^
01:18 dukeleto jimallman: howdy
01:18 dukeleto jimallman: https://github.com/OpenTreeOfLife/api.opentreeoflife.org/issues/28
01:18 dukeleto jimallman: do you have thoughts on that?
01:19 jimallman um.. that it sucks?
01:19 dukeleto jimallman: yeah, dude
01:19 dukeleto jimallman: i really wish i would have seen that one innocuous line of docs a lot sooner
01:19 jimallman maybe we could start using 1MB chunks...
01:19 jimallman but i've never liked that idea. weird to diff, hard to prettify
01:20 dukeleto jimallman: yeah, that seems like a world of pain
01:21 jimallman any chance that's a bogus limitation..? conservative "engineering" max? worth testing, i guess.
01:21 jimallman but hell, we're probably going to have >20MB files eventually
01:22 dukeleto asking for http://localhost:8080/api/default/v1/study/235.json gives me {"FULL_RESPONSE": ""}
01:22 dukeleto you can't even GET a file bigger than 1MB, which seems lame since you can ask for the raw github URL: https://raw.github.com/OpenTreeOfLife/treenexus/master/study/235/235.json
01:24 jimallman oof. study 235.json is only 1.3 MB... they're not kidding around.
01:24 jimallman i suppose if we could send only JSON patches, we might get away with < 1MB.
01:26 jimallman per POST, i mean. and resolve all GETs from raw URLs, as you point out.
01:28 * jimallman is googling for rays of hope... nothing so far
01:30 dukeleto jimallman: the github api mentions JSON patch, but most of the API methods only talk about POST/PUT
01:31 jimallman yeah, i doubt they're going to support it for us (i'd love to be wrong)... and I don't see how we could "build up" initial commits without having a process running local to a real repo. :-/
01:32 dukeleto jimallman: yeah, this is no bueno
01:33 jimallman we should sleep on it before jumping out a window, but ... agreed.
01:33 dukeleto the github api supports PATCH for editing a gist: http://developer.github.com/v3/gists/#edit-a-gist
01:33 jimallman interesting!
01:34 dukeleto and here: http://developer.github.com/v3/repos/#edit
01:34 dukeleto seemingly for updating a few keys of a dict and not having the specify the entire dict
01:35 jimallman right, sparse updates. which is nice, but it looks like they're only using it in known (repo) structures.
01:35 dukeleto yes
01:35 jimallman it's cool that they're open to using the PATCH method, but not likely they'd support it for blobs since it assumes JSON files in a repo.
01:36 jimallman kind of a special case
01:36 jimallman i'm wondering... should we consider requesting these changes as issues? so far, I can't find a repo for the Github API.
01:38 jimallman googling for that is beyond tricky: "github api repo" hits SO many other projects...
01:38 dukeleto jimallman: yeah, their API is tightly coupled to their internals and is not open
01:39 dukeleto https://github.com/github shows their code
01:39 dukeleto https://github.com/github/developer.github.com is the api docs
01:39 jimallman right. so hosting elsewhere would mean re-inventing the whole API...
01:40 jimallman looks like your earlier approach (maintain another repo, do heavy lifting there, push changes to Github repo) might be the only way after all.
01:41 dukeleto blarg
01:42 jimallman i'm doing a quick search for weird API method names, in case it's buried under a code name (i know, unlikely)...
01:43 dukeleto jsonpatch just became an rfc in april and does not seem widely supported yet
01:44 dukeleto HTTP PATCH is from 2010, tho
01:44 jimallman right, yet another unknown.. if we control both ends, we might pull it off.
01:45 dukeleto This is going to get messy
01:46 jimallman yeah, retreating from git (to nosql or whatever) would mean lots of wheel-reinvention.
01:46 dukeleto it is essentially the same problem as the 50MB filesize limit, but a lot sooner
01:46 jimallman right
01:46 jimallman so maybe tractable in the same ways.. chunking large Nexson files might be the LEAST ugly solution
01:47 dukeleto yeah
01:47 jimallman maybe some very generalized "sub-object" annotation we can add, so any part of the Nexson structure can be popped out to a separate file.
01:48 dukeleto looks like storing nexson files as directories with meaningful filenames
01:48 dukeleto jimallman: or that
01:48 jimallman ideally yes, but arbitrary chunk-IDs would work in a pinch.
01:49 jimallman we wouldn't want anyone to do much more than load the "core" file and build the whole thing before doing anything with it.
01:49 jimallman but at that point, maybe simple segments (unstructured) would work as well. oh, but the validator...
01:49 dukeleto jimallman: yeah, this gets really messy
01:50 jimallman your notion of a fixed structure in each folder sounds better to me now... something fine-grained enough that it would "never" (ha) exceed 1MB in a single chunk
01:50 dukeleto jimallman: but the "otus" part of the nexson can grow pretty darn big
01:50 jimallman not sure if that's realistic, we might well have single trees that are > 1MB
01:50 jimallman right
01:51 jimallman at the very least, the validator (and curation tool, and other consumers) would need to understand the convention and be able to build a complete Nexson to do anything interesting.
01:52 jimallman conceivably we could wrap that up in common Python code, shared with validator, web2py apps, etc.
01:52 dukeleto jimallman: this new information means that meaningful diffs for large trees is going to be very "involved"
01:53 jimallman right, we lose lots of nice Github UI for stuff like that. sigh.
01:53 jimallman would need to be offline, with the un/bundling tools involved.
01:53 jimallman i think
01:56 dukeleto oy vey
02:02 dukeleto jimallman: do you think we will see trees >50MB, excluding metadata ?
02:02 jimallman not sure, but someone on the team will know...
02:03 jimallman i'll review my notes from curator interviews. different domains had very different profiles, e.g., a microbe study can have thousands of trees, but each tree is very focused.
02:03 jimallman of course, that could mean LOTS of otus
02:03 dukeleto jimallman: if we make large commits locally and push them, and use raw.github.com to grab large things, we may be able to still use Github
02:04 dukeleto jimallman: but the 50MB limit will still loom in the distance
02:04 jimallman yep. we'll just need to watch the lag to make sure we're getting the real stuff from Github
02:05 jimallman true that (50 MB limit)
02:06 jimallman as i recall, it's a certainty that we'll have files bigger than that. so splitting mega-Nexson files is also an eventual certainty. this might have simply moved it closer in time.
02:06 dukeleto jimallman: yes, it sure did. Like "it has already happened" :/
02:06 jimallman yes, the future is now.
02:08 jimallman dukeleto: interesting hits on "split segment large json files"... maybe there's a decent standard for this..
02:10 dukeleto jimallman: it seems that a way of stitching nexson together from multiple files will be needed sooner or later
02:11 dukeleto jimallman: but everyone was hoping for "later"
02:11 jimallman yeah, there are some interesting ideas in these results. like using a compact JSON manifest to orchestrate the chunks
02:15 dukeleto jimallman: time to head home soon. Please let me know if you find anything promising
02:16 jimallman will do. i'm chasing some existing projects that serialize and deserialize humongous JSON (related to Hadoop, etc)
02:16 jimallman will send links and thoughts as i learn more
02:16 dukeleto jimallman: thanks!
02:47 _ilbot joined #otol
04:06 mtholder joined #otol
04:36 jimallman joined #otol
11:53 travis-ci joined #otol
11:53 travis-ci [travis-ci] OpenTreeOfLife/api.opentreeoflife.org#166 (master - 020c926 : Mark T. Holder): The build passed.
11:53 travis-ci [travis-ci] Change view : https://github.com/OpenTreeOfLife/api.opentreeoflife.org/compare/cfe95b990c6b...020c926be0e7
11:53 travis-ci [travis-ci] Build details : http://travis-ci.org/OpenTreeOfLife/api.opentreeoflife.org/builds/14192710
11:53 travis-ci left #otol
13:57 towodo joined #otol
14:30 jimallman joined #otol
16:29 jimallman towodo: FYI - we've discovered a hard file-size limit in the GitHub API (Duke found it buried in the API docs).. I'm exploring workarounds and options for segmenting large Nexson now, will post some findings shortly.
16:30 towodo yes, I saw that.
16:30 towodo we have similar issues for the taxonomy and for the synthetic tree.  wondering if there's a hierarchy-aware solution
16:31 towodo i've been assuming taxo / synth tree would be stored outside of github
16:31 towodo but if they could be in github, that would be pretty cool
16:32 jimallman i'm exploring two general approaches: segmentation into arbitrary-length strings, and a more structured approach based on pruning and separate sub-objects, leaving a sort of token in their place. There are interesting pro and cons to each.
16:32 towodo i've been imagining an algorithm that 'clips' a tree into chunks of size no bigger than N
16:33 towodo but the structured clipping idea was one i was going to put off until after our release
16:33 towodo worried about how much still needs to get done
16:35 jimallman agreed, it was nice having this on the back burner...
16:35 towodo is this a reason to move to bitbucket, or to pay $ to github?
16:36 jimallman i don't believe paid GitHub would relax these limits, though we should probably ask.
16:36 jimallman I thought someone in the know (Duke?) was clear that bitbucket doesn't have anything like the Github API, but I can take a quick look...
16:37 jimallman i stand corrected: http://blog.bitbucket.org/2013/11/12/api-2-0-new-function-and-enhanced-usability/
16:38 jimallman looking at this now, to see if they have the features we need.
16:38 towodo well it's a longshot, need to weigh the alternatives.  implementing segmentation or clipping introduces several risks
16:40 jimallman agreed. we'll need to provide simple, general tools to help people assemble and disassemble monolithic Nexson docs.
16:41 jimallman i'm playing with the Bitbucket interactive explorer:
16:41 jimallman http://restbrowser.bitbucket.org/
16:42 dukeleto joined #otol
16:43 jimallman dukeleto: hi, jar (towodo) and i were just talking about next steps.. i'm looking at bitbucket's API, just did a major jump to v2 in October.
16:44 dukeleto jimallman: interesting
16:44 jimallman https://confluence.atlassian.com/display/BITBUCKET/Use+the+Bitbucket+REST+APIs
16:44 jimallman interactive API explorer here: http://restbrowser.bitbucket.org/
16:44 jimallman (kinda hard to use without repos, auth, etc)
16:45 jimallman not sure yet whether it wraps git vs. mercurial in a common model...
16:46 dukeleto jimallman: the API docs look pretty sparse still
16:46 jimallman yes, they're definitely playing catch-up
16:47 jimallman https://confluence.atlassian.com/display/BITBUCKET/src+Resources
16:48 jimallman "You can use the Bitbucket src resource to browse directories and view files. This is a read-only resource."  Hm, let's see if we can make commits..
16:49 dukeleto jimallman: i only see GET requests for Changesets
16:49 jimallman same here. it seems their API is more informational, not intended for full participation.
16:51 towodo just curious, what's the biggest of our too-big files? which #?
16:51 towodo is compression an option?
16:52 jimallman it's certainly worth a look, but I'm afraid we're off by orders of magnitude.
16:53 jimallman (correction: i don't know that for sure.)
16:54 jimallman i don't have a local copy of the treenexus repo, so i don't have a sense of which are the biggest existing Nexson files, sorry. i think there's a recent copy on the old DEV though... lemme go see...
16:56 dukeleto towodo: 438 is 39MB
16:56 dukeleto towodo: here is a gist of the large files: https://gist.github.com/leto/7548604
16:57 dukeleto jimallman: ^
16:57 towodo ouch.  now I see that I made a note of that particular one on sept 13
16:57 dukeleto gzip'ing 438 makes it 2.8mb
16:58 towodo cool, you have them checked out, any chance you could do a du -s for me as well? was wondering what the clone overhead was
17:00 jimallman that's much better compression than i would have guessed. i suppose we'd need to store them zipped in the repo for this to make any difference (re: the API's file-size limit), right?
17:01 towodo yeah...
17:01 towodo but maybe just those particular ones
17:01 jimallman good point.
17:02 dukeleto towodo: you want a du -s for each study?
17:02 towodo no, no, just the total
17:03 towodo for the directory.  i.e. how much space is gobbled up by a clone
17:03 dukeleto towodo: 532MB
17:04 dukeleto towodo: 159MB of which is the .git dir
17:05 towodo thanks.
17:06 jimallman quick update on the Bitbucket API. i've been cross-checking the API docs and interactive API browser, which sometimes has the real McCoy. neither shows any way to submit data, so I'm thinking this is not an option.
17:15 towodo maybe we can special-case study 438, compress the other big ones, and leave the others as is.
17:15 towodo ok, going off to lunch, then there's a nescent meeting, so i should be back on irc around 4.
17:16 dukeleto jimallman: i also just found this: https://help.github.com/articles/what-is-my-disk-quota
17:16 dukeleto "1GB per repo"
17:16 dukeleto jimallman: it doesn't say if that is the working copy or .git dir, but we are approaching that, either way
17:17 jimallman dear god, why can't they put all these limitations in one place?
17:18 dukeleto jimallman: seriously. They are speckled about in very hard-to-find places
17:20 jimallman i guess we're far from their anticipated use cases. i spoke briefly with Chris Chacon at a recent tech conference here about what we're doing, but just for a minute. i wish he had more time to talk, he might have been able to warn me.
18:04 jimallman joined #otol
18:24 jimallman dukeleto: towodo: here's my first look at two approaches ("segmentation" into arbitrary chunks, vs "decomposition" into sub-objects): https://docs.google.com/document/d/1PkKPeWQW1mk7tEOf5jdcU2zSrptFjCk6qg54M-Sg2PE/edit?usp=sharing
18:33 jimallman all of which might be sadly moot based on the 1GB / repo limit Duke found.
18:50 dukeleto jimallman: looking at that now
18:51 jimallman me too. i thought maybe the paid hosting or Enterprise versions would relax these limits. no sign so far, but it's possible that the Enterprise (self-hosted) version would let you configure for larger repos.
18:51 dukeleto jimallman: if we run our own git server, the file size limits go away
18:51 dukeleto jimallman: but then authentication and provenance info becomes a lot more involved
18:51 jimallman right, but then we have no API (right?).
18:52 jimallman the Enterprise version does support API and OAuth, but it's priced per user (ouch) and isn't intended for public use.
18:52 jimallman so we'd need major exceptions to use it in an open project, and probably a serious price break.
18:53 dukeleto jimallman: yeah, github enterprise ain't cheap and it is meant to be used behind a firewall. I think it doesn't allow mixing public+private repos in the same enterprise install, as well
18:53 jimallman that would make sense.
18:55 jimallman dukeleto: ok, crazy thought: what about submodules for large studies?
18:55 jimallman split studies across n repos, just like we're splitting big Nexson files.
18:56 dukeleto jimallman: i was thinking of something like "put every N studies in their own repo"
18:56 dukeleto jimallman: if we only put "large" nexson files into submodules, keeping track of what is a submodule and what isn't would be hairy. And merges would become extremely complex
18:57 jimallman hm, this could actually help someone who needs to clone the monster as well. they could choose which submodule(s) are relevant for them and just clone those...
18:58 jimallman true. and what about if a submodule starts with small studies in it, and they all grow (massively)? moving a study from sub to sub could be weird in terms of preserving its history, though I think doable with git.
18:59 jimallman if we allocate submodules very conservatively, and always put a new study into a sub with lots of space, this could work.
19:00 jimallman how many repos would we (likely) need? ot_nexson_1, ot_nexson_2, ...
19:00 dukeleto jimallman: that depends on how fast our studies are growing and the speed of adding new studies, right?
19:01 dukeleto jimallman: going back to your question of "but then we have no API (right?)"
19:01 jimallman if we allocate (very conservatively) 10 studies/repo, we could get into the hundreds if all goes well.
19:02 jimallman re: API, i'm assuming we're talking about how to stay on GitHub. We'd just need a way to quickly determine which repo a given study is in and connect to it.
19:03 dukeleto jimallman: yes, staying on Github is quite important for visibility of the project and ease of contributing
19:03 jimallman i guess the whole submodule idea is kind of a red herring, unless we just offer the "master" treenexus repo as a quick way of cloning all studies.
19:04 jimallman really we're just be "sharding" our Nexson storage across n GitHub repos.
19:05 dukeleto yes. and since we never know a max size for any one nexson file, it just puts off the eventual day when one study grows too big
19:06 dukeleto we will need to shard the repo and the files, seemingly
19:06 jimallman paranoid mode: each study gets its own repo! can grow to 1 GB
19:06 jimallman at some point it becomes indigestible by any reasonable curation UI, in any case
19:07 jimallman i should torture test with that monster study in the curation UI, just to see how long it takes to load..
19:07 dukeleto jimallman: i think we might get a nasty email from Github if we created 3000 github repos in our org...
19:07 jimallman :D
19:07 dukeleto jimallman: but i can't find any mention of the maximum number of repos per org
19:08 jimallman there isn't one, but most of GH's limits are "soft" and they seem flexible. maybe we should initiate a proper conversation with someone there who's senior enough to look at what we want to do and make a general decision to support the project or not?
19:09 jimallman they have like 2 million repos, so we're kind of just another grain of sand on the beach. but the project and its mission are certainly compelling, so maybe we can get their attention.
19:11 dukeleto jimallman: what exactly should we ask for?
19:12 dukeleto jimallman: i know a few different people at GH, but not sure of their seniority or ability to bend rules
19:12 jimallman first, a reality check. is what we're trying to do something they support in principle? if so, will they recognize that we have... unusual needs for large repos (and possible files)?
19:13 jimallman and third, can they make exceptions for things like repo size or total size for an organization? fwiw, i gather the soft 1GB limit on repo size is a performance issue, so it shouldn't have to do with "reasonable" space for an org.
19:14 dukeleto jimallman: it is performance but also it comes down to the possibility of running out of memory on their machines
19:15 jimallman hmm, that makes sense.
19:15 jimallman they must be making similar decisions all the time about sharding user space.
19:15 dukeleto git takes 2-3x more memory than the largest file you operate on, which is why their max file size is reasonably small
19:15 jimallman i wonder if they need to keep an organization (all its repos) on one machine
19:16 jimallman yes, i read that somewhere (2x largest file size)
19:16 dukeleto jimallman: their internal architecture changes extremely fast
19:16 dukeleto jimallman: i saw a presentation about it a few years ago and I am sure it is all completely different now
19:16 dukeleto jimallman: even then, it talked about 5 or so different ways they had gone through organizing/sharding their data
19:17 dukeleto jimallman: the big question: Where do we go from here?
19:18 dukeleto jimallman: we can easily shard a new datastore repo when we get close to the 1GB limit. Writing files >1MB seems to be the most pressing issue
19:19 dukeleto jimallman: that, and dealing with studies that grow bigger than 50MB
19:19 jimallman agreed. i think solving #1 solves #2, by definition.
19:20 dukeleto jimallman: it would be interesting to see if the majority of the size of large studies is metadata or OTUs
19:20 jimallman if my pros/cons document seems reasonable, we could open this up for a wider conversation.
19:20 jimallman yes, that's an interesting question. moving annotations "outboard" would certainly be one way to handle things.
19:21 dukeleto jimallman: it is a nice description of where we are at
19:21 dukeleto jimallman: i think wider discussion is good
19:21 jimallman i'm afraid i've raced to those two options and i'm missing some alternative..
19:22 jimallman then again, it's a basic question: does the division respect the JSON structure or not?
19:25 dukeleto jimallman: yeah, i am imagining the process of writing to a nexson "file" if it split up across many files
19:26 jimallman i'm assuming (kinda) that the OTOL API, or some other middleman, would be responsible for bundling all the mini-files into a monolithic Nexson, and vice versa.
19:28 jimallman see my pseudo-code for the decomposition approach. it involves simultaneously traversing the JS object, and its "stringified" parts to figure out how big things are, and where to make splits.
19:28 jimallman the other approach is tempting for its brute simplicity.
19:31 dukeleto jimallman: i agree
19:31 dukeleto jimallman: we can always segment any size file into N files that meet our needs
19:32 dukeleto jimallman: decomposing into sub-objects sounds like a lot more code will need to be written, to decide where to decompose
19:32 jimallman all we lose is an easy view of intra-file changes.. at least in GitHub. to diff this, you'd need to do concatenation the files from both versions and diff in another tool. ugh.
19:32 jimallman ^ that was for the segmentation approach, i mean
19:33 dukeleto jimallman: understood
19:33 jimallman hmmmm
19:33 dukeleto jimallman: and if we decompose something into 5 sub-objects and then one of those sub-objects wants to grow too big, we still have the same problem
19:34 dukeleto jimallman: where as if we segment, that can't happen
19:35 jimallman yeah, my pseudo-code expects this to happen, one file becomes two, becomes four over time. sub-objects can have their own sub-objects if they grow too large.
19:35 jimallman it sounds gnarly, but the effect would actually be pretty consistent, and recursively loading all the sub-objects into a single Nexson obj should be pretty easy.
19:36 jimallman i'm more worried about the huge-pile-of-OTUs scenario, where there's no obvious attachment point for a sub-object.
19:37 dukeleto jimallman: yes, the case you mentioned of one huge flat collection of OTUs
19:37 jimallman we might need to have a token that basically represents "100 items in this array". then we could line them up in the parent array. ugh.
19:38 dukeleto jimallman: yes, ugh
19:38 jimallman ... meanwhile, i'm looking at git-annex, which tries to deal with the "big files in git" problem by storing pointers to large files elsewhere...
19:39 jimallman http://git-annex.branchable.com/
19:39 dukeleto jimallman: i met the author of that a few years ago. Very interesting tool, but it doesn't version control the large files
19:40 dukeleto jimallman: it allows you to organize, for instance, huge collections of media and use git to "sync" them
19:40 jimallman hmm. but could we version them? or store files on a journaling filesystem or the like?
19:44 dukeleto jimallman: seems possible
19:52 jimallman dukeleto: it looks like (hard to tell with minimal documentation) you can in fact version the big data files with git-annex: http://git-annex.branchable.com/forum/Retrieve_previous_version_in_direct_mode/
19:53 jimallman i gather "indirect mode" will version naturally, while "direct mode" means you're sort of editing the one-and-only version of a file.
19:57 * jimallman has to run a quick errand. back in a bit...
21:00 towodo (reading irc transcript)
21:04 jimallman joined #otol
21:05 towodo I think their limits may have to do with abuse, so contacting GH for exceptions might be the way to go
21:29 jimallman i'm looking at several different projects that offer separate storage (and versioning) of big data files: git-annex, git-fat, git-media, boar... all of these maintain metadata and a pointer in git proper, so might work for our purposes.
21:29 jimallman not clear which of these are compatible with a GitHub repo yet...
23:51 towodo ahoy, anyone there to field a python newbie question?

| Channels | #opentreeoflife index | Today | | Search | Google Search | Plain-Text | summary