Perl 6 - the future is here, just unevenly distributed

IRC log for #opentreeoflife, 2013-11-20

| Channels | #opentreeoflife index | Today | | Search | Google Search | Plain-Text | summary

All times shown according to UTC.

Time Nick Message
00:02 mtholder joined #otol
00:03 mtholder whoa! that is a long irc log.
00:03 mtholder Does someone still have a python newbie question?
00:11 towodo nope, figured it out… you can use a tuple (immutable) as a dictionary key
00:12 towodo towodo = jar
00:12 towodo (short for "total world domination")
00:27 mtholder joined #otol
00:36 dukeleto joined #otol
00:51 jimallman joined #otol
01:04 dukeleto jimallman: your gdoc is a great overview of the situation. Thanks for making that
01:04 jimallman dukeleto: cool. i've sketched out a possible solution using fat-git with the OTOL API. it's at the end of my notes.. happily, the git-fat code is one small Python script, so it should be very easy to use and tweak as described there.
01:05 jimallman the more i think about it, the better this looks to me. at least compared to every alternative.
01:06 dukeleto jimallman: git-fat does look more promising than the other alternatives
01:06 dukeleto jimallman: and the fact that it is python is good. I noticed git-annex is written in Haskell...
01:08 jimallman yeah, not my strong suit. same for git-media and Ruby. git-media and git-fat share the most important trait, which is using git's clean and smudge filters to transparently intercept some files for special treatment as they enter or leave the repo.
01:09 jimallman it's a remarkably powerful feature, but (afaict) not supported on GitHub, since it requires running trusted scripts alongside the repo.
01:12 jimallman dukeleto: my proposed OTOL API + git-fat uses some sleight-of-hand to keep the activity on GitHub with original authors. Certainly not what they originally imagined. But the pieces are simple enough that I'm quite confident this will work.
01:17 dukeleto jimallman: manually setting the author via --author seems reasonable
01:17 jimallman esp. since the OTOL API will have their credentials (i believe we always pass these, fresh from the GitHub profile)
01:17 dukeleto jimallman: how would we see diffs of large nexsons with git-fat? Do we lose that ability?
01:18 jimallman i think so (see the merge notes at the end)... it's a wart, for sure.
01:19 jimallman for the power user (using a cloned repo), they can do this themselves with a proper diff tool.
01:20 jimallman for others, i'm thinking maybe a super-simple website that lets 'em GET any Nexson file from storage (by its SHA1, see notes) or a diff of any two.
01:21 jimallman maybe something like this:
01:21 jimallman http://data.opentreeoflife.org/nexson/1f218834a137f7b185b498924e7a030008aee2ae/diff
01:21 jimallman (would show the specified Nexson+version diff'd against the previous version of the same file)
01:22 jimallman could be a raw diff, or we find an easy tool to show it all pretty-like with CSS
01:30 jimallman pretty diffs in HTML: http://stackoverflow.com/questions/7661045/how-to-highlight-more-than-two-characters-per-line-in-difflibs-html-output
01:30 jimallman looks like Python difflib or Google's diff-match-patch will do a nice job.
01:32 dukeleto interesting
01:33 jimallman https://code.google.com/p/google-diff-match-patch/
01:34 jimallman this rocks, by the way. in a previous project, we used it (from C#) to generate nice "redline" views of truly gnarly changes.
01:35 jimallman here's a demo, see esp. the "efficiency cleanup":
01:35 jimallman https://neil.fraser.name/software/diff_match_patch/svn/trunk/demos/demo_diff.html
03:55 lcoghill joined #otol
04:56 jimallman joined #otol
14:02 towodo joined #otol
14:23 jimallman joined #otol
14:51 mtholder joined #otol
16:24 mtholder joined #otol
17:38 dukeleto joined #otol
17:39 Topic for #otol is now opentreeoflife.org | github.com/opentreeoflife
17:53 dukeleto jimallman: looking at your most recent gdoc now
17:57 jimallman joined #otol
17:57 dukeleto jimallman: welcome back
17:59 dukeleto jimallman: one clarification that I would like to talk about: the main issue we have is storing our *data* on Github, not our code
18:00 dukeleto jimallman: and I am not sure we would want to use git-fat if we were hosting our own datastore git repo
18:01 jimallman dukeleto: hi! we have a few decisions in front of us, and it's hard to avoid some cross-talk between them..
18:01 jimallman re: big files, i'm being swayed by the chorus of voices (pretty much everyone) that cautions against keeping big files (text *or* binary) in git.
18:03 jimallman ... so i'm drawn to git-fat, even if we host our own data repo. i suppose we could try keeping it all in git and change later if needed... but i think it will be needed.
18:03 dukeleto jimallman: i definitely understand your point of view
18:04 jimallman if we do start with Nexson files in the git repo, there are of course ways of deflating it later. but they all involve messing with history, so we'd need to touch base with any cloned repos and have them push and re-clone..
18:08 dukeleto I guess I am looking for a short term solution that is flexible enough to meet our needs in the medium and long-term
18:09 jimallman i'm happy to setup the git-fat end of things, if we want to try it.
18:10 dukeleto jimallman: the main issue i have with git-fat is that the "big" files are not versioned
18:11 jimallman they are, actually. file data is stored in the back-end store, keyed by SHA1.
18:11 jimallman so (if i understand correctly) it's not very efficient, just big files side-by-side. but the versions are there if we have room for them.
18:12 * jimallman is double-checking to make sure i'm not mis-remembering something from git-annex...
18:12 dukeleto jimallman: and git-fat works by file extension, so it would want something like "*.json filter=fat", but we would want only certain "large" json files to be dealt with, correct ?
18:13 jimallman in this scenario we should (in my opinion) treat all Nexson files equally... but we could probably add a size-test to the "clean filter" that makes the call.
18:14 jimallman i suppose we could start new studies in git, then (if/when they exceed a preset size threshold) shunt them into the git-fat store.
18:16 dukeleto jimallman: but how to "shunt" ? I want to believe, but the details are murky :)
18:18 dukeleto jimallman: the solution of the OTOL API transparently segmenting/stitching together nexson files is appealing to me
18:18 dukeleto jimallman: who do we expect to use git-fat? Would it only be internally used by the OTOL API, or do we expect others to use it?
18:18 jimallman git-fat (and git-media) work largely be using git's filter feature. the "clean" filter is used for incoming files (commits etc) and is commonly used for normalizing whitespace, etc. The "smudge" filter is applied to outgoing files, usually to add stuff like $Version to the user's local file.
18:19 jimallman git-fat uses these as hooks to detect and divert big files to/from a separate store, dropping in a little proxy file into git instead.
18:20 jimallman if we go with git-fat, the OTOL API would use it to divert big files to a remote (?) store, and users of the API would be oblivious. anyone working directly with cloned git repo would need to setup and use git-fat.
18:21 jimallman i'm still chasing the question of how it handles cloned repos... whether we could all safely share a remote datastore, etc. i'm guessing anyone who wants to work outside the API will have the chops to set up their own store.
18:21 jimallman i'll check to make sure there are tools to handle merges and pull requests in this case.
18:22 dukeleto jimallman: well, if somebody cloned the repo, it would have a .gitfat file which lists the same remote
18:24 jimallman yes, and shared history. which suggests they could safely share. i'm wondering about possible SHA collisions (unlikely, i know) and in particular any "cleanup unused" commands that might clobber the other repo's stuff.
18:24 jimallman otherwise we'd want to treat the .gitfat file like our web2py private/config files, and keep the active file out of the repo
18:25 mtholder joined #otol
18:26 dukeleto jimallman: all these issues come down to our want/need to allow individual nexson files to grow arbitrarily large
18:26 jimallman true
18:26 jimallman and mtholder has pointed out that there might be other reasons to decompose the big'uns... so there's that.
18:29 jimallman re: git-fat, some answers:
18:29 jimallman its data store can be shared across clones, which is cool
18:30 jimallman it has push/pull commands to sync up between local and remote fat-store (not repo, just the raw files, so we don't need the API machine to hold everything locally)
18:33 dukeleto jimallman: deciding how and when to decide if a file is "fat" is an issue. What happens when a file wasn't fat and now it is? How does the tooling deal with that?
18:34 dukeleto jimallman: generating a diff between two commits, where a file wasn't fat and now it is, seems like it could be hairy
18:35 jimallman the clean filter runs on each commit of a file, so each time a Nexson doc comes back to the repository we should be able to check its size and (if it has crossed the line) bump it into the fat-file store. A tiny placeholder would be dropped in its place, so git would now hold the previous version as a Nexson file, and the new version as a git-fat placeholder.
18:36 dukeleto jimallman: git-fat sems really interesting, but it doesn't seem like it is used by many people and doesn't have active development, which gives me pause
18:37 jimallman true, getting a diff across that particular change won't be easy. i'd suggest we might catch the newly-fat file, bump its PREVIOUS version to fat-file storage, and then use our non-git tools to diff the fat-file versions, as we talked about last night.
18:37 jimallman i would share your concern if it wasn't so. damn. simple.
18:37 jimallman git-media is very similar and less active. it really looks like a proof-of-concept demo.
18:38 jimallman git-annex is very active, but much more complex and (IMO) overkill for us.
18:38 dukeleto jimallman: "bump it into the fat-file store" but how? The way filter works is by file extension, and I don't clearly see a way to selectively make certain files "fat"
18:39 jimallman the file extension sends an incoming file to the filter. the filter can decide whether to move it to fat-file storage (and return a tiny placeholder) or that it's too small (and return the original file)
18:39 dukeleto jimallman: i guess i am looking for a solution that sticks with basic git, if one exists
18:39 jimallman i hear that. we're going to be doing something weird in any case, even if it's simple segmentation.
18:39 dukeleto jimallman: if we were to use git-fat, what would your file size boundary be, to consider a file fat?
18:40 jimallman there are some other benefits to an off-sides store for large files. someone cloning the repo will have a reasonable size to work with..
18:41 towodo hey all - any input on what to do on today's call?  need any questions answered?
18:41 jimallman re: the boundary size, i'm beginning to think that GitHub's sizes aren't pulled from a hat. so maybe use their soft limit of 50MB?
18:43 jimallman i'd feel better (for consistency and simplicity) just treating all Nexson files equally. i feel like it's going to surprise the hell out of somebody to see their Nexson file grow and suddenly *poof* it's moved somewhere else. but i think i'm alone there, and most people want to keep the small studies close by in git.
18:43 dukeleto jimallman: sure. but they could also do git clone --depth=1 or "sparse checkout" described here: http://stackoverflow.com/questions/600079/is-there-any-way-to-clone-a-git-repositorys-sub-directory-only
18:45 jimallman true, but then they don't get history. maybe that's moot though.
18:45 dukeleto towodo: well, we need to decide what we *must* have in the short term for nexson storage vs. what would be *nice to have*
18:45 dukeleto jimallman: yep. If you don't want to download all of history, you don't get it. That seems reasonable
18:46 jimallman yes. if we think the 50MB limit will work in the short term, we could probably postpone any special handling of big files if we use a local repo. i just don't like the idea of playing chicken with the first big study that shows up...
18:47 towodo hmm.  this is about the curation interface, what it can do.  that's been mainly a Karen thing, the idea being that maybe we won't get a second chance with many people, if they decide it's no good they won't come back when it works better.
18:47 jimallman i'm guessing even the "hard" file-size limit on GitHub (100MB) is safe, or they wouldn't allow it.
18:48 towodo i don't think editable huge studies figures very highly in that equation
18:48 towodo the main thing is, can one of our visitors add a study.  clean, featureful git interface has not been a must-have for release, so far as I've heard
18:49 dukeleto jimallman: yes, a 100MB file means you will need 200MB of free memory to operate on it
18:49 towodo but yes, sounds like a good thing to discuss on the call
18:49 jimallman towodo: i see what you mean, assuming we have some control over what studies come into the system. or can have a "fire drill" when the first whale appears.
18:49 dukeleto jimallman: github has very conservative limits, because their servers are dealing with massive quantities of users and repos
18:49 dukeleto jimallman: we don't have that issue, and our limits can be much less conservative
18:50 jimallman makes sense (assuming our API server is beefy)
18:50 towodo yes.  no whales until later in  2014, or something like that
18:52 jimallman if that's our policy, then we can postpone the big-file solution, which is nice.
18:52 towodo that would be my call, but need Karen's signoff
18:53 jimallman OK. and we do have ways (slightly painful) of moving the odd large file out of the repo later, if need be.
19:01 towodo "Uh oh. Something's not right.
19:01 towodo We're having trouble loading this Hangout."
19:02 towodo closing and reopening chrome
19:03 towodo still not working
19:03 jimallman !
19:04 towodo jim, can you try initiating? at least with me, and maybe we can add people?
19:04 jimallman sure, just a sec (first time here)
19:05 * dukeleto is ready and on g+
19:07 mtholder me, too.
19:08 jimallman dukeleto: we're on the call...
19:08 jimallman trying to bring you in
19:09 dukeleto i see two of everybody and then when I join, I am alone
19:09 jimallman awesome.
19:10 dukeleto jimallman: can you try to add me again?
19:10 jimallman yes
19:11 jimallman sorry, it's on the way...
19:11 jimallman you should have it, i mean
20:33 dukeleto exciting times, indeed
22:07 travis-ci joined #otol
22:07 travis-ci [travis-ci] OpenTreeOfLife/treenexus#41 (testing_deploy - 5762c81 : Mark T. Holder): The build passed.
22:07 travis-ci [travis-ci] Change view : https://github.com/OpenTreeOfLife/treenexus/compare/testing_deploy
22:07 travis-ci [travis-ci] Build details : http://travis-ci.org/OpenTreeOfLife/treenexus/builds/14277633
22:07 travis-ci left #otol
22:16 jimallman joined #otol
22:39 dukeleto joined #otol
23:18 jimallman joined #otol

| Channels | #opentreeoflife index | Today | | Search | Google Search | Plain-Text | summary