Perl 6 - the future is here, just unevenly distributed

IRC log for #marpa, 2015-12-02

| Channels | #marpa index | Today | | Search | Google Search | Plain-Text | summary

All times shown according to UTC.

Time Nick Message
03:09 JPGainsborough joined #marpa
04:37 ceridwen Idiosyncrat, Is there somewhere in the Marpa documentation or code you can point me to that shows how exactly you lay out the memory for the SPPFs/bocages?
04:41 Idiosyncrat ceridwen: the only thing would be the CWeb source.
04:45 ceridwen Where is that?  I'm not sure I understand the layout very well.
04:52 Idiosyncrat In  the "official" Aria edition, http://dinhe.net/~aredridel/.notmine/P​DFs/Parsing/KEGLER,%20Jeffrey%20-%20ma​rpa%20annotated%20implementation.pdf ...
04:53 Idiosyncrat it seems to begin on page 214, with the description of OR-nodes
04:54 Idiosyncrat I heavily optimize, so there are actually two structures for both OR-nodes and AND-nodes.
04:54 Idiosyncrat "draft" and "final".
04:55 Idiosyncrat The draft version is used while putting the bocage together, the final version is what is used in later phases.
04:56 Idiosyncrat You may find it easier to create your own layout from the description in Scott's paper.
04:57 Idiosyncrat Also, if you have studied layouts of forests, with AND- and OR-nodes, in general, you'll want to do that.
04:58 Idiosyncrat s/if you have studied/if you have not studied/
04:59 ceridwen I've actually been looking at this in detail lately, partly because I've hit some irritating limitations in CPython that I'm going to have to work around.  If I'm going to end up having to write a special data structure implementation for CPython, I might as well try to make it good.
05:00 Idiosyncrat Doesn't CPython do linked lists?
05:01 ceridwen CPython has no built-in linked list type suitable for this application, unfortunately.
05:01 Idiosyncrat Also, you can "cheat" and instead of linked lists use an associative database -- hashes or AVLs
05:01 ceridwen If you're curious I can explain the other technical problems I've encountered, but they're very specific.
05:02 Idiosyncrat Even if you do have linked lists, the associative approach is tempting.
05:03 ceridwen I was looking again today at Johnstone and Scott's latest paper, which talks about the problems they've had.  Some key quotes:
05:03 ceridwen "One approach is to use a table of all possible elements, implemented using
05:03 ceridwen sparse matrix techniques [12] in which table rows are allocated on demand.
05:03 ceridwen This technique is impractical for realistic sized problems, since even with sparse
05:03 ceridwen optimisations, high memory consumption leads to swapping during the parse."
05:04 Idiosyncrat What date?
05:04 ceridwen "Our HashPool implementation maintains a vector of refer-
05:04 ceridwen ences to arrays of integers (called the pool) as the underlying memory model.
05:04 ceridwen The vector is initially populated by null values except for the first entry which
05:04 ceridwen is allocated an array of primitive integers sized so as to contain around 1,000
05:04 ceridwen elements. Allocation simply involves incrementing an index by the size of the
05:04 ceridwen allocated element, unless the end of the array has been reached in which case
05:04 ceridwen a new primitive array is created in the next vector element. If the vector is
05:04 ceridwen full, a new larger vector is allocated and the references copied over. Java does
05:04 ceridwen not allow us to directly re-interpret the contents of these arrays as structured
05:04 ceridwen types: instead we use indexed addressing and a set of symbolic constant offsets
05:04 ceridwen to access individual fields. The resulting programming style is ugly and error
05:04 ceridwen prone, since we have effectively removed the safety net of type checking."
05:04 ceridwen The preprint I have is from 2013, but it was published recently, maybe this year?
05:05 Idiosyncrat Is it online?
05:05 ceridwen https://pure.royalholloway.ac.uk/portal/files/​25269159/softwareMicroengineeringPostPrint.pdf
05:05 ceridwen (Took me a moment to find it.)
05:08 Idiosyncrat Anyway, you can fake linked lists with a dynamic array.
05:09 ceridwen My current plan, such as it was, involved a hash table with the standard node labels as keys and weak references to the nodes themselves as values.  Unfortunately, the obvious layout for or-nodes in Python is using the immutable vector type, but it can't be weak-referenced on CPython for historical reasons.
05:09 Idiosyncrat One thing you can take advantage of, is that you don't need to delete individual nodes in this application.
05:10 ceridwen Side-note: I tend to refer to what you're calling and-nodes as sequence or concatenation nodes, because of the use of "and" in Boolean grammars.
05:10 Idiosyncrat So perhaps you don't have to worry about weak references.
05:10 ceridwen How so?
05:11 Idiosyncrat When you're done with the bocage, just wipe everything.
05:12 ceridwen Oh, I see.
05:12 ceridwen I picked up the weak references from Yakker, as a way to avoid having to do manual deallocation for dead parse trees during the pares.
05:12 ceridwen *parse
05:12 Idiosyncrat For "weak" references you can use indexes -- you know the things aren't going anywhere until you are done with the whole thing.
05:13 Idiosyncrat Here's a big, big trick I learned with memory allocation for Marpa ....
05:13 Idiosyncrat In parsing, you don't usually need to deallocate individual pieces of data.
05:14 Idiosyncrat What you are doing is gradually building big structures, so you do need small allocations.
05:14 ceridwen Yes.
05:14 Idiosyncrat But if you put them into one big structure, that you can wipe out without understanding its structure, that works beautifully.
05:15 Idiosyncrat So none of my deallocation code has to worry about following links, what refers to what, etc.
05:15 ceridwen Right.
05:15 Idiosyncrat which is good because otherwise it would be a nightmare.
05:16 Idiosyncrat So, for example, I can delete the bocage without following links to or-nodes or and-nodes ...
05:16 Idiosyncrat I just put all the nodes into one big array ...
05:17 Idiosyncrat I delete the array, poof, it's gone, and I don't have to deal with the actual structure or links of all the things in the array.
05:18 Idiosyncrat This is a trick I learned from GNU's obstacks, and I think it is a big part of the explanation ...
05:18 ceridwen Yours is as far as I know the only production implementation of an SPPF.  Maybe Iguana?  Anyway, re-reading Scott and Johnstone's paper reminded me of the problems they'd run into with memory on a canonical OO implementation---if I have to work around CPython's limitations anyways, I might as well as try to figure out what the right implementation is.
05:18 ceridwen Good thought.
05:18 Idiosyncrat of why I am able to create large complicated structure with no memory leaks, segment faults, etc.
05:19 Idiosyncrat Those other workers whose code I've read have huge problems with deallocation ...
05:20 Idiosyncrat if you try to do it by following all the links in something like a bocage ...
05:20 Idiosyncrat it's maddening.
05:20 Idiosyncrat So I just put the bocage pieces into a container, and when I am done, delete the container.
05:21 Idiosyncrat It means I cannot deallocate piecemeal, but that is usually not a problem when parsing.
05:21 Idiosyncrat Another advantage of this is error handling.
05:21 ceridwen You don't run into memory limitations during the parse?  I.e., cases where retaining temporary trees that won't contribute to the final SPPF is too expensive?
05:22 Idiosyncrat I don't create temporary trees.
05:22 Idiosyncrat Why would they be necessary?
05:23 Idiosyncrat Although some of the problems you may be talking about are things I deal with using the "draft" and "final" OR- and AND-nodes.
05:24 ceridwen Yes, I think they are.
05:24 Idiosyncrat For what?
05:26 ceridwen We're talking about the same thing with different words, I think?  What's the difference between your draft and final nodes?
05:26 Idiosyncrat Well there I create a good working structure, a "draft", then I copy it over into a final structure,
05:27 ceridwen This might also be a difference between Earley and GLL---in GLL it's possible for a speculative parse to create nodes belonging to a tree that isn't ultimately linked to the root of the final SPPF.
05:27 Idiosyncrat and I delete the draft -- again, it's all in a container, so no following links, etc.
05:27 ceridwen Right.
05:27 Idiosyncrat OK.  That may be the difference, GLL.
05:28 Idiosyncrat In Earley, it's easy to tell what will be used in the parse ...
05:28 Idiosyncrat Just start at the root and traverse.
05:29 ceridwen Oddly, Yakker is based on Earley originally, so I'm not sure why they found the weak references necessary if it's always possible to tell which nodes will be in the final SPPF immediately when they're parsed.
05:31 Idiosyncrat Yakker may not have noticed that trick of dumping allocations into a container which can be tossed all at once.
05:31 ceridwen That doesn't mean that I can't with the problem similar to the way you are, build a hash table with the labels and then delete it once the SPPF is finished.
05:32 ceridwen *can't deal
05:32 ronsavage joined #marpa
05:33 Idiosyncrat As for the GLL aspect, I experimented with stacks of trees at one point and I can't imagine how folks get that working.
05:34 Idiosyncrat I haven't wanted to put myself in the position of dissing an approach you've decided to explore, so I haven't said a lot about GLL.
05:35 ceridwen GLL doesn't really need to deal with stacks of trees, because of how it uses the SPPF.
05:35 ceridwen I imagine that stacks of trees must be awful to implement.
05:36 Idiosyncrat OK, but you mentioned "speculative parses", which sounds like you're getting into data structures whose contents are entire trees.
05:37 ceridwen I think I know what you think of GLL without your having to say it :).  It certainly has its issues.  I'm planning to look at an Earley implementation too.  My reasons for using GLL over Earley would be entirely related to simplicity, if it turns out to be simpler.
05:37 Idiosyncrat Anyway, I didn't diss GLL, because after all, they said the same stuff about Earley and if I had listened there would be no Marpa.
05:39 ceridwen GLL parses from the top down, but it builds trees from the bottom up, like a conventional recursive-descent parser.  Is that clear?
05:39 Idiosyncrat Yes, OK.
05:41 Idiosyncrat So the "speculative parses" are the standard top-down stuff, where you're trying to either backtrack or packrat.
05:44 ceridwen Yes.  GLL and GLR are both roughly asynchronous versions of LL and LR, which follow all possible parses rather than simply balking when encountering an ambiguity.
05:46 Idiosyncrat I think I'm pretty familiar with the approach.
05:46 Idiosyncrat I read Tomita's book long ago.  I think I donated it somewhere because I no longer have it.
05:48 Idiosyncrat Grune and Jacobs actually credit Bernard Lang with the invention of GLL and GLR, but Lang apparently was not enthused with it as a parsing approach ...
05:48 Idiosyncrat whereas Tomita was and he ran with it and for a while GLR was called Tomita's algorithm.
05:48 ceridwen Anyway, for GLL versus Earley: Earley has the clear advantage that with Leo's changes, the algorithm runs in linear time on LR-regular grammars.  GLL goes quadratic on local ambiguities---you can do lookahead to limit this, of course.
05:50 ceridwen That said, I'm interested in investigating Parr's claims about lookahead + recursive descent being faster than any general CFG algorithm.
05:51 Idiosyncrat Parr's claims are probably true for all the grammars he can actually parse.
05:52 Idiosyncrat Although doesn't ANTLR also have an option to do some backtracking?
05:53 Idiosyncrat Anyway, as long you don't packrat or backtrack, if your grammar really is LALR or LL, the traditional methods will be faster.
05:53 ceridwen The latest version of ANTLR has the world's most complicated lookahead algorithm, but it defaults to prioritized choice if its lookahead algorithm can't resolve an ambiguity, at least, as far as I read his last paper.
05:54 ceridwen The lookahead algorithm is supposed to function for all LL-regular grammars, so Earley still handles a larger set of grammars, not to mention that it can handle ambiguities.
05:54 Idiosyncrat Right and I'm wondering if doesn't go non-linear in his lookahead, which kinda defeats the purpose at least if you look at the world from an Earley point of view.
05:54 ceridwen He claims it doesn't, and there is a proof in the paper---I can't claim to have gone through and checked it in detail.
05:56 Idiosyncrat I think he actually *does* the lookahead, and that has to take time if you're looking indefinitely far ahead.
05:57 Idiosyncrat The think about Leo's LR-regular claim is that his algorithm doesn't actually have to do the lookahead -- Leo's algorithm parses in linear time ...
05:57 Idiosyncrat those grammars which an LR parser would need to perform infinite lookahead in order to parse ...
05:57 Idiosyncrat without actually looking ahead.
05:58 Idiosyncrat Btw, LR-regular was actually invented with the idea that there might be some cool 2-pass trick for parsing.
05:59 Idiosyncrat Pass 1: do a regular expression analysis of the input.
05:59 Idiosyncrat Pass 2: do the real parsing, using pass 1 as a guide.
05:59 ceridwen That said---Parr also pointed out when he was running ANTLR against some other parsers that Elkhound did better than the other GLR parsers, because Elkhound attempts to simplify to an LR parser when it doesn't have to deal with ambiguities.  From my perspective, I think defaulting to prioritized choice at any point is not a good quality in a parser, but the possibility of writing a GLL or GLR parser that runs as fast as an LL or LR parser except whe
05:59 ceridwen n it has to deal with ambiguities has promise.
06:00 ceridwen Yes.  Parr's claim is that ANTLR can parse any LL-regular grammar in linear time, which means that it can in fact deal with some cases of infinite lookahead, though not as many cases as Earley+Leo can.
06:00 ceridwen Assuming he's right.
06:01 Idiosyncrat I would expect any explicit claim by Prof. Parr in a paper to be correct.
06:01 ceridwen If there weren't so many errors in published algorithms in parsing, I'd be confident he's right.  As it is, I'm *almost* certain he's right.
06:02 ceridwen If that makes sense?
06:04 ceridwen I don't mean to be unwarrantedly skeptical---mostly I'm saying that I can't recreate the proof myself, which means my understanding is not complete.
06:05 ceridwen Another reason I might prefer GLL to Earley is implementation complexity---from my PoV as a maintainer and also for trying to get users/contributors, simplicity is a huge virtue, and Earley unfortunately seems to be hard to understand.
06:06 Idiosyncrat Looking at Parr's paper I don't see a claim of O(n) for LL-regular stated anywhere.
06:07 Idiosyncrat http://www.antlr.org/papers/LL-star-PLDI11.pdf
06:13 ceridwen I think I was thinking of this: "ANTLR  generates
06:13 ceridwen LL(*)
06:13 ceridwen parsers  that  are  linear  in
06:13 ceridwen practice and that greatly reduce speculation, reducing mem-
06:13 ceridwen oization overhead over pure packrat parsers."
06:14 ceridwen So he doesn't actually have a proof.
06:15 Idiosyncrat "linear in practice" I take to mean "linear if you avoid non-linear grammars"
06:16 Idiosyncrat This kind of language is standard in backtracking/packratting, where the whole hope is that you can get away with microdosing an exponential algorithm on top of a linear one.
06:17 ceridwen http://www.antlr.org/papers/allstar-techreport.pdf
06:18 Idiosyncrat You hope you'll get the best of both -- that just a wee bit of packratting/backtracking will make all the difference, and you won't have to use it so much that it wrecks your performance.
06:19 ceridwen Yes, I misremembered.  I'll blame the late hour where I am :).  Anyway, he and his coauthors make the "linear in practice" claim in more depth in that paper---the real bound is O(n^4).
06:19 Idiosyncrat Btw, Prof. Parr's approach in the one I would be taking if I believed in top-down parsing.
06:20 Idiosyncrat He is systematic and careful with the math.
06:20 * ceridwen nods.
06:21 Idiosyncrat What page is the O(n**4) stated on?
06:21 Idiosyncrat "Unfortunately, your brain does full LL(k) and ANTLR does a slightly weaker linear approximate lookahead"
06:22 Idiosyncrat That Parr quote is from http://www.antlr2.org/doc/glossary​.html#Linear_approximate_lookahead
06:22 ceridwen The proof is on pp. 15-16.
06:22 Idiosyncrat ... and I'd like to point out something significant about it.
06:22 Idiosyncrat That is, why can't our parsing algorithms parse as well as our brains?
06:23 Idiosyncrat I think you'll find with O(n) LR-regular, the computer is finally smarter than we are about parsing.
06:24 Idiosyncrat Which I take as a sign that this is the right track.
06:29 Idiosyncrat Also, I think even with the O(n**4) bound, left recursion, direct or indirect, is not allowed.
06:30 ceridwen Maybe!  Trying to figure out how the human brain does anything is not a task I'm taking on :).  As it is, the brain can parse some context-sensitive and probably even recursively-enumerable languages, so our computers aren't quite there yet.
06:30 ceridwen Yes.
06:30 ceridwen That's another limitation that ANTLR has.
06:33 Idiosyncrat It's late CA time.
06:33 Idiosyncrat I hope what I've said was helpful.
06:33 ceridwen Yes, thanks :).
06:33 Idiosyncrat And that I didn't go too far with the LL-bashing. :-)
06:33 Idiosyncrat Good night!
06:33 ceridwen No, not at all.
06:33 ceridwen Good night!
08:18 ronsavage joined #marpa
16:12 sadmac joined #marpa
16:12 shadowpaste joined #marpa
16:12 hobbs joined #marpa
16:12 aredridel joined #marpa
16:12 ernimril joined #marpa
16:12 JPGainsborough joined #marpa
16:12 Pursuit joined #marpa
16:12 iarna joined #marpa
16:12 VsyachePuz joined #marpa
16:28 ceridwen joined #marpa
16:28 btyler joined #marpa
16:47 sadmac joined #marpa
16:47 shadowpaste joined #marpa
16:47 hobbs joined #marpa
16:47 aredridel joined #marpa
16:47 ernimril joined #marpa
16:47 VsyachePuz joined #marpa
16:47 iarna joined #marpa
16:47 Pursuit joined #marpa
16:47 JPGainsborough joined #marpa
16:52 ilbot3 joined #marpa
16:52 Topic for #marpa is now Start here: http://savage.net.au/Marpa.html - Pastebin: http://scsys.co.uk:8002/marpa - Jeffrey's Marpa site: http://jeffreykegler.github.io/Marpa-web-site/ - IRC log: http://irclog.perlgeek.de/marpa/today
16:56 sivoais joined #marpa
18:44 sivoais joined #marpa
19:19 sivoais joined #marpa
19:23 iarna joined #marpa
20:42 ronsavage joined #marpa
22:18 pczarn joined #marpa
22:34 JPGainsborough joined #marpa

| Channels | #marpa index | Today | | Search | Google Search | Plain-Text | summary