Time |
Nick |
Message |
01:00 |
|
idiosyncrat_ joined #marpa |
01:00 |
idiosyncrat_ |
https://www.reddit.com/r/programming/comments/3h8dsn/marpa_is_a_parsing_algorithm/ |
01:00 |
idiosyncrat_ |
Apparently a discussion of Marpa has started over at reddit.com |
01:54 |
|
idiosyncrat_ joined #marpa |
06:08 |
|
ronsavage joined #marpa |
10:31 |
|
nox joined #marpa |
10:31 |
nox |
Hello. |
10:31 |
nox |
How can Marpa handle all grammars covered by recursive descent parsers while it mentions restrictions with ambiguous grammars? |
10:32 |
nox |
Recursive descent parsers correspond to context-free grammars, and context-free grammars can be ambiguous. The claims on Marpa's web page sound contradictory. Am I missing something? |
11:31 |
rns |
Marpa's web page says "Ambiguous grammars that are _unions of a finite set of any of the above grammars_." and, above "All grammar classes that recursive descent parses" |
11:32 |
rns |
which I can't read other than "Marpa parses all ambiguous grammars that recursive descent parses" |
11:33 |
rns |
in linear time, that is. |
11:34 |
rns |
Oh and hello and welcome, of course. -- politeness is better late than never. :) |
12:47 |
|
idiosyncrat_ joined #marpa |
12:47 |
idiosyncrat_ |
nox: Hi! |
12:48 |
idiosyncrat_ |
First off, recursive descent is a strictly speaking not an algorithm, but a technique ... |
12:48 |
idiosyncrat_ |
this makes it hard to describe which class of grammars it handles ... |
12:49 |
idiosyncrat_ |
for example, you could have a recursive descent implementation that calls Marpa at points, |
12:50 |
idiosyncrat_ |
In practice it seems fair to treat recursive descent as LL(1) -- top-down with one character as look-ahead. |
12:51 |
idiosyncrat_ |
No LL-parser handles ambiguous grammars -- it's a deterministic method. |
12:51 |
idiosyncrat_ |
Marpa *does* handle all context-free grammars, even the ambiguous ones. |
12:52 |
idiosyncrat_ |
The restrictions you mention are on grammars that are parsed in *linear* time. |
12:54 |
idiosyncrat_ |
You can be far more generous about recursive descent and the same holds -- if you call recursive descent LL(k), for any k, it is still deterministic, and Marpa still handles all the grammars it handles in linear time. |
13:10 |
nox |
I see. |
13:10 |
nox |
Why Marpa::HTML gives 404 everywhere? |
13:54 |
rns |
nox: Marpa::HTML is backpan-only; the replacement is Marpa::R2::HTML -- https://metacpan.org/pod/distribution/Marpa-R2/html/pod/HTML.pod |
13:55 |
nox |
Ok. |
13:56 |
nox |
rns: I guess it's just a showcase of what Marpa can do, right? |
13:58 |
rns |
Not quite -- it was used in production by Jeffrey and others, IIRC. |
13:58 |
nox |
But it does things that are wrong. |
13:58 |
nox |
Like the part "The same logic works even if a table is very defective:" |
14:01 |
rns |
This is by design -- "Marpa::R2::HTML is an extremely liberal HTML parser. Marpa::R2::HTML does not reject any documents, no mater how poorly they fit the HTML standards." |
14:02 |
nox |
There is a difference between rejecting documents, and doing the thing in the HTML spec. |
14:02 |
nox |
Which now completely specify how trees should be built, and which tags can be omitted, and which tags must be ignored in which context. |
14:03 |
rns |
Yes, but one of the goals was to demonstrate Marpa parser's error recovery capability and configurability -- BTW, it has fully confgurable http://blogs.perl.org/users/jeffrey_kegler/2012/10/a-configurable-html-parser.html |
14:04 |
rns |
Standard compliance was not the goal, as cited above. |
14:04 |
rns |
s/it has/it is/ |
14:51 |
|
djns joined #marpa |
14:53 |
djns |
hi |
15:07 |
rns |
djns: hello |
15:13 |
idiosyncrat_ |
nox: which HTML spec are you referring to? |
15:13 |
nox |
HTML 5. |
15:14 |
nox |
(I don't think trying to abstract over HTML 4.01 and HTML 5 is a good idea at all.) |
15:15 |
idiosyncrat_ |
In writing Marpa::R2::HTML, I based some of its behavior on an earlier CPAN module ... |
15:16 |
idiosyncrat_ |
in particular its test suite repaired tables, and I copied that behavior |
15:16 |
nox |
I see. |
15:18 |
idiosyncrat_ |
At the time Marpa::R2::HTML was designed, HTML 5 was not as dominant, and in any case its design criteria was to turn *anything* into HTML. |
15:18 |
nox |
For the curious, HTML 5 says you must clear the stack back to a table context if you encounter a <tr> start tag in text insertion mode. https://html.spec.whatwg.org/multipage/syntax.html#parsing-main-intable:clear-the-stack-back-to-a-table-context-5 |
15:18 |
idiosyncrat_ |
You can give it a Perl script, and it will turn it into HTML. |
15:19 |
nox |
Err, misread spec. Anyway I'm pretty sure you just don't infer new <table> start tags. But with your explanation it makes sense. |
15:21 |
idiosyncrat_ |
The CPAN module (whose name I forget right now) was pre-HTML 5 and I used the table repair strategy in its test suite. |
15:22 |
nox |
Got it. |
15:23 |
idiosyncrat_ |
IIRC Marpa::R2::HTML's preference is to keep items on the stack and put them into some sort of context, making up as much stuff as needed. |
15:23 |
idiosyncrat_ |
Apparently HTML 5 mandates things be thrown away if they don't make sense. |
15:25 |
idiosyncrat_ |
Also I don't know how recent that HTML 5 language is -- the page you cite is July 2015, though its contents may be older. |
15:25 |
idiosyncrat_ |
I did study HTML 5 at the time for language as to what to do in this case, though it is not impossible I missed some that already existed. |
15:26 |
nox |
idiosyncrat_: HTML 5 is a living standard now. |
15:26 |
nox |
So you implement whatever it says that moment, and then you keep up and cry; |
15:26 |
nox |
s/;/./ |
15:27 |
idiosyncrat_ |
I suspect most HTML renderer teams do not catch up instantly with each change. |
15:27 |
idiosyncrat_ |
Marpa::R2::HTML sure won't. |
15:27 |
nox |
idiosyncrat_: But i'm pretty sure no version of HTML mandated the introduction of a missing <table> start tag. |
15:28 |
nox |
Modern browsers all have quite complete HTML 5 parsers. |
15:28 |
idiosyncrat_ |
IIRC correctly at the time I read the standard it was silent. |
15:29 |
idiosyncrat_ |
That is, silent as to what to do with an "orphan" <tr> tag |
15:30 |
idiosyncrat_ |
Note also that if you are using Marpa::R2::HTML as a utility (its more typical use) and not as a renderer ... |
15:30 |
idiosyncrat_ |
throwing things away is not usually what you want. |
15:31 |
idiosyncrat_ |
If it builds a table, you'll say "where'd that come from?" and fix the table. |
15:31 |
idiosyncrat_ |
And while I've not read the HTML 5 language, I suspect it is a mandate for *renderers* |
15:33 |
idiosyncrat_ |
And it makes sense in the context of a rendering engine, which has next to no probability of turning the lone '<tr>' into a useful table, and whose user is *not* going to change the original HTML ... |
15:33 |
idiosyncrat_ |
to simply throw that information away. |
15:35 |
idiosyncrat_ |
In the context of a user who'd want to find and fix the tag, it's annoying to throw it away silently. |
15:35 |
idiosyncrat_ |
Another point: |
15:36 |
idiosyncrat_ |
I'd hoped to use Marpa::R2::HTML to illustrate Marpa's power and flexibility for HTML parser -- its behavior is table-driven, and could be made configurable. |
15:37 |
idiosyncrat_ |
... and to inspire others to build on it. |
15:38 |
idiosyncrat_ |
Unfortunately, I wrote it before Marpa's latest interface, the SLIF, and so its actual implementation is quite obsolete in Marpa terms. |
15:47 |
idiosyncrat_ |
By the way, I'd *love* to see somebody fork Marpa:;R2::HTML, in which case they'd be able to make these choices. |
15:49 |
idiosyncrat_ |
My guess is that they fork-er will see its primary use as a utility, and decide that simply throwing away tags silently is, in that context, counter-productive. |
15:49 |
idiosyncrat_ |
s/they fork-er/the fork-er/ |
15:50 |
idiosyncrat_ |
But I'd hope they also made behavior configurable, and a strict HTML 5 mode would certainly be nice to have. |
16:05 |
djns |
HTML is fairly easy to parse, I wrote an HTML parser once just with a loop and state variable |
16:06 |
Aria |
I suspect you may not have encountered some HTML that's out there in the wild. |
16:50 |
djns |
yeah lol |
16:50 |
idiosyncrat_ |
HTML is fairly easy to *tokenize*. Parsing in the strict sense is creating the tree hierarchy formed by its elements. |
16:51 |
idiosyncrat_ |
And strict HTML parsing is not that hard either. |
16:51 |
djns |
Yes. I did work once with the SAX API in Perl. SAX is generally a good idea |
16:51 |
djns |
yes |
16:52 |
idiosyncrat_ |
But Marpa::R2::HTML parses "liberal" HTML, including highly defective HTML. |
16:52 |
djns |
SAX is for XML, but also can made to work on HTML |
16:52 |
djns |
yes |
16:52 |
idiosyncrat_ |
SAX-ish approached often beat Marpa, as long as the problem is one that you can actually solve with a SAX-ish approach |
16:52 |
idiosyncrat_ |
s/approached/approaches/ |
16:53 |
djns |
I am sure you could make a SAX front end for Marpa-based HTML parser |
16:53 |
djns |
for the SAX API |
16:54 |
idiosyncrat_ |
djns: true, but you'd wind up with the overhead of Marpa -- a great benefit of SAX is that it is low overhead |
16:55 |
idiosyncrat_ |
djns: that's actually a nice formulation of it -- if you problem can be parsed with "a loop and state variable", Marpa is probably overkill for your problem. |
16:59 |
djns |
right |
18:14 |
|
roxfan joined #marpa |
18:40 |
|
mauke joined #marpa |
20:12 |
|
koo7 joined #marpa |
21:15 |
|
purmou joined #marpa |
21:15 |
purmou |
hi all. my parser generator is pretty much complete! |
21:15 |
purmou |
noticed a ton of interest in the recent marpa post on r/programming, so I posted my parser generator there too |
21:15 |
purmou |
https://www.reddit.com/r/programming/comments/3hd1xx/pearleyparser_javascript_parser_generator_using/ |
21:16 |
purmou |
feel free to keep lookout there and answer any questions -- you all here are the experts after all :) |
22:30 |
|
ronsavage joined #marpa |
23:31 |
|
idiosyncrat_ joined #marpa |