Perl 6 - the future is here, just unevenly distributed

IRC log for #marpa, 2016-03-19

| Channels | #marpa index | Today | | Search | Google Search | Plain-Text | summary

All times shown according to UTC.

Time Nick Message
00:15 choroba much better now: https://github.com/choroba/karel/commit/​b11d13133e2890f36e194a36d57873debb2d77e0
00:21 choroba But it seems I can't use  accented characters in lexeme names, neither in angle brackets
01:09 idiosyncrat_ joined #marpa
02:00 idiosyncrat_ choroba: symbol names are documented as accepting any "Perl word character".
02:01 idiosyncrat_ https://github.com/jeffreykegler/Marpa--R2/b​lob/master/cpan/lib/Marpa/R2/meta/metag.bnf is the file that describes them.
02:01 idiosyncrat_ Bare names (that is, no brackets) are
02:01 idiosyncrat_ <bare name> ~ [\w]+
02:02 idiosyncrat_ And the part of a bracketed symbol name between the brackets is
02:02 idiosyncrat_ <bracketed name string> ~ [\s\w]+
02:03 idiosyncrat_ So it all depends on what "\w" matches.
02:05 idiosyncrat_ and that varies by locale, Unicode support, etc., see http://perldoc.perl.org/perlrecharclass.html
02:06 idiosyncrat_ or its equivalent for your Perl version.
02:06 idiosyncrat_ But by definition what "\w" matches is a "Perl word character"
02:08 kaare_ joined #marpa
02:48 ilbot3 joined #marpa
02:48 Topic for #marpa is now Start here: http://savage.net.au/Marpa.html - Pastebin: http://scsys.co.uk:8002/marpa - Jeffrey's Marpa site: http://jeffreykegler.github.io/Marpa-web-site/ - IRC log: http://irclog.perlgeek.de/marpa/today
04:31 idiosyncrat_ Good night!
08:47 choroba joined #marpa
08:52 choroba http://paste.scsys.co.uk/508089
08:52 choroba I have "use utf8" there
08:53 choroba doesn't work, with or without <> around symbol names
09:06 choroba "ŠÁ" =~ /^\w+$/ returns 1, though
10:20 koo7 joined #marpa
13:45 kaare_ joined #marpa
14:02 jdurand joined #marpa
14:33 koo7 joined #marpa
15:12 idiosyncrat_ joined #marpa
16:10 idiosyncrat_ choroba: re http://irclog.perlgeek.de/m​arpa/2016-03-19#i_12209115 -- thanks.
16:10 idiosyncrat_ I am looking at this.
16:11 idiosyncrat_ Just to let you know, if it turns out to be a Marpa bug, I am undecided how to handle it.
16:12 idiosyncrat_ Marpa::R2 is now stable, and I do not think future versions of Marpa, such as Kollos, will support non-ASCII symbol names.
16:12 idiosyncrat_ So I don't know if I'll do a code fix or a documentation fix.
16:14 idiosyncrat_ Also, I am not doing new releases of Marpa::R2 except for major issues, and I am not sure whether I will call this issue "major".
16:15 idiosyncrat_ If it is not major, it would have to wait until a major issue came along, justifying a new release.
16:54 idiosyncrat_ re http://irclog.perlgeek.de/m​arpa/2016-03-19#i_12209115
16:54 idiosyncrat_ I have been able to duplicate the problem.
16:55 idiosyncrat_ choroba: you've given me all I need to act on this, but if you could file an issue on Github, that would be an additional help.
16:56 idiosyncrat_ Github issues ensure these don't get forgotten, and if you file it, it's on record that you wete the one to identify the problem.
17:36 choroba joined #marpa
18:04 choroba issue created
18:08 choroba thanks for looking into it
18:09 choroba I'll need to add some kind of mapping (lexemes to UTF-8 strings) into my code to correctly report the expected lexemes, but I already have an idea that should simplify the code at the same time.
18:17 jdurand joined #marpa
18:19 jdurand Re http://irclog.perlgeek.de/m​arpa/2016-03-19#i_12210433 - if this can help, Jeffrey may confirm/infirm, this is only an issue of recognizing utf characters in symbols names. Having utf8 characters in lexeme values is ok, c.f. https://gist.github.com/jddurand/8066327 for an example
18:22 idiosyncrat_ jdurand: were you ever able to use accented characters in symbol names?  Or haven't you tried that?
18:22 idiosyncrat_ choroba has filed this as a Github issue: https://github.com/jeffreyk​egler/Marpa--R2/issues/268
18:22 idiosyncrat_ choroba: thanks!
18:23 choroba jdurand: yes, recognizing utf-8 strings works OK, I just wanted to have accented characters in symbol names.
18:24 jdurand Never did that - for the issue itself I reproduced it as well, and a "funny" way to isolate it is to run choroba's grammar using original marpa grammar - I will pastebin that - though you probably already did that...
18:25 jdurand I tried to bypass it with an utf8->import::into('Some::Internal​::Marpa::Package::Like::MetaAST') but I believe this was either not the way to go, either not where the problem is
18:27 shadowpaste "jdurand" at 217.168.150.38 pasted "Source to isolating the parsing issue with symbols having unicode characters with meta grammar" (245 lines) at http://fpaste.scsys.co.uk/508118
18:28 shadowpaste "jdurand" at 217.168.150.38 pasted "Output of: "Source to isolating the parsing issue with symbols having unicode characters with meta grammar "" (134 lines) at http://fpaste.scsys.co.uk/508119
18:30 choroba left #marpa
18:31 choroba joined #marpa
18:42 idiosyncrat_ jdurand: No, I didn't try copying the meta-grammar -- it's a very good diea.
18:43 idiosyncrat_ If you'd like to play with this issue, I'll be interested in what you find.
18:50 jdurand with trace_terminals to 99 we see well that \x[c1} is not registered under \w - will look a bit more into this tomorrow - I agree this is not a major issue - although clearly annoying for our friend choroba!
18:53 choroba thanks :)
18:59 jdurand where has gone my latest pastebin...
18:59 shadowpaste "jdurand" at 217.168.150.38 pasted "the interesting part with trace_terminals => 99" (9 lines) at http://fpaste.scsys.co.uk/508122
19:00 jdurand Ah voilà - you see it Jeftrey ? I wonder is this is not a more fundamental issue
19:01 koo7 joined #marpa
19:07 jdurand and \xc1 is not a valid uf8 character
19:08 jdurand but valid as a unicode code point
19:11 jdurand lunch time - will continue later -;
19:14 choroba maybe it's related to how Perl stores unicode characters internally? Latin1 characters are stored as Latin1, IIRC.
19:28 idiosyncrat_ A Latin1 vs. Unicode issue was something I thought might be the issue.
19:31 idiosyncrat_ Latin1 representation is not the same as UTF-8 representation for characters > 127.
19:31 idiosyncrat_ My personal solution to this has always been to avoid Latin1, which is easy for me since the only language I know is English.
19:40 choroba A bit naïve approach, I'd say ;-)
19:42 idiosyncrat_ This may be relevant: http://perldoc.perl.org/perluni​code.html#The-%22Unicode-Bug%22
19:44 idiosyncrat_ I didn't talk about this when you first mentioned the problem, but it was not clear to me whether the encoding in your script is wrong, or whether you'd found a bug in Marpa ...
19:45 idiosyncrat_ and since I'm rusty at this stuff, I didn't want to throw out advice until I was more sure that I wasn't sending you in the wrong direction.
19:51 choroba Adding "use feature 'unicode_strings';" doesn't help, though
19:55 jdurand choroba: yep just tried that few secs ago
20:05 idiosyncrat_ AFK
20:22 jdurand This is related to this short one-line script: perl -Mstrict -MEncode -e 'my @this = ("\N{U+00C1}", "\x{c1}", "\x{c1}"); utf8::upgrade($this[2]); map { if ($this[$_] =~ /\w/) { print "OK $_\n" } } (0..$#this)';
20:24 jdurand and if one read the section on utf8::upgrade in perldoc utf8, I effectively solve it by adding an utf8::upgrade in Marpa::R2::Scanless::R::resume
20:25 jdurand The section in perldoc utf8 says "
20:25 jdurand Can
20:25 jdurand be used to make sure that the UTF-8 flag is on, so that "\w" or "lc()"
20:25 jdurand work as Unicode on strings containing non-ASCII characters whose code
20:25 jdurand points are below 256.
20:26 jdurand *IF" jeffrey and other unicode experts (i am not) think it is safe, they might want to try the patch that I will submit as a reply to chorobas's github issue
21:03 jdurand Jeffrey, take care nevertheless, this may croak, well, not sure - unicode experts opinion is welcome
22:21 ronsavage joined #marpa
22:37 idiosyncrat_ joined #marpa
22:51 idiosyncrat_ "This anomaly stems from Perl's attempt to not disturb older programs that didn't use Unicode, along with Perl's desire to add Unicode support seamlessly. But the result turned out to not be seamless."
22:52 idiosyncrat_ From perlunicode: http://perldoc.perl.org/perluni​code.html#The-%22Unicode-Bug%22
22:53 idiosyncrat_ I note they've improved this language.  IIRC previous accounts of this, in an attempt to avoid admitting this was a mistake, used a lot of murky language.
22:54 idiosyncrat_ As best I understand it at the moment,
22:54 idiosyncrat_ 1.) Perl wanted to add UTF-8 support.
22:55 idiosyncrat_ 2.) Didn't want to force 8-bit string logic to pay the price for Unicode processing (in their defense, the price can be major.)
22:56 idiosyncrat_ 3.) Wanted it all to be DWIM -- do what I mean.
22:56 idiosyncrat_ So they created a system whereby only strings using Unicode symbols are upgraded to UTF-8.
22:58 idiosyncrat_ But they didn't notice until it was too late that it is impossible to read the intent with 100% accuracy when the character in the range 0x80-0xFF
22:59 idiosyncrat_ And when they noticed, they couldn't fix the DWIM-ery without breaking a lot of scripts using 8-bit Latin1 and expecting the old 8-bit logic to work.
22:59 idiosyncrat_ This is going to take some research and some thinking.
23:00 idiosyncrat_ This kind of thing, but by way, is why I avoid aggressive DWIM-ery, things like adding single-quoted strings into terminals expected ...
23:01 idiosyncrat_ there are a bunch of ways that I could have done that, and they would have benefited 95% of all cases ...
23:02 idiosyncrat_ but the tricky bugs in the corner cases would make the DWIM-ery counter-productive.
23:02 idiosyncrat_ Better to force the user to specify what they mean, if there is any doubt.
23:05 idiosyncrat_ With regard to Unicode experts, I believe we will be able to call on those folks when we think we have a solution ...
23:05 idiosyncrat_ and it would probably be wise to do so before changing Marpa::R2, ...
23:06 idiosyncrat_ but first we in the Marpa community should do our homework on this one.

| Channels | #marpa index | Today | | Search | Google Search | Plain-Text | summary