Ruby Can’t Be Parsed?

Is this valid Ruby code? (Syntax highlighting skipped because it obscures the point.)

whatever / 25 ; # / ; raise SystemExit.new;

Is it possible for whatever’s arity to be either zero or one arguments, and unknown at the point when it’s parsed? (Pretty sure this is “yes”: after all, Ruby has to parse the code assuming that whatever may be added at runtime. If it’s added at runtime, it may be given an arity equal to the next bit off /dev/random.)

[EDIT: See the comments for extensive conversation on what's going on with Ruby.]

If so, then parsing Ruby is equivalent to solving the halting problem. The proof is the same as Perl Cannot Be Parsed: A Formal Proof, with appropriate translation of the code. Notably, this proof is thwarted by the use of perl prototypes (unless you have optional arguments): more on prototypes at A Defense of Prototypes, or, Why Does Tom Christiansen Hate Perl?.

Groovy avoids it purely accidentally: since the comment character in Groovy always begins with a right slash (/), there’s no way to sneak the comment character into the regex.

Comments

August 18, 2009, Matt Brubeck wrote: Except this doesn't work in Ruby like it does in Perl 5. Ruby has an actual grammar. Its parser is generated by YACC. There's only one way that Ruby will parse your expression. (It always passes 0 arguments to "whatever.") You can easily test this by trying your code snippet with two different definitions of whatever. The 1-argument version will fail with ArgumentError, not SystemExit: "test.rb:4:in `whatever': wrong number of arguments (0 for 1) (ArgumentError)."
August 18, 2009, syskill wrote: Bah. You should be able to catch that with a unit test.
August 18, 2009, Robert Fischer wrote: @Matt Where's the grammar of Ruby documented? The fact that a particular implementation behaves a particular way doesn't mean it's a parseable grammar. The perl implementation on my OS-X also parses and does something (it also assumes whatever is nullary), but that doesn't mean the grammar is parseable in a formal sense. It just means that the parser plowed over the syntax ambiguity by fiating in a decision.
August 18, 2009, Alexandr Ciornii wrote: Robert Fischer: Padre IDE will highlight such Perl code correctly. Both in Scintilla and PPI (Perl parser) modes. Matt Brubeck: AFAIK, Perl also has a grammar.
August 18, 2009, Matt Brubeck wrote: The official, cross-implementation spec of the Ruby language is the RubySpec project. As far as I can tell, it does not yet include any specs for the low-level grammar or parser. So just as with everything else in Ruby that's not yet covered by RubySpec, the only "spec" is Matz's Ruby source, specifically parse.y. The Ruby Grammar Project is working on documenting the grammar and developing a reusable formal description in ANTLR: http://rubyforge.org/forum/forum.php?forum_id=4788 Their SVN repository has their work in progress, including both the formal grammars and some test cases: http://rubygrammar.rubyforge.org/svn/ Matz has acknowledged the parser as a weak point of Ruby that he is willing to have cleaned up: http://blog.nicksieger.com/articles/2006/10/22/rubyconf-matz-keynote Obviously the situation for Ruby is not ideal. It would be better if there were an officially blessed spec other than the current implementation. However, the situation is very different from Perl. Since Ruby implementations actually use a well-defined grammar, it's possible to statically parse Ruby programs in a way that matches how they will be compiled and executed. As a concrete example, the Ruby syntax highlighter for Vim highlights these two lines differently, and its highlighting matches the behavior of the Ruby implementations I tested: whatever / 25 ; # / ; raise SystemExit.new; whatever /25 ; # / ; raise SystemExit.new; (The second line is missing the space character after the first slash.) It's not possible for an editor to correctly highlight the syntax of the corresponding line from the Perl proof, because it may actually "parse" differently depending on runtime conditions. (I'm not sure what's up with your Perl on OS X, but Perl 5.10.0 on Ubuntu 9.04 changes its behavior depending on whether the line is preceded by "sub whatever {}" or "sub whatever() {}".)
August 18, 2009, Matt Brubeck wrote: Alexandr, how is an IDE supposed to highlight the code from this comment? http://www.perlmonks.org/?node_id=663393 It's clear from the behavior of that code on my system that Perl does not have a complete, statically parseable grammar. (Apparently Robert's Perl implementation differs.) Sometimes the "die" statement is part of a comment; other times it is executable code.
August 18, 2009, Robert Fischer wrote: I get the same variance in behaviors in perl. The issue was on my side. Here's some weirdness with the Ruby parser I encountered while playing around.
```
class Foo def whatever(x) puts "Unary method called" end
end class Foo def bar whatever / 25 ; # / end
end
a = Foo.new
a.bar() # ArgumentError: wrong number of arguments (0 for 1) class Foo def bar whatever 3 end
end
a = Foo.new
a.bar() # prints "Unary method called" class Foo def bar a = / 25 ; # / whatever a end
end
a = Foo.new
a.bar #prints "Unary method called"
```
So you can't pass regex patterns (literal Regexps) as the first argument to a method without using parens. Is this an actual explicit disambiguation rule in the Ruby grammar, or is it just a consequence of the current implementation? Whatever it is, that weirdness is all that's standing between the Ruby syntax and the same takedown.
August 18, 2009, Matt Brubeck wrote: "Is this an actual explicit disambiguation rule in the Ruby grammar, or is it just a consequence of the current implementation?" You actually can pass Regexp literals to methods without parens:
```
def whatever(x) puts x
end
whatever /25 ; # / # prints (?-mix:25 ; # )
```
This is indeed an intentional disambiguation in the Ruby grammar. After the lexer sees a "/" character, line 7142 of parse.y (in a recent SVN snapshot) tests whether the next character ISSPACE before returning tREGEXP_BEG ("regexp begin" token). But even if it weren't intentional, the fact that Ruby has an explicit parser, generated by bison, forces the grammar to at least be unambiguous. So there may be no specification, and certain behavior may be an accident of history, but (for a given implementation, at least) there is exactly one way that a particular file will be parsed, and therefore things like syntax highlighting are possible. Contrast this to Perl's behavior, which is clearly intentional but (because "parsing" is entangled with interpretation) has no unambiguous result in the general case. That's my issue with your idea that "that weirdness is all that’s standing between the Ruby syntax and the same takedown." It's not just today's specific implementation that gives Ruby (and most other languages) an unambiguous grammar; it's the tools and fundamental methods that *force* it to have an unambiguous grammar. (I exaggerate somewhat since e.g. Bison and Flex make it easy to mix code into your lexer+parser and maintain potentially complex state at compile time. But when used normally, tools like Bison or ANTLR do a good job of generating deterministic parsers, and forcing designers to fix ambiguous grammars.)
August 18, 2009, Robert Fischer wrote: If you add a space before the 25, though, it's all hosed up (as my code demonstrates). So the rule is that you can pass Rexep literals as long as there's not a space after the initial /. So you can pass SOME Regexp literals, but not others. Still definitely wonky. Which gets to my point: I can envision a Ruby implementation which would parse:
```
whatever / 25 ; # /
```
as being unary instead of nullary. After all, as you note, this is unary:
```
whatever /25 ; # /
```
That's some significant whitespace indeed! There's no clear or obvious reason why that should go one way or another: the only reason one was is "right" is because the current CRuby implementation happens to go that way. I do see how perl's in a much worse theoretical state, because how to parse the statement is explicitly dependent upon runtime semantics of the program. That's something which at least CRuby's avoided, somewhat surprisingly, because perl's behavior is actually pretty practically useful (if theoretically annoying): it's doing the right thing based on the context, after all. But, in any case, Ruby's still got some serious wonkiness around this example.
August 18, 2009, Matt Brubeck wrote: That's right. What I was trying to explain with my example (ISSPACE on parse.y:7142) is that in this lexer context, regexp literals are explicitly forbidden from starting with whitespace characters; this is a disambiguation between regexp-delimiting '/' and operator '/'. Not a very intuitive one in my opinion, but you find a lot of ad-hoc disambiguations like these in languages whose grammars weren't carefully thought out. If you read through PEP discussions on python-dev, you'll find that syntax proposals are often rejected because they would force such things into the Python parser. "That’s something which at least CRuby’s avoided, somewhat surprisingly..." I still find it weird that you're surprised. Take your Groovy comment from the original post ("Groovy avoids it purely accidentally..."). To me this is clearly not accidental or surprising. Groovy has an ANTLR-generated parser. Ruby has a Bison-generated parser. If you did want Perl-style ambiguity in Groovy or Ruby (or almost any other modern language) it would be quite difficult to implement, because the algorithms used to generate their parsers allow only unambiguous grammars. The fact that parsing is usually a separate step before execution means that normal parsers can't depend on runtime information.
August 19, 2009, Robert Fischer wrote: Regexp literals can start with space: line 25 in my example above (a = / 25 ; # /) works just fine. When I was saying Groovy avoids it accidentally, I was referring to the fact that I doubt Groovy chose // instead of # was because # might create an issue with slashy strings.
August 19, 2009, Matt Brubeck wrote: "Regexp literals can start with space" Hence my qualifier "in this lexer context." The specific context is "(IS_ARG() && space_seen)", apparently meaning the lexer is expecting zero or more method arguments, and a space character was already consumed.
August 19, 2009, Robert Fischer wrote: Got it. That kind of lexer inconsistency is what I'm considering wonky.

This article was a post on the EnfranchisedMind blog. EnfranchisedMind Blog by Robert Fischer, Brian Hurt, and Other Authors is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.

Comments

About Robert Fischer