HN2new | past | comments | ask | show | jobs | submitlogin
Show HN: Using Parsing Expression Grammars to rewrite source code (github.com/sebcat)
63 points by sebcat on Sept 14, 2016 | hide | past | favorite | 26 comments


Parsing expression grammars are my favorite way to parse text.

I also use Lua and LPeg. Here's a tutorial I wrote on them: http://leafo.net/guides/parsing-expression-grammars.html

Additionally, I've written an entire programming language with them: http://moonscript.org/

Here's the grammar:

https://github.com/leafo/moonscript/blob/master/moonscript/p...

My favorite part is that you express the grammar in code, not some simplified language designed to write parsers. So you get all the nice things a programming language normally gives you. (eg. you can use functions instead of having support for macros)

I've also started experimenting with compiling LPeg style grammars to C, for even more speed. This uses the peg library: https://github.com/leafo/moonparse

You can see how the grammar looks here: https://github.com/leafo/moonparse/blob/master/parse.peg.moo...


Nice work!

I agree with you regarding LPeg values as first class objects in Lua. It's a really nice fit, and a much nicer way to solve parsing problems than using flex/bison IMO. Having it in Lua allows for pretty fast code-writing -> test cycles too, if you're cooking up a one-off grammar on the fly for something.

I had a university course where we wrote recursive descent parsers in Java, using the standard library. One of the grammars were the classic calculator grammar. We're talking abstract class for parse tree nodes, one class for every node type (Node, NumberNode, AddNode, DivNode) &c. If someone back then would have walked in with the example present in your tutorial and told me that you can write a parser for what we were doing in less than 20 lines that, after you understand the core concepts of PEG, is a lot more legible and comprehensible than all that Java code was... I wish I knew then what I know now :)


Last time I used Lpeg for a big project I ran into difficulties with error detection and recovery --- writing the parser for correct input was beautifully easy, but writing a parser that would gracefully handle incorrect input was very hard. (I'm reminded of the apocryphal Prolog compiler which, if you gave it a program with a syntax error, would just reply 'No.'.)

When I asked about the best way to handle errors it was suggested to me that I should add alternatives to my rules that would catch errors and return an error token from there; but of course, that short circuits any alternatives further up the grammar tree, so it wasn't very satisfactory. A quick look at your grammar doesn't show anything like this --- how are you dealing with errors?


> how are you dealing with errors?

The example looks for string literals outside of what it considers comments and preprocessor directives. If you feed it random data it will just end up in various states in the graph. It matches everything and uses captures for subsitution when it's in the right state.

When I need to break on invalid grammar, I use match time captures (Cmt). I think if you have problems with short circuiting parsing you might have the match time capture in the wrong place, but I might be wrong. Do you have an short example?

Trivial example (maybe too trivial?) of error handling:

  lpeg = require "lpeg"
  P, Cmt, V, S = lpeg.P, lpeg.Cmt, lpeg.V, lpeg.S
  function errfunc(match)
    error("invalid token: "..tostring(match))
  end
  P{
    "tokens";
    space   = S" \t",
    invalid = (1-V"space")^1/errfunc,
    token   = P"foo" + P"bar" + P "baz" + V"invalid",
    tokens  = (V"token" * V"space"^0)^0  * -1
  }:match("foo bar baz woops foo")

EDIT: or, to get the position:

    lpeg = require "lpeg"
    P, Cmt, V, S = lpeg.P, lpeg.Cmt, lpeg.V, lpeg.S
    function errfunc(match, pos, cap)
      error(string.format("invalid token at position %d: %s",
        pos-#cap, cap)) 
    end 
    p = P{
      "tokens";
      space   = S" \t",
      invalid = Cmt((1-V"space")^1, errfunc),
      token   = P"foo" + P"bar" + P "baz" + V"invalid",
      tokens  = (V"token" * V"space"^0)^0  * -1
    }:match("foo bar baz woops foo")


Hi, you might want to check out LPegLabel[1]. It's an extension of LPeg that allows error detection and recovery by throwing labels (done manually but can be semi-automated). To put simply, throwing a label is like throwing an exception (so it bubbles up until it is caught). For error recovery, you can catch those labels and record it somewhere then continue parsing possibly skipping to a certain token like a semicolon before doing so.

I worked on modifying LPeg grammars to add error detection and recovery via LPegLabel for my Google Summer of Code, so please ask me questions about it if you have any. Right now, LPegLabel is still quite young, but it is fully usable and since it is a project under LabLua (the same research lab that is in charge of Lua and LPeg), I'm hoping it will be integrated into LPeg one day.

[1] https://github.com/sqmedeiros/lpeglabel


>> (I'm reminded of the apocryphal Prolog compiler which, if you gave it a program with a syntax error, would just reply 'No.'.)

I don't believe you remember this correctly.

The Prolog interpreter will raise an error for syntax errors. If you misspel something but don't cause a syntax error then you may get an unexpected "no" (or "false") but a syntax error causes compilation to fail, in Prolog as in any language.

In any case, a "no" ("false") is the proof procedure failing to prove your query true (or, more accurately, finding a way to prove it false). It's not an error and it's not a failure of the interpreter.


The expanded version of the saying, which I think I got from my Prolog lecturer at university, is:

Writing an optimising compiler in Prolog (note, in Prolog, not for Prolog) which will accept valid programs is trivially easy. Writing an optimising compiler in Prolog which will say anything other than 'No.' for invalid programs is intractably hard.


Ah, I see- you meant a compiler written in Prolog, not a Prolog compiler.

Who was your lecturer? I haven't heard of that problem with optimising compilers in Prolog. I don't know why it should be harder to do in Prolog than in any other language.


Hey, while you are trying out parser combinators, what do you think of this one? http://github.com/meric/leftry It's very new and I would like some comments on how to make it easier to use. You can check out http://github.com/meric/l2l, which is a hybrid lisp and lua programming language that relies on the parser combinator library. (See the lua.lua in that project, where the Lua grammar is implemented using Leftry)


Neat.

I did find the readme confusing. I had to jump up and down the page to make sense of it. 'any' is mentioned but not documented. The first few examples made absolutely zero sense until I've read quite a bit further, despite being familiar with parsers and grammars -- it's not a very gentle introduction.

OK but your actual question was about usability. I don't understand why 'factor' takes a generation function as an argument instead of something much less verbose, like an array/table. That looks like it would be my biggest complaint. Is it implementation detail, or to allow self reference, or to allow arbitrary code inside one? If the latter, can providing a function instead of simpler syntax be made optional? There ought to be alternatives to allowing self-reference too.

It's cute that a parser can be used as an iterator. How useful is it in practice to match concatenated valid sentences like that? I never saw a use for it in my own tasks.


Thanks for taking a look at it, I could really use your feedback to improve the documentation - definitely an area I need to improve. The anonymous function is so that it waits till all the non terminals are defined.

The iterator feature is cute, which was reason enough for me to do it. :)

Much appreciated.


I've used PEG parsers for two projects, one written with peg/leg (which leafo mentioned), and one with a significantly modified fork of pyPEG [2]. My method for handling incorrect input is equivalent to what you described: I extended the pyPEG grammar to add 'checkpoints'. (Earlier, when using peg/leg I used alternative code blocks that threw errors.) Once a checkpoint is passed, the parser is not allowed to backtrack past it again. This avoids the problem of the parser backtracking all the way to the very beginning of input and producing an error equivalent to 'No'. As a bonus, adding checkpoints speeds up parsing and massively reduces memory usage, because as soon as you hit one you can delete state needed for backtracking. Yes, it does mean that you have to think about whether adding a checkpoint somewhere cuts off valid parses; the checkpoints really ought to be added automatically. Ideally you would organise your rules in such a way that all the alternative things that could appear in some context are in a rule called Foo, and put a checkpoint at the beginning, so if none of them match you get a "expected Foo" error. And if you see an opening quote or bracket or anything like that, immediately add a checkpoint because it has to be closed. This worked very well for me, but I imagine would not scale to something as nightmarish as perl.

Real example:

   def expressionList():       return QUES, (expression, STAR, (PLUS, ",", CHECKPNT, expression))
(Note: pyPEG is funky, the ? * and + operators (QUES, STAR, PLUS) precede the thing they act on; I really should change that). Basically in this language if you see a comma in an expressionList, something must follow it. The error produced by this rule is:

   On line 79 in rbtest.rbas:
   Syntax error: while parsing expressionList, expected expression
        dim x as integer = y(a, b,)
                                  ^
This is a parser for a DSL extension of FreeBASIC (which I called ReloadBasic). In addition to the DSL stuff it parses just enough of FB to determine the types of all the variables and arguments and the function/sub beginnings and ends. For the small fragment it handles, it often produces better errors than the handwritten recursive-descent FB compiler. (It's also about 1% the speed...) Source is here: [1]

[1] https://bitbucket.org/rbv/ohrrpgce/src/6ba83910c564aa474220e... [2] https://bitbucket.org/rbv/ohrrpgce/src/6ba83910c564aa474220e...


Minor technical nitpick for the README: You're not moving up the Chomsky hierarchy but rather moving sideways ;). PEGs are non-Chomskyan parsing formalisms, in the sense that the family of languages describable by PEGs are not to be found in the Chomsky hierarchy.

For example, they are not regular, as they can match patterns such as a^n b^n. On the other hand, they are also not context-free, as they can also match non-context-free patterns such as a^n b^n c^n. Finally, since PEG can be parsed in linear time, there must also be some CFG which cannot be recognized by any PEG - this follows due to a result by Lee [1].

[1] L. Lee. Fast Context Free Grammar Parsing Requires Fast Boolean Matrix Multiplication. http://arxiv.org/pdf/cs/0112018.pdf


I did not know that, thanks. Replaced it with a mentioning of regular and context-free grammars instead.


Interesting. I see that the Wikipedia PEG article still states that the existence of CFG- but not PEG-parsable languages is an open problem.


It is! It is surprisingly difficult to come up with a grammar and a proof that it cannot be recognized by a PEG. Part of the reason is probably that PEGs and CFGs are quite different: A CFG is a set of generative rules which specify a set of strings, whereas a PEG is actually more like a recursive program.

There's an open question on Stack Exchange for anyone who feel like taking a stab at the problem: http://cstheory.stackexchange.com/questions/34792/does-peg-c...


OK. I thought you meant that a non-constructive proof existed that there is such a CFG. That's how I interpreted "Finally, since PEG can be parsed in linear time, there must also be some CFG which cannot be recognized by any PEG". But I see that on cstheory you wrote instead (being more precise?) "most probably ... PEG does not contain CFG".


Ah, yes, that's because the non-linear complexity of binary matrix multiplication (which CFG parsing can be reduced to) is strongly suspected to be a lower bound, but I don't think that there is a formal proof of it (I am not an expert in complexity theory, so I am just basing this on what some authors seem to suggest). So, the non-constructive argument is a "proof" in the same sense that some results can be proved by first assuming that P /= NP.


I really wish the expressions were better documented/explained in the source. They're dense, similarly to regexps, and regexps also benefit immensely from explanation when used as a demo/teaching tool.


Good point. I added a new branch where I broke the "literal" rule into multiple rules and added some comments where I try to explain what's going on in English. I also added an achor to it in the master README.md.

I find it a bit hard to accurately describe the operations with words though, if there's any ambiguities in the wording, let me know!

https://github.com/sebcat/move-literals/blob/commented_expre...


Awesome, thanks! It's the first time I've seen example PEGs explained so nicely. I've let myself add a few small comments with some additional suggestions.

Some additional comments I didn't know how to add on github:

- in load_data(), you should most probably use `local` on l.99 and 103;

- in load_data() l.99, you could consider using the idiom:

    local f = assert(io.open(file, "rb"))
though arguably it's clearer as it is now.

Other than that, it would be even cooler if you could also add some examples in the repo of how to deal with parsing errors, as you showed in another comment!


Thank you for the feedback. The commented_expressions branch looks a lot better than master now, I'll probably merge it. I added the code from the comment to README.md.


I viewed this on my phone and it was annoying and difficult to read due to the typing effect in the header which caused the content below to move up and down.


It is a readme on guthub. What typing effect are you talking about?


Wrong tab. Comment was meant for https://hackertimes.com/item?id=12498976


A while back I made a PEG parser in Python in ~400 LOC, that can parse itself. It's actually quite straightforward: https://gist.github.com/orlp/e880287287985ccd9288def8f6741b4....

I've also added a 'skipper' feature, that makes handling whitespace a breeze.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: