L(L|[A]R)(k) parser

byuu · Post by **byuu** » Sat Sep 13, 2008 12:58 am

Okay, one of the most important snippets of code I've worked on was based off some sample code from gladius a while back. A recursive descent parser, used to evaluate string expressions. Eg, "2+3*4" = 14.

This is quite important for things like scripting, my cross assembler, etc.

The way it works essentially forces you to parse a string from left to right, and eats up a lot of time in recursions. I don't mind the speed much, and was able to get it quite fast, but there was one issue I couldn't avoid.

Without assignment operators, and with some magic on parenthesis and unary operators, the entire thing was trivial to write as left-associative. Luckily, it matched C's operator associativity in all but one case: ternary expressions. They're right-to-left in C. Now, I could take the cheap way out and claim it's following PHP's style, which is also left-to-right, but I figured it'd be interesting to revisit the idea, and make it more adaptable, so that right-to-left evaluations would be easy to add. Eg in case there's ever a need to add assignment support or somesuch.

To allow interchangeable left/right associativity, and to speed things up a bit, I broke things up into individual functions and used a dual sliding window approach.

The idea is you start with markers on the left and right ends of the string. Left-associative operators move the markers on the left toward the right, and right-associative move the right markers left, until they meet up. Recursion is used to enable operator precedence.

The idea seems to work great, and so far it seems quite fast and clean. My ultimate goal is to write some base templates and allow custom grammars to be generated quickly and easily. Something like boost::spirit for dummies.

But I have one problem I can't solve ... supporting the grouping operator, or parenthesis. I don't suppose anyone here could help me out with that?

It seems easy enough. You'd think I could just add parenthesis below at the highest precedence (eg above muldivmod), and make that call the lowest precedence evaluator (eg ternary.) However, I run into the problem that the lower precedence operators hit their own symbols and thus prematurely shrink the left/right markers.

Eg for example, (1+2)*3. You go to ternary, there is none so go to addition, and it sees +, so it splits the expression to "(1" + "2)*3", which then goes finally to parenthesis, where it is only given "(1".

The only solution I can think of is to embed parenthesis scanning at each and every eval_n() call. And that just seems really messy. Surely there has to be a way to hide that overhead.

How it worked with the RDP was that we could simply detect the first byte as an opening parenthesis, and evaluate that to a single value. With everything packed into one function and only scanning in one direction, that worked fine. But I can't do that the same way with this new left+right marking scanner.

Even if I were to resolve parenthesis first, I wouldn't be able to do anything with the values. For instance, "(1+2)*3" -> I see (1+2) and resolve it. Great, now I have 3 and "*3". But the multiplication function can only evaluate "3*3", not 3 and "*3".

And even if I could, it seems like it'd cause problems with associativity. Eg if I resolve parenthesis, but I need right associativity, I'd have to evaluate the parenthesis in the other direction. Seems like a major hassle.

So, one, can anyone help with this? And two, I suck with the terminology. Is there a name for the type of parser I've created so far? I don't think it strictly qualifies as recursive descent anymore, even though it is still recursive. I also don't see what kind of grammar it cannot parse that a LL/LR/LALR parser cannot. So it doesn't seem to be purely RDP.

Many thanks in advance. Here's the code I have so far:

Code: Select all

//no associativity
int eval_integer(const char *sl, const char *sr) {
  int result = 0;
  for(; sl < sr; sl++) {
    if(*sl < '0' || *sl > '9') printf("bad match %c\n", *sl);
    result *= 10;
    result += *sl - '0';
  }
  return result;
}

//left-associative
int eval_muldivmod(const char *sl, const char *sr) {
  for(int n = 1; sr - n > sl; n++) {
    if(*(sr - n) == '*' || *(sr - n) == '/' || *(sr - n) == '%') {
      int left  = eval_muldivmod(sl, sr - n);
      int right = eval_integer(sr - n + 1, sr);
      return
        *(sr - n) == '*' ? left * right :
        *(sr - n) == '/' ? left / right :
        *(sr - n) == '%' ? left % right : 0;
    }
  }

  return eval_integer(sl, sr);
}

//left-associative
int eval_addsub(const char *sl, const char *sr) {
  for(int n = 1; sr - n > sl; n++) {
    if(*(sr - n) == '+' || *(sr - n) == '-') {
      int left  = eval_addsub(sl, sr - n);
      int right = eval_muldivmod(sr - n + 1, sr);
      return
        *(sr - n) == '+' ? left + right :
        *(sr - n) == '-' ? left - right : 0;
    }
  }

  return eval_muldivmod(sl, sr);
}

//right-associative
int eval_ternary(const char *sl, const char *sr) {
  for(int n = 0; sl + n < sr; n++) {
    if(*(sl + n) == '?') {
      int condition = eval_addsub(sl, sl + n);

      for(int o = n + 1; sl + o < sr; o++) {
        if(*(sl + o) == ':') {
          int left  = eval_addsub(sl + n + 1, sl + o);
          int right = eval_ternary(sl + o + 1, sr);
          return condition ? left : right;
        }
      }

      throw;
    }
  }

  return eval_addsub(sl, sr);
}

int eval(const char *sl) {
  const char *sr = sl + strlen(sl);
  return eval_ternary(sl, sr);
}

int main() {
  int n = eval("1?2:3?4:5");
  fprintf(stdout, "%d\n", n);
  return 0;
}

Associativity example ...

Left-associative:

Code: Select all

2+3-1+2 -> (((2+3)-1)+2) -> 6

Right-associative:

Code: Select all

1?2:3?4:5 -> 1?2:(3?4:5) -> 2

Jipcy · Post by **Jipcy** » Sat Sep 13, 2008 1:30 pm

In the left-associative example, wouldnt the answer be 2? I thought you would evaluate all the additions before the subtraction.

Verdauga Greeneyes · Post by **Verdauga Greeneyes** » Sat Sep 13, 2008 2:02 pm

As far as I know addition and subtraction have the same precedence and so they're just read from left to right. I recall that there's been some discussion about that, but graphical calculators all do it that way, so it's become the norm. (I dunno what the C++ standard has to say about it though)

franpa · Post by **franpa** » Sat Sep 13, 2008 2:03 pm

based on BODMAS, addition is done first out of addition/subtraction.

ZH/Franky · Post by **ZH/Franky** » Sat Sep 13, 2008 3:04 pm

franpa wrote:based on BODMAS, addition is done first out of addition/subtraction.

BIDMAS! Fucking hell!

FitzRoy · Post by **FitzRoy** » Sat Sep 13, 2008 3:25 pm

http://www.mathsisfun.com/operation-order-bodmas.html

You're all wrong. Div/mul before add/sub. But there is no precedence within div/mul or add/sub, you do them left to right.

In fact, whoever came up with BODMAS is an idiot. The whole concept of using an acronym to remember order of precedence is confounded by the fact that D has no precedence over M, and A has no precedence over S.

franpa · Post by **franpa** » Sat Sep 13, 2008 5:18 pm

FitzRoy wrote:You're all wrong. Div/mul before add/sub.

I ignored that intentionally since we were not talking about multiplication or division.

funkyass · Post by **funkyass** » Sat Sep 13, 2008 5:21 pm

No, there is nothing wrong with the acronym.

byuu · Post by **byuu** » Sat Sep 13, 2008 7:16 pm

Perhaps I could do better with a generic tokenizer, and without the recursion on left-associative types (but with it on right-associative), eg for the sake of keeping a string pointer in only one direction. Not sure if that would work or not ...

Code: Select all

3-1+2

right-to-left with recursion:
<addsub> + 2
  <addsub> - 1
    return 3
  return 3 - 1
return 2 + 2

left-to-right with recursion:
3 - <addsub>
  1 + <addsub>
    return 2
  return 1 + 2
return 3 - 3

left-to-right sans recursion:
3 - 1 -> 2
2 + 2 -> 4

Fes · Post by **Fes** » Sat Sep 13, 2008 10:21 pm

I wasn't able to help much with the v-sync thing, but this is an area I actually have experience with, so let's see if I can't help.

Just to get a clear picture, is this something you want to get done as quickly as possible, or more of a learning excersize? If the former, YACC/bison/ANTLR/GPB/etc would steamroll this whole thing. I'll assume the latter for sake of discussion though :)

First off, I strongly suggest your tokenize the string regardless of approach. This will become especially important as you add identifiers and types and functions and such.

Next, the problem I see with parens is that, by directly scanning the string for operators, it's effectively assuming that the top-level expression is an AddSubExpression (discounting ternary for the moment), when that may not actually be the case. To use the (1+2)*3 example again, the top level is actually a MulDivExpression. This is where the difficulty with RDP style parsers comes in. The technique requires you to be able to correctly identify which production to use from the current non-terminal. If the guess is wrong, it has to back out and try again.

I think using a tokenized approach will help a lot here. The first token picked up is '(' which tells you immediately that whatever is inside the parens is deeper in the syntax tree and not relevant to the current production. As you point out though, scanning from right to left makes this more complicated. For simplicity I think you should always scan from left to right, looking for the LAST mul/div, rather than starting from the right and looking for the first. Doing it this way allows you to have only a single procedure for paren-grouping, as well as using a stack to speed things up a bit.

For example, lets say you have (1+2)*(3+4)*5. While scanning, we find the first * token at position 5, note it in a stack, and continue, finding the next one at position 11. Noting that, and continuing, we find no more, and can unwrap the stack starting with the last * and recursively invoking the previous ones. This will save you from having to scan the string again each time in recursive descent of left-to-right ops at each level. For right-to-left operators, no stack is necessary and the first occurance would kick off a recursion. Doing this will result in the input for an evaluator being in one of two forms:

<Expression>
"(" <Expression> ")"

If it's the second form, the parens can be disregarded and the expression within is evaluated. If its the first form, any parens within are sub-expressions and should be saved for recursions. The rub is figuring out which form it is efficiently, which would require another scan of the stream to distinguish between (1+2) and (1+2)*(3+4).

Another concern is that it seems like this approach has the potential to break down as the grammar becomes increasingly complex. Additionally, This approach has a higher order of complexity because the token stream is being scanned repreatedly at each level, rather than a single time in traditional LL(k) class parser. It looks O(n^2) to me at first glance, although more detailed analysis of the average case would be necessary to know for sure what it's practical efficiency is. If it flies for your grammar though, I think it would be competitive with LR(k) class parsers.

To answer your last question, I think this would be an LL(inf) parser, as it descends from the root nonterminal, and uses unlimited token lookahead.

byuu · Post by **byuu** » Sat Sep 13, 2008 11:41 pm

Thanks for the help. Glad someone here was familiar with this.

Yeah, I want to learn how to do this, so no to YACC.

This will save you from having to scan the string again each time in recursive descent of left-to-right ops at each level.

I was going to say, a last marker scan would add a lot more overhead, but yeah; this would help with that.

The issue is that I'd need to keep my own variable stack, which kind of sucks. I like using recursion for the stack.

Another concern is that it seems like this approach has the potential to break down as the grammar becomes increasingly complex.

Yeah, but it's at least a very minor step up from the old RDP I was using, in terms of power.

Additionally, This approach has a higher order of complexity because the token stream is being scanned repreatedly at each level, rather than a single time in traditional LL(k) class parser. It looks O(n^2) to me at first glance, although more detailed analysis of the average case would be necessary to know for sure what it's practical efficiency is.

That it does. But it's more O(n^2) as in quick sort (without merge sort for small chunks). Eg only in the worst possible case where your list is sorted will you get O(n^2). In most real world cases, this parser shouldn't be too bad.

I compared it to the old single-function depth RDP. It's already ~10% slower evaluating "(1*2)+3?4*(5+6):(7*8)+9" a million times, and I only parse four of thirteen precedences so far. Not looking good for speed.

To answer your last question, I think this would be an LL(inf) parser, as it descends from the root nonterminal, and uses unlimited token lookahead.

I'm still having trouble with understanding the differences. I understand the top-down vs bottom-up part, but I don't see how even an LL(inf) parser can get stuck in an infinite loop ... at least not with the grammar I am parsing. Which is probably because it's LL compatible to begin with. Meh.

---

Anyway, doing what I really didn't want to do, I was able to get it working.

Code: Select all

int eval_parenthesis(const char *sl, const char *sr) {
  return eval_ternary(sl + 1, sr - 1);
}

int eval_integer(const char *sl, const char *sr) {
  if(*sl == '(' && *(sr - 1) == ')') return eval_parenthesis(sl, sr);

  int result = 0;
  for(; sl < sr; sl++) {
    if(*sl < '0' || *sl > '9') throw;
    result *= 10;
    result += *sl - '0';
  }
  return result;
}

int eval_muldivmod(const char *sl, const char *sr) {
  for(int depth = 0, n = 1; sr - n > sl; n++) {
    if(*(sr - n) == ')') depth++;
    if(*(sr - n) == '(') depth--;
    if(depth) continue;

    if(*(sr - n) == '*' || *(sr - n) == '/' || *(sr - n) == '%') {
      int left  = eval_muldivmod(sl, sr - n);
      int right = eval_integer(sr - n + 1, sr);
      return *(sr - n) == '*' ? left * right :
             *(sr - n) == '/' ? left / right :
             *(sr - n) == '%' ? left % right : 0;
    }
  }

  return eval_integer(sl, sr);
}

int eval_addsub(const char *sl, const char *sr) {
  for(int depth = 0, n = 1; sr - n > sl; n++) {
    if(*(sr - n) == ')') depth++;
    if(*(sr - n) == '(') depth--;
    if(depth) continue;

    if(*(sr - n) == '+' || *(sr - n) == '-') {
      int left  = eval_addsub(sl, sr - n);
      int right = eval_muldivmod(sr - n + 1, sr);
      return *(sr - n) == '+' ? left + right :
             *(sr - n) == '-' ? left - right : 0;
    }
  }

  return eval_muldivmod(sl, sr);
}

int eval_ternary(const char *sl, const char *sr) {
  for(int depth = 0, n = 0; sl + n < sr; n++) {
    if(*(sl + n) == '(') depth++;
    if(*(sl + n) == ')') depth--;
    if(depth) continue;

    if(*(sl + n) == '?') {
      int condition = eval_addsub(sl, sl + n);

      for(int depth = 0, o = n + 1; sl + o < sr; o++) {
        if(*(sl + o) == '(') depth++;
        if(*(sl + o) == ')') depth--;
        if(depth) continue;

        if(*(sl + o) == ':') {
          int left  = eval_addsub(sl + n + 1, sl + o);
          int right = eval_ternary(sl + o + 1, sr);
          return condition ? left : right;
        }
      }

      throw;
    }
  }

  return eval_addsub(sl, sr);
}

int eval(const char *sl) {
  const char *sr = sl + strlen(sl);
  return eval_ternary(sl, sr);
}

Eg, adding depth checking to each and every level. It's still going to have problems with unary expressions that share non-terminals, though.

Ultimately, you're right. I need to reduce the complexity to O(n) somehow. Merge all token scanning into a single function. But doing that without a table (stack), I'm not sure how it's possible. The RDP from before only worked because everything was already left-associative. Switching it up based on depth level would make the function far too large and unwieldy. It really needs to be broken up, as I'm trying with the above code ...

Guess I'll keep working at it ...

FitzRoy · Post by **FitzRoy** » Sat Sep 13, 2008 11:52 pm

franpa wrote:
FitzRoy wrote:You're all wrong. Div/mul before add/sub.
I ignored that intentionally since we were not talking about multiplication or division.

Right, you were talking about addition and subtraction, and neither has precedence over the other whereas you said addition had precedence over subtraction. That is only the case between div/mul and add/sub, not between div and mul, not between add and sub. That's why BODMAS confused you, because according to the acronym itself, addition comes before subtraction. It doesn't. It has equal precedence, thus it is entirely possible for subtraction to be done before addition if that's what comes first.

Fes · Post by **Fes** » Sun Sep 14, 2008 12:09 am

The only way you're going to get O(n) complexity is with an LL(k) grammar, which infix arithmetic is not (at least not with implicit order of operations). So one way or another you'll be dealing with longer worst case running time. On the plus side, this means that you can relax a little bit about the efficiency, since it's not possible to do better anyway.

If you really want to dig deep on this, the wikipedia pages have good definitions of LL vs. LR parsers, BNF grammars, implementation details, etc. There is also no shortage of books on the subject, as this is a rather fundamental problem in CS, it's also one that is for all practical intents and purposes, "solved." For this reason, as you add functionality to this technique, I think you're gonig to be railroaded into a more robust solution anyway, so it might be prudent to cut to the chase and examine LR parser theory and existing implementations. It really is a fascinating topic.

I myself have not actually written an LR parser, only LL ones, but I"d be happy to bounce ideas around.