Table of Contents
- 1. Introduction
- 2. Overview
- 3. Concepts of Pratt Parsing
- 4. Implementation of a Pratt Parser
- 4.1 Prerequisites and AST elements
- 4.2 Implementation Part I: Parsing prefix expressions
- 4.3 Implementation Part II: Adding infix expressions (and postfix) parsing
- 4.4 Implementation Part III: Adding operator precedence
- 4.5 Implementation Part IV: Adding associativity
- 4.6 Implementation Part V: Adding parentheses
- 4.7 Implementation completion
- 5. Appendix
1. Introduction
In my personal journey of discovering and dealing with parsing approaches for practical use cases, eventually I came across the most
elegant parsing algorithm up to now: Pratt parsing, named after its inventor Vaughan Pratt who initially described this parsing approach in his paper "Top Down Operator Precedence"
in the year 1973. The paper is available from [1] and [2].
The original paper is a bit cumbersome to read nowadays (while still worth to do so), there are other excellent articles
on the web: The article "Top Down Operator Precedence" by Douglas Crockford [3] is referred to quite often as the modern explanation.
Personally my absolute favorite article is "Pratt Parsers: Expression Parsing Made Easy" by Bob Nystrom [4].
My description and implementation (which you are currently reading right now) is heavily based on his superb tutorial.
2. Overview
The presented pratt parsing algorithm will parse mathematical expressions, quite similar as already done in [5].
Further, the details about the tokenizer are skipped - for details see [6].
It is assumed that there is a tokenizer available with a ProcessToken() method that consumes and return the next token of a token stream.
The supported tokens are:
- Floating-point numbers of type double (NUMBER token))
- Plus operator "+" (PLUS token)
- Minus operator "-" (MINUS token)
- Multiplication operator "*" (MUL token)
- Division operator "/" (DIV token)
- Exponentiation operator "**" (EXP token)
- Increment operator "++" (INCREMENT token)
- Decrement operator "--" (DECREMENT token)
- Inversion operator "~" (INVERT token)
- Left parenthesis operator "(" (CURLY_BRACKET_OPEN token)
- Right parenthesis operator ")" (CURLY_BRACKET_CLOSE token)
The minus and plus operators are supported as prefix operator (as in "-5" or "+3") as well as infix operator (as in "4 + 5").
The increment and decrement operators are supported as postfix operators (as in "4++").
The following table defines the operators with their precedence and associativity which the final parser will support,
from highest precedence to lowest:
| Precedence value | Operator | Operator name | Associativity |
| 6 | () | Parentheses | Left-to-right |
| 5 | ++ -- | Postfix operators | Right-to-left |
| 4 | ** | Exponentiation | Right-to-left |
| 3 | + - ~ | Unary operators | Left-to-right |
| 2 | \ * | Division Multiplication | Left-to-right |
| 1 | + - | Addition Subtraction | Left-to-right |
For sure, the explanation of the Pratt parser will be the heart of this article. After having discussed all the details
and developed the algorithm step-by-step, even two forms of the algorithm will be implemented: One providing an AST (abstract syntax tree)
of the mathematical expression and one providing the reverse polish notation as string of the mathematical expression.
3. Concepts of Pratt Parsing
For me, the main differences of Pratt parser to most other common (top-down) parsers are the following properties:
Semantic Code
Most parser implementation use one parser function to parse one rule of the grammar,
which is often provided in EBNF (Extended Backus-Naur Form).
For example, the JSON parser in [7] has one function to parse an object, another function to parse an array and so on.
In contrast, a pratt parser provides a function for each token type (Pratt calls it "semantic code").
So there is one function to parse a number, one function to parse the add sign '+' and so.
Differentiation of Operator Tokens
Pratt differs between two types of operator tokens:
- Token without preceding expression: In Pratt's terminology, this is called a null denotation (code) = nud.
For us, we call this operator token type just prefix operator.
Prefix operators are easier because the first token denotes the operator, then followed only by another expression (in the simplest case, only a token) follows. - Token with preceding expression: In Pratt's terminology, this is called a left denotation (code) = led.
For us, we call this operator token type just infix operator.
Infix operators need to first parse an expression (the left side) to come to the operator which denotes the type of expression. Any token which is not an infix operator is treated as prefix operator.
Note that for each token, there could be even two parsing functions - one to handle the token as prefix, the other one to handle the token as infix.
This just depends on the token position inside the token stream.
For example, the minus token '-' can be a prefix token, like in "-5" to denote a negative number,
as well as an infix token like in "5 - 3" to denote the subtraction operator. So for the token '-', two different parsing functions would be provided.
Operator precedence and associativity
In top-down parsers the operator precedence is usually somehow indirectly implemented. While the precedence is clearly given by the language grammar, it is hard to figure out
inside the application. Further, associativity is normally handled in a different way than precedence.
A Pratt parser handles precedence (and associativity) uniquely - using so-called binding powers. The binding power is not a global grammar property,
instead the binding power is actually a property of a token. The binding power denotes how strong an operator bind to the expressions around it.
Pratt differs between two types of binding powers:
- Left binding power (LBP): The LBP is a value that specifies how tight an operator belongs to the expression on its left side.
- Right binding power (RBP): The RBP is a value that specifies how tight an operator belongs to the expression on its right side.
The LBP is the precedence of token when parsing infix expressions. It tells us if the parsing of an infix expression should stop or continue.
The RBP value provides the information how many tokens to the right of the current operator are parsed into the current expression.
It will become clearer when we put our hands on operator precedence.
So let's dive into it in a more practical way...
4. Implementation of a Pratt Parser
We have already defined above that we use simple expressions with (mathematic) operations again like in the previous articles as they are ideal for demonstration:
Numbers and mathematical expressions and their precedence rules are well known and it is easy to validate the evaluation results.
4.1 Prerequisites and AST elements
At first, let's define some basic types. Our Pratt Parser will return an abstract syntax tree (AST) of our mathematical expression
and each node will represent a specific part of the overall expression.
Each specific expression will inherit from our base expression which is an abstract class (acting as placeholder) than can print its expression for easy debugging:
{
public abstract string ToDbgString();
}
The simplest expression is a single number like "5", defined as:
{
public double Number { get; set; }
public override string ToDbgString()
{
return $"({Number})";
}
}
Based on this, we can define a class to represent an prefix expression like "- 5" or " - (5+4)" which contains an operator token (in the example the minus token) followed by another expression:
{
public Token OperatorToken { get; set; }
public PpsBaseExpression RightLeaf { get; set; }
public override string ToDbgString()
{
return $"({OperatorToken} {RightLeaf.ToDbgString()})";
}
}
The same way, classes for infix and postfix expressions can be defined.
4.2 Implementation Part I: Parsing prefix expressions
Caution
The implementation of the Pratt parser in this chapter is incomplete: It does only parse our tokens and prefix operators.
It does not handle infix operators, parenthesis, precedence or associativity correctly (yet)!
This will be added in the subsequent chapters.
For each token type, implement a function that parses this token.
Let's start with prefix operators. First we define a function type to parse a prefix expression like "- 5":
As there is a 1:1 mapping between token and parsing function, we can get rid of the often length if-else / switch-case cascades to differentiate between the different token types. Instead we store for each token it's corresponding parsing function inside a map. This makes it far more understandable and structured.
When the parser is instantiated, all prefix operator token are mapped to the prefix parser function. Note that a single number token is also treated as infix operator.
{
{ PpsTokenType.NUMBER, ParseNumber },
{ PpsTokenType.PLUS, ParsePrefixExpression },
{ PpsTokenType.MINUS, ParsePrefixExpression },
{ PpsTokenType.INVERT, ParsePrefixExpression }
};
The first version of our main parsing function which only supports prefix operators is as follows: Assuming the current token is the prefix operator token, it is checked that a parsing function is registered for the current token. If this is the case, the actual parsing function for this token is retrieved from the map and just invoked, finally the parsed expression (tree) is returned:
{
// check existence and get prefix parse function for current token type
if (!prefixParseFuncs.ContainsKey(GetTokenType(currentToken)))
{
throw new Exception($"No prefix parse function registered for token type {currentToken.TokenType}");
}
PrefixParseFunc prefixFunc = prefixParseFuncs[GetTokenType(currentToken)];
// call the registered prefix function to parse the expression
PpsBaseExpression expr;
expr = prefixFunc();
// return the parsed expression
return expr;
}
Moving on to the concrete parsing functions, the function for parsing and constructing a numerical expression is trivial:
{
PpsNumberExpression expr = new PpsNumberExpression();
// we know that the token value here is a double number
expr.Number = (double)currentToken.TokenValue;
AdvanceToken();
return expr;
}
Let's continue straight to parse a real prefix expression.
A new prefix expression is instantiated and the prefix operator token is stored. Then it is advanced to the next token and note that now the expression to the right side
of the prefix operator is parsed by calling again our main parser function ParseExpression()!
{
PpsPrefixExpression expr = new PpsPrefixExpression();
expr.OperatorToken = GetCurrentToken();
AdvanceToken();
expr.RightLeaf = ParseExpression();
return expr ;
}
Note that tokens are consumed / advanced in prefix, infix and postfix parse functions. But never in the overall main ParseExpression() function!
So it's the task of the parsing functions to proceed the consumed tokens and after return of such parsing function, the tokenizer points to next token to be processed.
This is our defined contract on how all the parsing functions process the tokens. And this is important - do never break this rule otherwise the parser will not work correctly!
Side note: Alternatively, we could have also defined to never advance tokens in the parsing functions but only in the overall ParseExpression() function.
This would also work - important is just to not mix both variants up. In this approach we chose the advance the token in the parsing functions.
So at this point we can parse an expression like "-5"! The parser will return an PpsPrefixExpression object with operator token "-" and
right left a PpsNumberExpression object with number value 5. Calling the ToDbgString() on this object will return
"(- (5))".
Great - first part accomplished!
4.3 Implementation Part II: Adding infix expressions (and postfix) parsing
Caution
The implementation of the Pratt parser in this chapter is incomplete: It does only parse our tokens, prefix and infix operators.
It does not handle parenthesis, precedence or associativity correctly (yet)!
This will be added in the subsequent chapters.
Let's follow the same approach as for infix operators. First define a function type to parse a infix expressions like "2 + 3":
Note that the infix parse function expects a parameter - this is the already parsed expression, i.e the left node of the infix expression. In the example "2 + 3", the left parameter would be the (prefix) expression containing the number 2. We will see below how this works in detail.
Again we store for each token it's corresponding parsing function inside a map. Note that a separate mapping for infix expressions is used. This is necessary as the same operator might be used as prefix and infix operator, e.g the token MINUS is added to both mapping tables to handle "-5" (as prefix operator) and "5 - 2" (as infix operators):
infixParseFuncs = new Dictionary<PpsTokenType, InfixParseFunc>
{
{ PpsTokenType.PLUS, ParseInfixExpression },
{ PpsTokenType.MINUS, ParseInfixExpression },
{ PpsTokenType.MUL, ParseInfixExpression },
{ PpsTokenType.DIV, ParseInfixExpression },
{ PpsTokenType.EXP, ParseInfixExpression },
{ PpsTokenType.INCREMENT, ParsePostfixExpression },
{ PpsTokenType.DECREMENT, ParsePostfixExpression }
};
How to integrate the infix expression parsing functions to our parsers? Let's go back to our ParseExpression()
method which currently only handles prefix expressions.
Consider the expression "-5 + 2": The prefix expression "-5" is already parsed correctly. Now the next token to be processed is the "+" infix operator
and we are already at correct location! So instead of just returning the already parsed prefix expression, we check for an infix operator and if so,
call the corresponding infix parsing function from our map. Let's have a look at the changes in dark orange in the next listing:
{
// check existence and get prefix parse function for current token type
if (!prefixParseFuncs.ContainsKey(GetTokenType(currentToken)))
{
throw new Exception($"No prefix parse function registered for token type {currentToken.TokenType}");
}
PrefixParseFunc prefixFunc = prefixParseFuncs[GetTokenType(currentToken)];
// call the registered prefix function to parse the expression
PpsBaseExpression expr;
expr = prefixFunc();
// look at the current token and check if it tells us that this is an infix expression
if (infixParseFuncs.ContainsKey(GetTokenType(currentToken)))
{
// call the registered infix function to parse the expression
InfixParseFunc infixFunc = infixParseFuncs[GetTokenType(currentToken)];
expr = infixFunc(expr);
}
// return the parsed expression
return expr;
}
Note that at first we advance to the next token due to our parsing function contract defined above.
In our example "-5 + 2", the current token is now "+" and we check if an infix parsing function is registered for this token - if so, we invoke the infix parsing function.
Note that our infix parsing function expects as first parameter the already parsed expression, so that's why the variable is expr is passed to it.
This is needed to construct our AST - the already parsed expression is the left node of our infix expression!
Let's move forward to a concrete infix parsing function:
{
PpsInfixExpression expr = new PpsInfixExpression();
expr.LeftLeaf = Left;
expr.OperatorToken = GetCurrentToken();
AdvanceToken();
expr.RightLeaf = ParseExpression();
return expr;
}
An infix expression has a left and a right leaf and an operator. As already discussed, the left leaf is the already parsed expression. The operator is the infix operator. The expression to the right side of the infix operator is parsed by calling again our main parser function ParseExpression()! This is analogous to the prefix parsing functions to allow parsing of longer / nested expressions like "1 + 2 + 3".
So at this point we can also parse infix expressions like "-5 + 2", this is a great step.
But wait - maybe following questions arise:
Question 1: Does the adapted ParseExpression() sill parses prefix expressions correctly?
Answer: Sure. Consider the expression "-5" again. After this prefix expression is parsed, AdvanceToken() is called which will then proceed to the EOF token.
For this token type, no infix parsing function is registered, so the if-condition is skipped and the expression is returned.
Question 2: What about postfix expressions? How are they handled?
Answer: The short and easy answer is - they are handled like infix expression! Consider the postfix expression "5++". The only difference to an infix expression is that there is no right leaf.
So we can add postfix parsing functions to our infix mapping tables as well:
{
...
{ PpsTokenType.EXP, ParseInfixExpression },
{ PpsTokenType.INCREMENT, ParsePostfixExpression },
{ PpsTokenType.DECREMENT, ParsePostfixExpression }
};
The actual postfix parsing function is then similar to the the infix parsing function except for the right leaf:
{
PpsPostfixExpression expr = new PpsPostfixExpression();
expr.LeftLeaf = Left;
expr.OperatorToken = GetCurrentToken();
AdvanceToken();
// there is no right operand
return expr;
}
So prefix, infix and postfix expressions - but unfortunately not in the correct way! No operator precedence and associativity is supported yet.
For example, the expression "2 * 3 + 4" is incorrectly parsed as "((2) * ((3) + (4)))" which would result in 14, but actually it should be "(((2) * (3)) + (4))" with correct result of 10.
So let's tackle this in the next chapter.
4.4 Implementation Part III: Adding operator precedence
Caution
The implementation of the Pratt parser in this chapter is incomplete: It does parse our expressions with correct precedence.
But it does not handle parenthesis and associativity correctly (yet)!
This will be added in the subsequent chapters.
Now our binding powers come into play! To take operator precedence into account, it is sufficient for now to use only the left binding power.
In this subchapter, the terms operator precedence and left binding power are used synonymously.
Let's represent the operator precedence table from chapter 2 as an enum where a lower value means a lower precedence.
Also a lowest threshold is defined which will be used as start value of the parser:
{
LOWEST = 0,
SUM = 1,
PRODUCT = 2,
PREFIX = 3,
EXPONENTIATION = 4,
POSTFIX = 5
}
We said that the left binding power is a property of a token, so we need some mapping to get the operator precedence of a particular token. Let's use a dictionary which maps the token types to its precedence values:
{
{ PpsTokenType.PLUS, PpsOperatorPrecedenceLBP.SUM },
{ PpsTokenType.MINUS, PpsOperatorPrecedenceLBP.SUM},
{ PpsTokenType.MUL, PpsOperatorPrecedenceLBP.PRODUCT },
{ PpsTokenType.DIV, PpsOperatorPrecedenceLBP.PRODUCT },
{ PpsTokenType.EXP, PpsOperatorPrecedenceLBP.EXPONENTIATION },
{ PpsTokenType.DECREMENT, PpsOperatorPrecedenceLBP.POSTFIX },
{ PpsTokenType.INCREMENT, PpsOperatorPrecedenceLBP.POSTFIX },
{ PpsTokenType.INVERT, PpsOperatorPrecedenceLBP.POSTFIX }
};
As all prefix operators have the same binding power, things can be kept more simple by not adding the prefix operator tokens to above table, but only the infix and postfix operators.
We can define a small utility function to get the operator precedence of a given token type with the fallback to return the lowest precedence if the token type cannot be found (e.g. for the EOF token):
{
PpsOperatorPrecedenceLBP lbp;
PpsTokenType tokenType = GetTokenType(GetCurrentToken());
if (!precedenceTable.TryGetValue(tokenType, out lbp))
{
lbp = PpsOperatorPrecedenceLBP.LOWEST;
}
return lbp;
}
The parser shall continue parsing the expression as long as the precedence / of the current operator is greater than a minimum binding power.
The minimum binding power is the lowest binding power an operator must have in order to be allowed to continue parsing the current expression.
So the current operator precedence needs to be passed to the parsing functions:
Of course, we start with the lowest binding power to allow parsing at all:
{
return ParseExpression(PpsOperatorPrecedenceLBP.LOWEST);
}
As long as the current operator binds more tightly than current expression context represented by the min binding power, the parser shall continue parsing the expression.
This leads to following change in our ParseExpression() function:
{
// check existence and get prefix parse function for current token type
if (!prefixParseFuncs.ContainsKey(GetTokenType(currentToken)))
{
throw new Exception($"No prefix parse function registered for token type {currentToken.TokenType}");
}
PrefixParseFunc prefixFunc = prefixParseFuncs[GetTokenType(currentToken)];
// call the registered prefix function to parse the expression
PpsBaseExpression expr;
expr = prefixFunc();
// consume token, move forward to next one
AdvanceToken();
// as long as the current operator binds more tightly than current expression context represented by the min binding power,
// continue parsing the infix expression
while (GetLeftBindingPowerPrecedence(currentToken) > minBP)
{
// look at the current token and check if it tells us that this is an infix expression
if (infixParseFuncs.ContainsKey(GetTokenType(currentToken)))
{
// call the registered infix function to parse the expression
InfixParseFunc infixFunc = infixParseFuncs[GetTokenType(currentToken)];
expr = infixFunc(expr);
}
}
// return the parsed expression
return expr;
}
Then only a two other changes are required. In our ParsePrefixExpression, the left binding power must be passed to ParseExpression() when parsing the expression right to the operator. As described already above, all our prefix operators have the same left binding power so no need to retrieve the binding power from the token type, instead simply pass PREFIX precedence directly:
{
...
expr.RightLeaf = ParseExpression(PpsOperatorPrecedenceLBP.PREFIX);
return expr ;
}
The second change is on our ParseInfixExpression(). The left binding power precedence of the infix operator token must be retrieved from and passed to ParseExpression() for parsing the right side of the infix expression:
{
PpsInfixExpression expr = new PpsInfixExpression();
expr.LeftLeaf = Left;
expr.OperatorToken = GetCurrentToken();
PpsOperatorPrecedenceLBP precedence = GetLeftBindingPowerPrecedence(GetCurrentToken());
AdvanceToken();
expr.RightLeaf = ParseExpression(precedence);
return expr;
}
That's it! Operator precedence is handled correctly now! For example, "2 * 3 + 4" is correctly parsed to "(((2) * (3)) + (4))".
Let's have a look at a concrete program flow as example.
Flow sequence for the expression 2 + 3 + 4.
ParseExpression(lowest) [currentToken = 2]
Prefix function called for number : expr = 2
AdvanceToken -> [currentToken = +]
LBP of current token + (sum) IS greater than passed binding power lowest
ParseInfixExpression [currentToken = +]
Left = "2"
Op = "+"
AdvanceToken -> [currentToken = 3]
ParseExpression(+ (sum) precedence) [currentToken = 3]
Prefix function called for number : expr = 3,
AdvanceToken -> -> [currentToken = +]
LBP of current token + (sum) IS NOT greater than passed binding power + (sum)
return expr = 3
right = "3"
expr = return from ParseInfixExpression = "2 + 3"
LBP of current token + (sum) IS greater than lowest binding power
ParseInfixExpression [currentToken = +]
Left = "2 + 3"
Op = +
AdvanceToken -> [currentToken = 4]
ParseExpression(+ (sum) precedence) [currentToken = 4]
Prefix function called for number : expr = 4,
AdvanceToken -> -> [currentToken = EOF]
LBP of current token EOF IS NOT greater than passed binding power + (sum)
return expr = 4
right = "4"
expr = return from ParseInfixExpression = "left(2 + 3) + right (4)"
LBP of current token EOF IS NOT greater than passed binding power + (sum)
return expr = "left(2 + 3) + right (4)"
4.5 Implementation Part IV: Adding associativity
Caution
The implementation of the Pratt parser in this chapter is incomplete: It does parse our expressions with correct precedence and associativity.
But it does not handle parenthesis (yet)!
This will be added in the subsequent chapters.
Up to now, all expressions are parsed left-to-right associative, so "1 + 2 + 3" is parsed as "(1 + 2) + 3".
The exponentiation operator is right-to-left associative so "2 ** 3 ** 2" needs to be correctly parsed as "2 ** (3 ** 2)" instead of the current wrong result "(2 ** 3) ** 2".
For correct handling of associativity, the right binding power comes into play!
The right binding power (RBP), analogous to the LBP, specifies how tight an operator belongs to the expression on its right side.
Here comes the trick: For right associative operators, the RBP is defined less than the LBP! If the RBP is passed to ParseExpression(minBP),
this will cause the parser to continue for right associative operators because the condition "while (GetLeftBindingPowerPrecedence(currentToken) > minBP)" will be true!
Because the LPB of the right associative operator is greater than the passed minimum binding value which is the RBP.
Let's jump into the implementation and add support for RBP.
Let's redefine the existing operator precedence table PpsOperatorPrecedenceLBP by adding "gaps" between the enum values (we will see soon why) and rename it to PpsOperatorPrecedence:
{
LOWEST = 0,
SUM = 10,
PRODUCT = 20,
PREFIX = 30,
EXPONENTIATION = 40,
POSTFIX = 50
}
A simple struct is defined to hold both values, the left binding power and the right binding power, and the existing table precedenceTable is changed so that
it maps both binding power values to a given token.
This new precedenceTable dictionary table contains a very important fact: While for all left-to-right associative operators, the left and right binding powers are equal,
for all right-to-left associative operators like exponentiation, the right binding is less than the left binding power!
As seen below, we just subtract 1 from the precedence value for the right binding power. To avoid collisions of precedence values due to this subtraction, we have introduced
the "gaps" in the precedence values above.
{
public PpsOperatorPrecedence LBP;
public PpsOperatorPrecedence RBP;
}
private readonly Dictionary<PpsTokenType, PpsOperatorPrecedenceTuple> precedenceTable = new Dictionary<PpsTokenType, PpsOperatorPrecedenceTuple>
{
{ PpsTokenType.PLUS, new PpsOperatorPrecedenceTuple { LBP = PpsOperatorPrecedence.SUM, RBP = PpsOperatorPrecedence.SUM } },
{ PpsTokenType.MINUS, new PpsOperatorPrecedenceTuple { LBP = PpsOperatorPrecedence.SUM, RBP = PpsOperatorPrecedence.SUM } },
{ PpsTokenType.MUL, new PpsOperatorPrecedenceTuple { LBP = PpsOperatorPrecedence.PRODUCT, RBP = PpsOperatorPrecedence.PRODUCT } },
{ PpsTokenType.DIV, new PpsOperatorPrecedenceTuple { LBP = PpsOperatorPrecedence.PRODUCT, RBP = PpsOperatorPrecedence.PRODUCT } },
{ PpsTokenType.EXP, new PpsOperatorPrecedenceTuple { LBP = PpsOperatorPrecedence.EXPONENTIATION, RBP = PpsOperatorPrecedence.EXPONENTIATION - 1 } },
{ PpsTokenType.DECREMENT, new PpsOperatorPrecedenceTuple { LBP = PpsOperatorPrecedence.POSTFIX, RBP = PpsOperatorPrecedence.POSTFIX } },
{ PpsTokenType.INCREMENT, new PpsOperatorPrecedenceTuple { LBP = PpsOperatorPrecedence.POSTFIX, RBP = PpsOperatorPrecedence.POSTFIX } },
{ PpsTokenType.INVERT, new PpsOperatorPrecedenceTuple { LBP = PpsOperatorPrecedence.POSTFIX, RBP = PpsOperatorPrecedence.POSTFIX } }
};
The function GetLeftBindingPowerPrecedence() needs to be slightly changed due to the precedenceTable. Also a correspondingGetRightBindingPowerPrecedence()
utility function is introduced (which are not listed here due to its simplicity).
Now the second important (and last) change to complete the associativity functionality: In ParseInfixExpression(), the right binding power is passed to the ParseExpression()
function instead of the left binding power:
{
PpsInfixExpression expr = new PpsInfixExpression();
expr.LeftLeaf = Left;
expr.OperatorToken = GetCurrentToken();
PpsOperatorPrecedence precedence = GetLeftBindingPowerPrecedence(GetCurrentToken());
PpsOperatorPrecedence precedence = GetRightBindingPowerPrecedence(GetCurrentToken());
AdvanceToken();
expr.RightLeaf = ParseExpression(precedence);
return expr;
}
Let's check the program flow for example "4 ** 3 ** 2" which should not make only associativity clear but also precedence handling of the algorithm in general:
Flow sequence for the expression 4 ** 3 ** 2.
ParseExpression(lowest) [currentToken = 4]
Prefix function called for number : expr = 4
AdvanceToken -> [currentToken = **]
LBP of current token ** (exponentiation) IS greater than passed binding power lowest
ParseInfixExpression [currentToken = **]
Left = "4"
Op = "**"
AdvanceToken -> [currentToken = 3]
ParseExpression(RBP of **) [currentToken = 3]
Prefix function called for number : expr = 3,
AdvanceToken -> -> [currentToken = **]
LBP of current token ** (exponentiation) IS greater than given right binding power of ** (exponentiation)
ParseInfixExpression [currentToken = **]
Left = "3"
Op = "**"
AdvanceToken -> [currentToken = 2]
ParseExpression(RBP of **) [currentToken = 2]
Prefix function called for number : expr = 2,
AdvanceToken -> -> [currentToken = EOF]
LBP of current token EOF IS NOT greater than given right binding power of ** (exponentiation)
return expr = "2"
right = "2"
expr = return from ParseInfixExpression = "3 ** 2"
LBP of current token EOF IS NOT greater than given right binding power of ** (exponentiation)
return expr = "3 ** 2"
right = expr = "3 ** 2"
return expr = "left(4) ** right(3 ** 2)"
LBP of current token EOF IS NOT greater than given right binding power of ** (exponentiation)
expr = "left(4) ** right(3 ** 2)"
4.6 Implementation Part V: Adding parentheses
Now the final step - adding support for parentheses. And as you will see, this is surprisingly easy to implement! A Pratt parser handles parentheses in an very elegant way.
The key concept is that parentheses is not treated as another operator precedence layer but just as a grouped expression.
This means no additional operator needs to be added to the operator precedence table nor an extension to the right or left binding power handling is required.
Instead, an expression in parentheses is just handled by another prefix parsing function which consumes the opening and closing bracket and parses the inner expression as any other expression.
The important fact here is that the ParseExpression function is called with lowest binding power: We know that a Pratt Parser stops if the binding power of the next token is too low, i.e.
lower than current used binding power. But an expression like "(5+2*3)" shall be parsed as single expression "5+2*3" regardless of the surrounding expression parts. It's like when the parser is started
with the initial call to ParseExpression where also the lowest binding power is passed.
At first, register a new prefix parsing function for a grouped expression, that is an expression in parentheses:
{
{ PpsTokenType.NUMBER, ParseNumber },
{ PpsTokenType.MINUS, ParsePrefixExpression },
{ PpsTokenType.PLUS, ParsePrefixExpression },
{ PpsTokenType.INVERT, ParsePrefixExpression },
{ PpsTokenType.CURLY_BRACKET_OPEN, ParseExpressionWithParentheses }
};
The second step is the implementation of ParseExpressionWithParentheses() which is straightforward: Extract the parentheses and parse the inner expression with lower binding power:
{
// check and consume the "("
if (GetCurrentToken().TokenType != PpsTokenType.CURLY_BRACKET_OPEN)
{
throw new InvalidOperationException("Invalid expression. Current token: " + GetCurrentToken() + ", expected is opening bracket.");
}
AdvanceToken();
// parse the inner full expression with minimum binding power
PpsBaseExpression expr = ParseExpression(PpsOperatorPrecedence.LOWEST);
// check and consume the ")"
if (GetCurrentToken().TokenType != PpsTokenType.CURLY_BRACKET_CLOSE)
{
throw new InvalidOperationException("Invalid expression. Current token: " + GetCurrentToken() + ", expected is closing bracket.");
}
AdvanceToken();
return expr;
}
4.7 Implementation completion
And that's it! A fully working Pratt Parser, implemented step-by-step from scratch!
Nothing more to say. Download the source code at the top of the site or check the appendix for different approaches on how to evaluate the AST returned by our Pratt parser.
5. Appendix
5.1 Appendix I: Recursive evaluation of the AST
The recursive evaluation of our constructed abstract syntax tree is simple: Start at the root node of the AST and check its node type. A leaf node of the tree is always a number expression which can directly be returned as double value. All other expressions are just evaluated depending on the token type which is in fact the mathematical operator. The source code is straightforward and should make it clear:
{
if (expr is PpsNumberExpression numExpr)
{
return (double)numExpr.Number.TokenValue;
}
else if (expr is PpsPrefixExpression prefixExpr)
{
PpsTokenType tokenType = (PpsTokenType)(prefixExpr.OperatorToken.TokenType);
switch (tokenType.Type)
{
case EPpsTokenType.PLUS: return +EvalExpr(prefixExpr.RightLeaf);
case EPpsTokenType.MINUS: return -EvalExpr(prefixExpr.RightLeaf);
case EPpsTokenType.INVERT: return ~(long)(EvalExpr(prefixExpr.RightLeaf));
default: throw new NotImplementedException($"{tokenType.Type} not implemented for prefix expression");
}
}
else if (expr is PpsInfixExpression infixExpr)
{
PpsTokenType tokenType = (PpsTokenType)(infixExpr.OperatorToken.TokenType);
switch (tokenType.Type)
{
case EPpsTokenType.PLUS: return EvalExpr(infixExpr.LeftLeaf) + EvalExpr(infixExpr.RightLeaf);
case EPpsTokenType.MINUS: return EvalExpr(infixExpr.LeftLeaf) - EvalExpr(infixExpr.RightLeaf);
case EPpsTokenType.MUL: return EvalExpr(infixExpr.LeftLeaf) * EvalExpr(infixExpr.RightLeaf);
case EPpsTokenType.DIV: return EvalExpr(infixExpr.LeftLeaf) / EvalExpr(infixExpr.RightLeaf);
case EPpsTokenType.EXP: return Math.Pow(EvalExpr(infixExpr.LeftLeaf), EvalExpr(infixExpr.RightLeaf));
default: throw new NotImplementedException($"{tokenType.Type} not implemented for infix expressions");
}
}
else if (expr is PpsPostfixExpression postfixExpr)
{
PpsTokenType tokenType = (PpsTokenType)(postfixExpr.OperatorToken.TokenType);
switch (tokenType.Type)
{
case EPpsTokenType.DECREMENT: return EvalExpr(postfixExpr.LeftLeaf) - 1;
case EPpsTokenType.INCREMENT: return EvalExpr(postfixExpr.LeftLeaf) + 1;
default: throw new NotImplementedException($"{tokenType.Type} not implemented for postfix expressions");
}
}
else
{
throw new InvalidOperationException("Unsupported expression: " + expr.GetType());
}
}
5.2 Appendix II: Converting the AST to postfix notation
Another approach for evaluation is to convert the AST which is returned by the Pratt parser to postfix notation (also called reverse polish notation).
In my previous article [5] Shunting Yard Algorithm for Parsing Mathematical Expressions, it has been shown how to evaluate a mathematical expression
given in postfix notation, so the only task left is to convert the AST to postfix notation which is, more precisely, a list of expression tokens in postfix order.
This is pretty simple: Remember that in postfix notation the operands precede its operators. If the AST is traversed in a way that first the leafs of a node are visited (first left, then right)
and the operator token of the node at last, then the tokens are exactly in the desired order already.
Wow, that's simple! Here the main part of the recursive function that provides the AST expressions as list of token in postfix order:
{
List<PostfixToken> output = new List<PostfixToken>();
if (expression is PpsNumberExpression)
{
var expr = expression as PpsNumberExpression;
output.Add(new PostfixToken(expr.Number, false));
}
else if (expression is PpsPrefixExpression)
{
var expr = expression as PpsPrefixExpression;
output.AddRange(ConvertToPostFixNotation(expr.RightLeaf));
output.Add(new PostfixToken(expr.OperatorToken, true));
}
else if (expression is PpsInfixExpression)
{
var expr = expression as PpsInfixExpression;
output.AddRange(ConvertToPostFixNotation(expr.LeftLeaf));
output.AddRange(ConvertToPostFixNotation(expr.RightLeaf));
output.Add(new PostfixToken(expr.OperatorToken, false));
}
else if (expression is PpsPostfixExpression)
{
var expr = expression as PpsPostfixExpression;
output.AddRange(ConvertToPostFixNotation(expr.LeftLeaf));
output.Add(new PostfixToken(expr.OperatorToken, false));
}
else
{
throw new InvalidOperationException("Unsupported expression: " + expression.GetType());
}
return output;
}
Caution
However, please note following simplification of above snippet: In the Pratt parser, we used only one token for "-" (as well as for "+", but let's stick to the minus),
despite of being as prefix or infix operator. So there is only one MINUS token type.
This means "5 - 3" is represented as tokens NUMBER MINUS NUMBER while "-3" is MINUS NUMBER. While the Pratt Parser can handle these two types of "MINUS" (because it knows if it processes
an prefix or infix expression), this information is lost when converting to postfix notation.
While "5 - 3" is in postfix notation "5 3 -", the "-3" is in postfix "3 -". So if the postfix evaluator
comes to the "-", it does not know if it needs to pop one operand from stack (in case of prefix minus operator) or two operands (in case of infix minus operator).
One simple solution for this problem is too use different token types for both kind of minus, a "prefix minus" token and a "infix minus" token.
Another solution is to track if a token was encountered in a prefix expression or not - this workaround is used above.
5.3 Appendix III: A Pratt Parser directly emitting postfix notation
Of course, it is also possible to adapt the Pratt parser so that it returns the parsed mathematical expression in postfix notation instead of an AST.
The main changes are to get rid of all the AST elements (the type PpsBaseExpression and its subclasses): Instead of emitting AST expression nodes, the parsing functions add the tokens
directly into the list with postfix order.
The parsing functions itself process the operators in correct precedence order as follows: First emit the operands, then the operators.
Let's have a look at the modified prefix parsing function:
{
// store operator token
Token operatorToken = GetCurrentToken();
AdvanceToken();
// parse the expression after the prefix operator
ParseExpression(PpsOperatorPrecedence.PREFIX, output);
// emit operator token after(!) the operands
output.Add(new PostfixToken(operatorToken, true));
}
The operator is temporarily stored, then the following expression is parsed and only the the very end, the operator is stored in the token list.
The same rule applies to the infix parsing function:
{
// store operator token
Token operatorToken = GetCurrentToken();
// left operand has already been added to postfix token list
PpsOperatorPrecedence precedence = GetRightBindingPowerPrecedence(GetCurrentToken());
AdvanceToken();
// parse the expression right to the operator
ParseExpression(precedence, output);
// emit operator token after(!) both operands
output.Add(new PostfixToken(operatorToken, false));
}
The complete postfix Pratt parser implementation can be found in the source code package linked at the top of the page.
Conclusion
Nothing more to say - we have implemented a fully working Pratt parser implementation step-by-step. Check out the full source code linked at the top of the page.
References
* [1] Top down operator precedence by Vaughan R. Pratt @ ACM Digital Library
* [2] Top down operator precedence by Vaughan R. Pratt @ github.io
* [3] Top Down Operator Precedence by Douglas Crockford
* [4] Pratt Parsers: Expression Parsing Made Easy by Bob Nystrom
* [5] Shunting Yard Algorithm for Parsing Mathematical Expressions
* [6] Writing a simple generic tokenizer
* [7] A practical approach to write a simple JSON parser yourself
History
- 2026/03/09: Initial version.