Table of Contents
- 1. Introduction
- 2. Primer to JSON
- 3. Implementation of the JSON tokenizer
- 4. Implementation of the JSON parser
- 4.1 Overview and design
- 4.2 Implementation of the parser
- 4.2.1 Parsing an object
- 4.2.2 Parsing a member
- 4.2.3 Parsing a value
- 4.2.4 Parsing an array
- 4.3 JSON data model
- 4.3.1 Creation of the data model
- 4.3.2 Usage of the data model
- 4.3.3 Example usage
- 5. Conclusion & References
1. Introduction
Some time ago, I implemented my own generic tokenizer and explained its functionality (see [1]).
So the logical next step is use this library to write a simple parser.
I have decided to write a JSON parser - it's simple grammar makes it an ideal target for a first parser of my generic tokenizer library.
Further, JSON is currently trendy and popular, so diving into this simple data format has a beneficial side effect.
Disclaimer: This article focuses on a possible way to implement a parser.
However, it is probably not fully JSON conform according to the official JSON specification RFC7159 (see see [2]).
But this is also not the goal at all (do not use it in production to parse real JSON files - there are more sophisticated libraries available).
But if you read these lines to find out the first steps to implement a simple parser yourself, then you are at the perfectly right site!
2. Primer to JSON
The official JSON web site (see [3]) provides an awesome and clear introduction to the JSON format. However, for a better understanding of the parser later, here a short and simplified textural introduction to the main elements of the JSON format:
- A value can either be an object, array, string, number or any of the special keywords true, false or null.
- An object is a list of key-value pairs. A key is always a string while a value is of element type value.
An object is started with an opening curly bracket { and is completed with an closing curly bracket }. The key and values are separated by a colon :, the key-value pairs are separated by a comma ,. An object can be empty. - An array is an ordered list of values of element type value.
An array is started with an opening square bracket [ and is completed with an square curly bracket ]. The values are separated by a comma ,. An array can be empty. - A string is a sequence of Unicode characters which is enclosed by double quotes ("). Empty strings are allowed.
- A number is a sequence of digits, optionally with a fractional part, exponent part or a leading minus sign.
With these key facts in mind, have a look at some examples to get a idea of JSON files.
In this example, the JSON contains an array as root element which itself contains elements of different types.
In this example, the JSON contains an object as root element. First there are some keys with string, number, boolean and null values. But a value of a key-value pair of an object can also be an array (like for key "item_collection"), and an array itself can contain other arrays or objects. It can also be another object (like for key "item_map"), and the nested object can contain unlimited numbers of other object and arrays.
Here a summary of useful observations and additional facts:
- Note the recursion of JSON elements.
- An object contains value elements of of type value which itself can be an array or object (beside the other basic types).
- An array contains elements of type value which itself can be an array or object (beside the other basic types).
- An array can contains elements of different value types. In other words, the array elements are not required to have all the same JSON type.
- In the current version of this writing, JSON does not officially support comments.
- The root of a JSON value can either be an object, an array or a value.
- The keywords true, false and null are _not_ enclosed by double quotes ".
- White spaces are ignored.
- Details of Unicode and escape sequence are omitted in above description.
3. Implementation of the JSON tokenizer
At first, the particular JSON tokens needs to be identified. The presented approach defines and uses following token types:
- Structural token: This token class contains token types for the six distinct, single-character terminal values:
- { : Opening curly bracket.
- } : Closing curly bracket.
- [ : Closing square bracket.
- ] : Closing square bracket.
- : : Colon as separator between key and value.
- , : Comma as separator between key-value pairs and array elements.
- Literal token: This token class contains token types for the three keyword terminal values:
- true
- false
- null
- String token: Any token that begins and ends with double quote ".
- Number token: A numerical value, either starting with a number or a minus sign.
The parsing of the tokens is straightforward. Any kind of whitespace separate two token but are ignored in further processing.
The first character of any encountered token determines its type.
So the creation of JSON tokens can be implemented with the generic tokenizer as follows:
{
Token token;
skipWhiteSpaces(); /* ignore whitespaces */
char currentChar = source.GetCurrentChar();
if (currentChar == InputSource.EOF)
{
token = new TokenEOF(source, JsonTokenType.EOF);
}
/* current char is any of '{', '}', '[', ']', ',' or ':' -> structural token */
else if (JsonStructuralToken.IsStructuralChar(currentChar))
{
token = new JsonStructuralToken(source);
}
/* current char is any of 't', 'f' or 'n' -> literal token */
else if (JsonLiteralToken.IsLiteralTokenStart(currentChar))
{
token = new JsonLiteralToken(source);
}
/* current char is '"' -> string token */
else if (currentChar == '"')
{
token = new JsonStringToken(source);
}
/* current char is digit or minus -> number token */
else if (Char.IsDigit(currentChar) || currentChar == '-')
{
token = new JsonNumberToken(source);
}
/* else token is unknown */
else
{
throw new TokenizerException(source, null, currentChar.ToString());
}
currentToken = token;
return token;
}
The implementation of each token type is left out as it has already been addressed in detail in [3]. Check out the source code linked at the top of the article for details.
Let's proceed to the actual parser implementation.
4. Implementation of the JSON parser
4.1 Overview and design
The parser itself is implemented recursively because this is the easiest solution to handle the recursively nested JSON arrays and JSON objects data structures.
The input to the parser is a tokenizer instance. The parser extracts the tokens one after another and
builds up the JSON data structure objects for further processing.
Further it checks the validity of the input JSON information and aborts once an unexpected token is encountered.
The parser is executed with a single method call Parse() which will parse the whole JSON file and returns the content as data structures.
Internally, for each kind of data object, a separate method is implemented: ParseRoot(),
ParseObject(), ParseArray(), ParseMember() and ParseValue().
Remains the question about the data storage: Each time the parser completes the parsing of a JSON element, this element needs to be stored so that the JSON content
can be accessed later.
Therefore, the parser is provided with an instance of interface IJsonDataCreatorRoot. This interface provides functionality to store the JSON elements.
The default example implementation of interface IJsonDataCreatorRoot is contained in class JsonDataCreator and
uses a tree like data structure to store the JSON elements in the same form
as in the input file, but it's possible to create an own implementation of IJsonDataCreatorRoot and pass it to the parser.
Let's start with the parser implementation at first.
4.2 Implementation of the parser
The parser provides a ParseRoot() function. Remember the top-level object in a JSON file can be either an object, an array or a value.
{
Token currentToken = tokenizer.ProcessToken();
JsonTokenType t = (JsonTokenType)currentToken.TokenType;
switch (t.Type)
{
case EJsonTokenType.CURLY_BRACKET_OPEN:
rootObj.CreateJsonElem(ParseObject());
break;
case EJsonTokenType.SQUARE_BRACKET_OPEN:
rootObj.CreateJsonElem(ParseArray());
break;
case EJsonTokenType.STRING:
case EJsonTokenType.NUMBER:
case EJsonTokenType.TRUE:
case EJsonTokenType.FALSE:
case EJsonTokenType.NULL:
rootObj.CreateJsonElemFromObject(currentToken.TokenValue);
break;
default:
throw new JsonParserException();
}
}
The ParseRoot() function evaluates the first token to determine if it's the beginning of an object, an array or just a value.
In the first two cases, it calls then ParseObject() respectively ParseArray() and stores the result as root element.
A value however can be extracted directly and stored in the data model.
4.2.1 Parsing an object
The ParseObject() method first informs the JsonDataCreator to start a new JSON object data instance.
Then, for consistency, it asserts that the current token is an opening curly bracket so that it's really the start of an JSON object.
One of two different token types is expected next: Either a closing curly bracket, indicating the end of the JSON object (then it would be an empty object) or a string token as the key of a key-value pair. It enters then a loop to process the whole JSON object by getting the tokens one after another, compare them to the expected token types according to the JSON grammar and process the tokens.
In case of a key, it is stored for later when the complete key-value pair is parsed and saved in the data model.
The actual key-value pair is processed in method ParseMember().
The created JSON object data instance is passed to the ParseMember() method so that the key-value pair can be linked to the correct parent JSON object.
{
// start creation of new JSON object data object
IJsonDataCreatorObject newObj = dataCreator.CreateNewJsonObject();
/* opening curly bracket is already processed but check for consistency */
AssertTokenType(tokenizer.GetCurrentToken(), EJsonTokenType.CURLY_BRACKET_OPEN);
bool continueMemberProcessing = true;
EJsonTokenType expectedTokens = EJsonTokenType.STRING | EJsonTokenType.CURLY_BRACKET_CLOSE;
while (continueMemberProcessing)
{
// next token must either be an closing curly bracket or an string (key)
Token token = tokenizer.ProcessToken();
EJsonTokenType tokenType = ((JsonTokenType)token.TokenType).Type;
switch (tokenType)
{
case EJsonTokenType.CURLY_BRACKET_CLOSE:
if (AssertTokenType(tokenType, expectedTokens))
{
continueMemberProcessing = false;
}
break;
case EJsonTokenType.STRING:
// key of json object, parse member
if (AssertTokenType(tokenType, expectedTokens))
{
CurrentKey = token.TokenValue as string;
ParseMember(newObj);
expectedTokens = EJsonTokenType.COMMA | EJsonTokenType.CURLY_BRACKET_CLOSE;
}
break;
case EJsonTokenType.COMMA:
// after comma, a new key : value is expected
if (AssertTokenType(tokenType, expectedTokens))
{
expectedTokens = EJsonTokenType.STRING;
}
break;
default: throw new JsonParserException($"parseObject: Invalid token type {tokenType}");
}
}
return newObj.FinalizeObject();
}
4.2.2 Parsing a member
The ParseMember() function parses a single member (= key-value pair) of an JSON object. It is straightforward and needs no further explanation.
{
/* the starting key token is already processed */
AssertTokenType(tokenizer.GetCurrentToken(), EJsonTokenType.STRING);
// process and expect colon token
Token token = tokenizer.ProcessToken();
AssertTokenType(token, EJsonTokenType.COLON);
// parse element value
ParseValue(jsonObj);
}
4.2.3 Parsing a value
The ParseValue() function is a bit more interesting. Note that a value can be an object, array, string, number, true, false or null.
Depending on the value type, the appropriate parse function is called and the returned result is stored as key-value pair in the data model.
{
Token currentToken = tokenizer.ProcessToken();
JsonTokenType t = (JsonTokenType)currentToken.TokenType;
switch (t.Type)
{
case EJsonTokenType.CURLY_BRACKET_OPEN:
jsonObj.AddMember(CurrentKey, ParseObject());
break;
case EJsonTokenType.SQUARE_BRACKET_OPEN:
jsonObj.AddMember(CurrentKey, ParseArray());
break;
case EJsonTokenType.STRING:
jsonObj.AddMember(CurrentKey, dataCreator.CreateNewJsonDataValueFromObject(currentToken.TokenValue));
break;
case EJsonTokenType.NUMBER:
jsonObj.AddMember(CurrentKey, dataCreator.CreateNewJsonDataValueFromObject(currentToken.TokenValue));
break;
case EJsonTokenType.TRUE:
jsonObj.AddMember(CurrentKey, dataCreator.CreateNewJsonDataValueFromObject(currentToken.TokenValue));
break;
case EJsonTokenType.FALSE:
jsonObj.AddMember(CurrentKey, dataCreator.CreateNewJsonDataValueFromObject(currentToken.TokenValue));
break;
case EJsonTokenType.NULL:
jsonObj.AddMember(CurrentKey, dataCreator.CreateNewJsonDataValueFromObject(currentToken.TokenValue));
break;
default:
throw new JsonParserException();
}
}
4.2.4 Parsing an array
The ParseArray function follows the same pattern as the ParseObject, but must consider more different cases as it can contain different value types for the particular array elements.
{
// start creation of new JSON array data object
IJsonDataCreatorArray newObj = dataCreator.CreateNewJsonArray();
// inside an array, following elements are expected:
// - closing square bracket (array is finished)
// - opening square bracket (new element is an array)
// - opening curly bracket (new element is an object)
// - any type of a value as element
EJsonTokenType elementToken = EJsonTokenType.SQUARE_BRACKET_CLOSE |
EJsonTokenType.CURLY_BRACKET_OPEN |
EJsonTokenType.SQUARE_BRACKET_OPEN |
EJsonTokenType.STRING |
EJsonTokenType.NUMBER |
EJsonTokenType.NULL |
EJsonTokenType.FALSE |
EJsonTokenType.TRUE;
/* opening square bracket is already processed but check for consistency */
AssertTokenType(tokenizer.GetCurrentToken(), EJsonTokenType.SQUARE_BRACKET_OPEN);
bool continueMemberProcessing = true;
EJsonTokenType expectedTokens = elementToken;
while (continueMemberProcessing)
{
Token token = tokenizer.ProcessToken();
EJsonTokenType tokenType = ((JsonTokenType)token.TokenType).Type;
switch (tokenType)
{
case EJsonTokenType.SQUARE_BRACKET_CLOSE:
// array is closed, finish array parsing
if (AssertTokenType(tokenType, expectedTokens))
{
continueMemberProcessing = false;
}
break;
case EJsonTokenType.COMMA:
// another element is expected after a comma
if (AssertTokenType(tokenType, expectedTokens))
{
expectedTokens = elementToken;
}
break;
case EJsonTokenType.STRING:
case EJsonTokenType.NUMBER:
case EJsonTokenType.TRUE:
case EJsonTokenType.FALSE:
case EJsonTokenType.NULL:
// a terminal value, add it to the array
if (AssertTokenType(tokenType, expectedTokens))
{
newObj.AddElement(dataCreator.CreateNewJsonDataValueFromObject(token.TokenValue));
expectedTokens = EJsonTokenType.COMMA | EJsonTokenType.SQUARE_BRACKET_CLOSE;
}
break;
case EJsonTokenType.SQUARE_BRACKET_OPEN:
// a new array is started, parse it recursively
if (AssertTokenType(tokenType, expectedTokens))
{
newObj.AddElement(ParseArray());
expectedTokens = EJsonTokenType.COMMA | EJsonTokenType.SQUARE_BRACKET_CLOSE;
}
break;
case EJsonTokenType.CURLY_BRACKET_OPEN:
// a new object is started, parse it recursively
if (AssertTokenType(tokenType, expectedTokens))
{
newObj.AddElement(ParseObject());
expectedTokens = EJsonTokenType.COMMA | EJsonTokenType.SQUARE_BRACKET_CLOSE;
}
break;
default: throw new JsonParserException($"parseObject: Invalid token type {tokenType}");
}
}
return newObj.FinalizeObject();
}
4.3 JSON data model
4.3.1 Creation of the data model
As mentioned above, the actual JSON parser and the data model are separated. The parser is provided with a instance of interface IJsonDataCreator which provide methods to put the parsed JSON elements into a data model on the fly. The interface is simple and contains only following methods:
- IJsonDataCreatorRoot CreateRoot():
Creates the main JSON creator object (root object). Called at the beginning of parsing. - IJsonDataCreatorObject CreateNewJsonObject():
Creates a new JSON object instance. Called if the parser encounters the beginning of a new JSON object. - IJsonDataCreatorArray CreateNewJsonArray():
Creates a new JSON array instance. Called if the parser encounters the beginning of a new JSON array. - IJsonDataElemValue CreateNewJsonDataValueFromObject(object obj):
Creates a new JSON value instance. Called if the parser encounters a new JSON terminal value. - IJsonDataRoot FinalizeDataRoot(IJsonDataCreatorRoot rootObj):
Creates the actual JSON data object. Called when parser is finished.
Each of the Create methods returns another interface to stepwise fill the data object. For example, the IJsonDataCreatorArray interface returned by the CreateNewJsonArray() function has a method AddElement(IJsonDataElemBase value) to store the array elements one after another as the JSON parser processes them.
Each of the creator interface has a FinalizeObject() to indicate the completion of an data creation. The function then returns the completely, immutable created JSON data object that is later used to access the data.
For example, the IJsonDataCreatorArray interface has the following methods:
- void AddElement(IJsonDataElemBase value):
Adds a new element to the JSON array which is currently being setup. - IJsonDataElemArray FinalizeObject():
Creates the actual JSON array object after all elements have been added.
The FinalizeDataRoot() method of the IJsonDataCreator finally returns the complete JSON data model.
4.3.2 Usage of the data model
When the parser has completed the parsing of the JSON file, an object with interface type IJsonDataRoot is returned.
It provides a GetJsonRootElem() method to get a IJsonDataElemBase object as the root element.
The IJsonDataElemBase interface provides methods to identify the JSON data type and to get the data instance with correct data type, namely:
- JsonDataElemType GetJsonDataType();
- bool IsJsonDataObject();
- bool IsJsonDataArray();
- bool IsJsonDataValue();
- IJsonDataElemObject GetAsJsonObject();
- IJsonDataElemArray GetAsJsonArray();
- IJsonDataElemValue GetAsJsonValue();
The returned objects of types IJsonDataElemObject, IJsonDataElemArray and IJsonDataElemValue allow the recursive traversal of the whole data model which is built up like a tree. For example, the IJsonDataElemArray has a GetElements() method to retrieve its elements of type IJsonDataElemValue:
- IReadOnlyList<IJsonDataElemBase> GetElements();
- int GetNumOfElements();
4.3.3 Example usage
Following code snippet shows how to load and parse a JSON file:
JsonTokenizer tokenizer = new JsonTokenizer(new JsonInputSource(fileNamePath));
SimpleJsonParser parser = new SimpleJsonParser(tokenizer);
IJsonDataRoot jsonData = parser.Parse();
// access the parsed JSON data via jsonData object
Have a look into the source code (linked at the top of the article) of the provided demo application or unit test application to get a better overview of the parser usage.
5. Conclusion & References
This article gives an overview of my own approach to implement a JSON parser. It does not act as a user documentation of the parser itself (for this, please refer to the source).
Instead, after a small introduction of the JSON file format, an easy approach to implement an own JSON parser is presented.
Hopefully you found it interesting and learned something! Maybe now it's the time to start your own parser implementation ...
Sunshine, December 2022
References
* [1] Writing a simple generic tokenizer
* [2] RFC7159
* [3] JSON.org
History
- 2022/12/22: Initial version.