A practical approach to write a simple JSON parser yourself

Table of Contents


Download C# source code and binaries (103 kb)

1. Introduction

Some time ago, I implemented my own generic tokenizer and explained its functionality (see [1]).
So the logical next step is use this library to write a simple parser.
I have decided to write a JSON parser - it's simple grammar makes it an ideal target for a first parser of my generic tokenizer library. Further, JSON is currently trendy and popular, so diving into this simple data format has a beneficial side effect.

Disclaimer: This article focuses on a possible way to implement a parser. However, it is probably not fully JSON conform according to the official JSON specification RFC7159 (see see [2]).
But this is also not the goal at all (do not use it in production to parse real JSON files - there are more sophisticated libraries available).
But if you read these lines to find out the first steps to implement a simple parser yourself, then you are at the perfectly right site!


2. Primer to JSON

The official JSON web site (see [3]) provides an awesome and clear introduction to the JSON format. However, for a better understanding of the parser later, here a short and simplified textural introduction to the main elements of the JSON format:


With these key facts in mind, have a look at some examples to get a idea of JSON files.

Example 1: Array with different types


[ 1, 2, false, "empty" ]

In this example, the JSON contains an array as root element which itself contains elements of different types.


Example 2: Object with nested elements


{ "name": "John", "age": 42, "balance": -274.87, "married": false, "european": true, "extended_info": null, "item_collection": [ 3e2, "some_string", [ "elem1", "elem2", null ], { "password": 123 } ], "item_map": { "count": 2, "elems": [ 1.0, 2.2 ] } }

In this example, the JSON contains an object as root element. First there are some keys with string, number, boolean and null values. But a value of a key-value pair of an object can also be an array (like for key "item_collection"), and an array itself can contain other arrays or objects. It can also be another object (like for key "item_map"), and the nested object can contain unlimited numbers of other object and arrays.


Here a summary of useful observations and additional facts:


3. Implementation of the JSON tokenizer

At first, the particular JSON tokens needs to be identified. The presented approach defines and uses following token types:

The parsing of the tokens is straightforward. Any kind of whitespace separate two token but are ignored in further processing. The first character of any encountered token determines its type.
So the creation of JSON tokens can be implemented with the generic tokenizer as follows:

public override Token ProcessToken()
{
  Token token;

  skipWhiteSpaces(); /* ignore whitespaces */

  char currentChar = source.GetCurrentChar();
  if (currentChar == InputSource.EOF)
  {
    token = new TokenEOF(source, JsonTokenType.EOF);
  }
  /* current char is any of '{', '}', '[', ']', ',' or ':' -> structural token */
  else if (JsonStructuralToken.IsStructuralChar(currentChar))
  {
    token = new JsonStructuralToken(source);
  }
  /* current char is any of 't', 'f' or 'n' -> literal token */
  else if (JsonLiteralToken.IsLiteralTokenStart(currentChar))
  {
    token = new JsonLiteralToken(source);
  }
  /* current char is '"' -> string token */
  else if (currentChar == '"')
  {
    token = new JsonStringToken(source);
  }
  /* current char is digit or minus -> number token */
  else if (Char.IsDigit(currentChar) || currentChar == '-')
  {
    token = new JsonNumberToken(source);
  }
  /* else token is unknown */
  else
  {
    throw new TokenizerException(source, null, currentChar.ToString());
  }

  currentToken = token;

  return token;
}

The implementation of each token type is left out as it has already been addressed in detail in [3]. Check out the source code linked at the top of the article for details.
Let's proceed to the actual parser implementation.


4. Implementation of the JSON parser


4.1 Overview and design

The parser itself is implemented recursively because this is the easiest solution to handle the recursively nested JSON arrays and JSON objects data structures.
The input to the parser is a tokenizer instance. The parser extracts the tokens one after another and builds up the JSON data structure objects for further processing. Further it checks the validity of the input JSON information and aborts once an unexpected token is encountered.

The parser is executed with a single method call Parse() which will parse the whole JSON file and returns the content as data structures.
Internally, for each kind of data object, a separate method is implemented: ParseRoot(), ParseObject(), ParseArray(), ParseMember() and ParseValue().

Remains the question about the data storage: Each time the parser completes the parsing of a JSON element, this element needs to be stored so that the JSON content can be accessed later. Therefore, the parser is provided with an instance of interface IJsonDataCreatorRoot. This interface provides functionality to store the JSON elements.
The default example implementation of interface IJsonDataCreatorRoot is contained in class JsonDataCreator and uses a tree like data structure to store the JSON elements in the same form as in the input file, but it's possible to create an own implementation of IJsonDataCreatorRoot and pass it to the parser.

Let's start with the parser implementation at first.


4.2 Implementation of the parser

The parser provides a ParseRoot() function. Remember the top-level object in a JSON file can be either an object, an array or a value.

private void ParseRoot(IJsonDataCreatorRoot rootObj)
{
  Token currentToken = tokenizer.ProcessToken();

  JsonTokenType t = (JsonTokenType)currentToken.TokenType;

  switch (t.Type)
  {
    case EJsonTokenType.CURLY_BRACKET_OPEN:
      rootObj.CreateJsonElem(ParseObject());
      break;

    case EJsonTokenType.SQUARE_BRACKET_OPEN:
      rootObj.CreateJsonElem(ParseArray());
      break;

    case EJsonTokenType.STRING:
    case EJsonTokenType.NUMBER:
    case EJsonTokenType.TRUE:
    case EJsonTokenType.FALSE:
    case EJsonTokenType.NULL:
      rootObj.CreateJsonElemFromObject(currentToken.TokenValue);
      break;

    default:
      throw new JsonParserException();
  }
}

The ParseRoot() function evaluates the first token to determine if it's the beginning of an object, an array or just a value.
In the first two cases, it calls then ParseObject() respectively ParseArray() and stores the result as root element. A value however can be extracted directly and stored in the data model.


4.2.1 Parsing an object

The ParseObject() method first informs the JsonDataCreator to start a new JSON object data instance.
Then, for consistency, it asserts that the current token is an opening curly bracket so that it's really the start of an JSON object.
One of two different token types is expected next: Either a closing curly bracket, indicating the end of the JSON object (then it would be an empty object) or a string token as the key of a key-value pair. It enters then a loop to process the whole JSON object by getting the tokens one after another, compare them to the expected token types according to the JSON grammar and process the tokens.
In case of a key, it is stored for later when the complete key-value pair is parsed and saved in the data model.
The actual key-value pair is processed in method ParseMember(). The created JSON object data instance is passed to the ParseMember() method so that the key-value pair can be linked to the correct parent JSON object.

private IJsonDataElemObject ParseObject()
{
  // start creation of new JSON object data object
  IJsonDataCreatorObject newObj = dataCreator.CreateNewJsonObject();

  /* opening curly bracket is already processed but check for consistency */
  AssertTokenType(tokenizer.GetCurrentToken(), EJsonTokenType.CURLY_BRACKET_OPEN);

  bool continueMemberProcessing = true;
  EJsonTokenType expectedTokens = EJsonTokenType.STRING | EJsonTokenType.CURLY_BRACKET_CLOSE;

  while (continueMemberProcessing)
  {
    // next token must either be an closing curly bracket or an string (key)
    Token token = tokenizer.ProcessToken();
    EJsonTokenType tokenType = ((JsonTokenType)token.TokenType).Type;

    switch (tokenType)
    {
    case EJsonTokenType.CURLY_BRACKET_CLOSE:
      if (AssertTokenType(tokenType, expectedTokens))
      {
        continueMemberProcessing = false;
      }
      break;

    case EJsonTokenType.STRING:
      // key of json object, parse member
      if (AssertTokenType(tokenType, expectedTokens))
      {
        CurrentKey = token.TokenValue as string;
        ParseMember(newObj);
        expectedTokens = EJsonTokenType.COMMA | EJsonTokenType.CURLY_BRACKET_CLOSE;
      }
      break;

    case EJsonTokenType.COMMA:
      // after comma, a new key : value is expected
      if (AssertTokenType(tokenType, expectedTokens))
      {
        expectedTokens = EJsonTokenType.STRING;
      }
      break;

    default: throw new JsonParserException($"parseObject: Invalid token type {tokenType}");
    }
  }

  return newObj.FinalizeObject();
}

4.2.2 Parsing a member

The ParseMember() function parses a single member (= key-value pair) of an JSON object. It is straightforward and needs no further explanation.

private void ParseMember(IJsonDataCreatorObject jsonObj)
{
  /* the starting key token is already processed */
  AssertTokenType(tokenizer.GetCurrentToken(), EJsonTokenType.STRING);

  // process and expect colon token
  Token token = tokenizer.ProcessToken();
  AssertTokenType(token, EJsonTokenType.COLON);

  // parse element value
  ParseValue(jsonObj);
}

4.2.3 Parsing a value

The ParseValue() function is a bit more interesting. Note that a value can be an object, array, string, number, true, false or null.
Depending on the value type, the appropriate parse function is called and the returned result is stored as key-value pair in the data model.

private void ParseValue(IJsonDataCreatorObject jsonObj)
{
  Token currentToken = tokenizer.ProcessToken();

  JsonTokenType t = (JsonTokenType)currentToken.TokenType;

  switch (t.Type)
  {
  case EJsonTokenType.CURLY_BRACKET_OPEN:
    jsonObj.AddMember(CurrentKey, ParseObject());
    break;

  case EJsonTokenType.SQUARE_BRACKET_OPEN:
    jsonObj.AddMember(CurrentKey, ParseArray());
    break;

  case EJsonTokenType.STRING:
    jsonObj.AddMember(CurrentKey, dataCreator.CreateNewJsonDataValueFromObject(currentToken.TokenValue));
    break;

  case EJsonTokenType.NUMBER:
    jsonObj.AddMember(CurrentKey, dataCreator.CreateNewJsonDataValueFromObject(currentToken.TokenValue));
    break;

  case EJsonTokenType.TRUE:
    jsonObj.AddMember(CurrentKey, dataCreator.CreateNewJsonDataValueFromObject(currentToken.TokenValue));
    break;

  case EJsonTokenType.FALSE:
    jsonObj.AddMember(CurrentKey, dataCreator.CreateNewJsonDataValueFromObject(currentToken.TokenValue));
    break;

  case EJsonTokenType.NULL:
    jsonObj.AddMember(CurrentKey, dataCreator.CreateNewJsonDataValueFromObject(currentToken.TokenValue));
    break;

  default:
    throw new JsonParserException();
  }
}

4.2.4 Parsing an array

The ParseArray function follows the same pattern as the ParseObject, but must consider more different cases as it can contain different value types for the particular array elements.

private IJsonDataElemArray ParseArray()
{
  // start creation of new JSON array data object
  IJsonDataCreatorArray newObj = dataCreator.CreateNewJsonArray();

  // inside an array, following elements are expected:
  // - closing square bracket (array is finished)
  // - opening square bracket (new element is an array)
  // - opening curly bracket (new element is an object)
  // - any type of a value as element
  EJsonTokenType elementToken = EJsonTokenType.SQUARE_BRACKET_CLOSE |
    EJsonTokenType.CURLY_BRACKET_OPEN |
    EJsonTokenType.SQUARE_BRACKET_OPEN |
    EJsonTokenType.STRING |
    EJsonTokenType.NUMBER |
    EJsonTokenType.NULL |
    EJsonTokenType.FALSE |
    EJsonTokenType.TRUE;

  /* opening square bracket is already processed but check for consistency */
  AssertTokenType(tokenizer.GetCurrentToken(), EJsonTokenType.SQUARE_BRACKET_OPEN);

  bool continueMemberProcessing = true;
  EJsonTokenType expectedTokens = elementToken;

  while (continueMemberProcessing)
  {
    Token token = tokenizer.ProcessToken();
    EJsonTokenType tokenType = ((JsonTokenType)token.TokenType).Type;
    switch (tokenType)
    {
    case EJsonTokenType.SQUARE_BRACKET_CLOSE:
      // array is closed, finish array parsing
      if (AssertTokenType(tokenType, expectedTokens))
      {
        continueMemberProcessing = false;
      }
      break;

    case EJsonTokenType.COMMA:
      // another element is expected after a comma
      if (AssertTokenType(tokenType, expectedTokens))
      {
        expectedTokens = elementToken;
      }
      break;

    case EJsonTokenType.STRING:
    case EJsonTokenType.NUMBER:
    case EJsonTokenType.TRUE:
    case EJsonTokenType.FALSE:
    case EJsonTokenType.NULL:
      // a terminal value, add it to the array
      if (AssertTokenType(tokenType, expectedTokens))
      {
        newObj.AddElement(dataCreator.CreateNewJsonDataValueFromObject(token.TokenValue));
        expectedTokens = EJsonTokenType.COMMA | EJsonTokenType.SQUARE_BRACKET_CLOSE;
      }
      break;

    case EJsonTokenType.SQUARE_BRACKET_OPEN:
      // a new array is started, parse it recursively
      if (AssertTokenType(tokenType, expectedTokens))
      {
        newObj.AddElement(ParseArray());
        expectedTokens = EJsonTokenType.COMMA | EJsonTokenType.SQUARE_BRACKET_CLOSE;
      }
      break;

    case EJsonTokenType.CURLY_BRACKET_OPEN:
      // a new object is started, parse it recursively
      if (AssertTokenType(tokenType, expectedTokens))
      {
        newObj.AddElement(ParseObject());
        expectedTokens = EJsonTokenType.COMMA | EJsonTokenType.SQUARE_BRACKET_CLOSE;
      }
      break;

    default: throw new JsonParserException($"parseObject: Invalid token type {tokenType}");
    }
  }

  return newObj.FinalizeObject();
}

4.3 JSON data model

4.3.1 Creation of the data model

As mentioned above, the actual JSON parser and the data model are separated. The parser is provided with a instance of interface IJsonDataCreator which provide methods to put the parsed JSON elements into a data model on the fly. The interface is simple and contains only following methods:

Each of the Create methods returns another interface to stepwise fill the data object. For example, the IJsonDataCreatorArray interface returned by the CreateNewJsonArray() function has a method AddElement(IJsonDataElemBase value) to store the array elements one after another as the JSON parser processes them.
Each of the creator interface has a FinalizeObject() to indicate the completion of an data creation. The function then returns the completely, immutable created JSON data object that is later used to access the data.
For example, the IJsonDataCreatorArray interface has the following methods:

The FinalizeDataRoot() method of the IJsonDataCreator finally returns the complete JSON data model.

4.3.2 Usage of the data model

When the parser has completed the parsing of the JSON file, an object with interface type IJsonDataRoot is returned.
It provides a GetJsonRootElem() method to get a IJsonDataElemBase object as the root element.

The IJsonDataElemBase interface provides methods to identify the JSON data type and to get the data instance with correct data type, namely:

The returned objects of types IJsonDataElemObject, IJsonDataElemArray and IJsonDataElemValue allow the recursive traversal of the whole data model which is built up like a tree. For example, the IJsonDataElemArray has a GetElements() method to retrieve its elements of type IJsonDataElemValue:

4.3.3 Example usage

Following code snippet shows how to load and parse a JSON file:

// fileNamePath is the full path to the JSON file to parse
JsonTokenizer tokenizer = new JsonTokenizer(new JsonInputSource(fileNamePath));

SimpleJsonParser parser = new SimpleJsonParser(tokenizer);

IJsonDataRoot jsonData = parser.Parse();

// access the parsed JSON data via jsonData object

Have a look into the source code (linked at the top of the article) of the provided demo application or unit test application to get a better overview of the parser usage.


5. Conclusion & References

This article gives an overview of my own approach to implement a JSON parser. It does not act as a user documentation of the parser itself (for this, please refer to the source).
Instead, after a small introduction of the JSON file format, an easy approach to implement an own JSON parser is presented.
Hopefully you found it interesting and learned something! Maybe now it's the time to start your own parser implementation ...

Sunshine, December 2022


References

* [1] Writing a simple generic tokenizer
* [2] RFC7159
* [3] JSON.org


History