View profile

Day #6: Two ways of writing a JSON Parser (first way: without tokenization)

Puppy Nomadic
Puppy Nomadic
In my next two posts, I will write about two ways to approach the JSON Parser problem. My first approach does not use a Tokenizer. I simply followed the JSON documentation diagrams, and one by one, translated the logic in these diagrams into code. I created helper methods to loop through each character in the JSON string.
The second approach, which I will write about in one of my next posts, is something I hope to learn tomorrow from my friend Sattwyk. It requires writing a tokenizer, and understanding more deeply how lexical analysis works in Javascript. Since JSON has a relatively simple syntax, it’s not necessary to use a tokenizer to parse JSON, but for the purpose of learning more about lexical analysis, and the proper way to write parsers for compilers, I want to tackle re-doing this problem using a tokenizer in another post.
If you want to skip forward and just look at my code, you can click here. Would be grateful for any feedback to refactor and make it better!

Problem: Writing a JSON Parser
Here is the infamous JSON Parser problem in CodeSmith’s Precourse-3:
…JavaScript Object Notation, is a standard format for passing data between various components of the web…A JSON Parser is used to transform the data from the readable JSON format (which is a string) into a structured JavaScript object. Your task is to create this parser that will take a string in JSON format and return a JavaScript object in the appropriate key/value format.
The idea seems simple enough. How do you transform a string like:
"{"firstName": "John", "lastName": "Smith", "isAlive": true,"age": 25}"
and transform it into an actual Javascript object, wherein each of the string properties are assigned to data type values such as numbers, booleans, strings, arrays, and objects.
—–
NOTE: Before implementing my solution to this problem, I discussed some approaches with Sattwyk and Max in my class on Slack. Many thanks to Max for linking me to this helpful blog post, written by former CodeSmith resident Wayne Wilcox. Also, many thanks to Andres for pair programming with me last week to work on the parseObject() function together.
Understanding the Documentation
I have never taken a class on compilers / interpreters before, and did not know what a tokenizer was or AST. Before tackling this problem, I went down a bit of a Wikipedia rabbit-hole trying to understand these concepts, and spending lots of time trying to understand an abstract framework for tackling this problem, before realizing that I was spending too much time this way, and that the best way to start this problem is just to begin coding and testing functions, one at a time, to build up to the larger solution.
While I was researching the problem, I ran into this blog post by Tai Li Hau, which pointed me to the direction of understanding the Syntax Diagram (or Railroad Diagram) in the JSON.org documentation.
I spent a bit more time on Wikipedia learning the grammar/mechanics of a Syntax Diagram, and then decided to work on my solution to the JSON Parser problem following these diagrams, and translating them into code:
For ex, a white space is defined in the JSON documentation as these 5 options. (Omitting is first.)
For ex, a white space is defined in the JSON documentation as these 5 options. (Omitting is first.)
An object is defined in this sequence. You can use if-else statements to translate this to code.
An object is defined in this sequence. You can use if-else statements to translate this to code.
A value can be any of these data types, preceded and followed by a white space.
A value can be any of these data types, preceded and followed by a white space.
It’s also critical to understand the bigger picture of how these data types are nested within one another. The McKeeman Form diagram represents the same information as the Railroad diagram in text form.
It helpfully classifies escape and hex characters that need to be defined in a String and various components of a Number:
A number could be an integer, fraction, or exponent.
A number could be an integer, fraction, or exponent.
While I did not use Tai Li Hau’s methods or code to implement my JSON parser, I borrowed many of my ideas from the conceptual framework that Hau outlined. So I need to acknowledge his excellent post.
Breaking Down the Problem into Parts
In order to break this function down into parts, it is important to create a hierarchy of the simplest to most complex data types. It’s best to start by defining the primitive data types (Number, Boolean, Null) before moving on to defining the complex data types (String, Array, Object). This is because the more complex data types will need to call on methods to parse the simpler data types within.
Simplest to Most Complex –>
  1. Boolean, 2. Null, 3. Number, 4. String, 5. Array, 6. Object
Approaching the Boolean or Null function required some kind of text-reading function that triggers a True parser function when the character “T” is read, or a False parser function when the character “F” is read, or a Null parser function when the character “F” is read.
So in order to read the characters, one at a time, the next thing I needed to do was to write some helper functions.
Writing Helper Functions
I decided to put all my helper functions for reading characters from the JSON string into a “Reader” object with the following methods:
  • index – this property keeps track of the index number of the character in the string that is being read.
  • nextChar() – for reading the next character in the string, while incrementing the index by one.
  • peek() – allows me to get the value of the next character in the string, without incrementing the index, or “eating” the next character as Hau called it.
  • eatWS() – “eats” the next character if it’s a white space by incrementing the index by one, and moving on to the next character.
  • nextNonWS() – loops through next characters until a non-white-space character is found
  • position() – a getter method that returns the value of the index
The Reader object returns these properties, which can be called by each of the Data Type functions to move through the string, like this:
reader.nextChar()
reader.peek()
reader.eatWS()
Functions for JS Primitive Types
Booleans & Null
After writing and testing these helper functions for reading the character stream, I was ready to move on to writing functions for the primitive types.
I combined True, False, and Null in one function called:
parseWord(word, reader)
which takes an argument “word” depending on what character is being detected in the reader stream. The function checks the next three letters in the stream to make sure they are as expected; if not, it throws an error.
I then defined three functions parseTrue(), parseFalse(), and parseNull() that call on the parseWord() function, passing in the respective words as arguments.
Numbers
Defining numbers is quite a bit trickier than defining booleans and null. It took a little while to understand what this diagram means, and the sequence in which optional values are being presented for formulating a number:
I was very intimidated by this at first. It took some time to fully understand this diagram.
I was very intimidated by this at first. It took some time to fully understand this diagram.
Here’s a breakdown of what this diagram is saying:
  • A number can be negative and start with a “-” or a digit between 1-9.
  • If the number is a fraction, it can start with a 0 instead, followed necessarily by a “.” (In JSON, you can’t have decimals that start with a period, such as .9340, which reads without a problem in plain Javscript. Every decimal, called a “fraction” in JSON, must start with a 0.xxx) After the period follows more digits (0-9).
  • A number can be an exponent in scientific notation if it starts with a decimal (number, period, number), followed by “E” or “e”, followed by a “-” or “+”, and ending with digits. For example, a 2-decimal scientific notation displays 12345678901 as 1.23E+10, which is 1.23 times 10 to the 10th power.
This was a bit of a tricky function to write, and I ran into a few bugs as I tried to put this together. I enjoyed thinking through this particular function, so I’ll include the code snippet here. (Full code at the end of this article).
function isDigit(c) {
 return c != null && c >= '0' && c <= '9';
}
// ===== parseNumber() function =====
function parseNumber(reader) {
 let parsed = [];
 let c = reader.nextChar();
// read negative number
 if (c === '-') {
  parsed.push('-');
  c = reader.nextChar();
 }
// read whole number
 if (c >= '1' && c <= '9') {
  parsed.push(c);
  while (isDigit(c = reader.peek())) {
   parsed.push(c);
   reader.nextChar();
  }
 } else if (c === '0') {
  parsed.push(c);
  c = reader.peek();
  if (c !== '.') {
   throw makeError(c, reader);
  }
 } else {
  throw makeError(c, reader);
 }
// read fraction
 if (c === '.') {
  parsed.push(c);
  reader.nextChar();
  while (isDigit(c = reader.peek())) {
   parsed.push(c);
   reader.nextChar();
  }
 }
// read exponent
 if (c === 'e' || c === 'E') {
  parsed.push(c);
  reader.nextChar();
  c = reader.peek();
  if (c === '-' || c === '+') {
   c = reader.nextChar();
   parsed.push(c);
  }
  while (isDigit(c = reader.nextChar())) {
   parsed.push(c);
  }
 }
 return parseFloat(parsed.join(''));
}
// ===== end of function =====
Functions for Composite Data Types
After defining these primitive data types, I went on to define composite data types: String, Array, and Object – in that order.
Each of these three data types start with a character " or [ or { which continue onwards until the end of this object is denoted with a " or ] or }.
Strings
The tricky part about writing strings is that it can be “escaped” using the “/” character. The JSON documentation gives 9 possible escape characters, which I implemented using a switch / case statement.
The final case, “/u” is a hex character, which is defined by 4 alphanumeric characters made up of either the numbers 0-9, or the letters “a-f” or “A-F”.
If the reader encounters an escape character, it needs to add “/” plus whatever character follows it into the string. (This requires adding another “/” to parse the character.) Otherwise, the String function can directly add the character into the string, including if the character is a Number.
Pretty straightforward. CodeSmith's tests require you to also parse numbers inside of strings.
Pretty straightforward. CodeSmith's tests require you to also parse numbers inside of strings.
Code snippet for escape character case: unicode / HEX
while ((c = reader.nextChar()) !== '"') {
  if (c === '\\') {
   c = reader.nextChar()
   switch (c) {
...
case 'u':
     let uni = [];
     for (let i = 0; i < 4; i++) {
      c = reader.nextChar();
      if (c != null && ((c >= '0' && c <= '9') || (c >= 'a' && c <= 'f') || (c >= 'A' && c <= 'F'))) {
       uni.push(c);
      } else {
       throw makeError(c, reader);
      }
     }
Arrays
The main syntactical feature to note about arrays is that there is a comma in between values in an array. After each value, you have to reader.peek() to see if there is a comma or a “]” following that value.
NOTE: It’s important to peek() first and not to nextChar() here! Major headache and debugging can be avoided if you do so.
Using the reader.peek()method to check before reader.nextChar() allows this function to navigate cases of nested arrays.
function parseArray(reader) {
 let c = reader.nextChar();
 if (c !== '[') {
  throw makeError(c, reader);
 }
 let parsed = [];
 reader.eatWS();
 while ((c = reader.peek()) !== ']') {
  parsed.push(value(reader));
  c = reader.peek();
  if (c === ',') {
   reader.nextChar();
  } else if (c !== ']') {
   throw makeError(c, reader);
  }
 }
 reader.nextChar();
 return parsed;
}
Objects
Objects have a slightly more complex syntax than arrays, and you must make sure to have your parseString() method working before you write your parseObject() method.
The first string in an object can be parsed and stored in a variable as the “key”. This is followed by a colon. Then the next string can be parsed and stored in a “value” variable. The returned object will then assign the value its key property. If the value is followed by a comma, another key-value pair will follow, until a “}” closes the the object.
[See full code in Replit.]
Values
One more thing you need to define is the Value function, which detects which sort of data type the Reader is encountering at each step/index, and “eats the white space” before and after each value. I implemented this as one big switch/case statement. This is the primary function I call in my main JSON parser function to move through the string.
My main engine for moving through the JSON string, by eating white space around the values.
My main engine for moving through the JSON string, by eating white space around the values.
Nesting
I found that by writing these functions in a clear, modular way, nesting actually takes care of itself, and there was not really any additional functions I needed to write to pass the CS nesting tests. The only thing to be wary of is the times when I’m calling reader.peek() and forgetting to call reader.nextChar() after to move the index up. This could lead to an error in nesting, such as an issue I struggled with for several hours, due to dropping the last property in an object after a nested object.
However, if written correctly, the functions will call each other and execute each set of instructions in the proper local execution context, without needing to manually keep track of the number of levels of nesting you are doing. For example, for one of the tests:
JSONParser(‘{ “a”: { “b”: 1 }, “c”: 2 }’
parseObject() is being called inside of parseString(), which is being called by parseValue(), after being triggered inside the larger parseObject(), which moves through the object with parseString()
This could lead to weird errors, which requires tons of agonizing tracing of variable values up and down stacks to debug. But if all of the functions are written clearly, nesting takes care of itself.
JSONparser - Replit
FINAL NOTE
In the process of working through these nested functions, I had to get a lot more comfortable with using the Debugger in VS Code, to step through and into some of the wild loops that are being created here. I was running into so many bugs at first. To use the debugger, I would set breakpoints in the middle of the while loops, to check the values of the variables at each level of the call stack.
This was much clearer to understand than using the Quokka extension which prints all the looped variable values in-line, but would often start with a “…” because of how long the loops would run if you don’t set a breakpoint in the debugger. (Essentially impossible to print the starting values of a long loop in a single line with Quokka, and difficult to understand what’s really happening inside the loop).
I would not have been able to untangle this mess without taking the detour (and time) to get more comfortable with the VS Code debugger.
This video is really worth watching:
Debugging JavaScript in Visual Studio Code and Google Chrome
Next steps:
In an upcoming post, I hope to talk to my friend Satty about Abstract Syntax Trees (AST), lexical analysis (lexing), compilers and interpreters, and how to approach this problem a different way, by writing a tokenizer for the JSON parser.
Here’s the course on Front End Masters he recommended that I watch beforehand. It’s kinda blowing my mind right now:
Make Your Own Programming Language with Steve Kinney | FEM
Did you enjoy this issue? Yes No
Puppy Nomadic
Puppy Nomadic @puppynomadic

Let's pull each other up by our Bootstrap.js 🪴 👩🏻‍💻 🦮

=== Journey of software craftsmanship ===

{"Formerly": ["nonprofit community organizer", "teacher", "researcher", "government director"],
"Future": ["software engineer", "web3 dev", "developer advocate", "technical program manager"]
}

In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Created with Revue by Twitter.