Minifying JSON Text Beyond Whitespace

- Programming, Mathematics

Stream of zeros and ones in space

JSON is a common data serialization format to transmit information over the Internet. However, as I mentioned in a previous article, it's far from optimal. Nevertheless, due to business requirements, producing data in this format may be necessary.

I won't go into the details as to how one could structure JSON text differently to contain the same information more efficiently, as there is no general solution. Instead, I want to focus on the following problem: how can JSON text be formatted more compactly in a way that JSON parsers would still interpret it the same way? This is a useful question to ponder on, since it may reduce data consumption costs.

The thing is, it's possible to parse JSON text in arbitrary ways. As such, it's not possible to consider whether different JSON texts are equivalent or not without setting up certain assumptions in the way parsers should operate. Here are the assumptions that we are going to work with:

Interestingly, I have not witnessed a single JSON minification algorithm that implements all of the strategies I discovered, despite trying countless programs that are supposed to minify JSON. As such, I wanted to document them here.

With that said, let's get to it!

Whitespace

This is the obvious one. As whitespace surrounding JSON text and its structural characters is optional, it can be safely removed outside of strings.

Easy, but we can do more.

Strings

There are multiple ways to encode Unicode code points within a JSON string. Namely, each code point can be represented in at least one of the following ways:

It's important to note here that the only two-character escape representation that can be unescaped is the solidus character, which only requires 1 byte to encode unescaped. Because of this and the above, it's always better to represent a code point unescaped when possible, with the two-character escape representation preferred over the six-character one otherwise.

One very important point to note not explicit in the JSON grammar is that unpaired surrogates cannot be safely unescaped. While JSON supports ill-formed code unit sequences as strings, no Unicode encoding form allows it (including UTF-8), so special consideration should be taken to handle this case properly.

It's also worth noting that names in JSON objects are strings as well, so all of the above apply in this case as well.

Numbers

As each character in a number can only be encoded in 1 byte, only the number of characters needed to write the entire number needs to be taken into account.

There are a few simple ways numbers can be directly simplified by simply deleting unneeded characters. Specifically:

In addition, rational numbers can be freely converted between their decimal expansion and scientific notation, and there are infinite ways to write a number in scientific notation depending on the chosen exponent. Consider the following mathematical transformations, which may reduce the length of a number:

With such transformations, you have a wide range of equal numbers with various character lengths to choose from. Therefore, numbers may potentially be further reduced with a bit of mathematics by modifying the exponent and balancing the rest of the number accordingly for equality.

While there are potentially many optimal representations for the same number with these transformations, I have determined that only the following ones need to be taken into consideration to achieve minimal length when not including the unneeded characters described previously:

The reason is because one of these representations is guaranteed to minimize the absolute value of the exponent without leading or trailing zeros elsewhere, the first representation checks the case with no exponent, and the last representation checks the case with the smallest amount of characters possible before the exponent.

Numbers equal to 0 are trivial to minify, since the decimal expansion 0 is always optimal in this case. For every other case, I recommend converting the number to scientific notation with no fraction part and no trailing zeros first, then use the resulting integer component and exponent as function parameters to calculate the relative length of each of these representations, in order to pick one that is optimal among them. Just be warned that this calculation should be performed carefully since these parameters may be of arbitrary size.

As a final note, if we were to also assume that parsers should not be more precise than their IEEE 754 binary64 representation for historical reasons, then a conversion to that format and back could be performed to determine the proper precision to keep. This would limit the length of the integer component and fraction part to a total of 17 decimal digits at most. However, this may not be a safe assumption, and the temporary conversion attempt may overflow or underflow, so I recommend against this particular simplification.

Related articles I wrote

Dice stacked in a triangle shape, with their face numbers matching their row position

I Designed the Perfect Gambling Game, But...

- Mathematics, Business, Game Design

Back in 2006-07-08, during the 13th Canadian Undergraduate Mathematics Conference at McGill University, I presented a gambling game I designed with the novel property of being both advantageous to players and the house, and that despite this proprety, that pretty much nobody in their right mind…

Stream of concatenated JSON objects

Current Data Serialization Formats May Be a Waste of Money

- Programming, Business

Storing data. Transmitting data. Processing data. These fundamental topics of computer science are often overlooked nowadays thanks to the historical exponential growth of processing power, storage availability and bandwidth capabilities, along with a myriad of existing solutions to tackle them. So…

Girl sitting on a small deserted island at sunrise reading a magical book under a brain-shaped tree

The Ultimate Maths Cheat Sheet

- Mathematics

The following is a compilation of pretty much every single mathematical topic that I learned throughout my life, covering topics from all levels of education, along with external links for each of them for quick reference. I have compiled this list after extracting all of the relevant information…

Slippery road signs scattered everywhere

Scrum Is Not Agile

- Programming, Business, Psychology

While there is no denying that Scrum revolutionized the software industry for the better, it may seem a little strange to read about someone that dislikes it despite strongly agreeing with the Agile Manifesto, considering the creator of Scrum was one of its signers. However, after having experienced…

Assembled cog wheels

Validating and Viewing OpenAPI Definitions with Docker

- Quality Assurance, Programming

Here are a few commands I crafted to validate and easily read API definitions in the OpenAPI format, using Docker and open source tools provided by Swagger. I have yet to convert them into proper shell scripts, but I hope these will be helpful nonetheless. The commands are designed to be run in a…

See all of my articles