Current Data Serialization Formats May Be a Waste of Money
- Programming, Business
Storing data. Transmitting data. Processing data. These fundamental topics of computer science are often overlooked nowadays thanks to the historical exponential growth of processing power, storage availability and bandwidth capabilities, along with a myriad of existing solutions to tackle them. So much so, that we're assuming these technologies are properly adapted for today's needs.
Specifically, we're going to look at the cloud computing costs of data serialization, and question whether current data serialization technologies are adapted for them. (Spoiler: They're probably not.)
The money problem
Let's consider a scenario where we would like to offer a service that would send and receive data over the Internet. We would have to deal with the following expenses:
- Implementation and maintenance costs
- Processing power for data serialization and deserialization
- Bandwidth and storage consumption
As such, we would like to minimize the total sum of these costs over the lifetime of the service. In addition, we would also like to minimize these same costs for our consumers to give ourselves a competitive advantage.
Picking optimal data serialization formats is therefore critical to achieving this objective, because it will have an impact on all of these costs.
For implementation and maintenance, we also have to consider that once a data serialization format becomes popular, there's going to be a bunch of people that will have already done the base work, and thus shall not be considered here.
CSV, XML, JSON, YAML... those are all great data serialization formats because anyone can read them and modify them using a simple text editor. In terms of compactness however, they are pretty terrible because they are very verbose by design.
Let's say, for example, that you would like to represent an object with 5 boolean properties. Simply writing the values would require multiple bytes simply for writing "True" or "False" and delimiters between them. Similarly, if the name of the properties must be included in the format, that's more bytes to be consumed for writing them.
As such, not only does it take a bunch of space, but it also requires parsing text to deserialize the data, which is not very efficient. Removing some of the optional padding may help, but doing so has its limits.
One quick fix in terms of bandwidth and storage consumption is to apply data compression over text data. However, the results are relatively generic and generally not optimal. Also, while they may save in bandwidth and storage, they also require additional processing power, although the net result is usually worth it in terms of raw expenses.
As for the existing data compression algorithm themselves, some common issues include:
- Byte as the smallest component
- Upper size limit
- Equivalent values written differently
- Limited predefined dictionary
As a need for pure binary data serialization arose from the above issues, Protocol Buffers rose to fill the need. While not the only binary serialization solution, it became popular thanks to its open-source nature, its versatile data encoding, the powerful object definition, and the possibility of extending it using gRPC to define full web services. However, the encoding of Protocol Buffers is a bit strange, which may lead to some unexpected issues. For example:
- Definition of data requires transforming it into an API using an external tool, then embed that API in the main code, which may be problematic for compatibility and maintenance. This is especially a problem when having to deal with consumers stuck with legacy systems.
- Data types do not match between definition (scalar value types) and serialization (wire types), probably to simplify the conversion to common variable types in popular programming languages.
- Integers may be serialized longer than necessary, due to a base 128 encoding whose digits are bytes. This issue also affects the encoding of the data type and field ID.
- Strings are encoded as UTF-8, even when a better encoding may exist. This is especially true if strings do not require the full range of Unicode characters, or even ASCII characters.
- Repeated values or simple patterns are not compressed. While this may be partially mitigated by implementing data compression over the serialized data, this will likely not be done optimally.
As such, it's not a surprise that Protocol Buffers became popular, as each potential issue also have related advantages. Still, there is room for potential improvements.
Based on the above, here are ideas that I could identify as potential optimizations for the original objective of minimizing costs:
- Concatenate data at the bit level instead of the byte level
- Use a data compression algorithm that is specifically designed for the serialized data format
- Define a data serialization negotiation algorithm for simpler implementation and maintenance
- Allow dynamic data serialization within the same stream
- Use artificial intelligence to improve optimization of data compression
This is far from an exhaustive list, and I do not know if these ideas could lead to a significantly better solution than those that currently exists, but I believe they are certainly worth consideration for future designs and prototypes.
Disclaimer: I originally wrote this article back in 2020-10-12 at the request of Steeve Leblanc as an independent analysis of his data encoding invention, but he asked me to refrain from publishing it at the time due to a pending patent application. As this is no longer an issue, I have released the above article in its exact original wording. Note that since then, he has founded TS-Alpha, a company I have acquired shares in, and later joined as a full-time employee in order to help him realize said future technologies.
Related articles I wrote
Upgrading Your Cybersecurity from Cowboys to Sheriffs
- Security, Business, Anecdotes
Roaming throughout the countryside, dangerous desperados are awaiting in their hideout for the perfect opportunity to rob their victims in silence. Powerless, the authorities have posted wanted posters on public boards with cash bounties for any information that could lead to their arrest or death…
Scrum Is Not Agile
- Programming, Business, Psychology
While there is no denying that Scrum revolutionized the software industry for the better, it may seem a little strange to read about someone that dislikes it despite strongly agreeing with the Agile Manifesto, considering the creator of Scrum was one of its signers. However, after having experienced…
Validating and Viewing OpenAPI Definitions with Docker
- Quality Assurance, Programming
Here are a few commands I crafted to validate and easily read API definitions in the OpenAPI format, using Docker and open source tools provided by Swagger. I have yet to convert them into proper shell scripts, but I hope these will be helpful nonetheless. The commands are designed to be run in a…
Essential International Standards and Registries for Web Developers
- Programming, Quality Assurance, Security
The following is a collection of free international standards, registries and references that I collected throughout the years while developing websites and web services. These references, while very precise and technical by their nature, are extremely useful in order to ensure that a specific…
A Universe and World Creation Script for Mongoose Traveller 2nd Edition
- Tabletop RPGs, Programming
The following is a Python script developed by yours truly to generate a sector according to the core rulebook of the Mongoose Traveller 2nd Edition tabletop RPG, exactly as described in the Universe and World Creation chapter. It is designed to describe worlds in human-readable format as much as…
See all of my articles