MessagePack: like JSON, but fast and small

dathinab · on March 10, 2020

> MessagePack: It's like JSON. but fast and small

and complete.

Both are minimal self describing data serialization formats, but JSON is incomplete. It misses a type to represent one of the fundamental type: byte blobs. And also parts of floats.

Which means there are a lot inconsistent ways to hack that in e.g. base64 encoded strings. But this means it's loosing partially it's property of being self-describing (same if you allow numbers as strings e.g. "-12").

Just to be clear I'm not saying JSON should have bytes directly in it. But a "native" base64 string additional to string, number, null, list and map would help. E.g. `{ "bytes": b"YWJjZGU=" }`

Wrt. floats it's missing (+/-) Infinity, NaN is more like a error variant so it's okish to be missing, but then why not have it.

Also for completeness it would be better to differentiate between int and float as float is imprecise due to rounding errors.

----

PS: I don't like `null` but having it in this kind of data serialization format is still required. But the difference between a field not being there and it being `null` can be a mess and not all tools handle it well.

PPS: Both JSON and MsgPack can be used for non self describing serialization, e.g. by serializing the a record (fixed number of fields in known order) as a list of values instead of a mapping. But both are focused on enabling self describing serialization.

EDIT: PPPS: Yes, it's a very weak form of self-describing is use here, there are systems which are much more self describing e.g. XML+XML Schema linked from the XML document but also tend to be far more complex

eyelidlessness · on March 10, 2020

JSON is also missing a date type. Sure, you can use a string serialization but that's ambiguous.

It's also non-extensible so these types it's missing cannot be added without assumptions or overhead from convention.

Transit is a good solution to this.

kenoyer130 · on March 10, 2020

THIS! Not having a defined Date format has caused so much confusion with consumers of our API since every json API might have a different ISO date format (or some homegrown madness).

eyelidlessness · on March 10, 2020

In APIs I've integrated in recent years, I've seen:

1. ISO UTC 2. ISO local 3. Epoch 4. 2018-04-20 5. Jan 01 2016 11:58am

Probably others that I can't recall. It's a damn mess.

clon · on March 11, 2020

PHP-s strtotime() will probably handle them all, even though it will make you cringe a bit.

I never understood why simple integer unix timestamps are not more prevalent in API-s? Or some monotonic count from any reasonable epoch, depending on the context. How many APIs really have to ever return dates predating unix?

Jestar342 · on March 10, 2020

`/Date(1234567890)/`

That will probably cause a twitch to any dev that used Asp.NET up until a few years ago.

srgpqt · on March 10, 2020

I had to deal with that two weeks ago. Cringe.

thawaway1837 · on March 10, 2020

I’m twitching away...

eyelidlessness · on March 10, 2020

Wow :(

anamexis · on March 11, 2020

To be fair, 1, 2 and 4 are all just ISO.

eyelidlessness · on March 12, 2020

That's true. But 2 is substantially less useful than 1 and I encountered 4 mixed with 3.

bzbz · on March 10, 2020

Why wouldn't you just use epoch time?

VMG · on March 10, 2020

aside from the other issues mentioned here, alternative encodings like ISO8601 and others are human-readable and human-editable

foepys · on March 10, 2020

Epoch doesn't carry timezone information.

xgbi · on March 10, 2020

And it better not. Timezone information has nothing to do in a timestamp.

A timestamp is a point in time, whatever the time is in Paris or Tokyo. It is an abstract value and it is way better this way.

A timezone is a filter through which you display your timestamp, and it tells you what your local time was when that timestamp occurred.

So yeah, always store and exchange time information as timestamps. Timezone is extra styling information, just like CSS.

OskarS · on March 11, 2020

This is totally wrong, and a very dangerous way of handling date/time issues. If you set a wake-up alarm for 7am, you want that to go off at 7am in your local timezone, not 7am in whatever timezone you set the alarm in (which is what an ”absolute” time stamp would store).

There are lots of examples like this. Some time events you want to happen at absolute time points (in which case time zones are only for display purposes), but very often your events need to be ”time-zone aware”. There is no universal rule, and thinking that there is will lead to all sorts of trouble.

lokedhs · on March 11, 2020

It's not wrong. What you're describing is not a timestamp.

A timestamp is a point in time, and is the same everywhere (excluding relativistic effects).

What you describe is definitely a valid use case (and having developed software that programs hardware that is controlled by a calendar, I have painful experience in this). However, it's not a timestamp. I'd call it time-of-day or wallclock. Some systems like SQL refers to it as simply "date" and "time". Whatever you prefer to call it, it's not a timestamp.

Myrmornis · on March 11, 2020

Could you add something saying explicitly what a timestamp _is_ under your terminology? Is it an integer offset from the UNIX era start?

lokedhs · on March 11, 2020

It's an abstract concept. It's nothing more than a given point in time.

How you represent it doesn't change the definition. It could be stored as the number of seconds since some arbitrary point in time, such as Unix timestamp, or a Julian timestamp.

Or, it could be a given time and date combination with some fixed point of reference, such as UTC. I'm sure we all have favourite ways of representing timestamps.

Myrmornis · on March 12, 2020

Ok, one thing — perhaps not the most important — is that “timestamp” is a very unfortunate terminology for the abstract concept you describe. The word timestamp very much suggests that it’s referring to an explicit representation of some sort. “timepoint” would be much better for what you’re talking about. Not a criticism of you of course! I mostly write python at work so I rarely am directly exposed to the integer offset values, but I know many people use the word “timestamp” to refer to that integer, as opposed to a formatted string. Which also seems like unfortunate terminology, since what is a stamp if not formatted?

foepys · on March 12, 2020

You are assuming that timezones don't change ever in relation to UTC. This is wrong. Timezones change all the time.

When I set an alarm for any time in CEST in 2022 and convert this to UTC before saving it, it will very likely ring at the wrong time, simply because CEST will probably not exist by then and be replaced by CET due to the EU getting rid of DST.

lokedhs · on March 12, 2020

But an alarm is not supposed to use timestamps. An alarm is set for a certain time in a certain location. In otherwords, a wallclock time. Not a timestamp.

foepys · on March 11, 2020

No, timezone information can be critical and assuming you can always strip it is just wrong.

You might not have encountered such scenarios but they are very much out there.

rumanator · on March 11, 2020

> No, timezone information can be critical and assuming you can always strip it is just wrong.

Expressing time based on the standard reference timezones (i.e., UTC) is not stripping the time zone away.

foepys · on March 12, 2020

It is. Timezones change all the time in relation to UTC. If you set a date in the future and think converting it to UTC doesn't lose information, you will be surprised when you get bitten by it.

Example: The EU will most likely get rid of DST around 2022. Any time you set beforehand in CEST or CET will be an hour off, depending on which the EU will get rid of.aybe the EU doesn't get rid of DST and keeps it, you don't know. So unless you have a time machine, you cannot convert the time to UTC.

filleduchaos · on March 11, 2020

> A timestamp is a point in time, whatever the time is in Paris or Tokyo. It is an abstract value and it is way better this way.

> Timezone is extra styling information, just like CSS.

This is an incredibly reductive view of time data and the possible applications that use it.

Not all time data is timestamps, and UTC timestamps strip out information that is not "styling".

rumanator · on March 11, 2020

> Epoch doesn't carry timezone information.

Dates should not be encoded with time zone info anyway, as timezones is context-dependent. Dates should be encoded in UTC and then let clients interpret them according to their context.

foepys · on March 12, 2020

What do you do when timezones change? This happens all the time around the world.

masukomi · on March 10, 2020

epoch is INHERENTLY timestamped it's seconds since January 1, 1970 (midnight UTC/GMT). it'd be useless if it was just "you know midnight _somewhere_ on Jan 1 1970... pfft it was a long time ago. who cares"

lucasmullens · on March 10, 2020

When you go one hour back for daylight savings, you can't tell with just epoch if it's the first or second time you're at that time. With timezones you can, since it switches between PST and PDT.

kyralis · on March 10, 2020

This is fundamentally incorrect. Because UTC doesn't change, the two "identical" times are different UTC values.

foepys · on March 12, 2020

What happened when you set an alarm for 2022 in CEST? The EU will very likely get rid of DST by then.

You alarm will be an hour off when you convert to UTC beforehand.

alpaca128 · on March 10, 2020

I don't know that much about epoch, but isn't the point of it to be completely independent of stuff like timezones or daylight savings? Doesn't it track every second since 1970 no matter if meanwhile time jumped back or forward in some countries?

NewJazz · on March 11, 2020

UTC has leap seconds, but otherwise this is correct.

wahern · on March 10, 2020

1:30PST and 1:30PDT bijectively map to distinct UTC timestamps.

eyelidlessness · on March 10, 2020

1. It is lower fidelity (seconds vs milliseconds). 2. It's also not a type. How do you distinguish epoch from number?

a1369209993 · on March 11, 2020

> It's also not a type. How do you distinguish epoch from number?

I dunno, how do you distinguish "April" (the month) vs "April" (someone's first name) or 18 (kg) vs 18 ($) vs 18 (-th of the month) vs 18 (page number)?

> It is lower fidelity (seconds vs milliseconds).

This isn't a problem if you have a conforming implementation (2^52 milliseconds is over 100'000 years), but as nitrogen pointed out, you do apparently have to worry about that.

HereBeBeasties · on March 11, 2020

What the original poster means is that if I go JSON.parse(...) then how does that know to give me an object with a Date type instead of a Number? Answer: It can't.

This is a frequent gotcha with Typescript, where even if your type is declared with a Date field, Javascript won't care when it deserializes it, as that type information is all erased and not available at runtime.

(Also, Number in JS being floating point and all, it lacks the integer precision for high resolution timestamps - if you start serialising 64 bit timestamps and expect a JS-style runtime to do good things with them it doesn't end well.)

a1369209993 · on March 11, 2020

> how does that know to give me an object with a Date type instead of a Number

Ah, thanks, that makes sense; I'd forgotten that Javascript had a built-in Date type. (Strictly speaking, then, you ought to be able to write something like `{"time":new Date(...)}`, but that obviously doesn't work in practice.)

eyelidlessness · on March 11, 2020

Since Typescript is mentioned, might as well mention the awesome io-ts[1] library. It gives you both runtime validation and static type safety with very simple syntax.

[1]: https://github.com/gcanti/io-ts

eyelidlessness · on March 11, 2020

> I dunno, how do you distinguish "April" (the month) vs "April" (someone's first name) or 18 (kg) vs 18 ($) vs 18 (-th of the month) vs 18 (page number)?

Types! Extensible types. Like I said above, Transit is a good solution to this. With a fixed set of types, arbitrary data will always have ambiguity.

Minor49er · on March 10, 2020

1. You could use a float. 2. Probably in the documentation of whatever it is that's using it

ci5er · on March 10, 2020

Floats are ugly because the distance between any two successive numbers are not the same. This makes them inappropriate for discretized time and anything accounting-related (money).

nitrogen · on March 10, 2020

Epoch floats and doubles are not good for timestamps, as the further we get from 1970 the less precise they become.

Current precision with a 32-bit float (JSON/JavaScript are 64-bit usually, though in non-JS it's common to use Bigdecimal in slight non-compliance) is much worse than one second.

stickfigure · on March 10, 2020

> in the documentation of whatever it is that's using it

That's the point, timestamps should be self-describing. If "look up the structure in documentation" suffices we might as well just use protobufs.

eyelidlessness · on March 10, 2020

1. YUK 2. ISO 8601 is a better format for this.

BurningFrog · on March 10, 2020

That's a timestamp, not a date.

Maybe OP meant timestamp, but that's not what they wrote.

lokedhs · on March 11, 2020

JSON is also missing an integer type.

Read the spec. Basically all it says is that the numeric type is a float, but doesn't say anything about its precision.

I'm amazed that there haven't been any security vulnerabilities found yet that takes advantage of different number formats in different JSON implementations.

You'd think that a data serialisation format should get at least numbers right. All the ones that came before (and after) did, yet we're still stuck with JSON.

hombre_fatal · on March 11, 2020

Is that like saying it doesn't have a username type? How could you ever know this bit of data is a username if it doesn't have a username type?

lokedhs · on March 12, 2020

I think there is a rather large difference between not having a username type vs. not having an integer type.

As it turns out, the only way you can reliably store an integer in JSON is to use a string field. This is incidentally also the correct way to store a username in JSON.

nprateem · on March 10, 2020

And comments. Not being able to put comments in JSON files is beyond stupid.

jrockway · on March 10, 2020

JSON is a binary computer-to-computer interchange format. No computer-to-computer interchange format has comments.

The problem with JSON is that it's just human-readable enough that people think it's a config file format. It's not.

iainmerrick · on March 10, 2020

The problem with JSON is that it's just human-readable enough that people think it's a config file format. It's not.

That attitude is misguided. Many people do in fact use JSON as a config file format, so it is a config file format. Real world usage outweighs prescriptions about what is and isn’t correct.

Looking at the first standardized JSON spec, it’s not particularly prescriptive about usage, simply saying:

JSON is syntax of braces, brackets, colons, and commas that is useful in many contexts, profiles, and applications.

Of course, it would be a better config file format if it supported comments, but people use it even despite that. I think we should try to understand the reasons for that rather than just telling people they’re doing it wrong.

foepys · on March 10, 2020

The reason for that is that there is often a JSON parser library available for any language (especially JavaScript) but no ini or YAML config parser. So people use JSON because it's already easily accessible.

JSON is one of the worst config formats, even worse than the Apache or nginx config formats. Not being able to simply comment out a config line temporarily for testing purposes or to leave a explanatory comment for special setting is simply bad.

to11mtm · on March 10, 2020

Having worked with a bit of YAML, RAML,(which is a superset of YAML) and JSON style configs/schemas... I'd take JSON over YAML if only because it seems to be more expressive.

Still, everyone forgets HOCON [1] is a thing, which solves many of the problems expressed here about JSON -for configuration-. It's easy to clearly specify things like time, reference other parts of the config, or if you want to change just one value, you can add that to the 'end' of the HOCON file and be GTG.

[1] - https://github.com/lightbend/config/blob/master/HOCON.md

iainmerrick · on March 11, 2020

HOCON is meant to be a friendlier JSON, looks like? At first glance, it has some nice ideas but takes it way too far -- just the table of contents on that repo looks longer than a summary of JSON in its entirety.

Skimming through it, concatenation of unquoted values is where it goes off the rails for me. This is quite the gotcha: https://github.com/lightbend/config/blob/master/HOCON.md#not...

My wishlist for a friendlier JSON would be, in priority order:

- Allow trailing commas

- Comments

- Allow newlines instead of commas

- Allow unquoted dictionary keys

And I think that’s it. I’m not even sure the last one is worth the extra complexity.

styfle · on March 11, 2020

I’ll have to find something wrong with HOCON and add it to my list :)

https://twitter.com/styfle/status/1237182409239658500

vimslayer · on March 10, 2020

Though most tools that use "JSON" as the config format actually use some superset of JSON, like JSON5, and do support comments. eslintrc, babelrc, vscode config... pretty much every JS tool configuration apart from package.json.

I don't know why people would prefer that over YAML or others but at least you can actually add comments in them and the reason for their popularity doesn't seem to be baked in parser support because these tools are adding their additional "JSON" parsing anyway.

abiogenesis · on March 10, 2020

I have a hard time remembering YAML syntax (am I constructing a list or a dict now? Oh well, better look it up) and I know for a fact that it's not just me. JSON is much simpler.

hombre_fatal · on March 11, 2020

I even prefer JSON over TOML because JSON is simple. TOML has arbitrary rules like how `table = { foo = bar }` can't be multiline, and all that time you spend debugging your attempts at its nested table syntax, you could've just intuitively nested some dictionaries in JSON.

rumanator · on March 11, 2020

> That attitude is misguided. Many people do in fact use JSON as a config file format

...and those people need to face the consequences of their poor and missguided technical decisions.

In this day and age we have no excuse to repeat the "but everyone is using XML for that" mistake. Just pick the right tool for the job and stop complainig that the tool needs to change to compensate for your poor judgement.

hombre_fatal · on March 11, 2020

It's not poor and misguided. For example, VSCode has first-class support for JSONSchema giving you auto-complete and things like drop-own boxes for arbitrary project JSON config. It's better done and more mainstream than any other solution I've seen compared to people saying "well technically you could build that for <pet format>."

If you want comments, pipe it through json5 first or something. VSCode again supports comments as a courtesy. More tools that use JSON are starting to as well.

It's just not a big deal.

rumanator · on March 11, 2020

> It's not poor and misguided. For example, VSCode has first-class support for JSONSchema giving you auto-complete and things like drop-own boxes for arbitrary project JSON config.

That's just tooling compensating for the shortcomings of a format being shoehorned into a use-case that falls outside of it's scope.

It's the XML nonsense all over again.

> It's better done and more mainstream than any other solution I've seen compared to people saying "well technically you could build that for <pet format>."

That's the same short-sighted line of argument that was used to force the mistake of using XML everywhere.

There are right tools for the right job. JSON os the right tool for a lot of jobs, but config files is not one of them.

iainmerrick · on March 11, 2020

What are the alternatives?

Even though it’s not ideal, I actively prefer JSON over every other random format I’ve come across. This is speaking as somebody who mostly has to tweak existing configs, rather than extending them or writing new ones.

YAML (along with I guess TOML etc) looks nice at first glance, but it has too many weird syntax shortcuts that make it hard to figure out what’s actually going on. And googling for syntax like “[ ]” is hard!

With JSON, the data model is super simple, and for a given set of data there’s only one way to write it down. Sometimes the data model is too simple, sure. But even for fiddly cases like dates and times, there’s often an obvious solution (in this case, ISO 8601 strings).

Zarel · on March 12, 2020

JSON5 is a good alternative, mentioned a few times in this thread. JSON, plus comments, trailing commas, and object keys can be unquoted. Solves all my frustration with using JSON as a config file language.

iainmerrick · on March 12, 2020

Wow, I hadn’t heard of this before, even after reading other replies to this post! Looks pretty nice, hits just about every point on my personal wishlist. I’m going to use this.

gameswithgo · on March 10, 2020

Either it is a terrible format because it is such a massively wasteful (in terms of payload size and cpu overhead to serialize and deserialize) computer to computer interchange format, or it is terrible because it doesn't have comments.

jrockway · on March 10, 2020

Yeah, it's not a very good computer-to-computer format. No bytes type. No integer type. No Any->Any maps. Being bad at something doesn't automatically make you good at something else, and JSON is no exception. It's bad at everything except being popular.

You have to look at JSON in a historical context to understand it. It exists so that someone could get some data into their Javascript program with "eval". It was then standardized so that you didn't need a full Javascript engine to understand it, because Python and Perl and Ruby didn't have one, and it turned out that evalling random data from the Internet was a security disaster. Does that make it a good human-readable config file? Nope. Does that make it a good interface definition language? Nope.

It exists because it got popular early, and now we're stuck with it. Now it's too late to apply band-aids to make it a human-readable config language or a good computer-to-computer interchange format, because you will always have old parsers around and people will naturally want to target those. Trust me, 99% of developers will moan when you tell them that they have to use GRPC+Protos to access your API. They will be equally mad about your JSON extension that has comments and integers in it, because the dialect of Brainfuck they use for all their projects doesn't have a library that supports those extensions and their editor can't syntax-highlight it or autoformat it.

I would like to give you a solution to this problem, but there isn't one. JSON "won", but it's bad at everything except being well-understood. That seems to be all that people really care about.

mark-r · on March 10, 2020

Any format that can be edited with a text editor and retain proper formatting should be able to have embedded comments. No reason not to, and it can come in handy sometimes.

ghayes · on March 10, 2020

The argument, originally, I believe was that comments might be hacked to “extend” JSON with various additions (e.g. dates via an end of line marker). Comments were excluded to keep the format stable, which, FWIW, has worked.

ric2b · on March 11, 2020

That's a bad argument, since you can still do that with "magic" fields or metadata fields.

metaloha · on March 10, 2020

JSON isn't binary, it's text-based and human-readable by design (similar to YAML in that sense). JSONB is a binary variant, though, potentially more suitable for machine-to-machine communication and stores more efficiently in databases as a BLOB.

btilly · on March 10, 2020

No computer-to-computer interchange format has comments.

I call BS.

As a counterexample search for the word "comment" in https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html. Or are you saying that http is not a computer-to-computer interchange format?

jrockway · on March 10, 2020

You call BS because the HTTP/1.1 standard calls pieces of text that contain the name of the software at the other end of the connection "comments"? (To those that didn't read the spec, it uses the word comment to refer to the values in the User-Agent, Server, and Via headers.)

It turns out these "comments" were a disaster and user agents are moving away from even providing them. You put "firefox" in the user-agent header, and web servers would read the comment and send you "your browser doesn't work" instead of the actual document. That did not make for a very compatible web... turns out comments in machine-to-machine communication are a bad idea.

hoytech · on March 11, 2020

I think the "calling BS" objection was to your categorical statement: "No computer-to-computer interchange format has comments," and not about whether they are a good idea or not (that is a separate argument).

How about ".comment" sections in ELF binaries, do they count?

https://wiki.osdev.org/ELF

btilly · on March 10, 2020

Yes, comments tend to be turned over time into hints, directives, etc, etc, etc.

But that doesn't change the fact that people put comments into machine protocols.

For another example, lots of machine to machine protocols specify XML. Such as SOAP. They therefore all support comments.

In accord to the general trend, eventually they get abused into having a semantic meaning. For example https://support.ptc.com/help/windchill/wc111_hc/whc_en/index... shows how one system uses comments in SOAP to create documentation and a WSDL that lets interfaces to your code to be automatically generated.

And so comments eventually become executable. But that doesn't change the fact that comments can exist in machine to machine formats.

Mikhail_Edoshin · on March 12, 2020

BTW, XML actually has what can be called "executable comments": processing instructions. They are very basic, but unlike plain comments they have two parts: target name and opaque instruction for the target. If one needs to use comments for some sinister purpose, at least processing instructions may be a better fit.

yellowapple · on March 10, 2020

XML is frequently used as a computer-to-computer interchange format, and it has comments.

didibus · on March 11, 2020

Json? A binary computer-to-computer interchange format?

Do we use the same Json me and you?

> Json is an open standard file format, and data interchange format, that uses human-readable text to store and transmit data objects

- https://en.m.wikipedia.org/wiki/JSON

Json is text based, and meant for storage and transmission in a human readable format.

knodi123 · on March 10, 2020

What are the material differences between a config file format, and a machine-to-machine only format?

inetknght · on March 10, 2020

Well, for one: most config file formats specifically include a method for the user to add bytes to the data which won't be parsed or interpreted. In JSON... you might try adding a {"comment": "this is a bad way to comment"} object or key/value and hope you don't collide with an object's dedicated comment field, and also don't cause the parser to raise an error for having a field which it didn't expect.

haggy · on March 10, 2020

Config files are typically a 'load once, read many" pattern. Machine to machine communication protocols can be VERY high rate and over the network so they need to be highly efficient both in memory footprint and serialization overhead.

iainmerrick · on March 10, 2020

So... JSON isn’t great as a machine to machine protocol, then? So I guess it must be a config format after all!

korijn · on March 10, 2020

The key difference is the target audience. The format's "UX", so to speak, is optimized for the target audience (machine or human). @jrockway makes an excellent point.

mark-r · on March 10, 2020

Why must those be mutually exclusive? There's value in being understandable by both.

nitrogen · on March 10, 2020

Having comments in a "wire format" provides a means for the spec to become meaningless as random applications start using comments to convey information (or exfiltrate data) that other applications cannot parse correctly.

mark-r · on March 10, 2020

There should be a special place in hell reserved for people who use comments to convey machine readable information. The whole point of a comment is to provide something that has no semantic meaning.

robotnikman · on March 10, 2020

I would think being able to have comments is a big one, so you at lease have an idea of what changes what in a config file.

jve · on March 10, 2020

One must not forget JSON (JavaScript Object Notation) had to be Javascript compatible. It was almost. That is one of the main features. It could be eval'ed back into JS object after all.

> Although Douglas Crockford originally asserted that JSON is a strict subset of JavaScript, his specification actually allows valid JSON documents that are not valid JavaScript; JSON allows the Unicode line terminators U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR to appear unescaped in quoted strings. > JSON is a strict subset of ECMAScript as of the language's 2019 revision.

https://en.m.wikipedia.org/wiki/JSON#Data_portability_issues

ergothus · on March 10, 2020

I have hated the lack of comments so many times.

But crockfird said it was done to keep the parsing simple (and thus secure), and between seeing how XML has it, and seeing how poorly many json parsing implementations are despite this simplicity, I have begrudgingly decided to accept it. Still wish there was a way without those problems though.

hajile · on March 11, 2020

It had NOTHING to do with the simplicity of the parser. He's said in his lectures that the real reason was that he wanted to prevent someone hiding custom pre-processor data in the comments and causing incompatibility.

I think this was stupid. First, it happens anyway to some extent with people just using custom byte encodings for "strings". Second, using comments for metadata could be useful -- even if optional. For example, adding types after strings to specify things like the date format.

progval · on March 10, 2020

JSON also misses dictionaries/"objects" with non-string keys. Msgpack discourages you from using them because languages like JS don't support it, but allows it.

mr_toad · on March 10, 2020

JavaScript has Maps which allow anything as a key, but they were added after JSON was thought up, and they don’t have an literal format.

sdegutis · on March 10, 2020

I've become particularly fond of Ion[1]. The main benefit JSON has over Ion is its ubiquity, even in many standard libraries.

[1] https://amzn.github.io/ion-docs/

dathinab · on March 10, 2020

This looks quite nice. But then S-Expressions and the soon to be there templates add quite a bit of unnecessary complexity IMHO.

sdegutis · on March 12, 2020

I doubt it's any more complicated to add S-Expressions than to change list parsing to take either [] or () and set a type flag based on which one it finds. Semantically, it's not more complex either, just giving one extra piece of user-defined metadata to the actual data that's still basically a list.

mark-r · on March 10, 2020

> Also for completeness it would be better to differentiate between int and float as float is imprecise due to rounding errors.

Not true for most numbers. A 64-bit IEEE float can handle any 53-bit integer with no error at all.

zeroimpl · on March 10, 2020

It can represent less than 0.1% of all 64-bit ints. That doesn’t seem like most to me.

mark-r · on March 10, 2020

I didn't claim 64-bit ints, I claimed 53-bit ints. How often do you use ints in the range from 54 to 64 bits? I'm willing to bet it's close to never. That's what I meant by "most". You have to consider the distribution.

ben0x539 · on March 11, 2020

randomly generated IDs as 64 bit ints, pretty often. I mean, tweet IDs are up there too.

ben0x539 · on March 11, 2020

Almost all numbers aren't 53-bit integers (or smaller) :P

mark-r · on March 11, 2020

Since everybody feels like getting all pedantic on me, let me clarify what I was trying to say. Most of the integers you encounter in real life will be 53 bits or less, and can be held in an IEEE-754 64-bit float without introducing any error. It is not worth introducing a separate integer type for the exceptions, particularly when the language used as the basis of the format (Javascript) does not have an integer type.

rcfox · on March 10, 2020

JSON doesn't use floating point numbers, it uses signed decimal numbers. It's perfectly valid to store a number in JSON that isn't representable in a n-bit IEEE 754 floating point number.

rgovostes · on March 10, 2020

Why are we passing around binary data re-encoded to only use 6 bits per transmitted byte? In some cases it might be practical to shoehorn binary data where a text string has historically gone (e.g., data URLs, emails) but making a new format that can’t handle binary data without costly armoring seems like we’ve basically given up on using the right tool for the job.

AnIdiotOnTheNet · on March 10, 2020

> seems like we’ve basically given up on using the right tool for the job

That's webtech in a nutshell isn't it?

emj · on March 10, 2020

Cognitive load. A readable dataformat is easy to understand, binary is not readable. The thing is standards for formats get broken all the time, and a broken CSV files is usually easier to fix than a non-standard binary format.

londons_explore · on March 10, 2020

The same argument could have been made when the ppm image format was replaced with binary formats like JPEG.

Nobody says "what a shame I can't open that jpeg image in a text editor".

Everyone understands that an image is an image, and there are tools for dealing with images, like photoshop.

The same could be said for data interchange formats like json.

yellowapple · on March 11, 2020

> Nobody says "what a shame I can't open that jpeg image in a text editor".

Nobody except me, of course. I'd be thrilled if I could open up a JPEG in Emacs and e.g. view/edit the EXIF metadata in a text buffer (and/or pop up a graphics buffer for editing the image itself). Similarly, I'd be thrilled if Emacs was able to parse MessagePack data and let me view/edit it in a buffer. Both of these things are theoretically possible, but to my knowledge nobody's actually done them yet.

Granted, calling Emacs "just" a text editor is pretty stretchy, but still.

emj · on March 10, 2020

This was specifically about having a fileformat encoding binary data, and in that vein there is a place for PPM, it's probably better for many small bitmaps than PNG. My old PPMs from my email archive look horrible as jpegs https://xkcd.com/1683/ .

JSON vs. Bencode, JSON can get you the data good enough a lot faster, while a bencode parser is easier to write. My feeling is that most formats with schemas usually fail somewhere in the details anyways. Honestly I think I want too much from them; e.g. the YANG Data model language is a kitchen sink but lacks a lot.

bobbyi_settv · on March 11, 2020

> Why are we passing around binary data re-encoded to only use 6 bits per transmitted byte?

We're not. We're gzipping the json when we transmit it.

temporallobe · on March 11, 2020

It’s “incomplete” because JS doesn’t natively have the concept of bytes or the other things you mention. JSON is JavaScript Object Notation, so there’s no reason it would support features beyond what JavaScript itself can. Of course JSON is now commonly used by other languages and platforms, but its origins in JavaScript are the source of its limitations.

leu-mas · on March 11, 2020

Also, never understood why JSON allows for duplicate keys at the same level. Behavior is implemented differently across languages. Some throwing an error on parsing of duplicate keys while others simply overwrite the first value encountered with the second.

2J0 · on March 10, 2020

See my comment about ASN.1 which is anything but incomplete...

jnwatson · on March 10, 2020

ASN.1 doesn't have a schemaless encoding.

ci5er · on March 10, 2020

That's on the plus side of its attributes.

Schemaless has become a horror-show.

NoZZz · on March 11, 2020

ASN.1 already exists!!!

magicalhippo · on March 10, 2020

I was looking at MessagePack for communicating to and from my STM32F1-based microcontroller project from the PC controller software I'd be writing. At least the official C library was not optimized for memory usage and code size. I also considered BSON, but it also lacked suitable libraries.

So I ended up using JSON. Yes the message sizes are larger in byte size with JSON but using the jsmn[1] parser I could avoid dynamic memory usage and code size was small. The jsmn parser outputs an array of tokens that point to the buffer holding the message (ie start and end of key name etc), so overhead is quite limited.

For JSON output I modified json-maker[2]. It already allowed for static memory usage and rather small code size, but I changed it to support a write-callback so I could send output directly over the data link, so I didn't have to buffer the whole message. This is nice when sending larger arrays of data for example.

Combined it took about 10kB of program (flash) memory, of which float to string support is about 50%. Memory usage is determined by how larger incoming messages I'd need, for now 1kB is plenty.

A nice advantage of using JSON is that it's very easy to debug over UART.

Though having compact messages would be nice for wireless stuff and similar, so does anyone know of a MessagePack C/C++ library that is microcontroller friendly?

[1]: https://github.com/zserge/jsmn

[2]: https://github.com/rafagafe/json-maker

ludocode · on March 10, 2020

My MessagePack implementation is designed for embedded:

https://github.com/ludocode/mpack

It can be built to a very small code size, especially when you disable libc, allocations, etc. There are some people using it on embedded devices like Arduino. There's someone working on a port to 8-bit microcontrollers that don't have a 64-bit float, so you may want to look into that as well; see the open issue for it on the GitHub link above.

liamdiprose · on March 10, 2020

Protobufs/nanopb would be my go-to for minimal message size.

If you want small code size, CBOR seems like a good bet:

> The Concise Binary Object Representation (CBOR) is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation. [1]

This [2] C-implementation fits in under 1KiB of ARM code.

[1]: https://cbor.io/

[2]: https://github.com/cabo/cn-cbor

RantyDave · on March 10, 2020

I, too, love cbor. https://github.com/RantyDave/cppbor

jopsen · on March 10, 2020

CBOR is also used on WebAuthn, usage in a web spec means to me that someone smart considered it a sane choice -- and more importantly that the format is here to stay.

kbumsik · on March 11, 2020

It's great CBOR is accepted in wider area, but I am personally curious why WebAuthn choose CBOR instead of JSON. WebAuthn is a web browser feature, and why W3C would introduce a new data exchange format in their specs? Maybe WebAuthn needed a binary data type?

jopsen · on March 11, 2020

I'm guessing a binary format is nice when interacting with a device..

Anybody know if (and why) U2F uses CBOR?

SlowRobotAhead · on March 10, 2020

CBOR’s RFC: https://tools.ietf.org/html/rfc7049

And Amazon has picked it up as a first class citizen in some of their IoT Core features. It’s definitely here to stay.

kragen · on March 12, 2020

CBOR was originally part of the MsgPack project, by the way, before its designer forked it and renamed it after himself.

magicalhippo · on March 10, 2020

Ah yes I looked at CBOR too, but I dismissed it for reasons I can't recall right now. Will have to take another look.

SlowRobotAhead · on March 10, 2020

That’s strange because CBOR is almost literally msgpack that got an RFC and has extensions. I cant remember what MsgPack does for online streaming and indefinite lengths.

They’re extremely similar.

magicalhippo · on March 10, 2020

Looked at it again, seems memory management is a bit of an issue, it supports memory allocation callback but not just handing it a buffer to work with (though I guess allocation should be predictable).

Also I don't know how they got "code sizes appreciably under 1 KiB". On my STM32F1 release mode with -Os it adds about 12kB.

But yeah, maybe I should reevaluate CBOR.

SlowRobotAhead · on March 11, 2020

For reference, I’m using TimyCBOR because it’s include with Amazon FreeRTOS.

You’re on your own for malloc, which for me is great because FreeRTOS Heap4 management is quite good. So I malloc an object I’m decoding into and parse away.

There are two options parsing arrays and strings/bytestrings and I just chose the option where I specify the pointer to use, vs them using normal malloc then free() later.

I really like this setup. I made a deinit(bad_message) that works anywhere it failed (parse, validate, eval, etc), goes through and looks for pointers that I previously would have malloc’ed.

There is another popular library but I forget what it’s called.

kbumsik · on March 10, 2020

Yes. CBOR is designed for IoT especially in mind.

justinmk · on March 10, 2020

> MessagePack ... official C library was not optimized for memory usage and code size

But libmpack is: https://github.com/libmpack/libmpack

- libmpack serialization/deserialization API is callback-based, making it simple to serialize/deserialize directly from/to application-specific objects

- libmpack does no allocation at all, and provides some helpers to simplify dynamic allocation by the user, if required.

- C89

osteele · on March 10, 2020

ArduinoJson https://arduinojson.org supports MessagePack. I haven't looked at its static or runtime memory requirements.

TickleSteve · on March 10, 2020

Not messagepack, but if protobuf is ok, then nanopb has given me good results on uC projects.

SAI_Peregrinus · on March 10, 2020

I'll second nanopb as pretty good. Used it on an STM32F4.

SlowRobotAhead · on March 10, 2020

It’s good. But it’s a one man show and you are very much in this is how it’s done realm.

I threw away NanoPB in favor of TinyCBOR and haven’t looked back.

magicalhippo · on March 10, 2020

Ah, that looks pretty spiffy, thanks!

jrg_lcsf · on March 10, 2020

No idea if it suits your needs, but here's my pet project for microcontroller friendly communication protocols: https://github.com/jean-roland/LCSF_C_Stack

matthewaveryusa · on March 10, 2020

This library is very small and you need to implement your own I/O via well-defined functions. The parser itself does not use any library (including libc):

https://github.com/camgunz/cmp

supermatt · on March 10, 2020

I took the first example from http://www.json.org/example.html and msgpack makes it 304 bytes. a simple gzip on the json is 289 bytes. The larger examples are even more in gzips "favour"

I am not 100% sure why I would choose to use this - maybe for super tiny documents?

lalaithion · on March 10, 2020

It's also likely faster to parse; string and array sizes are statically known, and all parsing happens in order, so there's no backtracking. Want to parse a string in JSON? okay, this is a starting quote, so let's iterate down the rest of the document, looking for unescaped end quotes, and then copy that into a buffer. In msgpack, it's a single memcpy. Additionally, with number heavy messages, then it shines.

{"red": 127,"green": 28,"blue": 27,"x": -18,"y": 27,"z": 2008,"time": 1583862185,"dollars": 1298,"cents":12,"name":"tom",timeseries":[1,2,3,4,55,66,77,218,239,340,9009,80008,400004]}

gives 106 bytes with msgpack and 157 with gzip. Also, if you gzip the msgpack, for your message example, it's 248 bytes.

cptwunderlich · on March 10, 2020

And you trust a message from an unknown source? You can't simply memcpy according to some length indicator, that's just not safe. You still have to parse and validate.

kentonv · on March 10, 2020

Obviously you need to check that the length doesn't go past the end of the message, but that's a trivial O(1) check. You don't have to scan the bytes of the string first to decide if they are safe to memcpy.

laumars · on March 10, 2020

You might want to validate those byte sequences are valid character encodings.

kelnos · on March 10, 2020

You should be doing that with JSON as well, so this isn't a pro/con of either format.

laumars · on March 10, 2020

That was obviously my point.

ie just because MessagePack is a binary format it doesn't mean you can skip the same string checks that JSON requires; which means parsing MessagePack strings is unlikely to be any faster than JSON strings (contrary to the suggestions others have implied with the "just memcpy" comments). It's just with JSON that validation is done as part of the parser (remember JSON only technically supports a subset of ASCII and any extended characters or unicode is encoded via escape codes) where as with MessagePack you'd need to do that validation as an additional step.

Integers, on the hand, might differ since JSON would need additional validation (again, backed into the parser) which MessagePack would not because MessagePack encodes the integers as binary integers where as JSON encodes them as ASCII values that would need converting back to binary integers.

(hint: read the message I'm replying to).

kentonv · on March 10, 2020

Many (most?) applications do not actually care whether a byte blob of text is structurally valid UTF-8. They are either passing it around as an opaque byte blob, or already applying much stricter application-specific validation. Validating UTF-8 automatically at the serialization layer is a huge waste of cycles, especially in a big distributed system.

laumars · on March 11, 2020

On closed systems where you control both the input and output, then sure (Though I’d still recommend against that particular short cut because it’s an easy way for bugs to go undetected).

However if you’re accepting MessagePack encoded data from insecure systems (such as end users) then you absolutely should be validating your input somewhere along the pipeline and it’s usually better to do that early on.

Also it’s not generally the distributed systems you worry about when it comes to this specific degree of micro-optimisation (which is basically what this is). It’s the monolithic ones. Distributed architecture is meant to solve various problems (for example but not limited to, high availability, reduced geographical latency, single site but running on cheaper commodity hardware, etc) but often at the cost of CPU cycles. Whereas your monolithic infrastructures where you have fewer servers (such as Stack Overflows set up) would be greatly more dependant on reducing computational overhead where corners could be cut. However they’d also be significantly less likely to need networked RPCs via MessagePack anyway (simply due to the monolithic design of their architecture).

lwf · on March 10, 2020

As long as you know the length of the entire buffer, you just ensure that:

  current_addr + message_len - start_addr < buffer_len

Or am I missing something?

maxwindiff · on March 10, 2020

Invalid unicode sequences?

diabeetusman · on March 10, 2020

buffer_len could be larger than the message, copying some incorrect things into memory.

Similar to HeartBleed, where there wasn't validation on the heartbeat message, and the server would echo back buffer_len instead of just what was sent.

theamk · on March 10, 2020

I believe author intended buffer_len to be the length of incoming buffer (size of HTTP payload, number of bytes read from file, length of the database entry, etc...). So the worst that can happen is that entire input message is consumed -- like a JS payload which missed closing quote.

I can think of a very contrived situation where this can be a problem, but in most cases this will be perfectly safe.

bmn__ · on March 10, 2020

https://capnproto.org serialisation scheme skips the decoding. Does that make it not safe?

tebeka · on March 10, 2020

No serizalization is safe

- https://docs.microsoft.com/en-us/security-updates/securitybu... - https://en.wikipedia.org/wiki/Billion_laughs_attack - https://en.wikipedia.org/wiki/Zip_bomb - ...

strbean · on March 10, 2020

None of those are serialization schemes. XML can be used for serialization, but if you look at the whole ecosystem it is a Turing-complete complexity monster, so of course it isn't safe.

DougBTX · on March 10, 2020

It depends on what constraints apply to the data. Any bit pattern could be used for an int, but to guarantee a UTF-8 string it would need to be validated.

kingofpandora · on March 10, 2020

Genuine question - is it dangerous to memcpy X bytes that we know must be interpreted as, say, an integer?

bogdan · on March 10, 2020

No. Everything is 0s and 1s after all. Take for eg. a byte. It has 8 bits and by permutating all the 0s and 1s you end up with all the possible values of a signed byte; all the numbers -128 to 127. So now, if you were to copy a byte from a random memory location that byte will just contain a permutation of 0s and 1s which when interpreted as a signed int, will simply contain a number between -128 and 127.

kingofpandora · on March 10, 2020

That's what I thought ...

tangent128 · on March 10, 2020

Potentially. Most network protocols are big-endian while x86 is little-endian.

hinkley · on March 10, 2020

Keep fighting the good fight.

rumanator · on March 10, 2020

> Want to parse a string in JSON? okay, this is a starting quote, so let's iterate down the rest of the document, looking for unescaped end quotes, and then copy that into a buffer.

Yes, so what's the problem?

Meanwhile keep in mind that the JSON document was already passed in a HTTP request body, or that its trivial to put string length checks on a parser, or even nesting limit checks.

hn_throwaway_99 · on March 10, 2020

Delimited formats are, in general, slower to parse than formats where the record sizes are encoded in the message.

That said, not something I want to prematurely optimize for.

strbean · on March 10, 2020

I don't think it's that premature. Making the (big) assumption that serialization format X has good stable support in your language, and has the same expressive capabilities as JSON, you should be able to use it as a drop in replacement (as long as you control both ends of the communication...). And doesn't a huge portion of what we do around here amount to "shuffle serialized data around as efficiently as we can"?

rumanator · on March 11, 2020

> Delimited formats are, in general, slower to parse than formats where the record sizes are encoded in the message.

I don't see your point at all. The state machine required to parse a JSON string essentially has only 2 states: current token is either a string character or a string delimiter. That's it. Adding a record size increases the number of states because now you not only need to track if a token is a string character but also if the character makes sense to be there within that state.

Moreover, when you parse a JSON string, once you hit the string's end delimiter you already are able to calculate the string size. Thus, if the string is short enough to fit the buffer then not only is the lexer simpler but it also requires the same memory allocations than strig formats with record sizes. If however a string doesn't fit the buffer then we are already in the territory of a ropes data structure in both cases, thus the number of memory allocations tend to be equivalent.

dddbbb · on March 10, 2020

Why would you make a comparison by gzipping one and not the other? Running gzip on the msgpack example reduced it from 304 to 248 bytes.

dwild · on March 10, 2020

Sure, but I think what he meant is that what the website is showing isn't the real gains.

Without GZIP:

JSON 583 bytes

MessagePack 304 bytes

52 %

With GZIP:

JSON (GZIP) 289 bytes

MessagePack (GZIP) 248 bytes

85%

giantrobot · on March 10, 2020

So MessagePack lets you skip the CPU and memory usage needed to compress the JSON yet get similarly sized messages to compression JSON. The compression of the JSON Is a non-trivial extra load on a busy server.

dwild · on March 11, 2020

Most likely you'll have GZIP enabled on all your HTTP request. Your bandwidth will be a bottleneck way before GZIP CPU load become one.

I just took a quick look and they do compress their queries, using Brotli, which is even more efficient than GZIP. They are behind Cloudflare, so it's probably from them.

I just tried with https://jsonplaceholder.typicode.com/todos and I get this:

Raw:

JSON: 24,311

MessagePack: 14,704

60 %

With GZIP:

JSON: 3,965

MessagePack: 4,063

102%

With Brotli:

JSON: 3,495

MessagePack: 3,704

106%

So essentially, MessagePack compressed is WORSE than JSON, and not compressed, is at least 3 times worse than compressed.

giantrobot · on March 11, 2020

So you're saying MessagePack with no compression is way smaller than JSON. Which means with no load for compression it has a 40% bandwidth savings. With compression it's negligibly worse than compressed JSON. Seems like a back foot argument for MessagePack over JSON. That's to say nothing of the encode/decode efficiency differences.

In the case where your data can't be cached but CloudFlare proxies and compresses for you, MessagePack wins because your uncompressed connection to CF is 40% more bandwidth efficient so egress out of your app server for any traffic is reduced. In the case where CF can cache responses you get the same back end bandwidth win for a negligible (if any actual) amount of extra egress out of CF.

Use what you want but if you need a schemaless serialization format, MessagePack works well and is compact for "free".

kragen · on March 12, 2020

A different way of looking at this is that MessagePack is a worse way to compress JSON than gzip or Brotli, at least under some circumstances.

imtringued · on March 11, 2020

Or you just use JSON with gzip and never think about this problem ever again. Choosing msgpack vs JSON is a micro optimization at this point. There are significantly better alternatives if you want to save CPU cycles.

nick0garvey · on March 10, 2020

Performance. Overhead of JSON encoding followed by gzip will be slower and more CPU expensive than msgpack.

meritt · on March 10, 2020

Sure, but if performance is the actual issue, then you'd likely pick an even faster and more compact wire format (proto, avro, thrift).

I guess the ability to have a schemalass and no type checking format (just like JSON), while enjoying some performance benefits? It just feels like a weird niche to me.

ken · on March 10, 2020

Protobuf/Thrift require predefined schemas, which is a different model.

Avro is pretty similar to MsgPack in concept. It seems less popular (at least in my circles), and has fewer implementations, which may or may not matter. As for performance and efficiency, the first benchmark I found [1] shows MsgPack is more space efficient, and faster to serialize, while Avro is faster to de-serialize. The second benchmark I found [2] found exactly the opposite. So it's not clear to me that either is better, and I'm sure it depends on your data, which library you're using, and how you're using it.

Being nearly as performant as the predefined-schema serializers, while being nearly a drop-in replacement for JSON, seems like a major and valuable use case to me.

[1]: https://medium.com/@nitinpaliwal87/compression-and-serializa... [2]: https://github.com/saint1991/serialization-benchmark

meritt · on March 10, 2020

> shows MsgPack is more space efficient

While I see the same numbers as you do, there's no way a format that includes plaintext keys alongside the values is going to be smaller than a format that only includes a packed field number+type and the corresponding values. There's something fishy going on here, and I can't seem to find a link to the source behind his benchmarks?

ludocode · on March 10, 2020

I'm not sure if the benchmarks above reflect this, but MessagePack allows you to use integers (or any data type) for map keys. This makes it a lot smaller than using plaintext keys, and makes it comparable to formats like Protobuf despite the lack of formal schema. (You also lose a lot of the context-free readability of messages, so it's a tradeoff.)

mumblemumble · on March 10, 2020

I guess. . . for my part, it's always seemed to me like messagepack falls into a sort of uncanny valley of serialization formats. Schemaless is very undesirable for internal APIs, IMO. Plaintext is desirable for public-facing APIs that don't need to be particularly performant, but only if you're willing to go full HATEOAS. If you're going binary for performance or don't want to bother with HATEOAS, I'm back to preferring a published schema over something that's (semi-)self-describing plus some documentation that's invariably incomplete or out of date.

ClumsyPilot · on March 10, 2020

I have never seen any benefits of HATEOAS materialise - it is too free-form to be machine parceable, is it far more difficult to navigate than Swagger or GraphQ. Maybe I dont get something?

mumblemumble · on March 10, 2020

I don't ever machine parse it, but it can be useful for manual discovery if you get an API that isn't super well documented. I think I might generally prefer Swagger, too, but not everything is implemented in a language that has a library like that.

At the end of the day, though, I'd seriously much rather just have a halfway well commented *.proto file. Work smarter, not harder. It's just that I won't recommend that for public-facing APIs because GRPC isn't universally well supported, while JSON over HTTP is.

2J0 · on March 10, 2020

Whatever was wrong with ASN.1 for the wire? it's not a direct replacement but a sanitising schema that's truly amazing in application breadth, support, tooling, and implementation efficiency.

In force documentation:

https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-X.68...

Blog on ASN.1 schema for JSON:

https://www.obj-sys.com/blog/?p=508

tptacek · on March 10, 2020

It's incredibly annoying to marshal and unmarshal arbitrary ASN.1/BER. Variable-length odd-bit-length length fields annoying. BER is like the worst parts of every other format, collected in a single standard.

imtringued · on March 11, 2020

I don't know why but I constantly hear about things that never caught on and will never catch on in the future. Ok lets say I am hyped for ASN.1 as a schema for JSON. How do I integrate it into my bog standard Java application? Suddenly we are running into a huge problem.

>ASN.1 is a mature standard. As I already mentioned, it has been around since the 1980’s. Though stable, it is not stagnant; the most recent revision occurred in 2008.

It might be old but it is not widely used. To me it doesn't matter if something was created yesterday or 20 years ago as long as I can use it and using ASN.1 is significantly harder than necessary.

>ASN.1 has tool support. There are both commercial and open source tools that will generate code from your ASN.1 specification.

Ok, so where is it and why is it so difficult to find?

Even if we assume that we should just build it today then all the above claims suddenly become worthless. If someone builds ASN.1 tooling in 2020 then it is just as immature as e.g. GraphQL tooling that was built 2020. If there is renewed interest in ASN.1 then it might gain new features that will cause it to become less mature/stable again.

Having "mature" software is useless if it doesn't meet user demands.

maxime-esa · on March 11, 2020

There is an excellent article about ASN.1 and how it relates to JSON and other forms of encoding.

Here: https://www.thanassis.space/asn1.html

kragen · on March 12, 2020

Sadly, ASN.1 is widely used — both TLS and SNMP use it, which means that we're using ASN.1 every time we read or post on HN.

kragen · on March 12, 2020

ASN.1 is enormously more complex than JSON or even MsgPack, very bug-prone, has much worse tooling, is much less widely supported, and, surprisingly, most implementations are not even that efficient. Basically you pay a lot more in bugs and get a lot less in functionality.

I've spent a lot of hours poring over hex dumps of BER messages and CBOR messages (which are basically MsgPack) and I vastly prefer CBOR. But I prefer JSON way more.

jnwatson · on March 10, 2020

I really like ASN.1.

The big difference is that messagepack is schemaless.

dickeytk · on March 10, 2020

don't forget https://capnproto.org/

pas · on March 10, 2020

It's just a nice default to pick. (But JSON is "simple" so msgpack did not really catch on. And probably won't for the reasons others mentioned.)

choward · on March 10, 2020

> It's just a nice default to pick.

Why would it be your default?

pas · on March 12, 2020

It's really very-very fast, and no need for external schema registry. And it's really as simple as JSON. Has great language support.

https://github.com/thekvs/cpp-serializers#results

That said, currently I don't work with it in any projects. Apparently it hasn't really caught on.

ahoka · on March 10, 2020

It's my go to serialization format for redis payloads. Fast, compact, can represent a lot of types without too much magic.

jnwatson · on March 10, 2020

It is relatively efficient, schemaless, and cross-language and -platform. It supports floats and binary values as well.

Scarbutt · on March 10, 2020

Because all your existing json will just work, you don't have to change your data representation to support some other format(proto, avro, thrift).

rumanator · on March 10, 2020

> Overhead of JSON encoding followed by gzip will be slower

How much time would it take you to either

a) turn on gzip support on your HTTP server and keep using JSON

b) rip out your JSON stuff from controllers and whatnot and replace it with a custom serializer/deserializer that's far from an industry standard?

willvarfar · on March 10, 2020

A decade ago I benchmarked and used msgpack in a Python system instead of json because it was so much faster that it halved my hardware requirements. I think thrift and protobuf and stuff were around then but I don’t recall how they compared.

Edit: found my post https://stackoverflow.com/questions/9884080/fastest-packing-...

nitrogen · on March 10, 2020

I made a similar choice for a Ruby project, with msgpack over RabbitMQ.

I no longer have any connection to this repo, but the benchmarks (and a schema validator gem I wrote) are here: https://github.com/deseretbook/classy_hash/blob/master/READM...

meritt · on March 10, 2020

I fail to see the use case too. If you want a compact wire format, there are numerous options (protobufs, avro, thrift, etc) and their size is achieved by storing the data in a compact binary format and the schema separately.

With MessagePack the schema is embedded alongside the data, exactly like JSON. So it's just binary instead of text encoding, which saves some space, but as you pointed out, standard text compression algorithms are going to likely perform similar or superior.

blattimwind · on March 10, 2020

One particular advantage of msgpack is that it allows embedding binary data without binary-to-text overhead (~30 %). I don't really see msgpack as something that makes much sense to use inside a browser, though.

hinkley · on March 10, 2020

First piece of software I ever wrote was a base92 encoder. When I looked at b64 is was even worse than the 3:4 I had been told.

B64 as originally used (with an extra >2.6% for line splitting) is handily over 35%. As we use it now it’s still a bit over 33%, because of the == endings to detect truncation... 66% of the time.

pletnes · on March 10, 2020

Msgpack is a different encoding, not quite the same as compression. What happens if you do both?