Hacker Timesnew | past | comments | ask | show | jobs | submitlogin
MessagePack: like JSON, but fast and small (msgpack.org)
326 points by signa11 on March 10, 2020 | hide | past | favorite | 379 comments


> MessagePack: It's like JSON. but fast and small

and complete.

Both are minimal self describing data serialization formats, but JSON is incomplete. It misses a type to represent one of the fundamental type: byte blobs. And also parts of floats.

Which means there are a lot inconsistent ways to hack that in e.g. base64 encoded strings. But this means it's loosing partially it's property of being self-describing (same if you allow numbers as strings e.g. "-12").

Just to be clear I'm not saying JSON should have bytes directly in it. But a "native" base64 string additional to string, number, null, list and map would help. E.g. `{ "bytes": b"YWJjZGU=" }`

Wrt. floats it's missing (+/-) Infinity, NaN is more like a error variant so it's okish to be missing, but then why not have it.

Also for completeness it would be better to differentiate between int and float as float is imprecise due to rounding errors.

----

PS: I don't like `null` but having it in this kind of data serialization format is still required. But the difference between a field not being there and it being `null` can be a mess and not all tools handle it well.

PPS: Both JSON and MsgPack can be used for non self describing serialization, e.g. by serializing the a record (fixed number of fields in known order) as a list of values instead of a mapping. But both are focused on enabling self describing serialization.

EDIT: PPPS: Yes, it's a very weak form of self-describing is use here, there are systems which are much more self describing e.g. XML+XML Schema linked from the XML document but also tend to be far more complex


JSON is also missing a date type. Sure, you can use a string serialization but that's ambiguous.

It's also non-extensible so these types it's missing cannot be added without assumptions or overhead from convention.

Transit is a good solution to this.


THIS! Not having a defined Date format has caused so much confusion with consumers of our API since every json API might have a different ISO date format (or some homegrown madness).


In APIs I've integrated in recent years, I've seen:

1. ISO UTC 2. ISO local 3. Epoch 4. 2018-04-20 5. Jan 01 2016 11:58am

Probably others that I can't recall. It's a damn mess.


PHP-s strtotime() will probably handle them all, even though it will make you cringe a bit.

I never understood why simple integer unix timestamps are not more prevalent in API-s? Or some monotonic count from any reasonable epoch, depending on the context. How many APIs really have to ever return dates predating unix?


`/Date(1234567890)/`

That will probably cause a twitch to any dev that used Asp.NET up until a few years ago.


I had to deal with that two weeks ago. Cringe.


I’m twitching away...


Wow :(


To be fair, 1, 2 and 4 are all just ISO.


That's true. But 2 is substantially less useful than 1 and I encountered 4 mixed with 3.


Why wouldn't you just use epoch time?


aside from the other issues mentioned here, alternative encodings like ISO8601 and others are human-readable and human-editable


Epoch doesn't carry timezone information.


And it better not. Timezone information has nothing to do in a timestamp.

A timestamp is a point in time, whatever the time is in Paris or Tokyo. It is an abstract value and it is way better this way.

A timezone is a filter through which you display your timestamp, and it tells you what your local time was when that timestamp occurred.

So yeah, always store and exchange time information as timestamps. Timezone is extra styling information, just like CSS.


This is totally wrong, and a very dangerous way of handling date/time issues. If you set a wake-up alarm for 7am, you want that to go off at 7am in your local timezone, not 7am in whatever timezone you set the alarm in (which is what an ”absolute” time stamp would store).

There are lots of examples like this. Some time events you want to happen at absolute time points (in which case time zones are only for display purposes), but very often your events need to be ”time-zone aware”. There is no universal rule, and thinking that there is will lead to all sorts of trouble.


It's not wrong. What you're describing is not a timestamp.

A timestamp is a point in time, and is the same everywhere (excluding relativistic effects).

What you describe is definitely a valid use case (and having developed software that programs hardware that is controlled by a calendar, I have painful experience in this). However, it's not a timestamp. I'd call it time-of-day or wallclock. Some systems like SQL refers to it as simply "date" and "time". Whatever you prefer to call it, it's not a timestamp.


Could you add something saying explicitly what a timestamp _is_ under your terminology? Is it an integer offset from the UNIX era start?


It's an abstract concept. It's nothing more than a given point in time.

How you represent it doesn't change the definition. It could be stored as the number of seconds since some arbitrary point in time, such as Unix timestamp, or a Julian timestamp.

Or, it could be a given time and date combination with some fixed point of reference, such as UTC. I'm sure we all have favourite ways of representing timestamps.


Ok, one thing — perhaps not the most important — is that “timestamp” is a very unfortunate terminology for the abstract concept you describe. The word timestamp very much suggests that it’s referring to an explicit representation of some sort. “timepoint” would be much better for what you’re talking about. Not a criticism of you of course! I mostly write python at work so I rarely am directly exposed to the integer offset values, but I know many people use the word “timestamp” to refer to that integer, as opposed to a formatted string. Which also seems like unfortunate terminology, since what is a stamp if not formatted?


You are assuming that timezones don't change ever in relation to UTC. This is wrong. Timezones change all the time.

When I set an alarm for any time in CEST in 2022 and convert this to UTC before saving it, it will very likely ring at the wrong time, simply because CEST will probably not exist by then and be replaced by CET due to the EU getting rid of DST.


But an alarm is not supposed to use timestamps. An alarm is set for a certain time in a certain location. In otherwords, a wallclock time. Not a timestamp.


No, timezone information can be critical and assuming you can always strip it is just wrong.

You might not have encountered such scenarios but they are very much out there.


> No, timezone information can be critical and assuming you can always strip it is just wrong.

Expressing time based on the standard reference timezones (i.e., UTC) is not stripping the time zone away.


It is. Timezones change all the time in relation to UTC. If you set a date in the future and think converting it to UTC doesn't lose information, you will be surprised when you get bitten by it.

Example: The EU will most likely get rid of DST around 2022. Any time you set beforehand in CEST or CET will be an hour off, depending on which the EU will get rid of.aybe the EU doesn't get rid of DST and keeps it, you don't know. So unless you have a time machine, you cannot convert the time to UTC.


> A timestamp is a point in time, whatever the time is in Paris or Tokyo. It is an abstract value and it is way better this way.

> Timezone is extra styling information, just like CSS.

This is an incredibly reductive view of time data and the possible applications that use it.

Not all time data is timestamps, and UTC timestamps strip out information that is not "styling".


> Epoch doesn't carry timezone information.

Dates should not be encoded with time zone info anyway, as timezones is context-dependent. Dates should be encoded in UTC and then let clients interpret them according to their context.


What do you do when timezones change? This happens all the time around the world.


epoch is INHERENTLY timestamped it's seconds since January 1, 1970 (midnight UTC/GMT). it'd be useless if it was just "you know midnight _somewhere_ on Jan 1 1970... pfft it was a long time ago. who cares"


When you go one hour back for daylight savings, you can't tell with just epoch if it's the first or second time you're at that time. With timezones you can, since it switches between PST and PDT.


This is fundamentally incorrect. Because UTC doesn't change, the two "identical" times are different UTC values.


What happened when you set an alarm for 2022 in CEST? The EU will very likely get rid of DST by then.

You alarm will be an hour off when you convert to UTC beforehand.


I don't know that much about epoch, but isn't the point of it to be completely independent of stuff like timezones or daylight savings? Doesn't it track every second since 1970 no matter if meanwhile time jumped back or forward in some countries?


UTC has leap seconds, but otherwise this is correct.


1:30PST and 1:30PDT bijectively map to distinct UTC timestamps.


1. It is lower fidelity (seconds vs milliseconds). 2. It's also not a type. How do you distinguish epoch from number?


> It's also not a type. How do you distinguish epoch from number?

I dunno, how do you distinguish "April" (the month) vs "April" (someone's first name) or 18 (kg) vs 18 ($) vs 18 (-th of the month) vs 18 (page number)?

> It is lower fidelity (seconds vs milliseconds).

This isn't a problem if you have a conforming implementation (2^52 milliseconds is over 100'000 years), but as nitrogen pointed out, you do apparently have to worry about that.


What the original poster means is that if I go JSON.parse(...) then how does that know to give me an object with a Date type instead of a Number? Answer: It can't.

This is a frequent gotcha with Typescript, where even if your type is declared with a Date field, Javascript won't care when it deserializes it, as that type information is all erased and not available at runtime.

(Also, Number in JS being floating point and all, it lacks the integer precision for high resolution timestamps - if you start serialising 64 bit timestamps and expect a JS-style runtime to do good things with them it doesn't end well.)


> how does that know to give me an object with a Date type instead of a Number

Ah, thanks, that makes sense; I'd forgotten that Javascript had a built-in Date type. (Strictly speaking, then, you ought to be able to write something like `{"time":new Date(...)}`, but that obviously doesn't work in practice.)


Since Typescript is mentioned, might as well mention the awesome io-ts[1] library. It gives you both runtime validation and static type safety with very simple syntax.

[1]: https://github.com/gcanti/io-ts


> I dunno, how do you distinguish "April" (the month) vs "April" (someone's first name) or 18 (kg) vs 18 ($) vs 18 (-th of the month) vs 18 (page number)?

Types! Extensible types. Like I said above, Transit is a good solution to this. With a fixed set of types, arbitrary data will always have ambiguity.


1. You could use a float. 2. Probably in the documentation of whatever it is that's using it


Floats are ugly because the distance between any two successive numbers are not the same. This makes them inappropriate for discretized time and anything accounting-related (money).


Epoch floats and doubles are not good for timestamps, as the further we get from 1970 the less precise they become.

Current precision with a 32-bit float (JSON/JavaScript are 64-bit usually, though in non-JS it's common to use Bigdecimal in slight non-compliance) is much worse than one second.


> in the documentation of whatever it is that's using it

That's the point, timestamps should be self-describing. If "look up the structure in documentation" suffices we might as well just use protobufs.


1. YUK 2. ISO 8601 is a better format for this.


That's a timestamp, not a date.

Maybe OP meant timestamp, but that's not what they wrote.


JSON is also missing an integer type.

Read the spec. Basically all it says is that the numeric type is a float, but doesn't say anything about its precision.

I'm amazed that there haven't been any security vulnerabilities found yet that takes advantage of different number formats in different JSON implementations.

You'd think that a data serialisation format should get at least numbers right. All the ones that came before (and after) did, yet we're still stuck with JSON.


Is that like saying it doesn't have a username type? How could you ever know this bit of data is a username if it doesn't have a username type?


I think there is a rather large difference between not having a username type vs. not having an integer type.

As it turns out, the only way you can reliably store an integer in JSON is to use a string field. This is incidentally also the correct way to store a username in JSON.


And comments. Not being able to put comments in JSON files is beyond stupid.


JSON is a binary computer-to-computer interchange format. No computer-to-computer interchange format has comments.

The problem with JSON is that it's just human-readable enough that people think it's a config file format. It's not.


The problem with JSON is that it's just human-readable enough that people think it's a config file format. It's not.

That attitude is misguided. Many people do in fact use JSON as a config file format, so it is a config file format. Real world usage outweighs prescriptions about what is and isn’t correct.

Looking at the first standardized JSON spec, it’s not particularly prescriptive about usage, simply saying:

JSON is syntax of braces, brackets, colons, and commas that is useful in many contexts, profiles, and applications.

Of course, it would be a better config file format if it supported comments, but people use it even despite that. I think we should try to understand the reasons for that rather than just telling people they’re doing it wrong.


The reason for that is that there is often a JSON parser library available for any language (especially JavaScript) but no ini or YAML config parser. So people use JSON because it's already easily accessible.

JSON is one of the worst config formats, even worse than the Apache or nginx config formats. Not being able to simply comment out a config line temporarily for testing purposes or to leave a explanatory comment for special setting is simply bad.


Having worked with a bit of YAML, RAML,(which is a superset of YAML) and JSON style configs/schemas... I'd take JSON over YAML if only because it seems to be more expressive.

Still, everyone forgets HOCON [1] is a thing, which solves many of the problems expressed here about JSON -for configuration-. It's easy to clearly specify things like time, reference other parts of the config, or if you want to change just one value, you can add that to the 'end' of the HOCON file and be GTG.

[1] - https://github.com/lightbend/config/blob/master/HOCON.md


HOCON is meant to be a friendlier JSON, looks like? At first glance, it has some nice ideas but takes it way too far -- just the table of contents on that repo looks longer than a summary of JSON in its entirety.

Skimming through it, concatenation of unquoted values is where it goes off the rails for me. This is quite the gotcha: https://github.com/lightbend/config/blob/master/HOCON.md#not...

My wishlist for a friendlier JSON would be, in priority order:

- Allow trailing commas

- Comments

- Allow newlines instead of commas

- Allow unquoted dictionary keys

And I think that’s it. I’m not even sure the last one is worth the extra complexity.


I’ll have to find something wrong with HOCON and add it to my list :)

https://twitter.com/styfle/status/1237182409239658500


Though most tools that use "JSON" as the config format actually use some superset of JSON, like JSON5, and do support comments. eslintrc, babelrc, vscode config... pretty much every JS tool configuration apart from package.json.

I don't know why people would prefer that over YAML or others but at least you can actually add comments in them and the reason for their popularity doesn't seem to be baked in parser support because these tools are adding their additional "JSON" parsing anyway.


I have a hard time remembering YAML syntax (am I constructing a list or a dict now? Oh well, better look it up) and I know for a fact that it's not just me. JSON is much simpler.


I even prefer JSON over TOML because JSON is simple. TOML has arbitrary rules like how `table = { foo = bar }` can't be multiline, and all that time you spend debugging your attempts at its nested table syntax, you could've just intuitively nested some dictionaries in JSON.


> That attitude is misguided. Many people do in fact use JSON as a config file format

...and those people need to face the consequences of their poor and missguided technical decisions.

In this day and age we have no excuse to repeat the "but everyone is using XML for that" mistake. Just pick the right tool for the job and stop complainig that the tool needs to change to compensate for your poor judgement.


It's not poor and misguided. For example, VSCode has first-class support for JSONSchema giving you auto-complete and things like drop-own boxes for arbitrary project JSON config. It's better done and more mainstream than any other solution I've seen compared to people saying "well technically you could build that for <pet format>."

If you want comments, pipe it through json5 first or something. VSCode again supports comments as a courtesy. More tools that use JSON are starting to as well.

It's just not a big deal.


> It's not poor and misguided. For example, VSCode has first-class support for JSONSchema giving you auto-complete and things like drop-own boxes for arbitrary project JSON config.

That's just tooling compensating for the shortcomings of a format being shoehorned into a use-case that falls outside of it's scope.

It's the XML nonsense all over again.

> It's better done and more mainstream than any other solution I've seen compared to people saying "well technically you could build that for <pet format>."

That's the same short-sighted line of argument that was used to force the mistake of using XML everywhere.

There are right tools for the right job. JSON os the right tool for a lot of jobs, but config files is not one of them.


What are the alternatives?

Even though it’s not ideal, I actively prefer JSON over every other random format I’ve come across. This is speaking as somebody who mostly has to tweak existing configs, rather than extending them or writing new ones.

YAML (along with I guess TOML etc) looks nice at first glance, but it has too many weird syntax shortcuts that make it hard to figure out what’s actually going on. And googling for syntax like “[ ]” is hard!

With JSON, the data model is super simple, and for a given set of data there’s only one way to write it down. Sometimes the data model is too simple, sure. But even for fiddly cases like dates and times, there’s often an obvious solution (in this case, ISO 8601 strings).


JSON5 is a good alternative, mentioned a few times in this thread. JSON, plus comments, trailing commas, and object keys can be unquoted. Solves all my frustration with using JSON as a config file language.


Wow, I hadn’t heard of this before, even after reading other replies to this post! Looks pretty nice, hits just about every point on my personal wishlist. I’m going to use this.


Either it is a terrible format because it is such a massively wasteful (in terms of payload size and cpu overhead to serialize and deserialize) computer to computer interchange format, or it is terrible because it doesn't have comments.


Yeah, it's not a very good computer-to-computer format. No bytes type. No integer type. No Any->Any maps. Being bad at something doesn't automatically make you good at something else, and JSON is no exception. It's bad at everything except being popular.

You have to look at JSON in a historical context to understand it. It exists so that someone could get some data into their Javascript program with "eval". It was then standardized so that you didn't need a full Javascript engine to understand it, because Python and Perl and Ruby didn't have one, and it turned out that evalling random data from the Internet was a security disaster. Does that make it a good human-readable config file? Nope. Does that make it a good interface definition language? Nope.

It exists because it got popular early, and now we're stuck with it. Now it's too late to apply band-aids to make it a human-readable config language or a good computer-to-computer interchange format, because you will always have old parsers around and people will naturally want to target those. Trust me, 99% of developers will moan when you tell them that they have to use GRPC+Protos to access your API. They will be equally mad about your JSON extension that has comments and integers in it, because the dialect of Brainfuck they use for all their projects doesn't have a library that supports those extensions and their editor can't syntax-highlight it or autoformat it.

I would like to give you a solution to this problem, but there isn't one. JSON "won", but it's bad at everything except being well-understood. That seems to be all that people really care about.


Any format that can be edited with a text editor and retain proper formatting should be able to have embedded comments. No reason not to, and it can come in handy sometimes.


The argument, originally, I believe was that comments might be hacked to “extend” JSON with various additions (e.g. dates via an end of line marker). Comments were excluded to keep the format stable, which, FWIW, has worked.


That's a bad argument, since you can still do that with "magic" fields or metadata fields.


JSON isn't binary, it's text-based and human-readable by design (similar to YAML in that sense). JSONB is a binary variant, though, potentially more suitable for machine-to-machine communication and stores more efficiently in databases as a BLOB.


No computer-to-computer interchange format has comments.

I call BS.

As a counterexample search for the word "comment" in https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html. Or are you saying that http is not a computer-to-computer interchange format?


You call BS because the HTTP/1.1 standard calls pieces of text that contain the name of the software at the other end of the connection "comments"? (To those that didn't read the spec, it uses the word comment to refer to the values in the User-Agent, Server, and Via headers.)

It turns out these "comments" were a disaster and user agents are moving away from even providing them. You put "firefox" in the user-agent header, and web servers would read the comment and send you "your browser doesn't work" instead of the actual document. That did not make for a very compatible web... turns out comments in machine-to-machine communication are a bad idea.


I think the "calling BS" objection was to your categorical statement: "No computer-to-computer interchange format has comments," and not about whether they are a good idea or not (that is a separate argument).

How about ".comment" sections in ELF binaries, do they count?

https://wiki.osdev.org/ELF


Yes, comments tend to be turned over time into hints, directives, etc, etc, etc.

But that doesn't change the fact that people put comments into machine protocols.

For another example, lots of machine to machine protocols specify XML. Such as SOAP. They therefore all support comments.

In accord to the general trend, eventually they get abused into having a semantic meaning. For example https://support.ptc.com/help/windchill/wc111_hc/whc_en/index... shows how one system uses comments in SOAP to create documentation and a WSDL that lets interfaces to your code to be automatically generated.

And so comments eventually become executable. But that doesn't change the fact that comments can exist in machine to machine formats.


BTW, XML actually has what can be called "executable comments": processing instructions. They are very basic, but unlike plain comments they have two parts: target name and opaque instruction for the target. If one needs to use comments for some sinister purpose, at least processing instructions may be a better fit.


XML is frequently used as a computer-to-computer interchange format, and it has comments.


Json? A binary computer-to-computer interchange format?

Do we use the same Json me and you?

> Json is an open standard file format, and data interchange format, that uses human-readable text to store and transmit data objects

- https://en.m.wikipedia.org/wiki/JSON

Json is text based, and meant for storage and transmission in a human readable format.


What are the material differences between a config file format, and a machine-to-machine only format?


Well, for one: most config file formats specifically include a method for the user to add bytes to the data which won't be parsed or interpreted. In JSON... you might try adding a {"comment": "this is a bad way to comment"} object or key/value and hope you don't collide with an object's dedicated comment field, and also don't cause the parser to raise an error for having a field which it didn't expect.


Config files are typically a 'load once, read many" pattern. Machine to machine communication protocols can be VERY high rate and over the network so they need to be highly efficient both in memory footprint and serialization overhead.


So... JSON isn’t great as a machine to machine protocol, then? So I guess it must be a config format after all!


The key difference is the target audience. The format's "UX", so to speak, is optimized for the target audience (machine or human). @jrockway makes an excellent point.


Why must those be mutually exclusive? There's value in being understandable by both.


Having comments in a "wire format" provides a means for the spec to become meaningless as random applications start using comments to convey information (or exfiltrate data) that other applications cannot parse correctly.


There should be a special place in hell reserved for people who use comments to convey machine readable information. The whole point of a comment is to provide something that has no semantic meaning.


I would think being able to have comments is a big one, so you at lease have an idea of what changes what in a config file.


One must not forget JSON (JavaScript Object Notation) had to be Javascript compatible. It was almost. That is one of the main features. It could be eval'ed back into JS object after all.

> Although Douglas Crockford originally asserted that JSON is a strict subset of JavaScript, his specification actually allows valid JSON documents that are not valid JavaScript; JSON allows the Unicode line terminators U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR to appear unescaped in quoted strings. > JSON is a strict subset of ECMAScript as of the language's 2019 revision.

https://en.m.wikipedia.org/wiki/JSON#Data_portability_issues


I have hated the lack of comments so many times.

But crockfird said it was done to keep the parsing simple (and thus secure), and between seeing how XML has it, and seeing how poorly many json parsing implementations are despite this simplicity, I have begrudgingly decided to accept it. Still wish there was a way without those problems though.


It had NOTHING to do with the simplicity of the parser. He's said in his lectures that the real reason was that he wanted to prevent someone hiding custom pre-processor data in the comments and causing incompatibility.

I think this was stupid. First, it happens anyway to some extent with people just using custom byte encodings for "strings". Second, using comments for metadata could be useful -- even if optional. For example, adding types after strings to specify things like the date format.


JSON also misses dictionaries/"objects" with non-string keys. Msgpack discourages you from using them because languages like JS don't support it, but allows it.


JavaScript has Maps which allow anything as a key, but they were added after JSON was thought up, and they don’t have an literal format.


I've become particularly fond of Ion[1]. The main benefit JSON has over Ion is its ubiquity, even in many standard libraries.

[1] https://amzn.github.io/ion-docs/


This looks quite nice. But then S-Expressions and the soon to be there templates add quite a bit of unnecessary complexity IMHO.


I doubt it's any more complicated to add S-Expressions than to change list parsing to take either [] or () and set a type flag based on which one it finds. Semantically, it's not more complex either, just giving one extra piece of user-defined metadata to the actual data that's still basically a list.


> Also for completeness it would be better to differentiate between int and float as float is imprecise due to rounding errors.

Not true for most numbers. A 64-bit IEEE float can handle any 53-bit integer with no error at all.


It can represent less than 0.1% of all 64-bit ints. That doesn’t seem like most to me.


I didn't claim 64-bit ints, I claimed 53-bit ints. How often do you use ints in the range from 54 to 64 bits? I'm willing to bet it's close to never. That's what I meant by "most". You have to consider the distribution.


randomly generated IDs as 64 bit ints, pretty often. I mean, tweet IDs are up there too.


Almost all numbers aren't 53-bit integers (or smaller) :P


Since everybody feels like getting all pedantic on me, let me clarify what I was trying to say. Most of the integers you encounter in real life will be 53 bits or less, and can be held in an IEEE-754 64-bit float without introducing any error. It is not worth introducing a separate integer type for the exceptions, particularly when the language used as the basis of the format (Javascript) does not have an integer type.


JSON doesn't use floating point numbers, it uses signed decimal numbers. It's perfectly valid to store a number in JSON that isn't representable in a n-bit IEEE 754 floating point number.


Why are we passing around binary data re-encoded to only use 6 bits per transmitted byte? In some cases it might be practical to shoehorn binary data where a text string has historically gone (e.g., data URLs, emails) but making a new format that can’t handle binary data without costly armoring seems like we’ve basically given up on using the right tool for the job.


> seems like we’ve basically given up on using the right tool for the job

That's webtech in a nutshell isn't it?


Cognitive load. A readable dataformat is easy to understand, binary is not readable. The thing is standards for formats get broken all the time, and a broken CSV files is usually easier to fix than a non-standard binary format.


The same argument could have been made when the ppm image format was replaced with binary formats like JPEG.

Nobody says "what a shame I can't open that jpeg image in a text editor".

Everyone understands that an image is an image, and there are tools for dealing with images, like photoshop.

The same could be said for data interchange formats like json.


> Nobody says "what a shame I can't open that jpeg image in a text editor".

Nobody except me, of course. I'd be thrilled if I could open up a JPEG in Emacs and e.g. view/edit the EXIF metadata in a text buffer (and/or pop up a graphics buffer for editing the image itself). Similarly, I'd be thrilled if Emacs was able to parse MessagePack data and let me view/edit it in a buffer. Both of these things are theoretically possible, but to my knowledge nobody's actually done them yet.

Granted, calling Emacs "just" a text editor is pretty stretchy, but still.


This was specifically about having a fileformat encoding binary data, and in that vein there is a place for PPM, it's probably better for many small bitmaps than PNG. My old PPMs from my email archive look horrible as jpegs https://xkcd.com/1683/ .

JSON vs. Bencode, JSON can get you the data good enough a lot faster, while a bencode parser is easier to write. My feeling is that most formats with schemas usually fail somewhere in the details anyways. Honestly I think I want too much from them; e.g. the YANG Data model language is a kitchen sink but lacks a lot.


> Why are we passing around binary data re-encoded to only use 6 bits per transmitted byte?

We're not. We're gzipping the json when we transmit it.


It’s “incomplete” because JS doesn’t natively have the concept of bytes or the other things you mention. JSON is JavaScript Object Notation, so there’s no reason it would support features beyond what JavaScript itself can. Of course JSON is now commonly used by other languages and platforms, but its origins in JavaScript are the source of its limitations.


Also, never understood why JSON allows for duplicate keys at the same level. Behavior is implemented differently across languages. Some throwing an error on parsing of duplicate keys while others simply overwrite the first value encountered with the second.


See my comment about ASN.1 which is anything but incomplete...


ASN.1 doesn't have a schemaless encoding.


That's on the plus side of its attributes.

Schemaless has become a horror-show.


ASN.1 already exists!!!


I was looking at MessagePack for communicating to and from my STM32F1-based microcontroller project from the PC controller software I'd be writing. At least the official C library was not optimized for memory usage and code size. I also considered BSON, but it also lacked suitable libraries.

So I ended up using JSON. Yes the message sizes are larger in byte size with JSON but using the jsmn[1] parser I could avoid dynamic memory usage and code size was small. The jsmn parser outputs an array of tokens that point to the buffer holding the message (ie start and end of key name etc), so overhead is quite limited.

For JSON output I modified json-maker[2]. It already allowed for static memory usage and rather small code size, but I changed it to support a write-callback so I could send output directly over the data link, so I didn't have to buffer the whole message. This is nice when sending larger arrays of data for example.

Combined it took about 10kB of program (flash) memory, of which float to string support is about 50%. Memory usage is determined by how larger incoming messages I'd need, for now 1kB is plenty.

A nice advantage of using JSON is that it's very easy to debug over UART.

Though having compact messages would be nice for wireless stuff and similar, so does anyone know of a MessagePack C/C++ library that is microcontroller friendly?

[1]: https://github.com/zserge/jsmn

[2]: https://github.com/rafagafe/json-maker


My MessagePack implementation is designed for embedded:

https://github.com/ludocode/mpack

It can be built to a very small code size, especially when you disable libc, allocations, etc. There are some people using it on embedded devices like Arduino. There's someone working on a port to 8-bit microcontrollers that don't have a 64-bit float, so you may want to look into that as well; see the open issue for it on the GitHub link above.


Protobufs/nanopb would be my go-to for minimal message size.

If you want small code size, CBOR seems like a good bet:

> The Concise Binary Object Representation (CBOR) is a data format whose design goals include the possibility of extremely small code size, fairly small message size, and extensibility without the need for version negotiation. [1]

This [2] C-implementation fits in under 1KiB of ARM code.

[1]: https://cbor.io/

[2]: https://github.com/cabo/cn-cbor



CBOR is also used on WebAuthn, usage in a web spec means to me that someone smart considered it a sane choice -- and more importantly that the format is here to stay.


It's great CBOR is accepted in wider area, but I am personally curious why WebAuthn choose CBOR instead of JSON. WebAuthn is a web browser feature, and why W3C would introduce a new data exchange format in their specs? Maybe WebAuthn needed a binary data type?


I'm guessing a binary format is nice when interacting with a device..

Anybody know if (and why) U2F uses CBOR?


CBOR’s RFC: https://tools.ietf.org/html/rfc7049

And Amazon has picked it up as a first class citizen in some of their IoT Core features. It’s definitely here to stay.


CBOR was originally part of the MsgPack project, by the way, before its designer forked it and renamed it after himself.


Ah yes I looked at CBOR too, but I dismissed it for reasons I can't recall right now. Will have to take another look.


That’s strange because CBOR is almost literally msgpack that got an RFC and has extensions. I cant remember what MsgPack does for online streaming and indefinite lengths.

They’re extremely similar.


Looked at it again, seems memory management is a bit of an issue, it supports memory allocation callback but not just handing it a buffer to work with (though I guess allocation should be predictable).

Also I don't know how they got "code sizes appreciably under 1 KiB". On my STM32F1 release mode with -Os it adds about 12kB.

But yeah, maybe I should reevaluate CBOR.


For reference, I’m using TimyCBOR because it’s include with Amazon FreeRTOS.

You’re on your own for malloc, which for me is great because FreeRTOS Heap4 management is quite good. So I malloc an object I’m decoding into and parse away.

There are two options parsing arrays and strings/bytestrings and I just chose the option where I specify the pointer to use, vs them using normal malloc then free() later.

I really like this setup. I made a deinit(bad_message) that works anywhere it failed (parse, validate, eval, etc), goes through and looks for pointers that I previously would have malloc’ed.

There is another popular library but I forget what it’s called.


Yes. CBOR is designed for IoT especially in mind.


> MessagePack ... official C library was not optimized for memory usage and code size

But libmpack is: https://github.com/libmpack/libmpack

- libmpack serialization/deserialization API is callback-based, making it simple to serialize/deserialize directly from/to application-specific objects

- libmpack does no allocation at all, and provides some helpers to simplify dynamic allocation by the user, if required.

- C89


ArduinoJson https://arduinojson.org supports MessagePack. I haven't looked at its static or runtime memory requirements.


Not messagepack, but if protobuf is ok, then nanopb has given me good results on uC projects.


I'll second nanopb as pretty good. Used it on an STM32F4.


It’s good. But it’s a one man show and you are very much in this is how it’s done realm.

I threw away NanoPB in favor of TinyCBOR and haven’t looked back.


Ah, that looks pretty spiffy, thanks!


No idea if it suits your needs, but here's my pet project for microcontroller friendly communication protocols: https://github.com/jean-roland/LCSF_C_Stack


This library is very small and you need to implement your own I/O via well-defined functions. The parser itself does not use any library (including libc):

https://github.com/camgunz/cmp


I took the first example from http://www.json.org/example.html and msgpack makes it 304 bytes. a simple gzip on the json is 289 bytes. The larger examples are even more in gzips "favour"

I am not 100% sure why I would choose to use this - maybe for super tiny documents?


It's also likely faster to parse; string and array sizes are statically known, and all parsing happens in order, so there's no backtracking. Want to parse a string in JSON? okay, this is a starting quote, so let's iterate down the rest of the document, looking for unescaped end quotes, and then copy that into a buffer. In msgpack, it's a single memcpy. Additionally, with number heavy messages, then it shines.

{"red": 127,"green": 28,"blue": 27,"x": -18,"y": 27,"z": 2008,"time": 1583862185,"dollars": 1298,"cents":12,"name":"tom",timeseries":[1,2,3,4,55,66,77,218,239,340,9009,80008,400004]}

gives 106 bytes with msgpack and 157 with gzip. Also, if you gzip the msgpack, for your message example, it's 248 bytes.


And you trust a message from an unknown source? You can't simply memcpy according to some length indicator, that's just not safe. You still have to parse and validate.


Obviously you need to check that the length doesn't go past the end of the message, but that's a trivial O(1) check. You don't have to scan the bytes of the string first to decide if they are safe to memcpy.


You might want to validate those byte sequences are valid character encodings.


You should be doing that with JSON as well, so this isn't a pro/con of either format.


That was obviously my point.

ie just because MessagePack is a binary format it doesn't mean you can skip the same string checks that JSON requires; which means parsing MessagePack strings is unlikely to be any faster than JSON strings (contrary to the suggestions others have implied with the "just memcpy" comments). It's just with JSON that validation is done as part of the parser (remember JSON only technically supports a subset of ASCII and any extended characters or unicode is encoded via escape codes) where as with MessagePack you'd need to do that validation as an additional step.

Integers, on the hand, might differ since JSON would need additional validation (again, backed into the parser) which MessagePack would not because MessagePack encodes the integers as binary integers where as JSON encodes them as ASCII values that would need converting back to binary integers.

(hint: read the message I'm replying to).


Many (most?) applications do not actually care whether a byte blob of text is structurally valid UTF-8. They are either passing it around as an opaque byte blob, or already applying much stricter application-specific validation. Validating UTF-8 automatically at the serialization layer is a huge waste of cycles, especially in a big distributed system.


On closed systems where you control both the input and output, then sure (Though I’d still recommend against that particular short cut because it’s an easy way for bugs to go undetected).

However if you’re accepting MessagePack encoded data from insecure systems (such as end users) then you absolutely should be validating your input somewhere along the pipeline and it’s usually better to do that early on.

Also it’s not generally the distributed systems you worry about when it comes to this specific degree of micro-optimisation (which is basically what this is). It’s the monolithic ones. Distributed architecture is meant to solve various problems (for example but not limited to, high availability, reduced geographical latency, single site but running on cheaper commodity hardware, etc) but often at the cost of CPU cycles. Whereas your monolithic infrastructures where you have fewer servers (such as Stack Overflows set up) would be greatly more dependant on reducing computational overhead where corners could be cut. However they’d also be significantly less likely to need networked RPCs via MessagePack anyway (simply due to the monolithic design of their architecture).


As long as you know the length of the entire buffer, you just ensure that:

  current_addr + message_len - start_addr < buffer_len
Or am I missing something?


Invalid unicode sequences?


buffer_len could be larger than the message, copying some incorrect things into memory.

Similar to HeartBleed, where there wasn't validation on the heartbeat message, and the server would echo back buffer_len instead of just what was sent.


I believe author intended buffer_len to be the length of incoming buffer (size of HTTP payload, number of bytes read from file, length of the database entry, etc...). So the worst that can happen is that entire input message is consumed -- like a JS payload which missed closing quote.

I can think of a very contrived situation where this can be a problem, but in most cases this will be perfectly safe.


https://capnproto.org serialisation scheme skips the decoding. Does that make it not safe?



None of those are serialization schemes. XML can be used for serialization, but if you look at the whole ecosystem it is a Turing-complete complexity monster, so of course it isn't safe.


It depends on what constraints apply to the data. Any bit pattern could be used for an int, but to guarantee a UTF-8 string it would need to be validated.


Genuine question - is it dangerous to memcpy X bytes that we know must be interpreted as, say, an integer?


No. Everything is 0s and 1s after all. Take for eg. a byte. It has 8 bits and by permutating all the 0s and 1s you end up with all the possible values of a signed byte; all the numbers -128 to 127. So now, if you were to copy a byte from a random memory location that byte will just contain a permutation of 0s and 1s which when interpreted as a signed int, will simply contain a number between -128 and 127.


That's what I thought ...


Potentially. Most network protocols are big-endian while x86 is little-endian.


Keep fighting the good fight.


> Want to parse a string in JSON? okay, this is a starting quote, so let's iterate down the rest of the document, looking for unescaped end quotes, and then copy that into a buffer.

Yes, so what's the problem?

Meanwhile keep in mind that the JSON document was already passed in a HTTP request body, or that its trivial to put string length checks on a parser, or even nesting limit checks.


Delimited formats are, in general, slower to parse than formats where the record sizes are encoded in the message.

That said, not something I want to prematurely optimize for.


I don't think it's that premature. Making the (big) assumption that serialization format X has good stable support in your language, and has the same expressive capabilities as JSON, you should be able to use it as a drop in replacement (as long as you control both ends of the communication...). And doesn't a huge portion of what we do around here amount to "shuffle serialized data around as efficiently as we can"?


> Delimited formats are, in general, slower to parse than formats where the record sizes are encoded in the message.

I don't see your point at all. The state machine required to parse a JSON string essentially has only 2 states: current token is either a string character or a string delimiter. That's it. Adding a record size increases the number of states because now you not only need to track if a token is a string character but also if the character makes sense to be there within that state.

Moreover, when you parse a JSON string, once you hit the string's end delimiter you already are able to calculate the string size. Thus, if the string is short enough to fit the buffer then not only is the lexer simpler but it also requires the same memory allocations than strig formats with record sizes. If however a string doesn't fit the buffer then we are already in the territory of a ropes data structure in both cases, thus the number of memory allocations tend to be equivalent.


Why would you make a comparison by gzipping one and not the other? Running gzip on the msgpack example reduced it from 304 to 248 bytes.


Sure, but I think what he meant is that what the website is showing isn't the real gains.

Without GZIP:

JSON 583 bytes

MessagePack 304 bytes

52 %

With GZIP:

JSON (GZIP) 289 bytes

MessagePack (GZIP) 248 bytes

85%


So MessagePack lets you skip the CPU and memory usage needed to compress the JSON yet get similarly sized messages to compression JSON. The compression of the JSON Is a non-trivial extra load on a busy server.


Most likely you'll have GZIP enabled on all your HTTP request. Your bandwidth will be a bottleneck way before GZIP CPU load become one.

I just took a quick look and they do compress their queries, using Brotli, which is even more efficient than GZIP. They are behind Cloudflare, so it's probably from them.

I just tried with https://jsonplaceholder.typicode.com/todos and I get this:

Raw:

JSON: 24,311

MessagePack: 14,704

60 %

With GZIP:

JSON: 3,965

MessagePack: 4,063

102%

With Brotli:

JSON: 3,495

MessagePack: 3,704

106%

So essentially, MessagePack compressed is WORSE than JSON, and not compressed, is at least 3 times worse than compressed.


So you're saying MessagePack with no compression is way smaller than JSON. Which means with no load for compression it has a 40% bandwidth savings. With compression it's negligibly worse than compressed JSON. Seems like a back foot argument for MessagePack over JSON. That's to say nothing of the encode/decode efficiency differences.

In the case where your data can't be cached but CloudFlare proxies and compresses for you, MessagePack wins because your uncompressed connection to CF is 40% more bandwidth efficient so egress out of your app server for any traffic is reduced. In the case where CF can cache responses you get the same back end bandwidth win for a negligible (if any actual) amount of extra egress out of CF.

Use what you want but if you need a schemaless serialization format, MessagePack works well and is compact for "free".


A different way of looking at this is that MessagePack is a worse way to compress JSON than gzip or Brotli, at least under some circumstances.


Or you just use JSON with gzip and never think about this problem ever again. Choosing msgpack vs JSON is a micro optimization at this point. There are significantly better alternatives if you want to save CPU cycles.


Performance. Overhead of JSON encoding followed by gzip will be slower and more CPU expensive than msgpack.


Sure, but if performance is the actual issue, then you'd likely pick an even faster and more compact wire format (proto, avro, thrift).

I guess the ability to have a schemalass and no type checking format (just like JSON), while enjoying some performance benefits? It just feels like a weird niche to me.


Protobuf/Thrift require predefined schemas, which is a different model.

Avro is pretty similar to MsgPack in concept. It seems less popular (at least in my circles), and has fewer implementations, which may or may not matter. As for performance and efficiency, the first benchmark I found [1] shows MsgPack is more space efficient, and faster to serialize, while Avro is faster to de-serialize. The second benchmark I found [2] found exactly the opposite. So it's not clear to me that either is better, and I'm sure it depends on your data, which library you're using, and how you're using it.

Being nearly as performant as the predefined-schema serializers, while being nearly a drop-in replacement for JSON, seems like a major and valuable use case to me.

[1]: https://medium.com/@nitinpaliwal87/compression-and-serializa... [2]: https://github.com/saint1991/serialization-benchmark


> shows MsgPack is more space efficient

While I see the same numbers as you do, there's no way a format that includes plaintext keys alongside the values is going to be smaller than a format that only includes a packed field number+type and the corresponding values. There's something fishy going on here, and I can't seem to find a link to the source behind his benchmarks?


I'm not sure if the benchmarks above reflect this, but MessagePack allows you to use integers (or any data type) for map keys. This makes it a lot smaller than using plaintext keys, and makes it comparable to formats like Protobuf despite the lack of formal schema. (You also lose a lot of the context-free readability of messages, so it's a tradeoff.)


I guess. . . for my part, it's always seemed to me like messagepack falls into a sort of uncanny valley of serialization formats. Schemaless is very undesirable for internal APIs, IMO. Plaintext is desirable for public-facing APIs that don't need to be particularly performant, but only if you're willing to go full HATEOAS. If you're going binary for performance or don't want to bother with HATEOAS, I'm back to preferring a published schema over something that's (semi-)self-describing plus some documentation that's invariably incomplete or out of date.


I have never seen any benefits of HATEOAS materialise - it is too free-form to be machine parceable, is it far more difficult to navigate than Swagger or GraphQ. Maybe I dont get something?


I don't ever machine parse it, but it can be useful for manual discovery if you get an API that isn't super well documented. I think I might generally prefer Swagger, too, but not everything is implemented in a language that has a library like that.

At the end of the day, though, I'd seriously much rather just have a halfway well commented *.proto file. Work smarter, not harder. It's just that I won't recommend that for public-facing APIs because GRPC isn't universally well supported, while JSON over HTTP is.


Whatever was wrong with ASN.1 for the wire? it's not a direct replacement but a sanitising schema that's truly amazing in application breadth, support, tooling, and implementation efficiency.

In force documentation:

https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-X.68...

Blog on ASN.1 schema for JSON:

https://www.obj-sys.com/blog/?p=508


It's incredibly annoying to marshal and unmarshal arbitrary ASN.1/BER. Variable-length odd-bit-length length fields annoying. BER is like the worst parts of every other format, collected in a single standard.


I don't know why but I constantly hear about things that never caught on and will never catch on in the future. Ok lets say I am hyped for ASN.1 as a schema for JSON. How do I integrate it into my bog standard Java application? Suddenly we are running into a huge problem.

>ASN.1 is a mature standard. As I already mentioned, it has been around since the 1980’s. Though stable, it is not stagnant; the most recent revision occurred in 2008.

It might be old but it is not widely used. To me it doesn't matter if something was created yesterday or 20 years ago as long as I can use it and using ASN.1 is significantly harder than necessary.

>ASN.1 has tool support. There are both commercial and open source tools that will generate code from your ASN.1 specification.

Ok, so where is it and why is it so difficult to find?

Even if we assume that we should just build it today then all the above claims suddenly become worthless. If someone builds ASN.1 tooling in 2020 then it is just as immature as e.g. GraphQL tooling that was built 2020. If there is renewed interest in ASN.1 then it might gain new features that will cause it to become less mature/stable again.

Having "mature" software is useless if it doesn't meet user demands.


There is an excellent article about ASN.1 and how it relates to JSON and other forms of encoding.

Here: https://www.thanassis.space/asn1.html


Sadly, ASN.1 is widely used — both TLS and SNMP use it, which means that we're using ASN.1 every time we read or post on HN.


ASN.1 is enormously more complex than JSON or even MsgPack, very bug-prone, has much worse tooling, is much less widely supported, and, surprisingly, most implementations are not even that efficient. Basically you pay a lot more in bugs and get a lot less in functionality.

I've spent a lot of hours poring over hex dumps of BER messages and CBOR messages (which are basically MsgPack) and I vastly prefer CBOR. But I prefer JSON way more.


I really like ASN.1.

The big difference is that messagepack is schemaless.



It's just a nice default to pick. (But JSON is "simple" so msgpack did not really catch on. And probably won't for the reasons others mentioned.)


> It's just a nice default to pick.

Why would it be your default?


It's really very-very fast, and no need for external schema registry. And it's really as simple as JSON. Has great language support.

https://github.com/thekvs/cpp-serializers#results

That said, currently I don't work with it in any projects. Apparently it hasn't really caught on.


It's my go to serialization format for redis payloads. Fast, compact, can represent a lot of types without too much magic.


It is relatively efficient, schemaless, and cross-language and -platform. It supports floats and binary values as well.


Because all your existing json will just work, you don't have to change your data representation to support some other format(proto, avro, thrift).


> Overhead of JSON encoding followed by gzip will be slower

How much time would it take you to either

a) turn on gzip support on your HTTP server and keep using JSON

b) rip out your JSON stuff from controllers and whatnot and replace it with a custom serializer/deserializer that's far from an industry standard?


A decade ago I benchmarked and used msgpack in a Python system instead of json because it was so much faster that it halved my hardware requirements. I think thrift and protobuf and stuff were around then but I don’t recall how they compared.

Edit: found my post https://stackoverflow.com/questions/9884080/fastest-packing-...


I made a similar choice for a Ruby project, with msgpack over RabbitMQ.

I no longer have any connection to this repo, but the benchmarks (and a schema validator gem I wrote) are here: https://github.com/deseretbook/classy_hash/blob/master/READM...


I fail to see the use case too. If you want a compact wire format, there are numerous options (protobufs, avro, thrift, etc) and their size is achieved by storing the data in a compact binary format and the schema separately.

With MessagePack the schema is embedded alongside the data, exactly like JSON. So it's just binary instead of text encoding, which saves some space, but as you pointed out, standard text compression algorithms are going to likely perform similar or superior.


One particular advantage of msgpack is that it allows embedding binary data without binary-to-text overhead (~30 %). I don't really see msgpack as something that makes much sense to use inside a browser, though.


First piece of software I ever wrote was a base92 encoder. When I looked at b64 is was even worse than the 3:4 I had been told.

B64 as originally used (with an extra >2.6% for line splitting) is handily over 35%. As we use it now it’s still a bit over 33%, because of the == endings to detect truncation... 66% of the time.


Msgpack is a different encoding, not quite the same as compression. What happens if you do both?


I'm not sure this is a fair apples to apples comparison. Some thoughts:

1. msgpack + gz would be a more fair comparison to json + gz when comparing file size

2. have you run any time comparison? It would seem to me that msgpack gets pretty close to json + gz with far fewer resources, time


Yeah, that is a very fair point. I suppose if i can shave a few bytes per request, why not, and if the parsing is indeed faster, than all the better.

I havent had need to deal with such constraints personally, but maybe its something I should consider more. It doesnt need to be a 10x win to still be useful, and if its a simple import to use then i would likely be crazy not to consider it.


Do a speed test comparison as well. Serialization to msgpack is likely much quicker than json + gzip. This is for latency sensitive applications.


Gzip'ing the msgpack encoded example results in 238 bytes.


Nobody gzip their json over https, this is due to compression attacks.


Not every use case is affected by BREACH attack.


JSON is mostly used on the web, and with HTTPS, where compression is disabled.


gzip is very slow. I don't know why anyone would still use it today.


Gzip is old but still good. It’s not as good as Zstd nor cases where lz4 and ilk shine, but it’s not fair to claim its slow.

Gzip is quite close to the Pareto frontier, meaning it is a good trade off of time and space. See the charts at http://mattmahoney.net/dc/text.html (And read up on the Hutter Prize!)


When you say, "very slow" what do you mean?

gzip is fast relative to other compression algorithms and relative to internet speeds.

gzip is slower than an ideal gigabit network, but faster than 100mb.

https://www.rootusers.com/gzip-vs-bzip2-vs-xz-performance-co...


"gzip vs bzip2 vs xz performace comparison" is like comparing which centenarian can limp faster, it might be entertaining for some people, but is generally not relevant.

> gzip is fast relative to other compression algorithms

gzip looks fast perhaps compared to "xz -9", but not to anything modern.


Your comment could benefit from pointing out the alternatives that outperform gzip.


Check out zstd


really? what do you suggest that is faster? Our shop has tested a number of compression formats (xv, bz2, gzip, etc) and gzip is good enough and faster than the others we tested.


https://quixdb.github.io/squash-benchmark/

Brotli is better, zstd is faster, lz4 is often good enough and a lot faster.


The algorithms you name are all rather outdated.

Typical gzip decompression speed is somewhere in the 200-250 MB/s region, compression is much slower. LZ4 for example tends to compress at ~600-700 MB/s, and decompress at several GB/s. zstd is tweakable over a very wide range of ratio-speed trade-offs.

LZMA(2) (xz) is a rather troubled format and should not be used any more. bzip2 has always been slower than gzip with usually marginally better compression. It has been irrelevant for a long time.


> LZMA(2) (xv) is a rather troubled format and should not be used any more.

You mean xz. Not sure where you got the idea that it’s a troubled format, but if you’re talking about the infamous “Xz format inadequate for long-term archiving”, IMO that’s just bzip2 authors taking a dump on xz for no good reason, and fortunately for us it’s bzip2 that’s basically irrelevant today, not xz.


> It has been irrelevant for a long time.

Funny you say that, my company uses bz2 for compressing pretty much everything.


And many companies use fixed-width record formats for data exchange... what's your point exactly?


I guess that it's not irrelevant. Deprecated, old, etc sure, but irrelevant?


Well yah, those are all ratio-tuned codecs that are very slow. LZMA beats bzip2 on both ratio and speed so you might as well forget about bzip2 forever. Zstd, snappy, LZ4, or even brotli are probably better choices than zlib for most people. Brotli has LZMA-like ratios at dramatically higher speeds.


Depends on what you're doing... for real-time data streaming, sure... for request-response, I'm less convinced. JSON being human readable, relatively lightweight, and highly compressible is pretty convincing.

I'm not against messagepack or protobuf, however, much like all things, I'd rather start with simple http+gz+json (maybe websockets) and optimize as needed. Not everyone is at the scale of FAANG, and most don't really need this level of optimization.


Agree. Most systems have to be optimized in other areas first and I wouldn't give up the debuggability of JSON for a fraction of percent of increase in performance.


>[JSON] relatively lightweight

Um... no? Epoch as binary, 4 bytes. As JSON, 10 bytes. 64bit big num in binary, 8 bytes, in JSON 19 bytes.

For a small example object I can think of

{ “x” : “y” }

In binary, 5 bytes. In JSON, 9 bytes.

Now... light ENOUGH because you’re using a PC and gigs of ram and a 100mbit internet, sure. Light in terms of a microcontroller? No.

I flat out could not use JSON in any of my projects and I’m not at FAANG level.


seems like what you're talking about would fall under the real-time streaming category... HTTP overhead alone would outweigh the JSON differences you're talking about.


Double. That's the inefficiency you can count on with JSON over CBOR or MsgPack if you're dealing with number data or short strings.

Sure, the HTTP overhead and TCP overhead under that are significant if you're transferring a single bignum. How about 10,000 of them?

Even if the best case of all strings where binary gets limited by ASCII's inefficiency, "":"", is "only" 5 bytes of wasted data, but what about a million item JSON? 5 million wasted bytes does seem like it outpaces NIC then TCP then HTTP layer overhead. But uea, IDK.

The claim was that JSON is lightweight, it is not. It's not as bad as XML, I'll give it that.


Protobuf is entirely different to MessagePack and JSON because it has a schema.


schema isn't part of the actual encoding though afaik.


I implemented a streaming deserializer for MessagePack data in C for small microcontrollers. It is quite small and not complete.

Then I tested it with test data streams generated by the reference C++ implementation and Python implementation.

It's actually kind of a pain, because the C++ serializer generates a variety of different data types depending on the values you are packing, not the type of the values you are packing. Let's say I encode a uint32_t field. The stream might get a uint8, a uint16, a uint32, or even an int32 (for reasons that completely elude me).

Also, C++ strings come out as the 'ext' type while Python strings come out as the 'string' type, so I have to accommodate both, even though they are both basically byte strings.

So, I want to tell the deserializer what kind of output field I'm expecting, for each struct member or array, and then look in the data stream to see if there is a data object there that came from the same type. But this is impossible, so the per-type decode functions have to be quite complicated to handle a variety of types.

So - it works but I can't do much on the deserialization side to verify that the data I'm unpacking really matches what was encoded. I can only detect very broken cases, like when I'm unpacking a uint32_t and in the stream there is an int32_t with a negative value.

I guess this is mostly done for optimization, but two things would make it a lot better:

- if the spec actually specified how data types in different languages were allowed to be encoded

- if the encoded data contained _two_ type fields, one indicating the original source data type and another indicating the type it was encoded into.

Basically the spec is just way too "loose" to make it usable for the use case I'm trying to use it for, which is to easily generate data that is sent to a micro and stored in EEPROM, then deserialized out of EEPROM later.

That's probably not very close to a use case the original designer had in mind. But I haven't found anything that works better (less decode logic).


The spec doesn't really make this clear, but reifying the packed encoding types is not, I think, how MessagePack was intended to be used. At least it's not how it's implemented in MessagePack libraries. As far as I know they pretty much all use the most efficient representation for all values, so the original data type is always lost.

You generally shouldn't worry about the low-level type that a value was encoded into in the MessagePack stream. It's dynamically typed, so you should just care about values. When encoding you should allow the encoder to use the most efficient representation, and when decoding you should be able to tell your MessagePack parser the integer width you want instead of caring about the original type or how it was encoded. It should then accept any packed integer type as long as the value is in range.

This is how my MessagePack implementation works as well. If you expect to receive an integer that fits in, say, `uint16_t`, you can call `mpack_expect_u16()` or `mpack_node_u16()`, and it will allow any integer representation as long as the value is in range.

It sounds like this is where you were going with your implementation as well, so this may not be comforting because it's not what you want, but it is at least the correct way to understand the format. I've talked about this pretty extensively and wrote up a protocol clarifications document that explains a bit more about how and why MessagePack libraries discard integer width and signedness:

https://github.com/ludocode/mpack/issues/35

https://github.com/ludocode/mpack/blob/develop/docs/protocol...

If you really want things like original integer width represented in the format, ultimately you're going to want to use a different format, probably one that is non-dynamic and uses schemas.

As far as the string vs ext, you may have meant string vs bin; there was a format change a while back that separated string and bin types and not all MessagePack libraries have adapted to that. Many libraries (including mine) support a compatibility mode so they will use only compatible string representations.



Ha! Yes, I have been using MessagePack for several years. It never occurred to me that people would be interested to have a discussion about it on Hacker News. MessagePack to me is as ubiquitous as JSON and Protobuf, I submitted www.json.org [1] just to prove my point. I figure OP is today’s lucky 10,000 [2] and that’s why they decided to submit the link, thinking it was a new type of data interchange format.

[1] https://hackertimes.com/item?id=22538794

[2] https://xkcd.com/1053/


Yeah, this same conversation about Message Pack happens once every two years here but that’s ok I guess...


Have there been threads in the past 8 years?


I guess you’re right that these are the top level threads before 2012. What I’m remembering (and found via google search) is this kind of discussion around message pack comes up in most other threads about serialization formats or adjacent topics.


There definitely have been some comments: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...


Under PHP...

> Msgpack is an PECL extension, thus you can simply install it by:

Having something available in PECL is a good first step, but nobody will use it unless you either:

1. Get it into the standard library (which requires an RFC for PHP Internals), OR

2. Write a pure-PHP polyfill installable from Composer, OR

3. Do #2 then #1 (using the polyfill's popularity to argue for the importance of the RFC acceptance to make #1 a reality).

Reason: A lot of the places PHP is deployed, you can't compile C code or install binary dependencies (.so, .dll files). You can't access the OS package manager, either.

But Composer is a pure-PHP package manager that still operates in these environments.

So if anyone on HN ever wants your thing to be used by PHP developers, don't just stop at "PHP extension, written in C, available in PECL".


If you literally read further down the page there is a pure PHP implementation, the rybakit/msgpack composer module.


> Reason: A lot of the places PHP is deployed, you can't compile C code or install binary dependencies (.so, .dll files). You can't access the OS package manager, either.

If you’re in the target audience for msgpack you probably aren’t relying on shared hosting and can build a pecl extension.


Thats some strange logic.

I have multiple racks of company owned bare metal that I deploy to. Still don't want to build a pecl extension. Much simpler and way less likely to not run into random build issues if I can just install via composer.


> Still don't want to build a pecl extension.

Docker? Build it as a system (RPM/DEB) package? Man, my life would be difficult if I just flat out refused to use packages for PHP/Python/Ruby requiring native extensions because it required some minimal effort on my part to deploy it.


I found myself using msgpack as a drop-in alternative for acceptable mimetypes for HTTP responses in a flask app. Browsers would get the response data with the pretty-printed json embedded (or even a custom template with the data fitted in, if there was one for that specific endpoint), api clients asking for nothing in particular would get pure json, and clients asking for msgpack would get that.

Seemed like a free way to offer slightly better performance on the API (both serialize/parse times and bandwidth), since it can just serialize any data without specifying protocols or schemas, like json serializers can too. I didn't know about any alternatives that would also require no further configuration or infrastructure, so it seems like msgpack fills this 'free performance boost' niche quite nicely


Or you could use IETF standard CBOR (Concise Binary Object Representation) https://cbor.io/ RFC 7049


Is there any advantage of msgpack over json or gzipped json on one side and soemthing like protobuf or flatbuffers on the other?

Msgpack, unlike json, is not human readable on the wire, not a purely text based format, and I doubt it is smaller or faster than protobufs or flatbuffers.


Protobuf requires schemas, which is good practice anyways but maybe you don't have or want to do for some reason.

FlatBuffers doesn't have as many client libraries. There's a MessagePack library for a ton of languages.

One thing I really like about MessagePack is that the Python client (and others too) supports reading from a stream. So you can write a bunch of msgpack messages to a file or TCP socket and it just works.

Protobuf can't do this out of the box because it doesn't include how long the message is. You can write a wrapper that specifies the message length, which isn't that hard and I've done before, but it is another thing to maintain. And other formats do that out of the box (ex. Cap'n Proto)

And as someone else mentioned, Protobuf doesn't have NULL which is useful in some cases. (I understand that Go has strong opinions about there being a useful default value, but that doesn't map well to a lot of languages)

One other thing with Protobuf is that the Python client is not very pythonic. I've been keeping my eye on this project [0] which makes Protobuf messages work just like dataclasses. They don't support OneOf types currently, which I happen to need for some of my use cases. But they're working on it [1]

[0] https://github.com/eigenein/protobuf

[1] https://github.com/eigenein/protobuf/issues/85


There's nothing fundamental about protobuf itself that prevents streaming output. You just have to make a pass over the proto structure first to compute the sizes, then stream it out using the precomputed sizes. At no time do you necessarily need the entire representation of the output in memory. The C++ library, for example, offers this.


Yes, and that's what I did. In my case, I was writing to a file. And my languages were C, Python, and Lua. So, I wrote a small package (plus bindings for Python and Lua) that would write out the message length as a uint64, then a uint8 for which schema the message is (I was working with multiple different message types), and then the actual message bytes.

So, it's do-able. But, I'm just saying that other formats/libraries support this natively out of the box.


It sounds a bit like you invented RecordIO[1]. I didn't realize you meant decoding from a stream containing several protobufs. In that case you do indeed need some kind of framing, because there is no end-of-message delimiter. The positive tradeoff is you can easily concatenate several protobufs into another, valid protobuf. That is impossible if there is a terminating symbol.

1: https://www.tensorflow.org/tutorials/load_data/tfrecord#tfre...


Oh neat. I'm not familiar with the ML/AI ecosystem. So, I didn't think to look there for formats; a bit of a shame I reinvented the wheel here, but it worked great for my specific use case.

Yeah, it's all about the tradeoffs with formats. Protobuf is a good choice a lot of the time and I always think about it as one of the first/preferred options. gRPC is great too.


If you're streaming messages I guess the good comparison would be protobuf + gRPC?


You can use e g. in the Java api writeDelimitedTo to do streaming protobufs



My languages were C, Python, and Lua. Lua supports this, but C and Python don't (at least not in the library generated by protoc at the time I did this project, ie. like 2 years ago).


One advantage is that you can pack and unpack msgpack on an Arduino but you can't run gzip on that hardware.


> Is there any advantage of msgpack over json or gzipped json on one side and soemthing like protobuf or flatbuffers on the other?

Or CBOR, that has an RFC (which is important for some folks):

* https://en.wikipedia.org/wiki/CBOR


Plus the fact that protobuf is more flexible about interface changes. My previous company had countless incidents that happened when new code deserialized msgpack records with unexpected fields from redis...


I was asking myself the same thing and apparently protobuf has no concept of "null", thus I can see how MessagePack might have an advantage depending on your use case here.

https://github.com/neuecc/MessagePack-CSharp#comparison-with...


protobuf can do nullable values, but you have to ask for them as primitives are the default. e.g. StringValue (nullable) vs string (not)


Well, msgpack does support blobs, unlike JSON, and is schemaless, unlike protobufs and FlatBuffers. I think the second one is kind of a hard constraint, while the first one is something you can work around with base64 in many cases.


it's about as fast as protobuf using .net. It is not human readible.


I used MessagePack in a real-time streaming app. It was smaller than Protobuf (and about as fast in .net). In fact, if you are a .net dev and use the popular SignalR it can use MessagePack for real-time browser binary messaging. I highly recommend it. But, it is not a direct replacement for JSON as it's not a human readable or accessible without libraries.


I've also used MessagePack on .NET, but as a serialisation format for use with a message queue - I don't recall the numbers off hand, but it's something like an order of magnitude to serialise and deserialise than JSON, and resulted in a lot less allocations too.

It supports compression too. Even if the resulting serialised size isn't always smaller than compressed JSON, from a performance standpoint it's a lot better.


It might be better than text formats, but it still encodes most things as variable-size and this is not particularly machine-friendly. A format with separate metadata and fixed length encodings for most of the things (eg numbers) would be much more efficient to serialize / deserialize.


> resulted in a lot less allocations too

Especially if you use the MessagePack-CSharp library. That dude knows his high-perf .Net.


I've worked extensively with MessagePack and JSON. I much prefer JSON because it's human readable. It's just a pain to debug something that looks like raw binary data, and I usually have to debug it at times I'm just not in the mood to deal with shenanigans.


Reminds me of Adobe AMF (Action Message Format) that was used in their Flash Player:

https://www.adobe.com/content/dam/acom/en/devnet/pdf/amf-fil...


If you want a binary JSON, https://google.github.io/flatbuffers/flexbuffers.html is worth looking at, since it carriers over the advantages of FlatBuffers (in-place access without unpacking), but without the need of schema like FlatBuffers.


I thought the appeal of JSON was that it is, among other things, somewhat human readable. The gain in bytes does not seem to justify the loss of human readability. Am I missing something?


It depends on your use case; having human readable serialised messages isn't always a big deal. For example, I recently used it for a high-throughput messaging system, where the vastly improved performance was a huge selling point.


Are you referring to the reduced payloads or does this also provide a faster way to serialize/de-serialize?


Much faster serialisation/deserialisation (and a lot less allocations on .NET).

It does support compression, but messages sizes are generally comparable to compressed JSON (YMMV, it will depend on your messages).


Probably we need a human readable text view for messagepack

That would let it fully take on the role of json

But what's the text format? -- now there is endless happy bikeshedding

Maybe a library with json (or compatible superset, to get all messagepack features). Then the standard just serializes to messagepack, before gzip or whatever


I wrote a MessagePack to JSON conversion tool which is ideal for viewing MessagePack. It supports a pseudo-JSON debug output so you can view messages even when they use features outside of JSON, like binary blobs or arbitrary key types:

https://github.com/ludocode/msgpack-tools

It works best if you use MessagePack like JSON where all map (object) keys are strings, so you can easily understand a message without context. If you want to optimize your MessagePack more, you would tend to use integers for map keys, but this makes the JSON-equivalent view not super clear because you just see a bunch of numbers in a tree structure.


Thanks! Your `msgpack-tools` are really handy. When I use MessagePack usually I'll add it as an option to an endpoint (`/api/request/123?type=mpack`) in addition to json but there's still cases where having a mpack-to-json tool comes in handy.


Ok so without taking away from this library, what's wrong with using a fully binary format of your own? No string parsing, no nonsense, just binary data. It seems like people are becoming afraid of plain binary, and I don't understand why. It's so easy, plain binary is very small, and is extremely easy to parse when compared to anything with text in it, especially if you have to support multiple encodings.

1.) Decide upon a binary format.

2.) Use it.

No 3rd party libraries required; you could write writer or reader code for a Commodore 64, if you had to.

Keep it simple. I suspect any simple "type-length-value" type of binary file format would be written or read at least as quickly as this, without third party code.


Communication. The new guy joining your team can be immediately productive with JSON, protobuf, etc, but will take some time to understand the homegrown format. Also much more error prone. Most teams don't want to be writing binary encoders and decoders as that's not where their expertise lies, and would much rather focus their time on business features.


Been there, done that, switched to proto bufs because maintaining 4 different parsers in 3 different languages was a pain.

Also proto bufs gave us those nice schemas, the code gen, and easy forwards/backwards compat.

Was it slower? Yeah. Was it crap ton less work? Yup.


I haven't had the same experience I guess. Keeping binary readers and writers maintained was a very small portion of the time I spent on the applications which used those writers and readers.


In our case we were adding new APIs and expanding our format rapidly for years on end. After the 20th or so miscommunication that lead to a week+ delay because someone got a field order wrong, or in one case because we hit a bug in the C# compiler in regards to struct layouts (!!) we switched away from rolling our own.


It's unlikely you can write a serializer/unserializer that is faster than what we already have. And bug-free on complex objects.


It's unlikely that I can? That's a strong statement. Binary file formats are not difficult. Writing code that does the right thing is also not difficult.

I am opposed to the blind use of libraries like this. Developers need to understand what they're doing and how what they're doing is being done at a reasonably low level if they ever hope to become better developers. Masking it all away behind a third party library is not how you understand the code that's running on your systems.


How about you write a simple binary substitute that covers just the standard JSON (number, string, true, false, null, array, object), see how you do compared to the existing JSON serializeres / deserializers and MsgPack.


funPOV: Text is a binary format. Just standard, with tooling.


Unfortunately messagepack also forces IEEE754 standard for floating point numbers. That means it is useless for things like money or large/arbitrary precision numbers.

In JSON a number-type value has to be parsed like in javascript, so IEEE754 double.

for example:

  {"number": 0.1}
in reality it is parsed as:

  {"number": 0.100000000000000005551115123126}
Solely deserialization followed by serialization might introduce a change in serialized value.

Because of problems with IEEE754 in all of our api we use only floats as strings, like {"number": "0.1"}. For us enforced IEEE754/double format for floating point numbers renders number type near useless.


This is technically correct, except that almost all JSON libraries parse JSON numbers into either 64-bit integer or 64-bit double, so you get those lossy conversions anyway.

I've worked on a production app where we had to swap out the JSON parsing libraries in the server (Rails) and all clients (Android, iOS, Rails again) for ones that preserved numbers as BigDecimals. This was a huge pain, it made everything slower, and even then it wasn't ideal because different BigDecimal libraries aren't even necessarily compatible. Foundation's NSDecimalNumber has different limits than Android's BigDecimal for example, and these are the types used by the parsers so you can't just get the raw data to do it yourself.

If I had to do it over again I would never rely on JSON's decimal support. I'd rather stuff my decimals in strings and parse them myself.


I think you're conflating implementation details of your JSON decoder with the JSON spec. There is no requirement that JSON numbers are parsed into IEEE floats or doubles. Many JSON parsers support decoding numbers into other decimal data types.

https://www.json.org


Small lesson learned with MessagePack:

We prematurely utilized it and paid the price of "nothing's human readable without first unpacking it" without actually benefiting much given we weren't shipping much data that often."


Anyone know of an efficient binary format similar to MessagePack that supports deserializing only whitelisted keys, and not paying the penalty of parsing data that isn't needed?

MessagePack is great, but libraries generally only support deserializing the whole thing. I have an application where these structured documents can be very large, and scanning code sometimes only needs a very small subset of keys.

I believe Cap'n Proto has this feature, but unlike MessagePack it's not schemaless.

For example, given a struct like this:

  {
    "id": "123",
    "name": "Developers",
    "members": [{
      "id": "567",
      "permissions": [
        {"type": "read"},
        {"type": "write"}
      }
    }]
  }
Let's say I only want the name, the ID of each member, and whether permissions.read is set. I may want to do something like (Go):

  StreamingUnmarshal(b,
    func(keypath string, parse func() interface{}) {
      switch keypath {
        case "id", "members.id":
          value := parse()
          // ... use value ...
        case "members.permissions.type":
          if parse().(string) == "read" {
            // ...
          }
      }
    })
Random access could be also work, as long as it didn't need to sequentially parse from the beginning of the data to get to the right value each time. Something like:

  id := GetKey(b, "id")
  memberIDs := GetKey(b, "members.id")
  permissionTypes := GetKey(b, "roles.permissions.type")
The trickiest bit is treating arrays of nested structs (as in "roles.permissions.read") correctly, although efficient scanning of keys becomes an important optimization point, too.

Probably the best method would be to store the keys pre-sorted at the beginning of the data, so that they'd better fit in the CPU cache, and have pointers to the offsets of the values:

  KEY1,KEY2,KEY3,VALUE1,VALUE2,VALUE3
Arrays of structs are tricky here, again, but this is solvable.


Have you considered using a sqlite file?

No, seriously: It doesn't require an external schema, supports every kind of indexing and fast random access you could want, is supported in like every language, OS, and architecture, has copious tooling, documentation, and community support, and is battle-tested across literally billions of installations worldwide.


No, that would not make any sense. SQLite cannot deal with structured, hierarchical data like the stuff I described in my comment.

Also, I am talking about individual documents that already live in a database such as PostgreSQL. I can't store an entire SQLite database in a single column.


> I can't store an entire SQLite database in a single column.

Sure you can, it's just a file. :) sqlite scales down nicely to data sets of just a few kilobytes -- if you're worried about parse time of your documents then I assume they are larger than that.

That said, if you're already loading the whole blob from a single row in Postgres anyway, then is random access such a big win? Or is the idea that you would selectively read byte ranges out of Postgres? If you're already pulling the bytes into RAM then avoiding the parse isn't that huge of a win.

(I say this as the author of Cap'n Proto which is all about zero-copy random access... it's only a big win in certain use cases, like mmap() or shared memory IPC.)


Well, many documents are several megabytes, but many are in the order of 100 bytes. I need a serialization scheme that scales up to large documents and down to tiny ones.

I can't imagine that the overhead of initializing an SQLite database from a small byte array in memory is that small, not to mention the overhead of maintaining the table schema.

Out of pure curiosity, I glanced at the Go bindings for SQLite, and there's no provision for initializing a database from a byte array, or accessing the raw underlying byte data of a live database. The C API supports implementing your own VFS for custom storage, but that's not supported by the Go bindings, and seems like a lot of work.

You're right about loading whole blobs; I was misremembering a little bit. The application in question already pares down the document keys in its queries to avoid sending everything. I'm in the middle of a research project into an alternative backend where the documents are stored as binary data, not JSON, and given a set of keys/keypaths, I want to do a little better than deserializing the whole blob.


Take a look at FlexBuffers, part of FlatBuffers. FlatBuffers itself is similar to CapnProto and requires a schema, but FlexBuffers is a related schema-less format that uses a bunch of the encoding techniques from FlatBuffers to gain advantages like (I believe) not having to parse unneeded substructures.


Thanks, that looks nice. But it looks like FlexBuffers is only implemented in C++ and Java.


I was looking into binary JSON formats recently (there are a ton), and there are some important differences to note. Especially for my application - you can't do what Amazon calls a "sparse read". Say I have a 10 GB JSON file and I only want to read one key, no dice. You have to parse the entire file.

"But, maybe some of these binary JSON formats are smarter!" you think. Well, almost all of them aren't. Only two are: BSON, and Amazon Ion. Unfortunately BSON limits the message size to 2 GB.

Amazon Ion is also the only format that actually deduplicates object keys. Definitely the most capable and well-designed of these formats. Unfortunately it is also the most complicated.

Sadly a lot of these formats make questionable choices, like storing numbers in big endian format (why?), using explicit `uint8`/`uint16`/`uint32` sizes rather than something like Protobuf's varint, etc. And there are also a load of them that are nearly identical. You really have to dig deep to find the critical flaws.


Binn also does binary length prefixing of all values, allowing you to skip through it. It's also a lot simpler than Amazon Ion and doesn't have the awful mistakes of BSON so it might do the trick for you. It's not at all popular though.

https://github.com/liteserver/binn

The reason most formats don't length-prefix everything is because it makes it costly to encode in both time and space. You have to basically encode a message inside-out to calculate the nested sizes of everything. This is going to be hugely slow and memory-intensive if you're encoding a 10 GB file, and it's useless for messages on the scale of kilobytes so there isn't any point. MessagePack on the other hand can be encoded in one pass from start to finish as long as you know the element counts of your maps and arrays beforehand.

> storing numbers in big endian format (why?)

Embedded processors tended to be big-endian, like older PowerPC and older ARM. These formats are designed for embedded so it (probably?) improved performance on those processors. This is less true now since virtually all modern ARM processors and probably most other embedded processors now run in little-endian mode.

Ultimately what it comes down to is that these formats are designed for the opposite of your use case. I don't know what you're using a 10 GB JSON file for but there must be a better storage solution for you than a schemaless serialization format.


> The reason most formats don't length-prefix everything is because it makes it costly to encode in both time and space.

Yeah this is true, except for BSON because it uses fixed-size length prefixes, so you can just go back and fill them in later. Presumably that's why they used fixed-size lengths. The downsides are it is less space efficient and limited to 2GB.

In any case Amazon make the very good point that formats are read more often than they are written. It makes sense to optimise for the read case.

> Embedded processors tended to be big-endian, like older PowerPC and older ARM.

Nobody uses PowerPC anymore, and ARM hasn't been big endian for ages. Also MessagePack isn't designed for embedded systems and it still uses big endian. I don't think that's the reason. I suspect it's from a misguided belief that "network byte order" still matters.

And I totally agree, a schema-based format makes way more sense for my use case - changing is difficult though.


I looked into Binn, but unfortunately it has a 2GB file size limit too.


I've skimmed over it and didn't see if there's a compelling reason to use this over CBOR, or vice versa.

Anyone have any insight?


They're almost the same. To see how similar they are, compare their implementation in nlohmann's JSON library for C++. They are both processed by the same class template, only some constants are different.


CBOR went to the trouble of being an IETF standard, RFC 7049. So, you know, the lovely thing about using standards!


CBOR also made a lot of changes to MessagePack making it far more complicated, both to use and to implement. I've talked about this on HN before so I'm repeating myself a bit but here's a short list:

- CBOR has two ways of encoding maps and arrays: fixed length and variable-length. This complicates decoders, especially those that would pre-allocate arrays and maps to the proper sizes, which significantly reduces decoding performance. The CBOR spec has nothing useful to say about this; it just requires you to allocate indefinitely.

- CBOR defines a canonical representation, including a key sorting order based on binary representation which is just awful. It requires multi-pass encoding which is slow, complex, error prone, and completely non-intuitive: [1,2,3] comes before 100000 which comes before [1,2,3,4].

- CBOR has more types in the core spec, ones that are extremely specific to certain applications or programming languages. It has a 16-bit float, and it has both null and undefined as separate types.

- CBOR defined a system of "tags" with a huge number of extension types. These are supposed to be optional, but of course they only work if both ends support them. Some features like BigNum are well-supported in some programming languages but not others, so CBOR implementations tend to diverge in supported message types.

CBOR as a standard is far worse than the "non-standard" MessagePack it purports to replace. Here's a great HN comment on it from another user (and another MessagePack library implementer) a few years back: https://hackertimes.com/item?id=14072598


All those gripes are optional features.

Your parser or encoder does not need to support indefinite arrays, that feature is clearly designed to be used with some practical limitations like “I don’t know how many but let’s assume less than x, and I’ll send a STOP when I’m done”. Canonical ordering is optional. Yes, a typed system that has more types, IDK what to say about that other than you don’t have to use them. And yes, tags need to be supported on both ends, just like ANY DATA that is being transferred, compare to a strict schema’ed system and I no difference except that you’re only partially required to adhere to the plan.

Maybe msgpack is just objectively better because it has less features. IDK. Doesn’t matter because CBOR got an RFC and is actually popping up in places. If there was a competition, cbor won, right or wrong.


I tried to work with MessagePack last year while teaching someone who was building it into their product. Absolute nightmare.


JSON is as pervasive these days as XML and despite it's shortcomings is not going away for the foreseeable future.


These formats are great for private communication, but become a nice attack surface area if publicly exposed. I was using thrift on an old project and had to add a sanity check layer to make sure an attacker couldn’t just specify a valid request with a list set to have int.max elements.


But any parser needs valid checks of this kind, including the JSON one.

A common way to mess with JSON parsers is e.g. to nest a lot arrays and objects. So you needs max nesting depth. There are a bunch of other ways to mess with JSON parsers, too.

(In case of MsgPack lists and maps: A parser should only per-allocate memory for given size if it has done proper sanity checks, e.g. compares the given length with the remaining byte length of the message. Also you can simply not preallocate and instead, like in JSON parsers, grow your list on demand and just use the length to know when the list ends).

But yes you have to make sure the parser works for your use case.


This is great, but, to complement this, does anyone know a better alternative to TCP? I've recently needed to make two processes communicate over the network and, while MessagePack handles the serialization, I found myself needing something higher level than TCP but lower than HTTP.

An annoyance of TCP was that I never knew whether I read all the data. I either read less and leave data unread, or I read more and end up blocking for a long time (or implement timeouts and get the worst of both worlds).

What's a good alternative? Maybe 0mq? All I need is to send some MsgPack bytes to another client over the wire, hopefully without having to guess whether there's more to read in the socket or not.


ØMQ is probably a good fit (I think nanomsg is dead?) but there are also a number of protocols at the TCP level that provide a message-sequence interface rather than a byte-sequence interface: SCTP is perhaps the best-known, and Plan9 IL is another (deprecated) solution.

For some applications, though, the simplest solution is to shut down one half of the TCP connection once you've finished sending your data. That's how rsh, finger, and HTTP/0.9 responses work, and it's a supported option in HTTP/1.0 and HTTP/1.1. Failing that, preceding each message with a byte count, a la netstrings, is fairly simple; or you can use SLIP-like or COBS framing.


Hmm, interesting, and 0mq is a bit heavy. Unfortunately I can't shut down the connection, as the server needs to provide real-time updates to the clients (its pub/sub), but I'll look into STCP, thanks.


There are a couple of ghetto ways to do pubsub. Webhooks is one, and it's often by far the easiest to implement, but in other cases it's impossible. "Long polling" is another: you open a connection and tell the server what you think the current state of a variable is, and the server just sits there with the connection open until that isn't the current state of the variable any more, at which point it sends you the new current state, or the delta from the state you had to the current state, and closes the connection. If you were wrong about the current state, this happens immediately. Again, though, there are pubsub cases where this works, and pubsub cases where the extra latency and kernel CPU of opening a new TCP connection for every message are intolerable.

So, to take the canonical concrete example, a chat channel might number the messages on it in a monotonically increasing order, and you might tell the server the channel name and the number of the last message you saw, at which point it sends you the messages since that point, if any, then closes the connection. As I understand it, this is how Kafka works, except for the connection-closing part.

In all probability, your life will be easier and your performance will be better with ØMQ, but these hacks are things that work reasonably well and are extremely easy to implement with off-the-shelf tech.

SCTP in many cases suffers from the fact that it doesn't run on top of TCP, so NATs don't know what to do with it. If you have enough control over your network that that isn't a concern for you, UDP with IP multicast is another plausible solution, the one TIBCO used originally IIRC; you can allocate a multicast IP address per pubsub channel or multiplex them. With IP multicast, recovery from lost messages is a concern, especially if 802.11 is part of your network (since 802.11 uses hop-by-hop ACKs for unicast packets) but there are a variety of reliable multicast protocols like SRM to handle that.

Feel free to hit me up for more info, I've been hacking around with different ways of doing pubsub since the previous millennium.


That's very informative, thank you for taking the time. Just so you have more context, this is what I'm using this in:

https://gitlab.com/stavros/itsalive

Clients can connect to the server and get updates for the commands that are currently running, which is not high throughput or complex from a networking perspective. I was wondering if there was something lightweight that will do the same, and 0mq seems like the best choice, but a simple loop over the connections seems to work well as well.

I played around with 0mq for this and it works great, but in this instance I might not want to add the extra dependency (especially since I've already implemented it, minus a bug where it'll block if a packet is exactly 4k).

I think adding an "end of message" character (eg a newline) would be the simplest thing to do in this instance.


Yeah, that sounds like the best choice. (That's the SLIP framing approach.) SCTP probably isn't viable if you want random people to be able to watch the presentation without rebuilding their kernels, and multicast IP isn't viable on the global internet. An IRC server would work fine, and you might even be able to just use a secret channel on Freenode, but some places block IRC because of other pubsub software that uses it.


Yeah, I wouldn't want to burden Freenode with that, but IRC is an interesting choice. I'll use the terminating character, thanks for your time!


Right, the benefit of using IRC is that you don't have to write the server; there are dozens of well-known, actively-maintained free-software servers, they're well-documented, and they already support epoll and kqueue and have reasonable ways of handling all kinds of pathological network conditions. But maybe a simple asyncio-based event loop, or even threads, would be fine for itsalive.


Do you mean just opening a port and listening for connections? And a client which connects to that port and sends/receives data? That's not very low level and it's pretty easy to do, if I'm understanding you correctly.


How do you guarantee that you've read the entire buffer without trying to read more and blocking?


Maybe I misunderstood but if you add a length header to every packet reading becomes trivial.


It does, but if I can have a library handle that for me, that'd be better. Looks like 0mq is what I want, it also does pub/sub so it frees me from doing that myself.


The "Try!" demo fails with large JSON submissions with an exception in jQuery:

Uncaught RangeError: Maximum call stack size exceeded

I'm curious to see the savings difference and hoped to with "Try!" but it'll have to wait.


Slightly related: recently wrote a binary encoder/decoder that used bitpacking, delta-encoding, and other well-known 'tricks' to efficiently pack batches of 10k-100k events of the same uniform type (essentially encoding column by column and using similarities to my advantage). Nothing too fancy, but it was A) a huge success in terms of compresison-ratio and B) a hassle to write. Do any, more or less, turn-key solutions exist for this? Specifically targeting Node but a commandline util might work as well.


I had a play with something similar and I noticed that -- at least for my use case -- you can get ~90% of the gains through just a couple of simple tricks:

1) Identify medium-scale similarity boundaries in the data structures. E.g.: a sequence of messages in a protocol, such as a C "struct" with a bunch of fields.

2) Compute the binary difference between these structures so that most of the subsequent bytes after the first message are either zeroes or small numbers. Both the sender and receiver have to keep the previous message in a buffer to allow this.

3) Use a high-performance compression algorithm that supports "user provided dictionaries", such as Zstandard. Train it with sample data.

This above is surprisingly straightforward because it doesn't require complex changes to the underlying data structures. You don't even necessarily need to be able to parse it at all, as long as it has large-scale repeating structures that you can identify.


Agreed. It really blows generic compression algos out of the water. Didn't know about Zstandard dictionary encoding. This might just be what I'm after. Thanks


https://medium.com/unbabel/the-need-for-speed-experimenting-...

> It’s that is not enough to just know some new cool technology, nod along and go about your day with your assumptions unchallenged. You need to find out more, test it out, have a grasp before committing to it, and, if you’re lucky, learn a thing or two in the process.


This reminds me of a quite popular mobile game that sends master game data using messagepack which is then gzipped and then encrypted and then base64 encoded in a json response. Fun.


That remind me of one of my first contracts, 10 years ago, for an Android mobile app. It was an app made to quickly record something using video, audio or picture, that would then be uploaded through their API. The API worked essentially like you said, it was converting the binary file to base64, adding it to a JSON, which was in a GET variable (thus urlencoded) and sending that over an HTTP connection (I don't remember if it was HTTPS or not though, I hope it was, but at the time I could have ignored that part). Android didn't allowed more than 16 MB of memory for an app, thus I had to build streams to support each things individually, that was an interesting challenge. I was amazed to find out that there was officially supported Base64 streams, but their API strangely didn't accept the default Base64, it was only accepting the one for URL (which replace a few characters with others one), so I had to add another stream on top to do this.


I once used an API where to send a command, you would make a POST where the message body was an x-www-form-urlencoded dictionary where one of the values is a binary string which is a zipped XML document.

It was very clearly just a page on the vendor's site where you could manually upload a zip of documents, which they had simply declared to be an API.


You can't beat JSON by very much with a format that has the same free-form structure. (Particularly when you use gzip, zstd, ...)

Parsing code has to be branchy to handle many different possible structures.

If you want extreme performance, variable-length strings are a problem -- the old mainframes that had fixed-length "HOLLERITH" strings had a good idea. I like just about everything about Apache Arrow except that it ignores the problem of fast/portable string handling.


The serialization format here was designed so you can parse it (or detect a parse error and bail out) in a single O(n) pass through the message data. No backtracking required.

(You mentioned strings, so I'll use that to give flavor: a string is represented as a one-byte prefix, followed by a byte length, followed by that many bytes of UTF-8. I'm not sure whether you'd categorize this as fixed or variable length, but that's how it's represented on the wire.)


Well, you can't always beat it much from a size perspective, but MessagePack can be an order of magnitude faster to serialise/deserialise than JSON.


It is somewhat faster, but other formats are even faster than that.


In all the cases I've used it, it was a lot faster than JSON.

Other formats might have been faster, but MessagePack was very easy to use.


Why would one use msgpack over flat buffers? AFAIK offers an amazing zero copy feature meaning you can get to the data of interest without having to parse the whole object in memory. Since every type is length prefixed you’re can jump around indexes quickly to access what’s needed and only read that.

It seems msgpack still needs a deserialize step to partially read data right ?

Netflix uses flatbuffers and works wonders in low powered devices.


Obligatory Cap'n Proto[1] reference whenever serialization format discussions crop up.

[1] https://capnproto.org/index.html


Capnproto is schema based rather than self-describing. So it's related, and it does compete with formats like json and msgpack, but there isn't total overlap in the applications.


I would be nice to have a taxonomy, or at least a bestiary, of serialisation formats. Tentatively:

Schema-driven no-compromise fast compact binary formats with no cross-version compatibility: Cap'n Proto, FlatBuffers, SBE, ASN.1 PER, XDR, OMG CDR.

Schema-driven binary formats which allow some cross-version compatibility: Protocol Buffers, Thrift.

Self-describing binary formats: MessagePack, CBOR, BJSON, Bencode, ASN.1 BER, Avro (?), Fast Infoset, AMF3.

Self-describing textual formats: JSON, XML, YAML, TOML.

I'm using "self-describing" here to mean simply that you can recover the structure of the encoded data without a separate schema, rather than that you can attach any semantic meaning to it.


> no cross-version compatibility: Cap'n Proto, FlatBuffers

This is incorrect: Cap'n Proto absolutely allows cross-version compatibility, using roughly the same semantics as Protobuf. I believe FlatBuffers does too. (I'm unsure about the rest, haven't studied them in a while.)

> I'm using "self-describing" here to mean simply that you can recover the structure of the encoded data without a separate schema, rather than that you can attach any semantic meaning to it.

Protobuf, Cap'n Proto, and probably several of the other binary formats can parse data into a message tree without the help of a schema, but all the fields will be labeled numerically. MessagePack is only considered "self-describing" in comparison because in encodes human-readable field names on the wire.


Needs honorable mentions to:

- Avro

- CBOR

- SMILE

And BSON anyone? I think not much people besides MongoDB using it though.

Yes, compressing JSON with gzip-style compressor usually yields 0.5-1% better results then equally compressed binary format (in my limited testing). Still the serialization speed and savings on compression are great to have.


I implemented this in Swift w/o using the Foundation library a few years back if anyone wants it: https://github.com/wittedhaddock/bytepress


OK, I give up...how is that two column list of languages organized? Why is the Perl implementation [1] not listed?

[1] https://metacpan.org/pod/Data::MessagePack


So not at all like JSON, then?


But is is readable as plaintext? That’s one of the main appeals of json.


I used to care about that but once the JSON gets at all complex, it's nice to switch to pretty formatting of it. At that point, a dev tool in the browser could do the same for this format so I don't think that is as pressing (for me and I suspect most people).


Actually , I think thats no longer the case.

(Except if you want to use JSON for configs, which a strongly recommend against for a bunch of reasons including missing support for comments).

I mean by now nearly all times you send thinks over the wire they are ecrypted or at least compressed. So inspecting on-wire messages without "proper" more complex tooling doesn't really work. But if you have already more complex tooling involved there is no reason why you need to be able to read the raw message, it could be just converted on-the-fly into a readable format.

Same applies for application development when e.g. logging, you always do some formatting/conversion when logging some data, so not a problem to convert msg-pack to a human readable format. But then today logging of e.g. servers should always go to some form of log server which does add features like search ability by indexing the log and similar. So no problem to have a msg-pack viewer there, too.

Honestly I believe the only reason human readable formats made/(still do make sometimes) sense was due to limited tooling sometimes caused by server limits in computation power on the developers system. And the fact that many binary formats are totally over engineered making handling of them painful. Especially debugging of slightly corrupted data. Which isn't the case for MsgPack.


Compression is not encryption and security by obscurity is bad practice. So it doesn't matter if the data is json or compressed or protobuf or messagePack or custom struct. As long as it's not properly encrypted, it's just plaintext.


I don't think the parent poster was confusing encryption for compression or suggesting security by obscurity. He just meant that JSON is never sent in pretty-printed plain text over the wire, so you always need tooling to view it.

This has been my experience with JSON as well. Almost all JSON in the wild especially in web services and RPC is at least minified, so you need to pass it through something like `python -m json.tool` to reformat it for viewing. So you might as well use MessagePack and pass it through `msgpack2json -d` to view it instead. It makes no difference whether the underlying format is human readable.


Sure doesn't look like it..

I agree- that's why json is honestly great. I avoided it for the longest time, but now I totally see the appeal.


What did you avoid it in favor of? XML?


it's not. IMO It's a good replacement for ProtoBuf.


Maybe where you want to have a human debug or see what is happening (like in localStorage, or cookies). But if you're going to transfer a significant amount of data between front and back-end, for example, that can save money on a cloud service.

You can easily make a function wrapper that uses JSON in a dev environment and MessagePack in Production, for example.


If you want to transfer a significant amount of data and don' t need readability, using protobufs makes even more sense, since the key names do not go as string on the wire, so the data should be much smaller.


MessagePack lets you use integers for keys, so you can use enums or integer constants in the code instead of strings. There are good reasons to use MessagePack over Protobuf even if you don't need readability, such as easier integration into buildsystems and better support for embedded platforms.


It's taking a different angle from JSON, encoding it in JavaScript is done from standard objects into a UInt8Array.


I like it! Sort of the unholy love child of XDR and JSON :-).

Now we need an IDL that will let you define a structure and have it produce <language> marshalling and unmarshalling routines.


Missing links to benchmarks. Also, Json is not really a good baseline for comparison. If you support binary data natively in the format, you should compare it to Bson.


Bson is just terrible. Just as a taste, arrays are encoded as key-value, where key is the array index converted to a decimal string and then stored as zero terminated string with 4 byte length prefixed to it. It boggles the mind why somebody would design a format like this.


Is there any comparison of the tooling/performance/etc of MessagePack, BSON, Protobuffers, Flatbuffers, and plain JSON?


I did some pretty extensive benchmarking of various schemaless serialization libraries a few years back. All of these libraries have advanced quite a bit since then so it's a bit out of date, but the relative speeds of MessagePack vs JSON and BSON are probably still relevant:

https://github.com/ludocode/schemaless-benchmarks

I haven't compared them to schema formats like Protobuf or FlatBuffers yet because the use cases are pretty different. I like MessagePack for small projects or rapid prototyping because you don't need to integrate any big libraries or set up code generation as part of your buildsystem. (Mostly I got sick of integrating the C++ Protobuf library into embedded projects.)

The MessagePack format is a lot simpler than Protobuf and the best implementations are nowhere near as allocation-prone as the reference implementation so I expect they would beat it flat out on performance, though the messages may be slightly larger. They would probably beat FlatBuffers for encoding speed as well, but I don't expect any schemaless format could beat FlatBuffers for decoding speed.


At noesis.gg we made a small, not-at-all scientific comparison of our JSON and flatbuffers implementations: https://www.noesis.gg/news/player-movement-speedup.html.

The somewhat silly video on that page shows the actual difference in performance our users felt after the change. It was a _huge_ benefit, both in terms of loading time, but also in terms of memory, vastly increasing the number of CS:GO rounds that could be analyzed simultaneously.


Lots.. however, best bet is to search for your language, platform and use-case. The performance can vary widely by platform and language. JSON is actually faster in some platforms or libraries, binary options faster in others. Connection interface and overhead are also an issue.


It's for a specific library of course, but there are some numbers here:

https://github.com/neuecc/MessagePack-CSharp


Also Avro and Thrift, please!


It also supports binary data in strings, unlike JSON. This means no more slow base64 encoding/decoding.


Not sure why you are being down voted, sending binary data is a well known issue with JSON, and base64 can be pretty costly if using a naive implementation.

In a previous life I inherited a service that shipped tons of data (billions of requests a week) as base64 encoded protobuf strings over HTTP. It was a bad solution in so many ways, but there were historical reasons why it had gotten there. The system required a number of servers and I decided to do some profiling to see if there were some quick gains that could be made. As it turns out, about 60% of CPU time was spent decoding base64 using Python's standard library. I was shocked.


Apologies if I missed it, but what are the advantages of this over bEncode?


Size and speed comparison between JSON.zst and MessagePack please.


Why only compress the JSON? MessagePack will compress about as well as JSON does.


The sales pitch for MessagePack is: like JSON, but fast and small.


Correct: Uncompressed MessagePack is faster and smaller than uncompressed JSON, and compressed MessagePack is faster and smaller than compressed JSON.

I still don't see why you'd compare uncompressed MesasgePack to compressed JSON.


Because I suspect that the MesasgePack–compressed or not–is not the worth of the effort until I see comparison.

In other words, compressed MesasgePack is probably only tiny amount smaller than compressed JSON.


Sure, a comparison between compressed JSON vs. compressed MessagePack is interesting.

I interpreted your original message as requesting a comparison between compressed JSON vs. uncompressed MessagePack, which didn't make sense (but which I see people ask a lot, including elsewhere in this thread). Sorry if I misunderstood.


>I interpreted your original message as requesting a comparison between compressed JSON vs. uncompressed MessagePack

You interpreted it correctly. And it makes sense. If MessagePack is alternative to JSON so in JSON.zstd if I need compactness.


Bit of a shameless plug, but yet another alternative is VTON, though it is typeless.

https://github.com/scandum/vton


I'll throw my hat into the ring as well.

I've been building a new ad-hoc data format to replace JSON for a couple of years [1], and am nearing completion of the reference implementation in go. It natively supports the following types:

* Nil : No data (NULL)

* Boolean : True or false

* Integer : Positive or negative, arbitrary size

* Float : Binary or decimal floating point, arbitrary size

* Time : Date, time, or timestamp, arbitrary size

* URI : RFC-3986 URI

* String : UTF-8 string, arbitrary length

* Bytes : Array of octets, arbitrary length

* List : List of objects

* Map : Mapping keyable objects to other objects

* Markup : Presentation data, similar to XML

* Reference : Points to previously defined objects or other documents

* Metadata : Data about data

* Comment : Arbitrary comments about anything, nesting supported

But the most important feature is that it is a paired format: a binary format [2] and a text format [3], which are 1:1 compatible. This allows you to transmit in the binary format, and only convert to text when a human is involved.

I've put together a quick comparison here: https://github.com/kstenerud/concise-encoding#comparison-to-...

Currently, I'm finishing off the go implementation [4], which so far I've managed to get running 30% faster than the json codec, using less than half the memory. I'll be pushing the binary codec to master soon, and the text codec shouldn't take much longer since the code is pretty modular.

[1] https://github.com/kstenerud/concise-encoding#concise-encodi...

[2] https://github.com/kstenerud/concise-encoding/blob/master/cb...

[3] https://github.com/kstenerud/concise-encoding/blob/master/ct...

[4] https://github.com/kstenerud/go-cbe/tree/new-implementation


Why add special support for URIs and not any other types that can be easily represented as strings e.g. UUID or ISBN


It's mostly a question of how common they are. Both uuid and ISBN can be represented as URI. However, I'm still on the fence regarding uuids, because they tend to get a lot of use in general... I may add it after all.

The idea is to make a format for the 80% case, so things like ISBN are definitely out.


Explicitly supporting Uri has a performance cost with no clear size win. Supporting UUID at least could win back some bytes.


Oh, it's turd polishing anyway. Numbers in JSON are a pain.


The data types that msgpack supports aren't quite the same as JSON -- msgpack has encodings for signed and unsigned 8, 16, 32, and 64-bit integers, as well as single- and double-precision floats. (This runs into trouble if you're dealing with JavaScript, which doesn't support the full 64-bit range of integers, but everybody else should be okay.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: