Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

I don't get the XML hate, to be honest. Everyone who drops XML to make it simpler to use eventually re-invents most of XML.

I find typing YAML to be much better than typing XML, but when it comes to serialization, I honestly don't see why JSON is that much better. To solve interoperability, there are three or four JSON schema standards and sometimes extensive written documentation just to explain the types of fields and when they can occur.

Now I need to find out which type of JSON an application uses (does it always use UTF8 or does it break the standard? Which version of the JSON Schema does it use? Is it using a schema at all? What happens when "$ref" appears in the content body? How should I deal with duplicate keys?) and how to properly encode it. It's the quick & dirty way to serialize data, and that's great for messing around with prototypes and getting started quick, but terrible for business critical applications and public-facing APIs.

All serialization formats are stupid and messy in some way but I think XML gets a bad rep because of the defaults many parsing libraries picked (and the vulnerabilities they introduced). Protobuf and friends are probably a much better serialization system for chat messages.

I don't see why developers fear XML so much. I think it has to do with the fact that everything is built by web developers now. Sockets have been replaced by sockets and concise protocols have been replaced by messy HTTP requests. I wonder how long it'll be before ISPs start blocking any traffic not directed towards port 80/443.



When do you put something in an attribute and when do you give it its own tag? What if semantically it belongs in an attribute but the kind of complicated parsing rules make putting it there cumbersome? Should you use <![CDATA[]]> or base64 to encode non-xml data? If you use CDATA what if the data has a ]]>? If you use tags for hierarchical information, what do you do if there happens to be some random non-whitespace/comment data between the tags as if it was html?

I don't really like interacting with json as a human user (mostly because restricting trailing commas and comments are both utterly awful design choices), but I'd take it a hundred times over the enormous ball of complexity and decision paralysis that xml imposes on you. There's just too much there there.


> When do you put something in an attribute and when do you give it its own tag? [...]

Oh, the same old red herrings around irrelevant details.

The simplest answer is: it doesn't matter. Let your serializer of choice handle it.

More in-depth answer: depends on what your schema tools and serializers support best. For example, I define my classes in C# and use dot net's built-in schema exporter to generate schemas (which in turn get compiled again into strongly-typed classes for other languages, e.g., Java). I chose DataContractSerializer, and its rules are simple: "user" data belongs to elements, serializer metadata belongs to attributes. Which makes sense because arbitrary attributes can appear on elements without breaking the schema (e.g., DCS uses xsi:type attribute for polymorphic deserialization). It also decides for you to use base64 for binary data.

Bottom line: use tooling and don't try to tweak the details of the XML form.


>The simplest answer is: it doesn't matter. Let your serializer of choice handle it.

Except, this absolutely matters! I've had serializer incompatibilities across platforms on both things like json and much simpler binary formats.

Letting the "serializer handle it" falls apart in reality, especially on extremely complicated formats like XML. The only reliable way to make complicated cross platform de/serializing work is to define a feature subset and adhere to that.


OK, but this approach is totally useless when it comes to a public API, which is the topic of this thread. Especially when it comes to backward compatibility.


The topic is about decision problems, not about backward compatibility. That said, JSON has corner cases too, similar to XML or worse. All those complaints are just rationalization of hipster propaganda.


JSON has corner cases but the advantage is that every JSON document has a single obvious mapping to language primitives — dicts, lists, strings, floats. Nobody has to agree beforehand how to load JSON data.

In contrast the generic mapping for XML is a Tree[str (name), Dict[str, str] (attrs), Union[str, Tree] (body)] which maps so poorly between languages that people do one of two things — implement formats on top of XML to do serialization which leads to non-interoperability when different software does it differently, or parse to a database-like “Abstract XML object that you query with xpath.


JSON maps well to javascript types, but not anything else. And float isn't an obvious mapping: the standard does have a concept of an integer number, and most numbers are indeed integers. Arrays do everything objects do, but have better performance and better defined behavior.

>Nobody has to agree beforehand how to load JSON data.

Such agreement is never necessary, it's up to the programmer what to write, and the standard doesn't specify behavior of JSON parsers anyway, it only defines JSON documents. For example there's no need to use hashtables, it's a random javascript artifact due to parsing JSON with the eval function.

>when different software does it differently

I assume you mean schemaless documents here. Those are always abstract databases, both XML and JSON. I suppose there's jq that can query abstract JSON databases.


> JSON maps well to JavaScript types, but not anything else.

I grant you that JSON might be equally as awkward as XML languages like C but pretty much every language -- Python, Ruby, Java have very sane mappings to and from JSON types. You don't ever really have to "query" a JSON object, you just `json.loads` and `for item in obj["key"]:`. Even in the cases with schemas you're still usually only working with primitive types.

> Such agreement is never necessary, it's up to the programmer what to write...

What I mean is that there's not weirdness like having to encode types in the base document. You don't have to do things like

    <blahblah type="dict">
       <item>
         <key>dlkj</key>
         <value>kjdf</value>
       </item>
    <blahblah>
where different projects / parsers might do it differently. The "abstract JSON types" are actually useful and expressive where in XML everyone has to carve out their own way to represent lists, mappings, and numbers out of trees because basically nobody works with just trees in day-to-day work.

I think we might be talking about two different use-cases. If what you want to do with XML / JSON is serialize arbitrary classes in $specific_language and then read it back then nothing really matters; the on-disk format is just an implementation detail. But abstract JSON works really really well as a schema everyone agrees on and supported by every language.


> You don't have to do things like [...] carve out their own way to represent lists, mappings, and numbers

I work with XML extensively and out of hundreds of classes and fields, I've needed an arbitrary dictionary maybe a handful of times. Mapping/dictionary is json's abysmal replacement for a class/struct in which case you'd have XML like

    <MyClass>
        <Field1>Value</Field1>
    </MyClass>
IOW, _the tag is the key_ ! List? Simply repeated elements. Numbers? What are you talking about, they're directly representable in XML and XSD knows about integers, floats, etc. (unlike json).


You don't show anything with that. Sure you can walk through any schemaless JSON document, because it has generic JSON document structure, but the same can be done for XML too in any language. You can't make sense of the document this way beyond its wellformedness. There being numbers don't help you much, you can't tell anything about them beyond them being numbers.


>JSON maps well to javascript types, but not anything else.

Except it does. Take python. Ruby. Any language that has a notion of dicts, lists and strings/ints/floats. That's basically every high level language ever. Even exotic stuff like tcl. And e.g. C, a low level language, has a thousand implementations of those same structures.


> That said, JSON has corner cases too, similar to XML or worse.

Feel free to list some.


The often mentioned design paralysis of choice between elements and attributes - in JSON there can be many ways to implement a collection of name/value pairs. One interesting case is compound key: you can use a mini serialization format and still make it an object (actually saw this in the wild).


> in JSON there can be many ways to implement a collection of name/value pairs.

Are there? JSON has dicts and lists. I mean you could store a collection of name value pairs in a list, maybe even a list of lists, but that's just stupid and an incorrect usage of the format. Where as in xml there really are tons of ways to do that, and ALL of them are awkward



Most of these just amount to "some parsers are bad", which.. I mean sure? But xml's surface area for parsers to be 'bad' is so much greater, and its gotchas are thus much more subtle. I hope you're not trying to suggest that all xml parsers are identical and perfect.

Your assertion was that json's edge cases are "as bad or a little worse" but imo this document doesn't suggest that at all. Every single thing listed in it can go more wrong with xml, not less.


1) If _you_ are designing a new API, then my answer still holds. Choose a serializer and let it decide the wire format for you.

2) If someone else has already published the API, well, then it's already decided.

In both cases, your objections are moot.


This just don't work for interoperable protocols (or file formats). If you have a protocol with multiple implementations, they may be written in different languages, with different XML libraries.


The serializer exports the schema and you map the schema to whatever code is needed, using whatever XML library, to extract the data based on the schema. XML rules are strict and how to extract data from the document, given schema, is unambiguous. If you don't have a decent XML library, then you're stuck. Oh wait, the same holds for HTML any other format.


This is really not a simple answer. Now you've just compounded the problem with N solutions, where N is the number of encoders in the wild in use by people who want to talk to your service.

And then you've basically got SOAP, which is one of those acronyms that manages to be none of its constituent words in practice.


Posted in a comment above, but the idea is that text content is text content, because XML was intended as a document interchange format, not a data interchange format.

When you look at SVG content for instance, you'll notice colors and coordinates are contained within attributes, because it means a document reader which cannot understand some crazy new drawing element would still interpret that data is text content, leaving the document accessible (if perhaps badly formatted).

If you care about that rule, then structured, semantic, non-user-accessible data uses elements for structure and attributes for data. This also lets you ignore the difference between semantic and non-semantic whitespace - no whitespace has semantics.

CDATA sections are (usually) represented in tooling, but should be considered just a text node with different escaping rules (unless your document format actually assigns a purpose, which is a really bad idea from an interoperability perspective). CDATA is not meant to provide a way to embed binary data (both XML and JSON are somewhat bad for information which is not primarily text).

Similarly, processing instructions are somewhat orthogonal to the document format. I believe XSLT is the only spec which defined a standard behavior for them, but there were examples such as commercial document editors which saved information like the cursor position as processing instructions.

FWIW (as a human interacting with JSON) - there are extensions to JSON such as JSON5 which aim to add the extra flexibility that makes data entry easier. JSON5 adds comments, trailing commas, unquoted symbols, single quoted strings, and so on. Perhaps its biggest issue is that it allows non-finite number values like NaN, which makes it a superset of JSON at the data model layer. So you can't be guaranteed JSON5 text can be stripped and quoted into valid JSON text.


Agreed. I wish json5 had become, or becomes the standard.

https://json5.org/


One of the virtues of JSON is that you can break it into parseable chunks by line, which also enables stuff like line-delimited JSON streams. JSON5 seems like a good format for a configuration file, though.


some folks even specced it via https://jsonlines.org/ and http://ndjson.org/


That reminds me SAX parsing.


The problem with JSON alternatives is that now I come to say that I wish HJSON [0] had become or becomes the standard... so now we have the N+1 Standards xkcd comic strip again

[0]: https://hjson.github.io/faq.html


I liked this summary: XML is almost always misused [1]

The gist is that XML is best used for markup, because it inherits assumptions from the "document" metaphor that are not needed and are sometimes unhelpful for data interchange. An example of this is hashmaps & sets--they intentionally leave "order" or "sequence" out of their representation, but documents as a form of data interchange force you to keep thinking about it.

[1] https://www.devever.net/~hl/xml


>XML is best used for markup,

Funny enough, the one thing XMPP does not use xml for is markup! See

https://xmpp.org/extensions/xep-0393.html - an ad hoc markup format

https://xmpp.org/extensions/xep-0394.html - a laughably perverse usage of the already perverse XML format I don't know why they did this but I guess they had reasons - perhaps XML doesn't actually do that well even for markups


Do not mention the format war.

Once upon a time, we had XHTML-IM https://xmpp.org/extensions/xep-0071.html

But then web developers came along and just put this directly into the DOM of their web clients, leading to endless XXS exploits, so XEP-0071 was burned at the stake.

XEP-0393 might look ad-hoc, but it's essentially what people were typing into their chats and emails since time immemorial.

People sometimes think this is Markdown and then pick a markdown library off the shelf and then the HTML passtrough bites them, leading us back to the beginning.

I really don't understand how Matrix and Mastodon etc are allowed to pass around HTML embedded in JSON as if that somehow solves all those problems.


Tbh if a client is dumb enough to put xhtml-im directly into DOM with no verification that is the client's problem, not the XEP's, and that should be no reason to cancel it.


Well, if it was _a_ client. But if 100% of clients make the same mistake, then it is the spec that is the problem, or so the argument went.

Maybe one day xhtml-im will make a glorious return as a 2.0, with bigger, better and scarier warnings about sanitizing your inputs


JSON seems to have better parsers than XML, and XML is more verbose. XML has comments though. That said, I think they're more or less equivalent. I hate YAML, though. Awful awful language. Surprisingly complex, and whitespace sensitivity in a language that's templated all the time? Must be the work of the devil. JSON > XML >>>> YAML, IMO.


I used to come from the whitespace-hate camp but I just use linters now and presto literally all of my problems ever with YAML gone.

Use YAML for k:v things, programming languages everywhere else. It's eminently readable and writable and with those linters I mentioned, untroublesome.


> I just use linters now and presto literally all of my problems ever with YAML gone.

Copy a valid fragment of YAML from one file into another file, run the linter, and presto! garbage. Whitespace-indented formats are poison for auto-formatters. They might have made sense back in the 90s when such "advanced" tooling was rare, but they're a bad choice today.


I don't understand this. If you copy a piece of JSON out from one file to another without thinking it will also result in garbage. You need to know where to place code. That seems reasonable to me.


If you copy&paste an object {"a": 5, "b": 6} or an array [1, 2, 3] from one JSON file to another (or from one location to another in the same file), then the structure of the result is unambiguously determined by the opening and closing delimiters {}[]. When you do the same with an indented YAML fragment, the indentation level at the source location might be different from the indentation level at the destination, wreaking havoc on the structure of the result. You have to manually adjust the indentation of the pasted fragment before auto-formatting the file, or else the information about the intended indentation is lost.


I don't know, man. Have you ever had the need to template a yaml file so that a piece of text is inserted from another yaml file? Indentation becomes quite important, and you remember that experience. Indentation math is not my idea of fun.


What linters are you using?


I think XML gets a bad rep for a similar reason that CSV gets a bad rep.

i.e. there is a wide variance of what “passes” for XML, and two systems that profess to “speak” XML often speak mutually incomprehensible dialects.

I think each of your valid complaints about JSON extend to XML, for example. Totally agreed that XML gets an especially bad rep because most xml parsing/generating libraries in popular programming languages are frustrating.


> i.e. there is a wide variance of what “passes” for XML

not really, it's standardized.

what differs is how it's used, i.e. a top of most serialization formats there is another implicit often overlooked serialization layer which roughly maps domain logic to the structures the serialization format supports.

The problem with XML is similar to that of XMPP, to many features and variations, nobs to twist, things to easily subtle get wrong, etc.

Also string content encoding is terrifying in XML as it overlaps string content and string formatted control structures and pretty printing in a messy way. (Like imagine pretty printing _inside_ of a json string without clear separation of weather or not a newline is content or formatting.)


Then you hit the hell of having to interact with some system (almost always written in Java) that's using some 20 year old SAX parser that barfs if the tags aren't in the exact order it expects.


> I don't see why developers fear XML so much. I think it has to do with the fact that everything is built by web developers now.

I heard exactly this same thing 15 years ago from a mainframe programmer I was working with on an integration project, but flipped around: "this XML nonsense... you web developers want everything to look like HTML!" Wasn't a good look then, isn't a good look now.


> does it always use UTF8 or does it break the standard?

It always uses UTF8, or it is invalid JSON and MUST NOT be parsed. For the same argument against XML, you could argue whether it actually honours the encoding field (i've found many cases where things lie in the encoding field, and still parse fine by XML parsers)

> Which version of the JSON Schema does it use?

Quite literally the only one I have ever seen in use, ever, is JSON Schema[0].

> Is it using a schema at all?

Most often, no.

> What happens when "$ref" appears in the content body?

Nothing? JSON doesn't have lookups. There is no way to do lookups in JSON. If you do otherwise, you do not have JSON anymore.

> How should I deal with duplicate keys?

This is an actual, real issue with JSON.

XML is a terrible format. It's filled to the brim with footgun features like entity expansion (only ever used to DOS servers), no data types (everything is a string, your parser just needs to know better), no meaningful reason to have both attributes and content, ambiguity between an array of one element and an element, etc. etc. etc.

[0]: https://json-schema.org/


Maybe I'm misunderstanding, but JSON as defined in RFC 7159 doesn't require UTF-8.


8.1. Character Encoding

   JSON text exchanged between systems that are not part of a closed
   ecosystem MUST be encoded using UTF-8 [RFC3629].
JSONs definition was improved in RFC8259 (2017) to mandate UTF-8.


Missed that. Thanks!


XML was originally meant to be a document markup format, an evolution of SGML. While the markup was consistent, the interpretation of it was to be left to the actual document format definition.

Unfortunately, there was another camp which was trying to change it to not just be an extensible document interchange format but a data interchange format. These have different requirements.

For example, someone asked when you put data in an element vs an attribute. There was a push to provide guidance at one point based on SVG - everything other than text data (such as coordinates making up graphics) were attributes, such that a non-SVG view of the document would just be all of the textual data appended.

Most of the tooling issues came from this disconnect between document and data oriented interchange, such as tooling having options to toggle between interpretations of pretty fundamental concepts such as namespaces.

It also became pretty common for technologies to come out of the document-oriented space (e.g. XPath and XSLT) which led to it being basically impossible to compose or decompose XML-based data without potentially changing its meaning - unless you were doing so with tools that understood the interpretation of the data itself.


I saw this earlier today as hoytech's github was on HN front page: https://github.com/hoytech/serialipedia

Curious how would HN readers rate YAML vs CBOR.

TinyCBOR has a json2cbor utility which uses cjson. Is there a similiar utility for converting JSON to YAML.


CBOR is a pretty good as a smaller / "better" JSON if you have a free hand choosing.

It has ambitions to replace ASN.1 / X.509 coding for certs, but I don't see it being used.

It is a bytewise binary coding, so you can't really write it by hand, whereas you definitely can expect to do that for JSON. However if you must send binary data, then it is a more natural fit, it will send it with a short header overhead rather than have to bloat it with base64.


All of these complaints also seem like valid complaints about XML, except the thing about duplicate keys.

Also as far as I know, there is only one "JSONSchema", and it has a clearly defined version scheme. Are there other JSON Schema standards that I'm not aware of? I'd be interested to see them.


"All human readable serialization formats are stupid and messy in some way".

Fixed that for you.

There's plenty of simple binary serialization formats that, while not perfect, don't generally qualify as stupid or messy.


like msgpack

as long as you don't take stuff as ASN.1 representative for binary formats. Wait, there is a human readable serialization format for ASN.1 using XML...


ASN.1 is so difficult to defend. ;-)


>Now I need to find out which type of JSON an application uses (does it always use UTF8 or does it break the standard?

Utf-8 isn't standard to XML, it's required by XMPP. So in this scenario the same could be required from json.

>Which version of the JSON Schema does it use?

No one uses those and everyone is happy. As for XML schemas, xmpp makes even harder to use than it makes everything else. Because it has this endless xml document that you end up having to parse with a streaming parser and those understandably don't support schemas. You end up having to build the little DOMs yourself, for every stanza. This isn't me theorizing, this is what clients (dino, gajim, tkabber) do. Now with the homebrew DOM you're unable to use xpath (unless again you implement it yourself which is no easy feat) which makes xml's awkwardness i.e. the fact that xml doesn't map into common language structures like dicts and lists so, so much worse.

> How should I deal with duplicate keys?

Generally you don't because libraries don't even support that, and people generally are decent enough to never use those.

>All serialization formats are stupid and messy in some way but I think XML gets a bad rep because of the defaults many parsing libraries picked (and the vulnerabilities they introduced).

And because it is messy. Not pushing for json in particular but even json doesn't have dtds, entities and so on and so forth. Thus parsing libraries don't implement such features, thus fewer vulnerabilities. Whereas vulns in xml parsers are basically the norm - look at python's out of the box xml parsers - there are four of them and each one has at least some vuln marked in the table in official python docs! Again, I'm no json fan, but it's already light years better than xml. "Any damn fool could produce a better data format than XML." Look at bittorrent's ad-hoc format, bencode. Even that is much better than xml. And in a high level language an entire parser would take you one evening to write with no prior knowledge of the format. And that implementation would be probably about the size of those ad-hoc DOM implementations in xmpp clients, that don't even do any parsing themselves, and would be much more pleasant to use.


XML is also superfluous by this standard, since by induction on the principle of "unnecessary reimplementation", we could be exchanging data via S-expressions.


>re-invents most of XML.

That's fine though. Only a few bits of XML make the format very cumbersome to support.


> Now I need to find out which type of JSON an application uses (does it always use UTF8 or does it break the standard? Which version of the JSON Schema does it use? Is it using a schema at all? What happens when "$ref" appears in the content body? How should I deal with duplicate keys?) and how to properly encode it. It's the quick & dirty way to serialize data, and that's great for messing around with prototypes and getting started quick, but terrible for business critical applications and public-facing APIs.

I've never known any of XML's features to help with actually solving a business problem. Oh great, your documents have a mandatory schema; what benefit does that actually give you? It doesn't mean you can skip doing logical validation of a request/response post-deserialization (a schema can enforce that an id is an integer, but not that it's an id that actually exists in your database). In theory it might make it easier to complain about bugs where the real system doesn't match the documentation, but in practice it's more likely to be considered an error in the schema than a bug in the system. As far as I can tell all that using a schema actually does is means that you'll occasionally reject a document that you could have processed, or (even better) refuse to load the document because the schema's website is down.

Similarly with namespaces: the only impact I've ever known XML namespaces to have is to frustrate users who can't understand why their XPaths aren't matching until they configure all their namespaces. I know in theory there are cases where one XML document embedded in another might mean that an XPath would match something it shouldn't, but I've literally never seen that happen in real life, whereas having everything silently not match until someone configures namespaces happens all the time.

Similarly with custom entities, the only thing I've seen them used for is DOSes.

There's also a bunch of other issues: standard XML Schema is awful (RelaxNG is better, but you can't use it because it's not the official schema format), the format is almost-but-not-quite whitespace-insensitive which gives you the worst of both worlds (and similarly for text encodings), and XML is deeply associated with verbose overengineering because that's the main thing it's historically been used for. But even putting those aside, it really is as bad as it's made out to be.

> Protobuf and friends are probably a much better serialization system for chat messages.

Completely agreed.

> I don't see why developers fear XML so much. I think it has to do with the fact that everything is built by web developers now. Sockets have been replaced by sockets and concise protocols have been replaced by messy HTTP requests. I wonder how long it'll be before ISPs start blocking any traffic not directed towards port 80/443.

That's pretty backwards IMO. Have you ever seen what SOAP requests actually look like? It's like a complete reimplementation of everything that HTTP does, in more verbose form... but it's only ever used on top of HTTP. The web/JSON stack has a long way to go to catch up with WS-* for messy overcomplication (and I say that as someone who thinks WS-* is not actually as bad as it's generally considered).


> I've never known any of XML's features to help with actually solving a business problem.

Namespaces let you version data and unambiguously mix elements with the same (simple) name in the same document. Esp. the first point is necessary for long-term data archival.

> Oh great, your documents have a mandatory schema; what benefit does that actually give you?

It can be compiled to strongly-typed DTOs for your language of choice. I.e., seamless, strongly-typed cross-language data exchange. As opposed to manually picking apart the document with DOM or letting the serializer guesstimate the type as with untyped json.

Also, schema can express (and validate) in-document references.

Etc. XML without tooling is painful, yes. With tooling it's a powerful and reliable tool.


> Namespaces let you version data and unambiguously mix elements with the same (simple) name in the same document. Esp. the first point is necessary for long-term data archival.

How do namespaces help with versioning? That seems like a complete non-sequitur.

As for unambiguously mixing elements with the same simple name, I acknowledged that that's a theoretical possibility, but I've never seen it be important in practice.

> It can be compiled to strongly-typed DTOs for your language of choice. I.e., seamless, strongly-typed cross-language data exchange.

The tooling for that is very limited and ineffective, IME, to the point that you're better off writing some class definitions and generating XML or JSON serializers from those. There's a huge impedance mismatch between the kind of constraints that are natural to express in XML schema and the kind that are natural to express in programming languages.


> How do namespaces help with versioning? That seems like a complete non-sequitur.

They tell you how to interpret data and to which schema definition the data conforms to. Elements `<a:MyElt>` and `<b:MyElt>` tell you explicitly how to interpret them. Without the namespace, you have to guess.

> The tooling for that is very limited and ineffective, IME, to the point that you're better off writing some class definitions and generating XML or JSON serializers from those. There's a huge impedance mismatch between the kind of constraints that are natural to express in XML schema and the kind that are natural to express in programming languages.

My experience is totally the opposite. If anything, XSD can express more constraints than most PLs will allow.


> They tell you how to interpret data and to which schema definition the data conforms to. Elements `<a:MyElt>` and `<b:MyElt>` tell you explicitly how to interpret them. Without the namespace, you have to guess.

So you'd mix and match elements from different versions of the schema in the same document? Does that work? I've never seen that done and can't imagine how code would handle that unless it was via some very simple translation rules (in which case the value would be minimal).

(I've seen documents that use the (single) schema declaration as a way of declaring that they're version 3.0 or version 3.1, but there doesn't seem to be any practical advantage to that over something more lightweight like "_version": "3.0" at the start of the document).

> If anything, XSD can express more constraints than most PLs will allow.

I don't actually disagree with this, but they're different constraints and it's not easy to losslessly convert. So it's very hard to use XSD as the source of truth and generate good, idiomatic versions of your constraints in the PL representation of your types. (It's also difficult to generate good, idiomatic versions of your PL constraints in XSD)


> So you'd mix and match elements from different versions of the schema in the same document?

No, the use-case is having an archive of documents conforming to different schemas. Or another use-case: schema evolves during the system's lifetime and you don't want to / can't upgrade old data to new schemas.

And yes, I even mix and match different schemas in the same document: pre-parsed information is stored in "my" elements, whereas the original data source is stored as extension in the XML, in its own namespace etc. So when the need arises for further processing/parsing, everything's already there in the document, with _the_ definitive source of truth. (Uninterpreted raw data)

> idiomatic versions of your PL constraints in XSD

That way is very easy: no PLs support (the XSD equivalent of) foreign keys, so that's "solved". Structs and inheritance are directly expressible, and even sum types from the languages that support it. Granted, XSD using sum types generates clumsy classes in PLs that don't.


> No, the use-case is having an archive of documents conforming to different schemas. Or another use-case: schema evolves during the system's lifetime and you don't want to / can't upgrade old data to new schemas.

Right, I talked about that case - AFAICS the schema is acting as a basic version tag (which is worth having, but can be done much more simply).

> And yes, I even mix and match different schemas in the same document: pre-parsed information is stored in "my" elements, whereas the original data source is stored as extension in the XML, in its own namespace etc. So when the need arises for further processing/parsing, everything's already there in the document, with _the_ definitive source of truth. (Uninterpreted raw data)

Embedding the original document sounds useful, but namespaces still seem vastly overengineered for that case - you'd presumably have a standard, well-defined place for the original document to go, so anything parsing/using your document knows about it and can just skip that node. I guess you get a little bit of value from being able to write xpaths that will never accidentally hit a node in the embedded document, but again that's something I've never seen actually be a problem in real life. Namespacing seems to be built to support the idea that you'd arbitrarily interleave nodes from multiple schemata, and that still seems like a solution in search of a problem.

> That way is very easy: no PLs support (the XSD equivalent of) foreign keys, so that's "solved". Structs and inheritance are directly expressible, and even sum types from the languages that support it.

Oh? Can you point me at a good implementation for Haskell or especially Scala? (TBH I think if we're accepting that the PL is the source of truth for what the constraints are then we don't gain much from encoding more of them into schema versus just checking them after parsing, but every little helps).


> the schema is acting as a basic version tag

Except that it's syntactically separate so no other version tag can masquerade as yours. PLs have namespaces as a separate syntactic construct as well, and for a good reason.

> Embedding the original document sounds useful, but namespaces still seem vastly overengineered for that case

Quite the opposite, it's the simplest option. Everything (original and interpreted data) is kept together, and because of NSs, there's no danger of misinterpreting the one for the other.

> Can you point me at a good implementation for Haskell or especially Scala?

Not using those.

> I think if we're accepting that the PL is the source of truth

XSD can be processed to automatically generate parsing and checking code for whatever other PL than the original one.


> Not using those.

> XSD can be processed to automatically generate parsing and checking code for whatever other PL than the original one.

Well, where are the actual working implementations of these things that you're saying are possible? You say there are tools that have good conversions between XML schema and language sum types; what tools? (and if not in Haskell/Scala then what languages?) Because my experience is that you just don't get good idiomatic representations from the tools, and end up either maintaining the schema and the code in parallel manually, or autogenerating a "dumb" schema that's missing most of your validity constraints.


> and if not in Haskell/Scala then what languages

Java and C#.


Neither of those can really been said to have sum types.


Over a decade ago I worked with a vendor who had a SOAP service where every method had a single parameter, and that parameter was a Base64-encoded XML document containing the actual parameters. Just this year I was reviewing the web services documentation for a Fortune 500 company and realized that their SOAP offering was a transparently thin (like a soap bubble?) façade over a simpler "POST us some XML" API. Between these two experiences I have worked with more SOAP services than I can count, and I can't think of a single one where SOAP made anything easier.


> I've never known any of XML's features to help with actually solving a business problem.

Arm are publishing their full ISA processor specifications in XML, so it is 100% machine readable, see [1] for the actual specification and [2, 3] for why you want a machine readable ISA spec in the first place. Arm has been really ahead with this, and all the other processor manufacturer are playing catchup.

I'm not saying that XML is the idea format for this purpose, but it clearly works for them. Which other format would you think is better for this purpose? Miminal requirements: works for gigabytes of data including graphics, automatic format checking, conversion to/from other formats, widely supported, easy to hire for, standard IDEs (e.g. VSCode plugins), long-term stability (processors need support for decades).

[1] https://developer.arm.com/architectures/cpu-architecture/a-p...

[2] https://alastairreid.github.io/papers/fmcad2016-trustworthy....

[3] https://alastairreid.github.io/uses-for-isa-specs/


It may be "100% machine readable" in some sense, but I looked at a couple of those XML files at random and it seems like the majority of the content is a human-oriented markup document. Looking at it from the other side: why are they using XML rather than JSON? I don't think they're gaining much from schema or namespaces; I suspect the main reason is that that lets them have a single-file markup document that contains embedded data in a standardised way (by using XHTML and XSLT), which is sort of legitimate but only really for a niche archival use case (because normally having structured source data and some process that generates the HTML document from it is absolutely fine).

Realistically you can achieve much the same thing by having a HTML file that embeds a chunk of JSON somewhere and then has a bit of javascript to render the page based on that JSON (i.e. filling the role of XSLT) and in most practical respects that's a lot better. The only missing part is that there's no defined standard for how you do that (it's not hard to do it, but there are different places where you could put the JavaScript and the JSON and for an archival document you want to be sure a future reader will know where to look) - and, given that XHTML+XSLT is widely seen as a dead end, I suspect there's not much enthusiasm for defining one.


How do you plan to serialize chat messages into protobuf format? Just a protobuf message with one json/xml blob in it? Protobuf can't easily contain an arbitrary tree structure.


I don't think a chat message is or should be an arbitrary tree structure?


For plaintext messages? Existing chat programs use formatted messages with quotes, bold text, smilies and whatnot.


> Existing chat programs use formatted messages with quotes, bold text, smilies and whatnot.

Right, but they're not arbitrarily nested tree structures. They're structured data that you want to represent in a structured way (e.g. lists of different kinds of spans), but I don't think a full DOM tree is a good representation.


Quotes apparently allow nesting. That said, serializing trees in linear form isn't really a problem: each node simply specifies parent's index. The problem is that these nodes are polymorphic, protobuf doesn't like that.


oneof works reasonably well. It's not perfect but it's fine.



Out of line metadata is an interesting idea. Is there a guarantee that metadata is sorted by (start,length)?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: