2012/12/16

Document and Serialization Format Thoughts and Examples

I made a few posts on reddit in a thread about Python in the browser about why HTML / xml suck and almost anything else is better. Assuming that, I explored some options I would like to describe here as alternative document syntaxes.

1. JSON-like whitespace-insignificant functional orientation:

jsondoc {
  title : "Ninjas Are Awesome",
  p(id:"fact") : {. : "A ninja's natural enemy is a",
                  strong : "PIRATE", "!"}
  ul(id:"ninja_weapons") : { li : "sword", 
                             li : "kung fu", 
                             li : "throwing star"}
}

This I feel has the most chance to catch on - or something like it. JSON-esque maps. The only real difference between standards JSON and this syntax is the introduction of arguments on keys that act as attributes. The map element . denotes the wildcard content of a tag if you use a map instead of value for a "tag". Alternately, you could derivative from JSON standard more, and that leads me to my next example.

2. YAML-like whitespace significant:

title :: Python
list ::
- article : id=1 :
    author :: Xanny
    date :: 2012-12-15
    title : id=pirates : Ninjas are awesome
    post : id=bacon,class=words : >
        I talked about this in another fork of this 
        comment thread a bit.  You absolutely need to 
        have the capabilities of the language...
- article : id=2 :

This syntax is YAML derived but introduces attributes between the key - value syntax.   It is a little verbose with all the double colons, and in my considerations of this syntax it became apparent you didn't need to have the : > syntax for multiline descendant text bodies, so I refined it with a few changes. > is a synonym for ::, and a value section that only contains a : is considered an indicator of a descendant multiline element.  Also, instead of using minus signs as an array element delimiter, I use commas instead, since it uses comma separated values in the arguments.  Also, a keyless argument is assumed to be the id, and arguments without quotes are ,>: terminated.

title > Python
list >
  , article : 1 :
      author > Xanny
      date > 2012-12-15
      title : pirates : Ninjas are awesome
      post : class = words, bacon >
        I talked about this in another fork of this 
        comment thread a bit.  You absolutely need to 
        have the capabilities of the language...
  , article : 2 :

I like this, but I can't feel it would fly very well, so I also have a whitespace insignificant dialect of this syntax:

doc {
title > Python,
  list > [
    article : 1 : {
      author > Xanny,
      date > 2012-12-15,
      title : pirates : Ninjas are awesome,
     post : class = words, bacon : I talked about this in another fork of this comment thread a bit.  You absolutely need to have the capabilities of the language...
  } ,
  article : 2 : ...
  ]
}

Here, commas are string and element delimiters everywhere, maps are curly braces denoted. Because of a concise grammar, you would only need to escape commas, colons, and the two kinds of braces. One thing to note is this language never utilizes parenthesis. I feel something like this might easily become more popular. An alternative might be to keep the argumentative behavior from the json dialect, and reintroduce parenthesis:

doc {
  title : Python,
  list : [
    article(1) : {
      author : Xanny,
      date : 2012-12-15,
      title(pirates) : Ninjas are awesome,
      post(class=words, bacon) : I talked about this in another fork of this comment thread a bit.  You absolutely need to have the capabilities of the language...
  } ,
  article(2) : ...
  ]
}

The big deal here is that by introducing overhead of parenthesis the key : value syntax remains succinct, and it easily allows for a data serialization format to be used where you just discard the arguments syntax, or maybe even have the parsing behavior definied that arguments are just key:value pairs to be added to the constructed map (here, in python syntax) like so:

title(pirates) : Ninjas are awesome, 
>>> 
"title" : {"id" : "pirates", "body" : "Ninjas are awesome"}

The same could work for the yaml syntax, where colons past the first are disregarded (except ::\w+\n which denotes multiline text follows). Or you could use a glyph exclusively for multiline text, like & and * which go unused in yaml).

It really comes back to that vision of one data format to rule them all (that isn't the current ruler, xml) for documents, serialization, message passing, etc. Both JSON and YAML are significantly better than xml, and in keeping with that unified protocol ideology, unified textual language is a natural extension.

As a footnote, I had to rewrite this blog in completely manual html since it kept malforming the code -> pargraph transistions and inserting redundant spaces with tons of unneeded tag duplication. So the source of this should be pretty. Come on Google, get your shizzle together.

And as a final note, I'd probably go with a choice between the last two. I'd easily see this better-markup-language (bml) have extensions .bmlw for the whitespace dependent version (better markup language (with) whitespace-significance) and .bmlb (better markup language (with) brace-significance) for the whitespace agnostic.

No comments:

Post a Comment