JSON - The Fat Free Alternative

So the story goes something like this.

I get into one of these JSON is better/slimmer/faster - oh no it isn't arguments and we are getting entrenched in our respective positions when I realise that I actually have data that can test the slimmer argument,  created not for the purpose of a benchmark but as a solution to a problem I had.

I am creating a movie data mashup and neet to integrate movie data from a JSON repository with some XML movie data. The mashup is an XSLT transformation so the JSON has to be converted. The problem I had was apres download and conversion the XML file was too big for the XSLT processor and I was getting heap space errors.

Here is a snippet of JSON data for one movie.

{
  "result": [
    {
      "initial_release_date": "2006-11-30",
      "rottentomatoes_id": [],
      "key": [{
        "namespace": "/authority/imdb/title",
        "value": "tt0259822"
      }],
      "name": ".45",
      "type": "/film/film",
      "starring": [
        {
          "actor": [{
            "/common/topic/alias": [
              "Milla",
              "Milica Natasha Jovovich",
              "Milica Jovović",
              "Milla Yovovich",
              "Reigning Queen of Kick-Butt",
              "Milica Nataša Jovović"
            ],
            "name": "Milla Jovovich"
          }]
        },
        {
          "actor": [{
            "/common/topic/alias": [
              "Angus McFadyen",
              "Angus MacFadyen"
            ],
            "name": "Angus Macfadyen"
          }]
        },
        {
          "actor": [{
            "/common/topic/alias": [
              "Aisha N. Tyler"
            ]i
            "name": "Aisha Tyler"
          }]
        },
        {
          "actor": [{
            "/common/topic/alias": [
              "Stephen Dorff Jr.",
              "Brad Matlock"
            ],
            "name": "Stephen Dorff"
          }]
        },
        {
          "actor": [{
            "/common/topic/alias": [],
            "name": "Sarah Strange"
          }]
        },
        {
          "actor": [{
            "/common/topic/alias": [
              "Vincent LaResca",
              "Vinnie the kid"
            ],
            "name": "Vincent Laresca"
          }]
        },
        {
          "actor": [{
            "/common/topic/alias": [
              "Dawn Greenhall",
              "Hazel Dawn Greenhalgh"
            ],
            "name": "Dawn Greenhalgh"
          }]
        },
        {
          "actor": [{
            "/common/topic/alias": [
              "Nola Auguston"
            ],
            "name": "Nola Augustson"
          }]
        },
        {
          "actor": [{
            "/common/topic/alias": [
              "Katherine Mary Craven Hawtrey",
              "Kay Hartrey",
              "Kay Hawtry",
              "Katherine Hawtrey"
            ],
            "name": "Kay Hawtrey"
          }]
        },
        {
          "actor": [{
            "/common/topic/alias": [],
            "name": "Shawn Campbell"
          }]
        }
      ],
      "mid": "/m/0c2l1s",
      "directed_by": [{
        "/common/topic/alias": [],
        "name": "Gary Lennon"
      }]
    }

Following a naive JSON to XML convesion by yours truly I produced this.

<item>

<key>

<item>

<namespace>/authority/imdb/title</namespace>

</item>

</key>

<item>

<actor>

<item>

<alias>

<item>Milla</item>

<item>Milica Natasha Jovovich</item>

<item>Milica Jovović</item>

<item>Milla Yovovich</item>

<item>Reigning Queen of Kick-Butt</item>

<item>Milica Nataša Jovović</item>

</alias>

<name>Milla Jovovich</name>

</item>

</actor>

</item>

<item>

<actor>

<item>

<alias>

<item>Angus McFadyen</item>

<item>Angus MacFadyen</item>

</alias>

<name>Angus Macfadyen</name>

</item>

</actor>

</item>

<item>

<actor>

<item>

<alias>

<item>Aisha N. Tyler</item>

</alias>

<name>Aisha Tyler</name>

</item>

</actor>

</item>

<item>

<actor>

<item>

<alias>

<item>Stephen Dorff Jr.</item>

<item>Brad Matlock</item>

</alias>

<name>Stephen Dorff</name>

</item>

</actor>

</item>

<item>

<actor>

<item>

<name>Sarah Strange</name>

</item>

</actor>

</item>

<item>

<actor>

<item>

<alias>

<item>Vincent LaResca</item>

<item>Vinnie the kid</item>

</alias>

<name>Vincent Laresca</name>

</item>

</actor>

</item>

<item>

<actor>

<item>

<alias>

<item>Dawn Greenhall</item>

</alias>

<name>Dawn Greenhalgh</name>

</item>

</actor>

</item>

<item>

<actor>

<item>

<alias>

<item>Nola Auguston</item>

</alias>

<name>Nola Augustson</name>

</item>

</actor>

</item>

<item>

<actor>

<item>

<alias>

<item>Katherine Mary Craven Hawtrey</item>

<item>Kay Hartrey</item>

<item>Kay Hawtry</item>

<item>Katherine Hawtrey</item>

</alias>

<name>Kay Hawtrey</name>

</item>

</actor>

</item>

<item>

<actor>

<item>

<name>Shawn Campbell</name>

</item>

</actor>

</item>

</starring>

<directed_by>

<item>

<name>Gary Lennon</name>

</item>

</directed_by>

<initial_release_date>2006-11-30</initial_release_date>

<rottentomatoes_id/>

</item>

There were a hundred movies per file and the JSON data came in at 325k, by the time it had been converted to the XML above it had ballooned to 1.16MB.

My aim was to compact the XML sufficiently to allow a single transformation to accept the contents of about 1800 or so such files. So here is the data pre the compacting transformation

ihe@ihe-ThinkPad-T410:~/film$ ls rawFreebase/1.xml -l

-rw-r--r-- 1 ihe ihe 1160691 Aug 29 06:46 rawFreebase/1.xml

after the compacting, which was supposed to be lossless the data looked like this

<alias>Milla</alias>

<alias>Milica Natasha Jovovich</alias>

<alias>Milica Jovović</alias>

<alias>Milla Yovovich</alias>

<alias>Reigning Queen of Kick-Butt</alias>

<alias>Milica Nataša Jovović</alias>

</actor>

<alias>Angus McFadyen</alias>

<alias>Angus MacFadyen</alias>

</actor>

<alias>Aisha N. Tyler</alias>

</actor>

<alias>Stephen Dorff Jr.</alias>

<alias>Brad Matlock</alias>

</actor>

<alias>Vincent LaResca</alias>

<alias>Vinnie the kid</alias>

</actor>

<alias>Dawn Greenhall</alias>

</actor>

<alias>Nola Auguston</alias>

</actor>

<alias>Katherine Mary Craven Hawtrey</alias>

<alias>Kay Hartrey</alias>

<alias>Kay Hawtry</alias>

<alias>Katherine Hawtrey</alias>

</actor>

</movie>

and the size of a file of 100 such entries....

ihe@ihe-ThinkPad-T410:~/film$ ls freebase/1.xml -l

-rw-r--r-- 1 ihe ihe 351067 Aug 29 06:53 freebase/1.xml

350k compared to 324k of JSON.

I then decided to see what would happen to the file sizes after they were compressed.

Here are the results.

ihe@ihe-ThinkPad-T410:~/film$ ls -l rawFreebase/*.zip

-rw-r--r-- 1 ihe ihe 83833 Sep 30 12:41 rawFreebase/1.xml.zip

This is the compacted XML

ihe@ihe-ThinkPad-T410:~/film$ ls -l freebase/*.zip

-rw-r--r-- 1 ihe ihe 69058 Sep 30 12:41 freebase/1.xml.zip

This is the compressed JSON

-rw-r--r-- 1 ihe ihe 61528 Sep 30 12:42 Downloads/films.json.zip

83K for the naive XML,

69k for the compacted XML and

61k for the JSON.

Hmmmmmmmmmmm!