So the story goes something like this.
I get into one of these JSON is better/slimmer/faster - oh no it isn't arguments and we are getting entrenched in our respective positions when I realise that I actually have data that can test the slimmer argument, created not for the purpose of a benchmark but as a solution to a problem I had.
I am creating a movie data mashup and neet to integrate movie data from a JSON repository with some XML movie data. The mashup is an XSLT transformation so the JSON has to be converted. The problem I had was apres download and conversion the XML file was too big for the XSLT processor and I was getting heap space errors.
Here is a snippet of JSON data for one movie.
{
"result": [
{
"initial_release_date": "2006-11-30",
"rottentomatoes_id": [],
"key": [{
"namespace": "/authority/imdb/title",
"value": "tt0259822"
}],
"name": ".45",
"type": "/film/film",
"starring": [
{
"actor": [{
"/common/topic/alias": [
"Milla",
"Milica Natasha Jovovich",
"Milica Jovović",
"Milla Yovovich",
"Reigning Queen of Kick-Butt",
"Milica Nataša Jovović"
],
"name": "Milla Jovovich"
}]
},
{
"actor": [{
"/common/topic/alias": [
"Angus McFadyen",
"Angus MacFadyen"
],
"name": "Angus Macfadyen"
}]
},
{
"actor": [{
"/common/topic/alias": [
"Aisha N. Tyler"
]i
"name": "Aisha Tyler"
}]
},
{
"actor": [{
"/common/topic/alias": [
"Stephen Dorff Jr.",
"Brad Matlock"
],
"name": "Stephen Dorff"
}]
},
{
"actor": [{
"/common/topic/alias": [],
"name": "Sarah Strange"
}]
},
{
"actor": [{
"/common/topic/alias": [
"Vincent LaResca",
"Vinnie the kid"
],
"name": "Vincent Laresca"
}]
},
{
"actor": [{
"/common/topic/alias": [
"Dawn Greenhall",
"Hazel Dawn Greenhalgh"
],
"name": "Dawn Greenhalgh"
}]
},
{
"actor": [{
"/common/topic/alias": [
"Nola Auguston"
],
"name": "Nola Augustson"
}]
},
{
"actor": [{
"/common/topic/alias": [
"Katherine Mary Craven Hawtrey",
"Kay Hartrey",
"Kay Hawtry",
"Katherine Hawtrey"
],
"name": "Kay Hawtrey"
}]
},
{
"actor": [{
"/common/topic/alias": [],
"name": "Shawn Campbell"
}]
}
],
"mid": "/m/0c2l1s",
"directed_by": [{
"/common/topic/alias": [],
"name": "Gary Lennon"
}]
}
Following a naive JSON to XML convesion by yours truly I produced this.
<result>
<item>
<key>
<item>
<value>tt0259822</value>
<namespace>/authority/imdb/title</namespace>
</item>
</key>
<type>/film/film</type>
<name>.45</name>
<starring>
<item>
<actor>
<item>
<alias>
<item>Milla</item>
<item>Milica Natasha Jovovich</item>
<item>Milica Jovović</item>
<item>Milla Yovovich</item>
<item>Reigning Queen of Kick-Butt</item>
<item>Milica Nataša Jovović</item>
</alias>
<name>Milla Jovovich</name>
</item>
</actor>
</item>
<item>
<actor>
<item>
<alias>
<item>Angus McFadyen</item>
<item>Angus MacFadyen</item>
</alias>
<name>Angus Macfadyen</name>
</item>
</actor>
</item>
<item>
<actor>
<item>
<alias>
<item>Aisha N. Tyler</item>
</alias>
<name>Aisha Tyler</name>
</item>
</actor>
</item>
<item>
<actor>
<item>
<alias>
<item>Stephen Dorff Jr.</item>
<item>Brad Matlock</item>
</alias>
<name>Stephen Dorff</name>
</item>
</actor>
</item>
<item>
<actor>
<item>
<alias/>
<name>Sarah Strange</name>
</item>
</actor>
</item>
<item>
<actor>
<item>
<alias>
<item>Vincent LaResca</item>
<item>Vinnie the kid</item>
</alias>
<name>Vincent Laresca</name>
</item>
</actor>
</item>
<item>
<actor>
<item>
<alias>
<item>Dawn Greenhall</item>
</alias>
<name>Dawn Greenhalgh</name>
</item>
</actor>
</item>
<item>
<actor>
<item>
<alias>
<item>Nola Auguston</item>
</alias>
<name>Nola Augustson</name>
</item>
</actor>
</item>
<item>
<actor>
<item>
<alias>
<item>Katherine Mary Craven Hawtrey</item>
<item>Kay Hartrey</item>
<item>Kay Hawtry</item>
<item>Katherine Hawtrey</item>
</alias>
<name>Kay Hawtrey</name>
</item>
</actor>
</item>
<item>
<actor>
<item>
<alias/>
<name>Shawn Campbell</name>
</item>
</actor>
</item>
</starring>
<directed_by>
<item>
<alias/>
<name>Gary Lennon</name>
</item>
</directed_by>
<initial_release_date>2006-11-30</initial_release_date>
<alias/>
<mid>/m/0c2l1s</mid>
<rottentomatoes_id/>
</item>
There were a hundred movies per file and the JSON data came in at 325k, by the time it had been converted to the XML above it had ballooned to 1.16MB.
My aim was to compact the XML sufficiently to allow a single transformation to accept the contents of about 1800 or so such files. So here is the data pre the compacting transformation
ihe@ihe-ThinkPad-T410:~/film$ ls rawFreebase/1.xml -l
-rw-r--r-- 1 ihe ihe 1160691 Aug 29 06:46 rawFreebase/1.xml
after the compacting, which was supposed to be lossless the data looked like this
<movie imdb="tt0259822" name=".45" mid="/m/0c2l1s" date="2006-11-30">
<actor name="Milla Jovovich">
<alias>Milla</alias>
<alias>Milica Natasha Jovovich</alias>
<alias>Milica Jovović</alias>
<alias>Milla Yovovich</alias>
<alias>Reigning Queen of Kick-Butt</alias>
<alias>Milica Nataša Jovović</alias>
</actor>
<actor name="Angus Macfadyen">
<alias>Angus McFadyen</alias>
<alias>Angus MacFadyen</alias>
</actor>
<actor name="Aisha Tyler">
<alias>Aisha N. Tyler</alias>
</actor>
<actor name="Stephen Dorff">
<alias>Stephen Dorff Jr.</alias>
<alias>Brad Matlock</alias>
</actor>
<actor name="Sarah Strange"/>
<actor name="Vincent Laresca">
<alias>Vincent LaResca</alias>
<alias>Vinnie the kid</alias>
</actor>
<actor name="Dawn Greenhalgh">
<alias>Dawn Greenhall</alias>
</actor>
<actor name="Nola Augustson">
<alias>Nola Auguston</alias>
</actor>
<actor name="Kay Hawtrey">
<alias>Katherine Mary Craven Hawtrey</alias>
<alias>Kay Hartrey</alias>
<alias>Kay Hawtry</alias>
<alias>Katherine Hawtrey</alias>
</actor>
<actor name="Shawn Campbell"/>
<director name="Gary Lennon"/>
</movie>
and the size of a file of 100 such entries....
ihe@ihe-ThinkPad-T410:~/film$ ls freebase/1.xml -l
-rw-r--r-- 1 ihe ihe 351067 Aug 29 06:53 freebase/1.xml
350k compared to 324k of JSON.
I then decided to see what would happen to the file sizes after they were compressed.
Here are the results.
ihe@ihe-ThinkPad-T410:~/film$ ls -l rawFreebase/*.zip
-rw-r--r-- 1 ihe ihe 83833 Sep 30 12:41 rawFreebase/1.xml.zip
This is the compacted XML
ihe@ihe-ThinkPad-T410:~/film$ ls -l freebase/*.zip
-rw-r--r-- 1 ihe ihe 69058 Sep 30 12:41 freebase/1.xml.zip
This is the compressed JSON
-rw-r--r-- 1 ihe ihe 61528 Sep 30 12:42 Downloads/films.json.zip
83K for the naive XML,
69k for the compacted XML and
61k for the JSON.
Hmmmmmmmmmmm!