Nifty tidbits

Nifty tidbits and random thoughts on technology and anything else that catches my fancy

Archive for the 'XML' Category


Working with huge XML files - tools of the trade.

Posted by Raghu on March 27, 2008

XMLStarlet is great for slicing and dicing huge XML files. Had a run in recently - had a 80 Mb XML file in a single line :D. Guess what, most editors that I tried balked and fell over. This was on a 2Gig Core2 Duo machine.

XMLSpy, vi, emacs, notepad++ all died - and trying to do something with a 80 Gig XML where the 80 gigs are on a single line isnt much fun. So the first order of business was to pretty print the XML. XMLstarlet worked great -

xmlstarlet fo file.xml > output.xml

and you’re done.

The next order of business was that we needed to validate the XML document against a schema. Our first attempt was with Sun’s multi schema validator (MSV). MSV does not validate the whole document but instead stops after a certain number of failures. So, MSV - out, XMLStarlet in. XMLStarlet can validate documents again W3C schema, DTD  or a RELAXNG schema.

xmlstarlet val --err --xsd schema.xsd input.xml >  errors.txt

And presto! - you get an error report that you can slice and dice with sed/awk or anything else at all.

XMLStarlet also allows you to write Xpaths to query the xml - however, I found the syntax too weird and round about. A better alternative is a perl based solutions - XSH2 - a command line xml editing shell. You can install it under cygwin and it supports basic command pipelining and redirection.

So go ahead and launch XSH. At your cygwin prompt

[~]xsh
—————————————
 xsh - XML Editing Shell version 2.1.1
—————————————

Copyright (c) 2002 Petr Pajas.
This is free software, you may use it and distribute it under
either the GNU GPL Version 2, or under the Perl Artistic License.
Using terminal type: Term::ReadLine::Gnu
Hint: Type `help’ or `help | less’ to get more help.
$scratch/>

Now, lets load up our document, type

$scratch/>$x:=open formatted.xml

Your prompt changes to

$x/>

So go ahead and try a few xpaths

$x/> ls /path/to/node

and XSH prints out the matching nodes. Now what if you need to create a document fragment of nodes matching a certain xpath? Piece of cake - do ahead

$x/> ls /path/to/node | tee fragment.xml

XSH2 has many, many more features - but this should be good enough to get you off the ground.

Posted in HOWTO, Tips, Tools, Utilities, XML | No Comments »

I’m on a high. I’ve been stuck with this problem o…

Posted by Raghu on April 28, 2005

I’m on a high. I’ve been stuck with this problem of trying to understand a HUGE xslt that operates on an even bigger XML. I sorely needed something that will let me trace through the xslt execution to understand the flow.

Tried a couple of IDEs - Stylus Studio (free edition) and Marrowsoft XSelerator. Stylus studio did a graceful exit, Xselerator went purple in the face and died a gruesome death :-(

Hmm… so after sometime I was wondering if I could annotate the XSL output with information on the templates matched it would atleast help partway. I was thinking of perl/C#/regular expressions and then suddenly the penny dropped “for each xsl:template node, include a comment with the template match/mode” - Hang on!!! looks like that sounds like a job for XSLT….

Anyway, there are a couple of quirks - the first one you hit will be when you try to output a template like this

<xsl:stylesheet version="1.0"  xmlns:xsl="http://www.w3.org/1999/XSL/Transform" >
<xsl:template match="xsl:stylesheet" >
	<!-- generate an output xsl:stylesheet node -->
		<xsl:stylesheet>
		</xsl:stylesheet>
	</xsl:template>
</xsl:stylesheet>

Oops! The XSLT processor cribs (and with good reason too)! It doesn’t know which xsl:template is for the current stylesheet vs which is intended to be output to the result document.There are a couple of approaches around this. One is to use xsl:element like this

    <xml:namespace prefix = xsl />
	<xsl:element name="xsl:template"></xsl:element>

But this results in enormously wordy documents. Thankfully there’s a neater way out. You use something called . Basically, what it does is that it allows you to use a dummy namespace in your xslt. You set up the dummy namespace (let’s say genxsl) to map to a real namespace in the result document (xsl). Then you basically use the dummy namespace in your XSLT. However, when generating output, the processor will replace all references to the dummy namespace in the result document with references to the real namespace. For ex.

<xsl:stylesheet version="1.0"
	xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
	xmlns:gen="http://www.w3.org/1999/XSL/Transform/2">
<xsl:namespace-alias stylesheet-prefix="gen" result-prefix="xsl"/>
<xsl:template match="xsl:stylesheet">
	<gen:stylesheet>
		<xsl:for-each select="@*">
		<xsl:attribute name="{name(.)}">
		<xsl:value-of select="."/>
		</xsl:attribute>
	</xsl:for-each>
	<xsl:apply-templates></xsl:apply-templates>
	<xsl:if test="not(xsl:template[@name='pseudo-xpath-to-current-node'])” >
	<xsl:text></xsl:text>
	<xsl:copy-of select="document('')/xsl:stylesheet/xsl:template[ @name='pseudo-xpath-to-current-node']“/>
	<xsl:text></xsl:text>
	</xsl:if>
</gen:stylesheet>
</xsl:template>

Note the usage of xsl:namespace-alias and the code for generating an xsl:stylesheet element in the result document.

I’ve included my efforts here - along with a simple books.xml, a books.xsl which generates a table and finally an instrument.xsl that instruments books.xsl to generate an instrumented version. Transforming books.xml with the instrumented xslt generates output that annotated with custom nodes that highlight which template got called when.

After I was mostly done with the code, I came across an article in IBM developerWorks which discusses the same topic. Rather than cover the same material again, you can find the article here. Stuff that’s different is that I generate custom nodes (which I thought would be useful to view in XML IDE which allow a hierarchical display). I’ve also shamelessly borrowed the code to generate the Xpath of the node (part of what you see in the snippet).

Posted in XML | No Comments »