I needed to generate an XML file from database tables and the plan was to use Talend Open Studio. Talend is an ETL tool that generates data integration jobs in Java. The community edition is free and I’d been using it for several other data tasks for an ecommerce client. Overall, I think it’s quicker than hand-coding in Java, but you can still dip into Java code if you need to and embed the jobs in other programs.
Unfortunately, it’s not so good when it comes to generating moderately complex XML files. By moderately complex, I mean lists of lists like this:
<products> <product id="1"> <categories> <category id="1" /> <category id="2" /> </categories> <skus> <sku>12345</sku> <sku>67890</sku> </skus> </product> <products>
Talend can do this, it’s just obscenely slow for larger file sizes. By “larger” I mean a few MB. It appears this is due to their use of DOM4J instead a SAX parser. Why a few megs of XML data takes up so much memory I don’t know, but that’s the case.1
Talend converts columnar data to XML, so you have to do it in two passes:
Products/Categories (category is the loop element):
<products> <product id="1"> <categories> <category id="1" /> <category id="2" /> </categories> </product> <products>
Products/SKUs (sku is the loop element):
<products> <product id="1"> <skus> <sku>12345</sku> <sku>67890</sku> </skus> </product> <products>
I noticed when generating these files individually, Talend is really fast, around 1,000 rows/second on my machine.2 But when you instruct it to combine the files (really append the second file to the first, joining on the product ID), it slows down to about 1 row/sec for a 5,000 row job. Yes, 1,000 times slower. Again, it’s all due to the file size, as when I restricted it to 350 rows it ran at ~120 rows/s. The problem was that in production I need to process about 18K rows and it gets exponentially slower.
The solution is to generate two separate files, then merge them using a SAX parser. I’m just starting to use Groovy, which Talend supports, and was assuming it would be faster to develop in that language over Java. Well, if it didn’t require a ton of trial and error to overcome poor documentation, maybe it would have. Hopefully, this heavily commented code makes it easier for the next person.
package com.madeupname import groovy.util.slurpersupport.GPathResult; import groovy.xml.StreamingMarkupBuilder; // If there is no match in the products/categories file, use this empty node def emptyCategoriesXML = ''' ''' // Uses a SAX parser, less memory and overhead than a DOM parser (XmlParser) // parse() method returns a GPathResult, which allows you to traverse and // manipulate an XML file or snippet using dot (.) notation. def xs = new XmlSlurper() // Main products file with SKU list def productsSKUs = xs.parse(new File('/Data/ProductSKU.xml')) // Products file with list of categories def productsCategories = xs.parse(new File('/Data/ProductCategory.xml')) def emptyCategories = xs.parseText(emptyCategoriesXML) // Output file File output = new File('/Data/Products.xml') // Store category nodes into a Map for fast retrieval later. Key is product ID. // Note: if you're using Talend, you can't statically type this and must use this: // def productsCategoriesMap = new HashMap<String, GPathResult>() HashMap<String, GPathResult> productsCategoriesMap = new HashMap<String, GPathResult>() // Loop through all the products. Note that the root category is products // (plural), but the GPathResult you get from XmlSlurper assumes you're already // in the root category. That's why it's not productsCategories.products.product.each productsCategories.product.each { // Note you must put the id in a String (Groovy style shown here) // in order to have a String key. productsCategoriesMap["${it.@id}"] = it } // This allows you to use a DSL to write the file. Note that you are not // actually doing the work specified in the closure until you start writing it. new StreamingMarkupBuilder().bind { // mkp is a special markup namespace for use within this closure. There // are other methods as well, see the docs. mkp.xmlDeclaration(["version":"1.0", "encoding":"UTF-16LE"]) // My root category products { // Loop through each product and append (insert) the categories // node to the product node with the same product id. productsSKUs.product.each { if (productsCategoriesMap["${it.@id}"] != null) { it.appendNode(productsCategoriesMap["${it.@id}"].sites) } else { it.appendNode(emptyCategories) } // Note this is not System.out, it merely ensures the // GPathResult is printed when written. out << it } } // Here we actually write the file, executing the above closure. // Note how I specify the character set to match the declaration. } .writeTo(output.newWriter("UTF-16LE"))
For Talend users, I had to use tGroovyFile instead of tGroovy because it was complaining about a missing library.3 You’ll also want to change the hard coded file paths to use context variables.
To save you some time, here are the links to the relevant documentation:
http://groovy-lang.org/processing-xml.html
http://docs.groovy-lang.org/latest/html/api/groovy/util/slurpersupport/GPathResult.html
Apologies for complaining about the docs, but I like Groovy and want to see adoption spread. To do that, it has to make the hard things easy. None the docs I read (above JavaDocs, all the XML walkthroughs on the official site, and relevant chapters of Programming Groovy) went beyond the basics of generating XML from scratch (and the JavaDocs are particularly lacking). Groovy could really benefit from a good cookbook site (maybe nowadays that’s Stack Overflow) and most of all, annotated API documentation like PHP has had for years. I found those user contributed notes to be priceless when I was learning it. I think a wiki with comments would be a great home for the Groovy API reference docs.
Hi Philip,
Just to say so many thanks to you for your little groovy script to merge XML files that helped me a lot, you’re my savior !
I have to build a complex XML file with same issue using talend.
Matthew.
Very helpful! Much appreciated. An elegant way to merge xml files.
Hi, Philip,
This appears to be just the fix I need for an impossibly slow job merging xml files. However, I cannot see the schemas of your input xml files. Would it be possible for you to send me a screenshot of them so I can better understand how your code works?
Thanks.
Hi Cara, thanks for pointing out that this was broken. Seems my code highlighting plugin was deprecated at some point and I needed a new one. Hope this makes it clear what I was doing.