I needed to generate an XML file from database tables and the plan was to use Talend Open Studio. Talend is an ETL tool that generates data integration jobs in Java. The community edition is free and I’d been using it for several other data tasks for an ecommerce client. Overall, I think it’s quicker than hand-coding in Java, but you can still dip into Java code if you need to and embed the jobs in other programs.

Unfortunately, it’s not so good when it comes to generating moderately complex XML files. By moderately complex, I mean lists of lists like this:

<products>
  <product id="1">
    <categories>
      <category id="1" />
      <category id="2" />
    </categories>
    <skus>
      <sku>12345</sku>
      <sku>67890</sku>
    </skus>
  </product>
<products>

 

Talend can do this, it’s just obscenely slow for larger file sizes. By “larger” I mean a few MB. It appears this is due to their use of DOM4J instead a SAX parser. Why a few megs of XML data takes up so much memory I don’t know, but that’s the case.1

Talend converts columnar data to XML, so you have to do it in two passes:

Products/Categories (category is the loop element):

<products>
  <product id="1">
    <categories>
      <category id="1" />
      <category id="2" />
    </categories>
  </product>
<products>

 

Products/SKUs (sku is the loop element):

<products>
  <product id="1">
    <skus>
      <sku>12345</sku>
      <sku>67890</sku>
    </skus>
  </product>
<products>

 

I noticed when generating these files individually, Talend is really fast, around 1,000 rows/second on my machine.2 But when you instruct it to combine the files (really append the second file to the first, joining on the product ID), it slows down to about 1 row/sec for a 5,000 row job. Yes, 1,000 times slower. Again, it’s all due to the file size, as when I restricted it to 350 rows it ran at ~120 rows/s. The problem was that in production I need to process about 18K rows and it gets exponentially slower.

The solution is to generate two separate files, then merge them using a SAX parser. I’m just starting to use Groovy, which Talend supports, and was assuming it would be faster to develop in that language over Java. Well, if it didn’t require a ton of trial and error to overcome poor documentation, maybe it would have. Hopefully, this heavily commented code makes it easier for the next person.

package com.madeupname

import groovy.util.slurpersupport.GPathResult;
import groovy.xml.StreamingMarkupBuilder;

// If there is no match in the products/categories file, use this empty node
def emptyCategoriesXML = '''

	

'''

// Uses a SAX parser, less memory and overhead than a DOM parser (XmlParser)
// parse() method returns a GPathResult, which allows you to traverse and 
// manipulate  an XML file or snippet using dot (.) notation. 
def xs = new XmlSlurper()
// Main products file with SKU list
def productsSKUs = xs.parse(new File('/Data/ProductSKU.xml'))
// Products file with list of categories 
def productsCategories = xs.parse(new File('/Data/ProductCategory.xml'))
def emptyCategories = xs.parseText(emptyCategoriesXML)
// Output file
File output = new File('/Data/Products.xml')

// Store category nodes into a Map for fast retrieval later. Key is product ID. 
// Note: if you're using Talend, you can't statically type this and must use this:
// def productsCategoriesMap = new HashMap<String, GPathResult>()
HashMap<String, GPathResult> productsCategoriesMap = new HashMap<String, GPathResult>()

// Loop through all the products. Note that the root category is products 
// (plural), but the GPathResult you get from XmlSlurper assumes you're already 
// in the root category. That's why it's not productsCategories.products.product.each  
productsCategories.product.each {
	// Note you must put the id in a String (Groovy style shown here) 
	// in order to have a String key.
	productsCategoriesMap["${it.@id}"] = it
}

// This allows you to use a DSL to write the file. Note that you are not 
// actually doing the work specified in the closure until you start writing it. 
new StreamingMarkupBuilder().bind {
	// mkp is a special markup namespace for use within this closure. There 
	// are other methods as well, see the docs.
	mkp.xmlDeclaration(["version":"1.0", "encoding":"UTF-16LE"])
	// My root category
	products {
		// Loop through each product and append (insert) the categories 
		// node to the product node with the same product id.
		productsSKUs.product.each {
			if (productsCategoriesMap["${it.@id}"] != null) {
				it.appendNode(productsCategoriesMap["${it.@id}"].sites)
			} else {
				it.appendNode(emptyCategories)
			}
			// Note this is not System.out, it merely ensures the 
			// GPathResult is printed when written.
			out << it
		}
	}
// Here we actually write the file, executing the above closure.
// Note how I specify the character set to match the declaration.
} .writeTo(output.newWriter("UTF-16LE"))

For Talend users, I had to use tGroovyFile instead of tGroovy because it was complaining about a missing library.3 You’ll also want to change the hard coded file paths to use context variables.

To save you some time, here are the links to the relevant documentation:

http://groovy-lang.org/processing-xml.html
http://docs.groovy-lang.org/latest/html/api/groovy/util/slurpersupport/GPathResult.html

Apologies for complaining about the docs, but I like Groovy and want to see adoption spread. To do that, it has to make the hard things easy. None the docs I read (above JavaDocs, all the XML walkthroughs on the official site, and relevant chapters of Programming Groovy) went beyond the basics of generating XML from scratch (and the JavaDocs are particularly lacking). Groovy could really benefit from a good cookbook site (maybe nowadays that’s Stack Overflow) and most of all, annotated API documentation like PHP has had for years. I found those user contributed notes to be priceless when I was learning it. I think a wiki with comments would be a great home for the Groovy API reference docs.

 

  1. I note that loading an 8MB XML file has caused Chrome to use tons of memory and crash the tab, so there must be a complexity I’m missing. []
  2. Core i7 quad 1.6, 6GB RAM, SSD. []
  3. Yes, I tried adding tLibraryLoad, but it didn’t help and I got no response on the forums. []