|
|
I needed to generate an XML file from database tables and the plan was to use Talend Open Studio. Talend is an ETL tool that generates data integration jobs in Java. The community edition is free and I’d been using it for several other data tasks for an ecommerce client. Overall, I think it’s quicker than hand coding in Java, but you can still dip into Java code if you need to and embed the jobs in other programs.
Unfortunately, it’s not so good when it comes to generating moderately complex XML files. By moderately complex, I mean lists of lists like this:
<products>
<product id="1">
<categories>
<category id="1" />
<category id="2" />
</categories>
<skus>
<sku>12345</sku>
<sku>67890</sku>
</skus>
</product>
<products>
Talend can do this, it’s just obscenely slow for larger file sizes. By “larger” I mean a few MB. It appears this is due to their use of DOM4J instead a SAX parser. Why a few megs of XML data takes up so much memory I don’t know, but that’s the case.
Talend converts columnar data to XML, so you have to do it in two passes:
Products/Categories (category is the loop element):
<products>
<product id="1">
<categories>
<category id="1" />
<category id="2" />
</categories>
</product>
<products>
Products/SKUs (sku is the loop element):
<products>
<product id="1">
<skus>
<sku>12345</sku>
<sku>67890</sku>
</skus>
</product>
<products>
I noticed when when generating these files individually, Talend is really fast, around 1,000 rows/second on my machine. But when you instruct it to combine the files (really append the second file to the first, joining on the product ID), it slows down to about 1 row/sec for a 5,000 row job. Yes, 1,000 times slower. Again, it’s all due to the file size, as when I restricted it to 350 rows it ran at ~120 rows/s. The problem was that in production I need to process about 18K rows and it gets exponentially slower.
The solution is to generate two separate files, then merge them using a SAX parser. I’m just starting to use Groovy, which Talend supports, and was assuming it would be faster to develop in that language over Java. Well, if it didn’t require a ton of trial and error to overcome poor documentation, maybe it would have. Hopefully this heavily commented code makes it easier for the next person.
package com.madeupname
package com.madeupname
import groovy.util.slurpersupport.GPathResult;
import groovy.xml.StreamingMarkupBuilder;
// If there is no match in the products/categories file, use this empty node
def emptyCategoriesXML = '''<sites>
<categories>
<category />
</categories>
'''
// Uses a SAX parser, less memory and overhead than a DOM parser (XmlParser)
// parse() method returns a GPathResult, which allows you to traverse and
// manipulate an XML file or snippet using dot (.) notation.
def xs = new XmlSlurper()
// Main products file with SKU list
def productsSKUs = xs.parse(new File('/Data/ProductSKU.xml'))
// Products file with list of categories
def productsCategories = xs.parse(new File('/Data/ProductCategory.xml'))
def emptyCategories = xs.parseText(emptyCategoriesXML)
// Output file
File output = new File('/Data/Products.xml')
// Store category nodes into a Map for fast retrieval later. Key is product ID.
HashMap<String, GPathResult> productsCategoriesMap = new HashMap<String, GPathResult>()
// Note: if you're using Talend, you can't statically type this and must use this:
// def productsCategoriesMap = new HashMap<String, GPathResult>()
// Loop through all the products. Note that the root category is products
// (plural), but the GPathResult you get from XmlSlurper assumes you're already
// in the root category. That's why it's not productsCategories.products.product.each
productsCategories.product.each {
// Note you must put the id in a String (Groovy style shown here)
// in order to have a String key.
productsCategoriesMap["${it.@id}"] = it
}
// This allows you to use a DSL to write the file. Note that you are not
// actually doing the work specified in the closure until you start writing it.
new StreamingMarkupBuilder().bind {
// mkp is a special markup namespace for use within this closure. There
// are other methods as well, see the docs.
mkp.xmlDeclaration(["version":"1.0", "encoding":"UTF-16LE"])
// My root category
products {
// Loop through each product and append (insert) the categories
// node to the product node with the same product id.
productsSKUs.product.each {
if (productsCategoriesMap["${it.@id}"] != null) {
it.appendNode(productsCategoriesMap["${it.@id}"].sites)
} else {
it.appendNode(emptyCategories)
}
// Note this is not System.out, it merely ensures the
// GPathResult is printed when written.
out << it
}
}
// Here we actually write the file, executing the above closure.
// Note how I specify the character set to match the declaration.
} .writeTo(output.newWriter("UTF-16LE"))
For Talend users, I had to use tGroovyFile instead of tGroovy because it was complaining about a missing library. You’ll also want to change the hard coded file paths to use context variables.
To save you some time, here are the links to the relevant documentation:
http://groovy.codehaus.org/api/groovy/util/XmlSlurper.html
http://groovy.codehaus.org/api/groovy/util/slurpersupport/GPathResult.html
http://groovy.codehaus.org/gapi/groovy/xml/StreamingMarkupBuilder.html
http://groovy.codehaus.org/api/index.html?groovy/xml/MarkupBuilderHelper.html
Apologies for complaining about the docs, but I like Groovy and want to see adoption spread. To do that, it has to make the hard things easy. None the docs I read (above JavaDocs, all the XML walk throughs on the official site, and relevant chapters of Programming Groovy) went go beyond the basics of generating XML from scratch (and the JavaDocs are particularly lacking). Groovy could really benefit from a good cookbook site (maybe nowadays that’s Stack Overflow) and most of all, annotated API documentation like PHP has had for years. I found those user contributed notes to be priceless when I was learning it. I think a wiki with comments would be a great home for the Groovy API reference docs.
I have come up with a simple idea that will have a positive, global environmental impact. I’m talking about the end of the business card as we know it. Have you ever had a box of 500, maybe 1,000 business cards, handed out a few, then thrown the rest away when your title or contact info changed? Maybe you’ve done that a few times, or several. How much did that cost you? How did it impact the environment? How did you feel when you threw them away? What if no one ever did that again? Here’s a story about how we can make that happen.
A good friend of mine was asking for advice about business cards. He was going to be traveling in Europe for 6 weeks, meeting a ton of people, and wanted something that would stand out, something creative and memorable. But he also didn’t have a lot of time. My answer was simple. First, when you’re traveling for an extended period of time, hitting a lot of locations, you want to keep it light. The last thing you need is to lug around is a box of business cards.
My advice was to create a single, sturdy business card that simply had a QR code on it. People would scan it with their phones and you’d take it back. It’s both memorable and green, which I think a lot of Europeans would respond positively to. Especially the, ahem, female Europeans whose acquaintance he wanted to make. Moreover, it goes right into their contacts, saving them the trouble of transferring it, which won’t happen if it gets lost (even at the bottom of a purse).
Then another thought hit me – why do you need the card? You can have it as an image on your phone! Their phone photographs your phone and you’re done. To test this out, I went to an online QR code generator, capable of making a vCard/meCard. I took out my relatively new HTC Evo 4G and photographed the screen.
Nothing happened.
Turns out, even though Japanese cell phones have had built-in QR code readers for several years, Google and Apple still want you to download a separate barcode reader app for this. I’ve been seeing these codes all over the place: business cards, movie posters, real estate signs. I’m sure you have as well, although maybe you didn’t notice them or know what they were called. I was quite surprised to learn that they all rely on a 3rd party app.
All you need to read a QR code is a camera and bit of processing power. There are several free readers for iPhone, Android, Blackberry, Windows Phone 7, Palm OS, and probably several others. You can create several images representing each virtual business card you want to share: work, personal, work + personal, etc.. You can store them in a folder in your photo gallery app, or use it as your phone’s background image so it can be viewed and captured without even unlocking the phone.
Breaking Down The Branding Defense
I know, many people use business cards as part of their branding. My brother David is a talented graphic designer who has done this for many clients. But as I sort through a stack of about 30 collected business cards, very few people are doing this. What I’m seeing:
- 1/3 are nice. Here I include the traditional Fortune 500 business cards. Those have good layout, fonts, print and paper quality, and enforce the brand message (solid, traditional), but don’t differentiate them. Maybe 2 or 3 actually looked kinda cool, but none blew me away.
- 1/3 are meh. They don’t look like they were created by experienced designers, more like professional amateurs. Or, as is often the case, the client’s choice overruled the designer’s.
- 1/3 are just bad. Cheap paper and printing, ugly design. These actually hurt the person who hands them out.
While it’s a small sample size, it feels about right. Most people think their card helps them, but most people are wrong. The best defense for paper business cards is that your potential clients are primarily dumbphone owners. For most professionals, that’s a pretty small group. If you take down their info instead of giving yours, you gain a measure of control over the transaction. If that’s not feasible, you can ask your designer about small batch printing and eco-friendly materials.
In contrast, what does the paperless business card say about you or your company? At a minimum, it says you’re tech savvy, even cutting edge, and that you’re environmentally conscious. Some don’t care about the environment, but I can’t think of cases where that mindset makes you look bad.
Resources
You’re sold, right? So, where to go from here? First, make one or more QR codes. You can do that here:
QR Code and 2D Code Generator
Allows you to create many different QR codes, including vCards and meCards. You can also choose format. I chose PNG (an image file, like GIF or JPEG), then saved it and mailed to to my phone.
ZXing QR Code Generator
From the maker of the free, open source Barcode Reader app. I tested this with a couple generators and it works fine with contacts. It also generates QR codes for you from your contact list, although I don’t know how compliant they are with the vCard or meCard formats.
Google Infographics
Allows developers to create QR codes with a simple HTTP GET or POST request.
Google will find you many more options. After that, you need to find one for your phone. Instead listing them here, just go to your favorite app store. There are many quality, free apps to choose from. My only caveat is that I first tried Google Goggles and discovered it can’t read embedded phone numbers (I tried two different generators, and both vCards and meCards). Pretty major limitation for getting contact info.
How to Help
If you want to help, request that your phone maker or carrier provide this feature natively.
Android barcode reader integration – You can directly vote for this feature in Android by “starring” this request.
iPhone Feedback Form – Ask Apple for QR code reader to be integrated into the camera.
Postscript: Alternatives
There are alternatives to QR codes. One is near field communication (NFC), but most phones, including my relatively new HTC Evo 4G, don’t have an NFC chip/antenna. Another is the app Bump. Bump’s mechanism is extremely clever, and the company appears to have some brilliant, highly credentialed people working there. I think it’s a good idea and it’s on my phone. However, it’s only available for iOS and Android, and the Android version is missing some critical features like multiple contact cards or custom contact cards (it uses Android’s contact info, which doesn’t have a field for your web site). And, of course, neither works with printed advertisements like movie posters.
I recently switched from Eclipse 3.6 to STS 2.7.1 (based on Eclipse 3.7). Ditching my old .project and workspace settings files along with the move has made for a smoother experience; it seems these files get corrupted over time, and I’m too lazy to do the research to fix them. However, the upgrade resulted in performance issues. For instance, it hung for ~10 seconds every time I saved web.xml, and there were various random pauses. It’s not the hardware: I’m on a Core i7 Quad with 6GB RAM running Win7 x64. I realize you are getting more tooling with STS, but performance was much worse than I experienced with 3.6.
Well, it had slipped my mind that I had updated my 3.6 eclipse.ini settings with those I had found in an excellent Stack Overflow answer from VonC on optimal JVM settings for Eclipse. It hasn’t been updated for 3.7 (nor does it mention STS), but after some experimenting and research it appears to work well for it. Here are my settings, and below I add some commentary on what they do, which is missing from the original answer (although I still suggest you read that, as it covers other situations/issues that may affect you). Keep in mind I’m not a JVM tuning expert, YMMV, etc. Here are the contents of my sts.ini:
-vm
C:/Java/SDKs/jdk1.6.0_24x64/bin/javaw.exe
-startup
plugins/org.eclipse.equinox.launcher_1.2.0.v20110502.jar
–launcher.library
plugins/org.eclipse.equinox.launcher.win32.win32.x86_64_1.1.100.v20110502
-product
com.springsource.sts.ide
–launcher.defaultAction
openFile
–launcher.XXMaxPermSize
384M
-vmargs
-Dosgi.requiredJavaVersion=1.6
-Xmn128m
-Xms256m
-Xmx768m
-Xss4m
-XX:PermSize=128m
-XX:MaxPermSize=384m
-XX:CompileThreshold=1000
-XX:+CMSIncrementalPacing
-XX:+UnlockExperimentalVMOptions
-XX:+UseG1GC
-XX:+UseFastAccessorMethods
-XX:CompileThreshold=1000
This is the number of method invocations/branches before compiling. This is normally set to 10,000, so we’re changing it dramatically, but the original suggestion was leading to errors so I raised it. You will notice on startup that it takes longer, and your CPU usage jumps. However, your performance after that is much better. Those 10s save times for web.xml? Gone after this. I’m willing to take a hit at the beginning for better productivity while coding.
-XX:ReservedCodeCacheSize=64m
Related to the above, I was getting the error “Unhandled event loop exception / out of space in CodeCache for adapters” due to setting the compile threshold to 5. This is another solution to that problem, and may be redundant.
-Xss4m
This is stack size, and was previously set to 1MB, now up to 4MB per thread. Doing this will increase the overall memory used.
-XX:+UnlockExperimentalVMOptions
-XX:+UseG1GC
-XX:+UseFastAccessorMethods
These enable parallel garbage collection. I saw my CPU utilization reach 100% after this, which is rare on a Core i7 Quad. It felt like I was finally using it to its potential.
Again, I’m not an expert. I’ve found it’s more sluggish at first, but response times quickly improve. For me, it’s a clear net gain. Not documented are the things I turned off in preferences because I wasn’t using them (Maven is disconnected, etc.). Visit Windows >> Preferences and filter on startup, see if there’s anything you can get rid of. Finally, I must give credit to my sources outside the original article:
http://ugosan.org/speeding-up-eclipse-a-bit-with-unlockexperimentalvmoptions/
http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html
http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html
http://performance.netbeans.org/howto/jvmswitches/
I’ve created a Tech Startup Wiki on Wikispaces. Right now it’s just a list of books and web sites for those interested in tech startups. Please check it out and contribute if you can.
I was trying to launch a Tomcat instance in Eclipse and it complained that port 8080 was in use. This scared me, since to my knowledge nothing else used that port. Was it spyware? I visited http://localhost:8080 and saw:
Access violation at address 32658D8F in module 'CC3260MT.DLL'. Read of address 00000000
No, it’s not spyware, it’s TiVo. Specifically the TiVo desktop server (Bonjour, I believe it’s called.). Pulling up the interface and pausing the server frees up the port. Shame on TiVo for using a very popular port for developers, but I guess we can change the port in Eclipse/Tomcat/Jetty if we need to. Now back to work…
For some reason, searching for “wincvs revert” in Google doesn’t immediately show http://cvsgui.sourceforge.net/newfaq.htm#reversion, which it absolutely should. It explains what you have to do, but to make it extra clear here is a screen grab:

You can replace HEAD with another version like 1.12.
BTW, if you’re on Windows and using Eclipse, WinCVS is a great support tool for Eclipse’s broken CVS support. I can get history/logs for files and revert to previous versions. Eclipse remote history has been broken for me through several upgrades/reinstalls and will only revert to a tag, not a specified file version.
The only real issue I have with WinCVS is that its help system doesn’t provide any. Expect things like “Merge options dialog allows the user to change the merge options.” Still, I’m grateful for the effort and can’t complain about the price.
It occurred to me that by keeping launch dates secret, Apple never appears to suck at software estimation. Microsoft gives a date, sails right past it, and everyone is up in arms about it.
A better question is why doesn’t MS keep things secret? Or all software companies, for that matter? I know with sales, everyone wants to know when the next version is out so they can hold off on buying the current one. Or the sales person tries to keep you from buying their competitor’s product because their next version will be much better. But everyone knows there’s no guarantee of that happening, and there’s a potential opportunity cost in waiting. And since the estimates that drive the schedule were done by the wrong people at the wrong time, without being updated, you’ll probably be waiting longer than they claim.
I think the best policy would be to launch whenever it’s ready, and everyone who purchased within the last 60 days – or better yet, has a support contract – gets the new version for free.
Of course, this only applies to software. For hardware, you’re forced to apply common sense: do I need this right now? Does it do what need at a fair price? Or you can visit http://buyersguide.macrumors.com/ and hope they’re right.
I recently had some more frustration with Eclipse, with no solution on the web, so I’m posting mine.
The problem:
I had an auto-generated task (TODO) from creating a class that implemented an interface. At some point, I noticed the task comment was gone, but the task indicator (checkbox icon) was still there. Probably because I have it set to reformat on save, but maybe I deleted the task comment without hitting the task button (or both). Anyway, I could not clear it no matter what:
- Double clicking the icon didn’t work since it couldn’t find the comment.
- Clicking the “Clean and Redetect Tasks” button did nothing.
- Restarting Eclipse (which I do more often than a Windows admin reboots), did naught.
- The Task View displayed the offending tasks, but the Delete option was greyed out. Selecting the task and hitting delete 3 million times while cursing furiously at the screen brought no justice.
The solution:
- Go to Window >> Preferences, then Java/Compiler/Task Tags. Select the TODO task tag, or whatever accursed tag haunts you.
- Click Remove. When it threatens a rebuild, call it’s bluff (that is, agree). When it’s done (and it took its sweet time), the offending tasks will be gone. Rejoice!
- Click New… and restore the TODO tag. All legitimate TODO tasks will be restored. Callooh! Callay!
keywords: can’t delete tasks, task tags, eclipse 3.4, mylyn
I just quit another survey before completing it, this one from Rhapsody. I like Rhapsody, and I don’t mind giving them my opinions to improve their service (or even to keep it the same). However, my time is valuable, and I can’t waste it on sites that don’t institute the simplest of usability measures. For example, if I leave a question blank, and there is a very reasonable conversion for blank (like zero or n/a), don’t come back to me with “answer all questions properly.” They didn’t even highlight which question they had a problem with or what, specifically, was wrong. The second time I got that message, I just closed the tab. They said the survey would take 10-15 minutes. Well guess what? If you coded it nicely, it’d only take us 5.
This is similar to telemarketers who give phone surveys and, because of some stupid rule set up by their management, must tell you what the numbers 1 through 5 represent for every single question. At that point, I’m thinking 1 for slightly annoyed, 2 for really annoyed, 3 for angry, 4 for hanging up right now…
And offering a chance of winning a single $100 Amazon gift card (which seems to be a new survey standard) is really no incentive at all. If you really want to incentivize, why not say 100 people will get a free month of Rhapsody to Go? Wouldn’t that improve your image without costing you much, since it’s your survey to begin with?
Look, for many topics, I’m a guy who actually cares. I’m happy to give you my opinions and insights. Please stop making me care less.
I had this idea and considered creating it as a service, but I’ve got my own web startup going and don’t need the distraction. Several sites, such as Zap2it, TV Guide, and TitanTV (beta) already have the infrastructure (as well as the TV listings I’d have to license) so hopefully this won’t be too hard for one of them to implement.
I’m looking for a clone of the Tivo Wishlist. The difference is that instead of recording, you get email alerts. I imagine if you have a DVR/PVR that is internet programmable, the service could take advantage of that, but I’ve got my cable company’s DVR (Scientific Atlanta) like most people and must program it with the remote. So this provides a wishlist feature for everyone without a Tivo, which I think is compelling.
The search features of current TV listings sites are missing critical fields for a wishlist to work (not to mention the email reminder part). Filtering (both inclusive and exclusive) by genre and channel are required.
Here are a couple strong (IMHO) use cases:
- You want to be notified if anyone on a list of people is scheduled to be on a talk show. You enter description:”Quentin Tarantino, Kevin Smith, Judd Apatow” and genre: talk and every time any of them appear on a talk show you’re notified. If Tarantino’s Reservoir Dogs is played on HBO, nothing happens.
- You’re planning a vacation and you want to record travel shows about various places. You enter keywords:”Prague,Tokyo,Paris” and interest:travel (or perhaps channels:travel,discovery,tlc,pbs) and you get notified for any travel shows relevant to you.
Of course, the above would be done via a nice GUI/query builder.
When you get your email, there would be links to hide/exclude shows in the future, which is useful for anything that gets rerun frequently (especially basic cable shows).
You can monetize this through targeted ads, since the user is telling you what he/she wants.
Another service would be to send a post-mortem email that includes links to the shows you want on Hulu, YouTube, the network’s website, etc. after they’ve been uploaded. At that point you’re much closer to a real Tivo service and could possibly charge for it. Possibly.
I should point out that Tivo’s own advanced search is great and includes categories (and subs) and is open to the public.
And if you are only interested in the talk show part, you can set a calendar reminder to check the talk show lineups page once a week. However, I’d much rather have something automated that allows me to set it and forget it. I could probably whip up a script to parse that page and run it as a service/cron job to notify me when there’s a match, but still, it would only work for talk shows. And parsing poorly formed HTML is a pain.
No, the easiest solution is to convince someone else to implement it for me
Update: If you want to see Yahoo TV implement this, upvote it here.
|
|