Thursday, May 27, 2010

XQuery: Powerful, Simple, Cool .. "Demo"

At IBM Impact this year, I did talks about the XML Feature Pack as well as basic introduction to the XPath 2.0, XSLT 2.0 and XQuery 1.0. I think one of the most useful parts of my talk was when I demoed code in XQuery. I found that people really saw the light (how simple and fully functioned XQuery is) once people saw the code in a useful application. Also, people that were experienced with XPath 1.0 appreciated the new features and people who had experience with XSLT 1.0 appreciated the syntax (closer to imperative coding). The application I used in the demo was the download stats program I have blogged about before. Let me take a second to do the same "demo" here.

First, I have an XML input file of all the downloads over a certain time period. That XML file could come from a web services, a JMS message, or be loaded from a XML database. The data looks something like:


<?xml version="1.0" encoding="UTF-8"?>
<downloads>
<download>
<transaction>1</transaction>
<userid>user1</userid>
<uniqueCustomerId>uid-1</uniqueCustomerId>
<filename>xml_and_import_repositories.zip</filename>
<name>Mr. Andrew Spyker</name>
<email>user@email.com</email>
<companyname>IBM</companyname>
<datedownloaded>2009-11-20</datedownloaded>
</download>
<!-- more download records repeating -->
</downloads>


First I want to quickly get rid of all downloads that have "education" in the filename. Next I want to split the downloads that come from IBM'ers (email or company has some version of IBM in it) vs. the downloads that come from clients. Of those groups, I want to quickly group repeat downloaders (by uniqueCustomerId). I won't include it here, but I've show how to write some of this with Java and DOM in the past. It's sufficient to say that this code is very complex (imagine all the loops through the data you'd write for each of these steps). Let's look at these steps in XQuery:


(: Quickly get rid of education downloads :)
declare variable $allNonEducationDownloads := /downloads/download[not(contains(filename, '/education/'))];

(: Split the IBM downloads from non-IBM downloads :)
declare variable $allIBMDownloads :=
$allNonEducationDownloads[contains(upper-case(email), 'IBM')] |
$allNonEducationDownloads[contains(upper-case(companyname), 'IBM')] |
$allNonEducationDownloads[contains(upper-case(companyname), 'INTERNATIONAL BUSINESS MACHINES')];

(: Get the unique IBM downloader id's :)
declare variable $allIBMUniqueIds := distinct-values($allIBMDownloads/uniqueCustomerId);

(: Get the non-IBM downloads :)
declare variable $allNonIBMDownloads := $allNonEducationDownloads except $allIBMDownloads;

(: Get the unique non-IBM downloader id's :)
declare variable $allINonIBMUniqueIds := distinct-values($allNonIBMDownloads/uniqueCustomerId);


I think the most powerful line of the above code is the "except" statement. In that one quick statement, I can quickly express that we want to take all the downloads and remove the IBM downloads which leaves us with the non-IBM downloads. I think it's quite impressive that XQuery expresses the above statements in about the same amount of lines as the English language I used to describe the requirements.

Additionally, since you are telling the runtime what you want to do instead of how you want to do it, our runtime can aggressively optimize the data access in ways that we couldn't if we had to try to understand the Java byte codes were doing on top of the DOM programming model. Also, since XQuery is functional (the above variables are final) we could span this to multi-core more safely than imperative code as we can guarantee there are no side-effects. This is why, as a performance guy, I think declarative languages are a key to the future of performance.

Back to the code. For people used to XPath 1.0 and its lack of all the built-in schema types, dealing with things as simple as dates was problematic (they were just strings). Here are a few functions that show, with schema awareness, XPath 2.0 and XQuery 1.0 are much more powerful than before:


declare function my:downloadsInDateRange($downloads, $startDate as xs:date, $endDate as xs:date) {
$downloads[xs:date(datedownloaded) >= $startDate and xs:date(datedownloaded) <= $endDate]
};

declare function my:codeDownloadsInDateRange($downloads, $startDate as xs:date, $endDate as xs:date) {
let $onlyCodeDownloads := my:onlyCodeDownloads($downloads)
return my:downloadsInDateRange($onlyCodeDownloads, $startDate, $endDate)
};


These two functions give me a quick way to look for "code" downloads within a date range. In the first function, it's very easy to understand that this functions take the downloads and returns only the subset that has a datedownloaded that is after the start date and before the end date. In the second function, you can see it's easy to call the first function. At this point, I think most Java programmers might be saying "this isn't like what I expected based on my previous work with XSLT". While XSLT is a great language for transformation (XSLT 2.0 even better), I think XQuery gets a little closer to a general purpose language with the ability to declare functions and variables in a more terse syntax.

Finally, let's cover two more important powerful features - FLOWR and output construction. Once I have sliced and diced the data, I need to output the data into a XML report. XQuery gives you a very nice way to mix XML and declarative code as shown below:


declare function my:downloadsByUniqid($uniqid, $downloads) {
for $id in $uniqid
let
$allDownloadsByUniqueId := $downloads[uniqueCustomerId = $id],
$allCodeDownloadsByUniqueId := $downloads[uniqueCustomerId = $id and (contains(filename, 'repositories'))]
return
<downloadById id="{ $id }" codeDownloads="{ count($allCodeDownloadsByUniqueId) }" >
<name>{ data($allDownloadsByUniqueId[1]/name) }</name>
<companyName>{ data($allDownloadsByUniqueId[1]/companyname) }</companyName>
<codeDownloads>
{
for $download in $allCodeDownloadsByUniqueId order by $download/datedownloaded return
<download>
<filename>{ data($download/filename) }</filename>
<datedownloaded>{ data($download/datedownloaded) }</datedownloaded>
</download>
}
</codeDownloads>
</downloadById>
};



This shows how you can create new XML documents and quickly mix in XQuery code. Some people I've talked to think this looks like scripting languages in terms of simplicity. Also, you'll see a For ($id in $uniqid) Let ($allDownloadsByUniqueId, ohters) Return (downloadsById). These three parts make up part of what people call FLOWR (and pronounce flower) which stands for for, let, order by, where, return. The FLOWR statement is a very powerful construct -- able to do all the sorts of joins of data you're used to in SQL -- but in this example I've chosen to show how it can be used to simplify code in the general case where joining data wasn't the focus. For Java people, think of it as a much more powerful looping construct that integrates all the power of SQL for XML.

In the end, I have a 200 line program that takes all the download reports and organizes them by unique IBM vs. unique non-IBM ids and produces a month by month summary. I'd be surprised if you could come up with anything shorter and more maintainable that worked with Java and DOM. I hope this "demo" encourages you to consider using XQuery in your next project where you need to work with data.

Finally, if you find people trying to convince you that XQuery isn't capable enough to be a general language, take a look at a complete ray tracer written in XQuery in a mere 300 lines of code (a real statement of XQuery's power and brevity).

PS. You can download this XQuery program here and some sample input here. You can run them by getting the XML Feature Pack thin client here. The thin client is a general purpose Java based XQuery processor that you can use for evaluation and in production when used with the WebSphere Application Server. All you need to do is download the thin client, unzip and run the below command:


.\executeXQuery.bat -input downloads-fake.xml summary.xq

7 comments:

Unknown said...

As a fan of XQuery and as an employee of a company that deals extensively in XML messaging this is a great article. My question though is what java client can I use for single XML documents as in the example since I don't have an XML data base. Thnx.

Matt said...

Saxon has an XQuery processer. http://www.saxonica.com/

Andrew Spyker said...

@TN the blog concludes with a link of how to run this on the Thin Client for XML with WebSphere Application Server. Let me know if you need assistance getting that running. Pretty simple, download thin client, unzip thin client, download source and xquery file from blog, and run command on blog.

Anonymous said...

@TN maybe could you take a look at http://www.inf.uni-konstanz.de/dbis/basex/index. BaseX is an XML Database with a xquery engine ... all running from a single jar file.
Haven't tried this code yet with it

Unknown said...

Thnx to everyone for the pointers particularly BaseX

Unknown said...

www.marklogic.com. The best XML database around. You can download a free community edition.

Andrew Spyker said...

@All,

It is worth noting that TN asked for something that could run a XQuery from the command line and many of the readers of this blog are looking for solutions in the application server space.

While the database solutions here are good for what they do, they aren't really applicable to such scenarios. Also, if we're looking at databases, I'd suggest we talk about DB2 pureXML (http://www-01.ibm.com/software/data/db2/xml/). pureXML is an excellent hybrid database which means you can integrate your XML data and relational data in a very natural and efficient way.

Among the solutions mentioned that would fit into the scenario suggested - Saxon and the XML Feature Pack, both would work on the command line and in the application server. One difference between the two is that the XML Feature Pack is free to WebSphere Application Server customers and comes with support from IBM.

Great to see so many XQuery based solutions in the market.