Saturday, August 15, 2009

Download Stats Extreme Makeover - The Value of XSLT 2.0 and XQuery 1.0

On Friday, I had the opportunity to use XSLT 2.0 and XQuery 1.0 in a way that proved the value of the new standards.

For our Beta programs, we get a weekly report of how many downloads have occurred. Over time we have asked for more and more breakdowns of the data (how many were from IBM'ers vs. non-IBM'ers, how many were from non gmail/hotmail/etc email addresses, how many for code vs. how many for documentation, how many unique users have downloaded, etc.). We use these reports to gauge the interest in our betas and the effectiveness of our beta programs.

Currently the process for doing these reports is:

1. Load a webpage somewhere on the intranet that returns in either HTML or space and newline separated text the download records (and I don't have the time nor contacts to change this "service").
2. Import these download records into Excel
3. Write VBScript that processes the rows into summary tables, but not all summaries have VBScript written for them due to the fact that VBScript can get quite complicated.
4. Read the summary tables and hand compute some summaries and transpose them into emails and presentations to beta teams.

This process can take a few hours and is error prone.

I asked for the raw data for the service as I knew there had to be an easier way.

First I started with the HTML as its pretty much XML and I figured I could just write some simple XQuery summaries. This failed as HTML isn't XML (if its not XHTML). Don't get me started on that rant (in this case, we had things like width=-1 with no quotes)! So I restarted with the text file version of the data.

XSLT 2.0 has a new function - unparsed-text() that lets you load from text files. This allows you to load any data into XSLT 2.0, but only as a string. As the XSLT 2.0 specification shows the real power of this function is magnified by XPath 2.0's support of regular expressions. Combining the unparsed-text() with regex tokenize() with newline as the separator allowed me to for-each across every line in the file. Adding another new XSLT 2.0 function into the mix - analyze-string() with a regular expression that did capturing of each of the fields allowed me to transform each line into a well formed XML element with sub-elements for each of the fields (download element with elements for filename, email address, etc). Ah, now I have data in a well formed structured XML format. Life is good.

Once I had this transformation of the input data, I returned to the task of creating the summaries. With XQuery, any query I want is just moments away. To prove the point, I decided to write a simple query that would break down the downloads for IBM'ers and non-IBM'ers separately. I would also group each unique user and list all the downloads that user had done with summaries at all levels of how many code vs. how many documentation downloads had occured. I want to put this in XML format, so others could query my summaries or I could put the summaries into HTML or load back into Excel (if you really really want to).

It took me approximately 30 minutes to create all of the above queries. The code is approximately 20 lines long and is well designed into re-usable functions that could be customized later. The code is easy for most of my peer Java developers to understand as XQuery looks alot like an imperative language with SQL like queries mixed into templating of output.

My next step is to move this to a web application (what I did so far was prototyped it with sample data) that connects directly to the service. The web application could easily offer up web forms that allowed the user to specify search criteria supported by the back end service (date range for example) along with what XQuery/summary view was required.

There are still some items that can't be automated and need human intervention. As an example, the process of deciding what constitutes an IBM'er is complex as some IBM'ers have "ibm.com" email address and some do not. Also, IBM could be "IBM, International Business Machines, or mistyped". It would take a bit of time to create some services that approximate what a human eye could spot manually. Adding a human facing pop-up that allowed visual inspection of the automated data analysis steps would be valuable. Also, I admit that I didn't create the charts and graphs - just the raw data that could be loaded into Excel to create charts and graphs.

But in the end I have converted a manual error prone data processing scenario into an automated approach (for data query and summaries) that creates all the same valuable raw data for reports with the potential to add more reports much more quickly. All of this was made possible using well documented W3C standards that have all the needed features (some new with XPath/XSLT 2.0 and XQuery 1.0) that make this scenario possible.

1 comments:

thorn said...

quite impressible! xslt 2.0 rules