Thursday, April 10, 2014

From XQuery to JavaScript - MarkLogic's Bold Platform Play

San Francisco Airport would seem an odd place to change the world, though no doubt it's been the hub of any number of game changers over the years, but the MarkLogic World conference in San Francisco, held at the Waterfront Marriott just south of the iconic airport, may very well have presaged a radical change for the company and perhaps far beyond.

Over the last couple of years, MarkLogic World has become leaner but also more focused, with announcements of changes for their eponymous data server that were both unexpected and in general have proven very successful. The announcement this year was, in that light, about par for the course: For MarkLogic 8, due out later this year, the MarkLogic team was taking the ambitious step of compiling and integrating Google's V8 JavaScript engine directly into the core of the database.

In essence, the erstwhile XML application is becoming a fully functional Javascript/JSON database. Javascript programmers will be able to use what amounts to the core of the Node.js server to access JSON, XML, binaries and RDF from the database, run full text, geoSpatial and sparql queries, use the robust application server and performance monitoring capabilities, and ultimately do everything within MarkLogic that XQuery developers have been able to do from nearly the first days of the product's existence.

This move was driven by a few hard realities. One of the single biggest gating factors that MarkLogic has faced has been the need to program the application interfaces in XQuery. The language is very expressive, but has really only established itself within a few XML-oriented databases, even as JavaScript and JSON databases has seen an explosion of developers, implementations, and libraries. Organizations that purchased MarkLogic found themselves struggling to find the talent to code it, and that in turn meant that while MarkLogic has had some very impressive successes, it was beginning to gain a reputation as being too hard to program.

MarkLogic 8 will likely reverse that trend. People will be able to write query functions and modules in JavaScript, will be able to invoke JavaScript from XQuery (and vice versa), can use JavaScript dot notation as well as XPath notation, will be able to import and use community (possibly node.js compatible) JavaScript script modules, and can mix XML and JSON as native types. It is very likely that this may very well add rocket fuel to the MarkLogic server, as it becomes one of the first to effectively manage the trifecta of XML, JSON, and RDF (and their respective languages) within the same polyglot environment.

MarkLogic CEO Gary Bloom deserves a lot of the credit for this if he can pull it off. A couple of years ago, the company was somewhat dispirited, there were a number of high profile departures, and the organization had just gone through three CEOs in two years. Bloom managed to turn around morale, adding semantic support last year (which is significantly enhanced with MarkLogic 8, see below), cutting licensing prices by 2/3, refocusing the development team, and significantly expanding the sales teams. That's paid off significant dividends - there were a number of new customers in attendance this year at the SF event, and this is one of a six stop "world tour" that will see key management and technical gurus reach out to clients in Washington, DC, Baltimore, New York, London, and Chicago.

In addition to the JavaScript news, MarkLogic also announced that they would take the next step in completing the Semantic layer of the application. This includes completion of the Sparql 1.1 specification (including the rest of the predicate paths specification and aggregate operations), adoption of the Sparql Update Facility and inference support. While the JavaScript/JSON announcement tended to overshadow this, there is no question that MarkLogic sees Semantics as a key part of its data strategy over the next few years. This particular version represents the second year of a three year effort to create an industry leading semantic triple store, and it is very likely that most of what will be in ML 9 will be ontology modeling tools, admin capabilities and advanced analytics tools.

The inferencing support is, in its own way, as much of a gamble as the JavaScript efforts, in an attempt to consolidate on the semantics usage by publishers, news organizations, media companies, government agencies, and others that see semantic triple stores as analytics tools. This becomes even more complex given the fact that such inferencing needs to be done quickly within the context of dynamic updates, the MarkLogic security model and similar constraints. If they pull it off (and there's a fair amount of evidence to indicate they will), not only will MarkLogic vault to the top of the Semantics market but this may also dramatically increase RDF/Sparql in the general development community, especially given that semantics capabilities (including Sparql) will be as available to JavaScript developers as it will be to xQuery devs.

The final announcement from MarkLogic was the introduction of a bitemporal index. Bitemporality is probably not that big of an issue in most development circles, but in financial institutions, especially those that need to deal with regulatory issues, this is a very big deal. The idea behind bitemporality is that documents entered into the database may differ in time from when the database is processed by an application. This distinction can make a big difference about financial transactions, and may have an impact upon regulatory restrictions. Bitemporality makes it possible for a document to effectively maintain multiple date stamps, which in turn can be used to ascertain what documents are "in effect" at a given time. In a way, this makes it possible to use MarkLogic as a "time machine", rolling the database back in time to see what resources were or weren't active at that time.

Will this mean that you'll see applications being developed where all of the tools of MarkLogic - from XQuery to Javascript, Semantics to SQL and XSLT - will be used to build applications, as MarkLogic Chief Architect Jason Hunter challenged me at lunch one day of the session? Is there a use case where that even makes sense? After a lot of thought, I will have to throw in the towel. I think there are definitely places where you may end up using SPARQL and SQL together - if you had slurped up a relational table that you wanted to preserve in its original form while working with RDF data, the case is definitely there, and any time you work with XML content, there are often good reasons to use XSLT - for formatting complex XML output or doing specialized tree-walking processing (you can do that in XQuery, but XSLT is usually more intuitive there), but the challenge comes in directly using XQuery and JSON together.

The reason for this difficulty is that XQuery and JSON fulfill very similar roles. For instance, suppose that you have RDF that describes an organization's sales revenue, and you wanted to compare sector sales in the various quarters of 2014. This can actually be handled by a single SPARQL query that looks something like (salesReport.sp):

    select ?sectorLabel as "Sector"
              ?quarterLabel as "Quarter"
              ?salesAgentLabel as "Agent"
              ?revenue as "Revenue"
    where (
         ?company company:identifier ?companyID
         ?sector sector:company ?company.
         ?quarter quarter:company ?company.
         ?sector 
               rdf:type ?sectorType
               sector:estabishedDate ?sectorStartDate;
               sector:reorgDate ?sectorReorgDate;
               rdfs:label ?sectorLabel.
            filter(gYear(?sectorStartDate) <= ?year)
            filter(gYear(?sectorEndDate > ?year)
         ?salesAgent
               rdf:type ?saleAgentType; 
               salesAgent:sale ?sale;
               rdfs:label ?label.
         ?sale sale:salesQuarter ?quarter;
               sale:salesSector ?sector;
               sale:revenue ?revenue.
         ?quarter rdfs:label ?quarterLabel;
               rdf:type ?quarterType.
         }  order by ?sectorLabel ?quarterLabel ?salesAgentLabel desc(?revenue)

Now, in XQuery, the report for this is fairly simple to generate:

let $companyID = "MARKLOGIC"
let $year := 2014
return
<report>{
 sparql:invoke('salesReport.sp',map:new("companyID",$companyID),map:entry("year",$year))) !
<record>
      <sector>{map:get(.,"Sector")}</sector>,
      <quarter>{map:get(.,"Quarter")}</quarter>
      <agent>{map:get(.,"Agent")}</agent>
      <revenue>{map:get(.,"Revenue")}</revenue>
</record>}</report>

In a (currently hypothetical) Javascript version, it may very well end up being about as simple:

function(){
var companyID = "MARKLOGIC"
var year = 2014
return {report: sparql:invoke('salesReport.sp',
  function(obj,index){return {record:{
    sector: obj.Sector,
    quarter: obj.Quarter,
    agent: obj.Agent,
    revenue: obj:Revenue
    }
 })
}}();

Note that in both cases, I've explicitly broken out the mapping to make it obvious what was happening, but the four assignments could also have been replaced by

<record>
     {map:keys(.) ! element {fn:lower-case($key)} {map:get(.,$key)}}
</record>}</report>

and

function(obj,index){
    var new-obj = {}
    forEach(obj,function(key,value){
        new-obj [ lower-case(key) ] =  value;
        })
    return {record:new-obj}
 })

respectively.

The principle difference that exists between the two implementations is in the use of functional callbacks in JavaScript as opposed to the somewhat more purely declarative model of XQuery, but these aren't significant in practice ... and that is the crux of Jason's (correct) assertion - it's possible that you may end up wanting to invoke XQuery in line directly from JavaScript (or vice versa) but unlikely, because there is a lot of overlap.

On the other hand, what I do expect to see is situations where in a development team, you may have some people work with XQuery and other people work with JavaScript, but each will break their efforts into modules of functions that can be imported. For instance, you may have an advanced mathematics library (say, giving you "R" like functionality, for those familiar with the hot new statistical analysis language) that may be written in Javascript. XQuery should be able to use those functions:

import module namespace rlite = "http://www.marklogic.com/packages/r-lite" at "/MarkLogic/Packages/R-Lite.js";
let $b := rlite:bei(2,20)
return "Order 2 Bessel at x = 20 is " || $b


Similar Javascript should be able to use existing libraries as well as any that are engineered in XQuery (here an admin package):

var Admin = importPackage("http://www.marklogic.com/packages/admin", "/MarkLogic/Packages/Admin.xqm");
"There are " + fn.count(Admin.servers()) + " servers currently in operation"; 


The package variable Admin holds methods. It may be that this gets invoked as Admin::servers(), depending upon the degree to which MarkLogic is going to alter the native V8 implementation in order to facilitate such packages (and provide support for inline XML, among a host of other issues).

Ironically, one frustratring problem for MarkLogic may be in their best practice use of the dash ("-") in variable and function names. My guess is that xdmp:get-request-field() may end up getting rendered as xdmp.get_request_field() in Javascript, but until the EA rolls out, it will be difficult to say for sure.

However, the biggest takeaway from this is that if you're a MarkLogic XQuery developer, XQuery will continue to be supported, while if you're a JavaScript developer looking to get into MarkLogic, the ML8 implementation is definitey something you should explore.

For those of you who work with MarkLogic closely, now's the time to get involved with the Early Adopter program. Check http://developer.marklogic.com for more information (I'll update this when I know more.). MarkLogic is planning on releasing the first Early Adopter alpha within the next couple of weeks, and then should end up with a new release about once every six weeks or so (if last year's release was any indication).

Kurt Cagle (caglek@avalonconsult.com) is the Principal Evangelist for Semantics with Avalon Consulting, LLC (http://www.avalonconsult.com). He is the author of several books on web technologies and semantics, and lives in Issaquah, WA, where he can see mountain ranges in every direction he looks.