Thursday, April 10, 2014

From XQuery to JavaScript - MarkLogic's Bold Platform Play

San Francisco Airport would seem an odd place to change the world, though no doubt it's been the hub of any number of game changers over the years, but the MarkLogic World conference in San Francisco, held at the Waterfront Marriott just south of the iconic airport, may very well have presaged a radical change for the company and perhaps far beyond.

Over the last couple of years, MarkLogic World has become leaner but also more focused, with announcements of changes for their eponymous data server that were both unexpected and in general have proven very successful. The announcement this year was, in that light, about par for the course: For MarkLogic 8, due out later this year, the MarkLogic team was taking the ambitious step of compiling and integrating Google's V8 JavaScript engine directly into the core of the database.

In essence, the erstwhile XML application is becoming a fully functional Javascript/JSON database. Javascript programmers will be able to use what amounts to the core of the Node.js server to access JSON, XML, binaries and RDF from the database, run full text, geoSpatial and sparql queries, use the robust application server and performance monitoring capabilities, and ultimately do everything within MarkLogic that XQuery developers have been able to do from nearly the first days of the product's existence.

This move was driven by a few hard realities. One of the single biggest gating factors that MarkLogic has faced has been the need to program the application interfaces in XQuery. The language is very expressive, but has really only established itself within a few XML-oriented databases, even as JavaScript and JSON databases has seen an explosion of developers, implementations, and libraries. Organizations that purchased MarkLogic found themselves struggling to find the talent to code it, and that in turn meant that while MarkLogic has had some very impressive successes, it was beginning to gain a reputation as being too hard to program.

MarkLogic 8 will likely reverse that trend. People will be able to write query functions and modules in JavaScript, will be able to invoke JavaScript from XQuery (and vice versa), can use JavaScript dot notation as well as XPath notation, will be able to import and use community (possibly node.js compatible) JavaScript script modules, and can mix XML and JSON as native types. It is very likely that this may very well add rocket fuel to the MarkLogic server, as it becomes one of the first to effectively manage the trifecta of XML, JSON, and RDF (and their respective languages) within the same polyglot environment.

MarkLogic CEO Gary Bloom deserves a lot of the credit for this if he can pull it off. A couple of years ago, the company was somewhat dispirited, there were a number of high profile departures, and the organization had just gone through three CEOs in two years. Bloom managed to turn around morale, adding semantic support last year (which is significantly enhanced with MarkLogic 8, see below), cutting licensing prices by 2/3, refocusing the development team, and significantly expanding the sales teams. That's paid off significant dividends - there were a number of new customers in attendance this year at the SF event, and this is one of a six stop "world tour" that will see key management and technical gurus reach out to clients in Washington, DC, Baltimore, New York, London, and Chicago.

In addition to the JavaScript news, MarkLogic also announced that they would take the next step in completing the Semantic layer of the application. This includes completion of the Sparql 1.1 specification (including the rest of the predicate paths specification and aggregate operations), adoption of the Sparql Update Facility and inference support. While the JavaScript/JSON announcement tended to overshadow this, there is no question that MarkLogic sees Semantics as a key part of its data strategy over the next few years. This particular version represents the second year of a three year effort to create an industry leading semantic triple store, and it is very likely that most of what will be in ML 9 will be ontology modeling tools, admin capabilities and advanced analytics tools.

The inferencing support is, in its own way, as much of a gamble as the JavaScript efforts, in an attempt to consolidate on the semantics usage by publishers, news organizations, media companies, government agencies, and others that see semantic triple stores as analytics tools. This becomes even more complex given the fact that such inferencing needs to be done quickly within the context of dynamic updates, the MarkLogic security model and similar constraints. If they pull it off (and there's a fair amount of evidence to indicate they will), not only will MarkLogic vault to the top of the Semantics market but this may also dramatically increase RDF/Sparql in the general development community, especially given that semantics capabilities (including Sparql) will be as available to JavaScript developers as it will be to xQuery devs.

The final announcement from MarkLogic was the introduction of a bitemporal index. Bitemporality is probably not that big of an issue in most development circles, but in financial institutions, especially those that need to deal with regulatory issues, this is a very big deal. The idea behind bitemporality is that documents entered into the database may differ in time from when the database is processed by an application. This distinction can make a big difference about financial transactions, and may have an impact upon regulatory restrictions. Bitemporality makes it possible for a document to effectively maintain multiple date stamps, which in turn can be used to ascertain what documents are "in effect" at a given time. In a way, this makes it possible to use MarkLogic as a "time machine", rolling the database back in time to see what resources were or weren't active at that time.

Will this mean that you'll see applications being developed where all of the tools of MarkLogic - from XQuery to Javascript, Semantics to SQL and XSLT - will be used to build applications, as MarkLogic Chief Architect Jason Hunter challenged me at lunch one day of the session? Is there a use case where that even makes sense? After a lot of thought, I will have to throw in the towel. I think there are definitely places where you may end up using SPARQL and SQL together - if you had slurped up a relational table that you wanted to preserve in its original form while working with RDF data, the case is definitely there, and any time you work with XML content, there are often good reasons to use XSLT - for formatting complex XML output or doing specialized tree-walking processing (you can do that in XQuery, but XSLT is usually more intuitive there), but the challenge comes in directly using XQuery and JSON together.

The reason for this difficulty is that XQuery and JSON fulfill very similar roles. For instance, suppose that you have RDF that describes an organization's sales revenue, and you wanted to compare sector sales in the various quarters of 2014. This can actually be handled by a single SPARQL query that looks something like (salesReport.sp):

    select ?sectorLabel as "Sector"
              ?quarterLabel as "Quarter"
              ?salesAgentLabel as "Agent"
              ?revenue as "Revenue"
    where (
         ?company company:identifier ?companyID
         ?sector sector:company ?company.
         ?quarter quarter:company ?company.
         ?sector 
               rdf:type ?sectorType
               sector:estabishedDate ?sectorStartDate;
               sector:reorgDate ?sectorReorgDate;
               rdfs:label ?sectorLabel.
            filter(gYear(?sectorStartDate) <= ?year)
            filter(gYear(?sectorEndDate > ?year)
         ?salesAgent
               rdf:type ?saleAgentType; 
               salesAgent:sale ?sale;
               rdfs:label ?label.
         ?sale sale:salesQuarter ?quarter;
               sale:salesSector ?sector;
               sale:revenue ?revenue.
         ?quarter rdfs:label ?quarterLabel;
               rdf:type ?quarterType.
         }  order by ?sectorLabel ?quarterLabel ?salesAgentLabel desc(?revenue)

Now, in XQuery, the report for this is fairly simple to generate:

let $companyID = "MARKLOGIC"
let $year := 2014
return
<report>{
 sparql:invoke('salesReport.sp',map:new("companyID",$companyID),map:entry("year",$year))) !
<record>
      <sector>{map:get(.,"Sector")}</sector>,
      <quarter>{map:get(.,"Quarter")}</quarter>
      <agent>{map:get(.,"Agent")}</agent>
      <revenue>{map:get(.,"Revenue")}</revenue>
</record>}</report>

In a (currently hypothetical) Javascript version, it may very well end up being about as simple:

function(){
var companyID = "MARKLOGIC"
var year = 2014
return {report: sparql:invoke('salesReport.sp',
  function(obj,index){return {record:{
    sector: obj.Sector,
    quarter: obj.Quarter,
    agent: obj.Agent,
    revenue: obj:Revenue
    }
 })
}}();

Note that in both cases, I've explicitly broken out the mapping to make it obvious what was happening, but the four assignments could also have been replaced by

<record>
     {map:keys(.) ! element {fn:lower-case($key)} {map:get(.,$key)}}
</record>}</report>

and

function(obj,index){
    var new-obj = {}
    forEach(obj,function(key,value){
        new-obj [ lower-case(key) ] =  value;
        })
    return {record:new-obj}
 })

respectively.

The principle difference that exists between the two implementations is in the use of functional callbacks in JavaScript as opposed to the somewhat more purely declarative model of XQuery, but these aren't significant in practice ... and that is the crux of Jason's (correct) assertion - it's possible that you may end up wanting to invoke XQuery in line directly from JavaScript (or vice versa) but unlikely, because there is a lot of overlap.

On the other hand, what I do expect to see is situations where in a development team, you may have some people work with XQuery and other people work with JavaScript, but each will break their efforts into modules of functions that can be imported. For instance, you may have an advanced mathematics library (say, giving you "R" like functionality, for those familiar with the hot new statistical analysis language) that may be written in Javascript. XQuery should be able to use those functions:

import module namespace rlite = "http://www.marklogic.com/packages/r-lite" at "/MarkLogic/Packages/R-Lite.js";
let $b := rlite:bei(2,20)
return "Order 2 Bessel at x = 20 is " || $b


Similar Javascript should be able to use existing libraries as well as any that are engineered in XQuery (here an admin package):

var Admin = importPackage("http://www.marklogic.com/packages/admin", "/MarkLogic/Packages/Admin.xqm");
"There are " + fn.count(Admin.servers()) + " servers currently in operation"; 


The package variable Admin holds methods. It may be that this gets invoked as Admin::servers(), depending upon the degree to which MarkLogic is going to alter the native V8 implementation in order to facilitate such packages (and provide support for inline XML, among a host of other issues).

Ironically, one frustratring problem for MarkLogic may be in their best practice use of the dash ("-") in variable and function names. My guess is that xdmp:get-request-field() may end up getting rendered as xdmp.get_request_field() in Javascript, but until the EA rolls out, it will be difficult to say for sure.

However, the biggest takeaway from this is that if you're a MarkLogic XQuery developer, XQuery will continue to be supported, while if you're a JavaScript developer looking to get into MarkLogic, the ML8 implementation is definitey something you should explore.

For those of you who work with MarkLogic closely, now's the time to get involved with the Early Adopter program. Check http://developer.marklogic.com for more information (I'll update this when I know more.). MarkLogic is planning on releasing the first Early Adopter alpha within the next couple of weeks, and then should end up with a new release about once every six weeks or so (if last year's release was any indication).

Kurt Cagle (caglek@avalonconsult.com) is the Principal Evangelist for Semantics with Avalon Consulting, LLC (http://www.avalonconsult.com). He is the author of several books on web technologies and semantics, and lives in Issaquah, WA, where he can see mountain ranges in every direction he looks.

Sunday, January 5, 2014

Through a Glass Darkly:Technology Trends for 2014

Once again a few months have passed since last I posted, partially because I've posted a number of technical commentary elsewhere, partially because things have just been fairly busy in the last few months. I'll be reposting the work I've done elsewhere here shortly, but in this post, I wanted to focus on my end of year thoughts for 2013 and what I'm expecting to see in 2014.

First, this has been a year for endings and transitions. My mother and my grandmother on my father's side died within a month of one another, my mother to multiple myeloma, my grandmother to Alzheimers. That, more than anything, greatly colored the rest of the year for me. We moved back to the Pacific Northwest, which, while it's had a few drawbacks, was for me a necessary move, and there are few days that I breathe in the cool fir and spruce scented air and think that I made the wrong choice coming back - this is home for me.
Also, a bit of a warning on this. I write these analyses partially for my own benefit, as it helps me to focus on what looks like the areas to explore in more depth over the next year or two. As such, these tend to be long. I'll try to put together an abstract at the beginning of each section, so if something is not of interest, just skip over it. Also, these aren't necessarily predictions (though I do make some of those here as well) - I can tell you what I think might be bubbling to the surface and give you some interesting things to think on, but your mileage may vary.

On to the analysis:

1. Business Intelligence Becomes Big Data Analytics


Both BI and Big Data are definitely buzzwords, but at their core there's still very much a problem domain to be solved. Not surprisingly, for these two terms, the core problem is "What is the state of my business?". Business Intelligence systems tend to be internalized systems, and are built primarily upon understanding the parameters that drive that business and understanding how the business compares with other businesses.
Yet even given this there's some very real differences between the two. BI's focus has traditionally been context sensitive - what are both the internal drivers within a company and external factors of a company relating to others that either make it competitive in the marketplace or reduce its ability to the same? Big Data Analytics, on the other hand, is increasingly oriented around establishing a landscape showing the nature of the business that a company is involved in. This is a more holistic approach to working with data, albeit one that requires considerably more data to be able to accomplish. This points to another difference, one having to do with the nature of data processing.

Business Intelligence systems were heavily integrated around Data Warehousing solutions, and in general assumed a general homogeneity with data formats (typically SQL relational datasets configured into OLAP or similar data cubes) that was usually easy to acquire from internal business data systems. Master Data Management has become perhaps the natural outgrowth of this, as an MDM solution became increasingly necessary to coordinate the various data stores.
One of the central differences that has arisen in the Big Data space is that the information involved has not, in general, been relational, but has instead been what many people call semi-structured, though I think of as super-structured: XML and JSON data feeds, where schemas are externally defined and advisory rather than internally defined and structural. This data can exist in data repositories (this is what most NoSQL databases are) but they also exist as data messages - data in transit. This kind of data is typically not structured with an intent to be utilized by a single organization or department, but instead represents content outside of that organization, which means that it needs to be processed and mined for value beyond its original purpose. A shift in how we think about resource identifiers in a global space is happening at the same time, and as this information moves out of its safe database boundaries, identity management becomes a growing concern.
This means that Big Data Analytics is actually looking at storing less information long term, with more information stored for just long enough to parse, process and encode. This is one of the roles of distributed applications such as Hadoop; providing a set of filters that extract from the data streams enough information to be useful to multiple processes while removing that data which doesn't have long-term value. Such analytics tools are also increasingly reactive in nature, as the resulting data instances effectively define a declarative model. This puts a premium on effective data modeling, but also requires a data model architecture that follows an open-world assumption, one in which it is assumed that there may always be more information about a given resource entity than any one process currently has.

This brings two radical changes to the world of data analytics. The first is the tacit realization that any such models are ultimately stochastic rather than deterministic - you can have no guarantee that the information that you have is either complete, or completely accurate. While this may seem like a common-sense assumption, in practice the inability to fully know your data model was a challenge to implement in a relational context.

Today, what used to be BI tools run by business analysts who were Excel savvy are increasingly becoming modeling and mathematics suites used primarily with people with mathematics and statistics backgrounds. MatLab, long a staple of college mathematics departments, is now appearing as a must-have skill for the would be BI Analyst, and R, a mathematics-heavy interpreted language (it includes things like Bessel functions as "core" functions), is similarly in surprisingly big demand. A couple of decades ago, getting a bachelor's or even master's degree in mathematics was considered a career killing move, unless it was in support of a computer science degree, now it's one of the most in-demand technical degrees you can get, with beginning analysts getting starting salaries in the mid-90K range.

The reason for this high-powered math background is simple - there is a growing realization that the business world is systemic and complex, and that you can neither analyze nor build models from that analysis with the available set of tools. BI tools attempted to encode information in formal rules, Big Data Analytics takes that several steps further and attempts to understand the reasons that the rules that work do while weeding out false positives.

2. Semantics Goes Mainstream

I've felt for a number of years (starting about 2010) that semantics would hit the mainstream in the 2013-2014 window. There is a great deal of evidence that this is in fact now happening. Many governments are now producing their data in RDF and making available SPARQL endpoints for accessing that data. You're beginning to see the merger of asset management and master data management systems with semantics in order to more effectively track the connections between resources and resource types. Semantics are used for doing profile matching and recommendations, are smoothing out complex and byzantine services infrastructures and are increasingly the foundation on which organizations are getting a better sense of the data world in which they operate.

One of my favorite NoSQL databases, MarkLogic just released a semantics engine as part of their 7.0 server. There are at least three different initiatives underway that I'm aware of to provide a semantic interface layer to Hadoop, and there are several libraries available for node.js that perform at least some semantic operations with RDF triples. RDFa has become a part of the HTML5 specification, and is beginning to find its way into adoption within the ePub specification efforts.

Most significantly, the SPARQL 1.1 specification, which includes the SPARQL UPDATE capability that provides a standardized interface for updating RDF triple stores as well as a number of much needed functions and capabilities to SPARQL, became a formal recommendation in March 2013. UPDATE is a major deal - with it standardized, SPARQL now can be used not only to retrieve content but to create it within a triple database. To give an idea about how important this is, its worth thinking back to the 1989 time frame when SQL standardized its own update facility, which turned what had been up until them a modestly utilized query language used by a few large vendors (each with their own way of updating content) into the most widely used query languages in the world.

What does this mean in practice? SPARQL does something that no other query language does. It provides a standardized way of querying a data set, joining "tables" within those data sets, and retrieving that content in a variety of standard transport protocols - RDF turtle, RDF-XML, JSON, CSV text and more. With an XSLT transformation (or XQuery), it should be relatively easy to generate modern Excel or ODF spreadsheets, to transform information into Atom feeds, or to produce output in other more esoteric functions. The keyword here is "standard" - your SPARQL skills are directly transferable from one data system to another, and for the types of operations involved, even creating new language extensions are standardized. Information can be transported over SOAP interfaces or RESTful ones. In other words, SPARQL is the closest thing to a virtualized data interface that exists on the planet.

So what of SQL? SQL has a few major limitations - it was developed prior to the advent of the web, and as such there has been no standardization for data transport between systems SQL ETL is still an incredible painful topic, as the way that you transport SQL is to generate SQL UPDATE statements, which means that transporting between different SQL systems is still not completely reliable. Relational data with databases are tied together with index keys are do not do well as global identifiers. SQL federation is possible across dedicated systems, but is still far from easy, and certainly not consistent across heterogeneous systems.

These limitations are all resolved in SPARQL. It was designed for global identifiers and federated queries. ETL can be accomplished across a number of different protocols in SPARQL. Best of all, putting a SPARQL overlay on top of a relational database is surprisingly easy.

Okay, I'm moving away from analysis and into evangelism here, but I think that all of these reasons will make SPARQL an attractive proposition as people began seeking pure data virtualization. Concurrent with this, I expect that semantics related skills - data modeling and MDM integration, for instance, will become a big part of the toolsets that organizations will come to expect in their developers and architects.

3. Data Virtualization and Hybrid Databases

When I'm working with clients and potential clients, I listen for unfamiliar buzzwords, because these often represent saplings that were first seeded by the big analyst firms such as Gartner, Forrester, etc. Data virtualization must have come out recently, because just in the last few months it has suddenly become the next major "must-have" technological idea.

Data Virtualization in a nutshell is the idea that you as a data consumer do not need to know either the technical details about the internal representation or the specific physical infrastructure of the data storage system. It is, frankly, a marketing invention, because data structures (the metadata) are an intrinsic part of any data but, that point aside, it is also something that is becoming a reality.

My interpretation of data virtualization may vary a little bit from the marketing one, but can be summed up as follows:

Data is "virtualized" if it is resource oriented, query based on requests, and agnostic on both inbound and outbound serialization formats. What this means is that I can send to an address a CSV file, a JSON stream, an XML document file, a word document and so forth and the system will be able to store this content. I can search it through a search query language, searching either for words in free text or fields (or facets, which are specialized kinds of fields), retrieving back a set of address links and snippets that will identify these resources to a consumer, and selection of an address will allow the user to retrieve that content in a broad panoply of output formats. I can also query the address to retrieve a logical schema that will allow me to manipulate the information in greater detail, meaning that information about the logical schematics are discoverable.

In other words, it bears a great deal of resemblance to what I consider RESTful services.

One critical aspect of RESTfulness is the idea that the internal representation of resource entity is irrelevant to its external representation. So long as that internal representation can be reliably transformed to the external and back.

To facilitate this process, databases are beginning to converge into a common design, one in which the data is broken into a combination of a graph of nodes for describing atomized content, pointers for connecting , and indexes for mapping keys to those nodes. These are usually combined with in memory sequential structures for working with sets of nodes.

This combination of capabilities is unusual in that with it you can model every other kind of database. Relational, xml, json, columnar, graph and semantic. It can model file systems. In effect such systems are universal. The principle downside with these kinds of systems is that they are not as performant as dedicated systems, though they are getting closer.

A good analogy here is the distinction between compiled and interpreted languages. All computer languages are parsed into machine language instructions. Early on, developers would hand optimize this code. Eventually, computers took over this task, first in software, then in firmware. Compilation could take hours, but over time, those hours became minutes, then seconds. In the last few years, interpreted code, has taken advantage of these optimizations, and such languages are beginning to rival compiled languages for speed.
This is happening at the database level as well, and what had been a hopelessly suboptimal approach to building and storing graph based data systems is now feasible, especially with in memory systems. This has given rise to a number of hybrid data systems that are not only able to be configured to handle different data formats design, but can increasingly recognize these on input and self configure to process them.

There are a few systems like this now. MarkLogic, mentioned previously, comes very close to being this kind of a system, though it takes some initial work to get it there. The Apache JackRabbit project is built around a similar architecture, though again it takes a fair amount of modification to get it to that point. A number of semantic and graph triple stores are similarly close to this model, Virtuoso being a good example there. I was also very taken with the FluidOps (http://fluidops.com) system, which combined a number of data virtualization features with a solid administrative interface.

I expect in 2014 and 2015 that this field will explode as interest in data virtualization climbs, and as the current contenders for data virtualization, mainly master data management systems, begin to run into issues with scaling. Indeed, my expectation is that both Digital Asset Management Systems and Master Data Management will likely end up going down this path, primarily because both are dealing with metadata and its integration across a potentially large number of systems within the typical enterprise. I'll make the argument that Metadata Management Systems will likely be built on such hybrid systems to take advantage of both the sophisticated search capabilities of text based search systems and the relational nature of metadata in everything from publishing to media production to retail.

4. Big Data vs. Big Brother

Nearly a decade before Edward Snowden walked off with a fairly damning number of documents from the National Security Administration, there was an admiral named John Poindexter, who was tasked by George Bush to create a program for the Defense Department called Total Information Awareness (TIA). The idea behind this program was to join surveillance of information channels, from phone calls to emails to the Internet, and then mix these with current information technologies to detect both foreign and domestic terrorists. The program was leaked to the public, and Congress quickly defunded TIA in 2003. However, most of the programs that were under it were rolled back into the black NSA budget, and nobody had any idea that even if the organization with the all seeing eye was gone, its mission continued pretty much unimpeded until Snowden's revelations in 2013. (A good overview of this can be found at http://en.wikipedia.org/wiki/Information_Awareness_Office).
These relevations have done a huge amount of damage to the notion of the surveillance state as a positive force, and have also begun raising larger questions about the ethics inherent in Big Data, questions that will likely spill over into the IT sphere during 2014. Already, foreign sales of electronics with NSA tracer technology built into it are having a significant negative impact on sales of laptops, tablets, and smart phones manufactured in the US. Google, Microsoft and Yahoo have all been taken to task for the degree to which the NSA has been able to hack their operations, and within Google recently there has been a shift towards seeing the NSA as a hostile agent, with internal protocols changed accordingly, and the chiefs of a number of technical companies traveled to the White House to make their displeasure with the current situation clear.

At the same time, the potential of Big Data to become an invasive force regardless of the gatekeepers is also becoming being raised to the level of public discourse in a way that hasn't ever happened before. It is very likely that this will spill into 2014, as that anger becomes more focused. It is of course worth stating that finding information about people is in fact absurdly simple, and sharing or selling of data and the increasing ease of building and supplying services only lowers these barriers even more. Indeed, the challenge in most cases is scrubbing the data in order to remove irrelevant data, not in the acquisition of the data in the first place.

My expectation here is that pressure from Europe (which has already made the use of non-consensual cookies illegal), a growing lobby on both the Right and the Left that sees such surveillance as violating a person's reasonable right to privacy, more high profile cases such as the capture of millions of retail store Target's credit card data and similar situations will eventual resolve into new legislation intended to to insure that financial instruments are more secure. Longer term, this may spur the development of chipped near field personal data identifier hardware that will ultimately replace credit cards altogether, worn as rings, jewelry or glasses. Indeed, when we talk about the next generation "whereables" I think these context setting solutions will in fact be a far bigger consideration than Google Glasses.

5. The Google Ecosystem Converges

On that front, this year was a rough one for Google. The aforementioned NSA scandals dented its reputation, though it also acted swiftly to shore up trap doors and put themselves on a more neutral footing viz-a-viz surveillance. Google Glasses debuted, and quickly became synonymous with Geek Excess and Gargoylism (see Neal Stephenson's Snow Crash). There were several periods where critical services were down for hours at a time, and Larry Page's tenure as CEO has been turbulent and controversial after Eric Schmidt left in 2011.


And yet ... Google is very rapidly coming to be in the 2010s what Microsoft was in the 1990s. Android devices outnumber Apple devices in use by a significant margin, with Microsoft a distant third. Two of the hottest selling products on Amazon this Christmas season were Google devices - a powerful, inexpensive HDMI streamer, and the Chromebook. Indeed, its worth discussing each of these along with Google Glasses to get an idea about where big data is truly heading - and what the landscape is going to increasingly look like over the next few years.

Chromebooks were roundly derided when they first debuted. They needed an Internet connection to do anything meaningful. You couldn't run many apps that people consider "standard". They were cheap to make and cheap to own, without the bells and whistles of more powerful computers. They could only run web games.

And yet - Google powered Chromebooks outsold all other "computers" (and nearly all tablets). The pundits were stymied, looking like grinches trying to figure out why Christmas came to Whoville. They sold because $199 is a magic price point. You give a kid a $199 Chromebook for school, because at that point, if the computer gets broken or stolen, it's easy to replace. You buy them for airplane trips, because if you leave one in the seat-pocket of the chair in front of you, they are easy to replace. You scatter them about the house and garage, because if the dog gets a hold of one and decides to use it as a chew toy ... yeah, you get the picture.

They're not intended to replace your work-horse laptops. They're a second screen, but one with a keyboard, enough peripheral ports to be useful and immensely portable. It's easy to write apps for them, easier, in fact, than it is for smartphones, but many of those same apps are sufficiently powerful to make them compelling. Even the "needing the Internet" capability is not as big of an issue, since Google Office (and increasingly other applications) are available in offline modes. Combine that with nearly seamless cloud access, and what happens is that people tend to drift into Google cloud without realizing that they are, and generally once there they stay. This is, of course, bad news for Microsoft in particular, who have always relied upon the ubiquity of the Microsoft platform to keep people on the Microsoft platform.

For the record, I have a Chromebook as a second screen. Sometimes, if my youngest daughter is with me, I'll work on my laptop while she works on the Chromebook. Other times, when she's wanting to play games, we swap out. That this is as simple a process as logging out and back in on the other machine is a testament to how subtle and powerful this convergence really is - we've now reached a stage where context resides in the cloud, and this relationship will only become stronger with time.

The other must have Google device this season was Chromecast, which looks like an oversized key with an HDMI port instead of a toothed surface that fits in the back of your television(s). It costs $35. It is giving cable company execs nightmares.

What this device does, like so much of what Google does, is deceptively simple. It connects to both the HDMI port of a TV (any TV, not a specialized one) and to a wifi signal. Any other connected device within range of that wifi signal can, with the right app, both instruct Chromecast to feed off the wifi stream (for example, on Hulu plus), as well as to route wifi from a laptop or smart phone directly to the TV. Now, wireless wifi has been around for the last year, and everyone has been trying to focus on building themselves up as content providers - but what this does is drop this capability to a point where it is essentially free. (Notice a trend developing here?)

To me, this completes the loop between big screens and small screens, and also drives a huge potential hole in the revenue models of companies such as Time Warner and Comcast, which basically have used a subscription model for access to media content. You still need the Internet, but with wireless Internet becoming increasingly common, and with companies (such as Google) that are exploring ways to make wireless more ubiquitous, this has to be a major concern for both content producers and content delivery companies. Given that Google also hosts YouTube and has been quietly introducing its own monetization model for more sophisticated or higher quality media products, it's not hard to see where this is going.

It's also worth noting that there is also an API involved here, meaning that application developers can intercept the data streams and provide both media post-processing and analysis on those streams. Expect media metadata to become a fairly critical part of such streams over time.
One final bit of Google future-casting. Google Glass has not been an unalloyed success story. This is not because of the capabilities, which, while somewhat on the light side, have nonetheless performed moderately well. Instead, the reception to Google Glass users has been ironically very much foreshadowed in the previously mentioned Snow Crash. People have reacted viscerally (and angrily) when they realized that someone was wearing a pair of the $1500 goggles. Many bars and restaurants have banned their use. People are not allowed to wear them into court, into theaters, or in workplaces. They have broken up relationships. Yet in many respects such glasses are not that much different from most modern smart phones (and indeed are usually powered by those same smart phones). So why the hostility?
My supposition here is that there is an uncertainty aspect that plays into this. With a phone, it's usually evident that someone is taking pictures or videos - they are holding up the phone in a certain way, their focus is clearly on the device, and subtle, or not so subtle, social cues can be used to indicate that such recording is considered to be in bad taste or even is illegal. With glasses, you don't know. Are you being recorded without your consent? Is everything you say or do going to broadcast to a million people on the Internet? Is that person intently focusing on a test, or looking up the answers online? Is this person preparing to blackmail you?
A simple form factor change may resolve some of this. Having a light or icon that goes on with the glasses when it is "live", one easily viewable and understandable by people on the other side of the lens, might make it easier for people to adjust their behavior in the presence of someone with these on. Professional shades may not necessarily have these - if you are a police officer or firefighter, for instance, the presumption should be that any pair of glasses you wear may be live, but consumer versions most likely would need a "social light".

Given all this, will Google Glass be a bust? No. Not even close. When you talk to people who have used Glass, one of the things that emerges is that, even in what is quite obviously an alpha project, this technology changes the way they think about computing, radically and not necessarily in obvious ways. In many respects, Glass may be the emergence of peripheral or subconscious programming - providing just enough hints in the periphery of your vision to be aware of metaphorical overloading of content, through such means as gradients, color shifts, the presence of icons or arrows and so forth.
It's not necessarily as effective at providing data directly as it is at providing "color" that provides a heat map interpretation of that data. Navigation benefits from this, of course, but so do things such as determining time to completion, providing alarms and notifications, indicating areas that may be potentially dangerous, highlighting the significance of objects, and of course, tracking a history of interest and extracting textual content from the environment, coupled with the time envelope that this content was discovered. In effect, something like this makes human sensory information part of the data stream that establishes and persists context.

My suspicion is that the Google Glass may go off market for a bit, but will reappear as a specification for other manufacturers, another part of the Google Android OS, and that such glasses will come back with a vengeance towards the end of 2014 from a host of vendors, just in time for the 2014-15 Christmas Season, priced at perhaps $299 or thereabouts. (It's been my belief for some time that the $1500 price tag for Glass currently was primarily to insure a real commitment to test them, and give meaningful feedback, not necessarily for recouping any costs involved.) Google is literally selling eyeballs at this stage, and the cost of the glasses should not be a significant impediment to their adoption.
I don't normally do company analysis until after I get through the top five factors, but I see what Google is doing as having a huge impact on data flows and the data flow industry. It will dramatically increase sensory contextual information, will make data access screens more universal while at the same time radically increasing the importance of generalized back end data tiers, and will have a marked impact upon creation and monetization of both content and the metadata associated with that content.

6. Government Changes How It Does IT

I was involved early on with the Affordable Care Act, or ACA, though I had to drop out from it due to the family issues discussed at the beginning of this report. I have my own observations and horror stories about it, but because this is still a live (and evolving) project, I cannot say that much about it in this venue.

However, it is very likely that the ACA (also known as ObamaCare) will likely become a case study that will be closely picked over for years, and will prove both as a cautionary tale and, ironically, a tale that will likely turn out successfully in the long run. It's worth mentioning that Medicare when it first debuted faced similar technical and administrative snafus, yet went on to become one of the most successful (and popular) government sponsored program in history.

One of the key changes that will come out of this process, however, is a long overdue evaluation of how technical programs are managed, how people are hired (or contracted) and how barriers are erected by existing companies and administrators to make it harder for all but the largest companies to compete on projects. As I publish this, a report and set of recommendations that will shape procurement in the government for decades were announced by the White House.

The first change comes with the realization that bigger is not necessarily better. Most federal procurement for IT tends to take the tack that all problems are big and require lots of programmers, program managers, designers, administrators and the like to complete the project. Large scale resource vendors in particular love this approach, because the more warm bodies they can put into chairs, the more they make (and the more they can mark up those rates). Politicians like it as well, because they're creating jobs, especially if the contractors are from their state.

However, there's an interesting realization that has been known in industry for a while, but is really just beginning to percolate at the government level - that there is a point of diminishing returns with putting people on projects, to the extent that with a sufficiently large workforce adding people actually adds to the time and cost of completing a project.

This has to do with an interesting process that occurs in networks. A network consists of nodes and links between nodes. Each link, each connection, adds to the overall energy cost of the network. That's one reason that organizations have a tendency to fold up into hierarchies - if you have one manager handling communication (and synchronization) between ten people, then you replace 45 connections with 10 (one each between the manager and each if his or her reports). However, this means that between each subordinate, there are now two links instead of one, and the quality of those links diminishes because the manager has to process and retransmit the signal. For a small team, this is not as critical, both because there are usually links that also exist between the various members of that team and because the manager typically tends to be fairly familiar with the domain of complexity that each programmer or other agent deals with.

When the number of nodes within that network grow, however, this becomes less and less true. Add another team and another manager, then at a minimum you increase the connections between two people to three (Programmer A to Manager A' to Manager B' to Programmer B) and the signal attenuates even more, requiring retransmission and higher error costs. The amount of information that Programmer A has about what Programmer B is doing also drops. You have to develop a more sophisticated infrastructure to manage communications, which perforce involves adding more people, which ... well, you get the idea. This is why command and control structures usually work quite well for known, clearly modularized and delineated tasks with rigid actions, and usually work horribly for developing software - the cost to maintain project cohesiveness consumes ever larger portions of the total budget. (There's a training factor that comes into play as well, but it follows this same principal - when you're bringing on people too quickly, you end up spending more time getting them to a consistent level than you do getting productive work accomplished).
So how do you get around this? Agile methodologies work reasonably well here, but they work best when the teams involved are small - maybe half a dozen developers, working closely with the client, usually under the direction of a designer/architect. Significantly, if you look at the projects that did succeed, as often as not, they had small teams - indeed, one independent group of four developers managed to create a system which provided most of the functionality that the ACA needed in under a month, just to show that they could. They would have been unable to even compete for the project because federal procurement is so heavily weighted towards the large general contractors, many of whom have very spotty records simply because they are trying to manage large development teams.

The same holds true for government agencies and their IT staffs. In many cases, government IT organizations are handicapped in how much they can develop in-house, again because these same large contractors, once awarded a contract, tend to be zealous in their attempts to insure that only they end up doing development. This has meant a general erosion in IT skills within many government agencies, so that when abuses or deficiencies do occur, the agencies have neither the technical acumen to identify these when they take place nor the authority to act against the contractors to correct these. It is very likely some attempts will be made to reign this effort in within 2014.

Finally, the roles of general contractors themselves will be evaluated. GCs do offer a valuable service, acting very much like the directors of projects and coordinating the contractors. However, the role of the GC coordinating and managing the programs involved has been overshadowed by their role as providers of contractors, and this is likely to come under fire. If this happens, it may very well accelerate the move away from open time and material contracts and towards fixed bid projects, where the general contractor needs to keep their overall costs down, but where winning the contract is not dependent upon having the lowest bid, but having the best bid as determined by an evaluation committee.

It remains to be seen how effective such reforms will end up being or even if any get implemented. The current principal players within this space are very well entrenched, and will likely do whatever it takes to keep such rules from being implemented, but certainly the recognition is there that the current system is not meeting the needs of the government.

7. Microsoft: What Happens After Ballmer

After years of predicting that stockholders would get tired of Steve Ballmer, last year they finally did. Ballmer had begun a radical restructuring of Microsoft after anemic sales of Windows 8 and the somewhat questionable purchases of both Nokia and Skype, attempting to make up for years of sitting on the sidelines while mobile computing bypassed the company altogether, when the news broke that he was retiring to spend more time with his family ... the current euphemism for being forced to fall on his sword.


I had written a fairly long breakdown of what I think Ballmer had done wrong in his tenure, but decided not to print it, because I think most of the sins that Ballmer faced was because he tried to run Microsoft as conservatively as possible and had a tin ear with regard to what worked and what didn't in the tech field at a time when his competitors were rewriting the rules of the game. It has left Microsoft comparatively weakened - it still has a solid war chest, but has lost a lot of the advantages it held in 2000 when Ballmer succeeded Bill Gates.

A search for a new CEO continues as I write this. Investors will likely want someone who is more business savvy, to provide greater shareholder value and increased dividends. Yet my suspicion is that if they go that route, these investors are likely to be disappointed - Ballmer was actually pretty good at keeping the dividend checks flowing, perhaps better than most, but he did so primarily by relying upon the Microsoft Office cash cow, which is looking a bit long in the tooth now (far too many "free" competitors who can interchange with Microsoft formats all too easily).

My own belief is that Microsoft needs a Chief Geek. It needs someone who is willing to ramp up the R&D engine, throw out Windows altogether for platforms where it is simply not appropriate, perhaps even taking the same route as Google in building atop a Linux foundation. This is not as strange as it sounds. Microsoft has been building up its cred in the Linux space for a while now, and there are quite a number of developers working there for whom Linux was their first real operating system.

That person will need to be more comfortable with media and games than Ballmer was, because this is increasingly becoming the heart of modern software companies, and it will need to be able to take those games to a common platform, which means rethinking the decrepit Internet Explorer with something altogether different. It may very well be time to dust off an idea that Microsoft first explored over a decade ago that was quashed with the antitrust lawsuit - reworking a new operating system so that the browser became the OS, something that Google is beginning to make work quite well, thank you.

Given their physical proximity, a closer relationship between Microsoft and Amazon would not be out of the question - and it may very well be Amazon driving that. Amazon has managed to establish itself as the dominant player in the cloud web services space (thanks largely due to the leadership of Jeff Barr) and rather than trying to push Azure (which is still struggling to gain even significant market penetration), a strategic alliance of building Microsoft services preferentially on an AWS platform could benefit both companies, especially as Google builds out its own cloud services platform with Hadoop 2.0 and the associated stack.

Amazon additionally is finding itself more and more frequently in contention with Google on the media side as it builds out its own media delivery platform, and even as the big media studios continue to try to position themselves with regards to the cable companies and online media services such as Hulu Plus and Netflix, Amazon and Google are shifting the media landscape from the software side. Microsoft, to regain its footing there, will likely need to align with one over the other, and I see Amazon as offering a more likely alliance.

I also expect that Microsoft's next logical move may very well be to embrace it's hardware side. The company has actually been making a number of moves into both the robotics and automotive telematics space as well as home automation systems, but these have been largely overshadowed by its software divisions. A new CEO, unencumbered with the belief that everything has to be a platform for Windows, could very well step into what is at the moment a still amorphous market space in home robotics systems. Similarly, I expect now is definitely the time to beef up offerings in the health informatics space.
Let me touch briefly on Apple. It's ironic that Apple is in roughly the same position as Microsoft - with the death of Steve Jobs, the company is now in the hands of Tim Cook, who has excellent credibility as a designer, but is definitely much more low key than Jobs. I suspect that, for a couple of years anyway, Apple will be somewhat adrift as they try to establish a new direction. I don't doubt they will - Cook doesn't have Jobs' reality distortion field working very well yet, but as mentioned before, the next couple of years will see a big step forward for wearables in general, and this is a market that Apple (and Cook) has traditionally been strongest in.

8. IT Labor Market - Tightening, Specializing, Insourcing

The IT labor market is in general a good bellwether of broader economic activity six months to a year down the road, primarily because development of software usually precedes marketing and manufacturing. Similarly, when you see layoffs in software related areas, it typically means that a particular field is reaching a point of oversaturation (especially true in social media and gaming/media).


2013 was a surprisingly good year for the IT labor market, not quite in bubble territory, but certainly more active than at any time in the last five years. A lot of the hiring that I have been seeing has been primarily in a few specific areas - Big Data solution providers (Hadoop in particular), solutions architects, business intelligence gurus and user experience designers/developers, primarily for more established companies. Insurance companies (possible gearing up for the ACA) seemed to top the list, with healthcare and biomedical companies close behind. With the ACA rollout now behind us, it's likely that this hiring will start abating somewhat, especially starting by 2014Q2.

Media companies have been doing a lot of consolidation during the same period, trying to pull together digital asset systems that have become unmanageable while at the same time trying to smooth the digital distribution pipeline to external services in anticipation of an online digital boom. This still seems to be gathering steam, though the initial bout of hiring is probably past. Expect to see companies that have large media presences (such as inhouse marketing departments that increasingly resemble small production studios) to continue this trend.

Games and mobile app development is soft, and likely will remain so for at least a couple more quarters. This is a case of oversupply - a lot of startups entered the market thinking that they could make a killing, only to discover everyone else had the same idea. Layoffs and lags in commissioned apps is becoming the norm, though a shift in the deployment of new platforms (see #5) may open this market up some, especially towards the end of 2014.

Data virtualization will have a decidedly mixed effect upon the IT space. Dealing with the complexities of data access has often been one of the key differentiators in iT skill sets. As data becomes easier to access, manipulate and persist, this also reduces the need for highly specific skills, and shifts more of the development burden onto SME-programmers, who combine subject matter expertise with more limited programming skills. I think as DV becomes inherently generic, this will likely have a negative impact upon database administrators and certain classes of data programmers. Expect languages such as JavaScript to become more expected than Java as a must-have skill in many places, especially functional JavaScript, while data flow languages such as SPARQL, SPIN, XQuery (or JSON equivalents to the latter, such as JSONiq) also become important.

Data visualization requirements, on the other hand, will probably increase the demand for both User Experience developers/designers and BI analysts. As mentioned in #1, Business Intelligence and Data Analytics skills are going to be in greater demand as the need to understand what all this data means, and such skills are typically SME specific within the business domain. This means at a minimum knowing how to do regression analysis and multivariate analysis. Proficiency in Matlab, R, Mathematica, and/or various analytics tools becomes a prerequisite for most such analytics positions, just as jQuery becomes key for UX and information architects.

Declarative languages also dominate the list of "next language to learn". Scala makes up a small proportion of Java-bytecode languages, but it's use of Haskell-esque features such as higher order functions, currying, list comprehensions, tail recursion, covariance and contravariance, and lazy evaluation give it considerable power over Java in terms of type safety, flexibility, ease of development and performance. I expect to see Scala used increasingly in the Hadoop stack as well. Indeed, because map/reduce applications generally work best with declarative languages, I see declarative programming in general catch fire in 2014, a trend that's been ongoing now for the past couple of years.

XQuery 3.0 becomes official this year, but I see some serious struggles for adoption of the language by more than die-hard XML fans. Indeed, one of the trends that I've seen become distressingly clear is that even within the XML sphere you're seeing an increasing migration into RDF on one hand and JSON on the other, or into generalized data modeling. There is still a space for XML (a big space) in the document management domain, but I would not be advising someone entering this field to become highly proficient exclusively in XML technologies. What I do see happening though is the need to maintain XML processing as part of a toolkit for data virtualization. Additionally, I think that getting a solid course in graph theory and the semantics toolkit (along with machine learning, agented systems, natural language processing and similar tools) would serve a hypothetical entry level programmer very well, at least for the next decade.

At this level as well, understanding distributed agents (aka robots) and their programming languages can prove highly lucrative, even though at the moment this field is still very fractured. I can't speak in detail here (it's just a domain I don't know very well) but I see a movement beginning in late 2014 or early 2015 to develop a standardized robotics language, rather than using a general purpose language such as C++ or Java, with much of this focused on diverse areas such as telematics as well.

Overall, then, I think that 2014 will probably not be as big a banner year for IT as 2013 was, though it will not be at recessionary levels either. Another factor that will contribute to this is the increase of insourcing, bringing manufacturing and design back to the US. Part of this has been the growing discontent and disillusion that companies have with the complexities inherent in maintaining a production system with pieces half a world away, part is due to the rise of wages in companies such as China, India and Japan that have reduced any competitive edge, and part, perhaps the most salient part, being that when you outsource the design and implementation processes themselves of the overall project to a cost-competitive area, you are also exporting your intellectual property and institutional know-how, and that in turn results in losing control of your ability to create or modify your own products. It turns out that exporting your IP and the people who can build that IP is a really bad idea, something many manufacturers and resellers are discovering to their dismay.

9. Social Media and the Political Landscape

I don't see a radical shift in the social media scene in the next year. Facebook will likely retain the social media crown for another year, Google+ will continue growing but stay close to its technical roots, Twitter, post-IPO, will be more heavily in the news than it had been 2013, but from a business standpoint it still has a couple of years of grace period where it can continue to being relatively unprofitable before the VCs start calling in their chips.

One thing that I do suspect is that in 2014 social media will play a dominant role in the 2014 mid-term elections. Part of the reason for this has to do with the cable-cord cutting that I see happening with ChromeCast and other streaming player systems. The population continues to segment into the traditional media camp (mostly on the religious conservative and Tea Party side) and the new media camp (mostly on the liberal, progressive and libertarian sides), each of which has its own very distinctive narrative of events.

The latter group has been watching radically diminished number of hours of broadcast television, and even when they do, it tends to be as snippets or clips on YouTube or other sources. What this will do is make it harder for candidates to project one view to one audience and a completely different (and perhaps opposite) view to another audience without these discrepancies playing a big factor. It is easier to "fact check" on the web, and is also easier to discuss (or more likely argue) different viewpoints. 

My suspicion is that the 2014 elections themselves will likely not change the balance in Congress significantly, but they may have the effect of moderating more extreme positions, given that I see what's happening (especially on the Republican side) as a struggle between two distinct, upper-income factions, trying to steer the conservative movement in very different directions. That, along with procurement reforms, may have an impact upon government IT appropriations into 2015.

10. Next Big Things

2014 will be the year that graphene-based applications move out of the laboratory and into production. This will the most immediate effects in the area of battery lifetimes, with the very real possibility that by 2015 we could see electronic devices - phones, tablets, laptops, etc. - being able to run for days on a single charge.
Memristors, which were also first discovered less than two years ago, are also changing the way that people thing about circuitry. When a current is passed over a memristor in one direction, it's resistance increases, from the other, it decreases. Turn off the current and the memristor retains knowledge of its resistance at the time the current stopped. Coupled with memcapacitors, and you have the ability to create much higher density solid state components that can replace transistors. While it is unlikely that we'll see products on the market incorporating these by 2014, its very likely that you'll see prototype fabrication units by early 2015, and products built on top of them by 2017 or 18.
Flex-screens are likely to start making their way into commercial products by mid-2014. These screens can be deformed away from the plane, making it possible to put screens on far more surfaces, and reducing the overall likelihood (and cost of replacement) of display screens significantly.
The year 2014 will also likely end up seeing a critical threshold of coverage where 90% of people in the US will have access to the Internet 99% of the time. That level of reliability I suspect will change the way that we think about building distributed applications, with the idea that being offline is increasingly an abnormal state.
This may also be the year that you see intelligent cars deployed on an experimental basis within normal traffic driving conditions. Such cars would be autonomous - capable of driving themselves. I don't expect that they'll be commercially available until the 2018-2020 timeframe, but they are getting there. Meanwhile, expect cars to be the next battleground of "mobile" applications, with integrated web-based entertainment and environmental control systems, diagnostics, navigation, security, brake systems, and intelligent powering systems (so that you don't have to do what I do, plugging a power strip and power converter into the increasingly misnamed cigarette igniter). Ford recently showcased a solar powered electric concept car, and there is no question that other automobile manufacturers will follow suit in that arena as well.

From an IT perspective, it's worth realizing that a vehicular mobile device becomes an incredible opportunity for mining data streams and building support applications, and it is likely that within the next two to three years, being a vehicular programmer will be as common as being a UX designer is today.

Wrap-up

This was a long piece (it's taken me the better part of a week to work on it), and I apologize for the length. Some of it may be self-evident - as mentioned at the beginning, this exercise is primarily for me to help identify areas that I personally need to concentrate on moving forward.

2014 will be an evolutionary year, not a revolutionary one (at least in IT). I expect a number of new technologies to begin influencing the fabric of IT, with more tumult likely to take place around the end of 2015 or early 2016 as these things begin to assert themselves at the social level. Data virtualization, effective metadata management and semantic systems will change the way that we think about information and our access to it, and will go a long way towards turning huge geysers of largely disconnected content contexts that tie them into other big streams of information. The Ontologist and the Technical Analyst will become critical players within IT organizations, while programming will become more focused on data flows rather than process flows. Startups today are more likely to be focused on what has traditionally been called artificial intelligence than on flinging angry birds around, and contextual, declarative development will become the norm. We might even begin to get better at putting together health care exchanges.
Kurt Cagle is the Principal Evangelist for Semantics and Machine Learning at Avalon Consulting, LLC. He lives in Issaquah, Washington with his wife, daughters and cat. When not peering through a glass darkly he programs, writes both technical books and science fiction novels, and is available for consulting. 

Tuesday, September 3, 2013

Going Functional

Local functions, like the map, concat and bang operators, originally appeared in MarkLogic 6, but even when working heavily with the technology, it is easy to miss the memo that they were released. If you are familiar with JavaScript, local functions should make perfect sense, but from other languages, local functions (and functions as arguments) are actually fairly advanced features.
Anyone who has worked with XQuery knows how to create a modular function. First you define a namespace:
declare namespace foo = "http://www.myexample.com/xmlns/foo";
or, if you’re actually creating a library module:
module namespace foo = "http://www.myexample.com/xmlns/foo";
This is used to group the functionality of all functions within “foo:”.
Once this declaration is made, you can then declare a function within the module namespace. For instance, suppose that you have a function called foo:title-case that takes a string (or node with text content, and returns a result where every first letter of each word is given in upper case and all other letters in the word are lower case:
declare function foo:title-case($expr as item()) as item(){
 fn:string-join(fn;tokenize(fn:string($expr),"\s") ! 
     (fn:upper-case(fn:substring(.,1,1) || fn:lower-case(fn:substring(.,2))," ")
 };
Then, in a different xquery script you can import the foo library and use the function:
import module namespace foo = "http://www.myexample.com/xmlns/foo" at "/lib/core/foo.xq";
 let $title := xdmp:get-request-field("title")
 return <div>{foo:title-case($title)}</div>
So far, so good. This kind of approach is very useful for establishing formal APIs, and when designing your applications, you should first look to modularize your code like this. These are public interfaces.
Not all functions, however, need to be (or even should be) public or global. With MarkLogic 6.0, a new kind of function was implemented, one that looks a lot more like a JavaScript function than it does XQuery. Thus makes use of the “function” function, and has the general form:
let $f := function($param1 as typeIn1,$param2 as typeIn2) as typeOut {
      (: function body :)
      }
For instance, suppose that you wanted a function that gave you the distance between two points, represented by a two or three dimensional sequence. This would then be written as:
let $distance:= function($pt1 as xs:double*,$pt2 as xs:double*) as xs:double {
      if (fn:count($pt1) = fn:count($pt2) ) then
             let $dif := for $index in (1 to fn:count($pt1)) return $pt2[$index] - $pt1[$index]
             let $difsquare := math:pow($dif,2)
             return math:pow(fn:sum($difsquare),0.5)
     else fn:error(xs:QName("xdmp:custom-error"),"Error: Mismatch of coordinates","The number of coordinates in $pt1 does not match that in $pt2")
     }
let $point1 := (0,0)
let $pointt2 := (50,40)
return $distance($point1,$point2)
There are several things to note here. The first is that when a function is assigned to a variable, that variable can then be invoked like a function by providing arguments – e.g., $distance($point1,$point2). This carries a number of implications, not least of which being that if a function is assigned to a local variable, then it has only local scope – it cannot be used outside of its calling scope by that name. Now, if a variable was declared globally in a module, this syntax could be used:
declare variable $my:distance := function(..){..};
is perfectly valid, and externally, this would be invoked as
$my:distance($pt1,$pt2)
Of course at that point there’s probably not much benefit to declaring the function as a variable rather than simply as a function, but it is possible.
A third point is that you can raise errors within such a locally defined function in precisely the same way you would a module function, wth the fn:error() function. Finally, local functions do not end with semi-colons.
In both of the functions given above, there are very few reasons to make these local. Where such functions do come in handy is when they employ a principle called closure, which means that the function is able to encapsulate a temporary bit of state that persists as long as the function itself does.  A simple helloworld function provides a good example of this.
let $greeting := "Hello"
let $hello-world := function($name as xs:string?) as xs:string {
     $greeting || ", " || (if (fn:empty($name)) then "World" else $name)
     }
return $hello-world("Kurt")
==> "Hello, Kurt"
In this case, the variable $greeting  is passed into the function directly, being already defined.  The function can then reference this variable directly, effectively saving state.
In practice, this is still pretty cumbersome. Where it begins to make a difference is when the function in turn generates a function as output. We can then create a whole set of generators:
declare namespace my = "http://example.com/xmlns/my";
declare function my:generate-greeting($greet-term as xs:string) as xdmp:function {
      function($name as xs:string?) as xs:string {$greet-term || ", " || 
          (if (fn:empty($name)) then "World" else $name)}
      };
let $greet1 := my:generate-greeting("Hello")
let $greet2 := my:generate-greeting("Welcome")
return ($greet1("Kurt"),$greet2("Kurt"))
The xdmp:function identifies the output of the function as a function itself. The two variables $greet1 and $greet2 then become functions that take a name and generate the appropriate output message. Using a function to create a set of functions is well known in programming, and is identified as the factory pattern – each factory creates multiple functions or objects based upon some input parameter.
In working with the MarkLogic semantics capability (in MarkLogic 7), this factory pattern can definitely prove useful.  For instance, consider a Sparql query builder. Let’s say that there are certain queries which occur quite often, and as such it makes sense to save them as files. For instance, the following retrieves the name of all items that link to a specific sem:iri.  To make this query you need to specify all of the prefixes used within the SPARQL query, need to concatenate this (cleanly) with the query, set the sort order and lmit size, then if desired serialize these out to different formats. Because you may do this a number of times (this is a remarkably common query), it seems like this may be a good candidate for a factory. The following illustrates one such factory:
import module namespace sem="http://marklogic.com/semantics" at "/MarkLogic/semantics.xqy";
declare namespace sp="http://example.com/xmlns/sparql";
declare function sp:sparql-factory($namespace-map as item(),$query as xs:string?,$file-path as xs:string?) as xdmp:function {
     let $query := if ($file-path = "") then $query else fn:unparsed-text($file-path)
      let $prefixes := fn:string-join(for $key in map:keys($namespace-map) return ("prefix "||$key||": <"||
           map:get($namespace-map,$key) || ">"),"&#13;")||"&#13;"
      return function($arg-map as item()) as item()* {sem:sparql($prefixes||$query||" ",$arg-map)} 
 };

declare function sp:curie-factory($ns-map as item()) as xdmp:function {
    function($curie as xs:string) as sem:iri {sem:curie-expand($curie,$ns-map)}
 }; 

declare function sp:merge-maps($ns-maps as item()*) as item(){
    if (fn:count($ns-maps)=1) then 
         $ns-maps
   else 
         let $map := map:map()
         let $_ := for $ns-map in $ns-maps return for $key in map:keys($ns-map) return map:put($map,$key,map:get($ns-map,$key))
         return $map
 };
declare variable $sp:ns-map := map:new((
map:entry("rdf","http://www.w3.org/1999/02/22-rdf-syntax-ns#"),
map:entry("context","http://disney.com/xmlns/context"),
map:entry("semantics","http://marklogic.com/semantics"),
map:entry("fn","http://www.w3.org/2005/xpath-functions#"),
map:entry("xs","http://www.w3.org/2001/XMLSchema#"),
map:entry("rdfs","http://www.w3.org/2000/01/rdf-schema#"),
map:entry("owl","http://www.w3.org/2002/07/owl#"),
map:entry("skos","http://www.w3.org/2004/02/skos/core#"),
map:entry("xdmp","http://marklogic.com/xdmp#"),
map:entry("entity","http://example.com/xmlns/class/Entity/"),
map:entry("class","http://example.com/xmlns/class/Class/"),
map:entry("map","http://marklogic.com/map#")));
let $curie := sp:curie-factory($sp:ns-map)
let $links-query := sp:sparql-factory($sp:ns-map,"","/sparql/linkList.sp")
return $links-query(map:entry("s",$curie("class:Person"))) ! map(.,"name")
This code actually defines two factories. The curie-factory takes a map of prefixes and namespaces and binds these into a function that will attempt to match a term’s namespace with one of those in the map. If it’s there, then this function will generate the appropriate sem:iri from the shortened curie form.
The second function sparql-factory, takes the map of namespaces indexed by prefix and uses this to generate the prefix headers for the  query. This can become significant when the number of namespaces is high, and it’s a small step from this to saving these namespaces in a file and updating them when new namespaces are added.
The function in turn generates a new function that either takes the query supplied in text or loads it in from a supplied text file stored in the data directory.  The newly created function can then take a map of parameters to return an associated json map or sem:triples list. In this case the output is a list of names that satisfy the query itself.
One final note, while talking about maps. You can assign a function to a map, which can then be persisted and retrieved.  In the above example, for instance, you could have persisted the functions:
xdmp:document-insert("/functions/named-functions",
        map:new((map:entry(" curie",$curie),map:entry("links",$links-query)))
in another routine, you could then retrieve these:
let $fn-map = map:map(fn:doc("/functions/named-functions"))
let $links-fn := map:get($fn-map,"links")
let $curie-fn := map:get($fn-map,"curie")
let $links := $links-fn(map:entry("s",$curie-fn("class:Person"))
return $links ! map(.,"name")
This becomes especially useful when tracking “named queries” produced by users.
There are other design patterns that can also use functions as arguments (most notably the decorator pattern), but in general the principle is the same – by using local functions it becomes to create general functions that either produce or consume functions of their owns, giving you considerably more power in writing code.

Monday, August 12, 2013

Maps, JSON and Sparql - A Peek Under the Hood of MarkLogic 7.0

Blogging while attempting to run a full time consulting job can be frustrating - you get several weeks of one article a week with some serious momentum, then life comes along and you're pulling sixty hour weeks and your dreams begin to resemble IDE screens. I've been dealing with both personal tragedies - my mother succumbed to a form of blood cancer last month less than a year after being diagnosed - and have also been busy with a semantics project for a large media broadcast company.

I've also had a chance to preview the MarkLogic 7 pre-release, and have been happily making a pest of myself on the MarkLogic forums as a consequence. My opinion about the new semantics capability is mixed but generally positive. I think that once they release 7.0, the capabilities that MarkLogic has in that space should catapult them into a major force in the semantic space at a time when semantics seems to finally be getting hot.

As I was working on code recently for client, though, I made a sudden, disquieting revelation: for all the code that I was writing, surprisingly little of it - almost none, in fact - was involved in manipulating XML. Instead, I was spending a lot of time working with maps, JSON objects, higher order functions, and SPARQL. The XQuery code was still the substrate for all this, mind you, but this was not an XML application - it was an application that worked with hash tables, assertions, factories and other interesting ephemera that seems to be intrinsic to coding in the 2010s.

There are a few interesting tips that I picked up that illustrate what you can do with these. For instance, I first encountered the concat operator - "||" - just recently, though it seems to have sneaked into ML 6 when I wasn't looking. This operator eliminates (or at least reduces) the need for the fn:concat function:

let $str1 := "test"
let $str2 := "This is a " || $str1 || ". This is only a " || $str1 || "."
return $str2
==> "This is a test. This is only a test."

XQuery has a tendency to be parenthesis heavy, and especially when putting together complex strings, trying to track whether you are inside or outside the string scope can be an onerous chore. The || operator seems like a little win, but I find that in general it is easier to keep track of string construction this way.

Banging on Maps

Another useful operator is the map operator "!", also known as the "bang" operator. This one is specific to ML7, and you will find yourself using it a fair amount. The map operator in effect acts like a "fold" operator (for those familiar with map/reduce functionality) - it iterates through a sequence of items and establishes a context for future operations. For instance, consider a sequence of colors and how these could be wrapped up in <color> elements:

let $colors = "red,orange,yellow,green,cyan,blue,violet"
return <colors>{fn:tokenize($colors,",") ! <color>{.}</color>}</colors>
=> <colors>
      <color>red</color>
      <color>orange</color>
      <color>yellow</color>
      <color>green</color>
      <color>cyan</color>
      <color>blue</color>
      <color>violet</color>
   </colors>

The dot in this case is the same as the dot context in a predicate - a context item in a sequence. This is analogous to the statements:

let $colors = "red,orange,yellow,green,cyan,blue,violet"
return <colors>{for $item in fn:tokenize($colors,",") return <color>{$item}</color>}</colors>

save that it is not necessary to declare a specific named variable for the iterator.

This can come in handy with another couple of useful functions - the map:entry() and map:new() functions. The map:entry() function takes to arguments - a hash name and a value - and as expected constructs a map from these:

map:entry("red","#ff0000")
=> <map:map xmlns:map="http://marklogic.com/xdmp/map" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<map:entry key="red">
<map:value xsi:type="xs:string">#ff0000</map:value>
</map:entry>
</map:map>

The map:new() function takes a sequence of map:entries as arguments, and constructs a compound map. 

map:new((
   map:entry("red","#ff0000"),
   map:entry("blue","#0000ff"),
   map:entry("green","#00ff00")
   ))
=>
<map:map xmlns:map="http://marklogic.com/xdmp/map" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<map:entry key="blue">
<map:value xsi:type="xs:string">#0000ff</map:value>
</map:entry>
<map:entry key="red">
<map:value xsi:type="xs:string">#ff0000</map:value>
</map:entry>
<map:entry key="green">
<map:value xsi:type="xs:string">#00ff00</map:value>
</map:entry>
</map:map>

Note that if the same key is used more than once, then the latter value replaces the former.

map:new((
   map:entry("red","#ff0000"),
   map:entry("blue","#0000ff"),
   map:entry("red","rouge")
   ))
=>
<map:map xmlns:map="http://marklogic.com/xdmp/map" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<map:entry key="blue">
<map:value xsi:type="xs:string">#0000ff</map:value>
</map:entry>
<map:entry key="red">
<map:value xsi:type="xs:string">rouge</map:value>
</map:entry>
</map:map>

Additionally, the order that the keys are stored in is unpredictable - a hash is a bag, not a sequence.

The bang operator works well with maps. For instance, you can cut down on verbage by getting the context of a map then getting the associated values:

let $colors := map:new((
    map:entry("red","#ff0000"),
    map:entry("blue","#0000ff"),
    map:entry("green","#00ff00")
    ))
return $colors ! map:get(.,"red")
=> "#ff0000"

You can also use it to iterate through keys:

let $colors := map:new((
    map:entry("red","#ff0000"),
    map:entry("blue","#0000ff"),
    map:entry("green","#00ff00")
    ))
return map:keys($colors) ! map:get($colors,.)
=> "#ff0000"
=> "#0000ff"
=> "#00ff00"

and can chain contexts:

let $colors := map:new((
    map:entry("red","#ff0000"),
    map:entry("blue","#0000ff"),
    map:entry("green","#00ff00")
    ))
return <table><tr>{
    map:keys($colors) ! (
      let $key := . 
      return map:get($colors,.) ! 
        <td style="color:{.}">{$key}</td>
    )}
</tr></table>
==>
<table>
   <tr>
       <td style="color:#ff0000">red</td>
       <td style="color:#0000ff">blue</td>
       <td style="color:#00ff00">green</td>
   </tr>
</table>
Here, the context changes - after the first bang operator the dot context holds the values ("red","blue" and "green" respectively). After the second bang operator, the dot context now holds the map output from these keys on the $colors map: ("#ff0000","#0000ff","#00ff00"). These are then used in turn to set the color of the text. Notice that you can also assign a context to a value (and use that further along, so long as you are in the XQuery scope for that variable) - here $key is assigned to the respective color names.

Again, this is primarily a shorthand for the for $item in $sequence statement, but it's a very useful shortcut.

Maps and JSON

MarkLogic maps look a lot like JSON objects. Internally, they are similar, though not quite identical, the primary difference being that maps are intrinsically hashes, while JSON objects may be a sequence of hashes. Marklogic 7 supports both of these objects, and can use the map() operators and the bank operator to work with internal JSON objects.

For instance, suppose that you set up a JSON string (or import it from an external data call). You can use the xdmp:from-json() function to convert the string into an internal ML JSON object:

import module namespace json="http://marklogic.com/xdmp/json"    at "/Marklogic/json/json.xqy";
let $json-str := '[{name:"Aleria",vocation:"mage",species:"elf",gender:"female"}, {name:"Gruarg",vocation:"warrior",species:"half-orc",gender:"male"},{name:"Huara",vocation:"cleric",species:"human",gender:"female"}]'let $characters := xdmp:from-json($json-str)

This list can then by referenced using the sequence and map operators. For instance, you can get the first item using the predicate index

$characters[1]
=> 
{"name": "Aleria","class": "mage","species": "elf","gender": "female"}

You can get a specific entry by using the map:get() operator:

map:get($characters[1],"name")
=> Aleria

You can update an entry using the map:put() operator:

let $_ := map:put($characters[1],"species","half-elf")
return map:get($characters[1])
=> 

{"name": "Aleria","vocation": "mage","species": "half-elf","gender": "female"}

You can use keys:

map:keys($characters[1])
=> ("name","vocation","species","gender")

and you can use the ! operator:

$characters ! (map:get(.,"name") || " [" || map:get(.,"species") || "]")
=> Aleria [half-elf]
=> Gruarg [half-orc]
=> Huara [human]

The xdmp:to-json() function will convert an object back into the corresponding JSON string, making it handy to work with MarkLogic in a purely JSONic mode. You can also convert json objects into XML:

<map>{$characters}</map>/*
=>
<json:array xmlns:json="http://marklogic.com/xdmp/json" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<json:value>
<json:object>
<json:entry key="name">
  <json:value xsi:type="xs:string">Aleria</json:value>
</json:entry>
<json:entry key="vocation">
  <json:value xsi:type="xs:string">mage</json:value>
</json:entry>
<json:entry key="species">
  <json:value xsi:type="xs:string">elf</json:value>
</json:entry>
<json:entry key="gender">
  <json:value xsi:type="xs:string">female</json:value>
</json:entry>
</json:object>
</json:value>
<json:value>
<json:object>
<json:entry key="name">
  <json:value xsi:type="xs:string">Gruarg</json:value>
</json:entry>
<json:entry key="vocation">
  <json:value xsi:type="xs:string">warrior</json:value>
</json:entry>
<json:entry key="species">
  <json:value xsi:type="xs:string">half-orc</json:value>
</json:entry>
<json:entry key="gender">
  <json:value xsi:type="xs:string">male</json:value>
</json:entry>
</json:object>
</json:value>
<json:value>
<json:object>
<json:entry key="name">
  <json:value xsi:type="xs:string">Huara</json:value>
</json:entry>
<json:entry key="vocation">
  <json:value xsi:type="xs:string">cleric</json:value>
</json:entry>
<json:entry key="species">
  <json:value xsi:type="xs:string">human</json:value>
</json:entry>
<json:entry key="gender">
  <json:value xsi:type="xs:string">female</json:value>
</json:entry>
</json:object>
</json:value>
</json:array>

This format can then be transformed to other XML formats, a topic for another blog post.

SPARQL and Maps

These capabilities are indispensable for working with SPARQL. Unless otherwise specified, SPARQL queries generate JSON maps, and use regular maps for passing in parameters. For instance, suppose you load the following turtle data:

let $turtle := '
@prefix class: <http://www.example.com/xmlns/Class/>.
@prefix character: <http://www.example.com/xmlns/Character/>.
@prefix species: <http://www.example.com/xmlns/Species/>.
@prefix gender:  <http://www.example.com/xmlns/Gender/>.
@prefix vocation:  <http://www.example.com/xmlns/Vocation/>.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.

character:Aleria rdf:type class:Character;
       character:species species:HalfElf;
       character:gender gender:Female;
       character:vocation vocation:Mage;
       rdfs:label "Aleria".
character:Gruarg rdf:type class:Character;
       character:species species:HalfOrc;
       character:gender gender:Male;
       character:vocation vocation:Warrior;
       rdfs:label "Gruarg".
character:Huara rdf:type class:Character;
       character:species species:Human;
       character:gender gender:Female;
       character:vocation vocation:Cleric;
       rdfs:label "Huara".
character:Drina rdf:type class:Character;
       character:species species:HalfElf;
       character:gender gender:Female;
       character:vocation vocation:Archer;
       rdfs:label "Drina".
gender:Female rdf:type class:Gender;
       rdfs:label "Female".
gender:Male rdf:type class:Gender;
       rdfs:label "Male".
species:HalfElf rdf:type class:Species;
       rdfs:label "Half-Elf".
species:Human rdf:type class:Species;
       rdfs:label "Human".
species:HalfOrc rdf:type class:Species;
       rdfs:label "Half-Orc".
vocation:Warrior rdf:type class:Vocation;
       rdfs:label "Warrior".
vocation:Mage rdf:type class:Vocation;
       rdfs:label "Mage".
vocation:Cleric rdf:type class:Vocation;
       rdfs:label "Cleric".'
let $triples := sem:rdf-parse($turtle,"turtle")
return sem:rdf-insert($triples)

You can then retrieve the names of all female half-elves from the dataset with a sparql query:

let $namespaces := 'prefix class: <http://www.example.com/xmlns/Class/>
prefix character: <http://www.example.com/xmlns/Character/>
prefix species: <http://www.example.com/xmlns/Species/>
prefix gender:  <http://www.example.com/xmlns/Gender/>
prefix vocation:  <http://www.example.com/xmlns/Vocation/>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
'

let $character-maps := sem:sparql($namespaces ||
'select ?characterName ?vocationLabel where {
?character character:species ?species.
?species rdfs:label ?speciesLabel.
?character character:gender ?gender.
?gender rdfs:label ?genderLabel.
?character character:vocation ?vocation.
?vocation rdfs:label ?vocationLabel.
}',
map:new(map:entry("genderLabel","Female"),map:entry("speciesLabel","Half-Elf"))
return $character-maps ! (map:get(.,"characterName") || " [" || map:get(.,"vocationLabel") || "]")  
=> "Aleria [Mage]"
=> "Drina [Archer]" 

The query passes a map with two entries, one specifying the gender label, the other the species label. Note in this case that we're not actually passing in the iris, but using a text match. These are then used to determine via Sparql the associated character name and vocation label, with the output then being a JSON sequence of json hash-map objects. This result is used by the bang operator to retrieve the values for each specific record.

It is possible to get sparql output in other formats, use the sem:query-results-serialize() function on the sparql results with the option of "xml","json" or "triples", such as:

sem:query-results-serialize($character-maps,"json")

which is especially useful when using MarkLogic as a SPARQL endpoint, but internally, sticking with maps for processing is probably the fastest and easiest way to work with SPARQL in your applications.

Summary

There is no question that these capabilities are changing the way that applications are written in MarkLogic, and represents a shift in the server from being primarily an XML database (though certainly it can still be used this way) into being increasingly its own beast, something capable of working just as readily with JSON and RDF employing a more contemporary set of coding practices.

In the next column, I'm going to shift gears somewhat and look at higher order functions and how they are used in the MarkLogic environment.