Monday, August 12, 2013

Maps, JSON and Sparql - A Peek Under the Hood of MarkLogic 7.0

Blogging while attempting to run a full time consulting job can be frustrating - you get several weeks of one article a week with some serious momentum, then life comes along and you're pulling sixty hour weeks and your dreams begin to resemble IDE screens. I've been dealing with both personal tragedies - my mother succumbed to a form of blood cancer last month less than a year after being diagnosed - and have also been busy with a semantics project for a large media broadcast company.

I've also had a chance to preview the MarkLogic 7 pre-release, and have been happily making a pest of myself on the MarkLogic forums as a consequence. My opinion about the new semantics capability is mixed but generally positive. I think that once they release 7.0, the capabilities that MarkLogic has in that space should catapult them into a major force in the semantic space at a time when semantics seems to finally be getting hot.

As I was working on code recently for client, though, I made a sudden, disquieting revelation: for all the code that I was writing, surprisingly little of it - almost none, in fact - was involved in manipulating XML. Instead, I was spending a lot of time working with maps, JSON objects, higher order functions, and SPARQL. The XQuery code was still the substrate for all this, mind you, but this was not an XML application - it was an application that worked with hash tables, assertions, factories and other interesting ephemera that seems to be intrinsic to coding in the 2010s.

There are a few interesting tips that I picked up that illustrate what you can do with these. For instance, I first encountered the concat operator - "||" - just recently, though it seems to have sneaked into ML 6 when I wasn't looking. This operator eliminates (or at least reduces) the need for the fn:concat function:

let $str1 := "test"
let $str2 := "This is a " || $str1 || ". This is only a " || $str1 || "."
return $str2
==> "This is a test. This is only a test."

XQuery has a tendency to be parenthesis heavy, and especially when putting together complex strings, trying to track whether you are inside or outside the string scope can be an onerous chore. The || operator seems like a little win, but I find that in general it is easier to keep track of string construction this way.

Banging on Maps

Another useful operator is the map operator "!", also known as the "bang" operator. This one is specific to ML7, and you will find yourself using it a fair amount. The map operator in effect acts like a "fold" operator (for those familiar with map/reduce functionality) - it iterates through a sequence of items and establishes a context for future operations. For instance, consider a sequence of colors and how these could be wrapped up in <color> elements:

let $colors = "red,orange,yellow,green,cyan,blue,violet"
return <colors>{fn:tokenize($colors,",") ! <color>{.}</color>}</colors>
=> <colors>
      <color>red</color>
      <color>orange</color>
      <color>yellow</color>
      <color>green</color>
      <color>cyan</color>
      <color>blue</color>
      <color>violet</color>
   </colors>

The dot in this case is the same as the dot context in a predicate - a context item in a sequence. This is analogous to the statements:

let $colors = "red,orange,yellow,green,cyan,blue,violet"
return <colors>{for $item in fn:tokenize($colors,",") return <color>{$item}</color>}</colors>

save that it is not necessary to declare a specific named variable for the iterator.

This can come in handy with another couple of useful functions - the map:entry() and map:new() functions. The map:entry() function takes to arguments - a hash name and a value - and as expected constructs a map from these:

map:entry("red","#ff0000")
=> <map:map xmlns:map="http://marklogic.com/xdmp/map" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<map:entry key="red">
<map:value xsi:type="xs:string">#ff0000</map:value>
</map:entry>
</map:map>

The map:new() function takes a sequence of map:entries as arguments, and constructs a compound map. 

map:new((
   map:entry("red","#ff0000"),
   map:entry("blue","#0000ff"),
   map:entry("green","#00ff00")
   ))
=>
<map:map xmlns:map="http://marklogic.com/xdmp/map" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<map:entry key="blue">
<map:value xsi:type="xs:string">#0000ff</map:value>
</map:entry>
<map:entry key="red">
<map:value xsi:type="xs:string">#ff0000</map:value>
</map:entry>
<map:entry key="green">
<map:value xsi:type="xs:string">#00ff00</map:value>
</map:entry>
</map:map>

Note that if the same key is used more than once, then the latter value replaces the former.

map:new((
   map:entry("red","#ff0000"),
   map:entry("blue","#0000ff"),
   map:entry("red","rouge")
   ))
=>
<map:map xmlns:map="http://marklogic.com/xdmp/map" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<map:entry key="blue">
<map:value xsi:type="xs:string">#0000ff</map:value>
</map:entry>
<map:entry key="red">
<map:value xsi:type="xs:string">rouge</map:value>
</map:entry>
</map:map>

Additionally, the order that the keys are stored in is unpredictable - a hash is a bag, not a sequence.

The bang operator works well with maps. For instance, you can cut down on verbage by getting the context of a map then getting the associated values:

let $colors := map:new((
    map:entry("red","#ff0000"),
    map:entry("blue","#0000ff"),
    map:entry("green","#00ff00")
    ))
return $colors ! map:get(.,"red")
=> "#ff0000"

You can also use it to iterate through keys:

let $colors := map:new((
    map:entry("red","#ff0000"),
    map:entry("blue","#0000ff"),
    map:entry("green","#00ff00")
    ))
return map:keys($colors) ! map:get($colors,.)
=> "#ff0000"
=> "#0000ff"
=> "#00ff00"

and can chain contexts:

let $colors := map:new((
    map:entry("red","#ff0000"),
    map:entry("blue","#0000ff"),
    map:entry("green","#00ff00")
    ))
return <table><tr>{
    map:keys($colors) ! (
      let $key := . 
      return map:get($colors,.) ! 
        <td style="color:{.}">{$key}</td>
    )}
</tr></table>
==>
<table>
   <tr>
       <td style="color:#ff0000">red</td>
       <td style="color:#0000ff">blue</td>
       <td style="color:#00ff00">green</td>
   </tr>
</table>
Here, the context changes - after the first bang operator the dot context holds the values ("red","blue" and "green" respectively). After the second bang operator, the dot context now holds the map output from these keys on the $colors map: ("#ff0000","#0000ff","#00ff00"). These are then used in turn to set the color of the text. Notice that you can also assign a context to a value (and use that further along, so long as you are in the XQuery scope for that variable) - here $key is assigned to the respective color names.

Again, this is primarily a shorthand for the for $item in $sequence statement, but it's a very useful shortcut.

Maps and JSON

MarkLogic maps look a lot like JSON objects. Internally, they are similar, though not quite identical, the primary difference being that maps are intrinsically hashes, while JSON objects may be a sequence of hashes. Marklogic 7 supports both of these objects, and can use the map() operators and the bank operator to work with internal JSON objects.

For instance, suppose that you set up a JSON string (or import it from an external data call). You can use the xdmp:from-json() function to convert the string into an internal ML JSON object:

import module namespace json="http://marklogic.com/xdmp/json"    at "/Marklogic/json/json.xqy";
let $json-str := '[{name:"Aleria",vocation:"mage",species:"elf",gender:"female"}, {name:"Gruarg",vocation:"warrior",species:"half-orc",gender:"male"},{name:"Huara",vocation:"cleric",species:"human",gender:"female"}]'let $characters := xdmp:from-json($json-str)

This list can then by referenced using the sequence and map operators. For instance, you can get the first item using the predicate index

$characters[1]
=> 
{"name": "Aleria","class": "mage","species": "elf","gender": "female"}

You can get a specific entry by using the map:get() operator:

map:get($characters[1],"name")
=> Aleria

You can update an entry using the map:put() operator:

let $_ := map:put($characters[1],"species","half-elf")
return map:get($characters[1])
=> 

{"name": "Aleria","vocation": "mage","species": "half-elf","gender": "female"}

You can use keys:

map:keys($characters[1])
=> ("name","vocation","species","gender")

and you can use the ! operator:

$characters ! (map:get(.,"name") || " [" || map:get(.,"species") || "]")
=> Aleria [half-elf]
=> Gruarg [half-orc]
=> Huara [human]

The xdmp:to-json() function will convert an object back into the corresponding JSON string, making it handy to work with MarkLogic in a purely JSONic mode. You can also convert json objects into XML:

<map>{$characters}</map>/*
=>
<json:array xmlns:json="http://marklogic.com/xdmp/json" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<json:value>
<json:object>
<json:entry key="name">
  <json:value xsi:type="xs:string">Aleria</json:value>
</json:entry>
<json:entry key="vocation">
  <json:value xsi:type="xs:string">mage</json:value>
</json:entry>
<json:entry key="species">
  <json:value xsi:type="xs:string">elf</json:value>
</json:entry>
<json:entry key="gender">
  <json:value xsi:type="xs:string">female</json:value>
</json:entry>
</json:object>
</json:value>
<json:value>
<json:object>
<json:entry key="name">
  <json:value xsi:type="xs:string">Gruarg</json:value>
</json:entry>
<json:entry key="vocation">
  <json:value xsi:type="xs:string">warrior</json:value>
</json:entry>
<json:entry key="species">
  <json:value xsi:type="xs:string">half-orc</json:value>
</json:entry>
<json:entry key="gender">
  <json:value xsi:type="xs:string">male</json:value>
</json:entry>
</json:object>
</json:value>
<json:value>
<json:object>
<json:entry key="name">
  <json:value xsi:type="xs:string">Huara</json:value>
</json:entry>
<json:entry key="vocation">
  <json:value xsi:type="xs:string">cleric</json:value>
</json:entry>
<json:entry key="species">
  <json:value xsi:type="xs:string">human</json:value>
</json:entry>
<json:entry key="gender">
  <json:value xsi:type="xs:string">female</json:value>
</json:entry>
</json:object>
</json:value>
</json:array>

This format can then be transformed to other XML formats, a topic for another blog post.

SPARQL and Maps

These capabilities are indispensable for working with SPARQL. Unless otherwise specified, SPARQL queries generate JSON maps, and use regular maps for passing in parameters. For instance, suppose you load the following turtle data:

let $turtle := '
@prefix class: <http://www.example.com/xmlns/Class/>.
@prefix character: <http://www.example.com/xmlns/Character/>.
@prefix species: <http://www.example.com/xmlns/Species/>.
@prefix gender:  <http://www.example.com/xmlns/Gender/>.
@prefix vocation:  <http://www.example.com/xmlns/Vocation/>.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.

character:Aleria rdf:type class:Character;
       character:species species:HalfElf;
       character:gender gender:Female;
       character:vocation vocation:Mage;
       rdfs:label "Aleria".
character:Gruarg rdf:type class:Character;
       character:species species:HalfOrc;
       character:gender gender:Male;
       character:vocation vocation:Warrior;
       rdfs:label "Gruarg".
character:Huara rdf:type class:Character;
       character:species species:Human;
       character:gender gender:Female;
       character:vocation vocation:Cleric;
       rdfs:label "Huara".
character:Drina rdf:type class:Character;
       character:species species:HalfElf;
       character:gender gender:Female;
       character:vocation vocation:Archer;
       rdfs:label "Drina".
gender:Female rdf:type class:Gender;
       rdfs:label "Female".
gender:Male rdf:type class:Gender;
       rdfs:label "Male".
species:HalfElf rdf:type class:Species;
       rdfs:label "Half-Elf".
species:Human rdf:type class:Species;
       rdfs:label "Human".
species:HalfOrc rdf:type class:Species;
       rdfs:label "Half-Orc".
vocation:Warrior rdf:type class:Vocation;
       rdfs:label "Warrior".
vocation:Mage rdf:type class:Vocation;
       rdfs:label "Mage".
vocation:Cleric rdf:type class:Vocation;
       rdfs:label "Cleric".'
let $triples := sem:rdf-parse($turtle,"turtle")
return sem:rdf-insert($triples)

You can then retrieve the names of all female half-elves from the dataset with a sparql query:

let $namespaces := 'prefix class: <http://www.example.com/xmlns/Class/>
prefix character: <http://www.example.com/xmlns/Character/>
prefix species: <http://www.example.com/xmlns/Species/>
prefix gender:  <http://www.example.com/xmlns/Gender/>
prefix vocation:  <http://www.example.com/xmlns/Vocation/>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
'

let $character-maps := sem:sparql($namespaces ||
'select ?characterName ?vocationLabel where {
?character character:species ?species.
?species rdfs:label ?speciesLabel.
?character character:gender ?gender.
?gender rdfs:label ?genderLabel.
?character character:vocation ?vocation.
?vocation rdfs:label ?vocationLabel.
}',
map:new(map:entry("genderLabel","Female"),map:entry("speciesLabel","Half-Elf"))
return $character-maps ! (map:get(.,"characterName") || " [" || map:get(.,"vocationLabel") || "]")  
=> "Aleria [Mage]"
=> "Drina [Archer]" 

The query passes a map with two entries, one specifying the gender label, the other the species label. Note in this case that we're not actually passing in the iris, but using a text match. These are then used to determine via Sparql the associated character name and vocation label, with the output then being a JSON sequence of json hash-map objects. This result is used by the bang operator to retrieve the values for each specific record.

It is possible to get sparql output in other formats, use the sem:query-results-serialize() function on the sparql results with the option of "xml","json" or "triples", such as:

sem:query-results-serialize($character-maps,"json")

which is especially useful when using MarkLogic as a SPARQL endpoint, but internally, sticking with maps for processing is probably the fastest and easiest way to work with SPARQL in your applications.

Summary

There is no question that these capabilities are changing the way that applications are written in MarkLogic, and represents a shift in the server from being primarily an XML database (though certainly it can still be used this way) into being increasingly its own beast, something capable of working just as readily with JSON and RDF employing a more contemporary set of coding practices.

In the next column, I'm going to shift gears somewhat and look at higher order functions and how they are used in the MarkLogic environment.