Semantics and Data Modeling: 2013

Tuesday, September 3, 2013

Going Functional

Local functions, like the map, concat and bang operators, originally appeared in MarkLogic 6, but even when working heavily with the technology, it is easy to miss the memo that they were released. If you are familiar with JavaScript, local functions should make perfect sense, but from other languages, local functions (and functions as arguments) are actually fairly advanced features.

Anyone who has worked with XQuery knows how to create a modular function. First you define a namespace:

declare namespace foo = "http://www.myexample.com/xmlns/foo";

or, if you’re actually creating a library module:

module namespace foo = "http://www.myexample.com/xmlns/foo";

This is used to group the functionality of all functions within “foo:”.

Once this declaration is made, you can then declare a function within the module namespace. For instance, suppose that you have a function called foo:title-case that takes a string (or node with text content, and returns a result where every first letter of each word is given in upper case and all other letters in the word are lower case:

declare function foo:title-case($expr as item()) as item(){
 fn:string-join(fn;tokenize(fn:string($expr),"\s") ! 
     (fn:upper-case(fn:substring(.,1,1) || fn:lower-case(fn:substring(.,2))," ")
 };

Then, in a different xquery script you can import the foo library and use the function:

import module namespace foo = "http://www.myexample.com/xmlns/foo" at "/lib/core/foo.xq";
 let $title := xdmp:get-request-field("title")
 return <div>{foo:title-case($title)}</div>

So far, so good. This kind of approach is very useful for establishing formal APIs, and when designing your applications, you should first look to modularize your code like this. These are public interfaces.

Not all functions, however, need to be (or even should be) public or global. With MarkLogic 6.0, a new kind of function was implemented, one that looks a lot more like a JavaScript function than it does XQuery. Thus makes use of the “function” function, and has the general form:

let $f := function($param1 as typeIn1,$param2 as typeIn2) as typeOut {
      (: function body :)
      }

For instance, suppose that you wanted a function that gave you the distance between two points, represented by a two or three dimensional sequence. This would then be written as:

let $distance:= function($pt1 as xs:double*,$pt2 as xs:double*) as xs:double {
      if (fn:count($pt1) = fn:count($pt2) ) then
             let $dif := for $index in (1 to fn:count($pt1)) return $pt2[$index] - $pt1[$index]
             let $difsquare := math:pow($dif,2)
             return math:pow(fn:sum($difsquare),0.5)
     else fn:error(xs:QName("xdmp:custom-error"),"Error: Mismatch of coordinates","The number of coordinates in $pt1 does not match that in $pt2")
     }
let $point1 := (0,0)
let $pointt2 := (50,40)
return $distance($point1,$point2)

There are several things to note here. The first is that when a function is assigned to a variable, that variable can then be invoked like a function by providing arguments – e.g., $distance($point1,$point2). This carries a number of implications, not least of which being that if a function is assigned to a local variable, then it has only local scope – it cannot be used outside of its calling scope by that name. Now, if a variable was declared globally in a module, this syntax could be used:

declare variable $my:distance := function(..){..};

is perfectly valid, and externally, this would be invoked as

$my:distance($pt1,$pt2)

Of course at that point there’s probably not much benefit to declaring the function as a variable rather than simply as a function, but it is possible.

A third point is that you can raise errors within such a locally defined function in precisely the same way you would a module function, wth the fn:error() function. Finally, local functions do not end with semi-colons.

In both of the functions given above, there are very few reasons to make these local. Where such functions do come in handy is when they employ a principle called closure, which means that the function is able to encapsulate a temporary bit of state that persists as long as the function itself does. A simple helloworld function provides a good example of this.

let $greeting := "Hello"
let $hello-world := function($name as xs:string?) as xs:string {
     $greeting || ", " || (if (fn:empty($name)) then "World" else $name)
     }
return $hello-world("Kurt")
==> "Hello, Kurt"

In this case, the variable $greeting is passed into the function directly, being already defined. The function can then reference this variable directly, effectively saving state.

In practice, this is still pretty cumbersome. Where it begins to make a difference is when the function in turn generates a function as output. We can then create a whole set of generators:

declare namespace my = "http://example.com/xmlns/my";
declare function my:generate-greeting($greet-term as xs:string) as xdmp:function {
      function($name as xs:string?) as xs:string {$greet-term || ", " || 
          (if (fn:empty($name)) then "World" else $name)}
      };

let $greet1 := my:generate-greeting("Hello")
let $greet2 := my:generate-greeting("Welcome")
return ($greet1("Kurt"),$greet2("Kurt"))

The xdmp:function identifies the output of the function as a function itself. The two variables $greet1 and $greet2 then become functions that take a name and generate the appropriate output message. Using a function to create a set of functions is well known in programming, and is identified as the factory pattern – each factory creates multiple functions or objects based upon some input parameter.

In working with the MarkLogic semantics capability (in MarkLogic 7), this factory pattern can definitely prove useful. For instance, consider a Sparql query builder. Let’s say that there are certain queries which occur quite often, and as such it makes sense to save them as files. For instance, the following retrieves the name of all items that link to a specific sem:iri. To make this query you need to specify all of the prefixes used within the SPARQL query, need to concatenate this (cleanly) with the query, set the sort order and lmit size, then if desired serialize these out to different formats. Because you may do this a number of times (this is a remarkably common query), it seems like this may be a good candidate for a factory. The following illustrates one such factory:

import module namespace sem="http://marklogic.com/semantics" at "/MarkLogic/semantics.xqy";
declare namespace sp="http://example.com/xmlns/sparql";

declare function sp:sparql-factory($namespace-map as item(),$query as xs:string?,$file-path as xs:string?) as xdmp:function {
     let $query := if ($file-path = "") then $query else fn:unparsed-text($file-path)
      let $prefixes := fn:string-join(for $key in map:keys($namespace-map) return ("prefix "||$key||": <"||
           map:get($namespace-map,$key) || ">"),"&#13;")||"&#13;"
      return function($arg-map as item()) as item()* {sem:sparql($prefixes||$query||" ",$arg-map)} 
 };

declare function sp:curie-factory($ns-map as item()) as xdmp:function {
    function($curie as xs:string) as sem:iri {sem:curie-expand($curie,$ns-map)}
 }; 

declare function sp:merge-maps($ns-maps as item()*) as item(){
    if (fn:count($ns-maps)=1) then 
         $ns-maps
   else 
         let $map := map:map()
         let $_ := for $ns-map in $ns-maps return for $key in map:keys($ns-map) return map:put($map,$key,map:get($ns-map,$key))
         return $map
 };

declare variable $sp:ns-map := map:new((
map:entry("rdf","http://www.w3.org/1999/02/22-rdf-syntax-ns#"),
map:entry("context","http://disney.com/xmlns/context"),
map:entry("semantics","http://marklogic.com/semantics"),
map:entry("fn","http://www.w3.org/2005/xpath-functions#"),
map:entry("xs","http://www.w3.org/2001/XMLSchema#"),
map:entry("rdfs","http://www.w3.org/2000/01/rdf-schema#"),
map:entry("owl","http://www.w3.org/2002/07/owl#"),
map:entry("skos","http://www.w3.org/2004/02/skos/core#"),
map:entry("xdmp","http://marklogic.com/xdmp#"),
map:entry("entity","http://example.com/xmlns/class/Entity/"),
map:entry("class","http://example.com/xmlns/class/Class/"),
map:entry("map","http://marklogic.com/map#")));

let $curie := sp:curie-factory($sp:ns-map)
let $links-query := sp:sparql-factory($sp:ns-map,"","/sparql/linkList.sp")
return $links-query(map:entry("s",$curie("class:Person"))) ! map(.,"name")

This code actually defines two factories. The curie-factory takes a map of prefixes and namespaces and binds these into a function that will attempt to match a term’s namespace with one of those in the map. If it’s there, then this function will generate the appropriate sem:iri from the shortened curie form.

The second function sparql-factory, takes the map of namespaces indexed by prefix and uses this to generate the prefix headers for the query. This can become significant when the number of namespaces is high, and it’s a small step from this to saving these namespaces in a file and updating them when new namespaces are added.

The function in turn generates a new function that either takes the query supplied in text or loads it in from a supplied text file stored in the data directory. The newly created function can then take a map of parameters to return an associated json map or sem:triples list. In this case the output is a list of names that satisfy the query itself.

One final note, while talking about maps. You can assign a function to a map, which can then be persisted and retrieved. In the above example, for instance, you could have persisted the functions:

xdmp:document-insert("/functions/named-functions",
        map:new((map:entry(" curie",$curie),map:entry("links",$links-query)))

in another routine, you could then retrieve these:

let $fn-map = map:map(fn:doc("/functions/named-functions"))
let $links-fn := map:get($fn-map,"links")
let $curie-fn := map:get($fn-map,"curie")
let $links := $links-fn(map:entry("s",$curie-fn("class:Person"))
return $links ! map(.,"name")

This becomes especially useful when tracking “named queries” produced by users.

There are other design patterns that can also use functions as arguments (most notably the decorator pattern), but in general the principle is the same – by using local functions it becomes to create general functions that either produce or consume functions of their owns, giving you considerably more power in writing code.

Monday, August 12, 2013

Maps, JSON and Sparql - A Peek Under the Hood of MarkLogic 7.0

Blogging while attempting to run a full time consulting job can be frustrating - you get several weeks of one article a week with some serious momentum, then life comes along and you're pulling sixty hour weeks and your dreams begin to resemble IDE screens. I've been dealing with both personal tragedies - my mother succumbed to a form of blood cancer last month less than a year after being diagnosed - and have also been busy with a semantics project for a large media broadcast company.

I've also had a chance to preview the MarkLogic 7 pre-release, and have been happily making a pest of myself on the MarkLogic forums as a consequence. My opinion about the new semantics capability is mixed but generally positive. I think that once they release 7.0, the capabilities that MarkLogic has in that space should catapult them into a major force in the semantic space at a time when semantics seems to finally be getting hot.

As I was working on code recently for client, though, I made a sudden, disquieting revelation: for all the code that I was writing, surprisingly little of it - almost none, in fact - was involved in manipulating XML. Instead, I was spending a lot of time working with maps, JSON objects, higher order functions, and SPARQL. The XQuery code was still the substrate for all this, mind you, but this was not an XML application - it was an application that worked with hash tables, assertions, factories and other interesting ephemera that seems to be intrinsic to coding in the 2010s.

There are a few interesting tips that I picked up that illustrate what you can do with these. For instance, I first encountered the concat operator - "||" - just recently, though it seems to have sneaked into ML 6 when I wasn't looking. This operator eliminates (or at least reduces) the need for the fn:concat function:

let $str1 := "test"

let $str2 := "This is a " || $str1 || ". This is only a " || $str1 || "."

return $str2

==> "This is a test. This is only a test."

XQuery has a tendency to be parenthesis heavy, and especially when putting together complex strings, trying to track whether you are inside or outside the string scope can be an onerous chore. The || operator seems like a little win, but I find that in general it is easier to keep track of string construction this way.

Banging on Maps

Another useful operator is the map operator "!", also known as the "bang" operator. This one is specific to ML7, and you will find yourself using it a fair amount. The map operator in effect acts like a "fold" operator (for those familiar with map/reduce functionality) - it iterates through a sequence of items and establishes a context for future operations. For instance, consider a sequence of colors and how these could be wrapped up in <color> elements:

let $colors = "red,orange,yellow,green,cyan,blue,violet"

return <colors>{fn:tokenize($colors,",") ! <color>{.}</color>}</colors>

=> <colors>

<color>orange</color>

<color>yellow</color>

<color>green</color>

<color>violet</color>

</colors>

The dot in this case is the same as the dot context in a predicate - a context item in a sequence. This is analogous to the statements:

let $colors = "red,orange,yellow,green,cyan,blue,violet"

return <colors>{for $item in fn:tokenize($colors,",") return <color>{$item}</color>}</colors>

save that it is not necessary to declare a specific named variable for the iterator.

This can come in handy with another couple of useful functions - the map:entry() and map:new() functions. The map:entry() function takes to arguments - a hash name and a value - and as expected constructs a map from these:

map:entry("red","#ff0000")

=> <map:map xmlns:map="http://marklogic.com/xdmp/map" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema">

<map:entry key="red">
<map:value xsi:type="xs:string">#ff0000</map:value>
</map:entry>
</map:map>

The map:new() function takes a sequence of map:entries as arguments, and constructs a compound map.

map:new((

map:entry("red","#ff0000"),

map:entry("blue","#0000ff"),

map:entry("green","#00ff00")

))

<map:map xmlns:map="http://marklogic.com/xdmp/map" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<map:entry key="blue">
<map:value xsi:type="xs:string">#0000ff</map:value>
</map:entry>
<map:entry key="red">
<map:value xsi:type="xs:string">#ff0000</map:value>
</map:entry>
<map:entry key="green">
<map:value xsi:type="xs:string">#00ff00</map:value>
</map:entry>
</map:map>

Note that if the same key is used more than once, then the latter value replaces the former.

map:new((

map:entry("red","#ff0000"),

map:entry("blue","#0000ff"),

map:entry("red","rouge")

))

Additionally, the order that the keys are stored in is unpredictable - a hash is a bag, not a sequence.

The bang operator works well with maps. For instance, you can cut down on verbage by getting the context of a map then getting the associated values:

let $colors := map:new((

map:entry("red","#ff0000"),

map:entry("blue","#0000ff"),

map:entry("green","#00ff00")

))

return $colors ! map:get(.,"red")

=> "#ff0000"

You can also use it to iterate through keys:

let $colors := map:new((

map:entry("red","#ff0000"),

map:entry("blue","#0000ff"),

map:entry("green","#00ff00")

))

return map:keys($colors) ! map:get($colors,.)

=> "#ff0000"

=> "#0000ff"

=> "#00ff00"

and can chain contexts:

let $colors := map:new((

map:entry("red","#ff0000"),

map:entry("blue","#0000ff"),

map:entry("green","#00ff00")

))

return <table><tr>{

map:keys($colors) ! (

let $key := .

return map:get($colors,.) !

)}

</tr></table>

==>
<table>

<tr>

<td style="color:#00ff00">green</td>

</tr>

</table>

Here, the context changes - after the first bang operator the dot context holds the values ("red","blue" and "green" respectively). After the second bang operator, the dot context now holds the map output from these keys on the $colors map: ("#ff0000","#0000ff","#00ff00"). These are then used in turn to set the color of the text. Notice that you can also assign a context to a value (and use that further along, so long as you are in the XQuery scope for that variable) - here $key is assigned to the respective color names.

Again, this is primarily a shorthand for the for $item in $sequence statement, but it's a very useful shortcut.

Maps and JSON

MarkLogic maps look a lot like JSON objects. Internally, they are similar, though not quite identical, the primary difference being that maps are intrinsically hashes, while JSON objects may be a sequence of hashes. Marklogic 7 supports both of these objects, and can use the map() operators and the bank operator to work with internal JSON objects.

For instance, suppose that you set up a JSON string (or import it from an external data call). You can use the xdmp:from-json() function to convert the string into an internal ML JSON object:

import module namespace json="http://marklogic.com/xdmp/json" at "/Marklogic/json/json.xqy";
let $json-str := '[{name:"Aleria",vocation:"mage",species:"elf",gender:"female"}, {name:"Gruarg",vocation:"warrior",species:"half-orc",gender:"male"},{name:"Huara",vocation:"cleric",species:"human",gender:"female"}]'let $characters := xdmp:from-json($json-str)

This list can then by referenced using the sequence and map operators. For instance, you can get the first item using the predicate index

$characters[1]

=>
{"name": "Aleria","class": "mage","species": "elf","gender": "female"}

You can get a specific entry by using the map:get() operator:

map:get($characters[1],"name")

=> Aleria

You can update an entry using the map:put() operator:

let $_ := map:put($characters[1],"species","half-elf")

return map:get($characters[1])

=>

{"name": "Aleria","vocation": "mage","species": "half-elf","gender": "female"}

You can use keys:

map:keys($characters[1])
=> ("name","vocation","species","gender")

and you can use the ! operator:

$characters ! (map:get(.,"name") || " [" || map:get(.,"species") || "]")
=> Aleria [half-elf]
=> Gruarg [half-orc]
=> Huara [human]

The xdmp:to-json() function will convert an object back into the corresponding JSON string, making it handy to work with MarkLogic in a purely JSONic mode. You can also convert json objects into XML:

<map>{$characters}</map>/*
=>
<json:array xmlns:json="http://marklogic.com/xdmp/json" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<json:value>
<json:object>
<json:entry key="name">
  <json:value xsi:type="xs:string">Aleria</json:value>
</json:entry>
<json:entry key="vocation">
  <json:value xsi:type="xs:string">mage</json:value>
</json:entry>
<json:entry key="species">
  <json:value xsi:type="xs:string">elf</json:value>
</json:entry>
<json:entry key="gender">
  <json:value xsi:type="xs:string">female</json:value>
</json:entry>
</json:object>
</json:value>
<json:value>
<json:object>
<json:entry key="name">
  <json:value xsi:type="xs:string">Gruarg</json:value>
</json:entry>
<json:entry key="vocation">
  <json:value xsi:type="xs:string">warrior</json:value>
</json:entry>
<json:entry key="species">
  <json:value xsi:type="xs:string">half-orc</json:value>
</json:entry>
<json:entry key="gender">
  <json:value xsi:type="xs:string">male</json:value>
</json:entry>
</json:object>
</json:value>
<json:value>
<json:object>
<json:entry key="name">
  <json:value xsi:type="xs:string">Huara</json:value>
</json:entry>
<json:entry key="vocation">
  <json:value xsi:type="xs:string">cleric</json:value>
</json:entry>
<json:entry key="species">
  <json:value xsi:type="xs:string">human</json:value>
</json:entry>
<json:entry key="gender">
  <json:value xsi:type="xs:string">female</json:value>
</json:entry>
</json:object>
</json:value>
</json:array>

This format can then be transformed to other XML formats, a topic for another blog post.

SPARQL and Maps

These capabilities are indispensable for working with SPARQL. Unless otherwise specified, SPARQL queries generate JSON maps, and use regular maps for passing in parameters. For instance, suppose you load the following turtle data:

let $turtle := '

@prefix class: <http://www.example.com/xmlns/Class/>.

@prefix character: <http://www.example.com/xmlns/Character/>.

@prefix species: <http://www.example.com/xmlns/Species/>.

@prefix gender: <http://www.example.com/xmlns/Gender/>.

@prefix vocation: <http://www.example.com/xmlns/Vocation/>.

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.

character:Aleria rdf:type class:Character;

character:species species:HalfElf;

character:gender gender:Female;

character:vocation vocation:Mage;

rdfs:label "Aleria".

character:Gruarg rdf:type class:Character;

character:species species:HalfOrc;

character:gender gender:Male;

character:vocation vocation:Warrior;

rdfs:label "Gruarg".

character:Huara rdf:type class:Character;

character:species species:Human;

character:gender gender:Female;

character:vocation vocation:Cleric;

rdfs:label "Huara".

character:Drina rdf:type class:Character;

character:species species:HalfElf;

character:gender gender:Female;

character:vocation vocation:Archer;

rdfs:label "Drina".

gender:Female rdf:type class:Gender;

rdfs:label "Female".

gender:Male rdf:type class:Gender;

rdfs:label "Male".

species:HalfElf rdf:type class:Species;

rdfs:label "Half-Elf".

species:Human rdf:type class:Species;

rdfs:label "Human".

species:HalfOrc rdf:type class:Species;

rdfs:label "Half-Orc".

vocation:Warrior rdf:type class:Vocation;

rdfs:label "Warrior".

vocation:Mage rdf:type class:Vocation;

rdfs:label "Mage".

vocation:Cleric rdf:type class:Vocation;

rdfs:label "Cleric".'

let $triples := sem:rdf-parse($turtle,"turtle")

return sem:rdf-insert($triples)

You can then retrieve the names of all female half-elves from the dataset with a sparql query:

let $namespaces := 'prefix class: <http://www.example.com/xmlns/Class/>

prefix character: <http://www.example.com/xmlns/Character/>

prefix species: <http://www.example.com/xmlns/Species/>

prefix gender: <http://www.example.com/xmlns/Gender/>

prefix vocation: <http://www.example.com/xmlns/Vocation/>

prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>

let $character-maps := sem:sparql($namespaces ||

'select ?characterName ?vocationLabel where {

?character character:species ?species.

?species rdfs:label ?speciesLabel.

?character character:gender ?gender.

?gender rdfs:label ?genderLabel.

?character character:vocation ?vocation.

?vocation rdfs:label ?vocationLabel.

}',

map:new(map:entry("genderLabel","Female"),map:entry("speciesLabel","Half-Elf"))

return $character-maps ! (map:get(.,"characterName") || " [" || map:get(.,"vocationLabel") || "]")

=> "Aleria [Mage]"

=> "Drina [Archer]"

The query passes a map with two entries, one specifying the gender label, the other the species label. Note in this case that we're not actually passing in the iris, but using a text match. These are then used to determine via Sparql the associated character name and vocation label, with the output then being a JSON sequence of json hash-map objects. This result is used by the bang operator to retrieve the values for each specific record.

It is possible to get sparql output in other formats, use the sem:query-results-serialize() function on the sparql results with the option of "xml","json" or "triples", such as:

sem:query-results-serialize($character-maps,"json")

which is especially useful when using MarkLogic as a SPARQL endpoint, but internally, sticking with maps for processing is probably the fastest and easiest way to work with SPARQL in your applications.

Summary

There is no question that these capabilities are changing the way that applications are written in MarkLogic, and represents a shift in the server from being primarily an XML database (though certainly it can still be used this way) into being increasingly its own beast, something capable of working just as readily with JSON and RDF employing a more contemporary set of coding practices.

In the next column, I'm going to shift gears somewhat and look at higher order functions and how they are used in the MarkLogic environment.

Thursday, May 23, 2013

Much Ado About Nothing: Blank Nodes in RDF

Here's a secret - you want to understand a data format? Learn its query language. I've worked heavily with XQuery for several years now, but only fairly recently (three years now) did I start working with SPARQL for RDF (along with the various view languages that CouchBase and Mongo expose), and it's given me far more insight into RDF than eight years spent trying to understand what the language was all about before then. Indeed, I'd go so far as to say that SPARQL makes RDF, dare I say it, accessible.

For instance, consider one of the more vexing aspects of RDF - how do you deal with composition vs. aggregation? Now, before getting too deep into the realm of modeling, it's worth taking a look at what each of these mean.

In XML, you tend to describe aggregations and compositions the same way - a specific element of one type holds a collection of elements of another type (or subclasses thereof). For instance, a resume element may include a one to many relationship with specific jobs that were held. It may also contain a one to many relationship with the articles, papers or books that you have written. Both of these are called associations - you are associating a given entity with another entity, but they are not quite the same type of thing.

To understand why, you need to ask the question - does the associated object have any meaning outside the context of the containing object? In the case of books, the answer is most assuredly yes - you may not be the only author, the books are likely available by ISBN or on the web, and if you delete the resume, the books do not themselves disappear. In this case, you're dealing with an aggregation - the child entities have a distinct identity outside the boundaries of the container. In RESTful terms, the child entities are addressable resources.

Composition On the Job

The case of jobs is a little harder. A job is a description of a state - what you were doing at any given time. While the job may have a job title, an associated company and the like, it effectively is a short-hand way of talking about something that you do or did at some point. Take away the context - you - and such jobs generally make much less sense. In a composition, then, the child entities being described are more like states than they are like objects - they are generally only "locally" addressable relative to the containing element or context.

In XML, such structures occur all the time:

<resume>
<name>Jane Doe</name>
<job>
<jobTitle>Architect</jobTitle>
<company>Collosal Corp.</company>
<startDate>2011-07</startDate>
<description>Designed cool stuff.</description>
</job>
<job>
<jobTitle>Senior Programmer</jobTitle>
<company>Big IT Corp.</company>
<startDate>2008-05</startDate>
<endDate>2011- 04</endDate>
<description>Programmed in many cool languages, and built some stuff.</description>
</job>
<job>
<jobTitle>Junior Programmer</jobTitle>
<company>Small IT Corp.</company>
<startDate>2005-05</startDate>
<endDate>208- 04</endDate>
<description>Programmed in a couple of cool languages, and built some other stuff.</description>
</job>
</resume>

So the question here is whether a job is an aggregation or a composition. A good way of thinking about this is to ask yourself whether, if you took turned each job into a separate XML document, whether it has enough context information to make sense:

<job>
<jobTitle>Junior Programmer</jobTitle>
<company>Small IT Corp.</company>
<startDate>2005-05</startDate>
<endDate>2008-04</endDate>
<description>Programmed in a couple of cool languages, and built some other stuff.</description>
</job>

Here there's no "person" that this belongs to - it could be a job held by anybody (or multiple people at the same time, conceivably). Without that context the information here is insufficient to be useful, save in indicating that someone claims to have worked at a given place. <job> is clearly a composition relationship.

Now zoom down another level to company:

<company>Small IT Corp.</company>

Curiously enough, this "document" is actually a stand-alone entity. If I had a database with resumes from a number of different people, being able to see all companies that are represented by the database would be a key requirement, and what's more, as a database designer I'd probably want to insure that there is one and only one such representation of that entity's name to be able to partitioning the data unnecessarily. It's an association.

In RDF (as expressed in Turtle), this becomes much more evident (I'm suppressing namespaces here for ease of reading):

person:Jane_Doe rdf:type class:Person;

person:name "Jane Doe";
person:job [
job:jobTitle "Architect";
job:company company:ColossalCorp;
job:startDate "2011-07"^^xs:date;
job:description "Designed cool stuff"
],

[
job:jobTitle "Senior Programmer";
job:company company:BigITCorp;
job:startDate "2008-05"^^xs:date;
job:endDate "2011- 04"^^xs:date;
job:description "Programmed in many cool languages, and built some stuff."
],
[
job:jobTitle "Junior Programmer";
job:company company:SmallITCorp;
job:startDate "2005-05"^^xs:date;
job:endDate "2008- 04"^^xs:date;
job:description "Programmed in a couple of cool languages, and built some other stuff."
].

company:ColossalCorp rdf:type class:Company;
company:companyName "Colossal Corporation".

company:BigITCorp rdf:type class:Company;
company:companyName "Big IT Corporation".

company:SmallITCorp rdf:type class:Company;
company:companyName "Small IT Corporation".

If you are new to Turtle, this may take a bit of explaining. An expression like

a b c;
d e.

is a shorthand for the statements

a b c.
a d e.

Similarly,

a b c, d, e.

is a shorthand for the statements

a b c.
a b d.
a b e.

The first (semicolon) set of statements assume that the subject is repeated but with new predicates and objects, while the second (comma) set of statements assume a common subject and predicate but different objects.

However, what's the [] notation mean? The brackets are indicative of a blank node. A blank node can be thought of as being the equivalent of a composition container in XML For instance, in XML you have the fragment

<job>
<jobTitle>Architect</jobTitle>
<company>Collosal Corp.</company>
<startDate>2011-07</startDate>
<description>Designed cool stuff.</description>
</job>

The <job> element itself doesn't really serve any purpose beyond indicating that its contents are thematically part of a larger construct. It's not in and of itself a property, but more properly a property "bag". As such, it can be thought of as being (somewhat) analogous to an array - you'd reference it not by name but by position in a language such as Java or JavaScript. (We'll get back to that (somewhat) reference shortly).

Internally, RDF triple stores don't use numbers directly for identifying such entities. Instead, they are treated as blank nodes. A blank node is an internal resource identifier, meaning that, unlike most resources, it can't be directly accessed. Instead, (at least from the standpoint of SPARQL) they are usually referenced indirectly by being determined from the local context. Blank nodes are usually shown in articles as having the syntax _:b0, _:b1, etc. For instance, the resume in the above list could be rewritten as:

person:Jane_Doe rdf:type class:Person;
person:name "Jane Doe";
person:job _:b0, _:b1, _:b2.
_:b0 job:jobTitle "Architect";
job:company company:CollosalCorp;
job:startDate "2011-07"^^xs:date;
job:description "Designed cool stuff".
_:b1 job:jobTitle "Senior Programmer";
job:company company:BigITCorp;
job:startDate "2008-05"^^xs:date;
job:endDate "2011- 04"^^xs:date;
job:description "Programmed in many cool languages, and built some stuff."

_:b2 job:jobTitle "Junior Programmer";
job:company company:SmallITCorp;

job:endDate "2008- 04"^^xs:date;
job:description "Programmed in a couple of cool languages, and built some other stuff.".

Notation-wise, this makes the structure of each job more obvious, but gets a little more complicated referentially. Jane Doe has had three jobs. The jobs each have an internal identifier, but because the jobs make no sense out of the context of being jobs for Jane, So why not use explicit identifiers for each job?

person:Jane_Doe rdf:type class:Person;
person:name "Jane Doe";
person:job job:105319, job:125912, job:272421.
job:105319 job:jobTitle "Architect";
job:company company:CollosalCorp;
job:startDate "2011-07"^^xs:date;
job:description "Designed cool stuff".
job:125912 job:jobTitle "Senior Programmer";
job:company company:BigITCorp;
job:startDate "2008-05"^^xs:date;
job:endDate "2011- 04"^^xs:date;
job:description "Programmed in many cool languages, and built some stuff."

job:272421 job:jobTitle "Junior Programmer";
job:company company:SmallITCorp;

job:endDate "2008- 04"^^xs:date;
job:description "Programmed in a couple of cool languages, and built some other stuff.".
company:ColossalCorp rdf:type class:Company.
company:companyName "Colossal Corporation".

company:BigITCorp rdf:type class:Company.
company:companyName "Big IT Corporation".

company:SmallITCorp rdf:type class:Company.
company:companyName "Small IT Corporation".

From the standpoint of SPARQL, it makes very little difference. For instance, if you wanted to get the name and description of each job a person has, the SPARQL query would look something like this:

select ?personName ?jobTitle ?companyName where {
?person rdf:type class:Person;
person:name ?personName;
person:job ?job.
?job job:jobTitle ?jobTitle;
job:company ?company.
?company company:companyName
?companyName.
}

which would produce the table:

?personName	?jobName	?companyName
"Jane Doe"	"Architect"	"Colossal Corporation"
"Jane Doe"	"Senior Programmer"	"Big IT Corporation"
"Jane Doe"	"Junior Programmer"	"Small IT Corporation"

The ?job variable, in this case, will carry either the blank node job entry or the defined job in exactly the same manner. So why use a blank node? In most cases, they're used because creating explicit named resource URIs can be a pain, or because there is no real advantage to working with them outside of the resource context from which they're called. Typically, most RDF triple stores actually optimize blank nodes separately, and insure that blank nodes never collide within the data system itself.

Another advantage is that it simplifies both Turtle and RDF-XML notation. Turtle notation can use the [] square bracket approach (and can even nest such brackets within other blank node expressions). RDF-XML on the other hand can represent blank nodes in two modes. In the denormalized (or embedded) form, the @rdf:parseType="resource" expression is used on the container node to indicate that it should be represented from a blank node:

<resume rdf:about="person:Jane_Doe">
<name>Jane Doe</name>
<job rdf:parseType="resource">
<jobTitle>Architect</jobTitle>
<company>Collosal Corp.</company>
<startDate>2011-07</startDate>
<description>Designed cool stuff.</description>
</job>
..
</resume>

The same expression can also be normalized:

<rdf:RDF>

</resume>

<rdf:Description rdf:nodeID="b0">

<jobTitle>Architect</jobTitle>

<company rdf:resource="company:ColossalCorp"/>Colossal Corporation</company>

<description>Designed cool stuff.</description>

</rdf:DescriptiOn>

...

</rdf:RDF>

The embedded format actually hints at how one can convert XML into RDF - it mostly involves resource references (i.e., company) and using attributes like rdf:parseType (XML to RDF conversion is food for another post).

Localization with Les Nœuds Anonymes

Blank nodes can actually be very useful for dealing with groups of properties that differ based upon language or locale. RDF utilizes the @lang expression at the end of strings to handle localization, which can work in simple cases, but if there are a number of properties that are all localized (such as prices and currencies) then blank nodes may be a better solution. For instance,

book:The_Art_Of_SPARQL book:bookLocal
[
bookLocal:lang "EN";
bookLocal:locale "US";
bookLocal:title "The Art of Sparql";
bookLocal:price "29.95";
bookLocal:currency "USD";
],

[
bookLocal:lang "EN";
bookLocal:locale "UK";
bookLocal:title "The Art of Sparql";
bookLocal:price "21.95";
bookLocal:currency "GBP";
],

[
bookLocal:lang "DE";
bookLocal:locale "DE";
bookLocal:title "Die Kunst des SPARQL";
bookLocal:price "24.95";
bookLocal:currency "EUR";
].

In this case, the implied bookLocal objects are blank nodes. You can then retrieve the title, price and currency of a book in a different locale if the US book title is known:

select ?title ?lang ?price ?currency where {
bind ("DE" as ?locale)
bind ("The Art of SPARQL" as ?usTitle)
?book book:bookLocal [
bookLocal:title ?usTitle;
bookLocal:locale "US"].
?book book:bookLocal [
bookLocal:locale ?locale;
bookLocal:title ?title;
bookLocal:lang ?lang;
bookLocal:price ?price";
bookLocal:currency ?currency".
].
}

This produces the results:

?title	?lang	?price	?currency
"Die Kunst des SPARQL"	"DE"	"24.95"	"EUR"

This information can, depending upon the SPARQL engine, be output as JSON or XML. This approach can also be useful for dealing with triples where the object is itself an XML node (such as XHTML content) where different languages are involved, since the XMLLiteral type can't directly take a @lang type extension. I use such blank node assignments quite frequently for precisely that kind of issue, making data modeling when dealing with localization considerably easier.

It's also worth noting that even though the nodes themselves are blank (or anonymous), that doesn't mean that they can't be associated with schematic types. For instance, in the previous example, the inclusion of a single statement in each block means that you can use RDFS/OWL validation on each "bookLocal" object:

book:The_Art_Of_SPARQL book:bookLocal
[
rfd:type class:bookLocal;
bookLocal:lang "EN";
bookLocal:locale "US";
bookLocal:title "The Art of Sparql";
bookLocal:price "29.95";
bookLocal:currency "USD";
],

These basically correspond to anonymous objects within JavaScript, with the added benefit that with RDF you can still do all of the type constraint checks that you could do with XSD (and more).

The Edge of the Graph

One final note about blank nodes - they are very useful for establishing the boundaries between objects for purposes such as getting the result of a Describe or deleting distinct objects (rather than just individual assertions) from a database. The describe statement, when given a subject, follows each assertion from the initial subject to each object. If the object is an atomic value, the assertion is kept, but no further search is done. If the object is an object URI, the same thing happens.

However, if the object is a blank node, then the blank node is added as a subject and the same process is used. Deleting an object involves doing a "Describe" on the resource, finding all assertions for which the subject is some other assertion's object, then deleting all of these links.

This can be especially useful for handling updates into RDF Databases. I'll be covering this in more details in a later post.

Nothing Much? Nah!

Blank nodes are a useful feature of RDF, are well supported by SPARQL and Turtle notation, and can help to differentiate between aggregate structures (references to external objects within the system) and composed structures (references to internal entities that only make sense within the context of a given external entity). They are used heavily by both Turtle and RDF-XML parsers, and they can both be used to define the boundaries of resources within the system and delete those resources in a logical and consistent fashion. They should be seen as indispensable tools of the working ontologist.

Tuesday, September 3, 2013

Going Functional

Monday, August 12, 2013

Maps, JSON and Sparql - A Peek Under the Hood of MarkLogic 7.0

Banging on Maps

Maps and JSON

let $_ := map:put($characters[1],"species","half-elf") return map:get($characters[1]) =>

{"name": "Aleria","vocation": "mage","species": "half-elf","gender": "female"}

You can use keys:

SPARQL and Maps

Summary

Thursday, May 23, 2013

Much Ado About Nothing: Blank Nodes in RDF

Composition On the Job

Localization with Les Nœuds Anonymes

The Edge of the Graph

Nothing Much? Nah!

let $_ := map:put($characters[1],"species","half-elf")

return map:get($characters[1])

=>