Semantics and Data Modeling: May 2013

Thursday, May 23, 2013

Much Ado About Nothing: Blank Nodes in RDF

Here's a secret - you want to understand a data format? Learn its query language. I've worked heavily with XQuery for several years now, but only fairly recently (three years now) did I start working with SPARQL for RDF (along with the various view languages that CouchBase and Mongo expose), and it's given me far more insight into RDF than eight years spent trying to understand what the language was all about before then. Indeed, I'd go so far as to say that SPARQL makes RDF, dare I say it, accessible.

For instance, consider one of the more vexing aspects of RDF - how do you deal with composition vs. aggregation? Now, before getting too deep into the realm of modeling, it's worth taking a look at what each of these mean.

In XML, you tend to describe aggregations and compositions the same way - a specific element of one type holds a collection of elements of another type (or subclasses thereof). For instance, a resume element may include a one to many relationship with specific jobs that were held. It may also contain a one to many relationship with the articles, papers or books that you have written. Both of these are called associations - you are associating a given entity with another entity, but they are not quite the same type of thing.

To understand why, you need to ask the question - does the associated object have any meaning outside the context of the containing object? In the case of books, the answer is most assuredly yes - you may not be the only author, the books are likely available by ISBN or on the web, and if you delete the resume, the books do not themselves disappear. In this case, you're dealing with an aggregation - the child entities have a distinct identity outside the boundaries of the container. In RESTful terms, the child entities are addressable resources.

Composition On the Job

The case of jobs is a little harder. A job is a description of a state - what you were doing at any given time. While the job may have a job title, an associated company and the like, it effectively is a short-hand way of talking about something that you do or did at some point. Take away the context - you - and such jobs generally make much less sense. In a composition, then, the child entities being described are more like states than they are like objects - they are generally only "locally" addressable relative to the containing element or context.

In XML, such structures occur all the time:

<resume>
<name>Jane Doe</name>
<job>
<jobTitle>Architect</jobTitle>
<company>Collosal Corp.</company>
<startDate>2011-07</startDate>
<description>Designed cool stuff.</description>
</job>
<job>
<jobTitle>Senior Programmer</jobTitle>
<company>Big IT Corp.</company>
<startDate>2008-05</startDate>
<endDate>2011- 04</endDate>
<description>Programmed in many cool languages, and built some stuff.</description>
</job>
<job>
<jobTitle>Junior Programmer</jobTitle>
<company>Small IT Corp.</company>
<startDate>2005-05</startDate>
<endDate>208- 04</endDate>
<description>Programmed in a couple of cool languages, and built some other stuff.</description>
</job>
</resume>

So the question here is whether a job is an aggregation or a composition. A good way of thinking about this is to ask yourself whether, if you took turned each job into a separate XML document, whether it has enough context information to make sense:

<job>
<jobTitle>Junior Programmer</jobTitle>
<company>Small IT Corp.</company>
<startDate>2005-05</startDate>
<endDate>2008-04</endDate>
<description>Programmed in a couple of cool languages, and built some other stuff.</description>
</job>

Here there's no "person" that this belongs to - it could be a job held by anybody (or multiple people at the same time, conceivably). Without that context the information here is insufficient to be useful, save in indicating that someone claims to have worked at a given place. <job> is clearly a composition relationship.

Now zoom down another level to company:

<company>Small IT Corp.</company>

Curiously enough, this "document" is actually a stand-alone entity. If I had a database with resumes from a number of different people, being able to see all companies that are represented by the database would be a key requirement, and what's more, as a database designer I'd probably want to insure that there is one and only one such representation of that entity's name to be able to partitioning the data unnecessarily. It's an association.

In RDF (as expressed in Turtle), this becomes much more evident (I'm suppressing namespaces here for ease of reading):

person:Jane_Doe rdf:type class:Person;

person:name "Jane Doe";
person:job [
job:jobTitle "Architect";
job:company company:ColossalCorp;
job:startDate "2011-07"^^xs:date;
job:description "Designed cool stuff"
],

[
job:jobTitle "Senior Programmer";
job:company company:BigITCorp;
job:startDate "2008-05"^^xs:date;
job:endDate "2011- 04"^^xs:date;
job:description "Programmed in many cool languages, and built some stuff."
],
[
job:jobTitle "Junior Programmer";
job:company company:SmallITCorp;
job:startDate "2005-05"^^xs:date;
job:endDate "2008- 04"^^xs:date;
job:description "Programmed in a couple of cool languages, and built some other stuff."
].

company:ColossalCorp rdf:type class:Company;
company:companyName "Colossal Corporation".

company:BigITCorp rdf:type class:Company;
company:companyName "Big IT Corporation".

company:SmallITCorp rdf:type class:Company;
company:companyName "Small IT Corporation".

If you are new to Turtle, this may take a bit of explaining. An expression like

a b c;
d e.

is a shorthand for the statements

a b c.
a d e.

Similarly,

a b c, d, e.

is a shorthand for the statements

a b c.
a b d.
a b e.

The first (semicolon) set of statements assume that the subject is repeated but with new predicates and objects, while the second (comma) set of statements assume a common subject and predicate but different objects.

However, what's the [] notation mean? The brackets are indicative of a blank node. A blank node can be thought of as being the equivalent of a composition container in XML For instance, in XML you have the fragment

<job>
<jobTitle>Architect</jobTitle>
<company>Collosal Corp.</company>
<startDate>2011-07</startDate>
<description>Designed cool stuff.</description>
</job>

The <job> element itself doesn't really serve any purpose beyond indicating that its contents are thematically part of a larger construct. It's not in and of itself a property, but more properly a property "bag". As such, it can be thought of as being (somewhat) analogous to an array - you'd reference it not by name but by position in a language such as Java or JavaScript. (We'll get back to that (somewhat) reference shortly).

Internally, RDF triple stores don't use numbers directly for identifying such entities. Instead, they are treated as blank nodes. A blank node is an internal resource identifier, meaning that, unlike most resources, it can't be directly accessed. Instead, (at least from the standpoint of SPARQL) they are usually referenced indirectly by being determined from the local context. Blank nodes are usually shown in articles as having the syntax _:b0, _:b1, etc. For instance, the resume in the above list could be rewritten as:

person:Jane_Doe rdf:type class:Person;
person:name "Jane Doe";
person:job _:b0, _:b1, _:b2.
_:b0 job:jobTitle "Architect";
job:company company:CollosalCorp;
job:startDate "2011-07"^^xs:date;
job:description "Designed cool stuff".
_:b1 job:jobTitle "Senior Programmer";
job:company company:BigITCorp;
job:startDate "2008-05"^^xs:date;
job:endDate "2011- 04"^^xs:date;
job:description "Programmed in many cool languages, and built some stuff."

_:b2 job:jobTitle "Junior Programmer";
job:company company:SmallITCorp;

job:endDate "2008- 04"^^xs:date;
job:description "Programmed in a couple of cool languages, and built some other stuff.".

Notation-wise, this makes the structure of each job more obvious, but gets a little more complicated referentially. Jane Doe has had three jobs. The jobs each have an internal identifier, but because the jobs make no sense out of the context of being jobs for Jane, So why not use explicit identifiers for each job?

person:Jane_Doe rdf:type class:Person;
person:name "Jane Doe";
person:job job:105319, job:125912, job:272421.
job:105319 job:jobTitle "Architect";
job:company company:CollosalCorp;
job:startDate "2011-07"^^xs:date;
job:description "Designed cool stuff".
job:125912 job:jobTitle "Senior Programmer";
job:company company:BigITCorp;
job:startDate "2008-05"^^xs:date;
job:endDate "2011- 04"^^xs:date;
job:description "Programmed in many cool languages, and built some stuff."

job:272421 job:jobTitle "Junior Programmer";
job:company company:SmallITCorp;

job:endDate "2008- 04"^^xs:date;
job:description "Programmed in a couple of cool languages, and built some other stuff.".
company:ColossalCorp rdf:type class:Company.
company:companyName "Colossal Corporation".

company:BigITCorp rdf:type class:Company.
company:companyName "Big IT Corporation".

company:SmallITCorp rdf:type class:Company.
company:companyName "Small IT Corporation".

From the standpoint of SPARQL, it makes very little difference. For instance, if you wanted to get the name and description of each job a person has, the SPARQL query would look something like this:

select ?personName ?jobTitle ?companyName where {
?person rdf:type class:Person;
person:name ?personName;
person:job ?job.
?job job:jobTitle ?jobTitle;
job:company ?company.
?company company:companyName
?companyName.
}

which would produce the table:

?personName	?jobName	?companyName
"Jane Doe"	"Architect"	"Colossal Corporation"
"Jane Doe"	"Senior Programmer"	"Big IT Corporation"
"Jane Doe"	"Junior Programmer"	"Small IT Corporation"

The ?job variable, in this case, will carry either the blank node job entry or the defined job in exactly the same manner. So why use a blank node? In most cases, they're used because creating explicit named resource URIs can be a pain, or because there is no real advantage to working with them outside of the resource context from which they're called. Typically, most RDF triple stores actually optimize blank nodes separately, and insure that blank nodes never collide within the data system itself.

Another advantage is that it simplifies both Turtle and RDF-XML notation. Turtle notation can use the [] square bracket approach (and can even nest such brackets within other blank node expressions). RDF-XML on the other hand can represent blank nodes in two modes. In the denormalized (or embedded) form, the @rdf:parseType="resource" expression is used on the container node to indicate that it should be represented from a blank node:

<resume rdf:about="person:Jane_Doe">
<name>Jane Doe</name>
<job rdf:parseType="resource">
<jobTitle>Architect</jobTitle>
<company>Collosal Corp.</company>
<startDate>2011-07</startDate>
<description>Designed cool stuff.</description>
</job>
..
</resume>

The same expression can also be normalized:

<rdf:RDF>

</resume>

<rdf:Description rdf:nodeID="b0">

<jobTitle>Architect</jobTitle>

<company rdf:resource="company:ColossalCorp"/>Colossal Corporation</company>

<description>Designed cool stuff.</description>

</rdf:DescriptiOn>

...

</rdf:RDF>

The embedded format actually hints at how one can convert XML into RDF - it mostly involves resource references (i.e., company) and using attributes like rdf:parseType (XML to RDF conversion is food for another post).

Localization with Les Nœuds Anonymes

Blank nodes can actually be very useful for dealing with groups of properties that differ based upon language or locale. RDF utilizes the @lang expression at the end of strings to handle localization, which can work in simple cases, but if there are a number of properties that are all localized (such as prices and currencies) then blank nodes may be a better solution. For instance,

book:The_Art_Of_SPARQL book:bookLocal
[
bookLocal:lang "EN";
bookLocal:locale "US";
bookLocal:title "The Art of Sparql";
bookLocal:price "29.95";
bookLocal:currency "USD";
],

[
bookLocal:lang "EN";
bookLocal:locale "UK";
bookLocal:title "The Art of Sparql";
bookLocal:price "21.95";
bookLocal:currency "GBP";
],

[
bookLocal:lang "DE";
bookLocal:locale "DE";
bookLocal:title "Die Kunst des SPARQL";
bookLocal:price "24.95";
bookLocal:currency "EUR";
].

In this case, the implied bookLocal objects are blank nodes. You can then retrieve the title, price and currency of a book in a different locale if the US book title is known:

select ?title ?lang ?price ?currency where {
bind ("DE" as ?locale)
bind ("The Art of SPARQL" as ?usTitle)
?book book:bookLocal [
bookLocal:title ?usTitle;
bookLocal:locale "US"].
?book book:bookLocal [
bookLocal:locale ?locale;
bookLocal:title ?title;
bookLocal:lang ?lang;
bookLocal:price ?price";
bookLocal:currency ?currency".
].
}

This produces the results:

?title	?lang	?price	?currency
"Die Kunst des SPARQL"	"DE"	"24.95"	"EUR"

This information can, depending upon the SPARQL engine, be output as JSON or XML. This approach can also be useful for dealing with triples where the object is itself an XML node (such as XHTML content) where different languages are involved, since the XMLLiteral type can't directly take a @lang type extension. I use such blank node assignments quite frequently for precisely that kind of issue, making data modeling when dealing with localization considerably easier.

It's also worth noting that even though the nodes themselves are blank (or anonymous), that doesn't mean that they can't be associated with schematic types. For instance, in the previous example, the inclusion of a single statement in each block means that you can use RDFS/OWL validation on each "bookLocal" object:

book:The_Art_Of_SPARQL book:bookLocal
[
rfd:type class:bookLocal;
bookLocal:lang "EN";
bookLocal:locale "US";
bookLocal:title "The Art of Sparql";
bookLocal:price "29.95";
bookLocal:currency "USD";
],

These basically correspond to anonymous objects within JavaScript, with the added benefit that with RDF you can still do all of the type constraint checks that you could do with XSD (and more).

The Edge of the Graph

One final note about blank nodes - they are very useful for establishing the boundaries between objects for purposes such as getting the result of a Describe or deleting distinct objects (rather than just individual assertions) from a database. The describe statement, when given a subject, follows each assertion from the initial subject to each object. If the object is an atomic value, the assertion is kept, but no further search is done. If the object is an object URI, the same thing happens.

However, if the object is a blank node, then the blank node is added as a subject and the same process is used. Deleting an object involves doing a "Describe" on the resource, finding all assertions for which the subject is some other assertion's object, then deleting all of these links.

This can be especially useful for handling updates into RDF Databases. I'll be covering this in more details in a later post.

Nothing Much? Nah!

Blank nodes are a useful feature of RDF, are well supported by SPARQL and Turtle notation, and can help to differentiate between aggregate structures (references to external objects within the system) and composed structures (references to internal entities that only make sense within the context of a given external entity). They are used heavily by both Turtle and RDF-XML parsers, and they can both be used to define the boundaries of resources within the system and delete those resources in a logical and consistent fashion. They should be seen as indispensable tools of the working ontologist.

Semantics + Search : MarkLogic 7 Gets RDF

Let's talk Org Charts for a moment. Everyone knows what an org chart looks like. At the top you have the boss guy, Mr. Bigg, CEO of Bigg Business, Inc.

At the second tier, you have his lieutenants, the guys that head up the various functions within the organization (flipping to show relationships more clearly).

What emerges as part of this is the fact that you have what appears to be a tree. This can be made even more obvious when you start jumping to the next level (just showing a couple of people from the programming department):

If asked at this point, what would be the best way to store information about this organization chart, chances are pretty good that you'd opt for a language like XML or JSON, since there is a clear container/contained relationship between all the parties. However, suppose for a moment that the IT programmers are intended to be support personnel. While Ian Geek signs their paycheck, they actually work in different departments. Then you end up with something like this.

Oops.

What had been a nice, clean hierarchical chart suddenly gets considerably more complicated. There are actually three things worth noting here - first, that we've jumped from having one relationship - "reports to" - to having two - "reports to" and "assists". The second is that the "assists" relationship does not in fact support a hierarchical distribution at all - Jane supports both Owen Munny and Bartholomew Bigg, who are ostensibly at different levels within the organization. A side note (and the number three thing in the list) is that whereas the "reports to" relationship is always one to one, that's not true of "assists" - Jane assists two people.

This is the reason why data designers and information architects get the big bucks - the real world is usually considerably more messy than a hierarchy. If you had a SQL database, modeling this relationship wouldn't be all that hard - every person gets a primary key that identifies them, the reports to relationship would incorporate a foreign key directly to that person's manager, and the assists relation involves a pointer to a table which then maps the primary key to zero or more associated keys.

However, there may in fact be any number of reasons why you would prefer to keep your data in XML (or in JSON, the same arguments apply there). Hierarchies are convenient - they provide a means for organization information in a consistent fashion, they provide levels of abstraction for collections, and they are far more useful for transferring information than linear tables. As long as you're dealing with properties, or bags of properties, they work very nicely indeed.

The problem comes when you start dealing with discrete objects. In an organization, a person is a discrete object. A person may have one or more names, one or more job titles, one or more locations, one or more responsibilities. Some of these are simple properties - a name or a job title is a simple text string, for instance, the start (and potentially end) date are, well, dates. Other properties get a little more complex, and really depend upon what specifically is being modeled.

Locations are good examples of this - most organizations have rooms (or at least cubicles) where specific people work, and a way of identifying that room via some form of key. The location is in fact a "thing" - a resource - it is in a certain building, specific floor, and may have its own phone number. It makes sense for the person to have a relationship with that location, but that relationship isn't necessarily a containment or ownership relationship. An room may have more than one person in it, or be a temporary cubicle for contractors. A person may have multiple places they work from, including possibly from home.

Again, all of this can be modeled. However, there's something of a twist here. Suppose that in your organization, one database holds the names, positions, salaries and other business related information about a person. Another database holds the locations and names of people who are at those locations.

And then a reorganization occurs. You are tasked with identifying the org chart relationships then, when possible, moving people so that they are closest first to the people that they assist, then, when possible to the people that manage them. In an organization with 50 people, this can be a time consuming activity. In an organization with 10,000 people, it becomes a logistical nightmare. I should note here that a great number of business problems (and opportunities) share the same basic characteristics. I'm laying this one out just as an example.

There are actually several key issues that are brought up here. The first is the fact that most organizations have real problems with identity management - each database has a different (usually numeric) key that it internally uses to identify a person, location, department and so forth, and as a consequence finding whether two records refer to the same person typically comes down to having to identify sets of common characteristics, and increases the likelihood of error (as well as introduces computational costs).

Getting a new database to match keys by itself isn't enough (it in fact simply defers the problem for a bit) - you essentially need to identify and inventory the resources that your organization has, and then assign to each of them a unique ID - a GUID, or better, a Uniform Resource Identifier (URI) (also referred to as a uniform resource number (URN) or an international resource identifier (IRI), though each have subtly different meanings. A URI is a unique string that globally identifies a resource, whether that resource is a book, a web page, a person, or a room/cubicle. It may look something like schema://biggbusiness.com/person/jane_doe, or may be more cryptic (such as urn:biggbiz:1295102:39302185:1593). What's important here is that it is effectively unique - it not only identifies the person within a single database, but it identifies that person (or other resource) globally.

Once you have that globally unique identifier, then and only then can you start attaching other identifiers that may not be global. In many respects, this is the core of "semantics": Uniquely identify the common resources (and resource types) in an organization, use those resource keys in what amounts to a "columnar" database or graph store, establish the relationships that exist between these resources then associate enough properties and local identifiers to these global identifiers to make search feasible.

SPARQL is a big part of that semantic layer. SPARQL is to Semantic "Triple Stores" what XQuery is to XML, JavaScript (and associated JSON query dialects) is to JSON stores and SQL is to relational data. It allows you to query the relationships between these global objects, returning in turn either a binary answer, a table of variable values or other triples, depending upon what's asked, and most SPARQL engines also have a mechanism to create (simple) JSON and XML output structures.

Triple stores (named because columnar stores can be broken down into a three value set consisting of a "subject", a "predicate" property, and an "object" URI or scalar value) and SPARQL together are consequently useful because they allow you to perform joins across relationships on objects, even when the objects being joined are not simple tables. The keys they are joining on are URIs, not sequential indexes, and they transcend any single database.

This really comes in handy when dealing with controlled vocabulary lists. Suppose that I wanted to get a list of all people that directly report to Ian M. Geek and I know his URI. With SPARQL I can retrieve all people who have a reports to association with Ian (of course, I can do that with XQuery as well), since it's simply a string match. However, what I can also do is retrieve all people who report directly or indirectly with Ian - they report to someone who reports to someone who reports to him, for instance. This is what's called chaining in semantics and is one of the things that makes SPARQL very useful. Moreover, if I wanted to get the names and titles of everyone who is in Ian's reporting chain, with XQuery I would have to retrieve the document for each person and then get the name and title property, while with SPARQL, I'm dealing simply with assertions - an object may be described as the set of all triples that have the same identifier URI rather than an explicit "document", and so the SPARQL processor need only look at specific relationships, not all of them.

Having said that, SPARQL and Triple Stores in general are not ideal for other kinds of queries. Suppose, for instance, that I want to find out how reports to the CTO, but I don't know the CTO's name. There's actually two problems here - first, find a string match in an (unspecified) field, possibly with a regular expression, which will then retrieve the identifier of the corresponding person(s), then perform the previous query.

Triple stores do provide lexical search capabilities and indexing, but in general these are expensive operations, far more than is the case for XML stores. However, this is a classic "search" problem, and finding whole or near matches of terms is something that can be handled quite easily.

Moreover, SPARQL doesn't handle much beyond the semantic search - you can transform it with XSLT or similar applications, but these kinds of activities are often somewhat limited in scope. XQuery, on the other hand, is nearly as robust a mechanism for creating output content as it is for search.

Because of this, building a SPARQL application almost invariably has required invoking a triple store via a web service from some controlling language then pulling the result back and transforming it to meet the particular needs. Since you're sending data "over the wire" this has performance implications, and keeping the database in synchronization requires some serious headstands. However, that's now changed.

The MarkLogic 7.0 server was announced at MarkLogic World 2013 last month, becoming the first XQuery database to incorporate a semantics index and a SPARQL layer. A bit of disclosure here - Avalon Consulting, LLC, works as a primary partner for MarkLogic, largely because we value their product quite highly in the development of enterprise information solutions. I've also been lobbying MarkLogic to develop a SPARQL layer for more than three years, so I was absolutely giddy to discover that they had finally gone ahead and done it.

Because this technology is fairly complex, they are implementing it in stages, with Marklogic 7.0 supporting the SPARQL 1.0 standard and several low hanging fruit in SPARQL 1.1. They will then implement the balance of the SPARQL 1.1 layer (including SPARQL UPDATE) in a subsequent release, and may include some of the more sophisticated support for the OWL 2.0 web ontology standard (commonly referred to as the inferencing layer).

I recently (finally) had a chance to review the MarkLogic Server 7.0 Early Access 2 release, after having some trepidation that it would be too little too late. After at least a preliminary analysis, I think I can safely say that if and when MarkLogic ever goes public, I would buy as much of their stock as possible. Even without the full 1.1 support, it supports the SPARQL 1.0 standard remarkably well, it is as fast as I've come to expect MarkLogic products to be (that is, very), and most of what I personally would like to do with RDF I can, albeit not necessarily directly through SPARQL.

I can also do the kind of queries I was talking about above, and combine search and semantic queries into a single, internal operation that is breathtakingly fast (in great part because it is NOT having to go out to an external server). For instance, in the above queries I would use XQuery to do the text search to get those documents that include the search term in a specific set of fields (looking for CTO, for instance), then will pass these documents, with their associated triples to a SPARQL query. This reduces the search dramatically from comparing against the overall index to comparing against perhaps a few dozen entries. In SPARQL, such reductions are what makes working against tens of billions to trillions of triples feasible. These results, in turn, are filtered to retrieve name and job title from those working for the CTO, and then are returned as either JSON structures, custom XML, or generic map structures, depending upon need. These can then be transformed with XQuery into everything from HTML to SVG to PDFs or spreadsheets.

The real value here then comes in the ability to combine, transform and cache such output, as well as to manage processing. If a person quits, for instance a workflow would be initiated upon state change (through a process called a reverse query) that would query all locations that are associated with a given person (has an "claims" pointer, for instance) and would free those up. A similar query can determine whether there are any people who have placed reservations upon that room should it become freed, and will associate the room to the person based upon test criteria (the person who has the oldest reservation, for instance). This becomes the foundation for a rules based orchestration system that can replace a complex command and control system. Because such systems often have a significant number of inter-dependencies on different resources, a semantic system is actually preferable in this regard.

To sum it up, MarkLogic with semantics has what it takes to learn, to create associations where none previously existed, to judge likelihoods (I'm salivating at the prospect of writing a Bayesian analysis parser on top of it), to generate user interfaces that change in response to what it needs, and to commit to conclusions and actions based upon its own internal analysis. I'm noting that a number of entity enrichment, business intelligence and natural language processing companies are migrating to MarkLogic as a foundational platform for their own offerings, and I fully anticipate this trend to increase dramatically as the combination of XQuery, SPARQL and SQL (yup, it does support that too, thank you Mary Holstege!) makes MarkLogic a nexus for the intelligent machine.