Thursday, May 23, 2013

Semantics + Search : MarkLogic 7 Gets RDF


Let's talk Org Charts for a moment. Everyone knows what an org chart looks like. At the top you have the boss guy, Mr. Bigg, CEO of Bigg Business, Inc.

The Head Honcho

At the second tier, you have his lieutenants, the guys that head up the various functions within the organization (flipping to show relationships more clearly).





What emerges as part of this is the fact that you have what appears to be a tree. This can be made even more obvious when you start jumping to the next level (just showing a couple of people from the programming department):



If asked at this point, what would be the best way to store information about this organization chart, chances are pretty good that you'd opt for a language like XML or JSON, since there is a clear container/contained relationship between all the parties. However, suppose for a moment that the IT programmers are intended to be support personnel. While Ian Geek signs their paycheck, they actually work in different departments. Then you end up with something like this.


sem3.dot


Oops.

What had been a nice, clean hierarchical chart suddenly gets considerably more complicated. There are actually three things worth noting here - first, that we've jumped from having one relationship - "reports to" - to having two - "reports to" and "assists". The second is that the "assists" relationship does not in fact support a hierarchical distribution at all - Jane supports both Owen Munny and Bartholomew Bigg, who are ostensibly at different levels within the organization. A side note (and the number three thing in the list) is that whereas the "reports to" relationship is always one to one, that's not true of "assists" - Jane assists two people.

This is the reason why data designers and information architects get the big bucks - the real world is usually considerably more messy than a hierarchy. If you had a SQL database, modeling this relationship wouldn't be all that hard - every person gets a primary key that identifies them, the reports to relationship would incorporate a foreign key directly to that person's manager, and the assists relation involves a pointer to a table which then maps the primary key to zero or more associated keys.

However, there may in fact be any number of reasons why you would prefer to keep your data in XML (or in JSON, the same arguments apply there). Hierarchies are convenient - they provide a means for organization information in a consistent fashion, they provide levels of abstraction for collections, and they are far more useful for transferring information than linear tables. As long as you're dealing with properties, or bags of properties, they work very nicely indeed.

The problem comes when you start dealing with discrete objects. In an organization, a person is a discrete object. A person may have one or more names, one or more job titles, one or more locations, one or more responsibilities. Some of these are simple properties - a name or a job title is a simple text string, for instance, the start (and potentially end) date are, well, dates. Other properties get a little more complex, and really depend upon what specifically is being modeled.

Locations are good examples of this - most organizations have rooms (or at least cubicles) where specific people work, and a way of identifying that  room via some form of key. The location is in fact a "thing" - a resource - it is in a certain building, specific floor, and may have its own phone number.  It makes sense for the person to have a relationship with that location, but that relationship isn't necessarily a containment or ownership relationship. An room may have more than one person in it, or be a temporary cubicle for contractors. A person may have multiple places they work from, including possibly from home.

Again, all of this can be modeled. However, there's something of a twist here. Suppose that in your organization, one database holds the names, positions, salaries and other business related information about a person. Another database holds the locations and names of people who are at those locations.

And then a reorganization occurs. You are tasked with identifying the org chart relationships then, when possible, moving people so that they are closest first to the people that they assist, then, when possible to the people that manage them. In an organization with 50 people, this can be a time consuming activity. In an organization with 10,000 people, it becomes a logistical nightmare. I should note here that a great number of business problems (and opportunities) share the same basic characteristics. I'm laying this one out just as an example.

There are actually several key issues that are brought up here. The first is the fact that most organizations have real problems with identity management - each database has a different (usually numeric) key that it internally uses to identify a person, location, department and so forth, and as a consequence finding whether two records refer to the same person typically comes down to having to identify sets of common characteristics, and increases the likelihood of error (as well as introduces computational costs).

Getting a new database to match keys by itself isn't enough (it in fact simply defers the problem for a bit) - you essentially need to identify and inventory the resources that your organization has, and then assign to each of them a unique ID - a GUID, or better, a Uniform Resource Identifier (URI) (also referred to as a uniform resource number (URN) or an international resource identifier (IRI), though each have subtly different meanings.  A URI is a unique string that globally identifies a resource, whether that resource is a book, a web page, a person, or a room/cubicle. It may look something like schema://biggbusiness.com/person/jane_doe, or may be more cryptic (such as urn:biggbiz:1295102:39302185:1593). What's important here is that it is effectively unique - it not only identifies the person within a single database, but it identifies that person (or other resource) globally.

Once you have that globally unique identifier, then and only then can you start attaching other identifiers that may not be global. In many respects, this is the core of "semantics":  Uniquely identify the common resources (and resource types) in an organization, use those resource keys in what amounts to a "columnar" database or graph store, establish the relationships that exist between these resources then associate enough properties and local identifiers to these global identifiers to make search feasible.
SPARQL is a big part of that semantic layer. SPARQL is to Semantic "Triple Stores" what XQuery is to XML, JavaScript (and associated JSON query dialects) is to JSON stores and SQL is to relational data. It allows you to query the relationships between these global objects, returning in turn either a binary answer, a table of variable values or other triples, depending upon what's asked, and most SPARQL engines also have a mechanism to create (simple) JSON and XML output structures.

Triple stores (named because columnar stores can be broken down into a three value set consisting of a "subject", a "predicate" property, and an "object" URI or scalar value) and SPARQL together are consequently useful because they allow you to perform joins across relationships on objects, even when the objects being joined are not simple tables. The keys they are joining on are URIs, not sequential indexes, and they transcend any single database.

This really comes in handy when dealing with controlled vocabulary lists. Suppose that I wanted to get a list of all people that directly report to Ian M. Geek and I know his URI. With SPARQL I can retrieve all people who have a reports to association with Ian (of course, I can do that with XQuery as well), since it's simply a string match. However, what I can also do is retrieve all people who report directly or indirectly with Ian - they report to someone who reports to someone who reports to him, for instance. This is what's called chaining in semantics and is one of the things that makes SPARQL very useful. Moreover, if I wanted to get the names and titles of everyone who is in Ian's reporting chain, with XQuery I would have to retrieve the document for each person and then get the name and title property, while with SPARQL, I'm dealing simply with assertions - an object may be described as the set of all triples that have the same identifier URI rather than an explicit "document", and so the SPARQL processor need only look at specific relationships, not all of them.

Having said that, SPARQL and Triple Stores in general are not ideal for other kinds of queries. Suppose, for instance, that I want to find out how reports to the CTO, but I don't know the CTO's name. There's actually two problems here - first, find a string match in an (unspecified) field, possibly with a regular expression, which will then retrieve the identifier of the corresponding person(s), then perform the previous query. 

Triple stores do provide lexical search capabilities and indexing, but in general these are expensive operations, far more than is the case for XML stores.  However, this is a classic "search" problem, and finding whole or near matches of terms is something that can be handled quite easily.

Moreover, SPARQL doesn't handle much beyond the semantic search - you can transform it with XSLT or similar applications, but these kinds of activities are often somewhat limited in scope. XQuery, on the other hand, is nearly as robust a mechanism for creating output content as it is for search.

Because of this, building a SPARQL application almost invariably has required invoking a triple store via a web service from some controlling language then pulling the result back and transforming it to meet the particular needs. Since you're sending data "over the wire" this has performance implications, and keeping the database in synchronization requires some serious headstands. However, that's now changed.

The MarkLogic 7.0 server was announced at MarkLogic World 2013 last month, becoming the first XQuery database to incorporate a semantics index and a SPARQL layer.  A bit of disclosure here - Avalon Consulting, LLC,  works as a primary partner for MarkLogic, largely because we value their product quite highly in the development of enterprise information solutions. I've also been lobbying MarkLogic to develop a SPARQL layer for more than three years, so I was absolutely giddy to discover that they had finally gone ahead and done it.

Because this technology is fairly complex, they are implementing it in stages, with Marklogic 7.0 supporting the SPARQL 1.0 standard and several low hanging fruit in SPARQL 1.1. They will then implement the balance of the SPARQL 1.1 layer (including SPARQL UPDATE) in a subsequent release, and may include some of the more sophisticated support for the OWL 2.0 web ontology standard (commonly referred to as the inferencing layer).

I recently (finally) had a chance to review the MarkLogic Server 7.0 Early Access 2 release, after having some trepidation that it would be too little too late. After at least a preliminary analysis, I think I can safely say that if and when MarkLogic ever goes public, I would buy as much of their stock as possible. Even without the full 1.1 support, it supports the SPARQL 1.0 standard remarkably well, it is as fast as I've come to expect MarkLogic products to be (that is, very), and most of what I personally would like to do with RDF I can, albeit not necessarily directly through SPARQL.

I can also do the kind of queries I was talking about above, and combine search and semantic queries into a single, internal operation that is breathtakingly fast (in great part because it is NOT having to go out to an external server). For instance, in the above queries I would use XQuery to do the text search to get those documents that include the search term in a specific set of fields (looking for CTO, for instance), then will pass these documents, with their associated triples to a SPARQL query. This reduces the search dramatically from comparing against the overall index to comparing against perhaps a few dozen entries. In SPARQL, such reductions are what makes working against tens of billions to trillions of triples feasible. These results, in turn, are filtered to retrieve name and job title from those working for the CTO, and then are returned as either JSON structures, custom XML, or generic map structures, depending upon need. These can then be transformed with XQuery into everything from HTML to SVG to PDFs or spreadsheets.



The real value here then comes in the ability to combine, transform and cache such output, as well as to manage processing. If a person quits, for instance a workflow would be initiated upon state change (through a process called a reverse query) that would query all locations that are associated with a given person (has an "claims" pointer, for instance) and would free those up. A similar query can determine whether there are any people who have placed reservations upon that room should it become freed, and will associate the room to the person based upon test criteria (the person who has the oldest reservation, for instance). This becomes the foundation for a rules based orchestration system that can replace a complex command and control system. Because such systems often have a significant number of inter-dependencies on different resources, a semantic system is actually preferable in this regard.

To sum it up, MarkLogic with semantics has what it takes to learn, to create associations where none previously existed, to judge likelihoods (I'm salivating at the prospect of writing a Bayesian analysis parser on top of it), to generate user interfaces that change in response to what it needs, and to commit to conclusions and actions based upon its own internal analysis. I'm noting that a number of entity enrichment, business intelligence and natural language processing companies are migrating to MarkLogic as a foundational platform for their own offerings, and I fully anticipate this trend to increase dramatically as the combination of XQuery, SPARQL and SQL (yup, it does support that too, thank you Mary Holstege!) makes MarkLogic a nexus  for the intelligent machine.

No comments:

Post a Comment