Monday, December 3, 2012


Over the years, I've started a number of blogs and even a few dedicated web sites. In some cases these were personal sites, such as my current site at Others, such as the now rather dormant site, were more focused on a given technology. I've come back full circle with Semantics and Data Modeling, making use of the considerably improved interfaces that Blogger has to offer and also coming to realize that my recent semantics work has moved outside of the realm of XML, though it still touches on it in many ways.

About three years ago, I had an epiphany. Much of my writing at the time had been on REST and RESTful services, and how this paradigm, so central to the web, created an equivalency between terms in an ontology and URIs in an XML or JSON database. This was also about the same time that I encountered the superb book The Semantic Web for the Working Ontologist by Dean Allemang and Jim Hendler (both of whom I later had the privilege to meet) which effectively eschewed the fixation on complex URIs and simplified things down to qualified names for assertions in SPARQL. It was at that stage that I realized that what I'd been working toward for the last six years or so was essentially just another reflection on the Semantic Web.

Once I made that realization, a lot of other pieces fell into place. RDF-XML had always baffled me - it seemed like a remarkably idiotic way of putting together XML structures. However, once I began to think about the notion of triples as an alternative mechanism to XML (and more, as an effective means to model content), then the structures of RDF-XML began to make more sense, to the extent that my XML database work is beginning to be shaped by how readily content can be transformed into RDF. Similarly, the discovery about SPARQL over HTTP, which really has just begun to become a factor with triple-stores, and the use of the SPARQL update vocabulary have both turned what had been largely an analytics technology with a passive data set into a far more interactive technology with an active data set. 

Ironically, all of this realization has come at the expense of my interest in XML. I've been working with XML for a decade, and working with XQuery since literally it's inception. I get it - it's a cool language for doing many things (and is vastly under-appreciated by most people for the power that it does have) - but I'm also coming to realize that the problem domains that are interesting are no longer in the XML space. Part of this is because the area where the problems are biggest increasingly fall into two very divergent categories - how do you handle petabytes worth of data that's generated via firehose processes (think social media feeds for a starter) to find meaningful information, and how do you relate that information with other information that exists in the increasingly interconnected data space? 

The first question goes directly to the heart of Big Data, one of those wonderfully oxymoronic terms that marketers just love and programmers reluctantly use because their CEO got the marketing pitch from company X's marketers. In principle Big Data comes down to dealing with a hyperrich, heterogeneous, continually changing data space. In practice, Big Data has largely been coopted by Hadoop, as a commodity data mining solution. Hadoop's cool - check out the aforementioned marketer statements (*snark*) but I also think that Hadoop will eventually be just one of a number of different approaches to processing what amounts to artifactual data - data that emerges as an artifact to some existing process.

For instance, Twitter is a good Big Data source - it generates messages that are an artifact to how the initial application works. However, it is frequently useful to take these artifacts and reprocess them for ancillary data mining - retrieving images and URLs of potentially useful links, measuring sentiment, looking for patterns. This was not the original intent of the Twitter format, but once the data is generated, it is a good way of recycling that data. As people begin understanding the value of repurposed data, I expect that the source formats will likely also become richer in terms of content.

Now, I've worked some with Hadoop, and there are definitely places where Hadoop can interact with semantics (most notably in the semantic enrichment arena), but I'll probably touch on Hadoop only peripherally in this particular blog.

The second question, though, is where I get excited. Information can be generally of three types - you are either identifying a resource, describing an attribute or property of that resource, or identifying a relationship between a resource and another resource. It turns out that data typing can be thought of in general as being simply a form of the last type - the rdf:type relationship is pretty much fundamental to OWL, for instance, so much so that it is abbreviated as "a" in SPARQL. XML Databases are very good at dealing with data that falls into the first two forms - mapping terms to document identifiers via indices. Where they are weak is in the last form - describing relationships between entities in a system or ontology. SPARQL is somewhat awkward for managing the second, but the first and third types are intrinsic to triple stores.

If you can combine an XQuery enabled XML Database (such as MarkLogic or eXist) with a SPARQL enabled triple store, then, with a bit of work, you can get the best of all three worlds at the cost of some redundancy in the data itself. This is the world that I want to play in, because it describes some very hard business problems - how are assets related to one another? Who is responsible for what asset at what time? If a set of assets is in this class, can they be processed in concert even if the assets themselves are distributed? What workflow state (in what workflow chain) is a given asset in? What discrepancies exist in the data system, and do these reflect discrepancies in the real world?

These are critical questions for any CMS system, but especially for large ones. It's critical for knowledge management systems where there are often both topical and atopical relationships between clusters of knowledge, and it's critical for assets such as video and audio media where classification changes based upon frame, context, and even airing. These all can be handled in an XML database or even a relational database, but as data becomes increasingly document-like, neither of these tools by themselves are that well optimized for the kind of mixed use requirements that our data systems require.

There is also the ancillary question of inferential analysis. Inferences are almost invariably relational in nature - A is somewhat analogous to B, B is referred to by C, therefore C may be applicable to A as well.  Since relationships are seldom binary - inferences typically are fuzzier (and indeed, you can introduce an ntuple relationship that includes a relevancy index that better quantifies that fuzziness) - this makes inference analysis a good tool for broad pattern matching that can then be used to find related content along multiple relational links.

So, given all this, and the fact that more and more of my work is built around data modeling, ontology management and semantics, I've decided to kick off a new blog called Semantics and Data Modeling. It's intended to be a vehicle for me to discuss what I'm doing in this space, a place to educate others on the benefits (and limitations) of both semantics and modeling, to put a stake in the ground discussing how these factor in both at the enterprise and the technical level with other tech, and occasionally gives me a venue for posting code. Feedback is welcome - and will be answered as quickly as possible.


Kurt Cagle


  1. This comment has been removed by a blog administrator.

  2. Hello blog. Kurt, great to see you kick this off. I'll be interested to see what domain semantic technologies claim as they mature from theory to practice. I'm expecting some skirmishes launched from right here. ;-)