Tuesday, December 4, 2012

Data Modeling via RDF

I'm frequently called upon to design schemas and data models for organizations and companies. In my experience, this exercise usually starts with someone saying "We need a logical data model" and someone else (usually a relational database designer) grabbing an ERWIN tool such as IBM's Rational Rose in order to build UML pictures. It's also been my experience that such data models, once created, are large, cumbersome, and often completely inappropriate to the problem domain ... especially when the end product is a structured entity such as XML or JSON.

Of late, I've begun playing with an alternative approach that actually has proven to be quite effective, but it's one that I see very few people doing as yet. The principle is simple - in your data model, start with a particular object - it doesn't really matter which object, though usually there are a few that are more central than others - and put together live examples of what you're trying to emulate. This will not only help you to identify the properties of a given object, but will also tend to expose links to other objects that are part of the problem domain. By using RDF, you can concentrate on such links, and by working with a shorthand notation with RDF, you can do so without all of the complexifying that namespaces normally introduce.

For instance, suppose that you are wanting to model a card catalog for a library system. The most obvious starting class for such a system is a book. For convenience sake I'm going to create a namespace with the rather odd notation xmlns:book="book:". This means that I can create an rdf subject or object of the form <book:Book1> which has the URI "book:Book1". Eventually we'll resolve the namespaces to something more typical of a URI, but for now this approach lets you concentrate on concept rather than syntax or notation. 

Once we have the book, we can start thinking of properties, and can express these as RDF assertions. For instance,

<book:Book1> <book:isbn> "195292514331".
<book:Book1> <book:title> "The SPARQL Book".
<book:Book1> <book:edition> "1".
<book:Book1> <book:printing> "1".
<book:Book1> <book:publishingYear> 2012;
<book:Book1> <description> "This is a book about SPARQL".

Now there's a certain degree of redundancy in such lists. One way to reduce this is to employ the semi-colon notation to indicate that several assertions all have the same subject:

<book:Book1>
    <book:isbn> "195292514331";
    <book:title> "The SPARQL Book";
    <book:edition> "1";
    <book:printing> "1";
    <book:publishingYear> 2012;
    <book:description> "This is a book about SPARQL".

Any assertion ending with a semi-colon indicates that the next item in the sequence will just contain the predicate and object, and should use the same subject as the current item uses. Commas can be used for a similar purpose - they indicate that a given subject-predicate pair may be used for more than one object. 

Now, one thing that should be evident in our description of a book is that we're missing a few critical pieces, such as authors. Here's where modeling begins to get exciting. I could use a string here indicating the author's name, but there are two questions that immediately come up - are there more than one author for any given book, and are there any books that have multiple authors. Chances are high (unless you have a very tiny library) that the answer will be yes in both cases. This suggests that the author in turn may be another type of object in the system that we'll call xmlns:author = "author:".

<book:Book1> <book:author> <author:Author1>.

Note that at this stage we know absolutely nothing about the author beyond the fact that he or she exists, but this is not an insignificant bit of knowledge. At this stage it may be worth filling in some of the blanks:

<author:Author1> 
     <author:displayName> "Jane Doe";
     <author:givenName> "Jane";
     <author:middleNames> "Elizabeth";
     <author:surName> "Doe";
     <author:searchName> "Doe, Jane Elizabeth";

     <author:bio> "An author and librarian who likes cats.".

Suppose that the book had two co-authors. Adding a second entry becomes simple enough:

<author:Author2> 
     <author:displayName> "John Dee";
     <author:givenName> "John";
     <author:middleNames> "Michael";
     <author:surName> "Dee";
     <author:searchName> "Dee, John Michael";
     <author:bio> "A writer of many talents and pretensions.".
<book:Book1> <book:author> <author:Author2>.

This is about the time that XML developers start getting nervous, because authors in this case may also author multiple books.  For instance, Jane might also have authored a second book "Sparql and OWL, a Guide":

<book:Book2>
    <book:isbn> "1952929129568";
    <book:title> "Sparql and OWL, a Guide";
    <book:edition> "1";
    <book:printing> "1";
    <book:publishingYear> 2010;
    <book:description> "A book about SPARQL and data modeling".
<book:Book2> <book:author> <author:Author1>.

To make the relationships even more representative, let's assume that John Dee also wrote another book:

<book:Book3>
    <book:isbn> "19123977952320";
    <book:title> "Semantics";
    <book:edition> "1";
    <book:printing> "2";
    <book:publishingYear> 2011;
    <book:description> "A book on Semantics".
<book:Book3> <book:author> <author:Author2>.

There is an implicit relationship here - that an author may have multiple books. Here's where the new SPARQL update spec comes in handy. If you're using Jena or some similar triple store that supports SPARQL 1.1 update, then you can make this relationship explicit as well:

insert {?author <author:book> ?book} where
{?book <book:author> ?author.}

When the update is run, this will generate the four triples:

<author:Author1> <author:book> <book:Book1>.
<author:Author1> <author:book> <book:Book2>.
<author:Author2> <author:book> <book:Book1>.
<author:Author2> <author:book> <book:Book3>.

Some systems will also let you use the SPARQL 1.0 CONSTRUCT statement to do the same thing, though this usually generates RDF as an output without adding it to the database:

construct {?author <author:book> ?book} where
{?book <book:author> ?author.}

So, what about publishers. Again, this is a case where you have a text label that could, nonetheless, also be an object. The big question in the model is a utilitarian one - are you likely to want to see books or authors grouped by publishers. If you are, then this becomes a separate class:

<book:Book1> <book:publisher> <publisher:Publisher1>.
<book:Book2> <book:publisher> <publisher:Publisher2>.
<book:Book3> <book:publisher> <publisher:Publisher1>.
<publisher:Publisher1> 
     <publisher:name> "Oriole Publishing";
     <publisher:description> "Large publisher of technical books for the programming market.".
<publisher:Publisher2> 
     <publisher:name> "Avante Books";
     <publisher:description> "A major publisher of scientific and research books.".

Thus, Book1 and Book3 are published by Oriole, while book is published by Avante. Does it make sense to link publishers and authors? Generally, probably not, though you can retrieve this data indirectly with a SPARQL query:

select ?pubName ?authName where
{
?book <book:publisher> ?publisher.
?book <book:author> ?author.
?publisher <publisher:name> ?pubName.
?author <author:fullName> ?authName.
} order by ?pubName

This will retrieve a listing of authors by publisher:

pubNameauthName
Avante BooksJane Doe
Avante BooksJohn Dee
Oriole PublishingJane Doe
Oriole PublishingJohn Dee

Thus, one of the big trade-offs in data modeling is determining whether a given relationship needs to be made explicit (which adds redundancy) or can remain implicit (which adds query complexity). As above, if you decided that getting authors by publisher is important, then you can always add those relationships via an update rule:

insert {?publisher <publisher:author> ?author.} where
{
?book <book:publisher> ?publisher.
?book <book:author> ?author.
}

One final addition here is to make explicit the class names involved. In this case, I normally use an internal "entity:" namespace to capture schematic properties, which I'll later map to rdf:, rdfs: or :owl equivalencies when I convert namespaces. The <entity:instanceOf> property indicates that a given subject is an instance of a certain class, and is functionally equivalent to an <rdf:type> predicate, while <entity:subClassOf> is the equivalent of <owl:subClass>


<book:Book1> <entity:instanceOf> <class:Book>.

<book:Book2> <entity:instanceOf> <class:Book>.
<book:Book3> <entity:instanceOf> <class:Book>.
<author:Author1> <entity:instanceOf> <class:Author>.
<author:Author2> <entity:instanceOf> <class:Author>.
<publisher:Publisher1> <entity:instanceOf> <class:Publisher>.
<publisher:Publisher2> <entity:instanceOf> <class:Publisher>.
<class:Book> <entity:subClassOf> <class:Entity>.


<class:Author> <entity:subClassOf> <class:Entity>.


<class:Publisher> <entity:subClassOf> <class:Entity>.

The designation of types makes it possible to ask for a list of all books, authors or publishers in the system by type, while the <entity:subClassOf> entry here makes it possible to get all classes available within the ontology itself. I also usually like to create object agnostic labeling in my models, using the <v:label> and <entity:description> properties. These can be tailored using SPARQL update:



insert {?resource <entity:label> ?label}
where {
{?resource <book:title> ?label.}
UNION
{?resource <author:fullName> ?label.}
UNION
{?publisher <publisher:name> ?label.}
}

insert {?resource <entity:description> ?description}
where {
{?resource <book:description> ?description.}
UNION
{?resource <author:bio> ?description.}
UNION
{?publisher <publisher:description> ?description.}
}


When run, this will insure that every object in the system can be described using both common and type specific properties. In a sense, the "entity:" namespace objects can be thought of as an abstract superclass that all other classes inherit, with the properties "instanceOf", "subClassOf", "label" and "description". 

One other important point to remember about the SPARQL insert commands - if a given triple already exists in the system, insert will not add it again. This means that  by collecting all of the insert statements in a script, you can run the script periodically after inserting new objects to make implicit assertions explicit.

The whole script can be run through the Jena update capability within its server, with the condensed script as follows:

insert data {

<book:Book1> 
    <book:isbn> "195292514331";
    <book:title> "The SPARQL Book";
    <book:edition> "1";
    <book:printing> "1";
    <book:publishingYear> 2012;
    <book:description> "This is a book about SPARQL".
<author:Author1> 
     <author:displayName> "Jane Doe";
     <author:givenName> "Jane";
     <author:middleNames> "Elizabeth";
     <author:surName> "Doe";
     <author:searchName> "Doe, Jane Elizabeth";
     <author:bio> "An author and librarian who likes cats.".
<book:Book1> <book:author> <author:Author1>.
<author:Author2> 
     <author:displayName> "John Dee";
     <author:givenName> "John";
     <author:middleNames> "Michael";
     <author:surName> "Dee";
     <author:searchName> "Dee, John Michael";
     <author:bio> "A writer of many talents and pretensions.".
<book:Book1> <book:author> <author:Author2>.
<book:Book2>
    <book:isbn> "1952929129568";
    <book:title> "Sparql and OWL, a Guide";
    <book:edition> "1";
    <book:printing> "1";
    <book:publishingYear> 2010;
    <book:description> "A book about SPARQL and data modeling".
<book:Book2> <book:author> <author:Author1>.
<book:Book3>
    <book:isbn> "19123977952320";
    <book:title> "Semantics";
    <book:edition> "1";
    <book:printing> "2";
    <book:publishingYear> 2011;
    <book:description> "A book on Semantics".
<book:Book3> <book:author> <author:Author2>.
<book:Book1> <book:publisher> <publisher:Publisher1>.
<book:Book2> <book:publisher> <publisher:Publisher2>.
<book:Book3> <book:publisher> <publisher:Publisher1>.
<publisher:Publisher1> 
     <publisher:name> "Oriole Publishing";
     <publisher:description> "Large publisher of technical books for the programming market.".
<publisher:Publisher2> 
     <publisher:name> "Avante Books";
     <publisher:description> "A major publisher of scientific and research books.".
<book:Book1> <entity:instanceOf> <class:Book>.
<book:Book2> <entity:instanceOf> <class:Book>.
<book:Book3> <entity:instanceOf> <class:Book>.
<author:Author1> <entity:instanceOf> <class:Author>.
<author:Author2> <entity:instanceOf> <class:Author>.
<publisher:Publisher1> <entity:instanceOf> <class:Publisher>.
<publisher:Publisher2> <entity:instanceOf> <class:Publisher>.
<class:Book> <entity:subClassOf> <class:Entity>.
<class:Author> <entity:subClassOf> <class:Entity>.
<class:Publisher> <entity:subClassOf> <class:Entity>.
}

insert {?author <author:book> ?book} where
{?book <book:author> ?author.}

insert {?publisher <publisher:author> ?author} where
{
?book <book:publisher> ?publisher.
?book <book:author> ?author.
}

insert {?resource <entity:label> ?label}
where {
{?resource <book:title> ?label.}
UNION
{?resource <author:fullName> ?label.}
UNION
{?publisher <publisher:name> ?label.}
}

insert {?resource <entity:description> ?description}
where {
{?resource <book:description> ?description.}
UNION
{?resource <author:bio> ?description.}
UNION
{?publisher <publisher:description> ?description.}
}

You can then see what your data space looks like thus far with the following SPARQL query:

select ?s ?p ?q where
{?s ?p ?q.}
order by ?s ?p ?o

In the case of the example given here, this will generate the following dataset:
spq
<author:Author1><author:bio>"An author and librarian who likes cats."
<author:Author1><author:book><book:Book1>
<author:Author1><author:book><book:Book2>
<author:Author1><author:displayName>"Jane Doe"
<author:Author1><author:givenName>"Jane"
<author:Author1><author:middleNames>"Elizabeth"
<author:Author1><author:searchName>"Doe, Jane Elizabeth"
<author:Author1><author:surName>"Doe"
<author:Author1><entity:description>"An author and librarian who likes cats."
<author:Author1><entity:instanceOf><class:Author>
<author:Author2><author:bio>"A writer of many talents and pretensions."
<author:Author2><author:book><book:Book1>
<author:Author2><author:book><book:Book3>
<author:Author2><author:displayName>"John Dee"
<author:Author2><author:givenName>"John"
<author:Author2><author:middleNames>"Michael"
<author:Author2><author:searchName>"Dee, John Michael"
<author:Author2><author:surName>"Dee"
<author:Author2><entity:description>"A writer of many talents and pretensions."
<author:Author2><entity:instanceOf><class:Author>
<book:Book1><book:author><author:Author1>
<book:Book1><book:author><author:Author2>
<book:Book1><book:description>"This is a book about SPARQL"
<book:Book1><book:edition>"1"
<book:Book1><book:isbn>"195292514331"
<book:Book1><book:printing>"1"
<book:Book1><book:publisher><publisher:Publisher1>
<book:Book1><book:publishingYear>"2012" ^^<http://www.w3.org/2001/XMLSchema#integer>
<book:Book1><book:title>"The SPARQL Book"
<book:Book1><entity:description>"This is a book about SPARQL"
<book:Book1><entity:instanceOf><class:Book>
<book:Book1><entity:label>"The SPARQL Book"
<book:Book2><book:author><author:Author1>
<book:Book2><book:description>"A book about SPARQL and data modeling"
<book:Book2><book:edition>"1"
<book:Book2><book:isbn>"1952929129568"
<book:Book2><book:printing>"1"
<book:Book2><book:publisher><publisher:Publisher2>
<book:Book2><book:publishingYear>"2010" ^^<http://www.w3.org/2001/XMLSchema#integer>
<book:Book2><book:title>"Sparql and OWL, a Guide"
<book:Book2><entity:description>"A book about SPARQL and data modeling"
<book:Book2><entity:instanceOf><class:Book>
<book:Book2><entity:label>"Sparql and OWL, a Guide"
<book:Book3><book:author><author:Author2>
<book:Book3><book:description>"A book on Semantics"
<book:Book3><book:edition>"1"
<book:Book3><book:isbn>"19123977952320"
<book:Book3><book:printing>"2"
<book:Book3><book:publisher><publisher:Publisher1>
<book:Book3><book:publishingYear>"2011" ^^<http://www.w3.org/2001/XMLSchema#integer>
<book:Book3><book:title>"Semantics"
<book:Book3><entity:description>"A book on Semantics"
<book:Book3><entity:instanceOf><class:Book>
<book:Book3><entity:label>"Semantics"
<class:Author><entity:subClassOf><class:Entity>
<class:Book><entity:subClassOf><class:Entity>
<class:Publisher><entity:subClassOf><class:Entity>
<publisher:Publisher1><entity:instanceOf><class:Publisher>
<publisher:Publisher1><publisher:author><author:Author1>
<publisher:Publisher1><publisher:author><author:Author2>
<publisher:Publisher1><publisher:description>"Large publisher of technical books for the programming market."
<publisher:Publisher1><publisher:name>"Oriole Publishing"
<publisher:Publisher2><entity:instanceOf><class:Publisher>
<publisher:Publisher2><publisher:author><author:Author1>
<publisher:Publisher2><publisher:description>"A major publisher of scientific and research books."
<publisher:Publisher2><publisher:name>"Avante Books"
There are a number of other properties or relationships that can be defined, with each relationship generally identifying a class in the overall entity relationship diagram. For instance, a given book may have multiple copies, you may have different media (such as audio books or DVDs), this can be extended to include which library currently contains the book, who has it borrowed and so forth.

This process can take a while - the goal with this stage of development is to figure out what relationships exist and are worth tracking, which properties are optional, and as such I find that the best approach to completing this is to do an initial run, then to pull in stakeholders and show the individual instances and how they interrelate. At that point, domain experts may point out information that needs to be captured that the model doesn't catch, but adding these properties can be done in a very ad hoc fashion.

Once the model has been sufficiently refined, the next stage of the process is converting this into an RDF-XML format, and from there into an XML format. Additionally, this is a good stage at which to use visualization tools such as raptor and dot to visualize the relationships in terms of a graph. I'll cover these in subsequent articles.



4 comments:

  1. Hi Kurt -- please elaborate on why you think conceptual and/or logical data modeling are ineffective. I agree that modeling in UML is a mixed bag (er, no pun intended...), and that some modeling tools can be cumbersome, but I still believe it's most productive to start with conceptual modeling and to jump into XML/SQL/RDF/etc. only after building consensus on the conceptual model. thx for sharing your insights, as always!

    ReplyDelete
  2. Hey, Peter!

    My response got too long, so have reposted it as a
    blog post.

    ReplyDelete
  3. >the next stage of the process is converting this into an
    >RDF-XML format, and from there into an XML format.

    Why bother converting it to RDF/XML along the way? Because RDF/XML can represent the same triple different ways, XML-based conversions of that into some other XML can be a real pain unless they're taking for granted that the same tool will always be used to create the RDF/XML.

    I would just do a SPARQL query of the data above and ask for the results in SPARQL XML Query Results format (http://www.w3.org/TR/2008/REC-rdf-sparql-XMLres-20080115/) and then convert that to the XML that you really want. That would make things much simpler.

    ReplyDelete
  4. Bob,

    I actually do exactly that with the XQuery interfaces - I make a SPARQL call that's parameterized for XQuery consumption to Jena, then retrieve the result back as Sparql-XML, which I then generally transform into a more named element structure based upon the sparql variables (it's just a little easier to query against that way).

    I use the RDF-XML for a few things - storing element-attribute keys based upon the @rdf:description attribute for use in retrieving records via XQuery (this fits in nicely with a REST model), loading in large data stores into Jena from external sources, and creating and updating new RDF ntuples from external XQuery content (as well as using this to support versioning and auditing within triple stores). In general these kinds of actions are easier to perform using rdf-xml and the /ds/data option for SOH from XQuery than they are by trying to construct SPARQL insert statements from xquery. However, as you pointed out, I do not see the rdf-xml as canonical - only the rdf triples within Jena.

    ReplyDelete