Thursday, May 23, 2013

Much Ado About Nothing: Blank Nodes in RDF

Here's a secret - you want to understand a data format? Learn its query language. I've worked heavily with XQuery for several years now, but only fairly recently (three years now) did I start working with SPARQL for RDF (along with the various view languages that CouchBase and Mongo expose), and it's given me far more insight into RDF than eight years spent trying to understand what the language was all about before then. Indeed, I'd go so far as to say that SPARQL makes RDF, dare I say it, accessible.

For instance, consider one of the more vexing aspects of RDF - how do you deal with composition vs. aggregation? Now, before getting too deep into the realm of modeling, it's worth taking a look at what each of these mean.

In XML, you tend to describe aggregations and compositions the same way - a specific element of one type holds a collection of elements of another type (or subclasses thereof). For instance, a resume element may include a one to many relationship with specific jobs that were held. It may also contain a one to many relationship with the articles, papers or books that you have written. Both of these are called associations - you are associating a given entity with another entity, but they are not quite the same type of thing.

To understand why, you need to ask the question - does the associated object have any meaning outside the context of the containing object? In the case of books, the answer is most assuredly yes - you may not be the only author, the books are likely available by ISBN or on the web, and if you delete the resume, the books do not themselves disappear. In this case, you're dealing with an aggregation - the child entities have  a distinct identity outside the boundaries of the container. In RESTful terms, the child entities are addressable resources.

Composition On the Job

The case of jobs is a little harder. A job is a description of a state - what you were doing at any given time. While the job may have a job title, an associated company and the like, it effectively is a short-hand way of talking about something that you do or did at some point. Take away the context - you - and such jobs generally make much less sense. In a composition, then, the child entities being described are more like states than they are like objects - they are generally only "locally" addressable relative to the containing element or context.

In XML, such structures occur all the time:

<resume>
       <name>Jane Doe</name>
       <job>
             <jobTitle>Architect</jobTitle>
             <company>Collosal Corp.</company>
             <startDate>2011-07</startDate>
             <description>Designed cool stuff.</description>
      </job>
       <job>
             <jobTitle>Senior Programmer</jobTitle>
             <company>Big IT Corp.</company>
             <startDate>2008-05</startDate>
             <endDate>2011- 04</endDate>
             <description>Programmed in many cool languages, and built some stuff.</description>
      </job>
       <job>
             <jobTitle>Junior Programmer</jobTitle>
             <company>Small IT Corp.</company>
             <startDate>2005-05</startDate>
             <endDate>208- 04</endDate>
             <description>Programmed in a couple of cool languages, and built some other stuff.</description>
      </job>
</resume>

So the question here is whether a job is an aggregation or a composition. A good way of thinking about this is to ask yourself whether, if you took turned each job into a separate XML document, whether it has enough context information to make sense:


       <job>
             <jobTitle>Junior Programmer</jobTitle>
             <company>Small IT Corp.</company>
             <startDate>2005-05</startDate>
             <endDate>2008-04</endDate>
             <description>Programmed in a couple of cool languages, and built some other stuff.</description>
      </job>


Here there's no "person" that this belongs to - it could be a job held by anybody (or multiple people at the same time, conceivably). Without that context the information here is insufficient to be useful, save in indicating that someone claims to have worked at a given place. <job> is  clearly a composition relationship.

Now zoom down another level to company:


             <company>Small IT Corp.</company>

Curiously enough, this "document" is actually a stand-alone entity. If I had a database with resumes from a number of different people, being able to see all companies that are represented by the database would be a key requirement, and what's more, as a database designer I'd probably want to insure that there is one and only one such representation of that entity's name to be able to partitioning the data unnecessarily. It's an association.

In RDF (as expressed in Turtle), this becomes much more evident (I'm suppressing namespaces here for ease of reading):
person:Jane_Doe rdf:type class:Person;
          person:name "Jane Doe";
          person:job [
                  job:jobTitle "Architect";
                  job:company company:ColossalCorp;
                  job:startDate "2011-07"^^xs:date;
                  job:description "Designed cool stuff"
                  ],

                  [
                  job:jobTitle "Senior Programmer";
                  job:company company:BigITCorp;
                  job:startDate "2008-05"^^xs:date;
                  job:endDate "2011- 04"^^xs:date;
                  job:description "Programmed in many cool languages, and built some stuff."
                  ],
                  [
                  job:jobTitle "Junior Programmer";
                  job:company company:SmallITCorp;
                  job:startDate "2005-05"^^xs:date;
                  job:endDate "2008- 04"^^xs:date;
                  job:description "Programmed in a couple of cool languages, and built some other stuff."
                  ].

company:ColossalCorp rdf:type class:Company;
             company:companyName "Colossal Corporation".

company:BigITCorp rdf:type class:Company;
             company:companyName "Big IT Corporation".

company:SmallITCorp rdf:type class:Company;
             company:companyName "Small IT Corporation".





If you are new to Turtle, this may take a bit of explaining. An expression like

a  b   c;
   d   e.

is a shorthand for the statements  

a  b  c.
a  d  e.

Similarly, 

a  b  c, d, e.

is a shorthand for the statements  

a  b  c.
a  b  d.
a  b  e.

The first (semicolon) set of statements assume that the subject is repeated but with new predicates and objects, while the second (comma) set of statements assume a common subject and predicate but different objects.

However, what's the [] notation mean? The brackets are indicative of a blank node. A blank node can be thought of as being the equivalent of a composition container in XML  For instance, in XML you have the fragment


       <job>
             <jobTitle>Architect</jobTitle>
             <company>Collosal Corp.</company>
             <startDate>2011-07</startDate>
             <description>Designed cool stuff.</description>
      </job>


The <job> element itself doesn't really serve any purpose beyond indicating that its contents are thematically part of a larger construct. It's not in and of itself a property, but more properly a property "bag". As such, it can be thought of as being (somewhat) analogous to an array - you'd reference it not by name but by position in a language such as Java or JavaScript. (We'll get back to that (somewhat) reference shortly).

Internally, RDF triple stores don't use numbers directly for identifying such entities. Instead, they are treated as blank nodes. A blank node is an internal resource identifier, meaning that, unlike most resources, it can't be directly accessed. Instead, (at least from the standpoint of SPARQL) they are usually referenced indirectly by being determined from the local context. Blank nodes are usually shown in articles as having the syntax _:b0, _:b1, etc. For instance, the resume in the above list could be rewritten as:


person:Jane_Doe rdf:type class:Person;
          person:name "Jane Doe";
          person:job _:b0, _:b1, _:b2.
_:b0      job:jobTitle "Architect";
          job:company company:CollosalCorp;
          job:startDate "2011-07"^^xs:date;
          job:description "Designed cool stuff".
_:b1          job:jobTitle "Senior Programmer";
          job:company company:BigITCorp;
          job:startDate "2008-05"^^xs:date;
          job:endDate "2011- 04"^^xs:date;
          job:description "Programmed in many cool languages, and built some stuff."
_:b2          job:jobTitle "Junior Programmer";
          job:company company:SmallITCorp;

          job:endDate "2008- 04"^^xs:date;
          job:description "Programmed in a couple of cool languages, and built some other stuff.".

Notation-wise, this makes the structure of each job more obvious, but gets a little more complicated referentially. Jane Doe has had three jobs. The jobs each have an internal identifier, but because the jobs make no sense out of the context of being jobs for Jane, So why not use explicit identifiers for each job?


person:Jane_Doe rdf:type class:Person;
             person:name "Jane Doe";
             person:job job:105319, job:125912job:272421.
job:105319   job:jobTitle "Architect";
             job:company company:CollosalCorp;
             job:startDate "2011-07"^^xs:date;
             job:description "Designed cool stuff".
job:125912   job:jobTitle "Senior Programmer";
             job:company company:BigITCorp;
             job:startDate "2008-05"^^xs:date;
             job:endDate "2011- 04"^^xs:date;
             job:description "Programmed in many cool languages, and built some stuff."
job:272421   job:jobTitle "Junior Programmer";
             job:company company:SmallITCorp;

             job:endDate "2008- 04"^^xs:date;
             job:description "Programmed in a couple of cool languages, and built some other stuff.".
company:ColossalCorp rdf:type class:Company.
             company:companyName "Colossal Corporation".

company:BigITCorp rdf:type class:Company.
             company:companyName "Big IT Corporation".

company:SmallITCorp rdf:type class:Company.
             company:companyName "Small IT Corporation".

From the standpoint of SPARQL, it makes very little difference. For instance, if you wanted to get the name and description of each job a person has, the SPARQL query would look something like this:

select ?personName ?jobTitle ?companyName where {
    ?person   rdf:type      class:Person;
              person:name   ?personName;
              person:job    ?job.
    ?job      job:jobTitle  ?jobTitle;
              job:company   ?company.

    ?company  company:companyName
                            ?companyName.
    }
which would produce the table:
?personName?jobName?companyName
"Jane Doe""Architect""Colossal Corporation"
"Jane Doe""Senior Programmer""Big IT Corporation"
"Jane Doe""Junior Programmer""Small IT Corporation"


The ?job variable, in this case, will carry either the blank node job entry or the defined job in exactly the same manner. So why use a blank node? In most cases, they're used because creating explicit named resource URIs can be a pain, or because there is no real advantage to working with them outside of the resource context from which they're called. Typically, most RDF triple stores actually optimize blank nodes separately, and insure that blank nodes never collide within the data system itself.

Another advantage is that it simplifies both Turtle and RDF-XML notation. Turtle notation can use the [] square bracket approach (and can even nest such brackets within other blank node expressions). RDF-XML on the other hand can represent blank nodes in two modes. In the denormalized (or embedded) form, the @rdf:parseType="resource" expression is used on the container node to indicate that it should be represented from a blank node:


<resume rdf:about="person:Jane_Doe">
       <name>Jane Doe</name>
       <job rdf:parseType="resource">
             <jobTitle>Architect</jobTitle>
             <company>Collosal Corp.</company>
             <startDate>2011-07</startDate>
             <description>Designed cool stuff.</description>
      </job>
      ..

</resume>


The same expression can also be normalized:

<rdf:RDF>
   <resume rdf:about="person:Jane_Doe">
       <name>Jane Doe</name>
       <job rdf:resource="b0"/>
       <job rdf:resource="b1"/>
       <job rdf:resource="b2"/>
   </resume>
   <rdf:Description rdf:nodeID="b0">
       <jobTitle>Architect</jobTitle>
       <company rdf:resource="company:ColossalCorp"/>Colossal Corporation</company>
       <startDate rdf:datatype="xs:date">2011-07</startDate>
       <description>Designed cool stuff.</description>

   </rdf:DescriptiOn>
   ...
</rdf:RDF>


The embedded format actually hints at how one can convert XML into RDF - it mostly involves resource references (i.e., company) and using attributes like rdf:parseType (XML to RDF conversion is food for another post).


Localization with Les NÅ“uds Anonymes


Blank nodes can actually be very useful for dealing with groups of properties that differ based upon language or locale. RDF utilizes the @lang expression at the end of strings to handle localization, which can work in simple cases, but if there are a number of properties that are all localized (such as prices and currencies) then blank nodes may be a better solution. For instance,


book:The_Art_Of_SPARQL book:bookLocal
     [
        bookLocal:lang "EN";
        bookLocal:locale "US";
        bookLocal:title "The Art of Sparql";
        bookLocal:price "29.95";
        bookLocal:currency "USD";
     ],

     [
        bookLocal:lang "EN";
        bookLocal:locale "UK";
        bookLocal:title "The Art of Sparql";
        bookLocal:price "21.95";
        bookLocal:currency "GBP";
     ],
     [
        bookLocal:lang "DE";
        bookLocal:locale "DE";
        bookLocal:title "Die Kunst des SPARQL";
        bookLocal:price "24.95";
        bookLocal:currency "EUR";
     ].


In this case, the implied bookLocal objects are blank nodes. You can then retrieve the title, price and currency of a book in a different locale if the US book title is known:

select ?title ?lang ?price ?currency where {
    bind ("DE" as ?locale)
    bind ("The Art of SPARQL" as ?usTitle)
    ?book book:bookLocal [
              bookLocal:title ?usTitle;
              bookLocal:locale "US"].
    ?book book:bookLocal [
              bookLocal:locale ?locale;
              bookLocal:title ?title;
              bookLocal:lang ?lang;
              bookLocal:price ?price";
              bookLocal:currency ?currency".
              ].
    }

This produces the results:
 
?title?lang?price?currency
"Die Kunst des SPARQL""DE""24.95""EUR"

This information can, depending upon the SPARQL engine, be output as JSON or XML. This approach can also be useful for dealing with triples where the object is itself an XML node (such as XHTML content) where different languages are involved, since the XMLLiteral type can't directly take a @lang type extension. I use such blank node assignments quite frequently for precisely that kind of issue, making data modeling when dealing with localization considerably easier.

It's also worth noting that even though the nodes themselves are blank (or anonymous), that doesn't mean that they can't be associated with schematic types. For instance, in the previous example, the inclusion of a single statement in each block means that you can use RDFS/OWL validation on each "bookLocal" object:

 book:The_Art_Of_SPARQL book:bookLocal
     [
        rfd:type class:bookLocal;
        bookLocal:lang "EN";
        bookLocal:locale "US";
        bookLocal:title "The Art of Sparql";
        bookLocal:price "29.95";
        bookLocal:currency "USD";
     ],

These basically correspond to anonymous objects within JavaScript, with the added benefit that with RDF you can still do all of the type constraint checks that you could do with XSD (and more).


The Edge of the Graph


One final note about blank nodes - they are very useful for establishing the boundaries between objects for purposes such as getting the result of a Describe or deleting distinct objects (rather than just individual assertions) from a database. The describe statement, when given a subject, follows each assertion from the initial subject to each object. If the object is an atomic value, the assertion is kept, but no further search is done. If the object is an object URI, the same thing happens.

However, if the object is a blank node, then the blank node is added as a subject and the same process is used. Deleting an object involves doing a "Describe" on the resource, finding all assertions for which the subject is some other assertion's object, then deleting all of these links.

This can be especially useful for handling updates into RDF Databases. I'll be covering this in more details in a later post.


Nothing Much? Nah!


Blank nodes are a useful feature of RDF, are well supported by SPARQL and Turtle notation, and can help to differentiate between aggregate structures (references to external objects within the system) and composed structures (references to internal entities that only make sense within the context of a given external entity). They are used heavily by both Turtle and RDF-XML parsers, and they can both be used to define the boundaries of resources within the system and delete those resources in a logical and consistent fashion. They should be seen as indispensable tools of the working ontologist. 

1 comment:

  1. Data Modelling Online Training, ONLINE TRAINING – IT SUPPORT – CORPORATE TRAINING http://www.21cssindia.com/courses/datamodelling-online-training-24.html The 21st Century Software Solutions of India offers one of the Largest conglomerations of Software Training, IT Support, Corporate Training institute in India - +919000444287 - +917386622889 - Visakhapatnam,Hyderabad Data Modelling Online Training, Data Modelling Training, Data Modelling, Data Modelling Online Training| Data Modelling Training| Data Modelling| "Courses at 21st Century Software Solutions
    Talend Online Training -Hyperion Online Training - IBM Unica Online Training - Siteminder Online Training - SharePoint Online Training - Informatica Online Training - SalesForce Online Training - Many more… | Call Us +917386622889 - +919000444287 - contact@21cssindia.com
    Visit: http://www.21cssindia.com/courses.html"

    ReplyDelete