Wednesday, April 3, 2013

The Joys of Reification

The danger of periodically producing a blog is that every so often something comes along that eats up a significant amount of time, and next thing you know, a month or more ends up passing between posts. In my case, that's largely been demo writing and wrapping up on revising after the SVG technical edits, and my editors are waiting ever so patiently (actually, no, they're not) for me to get a manuscript to them.

I wanted to take a few minutes, though to capture the fruits of an exchange I had with a friend of mine about storing information in triple stores. The topic involved was a seemingly trivial one that was anything but - when you have only three values in an RDF assertion, how can you incorporate time stamps or similar information. The answer, as it turns out, is to create metadata about the metadata - a process that RDF types refer to as reification

In this case, my friend wanted to capture two critical pieces of information - when was a particular statement entered into the system, and who submitted it. Other information could be added, of course, but this illustrates the problem well enough.

Suppose that I had an RPG game and wanted to add a couple of characters to the system. I could add a timestamp property and email identifier as properties of the subject, but what I really want to track is when was a given statement inserted about the characters. In essence, I'm trying to get information about the statement, not the subject. Here's the SPARQL 1.1 update data:

insert data {
     <character:Aleria> <owl:instanceOf> <class:Character>;
                        <character:species> <species:Elf>;
                        <character:gender> <gender:Female>;
                        <character:location> [
                            <location:city> <city:New_Alfrans>;
                            <location:nation> <nation:Mirales>].

    <character:Merasha> <owl:instanceOf> <class:Characterl>;
                        <character:species> <species:HalfElf>;
                        <character:gender> <gender:Female>;
                        <character:location> [
                            <location:city> <city:Tunsany>;
                            <location:nation> <nation:Milesia>].
};

There are actually two separate (albeit related) problems here. The first is how to attach properties to specific statements. The second is how to do so while insuring that the "auditing" information isn't itself added into the lists of things that are normally queried, otherwise you get a ongoing additive spiral.
The second problem can actually be solved quite handily with the use of named graph, which we'll call <app:assert> for brevity. This is a graph that just contains assertions about assertions. When you query against the default graph (or some other named graph) this graph will be ignored unless you specifically reference. it.

The first problem then comes down to creating a unique key for an assertion statement. This is a good place to pull out a hashing function (md5() is a pretty good one to start with, and is supported by most SPARQL 1.1 engines). By creating a hash from a concatenation of subject, predicate and object, we have a generatable, unique key.

With these two statements, you can then create a SPARQL UPDATE 1.1 script that would be called periodically:

insert {
graph <app:assert> {
    ?statement <owl:hash> ?hash.
# The next three statements may not be necessary
    ?statement <owl:subject> ?subject.
    ?statement <owl:predicate> ?predicate.
    ?statement <owl:object> ?object.
    ?statement <owl:timestamp> ?timestamp.
    ?statement <owl:email> ?email.
    }}
where {
    ?subject ?predicate ?object.
   filter not exists{
          ?s <owl:subject> ?subject.
          ?s <owl:predicate> ?predicate.
          ?s <owl:object> ?object.
          }
    bind ("kurt.cagle@gmail.com" as ?email)
    bind (uuid() as ?statement)
    bind (now() as ?timestamp)
    bind (md5(concat(str(?subject),str(?predicate),str(?object))) as ?hash)
}

The insert part of the clause creates the new record based upon the bindings from the where clause. The first line indicates that all items that haven't been audited should be audited. Since typically only certain objects in the system really need to be audited, additional statements limiting the audit only to items of a given class may be useful here. 

where {
    ?subject ?predicate ?object.
    ?subject a ?systemClass.
    ...

The next piece of information would mostly be created either through the interface that builds this script (I use a regex method myself to map variables of the form $foo into IRIs or string). The first binds to an email string (or IRI, if that's how you're identifying such keys), the ?timestamp reads the current time that the script is called, and the last bind creates an md5 hash from concatenating subject, predicate and object together.

(It might be possible to create an IRI directly from the hash as well, but that's still in essence a hash key. This just makes the association more explicits and avoids formal semantics in identifiers.)  

This then creates new records into the <app:assert> graph with the relevant properties.

Retrieving this information is pretty straightforward, and assumes that the hashing algorithm is the same in your system from what generated it (most likely the case): If you have a subject, predicate and object, you can get the timestamp and other properties with a simple query:

select ?subject ?predicate ?object ?timestamp where {
   {?subject ?predicate ?object}
   bind (md5(concat(str(?subject),str(?predicate),str(?object ))) as ?hash)
   {
   graph <app:assert> 
         {?statement <owl:hash> ?hash.
          ?statement <owl:timestamp> ?timestamp.
         }
   }
}

Keep in mind that there is a multiplicative factor here - every assertion creates at least five reified ones, so this database can get big fast. To that end, it may be that you don't actually need the subject, predicate and object in the corresponding record, though that can be useful for higher order meta-functions.

Beyond simple auditing, this may also be a mechanism for inserting a confidence measure into a statement (something between 0 and 1). Confidence might determine the fuzziness of the given assertion when it's not absolute, allowing you to perform Bayesian analysis on it. A simple example of this might be in something like friend of a friend analysis, where the strength of each FOAF statement varies from not a friend (0.0) to a life-bonded friend (1.0).

insert {
    graph <app:assert> {
         ?statement <owl:confidence> 0.7.
         }
    }
where {
    <kurt.cagle@gmail.com> foaf:knows <jane.doe@gmail.com>.
    ?me foaf:knows ?friend. 
    bind (md5(fn:concat(?me,foaf:knows,?friend)) as ?hash)
    graph <app:assert> {
        ?statement <owl:hash> ?hash.
        }
     };

With something like this, you can then do things like return chains to all friends that know one another reasonably well (confidence > 0.5):

select ?friend1 ?friend2 ?confidence
where {
    ?friend1 foaf:knows+ ?friend2.
    filter EXISTS {<kurt.cagle@gmail.com> foaf:knows ?friend2}
    bind (md5(fn:concat(?friend1,foaf:knows,?friend2) as ?hash)
    bind (0.5 as ?threshold)
   {
   graph <app:assert> 
         {?statement <owl:hash> ?hash.
          ?statement <owl:confidence> ?confidence.
         }
   filter (?confidence > ?threshold)
   }
}

With something like this, you can then do things like return chains of all friends that know one another reasonably well (confidence > 0.5). More generally, you can create queries where you have moderate to high faith that the assertions you are making are sound, or on the flipside where you want to explore potential inferences but also want to get a sense of how likely any inferences you create actually are. This approach can be used in conjunction with external programs to create a feedback loop - you can make a tentative assertion, then as other data corroborates that assertion you can increase the confidence to reflect this.

Reification is a useful technique for working with semantic content - it can help support versioning at the statement level, can be used for setting up generalized triggers and can more effectively manage archiving of assertions over time. This technique also illustrates hints at how named graphs can be used as scratchpads for operations - creating interim named graphs for establishing hypotheses, then when these prove out clearing the graphs and freeing up space. This is grist for another article.

No comments:

Post a Comment