Monday, April 22, 2013

Age Discrimination or Reasonable Expectations?


I recently saw a discussion about Age Discrimination in the IT Workforce (http://www.linkedin.com/today/post/article/20130422020049-8451-the-tech-industry-s-darkest-secret-it-s-all-about-age) and it got me thinking about it beyond the immediate knee-jerk reaction of fear about my own job standing. In general, while IT people tend to have an outsized impression of the value of their code within an organization (sometimes deserved, sometimes not) from an organizational perspective most code is simple depreciating assets that will need to be replaced over time. For that reason, the cost issue of paying a senior developer at 55 earning 2-3 times a junior programmer's salary for what is perceived as decaying assets begins to make sense.

Moreover, there are typically several career tracks which employers look for with technical people, and these are very much tied into age and experience. Looked at in that perspective, the age distribution within the industry begins to look a little bit more rational, but it also means that you should be thinking about career management from the time you leave college.

At 25, you're basically an apprentice - learning the "art" of programming, gaining experience, putting in the hours to prove yourself. You don't have a family, you're willing to work 80 hours for a 40 hour a week job, coding is still new and shiny, and the code that you write, while probably not brilliant, will likely be more innovative simply because you're not locked into specific patterns yet. Employers will hire you because they are generally less concerned with quality than they are costs. You're expendable, and for this age group being expendable is not necessarily a bad thing - it gives you exposure to different problem domains, and teaches you how to stay nimble in the market.

By 40, you're probably a good programmer, skilled in several languages, with a few major project successes and failures under your belt. However, your value to the organization increases if you can use that experience to train up, mentor and manage the younger programmers, learn how to knock heads to keep egos out of the way of making deadlines, or to transfer that experience more into architecture or product design. It's also a good time to go journeyman - gain experiences as a consultant, learn how to work directly with clients and how to recognize when problems are better handled with social programming than technological ones. Start working at the systems architecture level. Write a book or three.

As you head into your fifties, it's a better time to start a company, to expand your consulting to large companies or to go into research - it's also not a bad time to go back to school and get that PhD that you didn't have time for earlier, since kids are nearing adulthood by then if not already there, you have a much better breadth of knowledge on which to base your studies and you are far more likely to get into a faculty position or a research institute if you're seasoned with real world experience (and you'll be a better teacher to your students as well).

From an entrepreneurial standpoint you have the connections that are so critical to starting a business and building up a market. As a consultant, you don't have the demands of home and hearth to hold you back any more to the same extent you did when your kids are still young, so you are capable of travel and long distance gigs. You're also more likely to be acting in a mixed business/technical perspective, providing architectural or business guidance while some whiz kid in his twenties bangs out the cool code.

If you're 55 and still doing nothing but coding, senior management will wonder why they are shelling out so much for that code, no matter how brilliant, since they see the code as being ultimately expendable.

I'm fifty. I am a consulting information architect, have written more than a dozen books, and regularly consult with federal agencies and Fortune 100 companies. I still write code, but most of that code anymore gets put into my books or articles or becomes proof of concepts for clients when trying to sell a new technology idea. In other words, the coding is simply a tool in my suite of tools now, The pay is generally better, the ability to influence projects is considerably more significant, and I find it easier to deal directly with the C-level executives that make the decisions.

What this means to a young programmer in particular is simple - expand your horizons. The code is important, but it is not the only important piece in the puzzle, and in many respects no matter how good a programmer you are, it's your ability to navigate the social shoals and reefs that determine how successful you are in your career. As an employer, I'd be as skeptical about a 55 year old programmer as I would be about a 25 year old information architect - neither one has the experience that I need for the jobs at hand.

Sunday, April 7, 2013

Book Chapter and Verse - HTML5 + RDFa

Some time back I did some work for a religious publisher, and at the time began to play around with the question of how we could take a lot of the content that they had for their book publications and make them more accessible, especially given the interrelatedness of their various content. I'd begun playing at the time with RDF, and the notion of building graphs for solving this particular problem was attractive, but the tools were simply not yet there to do more than contemplate.

Recently, I had a chance to talk with a friend I'd first met there, and I got to thinking about the idea of making use of HTML + one of the less discussed aspects of the HTML5 revolution - the utilization of RDF for Attributes, otherwise known as RDFa. The idea behind RDFa is relatively simple - put into attributes sufficient information to be able to embed semantic markup into an HTML5 document, then once you have this, converting this RDFa into RDF via the use of a transformation technology called GRDDL (which is typically implemented in XSLT, but can also be done in Java).

Since I've been doing some exploration of semantics for a couple of large media clients, I decided it would be a good time to experiment, and get a better handle on working with RDFa -- including putting together some more challenging constructs in order to see how this can be effectively integrated into RDF/OWL.

What I ultimately came up with was a very tiny book of "holy writ" that may actually end up becoming a short story at some point (never throw away good concepts). At the moment, this exists primarily to show concepts, but I suspect I'll be able to build more up. The original XHTML is shown in Listing 1.

<html
    xmlns="http://www.w3.org/1999/xhtml">
    <body>
        <header>
            <a name="Michael"><h1>The Book of Michael</h1></a>
        </header>
        <article>
            <a name="Michael-1-1" id="Michael-1-1" >                <p>In the heart of the world, not long after the beginning, a man named Michael Grenadine, he who was known as The Librarian, walked out of the Wastes.</p></a>
            <a name="Michael-1-2" id="Michael-1-2" >
                <p>Travelled did he with naught but a humble mule, which he called The Horse With No Name, verily to wreak confusion and wonder among the unenlightened.</p></a>
            <a name="Michael-1-3" id="Michael-1-3">                <p>Within his wagon, hauled by his humble mule, did Michael carry with him many books, written by the Monks of Knowledge and the Sisters of Understanding, the better to preserve the knowledge and wisdom of ages past, as well as the cautionary tales that had led to the Wastes.</p></a>
            <a name="Michael-1-4" id="Michael-1-4">
<p>Weary was he, and sore afoot, for he had travelled many days and nights, had wandered through mountains and deserts, searching for a new home, until finally he collapsed, ill and feverish.</p></a>
  <a name="Michael-1-5" id="Michael-1-5">
                <p>There would Michael have died, had his mule not begun to bray in fear for his master, for its master was a kind man who fed the mule even when he himself had nothing, and groomed his mule before ever taking himself down to sleep.</p></a>
            <a name="Michael-2-1" id="Michael-2-1">
                <p>Not far away, near the town of Seattlec'l, a young woman named Alara Dishaean was tending her cows when she heard the braying of a mule.</p></a>
            <a name="Michael-2-2" id="Michael-2-2">
                <p>Curious, Alara left her farm on a Bantu Horse until she entered into the Great Forest, where she found Michael upon the forest floor, unconscious.</p></a>
        </article> 
    </body>
</html>
Listing 1. The original HTML for the Book of Michael.

Overall, there are a number of distinct types of entities at work here - concepts, people, locations, along with the implied book, chapter and verse that one might expect from a work of this sort. It takes a bit of data modeling to figure out what specifically you want to track (we'll come back to that later), but ultimately what tends to emerge is that you want to maintain one "namespace" for each type of object that you'll want to model. RDFa 1.0 introduced the concept of the curie, but because the use of namespace prefixes are somewhat problematic, RDFa 1.1 for HTML introduced the prefix attribute, which is typically placed on the outermost viable container, and consists of the form:

prefix="prfx1:  http://www.example.org/prefix1 prfx:2 http://www.example.org/prefix2 ..." 

and so forth. These prefixes reduce the overall size of the attributes within the HTML document, considerably, and make them marginally more legible. In the case of the example document, the new html header now looks as follows:

<html xmlns="http://www.w3.org/1999/xhtml" vocab="http://OrderOfTheBook.org/xmlns/verse#"
    prefix="bio: http://OrderOfTheBook.org/xmlns/bio# 
    location: http://OrderOfTheBook.org/xmlns/location# 
    concept: http://OrderOfTheBook.org/xmlns/concept#
    writ: http://OrderOfTheBook.org/xmlns/writ#
    verse: http://OrderOfTheBook.org/xmlns/verse#
    chapter: http://OrderOfTheBook.org/xmlns/chapter#
    book: http://OrderOfTheBook.org/xmlns/book#
    class: http://OrderOfTheBook.org/xmlns/class#
    dc: http://purl.org/dc/terms/"
    about="chapter:Michael" typeof="class:Chapter">



The @prefix attriibute identifies all of the CURIE namespaces and their associated prefixes. Each is essentially a different vocabulary of terms.

The @about attribute identifies that this document is a description of the chapter "Michael". The expression chapter:Michael is a prefix qualified name, and is equivalent to http://OrderOfTheBook.org/xmlns/chapter#Michael
and is effectively the "subject" of this particular document. Unless it is overridden in an internal content, this is the context that all properties belong to.

The advantage of working with CURIES should be evident there - they make the code considerably easier to read.

The @typeof attribute is a reference to an RDF class, in this case "class:Chapter". This is actually a pretty significant innovation, because what we have done is defined semantically that this particular construct is a Chapter, which means that rules and logic that apply to chapters apply here as well. Moreover, there is now a different semantic markup beginning to emerge on top of the HTML semantics, which don't really even have a notion of chapter.

The @vocab attribute identifies what the default semantic namespace is, and as
so is inherited by subordinate containers until a container has a different @vocab defined, at which point the innermost vocab becomes the default. In this case,

After the <html> element, the next element is the header, which contains the title of the chapter, as shown in Listing 2.

    <body>
        <header>
           <h1 property="dc:title">The Book of Michael</h1>
        </header>
Listing 2. An RDFa property.


This uses the Dublin Core namespace to identify the title of the document, which internally would end up creating a triple that looks like:


@prefix chapter: <http://OrderOfTheBook.org/xmlns/chapter#> .
@prefix dc: <http://purl.org/dc/terms/> .
chapter:Michael     dc:title   "The Book of Michael".

in Turtle notation.

The next section identifies for the book the first chapter (listing 3):


        <span inlist="" property="book:chapter" resource="chapter:Michael-1"/>
        <article about="chapter:Michael-1" typeof="class:Chapter">
            <header>
                <h2 property="dc:title">Michael: Chapter 1</h2>
            </header>
Listing 3. Defining the chapter.

This is a bit more complicated - the first <span> statement is within the context of the global book, <book:Michael>, and asserts that there is a property called book:chapter with a link to the chapter resource URI, <chapter:Michael-1>. The inlist attribute (with an empty value in XHTML or just the attribute without a value at all for HTML) requires some explanation. Ordinarily, there is no concept of sequential ordering in RDF, but ordered sequences do occur in real life. To get around this, RDF defines the notion of a list. that uses blank nodes and several key RDF properties. The @inlist attribute signals to the parser that the resource should be attached in a list sequence to the context object using the specified property predicate (in this case the property <book:chapter>). This means that the book will have an Turtle notation that looks like:

book:Michael book:chapter (chapter:Michael-1 chapter:Michael-2 chapter:Michael-3). 

and so forth.

The next line identifies the article to be "about" <chapter:Michael-1>, making this the new context that everything else is related to as subject. This also identifies this article as being of the Chapter type. This can be useful to create specific visual identities for the various object types in your system because you can create a CSS rule such as

article [typeof=class:Chapter] {background-color:lightBlue;font-family:Arial; ...}

that would provide a visual rendering of any Chapter object, regardless of the underlying HTML semantics.

The <header><h1> elements also provide a label for the new chapter object (remember, we're now in the chapter context, not the book). Note that when you have a @property attribute with neither content nor a resource, then the string representation of the contents of that attribute's element becomes the value used for the property association (here, the Dublin Core <dc:title> property). This holds true even if the element has child elements within it, a fact we take advantage of at the verse level.

Each verse follows a similar convention as shown in Listing 4.


         <div inlist="" property="chapter:verse" resource="verse:Michael-1-1">
            <a name="Michael-1-1" id="Michael-1-1" about="verse:Michael-1-1" typeof="class:Verse">
                <p property="text"><span property="dc:title" content="Michael 1-1"/>In the heart of the world, not long after the beginning, a man named <span property="bio" resource="bio:Michael_Grenadine">Michael Grenadine</span>, he who was known as <span property="concept" resource="concept:librarian">The Librarian</span>, walked out of <span property="location" resource="location:The_Wastes">the Wastes</span>.</p>
            </a>
          </div>
Listing 4. The HTML verse content. 


Here, you have a <div> element that identifies the resource in question as being part of the <chapter:verse> list. Within this you have the identifier itself (bound with an <a name=".."> element) along with its class indicator. Everything within this element is now part of the verse scope.

The paragraph <p> element also includes a property, but in this case it is given only as a local name. The property="text" statement takes advantage of the default namespace that was defined in the header, which defines the default semantic namespace as being the verse: namespace. What this means in practice is that the expression
<p property="text">
is actually a shorthand notation for
<p property="verse:text">,
which, since the element has neither a @content or a @resource attribute, means that the text string of the content, minus any internal markup, is the object of the assertion:
verse:Michael-1-1   verse:text    """In the heart of the world, not long after the beginning, a man named Michael Grenadine, he who was known as The Librarian, walked out of the Wastes.""".

The triple quotes are multi-line quotes, used both to hold content that may span multiple lines and used to safely encapsulate both single and double quotes which can cause problems with text parsing logic.

Within the verse: namespace, there are a number of organizational concepts - verse:loc for locations, verse:bio for biographical entities, verse:concept for conceptual entities (love, war, illness, death), terms that you may expect with a semantic knowledge ontology system. These keywords can often provide multiple ways of simultaneously organizing content, either in hierarchical taxonomies or in more freeform associational structures, but they also can make finding related topics far easier.

The full structure for the HTML document is given in Listing 5.


<html xmlns="http://www.w3.org/1999/xhtml" vocab="http://OrderOfTheBook.org/xmlns/verse#"
    prefix="bio: http://OrderOfTheBook.org/xmlns/bio# 
    location: http://OrderOfTheBook.org/xmlns/location# 
    concept: http://OrderOfTheBook.org/xmlns/concept#
    writ: http://OrderOfTheBook.org/xmlns/writ#
    verse: http://OrderOfTheBook.org/xmlns/verse#
    chapter: http://OrderOfTheBook.org/xmlns/chapter#
    book: http://OrderOfTheBook.org/xmlns/book#
    class: http://OrderOfTheBook.org/xmlns/class#
    owl: http://www.w3.org/2002/07/owl#
    rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
    rdfs: http://www.w3.org/2000/01/rdf-schema#
    dc: http://purl.org/dc/terms/"
    about="book:Michael" typeof="class:Book">
    <body>
        <header>
           <h1 property="dc:title">The Book of Michael</h1>
        </header>
        <span inlist="" property="book:chapter" resource="chapter:Michael-1"/>
        <article about="chapter:Michael-1" typeof="class:Chapter">
            <header>
                <h2 property="dc:title">Michael: Chapter 1</h2>
            </header>
            <span inlist="" property="chapter:verse" resource="verse:Michael-1-1"/>
            <a name="Michael-1-1" id="Michael-1-1" about="verse:Michael-1-1" typeof="class:Verse">
                <p property="text"><span property="dc:title" content="Michael 1-1"/>In the heart of the world, not long after the beginning, a man named <span property="bio" resource="bio:Michael_Grenadine">Michael Grenadine</span>, he who was known as <span property="concept" resource="concept:librarian">The Librarian</span>, walked out of <span property="location" resource="location:The_Wastes">the Wastes</span>.</p>
            </a>
            <span inlist="" property="chapter:verse" resource="verse:Michael-1-2"/>
            <a name="Michael-1-2" id="Michael-1-2" about="verse:Michael-1-2" typeof="Verse">
                <span property="dc:title" content="Michael 1-2"/>
                <p property="text">Travel did he with nought but a humble <span property="concept" resource="concept:The_Mule_Of_Michael">mule</span>, which he called <span property="concept" resource="concept:The_Horse_With_No_Name">The Horse With No Name</span>, verily to wreak confusion and wonder among the <span property="concept" resource="concept:Unenlighted">unenlightened</span>.</p>
            </a>
            <span inlist="" property="chapter:verse" resource="verse:Michael-1-3"/>
            <a name="Michael-1-3" id="Michael-1-3" about="verse:Michael-1-3" typeof="Verse">  
                <p property="text"><span property="dc:title" content="Michael 1-3"/>Within his wagon, hauled by his humble <span property="concept" resource="concept:The_Mule_Of_Michael">mule</span>, did Michael carry with him <span property="concept" resource="concept:The_Library_Of_Michael">many books</span>, written by the <span property="concept" resource="concept:Monks_Of_Knowledge">Monks of Knowledge</span> and the <span property="concept" resource="concept:Sisters_Of_Understanding">Sisters of Understanding</span>, the better to preserve the knowledge and wisdom of ages past, as well as the cautionary tales that had led to the creation of <span property="location" resource="location:The_Wastes">the Wastes</span>.</p>
            </a>
            <span inlist="" property="chapter:verse" resource="verse:Michael-1-4"/>
            <a name="Michael-1-4" id="Michael-1-4" about="verse:Michael-1-4" typeof="Verse">                
                <p property="text"><span property="dc:title" content="Michael 1-4"/><span property="concept" resource="concept:Weariness">Weary was he, and sore afoot,</span> for he had <span property="concept" resource="concept:Travel">travelled many days and nights</span>, had wandered through <span property="concept" resource="concept:Mountain">mountains</span> and <span property="concept" resource="concept:Desert">deserts</span>, searching for a new home, until finally <span property="concept" resource="concept:Illness">he collapsed, ill and feverish</span>.</p>
            </a>
            <span inlist="" property="chapter:verse" resource="verse:Michael-1-5"/>
            <a name="Michael-1-5" id="Michael-1-5" about="verse:Michael-1-5" typeof="Verse">                
                <p property="text">There would Michael have died, had his <span property="concept" resource="concept:The_Mule_Of_Michael">mule</span> not begun <span property="concept" resource="concept:Loyalty">to bray in fear for his master</span>, for its master was a <span property="concept" resource="concept:Kindness">kind man who fed the mule even when he himself had nothing</span>, and <span property="concept" resource="concept:Caring">groomed his mule before ever taking himself down to sleep</span>.</p>
            </a>
        </article>
        <span inlist="" property="book:chapter" resource="chapter:Michael-2"/>
        <article about="chapter:Michael-2" typeof="class:Chapter">
            <header>
                <h2 property="dc:title">Michael: Chapter 2</h2>
            </header>
            <span inlist="" property="chapter:verse" resource="verse:Michael-2-1"/>
            <a name="Michael-2-1" id="Michael-2-1" about="verse:Michael-2-1" typeof="class:Verse">
                <p property="text"><span property="dc:title" content="Michael 2-1"/>Not far away, near the town of <span property="location" resource="location:Seattle_Delshaean_Era">Seattlec'l</span>, a young woman named <span property="bio" resource="Alara_Dishaean">Alara Dishaean</span> was tending her <span property="concept" resource="concept:Cattle">cows</span> when she heard <span property="concept" resource="concept:Mule">the braying of a mule</span>.</p>
            </a>
            <span inlist="" property="chapter:verse" resource="verse:Michael-2-2"/>
            <a name="Michael-2-2" id="Michael-2-2" about="verse:Michael-2-2" typeof="class:Verse">
                <p property="text"><span property="dc:title" content="Michael 2-2"/><span property="concept" resource="concept:Curiosity">Curious,</span> Alara left her farm on a <span property="concept" resource="concept:Bantu_Horse">Bantu Horse</span> until she entered into the <span property="location" resource="location:Great_Forest">Great Forest</span>, where she found <span property="bio:" resource="bio:Michael_Grenadine">Michael</span> upon the forest floor, unconscious.</p>
            </a>
        </article>
    </body>
</html>
Listing 5. The full HTML-based book.

Up to now, this article has focused on the construction of the RDFa, but has not yet answered how this gets translated into RDF. The answer to this is to make use of a program called an RDFa Parser/Distiller. The one I've used for these examples is available online as a Python application at http://www.w3.org/2012/pyRdfa/ . This runs as a service, and lets you pass RDFa as a text stream, upload a file, or parse an online page with embedded RDFa code. Overall pyRdfa seems to offer the most comprehensive RDFa 1.1 coverage.

Running the above example through the parser, specifying for XHTML5+RDFA 1.1 input and Turtle output (see Figure 1), pyRDFa produces the Turtle triples given in Listing 6.

Figure 1. The configuration screen for pyRDFa text input.

@prefix bio: <http://OrderOfTheBook.org/xmlns/bio#> .
@prefix book: <http://OrderOfTheBook.org/xmlns/book#> .
@prefix chapter: <http://OrderOfTheBook.org/xmlns/chapter#> .
@prefix class: <http://OrderOfTheBook.org/xmlns/class#> .
@prefix concept: <http://OrderOfTheBook.org/xmlns/concept#> .
@prefix dc: <http://purl.org/dc/terms/> .
@prefix location: <http://OrderOfTheBook.org/xmlns/location#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfa: <http://www.w3.org/ns/rdfa#> .
@prefix verse: <http://OrderOfTheBook.org/xmlns/verse#> .

<> rdfa:usesVocabulary verse: .

book:Michael a class:Book;
    book:chapter ( chapter:Michael-1 chapter:Michael-2 );
    dc:title "The Book of Michael" .

chapter:Michael-1 a class:Chapter;
    chapter:verse ( verse:Michael-1-1 verse:Michael-1-2 verse:Michael-1-3 verse:Michael-1-4 verse:Michael-1-5 );
    dc:title "Michael: Chapter 1" .

chapter:Michael-2 a class:Chapter;
    chapter:verse ( verse:Michael-2-1 verse:Michael-2-2 );
    dc:title "Michael: Chapter 2" .

verse:Michael-1-1 a class:Verse;
    verse:bio bio:Michael_Grenadine;
    verse:concept concept:librarian;
    verse:location location:The_Wastes;
    verse:text "In the heart of the world, not long after the beginning, a man named Michael Grenadine, he who was known as The Librarian, walked out of the Wastes.";
    dc:title "Michael 1-1" .

verse:Michael-1-2 a verse:Verse;
    verse:concept concept:The_Horse_With_No_Name,
        concept:The_Mule_Of_Michael,
        concept:Unenlighted;
    verse:text "Travel did he with nought but a humble mule, which he called The Horse With No Name, verily to wreak confusion and wonder among the unenlightened.";
    dc:title "Michael 1-2" .

verse:Michael-1-3 a verse:Verse;
    verse:concept concept:Monks_Of_Knowledge,
        concept:Sisters_Of_Understanding,
        concept:The_Library_Of_Michael,
        concept:The_Mule_Of_Michael;
    verse:location location:The_Wastes;
    verse:text "Within his wagon, hauled by his humble mule, did Michael carry with him many books, written by the Monks of Knowledge and the Sisters of Understanding, the better to preserve the knowledge and wisdom of ages past, as well as the cautionary tales that had led to the creation of the Wastes.";
    dc:title "Michael 1-3" .

verse:Michael-1-4 a verse:Verse;
    verse:concept concept:Desert,
        concept:Illness,
        concept:Mountain,
        concept:Travel,
        concept:Weariness;
    verse:text "Weary was he, and sore afoot, for he had travelled many days and nights, had wandered through mountains and deserts, searching for a new home, until finally he collapsed, ill and feverish.";
    dc:title "Michael 1-4" .

verse:Michael-1-5 a verse:Verse;
    verse:concept concept:Caring,
        concept:Kindness,
        concept:Loyalty,
        concept:The_Mule_Of_Michael;
    verse:text "There would Michael have died, had his mule not begun to bray in fear for his master, for its master was a kind man who fed the mule even when he himself had nothing, and groomed his mule before ever taking himself down to sleep." .

verse:Michael-2-1 a class:Verse;
    verse:bio <Alara_Dishaean>;
    verse:concept concept:Cattle,
        concept:Mule;
    verse:location location:Seattle_Delshaean_Era;
    verse:text "Not far away, near the town of Seattlec'l, a young woman named Alara Dishaean was tending her cows when she heard the braying of a mule.";
    dc:title "Michael 2-1" .

verse:Michael-2-2 a class:Verse;
    verse:bio bio:Michael_Grenadine;
    verse:concept concept:Bantu_Horse,
        concept:Curiosity;
    verse:location location:Great_Forest;
    verse:text "Curious, Alara left her farm on a Bantu Horse until she entered into the Great Forest, where she found Michael upon the forest floor, unconscious.";
    dc:title "Michael 2-2" .

Listing 6. RDF Turtle output generated from test HTML+RDFa file.

This now has broken down the RDFa markup into RDF assertions. In most cases, there was a little cheating going on - I defined objects in this example for (presumed) predefined entities. However, suppose that you only had only the text with no specific resources defined, something like:

<p><span property="verse:bioText">Michael Grenadine<span> was ...</p>

This would have generated the triple:
verse:Michael-1-1 verse:bioText "Michael Grenadine";

If you had already defined an entry for the person beforehand (<bio:Michael_Grenadine>), you could do an inference using SPARQL update:
insert {?verse verse:bio ?bio} 
where {
     ?verse verse:bioText $bioText.
     ?bio bio:fullName $bioText.
      };
Listing 7. Assigning a URI reference when given a string.

You could also create more sophisticated inferences that would take shortened forms of the person's name and the attempt to do searches on these, in the context of an already extant reference somewhere within the verse itself. I'll leave that for a later post.

Once this data is put into a triple store, it opens up some interesting possibilities. As a simple example, you could retrieve all the verse in a given book by verse in order (Listing 8).

select ?verse ?text where {
     $book book:chapter ?chapterList.
     ?chapterList rdf:rest*/rdf:first ?chapter.
     ?chapter chapter:verse ?verseList.
     ?verseList rdf:rest*/rdf:first ?verse.
     ?verse verse:text ?text.
     }
Listing 8. Retrieving verses in a book in sequential order.

where $book in this case contains the URI of the book being referenced. The rather peculiar construct of rdf:rest*/rdf:first may seem rather nonsensical, but it is an artifact of the list structure described earlier. RDF represents lists using blank-nodes, with the structure for verses in a chapter looking something like the Listing 9.

chapter:Michael-1 rdf:list   _:b0.
_:b0              rdf:first  verse:Michael-1-1;
                  rdf:rest   _:b1.
_:b1              rdf:first  verse:Michael-1-2;
                  rdf:rest   _:b2.
_:b2              rdf:first  verse:Michael-1-3;
                  rdf:rest   _:b3.
_:b3              rdf:first  verse:Michael-1-4;
                  rdf:rest   _:b4.
_:b4              rdf:first  verse:Michael-1-5;
                  rdf:rest   rdf:nil.
Listing 9. How a list is rendered as triples.

Given this, the construct rdf:rest*/rdf:first says, for a transitive relationship (rdf:rest) retrieve all rdf:rest items (including those with no rdf:rest predicate) in a path that ends with an rdf:first property. This will iterate over the list with the added benefit that the individual items do not then have to maintain their own pointers.

Why is this a benefit? In many kinds of documents, such as religious works, it's not uncommon to have "parables", which consist of one or more verses that may start in the middle of one chapter and end in another, may jump from one section to another, or my appear in a different order. than was originally given. By keeping the links external, you can add items sequentially but without being bound to needing imperative logic and exception handling. So, we can create a parable called "The parable of the mule" (Listing 10).

parable:Parable_Of_The_Mule owl:Class class:Parable;
     dc:title "The Parable of the Mule". 
     parable:verse (verse:Michael-1-2 verse:Michael-1-3 verse:Michael-1-4 verse:Michael-1-5 verse:Michael-2-1).

Listing 10. The parable of the mule.

Once you have this kind of relationship, it also becomes possible to do things like determine every parable that a given verse is used in (Listing 11), or even (assuming that verses are the only place that hold keywords) finding all parables that discuss certain concepts (Listing 12).

select ?parable where {
     ?parable parable:verse ?verseList.
     ?verseList owl:rest*/owl:first $verse.
     };
Listing 11. Retrieving all parables that have a specific $verse.

select ?parable where {
     ?parable parable:verse ?verseList.
     ?verseList owl:rest*/owl:first ?verse.
     ?verse verse:concept $concept.
     };
Listing 12. Retrieving all parables that include a specific $concept.

RDFa can appear somewhat complex at first, but the advantage to being able to encode content in RDF is that you can identify relationships between entities, can make this information aware of other information that exists within a larger dataspace, and can make that information far more malleable and while still maintaining useful context.

Kurt Cagle is an information architect and author working for Avalon Consulting LLC, specializing in NoSQL, Semantics, XML and JSON data systems. He is the author of eighteen books on XML and web technologies, including the upcoming HTML5 Graphics with SVG and CSS for O'Reilly Media. He can be reached at kurt.cagle@gmail.com.

Wednesday, April 3, 2013

The Joys of Reification

The danger of periodically producing a blog is that every so often something comes along that eats up a significant amount of time, and next thing you know, a month or more ends up passing between posts. In my case, that's largely been demo writing and wrapping up on revising after the SVG technical edits, and my editors are waiting ever so patiently (actually, no, they're not) for me to get a manuscript to them.

I wanted to take a few minutes, though to capture the fruits of an exchange I had with a friend of mine about storing information in triple stores. The topic involved was a seemingly trivial one that was anything but - when you have only three values in an RDF assertion, how can you incorporate time stamps or similar information. The answer, as it turns out, is to create metadata about the metadata - a process that RDF types refer to as reification

In this case, my friend wanted to capture two critical pieces of information - when was a particular statement entered into the system, and who submitted it. Other information could be added, of course, but this illustrates the problem well enough.

Suppose that I had an RPG game and wanted to add a couple of characters to the system. I could add a timestamp property and email identifier as properties of the subject, but what I really want to track is when was a given statement inserted about the characters. In essence, I'm trying to get information about the statement, not the subject. Here's the SPARQL 1.1 update data:

insert data {
     <character:Aleria> <owl:instanceOf> <class:Character>;
                        <character:species> <species:Elf>;
                        <character:gender> <gender:Female>;
                        <character:location> [
                            <location:city> <city:New_Alfrans>;
                            <location:nation> <nation:Mirales>].

    <character:Merasha> <owl:instanceOf> <class:Characterl>;
                        <character:species> <species:HalfElf>;
                        <character:gender> <gender:Female>;
                        <character:location> [
                            <location:city> <city:Tunsany>;
                            <location:nation> <nation:Milesia>].
};

There are actually two separate (albeit related) problems here. The first is how to attach properties to specific statements. The second is how to do so while insuring that the "auditing" information isn't itself added into the lists of things that are normally queried, otherwise you get a ongoing additive spiral.
The second problem can actually be solved quite handily with the use of named graph, which we'll call <app:assert> for brevity. This is a graph that just contains assertions about assertions. When you query against the default graph (or some other named graph) this graph will be ignored unless you specifically reference. it.

The first problem then comes down to creating a unique key for an assertion statement. This is a good place to pull out a hashing function (md5() is a pretty good one to start with, and is supported by most SPARQL 1.1 engines). By creating a hash from a concatenation of subject, predicate and object, we have a generatable, unique key.

With these two statements, you can then create a SPARQL UPDATE 1.1 script that would be called periodically:

insert {
graph <app:assert> {
    ?statement <owl:hash> ?hash.
# The next three statements may not be necessary
    ?statement <owl:subject> ?subject.
    ?statement <owl:predicate> ?predicate.
    ?statement <owl:object> ?object.
    ?statement <owl:timestamp> ?timestamp.
    ?statement <owl:email> ?email.
    }}
where {
    ?subject ?predicate ?object.
   filter not exists{
          ?s <owl:subject> ?subject.
          ?s <owl:predicate> ?predicate.
          ?s <owl:object> ?object.
          }
    bind ("kurt.cagle@gmail.com" as ?email)
    bind (uuid() as ?statement)
    bind (now() as ?timestamp)
    bind (md5(concat(str(?subject),str(?predicate),str(?object))) as ?hash)
}

The insert part of the clause creates the new record based upon the bindings from the where clause. The first line indicates that all items that haven't been audited should be audited. Since typically only certain objects in the system really need to be audited, additional statements limiting the audit only to items of a given class may be useful here. 

where {
    ?subject ?predicate ?object.
    ?subject a ?systemClass.
    ...

The next piece of information would mostly be created either through the interface that builds this script (I use a regex method myself to map variables of the form $foo into IRIs or string). The first binds to an email string (or IRI, if that's how you're identifying such keys), the ?timestamp reads the current time that the script is called, and the last bind creates an md5 hash from concatenating subject, predicate and object together.

(It might be possible to create an IRI directly from the hash as well, but that's still in essence a hash key. This just makes the association more explicits and avoids formal semantics in identifiers.)  

This then creates new records into the <app:assert> graph with the relevant properties.

Retrieving this information is pretty straightforward, and assumes that the hashing algorithm is the same in your system from what generated it (most likely the case): If you have a subject, predicate and object, you can get the timestamp and other properties with a simple query:

select ?subject ?predicate ?object ?timestamp where {
   {?subject ?predicate ?object}
   bind (md5(concat(str(?subject),str(?predicate),str(?object ))) as ?hash)
   {
   graph <app:assert> 
         {?statement <owl:hash> ?hash.
          ?statement <owl:timestamp> ?timestamp.
         }
   }
}

Keep in mind that there is a multiplicative factor here - every assertion creates at least five reified ones, so this database can get big fast. To that end, it may be that you don't actually need the subject, predicate and object in the corresponding record, though that can be useful for higher order meta-functions.

Beyond simple auditing, this may also be a mechanism for inserting a confidence measure into a statement (something between 0 and 1). Confidence might determine the fuzziness of the given assertion when it's not absolute, allowing you to perform Bayesian analysis on it. A simple example of this might be in something like friend of a friend analysis, where the strength of each FOAF statement varies from not a friend (0.0) to a life-bonded friend (1.0).

insert {
    graph <app:assert> {
         ?statement <owl:confidence> 0.7.
         }
    }
where {
    <kurt.cagle@gmail.com> foaf:knows <jane.doe@gmail.com>.
    ?me foaf:knows ?friend. 
    bind (md5(fn:concat(?me,foaf:knows,?friend)) as ?hash)
    graph <app:assert> {
        ?statement <owl:hash> ?hash.
        }
     };

With something like this, you can then do things like return chains to all friends that know one another reasonably well (confidence > 0.5):

select ?friend1 ?friend2 ?confidence
where {
    ?friend1 foaf:knows+ ?friend2.
    filter EXISTS {<kurt.cagle@gmail.com> foaf:knows ?friend2}
    bind (md5(fn:concat(?friend1,foaf:knows,?friend2) as ?hash)
    bind (0.5 as ?threshold)
   {
   graph <app:assert> 
         {?statement <owl:hash> ?hash.
          ?statement <owl:confidence> ?confidence.
         }
   filter (?confidence > ?threshold)
   }
}

With something like this, you can then do things like return chains of all friends that know one another reasonably well (confidence > 0.5). More generally, you can create queries where you have moderate to high faith that the assertions you are making are sound, or on the flipside where you want to explore potential inferences but also want to get a sense of how likely any inferences you create actually are. This approach can be used in conjunction with external programs to create a feedback loop - you can make a tentative assertion, then as other data corroborates that assertion you can increase the confidence to reflect this.

Reification is a useful technique for working with semantic content - it can help support versioning at the statement level, can be used for setting up generalized triggers and can more effectively manage archiving of assertions over time. This technique also illustrates hints at how named graphs can be used as scratchpads for operations - creating interim named graphs for establishing hypotheses, then when these prove out clearing the graphs and freeing up space. This is grist for another article.