Recently, I had a chance to talk with a friend I'd first met there, and I got to thinking about the idea of making use of HTML + one of the less discussed aspects of the HTML5 revolution - the utilization of RDF for Attributes, otherwise known as RDFa. The idea behind RDFa is relatively simple - put into attributes sufficient information to be able to embed semantic markup into an HTML5 document, then once you have this, converting this RDFa into RDF via the use of a transformation technology called GRDDL (which is typically implemented in XSLT, but can also be done in Java).
Since I've been doing some exploration of semantics for a couple of large media clients, I decided it would be a good time to experiment, and get a better handle on working with RDFa -- including putting together some more challenging constructs in order to see how this can be effectively integrated into RDF/OWL.
What I ultimately came up with was a very tiny book of "holy writ" that may actually end up becoming a short story at some point (never throw away good concepts). At the moment, this exists primarily to show concepts, but I suspect I'll be able to build more up. The original XHTML is shown in Listing 1.
<html
xmlns="http://www.w3.org/1999/xhtml">
<body>
<header>
<a name="Michael"><h1>The Book of Michael</h1></a>
</header>
<article>
<a name="Michael-1-1" id="Michael-1-1" > <p>In the heart of the world, not long after the beginning, a man named Michael Grenadine, he who was known as The Librarian, walked out of the Wastes.</p></a>
<a name="Michael-1-2" id="Michael-1-2" >
<p>Travelled did he with naught but a humble mule, which he called The Horse With No Name, verily to wreak confusion and wonder among the unenlightened.</p></a>
<a name="Michael-1-3" id="Michael-1-3"> <p>Within his wagon, hauled by his humble mule, did Michael carry with him many books, written by the Monks of Knowledge and the Sisters of Understanding, the better to preserve the knowledge and wisdom of ages past, as well as the cautionary tales that had led to the Wastes.</p></a>
<a name="Michael-1-4" id="Michael-1-4">
<p>Weary was he, and sore afoot, for he had travelled many days and nights, had wandered through mountains and deserts, searching for a new home, until finally he collapsed, ill and feverish.</p></a>
<a name="Michael-1-5" id="Michael-1-5">
<p>There would Michael have died, had his mule not begun to bray in fear for his master, for its master was a kind man who fed the mule even when he himself had nothing, and groomed his mule before ever taking himself down to sleep.</p></a>
<a name="Michael-2-1" id="Michael-2-1">
<p>Not far away, near the town of Seattlec'l, a young woman named Alara Dishaean was tending her cows when she heard the braying of a mule.</p></a>
<a name="Michael-2-2" id="Michael-2-2">
<p>Curious, Alara left her farm on a Bantu Horse until she entered into the Great Forest, where she found Michael upon the forest floor, unconscious.</p></a>
</article>
</body>
</html>
Listing 1. The original HTML for the Book of Michael.
Overall, there are a number of distinct types of entities at work here - concepts, people, locations, along with the implied book, chapter and verse that one might expect from a work of this sort. It takes a bit of data modeling to figure out what specifically you want to track (we'll come back to that later), but ultimately what tends to emerge is that you want to maintain one "namespace" for each type of object that you'll want to model. RDFa 1.0 introduced the concept of the curie, but because the use of namespace prefixes are somewhat problematic, RDFa 1.1 for HTML introduced the prefix attribute, which is typically placed on the outermost viable container, and consists of the form:
prefix="prfx1: http://www.example.org/prefix1 prfx:2 http://www.example.org/prefix2 ..."
and so forth. These prefixes reduce the overall size of the attributes within the HTML document, considerably, and make them marginally more legible. In the case of the example document, the new html header now looks as follows:
<html xmlns="http://www.w3.org/1999/xhtml" vocab="http://OrderOfTheBook.org/xmlns/verse#"
prefix="bio: http://OrderOfTheBook.org/xmlns/bio#
location: http://OrderOfTheBook.org/xmlns/location#
concept: http://OrderOfTheBook.org/xmlns/concept#
writ: http://OrderOfTheBook.org/xmlns/writ#
verse: http://OrderOfTheBook.org/xmlns/verse#
chapter: http://OrderOfTheBook.org/xmlns/chapter#
book: http://OrderOfTheBook.org/xmlns/book#
class: http://OrderOfTheBook.org/xmlns/class#
dc: http://purl.org/dc/terms/"
about="chapter:Michael" typeof="class:Chapter">
The @prefix attriibute identifies all of the CURIE namespaces and their associated prefixes. Each is essentially a different vocabulary of terms.
The @about attribute identifies that this document is a description of the chapter "Michael". The expression chapter:Michael is a prefix qualified name, and is equivalent to http://OrderOfTheBook.org/xmlns/chapter#Michael
and is effectively the "subject" of this particular document. Unless it is overridden in an internal content, this is the context that all properties belong to.
The advantage of working with CURIES should be evident there - they make the code considerably easier to read.
The @typeof attribute is a reference to an RDF class, in this case "class:Chapter". This is actually a pretty significant innovation, because what we have done is defined semantically that this particular construct is a Chapter, which means that rules and logic that apply to chapters apply here as well. Moreover, there is now a different semantic markup beginning to emerge on top of the HTML semantics, which don't really even have a notion of chapter.
The @vocab attribute identifies what the default semantic namespace is, and as
so is inherited by subordinate containers until a container has a different @vocab defined, at which point the innermost vocab becomes the default. In this case,
After the <html> element, the next element is the header, which contains the title of the chapter, as shown in Listing 2.
<body>
<header>
<h1 property="dc:title">The Book of Michael</h1>
</header>
Listing 2. An RDFa property.
This uses the Dublin Core namespace to identify the title of the document, which internally would end up creating a triple that looks like:
@prefix chapter: <http://OrderOfTheBook.org/xmlns/chapter#> . @prefix dc: <http://purl.org/dc/terms/> . chapter:Michael dc:title "The Book of Michael".
in Turtle notation.
The next section identifies for the book the first chapter (listing 3):
<span inlist="" property="book:chapter" resource="chapter:Michael-1"/>
<article about="chapter:Michael-1" typeof="class:Chapter">
<header>
<h2 property="dc:title">Michael: Chapter 1</h2>
</header>
Listing 3. Defining the chapter.
This is a bit more complicated - the first <span> statement is within the context of the global book, <book:Michael>, and asserts that there is a property called book:chapter with a link to the chapter resource URI, <chapter:Michael-1>. The inlist attribute (with an empty value in XHTML or just the attribute without a value at all for HTML) requires some explanation. Ordinarily, there is no concept of sequential ordering in RDF, but ordered sequences do occur in real life. To get around this, RDF defines the notion of a list. that uses blank nodes and several key RDF properties. The @inlist attribute signals to the parser that the resource should be attached in a list sequence to the context object using the specified property predicate (in this case the property <book:chapter>). This means that the book will have an Turtle notation that looks like:
book:Michael book:chapter (chapter:Michael-1 chapter:Michael-2 chapter:Michael-3). and so forth.
The next line identifies the article to be "about" <chapter:Michael-1>, making this the new context that everything else is related to as subject. This also identifies this article as being of the Chapter type. This can be useful to create specific visual identities for the various object types in your system because you can create a CSS rule such as
article [typeof=class:Chapter] {background-color:lightBlue;font-family:Arial; ...}
that would provide a visual rendering of any Chapter object, regardless of the underlying HTML semantics.
The <header><h1> elements also provide a label for the new chapter object (remember, we're now in the chapter context, not the book). Note that when you have a @property attribute with neither content nor a resource, then the string representation of the contents of that attribute's element becomes the value used for the property association (here, the Dublin Core <dc:title> property). This holds true even if the element has child elements within it, a fact we take advantage of at the verse level.
Each verse follows a similar convention as shown in Listing 4.
<div inlist="" property="chapter:verse" resource="verse:Michael-1-1">
<a name="Michael-1-1" id="Michael-1-1" about="verse:Michael-1-1" typeof="class:Verse">
<p property="text"><span property="dc:title" content="Michael 1-1"/>In the heart of the world, not long after the beginning, a man named <span property="bio" resource="bio:Michael_Grenadine">Michael Grenadine</span>, he who was known as <span property="concept" resource="concept:librarian">The Librarian</span>, walked out of <span property="location" resource="location:The_Wastes">the Wastes</span>.</p>
</a>
</div>
Listing 4. The HTML verse content.
Here, you have a <div> element that identifies the resource in question as being part of the <chapter:verse> list. Within this you have the identifier itself (bound with an <a name=".."> element) along with its class indicator. Everything within this element is now part of the verse scope.
The paragraph <p> element also includes a property, but in this case it is given only as a local name. The property="text" statement takes advantage of the default namespace that was defined in the header, which defines the default semantic namespace as being the verse: namespace. What this means in practice is that the expression
<p property="text">
is actually a shorthand notation for
<p property="verse:text">,
which, since the element has neither a @content or a @resource attribute, means that the text string of the content, minus any internal markup, is the object of the assertion:
verse:Michael-1-1 verse:text """In the heart of the world, not long after the beginning, a man named Michael Grenadine, he who was known as The Librarian, walked out of the Wastes.""".
The triple quotes are multi-line quotes, used both to hold content that may span multiple lines and used to safely encapsulate both single and double quotes which can cause problems with text parsing logic.
Within the verse: namespace, there are a number of organizational concepts - verse:loc for locations, verse:bio for biographical entities, verse:concept for conceptual entities (love, war, illness, death), terms that you may expect with a semantic knowledge ontology system. These keywords can often provide multiple ways of simultaneously organizing content, either in hierarchical taxonomies or in more freeform associational structures, but they also can make finding related topics far easier.
The full structure for the HTML document is given in Listing 5.
<html xmlns="http://www.w3.org/1999/xhtml" vocab="http://OrderOfTheBook.org/xmlns/verse#"
prefix="bio: http://OrderOfTheBook.org/xmlns/bio#
location: http://OrderOfTheBook.org/xmlns/location#
concept: http://OrderOfTheBook.org/xmlns/concept#
writ: http://OrderOfTheBook.org/xmlns/writ#
verse: http://OrderOfTheBook.org/xmlns/verse#
chapter: http://OrderOfTheBook.org/xmlns/chapter#
book: http://OrderOfTheBook.org/xmlns/book#
class: http://OrderOfTheBook.org/xmlns/class#
owl: http://www.w3.org/2002/07/owl#
rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs: http://www.w3.org/2000/01/rdf-schema#
dc: http://purl.org/dc/terms/"
about="book:Michael" typeof="class:Book">
<body>
<header>
<h1 property="dc:title">The Book of Michael</h1>
</header>
<span inlist="" property="book:chapter" resource="chapter:Michael-1"/>
<article about="chapter:Michael-1" typeof="class:Chapter">
<header>
<h2 property="dc:title">Michael: Chapter 1</h2>
</header>
<span inlist="" property="chapter:verse" resource="verse:Michael-1-1"/>
<a name="Michael-1-1" id="Michael-1-1" about="verse:Michael-1-1" typeof="class:Verse">
<p property="text"><span property="dc:title" content="Michael 1-1"/>In the heart of the world, not long after the beginning, a man named <span property="bio" resource="bio:Michael_Grenadine">Michael Grenadine</span>, he who was known as <span property="concept" resource="concept:librarian">The Librarian</span>, walked out of <span property="location" resource="location:The_Wastes">the Wastes</span>.</p>
</a>
<span inlist="" property="chapter:verse" resource="verse:Michael-1-2"/>
<a name="Michael-1-2" id="Michael-1-2" about="verse:Michael-1-2" typeof="Verse">
<span property="dc:title" content="Michael 1-2"/>
<p property="text">Travel did he with nought but a humble <span property="concept" resource="concept:The_Mule_Of_Michael">mule</span>, which he called <span property="concept" resource="concept:The_Horse_With_No_Name">The Horse With No Name</span>, verily to wreak confusion and wonder among the <span property="concept" resource="concept:Unenlighted">unenlightened</span>.</p>
</a>
<span inlist="" property="chapter:verse" resource="verse:Michael-1-3"/>
<a name="Michael-1-3" id="Michael-1-3" about="verse:Michael-1-3" typeof="Verse">
<p property="text"><span property="dc:title" content="Michael 1-3"/>Within his wagon, hauled by his humble <span property="concept" resource="concept:The_Mule_Of_Michael">mule</span>, did Michael carry with him <span property="concept" resource="concept:The_Library_Of_Michael">many books</span>, written by the <span property="concept" resource="concept:Monks_Of_Knowledge">Monks of Knowledge</span> and the <span property="concept" resource="concept:Sisters_Of_Understanding">Sisters of Understanding</span>, the better to preserve the knowledge and wisdom of ages past, as well as the cautionary tales that had led to the creation of <span property="location" resource="location:The_Wastes">the Wastes</span>.</p>
</a>
<span inlist="" property="chapter:verse" resource="verse:Michael-1-4"/>
<a name="Michael-1-4" id="Michael-1-4" about="verse:Michael-1-4" typeof="Verse">
<p property="text"><span property="dc:title" content="Michael 1-4"/><span property="concept" resource="concept:Weariness">Weary was he, and sore afoot,</span> for he had <span property="concept" resource="concept:Travel">travelled many days and nights</span>, had wandered through <span property="concept" resource="concept:Mountain">mountains</span> and <span property="concept" resource="concept:Desert">deserts</span>, searching for a new home, until finally <span property="concept" resource="concept:Illness">he collapsed, ill and feverish</span>.</p>
</a>
<span inlist="" property="chapter:verse" resource="verse:Michael-1-5"/>
<a name="Michael-1-5" id="Michael-1-5" about="verse:Michael-1-5" typeof="Verse">
<p property="text">There would Michael have died, had his <span property="concept" resource="concept:The_Mule_Of_Michael">mule</span> not begun <span property="concept" resource="concept:Loyalty">to bray in fear for his master</span>, for its master was a <span property="concept" resource="concept:Kindness">kind man who fed the mule even when he himself had nothing</span>, and <span property="concept" resource="concept:Caring">groomed his mule before ever taking himself down to sleep</span>.</p>
</a>
</article>
<span inlist="" property="book:chapter" resource="chapter:Michael-2"/>
<article about="chapter:Michael-2" typeof="class:Chapter">
<header>
<h2 property="dc:title">Michael: Chapter 2</h2>
</header>
<span inlist="" property="chapter:verse" resource="verse:Michael-2-1"/>
<a name="Michael-2-1" id="Michael-2-1" about="verse:Michael-2-1" typeof="class:Verse">
<p property="text"><span property="dc:title" content="Michael 2-1"/>Not far away, near the town of <span property="location" resource="location:Seattle_Delshaean_Era">Seattlec'l</span>, a young woman named <span property="bio" resource="Alara_Dishaean">Alara Dishaean</span> was tending her <span property="concept" resource="concept:Cattle">cows</span> when she heard <span property="concept" resource="concept:Mule">the braying of a mule</span>.</p>
</a>
<span inlist="" property="chapter:verse" resource="verse:Michael-2-2"/>
<a name="Michael-2-2" id="Michael-2-2" about="verse:Michael-2-2" typeof="class:Verse">
<p property="text"><span property="dc:title" content="Michael 2-2"/><span property="concept" resource="concept:Curiosity">Curious,</span> Alara left her farm on a <span property="concept" resource="concept:Bantu_Horse">Bantu Horse</span> until she entered into the <span property="location" resource="location:Great_Forest">Great Forest</span>, where she found <span property="bio:" resource="bio:Michael_Grenadine">Michael</span> upon the forest floor, unconscious.</p>
</a>
</article>
</body>
</html>
Listing 5. The full HTML-based book.
Up to now, this article has focused on the construction of the RDFa, but has not yet answered how this gets translated into RDF. The answer to this is to make use of a program called an RDFa Parser/Distiller. The one I've used for these examples is available online as a Python application at http://www.w3.org/2012/pyRdfa/ . This runs as a service, and lets you pass RDFa as a text stream, upload a file, or parse an online page with embedded RDFa code. Overall pyRdfa seems to offer the most comprehensive RDFa 1.1 coverage.
Running the above example through the parser, specifying for XHTML5+RDFA 1.1 input and Turtle output (see Figure 1), pyRDFa produces the Turtle triples given in Listing 6.
Figure 1. The configuration screen for pyRDFa text input.
@prefix bio: <http://OrderOfTheBook.org/xmlns/bio#> . @prefix book: <http://OrderOfTheBook.org/xmlns/book#> . @prefix chapter: <http://OrderOfTheBook.org/xmlns/chapter#> . @prefix class: <http://OrderOfTheBook.org/xmlns/class#> . @prefix concept: <http://OrderOfTheBook.org/xmlns/concept#> . @prefix dc: <http://purl.org/dc/terms/> . @prefix location: <http://OrderOfTheBook.org/xmlns/location#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfa: <http://www.w3.org/ns/rdfa#> . @prefix verse: <http://OrderOfTheBook.org/xmlns/verse#> . <> rdfa:usesVocabulary verse: . book:Michael a class:Book; book:chapter ( chapter:Michael-1 chapter:Michael-2 ); dc:title "The Book of Michael" . chapter:Michael-1 a class:Chapter; chapter:verse ( verse:Michael-1-1 verse:Michael-1-2 verse:Michael-1-3 verse:Michael-1-4 verse:Michael-1-5 ); dc:title "Michael: Chapter 1" . chapter:Michael-2 a class:Chapter; chapter:verse ( verse:Michael-2-1 verse:Michael-2-2 ); dc:title "Michael: Chapter 2" . verse:Michael-1-1 a class:Verse; verse:bio bio:Michael_Grenadine; verse:concept concept:librarian; verse:location location:The_Wastes; verse:text "In the heart of the world, not long after the beginning, a man named Michael Grenadine, he who was known as The Librarian, walked out of the Wastes."; dc:title "Michael 1-1" . verse:Michael-1-2 a verse:Verse; verse:concept concept:The_Horse_With_No_Name, concept:The_Mule_Of_Michael, concept:Unenlighted; verse:text "Travel did he with nought but a humble mule, which he called The Horse With No Name, verily to wreak confusion and wonder among the unenlightened."; dc:title "Michael 1-2" . verse:Michael-1-3 a verse:Verse; verse:concept concept:Monks_Of_Knowledge, concept:Sisters_Of_Understanding, concept:The_Library_Of_Michael, concept:The_Mule_Of_Michael; verse:location location:The_Wastes; verse:text "Within his wagon, hauled by his humble mule, did Michael carry with him many books, written by the Monks of Knowledge and the Sisters of Understanding, the better to preserve the knowledge and wisdom of ages past, as well as the cautionary tales that had led to the creation of the Wastes."; dc:title "Michael 1-3" . verse:Michael-1-4 a verse:Verse; verse:concept concept:Desert, concept:Illness, concept:Mountain, concept:Travel, concept:Weariness; verse:text "Weary was he, and sore afoot, for he had travelled many days and nights, had wandered through mountains and deserts, searching for a new home, until finally he collapsed, ill and feverish."; dc:title "Michael 1-4" . verse:Michael-1-5 a verse:Verse; verse:concept concept:Caring, concept:Kindness, concept:Loyalty, concept:The_Mule_Of_Michael; verse:text "There would Michael have died, had his mule not begun to bray in fear for his master, for its master was a kind man who fed the mule even when he himself had nothing, and groomed his mule before ever taking himself down to sleep." . verse:Michael-2-1 a class:Verse; verse:bio <Alara_Dishaean>; verse:concept concept:Cattle, concept:Mule; verse:location location:Seattle_Delshaean_Era; verse:text "Not far away, near the town of Seattlec'l, a young woman named Alara Dishaean was tending her cows when she heard the braying of a mule."; dc:title "Michael 2-1" . verse:Michael-2-2 a class:Verse; verse:bio bio:Michael_Grenadine; verse:concept concept:Bantu_Horse, concept:Curiosity; verse:location location:Great_Forest; verse:text "Curious, Alara left her farm on a Bantu Horse until she entered into the Great Forest, where she found Michael upon the forest floor, unconscious."; dc:title "Michael 2-2" .
Listing 6. RDF Turtle output generated from test HTML+RDFa file.
This now has broken down the RDFa markup into RDF assertions. In most cases, there was a little cheating going on - I defined objects in this example for (presumed) predefined entities. However, suppose that you only had only the text with no specific resources defined, something like:
<p><span property="verse:bioText">Michael Grenadine<span> was ...</p>
This would have generated the triple:
verse:Michael-1-1 verse:bioText "Michael Grenadine";
If you had already defined an entry for the person beforehand (<bio:Michael_Grenadine>), you could do an inference using SPARQL update:
insert {?verse verse:bio ?bio}
where {
?verse verse:bioText $bioText.
?bio bio:fullName $bioText.
};
Listing 7. Assigning a URI reference when given a string.
You could also create more sophisticated inferences that would take shortened forms of the person's name and the attempt to do searches on these, in the context of an already extant reference somewhere within the verse itself. I'll leave that for a later post.
Once this data is put into a triple store, it opens up some interesting possibilities. As a simple example, you could retrieve all the verse in a given book by verse in order (Listing 8).
select ?verse ?text where {
$book book:chapter ?chapterList.
?chapterList rdf:rest*/rdf:first ?chapter.
?chapter chapter:verse ?verseList.
?verseList rdf:rest*/rdf:first ?verse.
?verse verse:text ?text.
}
Listing 8. Retrieving verses in a book in sequential order.
where $book in this case contains the URI of the book being referenced. The rather peculiar construct of rdf:rest*/rdf:first may seem rather nonsensical, but it is an artifact of the list structure described earlier. RDF represents lists using blank-nodes, with the structure for verses in a chapter looking something like the Listing 9.
chapter:Michael-1 rdf:list _:b0.
_:b0 rdf:first verse:Michael-1-1;
rdf:rest _:b1.
_:b1 rdf:first verse:Michael-1-2;
rdf:rest _:b2.
_:b2 rdf:first verse:Michael-1-3;
rdf:rest _:b3.
_:b3 rdf:first verse:Michael-1-4;
rdf:rest _:b4.
_:b4 rdf:first verse:Michael-1-5;
rdf:rest rdf:nil.
Listing 9. How a list is rendered as triples.
Given this, the construct rdf:rest*/rdf:first says, for a transitive relationship (rdf:rest) retrieve all rdf:rest items (including those with no rdf:rest predicate) in a path that ends with an rdf:first property. This will iterate over the list with the added benefit that the individual items do not then have to maintain their own pointers.
Why is this a benefit? In many kinds of documents, such as religious works, it's not uncommon to have "parables", which consist of one or more verses that may start in the middle of one chapter and end in another, may jump from one section to another, or my appear in a different order. than was originally given. By keeping the links external, you can add items sequentially but without being bound to needing imperative logic and exception handling. So, we can create a parable called "The parable of the mule" (Listing 10).
parable:Parable_Of_The_Mule owl:Class class:Parable;
dc:title "The Parable of the Mule".
parable:verse (verse:Michael-1-2 verse:Michael-1-3 verse:Michael-1-4 verse:Michael-1-5 verse:Michael-2-1).
Listing 10. The parable of the mule.
Once you have this kind of relationship, it also becomes possible to do things like determine every parable that a given verse is used in (Listing 11), or even (assuming that verses are the only place that hold keywords) finding all parables that discuss certain concepts (Listing 12).
select ?parable where {
?parable parable:verse ?verseList.
?verseList owl:rest*/owl:first $verse.
};
Listing 11. Retrieving all parables that have a specific $verse.
select ?parable where {
?parable parable:verse ?verseList.
?verseList owl:rest*/owl:first ?verse.
?verse verse:concept $concept.
};
Listing 12. Retrieving all parables that include a specific $concept.
RDFa can appear somewhat complex at first, but the advantage to being able to encode content in RDF is that you can identify relationships between entities, can make this information aware of other information that exists within a larger dataspace, and can make that information far more malleable and while still maintaining useful context.
Kurt Cagle is an information architect and author working for Avalon Consulting LLC, specializing in NoSQL, Semantics, XML and JSON data systems. He is the author of eighteen books on XML and web technologies, including the upcoming HTML5 Graphics with SVG and CSS for O'Reilly Media. He can be reached at kurt.cagle@gmail.com.
No comments:
Post a Comment