Semantics and Data Modeling: January 2013

Thursday, January 31, 2013

SVG Book Getting Closer

I wanted to redirect your attention to the graphic to your left. It's written in Scalable Vector Graphics, or SVG, and is rendered directly into your browser as part of HTML 5 support. I've just wrapped up the first draft of my SVG Web Graphics book for O'Reilly Media, due to go to technical review today, and hopefully available on Safari by early spring.

I'm doing this partially as a test in order to see how well SVG is supported in blogger (the answer is "pretty good, actually!"). However, I hope to preview some of the book in the future as well, and talk about the intersection of data design and user interface design, which SVG is ideally suited for. So, watch this space.

Monday, January 14, 2013

When Data Has It's Own Address

I've been thinking a fair amount lately about IPv6. As a standard it's not all that terribly well known, though it replaces a standard that is itself fading from public awareness. Back in the late 1980s, when the backbone of the Internet was first being fleshed out, the IP address was born as a way of providing a unique address on the web. Such IP addresses consisted of eight eight-bit numbers - something like AF.2C.32.7D, for instance, though this was usually converted into decimal notation: 175.44.50.125 - and as such could represent about 4.3 billion addresses.

The sounds like a lot of addresses, and at the time, it was inconceivable that we would ever reach that many devices connected to the Internet (indeed, the total number of devices even before Tim Berners Lee started to put together the world wide web was maybe a couple of dozen).

Yet as the number of devices on the web began to climb dramatically during the mid-90s, more than a few people began questioning whether inconceivable meant what they thought it meant - all of a sudden, four billion interconnected devices was beginning to seem rather confining, especially given the mapping between IP addresses and domain names with the introduction of the web.

Two trends helped that process along. First was the rise of sensors in everything from refrigerators to automobiles to water regulators in irrigation systems. In 1997, a coke machine with a jury-rigged sensor at MIT was a novelty, by 2013, most vendors of automated dispensing machines build in Internet connections to determine when a machine was needing to be restocked, which significantly reduced the number of hours this process needed to be done manually. Indeed, I'm now seeing soda dispensers at fast food restaurants that allow you to serve up dozens of different soft drinks, along with various flavorings and additions, fed by cartridged syrup - at any given point a manager of the store can bring up a diagnostic screen on his or her tablet that will show exactly how much of any given syrup or is available and will automatically order more when that number dips below a specified threshold).

The second new user of IP addresses were handhelds - mobile phones, readers, pdas and tablets - all of which required reliable, permanent IP addresses. It was possible to expand out existing IP addresses by partitioning blacks and creating gateways, but this also effectively meant that whenever you reached a specific boundary number of devices (such as 256 or 16,384) you ended up having to allocate a new block of addresses, and constrained the total number of available high level domains even more.

The IPv6 structure takes a somewhat different approach. First of all, it bumps the number of bits in an IP address from 32 to 128. At absolute first glace this may serve to kick the can down the road a bit, but it's worth remembering the concept of exponential doubling. With 128 bits, you can express 3.403 x 10³⁸, or 340,300,000,000,000,000,000,000,000,000,000,000,000, addresses. To put this in perspective, the Internet has been expanding at a rate of about 15% year over year. Assuming that IP addresses continue to be added at this rate, we will run out of IP addresses some time around 2490 AD (and if current trends continue where the doubling itself is slowing, this could push it out to around 3000 AD, long enough to extend the standard to IPv7). Put another way, if every person on Earth today were given an IPv6 address, they could assign IP addresses to some 300 quadrillion quadrillion devices.

In general, such an IP address is broken into three parts. The first part (54 bits typically) identifies higher level domains, while the next 10 bits identify the subdomain. 54 bits will give you approximately 16x10¹⁵ domains, such as those used by a corporation, while the 10 bits give you an additional 1024 subdomains. This means that you could set up for acme.com one IP address, then up to 1024 of acme's divisions could each have a separate IP address that would still be tied into the core IP address. The remaining 64 bits - then would link to servers within each subdomain. Note that the 54 bits given above is not absolute - a hosting service, for instance, may have only 50 bits in its domain with 2¹⁴ , or 16,384 subdomains.

It's the remaining 64 bits that I find most intriguing, however. These are typically assigned as specific client IPs. Assuming the 54/10/64 breakdown given above, this means that any subdomain can support roughly 8 x 10¹⁸individual addresses. While this falls (barely) into the realm of the comprehensible, it is still a lot of addresses. Again, a little gedanken experiment puts this into perspective. One of the more innovative (disturbing) innovations has been specialized sensors called smart dust that are essentially RFID transmitters. The dust isn't quite dust sized - each is about the size of a piece of confetti - but its small enough that it is easily buffeted by the wind. If you assign each one of these an IPv6 address in the same domain and you dump a ton (let's say 1 billion smart dust transmitters) into a hurricane, you will have used up roughly 0.0000001% of your domain space. Of course, you probably need only a cup of smart dust for it to be reasonably useful in the tornado tracking experiment, so the actual percentage is probably three or four magnitudes smaller.

Indeed, this opens up a pretty valid point - even in the universe of things, the available address space is so much larger than any physical usage of this space that even profligate use of such IP addresses as in the example above will be hard pressed to make much of a dent in this space. This is even more true as addresses expire and are recycled. For instance, if you have an IP address for a refrigerator, if after a year the IP address returns no response, then it's probably safe to assume the refrigerator itself has been junked or recycled. Should it suddenly come back on (say it's been in storage), then the processor within the refrigerator can simply request a new IP address, assuming the initial IP address was not hard coded ... and realistically it probably won't be. This is true of domains as well - if acme.corp with its 54 bit domain encoding goes out of business, it's very likely that its registrar would release that domain IP address and all of its ancillary domains as well. This suggests a utilization rate that should keep much of the overall domain space empty for quite some time to come.

So what else could be assigned within those domain spaces? How about terms? I touched upon this briefly in my 2013 analysis, but wanted to expand upon this concept here. One of the central ideas of data processing is the notion of a unique id. In a self-contained SQL database, such ids are usually just sequential integers - 1, 2, 3 ... etc., that identify each record in a given collection. The database generally stores an iterator either at the database or the the table level such that when a new record is added, the system increments that iterator automatically by one.

Table iterators are attractive because they also provide a quick way of doing ordinal counting of the number of items, but in practice things get complicated pretty quickly when you try using such index identifiers in that fashion. As a general practice, an indexed identifier should not have any implicit semantics - there is no guarantee that 225 follows 224, for instance. Sometimes organizations tend to want to overload their identifiers: "ACC12-2561" indicates an account in the year 2012 with number 2561, for instance - when in fact each of these should be individual properties of the record. It would be far better to assign a unique but otherwise semantically minimal ID such as a universal unique identifier (UUID), which in practice is a 128 bit number ... curiously similar to an IP address.

Now, if you have a database that is itself a subdomain within your organization with its own IP address, you know that half of that number is in fact by definition unique, because it is registered uniquely. That means that the remaining 64 bits, or 8 quintillion potential values are available for identifiers. Now, assume that within each database you have up to 16,384 (2¹⁴) unique tables. This still means that each table can potentially hold 2⁵⁰ or about one quadrillion records. As most data systems can't even begin to approach this number per table, it means that even in specifying a large number of tables there is still plenty of room to insure that each record can be given such a unique identifier.

From a semantics standpoint, each table can be thought of as a conceptual domain. Entity relationships (or ER) diagrams derive from precisely this equivalence, and it's not really surprising that UML derived from such ER diagrams. More generally, we can define each table as being a namespace, with each record in that namespace being a namespace term. Things get interesting though when you start talking about views, however, because a view is in effect a synthetic table. However, the view is also itself a namespace, with each term in that namespace being the "generated" or "virtual" records. This means that it is possible that the information covered by a given collection of data could in fact be in two distinct namespaces. Intuitively, this actually makes sense - it is entirely possible for an object to be a part of two or more distinct classification systems that are otherwise orthogonal to one another.

In an XML or NoSQL database, the same principle applies. A document may have more than one identifier applied to it, and that ID should be systemically unique (an identifier points to one and only one document). A query, however, is a view of a collection of documents - in effect it is a synthetic view of the data space. Document queries get a little fuzzier because either an array of documents or a document wrapping that array of documents may be returned, but in either case, the query effectively defines a namespace, with its parameters in turn defining a corresponding record or entity in that namespace.

This is, in a nutshell, what REST is all about. In a RESTful architecture, you have a collection of resources, each of which has one or more unique identifiers that allow that resource to be created, retrieved, updated or removed, and you have a collection of views (or named queries), each of which can specify a conceptual namespace with specific sets of terms dependent upon parametric values.

What this behavior means in practice is that - if we make the caveat that a query generally returns a single "document", even if it's a generated one - then each collection, view or query is in effect a conceptual server domain, and each term in that server domain has a one to one correspondence with a unique identifier.

Putting this all together, this means that acme.com (AF11:C539:48F2:) could have a database (systemData) (529D:) with a table for users (6438:) with the record for Jane Doe (C119:8252:921C). The full UUID - AF11:C539:48F2:529D:6438:C119:8252:921C - is globally unique, and could in turn map to the semantic term AcmeUser:Jane_Doe. The more general term Class:AcmeUser, corresponds to the zeroed form AF11:C539:48F2:529D:6438:0000:0000:0000, which is often truncated as AF11:C539:48F2:529D:6438::0.

Note that the term class here is a semantic rather than programmatic class. While you usually "get" a given resource (or delete a given resource), non-idempotent operations, such as creating new documents of updating existing documents, typically involves sending a bundle of data - a record, whether expressed as a structured bundle or passed parametrically - to a given collection.

This equivalency is a powerful one. Not all data is public, of course, nor should it be, but I think there's this slow realization on the part of many organizations that there are entities that they have which should be visible beyond the immediate boundaries of single applications: documents that may have relevance within an organization, significant amounts of LDAP data (LDAP is a technology that is beginning to show its age), research content, presentations and white papers, software applications, media files. What's more relevant here is that it as often as not the metadata about this content that is more relevant than the content itself - who produced it, when was it created, what it's about, how it is categorized, what version it is. By dealing with such content at a conceptual level, then using such an interface to serve the content itself, this also creates a more generalized knowledge and asset management architecture that is largely independent of any one give proprietary product.

By laying this foundation, it becomes possible to manage relationships between data even when the data itself resides on multiple servers. By binding an <owl:sameAs> or similar identity tool, these conceptual servers can be manipulated indirectly, and then when a given resource changes the linkages that were symbolic before (AcmeUser:Jane_Doe, for instance) can then be resolved to a formal IPv6 URI and consequently can provide relationships that existed at the time of the change (even as the formal text URI gets assigned to a new entity).

I'm going to have to play with this concept some more. It's possible that such IPv6 URIs will go in a completely different direction - but to me, the concept of a web of things is not really that different from a web of ideas - even if a physical object has a URI, from the web's standpoint, you are dealing not with the object but with an abstraction of it, and IPv6 opens up the possibility of making such abstractions completely addressable.

A few links:

Sunday, January 6, 2013

Modeling Workflows Semantically

Workflows are essential parts of nearly all document-centric systems, to the extent that many applications tend to encode workflows directly into the operational logic of the application itself. However, over the years I've found that it makes a great deal of sense to look at "soft" workflows - where the specific actions that a document can take at any time can be modeled as a graph.

A standard publishing workflow provides a good example of what such a graph looks like:

Publishing State Diagram

T
One of the first things that becomes evident with such a graph is that a workflow is very seldom a linear sequence of items. Instead, you can think of such a graph as a combination of states (such as New, Editable or Published) that identify a particular view (possibly editable) that a document is in, along with a set of actions (Create, Edit, Review, Publish, etc.) that connect these states.

The actions are links that connect nodes, and it's entirely possible for a single node to have multiple outbound and inbound links - and even have re-entrant links such as is the case for the Editable State and the Edit Action. However, it is precisely the fact that you have referential links that makes describing such workflows in languages such as XML problematic ... and why it makes the use of RDF ideal for the same task.

From a conceptual standpoint, each state, action and workflow "term" should be considered unique. From an RDF standpoint, each action is a predicate that joins a subject node (the starting state) and an object node (the ending state). Consequently, you could describe the state diagram via a series of RDF triples:

<State:New> <Action:Edit> <State:Editing>.

<State:New> <Action:Review> <State:Approval>.

<State:Approval> <Action:Publish> <State:Published>.

<State:Approval> <Action:Edit> <State:Editing>.

<State:Editable> <Action:Review> <State:Approval>.

<State:Editable> <Action:Edit> <State:Editable>.

...

Each term or resource identified above will have additional information. For instance, the set of all states and actions within a given workflow should be identified as being a part of that workflow, and the class of a given state or action should also be identified. For instance, the following is used to both identify a given workflow and to bind the Approval State to that workflow, as well as provide a label for the current state that the document is in:

<Workflow:Publishing> <rdf:type> <Class:State>.

<State:Approval> <rdf:type> <Class:State>.
<State:Approval> <State:Workflow> <Workflow:Publishing>.
<State:Approval> <owl:label> "Document Review".

Similarly, each action can also be defined:

<Action:Publish> <rdf:type> <Class:Action>.

<Action:Publish> <Action:Workflow> <Workflow:Publishing>.

<Action:Publish> <owl:label> "Publish".

Finally: a helper SPARQL update statement can make the relationship between actions and states explicit at the class level.

insert {?startState <State:HasAction> ?action} where
{
?startState ?action ?endState.
?startState <rdf:type> <Class:State>.
?endState <rdf:type> <Class:State>.

}

For the Approval State, this will add the following:

<State:Approval> <State:HasAction> <Action:Publish>.

<State:Approval> <State:HasAction> <Action:Edit>.

This makes it possible, for a given state, to determine what particular actions are available on that state.

While this describes the particular states, it's important to understand that a given document or resource participates in a Workflow by binding the document to a given state (and typically binding the workflow itself to the class of the document). For instance, consider a blog entry called "My First Blog Post", with identifier <BlogPost:MyFirstBlogPost>. This would be bound to a workflow by using a property called <Doc:WorkflowState> and <Doc:Workflow> respectively:

<BlogPost:MyFirstBlogPost> <rdf:type> <Class:BlogPost>;
<WorkflowDoc:WorkflowState> <State:Approval>.

<Class:BlogPost> <WorkflowDoc:Workflow> <Workflow:WF1>.

I'm using the WorkflowDoc: namespace here as a base type for all documents that participant in workflows in the system, with the assumption that the BlogPost class is a subclass:

<Class:BlogPost> <owl:subClassOf> <Class:WorkflowDoc>.

Note that if $doc and $action are the terms for a specific document and action, you can set the next state as follows:

delete {$doc <WorkflowDoc:WorkflowState> ?oldState}
insert {$doc <WorkflowDoc:WorkflowState> ?newState}
where
{
$doc <WorkflowDoc:WorkflowState> ?oldState.
?oldState $action ?newState.
}

(The DELETE INSERT construct is especially useful for replacing old property assertions with new ones, as both the DELETE and INSERT clause use the variables that are in scope in the WHERE clause.)

This is a fairly simplified model, and doesn't take into account user or group permissions, which in general make use of a surprisingly similar model. I'll be covering that in the next posting.

The publishing workflow described here is pretty standard, but it should be noted that any workflow can be described in a similar manner, from wizards to complex business orchestration. In the latter situation where the action is initiated by a system event rather than a user one, the action taken may very well be determined by testing against query conditions. Unless triggering is supported in your SPARQL database, this will likely be initiated from an external condition, such as a call from an XQuery engine, which would retrieve the items that are in a given state and have additional internal conditions, such as the following:

# for the $action <Action:Promote>, this will change the state
# of the document to <State:Promoted> if this action is supported
# in the workflow and if the blog post has more than three "favorites"
# on the post itself

delete {?doc <WorkflowDoc:WorkflowState> ?oldState}
insert {?doc <WorkflowDoc:WorkflowState> ?newState}
where
{
$action <owl:sameAs> <Action:Promote>.
?doc <WorkflowDoc:WorkflowState> ?oldState.
?oldState $action ?newState.
?doc <BlogPost:numFavorites> ?numFavorites.
FILTER (?numFavorites > 3)
}

The <owl:sameAs> predicate, by the way, is immensely useful for switches and conditions with parameters, as illustrated in the above. It provides an identity relationship, and in general every term should have one of the form:

<foo:bar> <owl:sameAs> <foo:bar>.

One additional consideration is indicating where a given document starts and terminates within a workflow. This can be established on the workflow object itself:

<Workflow:Publishing> <Workflow:StartState> <State:New>.

<Workflow:Publishing> <Workflow:EndState> <State:Purge>.

The workflow models given here make no assumptions about user or group permissions, which should be seen as being orthogonal to the workflow - a document must take into account user permissions when transitioning between states, but these are conditions in the WHERE clause. My next SPARQL post will cover user and group permissions, and show how they tie into workflow models.

Finally, it should be noted that workflows as discussed here use the same underlying concept as they do in most enterprise systems - a graph that describes the transitions between states via actions over time, rather than the somewhat vaguer definition of how a person in his or her day to day routine accomplishes a task, though for a sufficiently complex graph the former should approach the latter.

Thursday, January 3, 2013

You Say Tomato, I say Ketchup

I've been moving into a new house in Issaquah, Washington lately, and one of the inevitable chores that comes with moving is putting up books, videos, dvds, games and similar content. I had an interesting lesson in organization that came about when my wife and daughter independently ended up organizing the various movies we'd collected over the years and discovered that each had an ever so slightly different view of what constituted proper ordering and organization. My wife organized content by publisher and alphabetical title. My daughter came in later and reorganized the same dvds purely by alphabetical order, save that series were arranged in chronological order. By the time the day was over, tempers were frayed, and each was grumbling about the other being absolutely unreasonable in how they viewed the way the world should work.

This same process takes place every day in organizations. DBAs put together databases, typically designed to handle one particular application, without thought to the broader applicability of the data that they are storing. Programmers build ad hoc XML or JSON schemas, without worrying whether there is any larger scheme for promulgating this information beyond the immediate confines of the services they are writing. Services get created that are used to get at very specific content, without thinking of the broader design of such interfaces viz-a-viz organizational goals.

This happens for obvious reasons. Developing standards takes specialized skills and time, and usually both are in short supply. Most people do not want to take the effort to find data structures already in use that are similar to what they have, because inevitably there will be information that process Y needs that process X didn't supply, or the information requires some transformation of data, or the data is locked behind a web service interface.

Beyond these more mundane factors are also bigger issues. As semi-structured data (XML, JSON, etc.) becomes utilized by one group, code becomes dependent upon that data structure. This becomes especially true when the language processing that structured data is procedural, and the developers in question do not have the means to understand how to work well with such structured content. I've seen too many projects where the Java or .NET code was so fragile with regards to the XML that something as simple as adding an element to a schema could break the application.

This usually ends up meaning that the same organization may end up with half a dozen different ways of expressing invoices, manifests, even addresses. For a small organization, this isn't that much of a problem, but beyond a certain point organizations are sufficiently large and distributed that a multitude of one-off schemas make data interchange costly and in many cases not even possible, even with XML or JSON as the data format.

So what's the solution to this Babelization? In an ideal world, the organization would have an information architect lay out a broad term data strategy, and any data structures that are created would then need to follow these data rules. In practice, this happens very seldom - most companies tend to code first and think data strategy only after a great deal of such code is already in place.

More often what happens is that an executive decision gets made to create open standards or platforms because too much of the data of an organization is locked up in different databases that can't talk with one another, and it is translating into extra costs or potential revenue that is not currently being made. At this stage, it becomes the responsibility of a "data architect" to somehow de-siloize the organization's information.

At this stage, the existing data is organized to tease out the data structures, and these data structures in turn are modeled more formally. Typing is noted, taxonomies are identified, conventions are established, and relationships are clarified. This process is also very contentious, because the various stakeholders of that data want to insure that the structures mostly conform to what already exist in the code base - even if it means violating all of the carefully laid rules at the beginning of the process.

What's more, a great deal of information that gets captured in enterprise systems are messages, especially when XML is involved. These messages are usually derived from physical forms. Indeed, one of the most common ways of building models is to transfer every field in the form to an XML schema, all too often not even bothering to question whether such fields exist to provide intent rather than specific object data. Forms are certainly useful to help understand data and data relationships, but should be used only a a guideline, and only in context with other forms in the system.

In effect, one of the primary purposes of such data modeling exercises is to establish logical models of the data space. A logical model identifies the things in the system, and is independent of the medium being used to capture or persist the data in the first place.

In this model, the role of a form is to either provide the data necessary to construct the things in the system, provide sufficient data to enable updating these objects, is used to pass parameters to query the data store for one or more objects that satisfy given conditions, or to expedite some changes in the metadata of the object (such as where the object is located).

Now, one of the consequences of such a rethink is that the creation of an enterprise data model means that data entities become unique. This makes sense, given that a particular object being modeled is usually a physical entity in it's own right, but it also means that by moving to an enterprise model the systems need to have ways of identifying when a chunk of data in two different data stores refer to the same entity, and to resolve conflicts when these two data chunks disagree.

This is perhaps one of the hardest thing that an information architect needs to do. Resources have to be uniquely identified, and resource data has to be framed within a contextual provinence. Multiple databases may actually contain different pieces of information about the same entity - one database may contain my employment history, while another may contain my medical records, and a third may contain my academic records. Moreover, two different databases may have the same type of information about me, but may be out of sync, or maybe contradictory. These are all things that the architect needs to consider when designing systems, especially when those systems span multiple data repositories, types, domain control and transmission mechanisms.

At the same time, the architect must act as an advocate to the various programs and data systems users. Across an enterprise, the architect does not (and should not) design each database or set of data structures. That creates a choke point in the system, and will likely result in your architect carried off in a stretcher and your organization reaching a stand-still waiting for him to recover from the heart attack. What the information architect should do, however, is to establish standards - how are schemas defined, what naming and design rules are utilized, how the scope of a given schema should propagate upward through the organization based upon its utility and context, how auditing and versioning takes place and so forth. In that respect, an enterprise information architect is a manager of departmental data architects, fostering an environment where there is communications between the various stake holders and transparency in data modeling and design. The departmental data architects in turn should work with the system architects, business analysts and software designers within their department to insure that information can be adequately represented for programmers to work with while still retaining all the information necessary for the objectives of the organization overall.

Ultimately, the information architect should ultimately recognize that the data space is a shared space - too much control and the complexity of the software becomes unmanagable, too little control and data interchange cannot occur without information, perhaps critical information being lost in translation. It's not an easy role to fill - in my time as an information architect I've made decisions that I've come to regret that seemed sensible at the time, and invariably there are demands to get data models out before all of the requirements are known, leading to late nights and headaches down the road. On the other hand, when you do reach that nirvana point - where the applications are done and the information is moving smoothly between systems - there are few feelings that can beat that ... until the requirements come down the pike for the next batch of software ...