Semantics and Data Modeling: When Data Has It's Own Address

I've been thinking a fair amount lately about IPv6. As a standard it's not all that terribly well known, though it replaces a standard that is itself fading from public awareness. Back in the late 1980s, when the backbone of the Internet was first being fleshed out, the IP address was born as a way of providing a unique address on the web. Such IP addresses consisted of eight eight-bit numbers - something like AF.2C.32.7D, for instance, though this was usually converted into decimal notation: 175.44.50.125 - and as such could represent about 4.3 billion addresses.

The sounds like a lot of addresses, and at the time, it was inconceivable that we would ever reach that many devices connected to the Internet (indeed, the total number of devices even before Tim Berners Lee started to put together the world wide web was maybe a couple of dozen).

Yet as the number of devices on the web began to climb dramatically during the mid-90s, more than a few people began questioning whether inconceivable meant what they thought it meant - all of a sudden, four billion interconnected devices was beginning to seem rather confining, especially given the mapping between IP addresses and domain names with the introduction of the web.

Two trends helped that process along. First was the rise of sensors in everything from refrigerators to automobiles to water regulators in irrigation systems. In 1997, a coke machine with a jury-rigged sensor at MIT was a novelty, by 2013, most vendors of automated dispensing machines build in Internet connections to determine when a machine was needing to be restocked, which significantly reduced the number of hours this process needed to be done manually. Indeed, I'm now seeing soda dispensers at fast food restaurants that allow you to serve up dozens of different soft drinks, along with various flavorings and additions, fed by cartridged syrup - at any given point a manager of the store can bring up a diagnostic screen on his or her tablet that will show exactly how much of any given syrup or is available and will automatically order more when that number dips below a specified threshold).

The second new user of IP addresses were handhelds - mobile phones, readers, pdas and tablets - all of which required reliable, permanent IP addresses. It was possible to expand out existing IP addresses by partitioning blacks and creating gateways, but this also effectively meant that whenever you reached a specific boundary number of devices (such as 256 or 16,384) you ended up having to allocate a new block of addresses, and constrained the total number of available high level domains even more.

The IPv6 structure takes a somewhat different approach. First of all, it bumps the number of bits in an IP address from 32 to 128. At absolute first glace this may serve to kick the can down the road a bit, but it's worth remembering the concept of exponential doubling. With 128 bits, you can express 3.403 x 10³⁸, or 340,300,000,000,000,000,000,000,000,000,000,000,000, addresses. To put this in perspective, the Internet has been expanding at a rate of about 15% year over year. Assuming that IP addresses continue to be added at this rate, we will run out of IP addresses some time around 2490 AD (and if current trends continue where the doubling itself is slowing, this could push it out to around 3000 AD, long enough to extend the standard to IPv7). Put another way, if every person on Earth today were given an IPv6 address, they could assign IP addresses to some 300 quadrillion quadrillion devices.

In general, such an IP address is broken into three parts. The first part (54 bits typically) identifies higher level domains, while the next 10 bits identify the subdomain. 54 bits will give you approximately 16x10¹⁵ domains, such as those used by a corporation, while the 10 bits give you an additional 1024 subdomains. This means that you could set up for acme.com one IP address, then up to 1024 of acme's divisions could each have a separate IP address that would still be tied into the core IP address. The remaining 64 bits - then would link to servers within each subdomain. Note that the 54 bits given above is not absolute - a hosting service, for instance, may have only 50 bits in its domain with 2¹⁴ , or 16,384 subdomains.

It's the remaining 64 bits that I find most intriguing, however. These are typically assigned as specific client IPs. Assuming the 54/10/64 breakdown given above, this means that any subdomain can support roughly 8 x 10¹⁸individual addresses. While this falls (barely) into the realm of the comprehensible, it is still a lot of addresses. Again, a little gedanken experiment puts this into perspective. One of the more innovative (disturbing) innovations has been specialized sensors called smart dust that are essentially RFID transmitters. The dust isn't quite dust sized - each is about the size of a piece of confetti - but its small enough that it is easily buffeted by the wind. If you assign each one of these an IPv6 address in the same domain and you dump a ton (let's say 1 billion smart dust transmitters) into a hurricane, you will have used up roughly 0.0000001% of your domain space. Of course, you probably need only a cup of smart dust for it to be reasonably useful in the tornado tracking experiment, so the actual percentage is probably three or four magnitudes smaller.

Indeed, this opens up a pretty valid point - even in the universe of things, the available address space is so much larger than any physical usage of this space that even profligate use of such IP addresses as in the example above will be hard pressed to make much of a dent in this space. This is even more true as addresses expire and are recycled. For instance, if you have an IP address for a refrigerator, if after a year the IP address returns no response, then it's probably safe to assume the refrigerator itself has been junked or recycled. Should it suddenly come back on (say it's been in storage), then the processor within the refrigerator can simply request a new IP address, assuming the initial IP address was not hard coded ... and realistically it probably won't be. This is true of domains as well - if acme.corp with its 54 bit domain encoding goes out of business, it's very likely that its registrar would release that domain IP address and all of its ancillary domains as well. This suggests a utilization rate that should keep much of the overall domain space empty for quite some time to come.

So what else could be assigned within those domain spaces? How about terms? I touched upon this briefly in my 2013 analysis, but wanted to expand upon this concept here. One of the central ideas of data processing is the notion of a unique id. In a self-contained SQL database, such ids are usually just sequential integers - 1, 2, 3 ... etc., that identify each record in a given collection. The database generally stores an iterator either at the database or the the table level such that when a new record is added, the system increments that iterator automatically by one.

Table iterators are attractive because they also provide a quick way of doing ordinal counting of the number of items, but in practice things get complicated pretty quickly when you try using such index identifiers in that fashion. As a general practice, an indexed identifier should not have any implicit semantics - there is no guarantee that 225 follows 224, for instance. Sometimes organizations tend to want to overload their identifiers: "ACC12-2561" indicates an account in the year 2012 with number 2561, for instance - when in fact each of these should be individual properties of the record. It would be far better to assign a unique but otherwise semantically minimal ID such as a universal unique identifier (UUID), which in practice is a 128 bit number ... curiously similar to an IP address.

Now, if you have a database that is itself a subdomain within your organization with its own IP address, you know that half of that number is in fact by definition unique, because it is registered uniquely. That means that the remaining 64 bits, or 8 quintillion potential values are available for identifiers. Now, assume that within each database you have up to 16,384 (2¹⁴) unique tables. This still means that each table can potentially hold 2⁵⁰ or about one quadrillion records. As most data systems can't even begin to approach this number per table, it means that even in specifying a large number of tables there is still plenty of room to insure that each record can be given such a unique identifier.

From a semantics standpoint, each table can be thought of as a conceptual domain. Entity relationships (or ER) diagrams derive from precisely this equivalence, and it's not really surprising that UML derived from such ER diagrams. More generally, we can define each table as being a namespace, with each record in that namespace being a namespace term. Things get interesting though when you start talking about views, however, because a view is in effect a synthetic table. However, the view is also itself a namespace, with each term in that namespace being the "generated" or "virtual" records. This means that it is possible that the information covered by a given collection of data could in fact be in two distinct namespaces. Intuitively, this actually makes sense - it is entirely possible for an object to be a part of two or more distinct classification systems that are otherwise orthogonal to one another.

In an XML or NoSQL database, the same principle applies. A document may have more than one identifier applied to it, and that ID should be systemically unique (an identifier points to one and only one document). A query, however, is a view of a collection of documents - in effect it is a synthetic view of the data space. Document queries get a little fuzzier because either an array of documents or a document wrapping that array of documents may be returned, but in either case, the query effectively defines a namespace, with its parameters in turn defining a corresponding record or entity in that namespace.

This is, in a nutshell, what REST is all about. In a RESTful architecture, you have a collection of resources, each of which has one or more unique identifiers that allow that resource to be created, retrieved, updated or removed, and you have a collection of views (or named queries), each of which can specify a conceptual namespace with specific sets of terms dependent upon parametric values.

What this behavior means in practice is that - if we make the caveat that a query generally returns a single "document", even if it's a generated one - then each collection, view or query is in effect a conceptual server domain, and each term in that server domain has a one to one correspondence with a unique identifier.

Putting this all together, this means that acme.com (AF11:C539:48F2:) could have a database (systemData) (529D:) with a table for users (6438:) with the record for Jane Doe (C119:8252:921C). The full UUID - AF11:C539:48F2:529D:6438:C119:8252:921C - is globally unique, and could in turn map to the semantic term AcmeUser:Jane_Doe. The more general term Class:AcmeUser, corresponds to the zeroed form AF11:C539:48F2:529D:6438:0000:0000:0000, which is often truncated as AF11:C539:48F2:529D:6438::0.

Note that the term class here is a semantic rather than programmatic class. While you usually "get" a given resource (or delete a given resource), non-idempotent operations, such as creating new documents of updating existing documents, typically involves sending a bundle of data - a record, whether expressed as a structured bundle or passed parametrically - to a given collection.

This equivalency is a powerful one. Not all data is public, of course, nor should it be, but I think there's this slow realization on the part of many organizations that there are entities that they have which should be visible beyond the immediate boundaries of single applications: documents that may have relevance within an organization, significant amounts of LDAP data (LDAP is a technology that is beginning to show its age), research content, presentations and white papers, software applications, media files. What's more relevant here is that it as often as not the metadata about this content that is more relevant than the content itself - who produced it, when was it created, what it's about, how it is categorized, what version it is. By dealing with such content at a conceptual level, then using such an interface to serve the content itself, this also creates a more generalized knowledge and asset management architecture that is largely independent of any one give proprietary product.

By laying this foundation, it becomes possible to manage relationships between data even when the data itself resides on multiple servers. By binding an <owl:sameAs> or similar identity tool, these conceptual servers can be manipulated indirectly, and then when a given resource changes the linkages that were symbolic before (AcmeUser:Jane_Doe, for instance) can then be resolved to a formal IPv6 URI and consequently can provide relationships that existed at the time of the change (even as the formal text URI gets assigned to a new entity).

I'm going to have to play with this concept some more. It's possible that such IPv6 URIs will go in a completely different direction - but to me, the concept of a web of things is not really that different from a web of ideas - even if a physical object has a URI, from the web's standpoint, you are dealing not with the object but with an abstraction of it, and IPv6 opens up the possibility of making such abstractions completely addressable.

A few links:

3 comments:

Dan McCrearyJanuary 15, 2013 at 6:35 AM
Nice article Kurt! It does make us think a bit more about the ties between the network layer and the application layer. Almost makes me think that "concept routing" could help us differentiate localized semantics.
UnknownJanuary 17, 2013 at 5:33 PM
Dan,

I may be reading too much into the IP address and direct network ties, but I do think the idea that every entity needs to have a permanent URI, something that I think Tim Berners-Lee hinted at many years ago but that didn't necessarily make sense for the web at the time, is a pretty important one. There's a lot of overlap between the Semantic Web and REST, one that I think actually gets pretty deep into the notion of how we think about objects in the virtual environment. Maybe one of these days I'll successfully articulate what that is.
RubyFebruary 6, 2013 at 12:45 PM
Interesting post you made. Definitely something to look into since more and more data are being streamed, copied, stored, etc to different networks at huge sizes. It’s definitely a probability in the nearer future that this would be more recognized and utilized than what we currently have.

Ruby Badcoe

Semantics and Data Modeling

Monday, January 14, 2013

When Data Has It's Own Address

3 comments:

About Me