Monday, December 31, 2012

Data Trends of 2013


Ten years ago, I began an annual tradition of posting my analysis for the coming year. I've missed a few years in there, and of course, everyone and their brother now does this, but I find that this exercise helps me to focus on what trends I see coming up and lets me adjust to them accordingly.

This year was notable for its apocalyptic tenor. The Mayan calendar ended a major cycle a couple of weeks ago, and at that point we transitioned from one Bak'Tun to another. The Mayan's were masters of the Long Now - they thought in 8,000 year chunks of time. It's worth noting that the Mayan civilization lasted longer than the American one has to date, so perhaps they were on to something.

When I've done these in the past, I've usually had a mix of the technical and non-technical. Given that I am currently running two blogs - one on Semantics, the other more on history, politics and economics - I decided that I'd actually focus on the data world for this particular analysis of trends, and will write my more non-technical analysis to the Metaphorical Web Blog.

I did do an analysis last year to cover 2012, covered here. Overall, it proves that I should probably stick to the areas that I know best - data services, social media, semantics, and the like. My analysis was pretty good, but I still made a couple of really bad calls - Steve Ballmer is still CEO of Microsoft, though I think that with the release of Windows 8 my call that it is sliding into irrelevance may be truer than not. I hope not - I think there is a lot of potential for the company, but it's playing too conservatively in a space that has generally rewarded risk-taking.  I also figured that RIM would be out of business by now. They're still around, but they are definitely struggling.

So, given those lumps, we'll see how well I fare going into 2013:

1. Big Data Fizzles. Every so often, a term gets thrown out by some industry consulting group such as Gartner or Forrester that catches fire among marketing people. Web ServicesService Oriented Architecture (SOA)Enterprise Service Buses (ESB)AJAX and more all come to mind. There's usually a kernel of a technical idea there, but it is so heavily smothered in marketing BS that by the time it gains wide-spread currency, the term exists primarily so that a product website designer can put another bullet-point in a list of features that make product/service X something new and improved.

The term Big Data very readily fits into that list. A few years back, someone realized that "data" had slipped the confines of databases and was increasingly streaming along various wires, radio waves and optical pulses. Social media seems to be a big source of such data, along with a growing number of sensors of various sorts, and as a consequence it looked like a windfall for organizations and software projects that could best parse and utilize that data for various and sundry data mining (yet another buzzworthy phrase) purposes. All of a sudden all of the cool kids were learning Hadoop, and everyone had a Hadoop pilot project or connector out.

The problem with this is that most map/reduce type solutions of any value are essentially intended to create indexes into data. An index takes one or more keys and ties them into a record, document or data file. Certainly commodity parallel processing can do other things, but most of these, at some point or another, still rely upon that index. The other thing that map/reduce does is allow for the creation of new documents that contain enriched metadata about existing document contents. Document enrichment in general isn't as sexy - it's difficult, it's far from perfect, especially for general domain documents, and it requires a fair amount of deep pre-analysis, dictionary development and establishment of grammatical rules. Whether commercial software or something such as GATE and JADE, most such tools are already designed to work in parallel on a corpus of documents. Hadoop simply makes it (somewhat) easier to spawn multiple virtual CPU threads for the processing of such documents.

This complexity and limited application set, coupled with the complexity of setting such systems up in the first place, will make such Big Data databases considerably less attractive as pilot projects fail to scale. Meanwhile, vendors in this space that are already working with high speed, high volume systems are already tailoring their particular wares to provide more specialized, more highly scalable solutions than the one size fits all approach that the Big Data paradigm is engendering.

On a related note, I've been to a fair number of conferences this year, and the thing I'm seeing at this stage among customers at the enterprise level is that most senior IT managers see data mining as a lower priority to them than data enrichment, design or utilization - if there's benefit to be gained from the effluviant stream of data processes they may think about exploring it, but for the most part these people are more concerned about the primary, rather than downstream, use of the data being produced in their systems.

What I anticipate will happen this year is that you'll have a few desultory Big Data products that will simply not produce the excitement that marketers hope, and the term will begin to fade from use. However, the various problems that the Big Data domain typically has tried to pull together will each of them become more distinct over time, splitting into the various regimes - high volume data processing, distributed data management, heterogeneous document semantification and data visualization. Each of these will have a part to play in the coming year.

2. Data Firehose Feeds Emerge as a Fact of Life. Twitter feeds, Facebook activity feeds, notifications, call them what you will, processes that generate high frequency data updates are becoming the norm. These include both social media updates where you may have millions of contributors adding to the overall stream of messages at a rate of tens of thousands per second, or arrays of distributed sensors each reporting back bits of  information once a second or so. This sensor traffic is exploding - cell phones reporting positioning information, traffic sensors, networked medical devices, even mobile "drone" information, all of these are effectively aggregated through data ports, and then processed by load balanced servers before being archived for post processing.

Such feeds perforce have three major aspects - data generation, which is now increasingly distributed, data ingestion, in which the data is entered into the system, given an identity within that system and provided with initial metadata such as receipt time, service processors and similar provenance information, and finally data expression, which routes that information to the appropriate set of parties. In many cases there is also a data archiving aspect, in which the information is archived in an interim or permanent data store for post processing. This is especially true if the data is syndicated - a person requests some or all items that have arrived since a given point in time. Syndication, which is a pull based publishing architecture, is increasingly seen as the most efficient mechanism for dealing with data firehoses, because it gives the server the ability to schedule or throttle the amount of data being provided depending upon its load.

I see such an approach applying not just to social media feeds (which are almost all now syndicated) but also to sensor feeds and data queries. One characteristic of such a syndicated approach is that if you have distributed data sources (discussed next) then when you make a query you are essentially sampling all of those data sources which are in turn retrieving constrained sets of data, either by date, by relevance, by key order or by some similar ordering metric. The aggregating server (which federates the search) caches the response, providing a way to create ever more representative samples. What you lose in this process is totality - you cannot guarantee at any point that you have found all, or even necessarily the best fit data to your query - but what you gain is speed of feedback.

A major change that's happening in this space as well is the large scale adoption of NoSQL databases - XML databases such as MarkLogic, eXist, BaseX, Zorba, XDB 10, and others, CouchDB and MongoDB for JSON data stores in conjunction with node.js, graph databases, columnar databases, rdf triple stores and so forth. The leaders in this space are trying to bridge these, providing data access and query services that can work with XML, JSON, SQL, Name/Value Pairs and RDF, as well as binding into Hadoop systems and otherwise becoming cross compatible. Additionally there's a growing number of developers and standards gurus that are trying to find commonality between formats, and I suspect that by the end of 2013 and into 2014 what you will see is the rise of omni-capable data services that allow you to query with XQuery or SPARQL or a JavaScript API, and to then take the output of such queries and bind them into appropriate output transparently. A lot of these will be former XML databases seeking to diversify out of the XML ghetto, since overall they probably have the most comprehensive toolsets for dealing with richly indexed non-RDBMS data.

3. Asset and Distributed Data Management Grows Up. One of the central problems that has begun to emerge as organizations become distributed and virtual is the task of figuring out where both the tangible assets (desks, computers, printers, people, buildings, etc.) are and where the intangible assets (intellectual property, data, documents) are at any given time.

One of the more significant events of 2012 was the permanent mandate for the adoption of IPv6 in June of this year. While the most immediate ramification of this is that it insures that everyone will be able to have an IP address, a more subtle aspect of this is that everything will have its own IP address - in effect its own networked name - as well. The IPv6 address schema provides support for 3.4×1038 devices, and consists of eight unique sets of hexadecimal numbers, such as 2001:0db8:85a3:0042:1000:8a2e:0370:7334 . To put this into perspective, if each person on Earth was assigned an IPv6 address, each person would in turn be able to assign 43,000,000,000,000,000,000,000,000,000 different IPv6 addresses to everything he or she owned.

One interesting consequence of this is that you can also assign IP addresses to relationships. For instance, suppose that a computer has an IP address of 593A:... (I'll truncate to one hex number for conciseness). If you (42AC:..) were an employee (8821:..) of Company X  (EAC1:..) and were assigned (8825:..) that computer owned by (8532:.. ) Company X, then you could express all of these relations as sets of three IP addresses -  "you are assigned this computer" becomes <42ac:> <8825:> <593a:>. "This computer is owned by Company X" becomes <593a:> <8532:> . These are semantic assertions, and with the set of all such assertions you can manage physical assets. Bind the IP address to a physical transmitter (such as an active RFID chip) and you can ask such questions as "Show me where all assets that are assigned to you are located." or even "Show me the locations of all desks of people who currently report to me and who currently is seated at them."

The same thing applies in the case of virtual assets such as animation characters or marketing materials. Every document, every image, every video (and even slices of that video) can be given unique addresses (indeed, you could, with segmenting, give each ASSERTION it's own IP address). This is one case where I see Semantic Web technologies playing a huge role. If every asset (or resource) could also report on its own properties in real time (or can update a database in near real time) then you can not only have a real time inventory, but you can set triggers or conditions that would run when certain assertions are made. I expect to see such systems start to be implemented in 2013, though I think it will likely not be until mid-decade before they are commonplace.

4. Documents Become Meaningful. Speaking of the Semantic Web, semantics should continue to play a bigger part in a more traditional domain - intelligent documents. From both personal and anecdotal experience, one of the growth areas right now for IT services is in document analytics. In basic terms, this involves taking a document - anything from a book to a legal contract to a someone's blog - manually or automatically tagging that document to pick out relevant content in a specific context, then using those tags to identify relationships with other documents in a given data system.

One effect of this is that it makes it possible to identify content within a collection (or corpus, as the document enrichment people usually say) that is relevant to a given topic without necessarily having keywords that match this content. Such relevancy obviously has applicability with books and articles, but it can apply to most media, to knowledge management systems, to legal documents and elsewhere. For instance, by using semantic relationships and document enrichment on Senate testimony, I and a group of other developers from Avalon Consulting were able to determine what the topics of that testimony were, who spoke for or against the subject of the hearings and what roles they played, and could consequently link these to particular bills or resolutions.

Such tagging in turn makes navigation between related documents possible. This is one of the unexpected side effects of semantic systems. A relationship provides a link between resources. Any link can be made into a hyperlink, especially when resources can be mapped to unique addresses. Additionally, if multiple links exist (either inbound or outbound) for a given resource, then it should be possible to retrieve the list of all possible items (and their link addresses) that satisfy this relationship. This is again a form of data feed. Human beings can manual navigate across such links, via browsers, while machines can spider across the link space. The result tends to be a lot like wikis (indeed, semantic wikis are an area to pay very close attention to, as they embody this principle quite well).

5. Data Visualization. Data visualization is an intriguing domain. All data has distinct structures, although the extent of such structures can vary significantly. Most data, especially in the era of "Big Data", also exists principally to satisfy a specific application - Twitter messages and Facebook Activities each exist primarily to facilitate those specific web applications, and only secondarily do that provide additional information - either through analysis of positional or temporal data or through the existence of key words and phrases within user generated content.

With the rise of graphical tools - Canvas, SVG, and Web3D - on browsers, I think we're entering a new age of data visualization. Before this, most data visualization tools either produced static imagery, with the exception of the map space, but the idea of interactive visualizations is now becoming increasingly feasible. What happens when you can create graphics that give you dynamic drill-down capabilities, that let you visualize data sets in real time in three dimensions, and when you can bundle clusters of data with hypermedia links and windows? In the past couple of years I've seen glimpses of this - Web3D in particular is an intriguing area that I think is only just beginning to make its way into common usage, and SVG has reached a point of common stability across browsers that makes it attractive for web developers to start using it for common web applications.

6. The Android Mobile Operating System Becomes Dominant. Something happened toward the end of this year that I found rather remarkable. Microsoft released a new version of their Windows operating system, Windows 8, for both the PC and a rather expensive "tablet" system. The general reaction was "meh" - there simply was not a lot of interest in it, bad or good. Sales of Windows 8 are far below expectations, as are the sales of Surface - the aforementioned tablet PC. Why? Because right now, after a decade of Windows vs. Mac with a little side-helping of Linux, the "real" OS wars are between Apple's iPhone/iPad operating system and Google's Android system, ironically based on Linux. Desktop PCs any more are fairly tiny boxes  that are becoming specialized as media servers, while even laptop sales are eroding in the face of tablets.

In many respects this was a completely out of the blue win for Linux. Booksellers have become players in the tablet game - both Amazon's Kindle Fire and Barnes &amp; Noble's Nook are filling an interesting ecological niche - and I suspect that other media interests are watching sales this Christmas of these "branded" tablets. It's not hard to envision a Disney or Warner Brothers tablet. Meanwhile, companies such as Comcast are building security apps for handheld tablets and bundling them with home security systems).

Steve Jobs' death in late 2011 may have dealt a mortal blow ultimately to Apple's dominant position in that sector. Without Jobs' reality distortion field, and after some embarrassing gaffes with Apple Maps that marred the iPhone 5 release, Apple appears weakened going into 2013, and may end up still being a major, but not the dominant, player in that space by this time next year.

I expect these trends to continue into next year - Android based systems will become ubiquitous (and will find their way into most home appliances, home electronics and automobiles), possibly leading to a confrontation between Google and vendors about control of the system. toward the end of the year or early into 2014.

7. Google (and Other) Glasses Become Stylin'. The computer has been losing pieces of itself for a while. The mouse is going extinct - first replaced (or augmented) by touch pads, then incorporated into the screen itself in tablets and hand-helds. Ditto the speakers, which now exist primarily as tiny earbuds. The keyboard lasted a while longer, and even now is somewhat useful for working with pads, but virtual (and increasingly haptic) keyboards on tablets are becoming the norm. The computer box is now part of the display, and network, usb and hdmi cables are increasingly wireless.

Yet you still have to hold them. This is where the next major revolution in computing will come from. Put those same screens in sufficiently high resolution into glasses (or possibly even shining them directly onto the retina) and you no longer need to hold the computer. Computer glasses have existed in a number of different forms for a couple of years, but they were typically highly specialized, and had some significant limitations - poor resolution, performance issues, bandwidth problems.

I believe this is the year that such glasses will become consumer products, and they will be built using tablet-based technology. These will give true stereoscopic vision, because you can send slightly different perspectives to each lens in order to create parallax interpretations. You can create overlays on what's in front of you by changing the percentage of incoming light (from front facing cameras over each lens) vs. the graphical overlays. Coupled with ear and throat pieces, you have everything you need for both input and output, possibly either in conjunction with a glove or a bracelet that would measure electromagnetic signals from nerves in the forearm to determine the positions of various fingers.

It'll take a while - 3-5 years for this to fully take hold - but I'm expecting that toward the end of this year these will become available to the general public, and they will have a profound impact.


8. Whither TV? The media revolution continues as well, and one casualty of that is consensus television. There's a growing number of "asynchronous" users - people who may end up watching shows on their computers or media centers long after they are broadcast, often watching a season of shows over the course of an evening or weekend. This in turn is having the effect of changing the nature of these broadcasts - reflecting a rise of broader narratives rather than the stand-alone episodes that were so common until comparatively recently.

These shows are also increasingly being encoded with metadata - not just closed caption overlays but significant blocks of information about particular scenes, actors, or the production that are in turn able to tie into the web or mobile spaces. The media industry is coalescing around several standards, such as EIDR and  Ultraviolet, that also allow for data interchange, making it easier for both traditional and digital media distributors to access and manage the process flow for media content generated by these organizations, including Walt Disney, Warner Bros., Comcast and a number of others.

Such standardization will have a profound effect upon both media broadcast and interactive "game" applications. Indeed, the mobile app and gaming space, which is ripe for a major consolidation, will likely be transformed dramatically as the distinction between "game" and media production continues to disappear. In the past, media-oriented games were typically made out of band from the media production itself, usually by a third party, in order to both take advantage of completed footage and to insure there was sufficient audience interest to justify the cost of gamification. With EIDR and Ultraviolet, on the other hand, game tie-ins are generally created in conjunction with the production of the media itself, and the ability to both tag and identify specific resources uniquely make it possible to build a strong framework then import media and metadata into that framework to support the games.

It means that production companies may very well alternate between producing purely theatrical release media, television media, gaming media or some combination of the three, using the standards as a mechanism to most effectively multipurpose individual content. This also makes rights management, such a critical part of content production, far simpler to track.

One final note while I'm on the topic: Kickstarter (and other services of that kind) will likely face some hard times next year as the number of underfunded projects continue to pile up. Longer term, however, it will likely be an increasingly effective model for funding "narrowcast" projects and pilots that can be tested with a specialized audience. My guess is that the twenty-first century studios (and this isn't an allusion to Fox here) are already watching Kickstarter closely, more to get a feel for what seems to be appealing to various markets than for any specific talent, though individual production houses may very well see Kickstarter as an incubator from which to grab potential actors, directors, cinematographers and effects people.

9. ePublishing Hits Its Stride. It is hard to believe, looking at recent downloads and sales figures, that publishing was on the ropes as little as three years ago, though certainly distribution channels have changed. Barnes and Noble has become the last major physical bookseller standing, but this is only because its competitors are now e-retailers ... Apple (via iTunes), Amazon and Google. eBook sales exceeded hardback sales early last year, and are on a pace to exceed softcover sales sometime in 2013, and it can be argued that B&amp;N is rapidly becoming an e-retailer that still maintains a physical presence rather than vice versa.

This trend will likely only solidify in 2013. This year saw a number of authors who jumped into prominence from the eBook ghetto, as well as an explosion of eBooks as everyone and their brother decided to get rich quick with their own eBooks. As with most creative endeavors that have undergone a sea change in the presence of the Internet, it's likely that an improving economy and the work involved in putting together 100,000+ words for a novel (for relatively mediocre returns) will likely winnow a lot of people out of the field in the next few years, but in some respects we're in the gold rush phase of the eBook revolution.

There are three standards that will likely have a profound impact moving forward on eBook publication - the ePUB 3 specification, which establishes a common format for incorporating not only text layout but also vector graphics and media, the rise of HTML5 as a potential baseline for organizing book content, which informs the ePUB 3 specification fairly dramatically, and the increasing prevalence of Scalable Vector Graphics (SVG). Moreover, XML-based workflows for production of both print and eBooks have matured considerably over the last few years., to the extent that the book publishing industry has a clear path from content generation to automated book distribution. This benefits not only the eBook industry, but also the print on demand industry - the cost to produce one or ten thousand books is the same (though the fees to do so limit that somewhat).

I'm taking that jump in 2013, both with the SVG Graphics book I'm nearly completed with now and a couple of novels that I've had in the works for a year or so. While the challenge of writing a fiction book is one reason I'm doing, another is giving me the opportunity to see the benefits and pitfalls of POD and eBook publishing first hand.

One other key standard to watch in 2013 is PRISM - the Publishing Requirements for Industry Standard Metadata. This standard will affect the way that magazines are produced by establishing both a working ontology and tools for using XML and HTML5 to handle complex layout of pages for magazines in a consistent manner, along with extensions for handling DRM and metadata output. As with media metadata standards, PRISM makes it possible for magazine publishers to take advantage of XML publishing workflows in a consistent manner, extending content and template reuse to short form magazine and news print content.

Magazines and newspapers have differed from books in production for years primarily because most newspapers have required far more sophisticated layout tools than can be handled via automated methods. PRISM is an attempt to rectify that, primarily by making it possible to more effectively bind articles to templates, handle overflow content, and deal effectively with images at differing resolutions. Because this is a specification that is supported by a large number of print media publishers, PRISM may also make it easier to manage subordinate rights, reprints and syndicated content.

10. IT Employment Trends. At the end of 2011, the overall trend for IT was positive, but it was still centered primarily in the government sector and consultant hiring rather than full time employment. The recent political games in Washington will likely cause some confusion in tax codes and structures that will make full time hires more difficult in 2013Q1 (as it has in 2012Q4) but, outside of certain sectors, it's likely that general IT hiring for permanent and long term contracting positions in the private sector will actually increase markedly by 2013Q2.

Part of the reason for this is that many of the contractor positions from 2012 were senior level - architects, requirements analysts and the like who tend to be employed most heavily at the beginning of projects, along with a small staff of contract developers for pilot programs and groundwork coding.  These projects in turn lead to more significant programmer hiring, generally for long term projects where it makes sense to hire developers for windows of one year or more (at which interval the financial and stability benefits of full time hiring exceeds the greater flexibility but higher costs of hiring consultants).

Having said that, there are some dark clouds. The mobile app market is saturated - apps are easy to create, the market is crowded, and margins in general have dropped pretty dramatically. I expect to see a lot of consolidation - app production companies going out of business or getting bought up by more established firms,  as well as competition from larger non-tech companies that nonetheless see advantages in establishing a presence. Short term, this is going to produce a lot of churn in the marketplace, though it's likely that reasonably talented programmers and content producers will continue to work if they're willing to remain adroit.

It's worth noting that this has been taking place in a lot of new tech fields over the last year, so for it to happen to mobile systems is not that surprising. Poor business execution, failure to read the market properly, impatient (or rapacious) investors or poor development practices are not uncommon at this stage of growth. On the flip side, as areas such as advanced materials engineering, bioinformatics, energy systems (both traditional and alternative), commercial aerospace ventures, transportation telematics, interactive media publishing and robotics realign, consolidate and become more competitive, this will also drive growth for related IT support work.

Federal initiatives that are already underway (such as ObamaCare) will also mature, though I expect the pace of hiring in these areas will slow down (as existing projects reach completion) and the consequences of the last two years of relative Congressional inactivity comes home to roost. Cuts to existing programs in both the Defense Dept. (due both to the "fiscal cliff" and the drawdown of troops in Afghanistan) and in research and social programs will also likely reduce IT work in the Maryland/Virginia area, which up until now has recovered more quickly than the rest of the country from the economic collapse in 2008/9.

On the other hand this too shall pass. While it hasn't been perfect (and there are still bubbles in certain areas, especially education) there are signs that the economy is beginning to recover significantly and shift into a new mode of development, one with a stronger emphasis on the secondary effects of the combined information and materials sciences innovations of the last couple of decades. However, because these are second order jobs from the information revolution, it's worth noting that core languages such as JavaScript (and CoffeeScript), Hadoop, declarative languages such as Haskell and Ocaml, data processing languages such as XQuery, NoSQL/node.js, SPARQL 1.1, and the like are likely to see continued growth, while more traditional languages - Java and C++, for instance, will be seen primarily in augmentative roles. Additionally, I've seen the rise of domain specific languages such as R, Mathematica and SciLab, as well as cloud services APIs (which are becoming increasingly RESTful) factoring into development.

The next wave of "developers" are as likely as not coming from outside the world of pure IT - geneticists, engineers, archivists, business analysts (and indeed analysts in any number of areas), researchers in physics and materials science, roboticists, city planners, environmental designers, architects and so forth, as well as creative professionals who have already adopted technology tools in the last decade. This will be an underlying trend through much of the 'teens - the software being developed and used will increasingly be domain specific, with the ability to customize existing software with extensions and toolkits coming not from dedicated "software companies" but from domain specific companies with domain experts often driving and even developing the customizations to the software as part of their duties.

Summary
My general take for 2013 is that it is not likely that there will be radical new technologies being introduced (though I think computer glasses have some interesting potential) but rather that this will be a year of consolidation, increased adoption of key industry standards and incremental development in a large number of areas that have been at the prototyping stage up to now.

Friday, December 7, 2012

Why Data Architecture Trumps Application Framework Agnosticism

A Bus Stop - a good place for thinking about how well the
Enterprise Service Bus model really works.
I recently had a discussion at work about certain clients that were very much tied into a seductive concept: they felt that the way to build a large scale application is to define a system in a modular fashion such that if one particular component didn't work out, they could just replace it with an analogous component from a different vendor. To do this, they would only use the absolute minimal capabilities of each component - in effect relying upon coding standards being relatively uniform from one component to the next.

This is an attractive proposition, since it reduces reliance upon any one vendor ... at least in theory. However, the real world experience for many organizations building large scale systems has generally been the opposite - rather than building a system that is easily reducible they end up with systems that are highly dependent upon the general contractor for documentation, support and maintenance, perhaps more so than using a more specialized tools for their needs. Yes, any component may be "swapped out", but only by the contractor.

Additionally, this practice means that when you do construct such a system, you end up with a vanilla system that takes advantages of none of the optimizations that could make the system work efficiently, which can be a real problem when you start talking about Big Data - large volumes, high throughput, intensive processing. This becomes especially true when you start talking about distributed systems as well, since such distribution significantly increases the latencies inherent in such systems.

Instead, organizations should in general concentrate first on the biggest questions - in a decade, when the technology has changed enough that it makes sense to take advantage of new innovations in delivering information to new platforms, what ultimately are the most important things to preserve. The answer is generally not code. Code is an expense, no doubt, but in buying code you should think of it as buying a new car. The moment that the car leaves the lot, it loses 20% of its value. You'll end up buying a new tire or two, maybe a muffler. As the car gets older, you'll sink more money into repairing the radiator, replacing the solenoid or battery, maybe throwing in a new stereo system because that state of the art eight track player you bought in the 70s just isn't cutting it anymore.

Ultimately you realize that the car is costing too much to maintain, or no longer meets your needs, or is just plain ugly. You get a new car. You won't, however, put in that new muffler or radiator, you won't take the tires over (even if you can) and chances are better than average that the new car's stereo is probably better than the one you bought several years before. Maybe besides a few knick-knacks, you won't in fact carry any of that car forward, save as something you can trade in to the dealer.

Code works the same way, almost exactly. It's sunk costs. You've paid it to achieve a certain objective on a certain platform at a certain time, and neither those objects nor those platforms stay the same over a window as long as a decade ... nor would you expect them to. Maybe, if you're lucky, you might be able to salvage the API, but even there, chances are pretty good that a new system might be a good time to roll out a new API on top of everything else.

What makes this even more problematic is that when you build a complex custom hardware system, then the quality of the code base is only as good as the most junior programmer on the team. Since at any given time you will only know of the quality of this junior programmer when that part fails, this means that the more hardware intensive, the more API intensive your systems, the more likely will be the chance that your code sucks.

On the other hand, data integrity and data architecture is very important - perhaps supremely important. The purpose of an application, at the most abstract level, is to change the state of a data store. There are, admittedly, usually side effects - killing verdant domestic boars via various avians with anger management issues, for instance - but at any given point in the application you are not only changing state, but usually creating a record of previous state changes - an archive or audit trail.

This audit trail will usually consist of both the core objects that the system operates on - the "catalog cards" of a library system and the "batch records" that are used to facilitate and the batch transfer and management of the objects being managed by the cards, as an example in a library system - as well as the messages that mediate the management of these objects - batch transfers requests and responses, object upgrades, and so forth.

When the technology becomes obsolete (it doesn't run on the newest machines or runs too slowly compared to more contemporary technology or it no longer meets the business needs of the organization), what then becomes most important is the migration of the data content in a consistent, standardized format to a new system. This in turn means that you have a clean internal information architecture.

Now, it is certainly possible that the data architecture will have evolved over time - data architecture by it's very nature reflects the business environment which it is embedded in, and that environment also changes over time. However, in most cases, such evolutions are generally additive in nature - new types of information get stored, information differentiates in the face of more refined tracking, very occasionally a property becomes obsolete and is no longer stored.

Critically, again, what should be persisted should be, most importantly, the primary objects in their various incarnations. Usually, at a certain point in the process, message information gets condensed, digested and reorganized. This means that when designing a good data architecture, some thought should be given to that condensation process as well.

So what happens when it's time to buy that replacement car - er, application? The quiet revolution that has taken place over the last decade has been the rise of open data standards - XML, RDF and JSON. In it's simplest incarnation, this means that your data can be transferred from one system to another without the need for extensive ETL tool processes. This doesn't mean that your data models themselves won't change - assume that change is inevitable - but it does mean that if you had concentrated on a RESTful architecture that captured the principle objects within the system, then the transfer from one data store to another can be accomplished via a relatively simple transformation. If you do work with REST, it also means that many of the precepts of REST - including the notion of the globally unique identifiability of resources, up to and including "records" in a data system - transfer over from one system to the next ... regardless of the systems involved.

Now, if your application is Angry Birds, this probably doesn't matter so much (though I'd argue that as applications move increasingly to the cloud and networked multiplayer mode, it matters a whole heckuva lot more than it did when these were stand-alone applications). On the other hand, if your application is intended to run for the long term, involves large amounts of distributed data that needs to be consumed and produced by multiple participants, then the data architecture is far more important than the specific implementation details - quite possibly because the code involved will almost certainly be different from end-point to end-point, even if the intent of that code remains the same.

This is part of the reason why I think that the next generation of object databases - XML databases, JSON databases, Sparql 1.1 triple stores, etc. - should be seen less as databases than as integration platforms that are particularly efficacious at "wrangling" open standards formats, which happen to have lots of memory for persisting structures. Increasingly these have embedded HTTP clients, extensible function libraries, optimized search and data retrieval capabilities, reverse-query based decision-triggers, and other features that replicate in code what traditional ESBs usually delegate to different products. These scale well, in general, they are adept at transforming data streams in-process (significantly reducing overall serialization/parsing costs), and typically they work quite effectively in virtual environments, especially when the need to spawn additional operational nodes emerges.

Moreover, consolidating these into a single system makes a great deal of sense when the messaging formats involved are themselves in the "big three" formats of XML, RDF, or JSON. Rather than imperative logic, I can use a conditional "query" to test whether a given data or metadata document fulfills given conditions, can spawn internal processes on that document (or more accurately on all documents that satisfy that query), and can then change the states of these documents or create new documents while archiving the old ones if these documents are the ones that result from the process. However, the significant thing is that these documents do not "move" in the traditional sense unless there is a specific transference process that is required out of band (i.e., when an external service gets invoked). In effect, these NoSQL data "integrators" are far more efficient at data/document processing because the documents generally do not move - document metadescriptors get changed instead.

So, to wrap things up, by focusing on data architecture first, with an emphasis on identifying and defining the primary entities in the system rather than focusing on messaging systems and formats, you can design systems that focus on preserving the data of the system for the long term at the expense of code which would get changed anyway within a decade. By focusing on a RESTful resource-centric NoSQL data-store architecture (which is generally far easier with model-driven approaches to develop), you can also dramatically reduce the number of external service components necessary to accomplish the same tasks.

Given that, at least to me, it seems that trying to take advantage of code reuse is false economy once you get to the level of enterprise level data systems. Focus on the development of a solid information architecture first, identifying the primary entities that will be used by the various participants in the network and a mechanism for universal identification of those resources, identify where those entities will be persisted and accessed, then from there build out from these stores. This differs from the conventional wisdom of trying to minimize the work that the databases do, but in general this has more to do with the fact that these "databases" are in effect more like platforms themselves today than they were even a few years before.

Kurt Cagle is an information architect for Avalon Consulting, LLC., and has helped develop information systems for numerous US federal agencies and Fortune 500 clients. He has also authored or contributed to seventeen books on information architecture, XML and the web, and is preparing his 18th, on SVG, for publication now. 

Thursday, December 6, 2012

RDF and its role in Logical/Canonical Modeling

Peter O'Kelly, a friend and former colleague of mine and the Principle Analyst at O'Kelly Associates, wrote a comment with regard to my post on using RDF for data modeling asking why I felt that RDF was a better tool than UML for modeling, especially given that tool's primacy in most enterprise spaces. I tried responding in a comment, but after writing it, I realized that, first, it was actually a good post itself, and second, I couldn't fit it into the 4096 character limit for comments. Thus this new post ...


I don't think that conceptual/logical data modeling via UML is ineffective - I think in many respects it's a very necessary part of the modeling process. However, what I've generally found with CDM/LDM work is that by dealing at the level of class abstraction rather than individual instances for the very initial prototyping, it's not always obvious what the functional classes are. It's also hard to tell whether what you are dealing with are associations or aggregations ... or what the specific relationships are between different entities. I find that taking an initial RDF approach helps in identifying those entities by "trying them out" with working world data one element at a time, rather than trying to holistically map out an overarching canonical model.

The RDF triple approach can come in handy as well in identifying what are enumerants rather than lists of named entities - or whether it makes sense to use one vs. the other (as there is almost invariably a very fuzzy line between the two). This is a common problem that XSD encounters as well as UML, because both languages tend to see enumerations as a simple list, whereas the RDF approach typically sees enumerations as being entities in a different namespace that may have multiple expressions.

For instance, countries or states/provinces are often treated as enumerations when in fact they should be seen as lists of objects with multiple properties - full name, abbrev2, abbrev3, etc., as well as internal relationships (a state can be both an object of type GeoState and a state of a given country). Colors are usually expressions of a given object property - silver/gray means very different things to a car dealer and a sweater manufacturer for instance, and they have very specific meanings. UML certainly can express these concepts as well, but because UML is focused primarily upon class relationships rather than ontological ones, the ontological relationships and divisions of namespaces are all too often far from obvious when attempting to model UML graphically, especially when dealing with enterprise level ontologies.

Now, having mapped out a preliminary sample ontology, going from this to a UML is probably a good next step, as well as generating relational diagrams using Raptor or similar visualization tools (which I hope to cover in a later post). It's at that stage that the socialization of the CDM takes place, and UML is generally better at communicating this information than either instance or schema relational graphs showing RDF in my experience, especially to semi-technical business types. In this respect the initial RDF work can be seen as a test-bed, especially since it's possible using SPARQL update to change both explicit property types and more subtle relationships. It's also relatively easy to generate RDF schema or OWL2 from instance data as long as you have a few key things like sameAs relationships and preliminary rdf:type associations. XSD is not as amenable to such changes.

There is a UML profile for RDF which makes interchange between the two possible, if not always easy. I'm still experimenting with this, mind you, so I can't speak with authority here.

I think your last point is worth addressing at a higher level as well. I've been involved now in a couple dozen large scale data modeling efforts, and about to undertake a couple more shortly. Data modeling is still very much an art, primarily because the modeling domains themselves are becoming increasingly complex as the tools enable us the ability to work with those domains, and so what works as best practice for systems with a couple of dozens primary classes doesn't necessarily scale well when you start talking about thousands of such classes. Most data models are putatively designed by committee, but they typically involve an architect putting up a straw man for that committee to critique, rather than having that committee actively designing such artifacts on the fly.

Perhaps the closest analogue I can think of here would be the peer review committee for a doctoral thesis - it is the job of the architect to defend his or her design, and to do that the confidence level in the ability of the model to meet the project's needs must be high ... which means testing the data model beforehand. XML tends to be a fairly poor medium for testing, because at its core it only has a container/contained relationship, and because linking with XML is awkward and ill-supported even in XML databases (ditto JSON). Building imperative structures for managing such linking defeats the whole purpose of a data model, which is intrinsically declarative - building the application to test the data model works, but it is a time consuming process and fragile to changes. Building RDF prototypes of the data model, on the other hand, allows for incremental changes in the ontology at relatively minor cost in comparison.

So, I'm not dissing UML tools - I've used MagicDraw and Rational Rose, and they are both a key part of an ontologist's toolkit. Rather, I'm just arguing that RDF provides a good mechanism for building a proof of concept ontology quickly that can then feed into the UML if necessary for communication.

Tuesday, December 4, 2012

Data Modeling via RDF

I'm frequently called upon to design schemas and data models for organizations and companies. In my experience, this exercise usually starts with someone saying "We need a logical data model" and someone else (usually a relational database designer) grabbing an ERWIN tool such as IBM's Rational Rose in order to build UML pictures. It's also been my experience that such data models, once created, are large, cumbersome, and often completely inappropriate to the problem domain ... especially when the end product is a structured entity such as XML or JSON.

Of late, I've begun playing with an alternative approach that actually has proven to be quite effective, but it's one that I see very few people doing as yet. The principle is simple - in your data model, start with a particular object - it doesn't really matter which object, though usually there are a few that are more central than others - and put together live examples of what you're trying to emulate. This will not only help you to identify the properties of a given object, but will also tend to expose links to other objects that are part of the problem domain. By using RDF, you can concentrate on such links, and by working with a shorthand notation with RDF, you can do so without all of the complexifying that namespaces normally introduce.

For instance, suppose that you are wanting to model a card catalog for a library system. The most obvious starting class for such a system is a book. For convenience sake I'm going to create a namespace with the rather odd notation xmlns:book="book:". This means that I can create an rdf subject or object of the form <book:Book1> which has the URI "book:Book1". Eventually we'll resolve the namespaces to something more typical of a URI, but for now this approach lets you concentrate on concept rather than syntax or notation. 

Once we have the book, we can start thinking of properties, and can express these as RDF assertions. For instance,

<book:Book1> <book:isbn> "195292514331".
<book:Book1> <book:title> "The SPARQL Book".
<book:Book1> <book:edition> "1".
<book:Book1> <book:printing> "1".
<book:Book1> <book:publishingYear> 2012;
<book:Book1> <description> "This is a book about SPARQL".

Now there's a certain degree of redundancy in such lists. One way to reduce this is to employ the semi-colon notation to indicate that several assertions all have the same subject:

<book:Book1>
    <book:isbn> "195292514331";
    <book:title> "The SPARQL Book";
    <book:edition> "1";
    <book:printing> "1";
    <book:publishingYear> 2012;
    <book:description> "This is a book about SPARQL".

Any assertion ending with a semi-colon indicates that the next item in the sequence will just contain the predicate and object, and should use the same subject as the current item uses. Commas can be used for a similar purpose - they indicate that a given subject-predicate pair may be used for more than one object. 

Now, one thing that should be evident in our description of a book is that we're missing a few critical pieces, such as authors. Here's where modeling begins to get exciting. I could use a string here indicating the author's name, but there are two questions that immediately come up - are there more than one author for any given book, and are there any books that have multiple authors. Chances are high (unless you have a very tiny library) that the answer will be yes in both cases. This suggests that the author in turn may be another type of object in the system that we'll call xmlns:author = "author:".

<book:Book1> <book:author> <author:Author1>.

Note that at this stage we know absolutely nothing about the author beyond the fact that he or she exists, but this is not an insignificant bit of knowledge. At this stage it may be worth filling in some of the blanks:

<author:Author1> 
     <author:displayName> "Jane Doe";
     <author:givenName> "Jane";
     <author:middleNames> "Elizabeth";
     <author:surName> "Doe";
     <author:searchName> "Doe, Jane Elizabeth";

     <author:bio> "An author and librarian who likes cats.".

Suppose that the book had two co-authors. Adding a second entry becomes simple enough:

<author:Author2> 
     <author:displayName> "John Dee";
     <author:givenName> "John";
     <author:middleNames> "Michael";
     <author:surName> "Dee";
     <author:searchName> "Dee, John Michael";
     <author:bio> "A writer of many talents and pretensions.".
<book:Book1> <book:author> <author:Author2>.

This is about the time that XML developers start getting nervous, because authors in this case may also author multiple books.  For instance, Jane might also have authored a second book "Sparql and OWL, a Guide":

<book:Book2>
    <book:isbn> "1952929129568";
    <book:title> "Sparql and OWL, a Guide";
    <book:edition> "1";
    <book:printing> "1";
    <book:publishingYear> 2010;
    <book:description> "A book about SPARQL and data modeling".
<book:Book2> <book:author> <author:Author1>.

To make the relationships even more representative, let's assume that John Dee also wrote another book:

<book:Book3>
    <book:isbn> "19123977952320";
    <book:title> "Semantics";
    <book:edition> "1";
    <book:printing> "2";
    <book:publishingYear> 2011;
    <book:description> "A book on Semantics".
<book:Book3> <book:author> <author:Author2>.

There is an implicit relationship here - that an author may have multiple books. Here's where the new SPARQL update spec comes in handy. If you're using Jena or some similar triple store that supports SPARQL 1.1 update, then you can make this relationship explicit as well:

insert {?author <author:book> ?book} where
{?book <book:author> ?author.}

When the update is run, this will generate the four triples:

<author:Author1> <author:book> <book:Book1>.
<author:Author1> <author:book> <book:Book2>.
<author:Author2> <author:book> <book:Book1>.
<author:Author2> <author:book> <book:Book3>.

Some systems will also let you use the SPARQL 1.0 CONSTRUCT statement to do the same thing, though this usually generates RDF as an output without adding it to the database:

construct {?author <author:book> ?book} where
{?book <book:author> ?author.}

So, what about publishers. Again, this is a case where you have a text label that could, nonetheless, also be an object. The big question in the model is a utilitarian one - are you likely to want to see books or authors grouped by publishers. If you are, then this becomes a separate class:

<book:Book1> <book:publisher> <publisher:Publisher1>.
<book:Book2> <book:publisher> <publisher:Publisher2>.
<book:Book3> <book:publisher> <publisher:Publisher1>.
<publisher:Publisher1> 
     <publisher:name> "Oriole Publishing";
     <publisher:description> "Large publisher of technical books for the programming market.".
<publisher:Publisher2> 
     <publisher:name> "Avante Books";
     <publisher:description> "A major publisher of scientific and research books.".

Thus, Book1 and Book3 are published by Oriole, while book is published by Avante. Does it make sense to link publishers and authors? Generally, probably not, though you can retrieve this data indirectly with a SPARQL query:

select ?pubName ?authName where
{
?book <book:publisher> ?publisher.
?book <book:author> ?author.
?publisher <publisher:name> ?pubName.
?author <author:fullName> ?authName.
} order by ?pubName

This will retrieve a listing of authors by publisher:

pubNameauthName
Avante BooksJane Doe
Avante BooksJohn Dee
Oriole PublishingJane Doe
Oriole PublishingJohn Dee

Thus, one of the big trade-offs in data modeling is determining whether a given relationship needs to be made explicit (which adds redundancy) or can remain implicit (which adds query complexity). As above, if you decided that getting authors by publisher is important, then you can always add those relationships via an update rule:

insert {?publisher <publisher:author> ?author.} where
{
?book <book:publisher> ?publisher.
?book <book:author> ?author.
}

One final addition here is to make explicit the class names involved. In this case, I normally use an internal "entity:" namespace to capture schematic properties, which I'll later map to rdf:, rdfs: or :owl equivalencies when I convert namespaces. The <entity:instanceOf> property indicates that a given subject is an instance of a certain class, and is functionally equivalent to an <rdf:type> predicate, while <entity:subClassOf> is the equivalent of <owl:subClass>


<book:Book1> <entity:instanceOf> <class:Book>.

<book:Book2> <entity:instanceOf> <class:Book>.
<book:Book3> <entity:instanceOf> <class:Book>.
<author:Author1> <entity:instanceOf> <class:Author>.
<author:Author2> <entity:instanceOf> <class:Author>.
<publisher:Publisher1> <entity:instanceOf> <class:Publisher>.
<publisher:Publisher2> <entity:instanceOf> <class:Publisher>.
<class:Book> <entity:subClassOf> <class:Entity>.


<class:Author> <entity:subClassOf> <class:Entity>.


<class:Publisher> <entity:subClassOf> <class:Entity>.

The designation of types makes it possible to ask for a list of all books, authors or publishers in the system by type, while the <entity:subClassOf> entry here makes it possible to get all classes available within the ontology itself. I also usually like to create object agnostic labeling in my models, using the <v:label> and <entity:description> properties. These can be tailored using SPARQL update:



insert {?resource <entity:label> ?label}
where {
{?resource <book:title> ?label.}
UNION
{?resource <author:fullName> ?label.}
UNION
{?publisher <publisher:name> ?label.}
}

insert {?resource <entity:description> ?description}
where {
{?resource <book:description> ?description.}
UNION
{?resource <author:bio> ?description.}
UNION
{?publisher <publisher:description> ?description.}
}


When run, this will insure that every object in the system can be described using both common and type specific properties. In a sense, the "entity:" namespace objects can be thought of as an abstract superclass that all other classes inherit, with the properties "instanceOf", "subClassOf", "label" and "description". 

One other important point to remember about the SPARQL insert commands - if a given triple already exists in the system, insert will not add it again. This means that  by collecting all of the insert statements in a script, you can run the script periodically after inserting new objects to make implicit assertions explicit.

The whole script can be run through the Jena update capability within its server, with the condensed script as follows:

insert data {

<book:Book1> 
    <book:isbn> "195292514331";
    <book:title> "The SPARQL Book";
    <book:edition> "1";
    <book:printing> "1";
    <book:publishingYear> 2012;
    <book:description> "This is a book about SPARQL".
<author:Author1> 
     <author:displayName> "Jane Doe";
     <author:givenName> "Jane";
     <author:middleNames> "Elizabeth";
     <author:surName> "Doe";
     <author:searchName> "Doe, Jane Elizabeth";
     <author:bio> "An author and librarian who likes cats.".
<book:Book1> <book:author> <author:Author1>.
<author:Author2> 
     <author:displayName> "John Dee";
     <author:givenName> "John";
     <author:middleNames> "Michael";
     <author:surName> "Dee";
     <author:searchName> "Dee, John Michael";
     <author:bio> "A writer of many talents and pretensions.".
<book:Book1> <book:author> <author:Author2>.
<book:Book2>
    <book:isbn> "1952929129568";
    <book:title> "Sparql and OWL, a Guide";
    <book:edition> "1";
    <book:printing> "1";
    <book:publishingYear> 2010;
    <book:description> "A book about SPARQL and data modeling".
<book:Book2> <book:author> <author:Author1>.
<book:Book3>
    <book:isbn> "19123977952320";
    <book:title> "Semantics";
    <book:edition> "1";
    <book:printing> "2";
    <book:publishingYear> 2011;
    <book:description> "A book on Semantics".
<book:Book3> <book:author> <author:Author2>.
<book:Book1> <book:publisher> <publisher:Publisher1>.
<book:Book2> <book:publisher> <publisher:Publisher2>.
<book:Book3> <book:publisher> <publisher:Publisher1>.
<publisher:Publisher1> 
     <publisher:name> "Oriole Publishing";
     <publisher:description> "Large publisher of technical books for the programming market.".
<publisher:Publisher2> 
     <publisher:name> "Avante Books";
     <publisher:description> "A major publisher of scientific and research books.".
<book:Book1> <entity:instanceOf> <class:Book>.
<book:Book2> <entity:instanceOf> <class:Book>.
<book:Book3> <entity:instanceOf> <class:Book>.
<author:Author1> <entity:instanceOf> <class:Author>.
<author:Author2> <entity:instanceOf> <class:Author>.
<publisher:Publisher1> <entity:instanceOf> <class:Publisher>.
<publisher:Publisher2> <entity:instanceOf> <class:Publisher>.
<class:Book> <entity:subClassOf> <class:Entity>.
<class:Author> <entity:subClassOf> <class:Entity>.
<class:Publisher> <entity:subClassOf> <class:Entity>.
}

insert {?author <author:book> ?book} where
{?book <book:author> ?author.}

insert {?publisher <publisher:author> ?author} where
{
?book <book:publisher> ?publisher.
?book <book:author> ?author.
}

insert {?resource <entity:label> ?label}
where {
{?resource <book:title> ?label.}
UNION
{?resource <author:fullName> ?label.}
UNION
{?publisher <publisher:name> ?label.}
}

insert {?resource <entity:description> ?description}
where {
{?resource <book:description> ?description.}
UNION
{?resource <author:bio> ?description.}
UNION
{?publisher <publisher:description> ?description.}
}

You can then see what your data space looks like thus far with the following SPARQL query:

select ?s ?p ?q where
{?s ?p ?q.}
order by ?s ?p ?o

In the case of the example given here, this will generate the following dataset:
spq
<author:Author1><author:bio>"An author and librarian who likes cats."
<author:Author1><author:book><book:Book1>
<author:Author1><author:book><book:Book2>
<author:Author1><author:displayName>"Jane Doe"
<author:Author1><author:givenName>"Jane"
<author:Author1><author:middleNames>"Elizabeth"
<author:Author1><author:searchName>"Doe, Jane Elizabeth"
<author:Author1><author:surName>"Doe"
<author:Author1><entity:description>"An author and librarian who likes cats."
<author:Author1><entity:instanceOf><class:Author>
<author:Author2><author:bio>"A writer of many talents and pretensions."
<author:Author2><author:book><book:Book1>
<author:Author2><author:book><book:Book3>
<author:Author2><author:displayName>"John Dee"
<author:Author2><author:givenName>"John"
<author:Author2><author:middleNames>"Michael"
<author:Author2><author:searchName>"Dee, John Michael"
<author:Author2><author:surName>"Dee"
<author:Author2><entity:description>"A writer of many talents and pretensions."
<author:Author2><entity:instanceOf><class:Author>
<book:Book1><book:author><author:Author1>
<book:Book1><book:author><author:Author2>
<book:Book1><book:description>"This is a book about SPARQL"
<book:Book1><book:edition>"1"
<book:Book1><book:isbn>"195292514331"
<book:Book1><book:printing>"1"
<book:Book1><book:publisher><publisher:Publisher1>
<book:Book1><book:publishingYear>"2012" ^^<http://www.w3.org/2001/XMLSchema#integer>
<book:Book1><book:title>"The SPARQL Book"
<book:Book1><entity:description>"This is a book about SPARQL"
<book:Book1><entity:instanceOf><class:Book>
<book:Book1><entity:label>"The SPARQL Book"
<book:Book2><book:author><author:Author1>
<book:Book2><book:description>"A book about SPARQL and data modeling"
<book:Book2><book:edition>"1"
<book:Book2><book:isbn>"1952929129568"
<book:Book2><book:printing>"1"
<book:Book2><book:publisher><publisher:Publisher2>
<book:Book2><book:publishingYear>"2010" ^^<http://www.w3.org/2001/XMLSchema#integer>
<book:Book2><book:title>"Sparql and OWL, a Guide"
<book:Book2><entity:description>"A book about SPARQL and data modeling"
<book:Book2><entity:instanceOf><class:Book>
<book:Book2><entity:label>"Sparql and OWL, a Guide"
<book:Book3><book:author><author:Author2>
<book:Book3><book:description>"A book on Semantics"
<book:Book3><book:edition>"1"
<book:Book3><book:isbn>"19123977952320"
<book:Book3><book:printing>"2"
<book:Book3><book:publisher><publisher:Publisher1>
<book:Book3><book:publishingYear>"2011" ^^<http://www.w3.org/2001/XMLSchema#integer>
<book:Book3><book:title>"Semantics"
<book:Book3><entity:description>"A book on Semantics"
<book:Book3><entity:instanceOf><class:Book>
<book:Book3><entity:label>"Semantics"
<class:Author><entity:subClassOf><class:Entity>
<class:Book><entity:subClassOf><class:Entity>
<class:Publisher><entity:subClassOf><class:Entity>
<publisher:Publisher1><entity:instanceOf><class:Publisher>
<publisher:Publisher1><publisher:author><author:Author1>
<publisher:Publisher1><publisher:author><author:Author2>
<publisher:Publisher1><publisher:description>"Large publisher of technical books for the programming market."
<publisher:Publisher1><publisher:name>"Oriole Publishing"
<publisher:Publisher2><entity:instanceOf><class:Publisher>
<publisher:Publisher2><publisher:author><author:Author1>
<publisher:Publisher2><publisher:description>"A major publisher of scientific and research books."
<publisher:Publisher2><publisher:name>"Avante Books"
There are a number of other properties or relationships that can be defined, with each relationship generally identifying a class in the overall entity relationship diagram. For instance, a given book may have multiple copies, you may have different media (such as audio books or DVDs), this can be extended to include which library currently contains the book, who has it borrowed and so forth.

This process can take a while - the goal with this stage of development is to figure out what relationships exist and are worth tracking, which properties are optional, and as such I find that the best approach to completing this is to do an initial run, then to pull in stakeholders and show the individual instances and how they interrelate. At that point, domain experts may point out information that needs to be captured that the model doesn't catch, but adding these properties can be done in a very ad hoc fashion.

Once the model has been sufficiently refined, the next stage of the process is converting this into an RDF-XML format, and from there into an XML format. Additionally, this is a good stage at which to use visualization tools such as raptor and dot to visualize the relationships in terms of a graph. I'll cover these in subsequent articles.