Friday, December 7, 2012

Why Data Architecture Trumps Application Framework Agnosticism

A Bus Stop - a good place for thinking about how well the
Enterprise Service Bus model really works.
I recently had a discussion at work about certain clients that were very much tied into a seductive concept: they felt that the way to build a large scale application is to define a system in a modular fashion such that if one particular component didn't work out, they could just replace it with an analogous component from a different vendor. To do this, they would only use the absolute minimal capabilities of each component - in effect relying upon coding standards being relatively uniform from one component to the next.

This is an attractive proposition, since it reduces reliance upon any one vendor ... at least in theory. However, the real world experience for many organizations building large scale systems has generally been the opposite - rather than building a system that is easily reducible they end up with systems that are highly dependent upon the general contractor for documentation, support and maintenance, perhaps more so than using a more specialized tools for their needs. Yes, any component may be "swapped out", but only by the contractor.

Additionally, this practice means that when you do construct such a system, you end up with a vanilla system that takes advantages of none of the optimizations that could make the system work efficiently, which can be a real problem when you start talking about Big Data - large volumes, high throughput, intensive processing. This becomes especially true when you start talking about distributed systems as well, since such distribution significantly increases the latencies inherent in such systems.

Instead, organizations should in general concentrate first on the biggest questions - in a decade, when the technology has changed enough that it makes sense to take advantage of new innovations in delivering information to new platforms, what ultimately are the most important things to preserve. The answer is generally not code. Code is an expense, no doubt, but in buying code you should think of it as buying a new car. The moment that the car leaves the lot, it loses 20% of its value. You'll end up buying a new tire or two, maybe a muffler. As the car gets older, you'll sink more money into repairing the radiator, replacing the solenoid or battery, maybe throwing in a new stereo system because that state of the art eight track player you bought in the 70s just isn't cutting it anymore.

Ultimately you realize that the car is costing too much to maintain, or no longer meets your needs, or is just plain ugly. You get a new car. You won't, however, put in that new muffler or radiator, you won't take the tires over (even if you can) and chances are better than average that the new car's stereo is probably better than the one you bought several years before. Maybe besides a few knick-knacks, you won't in fact carry any of that car forward, save as something you can trade in to the dealer.

Code works the same way, almost exactly. It's sunk costs. You've paid it to achieve a certain objective on a certain platform at a certain time, and neither those objects nor those platforms stay the same over a window as long as a decade ... nor would you expect them to. Maybe, if you're lucky, you might be able to salvage the API, but even there, chances are pretty good that a new system might be a good time to roll out a new API on top of everything else.

What makes this even more problematic is that when you build a complex custom hardware system, then the quality of the code base is only as good as the most junior programmer on the team. Since at any given time you will only know of the quality of this junior programmer when that part fails, this means that the more hardware intensive, the more API intensive your systems, the more likely will be the chance that your code sucks.

On the other hand, data integrity and data architecture is very important - perhaps supremely important. The purpose of an application, at the most abstract level, is to change the state of a data store. There are, admittedly, usually side effects - killing verdant domestic boars via various avians with anger management issues, for instance - but at any given point in the application you are not only changing state, but usually creating a record of previous state changes - an archive or audit trail.

This audit trail will usually consist of both the core objects that the system operates on - the "catalog cards" of a library system and the "batch records" that are used to facilitate and the batch transfer and management of the objects being managed by the cards, as an example in a library system - as well as the messages that mediate the management of these objects - batch transfers requests and responses, object upgrades, and so forth.

When the technology becomes obsolete (it doesn't run on the newest machines or runs too slowly compared to more contemporary technology or it no longer meets the business needs of the organization), what then becomes most important is the migration of the data content in a consistent, standardized format to a new system. This in turn means that you have a clean internal information architecture.

Now, it is certainly possible that the data architecture will have evolved over time - data architecture by it's very nature reflects the business environment which it is embedded in, and that environment also changes over time. However, in most cases, such evolutions are generally additive in nature - new types of information get stored, information differentiates in the face of more refined tracking, very occasionally a property becomes obsolete and is no longer stored.

Critically, again, what should be persisted should be, most importantly, the primary objects in their various incarnations. Usually, at a certain point in the process, message information gets condensed, digested and reorganized. This means that when designing a good data architecture, some thought should be given to that condensation process as well.

So what happens when it's time to buy that replacement car - er, application? The quiet revolution that has taken place over the last decade has been the rise of open data standards - XML, RDF and JSON. In it's simplest incarnation, this means that your data can be transferred from one system to another without the need for extensive ETL tool processes. This doesn't mean that your data models themselves won't change - assume that change is inevitable - but it does mean that if you had concentrated on a RESTful architecture that captured the principle objects within the system, then the transfer from one data store to another can be accomplished via a relatively simple transformation. If you do work with REST, it also means that many of the precepts of REST - including the notion of the globally unique identifiability of resources, up to and including "records" in a data system - transfer over from one system to the next ... regardless of the systems involved.

Now, if your application is Angry Birds, this probably doesn't matter so much (though I'd argue that as applications move increasingly to the cloud and networked multiplayer mode, it matters a whole heckuva lot more than it did when these were stand-alone applications). On the other hand, if your application is intended to run for the long term, involves large amounts of distributed data that needs to be consumed and produced by multiple participants, then the data architecture is far more important than the specific implementation details - quite possibly because the code involved will almost certainly be different from end-point to end-point, even if the intent of that code remains the same.

This is part of the reason why I think that the next generation of object databases - XML databases, JSON databases, Sparql 1.1 triple stores, etc. - should be seen less as databases than as integration platforms that are particularly efficacious at "wrangling" open standards formats, which happen to have lots of memory for persisting structures. Increasingly these have embedded HTTP clients, extensible function libraries, optimized search and data retrieval capabilities, reverse-query based decision-triggers, and other features that replicate in code what traditional ESBs usually delegate to different products. These scale well, in general, they are adept at transforming data streams in-process (significantly reducing overall serialization/parsing costs), and typically they work quite effectively in virtual environments, especially when the need to spawn additional operational nodes emerges.

Moreover, consolidating these into a single system makes a great deal of sense when the messaging formats involved are themselves in the "big three" formats of XML, RDF, or JSON. Rather than imperative logic, I can use a conditional "query" to test whether a given data or metadata document fulfills given conditions, can spawn internal processes on that document (or more accurately on all documents that satisfy that query), and can then change the states of these documents or create new documents while archiving the old ones if these documents are the ones that result from the process. However, the significant thing is that these documents do not "move" in the traditional sense unless there is a specific transference process that is required out of band (i.e., when an external service gets invoked). In effect, these NoSQL data "integrators" are far more efficient at data/document processing because the documents generally do not move - document metadescriptors get changed instead.

So, to wrap things up, by focusing on data architecture first, with an emphasis on identifying and defining the primary entities in the system rather than focusing on messaging systems and formats, you can design systems that focus on preserving the data of the system for the long term at the expense of code which would get changed anyway within a decade. By focusing on a RESTful resource-centric NoSQL data-store architecture (which is generally far easier with model-driven approaches to develop), you can also dramatically reduce the number of external service components necessary to accomplish the same tasks.

Given that, at least to me, it seems that trying to take advantage of code reuse is false economy once you get to the level of enterprise level data systems. Focus on the development of a solid information architecture first, identifying the primary entities that will be used by the various participants in the network and a mechanism for universal identification of those resources, identify where those entities will be persisted and accessed, then from there build out from these stores. This differs from the conventional wisdom of trying to minimize the work that the databases do, but in general this has more to do with the fact that these "databases" are in effect more like platforms themselves today than they were even a few years before.

Kurt Cagle is an information architect for Avalon Consulting, LLC., and has helped develop information systems for numerous US federal agencies and Fortune 500 clients. He has also authored or contributed to seventeen books on information architecture, XML and the web, and is preparing his 18th, on SVG, for publication now. 

No comments:

Post a Comment