Semantics and Data Modeling: You Say Tomato, I say Ketchup

I've been moving into a new house in Issaquah, Washington lately, and one of the inevitable chores that comes with moving is putting up books, videos, dvds, games and similar content. I had an interesting lesson in organization that came about when my wife and daughter independently ended up organizing the various movies we'd collected over the years and discovered that each had an ever so slightly different view of what constituted proper ordering and organization. My wife organized content by publisher and alphabetical title. My daughter came in later and reorganized the same dvds purely by alphabetical order, save that series were arranged in chronological order. By the time the day was over, tempers were frayed, and each was grumbling about the other being absolutely unreasonable in how they viewed the way the world should work.

This same process takes place every day in organizations. DBAs put together databases, typically designed to handle one particular application, without thought to the broader applicability of the data that they are storing. Programmers build ad hoc XML or JSON schemas, without worrying whether there is any larger scheme for promulgating this information beyond the immediate confines of the services they are writing. Services get created that are used to get at very specific content, without thinking of the broader design of such interfaces viz-a-viz organizational goals.

This happens for obvious reasons. Developing standards takes specialized skills and time, and usually both are in short supply. Most people do not want to take the effort to find data structures already in use that are similar to what they have, because inevitably there will be information that process Y needs that process X didn't supply, or the information requires some transformation of data, or the data is locked behind a web service interface.

Beyond these more mundane factors are also bigger issues. As semi-structured data (XML, JSON, etc.) becomes utilized by one group, code becomes dependent upon that data structure. This becomes especially true when the language processing that structured data is procedural, and the developers in question do not have the means to understand how to work well with such structured content. I've seen too many projects where the Java or .NET code was so fragile with regards to the XML that something as simple as adding an element to a schema could break the application.

This usually ends up meaning that the same organization may end up with half a dozen different ways of expressing invoices, manifests, even addresses. For a small organization, this isn't that much of a problem, but beyond a certain point organizations are sufficiently large and distributed that a multitude of one-off schemas make data interchange costly and in many cases not even possible, even with XML or JSON as the data format.

So what's the solution to this Babelization? In an ideal world, the organization would have an information architect lay out a broad term data strategy, and any data structures that are created would then need to follow these data rules. In practice, this happens very seldom - most companies tend to code first and think data strategy only after a great deal of such code is already in place.

More often what happens is that an executive decision gets made to create open standards or platforms because too much of the data of an organization is locked up in different databases that can't talk with one another, and it is translating into extra costs or potential revenue that is not currently being made. At this stage, it becomes the responsibility of a "data architect" to somehow de-siloize the organization's information.

At this stage, the existing data is organized to tease out the data structures, and these data structures in turn are modeled more formally. Typing is noted, taxonomies are identified, conventions are established, and relationships are clarified. This process is also very contentious, because the various stakeholders of that data want to insure that the structures mostly conform to what already exist in the code base - even if it means violating all of the carefully laid rules at the beginning of the process.

What's more, a great deal of information that gets captured in enterprise systems are messages, especially when XML is involved. These messages are usually derived from physical forms. Indeed, one of the most common ways of building models is to transfer every field in the form to an XML schema, all too often not even bothering to question whether such fields exist to provide intent rather than specific object data. Forms are certainly useful to help understand data and data relationships, but should be used only a a guideline, and only in context with other forms in the system.

In effect, one of the primary purposes of such data modeling exercises is to establish logical models of the data space. A logical model identifies the things in the system, and is independent of the medium being used to capture or persist the data in the first place.

In this model, the role of a form is to either provide the data necessary to construct the things in the system, provide sufficient data to enable updating these objects, is used to pass parameters to query the data store for one or more objects that satisfy given conditions, or to expedite some changes in the metadata of the object (such as where the object is located).

Now, one of the consequences of such a rethink is that the creation of an enterprise data model means that data entities become unique. This makes sense, given that a particular object being modeled is usually a physical entity in it's own right, but it also means that by moving to an enterprise model the systems need to have ways of identifying when a chunk of data in two different data stores refer to the same entity, and to resolve conflicts when these two data chunks disagree.

This is perhaps one of the hardest thing that an information architect needs to do. Resources have to be uniquely identified, and resource data has to be framed within a contextual provinence. Multiple databases may actually contain different pieces of information about the same entity - one database may contain my employment history, while another may contain my medical records, and a third may contain my academic records. Moreover, two different databases may have the same type of information about me, but may be out of sync, or maybe contradictory. These are all things that the architect needs to consider when designing systems, especially when those systems span multiple data repositories, types, domain control and transmission mechanisms.

At the same time, the architect must act as an advocate to the various programs and data systems users. Across an enterprise, the architect does not (and should not) design each database or set of data structures. That creates a choke point in the system, and will likely result in your architect carried off in a stretcher and your organization reaching a stand-still waiting for him to recover from the heart attack. What the information architect should do, however, is to establish standards - how are schemas defined, what naming and design rules are utilized, how the scope of a given schema should propagate upward through the organization based upon its utility and context, how auditing and versioning takes place and so forth. In that respect, an enterprise information architect is a manager of departmental data architects, fostering an environment where there is communications between the various stake holders and transparency in data modeling and design. The departmental data architects in turn should work with the system architects, business analysts and software designers within their department to insure that information can be adequately represented for programmers to work with while still retaining all the information necessary for the objectives of the organization overall.

Ultimately, the information architect should ultimately recognize that the data space is a shared space - too much control and the complexity of the software becomes unmanagable, too little control and data interchange cannot occur without information, perhaps critical information being lost in translation. It's not an easy role to fill - in my time as an information architect I've made decisions that I've come to regret that seemed sensible at the time, and invariably there are demands to get data models out before all of the requirements are known, leading to late nights and headaches down the road. On the other hand, when you do reach that nirvana point - where the applications are done and the information is moving smoothly between systems - there are few feelings that can beat that ... until the requirements come down the pike for the next batch of software ...

Semantics and Data Modeling

Thursday, January 3, 2013

You Say Tomato, I say Ketchup

No comments:

Post a Comment

About Me