Thursday, December 6, 2012

RDF and its role in Logical/Canonical Modeling

Peter O'Kelly, a friend and former colleague of mine and the Principle Analyst at O'Kelly Associates, wrote a comment with regard to my post on using RDF for data modeling asking why I felt that RDF was a better tool than UML for modeling, especially given that tool's primacy in most enterprise spaces. I tried responding in a comment, but after writing it, I realized that, first, it was actually a good post itself, and second, I couldn't fit it into the 4096 character limit for comments. Thus this new post ...


I don't think that conceptual/logical data modeling via UML is ineffective - I think in many respects it's a very necessary part of the modeling process. However, what I've generally found with CDM/LDM work is that by dealing at the level of class abstraction rather than individual instances for the very initial prototyping, it's not always obvious what the functional classes are. It's also hard to tell whether what you are dealing with are associations or aggregations ... or what the specific relationships are between different entities. I find that taking an initial RDF approach helps in identifying those entities by "trying them out" with working world data one element at a time, rather than trying to holistically map out an overarching canonical model.

The RDF triple approach can come in handy as well in identifying what are enumerants rather than lists of named entities - or whether it makes sense to use one vs. the other (as there is almost invariably a very fuzzy line between the two). This is a common problem that XSD encounters as well as UML, because both languages tend to see enumerations as a simple list, whereas the RDF approach typically sees enumerations as being entities in a different namespace that may have multiple expressions.

For instance, countries or states/provinces are often treated as enumerations when in fact they should be seen as lists of objects with multiple properties - full name, abbrev2, abbrev3, etc., as well as internal relationships (a state can be both an object of type GeoState and a state of a given country). Colors are usually expressions of a given object property - silver/gray means very different things to a car dealer and a sweater manufacturer for instance, and they have very specific meanings. UML certainly can express these concepts as well, but because UML is focused primarily upon class relationships rather than ontological ones, the ontological relationships and divisions of namespaces are all too often far from obvious when attempting to model UML graphically, especially when dealing with enterprise level ontologies.

Now, having mapped out a preliminary sample ontology, going from this to a UML is probably a good next step, as well as generating relational diagrams using Raptor or similar visualization tools (which I hope to cover in a later post). It's at that stage that the socialization of the CDM takes place, and UML is generally better at communicating this information than either instance or schema relational graphs showing RDF in my experience, especially to semi-technical business types. In this respect the initial RDF work can be seen as a test-bed, especially since it's possible using SPARQL update to change both explicit property types and more subtle relationships. It's also relatively easy to generate RDF schema or OWL2 from instance data as long as you have a few key things like sameAs relationships and preliminary rdf:type associations. XSD is not as amenable to such changes.

There is a UML profile for RDF which makes interchange between the two possible, if not always easy. I'm still experimenting with this, mind you, so I can't speak with authority here.

I think your last point is worth addressing at a higher level as well. I've been involved now in a couple dozen large scale data modeling efforts, and about to undertake a couple more shortly. Data modeling is still very much an art, primarily because the modeling domains themselves are becoming increasingly complex as the tools enable us the ability to work with those domains, and so what works as best practice for systems with a couple of dozens primary classes doesn't necessarily scale well when you start talking about thousands of such classes. Most data models are putatively designed by committee, but they typically involve an architect putting up a straw man for that committee to critique, rather than having that committee actively designing such artifacts on the fly.

Perhaps the closest analogue I can think of here would be the peer review committee for a doctoral thesis - it is the job of the architect to defend his or her design, and to do that the confidence level in the ability of the model to meet the project's needs must be high ... which means testing the data model beforehand. XML tends to be a fairly poor medium for testing, because at its core it only has a container/contained relationship, and because linking with XML is awkward and ill-supported even in XML databases (ditto JSON). Building imperative structures for managing such linking defeats the whole purpose of a data model, which is intrinsically declarative - building the application to test the data model works, but it is a time consuming process and fragile to changes. Building RDF prototypes of the data model, on the other hand, allows for incremental changes in the ontology at relatively minor cost in comparison.

So, I'm not dissing UML tools - I've used MagicDraw and Rational Rose, and they are both a key part of an ontologist's toolkit. Rather, I'm just arguing that RDF provides a good mechanism for building a proof of concept ontology quickly that can then feed into the UML if necessary for communication.

1 comment:

  1. Thanks for the detailed response and for sharing your modeling insights on this blog.

    [Preface micro-rant: what's with the rudimentary Blogger comment edit control? Have the folks at Google never heard of embedded links or images?... I just lost an earlier draft of this comment, after I foolishly tried to drag in an image from Skitch; argh...]

    A few quick responses, for now:

    Sorry if I confused conceptual modeling, logical modeling, and UML in my response to your previous post. I do not think UML is great for either conceptual or logical modeling, but there's a perplexing shortage of useful (and affordable) modeling tools for conceptual/logical modeling (Open ModelSphere [http://www.modelsphere.org/] is one example) and also a shortage of useful resources for learning about conceptual modeling (the Carlis/Maguire book [http://www.amazon.com/gp/product/020170045X] is still the best resource I've run across), so many people default to UML tools for modeling.

    I definitely agree it's useful to explore both instances and types/categories/classes when learning about and modeling a domain -- that's a best practice in the Carlis/Maguire technique as well.

    At the outset, I think it's useful to distinguish between resources and relations -- see slide 18 in a recent Gilbane presentation [http://www.slideshare.net/pbokelly/gilbane-boston-2012-xml-and-sql-not-dead-yet] on XML, SQL, and related topics for more details. In that framework, RDF gets a bit intertwingled, as it's a resource-based model that's often used for modeling relations...

    Anyways, more to follow -- thanks again for sharing your insights in this blog; I look forward to continuing the discussion.

    ReplyDelete