Sunday, February 17, 2013

Semantics and Airline Flight


Airline Routing Maps (courtesy United Airlines)
A question that I'm frequently asked about Semantic Web technology is how is this different from any other data storage or query approach? In general, data storage strategies are built less around data access speeds - in most cases the speeds are comparable - but in optimizing the storage to best fill the specific business requirements at hand.

Relational databases, for instance, are remarkably good for those situations where your data is regular, and where you have clearly defined relational keys between "flat" objects, where there is comparatively little variability in data structures, and typically where relevant data is contained within a single data store. This still describes the majority of use cases even in today's world, and as such it is unlikely that relational (RDBM) systems are going to stop being used any time soon.

Document and hash stores provide a second way of organizing data, and these generally work best when the data in question has greater variability in structure, is hierarchical, and can be organized as a discrete entity. XML and JSON are both good tools for visualizing this kind of information, where XML works better when the structures in question are largely narrative in nature (and you have the need for text interspersed with content), while JSON is increasingly seen as being useful for those situations where you are communicating primarily property information or sequences of same.

Such document stores typically index using name/value pairs, but also have effectively denormalized part of the underlying structure along "preferred" axes, in effect transferring the load of normalization from the document database to the application user. This in turn makes these formats superior for data transfer as well, especially since they have no formal requirements for transmitting the schema of such documents, not typically the case with relational databases.

Semantic Databases (triple stores, quad stores, or named graph databases) fall into a somewhat different category, and are in effect a form of columnar rather than record oriented databases. In this case, most data in the database can be thought of as a triple consisting of record identifier, a table column name and a column value where the two intersect, with that column value in turn being either a key to another record identifier or an atomic value.
Airport
AirportIDLabelStatusRoutesTo
SEASeattle Tacoma AirportActiveSFO
SEASeattle Tacoma AirportActiveLAX
SFOSan Francisco AirportActiveSEA
SFOSan Francisco AirportActiveLAX
LAXLos Angeles AirportActiveSEA
LAXLos Angeles AirportActiveSFO

If we use <airport:SFO> to identify a given airport (San Francisco International), <airport:routesTo> to identify the column that contains routing information and <airport:SEA> to contain the identifier reference to another airport (here Seattle-Tacoma International Airport), then the set:
<airport:SFO>  <airport:routesTo> <airport:SEA>.
creates an assertion that can be read as "SFO routes to SEA". This is what a Semantic Triple Store deals with.
Note that there is a fair amount of redundancy in most relational databases. Indeed, technically the above should properly be two data sets:
Airport
RecordIDAirportSignLabelStatusAirportRouteKey
1SEASeattle Tacoma AirportActive101
2SFOSan Francisco AirportActive102
3LAXLos Angeles AirportActive103

AirportRoute
RecordIDAirportRouteKeyAirportRecordID
11012
21013
31021
41023
51031
61032

Such normalization makes it easier to perform joins internally, but at the cost of both lucidity and redundancy of results - and of the complexity of writing the corresponding queries.

The Semantic approach tends to take a sparser style, making the identifier for a given resource it's own URI, and identifying links through similar URIs. It does do by breaking tables into individual columns and then mapping the relationships between the resource identifier subject and the object of the column property, as described above. However, it also has the effect of shifting the thought away from Tables and towards the notion of distinct kinds of resource. We first have the concept of airports, and their connectedness:

<airport:SEA> <rdf:type> <class:Airport>.
<airport:SFO> <rdf:type> <class:Airport>.
<airport:LAX> <rdf:type> <class:Airport>.
<airport:SEA> <airport:routesTo> <airport:SFO>.
<airport:SEA> <airport:routesTo> <airport:LAX>.
<airport:SFO> <airport:routesTo> <airport:SEA>.
<airport:SFO> <airport:routesTo> <airport:LAX>.
<airport:LAX> <airport:routesTo> <airport:SEA>.
<airport:LAX> <airport:routesTo> <airport:SFO>.
However, these routes - supplied here as a verb, can also be manifest as objects (route objects) in their own rights, which makes certain types of computations easier.
<route:SEASFO> <rdf:type> <class:Route>.
<route:SEALAX> <rdf:type> <class:Route>.
<route:SFOLAX> <rdf:type> <class:Route>.
<route:SFOSEA> <rdf:type> <class:Route>.
<route:LAXSFO> <rdf:type> <class:Route>.
<route:LAXSEA> <rdf:type> <class:Route>.
<route:SEASFO> <route:fromAirport> <airport:SEA>. <route:SEASFO> <route:toAirport> <airport:SFO>. <route:SEALAX> <route:fromAirport> <airport:SEA>. <route:SEALAX> <route:toAirport> <airport:LAX>. <route:SFOLAX> <route:fromAirport> <airport:SFO>. <route:SFOLAX> <route:toAirport> <airport:LAX>. <route:SFOSEA> <route:fromAirport> <airport:SFO>. <route:SFOSEA> <route:toAirport> <airport:SEA>. <route:LAXSFO> <route:fromAirport> <airport:LAX>. <route:LAXSFO> <route:toAirport> <airport:SFO>.
where there's a relationship that can be defined in SPARQL:

CONSTRUCT {?airport1 <airport:routesTo> ?airport2} WHERE
{
     ?route <rdf:type> <class:Route>.
     ?route <route:fromAirport> ?airport1.
     ?route <route:toAirport>   ?airport2.
}
This idea of construction is important - you can think of it as being a formal statement or definition of a given term in a taxonomy by using other terms. In this case we're defining the airport:routesTo operation in terms of the route object, making it possible to look at the same problem in two distinct ways.

Arguably, for simple problems, this is probably more work than it's worth. However, what if you're dealing with not so simple problems. For instance, consider a typical routing problem - you are at an airport in Seattle (SEA) and you want to get to an airport in Chicago (ORD - O'Hare). There may be a single connecting flight ... but then again, there may not be. In this case, you have to take one or more connecting flights. It's at this level that this particular problem becomes considerably harder, because you are at that point dealing with graph traversal. With three airports, you have six potential routes. With four airports, you have up to 24. By the time you get up to ten airports you're dealing with 3,628,800 potential routes. This also becomes more complex because you have to then cull out of those three million plus routes all circular subroutes - routes which create loops - and you have to then optimize this so that the passengers involved spend as little time in the air as possible, both for the passenger's comfort, and because you're consuming more fuel with every mile that passenger travels.

Playing with the model a bit, a couple of other assumptions come into play, one being that it should be possible to attach one or more flights to a given route. The multiple flights may leave at different times of the day, may be different aircraft and so forth. For purposes of this particular article, a few simplifying assumptions are made - first, a flight has a certain flight time that will be used to determine best flight paths (and could indirectly be used to determine ticket costs, though that's not covered). Additionally, it's preferable to order trips first by the number of flights it takes to get from A to B, and secondarily by transit time. Finally, this simulation will only provide solutions for up to four flights in a given trip, though generalizing it is not hard.

Assignments of flights to routes are straightforward:

<route:SEASFO> <route:flight> <flight:FL1>.
<route:SFOLAX> <route:flight> <flight:FL2>.
<route:LAXDEN> <route:flight> <flight:FL3>.
<route:SFOLAS> <route:flight> <flight:FL4>.
<route:LAXLAS> <route:flight> <flight:FL5>.
Similarly flight times (in hours) are given as:
<flight:FL1> <flight:flightTime> 2.0.
<flight:FL2> <flight:flightTime> 2.0.
<flight:FL3> <flight:flightTime> 3.0.
<flight:FL4> <flight:flightTime> 1.4.
...
Three more conditions are modeled. First, a given flight can be cancelled. This effectively takes the flight out of the routing path, which can impact several potential paths. Similarly, an airport can close for some reason, such as adverse weather conditions, forcing different routings. The biggest condition, at least for this simulation, is that this does not take into account scheduling, insuring that a given flight is available. This can also be modeled semantically (and is not really all that hard) but it is outside of the scope of this article.

With these conditions, it's possible to create a SPARQL query that will show, for a given trip between two airports, the potential paths that are available, the number of flights needed in the total trip, and the total time for that trip.

SELECT DISTINCT ?airportSrc ?airport2 ?airport3 ?airport4 ?airportDest ?flight1 ?flight2 ?flight3 ?flight4 ?numFlights ?flightTime WHERE {
    ?airportSrc <owl:sameAs> <airport:SEA>.
    ?airportDest <owl:sameAs> <airport:ORD>.
    ?airportSrc <airport:status> <status:Active>.
    ?airportDest <airport:status> <status:Active>.
    {?airportSrc <airport:routesTo> ?airportDest.
     ?route1 <route:fromAirport> ?airportSrc.
     ?route1 <route:toAirport> ?airportDest.
     ?route1 <route:flight> ?flight1.
     ?flight1 <flight:flightTime> ?flightTime.
     ?flight1 <flight:status> <status:Active>.
     BIND ((1) as ?numFlights)
    }
    UNION
    {?airportSrc <airport:routesTo> ?airport2.
     ?airport2 <airport:routesTo> ?airportDest.
     ?airportSrc <airport:status> <status:Active>.
     ?airport2 <airport:status> <status:Active>.
     ?airportDest <airport:status> <status:Active>.
     ?route1 <route:fromAirport> ?airportSrc.
     ?route1 <route:toAirport> ?airport2.
     ?route1 <route:flight> ?flight1.
     ?flight1 <flight:flightTime> ?flightTime1.
     ?route2 <route:fromAirport> ?airport2.
     ?route2 <route:toAirport> ?airportDest.
     ?route2 <route:flight> ?flight2.
     ?flight2 <flight:flightTime> ?flightTime2.
     ?flight1 <flight:status> <status:Active>.
     ?flight2 <flight:status> <status:Active>.
     BIND((?flightTime1 + ?flightTime2) AS ?flightTime)
     BIND ((2) as ?numFlights)
    }
    UNION
    {?airportSrc <airport:routesTo> ?airport2.
     ?airport2 <airport:routesTo> ?airport3.
     ?airport3 <airport:routesTo> ?airportDest.
     ?airportSrc <airport:status> <status:Active>.
     ?airport2 <airport:status> <status:Active>.
     ?airport3 <airport:status> <status:Active>.
     ?airportDest <airport:status> <status:Active>.
     ?route1 <route:fromAirport> ?airportSrc.
     ?route1 <route:toAirport> ?airport2.
     ?route1 <route:flight> ?flight1.
     ?flight1 <flight:flightTime> ?flightTime1.
     ?route2 <route:fromAirport> ?airport2.
     ?route2 <route:toAirport> ?airport3.
     ?route2 <route:flight> ?flight2.
     ?flight2 <flight:flightTime> ?flightTime2.
     ?route3 <route:fromAirport> ?airport3.
     ?route3 <route:toAirport> ?airportDest.
     ?route3 <route:flight> ?flight3.
     ?flight3 <flight:flightTime> ?flightTime3.
     ?flight1 <flight:status> <status:Active>.
     ?flight2 <flight:status> <status:Active>.
     ?flight3 <flight:status> <status:Active>.
     BIND((?flightTime1 + ?flightTime2 + ?flightTime3) AS ?flightTime)
     BIND ((3) as ?numFlights)
     FILTER (?airportSrc != ?airport3)
     FILTER (?airportDest != ?airport2)
    }
    UNION
    {?airportSrc <airport:routesTo> ?airport2.
     ?airport2 <airport:routesTo> ?airport3.
     ?airport3 <airport:routesTo> ?airport4.
     ?airport4 <airport:routesTo> ?airportDest.
     ?airportSrc <airport:status> <status:Active>.
     ?airport2 <airport:status> <status:Active>.
     ?airport3 <airport:status> <status:Active>.
     ?airport4 <airport:status> <status:Active>.
     ?airportDest <airport:status> <status:Active>.
     ?route1 <route:fromAirport> ?airportSrc.
     ?route1 <route:toAirport> ?airport2.
     ?route1 <route:flight> ?flight1.
     ?flight1 <flight:flightTime> ?flightTime1.
     ?route2 <route:fromAirport> ?airport2.
     ?route2 <route:toAirport> ?airport3.
     ?route2 <route:flight> ?flight2.
     ?flight2 <flight:flightTime> ?flightTime2.
     ?route3 <route:fromAirport> ?airport3.
     ?route3 <route:toAirport> ?airport4.
     ?route3 <route:flight> ?flight3.
     ?flight3 <flight:flightTime> ?flightTime3.
     ?route4 <route:fromAirport> ?airport4.
     ?route4 <route:toAirport> ?airportDest.
     ?route4 <route:flight> ?flight4.
     ?flight4 <flight:flightTime> ?flightTime4.
     ?flight1 <flight:status> <status:Active>.
     ?flight2 <flight:status> <status:Active>.
     ?flight3 <flight:status> <status:Active>.
     ?flight4 <flight:status> <status:Active>.     
     BIND((?flightTime1 + ?flightTime2 + ?flightTime3 + ?flightTime4) AS ?flightTime)
     BIND ((4) as ?numFlights)
     FILTER (?airportSrc != ?airport3)
     FILTER (?airportSrc != ?airport4)
     FILTER (?airportDest != ?airport2)
     FILTER (?airportDest != ?airport3)
     FILTER (?airport2 != ?airport4)
     }
} ORDER BY ?numFlights ?flightTime
The SELECT DISTINCT statement instructs the processor to ignore duplicate returns, then identifies which SPARQL fields should be displayed or converted to external output. ?airportSrc and ?airportDest are the source and destination airports, with ?airportN indicating the Nth airport if it's not one of these two. The next statements 
?airportSrc <owl:sameAs> <airport:SEA>. ?airportDest <owl:sameAs> <airport:ORD>.

identify the source and destination airports for the query, and in most cases would probably be passed parametrically. A condition is then placed to insure that both the source and destination airport are active (the query will return no paths if either airport is closed, as would be expected). 

After this, there are four distinct conditions, correspond to one flight through four flights. The second is suggestive of the assertions invoked:
UNION {?airportSrc <airport:routesTo> ?airport2. ?airport2 <airport:routesTo> ?airportDest. ?airportSrc <airport:status> <status:Active>. ?airport2 <airport:status> <status:Active>. ?airportDest <airport:status> <status:Active>. ?route1 <route:fromAirport> ?airportSrc. ?route1 <route:toAirport> ?airport2. ?route1 <route:flight> ?flight1. ?flight1 <flight:flightTime> ?flightTime1. ?route2 <route:fromAirport> ?airport2. ?route2 <route:toAirport> ?airportDest. ?route2 <route:flight> ?flight2. ?flight2 <flight:flightTime> ?flightTime2. ?flight1 <flight:status> <status:Active>. ?flight2 <flight:status> <status:Active>. BIND((?flightTime1 + ?flightTime2) AS ?flightTime) BIND ((2) as ?numFlights) }

This assumes three airports:?airportSrc, ?airport2 and ?airportDest are involved. The <airport:routesTo> predicate statements retrieve the set of all airports (?airport2) that both are from ?airportSrc and to ?airportDest. These in turn determine the two routes used, and once you know the route, you can retrieve the flight(s) associated with that route. Once you have the flight, you can determine the flight time. The first BIND statement adds the two flight times together and places the result in the ?flightTime variable. The second BIND then sets the ?numFlights variable to the value 2, since there are two flights. 

Things get a bit more complicated once you move beyond two flights. With three flights, there is a possibility for circular references - the trip may circle back to the a previous airport before making the jump to the final airport. It is possible (albeit fairly complicated) to short circuit such paths, but for purposes of this demo, I just checked all potential match configurations and used a FILTER to remove those paths for which this occurred. For example, with four flights, you have the following configurations, assuming that Src and Dest are not the same:
FILTER (?airportSrc != ?airport3) FILTER (?airportSrc != ?airport4) FILTER (?airportDest != ?airport2) FILTER (?airportDest != ?airport3) FILTER (?airport2 != ?airport4)

After these are determined, the resulting ntuple results are ordered first by the number of flights, then by the flight time:

} ORDER BY ?numFlights ?flightTime

This produces (in Jena or similar Triple Store) a table that will look something like this:

airportSrc
airport2
airport3
airport4
airportDest
flight1
flight2
flight3
flight4
numFlights
flightTime
<airport:SEA>
<airport:ORD>
<flight:FL17>
"1"
"7.1"
<airport:SEA>
<airport:DEN>
<airport:ORD>
<flight:FL7>
<flight:FL8>
"2"
"8.0"
<airport:SEA>
<airport:LAX>
<airport:ORD>
<flight:FL25>
<flight:FL12>
"2"
"10.5"
<airport:SEA>
<airport:DEN>
<airport:STL>
<airport:ORD>
<flight:FL7>
<flight:FL21>
<flight:FL19>
"3"
"8.4"
<airport:SEA>
<airport:SFO>
<airport:LAS>
<airport:ORD>
<flight:FL1>
<flight:FL4>
<flight:FL6>
"3"
"8.8"
<airport:SEA>
<airport:SFO>
<airport:LAX>
<airport:ORD>
<flight:FL1>
<flight:FL2>
<flight:FL12>
"3"
"10.2"
<airport:SEA>
<airport:LAX>
<airport:LAS>
<airport:ORD>
<flight:FL25>
<flight:FL5>
<flight:FL6>
"3"
"10.9"
<airport:SEA>
<airport:LAX>
<airport:STL>
<airport:ORD>
<flight:FL25>
<flight:FL23>
<flight:FL19>
"3"
"11.2"
<airport:SEA>
<airport:LAX>
<airport:STL>
<airport:ORD>
<flight:FL25>
<flight:FL24>
<flight:FL19>
"3"
"11.2"
<airport:SEA>
<airport:LAX>
<airport:DEN>
<airport:ORD>
<flight:FL25>
<flight:FL3>
<flight:FL8>
"3"
"11.8"
<airport:SEA>
<airport:SFO>
<airport:LAX>
<airport:LAS>
<airport:ORD>
<flight:FL1>
<flight:FL2>
<flight:FL5>
<flight:FL6>
"4"
"10.6"
<airport:SEA>
<airport:SFO>
<airport:LAS>
<airport:LAX>
<airport:ORD>
<flight:FL1>
<flight:FL4>
<flight:FL11>
<flight:FL12>
"4"
"10.8"
<airport:SEA>
<airport:SFO>
<airport:LAX>
<airport:STL>
<airport:ORD>
<flight:FL1>
<flight:FL2>
<flight:FL23>
<flight:FL19>
"4"
"10.9"
<airport:SEA>
<airport:SFO>
<airport:LAX>
<airport:STL>
<airport:ORD>
<flight:FL1>
<flight:FL2>
<flight:FL24>
<flight:FL19>
"4"
"10.9"
<airport:SEA>
<airport:SFO>
<airport:LAX>
<airport:DEN>
<airport:ORD>
<flight:FL1>
<flight:FL2>
<flight:FL3>
<flight:FL8>
"4"
"11.5"
<airport:SEA>
<airport:LAX>
<airport:DEN>
<airport:STL>
<airport:ORD>
<flight:FL25>
<flight:FL3>
<flight:FL21>
<flight:FL19>
"4"
"12.2"
<airport:SEA>
<airport:LAX>
<airport:SFO>
<airport:LAS>
<airport:ORD>
<flight:FL25>
<flight:FL10>
<flight:FL4>
<flight:FL6>
"4"
"13.1"
<airport:SEA>
<airport:LAX>
<airport:STL>
<airport:DEN>
<airport:ORD>
<flight:FL25>
<flight:FL23>
<flight:FL20>
<flight:FL8>
"4"
"15.7"
<airport:SEA>
<airport:LAX>
<airport:STL>
<airport:DEN>
<airport:ORD>
<flight:FL25>
<flight:FL24>
<flight:FL20>
<flight:FL8>
"4"
"15.7"
<airport:SEA>
<airport:DEN>
<airport:STL>
<airport:LAX>
<airport:ORD>
<flight:FL7>
<flight:FL21>
<flight:FL22>
<flight:FL12>
"4"
"16.1"

Note that this app doesn't concern itself with whether a route is feasible (going from Seattle to Saint Louis to Los Angeles before going to O'Hare in Chicago doesn't really make a lot of sense, for instance) but the sorting indicates that this is probably the least optimal path to take.

A full scale logistical application would certainly take into account several other factors - scheduling, seat availability, specific aircraft bindings, windspeed, ticket costs and the myriad dozens of other variables that are involved in any type of logistics problem, but the point here is not to build such an application ... it is only to show that such is possible through the use of semantic databases.

Next week, I hope to take this SPARQL application and show how it can then be bound to both a MarkLogic and node.js web application.

Kurt Cagle is an author and information architect for Avalon Consulting, LLC. His latest book HTML5 Graphics with SVG and CSS3, will be available from O'Reilly Media soon. 

No comments:

Post a Comment