Semantics and Data Modeling: Data Trends of 2013

Ten years ago, I began an annual tradition of posting my analysis for the coming year. I've missed a few years in there, and of course, everyone and their brother now does this, but I find that this exercise helps me to focus on what trends I see coming up and lets me adjust to them accordingly.

This year was notable for its apocalyptic tenor. The Mayan calendar ended a major cycle a couple of weeks ago, and at that point we transitioned from one Bak'Tun to another. The Mayan's were masters of the Long Now - they thought in 8,000 year chunks of time. It's worth noting that the Mayan civilization lasted longer than the American one has to date, so perhaps they were on to something.

When I've done these in the past, I've usually had a mix of the technical and non-technical. Given that I am currently running two blogs - one on Semantics, the other more on history, politics and economics - I decided that I'd actually focus on the data world for this particular analysis of trends, and will write my more non-technical analysis to the Metaphorical Web Blog.

I did do an analysis last year to cover 2012, covered here. Overall, it proves that I should probably stick to the areas that I know best - data services, social media, semantics, and the like. My analysis was pretty good, but I still made a couple of really bad calls - Steve Ballmer is still CEO of Microsoft, though I think that with the release of Windows 8 my call that it is sliding into irrelevance may be truer than not. I hope not - I think there is a lot of potential for the company, but it's playing too conservatively in a space that has generally rewarded risk-taking. I also figured that RIM would be out of business by now. They're still around, but they are definitely struggling.

So, given those lumps, we'll see how well I fare going into 2013:

1. Big Data Fizzles. Every so often, a term gets thrown out by some industry consulting group such as Gartner or Forrester that catches fire among marketing people. Web Services, Service Oriented Architecture (SOA), Enterprise Service Buses (ESB), AJAX and more all come to mind. There's usually a kernel of a technical idea there, but it is so heavily smothered in marketing BS that by the time it gains wide-spread currency, the term exists primarily so that a product website designer can put another bullet-point in a list of features that make product/service X something new and improved.

The term Big Data very readily fits into that list. A few years back, someone realized that "data" had slipped the confines of databases and was increasingly streaming along various wires, radio waves and optical pulses. Social media seems to be a big source of such data, along with a growing number of sensors of various sorts, and as a consequence it looked like a windfall for organizations and software projects that could best parse and utilize that data for various and sundry data mining (yet another buzzworthy phrase) purposes. All of a sudden all of the cool kids were learning Hadoop, and everyone had a Hadoop pilot project or connector out.

The problem with this is that most map/reduce type solutions of any value are essentially intended to create indexes into data. An index takes one or more keys and ties them into a record, document or data file. Certainly commodity parallel processing can do other things, but most of these, at some point or another, still rely upon that index. The other thing that map/reduce does is allow for the creation of new documents that contain enriched metadata about existing document contents. Document enrichment in general isn't as sexy - it's difficult, it's far from perfect, especially for general domain documents, and it requires a fair amount of deep pre-analysis, dictionary development and establishment of grammatical rules. Whether commercial software or something such as GATE and JADE, most such tools are already designed to work in parallel on a corpus of documents. Hadoop simply makes it (somewhat) easier to spawn multiple virtual CPU threads for the processing of such documents.

This complexity and limited application set, coupled with the complexity of setting such systems up in the first place, will make such Big Data databases considerably less attractive as pilot projects fail to scale. Meanwhile, vendors in this space that are already working with high speed, high volume systems are already tailoring their particular wares to provide more specialized, more highly scalable solutions than the one size fits all approach that the Big Data paradigm is engendering.

On a related note, I've been to a fair number of conferences this year, and the thing I'm seeing at this stage among customers at the enterprise level is that most senior IT managers see data mining as a lower priority to them than data enrichment, design or utilization - if there's benefit to be gained from the effluviant stream of data processes they may think about exploring it, but for the most part these people are more concerned about the primary, rather than downstream, use of the data being produced in their systems.

What I anticipate will happen this year is that you'll have a few desultory Big Data products that will simply not produce the excitement that marketers hope, and the term will begin to fade from use. However, the various problems that the Big Data domain typically has tried to pull together will each of them become more distinct over time, splitting into the various regimes - high volume data processing, distributed data management, heterogeneous document semantification and data visualization. Each of these will have a part to play in the coming year.

2. Data Firehose Feeds Emerge as a Fact of Life. Twitter feeds, Facebook activity feeds, notifications, call them what you will, processes that generate high frequency data updates are becoming the norm. These include both social media updates where you may have millions of contributors adding to the overall stream of messages at a rate of tens of thousands per second, or arrays of distributed sensors each reporting back bits of information once a second or so. This sensor traffic is exploding - cell phones reporting positioning information, traffic sensors, networked medical devices, even mobile "drone" information, all of these are effectively aggregated through data ports, and then processed by load balanced servers before being archived for post processing.

Such feeds perforce have three major aspects - data generation, which is now increasingly distributed, data ingestion, in which the data is entered into the system, given an identity within that system and provided with initial metadata such as receipt time, service processors and similar provenance information, and finally data expression, which routes that information to the appropriate set of parties. In many cases there is also a data archiving aspect, in which the information is archived in an interim or permanent data store for post processing. This is especially true if the data is syndicated - a person requests some or all items that have arrived since a given point in time. Syndication, which is a pull based publishing architecture, is increasingly seen as the most efficient mechanism for dealing with data firehoses, because it gives the server the ability to schedule or throttle the amount of data being provided depending upon its load.

I see such an approach applying not just to social media feeds (which are almost all now syndicated) but also to sensor feeds and data queries. One characteristic of such a syndicated approach is that if you have distributed data sources (discussed next) then when you make a query you are essentially sampling all of those data sources which are in turn retrieving constrained sets of data, either by date, by relevance, by key order or by some similar ordering metric. The aggregating server (which federates the search) caches the response, providing a way to create ever more representative samples. What you lose in this process is totality - you cannot guarantee at any point that you have found all, or even necessarily the best fit data to your query - but what you gain is speed of feedback.

A major change that's happening in this space as well is the large scale adoption of NoSQL databases - XML databases such as MarkLogic, eXist, BaseX, Zorba, XDB 10, and others, CouchDB and MongoDB for JSON data stores in conjunction with node.js, graph databases, columnar databases, rdf triple stores and so forth. The leaders in this space are trying to bridge these, providing data access and query services that can work with XML, JSON, SQL, Name/Value Pairs and RDF, as well as binding into Hadoop systems and otherwise becoming cross compatible. Additionally there's a growing number of developers and standards gurus that are trying to find commonality between formats, and I suspect that by the end of 2013 and into 2014 what you will see is the rise of omni-capable data services that allow you to query with XQuery or SPARQL or a JavaScript API, and to then take the output of such queries and bind them into appropriate output transparently. A lot of these will be former XML databases seeking to diversify out of the XML ghetto, since overall they probably have the most comprehensive toolsets for dealing with richly indexed non-RDBMS data.

3. Asset and Distributed Data Management Grows Up. One of the central problems that has begun to emerge as organizations become distributed and virtual is the task of figuring out where both the tangible assets (desks, computers, printers, people, buildings, etc.) are and where the intangible assets (intellectual property, data, documents) are at any given time.

One of the more significant events of 2012 was the permanent mandate for the adoption of IPv6 in June of this year. While the most immediate ramification of this is that it insures that everyone will be able to have an IP address, a more subtle aspect of this is that everything will have its own IP address - in effect its own networked name - as well. The IPv6 address schema provides support for 3.4×10³⁸ devices, and consists of eight unique sets of hexadecimal numbers, such as 2001:0db8:85a3:0042:1000:8a2e:0370:7334 . To put this into perspective, if each person on Earth was assigned an IPv6 address, each person would in turn be able to assign 43,000,000,000,000,000,000,000,000,000 different IPv6 addresses to everything he or she owned.

One interesting consequence of this is that you can also assign IP addresses to relationships. For instance, suppose that a computer has an IP address of 593A:... (I'll truncate to one hex number for conciseness). If you (42AC:..) were an employee (8821:..) of Company X (EAC1:..) and were assigned (8825:..) that computer owned by (8532:.. ) Company X, then you could express all of these relations as sets of three IP addresses - "you are assigned this computer" becomes <42ac:> <8825:> <593a:>. "This computer is owned by Company X" becomes <593a:> <8532:> . These are semantic assertions, and with the set of all such assertions you can manage physical assets. Bind the IP address to a physical transmitter (such as an active RFID chip) and you can ask such questions as "Show me where all assets that are assigned to you are located." or even "Show me the locations of all desks of people who currently report to me and who currently is seated at them."

The same thing applies in the case of virtual assets such as animation characters or marketing materials. Every document, every image, every video (and even slices of that video) can be given unique addresses (indeed, you could, with segmenting, give each ASSERTION it's own IP address). This is one case where I see Semantic Web technologies playing a huge role. If every asset (or resource) could also report on its own properties in real time (or can update a database in near real time) then you can not only have a real time inventory, but you can set triggers or conditions that would run when certain assertions are made. I expect to see such systems start to be implemented in 2013, though I think it will likely not be until mid-decade before they are commonplace.

4. Documents Become Meaningful. Speaking of the Semantic Web, semantics should continue to play a bigger part in a more traditional domain - intelligent documents. From both personal and anecdotal experience, one of the growth areas right now for IT services is in document analytics. In basic terms, this involves taking a document - anything from a book to a legal contract to a someone's blog - manually or automatically tagging that document to pick out relevant content in a specific context, then using those tags to identify relationships with other documents in a given data system.

One effect of this is that it makes it possible to identify content within a collection (or corpus, as the document enrichment people usually say) that is relevant to a given topic without necessarily having keywords that match this content. Such relevancy obviously has applicability with books and articles, but it can apply to most media, to knowledge management systems, to legal documents and elsewhere. For instance, by using semantic relationships and document enrichment on Senate testimony, I and a group of other developers from Avalon Consulting were able to determine what the topics of that testimony were, who spoke for or against the subject of the hearings and what roles they played, and could consequently link these to particular bills or resolutions.

Such tagging in turn makes navigation between related documents possible. This is one of the unexpected side effects of semantic systems. A relationship provides a link between resources. Any link can be made into a hyperlink, especially when resources can be mapped to unique addresses. Additionally, if multiple links exist (either inbound or outbound) for a given resource, then it should be possible to retrieve the list of all possible items (and their link addresses) that satisfy this relationship. This is again a form of data feed. Human beings can manual navigate across such links, via browsers, while machines can spider across the link space. The result tends to be a lot like wikis (indeed, semantic wikis are an area to pay very close attention to, as they embody this principle quite well).

5. Data Visualization. Data visualization is an intriguing domain. All data has distinct structures, although the extent of such structures can vary significantly. Most data, especially in the era of "Big Data", also exists principally to satisfy a specific application - Twitter messages and Facebook Activities each exist primarily to facilitate those specific web applications, and only secondarily do that provide additional information - either through analysis of positional or temporal data or through the existence of key words and phrases within user generated content.

With the rise of graphical tools - Canvas, SVG, and Web3D - on browsers, I think we're entering a new age of data visualization. Before this, most data visualization tools either produced static imagery, with the exception of the map space, but the idea of interactive visualizations is now becoming increasingly feasible. What happens when you can create graphics that give you dynamic drill-down capabilities, that let you visualize data sets in real time in three dimensions, and when you can bundle clusters of data with hypermedia links and windows? In the past couple of years I've seen glimpses of this - Web3D in particular is an intriguing area that I think is only just beginning to make its way into common usage, and SVG has reached a point of common stability across browsers that makes it attractive for web developers to start using it for common web applications.

6. The Android Mobile Operating System Becomes Dominant. Something happened toward the end of this year that I found rather remarkable. Microsoft released a new version of their Windows operating system, Windows 8, for both the PC and a rather expensive "tablet" system. The general reaction was "meh" - there simply was not a lot of interest in it, bad or good. Sales of Windows 8 are far below expectations, as are the sales of Surface - the aforementioned tablet PC. Why? Because right now, after a decade of Windows vs. Mac with a little side-helping of Linux, the "real" OS wars are between Apple's iPhone/iPad operating system and Google's Android system, ironically based on Linux. Desktop PCs any more are fairly tiny boxes that are becoming specialized as media servers, while even laptop sales are eroding in the face of tablets.

In many respects this was a completely out of the blue win for Linux. Booksellers have become players in the tablet game - both Amazon's Kindle Fire and Barnes & Noble's Nook are filling an interesting ecological niche - and I suspect that other media interests are watching sales this Christmas of these "branded" tablets. It's not hard to envision a Disney or Warner Brothers tablet. Meanwhile, companies such as Comcast are building security apps for handheld tablets and bundling them with home security systems).

Steve Jobs' death in late 2011 may have dealt a mortal blow ultimately to Apple's dominant position in that sector. Without Jobs' reality distortion field, and after some embarrassing gaffes with Apple Maps that marred the iPhone 5 release, Apple appears weakened going into 2013, and may end up still being a major, but not the dominant, player in that space by this time next year.

I expect these trends to continue into next year - Android based systems will become ubiquitous (and will find their way into most home appliances, home electronics and automobiles), possibly leading to a confrontation between Google and vendors about control of the system. toward the end of the year or early into 2014.

7. Google (and Other) Glasses Become Stylin'. The computer has been losing pieces of itself for a while. The mouse is going extinct - first replaced (or augmented) by touch pads, then incorporated into the screen itself in tablets and hand-helds. Ditto the speakers, which now exist primarily as tiny earbuds. The keyboard lasted a while longer, and even now is somewhat useful for working with pads, but virtual (and increasingly haptic) keyboards on tablets are becoming the norm. The computer box is now part of the display, and network, usb and hdmi cables are increasingly wireless.

Yet you still have to hold them. This is where the next major revolution in computing will come from. Put those same screens in sufficiently high resolution into glasses (or possibly even shining them directly onto the retina) and you no longer need to hold the computer. Computer glasses have existed in a number of different forms for a couple of years, but they were typically highly specialized, and had some significant limitations - poor resolution, performance issues, bandwidth problems.

I believe this is the year that such glasses will become consumer products, and they will be built using tablet-based technology. These will give true stereoscopic vision, because you can send slightly different perspectives to each lens in order to create parallax interpretations. You can create overlays on what's in front of you by changing the percentage of incoming light (from front facing cameras over each lens) vs. the graphical overlays. Coupled with ear and throat pieces, you have everything you need for both input and output, possibly either in conjunction with a glove or a bracelet that would measure electromagnetic signals from nerves in the forearm to determine the positions of various fingers.

It'll take a while - 3-5 years for this to fully take hold - but I'm expecting that toward the end of this year these will become available to the general public, and they will have a profound impact.

8. Whither TV? The media revolution continues as well, and one casualty of that is consensus television. There's a growing number of "asynchronous" users - people who may end up watching shows on their computers or media centers long after they are broadcast, often watching a season of shows over the course of an evening or weekend. This in turn is having the effect of changing the nature of these broadcasts - reflecting a rise of broader narratives rather than the stand-alone episodes that were so common until comparatively recently.

These shows are also increasingly being encoded with metadata - not just closed caption overlays but significant blocks of information about particular scenes, actors, or the production that are in turn able to tie into the web or mobile spaces. The media industry is coalescing around several standards, such as EIDR and Ultraviolet, that also allow for data interchange, making it easier for both traditional and digital media distributors to access and manage the process flow for media content generated by these organizations, including Walt Disney, Warner Bros., Comcast and a number of others.

Such standardization will have a profound effect upon both media broadcast and interactive "game" applications. Indeed, the mobile app and gaming space, which is ripe for a major consolidation, will likely be transformed dramatically as the distinction between "game" and media production continues to disappear. In the past, media-oriented games were typically made out of band from the media production itself, usually by a third party, in order to both take advantage of completed footage and to insure there was sufficient audience interest to justify the cost of gamification. With EIDR and Ultraviolet, on the other hand, game tie-ins are generally created in conjunction with the production of the media itself, and the ability to both tag and identify specific resources uniquely make it possible to build a strong framework then import media and metadata into that framework to support the games.

It means that production companies may very well alternate between producing purely theatrical release media, television media, gaming media or some combination of the three, using the standards as a mechanism to most effectively multipurpose individual content. This also makes rights management, such a critical part of content production, far simpler to track.

One final note while I'm on the topic: Kickstarter (and other services of that kind) will likely face some hard times next year as the number of underfunded projects continue to pile up. Longer term, however, it will likely be an increasingly effective model for funding "narrowcast" projects and pilots that can be tested with a specialized audience. My guess is that the twenty-first century studios (and this isn't an allusion to Fox here) are already watching Kickstarter closely, more to get a feel for what seems to be appealing to various markets than for any specific talent, though individual production houses may very well see Kickstarter as an incubator from which to grab potential actors, directors, cinematographers and effects people.

9. ePublishing Hits Its Stride. It is hard to believe, looking at recent downloads and sales figures, that publishing was on the ropes as little as three years ago, though certainly distribution channels have changed. Barnes and Noble has become the last major physical bookseller standing, but this is only because its competitors are now e-retailers ... Apple (via iTunes), Amazon and Google. eBook sales exceeded hardback sales early last year, and are on a pace to exceed softcover sales sometime in 2013, and it can be argued that B&N is rapidly becoming an e-retailer that still maintains a physical presence rather than vice versa.

This trend will likely only solidify in 2013. This year saw a number of authors who jumped into prominence from the eBook ghetto, as well as an explosion of eBooks as everyone and their brother decided to get rich quick with their own eBooks. As with most creative endeavors that have undergone a sea change in the presence of the Internet, it's likely that an improving economy and the work involved in putting together 100,000+ words for a novel (for relatively mediocre returns) will likely winnow a lot of people out of the field in the next few years, but in some respects we're in the gold rush phase of the eBook revolution.

There are three standards that will likely have a profound impact moving forward on eBook publication - the ePUB 3 specification, which establishes a common format for incorporating not only text layout but also vector graphics and media, the rise of HTML5 as a potential baseline for organizing book content, which informs the ePUB 3 specification fairly dramatically, and the increasing prevalence of Scalable Vector Graphics (SVG). Moreover, XML-based workflows for production of both print and eBooks have matured considerably over the last few years., to the extent that the book publishing industry has a clear path from content generation to automated book distribution. This benefits not only the eBook industry, but also the print on demand industry - the cost to produce one or ten thousand books is the same (though the fees to do so limit that somewhat).

I'm taking that jump in 2013, both with the SVG Graphics book I'm nearly completed with now and a couple of novels that I've had in the works for a year or so. While the challenge of writing a fiction book is one reason I'm doing, another is giving me the opportunity to see the benefits and pitfalls of POD and eBook publishing first hand.

One other key standard to watch in 2013 is PRISM - the Publishing Requirements for Industry Standard Metadata. This standard will affect the way that magazines are produced by establishing both a working ontology and tools for using XML and HTML5 to handle complex layout of pages for magazines in a consistent manner, along with extensions for handling DRM and metadata output. As with media metadata standards, PRISM makes it possible for magazine publishers to take advantage of XML publishing workflows in a consistent manner, extending content and template reuse to short form magazine and news print content.

Magazines and newspapers have differed from books in production for years primarily because most newspapers have required far more sophisticated layout tools than can be handled via automated methods. PRISM is an attempt to rectify that, primarily by making it possible to more effectively bind articles to templates, handle overflow content, and deal effectively with images at differing resolutions. Because this is a specification that is supported by a large number of print media publishers, PRISM may also make it easier to manage subordinate rights, reprints and syndicated content.

10. IT Employment Trends. At the end of 2011, the overall trend for IT was positive, but it was still centered primarily in the government sector and consultant hiring rather than full time employment. The recent political games in Washington will likely cause some confusion in tax codes and structures that will make full time hires more difficult in 2013Q1 (as it has in 2012Q4) but, outside of certain sectors, it's likely that general IT hiring for permanent and long term contracting positions in the private sector will actually increase markedly by 2013Q2.

Part of the reason for this is that many of the contractor positions from 2012 were senior level - architects, requirements analysts and the like who tend to be employed most heavily at the beginning of projects, along with a small staff of contract developers for pilot programs and groundwork coding. These projects in turn lead to more significant programmer hiring, generally for long term projects where it makes sense to hire developers for windows of one year or more (at which interval the financial and stability benefits of full time hiring exceeds the greater flexibility but higher costs of hiring consultants).

Having said that, there are some dark clouds. The mobile app market is saturated - apps are easy to create, the market is crowded, and margins in general have dropped pretty dramatically. I expect to see a lot of consolidation - app production companies going out of business or getting bought up by more established firms, as well as competition from larger non-tech companies that nonetheless see advantages in establishing a presence. Short term, this is going to produce a lot of churn in the marketplace, though it's likely that reasonably talented programmers and content producers will continue to work if they're willing to remain adroit.

It's worth noting that this has been taking place in a lot of new tech fields over the last year, so for it to happen to mobile systems is not that surprising. Poor business execution, failure to read the market properly, impatient (or rapacious) investors or poor development practices are not uncommon at this stage of growth. On the flip side, as areas such as advanced materials engineering, bioinformatics, energy systems (both traditional and alternative), commercial aerospace ventures, transportation telematics, interactive media publishing and robotics realign, consolidate and become more competitive, this will also drive growth for related IT support work.

Federal initiatives that are already underway (such as ObamaCare) will also mature, though I expect the pace of hiring in these areas will slow down (as existing projects reach completion) and the consequences of the last two years of relative Congressional inactivity comes home to roost. Cuts to existing programs in both the Defense Dept. (due both to the "fiscal cliff" and the drawdown of troops in Afghanistan) and in research and social programs will also likely reduce IT work in the Maryland/Virginia area, which up until now has recovered more quickly than the rest of the country from the economic collapse in 2008/9.

On the other hand this too shall pass. While it hasn't been perfect (and there are still bubbles in certain areas, especially education) there are signs that the economy is beginning to recover significantly and shift into a new mode of development, one with a stronger emphasis on the secondary effects of the combined information and materials sciences innovations of the last couple of decades. However, because these are second order jobs from the information revolution, it's worth noting that core languages such as JavaScript (and CoffeeScript), Hadoop, declarative languages such as Haskell and Ocaml, data processing languages such as XQuery, NoSQL/node.js, SPARQL 1.1, and the like are likely to see continued growth, while more traditional languages - Java and C++, for instance, will be seen primarily in augmentative roles. Additionally, I've seen the rise of domain specific languages such as R, Mathematica and SciLab, as well as cloud services APIs (which are becoming increasingly RESTful) factoring into development.

The next wave of "developers" are as likely as not coming from outside the world of pure IT - geneticists, engineers, archivists, business analysts (and indeed analysts in any number of areas), researchers in physics and materials science, roboticists, city planners, environmental designers, architects and so forth, as well as creative professionals who have already adopted technology tools in the last decade. This will be an underlying trend through much of the 'teens - the software being developed and used will increasingly be domain specific, with the ability to customize existing software with extensions and toolkits coming not from dedicated "software companies" but from domain specific companies with domain experts often driving and even developing the customizations to the software as part of their duties.

Summary
My general take for 2013 is that it is not likely that there will be radical new technologies being introduced (though I think computer glasses have some interesting potential) but rather that this will be a year of consolidation, increased adoption of key industry standards and incremental development in a large number of areas that have been at the prototyping stage up to now.

1 comment:

21st Century Software SolutionsSeptember 22, 2014 at 2:17 AM
Data Modelling Online Training, ONLINE TRAINING – IT SUPPORT – CORPORATE TRAINING http://www.21cssindia.com/courses/datamodelling-online-training-24.html The 21st Century Software Solutions of India offers one of the Largest conglomerations of Software Training, IT Support, Corporate Training institute in India - +919000444287 - +917386622889 - Visakhapatnam,Hyderabad Data Modelling Online Training, Data Modelling Training, Data Modelling, Data Modelling Online Training| Data Modelling Training| Data Modelling| "Courses at 21st Century Software Solutions
Talend Online Training -Hyperion Online Training - IBM Unica Online Training - Siteminder Online Training - SharePoint Online Training - Informatica Online Training - SalesForce Online Training - Many more… | Call Us +917386622889 - +919000444287 - contact@21cssindia.com
Visit: http://www.21cssindia.com/courses.html"

Semantics and Data Modeling

Monday, December 31, 2012

Data Trends of 2013

1 comment:

About Me