Data Interoperability Across the Enterprise

Draft

Data Interoperability across the Enterprise -

Why Current Technology Can’t Achieve it

January 29, 2007

This paper was drafted and reviewed by members of the Cross-Domain Semantic Interoperability (CDSI) Working Group, http://www.visualknowledge.com/wiki/CDSI, and its parent organization, Semantic Interoperability Community of Practice (SICoP), http://colab.cim3.net/cgi-bin/wiki.pl?SICoP . Editor: James Schoening, U.S. Army, james.schoening@us.army.mil

Abstract

Enterprises need data interoperability across all of their own systems, and with external systems. The Federal CIO Council Strategic Plan, FY 2007 – 2009, calls for, “interoperability across Federal, state, tribal, and local governments, as well as partners in the commercial and academic sectors.” [1]

Current technologies such as XML, metadata, RDF, OWL, and stand-alone ontologies can achieve data interoperability, but only within domains or Communities of Interest (COI), or between limited numbers of these. These technologies cannot achieve data interoperability across the many domains found in most large enterprises. To achieve this goal, leading organizations will need to invest in emerging technologies, such as common upper, middle, and reference ontologies, and mature them to where they are ready for enterprise-wide implementation.

Use Case

The following Use Case provides one example of how cross-enterprise data interoperability could provide value.

“An officer on the battlefield is directed to halt one course of action and pursue another objective, which had not been anticipated and for which there is no plan. He has 15 minutes to issue new orders. He spends 5 minutes plugging new parameters into his computer system, which activates scores of agents to access thousands of data sources (many of them unanticipated sources found through searches) and returns 3 alternative plans in 5 minutes, giving him 5 minutes to make the decision. Agents search and analyze data on items such as weather, supplies, cost, training readiness, local customs, and maps, and then integrate findings into alternatives.”

Intended Audience

This paper is intended for those with a stake in data interoperability across enterprises or between independently developed systems. It requires only a basic understanding of data interoperability requirements and technology.

Table of Contents

Part I: Why Current Technology Can't Achieve

Enterprise-Wide Data Interoperability 4

A. Independently Developed Data Models are not Interoperable 4

B. Developing Larger Domains Does Not Scale 4

C. Cross-Domain (Many-to-Many) Mapping Does Not Scale 4

D. Standard Languages Do Not Produce Interoperable Data 5

E. World Wide Web Consortium (W3C) Doing Much – But Not This 5

F. Data Interoperability has Multiple Layers – All Critical 6

G. What are ‘Current Technologies?’ 7

H. Part I Conclusion 8

Part II – A Way Forward for Enterprise Data Interoperability 9

A. Early, Ongoing, and Independent Assessment 9

B. Distributed and Open Innovation 9

C. Candidate Technical Solutions 9

D. Many Enabling Technologies 10

E. Part II Conclusion 10

References 10

Appendix 1: DOD Data-Sharing Strategy 11

Appendix 2: Software Readiness Level 12

Part I: Why Current Technology Cannot Achieve Enterprise-Wide Data Interoperability

A. Independently Developed Data Models are not Interoperable

All independently developed data models, including database schemas, data dictionaries, metadata, taxonomies, and ontologies are invariably non-interoperable. Each has a unique perspective, purpose, and constraints, which lead to divergence. For example, two data models may both have the data element ‘Person.’ One requires the ’person’ to be a specific instance, with a name and ID number, while the other refers to any human being, such as the ‘next person in line.’ Even if the same individual independently develops multiple models, they will still not interoperate. Readers are invited to view the names of their office email folders, paper folders, hard-drive folders, and browser favorite folders, and will likely find different labels with different meanings.

Common practice demonstrates this fact, which is why information workers spend so much time interpreting data and entering it into other applications, and why custom interfaces must be developed to connect computer systems. It is because data developed by different people, for different purposes, and with different constraints invariably has different structure and meaning.

B. ‘Developing Larger Domains’ does not Scale

Broader coordination can lead to interoperability within larger domains, but can’t scale to the levels of a large enterprise. For example, a Finance and a Logistics community could coordinate to develop a large data model, but this would not solve interoperability with systems from other domains, such as Acquisition or Human Resources. Plus, these additional areas could not easily be rolled into a larger data model. As the size of a data model grows, it becomes increasingly difficult for the model developers to agree on common data elements. This is not just due to increasing numbers or people, but also because subgroups within a large domain have different perspectives and use different terminology. If it were viable to develop ever larger domain data models, the world could develop one huge data model and data interoperability would be solved. The solution to data stovepipes is not larger stovepipes.

C. ‘Cross-Domain (Many-to-Many) Mapping’ Does Not Scale

Today, two domains can map their data models and achieve data interoperability amongst all the systems in both areas. This works for two domains or a small number of domains, but scales at a rate of roughly N-squared. Two (2) domains require 2 mappings, one in each direction. Three (3) domains require 6 mappings, 4 require 12, and this quickly becomes prohibitively expensive with real world numbers of domains. For example, the U.S. Army has 18 domains in its enterprise architecture, which could require as many as 18x17=306 mappings. But this would not provide interoperability with the domains in DOD, Air Force, Navy, Homeland Defense, other federal agencies, the states, coalition partners, and industry. Domain-to-domain mapping works on a small scale, but cannot scale to real world numbers of domains and Communities of Interest (COI). In the past, interfaces were developed between individual systems, which quickly suffered from the N-squared problem of too many interfaces. To mitigate this problem, many organizations advocated forming domains and COIs to develop common data models. While this reduced the problem, the approach of developing domain-to-domain interfaces still suffers from the N-squared problem and could never scale to support large enterprises with many domains, and could certainly never scale to a global solution. In math, ‘Order N-squared’ does not change even if one divides ‘N’ by a large number. In this field, this means that if we group systems into domains, the number of interfaces (between domains) still increases at a rate of roughly N-squared.

D. Standard Languages Do Not Produce Interoperable Data

Standard languages, such as RDF or OWL, do not produce interoperable data. They have substantial value in various respects, such as enabling easier comparison of models, but they do not force interoperability. They are a standard means of 'expressing' data elements or concepts, but the models or ontologies will invariably be unique. These languages have varying degrees of expressiveness, which are depicted on a scale known as the Semantic Spectrum [see http://en.wikipedia.org/wiki/Semantic_spectrum] or Ontology Spectrum (Figure 1 below.)

Figure 1: Ontology Spectrum, Dr. Leo Obrst.

At the low end of expressiveness is a ‘Taxonomy,’ which is a hierarchy of terms, with each being a sub-class of a more general term. Humans can understand the terms and definitions provided they understand the field and context, but the simple sub-class hierarchy provides very limited clues a computer can use to determine meaning.

At the high end of the scale are languages such as OWL that can express more meaning, enabling computers to do automated reasoning. But even the most expressive of these languages will not produce data or ontologies with the same meaning. Today, OWL is used by many ontology developers, yet the ontologies are not interoperable. Standard languages, regardless of their expressiveness, do not solve data interoperability.

E. World Wide Web Consortium (W3C): Doing Much – But Not This

The World Wide Web Consortium (W3C), the primary standards developing organization for the Web and the champion for the general initiative known as the ‘Semantic Web,’ is developing a host of standards to enable data and semantic interoperability within domains and Communities of Interest (COI). They also address data interoperability between limited numbers of data models and ontologies, but not amongst large numbers of Domains or Communities of Interest as are found in large enterprises.

F. Data Interoperability has Multiple Layers – All Critical

Interoperability has many layers [2] [3], with the lowest having mature standards. This paper focuses on the ontology or semantic (i.e. the ‘meaning’) layer, which is critical to data interoperability. The original Semantic Stack in Figure 2 below depicts these layers.

Figure 2. Semantic Web Wedding Cake (From Berners-Lee, XML 2000 Conference)

At the lowest layer, a computer must recognize characters such as z, 1, or @. The Unicode standard enables interoperability at this level.

At the next level, XML is a standard language for syntax (or format), so a computer can differentiate types of content, for example between a Heading and a Quote.

The Ontology layer defines the semantics (or meaning) of data, enabling Semantic Interoperability, which is defined by Wikipedia as the "ability of two or more computer systems to exchange information and have the meaning of that information accurately and automatically interpreted by the receiving system."[4] This paper focuses on 'Semantic Interoperability,' an essential element of data interoperability, if computers are to process the meaning of data without human intervention. OWL is a standard language for defining ontologies, but as explained above, such ontologies will not be interoperable. Part II of this paper describes how a standard upper ontology has potential to enable differing ontologies to be interoperable.

Beyond Semantic Interoperability are layers such as ‘Trust,’ which are also important. For example, if computers could someday search the Semantic Web and find and understand a source of data, they would still need to determine if they could ‘trust’ the data for the intended purpose.

Solving Semantic Interoperability will not provide the total solution to enterprise data interoperability, but it is an essential, major, and the next major step in this challenge.

G. What are ‘Current Technologies?’

This paper uses the phrase 'current technology' to mean technology ready to be implemented by functional organizations. Many technical communities use a Technology Readiness Level (TRL) scale to measure the maturity of given technologies. This paper uses the Software Readiness Scale, with the following levels:

1. Basic principles observed and reported.

2. Technology concept and/or application formulated.

3. Analytical and experimental critical functions and/or characteristic proof of concept.

4. Component and/or breadboard validation in laboratory environment.

5. Component and/or breadboard validation in relevant environment.

6. System/subsystem model or prototype demonstration in a relevant environment.

7. System prototype demonstration in an operational environment.

8. Actual system completed and ‘flight qualified’ through test and demonstration.

9. Actual system ‘flight proven’ though successful mission operations.

See Appendix 2 for a further explanation of each level.

Using the above scale, a ‘Current Technology’ is defined as being at level 8 or 9. Technologies at lower levels need further work before being ready for implementation by functional organizations.

Using the above scale, the following technologies are rated for readiness [5]:

Technology:	Score:
XML/Metadata	9
RDF	9
OWL language	8
OWL Tools	6
Stand-alone ontologies	6
Upper Ontologies	4
Mapped Upper Ontologies	1

The above ratings are meant to show the substantial difference in maturity between current technologies (XML, RDF, OWL language) and emerging technologies (upper ontologies and mapped upper ontologies). As described earlier in the paper, current technologies cannot achieve enterprise-wide data interoperability. Emerging technologies have potential to achieve this, but need substantial maturing.

H. Part-I Conclusion

Current technologies provide no viable solutions for sharing data across the many domains of large enterprises. Independently developed data models are not interoperable. Larger domains enable internal, but not external interoperability, and become increasingly difficult to establish as they grow in size. Highly expressive knowledge representation languages have value within domains, but do not force consistency between independently developed models. Mapping between domains works, and can provide value in high-impact areas, but can't scale enterprise-wide. [6] Current technologies cannot achieve enterprise-wide data interoperability. The only viable strategy is for stakeholders to invest in the maturing of emerging technologies, which is addressed in the next section.

Part II – A Way Forward for Enterprise Data Interoperability

This portion of the paper describes a viable path for organizations that choose to lead in the pursuit of enterprise data interoperability. After this paper is released, the contents of this section will continue to evolve and be posted on the web site of the Cross-Domain Semantic Interoperability Working Group at http://www.visualknowledge.com/wiki/cdsi.

A. Early, Ongoing, and Independent Assessment

Organizational leaders will not have the expertise to understand the depths of this technology. As such, an independent team should be funded to evaluate the current state of the art, the next steps to be funded, plus the progress being made by funded projects. Failure to apportion resources to independent assessment could result in wasted investments in non-critical technologies.

B. Distributed and Open Innovation

The technical challenges of this goal are many and varied. Some will require focused and systematic engineering, while others will require multiple innovative attempts by multiple teams. Organizations should not put all their eggs in one basket, but rather be flexible in investing in many innovative solutions from all available sources of expertise, including government researchers, contractors, academics, and individual experts, both U.S. and foreign. Progress should be shared in open forums to encourage participation by others in this field.

C. Candidate Technical Solutions

The following are candidate technical solutions for enterprise-wide interoperability, as vetted by the Cross Domain Semantic Interoperability WG. See http://www.visualknowledge.com/wiki/cdsi.

1) Single upper ontology. Examples: SUMO, DOLCE, OpenCyc, BFO Ontologies are sets of discreet concepts, while data models are sets of labels for complex concepts. It is easier for stakeholders to agree on concepts than on labels. An upper data model (for all domains) has never been developed, nor is it feasible to do so; however, upper ontologies have been developed and have been shown to address all domains. For example, SUMO has been mapped to all 100,000 terms in the Wordnet lexicon. [7] A large enterprise such as DOD or the U.S. federal government could standardize on a given upper ontology, and then develop compliant domain ontologies for areas such as Logistics, Human Resources, Acquisition, Medical, etc. Systems could be developed or mapped to one or more standard domain ontologies, and would thereby share common concepts with other compliant systems. [8][9] This approach has been demonstrated in the laboratory, but now needs to be demonstrated in an operational environment. Given success, various tools, techniques, and engineering work will be needed to make this ready for implementation.

There have also been proposals to develop a new and better upper ontology, which should lead to improved utility and broader adoption.

2) Set of mapped upper ontologies: Some researchers theorize that no single upper ontology can meet the needs of all systems. An alternative is to standardize on a small set of upper ontologies [10] (perhaps 3 to 5), and to construct strong mappings between each of them. Five upper ontologies would require 5x4=20 mappings, which is probably affordable. Some amount of semantics would be lost in the mappings, but the added flexibility could make this approach more feasible than a single upper ontology. See Upper Ontology Summit Joint Communiqué.

D. Many Enabling Technologies

Many technologies will enable this vision. Some are known, but others will surface as the field evolves. Below is an initial listing of known challenges and tasks:

· How to deal with overlapping and proliferation of Communities of Interest and Domains

· How to achieve interoperability amongst legacy systems.

· Metrics for Semantic Interoperability -- So we know what we have and how much we're improving it

· Existing state-of-the-art technologies need to demonstrated, with results fully and openly published.

· Utilization of existing upper ontologies.

· Establish a testbed or distributed testbed, open for all to monitor and interface with.

· Map two or more upper ontologies and evaluate feasibility of achieving semantic interoperability

· How to utilize partial Semantic Interoperability where perfect interoperability cannot practically be achieved.

This list will continue to grow and be vetted. See latest list at http://www.visualknowledge.com/wiki/cdsi.

E. Part-II Conclusion

Part I described the inability of current technology to enable data interoperability across the many domains of large enterprises. Large organizations, if serious about solving enterprise data interoperability, must invest in emerging technical solutions and be both patient and resolute in maturing them for enterprise scale implementation.

References:

[1] Federal CIO Council Strategic Plan, FY 2007 – 2009. http://www.cio.gov/documents/CIOCouncilStrategicPlan2007-2009.pdf

[2] Daconta, M., L. Obrst, K. Smith. 2003. The Semantic Web: The Future of XML, Web Services, and Knowledge Management. John Wiley, Inc., June, 2003.

[3] Obrst, L. 2003. Ontologies for Semantically Interoperable Systems. Proceedings of the Twelfth ACM International Conference on Information and Knowledge Management (CIKM 2003), Ophir Frieder, Joachim Hammer, Sajda Quershi, and Len Seligman, eds. New Orleans, LA, November 3-8, New York: ACM, pp. 366-369.

[4] Wikipedia definition of Semantic Interoperability. http://en.wikipedia.org/wiki/Semantic_interoperability.

[5] Obrst, Leo; Patrick Cassidy; Steve Ray; Barry Smith; Dagobert Soergel; Matthew West; Peter Yim. 2006. The 2006 Upper Ontology Summit Joint Communiqué. Journal of Applied Formal Ontology. Volume 1: 2, 2006.

[6] Kalfoglou Y, Schorlemmer M. 2003. Ontology mapping: the state of the art. Knowl. Eng. Rev., 18(1):1--31, 2003.

[7] Niles, I. and Pease, A. Linking Lexicons and Ontologies: Mapping WordNet to the Suggested Upper Merged Ontology. In Proceedings of the 2003 International Conference on Information and Knowledge Engineering (IKE?03), Las Vegas, Nevada, June 23-26, 2003. http://home.earthlink.net/~adampease/professional/Niles-IKE.pdf

[8] Semy, Salim K.; Pulvermacher, Mary K.; Obrst, Leo J.; September 2004; Toward the Use of an Upper Ontology for U.S. Government and U.S. Military Domains: An Evaluation; http://www.mitre.org/work/tech_papers/tech_papers_05/04_1175/04_1175.pdf

[9] Pulvermacher, Mary; Stoutenburg, Suzette; Semy, Salim; Netcentric Semantic Linking: An Approach for Enterprise Semantic Interoperability; October 2004; http://www.mitre.org/work/tech_papers/tech_papers_04/04_1174/04_1174.pdf

[10] Upper Ontology Summit Joint Communiqué; March 15, 2006; http://ontolog.cim3.net/cgi-bin/wiki.pl?UpperOntologySummit/UosJointCommunique

Appendix 1

DoD Data-Sharing Strategy

DOD requires data interoperability across its entire enterprise, plus with external partners [1] [2]. The main body of this paper describes how current technology cannot achieve this for enterprises with significant numbers of domains.

Given the limits of existing technology, DOD established a data-sharing strategy that encouraged Communities of Interest (COI) to develop standard data models, and then map between them [3][4].

This approach has the following problems:

a. Most applications need to share data with systems outside their COI. For example, a logistics system may fit well in a logistics domain, but may also need to share data with a financial system. If it uses the logistics COI data model, it will not be interoperable with the financial COI data model.

b. The freedom to create COIs will result in competing COIs. If the General Accounting Office (GAO), DoD, and Army each create a COI for Finance, which one does a given application adopt?

c. DOD will have far too many COIs and Domains to map between all of them. For example, the U.S. Army’s Enterprise Architecture has 18 domains, and most of these have multiple sub-domains. (Its Acquisition Domain has 10 sub-domains, which are unique enough to require separate data models.) The Army may end up with 100 or more data models, even with strong governance to limit the number. These could be mapped, but would require potentially 9,900 interfaces. Even if only 10% of the mappings were needed, it would still be too many, and this does not include interfaces with external systems. Mappings between data models works fine on a small scale, but could never scale to DoD-wide implementation.

DoD recognized in a recent report that its Net-Centric Data Strategy is making little progress, and concluded this was due to lack of proper implementation.[5] Technology is actually the limiting factor, for the reasons set forth in this paper. Improvements in implementation cannot succeed if the technology is incapable of achieving the goal.

DOD invests in hundreds of technologies. To achieve its Net-Centric goals for data sharing across its enterprise, DOD will need to invest in the maturing, demonstration, and piloting of emerging technologies that directly address enterprise-wide data sharing. A listing of candidate technologies is found in Part II of the body of this paper. The identification of these technical challenges and technologies is a mission of the Cross Domain Semantic Interoperability Working Group.

Appendix-1References:

1. Network-Centric Warfare (NCOW), http://en.wikipedia.org/wiki/Network-centric_warfare

2. Global Information Grid (GIG), http://en.wikipedia.org/wiki/Global_Information_Grid

3. DoD Directive 8320.2, Data Sharing in a Net-Centric Department of Defense, http://www.dtic.mil/whs/directives/corres/pdf/d83202_120204/d83202p.pdf

4. DOD 8310.02-G, Guidance for Implementing Net-Centric Data Sharing, http://www.dtic.mil/whs/directives/corres/pdf/p832002_041206/p832002p.pdf

5. Implementing the Net-Centric Data Strategy Progress and Compliance Report, August 2006. https://metadata.dod.mil/mdrPortal/download?contentItemID=urn%3Auuid%3Ac85e4dfc-607e-46e2-869f-fdf3c3133b60

Appendix 2 -- Software Readiness Level

Software Readiness Level	Description
1. Basic principles observed and reported.	SW: Lowest level of software readiness. Basic research begins to be translated into applied research and development. Examples might include a concept that can be implemented in software or analytic studies of an algorithm’s basic properties.
2. Technology concept and/or application formulated.	SW: Invention begins. Once basic principles are observed, practical applications can be invented. Applications are speculative and there is no proof or detailed analysis to support the assumptions. Examples are limited to analytic studies.
3. Analytical and experimental critical functions and/or characteristic proof of concept.	SW: Active research and development is initiated. This includes analytical studies to produce code that validates analytical predictions of separate software elements. Examples include software components that are not yet integrated or representative but satisfy an operational need. Algorithms run on a surrogate processor in a laboratory environment.
4. Component and/or breadboard validation in laboratory environment.	SW: Basic software components are integrated to establish that they will work together. They are relatively primitive with regard to efficiency and reliability compared to the eventual system. System software architecture development initiated to include interoperability, reliability, maintainability, extensibility, scalability and security issues. Software integrated with simulated current /legacy elements as appropriate.
5. Component and/or breadboard validation in relevant environment.	SW: Reliability of software ensemble increases significantly. The basic software components are integrated with reasonably realistic supporting elements so that it can be tested in a simulated environment. Examples include "high fidelity" laboratory integration of software components. System software architecture established. Algorithms run on a processor(s) with characteristics expected in the operational environment. Software releases are ‘Alpha’ versions and configuration control initiated. Verification, Validation and Accreditation (VV&A) initiated.
6. System/subsystem model or prototype demonstration in a relevant environment.	SW: Representative model or prototype system, which is well beyond that of TRL 5, is tested in a relevant environment. Represents a major step up in software-demonstrated readiness. Examples include testing a prototype in a live/virtual experiment or in simulated operational environment. Algorithm run on processor or operational environment integrated with actual external entities. Software releases are ‘Beta’ versions and configuration controlled. Software support structure in development. VV&A in process.
7. System prototype demonstration in an operational environment.	SW: Represents a major step up from TRL 6, requiring the demonstration of an actual system prototype in an operational environment, such as in a command post or air/ground vehicle. Algorithms run on processor of the operational environment integrated with actual external entities. Software support structure in place. Software releases are in distinct versions. Frequency and severity of software deficiency reports do not significantly degrade functionality or performance. VV&A completed.
8. Actual system completed and “flight qualified” through test and demonstration.	SW: Software has been demonstrated to work in its final form and under expected conditions. In most cases, this TRL represents the end of system development. Examples include test and evaluation of the software in its intended system to determine if it meets design specifications. Software releases are production versions and configuration controlled, in a secure environment. Software deficiencies are rapidly resolved through support structure.
9. Actual system “flight proven” though successful mission operations.	SW: Actual application of the software in its final form and under mission conditions, such as those encountered in operational test and evaluation. In almost all cases, this is the end of the last "bug fixing" aspects of system development. Examples include using the system under operational mission conditions. Software releases are production versions and configuration controlled. Frequency and severity of software deficiencies are at a minimum.