Draft
Data Interoperability across the Enterprise -
Why Current Technology Can’t Achieve it
January 29, 2007
This paper was drafted and reviewed by members of the
Cross-Domain Semantic Interoperability (CDSI) Working Group, http://www.visualknowledge.com/wiki/CDSI,
and its parent organization, Semantic Interoperability Community of Practice
(SICoP), http://colab.cim3.net/cgi-bin/wiki.pl?SICoP
. Editor: James Schoening, U.S. Army, james.schoening@us.army.mil
Abstract
Enterprises
need data interoperability across all of their own systems, and with
external systems. The Federal CIO
Council Strategic Plan, FY 2007 – 2009, calls for, “interoperability across
Federal, state, tribal, and local governments, as well as partners in the
commercial and academic sectors.” [1]
Current
technologies such as XML, metadata, RDF, OWL, and stand-alone
ontologies can achieve data interoperability, but only within domains or
Communities of Interest (COI), or between limited numbers of these. These
technologies cannot achieve data interoperability across the many domains found
in most large enterprises. To achieve this goal, leading organizations will
need to invest in emerging technologies, such as common upper, middle, and
reference ontologies, and mature them to where they are ready for
enterprise-wide implementation.
Use Case
The following Use Case provides one
example of how cross-enterprise data interoperability could provide value.
“An officer on the battlefield is
directed to halt one course of action and pursue another objective, which had
not been anticipated and for which there is no plan. He has 15 minutes
to issue new orders. He spends 5 minutes plugging new parameters into
his computer system, which activates
scores of agents to access thousands of data sources (many of them
unanticipated sources found through searches) and returns 3 alternative plans
in 5 minutes, giving him 5 minutes to make the decision. Agents
search and analyze data on items such as weather, supplies, cost, training
readiness, local customs, and maps, and then integrate findings into
alternatives.”
Intended Audience
This paper is intended for
those with a stake in data interoperability across enterprises or between
independently developed systems. It requires only a basic understanding of data
interoperability requirements and technology.
Table of Contents
Part I: Why Current Technology Can't Achieve
Enterprise-Wide Data Interoperability 4
A.
Independently Developed Data Models are not Interoperable 4
B. Developing Larger Domains Does Not Scale 4
C.
Cross-Domain (Many-to-Many) Mapping Does Not Scale 4
D.
Standard Languages Do Not Produce Interoperable Data 5
E.
World Wide Web Consortium (W3C) Doing Much – But Not This 5
F.
Data Interoperability has Multiple Layers – All Critical 6
G.
What are ‘Current Technologies?’ 7
H.
Part I Conclusion 8
Part II – A Way Forward for Enterprise Data
Interoperability 9
A.
Early, Ongoing, and Independent Assessment 9
B.
Distributed and Open Innovation 9
C.
Candidate Technical Solutions 9
D.
Many Enabling Technologies 10
E.
Part II Conclusion 10
References 10
Appendix 1: DOD Data-Sharing Strategy 11
Appendix 2: Software Readiness Level 12
Part I: Why Current Technology Cannot Achieve
Enterprise-Wide Data Interoperability
A. Independently
Developed Data Models are not Interoperable
All independently developed data models, including database
schemas, data dictionaries, metadata, taxonomies, and ontologies are invariably
non-interoperable. Each has a unique perspective, purpose, and
constraints, which lead to divergence. For example, two data models may both have the data element
‘Person.’ One requires the ’person’ to
be a specific instance, with a name and ID number, while the other refers to any
human being, such as the ‘next person in line.’ Even if the same individual independently develops multiple
models, they will still not interoperate. Readers are invited to view the names
of their office email folders, paper folders, hard-drive folders, and browser
favorite folders, and will likely find different labels with different
meanings.
Common practice
demonstrates this fact, which is why information workers spend so much time
interpreting data and entering it into other applications, and why custom
interfaces must be developed to connect computer systems. It is because data developed by different
people, for different purposes, and with different constraints invariably has
different structure and meaning.
B. ‘Developing Larger
Domains’ does not Scale
Broader coordination can lead to interoperability within
larger domains, but can’t scale to the levels of a large enterprise. For example, a Finance and a Logistics
community could coordinate to develop a large data model, but this would not
solve interoperability with systems from other domains, such as Acquisition or
Human Resources. Plus, these additional
areas could not easily be rolled into a larger data model. As the size of a data model grows, it becomes
increasingly difficult for the model developers to agree on common data
elements. This is not just due to
increasing numbers or people, but also because subgroups within a large domain
have different perspectives and use different terminology. If it were viable to develop ever larger
domain data models, the world could develop one huge data model and data
interoperability would be solved. The
solution to data stovepipes is not larger stovepipes.
C. ‘Cross-Domain
(Many-to-Many) Mapping’ Does Not Scale
Today, two domains can map their data models and
achieve data interoperability amongst all the systems in both areas. This
works for two domains or a small number of domains, but scales at a rate of roughly
N-squared. Two (2) domains require 2 mappings, one in
each direction. Three (3) domains require 6 mappings, 4 require
12, and this quickly becomes prohibitively expensive with real world
numbers of domains. For example, the U.S. Army has 18 domains
in its enterprise architecture, which could require as many as 18x17=306
mappings. But this would not provide interoperability
with the domains in DOD, Air Force, Navy, Homeland Defense,
other federal agencies, the states, coalition partners, and
industry. Domain-to-domain mapping works on a small scale,
but cannot scale to real world numbers of domains and Communities of Interest (COI).
In the past, interfaces were
developed between individual systems, which quickly suffered from the N-squared
problem of too many interfaces. To
mitigate this problem, many organizations advocated forming domains and COIs to
develop common data models. While this
reduced the problem, the approach of developing domain-to-domain interfaces still
suffers from the N-squared problem and could never scale to support large
enterprises with many domains, and could certainly never scale to a global
solution. In math, ‘Order N-squared’
does not change even if one divides ‘N’ by a large number. In this field, this means that if we group
systems into domains, the number of interfaces (between domains) still
increases at a rate of roughly N-squared.
D. Standard Languages Do
Not Produce Interoperable Data
Standard languages, such as RDF or OWL, do not
produce interoperable data. They
have substantial value in various respects, such as enabling easier comparison
of models, but they do not force interoperability. They are a standard
means of 'expressing' data elements or concepts, but the models or ontologies
will invariably be unique. These languages have varying degrees of
expressiveness, which are depicted on a scale known as the Semantic Spectrum
[see http://en.wikipedia.org/wiki/Semantic_spectrum]
or Ontology Spectrum (Figure 1 below.)
Figure 1: Ontology Spectrum,
Dr. Leo Obrst.
At the low end of expressiveness
is a ‘Taxonomy,’ which is a hierarchy of terms, with each being a sub-class of
a more general term. Humans can
understand the terms and definitions provided they understand the field and
context, but the simple sub-class hierarchy provides very limited clues a
computer can use to determine meaning.
At the high end of the scale
are languages such as OWL that can express more meaning, enabling computers to
do automated reasoning. But even the
most expressive of these languages will not produce data or ontologies
with the same meaning. Today, OWL is
used by many ontology developers, yet the ontologies are not interoperable. Standard
languages, regardless of their expressiveness, do not solve data
interoperability.
E. World Wide Web
Consortium (W3C): Doing Much – But Not This
The World Wide Web Consortium (W3C), the primary
standards developing organization for the Web and the champion for the general
initiative known as the ‘Semantic Web,’ is developing a host of standards to
enable data and semantic interoperability within domains and Communities of
Interest (COI). They also address data interoperability between limited
numbers of data models and ontologies, but not amongst large numbers of
Domains or Communities of Interest as are found in large enterprises.
F. Data Interoperability has Multiple Layers –
All Critical
Interoperability has many layers [2] [3], with the lowest
having mature standards. This paper
focuses on the ontology or semantic (i.e. the ‘meaning’) layer, which is
critical to data interoperability. The original
Semantic Stack in Figure 2 below depicts these layers.
Figure 2. Semantic Web
Wedding Cake (From
Berners-Lee, XML 2000 Conference)
At the lowest layer, a computer must
recognize characters such as z, 1, or @. The Unicode standard
enables interoperability at this level.
At the next level, XML is a standard language for syntax
(or format), so a computer can differentiate types of content, for example
between a Heading and a Quote.
The Ontology layer defines the semantics (or meaning) of
data, enabling Semantic Interoperability, which is defined by Wikipedia as the
"ability of two or more computer systems to exchange information and have
the meaning of that information accurately and automatically interpreted by the
receiving system."[4] This paper focuses on 'Semantic Interoperability,' an
essential element of data interoperability, if computers are to process the
meaning of data without human intervention.
OWL is a standard language for defining ontologies, but as explained
above, such ontologies will not be interoperable. Part II of this paper describes how a standard upper ontology has
potential to enable differing ontologies to be interoperable.
Beyond Semantic Interoperability are layers such as
‘Trust,’ which are also important. For example, if computers could someday
search the Semantic Web and find and understand a source of data, they would
still need to determine if they could ‘trust’ the data for the intended purpose.
Solving
Semantic Interoperability will not provide the total solution to enterprise
data interoperability, but it is an essential, major, and the next major step
in this challenge.
G. What are ‘Current
Technologies?’
This paper uses the phrase 'current technology' to mean
technology ready to be implemented by functional organizations. Many
technical communities use a Technology Readiness Level (TRL) scale to
measure the maturity of given technologies. This paper uses the
Software Readiness Scale, with the following levels:
1. Basic principles observed and reported.
2. Technology concept and/or application
formulated.
3. Analytical and experimental critical
functions and/or characteristic proof of concept.
4. Component and/or breadboard validation in
laboratory environment.
5. Component and/or breadboard validation in
relevant environment.
6. System/subsystem model or prototype
demonstration in a relevant environment.
7. System prototype demonstration in an operational
environment.
8. Actual system completed and ‘flight
qualified’ through test and demonstration.
9. Actual system ‘flight proven’ though
successful mission operations.
See Appendix 2 for a further explanation of each level.
Using
the above scale, a ‘Current Technology’ is defined as being at level 8 or
9. Technologies at lower levels need
further work before being ready for implementation by functional organizations.
Using the above scale, the following technologies are
rated for readiness [5]:
|
The above ratings are meant to show the substantial
difference in maturity between current technologies (XML, RDF, OWL language) and
emerging technologies (upper ontologies and mapped upper ontologies). As described earlier in the paper, current
technologies cannot achieve enterprise-wide data interoperability. Emerging technologies have potential to
achieve this, but need substantial maturing.
H. Part-I Conclusion
Current technologies provide no viable solutions for
sharing data across the many domains of large enterprises. Independently developed data models are not
interoperable. Larger domains enable internal, but not external interoperability,
and become increasingly difficult to establish as they grow in size. Highly expressive knowledge representation
languages have value within domains, but do not force consistency
between independently developed models. Mapping between domains works,
and can provide value in high-impact areas, but can't scale enterprise-wide. [6]
Current technologies cannot achieve enterprise-wide data interoperability. The only viable strategy is for stakeholders
to invest in the maturing of emerging technologies, which is addressed in the
next section.
Part II – A Way Forward for Enterprise Data
Interoperability
This portion of the paper describes
a viable path for organizations that choose to lead in the pursuit of enterprise
data interoperability. After this paper is released, the contents of
this section will continue to evolve and be posted on the web site of
the Cross-Domain Semantic Interoperability Working Group at http://www.visualknowledge.com/wiki/cdsi.
A. Early, Ongoing, and
Independent Assessment
Organizational leaders will not have the expertise to
understand the depths of this technology. As such, an independent team
should be funded to evaluate the current state of the art, the next steps
to be funded, plus the progress being made by funded projects. Failure
to apportion resources to independent assessment could result
in wasted investments in non-critical technologies.
B. Distributed and Open
Innovation
The technical challenges of this goal are many and
varied. Some will require focused and
systematic engineering, while others will require multiple innovative attempts
by multiple teams. Organizations should
not put all their eggs in one basket, but rather be flexible in investing in
many innovative solutions from all available sources of expertise, including
government researchers, contractors, academics, and individual experts, both U.S. and foreign. Progress should be shared in open forums to
encourage participation by others in this field.
C. Candidate Technical
Solutions
The following are candidate technical solutions for
enterprise-wide interoperability, as vetted by the Cross Domain Semantic
Interoperability WG. See http://www.visualknowledge.com/wiki/cdsi.
1) Single upper ontology. Examples: SUMO, DOLCE, OpenCyc, BFO Ontologies
are sets of discreet concepts, while data models are sets of labels for complex
concepts. It is easier for stakeholders
to agree on concepts than on labels.
An upper data model (for all domains) has never been developed, nor is
it feasible to do so; however, upper ontologies have been developed and have
been shown to address all domains. For
example, SUMO has been mapped to all 100,000 terms in the Wordnet lexicon. [7] A large enterprise such as DOD or the U.S.
federal government could standardize on a given upper ontology, and then
develop compliant domain ontologies for areas such as Logistics, Human
Resources, Acquisition, Medical, etc. Systems could be developed or mapped to
one or more standard domain ontologies, and would thereby share common concepts
with other compliant systems. [8][9] This approach has been demonstrated in the
laboratory, but now needs to be demonstrated in an operational environment. Given success, various tools, techniques, and
engineering work will be needed to make this ready for implementation.
There have also been proposals to develop a new and better
upper ontology, which should lead to improved utility and broader
adoption.
2) Set of mapped upper ontologies: Some researchers theorize that no
single upper ontology can meet the needs of all systems. An alternative is to standardize on a small
set of upper ontologies [10] (perhaps 3 to 5), and to construct strong mappings
between each of them. Five upper
ontologies would require 5x4=20 mappings, which is probably affordable. Some amount of semantics would be lost in
the mappings, but the added flexibility could make this approach more feasible
than a single upper ontology. See Upper Ontology Summit Joint Communiqué.
D. Many Enabling
Technologies
Many technologies will
enable this vision. Some are known, but others will surface as
the field evolves. Below is an initial listing of known challenges and tasks:
· How to
deal with overlapping and proliferation of Communities of Interest and Domains
· How to
achieve interoperability amongst legacy systems.
· Metrics
for Semantic Interoperability -- So we know what we have and how much we're
improving it
· Existing
state-of-the-art technologies need to demonstrated, with results fully and
openly published.
·
Utilization of existing upper ontologies.
· Establish
a testbed or distributed testbed, open for all to monitor and interface with.
· Map two or
more upper ontologies and evaluate feasibility of achieving semantic
interoperability
· How to
utilize partial Semantic Interoperability where perfect interoperability cannot
practically be achieved.
This list will
continue to grow and be vetted. See latest list at http://www.visualknowledge.com/wiki/cdsi.
E. Part-II Conclusion
Part I described the inability of current technology to
enable data interoperability across the many domains of large enterprises. Large organizations, if serious about
solving enterprise data interoperability, must invest in emerging technical
solutions and be both patient and resolute in maturing them for enterprise
scale implementation.
References:
[1] Federal CIO Council Strategic Plan, FY 2007 – 2009. http://www.cio.gov/documents/CIOCouncilStrategicPlan2007-2009.pdf
[2] Daconta, M., L. Obrst, K. Smith. 2003. The Semantic Web: The
Future of XML, Web Services, and Knowledge Management. John Wiley, Inc.,
June, 2003.
[3] Obrst, L. 2003. Ontologies for Semantically
Interoperable Systems. Proceedings of the Twelfth ACM International Conference
on Information and Knowledge Management (CIKM 2003), Ophir Frieder, Joachim
Hammer, Sajda Quershi, and Len Seligman, eds. New Orleans, LA, November 3-8,
New York: ACM, pp. 366-369.
[4] Wikipedia definition of Semantic Interoperability. http://en.wikipedia.org/wiki/Semantic_interoperability.
[5] Obrst, Leo; Patrick Cassidy; Steve Ray; Barry Smith; Dagobert
Soergel; Matthew West; Peter Yim. 2006. The
2006 Upper Ontology Summit Joint Communiqué. Journal of Applied Formal
Ontology. Volume 1: 2, 2006.
[6] Kalfoglou Y, Schorlemmer M.
2003. Ontology mapping: the state of the art. Knowl. Eng. Rev., 18(1):1--31,
2003.
[7] Niles, I. and Pease, A. Linking Lexicons and Ontologies:
Mapping WordNet to the Suggested Upper Merged Ontology. In Proceedings of the 2003 International
Conference on Information and Knowledge Engineering (IKE?03), Las Vegas,
Nevada, June 23-26, 2003. http://home.earthlink.net/~adampease/professional/Niles-IKE.pdf
[8] Semy, Salim K.; Pulvermacher, Mary K.; Obrst, Leo J.; September
2004; Toward the Use of an Upper Ontology for U.S. Government and U.S. Military
Domains: An Evaluation; http://www.mitre.org/work/tech_papers/tech_papers_05/04_1175/04_1175.pdf
[9] Pulvermacher, Mary; Stoutenburg, Suzette; Semy, Salim; Netcentric
Semantic Linking: An Approach for Enterprise Semantic Interoperability; October
2004; http://www.mitre.org/work/tech_papers/tech_papers_04/04_1174/04_1174.pdf
[10] Upper Ontology Summit Joint Communiqué; March 15, 2006; http://ontolog.cim3.net/cgi-bin/wiki.pl?UpperOntologySummit/UosJointCommunique
Appendix 1
DoD Data-Sharing Strategy
DOD requires data interoperability across its entire
enterprise, plus with external partners [1] [2]. The main body of this paper
describes how current technology cannot achieve this for enterprises with
significant numbers of domains.
Given the limits of existing technology, DOD established
a data-sharing strategy that encouraged Communities of Interest (COI) to
develop standard data models, and then map between them [3][4].
This approach has the
following problems:
a. Most
applications need to share data with systems outside their COI. For example, a logistics system may fit well
in a logistics domain, but may also need to share data with a financial system. If it uses the logistics COI data model, it
will not be interoperable with the financial COI data model.
b. The
freedom to create COIs will result in competing COIs. If the General Accounting Office (GAO), DoD, and Army each create
a COI for Finance, which one does a given application adopt?
c. DOD
will have far too many COIs and Domains to map between all of them. For example, the U.S. Army’s Enterprise
Architecture has 18 domains, and most of these have multiple sub-domains. (Its Acquisition Domain has 10 sub-domains,
which are unique enough to require separate data models.) The Army may end up with 100 or more data
models, even with strong governance to limit the number. These could be mapped, but would require
potentially 9,900 interfaces. Even if
only 10% of the mappings were needed, it would still be too many, and this does
not include interfaces with external systems.
Mappings between data models works fine on a small scale, but could
never scale to DoD-wide implementation.
DoD recognized in a recent report that its Net-Centric
Data Strategy is making little progress, and concluded this was due to lack of
proper implementation.[5] Technology is
actually the limiting factor, for the reasons set forth in this paper. Improvements in implementation cannot succeed
if the technology is incapable of achieving the goal.
DOD invests in hundreds of technologies. To achieve its Net-Centric goals for data
sharing across its enterprise, DOD will need to invest in the maturing, demonstration,
and piloting of emerging technologies that directly address enterprise-wide
data sharing. A listing of candidate
technologies is found in Part II of the body of this paper. The identification of these technical
challenges and technologies is a mission of the Cross Domain Semantic
Interoperability Working Group.
Appendix-1References:
1. Network-Centric Warfare (NCOW), http://en.wikipedia.org/wiki/Network-centric_warfare
2. Global Information Grid (GIG), http://en.wikipedia.org/wiki/Global_Information_Grid
3. DoD Directive 8320.2, Data Sharing in a Net-Centric
Department of Defense, http://www.dtic.mil/whs/directives/corres/pdf/d83202_120204/d83202p.pdf
4. DOD 8310.02-G, Guidance for Implementing Net-Centric
Data Sharing, http://www.dtic.mil/whs/directives/corres/pdf/p832002_041206/p832002p.pdf
5. Implementing
the Net-Centric Data Strategy Progress and Compliance Report, August 2006. https://metadata.dod.mil/mdrPortal/download?contentItemID=urn%3Auuid%3Ac85e4dfc-607e-46e2-869f-fdf3c3133b60
Appendix 2 -- Software Readiness Level
Software Readiness Level |
Description |
1.
Basic
principles observed and reported. |
SW:
Lowest level of software readiness. Basic research begins to be
translated into applied research and development. Examples might include a
concept that can be implemented in software or analytic studies of an
algorithm’s basic properties. |
2.
Technology
concept and/or application formulated. |
SW:
Invention begins. Once basic principles are observed, practical
applications can be invented. Applications are speculative and there is no
proof or detailed analysis to support the assumptions. Examples are limited
to analytic studies. |
3.
Analytical
and experimental critical functions and/or characteristic proof of concept. |
SW:
Active research and development is initiated. This includes analytical
studies to produce code that validates analytical predictions of separate
software elements. Examples include software components that are not yet
integrated or representative but satisfy an operational need. Algorithms run on a surrogate processor in
a laboratory environment. |
4.
Component
and/or breadboard validation in laboratory environment. |
SW:
Basic software components are integrated to establish that they will
work together. They are relatively primitive with regard to efficiency and
reliability compared to the eventual system.
System software architecture development initiated to include
interoperability, reliability, maintainability, extensibility, scalability
and security issues. Software integrated with simulated current /legacy
elements as appropriate. |
5.
Component
and/or breadboard validation in relevant environment. |
SW: Reliability of software ensemble increases
significantly. The basic software
components are integrated with reasonably realistic supporting elements so
that it can be tested in a simulated environment. Examples include "high fidelity" laboratory
integration of software components. System
software architecture established.
Algorithms run on a processor(s) with characteristics expected in the
operational environment. Software releases are ‘Alpha’ versions and
configuration control initiated.
Verification, Validation and Accreditation (VV&A) initiated. |
6.
System/subsystem
model or prototype demonstration in a relevant environment. |
SW:
Representative model or prototype system, which is well beyond that of
TRL 5, is tested in a relevant environment.
Represents a major step up in software-demonstrated readiness. Examples include testing a prototype in a
live/virtual experiment or in simulated operational environment. Algorithm run on processor or operational
environment integrated with actual external entities. Software releases are ‘Beta’ versions and
configuration controlled. Software
support structure in development.
VV&A in process. |
7.
System
prototype demonstration in an operational environment. |
SW: Represents a major step up from TRL 6,
requiring the demonstration of an actual system prototype in an operational
environment, such as in a command post or air/ground vehicle. Algorithms run on processor of the
operational environment integrated with actual external entities. Software support structure in place. Software releases are in distinct
versions. Frequency and severity of
software deficiency reports do not significantly degrade functionality or
performance. VV&A completed. |
8.
Actual
system completed and “flight qualified” through test and demonstration. |
SW:
Software has been demonstrated to work in its final form and under
expected conditions. In most cases, this TRL represents the end of system
development. Examples include test and evaluation of the software in its
intended system to determine if it meets design specifications. Software
releases are production versions and configuration controlled, in a secure
environment. Software deficiencies
are rapidly resolved through support structure. |
9.
Actual
system “flight proven” though successful mission operations. |
SW: Actual application of the software in its
final form and under mission conditions, such as those encountered in
operational test and evaluation. In almost all cases, this is the end of the
last "bug fixing" aspects of system development. Examples include using the system under
operational mission conditions.
Software releases are production versions and configuration
controlled. Frequency and severity of
software deficiencies are at a minimum. |