A Communication Instrument
BCNGroup.org and OntologyStream Inc
(Edited) September 7, 2003
Edited
slightly April 12, 2005 to bring this exposition up to date and to
help establish the Road Map for Global
Information Framework deployment.
This is a communication instrument that has been
modified from a September 4th, 2003 submission to DARPA by
SAIC/OntologyStream in response to the REAL BAA. The communication instrument is used to
communicate general principles. We do
not include a budget and the communication instrument is not a proposal.
The communication instrument has been modified from
the original submission so that the proposed project has a slightly broader
context. The budget request to DARPA is
approximately $2.5 million for 18 months.
Given an award by DARPA, our group would have developed a new semantic architecture
(see RoadMap).
Since 1991, BCNGroup founders have advocated a
national project to establish the knowledge sciences as an academic
discipline. The national project has
been referred to as the Manhattan Project to Establish the Knowledge
Sciences as an Academic Discipline. This
Project to Establish the Knowledge Science (PEKS) could only be
successful if the project itself was designed to be self-sustaining.
The Knowledge Sharing Core concept and the Charter of the BCNGroup provide a ground on which to establish this condition.
The Project to Establish the Knowledge Sciences
seeks to derive indirect social benefits from funding that is initially
justified based on the critical national need for a non-relational database
type Human-centric Information Production systems technology. The critical national need is for
intelligence vetting systems whose input is unstructured computer data. The structuring of data is possible in real
time, so that data reflects current information that might be available to the
human user.
A. Innovative claims for the proposed research
1. Openness: The Knowledge Sharing Core Charter requires that all embedded technology be openly disclosed to the public. The common disclosure of patents, patent pending and trade secretes allows scientists and consumers to drive improvements through fair and open collaboration. Good design and good science accelerates program innovation while reducing cost and eliminating dependency on grants.
2. Infrastructure unencumbered by technology secretes: We develop a software infrastructure more conducive to enhanced memetic expression. Conventional software infrastructure is optimized for retention of proprietary positions, while the Knowledge Sharing Core is optimized to support human knowledge sharing.
3. Multiple methods produce evolution: Comparative use of radically different algorithmic processes will provide for cross validation and measurement of outcomes. As in nature, variety and selection are used to drive evolution of domain specific knowledge extraction systems.
4. Science Committee: A committee of leading scholars will provide a review of the theory and practice as realized in Core processes and activities. Specific scholars serve on our Science Oversight Committee. Interoperability with OWL ontology and with Topic Maps is built in. New types of inference capabilities are being developed.
5. Human component: The project takes into account the human component in a human/machine reasoning system. Most conventional approaches attempt to create an autonomous reasoner with only supervisory participation by humans. Core architecture develops a patented data encoding that is already deployed, as intelligence technology. Our improvements to this deployed technology directly supports various types of human memory and anticipation. We shift from computer science to cognitive and social science
Innovative claims for the
proposed research (continued)
6. New form of mathematics: Computational processes produce a natural organizational stratification in data construction. This stratification reveals a correspondence to several areas of nature science. Stratification also reveals a relationship between discrete mathematics and continuum mathematics. Prueitt first observed this relationship in 2002 while working at Object Sciences Corporation as Senior Scientist. The context of this discovery is disclosed in documents on the OntologyStream web site.
By the expression “Differential Ontology” we choose to mean the
interchange of structural information between Implicit (machine-based) Ontology
and Explicit (machine-based) Ontology
Figure 1:
Mapping between continuum mathematics and discrete mathematics Mathematical convolutions over
localized bits of information, in the form of (type:value) pairs, produce one of the set of new transforms
that correspond to stratification theory.
More on pur-notational system is
given at: http://www.bcngroup.org/area2/KSF/Notation/notation.htm Innovative claims for the
proposed research (continued) 7. Differential and Formative
Ontology: The purpose of
Differential and Formative Ontology is to identify those covariance patterns
that exist in a data source. The purpose of the computational
processes is to assignment meaning to these patterns and to preserve the
assignment as reusable knowledge artifacts.
We create several advances in
ontology processing, including data reduction using categorical abstraction
into localized containers. Containers are created in those places where
ambiguation or disambiguation processes are essential. The contents of these containers
are sets of (type:value) pairs that are double encoded into hash tables to
provide almost instantaneous set theoretic operations, including support for
convolutions based on a type differentiating template. The template allows situational logic to
control the formation of ontology compounds in real time. Figure 2: The
aggregation of atoms into compounds Once this meaning has been
established, a stratified logic can be applied to predict the “properties” of
compounds based on a partial understanding of what the compound (… the compound that one is looking
at…) is and how it is composed. 8. Inferential support
and contextualization: Traditional
foundations of mathematics and logic are extended and used to supply
inferential support for various processes, including automation of ambiguation
and disambiguation during natural language parsing as part of ontology
construction. 9. Applied memetic research over active text
literatures: Basic research
produces a situational rendering from latent semantic analysis and other
techniques revealing linguistic variations that can be used to track memetic
expression in social discourse, medical research literatures and patent
disclosures. Innovative claims for the
proposed research (continued) The role of the patents and science: The 1994 and 1996 ATS patents develop a
reduction to practice of an interesting and important construction, called in
the patents, Continuous Connection Model (CCM). In CCM–Powered systems, the atomic units are word n-tuples. The atomic units, these n-tuples, are
directly derived from a measurement process made over the text input. Figure 3: The Actionable Intelligence Process
Model The Actionable Intelligence
Process Model, AIPM, can be referenced as a process model where the processes
are to be fully defined by the CCM notational system. OntologyStream founders developed
this model as we studied an information-processing paradigm used in the
American intelligence communities. We
place a diagram depicting the nine aspects of this model in Figure 3 and will
develop additional references to this model in the following pages. B. Project roadmap Technical
Goal: Our technical goals is a breakthrough in conceptual fidelity
and speed of machine ontology formation, produced within a self-sustaining,
unencumbered (open access) development program that becomes a foundation to the
public semantic web. The enabling
technologies depend less on computer programs to make inferences and more in a
flexible computer interface into data that measures co-occurrence using
structural ontology. The structural ontology
is merely a descriptive means to related things in the world to association,
properties and relationships find and reified by humans. Economic
Goal: A number of innovations are
to be integrated into a Peer-to-Peer knowledge sharing system will be either
public domain or patented. The science
committee will work to develop an objective evaluation of all disclosed patents
and patent applications, so as to assist in the proper ownership of these
innovations. A use-based
instrumentation of the Knowledge Sharing Cores will be used to differentially
charge for use, based on negotiated contracts. Use-based instrumentation will enable a low cost for high
value. Benefits
to the American Intelligence Community: Near-term benefits include better
intelligence findings from any large or small data stream, especially in terms
of detecting novelty, low salience changes, and broad shifts that other methods
miss. The development process, once started, increasingly augments
analyst reasoning at a greater rate and at lower cost compared to conventional
approaches. The state of art for human/machine
inferencing is extended and enhanced. Benefits
to private communities and persons: The Knowledge Sharing Core concept has
been designed as a Peer-to-Peer technology with no dependencies on .NET or
Java. The embedded technologies are instrumented to provide for
cyber security, localized control over transmitted information transparency and
use metrics that drives economic compensation based on patent disclosures. These
features provide for privacy and diversity of viewpoint from the ground
up. The Peer-to-Peer knowledge sharing
capability is designed to develop a Many-to-Many communication system as an
extension to the current virtual community type software such as e-forums and
chat systems. Technical
barriers: The current barriers to real-time differential ontology
operations are mainly self-imposed due to three assumptions: (1) levels of
analysis must be removed or hidden from the operator, (2) the computer system
itself must understand language and have common sense, and (3) a single method
must be perfected rather than a combination of methods. The main technical challenge is to assemble
tools and staff that consistently support an approach without
these assumptions. (Project roadmap: continued) Elements
of approach: Algorithms from competitive
traditions are modified, selected, compared, combined, and tested rapidly, with
reference to hypotheses derived from a specific set of natural science theories
about memetic expression, human reasoning, and human perception. Those
who have originated and continue to develop the theories will advise the
program and provide scientific peer review. We will develop learning
modules on major processes and their motivation. We provide a common
design language, and promote rapid sharing and application of original
innovations. Rationale
that builds confidence: New algorithms, and new synergistic
combinations of existing tools, are applied rapidly to outperform methods
currently applied on high-value intelligence problems. Leading scientists
will make significant contributions.
This program will benefit from an active community of practice involving
20 – 30 leading computer-social-cognitive scientists. There is active
participation in anticipation of future shared responsibilities. ) Nature
of expected results for the American Intelligence Community: More
intelligence analysts will prefer these tools to other deployed systems, not
only because the results will be better and more obvious, but because analyst
roles are recognized and enhanced rather than marginalized by technology that
is imposed on the community by business processes. Nature
of expected results for private communities and persons: It is not consistent with the notions of
participatory democracy that a branch of government has an information
technology that is not available to average citizens. Even if this were somehow consistent with our form of government,
the bandwidth necessary to develop the knowledge technologies and to develop a
complete curriculum for knowledge sciences is not available within any, or all,
of the world’s intelligence communities. Risk
if work is not done: In the absence of this alternative, there will be
continued elaboration, at great expense and with minimal progress, of the
general artificial intelligence paradigm, even though many scientists outside
of government circles recognize its limitations. Social
need: The Knowledge Sharing Core
will be deployed as a mechanism supporting the development of school and
university curriculum. This curriculum
will expose the knowledge technology function and operation. At the same time part of the unified revenue
stream will be used to develop the science related to biological functions such
as to perception, memory, cognition and anticipation. On memetic complexity:
We protect memetic complexity by identifying when the memetic expression
is simple and is threatening to more sophisticated knowledge sharing. Criteria for annual progress
evaluation: We expect to be
measured by the level of adoption by the public. The system is expected to be widely deployed within a 18 month
period. C. Research objectives Our
aim is to produce transformational technology that has broad uses in various
intelligence applications, including: To accomplish this, we are
proposing an unusual development process that seeks to bypass typical constraints
on innovation adoption. The process
includes the following features: The scientific objective of
this project is use structural stratification, localization of information into
(type:value) pairs, and convolution operators to produce inference and
informational organization. Innovators/scientists are
investigating several hypotheses concerning the utility of structural
stratification. Particular care has been
exercised in acquiring access to complex real world text sources, including
research medical literatures, web harvesting from consenting e-forms,
literatures, and patent disclosures. It is noted that in spite of the
huge investments in text understanding systems, these is no text understanding
system is available for low cost use by average citizens. (Research Objectives:
continued) Localization of information
about linguistic variation is a guiding principle. Several measurement and data encoding
innovations allows rapid complex passes over very large, or small, data
structures. Our localization processes depends on the discovery, by algorithms,
of invariance in specific informational structure. In natural systems, localization
of structure/function requires a behavioral/functional commonality and thus localization
is involved in producing natural archetypes.
When abstracted into language these archetypes are reflected in patterns
observed as linguistic variation in text.
In the Knowledge Sharing Core,
type is realized as a combination of substructural elements. How we treat type and relationships between
type is informed by the well-studied double articulation of phenome in spoken
language, and in case grammars involving a normalized and structural use of
parts of speech. In human minds, a class of
abstractions occurs in the formation and use of natural language. This is because natural language has evolved
to reflect the casual structure of natural types in the world. We have conjectured that structural
stratification is the key to complex machine inference and high conceptual
fidelity in knowledge representation.
This conjecture should be explained.
The structural stratification
exists in the natural expression of text.
Computer programs can observe the patterns of linguistic variation. Each pattern of linguistic variation has
causes related to anticipation and memory.
The linguistic variation exists at one level of organization, memory and
anticipation exists in separate realities.
Human memory is produced from
what is to that level of organization a hidden reality. Anticipation is grounded in the variation’s
environment and is also hidden. The
patterns exist because humans communicate with each other. The physical properties of human memory and
human anticipation shape the patterns of expression. In the language of social-biologists, Maturana and Varela, one
can think of the pattern as a memetic expression of an autopoietic envelop
having a complex interior and a reactive mechanism that manages a structural
coupling for maintaining the re-occurrence of pattern expression. Following the double articulation
principle, internal value structure of type creates an entailment to dynamic
structure between type exemplars. The
presence of structural coupling can be observed in nature, and human
communication depends on this structural coupling in both memetic simplex and
memetic complex expression. This
relevance of the notion of structural coupling between (type:value) pairs is an
objective well-framed scientific claim.
(Research Objectives:
continued) The BCNGroup and technology
scientists have made a map of how memetic structure is being expressed. Our work has been over the publicly
disclosed patents in the area of the knowledge technologies. We extract and abstract the properties of
patterns of expressive behavior. From
this technology we can anticipate the development and adoption of new
innovations. A new capability is being
made available. Memetic expression is as complex
as genetic expression, perhaps even more complex because the memetic expression
is within social systems (as shared concepts) and the genetic expression is
within natural ecosystems (as animals). Substructural variation in
machine inference binds inductive and deductive inferencing. How this is accomplished has as yet not been
demonstrated, but has been suggested in private research on the tri-level
architecture in conjunction with the Russian quasi-axiomatic theory. Patterns of predictive
inferencing about the “evolution” of thematic content of real time flow of
information from web sources are of particular interest. Archetypal value carries with it
the rich detail that allows natural language to be understood, by humans,
within social communication. Thus
specific words and word structures are reflective of meaning in the context of
broader experiences within the memory and anticipational aspects of human
cognition. Before concluding this section,
we should return to the conjecture that structural stratification is the key to
complex machine inference and high conceptual fidelity in knowledge
representation. The key opens the door
to a number of deep surprises, the first of which is that this key radically simplifies
the formal tasks associated with real time knowledge expression within
communities. This simplification is
relative to the artificial assumptions of statistical pattern recognition and
of classical logics. The surprise is
that the new technology will deliver value with far fewer computational
resources. The explanation as to why there
is a surprise is that the cognitive load is pushed back away from the
algorithms, where it has not and will not, by rational argument,
occur. The design of
the Knowledge Sharing Core pushes the cognitive load to the human minds and
into controlled vocabularies where mediated reconciliation presses can be
instruments (using the SchemaLogic SchemaServer – for example.). The human mind can be observed to have
functionality and behavior that no computer program has even reasonably
approximated. This is the point to the
two books by Sir Roger Penrose, a point made also by scholars related to the
BCNGroup. This simplification, and this
surprise, has a place in history. D1. Detailed description of technical approach This section, and the next, is
long and perhaps difficult to read. We
have two purposes in writing these two sections. First is to continue the exposition of basic theory and second is
to put this theoretical work into the current political and business process
context. (type:value) pairs have been used and implemented in
various systems since the 1950s, and they are part of almost every major
programming language and knowledge representation system in use today: 1.
They are the basis for the LISP property lists in the 1950s. 2.
They are the slot-filler scheme in every frame-based knowledge representation
system. 3.
They are the basis for the data structures in COBOL, PL/I, Pascal, C, C++, Ada,
Java, etc., etc., etc. 4.
They are the representation in the concept nodes of conceptual graphs (which
were first published in 1976). Given
the enormous number of variations in which (type:value) pairs have been used,
it is reasonable to conclude that no new patents in the area would be
possible. Dr.
Sowa’s observation allows an important insight: namely, that some unexpected
computer science innovations are possible, if one adopts a new paradigm.
For our team the (type:value) pair has properties that are NOT anticipated by
classical computer science and cognitive models based on scientific
reductionism. It is these properties
that we judge to be foundational to knowledge technologies. From
a quick reading of the two (1994,1996) ATS patents, one is surprised by the
specific manner of disclosure. One can
recognize at the outset that the (type:value) pair is a good way to localize
information about type and value. The CCM constructions follow XML and
ontologies developed from objects and classes (like OWL). But there is
something in addition to the (type:value) pairing, and this has to do with
information organization and inference.
Clearly the patent officers felt that the CCM (Contiguous Connection
Model) construction was NOT anticipated by the work that John Sowa refers to. Some language
will help us here. Inversions
involve two processes (1) the traversal of a branch (or
tree or collection of branches), (2) the convolution over all or
some subset of more elementary units (e.g., significant words) where the
convolution creates a partition and equivalence relationship. Inversions are a specific type of
convolution, as defined in classical mathematics. The convolution is over a set.
As each element of the set is visited some action takes place, that
action being defined by the convolution operator. In classical mathematics, the set can be infinite or finite in
size. In the CCM convolutions the set
contains (type:value) pairs and the action is defined by rules. Speed of
convolution operators over hash tables will turn out to be more and more
important as we develop more complex convolutions and as we allow the user (or
researcher) the parameters needed to re-apply convolutions experimentally as
one tries to bring a specific focus into the conceptual roll-up. The convolution may occur differentially
over type-categories or over value-categories – in ways that are disclosed in
the 1996 CCM patent. These
“constructed” equivalence relationships are expressed as part of a CCM
notational system under development by OntologyStream as part of an R&D
contract to ATS. Once expressed in the
CCM notational system, one can formally discuss properties related to both
fidelity and to efficiency in data processes.
For example, the convolution can be formally complex if
ontology is used with reconciliation containers. Complexity arises in the naturally occurring ambiguation and
disambiguation process that are essential to the use of natural language within
communities. Logics over (type:value)
pair schema containers follows the auxiliary innovations one sees in
SchemaLogic Inc.’s SchemaServer and other similar systems. In
both XML and in the Cycorp technologies, a problem exists in finding the proper
scope, namespaces, and situational context.
Complexity issues are at core to the solution of this problem. Complexity can be expressed in continuum
models. We point to connectionism as
one missing component to existing information technologies. The claim is
not that connectionism supplies all of the answers, but that connectionism
exists because there is more required that a localization of information into
(type:value) pairs. One may also understand, or
believe, that the structure of any natural system’s expression is so
constrained in the real world that the number of types and the relationships
between types are small in number, and yet open to change. Seen in this
way, one finds data regularity in context as a matter of human
observation. Even with this type
data regularity in context taken into account, individual localizations can be
massive in number. The current architectures develop
problems in completeness and consistency (the micro-theory problem in Cycorp
and the scope problem in Topic Maps). One has to be able to organize a
reasonable number of elements, each having the (type:value) pair nature, into
situational and scoped constructions. Specifically we look to several
innovations that have been adopted as part of SchemaServer developed by
SchemaLogic Inc. SchemaServer provides both a data schema integration
process and a community based reconciliation process that works on expressing
structural ambiguities necessary to human dialog and interaction. But the
reconciliation of controlled vocabularies and database schemas is only the very
beginning of the capabilities we expect to deliver within a few months. Schema resolution is seen as both
a discrete process, involving logics over schema, as well as a continuous
process, involving techniques such as latent semantic indexing and associative
memories. Differential ontology is a formal mapping methodology between
the discrete (and explicit) ontology and the implicit (continuum mathematics
expressed) ontology. More has to be said on
differential and formative ontology, but for now we should return to the
discussion of the ATS patents. Are these patents a reduction to practice
of both the (type:value) pair AND connectionist theory? The answer is
“yes”. Specifically, a “global”
organizational process is illustrated by the “inversion” technique disclosed in
the ATS patent. The ATS patents: ATS (Applied Technical Systems) has
developed several of the first CCM-powered referential systems with the hope
that CCM-powered systems could become a ubiquitous information and knowledge
sharing technology -- sitting at the heart of a cultural / economic knowledge
revolution. The use of these patents by the Knowledge Sharing Foundation
will demonstrate why this is a reasonable hope. We expect to build other
technologies, based on other patents, on the fundamental data structures that
now exist in a currently deployed CCM-powered NdCore ontology development
system. We feel that the disclosure of innovations that can be build on
the CCM constructions will fundamentally change what can be expected in the
near term. OntologyStream has developed a
number of basic research tools that are available to the team, and will made
available as part of the Knowledge Sharing Core. For example stochastic clustering of (type:value) pairs can be
shown with the Shallow Link analysis, Iterated scatter-gather and Parcelation
(SLIP) software developed in 2001 – 2002 as part of cyber event detection
research. This software allows us to
easily show the connection between localization of information, development of
relationship and the organization of sets of localized information using an
“eventChemistry”. The focus of this
software is in the exposition of principled information production using the
stratified paradigm (localization / global organization ). With these tools, we are able to
explore various aspects of connectionism, including nearness, similarity and
complexity. Deductive inference, using first order predicate logics,
makes little sense in domains with high measures of irregularly and
novelty. So one can, and should, make a distinction between deductive
logics, which can be performed by computers; and inductive inference, a
cognitive process that is not well understood. Having made this
distinction, we nevertheless point to unanticipated computational architecture
that is not exactly standard first order predicate. This architecture is based on a Russian paradigm called quasi-axiomatic
theory. From a study of this
foundational work we have simplified and extended the notion of deduction to
cover a situational and formative process involving localization and
globalization. The claim is made that
this form of deduction is more closely related to the inductive processes that
science finds at the heart of cognition.
Text
Analysis patents: Our team includes
a small company, Text Analysis International Corporation (TAIC). A patent
pending Integrated Development Environment (IDE) for developing text analyzers
has been evaluated in preliminary work by OntologyStream scientists. The
TAIC patent application allows knowledgeable users to develop a flexible
multi-pass construction process that produces a highly situational set of
parsing rules. Passes are involved in tokenizing, morphological analysis,
spelling correction, parts-of-speech tagging, entity recognition, simple
extraction (names, titles, locations, dates, quantities), and constituent
recognition (noun phrases, passages, themes). In
this IDE, these passes are not black boxes, as is typical to deployed NLP, or
ontology constructor systems, but are open to rapid modification by a
knowledgeable user. This is essential to our overall architecture design
since non-computer scientists need to be able to make adjustments to the rules
that are used in the parsing of text. The
modifications are expressed in the open construction of atoms in a situational
logic. The “inferred” compounds are
composed of those atoms and can be rendered as taxonomy, ontology. The
atoms themselves are “recognized” by the IDE and users are allowed to
instantiate those atoms that are deemed important. Moreover, an
additional invention (not as yet disclosed) convolves the ATS patents with the
TAIC patent application to produce a general-purpose ontology
constructor. Given
such a flexible arrangement, one can organize an NLP or ontology constructor
system in the best possible way for any given application. Furthermore,
the ability to insert passes into an existing set of passes enables a system to
grow, or be reduced, in a flexible and modular fashion. For example, some
passes can be devoted entirely to syntax, others to lexical process such as
segmenting text into lines, or a complex subsystem such as a recursive grammar
for handling lists. During
phase 1, our effort will be in a deep linguistic and ontology analysis of text
using manually constructed multi-pass parsing of rare text. The TAIC IDE is designed to achieve this
type of domain specific measurement of the parts of speech and the parts of
ontology. Our core team has already had
experience with the TAIC IDE, and the program manager, Dr. Prueitt, has worked
on the TAIC patent description. The
Semio patents: The Semio
patents, developed by Claude Vogel but now owned by Entrieva Inc., will be
extended so that the already “best in market” results of the Entrieva
conceptual maps application will be improved and made domain-specific. A
test collection using a small number of short fables has been studied, using
Semio, as part of preliminary research at OntologyStream. Claude
Vogel’s discovery assists in the definition of concept expression and the
extraction of passage categories having similar meanings. But other
inventions have to be used along with this one if conceptual roll-up is to
become the technique of choice for text analysis. An educational module is necessary to describe the innovation,
and to document what each innovation by itself is and is not able to do. On
the need for a common language: The
issue of a common language is a complex one.
Ideally, one should have a mathematical foundation to knowledge systems,
but suitable mathematics may not be readily available. Many scholars have come to believe that
mathematical biology, for example, cannot be developed based on current notions
of category and set membership.
However, we know of several extensions of mathematics that might serve
this need, including Russian quasi-axiomatic theory and applied semiotics
(theory of sign systems). Another
approach, one that deconstructs and then reconstructs set theory, is rough sets
and polylogics. In the meantime, we
still need a common means to talk about computer program behavior, and the best
option we have found is Cubicon. Sandy
Klausner, founder of CoreTalk Inc and inventor of the Cubicon language, will
represent Cubicon concepts in meetings with scholars, and illustrate the
benefits and requirements of a common description/ deployment language for
knowledge technology innovations.
Klausner and Prueitt have been discussing how to use the language since
early 2002, and in August 2003 the Cubicon language was first used to communicate
algorithmic modifications to the ATS system that increased the conceptual
fidelity of the CCM-Powered NdCore conceptual rollup process. Important
new work on the CCM system (performed by OntologyStream during a 6-month effort
to end in October 2003), while still preliminary, adds ontology and linguistic
services to CCM’s newest NdCore, creating a process for thematic
analysis. The NdCore creates an emerging ontology that depends on
the text analyzed and the variation of inputs by the users. This work is
consistent with the broader concept of the Knowledge Sharing Core proposed
here. Schema Logic Inc: Schema Logic Inc. will supply their schema
reconciliation technology in the form of SchemaServer 2.0. Schema reconciliation is related to a search
for the Topic Map process model. A Topic
Map can be about a complex subject that is undergoing fundamental changes. Out scientists have been attempting to
address this type of modeling. But
formative process models are on the leading edge of the standardization
processes. We will address exactly
these issues since without these issues addressed, it seems unlikely that real
world, real-time, situational ontology is possible. Topic Maps: Steven Newcomb, one of the primary authors
of the Topic Maps 1.0 standard, will advise the team on the development of
scope adjustment based on ontology services in conjunction with SchemaServer’s
community-based knowledge management services.
SchemaServer uses a proprietary methodology to assist in reconciliation
of multiple controlled vocabularies from diverse and complex interacting
communities. SchemaServer will be
deployed on a dedicated OntologyStream server for 18 months. A dedicated knowledge engineer/ knowledge
management engineer will be employed by OntologyStream to use and develop
knowledge artifacts based on a principled use of the SchemaServer. The
SchemaServer will NOT be integrated into the Knowledge Sharing Core but will be
an external resource. Infrastructure. While
ontology operations can be demonstrated within conventional infrastructure,
such infrastructure is poorly suited to such operations and limits them in the
following ways: Each of these limits, taken
alone, can easily cripple ontology operations. Taken together, they keep
ontology operations as a perpetual laboratory curiosity. For example, the
infrastructure of J2EE or .NET loads unnecessary transaction baggage on
differential ontology. Also, the use of
the relational database with SQL does not have agile metadata
transformations, except through the addition of meta modeling
(accomplished through SchemaLogics), and a process model that allows
the deconstruction and reconstruction of situational logics. Our use of this methodology will
allow us to develop a complete and proper system rather than components that
have to be expressed within .NET or J2EE. Care will be made to stand up
the system as J2EE interoperable, but much of the processes will use Berkeley
Data Base or/and a key-less hash table management system within peer-to-peer
distributed operating system, the Knowledge Sharing Core, that is independent
of the J2EE architectures. Education. Educational
services represent a major challenge both in terms of justifying the approach
to those unfamiliar with the paradigm and in providing deep training in how
text understanding and ontology services work.
Dr. Giovanni Marchisio has begun the design of university level
curriculum on all methods adopted by the Knowledge Sharing Core. Dr. Larry Medsker, at American University
and Dr. Art Murray at George Washington University will collaborate on this
effort and involve other university-based colleagues. (Full development of these designs will be pursued under separate
funding.) For initial content, the team will produce a competitive
comparison between the conceptual indexing activities by ATS using NdCore
versus Entrieva using Semio maps.
Steven Newcomb will provide authoritative expertise on the Topic Map
standard and on OWL ontology standards, as well as extend some basic graph
theoretic inference mechanisms involving polylogics, HyTime and situational
logics (an OntologyStream innovation).
John Sowa will advise on other related basic research and comparable
methods. D2: Summary of design. A model of the Knowledge Sharing
Core is giving in Figures 4 and 5. The
first depicts flow and the second depicts layers. The target tasks are text analysis resulting in ontology
production. The ontology will have
reusable components so that structured signatures related to specific types of
social discourse and knowledge sharing are revealed. Figure 4: Flow of Knowledge Sharing Core Multi-pass parsing tools will be
used to parse and orient ontology production.
Test collections will be placed into a competitive analysis where one
approach is based on the Text Analysis International Corporation’s multiple
pass linguistic/ontology analyzer tool set.
SchemaLogic’s SchemaServer will be used to allow the team members to
develop and adopt taxonomy, controlled vocabularies, and ontology. The
configuration of relations is done without modifying the underlying
(type:value) pairs data. This
notion has been captured in the term “eventChemistry”, which was discussed in
our 2002 NIMA proposal (deemed
fundable but not funded due to budgetary, and perhaps polical, issues.) Figure 5: Layers of Knowledge Sharing Core with CCM
engine Formative and differential
ontology is done within what has been called a tri-level architecture, because
models of memory and anticipation are developed separately and then merged in
situational ontology expressed as a middle stratum. A stratification of the system allows independent processing and
discovery at each of several levels without automatic or strict (logical)
entailment in the other levels. The middle stratum is not
logically at the same level of organization as the categorical invariances
(atoms of a logic). The chemistry of
events is developed in general terms and can be differentially applied to
produce formative reactions during the process of defining ontology scope
parameters. This follows the model of
quasi-axiomatic theory, but relies also on the co-mapping of
continuum mathematics and discrete mathematics, e.g. “differential ontology”,
that was not present in the Russian work (1950- 1995). The tri-level architecture was developed to
separate the memory of invariance and the top-down anticipation of templates
into two completely different logic systems.
Tri-level logical entailment, is
not a first order logic; but may depend on a first order logic and may, once
expressed in machine language, be treated as a predicate logic. Formative and differential
ontology is an inquiring system that supports conjecture and a broad array of
potentially anomalous information. Novelty detection is immediate due to
negative search characteristics similar to what is achieved in neural network
Adaptive Resonance Theory architectures.
Drs. Daniel Levine and Paul Prueitt have investigated, and published in
scientific journals, issues related to perception and novelty detection since
the mid 1980s. The architecture of the human
brain system is found to be relevant in an exercise of executive function over
logically underconstrainted formative processes. Karl Pribram’s work on holonomic models of perception and
behavioral expression fits into in framework that is more likely to find
scientific support than general artificial intelligence Lakoff (1999) argues that there
is a scientific revolution under way that potentially overturns these features
that are common to "first generation" cognitive science and software
and the analytic philosophy from which it stems. Our project clearly breaks from the first generation and is part
of the movement that Lakoff identifies. Performance
measurement. We will focus our measurement on the comparison of our
system with other available methods that can be used on the same data, and the
emphasis will be on the benefits to practical (i.e., “real world”) reasoning
among human analysts. Several of our technologies may be able to
demonstrate previously unimagined speed and scope of processing, allowing for
real time ontology processing of great fidelity using massive data. The
speed, however, is less important than the performance of the whole system,
including the input of analysts, in terms of sense making effectiveness.
A reasoning system, in other words, must be evaluated in terms of reasoning and
not primarily in terms of computational speed.
Governance.
The Knowledge Sharing Core addresses the need to make a transition in how
information technology innovation is being evaluated and procured by the
federal government. SAIC management understands the need for
transformation that will benefit military and intelligence clients. Accordingly, SAIC management will not advise
on what to include in the Core as this will be a process that is governed by
scientists on the project’s advisory board. It will always be clear that
it is the scientists and not the business leaders who make these selections.
Phases Two & Three: The scale of knowledge sharing will grow in
all dimensions and into application areas that have not been initially
selected. In general, precise pattern
recognition allows real time realignment of parsers and ontology services so
that new and important linguistic variation can be routed immediately to those
who need to look for consequences relating to national security. For example, a simpler functionality is
needed in responding to new patterns from medical ICD code analysis (syndromic
surveillance) and in immediately viewing digital
libraries (via grid systems) from a new
viewpoint. A similar functionality is
needed in mapping vulnerability and threats in trucking infrastructure and
harbors. See Figure 4 for a sense of
how ontology operations can become widely distributed. With the high fidelity of version 1 of the system, we will be
positioned to pursue a very difficult application based on new science that
social theorist Raymond Bradley, one of our advisors, is able to
contribute. We will have the
capacity to discover patterns of linguistic variation that identify social unit
membership. By extracting signature
patterns from voice recordings of conversations, we would be able to detect
whether the speakers are members of a group, quite apart from the words that
they are using. Likely members of a
sleeper terrorist cell, for example, can be identified. Figure 6: Two application areas for the Knowledge
Sharing Core In phase 1 we will have
investigated additional innovations and in phase 2 expect to incorporate the
best of them. One item currently is of
high interest, but unfortunately it is not ready for incorporation in phase
1. It is a complex addressing technique
that treats data, relations, structures, code etc. strictly as addresses, not,
as traditional systems do, distinguishing between data in containers and their
addresses. This system, patent pending in
the EU, distinguishes between data and structures (yet representing them in the
same way), and therefore can simulate containers. But since dimensions and
complexity are not tied to actual data, any number of dimensions or any degree
of complexity can be simulated as well.
Data structures are simulations that do not actually hold data. Data is
assigned to structures. Such assignments
can be in all possible forms – multiple assignments of the same data to
different structures, or structures assigned to other structures, or code
assigned to data and structures. This
architecture produces under constrained data schema. The speed and flexibility of this
addressing system makes sense for ontology operations, quite apart from any
other benefit, but its scaling characteristics may be even more important. Any truly massive application will have to
find a way around the linear scaling of conventional tools, and the addressing
system accomplishes that goal. We will
be able to demonstrate the relationships shown in Figure 7 during phase 2
work. The curve flattens for this
system, mostly due to similarities in the events represented. The shape of the
curve varies statistically rather than mathematically, approaching linearity in
the worse case (when all represented strings are unique). We understand that some at NSA are referring
to this type of performance as “fractal scalability”. A recent IBM press release
states: “IBM is
developing an XML-based architecture designed to unify various
machine-learning, statistical, and analytical approaches to improve computer
systems' ability to retrieve and use data, autonomously in many cases. IBM's unstructured information management
architecture (UIMA) will apply the Combination Hypothesis to help advance data
analysis, explained David Ferrucci, a staff member at IBM Research. … IBM still considers UIMA to be a research
project and does not have a timetable for implementing the technology
commercially.” The
"combination hypothesis" is exactly what the father of fuzzy logic,
Lofti Zadeh, called the "generalization group". Zadeh started
to talk about this in the late to mid 1990s when he became aware that his
notion of "computing with words" had failed to find a way to reduce
natural language to computational processes. John Sowa has some related
work on "intermediate languages". The Knowledge Sharing Core
concept is designed to allow the end user the knowledge to use complex
linguistic and knowledge tools within the notion of a generalized group, or
within the UIMA. Further, the Knowledge Sharing Core concept differs from
the AI agenda in that the cognitive load required to "make sense of"
experience of language systems (more generally semiotic systems) has to be
reallocated -- the expectation needs to be dropped that the computer can do it
autonomously. We disagree with IBM's claim
regarding XML: "It is
difficult to effectively combine multiple techniques in parallel to improve
data access and use. XML offers a key way to meet this challenge. Using XML
tags on documents provides structure and adds semantics, thereby facilitating
searching and analysis, particularly of otherwise unstructured data, Ferrucci
said. XML thus also helps integrate unstructured and structured data for
analysis." Our position is that the
experience of language systems only marginally depends on having a localization
of information from a non-(database-type) structure to a database type
structure. Differential and Formative Ontology was invented to address
differences between continuum type information representation (as in a neural
network or genetic algorithm computer program) and discrete information as in
XML or CCM. Scanning more widely for
comparable approaches, the cognitive graph (CG) approach has extended the
principles of existential graphs (Charles Peirce), entity relationships
diagrams, semantic networks, and XML-type ontology representation. Once
in a CG, various technologies facilitate a direct mapping to first order
logic. CG is used in a number of COTS systems to manage n-ary relations
in novel ways. CG systems generally assume that knowledge can be
represented as tokens in logic and rules based on these tokens, without
polylogic and analogic capability. This assumption is seen to have merit
by many technologists and by first generation cognitive scientists, but it is
in active dispute by Karl Pribram (from a cognitive neuroscience viewpoint) and
by Robert Shaw (from an ecological psychology viewpoint). Differential ontology produces
small-situated ontologies through a very rapid reduction of patterns in massive
data. These small ontologies cannot be interpreted by the rules of a
first order logic. The atoms from which the ontologies are constructed
are (type:value) pairs and are rendered into a 2- 3- or n-dimensional visual
display, which aids the analyst and the analyst community in interpretation and
making judgments on ambiguous intelligence. E. Statement of work The program is to be conducted in
three phases. All six tasks are active in each phase, but they change their
focus and character as the program matures. In phase 1 (18 months) a full
differential ontology system for ontology processing will be developed and
tested in multiple and varied settings, culminating in an application that is
realistic in terms of size, difficulty, subject matter, and participation by
analysts. Phase 2 and 3 essentially recapitulate the Phase 1 cycle, beginning
with a major reconfiguration and ending with a major application that
demonstrates the generality of the system and the capacity of the development
program for continued innovation. Phases 2 and 3 are optional: the government
may elect not to continue if the system has not demonstrated advanced
performance and the likelihood of further innovation. Task 1: Elaborate the design.
The initial design must be sufficient to guide the assembly of components and
application of the system in the first experiments. It is expected, however,
that the design keeps advancing as it is interpreted and tested by the program
participants, and that new aspects of the design, and improved expression of
the design, occur during the project, especially to prepare for reconfiguration
that occurs at the outset of phases 2 and 3. A panel of scientists will
advise on advanced concepts from various fields that will be relevant to the
design. To assure effective communication, the design will be elaborated
from different viewpoints using different media. The following design
documentation is expected: Task 2: Evaluate and obtain
components, or create components. The Knowledge Sharing Core will be
composed of several components that form a system. The project will identify
and obtain suitable existing components and avoid developing entirely new
components unless there is a clear opportunity to innovate or a clear void in
the market. The general evaluation criteria are the following. (Additional
sub-criteria should be added as necessary or where competitive cases need to be
resolved.) The components sufficient for
conducting experiments in phase 1 will be made available at the beginning of
the phase and will be suitable for integration as the first version of the
Core. A scanning and vetting process should be in place during phase 1 to
identify components that may be possible to add with low effort and in time for
use in experiments 3 and 4, but the main focus during phase 1 will be on
preparing for major reconfigurations at the beginning of phases 2 and 3. Task 3. Integrate and
interface components. Programmers, usage analysts, and scientific
advisors will all identify outputs of the integrated system, which will in turn
guide specification of system operation. Programmers will execute these
specifications and perform technical testing. Preliminary operational testing
will be conducted by usage analysts and fed back for revisions. An effort must
be made to keep changing the system rapidly in pursuit of the ideal design and
avoid premature closure and refinement of a particular instantiation. Task 4. Conduct
experiments. The program should identify a series of practical tests
that are increasingly difficult and that demonstrate the full range of
application of the system. Every test should be prepared with
suitable data, experimental conditions, and specific questions and hypotheses
that can be answered with performance results that are appropriate to the stage
of development. It is preferable that the test data will have been analyzed
previously by other means, such that results can be compared. Every test should be documented
as a case study to facilitate outside review.
Test will arrive at specific implications for changes and
improvements. The tests should cover domains
that will be relevant to the intelligence community, though it is understood
that simplified conditions (unclassified, no involvement of working intel
analysts) are most appropriate for phase 1 tests. For larger tests, it is
understood that the project will need to develop relationships with data owners
who will want to share the results and who may either contribute research
questions or help perform the analysis. Task 5: Communicate
findings. Major reports will consist of: four case studies based on
the tests, scientific papers (at least two during phase 1) to be presented at
conferences or published, educational presentations, and documentation of
innovations prior to patenting. (The labor for producing patent applications
will not be charged to the research contract, and thoughtful considerations, by
the science committee, will be made in each case.) The team members will
frequently produce, share, and comment on brief research notes. The educational
presentations will include briefings on the project suitable for presentation
to government reviewers and other research teams. Educational materials will be
needed to explain the background and motivation of some of the features of this
project since several aspects are unusual and deviate from the normal
background assumptions. Task 6: Establish
collaborative structure for innovation. Near-term technical
objectives are important, but since the program holds much more promise beyond
that, it must be organized in such a fashion that a succession of innovations,
not all of which can be specified at this point, become a likely result. Technical environment.
The team will establish practices and tools that promote innovation. A common
design language should be used to support rapid revision, avoid lower level
programming and infrastructure complications, and promote easy understanding of
each other's work. Communication
practices and tools should be used that promote online presence, easy
discussion when needed, and rapid access to context and reference material. It is especially important to keep remote
team members socially integrated. The team should also extend beyond those who
are directly working on the project, to those who are enlisted as role-players
within the experiments, and to a network of colleagues who will offer comment
and ideas. The
programming environment will be organized to support rapid prototyping,
testing, and feedback, without the need for lengthy performance and stability
checks, documentation cycles, and coordination meetings. The checks,
documentation, and coordination should not be ignored, however, but built in to
the development environment to the extent possible. Bottom Line: The initial
government funding should be used to create a self-sustaining program to which
many additional funders eventually contribute, either in the form of license
fees or direct tasking for additional development. Two conditions are required:
there is open disclosure of all technology and its performance, and
contributors have a realistic prospect of profiting from intellectual property
rights. Often these two aims conflict. Work is not disclosed because of
proprietary interests, but with no disclosure, inquiry is soon stifled, and
with it the flow of economically valuable technology. In this program
especially, the inquiry must be open because the basic technology and the
science that underlies it are not widely understood, and the only way to
receive a fair examination is to show it and discuss it and not merely refer to
hidden processes. The program should thus enlist the United States patent
system as one way to insure both the open inquiry and property rights that are
needed to sustain innovation. Any essential technology to be used in the Core
must be either patented, likely to be patented, or open source. This allows
full disclosure among team members, an extended network of colleagues who will
be interested in the program, and scientist advisors who will need to
understand how the system works. New patents created during the
program are handled under normal rules insuring government access. The program
members will pool ownership and set fees for deployments. A portion of revenues
is to be reinvested in development efforts. In order to make this program
structure work, member companies need to be recruited that can contribute
patented technologies and who are otherwise able and willing to disclose their
work for the project. The companies must also agree to pursue additional
patents, to share ownership when appropriate, and to reinvest in a long-term
program. F1. Detailed individual
effort description Items listed in the schedule are treated elsewhere in the proposal, and
this space is used to extend comments on a subset of items. Set up collaboration & communication. All team members will be
provided a copy of Groove. Every person will be required to use online
training, keep their presence marker on during working hours, and to meet a
quota of postings and exchanges. Every effort will be made to conduct all
message exchanges inside of Groove, in order to prevent fractured records and
poor sharing. All k


We
introduce the proposed system by tracing one of its motivations. John Sowa, one of the distinguished
scientific advisors to the project, made the following comments about the
(type:value) pair that is the focus of patents held by ATS (Applied Technical
Systems):


| 
D3. Comparison with current technology