Differential System for Real-time Ontology Processing

 

A Communication Instrument

BCNGroup.org and OntologyStream Inc

(Edited) September 7, 2003

 

Edited slightly April 12, 2005 to bring this exposition up to date and to help establish the Road Map for Global Information Framework deployment. 

 

 

This is a communication instrument that has been modified from a September 4th, 2003 submission to DARPA by SAIC/OntologyStream in response to the REAL BAA.  The communication instrument is used to communicate general principles.  We do not include a budget and the communication instrument is not a proposal. 

 

The communication instrument has been modified from the original submission so that the proposed project has a slightly broader context.  The budget request to DARPA is approximately $2.5 million for 18 months.  Given an award by DARPA, our group would have developed a new semantic architecture (see RoadMap). 

 

Since 1991, BCNGroup founders have advocated a national project to establish the knowledge sciences as an academic discipline.  The national project has been referred to as the Manhattan Project to Establish the Knowledge Sciences as an Academic Discipline.  This Project to Establish the Knowledge Science (PEKS) could only be successful if the project itself was designed to be self-sustaining. 

 

The Knowledge Sharing Core concept and the Charter of the BCNGroup provide a ground on which to establish this condition. 

 

The Project to Establish the Knowledge Sciences seeks to derive indirect social benefits from funding that is initially justified based on the critical national need for a non-relational database type Human-centric Information Production systems technology.  The critical national need is for intelligence vetting systems whose input is unstructured computer data.  The structuring of data is possible in real time, so that data reflects current information that might be available to the human user. 

 


A.  Innovative claims for the proposed research

 

1.  Openness: The Knowledge Sharing Core Charter requires that all embedded technology be openly disclosed to the public.  The common disclosure of patents, patent pending and trade secretes allows scientists and consumers to drive improvements through fair and open collaboration.  Good design and good science accelerates program innovation while reducing cost and eliminating dependency on grants. 

 

2.  Infrastructure unencumbered by technology secretes: We develop a software infrastructure more conducive to enhanced memetic expression. Conventional software infrastructure is optimized for retention of proprietary positions, while the Knowledge Sharing Core is optimized to support human knowledge sharing. 

 

3.  Multiple methods produce evolution:  Comparative use of radically different algorithmic processes will provide for cross validation and measurement of outcomes. As in nature, variety and selection are used to drive evolution of domain specific knowledge extraction systems.

 

4.  Science Committee: A committee of leading scholars will provide a review of the theory and practice as realized in Core processes and activities.  Specific scholars serve on our Science Oversight Committee. Interoperability with OWL ontology and with Topic Maps is built in.  New types of inference capabilities are being developed.

 

5.  Human component: The project takes into account the human component in a human/machine reasoning system.  Most conventional approaches attempt to create an autonomous reasoner with only supervisory participation by humans. Core architecture develops a patented data encoding that is already deployed, as intelligence technology.   Our improvements to this deployed technology directly supports various types of human memory and anticipation.  We shift from computer science to cognitive and social science

 


 

Innovative claims for the proposed research (continued)

 

6. New form of mathematics:  Computational processes produce a natural organizational stratification in data construction.   This stratification reveals a correspondence to several areas of nature science.  Stratification also reveals a relationship between discrete mathematics and continuum mathematics.  Prueitt first observed this relationship in 2002 while working at Object Sciences Corporation as Senior Scientist.  The context of this discovery is disclosed in documents on the OntologyStream web site.

 

By the expression “Differential Ontology” we choose to mean the interchange of structural information between Implicit (machine-based) Ontology and Explicit (machine-based) Ontology

 

• by Implicit Ontology we mean an attractor neural network system or one of the variations of latent semantic indexing. 

 

• by Explicit Ontology we mean an bag of ordered triples {  < a , r, b > }, where a and b are locations and r is a relational type, organized into a graph structure, and perhaps accompanied by first order predicate logic (such as the Topic Maps or Cyc ontologies).

 

 

 

Figure 1: Mapping between continuum mathematics and discrete mathematics

 

Mathematical convolutions over localized bits of information, in the form of (type:value) pairs,  produce one of the set of new transforms that correspond to stratification theory. 

 

More on pur-notational system is given at:

 

http://www.bcngroup.org/area2/KSF/Notation/notation.htm

 


 

Innovative claims for the proposed research (continued)

 

7. Differential and Formative Ontology:  The purpose of Differential and Formative Ontology is to identify those covariance patterns that exist in a data source. 

 

The purpose of the computational processes is to assignment meaning to these patterns and to preserve the assignment as reusable knowledge artifacts. 

 

We create several advances in ontology processing, including data reduction using categorical abstraction into localized containers.  Containers are created in those places where ambiguation or disambiguation processes are essential. 

 

The contents of these containers are sets of (type:value) pairs that are double encoded into hash tables to provide almost instantaneous set theoretic operations, including support for convolutions based on a type differentiating template.   The template allows situational logic to control the formation of ontology compounds in real time. 

 

 

Figure 2: The aggregation of atoms into compounds

 

Once this meaning has been established, a stratified logic can be applied to predict the “properties” of compounds based on a partial understanding of what the compound     (… the compound that one is looking at…)    is and how it is composed. 

 

8.  Inferential support and contextualization:  Traditional foundations of mathematics and logic are extended and used to supply inferential support for various processes, including automation of ambiguation and disambiguation during natural language parsing as part of ontology construction.

 

9.  Applied memetic research over active text literatures:  Basic research produces a situational rendering from latent semantic analysis and other techniques revealing linguistic variations that can be used to track memetic expression in social discourse, medical research literatures and patent disclosures. 


 

Innovative claims for the proposed research (continued)

 

The role of the patents and science:  The 1994 and 1996 ATS patents develop a reduction to practice of an interesting and important construction, called in the patents, Continuous Connection Model (CCM).  In CCM–Powered systems, the atomic units are word n-tuples.  The atomic units, these n-tuples, are directly derived from a measurement process made over the text input. 

 

 

Figure 3: The Actionable Intelligence Process Model

 

The Actionable Intelligence Process Model, AIPM, can be referenced as a process model where the processes are to be fully defined by the CCM notational system. 

 

OntologyStream founders developed this model as we studied an information-processing paradigm used in the American intelligence communities.  We place a diagram depicting the nine aspects of this model in Figure 3 and will develop additional references to this model in the following pages. 

 

 


 

B.  Project roadmap

 

Technical Goal:  Our technical goals is a breakthrough in conceptual fidelity and speed of machine ontology formation, produced within a self-sustaining, unencumbered (open access) development program that becomes a foundation to the public semantic web.  The enabling technologies depend less on computer programs to make inferences and more in a flexible computer interface into data that measures co-occurrence using structural ontology.  The structural ontology is merely a descriptive means to related things in the world to association, properties and relationships find and reified by humans. 

 

Economic Goal:  A number of innovations are to be integrated into a Peer-to-Peer knowledge sharing system will be either public domain or patented.  The science committee will work to develop an objective evaluation of all disclosed patents and patent applications, so as to assist in the proper ownership of these innovations.  A use-based instrumentation of the Knowledge Sharing Cores will be used to differentially charge for use, based on negotiated contracts.   Use-based instrumentation will enable a low cost for high value.   

 

Benefits to the American Intelligence Community: Near-term benefits include better intelligence findings from any large or small data stream, especially in terms of detecting novelty, low salience changes, and broad shifts that other methods miss.   The development process, once started, increasingly augments analyst reasoning at a greater rate and at lower cost compared to conventional approaches.  The state of art for human/machine inferencing is extended and enhanced.  

 

Benefits to private communities and persons: The Knowledge Sharing Core concept has been designed as a Peer-to-Peer technology with no dependencies on .NET or Java. 

 

The embedded technologies are instrumented to provide for cyber security, localized control over transmitted information transparency and use metrics that drives economic compensation based on patent disclosures. 

 

These features provide for privacy and diversity of viewpoint from the ground up.  The Peer-to-Peer knowledge sharing capability is designed to develop a Many-to-Many communication system as an extension to the current virtual community type software such as e-forums and chat systems. 

 

Technical barriers:  The current barriers to real-time differential ontology operations are mainly self-imposed due to three assumptions: (1) levels of analysis must be removed or hidden from the operator, (2) the computer system itself must understand language and have common sense, and (3) a single method must be perfected rather than a combination of methods.  The main technical challenge is to assemble tools and staff that consistently support an approach without these assumptions.

 


(Project roadmap: continued)

 

Elements of approach:  Algorithms from competitive traditions are modified, selected, compared, combined, and tested rapidly, with reference to hypotheses derived from a specific set of natural science theories about memetic expression, human reasoning, and human perception.  Those who have originated and continue to develop the theories will advise the program and provide scientific peer review.  We will develop learning modules on major processes and their motivation.  We provide a common design language, and promote rapid sharing and application of original innovations. 

 

Rationale that builds confidence:  New algorithms, and new synergistic combinations of existing tools, are applied rapidly to outperform methods currently applied on high-value intelligence problems.  Leading scientists will make significant contributions.  This program will benefit from an active community of practice involving 20 – 30 leading computer-social-cognitive scientists. There is active participation in anticipation of future shared responsibilities. )

 

Nature of expected results for the American Intelligence Community:  More intelligence analysts will prefer these tools to other deployed systems, not only because the results will be better and more obvious, but because analyst roles are recognized and enhanced rather than marginalized by technology that is imposed on the community by business processes.

 

Nature of expected results for private communities and persons:  It is not consistent with the notions of participatory democracy that a branch of government has an information technology that is not available to average citizens.  Even if this were somehow consistent with our form of government, the bandwidth necessary to develop the knowledge technologies and to develop a complete curriculum for knowledge sciences is not available within any, or all, of the world’s intelligence communities.

 

Risk if work is not done: In the absence of this alternative, there will be continued elaboration, at great expense and with minimal progress, of the general artificial intelligence paradigm, even though many scientists outside of government circles recognize its limitations.

 

Social need:  The Knowledge Sharing Core will be deployed as a mechanism supporting the development of school and university curriculum.  This curriculum will expose the knowledge technology function and operation.  At the same time part of the unified revenue stream will be used to develop the science related to biological functions such as to perception, memory, cognition and anticipation.  On memetic complexity:  We protect memetic complexity by identifying when the memetic expression is simple and is threatening to more sophisticated knowledge sharing. 

 

Criteria for annual progress evaluation:  We expect to be measured by the level of adoption by the public.  The system is expected to be widely deployed within a 18 month period. 


 

C.  Research objectives

 

Our aim is to produce transformational technology that has broad uses in various intelligence applications, including:

 

 

To accomplish this, we are proposing an unusual development process that seeks to bypass typical constraints on innovation adoption.  The process includes the following features: 

 

 

The scientific objective of this project is use structural stratification, localization of information into (type:value) pairs, and convolution operators to produce inference and informational organization.  

 

Innovators/scientists are investigating several hypotheses concerning the utility of structural stratification. 

 

Particular care has been exercised in acquiring access to complex real world text sources, including research medical literatures, web harvesting from consenting e-forms, literatures, and patent disclosures. 

 

It is noted that in spite of the huge investments in text understanding systems, these is no text understanding system is available for low cost use by average citizens.

 


(Research Objectives: continued)

 

Localization of information about linguistic variation is a guiding principle.  Several measurement and data encoding innovations allows rapid complex passes over very large, or small, data structures. Our localization processes depends on the discovery, by algorithms, of invariance in specific informational structure. 

 

In natural systems, localization of structure/function requires a behavioral/functional commonality and thus localization is involved in producing natural archetypes.  When abstracted into language these archetypes are reflected in patterns observed as linguistic variation in text.  In the Knowledge Sharing Core, type is realized as a combination of substructural elements.  How we treat type and relationships between type is informed by the well-studied double articulation of phenome in spoken language, and in case grammars involving a normalized and structural use of parts of speech.  

 

In human minds, a class of abstractions occurs in the formation and use of natural language.  This is because natural language has evolved to reflect the casual structure of natural types in the world.  We have conjectured that structural stratification is the key to complex machine inference and high conceptual fidelity in knowledge representation.  This conjecture should be explained. 

 

The structural stratification exists in the natural expression of text.  Computer programs can observe the patterns of linguistic variation.  Each pattern of linguistic variation has causes related to anticipation and memory.  The linguistic variation exists at one level of organization, memory and anticipation exists in separate realities. 

 

Human memory is produced from what is to that level of organization a hidden reality.  Anticipation is grounded in the variation’s environment and is also hidden.  The patterns exist because humans communicate with each other.  The physical properties of human memory and human anticipation shape the patterns of expression.  In the language of social-biologists, Maturana and Varela, one can think of the pattern as a memetic expression of an autopoietic envelop having a complex interior and a reactive mechanism that manages a structural coupling for maintaining the re-occurrence of pattern expression. 

 

Following the double articulation principle, internal value structure of type creates an entailment to dynamic structure between type exemplars.  The presence of structural coupling can be observed in nature, and human communication depends on this structural coupling in both memetic simplex and memetic complex expression.  This relevance of the notion of structural coupling between (type:value) pairs is an objective well-framed scientific claim. 


 

(Research Objectives: continued)

 

The BCNGroup and technology scientists have made a map of how memetic structure is being expressed.  Our work has been over the publicly disclosed patents in the area of the knowledge technologies.  We extract and abstract the properties of patterns of expressive behavior.  From this technology we can anticipate the development and adoption of new innovations.   A new capability is being made available. 

 

Memetic expression is as complex as genetic expression, perhaps even more complex because the memetic expression is within social systems (as shared concepts) and the genetic expression is within natural ecosystems (as animals).

 

Substructural variation in machine inference binds inductive and deductive inferencing.  How this is accomplished has as yet not been demonstrated, but has been suggested in private research on the tri-level architecture in conjunction with the Russian quasi-axiomatic theory.  

 

Patterns of predictive inferencing about the “evolution” of thematic content of real time flow of information from web sources are of particular interest. 

 

Archetypal value carries with it the rich detail that allows natural language to be understood, by humans, within social communication.  Thus specific words and word structures are reflective of meaning in the context of broader experiences within the memory and anticipational aspects of human cognition. 

 

Before concluding this section, we should return to the conjecture that structural stratification is the key to complex machine inference and high conceptual fidelity in knowledge representation.   The key opens the door to a number of deep surprises, the first of which is that this key radically simplifies the formal tasks associated with real time knowledge expression within communities.  This simplification is relative to the artificial assumptions of statistical pattern recognition and of classical logics.  The surprise is that the new technology will deliver value with far fewer computational resources. 

 

The explanation as to why there is a surprise is that the cognitive load is pushed back away from the algorithms, where it has not and will not, by rational argument, occur.  The design of the Knowledge Sharing Core pushes the cognitive load to the human minds and into controlled vocabularies where mediated reconciliation presses can be instruments (using the SchemaLogic SchemaServer – for example.).  The human mind can be observed to have functionality and behavior that no computer program has even reasonably approximated.  This is the point to the two books by Sir Roger Penrose, a point made also by scholars related to the BCNGroup. 

 

This simplification, and this surprise, has a place in history. 

 


D1.  Detailed description of technical approach

 

This section, and the next, is long and perhaps difficult to read.  We have two purposes in writing these two sections.  First is to continue the exposition of basic theory and second is to put this theoretical work into the current political and business process context.

 

We introduce the proposed system by tracing one of its motivations.  John Sowa, one of the distinguished scientific advisors to the project, made the following comments about the (type:value) pair that is the focus of patents held by ATS (Applied Technical Systems):

(type:value) pairs have been used and implemented in various systems since the 1950s, and they are part of almost every major programming language and knowledge representation system in use today:

 

1. They are the basis for the LISP property lists in the 1950s.

2. They are the slot-filler scheme in every frame-based knowledge representation system.

3. They are the basis for the data structures in COBOL, PL/I, Pascal, C, C++, Ada, Java, etc., etc., etc.

4. They are the representation in the concept nodes of conceptual graphs (which were first published in 1976).

 

Given the enormous number of variations in which (type:value) pairs have been used, it is reasonable to conclude that no new patents in the area would be possible. 

 

Dr. Sowa’s observation allows an important insight: namely, that some unexpected computer science innovations are possible, if one adopts a new paradigm.  For our team the (type:value) pair has properties that are NOT anticipated by classical computer science and cognitive models based on scientific reductionism.  It is these properties that we judge to be foundational to knowledge technologies.

 

From a quick reading of the two (1994,1996) ATS patents, one is surprised by the specific manner of disclosure.  One can recognize at the outset that the (type:value) pair is a good way to localize information about type and value.  The CCM constructions follow XML and ontologies developed from objects and classes (like OWL).  But there is something in addition to the (type:value) pairing, and this has to do with information organization and inference.  Clearly the patent officers felt that the CCM (Contiguous Connection Model) construction was NOT anticipated by the work that John Sowa refers to.

 

Some language will help us here.  Inversions involve two processes (1) the traversal of a branch (or tree or collection of branches), (2) the convolution over all or some subset of more elementary units (e.g., significant words) where the convolution creates a partition and equivalence relationship.  Inversions are a specific type of convolution, as defined in classical mathematics.  The convolution is over a set.  As each element of the set is visited some action takes place, that action being defined by the convolution operator.  In classical mathematics, the set can be infinite or finite in size.  In the CCM convolutions the set contains (type:value) pairs and the action is defined by rules.

 

Speed of convolution operators over hash tables will turn out to be more and more important as we develop more complex convolutions and as we allow the user (or researcher) the parameters needed to re-apply convolutions experimentally as one tries to bring a specific focus into the conceptual roll-up.  The convolution may occur differentially over type-categories or over value-categories – in ways that are disclosed in the 1996 CCM patent. 

 

These “constructed” equivalence relationships are expressed as part of a CCM notational system under development by OntologyStream as part of an R&D contract to ATS.  Once expressed in the CCM notational system, one can formally discuss properties related to both fidelity and to efficiency in data processes.  For example, the convolution can be formally complex if ontology is used with reconciliation containers.  Complexity arises in the naturally occurring ambiguation and disambiguation process that are essential to the use of natural language within communities.  Logics over (type:value) pair schema containers follows the auxiliary innovations one sees in SchemaLogic Inc.’s SchemaServer and other similar systems. 

 

In both XML and in the Cycorp technologies, a problem exists in finding the proper scope, namespaces, and situational context.  Complexity issues are at core to the solution of this problem.  Complexity can be expressed in continuum models.  We point to connectionism as one missing component to existing information technologies.  The claim is not that connectionism supplies all of the answers, but that connectionism exists because there is more required that a localization of information into (type:value) pairs.

 

One may also understand, or believe, that the structure of any natural system’s expression is so constrained in the real world that the number of types and the relationships between types are small in number, and yet open to change.  Seen in this way, one finds data regularity in context as a matter of human observation.   Even with this type data regularity in context taken into account, individual localizations can be massive in number. 

 

The current architectures develop problems in completeness and consistency (the micro-theory problem in Cycorp and the scope problem in Topic Maps).  One has to be able to organize a reasonable number of elements, each having the (type:value) pair nature, into situational and scoped constructions.  Specifically we look to several innovations that have been adopted as part of SchemaServer developed by SchemaLogic Inc.  SchemaServer provides both a data schema integration process and a community based reconciliation process that works on expressing structural ambiguities necessary to human dialog and interaction.  But the reconciliation of controlled vocabularies and database schemas is only the very beginning of the capabilities we expect to deliver within a few months.

 

Schema resolution is seen as both a discrete process, involving logics over schema, as well as a continuous process, involving techniques such as latent semantic indexing and associative memories.  Differential ontology is a formal mapping methodology between the discrete (and explicit) ontology and the implicit (continuum mathematics expressed) ontology. 

 

More has to be said on differential and formative ontology, but for now we should return to the discussion of the ATS patents.  Are these patents a reduction to practice of both the (type:value) pair AND connectionist theory?  The answer is “yes”.  Specifically, a “global” organizational process is illustrated by the “inversion” technique disclosed in the ATS patent.

 

The ATS patents:  ATS (Applied Technical Systems) has developed several of the first CCM-powered referential systems with the hope that CCM-powered systems could become a ubiquitous information and knowledge sharing technology -- sitting at the heart of a cultural / economic knowledge revolution.  The use of these patents by the Knowledge Sharing Foundation will demonstrate why this is a reasonable hope. 

 

We expect to build other technologies, based on other patents, on the fundamental data structures that now exist in a currently deployed CCM-powered NdCore ontology development system.  We feel that the disclosure of innovations that can be build on the CCM constructions will fundamentally change what can be expected in the near term.  

 

OntologyStream has developed a number of basic research tools that are available to the team, and will made available as part of the Knowledge Sharing Core.  For example stochastic clustering of (type:value) pairs can be shown with the Shallow Link analysis, Iterated scatter-gather and Parcelation (SLIP) software developed in 2001 – 2002 as part of cyber event detection research.  This software allows us to easily show the connection between localization of information, development of relationship and the organization of sets of localized information using an “eventChemistry”.  The focus of this software is in the exposition of principled information production using the stratified paradigm (localization / global organization ).

 

With these tools, we are able to explore various aspects of connectionism, including nearness, similarity and complexity.  Deductive inference, using first order predicate logics, makes little sense in domains with high measures of irregularly and novelty.  So one can, and should, make a distinction between deductive logics, which can be performed by computers; and inductive inference, a cognitive process that is not well understood.  Having made this distinction, we nevertheless point to unanticipated computational architecture that is not exactly standard first order predicate.  This architecture is based on a Russian paradigm called quasi-axiomatic theory.  From a study of this foundational work we have simplified and extended the notion of deduction to cover a situational and formative process involving localization and globalization.  The claim is made that this form of deduction is more closely related to the inductive processes that science finds at the heart of cognition. 

 

Text Analysis patents:  Our team includes a small company, Text Analysis International Corporation (TAIC).  A patent pending Integrated Development Environment (IDE) for developing text analyzers has been evaluated in preliminary work by OntologyStream scientists.  The TAIC patent application allows knowledgeable users to develop a flexible multi-pass construction process that produces a highly situational set of parsing rules.  Passes are involved in tokenizing, morphological analysis, spelling correction, parts-of-speech tagging, entity recognition, simple extraction (names, titles, locations, dates, quantities), and constituent recognition (noun phrases, passages, themes). 

 

In this IDE, these passes are not black boxes, as is typical to deployed NLP, or ontology constructor systems, but are open to rapid modification by a knowledgeable user.  This is essential to our overall architecture design since non-computer scientists need to be able to make adjustments to the rules that are used in the parsing of text.

 

The modifications are expressed in the open construction of atoms in a situational logic.  The “inferred” compounds are composed of those atoms and can be rendered as taxonomy, ontology.  The atoms themselves are “recognized” by the IDE and users are allowed to instantiate those atoms that are deemed important.  Moreover, an additional invention (not as yet disclosed) convolves the ATS patents with the TAIC patent application to produce a general-purpose ontology constructor.  

 

Given such a flexible arrangement, one can organize an NLP or ontology constructor system in the best possible way for any given application.  Furthermore, the ability to insert passes into an existing set of passes enables a system to grow, or be reduced, in a flexible and modular fashion.  For example, some passes can be devoted entirely to syntax, others to lexical process such as segmenting text into lines, or a complex subsystem such as a recursive grammar for handling lists. 

 

During phase 1, our effort will be in a deep linguistic and ontology analysis of text using manually constructed multi-pass parsing of rare text.  The TAIC IDE is designed to achieve this type of domain specific measurement of the parts of speech and the parts of ontology.  Our core team has already had experience with the TAIC IDE, and the program manager, Dr. Prueitt, has worked on the TAIC patent description. 

 

The Semio patents:   The Semio patents, developed by Claude Vogel but now owned by Entrieva Inc., will be extended so that the already “best in market” results of the Entrieva conceptual maps application will be improved and made domain-specific.  A test collection using a small number of short fables has been studied, using Semio, as part of preliminary research at OntologyStream. 

 

Claude Vogel’s discovery assists in the definition of concept expression and the extraction of passage categories having similar meanings.  But other inventions have to be used along with this one if conceptual roll-up is to become the technique of choice for text analysis.  An educational module is necessary to describe the innovation, and to document what each innovation by itself is and is not able to do.

 

On the need for a common language:  The issue of a common language is a complex one.  Ideally, one should have a mathematical foundation to knowledge systems, but suitable mathematics may not be readily available.  Many scholars have come to believe that mathematical biology, for example, cannot be developed based on current notions of category and set membership.  However, we know of several extensions of mathematics that might serve this need, including Russian quasi-axiomatic theory and applied semiotics (theory of sign systems).  Another approach, one that deconstructs and then reconstructs set theory, is rough sets and polylogics.  In the meantime, we still need a common means to talk about computer program behavior, and the best option we have found is Cubicon. 

 

Sandy Klausner, founder of CoreTalk Inc and inventor of the Cubicon language, will represent Cubicon concepts in meetings with scholars, and illustrate the benefits and requirements of a common description/ deployment language for knowledge technology innovations.   Klausner and Prueitt have been discussing how to use the language since early 2002, and in August 2003 the Cubicon language was first used to communicate algorithmic modifications to the ATS system that increased the conceptual fidelity of the CCM-Powered NdCore conceptual rollup process. 

 

Important new work on the CCM system (performed by OntologyStream during a 6-month effort to end in October 2003), while still preliminary, adds ontology and linguistic services to CCM’s newest NdCore, creating a process for thematic analysis.   The NdCore creates an emerging ontology that depends on the text analyzed and the variation of inputs by the users.  This work is consistent with the broader concept of the Knowledge Sharing Core proposed here. 

 

Schema Logic Inc:  Schema Logic Inc. will supply their schema reconciliation technology in the form of SchemaServer 2.0.  Schema reconciliation is related to a search for the Topic Map process model.  A Topic Map can be about a complex subject that is undergoing fundamental changes.  Out scientists have been attempting to address this type of modeling.  But formative process models are on the leading edge of the standardization processes.  We will address exactly these issues since without these issues addressed, it seems unlikely that real world, real-time, situational ontology is possible.

 

Topic Maps:  Steven Newcomb, one of the primary authors of the Topic Maps 1.0 standard, will advise the team on the development of scope adjustment based on ontology services in conjunction with SchemaServer’s community-based knowledge management services.  SchemaServer uses a proprietary methodology to assist in reconciliation of multiple controlled vocabularies from diverse and complex interacting communities.  SchemaServer will be deployed on a dedicated OntologyStream server for 18 months.  A dedicated knowledge engineer/ knowledge management engineer will be employed by OntologyStream to use and develop knowledge artifacts based on a principled use of the SchemaServer. The SchemaServer will NOT be integrated into the Knowledge Sharing Core but will be an external resource.

 

Infrastructure. While ontology operations can be demonstrated within conventional infrastructure, such infrastructure is poorly suited to such operations and limits them in the following ways:

 

 

Each of these limits, taken alone, can easily cripple ontology operations.  Taken together, they keep ontology operations as a perpetual laboratory curiosity.  For example, the infrastructure of J2EE or .NET loads unnecessary transaction baggage on differential ontology.  Also, the use of the relational database with SQL does not have agile metadata transformations, except through the addition of meta modeling (accomplished through SchemaLogics), and a process model that allows the deconstruction and reconstruction of situational logics. 

 

Our use of this methodology will allow us to develop a complete and proper system rather than components that have to be expressed within .NET or J2EE.  Care will be made to stand up the system as J2EE interoperable, but much of the processes will use Berkeley Data Base or/and a key-less hash table management system within peer-to-peer distributed operating system, the Knowledge Sharing Core, that is independent of the J2EE architectures. 

 

Education.  Educational services represent a major challenge both in terms of justifying the approach to those unfamiliar with the paradigm and in providing deep training in how text understanding and ontology services work.  Dr. Giovanni Marchisio has begun the design of university level curriculum on all methods adopted by the Knowledge Sharing Core.  Dr. Larry Medsker, at American University and Dr. Art Murray at George Washington University will collaborate on this effort and involve other university-based colleagues.  (Full development of these designs will be pursued under separate funding.)  For initial content, the team will produce a competitive comparison between the conceptual indexing activities by ATS using NdCore versus Entrieva using Semio maps.  Steven Newcomb will provide authoritative expertise on the Topic Map standard and on OWL ontology standards, as well as extend some basic graph theoretic inference mechanisms involving polylogics, HyTime and situational logics (an OntologyStream innovation).  John Sowa will advise on other related basic research and comparable methods.

 

D2: Summary of design.  

 

A model of the Knowledge Sharing Core is giving in Figures 4 and 5.  The first depicts flow and the second depicts layers.  The target tasks are text analysis resulting in ontology production.  The ontology will have reusable components so that structured signatures related to specific types of social discourse and knowledge sharing are revealed.   

 

 

Figure 4: Flow of Knowledge Sharing Core

 

Multi-pass parsing tools will be used to parse and orient ontology production.  Test collections will be placed into a competitive analysis where one approach is based on the Text Analysis International Corporation’s multiple pass linguistic/ontology analyzer tool set.  SchemaLogic’s SchemaServer will be used to allow the team members to develop and adopt taxonomy, controlled vocabularies, and ontology.  

 

The configuration of relations is done without modifying the underlying (type:value) pairs data.   This notion has been captured in the term “eventChemistry”, which was discussed in our 2002 NIMA proposal (deemed fundable but not funded due to budgetary, and perhaps polical, issues.)

 

 

 

Figure 5: Layers of Knowledge Sharing Core with CCM engine

 

Formative and differential ontology is done within what has been called a tri-level architecture, because models of memory and anticipation are developed separately and then merged in situational ontology expressed as a middle stratum.  A stratification of the system allows independent processing and discovery at each of several levels without automatic or strict (logical) entailment in the other levels. 

 

The middle stratum is not logically at the same level of organization as the categorical invariances (atoms of a logic).  The chemistry of events is developed in general terms and can be differentially applied to produce formative reactions during the process of defining ontology scope parameters.   This follows the model of quasi-axiomatic theory, but relies also on the co-mapping of continuum mathematics and discrete mathematics, e.g. “differential ontology”, that was not present in the Russian work (1950- 1995).  The tri-level architecture was developed to separate the memory of invariance and the top-down anticipation of templates into two completely different logic systems. 

 

Tri-level logical entailment, is not a first order logic; but may depend on a first order logic and may, once expressed in machine language, be treated as a predicate logic. 

 

Formative and differential ontology is an inquiring system that supports conjecture and a broad array of potentially anomalous information.  Novelty detection is immediate due to negative search characteristics similar to what is achieved in neural network Adaptive Resonance Theory architectures.  Drs. Daniel Levine and Paul Prueitt have investigated, and published in scientific journals, issues related to perception and novelty detection since the mid 1980s. 

 

The architecture of the human brain system is found to be relevant in an exercise of executive function over logically underconstrainted formative processes.  Karl Pribram’s work on holonomic models of perception and behavioral expression fits into in framework that is more likely to find scientific support than general artificial intelligence  

 

Lakoff (1999) argues that there is a scientific revolution under way that potentially overturns these features that are common to "first generation" cognitive science and software and the analytic philosophy from which it stems.  Our project clearly breaks from the first generation and is part of the movement that Lakoff identifies. 

 

Performance measurement.  We will focus our measurement on the comparison of our system with other available methods that can be used on the same data, and the emphasis will be on the benefits to practical (i.e., “real world”) reasoning among human analysts.  Several of our technologies may be able to demonstrate previously unimagined speed and scope of processing, allowing for real time ontology processing of great fidelity using massive data.  The speed, however, is less important than the performance of the whole system, including the input of analysts, in terms of sense making effectiveness.  A reasoning system, in other words, must be evaluated in terms of reasoning and not primarily in terms of computational speed. 

 

Governance.  The Knowledge Sharing Core addresses the need to make a transition in how information technology innovation is being evaluated and procured by the federal government.  SAIC management understands the need for transformation that will benefit military and intelligence clients.  Accordingly, SAIC management will not advise on what to include in the Core as this will be a process that is governed by scientists on the project’s advisory board.  It will always be clear that it is the scientists and not the business leaders who make these selections. 

 

Phases Two & Three:  The scale of knowledge sharing will grow in all dimensions and into application areas that have not been initially selected.  In general, precise pattern recognition allows real time realignment of parsers and ontology services so that new and important linguistic variation can be routed immediately to those who need to look for consequences relating to national security.  For example, a simpler functionality is needed in responding to new patterns from medical ICD code analysis (syndromic surveillance) and in immediately viewing digital libraries (via grid systems) from a new viewpoint.  A similar functionality is needed in mapping vulnerability and threats in trucking infrastructure and harbors.  See Figure 4 for a sense of how ontology operations can become widely distributed. 

 

With the high fidelity of version 1 of the system, we will be positioned to pursue a very difficult application based on new science that social theorist Raymond Bradley, one of our advisors, is able to contribute.  We will have the capacity to discover patterns of linguistic variation that identify social unit membership.  By extracting signature patterns from voice recordings of conversations, we would be able to detect whether the speakers are members of a group, quite apart from the words that they are using.  Likely members of a sleeper terrorist cell, for example, can be identified.

 

    | 

 

Figure 6:  Two application areas for the Knowledge Sharing Core

 

In phase 1 we will have investigated additional innovations and in phase 2 expect to incorporate the best of them.  One item currently is of high interest, but unfortunately it is not ready for incorporation in phase 1.  It is a complex addressing technique that treats data, relations, structures, code etc. strictly as addresses, not, as traditional systems do, distinguishing between data in containers and their addresses. 

 

This system, patent pending in the EU, distinguishes between data and structures (yet representing them in the same way), and therefore can simulate containers. But since dimensions and complexity are not tied to actual data, any number of dimensions or any degree of complexity can be simulated as well.  Data structures are simulations that do not actually hold data. Data is assigned to structures.  Such assignments can be in all possible forms – multiple assignments of the same data to different structures, or structures assigned to other structures, or code assigned to data and structures.  This architecture produces under constrained data schema.

 

The speed and flexibility of this addressing system makes sense for ontology operations, quite apart from any other benefit, but its scaling characteristics may be even more important.  Any truly massive application will have to find a way around the linear scaling of conventional tools, and the addressing system accomplishes that goal.  We will be able to demonstrate the relationships shown in Figure 7 during phase 2 work.  The curve flattens for this system, mostly due to similarities in the events represented. The shape of the curve varies statistically rather than mathematically, approaching linearity in the worse case (when all represented strings are unique).  We understand that some at NSA are referring to this type of performance as “fractal scalability”.

 
D3.  Comparison with current technology

 

A recent IBM press release states:

 

“IBM is developing an XML-based architecture designed to unify various machine-learning, statistical, and analytical approaches to improve computer systems' ability to retrieve and use data, autonomously in many cases.  IBM's unstructured information management architecture (UIMA) will apply the Combination Hypothesis to help advance data analysis, explained David Ferrucci, a staff member at IBM Research.  … IBM still considers UIMA to be a research project and does not have a timetable for implementing the technology commercially.”

 

The "combination hypothesis" is exactly what the father of fuzzy logic, Lofti Zadeh, called the "generalization group".  Zadeh started to talk about this in the late to mid 1990s when he became aware that his notion of "computing with words" had failed to find a way to reduce natural language to computational processes.  John Sowa has some related work on "intermediate languages".  The Knowledge Sharing Core concept is designed to allow the end user the knowledge to use complex linguistic and knowledge tools within the notion of a generalized group, or within the UIMA.  Further, the Knowledge Sharing Core concept differs from the AI agenda in that the cognitive load required to "make sense of" experience of language systems (more generally semiotic systems) has to be reallocated -- the expectation needs to be dropped that the computer can do it autonomously.

 

We disagree with IBM's claim regarding XML: 

 

"It is difficult to effectively combine multiple techniques in parallel to improve data access and use. XML offers a key way to meet this challenge. Using XML tags on documents provides structure and adds semantics, thereby facilitating searching and analysis, particularly of otherwise unstructured data, Ferrucci said. XML thus also helps integrate unstructured and structured data for analysis." 

 

Our position is that the experience of language systems only marginally depends on having a localization of information from a non-(database-type) structure to a database type structure.  Differential and Formative Ontology was invented to address differences between continuum type information representation (as in a neural network or genetic algorithm computer program) and discrete information as in XML or CCM.

 

Scanning more widely for comparable approaches, the cognitive graph (CG) approach has extended the principles of existential graphs (Charles Peirce), entity relationships diagrams, semantic networks, and XML-type ontology representation.  Once in a CG, various technologies facilitate a direct mapping to first order logic.  CG is used in a number of COTS systems to manage n-ary relations in novel ways.  CG systems generally assume that knowledge can be represented as tokens in logic and rules based on these tokens, without polylogic and analogic capability.  This assumption is seen to have merit by many technologists and by first generation cognitive scientists, but it is in active dispute by Karl Pribram (from a cognitive neuroscience viewpoint) and by Robert Shaw (from an ecological psychology viewpoint). 

 

Differential ontology produces small-situated ontologies through a very rapid reduction of patterns in massive data.  These small ontologies cannot be interpreted by the rules of a first order logic.  The atoms from which the ontologies are constructed are (type:value) pairs and are rendered into a 2- 3- or n-dimensional visual display, which aids the analyst and the analyst community in interpretation and making judgments on ambiguous intelligence. 

 


 

E.  Statement of work

 

The program is to be conducted in three phases. All six tasks are active in each phase, but they change their focus and character as the program matures. In phase 1 (18 months) a full differential ontology system for ontology processing will be developed and tested in multiple and varied settings, culminating in an application that is realistic in terms of size, difficulty, subject matter, and participation by analysts. Phase 2 and 3 essentially recapitulate the Phase 1 cycle, beginning with a major reconfiguration and ending with a major application that demonstrates the generality of the system and the capacity of the development program for continued innovation. Phases 2 and 3 are optional: the government may elect not to continue if the system has not demonstrated advanced performance and the likelihood of further innovation.

 

Task 1: Elaborate the design.  The initial design must be sufficient to guide the assembly of components and application of the system in the first experiments. It is expected, however, that the design keeps advancing as it is interpreted and tested by the program participants, and that new aspects of the design, and improved expression of the design, occur during the project, especially to prepare for reconfiguration that occurs at the outset of phases 2 and 3.  A panel of scientists will advise on advanced concepts from various fields that will be relevant to the design.  To assure effective communication, the design will be elaborated from different viewpoints using different media. The following design documentation is expected:

 

 

Task 2: Evaluate and obtain components, or create components.  The Knowledge Sharing Core will be composed of several components that form a system. The project will identify and obtain suitable existing components and avoid developing entirely new components unless there is a clear opportunity to innovate or a clear void in the market. The general evaluation criteria are the following. (Additional sub-criteria should be added as necessary or where competitive cases need to be resolved.)

 

 

The components sufficient for conducting experiments in phase 1 will be made available at the beginning of the phase and will be suitable for integration as the first version of the Core. A scanning and vetting process should be in place during phase 1 to identify components that may be possible to add with low effort and in time for use in experiments 3 and 4, but the main focus during phase 1 will be on preparing for major reconfigurations at the beginning of phases 2 and 3.

 

Task 3. Integrate and interface components.  Programmers, usage analysts, and scientific advisors will all identify outputs of the integrated system, which will in turn guide specification of system operation. Programmers will execute these specifications and perform technical testing. Preliminary operational testing will be conducted by usage analysts and fed back for revisions. An effort must be made to keep changing the system rapidly in pursuit of the ideal design and avoid premature closure and refinement of a particular instantiation.

 

Task 4. Conduct experiments.  The program should identify a series of practical tests that are increasingly difficult and that demonstrate the full range of application of the system.  Every test should be prepared with suitable data, experimental conditions, and specific questions and hypotheses that can be answered with performance results that are appropriate to the stage of development. It is preferable that the test data will have been analyzed previously by other means, such that results can be compared.

 

Every test should be documented as a case study to facilitate outside review.  Test will arrive at specific implications for changes and improvements.  The tests should cover domains that will be relevant to the intelligence community, though it is understood that simplified conditions (unclassified, no involvement of working intel analysts) are most appropriate for phase 1 tests.  For larger tests, it is understood that the project will need to develop relationships with data owners who will want to share the results and who may either contribute research questions or help perform the analysis.

 

Task 5: Communicate findings.  Major reports will consist of: four case studies based on the tests, scientific papers (at least two during phase 1) to be presented at conferences or published, educational presentations, and documentation of innovations prior to patenting. (The labor for producing patent applications will not be charged to the research contract, and thoughtful considerations, by the science committee, will be made in each case.)  The team members will frequently produce, share, and comment on brief research notes. The educational presentations will include briefings on the project suitable for presentation to government reviewers and other research teams. Educational materials will be needed to explain the background and motivation of some of the features of this project since several aspects are unusual and deviate from the normal background assumptions.

 

Task 6: Establish collaborative structure for innovation.   Near-term technical objectives are important, but since the program holds much more promise beyond that, it must be organized in such a fashion that a succession of innovations, not all of which can be specified at this point, become a likely result.

 

Technical environment.  The team will establish practices and tools that promote innovation. A common design language should be used to support rapid revision, avoid lower level programming and infrastructure complications, and promote easy understanding of each other's work.  Communication practices and tools should be used that promote online presence, easy discussion when needed, and rapid access to context and reference material.  It is especially important to keep remote team members socially integrated. The team should also extend beyond those who are directly working on the project, to those who are enlisted as role-players within the experiments, and to a network of colleagues who will offer comment and ideas.

 

The programming environment will be organized to support rapid prototyping, testing, and feedback, without the need for lengthy performance and stability checks, documentation cycles, and coordination meetings. The checks, documentation, and coordination should not be ignored, however, but built in to the development environment to the extent possible.

 

Bottom Line: The initial government funding should be used to create a self-sustaining program to which many additional funders eventually contribute, either in the form of license fees or direct tasking for additional development. Two conditions are required: there is open disclosure of all technology and its performance, and contributors have a realistic prospect of profiting from intellectual property rights. Often these two aims conflict. Work is not disclosed because of proprietary interests, but with no disclosure, inquiry is soon stifled, and with it the flow of economically valuable technology. In this program especially, the inquiry must be open because the basic technology and the science that underlies it are not widely understood, and the only way to receive a fair examination is to show it and discuss it and not merely refer to hidden processes. The program should thus enlist the United States patent system as one way to insure both the open inquiry and property rights that are needed to sustain innovation. Any essential technology to be used in the Core must be either patented, likely to be patented, or open source. This allows full disclosure among team members, an extended network of colleagues who will be interested in the program, and scientist advisors who will need to understand how the system works.

 

New patents created during the program are handled under normal rules insuring government access. The program members will pool ownership and set fees for deployments. A portion of revenues is to be reinvested in development efforts.  In order to make this program structure work, member companies need to be recruited that can contribute patented technologies and who are otherwise able and willing to disclose their work for the project. The companies must also agree to pursue additional patents, to share ownership when appropriate, and to reinvest in a long-term program.  


 

F1.  Detailed individual effort description

 

Items listed in the schedule are treated elsewhere in the proposal, and this space is used to extend comments on a subset of items.

 

Set up collaboration & communication. All team members will be provided a copy of Groove. Every person will be required to use online training, keep their presence marker on during working hours, and to meet a quota of postings and exchanges. Every effort will be made to conduct all message exchanges inside of Groove, in order to prevent fractured records and poor sharing. All k