[Cover sheet generated from DARPA’s web site goes here – I don’t have a file, only a hard copy]
1. Openness: The Knowledge Sharing Core Charter requires that all embedded technology be openly disclosed to the public. The common disclosure of patents, patent pending and trade secretes allows scientists and consumers to drive improvements through open collaboration. Good design and good science accelerates program innovation while reducing cost and eliminating dependency on grants.
2. Infrastructure unencumbered by technology secretes: We develop a software infrastructure more conducive to ontology operations. Conventional software infrastructure is optimized for retention of proprietary positions, while the Knowledge Sharing Core is optimized to support human knowledge sharing.
3. Multiple methods produce evolution: Simultaneous use of radically different algorithmic processes will provide for cross validation and measurement of outcomes. As in nature, variety and selection are used to drive evolution of domain specific knowledge extraction systems.
4. Science Committee: A committee of leading scholars will provide a review of the theory and practice as realized in Core processes and activities. Interoperability with OWL ontology and with Topic Maps will be built in. Specific scholars will be invited to serve on our Science Oversight Committee. Our team includes one of the authors of the Topic Maps 1.0 standard. Steven Newcomb’s role will be as a knowledgeable interface with the Virginia Bioinformatics Institute. John Sowa, leading scholar on Cognitive Graphs, will advise the team.
5. Human component: The project takes into account the human component in a human/machine reasoning system. Most conventional approaches attempt to create an autonomous reasoner with only supervisory participation by humans. Core architecture develops a patented data encoding and already deployed, as intellectual technology that directly supports various types of human memory and anticipation.
6. New form of mathematics: Computational processes produce a natural organizational stratification and mathematical convolutions over localized bits of information, in the form of (type:value) pairs residing in the computer.
7. Differential and Formative Ontology: We create several advances in ontology processing, including data reduction using categorical abstraction into localized containers. Containers are created in those places where ambiguation or disambiguation processes are essential. The contents of these containers are sets of (type:value) pairs that are double encoded into hash tables to provide almost instantaneous set theoretic operations, including support for convolutions based on a type differentiating template.
8. Inferential support and contextualization: Traditional foundations of mathematics and logic are extended and used to supply inferential support for various processes, including automation of ambiguation and disambiguation during natural language parsing as part of ontology construction.
9. Applied memetic research over active text literatures: Basic research produces a situational rendering from latent semantic analysis (patented by SAIC) and other techniques (patents supporting NdCore, Semio) revealing linguistic variations that can be used to track memetic expression in social discourse and medical research literatures.
B. Proposal
roadmap
Goals: A breakthrough in conceptual fidelity and speed of ontology formation, produced within a self-sustaining, unencumbered (open access) development program that becomes a foundation for the semantic web.
Benefits: Near-term benefits include better intelligence findings from any large or small data stream, especially in terms of detecting novelty, low salience changes, and broad shifts that other methods miss. The development process, once started, increasingly augments analyst reasoning at a greater rate and at lower cost compared to conventional approaches. The state of art for human/machine inferencing is enhanced.
Technical barriers: The current barriers to ontology operations are mainly self-imposed due to three assumptions: (1) levels of analysis must be removed or hidden from the operator, (2) the system itself must understand language and have common sense, and (3) a single method must be perfected rather than a combination of methods. The main technical challenge is to assemble tools and staff that consistently support an approach without these assumptions.
Elements of approach: Algorithms from competitive traditions are modified, selected, compared, combined, and tested rapidly, with reference to hypotheses derived from a specific set of natural science theories about memetic expression, human reasoning, and human perception. Those who have originated and continue to develop the theories will advise the program and provide scientific peer review. We will develop learning modules on major processes and their motivation. We provide a common design language, and promote rapid sharing and application of original innovations.
Rationale that builds confidence: New algorithms, and new synergistic combinations of existing tools, are applied rapidly to outperform methods currently applied on high-value intelligence problems. Leading scientists will make significant contributions. This program will benefit from an active community of practice involving 20 – 30 leading computer-social-cognitive scientists. (Most of these scientists are not compensated under this contract. However, there is active participation in anticipation of future shared responsibilities. )
Nature of expected results: More intelligence analysts will prefer these tools to other deployed systems, not only because the results will be better and more obvious, but because analyst roles are recognized and enhanced rather than marginalized.
Risk if work is not done: In the absence of this alternative, there will be continued elaboration, at great expense and with minimal progress, of the general artificial intelligence paradigm, even though many scientists outside of government circles recognize its limitations.
Criteria for annual progress evaluation: We expect to be measured by the level of adoption by analysts in pilot settings. The system is expected to be deployed within the 18 month period. Program activity becomes self-sustaining in out-years as our funding request tapers off.
|
FY 04 |
FY 05 |
FY 06 |
FY 07 |
FY 08 |
Total |
|
1,600,000 |
1,000,000 |
500,000 |
500,000 |
400,000 |
4M |
C. Research objectives
Our aim is to produce transformational technology that has broad uses in various intelligence applications, including:
To accomplish this, we are proposing an unusual development process that seeks to bypass typical constraints on innovation. The process includes the following features:
The scientific objective of this project is use structural stratification, localization of information into (type:value) pairs, and convolution operators to produce inference and informational organization. A collaborative team of innovators/scientists will investigate several hypotheses concerning the utility of structural stratification. Particular care has been exercised in acquiring access to complex real world text sources.
Localization of information about linguistic variation is a guiding principle. Several measurement and data encoding innovation allows rapid complex passes over very large, or small, data structures. Our localization processes depends on the discovery, by algorithms, of invariance in specific informational structure. In natural systems, localization of structure/function requires a behavioral/functional commonality and thus localization is involved in producing natural archetypes. When abstracted into language these archetypes are rendered as type and reflected in linguistic variation in text. Type is realized as a combination of substructural elements as in the well-studied double articulation of phoneme in spoken language, and in case grammars involving a normalized and structural use of parts of speech. In human minds, a class of abstractions occurs in the formation and use of natural language. This is because natural language has evolved to reflect the casual structure of natural types in the world.
We have conjectured that structural stratification is the
key to complex machine inference and high conceptual fidelity in knowledge
representation. Substructural variation in machine inference binds inductive
and deductive inferencing. Following
the double articulation principle, internal value structure of type creates an
entailment to dynamic structure between type exemplars. Predictive inferencing about the “evolution”
of thematic content of real time flow of information from web sources is of
particular interest. Archetypal value
carries with it the rich detail that allows natural language to be understood,
by humans, within social communication.
Thus specific words and word structures are reflective of meaning in the
context of broader experiences within the memory and anticipational aspects of
human cognition.
D1. Detailed description of technical approach
We will introduce the proposed system by tracing one of its motivations. John Sowa, one of the distinguished scientific advisors to the project, made the following comments about the (type:value) pair that are the focus of patents held by ATS (Applied Technical Systems):
(type:value) pairs have been
used and implemented in various systems since the 1950s, and they are part of
almost every major programming language and knowledge representation system in
use today:
1. They are the basis for the LISP property lists in the 1950s.
2. They are the slot-filler scheme in every frame-based knowledge representation system.
3. They are the basis for the data structures in COBOL, PL/I, Pascal, C, C++, Ada, Java, etc., etc., etc.
4. They are the representation in the concept nodes of conceptual graphs (which were first published in 1976).
Given the enormous number
of variations in which (type:value) pairs have been used, it is reasonable to
conclude that no new patents in the area would be possible.
Dr. Sowa’s observation allows an important insight: namely, that some unexpected computer science innovations are possible, if one adopts a new paradigm. For our team the (type:value) pair has properties that are NOT anticipated by classical computer science and cognitive models. It is these properties that we judge to be foundational to knowledge technologies.
From a quick reading of the two (1994,1996) ATS patents, one is surprised by the specific manner of disclosure. Clearly the patent officers felt that the CCM (Contiguous Connection Model) construction was NOT anticipated by the enormous amount of work that John Sowa refers to. One can recognize at the outset that the (type:value) pair is a good way to localize information about type and value. The CCM constructions follow XML and ontologies developed from objects and classes (like OWL). But there is something in addition to the (type:value) pairing, and this has to do with information organization and inference.
In both XML and in the Cycorp technologies, a serous problem exists in finding the proper scope, namespaces, and situational context. We point to connectionism as one indication of the nature of a missing component to existing information technologies. The claim is not that connectionism supplies all of the answers, but that connectionism exists because there is more required that a localization of information into (type:value) pairs.
Some language will help us here. Inversions involve two processes (1) the traversal of a branch (or tree or collection of branches), (2) the convolution over all or some subset of more elementary units (e.g., significant words) where the convolution creates a partition and equivalence relationship. Inversions are a specific type of convolution, as defined in classical mathematics. The convolution is over a set. As each element of the set is visited some action takes place, that action being defined by the convolution operator. In classical mathematics, the set can be infinite or finite in size. In the CCM convolutions the set contains (type:value) pairs and the action is defined by rules.
Speed of convolution operators over hash tables will turn out to be more and more important as we develop more complex convolutions and as we allow the user (or researcher) the parameters needed to re-apply convolutions experimentally as one tries to bring a specific focus into the conceptual roll-up. The convolution may occur differentially over type-categories or over value-categories – in ways that are disclosed in the 1996 CCM patent.
These “constructed” equivalence relationships are expressed as part of a CCM notational system. The CCM notational system is under development by OntologyStream as part of an R&D contract to ATS. Once expressed in the CCM notational system, one can formally discuss properties related to both fidelity and to efficiency in data processes. For example, the convolution can be formally complex if ontology is used with reconciliation containers. Complexity arises in the naturally occurring ambiguation and disambiguation process that are essential to the use of natural language within communities. Logics over (type:value) pair schema containers follows the auxiliary innovations one sees in SchemaLogic Inc.’s SchemaServer and other similar systems.
One may also understand, or believe, that the structure of any natural system’s expression is so constrained in the real world that the number of types and the relationships between types are small in number, and yet open to change. Seen in this way, one finds data regularity in context as a matter of human observation.
Even with data regularity in context taken into account, individual localizations can be massive in number. The current architectures develop problems in completeness and consistency (the micro-theory problem in Cycorp and the scope problem in Topic Maps). One has to be able to organize a reasonable number of elements, each having the (type:value) pair nature, into situational and scoped constructions. Specifically we look to several innovations that have been adopted as part of SchemaServer developed by SchemaLogic Inc. SchemaServer provides both a data schema integration process and a community based reconciliation process that works on expressing structural ambiguities necessary to human dialog and interaction. But the reconciliation of controlled vocabularies and database schemas is only the very beginning of the capabilities we expect to deliver within a few months.
Schema resolution is seen as both a discrete process, involving logics over schema, as well as a continuous process, involving techniques such as latent semantic indexing and associative memories. Differential ontology is a formal mapping methodology between the discrete (and explicit) ontology and the implicit (continuum mathematics expressed) ontology.
More has to be said on differential and formative ontology, but for now we should return to the discussion of the ATS patents. Are these patents a reduction to practice of both the (type:value) pair AND connectionist theory? The answer is “yes”. Specifically, a “global” organizational process is illustrated by the “inversion” technique disclosed in the ATS patent.
ATS (Applied Technical Systems) has developed the CCM-powered referential system with the hope that CCM-powered systems could become a ubiquitous information and knowledge sharing technology -- sitting at the heart of a cultural / economic knowledge revolution. An early goal for phase 1 is to demonstrate why this is a reasonable hope. We also expect to build certain other technologies, based on other patents, on the fundamental data structures that now exist in a currently deployed CCM-powered NdCore ontology development system.
A single innovation will not ignite the Semantic Web, as we see with the limited use of OWL and RDF. However, the development of a method that finds and discloses innovations that can be build on the CCM constructions will fundamentally change what can be expected in the near term. The time is right for this revolution to occur.
We are able to explore various aspects of connectionism, including nearness, similarity and complexity. Deductive inference, using first order predicate logics, makes little sense in domains with high measures of irregularly and novelty. So one can, and should, make a distinction between deductive logics, which can be performed by computers; and inductive inference, a cognitive process that is not well understood. Having made this distinction, we nevertheless point to unanticipated computational architecture that is not exactly standard first order predicate. This architecture is based on a Russian paradigm called quasi-axiomatic theory. From a study of this foundational work we have simplified and extended the notion of deduction to cover a situational and formative process involving localization and globalization. The claim is made that this form of deduction is more closely related to the inductive processes that science finds at the heart of cognition. Properties related to situational grounding are more easily obtained.
The team includes also a small company, Text Analysis International Corporation (TAIC). A patent pending Integrated Development Environment (IDE) for developing text analyzers has been evaluated in preliminary work by OntologyStream scientists. The TAIC patent application allows knowledgeable users to develop a flexible multi-pass construction process that produces a highly situational set of parsing rules. Passes are involved in tokenizing, morphological analysis, spelling correction, parts-of-speech tagging, entity recognition, simple extraction (names, titles, locations, dates, quantities), and constituent recognition (noun phrases, passages, themes).
In the IDE, these passes are not black boxes, as is typical to deployed NLP, or ontology constructor systems, but are open to rapid modification by a knowledgeable user. The modifications are expressed in the open construction of atoms in a situational logic. The “inferred” compounds are composed of those logic atoms and can be rendered as taxonomy or ontology. The atoms themselves are “recognized” by the IDE and users are allowed to instantiate those atoms that are deemed important. Moreover, an additional invention (not as yet disclosed) convolves the ATS patents with the TAIC patent application to produce a general-purpose ontology constructor.
Given such a flexible arrangement, one can organize an NLP or ontology constructor system in the best possible way for any given application. Furthermore, the ability to insert passes into an existing set of passes enables a system to grow, or be reduced, in a flexible and modular fashion. For example, some passes can be devoted entirely to syntax, others to lexical process such as segmenting text into lines, or a complex subsystem such as a recursive grammar for handling lists. The flexibility of IDE will be applied to web harvesting to produce a competitor to the current J-39 Harvester now deployed at INSCOM.
The Semio patents, now owned by Entrieva Inc., will be extended so that the already “best in market” results of the Entrieva conceptual maps application will be improved and made domain-specific. A test collection using a small number of short fables has been studied, using Semio, as part of preliminary research at OntologyStream. We feel that a discovery of fundamental importance was disclosed in Semio patents by the company’s founder, Claude Vogel. The discovery assists in the definition of concept expression and the extraction of passage categories having similar meanings. Other inventions have to be used along with this one if conceptual roll-up is to become the technique of choice for text analysis. An educational module is necessary to describe the innovation, and to document what each innovation by itself is and is not able to do.
ClearForest tools are being widely deployed (as of July 2003) as rule-based entity extraction systems searching for themes in web published text. ClearForest is part of the team because their ClearResearch toolset is compatible and complementary to the TAIC IDE. ClearForest tools are to be used to measure social discourse, and then other tools are used to develop a weather map representation of the thematic structure of social discourse being expressed in public web sites by various social units – including those social units that represent possible asymmetric threats.
The ClearForest
and TAIC IDE components will be used together to study the textual sources that
the team has access to. We will enhance the research relationship
between the team members and the scientists at Virginia Bioinformatics
Institute (VBI) at Virginia Tech. VBI
will supply a continual flow of thematically rich situational medical research
literature. We also expect to tap into
a existing INSCOM (Army Intelligence) open source archive of Islamic web sites. Other sources may be developed, consistent
with guidelines that will be established during phase 1.
Our scholars will be in a position to turn research into actionable knowledge in experiment 2. This will be a re-analysis of patent documents, yielding a common language in which to express them, and indications of what areas have not yet been expressed and would be likely candidates for new patents. This experiment will attempt to anticipate as yet undisclosed patent applications. The problem is similar to the problem of anticipating terrorist behavior, or reactions along metabolic pathways.
The issue of a common language is a complex one. Ideally, one should have a mathematical foundation to knowledge systems, but suitable mathematics may not be readily available. Many scholars have come to believe that mathematical biology, for example, cannot be developed based on current notions of category and set membership. However, we know of several extensions of mathematics that might serve this need, including Russian quasi-axiomatic theory and applied semiotics (theory of sign systems). Another approach, one that deconstructs and then reconstructs set theory, is rough sets and polylogics. In the meantime, we still need a common means to talk about computer program behavior, and the best option we have found is Cubicon.
A more complete discussion of Cubicon, and its history, will have to be developed during phase 1. Sandy Klausner, founder of CoreTalk Inc and inventor of the Cubicon language, will train project members on the system, represent Cubicon concepts in meetings with scholars, and illustrate the benefits and requirements of a common description/ deployment language for knowledge technology innovations. Klausner and Prueitt have been discussing how to use the language since early 2002, and in August 2003 the Cubicon language was first used to communicate algorithmic modifications to the ATS system that increased the conceptual fidelity of the CCM-Powered NdCore conceptual rollup process.
Important new work on the CCM system (performed by OntologyStream during a 6-month effort to end in October 2003), while still preliminary, adds ontology and linguistic services to CCM’s newest NdCore, creating a process for thematic analysis. The NdCore creates an emerging ontology that depends on the text analyzed and the variation of inputs by the users. This work is consistent with the broader concept of the Knowledge Sharing Core proposed here.
In the later part of phase 1, we will shift our attention to the use of a tested system to demonstrate high fidelity general-purpose analysis of social discourse occurring in real time, first in translations from the Arabic press (experiment 3).
Schema Logic Inc. will supply their schema reconciliation technology in the form of SchemaServer 2.0. Schema reconciliation is related to a search for the Topic Map process model. A Topic Map can be about a complex subject that is undergoing fundamental changes. Out scientists have been attempting to address this type of modeling. But formative process models are on the leading edge of the standardization processes. We will address exactly these issues since without these issues addressed, it seems unlikely that real world, real-time, situational ontology is possible.
Steven Newcomb, one of the primary authors of the Topic Maps 1.0 standard, will advise the team on the development of scope adjustment based on ontology services in conjunction with SchemaServer’s community-based knowledge management services. SchemaServer uses a proprietary methodology to assist in reconciliation of multiple controlled vocabularies from diverse and complex interacting communities. SchemaServer will be deployed on a dedicated OntologyStream server for 18 months. A dedicated knowledge engineer/ knowledge management engineer will be employed by OntologyStream to use and develop knowledge artifacts based on a principled use of the SchemaServer. The SchemaServer will NOT be integrated into the Knowledge Sharing Core but will be an external resource.
Infrastructure. While ontology operations can be demonstrated within conventional infrastructure, such infrastructure is poorly suited to such operations and limits them in the following ways:
Each of these limits, taken alone, can easily cripple ontology operations. Taken together, they keep ontology operations as a perpetual laboratory curiosity. For example, the infrastructure of J2EE or .NET loads unnecessary transaction baggage on differential ontology. Also, the use of the relational database with SQL does not have agile metadata transformations, except through the addition of meta modeling (accomplished through SchemaLogics), and a process model that allows the deconstruction and reconstruction of situational logics.
The path needs to be cleared if DARPA’s goals of speed, scale, and generative potential are to be responded to with more than weak measures such as code optimization, or peripheral exploration of the standard paradigms. While we cannot ourselves build a new infrastructure within the confines of this focused research, a disruptive infrastructure development methodology is know to us that is well suited to ontology operations. Our use of this methodology will allow us to develop a complete and proper system rather than components that have to be expressed within .NET or J2EE. Care will be made to stand up the system as J2EE interoperable, but much of the processes will use Berkeley Data Base or/and a key-less hash table management system within peer-to-peer distributed operating system, the Knowledge Sharing Core, that is independent of the J2EE architectures.
Education. Educational services represent a major challenge both in terms of justifying the approach to those unfamiliar with the paradigm and in providing deep training in how text understanding and ontology services work. Dr. Giovanni Marchisio has begun the design of university level curriculum on all methods adopted by the Knowledge Sharing Core. Dr. Larry Medsker, at American University will collaborate on this effort and involve other university-based colleagues. (Full development of these designs will be pursued under separate funding.) For initial content, the team will produce a competitive comparison between the conceptual indexing activities by ATS using NdCore versus Entrieva using Semio maps. Steven Newcomb will provide authoritative expertise on the Topic Map standard and on OWL ontology standards, as well as extend some basic graph theoretic inference mechanisms involving polylogics, HyTime and situational logics (an OntologyStream innovation). John Sowa will advise on other related basic research and comparable methods.
Summary of design. A model of the Knowledge Sharing Core is giving in figures 1 and 2. The first depicts flow and the second depicts layers. The target tasks are text analysis resulting in ontology production. The ontology will have reusable components so that structured signatures related to specific types of social discourse and knowledge sharing are revealed.

Figure 1: Flow of Knowledge Sharing Core
ClearForest tools will be used to parse and orient ontology production. Test collections will be placed into a competitive analysis where one approach is based on the Text Analysis International Corporation’s multiple pass linguistic/ontology analyzer tool set. SchemaLogic’s SchemaServer will be used to allow the team members to develop and adopt taxonomy, controlled vocabularies, and ontology.
ATS and Entrieva will compete head to head and learn from the difference and similarities in the output ontologies. CoreTalk’s Cubicon language will be used to design new innovative algorithms and to link Knowledge Sharing Core components.
It is appropriate to characterize the whole system, not just the SchemaLogic component, as “Differential Ontology.” The term has a history and has influenced the development of the team’s effort in a very precise sense. Differential ontology does not force a commitment to one set of relations, under the assumption that there is one best set that will be most true. It facilitates the formation of alternative sets of relationships differential, and encourages the exploration of alternative configurations between (type:value) pairs. The configuration of relations is done without modifying the underlying (type:value) pairs data. This notion has been captured in the term “eventChemistry”, which was discussed in our 2002 NIMA proposal (deemed fundable but not funded due to budgetary issues.)

Figure 2: Layers of Knowledge Sharing Core with CCM
engine
A stratification of the system allows independent processing and discovery at each of several levels without automatic or strict (logical) entailment in the other levels. Formative and differential ontology is done within what has been called a tri-level architecture, because models of memory and anticipation are developed separately and then merged in situational ontology expressed as a middle stratum. The middle stratum is not logically at the same level of organization as the categorical invariances (atoms of a logic). The chemistry of events is developed in general terms and can be differentially applied to produce formative reactions during the process of defining ontology scope parameters. This follows the model of quasi-axiomatic theory, but relies also on the co-mapping of continuum mathematics and discrete mathematics, e.g. “differential ontology”, that was not present in the Russian work (1950- 1995). The tri-level architecture was developed to separate the memory of invariance and the top-down anticipation of templates into two completely different logic systems. Logical entailment, is thus not a first order logic; but may depend on a first order logic and may, once expressed in machine language, be treated as a predicate logic.
Formative and differential ontology is an inquiring system that supports conjecture and a broad array of potentially anomalous information. Novelty detection is immediate due to negative search characteristics similar to what is achieved in neural network Adaptive Resonance Theory architectures. Drs. Daniel Levine and Paul Prueitt have investigated, and published, issues related to perception and novelty detection since the mid 1980s. The architecture of the human brain system is found to be relevant in an exercise of executive function over logically underconstrainted formative processes. Karl Pribram’s work on holonomic models of perception and behavioral expression fits into in framework that is more likely to find scientific support than general artificial intelligence (as defined by Dr. Ben Goertzel – one of the consulting group.
Steven Newcomb will use HyTime and SchemaLogic to provides a localization of ambiguation/ disambiguation terminology and thus to inform the scope operators so essential to a Topic Maps process model.
The system includes some near-equivalent components that offer an opportunity to raise several questions that might not otherwise be asked, or asked as often, without a comparative dimension supported by traditions in the foundations of mathematics and by research literatures in the cognitive and social sciences. Rich questions are promoted by the complexity of underconstrained computer processes. Authoritative educational modules designed to teach not merely advertise are to be available within the Knowledge Sharing Core.
Based on our continuing study of patent disclosures and science and logic literatures, we constantly test and are aware of practical results from alternatives. The same approach to inquiry is played out in our experiments where we have chosen some of our data where prior studies using different tools are available. The components and new processes we will develop are not used to merely create differences, which would end in chaos, but to also converge toward higher fidelity ontologies.
As an example of what our radical model of innovation can produce, we have specified and will develop in this project what we call an “ontology lens.” The ontology lens was made public domain in 2002. It allows the user to see documents sorted to bins that are defined by using those same documents as the bin exemplars. The lens is in focus when all exemplars are properly placed. Results from multiple passes can then be mathematically overlaid to reveal information that had not previously been extracted. The ontology lens is an extension to the LSI technology, and it is emblematic of how we expect to use combinations of tools and multi-pass analysis. The lens is not particular useful if end users have no control of the parameters that if adjusted can make large variations in the computed outcomes. In an environment where parameter adjustment is allowed, these adjustments are not particularly useful unless the theory and science is available to the end-user.
Finally, we are explicitly accounting for the multiple
perspectives of analysts and their requirements for interactive sense
making. This is an old story from the beginning of operations research to
the end of AI, where systems were falsely advertised as “a decision aid, not a
decision made.” Such systems create
hard conclusions that, however packaged, are difficult for analysts to escape
and essentially take the analysts out of the process. Lakoff (1999) argues that there is a scientific revolution under
way that potentially overturns these features that are common to "first
generation" cognitive science and software and the analytic philosophy
from which it stems. Our project
clearly breaks from the first generation and is part of the movement that
Lakoff identifies.
Performance
measurement. We will focus our measurement on the comparison of our
system with other available methods that can be used on the same data, and the
emphasis will be on the benefits to practical (i.e., “real world”) reasoning
among human analysts. Several of our technologies may be able to
demonstrate previously unimagined speed and scope of processing, allowing for
real time ontology processing of great fidelity using massive data. The
speed, however, is less important than the performance of the whole system,
including the input of analysts, in terms of sense making effectiveness.
A reasoning system, in other words, must be evaluated in terms of reasoning and
not primarily in terms of computational speed.
It will be important to develop
measures that avoid unintended or destructive biases. For example, members of our group judge the TREC, Message
Understanding Conferences (MUC) and TIPSTER projects conducted by DARPA, CIA
and NIST as having been harmed by the use of precision-recall measures that are
deeply biased towards statistical methods.
Many software sensors will be embedded in the software to allow for
lower level monitoring, and these data sensors becomes one source for
performance measurement. The sensors
will serve a number of purposes, including the development of use metrics for
purposes of billing in deployed systems.
An instrumented system is also capable of self-protection from viruses
and worms, as has been demonstrated by several innovations in cyber security
research (specifics can be communicated under non-disclosure).
Governance. The Knowledge Sharing Core addresses the need to make a transition in how information technology innovation is being evaluated and procured by the federal government. SAIC management understands the need for transformation that will benefit military and intelligence clients. Accordingly, SAIC management will not advise on what to include in the Core as this will be a process that is governed by scientists on the project’s advisory board. It will always be clear that it is the scientists and not the business leaders who make these selections.
Phases Two &
Three: The scale of knowledge sharing will grow in all dimensions and into
application areas that have not been initially selected. In general, precise pattern recognition
allows real time realignment of parsers and ontology services so that new and
important linguistic variation can be routed immediately to those who need to
look for consequences relating to national security. For example, a simpler functionality is needed in responding to
new patterns from medical ICD code analysis (syndromic
surveillance) and in immediately viewing digital
libraries (via grid systems) from a new
viewpoint. A similar functionality is
needed in mapping vulnerability and threats in trucking infrastructure and
harbors. See figure 3 for a sense of
how ontology operations can become widely distributed.
With the high fidelity of version 1 of the system, we will be positioned to pursue a very difficult application based on new science that social theorist Raymond Bradley, one of our advisors, is able to contribute. We will have the capacity to discover patterns of linguistic variation that identify social unit membership. By extracting signature patterns from voice recordings of conversations, we would be able to detect whether the speakers are members of a group, quite apart from the words that they are using. Likely members of a sleeper terrorist cell, for example, can be identified.
| 
Figure 3: Two application areas for the Knowledge
Sharing Core
In phase 1 we will have investigated additional innovations and in phase 2 expect to incorporate the best of them. One item currently is of high interest, but unfortunately it is not ready for incorporation in phase 1. It is a complex addressing technique that treats data, relations, structures, code etc. strictly as addresses, not, as traditional systems do, distinguishing between data in containers and their addresses. This system, patent pending in the EU, distinguishes between data and structures (yet representing them in the same way), and therefore can simulate containers. But since dimensions and complexity are not tied to actual data, any number of dimensions or any degree of complexity can be simulated as well. Data structures are simulations that do not actually hold data. Data is assigned to structures. Such assignments can be in all possible forms – multiple assignments of the same data to different structures, or structures assigned to other structures, or code assigned to data and structures. This architecture produces under constrained data schema.
The speed and flexibility of this addressing system makes sense for ontology operations, quite apart from any other benefit, but its scaling characteristics may be even more important. Any truly massive application will have to find a way around the linear scaling of conventional tools, and the addressing system accomplishes that goal. We will be able to demonstrate the relationships shown in figure 4 during phase 2 work. The curve flattens for this system, mostly due to similarities in the events represented. The shape of the curve varies statistically rather than mathematically, approaching linearity in the worse case (when all represented strings are unique). We understand that some at NSA are referring to this type of performance as “fractal scalability”.
Memory use


Files added to
system
Figure 4: Fractal
memory use. Curve flattens with more data added to system.
Curve geometry depends on
similarities in the data (Compression effect).
Dotted lines: connectivity
provided by system / per object
The addressing system can be compared to two innovations:
the Hilbert Engine by Prementia, and certain data mining processes currently
defined in the Berkeley Database (an elegant, and open source, hash table
management system). We conjecture that
interoperability between the computer programs based on these innovations is
enhanced if each of the patented processes are rendered in the Cubicon
language.
D2. Comparison
with current technology
A recent IBM press release states: “IBM is developing an XML-based architecture designed to unify various machine-learning, statistical, and analytical approaches to improve computer systems' ability to retrieve and use data, autonomously in many cases. IBM's unstructured information management architecture (UIMA) will apply the Combination Hypothesis to help advance data analysis, explained David Ferrucci, a staff member at IBM Research. … IBM still considers UIMA to be a research project and does not have a timetable for implementing the technology commercially.”
The "combination hypothesis" is exactly what the father of fuzzy logic, Lofti Zadeh, called the "generalization group". Zadeh started to talk about this in the late to mid 1990s when he became aware that his notion of "computing with words" had failed to find a way to reduce natural language to computational processes. John Sowa has some related work on "intermediate languages". The Knowledge Sharing Core concept is designed to allow the end user the knowledge to use complex linguistic and knowledge tools within the notion of a generalized group, or within the UIMA. Further, the Knowledge Sharing Core concept differs from the AI agenda in that the cognitive load required to "make sense of" experience of language systems (more generally semiotic systems) has to be reallocated -- the expectation needs to be dropped that the computer can do it autonomously.
We disagree with IBM's claim regarding XML: "It is difficult to effectively combine multiple techniques in parallel to improve data access and use. XML offers a key way to meet this challenge. Using XML tags on documents provides structure and adds semantics, thereby facilitating searching and analysis, particularly of otherwise unstructured data, Ferrucci said. XML thus also helps integrate unstructured and structured data for analysis."
Our position is that the experience of language systems only marginally depends on having a localization of information from a non-(database-type) structure to a database type structure. Differential and Formative Ontology was invented to address differences between continuum type information representation (as in a neural network or genetic algorithm computer program) and discrete information as in XML or CCM.
Scanning more widely for comparable approaches, the cognitive graph (CG) approach has extended the principles of existential graphs (Charles Peirce), entity relationships diagrams, semantic networks, and XML-type ontology representation. Once in a CG, various technologies facilitate a direct mapping to first order logic. CG is used in a number of COTS systems to manage n-ary relations in novel ways. CG systems generally assume that knowledge can be represented as tokens in logic and rules based on these tokens, without polylogic and analogic capability. This assumption is seen to have merit by many technologists and by first generation cognitive scientists, but it is in active dispute by Karl Pribram (from a cognitive neuroscience viewpoint) and by Robert Shaw (from an ecological psychology viewpoint).
Differential ontology produces small-situated ontologies through a very rapid reduction of patterns in massive data. These small ontologies cannot be interpreted by the rules of a first order logic. The atoms from which the ontologies are constructed are (type:value) pairs and are rendered into a 2- 3- or n-dimensional visual display, which aids the analyst and the analyst community in interpretation and making judgments on ambiguous intelligence.
E. Statement of work
The program is to be conducted in three phases. All six tasks are active in each phase, but they change their focus and character as the program matures. In phase 1 (18 months) a full differential ontology system for ontology processing will be developed and tested in multiple and varied settings, culminating in an application that is realistic in terms of size, difficulty, subject matter, and participation by analysts. Phase 2 and 3 essentially recapitulate the Phase 1 cycle, beginning with a major reconfiguration and ending with a major application that demonstrates the generality of the system and the capacity of the development program for continued innovation. Phases 2 and 3 are optional: the government may elect not to continue if the system has not demonstrated advanced performance and the likelihood of further innovation.
Task 1: Elaborate the design. The initial design must be sufficient to guide the assembly of components and application of the system in the first experiments. It is expected, however, that the design keeps advancing as it is interpreted and tested by the program participants, and that new aspects of the design, and improved expression of the design, occur during the project, especially to prepare for reconfiguration that occurs at the outset of phases 2 and 3. A panel of scientists will advise on advanced concepts from various fields that will be relevant to the design. To assure effective communication, the design will be elaborated from different viewpoints using different media. The following design documentation is expected:
Task 2: Evaluate and obtain components, or create components. The Knowledge Sharing Core will be composed of several components that form a system. The project will identify and obtain suitable existing components and avoid developing entirely new components unless there is a clear opportunity to innovate or a clear void in the market. The general evaluation criteria are the following. (Additional sub-criteria should be added as necessary or where competitive cases need to be resolved.)
The components sufficient for conducting experiments in phase 1 will be made available at the beginning of the phase and will be suitable for integration as the first version of the Core. A scanning and vetting process should be in place during phase 1 to identify components that may be possible to add with low effort and in time for use in experiments 3 and 4, but the main focus during phase 1 will be on preparing for major reconfigurations at the beginning of phases 2 and 3.
Task 3. Integrate and interface components. Programmers, usage analysts, and scientific advisors will all identify outputs of the integrated system, which will in turn guide specification of system operation. Programmers will execute these specifications and perform technical testing. Preliminary operational testing will be conducted by usage analysts and fed back for revisions. An effort must be made to keep changing the system rapidly in pursuit of the ideal design and avoid premature closure and refinement of a particular instantiation.
Task 4. Conduct experiments. The program should identify a series of practical tests that are increasingly difficult and that demonstrate the full range of application of the system. Every test should be prepared with suitable data, experimental conditions, and specific questions and hypotheses that can be answered with performance results that are appropriate to the stage of development. It is preferable that the test data will have been analyzed previously by other means, such that results can be compared.
Every test should be documented as a case study to facilitate outside review. Test will arrive at specific implications for changes and improvements. The tests should cover domains that will be relevant to the intelligence community, though it is understood that simplified conditions (unclassified, no involvement of working intel analysts) are most appropriate for phase 1 tests. For larger tests, it is understood that the project will need to develop relationships with data owners who will want to share the results and who may either contribute research questions or help perform the analysis.
Task 5: Communicate findings. DARPA requires annual reporting using specified formats. In addition, the team will issue regular progress reports, including DARPA conference presentations as needed. The major reports will consist of: four case studies based on the tests, scientific papers (at least two during phase 1) to be presented at conferences or published, educational presentations, and documentation of innovations prior to patenting. (The labor for producing patent applications will not be charged to the research contract, and thoughtful considerations, by the science committee, will be made in each case.) The team members will frequently produce, share, and comment on brief research notes. The educational presentations will include briefings on the project suitable for presentation to government reviewers and other research teams. Educational materials will be needed to explain the background and motivation of some of the features of this project since several aspects are unusual and deviate from the normal background assumptions.
Task 6: Establish collaborative structure for innovation. Near-term technical objectives are important, but since the program holds much more promise beyond that, it must be organized in such a fashion that a succession of innovations, not all of which can be specified at this point, become a likely result.
Technical environment. The team will establish practices and tools that promote innovation. A common design language should be used to support rapid revision, avoid lower level programming and infrastructure complications, and promote easy understanding of each other's work. Communication practices and tools should be used that promote online presence, easy discussion when needed, and rapid access to context and reference material. It is especially important to keep remote team members socially integrated. The team should also extend beyond those who are directly working on the project, to those who are enlisted as role-players within the experiments, and to a network of colleagues who will offer comment and ideas.
The programming environment will be organized to support rapid prototyping, testing, and feedback, without the need for lengthy performance and stability checks, documentation cycles, and coordination meetings. The checks, documentation, and coordination should not be ignored, however, but built in to the development environment to the extent possible.
Bottom Line: The initial government funding should be used to create a self-sustaining program to which many additional funders eventually contribute, either in the form of license fees or direct tasking for additional development. Two conditions are required: there is open disclosure of all technology and its performance, and contributors have a realistic prospect of profiting from intellectual property rights. Often these two aims conflict. Work is not disclosed because of proprietary interests, but with no disclosure, inquiry is soon stifled, and with it the flow of economically valuable technology. In this program especially, the inquiry must be open because the basic technology and the science that underlies it are not widely understood, and the only way to receive a fair examination is to show it and discuss it and not merely refer to hidden processes. The program should thus enlist the United States patent system as one way to insure both the open inquiry and property rights that are needed to sustain innovation. Any essential technology to be used in the Core must be either patented, likely to be patented, or open source. This allows full disclosure among team members, an extended network of colleagues who will be interested in the program, and scientist advisors who will need to understand how the system works.
New patents created during the program are handled under
DARPA's normal rules insuring government access. The program members will pool
ownership and set fees for deployments. A portion of revenues is to be
reinvested in development efforts. In order to make this program
structure work, member companies need to be recruited that can contribute
patented technologies and who are otherwise able and willing to disclose their
work for the project. The companies must also agree to pursue additional
patents, to share ownership when appropriate, and to reinvest in a long-term
program.
F1. Schedule graphic
F2. Detailed individual effort description
Items listed in the
schedule are treated elsewhere in the proposal, and this space is used to
extend comments on a subset of items.
Set up collaboration & communication. All team members will be provided a copy of Groove. Every person will be required to use online training, keep their presence marker on during working hours, and to meet a quota of postings and exchanges. Every effort will be made to conduct all message exchanges inside of Groove, in order to prevent fractured records and poor sharing. All key personnel, plus those with administrative responsibilities, will have Groove's add-on tool for project management. This add-on tool will be used to keep a shared, detailed, online task network that will be continuously updated and revised. All research notes and rollup reports will be posted to Groove and members will be expected to comment on them. There are several other Groove tools, such as image viewers and instant messaging, which will be used to collaborate with the full range of options. Groove is not used simply to serve the project activities; its general functions in support of human sensemaking are also a part of the system being developed. It is expected that new Groove tools will be specified that link to the other layer of ontology operations. The collaborative functions during Phase 2 may be handled by software other than Groove, but during Phase 1 Groove will serve as a very able prototype and prototyping component.
Usage scenario walkthroughs. The role of the analyst must be designed, with due regard to what is 'reasonable' in the experience of the analyst. This effort includes, but is not limited to, the elaboration of typical keyboarding sequences that accomplish work results. What is also crucial is to track the intentions, thoughts, and interactions among collaborating analysts, thereby capturing the story structure of experience. Some typical problems will be featured in scenarios that place a strain on ontology operations. The scenarios will be walked through with analysts or suitable role players, and an appropriate set of technological augmentation events will be specified. There are several results from this effort, including: more and refined requirements, identification of erroneous assumptions in the current configuration of the system, identification of what need to be trained in order for the analyst to take full advantage of the system. The scenarios are refined with each experiment and will continue to aid designers in understanding the variety of demands on the system and to aid the educators and trainers in communicating how the system is supposed to work.
Prepare publications. We expect to make public disclosure of our work soon and often, in order to help clarify our thinking and gain feedback. This can be done effectively through conference papers where currently formulated but unsolved problems are presented. Multiple authors will be preferred, so that we maintain shared perspectives across the team. Many of these papers will develop from the research notes that participants will be expected to generate and circulate regularly.
Prepare learning modules. Explanations of the different assumptions that are used and why, while not specifically findings or innovations that are created by the project, do need to be explained on several levels since, without them, observers of the project can easily become disoriented. Merely listing the assumptions is not enough, because some very basic and entrenched thinking habits are involved, and it requires some interaction and exercises to realize that there is a valid alternative.
Conduct science conference. A small conference will be conducted where more formal papers will be prepared. The project's scientific advisors are expected to prepare works for this conference and to address issues directly relevant to the system we are producing under this contract. (Separate funding for this conference has been promised by a research office in OSD. A publisher has been contacted and has expressed a willingness to produce a book, tentatively titled "Computing and Cognition," based on the conference papers.)
Phase 1 final report. The final report is a compendium of various kinds of documentation that cover most questions and needs. It includes: slides suitable for education, review of benefits and uses, scientific accomplishments, feature and operational descriptions, case study findings, and plan for phase 2.
Reconfigure team. It is likely that new players will be invited to join in order to incorporate new components and address new areas. Existing players will be dropped from direct charge to this contract, yet will remain well connected to the program due to newly funded projects that have been spun off from phase 1.
Simulate distributed application. The main application to be piloted in phase 2 will involve the participation of different kinds of users who are remotely located and who collaborate in different ways. An analyst and a workgroup of analysts were the focus in phase 1, but a differentiated community is now to be served by ontology operations, and that will require more complex preparation so that expectations, roles, and exchanges make sense. This is walked through in simulation, and adjustments made, before participants are asked to begin pilot operations.
Work plan revisions. The team is structured for innovation, and
in the nature of things most innovation cannot be planned. When a good
opportunity arises that is consistent with the overall purpose and design, we
will adjust plans to capture the opportunity, which may result in a patent or
new application area that will in turn generate new resources. The greatest
tragedy would be to complete the original version of the plan but to have not,
in passing, created high demand for more development, innovation, and
application that others besides DARPA will want to fund.
G. Deliverables description
There are no proprietary claims. There will be patentable technologies resulting from the work, and the government will receive use rights to any patents that are filed. The discounted software licenses to be purchased for use in the program give the government rights to use the technology in development or pilot applications. Full deployments, none of which are anticipated within the terms of the contract, would require the purchase of deployment licenses under a separately negotiated contract. The receiving organization for all deliverables listed below is DARPA, though DARPA may direct that any other federal government agency also be a receiving organization.
Reporting, demonstrations, and presentations occur throughout the project, at least monthly, as required by DARPA and in support of DARPA meetings.
Month 6: Case study report on experiment 1, comparative analysis of fable discourse
Month 10: Case study report on experiment 2, comparative analysis of evolution in a network of patents
Month 13: Case study report on experiment 3, comparative analysis of Arab public discourse based on massive news harvests
Month 10 and 16: Educational presentations on how and why the system works and the approach to reasoning that it supports
Month 13: Scientific exposition of principles underlying the system and the innovations made to date.
Month 14: Interface specifications for version 1 of the differential system
Month 16: Case study report on experiment 4, comparative analysis of biosecurity concepts in a massive database of science literature.
Month 18: A working differential system for ontology operations, suitable for use on massive data and capable of interfacing to other reasoning systems.
Month 18: Plan for phase 2
Month 36: Case study report on phase 2 application
Month 36: Version 2 of differential ontology system
Month 60: Case study report on phase 3 application
Month 60: Version 3 of
differential system
H. Technology transition
The experiments are
chosen so as to demonstrate a range of important applications, better
performance than systems with similar goals, additional kinds of functions not
now attempted, and service that analysts understand and that enhance rather
than marginalize their contribution to intelligence functions. All this will induce potential users to look
at the system and use practical evaluation criteria without being fearful of
the system’s unusual features. The
project will set up a demonstration area and will invite continuous and broad
discussion. The project is structured
so that members have no interest in secrecy or gaining advantage against
partners while having high interest in generating creative feedback from the
technology and science community. This
approach has been used consistently over the last year and an invaluable result
has been the recruitment of new team members who see the advantages and agree
to cooperate. The network, that each
team member has developed, forms a ready audience for new synergies that will
be developed.
We are aware that
full deployment and commercialization involves much more than demos for the
curious, and that investment will be needed.
Our structure facilitates self-funding from early-adopter
applications. Beyond this, we have opened
discussions with interested parties at InQtel and with private early-stage
investors. The SAIC investment program
will be approached at the end of phase 1, after market potential can be
demonstrated. Murali Iyengar, a member
of the team from SAIC, has extensive experience in developing software ventures
and will take a leadership role in this aspect of the project.
We feel that many
development tasks after version 1 can be accomplished in the context of
application projects for both commercial and government organizations. The system is not of the kind that is
refined to one final form that is appropriate for all situations. Instead, it is a platform that will be most
effective when adapted to specific situations and continuously modified based
on a stream of innovations and patents.
Initial pricing cannot be low due to the need for services in
installation, modification, and education.
We believe, however, that low-priced versions are possible at phase 3,
allowing small groups to either use or co-develop new functions using Cubicon code.
While the system is
a platform, it does not require alternative platforms to be abandoned or
modified. The ontology operations can
be connected to other reasoning systems, especially those that DARPA funds
under investigation areas 1 and 2. The
main point of connection is Topic Maps, which Cyc and others support. But additional interfaces are possible, from
the whole system or from subcomponents as needed. An effort will be made early on to know what the requirements of
other systems are and to demonstrate interoperation. This is part of the justification for using SchemaServer from the
outset. Relationships will be developed
that will allow our system to be placed where its advantages are clear, and if
that means entry to intelligence shops via alliances with other systems, we
invite the opportunities.
It is possible that
purchasers of the Knowledge Sharing Core will want to pool data and processing
operations, once it becomes clear that our system can scale to handle the
volume, and that an increase in data volume, throughput, and configuration
management increases value to all those who use the collection. Also, running a shared processing center
offers significant cost reduction along with more rapid adoption of
improvements and standards. VGI can
serve as that collection point. As
another option, SAIC has an agreement in principle to use Intel’s data center
in Chantilly. The data center is
running Intel’s most advanced hardware and system software and Intel has been
seeking just such a project – vast flow of unstructured data – that can take
advantage of their most advanced processing capacity.
The team has agreed to pool ownership of new functions and patents, when appropriate, and to cross-market each other products.