A communication

on

Knowledge Representation Fidelity

 

Dr. Paul S. Prueitt

July 20, 2003

 

 

OntologyStream’s research lab is looking for sets of text collections where one has some representation of ontology about the collection.  The ontology can be expressed as RDF, OWL, Protégé, or Topic Maps; or something that is interoperable with these.  We are seeking a high level of fidelity between a conceptual rollup process, running on a standard computer, and concepts that are evoked by the reading of text by humans.  To measure this level of fidelity we look to the correspondence between the elements of the machine ontology and the constructions developed in the roll-up. 

 

Before such a measure is offered, one needs to bring light to the current nature of machine ontology as a model of the contents of human awareness.  On the one hand, the immediacy of introspection is a constant companion with awareness.  But on the other hand, this immediacy has not been reduced to a coherent formal construction – not even in a single case !  John Sowa’s cognitive maps and Cycorp ontologies, perhaps the leading edge of ontology representation, both have a long ways to go to be complete and coherent with natural science.

 

Because of the limitations with Cycorp’s approach, there is a legitimate question about whether true “intelligence” can be produced based on Doug Lenat’s work, ever!  But for practical reasons one has to be aware of history.  The measurement of Knowledge Representation Fidelity (KRF) requires constructions such as Sowa’s cognitive graphs with metaphors and with the excellent work on first order predicate logics like those developed at Cycorp or developed as OWL structures.

 

But the coherence between formalism and the life sciences cannot be taken for granted.  It is here that the limitation of the Cycorp approach is most apparent to the natural scientist.  First, such coherence CANNOT be found in today’s university curriculum.  So we cannot look to biomathematics, neural networks, genetic algorithms, or machine learning.  These disciplines do not recognize what bio-mathematician and category theorist Robert Rosen called the ‘category error” in mistaking the formalism for the natural system.  Sir Roger Penrose addresses this same limitation, Godel’s limitations, in formal theory in his several books on quantum mechanisms and human consciousness.  So we are not talking about esoterics.

 

We may look to theory that has been careful to developed from a widely dispersed literature, looking at what was done by the applied semiotics group in Russia, and looking at evidences from the cognitive neuroscience and quantum neuroscience literatures.  Not all of the answers are there, but within these literatures there are some very reasonable things that can be done.  Some of these, like scatter gather and differential ontology, have actually been done – though not published.  We bring other methods that have been around a long while. 

 

There has never been a full-scale effort to apply stratification theory to machine learning and knowledge representation.  This is NOT because the talent and expertise was not available.  We suggest that stratification theory is not attempted because it is not understood by those, in that community of program managers, who are responsible for funding decisions. 

 

In our current, privately funding work, we are producing (for the first time) formal measures of Knowledge Representation Fidelity (KRF) that involve all three dimensions of relatedness measures.  These dimensions have been classically called (1) syntactic, (2) semantic, and (3) pragmatic.  This measure of KRF can be expressed in a scientific language and we will use this language and the measure to demonstrate an additional level of fidelity for several types of conceptual rollup over small and large collections of natural language text. 

 

The primary limitation of the current measurement processes is that a pragmatic aspect existing in real physical systems is missing in all computer representations.  Our Actionable Intelligence Process Model puts the measurement problem up front.

 

 

Figure 1:  Actionable Intelligence Process Model (see

http://www.ontologystream.com/area1/MemeticOntology/support )

 

Measurement is the key.  If the initial measurement process is controlled by computer science and designed by individuals and program managers who are NOT looking at the social science and cognitive science; then the data mining processes will be useless, except as a means to justify corporate welfare.

 

We have held that the pragmatic axis must be, because of the stratified nature of formal systems, missing totally in any completely formal system such as a first order predicate logic.   This is a principled argument whose force against the current IT design and deployment can only increase over the next few years. 

 

The use of traditional Codd-type relational databases for novelty detection, thematic analysis and memetic reflective control will simply be government money spent in the wrong way – as could be judged by tests that will never occur.  There is an alternative that has received no research dollars, and yet has developed in the scholarly community.  This alternative is informational localization and globalization processes structured in a stratified formalism. (see the Power Points and materials in the folder at:

 

http://www.ontologystream.com/area1/MemeticOntology/support

 

It takes a while to explain the stratification formalism that we have developed over the last decade. 

 

A small community of scientists is looking at the thesis that through a stratification of the formalism it is possible to simulate the stability and regularity that one observes in the natural world.

 

http://www.bcngroup.org/area2/KSF/KSFconference.htm

 

The semantic aspect is also sometimes confused in the current research on formal semantics, but not always.  In some cases, a position is taken that of course the representation of meaningful patterns in linguistic variation is only meaningful when interpreted by a human.  But systems are often designed that do not take into account human interpretation as a necessary element of semantics.  We have felt that this is wrong minded.

 

The pragmatic dimension is related to how the experience of knowledge by a human or human community is structurally connected to events and processes in the real world.  This dimension is immediately accessable to human awareness, but only in a hypothetical way, through conjecturing and visualization.  This limited access is due to both complexity and indeterminism.  A theory of logical and physical “entailment” is necessary to even speak coherently about the pragmatic axis. 

 

The identified small community of scientists can produce this theory and provide rapid deployment of technology based on this theory.

 


Work in progress

 

J-39 contribution:  One of the “pragmatic” approaches that we have seen is one that was developed as part of the harvester (web spider) acquisition of Islamic social discourse (the J-39 system requested by the NSC).

 

http://www.bcngroup.org/area2/KnowledgeEcologies.htm

 

In this work, a thematic analysis, of what they called, “support” and “blocking” concepts is related to United Stated public actions and statements and world wide reactions within the Islamic communities.  The system is developed using a fuzzy-expert system called JESS and SAIC’s latent semantic indexing engine.

 

There are very few operational systems looking at the pragmatic axis.  And, one might observe, there is professional opposition to the systems being objectively tested.  Part of this opposition is due to business processes not being comfortable with how to contract based on outcome measures that shine the light of social science on the current data mining and intellectual vetting software.  It is easy to build systems that do not work very well using products from incumbent vendors and consultants.

 

The cultural issues are very difficult.  One worries about stories that hundreds of millions of dollars are being spent by SAIC and Booze Allen Hamilton and others in several large data mining technology projects.  One can only wonder what this technology is and what the value might be to anyone other than SAIC and Booze Allen Hamilton.

 

But the problem is larger than the entrenchment of current contracts awarded and being spent with the IT and consulting cottage industries.  One can almost define “knowledge science” to be the systematic investigation of those things that classical reductionist science regard as non-addressable.  What is necessary to develop machine intelligence is exactly what the mainstream computer science and most mathematicians carefully ignore.  So one also has a problem at the National Science Foundation and at National Institute of Standards and Technology.

 

OntologyStream has traditionally taken issues with the strong AI camp, and mainstream computer science, because the issues related to structural coupling between complex natural systems is not acknowledged, and it often marginalized as being either not relevant or not subject to any type of science. 


Semio:  We believe that the comfort level can change if a distinguished group of scientists show the business people that revenues can be generated from a more advanced type of information technology that does include semantic and pragmatic theory.

 

Specifically, we believe that from a preexisting ontology 'about" a collection a measure of fidelity is possible.  This measure can be properly defined so that the techniques are objectively evaluated [1].

 

Automated subsumption construction, called taxonomies, has been part of the deep research for several decades, and many groups have developed prototypes. 

 

A new company called Entrieva has, over the past year, integrated the patents and Intellectual Property of two companies, Semio and Webversa.

 

 

Figure 2: One view of the Semio representation of concepts in the fable collection

 

Per OntologyStream’s request, last week, the engineers at Entrieva quickly generated a topic map type representation of the concepts in OntologyStream’s test collection of Aesop fables.  This is high quality conceptual rollup over a small collection of thematically rich documents.

 

The conceptual roll up of concept representations and meta-concept representations could provide thematic analysis of social discourse, and as a consequence help us understand the social discourse in various countries where there is support for terrorism.  But one has to wrestle with the J-39 project away from incumbent consulting companies and allow proper funding of Semio type technologies.   Our group plans to do exactly this.

 

 

Figure 3:  A second view of concepts and concept relationships in the fable collection

 

The software interface to the fable collection is at the URL:

 

http://demo.semio.com/semio/discover.cgi?db=%2Ffables%2Cv

 

SLIP:  OntologyStream developed a unique approach to conceptual indexing using the Shallow Link analysis, Iterated scatter-gather and Parcelation technique we developed for TASC in 2000.

 

The map of linguistic functional load is shown in Figure 4.   We show two maps that provide a structural index involving verb use in the fables.  A semantic index based on generalized Latent Semantic Indexing will produce various concept based structural indices. 

        

a                                                                      b

Figure 4: conceptMaps indexing fable 265 (a) and the concept of “having” (b)

 

The numbers in Figure 1 point to specific fables.  For example fable #265 is The Peasant and the Apple-Tree. 

 

 

Figure 5: The URL at OSI with the 265th fable

 

One can find the verbs { served, looking, having, entreated, reached } contained within the fable. 

 

The work on SLIP lead to something called categoricalAbstraction and eventChemistry, which at one point did seem to interest John Poindexture, as least for a while.  SLIP also interested NIMA in 2002, as the central core of a proposal from SAIC/OntologyStream, but also only for a little while.  The proposal was deemed fundable but not funded because an alternative (Cycorp) ontology system was deemed more likely to provide for the NIMA needs.  The technical part of this proposal to NIMA is provided at:

 

 


Near future work

 

Large collections have statistical properties that do isolate linguistic variation patterns.  So the use of statistics is reasonable.  But statistics are over what?  The data has to be acquired properly.

 

Our recent work has been on a system based on a patented localization of ( type : value ) pairs.  Patents on this system where awarded in 1994, 1996 to Applied Technology Systems (ATS), Inc of Seattle Oregon.  This system uses a word level n-gram with n =  5. 

 

However, we find that a 5-gram, at the word level, will not extract the patterns that mark ALL concepts of importance.  The 5-gram is what has been used in the development of a system for INSCOM.  In a research contract, Ontology Stream Inc is developing a reification process based on generalizing the n-gram to ontology frames, such as those in the Protégé ontology constructions.  We expect to produce results as good as Semio within two months. 

 

The generalized n-gram is to be experimentally used on a test set consisting of the 312 Aesop fables collection to pull in these patterns that have a wider pattern and may have variations in ordering that a word level n-gram would not ever pick up.  The preliminary work is being done using an existing experimental system developed by ontology Stream Inc under contract to ATS.

 

How we develop conceptual indexes for a small collection can then be applied to the large text corpus at INSCOM or elsewhere.

 

The relationship between the n-gram and a Schank-like frame is considered.  Perhaps the most important contrast is that the frame has a name and slots where values go.  Protégé is based on frames and both Topic Maps and OWL can be used in such a way as to support the notion of frame- filling as a type of logical inference, ie, if the frame is filled in a certain way that make a certain deductive inference. 

 

The ATS 5-gram window is in fact a frame with five slots: the middle word, the two significant words to the left and the two significant words to the right. But the means, e.g., the rules, in which these slots are filled is simply due to the words being located in a specific position in relation to the middle word. 

 

There is a type of structural syntactic relationship that is being used as a measurement of the meaningful structures to be found in the text [2].

 

The semantic dimension is assumed to be phenomena that can be related to this purely structural phenomenon. This particular structural relationship depends on the co-location of terms and this co-relationship establishes the pattern that one looks for to gage the relatedness of primitive constructions such as the constructions produced during the rollup of possible conceptual indicators by a class of convolutions. 

 

These convolutions are only discussed in theory but the group of scientists recognized that the ATC process has a simple form of the general class of convolutions, which they call an “inversion”.

 

The frame allows one to develop more complex structural forms that when “filled” can be used to indicate concepts also.  How the frame slots are filled depends on rules like those constructed for entity extraction by ClearForest Inc, or those constructed for Parts of Speech tagging by Text International Corporation. 

 

Summary:  Founder of OntologyStream Inc, and the BCNGroup (not for profit corporation registered in Virginia), Dr. Paul S. Prueitt has delivered to Dr. David Alberts, Director of Research for DoD C4ISR Cooperative Research Program this nine-page communication on cultural and technical issues related to Knowledge Representation Fidelity. 

 

 

 

 

 

This communication exists at

 

http://www.ontologystream.com/area2/review/communication.htm

 

Along with an index to the specific materials delivered to Dr. Alberts by Dr. Prueitt.

 

 

 

 

_____________________   _________

Paul D. Prueitt                date

 



[1] Measures of precision and recall have been greatly criticized during the TREC and Tipster research and competitions in the 1990s.  Our measure will be more complex and yet rigorous in its definition. 

[2]The meaningful structure is ultimately “meaningful” to a human.  However, structures that are found, and which have been identified as being meaningful by a human are considered to be indicators of concepts that can be evoked in the mind of a reader.