Research Note 16
July 23, 2003
( Index )
This note
has the context of the Actionable Intelligence Process model that is discussed
in the PowerPoints in the folder at:
http://www.ontologystream.com/area1/MemeticOntology/support
.
Large collections have statistical properties that do isolate linguistic variation patterns. So the use of statistics is reasonable. But statistics are over what? The data has to be acquired properly.
The localization of ( type : value ) pairs provides the right answer.
The NdCore 2.0 system uses a word level n-gram with n = 5.
However, we can experimentally show that a 5-gram, at the word level, will not extract the patterns that mark ALL concepts of importance. Moreover, we can experimentally show that the use of statistics without ontology will develop constructions that are nonsense and in fact point to nothing that would be regarded as a concept expressed in the text by a human reader.
One can imagine a skilled writer developing messages where 3 keys to meaning are fully separated always by more than three words.
By fully, we mean that there is no occurrence of the three different key words that have a third word in common within the branches related to the key words. A partial linkage is sufficient to bring the keys together.
For example, suppose we have, in the input tree the branch segments
(1st key, 5th word, 3rd word), (2 ed key, 5th word, 4th word ),
and there exist a branch in the Input Tree with the third key, not having the 5th word but having a secondary linkage due to co-occurrence of the 3rd word.,
(3rd key, 3rd word, 6th word)
then it is possible that nearest neighbor or clustering algorithms will bring the three keys together.
However this is all syntactic.
If ANY of these linkages are incorrect due to contextual truth, then the concept extracted will be a false identification. So we have a figure ground issue. One must define the ground so that as we change the focus (query) the figure that arises (situational ontology) is correct.
Dr. Alberts at OSD is interested in ontologyStream developing a measure of what one needs to mean by “correct”, from a cognitive science and social science point of view.
Ontology Stream Inc is developing a reification process based on generalizing the n-gram to ontology frames, such as those in the Protégé ontology constructions. Our experience with Semio technology, and with SchemaLogic technology, as well as with Oracle Context and SLIP helps make decisions about how to manage the generalization and the inversions.
We expect to produce results as good as Semio within two months. (See next section.) What we hope to see is this new technology integrated with the NdCore 2.0 technology, along with an objective measure of concept fidelity.
According to my plan, the generalized n-gram is to be experimentally used on a test set consisting of the 312 Aesop fables collection to pull in those patterns that have a wide pattern and may have variations in word occurrences that a word level n-gram would not ever pick up. We have the Semio result as a type of ground truth.
How we develop conceptual indexes for a small collection can then be applied to the large text corpus at INSCOM or elsewhere. We have a small proof of concept, following other well developed technology; and then a full proof of product.
Frames, scripts and OWL: The relationship between the n-gram and a Schank-like frame is considered essential for several reasons.
Perhaps the most important contrast is that the frame has a name and slots where values go. Protégé is based on frames and both Topic Maps and OWL can be used in such a way as to support the notion of frame- filling as a type of logical inference, ie, if the frame is filled in a certain way that makes a certain deductive inference.
The ATS 5-gram window is in fact a frame with five slots: the middle word, the two significant words to the left and the two significant words to the right. But the means, e.g. the rules, in which these slots are filled, are simply due to the words being located in a specific position in relation to the middle word.
There is a type of structural syntactic relationship that is being used as a measurement of the meaningful structures to be found in the text.
The semantic dimension is assumed to be phenomena that can be related to this purely structural phenomenon. This particular structural relationship depends on the co-location of terms and this co-relationship establishes the pattern that one looks for to gage the relatedness of primitive constructions such as the constructions produced during the rollup of possible conceptual indicators by a class of convolutions.
These convolutions are only now discussed in theory but we recognize that the inversion process has a simple form of the general class of mathematically defined convolutions over a discrete data structure.
The frame allows one to develop more complex structural forms that when “filled” can be used to indicate concepts. How the frame slots are filled depends on rules like those constructed for entity extraction by ClearForest Inc, or those constructed for Parts of Speech tagging by Text International Corporation.
Semio: We believe that the comfort level can change if a distinguished group of scientists show the business people that revenues can be generated from a more advanced type of information technology that does include semantic and pragmatic theory.
Specifically, we believe that from a preexisting ontology 'about" a collection a measure of fidelity is possible. This measure can be properly defined so that the techniques are objectively evaluated.
Automated subsumption construction, called taxonomies, has been part of the deep research for several decades, and many groups have developed prototypes.
A new company called Entrieva has, over the past year, integrated the patents and Intellectual Property of two companies, Semio and Webversa. OntologyStream Inc is aware of how these patents work.

Figure 2: One view of the Semio representation of concepts in the fable collection
Per OntologyStream’s request, last week, the engineers at Entrieva quickly generated a topic map type representation of the concepts in OntologyStream’s test collection of Aesop fables. This is high quality conceptual rollup over a small collection of thematically rich documents.
The conceptual roll up of concept representations and meta-concept representations could provide thematic analysis of social discourse, and as a consequence help us understand the social discourse in various countries where there is support for terrorism.

Figure 3: A second view of concepts and concept relationships in the fable collection
The software interface to the fable collection is at the URL:
http://demo.semio.com/semio/discover.cgi?db=%2Ffables%2Cv