Notes on Relatedness and Hilbert Encoding

CCM Note 18

( Index )

 

July 24, 2003

 

 

 

 

 

Index

 

 

First Note

Topological Logic

Structural Nature of a Concept

Input Tree

On the Inversion

CCM ( type : value ) pair construction

Hilbert Encoding


 

 

 

First Note 

 

Two concepts are related if a certain measure of relatedness exists between words that are contained in the concepts.

 

Topological Logic

 

Clearly a concept should be most related to itself and this is true using this measure of relatedness.  The measure leads directly to a set of formalisms called topological logic.  Topological logic supports the Minimal Voting Procedure and related situational logics.  To set up topological logic we develop a natural representation of the process involved in producing local measures of relatedness.  This requires the CCM ( type : value ) pair constructions.

 

When using statistical formalisms, the concept of relatedness becomes a function of the entire collection if there is some type of divisor that is proportional to

 

1) the number of concepts,

2) the number of words in all concepts

 

This count can be over all occurrences or over all "subjects".

 

The focus of a set of concepts can be a non-linear measure of the importance of individual concepts, thus producing figure-ground attentional mechanism simply by changing the divisor.  (We need to see the NdCore algorithms, in order to show how this might work.)

 

When using categorical abstraction, the concept of relatedness becomes a function of both the entire collection AND ontology services that have encoded the structural relationships between categories of occurrences, subjects and controlled reconciliation processes.   We need to instantiate the categorical abstraction as a means to organize the occurrences of words into containers and to organize the reconciliation processes within the containers.  The use of the Hilbert encoding facilitates this work in a way that is provably optimal. 

 

Our notion of subject and occurrence is defined as in the Topic Map standard.

 

Our notion of attentional focus is derived from our experience with neural networks and perceptual psychology.

 

 

 

 


 

Structural nature of a concept

 

This leads us with the question about the structural nature of a concept.  We have some candidates based on the current NdCore conceptual rollup.

 

NdCore conceptual rollup measurement process uses n-grams:  This measurement produces a set of tree branches, two for each occurrence of each significant word. 

 

The branch has some "contextual" information but this contextual information has not undergone an reification or made subject to ontology services.  Of course this can be done and the algorithms that then act on the Input Array (which is the name of the ordered set of tree branches) will perform as before.

 

Let us indicate this Input Array, with the symbol “I”, and any reified Input Array “I(r)”.

 

Figure 1: A very simple Input Array with two elements (branches)


 

Input Tree

 

If the data source has only one document, then the Input Array (or branches) will be a single tree.  If we have only the significant words w(1), w(20, w(3), w(4) in a single sentence then we have the single tree in Figure 2.   The 5-grams are

 

{ [ *, w(1), w(2) , w(3), w(4) ], [ w(1), w(2) , w(3), w(4), * ] }

 

Figure 2: A simple input tree with four branches

 

 


 

On the inversion

 

A problem has confused us for the past several weeks. 

 

This problem goes away IF we see that one document will produce one Input tree because the “branches” all have the same textID.  If we have more that one document, then there will be more than one tree (and more than one textID).

 

We also need to see that each input tree can be written (to a ASCII file, for example) as a set of separated branches, so that the textID is repeated.

 

So Figure 2 would be

 

Figure 3: A single small Input Tree that is fractured into separated branches

 

This Figure 3 is written into a text file as:

 

(textID, w1, w2)

(textID, w1, w2, w3)

(textID, w4, w3)

(textID, w4, w3, w2)

 

The partial inversion (over word type only) is then accomplished by writing the n-tupes in the list (above) as

 

(w2, w1, textID)

(w3, w2, w1, textID)

(w3, w4, textID)

(w2, w3, w4, textID)

 

By ordering this list BY the first element (which is the word-type in fact) we have

 

 

(w2, w1, textID)

(w2, w3, w4, textID)

(w3, w2, w1, textID)

(w3, w4, textID)

 

 

and the “inversion” would look like

 

 

Figure 4: The inversion of Figure 3

 

Having multiple textIDs is easy if we write all input trees as separated branches (repeating the textID information one time for each branch.)

 


 

Figure 4:  A single word having four occurrences, two in TextID(1) and two in textID(2)

 

 

 

The ordering of the branches by the word type (after the branches are written in reverse order, is the key here since this allows a binary tree type search for one occurrence of a word, and immediately all textID related to that word is present, at the end of those inverted branches that are located next to the occurrence that the search has found.

 

No index is needed, and no hash table. 

 


 

The CCM ( t : v) construction

 

The CCM ( t : v) construction is used, nominally but still used, in creating the inversion of I.  This inversion is a set of simple trees, with the root of each tree having a correspondence with each significant word subject. 

 

The inversions bring together all occurrences of a word into a single tree-construction with the root node the "subject" of these multiple occurrences.  This process only needs EITHER I or I(r) to "work".  Given the Input Array, an Array (or hash table) is produced that encodes these "subject trees" into a memory structure.  We may call this the Output Array, or O.

 

If in the process of producing these "subject trees" there is a separation of occurrences of words so as to reflect word ambiguity in various contexts, then we will call this the reified Output Array, or O(r).  This separation of contexts within localized constructions (a controlled vocabulary) follow the innovation develop by SchemaLogic Inc based on a localization of controlled vocabulary into containers. 

 

Once the SchemaLogic innovation is followed, it is possible to develop ontology based on these containers and easily produce knowledge management type reconciliation processes that reify.


 

Hilbert Encoding

 

 

 

I and I(r) form the basis for a “Hilbert Encoding”

 

H

 

 and for a set of convolutions over this Hilbert Encoding.

 

I encodes as

 

H(I).

 

I(r)  encodes as

 

H( I(r) ).

 

The inversion with word type abstraction (only) must act transparently on either H(I) or H( I(r) ).

 

The Output Array (of branches), has been denoted as “O”.  O forms the basis for a Hilbert Encoding,

 

H( O )

 

and has a set of reification processes that create

 

H ( O(r) ).