Evaluation of indexing, routing and retrieval technology

Back ... ... ... ... ... ... ... ... ... ... ... Send comments into the ontology stream. ... ... ... ... ... ... ... ... ... ... ... Forward

(:….:) … (*) (**) (*) (*) … (:..:)

ARL Final Report

Section 2

The Application of CCG generalization to the problem of text parsing

December 31, 2000

ARL has expressed an interest in text parsing. Thus it is important to specify how we see test parsing fitting into an architecture that deploys a generalization of Russian CCG technology.

The diagram for this architecture has the following parts:

1) a preprocessor for data streams

2) an image library

3) an visualization interface and control system

4) indexing engine,

5) routing and retrieval engines

6) a viewer or reader interface with feedback loop to 2, 3, 4 and 5

Component 6 is where we might have a decision support interface. Components 2, 3 , 4, and 5 are really the components of a knowledge warehouse. These components are built up over time. In the data mining terminology, component 1 is called a data-cleaning component.

The preprocessor simply must do what is necessary to put the incoming information into a regular data structure. The cleaned data structure can have many different types of forms, however in the text-parsing task all of these forms have deficits. Again, this is due to the indirect relationship that the textual information has to the experience of awareness or knowledge. We seek to transfer the interpretability of natural text into an image framework.

The viewer interface is used in real time on problems of some consequence. Thus there is a systems requirement that a feed forward loop be established from one decision-making event to the next. The feed forward loop must touch components 2, 3, 4 and 5.

This architecture allows at least three places where human perception can make differences in the systems computations. The first is in the development of the image library (perhaps a token library is a more general concept). The second is in the indexing of the library components and perhaps some data streams. The third is in the decision support system (component 6).

Section 2.1

In Section 1, we made several claims. One of these is that the flow of information from an object of investigation can be vetted by a human computer system if and only if all real aspects of the syntactic and semantic representational problems are addressed. The flow of information from an object of investigation is likely to suffer from the data source being somewhat indirect, as in EEG data and linguistic data. Data sources such as astronomic data are more direct and thus more like the formal data sources such as number theory.

The partial success of statistical methods on word frequencies attests to the fact that a partial solution to these problems leads to an imperfect result. The glass is either half empty or half full. We do not know how to make this judgment, because today there is no completely satisfactory automated text parsing system.

The indirect nature of the data source would seem to imply that a human interpretant is necessary before there can be really successful text parsing. Thus the notion of vetting is proper, since this notion implies causation on a process that is mediated by a knowledgeable source and human judgment. The goal of a CCG system for text parsing is to transfer the interpretive degrees of freedom of text into an image framework. Once in the framework certain algorithm paths can produce suggestive consequences in new context.

It is not yet known if new methodology, entirely separate from the existing routing and retrieval technologies, will give rise to new and more successful results. We have suggested that much of the statically work on word frequencies is hard limited by the nature of anticipation. The statistical sciences can tell perhaps everything about the past, but cannot always predict the future. Moreover, we have the problem of false sense making. The meaning of words is enabled with ambiguity just for the purpose of predicting the meaning of words in contexts that are bound in a perceptional loop. This loop involves both memory and anticipation.

One can revisit the TREC and TIPSTER literatures, as we at TelArt will be doing over the next three months. In this review, we find not only statistical approaches, such as those made by David Lewis at AT&T and Susan Dumas at BellCore, but also a few linguistic and semantic methods. These methods are being reviewed as part of the Indexing Routing and Retrieval (IRR) evaluation by conducted by TelArt for a commercial client.

An understanding of routing and retrieval techniques might assist the generalization of the CCG technology. This generalization was done to establish some broad principles that might be formative to a proper text parsing system. The CCG technology can then be seen to have the following parts:

1) A representational schema

2) A visualization schema

3) An induction schema

New thinking on indexing, sorting or arrangements of data atoms may also provide value to our task. As we look for tools and methods we will of course be somewhat hampered by the now proprietary nature of new technologies. However, this is just part of the task we set for ourselves.

The application of CCG technology to the problem of text parsing requires that our group have a command of all existing IRR technologies and theory.