Back ... ... ... ... ... ... ... ... ... ... ... On Stream ... ... ... ... ... ... ... ... ... ... ... Forward

 

OntologyStream Inc.

Copyright: 2001

 

Full Text Mining

Paul S. Prueitt

June 4, 2001, rewritten from 1997 unpublished paper

The practical aspect of information aggregation and data ware housing can be done in real time in small memory footprints using an In-Memory database format called structural holonomy.. 

Document Organization into an Information Warehouse

The book, "Oracle Data Warehousing" by Donald Burleson, and other recent books on data warehouses, indicates two modes of data warehousing operation. The first is a mining mode supporting autonomous data warehouse analysis. This mode is supported during system design cycles and in lights out operations. The second is a discovery mode.

Discovery processes use the data developed during the mining operation, and are run separate from the mining mode. The idea is that discovery requires a pre-processing of massive amounts of data and the interaction of the user. Once the organization of the data warehouse reflects statistical properties of data use patterns, the discovery mode can be used interactively to produce an information warehouse.

The commercial literature, such as Burleson's book and the book from the SPIRAL Group "Data Warehousing and Decision Support", suggests that a mature data warehouse should not be updated while users are accessing the system. In this view, updating and analysis occurs either over the weekend or at night. The separation is between computationally intensive data mining process and user discovery using aggregated data.

Other literatures, for example research at IBM, suggest that a formal separation of lights out operation from user access may be viewed differently if data is compartmentalized, via data replication. In general, the community is attempting to clearly understand analytic transformation of transaction data and the process of clarification has put some pressure on traditional views of data base administration, information, and artificial intelligence.

OSI’s General Framework for Machine and Natural Intelligence suggests that the modal nature of perception, memory formation, cognition, and action should influence the architecture for information mining (data mining plus text mining) and knowledge discovery in databases [1,2,3,4,5]. Regardless of the hardware and software constraints, there is direct evidence, from the neuropsychology of memory and cognition, that a pre-processing mode should occur and that during this mode learning modifies implicit memory stores at the substructural level.

Practical aspect: The grounding of OSI’s General Framework in modern experimental literature on human perception is good business, because the Framework is endowed with long-term value. 

We suggest that data mining can produce specific tacit knowledge about the object invariants in the data as seen through the analytic processes. This leads to the notion of Computational Implicit Memory or CIMs, and retrieval processes based on the voting procedures discussed in [3, 6]. OSI calls this retrieval process “Implicit Query” or IQ.   CIMs was a concept developed by Prueitt in 1998, and has now been replaced somewhat by the notion of structural holonomy. 

Autonomous reorganization of memory (be this human or corporate) requires a global lights out process. For example, the categorization policy (a partitioning function that is applied to the entire data warehouse) has to be globally reevaluated due to the non-stationarity of experience. This evaluation need not be continuous, so that periodic update is acceptable. For data mining systems, a global partition function can be the cause of "object consistencies" that are them modeled by aggregation techniques, and represent a memory of the past as reflected in the data invariances. Laying out this framework for memory of the past is the first requirement in the transformation of a "data warehouse" to an "information warehouse".

However, a real time adaptive modification of within category linkage is a second feature of biological intelligence. Real time user induced changes to the structure of memory is a capability that can only be considered given an adaptive architecture for partitioning the substructural features into the "object consistencies". User induction can be supported at both the level of object components (theme expression, cognitive graphs, etc.) and at the level of some meaningful aggregation of components. Thus, the "interpretation" of an aggregation of data can be guided by the discovered interactivity of subcomponents. Subcomponents analysis forms the basis for OSI’s notion of knowledge warehousing.

The OSI Information Mining architecture is primarily motivated by the ability of computer systems to do various types of tasks. The user is required to do another set of tasks. Knowledge acquisition picks up the results of information mining and through interaction with a user, develops "knowledge" through the validation of the degrees for which meaning can be assigned to substructural aggregation.

The distinction between information and knowledge is subtle, and relates most closely to the issue of which information can be "validated" as meaningful (to an organization or some other real entity.) The Computational Implicit Memory (CIMs) is a representation of statistically valid patterns in the data.

The CIMs can be regarded as stored knowledge, only when the database that store CIMs have a means to validate the meaningful interpretation by a human user..

The separation of data access from data organization provides a stable warehouse for users to interact with, while the lights out analytic mode provides an opportunity for experts and database administrator to examine user queries, data base performance, and data aggregation. In the books by Burleson and the SPIRAL Group, the modes are regarded as a "mining" mode and "discovery" mode. The mining mode places raw material into a configuration.

Discovery can be achieved by the act of a human operator or some other automated process depending on a successful mining operation.

Review of the technology

The four classes of mining functions are:

1.     associations,

2.     sequential patterns,

3.     classifiers and

4.     clustering.

Associations between two patterns (or data units) can be as simple as the observed probability that if one occurs then the other occurs. Co occurrence provides one means to develop an enumeration of relationships that are observed to exist between patterns. The set of all these patterns, and the potential relationships derived from empirical evidence, is information (or aggregated data).

The process of discovering the consequences of co-occurrence is less well supported in existing commercial software, and leads from information technology into knowledge technology. The Founders of OSC have investigated a version of Mill’s logic that discovers and manages information about the consequences of word or token co-occurrence. This preliminary work suggests that the voting procedure, used with Mill's logic, can make an assignment of relationship based on a theory of types and stratified category representation.

Full text representation is the first step in information mining and knowledge acquisition.

Tool sets exist for representational substructure and sign systems, discovery tools based on neural network models, and data warehouses.

Layered neural networks can assign a pattern to a category and in this way associate a pattern to use behavior. Thus, neural network architecture is appropriate for the encoding of knowledge representations. Data warehousing language reserves the terms "category" and "cluster" for the output of an artificial neural network. Neuro-technology encodes associations between levels of organization (pattern to cluster) or between patterns in a reinforcing context.

One expects that there is at least some unadvertised use of neural networks as part of data mining and knowledge discovery - particularly in financial analysis of real time data.

Data mining’s discovery mode identifies some refinement of simple co-occurrence associations.  Aggregate objects are formed to filter and rout data during the next mining cycle.

There are a number of potential knowledge technologies. For example, one could extend the probability of co-occurrence to fuzzy associations by assigning language terms such as "strong" "moderate" and "weak" to a description of the association.

Sequential patterns are associations across time. Associations between complex objects, such as clustered units, are also possible. In the Oracle and IBM literature, it is generally assumed that users make discoveries. Most of the commercial literature is focused on how the data is managed and not on how the data is used or what are the foundations of a theory of knowledge discovery.

Oracle and IBM technology and tools organize data to detect patterns, and then develop aggregates (perhaps fully defined as objects) based on semi-automated analysis. The data patterns are then assigned metadata to speed access to the aggregated data structures. The aggregates can be supported by frame filling tools that add information and develop quick access to pre-processed information. Frame filling enrich the definition of patterns.

Clearly there are some strategy decisions that are made, regarding the role of users in discovery. IBM places responsibility on the data warehouse system to lay out data in a way that users can discover value.

A clear separation can be made between mining functions that autonomously identify sequential patterns, and aggregations functions that nominate units as being of high importance.

The "higher order processes"; such as pattern completion (a form of categorization) forecasting (even in it's simplest form), selective attention, hypothesis testing (again in it's simplest form), and goal formation, are part of the discovery mode.

Discovery implies that there can be very little automation of these processes until the nature of information mining and knowledge discovery is better understood. We must rely fully on the human to make this type of judgment until advanced machine intelligence provides to us a new means to discover the meaning in aggregations of patterns.

Practical aspect: OSI has to wait out the time until a significant part of the commercial markets develop an expectation towards knowledge technology.  During this wait, we are developing products that are more traditional. 

Document Representation based on Features Extracted and Relational Associations

As we know, traditional Information Extraction and Retrieval (IR) methods require a set of features through which retrieval is supported. The retrieval can be enhanced by Salton query expansion and other advanced query aids. Information Warehousing and Knowledge Extraction moves us beyond IR by requiring that aggregations of features be formalized as objects. Sometimes the cost is a selective violation of Codd’s third normal form. This occurs in order to provide better access or visualization of these aggregations of features.

The voting procedure identifies and refines aggregations of features and is thus an ideal data warehouse management tool. The procedure is neutral regarding how features are identified and extracted, and whether query expansion operators are available. Also the procedure will integrate with any of a class of existing feature extraction tools including Topic Maps.

The results of third party classifiers or cluster production systems can be used to define category policy. For example, a neural network classifier could be used to sort messages into categories. The sorted messages could then be used to define a category policy. Iterated ranking and trimming of the representational sets refine the policy. Refined category policies may be used to generate a description of the contents of the categories and these descriptions used to initiate informational seeking programs. The description of the categories can be associated to other descriptions in a hierarchical fashion.

Data warehouse standard relational association tools are based on the co-occurrence of tokens and the "affinities" that tokens have for each other. The tools are generally based on statistical methods such as statistical regression, but can also be based on adaptive systems such as neural networks. However, the results are handled through some "presentation" software that shows these affinities to the user. If the user is provided with proper tools and has proper background knowledge, then affinities can be further refined.

The affinities are generally stored as a matrix, where the array elements is a measure of co-occurrence, similarity or affinity. Special array processors are used by the HNC hardware, and possibly be other systems such as IBM's Intelligent Miner Family ™ of tools. Once useful associations are validated by the user, then appropriate object aggregates can be configured to organize underlying data so that similar associations are discovered, or additional information about the association is acquired. This configuration can occur in near real time or as part of the mining/discovery cycle.

Practical aspect:  I believe that this is the essence of the contributions being made by Richard Ballard. 

Our model of implicit knowledge is consistent with processing substructural features for associations at one level and processing category objects and associations at another level. The implied association model should be procedurally the same at either level - but differ in the data processed. The use of a implicit (or tacit) knowledge potential from bi-level computational argumentation is largely ignored in the literatures. The exceptions are related to research programs in evolutionary computing and connectionism.

Document Retrieval based on Concept and Use Profiles

Use profiles can be user profiles. A use profile can be maintained in the form of a category policy relative to one user, a class of users, or a situation.

The voting procedure can be used in describing (via summarization of meaning) aggregate objects such as categories. This is a natural outcome since the relationship between substructure and use properties of the category can be encoded. The voting procedures can be structurally modified using linguistics, and situational logics.

Example: a document is placed into category 12. After a client reads this document it is routed into a work flow queue where additional information is acquired about the situation that caused the document. (The notion of causation is used non-literally to set up the Mill's logic.) This information is used to refine a category policy for all documents placed into category 12. User profiles are then developed to simulate the routing habits of the client. These habits are seen in the context of an environmental situation.

On Line Analytic Processing (OLAP) is a paradigm that mines static or dynamic data sets for associations based on co-occurrence distribution and affinities. These distributions form context free grammars, the profiles of which can be used to establish appropriate context, and refine affinity relationships. OLAP could also use evolutionary computing or neural networks - but there are few public examples where this has been done successfully.

Of course, co-occurrence and simple implementations for affinity based on thesaurus will play a role in On Line Analysis of Situations (OLAS). More has to be done to convert OLAP to Decision Support Systems (DSS).

Mill's logic, as revealed through the voting procedure, will mine the same data sets for associations based on a more powerful paradigm. The affinities, now seen in information warehousing, can be given qualities linked to a relationship variable between data aggregations. The notion of Ultrastructure can and should be introduced here; along with notions of interpretant, the notion of substructure and control, and the notions of emergence and implicit memory seen in the voting procedures.

Summary

Information extraction supports the spotting of topics in text or objects in images. In spite of the advanced power of these methods, one of the open problems is in the construction of automated processes.  Adopting a separation between mining and discovery is useful – particularly if this separation is explained as part of the data warehousing philosophy.

 

[1] Prueitt, Paul S. (1996c). Is Computation Something New?, Proceedings of NIST Conference on Intelligent Systems: A Semiotics Perspective. Session: Memory. Complexity and Control in Biological and Artificial Systems.

[2] Prueitt, Paul S. (1997b). The Autonomous Organization of Data through Semiotic Methods, in proceedings of the NIST Intelligent Systems and Applied Semiotics conference, September 22-25.

[3] Prueitt, Paul S. (1997c). Grounding Applied Semiotics in Neuropsychology and Open logic, in Proceedings of IEEE Systems, Man and Cybernetics Conference, October 12-15, 1997, Orlando, Florida.

[4] Prueitt, Paul S. (1998a). A General Framework for Computational Intelligence, accepted at 2ed World Multiconferenece on Systemics, Cybernetics and Informatics, July 12-16 1998.

[5] Prueitt, Paul S. (1998b). An Interpretation of the Logic of J. S. Mill, accepted at IEEE, Joint Conference on the Science and Technology of Intelligent Systems, Sept. 14-17, 1998.

[6] Prueitt, Paul S. (1998). Measurement, Categorization and Bi-level Computational Memory, under review.

Analog and distinctions between Knowledge, Information and Data Warehousing

The language used in data warehousing commercial literature has a few terms with analogs to the language of Information and Knowledge Warehouses. These include the following:

Affinity, Aggregation, Associations, Classifiers, Data Clustering, Decision, Discovery, Frames, Forecasting, Fuzzy Reasoning, Index, Inference, Learning, Mining, Neural Network, Object, Query, Pattern Identification, Reasoning, Rule, Summarization, Sequential Patterns, Transaction