Note1                 Back to Home                   Note2

 

Taxonomy Note 1: Note on taxonomy use

 

Overview of the problem to be addressed

Two level Fixed Upper Taxonomy from subject indicators

The topology on the set of all Subject Matter Indicator neighborhoods

Topological neighborhoods and topical logics

Issue of small text conceptual indexing

 

 

 

 

 


 

Overview of the problem to be addressed

 

Notice that the task of rolling up the concepts in a text collection is functionally similar to obtaining subject indicators that might be chosen to go into a specific taxonomy for a universe of discourse.  Such taxonomy is often developed by hand, and they have been found to be very valuable to organizations.  

 

When used as a source of metadata for retrieval software, taxonomy is often called a controlled vocabulary.

 

The core issue is the identification of subject indicators.  

 

Consider the following text:

 

 

Unintentional wireless 911 calls today can be a significant problem for the nation's Public Safety Answering Points (PSAPs).  These unintentional 911 calls can occur when a consumer accidentally presses a key on his or her handset automatically programmed to dial 911 via speed dial.  The consumer unknowingly ties up a 911 call taker at the other end of the line who has to confirm, based on little information, that the call is, in fact, accidental, and not from a person in distress.  That call taker now also is unable to field other, real, 911 emergency calls.  According to the National Emergency Number Association (NENA), individual 911 PSAPs have estimated that between 25 and 70% of their wireless calls are unintentional calls.  Thousands of PSAP call taker hours are thus wasted every year on unintentional wireless 911 calls.

 

Let us suppose that one can produce a list of subject indicators, each element of the list having a (class:object) pairing.  The (class:object) pairing is required to allow an instrumentation of subject indicator detection technologies, see Notational Paper.  The subject indicators, elements of the controlled vocabulary so anointed by a community of practice, can have the nature of a machine based ontology (representation of the nature of the world).

 

The ontology can reflect the elements of controlled vocabulary, while having some additional usefulness including some mechanisms for inference and for the completion of categorization when only partial information is available. 

 

Suppose that from a word-level parsing pass over the text we have the following elements, phrases, extracted:

 

 

Subject1: Unintentional wireless 911 calls

 

Subject2: accidentally presses a key

 

Subject3: ties up a 911 call taker

 

Subject4: confirm, based on little information, that the call is, in fact, accidental, and not from a person in distress

 

Subject5: unable to field other, real, 911 emergency calls

 

Subject6: call taker hours are thus wasted every year on unintentional wireless 911 calls

 

 

We see here that the classes, of occurrences indicating a subject, needs to be named.  We have only “Subject1, . . ., Subject6” as names.  What we desire is to “name” these extracted elements using our controlled vocabulary.

 

The controlled vocabulary consists of the names of those elements of the Upper Taxonomy that have been validated as being significant by the community of practice, plus elements of the lower level of the Upper Taxonomy.   The elements of the lower levels are also validated as being significant.  Elements of the lower level have a broad-term narrow-tern relationship with the upper elements.

 

Objects, specific occurrences of the subject indicator, have a specific form which may of may not be repeated exactly in other occurrences.   To instrument high precision recall in search and retrieval tasks we are using a complex data set, also called an “element”.  The generation of Subject Matter Indicator neighborhoods is one way to create these complex data sets.

 

Note also the subject indicator: Public Safety Answering Points (PSAPs) is not in the list of extracted elements.  It is left up to a human, perhaps aided by some future machine algorithms, to make the mental association between the concept and these complex data set indicated by these


 

Two level Fixed Upper Taxonomy from subject indicators

 

We have exposed several problems. 

 

Q1: How can we fix a specific “Upper Taxonomy” within the constraints of a small number of nodes and two levels, so that the Taxonomy covers the Universe of Discourse for a specific period of time?

Q2: How can we provide additional layers, called a Hidden Taxonomy, that interfaces with intelligent search and metadata extraction technology, while keeping the Upper Taxonomy fixed?

 

Our solution is to have human enumeration of a two-level taxonomy that is to be fixed for a period of nine months.  This human enumeration is matched to a bottom up adaptive elaboration of the Hidden Taxonomy.

 

An addition to this two level taxonomy we wish to provide an adaptive elaboration of the bottom elements of the two-level taxonomy so that the taxonomy extends into the subject matter.

 

First step: Produce a single set of taxonomy node candidates

 

 

Figure 1: Taxonomy candidates derived from machine or algorithm

 

Second step:  Organize the taxonomy candidates into a two level taxonomy using Broad Term (BT) / Narrow Term (NT) relationships.

 

 

Figure 2: Figure 1 organized by BT/NT relationships

 

Figure 1 and 2 indicate a top down enumeration of taxonomy using polling instruments and knowledge engineering/management methods.  BT/NT relationships are used.  For example, c(21) is a broad term having three terms,

 

{ c(11), c(3), c(40) }

 

with a more narrow meaning or context. 

 

Once the Upper Taxonomy is fixed we have a finer resolution of subject matter indicators that MUST match the bottom layer of the Upper Taxonomy.  (This matching between the lower level of the Upper Taxonomy and the Top most level of the Hidden Taxonomy is the key to our approach for government agencies.)

 

 

Figure 3: Upper Taxonomy and Hidden Taxonomy

 

We will use elements of a class of adaptive elaboration instruments to enhance search and retrieval algorithms.  For example, new linguistic variation in text categorized indicates evolution of subject indicators. As the social discourse changes, these pattern of linguistic variation can be empirically observed to change and with these changes comes an evolution in nuance. These changes are to be observed and then linked within the Hidden Taxonomy and the associations between the Hidden Taxonomy and the Upper Taxonomy can be allowed to evolve as the social discourse introduces this nuance. 

 

The Upper Taxonomy continues to be fixed until or unless there are reasons to change the Upper Taxonomy because of the introduction of new topics, or the forgetfulness of topics that are no long considered relevant to the purposes of the controlled vocabulary

 

Documents can also be placed into repositories using the Upper Taxonomy as user defined metadata, however search and retrieval using the Subject Matter Indicator neighborhoods will also use a multi-pass rule engine to provide higher resolution to the subject indicators. 

 

The topology on the set of all Subject Matter Indicator neighborhoods

 

A machine derived, bottom up, taxonomy can be generated.  Take a set of complex data sets and cluster the elements to introduce metaconcept boundaries and to suggest relationships between these high order constructions. 

 

One way of thinking about this clustering process is that there is an implicit topology on the space of complex data sets.  A method of differential ontology can be used to produce an explicit representation of these boundaries and these relationships.

 

Topological neighborhoods and topical logics

 

 

Figure 4: A topology with two neighborhoods

 

Figure 4 could be used to produce a two level taxonomy with two nodes in the top level, one corresponding to each of the neighborhoods with “radius” = 2.  Under the first top node (derived from the upper left neighborhood), we have ten subject indicators within a radius of 2 units.  Under the second top node (derived from the lower right neighborhood), we also have ten subject indicators within a radius of 2 units. 

 

Neighborhoods can be made broader by taking only the underlying nodes with distance 0 and 1.  In this case, the first top node would have 6 children and the second top node would have 4 children.

 

In the above topology, we have a simple notion of distance in graphical constructions (not trees – but more general graphic constructions).

 

Topological logics can be used to measure the presence of subject matter indicators.

 

Many companies have products that address concept extractions.  Cost and ease of deployment are the limiting factors in bringing knowledge of these technologies to the client.  Entrieva Inc and Applied technical Systems Inc both have concept extraction systems.  Our technology matches any of these extraction/detection processes to the lower level of the Upper Taxonomy. 

 

 

Issue of small text conceptual indexing

 

We hold that the most challenging problem is found in the development of high fidelity thematic identification in small text corpus.  We have developed the fable collection as a test set for technologies that claim to be able to produce taxonomy or ontology from small text collections.