Noun, Verb co-occurrence Analysis using SLIP,

Tutorial and Design Document

by, Paul Prueitt, PhD

December 29, 2001

In linguistics, the concept of functional load is treated as a cause of the distribution of basic compositional elements related to spoken and written expression. In essence, the notion is that sounds that are easy to make will be used in situations where ambiguity of expression has some penalty. A study of auditory and acoustic phonetics leads to an understanding of how natural language is used and evolves.

Auditory and acoustical coherence and discordance is reflected in the structure and form of natural language. An investigation of this type leads to partial knowledge when the target of investigation is a complex system.

OSI has conjectured that incident management, involving computer intrusions, will identify the functional load of various hacker techniques and tools, as well as the compositional elements of vulnerabilities and exposures presented to hackers by modern operating systems and application software. Hacker tools co-evolve with computer system vulnerabilities and exposures. Thus the application of evolutionary programming to produce and maintain incident and intrusion taxonomy seems feasible.

But what if the target of understanding is a specific collection of writings? Can the essential characteristics of the writing style and the conceptual structure be revealed using techniques directed at discovering linguistic-type functional load? OSI is looking for collaboration on this issue; particularly with someone who has dictionary, subsumption technology, ontology technology or linguistic parsers.

In three previous tutorials (1), (2) and (3), we have begun to reveal the issues that are involved in creating a semantic map of a text collection. Over the same period (late December 2001) Don Mitchell and Paul Prueitt have been working on the design and coding of a Natural Language Processor (NLP) browser shell. This shell will be responsible for producing the input file required by the Warehouse Browser, and will serve to apply various parsing techniques and dictionary resources for this purpose.

On the general issue of recognition

The basic framework for SLIP conjecture and the emergence of organizational categories depends on having many isolated facts that are stochastically convolved into a universe of abstractions that may correspond to the intent of the author to convey meaning. Themes, such as a relationship between the strong and the weak, will be repeated by a single author using variations. These themes will themselves be part of a set of themes, some quite distinct or isolated and some ubiquitous.

We will, for a moment, regress to a general issue that shapes any effort to use functional load to model the intent of a writer. The author has personal/social experiences and the freedoms and constraints offered by natural language. Seen through the author’s written work these freedoms and constraints are recognizably by someone who is familiar with the work.

Suppose we have a lost Aesop fable, one that is not part of the published collections. A person who has come to love the fables would intuitively sense the character of this lost fable. The signature of the Aesop fables, as a whole, is recognized. Moreover, the substructure of themes and variations of themes would also be recognized.

Such recognition of signature is both very precise and subject to vagueness. Given the evaluation of the veracity of a claimed lost fable both the preciseness and the vagueness is a property of human perception and belief. One might say, “I know that this is a fable written by Aesop”, when one believes strongly that this is truth, in fact. But this claim may not have precise proof. So what can be done if this issue were important?

Specific evidence could be developed. It might be pointed out, for example, that a pattern of expression is made in eight of the fables and that this pattern is produced in the lost fable. It might also be pointed out that a certain moral value that Aesop often expressed is expressed in this lost fable. One produces this type of evidence though an understanding of the fables and the application of this understanding to evaluate the claim. The argument is not always strictly deductive because we are talking about similarity of expression and similarity of patterns.

Recognition of patterns in Intrusion Detection systems

The same type of evidence might be asked in case we are looking at a new attack pattern in hacker activity. The specialist might say, “I know that this attack is being developed by group q”. But this is an intuition that is based on belief. Specific evidence is needed if one is to strengthen the claim or to discover more detail about the new attack patterns.

It has been claimed that the Internet is a simple, yet very complicated, system where the assumptions made in statistical modeling of case grammar by linguists are valid. An absolute claim of validity over these assumptions may be argued as incorrect when modeling natural language use in human communities.

In the artificial world of the Internet transactions, these assumptions may be argued as being correct. Why? This is because the hacker community must act through a formal system in order to effect intentions.

The computer processor conforms to a formal system, specifically modeled as a finite state machine. Moreover, scripted programs do much of the hacker’s work. Patterns in the bit stream provide exact information as to the program component nature. What we have not had, up to now, is a comprehensive approach to providing measurement devices that identify these patterns completely.

This comprehensive approach must reach to the level of individual systems calls. Because the number of individual systems calls is a very large number, it is necessary to build models of how various computers function and use this to add to existing informational resolution coming form the standard Intrusion Detection Systems (such as RealSecure.) But we know how to do this.

What OSI needs now is a deployment plan and high level (political) support? We need a client who will reasonably support the final steps required to produce the Event Browser and the stratified taxonomy.

After OSI completes the development of the NLP Browser, we will be able to apply some of the linguistic techniques to the Intrusion log files.

We look for partners in these vertical markets:

1) The analysis and event trending of trouble tickets from the network performance centers in telecommunications systems.

2) OSI eventChemistry ™ as applied to the examination of financial data.

3) The examination of the Patent and Trademark Office (PTO) database.

On co-occurrence of verbs and nouns

The previous tutorial suggested a study of the co-occurrence of nouns and verbs in the fable collection. It was suggested that we might look for the signs of linguistic functional load using co-occurrence within individual fables of pairs of nouns and verbs from a simple dictionary. This study is taken as a first approximation of meaningful linkage between concepts and styles in the Aesop fables. A second approximation follows, where we refine the means through which our datawh.txt text is created.

A 24,771 record datawh.txt file was produced for the first approximation. We parsed the fables and identified when a noun and a verb from the Dictionary were both contained in a single fable. The events where reported out to a new datawh.txt in the form

( noun, verb, fable name )

We then defined the conjecture shown in Figure 2a. Figure 2b flips this conjecture and was not used directly, because we recognize that the issue of scope was not properly addressed. It is important to review this issues, since we have also not properly addressed the issue of scope in the Incident Management and Intrusion Detection software.

a b

Figure 2: An analytic conjecture relating nouns (a) and verbs (b) non-specifically

The conjecture in Figure 2a identifies nouns as “a” values and verbs as “b” values. It develops a non-specific relationship between two nouns if there exist a verb that has each of the two nouns co-concurrent (with that verb) within any of the 312 fables.

Clearly the non-specific relationship is an abstraction, and the atoms linked together form an abstract category. But the abstractions may provide real insights into the structural affordance between the noun “loin” and any other noun. The two records (loin, roared, n) and (x, roared, m), might not be found to related. This implies that there exists no noun x such that loin and x are related by the property of roaring. On the other hand (monkey, accusing, n) and (fox, accusing, m) seems perfectly reasonable. Both monkeys and foxes are known to be accusatory.

Figure 3 show two small noun clusters from a subset of the noun atoms.

a b

Figure 3: Noun atoms from the first approximation

In Figure 3a we have two nouns, voices and sling. The associated verbs seem not relevant. For example (voices, died) and (voices, sort) seem to not be properly related. This is like the data in an Intrusion Detection System that should not be there.

Voices:

Chirping, demanded, desiring, died, give, having, heard, inquired, lifted, lived, possess, replied, resolved, sort, take

Sling:

Blows, charged, chased, crying, feeding, found, inspired, killed, move, remaining, scare, sown, swung, take

This association of these verbs and nouns clearly has some problems. But, of course! First the verbs are simply any verb that happens to be in any fable that contains the noun. This is the way events were generated by the first processes in the NLP Browser. But this scope is incorrect. Even though a noun might establish a certain mood to a verb that is outside of the sentence, most of the “scope” of the noun or verb will be within the sentence. A similar scope issue is found with the Incident Management and Intrusion Detection system designed for second quarter (2002) deployment.

But in our second approximation one can see the marked improvement in the quality of relationships seen between nouns and verbs when the scope is constrained to the sentence. OSI can make this refinement because of a background in linguistics. However, the science of IMID analysis is sill in the minds of a few hackers and a few CERT analysts.

The design of the NLP Browser will get smart about issues such as scope, similarity and contrast relationship, and even case grammar. The same will be true of the Event Browser for IMID.

Figure 4: Noun atoms from the second approximation

In Figure 4 we have four noun atoms that have been defined using the sentence as the scope for which verb matches to nouns are made.

Members: administering, carried, engaged, rebelled, refused, take,

Seashore: come, feeding, find, looking, own, placed, projecting, saw, standing, suggested, take

Monkey: arrived, asked, casting, counseled, danced, descended, desiring, discovered, elected, envying, favor, found, gave, laughed, lay, leading, lying, obliged, placed, pleased, presented, promised, proposed , saw, see, stood, supposing, took, traveling, turn, watched

Flight: capture, conqueror, entreated, exultingly, find, flying, skulked, took, vanquished

Looking into a different category we find that

Voices: lifted, replied, take

Rather than

Voices:

Chirping, demanded, desiring, died, give, having, heard, inquired, lifted, lived, possess, replied, resolved, sort, take

Which is a more narrow scope.

Reflection: This December 29^th Tutorial and Design Document is targeted at defining where OSI is at in a very important research effort, and what we need to do. The communities are invited to participate.

Comments to Dr. Paul Prueitt.