Data Structures, Theorems, and Programs

 

 

 

 

 

Obtaining Informational Transparency with Selective Attention

 

 

 

Dr. Paul S. Prueitt

 

President, OntologyStream Inc

 

October 24-25, 2001

 

 


Shallow Link analysis, Iterated scatter-gather and Parcelation (SLIP)

 Data Structures, Theorems, and Programs

Public Domain Materials

 

October 24-25, 2001

 

Introduction

 

Only tables and algorithms where involved in the original SLIP experiments.  There where no relationships between tables.  Therefore the tables can be written as simple arrays.  These can be stored in text files and mapped into memory. 

 

Due to our design principles, we are able to move the entire SLIP technology into flat files and any programming language like Java, C or Basic. 

 

Data mining type operations can then be made fast enough to allow creative interaction between a domain expert and a SLIP Framework.  For example, a SLIP Framework now takes about four hours to develop, mostly because of manual steps.  The total run time is about 40 mins – mostly to do the gather process.  This run time can be reduced to 1 – 2 mins, perhaps less. 

 

 

Figure 1: SLIP Interface 1.5, showing the radial plot for category C5

 

The Framework is the stratified concept map created through the human / SLIP interaction.  In developing the data transformation algorithms and other data processing requirements, we separated the programs and data structures into four groups. 

 

In Section 1 we provide a software specification for each of the four groups of programs and data structures.  In Section 2, we begin the exposition of a formal theory of categorization using the SLIP technology. 

 

In Sections 3 and 4, we describe the existing programs for doing the scatter gather and the parcelation of a set into categories.  The categories have been shown to be event types.  These event types form a type of generalization and abstraction for single IDS events into distributed, over time and location, IDS events.

 


Section 1: The Full Function SLIP, Software Specification

 

The full function SLIP interface needs all four of these groups of algorithms and data structures.

 

Group 1: Set-up, Programs and Data Structures

 

1.1: Manually create the SLIP Data Mart and the Analytic Conjecture

1.2: Produce an interface between Visual Insights and/or data sources such as RealSecure (under consideration)

1.3: The only output is a single file with two columns and a written description of the Analytic Conjecture

 

Group 2: Production of the Top level of the SLIP Framework,  Programs and Data Structures (see Section 1)

 

2.1: Requires only the two column SLIP datamart

2.2: Produces two tables:

2.2.1: A table of Atoms involved in the non-specific relationship defined by the SLIP Analytic Conjecture

2.2.2: A table called <Pairs> which is the set of atoms that are paired by the non-specific relationship in the Conjecture

2.2: Automatically clusters the set of atoms but does not produce a category membership (done)

 

Group 3: Parcelation, filling out the SLIP framework, Programs and Data Structures (see Section 2)

 

3.1: Only requires the set of atoms, A, and the set of pairs, P.

3.2: Iteration automatically produces new levels in the SLIP Framework Tree (done)

3.3: Iteration can be manually applied under specific conditions to create modifications to the SLIP Framework (done)

 

Group 4: Enterprise management, Programs and Data Structures

 

4.1: Allows the navigation and review of various existing SLIP Frameworks (Version 1.5 available as of Oct 16th, 2001)

4.2: Allows the encoding of interpretation and comments about the SLIP categories (version 1.6 available Oct. 17th, 2001

4.3: Allows additional parcelation due to manual use of Group 3 processes (under development)

4.4: Has a small technology footprint.  No installation process and no database.  (under development)

4.5: XML encoding and mobile agents allow very easy sharing of SLIP structures (under development)

 

Work on Group 2 and Group 3 is proceeding now.  The objective of this work is to have a visual basic program that is

 

1)       a stand alone

2)       has no install requirements

3)       has a small technology footprint

4)       has all of the algorithms needed to create a SLIP Framework

5)       allows domain specialists to make comments about the nature of each of the categories

 

Group 1 algorithms and data structures allow the construction of the set-up processes including the specification of an analytic conjecture, if desired, and the specification of a SLIP data mart.

 

Group 4 algorithms and data structure allow the sharing of SLIP Frameworks from one individual to another, and the establishment of a XML repository for work product.


Section 2: Definitions and Theorems on the decomposition of a SLIP data set

 

This is edited and extended from Section 6.8:  “Definitions and Theorems on the decomposition of a SLIP data set” from the October Report: “Summary of Results: July – October 2001, SLIP Intrusion Detection Technology (74 pages).   

 

The SLIP technology does a Fourier type decomposition of all invariance in the data set into clusters that have tight inter-cluster linkage and weak intra-cluster linkage.  This invariance is encoded as the nodes of a tree (see Figure 2). 

 

 

Figure 2: A trace route event that is gathered into category D1

 

In figure two we see the SLIP interface version 1.6 (October 17th, 2001).  By scrolling the top left window we would see that a RealSecure IDS classified each of the 24 elements of category D1 as a Trace Route event.  The only computational means for gathering these elements together is the non-specific relationship between Source IP plus Source Port and Target IP plus Target Port. 

 

Observation 1: On the question of non-crisp categories: 

 

The clusters themselves are gathered from the four limiting distributions.  The set of all of these will have pairs of clusters that have non-empty intersections.  What lies outside of a core might be considered to be an environment and may have significance.  However, we focus first on the cores of these categories.

 

Figure 3: Characterization of a core

 

Our method for automatically generating a framework to start analysis was chosen to eliminate the environment.   We are looking for what stays the same as one moves from one event type to another.  The core is the center of a category where this center is invariant across several limiting distributions.  The purpose of the SLIP Framework is to make available to domain experts the invariance that is produced by the non-specific relationship. 

 

The set of ending nodes (sometimes called the leafs) of the SLIP Tree produces a crisp partition of the original set of atoms.  This is because children are always produced in such a way as to provide this crisp partition with no overlap between the memberships of the children.  The memberships of the children also will union to produce the complete membership of the parent node in the SLIP Framework tree.   This is the important notion of disjoint union. 

 

A series of theorems can be given.  However, the observation now is that a non-crisp partition can be developed in order to reflect what we conjecture would be category entanglement (again look at the separation of context and core seen in figure 2). 

 

In more advanced implementation of the fundamental SLIP theory, one can use feature analysis and a voting procedure to rout new events into a type of non-crisp classification based on profiles of categories.  This introduces the notion of a tri-level architecture for information routing, categorization and information retrieval.  These more advanced implementations await the completion of the first fully stand alone full function SLIP interface.  The reason why we mention the advanced implementation is that when the SLIP Frameworks are being created and shared, then a number of intelligent programs will be possible. 

 

 

Figure 5: The SLIP Interface taking comments on an event category

 

The SLIP methods produce context free core category memberships that appear in different environments.  The method is a direct intersection of sets rather than a statistical method. 

 

Metadata can be associated with these nodes.  This is done using a scripting language from the command line (see Figure 5).  Comments made by analysts are appended to a file pointed by a metadata tag within the node tag in XML.

 

Observation 2: On the fundamental SLIP theory: 

 

A set of definitions establishes an abstract mathematical language.  This language is useful for two basic reasons:

 

1)       The language allows one to conjecture about and prove properties that one can find experimentally by developing algorithms and software

2)       The language allows a peer review of the underlying intuitions about what the SLIP technology might be used for (both in ID work and elsewhere)

 

Primary concepts:

 

Datamart:  The SLIP datamart consist of a table with two columns.  In the non-database version of the SLIP technology, an ACSII text file replaces the table.  

 

The selection of the datamart is important.  The following issues might have an impact on what type of data is selected and which records from a larger database will be used.

 

1)       A time period may delimit an event or group of events.

2)       A domain expert may analyze the nature of the data itself and produce an Analytic Conjecture.  For example the first column value might be the defensive addresses and the second column value might be the system calls that occurred during the time interval in question. 

 

It is important to realize that much of the value of the SLIP technology will come about when domain experts develop data with specific investigations in mind. 

 

Non-specific relations  Pairs of first column values are identified by parsing the Datamart and finding occurrences where a second column value (b) appears in more than one record.

 

The critical issues are only that each line in the Pairs text file represents a record and that the record represents a single event.

 

 

Figure 5:  The non-specific relationship between the atom a1 and a2

 

Two values from the first column are paired if the associated value in the second column is the same.  This situation is represented in Figure 5. 

 

Once this pairing is done, then the pairs are used to specify atoms.  The atoms are those elements that are in one or more pairs. 

 

Formally we have:

( a1 , b ) +  ( a2 , b )  à   < a1 , r,  a2 >

 

where r is the non-specific relationship.

 

Pairs.dbf   Pairs.dbf (formerly called Two.dbf) is the data source that contains the pairs of secondnames  The table is denoted with a script bold Q.

 

Set of Atoms  Q is parsed to find all occurrences of second column values.  These are added into the Atoms.dbf.  This set is determined by the occurrences of the non-specific relationship, and often times the set of atoms is a small percentage of all second column values that are in the mart.  The percentage is sometimes 1/1000, for example.  Thus a much larger data set is reduced in size at the very beginning of the SLIP computations.  The set of atoms is denoted with a script bold A. This reduction of data into “informational units” is not statistical.  Data aggregation is a filtering process that uses the non-specific relationship to pull things that are related but only related by this one well-defined non-specific relationship. 

 

All of this computational process is formal.  Not meaning is assigned until the domain expert looks that the membership, using the membership to produce a report from the original data source, and making human judgments about the nature of the core category.  These judgments are then collected if the domain expert types into the comment property of the core category. 

 

The notions of nearness and the topological notion of analytic mathematics is used to produce a retrieval of elements into a core category.  The sets of elements produced in this way will be actually related in exactly the fashion defined by the chaining occurring through the non-specific relationship. 

 

Ratio atoms/secondname:  This ratio is a computed value for any mart table.  This ratio can be computed and used to tune the import of SLIP data marts.  For example, consider the dataset that produces the sample SLIP Framework (see Figure 1).  The mart columns were originally selected to reflect scanning for Trojans and the follow-up use of a port identified in Trojan scans.  A quick computation from RealSecure database reports can indicate if there is significant Trojan scanning in this dataset.  If there is and one wishes to compute the full SLIP Framework, then this is possible. 

 

This ratio and others like it can be used to produce a “selective attention” that automates some of the intelligence functions of ID, particularly those involving either data visualization or link analysis. 

 

Distribution of A on the circle:  The first use, that I know of, of the circle for scatter gather was by myself in 1996.  The technique has not been publicly reported as yet.  Scatter gather generally requires both a pushing apart of atoms and a pulling them together.  However, due to the one point compactification of the line interval, a manifold with no boundary, a pulling together is sufficient to separate those groups that are interlinked (called prime cores.)  The use of the collection Q (derived from link analysis) to gather was invented (by Prueitt) in 2001, as far as I know.

 

Scatter/gather:  The scatter is done into a manifold where a distance between atoms is well defined. This is done randomly to eliminate the meaning of relationships that might be implied by the initial distance between atoms.  The scatter does in fact uniquely define the distance between any two pairs from A.  Moreover we have a common manifold metric on the entire set of atoms.  This collective metric is to be contrasted with the pair wise metric given in Q.  The pairwise metric is used (again in a stochastic fashion) to induce an organization of the topology of the manifold.  This induced organization is inherited from the pair wise metrics. 

 

Halting condition:  The Halting Condition occurs when the gather process, if continued will not move any of the atoms.  Three formal objects are required to define the halting condition.

 

a)       The paired atoms that are in the Pair table, Q.

b)       A subset of A identified from Q

c)       A distribution, D, of these atoms in a manifold

 

At any one time, the atom distribution in the manifold will define a distance between every pair of atoms.  The Pair table is used to randomly find pairs of points that have the non-specific relationship.  When such a pair is found, then the two points associated with the two atoms are moved closed together by a little bit.  (This is like simulated annealing in the artificial neural network systems).  If the two points are already at the same location, then nothing will happen to the distribution.  

 

The halting condition is a simple notion in the case of a well-linked set of atoms, such as the defined “prime cores” in our example.  However, the notion is not exactly simple in other cases.  For example, the halting condition can be characterized with a “chain condition” and this re characterization used to stop the gather process before the halting condition is reached. 

 

A set of atoms placed into a manifold does NOT have the halting condition with-respect-to Q if and only if there does exist a pair in the pair table where the first element and the second element are NOT at the same location in the manifold.  As the gather process proceeds, the set may, or may not, reach the halting condition at a later iteration.

 

One needs three objects, A, Q, and the distribution of A in the manifold.  The manifold changes due to iterations of the gather, and thus an index will uniquely specify the manifold at iteration i.  We can denote the ith distribution as D(i). So the triple (A, Q, D(i) ) has the halting condition if and only if

 

(A, Q, D(i) ) = (A, Q, D(j) )  for all j > i

 

Definition:  A subset S of size greater than 1 from (A, Q, D(*))  is said to be  prime if and only if the there exist an index k such that (A, Q, D(k)) has the halting condition AND all of the elements of S are gathered to the same location. 

 

As Theorem 1 shows the union of primes will be prime only a special case. 

 

Figure 5: See Theorem 1

 

 

Theorem 1:  Let C(1), C(2), and C(3) be three distinct clusters.  Suppose that when each is treated by itself that it is prime.  The union of the three will be prime if and only if there does exist a chain, of length 2,

 

{ <a’, r, a(1)>, <a(1), r, a”> }

 

where a’ is in C(1), a(1) is in C(2), and a’ is in C(3)  .

 

Corollary 2.1: Theorem 2 shows one can create a halting condition by removing elements. 

 

Theorem 2: Let S be the subset of (A, Q, D(i))  whose members are part of pairs (a’, a”)  in Q where a’ and a” are not in the same location, and the part of the pair has no other chaining relationship to any atom having a chaining relationship to the other part of the pair.  (A -S, Q, D(j))  will have the halting condition for k > j for some j.  

 

Figure 6: See Theorem 2 and Definition of 2

 

The Figure 6 we illustrate the possibility that a set might be identified that when removed separates the set of atoms into three primes.  

 

 

Definition 2:  A prime set may be fractured into multiple prime sets given that one finds a set such as the subset S in Theorem 2.

 

Clearly the subset S will be the cause of future changes in the distribution.  .  (A -S, Q, D(i))  may be prime or composite but this condition can be immediately seen in the distribution , D(k)) , which is at the kth iteration pass a iteration that can be found eventually.  If prime, since we are at the halting condition, all of the atoms will be in exactly the same location.  If composite there will be primes (subsets of the distribution) that will each have the halting condition.  

 

Lemma 2.1  Let S be a subset that fractures A. Let a’ be an element of S, a” be an element of S, and < a’, r, a”> be an element of Q.  If there exist <a*, r, a”> in Q, then there does not exist <a’, r, a*> as an element of Q.

 

With these theorems and definitions we may obtain some insight into the SLIP algorithms. 

 

Let us review the SLIP process briefly.  We start with a set of atoms, A, and a set, Q,  of pairs of these atoms that have the non-specific relationship defined by a link analysis on the mart table.  On our example the size of A is 725.  The size of Q is 18,763.  The size of mart is 56,000 records. 

 

The ratio of Q to the set of all possible pairs from A is 18,763/525,625 or about 4/1000.  This is a probability that a randomly selected pair will have the non-specific relationship.  We use the pairs that are in Q to gather together atoms in the manifold.  At some point we stop this process and find a candidate for S is in Theorem 2. 

 

Finding exactly the situation in Theorem 2 still seems a bit difficult.  Computing chains and defining categories this way seems computational intensive, but so is the gather process.  However the heuristic of taking intersections across several separately computed limiting distributions seems to result in a practical solution that follows the theory exactly.  The computed cores are prime, but we get only a few at a time (at each level).  Furthermore, it is still not clear that the prime decomposition using stratified theory will produce THE unique result. 

 

The open question is whether or not we have a unique factorization theorem as one does in number theory. 

 

Conjecture 1 (October 17th, 2001):  There exist a unique decomposition of a set of atoms into primes using a purely algorithmic process.  This unique decomposition will be produced each time the SLIP algorithms are run, given that the algorithms find any splitting subset S that exists in any prime core. 

 

The SLIP algorithms and data structure form a new type of information system.  This system is a non-traditional database similar to various non-relational ( i.e., third normal form type relational databases) called Referential Information Bases (RIBs).  These RIB structures are being developed as static in-memory structures by a number of groups.  Query and data management features are different in that the RIBs do NOT allow delete or append functions.  These update functions are accomplished by completely unloading and remapping data into a formal finite state machine.  This process is slow compared to relational database updates. 

 

However, once the remapping occurs the update is complete and very fast data aggregation and emergent computing processes can occur.  The RIB technologies are being developed as a new generation of data warehousing technologies, where append and delete is managed in a batch process.  SLIP uses many of the concept that have been developed by others involved in the RIB-type technologies.

 

The following is some preliminary analysis about how the input data might be configured so as to make a specific search for events of a particular type. 

 

 

Figure 7: A chain relationship between a(1) and a(n)

 

The events in Figure 3 have a specific nature if the first and second columns are (1) attacker locations { x, a(1), a(2), a(3), a(4), . . . , a(n) } and (2) defender location, { y }. 

 

We have seen two different interpretations of a chain relationship of this type given these two columns in the data mart.  The first interpretation is about the identification of a Trojan and the consequent use of this knowledge by the attacker.  The second interpretation is about session hijacking. 

 

So what is the different between these two types of events? 

 

The first interpretation was the motivation for the example SLIP Framework that the demonstration (version 1.4) displays and navigates.  A port scan, from an source a(i), is used to trigger a response from a Trojan existing at location y.   Given a successful response from the scan, it is assumed that a(i) communicates off line to x and that x then addresses the same port at location y and establishes a session that in some way uses the Trojan. 

 

A different source IP , a(j), is used against a different target IP – again denoted in Figure 3 as y.  The identity of y is not used in the SLIP scatter gather and thus the fact that the y may change location is not accounted for.  We create a category Y, of y locations and treat any member of the category without distinction.

 

y(i), y(j) is an element of Y  à  y(i) – y(j)

 

If in each case, a single IP is used once a Trojan is identified, then we have established the chain relationship.  The source IP locations { x, a(1), a(2), a(3), a(4), . . . , a(n) } will form all or part of a prime core. 

 

The second interpretation is quite a bit different, and yet has almost the same characteristics when put into the SLIP Framework.  One form of session hijacking occurs when an SYN is sent from x to y but x is spoofing its location, using the location a(i). 

 

Some considerable work needs to be done in order to catch a reply from one of the spoofed locations. 

 

Rout traces will sometimes work, if the administrator has not turned off rout tracing.  Then given the ability of catch the reply, the source must manage to guess the session id number. Here is where a vulnerability of the attacker occurs.  The ability to guess the session id is completely dependent on there being very little time elapsed between the SYN and the ACK reply.  A SLIP signature for this type of element could be the membership in E4 of the first example SLIP Framework.

 

 

Figure 8: Five spoofed addresses chained to a port 1080 attack

 

Motivation:  The motivation for creating a batch of test sets and examining the cluster cores, is to classify types of attacks from the patterns given in chain relationships within the pair table.  These chain relationships are due to shallow link analysis.  The clustering merely shows us where to look.  The halting condition for clustering has been shown to be equivalent with a formal property about chain relations.  So one has a computational foothold on the chains themselves.  The chaining in turn reveals real, but non-specific linkage between data elements that exist in the data due to specific causes.