Shallow Link analysis, Iterated scatter-gather and Parcelation (SLIP)

 

Exercise on Importing an Arbitrary Event Log

 

December 7, 2001

 

 

 

 

Obtaining Informational Transparency with Selective Attention

 

 

 

Dr. Paul S. Prueitt

President, OntologyStream Inc

December 7, 2001

 

Exercise on Importing an Arbitrary Event Log

 

 

The current data set is a collection of 120,246 Audit records.  Scott Wimer sent this data set to OSI from Software Systems International (SSI).  SSI's CylantSecure Products provide a comprehensive, integrated, Disallowed Operational Anomaly (DOA) identification technique that protects hosts from known and unknown attacks, misuse, abuse and anomalies.  (See: http://www.softsysint.com).  We were told only that this log file came from a LINUX system.  It is a blind test of the SLIP.

 

The examination of this data was treated as an experiment.  During the examination period nothing was told to OSI about the origin of the event log.  We had only the data records and out new technology. This exercise is designed to show how the SLIP Browsers can be used to develop a model of the events reflected by this data set, even with no information about the system from which the data is collected.

 

It is important to note that the problems that we start with are

 

1)      massive amount of data and

2)      no identification of events or the sequences of events.

 

This problem does not go away easily. 

 

 

Figure 1: The Warehouse Browser after the SSI data set is pulled and exported

 

This Exercise is followed by the introduction of the SLIP Event Browser in the next Exercise.


Part 1: Use Warehouse to build the Analytic Conjecture. 

 

This exercise requires a single zip file called ELE.zip.  This file is 914K and can be obtained from Dr. Prueitt at beadmaster@ontologystream.com.

 

 

Figure 2: ELE.zip contains two zip files. 

 

When the two zip files in ELE.zip are unzipped one will have a folder that looks like Figure 2.  The two Browsers are in the folder “filtered”.  They can be copied and moved and require no installation.

 

Start with only the two browsers in an empty folder.  Create a Data folder and place the Datawh.txt, of your choice, in that folder. (Or just unzip the file filtered.zip).  Launch the Warehouse Browser (SLIPWhse.1.1.3.exe) and select any two of the column names (that make sense).  In our data set the Sip (Source IP) is always spoofed and so the address is always 0.0.0.0.  Moreover time has no repeated values so time is not a particularly interesting column with respect to an analytic conjecture. 

 

The conjecture might be (b, a) = (Dport, Sport) as in Figure 1.  (Sport, Dport) might also be interesting as would be (Dport, Dip), (Dip, Dport), (Dip, Sport) and (Sport, Dip).  (Spot, Protocol) produces too many pairs to work with.  However, a secondary aggregation process will reduce the data representation.  We use a category abstraction to produce event templates (see Part 4), in this case a template selecting every 10th record. 

 

To make the conjecture (b, a) = (Dport, Sport) use the commands

 

b = 4

a = 3

 

followed by the command

 

pull

export

 

“Pull” pulls the two columns into a file called Mart.txt.  “Export” computes and exports the pairs of linked atoms from a temporary internal data structure to a file called Paired.txt.

 

At present there are two Browsers.  The SLIP Technology Browser was the first developed.  The Technology Browser initially relied on FoxPro programs to produce four data files.  In early December, we completed the SLIP Warehouse Browser. 

 

This Exercise reviews how a completed SLIP Framework is produced.  First take filtered.zip and unzip into an empty folder.  Now open the Data Folder and remove A1 (completely) and the files, Conjecture.txt, Mart.txt, and Paired.txt.

 

 

Figure 3: Datawh.txt alone in the Data Folder

 

Once you have used the Warehouse Browser to create the files Mart.txt and Paired.txt, then launch the SLIP Technology Browser (SLIP.2.2.2.exe).  The imported files are very large (6 Megs of data) so the application may take up to a minute to open. 

 

Once opened the user needs to type in the two commands:

 

Import

Extract

 

“Import” imports pairs from Paired.txt into an in-memory database.  “Extract” extracts atoms from the Paired.txt by parsing the text and identifying unique values.  For each value there will be one of more occurrences of this value.  An abstract category is defined that binds all of these common values what is then called an “atom”. 

 

The number of records involved in this Exercise is large.  Under the Analytic Conjecture (b, a) = (Dport, Sport), the initial 120,246 Audit records produce 626,668 pairs of atoms – each pair having one or more link relationships via a “b” value.  The extract command produces 1456 atoms.  Note that 1456*1455 = 2,118,480 is the size of the set of all possible pairs.  (There are 1456 ways to fill the first part of the pair and then 1455 ways to fill the second part of the pair.) 

 

Note also that the ratio 626,668/2,118,480 should be looked at carefully.  This seems to indicate the normal operation of a LINUX kernel where there is no pre-selection of events as being “intrusions” by an Intrusion Detection System such as RealSecure.  However, there is a question about whether or not there is some redundancy in the Paired.txt.  From the theory of distributions, one knows that this number has redundancy and needs to be removed by the Warehouse Browser before writing out to the Paired file.

 

This can be done in a number of ways.  Once this is done that the ratio will become more meaningful as a measure of the anomaly in this data set. 

 

If your Analytic Conjecture is (Dport, Sport) then clustering the atoms will produce a large cluster and a residue. Some experimentation will show that this behavior is typical of this data set no matter what the Analytic Conjecture and no matter what the scale of the data set sampling template (see Part 4).

 

One of the reasons for a spike of this type may be simply the size of the Audit Log. Having 120,000 records may simply connect link relationships into a single event.   But the behavior appears also in random subsets of the data.  So the single spike is likely the normal signature of the system.  One has to look into the event. 

 

A number of approaches are possible. 


Part 2:  Look into the Residue

 

First we may look outside the main event to see a number of small events that are separated completely but that have very little (or nothing to do structurally with the single main event.)

  

a                                                b

Figure 4: Clustering shows a main event having 1216 of the 1456 atoms and a residue

 

Clustering the residue will produce choices for 5 – 10 small groups of links atoms (primes).  The user can look for these him/herself.  Type:

 

random

cluster 2000

 

This will randomize the distribution and then iterate the gather process 20,000,000 times.

 

Once you have a small cluster identified and parceled into a category, the Report will show the actual linked relationships. 

 

To create a category, select a node, and show the Plot.  Use the command “x, y -> name” to take the atoms in the interval (x, y) and put them into the category named “name”.  Select the node and type “generate”. Then click the Report button. 

 

Since the groups are small one can easily see structure in the Report.  However the link structure can also be represented as an event graph.  Once in this form, then the links themselves have a time stamp and thus the evolution of the event can be shown in an animation by changing the colors of the links to trace the evolution of the event. 

The event graphs will be an important part of the SLIP Technologies.  But we need about six weeks to complete the Event Browser.  For now we take two of the largest events outside of the single main event show in Figure 4a, show the report and a hand drawn revision of the event map.  (The data folder for this is called twosmallevents.zip).

A note should be made that version 2.2.2 requires that one use the key command to key to the atoms column, 3 in this case.    Type “key 3” before generating reports.

The generation of Reports is now a brute force in-memory process.  For the large SSI data set, this takes a few minutes even for a small report. 

The principles of RIB algorithms are well understood by OSI and the conversion is being done now.  Several weeks of work is required to replace the brute force method with the fast RIB algorithms.  Once this is done, the reports will be generated almost instantaneously.   

 

a                                                                   b

Figure 5: The Reports of two small clusters, in twosmallevents.zip

 

We look forward to seeing the features and properties that OSI is putting into the SLIP Event Browser.  The Event Browser takes the next step.  The categories of a SLIP Framework can indicate the boundaries and exact nature of an “event” that has occurred. 

 

Event Chemistry is an active area of research at several universities.  The results of this research are indicating that the event graph can be automatically generated from a SLIP Framework category. 

 

The Reports are viewed by the Browser but also can be viewed by opening the Reports.txt file in the appropriate folder.

 

a                                                                             b

Figure 6: Two hand draw event maps

 

The hand draw event maps in Figure 6 are early representatives of what the automated construction of event maps should look like. 

 

Note that Figure 6b looks like either a Port Scan or a Trace Rout.  However the nodes are source ports not destination ports. 

 

In Figure 6a the linked nodes <25970, 8231, 29555> and <25970, 88231, 29445> both involve the destination port 8231.  This means that the pairs of source ports (925970, 29555) and (25970, 29445) are both linked by the destination port 8231.

 


Part 3: Look into the main event

 

The files needed by the SLIP Technology Browser are pictured in Figure 5.  The Warehouse Browser requires only Datawh.txt.  The Warehouse Browser creates the files Conjecture.txt, Mart.txt and Paired.txt.

 

 

Figure 7: The files needed by the Technology Browser

 

The use of the Technology Browser creates a folder A1 and subfolders nested within A1.  Child nodes in the SLIP Framework have corresponding folders within the parent node.

 

From the single file, datawh.txt, one can create the SLIP Framework developed in Figure 8.  In Figure 9 we show five primes, each of which should have an event map.  The development of the event maps for this data will be shown as soon as the Event Browser is completed.

 

The development of the Event Browser is important for a number of reasons.

 

1)      The user community will be able to parse Audit Logs into targeted event types that are somewhat or completely isolated from the background noise.

a.       For example, all port scans that are connected by a structural linkage will appear as a linked circle such as in Figure 4b. Colored labels indicate the effected systems.

b.      The individual events will sometimes be linked weakly by links that are “external” to the event, and thus identify sequences of events

2)      The automated conversion of the event graph to a Petri net will allow the update of event detection rules

3)      The temporal coding of the links comes with the time stamp.  With the use of timestamps we will be able to see the order in which the nodes of the graph have been traversed.

4)      The patterns expressed in the picture of the graph will become event clues that will act as rapid informational retrieval.  “Find things that look like this” will produce information retrieval during trending analysis and even in real time response analysis

 

The development of a set of graphical pictures of event types will alter fundamentally the way in which incident response and trending analysis occurs in the computer security organization such as the government CERTs. 

 

In Figure 8, interesting segments of the circle are identified using the bracket indicator.  The command “x, y -> B1” creates the category B1, for example.

 

a

     

b                                            c                                             d

  

e                                           f

Figure 8: The A and B levels of the SLIP Framework

 

Each of the B level categories can be inspected for prime structure.  The derivation of these prime structures is an art form.  The identification of a prime has a formal correspondence to the identification of an actual event that is occurring in the natural world. 

 

The ending nodes in Figure 9 are indeed prime.  Look into each of these categories, randomize and then cluster to check this.  If all atoms move quickly to the same location, then the category is prime. 

 

In the case of D1 and D2 these are derived from C3 and C5 by removing atoms that are outside of the prime.  This is done by inspection and the use of the bracket command “x, y ->”.

 

 

Figure 9:  The complete SLIP Framework

 

The purpose of obtaining prime structures is related to the search for means to separate data into events.    The events are then pictured as event graphs.  Animation of the temporal sequence of the occurrences of links produces a visual impression of how the event unfolds. 

 

The user is encouraged to work with the data set in twosmallevents.zip or in the much smaller data set filtered.zip.  twosmallevents.zip has the very large dataset discussed in Part 1 and 2.  The two small events are those represented in the event graph in Figure 6.

 

 

 


Part 4: Filtering the data to produce substructure

 

The data set filtered.zip contains a datawh.txt that is filtered from the 120,246 records.  Simply taking every 10th record produces 12,024 records.  One should compare Figure 8 with Figure 4.

 

  

a                                                             b

Figure 10: The filtered SSI data

 

This data set is very easy to work with because the data size is small.  Moreover, the production of a record set for import into the Warehouse demonstrates various techniques.

 

1)      Gain a representative set of records, but of a size that makes visualization easy.

2)      Detect rare events even in large data sets without visualization

 

The dynamics of the cluster of A1 is important to note.  First a spike develops.  Then a group gathers together and moves slowly towards the spike.  As it moves forward the group and the spike periodically exchange a number of atoms.  The exchange process is captured in Figure 8 a and 8b. 

 

The data set filtered.zip contains the distributions show in Figure 10.  The three Reports have an event graph.  This event graph will be the first automatically generated event graph and will be available in the next exercise.