The SAS Institute Solution

for Data Mining

The Enterprise Miner®,


Contents



  1. Introduction
  1. In General
  1. The GUI
  1. The SEMMA Process
    1. Sampling
    1. Exploration
    1. Modification
      1. The Transform Variable
      1. Filter Outlier
      1. Replacement
      1. Clustering
    1. Modeling
      1. Trees
      2. Neural Networks
      3. Ensemble
    1. Assessment
  1. Client/Server Architecture
  1. Classification
  1. Conclusion


1: Introduction

This paper is a brief description of the data mining solution that SAS© Institute provides. The comprehensive solution provided includes four categories of software, the Enterprise Miner®, SAS/Warehouse®Administrator, OLAP and SAS IntrNet®. As one would expect, the four components are fully integrated. To accommodate a wide range of users and applications, SAS® provides a GUI interface that can be used for mining with little or no change to existing DBMS systems. Here we focus on the Enterprise Miner® and give a general overview of its functions and capabilities.

2: In General

One very important feature of the software is its ability to use existing DBMS systems. The Enterprise miner is capable of accessing data regardless of location or type. Corporate databases, Internet or intranets are all acceptable sources. SAS institute has an exclusive architecture called the Multivender Architecture (MVA) which makes the platform transparent to various operations. Applications are not restricted to one environment or one server. This raises another important feature of the software, portability. SAS applications use the (MEA) Multiple Engine Architecture that translates read, write and update calls into the appropriate local DBMS format. As you may have already speculated, the user of SAS software is not required to use or even understand SQL.

3: The GUI

The Graphical User Interface is user friendly and provides all the functions needed to complete the data mining task. Its design is geared to accommodate the business analyst and the statistical expert alike. All the usual tools you will find in any modern GUI are present. Some of the features include the following: -

4: The SEMMA process

The SAS institution created this iterative approach to data mining. SEMMA stands for Sampling, Exploration, Modification, Modeling, and Assessment. These make up stand alone set of modules that can be followed by their entirety or as a selective subset according to the nature of the mining task on hand. As we have seen earlier, Nodes are used to specify this combination of modules. SAS software uses the iterative SEMMA process to obtain the best desired results. In some instances, the user may select Nodes that will repeatedly explore and plot the data in various ways in the attempt find anomalies. In another scenario we may be interested in a model of the data, in which case the iterative process will try different modeling techniques to arrive at the desired result.

4.1: Sampling

The first step is input data source node. The function of this node to determine meta data about the data set. By default, it obtains 2000 samples from the data set. Enterprise miner will automatically generate data characterization, i.e. data generalization, attribute reduction, etc on those samples. The result is presented to user. User is allowed to change the role of data attribute which to be used in data modeling.

The second step is data sampling. This steps is optional. Enterprise Miner provides various methods for sampling which include: every n th observation, stratified sampling, first n observation, cluster, sample size and random sampling. The goal of this step to reduce the size data to manageable and at the same time, retaining the properties of the original data set.

The final step of is data partitioning. This module partition the input data source for training, validation and testing of model which to be developed in the later stage of the data mining process.

4.2: Exploration

It is consists of multiple tools within this module. The main purpose of this module is to provide graphical representation(various kinds of graphs and charts) of the data to user. This allows user to perform outlier filtering and data normalization. This in terms will help to reduce the noise level within the data. In addition, exploration node also provide advanced visualization techniques which OLAP operations. This allows user to fine tune one's requirement in the concept hierarchy. Furthermore, this module also provides support for association discovery. This module allows user to specify various parameters for association rules and sequence rules.

4.3: Modification

This set of Nodes makes up the data preprocessing phase discussed in class. The various Nodes perform here we will briefly discuss some of the main features using the exact names that appear in the actual software .

4.3.1: The Transform variable Node

This allows the creation of new associative entities that correlate some of the existing attributes; thus helping in the modeling phase where we try to fit a created model via attribute transformation. Using this we are able to stabilize variance, remove non-linearity, and correct an existing non-linearity in variables. Some of the available transformations include, Log, Sqrt, Inv, Sqr, etc. If non of these is suitable, the user is free to enter s user-defined transformation.

4.3.2: Filter Outliers Node

This Node enables the filtration of the outliers or any other observation that we wish to exclude from the analysis. The automatic filter option enables this by providing :-

4.3.3: Replacement Nodes

This is another important feature of the Enterprise Miner. As we will see in the next module, modeling any tuple with a missing field will be rejected. Since the dismissal of such tuples could result in a bias sample and inaccurate results, SAS presents two options. The first is in modeling were the Tree Node can be used to organize these tuples and then generate a set of surrogate rules. Another viable solution is to use a statistical measure , in this module, to "fill-in" the missing value. The replacement statistic can be any of the ususal functions or user defined. Results of the sampling process can be used here to determine the "best" replacement statistic using the meta data file.

4.3.4: Clustering Node

As we know, clustering is an important data mining function in which data is organized int classes. Enterprise Miner methods for clustering are based on Euclidian distance. The algorithm maintains and updates a set of seeds that are used to calculate the distance given a data observation. Upon completion, the result browser can be used to graphically examine the clusters so that a characterization can be obtained. Within the GUI the user has the flexibility to control the maximum number of clusters required. Given the number of clusters needed, one can also set the criterion in which the seeds are generated and updated.

4.4: Modeling

In this module data miners arrive at suitable models for their data. The GUI has a Model Manager which enables the user to save and retrieve different models. Models can be interactively modified to arrive at the final model. The Model Manager in this phase is capable of comparing results obtained only by the model in consideration. Cross model comparisons can performed in the assessment phase. The method by which any selected model is created can be specified using the GUI. The backward method can be specified to create the model from the set of all input and as the algorithm runs, irrelevant input is discarded. The opposite method, forward, is also supported.

The models provided by Enterprise Miner are based on the following: -

4.5: Assessment

This model is capable of examining all the predictions and data models that have been produced thus far using Enterprise Miner. Various charts can be generated to cross compare models and to study the relation between predicted and actual values. Here are some example of the tools available: -

5: Client/Server Architecture

As discussed earlier, Enterprise Miner supports the client/server architecture. When a new project is initiated, one can select the client/server option to provide a locally running Enterprise Miner process with access to various types of DBMS on a server. This means that data processing can be done on the source machine; thus avoiding heavy use of networks. Another important advantage is data centralization which is inversely proportional to data redundancy.

6: Classification

After this brief discussion one may ask, to which class of data mining systems does Enterprise Miner belong? As discussed by Dr. Zaiane, such classification can be obtained using one of the following criterion.

  1. The type of the data source.
  2. The data model used.
  3. The discovery type.
  4. Mining techniques used.

Due to the versatility of Enterprise Miner it would be difficult to classify it as one of the above, but it can be easily classified as all of the above. 1) With the (MVA) architecture, various types of data can be imported across platforms. 2) There are many data models that one can choose to use. 3) The type of the results discovered depends on 1) and 2). Finally, The SEMMA approach to data mining is flexible and can be used by its entirety or as stand alone modules.

7: Conclusion

The Enterprise Miner is a powerful, portable, scalable, and flexible package. The open architecture design makes it easy for a corporation to use the software and start mining immediately whether the existing computer systems are PC based or a large client/server UNIX environment. With the SEMMA methodology, the mining process can be tailored to fit a terabyte operation or a small data warehouse. The price of the Enterprise Miner is most reasonable when compared to others in the market.




The above was Assignment 1

Assignment 1: Deliverable