The SAS Institute Solution

for Data Mining

The Enterprise Miner^®,

Contents

Introduction

In General

The GUI

The SEMMA Process

Sampling

Exploration

Modification

The Transform Variable

Filter Outlier

Replacement

Clustering

Modeling

Trees
Neural Networks
Ensemble

Assessment

Client/Server Architecture

Classification

Conclusion

1: Introduction

This paper is a brief description of the data mining solution that SAS^© Institute provides. The comprehensive solution provided includes four categories of software, the Enterprise Miner^®, SAS/Warehouse^®Administrator, OLAP and SAS IntrNet^®. As one would expect, the four components are fully integrated. To accommodate a wide range of users and applications, SAS^® provides a GUI interface that can be used for mining with little or no change to existing DBMS systems. Here we focus on the Enterprise Miner^® and give a general overview of its functions and capabilities.

2: In General

One very important feature of the software is its ability to use existing DBMS systems. The Enterprise miner is capable of accessing data regardless of location or type. Corporate databases, Internet or intranets are all acceptable sources. SAS institute has an exclusive architecture called the Multivender Architecture (MVA) which makes the platform transparent to various operations. Applications are not restricted to one environment or one server. This raises another important feature of the software, portability. SAS applications use the (MEA) Multiple Engine Architecture that translates read, write and update calls into the appropriate local DBMS format. As you may have already speculated, the user of SAS software is not required to use or even understand SQL.

3: The GUI

The Graphical User Interface is user friendly and provides all the functions needed to complete the data mining task. Its design is geared to accommodate the business analyst and the statistical expert alike. All the usual tools you will find in any modern GUI are present. Some of the features include the following: -

The project Navigator.

Diagram Workspace: This is the area in which the Process Flow Diagram (PFD) can be created and executed. The PFD is built using Nodes as a graphical representation of each step in a particular mining task.
Tools Bar
Nods: As we will see later, the entire process of mining is guided by a combination of these nodes. All nodes are designed in such a way so that they appear uniform. The GUI provides a way to quickly show the functionality of a given Node and create the appropriate PFD.

4: The SEMMA process

The SAS institution created this iterative approach to data mining. SEMMA stands for Sampling, Exploration, Modification, Modeling, and Assessment. These make up stand alone set of modules that can be followed by their entirety or as a selective subset according to the nature of the mining task on hand. As we have seen earlier, Nodes are used to specify this combination of modules. SAS software uses the iterative SEMMA process to obtain the best desired results. In some instances, the user may select Nodes that will repeatedly explore and plot the data in various ways in the attempt find anomalies. In another scenario we may be interested in a model of the data, in which case the iterative process will try different modeling techniques to arrive at the desired result.

4.1: Sampling

The first step is input data source node. The function of this node to determine meta data about the data set. By default, it obtains 2000 samples from the data set. Enterprise miner will automatically generate data characterization, i.e. data generalization, attribute reduction, etc on those samples. The result is presented to user. User is allowed to change the role of data attribute which to be used in data modeling.

The second step is data sampling. This steps is optional. Enterprise Miner provides various methods for sampling which include: every n th observation, stratified sampling, first n observation, cluster, sample size and random sampling. The goal of this step to reduce the size data to manageable and at the same time, retaining the properties of the original data set.

The final step of is data partitioning. This module partition the input data source for training, validation and testing of model which to be developed in the later stage of the data mining process.

4.2: Exploration

It is consists of multiple tools within this module. The main purpose of this module is to provide graphical representation(various kinds of graphs and charts) of the data to user. This allows user to perform outlier filtering and data normalization. This in terms will help to reduce the noise level within the data. In addition, exploration node also provide advanced visualization techniques which OLAP operations. This allows user to fine tune one's requirement in the concept hierarchy. Furthermore, this module also provides support for association discovery. This module allows user to specify various parameters for association rules and sequence rules.

4.3: Modification

This set of Nodes makes up the data preprocessing phase discussed in class. The various Nodes perform here we will briefly discuss some of the main features using the exact names that appear in the actual software .

4.3.1: The Transform variable Node

This allows the creation of new associative entities that correlate some of the existing attributes; thus helping in the modeling phase where we try to fit a created model via attribute transformation. Using this we are able to stabilize variance, remove non-linearity, and correct an existing non-linearity in variables. Some of the available transformations include, Log, Sqrt, Inv, Sqr, etc. If non of these is suitable, the user is free to enter s user-defined transformation.

4.3.2: Filter Outliers Node

This Node enables the filtration of the outliers or any other observation that we wish to exclude from the analysis. The automatic filter option enables this by providing :-

Automatic elimination of any classification variable with a value that is less than a given threshold.
Using methods such as mean, extreme percentiles and model centroid, interval variables can become outlier free.
Account for missing values.

4.3.3: Replacement Nodes

This is another important feature of the Enterprise Miner. As we will see in the next module, modeling any tuple with a missing field will be rejected. Since the dismissal of such tuples could result in a bias sample and inaccurate results, SAS presents two options. The first is in modeling were the Tree Node can be used to organize these tuples and then generate a set of surrogate rules. Another viable solution is to use a statistical measure , in this module, to "fill-in" the missing value. The replacement statistic can be any of the ususal functions or user defined. Results of the sampling process can be used here to determine the "best" replacement statistic using the meta data file.

4.3.4: Clustering Node

As we know, clustering is an important data mining function in which data is organized int classes. Enterprise Miner methods for clustering are based on Euclidian distance. The algorithm maintains and updates a set of seeds that are used to calculate the distance given a data observation. Upon completion, the result browser can be used to graphically examine the clusters so that a characterization can be obtained. Within the GUI the user has the flexibility to control the maximum number of clusters required. Given the number of clusters needed, one can also set the criterion in which the seeds are generated and updated.

4.4: Modeling

In this module data miners arrive at suitable models for their data. The GUI has a Model Manager which enables the user to save and retrieve different models. Models can be interactively modified to arrive at the final model. The Model Manager in this phase is capable of comparing results obtained only by the model in consideration. Cross model comparisons can performed in the assessment phase. The method by which any selected model is created can be specified using the GUI. The backward method can be specified to create the model from the set of all input and as the algorithm runs, irrelevant input is discarded. The opposite method, forward, is also supported.

The models provided by Enterprise Miner are based on the following: -

Regression Node: Two types of regression is supported, linear and logistic regression.
Tree Node: SAS provides tools that use decision trees to produce predictions and classifications from data observations. Some of the algorithms supported include chi-square test, F test, variance reduction and entropy reduction. The results of such algorithms can be examined using summery tables and assessment tables in order to determine how well the tree describes the input data.
Neural Network Node: With SAS you are free to use a wide range of neural network architectures including GLIM, MLP and several RBF's. To fully utilize the power of neural networks, one must have a "good" understanding of how these learn to find patterns given a sample. SAS provides a Basic tab in the GUI for those with less experience in the area. This tab includes default settings that are useful in the general sense.
Ensemble Node: This tool enables the user to combine results from several iterative modeling runs into one new model.

4.5: Assessment

This model is capable of examining all the predictions and data models that have been produced thus far using Enterprise Miner. Various charts can be generated to cross compare models and to study the relation between predicted and actual values. Here are some example of the tools available: -

Assessment Node: Results obtained by all the models mentioned above can be compared against one another using this tool.
Report Node: This tool generates the final product of data mining. The report will be in HTML format and include the PFD used and all relevant analysis.

5: Client/Server Architecture

As discussed earlier, Enterprise Miner supports the client/server architecture. When a new project is initiated, one can select the client/server option to provide a locally running Enterprise Miner process with access to various types of DBMS on a server. This means that data processing can be done on the source machine; thus avoiding heavy use of networks. Another important advantage is data centralization which is inversely proportional to data redundancy.

6: Classification

After this brief discussion one may ask, to which class of data mining systems does Enterprise Miner belong? As discussed by Dr. Zaiane, such classification can be obtained using one of the following criterion.

The type of the data source.
The data model used.
The discovery type.
Mining techniques used.

Due to the versatility of Enterprise Miner it would be difficult to classify it as one of the above, but it can be easily classified as all of the above. 1) With the (MVA) architecture, various types of data can be imported across platforms. 2) There are many data models that one can choose to use. 3) The type of the results discovered depends on 1) and 2). Finally, The SEMMA approach to data mining is flexible and can be used by its entirety or as stand alone modules.

7: Conclusion

The Enterprise Miner is a powerful, portable, scalable, and flexible package. The open architecture design makes it easy for a corporation to use the software and start mining immediately whether the existing computer systems are PC based or a large client/server UNIX environment. With the SEMMA methodology, the mining process can be tailored to fit a terabyte operation or a small data warehouse. The price of the Enterprise Miner is most reasonable when compared to others in the market.

The above was Assignment 1

Assignment 1: Deliverable

Write a report about the data mining/data warehousing system you selected. Explain the functionalities, capabilities, and limitations.
Classify the system according to the classification scheme we discussed in class.
Convert this report into HTML so that we can publish it on the course web site.
Present your findings in a 15 minute class talk.