CMPUT 690: KDD Principales

HOME Announcements Calendar On-line Materials Activities Grading Glossary U-Chat Tool Web Links Stud. Resources

 
About...

 

©1999 Osmar R. Zaïane
(zaiane@cs.ualberta.ca)

                 

Glossary of Data Mining Terms


AB CD EF GH IJ KL MN OP QR ST UV WX YZ

- A -

Analytical Model:A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification

Anomalous Data:Data that result from errors (for example, data entry keying errors) or that represent unusual events. Anomalous data should be examined carefully because it may carry important information.

Artificial Neural Networks:Non-linear predictive models that learn through training and resemble biological neural networks in structure.


- B -


- C -

CART:Classification and Regression Trees. A decision tree technique used for classification of a dataset. Provides a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome. Segments a dataset by creating 2-way splits. Requires less data preparation than CHAID.

CHAID:Chi Square Automatic Interaction Detection. A decision tree technique used for classification of a dataset. Provides a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome. Segments a dataset by using chi square tests to create multi-way splits. Preceded, and requires more data preparation than, CART.

Children: Members of a dimension that are included in a calculation to produce a consolidated total for a parent member. Children may themselves be consolidated levels, which requires that they have children. A member may be a child for more than one parent, and a child's multiple parents may not necessarily be at the same hierarchical level, thereby allowing complex, multiple hierarchical aggregations within any dimension.

Cell: A single datapoint that occurs at the intersection defined by selecting one member from each dimension in a multi-dimensional array. For example, if the dimensions are measures, time, product and geography, then the dimension members: Sales, January 1996, Bicycle and Canada specify a precise intersection along all dimensions that uniquely identifies a single data cell, which contains the value of bicycle sales in Canada for the month of January 1996.

Classification:The process of dividing a dataset into mutually exclusive groups such that the members of each group are as "close" as possible to one another, and different groups are as "far" as possible from one another, where distance is measured with respect to specific variable(s) you are trying to predict. For example, a typical classification problem is to divide a database of companies into groups that are as homogeneous as possible with respect to a creditworthiness variable with values "Good" and "Bad."

Clustering:The process of dividing a dataset into mutually exclusive groups such that the members of each group are as "close" as possible to one another, and different groups are as "far" as possible from one another, where distance is measured with respect to all available variables.

Concept Hierarchy: A concept hierarchy defines a drilling path within a dimension.

Cube: See Data Cube.


- D -

Data Cleansing: Also Data Cleaning. The process of ensuring that all values in a dataset are consistent and correctly recorded by removing redundancies and inconsistencies in data.

Data mart: A small, single-subject warehouse used by individual departments or groups of users.

Data Cube: Also Cube, Hypercube, Multi-dimentional Array, Multi-dimentional Database. It is a multi-dimentional data structure, a group of data cells arranged by the dimensions of the data. For example, a spreadsheet exemplifies a two-dimensional array with the data cells arranged in rows and columns, each being a dimension. A three-dimensional array can be visualized as a cube with each dimension forming a side of the cube, including any slice parallel with that side. Higher dimensional arrays have no physical metaphor, but they organize the data in the way users think of their enterprise. Typical enterprise dimensions are time, measures, products, geographical location, sales channels, etc. It is not rare to see more than 20 dimensions. However, the higher the dimensions the more complex the manipulation and data mining on the cube become, and the more sparce the data cube may become.

Data Mining: The extraction of hidden predictive information, patterns and correlations from large databases.

Data Navigation:The process of viewing different dimensions, slices, and levels of detail of a multidimensional database. See OLAP.

Data Steward: A new role of data caretaker emerging in business units. Individual takes responsibilities for the data content and quality.

Data Visualization: The visual interpretation of complex relationships in multidimensional data.

Data Warehouse: A system for storing and delivering massive quantities of data.

Decision Tree: A tree-shaped structure that represents a set of decisions. These decisions generate rules for the classification of a dataset. See CART and CHAID.

Dense: A multi-dimensional database is dense if a relatively high percentage of the possible combinations of its dimension members contain data values. This is the opposite of sparse. Dice: An OLAP function. The dice operation is a slice on more than two dimentions of a data cube (or more than two concecutive slices).

Dimension: In a flat or relational database, each field in a record represents a dimension. In a multidimensional database, a dimension is a set of similar entities; for example, a multidimensional sales database might include the dimensions Product, Time, and City.

Drill-down: An OLAP function. It is a specific analytical technique whereby the user navigates among levels of data ranging from the most summarized (up) to the most detailed (down) along a concept hierarchy. For example, when viewing the data for North America, a drill-down operation in the Location dimension would display Canada, the United States and Mexico. A drill-down on Canada would display provinces or groups of provinces such as Maritimes, Quebec, Ontario, Prairies and British Columbia. A further drill-down on Prairies would display data for Saskatchewan, Manitoba and Alberta, or cities such as Edmonton, Calgary, Winnipeg, Regina, etc.

Drill-through: An OLAP function. It is a specific operation whereby the user views the raw data pertaining to a high level concept from a concept hierarchy of a given dimension.

Drill-up: An OLAP function. See Roll-Up (opposite of Drill-Down).


- E -

Exploratory Data Analysis:The use of graphical and descriptive statistical techniques to learn about the structure of a dataset.


- F -


- G -

Genetic Algorithms:Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution.

Gigabyte: One billion bytes.


- H -

Hypercube: Also Hyper-Cube. See Data Cube.


- I -


- J -


- K -


- L -

Linear Model:An analytical model that assumes linear relationships in the coefficients of the variables being studied.

Linear Regression:A statistical technique used to find the best-fitting linear relationship between a target (dependent) variable and its predictors (independent variables).

Logistic Regression:A linear regression that predicts the proportions of a categorical target variable, such as type of customer, in a population.


- M -

Member: Also Concept, Position, Item, Attribute. A dimension member is a discrete name or identifier used to identify a data item's position and description within a dimension. For example, January 1989 or 1Qtr93 are typical examples of members of a Time dimension. Wholesale, Retail, etc., are typical examples of members of a Distribution Channel dimension.

Member Combination: A member combination is an exact description of a unique cell in a multi-dimensional array, consisting of a specific member selection in each dimension of the array.

Missing Data / Missing Value: A special data item which indicates that the data in this cell does not exist. This may be because the member combination is not meaningful (e.g., snowmobiles may not be sold in Miami) or has never been entered. Missing data is similar to a null value or N/A, but is not the same as a zero value. Multi-Dimentional Array: See Data Cube.

Multidimensional Database:A database designed for on-line analytical processing. Structured as a multidimensional hypercube with one axis per dimension.

Multi-Dimentional Query Language: A computer language that allows one to specify which data to retrieve out of a cube. The user process for this type of query is usually called slicing and dicing. The result of a multi-dimensional query is either a cell, a two-dimensional slice, or a multi-dimensional sub-cube.

Multiprocessor Computer:A computer that includes multiple processors connected by a network. See parallel processing.


- N -

Nearest Neighbour:A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k is greater than or equal to 1). Sometimes called a k-nearest neighbour technique.

Non-Linear Predictive Models:An analytical model that does not assume linear relationships in the coefficients of the variables being studied.


- O -

OLAP: On-Line Analytical Processing. Refers to array-oriented database applications that enable users (analysts, managers and executives) to view, navigate through, manipulate, and analyze multidimensional databases. With OLAP software, users gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user. Exemples of OLAP operations are: Drill-down, Drill-through, Roll-up, Slice, Dice, Pivot, etc.

OLAP Client: End user applications that can request slices from OLAP servers and provide two-dimensional or multi-dimensional displays, user modifications, selections, ranking, calculations, etc., for visualization and navigation purposes. OLAP clients may be as simple as a spreadsheet program retrieving a slice for further work by a spreadsheet-literate user or as high-functioned as a financial modeling or sales analysis application.

OLAP Server: An OLAP server is a high-capacity, multi-user data manipulation engine specifically designed to support and operate on multi-dimensional data structures (or database).

Outlier: A data item whose value falls outside the bounds enclosing most of the other corresponding values in the sample. May indicate anomalous data. Should be examined carefully; may carry important information.


- P -

Page Dimension: A page dimension is generally used to describe a dimension which is not one of the two dimensions of the page being displayed, but for which a member has been selected to define the specific page requested for display. All page dimensions must have a specific member chosen in order to define the appropriate page for display.

Page Display: The page display is the current orientation for viewing a multi-dimensional slice. The horizontal dimension(s) run across the display, defining the column dimension(s). The vertical dimension(s) run down the display, defining the contents of the row dimension(s). The page dimension-member selections define which page is currently displayed. A page is much like a spreadsheet, and may in fact have been delivered to a spreadsheet product where each cell can be further modified by the user.

Parallel Processing:The coordinated use of multiple processors to perform computational tasks. Parallel processing can occur on a multiprocessor computer or on a network of workstations or PCs.

Parent: The member (or concept) that is one level up in a concept hierarchy from another member. The parent value is usually a consolidation of all of its children's values.

Pivot: Also Rotate. An OLAP function. It is an operation whereby the the dimensional orientation of a report or page display is changed. For example, pivoting may consist of swapping the rows and columns, or moving one of the row dimensions into the column dimension, or swapping an off-spreadsheet dimension with one of the dimensions in the page display (either to become one of the new rows or columns), etc. A specific example of the first case would be taking a report that has Time across (the columns) and Products down (the rows) and rotating it into a report that has Product across and Time down. An example of the second case would be to change a report which has Measures and Products down and Time across into a report with Measures down and Time over Products across. An example of the third case would be taking a report that has Time across and Product down and changing it into a report that has Time across and Geography down.

Predictive Model:A structure and process for predicting the values of specified variables in a dataset.

Prospective Data Analysis: Data analysis that predicts future trends, behaviors, or events based on historical data.


- Q -


- R -

RAID:Redundant Array of Inexpensive Disks. A technology for the efficient parallel storage of data for high-performance computer systems.

Retrospective Data Analysis: Data analysis that provides insights into trends, behaviors, or events that have already occurred.

Roll-up: Also Drill-Up. An OLAP function. It is a specific analytical technique whereby the user navigates among levels of data ranging from the most detailed (down) to the most summarized (up) along a concept hierarchy. For example, when viewing the data for the city of Toronto, a roll-up operation in the Location dimension would display Ontario (i.e. the direct concept parent of "Toronto" in the , the concept hierarchy of the dimension "Location"). A further roll-up on Ontario would display data for Canada.

Rule Induction: The extraction of useful if-then rules from data based on statistical significance.


- S -

Size of data: In data mining the size of data is an important factor. Data to be mined is usually very large, in the order of Gigabytes and higher. Here are some mesures for data size:
Kilo = 103= 1,000
Mega = 106= 1,000,000
Giga = 109= 1,000,000,000
Tera = 1012= 1,000,000,000,000
Peta = 1015= 1,000,000,000,000,000
Exa = 1018= 1,000,000,000,000,000,000
Zetta = 1021= 1,000,000,000,000,000,000,000
Yotta = 1024= 1,000,000,000,000,000,000,000,000
1 Gigabyte = 1 billion bytes
1 Terabyte = 1 trilion bytes
1 Exabyte = 1 billion gigabytes
1 Yottabyte = 1 trillion terabytes

Slice: An OLAP function. It is an operation whereby a subset of a multi-dimensional array (or cube) corresponding to a single value for one or more members of the dimensions not in the subset is selected at a given concept level. A slice of a cube is also the result of a slice operation. For example, if the member United States is selected from the Location dimension, then the sub-cube of all the remaining dimensions is the slice that is specified. The data omitted from this slice would be any data associated with the non-selected members of the Location dimension, for example Canada, Mexico, etc. From an end user perspective, the term slice most often refers to a two-dimensional page selected from the cube.

Slice and Dice: An OLAP user-initiated process of navigating by calling for page displays interactively, through the specification of slices via pivoting and drilling.

SMP: Symmetric multiprocessor. A type of multiprocessor computer in which memory is shared among the processors.

Sparse: A multi-dimensional data set is sparse if a relatively high percentage of the possible combinations (intersections) of the members from the data set's dimensions contain missing data. The total possible number of intersections can be computed by multiplying together the number of members in each dimension. Data sets containing one percent, .01 percent, or even smaller percentages of the possible data exist and are quite common. The oppasite of a sprace cube is a dense cube.


- T -

Terabyte: One trillion bytes.

Time Series Analysis:The analysis of a sequence of measurements made at specified time intervals. Time is usually the dominating dimension of the data.


- U -


- V -


- W -


- X -


- Y -


- Z -


AB CD EF GH IJ KL MN OP QR ST UV WX YZ


[Home] [Announcements] [Calendar] [On-line Materials] [Activities]
[Grading] [Glossary] [U-Chat Tool] [Web Links]
[Student Resources]

Last updated: August 5th, 1999
(Compiled from various sources (OLAP council, etc.))
[About this site and list of symbols]
Copyright Osmar R. Zaiane, 1999