Glossary of Data Mining Terms
Anomalous Data:Data that result from errors (for example, data entry keying errors) or that represent unusual events. Anomalous data should be examined carefully because it may carry important information.
Artificial Neural Networks:Non-linear predictive models that learn through training and resemble biological neural networks in structure.
CHAID:Chi Square Automatic Interaction Detection. A decision tree technique used for classification of a dataset. Provides a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome. Segments a dataset by using chi square tests to create multi-way splits. Preceded, and requires more data preparation than, CART.
Children: Members of a dimension that are included in a calculation to produce a consolidated total for a parent member. Children may themselves be consolidated levels, which requires that they have children. A member may be a child for more than one parent, and a child's multiple parents may not necessarily be at the same hierarchical level, thereby allowing complex, multiple hierarchical aggregations within any dimension.
Cell: A single datapoint that occurs at the intersection defined by selecting one member from each dimension in a multi-dimensional array. For example, if the dimensions are measures, time, product and geography, then the dimension members: Sales, January 1996, Bicycle and Canada specify a precise intersection along all dimensions that uniquely identifies a single data cell, which contains the value of bicycle sales in Canada for the month of January 1996.
Classification:The process of dividing a dataset into mutually exclusive groups such that the members of each group are as "close" as possible to one another, and different groups are as "far" as possible from one another, where distance is measured with respect to specific variable(s) you are trying to predict. For example, a typical classification problem is to divide a database of companies into groups that are as homogeneous as possible with respect to a creditworthiness variable with values "Good" and "Bad."
Clustering:The process of dividing a dataset into mutually exclusive groups such that the members of each group are as "close" as possible to one another, and different groups are as "far" as possible from one another, where distance is measured with respect to all available variables.
Cube: See Data Cube.
Data mart: A small, single-subject warehouse used by individual departments or groups of users.
Data Cube: Also Cube, Hypercube, Multi-dimentional Array, Multi-dimentional Database. It is a multi-dimentional data structure, a group of data cells arranged by the dimensions of the data. For example, a spreadsheet exemplifies a two-dimensional array with the data cells arranged in rows and columns, each being a dimension. A three-dimensional array can be visualized as a cube with each dimension forming a side of the cube, including any slice parallel with that side. Higher dimensional arrays have no physical metaphor, but they organize the data in the way users think of their enterprise. Typical enterprise dimensions are time, measures, products, geographical location, sales channels, etc. It is not rare to see more than 20 dimensions. However, the higher the dimensions the more complex the manipulation and data mining on the cube become, and the more sparce the data cube may become.
Data Mining: The extraction of hidden predictive information, patterns and correlations from large databases.
Data Steward: A new role of data caretaker emerging in business units. Individual takes responsibilities for the data content and quality.
Data Visualization: The visual interpretation of complex relationships in multidimensional data.
Data Warehouse: A system for storing and delivering massive quantities of data.
Dense: A multi-dimensional database is dense if a relatively high percentage of the possible combinations of its dimension members contain data values. This is the opposite of sparse. Dice: An OLAP function. The dice operation is a slice on more than two dimentions of a data cube (or more than two concecutive slices).
Dimension: In a flat or relational database, each field in a record represents a dimension. In a multidimensional database, a dimension is a set of similar entities; for example, a multidimensional sales database might include the dimensions Product, Time, and City.
Drill-down: An OLAP function. It is a specific analytical technique whereby the user navigates among levels of data ranging from the most summarized (up) to the most detailed (down) along a concept hierarchy. For example, when viewing the data for North America, a drill-down operation in the Location dimension would display Canada, the United States and Mexico. A drill-down on Canada would display provinces or groups of provinces such as Maritimes, Quebec, Ontario, Prairies and British Columbia. A further drill-down on Prairies would display data for Saskatchewan, Manitoba and Alberta, or cities such as Edmonton, Calgary, Winnipeg, Regina, etc.
Gigabyte: One billion bytes.
Linear Regression:A statistical technique used to find the best-fitting linear relationship between a target (dependent) variable and its predictors (independent variables).
Logistic Regression:A linear regression that predicts the proportions of a categorical target variable, such as type of customer, in a population.
Missing Data / Missing Value: A special data item which indicates that the data in this cell does not exist. This may be because the member combination is not meaningful (e.g., snowmobiles may not be sold in Miami) or has never been entered. Missing data is similar to a null value or N/A, but is not the same as a zero value. Multi-Dimentional Array: See Data Cube.
Multi-Dimentional Query Language: A computer language that allows one to specify which data to retrieve out of a cube. The user process for this type of query is usually called slicing and dicing. The result of a multi-dimensional query is either a cell, a two-dimensional slice, or a multi-dimensional sub-cube.
OLAP Client: End user applications that can request slices from OLAP servers and provide two-dimensional or multi-dimensional displays, user modifications, selections, ranking, calculations, etc., for visualization and navigation purposes. OLAP clients may be as simple as a spreadsheet program retrieving a slice for further work by a spreadsheet-literate user or as high-functioned as a financial modeling or sales analysis application.
OLAP Server: An OLAP server is a high-capacity, multi-user data manipulation engine specifically designed to support and operate on multi-dimensional data structures (or database).
Outlier: A data item whose value falls outside the bounds enclosing most of the other corresponding values in the sample. May indicate anomalous data. Should be examined carefully; may carry important information.
Page Display: The page display is the current orientation for viewing a multi-dimensional slice. The horizontal dimension(s) run across the display, defining the column dimension(s). The vertical dimension(s) run down the display, defining the contents of the row dimension(s). The page dimension-member selections define which page is currently displayed. A page is much like a spreadsheet, and may in fact have been delivered to a spreadsheet product where each cell can be further modified by the user.
Parent: The member (or concept) that is one level up in a concept hierarchy from another member. The parent value is usually a consolidation of all of its children's values.
Pivot: Also Rotate. An OLAP function. It is an operation whereby the the dimensional orientation of a report or page display is changed. For example, pivoting may consist of swapping the rows and columns, or moving one of the row dimensions into the column dimension, or swapping an off-spreadsheet dimension with one of the dimensions in the page display (either to become one of the new rows or columns), etc. A specific example of the first case would be taking a report that has Time across (the columns) and Products down (the rows) and rotating it into a report that has Product across and Time down. An example of the second case would be to change a report which has Measures and Products down and Time across into a report with Measures down and Time over Products across. An example of the third case would be taking a report that has Time across and Product down and changing it into a report that has Time across and Geography down.
Predictive Model:A structure and process for predicting the values of specified variables in a dataset.
Prospective Data Analysis: Data analysis that predicts future trends, behaviors, or events based on historical data.
Retrospective Data Analysis: Data analysis that provides insights into trends, behaviors, or events that have already occurred.
Roll-up: Also Drill-Up. An OLAP function. It is a specific analytical technique whereby the user navigates among levels of data ranging from the most detailed (down) to the most summarized (up) along a concept hierarchy. For example, when viewing the data for the city of Toronto, a roll-up operation in the Location dimension would display Ontario (i.e. the direct concept parent of "Toronto" in the , the concept hierarchy of the dimension "Location"). A further roll-up on Ontario would display data for Canada.
Rule Induction: The extraction of useful if-then rules from data based on statistical significance.
1 Terabyte = 1 trilion bytes
1 Exabyte = 1 billion gigabytes
1 Yottabyte = 1 trillion terabytes
Slice: An OLAP function. It is an operation whereby a subset of a multi-dimensional array (or cube) corresponding to a single value for one or more members of the dimensions not in the subset is selected at a given concept level. A slice of a cube is also the result of a slice operation. For example, if the member United States is selected from the Location dimension, then the sub-cube of all the remaining dimensions is the slice that is specified. The data omitted from this slice would be any data associated with the non-selected members of the Location dimension, for example Canada, Mexico, etc. From an end user perspective, the term slice most often refers to a two-dimensional page selected from the cube.
SMP: Symmetric multiprocessor. A type of multiprocessor computer in which memory is shared among the processors.
Sparse: A multi-dimensional data set is sparse if a relatively high percentage of the possible combinations (intersections) of the members from the data set's dimensions contain missing data. The total possible number of intersections can be computed by multiplying together the number of members in each dimension. Data sets containing one percent, .01 percent, or even smaller percentages of the possible data exist and are quite common. The oppasite of a sprace cube is a dense cube.
Time Series Analysis:The analysis of a sequence of measurements made at specified time intervals. Time is usually the dominating dimension of the data.
[Grading] [Glossary] [U-Chat Tool] [Web Links] [Student Resources]