I. Concept Hierarchies
Concept hierarchies are very important in data mining. They allow
knowledge discovery at different conceptual levels, they allow
interactive progressive refinement, etc.
In data warehousing, concept hierarchies are necessary for operations
such as drill-down, roll-up, etc. Concept hierarchies can be partial
orders, lattices, or even graphs.
There are many ways to implement concept hierarchy data structures in
main memory and on disk.
- Enumerate and describe as many concept hierarchy data structure representations as possible and explain their advantages and limitations.
- Indicate, according to you, which concept hierarchy representation is the most efficient in terms of space used, and which representation is the most appropriate for concept hierarchies frequently updated. Justify your answers.
- Suppose we choose to represent concept hierarchies with tables in a relational database.
a) What are the advantages of such a choice?
b) Explain how the generalization and specialization operations are performed. Use examples to better illustrate your ideas.
II. Data Cubes
A data cube is a data structure to represent multidimensional data. It
is called a cube but this data structure may often represent more than
three dimensions. A cell in a data cube may contain one or more
measurements associated with values in the dimensions (attributes)
represented. It is common to see data cubes with most cells
empty. These cubes are called sparse data cubes.
- Explain why multidimensional data cubes are often sparse. Give examples to illustrate your arguments.
- Because data cubes are very large and most of their cells are empty (i.e sparse cubes), when storing and manipulating data cubes or cuboids in memory, it is wiser to avoid representing the empty cells to prevent shortage of memory space.
a) Design a representation for a multidimensional data cube that
solves the sparsity of the cubes.
b) Explain how MOLAP operations, drill-down and roll-up, perform on your data structure.
c) Explain how the data cube represented with your data structure is up-dated when new measurement values are provided.
Due Date: October 29th 10:00 am
|