Total Pageviews

Wednesday, 11 June 2014

Slowly Changing Dimensions



Slowly Changing Dimensions

Handling changes to dimensional data across time is the most important aspect in designing a data warehouse. In dimensional modeling, there is a very rare chance that a dimension will remain static over time. For example, a customer address may change; a company may phase out old products and introduce new products. What if a customer name changes, sales person changes his region of sale or a company assigns new sales territory. How to record the history or preserve the old version of history? Here comes the concept of Slowly Changing Dimensions. The term Slowly Changing Dimension is about variation in dimensional attributes over time. The word slowly, in this context, might seem incorrect. A sales person may change his territory rapidly. But in general, when compared to measures in fact table, the changes in dimensions occur slowly.
Types of Slowly Changing Dimensions
In reference to Figure 3 above, lets say a sales person changes his region of sale. We may handle this change in several ways. These methods fall in various categories based on companys need to preserve an accurate history of dimensional changes. Ralph Kimball categorized the dimensional changes into three categories
  • Type One: Changes that overwrite history
  • Type Two: Preserve history
  • Type Three: Preserve a version of history
Type One (Overwrite History)
A type one change overwrites existing dimensional attribute with new information. In Sales Person Region change example, the old region name will be overwritten by the new region. Say, a sales person Rob, has territory as ASIA.
Sales_Person_Dimension
Sales_Person_Key
ID
Name
Region
100
203234
Rob Doe
ASIA

Now, if he starts looking after NorthWest Region, by implementing Type 1 dimension, the dimension table will look like:
Sales_Person_Dimension
Sales_Person_Key
ID
Name
Region
100
203234
Rob Doe
NorthWest
Advantages
  • This is the easiest way to handle the Slowly Changing Dimension problem, since there is no need to keep track of the old information.
Disadvantages
  • All history is lost. By applying this methodology, it is not possible to trace back in history. For example, in this case, the company would not be able to know that Christina lived in Illinois before.
Type Two (Preserve History)
A Type Two change writes a record with the new attribute information and preserves a record of the old dimensional data. Type Two changes let you preserve historical data. Implementing Type Two changes within a data warehouse might require significant analysis and development. Type Two changes accurately partition history across time more effectively than other types. However, because Type Two changes add records, they can significantly increase the database’s size.
In our example, lets say we identify Region as Type Two attribute. This can be handled in this way using:
Sales_Person_Dimension
Sales_Person_Key
ID
Name
Region
100
203234
Rob Doe
ASIA
153
203234
Rob Doe
NorthWest
Advantages
  • This allows us to accurately keep all historical information.
Disadvantages
  • This will cause the size of the table to grow fast. In cases where the number of rows for the table is very high to start with, storage and performance can become a concern.
  • This necessarily complicates the ETL process.
Type Three (Preserve a Version of History)
You usually implement Type Three changes only if you have a limited need to preserve and accurately describe history, such as when someone gets married and you need to retain the previous name. Instead of creating a new dimensional record to hold the attribute change, a Type Three change places a value for the change in the original dimensional record. You can create multiple fields to hold distinct values for separate points in time. In the case of a region change example, you could create an OLD_REGION and NEW_REGION field and a REGION_CHANGE_EFF_DATE field to record when the change occurs. This method preserves the change. But how would you handle a second name change, or a third, and so on? The side effects of this method are increased table size and, more important, increased complexity of the queries that analyze historical values from these old fields. After more than a couple of iterations, queries become impossibly complex, and ultimately you’re constrained by the maximum number of attributes allowed on a table.
This is how the table will look like in Type Three change:
Sales_Person_Dimension
Sales_Person_Key
ID
Name
Old Region
New Region
100
203234
Rob Doe
ASIA
NorthWest
Advantages
  • This does not increase the size of the table, since new information is updated.
  • This allows us to keep some part of history.
Disadvantages
  • Type 3 will not be able to keep all history where an attribute is changed more than once. For example, if Christina later moves to Texas on December 15, 2003, the California information will be lost.
Because most business requirements include tracking changes over time, data warehouse architects commonly implement Type Two changes. A data warehouse might use Type Two changes for all attributes in all tables. As an alternative, you can implement a mix of Type One and Type Two changes at an attribute level by implementing Type 2 changes for only attributes whose historical values are important when you’re slicing and dicing. For example, users might not need to an individual’s previous name if a name change occurs, so a Type One change would suffice. Users might want the system to show only the person’s current name. However, if the company reassigns sales territories, users might need to track who sold what, at what time, and in what territory, necessitating a Type Two change.

No comments:

Post a Comment