Blog to explore things related to Data warehousing, Databases, Informatica, SQL, PLSQL, and Unix scripting technologies.
Total Pageviews
Monday, 16 June 2014
Wednesday, 11 June 2014
Data warehouse definition
"It is a way of storing the data for the purpose of meaningful future analysis."
Different people have different definitions for a data warehouse. The most popular definition came from Bill Inmon, who provided the following:
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process.
Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, "sales" can be a particular subject.
Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.
Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered.
Ralph Kimball provided a more concise definition of a data warehouse:
A data warehouse is a copy of transaction data specifically structured for query and analysis.
Different people have different definitions for a data warehouse. The most popular definition came from Bill Inmon, who provided the following:
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process.
Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, "sales" can be a particular subject.
Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.
Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered.
Ralph Kimball provided a more concise definition of a data warehouse:
A data warehouse is a copy of transaction data specifically structured for query and analysis.
Slowly Changing Dimensions
Slowly
Changing Dimensions
Handling changes to dimensional data
across time is the most important aspect in designing a data warehouse. In
dimensional modeling, there is a very rare chance that a dimension will remain
static over time. For example, a customer address may change; a company may
phase out old products and introduce new products. What if a customer name
changes, sales person changes his region of sale or a company assigns new sales
territory. How to record the history or preserve the old version of history?
Here comes the concept of Slowly Changing Dimensions. The term Slowly Changing
Dimension is about variation in dimensional attributes over time. The word
slowly, in this context, might seem incorrect. A sales person may change his
territory rapidly. But in general, when compared to measures in fact table, the
changes in dimensions occur slowly.
Types
of Slowly Changing Dimensions
In reference to Figure 3 above, lets
say a sales person changes his region of sale. We may handle this change in
several ways. These methods fall in various categories based on companys need
to preserve an accurate history of dimensional changes. Ralph Kimball
categorized the dimensional changes into three categories
- Type One: Changes that overwrite history
- Type Two: Preserve history
- Type Three: Preserve a version of history
Type
One (Overwrite History)
A type one change overwrites
existing dimensional attribute with new information. In Sales Person Region
change example, the old region name will be overwritten by the new region. Say,
a sales person Rob, has territory as ASIA.
Sales_Person_Dimension
|
||||
Sales_Person_Key
|
ID
|
Name
|
Region
|
…
|
100
|
203234
|
Rob
Doe
|
ASIA
|
…
|
Now, if he starts looking after
NorthWest Region, by implementing Type 1 dimension, the dimension table will
look like:
Sales_Person_Dimension
|
||||
Sales_Person_Key
|
ID
|
Name
|
Region
|
…
|
100
|
203234
|
Rob
Doe
|
NorthWest
|
…
|
Advantages
- This is the easiest way to handle the Slowly Changing Dimension problem, since there is no need to keep track of the old information.
Disadvantages
- All history is lost. By applying this methodology, it is not possible to trace back in history. For example, in this case, the company would not be able to know that Christina lived in Illinois before.
Type
Two (Preserve History)
A Type Two change writes a record
with the new attribute information and preserves a record of the old
dimensional data. Type Two changes let you preserve historical data.
Implementing Type Two changes within a data warehouse might require significant
analysis and development. Type Two changes accurately partition history across
time more effectively than other types. However, because Type Two changes add
records, they can significantly increase the database’s size.
In our example, lets say we identify
Region as Type Two attribute. This can be handled in this way using:
Sales_Person_Dimension
|
||||
Sales_Person_Key
|
ID
|
Name
|
Region
|
…
|
100
|
203234
|
Rob
Doe
|
ASIA
|
…
|
153
|
203234
|
Rob
Doe
|
NorthWest
|
…
|
Advantages
- This allows us to accurately keep all historical information.
Disadvantages
- This will cause the size of the table to grow fast. In cases where the number of rows for the table is very high to start with, storage and performance can become a concern.
- This necessarily complicates the ETL process.
Type
Three (Preserve a Version of History)
You usually implement Type Three
changes only if you have a limited need to preserve and accurately describe
history, such as when someone gets married and you need to retain the previous
name. Instead of creating a new dimensional record to hold the attribute
change, a Type Three change places a value for the change in the original
dimensional record. You can create multiple fields to hold distinct values for
separate points in time. In the case of a region change example, you could
create an OLD_REGION and NEW_REGION field and a REGION_CHANGE_EFF_DATE field to
record when the change occurs. This method preserves the change. But how would
you handle a second name change, or a third, and so on? The side effects of this
method are increased table size and, more important, increased complexity of
the queries that analyze historical values from these old fields. After more
than a couple of iterations, queries become impossibly complex, and ultimately
you’re constrained by the maximum number of attributes allowed on a table.
This is how the table will look like
in Type Three change:
Sales_Person_Dimension
|
|||||
Sales_Person_Key
|
ID
|
Name
|
Old
Region
|
New
Region
|
…
|
100
|
203234
|
Rob
Doe
|
ASIA
|
NorthWest
|
…
|
Advantages
- This does not increase the size of the table, since new information is updated.
- This allows us to keep some part of history.
Disadvantages
- Type 3 will not be able to keep all history where an attribute is changed more than once. For example, if Christina later moves to Texas on December 15, 2003, the California information will be lost.
Because most business requirements
include tracking changes over time, data warehouse architects commonly
implement Type Two changes. A data warehouse might use Type Two changes for all
attributes in all tables. As an alternative, you can implement a mix of Type
One and Type Two changes at an attribute level by implementing Type 2 changes
for only attributes whose historical values are important when you’re slicing
and dicing. For example, users might not need to an individual’s previous name
if a name change occurs, so a Type One change would suffice. Users might want
the system to show only the person’s current name. However, if the company
reassigns sales territories, users might need to track who sold what, at what
time, and in what territory, necessitating a Type Two change.
Star Schema
Dimensional modeling is the design
concept used by many data warehouse designers to build their data warehouse.
Dimensional model is the underlying data model used by many of the commercial
OLAP products available today in the market. Designing a data warehouse is very
different from designing an online transaction processing (OLTP) system. In
contrast to an OLTP system in which the purpose is to capture high rates of
data changes and additions, the purpose of a data warehouse is to organize
large amounts of stable data for ease of analysis and retrieval. Because of
these differing purposes, there are many considerations in data warehouse
design that differ from OLTP database design. In dimensional model, all data is
contained in two types of tables called Fact Table and Dimension Table.
Fact Table
Each data warehouse or data mart
includes one or more fact tables. The fact table captures the data that
measures the organizations business operations. A fact table might contain
business sales events such as cash register transactions or the contributions
and expenditures of a nonprofit organization. Fact tables usually contain large
numbers of rows, sometimes in the hundreds of millions of records when they
contain one or more years of history for a large organization. A key
characteristic of a fact table is that it contains numerical data (facts) that
can be summarized to provide information about the history of the operation of
the organization. Each fact table also includes a multipart index that contains
as foreign keys the primary keys of related dimension tables, which contain the
attributes of the fact records. Fact tables should not contain descriptive
information or any data other than the numerical measurement fields and the
index fields that relate the facts to corresponding entries in the dimension
tables. An example of fact table is Sales_Fact table that might contain the
information like sale_amount, unit_price, discount, etc.
Dimension Table
Dimension tables contain attributes
that describe fact records in the fact table. Some of these attributes provide
descriptive information; others are used to specify how fact table data should
be summarized to provide useful information to the analyst. Dimension tables
contain hierarchies of attributes that aid in summarization. For example, a
dimension containing product information would often contain a hierarchy that
separates products into categories such as food, drink, and non-consumable
items, with each of these categories further subdivided a number of times until
the individual product is reached at the lowest level.
Dimensional modeling produces
dimension tables in which each table contains fact attributes that are
independent of those in other dimensions. For example, a customer dimension
table contains data about customers, a product dimension table contains
information about products, and a store dimension table contains information
about stores. Queries use attributes in dimensions to specify a view into the
fact information. For example, a query might use the product, store, and time
dimensions to ask the question “What was the cost of non-consumable goods sold
in the northeast region in 1999?” Subsequent queries might drill down along one
or more dimensions to examine more detailed data, such as “What was the cost of
kitchen products in New York City in the third quarter of 1999?” In these
examples, the dimension tables are used to specify how a measure (sale_amount)
in the fact table is to be summarized.
Consider an example of Sales_Fact table
and the various attributes that describe this fact are Store, Product, Date and
say Sales Person. In this case we will have four dimension tables, viz.
Store_Dimension, Product_Dimension, Date_Dimension and Sales_Person_Dimension.
You may notice that all of these
dimensions contain a Key field. This is called Surrogate Key. This key is
substitute for a natural key in dimensions (e.g., in Sales_Person_Dimension, we
have natural key as ID). In a data warehouse a surrogate key is a
generalization of the natural production key and is one of the basic elements
of data warehouse.
As a fact table is described by the
four dimension tables described above, it will contain the Surrogate Keys of
all these dimensions. This is how the Sales_Fact table will look like:
Now if you carefully look at the
structure of above tables and how they are linked the schema will look like
this:
You can easily tell that this looks
like a STAR. Hence it is called as Star Schema.
Advantages of having Star Schema
- Star Schema is very easy to understand, even for non technical business managers
- Star Schema provides better performance and smaller query times
- Star schema is easily extensible and will handle future changes easily
Subscribe to:
Posts (Atom)