Total Pageviews

Wednesday, 11 June 2014

Star Schema


Dimensional modeling is the design concept used by many data warehouse designers to build their data warehouse. Dimensional model is the underlying data model used by many of the commercial OLAP products available today in the market. Designing a data warehouse is very different from designing an online transaction processing (OLTP) system. In contrast to an OLTP system in which the purpose is to capture high rates of data changes and additions, the purpose of a data warehouse is to organize large amounts of stable data for ease of analysis and retrieval. Because of these differing purposes, there are many considerations in data warehouse design that differ from OLTP database design. In dimensional model, all data is contained in two types of tables called Fact Table and Dimension Table.

Fact Table
Each data warehouse or data mart includes one or more fact tables. The fact table captures the data that measures the organizations business operations. A fact table might contain business sales events such as cash register transactions or the contributions and expenditures of a nonprofit organization. Fact tables usually contain large numbers of rows, sometimes in the hundreds of millions of records when they contain one or more years of history for a large organization. A key characteristic of a fact table is that it contains numerical data (facts) that can be summarized to provide information about the history of the operation of the organization. Each fact table also includes a multipart index that contains as foreign keys the primary keys of related dimension tables, which contain the attributes of the fact records. Fact tables should not contain descriptive information or any data other than the numerical measurement fields and the index fields that relate the facts to corresponding entries in the dimension tables. An example of fact table is Sales_Fact table that might contain the information like sale_amount, unit_price, discount, etc.

Dimension Table
Dimension tables contain attributes that describe fact records in the fact table. Some of these attributes provide descriptive information; others are used to specify how fact table data should be summarized to provide useful information to the analyst. Dimension tables contain hierarchies of attributes that aid in summarization. For example, a dimension containing product information would often contain a hierarchy that separates products into categories such as food, drink, and non-consumable items, with each of these categories further subdivided a number of times until the individual product is reached at the lowest level.
Dimensional modeling produces dimension tables in which each table contains fact attributes that are independent of those in other dimensions. For example, a customer dimension table contains data about customers, a product dimension table contains information about products, and a store dimension table contains information about stores. Queries use attributes in dimensions to specify a view into the fact information. For example, a query might use the product, store, and time dimensions to ask the question “What was the cost of non-consumable goods sold in the northeast region in 1999?” Subsequent queries might drill down along one or more dimensions to examine more detailed data, such as “What was the cost of kitchen products in New York City in the third quarter of 1999?” In these examples, the dimension tables are used to specify how a measure (sale_amount) in the fact table is to be summarized.
Consider an example of Sales_Fact table and the various attributes that describe this fact are Store, Product, Date and say Sales Person. In this case we will have four dimension tables, viz. Store_Dimension, Product_Dimension, Date_Dimension and Sales_Person_Dimension.


You may notice that all of these dimensions contain a Key field. This is called Surrogate Key. This key is substitute for a natural key in dimensions (e.g., in Sales_Person_Dimension, we have natural key as ID). In a data warehouse a surrogate key is a generalization of the natural production key and is one of the basic elements of data warehouse.
As a fact table is described by the four dimension tables described above, it will contain the Surrogate Keys of all these dimensions. This is how the Sales_Fact table will look like:
Now if you carefully look at the structure of above tables and how they are linked the schema will look like this:
 
You can easily tell that this looks like a STAR. Hence it is called as Star Schema.
Advantages of having Star Schema
  • Star Schema is very easy to understand, even for non technical business managers
  • Star Schema provides better performance and smaller query times
  • Star schema is easily extensible and will handle future changes easily

No comments:

Post a Comment