Business Intelligent


YANG data modeler
Data warehousing and best practices for data warehousing
Data migration and best practices for data migration
BIRT Actuate
Business Object
Cognos BI
Information Builders
Crystal Reports
Hyperion Solution's Essbase
Oracle's Express Server
Microsoft Analysis Service

Public data sets

Data Analysis
Data Mining
statistical analysis
decision support

Schemas design - Stars versus Snowflakes
Dimension tables
Fact tables

Ralph Kimball and Margy Ross, The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition), p. 393
Ralph Kimball and Bill Inmon - done reading - done reading - done reading - done reading - done reading - done reading - done reading - done reading - done reading - done reading - done reading - done reading - done reading - done reading - done reading - Data Warehousing Framework - DB Modelling: OLAP Database Design Part 1 - What is OLAP - Bussiness Intelligence EP2 / Part2 - BI Tool Vendor Grudge Match Presentation Intro (1 of 2) - What's a Data Warehouse and How Do I Test It? - Data Warehousing Explained, Illustrated, and Subtitled (Starring Sakila) - Demystifying the Data Warehouse - Business Intelligence Demonstration

Star schema and Snowflake schema:
Data Warehousing Best Practices Star Schemas - watched
Big Data, OLAP, BI Tutorial - Stars, Wormholes, and Data Warehouse Design - watched, relevant
Data Modeling and Conceptual Sketching in the Design Process
The Theory of Normalization
DB Modelling: OLAP Database Design Part 1 - watched, relevant
Analysis Services - 11 Star and Snowflake Schemas
The Thinking Persons Guide to Data Warehouse Design
Data Modeling and Conceptual Sketching in the Design Process
Business Dimensional Modeling with Laura Reeves - BI Knowledge Exchange - Part 1
3NF - Data Vault - Star Schema by Data Vault Academy
INF 3: What is Star Schema
INF 4: What is Snowflake Schema
Data Warehouse Star Schema - Kalman Toth
Analysis Services - 11 Star and Snowflake Schemas
4 Snowflake dimensions, defining hierarchies, defining dimension primary keys
IBI10 - Creating Staging Areas
Slowly Changing Dimension Type 2 Illustration Using Informatica - By Mohan Vamsi

Data Warehouse tutorial. Creating an ETL using MS Visual Studio
ETL Testing: How to Achieve 100% Data Validation
What's a Data Warehouse and How Do I Test It?
ETL Essential Fundamentals

Data Warehousing and Data Mining:
Introduction to Data Warehousing and Data Mining
Data Warehousing Part 1
Sakila: Data Warehousing Explained, Illustrated, and Subtitled
Data Mining: The Tool of The Information Age

Microsoft SQL Server:
Microsoft SQL Server Analysis Services tutorial. Creating OLAP cube. Introduction to data warehouse

Statistical Aspects of Data Mining (Stats 202) Day 1
Statistical Aspects of Data Mining (Stats 202) Day 2
Statistical Aspects of Data Mining (Stats 202) Day 4
Statistical Aspects of Data Mining (Stats 202) Day 3
Statistical Aspects of Data Mining (Stats 202) Day 5
Statistical Aspects of Data Mining (Stats 202) Day 6
Statistical Aspects of Data Mining (Stats 202) Day 7
Statistical Aspects of Data Mining (Stats 202) Day 8
Statistical Aspects of Data Mining (Stats 202) Day 9
Statistical Aspects of Data Mining (Stats 202) Day 10

O'Reilly Webcast: Social Network Analysis -- Finding communities and influencers
Analyzing Social Networks on Twitter
Social Network Analysis with Python
Lecture 14 - Analyzing Big Data with Twitter: Stan Nikolov on Information Diffusion at Twitter
Brains, Meaning and Corpus Statistics
Introduction to Predictive Analytics
R and Hadoop - Big Data Analytics
Using Segmentation and Predictive Analytics to Combat Attrition
Realtime Analytics for Big Data: A Facebook Case Study

MySQL Data Warehousing:

Big Data:
Big Data - The Hadoop Data Warehouse - Part 1
Big Data tutorial by Marko Grobelnik
Analyzing Big Data with Twitter - Lec. 2 - Growing a Human-Scale Service & the Twitter Ecosystem
Big Data and Hadoop 1 | Hadoop Tutorials 1 |Big Data Tutorial 1 |Hadoop Tutorial for Beginners -1
Realtime Analytics for Big Data: A Facebook Case Study
Big Data & BI Best Practices Webinar by Yellowfin and Actian Vectorwise

Data Visualization:




Data Mining:

Data Mining using R:

Statistical Finance:
Issues in Financial Mathematics and Statistics


Head First Statistics
Head First Data Analysis

Data mining
Predictive analysis

The star schema data warehouse takes the data from many transactional system and copies the data to a common format with a completely different relational database design than a transactional system containing many star schema configurations.

Each star schema has a central fact table which represents events or occurrences which have measures that are always been numeric, therefore they can be counted, sum or average. Measures could include a count of the number of cases filed, the count of dispositioned cases, as in example in this demonstration or the fines assessed per case in analysis of criminal cases.

A single fact table can have many measures so long as they pertain to the same variable or dimensions. If important measures don't share all their dimensions, and many of them will not, we can make other fact tables in the star schema data warehouses, and relate those facts to their relevant dimensions.

An interesting feature of the star schema data warehouse is that end users intuitively understand it. They understand that they are looking at some related measures that can be filtered by relevant dimensions.

One challenge of designing in constructing a data warehouse is determining out of all the tables, rows, and columns available what are the measures which defined the relevant facts upon which to base our system projection.

Another challenge concerns how we address some of the existing problems that were previously identified with existing data.

Remember that one of the problems was the data was in databases from different vendors, each with their own schema. In the past, this translate to the common design of the data warehouse was done by writing low level database vendor and database content specific programs to do all the work. Remember that data was not quoted consistently, in that similar items may be coded using different courthouse terminology. The differences may be small and obvious, or large and subtle. There may be misspelling of key terms, or names may be presented differently. All of these issues affect the quality of the data and the effort necessary to transform the data into a star schema data warehouse.

Performing the task associated with moving, correcting, and transforming the data from transactional systems to star schema data warehouse is called extraction, transformation, and loading.

Even when the measures and dimensions are accurately specified, the shortcoming of a star schema data warehouse is that simple queries can only get one piece of data at a time. The intersection of a measure with the single value from each of the dimension. In order to make comparisons between many values in a useful way greater querying ability is needed.

Analysis services stores data in yet another format call the cube. The term cube is a convenience because usually there will be more than three dimensions or variables against which we wish to aggregate and analyzed the data. For illustration, let's consider a cube that have only three dimensions. It will be a cube that measures the count of dispositions of court case. Along one axis, we have the dimension of time. Along another we have county. And across the third, we have judges.

To keep the cube simple, we have a subset of the values that these dimensions might take on. The values to the dimension can take on are called members. If we set one member of each dimension to a fixed value, a single measure will result, but the cube is going to let the user view many of these cells at a time, which gives us a broader comparitive view of the data. Because the data is pre-aggregated, we will be able to adjust our querying parameters in real time without having to wait 30 minutes for a report generating tool to count all the detail records of a transactional database to populate the cells of the cube.

Let look at a cube through analysis services based on data from several tensest databases that have been placed in a single star schema data warehouse. This data is not actual data or actual data that has been modified during testing.

Data can be allocated along some dimensions, and merge across others.

The important area of the spreadsheet for use with analysis services is called a pivot table. There are places on the pivot table for placing measures and placing dimensions. Dimensions can also be placed to filter the entire view of the data. We will place our first measure in the central area of the pivot table. This cube currently has only one measure defined, but if multiple measures existed for this cube, we could display one or more of them at a time side by side. Here we see the total count dispositions contained in this cube. Virtually every cube has time as a dimension, so we will place time across the horizontal axis of the pivot table

What is a pivot table?

The user view the data through predefined rigid interface and have ability to only query in the manner that has been previously configured for them by programmers. If we perform an analysis to discover the key data column that comprise measures and dimensions of all the relevant databases and database types, we can extract this data, massage it into a common format, and loaded into star schema relational data warehouse. While the design of these data warehouse is often understood by end-users, this presentation of the data has its limitation as well. If there are derived and calculated measures and dimensions such as the year to year change calculations shown in the excel demonstration, formulas for these and for calculations specific to the subject matter of the cube must be defined. The data is in process from the star schema data warehouse, and loaded into the analysis services cubes. Users are then ready to use the data through excel, or web pages using excel-like features.

BI: Getting the right information into the right people's hands in a format that allows them to understand the data quickly. It is not just about decision support. BI is about monitoring changes (identifying lagging / growing metrics, understandy why, understanding the context behind the numbers, identifying trends in various areas of your organization). React to changes, and cause changes.

What is data cube?

What is dimension?

What is measure?

A data warehouse is a database that collects, integrates, and stores an organization's data with the aim to produce accurate timely information, provide a mean for analysis of data to support management decisions, and decision support.

Analysis often requires huge amount of data to be processed, which is often a problem. For example, the OLTP database tables are locked for retrieval. A data warehouse can be completely detached from the information system, even running on a different system.

OLTP system's data model is rarely optimized for analysis. We all learn to develop systems to use normalized database modeled after our entity relationship. This is a good thing for information system, but it makes querying for large sums of aggregated data a costly operation. Furthermore, redundancy is rarely part of this database design, because redundancy is hard to maintain, often causing data inconsistencies or worse. For data analysis, redundancy can be great, because it speeds up the process.

Data in an OLTP system might change over time. A customer might move to another country, leaving you, the data analyst with an impossible choice: either you update the customer's record, discarding his previous state and invalidating previous analysis or you need to somehow create a new record for the on-line system and change all the reference to it. Neither of them are desirable, but in a data warehouse, you can keep both the old and new state of the customer and specify at what period in time it has changed.

Each data source consist of a single fact table links with multiple dimensional tables.

As we create and link dimensions, an important rule is to never use the existing keys from the online system, because we have no control over how they might change or even disappear. Instead, every dimension will get it's own surrogate key called the "technical key" which is unique to the data warehouse. This also goes for the fact table.

The first step in constructing a data warehouse is to populate the data warehouse with data from the OLTP system. This step is known as ETL (Extract, Transform, Load). Extract the data needed for the fact and dimension tables from all different data sources, transform it to fit our needs, and load it into the data warehouse so it can be queried.

In order to be able to fill the central fact table, the keys to all the dimensions must be known.

Two types of dimensions (not to be confused with Ralph Kimball's slowly changing dimension types):

  1. Dimensions consisting of data already known to the online system (the data exists in another table in the online system)
  2. Dimensions that are to be generated from the fact data and surrounding sources

We generate or update the dimensions of type two (dimensions that are to be generated from the fact data and surrounding sources) while filling out fact table, and thus know its keys at that time. However, we cannot do this for the dimensions of the first type.

In our example, the time and page dimensions are of the second type. They are generated when updating our fact table. The user dimension has to be known before then, because it is based on some independent tables in the online system. Because of this, we will fill our data warehouse in two separate transactions, ensuring first the existence of our type 1 user dimension and later the other dimensions and the fact table itself.

Updating type 1 dimensions:

Because the data exists in the source system as a separate table, all we have to do is read it and run it through Spoon's "Dimension lookup / update" step. This step is capable of creating, updating, and looking up "slowly changing dimensions" as described by Ralph Kimball. For sources that contains changing data, such as the customer records mentioned earlier, it can do this in two ways:

  • Overwriting the existing record by the updated one
  • Creating another dimension record, maintaining multiple copies of changed records over time

The second type is implemented by adding a "version", "date_from" and "date_to" field to the dimension's table, and it is almost always the most useful one.

Because we create this dimension directly from a table from the online system, the key field to use in this dimension is trivial. It is the key used in the on-line system, which in our case is the user's email. We will not use this field as a key in our data warehouse. We replace it by a technical key unique to our data warehouse. It is merely needed to be able to retrieve this technical key later, as we will need it to link the dimension to the fact table. In our case, this technical key field is called "user_id".

In the "Fields" tab, we specify the fields that are important to this dimension, such as hierarchical data. In our case, we keep track of information like the user's full name, the company he works for, his / her gender.

In case of type two dimensions, the "Version field" is used to keep track of the different versions in a slowly changing dimension. The date ranges will be used to indicate the period of validity for each version in that case. If the online system keeps track of when records change, it is useful to use the "Stream Datefield" for better validity, but there are many more applications possible, which are not in the scope of this document. Usually, the default will suffice.

Updating type two dimensions and the fact table:

Now that all independent dimensions have been prepared, it is time to populate the fact table. In this process, the remaining dimensions can be updated as well, having their technical keys ready when needed.

We start by reading the basic grain for the fact table from the request log. Then the date dimension is generated, which is done by using the JavaScript step on the timestamp field in the request log. Using the "Dimension Lookup / Update" step, we put this data into our data warehouse, and keep reference to it in the form of the generated technical key "time_id", which is added to the stream as an extra column.

Next up is the user dimension. Remember that we already updated this in the previous step. Now, all we need to do is link its corresponding entry for each request in the fact table. Through a complex lookup using the session key to match each request to a login action from the actions table, we are able to finally match a user to each request. If no user can be found for this session, we turn its key (the email) to NULL, so it will match a special case created by Spoon's "Dimension Lookup / Update" step: the unknown user.

Because filtering creates separate threads, we sort the user streams afterwards using Spoon's sort step. When we reached the "Lookup user_dim" step, the stream contains an extra field named "email", which contains the email of each request, or NULL if it is not known. Remember that we specified "email" as the lookup key when updating this dimension earlier, so using this field, the transformation step can find the technical key for each entry in the stream, adding it as an extra field called "user_id".

While filling the dimension table with hierarchical data about this request (domain, path, and page), the newly generated technical key field "page_id" is added to the stream.

Finally, the data that is to go into the fact table is filtered so that only the technical keys to the dimensions and the grain's facts remain. It is then inserted into a table request_fact.

Next, we need to tell our OLAP server (Mondrian) about our data warehouse's structure. This is done using a fairly well-documented XML format, containing information about how the OLAP cubes, their dimensions, hierarchies, and measures are to be built.

In our case, the schema contains only one cube, which is called "Requests". The request cube is linked directly to the "request_facts" table. It has one measure, and four dimensions.

ROLAP requires a properly prepared data source in the form of a star or snowflake schema that defines a logical multi-dimensional database and maps it to a physical database model. Once we have our initial data structure in place, we must design a descriptive layer for it in the form of a Mondrian schema, which consists of one or more cubes, hierarchies, and members.

We don't have to worry too much about getting the model exactly right on the first try. Just cover all of your anticipated business needs. Part of the process is coming back to the data warehouse design step and making changes to your initial data model after you've discover your needs. What about implications on every time we redo / improve / or make change on our data warehouse? Do we have to tear down and set up from scratch? Do we have to clean up? What does this imply on our existing operations / services?

  1. Design a star or snowflake schema. The entire process starts with a data warehouse.
  2. Populate the star / snowflake schema. After the data model is designed, the next step is to populate it with actual data, thereby creating your data warehouse.
  3. Build a Mondrian Schema. Now that our initial data warehouse is complete, we must build a Mondrian schema to organize and describe it in term that Pentaho Analysis can understand.

Do all business intelligence applications require a data warehouse?

No. Not all data warehouses are used for business intelligence nor do all business intelligence applications require a data warehouse.

What are the layers in a data warehouse architecture?

  1. OLTP
  2. ETL
  3. Metadata: There are dictionaries for the entire warehouse and sometimes dictionaries for the data that can be accessed by a particular reporting and analysis tool.
  4. Informational access layer: The data accessed for reporting and analyzing and the tools for reporting and analyzing data – Business intelligence tools fall into this layer.

What are the three main approach for designing a data warehouse?

Famous authors and data warehouse experts Ralph Kimball and Bill Inmon give two different design methodoloigies for building a data warehouse.

  1. Kimball’s approach is more of a Bottom-up design where data marts are created first for specific subject/business areas and have the capability to report and analyse. THen these data marts are combined to create a data warehouse. This approach provide quicker approach to get the data ready for individual sujects/businesses. The major task in this design is maintaining the Dimensions across multiple data marts.
  2. Inmon has defined a data warehouse as a centralized repository for the entire enterprise, in which the data warehouse is designed using a normalized enterprise data model. Data at the lowest level of detail is stored in the data warehouse. Dimensional data marts containing data needed for specific business processes or specific departments are created from the data warehouse. Inmon states that the data warehouse is: Subject-oriented, Non-volatile and Integrated. This methodology generates highly consistent dimensional views of data across data marts since all data marts are loaded from the centralized repository. Top-down design has also proven to be robust against business changes. Generating new dimensional data marts against the data stored in the data warehouse is a relatively simple task. The main disadvantage to the top-down methodology is that it represents a very large project with a very broad scope. The up-front cost for implementing a data warehouse using the top-down methodology is significant, and the duration of time from the start of project to the point that end users experience initial benefits can be substantial. In addition, the top-down methodology can be inflexible and unresponsive to changing departmental needs during the implementation phases.
  3. Hybrid: combine these two and provide more comprehensive and robust data warehouse design.


Why are the schemas in a data warehouse typically designed at a level less then third normal form?

The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form. See

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License