Data Warehousing - System Processes & Architecture
Data Warehousing -
System Processes
We
have fixed number of operations to be applied on operational databases and we
have well defined techniques such as use
normalized data,keep table small etc.
These techniques are suitable for delivering a solution. But in case of
decision support system we do not know what query and operation need to be
executed in future. Therefore techniques applied on operational databases are
not suitable for data warehouses.
In this chapter We'll focus
on designing data warehousing solution built on the top open-system
technologies like Unix and relational databases.
Process
Flow in Data Warehouse
There are four major
processes that build a data warehouse. Here is the list of four processes:
·
Extract and load data.
·
Cleaning and transforming the data.
·
Backup and Archive the data.
·
Managing queries & directing them to the appropriate data
sources.
Extract
and Load Process
·
The Data Extraction takes data from the source systems.
·
Data load takes extracted data and loads it into data warehouse.
Note: Before loading the data into data warehouse the information
extracted from external sources must be reconstructed.
Points to remember while
extract and load process:
·
Controlling the process
·
When to Initiate Extract
·
Loading the Data
CONTROLLING THE PROCESS
Controlling
the process involves determining that when to start data extraction and
consistency check on data. Controlling process ensures that tools, logic
modules, and the programs are executed in correct sequence and at correct time.
WHEN TO INITIATE EXTRACT
Data
need to be in consistent state when it is extracted i.e. the data warehouse
should represent single, consistent version of information to the user.
For example in a customer
profiling data warehouse in telecommunication sector it is illogical to merge
list of customers at 8 pm on wednesday from a customer database with the
customer subscription events up to 8 pm on tuesday. This would mean that we are
finding the customers for whom there are no associated subscription.
LOADING THE DATA
After
extracting the data it is loaded into a temporary data store.Here in the
temporary data store it is cleaned up and made consistent.
Note: Consistency checks are executed only when all data sources have
been loaded into temporary data store.
Clean and
Transform Process
Once data is extracted and
loaded into temporary data store it is the time to perform Cleaning and
Transforming. Here is the list of steps involved in Cleaning and Transforming:
·
Clean and Transform the loaded data into a structure.
·
Partition the data.
·
Aggregation
CLEAN AND TRANSFORM THE LOADED DATA INTO A STRUCTURE
This
will speed up the queries.This can be done in the following ways:
·
Make sure data is consistent within itself.
·
Make sure data is consistent with other data within the same data
source.
·
Make sure data is consistent with data in other source systems.
·
Make sure data is consistent with data already in the warehouse.
Transforming involves
converting the source data into a structure. Structuring the data will result
in increases query performance and decreases operational cost. Information in
data warehouse must be transformed to support performance requirement from the
business and also the ongoing operational cost.
PARTITION THE DATA
It
will optimize the hardware performance and simplify the management of data
warehouse. In this we partition each fact table into a multiple separate
partitions.
AGGREGATION
Aggregation
is required to speed up the common queries. Aggregation rely on the fact that
most common queries will analyse a subset or an aggregation of the detailed
data.
Backup and
Archive the data
In order to recover the data
in event of data loss, software failure or hardware failure it is necessary to
backed up on regular basis.Archiving involves removing the old data from the
system in a format that allow it to be quickly restored whenever required.
For example in a retail sales
analysis data warehouse, it may be required to keep data for 3 years with
latest 6 months data being kept online. In this kind of scenario there is often
requirement to be able to do month-on-month comparisons for this year and last
year. In this case we require some data to be restored from the archive.
Query
Management Process
This process performs the
following functions
·
This process manages the queries.
·
This process speed up the queries execution.
·
This Process direct the queries to most effective data sources.
·
This process should also ensure that all system sources are used
in most effective way.
·
This process is also required to monitor actual query profiles.
·
Information in this process is used by warehouse management
process to determine which aggregations to generate.
·
This process does not generally operate during regular load of
information into data warehouse.
Data Warehousing -
Architecture
In this article, we will
discuss the business analysis framework for data warehouse design and
architecture of a data warehouse.
Business
Analysis Framework
The business analyst get the
information from the data warehouses to measure the performance and make
critical adjustments in order to win over other business holders in the market.
Having data warehouse has the following advantages for the business.
·
Since the data warehouse can gather the information quickly and
efficiently therefore it can enhance the business productivity.
·
The data warehouse provides us the consistent view of customers
and items hence help us to manage the customer relationship.
·
The data warehouse also helps in bringing cost reduction by
tracking trends, patterns over a long period in a consistent and reliable
manner.
To
design an effective and efficient data warehouse we are required to understand
and analyze the business needs and construct a business analysis framework. Each
person has different views regarding the design of a data warehouse. These
views are as follows:
·
The top-down view - This view allows the selection of relevant information needed
for data warehouse.
·
The data source view - This view presents the information being captured, stored, and
managed by operational system.
·
The data warehouse view - This view includes the fact tables and dimension tables.This
represent the information stored inside the data warehouse.
·
The Business Query view - It is the view of the data from the viewpoint of the end user.
Three-Tier
Data Warehouse Architecture
Generally the data warehouses
adopt the three-tier architecture. Following are the three tiers of data
warehouse architecture.
·
Bottom Tier - The bottom tier of the architecture is the data warehouse
database server.It is the relational database system.We use the back end tools
and utilities to feed data into bottom tier.these back end tools and utilities
performs the Extract, Clean, Load, and refresh functions.
·
Middle Tier - In the middle tier we have OLAp Server. the OLAP Server can be
implemented in either of the following ways.
o By
relational OLAP (ROLAP), which is an extended relational database management
system. The ROLAP maps the operations on multidimensional data to standard
relational operations.
o By
Multidimensional OLAP (MOLAP) model, which directly implements multidimensional
data and operations.
·
Top-Tier - This tier is the front-end client layer. This layer hold the
query tools and reporting tool, analysis tools and data mining tools.
Following diagram explains
the Three-tier Architecture of Data warehouse:
Data
Warehouse Models
From the perspective of data
warehouse architecture we have the following data warehouse models:
·
Virtual Warehouse
·
Data mart
·
Enterprise Warehouse
VIRTUAL WAREHOUSE
·
The view over a operational data warehouse is known as virtual
warehouse. It is easy to built the virtual warehouse.
·
Building the virtual warehouse requires excess capacity on
operational database servers.
DATA MART
·
Data mart contains the subset of organisation-wide data.
·
This subset of data is valuable to specific group of an
organisation
Note: in other words we can say that data mart contains only that data
which is specific to a particular group. For example the marketing data mart
may contain only data related to item, customers and sales. The data mart are
confined to subjects.
Points to remember about data
marts
·
window based or Unix/Linux based servers are used to implement
data marts. They are implemented on low cost server.
·
The implementation cycle of data mart is measured in short period
of time i.e. in weeks rather than months or years.
·
The life cycle of a data mart may be complex in long run if it's
planning and design are not organisation-wide.
·
Data mart are small in size.
·
Data mart are customized by department.
·
The source of data mart is departmentally structured data
warehouse.
·
Data mart are flexible.
ENTERPRISE WAREHOUSE
·
The enterprise warehouse collects all the information all the
subjects spanning the entire organization
·
This provide us the enterprise-wide data integration.
·
This provide us the enterprise-wide data integration.
·
The data is integrated from operational systems and external
information providers.
·
This information can vary from a few gigabytes to hundreds of
gigabytes, terabytes or beyond.
Load
Manager
·
This Component performs the operations required to extract and
load process.
·
The size and complexity of load manager varies between specific
solutions from data warehouse to data warehouse.
LOAD MANAGER ARCHITECTURE
The
load manager performs the following functions:
·
Extract the data from source system.
·
Fast Load the extracted data into temporary data store.
·
Perform simple transformations into structure similar to the one
in the data warehouse.
EXTRACT DATA FROM SOURCE
The
data is extracted from the operational databases or the external information
providers. Gateways is the application programs that are used to extract data.
It is supported by underlying DBMS and allows client program to generate SQL to
be executed at a server. Open Database Connection( ODBC), Java Database
Connection (JDBC), are examples of gateway.
FAST LOAD
·
In order to minimize the total load window the data need to be
loaded into the warehouse in the fastest possible time.
·
The transformations affects the speed of data processing.
·
It is more effective to load the data into relational database
prior to applying transformations and checks.
·
Gateway technology proves to be not suitable, since they tend not
be performant when large data volumes are involved.
SIMPLE TRANSFORMATIONS
While
loading it may be required to perform simple transformations. After this has
been completed we are in position to do the complex checks. Suppose we are
loading the EPOS sales transaction we need to perform the following checks:
·
Strip out all the columns that are not required within the
warehouse.
·
Convert all the values to required data types.
Warehouse
Manager
·
Warehouse manager is responsible for the warehouse management
process.
·
The warehouse manager consist of third party system software, C
programs and shell scripts.
·
The size and complexity of warehouse manager varies between specific
solutions.
WAREHOUSE MANAGER ARCHITECTURE
The
warehouse manager includes the following:
·
The Controlling process
·
Stored procedures or C with SQL
·
Backup/Recovery tool
·
SQL Scripts
OPERATIONS PERFORMED BY WAREHOUSE MANAGER
·
Warehouse manager analyses the data to perform consistency and
referential integrity checks.
·
Creates the indexes, business views, partition views against the
base data.
·
Generates the new aggregations and also updates the existing
aggregation. Generates the normalizations.
·
Warehouse manager Warehouse manager transforms and merge the
source data into the temporary store into the published data warehouse.
·
Backup the data in the data warehouse.
·
Warehouse Manager archives the data that has reached the end of
its captured life.
Note: Warehouse Manager also analyses query profiles to determine index
and aggregations are appropriate.
Query
Manager
·
Query Manager is responsible for directing the queries to the
suitable tables.
·
By directing the queries to appropriate table the query request
and response process is speed up.
·
Query Manager is responsible for scheduling the execution of the
queries posed by the user.
QUERY MANAGER ARCHITECTURE
Query
Manager includes the following:
·
The query redirection via C tool or RDBMS.
·
Stored procedures.
·
Query Management tool.
·
Query Scheduling via C tool or RDBMS.
·
Query Scheduling via third party Software.
Detailed
information
The following diagram shows
the detailed information
The detailed information is
not kept online rather is aggregated to the next level of detail and then
archived to the tape. The detailed infomation part of data warehouse keep the
detailed information in the starflake schema. the detailed information is
loaded into the data warehouse to supplement the aggregated data.
Note: If the detailed information is held offline to minimize the disk
storage we should make sure that the data has been extracted, cleaned up, and
transformed then into starflake schema before it is archived.
Summary
Information
·
In this area of data warehouse the predefined aggregations are
kept.
·
These aggregations are generated by warehouse manager.
·
This area changes on ongoing basis in order to respond to the
changing query profiles.
·
This area of data warehouse must be treated as transient.
Points to remember about
summary information.
·
The summary data speed up the performance of common queries.
·
It increases the operational cost.
·
It need to be updated whenever new data is loaded into the data
warehouse.
·
It may not have been backed up, since it can be generated fresh
from the detailed information.
I was finding the most critical data warehousing service provider due to which I found your blog to be the topmost reputable blog. The services offered by your blog helped me in delivering the big data into the useful and standard quality of the format. If you people want to read more and unique features about the powerful data warehouse blog, then you can visit the official site of this blog.
ReplyDelete