Data Warehousing - Terminologies & Delivery Process
Data Warehousing -
Terminologies
In this article, we will
discuss some of the commonly used terms in Data Warehouse.
Data
Warehouse
Data warehouse is subject
Oriented, Integrated, Time-Variant and nonvolatile collection of data that
support of management's decision making process. Let's explore this Definition
of data warehouse.
·
Subject Oriented - The Data warehouse is subject oriented because it provide us the
information around a subject rather the organization's ongoing operations.
These subjects can be product, customers, suppliers, sales, revenue etc. The
data warehouse does not focus on the ongoing operations rather it focuses on
modelling and analysis of data for decision making.
·
Integrated - Data Warehouse is constructed by integration of data from
heterogeneous sources such as relational databases, flat files etc. This
integration enhance the effective analysis of data.
·
Time-Variant - The Data in Data Warehouse is identified with a particular time
period. The data in data warehouse provide information from historical point of
view.
·
Non Volatile - Non volatile means that the previous data is not removed when
new data is added to it. The data warehouse is kept separate from the
operational database therefore frequent changes in operational database is not
reflected in data warehouse.
·
Metadata - Metadata is simply defined as data about data. The data that are
used to represent other data is known as metadata. For example the index of a
book serve as metadata for the contents in the book.In other words we can say
that metadata is the summarized data that lead us to the detailed data.
In terms of data warehouse we
can define metadata as following:
·
Metadata is a road map to data warehouse.
·
Metadata in data warehouse define the warehouse objects.
·
The metadata act as a directory.This directory helps the decision
support system to locate the contents of data warehouse.
Metadata
Respiratory
The Metadata Respiratory is
an integral part of data warehouse system. The Metadata Respiratory contains
the following metadata:
·
Business Metadata - This metadata has the data ownership information, business
definition and changing policies.
·
Operational Metadata -This metadata includes currency of data and data lineage.
Currency of data means whether data is active, archived or purged. Lineage of
data means history of data migrated and transformation applied on it.
·
Data for mapping from operational
environment to data warehouse -This
metadata includes source databases and their contents, data extraction,data
partition, cleaning, transformation rules, data refresh and purging rules.
·
The algorithms for summarization - This includes dimension algorithms, data on granularity,
aggregation, summarizing etc.
Data cube
Data cube help us to
represent the data in multiple dimensions. The data cube is defined by
dimensions and facts. The dimensions are the entities with respect to which an
enterprise keep the records.
Illustration
of Data cube
Suppose a company wants to
keep track of sales records with help of sales data warehouse with respect to
time, item, branch and location. These dimensions allow to keep track of
monthly sales and at which branch the items were sold.There is a table
associated with each dimension. This table is known as dimension table. This
dimension table further describes the dimensions. For example "item"
dimension table may have attributes such as item_name, item_type and
item_brand.
The following table
represents 2-D view of Sales Data for a company with respect to time,item and
location dimensions.
The above 3-D table can be
represented as 3-D data cube as shown in the following figure:
Data mart
Data mart contains the subset
of organisation-wide data. This subset of data is valuable to specific group of
an organisation. in other words we can say that data mart contains only that
data which is specific to a particular group. For example the marketing data
mart may contain only data related to item, customers and sales. The data mart
are confined to subjects.
Points to
remember about data marts:
·
window based or Unix/Linux based servers are used to implement
data marts. They are implemented on low cost server.
·
The implementation cycle of data mart is measured in short period
of time i.e. in weeks rather than months or years.
·
The life cycle of a data mart may be complex in long run if it's
planning and design are not organisation-wide.
Data mart
are small in size.
·
Data mart are customized by department.
·
The source of data mart is departmentally structured data
warehouse.
·
Data mart are flexible.
Graphical Representation of
data mart.
Virtual
Warehouse
The view over a operational
data warehouse is known as virtual warehouse. It is easy to built the virtual
warehouse. Building the virtual warehouse requires excess capacity on
operational database servers.
Data Warehousing -
Delivery Process
Introduction
The data warehouse are never
static. It evolves as the business increases. The today's need may be different
from the future needs.We must design the data warehouse to change constantly.
The real problem is that business itself is not aware of its requirement for
information in the future.As business evolves it's need also changes therefore
the data warehuose must be designed to ride with these changes. Hence the data
warehouse systems need to be flexible.
There should be a delivery
process to deliver the data warehouse.But there are many issues in data
warehouse projects that it is very difficult to complete the task and
deliverables in the strict, ordered fashion demanded by waterfall method
because the requirements are hardly fully understood. Hence when the
requirements are completed only then the architectures designs, and build
components can be completed.
Delivery
Method
The delivery method is a variant
of the joint application development approach, adopted for delivery of data
warehouse. We staged the data warehouse delivery process to minimize the risk.
The approach that i will discuss does not reduce the overall delivery
time-scales but ensures business benefits are delivered incrementally through
the development process.
Note: The delivery process is broken into
phases to reduce the project and delivery risk.
Following diagram Explain the
Stages in delivery process:
IT
Strategy
Data warehouse are strategic
investments, that require business process to generate the project benefits. IT
Strategy is required to procure and retain funding for the project.
Business
Case
The objective of Business
case is to know the projected business benefits that should be derived from
using the data warehouse. These benefits may not be quantifiable but the
projected benefits need to be clearly stated.. If the data warehouse does not
have a clear business case then the business tend to suffer from the
credibility problems at some stage during the delivery process.Therefore in
data warehouse project we need to understand the business case for investment.
Education
and Prototyping
The organization will
experiment with the concept of data analysis and educate themselves on the
value of data warehouse before determining that a data warehouse is prior
solution. This is addressed by prototyping. This prototyping activity helps in
understanding the feasibility and benefits of a data warehouse. The Prototyping
activity on a small scale can further the educational process as long as:
·
The prototype address a defined technical objective.
·
The prototype can be thrown away after the feasibility concept has
been shown.
·
The activity addresses a small subset of eventual data content if
the data warehouse.
·
The activity timescale is non- critical.
Points to remember to produce
an early release of a part of a data warehouse to deliver business benefits.
·
Identify the architecture that is capable of evolving.
·
Focus on the business requirements and technical blueprint phases.
·
Limit the scope of the first build phase to the minimum that
delivers business benefits.
·
Understand the short term and medium term requirements of the data
warehouse.
Business
Requirements
To provide the quality
deliverables we should make sure that overall requirements are understood. The
business requirements and the technical blueprint stages are required because
of the following reasons:
·
If we understand the business requirements for both short and
medium term then we can design a solution that satisfies the short term need.
·
This would be capable of growing to the full solution.
Things to determine in this
stage are following.
·
The business rule to be applied on data.
·
The logical model for information within the data warehouse.
·
The query profiles for the immediate requirement.
·
The source systems that provide this data.
Technical
Blueprint
This phase need to deliver an
overall architecture satisfying the long term requirements. This phase also
deliver the components that must be implemented in a short term to derive any
business benefit. The blueprint need to identify the followings.
·
The overall system architecture.
·
The data retention policy.
·
The backup and recovery strategy.
·
The server and data mart architecture.
·
The capacity plan for hardware and infrastructure.
·
The components of database design.
Building
the version
·
In this stage the first production deliverable is produced.
·
This production deliverable smallest component of data warehouse.
·
This smallest component adds business benefit.
History
Load
This is the phase where the
remainder of the required history is loaded into the data warehouse. In this
phase we do not add the new entities but additional physical tables would
probably be created to store the increased data volumes.
Let's have an example,
Suppose the build version phase has delivered a retail sales analysis data
warehouse with 2 months worth of history. This information will allow the user
to analyse only the recent trends and address the short term issues. The user
can not identify the annual and seasonal trends. So the 2 years worth of sales
history could be loaded from the archive to make user to analyse the sales
trend yearly and seasonal. Now the 40GB data is extended to 400GB.
Note:The
backup and recovery procedures may become complex therefore it is recommended
that perform this activity within separate phase.
Ad hoc
Query
·
In this phase we configure an ad hoc query tool.
·
This ad hoc query tool is used to operate the data warehouse.
·
These tools can generate the database query.
Note:It is
recommended that not to use these access tolls when database is being
substantially modified.
Automation
In this phase operational
management processes are fully automated. These would include:
·
Transforming the data into a form suitable for analysis.
·
Monitoring query profiles and determining the appropriate
aggregations to maintain system performance.
·
Extracting and loading the data from different source systems.
·
Generating aggregations from predefined definitions within the data
warehouse.
·
Backing Up, restoring and archiving the data.
Extending
Scope
In this phase the data
warehouse is extended to address a new set of business requirements. The scope
can be extended in two ways:
·
By loading additional data into the data warehouse.
·
By introducing new data marts using the existing information.
Note:This
phase should be performed separately since this phase involves substantial
efforts and complexity.
Requirements
Evolution
From the perspective of
delivery process the requirement are always changeable. They are not static.The
delivery process must support this and allow these changes to be reflected
within the system.
This issue is addressed by
designing the data warehouse around the use of data within business processes,
as opposed to the data requirements of existing queries.
The architecture is designed
to change and grow to match the business needs,the process operates as a pseudo
application development process, where the new requirements are continually fed
into the development activities. The partial deliverables are produced.These
partial deliverables are fed back to users and then reworked ensuring that
overall system is continually updated to meet the business needs.
Comments
Post a Comment