Data Warehousing - Metadata and Data Marting Concepts
Data Warehousing -
Metadata Concepts
What is
Metadata
Metadata is simply defined as
data about data. The data that are used to represent other data is known as
metadata. For example the index of a book serve as metadata for the contents in
the book. In other words we can say that metadata is the summarized data that
leads us to the detailed data. In terms of data warehouse we can define
metadata as following.
·
Metadata is a road map to data warehouse.
·
Metadata in data warehouse define the warehouse objects.
·
The metadata act as a directory.This directory helps the decision
support system to locate the contents of data warehouse.
Note: In data warehouse we create metadata for the data names and
definitions of a given data warehouse. Along with this metadata additional
metadata are also created for timestamping any extracted data, the source of
extracted data.
Categories
of Metadata
The metadata can be broadly
categorized into three categories:
·
Business Metadata - This metadata has the data ownership information, business
definition and changing policies.
·
Technical Metadata - Technical metadata includes database system names, table and
column names and sizes, data types and allowed values. Technical metadata also
includes structural information such as primary and foreign key attributes and
indices.
·
Operational Metadata - This metadata includes currency of data and data
lineage.Currency of data means whether data is active, archived or purged.
Lineage of data means history of data migrated and transformation applied on
it.
Role of
Metadata
Metadata has very important
role in data warehouse. The role of metadata in warehouse is different from the
warehouse data yet it has very important role. The various roles of metadata
are explained below.
·
The metadata act as a directory.
·
This directory helps the decision support system to locate the
contents of data warehouse.
·
Metadata helps in decision support system for mapping of data when
data are transformed from operational environment to data warehouse
environment.
·
Metadata helps in summarization between current detailed data and
highly summarized data.
·
Metadata also helps in summarization between lightly detailed data
and highly summarized data.
·
Metadata are also used for query tools.
·
Metadata are used in reporting tools.
·
Metadata are used in extraction and cleansing tools.
·
Metadata are used in transformation tools.
·
Metadata also plays important role in loading functions.
Diagram to understand role of
Metadata.
Metadata
Respiratory
The Metadata Respiratory is
an integral part of data warehouse system. The Metadata Respiratory has the
following metadata:
·
Definition of data warehouse - This includes the description of structure of data warehouse.
The description is defined by schema, view, hierarchies, derived data
definitions, and data mart locations and contents.
·
Business Metadata - This metadata has the data ownership information, business
definition and changing policies.
·
Operational Metadata - This metadata includes currency of data and data lineage.
Currency of data means whether data is active, archived or purged. Lineage of
data means history of data migrated and transformation applied on it.
·
Data for mapping from operational
environment to data warehouse - This
metadata includes source databases and their contents, data extraction,data
partition cleaning, transformation rules, data refresh and purging rules.
·
The algorithms for summarization - This includes dimension algorithms, data on granularity,
aggregation, summarizing etc.
Challenges
for Metadata Management
The importance of metadata
can not be overstated. Metadata helps in driving the accuracy of reports,
validates data transformation and ensures the accuracy of calculations. The
metadata also enforces the consistent definition of business terms to business
end users. With all these uses of Metadata it also has challenges for metadata
management. The some of the challenges are discussed below.
·
The Metadata in a big organization is scattered across the
organization. This metadata is spreaded in spreadsheets, databases, and
applications.
·
The metadata could present in text file or multimedia file. To use
this data for information management solution, this data need to be correctly
defined.
·
There are no industry wide accepted standards. The data management
solution vendors have narrow focus.
·
There is no easy and accepted methods of passing metadata.
Data Warehousing -
Data Marting
Why to
create Datamart
The following are the reasons
to create datamart:
·
To partition data in order to impose access control strategies.
·
To speed up the queries by reducing the volume of data to be
scanned.
·
To segment data into different hardware platforms.
·
To structure data in a form suitable for a user access tool.
Note: Donot data mart for any other reason since the operation cost of
data marting could be very high. Before data marting, make sure that data
marting strategy is appropriate for your particular solution.
Steps to
determine that data mart appears to fit the bill
Following steps need to be
followed to make cost effective data marting:
·
Identify the Functional Splits
·
Identify User Access Tool Requirements
·
Identify Access Control Issues
IDENTIFY THE FUNCTIONAL SPLITS
In
this step we determine that whether the natural functional split is there in
the organization. We look for departmental splits, and we determine whether the
way in which department use information tends to be in isolation from the rest
of the organization. Let's have an example...
suppose in a retail
organization where the each merchant is accountable for maximizing the sales of
a group of products. For this the information that is valuable is :
·
sales transaction on daily basis
·
sales forecast on weekly basis
·
stock position on daily basis
·
stock movements on daily basis
As the merchant is not
interested in the products they are not dealing with, so the data marting is
subset of the data dealing which the product group of interest. Following
diagram shows data marting for different users.
Issues in
determining the functional split:
·
The structure of the department may change.
·
The products might switch from one department to other.
·
The merchant could query the sales trend of other products to
analyse what is happening to the sales.
These are issues that need to
be taken into account while determining the functional split.
Note: we need to determine the business benefits and technical
feasibility of using data mart.
IDENTIFY USER ACCESS TOOL REQUIREMENTS
For the user
access tools that
require the internal data structures we need data mart to support such tools.
The data in such structures are outside the control of data warehouse but need
to be populated and updated on regular basis.
There are some tools that
populated directly from the source system but some can not. Therefore
additional requirements outside the scope of the tool are needed to be
identified for future.
Note: In order to ensure consistency of data across all access tools the
data should not be directly populated from the data warehouse rather each tool
must have its own data mart.
IDENTIFY ACCESS CONTROL ISSUES
There
need to be privacy rules to ensure the data is accessed by the authorised users
only. For example in data warehouse for retail baking institution ensure that
all the accounts belong to the same legal entity. Privacy laws can force you to
totally prevent access to information that is not owned by the specific bank.
Data mart allow us to build
complete wall by physically separating data segments within the data warehouse.
To avoid possible privacy problems the detailed data can be removed from the
data warehouse.We can create data mart for each legal entity and load it via
data warehouse, with detailed account data.
Designing
Data Marts
The data marts should be
designed as smaller version of starflake schema with in the data warehouse and
should match to the database design of the data warehouse. This helps in
maintaining control on database instances.
The summaries are data marted
in the same way as they would have been designed within the data warehouse.
Summary tables helps to utilize all dimension data in the starflake schema.
Cost Of
Data Marting
The following are the cost
measures for Data marting:
·
Hardware and Software Cost
·
Network Access
·
Time Window Constraints
HARDWARE AND SOFTWARE COST
Although
the data marts are created on the same hardware even then they require some
additional hardware and software.To handle the user queries there is need of
additional processing power and disk storage. If the detailed data and the data
mart exist within the data warehouse then we would face additional cost to
store and manage replicated data.
Note: The data marting is more expensive than aggregations therefore it
should be used as an additional strategy not as an alternative strategy.
NETWORK ACCESS
The data mart could be on different locations from the data
warehouse so we should ensure that the LAN or WAN has the capacity to handle
the data volumes being transferred within the data mart load process.
TIME WINDOW CONSTRAINTS
The
extent to which the data mart loading process will eat into the available time
window will depend on the complexity of the transformations and the data
volumes being shipped. Feasiblity of number of data mart depend on.
·
Network Capacity.
·
Time Window Available
·
Volume of data being transferred
·
Mechanisms being used to insert data into data mart
Data Warehousing -
System Managers
Introduction
The system management is must
for the successful implementation of data warehouse. In this chapter we will
discuss the most important system managers such as following mentioned below.
·
System Configuration Manager
·
System Scheduling Manager
·
System Event Manager
·
System Database Manager
·
System Backup Recovery Manager
System
Configuration Manager
·
The system configuration manager is responsible for the management
of the setup and configuration of data warehouse.
·
The Structure of configuration manager varies from the operating
system to operating system.
·
In unix structure of configuration manager varies from vendor to
vendor.
·
Configuration manager have the single user interface.
·
The interface of configuration manager allow us to control of all
aspects of the system.
Note: The most important configuration tool is the I/O manager.
System
Scheduling Manager
The System Scheduling Manager
is also responsible for the successful implementation of the data warehouse.
The purpose of this scheduling manager is to schedule the ad hoc queries. Every
operating system has its own scheduler with some form of batch control
mechanism. Features of System Scheduling Manager are following.
·
Work across cluster or MPP boundaries.
·
Deal with international time differences.
·
Handle job failure.
·
Handle multiple queries.
·
Supports job priorities.
·
Restart or requeue the failed jobs.
·
Notify the user or a process when job is completed.
·
Maintain the job schedules across system outages.
·
Requeue jobs to other queues.
·
Support the stopping and starting of queues.
·
Log Queued jobs.
·
Deal with interqueue processing.
Note: The above are the evaluation parameters for evaluation of a good
scheduler.
Some important jobs that the
scheduler must be able to handle are as followed:
·
Daily and ad hoc query scheduling.
·
execution of regular report requirements.
·
Data load
·
Data Processing
·
Index creation
·
Backup
·
Aggregation creation
·
data transformation
Note: If the data warehouse is running on a cluster or MPP architecture,
then the system scheduling manager must be capable of running across the
architecture.
System
Event Manager
The event manager is a kind
of a software. The event manager manages the events that are defined on the
data warehouse system. We cannot manage the data warehouse manually because the
structure of data warehouse is very complex. Therefore we need a tool that
automatically handle all the events without intervention of the user.
Note: The Event manager monitor the events occurrences and deal with
them. the event manager also track the myriad of things that can go wrong on
this complex data warehouse system.
EVENTS
The
question arises is What is an event? event is nothing but the action that are
generated by the user or the system itself. It may be noted that the event is
measurable, observable, occurrence of defined action.
The following are the common
events that are required to be tracked.
·
hardware failure.
·
Running out of space on certain key disks.
·
A process dying.
·
A process returning an error.
·
CPU usage exceeding an 805 threshold.
·
Internal contention on database serialization points.
·
Buffer cache hit ratios exceeding or failure below threshold.
·
A table reaching to maximum of its size.
·
Excessive memory swapping.
·
A table failing to extend due to lack of space.
·
Disk exhibiting I/O bottlenecks.
·
Usage of temporary or sort area reaching a certain thresholds.
·
Any other database shared memory usage.
The most important thing
about is that they should be capable of executing on their own. there event
packages that defined the procedures for the predefined events. The code
associated with each event is known as event handler. This code is executed whenever
an event occurs.
System and
Database Manager
System and Database manager
are the two separate piece of software but they do the same job. The objective
of these tools is to automate the certain processes and to simplify the
execution of others. The Criteria of choosing the system and database manager
are an abitlity to:
·
increase user's Quota.
·
assign and deassign role to the users.
·
assign and deassign the profiles to the users.
·
perform database space management
·
monitor and report on space usage.
·
tidy up fragmented and unused space.
·
add and expand the space.
·
add and remove users.
·
manage user password.
·
manage summary or temporary tables.
·
assign or deassign temporary space to and from the user.
·
reclaim the space form old or outofdate temporary tables.
·
manage error and trace logs.
·
to browse log and trace files.
·
redirect error or trace information.
·
switch on and off error and trace logging.
·
perform system space management.
·
monitor and report on space usage.
·
clean up old and unused file directories.
·
add or expand space.
System
Backup Recovery Manager
The backup and recovery tool
make it easy for operations and management staff to backup the data. It is
worth noted that the system backup manager must be integrated with the schedule
manager software being used. The important features that are required for the
management of backups are following.
·
Scheduling
·
Backup data tracking
·
Database awareness.
The backup are taken only to
protect the data against loss. Following are the important points to remember.
·
The backup software will keep some from of database of where and
when the piece of data was backed up.
·
The backup recovery manager must have a good front end to that
database.
·
The backup recovery software should be database aware.
·
Being aware of database the software then can be addressed in
database terms, and will not perform backups that would not be viable.
Comments
Post a Comment