Data Warehousing - Metadata and Data Marting Concepts

Data Warehousing - Metadata Concepts

What is Metadata

Metadata is simply defined as data about data. The data that are used to represent other data is known as metadata. For example the index of a book serve as metadata for the contents in the book. In other words we can say that metadata is the summarized data that leads us to the detailed data. In terms of data warehouse we can define metadata as following.
·         Metadata is a road map to data warehouse.
·         Metadata in data warehouse define the warehouse objects.
·         The metadata act as a directory.This directory helps the decision support system to locate the contents of data warehouse.
Note: In data warehouse we create metadata for the data names and definitions of a given data warehouse. Along with this metadata additional metadata are also created for timestamping any extracted data, the source of extracted data.

Categories of Metadata

The metadata can be broadly categorized into three categories:
·         Business Metadata - This metadata has the data ownership information, business definition and changing policies.
·         Technical Metadata - Technical metadata includes database system names, table and column names and sizes, data types and allowed values. Technical metadata also includes structural information such as primary and foreign key attributes and indices.
·         Operational Metadata - This metadata includes currency of data and data lineage.Currency of data means whether data is active, archived or purged. Lineage of data means history of data migrated and transformation applied on it.


Role of Metadata

Metadata has very important role in data warehouse. The role of metadata in warehouse is different from the warehouse data yet it has very important role. The various roles of metadata are explained below.
·         The metadata act as a directory.
·         This directory helps the decision support system to locate the contents of data warehouse.
·         Metadata helps in decision support system for mapping of data when data are transformed from operational environment to data warehouse environment.
·         Metadata helps in summarization between current detailed data and highly summarized data.
·         Metadata also helps in summarization between lightly detailed data and highly summarized data.
·         Metadata are also used for query tools.
·         Metadata are used in reporting tools.
·         Metadata are used in extraction and cleansing tools.
·         Metadata are used in transformation tools.
·         Metadata also plays important role in loading functions.
Diagram to understand role of Metadata.

Metadata Respiratory

The Metadata Respiratory is an integral part of data warehouse system. The Metadata Respiratory has the following metadata:
·         Definition of data warehouse - This includes the description of structure of data warehouse. The description is defined by schema, view, hierarchies, derived data definitions, and data mart locations and contents.
·         Business Metadata - This metadata has the data ownership information, business definition and changing policies.
·         Operational Metadata - This metadata includes currency of data and data lineage. Currency of data means whether data is active, archived or purged. Lineage of data means history of data migrated and transformation applied on it.
·         Data for mapping from operational environment to data warehouse - This metadata includes source databases and their contents, data extraction,data partition cleaning, transformation rules, data refresh and purging rules.
·         The algorithms for summarization - This includes dimension algorithms, data on granularity, aggregation, summarizing etc.

Challenges for Metadata Management

The importance of metadata can not be overstated. Metadata helps in driving the accuracy of reports, validates data transformation and ensures the accuracy of calculations. The metadata also enforces the consistent definition of business terms to business end users. With all these uses of Metadata it also has challenges for metadata management. The some of the challenges are discussed below.
·         The Metadata in a big organization is scattered across the organization. This metadata is spreaded in spreadsheets, databases, and applications.
·         The metadata could present in text file or multimedia file. To use this data for information management solution, this data need to be correctly defined.
·         There are no industry wide accepted standards. The data management solution vendors have narrow focus.
·         There is no easy and accepted methods of passing metadata.

Data Warehousing - Data Marting

Why to create Datamart

The following are the reasons to create datamart:
·         To partition data in order to impose access control strategies.
·         To speed up the queries by reducing the volume of data to be scanned.
·         To segment data into different hardware platforms.
·         To structure data in a form suitable for a user access tool.
Note: Donot data mart for any other reason since the operation cost of data marting could be very high. Before data marting, make sure that data marting strategy is appropriate for your particular solution.

Steps to determine that data mart appears to fit the bill

Following steps need to be followed to make cost effective data marting:
·         Identify the Functional Splits
·         Identify User Access Tool Requirements
·         Identify Access Control Issues


IDENTIFY THE FUNCTIONAL SPLITS

In this step we determine that whether the natural functional split is there in the organization. We look for departmental splits, and we determine whether the way in which department use information tends to be in isolation from the rest of the organization. Let's have an example...
suppose in a retail organization where the each merchant is accountable for maximizing the sales of a group of products. For this the information that is valuable is :
·         sales transaction on daily basis
·         sales forecast on weekly basis
·         stock position on daily basis
·         stock movements on daily basis
As the merchant is not interested in the products they are not dealing with, so the data marting is subset of the data dealing which the product group of interest. Following diagram shows data marting for different users.


Issues in determining the functional split:

·         The structure of the department may change.
·         The products might switch from one department to other.
·         The merchant could query the sales trend of other products to analyse what is happening to the sales.
These are issues that need to be taken into account while determining the functional split.
Note: we need to determine the business benefits and technical feasibility of using data mart.


IDENTIFY USER ACCESS TOOL REQUIREMENTS

For the user access tools that require the internal data structures we need data mart to support such tools. The data in such structures are outside the control of data warehouse but need to be populated and updated on regular basis.
There are some tools that populated directly from the source system but some can not. Therefore additional requirements outside the scope of the tool are needed to be identified for future.
Note: In order to ensure consistency of data across all access tools the data should not be directly populated from the data warehouse rather each tool must have its own data mart.

IDENTIFY ACCESS CONTROL ISSUES

There need to be privacy rules to ensure the data is accessed by the authorised users only. For example in data warehouse for retail baking institution ensure that all the accounts belong to the same legal entity. Privacy laws can force you to totally prevent access to information that is not owned by the specific bank.
Data mart allow us to build complete wall by physically separating data segments within the data warehouse. To avoid possible privacy problems the detailed data can be removed from the data warehouse.We can create data mart for each legal entity and load it via data warehouse, with detailed account data.

Designing Data Marts

The data marts should be designed as smaller version of starflake schema with in the data warehouse and should match to the database design of the data warehouse. This helps in maintaining control on database instances.

The summaries are data marted in the same way as they would have been designed within the data warehouse. Summary tables helps to utilize all dimension data in the starflake schema.

Cost Of Data Marting

The following are the cost measures for Data marting:
·         Hardware and Software Cost
·         Network Access
·         Time Window Constraints

HARDWARE AND SOFTWARE COST

Although the data marts are created on the same hardware even then they require some additional hardware and software.To handle the user queries there is need of additional processing power and disk storage. If the detailed data and the data mart exist within the data warehouse then we would face additional cost to store and manage replicated data.
Note: The data marting is more expensive than aggregations therefore it should be used as an additional strategy not as an alternative strategy.


NETWORK ACCESS

The data mart could be on different locations from the data warehouse so we should ensure that the LAN or WAN has the capacity to handle the data volumes being transferred within the data mart load process.


TIME WINDOW CONSTRAINTS

The extent to which the data mart loading process will eat into the available time window will depend on the complexity of the transformations and the data volumes being shipped. Feasiblity of number of data mart depend on.
·         Network Capacity.
·         Time Window Available
·         Volume of data being transferred
·         Mechanisms being used to insert data into data mart

Data Warehousing - System Managers

Introduction

The system management is must for the successful implementation of data warehouse. In this chapter we will discuss the most important system managers such as following mentioned below.
·         System Configuration Manager
·         System Scheduling Manager
·         System Event Manager
·         System Database Manager
·         System Backup Recovery Manager

System Configuration Manager

·         The system configuration manager is responsible for the management of the setup and configuration of data warehouse.
·         The Structure of configuration manager varies from the operating system to operating system.
·         In unix structure of configuration manager varies from vendor to vendor.
·         Configuration manager have the single user interface.
·         The interface of configuration manager allow us to control of all aspects of the system.
Note: The most important configuration tool is the I/O manager.

System Scheduling Manager

The System Scheduling Manager is also responsible for the successful implementation of the data warehouse. The purpose of this scheduling manager is to schedule the ad hoc queries. Every operating system has its own scheduler with some form of batch control mechanism. Features of System Scheduling Manager are following.
·         Work across cluster or MPP boundaries.
·         Deal with international time differences.
·         Handle job failure.
·         Handle multiple queries.
·         Supports job priorities.
·         Restart or requeue the failed jobs.
·         Notify the user or a process when job is completed.
·         Maintain the job schedules across system outages.
·         Requeue jobs to other queues.
·         Support the stopping and starting of queues.
·         Log Queued jobs.
·         Deal with interqueue processing.
Note: The above are the evaluation parameters for evaluation of a good scheduler.
Some important jobs that the scheduler must be able to handle are as followed:
·         Daily and ad hoc query scheduling.
·         execution of regular report requirements.
·         Data load
·         Data Processing
·         Index creation
·         Backup
·         Aggregation creation
·         data transformation
Note: If the data warehouse is running on a cluster or MPP architecture, then the system scheduling manager must be capable of running across the architecture.

System Event Manager

The event manager is a kind of a software. The event manager manages the events that are defined on the data warehouse system. We cannot manage the data warehouse manually because the structure of data warehouse is very complex. Therefore we need a tool that automatically handle all the events without intervention of the user.
Note: The Event manager monitor the events occurrences and deal with them. the event manager also track the myriad of things that can go wrong on this complex data warehouse system.

EVENTS

The question arises is What is an event? event is nothing but the action that are generated by the user or the system itself. It may be noted that the event is measurable, observable, occurrence of defined action.
The following are the common events that are required to be tracked.
·         hardware failure.
·         Running out of space on certain key disks.
·         A process dying.
·         A process returning an error.
·         CPU usage exceeding an 805 threshold.
·         Internal contention on database serialization points.
·         Buffer cache hit ratios exceeding or failure below threshold.
·         A table reaching to maximum of its size.
·         Excessive memory swapping.
·         A table failing to extend due to lack of space.
·         Disk exhibiting I/O bottlenecks.
·         Usage of temporary or sort area reaching a certain thresholds.
·         Any other database shared memory usage.
The most important thing about is that they should be capable of executing on their own. there event packages that defined the procedures for the predefined events. The code associated with each event is known as event handler. This code is executed whenever an event occurs.

System and Database Manager

System and Database manager are the two separate piece of software but they do the same job. The objective of these tools is to automate the certain processes and to simplify the execution of others. The Criteria of choosing the system and database manager are an abitlity to:
·         increase user's Quota.
·         assign and deassign role to the users.
·         assign and deassign the profiles to the users.
·         perform database space management
·         monitor and report on space usage.
·         tidy up fragmented and unused space.
·         add and expand the space.
·         add and remove users.
·         manage user password.
·         manage summary or temporary tables.
·         assign or deassign temporary space to and from the user.
·         reclaim the space form old or outofdate temporary tables.
·         manage error and trace logs.
·         to browse log and trace files.
·         redirect error or trace information.
·         switch on and off error and trace logging.
·         perform system space management.
·         monitor and report on space usage.
·         clean up old and unused file directories.
·         add or expand space.

System Backup Recovery Manager

The backup and recovery tool make it easy for operations and management staff to backup the data. It is worth noted that the system backup manager must be integrated with the schedule manager software being used. The important features that are required for the management of backups are following.
·         Scheduling
·         Backup data tracking
·         Database awareness.
The backup are taken only to protect the data against loss. Following are the important points to remember.
·         The backup software will keep some from of database of where and when the piece of data was backed up.
·         The backup recovery manager must have a good front end to that database.
·         The backup recovery software should be database aware.

·         Being aware of database the software then can be addressed in database terms, and will not perform backups that would not be viable.

Comments

Popular posts from this blog

How to Configure Report Scheduler to work on your Jasper Server

Data Warehousing - Process Managers & Security

Data Warehousing - System Processes & Architecture