Sunday, November 2, 2014

Data Warehousing - Testing

Introduction

Testing is very important for data warehouse systems to make them work correctly and efficiently. There are three basic level of testing that are listed below:

  •  Unit Testing
  •  Integration Testing
  • System testing

UNIT TESTING


  • In the Unit Testing each component is separately tested.
  • In this kind of testing each module i.e. procedure, program, SQL Script, Unix shell is tested.
  • This tested is performed by the developer.

INTEGRATION TESTING


  •  In this kind of testing the various modules of the application are brought together and then tested against number of inputs.
  • It is performed to test whether the various components do well after integration.

SYSTEM TESTING


  • In this kind of testing the whole data warehouse application is tested together.
  • The purpose of this testing is to check whether the entire system work correctly together or not.
  • This testing is performed by the testing team.
  • Since the size of the whole data warehouse is very large so it is usually possible to perform minimal system testing before the test plan proper can be enacted.

Test Schedule


  • First of all the Test Schedule is created in process of development of Test Plan.
  • In this we predict the estimated time required for the testing of entire data warehouse system.

DIFFICULTIES IN SCHEDULING THE TESTING


  • There are different methodologies available but none of them is perfect because the data warehouse is very complex and large. Also the data warehouse system is evolving in nature.
  • A simple problem may have large size of query which can take a day or more to complete i.e. the query does not complete in desired time scale.
  • There may be the hardware failure such as losing a disk, or the human error such as accidentally deleting the table or overwriting a large table.
Note: Due to the above mentioned difficulties it is recommended that always double the amount of time you would normally allow for testing.

Testing the backup recovery

This is very important testing that need to be performed. Here is the list of scenarios for which this testing is needed.

  • Media failure.
  • Loss or damage of table space or data file
  • Loss or damage of redo log file.
  • Loss or damage of control file
  • Instance failure.
  • Loss or damage of archive file.
  • Loss or damage of table.
  • Failure during data failure.

Testing Operational Environment

There are number of aspects that need to be tested. These aspects are listed below.

  • Security - A separate security document is required for security testing. This document contain the list of disallowed operations and devising test for each.
  • Scheduler - Scheduling software is required to control the daily operations of data warehouse. This need to be tested during the system testing. The scheduling software require interface with the data warehouse, which will need the scheduler to control the overnight processing and the management of aggregations.
  • Disk Configuration. - The Disk configuration also need to be tested to identify the I/O bottlenecks. The test should be performed with multiple times with different settings.
  • Management Tools. - It is needed to test all the management tools during system testing. Here is the list of tools that need to be tested.
o    Event manager
o    system Manager.
o    Database Manager.
o    Configuration Manager
o    Backup recovery manager.

Testing the Database

There are three set of tests that are listed below:
  • Testing the database manager and monitoring tools. - To test the database manager and the monitoring tools they should be used in the creation, running and management of test database.
  • Testing database features. - Here is the list of features that we have to test:
o    Querying in parallel
o    Create index in parallel
o    Data load in parallel
  • Testing database performance. - Query execution plays a very important role in data warehouse performance measures. There are set of fixed queries that need to be run regularly and they should be tested. To test ad hoc queries one should go through the user requirement document and understand the business completely. Take the time to test the most awkward queries that the business is likely to ask against different index and aggregation strategies.

Testing The Application


  •  All the managers should be integrated correctly and work in order to ensure that the end-to-end load, index, aggregate and queries work as per the expectations.
  • Each function of each manager should work in correct manner.
  • It is also necessary to test the application over a period of time.
  • The week-end and month-end task should also be tested.

Logistic of the Test

There is a question that What you are really testing? The answer to this question is that you are testing a suite of data warehouse application code.
The aim of system test is to test all of the following areas.

  • Scheduling Software
  • Day-to Day operational procedures.
  • Backup recovery strategy.
  • Management and scheduling tools.
  • Overnight processing
  • Query Performance
Note: The most important point is to test the scalability. Failure to do so will leave us a system design that does not work when the system grow.

Data Warehousing - Future Aspects

Following are the future aspects of Data Warehousing.

  • As we have seen that the size of the open database has grown approximately double the magnitude in last few years. This change in magnitude is of greater significance.
  • As the size of the databases grow , the estimates of what constitutes a very large database continues to grow.
  • The Hardware and software that are available today do not allow to keep a large amount of data online. For example a Telco call record require 10TB of data to be kept online which is just a size of one month record. If It require to keep record of sales, marketing customer, employee etc. then the size will be more than 100 TB.
  • The record not only contain the textual information but also contain some multimedia data. Multimedia data cannot be easily manipulated as text data. Searching the multimedia data is not an easy task whereas the textual information can be retrieved by the relational software available today.
  •  Apart from size planning, building and running ever-larger data warehouse systems are very complex. As the number of users increases the size of the data warehouse also increases. These users will also require to access to the system.

  • · With growth of internet there is requirement of users to access data online.

Data Warehousing - Tuning


Introduction
The data warehouse evolves throughout the period of time and the it is unpredictable that what query the user is going to be produced in future. Therefore it becomes more difficult to tune data warehouse system. In this chapter we will discuss about how to tune the different aspects of data warehouse such as performance, data load, queries ect.
Difficulties in Data Warehouse Tuning
Here is the list of difficulties that can occur while tuning the data warehouse.
·        The data warehouse never remain constant throughout the period of time.
·        It is very difficult to predict that what query the user is going to produce in future.
·        The need of the business also changes with time.
·        The users and their profile never remains the same with time.
·        The user can switch from one group to another.
·        the data load on the warehouse also changes with time.
Note: It is very important to have the complete knowledge of data warehouse.
Performance Assessment
Here is the list of objective measures of performance.
·        Average query response time
·        Scan rates.
·        Time used per day query.
·        Memory usage per process.
·        I/O throughput rates
Following are the points to be remembered.
·        It is necessary to specify the measures in service level agreement(SLA).
·        It is of no use to trying to tune response time if they are already better than those required.
·        It is essential to have realistic expectations while performance assessment.
·        It is also essential that the users have the feasible expectations.
·        To hide the complexity of the system from the user the aggregations and views should be used.
·        It is also possible that the user can write a query you had not tuned for.
Data Load Tuning
·        Data Load is very critical part of overnight processing.
·        Nothing else can run until data load is complete.
·        This is the entry point into the system.
Note: If there is delay in transferring the data or in arrival of data then the entire system is effected badly. Therefore it is very important to tune the data load first.
There are various approaches of tuning data load that are discussed below:
·        The very common approach is to insert data using the SQL Layer. In this approach the normal checks and constraints need to be performed. When the data is inserted into the table the code will run to check is there enough space available to insert the data. if the sufficient space is not available then more space may have to be allocated to these tables. These checks take time to perform and are costly to CPU. But pack the data tightly by making maximal use of space.
·        The second approach is to bypass all these checks and constraints and place the data directly into preformatted blocks. These blocks are later written to the database. It is faster than the first approach but it can work only with the whole blocks of data. This can lead to some space wastage.
·        The third approach is that while loading the data into the table that already contains the table, we can either maintain the indexes.
·        The fourth approach says that to load the data in tables that already contains the data, drop the indexes & recreate them when the data load is complete. Out of third and fourth, which approach is better depends on how much data is already loaded and how many indexes need to be rebuilt.
Integrity Checks
The integrity checking highly affects the performance of the load
Following are the points to be remembered.
·        The integrity checks need to be limited because processing required can be heavy.
·        The integrity checks should be applied on the source system to avoid performance degrade of data load.
Tuning Queries
We have two kinds of queries in data warehouse:
·        Fixed Queries
·        Ad hoc Queries
FIXED QUERIES
The fixed queries are well defined. The following are the examples of fixed queries.
·        regular reports
·        Canned queries
·        Common aggregations
Tuning the fixed queries in data warehouses is same as in relational database systems. the only difference is that the amount of data to be queries may be different. It is good to store the most successful execution plan while testing the fixed queries. Storing these executing plan will allow us to spot changing data size and data skew as this will cause the execution plan to change.
Note: We cannot do more on fact table but while dealing with the dimension table or the aggregations, the usual collection of SQL tweaking, storage mechanism and access methods can be used to tune these queries.
AD HOC QUERIES
To know the ad hoc queries it is important to know the ad hoc users of the data warehouse. Here is the list of points that need to understand about the users of the data warehouse:
·        The number of users in the group.
·        Whether they use ad hoc queries at regular interval of time.
·        Whether they use ad hoc queries frequently.
·        whether they use ad hoc queries occasionally at unknown intervals.
·        The maximum size of query they tend to run
·        The average size of query they tend to run.
·        Whether they require drill-down access to the base data.
·        The elapsed login time per day
·        The peak time of daily usage
·        The number of queries they run per peak hour.
·        Following are the points to be remembered.
·        It is important to track the users profiles and identify the queries that are run on regular basis.
·        It is also important to identify tuning performed does not affect the performance.
·        Identify the similar and ad hoc queries that are frequently run.
·        If these queries are identified then the database will change and new indexes can be added for those queries.

·        If these queries are identified then new aggregations can be created specifically for those queries that would result in their efficient execution.

Data Warehousing - Backup

Introduction

There exist large volume of data into the data warehouse and the data warehouse system is very complex hence it becomes important to have backup of all the data which is available for the recovery in future as per the requirement. In this chapter I will discuss the issues on designing backup strategy.

Backup Terminologies

Before proceeding further we should know some of the backup terminologies discussed below.
·        Complete backup - In complete backup the entire database is backed up at the same time. This backup includes all the database files, control files and journal files.
·         Partial backup - Partial backup is not the complete backup of database. Partial backup are very useful in large databases because they allow a strategy whereby various parts of the database are backed up in a round robin fashion on daybyday basis, so that the whole database is backed up effectively once a week.
·        Cold backup - Cold backup is taken while the database is completely shut down. In multiinstance environment all the instances should be shut down.
·        Hot backup - The hot backup is take when the database engine is up and running. Hot backup requirements that need to be considered varies from RDBMS to RDBMS. Hot backups are extremely useful.
·        Online backup - It is same as the hot backup.

Hardware Backup

It is important to decide which hardware to use for the backup.We have to make the upper bound on the speed at which backup is can be processed. the speed of processing backup and restore depends not only on the hardware being use rather it also depends upon the how hardware is connected, bandwidth of the network, backup software and speed of server's I/O system. Here I will discuss about some of the hardware choices that are available and their pros and cons. These choices are as follows.
·        Tape Technology
·        Disk Backups

TAPE TECHNOLOGY

The tape choice can be categorized into the following.
·        Tape media
·        Standalone tape drives
·        Tape stackers
·        Tape silos

Tape Media

There exists several varieties of tape media. The some tape media standard are listed in the table below:
Tape Media
Capacity
I/O rates
DLT
40 GB
3 MB/s
3490e
1.6 GB
3 MB/s
8 mm
14 GB
1 MB/s
Other factors that need to be considered are following:
·        Reliability of the tape medium.
·        Cost of tape medium per unit.
·        scalability.
·        Cost of upgrades to tape system.
·        Cost of tape medium per unit.
·        Shelf life of tape medium.

Standalone tape drives

The tape drives can be connected in the following ways.
·        Direct to the server.
·        As as networkavailable devices.
·        Remotely to other machine.
Issues of connecting the tape drives
·        Suppose the server is the 48node MPP machine so which node do you connect the tape drive, how do you spread them over the server nodes to get the optimal performance with least disruption of the server and least internal I/O latency?
·        Connecting the tape drive as a network available device require the network to be up to the job of the huge data transfer rates needed. make sure that sufficient bandwidth is available during the time you require it.
·        Connecting the tape drives remotely also require the high bandwidth.

TAPE STACKERS

The method of loading the multiple tapes into a single tape drive is known as tape stackers. The stacker dismounts the current tape when it has finished with it and load the next tape hence only one tape is available data a time to be accessed.The price and the capabilities may vary but the common ability is that they can perform unattended backups.

TAPE SILOS

The tape silos provide the large store capacities.Tape silos can store and manage the thousands of tapes. The tape silos can integrate the multiple tape drives. They have the software and hardware to label and store the tapes they store. It is very common for the silo to be connected remotely over a network or a dedicated link.We should ensure that the bandwidth of that connection is up to the job.

Other Technologies

The technologies other than the tape are mentioned below.
·        Disk Backups
·        Optical jukeboxes

DISK BACKUPS

Methods of disk backups are listed below.
·        Disk-to-disk backups
·        Mirror breaking
These methods are used in OLTP system. These methods minimize the database downtime and maximize the availability.

Disk-to-disk backups

In this kind of backup the backup is taken on to disk rather than to tape. Reasons for doing Disktodisk backups are.
·        Speed of initial backups
·        Speed of restore
Backing up the data from Disk to disk is much faster than to the tape. However it is the intermediate step of backup later the data is backed up on the tape. The other advantage of Disk to disk backups is that it gives you the online copy of the latest backup.

Mirror Breaking

The idea is to have disks mirrored for resilience during the working day. When back is required one of the mirror sets can be broken out. This technique is variat of Disktodisk backups.
Note: The database may need to be shutdown to guarantee the consistency of the backup.

OPTICAL JUKEBOXES

Optical jukeboxes allow the data to be stored near line. This technique allow large number of optical disks to be managed in same way as a tape stacker or tape silo. The drawback of this technique is that it is slow write speed than disks. But the optical media provide the long life and reliability make them good choice of medium of archiving.

Software Backups

There are software tools available which helps in backup process. These software tools come as a package.These tools not only take backup in fact they effectively manage and control the backup strategies. There are many software packages available in the market .Some of them are here listed in the following table.
Package Name
Vendor
Networker
Legato
ADSM
IBM
Epoch
Epoch Systems
Omniback II
HP
Alexandria
Sequent

CRITERIA FOR CHOOSING SOFTWARE PACKAGES

The criteria of choosing the best software package is listed below:
·        How scalable is the product as tape drives are added?
·        Does the package have client server option, or must it run on database server itself?
·        Will it work in cluster and MPP environments?
·        What degree of parallelism is required?
·        What platforms are supported by the package?
·        Does package support easy access to information about tape contents?
·        Is the package database aware?
·        What tape drive and tape media are supported by package?