Data Warehousing - Tuning
Introduction
The data warehouse evolves throughout the period of time and the it is
unpredictable that what query the user is going to be produced in future.
Therefore it becomes more difficult to tune data warehouse system. In this
chapter we will discuss about how to tune the different aspects of data
warehouse such as performance, data load, queries ect.
Difficulties in Data Warehouse Tuning
Here is the list of difficulties that can occur while tuning the data
warehouse.
·
The data warehouse never remain constant throughout the period of time.
·
It is very difficult to predict that what query the user is going to
produce in future.
·
The need of the business also changes with time.
·
The users and their profile never remains the same with time.
·
The user can switch from one group to another.
·
the data load on the warehouse also changes with time.
Note: It is very important to have the complete knowledge of data
warehouse.
Performance Assessment
Here is the list of objective measures of performance.
·
Average query response time
·
Scan rates.
·
Time used per day query.
·
Memory usage per process.
·
I/O throughput rates
Following are the points to be remembered.
·
It is necessary to specify the measures in service level agreement(SLA).
·
It is of no use to trying to tune response time if they are already
better than those required.
·
It is essential to have realistic expectations while performance
assessment.
·
It is also essential that the users have the feasible expectations.
·
To hide the complexity of the system from the user the aggregations and
views should be used.
·
It is also possible that the user can write a query you had not tuned
for.
Data Load Tuning
·
Data Load is very critical part of overnight processing.
·
Nothing else can run until data load is complete.
·
This is the entry point into the system.
Note: If there is delay in transferring the data or in arrival of data then
the entire system is effected badly. Therefore it is very important to tune the
data load first.
There are various approaches of tuning data load that are discussed
below:
·
The very common approach is to insert data using the SQL Layer.
In this approach the normal checks and constraints need to be performed. When
the data is inserted into the table the code will run to check is there enough
space available to insert the data. if the sufficient space is not available
then more space may have to be allocated to these tables. These checks take
time to perform and are costly to CPU. But pack the data tightly by making
maximal use of space.
·
The second approach is to bypass all these checks and constraints and
place the data directly into preformatted blocks. These blocks are later
written to the database. It is faster than the first approach but it can work
only with the whole blocks of data. This can lead to some space wastage.
·
The third approach is that while loading the data into the table that
already contains the table, we can either maintain the indexes.
·
The fourth approach says that to load the data in tables that already
contains the data, drop the indexes & recreate them when
the data load is complete. Out of third and fourth, which approach is better
depends on how much data is already loaded and how many indexes need to be
rebuilt.
Integrity Checks
The integrity checking highly affects the performance of the load
Following are the points to be remembered.
·
The integrity checks need to be limited because processing required can
be heavy.
·
The integrity checks should be applied on the source system to avoid
performance degrade of data load.
Tuning Queries
We have two kinds of queries in data warehouse:
·
Fixed Queries
·
Ad hoc Queries
FIXED QUERIES
The fixed queries are well defined. The following are the examples of
fixed queries.
·
regular reports
·
Canned queries
·
Common aggregations
Tuning the fixed queries in data warehouses is same as in relational
database systems. the only difference is that the amount of data to be queries
may be different. It is good to store the most successful execution plan while
testing the fixed queries. Storing these executing plan will allow us to spot
changing data size and data skew as this will cause the execution plan to
change.
Note: We cannot do more on fact table but while dealing with the
dimension table or the aggregations, the usual collection of SQL tweaking,
storage mechanism and access methods can be used to tune these queries.
AD HOC QUERIES
To know the ad hoc queries it is important to know the ad hoc users of
the data warehouse. Here is the list of points that need to understand about
the users of the data warehouse:
·
The number of users in the group.
·
Whether they use ad hoc queries at regular interval of time.
·
Whether they use ad hoc queries frequently.
·
whether they use ad hoc queries occasionally at unknown intervals.
·
The maximum size of query they tend to run
·
The average size of query they tend to run.
·
Whether they require drill-down access to the base data.
·
The elapsed login time per day
·
The peak time of daily usage
·
The number of queries they run per peak hour.
·
Following are the points to be remembered.
·
It is important to track the users profiles and identify the queries
that are run on regular basis.
·
It is also important to identify tuning performed does not affect the
performance.
·
Identify the similar and ad hoc queries that are frequently run.
·
If these queries are identified then the database will change and new
indexes can be added for those queries.
·
If these queries are identified then new aggregations can be created
specifically for those queries that would result in their efficient execution.
Comments
Post a Comment