Data Warehousing - Partitioning Strategy
Data Warehousing -
Partitioning Strategy
Introduction
The partitioning is done to
enhance the performance and make the management easy. Partitioning also helps
in balancing the various requirements of the system. It will optimize the
hardware performance and simplify the management of data warehouse. In this we
partition each fact table into a multiple separate partitions. In this chapter
we will discuss about the partitioning strategies.
Why to
Partition
Here is the list of reasons.
·
For easy management
·
To assist backup/recovery
·
To enhance performance
FOR EASY MANAGEMENT
The
fact table in data warehouse can grow to many hundreds of gigabytes in size.
This too large size of fact table is very hard to manage as a single entity.
Therefore it needs partition.
TO ASSIST BACKUP/RECOVERY
If we
do not have partitioned the fact table then we have to load the complete fact
table with all the data.Partitioning allow us to load that data which is
required on regular basis. This will reduce the time to load and also enhances
the performance of the system.
Note: To cut down on the backup size all partitions other than the
current partitions can be marked read only. We can then put these partition
into a state where they can not be modified.Then they can be backed up .This
means that only the current partition is to be backed up.
TO ENHANCE PERFORMANCE
By
partitioning the fact table into sets of data the query procedures can be
enhanced. The query performance is enhanced because now the query scans the
partitions that are relevant. It does not have to scan the large amount of
data.
Horizontal
Partitioning
There are various way in
which fact table can be partitioned. In horizontal partitioning we have to keep
in mind the requirements for manageability of the data warehouse.
PARTITIONING BY TIME INTO EQUAL SEGMENTS
In this partitioning strategy the fact table is partitioned on the
bases of time period. Here each time period represents a significant retention
period within the business. For example if the user queries formonth to date
data then
it is appropriate to partition into monthly segments. We can reuse the
partitioned tables by removing the data in them.
PARTITIONING BY TIME INTO DIFFERENT-SIZED SEGMENTS
This
kind of partition is done where the aged data is accessed infrequently. This
partition is implemented as a set of small partitions for relatively current
data, larger partition for inactive data.
Following is the list of
advantages.
·
The detailed information remains available online.
·
The number of physical tables is kept relatively small, which
reduces the operating cost.
·
This technique is suitable where the mix of data dipping recent
history, and data mining through entire history is required.
Following is the list of
disadvantages.
·
This technique is not useful where the partitioning profile
changes on regular basis, because the repartitioning will increase the
operation cost of data warehouse.
PARTITION ON A DIFFERENT DIMENSION
The
fact table can also be partitioned on basis of dimensions other than time such
as product group,region,supplier, or any other dimensions. Let's have an
example.
Suppose
a market function which is structured into distinct regional departments for
example state
by state basis.
If each region wants to query on information captured within its region, it
would proves to be more effective to partition the fact table into regional
partitions. This will cause the queries to speed up because it does not require
to scan information that is not relevant.
Following is the list of
advantages.
·
Since the query does not have to scan the irrelevant data which
speed up the query process.
Following is the list of
disadvantages.
·
This technique is not appropriate where the dimensions are
unlikely to change in future. So it is worth determining that the dimension
does not change in future.
·
If the dimension changes then the entire fact table would have to
be repartitioned.
Note: We recommend that do the partition only on the basis of time
dimension unless you are certain that the suggested dimension grouping will not
change within the life of data warehouse.
PARTITION BY SIZE OF TABLE
When there are no clear basis for partitioning the fact table on
any dimension then we should partition
the fact table on the basis of their size. We can set the predetermined size as a critical point. when the
table exceeds the predetermined size a new table partition is created.
Following is the list of
disadvantages.
·
This partitioning is complex to manage.
Note: This partitioning required metadata to identify what data stored
in each partition.
PARTITIONING DIMENSIONS
If the
dimension contain the large number of entries then it is required to partition
dimensions. Here we have to check the size of dimension.
Suppose a large design which
changes over time. If we need to store all the variations in order to apply
comparisons, that dimension may be very large. This would definitely affect the
response time.
ROUND ROBIN PARTITIONS
In
round robin technique when the new partition is needed the old one is archived.
In this technique metadata is used to allow user access tool to refer to the
correct table partition.
Following is the list of
advantages.
·
This technique make it easy to automate table management
facilities within the data warehouse.
Vertical
Partition
In Vertical Partitioning the
data is split vertically.
The Vertical Partitioning can
be performed in the following two ways.
·
Normalization
·
Row Splitting
NORMALIZATION
Normalization
method is the standard relational method of database organization. In this
method the rows are collapsed into single row, hence reduce the space.
Table before normalization
Product_id
|
Quantity
|
Value
|
sales_date
|
Store_id
|
Store_name
|
Location
|
Region
|
30
|
5
|
3.67
|
3-Aug-13
|
16
|
sunny
|
Bangalore
|
S
|
35
|
4
|
5.33
|
3-Sep-13
|
16
|
sunny
|
Bangalore
|
S
|
40
|
5
|
2.50
|
3-Sep-13
|
64
|
san
|
Mumbai
|
W
|
45
|
7
|
5.66
|
3-Sep-13
|
16
|
sunny
|
Bangalore
|
S
|
Table after normalization
Store_id
|
Store_name
|
Location
|
Region
|
16
|
sunny
|
Bangalore
|
W
|
64
|
san
|
Mumbai
|
S
|
Product_id
|
Quantity
|
Value
|
sales_date
|
Store_id
|
30
|
5
|
3.67
|
3-Aug-13
|
16
|
35
|
4
|
5.33
|
3-Sep-13
|
16
|
40
|
5
|
2.50
|
3-Sep-13
|
64
|
45
|
7
|
5.66
|
3-Sep-13
|
16
|
ROW SPLITTING
The
row splitting tend to leave a one-to-one map between partitions. The motive of
row splitting is to speed the access to large table by reducing its size.
Note: while using vertical partitioning make sure that there is no
requirement to perform major join operations between two partitions.
Identify
Key to Partition
It is very crucial to choose
the right partition key.Choosing wrong partition key will lead you to
reorganize the fact table. Let's have an example. Suppose we want to partition
the following table.
Account_Txn_Table
transaction_id
account_id
transaction_type
value
transaction_date
region
branch_name
We can choose to partition on
any key. The two possible keys could be
·
region
·
transaction_date
Now suppose the business is
organised in 30 geographical regions and each region have different number of
branches.That will give us 30 partitions, which is reasonable. This
partitioning is good enough because our requirements capture has shown that
vast majority of queries are restricted to the user's own business region.
Now If we partition by
transaction_date instead of region. Then it means that the latest transaction
from every region will be in one partition. Now the user who wants to look at
data within his own region has to query across multiple partition.
Hence
it is worth determining the right partitioning key.
Comments
Post a Comment