Sunday, November 2, 2014

Data Warehousing - Partitioning Strategy

Data Warehousing - Partitioning Strategy

Introduction

The partitioning is done to enhance the performance and make the management easy. Partitioning also helps in balancing the various requirements of the system. It will optimize the hardware performance and simplify the management of data warehouse. In this we partition each fact table into a multiple separate partitions. In this chapter we will discuss about the partitioning strategies.

Why to Partition

Here is the list of reasons.
·         For easy management
·         To assist backup/recovery
·         To enhance performance


FOR EASY MANAGEMENT

The fact table in data warehouse can grow to many hundreds of gigabytes in size. This too large size of fact table is very hard to manage as a single entity. Therefore it needs partition.

TO ASSIST BACKUP/RECOVERY

If we do not have partitioned the fact table then we have to load the complete fact table with all the data.Partitioning allow us to load that data which is required on regular basis. This will reduce the time to load and also enhances the performance of the system.
Note: To cut down on the backup size all partitions other than the current partitions can be marked read only. We can then put these partition into a state where they can not be modified.Then they can be backed up .This means that only the current partition is to be backed up.

TO ENHANCE PERFORMANCE

By partitioning the fact table into sets of data the query procedures can be enhanced. The query performance is enhanced because now the query scans the partitions that are relevant. It does not have to scan the large amount of data.

Horizontal Partitioning

There are various way in which fact table can be partitioned. In horizontal partitioning we have to keep in mind the requirements for manageability of the data warehouse.

PARTITIONING BY TIME INTO EQUAL SEGMENTS

In this partitioning strategy the fact table is partitioned on the bases of time period. Here each time period represents a significant retention period within the business. For example if the user queries formonth to date data then it is appropriate to partition into monthly segments. We can reuse the partitioned tables by removing the data in them.

PARTITIONING BY TIME INTO DIFFERENT-SIZED SEGMENTS

This kind of partition is done where the aged data is accessed infrequently. This partition is implemented as a set of small partitions for relatively current data, larger partition for inactive data.


Following is the list of advantages.
·         The detailed information remains available online.
·         The number of physical tables is kept relatively small, which reduces the operating cost.
·         This technique is suitable where the mix of data dipping recent history, and data mining through entire history is required.
Following is the list of disadvantages.
·         This technique is not useful where the partitioning profile changes on regular basis, because the repartitioning will increase the operation cost of data warehouse.

PARTITION ON A DIFFERENT DIMENSION

The fact table can also be partitioned on basis of dimensions other than time such as product group,region,supplier, or any other dimensions. Let's have an example.
Suppose a market function which is structured into distinct regional departments for example state by state basis. If each region wants to query on information captured within its region, it would proves to be more effective to partition the fact table into regional partitions. This will cause the queries to speed up because it does not require to scan information that is not relevant.
Following is the list of advantages.
·         Since the query does not have to scan the irrelevant data which speed up the query process.
Following is the list of disadvantages.
·         This technique is not appropriate where the dimensions are unlikely to change in future. So it is worth determining that the dimension does not change in future.
·         If the dimension changes then the entire fact table would have to be repartitioned.
Note: We recommend that do the partition only on the basis of time dimension unless you are certain that the suggested dimension grouping will not change within the life of data warehouse.

PARTITION BY SIZE OF TABLE

When there are no clear basis for partitioning the fact table on any dimension then we should partition the fact table on the basis of their size. We can set the predetermined size as a critical point. when the table exceeds the predetermined size a new table partition is created.
Following is the list of disadvantages.
·         This partitioning is complex to manage.
Note: This partitioning required metadata to identify what data stored in each partition.

PARTITIONING DIMENSIONS

If the dimension contain the large number of entries then it is required to partition dimensions. Here we have to check the size of dimension.
Suppose a large design which changes over time. If we need to store all the variations in order to apply comparisons, that dimension may be very large. This would definitely affect the response time.

ROUND ROBIN PARTITIONS

In round robin technique when the new partition is needed the old one is archived. In this technique metadata is used to allow user access tool to refer to the correct table partition.
Following is the list of advantages.
·         This technique make it easy to automate table management facilities within the data warehouse.

Vertical Partition

In Vertical Partitioning the data is split vertically.

The Vertical Partitioning can be performed in the following two ways.
·         Normalization


·         Row Splitting


NORMALIZATION

Normalization method is the standard relational method of database organization. In this method the rows are collapsed into single row, hence reduce the space.
Table before normalization
Product_id
Quantity
Value
sales_date
Store_id
Store_name
Location
Region
30
5
3.67
3-Aug-13
16
sunny
Bangalore
S
35
4
5.33
3-Sep-13
16
sunny
Bangalore
S
40
5
2.50
3-Sep-13
64
san
Mumbai
W
45
7
5.66
3-Sep-13
16
sunny
Bangalore
S
Table after normalization
Store_id
Store_name
Location
Region
16
sunny
Bangalore
W
64
san
Mumbai
S

Product_id
Quantity
Value
sales_date
Store_id
30
5
3.67
3-Aug-13
16
35
4
5.33
3-Sep-13
16
40
5
2.50
3-Sep-13
64
45
7
5.66
3-Sep-13
16

ROW SPLITTING

The row splitting tend to leave a one-to-one map between partitions. The motive of row splitting is to speed the access to large table by reducing its size.
Note: while using vertical partitioning make sure that there is no requirement to perform major join operations between two partitions.

Identify Key to Partition

It is very crucial to choose the right partition key.Choosing wrong partition key will lead you to reorganize the fact table. Let's have an example. Suppose we want to partition the following table.
Account_Txn_Table
transaction_id
account_id
transaction_type
value
transaction_date
region
branch_name
We can choose to partition on any key. The two possible keys could be
·         region
·         transaction_date
Now suppose the business is organised in 30 geographical regions and each region have different number of branches.That will give us 30 partitions, which is reasonable. This partitioning is good enough because our requirements capture has shown that vast majority of queries are restricted to the user's own business region.
Now If we partition by transaction_date instead of region. Then it means that the latest transaction from every region will be in one partition. Now the user who wants to look at data within his own region has to query across multiple partition.

Hence it is worth determining the right partitioning key.

No comments:

Post a Comment