Hive Partitioning & Clustering
Hive Partitioning & Bucketing
- Hive provides way to categories data into smaller directories and files using partitioning or/and bucketing/clustering in order to improve performance of data retrieval queries and make them faster.
- Main difference between Partitioning and Bucketing is that partitioning is applied directly on the column value and data is stored within directory named with column value whereas bucketing is applied using hash function on the column value followed by MOD function with the number of buckets to store data in specific bucket file.
- Sample data file used in the examples can be downloaded here.
➠
Partitioning in Hive
- Hive provides a way to partition table data based on 1 or more columns.
- Partitioning is effective for columns which are used to filter data and limited number of values.
- Hive will create directory for each value of partitioned column(as shown below).
$ hadoop fs -ls /user/hive/warehouse/hivefirstpartitionedtable
Output:
drwxr-xr-x /user/hive/warehouse/hivefirstpartitionedtable/order_status=CANCELED
drwxr-xr-x /user/hive/warehouse/hivefirstpartitionedtable/order_status=CLOSED
drwxr-xr-x /user/hive/warehouse/hivefirstpartitionedtable/order_status=COMPLETE
drwxr-xr-x /user/hive/warehouse/hivefirstpartitionedtable/order_status=ON_HOLD
drwxr-xr-x /user/hive/warehouse/hivefirstpartitionedtable/order_status=PENDING
drwxr-xr-x /user/hive/warehouse/hivefirstpartitionedtable/order_status=PENDING_PAYMENT
drwxr-xr-x /user/hive/warehouse/hivefirstpartitionedtable/order_status=PROCESSING
drwxr-xr-x /user/hive/warehouse/hivefirstpartitionedtable/order_status=SUSPECTED_FRAUD
- Query performance will only improve if partitioned column is used in filter condition as it has to read only specific set of data inside particular directory, otherwise it will degrade the performance.
- Each partitioned folder will store data containing only partitioned value as specified by folder name.
- As Hive create separate folder for each value of partitioned column, therefore partition column value will not be stored in Hive file.
$ hadoop fs -cat hdfs://localhost:9000/user/hive/warehouse/hivefirstpartitionedtable/order_status=CANCELED/000000_0
Output:
68792,2014-01-28 00:00:00.0,9809
68816,2014-03-26 00:00:00.0,8769
68867,2014-06-23 00:00:00.0,869
- If multiple columns are used for partitioning then Hive will create nested folder to store data for other column inside first column specified in partition clause.
→ Create Table Example: In the below example, partition is created on the order_status column.
Create table hiveFirstPartitionedTable
(
order_id INT,
order_date STRING,
cust_id INT
)
PARTITIONED BY (order_status STRING)
Row format delimited
fields terminated by ','
Stored as textfile;
→ Loading Data into Partitioned Table
LOAD DATA local inpath '/Users/username/data_files/order.txt' OVERWRITE INTO TABLE hiveFirstPartitionedTable;
→ Loading Data into Partitioned Table Using other table
- Syntax 1
LOAD DATA local inpath '/Users/username/data_files/order.txt' OVERWRITE INTO TABLE hiveTempTable;
INSERT OVERWRITE TABLE hivefirstpartitionedtable PARTITION(order_status) SELECT order_id, order_date, cust_id, order_status from hiveTempTable;
- Syntax 2
LOAD DATA local inpath '/Users/username/data_files/order.txt' OVERWRITE INTO TABLE hiveTempTable;
FROM hiveTempTable INSERT OVERWRITE TABLE hivefirstpartitionedtable PARTITION(order_status) SELECT order_id, order_date, cust_id, order_status;
If the above command fail with below error then set dynamic partition mode to non-strict.
FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict
➠
Bucketing/Clustering in Hive
- Partition should not be added to every type of column, for example: transaction_id. Transaction_id will be unique for each sales transaction.
- It will be overhead to add partition in such scenarios.
- Hive provides clustering to retrieve data faster for the scenarios like above.
- In clustering, Hive uses hash function on the clustered column and number of buckets specified to store the data into a specific bucket returned after applying MOD function(as shown below).
Hash_function(bucket_column) Mod (no of buckets)
- Hive will create as many files as the number of buckets specified(as shown below) to distribute data in those files.
$ hadoop fs -ls /user/hive/warehouse/hivefirstclusteredtable
Output:
-rw-r-xr-x /user/hive/warehouse/hivefirstclusteredtable/000000_0
-rw-r-xr-x /user/hive/warehouse/hivefirstclusteredtable/000001_0
-rw-r-xr-x /user/hive/warehouse/hivefirstclusteredtable/000002_0
-rw-r-xr-x /user/hive/warehouse/hivefirstclusteredtable/000003_0
-rw-r-xr-x /user/hive/warehouse/hivefirstclusteredtable/000004_0
-rw-r-xr-x /user/hive/warehouse/hivefirstclusteredtable/000005_0
-rw-r-xr-x /user/hive/warehouse/hivefirstclusteredtable/000006_0
-rw-r-xr-x /user/hive/warehouse/hivefirstclusteredtable/000007_0
-rw-r-xr-x /user/hive/warehouse/hivefirstclusteredtable/000008_0
-rw-r-xr-x /user/hive/warehouse/hivefirstclusteredtable/000009_0
- Query performance will only improve if clustered column is used in filter condition as it has to read only specific set of data inside a particular file, otherwise it will degrade the performance.
→ Create Table Example: In the below example, clustering is done on the order_id column and 10 is the number of buckets defined.
Create table hiveFirstClusteredTable
(
order_id INT,
order_date STRING,
cust_id INT,
order_status STRING
)
CLUSTERED by (order_id) INTO 10 buckets
Row format delimited
fields terminated by ','
Stored as textfile;
→ Loading Data into Partitioned Table
LOAD DATA local inpath '/Users/username/data_files/order.txt' OVERWRITE INTO TABLE hiveFirstClusteredTable;
➠
Partitioning & Clustering in Hive
- Partitioning & clustering can be defined simultaneously on a Hive table.
- In the below example, partitioning is done on 'order_status' column and clustering is done on 'order_id' column with 10 buckets.
Create table hiveFirstPartitionedClusteredTable
(
order_id INT,
order_date STRING,
cust_id INT
)
PARTITIONED BY (order_status STRING)
CLUSTERED by (order_id) INTO 10 buckets
Row format delimited
fields terminated by ','
Stored as textfile;
- In such cases, data will be first partitioned into clustered column value directories and then clustering will be applied. As shown in example below, data is first partitioned on 'order_status=CANCELED' and then clustered into 10 buckets.
$ hadoop fs -ls /user/hive/warehouse/hivefirstpartitionedclusteredtable/order_status=CANCELED
Output:
-rw-r-xr-x /user/hive/warehouse/hivefirstpartitionedclusteredtable/order_status=CANCELED/000000_0
-rw-r-xr-x /user/hive/warehouse/hivefirstpartitionedclusteredtable/order_status=CANCELED/000001_0
-rw-r-xr-x /user/hive/warehouse/hivefirstpartitionedclusteredtable/order_status=CANCELED/000002_0
-rw-r-xr-x /user/hive/warehouse/hivefirstpartitionedclusteredtable/order_status=CANCELED/000003_0
-rw-r-xr-x /user/hive/warehouse/hivefirstpartitionedclusteredtable/order_status=CANCELED/000004_0
-rw-r-xr-x /user/hive/warehouse/hivefirstpartitionedclusteredtable/order_status=CANCELED/000005_0
-rw-r-xr-x /user/hive/warehouse/hivefirstpartitionedclusteredtable/order_status=CANCELED/000006_0
-rw-r-xr-x /user/hive/warehouse/hivefirstpartitionedclusteredtable/order_status=CANCELED/000007_0
-rw-r-xr-x /user/hive/warehouse/hivefirstpartitionedclusteredtable/order_status=CANCELED/000008_0
-rw-r-xr-x /user/hive/warehouse/hivefirstpartitionedclusteredtable/order_status=CANCELED/000009_0