Hive Materialized View.

Hive Indexing & Materialized View

Support for creating Index has been removed from the Hive starting from Hive version 3. And if users will try to create indexing in new version then user will get unrecognized input error as shown below.

CREATE INDEX index_test ON TABLE orders(order_id)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
IN TABLE orders_index_test;

FAILED: ParseException line 1:7 cannot recognize input near 'CREATE' 'INDEX' 'index_test' in ddl statement


Since indexing is removed, there are few other option which can be used to replace indexing in Hive. Note: In this tutorial, only Materialized view will be discussed in detail and advantage of Columnar file formats will be discussed in other tutorials later.


Materialized View: Creating materialized view is one of popular way to increase the perfomance of frequently used queries or data summary queries.

Creating a Materialized View:
CREATE MATERIALIZED VIEW mv_complete_orders
AS
SELECT orders_transactional.order_id,order_item_product_id,order_item_subtotal 
FROM orders_transactional JOIN order_items_transactional ON (orders_transactional.order_id = order_items_transactional.order_item_order_id) 
WHERE orders_transactional.order_status='COMPLETE';

Creating a Materialized View Stored in text foramt:
CREATE MATERIALIZED VIEW mv_complete_orders_delimited
ROW FORMAT delimited 
FIELDS TERMINATED BY '\t' 
LINES TERMINATED BY '\n'
STORED AS textfile
AS
SELECT orders_transactional.order_id,order_item_product_id,order_item_subtotal 
FROM orders_transactional JOIN order_items_transactional ON (orders_transactional.order_id = order_items_transactional.order_item_order_id) 
WHERE orders_transactional.order_status='COMPLETE';

Creating a Materialized View with Auto Rewriting:
CREATE MATERIALIZED VIEW mv_complete_orders_autorewriting
TBLPROPERTIES("hive.materializedview.rewriting.time.window"="10min")
AS
SELECT orders_transactional.order_id,order_item_product_id,order_item_subtotal 
FROM orders_transactional JOIN order_items_transactional ON (orders_transactional.order_id = order_items_transactional.order_item_order_id) 
WHERE orders_transactional.order_status='COMPLETE';

Dropping a Materialized View:
DROP MATERIALIZED VIEW mv_complete_orders;

Selecting Data From a Materialized View:
SELECT order_id,order_item_subtotal,order_item_product_id
FROM mv_complete_orders
WHERE order_item_subtotal BETWEEN 130 AND 160;

order_id	order_item_subtotal	order_item_product_id
68789		150.0			502
68844		150.0			502
68844		150.0			502
68845		150.0			502
68880		149.94			1014
68880		149.94			1014
68883		150.0			502

Explain plan of a query on Materialized View:
hive> explain SELECT order_id,order_item_subtotal,order_item_product_id
    > FROM mv_complete_orders
    > WHERE order_item_subtotal BETWEEN 130 AND 160;
OK
Explain
STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: mv_complete_orders
          Statistics: Num rows: 90 Data size: 1080 Basic stats: COMPLETE Column stats: NONE
          Filter Operator
            predicate: order_item_subtotal BETWEEN 130 AND 160 (type: boolean)
            Statistics: Num rows: 10 Data size: 120 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: order_id (type: int), order_item_subtotal (type: float), order_item_product_id (type: int)
              outputColumnNames: _col0, _col1, _col2
              Statistics: Num rows: 10 Data size: 120 Basic stats: COMPLETE Column stats: NONE
              ListSink