Hive Tables

Hive Tables

This tutorials provides most of the information related to tables in Hive. There are 2 type of tables in Hive.
Internal Table: Internal Tables stores data inside HDFS hive/warehouse with tablename as directory. Internal tables are also called managed tables.
External Table: External Tables stores data in the user defined HDFS directory.



Differences between Internal/Managed & External Table based on various parameters


Internal/Managed Table

External Table

Table Creation

User don't have to specify internal keyword in create table statement. Example: 'CREATE TABLE tablename'.

User has to specifically specify external keyword in create table statement. example: 'CREATE EXTERNAL TABLE tablename'.

Table Data Location

Data will stored in system defined HDFS hive/warehouse directory.

User has to specify the HDFS data directory in create table statement.

DROP Table

Drop table command will remove all the data from HDFS as well and data will lost.

Drop table command will not remove any data(no data loss) from HDFS and it will just remove table defintion.



Hive Internal Table: Below table example will demonstrate simple Internal table in Hive.

CREATE TABLE hiveFirstTable
(  
   order_id INT,
   order_date STRING,
   cust_id INT,
   order_status STRING
)
ROW FORMAT delimited 
FIELDS TERMINATED BY ',' 
LINES TERMINATED BY '\n'
STORED AS textfile;

Hive External Table: Below table example will demonstrate simple External table in Hive.

CREATE EXTERNAL TABLE hiveFirstExternalTable
(  
   order_id INT,
   order_date STRING,
   cust_id INT,
   order_status STRING
)
ROW FORMAT delimited 
FIELDS TERMINATED BY ',' 
LINES TERMINATED BY '\n'
STORED AS textfile
LOCATION '/user/username/somedirectory';


Hive Table Attributes: Some of the attributes that can be used in create table statement in Hive.

Attribute

Description

ROW FORMAT

This is used to specify the format of data file that will be loaded into HIVE. Possible types: DELIMITED, SERDE

TBLPROPERTIES

This attribute can be used to specify multiple properties for tables as key-value pair. Example: Ignoring 'n' number of lines of header or footer from file to be loaded.

STORED AS

This is used to specify the format of data file that will be loaded/stored into HIVE. Default is plain text file, other types includes: ORC, PARQUET, RCFILE, AVRO, JSONFILE, SEQUENCEFILE.

LINES TERMINATED BY

This is used to specify the character which can be used to differentiate multiple rows. Default is newline character.

PARTITION BY

This is used to partition the table data into multiple files based on partition column defined.

CLUSTERED BY

This is used to classify data into specified number of bucket(again multiple files) using hash function.

LOCATION

This property is specifically for external table to specify the directory location where data file loaded into the external table will be stored.

Note: PARTITION BY & CLUSTERED BY can be used simultaneously.


ROW FORMAT: File to be loaded into Hive should be either delimited or SERDE compliant.
TBLPROPERTIES: Various key-value attribute can be used in TBLPROPERTIES to define the way table should work. Multiple table properties can be specified inside TBLPROPERTIES.

More Hive Table Examples
Skipping Header From File: Below table will ignore first line from a loaded file in Hive. Number one(1) in the below example can be replaced with any number to ignore that many header rows.

CREATE TABLE hiveFirstTableSkipHeader
(  
   order_id INT,
   order_date STRING,
   cust_id INT,
   order_status STRING
)
ROW FORMAT delimited 
FIELDS TERMINATED BY '\t' 
LINES TERMINATED BY '\n'
STORED AS textfile
TBLPROPERTIES("skip.header.line.count"="1");


Skipping Footer From File: Below table will ignore last line from a loaded file in Hive. Number one(1) in the below example can be replaced with any number to ignore that many footer rows.

CREATE TABLE hiveFirstTableSkipFooter
(  
   order_id INT,
   order_date STRING,
   cust_id INT,
   order_status STRING
)
ROW FORMAT delimited 
FIELDS TERMINATED BY '\t' 
LINES TERMINATED BY '\n'
STORED AS textfile
TBLPROPERTIES("skip.footer.line.count"="1");


Skipping Both Header/Footer From File: Below table will ignore last line from a loaded file in Hive. Number one(1) in the below example can be replaced with any number to ignore that many header & footer rows.

CREATE TABLE hiveFirstTableSkipHeaderFooter
(  
   order_id INT,
   order_date STRING,
   cust_id INT,
   order_status STRING
)
ROW FORMAT delimited 
FIELDS TERMINATED BY ',' 
LINES TERMINATED BY '\n'
STORED AS textfile
tblproperties("skip.header.line.count"="1", "skip.footer.line.count"="1");


Transactional Table: Transactional Table property should be enabled in order to delete, insert & update data in Hive table. Also table storage format should be ORC (or other hive compliant format).

CREATE TABLE hiveFirstTransTable
(  
   order_id INT,
   order_date STRING,
   cust_id INT,
   order_status STRING
)
ROW FORMAT delimited 
FIELDS TERMINATED BY ',' 
LINES TERMINATED BY '\n'
STORED AS ORC
TBLPROPERTIES("transactional"="true");
Note: Below properties need to be set otherwise create table statement will fail

SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.support.concurrency=true;


Partitioned & Clustered Table: Below create table example is the example of partitioned and clustered table.

Create table hiveFirstPartitionedClusteredTable
(  
   order_id INT,
   order_date STRING,
   cust_id INT
)
PARTITIONED BY (order_status STRING)
CLUSTERED BY (order_id) INTO 10 buckets
ROW format delimited 
FIELDS TERMINATED BY  ','
STORED AS textfile;


SERDE Table: SERDE is popularly used to load from sources storing data in JSON format. Check out this Hive SERDE page to know more about SERDE.

CREATE TABLE hiveFirstTransTable
(  
   order_id INT,
   order_date STRING,
   cust_id INT,
   order_status STRING
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' 
STORED AS TEXTFILE;


Create Table Using Other Table