Teradata Primary Index(PI)

Pradeep

Teradata Primary Index(PI)

Teradata spreads data rows for each table across the AMPs. Whenever the table needs to be read then each AMP has to read portion of the table stored on that AMP. If the AMPs start reading at the same time and there are an equal amount of rows on each AMP, then parallel processing works best. This parallel processing starts with the Primary Index(PI).

The Primary Index (PI) plays 3 important roles:

Data Distribution
Fastest way to retrieve data
Incredibly important for Joins

Primary Index characteristics:

The PI in a table will determine on which AMP a row will be stored.
PI of a populated table cannot be modified but it can be altered for an empty table.
Maximum of 64 columns can be added as PI for a table.
PI is not Primary Key and allows NULL.
Only one PI can be defined on a table.

There are two types of Primary Indexes

Unique Primary Index (UPI)
Non-Unique Primary Index (NUPI)

Unique Primary Index (UPI): It means that the values for the selected column must be unique.

UPI will always store rows of the table evenly between the AMPs.
UPI access is always a one-AMP operation.

Example 1:


CREATE MULTISET TABLE [databasename].[tablename]
(                                   
  col_1   INTEGER,
  col_2   VARCHAR(50)
)
UNIQUE PRIMARY INDEX ( col_1 );

Non-Unique Primary Index (NUPI): It means that the values for the selected column can have duplicates.

UPI will almost never store rows of the table evenly.

Example 2:


CREATE MULTISET TABLE [databasename].[tablename]
(                                   
  col_1   INTEGER,
  col_2   VARCHAR(50)
)
PRIMARY INDEX ( col_1 );

General rules for defining a Primary Indexes

The PI should be selected in such a way that it should be as unique as possible ,more the unique index more evenly the rows will be distributed by the AMPs and better will be the space utilization.
PI should be defined on as few columns as possible.
The PI defined can have unique or non-unique values.
Columns having more duplicates values should never be defined as PI, as it leads to more skewness (uneven data distribution on AMPs). Since data is skewed and when join is performed, it can cause the performance issue and lead to spool space error.
PI should be chosen on the column/s which is/are frequently used in the WHERE clause for best way to retrieve rows.

Example 3: Checking data distribution of given columns on the AMP


SELECT HASHAMP(HASHBUCKET(HASHROW([columnlist]))),COUNT(*) FROM [databasename].[tablename]
GROUP BY 1 ORDER BY 2 DESC;

Example 4: Storage of a table for each AMP


SELECT vproc, CurrentPerm
FROM DBC.TableSize
WHERE DatabaseName = '[databasename]'
AND TableName = '[tablename]'
ORDER BY 1;

Example 5: Skewness of a table


SELECT TableName,SUM(CurrentPerm) AS CurrentPerm,SUM(PeakPerm) AS PeakPerm,
       CAST((100 - (AVG(CurrentPerm)/MAX(CurrentPerm)*100)) AS INTEGER) AS SkewFactor
FROM DBC.TableSize
WHERE DatabaseName='[databasename]'
AND TableName= '[tablename]'
GROUP BY 1;

Note: If table is more skewed then user may get "no more room in database" error while inserting data even if there is space available. This is because of HOT AMP i.e. there is uneven distribution of data and more data is going to single amp which does not have enough space to handle all data.

Teradata spreads data rows for each table across the AMPs with the help of Primary Index(PI)

dbmstutorials.com

Teradata Primary Index(PI)