Teradata Primary Index(PI)
Teradata spreads data rows for each table across the AMPs. Whenever the table needs to be read then each AMP has to read portion of the table stored on that AMP.
If the AMPs start reading at the same time and there are an equal amount of rows on each AMP, then parallel processing works best.
This parallel processing starts with the Primary Index(PI).
The Primary Index (PI) plays 3 important roles:
- Data Distribution
- Fastest way to retrieve data
- Incredibly important for Joins
Primary Index characteristics:
- The PI in a table will determine on which AMP a row will be stored.
- PI of a populated table cannot be modified but it can be altered for an empty table.
- Maximum of 64 columns can be added as PI for a table.
- PI is not Primary Key and allows NULL.
- Only one PI can be defined on a table.
There are two types of Primary Indexes
- Unique Primary Index (UPI)
- Non-Unique Primary Index (NUPI)
Unique Primary Index (UPI): It means that the values for the selected column must be unique.
- UPI will always store rows of the table evenly between the AMPs.
- UPI access is always a one-AMP operation.
Example 1:
CREATE MULTISET TABLE [databasename].[tablename]
(
col_1 INTEGER,
col_2 VARCHAR(50)
)
UNIQUE PRIMARY INDEX ( col_1 );
Non-Unique Primary Index (NUPI): It means that the values for the selected column can have duplicates.
- UPI will almost never store rows of the table evenly.
Example 2:
CREATE MULTISET TABLE [databasename].[tablename]
(
col_1 INTEGER,
col_2 VARCHAR(50)
)
PRIMARY INDEX ( col_1 );
General rules for defining a Primary Indexes
- The PI should be selected in such a way that it should be as unique as possible ,more the unique index more evenly the rows will be distributed by the AMPs and better will be the space utilization.
- PI should be defined on as few columns as possible.
- The PI defined can have unique or non-unique values.
- Columns having more duplicates values should never be defined as PI, as it leads to more skewness (uneven data distribution on AMPs). Since data is skewed and when join is performed, it can cause the performance issue and lead to spool space error.
- PI should be chosen on the column/s which is/are frequently used in the WHERE clause for best way to retrieve rows.
Example 3: Checking data distribution of given columns on the AMP
SELECT HASHAMP(HASHBUCKET(HASHROW([columnlist]))),COUNT(*) FROM [databasename].[tablename]
GROUP BY 1 ORDER BY 2 DESC;
Example 4: Storage of a table for each AMP
SELECT vproc, CurrentPerm
FROM DBC.TableSize
WHERE DatabaseName = '[databasename]'
AND TableName = '[tablename]'
ORDER BY 1;
Example 5: Skewness of a table
SELECT TableName,SUM(CurrentPerm) AS CurrentPerm,SUM(PeakPerm) AS PeakPerm,
CAST((100 - (AVG(CurrentPerm)/MAX(CurrentPerm)*100)) AS INTEGER) AS SkewFactor
FROM DBC.TableSize
WHERE DatabaseName='[databasename]'
AND TableName= '[tablename]'
GROUP BY 1;
Note: If table is more skewed then user may get "no more room in database" error while inserting data even if there is space available. This is because of HOT AMP i.e. there is uneven distribution of data and more data is going to single amp which does not have enough space to handle all data.