Teradata: Surrogate Key Concept

Pradeep

Teradata: Surrogate Key Concept

Surrogate Key is a unique, database supplied or generated identifier generally used as the primary key/index of a table.

When to use a surrogate key
Advantages of surrogate key
Disadvantages of surrogate key
How to generate surrogate key in Teradata

➠ When to use a surrogate key

Surrogate key should be used if each row of table cannot be uniquely identified using 1 or more columns.
Surrogate key can also be used when a unique key is too long and non-numeric.

➠ Advantages of surrogate key

Each row can be uniquely identified within a table using surrogate key value.
Surrogate key can be used as primary index for a table to distribute data evenly on all the AMPs.

➠ Disadvantages of surrogate key

We cannot derive any meaning or relationship between the surrogate key and the rest of the data columns in a row, therefore surrogate keys have no meaning to the users.
There will be cases when data will be shared among different databases. In this case, same rows(from different database) may have different surrogate key and different rows(from different database) may have same surrogate key.

➠ How to generate surrogate key in Teradata

By using analytical functions

By using CSUM analytical function, Syntax/Example 1:


SELECT
     (SELECT ZEROIFNULL(MAX(emp_no)) FROM employee) + CSUM(1,1) AS surrogate_key_1_new_emp_no
     ,emp_no
     ,emp_name
from employee;

Single AMP (usually vproc 0) generally processes all the data when using CSUM(1,1)

By using SUM analytical function, Syntax/Example 2:


SELECT
     (SELECT ZEROIFNULL(MAX(emp_no)) FROM employee) + SUM(1) OVER (ROWS UNBOUNDED PRECEDING) AS surrogate_key_2_new_emp_no
     ,emp_no
     ,emp_name
FROM employee;

By using ROW_NUMBER analytical function, Syntax/Example 3:
```
SELECT
    (SELECT ZEROIFNULL(MAX(emp_no)) FROM employee) + ROW_NUMBER() OVER (ORDER BY 1) AS surrogate_key_3_new_emp_no
    ,emp_no
    ,emp_name
FROM employee;
```
Note: "(SELECT ZEROIFNULL(MAX(emp_no)) FROM employee)" is used in above examples to generate new sequence which must be greater than the current max value present in the table.

By using Identity column, check Identity Columns page for more detail on Identity Column(Sequence) in Teradata.
Syntax/Example 4:


CREATE MULTISET TABLE surrogate_key_test1
( Id DECIMAL(38,0) GENERATED ALWAYS AS IDENTITY
    (
     START WITH 1 
     INCREMENT BY 1
     MAXVALUE 4000000
     NO CYCLE
     ),
name VARCHAR(100) NOT NULL
) PRIMARY INDEX (id);


INSERT INTO surrogate_key_test(name) VALUES('Teradata');
INSERT INTO surrogate_key_test(name) VALUES('hive');
INSERT INTO surrogate_key_test(name) VALUES('oracle');


SELECT * FROM surrogate_key_test;

 *** Query completed. 3 rows found. 2 columns returned. 
 *** Total elapsed time was 1 second.

 Id  Name
---  ---------
  3  oracle
  1  Teradata
  2  hive

Note: There will always be gaps in generated number when using identity column because in Teradata it is not one sequence but multiple parallel sequences(one on each AMP).