Teradata Collect Statistics

Pradeep

Teradata Collect Statistics

This tutorials will cover importance of collect statistics in Teradata, how to collect statistics and how to drop statistics.
Hereafter term 'stats' can be used for 'statistics'.

Decision making process of optimizer is greatly influenced by the availability of Collected statistics.
The more the optimizer knows about data, the better the guarantee that the estimate will be right on the spot.
Optimizer will always check for available stats before using 'Random Sampling' based on AMP row count, row size.
Collect stats can be taken on the columns of Tables, Hash Indexes and Join Indexes.
It is always recommended to keep the stats up to date because stale stats may mislead optimizer into generating worst execution plan which may further degrade the performance.
Teradata also provides a way to define how statistics will be sampled, For example: User defined sample percent or System defined sample.

➠ Information collected by statistics

Number of rows in the table
Number of rows per value
Row size
Column demographics
Index demographics

➠ Operations related to Collect Statistics, click on the line item to check its detail.

Define Statistics
Check Collected Statistics Detail
Refresh Collected Statistics
Show Collected Statistics
Drop Collected Statistics
Set how statistics will be sampled
Statistics on Built-in function expression
Optimizer's suggestion on Statistics

➠ Recommendations for Collect Statistics

It is recommended to collect stats at least on Primary Index column/columns.
It is recommended to update collect stats after 10% change in data of a table.
It is recommended to collect stats on non unique secondary indexes(NUSI) for optimizer to know total number of rows in NUSI subtable and make better decisions regarding the cost savings.
It is recommended to collect stats on non-indexed columns that are frequently used in 'WHERE' clause and equi-joins.
It is recommended to drop statistics that are not used anymore(i.e. stats on columns that no longer used to filter data or used in joins) as it consumes space and consumes CPU to update them.
It is recommended to schedule stats collection during non-peak hours, this is because stats collection takes high CPU usage.

➠ Restrictions for Collect Statistics

Stats cannot be collected on columns with datatypes like BLOB, CLOB, JSON, XML and Period. If user tries to collect stats on these datatypes then it will fail with a error like below.
```
*** Failure 6969 This statistics operation is not allowed on a Period column.
```
Users will get below error while refreshing collect stats on Global Temporary Table(GTT) using general syntax. User should explicitly mention TEMPORARY keyword while collecting stats on a GTT.
```
*** Failure 5343 This statement is not allowed when the target temporary table is already materialized.
```

➠ Define Statistics

Collect Statistics on column/columns


Syntax: COLLECT STATS ON [database_name.table_name] COLUMN (col_1,col_2,col_3,..);

Example 1: COLLECT STATS ON tutorial_db.employee COLUMN(emp_no);
Example 2: COLLECT STATS ON tutorial_db.employee COLUMN(dept_no,job_title);

Collect Statistics on Index


Syntax 1: COLLECT STATS ON [database_name.table_name] INDEX (index_col_1,index_col_2,index_col_3,..);
Example 1: COLLECT STATS ON tutorial_db.employee INDEX (emp_name,job_title,manager_id,salary);


Syntax 2: COLLECT STATS ON [database_name.table_name] INDEX [index_name];
Example 2: COLLECT STATS ON tutorial_db.employee INDEX Employee_table_nusi;

Collect Statistics on Global Temporary Table


Syntax : COLLECT STATS ON TEMPORARY [database_name.table_name] COLUMN (index_col_1,index_col_2,index_col_3,..);
Example : COLLECT STATS ON TEMPORARY tutorial_db.gtt_employee COLUMN (emp_name,job_title,manager_id,salary);

Collect Statistics on Partition Column


Syntax : COLLECT STATS ON [database_name.table_name] COLUMN (PARTITION);
Example : COLLECT STATS ON tutorial_db.employee COLUMN (PARTITION);

Define Multiple Collect Statistics on Table


COLLECT STATISTICS 
            COLUMN ( emp_no ) , 
            COLUMN ( dept_no,job_title ) 
  ON TUTORIAL_DB.EMPLOYEE ;

➠ Check Collected Statistics Detail

Statistics Detail: Check the detail of collected statistics on a table


Syntax 1: HELP STATISTICS [database_name.table_name];
Example 1: HELP STATISTICS tutorial_db.employee;

Syntax 2: HELP STATS [database_name.table_name];
Example 2: HELP STATS tutorial_db.employee;

Statistics Detail for GTT: Check the detail of collected statistics on a global temporary table


Syntax 1: HELP STATISTICS TEMPORARY [database_name.table_name];
Example 1: HELP STATISTICS TEMPORARY tutorial_db.gtt_employee;

Syntax 2: HELP STATS TEMPORARY [database_name.table_name];
Example 2: HELP STATS TEMPORARY tutorial_db.gtt_employee;

➠ Refresh Collected Statistics

Refresh Statistics


Syntax 1: COLLECT STATS ON [database_name.table_name] INDEX (index_col_1,index_col_2,index_col_3,..); --single Statistics
Example 1: COLLECT STATS ON tutorial_db.employee INDEX (emp_name,job_title,manager_id,salary);


Syntax 2: COLLECT STATS ON [database_name.table_name]; --all statistics
Example 2: COLLECT STATS ON tutorial_db.employee;

Refresh Statistics on GTT


Syntax 1: COLLECT STATS ON TEMPORARY[database_name.table_name] INDEX (index_col_1,index_col_2,index_col_3,..); --single Statistics
Example 1: COLLECT STATS ON TEMPORARYtutorial_db.gtt_employee INDEX (emp_name,job_title,manager_id,salary);


Syntax 2: COLLECT STATS ON TEMPORARY[database_name.table_name]; --all statistics
Example 2: COLLECT STATS ON TEMPORARYtutorial_db.gtt_employee;

➠ Show Collected Statistics

Show Statistics: Check the definitions of statistics defined on a table


Syntax 1: SHOW STATISTICS ON [database_name.table_name];
Example 1: SHOW STATISTICS ON tutorial_db.employee;

Syntax 2: SHOW STATS ON [database_name.table_name];
Example 2: SHOW STATS ON tutorial_db.employee;

Show Statistics on GTT: Check the definitions of statistics defined on a GTT table


Syntax: SHOW STATISTICS ON TEMPORARY [database_name.table_name];
Example: SHOW STATISTICS ON TEMPORARY tutorial_db.employee;

➠ Drop Collected Statistics

Drop Statistics: Drop statistics defined on all columns of a table


Syntax 1: DROP STATISTICS ON [database_name.table_name];
Example 1: DROP STATISTICS ON tutorial_db.employee;

Syntax 2: DROP STATISTICS ON [database_name.table_name];
Example 2: DROP STATISTICS ON tutorial_db.employee;

Partially Drop Statistics: Drop statistics defined on selected columns of a table


Syntax: DROP STATISTICS ON [database_name.table_name] COLUMN (col_1,col_2,col_3,..);
Example: DROP STATISTICS ON tutorial_db.employee COLUMN(dept_no,job_title);

➠ Set how statistics will be sampled: Teradata provides a way to define how statistics will be sampled.

SAMPLE n PERCENT: This example will collect the specified stats by reading 10% of the table rows as defined by user.
```
COLLECT STATISTICS
  USING SAMPLE 10 PERCENT
  COLUMN dept_no
ON tutorial_db.employee;
```
SYSTEM SAMPLE: This example may do full-table scan initially to collect the specified stats and may reduce rows percentage after collecting some history of the table.
```
COLLECT STATISTICS
  USING SYSTEM SAMPLE
  COLUMN dept_no
ON tutorial_db.employee;
```

NO SAMPLE: This example will do full-table scan to collect the specified stats.


COLLECT STATISTICS
  USING NO SAMPLE
  COLUMN dept_no
ON tutorial_db.employee;

➠ Statistics on Built-in function expression: Teradata provides a way to collect statistics on custom user defined expression.


COLLECT STATISTICS
  COLUMN CASE WHEN emp_no mod 2 = 0 THEN 'Even'
              ELSE 'Odd'
         END AS stats_id_as_odd_even
ON TUTORIAL_DB.EMPLOYEE;

➠ Optimizer's suggestion

Once the 'DIAGNOSTIC HELPSTATS' is set for the session, explain plan of query will provide recommendation as per optimizer on what all statistics should be collected for the table/tables in query.
'DIAGNOSTIC HELPSTATS' can be set using below command.
```
DIAGNOSTIC HELPSTATS ON FOR SESSION;
```

Explain plan for the query will provide optimizer's recommendation for collecting statistics. Recommendations are marked with bold in the below examples.


Syntax:  EXPLAIN [query]

EXPLAIN SELECT * FROM tutorial_db.employee WHERE salary>1000;

Explanation
--------------------------------------------------------------------------------
  1) First, we lock tutorial_db.employee in TD_MAP1 for read on a
     reserved RowHash to prevent global deadlock.
  2) Next, we lock tutorial_db.employee in TD_MAP1 for read.
  3) We do an all-AMPs RETRIEVE step in TD_MAP1 from
     tutorial_db.employee by way of an all-rows scan with a condition
     of ("tutorial_db.employee.salary > 1000.00") into Spool 1
     (group_amps), which is built locally on the AMPs.  The size of
     Spool 1 is estimated with no confidence to be 5 rows (910 bytes).
     The estimated time for this step is 0.00 seconds.
  4) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.  The total estimated time is 0.00 seconds.
     BEGIN RECOMMENDED STATS FOR FINAL PLAN->
     -- "COLLECT STATISTICS COLUMN (EMP_NAME ,JOB_TITLE ,MANAGER_ID
     , SALARY) ON tutorial_db.employee" (Medium Confidence)
     -- "COLLECT STATISTICS COLUMN (SALARY) ON tutorial_db.employee"
     (High Confidence)
     <- END RECOMMENDED STATS FOR FINAL PLAN

Note: Above recommendations by Optimizer is only for the query specified in explain plan and not for all the queries related to table/tables in the explain plan.