Teradata Collect Statistics
This tutorials will cover importance of collect statistics in Teradata, how to collect statistics and how to drop statistics.
Hereafter term
'stats' can be used for
'statistics'.
- Decision making process of optimizer is greatly influenced by the availability of Collected statistics.
- The more the optimizer knows about data, the better the guarantee that the estimate will be right on the spot.
- Optimizer will always check for available stats before using 'Random Sampling' based on AMP row count, row size.
- Collect stats can be taken on the columns of Tables, Hash Indexes and Join Indexes.
- It is always recommended to keep the stats up to date because stale stats may mislead optimizer into generating worst execution plan which may further degrade the performance.
- Teradata also provides a way to define how statistics will be sampled, For example: User defined sample percent or System defined sample.
➠ Information collected by statistics
- Number of rows in the table
- Number of rows per value
- Row size
- Column demographics
- Index demographics
➠ Operations related to Collect Statistics, click on the line item to check its detail.
➠
Recommendations for Collect Statistics
- It is recommended to collect stats at least on Primary Index column/columns.
- It is recommended to update collect stats after 10% change in data of a table.
- It is recommended to collect stats on non unique secondary indexes(NUSI) for optimizer to know total number of rows in NUSI subtable and make better decisions regarding the cost savings.
- It is recommended to collect stats on non-indexed columns that are frequently used in 'WHERE' clause and equi-joins.
- It is recommended to drop statistics that are not used anymore(i.e. stats on columns that no longer used to filter data or used in joins) as it consumes space and consumes CPU to update them.
- It is recommended to schedule stats collection during non-peak hours, this is because stats collection takes high CPU usage.
➠
Restrictions for Collect Statistics
- Stats cannot be collected on columns with datatypes like BLOB, CLOB, JSON, XML and Period. If user tries to collect stats on these datatypes then it will fail with a error like below.
*** Failure 6969 This statistics operation is not allowed on a Period column.
- Users will get below error while refreshing collect stats on Global Temporary Table(GTT) using general syntax. User should explicitly mention TEMPORARY keyword while collecting stats on a GTT.
*** Failure 5343 This statement is not allowed when the target temporary table is already materialized.
➠
Define Statistics
- Collect Statistics on column/columns
Syntax: COLLECT STATS ON [database_name.table_name] COLUMN (col_1,col_2,col_3,..);
Example 1: COLLECT STATS ON tutorial_db.employee COLUMN(emp_no);
Example 2: COLLECT STATS ON tutorial_db.employee COLUMN(dept_no,job_title);
- Collect Statistics on Index
Syntax 1: COLLECT STATS ON [database_name.table_name] INDEX (index_col_1,index_col_2,index_col_3,..);
Example 1: COLLECT STATS ON tutorial_db.employee INDEX (emp_name,job_title,manager_id,salary);
Syntax 2: COLLECT STATS ON [database_name.table_name] INDEX [index_name];
Example 2: COLLECT STATS ON tutorial_db.employee INDEX Employee_table_nusi;
- Collect Statistics on Global Temporary Table
Syntax : COLLECT STATS ON TEMPORARY [database_name.table_name] COLUMN (index_col_1,index_col_2,index_col_3,..);
Example : COLLECT STATS ON TEMPORARY tutorial_db.gtt_employee COLUMN (emp_name,job_title,manager_id,salary);
- Collect Statistics on Partition Column
Syntax : COLLECT STATS ON [database_name.table_name] COLUMN (PARTITION);
Example : COLLECT STATS ON tutorial_db.employee COLUMN (PARTITION);
- Define Multiple Collect Statistics on Table
COLLECT STATISTICS
COLUMN ( emp_no ) ,
COLUMN ( dept_no,job_title )
ON TUTORIAL_DB.EMPLOYEE ;
➠
Check Collected Statistics Detail
- Statistics Detail: Check the detail of collected statistics on a table
Syntax 1: HELP STATISTICS [database_name.table_name];
Example 1: HELP STATISTICS tutorial_db.employee;
Syntax 2: HELP STATS [database_name.table_name];
Example 2: HELP STATS tutorial_db.employee;
- Statistics Detail for GTT: Check the detail of collected statistics on a global temporary table
Syntax 1: HELP STATISTICS TEMPORARY [database_name.table_name];
Example 1: HELP STATISTICS TEMPORARY tutorial_db.gtt_employee;
Syntax 2: HELP STATS TEMPORARY [database_name.table_name];
Example 2: HELP STATS TEMPORARY tutorial_db.gtt_employee;
➠
Refresh Collected Statistics
- Refresh Statistics
Syntax 1: COLLECT STATS ON [database_name.table_name] INDEX (index_col_1,index_col_2,index_col_3,..); --single Statistics
Example 1: COLLECT STATS ON tutorial_db.employee INDEX (emp_name,job_title,manager_id,salary);
Syntax 2: COLLECT STATS ON [database_name.table_name]; --all statistics
Example 2: COLLECT STATS ON tutorial_db.employee;
- Refresh Statistics on GTT
Syntax 1: COLLECT STATS ON TEMPORARY[database_name.table_name] INDEX (index_col_1,index_col_2,index_col_3,..); --single Statistics
Example 1: COLLECT STATS ON TEMPORARYtutorial_db.gtt_employee INDEX (emp_name,job_title,manager_id,salary);
Syntax 2: COLLECT STATS ON TEMPORARY[database_name.table_name]; --all statistics
Example 2: COLLECT STATS ON TEMPORARYtutorial_db.gtt_employee;
➠
Show Collected Statistics
- Show Statistics: Check the definitions of statistics defined on a table
Syntax 1: SHOW STATISTICS ON [database_name.table_name];
Example 1: SHOW STATISTICS ON tutorial_db.employee;
Syntax 2: SHOW STATS ON [database_name.table_name];
Example 2: SHOW STATS ON tutorial_db.employee;
- Show Statistics on GTT: Check the definitions of statistics defined on a GTT table
Syntax: SHOW STATISTICS ON TEMPORARY [database_name.table_name];
Example: SHOW STATISTICS ON TEMPORARY tutorial_db.employee;
➠
Drop Collected Statistics
- Drop Statistics: Drop statistics defined on all columns of a table
Syntax 1: DROP STATISTICS ON [database_name.table_name];
Example 1: DROP STATISTICS ON tutorial_db.employee;
Syntax 2: DROP STATISTICS ON [database_name.table_name];
Example 2: DROP STATISTICS ON tutorial_db.employee;
- Partially Drop Statistics: Drop statistics defined on selected columns of a table
Syntax: DROP STATISTICS ON [database_name.table_name] COLUMN (col_1,col_2,col_3,..);
Example: DROP STATISTICS ON tutorial_db.employee COLUMN(dept_no,job_title);
➠
Set how statistics will be sampled: Teradata provides a way to define how statistics will be sampled.
- SAMPLE n PERCENT: This example will collect the specified stats by reading 10% of the table rows as defined by user.
COLLECT STATISTICS
USING SAMPLE 10 PERCENT
COLUMN dept_no
ON tutorial_db.employee;
- SYSTEM SAMPLE: This example may do full-table scan initially to collect the specified stats and may reduce rows percentage after collecting some history of the table.
COLLECT STATISTICS
USING SYSTEM SAMPLE
COLUMN dept_no
ON tutorial_db.employee;
- NO SAMPLE: This example will do full-table scan to collect the specified stats.
COLLECT STATISTICS
USING NO SAMPLE
COLUMN dept_no
ON tutorial_db.employee;
➠
Statistics on Built-in function expression: Teradata provides a way to collect statistics on custom user defined expression.
➠
Optimizer's suggestion
- Once the 'DIAGNOSTIC HELPSTATS' is set for the session, explain plan of query will provide recommendation as per optimizer on what all statistics should be collected for the table/tables in query.
- 'DIAGNOSTIC HELPSTATS' can be set using below command.
DIAGNOSTIC HELPSTATS ON FOR SESSION;
- Explain plan for the query will provide optimizer's recommendation for collecting statistics. Recommendations are marked with bold in the below examples.
Syntax: EXPLAIN [query]
EXPLAIN SELECT * FROM tutorial_db.employee WHERE salary>1000;
Explanation
--------------------------------------------------------------------------------
1) First, we lock tutorial_db.employee in TD_MAP1 for read on a
reserved RowHash to prevent global deadlock.
2) Next, we lock tutorial_db.employee in TD_MAP1 for read.
3) We do an all-AMPs RETRIEVE step in TD_MAP1 from
tutorial_db.employee by way of an all-rows scan with a condition
of ("tutorial_db.employee.salary > 1000.00") into Spool 1
(group_amps), which is built locally on the AMPs. The size of
Spool 1 is estimated with no confidence to be 5 rows (910 bytes).
The estimated time for this step is 0.00 seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.00 seconds.
BEGIN RECOMMENDED STATS FOR FINAL PLAN->
-- "COLLECT STATISTICS COLUMN (EMP_NAME ,JOB_TITLE ,MANAGER_ID
, SALARY) ON tutorial_db.employee" (Medium Confidence)
-- "COLLECT STATISTICS COLUMN (SALARY) ON tutorial_db.employee"
(High Confidence)
<- END RECOMMENDED STATS FOR FINAL PLAN
Note: Above recommendations by Optimizer is only for the query specified in explain plan and not for all the queries related to table/tables in the explain plan.