Teradata: Group By Vs Distinct

Pradeep

Teradata: Group By Vs Distinct

Older versions(TD 12 Version and below) of Teradata used to handle 'Group By' & 'Distinct' in a different way.
It was suggested not to use 'Distinct' in the older versions of Teradata.
This was because while using 'Distinct' in older versions, Teradata used to sort data on each amp and merge the sorted data in spool to remove duplicates and return the result. This used to be very heavy CPU/IO operation if table/s was very large.
Over the period of time in the new versions(TD 13 version and above), Teradata has ensured that Group By and Distinct works in similar way (with same performance).
In the newer versions, Teradata internally generates same query execution plan whenever 'Group By' OR Distinct is used.

➠ Remember: There is no difference between Group By and Distinct in newer version of Teradata(TD v13 and above). Reference: Teradata database Release Summary 13


Tutorial_db.product (Count--> 1975164979)
Tutorial_db.product_item (Count--> 1891505240)

LOCK ROW FOR ACCESS SELECT COUNT(*) 
                    FROM tutorial_db.product_item 
                    INNER JOIN tutorial_db.product 
                    ON (tutorial_db.product_item.product_item_id = tutorial_db.product.product_item_id);
Join Output:
   Count(*)
-----------
 1811349898

➠ Experiment 1: Group By & Distinct directly on a single table

Explanation:

DISTINCT: Explain Plan when Distinct is used on a single table.



EXPLAIN SELECT DISTINCT product_type_cd FROM tutorial_db.product_item;

 *** Help information returned. 15 rows.
 *** Total elapsed time was 1 second.

Explanation(Distinct)
--------------------------------------------------------------------------------
  1) First, we lock tutorial_db.product_item for read
     on a reserved RowHash in all partitions to prevent global deadlock.
  2) Next, we lock tutorial_db.product_item for read.
  3) We do an all-AMPs SUM step to aggregate from
     tutorial_db.product_item by way of a traversal of
     index # 20 with no residual conditions , grouping by field1 (
     tutorial_db.product_item.product_type_cd).
     Aggregate Intermediate Results are computed globally, then placed
     in Spool 1.  The size of Spool 1 is estimated with high confidence
     to be 5 rows (245 bytes).  The estimated time for this step is
     0.73 seconds.
  4) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.  The total estimated time is 0.73 seconds.

GROUP BY: Explain Plan when Group By is used on a single table


EXPLAIN SELECT product_type_cd FROM tutorial_db.product_item GROUP BY 1;

 *** Help information returned. 15 rows.
 *** Total elapsed time was 1 second.

Explanation(Group By)
--------------------------------------------------------------------------------
  1) First, we lock tutorial_db.product_item for read
     on a reserved RowHash in all partitions to prevent global deadlock.
  2) Next, we lock tutorial_db.product_item for read.
  3) We do an all-AMPs SUM step to aggregate from
     tutorial_db.product_item by way of a traversal of
     index # 20 with no residual conditions , grouping by field1 (
     tutorial_db.product_item.product_type_cd).
     Aggregate Intermediate Results are computed globally, then placed
     in Spool 1.  The size of Spool 1 is estimated with high confidence
     to be 5 rows (245 bytes).  The estimated time for this step is
     0.73 seconds.
  4) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.  The total estimated time is 0.73 seconds.

➠ Experiment 2: Group By & Distinct directly on joined tables

Explanation:

GROUP BY: Explain Plan when Group By is used on multiple joined tables.


EXPLAIN SELECT product_type_cd FROM tutorial_db.product_item INNER JOIN tutorial_db.product ON (tutorial_db.product_item.product_item_id = tutorial_db.product.product_item_id)
GROUP BY 1;

 *** Help information returned. 46 rows.
 *** Total elapsed time was 1 second.

Explanation(Group By)
--------------------------------------------------------------------------------
  1) First, we lock tutorial_db.product for read on a reserved
     RowHash in all partitions to prevent global deadlock.
  2) Next, we lock tutorial_db.product_item for read
     on a reserved RowHash in all partitions to prevent global deadlock.
  3) We lock tutorial_db.product for read, and we lock
     tutorial_db.product_item for read.
  4) We execute the following steps in parallel.
       1) We do an all-AMPs RETRIEVE step from
          tutorial_db.product by way of an all-rows scan with
          no residual conditions into Spool 4 (all_amps) (compressed
          columns allowed), which is built locally on the AMPs.  Then
          we do a SORT to order Spool 4 by the hash code of (
          tutorial_db.product.product_item_id).  The size of
          Spool 4 is estimated with high confidence to be 1,975,458,745
          rows (61,239,221,095 bytes).  The estimated time for this
          step is 7.33 seconds.
       2) We do an all-AMPs RETRIEVE step from
          tutorial_db.product_item by way of an
          all-rows scan with no residual conditions into Spool 5
          (all_amps) (compressed columns allowed), which is built
          locally on the AMPs.  Then we do a SORT to order Spool 5 by
          the hash code of (
          tutorial_db.product_item.product_item_id).
          The size of Spool 5 is estimated with high confidence to be
          1,891,571,758 rows (92,687,016,142 bytes).  The estimated
          time for this step is 7.95 seconds.
  5) We do an all-AMPs JOIN step from Spool 4 (Last Use) by way of a
     RowHash match scan, which is joined to Spool 5 (Last Use) by way
     of a RowHash match scan.  Spool 4 and Spool 5 are joined using a
     merge join, with a join condition of ("product_item_id =
     product_item_id").  The result goes into Spool 3 (all_amps)
     (compressed columns allowed), which is built locally on the AMPs.
     The size of Spool 3 is estimated with low confidence to be
     1,891,572,164 rows (62,421,881,412 bytes).  The estimated time for
     this step is 3.17 seconds.
  6) We do an all-AMPs SUM step to aggregate from Spool 3 (Last Use) by
     way of an all-rows scan , grouping by field1 (
     tutorial_db.product_item.product_type_cd).
     Aggregate Intermediate Results are computed globally, then placed
     in Spool 1.  The size of Spool 1 is estimated with low confidence
     to be 5 rows (245 bytes).  The estimated time for this step is
     1.19 seconds.
  7) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.  The total estimated time is 12.32 seconds.

DISTINCT: Explain Plan when Distinct is used on multiple joined tables.


EXPLAIN SELECT DISTINCT product_type_cd FROM tutorial_db.product_item INNER JOIN tutorial_db.product ON (tutorial_db.product_item.product_item_id = tutorial_db.product.product_item_id);

 *** Help information returned. 46 rows.
 *** Total elapsed time was 1 second.

Explanation(Distinct)
--------------------------------------------------------------------------------
  1) First, we lock tutorial_db.product for read on a reserved
     RowHash in all partitions to prevent global deadlock.
  2) Next, we lock tutorial_db.product_item for read
     on a reserved RowHash in all partitions to prevent global deadlock.
  3) We lock tutorial_db.product for read, and we lock
     tutorial_db.product_item for read.
  4) We execute the following steps in parallel.
       1) We do an all-AMPs RETRIEVE step from
          tutorial_db.product by way of an all-rows scan with
          no residual conditions into Spool 4 (all_amps) (compressed
          columns allowed), which is built locally on the AMPs.  Then
          we do a SORT to order Spool 4 by the hash code of (
          tutorial_db.product.product_item_id).  The size of
          Spool 4 is estimated with high confidence to be 1,975,458,745
          rows (61,239,221,095 bytes).  The estimated time for this
          step is 7.33 seconds.
       2) We do an all-AMPs RETRIEVE step from
          tutorial_db.product_item by way of an
          all-rows scan with no residual conditions into Spool 5
          (all_amps) (compressed columns allowed), which is built
          locally on the AMPs.  Then we do a SORT to order Spool 5 by
          the hash code of (
          tutorial_db.product_item.product_item_id).
          The size of Spool 5 is estimated with high confidence to be
          1,891,571,758 rows (92,687,016,142 bytes).  The estimated
          time for this step is 7.95 seconds.
  5) We do an all-AMPs JOIN step from Spool 4 (Last Use) by way of a
     RowHash match scan, which is joined to Spool 5 (Last Use) by way
     of a RowHash match scan.  Spool 4 and Spool 5 are joined using a
     merge join, with a join condition of ("product_item_id =
     product_item_id").  The result goes into Spool 3 (all_amps)
     (compressed columns allowed), which is built locally on the AMPs.
     The size of Spool 3 is estimated with low confidence to be
     1,891,572,164 rows (62,421,881,412 bytes).  The estimated time for
     this step is 3.17 seconds.
  6) We do an all-AMPs SUM step to aggregate from Spool 3 (Last Use) by
     way of an all-rows scan , grouping by field1 (
     tutorial_db.product_item.product_type_cd).
     Aggregate Intermediate Results are computed globally, then placed
     in Spool 1.  The size of Spool 1 is estimated with low confidence
     to be 5 rows (245 bytes).  The estimated time for this step is
     1.19 seconds.
  7) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.  The total estimated time is 12.32 seconds.

Conclusion: There is no difference between GROUP BY and DISTINCT in Teradata.

There is no difference between Group By AND Distinct in Teradata.Internally Teradata generates same query execution plan when either Group By OR Distinct is used

dbmstutorials.com

Teradata: Group By Vs Distinct