Teradata Cumulative Distinct Count

Pradeep

Teradata Cumulative Distinct Count

This tutorial will explain how to find cumulative distinct count over a period of time.
Cumulative distinct count means if the column value is already counted in earlier period then don't count it again if it re-appears in another period.
Simple application of this algorithm may be to find total different types of products sold incrementally over a period till the given date.

Lets try to understand what cumulative distinct count actually mean with a simple example.

Sample Data: Consider the below 3 days data which shows the product's unique Identifier (Id) and date(rptg_dt) on which product sale happened.


Id  rptg_dt
--- ----------
10  2017-12-31
20  2017-12-31
30  2017-12-31
30  2017-12-31

Id  rptg_dt
--- ----------
10  2018-01-01
40  2018-01-01
30  2018-01-01
40  2018-01-01

Id  rptg_dt
--- ----------
10  2018-01-02
50  2018-01-02
60  2018-01-02

Required Output: Cumulative distinct count of 'Ids' which is equivalent to total different types of products sold till the given date.
```
rptg_dt     Count(*)
----------  --------
2017-12-31  3
2018-01-01  4
2018-01-02  6
```
Result Explanation:
- There were 3 distinct Ids on first day(2017-12-31), therefore distinct count for this day is 3.
- On second day(2018-01-01), there is only 1 new Id (i.e. 40) and others(10,20,30) were already counted on previous day. Therefore distinct cumulative count is 4 (1 from current day and 3 from previous day).
- On third day(2018-01-02), there are 2 new Ids (i.e. 50 & 60) and others(10,20,30,40) were already counted on previous days. Therefore distinct cumulative count is 6 (2 from current day and 4 from previous days).

➠ Dataset: Lets convert the above dataset into a table for demonstration purpose.


CREATE MULTISET TABLE product_sale( id INTEGER, rptg_dt DATE)
primary index(id);


Data:
INSERT INTO product_sale(rptg_dt, id) VALUES('2017-12-31',10);
INSERT INTO product_sale(rptg_dt, id) VALUES('2017-12-31',20);
INSERT INTO product_sale(rptg_dt, id) VALUES('2017-12-31',30);
INSERT INTO product_sale(rptg_dt, id) VALUES('2017-12-31',30);

INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-01',10);
INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-01',40);
INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-01',30);
INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-01',40);

INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-02',10);
INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-02',50);
INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-02',60);

➠ Query to get Cumulative Distinct Count


--Step 3:
SELECT rptg_dt,
      cum_distinct_cnt
FROM (
     --Step 2:
     SELECT id,
            rptg_dt,
            COUNT(id) OVER (ORDER BY rptg_dt ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cum_distinct_cnt
     FROM (
        --Step 1:
           SELECT id,
                  rptg_dt,
                  ROW_NUMBER() OVER (PARTITION BY id ORDER BY rptg_dt) row_num
           FROM product_sale
           QUALIFY (ROW_NUMBER() OVER (PARTITION BY id ORDER BY rptg_dt)) =1 
           ) innerTab1
    ) innerTab2 
QUALIFY ROW_NUMBER() OVER (PARTITION BY rptg_dt ORDER BY cum_distinct_cnt DESC) = 1;

Output:
rptg_dt     cum_distinct_cnt
----------  ----------------
2017-12-31  3
2018-01-01  4
2018-01-02  6

Explanation: Above query is written in 3 steps

Step 1: Determine 1st record for each 'Id' all periods so that same Id should not be counted multiple times across different periods(dates). This step has been shown above in blue color.
Step 2: Find general cumulative count using the data from Step 1. This step has been shown above in maroon color.
Step 3: Final step will be to filter all rows for each reporting date except the one with greatest count. This step has been shown above in gray color.

Note: If dataset is very large, it is advisable to break the query at each of the 3 steps and populate intermediate data in work table for better performance.