Teradata Cumulative Distinct Count

Teradata Cumulative Distinct Count


Lets try to understand what cumulative distinct count actually mean with a simple example.


Dataset: Lets convert the above dataset into a table for demonstration purpose.

CREATE MULTISET TABLE product_sale( id INTEGER, rptg_dt DATE)
primary index(id);


Data:
INSERT INTO product_sale(rptg_dt, id) VALUES('2017-12-31',10);
INSERT INTO product_sale(rptg_dt, id) VALUES('2017-12-31',20);
INSERT INTO product_sale(rptg_dt, id) VALUES('2017-12-31',30);
INSERT INTO product_sale(rptg_dt, id) VALUES('2017-12-31',30);

INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-01',10);
INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-01',40);
INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-01',30);
INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-01',40);

INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-02',10);
INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-02',50);
INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-02',60);


Query to get Cumulative Distinct Count

--Step 3:
SELECT rptg_dt,
      cum_distinct_cnt
FROM (
     --Step 2:
     SELECT id,
            rptg_dt,
            COUNT(id) OVER (ORDER BY rptg_dt ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cum_distinct_cnt
     FROM (
        --Step 1:
           SELECT id,
                  rptg_dt,
                  ROW_NUMBER() OVER (PARTITION BY id ORDER BY rptg_dt) row_num
           FROM product_sale
           QUALIFY (ROW_NUMBER() OVER (PARTITION BY id ORDER BY rptg_dt)) =1 
           ) innerTab1
    ) innerTab2 
QUALIFY ROW_NUMBER() OVER (PARTITION BY rptg_dt ORDER BY cum_distinct_cnt DESC) = 1;

Output:
rptg_dt     cum_distinct_cnt
----------  ----------------
2017-12-31  3
2018-01-01  4
2018-01-02  6
Explanation: Above query is written in 3 steps
Note: If dataset is very large, it is advisable to break the query at each of the 3 steps and populate intermediate data in work table for better performance.