Teradata Cumulative Distinct Count
- This tutorial will explain how to find cumulative distinct count over a period of time.
- Cumulative distinct count means if the column value is already counted in earlier period then don't count it again if it re-appears in another period.
- Simple application of this algorithm may be to find total different types of products sold incrementally over a period till the given date.
Lets try to understand what cumulative distinct count actually mean with a simple example.
- Sample Data: Consider the below 3 days data which shows the product's unique Identifier (Id) and date(rptg_dt) on which product sale happened.
Id rptg_dt
--- ----------
10 2017-12-31
20 2017-12-31
30 2017-12-31
30 2017-12-31
Id rptg_dt
--- ----------
10 2018-01-01
40 2018-01-01
30 2018-01-01
40 2018-01-01
Id rptg_dt
--- ----------
10 2018-01-02
50 2018-01-02
60 2018-01-02
- Required Output: Cumulative distinct count of 'Ids' which is equivalent to total different types of products sold till the given date.
rptg_dt Count(*)
---------- --------
2017-12-31 3
2018-01-01 4
2018-01-02 6
Result Explanation:
- There were 3 distinct Ids on first day(2017-12-31), therefore distinct count for this day is 3.
- On second day(2018-01-01), there is only 1 new Id (i.e. 40) and others(10,20,30) were already counted on previous day. Therefore distinct cumulative count is 4 (1 from current day and 3 from previous day).
- On third day(2018-01-02), there are 2 new Ids (i.e. 50 & 60) and others(10,20,30,40) were already counted on previous days. Therefore distinct cumulative count is 6 (2 from current day and 4 from previous days).
➠
Dataset: Lets convert the above dataset into a table for demonstration purpose.
CREATE MULTISET TABLE product_sale( id INTEGER, rptg_dt DATE)
primary index(id);
Data:
INSERT INTO product_sale(rptg_dt, id) VALUES('2017-12-31',10);
INSERT INTO product_sale(rptg_dt, id) VALUES('2017-12-31',20);
INSERT INTO product_sale(rptg_dt, id) VALUES('2017-12-31',30);
INSERT INTO product_sale(rptg_dt, id) VALUES('2017-12-31',30);
INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-01',10);
INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-01',40);
INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-01',30);
INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-01',40);
INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-02',10);
INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-02',50);
INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-02',60);
➠
Query to get Cumulative Distinct Count
--Step 3:
SELECT rptg_dt,
cum_distinct_cnt
FROM (
--Step 2:
SELECT id,
rptg_dt,
COUNT(id) OVER (ORDER BY rptg_dt ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cum_distinct_cnt
FROM (
--Step 1:
SELECT id,
rptg_dt,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY rptg_dt) row_num
FROM product_sale
QUALIFY (ROW_NUMBER() OVER (PARTITION BY id ORDER BY rptg_dt)) =1
) innerTab1
) innerTab2
QUALIFY ROW_NUMBER() OVER (PARTITION BY rptg_dt ORDER BY cum_distinct_cnt DESC) = 1;
Output:
rptg_dt cum_distinct_cnt
---------- ----------------
2017-12-31 3
2018-01-01 4
2018-01-02 6
Explanation: Above query is written in 3 steps
- Step 1: Determine 1st record for each 'Id' all periods so that same Id should not be counted multiple times across different periods(dates). This step has been shown above in blue color.
- Step 2: Find general cumulative count using the data from Step 1. This step has been shown above in maroon color.
- Step 3: Final step will be to filter all rows for each reporting date except the one with greatest count. This step has been shown above in gray color.
Note:
If dataset is very large, it is advisable to break the query at each of the 3 steps and populate intermediate data in work table for better performance.