• This tutorial will explain how to find cumulative distinct count over a period of time.
• Cumulative distinct count means if the column value is already counted in earlier period then don't count it again if it re-appears in another period.
• Simple application of this algorithm may be to find total different types of products sold incrementally over a period till the given date.

Lets try to understand what cumulative distinct count actually mean with a simple example.

• Sample Data: Consider the below 3 days data which shows the product's unique Identifier (Id) and date(rptg_dt) on which product sale happened.
```Id  rptg_dt
--- ----------
10  2017-12-31
20  2017-12-31
30  2017-12-31
30  2017-12-31

Id  rptg_dt
--- ----------
10  2018-01-01
40  2018-01-01
30  2018-01-01
40  2018-01-01

Id  rptg_dt
--- ----------
10  2018-01-02
50  2018-01-02
60  2018-01-02
```

• Required Output: Cumulative distinct count of 'Ids' which is equivalent to total different types of products sold till the given date.
```rptg_dt     Count(*)
----------  --------
2017-12-31  3
2018-01-01  4
2018-01-02  6
```
Result Explanation:
• There were 3 distinct Ids on first day(2017-12-31), therefore distinct count for this day is 3.
• On second day(2018-01-01), there is only 1 new Id (i.e. 40) and others(10,20,30) were already counted on previous day. Therefore distinct cumulative count is 4 (1 from current day and 3 from previous day).
• On third day(2018-01-02), there are 2 new Ids (i.e. 50 & 60) and others(10,20,30,40) were already counted on previous days. Therefore distinct cumulative count is 6 (2 from current day and 4 from previous days).

Dataset: Lets convert the above dataset into a table for demonstration purpose.
```CREATE MULTISET TABLE product_sale( id INTEGER, rptg_dt DATE)
primary index(id);

Data:
INSERT INTO product_sale(rptg_dt, id) VALUES('2017-12-31',10);
INSERT INTO product_sale(rptg_dt, id) VALUES('2017-12-31',20);
INSERT INTO product_sale(rptg_dt, id) VALUES('2017-12-31',30);
INSERT INTO product_sale(rptg_dt, id) VALUES('2017-12-31',30);

INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-01',10);
INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-01',40);
INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-01',30);
INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-01',40);

INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-02',10);
INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-02',50);
INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-02',60);
```

Query to get Cumulative Distinct Count
```--Step 3:
SELECT rptg_dt,
cum_distinct_cnt
FROM (
--Step 2:
SELECT id,
rptg_dt,
COUNT(id) OVER (ORDER BY rptg_dt ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cum_distinct_cnt
FROM (
--Step 1:
SELECT id,
rptg_dt,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY rptg_dt) row_num
FROM product_sale
QUALIFY (ROW_NUMBER() OVER (PARTITION BY id ORDER BY rptg_dt)) =1
) innerTab1
) innerTab2
QUALIFY ROW_NUMBER() OVER (PARTITION BY rptg_dt ORDER BY cum_distinct_cnt DESC) = 1;

Output:
rptg_dt     cum_distinct_cnt
----------  ----------------
2017-12-31  3
2018-01-01  4
2018-01-02  6
```
Explanation: Above query is written in 3 steps
• Step 1: Determine 1st record for each 'Id' all periods so that same Id should not be counted multiple times across different periods(dates). This step has been shown above in blue color.
• Step 2: Find general cumulative count using the data from Step 1. This step has been shown above in maroon color.
• Step 3: Final step will be to filter all rows for each reporting date except the one with greatest count. This step has been shown above in gray color.

Note: If dataset is very large, it is advisable to break the query at each of the 3 steps and populate intermediate data in work table for better performance.