Khatri

- This tutorial will explain how to find cumulative distinct count over a period of time.
- Cumulative distinct count means if the column value is already counted in earlier period then don't count it again if it re-appears in another period.
- Simple application of this algorithm may be to find total different types of products sold incrementally over a period till the given date.

Lets try to understand what cumulative distinct count actually mean with a simple example.

**Sample Data:**Consider the below 3 days data which shows the product's unique Identifier (Id) and date(rptg_dt) on which product sale happened.`Id rptg_dt --- ---------- 10 2017-12-31 20 2017-12-31 30 2017-12-31 30 2017-12-31 Id rptg_dt --- ---------- 10 2018-01-01 40 2018-01-01 30 2018-01-01 40 2018-01-01 Id rptg_dt --- ---------- 10 2018-01-02 50 2018-01-02 60 2018-01-02`

**Required Output:**Cumulative distinct count of 'Ids' which is equivalent to total different types of products sold till the given date.`rptg_dt Count(*) ---------- -------- 2017-12-31 3 2018-01-01 4 2018-01-02 6`

**Result Explanation:**- There were 3 distinct Ids on first day(2017-12-31), therefore distinct count for this day is
**3**. - On second day(2018-01-01), there is only 1 new Id (i.e. 40) and others(10,20,30) were already counted on previous day. Therefore distinct cumulative count is
**4**(1 from current day and 3 from previous day). - On third day(2018-01-02), there are 2 new Ids (i.e. 50 & 60) and others(10,20,30,40) were already counted on previous days. Therefore distinct cumulative count is
**6**(2 from current day and 4 from previous days).

- There were 3 distinct Ids on first day(2017-12-31), therefore distinct count for this day is

➠

```
CREATE MULTISET TABLE product_sale( id INTEGER, rptg_dt DATE)
primary index(id);
```**Data:**
INSERT INTO product_sale(rptg_dt, id) VALUES('2017-12-31',10);
INSERT INTO product_sale(rptg_dt, id) VALUES('2017-12-31',20);
INSERT INTO product_sale(rptg_dt, id) VALUES('2017-12-31',30);
INSERT INTO product_sale(rptg_dt, id) VALUES('2017-12-31',30);
INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-01',10);
INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-01',40);
INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-01',30);
INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-01',40);
INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-02',10);
INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-02',50);
INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-02',60);

➠

```
```**--Step 3:**
SELECT rptg_dt,
cum_distinct_cnt
FROM (
**--Step 2:**
SELECT id,
rptg_dt,
COUNT(id) OVER (ORDER BY rptg_dt ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cum_distinct_cnt
FROM (
**--Step 1:**
SELECT id,
rptg_dt,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY rptg_dt) row_num
FROM product_sale
QUALIFY (ROW_NUMBER() OVER (PARTITION BY id ORDER BY rptg_dt)) =1
) innerTab1
) innerTab2
QUALIFY ROW_NUMBER() OVER (PARTITION BY rptg_dt ORDER BY cum_distinct_cnt DESC) = 1;
**Output:**
rptg_dt cum_distinct_cnt
---------- ----------------
2017-12-31 3
2018-01-01 4
2018-01-02 6

**Step 1:**Determine 1st record for each 'Id' all periods so that same Id should not be counted multiple times across different periods(dates). This step has been shown above in blue color.**Step 2:**Find general cumulative count using the data from Step 1. This step has been shown above in maroon color.**Step 3:**Final step will be to filter all rows for each reporting date except the one with greatest count. This step has been shown above in gray color.