Khatri

- This tutorial will explain how to find cumulative distinct count over a period of time.
- Cumulative distinct count means if the column value is already counted in earlier period then don't count it again if it re-appears in another period.
- Simple application of this algorithm may be to find total different types of products sold incrementally over a period till the given date.

Lets try to understand what cumulative distinct count actually mean with a simple example.

**Sample Data:**Consider the below 3 days data which shows the product's unique Identifier (Id) and date(rptg_dt) on which product sale happened.Id rptg_dt --- ---------- 10 2017-12-31 20 2017-12-31 30 2017-12-31 30 2017-12-31 Id rptg_dt --- ---------- 10 2018-01-01 40 2018-01-01 30 2018-01-01 40 2018-01-01 Id rptg_dt --- ---------- 10 2018-01-02 50 2018-01-02 60 2018-01-02

**Required Output:**Cumulative distinct count of 'Ids' which is equivalent to total different types of products sold till the given date.rptg_dt Count(*) ---------- -------- 2017-12-31 3 2018-01-01 4 2018-01-02 6

**Result Explanation:**- There were 3 distinct Ids on first day(2017-12-31), therefore distinct count for this day is
**3**. - On second day(2018-01-01), there is only 1 new Id (i.e. 40) and others(10,20,30) were already counted on previous day. Therefore distinct cumulative count is
**4**(1 from current day and 3 from previous day). - On third day(2018-01-02), there are 2 new Ids (i.e. 50 & 60) and others(10,20,30,40) were already counted on previous days. Therefore distinct cumulative count is
**6**(2 from current day and 4 from previous days).

- There were 3 distinct Ids on first day(2017-12-31), therefore distinct count for this day is

➠

CREATE MULTISET TABLE product_sale( id INTEGER, rptg_dt DATE) primary index(id);Data:INSERT INTO product_sale(rptg_dt, id) VALUES('2017-12-31',10); INSERT INTO product_sale(rptg_dt, id) VALUES('2017-12-31',20); INSERT INTO product_sale(rptg_dt, id) VALUES('2017-12-31',30); INSERT INTO product_sale(rptg_dt, id) VALUES('2017-12-31',30); INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-01',10); INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-01',40); INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-01',30); INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-01',40); INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-02',10); INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-02',50); INSERT INTO product_sale(rptg_dt, id) VALUES('2018-01-02',60);

➠

--Step 3:SELECT rptg_dt, cum_distinct_cnt FROM (--Step 2:SELECT id, rptg_dt, COUNT(id) OVER (ORDER BY rptg_dt ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cum_distinct_cnt FROM (--Step 1:SELECT id, rptg_dt, ROW_NUMBER() OVER (PARTITION BY id ORDER BY rptg_dt) row_num FROM product_sale QUALIFY (ROW_NUMBER() OVER (PARTITION BY id ORDER BY rptg_dt)) =1 ) innerTab1 ) innerTab2 QUALIFY ROW_NUMBER() OVER (PARTITION BY rptg_dt ORDER BY cum_distinct_cnt DESC) = 1;Output:rptg_dt cum_distinct_cnt ---------- ---------------- 2017-12-31 3 2018-01-01 4 2018-01-02 6

**Step 1:**Determine 1st record for each 'Id' all periods so that same Id should not be counted multiple times across different periods(dates). This step has been shown above in blue color.**Step 2:**Find general cumulative count using the data from Step 1. This step has been shown above in maroon color.**Step 3:**Final step will be to filter all rows for each reporting date except the one with greatest count. This step has been shown above in gray color.