sql - Oracle consolidate or re-normalize row sets -


we have table of values has been expanded de-normalized set , need re-normalize it, finding fewest number of reference sets.

a simplified version of source data looks this:

period  group  item  seq ------  -----  ----  ---      1          1    1      1          2    2      1          3    3      1      b     1    1      1      b     2    2      1      b     3    3      1      c     1    1      1      c     4    2      1      c     5    3      1      d     2    1      1      d     1    2      1      d     3    3      1      e     1    1      1      e     2    2      1      f     2    1      1      f     1    2      1      f     3    3 

i want extract minimum number of lists defined in data , assign reference list based on period , group. list consists of ordered sequence of items. here 4 lists defined in above data:

list  item  seq ----  ----  ---    1     2    1    1     1    2    1     3    3    2     1    1    2     2    2    2     3    3    3     1    1    3     4    2    3     5    3    4     1    1    4     2    2 

and output want achieve:

period  group  list ------  -----  ----      1          2      1      b     2      1      c     3      1      d     1      1      e     4      1      f     1 

i have solution works using ora_hash , list_agg generate hash on items of group, fails when number of items in group greater 400. resulting error ora-01489: result of string concatenation long.

i'm looking general solution work regardless of number of items in group in given period.

items identified integer value less 100,000. realistically, we'll never see more 4000 items in group.

this logically similar works 400 group item records:

with      the_source_data (     select 1 the_period, 'a' the_group, 1 the_item, 1 the_seq dual union     select 1 the_period, 'a' the_group, 2 the_item, 2 the_seq dual union     select 1 the_period, 'a' the_group, 3 the_item, 3 the_seq dual union     select 1 the_period, 'b' the_group, 1 the_item, 1 the_seq dual union     select 1 the_period, 'b' the_group, 2 the_item, 2 the_seq dual union     select 1 the_period, 'b' the_group, 3 the_item, 3 the_seq dual union     select 1 the_period, 'c' the_group, 1 the_item, 1 the_seq dual union     select 1 the_period, 'c' the_group, 4 the_item, 2 the_seq dual union     select 1 the_period, 'c' the_group, 5 the_item, 3 the_seq dual union     select 1 the_period, 'd' the_group, 2 the_item, 1 the_seq dual union     select 1 the_period, 'd' the_group, 1 the_item, 2 the_seq dual union     select 1 the_period, 'd' the_group, 3 the_item, 3 the_seq dual union     select 1 the_period, 'e' the_group, 1 the_item, 1 the_seq dual union     select 1 the_period, 'e' the_group, 2 the_item, 2 the_seq dual union     select 1 the_period, 'f' the_group, 2 the_item, 1 the_seq dual union     select 1 the_period, 'f' the_group, 1 the_item, 2 the_seq dual union     select 1 the_period, 'f' the_group, 3 the_item, 3 the_seq dual         ),     cte_list_hash ( select     the_period,     the_group,      ora_hash(listagg(to_char(the_item, '00000')||to_char(the_seq, '0000')) within group (order the_seq)) list_hash     the_source_data group     the_period,     the_group  ), cte_unique_lists ( select     list_hash,     min(the_period) keep (dense_rank first order the_period, the_group) the_period,     min(the_group) keep (dense_rank first order the_period, the_group) the_group     cte_list_hash group      list_hash ), cte_list_base ( select         the_period,     the_group,     list_hash,     rownum the_list             cte_unique_lists ) select     a.the_period,     a.the_group,     b.the_list     cte_list_hash     inner join     cte_list_base b         on a.list_hash = b.list_hash; 

any finding right direction take appreciated.

here way results without using listagg , without getting ora-01489 errors.

the main caveat numbers lists differently had in example, numbering seemed arbitrary me. version numbers them based on ordinal position of first period/group uses list. is, example, list used group in period 1 "list #1".

i threw in sample data period 2, make sure happening correctly, too.

hopefully comments in sql below explain approach enough.

finally... have no idea how long run on large data set. cross join may problematic.

with      the_source_data (     select 1 the_period, 'a' the_group, 1 the_item, 1 the_seq dual union     select 1 the_period, 'a' the_group, 2 the_item, 2 the_seq dual union     select 1 the_period, 'a' the_group, 3 the_item, 3 the_seq dual union     select 1 the_period, 'b' the_group, 1 the_item, 1 the_seq dual union     select 1 the_period, 'b' the_group, 2 the_item, 2 the_seq dual union     select 1 the_period, 'b' the_group, 3 the_item, 3 the_seq dual union     select 1 the_period, 'c' the_group, 1 the_item, 1 the_seq dual union     select 1 the_period, 'c' the_group, 4 the_item, 2 the_seq dual union     select 1 the_period, 'c' the_group, 5 the_item, 3 the_seq dual union     select 1 the_period, 'd' the_group, 2 the_item, 1 the_seq dual union     select 1 the_period, 'd' the_group, 1 the_item, 2 the_seq dual union     select 1 the_period, 'd' the_group, 3 the_item, 3 the_seq dual union     select 1 the_period, 'e' the_group, 1 the_item, 1 the_seq dual union     select 1 the_period, 'e' the_group, 2 the_item, 2 the_seq dual union     select 1 the_period, 'f' the_group, 2 the_item, 1 the_seq dual union     select 1 the_period, 'f' the_group, 1 the_item, 2 the_seq dual union     select 1 the_period, 'f' the_group, 3 the_item, 3 the_seq dual union     select 2 the_period, 'f' the_group, 1 the_item, 1 the_seq dual union     select 2 the_period, 'f' the_group, 4 the_item, 2 the_seq dual union     select 2 the_period, 'f' the_group, 5 the_item, 3 the_seq dual          ), -- cte counts number of rows in each period, group.  need avoid matching long list shorter list -- happens share same order, far goes. sd2 ( select sd.*, count(*) on ( partition sd.the_period, sd.the_group) cnt the_source_data sd ), -- cte joins every row every other rows , filters based on matches of item#, seq, , list length -- counts number of matches period , group (cnt3) sd3 (  select sd2a.the_period, sd2a.the_group, sd2a.the_item, sd2a.the_seq, sd2a.cnt, sd2b.the_period the_period2, sd2b.the_group the_group2, sd2b.the_item the_item2, sd2b.the_seq the_seq2, sd2b.cnt cnt2   , count(*) on ( partition sd2a.the_period, sd2a.the_group, sd2b.the_period, sd2b.the_group) cnt3 sd2 sd2a cross join sd2 sd2b   sd2b.the_item= sd2a.the_item  ,     sd2b.the_seq = sd2a.the_seq ,     sd2a.cnt = sd2b.cnt ), -- cte filters period, groups had same number of matches elements in original period, group.  i.e.,  -- filters perfect list matches: elements same, in same order, , list lengths same. -- each, gets first period , group # share list sd4 (  select the_period, the_group, --min(the_group2) on ( partition the_period, the_group ) first_in_group min(the_period2) keep ( dense_rank first order the_period2, the_group2 ) on ( partition the_period, the_group) first_period,  min(the_group2) keep ( dense_rank first order the_period2, the_group2 ) on ( partition the_period, the_group) first_group  sd3 cnt = cnt3 ) -- we'll arbitrarily name lists based on ordinal position of first period , group uses list.  select distinct the_period, the_group, dense_rank() on ( partition null order first_period, first_group ) list sd4 order 1,2 

Comments