we have table of values has been expanded de-normalized set , need re-normalize it, finding fewest number of reference sets.
a simplified version of source data looks this:
period group item seq ------ ----- ---- --- 1 1 1 1 2 2 1 3 3 1 b 1 1 1 b 2 2 1 b 3 3 1 c 1 1 1 c 4 2 1 c 5 3 1 d 2 1 1 d 1 2 1 d 3 3 1 e 1 1 1 e 2 2 1 f 2 1 1 f 1 2 1 f 3 3
i want extract minimum number of lists defined in data , assign reference list based on period , group. list consists of ordered sequence of items. here 4 lists defined in above data:
list item seq ---- ---- --- 1 2 1 1 1 2 1 3 3 2 1 1 2 2 2 2 3 3 3 1 1 3 4 2 3 5 3 4 1 1 4 2 2
and output want achieve:
period group list ------ ----- ---- 1 2 1 b 2 1 c 3 1 d 1 1 e 4 1 f 1
i have solution works using ora_hash , list_agg generate hash on items of group, fails when number of items in group greater 400. resulting error ora-01489: result of string concatenation long.
i'm looking general solution work regardless of number of items in group in given period.
items identified integer value less 100,000. realistically, we'll never see more 4000 items in group.
this logically similar works 400 group item records:
with the_source_data ( select 1 the_period, 'a' the_group, 1 the_item, 1 the_seq dual union select 1 the_period, 'a' the_group, 2 the_item, 2 the_seq dual union select 1 the_period, 'a' the_group, 3 the_item, 3 the_seq dual union select 1 the_period, 'b' the_group, 1 the_item, 1 the_seq dual union select 1 the_period, 'b' the_group, 2 the_item, 2 the_seq dual union select 1 the_period, 'b' the_group, 3 the_item, 3 the_seq dual union select 1 the_period, 'c' the_group, 1 the_item, 1 the_seq dual union select 1 the_period, 'c' the_group, 4 the_item, 2 the_seq dual union select 1 the_period, 'c' the_group, 5 the_item, 3 the_seq dual union select 1 the_period, 'd' the_group, 2 the_item, 1 the_seq dual union select 1 the_period, 'd' the_group, 1 the_item, 2 the_seq dual union select 1 the_period, 'd' the_group, 3 the_item, 3 the_seq dual union select 1 the_period, 'e' the_group, 1 the_item, 1 the_seq dual union select 1 the_period, 'e' the_group, 2 the_item, 2 the_seq dual union select 1 the_period, 'f' the_group, 2 the_item, 1 the_seq dual union select 1 the_period, 'f' the_group, 1 the_item, 2 the_seq dual union select 1 the_period, 'f' the_group, 3 the_item, 3 the_seq dual ), cte_list_hash ( select the_period, the_group, ora_hash(listagg(to_char(the_item, '00000')||to_char(the_seq, '0000')) within group (order the_seq)) list_hash the_source_data group the_period, the_group ), cte_unique_lists ( select list_hash, min(the_period) keep (dense_rank first order the_period, the_group) the_period, min(the_group) keep (dense_rank first order the_period, the_group) the_group cte_list_hash group list_hash ), cte_list_base ( select the_period, the_group, list_hash, rownum the_list cte_unique_lists ) select a.the_period, a.the_group, b.the_list cte_list_hash inner join cte_list_base b on a.list_hash = b.list_hash;
any finding right direction take appreciated.
here way results without using listagg
, without getting ora-01489
errors.
the main caveat numbers lists differently had in example, numbering seemed arbitrary me. version numbers them based on ordinal position of first period/group uses list. is, example, list used group in period 1 "list #1".
i threw in sample data period 2, make sure happening correctly, too.
hopefully comments in sql below explain approach enough.
finally... have no idea how long run on large data set. cross join may problematic.
with the_source_data ( select 1 the_period, 'a' the_group, 1 the_item, 1 the_seq dual union select 1 the_period, 'a' the_group, 2 the_item, 2 the_seq dual union select 1 the_period, 'a' the_group, 3 the_item, 3 the_seq dual union select 1 the_period, 'b' the_group, 1 the_item, 1 the_seq dual union select 1 the_period, 'b' the_group, 2 the_item, 2 the_seq dual union select 1 the_period, 'b' the_group, 3 the_item, 3 the_seq dual union select 1 the_period, 'c' the_group, 1 the_item, 1 the_seq dual union select 1 the_period, 'c' the_group, 4 the_item, 2 the_seq dual union select 1 the_period, 'c' the_group, 5 the_item, 3 the_seq dual union select 1 the_period, 'd' the_group, 2 the_item, 1 the_seq dual union select 1 the_period, 'd' the_group, 1 the_item, 2 the_seq dual union select 1 the_period, 'd' the_group, 3 the_item, 3 the_seq dual union select 1 the_period, 'e' the_group, 1 the_item, 1 the_seq dual union select 1 the_period, 'e' the_group, 2 the_item, 2 the_seq dual union select 1 the_period, 'f' the_group, 2 the_item, 1 the_seq dual union select 1 the_period, 'f' the_group, 1 the_item, 2 the_seq dual union select 1 the_period, 'f' the_group, 3 the_item, 3 the_seq dual union select 2 the_period, 'f' the_group, 1 the_item, 1 the_seq dual union select 2 the_period, 'f' the_group, 4 the_item, 2 the_seq dual union select 2 the_period, 'f' the_group, 5 the_item, 3 the_seq dual ), -- cte counts number of rows in each period, group. need avoid matching long list shorter list -- happens share same order, far goes. sd2 ( select sd.*, count(*) on ( partition sd.the_period, sd.the_group) cnt the_source_data sd ), -- cte joins every row every other rows , filters based on matches of item#, seq, , list length -- counts number of matches period , group (cnt3) sd3 ( select sd2a.the_period, sd2a.the_group, sd2a.the_item, sd2a.the_seq, sd2a.cnt, sd2b.the_period the_period2, sd2b.the_group the_group2, sd2b.the_item the_item2, sd2b.the_seq the_seq2, sd2b.cnt cnt2 , count(*) on ( partition sd2a.the_period, sd2a.the_group, sd2b.the_period, sd2b.the_group) cnt3 sd2 sd2a cross join sd2 sd2b sd2b.the_item= sd2a.the_item , sd2b.the_seq = sd2a.the_seq , sd2a.cnt = sd2b.cnt ), -- cte filters period, groups had same number of matches elements in original period, group. i.e., -- filters perfect list matches: elements same, in same order, , list lengths same. -- each, gets first period , group # share list sd4 ( select the_period, the_group, --min(the_group2) on ( partition the_period, the_group ) first_in_group min(the_period2) keep ( dense_rank first order the_period2, the_group2 ) on ( partition the_period, the_group) first_period, min(the_group2) keep ( dense_rank first order the_period2, the_group2 ) on ( partition the_period, the_group) first_group sd3 cnt = cnt3 ) -- we'll arbitrarily name lists based on ordinal position of first period , group uses list. select distinct the_period, the_group, dense_rank() on ( partition null order first_period, first_group ) list sd4 order 1,2
Comments
Post a Comment