python - Working out dependent groups -


using following function can generate test data.

import random, string = list(string.ascii_lowercase) def gen_test_data():     s = []     in xrange(15):         p = random.randint(1,3)         xs = [random.choice(a) in xrange(p)]         s.append(xs)     return s 

this test data.

[     ['a', 's'],     ['f', 'c'],     ['w', 'z'],     ['z', 'p'],     ['z', 'u', 'g'],     ['v', 'q', 'w'],     ['y', 'w'],     ['d', 'x', 'i'],     ['l', 'f', 's'],     ['z', 'g'],     ['h', 'x', 'k'],     ['b'],     ['t'],     ['s', 'd', 'c'],     ['s', 'w', 'd'] ] 

if letter shares list letter dependent on letter, , of letters other dependencies. example

x dependant on ['a','c','d','g','f','i','h','k','l','q','p','s','u','w','v','y', 'x','z']

these dependencies 2 way. k dependent on including x

but x not dependant on b or t. these can placed in separate groups.

i need divide list many non dependant groups possible.

each group set of letters group depends on. non dependent letters set of one.

an example output 1 above is

[     ['t'],     ['b'],     ['a','c','d','g','f','i','h','k','l','q','p','s','u','w','v','y', 'x','z'] ] 

i trying write function can't figure out right way group correctly.

this classic connected components problem. there number of efficient linear-time or nearly-linear-time algorithms solve it, such graph search algorithm depth-first search or union-find data structure.


for search-based algorithm, set graph based on input, nodes in input sublists connected, run search of graph find nodes reachable each other. graph libraries networkx can handle of implementation you, or write yourself. example,

import collections  def build_graph(data):     graph = collections.defaultdict(list)     sublist in data:         # it's enough connect each sublist element first element.         # no need connect each sublist element each other sublist element.         item in sublist[1:]:             graph[sublist[0]].append(item)             graph[item].append(sublist[0])         if len(sublist) == 1:             # make sure add nodes if don't have edges.             graph.setdefault(sublist[0], [])     return graph  def dfs_connected_component(graph, node):     frontier = [node]     visited = set()     while frontier:         frontier_node = frontier.pop()         if frontier_node in visited:             continue         visited.add(frontier_node)         frontier.extend(graph[frontier_node])     return visited  def dependent_groups(data):     graph = build_graph(data)     components = []     nodes_with_component_known = set()     node in graph:         if node in nodes_with_component_known:             continue         component = dfs_connected_component(graph, node)         components.append(component)         nodes_with_component_known.update(component)     return components 

this have runtime linear in size of input.


you use union-find data structure. union-find data structure associates items sets, each set represented representative element. support 2 operations: find, finds representative element, , union, takes 2 elements , joins sets 1 set.

you can set union-find data structure, , each sublist in input, union each element of sublist first element of sublist. efficiently group dependent elements without joining independent elements together.

with standard implementation of union-find data structure disjoint-set forest union rank , path compression, runtime linear in size of input. it'd slower factor of inverse ackermann function of input, constant practical input sizes.


Comments