TADM2E 4.45

Let's use the example given, for words say <i>A, B, C</i>, in the problem but not get too tied to the specific values. It helps to think about sorting the search string words by their indexes in the document:

In the code example this is just a call to sort but since each word index list is sorted we could use <code>MergeSort</code> which means this portion of the code is <math>\mathcal{O}(n)</math> for <math>n</math> indexes.

Once we have the list we add push each element into a hash with a separate key for each word and update our estimate of the snippet <i>span</i> that includes all the words.

Here's python code for one simple implementation.

<pre> import sys

def smallest_snippet(*args):

   
   args -- K lists of word positions for K words
   
   >>> smallest_snippet([1,10], [2,20], [3,30])
   [1, 3]
   >>> smallest_snippet([1,9,27], [6,10,19], [8,12,14])
   [8, 10]
   >>> smallest_snippet([1,4,11,27], [3,6,10,19], [5,8,12,14])
   [3, 5]
   >>> smallest_snippet([1,4,5], [3,9,10], [2,6,15])
   [1, 3]

   master = []
   for i in range(len(args)):
       master.extend(map(lambda j: (i,j), args[i])) # ith word, jth index
   master.sort(lambda x,y: cmp(x[1], y[1]))         # TODO: mergesort
   tops = {}                                        # { word i: index j }
   best = [master[0][1], master[-1][1]]
   minspan = best[-1] - best[0] + 1

   # update span after each new index tuple
   for (i,j) in master:
       tops[i] = j
       if len(tops) == len(args):
           curr = [min(tops.values()), max(tops.values())]
           if curr[1] - curr[0] < minspan:
               minspan = curr[1] - curr[0]
               best = curr
       
   return best

if __name__ == "__main__":

   import doctest
   doctest.testmod()

   sys.exit()

</pre>

TADM2E 4.45

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools