TADM2E 3.29
Using a Hashtable
Scan the web page word by word and for each one, except the first, take the pair formed by the previous word + the current one. Use that pair as the key to a hashtable, where you store the frequency of that pair. If the entry is not already in the hash table store it with value 1, if it is there increment the current value by one. Keep the highest frequency seen and update it wherever you see a greater one. In the end return the greatest frequency seen.
Complexity, where N = number of words in the text:
Time: O(N) (you only iterate over each word once)
Space: O(N) (Worst case is when every word is distinct and thus you have no repeated pairs, and that would take N storage on the hash table)
Using a Trie
The solution with the hashtable will only work in time $ O(n) $ if the hashtable contains $ m=n $ slots (in general, lookup in a hashtable takes $ O(n/m) $ time). Still, this approach could work well if $ m $ is chosen wisely.
Another approach would be to store the text in a trie and save the number of occurrences in the leaf nodes. The implementation below iterates through the text letter by letter (in $ O(n) $ time), using two word pointers word1
and word2
that point to the position in the trie of the current word and the concatenation of the current with the last word, respectively.
If the text is long and natural, it is expected to contain many repetitions and common prefixes, and an efficient implementation would use relatively little memory. Another advantage is that no parameters have to be pre-defined (such as $ m $ in the example using the hashtable).
class Trie: def __init__(self, parent=None): self.parent = parent self.children = {} self.n = 0 def add(self, c): return self.children.setdefault(c, Trie(self)) def letter(self, child): for c in self.children: if self.children[c] == child: return c return None def spell(self): if self.parent: return self.parent.spell() + self.parent.letter(self) return '' def wrap(text): """iterates through text, yielding lowercase letter or single ' '""" inword = False for c in text: c = c.lower() if ord(c) >= ord('a') and ord(c) <= ord('z'): yield c inword = True else: if inword: yield ' ' inword = False if inword: yield ' ' def pairmax(text): """finds the word pair with maximum occurrence""" root = word2 = Trie() best = word1 = None for c in wrap(text): if c == ' ': if word1: word2.n += 1 if best is None or word2.n > best.n: best = word2 word2 = word1.add(' ') word1 = root else: word1 = root word2 = word2.add(' ') else: word2 = word2.add(c) if word1: word1 = word1.add(c) return best.spell(), best.n if __name__ == '__main__': import urllib2, sys urls = sys.argv[1:] if not urls: urls = ['http://www.rfc-editor.org/rfc/rfc19%02d.txt' % i for i in range(30, 40)] for url in urls: word, n = pairmax(urllib2.urlopen(url).read()) sys.stdout.write('%s : %dx "%s"\n' %(url, n, word))