Difference between revisions of "TADM2E 3.29"

From Algorithm Wiki
Jump to: navigation, search
(added solution using Trie)
(Replaced content with " Tus")
Line 1: Line 1:
  
== Using a Hashtable ==
+
Tus
 
+
Scan the web page word by word and for each one, except the first, take the pair formed by the previous word + the current one. Use that pair as the key to a hashtable, where you store the frequency of that pair. If the entry is not already in the hash table store it with value 1, if it is there increment the current value by one. Keep the highest frequency seen and update it wherever you see a greater one. In the end return the greatest frequency seen.
+
 
+
Complexity, where N = number of words in the text:
+
 
+
Time: O(N) (you only iterate over each word once)
+
 
+
Space: O(N) (Worst case is when every word is distinct and thus you have no repeated pairs, and that would take N storage on the hash table)
+
 
+
 
+
== Using a Trie ==
+
 
+
The solution with the hashtable will only work in time <math>O(n)</math> if the hashtable contains <math>m=n</math> slots (in general, lookup in a hashtable takes <math>O(n/m)</math> time). Still, this approach could work well if <math>m</math> is chosen wisely.
+
 
+
Another approach would be to store the text in a trie and save the number of occurrences in the leaf nodes. The implementation below iterates through the text letter by letter (in <math>O(n)</math> time), using two word pointers <code>word1</code> and <code>word2</code> that point to the position in the trie of the current word and the concatenation of the current with the last word, respectively.
+
 
+
If the text is long and natural, it is expected to contain many repetitions and common prefixes, and an efficient implementation would use relatively little memory. Another advantage is that no parameters have to be pre-defined (such as <math>m</math> in the example using the hashtable).
+
 
+
<source lang="python">
+
 
+
class Trie:
+
 
+
    def __init__(self, parent=None):
+
        self.parent = parent
+
        self.children = {}
+
        self.n = 0
+
 
+
    def add(self, c):
+
        return self.children.setdefault(c, Trie(self))
+
 
+
    def letter(self, child):
+
        for c in self.children:
+
            if self.children[c] == child:
+
                return c
+
        return None
+
 
+
    def spell(self):
+
        if self.parent:
+
            return self.parent.spell() + self.parent.letter(self)
+
        return ''
+
 
+
def wrap(text):
+
    """iterates through text, yielding lowercase letter or single ' '"""
+
    inword = False
+
    for c in text:
+
        c = c.lower()
+
        if ord(c) >= ord('a') and ord(c) <= ord('z'):
+
            yield c
+
            inword = True
+
        else:
+
            if inword:
+
                yield ' '
+
            inword = False
+
    if inword:
+
        yield ' '
+
 
+
def pairmax(text):
+
    """finds the word pair with maximum occurrence"""
+
    root = word2 = Trie()
+
    best = word1 = None
+
    for c in wrap(text):
+
        if c == ' ':
+
            if word1:
+
                word2.n += 1
+
                if best is None or word2.n > best.n:
+
                    best = word2
+
                word2 = word1.add(' ')
+
                word1 = root
+
            else:
+
                word1 = root
+
                word2 = word2.add(' ')
+
        else:
+
            word2 = word2.add(c)
+
            if word1:
+
                word1 = word1.add(c)
+
    return best.spell(), best.n
+
 
+
 
+
if __name__ == '__main__':
+
    import urllib2, sys
+
 
+
    urls = sys.argv[1:]
+
    if not urls:
+
        urls = ['http://www.rfc-editor.org/rfc/rfc19%02d.txt' % i for i in range(30, 40)]
+
 
+
    for url in urls:
+
        word, n = pairmax(urllib2.urlopen(url).read())
+
        sys.stdout.write('%s : %dx "%s"\n' %(url, n, word))
+
</source>
+

Revision as of 01:57, 14 July 2020

Tus