Login | Register

Nerd Paradise

↑ ↑ ↓ ↓ ← → ← → B A start
The Forum > Technology > Anagram finder in Python
This program reads a list of words (in order to know what permutations are 'real words') and finds all the unique permutations of an input string that are in the dictionary. Then it does it again, and repeats until you type "!" at the prompt.

Anagram finding is accomplished speedily by compiling dictionary words into a prefix tree (every word that begins with 'a' is stored in chains rooted at an 'a' node, words that start with 'ab' are all rooted at a 'b' node under the top-level 'a', etc.) and also by storing that tree via Python's pickle module for reuse in future runs.

Here's the code. Change the file names or save a dictionary to the right place, and you're in business.

import os
import pickle

bedugging = False
bedugging = True

def dprint(msg):
    global bedugging
    if bedugging:
        print(msg)

class Node:
    def __init__(self, _l, _parent=None):
        self.l = _l
        self.parent = _parent
        self.children = {}
    def __len__(self):
        return len(self.children.keys())
    def __getitem__(self, key):
        return self.children[key]
    def __contains__(self, item):
        return item in self.children
    def __setitem__(self, key, value):
        self.children[key] = value
    def __str__(self):
        return ','.join(self.children.keys())
    def stop(self):
        return '' in self.children

def insertword(line, node):
    dprint('insert %s' % (line))
    current = node
    for char in line:
        if char not in current:
            dprint('add %s' % (char))
            nextnode = Node(char, current)
            current[char] = nextnode
        else:
            dprint('follow %s' % (char))
        current = current[char]
    dprint('terminate')
    current[''] = None
         
def load_dictionary(path):
    words = Node(None)
    for line in open(path).readlines():
        insertword(line.strip(), words)
    return words

def wordscontaining(letters, tree, word=[]):
    dprint(letters)
    dprint(word)
    dprint(tree)
    if len(letters) > 0 and len(tree) > 0 :
        dprint('keep going')
        result = []
        for letter in letters:
            if letter in tree:
                nextletters = letters[:]
                nextletters.remove(letter)
                result += wordscontaining(nextletters, tree[letter], word + [letter])
        dprint('returning %s' % result)
        return result
    if len(letters) == 0 and tree.stop():
        dprint('matched %s' % (''.join(word)))
        return [''.join(word)]
    # otherwise
    dprint('cancel')
    return []

def startup():
    global prefictionary
    treepath = "prefictionary.pyobj"
    dicpath = "english-words.80.trimmed" if not bedugging else "faketionary.txt"

    if os.path.exists(treepath) and os.path.getsize(treepath)>0:
        prefictionary = pickle.load(open(treepath, 'rb'))
    else:
        print("Generating dictionary tree...", end='')
        prefictionary = load_dictionary(dicpath)
        pickle.dump(prefictionary, open(treepath, 'wb'))
        print("done generating.")

def findwords(text):
    withdups = wordscontaining(list(text), prefictionary)
    withoutdups = {}
    for word in withdups:
        dprint('found %s' % (word))
        withoutdups[''.join(word)] = True
    return list(withoutdups.keys())

def main():
    try:
        while True:
            line = input('> ')
            if len(line) > 0:
                if line[0] == '!':
                    raise EOFError()
                for word in findwords(list(line)):
                    print(word)
    except EOFError:
        pass

###

startup()
print("Ready.")
if bedugging:
    print(findwords(['l''i''e''n']))
else:
    main()


NOTES!

Repeated Letters: The search algorithm is naive about repeat-letters. If you search for a word (that exists in the dictionary) with more than five or six repetitions of the same letter, it will serve as a wonderful object lesson on the term 'combinatorial explosion'. Especially if you have debugging turned on; all that console output slows things down massively.
I have a solution in mind, a minor modification to the code, that should solve this problem pretty handily. How would YOU do it? ;)

SCOWL: ...No, I'm not scowling at you. The dictionary file I used was built from the Spell Checker Oriented Word List project. It provides not only multiple different word lists by region (i.e., a UK list as distinct from an American list, etc.), but splits the lists into commonality categories. The most common words are in the first file, the rarest words are in the last. This is handy to let you decide how comprehensive a dictionary you want to work with. And do yourself a favor and read the readme file about automatically compiling a dictionary file from the project data. I didn't do that until just now.

TODO:
* Use more comments.
* Put in some command-line argument processing to turn debugging on and off, or specify a dictionary file, or suppress prompts for using the program in a pipeline.
* Fix that repeat-letter bug.
* Learn more about proper OOP in Python.
* Start going to the gym.
* Solve world hunger.
* Go off on fewer tangents!
[Quote] [Link]
gws said:
Anagram finding is accomplished speedily by compiling dictionary words into a prefix tree (every word that begins with 'a' is stored in chains rooted at an 'a' node, words that start with 'ab' are all rooted at a 'b' node under the top-level 'a', etc.) and also by storing that tree via Python's pickle module for reuse in future runs.


I'm going to stop you there. There's a much better way to do it. Instead of your prefix tree, it is better to take the letters of a word and sort them, then associate the word with that sorted letters.

def sortword(word):
    wordl = list(word.lower())
    wordl.sort()
    return "".join(wordl)


Thus Grab, Brag, and Garb will all be sorted to "abgr"

Then you use a dict to store all of these associations:

def readdictionary(path):
    mydict = {}
    for word in open(path):
        word = word.strip()
        sortedword = sortword(word)
        if sortedword not in mydict:
            mydict[sortedword] = []
        mydict[sortedword].append(word)
    return mydict


Retrieval is very, very simple, and O(1) in every case, instead of your version, which is O(n!) in the worst case.

def getanagrams(word, dictionary):
    sortedword = sortword(word)
    if sortedword in dictionary:
        return dictionary[sortedword]
    else:
        return []


With a few modifications to the sortword function, it's easy to make it take more things, including proper nouns, and check based on only the letters. The full program is below. Feel free to objectify it, I just wanted to quickly hack this together:

import re

def sortword(word):
    wordl = list("".join(re.split("[^a-z]*", word.lower())))
    wordl.sort()
    return "".join(wordl)

def readdictionary(path):
    mydict = {}
    for word in open(path):
        word = word.strip()
        sortedword = sortword(word)
        if sortedword not in mydict:
            mydict[sortedword] = []
        mydict[sortedword].append(word)
    return mydict

def getanagrams(word, dictionary):
    sortedword = sortword(word)
    if sortedword in dictionary:
        return dictionary[sortedword]
    else:
        return []
[Quote] [Link]
Canonicalization is definitely faster...more memory use but that's usually not a problem. (average-word-length*sizeof(char)*O(words) vs. sizeof(pointer)*O(words)) I know I did it that way once, too.

I starting writing prefix tree code for something else, forgot what I was working on, and turned it into an anagram searcher.
[Quote] [Link]
The Forum > Technology > Anagram finder in Python
Current Date: 13 Ineo 14:3Current Time: 11.53.0Join us in IRC...
Server: irc.esper.net
Channel: #nerdparadise
Your IP: 50.16.36.153Browser: UnknownBrowser Version: 0