[AccessD] Wordnet dictionary database

John W. Colby jwcolby at gmail.com
Mon Feb 9 22:25:36 CST 2015


Have you ever run across something called WordNet dictionary?  It can be 
obtained in a SQLite database format, which is a ROYAL pita to deal with 
outside of SQLite.  The dictionary is "open" and if you just want to use 
the SQLite database it is fast and easy to start to use.  I will likely 
do that with Linux, use it directly to program against using C++.

In any event the dictionary is quite complex, dozens of tables, 
apparently fairly well normalized.  And of course I really don't 
understand all the stuff that they are doing (yet) as language is not my 
strong suit.

This started out as an attempt to simply get a list of words and their 
definitions (a simple dictionary).  Well I got a dictionary but it isn't 
a simple one.

Some guy was running benchmarks and came up with this task (I expect a 
pet project of his) to find all of the words that can be made from a 
given set of characters.  He didn't discuss an algorithm, merely that he 
timed it with the original Raspberry Pi, and then again with the 
Raspberry Pi M2.  However he didn't actually parallel it, simply ran the 
algorithm simultaneously 4 times on 4 separate threads, one on each 
core.   That's cheating and not a true benchmark.  I suspect however 
that he was using a brute force search of some sort.

Which got me itching to do it right.  I am trying to get my hands on 
several of the M2 Pis to build out a small cluster.  So an algorithm 
like this, really parallized to run pieces of the thing on different 
threads (cores), and in fact sets of pieces on different physical 
machines (true parallelization) would be a cool thing to do.

The algorithm basically implies discovering the characters in a word:

parallelization =  aaaeiilllnoprtz

and sorting them as I did above.  Then using a database or perhaps 
binary tree, search to find all of the words that have the same set of 
characters.

One way would be to use a database where you simply order the characters 
as above, then add a new field to each word record to hold that 
reordered set, index and done.
Another way is to build a parent table where you insert the ordered 
character set, obtaining a PK to insert as an FK in the word fieldtable  
If another word has the same set, find the ordered character set and use 
the same PK.  Thus any given set of characters only goes in the parent 
table once, with a FK in the word table pointing back to the ordered 
character set.  No doubt MUCH faster to use (and a better "database" 
solution), but more preprocessing to set it all up.

Regardless of solution, to find matching words I have to sort the 
supplied characters, then look the sorted set up in the database, 
returning all matching words.  Of course MY defined problem assumes an 
exact match.  Something like a scrabble word finder would require 
returning all words which fit the character set, no doubt sorted longest 
to shortest.  An entirely different problem!  In fact a scrabble word 
finder would need to include weighting for character values as well.  
Now that would be an interesting programming problem!



And no, I don't play scrabble.

So any way, I am in the process of obtaining a database of words, and a 
BUNCH of otherwise useless (to me) stuff like synonyms etc. For my 
purpose I will just grab the words and start building out the character 
set table.

This is interesting applying to the Raspberry Pi if for no other reason 
than the size of the database and the available memory in each processor.


-- 
John W. Colby



More information about the AccessD mailing list