John W. Colby
jwcolby at gmail.com
Mon Feb 9 22:25:36 CST 2015
Have you ever run across something called WordNet dictionary? It can be obtained in a SQLite database format, which is a ROYAL pita to deal with outside of SQLite. The dictionary is "open" and if you just want to use the SQLite database it is fast and easy to start to use. I will likely do that with Linux, use it directly to program against using C++. In any event the dictionary is quite complex, dozens of tables, apparently fairly well normalized. And of course I really don't understand all the stuff that they are doing (yet) as language is not my strong suit. This started out as an attempt to simply get a list of words and their definitions (a simple dictionary). Well I got a dictionary but it isn't a simple one. Some guy was running benchmarks and came up with this task (I expect a pet project of his) to find all of the words that can be made from a given set of characters. He didn't discuss an algorithm, merely that he timed it with the original Raspberry Pi, and then again with the Raspberry Pi M2. However he didn't actually parallel it, simply ran the algorithm simultaneously 4 times on 4 separate threads, one on each core. That's cheating and not a true benchmark. I suspect however that he was using a brute force search of some sort. Which got me itching to do it right. I am trying to get my hands on several of the M2 Pis to build out a small cluster. So an algorithm like this, really parallized to run pieces of the thing on different threads (cores), and in fact sets of pieces on different physical machines (true parallelization) would be a cool thing to do. The algorithm basically implies discovering the characters in a word: parallelization = aaaeiilllnoprtz and sorting them as I did above. Then using a database or perhaps binary tree, search to find all of the words that have the same set of characters. One way would be to use a database where you simply order the characters as above, then add a new field to each word record to hold that reordered set, index and done. Another way is to build a parent table where you insert the ordered character set, obtaining a PK to insert as an FK in the word fieldtable If another word has the same set, find the ordered character set and use the same PK. Thus any given set of characters only goes in the parent table once, with a FK in the word table pointing back to the ordered character set. No doubt MUCH faster to use (and a better "database" solution), but more preprocessing to set it all up. Regardless of solution, to find matching words I have to sort the supplied characters, then look the sorted set up in the database, returning all matching words. Of course MY defined problem assumes an exact match. Something like a scrabble word finder would require returning all words which fit the character set, no doubt sorted longest to shortest. An entirely different problem! In fact a scrabble word finder would need to include weighting for character values as well. Now that would be an interesting programming problem! And no, I don't play scrabble. So any way, I am in the process of obtaining a database of words, and a BUNCH of otherwise useless (to me) stuff like synonyms etc. For my purpose I will just grab the words and start building out the character set table. This is interesting applying to the Raspberry Pi if for no other reason than the size of the database and the available memory in each processor. -- John W. Colby