Best way to handle huge datasets

Monkey Forums/Monkey Programming/Best way to handle huge datasets

monotonic(Posted 2014) [#1]
Hi All,

I'm currently researching a couple of game ideas, one of them is a spin on ye olde word game. I have a dictionary which is a combined US/UK English dictionary it has around 200,000 words in there, I've yet to go through and do some pruning, which will probably take it down a bit.

No my question is what is the best way to handle this many items, reading it from disc upon load and storing in a sorted array, or converting the dictionary into a code file and having it embedded into the game?

Finding the words is not a problem, I've already proven a concept on this one, it's the best way to store this data that I'm thinking of.

Thanks for any suggestions in advance.


programmer(Posted 2014) [#2]
I recommend you to look at http://en.wikipedia.org/wiki/Directed_acyclic_word_graph
Size of English dictionary in DAWG format is about 1Mb. Use brl.DataBuffer to load and process binary data.


Nobuyuki(Posted 2014) [#3]
also check http://en.wikipedia.org/wiki/Suffix_tree although to be honest with you, I haven't had a problem with lookup time for english dictionary with a simple red-black tree like the kind StringMap uses. Particularly if your game isn't real-time, a few lookups of a large dictionary won't take long enough for the user to notice.


muddy_shoes(Posted 2014) [#4]
>Finding the words is not a problem, I've already proven a concept on this one, it's the best way to store this data that I'm thinking of.

The best method of finding words will depend on how you want to search (by prefix, by suffix, with wildcard etc.) and the data structure would generally be integral to that method so I'm not sure where you're coming from with this. The general issue with data on Android is loading/processing times not storage.


Gerry Quinn(Posted 2014) [#5]
[Deleted double post]


Gerry Quinn(Posted 2014) [#6]
[Edit: original code left out "Import mojo.app"]

By chance I just did a dictionary class. I haven't fully tested speed parameters yet but it seems to load almost instantly on Android, using the Enable dictionary of 170k words. At runtime it does a binary search and generates no garbage if you search for a array of characters (you could easily modify the string search to allocate no memory if you wanted).

As muddy points out, you might want something depending on what you are searching for: this one is oriented to finding out whether a given specific word is in the dictionary.



It simply loads the whole dictionary as a string and converts it to character codes. Then it generates an array of pointers to the start of each word in the dictionary.

Of course you could optimise loading times further by saving the data in this format. But if the current method is fast enough, it provides the convenience of working with text dictionaries.


monotonic(Posted 2014) [#7]
Hi Guys,

Sorry it took me so long to respond, my internet has been down!

@programmer

I really like the idea of using a DAWG structure, plus it would make the whole thing super fast to search for. However, because of time restrictions I might not have the time to research the whole method, writing and test it. Having said that this is up there as the number one choice if the "easy" method fails.


@Nobuyuki

That's something that I might have a look into, it looks simple enough I'm just not sure that the overhead will give a greatly increased lookup time.


@muddy_shoes

Yes, sorry my power of explanation is somewhat weak! What I meant is the best way to store the word lists so that loading them into memory in a such a way that meant they can be checked very quickly.

I currently have a method which is not as fast as I would like. My game allows words between 3-10 characters long, so I have split the word lists into 8 files, one containing words of 3 chars, 4 chars, 5 chars.... Then in memory I have a 2D array, the first dimension is for the length of the word so 3 char words are at index (3 - 3), 4 char words (4 - 3). Then the second dimension is indexed by the alphabetical position of the first char, so A is at position 0, b is at position 1 etc, this splits the words into length and first character groups. The data type of this 2D array is a string map of integers, so to check if a word exists I get length of the word and the first character, then using these values I get the corresponding string map. Then I check if this string map contains the word I'm looking for. This method means that I'm only scanning roughly 1/10th of the dataset, the tests I have done take ~250-350ms to check that a word exists (this is in HTML5).

The game is real-time, like Microsoft Wordament so it needs to be as fast as possible. A problem I can see with the above method is the shear number of string maps that I create, worst case scenario is that each group has a word that starts with every letter of the alphabet, this would mean that I create 8 * 26 string maps (that's 208!).

The best and most professional method is to use the DAWG structure as pointed out by programmer, but I just don't have to the time for R&D on this, so an easier but less robust method would be preferred.


@Gerry Quinn

My word lists are around 170k, how long does it take you to find a word amongst all that? Also you spoke of Enable 170k Dictionaries, I'm not sure what is meant by this.

Thank you for your input guys, I really appreciate it, the Monkey community never fails to lend a hand to a fellow coder.


Gerry Quinn(Posted 2014) [#8]
The Enable word list is a PD word list with about 170K words. You can find it around the net - yours might well be based on it, or come from similar sources.

My code does a binary search, and while it could doubtless be improved by a linear factor using native code, it will be as fast as possible from an O(n) perspective. It also has minimal memory fragmentation and garbage collection can be zero. These things are likely to be more important than clever data structures.

Anyway I just did a test, and in HTML5 it can find over a million words per second.

Try it / use it if you like, it should work on any alphabetically-sorted word list with lower case characters a-z only, and one word per line. I will tidy it up and stick it in the code section at some stage.