Question on String Search
Blitz3D Forums/Blitz3D Programming/Question on String Search
| ||
Hey guys, Do you have any fast techniques on searching strings? Lets say, I have a string "AAFABASBEARNA" And I want to search a database on 80,000 words to find words inside this string. Do you guys have any ideas on how I could do this quick? Thanks! |
| ||
Well there is something called a DAWG (Directed Acyclic Word Graph) which I have not tried personally but may be worth a look since it appears to give a compact dictionary representation that allows fast searches. http://wiki.forthfreak.net/index.cgi?DirectedAcyclicWordGraph http://www-2.cs.cmu.edu/afs/cs/academic/class/15451-f00/www/lectures/lect0926 Plenty of other stuff about the technique on google and its implementation in scrabble is described here http://counties.csc.calpoly.edu/~team12fk/F02/acm_worlds_fastest_scrabble_program.pdf |
| ||
Thanks! Looks complicated though, and it descrambles the words, I just want a program that finds all the words in string without re-arranging the letters. DJWoodgate, do you have any blitz code examples to help me out too;) |
| ||
Yes, I know it looks complicated, I do not fancy implementing it either! (well the tricky bit is constructing it, using it should be pretty easy). Well how about a binary search? The worst case number of compares to find a word in a sorted 80000 word list using binary search will be of the order of Log2(n) or about 16 and this could be further reduced by storing additional entry points for each initial letter. So it becomes a case of searching this list for all possible substrings of your string, many of which may be discarded early or will further reduce the search range. For instance in your string "AAFABASBEARNA" if there are no words beginning with AAF there is not much point searching for AAFA or AAFAB and so on, and if your have just found EAR then you can search for the next substring EARN between the position of EAR and the position of words beginning with F or perhaps try finding EARNA first (the longest substring from that position) and binary search a reducing range between EA and EARNA (or the word that immediately preceeds EARNA since it does not exist). [edit] Preliminary tests with the ~80000 word OSPD loaded into an array indicate something of this nature will resolve your example string in less than a millisecond. I still need to work out some issues with my crapy code but I will post it here when I get the time to fix them - that's if you do not beat me to it. |
| ||
OK, here is the binary search method, not fully tested though, so beware. I have not implemented the reducing lookup range idea in the substring search, because for some obscure reason it was slower, probably to do with how the binary search works, though it might just have been a coding error on my part. BTW The version of binary search I have chosen (no I did not think it up myself) is good because it just does one compare per subdivision and does not do an explicit equality test. This is I think advantageous for string searches. You can throw an equality test in there and sometimes get an early exit, but overall that extra compare just slows it down. [edit] I have added the lookup range reduction idea back into the wordsearch function as it definately seems to improve matters with larger dictionaries and I can see no good reason why it should not be faster most of the time. [edit] Having added range reduction a better exit strategy for the substring search presented itself, which results in another slight speedup. Graphics 600,600,0,2 Dim Dict$(80000) Dim Didx(30) Dim Words(255) ; Function loads a dictionary, which is assumed to be an ascii sorted word list. ; words are stored in Dict$() and Initial character indexing in Didx() Function LoadDict(filepath$) File=ReadFile(filepath$) If file=0 Then Return False Repeat index=index+1 word$=Lower(ReadLine(file)) dict(index)=word Thischar=Asc(word)-96 If Thischar<>Lastchar Then Didx(Thischar)=index Lastchar=Thischar Until Eof(file) Didx(Thischar+1)=index : Didx(0)=index CloseFile file Return index End Function ; Function checks for word in dictionary by binary search ; if word does not exist return next nearest word. Function Dictfind(word$,startpos,endpos) Local top=endpos, bot=startpos-1, middle While (top-bot)>1 middle=(top+bot)/2 If dict(middle)<word Then bot=middle Else top=middle Wend Return top End Function ; Function finds words in search string with specified minimum length Function Wordsearch(search$,minlen) Local i,j,substr$,wcount,idx,low,high,maxlen,searchlen=Len(search) search=Lower(search) For i=1 To searchlen-minlen+1 maxlen = searchlen-i+1 substr = Mid(search,i,maxlen) idx = Asc(substr)-96 : low=didx(idx) : high=Didx(idx+1) high = dictfind(substr,low,high) If dict(high)=substr Then wcount=wcount+1 : words(wcount)=high ;: high=high-1 For j=minlen To maxlen-1 substr=Mid(search,i,j) low=Dictfind(substr,low,high) ; If Left(dict(low),j)<>substr Then Exit ; See improved exit strategy below. If dict(low)=substr Then wcount=wcount+1 : words(wcount)=low : low=low+1 If low = high Then Exit ; Range reduction allows this instead of the test above Next Next Return wcount End Function ; load the OSPD (official scrabble player dictionary) ; see http://www.puzzlers.org/wordlists/dictinfo.php time=MilliSecs() wordcount=Loaddict("C:\program files\blitz3d\tmp\OSPD.txt") If wordcount=0 Then RuntimeError " Failed to load dictionary" time=MilliSecs()-time Print "Dictionary Loaded in "+time+" Millisecs with ("+ wordcount+") Words" Print search$="AAFABASBEARNA" wordlength=2 iterations=100 Repeat Print "Searching for words with at least "+wordlength+" letters Print "in String "+search time=MilliSecs() For i = 1 To iterations wordcount=wordsearch(search,wordlength) Next time=MilliSecs()-time Print "Searched string "+iterations+" times in "+time+" milliseconds" Print "Found "+wordcount+" words" For i=1 To wordcount Print dict(words(i)) Next wordlength=wordlength+1 Until WaitKey()=27 Or wordcount=0 End |
| ||
Thanks! I was actually able to get something working with your help, and its really fast! |
| ||
I think I have stumbled across a better exit strategy for the substring search as noted above. Nothing too remarkable but as this is a slow routine any speedup is worthwhile. |