Question on String Search

Blitz3D Forums/Blitz3D Programming/Question on String Search

starfox(Posted 2005) [#1]
Hey guys,

Do you have any fast techniques on searching strings?

Lets say, I have a string "AAFABASBEARNA"
And I want to search a database on 80,000 words to find
words inside this string. Do you guys have any ideas on
how I could do this quick?

Thanks!


DJWoodgate(Posted 2005) [#2]
Well there is something called a DAWG (Directed Acyclic Word Graph) which I have not tried personally but may be worth a look since it appears to give a compact dictionary representation that allows fast searches.
http://wiki.forthfreak.net/index.cgi?DirectedAcyclicWordGraph
http://www-2.cs.cmu.edu/afs/cs/academic/class/15451-f00/www/lectures/lect0926

Plenty of other stuff about the technique on google and its implementation in scrabble is described here
http://counties.csc.calpoly.edu/~team12fk/F02/acm_worlds_fastest_scrabble_program.pdf


starfox(Posted 2005) [#3]
Thanks!

Looks complicated though, and it descrambles the words, I just want a program that finds all the words in string without re-arranging the letters.

DJWoodgate, do you have any blitz code examples to help me out too;)


DJWoodgate(Posted 2005) [#4]
Yes, I know it looks complicated, I do not fancy implementing it either! (well the tricky bit is constructing it, using it should be pretty easy).

Well how about a binary search? The worst case number of compares to find a word in a sorted 80000 word list using binary search will be of the order of Log2(n) or about 16 and this could be further reduced by storing additional entry points for each initial letter.

So it becomes a case of searching this list for all possible substrings of your string, many of which may be discarded early or will further reduce the search range. For instance in your string "AAFABASBEARNA" if there are no words beginning with AAF there is not much point searching for AAFA or AAFAB and so on, and if your have just found EAR then you can search for the next substring EARN between the position of EAR and the position of words beginning with F or perhaps try finding EARNA first (the longest substring from that position) and binary search a reducing range between EA and EARNA (or the word that immediately preceeds EARNA since it does not exist).

[edit]
Preliminary tests with the ~80000 word OSPD loaded into an array indicate something of this nature will resolve your example string in less than a millisecond. I still need to work out some issues with my crapy code but I will post it here when I get the time to fix them - that's if you do not beat me to it.


DJWoodgate(Posted 2005) [#5]
OK, here is the binary search method, not fully tested though, so beware. I have not implemented the reducing lookup range idea in the substring search, because for some obscure reason it was slower, probably to do with how the binary search works, though it might just have been a coding error on my part.

BTW The version of binary search I have chosen (no I did not think it up myself) is good because it just does one compare per subdivision and does not do an explicit equality test. This is I think advantageous for string searches. You can throw an equality test in there and sometimes get an early exit, but overall that extra compare just slows it down.

[edit] I have added the lookup range reduction idea back into the wordsearch function as it definately seems to improve matters with larger dictionaries and I can see no good reason why it should not be faster most of the time.

[edit] Having added range reduction a better exit strategy for the substring search presented itself, which results in another slight speedup.

Graphics 600,600,0,2
Dim Dict$(80000)
Dim Didx(30)
Dim Words(255)

; Function loads a dictionary, which is assumed to be an ascii sorted word list.
; words are stored in Dict$() and Initial character indexing in Didx()
Function LoadDict(filepath$)
	File=ReadFile(filepath$)
	If file=0 Then Return False
	Repeat
		index=index+1
		word$=Lower(ReadLine(file))
		dict(index)=word
		Thischar=Asc(word)-96
		If Thischar<>Lastchar Then  Didx(Thischar)=index
		Lastchar=Thischar
	Until Eof(file)
	Didx(Thischar+1)=index : 	Didx(0)=index
	CloseFile file
	Return index
End Function

; Function checks for word in dictionary by binary search
; if word does not exist return next nearest word.
Function Dictfind(word$,startpos,endpos)
	Local top=endpos, bot=startpos-1, middle
	While (top-bot)>1
		middle=(top+bot)/2
		If dict(middle)<word Then bot=middle Else top=middle 
	Wend
	Return top
End Function

; Function finds words in search string with specified minimum length
Function Wordsearch(search$,minlen)
	Local i,j,substr$,wcount,idx,low,high,maxlen,searchlen=Len(search)

	search=Lower(search)  
	
	For i=1 To searchlen-minlen+1
	
		maxlen = searchlen-i+1 
		
		substr = Mid(search,i,maxlen)
	
		idx = Asc(substr)-96 : low=didx(idx) : high=Didx(idx+1)
	
		high = dictfind(substr,low,high) 
		
		If dict(high)=substr Then wcount=wcount+1 : words(wcount)=high ;: high=high-1

		For j=minlen To maxlen-1

			substr=Mid(search,i,j)
		
			low=Dictfind(substr,low,high)
		
;			If Left(dict(low),j)<>substr Then Exit ; See improved exit strategy below.

			If dict(low)=substr Then wcount=wcount+1 : words(wcount)=low : low=low+1

			If low = high Then  Exit ; Range reduction allows this instead of the test above	

		Next
	
	Next
	
	Return wcount

End Function



; load the OSPD (official scrabble player dictionary)
; see http://www.puzzlers.org/wordlists/dictinfo.php
time=MilliSecs()
wordcount=Loaddict("C:\program files\blitz3d\tmp\OSPD.txt")
If wordcount=0 Then RuntimeError " Failed to load dictionary"
time=MilliSecs()-time
Print "Dictionary Loaded in "+time+" Millisecs with ("+ wordcount+") Words"
Print

search$="AAFABASBEARNA"
wordlength=2


iterations=100

Repeat

	Print "Searching for words with at least "+wordlength+" letters 
	Print "in String "+search 


	time=MilliSecs()
	For i = 1 To iterations
		wordcount=wordsearch(search,wordlength)
	Next
	time=MilliSecs()-time
	
	Print "Searched string "+iterations+" times in "+time+" milliseconds"
	Print "Found "+wordcount+" words"
	
	For i=1 To wordcount
		Print dict(words(i))
	Next
	
	wordlength=wordlength+1

Until WaitKey()=27 Or wordcount=0

End



starfox(Posted 2005) [#6]
Thanks!

I was actually able to get something working with your help, and its really fast!


DJWoodgate(Posted 2005) [#7]
I think I have stumbled across a better exit strategy for the substring search as noted above. Nothing too remarkable but as this is a slow routine any speedup is worthwhile.