Compare strings

BlitzMax Forums/BlitzMax Programming/Compare strings

Hezkore(Posted 2015) [#1]
I'm using BlitzMax's string compare function to compare two strings with each other to see how similar they are.
It spits out a value to tell you how similar two strings are and a "0" would be a perfect match.
However, it usually returns very odd results!

For example, according to BlitzMax the string "talking super jeopardy!" is closer to "super mario bros 3" than "super mario bros. 3" is.

I've made an example so you guys can test it out yourself.
SuperStrict
Framework brl.standardio

Local str1:String = "super mario bros 3"
Local str2:String = "super mario bros. 3"
Local str3:String = "talking super jeopardy!"

Print "Str1 & Str2 Similarity: " + str2.Compare(str1)
Print "Str1 & Str3 Similarity: " + str3.Compare(str1)



Brucey(Posted 2015) [#2]
Compare returns :
0 : if the two strings are identical

<some number> : the difference in length of two strings, if the shorter one is identical up to its length.

<some number> : the difference between the value of first character that is not identical. (for example, the difference between "t" and "s".

(see blitz_string.c / bbStringCompare() for details)

Not sure you can use Compare() in the way you think you can.

;-)


Hezkore(Posted 2015) [#3]
I found your regex module Brucey, and it seems to return better results.
SuperStrict
Framework brl.standardio
Import bah.regex

Local str1:String = "super mario bros 3"
Local str2:String = "super mario bros. 3"
Local str3:String = "talking super jeopardy!"

Print "Str1 & Str2 Similarity: " + StringCompare(str1, str2)
Print "Str1 & Str3 Similarity: " + StringCompare(str1, str3)

Function StringCompare:Int(str1:String, str2:String)
	Local regex:TRegEx = TRegEx.Create(str1)
	Return regex.Compare(str1) - regex.Compare(str2)
EndFunction



Brucey(Posted 2015) [#4]
For completeness, here's a small program showing the three examples I mentioned above :
SuperStrict
Framework brl.standardio

Local same1:String = "Hello World!"
Local same2:String = "Hello World!"

Print "* SAME *"
Print "same1 = " + same1
Print "same2 = " + same2
Print "compared = " + same2.Compare(same1)

Print "~n* LENGTH DIFFERENCE *"

Local small:String = "Hello"
Local big:String = "Hello World!"

Print "small = " + small
Print "big = " + big
Print "big - small lengths = " + (big.length - small.length)
Print "compared = " + big.Compare(small)

Print "~n* CHAR DIFFERENCE *"

Local diff1:String = "Hello"
Local diff2:String = "World"

Print "diff1 = " + diff1
Print "diff2 = " + diff2
Print "W - H = " + (Asc("W") - Asc("H"))
Print "compared = " + diff2.compare(diff1)



Derron(Posted 2015) [#5]
The compare function is defined in brl.mod/blitz.mod/blitz_string.c

int bbStringCompare( BBString *x,BBString *y ){
	int k,n,sz;
	sz=x->length<y->length ? x->length : y->length;
	for( k=0;k<sz;++k ) if( n=x->buf[k]-y->buf[k] ) return n;
	return x->length-y->length;
}


Which might translate to a blitzmax variant in the likes of:
SuperStrict
Framework brl.standardio

Local str1:String = "super mario bros 3"
Local str2:String = "super mario bros. 3"
Local str3:String = "talking super jeopardy!"

Print "Str1 & Str1 equal: " + stringCompare(str1, str1) +"  str.Compare() = " + str1.Compare(str1) 
Print "Str1 & Str2 equal: " + stringCompare(str1, str2) +"  str.Compare() = " + str1.Compare(str2)
Print "Str1 & Str3 equal: " + stringCompare(str1, str3) +"  str.Compare() = " + str1.Compare(str3)

Function stringCompare:int( x:string, y:string )
	local sz:int
	if x.length < y.length
		sz = x.length
	else
		sz = y.length
	endif

	For local k:int = 0 to sz
		'i am not sure if I understood this portion correctly
		if x[k] - y[k] <> 0 then return x[k] - y[k]
	Next
	return x.length - y.length
End Function


So it seems to do this: it checks for similar characters (from character 0 to character min(lengthX, lengthY)). As soon as the charcodes differ, it will return the charcode difference. If there is no difference, it returns the difference in length.

conclusion: it returns "0" for an equal string, and all other numbers mean: not equal.


So this is NO similarity check - for this you might code your function in a way it checks for "equal characters" on the same position. BUT ... there is more advanced stuff to do there:
- check equal characters (what happens to "super" versus "supper" - so you have to check for neighborhood characters - because else it checks "sup" on both and from then on each char is different")
- check for equal length
- check for similar sounding characters ("super mario" versus "super marin" versus "super mariu" versus "super marioo")

At the end you have to "weight" each of the factors according to your needs (is the "sound" of a string important, a similar length, ...)


EDIT: Seems Brucey was faster... maaan I needed more time to generate the sample code and validate that _I_ understood it correctly.

bye
Ron


Hezkore(Posted 2015) [#6]
I was expecting something like a levenshtein distance, to tell me how "similar" strings are.
But the regex module seems to function as I expected. (See code above)


Hezkore(Posted 2015) [#7]
I've found something odd with the regex module though.
If you convert the strings to Lower, they won't match anymore for some reason.
SuperStrict
Framework brl.standardio
Import brl.retro
Import bah.regex

Local str1:String = "super mario bros."
Local str2:String = "super mario bros."

Print "Str1 & Str2 Similarity: " + StringCompare(Lower(str1), Lower(str2))

Function StringCompare:Int(str1:String, str2:String)
	Local regex:TRegEx = TRegEx.Create(str1)
	Return regex.Compare(str1) - regex.Compare(str2)
EndFunction
That would result in -48, even though they're exactly the same.
But if you don't convert it to lowercase, you get the result 0.


Hezkore(Posted 2015) [#8]
I've decided to use this: http://www.blitzbasic.com/codearcs/codearcs.php?code=2439


degac(Posted 2015) [#9]
Never noticed the .compare() method in string!
Interesting the example posted... I need something like this.