BaH.RegEx bug?

BlitzMax Forums/Brucey's Modules/BaH.RegEx bug?

Ghost Dancer(Posted 2011) [#1]
Just started using this in my application and looks like its the solution I need. However, I think I've found a bug:

Framework BaH.RegEx
Import BRL.StandardIO

Const text$ = "120" + Chr($2013) + "125"

Local regex:TRegEx = TRegEx.Create("([\d]+)(" + Chr($2013) + ")")

Local match:TRegExMatch = regex.Find(text$)

If match Then
	Print match.SubEnd()	'prints 6, but should be 4
End If


As noted in the comment, match.SubEnd() is returning 6 when it should be 4 (the position of the EN Dash). Replacing Chr($2013) with a standard hyphen character produces the correct result so I'm guessing its something UTF-8 related.

Is this a bug, or do I need to handle characters like this differently in regular expressions?

Last edited 2011


Brucey(Posted 2011) [#2]
According to the PCRE documentation :

When a match is successful, information about captured substrings is
returned in pairs of integers, starting at the beginning of ovector,
and continuing up to two-thirds of its length at the most. The first
element of each pair is set to the byte offset of the first character
in a substring, and the second is set to the byte offset of the first
character after the end of a substring. Note: these values are always
byte offsets, even in UTF-8 mode. They are not character counts.


So it seems it's returning the byte offset rather than the character offset.
I'll need to work something out :)