BaH.RegEx bug?
BlitzMax Forums/Brucey's Modules/BaH.RegEx bug?
| ||
Just started using this in my application and looks like its the solution I need. However, I think I've found a bug:Framework BaH.RegEx Import BRL.StandardIO Const text$ = "120" + Chr($2013) + "125" Local regex:TRegEx = TRegEx.Create("([\d]+)(" + Chr($2013) + ")") Local match:TRegExMatch = regex.Find(text$) If match Then Print match.SubEnd() 'prints 6, but should be 4 End If As noted in the comment, match.SubEnd() is returning 6 when it should be 4 (the position of the EN Dash). Replacing Chr($2013) with a standard hyphen character produces the correct result so I'm guessing its something UTF-8 related. Is this a bug, or do I need to handle characters like this differently in regular expressions? Last edited 2011 |
| ||
According to the PCRE documentation : When a match is successful, information about captured substrings is returned in pairs of integers, starting at the beginning of ovector, and continuing up to two-thirds of its length at the most. The first element of each pair is set to the byte offset of the first character in a substring, and the second is set to the byte offset of the first character after the end of a substring. Note: these values are always byte offsets, even in UTF-8 mode. They are not character counts. So it seems it's returning the byte offset rather than the character offset. I'll need to work something out :) |