Questions about bah.regex

BlitzMax Forums/Brucey's Modules/Questions about bah.regex

Otus(Posted 2008) [#1]
I read through the help and PCRE man, but didn't quite understand everything...

What does TRegEx.Find do? It returns a match object, which has SubCount matches, but what are they?

PCRE has "standard" and "alternative" matching methods, of which common.bmx seems to import the standard one... However the man says this about the standard method:
If a leaf node is reached, a matching string has been found, and at that point the algorithm stops.

So why would it ever return more than one match? Also, will the match always be the first from left, the shortest, the longest or something else?

If I need to find the longest match beginning at position 0 in the string, what should I do?


Brucey(Posted 2008) [#2]
Well, I'm no expert... but I'll see what I can do :-)

What does TRegEx.Find do?

It finds the first match in the string. On subsequent calls, it will find the next match, and so on.

...a match object, which has SubCount matches, but what are they?

A regular expression can be quite complex, finding not only the main match, but all sub-pattern matches which make up the whole. For a basic search you would not necessarily be interested in these, but for other searches you might want it to pre-split a date into its constituent parts, for example.
Here's a little example of subpatterns. The expression itself is taken from the docs :
SuperStrict

Framework BaH.Regex
Import BRL.StandardIO

Local pattern:String = "the ((red|white) (king|queen))"
Local search:String = "the red king"

Local regex:TRegEx = TRegEx.Create(pattern)

Try

	Local match:TRegExMatch = regex.Find(search)
	
	While match

		For Local i:Int = 0 Until match.SubCount()
			Print i + ": " + match.SubExp(i)
		Next

		match = regex.Find()
	Wend

Catch e:TRegExException

	Print "Error : " + e.toString()
	End
	
End Try

It outputs this :
0: the red king
1: red king
2: red
3: king

which are the subpatterns as defined by the brackets ().

The bbdoc docs for BaH.Regex do go over how subpatterns and suchlike work, although I have to say it is a tad on the deep and technical side.

You can also use things such as "Lookahead assertions" and "Lookbehind assertions", and a whole load of other meaty set of character combinations to be very specific in your search parameters.

So why would it ever return more than one match?

I can't say I've read much on the page you linked to, but it does appear to work.

Changing the search string in the above example to
Local search:String = "the red king was here, but the white bishop was nowhere to be found. Did you see the white king, perhance?"

results in the following output:
0: the red king
1: red king
2: red
3: king
0: the white king
1: white king
2: white
3: king

so obviously Find() was able to pick up two separate matching cases in the string, and break down the subpatterns at the same time.

Also, will the match always be the first from left, the shortest, the longest or something else?

I think that depends how you write the expression, but I'm not an expert.
Obviously, if you do something like this - [A]+ - it will match *any* series of A.

....

Hope this helps a little bit for now?


Otus(Posted 2008) [#3]
Thanks!

I totally misunderstood the part about subexpressions, that's why I was confused. I thought the different subexpressions in a TRegExMatch object were related to different matches instead of patterns within the same match.

So, I guess what I needed to know was whether something like [a-z]+ always matches the longest string from the beginning. Eg. for search$ = "abcd efghi" will I always get the match for "abcd" instead of "abc" or "efghi" or "a" - all of which match the pattern.


Brucey(Posted 2008) [#4]
For your example, I would expect it to match twice, once for "abcd" and once for "efghi".

"a" doesn't necessarily match in this case, since you are asking for a run of characters, where the full set of "abcd" is the match.