Lexical scanner framework

This code has been declared by its author to be Public Domain code.

Lexical scanner framework by Yasha

2012

This is a simple library that allows you to describe a lexical scanner, and use it to tokenise an input file or string. It is suitable for use in things like console inputs, expression calculators, data file parsers, or programming/scripting language compilers.

"Lexing" or "tokenising" is the stage of reading some input where the source is broken up into homogenous "tokens", so that you can stop worrying about the input as text (with all of the annoying variations text can have, like the presence of spaces, indents, cases, and whatnot), and instead treat it as a stream of linear "chunks". e.g. a suitable lexer could take the input:

 1+2  * 4= 0xC

...which is polluted by annoying spacing problems, formats, and different value lengths, and turn it into a stream of objects representing numbers and operators, so that we can just pull tokens off the top and be given either an operator or a number.

The output of a lexer is "flat" - turning it into a "tree" (i.e. parenting values to operators, putting operators in relative precedence to one another, etc.) is the job of a "parser". So this isn't enough to evaluate languages on its own, but it's the first step. Lexers and parsers are normally designed to work together, with the parser consuming the token stream produced by the lexer (a sister parser API to work with this is now also available).

This library lets you create lexer objects and add rules to them in the form of regular-expression patterns for the input to be matched against, combined with actions and extra result information. Output is directed to a list supplied by the user.

Rules are tested in the order they were added; a longer match overrules a previous match. This may be relevant if you have two potential matches of the same length, so keep it in mind.

Upon match an "action" is performed. By default the lexer can store the result, switch mode, raise an error, "include" a filename, and discard the match. The built-in action constants are all negative, so use positives for any extra ones you care to define; add any extra action implementations to LEX_HandleAction in Lexer-Interface (that's what it's there for). The Include functionality is fairly simple and will require some use of modes to be useful; it does however try to detect recursive includes, and optionally guards against repeated includes (like Blitz Basic does, no error).

The file Lexer-Interface.bb is intended to be supplied by the user: in it, define LEX_HandleAction(l.LEX_Lexer, r.LEX_LexRule, tok$) as explained above, to handle any extra "actions" you want to add; and LEX_Error(l.LEX_Lexer, msg$), because you'll want error handling to suit the logging and handling mechanism of your host program. These two functions are all that are required, and the library does not depend upon either of them doing any more than receiving their arguments (so if you don't care about errors you can even leave them empty!).

Input is read in the form of a LEX_File object, representing a custom pseudo-filestream (basically a bank, but we're calling it a file), which is passed to LEX_ScanFile for scanning; remember to close LEX_File objects with LEX_CloseFile when done (any files opened by Include actions will be closed automatically). LEX_File objects can either be opened from real files (using LEX_ReadFile), or from strings (using LEX_FileFromString); the latter is handy for scanning input from a command prompt or REPL or something, when you don't necessarily actually have a file providing input. You still need to close LEX_Files created from strings, though.

Here's a small example that uses the library to create a lexer for a minimal calculator language with operators, numbers, functions and comments:

; Generic Lexer: demonstration
;==============================

Include "Lexer.bb"
Include "LList.bb"


Local l.LEX_Lexer = LEX_CreateLexer()	;Lexer object

Local tokenList.LList = CreateList()	;Output token list (this isn't "owned" by the lexer so we create it externally)
LEX_SetOutput l, tokenList		;Do this before each scanning job, or they'll get appended!


; For the example, we'll scan a simple mathematical expression language with numbers,
; arithmetic operators, function calls, parentheses and also (for some reason) comments.
; We won't bother with variables or includes.

; Values: numeric, in decimal or hex format (won't distinguish ints from floats today)
; Comments: /* */ and //, like in C
; Operators: + - * / % << >> ^ = != <= >=
; Punctuation: ( ) ,
; Function names: begin with a letter and then have numbers or underscores as well, like Blitz
; Whitespace: is ignored, but separates tokens or terminates comments where relevant
; Everything else: is an error


;Block comments need a mode; custom modes must not be zero
Const COMMENT_MODE = 1


;Numbers (most complicated patterns)
LEX_AddLexRule l, "[0-9]*", LEX_ACTION_STORE, "number"	;Simple int
LEX_AddLexRule l, "0[xX][0-9a-fA-F]+", LEX_ACTION_STORE, "number"	;Hex int (style: 0xABC12)
LEX_AddLexRule l, "[0-9]*.[0-9]+([eE]-?[0-9][0-9]*)?", LEX_ACTION_STORE, "number"		;Float, simple or scientific

LEX_AddLexRule l, "/*", LEX_ACTION_MODE, COMMENT_MODE, 0	;Note that the star must be escaped
LEX_AddLexRule l, ".", LEX_ACTION_DISCARD, "", COMMENT_MODE		;Match any charcater, but throw it away in comment mode only
LEX_AddLexRule l, "*/", LEX_ACTION_MODE, 0, COMMENT_MODE		;Return to mode 0 on hitting */
LEX_AddLexRule l, "//[^n]*n", LEX_ACTION_DISCARD			;Line comment: match up to end of line

LEX_AddLexRule l, "+", LEX_ACTION_STORE, "add"		;Any Regex operators need to be escaped with 
LEX_AddLexRule l, "-",  LEX_ACTION_STORE, "sub"
LEX_AddLexRule l, "*", LEX_ACTION_STORE, "mul"
LEX_AddLexRule l, "/",  LEX_ACTION_STORE, "div"
LEX_AddLexRule l, "%",  LEX_ACTION_STORE, "mod"
LEX_AddLexRule l, "<<", LEX_ACTION_STORE, "shl"
LEX_AddLexRule l, ">>", LEX_ACTION_STORE, "shr"
LEX_AddLexRule l, "^", LEX_ACTION_STORE, "pow"
LEX_AddLexRule l, "=",  LEX_ACTION_STORE, "eql"
LEX_AddLexRule l, "!=", LEX_ACTION_STORE, "neq"
LEX_AddLexRule l, "<=", LEX_ACTION_STORE, "leq"
LEX_AddLexRule l, ">=", LEX_ACTION_STORE, "geq"

LEX_AddLexRule l, "(", LEX_ACTION_STORE, "lparen"
LEX_AddLexRule l, ")", LEX_ACTION_STORE, "rparen"
LEX_AddLexRule l, ",",  LEX_ACTION_STORE, "comma"

LEX_SetCaseSensitivity l, False		;Only really useful to simplify the pattern below
LEX_AddLexRule l, "[a-z][a-z0-9_]*", LEX_ACTION_STORE, "function"

LEX_AddLexRule l, "[^[:space:]]", LEX_ACTION_ERROR		;Raise an error over any other printable character


; OK, let's load a "file":
Local f.LEX_File = LEX_FileFromString("1+2 * atan2(0x43, 1.5e-03)  /* Comment! */ %2")

LEX_ScanFile l, f
LEX_CloseFile f

; So what did we get?
Local i.Iterator = GetIterator(tokenList) : While EachIn(i)
	Local t.LEX_Token = Object.LEX_Token iValue
	Print tvalue + " : " + ttType
Wend
Print "done"
;... a tasty list of tokens all ready to be turned into something useful by a parser!


WaitKey
End

This uses the following Lexer-Interface.bb file:

; Example lexer specialisations
;===============================


Function LEX_Error(l.LEX_Lexer, msg$)
	If l <> Null
		Local err$ = "Lexer Error in '" + lcFilename
		err = err + "' ('" + lcFiledir + lcFilename + "')"
		err = err + " at line " + lcFilecLine + ", col " + lcFilecCol
		Print err + ": " + msg
	Else
		Print "Lexer Error: " + msg
	EndIf
End Function

Function LEX_HandleAction(l.LEX_Lexer, r.LEX_LexRule, tok$)
	; Example doesn't have any custom actions
End Function

That's one way to get nice clean input for an expression evaluator!

This library depends on regular expressions and linked list include files, both of which are available from the code archives at the given links.

This library replaces my older lexer generator. It's more or less equally powerful (same regex-powered backend), so there's no reason to use the old version, with its inconvenient preprocessing step and custom source format.

; Generic lexical scanner/tokeniser framework
;=============================================


;Dependencies
Include "LList.bb"	;Get this file at http://www.blitzbasic.com/codearcs/codearcs.php?code=2873
Include "Regex.bb"	;Get this file at http://www.blitzbasic.com/codearcs/codearcs.php?code=2632

;User-specialised interface: supplies LEX_Error and LEX_HandleAction
Include "Lexer-Interface.bb"


Type LEX_Token
	Field value$, tType$
	Field file$, l, c
End Type

Type LEX_Lexer
	Field rules.LList
	Field cFile.LEX_File
	Field out.LList		;Output list is not "owned" by the lexer
	Field csMode, guardMode, errState
	Field istk.LList, prev.LList
End Type

Type LEX_LexRule
	Field rule.RegEx_Node, pattern$
	Field action, result$, mode
End Type

Type LEX_File
	Field dir$, name$
	Field stream, sLen, cPtr, cLine, cCol
End Type

Type LEX_Guard
	Field name$
End Type


Const LEX_ACTION_STORE = -1, LEX_ACTION_MODE = -2, LEX_ACTION_ERROR = -3, LEX_ACTION_DISCARD = -4, LEX_ACTION_INCLUDE = -5


;Create a new, empty lexer object
Function LEX_CreateLexer.LEX_Lexer()
	Local l.LEX_Lexer = New LEX_Lexer
	lrules = CreateList()
	lcsMode = True
	lguardMode = False
	Return l
End Function

;Set the output token list for the lexer
Function LEX_SetOutput(l.LEX_Lexer, out.LList)
	lout = out
End Function

;Set whether rules added to this lexer will be case-sensitive (changing this doesn't affect old rules)
Function LEX_SetCaseSensitivity(l.LEX_Lexer, csMode)
	lcsMode = csMode
End Function

;Set whether included files are auto-guarded or not (i.e. can they be included twice?)
Function LEX_SetIncludeMode(l.LEX_Lexer, guarded)
	lguardMode = guarded
End Function

;(Internal) Try to include a source file, guards and recursion checks permitting (returns boolean indicating success)
Function LEX_TryIncludeFile(l.LEX_Lexer, file$)
	Local i.Iterator, fd$[0], fn$[0]
	LEX_GetPathCmpts file, fd, fn : file = fd[0] + fn[0]	;Force the path to be absolute for easier comparison
	
	If lguardMode	;Auto-guarded includes: check if it's been used already, if so ignore it
		i = GetIterator(lprev) : While EachIn(i)
			Local g.LEX_Guard = Object.LEX_Guard iValue
			If gname = file Then Return True	;...return without actually changing it
		Wend
	EndIf
	
	i = GetIterator(listk) : While EachIn(i)	;Check against the currently-open files
		Local f.LEX_File = Object.LEX_File iValue
		If fdir + fname = file
			LEX_Error l, "Cannot recursively include '" + file + "'"
			Return False
		EndIf
	Wend
	
	lcFile = LEX_ReadFile(file)
	If lcFile <> Null
		ListAddLast listk, Handle lcFile : LEX_GuardFileName l, lcFile
		Return True
	Else
		Return False	;Error has already been issued by LEX_ReadFile
	EndIf
End Function

;(Internal) Clear the include stacks when a scan has stopped, closing any "owned" files
Function LEX_ClearStacks(l.LEX_Lexer, keep.LEX_File)
	Local i.Iterator
	If listk <> Null
		i = GetIterator(listk) : While EachIn(i)
			Local f.LEX_File = Object.LEX_File iValue : If f <> keep Then LEX_CloseFile f
		Wend
		FreeList listk
	EndIf
	If lprev <> Null
		i = GetIterator(lprev) : While EachIn(i)
			Delete Object.LEX_Guard iValue
		Wend
		FreeList lprev
	EndIf
	lcFile = Null
End Function

;Delete a lexer object and its resources (not output)
Function LEX_FreeLexer(l.LEX_Lexer)
	Local i.Iterator = GetIterator(lrules) : While EachIn(i)
		Local r.LEX_LexRule = Object.LEX_LexRule(iValue)
		RegEx_Delete rrule
		Delete r
	Wend
	FreeList lrules
	
	LEX_ClearStacks l, Null
	If lcFile <> Null Then LEX_CloseFile lcFile
	
	Delete l
End Function

;Delete an output token stream (convenience function)
Function LEX_FreeTokenStream(s.LList)
	Local i.Iterator = GetIterator(s) : While EachIn(i)
		Delete Object.LEX_Token iValue
	Wend
	FreeList s
End Function

;Add a scan rule, consisting of a match pattern, action, "result" (additional data such as type or error message) and permitted mode
Function LEX_AddLexRule(l.LEX_Lexer, rule$, action, result$ = "", mode = 0)
	Local r.LEX_LexRule = New LEX_LexRule
	rrule = RegEx_Parse(rule, lcsMode)
	rpattern = rule
	raction = action
	rresult = result
	rmode = mode
	ListAddLast lrules, Handle r
End Function

;Scan a file (opened with LEX_ReadFile) using the given lexer
Function LEX_ScanFile(l.LEX_Lexer, f.LEX_File)
	LEX_ResetFile f
	lerrState = False
	lcFile = f
	
	listk = CreateList()
	lprev = CreateList()
	ListAddLast listk, Handle f
	LEX_GuardFileName l, f
	
	Repeat
		Local token$, rule.LEX_LexRule, mode = 0
		
		While lcFilecPtr < lcFilesLen
			token = "" : rule = Null
			
			Local i.Iterator = GetIterator(lrules) : While EachIn(i)
				Local r.LEX_LexRule = Object.LEX_LexRule iValue
				If mode = rmode
					Local cMatch$ = RegEx_Match(rrule, lcFilestream, lcFilecPtr)
					If Len(cMatch) > Len(token) Then token = cMatch : rule = r
				EndIf
			Wend
			
			If rule <> Null		;Something matched successfully!
				Select ruleaction
					Case LEX_ACTION_STORE
						ListAddLast lout, Handle LEX_NewToken(token, ruleresult, lcFilename, lcFilecLine, lcFilecCol)
					Case LEX_ACTION_MODE
						mode = Int ruleresult
					Case LEX_ACTION_ERROR
						LEX_Error l, ruleresult
						LEX_ClearStacks l, f : Return	;Clear the include stacks and stop scanning
					Case LEX_ACTION_DISCARD
						;Do nothing
					Case LEX_ACTION_INCLUDE
						LEX_IncrementFilePtr lcFile, Len(token)
						token = LEX_FilterIncludeString(token, ruleresult)	;Shorten the token to just the file path
						If Not LEX_TryIncludeFile(l, token) Then LEX_ClearStacks l, f : Return
						
					Default
						LEX_HandleAction l, rule, token
						If lerrState Then LEX_ClearStacks l, f : Return
				End Select
				
				If ruleaction <> LEX_ACTION_INCLUDE Then LEX_IncrementFilePtr lcFile, Len(token)
			Else
				LEX_IncrementFilePtr lcFile, 1
			EndIf
		Wend
		
		If lcFile <> f		;Pop back to the previous file in the include stack
			LEX_CloseFile lcFile
			ListRemoveLast listk
			lcFile = Object.LEX_File ListLast(listk)
		Else
			Exit	;If it's f, we're done
		EndIf
	Forever
	
	LEX_ClearStacks l, f
End Function

;Load a file into a suitable format for scanning
Function LEX_ReadFile.LEX_File(path$)
	path = Replace(path, "", "/")	;Normalise slashes
	
	Local dirName$[0], fileName$[0] : LEX_GetPathCmpts path, dirName, fileName
	
	Local stream = ReadFile(path) : If Not stream
		LEX_Error Null, "Unable to open file '" + path + "' ('" + dirName[0] + fileName[0] + "')"
		Return Null
	EndIf
	
	Local f.LEX_File = New LEX_File
	fdir = dirName[0] : fname = fileName[0]
	
	fstream = CreateBank(FileSize(path))
	ReadBytes fstream, stream, 0, BankSize(fstream)
	CloseFile stream
	fsLen = BankSize(fstream)
	
	LEX_ResetFile f
	Return f
End Function

;Flag an error, preventing the lexer from continuing and possibly logging it via the interface
Function LEX_Error(l.LEX_Lexer, msg$)
	LEX_LogError l, msg
	lerrState = True
End Function

;(Internal) Use a simple set of filter chars to chop the path out of an include directive
Function LEX_FilterIncludeString$(inc$, filter$)
	Local i, p
	For i = 1 To Len(filter)
		p = Instr(inc, Mid(filter, i, 1))
		If p Then inc = Mid(inc, p + 1) : Exit
	Next
	For i = 1 To Len(filter)
		p = Instr(inc, Mid(filter, i, 1))
		If p Then inc = Left(inc, p - 1) : Exit
	Next
	Return inc
End Function

;(Internal) Get the absolute directory and stripped filename from a path
Function LEX_GetPathCmpts(path$, dirOut$[0], fileOut$[0])
	dirOut[0] = "" : fileOut[0] = path
	Local tgt$ : If Instr(path, "/") Then tgt = "/" : Else tgt = ":"
	
	Local c : For c = Len(path) To 1 Step -1
		If Mid(path, c, 1) = tgt
			dirOut[0] = Left(path, c) : If Left(dirOut[0], 1) = "/" Then dirOut[0] = Mid(dirOut[0], 2)
			fileOut[0] = Mid(path, c + 1)
			Exit
		EndIf
	Next
	
	Local isAbsolute = (Left(path, 2) = "//") Or Instr(path, ":")	;UNC/driveletter/URL
	If Not isAbsolute Then dirOut[0] = dirOut[0] + Replace(CurrentDir(), "", "/")
End Function

;(Internal) Reset a LEX_File's pointer fields
Function LEX_ResetFile(f.LEX_File)
	fcPtr = 0
	fcLine = 1
	fcCol = 1
End Function

;(Internal) Increment a LEX_File's pointer fields
Function LEX_IncrementFilePtr(f.LEX_File, count)
	Local c : For c = 1 To count
		Local char = PeekByte(fstream, fcPtr)
		If char < 32	;Only count printable characters in the column field
			If char = 10 Then fcLine = fcLine + 1 : fcCol = 1
		Else
			fcCol = fcCol + 1
		EndIf
		fcPtr = fcPtr + 1
	Next
End Function

;(Internal) "Guard" a file against being included twice
Function LEX_GuardFileName(l.LEX_Lexer, f.LEX_File)
	Local g.LEX_Guard = New LEX_Guard
	gname = fdir + fname
	ListAddLast lprev, Handle g
End Function

;"Close" a file opened for scanning
Function LEX_CloseFile(f.LEX_File)
	If fstream Then FreeBank fstream
	Delete f
End Function

;Treat the contents of a string as a lexer file (useful for short inputs)
Function LEX_FileFromString.LEX_File(s$)
	Local f.LEX_File = New LEX_File
	fdir = "" : fname = "<string>"
	
	fstream = CreateBank(Len(s))
	Local c : For c = 1 To BankSize(fstream)
		PokeByte fstream, c - 1, Asc(Mid(s, c, 1))
	Next
	fsLen = BankSize(fstream)
	
	LEX_ResetFile f
	Return f
End Function

;Create a new match result (token object)
Function LEX_NewToken.LEX_Token(val$, tt$, file$, l = 0, c = 0)
	Local t.LEX_Token = New LEX_Token
	tvalue = val : ttType = tt
	tfile = file : tl = l : tc = c
	Return t
End Function


;~IDEal Editor Parameters:
;~F#D#12#1A#1F#24#2D#36#3B#40#45#62#74#83#8B#96#D6#ED#F3#101#112
;~F#119#126#12D#133#142
;~C#Blitz3D

Comments

None.

Code Archives Forum