Code archives/Algorithms/Lexer generator

This code has been declared by its author to be Public Domain code.

Download source code

Lexer generator by Yasha2009
Final update: There is no longer any reason to use this entry: it has been completely superseded by a much more convenient generic lexer API.


Updated 21/07/2011: Completely rewrote the regular expression engine in an attempt to fix some bugs, and tidied up the output of the scanner to use objects.

Updated 18/06/2010: Added the "Type" command; affects both files.

Updated 19/04/2010: Fixed a minor error with backslash-escaped backslashes in the regexen; added a couple of major speed enhancements.

Updated 01/03/2010: If you downloaded this previously, please do so again as there was a problem with the regular expression engine.


A lexical scanner, or tokeniser, is a tool that reads through souce code or some other kind of input, and breaks it up into separate tokens, identified by type. It's an absolutely essential component in a compiler or interpreter, and is useful as the first stage of parsing many other kinds of input as well, such as in a calculator or command-line interface. Writing them can be both difficult and tedious, however; so it's common to automate the process...

This generator is very loosely based on Flex, although significantly less powerful and only aimed at producing a specific kind of output.

Input is in the form of a definitions file - this is a simple text file in the following format:
Case Insensitive

Constants: {
    Digit [0-9]
    Quote "
    Point \.
    Int 1
    Float 2
    String 3
}

Modes: {
    COMMENT Exclusive
    DOG Inclusive
}

Rules: {
    -?{Digit}+ Store Int
    -?{Digit}*{Point}{Digit}+ Store Float
    {Quote}[^\n]*{Quote} Store String
    \{- Mode <COMMENT>

    <COMMENT> -\} Mode <>
    <COMMENT,DOG,> doggy {
        print "nada"
    }
}

Code: {
;Anything in here is copied straight to the main body
;so make sure it's valid BB code
}


Input is arranged in "blocks" - anything not in one of these blocks is currently ignored, with the exception of the "case insensitive" directive which should appear outside the blocks (EDIT: There's now also a "case sensitive" directive, if you need to return it to the default, which also needs to be outside the blocks). There are four kinds of block, as shown above, although none are necessary and each type may appear more than once (a file with no "Rules" block will not generate a functioning lexer, though!) and in any order. Blocks begin with a type specifier ("Rules", "Code", "Constants" or "Modes") followed by a colon and an opening brace; the definitions in the block must then begin on a new line. The block ends when a closing brace is encountered on its own new line.

The "Code" block is simplest: anything in this block is simply copied verbatim into the resulting file, in the "main body" of the code. You can specify include lines or helper functions here if you like.

The "Constants" block defines simple replacement constants for use in rule definitions and their actions (more on this below). These won't make it into the final BB code though, so don't reference them in any code sections. You can declare these before or after Rules; it makes no difference.

The "Modes" block defines different modes of operation that will determine which rules are followed at any given time (supposed to imitate Flex's "start conditions"). For example, if you wanted to scan C or C++, you could set the token "/\*" to trigger "comment" mode, which has only one rule: "\*/", which puts the scanner back in normal mode and allows it to pick up names and commands again (in both cases the asterisk must be escaped). Modes are defined as "inclusive" or "exclusive": exclusive modes, like the comment mode, only allow rules explicitly assigned to them to be followed while active; an inclusive mode will also allow rules without any specific mode to be followed. As with constants, modes can be declared anywhere in the definitions file.

Finally, the most important section is the Rules section. A rule begins with an optional list of modes, contained within <> (more than one mode can be assigned to a rule, separated with commas, including as above, the empty mode). Next is a regular expression that defines a pattern to match tokens to, following the rules outlined here with the addition of ^ at the start of a pattern to force start-of-line or -file and $ at the end of a pattern to force end-of-line/file. Note that the braces are escaped with backslash to force their literal value. If you included the "case insensitive" directive somewhere in the file (outside the blocks) then all of the rules will be case-insensitive (if you need only some of the rules to be case-sensitive, you'll need to devise regex rules that reflect this).

After the pattern, you can specify an action for the lexer to take: you can store the token and an integer type for it, just store the type (useful for saving memory on things like operators where the token itself is always known), change mode, or execute arbitrary BB code within a braced block, again allowing a new line for the final brace (unlike the code in Code:{} blocks, this is placed into the scanner function scope). If you want to do more than one of these, at the moment the only way is to do so directly in BB code (will probably change this). You don't need to specify an action at all - in a C++ lexer you might have the rule //[^\n]*\n (double slash, then anything up until the next newline) with no action, to comment out the rest of a line.

If more than one pattern could match a character string (eg. "End" and "End Function") the lexer will go with the longer match. If the matches are the same length then the first specified in the list will be chosen.

The generated scanner builds a list of tokens (as bankstrings) and their integer types in a bank, which it returns, unless you use the BB code blocks to make it do something else, so unlike some tokenising functions that are called repeatedly to cough up the next token, BBLex_ScanFile() only needs to be called once and then the tokens can be obtained by navigating the resulting bank.

Only three functions are actually created by this generator - the scanner itself (BBLex_ScanFile) and two other initialisation functions that it calls. The vast majority of the scanner actually consists of a slightly-modified version of my regular expressions library, and so is simply Included as BBLex_Functions.bb:



Here's an example program to demonstrate the results:

Local i, tokenBank

tokenBank = BBLex_ScanFile("test.txt")

For i = 0 To BankSize(tokenBank) / 4 - 1
        Local tok.BBLex_Token = BBLex_GetToken(tokenBank, i)
	Print tok\tType + " : " + tok\val
Next

WaitKey
End

Include "testlex.bb"


...and a really simple test file to tokenise:

12 345 56.45 doggy {- This is a 
comment and shouldn't be picked up -}
"String literal 1!"n
n"String literal 2!"
doggy
65 -12.34
boogledoggy 35.8 "doggy as a string literal!"


And finally, the generator itself:
Write "Generating... "
BBLex_Generate("Scythe lexer.txt","BBLex_Scythe.bb")		;Change these to the desired input and output files
Print "done!"

Print ""
Print "Press any key to exit..."

WaitKey
End


Const SIZEOF_CONST = 9

Function BBLex_Generate(defFile$,lexFile$)		;Generate a .bb lexer from the definitions given in defFile and output it as lexFile
	Local dFile,dLine$,lFile,i,caseSen,userCodeOutput
	Local ruleBank,constBank,modeBank
	
	dFile=ReadFile(defFile)
	lFile=WriteFile(lexFile)
	constBank=CreateBank()
	modeBank=CreateBank(5)
	PokeInt modeBank,0,StrToBank("")
	ruleBank=CreateBank()
	
	WriteLine lFile,""
	WriteLine lFile,";This file was automatically generated using BBLex: http://www.blitzbasic.com/codearcs/codearcs.php?code=2636"
	WriteLine lFile,""
	
	While Not Eof(dFile)
		dLine=Replace(Replace(Lower(ReadLine(dFile)),Chr(9),"")," ","")
		Select dLine
			Case "caseinsensitive","case-insensitive"
				caseSen=False
			Case "casesensitive","case-sensitive"
				caseSen=True
			Case "constants:{"
				LoadConstants(constBank,dFile)
			Case "modes:{"
				LoadModes(modeBank,dFile)
			Case "rules:{"
				LoadRules(ruleBank,dFile)
			Case "code:{"
				WriteLine lFile,""
				dLine=ReadLine(dFile)
				While Not Eof(dFile)
					If Left(Trim(dLine),1)="}" Then Exit
					If userCodeOutput=False
						WriteLine lFile,""
						WriteLine lFile,Chr(9)+";User code:"
						WriteLine lFile,""
						userCodeOutput=True
					EndIf
					WriteLine lFile,dLine
					dLine=ReadLine(dFile)
				Wend
				WriteLine lFile,""
		End Select
	Wend
	
	CloseFile dFile
	ProcessRules(ruleBank,modeBank,constBank)
	
	OutputLexer(constBank,ruleBank,lFile,caseSen,userCodeOutput)
	CloseFile lFile
	
	For i=0 To BankSize(constBank)-SIZEOF_CONST Step SIZEOF_CONST
		FreeBank PeekInt(constBank,i)
		FreeBank PeekInt(constBank,i+4)
	Next
	FreeBank constBank
	For i=0 To BankSize(modeBank)-5 Step 5
		FreeBank PeekInt(modeBank,i)
	Next
	FreeBank modeBank
	
	FreeBank ruleBank
End Function

Function OutputLexer(constBank,ruleBank,lexFile,caseSen,userCodeOutput)
	Local newLine$,i,j,action$
	
	newLine=Chr(13)+Chr(10)
	
	If userCodeOutput Then WriteLine lexFile,newLine+newLine+Chr(9)+";Generated code:"
	WriteLine lexFile,newLine+"Include "+Chr(34)+"BBLex_Functions.bb"+Chr(34)+newLine
	
	If BankSize(constBank)
		For i=0 To BankSize(constBank)-SIZEOF_CONST Step SIZEOF_CONST
			If PeekByte(constBank,i+8)=True
				WriteLine lexFile,"Const "+BankToStr(PeekInt(constBank,i))+" = "+BankToStr(PeekInt(constBank,i+4))
			EndIf
		Next
		WriteLine lexFile, ""
	EndIf
	
	WriteLine lexFile,"Function BBLex_ScanData(sBank)"
	WriteLine lexFile,Chr(9)+"Local rBank, mBank, tBank, cPtr"+Chr(9)+newLine+Chr(9)+"Local token$, cMatch$, rID, i, cMode"+newLine
	WriteLine lexFile,Chr(9)+"rBank = BBLex_InitRegexen()"+newLine+Chr(9)+"mBank = BBLex_InitModes()"
	WriteLine lexFile,Chr(9)+"tBank = CreateBank()"+newLine
	WriteLine lexFile,Chr(9)+"While cPtr < BankSize(sBank)"+newLine+Chr(9)+Chr(9)+"token = "+Chr(34)+Chr(34)+newLine
	
	WriteLine lexFile,Chr(9)+Chr(9)+"For i = 0 to "+((BankSize(ruleBank)/12)-1)
	WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+"If BBLex_ModeMatch(i, mBank, cMode)"
	WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+Chr(9)+"cMatch = Regex_Match(Object.RegEx_Node(PeekInt(rBank, i * 4)), sBank, cPtr)"
	WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+Chr(9)+"If Len(cMatch) > Len(token) Then token = cMatch : rID = i"
	WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+"EndIf"+newLine+Chr(9)+Chr(9)+"Next"+newLine
	
	WriteLine lexFile,Chr(9)+Chr(9)+"If token = "+Chr(34)+Chr(34)+newLine+Chr(9)+Chr(9)+Chr(9)+"cPtr = cPtr + 1"
	WriteLine lexFile,Chr(9)+Chr(9)+"Else"+newLine+Chr(9)+Chr(9)+Chr(9)+"Select rID"
	
	For i=0 To ((BankSize(ruleBank)/12)-1)
		WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+Chr(9)+"Case "+i
		action=BankToStr(PeekInt(ruleBank,i*12+8))
		Select Lower(Left(action,1))
			Case "s"
				WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+Chr(9)+Chr(9)+"BBLex_StoreToken tBank, "+Trim(Mid(action,6))+", token"
			Case "t"
				WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+Chr(9)+Chr(9)+"BBLex_StoreType tBank, "+Trim(Mid(action,6))
			Case "m"
				WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+Chr(9)+Chr(9)+"cMode = "+Trim(Mid(action,5))
			Case "{"
				WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+Chr(9)+Chr(9)+Mid(action,2)
		End Select
	Next
	WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+"End Select"
	WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+"cPtr = cPtr + Len(token)"+newLine+Chr(9)+Chr(9)+"EndIf"+newLine+Chr(9)+"Wend"+newLine
	WriteLine lexFile,Chr(9)+"BBLex_DeleteRegexen(rBank)"+newLine+Chr(9)+"BBLex_ClearModes(mBank)"
	WriteLine lexFile,Chr(9)+"FreeBank sBank"+newLine
	WriteLine lexFile,Chr(9)+"Return tBank"+newLine+"End Function"+newLine
	
	WriteLine lexFile,"Function BBLex_InitRegexen()"
	WriteLine lexFile,Chr(9)+"Local regexBank"+newLine
	WriteLine lexFile,Chr(9)+"regexBank = CreateBank("+(BankSize(ruleBank)/3)+")"+newLine
	
	For i=0 To BankSize(ruleBank)/12-1
		WriteLine lexFile,Chr(9)+"PokeInt regexBank, "+(i*4)+", Handle(Regex_Parse("+ExpandQuotes(BankToStr(PeekInt(ruleBank,i*12+4)))+", "+StrFromBool(caseSen)+"))"
	Next
	
	WriteLine lexFile,newLine+Chr(9)+"Return regexBank"+newLine+"End Function"+newLine
	
	WriteLine lexFile,"Function BBLex_InitModes()"
	WriteLine lexFile,Chr(9)+"Local modeBank"+newLine+Chr(9)+"modeBank = CreateBank("+(BankSize(ruleBank)/3)+")"+newLine
	
	For i = 0 To BankSize(ruleBank) / 12 - 1
		If BankSize(PeekInt(ruleBank, i * 12))
			WriteLine lexFile, Chr(9)+"PokeInt modeBank, " + i * 4 + ", CreateBank("+BankSize(PeekInt(ruleBank, i * 12)) + ")"
			For j = 0 To BankSize(PeekInt(ruleBank, i * 12)) - 4 Step 4
				WriteLine lexFile, Chr(9) + "PokeInt PeekInt(modeBank, "+i*4+"), "+j+", " + PeekInt(PeekInt(ruleBank, i * 12), j)
			Next
		Else
			WriteLine lexFile, Chr(9) + "PokeInt modeBank, " + i * 4 + ", 0"
		EndIf
	Next
	WriteLine lexFile,newLine+Chr(9)+"Return modeBank"+newLine+"End Function"+newLine
End Function

Function ExpandQuotes$(s$)
	Local i
	
	If s = Chr(34) Then Return "Chr(34)"
	
	Local l$ = Left(s, 1), r$ = Right(s, 1), m$ = Mid(s, 2, Len(s) - 2)
	
	If l = Chr(34) Then l = "Chr(34) + " + Chr(34) : Else l = Chr(34) + l
	If r = Chr(34) Then r = Chr(34) + " + Chr(34)" : Else r = r + Chr(34)
	m = Replace(m, Chr(34), Chr(34) + " + Chr(34) + " + Chr(34))
	
	Return l + m + r
End Function

Function StrFromBool$(b)
	If b Then Return "True" Else Return "False"
End Function

Function LoadConstants(constBank,dFile)
	Local dLine$,cName$,cValue$,i, export
	
	While Not Eof(dFile)
		dLine=Trim(ReadLine(dFile))
		If Left(dLine,1)="}" Then Exit
		
		If dLine<>""
			If Left(dLine,1)<>";"
				For i=1 To Len(dLine)
					If i>1
						If Mid(dLine,i-1,1)<>"\"
							If Asc(Mid(dLine,i,1))<=32 Then Exit
						EndIf
					EndIf
					cName=cName+Mid(dLine,i,1)
				Next
				
				dLine=Trim(Mid(dLine,i+1))
				
				For i=1 To Len(dLine)
					If i>1
						If Mid(dLine,i-1,1)<>"\"
							If Asc(Mid(dLine,i,1))<=32 Then Exit
						EndIf
					EndIf
					cValue=cValue+Mid(dLine,i,1)
				Next
				dLine=Trim(Mid(dLine,i))
				
				ResizeBank constBank,BankSize(constBank)+SIZEOF_CONST
				PokeInt constBank,BankSize(constBank)-SIZEOF_CONST,StrToBank(cName)
				PokeInt constBank,BankSize(constBank)-(SIZEOF_CONST-4),StrToBank(cValue)
				PokeByte constBank,BankSize(constBank)-(SIZEOF_CONST-8),(Lower(Left(dLine,6))="export")
				cName=""
				cValue=""
			EndIf
		EndIf
	Wend
End Function

Function LoadModes(modeBank,dFile)
	Local dLine$,mName$,i
	
	While Not Eof(dFile)
		dLine=Trim(ReadLine(dFile))
		If Left(dLine,1)="}" Then Exit
		
		If dLine<>""
			If Left(dLine,1)<>";"
				For i=1 To Len(dLine)
					If i>1
						If Mid(dLine,i-1,1)<>"\"
							If Asc(Mid(dLine,i,1))<=32 Then Exit
						EndIf
					EndIf
					mName=mName+Mid(dLine,i,1)
				Next
				
				dLine=Trim(Mid(dLine,i+1))
				
				ResizeBank modeBank,BankSize(modeBank)+5
				PokeInt modeBank,BankSize(modeBank)-5,StrToBank(mName)
				
				If Lower(Left(dLine,2))="in" Then PokeByte modeBank,BankSize(modeBank)-1,1:Else PokeByte modeBank,BankSize(modeBank)-1,0
				mName=""
			EndIf
		EndIf
	Wend
End Function

Function LoadRules(ruleBank,dFile)
	Local dLine$,cPtr,mode$,rule$,action$
	
	While Not Eof(dFile)
		dLine=Trim(ReadLine(dFile))
		If Left(dLine,1)="}" Then Exit
		
		If dLine<>""
			If Left(dLine,1)<>";"
				mode=""
				
				If Left(dLine,1)="<"
					cPtr=2
					While Mid(dLine,cPtr,1)<>">"
						If Asc(Mid(dLine,cPtr,1))>32 Then mode=mode+Mid(dLine,cPtr,1)
						cPtr=cPtr+1
					Wend
					dLine=Trim(Mid(dLine,cPtr+1))
				EndIf
				
				rule=""
				For cPtr=1 To Len(dLine)
					If cPtr>2
						If Mid(dLine,cPtr-1,1)<>"\"
							If Asc(Mid(dLine,cPtr,1))<=32 Then Exit
						ElseIf Mid(dLine,cPtr-2,2)="\\"
							If Asc(Mid(dLine,cPtr,1))<=32 Then Exit	;If that backslash was part of the pattern
						EndIf
					ElseIf cPtr=2
						If Left(dLine,1)<>"\"
							If Asc(Mid(dLine,cPtr,1))<=32 Then Exit
						EndIf
					EndIf
					rule=rule+Mid(dLine,cPtr,1)
				Next
				dLine=Trim(Mid(dLine,cPtr))
				
				action=dLine
				If Left(dLine,1)="{"
					While Not Eof(dFile)
						dLine=Trim(ReadLine(dFile))
						If Left(dLine,1)="}" Then action=action+Chr(13)+Chr(10):Exit
						
						action=action+Chr(13)+Chr(10)+dLine
					Wend
				EndIf
				
				ResizeBank ruleBank,BankSize(ruleBank)+12
				PokeInt ruleBank,BankSize(ruleBank)-12,StrToBank(mode)
				PokeInt ruleBank,BankSize(ruleBank)-8,StrToBank(rule)
				PokeInt ruleBank,BankSize(ruleBank)-4,StrToBank(action)
			EndIf
		EndIf
	Wend
End Function

Function ProcessRules(ruleBank,modeBank,constBank)
	Local r,c,m,mode$,rule$,action$
	
	For r=0 To BankSize(ruleBank)-12 Step 12
		mode=BankToStr(PeekInt(ruleBank,r))
		FreeBank PeekInt(ruleBank,r)
		PokeInt ruleBank,r,CreateBank()
		
		While mode<>""
			If Right(mode,1)=","
				ResizeBank PeekInt(ruleBank,r),BankSize(PeekInt(ruleBank,r))+4
				PokeInt PeekInt(ruleBank,r),BankSize(PeekInt(ruleBank,r))-4,0
				mode=Trim(Left(mode,Len(mode)-1))
			EndIf
			If Instr(mode,",")>0
				For m=0 To BankSize(modeBank)-5 Step 5
					If Left(mode,Instr(mode,",")-1)=BankToStr(PeekInt(modeBank,m))
						ResizeBank PeekInt(ruleBank,r),BankSize(PeekInt(ruleBank,r))+4
						If PeekByte(modeBank,m+4)=1
							PokeInt PeekInt(ruleBank,r),BankSize(PeekInt(ruleBank,r))-4,-(m/5)
						Else
							PokeInt PeekInt(ruleBank,r),BankSize(PeekInt(ruleBank,r))-4,m/5
						EndIf
					EndIf
				Next
				mode=Mid(mode,Instr(mode,",")+1)
			Else
				For m=0 To BankSize(modeBank)-5 Step 5
					If mode=BankToStr(PeekInt(modeBank,m))
						ResizeBank PeekInt(ruleBank,r),BankSize(PeekInt(ruleBank,r))+4
						If PeekByte(modeBank,m+4)=1
							PokeInt PeekInt(ruleBank,r),BankSize(PeekInt(ruleBank,r))-4,-(m/5)
						Else
							PokeInt PeekInt(ruleBank,r),BankSize(PeekInt(ruleBank,r))-4,m/5
						EndIf
					EndIf
				Next
				mode=""
			EndIf
		Wend
		
		rule=BankToStr(PeekInt(ruleBank,r+4))
		FreeBank PeekInt(ruleBank,r+4)
		For c=0 To BankSize(constBank)-SIZEOF_CONST Step SIZEOF_CONST
			rule=Replace(rule,"{"+BankToStr(PeekInt(constBank,c))+"}",BankToStr(PeekInt(constBank,c+4)))
		Next
		PokeInt ruleBank,r+4,StrToBank(rule)
		
		action=BankToStr(PeekInt(ruleBank,r+8))
		If Left(action,1)<>"{"
			FreeBank PeekInt(ruleBank,r+8)
			If Lower(Left(action,5))="store"
				For c=0 To BankSize(constBank)-SIZEOF_CONST Step SIZEOF_CONST
					If PeekByte(constBank, c + 8) = False
						action="store "+Replace(Mid(action,6),"{"+BankToStr(PeekInt(constBank,c))+"}",BankToStr(PeekInt(constBank,c+4)))
					EndIf
				Next
			ElseIf Lower(Left(action,4))="type"
				For c=0 To BankSize(constBank)-SIZEOF_CONST Step SIZEOF_CONST
					If PeekByte(constBank, c + 8) = False
						action="type "+Replace(Mid(action,5),"{"+BankToStr(PeekInt(constBank,c))+"}",BankToStr(PeekInt(constBank,c+4)))
					EndIf
				Next
			ElseIf Lower(Left(action,4))="mode"
				For m=0 To BankSize(modeBank)-5 Step 5
					If PeekByte(modeBank,m+4)=1
						action=Replace(action,"<"+BankToStr(PeekInt(modeBank,m))+">",-(m/5))
					Else
						action=Replace(action,"<"+BankToStr(PeekInt(modeBank,m))+">",m/5)
					EndIf
				Next
			EndIf
			PokeInt ruleBank,r+8,StrToBank(action)
		EndIf
	Next
End Function

Function StrToBank(s$)		;Return a bank containing the binary value of the given string
	Local i,bank
	bank=CreateBank(Len(s))
	For i=0 To Len(s)-1
		PokeByte bank,i,Asc(Mid(s,i+1,1))
	Next
	Return bank
End Function

Function BankToStr$(bank)		;Return a string containing the ASCII value of the given bank
	Local i,s$
	For i=0 To BankSize(bank)-1
		s=s+Chr(PeekByte(bank,i))
	Next
	Return s
End Function

;~IDEal Editor Parameters:
;~F#E#4F#9D#AB#AF#D8#F6#12E#17B#184
;~C#Blitz3D

Comments

Yasha2010
Here are a couple of more useful examples.

This generates a small lexer for a simple QuakeC-like language:



This generates a complete lexer for the C programming language, as described in the reference grammar at the back of K&R (ANSI C89, not including preprocessor):




Dabhand2010
Nice work! :)

Dabz


Code Archives Forum