Code archives/Algorithms/Lexer generator
This code has been declared by its author to be Public Domain code.
Download source code
| |||||
Final update: There is no longer any reason to use this entry: it has been completely superseded by a much more convenient generic lexer API. Updated 21/07/2011: Completely rewrote the regular expression engine in an attempt to fix some bugs, and tidied up the output of the scanner to use objects. Updated 18/06/2010: Added the "Type" command; affects both files. Updated 19/04/2010: Fixed a minor error with backslash-escaped backslashes in the regexen; added a couple of major speed enhancements. Updated 01/03/2010: If you downloaded this previously, please do so again as there was a problem with the regular expression engine. A lexical scanner, or tokeniser, is a tool that reads through souce code or some other kind of input, and breaks it up into separate tokens, identified by type. It's an absolutely essential component in a compiler or interpreter, and is useful as the first stage of parsing many other kinds of input as well, such as in a calculator or command-line interface. Writing them can be both difficult and tedious, however; so it's common to automate the process... This generator is very loosely based on Flex, although significantly less powerful and only aimed at producing a specific kind of output. Input is in the form of a definitions file - this is a simple text file in the following format: Case Insensitive Constants: { Digit [0-9] Quote " Point \. Int 1 Float 2 String 3 } Modes: { COMMENT Exclusive DOG Inclusive } Rules: { -?{Digit}+ Store Int -?{Digit}*{Point}{Digit}+ Store Float {Quote}[^\n]*{Quote} Store String \{- Mode <COMMENT> <COMMENT> -\} Mode <> <COMMENT,DOG,> doggy { print "nada" } } Code: { ;Anything in here is copied straight to the main body ;so make sure it's valid BB code } Input is arranged in "blocks" - anything not in one of these blocks is currently ignored, with the exception of the "case insensitive" directive which should appear outside the blocks (EDIT: There's now also a "case sensitive" directive, if you need to return it to the default, which also needs to be outside the blocks). There are four kinds of block, as shown above, although none are necessary and each type may appear more than once (a file with no "Rules" block will not generate a functioning lexer, though!) and in any order. Blocks begin with a type specifier ("Rules", "Code", "Constants" or "Modes") followed by a colon and an opening brace; the definitions in the block must then begin on a new line. The block ends when a closing brace is encountered on its own new line. The "Code" block is simplest: anything in this block is simply copied verbatim into the resulting file, in the "main body" of the code. You can specify include lines or helper functions here if you like. The "Constants" block defines simple replacement constants for use in rule definitions and their actions (more on this below). These won't make it into the final BB code though, so don't reference them in any code sections. You can declare these before or after Rules; it makes no difference. The "Modes" block defines different modes of operation that will determine which rules are followed at any given time (supposed to imitate Flex's "start conditions"). For example, if you wanted to scan C or C++, you could set the token "/\*" to trigger "comment" mode, which has only one rule: "\*/", which puts the scanner back in normal mode and allows it to pick up names and commands again (in both cases the asterisk must be escaped). Modes are defined as "inclusive" or "exclusive": exclusive modes, like the comment mode, only allow rules explicitly assigned to them to be followed while active; an inclusive mode will also allow rules without any specific mode to be followed. As with constants, modes can be declared anywhere in the definitions file. Finally, the most important section is the Rules section. A rule begins with an optional list of modes, contained within <> (more than one mode can be assigned to a rule, separated with commas, including as above, the empty mode). Next is a regular expression that defines a pattern to match tokens to, following the rules outlined here with the addition of ^ at the start of a pattern to force start-of-line or -file and $ at the end of a pattern to force end-of-line/file. Note that the braces are escaped with backslash to force their literal value. If you included the "case insensitive" directive somewhere in the file (outside the blocks) then all of the rules will be case-insensitive (if you need only some of the rules to be case-sensitive, you'll need to devise regex rules that reflect this). After the pattern, you can specify an action for the lexer to take: you can store the token and an integer type for it, just store the type (useful for saving memory on things like operators where the token itself is always known), change mode, or execute arbitrary BB code within a braced block, again allowing a new line for the final brace (unlike the code in Code:{} blocks, this is placed into the scanner function scope). If you want to do more than one of these, at the moment the only way is to do so directly in BB code (will probably change this). You don't need to specify an action at all - in a C++ lexer you might have the rule //[^\n]*\n (double slash, then anything up until the next newline) with no action, to comment out the rest of a line. If more than one pattern could match a character string (eg. "End" and "End Function") the lexer will go with the longer match. If the matches are the same length then the first specified in the list will be chosen. The generated scanner builds a list of tokens (as bankstrings) and their integer types in a bank, which it returns, unless you use the BB code blocks to make it do something else, so unlike some tokenising functions that are called repeatedly to cough up the next token, BBLex_ScanFile() only needs to be called once and then the tokens can be obtained by navigating the resulting bank. Only three functions are actually created by this generator - the scanner itself (BBLex_ScanFile) and two other initialisation functions that it calls. The vast majority of the scanner actually consists of a slightly-modified version of my regular expressions library, and so is simply Included as BBLex_Functions.bb: Here's an example program to demonstrate the results: Local i, tokenBank tokenBank = BBLex_ScanFile("test.txt") For i = 0 To BankSize(tokenBank) / 4 - 1 Local tok.BBLex_Token = BBLex_GetToken(tokenBank, i) Print tok\tType + " : " + tok\val Next WaitKey End Include "testlex.bb" ...and a really simple test file to tokenise: 12 345 56.45 doggy {- This is a comment and shouldn't be picked up -} "String literal 1!"n n"String literal 2!" doggy 65 -12.34 boogledoggy 35.8 "doggy as a string literal!" And finally, the generator itself: | |||||
Write "Generating... " BBLex_Generate("Scythe lexer.txt","BBLex_Scythe.bb") ;Change these to the desired input and output files Print "done!" Print "" Print "Press any key to exit..." WaitKey End Const SIZEOF_CONST = 9 Function BBLex_Generate(defFile$,lexFile$) ;Generate a .bb lexer from the definitions given in defFile and output it as lexFile Local dFile,dLine$,lFile,i,caseSen,userCodeOutput Local ruleBank,constBank,modeBank dFile=ReadFile(defFile) lFile=WriteFile(lexFile) constBank=CreateBank() modeBank=CreateBank(5) PokeInt modeBank,0,StrToBank("") ruleBank=CreateBank() WriteLine lFile,"" WriteLine lFile,";This file was automatically generated using BBLex: http://www.blitzbasic.com/codearcs/codearcs.php?code=2636" WriteLine lFile,"" While Not Eof(dFile) dLine=Replace(Replace(Lower(ReadLine(dFile)),Chr(9),"")," ","") Select dLine Case "caseinsensitive","case-insensitive" caseSen=False Case "casesensitive","case-sensitive" caseSen=True Case "constants:{" LoadConstants(constBank,dFile) Case "modes:{" LoadModes(modeBank,dFile) Case "rules:{" LoadRules(ruleBank,dFile) Case "code:{" WriteLine lFile,"" dLine=ReadLine(dFile) While Not Eof(dFile) If Left(Trim(dLine),1)="}" Then Exit If userCodeOutput=False WriteLine lFile,"" WriteLine lFile,Chr(9)+";User code:" WriteLine lFile,"" userCodeOutput=True EndIf WriteLine lFile,dLine dLine=ReadLine(dFile) Wend WriteLine lFile,"" End Select Wend CloseFile dFile ProcessRules(ruleBank,modeBank,constBank) OutputLexer(constBank,ruleBank,lFile,caseSen,userCodeOutput) CloseFile lFile For i=0 To BankSize(constBank)-SIZEOF_CONST Step SIZEOF_CONST FreeBank PeekInt(constBank,i) FreeBank PeekInt(constBank,i+4) Next FreeBank constBank For i=0 To BankSize(modeBank)-5 Step 5 FreeBank PeekInt(modeBank,i) Next FreeBank modeBank FreeBank ruleBank End Function Function OutputLexer(constBank,ruleBank,lexFile,caseSen,userCodeOutput) Local newLine$,i,j,action$ newLine=Chr(13)+Chr(10) If userCodeOutput Then WriteLine lexFile,newLine+newLine+Chr(9)+";Generated code:" WriteLine lexFile,newLine+"Include "+Chr(34)+"BBLex_Functions.bb"+Chr(34)+newLine If BankSize(constBank) For i=0 To BankSize(constBank)-SIZEOF_CONST Step SIZEOF_CONST If PeekByte(constBank,i+8)=True WriteLine lexFile,"Const "+BankToStr(PeekInt(constBank,i))+" = "+BankToStr(PeekInt(constBank,i+4)) EndIf Next WriteLine lexFile, "" EndIf WriteLine lexFile,"Function BBLex_ScanData(sBank)" WriteLine lexFile,Chr(9)+"Local rBank, mBank, tBank, cPtr"+Chr(9)+newLine+Chr(9)+"Local token$, cMatch$, rID, i, cMode"+newLine WriteLine lexFile,Chr(9)+"rBank = BBLex_InitRegexen()"+newLine+Chr(9)+"mBank = BBLex_InitModes()" WriteLine lexFile,Chr(9)+"tBank = CreateBank()"+newLine WriteLine lexFile,Chr(9)+"While cPtr < BankSize(sBank)"+newLine+Chr(9)+Chr(9)+"token = "+Chr(34)+Chr(34)+newLine WriteLine lexFile,Chr(9)+Chr(9)+"For i = 0 to "+((BankSize(ruleBank)/12)-1) WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+"If BBLex_ModeMatch(i, mBank, cMode)" WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+Chr(9)+"cMatch = Regex_Match(Object.RegEx_Node(PeekInt(rBank, i * 4)), sBank, cPtr)" WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+Chr(9)+"If Len(cMatch) > Len(token) Then token = cMatch : rID = i" WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+"EndIf"+newLine+Chr(9)+Chr(9)+"Next"+newLine WriteLine lexFile,Chr(9)+Chr(9)+"If token = "+Chr(34)+Chr(34)+newLine+Chr(9)+Chr(9)+Chr(9)+"cPtr = cPtr + 1" WriteLine lexFile,Chr(9)+Chr(9)+"Else"+newLine+Chr(9)+Chr(9)+Chr(9)+"Select rID" For i=0 To ((BankSize(ruleBank)/12)-1) WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+Chr(9)+"Case "+i action=BankToStr(PeekInt(ruleBank,i*12+8)) Select Lower(Left(action,1)) Case "s" WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+Chr(9)+Chr(9)+"BBLex_StoreToken tBank, "+Trim(Mid(action,6))+", token" Case "t" WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+Chr(9)+Chr(9)+"BBLex_StoreType tBank, "+Trim(Mid(action,6)) Case "m" WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+Chr(9)+Chr(9)+"cMode = "+Trim(Mid(action,5)) Case "{" WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+Chr(9)+Chr(9)+Mid(action,2) End Select Next WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+"End Select" WriteLine lexFile,Chr(9)+Chr(9)+Chr(9)+"cPtr = cPtr + Len(token)"+newLine+Chr(9)+Chr(9)+"EndIf"+newLine+Chr(9)+"Wend"+newLine WriteLine lexFile,Chr(9)+"BBLex_DeleteRegexen(rBank)"+newLine+Chr(9)+"BBLex_ClearModes(mBank)" WriteLine lexFile,Chr(9)+"FreeBank sBank"+newLine WriteLine lexFile,Chr(9)+"Return tBank"+newLine+"End Function"+newLine WriteLine lexFile,"Function BBLex_InitRegexen()" WriteLine lexFile,Chr(9)+"Local regexBank"+newLine WriteLine lexFile,Chr(9)+"regexBank = CreateBank("+(BankSize(ruleBank)/3)+")"+newLine For i=0 To BankSize(ruleBank)/12-1 WriteLine lexFile,Chr(9)+"PokeInt regexBank, "+(i*4)+", Handle(Regex_Parse("+ExpandQuotes(BankToStr(PeekInt(ruleBank,i*12+4)))+", "+StrFromBool(caseSen)+"))" Next WriteLine lexFile,newLine+Chr(9)+"Return regexBank"+newLine+"End Function"+newLine WriteLine lexFile,"Function BBLex_InitModes()" WriteLine lexFile,Chr(9)+"Local modeBank"+newLine+Chr(9)+"modeBank = CreateBank("+(BankSize(ruleBank)/3)+")"+newLine For i = 0 To BankSize(ruleBank) / 12 - 1 If BankSize(PeekInt(ruleBank, i * 12)) WriteLine lexFile, Chr(9)+"PokeInt modeBank, " + i * 4 + ", CreateBank("+BankSize(PeekInt(ruleBank, i * 12)) + ")" For j = 0 To BankSize(PeekInt(ruleBank, i * 12)) - 4 Step 4 WriteLine lexFile, Chr(9) + "PokeInt PeekInt(modeBank, "+i*4+"), "+j+", " + PeekInt(PeekInt(ruleBank, i * 12), j) Next Else WriteLine lexFile, Chr(9) + "PokeInt modeBank, " + i * 4 + ", 0" EndIf Next WriteLine lexFile,newLine+Chr(9)+"Return modeBank"+newLine+"End Function"+newLine End Function Function ExpandQuotes$(s$) Local i If s = Chr(34) Then Return "Chr(34)" Local l$ = Left(s, 1), r$ = Right(s, 1), m$ = Mid(s, 2, Len(s) - 2) If l = Chr(34) Then l = "Chr(34) + " + Chr(34) : Else l = Chr(34) + l If r = Chr(34) Then r = Chr(34) + " + Chr(34)" : Else r = r + Chr(34) m = Replace(m, Chr(34), Chr(34) + " + Chr(34) + " + Chr(34)) Return l + m + r End Function Function StrFromBool$(b) If b Then Return "True" Else Return "False" End Function Function LoadConstants(constBank,dFile) Local dLine$,cName$,cValue$,i, export While Not Eof(dFile) dLine=Trim(ReadLine(dFile)) If Left(dLine,1)="}" Then Exit If dLine<>"" If Left(dLine,1)<>";" For i=1 To Len(dLine) If i>1 If Mid(dLine,i-1,1)<>"\" If Asc(Mid(dLine,i,1))<=32 Then Exit EndIf EndIf cName=cName+Mid(dLine,i,1) Next dLine=Trim(Mid(dLine,i+1)) For i=1 To Len(dLine) If i>1 If Mid(dLine,i-1,1)<>"\" If Asc(Mid(dLine,i,1))<=32 Then Exit EndIf EndIf cValue=cValue+Mid(dLine,i,1) Next dLine=Trim(Mid(dLine,i)) ResizeBank constBank,BankSize(constBank)+SIZEOF_CONST PokeInt constBank,BankSize(constBank)-SIZEOF_CONST,StrToBank(cName) PokeInt constBank,BankSize(constBank)-(SIZEOF_CONST-4),StrToBank(cValue) PokeByte constBank,BankSize(constBank)-(SIZEOF_CONST-8),(Lower(Left(dLine,6))="export") cName="" cValue="" EndIf EndIf Wend End Function Function LoadModes(modeBank,dFile) Local dLine$,mName$,i While Not Eof(dFile) dLine=Trim(ReadLine(dFile)) If Left(dLine,1)="}" Then Exit If dLine<>"" If Left(dLine,1)<>";" For i=1 To Len(dLine) If i>1 If Mid(dLine,i-1,1)<>"\" If Asc(Mid(dLine,i,1))<=32 Then Exit EndIf EndIf mName=mName+Mid(dLine,i,1) Next dLine=Trim(Mid(dLine,i+1)) ResizeBank modeBank,BankSize(modeBank)+5 PokeInt modeBank,BankSize(modeBank)-5,StrToBank(mName) If Lower(Left(dLine,2))="in" Then PokeByte modeBank,BankSize(modeBank)-1,1:Else PokeByte modeBank,BankSize(modeBank)-1,0 mName="" EndIf EndIf Wend End Function Function LoadRules(ruleBank,dFile) Local dLine$,cPtr,mode$,rule$,action$ While Not Eof(dFile) dLine=Trim(ReadLine(dFile)) If Left(dLine,1)="}" Then Exit If dLine<>"" If Left(dLine,1)<>";" mode="" If Left(dLine,1)="<" cPtr=2 While Mid(dLine,cPtr,1)<>">" If Asc(Mid(dLine,cPtr,1))>32 Then mode=mode+Mid(dLine,cPtr,1) cPtr=cPtr+1 Wend dLine=Trim(Mid(dLine,cPtr+1)) EndIf rule="" For cPtr=1 To Len(dLine) If cPtr>2 If Mid(dLine,cPtr-1,1)<>"\" If Asc(Mid(dLine,cPtr,1))<=32 Then Exit ElseIf Mid(dLine,cPtr-2,2)="\\" If Asc(Mid(dLine,cPtr,1))<=32 Then Exit ;If that backslash was part of the pattern EndIf ElseIf cPtr=2 If Left(dLine,1)<>"\" If Asc(Mid(dLine,cPtr,1))<=32 Then Exit EndIf EndIf rule=rule+Mid(dLine,cPtr,1) Next dLine=Trim(Mid(dLine,cPtr)) action=dLine If Left(dLine,1)="{" While Not Eof(dFile) dLine=Trim(ReadLine(dFile)) If Left(dLine,1)="}" Then action=action+Chr(13)+Chr(10):Exit action=action+Chr(13)+Chr(10)+dLine Wend EndIf ResizeBank ruleBank,BankSize(ruleBank)+12 PokeInt ruleBank,BankSize(ruleBank)-12,StrToBank(mode) PokeInt ruleBank,BankSize(ruleBank)-8,StrToBank(rule) PokeInt ruleBank,BankSize(ruleBank)-4,StrToBank(action) EndIf EndIf Wend End Function Function ProcessRules(ruleBank,modeBank,constBank) Local r,c,m,mode$,rule$,action$ For r=0 To BankSize(ruleBank)-12 Step 12 mode=BankToStr(PeekInt(ruleBank,r)) FreeBank PeekInt(ruleBank,r) PokeInt ruleBank,r,CreateBank() While mode<>"" If Right(mode,1)="," ResizeBank PeekInt(ruleBank,r),BankSize(PeekInt(ruleBank,r))+4 PokeInt PeekInt(ruleBank,r),BankSize(PeekInt(ruleBank,r))-4,0 mode=Trim(Left(mode,Len(mode)-1)) EndIf If Instr(mode,",")>0 For m=0 To BankSize(modeBank)-5 Step 5 If Left(mode,Instr(mode,",")-1)=BankToStr(PeekInt(modeBank,m)) ResizeBank PeekInt(ruleBank,r),BankSize(PeekInt(ruleBank,r))+4 If PeekByte(modeBank,m+4)=1 PokeInt PeekInt(ruleBank,r),BankSize(PeekInt(ruleBank,r))-4,-(m/5) Else PokeInt PeekInt(ruleBank,r),BankSize(PeekInt(ruleBank,r))-4,m/5 EndIf EndIf Next mode=Mid(mode,Instr(mode,",")+1) Else For m=0 To BankSize(modeBank)-5 Step 5 If mode=BankToStr(PeekInt(modeBank,m)) ResizeBank PeekInt(ruleBank,r),BankSize(PeekInt(ruleBank,r))+4 If PeekByte(modeBank,m+4)=1 PokeInt PeekInt(ruleBank,r),BankSize(PeekInt(ruleBank,r))-4,-(m/5) Else PokeInt PeekInt(ruleBank,r),BankSize(PeekInt(ruleBank,r))-4,m/5 EndIf EndIf Next mode="" EndIf Wend rule=BankToStr(PeekInt(ruleBank,r+4)) FreeBank PeekInt(ruleBank,r+4) For c=0 To BankSize(constBank)-SIZEOF_CONST Step SIZEOF_CONST rule=Replace(rule,"{"+BankToStr(PeekInt(constBank,c))+"}",BankToStr(PeekInt(constBank,c+4))) Next PokeInt ruleBank,r+4,StrToBank(rule) action=BankToStr(PeekInt(ruleBank,r+8)) If Left(action,1)<>"{" FreeBank PeekInt(ruleBank,r+8) If Lower(Left(action,5))="store" For c=0 To BankSize(constBank)-SIZEOF_CONST Step SIZEOF_CONST If PeekByte(constBank, c + 8) = False action="store "+Replace(Mid(action,6),"{"+BankToStr(PeekInt(constBank,c))+"}",BankToStr(PeekInt(constBank,c+4))) EndIf Next ElseIf Lower(Left(action,4))="type" For c=0 To BankSize(constBank)-SIZEOF_CONST Step SIZEOF_CONST If PeekByte(constBank, c + 8) = False action="type "+Replace(Mid(action,5),"{"+BankToStr(PeekInt(constBank,c))+"}",BankToStr(PeekInt(constBank,c+4))) EndIf Next ElseIf Lower(Left(action,4))="mode" For m=0 To BankSize(modeBank)-5 Step 5 If PeekByte(modeBank,m+4)=1 action=Replace(action,"<"+BankToStr(PeekInt(modeBank,m))+">",-(m/5)) Else action=Replace(action,"<"+BankToStr(PeekInt(modeBank,m))+">",m/5) EndIf Next EndIf PokeInt ruleBank,r+8,StrToBank(action) EndIf Next End Function Function StrToBank(s$) ;Return a bank containing the binary value of the given string Local i,bank bank=CreateBank(Len(s)) For i=0 To Len(s)-1 PokeByte bank,i,Asc(Mid(s,i+1,1)) Next Return bank End Function Function BankToStr$(bank) ;Return a string containing the ASCII value of the given bank Local i,s$ For i=0 To BankSize(bank)-1 s=s+Chr(PeekByte(bank,i)) Next Return s End Function ;~IDEal Editor Parameters: ;~F#E#4F#9D#AB#AF#D8#F6#12E#17B#184 ;~C#Blitz3D |
Comments
| ||
Here are a couple of more useful examples. This generates a small lexer for a simple QuakeC-like language: This generates a complete lexer for the C programming language, as described in the reference grammar at the back of K&R (ANSI C89, not including preprocessor): |
| ||
Nice work! :) Dabz |
Code Archives Forum