Code archives/Miscellaneous/BlitzMax Lexer Module

This code has been declared by its author to be Public Domain code.

Download source code

BlitzMax Lexer Module by N2010
This sourcecode is now available under the zlib license at http://github.com/nilium/cower.bmxlexer
This doesn't mean you have to go there to get the source, but this code has some bugs in it, and in the interest of migrating people away from this code and towards the code that's under version control, please go to the above URL.



I originally wrote this in Ruby, but there is a rather annoying issue with writing any code in Ruby: using it anywhere else is an immense pain. If you've ever had to work with the C API to embed Ruby in something, you're probably aware of this. You may also be insane if you're going "I did it and I thoroughly enjoyed the experience." I can't help those people, they're clearly lost causes.

Anyhow, so I ported the code to C, and overall I think it's an improvement because it's a little less messy. There's not a lot of comments — mostly in the BlitzMax code just because BlitzMax sucks at actually working with C code and sometimes I need to make a note about what type something really is. The C API is private in this, mostly because I think most BlitzMax users would find it terrifying even if it's relatively simple.

The BlitzMax API is fairly simple, I don't think I need to explain what each method does or what the fields of something are. If it has an _ before it, you don't touch that, fairly simple.

If you need to parse BlitzMax code, this is probably a decent starting point so you don't have to concern yourself with the annoying string parsing crap you'd otherwise have to do and just focus on structure and chunks of code. If you want to tweak the lexer to match certain other things, it's probably fairly easy to do and could be a decent starting point for something else (most of what you'd change would likely be covered by the token singles/pairs arrays and changing those to match your own preferences - case sensitivity options are in there, so you could work that in as well).

On a side-note about "additional things," this will recognize certain keywords that are not keywords in BlitzMax, include Protocol, EndProtocol (and its spaced variant), and Implements. It's not hard to remove these, but I've left them in partly because I use that code and partly because it'll illustrate how you can create additional tokens fairly easily. However, bear in mind that I've only supported combining ordered pairs of tokens. Anything beyond that isn't really needed.

Anyhow, the C side of things...

lexer.h


lexer.c
SuperStrict

Module Cower.BMXLexer
ModuleInfo "Name: BlitzMax Lexer"
ModuleInfo "Description: Wrapped lexer for BlitzMax source code"
ModuleInfo "Author: Noel Cower"
ModuleInfo "License: Public Domain"

Import "lexer.c"

Private

Extern "C"
	Function lexer_new@Ptr(source_begin@Ptr, source_end@Ptr)
	Function lexer_destroy(lexer@Ptr)
	Function lexer_run:Int(lexer@Ptr)
	Function lexer_get_error$z(lexer@Ptr)
	Function lexer_get_num_tokens:Int(lexer@Ptr)
	Function lexer_get_token:Int(lexer@Ptr, index%, token@Ptr)
'	 Function lexer_copy_tokens@Ptr(lexer@Ptr, num_tokens%Ptr)'unused
	Function token_to_string@Ptr(tok@Ptr)
	Function free(b@Ptr)
End Extern

Public

Type TToken
	Field kind%				' token_kind_t
	Field _from:Byte Ptr	 ' const char *
	Field _to_:Byte Ptr		  ' const char *
	Field line%				' int
	Field column%			' int
	
	Field _cachedStr$=Null
	
	Method ToString$()
		If _cachedStr = Null Then
			Local cstr@Ptr = token_to_string(Self)
			_cachedStr = String.FromCString(cstr)
			free(cstr)
		EndIf
		Return _cachedStr
	End Method
	
	'#region token_kind_t
	Const TOK_INVALID% = 0

	Const TOK_ID% = 1

	Const TOK_END_KW% = 2

	Const TOK_FUNCTION_KW% = 3
	Const TOK_ENDFUNCTION_KW% = 4

	Const TOK_METHOD_KW% = 5
	Const TOK_ENDMETHOD_KW% = 6

	Const TOK_TYPE_KW% = 7
	Const TOK_EXTENDS_KW% = 8
	Const TOK_ABSTRACT_KW% = 9
	Const TOK_FINAL_KW% = 10
	Const TOK_NODEBUG_KW% = 11
	Const TOK_ENDTYPE_KW% = 12

	Const TOK_EXTERN_KW% = 13
	Const TOK_ENDEXTERN_KW% = 14

	Const TOK_REM_KW% = 15
	Const TOK_ENDREM_KW% = 16

	Const TOK_FLOAT_KW% = 17
	Const TOK_DOUBLE_KW% = 18
	Const TOK_BYTE_KW% = 19
	Const TOK_SHORT_KW% = 20
	Const TOK_INT_KW% = 21
	Const TOK_STRING_KW% = 22
	Const TOK_OBJECT_KW% = 23

	Const TOK_LOCAL_KW% = 24
	Const TOK_GLOBAL_KW% = 25
	Const TOK_CONST_KW% = 26

	Const TOK_VARPTR_KW% = 27
	Const TOK_PTR_KW% = 28
	Const TOK_VAR_KW% = 29

	Const TOK_NULL_KW% = 30

	Const TOK_STRICT_KW% = 31
	Const TOK_SUPERSTRICT_KW% = 32

	Const TOK_FRAMEWORK_KW% = 33

	Const TOK_MODULE_KW% = 34
	Const TOK_MODULEINFO_KW% = 35

	Const TOK_IMPORT_KW% = 36
	Const TOK_INCLUDE_KW% = 37

	Const TOK_PRIVATE_KW% = 38
	Const TOK_PUBLIC_KW% = 39

	Const TOK_OR_KW% = 40
	Const TOK_AND_KW% = 41
	Const TOK_SHR_KW% = 42
	Const TOK_SHL_KW% = 43
	Const TOK_SAR_KW% = 44
	Const TOK_MOD_KW% = 45
	Const TOK_NOT_KW% = 46

	Const TOK_WHILE_KW% = 47
	Const TOK_WEND_KW% = 48
	Const TOK_ENDWHILE_KW% = 49

	Const TOK_FOR_KW% = 50
	Const TOK_NEXT_KW% = 51
	Const TOK_UNTIL_KW% = 52
	Const TOK_TO_KW% = 53
	Const TOK_EACHIN_KW% = 54

	Const TOK_REPEAT_KW% = 55
	Const TOK_FOREVER_KW% = 56

	Const TOK_IF_KW% = 57
	Const TOK_ENDIF_KW% = 58
	Const TOK_ELSE_KW% = 59
	Const TOK_ELSEIF_KW% = 60
	Const TOK_THEN_KW% = 61

	Const TOK_SELECT_KW% = 62
	Const TOK_CASE_KW% = 63
	Const TOK_DEFAULT_KW% = 64
	Const TOK_ENDSELECT_KW% = 65

	Const TOK_SELF_KW% = 66
	Const TOK_SUPER_KW% = 67
	Const TOK_PI_KW% = 68
	Const TOK_NEW_KW% = 69

	Const TOK_PROTOCOL_KW% = 70
	Const TOK_ENDPROTOCOL_KW% = 71
	Const TOK_AUTO_KW% = 72
	Const TOK_IMPLEMENTS_KW% = 73

	Const TOK_COLON% = 74
	Const TOK_QUESTION% = 75
	Const TOK_BANG% = 76
	Const TOK_HASH% = 77
	Const TOK_DOT% = 78
	Const TOK_DOUBLEDOT% = 79
	Const TOK_TRIPLEDOT% = 80
	Const TOK_AT% = 81
	Const TOK_DOUBLEAT% = 82
	Const TOK_DOLLAR% = 83
	Const TOK_PERCENT% = 84
	Const TOK_SINGLEQUOTE% = 85
	Const TOK_OPENPAREN% = 86
	Const TOK_CLOSEPAREN% = 87
	Const TOK_OPENBRACKET% = 88
	Const TOK_CLOSEBRACKET% = 89
	Const TOK_OPENCURL% = 90
	Const TOK_CLOSECURL% = 91
	Const TOK_GREATERTHAN% = 92
	Const TOK_LESSTHAN% = 93
	Const TOK_EQUALS% = 94
	Const TOK_MINUS% = 95
	Const TOK_PLUS% = 96
	Const TOK_ASTERISK% = 97
	Const TOK_CARET% = 98
	Const TOK_TILDE% = 99
	Const TOK_GRAVE% = 100
	Const TOK_BACKSLASH% = 101
	Const TOK_SLASH% = 102
	Const TOK_COMMA% = 103
	Const TOK_SEMICOLON% = 104
	Const TOK_PIPE% = 105
	Const TOK_AMPERSAND% = 106
	Const TOK_NEWLINE% = 107

	Const TOK_ASSIGN_ADD% = 108
	Const TOK_ASSIGN_SUBTRACT% = 109
	Const TOK_ASSIGN_DIVIDE% = 110
	Const TOK_ASSIGN_MULTIPLY% = 111
	Const TOK_ASSIGN_POWER% = 112

	Const TOK_ASSIGN_SHL% = 113
	Const TOK_ASSIGN_SHR% = 114
	Const TOK_ASSIGN_SAR% = 115
	Const TOK_ASSIGN_MOD% = 116

	Const TOK_ASSIGN_XOR% = 117
	Const TOK_ASSIGN_AND% = 118
	Const TOK_ASSIGN_OR% = 119

	Const TOK_ASSIGN_AUTO% = 120
	Const TOK_DOUBLEMINUS% = 121
	Const TOK_DOUBLEPLUS% = 122

	Const TOK_NUMBER_LIT% = 123
	Const TOK_HEX_LIT% = 124
	Const TOK_BIN_LIT% = 125
	Const TOK_STRING_LIT% = 126

	Const TOK_LINE_COMMENT% = 127
	Const TOK_BLOCK_COMMENT% = 128

	Const TOK_EOF% = 129
	
	Const TOK_LAST%=TOK_EOF
	Const TOK_COUNT%=TOK_LAST+1
	'#endregion
End Type

Type TLexer
	Field _lexer@Ptr	' lexer_t
	Field _run:Int = False
	Field _cstr_source@Ptr
	Field _length%
	Field _tokens:TToken[]
	Field _error:String = Null
	
	Method InitWithSource:TLexer(source$)
		Assert _cstr_source=Null Else "Lexer already initialized"
		
		_cstr_source = source.ToCString()
		_length = source.Length
		_lexer = lexer_new(_cstr_source, _cstr_source+_length)
		
		Return Self
	End Method
	
	Method Delete()
		If _cstr_source Then
			MemFree(_cstr_source)
		EndIf
		If _lexer Then
			lexer_destroy(_lexer)
		EndIf
	End Method
	
	Method Run:Int()
		Assert _run = False Else "Lexer has already run"
		_run = True
		Local r% = lexer_run(_lexer)
		If r <> 0 Then
			_error = lexer_get_error(_lexer)
		EndIf
		Return (r=0)
	End Method
	
	Method _cacheTokens()
		If _tokens = Null Then
			_tokens = New TToken[lexer_get_num_tokens(_lexer)]
			For Local init_idx:Int = 0 Until _tokens.Length
				_tokens[init_idx] = New TToken
				lexer_get_token(_lexer, init_idx, _tokens[init_idx])
			Next
		EndIf
	End Method
	
	Method GetToken:TToken(index%)
		_cacheTokens()
		Return _tokens[index]
	End Method
	
	Method GetTokens:TToken[]()
		_cacheTokens()
		Return _tokens
	End Method
	
	Method NumTokens:Int()
		If _tokens Then
			Return _tokens.Length
		EndIf
		Return lexer_get_num_tokens(_lexer)
	End Method
	
	Method GetError$()
		Return _error
	End Method
End Type

Comments

GW2010
bmk doesn't seem to want to compile it. There is no error message because bmk is sh*t.

Removing the module stuff and compiling as an exe results in
lexer.c:(.text+0x35a): undefined reference to `asprintf'



N2010
What OS are you using?

Edit: Looks like asprintf is something like a GNU/BSD extension. MinGW apparently lacks it for some reason, but whatever. Easy enough to fix...


N2010
Should be fixed now.


N2010
I've updated this to fix an amazingly stupid bug in lexer_asprintf. Also an.. oddity.. in the function for checking singles. I'm still not sure how to explain that one.


Htbaa2010
Hey Nilium, small (?) request from my side. As This is a useful module (it's used by Maximus) could you put it on GitHub?


N2010
Sure, I'll throw it up there now. Only downside is I haven't been using any version control for it, so previous versions will be lost.

Edit: Additionally, this will be covered by a license other than public domain on github (zlib).


Htbaa2010
At least now some version history can be made :-).

Doesn't GitHub allow Public Domain?

Anyhow, much appreciated.


N2010
This sourcecode is now available under the zlib license at http://github.com/nilium/cower.bmxlexer

Doesn't GitHub allow Public Domain?
It does, but I'd rather have the zlib license attached to it if I'm moving it elsewhere. Either that or BSD, but I picked zlib for this.


Code Archives Forum