Code archives/Miscellaneous/Parser framework

This code has been declared by its author to be Public Domain code.

Download source code

Parser framework by Yasha2012
This is a monstrosity of a library that allows you to describe a parser using a declarative grammar syntax, and use it to parse a token stream generated by my Lexical scanner framework.

EDIT: See first comment for a utility to make defining parsers with this library even easier!


"Parsing" is the process of taking a linear input stream and turning it into something meaningful, usually an abstract syntax tree. The tree is a nice way to reorganise an arbitrarily-complex and nested input into nodes each carrying a defined, fixed amount of easily-processed meaning. e.g. the following input:
1 + 2 * 3 + 4 * 5

...could, after being tokenised, be "parsed" into a syntax tree structured like this:
    +
   / \
  1  +
    /  \
   *    *
  / \  / \
 2  3 4  5


This is easy to evaluate recursively (or e.g. output as function-calls), treating each operation as a simple function that doesn't have to worry about nuisances like operator precedence or the like. We can generalise this to most formal languages, which interestingly for us, includes programming, scripting and configuration languages (Blitz, C, JSON...).

Using this library is a tad more complicated than its accompanying lexer, mainly because parsing is a much more complicated task. You will need to have a solid grasp of the concept of a "formal grammar" in order to use this library, and indeed to design your language in the first place (it's close to impossible to design a decent programming language without defining a grammar first, don't try). Grammar is to "sentences" or program forms as regular expressions are to individual words, more or less.


Create a new parser with PAR_CreateParser. This sets a global state variable so that you don't have to keep passing the parser object to the rule functions, which would get boring really quickly (needless to say this means the library is absolutely not threadable, as though that matters for Blitz Classic; several other global state vars are also used internally by the parser engine).

Rules are defined using the five rule operators, and are given names so they can actually be used using PAR_NameRule. The root rule that the parser will use to match the entire input stream is set with PAR_Root. The operators (cat, alt, opt, rep, err) create anonymous rule objects and return automatic names, so they must be named in order to be useful.

Operators:
-- PAR_Cat - concatenation (things following each other - this is usually implicit in BNF and similar syntaxes)
-- PAR_Alt - alternative: one of the listed options. Bar operator "|" in common syntax
-- PAR_Opt - optional: is allowed to not be present. Corresponds to "?" in common syntax
-- PAR_Rep - repetition: is repeated ZERO or more times. Corresponds to the "*" operator
-- PAR_Err - raise an error and stop parsing, returning null
-- PAR_Plus - A wrapper around Cat/Rep. Corresponds to "+" in common syntax

The components of a rule are separated by spaces. A terminal is represented by a string containing either its token-type or actual value, prefixed with # for type and $ for value. To raise an error on failing to match a terminal, add ! (e.g. #!number ). If multipart rules are passed to Opt or Rep, they are treated as though they were packaged by Cat (i.e. the elements are assumed to be concatenated).

The Cat operator (and Opt or Rep, as applied to the implicit Cat) can also take an optional second "filter" argument: a filter can rename ("@name") or drop ("~") elements from the resulting match. Elements that aren't renamed, but kept with a simple "@", are numbered according to their position from the start of the match. It can also fold elements into their predecessors ("<"); if the predecessor is a list created with Rep, the element will be stuck on the end and numbered accordingly (this is useful when you need a slightly different rule for whatever marks the end of a list so it can't be part of the Rep).

So:
-- PAR_Cat("atom comma atom") produces a match with the values named "0:", "1:", "2:"
-- PAR_CAT("atom comma atom", "@left ~ @right") produces a match with the atoms named "left:" and "right:", and no comma remaining in the output - it is filtered out
-- Any values not accounted for beyond the end of the filter are numbered as default

By default concatenating operators that form nested elements of named rules are "folded": if the result consists of only one element, it is returned directly to the containing rule in order to keep the result tree flatter (or at least, manageably small - otherwise even simple expressions would produce very deeply nested results). Named rules can also be set to "fold" by passing "^" as the last element of the filter, after the rename/drop elements. If the result contains more than one node, the match will never fold, even if specifically requested to do so with "^".

If the exclamation mark "!" appears on its own as an element of a rule, in the event that the rule fails to return a match after that point, instead of resetting and trying something else, the parser will error out. This is handy for something like e.g. a "Type" or "Function" declaration, where you know what the object is from the first word: you don't want it to try a different pattern if the match fails, because you already know there must be an error around here.


Once all the rules have been listed, the parser must be finalised with PAR_Commit. This links all of the renames to the relevant rules, compiles token type checks, etc. etc. Once PAR_Commit has been called, the parser is ready for use or storage (you can for instance safely create other parsers if necessary after this point).

The parser accepts input in the form of a list of LEX_Token objects. It does not take ownership of this list, but the resulting parse tree will make reference to the tokens held in the list, so the list shouldn't be freed until you're finished with the parse tree making use of it.

Note that while designed as an API, this is really intended as just a more declarative way of expressing static parsers in Blitz code. As a result, PAR_Commit and the rule definition functions use RunTimeError to signal problems (the parser engine itself uses a safe, dynamic error handling system to report errors in the input stream without crashing). In other words, verify that your grammar and so on is correct before releasing your program! This is not really intended for dynamically loading new language grammars on the fly (that would be badass, but... at that point even I have to admit defeat and say "just go and use Lisp").

As with the lexer, the parser library expects you to supply a file "Parser-Interface.bb" containing the definition of PAR_Error, to suit your program's error handling system; and PAR_GetRuleDesc, to suit your grammar. As with the lexer, the library makes no assumptions about these functions other than they exist: leave them empty for all we care, or log errors properly. The first function is called when the Err operator matches input, with the state PAR_ERR_MATCH (in p\errorState); and to output messages generated by other internal error states as well. The second is called when a rule force-fails because of the ! marker, and is intended to expand concise rule names (e.g. "func-decl") into longer, readable ones ("function body declaration").


Warning: There are two major restrictions upon the rules that can be expressed by this API. Firstly, the parser engine that interprets the rules uses something akin to recursive descent; this means no left-recursive rules or you'll get an infinite loop that never matches anything. Secondly, the parser engine uses actual recursion (because looping is a PITA for such things), so make use of the Rep operator rather than using ludicrously deep recursion to describe list-like structures, or there will be a very real risk of stack overflow.


Here's a small example, using the parser library to parse the same input we saw before into a syntax tree (example prints to the Debug Log, so run in Debug to see anything):



...using this Parser-Interface.bb:

; Example parser error handler
;==============================


;Called in the event of user-defined error thrown by pattern match
;Examine p\strp to get file/line.column information on the point of failure
Function PAR_Error(p.PAR_Parser, msg$)
	DebugLog "Parser Error: " + msg
End Function

;Called in the event of an error within a specific match that cannot fail
;Replace the short-form rule identifier with a user-readable name
Function PAR_GetRuleDesc$(rule$)
	;Have a list here of rule names to match against and get a better error description
	Return rule
End Function



This also provides an example grammar, in non-API syntax, and several examples of filters and folds in use. Hopefully this will help if you have difficulty understanding these concepts.

The printed output is a little odd (e.g. the right-hand argument to math operators is nested; argument 2 and beyond to a function are packed into one "el" for "element list" slot): this is a product of the structure of the grammar and its nested rules. A normalisation pass, converting this tree into a slightly more intuitive layout, would be handy - but not necessary - before actually evaluating the expression, to make it more regular. At any rate, actually evaluating a little math tree like this is dead easy from here on, so I won't spoil the fun by showing the rest of the calculator (since it's only an example and that has nothing to do with parsing).


This library depends on the accompanying lexer (obviously) and linked list include files, along with any dependencies of their own (e.g. regex).

This is a new, complicated library for a reasonably complicated subject. It is not like parser frameworks for other languages (mainly because they usually use either preprocessors, or macros/operator overloading); and it is probably full of bugs still; so expect to see some stealth-updates occurring.

Either way, please, feel free to ask questions and for any more helpful examples if you have any interest in using this tool, and please do report any bugs you encounter!
; Generic parser framework
;==========================


;Dependencies
Include "Lexer.bb"	;Get this file at http://www.blitzbasic.com/codearcs/codearcs.php?code=2985
Include "LList.bb"	;Get this file at http://www.blitzbasic.com/codearcs/codearcs.php?code=2873

;User-specialised interface: supplies PAR_Error
Include "Parser-Interface.bb"


Type PAR_Parser
	Field isCompiled
	Field rules.LList	;List of owned rules
	Field alias.LList	;List of rule aliases
	Field root.PAR_Rule, rn$
	Field tokens.LList, strp.ListNode
	Field errorState
End Type

Type PAR_RuleAlias
	Field alias$, rName$, r.PAR_Rule	;Does not own r
End Type

Type PAR_Rule
	Field rs$, fs$, genName$, alias$
	Field action, doFold, isTerminal
	Field elems.LList, filters.LList, errOnFailAfter
End Type

Type PAR_Filter
	Field name$, action
End Type

Type PAR_Node
	Field rName$, eName$	;Name of rule (as matched), name of element (as filtered)
	Field term.LEX_Token, leaves.LList
End Type


Const PAR_ACTION_CAT = 1, PAR_ACTION_ALT = 2, PAR_ACTION_OPT = 3, PAR_ACTION_REP = 4, PAR_ACTION_ERR = 5
Const PAR_ACTION_CT = 6, PAR_ACTION_CV = 7, PAR_ACTION_ET = 8, PAR_ACTION_EV = 9, PAR_ACTION_EOF = 10, PAR_ACTION_PLUS = 11

Const PAR_FILTER_NAME = 1, PAR_FILTER_DROP = 2, PAR_FILTER_FOLDPREV = 3, PAR_FILTER_FOLDWITH = 4
Const PAR_ERR_NONE = 0, PAR_ERR_EX_T = -1, PAR_ERR_EX_V = -2, PAR_ERR_INCOMP = -3, PAR_ERR_UNKNOWN = -4
Const PAR_ERR_MATCH = -5, PAR_ERR_EOF = -6, PAR_ERR_EXRULE = -7


;Private globals maintaining internal state
Global PAR_private_CurrentParser_.PAR_Parser, PAR_private_Nil_.LEX_Token
Global PAR_private_NameCounter_, PAR_private_EOF_.LEX_Token, PAR_private_Exc_$
Dim PAR_private_SplitRes_$(0) : Global PAR_private_SplitCt_


;Create a new parser object. Rules are added after creation
Function PAR_CreateParser.PAR_Parser()
	Local p.PAR_Parser = New PAR_Parser
	p\rules = CreateList()
	p\alias = CreateList()
	PAR_private_CurrentParser_ = p
	Return p
End Function

;Free a parser object
Function PAR_FreeParser(p.PAR_Parser)
	Local i.Iterator
	i = GetIterator(p\rules) : While EachIn(i)
		PAR_FreeRule Object.PAR_Rule i\Value
	Wend
	FreeList p\rules
	i = GetIterator(p\alias) : While EachIn(i)
		Delete Object.PAR_RuleAlias i\Value
	Wend
	FreeList p\alias
	Delete p
End Function

;Apply all rule additions to a parser, aliasing and compiling rules as necessary
Function PAR_Commit()
	Local i.Iterator, a.PAR_RuleAlias, r.PAR_Rule, p.PAR_Parser = PAR_private_CurrentParser_
	
	i = GetIterator(p\alias) : While EachIn(i)		;Link aliases
		a = Object.PAR_RuleAlias i\Value
		a\r = PAR_GetRuleByWord(a\rName)
		If a\r\doFold <> 1 Then a\r\doFold = 0
		a\r\alias = a\alias
	Wend
	
	i = GetIterator(p\rules) : While EachIn(i)		;Find all subelements and build elem lists
		r = Object.PAR_Rule i\Value
		If r\doFold Then r\doFold = 1	;Normalise this supposed boolean
		If Not r\isTerminal
			PAR_Split r\rs
			Local e : For e = 0 To PAR_private_SplitCt_ - 1
				If PAR_private_SplitRes_(e) = "!"
					r\errOnFailAfter = e
				Else
					ListAddLast r\elems, Handle PAR_GetRuleByWord(PAR_private_SplitRes_(e))
				EndIf
			Next
		EndIf
	Wend
	
	p\root = PAR_GetRuleByWord(p\rn)	;Set root
	
	p\isCompiled = True
	PAR_private_CurrentParser_ = Null	;No further changes permitted after the parser has been committed
End Function

;Parse a token stream generated by the generic lexer into an abstract syntax tree
Function PAR_Parse.PAR_Node(p.PAR_Parser, stream.LList)
	If Not p\isCompiled Then RuntimeError "Generic parser error: unable to run uncompiled parser, please use PAR_Commit"
	
	If PAR_private_Nil_ = Null Then PAR_private_Nil_ = LEX_NewToken("<NIL>", "", "")
	PAR_private_EOF_ = New LEX_Token : ListAddLast stream, Handle PAR_private_EOF_
	p\tokens = stream : p\strp = ListNodeAtIndex(stream, 0)		;Attach token stream to parser
	p\errorState = PAR_ERR_NONE : PAR_private_Exc_ = ""		;Zero error state
	
	Local n.PAR_Node = PAR_DispatchRule(p, p\root)	;Apply root rule to the stream
	
	If p\errorState = PAR_ERR_NONE
		If p\strp\Value <> Handle PAR_private_EOF_
			p\errorState = PAR_ERR_INCOMP
		ElseIf n = Null
			p\errorState = PAR_ERR_UNKNOWN
		ElseIf n\term <> PAR_private_EOF_ And n\term <> Null
			p\errorState = PAR_ERR_INCOMP
		EndIf
	EndIf
	
	If p\errorState
		If n <> Null Then PAR_FreeNode n
		Local tok.LEX_Token = Object.LEX_Token p\strp\Value
		Local err$ = "at line: " + tok\l + ", col: " + tok\c + ", file:'" + tok\file + "' - "
		Select p\errorState
			Case PAR_ERR_EX_T
				err = err + "expecting token of type '" + PAR_private_Exc_ + "' but found '" + tok\tType + "'"
			Case PAR_ERR_EX_V
				err = err + "expecting token '" + PAR_private_Exc_ + "' but found '" + tok\value + "'"
			Case PAR_ERR_EOF
				err = err + "expecting end-of-file"
			Case PAR_ERR_INCOMP
				err = err + "unable to match remainder of document, parsing aborted"
			Case PAR_ERR_EXRULE
				err = err + "syntax error while matching " + PAR_GetRuleDesc(PAR_private_Exc_)
			Case PAR_ERR_UNKNOWN
				err = err + "encountered an unknown error attempting to match document; parsing aborted"
		End Select
		PAR_Error p, err
	EndIf
	
	Delete PAR_private_EOF_ : ListRemoveLast stream
	p\tokens = Null : p\strp = Null			;Detach stream
	
	Return n
End Function

;(Internal) Dispatch to the correct application for a given action, redirecting errors
Function PAR_DispatchRule.PAR_Node(p.PAR_Parser, r.PAR_Rule)
	Local res.PAR_Node
	If Not p\errorState
		Select r\action
			Case PAR_ACTION_CAT : res = PAR_ApplyCat(p, r)
			Case PAR_ACTION_ALT : res = PAR_ApplyAlt(p, r)
			Case PAR_ACTION_OPT : res = PAR_ApplyOpt(p, r)
			Case PAR_ACTION_REP : res = PAR_ApplyRep(p, r)
			Case PAR_ACTION_ERR : res = PAR_ApplyErr(p, r)
			Case PAR_ACTION_EOF : res = PAR_ApplyEof(p, r)
			Case PAR_ACTION_PLUS : res = PAR_ApplyPlus(p, r)
			Default		;Token rules
				res = PAR_ApplyTokenRule(p, r)
		End Select
	EndIf
	If p\errorState And (res <> Null) Then PAR_FreeNode res : Else Return res
End Function

;(Internal) Cat operator implementation
Function PAR_ApplyCat.PAR_Node(p.PAR_Parser, r.PAR_Rule, isRep = False)
	Local res.PAR_Node = PAR_CreateNode(r\alias, Null), pos.ListNode = p\strp
	Local i.Iterator = GetIterator(r\elems) : While EachIn(i)
		Local elem.PAR_Rule = Object.PAR_Rule i\Value, sub.PAR_Node = PAR_DispatchRule(p, elem)
		If (sub = Null) Or p\errorState
			PAR_FreeNode res : If Not p\errorState Then p\strp = pos
			If i\cni_ >= r\errOnFailAfter
				PAR_private_Exc_ = r\alias
				p\errorState = PAR_ERR_EXRULE
			EndIf
			IteratorBreak(i) : Return Null
		EndIf
		ListAddLast res\leaves, Handle sub
	Wend
	If r\filters <> Null Then PAR_ApplyFilter r, res
	If r\doFold Or isRep Then res = PAR_ApplyFold(res); : Else res = PAR_ApplyFold(res, True)
	Return res
End Function

;(Internal) Alt operator implementation
Function PAR_ApplyAlt.PAR_Node(p.PAR_Parser, r.PAR_Rule)
	Local i.Iterator = GetIterator(r\elems) : While EachIn(i)
		Local elem.PAR_Rule = Object.PAR_Rule i\Value, sub.PAR_Node = PAR_DispatchRule(p, elem)
		If (sub <> Null)
			Local res.PAR_Node
			If r\doFold		;Alt can fold itself easily, so it simply inherits any fold state
				res = sub
			Else
				res = PAR_CreateNode(r\alias, Null) : ListAddLast res\leaves, Handle sub
			EndIf
			IteratorBreak(i) : Return res
		ElseIf p\errorState
			IteratorBreak(i) : Return Null
		EndIf
	Wend
End Function

;(Internal) Opt operator implementation
Function PAR_ApplyOpt.PAR_Node(p.PAR_Parser, r.PAR_Rule)
	Local pos.ListNode = p\strp, res.PAR_Node = PAR_ApplyCat(p, r)	;Don't return Null on failure
	If res = Null And p\errorState = PAR_ERR_NONE
		res = PAR_CreateNode(r\alias, PAR_private_Nil_) : p\strp = pos
	EndIf
	Return res
End Function

;(Internal) Rep operator implementation
Function PAR_ApplyRep.PAR_Node(p.PAR_Parser, r.PAR_Rule)
	Local res.PAR_Node, pos.ListNode = p\strp, sub.PAR_Node
	Repeat
		sub = PAR_ApplyCat(p, r, True)
		If sub = Null
			If p\errorState
				If res <> Null Then PAR_FreeNode res
				Return Null
			EndIf
			If res = Null Then res = PAR_CreateNode(r\alias, PAR_private_Nil_)
			p\strp = pos
		Else
			If res = Null Then res = PAR_CreateNode(r\alias, Null)
			ListAddLast res\leaves, Handle sub
			pos = p\strp
		EndIf
	Until sub = Null
	;If r\doFold And (res\term <> PAR_private_Nil_) Then res = PAR_ApplyFold(res)
	If res\leaves <> Null		;More than one, not folded
		Local i.Iterator = GetIterator(res\leaves) : While EachIn(i)
			Local el.PAR_Node = Object.PAR_Node i\Value : el\eName = i\cni_
		Wend
	EndIf
	Return res
End Function

;(Internal) Plus operator implementation
Function PAR_ApplyPlus.PAR_Node(p.PAR_Parser, r.PAR_Rule)
	Local pos.ListNode = p\strp, sub.PAR_Node = PAR_ApplyCat(p, r, True)
	If sub = Null Then p\strp = pos : Return Null
	Local res.PAR_Node = PAR_CreateNode(r\alias, Null)
	ListAddLast res\leaves, Handle sub
	pos = p\strp
	Repeat
		sub = PAR_ApplyCat(p, r, True)
		If sub = Null
			If p\errorState Then PAR_FreeNode res : Return Null
			p\strp = pos
		Else
			ListAddLast res\leaves, Handle sub
			pos = p\strp
		EndIf
	Until sub = Null
	If res\leaves <> Null		;More than one, not folded
		Local i.Iterator = GetIterator(res\leaves) : While EachIn(i)
			Local el.PAR_Node = Object.PAR_Node i\Value : el\eName = i\cni_
		Wend
	EndIf
	If r\doFold Then res = PAR_ApplyFold(res)
	Return res
End Function

;(Internal) Err operator implementation
Function PAR_ApplyErr.PAR_Node(p.PAR_Parser, r.PAR_Rule)
	Local sub.PAR_Node = PAR_ApplyCat(p, r)
	If sub <> Null Then PAR_Error p, r\fs : p\errorState = PAR_ERR_MATCH : PAR_FreeNode sub
End Function

;(Internal) Eof operator implementation
Function PAR_ApplyEof.PAR_Node(p.PAR_Parser, r.PAR_Rule)
	Local tok.LEX_Token = Object.LEX_Token p\strp\Value
	If tok <> PAR_private_EOF_ Then p\errorState = PAR_ERR_EOF : Return Null
	Return PAR_CreateNode("EOF", tok)
End Function

;(Internal) Token-operator implementation(s)
Function PAR_ApplyTokenRule.PAR_Node(p.PAR_Parser, r.PAR_Rule)
	Local tok.LEX_Token = Object.LEX_Token p\strp\Value, res.PAR_Node
	
	If r\action = PAR_ACTION_CT Or r\action = PAR_ACTION_ET
		If tok\tType = r\rs Then res = PAR_CreateNode(r\alias, tok)
	Else	;EV or CV
		If tok\value = r\rs Then res = PAR_CreateNode(r\alias, tok)
	EndIf
	
	If res <> Null
		p\strp = p\strp\nx_
	ElseIf r\action = PAR_ACTION_ET Or r\action = PAR_ACTION_EV
		If r\action = PAR_ACTION_ET Then p\errorState = PAR_ERR_EX_T : Else p\errorState = PAR_ERR_EX_V
		PAR_private_Exc_ = r\rs
	EndIf
	
	Return res
End Function

;(Internal) Run a filter string over a node's result list to strip/rename entries
Function PAR_ApplyFilter(r.PAR_Rule, n.PAR_Node)
	Local f.PAR_Filter, e.PAR_Node, fi = 0
	Local i.Iterator = GetIterator(n\leaves) : While EachIn(i)
		f = Object.PAR_Filter ValueAtIndex(r\filters, fi)
		e = Object.PAR_Node i\Value
		If f = Null
			e\eName = Str fi
		Else	;Actually do something
			Select f\action
				Case PAR_FILTER_DROP
					PAR_FreeNode e : IteratorRemove i
				Case PAR_FILTER_NAME
					e\eName = f\name
				Case PAR_FILTER_FOLDPREV
					Local pv.PAR_Node = Object.PAR_Node i\cn_\pv_\Value
					If pv = Null Then RuntimeError "Cannot fold to previous node from start of result list"
					If pv\leaves = Null
						If pv\term = PAR_private_Nil_
							pv\term = Null : pv\leaves = CreateList()
						Else
							RuntimeError "Previous node does not have a leaf list"
						EndIf
					EndIf
					Local tail.PAR_Node = Object.PAR_Node ListLast(pv\leaves)
					If tail <> Null
						If tail\term = PAR_private_Nil_ Then PAR_FreeNode tail : ListRemoveLast pv\leaves
					EndIf
					e\eName = ListLength(pv\leaves)		;Give it an index, not a name
					ListAddLast pv\leaves, Handle e
					IteratorRemove i
				Case PAR_FILTER_FOLDWITH
					pv = Object.PAR_Node i\cn_\pv_\Value
					If pv = Null Then RuntimeError "Cannot fold to previous node from start of result list"
					If e\leaves = Null
						If e\term = PAR_private_Nil_
							e\term = Null : e\leaves = CreateList()
						Else
							RuntimeError "This node does not have a leaf list"
						EndIf
					EndIf
					Local head.PAR_Node = Object.PAR_Node ListFirst(e\leaves)
					If head <> Null
						If head\term = PAR_private_Nil_ Then PAR_FreeNode head : ListRemoveFirst e\leaves
					EndIf
					ListAddFirst e\leaves, Handle pv
					RemoveListNode i\cn_\pv_
					Local j.Iterator = GetIterator(e\leaves) : While EachIn(j)
						pv = Object.PAR_Node j\Value : pv\eName = j\cni_
					Wend
				Default
					RuntimeError "Unexpected parse filter action: " + f\action
			End Select
		EndIf
		fi = fi + 1
	Wend
End Function

;(Internal) 'Fold' a result node: if it consists of only one element, return that element instead
Function PAR_ApplyFold.PAR_Node(n.PAR_Node)
	Local ret.PAR_Node, i.Iterator, elem.PAR_Node
	i = GetIterator(n\leaves) : While EachIn(i)
		elem = Object.PAR_Node i\Value
		If elem\term <> PAR_private_Nil_
			If ret <> Null Then Return n : Else ret = elem	;Two non-Nil elements: don't fold
		EndIf
	Wend
	If ret = Null
		ret = PAR_CreateNode(n\rName, PAR_private_Nil_)	;All Nil, or none: create a Nil return value
	Else
	;	If preserve
	;		If Left(ret\rName, 2) = "<@"	;Only fold up internal nodes if they're anonymous
	;			ret\rName = n\rName : ret\eName = n\eName
	;		Else
	;			Return n
	;		EndIf
	;	EndIf
		ListRemove n\leaves, Handle ret
		PAR_FreeNode n
	EndIf
	Return ret
End Function

;Set the root rule for the whole grammar of the current parser
Function PAR_Root(name$)
	PAR_private_CurrentParser_\rn = name
End Function

;Grammar operator: concatenate arguments
Function PAR_Cat$(rule$, filter$ = "")
	Return PAR_CreateRule_(rule, filter, PAR_ACTION_CAT, True)
End Function

;Grammar operator: choose first matching argument
Function PAR_Alt$(rule$, filter$ = "")
	Return PAR_CreateRule_(rule, filter, PAR_ACTION_ALT, True)
End Function

;Grammar operator: try to match against arguments, but allow failure
Function PAR_Opt$(rule$, filter$ = "")
	Return PAR_CreateRule_(rule, filter, PAR_ACTION_OPT, True)
End Function

;Grammar operator: match arguments, with repetition
Function PAR_Rep$(rule$, filter$ = "")
	Return PAR_CreateRule_(rule, filter, PAR_ACTION_REP, True)
End Function

;Extended grammar operator: match arguments, with _at least one_ repetition
Function PAR_Plus$(rule$, filter$ = "")
	Return PAR_CreateRule_(rule, filter, PAR_ACTION_PLUS, True)
End Function

;Grammar operator: raise a specific error on match, abort
Function PAR_Err$(rule$, filter$ = "")
	Return PAR_CreateRule_(rule, filter, PAR_ACTION_ERR, False)
End Function

;Grammar operator: match the end of the input stream, error on anything else
Function PAR_Eof$()
	Return PAR_CreateRule_("EOF", "", PAR_ACTION_EOF, False)
End Function

;(Internal) Grammar operator: #
Function PAR_CT$(rule$)
	Return PAR_CreateRule_(Mid(rule, 2), "", PAR_ACTION_CT, False)
End Function

;(Internal) Grammar operator: $
Function PAR_CV$(rule$)
	Return PAR_CreateRule_(Mid(rule, 2), "", PAR_ACTION_CV, False)
End Function

;(Internal) Grammar operator: #!
Function PAR_ET$(rule$)
	Return PAR_CreateRule_(Mid(rule, 3), "", PAR_ACTION_ET, False)
End Function

;(Internal) Grammar operator: $!
Function PAR_EV$(rule$)
	Return PAR_CreateRule_(Mid(rule, 3), "", PAR_ACTION_EV, False)
End Function

;(Internal) Create a token rule out of a simple token word
Function PAR_TokenRule$(rule$)
	If Left(rule, 1) = "#"	;Type
		If Mid(rule, 2, 1) = "!" Return PAR_ET(rule) : Else Return PAR_CT(rule)
	Else		;Value
		If Mid(rule, 2, 1) = "!" Return PAR_EV(rule) : Else Return PAR_CV(rule)
	EndIf
End Function

;(Internal) Construct a new rule object
Function PAR_CreateRule_$(rule$, filter$, action, doFilter)
	Local r.PAR_Rule = New PAR_Rule
	
	r\rs = rule : r\fs = filter : r\action = action
	r\genName = PAR_GenRuleName()
	r\alias = r\genName		;May be updated later
	r\elems = CreateList()	;Populated during PAR_Commit; content not owned by the rule
	r\doFold = -1	;Will set to true or false based on filter and whether it gets named
	r\isTerminal = (action = PAR_ACTION_CT Or action = PAR_ACTION_CV Or action = PAR_ACTION_ET Or action = PAR_ACTION_EV Or action = PAR_ACTION_EOF)
	r\errOnFailAfter = 1000000000	;Unrealistically high number by default, i.e. "off"
	
	If doFilter
		PAR_Split filter
		If PAR_private_SplitCt_
			If PAR_private_SplitRes_(PAR_private_SplitCt_ - 1) = "^"	;"Fold up" filter option
				PAR_private_SplitCt_ = PAR_private_SplitCt_ - 1
				r\doFold = 1
			ElseIf PAR_private_SplitRes_(PAR_private_SplitCt_ - 1) = "."	;"No fold" filter option
				PAR_private_SplitCt_ = PAR_private_SplitCt_ - 1
				r\doFold = 1
			EndIf
		EndIf
		r\filters = CreateList()
		Local f : For f = 0 To PAR_private_SplitCt_ - 1
			ListAddLast r\filters, Handle PAR_FilterAction(PAR_private_SplitRes_(f), f)
		Next
	EndIf
	
	ListAddLast PAR_private_CurrentParser_\rules, Handle r	;Add to parser's ownership list
	Return " " + r\genName + " "
End Function

;(Internal) Generate a unique internal name for rules
Function PAR_GenRuleName$()
	PAR_private_NameCounter_ = PAR_private_NameCounter_ + 1
	Return "<@" + PAR_private_NameCounter_ + ">"
End Function

;Retrieve a rule by aliased name
Function PAR_GetRuleByName.PAR_Rule(p.PAR_Parser, name$, errFail = False)
	Local i.Iterator = GetIterator(p\alias) : While EachIn(i)
		Local a.PAR_RuleAlias = Object.PAR_RuleAlias i\Value
		If a\alias = name Then IteratorBreak(i) : Return a\r
	Wend
	If errFail Then RuntimeError "Generic-parser error: unable to find rule with alias '" + name + "'"
End Function

;(Internal) Retrieve a rule by genName
Function PAR_GetRuleByGenName.PAR_Rule(p.PAR_Parser, genName$, errFail = False)
	Local i.Iterator = GetIterator(p\rules) : While EachIn(i)
		Local r.PAR_Rule = Object.PAR_Rule i\Value
		If r\genName = genName Then IteratorBreak(i) : Return r
	Wend
	If errFail Then RuntimeError "Generic-parser error: unable to find rule with genName '" + genName + "'"
End Function

;(Internal) Retrieve a rule by name, genName, or if it's a token rule, create it anew
Function PAR_GetRuleByWord.PAR_Rule(word$)
	If Instr(word, "@") Then Return PAR_GetRuleByGenName(PAR_private_CurrentParser_, word, True)
	If Left(word, 1) = "#" Or Left(word, 1) = "$"
		PAR_TokenRule(word)
		Return Object.PAR_Rule ListLast(PAR_private_CurrentParser_\rules)
	EndIf
	Return PAR_GetRuleByName(PAR_private_CurrentParser_, word, True)
End Function

;Free a rule object
Function PAR_FreeRule(r.PAR_Rule)
	If r\filters <> Null
		Local i.Iterator = GetIterator(r\filters) : While EachIn(i)		;Free filters
			Delete Object.PAR_Filter i\Value
		Wend
		FreeList r\filters
	EndIf
	FreeList r\elems		;Free element list, but not element objects (owned by parser)
	Delete r
End Function

;Apply a fixed name to a rule object (this is the only way to refer to them in user code)
Function PAR_NameRule(name$, rName$)
	If Instr(name, "#") Or Instr(name, "$") Or Instr(name, "!") Or Instr(name, "@")
		RuntimeError "Generic-parser error: rule names may not contain the sigils #, !, $ or @"
	ElseIf Instr(name, " ") Or Instr(name, Chr(9))
		RuntimeError "Generic-parser error: rule names may not contain whitespace"
	EndIf
	
	name = Lower(name)
	
	Local i.Iterator = GetIterator(PAR_private_CurrentParser_\alias)	;Check for duplicate name
	While EachIn(i)
		Local r.PAR_RuleAlias = Object.PAR_RuleAlias i\Value
		If r\alias = name Then RuntimeError "Generic-parser error: duplicate rule named '" + name + "'"
	Wend
	
	PAR_CreateRuleAlias name, rName
End Function

;(Internal) Actually construct the above alias
Function PAR_CreateRuleAlias.PAR_RuleAlias(alias$, rName$)
	Local nr.PAR_RuleAlias = New PAR_RuleAlias
	nr\alias = alias
	nr\rName = Trim(rName)	;Remember the trim!
	If Instr(rName, "@")	;It's a genName: sort the list on this basis so that "real" rules are available first
		ListAddFirst PAR_private_CurrentParser_\alias, Handle nr
	Else
		ListAddLast PAR_private_CurrentParser_\alias, Handle nr
	EndIf
End Function

;(Internal) Create a new result node, named after the given match
Function PAR_CreateNode.PAR_Node(rName$, tok.LEX_Token)
	Local n.PAR_Node = New PAR_Node
	n\rName = rName
	If tok = Null Then n\leaves = CreateList() : Else n\term = tok
	Return n
End Function

;Free a result node and all of its subnodes
Function PAR_FreeNode(n.PAR_Node)
	If n\leaves <> Null
		Local i.Iterator = GetIterator(n\leaves) : While EachIn(i)
			PAR_FreeNode Object.PAR_Node i\Value
		Wend
		FreeList n\leaves
	EndIf
	Delete n
End Function

;Create a recursive copy of a node
Function PAR_CopyNode.PAR_Node(n.PAR_Node)
	Local c.PAR_Node = New PAR_Node
	c\rName = n\rName
	c\eName = n\eName
	c\term = n\term		;Nodes don't own terminals, so copying the reference is correct
	If n\leaves <> Null
		c\leaves = CreateList()
		Local i.Iterator = GetIterator(n\leaves) : While EachIn(i)
			ListAddLast c\leaves, Handle PAR_CopyNode(Object.PAR_Node i\Value)
		Wend
	EndIf
	Return c
End Function

;Return the child of a node with the given name
Function PAR_NodeChild.PAR_Node(n.PAR_Node, name$)
	If n\leaves <> Null
		Local i.Iterator = GetIterator(n\leaves) : While EachIn(i)
			Local ch.PAR_Node = Object.PAR_Node i\Value
			If ch\eName = name Then Return ch
		Wend
	EndIf
	Return Null
End Function

;(Internal) Construct filter objects based on filter keys
Function PAR_FilterAction.PAR_Filter(name$, num)
	Local f.PAR_Filter = New PAR_Filter
	Select name
		Case "~"
			f\action = PAR_FILTER_DROP
		Case "<"
			f\action = PAR_FILTER_FOLDPREV
		Case ">"
			f\action = PAR_FILTER_FOLDWITH
		Default
			If Left(name, 1) = "@"
				f\action = PAR_FILTER_NAME
				If name = "@" Then f\name = Str num : Else f\name = Mid(name, 2)
			Else
				RuntimeError "Generic-parser error: illegal filter action '" + name + "'"
			EndIf
	End Select
	Return f
End Function

;(Internal) Split a string by spaces - this is NOT a generic string splitter though
Function PAR_Split(s$)
	If Len(s) = 0 Then PAR_private_SplitCt_ = 0 : Return
	
	Local t$ = Replace(Trim(s), Chr(9), " ")
	Repeat
		s = Replace(t, "  ", " ")
		If Len(s) = Len(t) Then Exit : Else t = s
	Forever
	
	Local c, sCount : For c = 1 To Len(s)
		If Mid(s, c, 1) = " " Then sCount = sCount + 1
	Next
	Dim PAR_private_SplitRes_$(sCount) : PAR_private_SplitCt_ = sCount + 1
	
	If sCount = 0
		PAR_private_SplitRes_(0) = s
	Else
		For c = 0 To sCount
			If c < sCount
				Local p = Instr(s, " ")
				PAR_private_SplitRes_(c) = Left(s, p - 1)
				s = Mid(s, p + 1)
			Else
				PAR_private_SplitRes_(c) = s
			EndIf
		Next
	EndIf
End Function


;~IDEal Editor Parameters:
;~F#D#16#1A#20#24#39#42#50#70#A0#B3#C7#D9#E2#FD#117#11D#124#138#172
;~F#18B#190#195#19A#19F#1A4#1A9#1AE#1B3#1B8#1BD#1C2#1C7#1D0#1F1#1F7#200#209#213#21F
;~F#232#23E#246#251#260#26B#280
;~C#Blitz3D

Comments

Yasha2013
While this is waaaaaay more efficient and simple than writing a parser by hand, it's still a lot of typing and I've found it's incredibly tedious to get all of the punctuation (quotemarks, commas etc.) right when entering the Blitz code. It's also a nuisance to type things like PAR_NameRule repeatedly, and it's a real pain to have to maintain two separate copies of the grammar, one in readable form and one in this API representation, and correct changes from one into the other...

To that end, I have knocked together a tiny parser-generator-generator program. Enter the grammar once, and the program will emit parser prepackaged API code for the above library, grammar in BNF style, or both (BNF comments above API calls).

Download it here: https://sites.google.com/site/nangdongseng/downloads/ParserGenerator.zip?attredirects=0&d=1

The program is written in pure R5RS Scheme, and requires an interpreter (note: won't work with a compiler, it dynamically "eval"s the grammar file). I recommend Gambit or Chicken.

You need to write your grammar in S-expression style like this:

; S-expression grammar for a simple calculator language, readable by parser generator:
(parser democalc
	(:= expression comp-expr)
	(:= comp-expr  (:& sum-expr   (:* comp-op sum-expr)))
	(:= sum-expr   (:& shift-expr (:* sum-op shift-expr) => "foo"))
	(:= shift-expr (:& mul-expr   (:* shift-op mul-expr)))
	(:= mul-expr   (:& pow-expr   (:* mul-op pow-expr)))
	(:= pow-expr   (:& unary-expr (:* pow-op unary-expr)))
	(:= unary-expr (:& (:* unary-operator) atom))
	
	(:= atom (:/ %number func-call (:& %lparen expression %!rparen)))
	(:= func-call (:& %function %lparen expression (:* %comma expression) %!rparen))
	
	(:= comp-op  (:/ %eql %neq %leq %geq))
	(:= sum-op   (:/ %add %sub))
	(:= shift-op (:/ %shl %shr))
	(:= mul-op   (:/ %mul %div %mod))
	(:= pow-op   %pow)
	(:= unary-op (:/ %add %sub))
)


...and it will produce a packaged output parser like this:



Or, put the line "(set! output-mode '(doc))" or "(set! output-mode '(code))" right before the grammar definition, to get just the comments - i.e. a readable BNF-style grammar - or just the code, respectively.

Personally I find the S-expression grammar very easy to read and write, and therefore think this tool helps matters no end. Certainly there's a lot less typing involved.

Run the program with the line "echo (file1.scm file2.scm) | SCHEME parse-gen.scm", where "file1.scm" etc are the files you want to process, and "SCHEME" is the command to invoke your Scheme interpreter. Output is placed in a file with the name given in the grammar definition block.

parse-gen.scm has more information in its comments. There is no attempt at error-checking in the generator, so... um, don't make mistakes.


virtlands2013
Quite interesting, ...and super complex, with lots of special constants, and unusual DATA types

It's true that I know nothing about SCHEME.

I'm not sure, but If I were to create a parser, I would stick to using simple stacks, FILO-style (first-in,last-out)
http://en.wikipedia.org/wiki/LIFO

I'll be following your topic, and see how this progresses.


Yasha2013
You don't need to pay any attention to the Scheme program, it's just a support utility because I'm too lazy to even write out as much as the main API requires.


As for creating parsers: this framework is powered by a simple recursive descent algorithm. It's not efficient, but it should be able to handle any right-recursive CFG (e.g. it should have no problem parsing Blitz, Lua or even JavaScript code, given a grammar). I'm not too good at parsing algorithms so I could be totally wrong, but I think there's no stack-based solution of comparable power that isn't also far more complicated (if all you want to do is parse simple math expressions, you can do the shunting-yard algorithm in one page or so of Blitz code; but that can't cover arbitrary programming languages: this can). I guess you could substantially simplify a grammar by using a secondary precedence parser for all simple expressions though.

Do share if you have interesting ideas!


virtlands2013
You said, "covering arbitrary programming languages", etc.... .

That's going to be a lot of work. Good luck.
---------------------------------------------------------
As for interesting ideas, well, perhaps the ability
to include CONSTANTS, trigonometry, set theory, complex numbers, string functions,
memories, arrays, ... ( as in a[5]= sin(4*pi*sqrt[-1]) )
----------------------------------------------------------
For some inspiration on your parser project, maybe copy the Mathematica style of doing things.
http://functions.wolfram.com/
http://reference.wolfram.com/mathematica/guide/VariablesAndFunctions.html
http://reference.wolfram.com/mathematica/guide/FunctionalProgramming.html



Yasha2013
....

...I think we may be talking at cross-purposes.


This code is finished (barring bugs, of which I am sure there are many...). The archive entry at the top is the completed product.

I think you may be misunderstanding what it is though: it's a framework for creating a parser (it's about as close as you can get in B3D to a parser combinator). It's not a programming language by itself.

So that means two things:

1) You have to supply the language grammar yourself. The parse engine within it can handle "complicated" structures, but it can't recognise anything without a grammar. It has no built-in knowledge of any language.

2) Parsing is simply the process of breaking down a linear input stream into an organised, structured tree. A parser doesn't understand concepts like constants, functions or even data types. It just "draws" them in memory. You've got to hand the resulting parse tree over to a checker and code generator before you have a working compiler - this is just the first step!

Anything that would involve either a. "running" the input, or b. understanding the "meaning" of the input, is beyond the scope of both this framework, and parsing in general. This is just the process of turning your input stream - i.e. messy, arbitrary user files - into a neat structure akin to XML before the "real work" can begin... whatever that is (note: a parser isn't even directly tied to the concept of programming - it could just as well be used as the first stage of a very inefficient 3D model loader).


Code Archives Forum