Code archives/Miscellaneous/Parser framework
This code has been declared by its author to be Public Domain code.
Download source code
| |||||
This is a monstrosity of a library that allows you to describe a parser using a declarative grammar syntax, and use it to parse a token stream generated by my Lexical scanner framework. EDIT: See first comment for a utility to make defining parsers with this library even easier! "Parsing" is the process of taking a linear input stream and turning it into something meaningful, usually an abstract syntax tree. The tree is a nice way to reorganise an arbitrarily-complex and nested input into nodes each carrying a defined, fixed amount of easily-processed meaning. e.g. the following input: 1 + 2 * 3 + 4 * 5 ...could, after being tokenised, be "parsed" into a syntax tree structured like this: + / \ 1 + / \ * * / \ / \ 2 3 4 5 This is easy to evaluate recursively (or e.g. output as function-calls), treating each operation as a simple function that doesn't have to worry about nuisances like operator precedence or the like. We can generalise this to most formal languages, which interestingly for us, includes programming, scripting and configuration languages (Blitz, C, JSON...). Using this library is a tad more complicated than its accompanying lexer, mainly because parsing is a much more complicated task. You will need to have a solid grasp of the concept of a "formal grammar" in order to use this library, and indeed to design your language in the first place (it's close to impossible to design a decent programming language without defining a grammar first, don't try). Grammar is to "sentences" or program forms as regular expressions are to individual words, more or less. Create a new parser with PAR_CreateParser. This sets a global state variable so that you don't have to keep passing the parser object to the rule functions, which would get boring really quickly (needless to say this means the library is absolutely not threadable, as though that matters for Blitz Classic; several other global state vars are also used internally by the parser engine). Rules are defined using the five rule operators, and are given names so they can actually be used using PAR_NameRule. The root rule that the parser will use to match the entire input stream is set with PAR_Root. The operators (cat, alt, opt, rep, err) create anonymous rule objects and return automatic names, so they must be named in order to be useful. Operators: -- PAR_Cat - concatenation (things following each other - this is usually implicit in BNF and similar syntaxes) -- PAR_Alt - alternative: one of the listed options. Bar operator "|" in common syntax -- PAR_Opt - optional: is allowed to not be present. Corresponds to "?" in common syntax -- PAR_Rep - repetition: is repeated ZERO or more times. Corresponds to the "*" operator -- PAR_Err - raise an error and stop parsing, returning null -- PAR_Plus - A wrapper around Cat/Rep. Corresponds to "+" in common syntax The components of a rule are separated by spaces. A terminal is represented by a string containing either its token-type or actual value, prefixed with # for type and $ for value. To raise an error on failing to match a terminal, add ! (e.g. #!number ). If multipart rules are passed to Opt or Rep, they are treated as though they were packaged by Cat (i.e. the elements are assumed to be concatenated). The Cat operator (and Opt or Rep, as applied to the implicit Cat) can also take an optional second "filter" argument: a filter can rename ("@name") or drop ("~") elements from the resulting match. Elements that aren't renamed, but kept with a simple "@", are numbered according to their position from the start of the match. It can also fold elements into their predecessors ("<"); if the predecessor is a list created with Rep, the element will be stuck on the end and numbered accordingly (this is useful when you need a slightly different rule for whatever marks the end of a list so it can't be part of the Rep). So: -- PAR_Cat("atom comma atom") produces a match with the values named "0:", "1:", "2:" -- PAR_CAT("atom comma atom", "@left ~ @right") produces a match with the atoms named "left:" and "right:", and no comma remaining in the output - it is filtered out -- Any values not accounted for beyond the end of the filter are numbered as default By default concatenating operators that form nested elements of named rules are "folded": if the result consists of only one element, it is returned directly to the containing rule in order to keep the result tree flatter (or at least, manageably small - otherwise even simple expressions would produce very deeply nested results). Named rules can also be set to "fold" by passing "^" as the last element of the filter, after the rename/drop elements. If the result contains more than one node, the match will never fold, even if specifically requested to do so with "^". If the exclamation mark "!" appears on its own as an element of a rule, in the event that the rule fails to return a match after that point, instead of resetting and trying something else, the parser will error out. This is handy for something like e.g. a "Type" or "Function" declaration, where you know what the object is from the first word: you don't want it to try a different pattern if the match fails, because you already know there must be an error around here. Once all the rules have been listed, the parser must be finalised with PAR_Commit. This links all of the renames to the relevant rules, compiles token type checks, etc. etc. Once PAR_Commit has been called, the parser is ready for use or storage (you can for instance safely create other parsers if necessary after this point). The parser accepts input in the form of a list of LEX_Token objects. It does not take ownership of this list, but the resulting parse tree will make reference to the tokens held in the list, so the list shouldn't be freed until you're finished with the parse tree making use of it. Note that while designed as an API, this is really intended as just a more declarative way of expressing static parsers in Blitz code. As a result, PAR_Commit and the rule definition functions use RunTimeError to signal problems (the parser engine itself uses a safe, dynamic error handling system to report errors in the input stream without crashing). In other words, verify that your grammar and so on is correct before releasing your program! This is not really intended for dynamically loading new language grammars on the fly (that would be badass, but... at that point even I have to admit defeat and say "just go and use Lisp"). As with the lexer, the parser library expects you to supply a file "Parser-Interface.bb" containing the definition of PAR_Error, to suit your program's error handling system; and PAR_GetRuleDesc, to suit your grammar. As with the lexer, the library makes no assumptions about these functions other than they exist: leave them empty for all we care, or log errors properly. The first function is called when the Err operator matches input, with the state PAR_ERR_MATCH (in p\errorState); and to output messages generated by other internal error states as well. The second is called when a rule force-fails because of the ! marker, and is intended to expand concise rule names (e.g. "func-decl") into longer, readable ones ("function body declaration"). Warning: There are two major restrictions upon the rules that can be expressed by this API. Firstly, the parser engine that interprets the rules uses something akin to recursive descent; this means no left-recursive rules or you'll get an infinite loop that never matches anything. Secondly, the parser engine uses actual recursion (because looping is a PITA for such things), so make use of the Rep operator rather than using ludicrously deep recursion to describe list-like structures, or there will be a very real risk of stack overflow. Here's a small example, using the parser library to parse the same input we saw before into a syntax tree (example prints to the Debug Log, so run in Debug to see anything): ...using this Parser-Interface.bb: ; Example parser error handler ;============================== ;Called in the event of user-defined error thrown by pattern match ;Examine p\strp to get file/line.column information on the point of failure Function PAR_Error(p.PAR_Parser, msg$) DebugLog "Parser Error: " + msg End Function ;Called in the event of an error within a specific match that cannot fail ;Replace the short-form rule identifier with a user-readable name Function PAR_GetRuleDesc$(rule$) ;Have a list here of rule names to match against and get a better error description Return rule End Function This also provides an example grammar, in non-API syntax, and several examples of filters and folds in use. Hopefully this will help if you have difficulty understanding these concepts. The printed output is a little odd (e.g. the right-hand argument to math operators is nested; argument 2 and beyond to a function are packed into one "el" for "element list" slot): this is a product of the structure of the grammar and its nested rules. A normalisation pass, converting this tree into a slightly more intuitive layout, would be handy - but not necessary - before actually evaluating the expression, to make it more regular. At any rate, actually evaluating a little math tree like this is dead easy from here on, so I won't spoil the fun by showing the rest of the calculator (since it's only an example and that has nothing to do with parsing). This library depends on the accompanying lexer (obviously) and linked list include files, along with any dependencies of their own (e.g. regex). This is a new, complicated library for a reasonably complicated subject. It is not like parser frameworks for other languages (mainly because they usually use either preprocessors, or macros/operator overloading); and it is probably full of bugs still; so expect to see some stealth-updates occurring. Either way, please, feel free to ask questions and for any more helpful examples if you have any interest in using this tool, and please do report any bugs you encounter! | |||||
; Generic parser framework ;========================== ;Dependencies Include "Lexer.bb" ;Get this file at http://www.blitzbasic.com/codearcs/codearcs.php?code=2985 Include "LList.bb" ;Get this file at http://www.blitzbasic.com/codearcs/codearcs.php?code=2873 ;User-specialised interface: supplies PAR_Error Include "Parser-Interface.bb" Type PAR_Parser Field isCompiled Field rules.LList ;List of owned rules Field alias.LList ;List of rule aliases Field root.PAR_Rule, rn$ Field tokens.LList, strp.ListNode Field errorState End Type Type PAR_RuleAlias Field alias$, rName$, r.PAR_Rule ;Does not own r End Type Type PAR_Rule Field rs$, fs$, genName$, alias$ Field action, doFold, isTerminal Field elems.LList, filters.LList, errOnFailAfter End Type Type PAR_Filter Field name$, action End Type Type PAR_Node Field rName$, eName$ ;Name of rule (as matched), name of element (as filtered) Field term.LEX_Token, leaves.LList End Type Const PAR_ACTION_CAT = 1, PAR_ACTION_ALT = 2, PAR_ACTION_OPT = 3, PAR_ACTION_REP = 4, PAR_ACTION_ERR = 5 Const PAR_ACTION_CT = 6, PAR_ACTION_CV = 7, PAR_ACTION_ET = 8, PAR_ACTION_EV = 9, PAR_ACTION_EOF = 10, PAR_ACTION_PLUS = 11 Const PAR_FILTER_NAME = 1, PAR_FILTER_DROP = 2, PAR_FILTER_FOLDPREV = 3, PAR_FILTER_FOLDWITH = 4 Const PAR_ERR_NONE = 0, PAR_ERR_EX_T = -1, PAR_ERR_EX_V = -2, PAR_ERR_INCOMP = -3, PAR_ERR_UNKNOWN = -4 Const PAR_ERR_MATCH = -5, PAR_ERR_EOF = -6, PAR_ERR_EXRULE = -7 ;Private globals maintaining internal state Global PAR_private_CurrentParser_.PAR_Parser, PAR_private_Nil_.LEX_Token Global PAR_private_NameCounter_, PAR_private_EOF_.LEX_Token, PAR_private_Exc_$ Dim PAR_private_SplitRes_$(0) : Global PAR_private_SplitCt_ ;Create a new parser object. Rules are added after creation Function PAR_CreateParser.PAR_Parser() Local p.PAR_Parser = New PAR_Parser p\rules = CreateList() p\alias = CreateList() PAR_private_CurrentParser_ = p Return p End Function ;Free a parser object Function PAR_FreeParser(p.PAR_Parser) Local i.Iterator i = GetIterator(p\rules) : While EachIn(i) PAR_FreeRule Object.PAR_Rule i\Value Wend FreeList p\rules i = GetIterator(p\alias) : While EachIn(i) Delete Object.PAR_RuleAlias i\Value Wend FreeList p\alias Delete p End Function ;Apply all rule additions to a parser, aliasing and compiling rules as necessary Function PAR_Commit() Local i.Iterator, a.PAR_RuleAlias, r.PAR_Rule, p.PAR_Parser = PAR_private_CurrentParser_ i = GetIterator(p\alias) : While EachIn(i) ;Link aliases a = Object.PAR_RuleAlias i\Value a\r = PAR_GetRuleByWord(a\rName) If a\r\doFold <> 1 Then a\r\doFold = 0 a\r\alias = a\alias Wend i = GetIterator(p\rules) : While EachIn(i) ;Find all subelements and build elem lists r = Object.PAR_Rule i\Value If r\doFold Then r\doFold = 1 ;Normalise this supposed boolean If Not r\isTerminal PAR_Split r\rs Local e : For e = 0 To PAR_private_SplitCt_ - 1 If PAR_private_SplitRes_(e) = "!" r\errOnFailAfter = e Else ListAddLast r\elems, Handle PAR_GetRuleByWord(PAR_private_SplitRes_(e)) EndIf Next EndIf Wend p\root = PAR_GetRuleByWord(p\rn) ;Set root p\isCompiled = True PAR_private_CurrentParser_ = Null ;No further changes permitted after the parser has been committed End Function ;Parse a token stream generated by the generic lexer into an abstract syntax tree Function PAR_Parse.PAR_Node(p.PAR_Parser, stream.LList) If Not p\isCompiled Then RuntimeError "Generic parser error: unable to run uncompiled parser, please use PAR_Commit" If PAR_private_Nil_ = Null Then PAR_private_Nil_ = LEX_NewToken("<NIL>", "", "") PAR_private_EOF_ = New LEX_Token : ListAddLast stream, Handle PAR_private_EOF_ p\tokens = stream : p\strp = ListNodeAtIndex(stream, 0) ;Attach token stream to parser p\errorState = PAR_ERR_NONE : PAR_private_Exc_ = "" ;Zero error state Local n.PAR_Node = PAR_DispatchRule(p, p\root) ;Apply root rule to the stream If p\errorState = PAR_ERR_NONE If p\strp\Value <> Handle PAR_private_EOF_ p\errorState = PAR_ERR_INCOMP ElseIf n = Null p\errorState = PAR_ERR_UNKNOWN ElseIf n\term <> PAR_private_EOF_ And n\term <> Null p\errorState = PAR_ERR_INCOMP EndIf EndIf If p\errorState If n <> Null Then PAR_FreeNode n Local tok.LEX_Token = Object.LEX_Token p\strp\Value Local err$ = "at line: " + tok\l + ", col: " + tok\c + ", file:'" + tok\file + "' - " Select p\errorState Case PAR_ERR_EX_T err = err + "expecting token of type '" + PAR_private_Exc_ + "' but found '" + tok\tType + "'" Case PAR_ERR_EX_V err = err + "expecting token '" + PAR_private_Exc_ + "' but found '" + tok\value + "'" Case PAR_ERR_EOF err = err + "expecting end-of-file" Case PAR_ERR_INCOMP err = err + "unable to match remainder of document, parsing aborted" Case PAR_ERR_EXRULE err = err + "syntax error while matching " + PAR_GetRuleDesc(PAR_private_Exc_) Case PAR_ERR_UNKNOWN err = err + "encountered an unknown error attempting to match document; parsing aborted" End Select PAR_Error p, err EndIf Delete PAR_private_EOF_ : ListRemoveLast stream p\tokens = Null : p\strp = Null ;Detach stream Return n End Function ;(Internal) Dispatch to the correct application for a given action, redirecting errors Function PAR_DispatchRule.PAR_Node(p.PAR_Parser, r.PAR_Rule) Local res.PAR_Node If Not p\errorState Select r\action Case PAR_ACTION_CAT : res = PAR_ApplyCat(p, r) Case PAR_ACTION_ALT : res = PAR_ApplyAlt(p, r) Case PAR_ACTION_OPT : res = PAR_ApplyOpt(p, r) Case PAR_ACTION_REP : res = PAR_ApplyRep(p, r) Case PAR_ACTION_ERR : res = PAR_ApplyErr(p, r) Case PAR_ACTION_EOF : res = PAR_ApplyEof(p, r) Case PAR_ACTION_PLUS : res = PAR_ApplyPlus(p, r) Default ;Token rules res = PAR_ApplyTokenRule(p, r) End Select EndIf If p\errorState And (res <> Null) Then PAR_FreeNode res : Else Return res End Function ;(Internal) Cat operator implementation Function PAR_ApplyCat.PAR_Node(p.PAR_Parser, r.PAR_Rule, isRep = False) Local res.PAR_Node = PAR_CreateNode(r\alias, Null), pos.ListNode = p\strp Local i.Iterator = GetIterator(r\elems) : While EachIn(i) Local elem.PAR_Rule = Object.PAR_Rule i\Value, sub.PAR_Node = PAR_DispatchRule(p, elem) If (sub = Null) Or p\errorState PAR_FreeNode res : If Not p\errorState Then p\strp = pos If i\cni_ >= r\errOnFailAfter PAR_private_Exc_ = r\alias p\errorState = PAR_ERR_EXRULE EndIf IteratorBreak(i) : Return Null EndIf ListAddLast res\leaves, Handle sub Wend If r\filters <> Null Then PAR_ApplyFilter r, res If r\doFold Or isRep Then res = PAR_ApplyFold(res); : Else res = PAR_ApplyFold(res, True) Return res End Function ;(Internal) Alt operator implementation Function PAR_ApplyAlt.PAR_Node(p.PAR_Parser, r.PAR_Rule) Local i.Iterator = GetIterator(r\elems) : While EachIn(i) Local elem.PAR_Rule = Object.PAR_Rule i\Value, sub.PAR_Node = PAR_DispatchRule(p, elem) If (sub <> Null) Local res.PAR_Node If r\doFold ;Alt can fold itself easily, so it simply inherits any fold state res = sub Else res = PAR_CreateNode(r\alias, Null) : ListAddLast res\leaves, Handle sub EndIf IteratorBreak(i) : Return res ElseIf p\errorState IteratorBreak(i) : Return Null EndIf Wend End Function ;(Internal) Opt operator implementation Function PAR_ApplyOpt.PAR_Node(p.PAR_Parser, r.PAR_Rule) Local pos.ListNode = p\strp, res.PAR_Node = PAR_ApplyCat(p, r) ;Don't return Null on failure If res = Null And p\errorState = PAR_ERR_NONE res = PAR_CreateNode(r\alias, PAR_private_Nil_) : p\strp = pos EndIf Return res End Function ;(Internal) Rep operator implementation Function PAR_ApplyRep.PAR_Node(p.PAR_Parser, r.PAR_Rule) Local res.PAR_Node, pos.ListNode = p\strp, sub.PAR_Node Repeat sub = PAR_ApplyCat(p, r, True) If sub = Null If p\errorState If res <> Null Then PAR_FreeNode res Return Null EndIf If res = Null Then res = PAR_CreateNode(r\alias, PAR_private_Nil_) p\strp = pos Else If res = Null Then res = PAR_CreateNode(r\alias, Null) ListAddLast res\leaves, Handle sub pos = p\strp EndIf Until sub = Null ;If r\doFold And (res\term <> PAR_private_Nil_) Then res = PAR_ApplyFold(res) If res\leaves <> Null ;More than one, not folded Local i.Iterator = GetIterator(res\leaves) : While EachIn(i) Local el.PAR_Node = Object.PAR_Node i\Value : el\eName = i\cni_ Wend EndIf Return res End Function ;(Internal) Plus operator implementation Function PAR_ApplyPlus.PAR_Node(p.PAR_Parser, r.PAR_Rule) Local pos.ListNode = p\strp, sub.PAR_Node = PAR_ApplyCat(p, r, True) If sub = Null Then p\strp = pos : Return Null Local res.PAR_Node = PAR_CreateNode(r\alias, Null) ListAddLast res\leaves, Handle sub pos = p\strp Repeat sub = PAR_ApplyCat(p, r, True) If sub = Null If p\errorState Then PAR_FreeNode res : Return Null p\strp = pos Else ListAddLast res\leaves, Handle sub pos = p\strp EndIf Until sub = Null If res\leaves <> Null ;More than one, not folded Local i.Iterator = GetIterator(res\leaves) : While EachIn(i) Local el.PAR_Node = Object.PAR_Node i\Value : el\eName = i\cni_ Wend EndIf If r\doFold Then res = PAR_ApplyFold(res) Return res End Function ;(Internal) Err operator implementation Function PAR_ApplyErr.PAR_Node(p.PAR_Parser, r.PAR_Rule) Local sub.PAR_Node = PAR_ApplyCat(p, r) If sub <> Null Then PAR_Error p, r\fs : p\errorState = PAR_ERR_MATCH : PAR_FreeNode sub End Function ;(Internal) Eof operator implementation Function PAR_ApplyEof.PAR_Node(p.PAR_Parser, r.PAR_Rule) Local tok.LEX_Token = Object.LEX_Token p\strp\Value If tok <> PAR_private_EOF_ Then p\errorState = PAR_ERR_EOF : Return Null Return PAR_CreateNode("EOF", tok) End Function ;(Internal) Token-operator implementation(s) Function PAR_ApplyTokenRule.PAR_Node(p.PAR_Parser, r.PAR_Rule) Local tok.LEX_Token = Object.LEX_Token p\strp\Value, res.PAR_Node If r\action = PAR_ACTION_CT Or r\action = PAR_ACTION_ET If tok\tType = r\rs Then res = PAR_CreateNode(r\alias, tok) Else ;EV or CV If tok\value = r\rs Then res = PAR_CreateNode(r\alias, tok) EndIf If res <> Null p\strp = p\strp\nx_ ElseIf r\action = PAR_ACTION_ET Or r\action = PAR_ACTION_EV If r\action = PAR_ACTION_ET Then p\errorState = PAR_ERR_EX_T : Else p\errorState = PAR_ERR_EX_V PAR_private_Exc_ = r\rs EndIf Return res End Function ;(Internal) Run a filter string over a node's result list to strip/rename entries Function PAR_ApplyFilter(r.PAR_Rule, n.PAR_Node) Local f.PAR_Filter, e.PAR_Node, fi = 0 Local i.Iterator = GetIterator(n\leaves) : While EachIn(i) f = Object.PAR_Filter ValueAtIndex(r\filters, fi) e = Object.PAR_Node i\Value If f = Null e\eName = Str fi Else ;Actually do something Select f\action Case PAR_FILTER_DROP PAR_FreeNode e : IteratorRemove i Case PAR_FILTER_NAME e\eName = f\name Case PAR_FILTER_FOLDPREV Local pv.PAR_Node = Object.PAR_Node i\cn_\pv_\Value If pv = Null Then RuntimeError "Cannot fold to previous node from start of result list" If pv\leaves = Null If pv\term = PAR_private_Nil_ pv\term = Null : pv\leaves = CreateList() Else RuntimeError "Previous node does not have a leaf list" EndIf EndIf Local tail.PAR_Node = Object.PAR_Node ListLast(pv\leaves) If tail <> Null If tail\term = PAR_private_Nil_ Then PAR_FreeNode tail : ListRemoveLast pv\leaves EndIf e\eName = ListLength(pv\leaves) ;Give it an index, not a name ListAddLast pv\leaves, Handle e IteratorRemove i Case PAR_FILTER_FOLDWITH pv = Object.PAR_Node i\cn_\pv_\Value If pv = Null Then RuntimeError "Cannot fold to previous node from start of result list" If e\leaves = Null If e\term = PAR_private_Nil_ e\term = Null : e\leaves = CreateList() Else RuntimeError "This node does not have a leaf list" EndIf EndIf Local head.PAR_Node = Object.PAR_Node ListFirst(e\leaves) If head <> Null If head\term = PAR_private_Nil_ Then PAR_FreeNode head : ListRemoveFirst e\leaves EndIf ListAddFirst e\leaves, Handle pv RemoveListNode i\cn_\pv_ Local j.Iterator = GetIterator(e\leaves) : While EachIn(j) pv = Object.PAR_Node j\Value : pv\eName = j\cni_ Wend Default RuntimeError "Unexpected parse filter action: " + f\action End Select EndIf fi = fi + 1 Wend End Function ;(Internal) 'Fold' a result node: if it consists of only one element, return that element instead Function PAR_ApplyFold.PAR_Node(n.PAR_Node) Local ret.PAR_Node, i.Iterator, elem.PAR_Node i = GetIterator(n\leaves) : While EachIn(i) elem = Object.PAR_Node i\Value If elem\term <> PAR_private_Nil_ If ret <> Null Then Return n : Else ret = elem ;Two non-Nil elements: don't fold EndIf Wend If ret = Null ret = PAR_CreateNode(n\rName, PAR_private_Nil_) ;All Nil, or none: create a Nil return value Else ; If preserve ; If Left(ret\rName, 2) = "<@" ;Only fold up internal nodes if they're anonymous ; ret\rName = n\rName : ret\eName = n\eName ; Else ; Return n ; EndIf ; EndIf ListRemove n\leaves, Handle ret PAR_FreeNode n EndIf Return ret End Function ;Set the root rule for the whole grammar of the current parser Function PAR_Root(name$) PAR_private_CurrentParser_\rn = name End Function ;Grammar operator: concatenate arguments Function PAR_Cat$(rule$, filter$ = "") Return PAR_CreateRule_(rule, filter, PAR_ACTION_CAT, True) End Function ;Grammar operator: choose first matching argument Function PAR_Alt$(rule$, filter$ = "") Return PAR_CreateRule_(rule, filter, PAR_ACTION_ALT, True) End Function ;Grammar operator: try to match against arguments, but allow failure Function PAR_Opt$(rule$, filter$ = "") Return PAR_CreateRule_(rule, filter, PAR_ACTION_OPT, True) End Function ;Grammar operator: match arguments, with repetition Function PAR_Rep$(rule$, filter$ = "") Return PAR_CreateRule_(rule, filter, PAR_ACTION_REP, True) End Function ;Extended grammar operator: match arguments, with _at least one_ repetition Function PAR_Plus$(rule$, filter$ = "") Return PAR_CreateRule_(rule, filter, PAR_ACTION_PLUS, True) End Function ;Grammar operator: raise a specific error on match, abort Function PAR_Err$(rule$, filter$ = "") Return PAR_CreateRule_(rule, filter, PAR_ACTION_ERR, False) End Function ;Grammar operator: match the end of the input stream, error on anything else Function PAR_Eof$() Return PAR_CreateRule_("EOF", "", PAR_ACTION_EOF, False) End Function ;(Internal) Grammar operator: # Function PAR_CT$(rule$) Return PAR_CreateRule_(Mid(rule, 2), "", PAR_ACTION_CT, False) End Function ;(Internal) Grammar operator: $ Function PAR_CV$(rule$) Return PAR_CreateRule_(Mid(rule, 2), "", PAR_ACTION_CV, False) End Function ;(Internal) Grammar operator: #! Function PAR_ET$(rule$) Return PAR_CreateRule_(Mid(rule, 3), "", PAR_ACTION_ET, False) End Function ;(Internal) Grammar operator: $! Function PAR_EV$(rule$) Return PAR_CreateRule_(Mid(rule, 3), "", PAR_ACTION_EV, False) End Function ;(Internal) Create a token rule out of a simple token word Function PAR_TokenRule$(rule$) If Left(rule, 1) = "#" ;Type If Mid(rule, 2, 1) = "!" Return PAR_ET(rule) : Else Return PAR_CT(rule) Else ;Value If Mid(rule, 2, 1) = "!" Return PAR_EV(rule) : Else Return PAR_CV(rule) EndIf End Function ;(Internal) Construct a new rule object Function PAR_CreateRule_$(rule$, filter$, action, doFilter) Local r.PAR_Rule = New PAR_Rule r\rs = rule : r\fs = filter : r\action = action r\genName = PAR_GenRuleName() r\alias = r\genName ;May be updated later r\elems = CreateList() ;Populated during PAR_Commit; content not owned by the rule r\doFold = -1 ;Will set to true or false based on filter and whether it gets named r\isTerminal = (action = PAR_ACTION_CT Or action = PAR_ACTION_CV Or action = PAR_ACTION_ET Or action = PAR_ACTION_EV Or action = PAR_ACTION_EOF) r\errOnFailAfter = 1000000000 ;Unrealistically high number by default, i.e. "off" If doFilter PAR_Split filter If PAR_private_SplitCt_ If PAR_private_SplitRes_(PAR_private_SplitCt_ - 1) = "^" ;"Fold up" filter option PAR_private_SplitCt_ = PAR_private_SplitCt_ - 1 r\doFold = 1 ElseIf PAR_private_SplitRes_(PAR_private_SplitCt_ - 1) = "." ;"No fold" filter option PAR_private_SplitCt_ = PAR_private_SplitCt_ - 1 r\doFold = 1 EndIf EndIf r\filters = CreateList() Local f : For f = 0 To PAR_private_SplitCt_ - 1 ListAddLast r\filters, Handle PAR_FilterAction(PAR_private_SplitRes_(f), f) Next EndIf ListAddLast PAR_private_CurrentParser_\rules, Handle r ;Add to parser's ownership list Return " " + r\genName + " " End Function ;(Internal) Generate a unique internal name for rules Function PAR_GenRuleName$() PAR_private_NameCounter_ = PAR_private_NameCounter_ + 1 Return "<@" + PAR_private_NameCounter_ + ">" End Function ;Retrieve a rule by aliased name Function PAR_GetRuleByName.PAR_Rule(p.PAR_Parser, name$, errFail = False) Local i.Iterator = GetIterator(p\alias) : While EachIn(i) Local a.PAR_RuleAlias = Object.PAR_RuleAlias i\Value If a\alias = name Then IteratorBreak(i) : Return a\r Wend If errFail Then RuntimeError "Generic-parser error: unable to find rule with alias '" + name + "'" End Function ;(Internal) Retrieve a rule by genName Function PAR_GetRuleByGenName.PAR_Rule(p.PAR_Parser, genName$, errFail = False) Local i.Iterator = GetIterator(p\rules) : While EachIn(i) Local r.PAR_Rule = Object.PAR_Rule i\Value If r\genName = genName Then IteratorBreak(i) : Return r Wend If errFail Then RuntimeError "Generic-parser error: unable to find rule with genName '" + genName + "'" End Function ;(Internal) Retrieve a rule by name, genName, or if it's a token rule, create it anew Function PAR_GetRuleByWord.PAR_Rule(word$) If Instr(word, "@") Then Return PAR_GetRuleByGenName(PAR_private_CurrentParser_, word, True) If Left(word, 1) = "#" Or Left(word, 1) = "$" PAR_TokenRule(word) Return Object.PAR_Rule ListLast(PAR_private_CurrentParser_\rules) EndIf Return PAR_GetRuleByName(PAR_private_CurrentParser_, word, True) End Function ;Free a rule object Function PAR_FreeRule(r.PAR_Rule) If r\filters <> Null Local i.Iterator = GetIterator(r\filters) : While EachIn(i) ;Free filters Delete Object.PAR_Filter i\Value Wend FreeList r\filters EndIf FreeList r\elems ;Free element list, but not element objects (owned by parser) Delete r End Function ;Apply a fixed name to a rule object (this is the only way to refer to them in user code) Function PAR_NameRule(name$, rName$) If Instr(name, "#") Or Instr(name, "$") Or Instr(name, "!") Or Instr(name, "@") RuntimeError "Generic-parser error: rule names may not contain the sigils #, !, $ or @" ElseIf Instr(name, " ") Or Instr(name, Chr(9)) RuntimeError "Generic-parser error: rule names may not contain whitespace" EndIf name = Lower(name) Local i.Iterator = GetIterator(PAR_private_CurrentParser_\alias) ;Check for duplicate name While EachIn(i) Local r.PAR_RuleAlias = Object.PAR_RuleAlias i\Value If r\alias = name Then RuntimeError "Generic-parser error: duplicate rule named '" + name + "'" Wend PAR_CreateRuleAlias name, rName End Function ;(Internal) Actually construct the above alias Function PAR_CreateRuleAlias.PAR_RuleAlias(alias$, rName$) Local nr.PAR_RuleAlias = New PAR_RuleAlias nr\alias = alias nr\rName = Trim(rName) ;Remember the trim! If Instr(rName, "@") ;It's a genName: sort the list on this basis so that "real" rules are available first ListAddFirst PAR_private_CurrentParser_\alias, Handle nr Else ListAddLast PAR_private_CurrentParser_\alias, Handle nr EndIf End Function ;(Internal) Create a new result node, named after the given match Function PAR_CreateNode.PAR_Node(rName$, tok.LEX_Token) Local n.PAR_Node = New PAR_Node n\rName = rName If tok = Null Then n\leaves = CreateList() : Else n\term = tok Return n End Function ;Free a result node and all of its subnodes Function PAR_FreeNode(n.PAR_Node) If n\leaves <> Null Local i.Iterator = GetIterator(n\leaves) : While EachIn(i) PAR_FreeNode Object.PAR_Node i\Value Wend FreeList n\leaves EndIf Delete n End Function ;Create a recursive copy of a node Function PAR_CopyNode.PAR_Node(n.PAR_Node) Local c.PAR_Node = New PAR_Node c\rName = n\rName c\eName = n\eName c\term = n\term ;Nodes don't own terminals, so copying the reference is correct If n\leaves <> Null c\leaves = CreateList() Local i.Iterator = GetIterator(n\leaves) : While EachIn(i) ListAddLast c\leaves, Handle PAR_CopyNode(Object.PAR_Node i\Value) Wend EndIf Return c End Function ;Return the child of a node with the given name Function PAR_NodeChild.PAR_Node(n.PAR_Node, name$) If n\leaves <> Null Local i.Iterator = GetIterator(n\leaves) : While EachIn(i) Local ch.PAR_Node = Object.PAR_Node i\Value If ch\eName = name Then Return ch Wend EndIf Return Null End Function ;(Internal) Construct filter objects based on filter keys Function PAR_FilterAction.PAR_Filter(name$, num) Local f.PAR_Filter = New PAR_Filter Select name Case "~" f\action = PAR_FILTER_DROP Case "<" f\action = PAR_FILTER_FOLDPREV Case ">" f\action = PAR_FILTER_FOLDWITH Default If Left(name, 1) = "@" f\action = PAR_FILTER_NAME If name = "@" Then f\name = Str num : Else f\name = Mid(name, 2) Else RuntimeError "Generic-parser error: illegal filter action '" + name + "'" EndIf End Select Return f End Function ;(Internal) Split a string by spaces - this is NOT a generic string splitter though Function PAR_Split(s$) If Len(s) = 0 Then PAR_private_SplitCt_ = 0 : Return Local t$ = Replace(Trim(s), Chr(9), " ") Repeat s = Replace(t, " ", " ") If Len(s) = Len(t) Then Exit : Else t = s Forever Local c, sCount : For c = 1 To Len(s) If Mid(s, c, 1) = " " Then sCount = sCount + 1 Next Dim PAR_private_SplitRes_$(sCount) : PAR_private_SplitCt_ = sCount + 1 If sCount = 0 PAR_private_SplitRes_(0) = s Else For c = 0 To sCount If c < sCount Local p = Instr(s, " ") PAR_private_SplitRes_(c) = Left(s, p - 1) s = Mid(s, p + 1) Else PAR_private_SplitRes_(c) = s EndIf Next EndIf End Function ;~IDEal Editor Parameters: ;~F#D#16#1A#20#24#39#42#50#70#A0#B3#C7#D9#E2#FD#117#11D#124#138#172 ;~F#18B#190#195#19A#19F#1A4#1A9#1AE#1B3#1B8#1BD#1C2#1C7#1D0#1F1#1F7#200#209#213#21F ;~F#232#23E#246#251#260#26B#280 ;~C#Blitz3D |
Comments
| ||
While this is waaaaaay more efficient and simple than writing a parser by hand, it's still a lot of typing and I've found it's incredibly tedious to get all of the punctuation (quotemarks, commas etc.) right when entering the Blitz code. It's also a nuisance to type things like PAR_NameRule repeatedly, and it's a real pain to have to maintain two separate copies of the grammar, one in readable form and one in this API representation, and correct changes from one into the other... To that end, I have knocked together a tiny parser-generator-generator program. Enter the grammar once, and the program will emit parser prepackaged API code for the above library, grammar in BNF style, or both (BNF comments above API calls). Download it here: https://sites.google.com/site/nangdongseng/downloads/ParserGenerator.zip?attredirects=0&d=1 The program is written in pure R5RS Scheme, and requires an interpreter (note: won't work with a compiler, it dynamically "eval"s the grammar file). I recommend Gambit or Chicken. You need to write your grammar in S-expression style like this: ; S-expression grammar for a simple calculator language, readable by parser generator: (parser democalc (:= expression comp-expr) (:= comp-expr (:& sum-expr (:* comp-op sum-expr))) (:= sum-expr (:& shift-expr (:* sum-op shift-expr) => "foo")) (:= shift-expr (:& mul-expr (:* shift-op mul-expr))) (:= mul-expr (:& pow-expr (:* mul-op pow-expr))) (:= pow-expr (:& unary-expr (:* pow-op unary-expr))) (:= unary-expr (:& (:* unary-operator) atom)) (:= atom (:/ %number func-call (:& %lparen expression %!rparen))) (:= func-call (:& %function %lparen expression (:* %comma expression) %!rparen)) (:= comp-op (:/ %eql %neq %leq %geq)) (:= sum-op (:/ %add %sub)) (:= shift-op (:/ %shl %shr)) (:= mul-op (:/ %mul %div %mod)) (:= pow-op %pow) (:= unary-op (:/ %add %sub)) ) ...and it will produce a packaged output parser like this: Or, put the line "(set! output-mode '(doc))" or "(set! output-mode '(code))" right before the grammar definition, to get just the comments - i.e. a readable BNF-style grammar - or just the code, respectively. Personally I find the S-expression grammar very easy to read and write, and therefore think this tool helps matters no end. Certainly there's a lot less typing involved. Run the program with the line "echo (file1.scm file2.scm) | SCHEME parse-gen.scm", where "file1.scm" etc are the files you want to process, and "SCHEME" is the command to invoke your Scheme interpreter. Output is placed in a file with the name given in the grammar definition block. parse-gen.scm has more information in its comments. There is no attempt at error-checking in the generator, so... um, don't make mistakes. |
| ||
Quite interesting, ...and super complex, with lots of special constants, and unusual DATA types It's true that I know nothing about SCHEME. I'm not sure, but If I were to create a parser, I would stick to using simple stacks, FILO-style (first-in,last-out) http://en.wikipedia.org/wiki/LIFO I'll be following your topic, and see how this progresses. |
| ||
You don't need to pay any attention to the Scheme program, it's just a support utility because I'm too lazy to even write out as much as the main API requires. As for creating parsers: this framework is powered by a simple recursive descent algorithm. It's not efficient, but it should be able to handle any right-recursive CFG (e.g. it should have no problem parsing Blitz, Lua or even JavaScript code, given a grammar). I'm not too good at parsing algorithms so I could be totally wrong, but I think there's no stack-based solution of comparable power that isn't also far more complicated (if all you want to do is parse simple math expressions, you can do the shunting-yard algorithm in one page or so of Blitz code; but that can't cover arbitrary programming languages: this can). I guess you could substantially simplify a grammar by using a secondary precedence parser for all simple expressions though. Do share if you have interesting ideas! |
| ||
You said, "covering arbitrary programming languages", etc.... . That's going to be a lot of work. Good luck. --------------------------------------------------------- As for interesting ideas, well, perhaps the ability to include CONSTANTS, trigonometry, set theory, complex numbers, string functions, memories, arrays, ... ( as in a[5]= sin(4*pi*sqrt[-1]) ) ---------------------------------------------------------- For some inspiration on your parser project, maybe copy the Mathematica style of doing things. http://functions.wolfram.com/ http://reference.wolfram.com/mathematica/guide/VariablesAndFunctions.html http://reference.wolfram.com/mathematica/guide/FunctionalProgramming.html |
| ||
.... ...I think we may be talking at cross-purposes. This code is finished (barring bugs, of which I am sure there are many...). The archive entry at the top is the completed product. I think you may be misunderstanding what it is though: it's a framework for creating a parser (it's about as close as you can get in B3D to a parser combinator). It's not a programming language by itself. So that means two things: 1) You have to supply the language grammar yourself. The parse engine within it can handle "complicated" structures, but it can't recognise anything without a grammar. It has no built-in knowledge of any language. 2) Parsing is simply the process of breaking down a linear input stream into an organised, structured tree. A parser doesn't understand concepts like constants, functions or even data types. It just "draws" them in memory. You've got to hand the resulting parse tree over to a checker and code generator before you have a working compiler - this is just the first step! Anything that would involve either a. "running" the input, or b. understanding the "meaning" of the input, is beyond the scope of both this framework, and parsing in general. This is just the process of turning your input stream - i.e. messy, arbitrary user files - into a neat structure akin to XML before the "real work" can begin... whatever that is (note: a parser isn't even directly tied to the concept of programming - it could just as well be used as the first stage of a very inefficient 3D model loader). |
Code Archives Forum