Let's build a (silly) compiler!

Blitz3D Forums/Blitz3D Tutorials/Let's build a (silly) compiler!

Yasha

(Posted 2012) [#1]

Tutorial: Let's build a (script) compiler!

In this multi-post series (which is not in any way connected to Jack Crenshaw's LBAC series), we'll quickly cover the process of designing a programming language ("Linear Basic"), and implementing it as a native code compiler (no bytecode, no interpreter), in Blitz3D, or BlitzPlus if you prefer. A BlitzMax version may follow if there's demand for it.

While Blitz3D is a language horribly unsuited to this task, lacking almost all of the primitive constructs that would make doing so relatively painless, in a way that's what makes the challenge fun. More importantly, to date some twelve years after Blitz3D's release, there are still precisely zero decent game scripting libraries available for it (maybe: TinyScheme is probably borderline). This series probably won't change that fact, but it might help you change it. At any rate, once we're done, there'll be a couple more mediocre solutions. Hey, it's an excuse to write compilers, and that's all we needed.

The goal of this example project will be to define a pure Blitz3D-code compiler library, that can read source files in a particular programming language, and output DLLs for B3D programs to call with CallDLL (or DLLs to load via a userlib method, or even Windows executables, but either of those will be byproducts that happen to be easy to do as well). CallDLL has precisely one advantage compared to the userlib system, which is that it provides a means for B3D to select the name of the DLL and the function to call dynamically: this means that we can load a script source file, use it to generate a DLL (by the black magic to be explained in this series), write out the DLL as a file, and load functions from it dynamically to control game state (this is probably seventeen different kinds of prohibited by UAC: I don't know or much care about that, and will not be covering how to deal with it in the tutorials), providing a solution that can take advantage of the full speed of compiled native code, while not requiring any extra downloads or resources - B3D source only.

As it turns out, we will end up falling short of this goal in a couple of ways: since scripts sooner or later need to be able to control the main program in some way, without the ability to pass function pointers to the compiled script, it won't be able to do very much beyond number crunching, so something like MikhailV's FastPointer DLL will be necessary to make the scripts do anything useful (the only remotely efficient way to avoid this would be to implement a bytecode interpreter in Blitz3D code, instead of compiling to a real DLL). We'll also be using a third-party assembler and linker to output the final binary, because that step is boring (but might make it into another tutorial sometime).

Programme Note: this is a tutorial, not a community project. I'm happy to answer questions, but I'm not taking suggestions until after we're done. While I will be "working through" the tutorial as chapters get posted, the whole thing has already been drawn up in advance.

The structure of this series will be as follows:

SECTION A: the BASICs

The absolutely minimal steps required to get together a working compiler for a BASIC-like language, from source to machine code.

You can treat Section A as a short but complete tutorial in its own right, if you want to stick to the essentials.

- Chapter 0: Requirements

What do we need from our compiler? This tutorial goes over the preparatory step of establishing what we need the compiler to be able to achieve, and from this, devising a suitable language for our needs.

- Chapter 1: Definitions

Once we have an informal idea about the language, we need to draw up a strict definition of its syntax and meaning, so we can compile it properly.

- Chapter 2: Lexical scanning

Wherein we discover how to turn simple text into a stream of typed and tagged tokens, to make things faster and easier. Finally we get to start writing some code!

- Chapter 3: Parsing and trees

How to identify program structures, turn a flat list of tokens into something meaningful, and complain about gibberish input.

- Chapter 4: Verification and type checking

Not all programs that look right are right. We need to ensure that the code actually makes sense.

- Chapter 5: Output

Once we have a semantically valid program for the machine to run, let's get it out there! We'll use C as the output target for now. At this point, we'll have built a simple but working compiler.

SECTION B: fun with advanced features

More advanced language design and optional compiler features, put together incrementally to create a more powerful but non-BASIC language. These chapters will involve some rather more advanced programming language theoretical concepts.

These chapters will only provide very short introductions to the listed subjects, so don't be put off by the long topic list (or from buying a textbook). None of this stuff is essential for producing a working compiler, it's just good to know.

- Chapter 6: Digamma Digression

A quick introduction to the "Digamma" programming language that we'll be extending. Right now, it's pretty simple: just records and functions.

- Chapter 7: Memory management

Using garbage collection and other memory management techniques to shift more design burdens off the programmer and onto the compiler, reduce errors, and improve performance (really).

- Chapter 8: Exceptional dynamics

Dynamic vs. duck vs. static typing: levels of freedom in programming languages. We'll also take a look at exception handling, to find out what to do when those freedoms are abused.

- Chapter 9: Patterns and automation

Making the data fit the problem and the functions fit the data: introducing algebraic datatypes, object deconstruction; some alternative ideas on memory management; the evils of Null. Also, strings.

- Chapter 10: Type polymorphism

Using generic/polymorphic types to write more powerful and reusable code.

- Chapter 11: Higher-order programming

An introduction to higher-order and first-class functions, closures, and lambdas, moving closer to the math and away from the metal. Why "tail calls" matter, but setting values doesn't. A short word on overloading, which we won't be using.

--- Chapter 11.5: Continuations

Time travel, parallel universes, and programs that run backwards.

- Chapter 12: Tree rewriting

Important basic optimisations: getting rid of operations we don't need, flattening the program tree, and reducing the amount of output code with compile-time evaluation. You can skip ahead to this chapter from Chapter 4 if you like.

- Chapter 13: Type inference revisited

Much more powerful ways to reduce the amount of code you need to write, while also checking program correctness. Why the names Hindley, Damas and Milner are kinda important.

--- Chapter 13.5: Type-level programming

How to abuse polymorphic types as a metaprogramming device (skipping ahead a bit, but w/e).

- Chapter 14: Linear optimisation

Tree rewriting is nice, but it's not the best and not the only way to automatically improve a program. We'll investigate optimisations that can be performed on a linear "intermediate form", using peephole optimisation.

- Chapter 15: Modules and metaprogramming

For a language to be useful on a project scale, it really needs modules in order to break up code. For modules to be useful, they need to be easy to combine and compose.

--- Chapter 15.5: Macros

Everybody hates the poor old C preprocessor, but macros are actually pretty useful.

- Chapter 16: Output revisited

Finally what you were waiting for: compiling code to assembly. Using a template system, we'll compile the same code to Java and BlitzMax as well, because we can. You can skip ahead to this chapter from Chapter 5 if you don't care about Digamma's features.

--- Chapter 16.5: Register allocation

If you're outputting assembly and don't have a C compiler to do this part for you, it's important to get it right or you'll lose a lot of performance.

SECTION C: Linear Basic realized

At last, we'll put it all together: combining the principles we learned from Section B with BASIC to produce Linear Basic in its full glory: zero-overhead memory management, powerful module and type systems, and a beginner-friendly BASIC syntax (which, if nothing else, at least you'll know how to change).

You will need to know the material covered in the "Digamma" posts in order to understand this section properly.

- Chapter 17: The Linear Basic language

A description of the Linear Basic language as a whole.

- Chapter 18: Building Linear Basic

Sounds like a big task for one chapter, but really we'll just be putting together what we've learned in sections A and B, to create the code to the final product.

The series won't go into detail on advanced topics (like proper optimisation or static analysis, because those are whole orders of magnitude more difficult than parsing + typechecking: there are thousands of PhD theses left to be had there), but we will have a working, native-code compiler for our custom languages at the end of it, capable of outputting programs and libraries with reasonable performance! And if you didn't know about some of the topics in Section B... well, hopefully you will be able to use them after this.

I will try as hard as I can to make this series accessible to anyone with a solid grasp of Blitz Basic and a healthy sense of curiosity, but be warned in advance that some of this stuff is actually pretty hard. I'm also leaving out a lot - you can fill libraries on this topic, and my posts (walls of text that they are) will only be the most brief of introductions.

All the best, and I hope you enjoy it! I know I will!

Copyright (c) 2012 Alex Gilding, a.k.a "Yasha"

The text of this series is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. All code is released to the public domain.

Last edited 2012

Yasha

(Posted 2012) [#2]

SECTION A: the BASICs

Welcome to Section A. In the next six posts, we'll describe and then build a minimal compiler for a BASIC-like programming language. It won't be very good, but it will work, and introduces the most important compiler concepts needed to build a complete system.

- Chapter 0: Requirements

Before we begin, what will you need?

-- A copy of Blitz3D (or BlitzPlus. From now on when I say Blitz3D, assume BlitzPlus is included: if there's a difference anywhere, I'll say so).
-- A good level of confidence using Blitz3D. I'll say right now that if you're a beginner to Blitz itself, this tutorial probably isn't for you. We're going to be moving fast.
-- A C compiler. If you're serious about C programming, you should get MinGW, as it's a de facto industry standard. You can also use PCC, which is a very small and simple install but doesn't produce quite such fast code. I'll be using TCC as the backend assembler because it's tiny, superfast and freely redistributable, but it's not a good choice for general-purpose C programming. You can use any C compiler that supports GNU Assembly on your own system.
-- A reading knowledge of C would be handy, but isn't essential (to look at and understand the generated output). C's pretty easy and most of it has Blitz3D equivalents (e.g. "while" loops).

Now, I can pretty much guarantee someone is ready to jump up and start yelling "to C? That's not a compiler! That's just a translator! I want my money back!"

Firstly, if you paid for this tutorial, you were ripped off, 'cause it's free. Secondly, there is no established technical distinction between a compiler, a translator, a preprocessor, and any number of other related terms; their usage is by convention but they all do pretty much the same thing to varying degrees, which is turn input in one language into output in another. According to my personal definitions, if a program does a significant amount of work - i.e. completely takes apart the input into logical components, actually analyses and rewrites its meaning and logic, then outputs completely new generated code that bears no resemblance to the original, then it's a "true" compiler no matter how high-level the output language is. No amount of text-substitution can do what we're going to do, especially in Section B. A "mere translator", in my book, is something that wouldn't complain if you gave it a while-endif block, and just leave it to the backend compiler - we're not writing one of those. So the fact that most compilers output machine code isn't a restriction on all of them (for the record, BlitzMax outputs assembly text files: does that make it a mere translator? Um, no.).

If you don't like my definition, too bad. It's not quite the same as Mark Sibly's, but that doesn't bother either of us. Besides, we'll get to an assembly backend in Section B.

So let's get started!

What are we actually trying to achieve here? Well, we (ostensibly) want to provide scripting capability to Blitz3D, and we definitely want to deliver it with the full power of native machine code performance - no interpreter overhead for us! That makes three points:

-- Blitz3D
-- Scripting/extension
-- native code

We can only really deliver native code in two formats: executable program, and shared library (a DLL, on Windows, which is where you are if you're using B3D). We're not going to mess about with shared memory processes or anything like that, so the only appropriate way for Blitz3D to dynamically access native code is if it's packed in a DLL, that can be called via the builtin CallDLL function. So: our compiler needs to be able to compile a script down to a DLL. Since it's very easy to add, we'll also give it the ability to create executables just in case (an executable is basically just a DLL with an "entry point"). These two factors make for a very simple requirement, though: as long as the language is able to describe a list of functions (not exactly groundbreaking), it can be compiled for use with CallDLL.

The second point gives us a few more ideas. As an extension language, it will be used to write dynamic plugins for an application. This means code may be written by both the original developer, and likely by end users who want to make mods or otherwise extend the program. This gives us two more points: the language needs to be simple, so that it's easy for end users to learn and use, and it needs to be safe, preventing them from writing programs that can corrupt the application's memory, crash the machine, or worse.

There is another point which is raised by the nature of this tutorial: in order to keep our example compiler as simple as possible, the language should require a minimal or nonexistent runtime library (beyond the OS), in order to minimise the amount of rubbish we need to ship. We don't want to have to link it to a huge number of support functions necessary just to make it work.

Finally, we - sorry, I - want to do something new and slightly different. We're exploring here, let's not stick to ground that's too familiar. ("Something new" is mainly going to have to wait until Section B, though.)

In the interest of convenience and familiarity, let's choose to implement a dialect of BASIC. There's no especially good reason why the extension language has to look like the application language, but no rule says it can't either. Some people think BASIC languages are easier to learn and friendlier on the eye than others, so perhaps it will encourage end users more than some curly-brace thing.

The BASIC dialect will need to be updated a little bit from Blitz3D, sadly, since Blitz3D is a bit archaic. Luckily "BASIC" these days is largely defined as "what I say it is", so that's not too restrictive. There are several features we're going to inherit, mainly from BlitzMax/Monkey:

-- behaviour values. Blitz3D doesn't include function pointers, so there's no easy way to package an "action". Since scripts are almost entirely about "packaged actions", this one's a no-brainer - we need function pointers, methods, nested functions, or something like that, at the very least.

-- fast subtyping/polymorphism. In BlitzMax, you can "inherit" types, so that values of different sorts can be acted upon by common functions. The closest thing we have to this in Blitz3D is Object/Handle, which is fairly slow and doesn't offer any real safe code sharing.

-- generic types. These are really useful for making things like container structures efficient, by making code safe to share while not requiring runtime operations (Object/Handle come to mind again).

-- more control flow options. BlitzMax's exceptions make it much easier to jump around in code rather than have to have everything return success codes in a linear fashion. Exceptions by themselves are fairly type-safe and straightforward (unlike e.g. Goto, which we don't want). We also might like BlitzMax's Continue feature.

-- automatic memory management. This one will be controversial with the Blitz Classic community, but it's simply not safe to leave memory management in the hands of third-parties: they'll end up writing memory leaks (and then blame you!). It's also necessary if we're going to have exceptions, or behaviour-capturing values, as they might skip over "free"/"delete" commands. Making this automatic will ensure everything stays safe.

-- array values. If we can return arrays, resize them, pass them up and down and assign them to variables, we can dispense with the slow and unsafe Blitz Banks. All the better for simplifying things.

That might sound like it introduces a bit much to call "simple". Well, it is. While most of the above will make it into our BASIC in some limited form, we won't be able to do very much with them until we've covered some rather more advanced topics in Section B. So we're not going to have Fun with exceptions or modules right now. When we come back to BASIC in Section C, things will get a lot more Fun.

So, what will our BASIC look like? Pretty simple and classic for the moment. We won't bother with anything like OOP (or FP; for those wondering, that's been moved to Section B).

type Something
    field a:int, b:float
    field c:string
end

# This is a function that does a thing
func Action:int(x:int, y:int)
    if y > x
        for let i:int = y - x to 0  # Block-local variable (like BlitzMax)
            if i = 6 then continue
            print i
        next
    else
        let w = x * y + 1
        print "W is " + w
        let x = x + 1
        print "and X incremented is " + x
    end
    return x
end

# This function runs its parameter with 12 and 14
func DoSomething:int(f:Func<int(int, int)>)
    let z = call f(12, 14)
    return z
end

type NumberedThing<T>
    field n:int, thing:T
end

let arr1 = Array<NumberedThing<Something>>(50) # Assign an array via constructor
dim arr2:NumberedThing<Something>[50]    # Same thing but more BASIC-like

let arr1[5] = Something(42, 47, "foo")   # a = 42, b = 47, c = "foo" in order
let arr2[5] = Something(.b = 64)     # Only b has a value, the others are zeroed
print arr1[5].b    # Prints 47
let s = arr1[5]    # s is a *copy* of arr1[5]

So we have types, functions, comments, variables, assignments, blocks, generics and arrays on display here. Whew!

We've inherited generics (I'll assume for now that you know what these are, if not they'll be covered in more detail in Section B), simple "end" closing keywords, and type inference on assignment from Monkey (we don't want the "end" command because scripts shouldn't be allowed to end the host program); "continue", block-local variables, and punctuation from BlitzMax; "dim" to create arrays from Blitz3D; and "let" and "call" from the classic BASICs of the past. Most Blitzers won't like the sight of # as a comment character, or being forced to use "let" for assignments - keep following these tutorials and you'll learn how to change them back!

Unlike in Blitz3D, there will be no "new", and thus no "delete": types define "value structures" that are copied on assignment (if you know C/++, think of a struct rather than a pointer to a struct), so for instance all 12 bytes are copied in the final assignment statement of the example above, and modifying 's.a' will not affect 'arr1[5].a'. The other values of the array are also present, but have all-zeroed fields. You can initialise a whole type value at once using the type's name as a constructor, and as shown, a constructor can have "named arguments". The last line also shows that when "let" is used with a new variable, it can work out the type from the right-hand side. Forcing the use of "let" instead of "local" is handy for safety, because it means there are no uninitialised local variables.

The only "reference types" are arrays. Put an array into a new variable or pass it to a function, and the new variable refers to the same old content (like [ ] arrays or type instances in Blitz3D). There is a catch, though: once the array has been assigned to a new location, it's no longer visible through the old variable - in fact, the old variable is then treated as though it was never declared, and must be "let" all over again with a new value. It has been "consumed". (You can create full copies of arrays with the "copy" keyword, which doesn't consume the original. Getting an element from an array does not consume it.)

This sounds like a really odd constraint, but it's actually incredibly useful: by enforcing "one reference only" (a.k.a "linearity"), we can:

-- avoid having to insert code to count references at runtime, which is slow
-- avoid having to build a real tracing garbage collector, which is effort (and weighs us down with code to include)
-- predict at compile time when every object needs to be deleted, and insert the delete commands automatically

It's simple enough: any array that isn't passed out of a function (assigned to somewhere else, or returned) is to be deleted. Any array that is overwritten by an assignment statement... is to be deleted. An array can be kept alive by passing it up and down the call chain, so you can modify an array "in place" by passing it into a function and returning it again (there's also a much easier way, passing it to an "observer" - more on that later). It's ...not as simple as garbage collection, but it's still pretty simple, and it's guaranteed to be as fast as hand-tuned manual memory management (because there's no tracking happening) and as safe as garbage collection (because the compiler inserts the deletes). Take a look:

let v1 = Array<int>(10)  # v1 is in scope and holds a fresh array of 10 ints
let v2 = v1                    # v2 is now in scope and refers to *the same* object
# v1 is NO LONGER in scope. It cannot be used again, or the compiler will error out

let v3 = copy v2      # v3 contains a *new* array, a "full" copy of v2
let v2 = v3           # v3 has been "used up" in the assignment and is out of scope
# The array that was in v2 has been destroyed. v2 is still in scope with a new value

Also on arrays: the array constructor takes a size, but you can also use "dim" instead. An uninitialised array slot (in a type value or new array) has no elements, and is called "Null". To keep the system from crashing, out-of-bounds access or asking for an element from Null simply return the zero value of whatever type the array is supposed to hold (this is a really bad idea, by the way: it doesn't cause a runtime error, but that also means it hides logic errors from the programmer! We'll learn about a couple of ways to fix it later).

There are also function pointers, which you can call using "call". There's no real reason to force the use of "call" except to reinforce how limited the language is. There are no labels or gotos; everything else is pretty similar to the Blitz family, so we won't list it all.

To celebrate the attempt to implement a linear type system in BASIC (and recognise the fact that much of this is ripped off from Linear ML), the language will be called Linear Basic.

It's not a very good language, but it's simple; an experienced Blitzer should have no problem picking it up quickly. We'll find out how to do more interesting and elegant things with languages later on: this language is enough to do everything we need right now: it can write programs, export DLL functions, use all the usual control flows, and show us a couple of new techniques we can expand on properly later. The language also has some problems for us to fix later on down the line, like the silly business with Null.

That's about enough for the introduction. In this chapter, we determined that we need DLLs to get the full speed of the CPU; that automating memory management is better when you have to trust unknown idiots with your program; that linearity is a way to provide zero-overhead memory management; and a quick, informal outline of what our language will look like. In the next chapter, we'll start defining things a lot more formally!

Can Alex keep up the pace? Will this ever get anywhere? Tune in next time for - The Grammatical Adventures of Yasha!

Copyright (c) 2012 Alex Gilding, a.k.a "Yasha"

The text of this series is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. All code is released to the public domain.

Last edited 2012

Yasha

(Posted 2012) [#3]

Chapter 1: Definitions

In the previous chapter, we went over the requirements for this project, and how they would influence the language and compiler we're going to create. So now we have a good and fuzzy idea of the nature of the language. But we can't just decide "eh, looks like BASIC" and wing it the rest of the way; in order to build a decent compiler, we first need to formalise the language it's going to compile.

A language is made up of two parts: syntax, and semantics.

The syntax of a language is what it looks like: decisions like whether to use Blitz-style keyword blocks (If/EndIf, While/Wend, etc.) vs. C-style curly braces vs. Python-style indentation; whether to use apostrophes or semicolons (or hashes) for comments; whether strings are single-quoted or double-quoted; and so on. Syntax determines what the language looks like, but it doesn't really define its meaning: if you took Blitz3D and replaced keyword blocks with indentation, you wouldn't have changed what those blocks mean, only how they're written on the page. We have a lot of freedom when designing syntax, because even though it needs to flow logically out of both the code's meaning, and its intended use, at the end of the day as long as it's unambiguous (the machine is able to differentiate one syntactic construct from another), anything will work. We really design syntax to appeal to the humans who have to read and write the code.

The semantics of a language is what the code means: what the difference is betwene a While loop and a Repeat loop, for instance, is determined by the semantic value the compiler has recognised and attached to the whole control block (compilers aren't mindreaders: they don't actually know what the words "while" or "until" mean - those meanings have to be written into the compiler manually, in advance). Semantics is where the real creativity lies - if we don't come up with something new here, we're just defining a pretty find-and-replace layer over another language (all too many people do this). For instance, the linearity constraint described in chapter 0 (that values are "used up" by assignment), or BlitzMax's reference counting (values have their assignments counted in the background) are semantic design changes: in both cases, assignments look much the same as they would in Blitz3D - a name on the left, an operator in the middle, a value on the right - but in both cases also carry a deeper and more important meaning, by redefining what the act of assignment actually is.

Both syntax and semantics can be described formally or informally. A working, bug-free compiler by nature ends up being a formal description of both in action, although not the most useful one for the purposes of study. Formal logics exist for both, and are used by academics; most real-world compilers ditch the formal definition of the semantics, because it's easy to unambiguously say "like an assignment in C, but increments a hidden count at the same time". Describing syntax formally is more important, simply because syntax is extremely hard to describe properly in prose. So for this project, we'll create a formal grammar for Linear Basic, but leave the semantics informally specified in written English (both will appear in full in the appendix).

Terminals

What is a program? Potentially many things, but for BASIC we can narrow it down to "a list of words, numbers and symbols". When we write a program in BASIC these are the lowest-level forms we can actually move around to build a program. Thus, they form the "terminal" tokens of the language's grammar: individual letters don't have any meaning on their own in BASIC. The simplest way to describe whole tokens is using regular expressions. A regular expression is a simple expression that manipulates string components and can be used to match against an input string. So for instance, the expression:

Ax*

...says to match the letter "A", then as many as possible occurences of the letter "x" as possible (but possibly none). It will match successfully against "A", "Axx", and the first five characters of "Axxxxy" (assuming we accept partial matches), but will fail to match against "Bxx".

To describe a language's basic tokens, we generally only need very simple regular expressions: most of the keywords will be basic strings, e.g. the expression "If" will, unsurprisingly, only match the token "If" (how to tell the difference between a partial match on "Ifn" and a stupidly-named variable, we'll cover in Chapter 3 - it doesn't matter for now, just assume we can). The regular expression operations come in handy for matching against optional elements, such as in the valid Blitz3D constructs:

End Function
EndFunction
End            Function

...which are all supposedly the same, but we're not going to write all sorts of stupid code that manually tests for two words and tries to skip over spaces and so on, when we could just say that the single token that appears at the end of a function looks like this:

End[:blank:]*Function

That expression will match any number of tabs or spaces, possibly zero, that come between the strings "End" and "Function", and the best part is it will still think it's all one word, making our life much easier later. Making our lives even easier, we can also instruct the regular expression to ignore the case of the letters when matching, because BASIC is not case-sensitive: that means "eNd fUNCtioN" would have been picked up as valid too (which it is, I guess), without any extra work on our part.

Add a few operations to represent choices and ranges, and we can represent more complicated tokens like names, numbers, and strings:

[:alpha:][:word:]*      ; Any letter, followed by any number of letters, numbers, or underscores (possibly zero) - a name
[0-9]+                        ; At least one character from the range 0-9 - an integer (we could also have used the :digit: builtin range)
[0-9]*\.[0-9]+([eE]-?[0-9]+)? ; Optional digits, a decimal point, more digits, and an optional exponent
                              ; (floating point literal in Blitz3D - complicated!)

The last one of those is the most complicated regular expression needed for Blitz Basic, so that's actually not so bad.

And so, we can go through a text file, breaking it up into a stream of meaningful "token" objects (some do more interesting things, e.g. we can define "comments" as "everything between the comment sign and the newline" and then just throw them away!).

Grammar

So, if defining a set of "terminals" lets us turn a bunch of characters into a list of meaningful words, how do we turn a list of words into a meaningful code construct?

The answer is the same thing on a larger scale. Just as regular expressions build whole words out of letters, a grammar is a set of similar rules which build clauses (think of them as "sentences") out of words. Because programming languages are very limited in their expression, the number of rules is quite small. The important expressive difference that grammar rules have over regular expressions is that each one is given a name, and they can refer to each other: this lets us compose them together and use simpler rules to define more complicated ones, just as a program is built out of simple small elements. Each "terminal" forms a rule, and we can combine them together to form more complicated rules. So for a simple example:

atom = name | integer | float | string | "(" expression ")"  ; where | means "or"

product = atom ("*" atom)*             ; So * still means "at least once"
expression = product ("+" product)*    ; The parens "group" the repetition around the operator and subrule

; ...and expression can be put in brackets to recurse around to an atom again

I've written the simple terminals (in this case, operators) in quotes to make it easier to read, but technically that just means "the plus terminal" or "the star terminal". In code we would make this explicit.

For a more complicated example, let's look at the grammar of a couple of more real Blitz structures:

const-declaration = "const" const-decl-body ("," const-decl-body)*
const-decl-body = var-name "=" expression

var-name = identifier type-tag?
type-tag = "%" | "#" | "$" | ("." identifier)

Apart from expression, which is obviously more complicated, that's all we need to completely describe all of the legal permutations we could use to declare a constant. "Const" is followed by a name-value pair, and optionally commas and more name-value pairs; a name-value pair is a name, possibly tagged, with an equals and a value to assign. Simple once you get used to it, and unambiguous enough to compile (given a suitable compiler - they do exist!).

while-stmt = "while" expression local-statement-list "wend"

goto-stmt = "goto" identifier
gosub-stmt = "gosub" identifier

Again, expression and local-statement-list are more complicated rules, but the basic structure of the While statement is: "While", an expression to test, stuff to do, "Wend". Goto and Gosub are even simpler. (This is not an endorsement of Goto.)

Something to notice here is that the syntax has nothing to say about whether the code is meaningful: you can't encode in grammar the requirement that the value assigned to a constant must itself be a compile-time constant; you can't tell the difference between an identifier that represents a valid jump-label for Goto, and one that represents a variable or even type name; and you certainly can't tell whether the expression to While is a valid boolean or some other meaningless type. Those are semantic issues, and can only be checked once we have a full list of the names and types in use in the program later on. Syntax is just a matter of form.

[ If you're interested, you can find a more or less complete grammar for Blitz3D here: http://www.blitzbasic.com/Community/posts.php?topic=95526#1119767 ]

These examples use a simplified syntax to minimse the amount of stuff in the way. More professional writers will usually use a formal language called EBNF, which has many of the same operators but a few more details to its own syntax, to make it machine-readable.

We haven't discussed Linear Basic much yet in this chapter, instead using examples from Blitz3D, mainly because we'll be showing the actual code to define it later anyway. For the most part, its constructs are very recognisable BASIC structures.

Meaning

Most of the control structures in Linear Basic are the same as their equivalents in Blitz and other languages (e.g. the For loop looks slightly different, ending on "end" instead of "next", but otherwise behaves in exactly the same way as in B3D or BlitzMax, iterating over values). This means we can be content to describe them largely in informal terms that we assume the computing community will understand, e.g. "loops over each value in a range, performing the actions in the enclosed statement block for each iterated value". Much more usefully than for syntax (which is too flexible to treat this way), the finished compiler will be a working reference implementation of the language's semantics.

Constructs that have a more original meaning need more detailed description. For instance, Linear Basic allows functions to "observe" array parameters instead of "consuming" them, so when the function returns the old variable will still be valid. This means that the following Linear Basic code:

func ObserveValues(a:obs int[])
    for let e = 0 to a.length - 1
        print a[e]
    end
end

ObserveValues someArray
print someArray[20]    # OK, someArray was not "consumed" by the observer and still exists

...is perfectly OK, because the "obs" qualifier on the parameter type means that the call to ObserveValues didn't consume the array, and didn't invalidate the variable "someArray".

So we need to define that an "observed" parameter:

-- cannot be passed to any function that would "consume" it, because we don't have the "right" to do that
-- cannot be returned, or assigned to globals, because it would "escape" our control
-- can still be passed to "observer" functions
-- can still be read from, or even modified
-- can be copied, creating a new "real" linear array

This is "enough" formality for our purposes. We don't need to use a denotational calculus to express the finer details, when we can instead provide a reference implementation to demonstrate them. Finer detail than this, we will be hardcoding in very customised assembly-generating code anyway.

A complete description of Linear Basic itself will be created in the course of the rest of this series.

With luck, that ought to be enough to understand how to properly define a language's syntax in an unambiguous way. Even very complicated forms with many optional components, like a declaration that can declare many variables, or a floating-point number with all its quirks, can be described using simple formulaic expressions, in a regular language (either true Regular Expressions, or something similar but with additional named rules for grammars). Meanwhile, we can see that semantics, although infinitely more important, are actually much simpler to define in useful (if not strictly complete) terms.

...and that's all we need to define a language. Once we know what it is, we can begin the fun side: implementing it! The hard work is really in coming up with a definition that makes sense.

Next time, we get our hands dirty, and get to run a program! Tune in again for Chapter 2: Raiders of the Lost Lexer!

Copyright (c) 2012 Alex Gilding, a.k.a "Yasha"

The text of this series is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. All code is released to the public domain.

Last edited 2012

Yasha

(Posted 2012) [#4]

Interlude

Y'know what, I think there's a reason nobody has implemented a BASIC based on functional programming and an ML-like type system.

It's not that the language doesn't work, but there's a fine line between "good enough for an example" and "too annoying to actually use", and I think Linear Basic as I originally envisaged it falls squarely on the wrong side of that line. BASIC syntax simply gets in the way of too many necessary functional programming concepts, and the few advantages it brings conflict with other features I'd hoped would be helpful.

No matter how we look at this, I don't think the language actually is BASIC, and therefore maybe pretending that it is doesn't help us at all.

I'm going to go away and rethink this for a few days (also, I'll prepare the rest of this tutorial in advance before posting the next installment). The language will probably end up much closer to the original being ripped off, Linear ML, and will bear more of a visible ML heritage (starting with type inference, I think, so we'll have to add a chapter to the series). I know not everybody likes ML, but it's easier to write in than a confused mess.

We'll keep the name Linear Basic for now, though.

Expect some Orwellian retconning of the first couple of chapters.

Update: Well, it was a few months rather than a few days, whatever, but I have done my thinking and this series is officially alive again. Section A will be up in full over the next couple of days.

It was really encouraging to hear that people were actually reading this and wanting to see the rest!

Last edited 2012