Compilers: Assignments



Tue Feb 20 2018Presentation of the first assignment.
Tue Mar 6 2018First assignment due.
Presentation of the second assignment.
Tue Mar 20 2018Second assignment due.
Presentation of the third assignment.
Tue Mar 27 2018Presentation of LLVM.
Presentation of the fourth assignment.
Tue Apr 17 2018Third assignment due.
Presentation of dynamic dispatch.
Fri May 11 2018Fourth assignment due.


All the assignments will use the VSOP manual as a reference for the VSOP language. This file might be updated as the course goes on.

There will be four assignments, covering the different phases of your VSOP compiler.

Lexical Analysis

In this first assignment, you will write a program to do the lexical analysis of a VSOP source file, and dump the corresponding tokens on the output.

This first assignment should be submitted at the latest for the 6th of March 2018 (23:59 CET).

Syntax Analysis

In this second assignment, you will write a program to do the syntax analysis of a VSOP source file, and dump the corresponding AST on standard output.

This second assignment should be submitted at the latest for the 20th of March 2018 (23:59 CET).

Semantic Analysis

In this third assignment, you will write a program to do the semantic analysis of a VSOP source file, and dump the corresponding annotated AST on standard output.

This third assignment should be submitted at the latest for the 17th of April 2018 (23:59 CET).

Code Generation and Extension

In this final assignment, you will implement code generation (and, optionally, some language extensions).

This final assignment should be submitted at the latest for the 11th of May 2018 (23:59 CEST).

How to submit an assignment?

You will use the submission platform. You will need to register (if not already done) and subscribe to the course INFO0085. You will then be able to create or join a group, and submit your work for evaluation.

You should submit an archive name vsopcompiler.tar.xz. That archive should contain a vsopcompiler folder with your code, and a Makefile to build it. You can build your compiler with whatever tool you want (if you need building at all), but still provide a Makefile with the targets described below.

Your code will be built as follows. From inside your vsopcompiler folder, make install-tools will be run. This should install any software needed to build your compiler (make and llvm-dev) are already installed. You can use sudo if you need administrator privileges. Then, make vsopc will be issued to build your compiler, which should, logically enough, be named vsopc.

Once built, your vsopc compiler will be called with arguments as described in the individual assignments. Note that it will not be necessarily called from the same folder.

Your submissions will be evaluated in a reference virtual machine based on Debian Stretch (a GNU\Linux distribution) on an amd64 architecture. To obtain the base system, you can use this VirtualBox VM (user: vagrant, password: compilers2018). Alternatively, you can use a real Debian system, the debian:stretch docker image or debian/stretch64 vagrant image. You will need to install make and llvm-dev in the VM, and you might find the skeleton files useful if you go the alternative route.

You can submit as many times as you want, only the last submission will be taken into account.


If you have questions, either ask them during the course or send them to the teaching assistant. Responses will be posted here.

Lexical Analysis

Why and how to use a hash table for keywords?

The size of your DFA will rise exponentially with the number or rules and their length. If you use one regular expression per keyword, your lexer generator might fail because you reach the pre-specified maximum size for the DFA. It is unlikely to happen for basic VSOP (which has only a handful of keywords), but could happen if you add more keywords later on.

Moreover, a lexer with a large DFA will use a lot of memory and induce a lot of cache misses, which will result in a slow lexer.

Using a hash table for keywords solves this problem by requiring a single regular expression both for object-identifier and all the keywords. When you see what could be an object-identifier or a keyword in the input stream (i.e. when the stream matches the regexp [a-z][a-zA-Z0-9_]*), consult the keyword hash table to check which keyword the token is. If there is no corresponding keyword for the token, then it must be an object-identifier.

That is, rather than doing something like

"keyword1"                  return KEYWORD1;
...                         ...
"keyword100"                return KEYWORD100;
[a-z][a-zA-Z0-9_]*          return OBJECT_IDENTIFIER;

rather do something like

    HashTable keywords;
    keywords["keyword1"] = KEYWORD1;
    keywords["keyword100"] = KEYWORD100;
[a-z][a-zA-Z0-9_]*          keywords.get(yytext, OBJECT_IDENTIFIER);

where we assume that HashTable::get(key, defaultVal) returns the value stored in the hashtable for the given key, or defaultVal if there is no such value.

How to define tokens?

It will vary depending on the lexer generator you use. Read the documentation provided with your lexer generator and/or follow an online tutorial (which actually use tokens, not just an inline calculator).

For lexer generators derived from lex (such as flex or JFlex), the tokens are generally defined in the grammar, not in the lexer definition file. It is by running yacc (or bison or cup, etc.) on the grammar that the source code for token definitions will be generated. If you use such a generator, you will need to write a simple grammar definition file containing only the token definitions (and, optionally, the code to dump the read tokens on standard output).

Why only allow non-negative integer literals?

The definition of integer-literal does not allow for any -, and this is on purpose.

While handling negative integer literals at the lexical phase is possible, and done in some compilers, it has several drawbacks:

Allowing negative integer literals might make some syntax less verbose. E.g. in OCaml, the expression sin -1.57 will be interpreted as sin (-) 1.57, i.e. a call to function sin with two arguments, the function (-) (which returns the subtraction of its second argument from its first argument) and 1.57. This is obviously not the intended result. One would need to write sin (-1.57), which is syntactically heavier. However, what about sin -a, or sin -(a + b)? This becomes too complex to be handled at lexical analysis phase and would require work on the later phases anyway.

How to handle language features not described in the manual?

You should report it to the teaching assistant. Some features are left to your own judgment (e.g should you report an unterminated string, or a illegal escape sequence error message, if you encounter a \ followed by end-of-file in a string?), but some might just be omissions. If in doubt, report it.

How can I find resource files needed by my compiler?

If your vsopc compiler depends on some resource file (e.g a JAR file), remember that your compiler will not necessarily be invoked from the folder where it resides. The submission platform actually calls your compiler from the tests folder, not from the vsopcompiler folder.

It is up to you to ensure your compiler can robustly find files it depends on. Here are two potential solutions:

How can I test my lexer after the deadline?

Tests on the submission platform will be cumulative, so that lexical tests will also be run for all ulterior phases as well. So, you just need to submit your lexer to the current phase of the project, and concentrate on lexical tests in the output.

What is the problem with my exit code?

According to the POSIX standard, a program should return with 0 when successful, and some 8-bit non-zero value on error. You can return different values for different kinds of errors, if you want. Common generic values are 1 and 255 (-1). The OS itself might return something for you (e.g. if the program produces a segfault).

This is important to be able to detect errors. For example, a shell script with option -e will exit at the first detected error. In the following example, the script will exit before erasing all your files.

set -eu
# -e is for exit-on-error
# -u is for exit-on-access-to-undefined-variable
cd /some/inexisting/folder  # cd will return 1, terminating the script
rm -rf *  # Would delete all files in current folder
          # Hopefully, we used set -e and cd returned 1

Proper return codes also allow for conditional testing. For example,

set -eu
if grep "foo" bar >/dev/null; then
    echo "foo present in bar"
    echo "no foo in bar"

will output one of the two sentences according to bar's content.

Finally, well-behaved shells will also indicate errors one way or another.

In some cases, it is OK to print some warnings on stderr, continue processing, and exit with 0 if no critical error was met. In the context of this assignment, your compiler should exit with non-zero value if an error was met, and a message printed on stderr.