Compilers: Assignments

News

Agenda

Wed Feb 15 2017Presentation of the first assignment.
Wed Mar 1 2017First assignment due.
Presentation of the second assignment.
Wed Mar 15 2017Second assignment due.
Presentation of the third assignment.
Mon Mar 20 2017Presentation of LLVM.
Presentation of the fourth assignment.
Wed Apr 19 2017Third assignment due.
Presentation of dynamic dispatch.
Fri May 19 2017Fourth assignment due.

Assignments

All the assignments will use the VSOP manual as a reference for the VSOP language. This file might be updated as the course goes on.

There will be four assignments, covering the different phases of your VSOP compiler.

Lexical Analysis

In this first assignment, you will write a program to do the lexical analysis of a VSOP source file, and dump the corresponding tokens on the output.

This first assignment should be submitted at the latest for the 1st of March 2017 (23:59 CET).

Slides from the feedback presentation.

Syntax Analysis

In this second assignment, you will write a program to do the syntax analysis of a VSOP source file, and dump the corresponding AST on standard output.

This second assignment should be submitted at the latest for the 15th of March 2017 (23:59 CET).

Semantic Analysis

In this third assignment, you will write a program to do the semantic analysis of a VSOP source file, and dump the corresponding annotated AST on standard output.

This third assignment should be submitted at the latest for the 19th of April 2017 (23:59 CET).

Code Generation and Extension

In this final assignment, you will implement code generation (and, optionally, some language extensions).

Here are the slides of the LLVM presentation, along with an example of implementation for virtual dispatch.

This final assignment should be submitted at the latest for the 19th of May 2016 (23:59 CEST).

How to submit an assignment?

You will use the submission platform. You will need to register if not already done, and to subscribe to the course INFO0085. You will then be able to create or join a group, and submit your work for evaluation.

You should submit an archive name vsopcompiler.tar.xz. That archive should contain a vsopcompiler folder with your code, and a Makefile to build it. You can build your compiler with whatever tool you want (if you need building at all), but still provide a Makefile with the targets described below.

Your code will be built as follows. From inside your vsopcompiler folder, make install-tools will be run. This should install any software needed to build your compiler (make and llvm-dev) are already installed. You can use sudo if you need administrator privileges. Then, make vsopc will be issued to build your compiler, which should, logically, be named vsopc.

Your submissions will be evaluated in a reference virtual machine based on Debian Jessie (a GNU\Linux distribution) on an amd64 architecture. To obtain the base system, you can use this VirtualBox VM (user: vagrant, password: compilers2017). Alternatively, you can use a real Debian system, the debian:jessie docker image or debian/jessie64 vagrant image. You will need to install make and llvm-dev in the VM, and you might find the skeleton files useful if you go the alternative route.

You can submit as many times as you want, only the last submission will be taken into account.

FAQ

If you have questions, either ask them during the course or send them to the teaching assistant. Responses will be posted here.

Lexical Analysis

Why and how to use a hash table for keywords?

The size of your DFA will rise exponentially with the number or rules and their length. If you use one regular expression per keyword, your lexer generator might fail because you reach the pre-specified maximum size for the DFA. It is unlikely to happen for basic VSOP (which has only a handful of keywords), but could happen if you add more keywords later on.

Moreover, a lexer with a large DFA will use a lot of memory and induce a lot of cache misses, which will result in a slow lexer.

Using a hash table for keywords solves this problem by requiring a single regular expression both for object-identifier and all the keywords. When you see what could be an object-identifier or a keyword in the input stream (i.e. when the stream matches the regexp [a-z][a-zA-Z0-9_]*), consult the keyword hash table to check which keyword the token is. If there is no corresponding keyword for the token, then it must be an object-identifier.

That is, rather than doing something like

%%
"keyword1"                  return KEYWORD1;
...                         ...
"keyword100"                return KEYWORD100;
[a-z][a-zA-Z0-9_]*          return OBJECT_IDENTIFIER;

rather do something like

%{
    HashTable keywords;
    keywords["keyword1"] = KEYWORD1;
    ...
    keywords["keyword100"] = KEYWORD100;
%}
%%
[a-z][a-zA-Z0-9_]*          keywords.get(yytext, OBJECT_IDENTIFIER);

where we assume that HashTable::get(key, defaultVal) returns the value stored in the hashtable for the given key, or defaultVal if there is no such value.

How to define tokens?

It will vary depending on the lexer generator you use. Read the documentation provided with your lexer generator and/or follow an online tutorial (which actually use tokens, not just an inline calculator).

For lexer generators derived from lex (such as flex or JFlex), the tokens are generally defined in the grammar, not in the lexer definition file. It is by running yacc (or bison or cup, etc.) on the grammar that the source code for token definitions will be generated. If you use such a generator, you will need to write a simple grammar definition file containing only the token definitions (and, optionally, the code to dump the read tokens on standard output).

Why only allow non-negative integer literals?

The definition of integer-literal does not allow for any -, and this is on purpose.

While handling negative integer literals at the lexical phase is possible, and done is some languages, it has several drawbacks:

On the other hand, it allows expressions such as a + - b, which will be valid (interpreted as a - b) while they may be errors (missing the middle expression).

Allowing negative integer literals might also make some syntax less verbose. E.g. in OCaml, the expression sin -1.57 will be interpreted as sin (-) 1.57, i.e. a call to function sin with two arguments, the function (-) (which returns the subtraction of its second argument from its first argument) and 1.57. This is obviously not the intended result. One would need to write sin (-1.57), which is syntactically heavier. However, what about sin -a, or sin -(a + b)? This becomes too complex to be handled at lexical analysis phase and would require work on the later phases anyway.

How to handle language features not described in the manual?

You should report it to the teaching assistant. Some features are left to your own judgment (e.g should you report an unterminated string, or a illegal escape sequence error message, if you encounter a \ followed by end-of-file in a string?), but some might just be omissions. If in doubt, report it.

How can I find resource files needed by my compiler?

If your vsopc compiler depends on some resource file (e.g a JAR file), remember that your compiler will not necessarily be invoked from the folder where it resides. The submission platform actually calls your compiler from the tests folder, not from the vsopcompiler folder.

It is up to you to ensure your compiler can robustly find files it depends on. Here are two potential solutions:

How can I test my lexer after the deadline?

Tests on the submission platform will be cumulative, so that lexical tests will also be run for all ulterior phases as well. So, you just need to submit your lexer to the current phase of the project, and concentrate on lexical tests in the output.

What is the problem with my exit code?

According to the POSIX standard, a program should return with 0 when successful, and some 8-bit non-zero value on error. You can return different values for different kinds of errors, if you want. Common generic values are 1 and 255 (-1). The OS itself might return something for you (e.g. if the program produces a segfault).

This is important to be able to detect errors. For example, a shell script with option -e will exit at the first detected error. In the following example, the script will exit before erasing all your files.

#!/bin/sh -eu
# -e is for exit-on-error
# -u is for exit-on-access-to-undefined-variable
cd /some/inexisting/folder # cd will return 1, terminating the script
rm -rf * # Would delete all your files, without the -e, or if cd returned 0
         # despite the target directory being non-existent

Proper return codes also allow for conditional testing. For example,

#!/bin/sh -eu
if grep "foo" bar; then
    echo "foo present in bar"
else
    echo "no foo in bar"
fi

will output one of the two sentences according to bar's content.

Finally, well-behaved shells will also indicate errors one way or another.

In some cases, it is OK to print some warnings on stderr, continue processing, and exit with 0 if no critical error was met. In the context of this assignment, your compiler should exit with non-zero value if an error was met, and a message printed on stderr.

Syntax analysis

What is the deal with unary minus and exponentiation?

In mathematics, -2² is equal to -(2²) = -4 and not (-2)² = 4. We follow the same convention in VSOP.

Try it in different languages! It will work the same way in Matlab, Mathematica or Haskell, but will have the opposite sign in OCaml, bc, or Microsoft Excel.

How to report good syntax error messages?

With LL-based algorithms, it is relatively straightforward as you always know what kind of construct is expected next (it is predictive parsing, after all).

With LR-based (and LALR-based) algorithms, it is more difficult to generate good error messages. One can try incrementally to insert error pseudo-tokens, but it is hard to come up with a large number of specialized error messages while avoiding conflicts. One rule of thumb is to give context to your error pseudo-tokens (e.g. they should be immediately followed by tokens such as delimiters or terminators). However, it is more of an art than a science.

If you want to delve deeper into LR-based error message generation, in a more scientific way, have a look at Generating LR syntax error messages from examples, which describes how you can use the automaton states and parsing stack to identify specific error conditions.

Semantic analysis

Can a class have several methods with the same name?

No. For simplicity, there can only be one method with a given name in a class. Any redefinition is an error. A method can however be overriden in a child class, with the exact same type (same number of arguments, with the same types, and same return type).

A method and a field may have the same name, they are in different namespaces.

If you want, you can implement multiple methods with the same name but different argument types as a VSOP extension.

Code generation

Can I go through C or another high-level language to generate LLVM?

No.

This would sidestep some important concepts entirely, and would be considered cheating.

Going through the LLVM process has (at least) the following advantages:

You can, however, write some C code and pass it to clang -S -emit-llvm to look how such or such feature can be implemented. Indeed, we strongly advise you to do so.

You can also write your runtime (i.e. in basic VSOP, the I/O primitives) in C if you want.

How can I do sizeof in LLVM?

If you use the library, you can just use getTypeAllocSize() or CreateMalloc().

If you generate LLVM IR yourself, there is no sizeof instruction in LLVM. However, you can emulated it using the getelementptr instruction, as described here.

Let's say your struct has type %T. The idea is to take the address of the second element of an hypothetical array of %T starting at address 0. The second element should start just after the first one, giving you size of the first one, i.e. the size of your type. As getelementptr returns a pointer, you then have to convert it back to an integer.

%size_as_ptr = getelementptr %T* null, i32 1
%size_as_i32 = cast %T* %size_as_ptr to i32

You should avoid trying to compute the size yourself (like clang does), because of alignment issues which are target-dependent. E.g. on x86_64,

%T = type { i8, i8*, i32, i8, i32 }

will not have size 1 + 8 (for 64-bit pointer) + 4 + 1 + 4 = 18 bytes, but 32 bytes: 7 additional bytes to align the pointer on a 64-bit boundary, and 7 additional bytes to align the second i32 to a 64-bit boundary. Apart from being target-dependent, the rules are not trivial. The type

%T2 = type { i8, i32, i8*, i8, i32 } ; interverted 2nd and 3rd fields

only takes up 24 bytes, the second i32 being only aligned to a 32-bit boundary in that case.