Scanner Assignment

The first step in building a compiler is to create a scanner. You will use the Flex Scanner Generator to construct a scanner generator for the B-minor Language. It is up to you to carefully read this document and decide what all of the token types are, and define them carefully using flex regular expressions. Make sure that you define all reserved words, identifiers, operators, constants, statements, and any other program elements. If you think that the specification is not clear, be sure to ask for a clarification.

Your program will be graded on the student Red Hat 7 machines (student00-02.cse.nd.edu) using /usr/bin/gcc, so it is your responsibility to make sure that your code compiles and works there. If which gcc does not print out /usr/bin/gcc then you are using the wrong compiler for this class.

Your program must be written in plain C (not C++) using /usr/bin/gcc (not G++) and use flex to generate the scanner. You must have a Makefile such that when you type make, all the pieces are compiled and result in a binary program called bminor. make clean should also delete all temporary files, so that the program can be made again from scratch.

Your program must be invoked as follows:

./bminor -scan sourcefile.bminor
It must output to the standard output using printf the symbolic token types for each element of the input. For literal values (integers, strings, characters) and identifiers, you must also output the body of the token, with the quotes removed and any escape codes translated. If the input contains an invalid token, then the program should print a message to the standard error stream (fprintf(stderr,"...")) and exit(1); immediately. Otherwise, the program should exit with return code zero.

For example, if the input looks like this:

string
1534
'a'
Notre Dame
"Notre\nDame";
>=
@
then your output should be:
STRING
INTEGER_LITERAL 1534
CHARACTER_LITERAL a
IDENTIFIER Notre
IDENTIFIER Dame
STRING_LITERAL Notre
Dame
SEMICOLON
GE
scan error: @ is not a valid character
A compiler has many odd corner cases that you must carefully handle. To encourage you to test thoroughly, we will also require you to turn in twenty test cases. Ten should be named good[0-9].bminor and should contain valid tokens. Ten should be named bad[0-9].bminor and should contain at least one erroneous token. We strongly recommend that you also try these example test cases. We will evaluate your code using these and some other hidden test cases.

As always, exercise good style in programming by choosing sensible variable names, breaking complex tasks down into smaller functions, and using constructive comments where appropriate. Please review the General Instructions for Turning In

This assignment is due on Friday, September 20th at 5PM. Late assignments are not accepted.

Frequently Asked Questions

  1. Q: Is "" a valid string literal?
    A: Yes, two double quotes represents an empty string consisting only of the null terminator.

  2. Q: Is this a valid string literal?
    "hello
    world"
    
    A: No, a newline in a string needs to be escaped, like this: "hello\nworld"

  3. Q: Is '' a valid char literal?
    A: No, two single quotes is not a valid char literal, because it does not indicate a particular character.

  4. Q: How would I represent a backslash as a character literal?
    A: '\\' (quote-backslash-backslash-quote)

  5. Q: Do we need to handle #include, #define, and so forth?
    A: No, they are not part of B-minor.

  6. Q: Can an integer have a leading negative/positive sign?
    A: Yes, but unary minus/plus are distinct tokens, so -10 would scan as MINUS NUMBER.

  7. Q: Does B-minor support scientific notation like 3e8, as a shorthand for 3 * 10^8?
    A: No, integers are simply sequences of digits.

  8. Q: Can an integer begin with a leading zero?
    A: Yes, this is allowed. Unlike C, a leading zero does not have any special meaning, so 010 simply means the decimal number 10.

  9. Q: How should we interpret three dashes in a row?
    --- indicates the two tokens DECREMENT MINUS.

  10. Q: The specification indicates strings and identifiers can be no longer than 256 characters. Should a longer string or identifier result in an scanning error?
    A: Yes.

  11. Q: Does the string size limit include the null terminator?
    A: Yes. That is, 255 printable characters plus 1 byte for the null terminator is the maximum length. Be careful: escape codes like \n represent one character!

  12. Q: Is 5hello a valid identifier?
    A: 5hello is two valid tokens: NUMBER followed by IDENTIFIER, which will scan correctly, but will (eventually) result in a parse error.