CS 143 笔记
Lecture 1
The Structure of a Compiler
- Lexical Analysis —- identify words
- Parsing —- identify sentences
- Semantic Analysis —- analyse sentences
- Optimization —- editing
- Code Generation —- translation
Can be understood by analogy to how humans comprehend English.
- Lexical Analysis First step: recognize words – Smallest unit above letters Lexical analyzer divides program text into “words” or “tokens”
- Parsing Once words are understood, the next step is to understand sentence structure Parsing = Diagramming Sentences - The diagram is a tree
- Semantic Analysis Once sentence structure is understood, we can try to understand “meaning” - But meaning is too hard for compilers Compilers perform limited semantic analysis to catch inconsistencies Compilers perform many semantic checks besides variable bindings
- Optimization Akin to editing - Minimize reading time - Minimize items the reader must keep in short-term memory Automatically modify programs so that they - Run faster - Use less memory - In general, to conserve some resource
- Code Generation Typically produces assembly code Generally a translation into another language - Analogous to human translation
Intermediate Representations
Many compilers perform translations between successive intermediate languages
- All but first and last are intermediate representations (IR) internal to the compiler IRs are generally ordered in descending level of abstraction
- Highest is source
- Lowest is assembly
IRs are useful because lower levels expose features hidden by higher levels
- registers
- memory layout
- raw pointers
- etc
But lower levels obscure high-level meaning
- Classes
- Higher-order functions
- Even loops…
Lecture 2
History of Ideas: Abstraction
- Abstraction = detached from concrete details
- “Abstraction is selective ignorance” - Andrew Koenig
- Abstraction is necessary to build any complex system
- The key is information hiding—expose only the essential
- Modes of abstraction
- Via languages/compilers: High-level code, few machine dependencies
- Via functions and subroutines: Abstract interface to behavior
- Via modules: Export interfaces; hide implementation
- Via classes/abstract data types: Bundle data with its operations
History of Ideas: Types
- Originally, few types
- FORTRAN: scalars, arrays
- LISP: no static type distinctions
- Realization: Types help
- Lets you to express abstraction
- Lets the compiler report many frequent errors
- Sometimes to the point that programs are guaranteed “safe”
- Helps the compiler optimize your code
- More recently
- Lots of interest in types
- Experiments with various forms of parameterization
- Best developed in functional programming
History of Ideas: Reuse
- Reuse = exploit common patterns in software systems
- Goal: mass-produced software components
- Reuse is difficult
- Two popular approaches
- Type parameterization (List(int), List(double))
- Classes and inheritance: C++ derived classes
- C++ and Java have both
- Inheritance allows
- Specialization of existing abstraction
- Extension, modification, and hidden behavior
Lexical Analysis
What do we want to do? Example:
if(i==j)
z=0;
else
z=1;
The input is just a string of characters:
\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;
Goal: Partition input string into substrings
- Where the substrings are called tokens
Token
A token class corresponds to a set of strings Examples:
- Identifier: strings of letters or digits, starting with a letter
- Integer: a non-empty string of digits
- Keyword: “else” or “if” or “begin” or
- Whitespace: a non-empty sequence of blanks, newlines, and tabs
Lexical analysis produces a stream of tokens which is input to the parser Parser relies on token distinctions
- An identifier is treated differently than a keyword
Designing a Lexical Analyzer
- Define a finite set of tokens
- Tokens describe all items of interest
- Identifiers, integers, keywords
- Choice of tokens depends on
- language
- design of parser
- Tokens describe all items of interest
Recall
1 \tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;Useful tokens for this expression:
Integer, Keyword, Relation, Identifier, Whitespace, (, ), , ;
- Describe which strings belong to each token Recall:
- Identifier: strings of letters or digits, starting with a letter
- Integer: a non-empty string of digits
- Keyword: “else” or “if” or “begin” or …
- Whitespace: a non-empty sequence of blanks, newlines, and tabs
An implementation must do two things:
- Classify each substring as a token
- Return the value or lexeme (value) of the token
- The lexeme is the actual substring
- From the set of substrings that make up the token The lexer thus returns token-lexeme pairs
- And potentially also line numbers, file names, etc. to improve later error messages
The lexer usually discards “uninteresting” tokens that don’t contribute to parsing
- Examples: Whitespace, Comments
Lexical Analysis in FORTRAN
FORTRAN rule: Whitespace is insignificant
- E.g.,
VAR1
is the same asVA R1
Historical footnote: FORTRAN Whitespace rule motivated by inaccuracy of punch card operators
1 DO 5 I = 1,25This is a fortran loop. The first token is `Do
1 DO 5 I = 1.25This is a assignment. The first toekn is
DO5I
Two important points:
- The goal is to partition the string. This is implemented by reading left-to-right, recognizing one token at a time
- “Lookahead” may be required to decide where one token ends and the next token begins
Lexical Analysis in C++
Unfortunately, the problems continue today C++ template syntax:
Foo<Bar>
C++ stream syntax:
cin >> var;
But there is a conflict with nested templates:
Foo<Bar<Bazz>>
We still need
- A way to describe the lexemes of each token
- A way to resolve ambiguities
- Is if two variables i and f?
- Is == two equal signs = =?
There are several formalisms for specifying tokens Regular languages are the most popular
- Simple and useful theory
- Easy to understand
- Efficient implementations
Def. Let alphabet Σ be a set of characters. A language over Σ is a set of strings of characters drawn from Σ.