文章

CS 143 笔记

Lecture 1

The Structure of a Compiler

  1. Lexical Analysis —- identify words
  2. Parsing —- identify sentences
  3. Semantic Analysis —- analyse sentences
  4. Optimization —- editing
  5. Code Generation —- translation

Can be understood by analogy to how humans comprehend English.

  1. Lexical Analysis First step: recognize words – Smallest unit above letters Lexical analyzer divides program text into “words” or “tokens”
  2. Parsing Once words are understood, the next step is to understand sentence structure Parsing = Diagramming Sentences - The diagram is a tree Pasted image 20240122195641 Pasted image 20240122195659
  3. Semantic Analysis Once sentence structure is understood, we can try to understand “meaning” - But meaning is too hard for compilers Compilers perform limited semantic analysis to catch inconsistencies Pasted image 20240122213456 Pasted image 20240122213508 Compilers perform many semantic checks besides variable bindings
  4. Optimization Akin to editing - Minimize reading time - Minimize items the reader must keep in short-term memory Automatically modify programs so that they - Run faster - Use less memory - In general, to conserve some resource
  5. Code Generation Typically produces assembly code Generally a translation into another language - Analogous to human translation

Intermediate Representations

Many compilers perform translations between successive intermediate languages

  • All but first and last are intermediate representations (IR) internal to the compiler IRs are generally ordered in descending level of abstraction
  • Highest is source
  • Lowest is assembly

IRs are useful because lower levels expose features hidden by higher levels

  • registers
  • memory layout
  • raw pointers
  • etc

But lower levels obscure high-level meaning

  • Classes
  • Higher-order functions
  • Even loops…

Lecture 2

Pasted image 20240122220058

History of Ideas: Abstraction

  • Abstraction = detached from concrete details
    • “Abstraction is selective ignorance” - Andrew Koenig
  • Abstraction is necessary to build any complex system
    • The key is information hiding—expose only the essential
  • Modes of abstraction
    • Via languages/compilers: High-level code, few machine dependencies
    • Via functions and subroutines: Abstract interface to behavior
    • Via modules: Export interfaces; hide implementation
    • Via classes/abstract data types: Bundle data with its operations

History of Ideas: Types

  • Originally, few types
    • FORTRAN: scalars, arrays
    • LISP: no static type distinctions
  • Realization: Types help
    • Lets you to express abstraction
    • Lets the compiler report many frequent errors
    • Sometimes to the point that programs are guaranteed “safe”
    • Helps the compiler optimize your code
  • More recently
    • Lots of interest in types
    • Experiments with various forms of parameterization
    • Best developed in functional programming

History of Ideas: Reuse

  • Reuse = exploit common patterns in software systems
    • Goal: mass-produced software components
    • Reuse is difficult
  • Two popular approaches
    • Type parameterization (List(int), List(double))
    • Classes and inheritance: C++ derived classes
    • C++ and Java have both
  • Inheritance allows
    • Specialization of existing abstraction
    • Extension, modification, and hidden behavior

Lexical Analysis

What do we want to do? Example:

if(i==j)
	z=0;
else
	z=1;

The input is just a string of characters:

\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

Goal: Partition input string into substrings

  • Where the substrings are called tokens

Token

A token class corresponds to a set of strings Examples:

  • Identifier: strings of letters or digits, starting with a letter
  • Integer: a non-empty string of digits
  • Keyword: “else” or “if” or “begin” or
  • Whitespace: a non-empty sequence of blanks, newlines, and tabs

Lexical analysis produces a stream of tokens which is input to the parser Parser relies on token distinctions

  • An identifier is treated differently than a keyword

Designing a Lexical Analyzer

  1. Define a finite set of tokens
    • Tokens describe all items of interest
      • Identifiers, integers, keywords
    • Choice of tokens depends on
      • language
      • design of parser

Recall

1
\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

Useful tokens for this expression: Integer, Keyword, Relation, Identifier, Whitespace, (, ), , ;

  1. Describe which strings belong to each token Recall:
    • Identifier: strings of letters or digits, starting with a letter
    • Integer: a non-empty string of digits
    • Keyword: “else” or “if” or “begin” or …
    • Whitespace: a non-empty sequence of blanks, newlines, and tabs

An implementation must do two things:

  • Classify each substring as a token
  • Return the value or lexeme (value) of the token
    • The lexeme is the actual substring
    • From the set of substrings that make up the token The lexer thus returns token-lexeme pairs
  • And potentially also line numbers, file names, etc. to improve later error messages

The lexer usually discards “uninteresting” tokens that don’t contribute to parsing

  • Examples: Whitespace, Comments

Lexical Analysis in FORTRAN

FORTRAN rule: Whitespace is insignificant

  • E.g., VAR1 is the same as VA R1

Historical footnote: FORTRAN Whitespace rule motivated by inaccuracy of punch card operators

1
DO 5 I = 1,25

This is a fortran loop. The first token is `Do

1
DO 5 I = 1.25

This is a assignment. The first toekn is DO5I

Two important points:

  1. The goal is to partition the string. This is implemented by reading left-to-right, recognizing one token at a time
  2. “Lookahead” may be required to decide where one token ends and the next token begins

Lexical Analysis in C++

Unfortunately, the problems continue today C++ template syntax:

Foo<Bar>

C++ stream syntax:

cin >> var;

But there is a conflict with nested templates:

Foo<Bar<Bazz>>

We still need

  • A way to describe the lexemes of each token
  • A way to resolve ambiguities
    • Is if two variables i and f?
    • Is == two equal signs = =?

There are several formalisms for specifying tokens Regular languages are the most popular

  • Simple and useful theory
  • Easy to understand
  • Efficient implementations

Def. Let alphabet Σ be a set of characters. A language over Σ is a set of strings of characters drawn from Σ.

Pasted image 20240123001438

Pasted image 20240123001515 Pasted image 20240123001537 Pasted image 20240123001633 Pasted image 20240123001644 Pasted image 20240123001607 Pasted image 20240123001616 Pasted image 20240123001709 Pasted image 20240123001719

Implementation of Lexical Analysis

本文由作者按照 CC BY 4.0 进行授权

热门标签