École Polytechnique

INF564 – Compilation

Jean-Christophe Filliâtre

optimizing compiler (1/2)
goal for the next two lectures: writing an **optimizing compiler**

we intend to use x86-64 in the best possible way, notably

- its 16 registers
  - to pass parameters and to return results
  - for intermediate computations

- its instructions
  - such as the ability to add a constant to a register

```
add $3, %rdi
```
emitting optimized code in a single pass is doomed to failure

we decompose code production into several phases

1. instruction selection
2. RTL (Register Transfer Language)
3. ERTL (Explicit Register Transfer Language)
4. LTL (Location Transfer Language)
5. linearization

(we follow the architecture of the CompCert compiler by Xavier Leroy; see http://compcert.inria.fr/)
the starting point the abstract syntax tree output by the type checker

\[
\begin{align*}
T\text{tree} & \quad \downarrow \text{Is} \\
I\text{stree} & \quad \downarrow \text{Rtl} \\
R\text{tltree} & \quad \downarrow \text{Ertl} \\
E\text{rltree} & \quad \downarrow \text{Ltl} \\
L\text{ltree} & \quad \downarrow \text{Lin} \\
\text{X86/64} &
\end{align*}
\]
remark

this compiler architecture is independent of the programming paradigm (imperative, functional, object oriented, etc.)

it is illustrated on a small fragment of C
a small fragment of C with

- integers (type `int`)
- heap-allocated structures, only pointers to structures, no pointer arithmetic
- functions
- library functions `putchar` and `malloc`

to keep it simple, we assume 64-bit signed integers for values of type `int` (unusual, but standard compliant) so that integers and pointers have the same size
\[
\begin{align*}
E & \rightarrow n \\
& \quad | \ L \\
& \quad | \ L = E \\
& \quad | \ E \ op \ E \ | \ - \ E \ | \ ! \ E \\
& \quad | \ x(E, \ldots, E) \\
& \quad | \ \text{sizeof}(\text{struct } x) \\
L & \rightarrow x \\
& \quad | \ E->x \\
op & \rightarrow \text{==} | \text{!=} | \text{<} | \text{<=} | \text{>} | \text{>=} \\
& \quad | \ & \& | \ | | \ + | \ - | \ * | \ /
D & \rightarrow T \ x(T \ x, \ldots, T \ x) \ B \\
& \quad | \ \text{struct } x \ \{ \ V \ldots V \}; \\
P & \rightarrow D \ldots D \\
S & \rightarrow ; \\
& \quad | \ E; \\
& \quad | \ \text{if } (E) \ S \ \text{else } S \\
& \quad | \ \text{while } (E) \ S \\
& \quad | \ \text{return } E; \\
B & \rightarrow \{ \ V \ldots V \ S \ldots S \ \} \\
V & \rightarrow \text{int } x, \ldots, x; \\
& \quad | \ \text{struct } x \ \ast x, \ldots, \ast x; \\
T & \rightarrow \text{int} | \ \text{struct } x \ \ast \\
\end{align*}
\]
```c
int fact(int x) {
    if (x <= 1) return 1;
    return x * fact(x-1);
}

struct list { int val; struct list *next; };  

int print(struct list *l) {
    while (l) {
        putchar(l->val);
        l = l->next;
    }
    return 0;
}
```
we assume that type checking is done

in particular, we know the type of any sub-expression

note: for mini-C, types are not useful for code generation; yet,

• type checking ensures some form of safety e.g. we do not confuse an integer and a pointer

• for a larger fragment of C, types would be needed e.g. to select signed vs unsigned operations, to perform pointer arithmetic, etc.
phase 1: instruction selection

the first phase is **instruction selection**

goal:

- replace C arithmetic operations with x86-64 operations
- replace structure field access with explicit memory access
naively, we can simply translate each C arithmetic operation with the corresponding x86-64 operation

however, x86-64 provides us with better instructions in some cases, notably

- addition of a register and a constant
- bit shifting to the left or to the right, corresponding to a multiplication or a division by a power of 2
- comparison of a register and a constant
beside, it is advisable to perform as much evaluation as possible during compilation (partial evaluation)

examples: in some cases, we can simplify

- \((1 + e_1) + (2 + e_2)\) into \(e_1 + e_2 + 3\)
- \(e + 1 < 10\) into \(e < 9\)
- !\((e_1 < e_2)\) into \(e_1 \geq e_2\)
- \(0 \times e\) into 0

**important:** the semantics must be preserved
if some left/right evaluation order would be specified, we could simplify $(0 - e_1) + e_2$ into $e_2 - e_1$ only when $e_1$ and $e_2$ do not interfere

for instance if $e_1$ and $e_2$ are pure i.e. without side effect

with C, the evaluation order is not specified, so we can make the simplification
with unsigned C arithmetic, we could not replace $e + 1 < 10$ with $e < 9$ since $e + 1$ may be 0 by arithmetic overflow (the standard says that unsigned arithmetic wraps around)

if $e$ is the greatest integer, $e + 1 < 10$ holds but $e < 9$ does not

with signed arithmetic, however, arithmetic overflow is an undefined behavior (meaning that the compiler may choose any behavior)

consequently, we can turn $e + 1 < 10$ into $e < 9$ with type int
example 3

we can replace $0 \times e$ with 0 only if expression $e$ has no side effect

since our expressions may involve function calls, checking whether $e$ has no effect is not decidable

but we can over-approximate the absence of effect

\[
\begin{align*}
pure(n) &= true \\
pure(x) &= true \\
pure(e_1 + e_2) &= pure(e_1) \land pure(e_2) \\
\vdots \\
pure(e_1 = e_2) &= false \\
pure(f(e_1, \ldots, e_n)) &= false \quad \text{(we don't know)}
\end{align*}
\]
to implement partial evaluation, we can use **smart constructors**

A smart constructor behaves like a syntax tree constructor but it performs some simplifications on the fly.

Example: for addition, we introduce a smart constructor such as

```ocaml
val mk_add : expr -> expr -> expr (* OCaml *)
```

```java
Expr mkAdd(Expr e1, Expr e2) // Java
```
smart constructor for addition

Here are some simplifications for addition:

\[
\begin{align*}
\text{mkAdd}(n_1, n_2) &= n_1 + n_2 \\
\text{mkAdd}(0, e) &= e \\
\text{mkAdd}(e, 0) &= e \\
\text{mkAdd}(	ext{addi } n_1 e, n_2) &= \text{mkAdd}(n_1 + n_2, e) \\
\text{mkAdd}(n, e) &= \text{addi } n e \\
\text{mkAdd}(e, n) &= \text{addi } n e \\
\text{mkAdd}(e_1, e_2) &= \text{add } e_1 e_2 \text{ otherwise}
\end{align*}
\]
of course, the smart constructor must terminate

one has to figure out a positive measure over expressions that strictly decreases at each recursive call of the smart constructor
instruction selection is where we can make a good use of lea (see lab 1)
instruction selection is then a recursive process over the expressions

\[
\begin{align*}
IS(e_1 + e_2) &= \text{mkAdd}(IS(e_1), IS(e_2)) \\
IS(e_1 - e_2) &= \text{mkSub}(IS(e_1), IS(e_2)) \\
IS(! e_1) &= \text{mkNot}(IS(e_1)) \\
IS(-e_1) &= \text{mkSub}(0, IS(e_1)) \\
\vdots
\end{align*}
\]

and a morphism for the other constructs
instruction selection also introduces explicit memory access, written load and store.

A memory address is given by an expression together with a constant offset (so that we make good use of indirect addressing).
in our case, structure fields reads and assignments are turned into memory accesses

we have a simple schema where each field is exactly one word long (since type \texttt{int} is assumed to be 64 bits)

so

\[
\begin{align*}
\text{IS}(e_1->x) &= \text{load} \ IS(e_1) (n \times \text{wordsize}) \\
\text{IS}(e_1->x = e_2) &= \text{store} \ IS(e_1) (n \times \text{wordsize}) \ IS(e_2)
\end{align*}
\]

where \( n \) is the index for field \( x \) in the structure and \( \text{wordsize} = 8 \) (64 bits)
with the following structure

```c
struct S { int a; int b; };
```

the instruction selection for expression

```c
p->a = p->b + 2
```

is

```c
store p 0 (addi 2 (load p 8))
```
instruction selection is a morphism over statements (if, while, etc.)

for functions, we erase types (not needed anymore) and we gather all variables at the function level (type checking made all variables distinct)
```c
struct list {
    int val;
    struct list *next;
};

int print(struct list *l) {
    struct list *p;
    p = l;
    while (p) {
        int c;
        c = p->val;
        putchar(c);
        p = p->next;
    }
    return 0;
}

// no need for type list
// anymore

print(l) {
    locals p, c;
    p = l;
    while (p) {
        int c;
        c = load p 0;
        putchar(c);
        p = load p 8;
    }
    return 0;
}
```
the classic factorial

```c
int fact(int x) {
    if (x <= 1) return 1;
    return x * fact(x-1);
}
```

```c
fact(x) {
    locals:
    if (setle x $1) return 1;
    return imul x fact(addi $-1 x);
}
```
the next phase transforms the code to the language **RTL** (*Register Transfer Language*)

goal:

- get rid of the tree structure of expressions and statements, in favor of a **control-flow graph** (CFG), to ease further phases; in particular, we make no distinction between expressions and statements anymore

- introduce **pseudo-registers** to hold function parameters and intermediate computations; there are infinitely many pseudo-registers, that will later be either x86-64 registers or stack locations
let us consider the C expression

\[ b \times (3 + d) \]

that is the syntax tree

\[
\begin{array}{c}
* \\
/ \ \\
\downarrow \\
+ \\
/ \\
\downarrow \\
3 \ \\
\downarrow \\
d
\end{array}
\]

let us assume that \( b \) and \( d \) are in pseudo-registers \#1 and \#2

and the final value in \#3

then we build a CFG such as

\[
\begin{align*}
L_1 & : \text{mov } \#1 \#4 \\
L_2 & : \text{mov } \#2 \#5 \\
L_3 & : \text{add } \$3 \#5 \\
L_4 & : \text{mov } \#4 \#3 \\
L_5 & : \text{imul } \#5 \#3
\end{align*}
\]
the CFG is a map from labels (program points) to RTL instructions

conversely, each RTL instruction lists the labels of the next instructions

for instance, the RTL instruction

\[
\text{mov } n \ r \rightarrow L
\]

means “load constant \( n \) int pseudo-register \( r \) and transfer control to label \( L \)”
mov \( n \ r \rightarrow L \)
load \( n(r_1) \ r_2 \rightarrow L \)
store \( r_1 \ n(r_2) \rightarrow L \)
unop \( op \ r \rightarrow L \)
binop \( op \ r_1 \ r_2 \rightarrow L \)
ubranch \( br \ r \rightarrow L_1, L_2 \)
bbranch \( br \ r_1 \ r_2 \rightarrow L_1, L_2 \)
call \( r \leftarrow f(r_1, \ldots, r_n) \rightarrow L \)
goto \rightarrow L \)

unary operation (neg, etc.)
binary operation (add, mov, etc.)
unary branching (jz, etc.)
binary branching (jle, etc.)
we build a separate CFG for each function, with its own pseudo-registers (intra-procedural analysis)

we build the CFG from bottom to top, which means we always know the label of the continuation (the next instructions)
to translate an expression, we provide

- a pseudo-register $r_d$ to receive its value
- a label $L_d$ corresponding to the continuation

we return the label of the entry point for the evaluation of the expression

$$RTL(e, r_d, L_d)$$
the translation is pretty straightforward

\[
RTL(n, r_d, L_d) = \text{add } L_1 : \text{mov } n r_d \rightarrow L_d \quad \text{with } L_1 \text{ fresh}
\]

\[
RTL(e_1 + e_2, r_d, L_d) = \text{add } L_3 : \text{add } r_2 r_d \rightarrow L_d \quad \text{with } r_2, L_3 \text{ fresh}
\]

\[
L_2 \leftarrow RTL(e_2, r_2, L_3)
\]

\[
L_1 \leftarrow RTL(e_1, r_d, L_2)
\]

return \(L_1\)

(\text{etc.})

(\text{read the code from bottom to top})
for local variables, we set up a table where each variable is mapped to a fresh pseudo-register

then reading or writing a local variable is a `mov` instruction (one of the RTL binary operations)
to translate C operations `&&` and `||`, as well as `if` and `while` statements, we use RTL **branching** instructions

example:

```c
if (p != 0 && p->val == 1)
  ...branch 1...
else
  ...branch 2...
```

(the four blocks are sub-graphs)
to translate a condition, we provide two labels
- a label $L_t$ corresponding to the continuation if the condition holds
- a label $L_f$ when it does not hold

we return the label of the entry point for the evaluation of the condition

$$RTL_c(e, L_t, L_f)$$
translating a condition

\[
\begin{align*}
RTL_c(e_1 \&\& e_2, L_t, L_f) &= RTL_c(e_1, RTL_c(e_2, L_t, L_f), L_f) \\
RTL_c(e_1 \| |\| e_2, L_t, L_f) &= RTL_c(e_1, L_t, RTL_c(e_2, L_t, L_f)) \\
RTL_c(e_1 \leq e_2, L_t, L_f) &= \text{add } L_3: \text{bbranch } jle \ r_2 \ r_1 \rightarrow L_t, L_f \\
&\quad L_2 \leftarrow RTL(e_2, r_2, L_3) \\
&\quad L_1 \leftarrow RTL(e_1, r_1, L_2) \\
&\quad \text{return } L_1 \\
RTL_c(e, L_t, L_f) &= \text{add } L_2: \text{ubranch } jz \ r \rightarrow L_f, L_t \\
&\quad L_1 \leftarrow RTL(e, r, L_2) \\
&\quad \text{return } L_1
\end{align*}
\]

(of course, we can handle more particular cases)
to translate `return`, we provide a pseudo-register $r_{ret}$ to receive the function result and a label $L_{ret}$ corresponding to the function exit

\[
RTL(\; , L_d) &= \text{return } L_d \\
RTL(\text{return } e; , L_d) &= RTL(e, r_{ret}, L_{ret}) \\
RTL(\text{if}(e) s_1 \text{ else } s_2, L_d) &= RTL_c(e, RTL(s_1, L_d), RTL(s_2, L_d))
\]

etc.
for a while loop, we have to build a cycle in the CFG

```
while (e) {
    ...
}
```
The natural text representation of the diagram and equation is:

\[ \text{RTL(while(e)s, L_d)} = L_e \leftarrow \text{RTL}_{\text{c}}(e, , \text{RTL}(s, L), L_d) \]

add \( L : \text{goto} \ L_e \)

return \( L_e \)

Diagram:
- Node e connected to node s, s connected to node goto, goto connected to node e, and e connected back to L_e.
the formal parameters of a function, and its result, now are pseudo-registers

\[
#3 \ f(#1, \ #2) \{ \ \ldots \ \}
\]

as well as actual parameters and result in a call

\[
\text{call } \ #4 \leftarrow \ f(#5, \ #6)
\]
translating a function involves the following steps:

1. we allocate fresh pseudo-registers for its parameters, its result, and its local variables
2. we start with an empty graph
3. we pick a fresh label for the function exit
4. we translate the function body to RTL code, and the output is the entry label in the CFG
with the factorial function

```c
int fact(int x) {
    if (x <= 1) return 1;
    return x * fact(x-1);
}
```

we get

```
#2 fact(#1)
entry : L10
exit  : L1
locals:
L10: mov #1 #6 --> L9
L9  : jle $1 #6 --> L8, L7
L8  : mov $1 #2 --> L1
L7  : mov #1 #5 --> L6
L6  : add $-1 #5 --> L5
L5  : call #3<-fact(#5)---> L4
L4  : mov #1 #4 --> L3
L3  : mov #3 #2 --> L2
L2  : imul #4 #2 --> L1
```

(the graph is printed arbitrarily)
with the factorial function

```c
int fact(int x) {
    if (x <= 1) return 1;
    return x * fact(x-1);
}
```

we get
with a loop

```c
int loop(int x) {
    int r;
    r = 1;
    while (2 <= x) {
        r = r * x;
        x = x - 1;
    }
    return r;
}
```

#2 loop(#1)

- **entry**: L12
- **exit**: L1
- **locals**: #3

L12: mov $1 #3 --> L11
L11: mov $2 #6 --> L10
L10: mov #1 #7 --> L9
L9: jle #7 #6 --> L8, L2
L8: mov #3 #4 --> L7
L7: mov #1 #5 --> L6
L6: mov #4 #3 --> L5
L5: imul #5 #3 --> L4
L4: add $-1 #1 --> L3
L3: goto L11
L2: mov #3 #2 --> L1
with a loop

```c
int loop(int x) {
    int r;
    r = 1;
    while (2 <= x) {
        r = r * x;
        x = x - 1;
    }
    return r;
}
```
phase 3: ERTL

the third phase turns RTL into **ERTL** (Explicit Register Transfer Language)

goal: make **calling conventions** explicit, namely here

- the first six parameters are passed in %rdi, %rsi, %rdx, %rcx, %r8, %r9 and the next ones on the stack
- the result is returned in %rax
- in particular, putchar and malloc are library functions with a parameter in %rdi and a result in %rax
- the division `idivq` requires dividend and quotient in %rax
- callee-saved registers must be preserved by the callee (%rbx, %r12, %r13, %r14, %r15, %rbp)
the stack frame is as follows:

\[
\begin{array}{c|c|c}
\%rbp & \vdots & \%rbp \\
\hline
\text{param. } n & \vdots & \text{saved } \%rbp \\
\hline
\text{param. } 7 & \vdots & \text{local 1} \\
\hline
\text{return addr.} & \vdots & \text{local } m \\
\hline
\%rsp & \vdots & \\
\end{array}
\]

the \( m \) local variables area will hold all the pseudo-registers that could not be allocated to physical registers; register allocation (phase 4) will determine the value of \( m \).
ERTL instructions (1/3)

in ERTL, we have those same instructions as in RTL:

- `mov n r → L`
- `load n(r₁) r₂ → L`
- `store r₁ n(r₂) → L`
- `unop op r → L`
- `binop op r₁ r₂ → L`
- `ubranch br r → L₁, L₂`
- `bbranch br r₁ r₂ → L₁, L₂`
- `goto → L`

unary operation (neg, etc.)

binary operation (add, mov, etc.)

unary branching (jz, etc.)

binary branching (jle, etc.)
in RTL, we had

\[
\text{call } r \leftarrow f(r_1, \ldots, r_n) \rightarrow L
\]

in ERTL, we now have

\[
\text{call } f(k) \rightarrow L
\]

\textit{i.e.} we are only left with the name of the function to call, since new instructions will be inserted to load parameters into registers and stack, and to get the result from %rax

we only keep the number \(k\) of parameters passed into registers (to be used in phase 4)
finally, we have \textbf{new} instructions:

\begin{itemize}
\item \texttt{alloc} \texttt{frame} \rightarrow L \quad \text{allocate the stack frame}
\item \texttt{delete} \texttt{frame} \rightarrow L \quad \text{delete the stack frame}
\quad \quad \text{(note: we do not know its size yet)}
\item \texttt{get} \texttt{param} \ n \ r \rightarrow L \quad \text{access a parameter on stack (with \texttt{n}(%rbp))}
\item \texttt{push} \texttt{param} \ r \rightarrow L \quad \text{push the value of } r
\item \texttt{return} \quad \text{explicit return instruction}
\end{itemize}
we do not change the structure of the control-flow graph; we simply **insert new instructions**

- at the beginning of each function, to
  - allocate the stack frame
  - save the callee-saved registers
  - copy the parameters into the corresponding pseudo-registers

- at the end of each function, to
  - copy the pseudo-register holding the result into `%rax`
  - restore the callee-saved registers
  - delete the stack frame
  - execute `return`

- around each function call, to
  - copy the pseudo-registers holding the parameters into `%rdi`, ... and one the stack before the call
  - copy `%rax` into the pseudo-register holding the result after the call
  - pop the parameters, if any
we translate each RTL instruction to one/several ERTL instructions
mostly the identity operation, except for calls and division
dividend and quotient are in %rax

the RTL instruction

\[ L_1 : \text{binop div } r_1 \ r_2 \rightarrow L \]

becomes three ERTL instructions

\[ L_1 : \text{binop mov } r_2 \ %rax \rightarrow L_2 \]
\[ L_2 : \text{binop div } r_1 \ %rax \rightarrow L_3 \]
\[ L_3 : \text{binop mov } %rax \ r_2 \rightarrow L \]

where \( L_2 \) and \( L_3 \) are fresh labels

(beware of the direction: here we divide \( r_2 \) by \( r_1 \))
we translate the RTL instruction

\[ L_1 : \text{call } r \leftarrow f(r_1, \ldots, r_n) \rightarrow L \]

into a sequence of ERTL instructions

1. copy \( \min(n, 6) \) parameters \( r_1, r_2, \ldots \) into %rdi,%rsi,...
2. if \( n > 6 \), pass other parameters on the stack with push_param
3. execute call \( f(\min(n, 6)) \)
4. copy %rax into \( r \)
5. if \( n > 6 \), pop \( 8 \times (n - 6) \) bytes

that starts at the same label \( L_1 \) and transfers the control at the end to the same label \( L \)
the RTL code

L5: call #3 <- fact(#5) --> L4

is translated into the ERTL code

L5 : mov #5 %rdi --> L12
L12: call fact(1) --> L11
L11: mov %rax #3 --> L4
translating functions

RTL

\[ r \ f(r, \ldots, r) \]

locals : \{ r, \ldots \}

entry : L

exit : L

cfg : ...

ERTL

\[ f(n) \]

locals : \{ r, \ldots \}

entry : L

cfg : ...

Jean-Christophe Filliâtre

INF564 – Compilation

optimizing compiler (1/2) 57
for each callee-saved register, we allocate a fresh pseudo-register to save it, that we add to the local variables of the function

note: for the moment, we do not try to figure out which callee-saved registers will be used by the function
at the function entry, we

- allocate the stack frame with \texttt{alloc-frame}
- save the callee-saved registers
- copy the parameters into their pseudo-registers
to make things simpler, we here assume that callee-saved registers are limited to %rbx and %r12 (in practice, we also have %r13, %r14, %r15)
at function exit, we

- copy the pseudo-register holding the result into %rax
- restore the saved registers
- delete the stack frame
RTL

#2 fact(#1)
entry : L10
exit : L1
locals:
...
L8 : mov $1 #2 --> L1
...
L2 : imul #4 #2 --> L1

ERTL

fact(1)
entry : L17
locals: #7, #8
...
L8 : mov $1 #2 --> L1
...
L2 : imul #4 #2 --> L1
L1 : mov #2 %rax --> L21
L21: mov #7 %rbx --> L20
L20: mov #8 %r12 --> L19
L19: delete_frame --> L18
L18: return
altogether, we get the following ERTL code:

```
fact(1)
   entry : L17
   locals: #7,#8
L17: alloc_frame --> L16
L16: mov %rbx #7 --> L15
L15: mov %r12 #8 --> L14
L14: mov %rdi #1 --> L10
L10: mov #1 #6 --> L9
L9: jle $1 #6 --> L8, L7
L8: mov $1 #2 --> L1
L1: goto --> L22
L22: mov #2 %rax --> L21
L21: mov #7 %rbx --> L20
L20: mov #8 %r12 --> L19
L19: delete_frame --> L18
L18: return
L7: mov #1 #5 --> L6
L6: add $-1 #5 --> L5
L5: goto --> L13
L13: mov #5 %rdi --> L12
L12: call fact(1) --> L11
L11: mov %rax #3 --> L4
L4: mov #1 #4 --> L3
L3: mov #3 #2 --> L2
L2: imul #4 #2 --> L1
```
this is far from being what we think is a good x86-64 code for the factorial

at this point, we have to understand that

• register allocation (phase 4) will try to match physical registers to pseudo-registers to minimize the use of the stack and the number of mov

  if for instance we map #8 to %r12, we remove the two instructions at L15 and L20

• the code is not linearized yet (the graph is simply printed in some arbitrary order); this will be done in phase 5, where we will try to minimize jumps
if we intend to optimize tail calls, it has to be done during the RTL to ERTL translation

indeed, the ERTL instructions will differ, and this change influences the next phase (register allocation)
there is a difficulty, however, if the called function in a tail call does not have the same number of stack parameters or of local variables, since the stack frame has to be modified

at least two solutions

- limit tail call optimization to cases where the stack frame has the same layout; this is the case for recursive calls!
- the caller patches the stack frame and transfers the control after the instructions that allocate the stack frame
phase 4: LTL

the next phase translates ERTL to **LTL** (Location Transfer Language)

the goal is to get rid of pseudo-registers, replacing them with

- physical registers preferably
- stack locations otherwise

this is called **register allocation**
register allocation is complex, and decomposed into several steps

1. we perform a **liveness analysis**
   - it tells when the value contained in a pseudo-register is needed for the remaining of the computation

2. we build an **interference graph**
   - it tells what are the pseudo-registers that cannot be mapped to the same location

3. we allocate registers using a **graph coloring**
   - it maps pseudo-registers to physical registers or stack locations
in the following, a *variable* stands for a pseudo-register or a physical register

**Definition (live variable)**

*Given a program point, a variable is said to be **live** if the value it contains is likely to be used in the remaining of the computation.*

we say “is likely” since “is used” is not decidable; so we seek for a sound over-approximation
live variables are drawn on edges

```
mov $0 a
mov $1 b
L1: mov a c
    mov b a
    add c b
    jl $1000 b L1
mov a %rax
```

```
live variables can be deduced from **definitions** and **uses** of variables by the various instructions

### Definition

*For an instruction at label $l$ in the control-flow graph, we write*

- $\text{def}(l)$ *for the set of variables defined by this instruction,*
- $\text{use}(l)$ *for the set of variables used by this instruction.*

### Example:

For the instruction `add r1 r2` we have

\[
\text{def}(l) = \{r_2\} \quad \text{and} \quad \text{use}(l) = \{r_1, r_2\}
\]
computing live variables

to compute live variables, it is handy to map them to labels in the control-flow graph (instead of edges)

but then we have to distinguish between variables live at entry and variables live at exit of a given instruction

Definition

For an instruction at label $l$ in the control-flow graph, we write

- $\text{in}(l)$ for the set of live variables on the set of incoming edges to $l$,
- $\text{out}(l)$ for the set of live variables on the set of outcoming edges from $l$. 
the equations defining $in(l)$ and $out(l)$ are the following

\[
\begin{align*}
    in(l) &= use(l) \cup (out(l) \setminus \text{def}(l)) \\
    out(l) &= \bigcup_{s \in \text{succ}(l)} in(s)
\end{align*}
\]

these are mutually recursive functions and we seek for the smallest solution.

we are in the case of a monotonous function over a finite domain and thus we can use Tarski’s theorem (see lecture 3)
\[
\begin{align*}
\text{in}(l) &= \text{use}(l) \cup (\text{out}(l) \setminus \text{def}(l)) \\
\text{out}(l) &= \bigcup_{s \in \text{succ}(l)} \text{in}(s)
\end{align*}
\]

<table>
<thead>
<tr>
<th></th>
<th>use</th>
<th>def</th>
<th>in</th>
<th>out</th>
<th>in</th>
<th>out</th>
<th>in</th>
<th>out</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>a</td>
<td>a</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>a</td>
</tr>
<tr>
<td>2</td>
<td>b</td>
<td>b</td>
<td></td>
<td>a</td>
<td></td>
<td></td>
<td></td>
<td>a</td>
</tr>
<tr>
<td>3</td>
<td>a</td>
<td>c</td>
<td>a</td>
<td>a</td>
<td>b</td>
<td></td>
<td>...</td>
<td>a,b</td>
</tr>
<tr>
<td>4</td>
<td>b</td>
<td>a</td>
<td>b</td>
<td>b</td>
<td>b,c</td>
<td></td>
<td>...</td>
<td>b,c</td>
</tr>
<tr>
<td>5</td>
<td>b,c</td>
<td>b</td>
<td>b,c</td>
<td>b,c</td>
<td>b</td>
<td></td>
<td>...</td>
<td>a,b,c</td>
</tr>
<tr>
<td>6</td>
<td>b</td>
<td>b</td>
<td>b</td>
<td>b</td>
<td>a</td>
<td></td>
<td>...</td>
<td>a,b</td>
</tr>
<tr>
<td>7</td>
<td>a</td>
<td>a</td>
<td>a</td>
<td>a</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

we get the fixpoint with 7 iterations
assuming the control-flow graph has $N$ nodes and $N$ variables, a brute force computation has complexity $O(N^3)$ in the worst case

we can improve efficiency in several ways

- traversing the graph in "reverse order" and computing $out$ before $in$ (on the previous example, we converge in 3 iterations instead of 7)
- merging nodes with a unique predecessor and a unique successor (basic blocks)
- using a more subtle algorithm that only recomputes the $in$ and $out$ that may have changed; this is Kildall’s algorithm
Kildall’s algorithm

idea: if $in(l)$ changes, then we only need to redo the computation for the predecessors of $l$

\[
\begin{align*}
\{ 
\quad \text{out}(l) &= \bigcup_{s \in \text{succ}(l)} \text{in}(s) \\
\quad \text{in}(l) &= \text{use}(l) \cup (\text{out}(l) \setminus \text{def}(l))
\}
\]

here is the algorithm:

let $WS$ be a set containing all nodes
while $WS$ is not empty
    remove a node $l$ from $WS$
    old_in $\leftarrow$ in($l$)
    out($l$) $\leftarrow$ ...
    in($l$) $\leftarrow$ ...
    if in($l$) is different from old_in($l$) then
        add all predecessors of $l$ in $WS$
computing the sets $\text{def}(l)$ (definitions) and $\text{use}(l)$ (uses) is straightforward for most instructions.

Examples:

<table>
<thead>
<tr>
<th>Instruction</th>
<th>def</th>
<th>use</th>
</tr>
</thead>
<tbody>
<tr>
<td>mov $n$ $r$</td>
<td>${r}$</td>
<td>$\emptyset$</td>
</tr>
<tr>
<td>mov $r_1$ $r_2$</td>
<td>${r_2}$</td>
<td>${r_1}$</td>
</tr>
<tr>
<td>unop $op$ $r$</td>
<td>${r}$</td>
<td>${r}$</td>
</tr>
<tr>
<td>goto</td>
<td>$\emptyset$</td>
<td>$\emptyset$</td>
</tr>
</tbody>
</table>
this is more subtle for function calls

for a call, we express that any caller-saved register may be erased by the call

\[
\begin{array}{c|c|c}
\text{call } f(k) & \text{def} & \text{use} \\
\hline
\text{caller-saved} & \text{the first } k \text{ registers of } %rdi, %rsi, \ldots, %r9 \\
\end{array}
\]

last, for return, we express that %rax and all callee-saved registers may be used

\[
\begin{array}{c|c|c}
\text{return} & \text{def} & \text{use} \\
\hline
\emptyset & \{ %rax \} \cup \text{callee-saved} \\
\end{array}
\]
this was the ERTL code for fact

```
fact(1)
  entry : L17
  locals: #7,#8
L17: alloc_frame --> L16
L16: mov %rbx #7 --> L15
L15: mov %r12 #8 --> L14
L14: mov %rdi #1 --> L10
L10: mov #1 #6 --> L9
L9 : jle $1 #6 --> L8, L7
L8 : mov $1 #2 --> L1
L1 : goto --> L22
L22: mov #2 %rax --> L21
L21: mov #7 %rbx --> L20
L20: mov #8 %r12 --> L19
L19: delete_frame --> L18
L18: return
L7 : mov #1 #5 --> L6
L6 : add $-1 #5 --> L5
L5 : goto --> L13
L13: mov #5 %rdi --> L12
L12: call fact(1) --> L11
L11: mov %rax #3 --> L4
L4 : mov #1 #4 --> L3
L3 : mov #3 #2 --> L2
L2 : imul #4 #2 --> L1
```
liveness for fact

<table>
<thead>
<tr>
<th>Line</th>
<th>Instruction</th>
<th>In</th>
<th>Out</th>
</tr>
</thead>
<tbody>
<tr>
<td>L17</td>
<td>alloc_frame</td>
<td>%r12,%rbx,%rdi</td>
<td>%r12,%rbx,%rdi</td>
</tr>
<tr>
<td>L16</td>
<td>mov %rbx #7</td>
<td>%r12,%rbx,%rdi</td>
<td>#7,%r12,%rdi</td>
</tr>
<tr>
<td>L15</td>
<td>mov %r12 #8</td>
<td>#7,%r12,%rdi</td>
<td>#7,#8,%rdi</td>
</tr>
<tr>
<td>L14</td>
<td>mov %rdi #1</td>
<td>#7,#8,%rdi</td>
<td>#1,#7,#8</td>
</tr>
<tr>
<td>L13</td>
<td>mov %r12 #1</td>
<td>#1,%7,#8</td>
<td>#1,#6,#7,#8</td>
</tr>
<tr>
<td>L12</td>
<td>jle $1 #6</td>
<td>#7,%r12,%rdi</td>
<td>#1,#7,#8</td>
</tr>
<tr>
<td>L11</td>
<td>mov $1 #2</td>
<td>#7,#8</td>
<td>#2,#7,#8</td>
</tr>
<tr>
<td>L10</td>
<td>mov #1 #6</td>
<td>#1,#7,#8</td>
<td>#1,#6,#7,#8</td>
</tr>
<tr>
<td>L9</td>
<td>jle $1 #6</td>
<td>#7,#8</td>
<td>#2,#7,#8</td>
</tr>
<tr>
<td>L8</td>
<td>mov #1 #2</td>
<td>#7,#8</td>
<td>#1,#5,#7,#8</td>
</tr>
<tr>
<td>L7</td>
<td>mov #1 #5</td>
<td>#2,#7,#8</td>
<td>#1,#5,#7,#8</td>
</tr>
<tr>
<td>L6</td>
<td>add $-1 #5</td>
<td>#1,#5,#7,#8</td>
<td>#1,#5,#7,#8</td>
</tr>
<tr>
<td>L5</td>
<td>goto</td>
<td>#1,#5,#7,#8</td>
<td>#1,#5,#7,#8</td>
</tr>
<tr>
<td>L4</td>
<td>call fact(1)</td>
<td>#1,#5,#7,#8</td>
<td>#3,#4,#7,#8</td>
</tr>
<tr>
<td>L3</td>
<td>mov #3 #2</td>
<td>#1,#3,#7,#8</td>
<td>#2,#4,#7,#8</td>
</tr>
<tr>
<td>L2</td>
<td>imul #4 #2</td>
<td>#1,#3,#7,#8</td>
<td>#2,#7,#8</td>
</tr>
<tr>
<td>L1</td>
<td>goto</td>
<td>#1,#3,#7,#8</td>
<td>#1,#3,#7,#8</td>
</tr>
<tr>
<td>L22</td>
<td>mov #2 %rax</td>
<td>#1,#5,#7,#8</td>
<td>#2,#7,#8</td>
</tr>
<tr>
<td>L21</td>
<td>mov #7 %rbx</td>
<td>#7,#8,%rax</td>
<td>#7,#8,%rbx</td>
</tr>
<tr>
<td>L20</td>
<td>mov #8 %r12</td>
<td>#8,%rax,%rbx</td>
<td>#8,%rax,%rbx</td>
</tr>
<tr>
<td>L19</td>
<td>delete_frame</td>
<td>#8,%rax,%rbx</td>
<td>#8,%rax,%rbx</td>
</tr>
<tr>
<td>L18</td>
<td>return</td>
<td>#8,%rax,%rbx</td>
<td>#8,%rax,%rbx</td>
</tr>
<tr>
<td>L17</td>
<td>alloc_frame</td>
<td>#8,%rax,%rbx</td>
<td>#8,%rax,%rbx</td>
</tr>
<tr>
<td>L16</td>
<td>mov %rbx #7</td>
<td>#8,%rax,%rbx</td>
<td>#8,%rax,%rbx</td>
</tr>
<tr>
<td>L15</td>
<td>mov %r12 #8</td>
<td>#8,%rax,%rbx</td>
<td>#8,%rax,%rbx</td>
</tr>
<tr>
<td>L14</td>
<td>mov %rdi #1</td>
<td>#8,%rax,%rbx</td>
<td>#8,%rax,%rbx</td>
</tr>
<tr>
<td>L13</td>
<td>mov %r12 #1</td>
<td>#8,%rax,%rbx</td>
<td>#8,%rax,%rbx</td>
</tr>
<tr>
<td>L12</td>
<td>call fact(1)</td>
<td>#8,%rax,%rbx</td>
<td>#8,%rax,%rbx</td>
</tr>
<tr>
<td>L11</td>
<td>mov %rax #3</td>
<td>#8,%rax,%rbx</td>
<td>#8,%rax,%rbx</td>
</tr>
<tr>
<td>L10</td>
<td>mov #1 #6</td>
<td>#8,%rax,%rbx</td>
<td>#8,%rax,%rbx</td>
</tr>
<tr>
<td>L9</td>
<td>jle $1 #6</td>
<td>#8,%rax,%rbx</td>
<td>#8,%rax,%rbx</td>
</tr>
<tr>
<td>L8</td>
<td>mov $1 #2</td>
<td>#8,%rax,%rbx</td>
<td>#8,%rax,%rbx</td>
</tr>
<tr>
<td>L7</td>
<td>mov #1 #5</td>
<td>#8,%rax,%rbx</td>
<td>#8,%rax,%rbx</td>
</tr>
<tr>
<td>L6</td>
<td>add $-1 #5</td>
<td>#8,%rax,%rbx</td>
<td>#8,%rax,%rbx</td>
</tr>
<tr>
<td>L5</td>
<td>goto</td>
<td>#8,%rax,%rbx</td>
<td>#8,%rax,%rbx</td>
</tr>
<tr>
<td>L4</td>
<td>call fact(1)</td>
<td>#8,%rax,%rbx</td>
<td>#8,%rax,%rbx</td>
</tr>
<tr>
<td>L3</td>
<td>mov #3 #2</td>
<td>#8,%rax,%rbx</td>
<td>#8,%rax,%rbx</td>
</tr>
<tr>
<td>L2</td>
<td>imul #4 #2</td>
<td>#8,%rax,%rbx</td>
<td>#8,%rax,%rbx</td>
</tr>
<tr>
<td>L1</td>
<td>goto</td>
<td>#8,%rax,%rbx</td>
<td>#8,%rax,%rbx</td>
</tr>
</tbody>
</table>

Jean-Christophe Filliâtre

INF564 – Compilation

optimizing compiler (1/2)
• lab 7
  • mini-Python continued

• lecture 8
  • optimizing compiler 2/2