Revisiting the C compiler optimizations
Following up on my previous article on the C compilation pipeline, we will now focus on the effects of compiler optimizations, how they affect the generated executable as well as its consequences on the ability to debug a program.
The compiler, depending on the desired optimization level, will apply a set of increasingly agressive techniques to transform a program's source code to increasingly more efficient machine instructions, speeding up execution and reducing resource consumption. There are no free lunches however, each transformation for the sake of performance slows the compilation process and risks of making the generated machine code less transparent and more challenging to debug.
Modern compilers are sophisticated tools that do far more than simply translate human-readable code into machine instructions. They analyze, transform, and optimize code in ways that can dramatically differ from the original source. For developers, this means the code running on a machine might look quite different from the original source code. We'll use a simple example program to demonstrate how compiler optimizations can fundamentally affect the generated machine code, explore the impact on the ability to debug a running program.
A program in its original form
Let's start with a simple toy program:
#include <stdio.h>
int mult(int a, int b) {
return a * b;
}
int main() {
int a = 42;
int b = 12;
int result = mult(a, b);
printf("Result = %d\n", result);
return 0;
}
We can compile this program using gcc without optimizations: gcc -Wall prog.c -o prog
(this is the same as using the -O0
flag).
This will compile the program without any special optimization options, and that is actually reflected in the generated executable.
The example was compiled on x86 Linux box, the executable will be an ELF1 executable and can be inspected using the readelf
tool.
ELF header section can be read using the readelf -h prog
command and provides some interesting information information about this executable,
its endianness, Application Binary Interface (ABI) and architecture:
ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: DYN (Shared object file)
Machine: Advanced Micro Devices X86-64
Version: 0x1
Entry point address: 0x1060
Start of program headers: 64 (bytes into file)
Start of section headers: 14744 (bytes into file)
Flags: 0x0
Size of this header: 64 (bytes)
Size of program headers: 56 (bytes)
Number of program headers: 13
Size of section headers: 64 (bytes)
Number of section headers: 31
Section header string table index: 30
In the ELF format the .text.
section2 contains the code that will be executed in runtime.
The contents of this section are best inspected with the objdump
tool3 by running the command objdump -M intel -d -j .text prog
,
which produces in the Intel x86 assembly code that will be executed by the CPU (you can also use the excellent Compiler Explorer project). Note that because we are using the printf
function
from the C standard library, the generated assembly code has more things that what we have in the source code - essentially to be able
to dynamically load printf
from the standard library (libc). For now let's ignore that and focus on the assembly code for the main
and mult functions:
0000000000001149 <mult>:
1149: f3 0f 1e fa endbr64
114d: 55 push rbp
114e: 48 89 e5 mov rbp,rsp
1151: 89 7d fc mov DWORD PTR [rbp-0x4],edi
1154: 89 75 f8 mov DWORD PTR [rbp-0x8],esi
1157: 8b 45 fc mov eax,DWORD PTR [rbp-0x4]
115a: 0f af 45 f8 imul eax,DWORD PTR [rbp-0x8]
115e: 5d pop rbp
115f: c3 ret
0000000000001160 <main>:
1160: f3 0f 1e fa endbr64
1164: 55 push rbp
1165: 48 89 e5 mov rbp,rsp
1168: 48 83 ec 10 sub rsp,0x10
116c: c7 45 f4 2a 00 00 00 mov DWORD PTR [rbp-0xc],0x2a
1173: c7 45 f8 0c 00 00 00 mov DWORD PTR [rbp-0x8],0xc
117a: 8b 55 f8 mov edx,DWORD PTR [rbp-0x8]
117d: 8b 45 f4 mov eax,DWORD PTR [rbp-0xc]
1180: 89 d6 mov esi,edx
1182: 89 c7 mov edi,eax
1184: e8 c0 ff ff ff call 1149 <mult>
1189: 89 45 fc mov DWORD PTR [rbp-0x4],eax
118c: 8b 45 fc mov eax,DWORD PTR [rbp-0x4]
118f: 89 c6 mov esi,eax
1191: 48 8d 3d 6c 0e 00 00 lea rdi,[rip+0xe6c] # 2004 <_IO_stdin_used+0x4>
1198: b8 00 00 00 00 mov eax,0x0
119d: e8 ae fe ff ff call 1050 <printf@plt>
11a2: 8b 45 fc mov eax,DWORD PTR [rbp-0x4]
11a5: c9 leave
11a6: c3 ret
11a7: 66 0f 1f 84 00 00 00 nop WORD PTR [rax+rax*1+0x0]
11ae: 00 00
This is quite verbose with all the mov
calls. So let's unpack our main function to have a better understanding of the results of
compiling our source code. Before we start it is important to understand how to read the disassembled code:
- Function declarations composed of address and <name>:
0000000000001160 <main>:
- Instructions, composed of the address for the instruction (starting from the parent function) followed by hexdump of the machine code, as single bytes in memory order and finally the instruction itself. So the line
1165: 48 89 e5 mov rbp,rsp
represents the instruction at address1165
, which resulted in the machine code48 89 e5
which represents the instructionmov rbp,rsp
. - For function arguments, the
edi
/rdi
registers are used for the first argument andesi
/rsi
registers are used for the second argument. This follows the x86_64 System V AMD64 ABI calling convention.
The code for main function at address 0000000000001160
does the following:
- Allocate 16 bytes on the stack for variables with
sub rsp,0x10
(instruction 1168); - Copy the values 42 (0x2A) and 12 (0xC) to the program's stack (instructions 116c and 1173);
- Moving that to general purpose registers
edx
andeax
(instructions 117a and 117d); - And from there to the
esi
andedi
registers which hold the arguments for our function (instructions 1180 and 1182); - Call the mult function located at address 1149 (instruction 1184)
- Take the result of the function on register
eax
and push it to the stack (instructions 1189 and 118c) - Move that to the
esi
register (instruction 118f) - Read the string for printf. The
lea rdi,[rip+0xe6c]
instruction at 1191 is loading the address of the format string for printf relative to the current instruction pointer (position-independent code); - Call printf at address 1050 (instruction 119d)
- Restore stack and base pointer (11a5)
- return (instruction 11a6)
This follows the source code pretty closely, however this trivial code at the end of the day is just multiplying two constants and printing the result, so the CPU is doing a lot of work for such a simple task.
The effect of compiler optimizations
Indeed it is possible to generate a more efficient set of executable instructions that still produce a correct result. GCC supports several different optimization levels that introduce incresingly more aggressive optimization techniques, The main ones are4:
-O0: Preserves debuggability but offers minimal performance gains (the default); -01: The compiler tries to reduce code size and execution time, without performing any optimizations that take a great deal of compilation time; -O2: Additional compilation options increasing both compilation time and the performance of the generated code; -O3: Aggressive optimizations that prioritize speed over debugging ease
In the context of this example, since the code is very simple let's compile with -O3
, so that the effects of the compiler optimization are
quite easy to spot. The code can be compiled with additional optimization options with the command: gcc -O3 -Wall prog.c -o prog
.
The resulting executable can be inspected with objdump -M intel -d -j .text prog
, and this is the result:
0000000000001060 <main>:
1060: f3 0f 1e fa endbr64
1064: 48 83 ec 08 sub $0x8,%rsp
1068: ba f8 01 00 00 mov $0x1f8,%edx
106d: bf 01 00 00 00 mov $0x1,%edi
1072: 31 c0 xor %eax,%eax
1074: 48 8d 35 89 0f 00 00 lea 0xf89(%rip),%rsi # 2004 <_IO_stdin_used+0x4>
107b: e8 d0 ff ff ff callq 1050 <__printf_chk@plt>
1080: b8 f8 01 00 00 mov $0x1f8,%eax
1085: 48 83 c4 08 add $0x8,%rsp
1089: c3 retq
108a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
There are a quite a few notable changes:
- The multiplication is resolved to its final value
0x1f8
(42 * 12 = 504) and that gets stored in theedx
register (instruction 1068); printf
is not being called anymore! An attentive reader will note that the function call on instruction 107b points to the__printf_chk
function5. This function takes three arguments in this case: a flag set to 1 on theedi
register, the template string as the second argument on thersi
register and the computed value on theedx
register;- The base pointer register is restored by adding the 8 bytes that were subtracted initially (instruction 1085);
In practical terms this would be roughly the equivalent of the following source code:
int main() {
int value = 504;
printf("Result = %d\n", 504);
return 0;
}
This example, despite its simplicity, shows just how aggressively compilers can optimize the code and how it may end up being significantly different from the original source code.
Impact on debugging and code inspection
Considering effects of compiler optimizations, it's worth considering their impact on the ability of a programmer to effectively debug the program.
Debugging a program with no optimizations
To explore this question we will use the classic gdb
debugger. It is also important to note that by default the compiler
does not add debug symbols to the program, so the first step is compiling the program with this option enabled (and no optimizations): gcc -Wall -g -O0 prog.c -o prog
This adds debug information to the executable. Using the readelf -S prog
command it is possible to see several new sections related to debug information: .debug_aranges, .debug_info, .debug_abbrev, .debug_line and .debug_str.
To start the program with the debugger use the gdb prog
command. This opens the gdb interactive console, and for the purpose of this article we will set a breakpoint on a specific line of our source code and then let the program run. The expectation is that the debugger will pause the execution on said breakpoint.
Let's start by setting a breakpoint on line 4 the return a * b
statement using the break prog.c:4
command.
(gdb) break 4
Breakpoint 1 at 0x1157: file prog.c, line 4.
Note that because the program was compiled with debug symbols, the list
command shows the source code of the program in question:
(gdb) list
1 #include <stdio.h>
2
3 int mult(int a, int b) {
4 return a * b;
5 }
6
7 int main() {
8 int a = 42;
9 int b = 12;
10 int result = mult(a, b);
Now let’s run the program with the run
command.
As expected this will stop the program:
Starting program: /home/felix/dev/scrathpad/c-compile-debug/prog
Breakpoint 1, mult (a=42, b=12) at prog.c:4
4 return a * b;
And we can proceed with the continue
command and quit gdb with quit
.
Debugging an optimized program
So far this worked exactly as expected. Let’s now perform the same exercise with the compiler optimizations cranked up to the max (-O3
).
Start by compilinggcc -Wall -O3 -g prog.c -o prog
, and run the program with gdb (gdb prog
).
On the gdb interactive terminal the code listing (list
command) is exactly the same and we can set the same breakpoint on line 4 (break prog.c:4
command). If we execute the program with the run
command
it simply executes until finished!
This is not what we expected by inspecting the source code. However if we look at the disassembled code, it should be fairly clear that the compiler simply does not generate code that calls the mult function, and therefore the debug information does not match the executable code.
GDB allows for disassembling the code, and it is possible to set breakpoints on the address of specific instructions. Running the
disass main
command will disassemble the main function:
(gdb) disass main
Dump of assembler code for function main:
0x0000000000001060 <+0>: endbr64
0x0000000000001064 <+4>: sub $0x8,%rsp
0x0000000000001068 <+8>: mov $0x1f8,%edx
0x000000000000106d <+13>: mov $0x1,%edi
0x0000000000001072 <+18>: xor %eax,%eax
0x0000000000001074 <+20>: lea 0xf89(%rip),%rsi # 0x2004
0x000000000000107b <+27>: callq 0x1050 <__printf_chk@plt>
0x0000000000001080 <+32>: mov $0x1f8,%eax
0x0000000000001085 <+37>: add $0x8,%rsp
0x0000000000001089 <+41>: retq
This matches what we obtained with the objdump
program above, and it is possible to set a breakpoint on a specific instruction
using the instruction address: break *0x0000000000001068
.
Interestingly enough when we execute the run
command we get:
(gdb) run
Starting program: /home/felix/dev/scrathpad/c-compile-debug/prog
Warning:
Cannot insert breakpoint 1.
Cannot access memory at address 0x1068
Something clearly went wrong! If we run disass main
again, we see that the addresses of the instructions have changed!
Dump of assembler code for function main:
0x0000555555555060 <+0>: endbr64
0x0000555555555064 <+4>: sub $0x8,%rsp
0x0000555555555068 <+8>: mov $0x1f8,%edx
0x000055555555506d <+13>: mov $0x1,%edi
0x0000555555555072 <+18>: xor %eax,%eax
0x0000555555555074 <+20>: lea 0xf89(%rip),%rsi # 0x555555556004
0x000055555555507b <+27>: callq 0x555555555050 <__printf_chk@plt>
0x0000555555555080 <+32>: mov $0x1f8,%eax
0x0000555555555085 <+37>: add $0x8,%rsp
0x0000555555555089 <+41>: retq
End of assembler dump.
The changing addresses are due to Address Space Layout Randomization (ASLR)6, a security feature in modern operating systems that randomizes the memory layout of a process each time it's executed. This prevents attackers from reliably predicting memory addresses.
A more effective way to set this breakpoint would have been to set it using a relative reference to a function. For example, placing
a breakpoint on the same instruction can also be done using break *main+8
(which is a much more robust method for doing this).
Wrapping-up
Compiler optimizations are a powerful tool that have interesting trade-offs between raw performance and code transparency/understandability. Even a trivial program, such as the example in this article, can be affected in profound ways depending on the compilation options.
From a practical perspective this example illustrates the importance of understanding the range of optimization options available, and when to deploy them:
- Use lower optimization levels (
-O0
or-O1
) during development - Switch to higher optimization levels for production builds
- Compile with debug symbols (
-g
) during development
More broadly it is worth noting that the code we write is not always exactly what gets to be executed. Modern compilers are sophisticated pieces of technology that transform code, and make decisions that can significantly improve performance while maintaining the original program's semantics, though not necessarily any similarity with the source code. So mind the gap, and keep in mind that both for learning purposes or for cases where high performance is required it may still be beneficial to examine what the compiler is doing and roll your own assembly code if needed7.
Footnotes
-
Executable and Linkable Format (ELF) is the executable format in Unix-like systems (e.g. Linux and BSD). ↩
-
It is important to distinguish between sections and segments in ELF, as they come into play at different moments. During compilation and linking the compiler organizes the binary into different logical groupings or sections (.text, .data, .debug, etc). During runtime the operating system needs to load the program into memory, and different sections with similar memory protection characteristics are loaded into the same segment (e.g. code has read + execute permissions, data has read + write permissions while read-only data only needs read permission). ↩
-
The objdump tool is capable of disassembling an object file. The options I used in this areticle are as follows:
-d
- to disassemble;-j
- to select a specific section (to disassemble). In this case I was only looking for the contents of the.text.
section;-M
- Configure the disassembler. In particular I wanted to produce Intel asm not the older (and slightly more complicated) AT&T variant. ↩ -
The fine folks of FFmpeg often leverage specialized instructions to process multiple elements in one go (SIMD) using the latest CPU features when available. In this scenario, compiler optimizations may not be as effective as hand crafted assembly code: the dav1d project showed around a 2x speedup from this automatic vectorisation, while the hand-written versions could reach 8x ↩