Revisiting the C compiler optimizations

January 26, 2025 · 14 min read

Digital plumber, organizational archaeologist and occasional pixel pusher

Following up on my previous article on the C compilation pipeline, we will now focus on the effects of compiler optimizations, how they affect the generated executable as well as its consequences on the ability to debug a program.

The compiler, depending on the desired optimization level, will apply a set of increasingly agressive techniques to transform a program's source code to increasingly more efficient machine instructions, speeding up execution and reducing resource consumption. There are no free lunches however, each transformation for the sake of performance slows the compilation process and risks of making the generated machine code less transparent and more challenging to debug.

Modern compilers are sophisticated tools that do far more than simply translate human-readable code into machine instructions. They analyze, transform, and optimize code in ways that can dramatically differ from the original source. For developers, this means the code running on a machine might look quite different from the original source code. We'll use a simple example program to demonstrate how compiler optimizations can fundamentally affect the generated machine code, explore the impact on the ability to debug a running program.

A program in its original form

Let's start with a simple toy program:

#include <stdio.h>


int mult(int a, int b) {
   return a * b;
}


int main() {
   int a = 42;
   int b = 12;
   int result = mult(a, b);
   printf("Result = %d\n", result);
   return 0;
}

We can compile this program using gcc without optimizations: gcc -Wall prog.c -o prog (this is the same as using the -O0 flag). This will compile the program without any special optimization options, and that is actually reflected in the generated executable. The example was compiled on x86 Linux box, the executable will be an ELF¹ executable and can be inspected using the readelf tool.

ELF header section can be read using the readelf -h prog command and provides some interesting information information about this executable, its endianness, Application Binary Interface (ABI) and architecture:

ELF Header:
 Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
 Class:                             ELF64
 Data:                              2's complement, little endian
 Version:                           1 (current)
 OS/ABI:                            UNIX - System V
 ABI Version:                       0
 Type:                              DYN (Shared object file)
 Machine:                           Advanced Micro Devices X86-64
 Version:                           0x1
 Entry point address:               0x1060
 Start of program headers:          64 (bytes into file)
 Start of section headers:          14744 (bytes into file)
 Flags:                             0x0
 Size of this header:               64 (bytes)
 Size of program headers:           56 (bytes)
 Number of program headers:         13
 Size of section headers:           64 (bytes)
 Number of section headers:         31
 Section header string table index: 30

In the ELF format the .text.section² contains the code that will be executed in runtime. The contents of this section are best inspected with the objdump tool³ by running the command objdump -M intel -d -j .text prog, which produces in the Intel x86 assembly code that will be executed by the CPU (you can also use the excellent Compiler Explorer project). Note that because we are using the printf function from the C standard library, the generated assembly code has more things that what we have in the source code - essentially to be able to dynamically load printf from the standard library (libc). For now let's ignore that and focus on the assembly code for the main and mult functions:

0000000000001149 <mult>:
   1149:   f3 0f 1e fa             endbr64
   114d:   55                      push   rbp
   114e:   48 89 e5                mov    rbp,rsp
   1151:   89 7d fc                mov    DWORD PTR [rbp-0x4],edi
   1154:   89 75 f8                mov    DWORD PTR [rbp-0x8],esi
   1157:   8b 45 fc                mov    eax,DWORD PTR [rbp-0x4]
   115a:   0f af 45 f8             imul   eax,DWORD PTR [rbp-0x8]
   115e:   5d                      pop    rbp
   115f:   c3                      ret


0000000000001160 <main>:
   1160:   f3 0f 1e fa             endbr64
   1164:   55                      push   rbp
   1165:   48 89 e5                mov    rbp,rsp
   1168:   48 83 ec 10             sub    rsp,0x10
   116c:   c7 45 f4 2a 00 00 00    mov    DWORD PTR [rbp-0xc],0x2a
   1173:   c7 45 f8 0c 00 00 00    mov    DWORD PTR [rbp-0x8],0xc
   117a:   8b 55 f8                mov    edx,DWORD PTR [rbp-0x8]
   117d:   8b 45 f4                mov    eax,DWORD PTR [rbp-0xc]
   1180:   89 d6                   mov    esi,edx
   1182:   89 c7                   mov    edi,eax
   1184:   e8 c0 ff ff ff          call   1149 <mult>
   1189:   89 45 fc                mov    DWORD PTR [rbp-0x4],eax
   118c:   8b 45 fc                mov    eax,DWORD PTR [rbp-0x4]
   118f:   89 c6                   mov    esi,eax
   1191:   48 8d 3d 6c 0e 00 00    lea    rdi,[rip+0xe6c]        # 2004 <_IO_stdin_used+0x4>
   1198:   b8 00 00 00 00          mov    eax,0x0
   119d:   e8 ae fe ff ff          call   1050 <printf@plt>
   11a2:   8b 45 fc                mov    eax,DWORD PTR [rbp-0x4]
   11a5:   c9                      leave
   11a6:   c3                      ret
   11a7:   66 0f 1f 84 00 00 00    nop    WORD PTR [rax+rax*1+0x0]
   11ae:   00 00

This is quite verbose with all the mov calls. So let's unpack our main function to have a better understanding of the results of compiling our source code. Before we start it is important to understand how to read the disassembled code:

Function declarations composed of address and <name>: 0000000000001160 <main>:
Instructions, composed of the address for the instruction (starting from the parent function) followed by hexdump of the machine code, as single bytes in memory order and finally the instruction itself. So the line 1165: 48 89 e5 mov rbp,rsp represents the instruction at address 1165, which resulted in the machine code 48 89 e5 which represents the instruction mov rbp,rsp.
For function arguments, the edi/rdi registers are used for the first argument and esi/rsi registers are used for the second argument. This follows the x86_64 System V AMD64 ABI calling convention.

The code for main function at address 0000000000001160 does the following:

Allocate 16 bytes on the stack for variables with sub rsp,0x10 (instruction 1168);
Copy the values 42 (0x2A) and 12 (0xC) to the program's stack (instructions 116c and 1173);
Moving that to general purpose registers edx and eax(instructions 117a and 117d);
And from there to the esi and edi registers which hold the arguments for our function (instructions 1180 and 1182);
Call the mult function located at address 1149 (instruction 1184)
Take the result of the function on register eax and push it to the stack (instructions 1189 and 118c)
Move that to the esi register (instruction 118f)
Read the string for printf. The lea rdi,[rip+0xe6c] instruction at 1191 is loading the address of the format string for printf relative to the current instruction pointer (position-independent code);
Call printf at address 1050 (instruction 119d)
Restore stack and base pointer (11a5)
return (instruction 11a6)

This follows the source code pretty closely, however this trivial code at the end of the day is just multiplying two constants and printing the result, so the CPU is doing a lot of work for such a simple task.

The effect of compiler optimizations

Indeed it is possible to generate a more efficient set of executable instructions that still produce a correct result. GCC supports several different optimization levels that introduce incresingly more aggressive optimization techniques, The main ones are⁴:

-O0: Preserves debuggability but offers minimal performance gains (the default); -01: The compiler tries to reduce code size and execution time, without performing any optimizations that take a great deal of compilation time; -O2: Additional compilation options increasing both compilation time and the performance of the generated code; -O3: Aggressive optimizations that prioritize speed over debugging ease

In the context of this example, since the code is very simple let's compile with -O3, so that the effects of the compiler optimization are quite easy to spot. The code can be compiled with additional optimization options with the command: gcc -O3 -Wall prog.c -o prog .

The resulting executable can be inspected with objdump -M intel -d -j .text prog, and this is the result:

0000000000001060 <main>:
   1060:   f3 0f 1e fa             endbr64
   1064:   48 83 ec 08             sub    $0x8,%rsp
   1068:   ba f8 01 00 00          mov    $0x1f8,%edx
   106d:   bf 01 00 00 00          mov    $0x1,%edi
   1072:   31 c0                   xor    %eax,%eax
   1074:   48 8d 35 89 0f 00 00    lea    0xf89(%rip),%rsi        # 2004 <_IO_stdin_used+0x4>
   107b:   e8 d0 ff ff ff          callq  1050 <__printf_chk@plt>
   1080:   b8 f8 01 00 00          mov    $0x1f8,%eax
   1085:   48 83 c4 08             add    $0x8,%rsp
   1089:   c3                      retq
   108a:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)

There are a quite a few notable changes:

The multiplication is resolved to its final value 0x1f8 (42 * 12 = 504) and that gets stored in the edx register (instruction 1068);
printf is not being called anymore! An attentive reader will note that the function call on instruction 107b points to the __printf_chk function⁵. This function takes three arguments in this case: a flag set to 1 on the edi register, the template string as the second argument on the rsi register and the computed value on the edx register;
The base pointer register is restored by adding the 8 bytes that were subtracted initially (instruction 1085);

In practical terms this would be roughly the equivalent of the following source code:

int main() {
   int value = 504;
   printf("Result = %d\n", 504);
   return 0;
}

This example, despite its simplicity, shows just how aggressively compilers can optimize the code and how it may end up being significantly different from the original source code.

Impact on debugging and code inspection

Considering effects of compiler optimizations, it's worth considering their impact on the ability of a programmer to effectively debug the program.

Debugging a program with no optimizations

To explore this question we will use the classic gdb debugger. It is also important to note that by default the compiler does not add debug symbols to the program, so the first step is compiling the program with this option enabled (and no optimizations): gcc -Wall -g -O0 prog.c -o prog

This adds debug information to the executable. Using the readelf -S prog command it is possible to see several new sections related to debug information: .debug_aranges, .debug_info, .debug_abbrev, .debug_line and .debug_str.

To start the program with the debugger use the gdb prog command. This opens the gdb interactive console, and for the purpose of this article we will set a breakpoint on a specific line of our source code and then let the program run. The expectation is that the debugger will pause the execution on said breakpoint.

Let's start by setting a breakpoint on line 4 the return a * b statement using the break prog.c:4 command.

(gdb) break 4
Breakpoint 1 at 0x1157: file prog.c, line 4.

Note that because the program was compiled with debug symbols, the list command shows the source code of the program in question:

(gdb) list
 #include <stdio.h>
2
 int mult(int a, int b) {
   return a * b;
 }
6
 int main() {
   int a = 42;
   int b = 12;
  int result = mult(a, b);

Now let’s run the program with the run command. As expected this will stop the program:

Starting program: /home/felix/dev/scrathpad/c-compile-debug/prog


Breakpoint 1, mult (a=42, b=12) at prog.c:4
4     return a * b;

And we can proceed with the continue command and quit gdb with quit.

Debugging an optimized program

So far this worked exactly as expected. Let’s now perform the same exercise with the compiler optimizations cranked up to the max (-O3).

Start by compilinggcc -Wall -O3 -g prog.c -o prog, and run the program with gdb (gdb prog).

On the gdb interactive terminal the code listing (list command) is exactly the same and we can set the same breakpoint on line 4 (break prog.c:4 command). If we execute the program with the run command it simply executes until finished!

This is not what we expected by inspecting the source code. However if we look at the disassembled code, it should be fairly clear that the compiler simply does not generate code that calls the mult function, and therefore the debug information does not match the executable code.

GDB allows for disassembling the code, and it is possible to set breakpoints on the address of specific instructions. Running the disass main command will disassemble the main function:

(gdb) disass main
Dump of assembler code for function main:
  0x0000000000001060 <+0>: endbr64
  0x0000000000001064 <+4>: sub    $0x8,%rsp
  0x0000000000001068 <+8>: mov    $0x1f8,%edx
  0x000000000000106d <+13>:    mov    $0x1,%edi
  0x0000000000001072 <+18>:    xor    %eax,%eax
  0x0000000000001074 <+20>:    lea    0xf89(%rip),%rsi        # 0x2004
  0x000000000000107b <+27>:    callq  0x1050 <__printf_chk@plt>
  0x0000000000001080 <+32>:    mov    $0x1f8,%eax
  0x0000000000001085 <+37>:    add    $0x8,%rsp
  0x0000000000001089 <+41>:    retq

This matches what we obtained with the objdump program above, and it is possible to set a breakpoint on a specific instruction using the instruction address: break *0x0000000000001068.

Interestingly enough when we execute the run command we get:

(gdb) run
Starting program: /home/felix/dev/scrathpad/c-compile-debug/prog
Warning:
Cannot insert breakpoint 1.
Cannot access memory at address 0x1068

Something clearly went wrong! If we run disass main again, we see that the addresses of the instructions have changed!

Dump of assembler code for function main:
  0x0000555555555060 <+0>: endbr64
  0x0000555555555064 <+4>: sub    $0x8,%rsp
  0x0000555555555068 <+8>: mov    $0x1f8,%edx
  0x000055555555506d <+13>:    mov    $0x1,%edi
  0x0000555555555072 <+18>:    xor    %eax,%eax
  0x0000555555555074 <+20>:    lea    0xf89(%rip),%rsi        # 0x555555556004
  0x000055555555507b <+27>:    callq  0x555555555050 <__printf_chk@plt>
  0x0000555555555080 <+32>:    mov    $0x1f8,%eax
  0x0000555555555085 <+37>:    add    $0x8,%rsp
  0x0000555555555089 <+41>:    retq
End of assembler dump.

The changing addresses are due to Address Space Layout Randomization (ASLR)⁶, a security feature in modern operating systems that randomizes the memory layout of a process each time it's executed. This prevents attackers from reliably predicting memory addresses.

A more effective way to set this breakpoint would have been to set it using a relative reference to a function. For example, placing a breakpoint on the same instruction can also be done using break *main+8 (which is a much more robust method for doing this).

Wrapping-up

Compiler optimizations are a powerful tool that have interesting trade-offs between raw performance and code transparency/understandability. Even a trivial program, such as the example in this article, can be affected in profound ways depending on the compilation options.

From a practical perspective this example illustrates the importance of understanding the range of optimization options available, and when to deploy them:

Use lower optimization levels (-O0 or -O1) during development
Switch to higher optimization levels for production builds
Compile with debug symbols (-g) during development

More broadly it is worth noting that the code we write is not always exactly what gets to be executed. Modern compilers are sophisticated pieces of technology that transform code, and make decisions that can significantly improve performance while maintaining the original program's semantics, though not necessarily any similarity with the source code. So mind the gap, and keep in mind that both for learning purposes or for cases where high performance is required it may still be beneficial to examine what the compiler is doing and roll your own assembly code if needed⁷.

Executable and Linkable Format (ELF) is the executable format in Unix-like systems (e.g. Linux and BSD). ↩
It is important to distinguish between sections and segments in ELF, as they come into play at different moments. During compilation and linking the compiler organizes the binary into different logical groupings or sections (.text, .data, .debug, etc). During runtime the operating system needs to load the program into memory, and different sections with similar memory protection characteristics are loaded into the same segment (e.g. code has read + execute permissions, data has read + write permissions while read-only data only needs read permission). ↩
The objdump tool is capable of disassembling an object file. The options I used in this areticle are as follows:

-d - to disassemble;

-j - to select a specific section (to disassemble). In this case I was only looking for the contents of the .text. section;

-M - Configure the disassembler. In particular I wanted to produce Intel asm not the older (and slightly more complicated) AT&T variant. ↩
You can find a full list of GCC compiler options here ↩
__printf_chk ↩
Address Space Layout Randomization ↩
The fine folks of FFmpeg often leverage specialized instructions to process multiple elements in one go (SIMD) using the latest CPU features when available. In this scenario, compiler optimizations may not be as effective as hand crafted assembly code: the dav1d project showed around a 2x speedup from this automatic vectorisation, while the hand-written versions could reach 8x ↩

A program in its original form​

The effect of compiler optimizations​

Impact on debugging and code inspection​

Debugging a program with no optimizations​

Debugging an optimized program​

Wrapping-up​

Footnotes​