C/C++ Compiler Operations

Sources :
Delroy A. Brinkerhoff : Object-Oriented Programming using C++
Brian Gough, Richard M. Stallman : An Introduction to GCC


The process of translating source code into an executable program is called “compiling the program” or just “compiling”.
We usually view the compilation process as a single action and generally refer to it as such.
Nevertheless, a modern compiler actually consists of 4 separate programs:

- Preprocessor
  Expand macros and included header files

- Compiler
  Convert source code to assembly language

- Assembler
  Convert assembly language to machine code

- Linker
  Link object files and binary libraries, Create the final executable

So here is the process :

Source Code > Preprocessor > Compiler > Assembler > Linker > Executable Program

A single program usually consist of multiple source code files.
It is both awkward and inconvenient to deal with large programs in a single source code file, and spreading them over multiple files has many advantages:

1. It breaks large, complex programs into smaller, independent conceptual units
Easier to understand, follow and maintain.

2. It allows multiple programmers to work on a single program at the same time
Each programmer works on a separate set of files.

3. It may speed up compilation (depending on the compiler system options used)
The compiler system stores the generated machine code in an object file, one object file for each source code file. The compiler system may not delete the object files, so if the source code file is unchanged, the linker uses the existing object code file.

4. It permits related programs to share files
For example, office suites often include a word processor, a slide show editor, and a spreadsheet.
By maintaining the User Interface code in one shared file, they can present a consistent User Interface.

5. Although less important, it allows software developers to market software as object code organized as (binary black box) libraries, which is useful when supplying code that interfaces with applications.

Preprocessor

The Preprocessor takes the source code, removes the comments, includes headers, and replaces macros.

The preprocessor handles statements or lines of code that begin with the “#” character, which are called “preprocessor directives“.

Note that directives are not C/C++ statements (and therefore do not end with a semicolon) but rather instruct the preprocessor to carry out some action.

For each .c/.cpp file, the preprocessor handles directives that begin with the # character and creates a temporary file to store its output.
The preprocessor reads and processes each file one at a time from top to bottom.
It does not change the content of any of the source files it processes.

The results are files which contains the source code merged with headers files and with all macros expanded.
By convention, preprocessed files are given the file extension .i for C programs and .ii for C++ programs.
In practice, the preprocessed file is not saved to disk unless the -save-temps option is used.

Two of the most common directives, and the first that we will use, are #include and #define.

The #include Directive

When the preprocessor encounters the #include directive, it opens the header file and adds its contents into the temporary file.
The symbols surrounding the name of the header file are important and determine where the preprocessor looks for the file.

#include <name>
The angle brackets denote a system include file that is part of the compiler itself (think of it as “library” code)
and directs the preprocessor to search for the file where the system header files are located (which varies from one compiler to another and from one Operating System to another).

#include "name.h"
The double quotation marks identify a header file that is written as a part of a program.
The quotation marks instruct the preprocessor to look for the header file in the current directory (i.e., in the same directory as the source code).
Header files that a programmer writes as part of an application program typically end with a .h extension.

You might see two kinds of system header files in a C++ program :
Older system header files end with a “.h” extension: <name.h>.
These header files were originally created for C programs, but may also be used with C++.
Newer system header files do not end with an extension: <name>, may only be used with C++.

File names appearing between < and > refer to system header files
File names appearing between an opening and closing ” refer to header files written by the programmer as a part of the program.

Note:
The include directive does not end with a semicolon and there must be at least one space between the directive and the file name.

The #define Directive and Symbolic Constants

The #define directive introduces a programming construct called a macro.
A simple macro only replaces one string of characters with another string.

The #define directive is one (old) way of creating a symbolic constant (also known as a named or manifest constant).
The const and enum keywords are newer techniques for creating constants.
It is a well-accepted naming practice to write the names of symbolic constants with all upper-case characters (this provides a visual clue that the name represents a constant).

Note:
The define directive does not end with a semicolon and there must be at least one space between the directive and the identifier, and between the identifier and the defined value; the defined value (the third part of the directive) is optional.

Stop after the Preprocessing stage. 
The output is in the form of preprocessed source code, which is sent to the standard output.
Input files that don't require preprocessing are ignored.

$ gcc -E <program_file_1>.c <program_file_2>.c ... <program_file_n>.c

$ g++ -E <program_file_1>.cpp <program_file_2>.cpp ... <program_file_n>.cpp

Compiler

The Compiler translates source code into assembly code for a specific processor.

As the Preprocessor processes each source code file one at a time and produces a single temporary file (for each source code file).
Similarly, the Compiler processes each temporary file one at a time and produces one assembly code file for each temporary file.

The Compiler also detects syntax errors and provides the diagnostic output programmers use to find and correct those errors.
Despite all that the compiler does, its operation is transparent to programmers for the most part.

Stop after the stage of Compilation, do not Assemble. 
The output is in the form of an assembler code file for each non-assembler input file specified.
By default, the assembler file name for a source file is made by replacing the suffix .c, .cpp, .i, .ii, etc., with .s
Input files that don't require compilation are ignored.

$ gcc -S <program_file_1>.c <program_file_2>.c ... <program_file_n>.c

$ g++ -S <program_file_1>.cpp <program_file_2>.cpp ... <program_file_n>.cpp

Assembler

The Assembler translates assembly code into machine code the processor understands and can execute.

The purpose of the Assembler is to convert assembly language into machine code and generate an object file.

When there are calls to external functions in the assembly source file, the Assembler leaves the addresses of the external functions undefined, to be filled in later by the linker.

Compile AND Assemble the source files, but do not Link.
The output is in the form of an object file for each source file.
By default, the object file name for a source file is made by replacing the suffix .c, .cpp, .i, .ii, .s, etc., with .o
Unrecognized input files, not requiring compilation or assembly, are ignored.

$ gcc -c <program_file_1>.c <program_file_2>.c ... <program_file_n>.c

$ g++ -c <program_file_1>.cpp <program_file_2>.cpp ... <program_file_n>.cpp

Linker

The final stage of compilation is the linking of object files to create an executable program.

Object files contain machine code and information that the Linker uses to complete its tasks.
(Note that “object” in this context has nothing to do with the objects involved in Object-Oriented Programming)

This is where all of the object files and any additional binary libraries are linked together to make the final program.

It takes each object files created by the Assembler and links them together, along with system and runtime libraries, to form a complete, executable program.

An executable requires many external functions from system and runtime libraries.
They contain functions that are necessary to run a program on a given architecture
(linux-vdso.so.n, libc.so.n, ld-linux-x86-64.so.n (amd64), ld-linux.so.n (i386), etc.)

A library is a binary file (usually not directly executable) containing compiled functions/programming code that may be used/called by other programs/applications.

As a convention, a library name starts with ‘lib’, and the extension determines the type of the library:
.a stands for archive (static library)
.so stands for shared object (dynamic library)

Static Linking :
The linker adds all the libraries the program needs inside the final executable file (content is included).
Static linking may simplify the process of distributing a program to multiple similar environments, since it already has everything it needs to run. But any update to the libraries dependencies won’t be effective until you perform a whole compilation and linking process again.

Dynamic Linking :
The linker only places a reference to the required libraries in the final program (content is not included).
The actual linking happens when the program is executed (loaded at runtime).
You don’t need to recompile the program if any update occurs to the libraries dependencies, but they all need to be present/installed on the system for the program to work.

Libraries (binaries) Location

GNU C Library: Shared libraries   (package: libc<n>)
Contains the standard libraries that are used by nearly all programs on the system.

GNU Standard C++ Library v3       (package: libstdc++<n>)
Contains an additional runtime library for C++ programs built with the GNU compiler. 

Symbolic link /lib -> /usr/lib
On Debian 64-bits amd64 architecture:   /lib/x86_64-linux-gnu/
On Debian 32-bits i386 architecture:    /lib/i386-linux-gnu/

List of paths that ld (the linker) will search for libraries
The directories are searched in the order in which they are specified
$ ld --verbose | grep SEARCH_DIR | sed 's/; /\n/g'

The name of the executable file depends on the hosting Operating System:
On Linux, Unix, and macOS systems, the linker produces a file named ‘a.out’ by default.
On a Windows computer, the linker produces a file whose name ends with a .exe extension.

Users may also specify a name that overrides the default.

For example, if you want gcc to generate an executable with a specific name, use the -o option followed with the desired name:

$ gcc -o <program_name> <program_file_1>.c <program_file_2>.c ... <program_file_n>.c

$ g++ -o <program_name> <program_file_1>.cpp <program_file_2>.cpp ... <program_file_n>.cpp

When the compiling finishes, temporary/intermediate files are removed.

This command shows all shared library dependencies (what libraries the executable requires)

$ ldd <program_name>
readelf displays information about ELF format object files. 
The options control what particular information to display.
This program performs a similar function to objdump but it goes into more detail

$ readelf -a <program_name>
g++ Compiler Operations

Loader

This stage happens when the program starts up.
The program is scanned for references to shared libraries.
Any references found are resolved and the libraries are mapped into the program.

The dynamic linker/loader programs ld.so (or ld.so.n) and ld-linux.so (or ld-linux.so.n) find and load the shared objects (shared libraries) needed/used by a program, prepare the program to run, and then run it.

In Debian:
$ ls -l /lib/$( arch )-linux-gnu/ld-linux*

$ <loader_program> <program_name>