Lecture 9: Application Binary Interfaces

Date: 2023-09-19

1. Introduction to Application Binary Interfaces (ABIs)

Hello everyone, today we're diving into a topic that might seem a bit arcane but is crucial if you're working on software that interacts at a low level, be it system software or specialized libraries. That topic is Application Binary Interfaces, or ABIs for short.

What is an ABI?

An ABI is a set of rules that dictate how binary data should be formatted for software components to communicate with each other. In simpler terms, it's a contract between two binary program modules. These modules can be as small as a function or as large as an entire program.

Why Should You Care?

Why is this important? Well, think about it this way. You've written your code in a high-level language like Python or C++. But at the end of the day, that code gets compiled down to machine code, binary instructions that your CPU understands. If you're developing libraries or operating systems, or even using them, you need to know how your code will interact at this binary level. That's where ABIs come in.

Core Components

An ABI covers various aspects such as:

  • Data Type Sizes: Specifies the size of basic data types (like integers and floats).

  • Data Alignment: Rules for how data types should be aligned in memory.

  • Calling Conventions: Defines how functions receive parameters and return values.

  • Object File Formats: Specifies the layout of object files (.o, .obj) and executables.

  • System Calls: Describes how to perform system calls to the operating system.


2. Name Mangling in C++

Alright, switching gears a bit, let's talk about a concept that's quite relevant to C++: Name Mangling.

What is Name Mangling?

Name mangling is the process of transforming function or variable names in your code into a format that can be understood and processed by the linker. This mainly happens because C++ supports function overloading, which allows multiple functions with the same name but different parameters. The linker, however, needs a unique identifier for each function to resolve references properly.

How Does It Work?

Let's say you have two overloaded functions like this:

void foo(int);
void foo(double);

In your C++ code, it's pretty clear which foo you're referring to based on the arguments you pass. But how does the linker differentiate them? The compiler will mangle these names to produce unique strings. For example, these could be mangled into something like:

  • foo(int) might become _Z3fooi
  • foo(double) might become _Z3food

These mangled names are what actually get stored in the object files, and that's what the linker uses to connect all the pieces of your program together.

Why is This Important?

Understanding name mangling becomes essential when you're doing low-level tasks like dynamic linking or using tools that directly interact with object files or binaries. It's also useful when debugging, especially when you're trying to understand linker errors.

Platform and Compiler Differences

Remember, different compilers have their own mangling schemes. So, if you're mixing and matching code compiled with different compilers, you could run into issues. Always be aware of the toolchain you're using.

Demangling

Some tools can convert a mangled name back into its original form, a process known as "demangling." This is often handy when you're debugging or analyzing binaries.

A Note on extern "C"

In cases where you want to prevent C++ name mangling, usually because you're linking against C code, you can use the extern "C" declaration:

extern "C" void foo(int);

This tells the compiler not to mangle the name, making it easier to link with code from C or other languages that don't support function overloading.


3. Mangling Special Functions

Moving along, let's focus on some special kinds of functions in C++ that undergo name mangling—constructors, destructors, functions in private namespaces, and functors.

Constructors and Destructors

Constructors and destructors are special member functions in C++ classes responsible for initialization and cleanup, respectively. Just like any other function, their names get mangled to ensure that they are uniquely identifiable by the linker. But there's a little twist.

In C++, a class can have multiple constructors with different parameter lists. These constructors will undergo name mangling to distinguish them. However, a class will typically have only one destructor, and it takes no parameters. Nonetheless, the destructor's name is still mangled to make it consistent with other member functions and to handle cases where the class may be part of a hierarchy.

For example, consider the following class definition:

class MyClass {
public:
    MyClass(int);
    MyClass(double);
    ~MyClass();
};

The constructors and destructor could be mangled as:

  • MyClass(int) might become something like _ZN7MyClassC1Ei
  • MyClass(double) might become _ZN7MyClassC1Ed
  • ~MyClass() might become _ZN7MyClassD1Ev

Function in Private Namespace

In C++, you can define namespaces to avoid name collisions. A special type of namespace is the "unnamed" or "anonymous" namespace, often referred to as a "private namespace." Functions in a private namespace are only visible within the translation unit where they are defined.

Here's how you'd define a function in an anonymous namespace:

namespace {
    void myPrivateFunction();
}

Even though the function is limited in scope to the translation unit, its name is still mangled by the compiler to maintain ABI compliance.

Functors

A functor is a C++ class or struct that overloads the function-call operator operator(). While the function-call operator itself is a special kind of member function, it still gets mangled like other member functions.

For instance, consider a functor that adds a constant value:

struct AddValue {
    int value;
    AddValue(int v) : value(v) {}
    int operator()(int x) const { return x + value; }
};

The operator() could be mangled as something like _ZN7AddValueclEi.


4. What about Templates?

Templates in C++ introduce some unique challenges and considerations when it comes to ABIs. Let's delve into some specifics.

Not Part of ABI (because compile-time)

Firstly, it's important to note that templates, in their general form, aren't directly part of the ABI because they are a compile-time entity. When you write a template, you're essentially writing a recipe for the compiler to generate code. The actual binary code is not generated until the template is instantiated, meaning that a concrete type or value is supplied for the template's parameters. Only then is the function or class "realized" into machine code.

Template Specializations

However, template specializations are an exception. A template specialization is a specific version of a template that has been instantiated and compiled. Because it's compiled, it does become a concrete object, and therefore, its name gets mangled and it becomes part of the ABI.

For instance, if you have a template function like this:

template <typename T>
void foo(T t);

And then you specialize it for an int:

template <>
void foo<int>(int t);

This specialized version could be mangled to something like _Z3fooi, and it becomes part of the ABI.

Possibly Hard to Export, May Require Finessing

Exporting template specializations can be tricky. Generally speaking, templates are instantiated where they're used, meaning that the compiled code is generated in each translation unit that uses the template. However, it's possible to explicitly instantiate a template in one translation unit and use it in others. Doing this requires care, as it involves compiler-specific directives and might necessitate changes if you're switching compilers or versions.

Examples

Let's take the std::vector class template as an example. If you use std::vector<int> in your code, the compiler will instantiate the std::vector template for int. But what if you want to share this instantiated template across different translation units? Here's how you might explicitly instantiate it:

In a .cpp file:

#include <vector>

template class std::vector<int>;  // explicit instantiation

Now, this explicitly instantiated template becomes a concrete object, and its methods will have mangled names that form part of the ABI. Any other translation unit that uses std::vector<int> could potentially link to the same compiled object code, assuming you've set up your build process to make that possible.


5. Dynamically Loading Libraries

Great, let's jump into the topic of dynamically loading libraries and how this relates to the ABI. Dynamically loading libraries is a practice that offers a lot of flexibility but also comes with its own set of challenges and considerations.

Windows: (.exe, .dll)

On Windows systems, dynamic libraries are generally distributed as compiled binaries with the extensions .dll (Dynamic-Link Library) for the libraries and .exe for executable files. When you're dealing with precompiled .dll files, the ABI is frozen, meaning you must ensure that your application is compatible with the ABI of the library. The names of functions and data structures in the .dll are already mangled, and you usually interact with them through an API provided by the library vendor.

Linux/Mac: (.so)

In contrast, on Linux and Mac systems, you're more likely to encounter libraries in source code form, especially if you're using open-source packages. You compile the code yourself, which means ABI compatibility is generally less of a concern. When the libraries are precompiled, they usually come in the form of .so (Shared Object) files. Like .dll files, .so files have a fixed ABI that you need to be compatible with.

Link-time binding happens at compile time, where the compiler and linker work together to link your code with the library. In this case, all of the ABI issues like name mangling, function signatures, and data structures need to be resolved at the time of compilation and linking. Any mismatch in ABI compatibility will result in linker errors, preventing your code from even compiling successfully.

Run-time Binding

On the other hand, run-time binding occurs when a program is already running and decides to load a library dynamically. This is often done using system calls like LoadLibrary on Windows or dlopen on Linux/Mac. With run-time binding, ABI compatibility becomes a run-time issue rather than a compile-time issue. If there's an ABI mismatch, you'll likely experience run-time errors or crashes.

Here, you also often deal with function pointers to access the library's functions. For example, you might use GetProcAddress on Windows or dlsym on Linux/Mac to get a function pointer and then cast it to the correct function signature.

Conclusion

Dynamically loading libraries is a technique that adds a great deal of flexibility to a software project, but it also complicates matters of ABI compatibility. Whether you're dealing with precompiled libraries on Windows or more often source code on Linux/Mac, understanding how and when ABI compatibility issues can arise is crucial for avoiding both compile-time and run-time errors.


6. Calling C++ from Other Languages

Nice, let's shift gears and explore how to call C++ code from other languages, like Python, Java, or C#. This is a space where ABI considerations become particularly crucial because each language has its own calling conventions, data types, and memory management rules.

Extern "C" Again

Remember how we talked about extern "C" when discussing name mangling? This comes into play here as well. When you use extern "C", you're telling the C++ compiler to use C-style linkage for the specified functions. This disables name mangling and makes the function accessible using a simple, flat C-style API. This is often the key to making C++ code callable from other languages that can interface with C APIs.

Example:

extern "C" {
    void myFunction(int x) {
        // Your C++ code here
    }
}

Language-Specific Bridges

Many languages have specialized libraries or modules designed to bridge the gap between C++ and the target language:

  • Python: You can use ctypes or cffi to call C++ functions marked as extern "C". For more complex interactions, you can use wrappers like SWIG or pybind11.
  • Java: JNI (Java Native Interface) is a framework that allows Java code to call native methods written in C++.
  • C#: P/Invoke can be used to call native C++ methods from C# code.

ABI Mismatches and Data Conversion

One challenge you'll face is that each language has its own set of native data types, and these don't always map cleanly onto C++ types. You'll often have to write wrapper functions that convert between these types.

Also, be wary of ABI mismatches. This can happen if the calling conventions between the languages differ or if the other language expects a certain memory layout that your C++ code doesn't adhere to. For example, some languages expect arrays to be null-terminated, while C++ does not enforce this.

Threading and Memory Management

C++ gives you a lot of control over threading and memory management, but other languages might abstract these details away. When mixing languages, make sure you understand the implications for both threading and memory management to avoid issues like deadlocks or memory leaks.


7. Examples of ABI in C++

Excellent, let's cap off this exploration with some concrete examples that show how ABIs manifest in C++.

Changing Function Signatures

Imagine you have a library with a function void foo(int);, and you change it to void foo(int, int);. Now, every piece of code that calls foo(int) will break because the function signature has changed, altering its ABI.

Adding Virtual Functions

In C++, virtual function tables (vtables) are used to manage virtual functions. Adding a new virtual function to a base class changes the layout of the vtable, thereby breaking ABI compatibility.

// Version 1
class Base {
public:
    virtual void func1();
};

// Version 2
class Base {
public:
    virtual void func1();
    virtual void newFunc();  // ABI breakage
};

Changing Data Members

If you add, remove, or change the type of a data member in a class, you change the size and layout of that class, which breaks the ABI.

// Version 1
class MyClass {
    int a;
};

// Version 2
class MyClass {
    int a;
    double b;  // ABI breakage
};

Reordering Member Functions

Even if you merely reorder the member functions within a class without changing their actual implementation, you can still break the ABI if any of those functions are virtual. The vtable layout is dependent on the order of declaration of virtual functions.

Enum Values

Adding, removing, or reordering enumeration values can break the ABI, as the numerical values associated with the enum items may change.

// Version 1
enum Color {
    RED,
    BLUE
};

// Version 2
enum Color {
    RED,
    GREEN,  // ABI breakage
    BLUE
};

Template Specializations

As we discussed earlier, template specializations can also be a part of ABI. Changing a specialization can have repercussions, as it changes the generated code that relies on that specialization.

Inline Functions

At first glance, inline functions seem like they wouldn't affect ABI because they're typically implemented in headers. However, if an inline function uses a data member that was changed or removed, code that calls that inline function will break when recompiled.