Lecture 7: I/O, Structure Packing, Endianness, alignas

Date: 2023-09-05

1. Basic Data Types in Binary I/O

char

A char in C++ is a fundamental data type that takes up 1 byte of memory. It's often used for representing individual characters but can also be employed for small integer values.

Reading and Writing with char:

// Writing a char
std::ofstream outFile("file.bin", std::ios::binary);
char ch = 'A';
outFile.put(ch);
outFile.close();

// Reading a char
std::ifstream inFile("file.bin", std::ios::binary);
char chRead;
inFile.get(chRead);
inFile.close();

float

To write a float in a specific endianness:

std::ofstream outFile("file.bin", std::ios::binary);
float f = 3.14f;
unsigned char* f_ptr = reinterpret_cast<unsigned char*>(&f);

if (isLittleEndian()) { // Assume isLittleEndian() checks system endianness
    std::reverse(f_ptr, f_ptr + sizeof(f));
}

outFile.write(reinterpret_cast<const char*>(f_ptr), sizeof(f));
outFile.close();

To read a float back, considering endianness:

std::ifstream inFile("file.bin", std::ios::binary);
float fRead;
unsigned char* f_ptr = reinterpret_cast<unsigned char*>(&fRead);

inFile.read(reinterpret_cast<char*>(f_ptr), sizeof(fRead));

if (isLittleEndian()) {
    std::reverse(f_ptr, f_ptr + sizeof(fRead));
}

inFile.close();

double

The same logic can be applied to double. You would need to read or write 8 bytes and reverse them if necessary, based on the target or source endianness.

std::ofstream outFile("file.bin", std::ios::binary);
double d = 3.14;
unsigned char* d_ptr = reinterpret_cast<unsigned char*>(&d);

if (isLittleEndian()) { // Assume isLittleEndian() checks system endianness
    std::reverse(d_ptr, d_ptr + sizeof(d));
}

outFile.write(reinterpret_cast<const char*>(d_ptr), sizeof(d));
outFile.close();

2. Advanced Data Types and Structures

In modern C++ programming, beyond basic types like char, float, and double, you often encounter more complex data types and structures. Understanding these is key to efficiently storing and reading binary data.

Custom Structures

Custom structures in C++ allow you to define a data type that can encapsulate multiple variables under a single name. These variables can be of the same type or different types.

If you have an existing structure that isn't packed and you want to write it to a file in a packed form, or read a packed structure from a file and unpack it into your program, you have a few options:

Manually Serializing and Deserializing

The most straightforward approach is to manually serialize the data. That is, you can write each member of the structure to the file individually, ensuring that you only write the actual data and skip the padding bytes.

Here's a simple example:

Writing:

struct MyStruct {
    int id;
    float value;
};

// Serialize and write to file
std::ofstream outFile("file.bin", std::ios::binary);
MyStruct data = {42, 3.14};

// Manually write each field
outFile.write(reinterpret_cast<const char*>(&data.id), sizeof(data.id));
outFile.write(reinterpret_cast<const char*>(&data.value), sizeof(data.value));
outFile.close();

Reading:

// Deserialize and read from file
std::ifstream inFile("file.bin", std::ios::binary);
MyStruct dataRead;

// Manually read each field
inFile.read(reinterpret_cast<char*>(&dataRead.id), sizeof(dataRead.id));
inFile.read(reinterpret_cast<char*>(&dataRead.value), sizeof(dataRead.value));
inFile.close();

Using a Packed Proxy Structure

Another approach is to define a packed "proxy" structure that matches the layout of the original structure. You can then copy the data between the original and the proxy before writing or after reading.

Writing:

#pragma pack(push, 1)
struct MyPackedStruct {
    int id;
    float value;
};
#pragma pack(pop)

MyStruct original = {42, 3.14};
MyPackedStruct proxy = {original.id, original.value};

std::ofstream outFile("file.bin", std::ios::binary);
outFile.write(reinterpret_cast<const char*>(&proxy), sizeof(proxy));
outFile.close();

Reading:

MyPackedStruct proxyRead;

std::ifstream inFile("file.bin", std::ios::binary);
inFile.read(reinterpret_cast<char*>(&proxyRead), sizeof(proxyRead));
inFile.close();

MyStruct originalRead = {proxyRead.id, proxyRead.value};

By manually serializing or using a packed proxy, you can ensure that you're reading and writing only the data you intend to, without the interference of padding bytes. Both methods allow you to handle the data in a packed form for I/O operations while using unpacked structures in your program logic.

Arrays and Vectors

Arrays and vectors are collections of elements. While the size of an array is fixed, vectors are dynamic.

For arrays, the padding and alignment considerations are generally simpler compared to custom structures, but there are still some things you need to be aware of.

Basic Arrays

Arrays of basic types like int, float, or char don't have internal padding, so you can write them directly to a file.

Writing:

int numbers[] = {1, 2, 3, 4, 5};
std::ofstream outFile("array_file.bin", std::ios::binary);
outFile.write(reinterpret_cast<const char*>(numbers), sizeof(numbers));
outFile.close();

Reading:

int numbersRead[5];
std::ifstream inFile("array_file.bin", std::ios::binary);
inFile.read(reinterpret_cast<char*>(numbersRead), sizeof(numbersRead));
inFile.close();

Arrays of Structures

If you have an array of custom structures, the padding issues within each structure still apply. You'd handle each element as you would a single structure, possibly using manual serialization or a packed proxy structure as described before.

Writing with Serialization:

std::ofstream outFile("array_struct_file.bin", std::ios::binary);
for (const auto& item : arrayOfStructs) {
    outFile.write(reinterpret_cast<const char*>(&item.id), sizeof(item.id));
    outFile.write(reinterpret_cast<const char*>(&item.value), sizeof(item.value));
}
outFile.close();

Reading with Deserialization:

std::ifstream inFile("array_struct_file.bin", std::ios::binary);
for (auto& item : arrayOfStructsRead) {
    inFile.read(reinterpret_cast<char*>(&item.id), sizeof(item.id));
    inFile.read(reinterpret_cast<char*>(&item.value), sizeof(item.value));
}
inFile.close();

Multi-Dimensional Arrays

In C++, multi-dimensional arrays are contiguous blocks of memory. You can write and read them in the same way as one-dimensional arrays. However, if the multi-dimensional array includes custom structures, you'll have to take padding into consideration just like you would with a one-dimensional array of structures.

Vectors in C++ are dynamic arrays that handle their own memory management. The actual data in a std::vector is stored in a dynamically allocated array, and you can access this data using the data() member function. Just like arrays, vectors of basic types can generally be written and read directly without worrying about padding.

Basic Vectors

For basic types like int, float, or char, you can write the vector to a file by directly writing its internal array.

Writing:

std::vector<int> vec = {1, 2, 3, 4, 5};
std::ofstream outFile("vector_file.bin", std::ios::binary);
outFile.write(reinterpret_cast<const char*>(vec.data()), vec.size()*sizeof(int));
outFile.close();

Reading:

std::vector<int> vecRead(5);
std::ifstream inFile("vector_file.bin", std::ios::binary);
inFile.read(reinterpret_cast<char*>(vecRead.data()), vecRead.size()*sizeof(int));
inFile.close();

Vectors of Custom Structures

If you have a vector of custom structures, the same rules apply as for arrays of structures: You need to be cautious about padding and alignment. You can serialize and deserialize each structure as you write and read it.

Writing with Serialization:

std::ofstream outFile("vector_struct_file.bin", std::ios::binary);
for (const auto& item : vectorOfStructs) {
    outFile.write(reinterpret_cast<const char*>(&item.id), sizeof(item.id));
    outFile.write(reinterpret_cast<const char*>(&item.value), sizeof(item.value));
}
outFile.close();

Reading with Deserialization:

std::vector<MyStruct> vectorOfStructsRead(5);
std::ifstream inFile("vector_struct_file.bin", std::ios::binary);
for (auto& item : vectorOfStructsRead) {
    inFile.read(reinterpret_cast<char*>(&item.id), sizeof(item.id));
    inFile.read(reinterpret_cast<char*>(&item.value), sizeof(item.value));
}
inFile.close();

Important Consideration: Size Information

One thing to note is that while arrays have a fixed size, vectors are dynamic. When writing a vector to a file, you might also want to store its size so you can read it back into a vector of the correct size. This usually involves writing the size before the actual data and reading it before allocating the vector.


3. I/O Stream Basics (std::ios)

C++ Standard Library provides a robust I/O mechanism through the stream classes, like std::ifstream for input file streams and std::ofstream for output file streams. These classes are derived from std::ios, which is the base class for all I/O operations.

3.1 Using std::ifstream and std::ofstream

Both ifstream and ofstream objects can be created and directly associated with a file in their constructors. Alternatively, you can create an object and then open a file later.

Instantiating and Opening Immediately:

std::ifstream inFile("input_file.txt", std::ios::binary);
std::ofstream outFile("output_file.txt", std::ios::binary);

Instantiating and Opening Later:

std::ifstream inFile;
std::ofstream outFile;
inFile.open("input_file.txt", std::ios::binary);
outFile.open("output_file.txt", std::ios::binary);

Note: The std::ios::binary flag opens the file in binary mode, which is what you'll want for dealing with binary I/O. Without this flag, the stream will perform text translation, which can muddle your binary data.

3.2 Common Member Functions

  • open: Open a file with the given filename and mode flags.
inFile.open("input_file.txt", std::ios::binary);
  • close: Close the file associated with the stream. This is often done automatically when the object goes out of scope.
inFile.close();
  • write: Write raw data into an output file stream.
outFile.write(reinterpret_cast<const char*>(&data), sizeof(data));
  • read: Read raw data from an input file stream.
inFile.read(reinterpret_cast<char*>(&data), sizeof(data));
  • eof: Returns true if the end-of-file has been reached.
while (!inFile.eof()) { /*...*/ }
  • good: Returns true if the stream is in a good state; false if any error flags are set.
if (outFile.good()) { /*...*/ }
  • fail: Returns true if a non-critical error has occurred (e.g., failed format conversion).
if (inFile.fail()) { /*...*/ }
  • bad: Returns true if a critical error has occurred (e.g., write error or disk full).
if (outFile.bad()) { /*...*/ }

These are some of the basic functionalities you'll often use with std::ifstream and std::ofstream. Learning how to properly open, close, and check the state of these streams will give you a good foundation for more advanced file I/O tasks.

3.3 Handling Stream Exceptions (stream.exceptions)

In C++, you can control which I/O errors throw exceptions using the exceptions member function. This allows you to specify which error flags will trigger an exception (std::ios_base::failure) to be thrown. The flags you can use are:

Flag Description
std::ios::badbit Indicates a critical stream error like a write error. Stream is considered corrupted.
std::ios::failbit Indicates a non-critical I/O error, such as failed format conversion. Stream is still usable.
std::ios::eofbit Indicates end-of-file has been reached. Usually not considered an error by itself.
std::ios::goodbit Indicates that none of the other bits are set. Signifies that the stream is in a good state.

Each of these flags can be used in the exceptions() function to set which conditions should throw exceptions. For example, inFile.exceptions(std::ios::badbit | std::ios::failbit); will throw an exception if either a critical error occurs (badbit) or a non-critical I/O operation fails (failbit).

Here's how you can set a stream to throw exceptions:

std::ifstream inFile("file.txt", std::ios::binary);

// Throw an exception if a critical error occurs or a non-critical I/O operation fails
inFile.exceptions(std::ios::badbit | std::ios::failbit);

After setting the exceptions, you can use a try/catch block to handle them:

try {
    // Perform some I/O operations
    inFile.read(buffer, bufferSize);
}
catch (const std::ios_base::failure& e) {
    std::cerr << "Stream error: " << e.what() << '\n';
    // Handle the error
}

Points to Consider:

  • Setting exceptions can make your I/O code cleaner by removing the need to manually check error bits after each operation. However, it can also make the code more complex due to the use of try/catch.

  • You can also turn off exceptions by resetting the exception mask:

inFile.exceptions(std::ios::goodbit);
  • Be careful when mixing exception-based error handling and manual bit checking in the same stream. This can lead to somewhat unpredictable behavior if not handled carefully.

4. Sized Buffer of ifstream

Reading and writing data in chunks, often referred to as buffered I/O, can be an efficient way to handle large files. When using std::ifstream, you can control the size of the buffer you read into, making your file operations more efficient.

Basic Buffer Usage

You typically declare an array or a std::vector as your buffer. The size of this buffer will determine how many bytes you read at a time.

char buffer[1024];  // Array of 1 KB
std::ifstream inFile("large_file.bin", std::ios::binary);

To read into the buffer, you use the read member function:

inFile.read(buffer, sizeof(buffer));

Here, read will attempt to fill the entire buffer. You should always check how many bytes were actually read, especially when reaching the end of the file.

std::streamsize bytesRead = inFile.gcount();

Using Vector as a Dynamic Buffer

If you prefer a dynamic buffer size, a std::vector can be used:

std::vector<char> dynamicBuffer(1024);
inFile.read(dynamicBuffer.data(), dynamicBuffer.size());

Checking Read Status

After performing a read operation, it’s good practice to verify the stream's state:

if (inFile) {
    // All bytes were read successfully
} else {
    // Only some (or possibly none) of the bytes were read
    std::streamsize bytesRead = inFile.gcount();
    // Handle the partial read
}

Performance Considerations

  1. Buffer Size: The choice of buffer size can greatly affect performance. Generally, power-of-two sizes like 512 bytes, 1KB, 4KB, etc., are recommended for alignment reasons.

  2. System Buffer: It's worth noting that std::ifstream itself uses an internal buffer. However, using your own buffer can still be advantageous for various reasons, such as when you need more fine-grained control over the data you're reading.

  3. Multiple Reads: For very large files, you may perform multiple reads in a loop, processing each buffer full of data as you go.

  4. End of File: Be prepared to handle buffers that are not fully filled, especially the last buffer when reading a file. Use gcount() to determine the actual number of bytes read into the buffer.


5. alignas Keyword: Syntax and Use Cases

The alignas keyword in C++ is used to specify a type's or object's alignment requirement. It can help ensure that data is laid out in memory in a way that's optimal for the target architecture. This is often essential for systems-level programming, SIMD (Single Instruction, Multiple Data) operations, and other performance-critical situations.

Syntax

The basic syntax of alignas is straightforward. You place it before the object or type declaration, like this:

alignas(16) int array[4];  // The 'array' starts at an address that's a multiple of 16

Here, the array will be aligned on a 16-byte boundary.

Use Cases

  1. SIMD Operations: Many SIMD instruction sets require data to be aligned to specific boundaries. For example, some SSE instructions require 16-byte alignment.
alignas(16) float data[4];
  1. Memory-Mapped I/O: In systems programming, you often interact with hardware by reading from or writing to specific memory locations. Ensuring correct alignment can be crucial.
alignas(4) volatile uint32_t* registerPointer;
  1. Cache Line Optimization: Modern CPUs have cache lines, often 64 bytes in size. Aligning data structures so that they do not straddle multiple cache lines can improve performance.
alignas(64) char cacheBuffer[64];
  1. Structures and Classes: You can also use alignas with custom types to force all instances of that type to meet certain alignment requirements.
struct alignas(8) AlignedStruct {
    int a;
    char b;
    // ...
};

Points to Note

  • The alignas keyword can be used both on individual variables and on type definitions.

  • The alignment value must be a power of two.

  • You can query the alignment of a type at compile-time using alignof.

static_assert(alignof(AlignedStruct) == 8, "AlignedStruct should be aligned to 8 bytes");
  • Overaligning data, i.e., aligning it to a stricter boundary than required, will not cause errors but might waste memory.

6. Memory Mapping

Memory mapping is a technique that allows files or devices to be mapped into the application's address space. This essentially means that a file can be read or written just by accessing memory locations. Memory mapping is often faster than traditional file I/O methods because it minimizes the number of system calls and data copying required.

What It Is and Why It's Useful

  1. Direct Memory Access: You can access file data as if it were in-memory data structures. This can lead to more straightforward and faster code.

  2. Performance: Memory mapping enables quicker file reads and writes by reducing the overhead associated with system calls like read() or write().

  3. Resource Sharing: Multiple processes can map the same file into memory for inter-process communication.

  4. Large Files: Memory mapping makes it easier to work with large files that do not fit into physical memory.

How to Implement in C++

On Unix-like systems, including Linux and macOS, you can use the mmap and munmap system calls. On Windows, the CreateFileMapping and MapViewOfFile functions are used.

POSIX (Linux, macOS)

Here's a simple example using mmap in C++ to read a file:

#include <sys/mman.h>
#include <fcntl.h>
#include <unistd.h>

// Open file
int fd = open("example.txt", O_RDONLY);

// Get file size
off_t fileSize = lseek(fd, 0, SEEK_END);

// Memory-map the file
void* mapped = mmap(nullptr, fileSize, PROT_READ, MAP_PRIVATE, fd, 0);

// Use memory
// ...

// Cleanup
munmap(mapped, fileSize);
close(fd);
Windows

On Windows, you'd use something like this:

#include <Windows.h>

// Open file
HANDLE hFile = CreateFile("example.txt", GENERIC_READ, 0, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);

// Create file mapping
HANDLE hMap = CreateFileMapping(hFile, NULL, PAGE_READONLY, 0, 0, NULL);

// Map view of file
void* mapped = MapViewOfFile(hMap, FILE_MAP_READ, 0, 0, 0);

// Use memory
// ...

// Cleanup
UnmapViewOfFile(mapped);
CloseHandle(hMap);
CloseHandle(hFile);

Points to Note

  • Make sure to unmap the memory and close the file descriptors/handles when done.

  • Memory-mapped regions should be accessed within the size of the mapped file to avoid undefined behavior.

  • Be cautious when using memory mapping for writing, especially if multiple processes are involved, as changes can occur non-sequentially.