7. Convolution

The first half of this section discusses two perspectives on convolution: (1) the input-centric view, which understands convolution as a sum of scaled and shifted impulse responses, and (2) the output-centric view, which defines convolution as a dot product between the inverted impulse and the input signal. For a more detailed and graphical explanation of these two perspectives, I recommend Steven W. Smith's The Scientist and Engineer's Guide to Digital Signal Processing.

The second half of this section focuses on implementing convolution in STM32CubeIDE using CMSIS-DSP. We'll begin with non-real-time convolution before progressing to real-time convolution. Each type will be explored in two versions: (1) a naive implementation using unoptimized C code and (2) an optimized version employing CMSIS-DSP.

1. Non-Real-Time Convolution

1.1 Convolution from Scratch

We'll start with a from-scratch implementation using (relatively) unoptimized C code. The input signal consists of a 1 kHz sine wave mixed with a 15 kHz sine wave (the actual frequencies depend on the sampling rate). The impulse response is a low-pass filter with a cutoff frequency of 6 kHz. The output signal is the result of convolving the input signal with the impulse response.

The convolution in mathematical terms can be represented as follows:

$y[n] = \sum_{{k=\text{{k\_min}}}}^{\text{{k\_max}}} x[k] \cdot h[n-k]$

Here, $y[n]$ is the output signal, $x[k]$ is the input signal, and $h[n-k]$ is the kernel or impulse response. The limits $\text{{k\_min}}$ and $\text{{k\_max}}$ are determined by the current sample $n$ and the length of the kernel $\text{{kernel\_len}}$ .

$\text{{k\_min}} = \max(0, n - (\text{{kernel\_len}} - 1)), \quad \text{{k\_max}} = n$

Translated into C code, the convolution function looks like this:

static void convolve(float32_t *signal, int signal_len, const float32_t *kernel, int kernel_len, float32_t *result)
{
    for (int n = 0; n < signal_len; n++)
    {
        int k_min, k_max, k;
        result[n] = 0;

        k_min = (n >= kernel_len - 1) ? n - (kernel_len - 1) : 0;
        k_max = n;

        for (k = k_min; k <= k_max; k++)
        {
            result[n] += signal[k] * kernel[n - k];
        }
    }
}

convolution_result

1.2 Convolution with CMSIS-DSP

The ARM CMSIS-DSP library doesn't offer a general-purpose convolution function per se. It focuses on providing optimized functions for common DSP operations, and FIR filtering is one of those. Since FIR filtering is essentially a form of convolution, we can use it for the same purpose.

The arm_fir_instance_f32 of the ARM CMSIS-DSP library is designed to for FIR filtering. Here's a quick rundown of its key components:

numTaps: The number of filter coefficients or the length of the impulse response.
pState: Points to the state buffer, which has to be (numTaps + blockSize - 1) in length. The blockSize is the number of input samples to be processed per call.
pCoeffs: Points to the array of filter coefficients or the impulse response.

Initialization

Before you can use arm_fir_f32 to perform the filtering (convolution), you need to initialize an arm_fir_instance_f32 structure using arm_fir_init_f32:

arm_fir_instance_f32 S;
arm_fir_init_f32(&S, LPF_LEN, (float32_t *)&impulse_response[0], &firStateF32[0], BLOCK_SIZE);

In this line, &S is the pointer to an arm_fir_instance_f32 structure, LPF_LEN is the number of taps (filter coefficients), (float32_t *)&impulse_response[0] is the pointer to the impulse response, and &firStateF32[0] is the pointer to the state buffer. BLOCK_SIZE indicates the number of samples to be processed.

Filtering (Convolution)

After initialization, you can use arm_fir_f32 to perform the actual filtering operation:

arm_fir_f32(&S, &input_signal_f32_1kHz_15kHz[i * BLOCK_SIZE], &output_signal[0], BLOCK_SIZE);

Here, &S is the pointer to the initialized arm_fir_instance_f32 structure. The function takes a block of the input signal and stores the filtered output in output_signal.

Putting It All Together

Together, the initialization and filtering steps look like this:

#include "arm_math.h"

#define BLOCK_SIZE 32
#define NUM_BLOCKS (KHZ1_15_SIG_LEN / BLOCK_SIZE)

extern float32_t input_signal_f32_1kHz_15kHz[KHZ1_15_SIG_LEN];
extern const float32_t  impulse_response[LPF_LEN];

static float32_t firStateF32[BLOCK_SIZE + LPF_LEN - 1];
static arm_fir_instance_f32 S;

int main()
{
    // Enable FPU
    SCB->CPACR |= (0xFU << 20);

    // Initialize FIR instance
    arm_fir_init_f32(&S, LPF_LEN, (float32_t *)&impulse_response[0], &firStateF32[0], BLOCK_SIZE);

    while(1)
    {
        float32_t output_signal[BLOCK_SIZE];

        for (int i = 0; i < NUM_BLOCKS; i++)
        {
            arm_fir_f32(&S, &input_signal_f32_1kHz_15kHz[i * BLOCK_SIZE], &output_signal[0], BLOCK_SIZE);

            for (int j = 0; j < BLOCK_SIZE; j++)
            {
                printf("%f, %f\r\n", input_signal_f32_1kHz_15kHz[i * BLOCK_SIZE + j], output_signal[j]);
            }
        }
    }
}

And the output looks like this:

convolution_cmsis_dsp_result

2. Real-Time Convolution

2.1 Real-Time Convolution from Scratch

The non-real-time convolution in section 1.1 above uses batch processing, that is, the convolution() function takes an entire signal and convolves it with a kernel in one go. This is not practical for real-time processing where you're continuously receiving new data points. To make it real-time, there are serveral changes we need to make:

Circular Buffer: We need to store the previous samples so that we can convolve them with the new samples. Using a circular buffer allows the system to continually accept new data without needing to shift the entire dataset.
Single Sample Processing: The function performs convolution every time a new sample arrives, updating the result immediately. This makes it suitable for real-time processing.

The convolution function now looks like this:

static float32_t circular_buffer[CIRCULAR_BUFFER_SIZE];
static int circular_index = 0;
float32_t result;

static void real_time_convolve(float32_t new_sample, float32_t *result)
{
    // Add the new sample to the circular buffer
    circular_buffer[circular_index] = new_sample;

    // Perform the convolution
    *result = 0;
    for (int i = 0; i < LPF_LEN; i++)
    {
        int j = (circular_index - i + CIRCULAR_BUFFER_SIZE) % CIRCULAR_BUFFER_SIZE;
        *result += circular_buffer[j] * impulse_response[i];
    }

    // Update the circular buffer index
    circular_index = (circular_index + 1) % CIRCULAR_BUFFER_SIZE;
}

In this real-time version, the outer loop of the original convolution() function is essentially "unrolled" across multiple calls to the function, that is, we are yeilding the control to the caller after each iteration of the loop.

Note that we are only emulating real-time processing here. The actual input signal is still accessible in batch form. In real-time processing, the input signal would be a stream of data points, and the convolution would be performed on the fly.

2.2 Real-Time Convolution with CMSIS-DSP

However, processing one sample at a time is not the most efficient use of the ARM Cortex-M4F processor. The arm_fir_f32 function is optimized for block processing, so using it to process just one sample at a time doesn't leverage its full potential for speed and efficiency.

Therefore, to use it in a more efficient manner while still simulating a real-time environment, you could use a smaller buffer that fills up with new samples and then processes them in blocks once the buffer is full. Here's a way to adjust the code:

#include "arm_math.h"

#define BLOCK_SIZE 16
#define CIRCULAR_BUFFER_SIZE (BLOCK_SIZE + LPF_LEN - 1)

extern float32_t input_signal_f32_1kHz_15kHz[KHZ1_15_SIG_LEN];
extern const float32_t  impulse_response[LPF_LEN];

static float32_t firStateF32[CIRCULAR_BUFFER_SIZE];
static arm_fir_instance_f32 S;

int main()
{
    // Enable FPU
    SCB->CPACR |= (0xFU << 20);

    arm_fir_init_f32(&S, LPF_LEN, (float32_t *)&impulse_response[0], &firStateF32[0], BLOCK_SIZE);

    float32_t new_sample;
    float32_t output_signal[BLOCK_SIZE];
    int index = 0, block_index = 0;

    while(1)
    {
        if (block_index < BLOCK_SIZE) {
            // Fill the block with new samples
            new_sample = input_signal_f32_1kHz_15kHz[index % KHZ1_15_SIG_LEN];
            firStateF32[block_index] = new_sample;
            block_index++;
        } else {
            // Block is full, process it
            arm_fir_f32(&S, firStateF32, output_signal, BLOCK_SIZE);
            for (int i = 0; i < BLOCK_SIZE; i++) {
                printf("%f, %f\r\n", firStateF32[i], output_signal[i]);
            }
            block_index = 0;
        }

        index++;
    }
}

The modified code still uses arm_fir_f32 to benefit from optimized block processing, but it adapts the function for a more real-time approach.