Chapter 7: Wake Word Detection

Introduction

Wake word detection is the process of detecting a specific word or phrase in an audio stream. For privacy and power reasons, it is often desirable to perform wake word detection on-device rather than in the cloud. This chapter describes how to build a wake word detection model using TensorFlow Lite and how to deploy it on an Arduino Nano 33 BLE Sense.

The model is trained to distinguish between the words "yes", "no", unknown words, and silence. We will see how it is trained in the next chapter. For now, we will focus on the application code and how to run it on the local machine and on the Arduino Nano 33 BLE Sense.


Application Structure

application_diagram

As shown in the diagram above, the application consists of five main components. We will take a look at each of them in detail at the end of this chapter.


Part 1. Testing the Model on Local Machine and Audio Files

There are two versions of the micro_speech_test.cc file currenly on the two separate TensorFlow and TensorFlow Lite Micro repositories.

Let's just take a look at the newer version. But before we do that, let's first understand how to write a test case for TensorFlow Lite Micro.


How to write a test case for TensorFlow Lite Micro

The macro TF_LITE_MICRO_TEST is used in TensorFlow Lite for Microcontrollers to define test cases. It works in tandem with TensorFlow Lite's Micro Test framework to perform tests for verifying the behavior of the code.

To use TF_LITE_MICRO_TEST, you generally follow these steps:

  1. Include Necessary Headers: Include the headers needed for testing, especially the one that provides the macro definition.
#include "tensorflow/lite/micro/testing/micro_test.h"
  1. Initialize the Test Suite: Before defining individual test cases, you typically start by declaring the beginning of a test suite using TF_LITE_MICRO_TESTS_BEGIN.
TF_LITE_MICRO_TESTS_BEGIN
  1. Define Test Case: Each test is wrapped inside the TF_LITE_MICRO_TEST macro, followed by the test case function body enclosed within curly braces {}.
TF_LITE_MICRO_TEST(YourTestCaseName) {
    // Your test code here
    // ...

    // You can use TF_LITE_MICRO_EXPECT_XXX macros to perform assertions.
    TF_LITE_MICRO_EXPECT_EQ(42, some_function());
}
  1. End the Test Suite: Close the test suite using TF_LITE_MICRO_TESTS_END.
TF_LITE_MICRO_TESTS_END
  1. (Optional) Check for Test Failure: We can define a macro called TF_LITE_MICRO_CHECK_FAIL(). Its main purpose is to exit a function early if a test assertion has failed.
#define TF_LITE_MICRO_CHECK_FAIL()          \
    do {                                    \
        if (micro_test::did_test_fail) {    \
            return kTfLiteError;            \
        }                                   \
    } while (false)

Here's a simple example:

#include "tensorflow/lite/micro/testing/micro_test.h"

#define TF_LITE_MICRO_CHECK_FAIL()          \
    do {                                    \
        if (micro_test::did_test_fail) {    \
            return kTfLiteError;            \
        }                                   \
    } while (false)

TF_LITE_MICRO_TESTS_BEGIN

TF_LITE_MICRO_TEST(TestSomething) {
    int expected = 42;
    int actual = 40 + 2;
    TF_LITE_MICRO_EXPECT_EQ(expected, actual);
    TF_LITE_MICRO_CHECK_FAIL(); // Check if any test has failed so far and return early if so
}

TF_LITE_MICRO_TESTS_END

Within the test case function, you can use various TF_LITE_MICRO_EXPECT_XXX macros to assert conditions like equality, less than, etc. The test framework will automatically run all the tests enclosed between TF_LITE_MICRO_TESTS_BEGIN and TF_LITE_MICRO_TESTS_END, and report the results.


Registering Custom Operators

In TensorFlow Lite for Microcontrollers, operations (ops) need to be explicitly registered to be available for model inference. The MicroMutableOpResolver class is used to accomplish this. The code you provided shows two examples of registering ops using two different instances of MicroMutableOpResolver.

Here's a breakdown:

Type Alias for OpResolver

using MicroSpeechOpResolver = tflite::MicroMutableOpResolver<4>;
using AudioPreprocessorOpResolver = tflite::MicroMutableOpResolver<18>;

We can create type aliases for different sizes of MicroMutableOpResolver. The number indicates the maximum number of operators that the resolver can handle. MicroSpeechOpResolver can handle 4, while AudioPreprocessorOpResolver can handle 18.

RegisterOps Function for MicroSpeech

TfLiteStatus RegisterOps(MicroSpeechOpResolver& op_resolver) {
    TF_LITE_ENSURE_STATUS(op_resolver.AddReshape());
    TF_LITE_ENSURE_STATUS(op_resolver.AddFullyConnected());
    TF_LITE_ENSURE_STATUS(op_resolver.AddDepthwiseConv2D());
    TF_LITE_ENSURE_STATUS(op_resolver.AddSoftmax());
    return kTfLiteOk;
}

We can create a function that takes a MicroSpeechOpResolver reference and registers four types of ops: Reshape, FullyConnected, DepthwiseConv2D, and Softmax. TF_LITE_ENSURE_STATUS() checks that each registration is successful.

RegisterOps Function for AudioPreprocessor

TfLiteStatus RegisterOps(AudioPreprocessorOpResolver& op_resolver) {
    // ...
    // Similar to the previous function but registers different ops
    // ...
    return kTfLiteOk;
}

We can create another function that does the same thing but for a different set of ops, tailored for audio preprocessing. Again, it ensures that each operation is successfully registered.

Usage

Later in the code, you would use these RegisterOps functions to populate instances of MicroSpeechOpResolver and AudioPreprocessorOpResolver. Then, these populated resolvers are provided to the MicroInterpreter so that it knows how to handle each operation when running inference on a model.

Code Example for Usage

MicroSpeechOpResolver micro_op_resolver;
AudioPreprocessorOpResolver audio_op_resolver;

RegisterOps(micro_op_resolver);  // Populate with micro speech ops
RegisterOps(audio_op_resolver);  // Populate with audio preprocess ops

// Instantiate the MicroInterpreter with the chosen resolver
tflite::MicroInterpreter interpreter(model, micro_op_resolver, tensor_arena, tensor_arena_size, error_reporter);

This is a fundamental part of setting up TensorFlow Lite for Microcontrollers, as missing op registrations will lead to errors during model inference.


Basic Workflow for Running Inference

Here's a basic tutorial for setting up TensorFlow Lite Micro (TFLite Micro) and running inference on a model. Note that this is a C++ example, and it assumes you've already set up your development environment for TFLite Micro.

Step 1: Include Required Headers

Include the necessary headers for TensorFlow Lite Micro. These will differ based on your project, but for a typical setup, you might have:

#include "tensorflow/lite/micro/all_ops_resolver.h"
#include "tensorflow/lite/micro/micro_error_reporter.h"
#include "tensorflow/lite/micro/micro_interpreter.h"
#include "tensorflow/lite/schema/schema_generated.h"
#include "tensorflow/lite/version.h"

Step 2: Prepare the Model

The model should be loaded and its version should be checked to ensure compatibility.

const tflite::Model* model = tflite::GetModel(g_my_model);  // Assuming the array `g_my_model` is defined elsewhere

if (model->version() != TFLITE_SCHEMA_VERSION) {
    // Handle version mismatch
    return;
}

Step 3: Create an Op Resolver

Create an Op Resolver to map operations used in the model to their implementations.

using MicroSpeechOpResolver = tflite::MicroMutableOpResolver<4>;
MicroSpeechOpResolver op_resolver;
// Add operations to resolver
// e.g., op_resolver.AddFullyConnected();

Step 4: Allocate Memory for Tensors

Declare a tensor arena where TensorFlow Lite Micro will perform its operations. The size will depend on your specific model.

constexpr size_t kArenaSize = 28584;
alignas(16) uint8_t g_arena[kArenaSize];

Step 5: Create an Interpreter

Create a MicroInterpreter instance using the model, op resolver, and tensor arena.

tflite::MicroInterpreter interpreter(model, op_resolver, g_arena, kArenaSize);

Step 6: Allocate Tensors

After creating the interpreter, tensors need to be allocated.

if (interpreter.AllocateTensors() != kTfLiteOk) {
    // Handle allocation failure
    return;
}

Step 7: Get Input Tensor

Get the input tensor and populate it with your input data.

TfLiteTensor* input = interpreter.input(0);
// Populate `input` tensor

Step 8: Run Inference

Invoke the interpreter to run inference.

if (interpreter.Invoke() != kTfLiteOk) {
    // Handle invocation error
    return;
}

Step 9: Extract Output

After inference, you can extract the output from the output tensor.

TfLiteTensor* output = interpreter.output(0);
// Use `output` tensor

Part 2. Deploying the Model on Arduino Nano 33 BLE Sense

Open the Arduino IDE and open the micro_speech example sketch. This sketch is located in the File > Examples > Harvard_TinyMLx (or TensorFlowLite) > micro_speech menu. Upload the sketch to the Arduino Nano 33 BLE Sense.


Take a look at the micro_speech example sketch

The sketch itself looks a bit different from the test code we saw earlier. I guess it's because the examples haven't been updated to the latest TensorFlow Lite Micro API yet. But the general structure is the same. This sketch is much more complicated than the "Hello World" example, so let's take a look at it in detail.

Headers and Namespace

First, various headers are included that provide the functions and objects necessary for the code, such as TensorFlow Lite Micro classes, audio processing, and command recognition.

#include <TensorFlowLite.h>
#include "audio_provider.h"
...

Global Variables

The code sets up global variables to store instances of important classes and data structures. These include the error reporter, the TFLite model, the TFLite interpreter, and various others. All are initialized to nullptr and encapsulated within an anonymous namespace. Those pointers will later be initialized in the setup() function. (Why not just initialize them directly in the global scope? I'm not sure.)

namespace {
    tflite::ErrorReporter* error_reporter = nullptr;
    ...
    int32_t previous_time = 0;
    ...
}

setup() Function

The setup() function initializes the essential components:

  1. Error Reporting: It sets up an error reporter.
  2. Model Loading: It maps the model into a usable data structure and checks the version.
  3. Op Resolver: Sets up an op resolver, registering only the operations that are needed.
  4. Interpreter: Builds a MicroInterpreter object.
  5. Tensor Allocation: Allocates memory for the model's tensors and checks dimensions and types for the input tensor.
  6. Feature and Command Recognition: Sets up classes for feature extraction and command recognition.

loop() Function

The loop() function is where the real action happens:

  1. Feature Data: Populates the feature data based on the current time and audio input.
  2. Inference: If there are new slices of audio, the feature buffer is copied to the model's input tensor, and inference is run.
  3. Command Recognition: The output of the model is processed to recognize a specific command.
  4. Response: Finally, a function is called to respond to the recognized command.

Error Handling

Throughout the code, extensive error handling is implemented via TF_LITE_REPORT_ERROR() calls, making it easier to debug.

Memory Management

Memory is statically allocated for the tensors and features, as evident by the tensor_arena and feature_buffer arrays.

constexpr int kTensorArenaSize = 10 * 1024;
uint8_t tensor_arena[kTensorArenaSize];
int8_t feature_buffer[kFeatureElementCount];

In summary, the code provides a complete example of how to set up a real-time audio processing application using TensorFlow Lite Micro, from feature extraction to model inference to command recognition. It's a good example of how to go about deploying machine learning models on edge devices with limited computational resources.

As mentioned at the beginning of this chapter, the application is made of 5 main components. Let's take a look at each of them in detail.


1. Audio Provider

The arduino_audio_provider.cpp file is for an audio provider for TensorFlow Lite running on Arduino or similar embedded systems. It captures audio data using Pulse Density Modulation (PDM). It encodes an analog signal, such as audio, into a digital form. Unlike Pulse Code Modulation (PCM), where the amplitude of the analog signal is sampled regularly to turn it into a stream of numbers, PDM only has two states: '1' or '0'. PDM is often used in environments where computational resources are limited, like microcontrollers. It's also easier to implement than other more complex modulation methods. The code uses a third-party library called PDM.

Important Variables

  • g_audio_capture_buffer[]: This is where the raw audio samples are stored.
  • g_audio_output_buffer[]: This is the output buffer that will be used by TensorFlow Lite.
  • g_latest_audio_timestamp: A timestamp indicating when the last audio sample was captured.

Initialization (InitAudioRecording)

  • The function InitAudioRecording initializes the PDM microphone.
  • The function PDM.onReceive(CaptureSamples); sets the CaptureSamples function to be called when new audio samples arrive.
  • PDM.begin(1, kAudioSampleFrequency); starts the PDM microphone in mono mode at a given sample frequency.

Capturing Samples (CaptureSamples)

  • This function is called whenever new audio samples are available.
  • PDM.read(g_audio_capture_buffer + capture_index, DEFAULT_PDM_BUFFER_SIZE); reads the audio samples into g_audio_capture_buffer.

Getting Samples (GetAudioSamples)

  • This function is used to retrieve the audio samples captured.
  • It calculates which samples are needed based on start_ms and duration_ms arguments.
  • The samples are then copied to g_audio_output_buffer.

Utility Function (LatestAudioTimestamp)

  • This simply returns the latest audio timestamp (g_latest_audio_timestamp).

Conditional Compilation (#ifndef ARDUINO_EXCLUDE_CODE)

  • Some parts of the code will only be compiled if it's not meant for specific Arduino boards, as indicated by ARDUINO_EXCLUDE_CODE.

To sum up, this code sets up an audio interface for TensorFlow Lite on Arduino. It captures audio using PDM and provides that audio data to TensorFlow Lite for processing.


2.1 Feature Provider

The FeatureProvider class in the code serves as a wrapper around the feature extraction algorithm you provided earlier. It manages the temporal aspects of audio data and decides which portions of the audio stream to process for generating features. Let's break down its functionality:

Constructor

The constructor initializes an array feature_data_ of size feature_size_ to hold feature data. This array is later populated by the PopulateFeatureData method.

PopulateFeatureData Method

This method is the workhorse. It fetches audio data, processes it, and fills feature_data_ with computed features. Here's a step-by-step:

  1. Check Size: The method starts by checking if feature_size_ matches kFeatureElementCount. If not, an error is reported.

  2. Calculate Steps and Slices: It calculates how many new "slices" of audio data are needed based on the given last_time_in_ms and time_in_ms.

  3. Initialization: If this is the first run, it calls InitializeMicroFeatures to set up the feature extraction pipeline.

  4. Data Shifting: If any old data can be reused, it's shifted to the beginning of feature_data_.

  5. Audio Fetch and Feature Calculation: For new slices needed, it fetches the corresponding audio samples (using GetAudioSamples of the Audio Provider) and calls GenerateMicroFeatures to populate feature_data_.

Error Handling

The method uses TensorFlow Lite's error reporting to flag any issues like size mismatches or initialization errors.

Data Movement

The PopulateFeatureData method efficiently uses existing data. If only a part of the feature_data_ needs to be updated, it moves valid existing data up and only computes features for new slices.

Final Thoughts

The class is tightly coupled with the audio feature extraction process defined in arduino_audio_provider.cpp. It directly fetches audio samples and manages the complexity of having overlapping and strided feature computation windows.


2.2 Feature Generator

The algorithm used in the Feature Generator to generate features is based on a specialized audio processing pipeline. It employs multiple components such as FFT (Fast Fourier Transform), noise reduction, and a filter bank among others. It makes heavy use of the TensorFlow Lite Micro Frontend Library to perform these operations. Below are some key components of the feature extraction process:

  1. Windowing: The audio signal is divided into overlapping windows, controlled by parameters kFeatureSliceDurationMs and kFeatureSliceStrideMs.

  2. Noise Reduction: The code applies noise reduction techniques with some smoothing and minimum signal remaining ratio.

  3. Filterbank: A filterbank with a specific number of channels (kFeatureSliceSize) is applied to the FFT output. The filter bank focuses on the frequency range between 125.0 and 7500.0 Hz.

  4. PCAN (Per-Channel Automatic Normalization): It's enabled with a strength parameter set to 0.95 and an offset of 80.0.

  5. Log Scaling: Logarithmic scaling is applied with a specific scale shift.

  6. Quantization: Finally, the code quantizes the output features into 8-bit integer values that range from -128 to 127.

The actual audio features are processed by the FrontendProcessSamples function, which is a part of TensorFlow Lite's Micro Frontend Library.


3. Inference

Inference is done by calling the interpreter->Invoke() method. This method runs the model on the input data and populates the output tensor with the results.


4. Command Recognizer

Class Definition and Constructor

The class RecognizeCommands is designed to recognize commands based on audio features processed by TensorFlow Lite. The constructor initializes several private variables, including:

  • error_reporter_: For logging errors.
  • average_window_duration_ms_: The time window for averaging results.
  • detection_threshold_: A threshold for recognizing a command.
  • suppression_ms_: A minimum gap time between recognized commands.
  • minimum_count_: Minimum number of results needed to confirm a command.

ProcessLatestResults Method

The ProcessLatestResults method processes the latest recognition results, which are in the form of a TfLiteTensor.

  1. Input Checks: It first checks whether the dimensions and type of latest_results are as expected. If not, it logs an error and returns.

  2. Timestamp Check: It checks whether the new results are newer than the oldest ones in the queue. Again, if not, it logs an error and returns.

  3. Queue Management: It adds the latest results to a queue (previous_results_) and removes any results that are too old based on average_window_duration_ms_.

  4. Early Exit: If there are fewer results in the queue than minimum_count_, it returns an unreliable result.

  5. Average Calculation: Then it calculates the average scores for each category (like "yes", "no", etc.) across all the results in the queue.

  6. Top Scoring Category: It identifies the category with the highest average score and compares it with the previously recognized command.

  7. Command Recognition and Suppression: It checks if the score of the identified category is above the threshold and if enough time (suppression_ms_) has passed since the last recognized command.

  8. Output: It sets the recognized command, score, and whether it's a new command in the output parameters.

The code makes use of a moving window average and additional criteria like detection_threshold_ and suppression_ms_ to improve command recognition reliability.


5. Command Response

The function RespondToCommand is designed to handle voice command recognition results and provide a visual response using LEDs. Let's break down its functionality:

LED Light Logic

The function uses the built-in LED and RGB LEDs to display states: - Green LED turns on when "yes" is heard (digitalWrite(LEDG, LOW);). - Red LED turns on when "no" is heard (digitalWrite(LEDR, LOW);). - Blue LED turns on when "unknown" is heard (digitalWrite(LEDB, LOW);).

if (is_new_command)
{
    if (found_command[0] == 'y') {
        digitalWrite(LEDG, LOW);  // Green for yes
    }
    if (found_command[0] == 'n') {
        digitalWrite(LEDR, LOW);  // Red for no
    }
    if (found_command[0] == 'u') {
        digitalWrite(LEDB, LOW);  // Blue for unknown
    }
}

LED Off Logic

The variable last_command_time records the last time a command was recognized: - If current_time - last_command_time > 3000 milliseconds (3 seconds), all LEDs are turned off and last_command_time is set to zero.

if (last_command_time < (current_time - 3000)) {
    last_command_time = 0;
    digitalWrite(LED_BUILTIN, LOW);
    digitalWrite(LEDR, HIGH);
    digitalWrite(LEDG, HIGH);
    digitalWrite(LEDB, HIGH);
}

is_new_command

The is_new_command flag indicates whether a new command is heard: - If true, it logs the command and sets the last_command_time to current_time.

if (is_new_command) {
    TF_LITE_REPORT_ERROR(error_reporter, "Heard %s (%d) @%dms", found_command,
                        score, current_time);
    last_command_time = current_time;
    // ... (LED logic here)
}