Chapter 8: Training a Wake Word Model

This chapter focuses on this notebook. The notebook itself calls other python scritps that are described in this tutorial.


Write a "train" script

Creating your own training script can be broken down into several key steps. For this example, let's stick with TensorFlow since you've been working with it already:

  1. Import Libraries: Import all the required libraries like TensorFlow, NumPy, etc.
import tensorflow as tf
import numpy as np
  1. Parse Arguments: Since you're using command line arguments, you might want to use argparse to capture those arguments in your script.
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--model_architecture",
                    default="my_default_model",
                    type=str,
                    required=True,
                    help="What model architecture to use")
#
# add more args here...
#
FLAGS, unparsed = parser.parse_known_args()

Later, you can use syntax like FLAGS.model_architecture to get the value of this attribute.

  1. Load Data: Read in your dataset.
# Assuming you have some function to load data
train_data, test_data = load_data(FLAGS.data_dir)
  1. Preprocessing: Implement any data preprocessing steps.
# Assuming preprocess function exists
train_data = preprocess(train_data, FLAGS.preprocess)
  1. Model Architecture: Define your neural network model. If you have multiple architectures, you can use a conditional statement to decide which one to use based on the argument passed.
# Assuming you have a function to create model
model = create_model(FLAGS.model_architecture)
  1. Compile Model: Choose your optimizer, loss function, and metrics.
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=FLAGS.learning_rate),
                loss='sparse_categorical_crossentropy',
                metrics=['accuracy'])
  1. Training: Train the model using the training data.
model.fit(train_data, epochs=FLAGS.how_many_training_steps)
  1. Logging and Saving: Optionally, you can log the training process or save the model.
model.save(FLAGS.train_dir + "/model")
  1. Quantization: If you want to quantize your model, you can use TensorFlow's quantization techniques.
if FLAGS.quantize:
    fingerprint_min, fingerprint_max = input_data.get_features_range(
        model_settings)
    fingerprint_input = tf.quantization.fake_quant_with_min_max_args(
        input_placeholder, fingerprint_min, fingerprint_max)

Your complete script will be a compilation of these parts. After writing the script, you can run it using a command like the following:

python3 train.py --model_architecture=tiny_conv --window_stride=20

Checkpoint Files

Checkpoint files typically store the learned parameters like weights and biases, but they do not include the model architecture by default. You would usually use the model's code or a separate metadata file to rebuild the architecture.

Colab Expiry

Google Colab sessions can expire, and their maximum lifespan is indeed 12 hours. So, it's crucial to save your model's checkpoint to a persistent location like Google Drive.

Loading Checkpoints at Start

if FLAGS.start_checkpoint:
    models.load_variables_from_checkpoint(sess, FLAGS.start_checkpoint)
    start_step = global_step.eval(session=sess)

Here, FLAGS.start_checkpoint is presumably the path to a saved checkpoint. This code restores the model's variables (like weights and biases) from that checkpoint, and the training resumes from the global_step where it left off.

Saving Checkpoints in Training Loop

We can specify the model at intervals specified by FLAGS.save_step_interval and at the end of training.

if (training_step % FLAGS.save_step_interval == 0 or
        training_step == training_steps_max):
    checkpoint_path = os.path.join(FLAGS.train_dir, FLAGS.model_architecture + '.ckpt')
    tf.compat.v1.logging.info('Saving to "%s-%d"', checkpoint_path, training_step)
    saver.save(sess, checkpoint_path, global_step=training_step)

The model checkpoints are saved in the directory specified by FLAGS.train_dir, and the checkpoint file has a name based on FLAGS.model_architecture. The global_step=training_step argument means that TensorFlow will append -<step> to the filename, making it easier to identify checkpoints by the step at which they were saved.


Convert Graph to TFLite

.pb file

A.pb stands for Protocol Buffers. Protocol Buffers are a language-neutral and platform-neutral method developed by Google for serializing structured data. In the context of TensorFlow, a .pb file usually stores the graph definition along with the weights of the model. There is also a human-readable version of the graph definition called the .pbtxt file. The .pb file is usually used for deployment, while the .pbtxt file is used for debugging. In the train script above, we can save the model as a .pbtxt file using the following code:

tf.compat.v1.train.write_graph(sess.graph_def, FLAGS.train_dir,
                               FLAGS.model_architecture + '.pbtxt')

TensorFlow Lite Converter - TOCO

TOCO stands for TensorFlow Lite Optimizing Converter. It's a command-line utility to convert TensorFlow models into a more compact format suitable for use with mobile or embedded platforms, like TensorFlow Lite.

Here's a more generic version of the command:

toco
--graph_def_file=<path/to/model.pb> --output_file=<path/to/output.tflite> \
--input_shapes=<input_shape> --input_arrays=<input_node_name> --output_arrays=<output_node_name> \
--inference_type=<inference_type> --mean_values=<mean_values> --std_dev_values=<std_dev_values>
  • --graph_def_file: Specifies the path to the .pb file containing the TensorFlow model.
  • --output_file: Specifies where to write the converted .tflite file.
  • --input_shapes: Describes the shape of the input tensor.
  • --input_arrays and --output_arrays: Specify the names of the input and output nodes in the graph.
  • --inference_type: Sets the type of inference to be used (e.g., FLOAT, QUANTIZED_UINT8).
  • --mean_values and --std_dev_values: These are optional and used for quantization.

The output, in this case, would be a tiny_conv.tflite file, which can be deployed on mobile or other low-resource environments.


Converting .tflite file into a C array using xxd

This conversion is useful for embedding the model directly into an application, often for deployment on edge devices with resource constraints.

How it works:

  1. Install xxd: First, it makes sure the xxd utility is installed. This utility is commonly used to produce hex dumps or to convert hex dumps back into binary.
!apt-get -qq install xxd
  1. Convert to C Array: The xxd -i command reads the .tflite model file and outputs a C array that represents the file’s contents in hexadecimal format.
!xxd -i /content/tiny_conv.tflite > /content/tiny_conv.cc
  1. Print the Source File: It then prints the generated C source file to verify the content.
!cat /content/tiny_conv.cc

Output Explanation:

  • unsigned char _content_tiny_conv_tflite[] is the byte array storing the model data.
  • unsigned int _content_tiny_conv_tflite_len = 18208; specifies the length of the byte array, which you'll need when reading the array back into a program.

With this C array, you can now include the model as part of a C or C++ application, such as one running on an embedded system, without needing to read the model from a file. This is often necessary for environments where file system access is either restricted or not available.


Feature Extraction Process

  1. Apply Fourier Transform: For each 30-ms slice of audio, run a Fourier transform after applying a Hann window filter.

  2. Calculate Magnitude: Square the real and imaginary parts of the Fourier transform, sum them up, and then take the square root to find the magnitude for each frequency bucket.

  3. Frequency Bucketing: Given 480 samples in 30 ms (at a rate of 16,000 samples/second), we pad zeros to make it 512 samples for the FFT algorithm. This gives us 256 frequency buckets. To make it more manageable, we average these into 40 buckets. We use the mel frequency scale to give more weight to lower frequencies.

  4. Noise Reduction: Keep a running average of each frequency bucket. Subtract this average from the current bucket value to remove background noise.

  5. Odd-Even Coefficients: The algorithm uses different coefficients for odd and even frequency buckets. This was designed to enhance performance and creates a comb-like pattern in the feature images.

  6. Signal Boosting: Use per-channel amplitude normalization (PCAN) to auto-gain the signal based on average noise levels.

  7. Log Scaling: Apply a logarithmic scale to each bucket value to prevent louder frequencies from overshadowing quieter ones.

This procedure is looped 49 times, each time advancing the 30-ms audio window by 20 ms, to process one second of audio. The result is a 2D array with 40 columns (for each frequency bucket) and 49 rows (for each time slice).


Model Structure

The model has three main layers:

  1. Convolutional Layer: Contains 8 filters, each sized 8x10 pixels.
  2. Fully Connected Layer: Connects to the convolutional layer's output.
  3. Softmax Layer: Outputs the final probabilities for each category.

For input, the model expects a 4D array with dimensions 1 x 49 x 40 x 1, which represents a spectrogram of 49x40. The extra dimensions are for TensorFlow compatibility.

After convolution, the output shape becomes 1 x 25 x 20 x 8, giving us 4000 values. These feed into the fully connected layer, which then returns 4 outputs corresponding to the categories: "yes," "no," "unknown words," and "silence." The category with the highest probability becomes the model's prediction.

The model is lightweight, occupying only 18 KB and requiring 400,000 arithmetic operations for a single run.


Additional Factors

Detection Frequency

To increase the likelihood of capturing a complete word within a one-second window, the model needs to run more frequently than once per second. Based on experience, running the model 10 to 15 times per second yields optimal results.

Post-Processing

A special post-processing class is used to improve recognition accuracy. This class takes the raw model outputs and averages the scores over a short time frame. A word is recognized only if it receives multiple high scores in that time window. An algorithm then determines if the scores for any category exceed a set threshold. The filtered results are sent to the CommandResponder, which performs an action based on the platform's capabilities.