Here’s how to do keyword spotting using a simple convolution network


With the rapid development of mobile devices, voice-related technology is booming like never before. Many service providers such as Google offer the ability to search the voice on the Android platform. In contrast, the staff help Microsoft’s “Cortana”, Apple’s “Siri” and Amazon’s “Alexa” use a utility such as keyword recognition to interact with the system. On Android mobile phones, ‘Ok Google’ uses this function to search for a specific keyword to trigger the voice-based commands. Keyword recognition refers to a language technology that detects the existence of a word or short phrase in a particular audio stream. It is synonymous with keyword spotting.

The actual environment of keyword recognition is much more complex than this demonstration. This article focuses on knowing the basic idea of ​​keyword matching for short one-second audio files. Since the convolution networks excel in image-based classification tasks, we use this behavior of convolutional neural networks for the keyword recognition / classification task. To do this, we convert our audio files into spectrograms, just the visual representation of audio files, so that we can use convolutional neural networks. Before proceeding with the coding, let’s consider the details of the spectrogram and its properties.

What is a spectrogram?

A spectrogram is a detailed audio view that graphs time, frequency, and amplitude. A spectrogram can visually reveal broadband, electrical, or intermittent noise in the audio, so you can isolate those audio problems by simply citing the diagram. We can read a spectrogram like; it keeps time on the x-axis and frequency on the y-axis, and the suitability of the signal is shown as a kind of heat map or color saturation scale. It was originally made as black and white graphs on paper by a sound spectrograph, but now these graphs are made by software and can be of any range of colors.

Spectrograms depict sounds similar to a musical score; the difference is that it maps frequencies instead of musical notes. If we see the frequency distribution over time, we can clearly distinguish each sound element and its harmonic structures. This is particularly useful in acoustic studies when analyzing sounds such as birdsong or musical instruments. The graphics don’t look cool, but they tell you a lot of information about the audio file even if you aren’t listening to it.

Since we know how good CNN is on unstructured data like images, we’ll use a CNN-based model to classify some keywords. The implementation below will show you how to convert the audio files to the spectrogram and CNN model that classify the keywords. The following code is for the official implementation.

Import all dependencies:
 import matplotlib.pyplot as plt
 import numpy as np
 import seaborn as sns
 import tensorflow as tf
 from tensorflow.keras.layers.experimental import preprocessing
 from tensorflow.keras import layers
 from tensorflow.keras import models
 from IPython import display
 from sklearn.metrics import classification_report
 import pathlib
 import os
 seed = 42
Load data and train test division:

The following code is used to import the voice command record which contains nearly 105,000 wav files with 30 different keywords. Although the original data set weighs almost 8 GB, we’re using a small portion of that data set to save memory and time. The minimized data set contains keywords such as “down, go, left, right, no, stop, up and yes”.

 data_dir = pathlib.Path('data/mini_speech_commands')
 if not data_dir.exists():
       extract=True,cache_dir=".", cache_subdir="data")
 labels = np.array(
 labels = labels[labels != '']


Commands ['no' 'right' 'stop' 'go' 'down' 'up' 'yes' 'left']                

Extract the audio files into the list;

 files ='/*/*')
 files = tf.random.shuffle(files)
 num_samples = len(files)
 print("number of total samples: ",num_samples)
 print("examples per labels: ",len([0]))))
 print("file tensor: ",files[0]) 

Train_test split;

 train = files[:6400]
 vali = files[6400:6400+800]
 test = files[-800:] 
Reading audio files and labels:

The files are first read as binary files that we later have to convert to tensors. The WAV file contains time series data at a number of samples per second. Each sample represents the amplitude of the audio signal at a specific point in time. is used, which returns WAV encoded audio as a tensor.

 ## binary file will be converted into numerical tensors
 def audio_decode(audio_binary):
   audios,_ =
   return tf.squeeze(audios,axis=-1)
 ## labels for each wave file
 def get_label_(file_path):
   part = tf.strings.split(file_path, os.path.sep)
   return part[-2]
 ## create supevised training method which takes audio file along with label
 def waveform_and_label(file_path):
   labels = get_label_(file_path)
   audio_binary =
   waveforms = audio_decode(audio_binary)
   return waveforms,labels
 # apply the Process_path to build training set to extract audio-label pairs
 # and check the result
 files_data =
 waveform_data =, num_parallel_calls = AUTOTUNE) 
Visualize the waveform with its labels:
 row = 3
 col = 3
 n = row*col
 fig, axes = plt.subplots(row,col, figsize=(10,12))
 for i,(audios,labels) in enumerate(waveform_data.take(n)):
   r1 = i// col
   c1 = i % col
   axs = axes[r1][c1]
   labels = labels.numpy().decode('utf-8')
Create a function that returns a spectrogram:

We convert waveforms to spectrograms using short-term Fourier transform (STFT) to convert audio to the time-frequency domain. The STFT divides the signal into a time window using tf.signal.stft and performs a Fourier transform in each window that returns a 2D tensor to apply the convolutional layers. STFT creates an array that represents phase and amplitude information, but we will only be using amplitude information for modeling so tf.abs will derive it.

Choose frame_lenght and frame_step so that the output image is almost square. We’ll also be using null paddings so that all files are the same length.

 def get_spectogram(waveforms):
   ## padding the files with less than 1600 samples
   padding = tf.zeros([16000] - tf.shape(waveforms),dtype=tf.float32)
   ## concate audio with padding for equal lenght
   waveforms = tf.cast(waveforms, tf.float32)
   equal_lenght = tf.concat([waveforms,padding], 0)
   spectogram = tf.signal.stft(equal_lenght,frame_length=255,frame_step=128)
   spectogram = tf.abs(spectogram)
   return spectogram 

Compare the waveform, spectrogram and audio file of a sample;

 for waveforms,labels in waveform_data.take(2):
   labels = labels.numpy().decode('utf-8')
   spectogram = get_spectogram(waveforms)
 print('waveform shape:',waveforms.shape)
 print('Spectogram shape:',spectogram.shape)
 print('Audio playback')
 display.display(display.Audio(waveforms, rate=16000)) 

Audio file:

Plot the spectrogram of a sample;

 def plot_spectogram(spectogram, axs):
   # convert frequencies into log scale so that time represented on
   # x-axis
   log_scale = np.log(spectogram.T)
   height = log_scale.shape[0]
   width = log_scale.shape[1]
   x = np.linspace(0, np.size(spectogram),num=width, dtype=int)
   y = range(height)
   axs.pcolormesh(x,y, log_scale)
 fig,axes = plt.subplots(2,figsize=(12,8))
 time_scale = np.arange(waveforms.shape[0])
 axes[0].plot(time_scale, waveforms.numpy())

Transform the waveform dataset into a spectrogram dataset with appropriate labels and visualize the spectrograms;

 def spectogram_and_label(audios,label):
   spectogram = get_spectogram(audios)
   spectogram = tf.expand_dims(spectogram,-1)
   labels_id = tf.argmax(label == lables)
   return spectogram, labels_id 

spectogram_data =,num_parallel_calls=AUTOTUNE)

 row = 3
 col = 3
 n = row*col
 fig, axes = plt.subplots(row,col, figsize=(10,12))
 for i,(spectogram,label_id) in enumerate(spectogram_data.take(n)):
   r2 = i// col
   c2 = i % col
   axs = axes[r2][c2]
Perform the preprocessing step for the test and validation set:
 def create_dataset(files):
   files_data =
   output_data =,num_parallel_calls=AUTOTUNE)
   output_data =,num_parallel_calls = AUTOTUNE)
   return output_data
 train_data = spectogram_ds
 vali_data = create_dataset(vali)
 test_data = create_dataset(test) 
Build the model:

Stack the dataset and add cache () and prefetch () operations to reduce latency.

 batch_size = 64
 train_data = train_data.batch(batch_size)
 vali_data = vali_data.batch(batch_size)
 train_data = train_data.cache().prefetch(AUTOTUNE)
 vali_data = vali_data.cache().prefetch(AUTOTUNE) 

In addition to CNN layers, the model also has preprocessing layers such as resizing and normalization;

 for spectogram, _ in spectogram_data.take(1):
   input_shape1 = spectogram.shape
 print("input shape:",input_shape1)
 num_labels = len(labels)
 norma_layer = preprocessing.Normalization()
 norma_layer.adapt( x,_: x))
 model = models.Sequential([
         layers.Conv2D(64,3, activation='relu'),


 history =,validation_data=vali_ds,epochs=10,
        callbacks = tf.keras.callbacks.EarlyStopping(verbose=1,patience=2)) 
Rate the model:
 test_audios = []
 test_labels = []
 for audios,labels in test_ds:
 test_audios = np.array(test_audios)
 test_labels = np.array(test_labels) 
 y_predi = np.argmax(model.predict(test_audios),axis=1)
 y_true = test_labels
 test_accuracy = sum(y_predi == y_true) / len(y_true)
 print('Test accuracy:',test_accuracy) 

Test accuracy is around 83%

Present confusion matrix and classification report;

 confusion_mat = tf.math.confusion_matrix(y_true,y_predi)

Derive the model from the audio file;

 sample_data = create_dataset([str(sample_file)])
 for spectogram, label in sample_ds.batch(1):
   prediction = model(spectogram), tf.nn.softmax(prediction[0]))
   plt.title(f'Prediction for "{labels[label[0]]}"') 


This is all about keyword detection using simple convolutional neural networks where we used 1 second audio files with eight different words. This was the basic idea of ​​how keyword recognition works when the actual system is a bit complex. In terms of the model’s performance for a given audio file, the model predicts the file to be perfect. The accuracy of the words “no” and “go” is poor. This could be due to an imbalance as we did not sample the data uniformly for all classes. For the rest of the classes, parameters are acceptable.


Join our telegram group. Become part of a dedicated online community. Join here.

Subscribe to our newsletter

Get the latest updates and relevant offers by sharing your email.


Leave A Reply