With the rapid development of mobile devices, voice-related technology is booming like never before. Many service providers such as Google offer the ability to search the voice on the Android platform. In contrast, the staff help Microsoft’s âCortanaâ, Apple’s âSiriâ and Amazon’s âAlexaâ use a utility such as keyword recognition to interact with the system. On Android mobile phones, ‘Ok Google’ uses this function to search for a specific keyword to trigger the voice-based commands. Keyword recognition refers to a language technology that detects the existence of a word or short phrase in a particular audio stream. It is synonymous with keyword spotting.
The actual environment of keyword recognition is much more complex than this demonstration. This article focuses on knowing the basic idea of ââkeyword matching for short one-second audio files. Since the convolution networks excel in image-based classification tasks, we use this behavior of convolutional neural networks for the keyword recognition / classification task. To do this, we convert our audio files into spectrograms, just the visual representation of audio files, so that we can use convolutional neural networks. Before proceeding with the coding, let’s consider the details of the spectrogram and its properties.
What is a spectrogram?
A spectrogram is a detailed audio view that graphs time, frequency, and amplitude. A spectrogram can visually reveal broadband, electrical, or intermittent noise in the audio, so you can isolate those audio problems by simply citing the diagram. We can read a spectrogram like; it keeps time on the x-axis and frequency on the y-axis, and the suitability of the signal is shown as a kind of heat map or color saturation scale. It was originally made as black and white graphs on paper by a sound spectrograph, but now these graphs are made by software and can be of any range of colors.
Spectrograms depict sounds similar to a musical score; the difference is that it maps frequencies instead of musical notes. If we see the frequency distribution over time, we can clearly distinguish each sound element and its harmonic structures. This is particularly useful in acoustic studies when analyzing sounds such as birdsong or musical instruments. The graphics don’t look cool, but they tell you a lot of information about the audio file even if you aren’t listening to it.
Since we know how good CNN is on unstructured data like images, we’ll use a CNN-based model to classify some keywords. The implementation below will show you how to convert the audio files to the spectrogram and CNN model that classify the keywords. The following code is for the official implementation.
Import all dependencies:
import matplotlib.pyplot as plt import numpy as np import seaborn as sns import tensorflow as tf from tensorflow.keras.layers.experimental import preprocessing from tensorflow.keras import layers from tensorflow.keras import models from IPython import display from sklearn.metrics import classification_report import pathlib import os seed = 42 tf.random.set_seed(seed) np.random.seed(seed)Â
Load data and train test division:
The following code is used to import the voice command record which contains nearly 105,000 wav files with 30 different keywords. Although the original data set weighs almost 8 GB, we’re using a small portion of that data set to save memory and time. The minimized data set contains keywords such as “down, go, left, right, no, stop, up and yes”.
data_dir = pathlib.Path('data/mini_speech_commands') if not data_dir.exists(): Â Â tf.keras.utils.get_file('mini_speech_commands.zip', Â Â Â Â Â Â origin="http://storage.googleapis.com/download.tensorflow.org/data/mini_speech_commands.zip", Â Â Â Â Â Â extract=True,cache_dir=".", cache_subdir="data") labels = np.array(tf.io.gfile.listdir(str(data_dir))) labels = labels[labels != 'README.md'] print('Commands',labels)
Output:
Commands ['no' 'right' 'stop' 'go' 'down' 'up' 'yes' 'left']Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
Extract the audio files into the list;
files = tf.io.gfile.glob(str(data_dir)+'/*/*') files = tf.random.shuffle(files) num_samples = len(files) print("number of total samples: ",num_samples) print("examples per labels: ",len(tf.io.gfile.listdir(str(data_dir/labels[0])))) print("file tensor: ",files[0])

Train_test split;
train = files[:6400] vali = files[6400:6400+800] test = files[-800:]
Reading audio files and labels:
The files are first read as binary files that we later have to convert to tensors. The WAV file contains time series data at a number of samples per second. Each sample represents the amplitude of the audio signal at a specific point in time. tf.audio.decode_wav is used, which returns WAV encoded audio as a tensor.
## binary file will be converted into numerical tensors def audio_decode(audio_binary): Â Â audios,_ = tf.audio.decode_wav(audio_binary) Â Â return tf.squeeze(audios,axis=-1) ## labels for each wave file def get_label_(file_path): Â Â part = tf.strings.split(file_path, os.path.sep) Â Â return part[-2] ## create supevised training method which takes audio file along with label def waveform_and_label(file_path): Â Â labels = get_label_(file_path) Â Â audio_binary = tf.io.read_file(file_path) Â Â waveforms = audio_decode(audio_binary) Â Â return waveforms,labels # apply the Process_path to build training set to extract audio-label pairs # and check the result AUTOTUNE = tf.data.AUTOTUNE files_data = tf.data.Dataset.from_tensor_slices(train) waveform_data = files_ds.map(waveform_and_label, num_parallel_calls = AUTOTUNE)
Visualize the waveform with its labels:
row = 3 col = 3 n = row*col fig, axes = plt.subplots(row,col, figsize=(10,12)) for i,(audios,labels) in enumerate(waveform_data.take(n)):   r1 = i// col   c1 = i % col   axs = axes[r1][c1]   axs.plot(audios.numpy())   axs.set_yticks(np.arange(-1.2,1.2,0.2))   labels = labels.numpy().decode('utf-8')   axs.set_title(labels) plt.show()

Create a function that returns a spectrogram:
We convert waveforms to spectrograms using short-term Fourier transform (STFT) to convert audio to the time-frequency domain. The STFT divides the signal into a time window using tf.signal.stft and performs a Fourier transform in each window that returns a 2D tensor to apply the convolutional layers. STFT creates an array that represents phase and amplitude information, but we will only be using amplitude information for modeling so tf.abs will derive it.
Choose frame_lenght and frame_step so that the output image is almost square. We’ll also be using null paddings so that all files are the same length.
def get_spectogram(waveforms):   ## padding the files with less than 1600 samples   padding = tf.zeros([16000] - tf.shape(waveforms),dtype=tf.float32)   ## concate audio with padding for equal lenght   waveforms = tf.cast(waveforms, tf.float32)   equal_lenght = tf.concat([waveforms,padding], 0)   spectogram = tf.signal.stft(equal_lenght,frame_length=255,frame_step=128)   spectogram = tf.abs(spectogram)   return spectogram
Compare the waveform, spectrogram and audio file of a sample;
for waveforms,labels in waveform_data.take(2): Â Â labels = labels.numpy().decode('utf-8') Â Â spectogram = get_spectogram(waveforms) print('label:',labels) print('waveform shape:',waveforms.shape) print('Spectogram shape:',spectogram.shape) print('Audio playback') display.display(display.Audio(waveforms, rate=16000))

Audio file:
Plot the spectrogram of a sample;
def plot_spectogram(spectogram, axs):   # convert frequencies into log scale so that time represented on   # x-axis   log_scale = np.log(spectogram.T)   height = log_scale.shape[0]   width = log_scale.shape[1]   x = np.linspace(0, np.size(spectogram),num=width, dtype=int)   y = range(height)   axs.pcolormesh(x,y, log_scale) fig,axes = plt.subplots(2,figsize=(12,8)) time_scale = np.arange(waveforms.shape[0]) axes[0].plot(time_scale, waveforms.numpy()) axes[0].set_title('Wavefoem') axes[0].set_xlim([0,16000]) plot_spectogram(spectogram.numpy(),axes[1]) axes[1].set_title('Sectogram') plt.show()

Transform the waveform dataset into a spectrogram dataset with appropriate labels and visualize the spectrograms;
def spectogram_and_label(audios,label): Â Â spectogram = get_spectogram(audios) Â Â spectogram = tf.expand_dims(spectogram,-1) Â Â labels_id = tf.argmax(label == lables) Â Â return spectogram, labels_id
spectogram_data = waveform_data.map(spectogram_and_label,num_parallel_calls=AUTOTUNE)
row = 3 col = 3 n = row*col fig, axes = plt.subplots(row,col, figsize=(10,12)) for i,(spectogram,label_id) in enumerate(spectogram_data.take(n)):   r2 = i// col   c2 = i % col   axs = axes[r2][c2]   plot_spectogram(np.squeeze(spectogram.numpy()),axs)   axs.set_title(commands[label_id.numpy()])   axs.axis('off') plt.show()

Perform the preprocessing step for the test and validation set:
def create_dataset(files): Â Â files_data = tf.data.Dataset.from_tensor_slices(files) Â Â output_data = files_data.map(waveform_and_label,num_parallel_calls=AUTOTUNE) Â Â output_data = output_ds.map(spectogram_and_label,num_parallel_calls = AUTOTUNE) Â Â return output_data train_data = spectogram_ds vali_data = create_dataset(vali) test_data = create_dataset(test)
Build the model:
Stack the dataset and add cache () and prefetch () operations to reduce latency.
batch_size = 64 train_data = train_data.batch(batch_size) vali_data = vali_data.batch(batch_size) train_data = train_data.cache().prefetch(AUTOTUNE) vali_data = vali_data.cache().prefetch(AUTOTUNE)
In addition to CNN layers, the model also has preprocessing layers such as resizing and normalization;
for spectogram, _ in spectogram_data.take(1): Â Â input_shape1 = spectogram.shape print("input shape:",input_shape1) num_labels = len(labels) norma_layer = preprocessing.Normalization() norma_layer.adapt(spectogram_data.map(lambda x,_: x)) model = models.Sequential([ Â Â Â Â Â Â Â Â layers.Input(shape=input_shape1), Â Â Â Â Â Â Â Â preprocessing.Resizing(32,32), Â Â Â Â Â Â Â Â norma_layer, Â Â Â Â Â Â Â Â layers.Conv2D(64,3, activation='relu'), Â Â Â Â Â Â Â Â layers.Conv2D(80,3,activation='relu'), Â Â Â Â Â Â Â Â layers.MaxPooling2D(), Â Â Â Â Â Â Â Â layers.Dropout(0.25), Â Â Â Â Â Â Â Â layers.Flatten(), Â Â Â Â Â Â Â Â layers.Dense(128,activation='relu'), Â Â Â Â Â Â Â Â layers.Dropout(0.5), Â Â Â Â Â Â Â Â layers.Dense(num_labels)Â Â ])
model.summary()

model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),optimizer="adam",metrics=['accuracy']) history = model.fit(train_ds,validation_data=vali_ds,epochs=10, Â Â Â Â Â Â Â callbacks = tf.keras.callbacks.EarlyStopping(verbose=1,patience=2))

Rate the model:
test_audios = [] test_labels = [] for audios,labels in test_ds: Â Â test_audios.append(audios.numpy()) Â Â test_labels.append(labels.numpy()) test_audios = np.array(test_audios) test_labels = np.array(test_labels)
y_predi = np.argmax(model.predict(test_audios),axis=1) y_true = test_labels test_accuracy = sum(y_predi == y_true) / len(y_true) print('Test accuracy:',test_accuracy)
Test accuracy is around 83%
Present confusion matrix and classification report;
print(classification_report(y_true,y_pred,)) confusion_mat = tf.math.confusion_matrix(y_true,y_predi) plt.figure(figsize=(12,10)) sns.heatmap(confusion_mat,xticklabels=commands,yticklabels=commands,annot=True,fmt="g") plt.xlabel('Prediction') plt.ylabel('Actual') plt.show()


Derive the model from the audio file;
sample_file="/content/data/mini_speech_commands/down/004ae714_nohash_0.wav" sample_data = create_dataset([str(sample_file)]) for spectogram, label in sample_ds.batch(1): Â Â prediction = model(spectogram) Â Â plt.bar(labels, tf.nn.softmax(prediction[0])) Â Â plt.title(f'Prediction for "{labels[label[0]]}"') Â Â plt.show()

Conclusion
This is all about keyword detection using simple convolutional neural networks where we used 1 second audio files with eight different words. This was the basic idea of ââhow keyword recognition works when the actual system is a bit complex. In terms of the model’s performance for a given audio file, the model predicts the file to be perfect. The accuracy of the words ânoâ and âgoâ is poor. This could be due to an imbalance as we did not sample the data uniformly for all classes. For the rest of the classes, parameters are acceptable.
References
Join our telegram group. Become part of a dedicated online community. Join here.
Subscribe to our newsletter
Get the latest updates and relevant offers by sharing your email.