Acoustic scene classification based on three-dimensional multi-channel, feature-correlated deep learning networks


In this section, we first introduce a signal preprocessing approach based on the DWT technique and proceed to the extraction of signal features from the frequency domain representations of signals, e.g. B. STFT, Mel spectrogram and chromatogram. Extensive data enhancements are then applied to the spectrograms to improve the robustness and performance of the proposed model.


Considering that acoustic signals are usually recorded in dynamic environments, we apply the DWT technique to mitigate the effects of ambient noise and eliminate artifacts introduced into spectrograms. For the DWT, the set of wavelet functions is derived from an initial wavelet h


In particular, the Haar wavelet is considered to be an effective initial wavelet function based on extensive numerical experiments, its basis functionHk( e.g) is described as34,

$$ h_{k} (z) = h_{{pq}} (z) = frac{1}{{sqrt N }}left{ {begin{array}{*{20}l} { 2^{{{p mathord{left/ {vphantom {p 2}} right. kern-nulldelimiterspace} 2}}} } hfill & {{{(q – 1)} mathord{left/ {vphantom {{(q – 1)} {2^{p} le z le {{(q – 0.5)} mathord{left/ {vphantom {{(q – 0.5)} {2^{p} }}} right. kern-nulldelimiterspace} {2^{p} }}}}} right. kern-nulldelimiterspace} {2^{p} le z le {{(q – 0.5)} mathord{left/ {vphantom {{(q – 0.5)} {2^{ p} }} } Right. kern-nulldelimiterspace} {2^{p} }}}}} hfill {2^{{{p mathord{left/ {vphantom {p 2}} right. kern-nulldelimiterspace} 2}}} } hfill & {{{(q – 0.5)} mathord{left/ {vphantom {{(q – 0.5)} {2^{p} le z le {q mathord{left/ {vphantom {q {2^{p} }}} right. kern-nulldelimiterspace} {2^{p} }}}}} right. kern-nulldelimiterspace} {2^{p} le z le {q mathord{left/ {vphantom {q {2^{p} }}} right. kern-nulldelimiterspace} {2^{p} }}}}} hfill 0 hfill & {otherwise z in [0,1]} hfill end{array} } right. $$


Where kare clearly decomposablek = 2 p+q-1 and H0( e.g) is defined by (h_{0} (z) = h_{0,0} (z) = frac{1}{sqrt N },z in [0,1]). The wavelet decomposition coefficients obtained through mathematical derivations are further compared to a threshold and weighted to form a noise-free reconstruction of the signal. The wavelet decomposition follows the hierarchical rule in that a signal based on high-level decompositions can be factored into lower-level approximations. For purposes of illustration, Figure 1 shows the hierarchy diagram of a multi-level wavelet decomposition, where A and D denote the approximate and detailed wavelet components, respectively. A comprehensive presentation of the DWT can be referred to34.

illustration 1

Three level hierarchy diagram of the DWT.

In Fig. 2 we show a raw waveform and its denoised versions reconstructed by a three-level Haar-based DWT. It is observed that the DWT is effective in reducing high frequency noise while preserving the trend of the signal very well.

figure 2
figure 2

Signal denoising by a three-stage Haar-based DWT.


The STFT is a process of computing the Discrete Fourier Transform (DFT) over short overlapping windows and provides analytical insight into correlations between the time-domain and frequency-domain information of an acoustic signal. We perform the framing operation to split the signal into a number of fixed-length clips and perform the DFT at each clip interval. Small window lengths improve the STFT’s temporal resolution and improve the ability to distinguish closely time-spaced pulses, but at the expense of lower frequency resolutions.

In this document we set the length of the windowed signal to 1024 and the number of samples between adjacent STFT columns (i.e. hop length) to 512 to get an efficient compromise between time and frequency resolution. To optimize the effects of analyzing ambient noise, it is preferable to adopt the frequency domain representations in terms of mel-scaled frequencies, which are non-linearly related to physical frequencies and intuitively characterize the human auditory mechanism. Therefore, we transform the frequency components obtained by the fast Fourier transform (FFT) into a bank of mel-scaled bandpass filter banks over short time windows. Suppose the filter bank hasMTriangular filters each with a center frequency f(m ), 0 ≤ m Mand these filters are of equal bandwidth in the mel frequency domain, the mel frequency response of each filter is defined as35

$$ H_{m} (k) = left{ {begin{array}{*{20}l} {0,} hfill & {k f(m + 1)} hfill end{array} } right. $$


Where f( m) is expressed as

$$ f(m) = left( {frac{N}{{f_{s} }}} right)F_{mel}^{ – 1} left( {F_{mel} (f_{l} ) + mfrac{{F_{mel} (f_{h} ) – F_{mel} (f_{l} )}}{M + 1}} right) $$


Where fl and fH represent the lowest frequency or the highest frequency in the filter range, Nis the length of the FFT,fs denotes the sampling frequency and fme is given by

$$ F_{{{text{mel}}}} = 1125ln left( {1 + frac{f}{700}} right) $$


In this document, frequency domain analysis is performed using a 20 ms Hamming window with 50% overlap and 1024 FFT bins. Triangular filters linearly spaced in the mel scale are used to convert the STFT spectrogram to the mel spectrogram. In Fig. 3 we present the STFT and the MEL spectrograms of an ambient sound signal over linear and logarithmic (log) frequency scales, respectively. The spectrograms, which manifest as 2D heatmaps, clearly show correlations that exist between time and frequency domains. Note that high-frequency components are effectively suppressed in the logarithmic mel spectrogram. Converting signal waveforms into heatmaps enables us to exploit the enormous amount of potential manifested in CNN models originally developed for image classification tasks.

figure 3
figure 3

STFT spectrogram and Mel spectrogram.

In addition to the STFT spectrogram, we derive other features in the frequency domain based on expert knowledge of the raw signal. By exploiting the representational differences, we are able to provide supplemental auditory information to help distinguish between complex acoustic scenes, e.g. tempogram based on onset strength envelope local autocorrelation. In particular, the chromatogram is an efficient representation of the signal, often used in music genre analysis, which projects the entire spectrogram onto 12 bins, each representing a specific semitone of the octave. Based on the chromatogram, we further extract uniform local binary pattern (LBP) text descriptors, perform normalization of each vector, quantize amplitudes based on predefined thresholds, and smooth the result with a sliding window. Extensive numerical experiments have shown that the chromatogram functions are robust to acoustic dynamics and provide an intuitive approach to performing gain control for signals spanning multiple frequency bins. Therefore, it is a desirable feature for audio recovery applications, and particularly for the task of recognizing slightly different signals. In Fig. 4 we show the chromatogram, chroma CQT and tempogram, respectively, of an ambient noise signal. Compared to the mel spectrogram, the chromatogram allows us to perceptually evaluate a sound signal without resorting to physical frequencies by regarding musical notes that are an octave apart as similar. These properties allow the model to effectively simulate auditory performance by focusing on the most noticeable parts of the signal.

figure 4
figure 4

Chroma CQT, chromatogram, or tempogram of an ambient noise signal.

data extension

It is known that the performance of deep learning models is severely limited by the size of the training data sets. To improve the generalization ability of the model, we perform extensive extensions on training signals before generating frequency domain spectrograms. Conventional time domain augmentation approaches include random time shifting, Gaussian or colored noise injection, amplitude matching, and velocity/pitch moderation.

A unique characteristic of ambient noise is that signals often overlap. For example, a sound signal recorded in a supermarket consists of several acoustic components and ambient noise. In this article we propose to use a frequency domain augmentation method known as mix-up36, 48 Manipulating 2D spectrograms to simulate such overlapping effects. The mix-up approach uses a randomly generated beta distributed parameter λCombine two samples in the training data to produce a sample that was not previously present in the original dataset. The technique accounts for linear expressions between training patterns and significantly improves the model’s representational capacity and robustness. The method of generating virtual patterns is formulated by:

$$ mu (tilde{x},tilde{y}|xi,yi) = frac{1}{n}sumnolimits_{j}^{n} {mathop Elimits_{lambda } Left[ {delta (tilde{x} = lambda x_{i} + (1 – lambda )x_{j} ,tilde{y} = lambda y_{i} + (1 – lambda )y_{j} )} right]} $$


Where λ~ beta ( a, a), a (in )(0, ∞) and (widetilde{x}), (widetilde{y}) are formulated as:

$$ left{ begin{gathered} tilde{x} = lambda x_{i} + (1 – lambda )x_{j} hfill tilde{y} = lambda y_{i} + (1 – lambda )y_{i} hfill end{collected} right. $$


WherexI and xj denote raw input vectors,jI and jj are one-hot-label encodings, and (xI,jI) and (xj,jj) represent two random samples taken from the training data. After scaling the pixel values ​​to the range 0 to 255, spectrograms can be viewed equivalently as a three-channel red-green-blue (RGB) image. In addition, we apply a series of pixel-level image transformations to enhance the richness of the training data. The drop-out operation, as shown in Figure 5, can be used to simulate the scenario where the signal has no distinct frequency components by randomly masking a small percentage of pixels, ie setting their values ​​to zero.

Figure 5
Figure 5

Image-level coarse dropout augmentation applied to Chroma CQT.


Comments are closed.