Studying the Effect of Audio Filters in Pre-Trained Models for Environmental Sound Classification

Aditya Dawn
University of Kalyani
Kalyani, India
adityadawn98@gmail.com
&Wazib Ansar
A.K.Choudhury School of IT
University of Calcutta
Kolkata, India
waakcs_rs@caluniv.ac.in

Abstract

Environmental Sound Classification is an important problem of sound recognition and is more complicated than speech recognition problems as environmental sounds are not well structured with respect to time and frequency. Researchers have used various CNN models to learn audio features from different audio features like log mel spectrograms, gammatone spectral coefficients, mel-frequency spectral coefficients, generated from the audio files, over the past years. In this paper, we propose a new methodology : Two-Level Classification; the Level 1 Classifier will be responsible to classify the audio signal into a broader class and the Level 2 Classifiers will be responsible to find the actual class to which the audio belongs, based on the output of the Level 1 Classifier. We have also shown the effects of different audio filters, among which a new method of Audio Crop is introduced in this paper, which gave the highest accuracies in most of the cases. We have used the ESC-50 dataset for our experiment and obtained a maximum accuracy of 78.75% in case of Level 1 Classification and 98.04% in case of Level 2 Classifications.

Keywords Environmental Sound Classification (ESC) $\cdot$ Audio Crop $\cdot$ Per Channel Energy Normalization (PCEN) $\cdot$ Spectral Gating $\cdot$ Audio Filters $\cdot$ Convolutional Neural Network (CNN)

1 Introduction

Environmental Sound Classification (ESC) has become a challenging job in recent times. The classification and identification of environmental sounds, which include dog barking, birds chirping, knocking on the door, the sound of a vacuum cleaner, car horn, water drops and many other similar sounds, are necessary for developing smart-home appliances [1], security systems [2, 3], etc. to make human life more secure. For example, if mobile devices are able to recognize the sound of car honking then road accidents and pedestrian accidents will decrease by a significant rate. Or, if the auto-pet care systems can identify the sounds of dogs or birds, they can provide the animals with food and water. Or if a smart security system can recognize door knocking sounds it will be able to notify the owner of the place about the presence of a person behind the door. In other words, a well-featured and trained sound classifier will be able to recognize the sound from human surroundings or environment and will be able to make human life safe and sound.

Traditional Sound Processing is based on the sound features like log mel spectrograms with delta informations [4], Gammatones [5] and Mel Frequency Cepstral Coefficients (MFCC) [6]. Various Machine Learning and Deep Learning techniques are applied with these features to obtain high scoring results. Though the Machine Learning algorithms were able to obtain good results, in recent years the breakthrough of Deep Neural Networks, mainly Convolutional Neural Networks has been significant. CNN’s were used with the sound features to obtain accuracies like 83.9% [7], 86.95% [8] to high as 97.57% [9].

In this paper, we propose a two-level classification method using CNN models of the classes VGG, ResNet and EfficientNet with audio modifiers like PCEN, Spectral Gating(Noise Removal), Audio Crop and Audio Filters like Low Pass Filter, High Pass Filter, Band Pass Filter and Band Stop Filter. The Level 1 classifier will classify the sounds into broader groups of Animals, Birds, Nature, etc., while the Level 2 classification will pull the audio signal to the sub-class. For example, if an audio signal is classified to the Animal class by the Level 1 classifier, then the Level 2 classifier will be responsible to detect the actual animal, which might be a dog or a cow or a sheep. After applying the CNN models and the audio modifiers, our method obtained a score of 78.75% in the case of the Level 1 Classification and the maximum score obtained in the case of the Level 2 Classifiers is 98.04%.

The remaining portion of this article is arranged descriptively and divided into 7 sections. Sound Classification processes used in previous works for the similar problem are explained in Section 2. In Section 3, we have discussed CNN and Audio Modifiers which were used for our work. In Section 4, we have shown the process of our method with a flowchart, with a suitable description. The tools and libraries are discussed in Section 5. Then, the results of our works are shown in Section 6 and discussed in Section 7. Finally, we conclude our paper in Section 8.

2 Related Works

In this section, we discuss the previous works by researchers on the ESC problem.

Piczak extracted the log mel spectrograms for each frame of the audio files. Piczak used these log mel spectrograms and their delta informations in the CNN model proposed by him in [4]. He got an accuracy of 64.5% with this approach. The goal of Piczak’s paper was to evaluate the success of CNNs, when applied to ESC tasks.

Agrawal et al. [5], used a TEO-based gammatone feature set for the problem. Firstly, they extracted the gammatone filterbanks from the raw audio files and applied a bandpass filter. Then, they applied a half-wave rectifier on each of the sub-bands and then Teager Energy Operator was applied again on each of the sub-bands. Finally, they applied short-term averaging and short-term spectral features were obtained. The obtained spectrograms were given as input to a CNN architecture similar to that used by Piczak [4] and obtained a score of 81.95%. However, they showed that the TEO-based Gammatone Spectral Coefficients failed to give better results with CNN.

Zhang et al. [10] showed the performance of Dilated Convolution Network with seven layers and two input channels. They used log mel spectrograms and delta feature spectrograms in the proposed Dilated Convolution Network. They also studied the classification accuracy obtained for ReLU-type activation functions. However, they got a classification accuracy of 68.1% on the ESC-50 dataset and it was also noted by them that the improvement of classification accuracy cost more of higher computational complexity and bigger storage.

Zhichao et al. [7] proposed a new CNN architecture inspired by VGG Net by using 1-D convolution filters in place of $3\times 3$ convolution filters, to learn local patterns across frequency and time. They extracted log mel spectrograms and gammatone spectrograms, which were used in the proposed CNN architecture, along with their delta information and achieved a classification accuracy of 83.9%.

In [8], Zhichao et al. adopted a convolutional RNN architecture for the problem.At first, they used CNN with channel temporal attention mechanism in convolution layers, with log-gammatone spectrograms to extract high-level features from the spectrograms, which were further used in the bidirectional gated recurrent unit to analyse temporal correlations. They showed a classification accuracy of 86.5% on the ESC-50 dataset.

In [11], Ullo et al. proposed a method that uses a hybrid structure made of OAS, STFT, CNN and different classification techniques for the classification of the classes in the ESC-10 dataset and they achieved a classification accuracy of 95.8%.

Mushtaq et al. [9] showed the performances of different data augmentations on the audio files and log mel spectrograms of the original audio files and augmented audio files with transfer learning and obtained an accuracy of 97.57% for ESC-50 dataset. They also showed the performances of distinct pre-trained models, which included ResNet, DenseNet, AlexNet, SqueezeNet and VGG.

Ansar et al. [12] proposed an EfficientNet ensemble with triple-layered approach to eliminate noise for classification of audio signals. They further validated a trade-off between model depth and number of parameters to obtain optimal accuracy through extensive evaluation on a bouquet of models.

It is clear from the previous works that, all the new methodologies and new models were built to classify the audio files to 50 classes directly. So in this study, a new approach is proposed, a two-level classification, with the spectrograms obtained after the application of audio modifiers and the pre-trained CNN models, which obtained a state of art accuracy score compared to the works discussed above.

3 Methods

In this section, we discuss about the types of CNNs and audio modifiers used for our project.

3.1 Convolutional Neural Network (CNN)

CNNs are used in the application of image processing. CNNs can be trained well to understand the hidden features of the images. This is because CNN applies different relevant filters and the architecture reduces the number of parameters involved and increases the re-usability of the weights. For this reason, the spatial and temporal dependencies of an image are successfully captured by CNN. CNNs are mostly used for classification and computer vision tasks. Fig 1¹¹1towardsdatascience shows a general architecture of a CNN Model.

Refer to caption — Figure 1: CNN architecture

CNN models consists mainly of three layers :

1.

Convolution Layer
2.

Pooling Layer
3.

Fully-connected Layer

LeCun et al. [13] introduced the first CNN architecture, LeNet-5 for the recognition of handwritten digits (input images were of dimension $32\times 32\times 1$ ), using the MNIST dataset ²²2MNIST database. LeNet-5 was a vary shallow CNN with alternating convolution layers and pooling layers and had only about 60,000 parameters.

AlexNet was then introduced by Krizhevsky et al. [14]. The network was similar to LeNet but instead of alternating convolution layers and pooling layers, AlexNet had all the convolution layers stacked together. Also compared to LeNet-5, this network is much bigger and deeper.

Later, VGGNet was introduced by Simonyan & Zisserman [15]. Earlier, models like AlexNet used high dimensional filter in the initial layers, but VGG changed this by using $3\times 3$ filters.

He et al. [16] have presented a residual learning framework where the layers learn residual functions with respect to the inputs received instead of learning un-referenced functions. They were able to prove that this work is particularly useful for training deeper networks since residual networks are easier to optimize and gain much accuracy. The main drawback of this network is that it is much expensive to evaluate due to the huge number of parameters.

Various models were thus developed focusing either on performance or computational efficiency. Tan & V.Le introduced EfficientNet [17] model, which was able to solve both the problems. They proposed a common CNN architecture, which worked with three parameters width, depth and resolution. Width refers to the number of channels present in various layers, depth refers to the number of layers in the model and resolution refers to the input image size for the model. EfficientNet mainly helps in performing compound scaling with depth, width and resolution of the image. The compound scaling method only enhances the predictive capacity of the networks by replicating base network’s underlying convolutional operations and network structure.

In this paper we have also shown the efficiency of 3 classes of pre-trained models, which are as follows :

•
VGG [15]
- –
  
  VGG16
- –
  
  VGG19
•
ResNet [16]
- –
  
  ResNet50
- –
  
  ResNet101
- –
  
  ResNet152
•
EfficientNet [17]
- –
  
  EfficientNetB0
- –
  
  EfficientNetB1
- –
  
  EfficientNetB2
- –
  
  EfficientNetB3
- –
  
  EfficientNetB4

3.2 Spectrogram

A Spectrogram, usually depicted as a heatmap, is a visual representation, of a spectrum of frequencies of a signal as it varies with time. Spectrograms of some audio files of each category of class based on Table 1 are shown in Fig 4. The spectrograms are passed into CNN Models to learn the audio features.

3.3 Understanding Audio Modifiers Used

3.3.1 Spectral Gating

Spectral Gating as explained by [18] is a technique, which is comprised of several steps. A Fourier transformation is applied on the noise-only portion of the audio signal to create a spectral "fingerprint", which is further used as a "gate" to filter the audio signal. The frequencies in the audio signals, which are above the gated value are passed, while those below the value are removed. Here, we have removed noise based on the principle of Spectral Gating. The result of Spectral Gating performed on an audio file is shown in Fig 3 with the audio form and spectrogram.

3.3.2 Per Channel Energy Normalization (PCEN)

[19] introduced PCEN as an alternative to the log-mel frontend. [20] discussed the working principle of PCEN. PCEN is the result of three-component operation :

1.

Temporal integration
2.

Adaptive Gain Control
3.

Dynamic Range Compression

The result of PCEN on two audio files is shown in Fig 5.

3.3.3 Audio Crop

The main idea behind introducing this feature in our work, is to repeat the non-zero portions of an audio sample over the maximum time length of the audio files provided with the data. This method can be explained better with algorithms 1 and 2. Algorithm 1 finds the maximum time length present among the audio files and algorithm 2 removes the silent portions of the audio files. The process of Audio Cropping is shown in Fig 6.

3.3.4 Audio Filters

Research and developments on Audio Filters have been done for audio modifications as mentioned in [21]. The filters that we have used in our work modifies the audio signals based on frequencies.

1.

Low Pass Filter : Low Pass Filter allows the frequencies lower than a cut-off frequency to pass and attenuates the frequencies higher than the cut-off frequency.
2.

High Pass Filter : High Pass Filter allows the frequencies higher than a cut-off frequency to pass and attenuates the frequencies lower than the cut-off frequency.
3.

Band Pass Filter : Band Pass Filter accepts two cut-off frequencies, low-cut frequency and high-cut frequency. This Filter allows the band of frequencies within the low-cut and high-cut frequencies to pass and attenuates the frequencies lower than the low-cut frequency and higher than the high-cut frequency.
4.

Band Stop Filter : Band Stop Filter also accepts two cut-off frequencies, low-cut frequency and high-cut frequency, like Band Pass Filter. But like Band Pass Filter, this Filter attenuates the band of frequencies within the low-cut and high-cut frequencies and allows the frequencies lower than the low-cut frequency and higher than the high-cut frequency to pass.

4 Proposed Method

In this section, we discuss the design of the Two-Level Classification method, which we are proposing in this paper, as shown in Figure 2.

Table 1: New Classification

Animal

Birds

Natural

Soundscapes

Human

Machine

Sounds

Domestic

sounds

Outdoor

noises

Dog

Chirping Birds

Rain

Crying Baby

Mouse Click

Door knock

Helicopter

Sheep

Rooster

Sea Waves

Sneezing

Keyboard Typing

Toilet flush

Chainsaw

Pig

Crow

Crackling Fire

Clapping

Washing Machine

Clock alarm

Siren

Cow

Hen

Wind

Breathing

Vacuum cleaner

Door, wood creaks

Car Horn

Frog

Pouring water

Coughing

Can opening

Engine

Cat

Water drops

Footsteps

Clock tick

Train

Insects(flying)

Thunderstorm

Laughing

Glass breaking

Church bells

Crickets

Brushing teeth

Airplane

Snoring

Fireworks

Drinking, sipping

Hand saw

Step 1

The audio files of ESC-50 dataset is taken as input.
Step 2

In the pre-processing part, for our work, we have divided the dataset into 7 groups as shown in table 1. This new divisions were made based on the origin or source of the sounds. For example, Dog, Sheep, Pig, etc are evidently animals. So, they are grouped in the "Animal" class. Rooster, Crow, Hen and Chirping Birds are placed in the "Bird" class. Rain, Sea Waves, Wind, Pouring Water, Thunderstorms are observed in nature and so they are grouped in the "Natural Soundscapes" class. Sneezing, Clapping, Breathing, Drinking, etc are parts of human behavior. So, they are placed in the "Human" group. Mouse Click, Keyboard Typing, Washing Machine and Vacuum Cleaner are sound originating from some particular machines. So, they are grouped in the "Machine Sounds" class. Door knock, Toilet flush, Clock alarm, Can opening, etc are found in domestic environment. So, they are grouped in the "Domestic sounds" class. Finally, as Helicopter, Chainsaw, Siren, etc are found in outdoor spaces and so they have been grouped in the "Outdoor noises" class.
Step 3
Audio modifiers were then applied to the audio files. The audio modifiers include :
1. (a)
  
  Spectral Gating
2. (b)
  
  PCEN
3. (c)
  
  Audio Crop
4. (d)
  
  Low Pass Filter
5. (e)
  
  High Pass Filter
6. (f)
  
  Band Pass Filter
7. (g)
  
  Band Stop Filter
Step 4

The Spectrograms of the modified audio files are extracted.
Step 5

Extracted Spectrograms are then passed to the CNN models mentioned in sub-section 3.1 for the purpose of Level 1 Classification.
Step 6

The output from the Level 1 Classification is noted and used in algorithm 3.
Step 7

Algorithm 3 will then identify the classifier in the Level 2 Classification stage to identify the actual class to which the audio file belongs.

Note that, we are using 10 CNN models as discussed in Section 3, where we are passing eight different types of spectrogram images - one without filtration and seven with filtrations of Spectral Gating, PCEN, Audio Crop, Low Pass Filter, High Pass Filter, Band Pass Filter and Band Stop Filter. So for each classification, we are comparing the results of 80 different models.

Algorithm 1 Max Time Len

0: Path of the audio files in the data provided

0: Time length of the longest audio file

1: Time_Lengths

\leftarrow

An empty list to store the lengths of the audio files

2: for all Audio files do

3: time_length

\leftarrow

Length of the audio file

4: Store time_length in Time_Lengths

5: end for

6: Max_time_len

\leftarrow

Maximum value in Time_Lengths

7: return Max_time_len

5 Implementation of the Method

5.1 Dataset : ESC-50

We have used the ESC-50 [22] dataset for our work, which is a collection of 2000 recordings with an average duration of 5 seconds and a sampling frequency rate of 44100 Hz. These recordings have been collected from Freesound.org ³³3https://freesound.org/. The dataset consists of recordings of 40 audio files for each of the 50 categories.

Algorithm 2 Crop Audios according to the output of algorithm 1

1.

Path of the Audio files in the data provided
2.

Max_time $\leftarrow$ Output of Algorithm 1

0: Audio files without silent portions

for all Audio Files do

time_len

\leftarrow

time length of the audio file

audio_

\leftarrow

audio[audio

\neq

audio_add

\leftarrow

audio[audio

\neq

\leftarrow

quotient(Max_time

\div

time_len)

if q

>

1 then

times

\leftarrow

q-1

for time

\in

times do

audio_ = concatenate(audio_, audio_add)

end for

else

pass

end if

\leftarrow

remainder(Max_time

divide

time_len)

if r == 0 then

Pass

else

\leftarrow

while j

\neq

r do

audio_ = append(audio_, audio_[j])

end while

end if

end for

return Cropped_Audio

\leftarrow

audio_

These 50 categories are generally grouped into 5 groups as done in the previous works mentioned in the Section 2, which included the ESC-50 dataset. The groups are :

•

Animals
•

Natural soundscapes & water sounds
•

Human, non-speech sounds
•

Interior/domestic sounds
•

Exterior/urban noises

5.2 Libraries

The work has been done using the Python Programming Language. The Python libraries used for this project are discussed below.

5.2.1 NumPy

The NumPy library ⁴⁴4https://numpy.org/ helps in performing complex mathematical operations with arrays and random number generations. In this project we have used NumPy for random split of the dataset to make Training and Testing Sets to train and test on the CNN Models, respectively.

5.2.2 Pandas

The Pandas library ⁵⁵5https://pandas.pydata.org/ helped in manipulating the dataframe and also performing some basic operations on the dataframe.

Algorithm 3 Classifying Algorithm

0: Input

\leftarrow

Level 1 Classifier Output

0: Output

\leftarrow

Actual_Class

1: if Input == "Animal" then

2: Actual_Class = animal_model.predict()

3: else if Input == "Bird" then

4: Actual_Class = bird_model.predict()

5: else if Input == "Nature" then

6: Actual_Class = nature_model.predict()

7: else if Input == "Human" then

8: Actual_Class = human_model.predict()

9: else if Input == "Machine Sounds" then

10: Actual_Class = machine_sounds_model.predict()

11: else if Input == "Domestic" then

12: Actual_Class = domestic_model.predict()

13: else

14: Actual_Class = outdoor_model.predict()

15: end if

16: return Actual_Class

5.2.3 Matplotlib and Seaborn

Matplotlib ⁶⁶6https://matplotlib.org/ and Seaborn ⁷⁷7https://seaborn.pydata.org/ were used for plottings.

5.2.4 Librosa

Librosa ⁸⁸8https://librosa.org was used to work with the audio files. Librosa helped in extracting the audio files from their respective locations. Librosa also has functions to extract mel spectrograms and functions for audio modifications which were used on the audio samples.

5.2.5 SciPy

SciPy ⁹⁹9https://scipy.org/ is a library which has the functions to implement the audio filters on the audio samples provided in the data.

5.2.6 Noise Reduce

The noisereduce¹⁰¹⁰10https://pypi.org/project/noisereduce/ is used for Noise Removal from the audio files provided with the dataset.

5.2.7 Tensorflow and Keras

Tensorflow¹¹¹¹11https://www.tensorflow.org/ and Keras¹²¹²12https://keras.io/ were used to implement the CNN models. The pre-trained models from Keras were used in this work.

5.2.8 Model Hyperparameters

After the pre-trained layers of the pre-trained models, we added a layer with global average pooling method. After the global average layer, two dense layers were added each with 512 filters and activation function ReLU. The kernel initializer of the first dense layer was set to glorot uniform. Stochastic gradient descent as the optimizer function during compilation of the models.

Fig 4 shows that the intensity of sound is maximum within the range of 0 – 512 Hz, while it decreases slightly to 2048 Hz and starts to fade after that in the cases of thunderstorm, vacuum cleaner, glass breaking, train sounds. Spectrograms of chirping birds, clapping show that the intensity is faded till 128 Hz. Again, in the cases of clapping and dog maximum intensity is visible mostly within 512 – 4096 Hz. Besides this, the black portion indicates silence in the audio files of dog and glass breaking. Based o these observations, in case of the Audio Filters, we have used 512 Hz as the lower threshold and 2048 Hz as the higher threshold frequencies.

Table 2: Cut-off Frequencies for Audio Filters

Audio Filter	Cut-off frequency
Low Pass Filter	512 Hz
High Pass Filter	2048 Hz
Band Pass Filter	lower cut-off	512 Hz
Band Pass Filter	higher cut-off	2048 Hz
Band Stop Filter	lower cut-off	512 Hz
Band Stop Filter	higher cut-off	2048 Hz

5.3 Implementation Details

For our convenience, for each CNN Model, we first divided the data for the respective model in the ratio of 8:2 to create Training Set and the Testing Set. Then we divide the Training set again in the ratio of 8:2, as shown in table 3. Also, we have used a sampling rate of 44.1 KHz for the audio files.

6 Results

From table 3, it is clear that the number of testing samples is less except the case of Level 1 Classification. We are going to compare the performances of the classifiers and the models based on Classification Accuracy for Level 1 classifiers (table 4) only, but in the other cases we are going to judge based on Highest Validation Accuracy obtained as a single miss-classification by the model will decrease the Classification Accuracy significantly, specifically in the cases of Birds and Machine Sounds classification, though we have provided both the Classification Accuracy and Validation Accuracy in the result tables of the classification models for Animals (table 5), Birds (table 6), Natural Soundscapes (table 7), Human (table 8), Machine Sounds (table 9), Domestic (table 10) and Outdoor noises (table 11).

Table 3: Distribution of Samples for Training, Validation and Testing

Mode of Classification

Number of

classes

Total Number

of Samples

Size of

Training Set

Size of

Validation Set

Size of

Testing Set

Level 1 Classification

2000

1280

320

400

Animals

320

205

Birds

160

103

Nature

280

180

Human

400

256

Machine Sounds

160

103

Domestic

320

205

Outdoor

400

256

These raw spectrograms shown in Fig 4 are given as input to the CNN models to examine their performances on the audio files, which are shown in the column "No Filter" in the tables showing the results of the classifications performed. The spectrograms obtained with the different audio modifiers like Spectral Gating, PCEN and Audio Crop are shown in Fig 3, 5 and 6, respectively. The CNN models when combined with these audio filters, use these generated spectrograms as input and the results are shown in the columns "Spectral Gating", "PCEN" and "Audio Crop" in the tables showing the results of the classifications performed.

Based on the observations obtained from the raw spectrograms, we have fixed the lower threshold and higher threshold frequencies to 512 and 2048 Hz, respectively. The obtained spectrograms after using Low Pass Filter (threshold = 512 Hz), High Pass Filter (threshold = 2048 Hz), Band Pass Filter (lower threshold = 512 Hz and higher threshold = 2048 Hz) and Band Stop Filter (lower threshold = 512 Hz and higher threshold = 2048 Hz) are passed as input to the CNN models and the obtained results are shown in the columns Low Pass Filter, High Pass Filter, Band Pass Filter and Band Stop Filter in the tables showing the results of the classifications performed.

6.1 Level 1 Classification

Now, from the results of the Level 1 Classification as shown in table 4 it can be seen that, Classification Accuracy was 75.94% from the CNN model, EfficientNetB1 without any filtration; it increased to 74.75% from CNN model, EfficientNetB0 with Noise Removal, then it decreased to 50.50% with the application of PCEN; it further increased to 78.75% with the combination of EfficientNetB2 and Audio Crop, which is the maximum Classification Accuracy obtained in the Level 1 Classification. But then again it decreased to 48.50% with the application of Low Pass Filter, increased to 60.25% with High Pass Filter. Before giving the final accuracy as 36.50% with Band Stop Filter, it showed 55.25% in the case of Band Pass Filter.

Table 4: Level-1 Classifier

Filtration Mode		No Filter	Noise Removal	PCEN	Audio Crop	Low Pass Filter	High Pass Filter	Band Pass Filter	Band Stop Filter
CNN Models	Accuracies
VGG16	Highest Validation Accuracy	68.75%	62.81%	45.63%	70.63%	45.00%	49.38%	49.38%	33.75%
VGG16	Classification Accuracy	71.00%	66.00%	45.00%	71.50%	43.75%	50.50%	49.50%	31.25%
VGG19	Highest Validation Accuracy	67.50%	68.12%	43.13%	72.50%	48.75%	47.50%	52.81%	35.94%
VGG19	Classification Accuracy	70.50%	70.50%	44.75%	70.50%	42.00%	49.50%	50.75%	34.00%
ResNet50	Highest Validation Accuracy	71.88%	64.38%	43.13%	80.00%	49.69%	49.06%	47.19%	37.50%
ResNet50	Classification Accuracy	71.25%	64.50%	47.00%	79.00%	48.50%	56.25%	46.50%	33.75%
ResNet101	Highest Validation Accuracy	70.31%	67.19%	45.31%	73.75%	48.44%	43.75%	48.75%	36.56%
ResNet101	Classification Accuracy	73.50%	64.50%	47.75%	70.00%	46.00%	45.25%	52.00%	34.75%
ResNet152	Highest Validation Accuracy	67.50%	67.19%	51.88%	75.00%	49.38%	48.75%	49.69%	37.81%
ResNet152	Classification Accuracy	69.50%	68.75%	50.50%	71.25%	45.25%	51.25%	52.50%	36.50%
EfficientNetB0	Highest Validation Accuracy	71.25%	75.31%	44.06%	78.44%	49.69%	53.75%	51.56%	38.75%
EfficientNetB0	Classification Accuracy	75.50%	74.75%	47.25%	76.00%	47.25%	56.75%	53.00%	31.00%
EfficientNetB1	Highest Validation Accuracy	75.94%	70.63%	41.25%	75.94%	48.75%	55.62%	56.25%	38.44%
EfficientNetB1	Classification Accuracy	76.75%	73.25%	43.25%	75.50%	47.50%	56.25%	55.25%	32.50%
EfficientNetB2	Highest Validation Accuracy	72.19%	70.31%	46.56%	79.06%	46.88%	56.25%	54.69%	36.88%
EfficientNetB2	Classification Accuracy	76.25%	72.25%	45.25%	78.75%	45.50%	60.25%	53.00%	32.25%
EfficientNetB3	Highest Validation Accuracy	72.50%	69.69%	43.75%	74.69%	53.75%	55.31%	56.88%	36.25%
EfficientNetB3	Classification Accuracy	76.00%	74.25%	47.00%	77.75%	46.50%	58.00%	53.75%	36.00%
EfficientNetB4	Highest Validation Accuracy	73.44%	71.25%	40.00%	75.00%	47.19%	53.44%	53.75%	39.06%
EfficientNetB4	Classification Accuracy	77.25%	70.75%	41.00%	74.00%	47.00%	55.50%	52.50%	32.50%

6.2 Level 2 Classification

6.2.1 Animal

In case of Level 2 Classification of the Animal class as shown in table 5, the validation score started from 86.27% from ResNet50 and ResNet152 with the raw spectrograms of the unfiltered audio files. The validation score then started to decrease to 64.71% with PCEN, after giving the accuarcy as 82.35% with the combination of Noise Removal and EfficientNetB3. But, again increased to 88.24% with Audio Crop and EfficientNetB2.

After this the classifier did not increase any more and gave the accuracy results as 66.67%, 74.51% and 49.02% with the audio filters. Hence, the Level 2 Classifier of Animal also got the highest accuracy from Audio crop and EfficientNetB2 like the Level 1 Classifier.

6.2.2 Bird

The results of Level 2 Classification of the Bird class from the table 6 show that here the highest validation accuracy was obtained as 96.00% with the following combinations of audio modifier & CNN model.

1.

No Filter & VGG16
2.

No Filter & EfficientNetB2
3.

Noise Removal & EfficientNetB1
4.

Audio Crop & ResNet152
5.

Audio Crop & EfficientNetB1
6.

High Pass Filter & ResNet50

The accuracy was also obtained as 80.00% with Low Pass Filter and Band Pass Filter. The least accuracy score was obtained as 64.00% with Band Stop Filter.

Table 5: Results for Animal class

Filtration Mode		No Filter	Noise Removal	PCEN	Audio Crop	Low Pass Filter	High Pass Filter	Band Pass Filter	Band Stop Filter
CNN Models	Accuracies	No Filter	Noise Removal	PCEN	Audio Crop	Low Pass Filter	High Pass Filter	Band Pass Filter	Band Stop Filter
VGG16	Highest Validation Accuracy	66.67%	70.59%	58.82%	76.47%	64.71%	50.98%	49.02%	47.06%
VGG16	Classification Accuracy	79.69%	71.88%	50.00%	70.31%	48.44%	48.44%	54.69%	42.19%
VGG19	Highest Validation Accuracy	76.47%	68.63%	43.14%	76.47%	58.82%	64.71%	70.59%	39.22%
VGG19	Classification Accuracy	81.25%	71.88%	39.06%	73.44%	46.88%	67.19%	48.44%	35.94%
ResNet50	Highest Validation Accuracy	86.27%	76.47%	49.02%	84.31%	60.78%	60.78%	72.55%	35.29%
ResNet50	Classification Accuracy	82.81%	84.38%	46.88%	79.69%	34.38%	48.44%	68.75%	29.69%
ResNet101	Highest Validation Accuracy	76.47%	74.51%	50.98%	82.35%	66.67%	64.71%	58.82%	35.29%
ResNet101	Classification Accuracy	75.00%	73.44%	34.38%	81.25%	56.25%	57.81%	62.50%	23.44%
ResNet152	Highest Validation Accuracy	86.27%	76.47%	56.86%	86.27%	62.75%	74.51%	66.67%	43.14%
ResNet152	Classification Accuracy	84.38%	75.00%	53.12%	78.12%	50.00%	67.19%	71.88%	34.38%
EfficientNetB0	Highest Validation Accuracy	82.35%	72.55%	64.71%	76.47%	62.75%	64.71%	62.75%	39.22%
EfficientNetB0	Classification Accuracy	84.38%	75.00%	54.69%	78.12%	57.81%	70.31%	54.69%	35.94%
EfficientNetB1	Highest Validation Accuracy	84.31%	70.59%	56.86%	82.35%	60.78%	70.59%	74.51%	45.10%
EfficientNetB1	Classification Accuracy	78.12%	79.69%	50.00%	82.81%	62.50%	70.31%	67.19%	39.06%
EfficientNetB2	Highest Validation Accuracy	80.39%	66.67%	58.82%	88.24%	66.67%	74.51%	66.67%	33.33%
EfficientNetB2	Classification Accuracy	79.69%	76.56%	65.62%	82.81%	59.38%	64.06%	62.50%	37.50%
EfficientNetB3	Highest Validation Accuracy	84.31%	82.35%	60.78%	70.59%	66.67%	58.82%	70.59%	49.02%
EfficientNetB3	Classification Accuracy	89.06%	81.25%	54.69%	81.25%	60.94%	65.62%	64.06%	34.38%
EfficientNetB4	Highest Validation Accuracy	82.35%	68.63%	60.78%	76.47%	62.75%	60.78%	66.67%	31.37%
EfficientNetB4	Classification Accuracy	79.69%	73.44%	53.12%	81.25%	51.56%	68.75%	62.50%	29.69%

Table 6: Results for Bird class

Filtration Mode		No Filter	Noise Removal	PCEN	Audio Crop	Low Pass Filter	High Pass Filter	Band Pass Filter	Band Stop Filter
CNN Models	Accuracies	No Filter	Noise Removal	PCEN	Audio Crop	Low Pass Filter	High Pass Filter	Band Pass Filter	Band Stop Filter
VGG16	Highest Validation Accuracy	96.00%	88.00%	40.00%	84.00%	72.00%	88.00%	76.00%	64.00%
VGG16	Classification Accuracy	87.50%	84.38%	59.38%	87.50%	59.38%	75.00%	68.75%	59.38%
VGG19	Highest Validation Accuracy	88.00%	92.00%	68.00%	92.00%	68.00%	84.00%	76.00%	60.00%
VGG19	Classification Accuracy	75.00%	78.12%	68.75%	81.25%	56.25%	75.00%	81.25%	68.75%
ResNet50	Highest Validation Accuracy	92.00%	92.00%	48.00%	80.00%	80.00%	96.00%	76.00%	64.00%
ResNet50	Classification Accuracy	84.38%	93.75%	78.12%	78.12%	68.75%	81.25%	68.75%	56.25%
ResNet101	Highest Validation Accuracy	88.00%	92.00%	56.00%	88.00%	72.00%	84.00%	72.00%	60.00%
ResNet101	Classification Accuracy	87.50%	81.25%	56.25%	84.38%	65.62%	71.88%	71.88%	50.00%
ResNet152	Highest Validation Accuracy	92.00%	88.00%	44.00%	96.00%	72.00%	88.00%	72.00%	48.00%
ResNet152	Classification Accuracy	93.75%	65.62%	62.50%	90.62%	50.00%	71.88%	75.00%	50.00%
EfficientNetB0	Highest Validation Accuracy	92.00%	84.00%	64.00%	92.00%	72.00%	88.00%	80.00%	60.00%
EfficientNetB0	Classification Accuracy	92.00%	78.12%	71.88%	81.25%	56.25%	87.50%	68.75%	56.25%
EfficientNetB1	Highest Validation Accuracy	92.00%	96.00%	68.00%	96.00%	68.00%	92.00%	72.00%	60.00%
EfficientNetB1	Classification Accuracy	78.12%	81.25%	68.75%	87.50%	65.62%	62.50%	78.12%	65.62%
EfficientNetB2	Highest Validation Accuracy	96.00%	80.00%	52.00%	92.00%	68.00%	80.00%	68.00%	64.00%
EfficientNetB2	Classification Accuracy	90.62%	81.25%	68.75%	84.38%	56.25%	78.12%	71.88%	62.50%
EfficientNetB3	Highest Validation Accuracy	88.00%	84.00%	72.00%	88.00%	68.00%	92.00%	76.00%	60.00%
EfficientNetB3	Classification Accuracy	81.25%	78.12%	65.62%	81.25%	65.62%	78.12%	68.75%	65.62%
EfficientNetB4	Highest Validation Accuracy	92.00%	84.00%	68.00%	88.00%	76.00%	84.00%	76.00%	64.00%
EfficientNetB4	Classification Accuracy	87.50%	78.12%	71.88%	81.25%	62.50%	71.88%	81.25%	62.50%

6.2.3 Natural Soundscapes

Finally, from the results table 7 of the Level 2 Classifier of Nature class it is clear that, the highest validation accuracy obtained is 95.45% from the application combinations of No Filter & ResNet152, Noise Removal & ResNet152 and Noise Removal & EfficientNetB1. Here, the accuracy score was also obtained as high as 90.91% with Band Pass Filter and also 86.36% with the applications of Audio Crop and PCEN. But, in this case the minimum highest validation accuracy was obtained from the application of Low Pass Filter as 54.55%.

Table 7: Results for Nature class

Filtration Mode		No Filter	Noise Removal	PCEN	Audio Crop	Low Pass Filter	High Pass Filter	Band Pass Filter	Band Stop Filter
CNN Models	Accuracies	No Filter	Noise Removal	PCEN	Audio Crop	Low Pass Filter	High Pass Filter	Band Pass Filter	Band Stop Filter
VGG16	Highest Validation Accuracy	84.09%	86.36%	75.00%	86.36%	47.73%	75.00%	77.27%	45.45%
VGG16	Classification Accuracy	75.00%	85.71%	60.71%	83.93%	69.64%	82.14%	69.64%	41.07%
VGG19	Highest Validation Accuracy	86.36%	93.18%	63.64%	84.09%	52.27%	77.27%	75.00%	45.45%
VGG19	Classification Accuracy	82.14%	85.71%	60.71%	87.50%	53.57%	69.64%	69.64%	39.29%
ResNet50	Highest Validation Accuracy	90.91%	93.18%	70.45%	77.27%	52.27%	79.55%	84.09%	47.73%
ResNet50	Classification Accuracy	83.93%	92.86%	50.00%	87.50%	73.21%	71.43%	66.07%	39.29%
ResNet101	Highest Validation Accuracy	88.64%	88.64%	77.27%	86.36%	50.00%	84.09%	72.73%	47.73%
ResNet101	Classification Accuracy	83.93%	87.50%	60.71%	85.71%	64.29%	75.00%	66.07%	26.79%
ResNet152	Highest Validation Accuracy	95.45%	95.45%	81.82%	86.36%	52.27%	77.27%	90.91%	47.73%
ResNet152	Classification Accuracy	85.71%	92.86%	60.71%	87.50%	71.43%	73.21%	67.86%	26.79%
EfficientNetB0	Highest Validation Accuracy	88.64%	90.91%	81.82%	84.09%	54.55%	77.27%	81.82%	56.82%
EfficientNetB0	Classification Accuracy	83.93%	94.64%	67.86%	89.29%	66.07%	76.79%	67.86%	46.43%
EfficientNetB1	Highest Validation Accuracy	90.91%	95.45%	81.82%	84.09%	45.45%	84.09%	77.27%	47.73%
EfficientNetB1	Classification Accuracy	85.71%	94.64%	66.07%	85.71%	69.64%	80.36%	71.43%	48.21%
EfficientNetB2	Highest Validation Accuracy	88.64%	90.91%	84.09%	81.82%	43.18%	65.91%	79.55%	52.27%
EfficientNetB2	Classification Accuracy	78.57%	92.86%	73.21%	85.71%	66.07%	67.86%	73.21%	51.79%
EfficientNetB3	Highest Validation Accuracy	84.09%	90.91%	86.36%	84.09%	47.73%	86.36%	79.55%	54.55%
EfficientNetB3	Classification Accuracy	75.00%	92.86%	66.07%	85.71%	71.43%	82.14%	75.00%	46.43%
EfficientNetB4	Highest Validation Accuracy	81.82%	90.91%	81.82%	79.55%	54.45%	84.09%	88.64%	50.00%
EfficientNetB4	Classification Accuracy	80.36%	87.50%	60.71%	83.93%	69.64%	78.57%	67.86%	55.36%

Table 8: Results for Human class

Filtration Mode		No Filter	Noise Removal	PCEN	Audio Crop	Low Pass Filter	High Pass Filter	Band Pass Filter	Band Stop Filter
CNN Models	Accuracies	No Filter	Noise Removal	PCEN	Audio Crop	Low Pass Filter	High Pass Filter	Band Pass Filter	Band Stop Filter
VGG16	Highest Validation Accuracy	75.00%	71.88%	53.12%	76.56%	56.25%	56.25%	67.19%	43.75%
VGG16	Classification Accuracy	81.25%	80.00%	51.25%	83.75%	47.5%	56.25%	63.75%	38.75%
VGG19	Highest Validation Accuracy	68.75%	71.88%	60.94%	78.12%	62.5%	62.5%	62.5%	43.75%
VGG19	Classification Accuracy	83.75%	83.75%	58.75%	76.25%	53.75%	52.5%	58.75%	38.75%
ResNet50	Highest Validation Accuracy	81.25%	84.38%	51.56%	89.06%	62.5%	67.19%	78.12%	42.19%
ResNet50	Classification Accuracy	83.75%	81.25%	53.75%	81.25%	52.5%	63.75%	67.1%	42.5%
ResNet101	Highest Validation Accuracy	79.69%	82.81%	51.56%	84.38%	56.25%	56.25%	67.19%	40.62%
ResNet101	Classification Accuracy	83.75%	86.25%	43.75%	82.5%	50%	55%	65%	43.75%
ResNet152	Highest Validation Accuracy	82.81%	78.12%	57.81%	84.38%	54.69%	64.06%	65.62%	46.88%
ResNet152	Classification Accuracy	86.25%	80.00%	50.00%	82.5%	57.5%	61.25%	63.75%	46.25%
EfficientNetB0	Highest Validation Accuracy	81.25%	87.50%	68.75%	93.75%	56.25%	65.62%	65.62%	51.56%
EfficientNetB0	Classification Accuracy	88.75%	92.50%	65.00%	85%	56.25%	56.25%	65%	43.75%
EfficientNetB1	Highest Validation Accuracy	82.81%	82.81%	67.19%	89.06%	56.25%	64.06%	71.88%	42.19%
EfficientNetB1	Classification Accuracy	87.50%	85.00%	63.75%	77.5%	55%	65%	66.25%	46.25%
EfficientNetB2	Highest Validation Accuracy	81.25%	85.94%	67.19%	90.62%	60.94%	62.5%	68.75%	51.56%
EfficientNetB2	Classification Accuracy	88.75%	91.25%	60.00%	86.25%	51.25%	67.5%	58.75%	50%
EfficientNetB3	Highest Validation Accuracy	84.38%	81.25%	70.31%	90.62%	51.56%	67.19%	73.44%	50%
EfficientNetB3	Classification Accuracy	91.25%	86.25%	57.50%	86.25%	53.75%	63.75%	67.5%	48.75%
EfficientNetB4	Highest Validation Accuracy	84.38%	82.81%	64.06%	84.38%	57.81%	62.5%	68.75%	45.3%
EfficientNetB4	Classification Accuracy	92.50%	76.25%	58.75%	83.75%	53.75%	71.25%	60%	43.75%

6.2.4 Human

93.75% is the highest validation accuracy obtained by the Level 2 Classifier of the Human class as shown in the results of table 8, with the combiation of Audio Crop and the CNN model EfficientNetB0. It also got validation score of 87.50% with Noise Removal. But, like the Level 2 Classifiers of the previous classes it got minimum highest validation accuracy as 51.56% with the application of Band Stop Filter.

6.2.5 Machine Sounds

The classification results from the table 9 of the Level 2 Classification for Machine Sounds show that, the classifier got highest validation accuracy as 92.00% with the following combinations of audio modifiers & CNN models as follows.

1.

No Filter & VGG16
2.

No Filter & ResNet152
3.

No Filter & EfficientNetB3
4.

Audio Crop & VGG19
5.

Audio Crop & ResNet101
6.

Audio Crop & ResNet152
7.

Audio Crop & EfficientNetB3

The minimum accuracy score obtained was 64.00% with Band Stop Filter.

Table 9: Results for Machine Sounds class

Filtration Mode		No Filter	Noise Removal	PCEN	Audio Crop	Low Pass Filter	High Pass Filter	Band Pass Filter	Band Stop Filter
CNN Models	Accuracies	No Filter	Noise Removal	PCEN	Audio Crop	Low Pass Filter	High Pass Filter	Band Pass Filter	Band Stop Filter
VGG16	Highest Validation Accuracy	92.00%	84.00%	60.00%	96.00%	64.00%	80.00%	68.00%	60.00%
VGG16	Classification Accuracy	78.12%	56.25%	75.00%	71.88%	59.38%	65.62%	62.50%	50.00%
VGG19	Highest Validation Accuracy	80.00%	80.00%	60.00%	92.00%	72.00%	64.00%	76.00%	52.00%
VGG19	Classification Accuracy	75.00%	68.75%	50.00%	71.88%	78.12%	65.62%	56.25%	46.88%
ResNet50	Highest Validation Accuracy	84.00%	84.00%	80.00%	88.00%	60.00%	84.00%	76.00%	60.00%
ResNet50	Classification Accuracy	87.50%	68.75%	62.50%	68.75%	68.75%	68.75%	65.62%	40.62%
ResNet101	Highest Validation Accuracy	80.00%	80.00%	88.00%	92.00%	64.00%	76.00%	80.00%	52.00%
ResNet101	Classification Accuracy	78.12%	68.75%	62.50%	81.25%	46.88%	65.62%	65.62%	46.88%
ResNet152	Highest Validation Accuracy	92.00%	68.00%	60.00%	92.00%	68.00%	72.00%	72.00%	52.00%
ResNet152	Classification Accuracy	81.25%	78.12%	56.25%	62.50%	62.50%	62.50%	71.88%	46.88%
EfficientNetB0	Highest Validation Accuracy	76.00%	72.00%	80.00%	76.00%	36.00%	84.00%	80.00%	64.00%
EfficientNetB0	Classification Accuracy	75.00%	84.38%	62.50%	78.12%	62.50%	65.62%	65.62%	53.12%
EfficientNetB1	Highest Validation Accuracy	76.00%	68.00%	60.00%	88.00%	40.00%	80.00%	76.00%	56.00%
EfficientNetB1	Classification Accuracy	78.12%	68.75%	50.00%	84.38%	56.25%	71.88%	65.62%	46.88%
EfficientNetB2	Highest Validation Accuracy	80.00%	76.00%	84.00%	80.00%	44.00%	84.00%	76.00%	56.00%
EfficientNetB2	Classification Accuracy	78.12%	81.25%	75.00%	81.25%	59.38%	68.75%	62.50%	43.75%
EfficientNetB3	Highest Validation Accuracy	92.00%	64.00%	80.00%	92.00%	48.00%	84.00%	76.00%	52.00%
EfficientNetB3	Classification Accuracy	81.25%	75.00%	75.00%	81.25%	62.50%	78.12%	59.38%	50.00%
EfficientNetB4	Highest Validation Accuracy	64.00%	80.00%	88.00%	88.00%	48.00%	84.00%	68.00%	56.00%
EfficientNetB4	Classification Accuracy	62.50%	78.12%	75.00%	71.88%	56.25%	71.88%	71.88%	65.62%

Table 10: Results for Domestic class

Filtration Mode		No Filter	Noise Removal	PCEN	Audio Crop	Low Pass Filter	High Pass Filter	Band Pass Filter	Band Stop Filter
CNN Models	Accuracies	No Filter	Noise Removal	PCEN	Audio Crop	Low Pass Filter	High Pass Filter	Band Pass Filter	Band Stop Filter
VGG16	Highest Validation Accuracy	96.08%	94.12%	50.98%	86.27%	60.78%	72.55%	70.59%	45.10%
VGG16	Classification Accuracy	82.81%	84.38%	46.88%	89.06%	57.81%	70.31%	59.38%	43.75%
VGG19	Highest Validation Accuracy	82.35%	88.24%	56.86%	84.31%	64.71%	76.47%	74.51%	50.98%
VGG19	Classification Accuracy	82.81%	90.62%	60.94%	87.50%	62.50%	65.62%	65.62%	37.50%
ResNet50	Highest Validation Accuracy	86.27%	92.16%	62.75%	92.00%	52.94%	72.55%	62.75%	43.14%
ResNet50	Classification Accuracy	78.12%	92.19%	60.94%	94.12%	65.62%	70.31%	73.44%	43.75%
ResNet101	Highest Validation Accuracy	94.12%	96.08%	68.63%	92.16%	58.82%	74.51%	74.51%	47.06%
ResNet101	Classification Accuracy	87.50%	90.62%	71.88%	87.50%	68.75%	68.75%	71.88%	39.06%
ResNet152	Highest Validation Accuracy	96.08%	92.16%	60.78%	96.08%	56.86%	78.43%	68.63%	43.14%
ResNet152	Classification Accuracy	87.50%	89.06%	54.69%	87.50%	65.62%	65.62%	64.06%	34.38%
EfficientNetB0	Highest Validation Accuracy	96.08%	94.12%	72.55%	94.12%	66.67%	68.63%	82.35%	56.86%
EfficientNetB0	Classification Accuracy	84.38%	89.06%	67.19%	90.62%	70.31%	59.38%	68.75%	59.38%
EfficientNetB1	Highest Validation Accuracy	92.16%	96.08%	62.75%	92.16%	62.75%	78.43%	86.27%	56.86%
EfficientNetB1	Classification Accuracy	89.06%	87.50%	48.44%	92.19%	70.31%	62.50%	70.31%	50.00%
EfficientNetB2	Highest Validation Accuracy	94.12%	96.08%	74.51%	94.12%	62.75%	76.47%	84.31%	52.94%
EfficientNetB2	Classification Accuracy	85.94%	90.62%	60.94%	93.75%	64.06%	68.75%	65.62%	56.25%
EfficientNetB3	Highest Validation Accuracy	96.08%	96.08%	70.59%	86.27%	56.86%	82.35%	86.27%	54.90%
EfficientNetB3	Classification Accuracy	89.06%	90.62%	60.94%	85.94%	64.06%	73.44%	73.44%	56.25%
EfficientNetB4	Highest Validation Accuracy	96.08%	98.04%	52.94%	88.24%	50.98%	82.35%	82.35%	43.14%
EfficientNetB4	Classification Accuracy	79.69%	93.75%	62.50%	89.06%	68.75%	70.31%	76.56%	45.31%

6.2.6 Domestic

The results from table 10 of the Level 2 Classification of the Domestic class show that the highest validation accuracy as 98.04% with Noise Removal and EfficientNetB1. It also showed the score of 96.08% without application of modifiers and with Audio Crop. But, similar to the previous results of the Level 2 Classification, the lowest score was obtained as 56.86% with Band Stop Filter.

6.2.7 Outdoor

The Level 2 Classification of the Outdoor class got the highest validation accuracy as 93.75% with the combination of Audio Crop and EfficientNetB3 as can be seen from table 8. But it also showed the accuracy score high as 89.00% and 85.94% with the application of Noise Removal and from the spectrograms of the raw audio files with VGG16 and ResNet152, respectively. The lowest validation accuracy obtaned in this case is 50.00% with Band Stop Filter.

Now, comparing with the scores of previous works as shown in table 12, we have achieved quite a challenging accuracy scores compared to the previous works.

Table 11: Results for Outdoor class

Filtration Mode		No Filter	Noise Removal	PCEN	Audio Crop	Low Pass Filter	High Pass Filter	Band Pass Filter	Band Stop Filter
CNN Models	Accuracies	No Filter	Noise Removal	PCEN	Audio Crop	Low Pass Filter	High Pass Filter	Band Pass Filter	Band Stop Filter
VGG16	Highest Validation Accuracy	78.12%	89.06%	59.38%	89.06%	53.12%	68.75%	57.81%	35.94%
VGG16	Classification Accuracy	78.75%	82.50%	48.75%	76.25%	53.75%	47.50%	46.25%	36.25%
VGG19	Highest Validation Accuracy	79.69%	76.56%	60.94%	84.38%	60.94%	60.94%	60.94%	40.62%
VGG19	Classification Accuracy	83.75%	76.25%	50.00%	80.00%	57.50%	42.50%	48.75%	35.00%
ResNet50	Highest Validation Accuracy	82.81%	79.69%	64.06%	87.50%	57.81%	75.00%	62.50%	46.88%
ResNet50	Classification Accuracy	83.75%	81.25%	57.50%	83.75%	63.75%	58.75%	60.00%	37.50%
ResNet101	Highest Validation Accuracy	81.25%	75.00%	59.38%	79.69%	57.81%	76.56%	65.62%	43.75%
ResNet101	Classification Accuracy	83.75%	80.00%	50.00%	77.50%	71.25%	60.00%	58.75%	28.75%
ResNet152	Highest Validation Accuracy	85.94%	78.12%	56.25%	87.50%	48.44%	71.88%	57.81%	43.75%
ResNet152	Classification Accuracy	88.75%	80.00%	55.00%	82.50%	60.00%	55.00%	55.00%	33.75%
EfficientNetB0	Highest Validation Accuracy	84.38%	81.25%	70.31%	89.06%	56.25%	73.44%	68.75%	48.44%
EfficientNetB0	Classification Accuracy	85.00%	85.00%	60.00%	80.00%	65.00%	53.75%	55.00%	36.25%
EfficientNetB1	Highest Validation Accuracy	81.25%	75.00%	60.94%	89.06%	59.38%	75.00%	68.75%	50.00%
EfficientNetB1	Classification Accuracy	82.50%	83.75%	61.25%	81.25%	63.75%	62.50%	60.00%	42.50%
EfficientNetB2	Highest Validation Accuracy	81.25%	79.69%	64.06%	82.81%	60.94%	70.31%	64.06%	45.31%
EfficientNetB2	Classification Accuracy	87.50%	85.00%	53.75%	77.50%	63.75%	58.75%	55.00%	43.75%
EfficientNetB3	Highest Validation Accuracy	84.38%	85.94%	67.19%	93.75%	62.50%	76.56%	68.75%	43.75%
EfficientNetB3	Classification Accuracy	85.00%	82.50%	52.50%	82.50%	63.75%	58.75%	61.25%	30.00%
EfficientNetB4	Highest Validation Accuracy	78.12%	81.25%	67.19%	85.94%	56.25%	70.31%	62.50%	45.31%
EfficientNetB4	Classification Accuracy	82.50%	81.25%	56.25%	82.50%	56.25%	65.00%	58.75%	40.00%

Table 12: Comparison with previous works

Method by	Score on ESC-50
Piczak	64.50%
Agrawal et al.	81.95%
Zhang et al.	68.10%
Zhichao et al.	83.90%
Zhichao et al.	86.50%
Ullo et al.	95.80%
Mushtaq et al.	97.57%
Two-Level Classification	Level 1 - 78.75%
Two-Level Classification	Level 2 - 98.04 - (Highest)

7 Discussion

In this section, we are going to discuss the results obtained and shown in Section 6. As from table 4 it can be seen that the Level 1 Classification got its highest Classification Accuracy, 78.75% from the CNN Model EfficientNetB2 with Audio Crop.

Now in cases of the other classifications, the best results have been shown in the table 13. Table 14(b) and Fig 7(b) shows the number of times each CNN class achieved the highest validation scores and it is quite clear from table 14(b) and Fig 7(b) that, EfficientNet worked best in most of the cases. Though it has got Highest Validation Accuracy in 10 cases but, if we examine the tables 4, 5, 6, 7, 8, 9, 10 and 11 in Section 6, it will be clear that, in most of the cases the audio modifiers have got their highest Classification Accuracies, in case of Level 1 Classification and Highest Validation Accuracies in case of the other classifications with a CNN model belonging to the EfficientNet.

Now coming to the other crucial part of the research, the audio modifier, Audio Crop gave the best scores in most of the cases as shown in table 14(a) and Fig 7(a). As found in the table 13, Audio Crop has got Highest Validation Accuracy for each of the Animal, Birds, Human, Machine Sounds and Outdoor classifications, but from the tables 7 and 10, Audio Crop might have failed to give the Highest Validation Accuracy, but it worked at par with the other audio modifiers.

Table 13: Best Results obtained with CNN Models and Audio Modifiers

Mode of Classification	CNN Model	Audio Modifier	Highest Validation Accuracy
Animal	EfficientNetB2	Audio Crop	88.24%
Birds	VGG16	No Filter	96.00%
	EfficientNetB2	No Filter
	EfficientNetB1	Noise Removal
	ResNet152	Audio Crop
	EfficientNetB1	Audio Crop
	ResNet50	High Pass Filter
Nature	ResNet152	No Filter	95.45%
	ResNet152	Noise Removal
	EfficientNetB1	Noise Removal
Human	EfficientNetB0	Audio Crop	93.75%
Machine Sounds	VGG16	No Filter	92.00%
	ResNet152
	EfficientNetB3
	VGG19	Audio Crop
	ResNet101
	ResNet152
	EfficientNetB3
Domestic	EfficientNetB4	Noise Removal	98.04%
Outdoor	EfficientNetB3	Audio Crop	93.75%

Table 14: Count of best scores by CNN Models and Audio Modifiers

(a) Count of best scores by Audio Modifiers

Audio Modifier	Number of classification in which Highest Validation Accuracy is obtained
No Filter	3
Noise Removal	3
Audio Crop	5
High Pass Filter	1

(b) Count of best scores by CNN Models

CNN Model Class	Number of times to get best scores
VGG	3
ResNet	7
EfficientNet	10

Table 13 shows the accuracies for CNN models with No Filtration, Audio Crop, Noise Removal and High Pass Filter, but in some cases, other audio modifiers also worked well as the Low Pass Filter in Birds, High Pass Filter and Band Pass Filter in Nature, Band Pass Filter in Human and Band Pass Filter and PCEN in Machine Sounds. On the other hand, the four Audio Filters did not work well for the Outdoor class. Coming to the CNNs, ResNet and VGG showed high scores in the case of the classifications with a fewer number of samples, whereas, EfficientNet performed well with the problems with more classes and samples.

8 Conclusion

The main objective of this paper is to propose a Two-Level Sound Classification method for the Environmental Sound Classification problem. Experiments on the ESC-50 dataset show that the classification accuracy of the Level 1 Classification was obtained as high as 78.75%, while the highest validation score obtained by the Level 2 Classification is 98.04%. In addition, this paper also shows the efficiencies of different CNN models combined with different audio modifiers and discussed their impact on the audio files. In future, we plan to optimize the hyperparameters more accurately and examine the performances of other threshold frequencies with the audio filters, though we have obtained the highest accuracy with audio crop and further improve the performance of the proposed methodology. We hope our method and process of thinking will encourage future researchers to implement them in their research.

References

[1] Michel Vacher, Jean-François Serignat, and Stephane Chaillol. Sound classification in a smart room environment: an approach using gmm and hmm methods. In The 4th IEEE Conference on Speech Technology and Human-Computer Dialogue (SpeD 2007), Publishing House of the Romanian Academy (Bucharest), volume 1, pages 135–146, 2007.
[2] Regunathan Radhakrishnan, Ajay Divakaran, and A Smaragdis. Audio analysis for surveillance applications. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005., pages 158–161. IEEE, 2005.
[3] Richard F Lyon. Machine hearing: An emerging field [exploratory dsp]. IEEE signal processing magazine, 27(5):131–139, 2010.
[4] Karol J. Piczak. Environmental sound classification with convolutional neural networks. In 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6, 2015.
[5] Dharmesh M Agrawal, Hardik B Sailor, Meet H Soni, and Hemant A Patil. Novel teo-based gammatone features for environmental sound classification. In 2017 25th European Signal Processing Conference (EUSIPCO), pages 1809–1813. IEEE, 2017.
[6] F. Beritelli and R. Grasso. A pattern recognition system for environmental sound classification based on mfccs and neural networks. In 2008 2nd International Conference on Signal Processing and Communication Systems, pages 1–4, 2008.
[7] Zhichao Zhang, Shugong Xu, Shan Cao, and Shunqing Zhang. Deep convolutional neural network with mixup for environmental sound classification. In Chinese conference on pattern recognition and computer vision (prcv), pages 356–367. Springer, 2018.
[8] Zhichao Zhang, Shugong Xu, Shunqing Zhang, Tianhao Qiao, and Shan Cao. Learning attentive representations for environmental sound classification. IEEE Access, 7:130327–130339, 2019.
[9] Zohaib Mushtaq, Shun-Feng Su, and Quoc-Viet Tran. Spectral images based environmental sound classification using cnn with meaningful data augmentation. Applied Acoustics, 172:107581, 2021.
[10] Xiaohu Zhang, Yuexian Zou, and Wei Shi. Dilated convolution neural network with leakyrelu for environmental sound classification. In 2017 22nd international conference on digital signal processing (DSP), pages 1–5. IEEE, 2017.
[11] Silvia Liberata Ullo, Smith K Khare, Varun Bajaj, and GR Sinha. Hybrid computerized method for environmental sound classification. IEEE Access, 8:124055–124065, 2020.
[12] Wazib Ansar, Ahan Chatterjee, Saptarsi Goswami, and Amlan Chakrabarti. An efficientnet-based ensemble for bird-call recognition with enhanced noise reduction. SN Computer Science, 5(2):265, 2024.
[13] Yann Lecun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L.D. Jackel. Handwritten digit recognition with a back-propagation network. In Advances in Neural Information Processing Systems (NIPS 1989), Denver, CO, volume 2. Morgan Kaufmann, 1990.
[14] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
[15] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[17] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
[18] Joshua M Inouye, Silvia S Blemker, and David I Inouye. Towards undistorted and noise-free speech in an mri scanner: correlation subtraction followed by spectral noise gating. The Journal of the Acoustical Society of America, 135(3):1019–1022, 2014.
[19] Yuxuan Wang, Pascal Getreuer, Thad Hughes, Richard F Lyon, and Rif A Saurous. Trainable frontend for robust and far-field keyword spotting. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5670–5674. IEEE, 2017.
[20] Vincent Lostanlen, Justin Salamon, Mark Cartwright, Brian McFee, Andrew Farnsworth, Steve Kelling, and Juan Pablo Bello. Per-channel energy normalization: Why and how. IEEE Signal Processing Letters, 26(1):39–43, 2018.
[21] RR Porle, NS Ruslan, NM Ghani, NA Arif, SR Ismail, N Parimon, and M Mamat. A survey of filter design for audio noise reduction. J. Adv. Rev. Sci. Res, 12(1):26–44, 2015.
[22] Karol J Piczak. Esc: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia, pages 1015–1018, 2015.