Studying the Effect of Audio Filters in Pre-Trained Models for Environmental Sound Classification
Abstract
Environmental Sound Classification is an important problem of sound recognition and is more complicated than speech recognition problems as environmental sounds are not well structured with respect to time and frequency. Researchers have used various CNN models to learn audio features from different audio features like log mel spectrograms, gammatone spectral coefficients, mel-frequency spectral coefficients, generated from the audio files, over the past years. In this paper, we propose a new methodology : Two-Level Classification; the Level 1 Classifier will be responsible to classify the audio signal into a broader class and the Level 2 Classifiers will be responsible to find the actual class to which the audio belongs, based on the output of the Level 1 Classifier. We have also shown the effects of different audio filters, among which a new method of Audio Crop is introduced in this paper, which gave the highest accuracies in most of the cases. We have used the ESC-50 dataset for our experiment and obtained a maximum accuracy of 78.75% in case of Level 1 Classification and 98.04% in case of Level 2 Classifications.
Keywords Environmental Sound Classification (ESC) Audio Crop Per Channel Energy Normalization (PCEN) Spectral Gating Audio Filters Convolutional Neural Network (CNN)
1 Introduction
Environmental Sound Classification (ESC) has become a challenging job in recent times. The classification and identification of environmental sounds, which include dog barking, birds chirping, knocking on the door, the sound of a vacuum cleaner, car horn, water drops and many other similar sounds, are necessary for developing smart-home appliances [1], security systems [2, 3], etc. to make human life more secure. For example, if mobile devices are able to recognize the sound of car honking then road accidents and pedestrian accidents will decrease by a significant rate. Or, if the auto-pet care systems can identify the sounds of dogs or birds, they can provide the animals with food and water. Or if a smart security system can recognize door knocking sounds it will be able to notify the owner of the place about the presence of a person behind the door. In other words, a well-featured and trained sound classifier will be able to recognize the sound from human surroundings or environment and will be able to make human life safe and sound.
Traditional Sound Processing is based on the sound features like log mel spectrograms with delta informations [4], Gammatones [5] and Mel Frequency Cepstral Coefficients (MFCC) [6]. Various Machine Learning and Deep Learning techniques are applied with these features to obtain high scoring results. Though the Machine Learning algorithms were able to obtain good results, in recent years the breakthrough of Deep Neural Networks, mainly Convolutional Neural Networks has been significant. CNN’s were used with the sound features to obtain accuracies like 83.9% [7], 86.95% [8] to high as 97.57% [9].
In this paper, we propose a two-level classification method using CNN models of the classes VGG, ResNet and EfficientNet with audio modifiers like PCEN, Spectral Gating(Noise Removal), Audio Crop and Audio Filters like Low Pass Filter, High Pass Filter, Band Pass Filter and Band Stop Filter. The Level 1 classifier will classify the sounds into broader groups of Animals, Birds, Nature, etc., while the Level 2 classification will pull the audio signal to the sub-class. For example, if an audio signal is classified to the Animal class by the Level 1 classifier, then the Level 2 classifier will be responsible to detect the actual animal, which might be a dog or a cow or a sheep. After applying the CNN models and the audio modifiers, our method obtained a score of 78.75% in the case of the Level 1 Classification and the maximum score obtained in the case of the Level 2 Classifiers is 98.04%.
The remaining portion of this article is arranged descriptively and divided into 7 sections. Sound Classification processes used in previous works for the similar problem are explained in Section 2. In Section 3, we have discussed CNN and Audio Modifiers which were used for our work. In Section 4, we have shown the process of our method with a flowchart, with a suitable description. The tools and libraries are discussed in Section 5. Then, the results of our works are shown in Section 6 and discussed in Section 7. Finally, we conclude our paper in Section 8.
2 Related Works
In this section, we discuss the previous works by researchers on the ESC problem.
Piczak extracted the log mel spectrograms for each frame of the audio files. Piczak used these log mel spectrograms and their delta informations in the CNN model proposed by him in [4]. He got an accuracy of 64.5% with this approach. The goal of Piczak’s paper was to evaluate the success of CNNs, when applied to ESC tasks.
Agrawal et al. [5], used a TEO-based gammatone feature set for the problem. Firstly, they extracted the gammatone filterbanks from the raw audio files and applied a bandpass filter. Then, they applied a half-wave rectifier on each of the sub-bands and then Teager Energy Operator was applied again on each of the sub-bands. Finally, they applied short-term averaging and short-term spectral features were obtained. The obtained spectrograms were given as input to a CNN architecture similar to that used by Piczak [4] and obtained a score of 81.95%. However, they showed that the TEO-based Gammatone Spectral Coefficients failed to give better results with CNN.
Zhang et al. [10] showed the performance of Dilated Convolution Network with seven layers and two input channels. They used log mel spectrograms and delta feature spectrograms in the proposed Dilated Convolution Network. They also studied the classification accuracy obtained for ReLU-type activation functions. However, they got a classification accuracy of 68.1% on the ESC-50 dataset and it was also noted by them that the improvement of classification accuracy cost more of higher computational complexity and bigger storage.
Zhichao et al. [7] proposed a new CNN architecture inspired by VGG Net by using 1-D convolution filters in place of convolution filters, to learn local patterns across frequency and time. They extracted log mel spectrograms and gammatone spectrograms, which were used in the proposed CNN architecture, along with their delta information and achieved a classification accuracy of 83.9%.
In [8], Zhichao et al. adopted a convolutional RNN architecture for the problem.At first, they used CNN with channel temporal attention mechanism in convolution layers, with log-gammatone spectrograms to extract high-level features from the spectrograms, which were further used in the bidirectional gated recurrent unit to analyse temporal correlations. They showed a classification accuracy of 86.5% on the ESC-50 dataset.
In [11], Ullo et al. proposed a method that uses a hybrid structure made of OAS, STFT, CNN and different classification techniques for the classification of the classes in the ESC-10 dataset and they achieved a classification accuracy of 95.8%.
Mushtaq et al. [9] showed the performances of different data augmentations on the audio files and log mel spectrograms of the original audio files and augmented audio files with transfer learning and obtained an accuracy of 97.57% for ESC-50 dataset. They also showed the performances of distinct pre-trained models, which included ResNet, DenseNet, AlexNet, SqueezeNet and VGG.
Ansar et al. [12] proposed an EfficientNet ensemble with triple-layered approach to eliminate noise for classification of audio signals. They further validated a trade-off between model depth and number of parameters to obtain optimal accuracy through extensive evaluation on a bouquet of models.
It is clear from the previous works that, all the new methodologies and new models were built to classify the audio files to 50 classes directly. So in this study, a new approach is proposed, a two-level classification, with the spectrograms obtained after the application of audio modifiers and the pre-trained CNN models, which obtained a state of art accuracy score compared to the works discussed above.
3 Methods
In this section, we discuss about the types of CNNs and audio modifiers used for our project.
3.1 Convolutional Neural Network (CNN)
CNNs are used in the application of image processing. CNNs can be trained well to understand the hidden features of the images. This is because CNN applies different relevant filters and the architecture reduces the number of parameters involved and increases the re-usability of the weights. For this reason, the spatial and temporal dependencies of an image are successfully captured by CNN. CNNs are mostly used for classification and computer vision tasks. Fig 1111towardsdatascience shows a general architecture of a CNN Model.

CNN models consists mainly of three layers :
-
1.
Convolution Layer
-
2.
Pooling Layer
-
3.
Fully-connected Layer
LeCun et al. [13] introduced the first CNN architecture, LeNet-5 for the recognition of handwritten digits (input images were of dimension ), using the MNIST dataset 222MNIST database. LeNet-5 was a vary shallow CNN with alternating convolution layers and pooling layers and had only about 60,000 parameters.
AlexNet was then introduced by Krizhevsky et al. [14]. The network was similar to LeNet but instead of alternating convolution layers and pooling layers, AlexNet had all the convolution layers stacked together. Also compared to LeNet-5, this network is much bigger and deeper.
Later, VGGNet was introduced by Simonyan & Zisserman [15]. Earlier, models like AlexNet used high dimensional filter in the initial layers, but VGG changed this by using filters.
He et al. [16] have presented a residual learning framework where the layers learn residual functions with respect to the inputs received instead of learning un-referenced functions. They were able to prove that this work is particularly useful for training deeper networks since residual networks are easier to optimize and gain much accuracy. The main drawback of this network is that it is much expensive to evaluate due to the huge number of parameters.
Various models were thus developed focusing either on performance or computational efficiency. Tan & V.Le introduced EfficientNet [17] model, which was able to solve both the problems. They proposed a common CNN architecture, which worked with three parameters width, depth and resolution. Width refers to the number of channels present in various layers, depth refers to the number of layers in the model and resolution refers to the input image size for the model. EfficientNet mainly helps in performing compound scaling with depth, width and resolution of the image. The compound scaling method only enhances the predictive capacity of the networks by replicating base network’s underlying convolutional operations and network structure.
In this paper we have also shown the efficiency of 3 classes of pre-trained models, which are as follows :
3.2 Spectrogram
A Spectrogram, usually depicted as a heatmap, is a visual representation, of a spectrum of frequencies of a signal as it varies with time. Spectrograms of some audio files of each category of class based on Table 1 are shown in Fig 4. The spectrograms are passed into CNN Models to learn the audio features.
3.3 Understanding Audio Modifiers Used
3.3.1 Spectral Gating
Spectral Gating as explained by [18] is a technique, which is comprised of several steps. A Fourier transformation is applied on the noise-only portion of the audio signal to create a spectral "fingerprint", which is further used as a "gate" to filter the audio signal. The frequencies in the audio signals, which are above the gated value are passed, while those below the value are removed. Here, we have removed noise based on the principle of Spectral Gating. The result of Spectral Gating performed on an audio file is shown in Fig 3 with the audio form and spectrogram.
3.3.2 Per Channel Energy Normalization (PCEN)
[19] introduced PCEN as an alternative to the log-mel frontend. [20] discussed the working principle of PCEN. PCEN is the result of three-component operation :
-
1.
Temporal integration
-
2.
Adaptive Gain Control
-
3.
Dynamic Range Compression
The result of PCEN on two audio files is shown in Fig 5.
3.3.3 Audio Crop
The main idea behind introducing this feature in our work, is to repeat the non-zero portions of an audio sample over the maximum time length of the audio files provided with the data. This method can be explained better with algorithms 1 and 2. Algorithm 1 finds the maximum time length present among the audio files and algorithm 2 removes the silent portions of the audio files. The process of Audio Cropping is shown in Fig 6.
3.3.4 Audio Filters
Research and developments on Audio Filters have been done for audio modifications as mentioned in [21]. The filters that we have used in our work modifies the audio signals based on frequencies.
-
1.
Low Pass Filter : Low Pass Filter allows the frequencies lower than a cut-off frequency to pass and attenuates the frequencies higher than the cut-off frequency.
-
2.
High Pass Filter : High Pass Filter allows the frequencies higher than a cut-off frequency to pass and attenuates the frequencies lower than the cut-off frequency.
-
3.
Band Pass Filter : Band Pass Filter accepts two cut-off frequencies, low-cut frequency and high-cut frequency. This Filter allows the band of frequencies within the low-cut and high-cut frequencies to pass and attenuates the frequencies lower than the low-cut frequency and higher than the high-cut frequency.
-
4.
Band Stop Filter : Band Stop Filter also accepts two cut-off frequencies, low-cut frequency and high-cut frequency, like Band Pass Filter. But like Band Pass Filter, this Filter attenuates the band of frequencies within the low-cut and high-cut frequencies and allows the frequencies lower than the low-cut frequency and higher than the high-cut frequency to pass.
4 Proposed Method
In this section, we discuss the design of the Two-Level Classification method, which we are proposing in this paper, as shown in Figure 2.

Animal | Birds |
|
Human |
|
|
|
||||||||
Dog | Chirping Birds | Rain | Crying Baby | Mouse Click | Door knock | Helicopter | ||||||||
Sheep | Rooster | Sea Waves | Sneezing | Keyboard Typing | Toilet flush | Chainsaw | ||||||||
Pig | Crow | Crackling Fire | Clapping | Washing Machine | Clock alarm | Siren | ||||||||
Cow | Hen | Wind | Breathing | Vacuum cleaner | Door, wood creaks | Car Horn | ||||||||
Frog | Pouring water | Coughing | Can opening | Engine | ||||||||||
Cat | Water drops | Footsteps | Clock tick | Train | ||||||||||
Insects(flying) | Thunderstorm | Laughing | Glass breaking | Church bells | ||||||||||
Crickets | Brushing teeth | Airplane | ||||||||||||
Snoring | Fireworks | |||||||||||||
Drinking, sipping | Hand saw |
-
Step 1
The audio files of ESC-50 dataset is taken as input.
-
Step 2
In the pre-processing part, for our work, we have divided the dataset into 7 groups as shown in table 1. This new divisions were made based on the origin or source of the sounds. For example, Dog, Sheep, Pig, etc are evidently animals. So, they are grouped in the "Animal" class. Rooster, Crow, Hen and Chirping Birds are placed in the "Bird" class. Rain, Sea Waves, Wind, Pouring Water, Thunderstorms are observed in nature and so they are grouped in the "Natural Soundscapes" class. Sneezing, Clapping, Breathing, Drinking, etc are parts of human behavior. So, they are placed in the "Human" group. Mouse Click, Keyboard Typing, Washing Machine and Vacuum Cleaner are sound originating from some particular machines. So, they are grouped in the "Machine Sounds" class. Door knock, Toilet flush, Clock alarm, Can opening, etc are found in domestic environment. So, they are grouped in the "Domestic sounds" class. Finally, as Helicopter, Chainsaw, Siren, etc are found in outdoor spaces and so they have been grouped in the "Outdoor noises" class.
-
Step 3
Audio modifiers were then applied to the audio files. The audio modifiers include :
-
(a)
Spectral Gating
-
(b)
PCEN
-
(c)
Audio Crop
-
(d)
Low Pass Filter
-
(e)
High Pass Filter
-
(f)
Band Pass Filter
-
(g)
Band Stop Filter
-
(a)
-
Step 4
The Spectrograms of the modified audio files are extracted.
-
Step 5
Extracted Spectrograms are then passed to the CNN models mentioned in sub-section 3.1 for the purpose of Level 1 Classification.
-
Step 6
The output from the Level 1 Classification is noted and used in algorithm 3.
-
Step 7
Algorithm 3 will then identify the classifier in the Level 2 Classification stage to identify the actual class to which the audio file belongs.
Note that, we are using 10 CNN models as discussed in Section 3, where we are passing eight different types of spectrogram images - one without filtration and seven with filtrations of Spectral Gating, PCEN, Audio Crop, Low Pass Filter, High Pass Filter, Band Pass Filter and Band Stop Filter. So for each classification, we are comparing the results of 80 different models.
5 Implementation of the Method
5.1 Dataset : ESC-50
We have used the ESC-50 [22] dataset for our work, which is a collection of 2000 recordings with an average duration of 5 seconds and a sampling frequency rate of 44100 Hz. These recordings have been collected from Freesound.org 333https://freesound.org/. The dataset consists of recordings of 40 audio files for each of the 50 categories.
-
1.
Path of the Audio files in the data provided
-
2.
Max_time Output of Algorithm 1
These 50 categories are generally grouped into 5 groups as done in the previous works mentioned in the Section 2, which included the ESC-50 dataset. The groups are :
-
•
Animals
-
•
Natural soundscapes & water sounds
-
•
Human, non-speech sounds
-
•
Interior/domestic sounds
-
•
Exterior/urban noises
5.2 Libraries
The work has been done using the Python Programming Language. The Python libraries used for this project are discussed below.
5.2.1 NumPy
The NumPy library 444https://numpy.org/ helps in performing complex mathematical operations with arrays and random number generations. In this project we have used NumPy for random split of the dataset to make Training and Testing Sets to train and test on the CNN Models, respectively.
5.2.2 Pandas
The Pandas library 555https://pandas.pydata.org/ helped in manipulating the dataframe and also performing some basic operations on the dataframe.
5.2.3 Matplotlib and Seaborn
Matplotlib 666https://matplotlib.org/ and Seaborn 777https://seaborn.pydata.org/ were used for plottings.
5.2.4 Librosa
Librosa 888https://librosa.org was used to work with the audio files. Librosa helped in extracting the audio files from their respective locations. Librosa also has functions to extract mel spectrograms and functions for audio modifications which were used on the audio samples.
5.2.5 SciPy
SciPy 999https://scipy.org/ is a library which has the functions to implement the audio filters on the audio samples provided in the data.
5.2.6 Noise Reduce
The noisereduce101010https://pypi.org/project/noisereduce/ is used for Noise Removal from the audio files provided with the dataset.
5.2.7 Tensorflow and Keras
Tensorflow111111https://www.tensorflow.org/ and Keras121212https://keras.io/ were used to implement the CNN models. The pre-trained models from Keras were used in this work.
5.2.8 Model Hyperparameters
After the pre-trained layers of the pre-trained models, we added a layer with global average pooling method. After the global average layer, two dense layers were added each with 512 filters and activation function ReLU. The kernel initializer of the first dense layer was set to glorot uniform. Stochastic gradient descent as the optimizer function during compilation of the models.
Fig 4 shows that the intensity of sound is maximum within the range of 0 – 512 Hz, while it decreases slightly to 2048 Hz and starts to fade after that in the cases of thunderstorm, vacuum cleaner, glass breaking, train sounds. Spectrograms of chirping birds, clapping show that the intensity is faded till 128 Hz. Again, in the cases of clapping and dog maximum intensity is visible mostly within 512 – 4096 Hz. Besides this, the black portion indicates silence in the audio files of dog and glass breaking. Based o these observations, in case of the Audio Filters, we have used 512 Hz as the lower threshold and 2048 Hz as the higher threshold frequencies.
Audio Filter | Cut-off frequency | |
---|---|---|
Low Pass Filter | 512 Hz | |
High Pass Filter | 2048 Hz | |
Band Pass Filter | lower cut-off | 512 Hz |
higher cut-off | 2048 Hz | |
Band Stop Filter | lower cut-off | 512 Hz |
higher cut-off | 2048 Hz |
5.3 Implementation Details
For our convenience, for each CNN Model, we first divided the data for the respective model in the ratio of 8:2 to create Training Set and the Testing Set. Then we divide the Training set again in the ratio of 8:2, as shown in table 3. Also, we have used a sampling rate of 44.1 KHz for the audio files.
6 Results
From table 3, it is clear that the number of testing samples is less except the case of Level 1 Classification. We are going to compare the performances of the classifiers and the models based on Classification Accuracy for Level 1 classifiers (table 4) only, but in the other cases we are going to judge based on Highest Validation Accuracy obtained as a single miss-classification by the model will decrease the Classification Accuracy significantly, specifically in the cases of Birds and Machine Sounds classification, though we have provided both the Classification Accuracy and Validation Accuracy in the result tables of the classification models for Animals (table 5), Birds (table 6), Natural Soundscapes (table 7), Human (table 8), Machine Sounds (table 9), Domestic (table 10) and Outdoor noises (table 11).
Mode of Classification |
|
|
|
|
|
||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Level 1 Classification | 7 | 2000 | 1280 | 320 | 400 | ||||||||||
Animals | 8 | 320 | 205 | 51 | 64 | ||||||||||
Birds | 4 | 160 | 103 | 25 | 32 | ||||||||||
Nature | 7 | 280 | 180 | 44 | 56 | ||||||||||
Human | 10 | 400 | 256 | 64 | 80 | ||||||||||
Machine Sounds | 4 | 160 | 103 | 25 | 32 | ||||||||||
Domestic | 8 | 320 | 205 | 51 | 64 | ||||||||||
Outdoor | 10 | 400 | 256 | 64 | 80 |
These raw spectrograms shown in Fig 4 are given as input to the CNN models to examine their performances on the audio files, which are shown in the column "No Filter" in the tables showing the results of the classifications performed. The spectrograms obtained with the different audio modifiers like Spectral Gating, PCEN and Audio Crop are shown in Fig 3, 5 and 6, respectively. The CNN models when combined with these audio filters, use these generated spectrograms as input and the results are shown in the columns "Spectral Gating", "PCEN" and "Audio Crop" in the tables showing the results of the classifications performed.
Based on the observations obtained from the raw spectrograms, we have fixed the lower threshold and higher threshold frequencies to 512 and 2048 Hz, respectively. The obtained spectrograms after using Low Pass Filter (threshold = 512 Hz), High Pass Filter (threshold = 2048 Hz), Band Pass Filter (lower threshold = 512 Hz and higher threshold = 2048 Hz) and Band Stop Filter (lower threshold = 512 Hz and higher threshold = 2048 Hz) are passed as input to the CNN models and the obtained results are shown in the columns Low Pass Filter, High Pass Filter, Band Pass Filter and Band Stop Filter in the tables showing the results of the classifications performed.

6.1 Level 1 Classification
Now, from the results of the Level 1 Classification as shown in table 4 it can be seen that, Classification Accuracy was 75.94% from the CNN model, EfficientNetB1 without any filtration; it increased to 74.75% from CNN model, EfficientNetB0 with Noise Removal, then it decreased to 50.50% with the application of PCEN; it further increased to 78.75% with the combination of EfficientNetB2 and Audio Crop, which is the maximum Classification Accuracy obtained in the Level 1 Classification. But then again it decreased to 48.50% with the application of Low Pass Filter, increased to 60.25% with High Pass Filter. Before giving the final accuracy as 36.50% with Band Stop Filter, it showed 55.25% in the case of Band Pass Filter.









Filtration Mode | No Filter | Noise Removal | PCEN | Audio Crop | Low Pass Filter | High Pass Filter | Band Pass Filter | Band Stop Filter | |
---|---|---|---|---|---|---|---|---|---|
CNN Models | Accuracies | ||||||||
VGG16 | Highest Validation Accuracy | 68.75% | 62.81% | 45.63% | 70.63% | 45.00% | 49.38% | 49.38% | 33.75% |
Classification Accuracy | 71.00% | 66.00% | 45.00% | 71.50% | 43.75% | 50.50% | 49.50% | 31.25% | |
VGG19 | Highest Validation Accuracy | 67.50% | 68.12% | 43.13% | 72.50% | 48.75% | 47.50% | 52.81% | 35.94% |
Classification Accuracy | 70.50% | 70.50% | 44.75% | 70.50% | 42.00% | 49.50% | 50.75% | 34.00% | |
ResNet50 | Highest Validation Accuracy | 71.88% | 64.38% | 43.13% | 80.00% | 49.69% | 49.06% | 47.19% | 37.50% |
Classification Accuracy | 71.25% | 64.50% | 47.00% | 79.00% | 48.50% | 56.25% | 46.50% | 33.75% | |
ResNet101 | Highest Validation Accuracy | 70.31% | 67.19% | 45.31% | 73.75% | 48.44% | 43.75% | 48.75% | 36.56% |
Classification Accuracy | 73.50% | 64.50% | 47.75% | 70.00% | 46.00% | 45.25% | 52.00% | 34.75% | |
ResNet152 | Highest Validation Accuracy | 67.50% | 67.19% | 51.88% | 75.00% | 49.38% | 48.75% | 49.69% | 37.81% |
Classification Accuracy | 69.50% | 68.75% | 50.50% | 71.25% | 45.25% | 51.25% | 52.50% | 36.50% | |
EfficientNetB0 | Highest Validation Accuracy | 71.25% | 75.31% | 44.06% | 78.44% | 49.69% | 53.75% | 51.56% | 38.75% |
Classification Accuracy | 75.50% | 74.75% | 47.25% | 76.00% | 47.25% | 56.75% | 53.00% | 31.00% | |
EfficientNetB1 | Highest Validation Accuracy | 75.94% | 70.63% | 41.25% | 75.94% | 48.75% | 55.62% | 56.25% | 38.44% |
Classification Accuracy | 76.75% | 73.25% | 43.25% | 75.50% | 47.50% | 56.25% | 55.25% | 32.50% | |
EfficientNetB2 | Highest Validation Accuracy | 72.19% | 70.31% | 46.56% | 79.06% | 46.88% | 56.25% | 54.69% | 36.88% |
Classification Accuracy | 76.25% | 72.25% | 45.25% | 78.75% | 45.50% | 60.25% | 53.00% | 32.25% | |
EfficientNetB3 | Highest Validation Accuracy | 72.50% | 69.69% | 43.75% | 74.69% | 53.75% | 55.31% | 56.88% | 36.25% |
Classification Accuracy | 76.00% | 74.25% | 47.00% | 77.75% | 46.50% | 58.00% | 53.75% | 36.00% | |
EfficientNetB4 | Highest Validation Accuracy | 73.44% | 71.25% | 40.00% | 75.00% | 47.19% | 53.44% | 53.75% | 39.06% |
Classification Accuracy | 77.25% | 70.75% | 41.00% | 74.00% | 47.00% | 55.50% | 52.50% | 32.50% |
6.2 Level 2 Classification
6.2.1 Animal
In case of Level 2 Classification of the Animal class as shown in table 5, the validation score started from 86.27% from ResNet50 and ResNet152 with the raw spectrograms of the unfiltered audio files. The validation score then started to decrease to 64.71% with PCEN, after giving the accuarcy as 82.35% with the combination of Noise Removal and EfficientNetB3. But, again increased to 88.24% with Audio Crop and EfficientNetB2.
After this the classifier did not increase any more and gave the accuracy results as 66.67%, 74.51% and 49.02% with the audio filters. Hence, the Level 2 Classifier of Animal also got the highest accuracy from Audio crop and EfficientNetB2 like the Level 1 Classifier.
6.2.2 Bird
The results of Level 2 Classification of the Bird class from the table 6 show that here the highest validation accuracy was obtained as 96.00% with the following combinations of audio modifier & CNN model.
-
1.
No Filter & VGG16
-
2.
No Filter & EfficientNetB2
-
3.
Noise Removal & EfficientNetB1
-
4.
Audio Crop & ResNet152
-
5.
Audio Crop & EfficientNetB1
-
6.
High Pass Filter & ResNet50
The accuracy was also obtained as 80.00% with Low Pass Filter and Band Pass Filter. The least accuracy score was obtained as 64.00% with Band Stop Filter.
Filtration Mode | No Filter | Noise Removal | PCEN | Audio Crop | Low Pass Filter | High Pass Filter | Band Pass Filter | Band Stop Filter | |
---|---|---|---|---|---|---|---|---|---|
CNN Models | Accuracies | ||||||||
VGG16 | Highest Validation Accuracy | 66.67% | 70.59% | 58.82% | 76.47% | 64.71% | 50.98% | 49.02% | 47.06% |
Classification Accuracy | 79.69% | 71.88% | 50.00% | 70.31% | 48.44% | 48.44% | 54.69% | 42.19% | |
VGG19 | Highest Validation Accuracy | 76.47% | 68.63% | 43.14% | 76.47% | 58.82% | 64.71% | 70.59% | 39.22% |
Classification Accuracy | 81.25% | 71.88% | 39.06% | 73.44% | 46.88% | 67.19% | 48.44% | 35.94% | |
ResNet50 | Highest Validation Accuracy | 86.27% | 76.47% | 49.02% | 84.31% | 60.78% | 60.78% | 72.55% | 35.29% |
Classification Accuracy | 82.81% | 84.38% | 46.88% | 79.69% | 34.38% | 48.44% | 68.75% | 29.69% | |
ResNet101 | Highest Validation Accuracy | 76.47% | 74.51% | 50.98% | 82.35% | 66.67% | 64.71% | 58.82% | 35.29% |
Classification Accuracy | 75.00% | 73.44% | 34.38% | 81.25% | 56.25% | 57.81% | 62.50% | 23.44% | |
ResNet152 | Highest Validation Accuracy | 86.27% | 76.47% | 56.86% | 86.27% | 62.75% | 74.51% | 66.67% | 43.14% |
Classification Accuracy | 84.38% | 75.00% | 53.12% | 78.12% | 50.00% | 67.19% | 71.88% | 34.38% | |
EfficientNetB0 | Highest Validation Accuracy | 82.35% | 72.55% | 64.71% | 76.47% | 62.75% | 64.71% | 62.75% | 39.22% |
Classification Accuracy | 84.38% | 75.00% | 54.69% | 78.12% | 57.81% | 70.31% | 54.69% | 35.94% | |
EfficientNetB1 | Highest Validation Accuracy | 84.31% | 70.59% | 56.86% | 82.35% | 60.78% | 70.59% | 74.51% | 45.10% |
Classification Accuracy | 78.12% | 79.69% | 50.00% | 82.81% | 62.50% | 70.31% | 67.19% | 39.06% | |
EfficientNetB2 | Highest Validation Accuracy | 80.39% | 66.67% | 58.82% | 88.24% | 66.67% | 74.51% | 66.67% | 33.33% |
Classification Accuracy | 79.69% | 76.56% | 65.62% | 82.81% | 59.38% | 64.06% | 62.50% | 37.50% | |
EfficientNetB3 | Highest Validation Accuracy | 84.31% | 82.35% | 60.78% | 70.59% | 66.67% | 58.82% | 70.59% | 49.02% |
Classification Accuracy | 89.06% | 81.25% | 54.69% | 81.25% | 60.94% | 65.62% | 64.06% | 34.38% | |
EfficientNetB4 | Highest Validation Accuracy | 82.35% | 68.63% | 60.78% | 76.47% | 62.75% | 60.78% | 66.67% | 31.37% |
Classification Accuracy | 79.69% | 73.44% | 53.12% | 81.25% | 51.56% | 68.75% | 62.50% | 29.69% |
Filtration Mode | No Filter | Noise Removal | PCEN | Audio Crop | Low Pass Filter | High Pass Filter | Band Pass Filter | Band Stop Filter | |
---|---|---|---|---|---|---|---|---|---|
CNN Models | Accuracies | ||||||||
VGG16 | Highest Validation Accuracy | 96.00% | 88.00% | 40.00% | 84.00% | 72.00% | 88.00% | 76.00% | 64.00% |
Classification Accuracy | 87.50% | 84.38% | 59.38% | 87.50% | 59.38% | 75.00% | 68.75% | 59.38% | |
VGG19 | Highest Validation Accuracy | 88.00% | 92.00% | 68.00% | 92.00% | 68.00% | 84.00% | 76.00% | 60.00% |
Classification Accuracy | 75.00% | 78.12% | 68.75% | 81.25% | 56.25% | 75.00% | 81.25% | 68.75% | |
ResNet50 | Highest Validation Accuracy | 92.00% | 92.00% | 48.00% | 80.00% | 80.00% | 96.00% | 76.00% | 64.00% |
Classification Accuracy | 84.38% | 93.75% | 78.12% | 78.12% | 68.75% | 81.25% | 68.75% | 56.25% | |
ResNet101 | Highest Validation Accuracy | 88.00% | 92.00% | 56.00% | 88.00% | 72.00% | 84.00% | 72.00% | 60.00% |
Classification Accuracy | 87.50% | 81.25% | 56.25% | 84.38% | 65.62% | 71.88% | 71.88% | 50.00% | |
ResNet152 | Highest Validation Accuracy | 92.00% | 88.00% | 44.00% | 96.00% | 72.00% | 88.00% | 72.00% | 48.00% |
Classification Accuracy | 93.75% | 65.62% | 62.50% | 90.62% | 50.00% | 71.88% | 75.00% | 50.00% | |
EfficientNetB0 | Highest Validation Accuracy | 92.00% | 84.00% | 64.00% | 92.00% | 72.00% | 88.00% | 80.00% | 60.00% |
Classification Accuracy | 92.00% | 78.12% | 71.88% | 81.25% | 56.25% | 87.50% | 68.75% | 56.25% | |
EfficientNetB1 | Highest Validation Accuracy | 92.00% | 96.00% | 68.00% | 96.00% | 68.00% | 92.00% | 72.00% | 60.00% |
Classification Accuracy | 78.12% | 81.25% | 68.75% | 87.50% | 65.62% | 62.50% | 78.12% | 65.62% | |
EfficientNetB2 | Highest Validation Accuracy | 96.00% | 80.00% | 52.00% | 92.00% | 68.00% | 80.00% | 68.00% | 64.00% |
Classification Accuracy | 90.62% | 81.25% | 68.75% | 84.38% | 56.25% | 78.12% | 71.88% | 62.50% | |
EfficientNetB3 | Highest Validation Accuracy | 88.00% | 84.00% | 72.00% | 88.00% | 68.00% | 92.00% | 76.00% | 60.00% |
Classification Accuracy | 81.25% | 78.12% | 65.62% | 81.25% | 65.62% | 78.12% | 68.75% | 65.62% | |
EfficientNetB4 | Highest Validation Accuracy | 92.00% | 84.00% | 68.00% | 88.00% | 76.00% | 84.00% | 76.00% | 64.00% |
Classification Accuracy | 87.50% | 78.12% | 71.88% | 81.25% | 62.50% | 71.88% | 81.25% | 62.50% |
6.2.3 Natural Soundscapes
Finally, from the results table 7 of the Level 2 Classifier of Nature class it is clear that, the highest validation accuracy obtained is 95.45% from the application combinations of No Filter & ResNet152, Noise Removal & ResNet152 and Noise Removal & EfficientNetB1. Here, the accuracy score was also obtained as high as 90.91% with Band Pass Filter and also 86.36% with the applications of Audio Crop and PCEN. But, in this case the minimum highest validation accuracy was obtained from the application of Low Pass Filter as 54.55%.
Filtration Mode | No Filter | Noise Removal | PCEN | Audio Crop | Low Pass Filter | High Pass Filter | Band Pass Filter | Band Stop Filter | |
---|---|---|---|---|---|---|---|---|---|
CNN Models | Accuracies | ||||||||
VGG16 | Highest Validation Accuracy | 84.09% | 86.36% | 75.00% | 86.36% | 47.73% | 75.00% | 77.27% | 45.45% |
Classification Accuracy | 75.00% | 85.71% | 60.71% | 83.93% | 69.64% | 82.14% | 69.64% | 41.07% | |
VGG19 | Highest Validation Accuracy | 86.36% | 93.18% | 63.64% | 84.09% | 52.27% | 77.27% | 75.00% | 45.45% |
Classification Accuracy | 82.14% | 85.71% | 60.71% | 87.50% | 53.57% | 69.64% | 69.64% | 39.29% | |
ResNet50 | Highest Validation Accuracy | 90.91% | 93.18% | 70.45% | 77.27% | 52.27% | 79.55% | 84.09% | 47.73% |
Classification Accuracy | 83.93% | 92.86% | 50.00% | 87.50% | 73.21% | 71.43% | 66.07% | 39.29% | |
ResNet101 | Highest Validation Accuracy | 88.64% | 88.64% | 77.27% | 86.36% | 50.00% | 84.09% | 72.73% | 47.73% |
Classification Accuracy | 83.93% | 87.50% | 60.71% | 85.71% | 64.29% | 75.00% | 66.07% | 26.79% | |
ResNet152 | Highest Validation Accuracy | 95.45% | 95.45% | 81.82% | 86.36% | 52.27% | 77.27% | 90.91% | 47.73% |
Classification Accuracy | 85.71% | 92.86% | 60.71% | 87.50% | 71.43% | 73.21% | 67.86% | 26.79% | |
EfficientNetB0 | Highest Validation Accuracy | 88.64% | 90.91% | 81.82% | 84.09% | 54.55% | 77.27% | 81.82% | 56.82% |
Classification Accuracy | 83.93% | 94.64% | 67.86% | 89.29% | 66.07% | 76.79% | 67.86% | 46.43% | |
EfficientNetB1 | Highest Validation Accuracy | 90.91% | 95.45% | 81.82% | 84.09% | 45.45% | 84.09% | 77.27% | 47.73% |
Classification Accuracy | 85.71% | 94.64% | 66.07% | 85.71% | 69.64% | 80.36% | 71.43% | 48.21% | |
EfficientNetB2 | Highest Validation Accuracy | 88.64% | 90.91% | 84.09% | 81.82% | 43.18% | 65.91% | 79.55% | 52.27% |
Classification Accuracy | 78.57% | 92.86% | 73.21% | 85.71% | 66.07% | 67.86% | 73.21% | 51.79% | |
EfficientNetB3 | Highest Validation Accuracy | 84.09% | 90.91% | 86.36% | 84.09% | 47.73% | 86.36% | 79.55% | 54.55% |
Classification Accuracy | 75.00% | 92.86% | 66.07% | 85.71% | 71.43% | 82.14% | 75.00% | 46.43% | |
EfficientNetB4 | Highest Validation Accuracy | 81.82% | 90.91% | 81.82% | 79.55% | 54.45% | 84.09% | 88.64% | 50.00% |
Classification Accuracy | 80.36% | 87.50% | 60.71% | 83.93% | 69.64% | 78.57% | 67.86% | 55.36% |
Filtration Mode | No Filter | Noise Removal | PCEN | Audio Crop | Low Pass Filter | High Pass Filter | Band Pass Filter | Band Stop Filter | |
---|---|---|---|---|---|---|---|---|---|
CNN Models | Accuracies | ||||||||
VGG16 | Highest Validation Accuracy | 75.00% | 71.88% | 53.12% | 76.56% | 56.25% | 56.25% | 67.19% | 43.75% |
Classification Accuracy | 81.25% | 80.00% | 51.25% | 83.75% | 47.5% | 56.25% | 63.75% | 38.75% | |
VGG19 | Highest Validation Accuracy | 68.75% | 71.88% | 60.94% | 78.12% | 62.5% | 62.5% | 62.5% | 43.75% |
Classification Accuracy | 83.75% | 83.75% | 58.75% | 76.25% | 53.75% | 52.5% | 58.75% | 38.75% | |
ResNet50 | Highest Validation Accuracy | 81.25% | 84.38% | 51.56% | 89.06% | 62.5% | 67.19% | 78.12% | 42.19% |
Classification Accuracy | 83.75% | 81.25% | 53.75% | 81.25% | 52.5% | 63.75% | 67.1% | 42.5% | |
ResNet101 | Highest Validation Accuracy | 79.69% | 82.81% | 51.56% | 84.38% | 56.25% | 56.25% | 67.19% | 40.62% |
Classification Accuracy | 83.75% | 86.25% | 43.75% | 82.5% | 50% | 55% | 65% | 43.75% | |
ResNet152 | Highest Validation Accuracy | 82.81% | 78.12% | 57.81% | 84.38% | 54.69% | 64.06% | 65.62% | 46.88% |
Classification Accuracy | 86.25% | 80.00% | 50.00% | 82.5% | 57.5% | 61.25% | 63.75% | 46.25% | |
EfficientNetB0 | Highest Validation Accuracy | 81.25% | 87.50% | 68.75% | 93.75% | 56.25% | 65.62% | 65.62% | 51.56% |
Classification Accuracy | 88.75% | 92.50% | 65.00% | 85% | 56.25% | 56.25% | 65% | 43.75% | |
EfficientNetB1 | Highest Validation Accuracy | 82.81% | 82.81% | 67.19% | 89.06% | 56.25% | 64.06% | 71.88% | 42.19% |
Classification Accuracy | 87.50% | 85.00% | 63.75% | 77.5% | 55% | 65% | 66.25% | 46.25% | |
EfficientNetB2 | Highest Validation Accuracy | 81.25% | 85.94% | 67.19% | 90.62% | 60.94% | 62.5% | 68.75% | 51.56% |
Classification Accuracy | 88.75% | 91.25% | 60.00% | 86.25% | 51.25% | 67.5% | 58.75% | 50% | |
EfficientNetB3 | Highest Validation Accuracy | 84.38% | 81.25% | 70.31% | 90.62% | 51.56% | 67.19% | 73.44% | 50% |
Classification Accuracy | 91.25% | 86.25% | 57.50% | 86.25% | 53.75% | 63.75% | 67.5% | 48.75% | |
EfficientNetB4 | Highest Validation Accuracy | 84.38% | 82.81% | 64.06% | 84.38% | 57.81% | 62.5% | 68.75% | 45.3% |
Classification Accuracy | 92.50% | 76.25% | 58.75% | 83.75% | 53.75% | 71.25% | 60% | 43.75% |
6.2.4 Human
93.75% is the highest validation accuracy obtained by the Level 2 Classifier of the Human class as shown in the results of table 8, with the combiation of Audio Crop and the CNN model EfficientNetB0. It also got validation score of 87.50% with Noise Removal. But, like the Level 2 Classifiers of the previous classes it got minimum highest validation accuracy as 51.56% with the application of Band Stop Filter.
6.2.5 Machine Sounds
The classification results from the table 9 of the Level 2 Classification for Machine Sounds show that, the classifier got highest validation accuracy as 92.00% with the following combinations of audio modifiers & CNN models as follows.
-
1.
No Filter & VGG16
-
2.
No Filter & ResNet152
-
3.
No Filter & EfficientNetB3
-
4.
Audio Crop & VGG19
-
5.
Audio Crop & ResNet101
-
6.
Audio Crop & ResNet152
-
7.
Audio Crop & EfficientNetB3
The minimum accuracy score obtained was 64.00% with Band Stop Filter.
Filtration Mode | No Filter | Noise Removal | PCEN | Audio Crop | Low Pass Filter | High Pass Filter | Band Pass Filter | Band Stop Filter | |
---|---|---|---|---|---|---|---|---|---|
CNN Models | Accuracies | ||||||||
VGG16 | Highest Validation Accuracy | 92.00% | 84.00% | 60.00% | 96.00% | 64.00% | 80.00% | 68.00% | 60.00% |
Classification Accuracy | 78.12% | 56.25% | 75.00% | 71.88% | 59.38% | 65.62% | 62.50% | 50.00% | |
VGG19 | Highest Validation Accuracy | 80.00% | 80.00% | 60.00% | 92.00% | 72.00% | 64.00% | 76.00% | 52.00% |
Classification Accuracy | 75.00% | 68.75% | 50.00% | 71.88% | 78.12% | 65.62% | 56.25% | 46.88% | |
ResNet50 | Highest Validation Accuracy | 84.00% | 84.00% | 80.00% | 88.00% | 60.00% | 84.00% | 76.00% | 60.00% |
Classification Accuracy | 87.50% | 68.75% | 62.50% | 68.75% | 68.75% | 68.75% | 65.62% | 40.62% | |
ResNet101 | Highest Validation Accuracy | 80.00% | 80.00% | 88.00% | 92.00% | 64.00% | 76.00% | 80.00% | 52.00% |
Classification Accuracy | 78.12% | 68.75% | 62.50% | 81.25% | 46.88% | 65.62% | 65.62% | 46.88% | |
ResNet152 | Highest Validation Accuracy | 92.00% | 68.00% | 60.00% | 92.00% | 68.00% | 72.00% | 72.00% | 52.00% |
Classification Accuracy | 81.25% | 78.12% | 56.25% | 62.50% | 62.50% | 62.50% | 71.88% | 46.88% | |
EfficientNetB0 | Highest Validation Accuracy | 76.00% | 72.00% | 80.00% | 76.00% | 36.00% | 84.00% | 80.00% | 64.00% |
Classification Accuracy | 75.00% | 84.38% | 62.50% | 78.12% | 62.50% | 65.62% | 65.62% | 53.12% | |
EfficientNetB1 | Highest Validation Accuracy | 76.00% | 68.00% | 60.00% | 88.00% | 40.00% | 80.00% | 76.00% | 56.00% |
Classification Accuracy | 78.12% | 68.75% | 50.00% | 84.38% | 56.25% | 71.88% | 65.62% | 46.88% | |
EfficientNetB2 | Highest Validation Accuracy | 80.00% | 76.00% | 84.00% | 80.00% | 44.00% | 84.00% | 76.00% | 56.00% |
Classification Accuracy | 78.12% | 81.25% | 75.00% | 81.25% | 59.38% | 68.75% | 62.50% | 43.75% | |
EfficientNetB3 | Highest Validation Accuracy | 92.00% | 64.00% | 80.00% | 92.00% | 48.00% | 84.00% | 76.00% | 52.00% |
Classification Accuracy | 81.25% | 75.00% | 75.00% | 81.25% | 62.50% | 78.12% | 59.38% | 50.00% | |
EfficientNetB4 | Highest Validation Accuracy | 64.00% | 80.00% | 88.00% | 88.00% | 48.00% | 84.00% | 68.00% | 56.00% |
Classification Accuracy | 62.50% | 78.12% | 75.00% | 71.88% | 56.25% | 71.88% | 71.88% | 65.62% |
Filtration Mode | No Filter | Noise Removal | PCEN | Audio Crop | Low Pass Filter | High Pass Filter | Band Pass Filter | Band Stop Filter | |
---|---|---|---|---|---|---|---|---|---|
CNN Models | Accuracies | ||||||||
VGG16 | Highest Validation Accuracy | 96.08% | 94.12% | 50.98% | 86.27% | 60.78% | 72.55% | 70.59% | 45.10% |
Classification Accuracy | 82.81% | 84.38% | 46.88% | 89.06% | 57.81% | 70.31% | 59.38% | 43.75% | |
VGG19 | Highest Validation Accuracy | 82.35% | 88.24% | 56.86% | 84.31% | 64.71% | 76.47% | 74.51% | 50.98% |
Classification Accuracy | 82.81% | 90.62% | 60.94% | 87.50% | 62.50% | 65.62% | 65.62% | 37.50% | |
ResNet50 | Highest Validation Accuracy | 86.27% | 92.16% | 62.75% | 92.00% | 52.94% | 72.55% | 62.75% | 43.14% |
Classification Accuracy | 78.12% | 92.19% | 60.94% | 94.12% | 65.62% | 70.31% | 73.44% | 43.75% | |
ResNet101 | Highest Validation Accuracy | 94.12% | 96.08% | 68.63% | 92.16% | 58.82% | 74.51% | 74.51% | 47.06% |
Classification Accuracy | 87.50% | 90.62% | 71.88% | 87.50% | 68.75% | 68.75% | 71.88% | 39.06% | |
ResNet152 | Highest Validation Accuracy | 96.08% | 92.16% | 60.78% | 96.08% | 56.86% | 78.43% | 68.63% | 43.14% |
Classification Accuracy | 87.50% | 89.06% | 54.69% | 87.50% | 65.62% | 65.62% | 64.06% | 34.38% | |
EfficientNetB0 | Highest Validation Accuracy | 96.08% | 94.12% | 72.55% | 94.12% | 66.67% | 68.63% | 82.35% | 56.86% |
Classification Accuracy | 84.38% | 89.06% | 67.19% | 90.62% | 70.31% | 59.38% | 68.75% | 59.38% | |
EfficientNetB1 | Highest Validation Accuracy | 92.16% | 96.08% | 62.75% | 92.16% | 62.75% | 78.43% | 86.27% | 56.86% |
Classification Accuracy | 89.06% | 87.50% | 48.44% | 92.19% | 70.31% | 62.50% | 70.31% | 50.00% | |
EfficientNetB2 | Highest Validation Accuracy | 94.12% | 96.08% | 74.51% | 94.12% | 62.75% | 76.47% | 84.31% | 52.94% |
Classification Accuracy | 85.94% | 90.62% | 60.94% | 93.75% | 64.06% | 68.75% | 65.62% | 56.25% | |
EfficientNetB3 | Highest Validation Accuracy | 96.08% | 96.08% | 70.59% | 86.27% | 56.86% | 82.35% | 86.27% | 54.90% |
Classification Accuracy | 89.06% | 90.62% | 60.94% | 85.94% | 64.06% | 73.44% | 73.44% | 56.25% | |
EfficientNetB4 | Highest Validation Accuracy | 96.08% | 98.04% | 52.94% | 88.24% | 50.98% | 82.35% | 82.35% | 43.14% |
Classification Accuracy | 79.69% | 93.75% | 62.50% | 89.06% | 68.75% | 70.31% | 76.56% | 45.31% |
6.2.6 Domestic
The results from table 10 of the Level 2 Classification of the Domestic class show that the highest validation accuracy as 98.04% with Noise Removal and EfficientNetB1. It also showed the score of 96.08% without application of modifiers and with Audio Crop. But, similar to the previous results of the Level 2 Classification, the lowest score was obtained as 56.86% with Band Stop Filter.
6.2.7 Outdoor
The Level 2 Classification of the Outdoor class got the highest validation accuracy as 93.75% with the combination of Audio Crop and EfficientNetB3 as can be seen from table 8. But it also showed the accuracy score high as 89.00% and 85.94% with the application of Noise Removal and from the spectrograms of the raw audio files with VGG16 and ResNet152, respectively. The lowest validation accuracy obtaned in this case is 50.00% with Band Stop Filter.
Now, comparing with the scores of previous works as shown in table 12, we have achieved quite a challenging accuracy scores compared to the previous works.
Filtration Mode | No Filter | Noise Removal | PCEN | Audio Crop | Low Pass Filter | High Pass Filter | Band Pass Filter | Band Stop Filter | |
---|---|---|---|---|---|---|---|---|---|
CNN Models | Accuracies | ||||||||
VGG16 | Highest Validation Accuracy | 78.12% | 89.06% | 59.38% | 89.06% | 53.12% | 68.75% | 57.81% | 35.94% |
Classification Accuracy | 78.75% | 82.50% | 48.75% | 76.25% | 53.75% | 47.50% | 46.25% | 36.25% | |
VGG19 | Highest Validation Accuracy | 79.69% | 76.56% | 60.94% | 84.38% | 60.94% | 60.94% | 60.94% | 40.62% |
Classification Accuracy | 83.75% | 76.25% | 50.00% | 80.00% | 57.50% | 42.50% | 48.75% | 35.00% | |
ResNet50 | Highest Validation Accuracy | 82.81% | 79.69% | 64.06% | 87.50% | 57.81% | 75.00% | 62.50% | 46.88% |
Classification Accuracy | 83.75% | 81.25% | 57.50% | 83.75% | 63.75% | 58.75% | 60.00% | 37.50% | |
ResNet101 | Highest Validation Accuracy | 81.25% | 75.00% | 59.38% | 79.69% | 57.81% | 76.56% | 65.62% | 43.75% |
Classification Accuracy | 83.75% | 80.00% | 50.00% | 77.50% | 71.25% | 60.00% | 58.75% | 28.75% | |
ResNet152 | Highest Validation Accuracy | 85.94% | 78.12% | 56.25% | 87.50% | 48.44% | 71.88% | 57.81% | 43.75% |
Classification Accuracy | 88.75% | 80.00% | 55.00% | 82.50% | 60.00% | 55.00% | 55.00% | 33.75% | |
EfficientNetB0 | Highest Validation Accuracy | 84.38% | 81.25% | 70.31% | 89.06% | 56.25% | 73.44% | 68.75% | 48.44% |
Classification Accuracy | 85.00% | 85.00% | 60.00% | 80.00% | 65.00% | 53.75% | 55.00% | 36.25% | |
EfficientNetB1 | Highest Validation Accuracy | 81.25% | 75.00% | 60.94% | 89.06% | 59.38% | 75.00% | 68.75% | 50.00% |
Classification Accuracy | 82.50% | 83.75% | 61.25% | 81.25% | 63.75% | 62.50% | 60.00% | 42.50% | |
EfficientNetB2 | Highest Validation Accuracy | 81.25% | 79.69% | 64.06% | 82.81% | 60.94% | 70.31% | 64.06% | 45.31% |
Classification Accuracy | 87.50% | 85.00% | 53.75% | 77.50% | 63.75% | 58.75% | 55.00% | 43.75% | |
EfficientNetB3 | Highest Validation Accuracy | 84.38% | 85.94% | 67.19% | 93.75% | 62.50% | 76.56% | 68.75% | 43.75% |
Classification Accuracy | 85.00% | 82.50% | 52.50% | 82.50% | 63.75% | 58.75% | 61.25% | 30.00% | |
EfficientNetB4 | Highest Validation Accuracy | 78.12% | 81.25% | 67.19% | 85.94% | 56.25% | 70.31% | 62.50% | 45.31% |
Classification Accuracy | 82.50% | 81.25% | 56.25% | 82.50% | 56.25% | 65.00% | 58.75% | 40.00% |
Method by | Score on ESC-50 | |
Piczak | 64.50% | |
Agrawal et al. | 81.95% | |
Zhang et al. | 68.10% | |
Zhichao et al. | 83.90% | |
86.50% | ||
Ullo et al. | 95.80% | |
Mushtaq et al. | 97.57% | |
Two-Level Classification | Level 1 - 78.75% | |
Level 2 - 98.04 - (Highest) |
7 Discussion
In this section, we are going to discuss the results obtained and shown in Section 6. As from table 4 it can be seen that the Level 1 Classification got its highest Classification Accuracy, 78.75% from the CNN Model EfficientNetB2 with Audio Crop.
Now in cases of the other classifications, the best results have been shown in the table 13. Table 14(b) and Fig 7(b) shows the number of times each CNN class achieved the highest validation scores and it is quite clear from table 14(b) and Fig 7(b) that, EfficientNet worked best in most of the cases. Though it has got Highest Validation Accuracy in 10 cases but, if we examine the tables 4, 5, 6, 7, 8, 9, 10 and 11 in Section 6, it will be clear that, in most of the cases the audio modifiers have got their highest Classification Accuracies, in case of Level 1 Classification and Highest Validation Accuracies in case of the other classifications with a CNN model belonging to the EfficientNet.
Now coming to the other crucial part of the research, the audio modifier, Audio Crop gave the best scores in most of the cases as shown in table 14(a) and Fig 7(a). As found in the table 13, Audio Crop has got Highest Validation Accuracy for each of the Animal, Birds, Human, Machine Sounds and Outdoor classifications, but from the tables 7 and 10, Audio Crop might have failed to give the Highest Validation Accuracy, but it worked at par with the other audio modifiers.
Mode of Classification | CNN Model | Audio Modifier | Highest Validation Accuracy |
---|---|---|---|
Animal | EfficientNetB2 | Audio Crop | 88.24% |
Birds | VGG16 | No Filter | 96.00% |
EfficientNetB2 | |||
EfficientNetB1 | Noise Removal | ||
ResNet152 | Audio Crop | ||
EfficientNetB1 | |||
ResNet50 | High Pass Filter | ||
Nature | ResNet152 | No Filter | 95.45% |
ResNet152 | Noise Removal | ||
EfficientNetB1 | |||
Human | EfficientNetB0 | Audio Crop | 93.75% |
Machine Sounds | VGG16 | No Filter | 92.00% |
ResNet152 | |||
EfficientNetB3 | |||
VGG19 | Audio Crop | ||
ResNet101 | |||
ResNet152 | |||
EfficientNetB3 | |||
Domestic | EfficientNetB4 | Noise Removal | 98.04% |
Outdoor | EfficientNetB3 | Audio Crop | 93.75% |


Audio Modifier | Number of classification in which Highest Validation Accuracy is obtained |
---|---|
No Filter | 3 |
Noise Removal | 3 |
Audio Crop | 5 |
High Pass Filter | 1 |
CNN Model Class | Number of times to get best scores |
---|---|
VGG | 3 |
ResNet | 7 |
EfficientNet | 10 |
Table 13 shows the accuracies for CNN models with No Filtration, Audio Crop, Noise Removal and High Pass Filter, but in some cases, other audio modifiers also worked well as the Low Pass Filter in Birds, High Pass Filter and Band Pass Filter in Nature, Band Pass Filter in Human and Band Pass Filter and PCEN in Machine Sounds. On the other hand, the four Audio Filters did not work well for the Outdoor class. Coming to the CNNs, ResNet and VGG showed high scores in the case of the classifications with a fewer number of samples, whereas, EfficientNet performed well with the problems with more classes and samples.
8 Conclusion
The main objective of this paper is to propose a Two-Level Sound Classification method for the Environmental Sound Classification problem. Experiments on the ESC-50 dataset show that the classification accuracy of the Level 1 Classification was obtained as high as 78.75%, while the highest validation score obtained by the Level 2 Classification is 98.04%. In addition, this paper also shows the efficiencies of different CNN models combined with different audio modifiers and discussed their impact on the audio files. In future, we plan to optimize the hyperparameters more accurately and examine the performances of other threshold frequencies with the audio filters, though we have obtained the highest accuracy with audio crop and further improve the performance of the proposed methodology. We hope our method and process of thinking will encourage future researchers to implement them in their research.
References
- [1] Michel Vacher, Jean-François Serignat, and Stephane Chaillol. Sound classification in a smart room environment: an approach using gmm and hmm methods. In The 4th IEEE Conference on Speech Technology and Human-Computer Dialogue (SpeD 2007), Publishing House of the Romanian Academy (Bucharest), volume 1, pages 135–146, 2007.
- [2] Regunathan Radhakrishnan, Ajay Divakaran, and A Smaragdis. Audio analysis for surveillance applications. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005., pages 158–161. IEEE, 2005.
- [3] Richard F Lyon. Machine hearing: An emerging field [exploratory dsp]. IEEE signal processing magazine, 27(5):131–139, 2010.
- [4] Karol J. Piczak. Environmental sound classification with convolutional neural networks. In 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6, 2015.
- [5] Dharmesh M Agrawal, Hardik B Sailor, Meet H Soni, and Hemant A Patil. Novel teo-based gammatone features for environmental sound classification. In 2017 25th European Signal Processing Conference (EUSIPCO), pages 1809–1813. IEEE, 2017.
- [6] F. Beritelli and R. Grasso. A pattern recognition system for environmental sound classification based on mfccs and neural networks. In 2008 2nd International Conference on Signal Processing and Communication Systems, pages 1–4, 2008.
- [7] Zhichao Zhang, Shugong Xu, Shan Cao, and Shunqing Zhang. Deep convolutional neural network with mixup for environmental sound classification. In Chinese conference on pattern recognition and computer vision (prcv), pages 356–367. Springer, 2018.
- [8] Zhichao Zhang, Shugong Xu, Shunqing Zhang, Tianhao Qiao, and Shan Cao. Learning attentive representations for environmental sound classification. IEEE Access, 7:130327–130339, 2019.
- [9] Zohaib Mushtaq, Shun-Feng Su, and Quoc-Viet Tran. Spectral images based environmental sound classification using cnn with meaningful data augmentation. Applied Acoustics, 172:107581, 2021.
- [10] Xiaohu Zhang, Yuexian Zou, and Wei Shi. Dilated convolution neural network with leakyrelu for environmental sound classification. In 2017 22nd international conference on digital signal processing (DSP), pages 1–5. IEEE, 2017.
- [11] Silvia Liberata Ullo, Smith K Khare, Varun Bajaj, and GR Sinha. Hybrid computerized method for environmental sound classification. IEEE Access, 8:124055–124065, 2020.
- [12] Wazib Ansar, Ahan Chatterjee, Saptarsi Goswami, and Amlan Chakrabarti. An efficientnet-based ensemble for bird-call recognition with enhanced noise reduction. SN Computer Science, 5(2):265, 2024.
- [13] Yann Lecun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L.D. Jackel. Handwritten digit recognition with a back-propagation network. In Advances in Neural Information Processing Systems (NIPS 1989), Denver, CO, volume 2. Morgan Kaufmann, 1990.
- [14] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
- [15] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- [16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- [17] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
- [18] Joshua M Inouye, Silvia S Blemker, and David I Inouye. Towards undistorted and noise-free speech in an mri scanner: correlation subtraction followed by spectral noise gating. The Journal of the Acoustical Society of America, 135(3):1019–1022, 2014.
- [19] Yuxuan Wang, Pascal Getreuer, Thad Hughes, Richard F Lyon, and Rif A Saurous. Trainable frontend for robust and far-field keyword spotting. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5670–5674. IEEE, 2017.
- [20] Vincent Lostanlen, Justin Salamon, Mark Cartwright, Brian McFee, Andrew Farnsworth, Steve Kelling, and Juan Pablo Bello. Per-channel energy normalization: Why and how. IEEE Signal Processing Letters, 26(1):39–43, 2018.
- [21] RR Porle, NS Ruslan, NM Ghani, NA Arif, SR Ismail, N Parimon, and M Mamat. A survey of filter design for audio noise reduction. J. Adv. Rev. Sci. Res, 12(1):26–44, 2015.
- [22] Karol J Piczak. Esc: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia, pages 1015–1018, 2015.