
Assessing Audio Quality with Deep Learning
- Posted by Daitan Innovation Team
- On February 12, 2020
- Deep Learning, Tensor Flow 2.0, VoIP
How to Train a Deep Learning System to Estimate Mean Opinion Score (MOS) Using TensorFlow 2.0
Introduction
If you’ve ever used VoIP (Voice Over IP) applications like Skype or Hangouts, you know that audio degradation can be a problem. In video or audio conferences, perhaps with clients and prospects, audio quality is important.
“Speech quality” might sound like a subjective concept. But there are some well-known types of degradation that hurt speech intelligibility. By intelligibility, I mean how comprehensively “pleasant” speech can be. Some of the degradations that reduce intelligibility include echo, reverberation, and background noise (usually from your colleagues).
One commonly used metric to assess the quality of an audio signal is the Mean Opinion Score (MOS). MOS is the arithmetical mean of individual ratings given by different users. We’ll talk more about MOS in a bit, but if you have used Skype before, you’ll know what I’m talking about.
Most of these VoIP services use a similar strategy to get MOS from users. When a VoIP call ends, the tool asks users to rate their call experience. Most of the time, users report their satisfaction on a scale from one (bad) to five (excellent). Once a statistically significant number of people have rated a given audio sample, the MOS is the average of all ratings.
In this piece, we propose a deep convolutional neural network (
ConvNet
) to address the problem of MOS estimation. Moreover, we go a step further and train a multi-classConvNet
to estimate MOS and also to classify the type of degradation of a given input. Our code is written in TensorFlow 2.0 and is available at our GitHub page.
The Problem
We formulate the MOS estimation problem as a regression task. In other words, given a set of features representing an audio sample, we want to predict a real value in the range of one to five (standard MOS range). One important thing to keep in mind is that MOS should measure users’ quality of experience. However, as we’re going to see, audio signals are far from stationary. As a consequence, as in the example of Skype asking for users’ feedback, one MOS value may represent different situations.
To see this problem more clearly, imagine a short piece of speech from a VoIP call with an overall MOS of 3.9. Despite being a reasonably good score, there’s more than one situation that explains this final rating. A simple and straightforward example would be a call with an average good quality from beginning to end. In this situation, the MOS oscillates around the mean of 3.9 with no significant outliers — i.e. a small standard deviation.

However, and here comes the catch, because the arithmetic mean is very sensitive to outliers, the 3.9 MOS could also be explained by the picture below.

In this scenario, the MOS was high (above the mean of 3.9) for most of the time. However, towards the end of the call, there was a sudden decrease in intelligibility, which made the user rate the final experience lower than expected. This emphasizes the necessity of a different strategy to better measure audio quality. In other words, to get a reliable audio quality measure, we need to alleviate the weakness of the arithmetic mean.

To do that, we can measure MOS by slicing an audio signal in small intervals. For each interval, we estimate a separate (local) MOS. This way, the more we slice the audio sample, the higher the distribution of MOS. This, in turn, provides a better final MOS estimation. In the example above, to get the final MOS of the sample audio, we just compute the average of all five estimates. For the sake of the example, that would be 3.216.

Likewise, with a denser distribution of scores, the overall MOS tends to be more representative. For the example above, that would be 3.17.
The Dataset
It’s hard to find human-annotated audio databases for MOS estimation. Here, for simplicity, we chose the TCD-VoIP dataset.
This dataset was designed to aid the development and testing of speech quality VoIP systems. It contains a set of five types of VoIP degradations along with their corresponding subjective opinion scores (MOS). The dataset focus on degradation that occurs independently of hardware or network and is freely available.
The TCD-VoIP covers five types of commonly seen degradations in VoIP applications. These are:
- Background noise
- Intelligible competing speakers
- Echo effects
- Amplitude clipping
- Choppy speech
For each audio sample, there are individual subjective opinion scores from 24 different listeners. Likewise, the final subjective score (MOS) is given as the arithmetic mean across the 24 scores. In total, there are 384 audio files with two male and two female speakers. You can see the distribution of speech-degradations and MOS in the images below.


Feature Extraction
To extract features from audio samples, we experimented with the most popular representations used for audio processing. Specifically, we used the Short-Time Fourier Transform (STFT) to encode the audio signal into four different feature representations.
- The spectral magnitude of the STFT
- The Spectrogram of the STFT
- The Mel-Spectrogram
- The Constant-Q transforms
- The Mel Frequency Cepstral Coefficients (MFCCs)
To make things shorter, here we only describe the magnitude of the STFT features. Indeed, along with the Spectrogram of the STFT, the magnitude vectors were the most effective representations in our experiments.
The STFT is the most common representation of time-frequency for audio signals. The idea is to compute the Fourier transforms over small portions of the input signal. Since audio samples are highly non-stationary (mainly music signals), the STFT breaks the signal into smaller portions as a way of providing a more robust final representation.

For STFT transform, we used a Hamming window that covers 512 sample points of an input audio signal. The window moves with a stride (hop-size) of 64 points which guarantees a 75% overlap. Finally, we take the magnitude of the STFT and use it as the final feature vector. As a side note, to compute the Spectrogram of the STFT, we would just square the magnitude of the STFT.

Since the audio samples have different lengths, we pad the STFT using the “wrap” mode so that the feature vectors have the same shape. This way, the STFT has 259 frequency bins and 1241 frames in time.
The Solution
The final solution consists of a multi-output ConvNet with approximately 58K trainable parameters. The architecture consists of repeated blocks of:
Convolution → Batch Normalization → ReLU → Max Pooling → Spatial Dropout
During training, the model receives randomly cropped patches from the STFT magnitude spectrum as input. We used Z-Index normalization across the first axis to normalize the input patches. This ensures a near 0 mean and unit variance across the 259 bins of the STFT vectors.
The random crop preprocessing alleviates the training computing processing per input. Moreover, it augments the training data thereby, reducing overfitting. The model receives patches of fixed size (257 x 416) as input and produces two outputs.
- The first output is the MOS estimate optimized as a regression task. Here, we used the Mean Squared Error (MSE) as the objective. The model produces a value between 0 and 1 which corresponds to the normalized MOS.
- The second output is a set of probabilities to classify the signal’s type of degradation. For this classification task, the objective is set to be the cross-entropy loss function. Here, the last Dense layer has 5 neurons (one for each class) and softmax activation.
In the end, we combine the two costs to produce the total loss and minimize it using the Adam optimizer. Check out the reduced model architecture below.

In order to balance the contribution of each task to the final loss, we weight each loss inversely proportional to their strengths. Basically, we run a short experiment of 100 epochs and store the individual raw losses per epoch. We proceed by computing the average degree of magnitude of one loss over the other. By doing so, we can measure how larger (on average) one of the losses is compared to the other.
Results
Despite the small size of the TCD-VoIP dataset, results are reasonably good, especially for MOS estimation, which achieved a mean absolute deviation of only 0.06. You can see the figure below and compare the ground truth MOS from the test set and the corresponding estimates.

You can also listen to some audio samples (from the test set) and compare the target with the predicted subjective scores.
Below, you can also see a confusion matrix for the degradation task. The model managed to classify most of the test examples as the correct class. Nevertheless, the lack of more training data is a key factor for not achieving higher accuracy. The overall balanced accuracy is 75% with precision and recall of 82% and 77% respectively.

Conclusions
Even with a relatively small dataset containing five types of common VoIP degradations, the overall result was very good. Sadly, audio databases with statistically significant human-annotated subjective scores are hard to find.
Nonetheless, since deep learning models require vast amounts of data to yield good results, we should not expect this model to generalize all in the wild. Indeed, the TCD-VoIP dataset only contains speech samples from four different people. This lack of variability makes the model very constrained to the data itself. However, using larger datasets such as the Voice Conversion Challenge (VCC) 2018, the recipe presented here is the same, and we can expect better generalization.
Thanks for reading.
This article was authored by AI Software Architect, Thalles Santos Silva and the Innovation team at Daitan.
Image courtesy of Mathew A. on Unsplash.