• About Daitan
    • Meet the Team
  • Our Services
    • Design and Architecture
    • Agile Software Development
    • Data Science and Engineering
    • Automation and Chatbots
  • Innovation
  • Our Work
  • Knowledge Center
  • Careers
  • Contact
  • About Daitan
    • Meet the Team
  • Our Services
    • Design and Architecture
    • Agile Software Development
    • Data Science and Engineering
    • Automation and Chatbots
  • Innovation
  • Our Work
  • Knowledge Center
  • Careers
  • Contact
Assessing Audio Quality with Deep Learning

Assessing Audio Quality with Deep Learning

  • Posted by Daitan Innovation Team
  • On February 12, 2020
  • Deep Learning, Tensor Flow 2.0, VoIP

How to Train a Deep Learning System to Estimate Mean Opinion Score (MOS) Using TensorFlow 2.0

Introduction

If you’ve ever used VoIP (Voice Over IP) applications like Skype or Hangouts, you know that audio degradation can be a problem. In video or audio conferences, perhaps with clients and prospects, audio quality is important.

“Speech quality” might sound like a subjective concept. But there are some well-known types of degradation that hurt speech intelligibility. By intelligibility, I mean how comprehensively “pleasant” speech can be. Some of the degradations that reduce intelligibility include echo, reverberation, and background noise (usually from your colleagues).

One commonly used metric to assess the quality of an audio signal is the Mean Opinion Score (MOS). MOS is the arithmetical mean of individual ratings given by different users. We’ll talk more about MOS in a bit, but if you have used Skype before, you’ll know what I’m talking about.

Most of these VoIP services use a similar strategy to get MOS from users. When a VoIP call ends, the tool asks users to rate their call experience. Most of the time, users report their satisfaction on a scale from one (bad) to five (excellent). Once a statistically significant number of people have rated a given audio sample, the MOS is the average of all ratings.

In this piece, we propose a deep convolutional neural network (ConvNet) to address the problem of MOS estimation. Moreover, we go a step further and train a multi-class ConvNet to estimate MOS and also to classify the type of degradation of a given input. Our code is written in TensorFlow 2.0 and is available at our GitHub page.

The Problem

We formulate the MOS estimation problem as a regression task. In other words, given a set of features representing an audio sample, we want to predict a real value in the range of one to five (standard MOS range). One important thing to keep in mind is that MOS should measure users’ quality of experience. However, as we’re going to see, audio signals are far from stationary. As a consequence, as in the example of Skype asking for users’ feedback, one MOS value may represent different situations.

To see this problem more clearly, imagine a short piece of speech from a VoIP call with an overall MOS of 3.9. Despite being a reasonably good score, there’s more than one situation that explains this final rating. A simple and straightforward example would be a call with an average good quality from beginning to end. In this situation, the MOS oscillates around the mean of 3.9 with no significant outliers — i.e. a small standard deviation.

Subjective score of 3.9 for a given audio sample. The overall 3.9 MOS describes an audio sample with good quality from start to finish.

However, and here comes the catch, because the arithmetic mean is very sensitive to outliers, the 3.9 MOS could also be explained by the picture below.

Subjective score of 3.9 for a given audio sample. The overall 3.9 MOS describes an audio sample with excellent (above 3.9) MOS for most of the time. A sudden decrease is quality pulls the MOS down.

In this scenario, the MOS was high (above the mean of 3.9) for most of the time. However, towards the end of the call, there was a sudden decrease in intelligibility, which made the user rate the final experience lower than expected. This emphasizes the necessity of a different strategy to better measure audio quality. In other words, to get a reliable audio quality measure, we need to alleviate the weakness of the arithmetic mean.

Slicing an audio signal in small portions and estimating MOS for each separate slice, gives an overall better measure of the audio quality.

To do that, we can measure MOS by slicing an audio signal in small intervals. For each interval, we estimate a separate (local) MOS. This way, the more we slice the audio sample, the higher the distribution of MOS. This, in turn, provides a better final MOS estimation. In the example above, to get the final MOS of the sample audio, we just compute the average of all five estimates. For the sake of the example, that would be 3.216.

More slices gives a better overall quality measure

Likewise, with a denser distribution of scores, the overall MOS tends to be more representative. For the example above, that would be 3.17.

The Dataset

It’s hard to find human-annotated audio databases for MOS estimation. Here, for simplicity, we chose the TCD-VoIP dataset.

This dataset was designed to aid the development and testing of speech quality VoIP systems. It contains a set of five types of VoIP degradations along with their corresponding subjective opinion scores (MOS). The dataset focus on degradation that occurs independently of hardware or network and is freely available.

The TCD-VoIP covers five types of commonly seen degradations in VoIP applications. These are:

  • Background noise
  • Intelligible competing speakers
  • Echo effects
  • Amplitude clipping
  • Choppy speech

For each audio sample, there are individual subjective opinion scores from 24 different listeners. Likewise, the final subjective score (MOS) is given as the arithmetic mean across the 24 scores. In total, there are 384 audio files with two male and two female speakers. You can see the distribution of speech-degradations and MOS in the images below.

Feature Extraction

To extract features from audio samples, we experimented with the most popular representations used for audio processing. Specifically, we used the Short-Time Fourier Transform (STFT) to encode the audio signal into four different feature representations.

  • The spectral magnitude of the STFT
  • The Spectrogram of the STFT
  • The Mel-Spectrogram
  • The Constant-Q transforms
  • The Mel Frequency Cepstral Coefficients (MFCCs)

To make things shorter, here we only describe the magnitude of the STFT features. Indeed, along with the Spectrogram of the STFT, the magnitude vectors were the most effective representations in our experiments.

The STFT is the most common representation of time-frequency for audio signals. The idea is to compute the Fourier transforms over small portions of the input signal. Since audio samples are highly non-stationary (mainly music signals), the STFT breaks the signal into smaller portions as a way of providing a more robust final representation.

The magnitude vectors of the STFT.

For STFT transform, we used a Hamming window that covers 512 sample points of an input audio signal. The window moves with a stride (hop-size) of 64 points which guarantees a 75% overlap. Finally, we take the magnitude of the STFT and use it as the final feature vector. As a side note, to compute the Spectrogram of the STFT, we would just square the magnitude of the STFT.

Computing the Spectrogram spectrum of the STFT.

Since the audio samples have different lengths, we pad the STFT using the “wrap” mode so that the feature vectors have the same shape. This way, the STFT has 259 frequency bins and 1241 frames in time.

The Solution

The final solution consists of a multi-output ConvNet with approximately 58K trainable parameters. The architecture consists of repeated blocks of:

Convolution → Batch Normalization → ReLU → Max Pooling → Spatial Dropout

During training, the model receives randomly cropped patches from the STFT magnitude spectrum as input. We used Z-Index normalization across the first axis to normalize the input patches. This ensures a near 0 mean and unit variance across the 259 bins of the STFT vectors.

The random crop preprocessing alleviates the training computing processing per input. Moreover, it augments the training data thereby, reducing overfitting. The model receives patches of fixed size (257 x 416) as input and produces two outputs.

  • The first output is the MOS estimate optimized as a regression task. Here, we used the Mean Squared Error (MSE) as the objective. The model produces a value between 0 and 1 which corresponds to the normalized MOS.
  • The second output is a set of probabilities to classify the signal’s type of degradation. For this classification task, the objective is set to be the cross-entropy loss function. Here, the last Dense layer has 5 neurons (one for each class) and softmax activation.

In the end, we combine the two costs to produce the total loss and minimize it using the Adam optimizer. Check out the reduced model architecture below.

In order to balance the contribution of each task to the final loss, we weight each loss inversely proportional to their strengths. Basically, we run a short experiment of 100 epochs and store the individual raw losses per epoch. We proceed by computing the average degree of magnitude of one loss over the other. By doing so, we can measure how larger (on average) one of the losses is compared to the other.

model_build.py – Medium

Results

Despite the small size of the TCD-VoIP dataset, results are reasonably good, especially for MOS estimation, which achieved a mean absolute deviation of only 0.06. You can see the figure below and compare the ground truth MOS from the test set and the corresponding estimates.

Comparing the Ground Truth and predicted MOS estimates.

You can also listen to some audio samples (from the test set) and compare the target with the predicted subjective scores.

Audio sample with a target MOS of3.5 and estimated score of 3.27
Audio sample with a target MOS of 1.8 and estimated score of 2.09
Audio sample with a target MOS of 1.8 and estimated score of 2.251
Audio sample with a target MOS of 3.1 and estimated score of 3.364

Below, you can also see a confusion matrix for the degradation task. The model managed to classify most of the test examples as the correct class. Nevertheless, the lack of more training data is a key factor for not achieving higher accuracy. The overall balanced accuracy is 75% with precision and recall of 82% and 77% respectively.

Confusion matrix for the classification of VoIP common degradation.

Conclusions

Even with a relatively small dataset containing five types of common VoIP degradations, the overall result was very good. Sadly, audio databases with statistically significant human-annotated subjective scores are hard to find.

Nonetheless, since deep learning models require vast amounts of data to yield good results, we should not expect this model to generalize all in the wild. Indeed, the TCD-VoIP dataset only contains speech samples from four different people. This lack of variability makes the model very constrained to the data itself. However, using larger datasets such as the Voice Conversion Challenge (VCC) 2018, the recipe presented here is the same, and we can expect better generalization.

Thanks for reading.

This article was authored by AI Software Architect, Thalles Santos Silva and the Innovation team at Daitan.

Image courtesy of Mathew A. on Unsplash.

 1
Recent Posts
  • What is a Data-Driven Organization?
  • Optimizing the Customer Journey
  • How Much Do AI Projects Really Cost?
  • How a Data-Driven Approach Transforms the Modern Enterprise
  • Do You Have The Right Team For Your AI Project?
Categories
  • Blog Post
  • Case Study
  • Events
  • Innovation
  • Media
  • News
  • Whitepapers and eBooks
Tags
Agile Agile Teams AI AI Project Analytics Architecture Artificial Intelligence Audio Audio De-noiser Best Practices Business Goals Business Outcome Canada Chatbot Cloud Communications COVID-19 Customer Experience Daitan Daitan Hiring Data Data Science Deep Learning Design & Architecture DevOps Digital Business Digital Solutions Digital Transformation Event-Driven Architecture Facial Recognition Financial Services Hiring Machine Learning News NLP Object Storage Open Source SaaS Security Software Development Symphony Platform Telecommunications Tensor Flow 2.0 Time Series Forecast Virtualization

The Fundamental Tool That Data Scientists Can’t Miss

Previous thumb

Achieving Agile at Scale

Next thumb
Scroll

Since 2004, clients have trusted Daitan to build core technology, data solutions and software products that scale with real-time performance. They rely on Daitan because we deliver quality results, while de-risking projects and accelerating time-to-market. From well-funded start-ups to global Fortune 500 enterprises, Daitans clients span a wide variety of industries.

NAVIGATION
  • About Daitan
  • Our Services
  • Innovation
  • Our Work
  • Knowledge Center
  • Careers
  • Contact
Recent Posts
  • What is a Data-Driven Organization? March 4, 2021
  • Optimizing the Customer Journey March 3, 2021
  • How Much Do AI Projects Really Cost? February 25, 2021
DAITAN LOCATIONS
  • USA Headquarters

    2410 Camino Ramon, Suite 285
    San Ramon, CA 94583

  • CANADA Headquarters

    1175 Douglas Street, Unit 916, Victoria, BC, Canada, V8W2E1

  • BRAZIL Headquarters

    Av. Selma Parada, 201, Bloco 1, Conjunto 141, Galleria Office Park, Jardim Madalena, Campinas, SP, Brazil, 13091-904

Copyright ©2021 Daitan | All rights reserved | Privacy Policy | Contact Us

Explore your options

Get in touch to learn how Daitan can accelerate your project.