back

“What did they just say?” Machine learning to the rescue

August 21, 2024 Xperi Martin Walsh
Vice President of Audio Research and Development

Streaming content and broadcast technologies have seen remarkable advances in recent years. We can now watch content in 4K HDR and a high frame rates on 100” screens that are thinner than a pencil. Audio is broadcast or streamed in high-fidelity, multi channel and object-based surround sound.

Despite these technological advances, more people than ever are having difficulty hearing what people are saying on their TV. In fact, a recent Xperi survey of 1,200 adults revealed 84% of consumers have experienced trouble understanding dialogue during TV shows and movies. In response, over three-quarters (77%) of survey respondents said they use captions/subtitles, with one in three (30%) reporting they are always or often turned on.

Until recently, subtitles have been the only consistent way to help solve problems of dialogue intelligibility, but this is changing. Subjective issues like hearing loss – which people experience to varying degrees and in different ways – and differences in language or regional dialect can explain the increased propensity to enable subtitles when watching TV.

However, the degree by which general subtitle usage has increased implies that other issues with the modern TV-watching experience are causing the average listener to also deal with an additional cognitive load (which means, essentially, that it takes more brain power to watch with subititles than without). Can we explain why this is happening, and can the most significant problems related to dialogue intelligibility be addressed?

The growing problem

Ironically, many of the technologies associated with modern TV watching can be detrimental to understanding dialogue. While manufacturers compete to release the flattest and biggest TVs possible, this also requires the thinnest speakers. Flattening the space around the TV speakers is not conducive to high-quality audio, and signal processing is often necessary to increase the perceived loudness and bass. Audio processing technologies such as virtual surround sound and dynamic range compression may be applied on the TV to compensate for the limitations of speaker sound quality. Unfortunately, these combinations of audio algorithms can affect dialogue intelligibility if not tuned appropriately.

The quality of your TV speakers is not the only reason for poor dialogue intelligibility. For example, more soundtracks are being mixed for multiple channles of audio, with one channel primarily focusing on dialogue. These so-called immersive sound mixes are often downmixed to stereo when played over a conventional TV. This downmix process can result in a situation in which combining the other downmixed channels masks the dialogue, making it potentially less intelligible.

The average home has also changed in recent years. Today’s homes are more likely to be more reverberant and have more local noise pollution from appliances such as refrigerators, dishwashers, and air conditioners. As a result, environmental sounds can mask meaningful dialogue, especially if that dialogue is reproduced at lower levels, like whispers.

Less-than-ideal solutions exist

Several  DSP-based (digital signal processor) solutions have been available in televisions for several years, but this is not adequately solving the problem. Traditional dialogue enhancement algorithms boost the audio frequency ranges associated with dialogue. While this can improve intelligibility for those with mild to moderate hearing loss, it does little to alleviate the situation when background sounds are mixed with the dialogue since both dialogue and non-dialogue would receive the same boost.

One solution might be to change how content is delivered to the home with bespoke television mixes that are more dialogue-forward and have a lower dynamic range between quiet and loud sounds. While this methodology may improve the listening experience for some, it may be detrimental to others, mainly due to the subjective nature of the problem. In short, the audio mix should not have to be biased to an assumed worst-case listening scenario.

What if we could tailor the audio to individual consumer’s preferences and needs?

This scenario would be possible if the dialogue track were available as a separate audio stream at the point of consumption. In this way, only the dialogue would be filtered, and the dialogue-to-background ratio could be increased based on listener preferences, hearing loss, or due to environmental noise in the home.

One way could be to transmit a separate dialogue channel, delivered independently, of the original mix. For example, the dialogue could be sent exclusively in a surround mix’s center channel or sent as a separate audio object. Early pilots with these kinds of approaches have thus far been relatively unpopular. Approaches like these require new workflows from the content creator, which have been challenging to adopt. The solution would also require legacy content to be retrofitted to this approach to receive similar benefits and that’s not even mentioning support for such new technologies required on devices like a TV.

Machine learning to the rescue

Recent developments in audio machine learning techniques have made it possible to separate audio content into its component parts. New unmixing technologies enable the possibility of separating dialogue from non-dialogue components within any content without any requirements on the content producer.

These techniques also make it possible to separate dialogue on any audio content ever produced for the past century, going back to The Jazz Singer, the first ‘talky’ motion picture, released in 1927. As mentioned above, once  the dialogue has been separated, it can be processed with minimal consequence for the original artistic intent. Bespoke, personalized ‘worst case scenario’ mixes are no longer required. The precise amount of processing can be applied when watching the original content based on user preferences, the nature of the mix, or environmental background noise.

Even more exciting, these techniques can already be enabled using the same machine learning hardware prominently used for image and video processing in modern TVs.

New technology is forthcoming that enables dialogue to be processed independently from the rest of the soundtrack. This technology will enable TV viewers to focus less on trying to understand what’s being said on their TV, allowing them to become more immersed in the story the content creator wanted to tell. In my next article I’ll outline in greater detail how this technnlogy works and how I believe it will positively impact the experience of watching TV in the future.  

Stay up to date on the latest insights from DTS by signing up for their newsletter here.

Solutions