Mono to Stereo: How AI Is Breathing New Life into Music | by Max Hilsdorf | Dec, 2024
Now that we discussed how relevant mono-to-stereo technology is, you might be wondering how it works under the hood. Turns out there are different approaches to tackling this problem with AI. In the following, I want to showcase four different methods, ranging from traditional signal processing to generative AI. It does not serve as a complete list of methods, but rather as an inspiration for how this task has been solved over the last 20 years.
Traditional Signal Processing: Sound Source Formation
Before machine learning became as popular as it is today, the field of Music Information Retrieval (MIR) was dominated by smart, hand-crafted algorithms. It is no wonder that such approaches also exist for mono-to-stereo upmixing.
The fundamental idea behind a paper from 2007 (Lagrange, Martins, Tzanetakis, [1]) is simple:
If we can find the different sound sources of a recording and extract them from the signal, we can mix them back together for a realistic stereo experience.
This sounds simple, but how can we tell what the sound sources in the signal are? How do we define them so clearly that an algorithm can extract them from the signal? These questions are difficult to solve and the paper uses a variety of advanced methods to achieve this. In essence, this is the algorithm they came up with:
- Break the recording into short snippets and identify the peak frequencies (dominant notes) in each snippet
- Identify which peaks belong together (a sound source) using a clustering algorithm
- Decide where each sound source should be placed in the stereo mix (manual step)
- For each sound source, extract its assigned frequencies from the signal
- Mix all extracted sources together to form the final stereo mix.
Although quite complex in the details, the intuition is quite clear: Find sources, extract them, mix them back together.
A Quick Workaround: Source Separation / Stem Splitting
A lot has happened since Lagrange’s 2007 paper. Since Deezer released their stem splitting tool Spleeter in 2019, AI-based source separation systems have become remarkably useful. Leading players such as Lalal.ai or Audioshake make a quick workaround possible:
- Separate a mono recording into its individual instrument stems using a free or commercial stem splitter
- Load the stems into a Digital Audio Workstation (DAW) and mix them together to your liking
This technique has been used in a research paper in 2011 (see [2]), but it has become much more viable since due to the recent improvements in stem separation tools.
The downside of source separation approaches is that they produce noticeable sound artifacts, because source separation itself is still not without flaws. Additionally, these approaches still require manual mixing by humans, making them only semi-automatic.
To fully automate mono-to-stereo upmixing, machine learning is required. By learning from real stereo mixes, ML system can adapt the mixing style of real human producers.
Machine Learning with Parametric Stereo
One very creative and efficient way of using machine learning for mono-to-stereo upmixing was presented at ISMIR 2023 by Serrà and colleagues [3]. This work is based on a music compression technique called parametric stereo. Stereo mixes consist of two audio channels, making it hard to integrate in low-bandwidth settings such as music streaming, radio broadcasting, or telephone connections.
Parametric stereo is a technique to create stereo sound from a single mono signal by focusing on the important spatial cues our brain uses to determine where sounds are coming from. These cues are:
- How loud a sound is in the left ear vs. the right ear (Interchannel Intensity Difference, IID)
- How in sync it is between left and right in terms of time or phase (Interchannel Time or Phase Difference)
- How similar or different the signals are in each ear (Interchannel Correlation, IC)
Using these parameters, a stereo-like experience can be created from nothing more than a mono signal.
This is the approach the researchers took to develop their mono-to-stereo upmixing model:
- Collect a large dataset of stereo music tracks
- Convert the stereo tracks to parametric stereo (mono + spatial parameters)
- Train a neural network to predict the spatial parameters given a mono recording
- To turn a new mono signal into stereo, use the trained model to infer spatial parameters from the mono signal and combine the two to a parametric stereo experience
Currently, no code or listening demos seem to be available for this paper. The authors themselves confess that “there is still a gap between professional stereo mixes and the proposed approaches” (p. 6). Still, the paper outlines a creative and efficient way to accomplish fully automated mono-to-stereo upmixing using machine learning.
Generative AI: Transformer-based Synthesis
Now, we will get to the seemingly most straight-forward way to generate stereo from mono. Training a generative model to take a mono input and synthesizing both stereo output channels directly. Although conceptually simple, this is by far the most challenging approach from a technical standpoint. One second of high-resolution audio has 44.1k data points. Generating a three-minute song with stereo channels therefore means generating over 15 million data points.
With todays technologies such as convolutional neural networks, transformers, and neural audio codecs, the complexity of the task is starting to become managable. There are some papers who chose to generate stereo signal through direct neural synthesis (see [4], [5], [6]). However, only [5] train a model than can solve mono to stereo generation out of the box. My intuition is that there is room for a paper that builds a dedicated for the “simple” task of mono-to-stereo generation and focuses 100% on solving this objective. Anyone here looking for a PhD topic?