Exploring Audio Tracks using ShortTimeFourier, AudioData, and SpatialMedian
Exploring Audio Tracks using ShortTimeFourier, AudioData, and SpatialMedian
Roman Parker
Intern, Wolfram Research
By using audio functionalities in WL alongside SpatialMedian and ShortTimeFourier, it is possible to combine audio tracks, adjust their volume, and mute a track when another track is loud at a low-level, which provides insight into how Audio computations work. Additionally, using the higher-level Audio libraries, including AudioIntervals and AudioGenerator, these same things can be accomplished, which provides another method with simpler and faster execution for each of these cases. These functions have several applications, including playing music in the background of a video, which is disabled when the people in the video talk, and synthesizing together multiple feeds, which can have their audio levels at different sections manually modified.
Importing the Data
Importing the Data
For this notebook, the Audio Cats and Dogs dataset from the Wolfram Data Repository will be used. This dataset contains 277 labeled audio tracks of cats and dogs, of varying lengths. This is a perfect dataset for this purpose due to it containing many distinguishable tracks which aren’t all the same length, making it as close to real-world applications of these programs as possible.
Import the data from the Wolfram Data Repository:
In[]:=
testaudio=Normal[ResourceData["Audio Cats and Dogs"]〚All,1〛];
Combining Audio Tracks via Different Techniques
Combining Audio Tracks via Different Techniques
Audio files can be combined by several functions, including Mean, RootMeanSquare, HarmonicMean, and by finding the SpatialMedian of the ShortTimeFouriers (averageAudio). In this section, these techniques are compared and visualized to show which functions provide the highest quality and/or the fastest combination of the Audio tracks.
Define a function which combines Audio files using ShortTimeFourier and SpatialMedian (exploration of the function at end of the essay):
averageAudio[l_List]:=Module[{samplerate,fourierlist,inlist,indata,dataofaudio,endpoints,cutdata,fourier,reshapen},samplerate=ShortTimeFourier[l〚1〛,1024]["SampleRate"];fourierlist=(((ShortTimeFourier[#1,1024]&)/@l)〚#1〛["Data"]&)/@Range[Length[(ShortTimeFourier[#1,1024]&)/@l]];inlist=Flatten/@ReIm[fourierlist];indata=Flatten/@inlist;dataofaudio=SpatialMedian[SpatialPointData[PadRight[indata]]];endpoints=SequencePosition[dataofaudio,Table[0.`,1000]];cutdata=dataofaudio〚1;;If[Length[endpoints]0,-1,endpoints〚1,1〛-1]〛;fourier=(#1+#2&)@@Transpose[Partition[cutdata,2]];reshapen=Partition[fourier,1024];Audio[InverseShortTimeFourier[reshapen],SampleRatesamplerate]];
Combine five of the Audio tracks using six different techniques:
In[]:=
differentAverages=Module[{t15=testaudio[[1;;5]]},{Timing[Mean[t15]],Timing[RootMeanSquare[t15]],Timing[averageAudio[t15]],Timing[GeometricMean[t15]],Timing[HarmonicMean[t15]],Timing[ContraharmonicMean[t15]]}]
Out[]=
0.390625,,0.015625,,6.25,,0.0625,,0.5,,0.1875,
Apply a lowpass filter to make the tracks easier to hear:
In[]:=
LowpassFilter[#,]&/@differentAverages[[All,2]]
Out[]=
,,,,,
From a sound fidelity perspective, both Mean and AverageAudio are about equal, while the other functions are not ideal, but Mean is significantly faster (around 400x on this data and around 600x on the first 20 elements of the list), giving it the advantage in most cases. Adding the lowpass filter makes all the “bad” tracks significantly better from a fidelity perspective, but it can cut out important information and they still aren’t as good as Mean and/or AverageAudio. Next, we will create various visualizations to compare the Audio tracks created by each of the functions to each other.
Plot a spectrogram of each file (adding LowpassFilter makes nearly no difference):
In[]:=
Column[Spectrogram/@differentAverages[[All,2]]]
Out[]=
Make a MatrixPlot of the ShortTimeFourier (time is on the Y axis, frequency on the X axis, color corresponds to value):
In[]:=
MatrixPlot[ShortTimeFourier[#]["Data"]]&/@differentAverages[[All,2]]
Out[]=
,
,
,
,
,
Plot the amplitude over time (nearly identical results with lowpass filter):
In[]:=
AudioPlot[#,PlotRangeAll]&/@differentAverages[[All,2]]
Out[]=
,
,
,
,
,
Plot a periodogram with and without the lowpass filter:
In[]:=
Periodogram/@differentAverages[[All,2]]
Out[]=
,
,
,
,
,
In[]:=
Periodogram[LowpassFilter[#,]]&/@differentAverages[[All,2]]
Out[]=
,
,
,
,
,
Through all of these plots, it is obvious that Mean and averageAudio (the first and third elements) are still the most similar to each other, although adding a LowpassFilter helps make the periodograms look more similar. Below, plots are created to show the difference between the two functions more clearly.
Plot the amplitude over time via AudioPlot:
In[]:=
AudioPlot[differentAverages[[1;;3;;2,2]]]
Out[]=
Plot the amplitude over time via a ListPlot of the AudioData:
AverageAudio and Mean both seem to average Audio files in very similar ways, but Mean is far faster. However, AverageAudio, due to using SpatialMedian, is less sensitive to an outlier audio track throwing off the program. Additionally, AverageAudio, as seen here, has a significantly louder signal when the original Audio is loud, while having a similar volume when quiet. Therefore, in a case with messy data or where it is important to emphasize sections of a track which are louder, AverageAudio may be more accurate and worth the time tradeoff.
Volume Manipulation
Volume Manipulation
In this section, three audio tracks are used to demonstrate how to manipulate the volume of an audio track. This can be done both by simply multiplying the audio track by a number and by more complicated techniques, which allow for the modification of the volume of different parts of a track. For manipulating the volume of tracks at different times, there are two primary techniques. The first is to manually manipulate the AudioData by multiplying sections of it by different values, which is easier to understand and functions in a way which can more easily be used on non-Audio files, while the second is to use AudioGenerator to create an audio file with the correct amplitudes so that, when multiplied by the original file, changes the volumes. This second technique is about twice as fast and works better when manipulating multiple Audio objects, but, due to working on a high level with the Audio objects, can only be used for this specific task.
Define three new audio tracks:
Calculate the mean and average audio:
Modify the volumes of the various tracks:
Define a function to scale the length of an audio track to be the same as the length of another audio track:
Demonstrate this function:
Combine a version of track a stretched to the length of track b with track b:
Define a function to allow for changing the volumes of different parts of a track via the AudioData:
Change the volume of the first half of a track to be 0.1 times the original and the second half to be 1.5:
Create a helper function which finds the timestamp of a certain point in a track based on how far into the track it is:
Create the second version of volumeChanger which works directly with AudioGenerator and the Audio objects:
Change the volume of the first half of a track to be 0.1 times the original and the second half to be 1.5:
Below is a demonstration of a use of volumeChanger, in which the three audio tracks from earlier have their volumes changed at different points before being combined. This code functions equally well with either volumeChanger or volumeChanger2, but here, volumeChanger2 is used due to its speed.
Create a track with muted and volume-changed sections for each of the three recordings:
Combine the three new tracks:
VolumeChanger and volumeChanger2 are quite useful to fix problems that could occur if tracks are recorded at different times or with different microphones, since it is rare that all tracks have the same volume. Additionally, using these techniques, it is possible to turn down the volume of one track while it has only background noise and then turn it back up when the important parts of the audio begin. VolumeChanger2 is generally more useful due to running faster and working with AudioTrackPoint, but volumeChanger works with non-Audio objects, in case it is needed to modify the values of different sections of another list.
Volume-Based Track Muting
Volume-Based Track Muting
The following code creates functions which, when given two tracks, can automatically mute the first track when the second is loud. This is done by finding all points which are silent (either through finding points near a near-zero AudioData point or finding areas with low RMSAMplitudes), creating a list of areas to mute, and muting these sections. For example, if there are two tracks of people speaking, a specific track can be set to have priority over the other, so one of the tracks will mute when the other one has noise on it. Once again, two different techniques are used: one that directly manipulates AudioData and one that works with the high-level Audio library. The AudioData version of the code is easier to understand the inner workings and is more flexible (sections of the track can have their volume lowered to a proportional value instead of muting with a minor modification to the code, while the Audio version can’t do this easily). However, the Audio version is thousands of times more efficient, making it the best choice for almost all applications.
Define two more audio tracks:
Define a function which detects when a track is loud and creates a list inverse to the amplitude:
Demonstrate the muting factor of audio track a (scaled to not be longer than track b)
Define a helper function which creates a range from start to end, but the range will not go lower than 1 or higher than the length of the input list:
Demonstrate this function:
Define another helper function which cuts an audio track to the length of another:
Define a helper function which takes an input of a list (such as the output of createMutingFactor) and takes all points within the distance of scale of a point with a value above 0.95 and sets them to 0:
Provide an example of this function:
Create a version of track e which is muted whenever track e is loud enough (AKA when e is speaking):
Define a function to combine two audio tracks, but mute the non-priority track when the priority track is loud:
Define the high-level Audio version of speakover:
Speakover functions by detecting all points which are “noisy” (large enough amplitude) in the non-priority track and creates a list which is equal to zero for all points near the noisy points, which is multiplied by the non-priority track and averaged with the priority track, while speakover2 uses AudioIntervals to find all sections of the non-priority track which are loud enough, replace them with silence, and average this new track with the priority track. Wheretomute takes a very long time to execute with long lists and a large scale value due to a combination of creating lists of hundreds of thousands to millions of elements using ReplacePart, which makes speakover2 much faster.
Test that function:
A possible application of speakover or speakover2 could be if someone wanted background audio, but didn’t want it to play while they were speaking, this function would be able to mute the background audio temporarily while the second track was making noise.
AverageAudio: An Exploration
AverageAudio: An Exploration
The section below contains an explanation of how AverageAudio functions and contains an additional comparison to Mean. However, due to Mean being used instead of AverageAudio in many places in this code and AverageAudio having limited use cases, this section is left until the end due to its relative lack of importance.
Calculate the ShortTimeFourier of the data:
Show the first element of this data:
Show the spectrogram:
Extract the complex numbers which make up the ShortTimeFouriers:
Show the data of the first audio track:
A problem with using the ShortTimeFourier, however, is that each track takes the form of a list of lists of lists of complex numbers, which has a varying length depending on the audio track, while SpatialMedian, the function used for combining audio tracks here, needs one-dimensional lists of real numbers, all of which are the same length. For this, we will first split each complex number into a real number for each part, flatten these lists, and pad them with zeroes to be the same length as the other tracks. Then, after averaging these lists, since we know the format which the ShortTimeFouriers were originally in, we can “unflatten” the list by combining the numbers into their complex form, grouping everything into lists of length 1024, and grouping these lists into a list, which will be our ShortTimeFourier. From this, the InverseShortTimeFourier can be taken, giving us a list, which can be treated as AudioData and converted into audio.
Convert each complex number into two real numbers, one for each part:
Pad all of the data to be the same length, since the audio tracks have varying lengths:
The following function is relatively computationally intensive, so for time reasons and to avoid a memory overflow, only the first five audio tracks will be averaged. Later on, there is a version of this code which works with all 277 points, but it takes 6.5 minutes on my computer, so it is easier to demonstrate with this version.
Take the first five elements of this list (for time and memory purposes):
Calculate the spatial median of these points:
For the InverseShortTimeFourier to function properly, the audio track needs to be cut to the right length. For this purpose, the locations of 1000 consecutive zeroes are located. Since audio tracks rarely are completely blank, a zero is extremely unlikely to appear, and 1000 consecutive ones even less so. However, all the short audio tracks had zeroes appended to their end to make all the points the same length, so the end of the track should have many zeroes. This allows for the detection of the track’s end.
Find all segments of 1000 zeroes to find where the end of this track is located:
Cut off the blank noise from the end of the track:
Group the list of real numbers back into complex numbers (for example {1,2,3,4} becomes {1+2*I,3+4*I}):
Convert the flattened list back into an array of length 1024:
Create the audio track from this data and show the spectrogram:
Calculate the mean of the audio:
Although the audio tracks sound quite similar, on the spectrogram, there are several visible differences. It seems that the Fourier-based averaging of audio is more sensitive to individual loud points and less sensitive to quieter noises than the mean, which, depending on the tracks used, could be an advantage. Below, this can be seen more vividly by looking directly at the amplitudes of the Audio.
Plot the ShortTimeFourier average compared to the mean:
We can see that wherever the Mean track “spikes” in volume, the ShortTimeFourier track does so more, and while the audio is quiet, the two tracks are very similar in volume. This means that if there is a significant amount of quiet, but noticeable background noise, using AverageAudio is likely better, while if all tracks have a similarly-constant volume, Mean’s increased speed more than makes up for any disadvantages it could have.
Define a version of the function that does all of the steps required to create a ShortTimeFourier average:
Calculate the average audio of all 277 tracks in the dataset (long and sometimes causes a memory overflow):
Get the mean of all the tracks:
Plot the amplitudes of the two tracks:
Here, we see many of the same patterns playing out from the 5-track version, but more muddled due to the increased number of tracks. However, when working with lists of Audio objects on this order of magnitude, AverageAudio takes several minutes and uses a lot of memory, limiting its usage here. In the sections below, AverageAudio is used due to there being a small number of Audio tracks used and some amount of background noise in the tracks, which is slightly more obvious while using Mean. However, all the code below also works with Mean in the place of AverageAudio, so in cases where AverageAudio is impractical, Mean can be used instead.
Vocabulary
Vocabulary