UMX iterative Wiener expectation maximization for non-STFT time-frequency transforms

So, I’m using a different TF transform for my model (the Nonstationary Gabor Transform).

The shape is more like a list of tensors for each frequency bin of analysis:

STFT (pseudocode):

Tensor([1025, 235])
# 1025 frequencies assuming a window size of 2048
# 235 time frames, considering window 2048, overlap 512

My TF transform (pseudocode):

[Tensor([1, 8]), Tensor([1, 16]), Tensor([1, 32]), ...]
# frequency bin 1 has 8 time frames
# frequency bin 2 has 16 time frames
# frequency bin 3 has 32 time frames
...

This is pretty easily adapted in the actual model itself. Rather than learning how all the frequencies change (with a uniform time resolution), I have 1 mini-openunmix per frequency bin.

However, I have trouble in the Wiener/EM step, since the code for that is a bit more tricky:

Adapting this to fit my new TF transform would be a bit more difficult.

What I do now is:

  1. Get 4 target waveform estimates using my model magnitude prediction + mix phase
  2. Transform the 4 target waveform estimates with the STFT
  3. Apply iterative Wiener + EM with the STFT

With 1 iteration, this adds ~0.5-1.0 SDR to my score (which is respectable). I’m wondering if this is the appropriate/expected performance boost from the application of iterative Wiener EM.

My concern is between steps 1 and 2 (using my TF transform for the initial waveform, and then transforming the waveform with the STFT to apply iterative Wiener EM), there’s some error introduced by “exchanging” the frequency domain through the waveform domain.

Should I try to write the Wiener EM code to fit my own TF transform natively? Or proceed this way?

Hello @sevagh,

~0.5dB to 1.0d sounds fine and is the improvement that we can also observe (normally, it is around 0.6dB for us).

About applying the EM to your Gabor transform: One underlying assumption is that the frequency bins are distributed with a bivariate (as we have left/right channel) complex Gaussian. Is this also a valid model for your Gabor frequency bins? If yes, then I think you could use the EM and just apply it to each frequency bin individually. I would assume that you slightly (~0.1dB) loose performance by first transforming into the time domain and then back to the frequency domain.

Kind regards

Stefan

1 Like

The Gabor transform (specifically the sliCQ transform, which is the realtime/sliding window version of the Nonstationary Gabor Transform) for stereo audio has stereo channels, and the complex coefficients come from the FFT so I believe it should meet the conditions.

I followed your suggestion - I can now either apply Wiener EM to the Gabor transform directly (loop over each frequency bin, and then use the wiener function in open-unmix without too much modification - just some reshaping to make each frequency bin look like the STFT).

Alternatively, I can just do it with the STFT (by getting the first estimate of the waveform from the trained network in the sliCQ transform domain and swapping to the STFT transform, as I described first).

Here’s what the SDR values look like (for 1 song from MUSDB18-HQ test set):

1 iteration of STFT Wiener EM:

STFT WIENER
performing bss evaluation
Al James - Schoolboy Facination
 vocals          ==> SDR:   2.727  SIR:   7.992  ISR:   4.126  SAR:   2.082
drums           ==> SDR:   2.702  SIR:   1.920  ISR:   5.649  SAR:   1.765
bass            ==> SDR:   4.082  SIR:  10.806  ISR:   1.770  SAR:   2.128
other           ==> SDR:  -0.939  SIR:  -3.560  ISR:  11.802  SAR:   3.516

real    3m41.839s
user    12m32.746s
sys     0m58.218s

1 iteration of sliCQ Wiener EM:

sliCQ WIENER
performing bss evaluation
Al James - Schoolboy Facination
 vocals          ==> SDR:   2.754  SIR:   7.907  ISR:   4.166  SAR:   2.034
drums           ==> SDR:   2.885  SIR:   1.511  ISR:   5.872  SAR:   1.649
bass            ==> SDR:   4.134  SIR:  11.515  ISR:   1.349  SAR:   2.020
other           ==> SDR:  -0.891  SIR:  -3.382  ISR:  12.241  SAR:   3.458

real    7m59.104s
user    39m28.585s
sys     1m36.704s

2 iterations of STFT Wiener EM:

STFT WIENER
performing bss evaluation
Al James - Schoolboy Facination
 vocals          ==> SDR:   2.368  SIR:   8.594  ISR:   4.081  SAR:   1.094
drums           ==> SDR:   2.669  SIR:   2.600  ISR:   6.071  SAR:   1.466
bass            ==> SDR:   3.892  SIR:  11.786  ISR:   1.576  SAR:   1.796
other           ==> SDR:  -1.620  SIR:  -3.501  ISR:  12.949  SAR:   2.977

real    4m5.906s
user    16m4.784s
sys     1m3.355s

2 iterations of sliCQ Wiener EM:

sliCQ WIENER
performing bss evaluation
Al James - Schoolboy Facination
 vocals          ==> SDR:   2.481  SIR:   8.152  ISR:   4.101  SAR:   1.173
drums           ==> SDR:   2.832  SIR:   1.770  ISR:   6.270  SAR:   1.372
bass            ==> SDR:   3.957  SIR:  12.244  ISR:   1.516  SAR:   1.809
other           ==> SDR:  -1.505  SIR:  -3.357  ISR:  13.263  SAR:   3.029

real    12m55.510s
user    69m42.428s
sys     2m29.679s

The sliCQ Wiener is better (maybe more than ~0.1 dB), but prohibitively slow such that I can’t use it in the competition (the biggest frequency bins have ~20,000 time frames/coefficients). I’m not sure how far I can optimize it :man_shrugging: or whether its worth the effort.

However it seems like it’s not a big deal for minor gains, and 1 iteration of STFT Wiener is good enough.

I’ll try optimizing the Wiener EM code for the Gabor transform, as the performance boost looks to be significant and it might help my competition submissions.

Hello @sevagh,

thanks for this detailed analysis and the results look fine :+1: .

I think you should be able to speed-up the computations as is done for the STFT Wiener filter by using matrix-operations as much as possible. Although there is the problem of different number of frames for each frequency bin you could maybe use “zero-padding” to get the same number - if you normalize correctly, this should give you the same results as doing it frequency-bin wise. Alternatively, you could look into numba (http://numba.pydata.org/) which could also help.

Kind regards

Stefan

I created a matrix of zeros, and simply assigned as many time frames as was available (leaving the rest, to the right, to be zero):

        print('sliCQ WIENER')

        # block-wise wiener
        # assemble it all into a zero-padded matrix

        total_f_bins = 0
        max_t_bins = 0
        for i, X_block in enumerate(X):
            nb_samples, nb_channels, nb_f_bins, nb_slices, nb_t_bins, last_dim = X_block.shape
            total_f_bins += nb_f_bins
            max_t_bins = max(max_t_bins, nb_t_bins)

        X_matrix = torch.zeros((nb_samples, nb_channels, total_f_bins, nb_slices, max_t_bins, last_dim), dtype=X[0].dtype, device=X[0].device)

        freq_start = 0
        for i, X_block in enumerate(X):
            nb_samples, nb_channels, nb_f_bins, nb_slices, nb_t_bins, last_dim = X_block.shape

            # assign up to the defined time bins - to the right will be zeros
            X_matrix[:, :, freq_start:freq_start+nb_f_bins, :, : nb_t_bins, :] = X_block

            freq_start += nb_f_bins

        Xmag_matrix = torch.abs(torch.view_as_complex(X_matrix))

I could do the same for all the 4 target magnitudes. It seems like it’s working fine:

Al James - Schoolboy Facination
 vocals          ==> SDR:   2.753  SIR:   7.953  ISR:   4.158  SAR:   2.024
drums           ==> SDR:   2.864  SIR:   1.677  ISR:   5.835  SAR:   1.657
bass            ==> SDR:   4.140  SIR:  11.599  ISR:   1.350  SAR:   2.016
other           ==> SDR:  -0.893  SIR:  -3.391  ISR:  12.189  SAR:   3.448

real    4m28.061s
user    18m31.605s
sys     1m55.383s

Tiny bit slower than the STFT Wiener, but this should pass the Demix challenge without timing out.

1 Like

Now that I have open-sourced my code, here is the sliCQ Wiener/EM stuff:

1 Like