UMX iterative Wiener expectation maximization for non-STFT time-frequency transforms

So, I’m using a different TF transform for my model (the Nonstationary Gabor Transform).

The shape is more like a list of tensors for each frequency bin of analysis:

STFT (pseudocode):

Tensor([1025, 235])
# 1025 frequencies assuming a window size of 2048
# 235 time frames, considering window 2048, overlap 512

My TF transform (pseudocode):

[Tensor([1, 8]), Tensor([1, 16]), Tensor([1, 32]), ...]
# frequency bin 1 has 8 time frames
# frequency bin 2 has 16 time frames
# frequency bin 3 has 32 time frames
...

This is pretty easily adapted in the actual model itself. Rather than learning how all the frequencies change (with a uniform time resolution), I have 1 mini-openunmix per frequency bin.

However, I have trouble in the Wiener/EM step, since the code for that is a bit more tricky:

Adapting this to fit my new TF transform would be a bit more difficult.

What I do now is:

  1. Get 4 target waveform estimates using my model magnitude prediction + mix phase
  2. Transform the 4 target waveform estimates with the STFT
  3. Apply iterative Wiener + EM with the STFT

With 1 iteration, this adds ~0.5-1.0 SDR to my score (which is respectable). I’m wondering if this is the appropriate/expected performance boost from the application of iterative Wiener EM.

My concern is between steps 1 and 2 (using my TF transform for the initial waveform, and then transforming the waveform with the STFT to apply iterative Wiener EM), there’s some error introduced by “exchanging” the frequency domain through the waveform domain.

Should I try to write the Wiener EM code to fit my own TF transform natively? Or proceed this way?

Hello @sevagh,

~0.5dB to 1.0d sounds fine and is the improvement that we can also observe (normally, it is around 0.6dB for us).

About applying the EM to your Gabor transform: One underlying assumption is that the frequency bins are distributed with a bivariate (as we have left/right channel) complex Gaussian. Is this also a valid model for your Gabor frequency bins? If yes, then I think you could use the EM and just apply it to each frequency bin individually. I would assume that you slightly (~0.1dB) loose performance by first transforming into the time domain and then back to the frequency domain.

Kind regards

Stefan

1 Like