So, I’m using a different TF transform for my model (the Nonstationary Gabor Transform).
The shape is more like a list of tensors for each frequency bin of analysis:
STFT (pseudocode):
Tensor([1025, 235])
# 1025 frequencies assuming a window size of 2048
# 235 time frames, considering window 2048, overlap 512
My TF transform (pseudocode):
[Tensor([1, 8]), Tensor([1, 16]), Tensor([1, 32]), ...]
# frequency bin 1 has 8 time frames
# frequency bin 2 has 16 time frames
# frequency bin 3 has 32 time frames
...
This is pretty easily adapted in the actual model itself. Rather than learning how all the frequencies change (with a uniform time resolution), I have 1 mini-openunmix per frequency bin.
However, I have trouble in the Wiener/EM step, since the code for that is a bit more tricky:
Adapting this to fit my new TF transform would be a bit more difficult.
What I do now is:
- Get 4 target waveform estimates using my model magnitude prediction + mix phase
- Transform the 4 target waveform estimates with the STFT
- Apply iterative Wiener + EM with the STFT
With 1 iteration, this adds ~0.5-1.0 SDR to my score (which is respectable). I’m wondering if this is the appropriate/expected performance boost from the application of iterative Wiener EM.
My concern is between steps 1 and 2 (using my TF transform for the initial waveform, and then transforming the waveform with the STFT to apply iterative Wiener EM), there’s some error introduced by “exchanging” the frequency domain through the waveform domain.
Should I try to write the Wiener EM code to fit my own TF transform natively? Or proceed this way?