$\rightarrow~~~ $ REVERB, CHiME series
ANR VocADom pipeline
Overview of the talk and contributions of the thesis
Part I: Speaker localization
Localize the target speaker
    $\rightarrow~~~ $ Use the wake-up word to discriminate the target speaker against interfering speakers
Part II: Speech extraction & separation
Recover speech signals from a reverberant, noisy, multi-speaker recording
    $\rightarrow~~~ $
Analyze the impact of localization errors on speech extraction
    $\rightarrow~~~ $
Speech separation using iterative strategy
Part III: Explaining neural network outputs
Are some noises better than others to train a neural network for speech enhancement - better network output?
    $\rightarrow~~~ $ Use feature attribution methods to explain different model outputs
Part I: Speaker Localization
Signal mixing model
$$
\mathbf{c}_j(t) = \mathbf{a}_j \star s_j(t),\quad
$$
$\mathbf{a}_j(\tau)$ is the room impulse response
    $\rightarrow~~~ $ Iteratively estimate mask & DOA
Subspace methods
    $\rightarrow~~~ $
MUSIC
Learning-based methods
Pre-DNN based methods
    $\rightarrow~~~ $ GMM, SVM
DNN-based models
    $\rightarrow~~~ $ CNN,
CRNN
Generalized Cross Correlation with PHAse Transform (GCC-PHAT)
Compute the weighted cross-correlation between signals at two microphones
Knapp, C. and Carter, G. (1976) The generalized correlation method for estimation of time delay. TASSP
Generalized Cross Correlation with PHAse Transform (GCC-PHAT)
Compute the weighted cross-correlation between signals at two microphones
Knapp, C. and Carter, G. (1976) The generalized correlation method for estimation of time delay. TASSP
Linear transformation of cosine-sine interchannel phase difference (CSIPD) features
Generalized Cross Correlation with PHAse Transform (GCC-PHAT) with 2-speakers
$\mathbf{x} = \mathbf{c}_1 + \mathbf{c}_2+ \textbf{noise}; $ $\mathbf{c}_{\{.\}}^D \rightarrow $ Direct component of the spatial image
Contribution to speaker localization
Want to localize the speaker who uttered the keyword $\rightarrow$ New Task
With respect to other work
Localize one particular speaker in a mixture, not all
Interested in the speaker who uttered the wake-up word
Challenges
Localization is computed using cues derived from multichannel signals
    $\rightarrow~~~ $ what has text got to do with it?
Sunit Sivasankaran, Emmanuel Vincent, Dominique Fohr. Keyword-based speaker localization: Localizing a target speaker in a multi-speaker environment. In Interspeech, Sep 2018, Hyderabad, India.
Idea
Exploit time-frequency bins dominated by the target speaker only
Mixture spectrogram
Mask for target speaker, $\mathcal{M}$
Proposed approach
STEP 1: Wake-up word detection
STEP 2: Obtain the corresponding spectrogram, a.k.a. phone spectrum
STEP 3: Estimate target mask
STEP 4: CSIPD $\times$ target mask ⇒ [DNN] ⇒ DOA
STEP 1: Wake-up word detection
Keyword and alignment found by wake-up word detection system
Hidden Markov Model-Gaussian Mixture Model system used in this work
STEP 2: Phone spectra database
Pre-computed by averaging magnitude spectra per phone
Distinct patterns are observed for every phone
Pick spectrum corresponding to the aligned phone
Erdogan, H., Hershey, J. R., Watanabe, S., and Le Roux, J. (2015). Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In ICASSP
Step 3: Target mask
A DNN trained in a supervised fashion is used to estimate the mask
Hershey, J. R., Chen, Z., Le Roux, J., and Watanabe, S. (2016). Deep clustering: Discriminative embeddings for segmentation and separation. In ICASSP
    $\rightarrow~~~ $ DNN-based methods from raw waveform
Luo, Y. and Mesgarani, N. (2019). Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation. TASLP
Multichannel speech separation
    $\rightarrow~~~ $ Mask-based beamformers
    $\rightarrow~~~ $ Using phase difference along with magnitude spectra with deep clustering
    $\rightarrow~~~ $
Explicit use of speaker location : TDOA/DOA
Perotin, L., Serizel, R., Vincent, E., and Guérin, A. (2018). Multichannel speech separation with recurrent neural networks from high-order ambisonics recordings. In ICASSP
Chen, Z., Xiao, X., Yoshioka, T., Erdogan, H., Li, J., and Gong, Y. (2018). Multi-Channel overlapped speech recognition with location guided speech extraction network. In SLT
Contributions to speech separation
Use of localization information for speech extraction
Study the impact of localization errors
    $\rightarrow~~~ $ Can large angular distance between speakers compensate for low SIR?
    $\rightarrow~~~ $ Evaluate ASR performances using true speaker location information
Deflation strategy for speech separation
Make speech separation network robust to localization errors
Sunit Sivasankaran, Emmanuel Vincent, Dominique Fohr. Analyzing the impact of speaker localization errors on speech separation for automatic speech recognition. In 28th European Signal Processing Conference, Jan 2021, Amsterdam, The Netherlands. (Accepted)
Sunit Sivasankaran, Emmanuel Vincent, Dominique Fohr. SLOGD: speaker location guided deflation approach to speech separation. In 45th IEEE International Conference on Acoustics, Speech, and Signal Processing, May 2020, Barcelona, Spain.
Speech extraction given direction-of-arrival information
Step 1: Delay-and-Sum (DS) beamforming using the estimated DOA
Step 2: Estimate a mask corresponding to the target using
Magnitude spectra of the beamformed signal
CSIPD of the beamformed signal with respect to a reference microphone
Step 3: Apply data-dependent beamformer to extract target speech
Delay-and-Sum (DS) beamforming on phase difference
Phase difference of the signal + noise
After DS beamforming
Phase difference at bins dominated by source is zero after DS beamforming
Reduces the dimension from $I \times (I-1) \times F \rightarrow 2 \times F$ phase features
No dependency on the array geometry after DS beamforming
Dataset
WSJ0-2MIX dataset
$\rightarrow~~~ $ 100% overlap (min version)
$| $ No noise and reverberation
$ |$ Single Channel
WHAM!
$\rightarrow~~~ $ Based on WSJ0-2MIX $|$ Real ambient noise $|$ Single Channel
Multichannel WSJ0-2MIX
$\rightarrow~~~ $ Real and Simulated RIRs $|$ No noise $|$ 8 Channels
Created new dataset: Kinect-WSJ
$\rightarrow~~~ $ Based on max version of WSJ0-2MIX : No 100% overlap
$\rightarrow~~~ $ 4 channels with Microsoft Kinect like array geometry
$\rightarrow~~~ $ Real ambient multichannel noise from CHiME-5 dataset
$\rightarrow~~~ $ Angular distance between speakers $>5^\circ$
$\rightarrow~~~ $ Designed to study impact of localization on speech separation
https://github.com/sunits/Reverberated_WSJ_2MIX
Results
Mixture
True interference mask
True target mask
Estimated target mask
Demo
Simulated Data (2 speakers + noise)
Mixture
Proposed
Conv-Tasnet
$\mathbf{x}$
$\hat{\mathbf{c}}_1$
$\hat{\mathbf{c}}_2$
$\hat{\mathbf{c}}_1$
$\hat{\mathbf{c}}_2$
Male-Male
Female-Female
Real Data
Mixture
Estimated Target
Different microphone array geometry compared to simulated data
Robustness to DOA estimation errors
Speech extraction performance drops due to localization errors
Iteratively estimate sources using deflation strategy
    $\rightarrow~~~ $ Remove dominant speaker first and then estimate another speaker
Kinoshita, K., Drude, L., Delcroix, M., and Nakatani, T. (2018). Listening to each speaker one by one with recurrent selective hearing networks. In ICASSP
Speaker LOcalization Guided Deflation (SLOGD)
Estimation of the dominant speaker
Permutation invariant training criterion used for training DOA network
$$
\mathcal{L}_{\text{DOA}_1} = \min_{i} - \sum_{p=1}^{P}\log\Big(\frac{1}{N}\sum_{n} \mathrm{p}_1(n, \theta)\Big) \mathbb{I}_{\theta_{i}}(p)
$$
$\mathbb{I}_{\theta_i}(p)$ is the Indicator variable
Speaker LOcalization Guided Deflation (SLOGD)
Estimation of the second speaker
Remove the dominant speaker from the mixture
Estimate the DOA and mask of the non-dominant speaker
Use a data-dependent beamformer to extract sources from masks
Results in word error rate (WER) %
Before separation baseline: $66.5\%$
After separation
Using DOA
Adapting to DOA errors
Errors on DOA
True DOA
GCC-PHAT
$5 ^{\circ}$
$10^{\circ}$
$15^{\circ}$
SLOGD
35.0
54.5
55.9
73.6
75.6
44.2
Conv-TasNet: $53.2\%$
Yi.L and Mesgarani.N. "Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation." IEEE/ACM transactions on audio, speech, and language processing, 2019
Part III:
Explaining Neural Network Output
Motivation
Neural networks are black boxes
    $\rightarrow~~~ $ Gives impressive performances in multiple tasks
    $\rightarrow~~~ $ Hard to understand reasons for network output
Different ASR results with enhancement models trained using different noises
    $\rightarrow~~~ $
On CHiME-4 real evaluation dataset
    $\rightarrow~~~ $
How does noise influence the network?
    $\rightarrow~~~ $
Explain the generalization capability of speech enhancement models
Feature attribution methods
Assign importance to each dimension of the input
For image classification tasks, show which pixels a DNN is looking at
Feature attribution methods
Gradient-based
Gradient of the output with respect to the input
Saliency map
Smooth Grads
Gradients $\times$ Input
Leverage the magnitude and sign of input along with gradients
Integrated Gradients
Layerwise relevance propagation
DeepLift
Deep SHapley Additive Explanations (DeepSHAP)
Contributions to explaining neural network outputs
Use DeepSHAP to provide importance values for each time-frequency bin
DeepSHAP for regression instead of classification
Derive an objective scalar metric called speech relevance score to quantify feature attributions
Use speech relevance score to explain generalization of speech enhancement models
Computing SHAP values for speech enhancement models
$$\mathbf{x}(t) = \mathbf{c}_1(t) + \text{noise} $$
$\hat{\mathcal{M}} = \mathcal{F}(|\mathbf{X}_1|)$, $\hat{\mathcal{M}}$: estimated mask, $\mathcal{F}$: model and $\mathbf{X}_1$: STFT of $x_1$
For every $\hat{\mathcal{M}}(n,f)$ compute attributions,$\mathbf{\Phi}^{\text{TF}}(n,f)$, for
$|x_1(n',f')| \quad \forall n',f'$
Reduce the number of attributions by computing:
$
\mathbf{\Phi}^{\text{T}}(n) = \sum_f \mathbf{\Phi}^{\text{TF}}(n,f)
$
Speech relevance score
Generalizable model should make decisions based on speech and not noise bins
Speech relevance scores on simulated dev set of CHiME-4
Speech enhancement model
$\mathcal{F}_\text{CHIME}$
$\mathcal{F}_\text{SSN}$
$\mathcal{F}_\text{NET\_SOUND}$
WER Real test set (%)
11.7
14.0
15.1
WER Simulated dev set (%)
6.7
7.3
7.7
Speech relevance score,$\eta$ (%)
94.8
89.6
90.3
Baseline WER without enhancement on real test: 25.9%
Better performance of $\mathcal{F}_\text{CHIME}$ is due to better $\eta$ value
WER and $\eta$ for $\mathcal{F}_\text{SSN}$ and $\mathcal{F}_\text{NET\_SOUND}$ are similar
Results: Generalization capability
Experiment
    $\rightarrow~~~ $ Train: Train speech set + matched noise | Test: Train speech set + CHiME noise
    $\rightarrow~~~ $ Train and Test have same speech signals but different noises
Speech relevance scores (%)
Speech enhancement model
$\mathcal{F}_\text{SSN}$
$\mathcal{F}_\text{NET\_SOUND}$
$\mathcal{F}_\text{CHIME}$
Train [Clean speech + matched noise]
81.7
82.5
81.7
Test [Clean speech + CHIME noise]
74.4
58.6
-
Difference
07.3
23.9
-
$\mathcal{F}_\text{SSN}$ has better generalization capability than $\mathcal{F}_\text{NET\_SOUND}$
Conclusion
Summary of the thesis
Speaker localization
Localize the target speaker who uttered a known text such as the wake-up word
Masks were found to be effective target identifiers
Use of spoken text decreased the gross error rate$(>5^{\circ})$ by 72%
Speech separation
Analyzed performance of speech extraction given true speaker location
Proposed a deflation approach to separate speech using estimated speaker location
Proposed method was shown to outperform Conv-TasNet by 17% WER
Summary of the thesis
Explaining neural network model output
Methods to explain the inner working of DNN-based speech enhancement models
Feature attribution method called DeepSHAP was used
Proposed metric to evaluate feature attribution
Low speech relevance score difference of 7.3% with SSN noise compared to 23.9% with Network noise shows better generalization of $\mathcal{F}_\text{SSN}$
Future works
Speaker localization
End-to-End localization from raw waveform: Trainable filterbanks instead of STFT
Different target identifiers: Speaker identity instead of text
Speech separation
Joint separation and ASR: Train localization, separation and ASR jointly
Using visual cues for speech separation: Helps localization and ASR
Explaining speech enhancement model output
Using feature attributions to improve model architecture
    $\rightarrow~~~ $ Better $\eta$ ⇒ Better model