NMF-based Improvement of DNN and LSTM Pre-Training for Speech Enhancemet

Safari Dehnavi, Razieh; Seyedin, Sanaz

doi:10.61186/itrc.15.3.53

Volume 15, Issue 3 (9-2023) itrc 2023, 15(3): 53-65 | Back to browse issues page

‎ 10.61186/itrc.15.3.53

Mendeley

Zotero

RefWorks

Safari Dehnavi R, Seyedin S. NMF-based Improvement of DNN and LSTM Pre-Training for Speech Enhancemet. itrc 2023; 15 (3) :53-65
URL: http://journal.itrc.ac.ir/article-1-555-en.html

NMF-based Improvement of DNN and LSTM Pre-Training for Speech Enhancemet

Razieh Safari Dehnavi¹

, Sanaz Seyedin

1- Department of Electrical Engineering Amirkabir University of Technology (Tehran Polytechnic) Tehran, Iran
2- Department of Electrical Engineering Amirkabir University of Technology (Tehran Polytechnic) Tehran, Iran , sseyedin@aut.ac.ir

Abstract: (1539 Views)

A novel pre-training method is proposed to improve deep-neural-networks (DNN) and long-short-term-memory (LSTM) performance, and reduce the local minimum problem for speech enhancement. We propose initializing the last layer weights of DNN and LSTM by Non-Negative-Matrix-Factorization (NMF) basis transposed values instead of random weights. Due to its ability to extract speech features even in presence of non-stationary noises, NMF is faster and more successful than previous pre-training methods for network convergence. Using NMF basis matrix in the first layer along with another pre-training method is also proposed. To achieve better results, we further propose training individual models for each noise type based on a noise classification strategy. The evaluation of the proposed method on TIMIT data shows that it outperforms the baselines significantly in terms of perceptual-evaluation-of-speech-quality (PESQ) and other objective measures. Our method outperforms the baselines in terms of PESQ up to 0.17, with an improvement percentage of 3.4%.

Keywords: pre-training, deep neural networks (DNN), long short-term memory (LSTM), non-negative matrix factorization (NMF), speech enhancement, basis matrix, noise classification

Full-Text [PDF 1238 kb] (600 Downloads)

Type of Study: Research | Subject: Communication Technology

References

1. [1] T. Kawase, M. Okamoto, T. Fukutomi, and Y. Takahashi, "Speech enhancement parameter adjustment to maximize accuracy of automatic speech recognition," IEEE Transactions on Consumer Electronics, vol. 66, no. 2, pp. 125-133, 2020. [DOI:10.1109/TCE.2020.2986003]

2. [2] P. C. Loizou, Speech enhancement: Theory and Practice. CRC press, 2013. [DOI:10.1201/b14529]

3. [3] A. Pandey, and D. Wang, "A new framework for supervised speech enhancement in the time domain," Inter speech, pp.1136-1140, 2018. [DOI:10.21437/Interspeech.2018-1223]

4. [4] S. K. Roy, A Nicolson, and K. K. Paliwal, "Deep learning with augmented Kalman filter for single-channel speech enhancement," In IEEE International Symposium on Circuits and Systems (ISCAS), 2020, pp. 1-5. [DOI:10.1109/ISCAS45731.2020.9180820]

5. [5] M. M. Mirjalili, S. Seyedin, "Speech enhancement using NMF based on hierarchical deep neural networks with joint learning," 28th Iranian Conference on Electrical Engineering (ICEE), 2020. [DOI:10.1109/ICEE50131.2020.9260994] []

6. [6] J. Chen, J. Benesty, Y. Huang, and S. Doclo, "New insights into the noise reduction Wiener filter," IEEE Transactions on audio, speech, and language processing, vol. 14, no. 4, pp.1218-1234, 2006. [DOI:10.1109/TSA.2005.860851]

7. [7] B. Chen, and P. C. Loizou, "A Laplacian-based MMSE estimator for speech enhancement," Speech communication, vol. 49, no. 2, pp. 134-143, 2007. [DOI:10.1016/j.specom.2006.12.005] [PMID] []

8. [8] Q. Zhang, A. Nicolson, M. Wang, K. K. Paliwal, and C. Wang, "DeepMMSE: A deep learning approach to MMSE-based noise power spectral density estimation," IEEE/ACM Transactions on Audio, Speech, and Language Processing,Vol. 28, pp. 1404-1415, 2020. [DOI:10.1109/TASLP.2020.2987441]

9. [9] K. Kumar, and S. Cruces, "An iterative posterior NMF method for speech enhancement in the presence of additive Gaussian noise," Neurocomputing, vol. 230, pp. 312-315, 2017. [DOI:10.1016/j.neucom.2016.12.012]

10. [10] Z. Wang, T. Zhang, Y. Shao, B. Ding, "LSTM-convolutionalBLSTM encoder-decoder network for minimum mean-square error approach to speech enhancement," Applied Acoustics,Vol. 172, pp. 107647, 2021. [DOI:10.1016/j.apacoust.2020.107647]

11. [11] Y. Xu, J. Du, L. Dai, and C.-H. Lee, "An experimental study on speech enhancement based on deep neural networks," IEEE Signal processing letters, vol. 21, no. 1, pp. 65-68, 2013. [DOI:10.1109/LSP.2013.2291240]

12. [12] T. Gao, J. Du, Li-R. Dai, and C.-H. Lee, "SNR-based progressive learning of deep deural network for speech enhancement," INTERSPEECH, pp. 3713-3717. 2016. [DOI:10.21437/Interspeech.2016-224] [PMID]

13. [13] R. Li, Y. Liu, Y. Shi, L. Dong, and W. Cui, "ILMSAF based speech enhancement with DNN and noise classification," Speech Communication, vol. 85, pp. 53-70,2016. [DOI:10.1016/j.specom.2016.10.008]

14. [14] M. Strake, B. Defraene, K. Fluyt, W. Tirry, and T. Fingscheidt, "Separated noise suppression and speech restoration: LSTMbased speech enhancement in two stages," IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 239-243, 2019. [DOI:10.1109/WASPAA.2019.8937222]

15. [15] Z. Wang, T. Zhang, Y. Shao, and B. Ding, "LSTMconvolutional-BLSTM encoder-decoder network for minimum mean-square error approach to speech enhancement," Applied Acoustics, vol. 172, 2021. [DOI:10.1016/j.apacoust.2020.107647]

16. [16] X. Cui, Z. Chen, and F. Yin, "Speech enhancement based on simple recurrent unit network," Applied Acoustics, vol. 157,2020. [DOI:10.1016/j.apacoust.2019.107019]

17. [17] D. S. Williamson, Y. Wang, and D. Wang, "Estimating nonnegative matrix model activations with deep neural networks to increase perceptual speech quality," The Journal of the Acoustical Society of America, vol. 138, no. 3, pp. 1399- 1407, 2015. [DOI:10.1121/1.4928612] [PMID] []

18. [18] T. G. Kang, K. Kwon, J. W. Shin, and N. S. Kim, "NMF-based speech enhancement incorporating deep neural network." 15th Annual Conference of the International Speech Communication Association, 2014. [DOI:10.21437/Interspeech.2014-584] []

19. [19] C. Yarra, S. Nagesh, O. D. Deshmukh, and P. K. Ghosh. "Noise robust speech rate estimation using signal-to-noise ratio dependent sub-band selection and peak detection strategy," The Journal of the Acoustical Society of America, vol. 146, no. 3, pp. 1615-1628, 2019. [DOI:10.1121/1.5124473] [PMID]

20. [20] R. Safari, S. M. Ahadi, and S. Seyedin, "Modular dynamic deep denoising autoencoder for speech enhancement," 7th International Conference on Computer and Knowledge Engineering (ICCKE), pp. 254-259, 2017. [DOI:10.1109/ICCKE.2017.8167886]

21. [21] D. Erhan, A. Courville, Y. Bengio, and P. Vincent, "Why does unsupervised pre-training help deep learning?," 13th International Conference on Artificial Intelligence and Statistics, pp. 201-208, 2010.

22. [22] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle,"Greedy layer-wise training of deep networks," Advances in Neural Information Processing Systems, pp. 153-160, 2007. [DOI:10.7551/mitpress/7503.003.0024]

23. [23] S. Z. Seyyedsalehi, and S. A. Seyyedsalehi, "A fast and efficient pre-training method based on layer-by-layer maximum discrimination for deep neural networks." Neurocomputing, vol. 168, pp. 669-680, 2015. [DOI:10.1016/j.neucom.2015.05.057]

24. [24] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, "Speech enhancement based on deep denoising autoencoder," Interspeech, vol. 2013, pp. 436-440, 2013. [DOI:10.21437/Interspeech.2013-130]

25. [25] M. Lashkari, S. Seyedin. "NMF-based cepstral features for speech emotion recognition." 4th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS), pp. 189-193, 2018. [DOI:10.1109/ICSPIS.2018.8700539]

26. [26] W. Cao, X. Wang, Z. Ming, and J. Gao, "A review on neural networks with random weights," Neurocomputing, Vol. 275,pp. 278-287, 2018. [DOI:10.1016/j.neucom.2017.08.040]

27. [27] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, "DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1." STIN 93,pp. 27403, 1993. [DOI:10.6028/NIST.IR.4930]

28. [28] E. H. Rothauser. IEEE recommended practice for speech quality measurements. IEEE Trans. on Audio andElectroacoustics, 17, pp. 225-246, 1969. [DOI:10.1109/TAU.1969.1162058]

29. [29] H. J. M. Steeneken, and F. W. M Geurtsen, "Description of the RSG-10 noise database," report IZF 3 (1988): 1988.

30. [30] H.-G. Hirsch, and D. Pearce, "The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions," In ASR2000- Automatic speech recognition: challenges for the new Millenium ISCA tutorial and research workshop (ITRW), 2000.

31. [31] http://www.pianosociety.com/.

32. [32] ITU-T, P. "Objective measurement of active speech level," ITU-T Recommendation, 1993.

33. [33] H.-W. Tseng, M. Hong, and Z.-Q. Luo, "Combining sparse NMF with deep neural network: A new classification-based approach for speech enhancement," IEEE International Volume 15- Number 3 - 2023 (53 -65) 64 Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2145 -2149, 2015.

34. [34] X. Zhang, Y. Zou, and W. Shi. "Dilated convolution neural network with LeakyReLU for environmental sound classification ," 22nd International Conference on DigitalSignal Processing (DSP), pp. 1-5, 2017. [DOI:10.1109/ICDSP.2017.8096153]

35. [35] K. Kondo, Subjective quality measurement of speech: its evaluation, estimation and applications, Springer Science & Business Media, 2012. [DOI:10.1007/978-3-642-27506-7]

36. [36] Y. Hu, and P. C. Loizou. "Evaluation of objective measures for speech enhancement," 9th International Conference on Spoken Language Processing , 2006. [DOI:10.21437/Interspeech.2006-84]

37. [37] Hu, Y. and Loizou, P.C. "Evaluation of objective quality measures for speech enhancement". IEEE Transactions on audio, speech, and language processing, Vol. 16, Issue 1, pp.229-238, 2. [DOI:10.1109/TASL.2007.911054]

38. [38] H. Damirchi, S. Seyedin, S. M. Ahadi , "Improving the loss function efficiency for speaker extraction using psychoacousticeffects" Applied Acoustics, Vol. 183, pp. 108301-108307,2021. [DOI:10.1016/j.apacoust.2021.108301]

39. [39] J. Demšar, "Statistical comparisons of classifiers over multiple data sets,", The Journal of Machine learning research, Vol. 7,pp. 1-30, 2006.

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Principal Contact