DEEP LEARNING IN SPEECH SYNTHESIS SYSTEMS
DOI: 10.31673/2786-8362.2025.013553
DOI:
https://doi.org/10.31673/2786-8362.2025.013553Abstract
Deep learning systems
allow you to automate complex tasks that previously required human intelligence, and do so with high
accuracy. Deep learning uses artificial neural networks with many layers – each layer processes information
in an increasingly complex and abstract way. This allows the system to learn high-level features such as
emotions, intonations, expressiveness, etc. The introduction of these features makes synthesized speech
more natural, which contributes to its better perception by humans. Unlike traditional speech synthesis
methods, such as formant synthesis, concatenative synthesis or HMM-based approaches (Hidden Markov
Models), deep learning provides much higher flexibility and sound quality. In traditional systems, speech
was built from pre-recorded fragments or generated according to predefined rules, which limited the
naturalness, intonation richness and emotional coloring of the voice. Thus, deep learning overcomes key
limitations of traditional approaches and opens up new opportunities in the field of voice technologies –
from text-to-speech to full-fledged emotional communication between humans and machines. The article
considers the main areas of application of deep learning for speech synthesis, analyzes existing approaches
to building synthesis systems, and analyzes their weaknesses and strengths.
Keywords: deep learning, neural network, synthesized speech
References
1. Self-Supervised Speech Representation Learning: A Review / A. Mohamed et al. IEEE
Journal of Selected Topics in Signal Processing. 2022. P. 1–34.
URL: https://doi.org/10.1109/jstsp.2022.3207050.
2. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden
Units / W.-N. Hsu et al. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2021.
Vol. 29. P. 3451–3460. URL: https://doi.org/10.1109/taslp.2021.3122291
3. Hastad J., Risse K. On Bounded Depth Proofs for Tseitin Formulas on the Grid;
Revisited. 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS),
Denver, CO, USA, 31 October – 3 November 2022. 2022.
URL: https://doi.org/10.1109/focs54457.2022.00110
4. Conformer: Convolution-augmented Transformer for Speech Recognition / A. Gulati et
al. Interspeech 2020. ISCA, 2020. URL: https://doi.org/10.21437/interspeech.2020-3015.
5. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions / J. Shen
et al. ICASSP 2018 - 2018 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), Calgary, AB, 15–20 April 2018. 2018.
URL: https://doi.org/10.1109/icassp.2018.8461368.
6. DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations / J. Giorgi
et al. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and
the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),
Online. Stroudsburg, PA, USA, 2021. URL: https://doi.org/10.18653/v1/2021.acl-long.72.
7. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis /
T. Yoshimura et al. 6th European Conference on Speech Communication and Technology
(Eurospeech 1999). ISCA, 1999. URL: https://doi.org/10.21437/eurospeech.1999-513.
8. Xu S.H. Study on HMM-Based Chinese Speech Synthesis. Beijing : Beijing University of
Posts and Telecommunications, 2007.
9. Sotelo J., Mehri S., Kumar K., Santos J.F., Kastner K., Courville A., Bengio Y. Char2wav:
End-to-end Speech Synthesis // Proceedings of the International Conference on Learning
Representations Workshop, Toulon, France, 24–26 April 2017.
10. Klatt D. H. Software for a cascade/parallel formant synthesizer. The Journal of the Acoustical
Society of America. 1980. Vol. 67, no. 3. P. 971–995. URL: https://doi.org/10.1121/1.383940.
11. Moulines E., Charpentier F. Pitch-synchronous waveform processing techniques for text-tospeech synthesis using diphones. Speech Communication. 1990. Vol. 9, no. 5-6. P. 453–467.
URL: https://doi.org/10.1016/0167-6393(90)90021-z.
12. Ze H., Senior A., Schuster M. Statistical parametric speech synthesis using deep neural
networks. ICASSP 2013 - 2013 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), Vancouver, BC, Canada, 26–31 May 2013. 2013.
URL: https://doi.org/10.1109/icassp.2013.6639215.
13. Morise M., Yokomori F., Ozawa K. WORLD: A Vocoder-Based High-Quality Speech
Synthesis System for Real-Time Applications. IEICE Transactions on Information and Systems.
2016. E99.D, no. 7. P. 1877–1884. URL: https://doi.org/10.1587/transinf.2015edp7457.
14. Luong T., Pham H., Manning C. D. Effective Approaches to Attention-based Neural
Machine Translation. Proceedings of the 2015 Conference on Empirical Methods in Natural
Language Processing, Lisbon, Portugal. Stroudsburg, PA, USA, 2015.
URL: https://doi.org/10.18653/v1/d15-1166.