I. Introduction
Speech synthesis has made remarkable strides in recent years, thanks to advancements in deep learning and neural networks. Today, we can generate synthetic speech that is often indistinguishable from human speech. However, despite these achievements, there remains a challenging research problem: cloning the voice of a speaker who was unseen during training. This issue poses a significant challenge for text-to-speech (TTS) models, as accurately mimicking a specific speaker's voice requires tuning speech factors, including their timbre, accent, and unique characteristics, meticulously.