Promptspeaker

1. Abstract

Recently, text-guided content generation has received extensive attention. In this work, we explore the possibility of text description-based speaker generation, i.e., using text prompts to control the speaker generation process. Specifically, we propose PromptSpeaker, a text-guided speaker generation system. PromptSpeaker consists of a prompt encoder, a zero-shot VITS, and a Glow model, where the prompt encoder predicts a prior distribution based on the text description and samples from this distribution to obtain a semantic representation. The Glow model subsequently converts the semantic representation into a speaker representation, and the zero-shot VITS finally synthesizes the speaker's voice based on the speaker representation. We verify that PromptSpeaker can generate speakers new from the training set by objective metrics, and the synthetic speaker voice has reasonable subjective matching quality with the speaker prompt.

2. Training Data

Dataset	Speaker Prompt	Sample-1	Sample-2
Internal Stylistic	我想要一个机灵、童趣、可爱的女声。 ( I want a witty, childlike and lovable female voice.)
Internal Stylistic	我想要一个低沉、温柔、舒缓的女生的声音。 ( I want the low, gentle and soothing voice of a woman.)
AISHELL-3	给我一个成年女生的音色 ( Give me the voice of an adult female.)
AISHELL-3	给我一个成年男生的音色 ( Give me the voice of an adult female.)
DiDiSpeech	给我一个成年男生的音色 ( Give me the voice of an adult female.)
DiDiSpeech	给我一个成年女生的音色 ( Give me the voice of an adult female.)

3. Experimental Results

Speaker Prompt	Generated Speaker-1	Generated Speaker-2	Generated Speaker-3	Generated Speaker-4
我想要一个萌萌的小男孩的声音 (I want a cute little boy's voice.)
我想要一个温柔的大姐姐的声音 (I want a gentle elder sister's voice. )
我想要生成甜美可爱的女生的声音 (I want to generate a sweet and lovely girl's voice.)
我想要生成阳光积极的男生的声音 (I want to generate a sunny and positive boy's voice.)
我想要生成可爱小女孩的声音 (I want to generate a cute little girl's voice.)
我想听老大爷的声音 (I want to hear the voice of an old man)
我想要生成磁性、低沉的男声的声音 (I want to generate a magnetic, low male voice.)