Katsuya pfp
Katsuya

@kn

Yeah I don’t have good intuition for the size of dataset required but my guess is a lot less than TTS, maybe similar to ASR given one-to-many problem, so was thinking there is enough public speech dataset (>10k hrs) plus non-speech dataset which can be mixed together to synthesize for training.
1 reply
0 recast
0 reaction