diffsinger does not work using the samples of a normal UTAU like cvvc vcv and cv, to create a diffsinger (well record it) you must sing a song acapella, if you want to add appends, you should sing them with that tone, usually people record 1-3 hours, 2 hours are the ones that ensure a better result, also several languages are recorded (in some cases) using recordings in UTAU formats is not bad, but it is not recommended, since singing is different from recording for UTAU, and what diffsinger needs is singing