Does anyone know any good engine for making your own text to speech in english?

Diongoespew!

eldritch horror
Defender of Defoko
I mean like similar to Talqu but for English or something that allows you to make your own text to speech box? Bonus points for any similarities to UTAU or other voice snyths where you can share your vb. I highly doubt theirs one like this out there, but I'm highly interested to see! T_T
 

Purple Gecko

Momo's Minion
Tacotron 2 (what talqu is based on) Can actually be pretty good for english tts, though it requires some coding.
There is a tutorial though,
This is the video on how to set up the data

This is the video on how to train it (The synthesis notebook mentioned does not work, use this instead)

You can download the voice file once its trained and rename it to have the extension .pt and put it into talqu, but you would need the pro version of talqu because an english tts for some reason needs the pronunciation editor box thing (The third box below the other input boxes) otherwise it would only accept japanese text and would sound very cursed.

Hoped this helped :smile:
 

Diongoespew!

eldritch horror
Defender of Defoko
Thread starter
Tacotron 2 (what talqu is based on) Can actually be pretty good for english tts, though it requires some coding.
There is a tutorial though,
This is the video on how to set up the data

This is the video on how to train it (The synthesis notebook mentioned does not work, use this instead)

You can download the voice file once its trained and rename it to have the extension .pt and put it into talqu, but you would need the pro version of talqu because an english tts for some reason needs the pronunciation editor box thing (The third box below the other input boxes) otherwise it would only accept japanese text and would sound very cursed.

Hoped this helped :smile:
Very helpful thankyou!!!!
 

Nashdas

Momo's Minion
Tacotron 2 (what talqu is based on) Can actually be pretty good for english tts, though it requires some coding.
There is a tutorial though,
This is the video on how to set up the data

This is the video on how to train it (The synthesis notebook mentioned does not work, use this instead)

You can download the voice file once its trained and rename it to have the extension .pt and put it into talqu, but you would need the pro version of talqu because an english tts for some reason needs the pronunciation editor box thing (The third box below the other input boxes) otherwise it would only accept japanese text and would sound very cursed.

Hoped this helped :smile:
Interesting. Thanks for sharing this!

I spent a rushed couple of hours on it (about 40 min of training the model and 45 short recordings of myself, total file duration less than 5 min), and the results were far from satisfactory.

I ran into problems with TensorFlow (the error message mentioned 'tf.contrib' not supported by TF 2.0+ despite there being a line to use TF 1.x), but all my trying to add code from StackExchange answers didn't resolve it. What worked for me was restarting the runtime a few times. Literally turn it off and on again.

The generated files were either noise or stuttering. The output is as good as the input, so I advise anyone who wants to try to use much more than 5 min of data.
 

Nashdas

Momo's Minion
I ran it again for about 1.5 hours, this time with under 9 minutes of recordings. (This experiment was sponsored by both procrastination and impatience.)

Still gives static/noise, but now it can actually do words.

Interestingly, it can output some phrases present in my recordings, but not others. I used some sentences from http://festvox.org/cmu_faf/index.html, several from Project Gutenberg's Sherlock Holmes, and a couple from the English Wikipedia page for UTAU.



As for anyone reading this wondering about distribution, part 2 of the YouTube guide (the training one) shows the model must be publicly accessible on Google Drive to be used in a Colab synthesis notebook (the place where you input the text to generate speech). So, theoretically, someone could share their voicebank like this.
 

Thehyami

Ruko's Ruffians
Defender of Defoko
Tacotron 2 (what talqu is based on) Can actually be pretty good for english tts, though it requires some coding.
There is a tutorial though,
This is the video on how to set up the data

This is the video on how to train it (The synthesis notebook mentioned does not work, use this instead)

You can download the voice file once its trained and rename it to have the extension .pt and put it into talqu, but you would need the pro version of talqu because an english tts for some reason needs the pronunciation editor box thing (The third box below the other input boxes) otherwise it would only accept japanese text and would sound very cursed.

Hoped this helped :smile:

Thank you so much! :smile:
I didn't know there is text-to-speech system that is this easy. I tried to train the NNSVS but didn't get usable result. And it's very limited to Japanese.

Here's my result:

 
  • Like
Reactions: Nashdas

Thehyami

Ruko's Ruffians
Defender of Defoko
Interesting. Thanks for sharing this!

I spent a rushed couple of hours on it (about 40 min of training the model and 45 short recordings of myself, total file duration less than 5 min), and the results were far from satisfactory.

I ran into problems with TensorFlow (the error message mentioned 'tf.contrib' not supported by TF 2.0+ despite there being a line to use TF 1.x), but all my trying to add code from StackExchange answers didn't resolve it. What worked for me was restarting the runtime a few times. Literally turn it off and on again.

The generated files were either noise or stuttering. The output is as good as the input, so I advise anyone who wants to try to use much more than 5 min of data.

Did you figure it out what was wrong?
 

HoodyP39

Momo's Minion
Hey guys, as a Tacotron2 expert (and also part of the Uberduck contributors...). In order to create a really decent voice that covers most phonemes, I suggest doing at least about at least 30 minutes of speech data (total wavs duration), the best will be about 1-2 hours of speech data (the more the better!)... So I would say roughly about 1000-5000 wav files of sentences and/or more to be great enough for training
Post automatically merged:

For what type of reclist that can be used for training, I suggest using either LJSpeech's transcriptions (https://keithito.com/LJ-Speech-Dataset/) or CMU's publically available CMU Arctic (http://www.festvox.org/cmu_arctic/cmuarctic.data)
 

Thehyami

Ruko's Ruffians
Defender of Defoko
Hey guys, as a Tacotron2 expert (and also part of the Uberduck contributors...). In order to create a really decent voice that covers most phonemes, I suggest doing at least about at least 30 minutes of speech data (total wavs duration), the best will be about 1-2 hours of speech data (the more the better!)... So I would say roughly about 1000-5000 wav files of sentences and/or more to be great enough for training
Post automatically merged:

For what type of reclist that can be used for training, I suggest using either LJSpeech's transcriptions (https://keithito.com/LJ-Speech-Dataset/) or CMU's publically available CMU Arctic (http://www.festvox.org/cmu_arctic/cmuarctic.data)

Do you use arpabet or alphabet to train the network? Do you also use the "_" and the "-" symbols? If you do, what do you use them for?
 

Diongoespew!

eldritch horror
Defender of Defoko
Thread starter
I HAVE FOUND THE SOLUTION! I was browsing Anna Nyui's website and I discovered she did something called. Coefront? From what I can find, it's a do it yourself and download other peoples sometimes even comerical, Text to speech thing.
It's called COEFRONT! Right now I think it only does Japanese? But I foudn it through Anna Nyui. It only allows your account though so you can't share accounts or voicebanks that aren't yours. Even if you own it. Thoughts on this?

Here is an example:

On this page she has samples:

And here's a UTAULoid? I think user's take on it



Coefront website: https://coefont.cloud/
 

HoodyP39

Momo's Minion
Do you use arpabet or alphabet to train the network? Do you also use the "_" and the "-" symbols? If you do, what do you use them for?
You don't really have to use Arpabet for training (which means you can just use normal sentences/alphabets, though Arpabet conversion is optional, but for that, it's best to use Tacotron2 forks that has Arpabet conversion if you wanted to use Arpabet for a slightly more accurate prononciation). Also, you don't really use "_" symbols other than filenames, and "-" can be used in your training transcript as it's own normal usage (for example words like "air-space", or "merry-go-round")
 

Thehyami

Ruko's Ruffians
Defender of Defoko
You don't really have to use Arpabet for training (which means you can just use normal sentences/alphabets, though Arpabet conversion is optional, but for that, it's best to use Tacotron2 forks that has Arpabet conversion if you wanted to use Arpabet for a slightly more accurate prononciation). Also, you don't really use "_" symbols other than filenames, and "-" can be used in your training transcript as it's own normal usage (for example words like "air-space", or "merry-go-round")
Ooh thanks! :smile:
 

Similar threads