Does anyone know any good engine for making your own text to speech in english?

Diongoespew! · Feb 11, 2022

I mean like similar to Talqu but for English or something that allows you to make your own text to speech box? Bonus points for any similarities to UTAU or other voice snyths where you can share your vb. I highly doubt theirs one like this out there, but I'm highly interested to see! T_T

Purple Gecko · Feb 12, 2022

Tacotron 2 (what talqu is based on) Can actually be pretty good for english tts, though it requires some coding.
There is a tutorial though,
This is the video on how to set up the data

This is the video on how to train it (The synthesis notebook mentioned does not work, use this instead)

You can download the voice file once its trained and rename it to have the extension .pt and put it into talqu, but you would need the pro version of talqu because an english tts for some reason needs the pronunciation editor box thing (The third box below the other input boxes) otherwise it would only accept japanese text and would sound very cursed.

Hoped this helped :smile:

Diongoespew! · Feb 12, 2022

Purple Gecko said:
Tacotron 2 (what talqu is based on) Can actually be pretty good for english tts, though it requires some coding.
There is a tutorial though,
This is the video on how to set up the data

This is the video on how to train it (The synthesis notebook mentioned does not work, use this instead)

You can download the voice file once its trained and rename it to have the extension .pt and put it into talqu, but you would need the pro version of talqu because an english tts for some reason needs the pronunciation editor box thing (The third box below the other input boxes) otherwise it would only accept japanese text and would sound very cursed.

Hoped this helped

Very helpful thankyou!!!!

Nashdas · Feb 13, 2022

Purple Gecko said:
Tacotron 2 (what talqu is based on) Can actually be pretty good for english tts, though it requires some coding.
There is a tutorial though,
This is the video on how to set up the data

This is the video on how to train it (The synthesis notebook mentioned does not work, use this instead)

You can download the voice file once its trained and rename it to have the extension .pt and put it into talqu, but you would need the pro version of talqu because an english tts for some reason needs the pronunciation editor box thing (The third box below the other input boxes) otherwise it would only accept japanese text and would sound very cursed.

Hoped this helped

Interesting. Thanks for sharing this!

I spent a rushed couple of hours on it (about 40 min of training the model and 45 short recordings of myself, total file duration less than 5 min), and the results were far from satisfactory.

I ran into problems with TensorFlow (the error message mentioned 'tf.contrib' not supported by TF 2.0+ despite there being a line to use TF 1.x), but all my trying to add code from StackExchange answers didn't resolve it. What worked for me was restarting the runtime a few times. Literally turn it off and on again.

The generated files were either noise or stuttering. The output is as good as the input, so I advise anyone who wants to try to use much more than 5 min of data.

Nashdas · Feb 13, 2022

I ran it again for about 1.5 hours, this time with under 9 minutes of recordings. (This experiment was sponsored by both procrastination and impatience.)

Still gives static/noise, but now it can actually do words.

Interestingly, it can output some phrases present in my recordings, but not others. I used some sentences from http://festvox.org/cmu_faf/index.html, several from Project Gutenberg's Sherlock Holmes, and a couple from the English Wikipedia page for UTAU.

As for anyone reading this wondering about distribution, part 2 of the YouTube guide (the training one) shows the model must be publicly accessible on Google Drive to be used in a Colab synthesis notebook (the place where you input the text to generate speech). So, theoretically, someone could share their voicebank like this.

Thehyami · Feb 13, 2022

Purple Gecko said:
Tacotron 2 (what talqu is based on) Can actually be pretty good for english tts, though it requires some coding.
There is a tutorial though,
This is the video on how to set up the data

This is the video on how to train it (The synthesis notebook mentioned does not work, use this instead)

You can download the voice file once its trained and rename it to have the extension .pt and put it into talqu, but you would need the pro version of talqu because an english tts for some reason needs the pronunciation editor box thing (The third box below the other input boxes) otherwise it would only accept japanese text and would sound very cursed.

Hoped this helped

Thank you so much! :smile:

I didn't know there is text-to-speech system that is this easy. I tried to train the NNSVS but didn't get usable result. And it's very limited to Japanese.

Here's my result:

Thehyami · Feb 14, 2022

Nashdas said:
Interesting. Thanks for sharing this!

I spent a rushed couple of hours on it (about 40 min of training the model and 45 short recordings of myself, total file duration less than 5 min), and the results were far from satisfactory.

I ran into problems with TensorFlow (the error message mentioned 'tf.contrib' not supported by TF 2.0+ despite there being a line to use TF 1.x), but all my trying to add code from StackExchange answers didn't resolve it. What worked for me was restarting the runtime a few times. Literally turn it off and on again.

The generated files were either noise or stuttering. The output is as good as the input, so I advise anyone who wants to try to use much more than 5 min of data.

Did you figure it out what was wrong?

HoodyP39 · Feb 18, 2022

Hey guys, as a Tacotron2 expert (and also part of the Uberduck contributors...). In order to create a really decent voice that covers most phonemes, I suggest doing at least about at least 30 minutes of speech data (total wavs duration), the best will be about 1-2 hours of speech data (the more the better!)... So I would say roughly about 1000-5000 wav files of sentences and/or more to be great enough for training

Post automatically merged: Feb 18, 2022

For what type of reclist that can be used for training, I suggest using either LJSpeech's transcriptions (https://keithito.com/LJ-Speech-Dataset/) or CMU's publically available CMU Arctic (http://www.festvox.org/cmu_arctic/cmuarctic.data)

Thehyami · Feb 22, 2022

HoodyP39 said:
Hey guys, as a Tacotron2 expert (and also part of the Uberduck contributors...). In order to create a really decent voice that covers most phonemes, I suggest doing at least about at least 30 minutes of speech data (total wavs duration), the best will be about 1-2 hours of speech data (the more the better!)... So I would say roughly about 1000-5000 wav files of sentences and/or more to be great enough for training

Post automatically merged: Feb 18, 2022

For what type of reclist that can be used for training, I suggest using either LJSpeech's transcriptions (https://keithito.com/LJ-Speech-Dataset/) or CMU's publically available CMU Arctic (http://www.festvox.org/cmu_arctic/cmuarctic.data)

Do you use arpabet or alphabet to train the network? Do you also use the "_" and the "-" symbols? If you do, what do you use them for?

Diongoespew! · Mar 7, 2022

I HAVE FOUND THE SOLUTION! I was browsing Anna Nyui's website and I discovered she did something called. Coefront? From what I can find, it's a do it yourself and download other peoples sometimes even comerical, Text to speech thing.

Lance Lazor 4.0 (CV JAPANESE) by braincake

A Virtual Japanese Singer!

thatindiegamer.itch.io

It's called COEFRONT! Right now I think it only does Japanese? But I foudn it through Anna Nyui. It only allows your account though so you can't share accounts or voicebanks that aren't yours. Even if you own it. Thoughts on this?

Here is an example:

On this page she has samples:

CoeFont | 暗鳴ニュイ公式サイト

aoka45.wixsite.com

And here's a UTAULoid? I think user's take on it

Coefront website: https://coefont.cloud/

HoodyP39 · Mar 7, 2022

Thehyami said:
Do you use arpabet or alphabet to train the network? Do you also use the "_" and the "-" symbols? If you do, what do you use them for?

You don't really have to use Arpabet for training (which means you can just use normal sentences/alphabets, though Arpabet conversion is optional, but for that, it's best to use Tacotron2 forks that has Arpabet conversion if you wanted to use Arpabet for a slightly more accurate prononciation). Also, you don't really use "_" symbols other than filenames, and "-" can be used in your training transcript as it's own normal usage (for example words like "air-space", or "merry-go-round")

Thehyami · Apr 19, 2022

HoodyP39 said:
You don't really have to use Arpabet for training (which means you can just use normal sentences/alphabets, though Arpabet conversion is optional, but for that, it's best to use Tacotron2 forks that has Arpabet conversion if you wanted to use Arpabet for a slightly more accurate prononciation). Also, you don't really use "_" symbols other than filenames, and "-" can be used in your training transcript as it's own normal usage (for example words like "air-space", or "merry-go-round")

Ooh thanks! :smile:

Title	Forum	Replies	Date
Does anyone know where to get piano key samples?	UtaHelp	1	Jun 23, 2025
does anyone have oto2seg?	VOCALOID	0	May 21, 2025
does anyone know how to update openutau?	UtaHelp	1	May 14, 2025
does anyone have a good pitch editor plugin for UTAU	UtaHelp	1	Apr 13, 2025

Search

Does anyone know any good engine for making your own text to speech in english?

Diongoespew!

eldritch horror

Purple Gecko

Momo's Minion

Diongoespew!

eldritch horror

Nashdas

Momo's Minion

Nashdas

Momo's Minion

Thehyami

Ruko's Ruffians

Thehyami

Ruko's Ruffians

HoodyP39

Momo's Minion

Thehyami

Ruko's Ruffians

Diongoespew!

eldritch horror

Lance Lazor 4.0 (CV JAPANESE) by braincake

CoeFont | 暗鳴ニュイ公式サイト

HoodyP39

Momo's Minion

Thehyami

Ruko's Ruffians

Similar threads